Diagnosing a 6038 Error in Identity Manager

geoffc

By: geoffc

January 8, 2010 2:35 pm

Reads: 281

Comments:0

Rating:0

Diagnosing an 6038 error in Identity Manager:

In many fields, just knowing how to install and manage an application does not really tell you a lot about your abilities to troubleshoot a problem.

For example in the case of ZENworks (be it ZENworks 7 and lower, or ZENworks Configuration Management 10.x and higher) while there are basic rules to building workstation images and application install MSI files, it turns out to be more of an art than a science. The more experienced you are at building images, the better a job you usually do. This is because there are often many ways to do something, each with its own subtle strengths and weaknesses. The only effective way to evaluate when to use each, is to have tried it. (In theory some one ought to be able to write it all down, but then there is so much information involved, it gets lost in the noise.

Here is the email message I got, that started me looking into it. We have a rule in our drivers that any time a <status> document comes through the driver, with a level=”error” we send an email, and I review them to be sure it is not important. You can do this yourself, in the Input Transform, add a rule that looks for if operation is status, and XPATH is true of @level=”error” and you can select for this event too! (We then XML Parse the XPATH of . (period) and use that as the body of the email to convert the node set of the error document, into a text value that email can handle.

Well this one sure was interesting to start with!

Identity Vault event from driver \ACME-LDAP\acme\Drivers\IDM\eDirectory at Wednesday, July 22, 2009 7:48:21 PM EDT
 on  object  ==>  has returned an error.

 = = = = = = = = = = = = = = = = = = = = = = = = = = = =
<status level="error">Code(-9010) An exception occurred: novell.jclient.JCException: initVlistIterator -6038 UNKNOWN ERROR<application>DirXML</application>
       <module>eDirectory</module>
       <object-dn></object-dn>
       <component>Publisher</component>
</status>
 = = = = = = = = = = = = = = = = = = = = = = = = = = = =

 

Hmm, well 6038 is not a well defined error. Besides, you have to love errors that are defined as “UNKNOWN ERROR”, how can it get better or more helpful than that! Argh! I hate these useless error code messages. If you had the time to define it as a value, it should have a definition as well.

Ah well. Accept, and move on.

On a side note, a great reference for Novell error codes is this page:
http://www.novell.com/documentation/nwec/nwec_enu/nwec_list_of_all_codes.html

You can get to this, and sub page versions of it, from TID 10080995 which might work from this link:
http://www.novell.com/support/search.do?cmd=displayKC&docType=kc&externalId=10080995&sliceId=&dialogID=31426067&stateId=1%200%201784344

I have bad luck linking to TID’s over the past year or so, so I offer both approaches for you.

Of course those error code pages only list up to error 6030 and this is a 6038, so even less help than usual!

Next step since this message came from a driver in our LDAP tree lets go to that tree and do the first thing I always do when troubleshooting an eDirectory issue or application and check time synchronization.

So I connected to the LDAP tree IDM server, and checked time sync with the following command:

acm901lnx:~ # ndsrepair -T


[1] Instance at /var/opt/novell/eDirectory/conf/nds.conf:  acm901lnx.OU=Servers.O=acme.ACME-LDAP
Repair utility for Novell eDirectory 8.8 - 8.8 SP3 v20214.27
DS Version 20216.89  Tree name: ACME-LDAP
Server name: .acm901lnx.Servers.acme

Size of /opt/novell/eDirectory/data/ACME-LDAP/log/ndsrepair.log = 5670 bytes.

Building server list
Please Wait... 
Preparing Log File "/opt/novell/eDirectory/data/ACME-LDAP/log/ndsrepair.log"
Please Wait... 
Collecting time synchronization and server status
Time synchronization and server status information 
Start:  Wednesday, July 22, 2009 19:56:14 Local Time

---------------------------+---------+---------+-----------+--------+-------
                              DS      Replica   Time        Time is     Time
Server name                 Version   Depth     Source      in sync      +/-
---------------------------+---------+---------+-----------+--------+-------
Processing server: .LOCAL006W03E-NDS.Servers.acme
.LOCAL006W03E-NDS.Serv...   20216.89    1       Non-NetWare No           +54
Processing server: .and600lnx.Servers.acme
.and600lnx.Servers.acme    20216.83    0       Non-NetWare No           +54
Processing server: .acm600lnx.Servers.acme
.acm600lnx.Servers.acme    20216.83    0       Non-NetWare No           +54
Processing server: .and901lnx.Servers.acme
.and901lnx.Servers.acme    20216.89    0       Non-NetWare No           +54
Processing server: .acm901lnx.Servers.acme
.acm901lnx.Servers.acme    20216.89    0       Non-NetWare Yes            0
---------------------------+---------+---------+-----------+--------+-------
      Total errors: 0
NDSRepair process completed.

Ok, this is not good, but not end of the world. We can fix this later. Time is 54 seconds out of sync, but you can see from the pattern that basically it is the acme901lnx server that is the issue, and it is probably 54 seconds BEHIND the other servers, since all but that one show the same time offset.

Next, lets check up the replica synchronization with ndsrepair, as that is always step two in this process.

acm901lnx:~ # ndsrepair -E


Collecting replica synchronization status
Start:  Wednesday, July 22, 2009 19:56:47 Local Time
Retrieve replica status

Partition: .[Root].
  Replica on server: .and600lnx.Servers.acme
  Replica: .and600lnx.Servers.acme        07-22-2009 19:57:38
  Replica on server: .acm600lnx.Servers.acme
  Replica: .acm600lnx.Servers.acme        07-22-2009 19:57:39
  Replica on server: .and901lnx.Servers.acme
  Replica: .and901lnx.Servers.acme        07-22-2009 19:57:38
  Replica on server: .acm901lnx.Servers.acme
  Replica: .acm901lnx.Servers.acme        07-22-2009 19:56:44
All servers synchronized up to time:         07-22-2009 19:56:44  
Partition: .Security
  Replica on server: .and600lnx.Servers.acme
  Replica: .and600lnx.Servers.acme        07-22-2009 19:09:58
  Replica on server: .acm600lnx.Servers.acme
  Replica: .acm600lnx.Servers.acme        07-22-2009 19:09:58
  Replica on server: .LOCAL006W03E-NDS.Servers.acme
  Replica: .LOCAL006W03E-NDS.Servers.acme  07-22-2009 19:10:00
  Replica on server: .and901lnx.Servers.acme
  Replica: .and901lnx.Servers.acme        07-22-2009 19:09:59
  Replica on server: .acm901lnx.Servers.acme
  Replica: .acm901lnx.Servers.acme        07-22-2009 19:09:04
All servers synchronized up to time:         07-22-2009 19:09:04  
Partition: .asiapac.acme.corp
  Replica on server: .and600lnx.Servers.acme
  Replica: .and600lnx.Servers.acme        07-22-2009 19:53:33
  Replica on server: .acm600lnx.Servers.acme
  Replica: .acm600lnx.Servers.acme        07-22-2009 19:53:35
  Replica on server: .LOCAL006W03E-NDS.Servers.acme
  Replica: .LOCAL006W03E-NDS.Servers.acme  07-22-2009 19:53:30
  Replica on server: .and901lnx.Servers.acme
  Replica: .and901lnx.Servers.acme        07-22-2009 19:53:36
  Replica on server: .acm901lnx.Servers.acme
  Replica: .acm901lnx.Servers.acme        07-22-2009 19:52:42
All servers synchronized up to time:         07-22-2009 19:52:42  
Partition: .Servers.acme
  Replica on server: .and600lnx.Servers.acme
  Replica: .and600lnx.Servers.acme        07-22-2009 19:53:34
  Replica on server: .acm600lnx.Servers.acme
  Replica: .acm600lnx.Servers.acme        07-22-2009 19:53:34
  Replica on server: .LOCAL006W03E-NDS.Servers.acme
  Replica: .LOCAL006W03E-NDS.Servers.acme  07-22-2009 19:53:29
  Replica on server: .and901lnx.Servers.acme
  Replica: .and901lnx.Servers.acme        07-22-2009 19:53:37
  Replica on server: .acm901lnx.Servers.acme
  Replica: .acm901lnx.Servers.acme        07-22-2009 19:52:43
All servers synchronized up to time:         07-22-2009 19:52:43  
Partition: .Admins.acme
  Replica on server: .and600lnx.Servers.acme
  Replica: .and600lnx.Servers.acme        07-22-2009 19:53:35
  Replica on server: .acm600lnx.Servers.acme
  Replica: .acm600lnx.Servers.acme        07-22-2009 19:53:36
  Replica on server: .LOCAL006W03E-NDS.Servers.acme
  Replica: .LOCAL006W03E-NDS.Servers.acme  07-22-2009 19:53:30
  Replica on server: .and901lnx.Servers.acme
  Replica: .and901lnx.Servers.acme        07-22-2009 19:53:41
  Replica on server: .acm901lnx.Servers.acme
  Replica: .acm901lnx.Servers.acme        07-22-2009 19:52:44
All servers synchronized up to time:         07-22-2009 19:52:44  
Partition: .grpwise.acme
  Replica on server: .and600lnx.Servers.acme
  Replica: .and600lnx.Servers.acme        07-22-2009 19:53:39
  Replica on server: .LOCAL006W03E-NDS.Servers.acme
  Replica: .LOCAL006W03E-NDS.Servers.acme  07-22-2009 19:53:31
  Replica on server: .acm600lnx.Servers.acme
  Replica: .acm600lnx.Servers.acme        07-22-2009 19:53:40
  Replica on server: .and901lnx.Servers.acme
  Replica: .and901lnx.Servers.acme        07-22-2009 19:53:42
  Replica on server: .acm901lnx.Servers.acme
  Replica: .acm901lnx.Servers.acme        07-22-2009 19:52:49
All servers synchronized up to time:         07-22-2009 19:52:49  
Partition: .corp
  Replica on server: .and600lnx.Servers.acme
  Replica: .and600lnx.Servers.acme        07-22-2009 19:51:35
  Replica on server: .acm600lnx.Servers.acme
  Replica: .acm600lnx.Servers.acme        07-22-2009 19:51:36
  Replica on server: .and901lnx.Servers.acme
  Replica: .and901lnx.Servers.acme        07-22-2009 19:51:36
  Replica on server: .acm901lnx.Servers.acme
  Replica: .acm901lnx.Servers.acme        07-22-2009 19:50:37
All servers synchronized up to time:         07-22-2009 19:50:37  
Finish:  Wednesday, July 22, 2009 19:56:52 Local Time

      Total errors: 0
NDSRepair process completed.

Well that looks perfect. This is not the issue. (I.e. eDir looks ok, once we fix the time sync issue). Thus this is probably not eDirectory. Well Identity Manager is an application that runs in the eDirectory memory space so checking eDirectory first makes sense to be sure the base is solid.

Next up is looking at LDAP tree, eDirectory driver log, to get some context from driver for the error event:

[07/22/09 18:52:42.282]:eDirectory ST:: Sending...
[07/22/09 18:52:42.282]:eDirectory ST:
<nds dtdversion="3.5" ndsversion="8.x">
  <source>
    <product version="3.6.1.4427">DirXML</product>
    <contact>Novell, Inc.</contact>
  </source>
  <input>
    <modify cached-time="20090722225241.928Z" class-name="User" event-id="acm901lnx#20090722225241#3#1" qualified-src-dn="dc=corp\dc=acme\dc=asiapac\OU=Migrated\CN=lius" src-dn="\ACME-LDAP\corp\acme\asiapac\Migrated\lius" src-entry-id="35465" timestamp="1248303202#2">
      <association state="associated">{94B63074-1027-9941-23BA-94B630741027}</association>
      <modify-attr attr-name="Login Time">
        <remove-value>
          <value timestamp="1248222145#8" type="time">1248222141</value>
        </remove-value>
        <add-value>
          <value timestamp="1248303202#2" type="time">1248303198</value>
        </add-value>
      </modify-attr>
      <modify-attr attr-name="Last Login Time">
        <remove-value>
          <value timestamp="1248222145#7" type="time">1248216491</value>
        </remove-value>
        <add-value>
          <value timestamp="1248303202#1" type="time">1248303191</value>
        </add-value>
      </modify-attr>
    </modify>
  </input>
</nds>
[07/22/09 18:52:42.286]:eDirectory ST:: Document sent.
[07/22/09 18:52:42.286]:eDirectory ST:: Waiting for receive...
[07/22/09 18:52:42.293]:eDirectory ST:: Received.
[07/22/09 18:52:42.303]:eDirectory ST:
<nds dtdversion="3.5" ndsversion="8.x">
  <source>
    <product version="3.6.1.4427">DirXML</product>
    <contact>Novell, Inc.</contact>
  </source>
  <output>
    <status level="error">Code(-9010) An exception occurred: novell.jclient.JCException: initVlistIterator -6038 UNKNOWN ERROR<application>DirXML</application>
      <module>eDirectory</module>
      <object-dn></object-dn>
      <component>Publisher</component>
    </status>
  </output>
</nds>

This trace shows that in the LDAP tree, a user logged in, and thus the Login Time and Last Login Time attributes are updated, and we are trying to synchronize that over to the vault tree. This way, we collate the last logins and keep a single reference for the last one between many systems. We do this for SOX compliance reasons, as we need to disable users who have been inactive for more than 90 days, and if we can collect their last login time (Login Time attribute actually, the Last Login Time attribute is the second last login. Poorly named attributes, alas) from all the systems, with only the latest stored, then we can trivially enforce this rule that otherwise would be quite difficult to do.

This is in the Subscriber channel, (The bit “eDirectory ST:” tag in the trace tells us that, ST for Sub channel, I forget what the T stands for). It gets a response from the IDV tree, with the 6038 error, which is actually eDir is locked in the file system. (In hindsight, this was spot on to the issue!!)

Thus the error is on the IDV eDir driver side of the equation,

Well on to the next place to look. Lets look at the IDV tree, and connect to ACM900LNX, which is the server with ndsd (ndsd is the name of eDirectory process on Linux, on Windows it would be dhost.exe and on Netware it would be ds.nlm) and IDM running.

Look at /var/opt/novell/log/idv-edir.log which is where we set DSTrace to log output to a file, and at the end of the file see:

[07/22/09 14:47:51.580]:eDirectory PT:Modifying entry Acme\Users\smithj.
[07/22/09 14:47:51.583]:eDirectory PT:
DirXML Log Event -------------------
     Driver:   \ACME-IDV\Acme\Drivers\IDM\eDirectory
     Channel:  Publisher
     Object:   \ACME-LDAP\corp\acme\americas\Migrated\smithj (Acme\Users\smithj)
     Status:   Success
[07/22/09 14:47:51.584]:eDirectory PT:Fixing up association references.
[07/22/09 14:47:51.584]:eDirectory PT:No schema mapping policies.
[07/22/09 14:47:51.584]:eDirectory PT:Applying output transformation policies.
[07/22/09 14:47:51.585]:eDirectory PT:Applying policy: %+C%14Csub-otp-EmailOnFailedPasswordPub%-C.
[07/22/09 14:47:51.585]:eDirectory PT:  Applying to status #1.
[07/22/09 14:47:51.585]:eDirectory PT:    Evaluating selection criteria for rule 'Send e-mail for a failed publish password operation'.
[07/22/09 14:47:51.585]:eDirectory PT:      (if-global-variable 'notify-user-on-password-dist-failure' equal "true") = TRUE.
[07/22/09 14:47:51.586]:eDirectory PT:      (if-operation equal "status") = TRUE.
[07/22/09 14:47:51.586]:eDirectory PT:      (if-xpath true "self::status[@level != 'success']/operation-data/password-publish-status") = FALSE.
[07/22/09 14:47:51.586]:eDirectory PT:    Rule rejected.
[07/22/09 14:47:51.587]:eDirectory PT:Policy returned:
[07/22/09 14:47:51.587]:eDirectory PT:
<nds dtdversion="3.5" ndsversion="8.x">
  <source>
    <product version="3.6.1.4427">DirXML</product>
    <contact>No^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
idv-edir.log lines 646950-646981/646981 (END) 

Oh boy, that cannot be good! This was a new one for me, and that cannot be good, seeing garbage like that in a log file!

Well lets start the diagnosis process on the IDV tree server, that we had started with in the LDAP tree.

Lets check time synchronization first.

acm900lnx:/var/opt/novell/log # ndsrepair -T

[1] Instance at /var/opt/novell/eDirectory/conf/nds.conf:  acm900lnx.OU=Servers.O=Acme.ACME-IDV
Unable to connect to NDS Server.
Read-only file system.

Hmm, read only file system, that cannot be good.

Well is it really a read only file system? The mount command should at least hint at it.

acm900lnx:/var/opt/novell/log # mount
/dev/xvda2 on / type ext3 (rw,acl,user_xattr)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
debugfs on /sys/kernel/debug type debugfs (rw)
udev on /dev type tmpfs (rw)
devpts on /dev/pts type devpts (rw,mode=0620,gid=5)
/dev/xvda1 on /boot type ext3 (rw,acl,user_xattr)
/dev/xvda6 on /home type ext3 (rw,acl,user_xattr)
/dev/xvda3 on /tmp type ext3 (rw,acl,user_xattr)
securityfs on /sys/kernel/security type securityfs (rw)

Well eDirectory is running in a directory that is part of the root file system (/) and the mount command seems to claim it is a rw (read write) file system. Ok, so not something as obvious as that.

Maybe we ran out of disk space? We really need a log server for this environment, with more disk space, since we like leaving Dstrace running, even if it eats into performance because it makes troubleshooting issues that come up much more possible. The thinking is to mount this log server into the file system of each IDM server thus collecting all the logs in one spot on a server that is easier to add space too.

Also, we are not doing enough events to matter, in terms of Dstrace affecting performance. We have seen in the past that during a driver resync event where every object has to be checked, it definitely makes sense to disable trace. When you are doing that much driver traffic the performance is really affected. In normal day to day operations it does not cause enough of a performance hit to be worth disabling.

Use the df command (disk free?) to show usage on volumes. I like the -h for friendly results in megs and gigs, rather than blocks, which is the default.

acm900lnx:/var/opt/novell/log # df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/xvda2            7.9G  6.4G  1.2G  86% /
udev                  1.1G   80K  1.1G   1% /dev
/dev/xvda1            487M   26M  436M   6% /boot
/dev/xvda6           1012M   34M  927M   4% /home
/dev/xvda3            2.0G  384M  1.6G  20% /tmp
acm900lnx:/var/opt/novell/log # 

Nope, and not out of space either. 1.5 gigs free on the root file system.

Ok, so trying to run a time sync check via ndsrepair -T told us it was a read only file system, lets see if the ndsd process is even running, using the init.d run script to check.

acm900lnx:/var/opt/novell/log # /etc/init.d/ndsd status
DDCGetServerName failed: -6038

                                                                      dead

Is eDir (ndsd) it running? Does not look like it, or is it? Lets double check to see if the ndsd process itself is running, using the ps command.

acm900lnx:/var/opt/novell/log # ps -ef | grep ndsd
root      2349     1  3 Jun28 ?        20:56:30 /opt/novell/eDirectory/sbin/ndsd

Well it does seem to be running, since we do have such a process. So the plot thickens! (Soon we shall have gravy!)

Something makes it think the file system is read only… Stopping ndsd with the command /etc/init.d/ndsd stop reports the following:

Executing customized settings before stopping the Novell eDirectory server...
Stopping Novell eDirectory server...
................/etc/init.d/ndsd: line 279: /opt/novell/eDirectory/data/ACME-IDV/log/ndsd.log: Read-only file system
rm: cannot remove `/opt/novell/eDirectory/data/ACME-IDV/data/ndsd.pid': Read-only file system
rm: cannot remove `/var/lock/subsys/ndsd': Read-only file system

The script tries to clean up after the process is killed, and cannot, claiming a read only file system again.

Starting ndsd with the command /etc/init.d/ndsd start shows:

Starting Novell eDirectory server...
A previous instance of ndsd was not shutdown cleanly. Ignoring old pid file.
                                                                      done
touch: cannot touch `/var/lock/subsys/ndsd': Read-only file system

Well how many times do I need to be told something before I believe it? Hmm, let me go count…

Lets test this the simplest way and copy a simple file:

acm900lnx:/var/opt/novell/log # cp list.txt list2.txt
cp: cannot create regular file `list2.txt': Read-only file system

Ok, so the file system really is read only. What can cause that? Well lets try to remount it just in case it decided to go all wonky on us for some reason.

Remounting fails:

acm900lnx:/var/opt/novell/log # mount -o remount,rw /
mount: block device /dev/xvda2 is write-protected, mounting read-only

Look at the dmesg output to see if there are any errors right now:

acm900lnx:/var/opt/novell/log # dmesg  | grep -i remount
Remounting filesystem read-only
EXT3-fs error (device xvda2): ext3_remount: Abort forced by user
EXT3-fs error (device xvda2): ext3_remount: Abort forced by user
EXT3-fs error (device xvda2): ext3_remount: Abort forced by user
EXT3-fs error (device xvda2): ext3_remount: Abort forced by user

Ok, definitely something going on! Any one think this is good news yet?

Like an onion, peeling back the layers, your eyes start to tear up more and more… Seeing file system errors often brings tears to my eyes, depending on the state of the backups we have.

dmesg without a grep to find just the remount related issues, but seeking to the remount issues in the log, and we see:

end_request: I/O error, dev xvda, sector 6175592
Buffer I/O error on device xvda2, logical block 643429
lost page write due to I/O error on xvda2
end_request: I/O error, dev xvda, sector 4300784
Buffer I/O error on device xvda2, logical block 409078
lost page write due to I/O error on xvda2
Buffer I/O error on device xvda2, logical block 409079
lost page write due to I/O error on xvda2
Aborting journal on device xvda2.
__journal_remove_journal_head: freeing b_committed_data
__journal_remove_journal_head: freeing b_committed_data
journal commit I/O error
journal commit I/O error
ext3_abort called.
EXT3-fs error (device xvda2): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
end_request: I/O error, dev xvda, sector 4093160
Buffer I/O error on device xvda2, logical block 383125
lost page write due to I/O error on xvda2

Oh boy, this can not be good news.

Bad enough is the I/O error but on top of that journal commit errors means who knows how bad it is going to be!

The file system definitely has real issues. This typically looks like a hardware issue for it to be this bad, this server is running as a virtual machine an Citrix XEN Server (a commercial implementation of the open source XEN hypervisor), using a mounted file system on the XEN server, that is coming from a reasonably large EMC SAN, so if the SAN is having disk issues that have percolated up to the file system, and not been hidden by the SAN hardware then this is going to be such a bad day, there are few words to describe it.

But first, lets be sure it is not a software, file system issue, which a repair tool might be able to approach. I had better be sure I can get to the console, since fsck (the file system repair tool for Linux), normally resides as a binary file, inside the / (root) filesystem, so I better be sure I can get to the console, in case a reboot causes a problem.

Also, if I need too, the SLES10 SP2 DVD has a bootable partition on it, that can run fsck, for just this sort of case.

Time to launch XEN Center for all that needed access (Console screen, mount DVD’s from ISO’s etc) and some more news:

The good news here is that XEN Center reports a multi path failure in the logs, at the same time as the last event in Dstrace log. Later I found out that there was some kind of fibre channel path failure, that caused a multi path change. One of the four redundant paths (the active one of course, it never ever happens to the inactive ones, does it?) to the XEN servers had a problem and swapped over. Now this is a nice low level thing that you would think should have never percolated up as high in the stack as it did.

Well then lets try a simple restart of the server since we need to do it anyway to try the fsck approach, and the really good news is that everything came back up fine. The OS booted, ran an fsck on the file system since it detected errors, but eDirectory and Identity Manager started properly and caught up to the cached events that started to flow in. Since this eDirectory server was not accepting events due to a read only file system, the other servers with replicas were holding onto the events, and they started to synchronize back to this box once eDirectory loaded. Then those events got logged to the TAO file (cache file where Identity Manager events are held until processed) and finally processed by IDM.

This problem happened on a XEN Server with a number of VM’s, and in fact in a XEN VM cluster, with about 30 servers. Next step is to check all the other servers we are responsible for, and it looks like several others in the XEN cluster suffered similar problems, that simple reboots cleared up nice and quickly.

The other sites XEN was not affected, since it was clearly a multi path fail over at this end of the SAN that caused the issue.

This has also happened when a file system with eDirectory and Identity Manager on it get filled by logs, with basically similar symptoms.

Hopefully both the ultimate resolution, and the steps taken along the way to get there will be of help, should you ever run into a problem that is similar but not identical. Hope that my approach to troubleshooting this type of problem will help give you ideas when you are trying to troubleshoot errors you might have run into.

VN:F [1.9.22_1171]
Rating: 0.0/5 (0 votes cast)

Tags: , ,
Categories: Uncategorized

Disclaimer: As with everything else at NetIQ Cool Solutions, this content is definitely not supported by NetIQ, so Customer Support will not be able to help you if it has any adverse effect on your environment.  It just worked for at least one person, and perhaps it will be useful for you too.  Be sure to test in a non-production environment.

Comment