With the release of the User Application, starting I guess with IDM 3.0 or so, and its evolution to its current state, there is a fair bit of stuff going on with it, that can be complex to troubleshoot. Lots of new features have been added, lots of legacy features that are probably not commonly used are maintained. Additionally there is a fair bit of frame work stuff that is used but is not something you would think of as relevant in troubleshooting.
One very important step is to enable logging to see what is going on. However, there are so many log levels available (every class in use has the ability to individually log, which is normally awesome, but in this case overwhelming).
After a fair bit of digging (and to be honest, help from someone else) I found where the latest docs list the various logging classes, but I was kind of disappointed by the brevity of the information. For example, most of the documentation for these log levels consists of stuff like:
Messages related to the core portlet API
That is not very helpful to me, alas, as I do not know what uses the portlet API in the User App, so do I need to log that to troubleshoot?
Like DStrace, you could try turning everything on, but then the log is so overwhelmingly large, it is impossible to find issues in it, since not every thing that is an issue, says Error that can be searched for. However, often the ‘shape’ of the text changes when there is an error, and does some indenting, so that you could potentially page through the thousands of pages, just watching for ‘shape’ changes in the text. This turns out to be a surprisingly effective way to find stuff in trace. Hold down the space bar in ‘less’ (on a Linux box) and it scrolls by so fast, all you can see is shapes. Then stop when you see a change, back up slower and you can find it pretty quick.
Also, with everything tracing, performance is going to totally suck. Just like with DStrace on, but I suspect not quite as bad. Part of the reason DStrace slows the engine down so much is that the XML stuff in memory is stored in an efficient binary format, that is NOT the text version we like to see. But when you trace at level 2 or higher, the engine needs to spend time converting from that in-memory format to the text format, and that really slows it down.
Anyway, since I could not get the information I needed in the documentation, I figured I would start collecting what I have learned, and try to encourage others to join me, by starting a Wiki page on the topic. Go take a look at:
If you have been troubleshooting a User App issue, and found that a particular log level set to DEBUG helps you figure out what is going on, could you please update the Wiki (Just need a Novell or NetIQ login account, like you do for almost everything else to edit it) with the information you have. This way, we all can benefit from some shared knowledge. Alas, if you look at the page, still lots more possibilities to fill out of interest, so if you can help, please consider doing so.
As always remember, these are for troubleshooting, and should be left at the INFO level, for the production server and day to day running. The more debugging enabled, the slower it will be.
I was working on a User App 4.02 that had been upgraded recently from 4.01 and then patched to the 4.02 Patch C by my colleague. I was pretty sure it was working, until I went to go look at the Roles and Resource Catalogs.
The GUI gave me an error and would not show anything in either catalog. Looking in trace I found on User App startup I was seeing this error message:
Working with Support we were able to get to the point where we realized we were missing columns or more in the database tables. We were able to get the DBA to look at the DB and the named columns were just plain missing.
I do not have the complete log snippet, but the installer log showed that a Table Update process started, but the completion is not recorded in the log, which I suspect is what our issue was.
You can find the command you need to run, around some text that looks like this:
************************************************** If a failure is encountered while creating the tables, verify that this string is correct If not, you can modify this string and copy/paste to a command line to run **************************************************
I also suspect that it was a timing thing. When I finally fixed it, as you will see below it took 11 minutes in Dev, and oddly 16 minutes in Production. I imagine that looks like a hung instller/upgrader at some point, or else maybe something timed out.
The error we saw looks like this in the User App trace. (On a JBoss system that would be the server.log in the /opt/novell/rbpm/jboss/server/IDMProv/log/server.log or some path like that. We were running on WebSphere, so the file was the SystemOut.log in a funny path buried deep under the WebSphere structure. Ask your WebSphere guy where to find the SystemOut logs and that will point you in the right direction. Mine is: /home/websphere/IBM/WebSphere/Dmgr/profiles/idm/logs/server1/).
[12/10/13 22:07:06:213 EST] 0000000d SystemOut O INFO [RBPM] [com.novell.soa.persist.DatabaseSchemaUpdate:analyzeSchema] Checking schema for table SEC_DELPROXY_SRV_CFG....found [12/10/13 22:07:08:202 EST] 0000000d SystemOut O INFO [RBPM] [com.novell.soa.persist.DatabaseSchemaUpdate:analyzeSchema] Checking schema for table SEC_SYNC_CLEANUP_QUEUE....found [12/10/13 22:07:10:754 EST] 0000000d SystemOut O ERROR [RBPM] [com.sssw.fw.servlet.EboBootServlet:init] Runtime exception initializing. com.novell.soa.persist.PersistenceException: liquibase.exception.DatabaseException: Error executing SQL SELECT FILENAME,AUTHOR,ID,MD5SUM,DATEEXECUTED,ORDEREXECUTED,TAG,EXECTYPE FROM DATABASECHANGELOG ORDER BY DATEEXECUTED ASC, ORDEREXECUTED ASC: ORA-00904: "EXECTYPE": invalid identifier at com.novell.soa.persist.DatabaseSchemaUpdate.unappliedChangesets(DatabaseSchemaUpdate.java:346) at com.novell.soa.persist.DatabaseSchemaUpdate.validateDatabaseSchema(DatabaseSchemaUpdate.java:212) at com.sssw.fw.servlet.EboBootServlet.init(EboBootServlet.java:98) at com.sssw.portal.servlet.EboPortalBootServlet.init(EboPortalBootServlet.java:59) at javax.servlet.GenericServlet.init(GenericServlet.java:241) at com.ibm.ws.webcontainer.servlet.ServletWrapper.init(ServletWrapper.java:359) at com.ibm.ws.webcontainer.servlet.ServletWrapperImpl.init(ServletWrapperImpl.java:171)
The solution was to run the update command, which I actually got from the installer log (though the username info was x’ed out as –username=xxxxxx and the –password=xxxxxx
/home/websphere/IBM/WebSphere/Dmgr/java/jre/bin/java -Xms256m -Xmx256m -Dwar.context.name=IDMProv -jar /opt/novell/idm/liquibase.jar --databaseClass=com.novell.soa.persist.OracleUnicodeDatabase --driver=oracle.jdbc.driver.OracleDriver --classpath=/home/websphere/ojdbc6.jar:/opt/novell/idm/IDMProv.war --changeLogFile=DatabaseChangeLog.xml --url="jdbc:oracle:thin:@dbserver.myhost.com:1521:idmprd" --contexts="prov,updatedb" --logLevel=finer --logFile=/opt/novell/idm/db.out --username=idmapp --password=password update
There are two possible command modifiers for this function, the final ‘update’ and ‘???’
Some of this is documented here:
But alas, not the exact command, it only tells you to go look in the install log file to get the exact command. The primary reason for that, is the connect string for the database is of course a little hard to predict or even explain in a simple fashion, since so many different databases are supported, and thus the JDBC drivers and what not. You can see these three fields in the command that are specific to your install case.
--databaseClass=com.novell.soa.persist.OracleUnicodeDatabase --driver=oracle.jdbc.driver.OracleDriver --url="jdbc:oracle:thin:@dbserver.myhost.com:1521:idmprd"
However, I would still like to see some explanation of the command and its options in the docs, but that is my perenial argument on this topic. I recognize how hard it is to write docs the way I would like, but I can dream!
This example is a WebSphere application server against an Oracle database. I do not know if the WebSphere vs JBoss issue makes a difference, but you can imagine the matrix of possibilities. (Of course the installer has this information, since it is the thing that actually builds this string in the first place, so not like the information is unknown.)
You can see in the command that it specifies an output file ( –logFile=/opt/novell/idm/db.out ) and if you look, you will see some interesting information. I have snipped out bits of this that are very redundant. I.e. For every table, and whatnot. There were a surpsing number of those.
From the db.out file:
INFO 12/1/13 4:22 PM:liquibase: Successfully acquired change log lock INFO 12/1/13 4:22 PM:liquibase: Reading from DATABASECHANGELOG INFO 12/1/13 4:22 PM:liquibase: Updating null or out of date checksum on changeSet DbDropForeignKeys.xml::100::IDMRBPM::(Checksum: 3:62ab1433cd27b570f7ea6ad2a0cad894) to correct value INFO 12/1/13 4:22 PM:liquibase: Updating null or out of date checksum on changeSet DbDropForeignKeys.xml::1900::IDMRBPM::(Checksum: 3:bf06fa9dff29f0496de7cc91cc2c7ecc) to correct value
There were 400+ lines of this stuff! So be glad I trimmed this down for brevity’s sake.
INFO 12/1/13 4:22 PM:liquibase: Reading from DATABASECHANGELOG INFO 12/1/13 4:35 PM:liquibase: Marking ChangeSet: DbDropPrimaryKeys.xml::100::IDMRBPM::(Checksum: 3:fe60aab4237791809d4c2caefdec4744) ran despite precondition failure due to onFail='MARK_RAN': DatabaseChangeLog.xml : Table AFACTIVITY_ORIG does not exist INFO 12/1/13 4:35 PM:liquibase: Marking ChangeSet: DbDropPrimaryKeys.xml::200::IDMRBPM::(Checksum: 3:af4b4eb1eb37ce75b16edcc8cb39d417) ran despite precondition failure due to onFail='MARK_RAN': DatabaseChangeLog.xml : Table AFACTIVITYTIMERTASKS_ORIG does not exist
Then there were 126 lines of this stuff. Though each instance was three or four lines, which means it is not too bad.
Then we get some steps that offer some timing info, alas, not the parts that would be interesting but at least some timing info.
INFO 12/1/13 4:36 PM:liquibase: ChangeSet DbDropPrimaryKeys.xml::1300::IDMRBPM ran successfully in 60183ms INFO 12/1/13 4:36 PM:liquibase: ChangeSet DbConstraints.xml::3510::IDMRBPM ran successfully in 46ms INFO 12/1/13 4:36 PM:liquibase: ChangeSet DbConstraints.xml::3710::IDMRBPM ran successfully in 19ms INFO 12/1/13 4:36 PM:liquibase: Successfully released change log lock
As I look at this trace, I used to think that liquidbase was some open source project that the IDM team consumed to do stuff for them, but the more I look at it in trace, I wonder if this is an internal development they maintain themselves. I would like to know more about this class and the sort of things it is used for. It appears to be an abstraction for database stuff, which is always a wise thing to have. Makes handling diverse databases much easier for the developers, so a good thing. Sort of like how node,js and jQuery and others are set up to handle ECMA Script pecularities between different browser implementations, rather than making you figure it all out, every time, on your own. Abstraction layers can be very helpful.
Now that command to update the tables worked in the development system. Which was quite a relief. When I tried it on the Production system, when I discovered it too had the same issue, I got an error in the db.out file that I foolishly did not grab. As usual, when in doubt, try again, right? So when I tried running it a second time, it overwrote the db.out file so I lost the errors. Alas the third and fourth time did little else to help as well. Here is where I wish there had been a good explanation of the command line options so that I perhaps could have more easily recreated the tables.
In the end, we had to get the DBA to drop the tables entirely. This probably did not matter too much to the production system, since the database is mostly (as far as I know) used to store the state of running workflows. The good news is that this customer did not have any long running workflows. However I once worked on a project where they had workflows that could run for up to 6 weeks. Conceeded that was a bit strange, but it worked for them. (New users, would get an email, once a week for 4 weeks, and then if they do not do something, eventually it writes out to file, for snail mail delivery to their home address. In that case, killing the database would have really stunk, since we would have lost, with no good way to recover, all the running workflows. But in this case it did not really matter.
It is worth noting that the CN=AppConfig tree structure that resides under the User Application driver object in eDirectory is much more important and stores the Roles and Resources definitions, the DAL, and much much more content. So do not in any way misconstrue this article to suggest touching the AppConfig container.
Anyway when we next started WebSphere the logs showed that it noticed the missing tables and recreated them. Again it took quite a bit of time, more than I would have expected personally, but it did complete successfully, which was good news on a producion system.
Here is what we saw in the log when this happens. The User app starts up, and you can see here that it misses some tables, (I snipped out many of the repetitive lines, there are probably 20-30 tables it is looking for.
[12/11/13 13:32:28:641 EST] 0000000b SystemOut O INFO [RBPM] [com.novell.soa.persist.DatabaseSchemaUpdate:analyzeSchema] Checking schema for table SEC_SYNC_CLEANUP_QUEUE....missing [12/11/13 13:32:28:642 EST] 0000000b SystemOut O INFO [RBPM] [com.novell.soa.persist.DatabaseSchemaUpdate:updateSchema] Creating database schema
Then it sat there, chugging away, not much to see for about 11 minutes, which is to me quite surprising. How big a database can a production server generate in 11 minutes? This is a dedicated WebSphere box just for the User App, and is not a low spec server either.
Amusingly, WebSphere figured there was an issue as well and notes that this thread has been sitting there for 716122 milliseconds, about 12 minutes, but it let it run, just threw the error message.
[12/11/13 13:43:41:287 EST] 00000017 ThreadMonitor W WSVR0605W: Thread "server.startup : 2" (0000000b) has been active for 716122 milliseconds and may be hung. There is/are 1 thread(s) in total in the server that may be hung. at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:140) at oracle.net.ns.Packet.receive(Packet.java:283)
Finally, almost 19 minutes after it began we see this message:
[12/11/13 13:51:05:723 EST] 0000000b SystemOut O INFO [RBPM] [com.novell.soa.persist.DatabaseSchemaUpdate:validateDatabaseSchema] All requiredtables found. Schema is valid. [12/11/13 13:51:05:726 EST] 0000000b SystemOut O INFO [RBPM] [com.sssw.portal.servlet.EboPortalBootServlet:init] Identity Manager Roles Based Provisioning Module Version 4.0.2 [12/11/13 13:51:05:726 EST] 0000000b SystemOut O INFO [RBPM] [com.sssw.portal.servlet.EboPortalBootServlet:init] Patch C [12/11/13 13:51:05:726 EST] 0000000b SystemOut O INFO [RBPM] [com.sssw.portal.servlet.EboPortalBootServlet:init] Driver Version 2.2 [12/11/13 13:51:05:727 EST] 0000000b SystemOut O INFO [RBPM] [com.sssw.portal.servlet.EboPortalBootServlet:init] Build Revision 39446
You can see that the database looks good, the schema is valid, and this is a RBPM Provisioning instance (I assume that means Advanced Edition, not sure what Standard Edition would show here), version 4.02 running Patch C.
What is interesting to note is the report of the Driver Version. This actually matches the package level of the driver in Designer. It is worth noting that these package builds seem to be hard coded into the User App and it is quite picky that the Package version match the version User App wants. Which is a fair requirement. There does appear to be an attribute in the AppCOnfig container somewhere that it is looking at, and I have been meaning to go spelunking for it, but no time to do that lately.
After the User App has successully started you can get most of this information from two different places. Once you are logged in, in the upper right hand corner you can click on XXXXXX and it will show you the User App and patch levels.
In the Administration top tab, there is a Driver side tab, that will show you the driver version as well.
Hopefully all these error messages, examples, and trace samples will help someone in the future who runs into a similar issue.
Disclaimer: As with everything else at NetIQ Cool Solutions, this content is definitely not supported by NetIQ, so Customer Support will not be able to help you if it has any adverse effect on your environment. It just worked for at least one person, and perhaps it will be useful for you too. Be sure to test in a non-production environment.