In the first two parts of this series I looked at the various events that happen in the trace of the Roles and Resources Service Driver (RRSD). It turns out this driver does not quite work like other drivers, in that it starts with an event driven approach, but it then converts that into a command instead. In the previous articles it was pretty clear that an event on an object (nrfRequest, User, nrfResourceAssociation, nrfRole, and nrfResourceRequest) triggers the driver, by the filter, but the output is not what you would necessarily expect from an IDM driver. Rather you get a command in the format of nrf:identity (for user), nrf:request (for nrfRequest) and so on.
Then it uses JClient calls to directly write back to the directory. Jclient is the Java version of DClient an internal non-published API that NetIQ or Novell have been using for their internal applications. In principle, JClient should be faster than most other approach’s and should outperform an IDM event approach, you would think.
However, at the end of the second article I was looking at the process, granting a Role to a User follows. Under the covers, when you do that in the GUI in User Application, an eDirectory object, of class nrfRequest is created that the RRSD driver reacts too.
In that process, we watched 11 different events happen in the trace for the one ‘action’ we wanted enacted.
That seems like a lot of work to handle for simply assigning a Role with a Resource associated to a user. If you look at the directory side of it, you will see the following write events, getting written to the user object:
We might get nrfMemberOf, but I am not sure if that is only for Roles or Resources inherited via a Group or OU assignment.
On the RBPM Side, you will see writes of the following objects and attributes at a minimum.
We start with the creation of nrfRequest object to kick the process off. Lots of attributes on that one, with references to the target object to be granted, the target Role, the status, possibly a time range and whatnot.
We will see this written to at least 3 times after create, where it transitions for nrfStatus of 0 to 30, (twice possibly, per the trace) and then from status of 30 to 50 when it is complete. The process of fulfilling this request to grant the Role, will also write to the user as seen above, but also to the Resource as we will see below.
Along the way we will make another nrfResourceRequest which will hold the DN of the target User (or Group/OU if appropriate) the target DN of the Resource, and the nrfEntitlementRef with the payload to be written to the users entitlement attribute (DirXML-EntitlementRef). This too will get written too several times, as it transitions nrfStatus levels through the process. It looks like 0 to 30 at least twice, and then on completion from status of 30 to 50.
The Resource will not get updated with the DN of the user, as I do not see anywhere in schema obvious that suggests there is the reciprocal mapping of role and resource assignments, as we are used to in groups. With groups there are four attributes involved. Two on the User, two on the Group. The Group gets a Member list, and a Security Equal To Me attribute list of users assigned. The Users get Group Membership updated with the DN of the Group object as well as a Security Equals attribute value.
But in the case of Roles and Resources, it does not look like they implemented the link from the Role or Resource to the User, rather the values are stored on the users. This means that when you use the GUI to look at the Assignments tab of a Role or Resource, the User App is probably doing an LDAP query to find all users, groups, and Org Units, whose nrfAssigned values have the DN of this Role or Resource object. There is something to be said for dynamic querying like that, but also something to be said for having it more simply read off a single attribute list on the Role or Resource object or User itself. This is in fact more like the Active Directory model for Groups, where the Group has the Member list, that is editable but the Users have a read only attribute (in later AD versions) of MemberOf that is evaluated on the fly, when you look at it. To be fair that is a backwards example, since in AD, the Membership is on the Group. Here in Roles, the membership is on the users, which is actually less efficient, since a User, Group, or Org Unit could be a member. Additionally this is why assigning a Role to a dynamic group (one whose membership is defined by an LDAP filter) is so slow, since it must be ‘unrolled’ on a regular interval and statically stamped with the Role and Resource assignments. Thus rather than simply look at the Role or Resource object directly for a simple, single attribute read, a series of queries must be run, and the results paged in whenever you look at the Assignments tab.
I am sure there is an interesting reason for this which I would love to hear if anyone has any ideas. My current theory is that the developers were more used to a database approach than a Directory approach when they wrote this. It seems like a desire to avoid references between tables. I am not sure from a database perspective, why a foreign key reference would be worse than an entirely different table holding the linkage between entries in two tables. I guess multi-valued attributes of this type would be sort of odd to represent in the database as well. I make no claims at being a database expert so if someone has a good explanation I would love to hear it.
Now from a performance perspective, the most expensive operation in eDirectory is a write. Worse than that, eDirectory has a single writer model, which means writes, even if quick, can become a bottleneck as events build up, so the number of writes really does matter. There was talk at Brainshare 2012 in an eDirectory futures session about changing the single writer model, but as you can imagine, they talked about the difficulties of doing that. GroupWise which uses FLAIM as the underlying database the same as eDirectory took an interesting approach to this issue. GroupWise uses 256 separate FLAIM databases, each with its own single writer limitations, but with 256 databases, you can have 256 writer threads. Of course that approach may not apply well to eDirectory. You could imagine that each partition could be spawned off into its own database in a theoretical way, until you consider the work needed to handle merge and split events in the directory, where partition operations would also require database splits and merges as well at the same time.
I know someone who has been testing ways to boost eDirectory (and IDM) performance, and took some amusingly extreme steps. For example, knowing that the bottleneck is writes, you can imagine the simplest change is to move to a Solid State Disk array (SSD). This is slower than a RAM disk, but not by all that much. (Well writes are faster in a RAM disk than in a SSD, but reads should be reasonably comparable in speed, and SSDs are way cheaper than having that much RAM for the DIB database.). If you are considering running eDirectory with it’s database file on an SSD do consider running the latest release of eDirectory which in eDirectory 8.8 SP8 added support for SSDs. It would work before, but apparently some operations were optimized to do a better job on an SSD.
I have heard anecdotally of extreme cases where you can see huge performance increases using an SSD. Crazy numbers like contrived cases that were over 1000 times faster. Crazy stuff. So if you have a large IDM system and worry about performance, definitely consider using SSD’s or more specifically an array of SSDs, since a disk failure on a large system would be pretty ugly, so mirrored disks or maybe RAID 50 or other combinations would be best. If you still need better performance, you would want to try and see if you could move the Roll Forward logs somewhere else, so as not to share the disk bandwidth on writes. You would want to get as many parts of the system that uses write events onto their own disk as much as possible. The back end engineers know about these issues and have been thinking of them, so I expect to see some tricks coming eventually from them. You can imagine that changes to the underlying database engine in eDirectory take a fair bit of testing and concern before releasing them.
I was thinking about ways (short of large expenditures on SSD disk arrays) of speeding up the RRSD driver. None of these are supported of course, since they require some modifications to driver policies we are not supposed to touch.
With really large user bases (In the hundreds of thousands or millions), and large dynamic groups, performance really starts to become an issue.
Officially, there can be only one RRSD driver per set, sort of like the Highlander. (Anyone else hearing the Queen music in their head as I make that joke? I preferred the TV show to the movie: http://en.wikipedia.org/wiki/Highlander:_The_Series) Though I cannot quite imagine the engine standing there with a sword attacking drivers, so my metaphor fails.
But I was thinking about how you could run more than one RRSD in a tree/driverset. If you watch the trace shown in the previous articles you will see that policy wise, XDS events get remapped to RRSD commands, and all that goes to the RRSD driver shim are ‘commands’. Which then seem to be handled outside the scope of the IDM engine approach, so I presume JClient writes back to eDirectory.
The Work Order driver, also suffers from a performance issue, and I am surprised that I never wrote an article about the package I built to speed it up. In that case, the bottleneck is in the IDM engine’s Query functionality. An IDM query cannot do a query that compares time syntax attributes like LDAP can. In LDAP you can query with a filter like (DirXML-nwoDueDate>201405011212Z) or the like, which would only show the work order objects that have a Due Date that is now due. The shim has to query for EVERY Work Order object, look at the DirXML-nwoDueDate, see if it is now in the past, and throw away the rest. If you have twenty thousand pending work orders, then this can be slow enough to be annoying. I have a package that intercepts the queries, replaces them with LDAP queries, and the shim cannot tell the difference. But the RRSD driver cannot be helped with this sort of approach, alas.
What is left? The only thing, short of code changes in the shim is to try and scope the different events to distribute load.
There are really four different ‘commands’ the RRSD supports. (Look at the mapping table inside the driver config to see them.).
nrfRequest for Role
nrfRequest for Resource
Role to Resource association.
Now principally in theory it looks like you could run 4 drivers at least. Each one only processing one type of command event with a trivial scoping rule in the Subscriber Event Transformation policy set. The trick is, before the command gets to the shim, to scope it to the proper driver to share the load.
However the real bottleneck is in Dynamic Group role assignment processing, so where does a Dynamic group Role assignment fall in those command types? That is an example I did not reproduce and probably will work through in another article at a later date.
With Dynamic groups, where you will get in trouble is that all the drivers would be trying to re-evaluate them every XX minutes. There is a Driver config setting, dyn-group-interval that defaults to 60 minutes. I wonder if 0 is supported to disable the re-evaluation in some of the drivers. This probably would limit you to one driver if 0 is not supported. But if it was, then it seems like you could have a single driver handling the Dynamic group evaluation, and then 4 more drivers for other event types with their Dynamic Group evaluation set to zero.
Each queue could still get backed up, but that would be isolated from other classes of events.
This whole discussion is hypothetical, but it seems to me like it would be workable. If anyone has run into an issue like this and considered trying this approach, let me know, I am very curious to see if it can be made to work.
Disclaimer: As with everything else at NetIQ Cool Solutions, this content is definitely not supported by NetIQ, so Customer Support will not be able to help you if it has any adverse effect on your environment. It just worked for at least one person, and perhaps it will be useful for you too. Be sure to test in a non-production environment.