In the previous blog(s) I covered some different options on how to move towards a common topology for managing servers by breaking up the management feeds into different categories such as network related, help desk specific tickets, performance related, etc. I showed an example of how you can automate the building of these views so they are data driven and self maintaining. The next step is to put the rules in place to control how state (Critical, Major, Minor, etc) are propagated up from the different categories and control how they impact the individual servers.
The way in which Novell Operations Center controls state propagation is through a technology we refer to as Algorithms. Algorithms provide a way for the administrator to describe in an XML file different types of scenario’s and instruct the system on how to propagate state. One common example is around a cluster. For a cluster that has say ten active (hot) nodes with users load balanced across them, if one of those nodes were to go down, the underlying management tools would issue a critical alarm. While this is critical, since there are load balancers which will automatically move users from the down node to other nodes, the service is still up and running. Through standard out of the box rules, any critical alarm and/or critical child element will automatically float it’s highest condition up the topology. For this specific use case, we would want to show the service as OK/Green for the service availability. We may want to put a rule (Algorithm) in place that says if a certain percentage of those nodes are offline, such as 40%, we may want the service to show Minor/Yellow.
For this blog I will continue on with the use case of an individual server by server rule that controls the way that underlying management metrics (alarms, KPI’s, etc) are propagated, a cluster rule is another layer that would be applied multiple servers providing similar processing to a higher level service.
For our example, we have Network, Performance, Backup, Help Desk, Change Management and Process. Ok, so I have a few more children containers than the previous blogs, but I’m sure we are all fine with that. Now that we have the categories, let’s go over the high level rules we want to set up.
Algorithms are stored as an XML file under the base install directory, under database/shadowed/Algorithms.xml. This file is monitored for changes and re-read in periodically as well as stored in the backend database. A line is written to the trace file when changes are noticed which is good feedback when you are testing edits of individual rules. There is a UI to edit the Algorithms which can be accessed under Administration/Server/Algorithms (right-click and choose Edit Algorithms), but most administrators prefer to edit the file directly… up to you.
Below is the algorithm based on the instructions from above. The algo starts out with a tag to name the algorithm, this ends up being the Algorithm that is selectable from within the console interface, this algo is called “Server Rule”.
The next section does a gather, this is how the children of the server are gathered. Since children can be direct/real children (NAM) or in other use cases linked children (ORG), we are gathering both. For our specific use case, just gathering on NAM or ORG would not produce different results.
From there we go into a split/branch, think of this like a case statement. Only one of the branches will meet a true criteria.
The first branch reduces the list of children (Network, Performance, Help Desk, etc) down to *just* the Network child. A test is then performed around the condition of the Network category, if it is Critical (testCondition), then we want to float a critical state up to the server (result) as well as put a note on the server (reason) that identifies a network related issue.
The next branch for Performance follows the same general lead in, but a Critical Performance status is elevated to the server as a Major condition instead. Process being the next branch elevates a Minor condition when Process is Critical.
The last branch is more of a catch all, kind of like an “else” statement. In this section no conditions, properties or other things are tested, we just default the server having an OK/Green status.
<algorithm name="Server Rule"> <exec command="gather" relationship="NAM" /> <exec command="gather" relationship="ORG" /> <split> <branch> <exec command="reduce" invert="yes" property="name" value="Network" /> <exec command="band" testCondition="CRITICAL" amount="100%" result="CRITICAL" reason="Network related issue identified" /> </branch> <branch> <exec command="reduce" invert="yes" property="name" value="Performance" /> <exec command="band" testCondition="CRITICAL" amount="100%" result="MAJOR" reason="Performance impact identified"/> </branch> <branch> <exec command="reduce" invert="yes" property="name" value="Process" /> <exec command="band" testCondition="CRITICAL" amount="100%" result="MINOR" reason="Process issue identified"/> </branch> <branch> <set result="OK" reason="Online" /> </branch> </split> </algorithm>
Just a couple things on algorithms. Algo’s should be set up in a manner that they are generalized so you can write one algo and use it many times. One practice is to do an Algo based on the class of the element such as router, server, database, etc. There are cases where you may have some use case specific situations where you may want to use a different algorithm, but many times we try to address those situations within the tree/topology like we did for this blog.
Algo’s can do more than I covered. There is an ability to even jump right into java script and do all kinds of crazy stuff. Just be careful, everytime there is an alarm update for an underlying element, it causes the parents to recalculate their conditions. Having a java script in an algorithm, while acceptable, there may be performance impacts, so be careful.
The last piece of this series is to assign the Server Rule algorithm to our servers, since we are driving for an automated build and ongoing update of our server view, we are going to use Service Configuration Manager. Within our existing Service Configuration Definition that built the server view, under Modeling Policies there is an Algorithm section. The next step is to set up a new Algorithm Policy. Since we have a predictive class for our servers (if I remember correctly, I used server_host), within the Name Matcher for this algorithm, remove the .* expression and set up a class match for server_host, then in the drop down select the Server Rule algorithm. Whenever a new server is added to the view, by default, the Server Rule algorithm will automatically be applied.
This concludes the series on a common approach to monitoring and portraying servers within Novell Operations Center. While this is an approach, there are many ways this can be accomplished, this blog was intended to provide an approach that I have seen with several customers and it works well for them.
Disclaimer: As with everything else at NetIQ Cool Solutions, this content is definitely not supported by NetIQ, so Customer Support will not be able to help you if it has any adverse effect on your environment. It just worked for at least one person, and perhaps it will be useful for you too. Be sure to test in a non-production environment.