Why do I get Consolidated AM Health Check Events when there shouldn't be any? (NETIQKB73166)

  • 7773166
  • 30-Aug-2011
  • 01-Sep-2011

Environment

NetIQ AppManager 8.0
NetIQ Control Center 8.0
NetIQ AppManager AM Health Check

Situation

Why do I get Consolidated AM Health Check Events when there shouldn't be any?
Consolidated AM Health Check Events being generated when they shouldn't be.

Resolution

This is expected behavior, as each Management Server is responsible for reporting the AM Health conditions of the Agents for which it is currently responsible for.

As an example:  You have an environment with 4 Agents and 2 MSs.  Each MS is the Primary MS for two Agents, and is the Secondary MS for the other two, so that the load is evenly balanced.  Assume in this example that all Agents are currently talking to their Primary MS.  Each Agent then is 25% of the total Agents in the environment, but is also 50% of the Agents assigned to a given MS.

Using this example, if you have configured this threshold to be 50% and any one of those four Agents are down then an event will be generated against that Agent's current MS indicating that 50% of it's Agents are down.  Note that this is 50% of the Agents currently being handled by that specific MS, not 50% of all Agents in the environment.

The Events generated will appear similar to the following:

Short Message: Agent health check event: The number of agents which have not sent the heartbeat has exceeded the threshold.

Long Message: Agent health check event: 50% of agents have not communicated with the MS "NETIQMS01".  The following agents are not able to communicate with the management server: NETIQMC01  (where NETIQMS01 is the name of the Management Server, and NETIQMC01 is the name or names of the affected Agents)

Likewise, if you have configured this threshold to be 100%, and two Agents, both of whom have the same current MS, are down, then you would get an Event indicating that 100% of the Agents are down, despite the other two Agents not being down.

Cause

In Control Center 8.0, Administrators can configure AM Health Check to generate a single consolidated Event if a specific percentage (or more) of Agents are down.  The default value for this percentage is 30%.  However, what may not be clear is how the percentage of "down" Agents is determined.  The percentage of Agents that are down for the consolidated event is calculated based on which MS is the current MS for an Agent.  This can lead to Events that do not appear to be accurate considering the over-all number of Agents in the environment.

When the conditions exist for a Consolidated AM Health Check Event, individual Events for each affected Agent are suspended.  Once the number of "down" Agents falls below the user-defined threshold, you will receive a State Change Event, closing out the original Consolidated Event, and individual Agent-based Events will resume for any remaining "down" Agents.

Additional Information

Formerly known as NETIQKB73166

You can disable Consolidated AM Health Check Events, or alter their threshold in the Control Center Console, under Tools -> Options -> Health Check -> General -> 'Raise consolidated event if X percent of agents are down'.  From this location you may also adjust how frequently the MSs are to check on Agent Health, as well as the Severity Level of any resulting AM Health Check Events.