3.4 HeartbeatWin

Use this Knowledge Script to test the heartbeat of the AppManager Windows agent computer. A heartbeat is a periodic signal generated by an AppManager agent computer to indicate that it is still running. If an AppManager agent fails to send either data or an event to the QDB within the specified grace period, the AMHealth_HeartbeatWin job considers the agent to be offline.

This script raises events if the heartbeat for the agent computer stops or restarts, and it generates a data point about the heartbeat events. You can also use this script to track whether jobs finish in the expected time frame, or if they exceed the maximum run time.

You can set this script to raise an event for the following conditions:

  • Heartbeat fails

  • An agent is healthy, such as when the heartbeat returns after failing

  • The agent heartbeat fails a user-specified number of times

  • Jobs take longer than expected to execute

  • Jobs exceed maximum run time

  • No jobs are found

If an agent computer is offline, you can specify that the management server take additional steps to diagnose the level of non-connectivity that exists between the agent and the QDB.

If you use this Knowledge Script with the AppManager Operator Console, you can access the Actions and Advanced tab, but the options on those two tabs will not function.

This script generates data for consolidated events, which can be managed with the Health Check options found in Control Center. To access Health Check, click Options on the Main tab, and then click Health Check. For more information, see the online Help for Health Check.

3.4.1 Resource Objects

Windows servers

3.4.2 Default Schedule

The default interval for this script is every five minutes.

3.4.3 Setting Parameter Values

Set the following parameters as needed:

Parameter

How to Set It

General Settings

Job Failure Notification

Raise event if job fails unexpectedly?

Select Yes to raise an event if the AMHealth_HeartbeatWin fails unexpectedly. The default is Yes.

Event severity when job fails unexpectedly

Set the event severity level, from 1 to 40, to indicate the importance of an event in which this Knowledge Script job fails or any other unexpected event occurs. The default is 5.

Additional Settings

 

Event Details

 

Event detail format

Select the format in which you want to display the event detail. You can select from HTML Table or Plain Text. The default is HTML Table.

Heartbeat Options

Raise an event if the agent heartbeat fails?

Select Yes to raise an event for the selected agent if the agent is detected as having failed the heartbeat. The default is Yes.

If several agents fail the heartbeat at the same time, the script raises a single event for those agents, and the event message lists all the offline agents. By default, you receive a single, consolidated event instead of multiple, individual events when 30% of the agents go offline. You can change this setting in Control Center by clicking Options on the Main tab, and then clicking Health Check.

If you set the Monitor individual jobs? parameter in this script to Yes, you can control the severity of all heartbeat failure events using the Event severity when the agent heartbeat fails parameter instead of using the same severity for all consolidated events.

Event severity when the agent heartbeat fails

Set the event severity level, from 1 to 40, to indicate the importance of an event in which an agent failed the heartbeat. The default is 5.

Raise an event when agent heartbeat restarts?

Select Yes to raise an event if the heartbeat starts again after stopping. The default is Yes.

Event severity when the agent heartbeat restarts

Set the event severity level, from 1 to 40, to indicate the importance of an event in which the heartbeat starts again after stopping. The default is 25.

Number of consecutive heartbeat failures before raising an event

Specify the number of times the heartbeat must fail before raising an event. The default is 2.

Generate heartbeat data?

Select Yes to enable the heartbeat check. If you select Yes and the data point from this job is missing, AppManager raises an event. If you select No, the heartbeat check will not look for the data point from this job. The default is Yes.

Job Monitoring Options

Monitor individual jobs?

Select Yes to monitor all jobs running on the agent that came from the same QDB as the heartbeat job. If you select No, the heartbeat job simply sends the heartbeat event and data according to the heartbeat-related parameters you set for this Knowledge Script. The default is Yes.

Raise an event if jobs take longer than average to execute?

Select Yes to raise an event that lists all jobs that are taking longer to execute than their average execution time. The agent stores a list of the average times jobs take to execute. The default is Yes.

Ignore jobs running for less than this amount of time

If you want to ignore jobs that are running for certain length of time, specify the running time for jobs that will be ignored. The default is 30 seconds.

Grace period

Specify a number to represent the grace period for job execution. The grace period is a multiple of the average time a job takes to execute. The agent stores a list of the average times jobs take to execute.

For example, if you specified a grace period of 5, this script would take that value and multiply it by the average time a job takes to execute. If a job took one second to execute on average, the grace period would be 5 seconds. If the job takes longer than 5 seconds, the script raises an event.

The default grace period is 5.

Event severity when jobs take longer than average to execute

Specify the event severity, from 1 to 40, to indicate the importance of an event in which the execution time for Knowledge Script jobs is longer than the average execution time for that job. The default is 5.

Raise an event if jobs take longer than their schedule to execute?

Select Yes to raise an event that lists all jobs that are taking longer to execute than their scheduled time to execute. The scheduled time is how often the job is set to run, such as every five minutes. The default is Yes.

Event severity when jobs take longer than their schedule to execute

Specify the event severity, from 1 to 40, to indicate the importance of an event in which the execution time for a Knowledge Script job is longer than the script’s schedule. The default is 5.

Raise an event if job exceeds maximum job run time?

Select Yes to raise an event if the run time for a Knowledge Script job exceeds the Maximum job run time threshold. The default is Yes.

List of Knowledge Scripts to skip "Maximum job run time" check

Provide a comma-separated list of the Knowledge Scripts that you do not want to compare to the Maximum job run time threshold.

Maximum job run time

Specify the maximum number of seconds a Knowledge Script job can run before an event is raised. The default is 180 seconds.

Event severity when job exceeds maximum job run time

Set the event severity, from 1 to 40, to indicate the importance of an event in which the run time for a Knowledge Script job exceeds the Maximum job run time threshold. The default is 5.

Raise an event if no jobs found?

Select Yes to raise an event if no Knowledge Script jobs are running. The default is No.

Event severity when no jobs found

Set the event severity level, from 1 to 40, to indicate the importance of an event in which no Knowledge Script jobs are running. The default is 35.

Timeout when processing jobs

Specify how long AppManager should wait for the agent to process jobs before assuming the agent either will not respond or has timed out. Use this parameter to monitor agents that are consistently taking longer than expected to respond.

If the agent does not respond before your specified timeout value, AppManager raises an event stating that it was unable to process this command and suggesting you increase the timeout value. The event might also include data about a Windows error, if one was generated.

The default timeout is 10 seconds.

Heartbeat Investigation Steps (Used by Management Server)

AppManager performs the following steps only if the heartbeat event or the heartbeat data is missing.

Attempt to contact agent computer by ICMP ping?

Select Yes to send an ICMP ping request to the agent computer. If AppManager cannot contact an agent with an ICMP ping, the agent computer might have been shut down or disconnected from the network, or a firewall is blocking the ICMP communication.

Perform tracert diagnostic if ICMP ping fails?

Select Yes to run a tracert (traceroute) diagnostic test if the ping request fails. The default is Yes. A traceroute test helps you troubleshoot network routing problems that can block ICMP traffic.

This script raises an event if the tracert fails.

Connect to agent NetIQmc port if ICMP ping succeeds?

Select Yes to attempt a connection to the NetIQmc port on the agent computer. The default is Yes. The connection is attempted only if the ping attempt succeeds.

This script raises an event if the ping fails.

Use RPC to probe agent if port check succeeds?

Select Yes to send a Remote Procedure Call (RPC) to the agent computer. The default is Yes. The RPC is sent only if the port connection succeeds.

This script raises an event if the RPC probe fails.

Test agent computer registry if RPC probe succeeds?

Select Yes to allow the management server to attempt to use the Remote Registry Service to connect to the Windows Registry on the agent computer.

The connection is attempted only if the RPC probe succeeds. The management server must have sufficient privileges to connect to the Registry. The default is No.

This script raises an event if the management server cannot connect to the Registry.

Check status of agent services if registry test succeeds?

Select Yes to allow the management server to verify whether the NetIQ agent services, NetIQccm and NetIQmc, are running. This test is attempted only if the registry test succeeds.

The management server must have sufficient privileges to access the agent services. The default is No.

This script raises an event if the agent services are up or down.