3.3 HeartbeatUNIX

Use this Knowledge Script to monitor the heartbeat of the AppManager agent running on a UNIX or Linux server. A heartbeat is a periodic signal generated by an AppManager agent computer to indicate that it is still running. If an AppManager agent fails to send either data or an event to the QDB within the specified grace period, this script considers the agent to be offline.

This Knowledge Script also monitors the health of jobs running on UNIX and Linux agents. Use this script to monitor Knowledge Script job run time against job schedule and against a user-specified maximum run time.

NOTE:Job monitoring is only available for UNIX agent 8.0 and above.

You can set this script to raise an event for the following conditions:

  • Agent Heartbeat

    • The heartbeat fails

    • An agent is healthy, such as when the heartbeat returns after failing

    • The agent heartbeat fails a user-specified number of times

  • Job Monitoring (UNIX agent 8.0 and above)

    • A job takes longer than its schedule to execute

    • A job exceeds the user-specified maximum job run time

  • Knowledge Script

    • An unexpected error occurs in this Knowledge Script.

  • Heartbeat Investigation

    • An attempt to contact an offline agent computer using ICMP ping and the traceroute diagnostic tool both fail

If you use this Knowledge Script with the AppManager Operator Console, you can access the Actions and Advanced tab, but the options on those two tabs will not function.

This script generates data for consolidated events, which can be managed with the Health Check options found in Control Center. To access Health Check, click Options on the Main tab, and then click Health Check. For more information, see the online Help for Health Check.

3.3.1 Resource Objects

UNIX and Linux servers

3.3.2 Default Schedule

The default interval for this script is every five minutes.

3.3.3 Setting Parameter Values

Set the following parameters as needed:

Parameter

How to Set It

Heartbeat Options

Raise event if the agent heartbeat fails?

Select Yes to raise an event if the heartbeat for the AppManager agent server stops. The default is Yes.

Event severity when the agent heartbeat fails

Set the event severity level, from 1 to 40, to indicate the importance of an event in which the heartbeat stops. The default is 5.

Raise an event when agent heartbeat restarts?

Select Yes to raise an event if the heartbeat starts again after stopping. The default is Yes.

Event severity when the agent heartbeat restarts

Set the event severity level, from 1 to 40, to indicate the importance of an event in which the heartbeat starts again after stopping. The default is 25.

Number of consecutive heartbeat failures before raising an event

Specify the number of times the heartbeat must fail before raising an event. The default is 1.

Generate heartbeat data?

Select Yes to enable the heartbeat check. If you select Yes and the data point from this job is missing, AppManager raises an event. If you select No, the heartbeat check will not look for the data point from this job. The default is Yes.

Job Monitoring Options

Monitor individual jobs?

Select Yes to monitor individual AppManager jobs. The default is Yes.

Raise an event if jobs take longer than their schedule to execute?

Select Yes to raise an event when a monitored AppManager job takes longer than its scheduled time to execute. The default is Yes.

Event severity

Set the event severity level, from 1 to 40, to indicate the importance of an event in which a monitored Appmanager job took longer than its scheduled time to execute. The default is 5.

Raise an event if job exceeds maximum job run time?

Select Yes to raise an event when a monitored AppManager job exceeds the maximum job run time you set. The default is Yes.

List of Knowledge Scripts to skip (comma-separated)

Enter one or more comma-separated Knowledge Script names to exempt from job monitoring.

Maximum job run time

Set the number of seconds, from 1 to 32767, to indicate the maximum run time for a monitored Knowledge Script job before an event is raised. The default is 180 seconds.

Event severity

Set the event severity level, from 1 to 40, to indicate the importance of an event where a monitored Knowledge Script job exceeded the maximum job run time you set. The default is 5.

Knowledge Script Options

Event severity when unexpected error occurs

Set the event severity level, from 1 to 40, to indicate the importance of an event where an unexpected error occurs in the HeartbeatUNIX Knowledge Script. The default is 35.

Heartbeat Investigation Steps (Used by Management Server)

Attempt to contact agent computer by ICMP ping?

Select Yes to send an ICMP ping request to the agent computer. If AppManager cannot contact an agent with an ICMP ping, the agent computer might have been shut down or disconnected from the network, or a firewall is blocking the ICMP communication.

The default is Yes.

Perform tracert diagnostic if ICMP ping fails?

Select Yes to run a tracert (traceroute) diagnostic test if the ping request fails. The default is Yes.

A traceroute test helps you troubleshoot network routing problems that can block ICMP traffic. This script raises an event if the tracert fails.

3.3.4 Understanding the Event Detail Messages

Many AppManager jobs run periodically as a series of iterations. The time that should elapse between subsequent iterations is the job schedule. The maximum amount of time the job should run in any of its iterations is the maximum job run time.

A job has a state, typically Running or Stopped. A job whose state is Running can have iteration status Currently Running when a job iteration is in progress, or Completed when a job is inactive between iterations.

This script can raise events for the following job run time conditions:

  • When a job exceeds the Maximum job run time parameter you set in this Knowledge Script.

  • When a job iteration takes longer than its schedule to execute.

  • When a job iteration takes longer than its schedule time to execute and is still running.

The following are examples of the event detail messages this script creates when it detects these conditions in running jobs.

Job Exceeded Maximum Run Time

This Knowledge Script raises an event when one or more jobs exceeds the Maximum job run time parameter you set. The following is an example of the detail message.

Details for jobs exceeding maximum run time (180 seconds) at Sun Aug  3 20:33:06 2014 are :

Job ID 1269 (UNIX_ExecUtil)
      Iteration number : 759 (Completed)
      Iteration start time : Sun Aug  3 20:29:06 2014
      Iteration execution time : 3 minutes 33 seconds
Job ID 1271 (UNIX_ExecUtil)
      Iteration number : 380 (Currently Running)
      Iteration start time : Sun Aug  3 20:29:11 2014
      Iteration execution time : 3 minutes 55 seconds
      Last iteration run time : 4 minutes 13 seconds
Job ID 1273 (UNIX_ExecUtil)
      Iteration number : 758 (Completed)
      Iteration start time : Sun Aug  3 20:32:15 2014
      Iteration execution time : 4 minutes 28 seconds

The first line details the current Maximum job run time value and the time the report was generated. Each job that exceeds the maximum run time has its own entry in the detail message. Each entry starts with the agent-assigned job identifier and the Knowledge Script name.

The job entry details the iteration number, that is, how many iterations the job has run, and the current job iteration status. A job is Completed if it completed its iteration and is awaiting the next iteration. A job is Currently Running if the iteration is still in progress.

NOTE:Iteration status is separate from the job state. The job state is Running to indicate the job is either actively running or waiting to run. The iteration status is either Completed or Currently Running to indicate the current job iteration is either finished or is still active.

The job entry also details the iteration execution time and, in the case of a job currently running, the execution time for the previous iteration.

Completed Job Exceeded Scheduled Run Time

This Knowledge Script raises an event when a job exceeds its scheduled run time. For example, if a job is scheduled to run every 30 seconds and in iteration n runs for 40 seconds, iteration n + 1 cannot start at its scheduled time, it must start at least 10 seconds later. The following is an example of the detail message.

Details for jobs exceeding scheduled run time at Mon Aug  4 00:33:06 2014 are :

 Job ID 1273 (UNIX_ExecUtil)
      Iteration number : 807 (Completed)
      Iteration start time : Mon Aug  4 00:32:02 2014
      Execution time : 6 minutes 56 seconds

      Next iteration schedule time : Mon Aug  4 00:37:02 2014
      Next Iteration schedule delayed by : 1 minutes 56 seconds
Note: Please increase the job schedule by at least 1 minutes 56 seconds to correct this problem

The first line details the time the report was generated. Each job that exceeded its scheduled run time has an entry in the detail message. Each entry starts with the agent-assigned job identifier and the Knowledge Script name.

The entry details the execution time and the schedule time for the next iteration. A note indicates how much you should add to the current schedule to bring the iterations back on schedule and prevent future schedule overruns.

Running Job Exceeded Scheduled Run Time

This Knowledge Script raises an event when a job exceeds its schedule and is still running. The following is an example of the event detail message.

Details for jobs exceeding scheduled run time at Thu Aug 21 05:55:49 2014 are :

 Job ID 679 (UNIX_ExecUtil)
      Iteration number : 4 (Currently Running)
      Iteration start time : Thu Aug 21 05:55:34 2014
      Next iteration Schedule time : Thu Aug 21 05:55:44 2014
      Next iteration schedule delayed by : 5 seconds

The first line details the time the report was generated. Each job that exceeded its scheduled run time has an entry in the detail message. Each entry starts with the agent-assigned job identifier and the Knowledge Script name.

The Next iteration schedule delayed by value is based on the time the event was raised. It represents the amount of time the next iteration would be delayed if the current iteration completed now. For example, Job ID 679 is still in process and there is no way to determine exactly when it will end. But based on overrun at the time of this event detail message, the next iteration will be delayed by at least 5 seconds.