Use this Knowledge Script to monitor the heartbeat of the AppManager agent running on a UNIX or Linux server. A heartbeat is a periodic signal generated by an AppManager agent computer to indicate that it is still running. If an AppManager agent fails to send either data or an event to the QDB within the specified grace period, this script considers the agent to be offline.
This Knowledge Script also monitors the health of jobs running on UNIX and Linux agents. Use this script to monitor Knowledge Script job run time against job schedule and against a user-specified maximum run time.
NOTE:Job monitoring is only available for UNIX agent 8.0 and above.
You can set this script to raise an event for the following conditions:
Agent Heartbeat
The heartbeat fails
An agent is healthy, such as when the heartbeat returns after failing
The agent heartbeat fails a user-specified number of times
Job Monitoring (UNIX agent 8.0 and above)
A job takes longer than its schedule to execute
A job exceeds the user-specified maximum job run time
Knowledge Script
An unexpected error occurs in this Knowledge Script.
Heartbeat Investigation
An attempt to contact an offline agent computer using ICMP ping and the traceroute diagnostic tool both fail
If you use this Knowledge Script with the AppManager Operator Console, you can access the Actions and Advanced tab, but the options on those two tabs will not function.
This script generates data for consolidated events, which can be managed with the Health Check options found in Control Center. To access Health Check, click Options on the Main tab, and then click Health Check. For more information, see the online Help for Health Check.
UNIX and Linux servers
The default interval for this script is every five minutes.
Set the following parameters as needed:
Parameter |
How to Set It |
---|---|
Heartbeat Options |
|
Raise event if the agent heartbeat fails? |
Select Yes to raise an event if the heartbeat for the AppManager agent server stops. The default is Yes. |
Event severity when the agent heartbeat fails |
Set the event severity level, from 1 to 40, to indicate the importance of an event in which the heartbeat stops. The default is 5. |
Raise an event when agent heartbeat restarts? |
Select Yes to raise an event if the heartbeat starts again after stopping. The default is Yes. |
Event severity when the agent heartbeat restarts |
Set the event severity level, from 1 to 40, to indicate the importance of an event in which the heartbeat starts again after stopping. The default is 25. |
Number of consecutive heartbeat failures before raising an event |
Specify the number of times the heartbeat must fail before raising an event. The default is 1. |
Generate heartbeat data? |
Select Yes to enable the heartbeat check. If you select Yes and the data point from this job is missing, AppManager raises an event. If you select No, the heartbeat check will not look for the data point from this job. The default is Yes. |
Job Monitoring Options |
|
Monitor individual jobs? |
Select Yes to monitor individual AppManager jobs. The default is Yes. |
Raise an event if jobs take longer than their schedule to execute? |
Select Yes to raise an event when a monitored AppManager job takes longer than its scheduled time to execute. The default is Yes. |
Event severity |
Set the event severity level, from 1 to 40, to indicate the importance of an event in which a monitored Appmanager job took longer than its scheduled time to execute. The default is 5. |
Raise an event if job exceeds maximum job run time? |
Select Yes to raise an event when a monitored AppManager job exceeds the maximum job run time you set. The default is Yes. |
List of Knowledge Scripts to skip (comma-separated) |
Enter one or more comma-separated Knowledge Script names to exempt from job monitoring. |
Maximum job run time |
Set the number of seconds, from 1 to 32767, to indicate the maximum run time for a monitored Knowledge Script job before an event is raised. The default is 180 seconds. |
Event severity |
Set the event severity level, from 1 to 40, to indicate the importance of an event where a monitored Knowledge Script job exceeded the maximum job run time you set. The default is 5. |
Knowledge Script Options |
|
Event severity when unexpected error occurs |
Set the event severity level, from 1 to 40, to indicate the importance of an event where an unexpected error occurs in the HeartbeatUNIX Knowledge Script. The default is 35. |
Heartbeat Investigation Steps (Used by Management Server) |
|
Attempt to contact agent computer by ICMP ping? |
Select Yes to send an ICMP ping request to the agent computer. If AppManager cannot contact an agent with an ICMP ping, the agent computer might have been shut down or disconnected from the network, or a firewall is blocking the ICMP communication. The default is Yes. |
Perform tracert diagnostic if ICMP ping fails? |
Select Yes to run a tracert (traceroute) diagnostic test if the ping request fails. The default is Yes. A traceroute test helps you troubleshoot network routing problems that can block ICMP traffic. This script raises an event if the tracert fails. |
Many AppManager jobs run periodically as a series of iterations. The time that should elapse between subsequent iterations is the job schedule. The maximum amount of time the job should run in any of its iterations is the maximum job run time.
A job has a state, typically Running or Stopped. A job whose state is Running can have iteration status Currently Running when a job iteration is in progress, or Completed when a job is inactive between iterations.
This script can raise events for the following job run time conditions:
When a job exceeds the Maximum job run time parameter you set in this Knowledge Script.
When a job iteration takes longer than its schedule to execute.
When a job iteration takes longer than its schedule time to execute and is still running.
The following are examples of the event detail messages this script creates when it detects these conditions in running jobs.
This Knowledge Script raises an event when one or more jobs exceeds the Maximum job run time parameter you set. The following is an example of the detail message.
Details for jobs exceeding maximum run time (180 seconds) at Sun Aug 3 20:33:06 2014 are : Job ID 1269 (UNIX_ExecUtil) Iteration number : 759 (Completed) Iteration start time : Sun Aug 3 20:29:06 2014 Iteration execution time : 3 minutes 33 seconds Job ID 1271 (UNIX_ExecUtil) Iteration number : 380 (Currently Running) Iteration start time : Sun Aug 3 20:29:11 2014 Iteration execution time : 3 minutes 55 seconds Last iteration run time : 4 minutes 13 seconds Job ID 1273 (UNIX_ExecUtil) Iteration number : 758 (Completed) Iteration start time : Sun Aug 3 20:32:15 2014 Iteration execution time : 4 minutes 28 seconds
The first line details the current Maximum job run time value and the time the report was generated. Each job that exceeds the maximum run time has its own entry in the detail message. Each entry starts with the agent-assigned job identifier and the Knowledge Script name.
The job entry details the iteration number, that is, how many iterations the job has run, and the current job iteration status. A job is Completed if it completed its iteration and is awaiting the next iteration. A job is Currently Running if the iteration is still in progress.
NOTE:Iteration status is separate from the job state. The job state is Running to indicate the job is either actively running or waiting to run. The iteration status is either Completed or Currently Running to indicate the current job iteration is either finished or is still active.
The job entry also details the iteration execution time and, in the case of a job currently running, the execution time for the previous iteration.
This Knowledge Script raises an event when a job exceeds its scheduled run time. For example, if a job is scheduled to run every 30 seconds and in iteration n runs for 40 seconds, iteration n + 1 cannot start at its scheduled time, it must start at least 10 seconds later. The following is an example of the detail message.
Details for jobs exceeding scheduled run time at Mon Aug 4 00:33:06 2014 are : Job ID 1273 (UNIX_ExecUtil) Iteration number : 807 (Completed) Iteration start time : Mon Aug 4 00:32:02 2014 Execution time : 6 minutes 56 seconds Next iteration schedule time : Mon Aug 4 00:37:02 2014 Next Iteration schedule delayed by : 1 minutes 56 seconds Note: Please increase the job schedule by at least 1 minutes 56 seconds to correct this problem
The first line details the time the report was generated. Each job that exceeded its scheduled run time has an entry in the detail message. Each entry starts with the agent-assigned job identifier and the Knowledge Script name.
The entry details the execution time and the schedule time for the next iteration. A note indicates how much you should add to the current schedule to bring the iterations back on schedule and prevent future schedule overruns.
This Knowledge Script raises an event when a job exceeds its schedule and is still running. The following is an example of the event detail message.
Details for jobs exceeding scheduled run time at Thu Aug 21 05:55:49 2014 are : Job ID 679 (UNIX_ExecUtil) Iteration number : 4 (Currently Running) Iteration start time : Thu Aug 21 05:55:34 2014 Next iteration Schedule time : Thu Aug 21 05:55:44 2014 Next iteration schedule delayed by : 5 seconds
The first line details the time the report was generated. Each job that exceeded its scheduled run time has an entry in the detail message. Each entry starts with the agent-assigned job identifier and the Knowledge Script name.
The Next iteration schedule delayed by value is based on the time the event was raised. It represents the amount of time the next iteration would be delayed if the current iteration completed now. For example, Job ID 679 is still in process and there is no way to determine exactly when it will end. But based on overrun at the time of this event detail message, the next iteration will be delayed by at least 5 seconds.