4.6 CpuLoaded

Use this Knowledge Script to monitor average CPU usage and average queue length to determine whether the CPU is overloaded. You can monitor the average usage on each processor or the average usage across all processors in a computer. If both the CPU usage and CPU queue length thresholds are exceeded, the CPU is overloaded and AppManager raises an event.

On some systems the CPU queue length does not rise easily and you might want to ignore the queue length. If you do not want to monitor the CPU queue length, set Maximum number of processes in the queue threshold to -1.

4.6.1 Resource Objects

CPU folder or any individual CPU icon (for multiprocessor systems).

4.6.2 Default Schedule

The default interval for this script is Every 15 minutes.

4.6.3 Setting Parameter Values

Set the following parameters as needed:

Description

How to Set It

Event if CPU usage and queue are over thresholds? (y/n)

Set to y to raise events. The default is y.

Collect data? (y/n)

Set to y to collect data for charts and reports. When set to y, this script returns the average CPU utilization percentage (%) and the average CPU run queue length. The default is n.

HINT: If you only want to collect run queue length data, use the UNIX_GeneralCounter Knowledge Script.

Monitor overall CPU load? (y/n)

Set to y to monitor the average load across all processors in a computer. If you are collecting data, setting this option to y creates a single data stream for all processors.

Set to n to monitor the average load for each processor separately. If you are collecting data, setting this option to n creates a separate data stream for each processor.

The default is y.

NOTE:For a single CPU system, monitoring all CPUs produces the same results as monitoring an individual CPU.

Maximum CPU usage (%) threshold

Specify the maximum CPU utilization (user plus kernel). The default is 90%.

Maximum number of processes in the queue threshold

Specify the maximum number of processes in the queue length threshold. CPU queue length indicates how many processes are ready to run. The default is 2.

HINT:If you do not want to monitor the CPU queue length, set the threshold to -1.

Event severity level

Set the event severity level, from 1 to 40, to indicate the importance of the event. The default is 5.

Event severity for internal failure

Set the event severity level, from 1 to 40, to indicate the importance of an event in which this job experienced an internal error. The default is 5.

Enable debugging? (y/n)

Set to y to enable debugging. The default is n.

4.6.4 Example of How this Script Is Used

This script monitors both the percentage of CPU used and processor queue length because, by itself, high CPU usage might not indicate a problem. Instead, you need to consider several factors, including:

  • Queue length (Load average)

  • How you are using the computers monitored

  • Your overall strategy for the environment

For example, if you have a transactional environment on a computer consistently using 90% of the CPU, the computer is full. However, if the queue length remains low and stable (for example, never more than 2 processes waiting), it might indicate the computer is sized perfectly for maximum efficiency. If the queue length increases and you have processes waiting, it is likely to be a problem you need to address.

In a batch environment, consider setting the thresholds differently; for example, during down times when batch jobs are not running you might want an event if CPU usage is over 50% and any process is waiting (queue length at 0) to ensure the computer has enough CPU headroom when the batch jobs are running.

Other factors to consider are long-range plans, such as the number of users you expect to support, for how long, and how much room for growth you need. For example, you might want to set the CPU usage lower to give you an early warning that you need to off-load some processing or order new systems.

4.6.5 Selecting Overall or Individual CPU Load

Monitoring load for each CPU individually provides more specific information about what is happening on a system. For example, if you monitor average load and see CPU usage is 100%, it does not tell you as much about the resource usage as seeing that CPU 0 is running at 90% and CPU 1 is running at 10%.

4.6.6 Handling Spikes

Because CPU and queue length are often subject to temporary spikes, you should set a short interval, such as every 3 to 5 minutes, but raise an event only after thresholds are exceeded in 3 consecutive periods.

4.6.7 Collecting Data

This Knowledge Script is typically used to raise events, but if you collect data, you can use the information to identify usage trends. For example, seeing the CPU usage growing steadily can help you plan for growth. If you want to do this type of analysis, consider running a second job at a less frequent interval.

You can configure this Knowledge Script to collect data on the average CPU utilization percentage (%) and the average CPU run queue length. You can collect data for the average usage on each processor or the average usage across all processors in a computer.

4.6.8 Working with Multi-Processor Systems

On a multi-processor system, the total CPU utilization is the average percentage of time that all the processors on the system are busy executing non-idle threads. For example:

  • if all processors are always busy, this is 100%.

  • if all processors are 50% busy, this is 50%.

  • if 25% of the processors are busy and all processors use a single queue in which threads wait for a processor cycle, this is 25%.