4.3 Implementing Core Monitoring Support

In planning an AppManager deployment, you should first identify a specific set of Knowledge Scripts you want to run. Although the list is likely to change over time, your initial core set of Knowledge Scripts should monitor basic server health and availability and your most important application resources. At a minimum, for example, most organizations monitor CPU and memory usage, disk space, disk I/O activity, network connections or activity, and the availability of specific computers or specific processes.

In addition, many organizations monitor computer hardware components and application-specific resources, such as mailbox size for messaging servers and database connections for database servers.

HINT:The core set of Knowledge Scripts should consist of the Knowledge Scripts you want to run at regular intervals for monitoring performance and availability. In general, you should identify a relatively simple set of scripts to act as the core set. You can then extend the core set with additional Knowledge Scripts to perform more detailed analysis, assist you in troubleshooting, or collect data for reports.

In a typical environment, you run approximately 20 jobs on each agent computer at regular intervals to ensure basic operational health and availability. You run additional jobs less frequently to diagnose problems or take corrective action. Although running around 20 jobs is typical, the core set of Knowledge Scripts you initially run might include fewer jobs.

NetIQ Corporation recommends initially running a core set of Knowledge Scripts from the General and NT Knowledge Script categories. The following table describes the recommended core set of Knowledge Scripts. For more information about using these Knowledge Scripts and setting parameters, see the AppManager Knowledge Script Reference Guide, available on the AppManager Documentation page.

Knowledge Script	Description
General_EventLog	Monitors and filters information in the Windows Event Log and allows you to track log entries that match filtering criteria Initially, NetIQ Corporation recommends monitoring all logs for error events. You can further filter the log entries to include or exclude other criteria such as specific IDs, descriptions, user names, or computer names.
General_MachineDown	Detects whether the computer on which you run the script can communicate with one or more specified Windows computers and raises an event if communication attempts fail
NT_MemUtil	Monitors physical and virtual memory and the paging files and raises an event if a monitored metric exceeds the threshold
NT_DiskSpace	Monitors logical drives for disk utilization, the amount of free space available, and the percentage of disk growth
NT_CpuLoaded	Monitors total CPU usage and queue length to determine whether the CPU is overloaded and raises an event when both the total CPU usage and CPU queue length exceed the thresholds
NT_LogicalDiskStats	Monitors logical disk reads, writes, and transfers per second, disk operation time, and queue length
NT_PhysicalDiskStats	Monitors physical disk reads, writes, and transfers per second, disk operation time, and queue length
NT_ServiceDown	Monitors whether specified Microsoft Windows services are stopped or started, and, optionally, starts any stopped service
NT_TrustRelationship	Tests the domain trust relationship from the computer on which you run the script to a specified domain and raises an event if a problem exists with the domain trust

4.3.1 Collecting Data

To identify normal baseline operating values before you set thresholds for events, set all Knowledge Scripts only to collect data (that is, not to raise events) and run reports for at least one week. From the reports, you can review the high, low, and average values for core statistics. You can configure several basic report Knowledge Scripts to create reports.

To create reports about your environment:

Install at least one report-enabled agent.
Run the Discovery_ReportAgent Knowledge Script on the report-enabled agent computer.
In the Report view, click through tabs in the Knowledge Script pane to select the reports to run.

At the end of the collection period, evaluate the information to determine a baseline for a normal operating environment. After you complete your evaluation, remove the data you collected from the QDB.

When you are ready to raise events, set only those Knowledge Scripts that address critical issues in your environment to raise events, and set the remaining Knowledge Scripts to collect data. You can employ this approach enterprise-wide or only on the computers you identify as needing immediate attention. To help you tune your system later, track the frequency of events and the number of data points collected.

Based on the data you collect, you can adjust thresholds to more accurately reflect your environment’s specific characteristics. If you see too many events, the thresholds might be too low for your environment, the intervals might be too short, or you might need to address critical resource issues.

Basic AppManager reporting provides detailed information about the computers in a single management site. When you expand your deployment to multiple management sites with multiple QDBs, you might want the more sophisticated reporting available with NetIQ Analysis Center.

4.3.2 Setting and Adjusting Event Thresholds

Once you have identified a core set of Knowledge Scripts and baseline operating values for monitoring basic computer resources, such as CPU, memory, and disk, and critical application resources, create a Knowledge Script Group from those Knowledge Scripts and run them on a pilot group of computers.

The servers in your pilot group should have similar configurations and be similarly loaded. For example, you may want to set different event thresholds for servers that perform transactional operations than for servers that perform batch operations, so you would organize transactional and batch servers into separate management groups or views.

With a group of similarly configured and loaded servers, you should run the core set of Knowledge Scripts to raise events only for critical issues in your environment. You can use the default threshold values or your own estimation for initial threshold settings based on the results of your initial data collection.

HINT:Using a monitoring policy may simplify event threshold configuration. With a monitoring policy, the jobs are started automatically, changes to Knowledge Script group member properties are automatically propagated to policy-based jobs, and when you remove the policy, the jobs are automatically stopped and deleted.

The process of establishing effective event thresholds includes several basic steps. By following these steps with a pilot group of servers, you establish threshold values you can use through the rest of your enterprise:

Identify a group of servers that have a similar configuration.
Identify the event conditions most relevant to you for those servers.
Identify the Knowledge Scripts you want to run to monitor the event conditions you identified.
Run monitoring jobs and make adjustments to the event conditions and event thresholds as needed. The goal is to set event thresholds you believe to be accurate for the servers and applications most critical to your business.

The purpose of running a core set of jobs on a pilot group of computers is to reveal:

Serious problems that need immediate attention—for example, computers that are dangerously low on disk space or that have high CPU usage
Any environmental issues you need to address—for example, problems with insufficient account privileges, network instability, or the availability of SNMP or other services that need to be installed
Threshold levels and job properties that are appropriate to your specific environment and which you can standardize, either across your entire organization or across specific departmental or functional group

If you are seeing too many events, the thresholds may be set too low for your environment, or the interval for running the job may be too short. Events should not be raised unless something has happened that merits a response. Responses include acknowledging the event, running another Knowledge Script to remotely diagnose the problem, or diagnosing the system in person.

Deploying a core set of Knowledge Scripts also prevents your staff from being overwhelmed by a sudden barrage of events. By focusing on a limited number of key Knowledge Scripts and the most critical problems you need to address early in the deployment, you can develop an understanding of the events generated, implement a methodology for responding to those events, and effectively troubleshoot any issues that arise.

In your initial deployment, therefore, the core Knowledge Scripts should not perform responsive actions when events are raised. Avoiding actions in the earliest stages of deployment prevents an unnecessary surge of e-mail or pager messages being sent for events caused by thresholds that have been set too high or too low. Once you have determined appropriate thresholds for your environment, you can test responsive actions and choose an appropriate notification method, such as MAPI mail, SMTP mail, or a paging system.

4.3.3 Establishing a Manageable Level of Event Activity

If you are receiving too many events, you might need do some or all of the following:

Adjust thresholds. Whether they need to be higher or lower depends on your environment, on your reasons for monitoring a particular computer, and on how particular computers are being used. For example, when monitoring the computers in a lab to determine when you are nearing capacity, you might set thresholds lower than when monitoring users desktop computers or computers that store archived information that rarely changes.
Change the job schedule (increase or decrease the monitoring interval).
Change the number of consecutive times that a condition must be detected before an event is raised. For more information, see Adjusting Consecutive Intervals.
Modify the computer configuration to bring non-conforming computers in line with the benchmark settings or manage the non-conforming servers using another management group.

4.3.4 Developing a Data Collection Strategy

Once you are monitoring for events on your core systems and applications, you are ready to collect data for charts and reports. When considering your reporting needs, determine the following information:

Standard AppManager reports to generate and the Knowledge Scripts required to generate those reports
Who should receive the reports and how frequently
Whether to generate reports automatically on a scheduled basis or manually on demand
Who will generate reports

For example, you might want to restrict access to the Report view or assign Exchange reports to an Exchange administrator and SQL Server reports to your DBA group.
Whether to format reports in table format, in charts, or both
Whether to deliver reports through e-mail, a Web site, or the Report Viewer

The following table describes report Knowledge Scripts that NetIQ Corporation recommends running to generate standard reports. For more information about using these Knowledge Scripts and setting parameters, see the AppManager Knowledge Script Reference Guide, available on the AppManager Documentation page.

Knowledge Script	Description
ReportAM_EventSummary	Summarizes events per computer
ReportAM_SystemUpTime	Details the uptime and downtime of monitored computers
ReportAM_CompDeploy	Details the number of instances of each AppManager component installed on computers in an AppManager site
ReportAM_WatchList	Details the top or bottom n computers (by number or percent) generating the selected data streams
NT_Report_CPULoadSummary	Summarizes CPU usage and queue length for selected computers
NT_Report_LogicalDiskUsageSummary	Summarizes the percentage of disk space used and the amount of free space (in MB) for selected computers

When collecting data, you should familiarize yourself with how AppManager collects data for charts and reports. You should set repository preferences and job properties so that you only collect and maintain the data you need. Storing additional data can quickly consume repository resources and negatively impact performance. For more information, see Managing Data and Managing a QDB.

If you need to report on more than three months’ worth of data, consider using AppManager Analysis Center. The aggregate reporting capabilities available with Analysis Center are powerful and can avoid the performance problems potentially associated with storing large amounts of AppManager data for reports.