3.2 Deploying to a Pilot Group

Depending on the size of your organization, the importance of your monitoring needs, the expertise of your deployment team, and the resources available to you, the pilot deployment might involve a small but representative number of computers or all of the computers you intend to monitor. NetIQ Corporation recommends installing on enough computers to get a realistic view of the full-scale deployment. The pilot deployment should last from two to four weeks and reveal the following information:

  • Problems that need immediate attention, such as computers that are low on disk space

  • Environmental issues you need to address, such as insufficient privileges or instability

  • How closely the computers you want to monitor conform to your expectations

During the pilot deployment, focus on the following goals:

  • Running the recommended core set of Knowledge Scripts on agent computers

    For more information about working with Knowledge Scripts and jobs, see the Control Center User Guide for AppManager, available on the AppManager Documentation page. For more information about the recommended core set of Knowledge Scripts, see Running the Recommended Core Knowledge Scripts.

  • Identifying and correcting problems with running the core set of jobs

    For example, you might find problems with the required accounts and permissions.

  • Gaining experience viewing and responding to events

    For more information about how AppManager raises events and using the Control Center console to view and respond to them, see the Control Center User Guide for AppManager, available on the AppManager Documentation page.

  • Identifying normal operating values and adjusting thresholds for your environment

    For more information about identifying normal operating values, see Collecting Data.

  • Gathering operational data, such as disk space and CPU utilization, for charting and reporting

    For more information about using AppManager to generate charts and reports, see the Control Center User Guide for AppManager, available on the AppManager Documentation page.

3.2.1 Running the Recommended Core Knowledge Scripts

During the pilot deployment, NetIQ Corporation recommends that you only run a core set of Knowledge Scripts and restrict the number of users allowed to perform activities such as acknowledging and closing events or starting and stopping jobs. Running only a core set of Knowledge Scripts prevents a large number of events from overwhelming your staff and allows you to understand the events the jobs generate, develop a methodology for responding to them, and troubleshoot issues.

In a typical environment, you run approximately 20 Knowledge Script jobs on each agent computer at regular intervals to ensure basic operational health and availability. You run additional jobs less frequently to diagnose problems or take corrective action. Although running around 20 jobs is typical, the core set of Knowledge Scripts you initially run might include fewer jobs.

NetIQ Corporation recommends initially running a core set of Knowledge Scripts from the General and NT Knowledge Script categories. The following table describes the recommended core set of Knowledge Scripts. For more information about using these Knowledge Scripts and setting parameters, see the AppManager Knowledge Script Reference Guide, available on the AppManager Documentation page.

Knowledge Script

Description

General_EventLog

Monitors and filters information in the Windows Event Log and allows you to track log entries that match filtering criteria

Initially, NetIQ Corporation recommends monitoring all logs for error events. You can further filter the log entries to include or exclude other criteria such as specific IDs, descriptions, user names, or computer names.

General_MachineDown

Detects whether the computer on which you run the script can communicate with one or more specified Windows computers and raises an event if communication attempts fail

NT_MemUtil

Monitors physical and virtual memory and the paging files and raises an event if a monitored metric exceeds the threshold

NT_DiskSpace

Monitors logical drives for disk utilization, the amount of free space available, and the percentage of disk growth

NT_CpuLoaded

Monitors total CPU usage and queue length to determine whether the CPU is overloaded and raises an event when both the total CPU usage and CPU queue length exceed the thresholds

NT_LogicalDiskStats

Monitors logical disk reads, writes, and transfers per second, disk operation time, and queue length

NT_PhysicalDiskStats

Monitors physical disk reads, writes, and transfers per second, disk operation time, and queue length

NT_ServiceDown

Monitors whether specified Microsoft Windows services are stopped or started, and, optionally, starts any stopped service

NT_TrustRelationship

Tests the domain trust relationship from the computer on which you run the script to a specified domain and raises an event if a problem exists with the domain trust

3.2.2 Collecting Data

To identify normal baseline operating values before you set thresholds for events, set all Knowledge Scripts only to collect data (that is, not to raise events) and run reports for at least one week. From the reports, you can review the high, low, and average values for core statistics. You can configure several basic report Knowledge Scripts to create reports.

To create reports about your environment:

  1. Install at least one report-enabled agent.

    For more information about enabling reporting capability for an agent, see Understanding Agent Reporting Capabilities.

  2. Run the Discovery_ReportAgent Knowledge Script on the report-enabled agent computer.

  3. In the Report view, click through tabs in the Knowledge Script pane to select the reports to run.

At the end of the collection period, evaluate the information to determine a baseline for a normal operating environment. After you complete your evaluation, remove the data you collected from the QDB. For information about removing data from the QDB, see the Administrator Guide for AppManager, available on the AppManager Documentation page.

When you are ready to raise events, set only those Knowledge Scripts that address critical issues in your environment to raise events, and set the remaining Knowledge Scripts to collect data. You can employ this approach enterprise-wide or only on the computers you identify as needing immediate attention. To help tune your system later, track the frequency of events and the number of data points collected.

Based on the data you collect, you can adjust thresholds to more accurately reflect specific characteristics of your environment. If you see too many events, the thresholds might be too low for your environment, the intervals might be too short, or you might need to address critical resource issues.

Basic AppManager reporting provides detailed information about the computers in a single management site. When you expand your deployment to multiple management sites with multiple QDBs, you might want the more sophisticated reporting available with NetIQ Analysis Center.