NetIQ Documentation: NetIQ Privileged User Manager 2.3.3 Administration Guide

3.9 Troubleshooting

3.9.1 Promoting Managers When the Primary Manager Fails

If you have multiple Framework Managers deployed, the first manager installed is defined as the primary manager by default, and its packages are defined as primary. Manager packages on all other manager hosts act as backups. If your primary manager becomes unavailable, you can select single or multiple manager packages on a host to be promoted to primary status.

The Framework continues to function when the primary manager is unavailable, but no changes can be made to the Framework. Changes can only be written to the databases on the primary manager, which are then replicated to the backup managers. The only exception to this is the audit database. Each audit agent is responsible for sending its audit messages to each audit manager. This ensures that audit data is not lost.

NetIQ recommends having one host designated as a complete mirror of your primary manager. In event of a total failure of the primary manager, you can log into the backup console and promote it to primary status with no disruption of Privileged User Manager services.

Click Hosts on the home page of the console.
In the navigation pane, select the host where you want to promote a manager.
With the host’s packages displayed, select the manager packages you want to promote.

To select multiple manager packages, press the Ctrl key and select the packages one at a time, or press the Shift key to select a consecutive list of manager packages.
Click Promote Manager in the task pane.
Review the list of manager packages you have selected.
Click Finish.
View the host’s packages again and verify that the Status of the promoted manager packages has changed to Primary.

To use a command line option to promote backup host to primary status, see Section 10.6, Registry Manager Options.

3.9.2 Viewing Store and Forward Messages

Messages from one host to another are stored if the sending host cannot communicate with the receiving host, and forwarded when the communication link is restored. You can view these messages and delete them if you do not need them.

You can use this feature to analyze a host and to discover whether it is having problems contacting a particular host. This problem usually occurs when a host is down or when a DNS name for a host name cannot be resolved.

Click Hosts on the home page of the console.

The navigation pane displays the current hierarchy for your Framework.
Select the host for which you want to view store and forward messages.
Click the host’s Packages icon (select the arrow next to the host’s name to display it).
Select View Messages in the task pane.

If any stored messages exist, they are displayed. Information about the message is shown, including the time the message was sent, the host the message was being sent to, the module that sent the message, the type of message (method), the number of failed attempts at sending the message, and the next scheduled attempt to send the message, if any.
To attempt to send one or more messages again, select the messages and select Retry.
To delete one or more messages, select the messages and click Delete.
To refresh the screen, click Refresh.
Click Close.

3.9.3 Managing Low Disk Space

In previous releases of Privileged User Manager, usrun sessions were terminated with an auditing error when the server ran out of disk space. For long term running processes, this is not the ideal solution.

You can now use Command Control scripts to slow down or freeze input/output for the following conditions:

In usrun sessions when disk space is low.
When the store and forward process cannot contact an audit manager and its queue size is increasing.

You can control what happens under these conditions by configuring the following attributes:

disk_min_free (default: 1MB): Minimum free disk space. When free disk space goes below this level, the action defined in backoff_action is applied.

If the backoff_action is block, the audit message is paused until disk space becomes available.
If the backoff_action is fail, the request to store the audit message fails with an error, and the user session is terminated.
If the backoff_action is allow, the session is unaffected.

disk_wm_free (default: 2MB): Free disk space watermark. When the free disk space goes below this level, the delay defined in backoff_delay is applied between each audited message.

queue_max_size (default: 250MB): Maximum queue size. When the queue size goes above this level, the action defined in backoff_action is applied.

If the backoff_action is block, the audit message is paused until the queue size reduces below this level.
If backoff_action is fail, the request to store the audit message fails with an error, and the user session is terminated.
If backoff_action is allow, the session is unaffected.

queue_wm_size (default: 100MB): Queue size watermark. When the queue size goes above this level, the delay defined in backoff_delay is applied between each audited message.

backoff_divisor (default: 1): Provides the ability to increase the delay as the disk space reduces or the queue size increases. The delay is calculated by dividing the range between the disk_wm_free and disk_min_free (or queue_max_size and queue_wm_size) by the backoff_divisor and then applying the delay for each increment.

backoff_delay (default: 500ms): Time in milliseconds to delay the audit request.

backoff_action (default: block): Either block, fail, or allow.

The following Command Control script illustrates how to change these settings:

my $t=$meta->child("Audit"); 
$t=$meta->add_node("Audit") if(! $t); 
$t->arg("disk_min_free","10"); 
$t->arg("disk_wm_free","20"); 
return 1;

This script sets the disk_min_free attribute to 10MB and the disk_wm_free attribute to 20 MB. You can assign this script to any rule or you can assign it to a rule at the top of the tree that all commands pass through.

You should create an emergency policy that allows administrators to access the machine when disk space is low or the store-and-forward queue size is large. Such a script would look similar to the following:

my $t=$meta->child("Audit"); 
$t=$meta->add_node("Audit") if(! $t); 
$t->arg("disk_min_free","0"); 
$t->arg("disk_wm_free","0"); 
$t->arg("queue_max_size","0"); 
$t->arg("queue_wm_size","0"); 
$t->arg("backoff_action","allow"); 
return 1;

You can assign this script to any rule or you can assign it to a rule at the top of the tree that all commands pass through

3.9.4 Restarting the Agent

If you are having problems, Novell Support might ask you to restart an agent.

Click Hosts on the home page of the console.
In the navigation pane, select the host on which you want to restart the agent.

To select multiple hosts in a domain, select the domain, then press the Ctrl key and select the hosts one at a time, or press the Shift key to select a consecutive list of hosts. To select all hosts in a domain, use Ctrl+A.
Click Restart Agent in the task pane.
Select the type of restart you want to perform, as advised by Novell Support.

Soft restart: Reloads the module libraries and resets the service uptime.

Hard restart: Restarts the daemon, reloads all modules, and resets the service uptime.
Click Finish.

3.9.5 Managing the Registry Cache

The registry cache is held by the Registry Agent on each host, and it contains a list of the packages deployed on each host in your Framework. This list is a copy of part of the information held by the Registry Manager, and it enables Framework components to locate and communicate with each other, according to their position in the hierarchy created when you add domains and hosts to your Framework. Agents send requests to managers in the immediate subdomain, and if a request is unsuccessful, they try a manager higher up in the hierarchy. See Section 8.0, Load Balancing and Failover for details.

You can view the registry cache to check hosts in your Framework to see if a specific manager or agent module is installed, and check the order in the Framework hierarchy according to the hosts the modules are installed on. See Viewing the Registry Cache.
If the registry cache becomes out-of-date, communication problems can occur. To fix this, try clearing the registry cache on the Registry Agent to allow it to be updated by the Registry Manager. See Clearing the Registry Cache.

Viewing the Registry Cache

When viewing the registry cache, you can use the stale cache (the default option). The cache is considered stale if it has not been updated by the Registry Manager for 2 hours, and this is usually adequate. If you deselect the Use Stale Cache check box, the information is provided by the Registry Manager.

Click Hosts on the home page of the console.

The navigation pane displays the current hierarchy for your Framework.
Select the host for which you want to view the registry cache.
Click the host’s Packages icon (click the arrow next to the host’s name to display it).
Click View Cache in the task pane.
From the drop-down list, select the package you want to look up in the registry cache.
If you want to view the latest information from the Registry Manager, deselect the Use stale cache check box, then click Lookup.

Details of the hosts where the module is installed are displayed in order according to their position in the Framework hierarchy. Information shown includes the Framework agent name, IP address, port number, and whether the host has the primary manager component installed (indicated by 1 in the Primary column) or not (indicated by 0).
(Optional) To clear the registry cache, click Clear Cache.

This marks the cache as stale, and it is automatically updated by the Registry Manager. You can also clear the cache by using the Clear Cache option in the task pane.
Click Close.

Clearing the Registry Cache

Novell Support might advise you to try clearing the registry cache if you have communication problems among Privileged User Manager components. The registry cache is held by the Registry Agent and contains a list of manager and agents in your Framework, copied from the Registry Manager. See Managing the Registry Cache for more details.

Click Hosts on the home page of the console.

The navigation pane displays the current hierarchy for your Framework.
Select the host for which you want to clear the registry cache.
Click the host’s Packages icon (click the arrow next to the host’s name to display it).
Click Clear Cache in the task pane.

The registry cache is marked as stale and is updated by the Registry Manager. You can also clear the registry cache by using the View Cache option (see Viewing the Registry Cache).

3.9.6 Time Synchronization

All agents should be configured to use a Network Time Protocol (NTP) server. Agents must have their time synchronized with the primary registry manager so that the time difference is less than two hours.

If the time difference is greater than two hours, the agent can appear offline and Command Control requests can fail.