A.5 Backup and Recovery

The highly available failover cluster described in this document provides a level of redundancy so that if the service fails on one node in the cluster, it will automatically failover and recover on another node in the cluster. When an event like this happens, it's important to bring the node that failed back into an operational state so that the redundancy in the system can be restored and protect in the case of another failure. This section talks about restoring the failed node under a variety of failure conditions.

A.5.1 Backup

While a highly available failover cluster like the one described in this document provides a layer of redundancy, it is still important to regularly take a traditional backup of the configuration and data, which would not be easy to recover from if lost or corrupted. The section Backing Up and Restoring Data in the NetIQ Sentinel Administration Guide describes how to use Sentinel's built-in tools for creating a backup. These tools should be used on the active node in the cluster because the passive node in the cluster will not have the required access to the shared storage device. Other commercially available backup tools could be used instead and may have different requirements on which node they can be used.

A.5.2 Recovery

Transient Failure

If the failure was a temporary failure and there is no apparent corruption to the application and operating system software and configuration, then simply clearing the temporary failure, for example rebooting the node, will restore the node to an operational state. The cluster management user interface can be used to fail back the running service back to the original cluster node, if desired.

Node Corruption

If the failure caused a corruption in the application or operating system software or configuration that is present on the node's storage system, then the corrupted software will need to be reinstalled. Repeating the steps for adding a node to the cluster described earlier in this document will restore the node to an operational state. The cluster management user interface can be used to fail back the running service back to the original cluster node, if desired.

Cluster Data Configuration

If data corruption occurs on the shared storage device in a way that the shared storage device can't recover from, this would result in the corruption affecting the entire cluster in a way that cannot be automatically recovered from using the highly available failover cluster described in this document. The section Backing Up and Restoring Data in the NetIQ Sentinel Administration Guide describes how to use Sentinel's built-in tools for restoring from a backup. These tools should be used on the active node in the cluster because the passive node in the cluster will not have the required access to the shared storage device. Other commercially available backup and restore tools could be used instead and may have different requirements on which node they can be used.