A.1 Concepts

High availability refers to a design methodology that is intended to keep a system available for use as much as is practicable. The intent is to minimize the causes of downtime such as system failures and maintenance, and to minimize the time it will take to detect and recover from downtime events that do occur. In practice, automated means of detecting and recovering from downtime events quickly become necessary as higher levels of availability must be achieved.

A.1.1 External Systems

Sentinel is a complex multi-tier application that depends upon and provides a wide variety of services. Additionally, it integrates with multiple external third-party systems for data collection, data sharing, and incident remediation. Most HA solutions allow implementors to declare dependencies between the services that should be highly available, but this only applies to services running on the cluster itself. Systems external to Sentinel such as event sources must be configured separately to be as available as required by the organization, and must also be configured to properly handle situations where Sentinel is unavailable for some period of time such as a failover event. If access rights are tightly restricted, for example if authenticated sessions are used to send and/or receive data between the third-party system and Sentinel, the third-party system must be configured to accept sessions from or initiate sessions to any cluster node (Sentinel should be configured with a virtual IP for this purpose).

A.1.2 Shared Storage

All HA clusters require some form of shared storage so that application data can be quickly moved from one cluster node to another, in case of a failure of the originating node. The storage itself should be highly available; this is usually achieved by using Storage Area Network (SAN) technology connected to the cluster nodes using a Fibre Channel network. Other systems use Network Attached Storage (NAS), iSCSI, or other technologies that allow for remote mounting of shared storage. The fundamental requirement of the shared storage is that the cluster can cleanly move the storage from a failed cluster node to a new cluster node.

NOTE:For iSCSI, you should use the largest Message Transfer Unit (MTU) supported by your hardware. Larger MTUs benefits the storage performance. Sentinel might experience issues if latency and bandwidth to storage is slower than recommended.

There are two basic approaches that Sentinel can use for the shared storage. The first locates all components (application binaries, configuration, and event data) on the shared storage. On failover, the storage is unmounted from the primary node and moved the backup node; which loads the entire application and configuration from the shared storage. The second approach stores the event data on shared storage, but the application binaries and configuration reside on each cluster node. On failover, only the event data is moved to the backup node.

Each approach has benefits and disadvantages, but the second approach allows the Sentinel installation to use standard FHS-compliant install paths, allows for verification of the RPM packaging, and also allows for warm patching and reconfiguration to minimize downtime.

This solution will guide you through an example of the process of installing to a cluster that uses iSCSI shared storage and locates the application binaries/configuration on each cluster node.

A.1.3 Service Monitoring

A key component of any highly available environment is a reliable, consistent way to monitor the resource(s) that should be highly available, along with any resource(s) that they depend on. The SLE HAE uses a component called a Resource Agent to perform this monitoring - the Resource Agent's job is to provide the status for each resource, plus (when asked) to start or stop that resource.

Resource Agents must provide a reliable status for monitored resources in order to prevent unnecessary downtime. False positives (when a resource is deemed to have failed, but would in fact recover on its own) can cause service migration (and related downtime) when it is not actually necessary, and false negatives (when the Resource Agent reports that a resource is functioning when in fact it is not operating properly) can prevent proper use of the service. On the other hand, external monitoring of a service can be quite difficult - a web service port might respond to a simple ping, for example, but may not provide correct data when a real query is issued. In many cases, self-test functionality must be built into the service itself to provide a truly accurate measurement.

This solution provides a basic OCF Resource Agent for Sentinel that can monitor for major hardware, operating system, or Sentinel system failure. At this time the external monitoring capabilities for Sentinel are based on IP port probes, and there is some potential for false positive and false negative readings. We plan to improve both Sentinel and the Resource Agent over time to improve the accuracy of this component.

A.1.4 Fencing

Within an HA cluster, critical services are constantly monitored and restarted automatically on other nodes in the case of failure. This automation can introduce problems, however, if some communications problem occurs with the primary node; although the service running on that node appears to be down, it in fact continues to run and write data to the shared storage. In this case, starting a new set of services on a backup node could easily cause data corruption.

Clusters use a variety of techniques collectively called fencing to prevent this from happening, including Split Brain Detection (SBD) and Shoot The Other Node In The Head (STONITH). The primary goal is to prevent data corruption on the shared storage.