35.3 Service Monitoring

A key component of any highly available environment is a reliable, consistent way to monitor the resource(s) that should be highly available, along with any resource(s) that they depend on. The SLE HAE uses a component called a Resource Agent to perform this monitoring - the Resource Agent's job is to provide the status for each resource, plus (when asked) to start or stop that resource.

Resource Agents must provide a reliable status for monitored resources in order to prevent unnecessary downtime. False positives (when a resource is deemed to have failed, but would in fact recover on its own) can cause service migration (and related downtime) when it is not actually necessary, and false negatives (when the Resource Agent reports that a resource is functioning when in fact it is not operating properly) can prevent proper use of the service. On the other hand, external monitoring of a service can be quite difficult - a web service port might respond to a simple ping, for example, but may not provide correct data when a real query is issued. In many cases, self-test functionality must be built into the service itself to provide a truly accurate measurement.

This solution provides a basic OCF Resource Agent for Sentinel that can monitor for major hardware, operating system, or Sentinel system failure. At this time the external monitoring capabilities for Sentinel are based on IP port probes, and there is some potential for false positive and false negative readings. We plan to improve both Sentinel and the Resource Agent over time to improve the accuracy of this component.