6.1 Introduction to Availability, Fault Tolerance, and Redundancy

High Availability (HA) and Fault Tolerance (FT) are terms used to convey the importance to the total amount of time that an application is up and running. Availability requirements might include specific days and times; or can be 24 hours a day, 7 days a week. In addition, there is a percentage of uptime that must be met, such as “five nines” equating to 99.999% or “three nines” equating to 99.9%.

While costs of achieving the goal are directly correlated to the number of “nines” required, specific calculations aren’t available as HA and FT solutions are different from application to application. However, because implementations are the same in concept, the following points should be taken into account when planning your solution:

  • Fault Tolerance: This design allows the system to continue operating, possibly at reduced levels, when specific technologies fail.

    For example, a single disk fails in a RAID 5 volume set. The RAID 5 guidelines allow for a single disk to fail while the entire volume continues to operate without data loss. If for some reason a second disk in the RAID volume fails, the disk volume is no longer available and the applications fail. At this point, the recovery time is the time required to replace two drives, restore the file system, and bring the server back online.

    Requirements dictate the levels of redundancies necessary to achieve high levels of Availability. Configuring mirrored drives is good, configuring mirrored RAID drives is better. Configuring dual servers with mirrored RAID drives could even be better, but at double the cost.

    Redundancy is not just servers and hardware; it also touches the networking components, other applications in the environment, and possibly even other companies that are providing a service to you. The Availability requirements might dictate that all of it be protected, or it might put “fences” around specific technologies that must be protected.

  • High Availability: Is typically used around the running state of an application and generally deals with the whole system as an entity being available.

    Dual Web servers providing the same Web based applications make an application more available in the event something on one of the servers (or the server itself) stops working in such a manner that the application no longer works. Users then utilize the secondary server while the primary is being repaired. Providing the dual Web servers is the initial step to High Availability. Providing the Web service to the users in a manner that users do not know that a server failed and that they are now running on the backup is the next step.

  • Downtime: Refers to periods of time when the system is not available.

    Planned downtime is the result of a logical, management-initiated event and usually is maintenance that cannot be avoided. Unplanned downtime events typically arise from some physical event, such as a hardware or software failure. Planned downtime is often excluded from availability calculations because there is little or no impact on users because of scheduling. However, if a backup server can be used during these downtime periods, it is possible that users won’t even be aware that the server was down.

The following technologies and/or Operations Center’s products commonly have HA/FT requirements:

  • Operations Center Server

  • Management Systems integrated with Operations Center

  • Databases integrated with Operations Center

  • Secondary applications and/or tools launched or integrated with Operations Center, such as Help Desk software, Knowledgebase systems, Reporting software, and so on

  • Networking components (LAN and/or WAN)