6.2 Architecting a High Availability Solution

Implementing a stand alone system with some fault tolerance - such as a single Operations Center Server in an environment on a RAID volume with a UPS - is never a configuration that can achieve 99.999% availability. Any single failure (such as software, motherboard, network card, and so on) causes the service to be unavailable to end users until the problem is resolved. And, in the case of a Operations Center integration, the management systems that Operations Center integrates with, are also not likely high availability.

Systems that require five nines with little to no maintenance windows are a challenge to achieve while still being able to upgrade systems or deploy new features, and so on, on a single system. Therefore, HA could require one or more of the following scenarios:

  • Dual (or more) servers for Operations Center (physical or clustered)

  • Dual (or more) management systems (OV, Netcool, etc… physical or clustered)

  • Dual (or more) backend applications (help desk, knowledge base, etc)

  • Potentially dual networking components

Gathering specific requirements and looking at budget constraints are all of the steps required to be successful. If the system can be down for 20 minutes in an unscheduled outage manner during the course of a week versus only being able to be down for 20 minutes over an entire month, the budgetary requirements are vastly different.

6.2.1 Hardware Versus Software Solutions

High Availability solutions can be implemented using hardware and/or software. Some solutions are provided directly from the manufacturers of the hardware while other technologies are add‑on from third parties (hardware and/or software).

  • Clusters: Multiple servers are configured identically and provide the same type of service, but with a front door component that makes them appear as a single server. In this configuration, only one server provides the service with the secondary server taking over if it fails. To the outside, the end user does not need to reconfigure or reconnect to the server or service, and it often appears as if the service never failed. While cluster solutions can be implemented with hardware or software, the software-based solution typically relies on physical servers being configured, while the hardware-based solution relies on an individual server to appear as multiple servers.

  • Load Balancers: Using load balancers is a common practice used to distribute end user loads evenly across multiple servers. Smart technologies monitor and understand the amount of resources being utilized on a server at any given time, and automatically direct users to the server with the lowest utilization rate. While not necessarily a requirement for HA or FT, it can be a nice addition to provide the best end user experience possible.

  • Redirectors: Sometimes built into Load Balancers, redirector technology is used to implement software-based clustering (or could be a physical box). Redirectors send existing or new users to other available servers when a service is no longer available on a particular server.

6.2.2 Availability Levels on Servers

When designing your HA environment, these service descriptions are important when determining the availability:

  • Hot: The failover server is up and running with all data updated and ready to be used.

  • Warm: The failover server is up and running, but data is not current and an update is needed to synchronize the data or information.

  • Cold: The failover server is off (not running) and does not have any of the data.

Many clustering systems are Hot/Cold: when the production server process (hot) stops working, the cluster automatically activates the secondary/backup server process (cold) to take over. The drawback to a Hot/Cold solution is that the “Cold” process is started from the ground up as if the computer were just turned on. If an application server typically takes 20 minutes for start up, then the HA environment must allow for a 20 minute outage (minimum) for any type of failure.

While Hot/Warm configurations are faster, they are still not transparent to end users because of the synchronization or update necessary to bring data current. Hot/Hot always remains the best option available, but while it does not always provide a seamless failover, it can be close to meeting the goal of no downtime.

6.2.3 Example of an Implementation with Multiple Levels of Failover

Figure 6-1 is an example of a multisite, multi-integration implementation that has multiple levels of failover in order to provide HA and FT:

Figure 6-1 Multiple Levels of Failover

In this example, users access a operationscenter.myCompany.com URL. A redirector (not shown in Figure 6-1) sends the user to Operations Center Server 1 (left side). In the event of a failure, users are directed to Failover Operations Center Server 1. This achieves the first level of HA.

Both Operations Center servers are configured and running at the same time (Hot/Hot) with the same data because of the dual connections to the underlying management systems. An assumption is made that the underlying management systems are configured in the same manner: one Operations Center server can be configured with dual adapters into the primary and backup of each management system.

6.2.4 Integrating Processes

Architecting a solution is only one of the steps in achieving a highly available implementation. Processes must be implemented around changes to the environment to ensure that changes are propagated, as well as the guidelines where changes occur in general.

When implementing a Hot/Cold solution, the process to keep configuration files, security, scripts, etc up to date is relatively straight forward. Regular backups or copies from the primary (hot) server to the backup (cold) server can keep these files up to date.