1.1. High Availability and Linux FailSafe

In the world of mission critical computing, the availability of information and computing resources is extremely important. The availability of a system is affected by how long it is unavailable after a failure in any of its components. Different degrees of availability are provided by different types of systems:

The Silicon Graphics® Linux FailSafe product provides a general facility for providing highly available services. Linux FailSafe provides highly available services for a cluster that contains multiple nodes (N-node configuration). Using Linux FailSafe, you can configure a highly available system in any of the following topologies:

These configurations provide redundancy of processors and I/O controllers. Redundancy of storage can either be obtained through the use of multi-hosted RAID disk devices and mirrored disks, or with redundant disk systems which are kept in synchronization.

If one of the nodes in the cluster or one of the nodes' components fails, a different node in the cluster restarts the highly available services of the failed node. To clients, the services on the replacement node are indistinguishable from the original services before failure occurred. It appears as if the original node has crashed and rebooted quickly. The clients notice only a brief interruption in the highly available service.

In a Linux FailSafe highly available system, nodes can serve as backup for other nodes. Unlike the backup resources in a fault-tolerant system, which serve purely as redundant hardware for backup in case of failure, the resources of each node in a highly available system can be used during normal operation to run other applications that are not necessarily highly available services. All highly available services are owned and accessed by one node at a time.

Highly available services are monitored by the Linux FailSafe software. During normal operation, if a failure is detected on any of these components, a failover process is initiated. Using Linux FailSafe, you can define a failover policy to establish which node will take over the services under what conditions. This process consists of resetting the failed node (to ensure data consistency), doing any recovery required by the failed over services, and quickly restarting the services on the node that will take them over.

Linux FailSafe supports selective failover in which individual highly available applications can be failed over to a backup node independent of the other highly available applications on that node.

Linux FailSafe highly available services fall into two groups: highly available resources and highly available applications. Highly available resources include network interfaces, logical volumes, and filesystems such as ext2f or reiserfs that have been configured for Linux FailSafe. Silicon Graphics has also developed Linux FailSafe NFS. Highly available applications can include applications such as NFS, Apache, etc.

Linux FailSafe provides a framework for making additional applications into highly available services. If you want to add highly available applications on a Linux FailSafe cluster, you must write scripts to handle application monitoring functions. Information on developing these scripts is described in the Linux FailSafe Programmer's Guide. If you need assistance in this regard, contact SGI Global Services, which offers custom Linux FailSafe agent development and HA integration services.