This document describes the architecture and functional specification of Linux FailSafe. We provide a brief overview of the layered high availability (HA) infrastructure architecture and specify the functionality of the Linux FailSafe product that is layered over this architecture.
Cluster Manager GUI & CLI
Cluster Infrastructure (CHAOS)
Cluster Administration Services (CAS)
Linux FailSafe and related High Availability products have a common infrastructure layer called the Clustered High Availability Operating Services; (CHAOS), which provides the necessary support to build high-availability solutions. All CHAOS layers reside in user space.
The cluster infrastructure layer is comprised of the Cluster Membership Services (CMS), Group Communication Services (GCS), the System Resource Manager (SRM) and the Cluster Reset Services (CRS).
In addition, Linux FailSafe has an administration layer - Cluster Administration Services (CAS) - made up of the Cluster Administration Daemon (CAD), the Cluster Configuration Database Daemon and the Cluster Configuration Database (CDB).
Cluster Membership Services (CMS) is the lowest level of the CHAOS infrastructure, providing cluster node membership to upper layers and offering its clients the up/down status of nodes within a cluster. CMS contains a membership server and a library that defines the interface to the clients. There is a membership server per node in the cluster. These servers monitor each other by exchanging heartbeat messages and are responsible for providing the clients with the information about the status of nodes. They are also responsible for stability of membership views and the agreement of membership history. The membership service also has an I/O clean property - nodes that are not members will not be allowed to access the shared resources in the cluster.
Group communication services (GCS)is a distributed service layered on top of Cluster Membership service (CMS). GCS provides an abstraction of process groups - collection of processes distributed across a cluster, cooperating to provide a distributed application service. GCS presents clients with:
Distributed applications use GCS for notification of the normal termination and abnormal failure of individual application instances running on a cluster node. While distributed applications must still undertake the task of instance recovery and reintegration, GCS relieves the application from the task of monitoring instances and liveliness in a distributed environment.
System Resource Manager (SRM) manages resources across the cluster. It uses Group Communication Services for communication and to provide exclusive access to the resources. SRM contains detailed information pertinent to managing system resources such as disks, volumes, file systems, network adapters, IP addresses and other user defined resources. The SRM module is responsible for starting, stopping and monitoring HA resources in the clusters. It also attempts to restart resources locally in a node before attempting to move the resource/resource groups to another node in the clusters.
SRM also provides the user with an interface to add more resource types to the cluster.
Linux FailSafe is a policy module. It is the policy module that instructs the System Resource Manager to allocate and release resources in a cluster. It makes allocation decisions on the resource groups in the event of failures or changes in the cluster configuration.
Linux FailSafe allows users to provide failover policy modules for resource groups.
The Cluster Administration Services supports the Graphical User Interface (GUI) and the Command Line Interface (CLI) allowing users to administer, configure and monitor CHAOS and other applications controlled by FailSafe. All configuration information is stored in a configuration database (CDB).
The Cluster Configuration Database (CDB) is made up of two parts:
The Cluster Configuration Database is replicated on a set of nodes and provides transaction semantics for read-only and update operations. If a node goes down or the configuration database on a node becomes obsolete, it gets a new copy of the database when it joins the database membership.
FailSafe's continuing goal is to provide highly available services in a client/server environment. In general, this means that the client side of a client/server application will have uninterrupted service while nodes in the server cluster have changing availability states. A node is a set of hardware which runs one copy of the operating system. In past architectures, a node would have been synonymous to some physical unit. This is no longer necessarily the case since now a physical unit may or may not be divided into multiple nodes.
A server is a collection of nodes which make up an array or cluster. Services are provided by the server component of an application. In the remainder of this document, we refer to these components simply as applications. An application is essentially a set of threads (or processes) executing multiple tasks to provide a single service.
FailSafe provides the ability to monitor an application. This means that the FailSafe product performs a periodic operation to monitor the health of a specified application. Monitoring can be enabled or disabled dynamically. The operation can be intrusive, such as a database query, or non-intrusive, such as checking the existence of a process (pid).
FailSafe also provides the ability to failover an application. In other words, if a FailSafe monitor of an application reports a failure, FailSafe will cleanup and restart the application as defined by the configuration parameters. This process is defined as a failover.
An application failover domain (AFD) is a set of nodes (possibly a subset of the N-node cluster) in which a given application can execute across failures. In the simplest case, an application starts executing on a single node A and can failover to another node B if node A fails. The application failover domain is defined to be node A and B even if another node C is present in the cluster. To failover, an application will usually require appropriate connectivity to storage, network and other devices across failures. Hence, many of the devices in a FailSafe environment will require shared, but not necessarily concurrent, access to each node within an AFD. These shared devices are called FailSafe devices or resources.
The CHAOS infrastructure will support up to 16 nodes in a cluster. A star configuration of a cluster is also possible. This means that a single node will be allowed to be backup for more than one node. Also, there could be M failure domains in an N-node configuration, where M can be from 1 to N.
Local failover of an application is defined as the act of failing over an application on the same node after detecting a failure through the local monitor check. Failover may require clean up of the application before restarting. This feature allows Linux FailSafe to support a single node configuration (single node FailSafe).
In a multi-node configuration, FailSafe can be configured to perform X local failovers prior to failing over an application to another node, where X (>= 0) is configurable on a per application basis. From this point forward in this specification, we assume that an application failure implies that local restart attempts have failed.
Rather than causing an entire node to failover when a failure is detected, Linux FailSafe allows a specific application to be failed over from Node 1 to Node 2, while other FailSafe applications continue to run on Node 1 (if possible). We define this process as fine grain failover.
For instance, take two FailSafe applications, A & B, running on Node 1. Application A fails and can not be restarted locally. FailSafe must now attempt to restart application A on another node. If application A has no resources in common with application B, FailSafe will restart application A on another node while application B continues to execute on Node 1. If application A and B have common resources, an administrator must decide whether to failover both applications or to leave application A not running. The decision is communicated to FailSafe through configuration information.
Fine grain failover can be used by system administrators to perform load balancing.
With Linux FailSafe, an administrator is allowed to specify a list of nodes which represent a failover ordering. The act of following this order is called ordered failover. Given an ordered list of nodes, FailSafe will make sure an HA application is executing on one of the nodes, using the first node in the list which can satisfy execution requirements. Each ordered list also represents an application failover domain. We refer to the first node in an ordered list as an application's primary node.
If multiple HA applications are primary to a node A and if one of node A's application local monitors fails, instead of failing over all FailSafe resources owned by node A and all the HA applications primary to note A, FailSafe allows a priority specification for these applications such that an application, which normally would move to another node due to a failure, might not be restarted on any nodes within its ordered list.
Given two HA applications, Hi and Low, where Hi has a higher priority than Low and both share a filesystem, a failure with Hi or any resource on which Hi depends causes Hi and Low to be moved to another node. Any FailSafe resource used by Hi or Low would need to change ownership. However, if application Low were to fail, FailSafe would not attempt to move either Hi or Low to another node because Hi has a higher priority and is still able to execute properly. This feature if called selective failover.
FailSafe also allows for a concept called cascading failures. For example, given an ordered list of nodes, A, B & C, for an HA application, FailSafe will first attempt to start the application on node A. If that fails, FailSafe will attempt to use node B. If that fails and node A is still not an alternative, FailSafe will attempt to start the application on node C. In other words, FailSafe will handle certain cases of multiple failures, assuming that enough connectivity is present to continue down an ordered list.
This feature allows FailSafe monitoring of any HA application to be turned on and off while FailSafe continues to run. In Linux FailSafe, when the local monitoring of an application reported a failure, that event is escalated depending on some configurable parameters given by the system administrator.
If an application's monitoring is turned off, failures caught by that monitoring will be filtered out. Other failures, like a lost heartbeat or failure in another application, would still be caught by FailSafe. Those failures could still cause application failover to occur.
The main purpose of this feature is to allow on-line upgrades of HA applications while FailSafe continues to operate. In essence, an application is allowed to leave a cluster dynamically to undergo an upgrade or reconfiguration.
FailSafe allows resources and resource groups to the added to be cluster when the cluster is providing other HA services. The resource configuration can be changed without impacting availability of other resources.
Given a cluster of N nodes, the FailSafe architecture handles FailSafe software upgrades one node at a time while the leaving the cluster mostly intact.
Before upgrading the FailSafe software, the administrator must migrate the HA applications to another node, say BAR, in the AFD, so the applications will continue to be monitored for availability. Once accomplished, FailSafe on node FOO can be shutdown, upgraded, and restarted. All previously owned HA applications can now be migrated back to FOO.
A web-based GUI tool for setup and configuration is supported in the FailSafe product. A non-GUI interface is also provided. The GUI is an extension of the non-GUI interface and as such, provides insight to the higher level problems that FailSafe administrators are trying to solve. For example, the FailSafe heartbeat timing parameters have a single timeout value.
The remote node reset mechanism on which cluster membership services (CMS) depend has been abstracted into another module. The Cluster Reset Services Daemon (CRSD) allows other pieces of cluster software to utilize the remote reset mechanism using a library interfaces. Separating out the remote reset functionality allows other mechanisms to enforce I/O clean property to be easily added.