In order to use Linux FailSafe, you must understand the concepts in this section.
A cluster node is a single Linux execution environment. In other words, a single physical or virtual machine. In current Linux environments this will always be an individual computer. The term node is used to indicate this meaning in this guide for brevity, as opposed to any meaning such as a network node.
A pool is the entire set of nodes having membership in a group of clusters. The clusters are usually close together and should always serve a common purpose. A replicated cluster configuration database is stored on each node in the pool.
A cluster is a collection of one or more nodes coupled to each other by networks or other similar interconnections. A cluster belongs to one pool and only one pool. A cluster is identified by a simple name; this name must be unique within the pool. A particular node may be a member of only one cluster. All nodes in a cluster are also in the pool; however, all nodes in the pool are not necessarily in the cluster.
A node membership is the list of nodes in a cluster on which Linux FailSafe can allocate resource groups.
A process membership is the list of process instances in a cluster that form a process group. There can be multiple process groups per node.
A resource is a single physical or logical entity that provides a service to clients or other resources. For example, a resource can be a single disk volume, a particular network address, or an application such as a web server. A resource is generally available for use over time on two or more nodes in a cluster, although it can only be allocated to one node at any given time.
Resources are identified by a resource name and a resource type. One resource can be dependent on one or more other resources; if so, it will not be able to start (that is, be made available for use) unless the dependent resources are also started. Dependent resources must be part of the same resource group and are identified in a resource dependency list.
A resource type is a particular class of resource. All of the resources in a particular resource type can be handled in the same way for the purposes of failover. Every resource is an instance of exactly one resource type.
A resource type is identified by a simple name; this name should be unique within the cluster. A resource type can be defined for a specific node, or it can be defined for an entire cluster. A resource type definition for a specific node overrides a clusterwide resource type definition with the same name; this allows an individual node to override global settings from a clusterwide resource type definition.
Like resources, a resource type can be dependent on one or more other resource types. If such a dependency exists, at least one instance of each of the dependent resource types must be defined. For example, a resource type named Netscape_web might have resource type dependencies on resource types named IP_address and volume. If a resource named web1 is defined with the Netscape_web resource type, then the resource group containing web1 must also contain at least one resource of the type IP_address and one resource of the type volume.
The Linux FailSafe software includes some predefined resource types. If these types fit the application you want to make highly available, you can reuse them. If none fit, you can create additional resource types by using the instructions in the Linux FailSafe Programmer's Guide.
A resource name identifies a specific instance of a resource type. A resource name must be unique for a given resource type.
A resource group is a collection of interdependent resources. A resource group is identified by a simple name; this name must be unique within a cluster. Table 1-1 shows an example of the resources and their corresponding resource types for a resource group named WebGroup.
Table 1-1. Example Resource Group
Resource | Resource Type |
|---|---|
10.10.48.22 | IP_address |
/fs1 | filesystem |
vol1 | volume |
web1 | Netscape_web |
If any individual resource in a resource group becomes unavailable for its intended use, then the entire resource group is considered unavailable. Therefore, a resource group is the unit of failover.
Resource groups cannot overlap; that is, two resource groups cannot contain the same resource.
A resource dependency list is a list of resources upon which a resource depends. Each resource instance must have resource dependencies that satisfy its resource type dependencies before it can be added to a resource group.
A resource type dependency list is a list of resource types upon which a resource type depends. For example, the filesystem resource type depends upon the volume resource type, and the Netscape_web resource type depends upon the filesystem and IP_address resource types.
For example, suppose a file system instance fs1 is mounted on volume vol1. Before fs1 can be added to a resource group, fs1 must be defined to depend on vol1. Linux FailSafe only knows that a file system instance must have one volume instance in its dependency list. This requirement is inferred from the resource type dependency list.
A failover is the process of allocating a resource group (or application) to another node, according to a failover policy. A failover may be triggered by the failure of a resource, a change in the node membership (such as when a node fails or starts), or a manual request by the administrator.
A failover policy is the method used by Linux FailSafe to determine the destination node of a failover. A failover policy consists of the following:
Failover domain
Failover attributes
Failover script
Linux FailSafe uses the failover domain output from a failover script along with failover attributes to determine on which node a resource group should reside.
The administrator must configure a failover policy for each resource group. A failover policy name must be unique within the pool. Linux FailSafe includes predefined failover policies, but you can define your own failover algorithms as well.
A failover domain is the ordered list of nodes on which a given resource group can be allocated. The nodes listed in the failover domain must be within the same cluster; however, the failover domain does not have to include every node in the cluster.
The administrator defines the initial failover domain when creating a failover policy. This list is transformed into a run-time failover domain by the failover script; Linux FailSafe uses the run-time failover domain along with failover attributes and the node membership to determine the node on which a resource group should reside. Linux FailSafe stores the run-time failover domain and uses it as input to the next failover script invocation. Depending on the run-time conditions and contents of the failover script, the initial and run-time failover domains may be identical.
In general, Linux FailSafe allocates a given resource group to the first node listed in the run-time failover domain that is also in the node membership; the point at which this allocation takes place is affected by the failover attributes.
A failover attribute is a string that affects the allocation of a resource group in a cluster. The administrator must specify system attributes (such as Auto_Failback or Controlled_Failback), and can optionally supply site-specific attributes.
A failover script is a shell script that generates a run-time failover domain and returns it to the Linux FailSafe process. The Linux FailSafe process ha_fsd applies the failover attributes and then selects the first node in the returned failover domain that is also in the current node membership.
The following failover scripts are provided with the Linux FailSafe release:
ordered, which never changes the initial failover domain. When using this script, the initial and run-time failover domains are equivalent.
round-robin, which selects the resource group owner in a round-robin (circular) fashion. This policy can be used for resource groups that can be run in any node in the cluster.
If these scripts do not meet your needs, you can create a new failover script using the information in this guide.
The action scripts are the set of scripts that determine how a resource is started, monitored, and stopped. There must be a set of action scripts specified for each resource type.
The following is the complete set of action scripts that can be specified for each resource type:
exclusive, which verifies that a resource is not already running
start, which starts a resource
stop, which stops a resource
monitor, which monitors a resource
restart, which restarts a resource on the same server after a monitoring failure occurs
The release includes action scripts for predefined resource types. If these scripts fit the resource type that you want to make highly available, you can reuse them by copying them and modifying them as needed. If none fits, you can create additional action scripts by using the instructions in the Linux FailSafe Programmer's Guide.