1.10. Failover and Recovery Processes

When a failure is detected on one node (the node has crashed, hung, or been shut down, or a highly available service is no longer operating), a different node performs a failover of the highly available services that are being provided on the node with the failure (called the failed node). Failover allows all of the highly available services, including those provided by the failed node, to remain available within the cluster.

A failure in a highly available service can be detected by Linux FailSafe processes running on another node. Depending on which node detects the failure, the sequence of actions following the failure is different.

If the failure is detected by the Linux FailSafe software running on the same node, the failed node performs these operations:

When it receives the message, the node that is taking over the resource group performs these operations:

If the failure is detected by Linux FailSafe software running on a different node, the node detecting the failure performs these operations:

When a failed node comes back up, whether the node automatically starts to provide highly available services again depends on the failover policy you define. For information on defining failover policies, see Section 5.5.12.

Normally, a node that experiences a failure automatically reboots and resumes providing highly available services. This scenario works well for transient errors (as well as for planned outages for equipment and software upgrades). However, if there are persistent errors, automatic reboot can cause recovery and an immediate failover again. To prevent this, the Linux FailSafe software checks how long the rebooted node has been up since the last time it was started. If the interval is less than five minutes (by default), the Linux FailSafe software automatically disables Linux FailSafe from booting on the failed node and does not start up the Linux FailSafe software on this node. It also writes error messages to /var/log/failsafe and to the appropriate log file.