When a failure is detected on one node (the node has crashed, hung, or been shut down, or a highly available service is no longer operating), a different node performs a failover of the highly available services that are being provided on the node with the failure (called the failed node). Failover allows all of the highly available services, including those provided by the failed node, to remain available within the cluster.
A failure in a highly available service can be detected by Linux FailSafe processes running on another node. Depending on which node detects the failure, the sequence of actions following the failure is different.
If the failure is detected by the Linux FailSafe software running on the same node, the failed node performs these operations:
Stops the highly available resource group running on the node
Moves the highly available resource group to a different node, according to the defined failover policy for the resource group
Sends a message to the node that will take over the services to start providing all resource group services previously provided by the failed node
When it receives the message, the node that is taking over the resource group performs these operations:
Transfers ownership of the resource group from the failed node to itself
Starts offering the resource group services that were running on the failed node
If the failure is detected by Linux FailSafe software running on a different node, the node detecting the failure performs these operations:
Using the serial connection between the nodes, reboots the failed node to prevent corruption of data
Transfers ownership of the resource group from the failed node to the other nodes in the cluster, based on the resource group failover policy.
Starts offering the resource group services that were running on the failed node
When a failed node comes back up, whether the node automatically starts to provide highly available services again depends on the failover policy you define. For information on defining failover policies, see Section 5.5.12.
Normally, a node that experiences a failure automatically reboots and resumes providing highly available services. This scenario works well for transient errors (as well as for planned outages for equipment and software upgrades). However, if there are persistent errors, automatic reboot can cause recovery and an immediate failover again. To prevent this, the Linux FailSafe software checks how long the rebooted node has been up since the last time it was started. If the interval is less than five minutes (by default), the Linux FailSafe software automatically disables Linux FailSafe from booting on the failed node and does not start up the Linux FailSafe software on this node. It also writes error messages to /var/log/failsafe and to the appropriate log file.