When you configure the components of a Linux FailSafe system, you configure various timeout values and monitoring intervals that determine the application downtown of a highly-available system when there is a failure. To determine reasonable values to set for your system, consider the following equation:
application downtime = failure detection + time to handle failure + failure recovery
Failure detection depends on the type of failure that is detected:
When a node goes down, there will be a node failure detection after the node timeout; this is an HA parameter that you can modify. All failures that translate into a node failure (such as heartbeat failure and OS failure) fall into this failure category. Node timeout has a default value of 15 seconds. For information on modifying the node timeout value, see Section 5.4.4.
When there is a resource failure, there is a monitor failure of a resource. The amount of time this will take is determined by the following:
The monitoring interval for the resource type
The monitor timeout for the resource type
The number of restarts defined for the resource type, if the restart mode is configured on
For information on setting values for a resource type, see Section 5.5.6.
Reducing these values will result in a shorter failover time, but reducing these values could lead to significant increase in the Linux FailSafe overhead on the system performance and could also lead to false failovers.
The time to handle a failure is something that the user cannot control. In general, this should take a few seconds.
The failure recovery time is determined by the total time it takes for Linux FailSafe to perform the following:
Execute the failover policy script (approximately five seconds).
Run the stop action script for all resources in the resource group. This is not required for node failure; the failing node will be reset.
Run the start action script for all resources in the resource group