When a FailSafe system experiences problems, you can use some of the FailSafe features and commands to determine where the problem is.
FailSafe provides the following tools to evaluate and recover from system failure:
Log files
Commands to monitor status of system components
Commands to start, stop, and fail over highly available services
Keep in mind that the FailSafe logs may not detect system problems that do not translate into FailSafe problems. For example, if a CPU goes bad, or hardware maintenance is required, FailSafe may not be able to detect and log these failures.
In general, when evaluating system problems of any nature on a FailSafe configuration, you should determine whether you need to shut down a node to address those problems. When you shut down a node, perform the following steps:
Stop FailSafe services on that node
Shut down the node to perform needed maintenance and repair
Start up the node
Start FailSafe services on that node
It is important that you explicitly stop FailSafe services before shutting down a node, where possible, so that FailSafe does not interpret the node shutdown as node failure. If FailSafe interprets the service interruption as node failure, there could be unexpected ramifications, depending on how you have configured your resource groups and your application failover domain.
When you shut down a node to perform maintenance, you may need to change your FailSafe configuration to keep your system running.