9.6. Recovery Procedures

The following sections describe various recovery procedures you can perform when different failsafe components fail. Procedures for the following situations are provided:

9.6.1. Cluster Error Recovery

Follow this procedure if status of the cluster is UNKNOWN in all nodes in the cluster.

  1. Check to see if there are control networks that have failed (see Section 9.6.5).

  2. At least 50% of the nodes in the cluster must be able to communicate with each other to have an active cluster (Quorum requirement). If there are not sufficient nodes in the cluster that can communicate with each other using control networks, stop HA services on some of the nodes so that the quorum requirement is satisfied.

  3. If there are no hardware configuration problems, detach all resource groups that are online in the cluster (if any), stop HA services in the cluster, and restart HA services in the cluster.

The following cluster_mgr command detaches the resource group web-rg in cluster web-cluster:

cmgr> admin detach resource_group web-rg in cluster web-cluster

To stop HA services in the cluster web-cluster and ignore errors (force option), use the following cluster_mgr command:

cmgr> stop ha_services for cluster web-cluster force

To start HA services in the cluster web-cluster, use the following cluster_mgr command:

cmgr> start ha_services for cluster web-cluster

9.6.2. Node Error recovery

Follow this procedure if the status of a node is UNKNOWN in an active cluster:

  1. Check to see if the control networks in the node are working (see Section 9.6.5).

  2. Check to see if the serial reset cables to reset the node are working (see Section 9.6.6).

  3. If there are no hardware configuration problems, stop HA services in the node and restart HA services.

    To stop HA services in the node web-node3 in the cluster web-cluster, ignoring errors (force option), use the following cluster_mgr command

    cmgr> stop ha_services in node web-node3 for cluster web-cluster 
    force

    To start HA services in the node web-node3 in the cluster web-cluster, use the following cluster_mgr command:

    cmgr> start ha_services in node web-node3 for cluster web-cluster

9.6.3. Resource Group Maintenance and Error Recovery

To do simple maintenance on an application that is part of the resource group, use the following procedure. This procedure stops monitoring the resources in the resource group when maintenance mode is on. You need to turn maintenance mode off when application maintenance is done.

Caution

If there is node failure on the node where resource group maintenance is being performed, the resource group is moved to another node in the failover policy domain.

  1. To put a resource group web-rg in maintenance mode, use the following cluster_mgr command:

    cmgr> admin maintenance_on resource_group web-rg in cluster web-cluster
  2. The resource group state changes to ONLINE_MAINTENANCE. Do whatever application maintenance is required. (Rotating application logs is an example of simple application maintenance).

  3. To remove a resource group web-rg from maintenance mode, use the following cluster_mgr command:

    cmgr> admin maintenance_off resource_group web-rg in cluster 
    web-cluster

    The resource group state changes back to ONLINE.

You perform the following procedure when a resource group is in an ONLINE state and has an SRMD EXECUTABLE ERROR.

  1. Look at the SRM logs (default location: /var/log/failsafe/srmd_node name) to determine the cause of failure and the resource that has failed.

  2. Fix the cause of failure. This might require changes to resource configuration or changes to resource type stop/start/failover action timeouts.

  3. After fixing the problem, move the resource group offline with the force option and then move the resource group online.

    The following cluster_mgr command moves the resource group web-rg in the cluster web-cluster offline and ignores any errors:

    cmgr> admin offline resource_group web-rg in cluster web-cluster 
    force

    The following cluster_mgr command moves the resource group web-rg in the cluster web-cluster online:

    cmgr> admin online resource_group web-rg in cluster web-cluster

    The resource group web-rg should be in an ONLINE state with no error.

You use the following procedure when a resource group is not online but is in an error state. Most of these errors occur as a result of the exclusivity process. This process, run when a resource group is brought online, determines if any resources are already allocated somewhere in the failure domain of a resource group. Note that exclusivity scripts return that a resource is allocated on a node if the script fails in any way. In other words, unless the script can determine that a resource is not present, it returns a value indicating that the resource is allocated.

Some possible error states include: SPLIT RESOURCE GROUP (EXCLUSIVITY), NODE NOT AVAILABLE (EXCLUSIVITY), NO AVAILABLE NODES in failure domain. See Section 7.4.3, for explanations of resource group error codes.

  1. Look at the failsafe and SRM logs (default directory: /var/log/failsafe, files: failsafe_nodename, srmd_nodename) to determine the cause of the failure and the resource that failed.

    For example, say the task of moving a resource group online results in a resource group with error state SPLIT RESOURCE GROUP (EXCLUSIVITY). This means that parts of a resource group are allocated on at least two different nodes. One of the failsafe logs will have the description of which nodes are believed to have the resource group partially allocated.

    At this point, look at the srmd logs on each of these machines to see what resources are believed to be allocated. In some cases, a misconfigured resource will show up as a resource which is allocated. This is especially true for Netscape_web resources.

  2. Fix the cause of the failure. This might require changes to resource configuration or changes to resource type start/stop/exclusivity timeouts.

  3. After fixing the problem, move the resource group offline with the force option and then move the resource group online.

There are a few double failures that can occur in the cluster which will cause resource groups to remain in a non-highly-available state. At times a resource group might get stuck in an offline state. A resource group might also stay in an error state on a node even when a new node joins the cluster and the resource group can migrate to that node to clear the error.

When these circumstances arise, the correct action should be as follows:

  1. Try to move the resource group online if it is offline.

  2. If the resource group is stuck on a node, detach the resource group, then bring it online again. This should clear many errors.

  3. If detaching the resource group does not work, force the resource group offline, then bring it back online.

  4. If commands appear to be hanging or not working properly, detach all resource groups, then shut down the cluster and bring all resource groups back online.

See Section 7.5.2, for information on detaching resource groups and forcing resource groups offline.

9.6.4. Resource Error Recovery

You use this procedure when a resource that is not part of a resource group is in an ONLINE state with error. This can happen when the addition or removal of resources from a resource group fails.

  1. Look at the SRM logs (default location: /var/log/failsafe/srmd_nodename) to determine the cause of failure and the resource that has failed.

  2. Fix the cause of failure. This might require changes to resource configuration or changes to resource type stop/start/failover action timeouts.

  3. After fixing the problem, move the resource offline with the force option of the Cluster Manager CLI admin offline command:

    cmgr> admin offline_force resource web-srvr of resource_type 
    Netscape_Web in cluster web-cluster

    Executing this command removes the error state of resource web-srvr of type Netscape_Web, making it available to be added to a resource group.

    You can also use the Cluster Manager GUI to clear the error state for the resource. To do this, you select the “Recover a Resource” task from the “Resources and Resource Types” category of the FailSafe Manager.

9.6.5. Control Network Failure Recovery

Control network failures are reported in cmsd logs. The default location of cmsd log is /var/log/failsafe/cmsd_nodename. Follow this procedure when the control network fails:

  1. Use the ping command to check whether the control network IP address is configured in the node.

  2. Check node configuration to see whether the control network IP addresses are correctly specified.

    The following cluster_mgr command displays node configuration for web-node3:

    cmgr> show node web-node3
  3. If IP names are specified for control networks instead of IP addresses in XX.XX.XX.XX notation, check to see whether IP names can be resolved using DNS. It is recommended that IP addresses are used instead of IP names.

  4. Check whether the heartbeat interval and node timeouts are correctly set for the cluster. These HA parameters can seen using cluster_mgr show ha_parameters command.

9.6.6. Serial Cable Failure Recovery

Serial cables are used for resetting a node when there is a node failure. Serial cable failures are reported in crsd logs. The default location for the crsd log is /var/log/failsafe/crsd_nodename.

  1. Check the node configuration to see whether serial cable connection is correctly configured.

    The following cluster_mgr command displays node configuration for web-node3

    cmgr> show node web-node3

    Use the cluster_mgr admin ping command to verify the serial cables.

    cmgr> admin ping node web-node3

The above command reports serial cables problems in node web-node3.

9.6.7. CDB Maintenance and Recovery

When the entire configuration database (CDB) must be reinitialized, execute the following command:

# /usr/cluster/bin/cdbreinit /var/cluster/cdb/cdb.db

This command will restart all cluster processes. The contents of the configuration database will be automatically synchronized with other nodes if other nodes in the pool are available.

Otherwise, the CDB will need to be restored from backup at this point. For instructions on backing up and restoring the CDB, see Section 7.8.

9.6.8. FailSafe Cluster Manager GUI and CLI Inconsistencies

If the FailSafe Cluster Manager GUI is displaying information that is inconsistent with the FailSafe cluster_mgr command, restart cad process on the node to which Cluster Manager GUI is connected to by executing the following command:

# killall cad

The cluster administration daemon is restarted automatically by the cmond process.