Linux FailSafe System Operation This chapter describes administrative tasks you perform to operate and monitor a Linux FailSafe system. It describes how to perform tasks using the FailSafe Cluster Manager Graphical User Interface (GUI) and the FailSafe Cluster Manager Command Line Interface (CLI). The major sections in this chapter are as follows: Setting System Operation Defaults defaults system operation defaultsSeveral commands that you perform on a running system allow you the option of specifying a node or cluster. You can specify a node or a cluster to use as the default if you do not specify the node or cluster explicitly. Setting Default Cluster with Cluster Manager GUI The Cluster Manager GUI prompts you to enter the name of the default cluster when you have not specified one. Alternately, you can set the default cluster by clicking the “Select Cluster...” button at the bottom of the FailSafe Manager window. When using the Cluster Manager GUI, there is no need to set a default node. Setting Defaults with Cluster Manager CLI When you are using the Cluster Manager CLI, you can use the following commands to specify default values. Use either of the following commands to specify a default cluster: cmgr> set cluster A cmgr> set node A System Operation Considerations ctrl+c ramifications Once a Linux FailSafe command is started, it may partially complete even if you interrupt the command by typing Ctrl-c. If you halt the execution of a command this way, you may leave the cluster in an indeterminate state and you may need to use the various status commands to determine the actual state of the cluster and its components. Activating (Starting) Linux FailSafe starting Linux FailSafe activating Linux FailSafe After you have configured your Linux FailSafe system and run diagnostic tests on its components, you can activate the highly available services by starting Linux FailSafe. You can start Linux FailSafe on a systemwide basis, on all of the nodes in a cluster, or on a specified node only. When you start HA services on a subset of the nodes, you should make sure that resource groups are not running in other nodes in the cluster. For example, if a cluster contains nodes N1, N2, and N3 and HA services are started on nodes N1 and N2 but not on node N3, you should make sure that resource groups are not running on node N3. Linux FailSafe will not perform exclusivity checks on nodes where HA services are not started. When you start HA services, the following actions are performed: All nodes in the cluster in the CDB are enabled Linux FailSafe returns success to the user after modifying the CDB The local CMOND gets notification from cdbd The local CMOND starts all HA processes (CMSD, GCD, SRMD, FSD) and IFD. CMOND sets failsafe2 chkconfig flag to on. Activating Linux FailSafe with the Cluster Manager GUI To start Linux FailSafe services using the Cluster Manager GUI, perform the following steps: On the left side of the display, click on the “Nodes & Cluster” category. On the right side of the display click on the “Start FailSafe HA Services” task link to launch the task. Enter the selected inputs. Click on “OK” at the bottom of the screen to complete the task. Activating Linux FailSafe with the Cluster Manager CLI To activate Linux FailSafe in a cluster, use the following command: cmgr> start ha_services [ on node A] [for cluster B] System Status system status statussystem, overview While the Linux FailSafe system is running, you can monitor the status of the Linux FailSafe components to determine the state of the component. Linux FailSafe allows you to view the system status in the following ways: You can keep continuous watch on the state of a cluster using the FailSafe Cluster View of the Cluster Manager GUI. You can query the status of an individual resource group, node, or cluster using either the Cluster Manager GUI or the Cluster Manager CLI. You can use the haStatus script provided with the Cluster Manager CLI to see the status of all clusters, nodes, resources, and resource groups in the configuration. The following sections describe the procedures for performing each of these tasks. Monitoring System Status with the Cluster Manager GUI The easiest way to keep a continuous watch on the state of a cluster is to use the FailSafe Cluster View of the Cluster Manager GUI. In the FailSafe Cluster View window, problems system components are experiencing appear as blinking red icons. Components in transitional states also appear as blinking icons. If there is a problem in a resource group or node, the FailSafe Cluster View icon for the cluster turns red and blinks, as well as the resource group or node icon. The full color legend for component states in the FailSafe Cluster View is as follows: grey healthy but not online or active green healthy and active or online blinking green transitioning to green blinking red problems with component black and white outline resource type grey with yellow wrench maintenance mode, may or may not be currently monitored by Linux FailSafe If you minimize the FailSafe Cluster View window, the minimized-icon shows the current state of the cluster. When the cluster has Linux FailSafe HA services active and there is no error, the icon shows a green cluster. When the cluster goes into error state, the icon shows a red cluster. When the cluster has Linux FailSafe HA services inactive, the icon shows a grey cluster. Monitoring Resource and Reset Serial Line with the Cluster Manager CLI statusresource resource statusYou can use the CLI to query the status of a resource or to ping the system controller at a node, as described in the following subsections. Querying Resource Status with the Cluster Manager CLI resource group statusTo query a resource status, use the following CLI command: cmgr> show status of resource A of resource_type B [in clusterC] If you have specified a default cluster, you do not need to specify a cluster when you use this command and it will show the status of the indicated resource in the default cluster. Pinging a System Controller with the Cluster Manager CLI statussystem controller system controllerstatusTo perform a ping operation on a system controller by providing the device name, use the following CLI command: cmgr> admin ping dev_name Aof dev_type Bwith sysctrl_type C Resource Group Status statusresource groupTo query the status of a resource group, you provide the name of the resource group and the cluster which includes the resource group. Resource group status includes the following components: Resource group state Resource group error state Resource owner These components are described in the following subsections. If a node that contains a resource group online has a status of UNKNOWN, the status of the resource group will not be available or ONLINE-READY. Resource Group State resource group state state, resource groupA resource group state can be one of the following: ONLINE state ONLINE Linux FailSafe is running on the local nodes. The resource group is allocated on a node in the cluster and is being monitored by Linux FailSafe. It is fully allocated if there is no error; otherwise, some resources may not be allocated or some resources may be in error state. ONLINE-PENDING state ONLINE-PENDING Linux FailSafe is running on the local nodes and the resource group is in the process of being allocated. This is a transient state. OFFLINE state OFFLINE The resource group is not running or the resource group has been detached, regardless of whether Linux FailSafe is running. When Linux FailSafe starts up, it will not allocate this resource group. OFFLINE-PENDING stateOFFLINE-PENDING Linux FailSafe is running on the local nodes and the resource group is in the process of being released (becoming offline). This is a transient state. ONLINE-READY state ONLINE-READY Linux FailSafe is not running on the local node. When Linux FailSafe starts up, it will attempt to bring this resource group online. No Linux FailSafe process is running on the current node is this state is returned. ONLINE-MAINTENANCE stateONLINE-MAINTENANCE The resource group is allocated in a node in the cluster but it is not being monitored by Linux FailSafe. If a node failure occurs while a resource group in ONLINE-MAINTENANCE state resides on that node, the resource group will be moved to another node and monitoring will resume. An administrator may move a resource group to an ONLINE-MAINTENANCE state for upgrade or testing purposes, or if there is any reason that Linux FailSafe should not act on that resource for a period of time. INTERNAL ERROR state INTERNAL ERROR An internal Linux FailSafe error has occurred and Linux FailSafe does not know the state of the resource group. Error recovery is required. DISCOVERY state DISCOVERY (EXCLUSIVITY) The resource group is in the process of going online if Linux FailSafe can correctly determine whether any resource in the resource group is already allocated on all nodes in the resource group's application failure domain. This is a transient state. INITIALIZING state INITIALIZING Linux FailSafe on the local node has yet to get any information about this resource group. This is a transient state. Resource Group Error State When a resource group is ONLINE, its error status is continually being monitored. A resource group error status can be one of the following:error state resource grouperror state NO ERROR error state NO ERROR Resource group has no error. INTERNAL ERROR error stateINTERNAL ERROR - NOT RECOVERABLE Notify Silicon Graphics if this condition arises. NODE UNKNOWN error stateNODE UNKNOWN Node that had the resource group online is in unknown state. This occurs when the node is not part of the cluster. The last known state of the resource group is ONLINE, but the system cannot talk to the node. SRMD EXECUTABLE ERROR error stateSRMD EXECUTABLE ERROR The start or stop action has failed for a resource in the resource group. SPLIT RESOURCE GROUP error state NODE NOT AVAILABLE error state MONITOR ACTIVITY UNKNOWN error state NO AVAILABLE NODES error state SPLIT RESOURCE GROUP (EXCLUSIVITY) Linux FailSafe has determined that part of the resource group was running on at least two different nodes in the cluster. NODE NOT AVAILABLE (EXCLUSIVITY) Linux FailSafe has determined that one of the nodes in the resource group's application failure domain was not in the membership. Linux FailSafe cannot bring the resource group online until that node is removed from the application failure domain or HA services are started on that node. MONITOR ACTIVITY UNKNOWN In the process of turning maintenance mode on or off, an error occurred. Linux FailSafe can no longer determine if monitoring is enabled or disabled. Retry the operation. If the error continues, report the error to Silicon Graphics. NO AVAILABLE NODES A monitoring error has occurred on the last valid node in the cluster's membership. Resource Owner resource ownerThe resource owner is the logical node name of the node that currently owns the resource. Monitoring Resource Group Status with the Cluster Manager GUI You can use the FailSafe ClusterView to monitor the status of the resources in a Linux FailSafe configuration. You can launch the FailSafe Cluster View directly, or you can bring it up at any time by clicking on “FailSafe Cluster View” at the bottom of the “FailSafe Manager” display. From the View menu, select “Resources in Groups” to see the resources organized by the groups they belong to, or select “Groups owned by Nodes” to see where the online groups are running. This view lets you observe failovers as they occur. Querying Resource Group Status with the Cluster Manager CLI To query a resource group status, use the following CLI command: cmgr> show status of resource_group A [in clusterB] If you have specified a default cluster, you do not need to specify a cluster when you use this command and it will show the status of the indicated resource group in the default cluster. Node Status nodestatus status node nodestateTo query the status of a node, you provide the logical node name of the node. The node status can be one of the following: UP node state UP This node is part of cluster membership. DOWN node state DOWN This node is not part of cluster membership (no heartbeats) and this node has been reset. This is a transient state. UNKNOWN node state UNKNOWN This node is not part of cluster membership (no heartbeats) and this node has not been reset (reset attempt has failed). INACTIVE node state INACTIVE HA services have not been started on this node. When you start HA services, node states transition from INACTIVE to UP. It may happen that a node state may transition from INACTIVE to UNKNOWN to UP. Monitoring Cluster Status with the Cluster Manager GUI You can use the FailSafe Cluster View to monitor the status of the clusters in a Linux FailSafe configuration. You can launch the FailSafe Cluster View directly, or you can bring it up at any time by clicking on “FailSafe Cluster View” at the bottom of the “FailSafe Manager” display. From the View menu, select “Groups owned by Nodes” to monitor the health of the default cluster, its resource groups, and the group's resources. Querying Node Status with the Cluster Manager CLI To query node status, use the following CLI command: cmgr> show status of node A Pinging the System Controller with the Cluster Manager CLI When Linux FailSafe is running, you can determine whether the system controller on a node is responding with the following Cluster Manger CLI command: cmgr> admin ping node A This command uses the Linux FailSafe daemons to test whether the system controller is responding. You can verify reset connectivity on a node in a cluster even when the Linux FailSafe daemons are not running by using the standalone option of the admin ping command of the CLI: cmgr> admin ping standalone node A This command does not go through the Linux FailSafe daemons, but calls the ping command directly to test whether the system controller on the indicated node is responding. Cluster Status statuscluster cluster status ACTIVE cluster status INACTIVE cluster statusTo query the status of a cluster, you provide the name of the cluster. The cluster status can be one of the following: ACTIVE INACTIVE Querying Cluster Status with the Cluster Manager GUI You can use the Cluster View of the Cluster Manager GUI to monitor the status of the clusters in a Linux FailSafe system. Querying Cluster Status with the Cluster Manager CLI To query node and cluster status, use the following CLI command: cmgr> show status of cluster A Viewing System Status with the haStatus CLI Script haStatus scriptThe haStatus script provides status and configuration information about clusters, nodes, resources, and resource groups in the configuration. This script is installed in the /var/cluster/cmgr-scripts directory. You can modify this script to suit your needs. See the haStatus (1M) man page for further information about this script. The following examples show the output of the different options of the haStatus script. # haStatus -help Usage: haStatus [-a|-i] [-c clustername] where,  -a prints detailed cluster configuration information and cluster status.  -i prints detailed cluster configuration information only.  -c can be used to specify a cluster for which status is to be printed.  “clustername” is the name of the cluster for which status is to be printed. # haStatus Tue Nov 30 14:12:09 PST 1999 Cluster test-cluster:   Cluster state is ACTIVE. Node hans2:   State of machine is UP. Node hans1:   State of machine is UP. Resource_group nfs-group1:   State: Online   Error: No error   Owner: hans1   Failover Policy: fp_h1_h2_ord_auto_auto   Resources:   /hafs1 (type: NFS)   /hafs1/nfs/statmon (type: statd)   150.166.41.95 (type: IP_address)   /hafs1 (type: filesystem)   havol1 (type: volume) # haStatus -i Tue Nov 30 14:13:52 PST 1999 Cluster test-cluster: Node hans2:   Logical Machine Name: hans2   Hostname: hans2.engr.sgi.com   Is FailSafe: true   Is Cellular: false   Nodeid: 32418   Reset type: powerCycle   System Controller: msc   System Controller status: enabled   System Controller owner: hans1   System Controller owner device: /dev/ttyd2   System Controller owner type: tty   ControlNet Ipaddr: 192.26.50.15   ControlNet HB: true   ControlNet Control: true   ControlNet Priority: 1   ControlNet Ipaddr: 150.166.41.61   ControlNet HB: true   ControlNet Control: false   ControlNet Priority: 2 Node hans1:   Logical Machine Name: hans1   Hostname: hans1.engr.sgi.com   Is FailSafe: true   Is Cellular: false   Nodeid: 32645   Reset type: powerCycle   System Controller: msc   System Controller status: enabled   System Controller owner: hans2   System Controller owner device: /dev/ttyd2   System Controller owner type: tty   ControlNet Ipaddr: 192.26.50.14   ControlNet HB: true   ControlNet Control: true   ControlNet Priority: 1   ControlNet Ipaddr: 150.166.41.60   ControlNet HB: true   ControlNet Control: false   ControlNet Priority: 2 Resource_group nfs-group1:   Failover Policy: fp_h1_h2_ord_auto_auto   Version: 1   Script: ordered   Attributes: Auto_Failback Auto_Recovery   Initial AFD: hans1 hans2   Resources:   /hafs1 (type: NFS)   /hafs1/nfs/statmon (type: statd)   150.166.41.95 (type: IP_address)   /hafs1 (type: filesystem)   havol1 (type: volume) Resource /hafs1 (type NFS):   export-info: rw,wsync   filesystem: /hafs1   Resource dependencies   statd /hafs1/nfs/statmon   filesystem /hafs1 Resource /hafs1/nfs/statmon (type statd):   InterfaceAddress: 150.166.41.95   Resource dependencies   IP_address 150.166.41.95   filesystem /hafs1 Resource 150.166.41.95 (type IP_address):   NetworkMask: 0xffffff00   interfaces: ef1   BroadcastAddress: 150.166.41.255   No resource dependencies Resource /hafs1 (type filesystem):   volume-name: havol1   mount-options: rw,noauto   monitoring-level: 2   Resource dependencies   volume havol1 Resource havol1 (type volume):   devname-group: sys   devname-owner: root   devname-mode: 666   No resource dependencies Failover_policy fp_h1_h2_ord_auto_auto:   Version: 1   Script: ordered   Attributes: Auto_Failback Auto_Recovery   Initial AFD: hans1 hans2 # haStatus -a Tue Nov 30 14:45:30 PST 1999 Cluster test-cluster:   Cluster state is ACTIVE. Node hans2:   State of machine is UP.   Logical Machine Name: hans2   Hostname: hans2.engr.sgi.com   Is FailSafe: true   Is Cellular: false   Nodeid: 32418   Reset type: powerCycle   System Controller: msc   System Controller status: enabled   System Controller owner: hans1   System Controller owner device: /dev/ttyd2   System Controller owner type: tty   ControlNet Ipaddr: 192.26.50.15   ControlNet HB: true   ControlNet Control: true   ControlNet Priority: 1   ControlNet Ipaddr: 150.166.41.61   ControlNet HB: true   ControlNet Control: false   ControlNet Priority: 2 Node hans1:   State of machine is UP.   Logical Machine Name: hans1   Hostname: hans1.engr.sgi.com   Is FailSafe: true   Is Cellular: false   Nodeid: 32645   Reset type: powerCycle   System Controller: msc   System Controller status: enabled   System Controller owner: hans2   System Controller owner device: /dev/ttyd2   System Controller owner type: tty   ControlNet Ipaddr: 192.26.50.14   ControlNet HB: true   ControlNet Control: true   ControlNet Priority: 1   ControlNet Ipaddr: 150.166.41.60   ControlNet HB: true   ControlNet Control: false   ControlNet Priority: 2 Resource_group nfs-group1:   State: Online   Error: No error   Owner: hans1   Failover Policy: fp_h1_h2_ord_auto_auto   Version: 1   Script: ordered   Attributes: Auto_Failback Auto_Recovery   Initial AFD: hans1 hans2   Resources:   /hafs1 (type: NFS)   /hafs1/nfs/statmon (type: statd)   150.166.41.95 (type: IP_address)   /hafs1 (type: filesystem)   havol1 (type: volume) Resource /hafs1 (type NFS):   State: Online   Error: None   Owner: hans1   Flags: Resource is monitored locally   export-info: rw,wsync   filesystem: /hafs1   Resource dependencies   statd /hafs1/nfs/statmon   filesystem /hafs1 Resource /hafs1/nfs/statmon (type statd):   State: Online   Error: None   Owner: hans1   Flags: Resource is monitored locally   InterfaceAddress: 150.166.41.95   Resource dependencies   IP_address 150.166.41.95   filesystem /hafs1 Resource 150.166.41.95 (type IP_address):   State: Online   Error: None   Owner: hans1   Flags: Resource is monitored locally   NetworkMask: 0xffffff00   interfaces: ef1   BroadcastAddress: 150.166.41.255   No resource dependencies Resource /hafs1 (type filesystem):   State: Online   Error: None   Owner: hans1   Flags: Resource is monitored locally   volume-name: havol1   mount-options: rw,noauto   monitoring-level: 2   Resource dependencies   volume havol1 Resource havol1 (type volume):   State: Online   Error: None   Owner: hans1   Flags: Resource is monitored locally   devname-group: sys   devname-owner: root   devname-mode: 666   No resource dependencies # haStatus -c test-cluster Tue Nov 30 14:42:04 PST 1999 Cluster test-cluster:   Cluster state is ACTIVE. Node hans2:   State of machine is UP. Node hans1:   State of machine is UP. Resource_group nfs-group1:   State: Online   Error: No error   Owner: hans1   Failover Policy: fp_h1_h2_ord_auto_auto   Resources:   /hafs1 (type: NFS)   /hafs1/nfs/statmon (type: statd)   150.166.41.95 (type: IP_address)   /hafs1 (type: filesystem)   havol1 (type: volume) Resource Group Failover failover resource group resource groupfailover resource groupbringing online resource group taking offlineWhile a Linux FailSafe system is running, you can move a resource group online to a particular node, or you can take a resource group offline. In addition, you can move a resource group from one node in a cluster to another node in a cluster. The following subsections describe these tasks. Bringing a Resource Group Online Before you bring a resource group online for the first time, you should run the diagnostic tests on that resource group. Diagnostics check system configurations and perform some validations that are not performed when you bring a resource group online. To bring a resource group online, you specify the name of the resource and the name of the cluster which contains the node. You cannot bring a resource group online if the resource group has no members. To bring a resource group fully online, HA services must be active. When HA services are active, an attempt is made to allocate the resource group in the cluster. However, you can also execute a command to bring the resource group online when HA services are not active. When HA services are not active, the resource group is marked to be brought online when HA services become active. Before bringing a resource group online in the cluster, you must be sure that the resource group is not running on a disabled node (where HA services are not running). Bringing a resource group online while it is running on a disabled node could cause data corruption. For information on detached resource groups, see . Bringing a Resource Group Online with the Cluster Manager GUI To bring a resource group online using the Cluster Manager GUI, perform the following steps: On the left side of the display, click on the “Failover Policies & Resource Groups” category. On the right side of the display click on the “Bring a Resource Group Online” task link to launch the task. Enter the selected inputs. Click on “OK” at the bottom of the screen to complete the task. Bringing a Resource Group Online with the Cluster Manager CLI To bring a resource group online, use the following CLI command: cmgr> admin online resource_group A [in cluster B] If you have specified a default cluster, you do not need to specify a cluster when you use this command. Taking a Resource Group Offline When you take a resource group offline, FaillSafe takes each resource in the resource group offline in a predefined order. If any single resource gives an error during this process, the process stops, leaving all remaining resources allocated. You can take a Linux FailSafe resource group offline in any of three ways: Take the resource group offline. This physically stops the processes for that resource group and does not reset any error conditions. If this operation fails, the resource group will be left online in an error state. Force the resource group offline. This physically stops the processes for that resource group but resets any error conditions. This operation cannot fail. Detach the resource groups. This causes Linux FailSafe to stop monitoring the resource group, but does not physically stop the processes on that group. Linux FailSafe will report the status as offline and will not have any control over the group. This operation should rarely fail. If you do not need to stop the resource group and do not want Linux FailSafe to monitor the resource group while you make changes but you would still like to have administrative control over the resource group (for instance, to move that resource group to another node), you can put the resource group in maintenance mode using the “Suspend Monitoring a Resource Group” task on the GUI or the admin maintenance_on command of the CLI, as described in . Detaching a resource group leaves the resources in the resource group running at the cluster node where it was online. After stopping HA services on that cluster node, you should not bring the resource group online onto another node in the cluster, as this may cause data corruption. Taking a Resource Group Offline with the Cluster Manager GUI To take a resource group offline using the Cluster Manager GUI, perform the following steps: Launch the Cluster Manager. On the left side of the display, click on the “Failover Policies & Resource Groups” category. On the right side of the display click on the “Take a Resource Group Offline” task link to launch the task. Enter the selected inputs. Click on “OK” at the bottom of the screen to complete the task. Taking a Resource Group Offline with the Cluster Manager CLI To take a resource group offline, use the following CLI command: cmgr> admin offline resource_group A [in cluster B] If you have specified a default cluster, you do not need to specify a cluster in this command and the CLI will use the default. To take a resource group offline with the force option in effect, use the following CLI command: cmgr> admin offline_force resource_group A [in cluster B] To detach a resource group, use the following CLI command: cmgr> admin offline_detach resource_group A [in cluster B] Moving a Resource Group resource group movingWhile Linux FailSafe is active, you can move a resource group to another node in the same cluster. When you move a resource group, you specify the following: The name of the resource group. The logical name of the destination node (optional). When you do not provide a logical destination name, Linux FailSafe chooses the destination based on the failover policy. The name of the cluster that contains the nodes. Moving a Resource Group with the Cluster Manager GUI To move a resource group using the Cluster Manager GUI, perform the following steps: On the left side of the display, click on the “Failover Policies & Resource Groups” category. On the right side of the display click on the “Move a Resource Group” task link to launch the task. Enter the selected inputs. Click on “OK” at the bottom of the screen to complete the task. Moving a Resource Group with the Cluster Manager CLI To move a resource group to another node, use the following CLI command: cmgr> admin move resource_group A [in cluster B] [ to node C] Stop Monitoring of a Resource Group (Maintenance Mode) resource group monitoring maintenance mode resource groupstop monitoringYou can temporarily stop Linux FailSafe from monitoring a specific resource group, which puts the resource group in maintenance mode. The resource group remains on its same node in the cluster but is no longer monitored by Linux FailSafe for resource failures. You can put a resource group into maintenance mode if you do not want Linux FailSafe to monitor the group for a period of time. You may want to do this for upgrade or testing purposes, or if there is any reason that Linux FailSafe should not act on that resource group. When a resource group is in maintenance mode, it is not being monitored and it is not highly available. If the resource group's owner node fails, Linux FailSafe will move the resource group to another node and resume monitoring. When you put a resource group into maintenance mode, resources in the resource group are in ONLINE-MAINTENANCE state. The ONLINE-MAINTENANCE state for the resource is seen only on the node that has the resource online. All other nodes will show the resource as ONLINE. The resource group, however, should appear as being in ONLINE-MAINTENANCE state in all nodes. Putting a Resource Group into Maintenance Mode with the Cluster Manager GUI To put a resource group into maintenance mode using the Cluster Manager GUI, perform the following steps: On the left side of the display, click on the “Failover Policies & Resource Groups” category. On the right side of the display click on the “Suspend Monitoring a Resource Group” task link to launch the task. Enter the selected inputs. Resume Monitoring of a Resource Group with the Cluster Manager GUI To resume monitoring a resource group using the Cluster Manager GUI, perform the following steps: On the left side of the display, click on the “Failover Policies & Resource Groups” category. On the right side of the display click on the “Resume Monitoring a Resource Group” task link to launch the task. Enter the selected inputs. Putting a Resource Group into Maintenance Mode with the Cluster Manager CLI To put a resource group into maintenance mode, use the following CLI command: cmgr> admin maintenance_on resource_group A [in cluster B] If you have specified a default cluster, you do not need to specify a cluster when you use this command. Resume Monitoring of a Resource Group with the Cluster Manager CLI resource group resume monitoringTo move a resource group back online from maintenance mode, use the following CLI command: cmgr> admin maintenance_off resource_group A [in cluster B] Deactivating (Stopping) Linux FailSafe stopping Linux FailSafe stopping HA services deactivating HA services You can stop the execution of Linux FailSafe on a systemwide basis, on all the nodes in a cluster, or on a specified node only. Deactivating a node or a cluster is a complex operation that involves several steps and can take several minutes. Aborting a deactivate operation can leave the nodes and the resources in an intended state. When deactivating HA services on a node or for a cluster, the operation may fail if any resource groups are not in a stable clean state. Resource groups which are in transition will cause any deactivate HA services command to fail. In many cases, the command may succeed at a later time after resource groups have settled into a stable state. After you have successfully deactivated a node or a cluster, the node or cluster should have no resource groups and all HA services should be gone. Serially stopping HA services on every node in a cluster is not the same as stopping HA services for the entire cluster. If the former case, an attempt is made to keep resource groups online and highly available while in the latter case resource groups are moved offline, as described in the following sections. When you stop HA services, the Linux FailSafe daemons perform the following actions: A shutdown request is sent to Linux FailSafe (FSD) FSD releases all resource groups and puts them in ONLINE-READY state All nodes in the cluster in the configuration database are disabled (one node at a time and the local node last) Linux FailSafe waits until the node is removed from cluster membership before disabling the node The shutdown is successful only when all nodes are not part of cluster membership CMOND receives notification from the configuration database when nodes are disabled The local CMOND sends SIGTERM to all HA processes and IFD. All HA processes clean up and exit with “don't restart” code All other CMSD daemons remove the disabled node from the cluster membership Deactivating HA Services on a Node The operation of deactivating a node tries to move all resource groups from the node to some other node and then tries to disable the node in the cluster, subsequently killing all HA processes. When HA services are stopped on a node, all resource groups owned by the node are moved to some other node in the cluster that is capable of maintaining these resource groups in a highly available state. This operation will fail if there is no node that can take over these resource groups. This condition will always occur if the last node in a cluster is shut down when you deactivate HA services on that node. In this circumstance, you can specify the force option to shut down the node even if resource groups cannot be moved or released. This will normally leave resource groups allocated in a non-highly-available state on that same node. Using the force option might result in the node getting reset. In order to guarantee that the resource groups remain allocated on the last node in a cluster, all online resource groups should be detached. If you wish to move resource groups offline that are owned by the node being shut down, you must do so prior to deactivating the node. Deactivating HA Services in a Cluster The operation of deactivating a cluster attempts to release all resource groups and disable all nodes in the cluster, subsequently killing all HA processes. When a cluster is deactivated and the Linux FailSafe HA services are stopped on that cluster, resource groups are moved offline or deallocated. If you want the resource groups to remain allocated, you must detach the resource groups before attempting to deactivate the cluster. Serially stopping HA services on every node in a cluster is not the same as stopping HA services for the entire cluster. If the former case, an attempt is made to keep resource groups online and highly available while in the latter case resource groups are moved offline. Deactivating Linux FailSafe with the Cluster Manager GUI To stop Linux FailSafe services using the Cluster Manager GUI, perform the following steps: On the left side of the display, click on the “Nodes & Cluster” category. On the right side of the display click on the “Stop FailSafe HA Services” task link to launch the task. Enter the selected inputs. Click on “OK” at the bottom of the screen to complete the task. Deactivating Linux FailSafe with the Cluster Manager CLI To deactivate Linux FailSafe in a cluster and stop Linux FailSafe processing, use the following command: cmgr> stop ha_services [on node A] [for cluster  B][force] Resetting Nodes You can use Linux FailSafe to reset nodes in a cluster. This sends a reset command to the system controller port on the specified node. When the node is reset, other nodes in the cluster will detect this and remove the node from the active cluster, reallocating any resource groups that were allocated on that node onto a backup node. The backup node used depends on how you have configured your system.node reset resetting nodes Once the node reboots, it will rejoin the cluster. Some resource groups might move back to the node, depending on how you have configured your system. Resetting a Node with the Cluster Manager GUI To reset a Linux FailSafe node using the Cluster Manager GUI, perform the following steps: On the left side of the display, click on the “Nodes & Cluster” category. On the right side of the display click on the “Reset a Node” task link to launch the task. Enter the node to reset. Click on “OK” at the bottom of the screen to complete the task. Resetting a Node with the Cluster Manager CLI When Linux FailSafe is running, you can reboot a node with the following Cluster Manger CLI command: cmgr> admin reset node A This command uses the Linux FailSafe daemons to reset the specified node. You can reset a node in a cluster even when the Linux FailSafe daemons are not running by using the standalone option of the admin reset command of the CLI: cmgr> admin reset standalone node A This command does not go through the Linux FailSafe daemons. Backing Up and Restoring Configuration With Cluster Manager CLI backup and restore CDBbackup and restore restore, CDB backup, CDBThe Cluster Manager CLI provides scripts that you can use to backup and restore your configuration: cdbDump and cdbRestore. These scripts are installed in the /var/cluster/cmgr-scripts directory. You can modify these scripts to suit your needs. The cdbDump script, as provided, creates compressed tar files of the /var/cluster/cdb/cdb.db# directory and the /var/cluster/cdb.db file. The cdbRestore script, as provided, restores the compressed tar files of the /var/cluster/cdb/cdb.db# directory and the /var/cluster/cdb.db file. When you use the cdbDump and cdbRestore scripts, you should follow the following procedures: Run the cdbDump and cdbRestore scripts only when no administrative commands are running. This could result in an inconsistent backup. You must backup the configuration of each node in the cluster separately. The configuration information is different for each node, and all node-specific information is stored locally only. Run the backup procedure whenever you change your configuration. The backups of all nodes in the pool taken at the same time should be restored together. Cluster and Linux FailSafe process should not be running when you restore your configuration. In addition to the above restrictions, you should not perform a cdbDump while information is changing in the CDB. Check SYSLOG for information to help determine when CDB activity is occurring. As a rule of thumb, you should be able to perform a cdbDump if at least 15 minutes have passed since the last node joined the cluster or the last administration command was run.