While the Linux FailSafe system is running, you can monitor the status of the Linux FailSafe components to determine the state of the component. Linux FailSafe allows you to view the system status in the following ways:
You can keep continuous watch on the state of a cluster using the FailSafe Cluster View of the Cluster Manager GUI.
You can query the status of an individual resource group, node, or cluster using either the Cluster Manager GUI or the Cluster Manager CLI.
You can use the haStatus script provided with the Cluster Manager CLI to see the status of all clusters, nodes, resources, and resource groups in the configuration.
The following sections describe the procedures for performing each of these tasks.
The easiest way to keep a continuous watch on the state of a cluster is to use the FailSafe Cluster View of the Cluster Manager GUI.
In the FailSafe Cluster View window, problems system components are experiencing appear as blinking red icons. Components in transitional states also appear as blinking icons. If there is a problem in a resource group or node, the FailSafe Cluster View icon for the cluster turns red and blinks, as well as the resource group or node icon.
The full color legend for component states in the FailSafe Cluster View is as follows:
healthy but not online or active
healthy and active or online
transitioning to green
problems with component
resource type
maintenance mode, may or may not be currently monitored by Linux FailSafe
If you minimize the FailSafe Cluster View window, the minimized-icon shows the current state of the cluster. When the cluster has Linux FailSafe HA services active and there is no error, the icon shows a green cluster. When the cluster goes into error state, the icon shows a red cluster. When the cluster has Linux FailSafe HA services inactive, the icon shows a grey cluster.
You can use the CLI to query the status of a resource or to ping the system controller at a node, as described in the following subsections.
To query a resource status, use the following CLI command:
cmgr> show status of resource A of resource_type B [in cluster C] |
If you have specified a default cluster, you do not need to specify a cluster when you use this command and it will show the status of the indicated resource in the default cluster.
To perform a ping operation on a system controller by providing the device name, use the following CLI command:
cmgr> admin ping dev_name A of dev_type B with sysctrl_type C |
To query the status of a resource group, you provide the name of the resource group and the cluster which includes the resource group. Resource group status includes the following components:
Resource group state
Resource group error state
Resource owner
These components are described in the following subsections.
If a node that contains a resource group online has a status of UNKNOWN, the status of the resource group will not be available or ONLINE-READY.
A resource group state can be one of the following:
Linux FailSafe is running on the local nodes. The resource group is allocated on a node in the cluster and is being monitored by Linux FailSafe. It is fully allocated if there is no error; otherwise, some resources may not be allocated or some resources may be in error state.
Linux FailSafe is running on the local nodes and the resource group is in the process of being allocated. This is a transient state.
The resource group is not running or the resource group has been detached, regardless of whether Linux FailSafe is running. When Linux FailSafe starts up, it will not allocate this resource group.
Linux FailSafe is running on the local nodes and the resource group is in the process of being released (becoming offline). This is a transient state.
Linux FailSafe is not running on the local node. When Linux FailSafe starts up, it will attempt to bring this resource group online. No Linux FailSafe process is running on the current node is this state is returned.
The resource group is allocated in a node in the cluster but it is not being monitored by Linux FailSafe. If a node failure occurs while a resource group in ONLINE-MAINTENANCE state resides on that node, the resource group will be moved to another node and monitoring will resume. An administrator may move a resource group to an ONLINE-MAINTENANCE state for upgrade or testing purposes, or if there is any reason that Linux FailSafe should not act on that resource for a period of time.
An internal Linux FailSafe error has occurred and Linux FailSafe does not know the state of the resource group. Error recovery is required.
The resource group is in the process of going online if Linux FailSafe can correctly determine whether any resource in the resource group is already allocated on all nodes in the resource group's application failure domain. This is a transient state.
Linux FailSafe on the local node has yet to get any information about this resource group. This is a transient state.
When a resource group is ONLINE, its error status is continually being monitored. A resource group error status can be one of the following:
Resource group has no error.
Notify Silicon Graphics if this condition arises.
Node that had the resource group online is in unknown state. This occurs when the node is not part of the cluster. The last known state of the resource group is ONLINE, but the system cannot talk to the node.
The start or stop action has failed for a resource in the resource group.
Linux FailSafe has determined that part of the resource group was running on at least two different nodes in the cluster.
Linux FailSafe has determined that one of the nodes in the resource group's application failure domain was not in the membership. Linux FailSafe cannot bring the resource group online until that node is removed from the application failure domain or HA services are started on that node.
In the process of turning maintenance mode on or off, an error occurred. Linux FailSafe can no longer determine if monitoring is enabled or disabled. Retry the operation. If the error continues, report the error to Silicon Graphics.
A monitoring error has occurred on the last valid node in the cluster's membership.
The resource owner is the logical node name of the node that currently owns the resource.
You can use the FailSafe ClusterView to monitor the status of the resources in a Linux FailSafe configuration. You can launch the FailSafe Cluster View directly, or you can bring it up at any time by clicking on “FailSafe Cluster View” at the bottom of the “FailSafe Manager” display.
From the View menu, select “Resources in Groups” to see the resources organized by the groups they belong to, or select “Groups owned by Nodes” to see where the online groups are running. This view lets you observe failovers as they occur.
To query a resource group status, use the following CLI command:
cmgr> show status of resource_group A [in cluster B] |
If you have specified a default cluster, you do not need to specify a cluster when you use this command and it will show the status of the indicated resource group in the default cluster.
To query the status of a node, you provide the logical node name of the node. The node status can be one of the following:
This node is part of cluster membership.
This node is not part of cluster membership (no heartbeats) and this node has been reset. This is a transient state.
This node is not part of cluster membership (no heartbeats) and this node has not been reset (reset attempt has failed).
HA services have not been started on this node.
When you start HA services, node states transition from INACTIVE to UP. It may happen that a node state may transition from INACTIVE to UNKNOWN to UP.
You can use the FailSafe Cluster View to monitor the status of the clusters in a Linux FailSafe configuration. You can launch the FailSafe Cluster View directly, or you can bring it up at any time by clicking on “FailSafe Cluster View” at the bottom of the “FailSafe Manager” display.
From the View menu, select “Groups owned by Nodes” to monitor the health of the default cluster, its resource groups, and the group's resources.
To query node status, use the following CLI command:
cmgr> show status of node A |
When Linux FailSafe is running, you can determine whether the system controller on a node is responding with the following Cluster Manger CLI command:
cmgr> admin ping node A |
This command uses the Linux FailSafe daemons to test whether the system controller is responding.
You can verify reset connectivity on a node in a cluster even when the Linux FailSafe daemons are not running by using the standalone option of the admin ping command of the CLI:
cmgr> admin ping standalone node A |
This command does not go through the Linux FailSafe daemons, but calls the ping command directly to test whether the system controller on the indicated node is responding.
To query the status of a cluster, you provide the name of the cluster. The cluster status can be one of the following:
ACTIVE
INACTIVE
You can use the Cluster View of the Cluster Manager GUI to monitor the status of the clusters in a Linux FailSafe system.
To query node and cluster status, use the following CLI command:
cmgr> show status of cluster A |
The haStatus script provides status and configuration information about clusters, nodes, resources, and resource groups in the configuration. This script is installed in the /var/cluster/cmgr-scripts directory. You can modify this script to suit your needs. See the haStatus (1M) man page for further information about this script.
The following examples show the output of the different options of the haStatus script.
# haStatus -help Usage: haStatus [-a|-i] [-c clustername] where, -a prints detailed cluster configuration information and cluster status. -i prints detailed cluster configuration information only. -c can be used to specify a cluster for which status is to be printed. “clustername” is the name of the cluster for which status is to be printed. # haStatus Tue Nov 30 14:12:09 PST 1999 Cluster test-cluster: Cluster state is ACTIVE. Node hans2: State of machine is UP. Node hans1: State of machine is UP. Resource_group nfs-group1: State: Online Error: No error Owner: hans1 Failover Policy: fp_h1_h2_ord_auto_auto Resources: /hafs1 (type: NFS) /hafs1/nfs/statmon (type: statd) 150.166.41.95 (type: IP_address) /hafs1 (type: filesystem) havol1 (type: volume) # haStatus -i Tue Nov 30 14:13:52 PST 1999 Cluster test-cluster: Node hans2: Logical Machine Name: hans2 Hostname: hans2.engr.sgi.com Is FailSafe: true Is Cellular: false Nodeid: 32418 Reset type: powerCycle System Controller: msc System Controller status: enabled System Controller owner: hans1 System Controller owner device: /dev/ttyd2 System Controller owner type: tty ControlNet Ipaddr: 192.26.50.15 ControlNet HB: true ControlNet Control: true ControlNet Priority: 1 ControlNet Ipaddr: 150.166.41.61 ControlNet HB: true ControlNet Control: false ControlNet Priority: 2 Node hans1: Logical Machine Name: hans1 Hostname: hans1.engr.sgi.com Is FailSafe: true Is Cellular: false Nodeid: 32645 Reset type: powerCycle System Controller: msc System Controller status: enabled System Controller owner: hans2 System Controller owner device: /dev/ttyd2 System Controller owner type: tty ControlNet Ipaddr: 192.26.50.14 ControlNet HB: true ControlNet Control: true ControlNet Priority: 1 ControlNet Ipaddr: 150.166.41.60 ControlNet HB: true ControlNet Control: false ControlNet Priority: 2 Resource_group nfs-group1: Failover Policy: fp_h1_h2_ord_auto_auto Version: 1 Script: ordered Attributes: Auto_Failback Auto_Recovery Initial AFD: hans1 hans2 Resources: /hafs1 (type: NFS) /hafs1/nfs/statmon (type: statd) 150.166.41.95 (type: IP_address) /hafs1 (type: filesystem) havol1 (type: volume) Resource /hafs1 (type NFS): export-info: rw,wsync filesystem: /hafs1 Resource dependencies statd /hafs1/nfs/statmon filesystem /hafs1 Resource /hafs1/nfs/statmon (type statd): InterfaceAddress: 150.166.41.95 Resource dependencies IP_address 150.166.41.95 filesystem /hafs1 Resource 150.166.41.95 (type IP_address): NetworkMask: 0xffffff00 interfaces: ef1 BroadcastAddress: 150.166.41.255 No resource dependencies Resource /hafs1 (type filesystem): volume-name: havol1 mount-options: rw,noauto monitoring-level: 2 Resource dependencies volume havol1 Resource havol1 (type volume): devname-group: sys devname-owner: root devname-mode: 666 No resource dependencies Failover_policy fp_h1_h2_ord_auto_auto: Version: 1 Script: ordered Attributes: Auto_Failback Auto_Recovery Initial AFD: hans1 hans2 # haStatus -a Tue Nov 30 14:45:30 PST 1999 Cluster test-cluster: Cluster state is ACTIVE. Node hans2: State of machine is UP. Logical Machine Name: hans2 Hostname: hans2.engr.sgi.com Is FailSafe: true Is Cellular: false Nodeid: 32418 Reset type: powerCycle System Controller: msc System Controller status: enabled System Controller owner: hans1 System Controller owner device: /dev/ttyd2 System Controller owner type: tty ControlNet Ipaddr: 192.26.50.15 ControlNet HB: true ControlNet Control: true ControlNet Priority: 1 ControlNet Ipaddr: 150.166.41.61 ControlNet HB: true ControlNet Control: false ControlNet Priority: 2 Node hans1: State of machine is UP. Logical Machine Name: hans1 Hostname: hans1.engr.sgi.com Is FailSafe: true Is Cellular: false Nodeid: 32645 Reset type: powerCycle System Controller: msc System Controller status: enabled System Controller owner: hans2 System Controller owner device: /dev/ttyd2 System Controller owner type: tty ControlNet Ipaddr: 192.26.50.14 ControlNet HB: true ControlNet Control: true ControlNet Priority: 1 ControlNet Ipaddr: 150.166.41.60 ControlNet HB: true ControlNet Control: false ControlNet Priority: 2 Resource_group nfs-group1: State: Online Error: No error Owner: hans1 Failover Policy: fp_h1_h2_ord_auto_auto Version: 1 Script: ordered Attributes: Auto_Failback Auto_Recovery Initial AFD: hans1 hans2 Resources: /hafs1 (type: NFS) /hafs1/nfs/statmon (type: statd) 150.166.41.95 (type: IP_address) /hafs1 (type: filesystem) havol1 (type: volume) Resource /hafs1 (type NFS): State: Online Error: None Owner: hans1 Flags: Resource is monitored locally export-info: rw,wsync filesystem: /hafs1 Resource dependencies statd /hafs1/nfs/statmon filesystem /hafs1 Resource /hafs1/nfs/statmon (type statd): State: Online Error: None Owner: hans1 Flags: Resource is monitored locally InterfaceAddress: 150.166.41.95 Resource dependencies IP_address 150.166.41.95 filesystem /hafs1 Resource 150.166.41.95 (type IP_address): State: Online Error: None Owner: hans1 Flags: Resource is monitored locally NetworkMask: 0xffffff00 interfaces: ef1 BroadcastAddress: 150.166.41.255 No resource dependencies Resource /hafs1 (type filesystem): State: Online Error: None Owner: hans1 Flags: Resource is monitored locally volume-name: havol1 mount-options: rw,noauto monitoring-level: 2 Resource dependencies volume havol1 Resource havol1 (type volume): State: Online Error: None Owner: hans1 Flags: Resource is monitored locally devname-group: sys devname-owner: root devname-mode: 666 No resource dependencies # haStatus -c test-cluster Tue Nov 30 14:42:04 PST 1999 Cluster test-cluster: Cluster state is ACTIVE. Node hans2: State of machine is UP. Node hans1: State of machine is UP. Resource_group nfs-group1: State: Online Error: No error Owner: hans1 Failover Policy: fp_h1_h2_ord_auto_auto Resources: /hafs1 (type: NFS) /hafs1/nfs/statmon (type: statd) 150.166.41.95 (type: IP_address) /hafs1 (type: filesystem) havol1 (type: volume) |