Linux FailSafe System Operation
This chapter describes administrative tasks you perform to operate and
monitor a Linux FailSafe system. It describes how to perform tasks using the
FailSafe Cluster Manager Graphical User Interface (GUI) and the FailSafe Cluster
Manager Command Line Interface (CLI). The major sections in this chapter are
as follows:
Setting System Operation Defaults
defaults system operation defaultsSeveral
commands that you perform on a running system allow you the option of specifying
a node or cluster. You can specify a node or a cluster to use as the default
if you do not specify the node or cluster explicitly.
Setting Default Cluster with Cluster Manager GUI
The Cluster Manager GUI prompts you to enter the name of the default
cluster when you have not specified one. Alternately, you can set the default
cluster by clicking the “Select Cluster...” button at the bottom
of the FailSafe Manager window.
When using the Cluster Manager GUI, there is no need to set a default
node.
Setting Defaults with Cluster Manager CLI
When you are using the Cluster Manager CLI, you can use the following
commands to specify default values. Use either of the following commands to
specify a default cluster:
cmgr> set cluster A
cmgr> set node A
System Operation Considerations
ctrl+c ramifications
Once a Linux FailSafe command is started, it may partially complete
even if you interrupt the command by typing Ctrl-c. If you
halt the execution of a command this way, you may leave the cluster in an
indeterminate state and you may need to use the various status commands to
determine the actual state of the cluster and its components.
Activating (Starting) Linux FailSafe
starting Linux FailSafe
activating Linux FailSafe
After you have configured your Linux FailSafe system
and run diagnostic tests on its components, you can activate the highly available
services by starting Linux FailSafe. You can start Linux FailSafe on a systemwide
basis, on all of the nodes in a cluster, or on a specified node only.
When you start HA services on a subset of the nodes, you should make
sure that resource groups are not running in other nodes in the cluster. For
example, if a cluster contains nodes N1, N2, and N3 and HA services are started
on nodes N1 and N2 but not on node N3, you should make sure that resource
groups are not running on node N3. Linux FailSafe will not perform exclusivity
checks on nodes where HA services are not started.
When you start HA services, the following actions are performed:
All nodes in the cluster in the CDB are enabled
Linux FailSafe returns success to the user after modifying
the CDB
The local CMOND gets notification from cdbd
The local CMOND starts all HA processes (CMSD, GCD, SRMD,
FSD) and IFD.
CMOND sets failsafe2 chkconfig
flag to on.
Activating Linux FailSafe with the Cluster Manager GUI
To start Linux FailSafe services using the Cluster Manager GUI, perform
the following steps:
On the left side of the display, click on the “Nodes
& Cluster” category.
On the right side of the display click on the “Start
FailSafe HA Services” task link to launch the task.
Enter the selected inputs.
Click on “OK” at the bottom of the screen to complete
the task.
Activating Linux FailSafe with the Cluster Manager CLI
To activate Linux FailSafe in a cluster, use the following command:
cmgr> start ha_services [
on node A] [for cluster
B]
System Status
system status statussystem, overview
While the Linux FailSafe system is running, you can monitor the
status of the Linux FailSafe components to determine the state of the component.
Linux FailSafe allows you to view the system status in the following ways:
You can keep continuous watch on the state of a cluster using
the FailSafe Cluster View of the Cluster Manager GUI.
You can query the status of an individual resource group,
node, or cluster using either the Cluster Manager GUI or the Cluster Manager
CLI.
You can use the haStatus script provided
with the Cluster Manager CLI to see the status of all clusters, nodes, resources,
and resource groups in the configuration.
The following sections describe the procedures for performing each of
these tasks.
Monitoring System Status with the Cluster Manager GUI
The easiest way to keep a continuous watch on the state of a cluster
is to use the FailSafe Cluster View of the Cluster Manager GUI.
In the FailSafe Cluster View window, problems system components are
experiencing appear as blinking red icons. Components in transitional states
also appear as blinking icons. If there is a problem in a resource group or
node, the FailSafe Cluster View icon for the cluster turns red and blinks,
as well as the resource group or node icon.
The full color legend for component states in the FailSafe Cluster View
is as follows:
grey
healthy but not online or active
green
healthy and active or online
blinking green
transitioning to green
blinking red
problems with component
black and white outline
resource type
grey with yellow wrench
maintenance mode, may or may not be currently monitored by Linux FailSafe
If you minimize the FailSafe Cluster View window, the minimized-icon
shows the current state of the cluster. When the cluster has Linux FailSafe
HA services active and there is no error, the icon shows a green cluster.
When the cluster goes into error state, the icon shows a red cluster. When
the cluster has Linux FailSafe HA services inactive, the icon shows a grey
cluster.
Monitoring Resource and Reset Serial Line with the Cluster Manager
CLI
statusresource
resource
statusYou can use the CLI to query the
status of a resource or to ping the system controller at a node, as described
in the following subsections.
Querying Resource Status with the Cluster Manager CLI
resource group
statusTo query a resource status, use the following
CLI command:
cmgr> show status of resource
A of resource_type B
[in cluster C]
If you have specified a default cluster, you do not need to specify
a cluster when you use this command and it will show the status of the indicated
resource in the default cluster.
Pinging a System Controller with the Cluster Manager CLI
statussystem
controller
system controllerstatusTo perform
a ping operation on a system controller by providing the device name, use
the following CLI command:
cmgr> admin ping dev_name
A of dev_type
B with sysctrl_type
C
Resource Group Status
statusresource
groupTo query the status of a resource group, you
provide the name of the resource group and the cluster which includes the
resource group. Resource group status includes the following components:
Resource group state
Resource group error state
Resource owner
These components are described in the following subsections.
If a node that contains a resource group online has a status of
UNKNOWN, the status of the resource group will not be available
or ONLINE-READY.
Resource Group State
resource group
state state,
resource groupA resource group state can be one of the
following:
ONLINE state
ONLINE
Linux FailSafe is running on the local nodes. The resource group is
allocated on a node in the cluster and is being monitored by Linux FailSafe.
It is fully allocated if there is no error; otherwise, some resources may
not be allocated or some resources may be in error state.
ONLINE-PENDING state
ONLINE-PENDING
Linux FailSafe is running on the local nodes and the resource group
is in the process of being allocated. This is a transient state.
OFFLINE state
OFFLINE
The resource group is not running or the resource group has been detached,
regardless of whether Linux FailSafe is running. When Linux FailSafe starts
up, it will not allocate this resource group.
OFFLINE-PENDING
stateOFFLINE-PENDING
Linux FailSafe is running on the local nodes and the resource group
is in the process of being released (becoming offline). This is a transient
state.
ONLINE-READY state
ONLINE-READY
Linux FailSafe is not running on the local node. When Linux FailSafe
starts up, it will attempt to bring this resource group online. No Linux FailSafe
process is running on the current node is this state is returned.
ONLINE-MAINTENANCE
stateONLINE-MAINTENANCE
The resource group is allocated in a node in the cluster but it is not
being monitored by Linux FailSafe. If a node failure occurs while a resource
group in ONLINE-MAINTENANCE state resides on that node,
the resource group will be moved to another node and monitoring will resume.
An administrator may move a resource group to an ONLINE-MAINTENANCE
state for upgrade or testing purposes, or if there is any reason
that Linux FailSafe should not act on that resource for a period of time.
INTERNAL ERROR state
INTERNAL ERROR
An internal Linux FailSafe error has occurred and Linux FailSafe does
not know the state of the resource group. Error recovery is required.
DISCOVERY state
DISCOVERY (EXCLUSIVITY)
The resource group is in the process of going online if Linux FailSafe
can correctly determine whether any resource in the resource group is already
allocated on all nodes in the resource group's application failure domain.
This is a transient state.
INITIALIZING state
INITIALIZING
Linux FailSafe on the local node has yet to get any information about
this resource group. This is a transient state.
Resource Group Error State
When a resource group is ONLINE, its error status is continually being
monitored. A resource group error status can be one of the following:error state
resource grouperror state
NO ERROR error state
NO ERROR
Resource group has no error.
INTERNAL ERROR error
stateINTERNAL ERROR - NOT RECOVERABLE
Notify Silicon Graphics if this condition arises.
NODE UNKNOWN error
stateNODE UNKNOWN
Node that had the resource group online is in unknown state. This occurs
when the node is not part of the cluster. The last known state of the resource
group is ONLINE, but the system cannot talk to the node.
SRMD EXECUTABLE
ERROR error stateSRMD EXECUTABLE ERROR
The start or stop action has failed for a resource in the resource group.
SPLIT RESOURCE GROUP
error state NODE
NOT AVAILABLE error state
MONITOR ACTIVITY UNKNOWN error state NO AVAILABLE NODES error state
SPLIT RESOURCE GROUP (EXCLUSIVITY)
Linux FailSafe has determined that part of the resource group was running
on at least two different nodes in the cluster.
NODE NOT AVAILABLE (EXCLUSIVITY)
Linux FailSafe has determined that one of the nodes in the resource
group's application failure domain was not in the membership. Linux FailSafe
cannot bring the resource group online until that node is removed from the
application failure domain or HA services are started on that node.
MONITOR ACTIVITY UNKNOWN
In the process of turning maintenance mode on or off, an error occurred.
Linux FailSafe can no longer determine if monitoring is enabled or disabled.
Retry the operation. If the error continues, report the error to Silicon Graphics.
NO AVAILABLE NODES
A monitoring error has occurred on the last valid node in the cluster's
membership.
Resource Owner
resource
ownerThe resource owner is the logical node name of
the node that currently owns the resource.
Monitoring Resource Group Status with the Cluster Manager GUI
You can use the FailSafe ClusterView to monitor the status of the resources
in a Linux FailSafe configuration. You can launch the FailSafe Cluster View
directly, or you can bring it up at any time by clicking on “FailSafe
Cluster View” at the bottom of the “FailSafe Manager” display.
From the View menu, select “Resources in Groups” to see
the resources organized by the groups they belong to, or select “Groups
owned by Nodes” to see where the online groups are running. This view
lets you observe failovers as they occur.
Querying Resource Group Status with the Cluster Manager CLI
To query a resource group status, use the following CLI command:
cmgr> show status of resource_group
A [in cluster B]
If you have specified a default cluster, you do not need to specify
a cluster when you use this command and it will show the status of the indicated
resource group in the default cluster.
Node Status
nodestatus
status
node
nodestateTo query the status
of a node, you provide the logical node name of the node. The node status
can be one of the following:
UP node state
UP
This node is part of cluster membership.
DOWN node state
DOWN
This node is not part of cluster membership (no heartbeats) and this
node has been reset. This is a transient state.
UNKNOWN node state
UNKNOWN
This node is not part of cluster membership (no heartbeats) and this
node has not been reset (reset attempt has failed).
INACTIVE node state
INACTIVE
HA services have not been started on this node.
When you start HA services, node states transition from INACTIVE
to UP. It may happen that a node state may transition
from INACTIVE to UNKNOWN to
UP.
Monitoring Cluster Status with the Cluster Manager GUI
You can use the FailSafe Cluster View to monitor the status of the clusters
in a Linux FailSafe configuration. You can launch the FailSafe Cluster View
directly, or you can bring it up at any time by clicking on “FailSafe
Cluster View” at the bottom of the “FailSafe Manager” display.
From the View menu, select “Groups owned by Nodes” to monitor
the health of the default cluster, its resource groups, and the group's resources.
Querying Node Status with the Cluster Manager CLI
To query node status, use the following CLI command:
cmgr> show status of node A
Pinging the System Controller with the Cluster Manager CLI
When Linux FailSafe is running, you can determine whether the system
controller on a node is responding with the following Cluster Manger CLI command:
cmgr> admin ping node A
This command uses the Linux FailSafe daemons to test whether the system
controller is responding.
You can verify reset connectivity on a node in a cluster even when the
Linux FailSafe daemons are not running by using the standalone
option of the admin ping command of the CLI:
cmgr> admin ping standalone node
A
This command does not go through the Linux FailSafe daemons, but calls
the ping command directly to test whether the system controller
on the indicated node is responding.
Cluster Status
statuscluster
cluster status
ACTIVE cluster
status INACTIVE
cluster statusTo query the status of a cluster, you
provide the name of the cluster. The cluster status can be one of the following:
ACTIVE
INACTIVE
Querying Cluster Status with the Cluster Manager GUI
You can use the Cluster View of the Cluster Manager GUI to monitor the
status of the clusters in a Linux FailSafe system.
Querying Cluster Status with the Cluster Manager CLI
To query node and cluster status, use the following CLI command:
cmgr> show status of cluster
A
Viewing System Status with the haStatus CLI Script
haStatus scriptThe
haStatus script provides status and configuration information about
clusters, nodes, resources, and resource groups in the configuration. This
script is installed in the /var/cluster/cmgr-scripts
directory. You can modify this script to suit
your needs. See the haStatus (1M) man page for further
information about this script.
The following examples show the output of the different options of the
haStatus script.
# haStatus -help
Usage: haStatus [-a|-i] [-c clustername]
where,
-a prints detailed cluster configuration information and cluster
status.
-i prints detailed cluster configuration information only.
-c can be used to specify a cluster for which status is to be printed.
“clustername” is the name of the cluster for which status is to be
printed.
# haStatus
Tue Nov 30 14:12:09 PST 1999
Cluster test-cluster:
Cluster state is ACTIVE.
Node hans2:
State of machine is UP.
Node hans1:
State of machine is UP.
Resource_group nfs-group1:
State: Online
Error: No error
Owner: hans1
Failover Policy: fp_h1_h2_ord_auto_auto
Resources:
/hafs1 (type: NFS)
/hafs1/nfs/statmon (type: statd)
150.166.41.95 (type: IP_address)
/hafs1 (type: filesystem)
havol1 (type: volume)
# haStatus -i
Tue Nov 30 14:13:52 PST 1999
Cluster test-cluster:
Node hans2:
Logical Machine Name: hans2
Hostname: hans2.engr.sgi.com
Is FailSafe: true
Is Cellular: false
Nodeid: 32418
Reset type: powerCycle
System Controller: msc
System Controller status: enabled
System Controller owner: hans1
System Controller owner device: /dev/ttyd2
System Controller owner type: tty
ControlNet Ipaddr: 192.26.50.15
ControlNet HB: true
ControlNet Control: true
ControlNet Priority: 1
ControlNet Ipaddr: 150.166.41.61
ControlNet HB: true
ControlNet Control: false
ControlNet Priority: 2
Node hans1:
Logical Machine Name: hans1
Hostname: hans1.engr.sgi.com
Is FailSafe: true
Is Cellular: false
Nodeid: 32645
Reset type: powerCycle
System Controller: msc
System Controller status: enabled
System Controller owner: hans2
System Controller owner device: /dev/ttyd2
System Controller owner type: tty
ControlNet Ipaddr: 192.26.50.14
ControlNet HB: true
ControlNet Control: true
ControlNet Priority: 1
ControlNet Ipaddr: 150.166.41.60
ControlNet HB: true
ControlNet Control: false
ControlNet Priority: 2
Resource_group nfs-group1:
Failover Policy: fp_h1_h2_ord_auto_auto
Version: 1
Script: ordered
Attributes: Auto_Failback Auto_Recovery
Initial AFD: hans1 hans2
Resources:
/hafs1 (type: NFS)
/hafs1/nfs/statmon (type: statd)
150.166.41.95 (type: IP_address)
/hafs1 (type: filesystem)
havol1 (type: volume)
Resource /hafs1 (type NFS):
export-info: rw,wsync
filesystem: /hafs1
Resource dependencies
statd /hafs1/nfs/statmon
filesystem /hafs1
Resource /hafs1/nfs/statmon (type statd):
InterfaceAddress: 150.166.41.95
Resource dependencies
IP_address 150.166.41.95
filesystem /hafs1
Resource 150.166.41.95 (type IP_address):
NetworkMask: 0xffffff00
interfaces: ef1
BroadcastAddress: 150.166.41.255
No resource dependencies
Resource /hafs1 (type filesystem):
volume-name: havol1
mount-options: rw,noauto
monitoring-level: 2
Resource dependencies
volume havol1
Resource havol1 (type volume):
devname-group: sys
devname-owner: root
devname-mode: 666
No resource dependencies
Failover_policy fp_h1_h2_ord_auto_auto:
Version: 1
Script: ordered
Attributes: Auto_Failback Auto_Recovery
Initial AFD: hans1 hans2
# haStatus -a
Tue Nov 30 14:45:30 PST 1999
Cluster test-cluster:
Cluster state is ACTIVE.
Node hans2:
State of machine is UP.
Logical Machine Name: hans2
Hostname: hans2.engr.sgi.com
Is FailSafe: true
Is Cellular: false
Nodeid: 32418
Reset type: powerCycle
System Controller: msc
System Controller status: enabled
System Controller owner: hans1
System Controller owner device: /dev/ttyd2
System Controller owner type: tty
ControlNet Ipaddr: 192.26.50.15
ControlNet HB: true
ControlNet Control: true
ControlNet Priority: 1
ControlNet Ipaddr: 150.166.41.61
ControlNet HB: true
ControlNet Control: false
ControlNet Priority: 2
Node hans1:
State of machine is UP.
Logical Machine Name: hans1
Hostname: hans1.engr.sgi.com
Is FailSafe: true
Is Cellular: false
Nodeid: 32645
Reset type: powerCycle
System Controller: msc
System Controller status: enabled
System Controller owner: hans2
System Controller owner device: /dev/ttyd2
System Controller owner type: tty
ControlNet Ipaddr: 192.26.50.14
ControlNet HB: true
ControlNet Control: true
ControlNet Priority: 1
ControlNet Ipaddr: 150.166.41.60
ControlNet HB: true
ControlNet Control: false
ControlNet Priority: 2
Resource_group nfs-group1:
State: Online
Error: No error
Owner: hans1
Failover Policy: fp_h1_h2_ord_auto_auto
Version: 1
Script: ordered
Attributes: Auto_Failback Auto_Recovery
Initial AFD: hans1 hans2
Resources:
/hafs1 (type: NFS)
/hafs1/nfs/statmon (type: statd)
150.166.41.95 (type: IP_address)
/hafs1 (type: filesystem)
havol1 (type: volume)
Resource /hafs1 (type NFS):
State: Online
Error: None
Owner: hans1
Flags: Resource is monitored locally
export-info: rw,wsync
filesystem: /hafs1
Resource dependencies
statd /hafs1/nfs/statmon
filesystem /hafs1
Resource /hafs1/nfs/statmon (type statd):
State: Online
Error: None
Owner: hans1
Flags: Resource is monitored locally
InterfaceAddress: 150.166.41.95
Resource dependencies
IP_address 150.166.41.95
filesystem /hafs1
Resource 150.166.41.95 (type IP_address):
State: Online
Error: None
Owner: hans1
Flags: Resource is monitored locally
NetworkMask: 0xffffff00
interfaces: ef1
BroadcastAddress: 150.166.41.255
No resource dependencies
Resource /hafs1 (type filesystem):
State: Online
Error: None
Owner: hans1
Flags: Resource is monitored locally
volume-name: havol1
mount-options: rw,noauto
monitoring-level: 2
Resource dependencies
volume havol1
Resource havol1 (type volume):
State: Online
Error: None
Owner: hans1
Flags: Resource is monitored locally
devname-group: sys
devname-owner: root
devname-mode: 666
No resource dependencies
# haStatus -c test-cluster
Tue Nov 30 14:42:04 PST 1999
Cluster test-cluster:
Cluster state is ACTIVE.
Node hans2:
State of machine is UP.
Node hans1:
State of machine is UP.
Resource_group nfs-group1:
State: Online
Error: No error
Owner: hans1
Failover Policy: fp_h1_h2_ord_auto_auto
Resources:
/hafs1 (type: NFS)
/hafs1/nfs/statmon (type: statd)
150.166.41.95 (type: IP_address)
/hafs1 (type: filesystem)
havol1 (type: volume)
Resource Group Failover
failover
resource group
resource groupfailover resource groupbringing online
resource group
taking offlineWhile a Linux FailSafe
system is running, you can move a resource group online to a particular node,
or you can take a resource group offline. In addition, you can move a resource
group from one node in a cluster to another node in a cluster. The following
subsections describe these tasks.
Bringing a Resource Group Online
Before you bring a resource group online for the first time, you should
run the diagnostic tests on that resource group. Diagnostics check system
configurations and perform some validations that are not performed when you
bring a resource group online.
To bring a resource group online, you specify the name of the resource
and the name of the cluster which contains the node.
You cannot bring a resource group online if the resource group has no
members.
To bring a resource group fully online, HA services must be active.
When HA services are active, an attempt is made to allocate the resource group
in the cluster. However, you can also execute a command to bring the resource
group online when HA services are not active. When HA services are not active,
the resource group is marked to be brought online when HA services become
active.
Before bringing a resource group online in the cluster, you must be
sure that the resource group is not running on a disabled node (where HA services
are not running). Bringing a resource group online while it is running on
a disabled node could cause data corruption. For information on detached resource
groups, see .
Bringing a Resource Group Online with the Cluster Manager GUI
To bring a resource group online using the Cluster Manager GUI, perform
the following steps:
On the left side of the display, click on the “Failover
Policies & Resource Groups” category.
On the right side of the display click on the “Bring
a Resource Group Online” task link to launch the task.
Enter the selected inputs.
Click on “OK” at the bottom of the screen to complete
the task.
Bringing a Resource Group Online with the Cluster Manager CLI
To bring a resource group online, use the following CLI command:
cmgr> admin online resource_group
A [in cluster B]
If you have specified a default cluster, you do not need to specify
a cluster when you use this command.
Taking a Resource Group Offline
When you take a resource group offline, FaillSafe takes each resource
in the resource group offline in a predefined order. If any single resource
gives an error during this process, the process stops, leaving all remaining
resources allocated.
You can take a Linux FailSafe resource group offline in any of three
ways:
Take the resource group offline. This physically stops the
processes for that resource group and does not reset any error conditions.
If this operation fails, the resource group will be left online in an error
state.
Force the resource group offline. This physically stops the
processes for that resource group but resets any error conditions. This operation
cannot fail.
Detach the resource groups. This causes Linux FailSafe to
stop monitoring the resource group, but does not physically stop the processes
on that group. Linux FailSafe will report the status as offline and will not
have any control over the group. This operation should rarely fail.
If you do not need to stop the resource group and do not want Linux
FailSafe to monitor the resource group while you make changes but you would
still like to have administrative control over the resource group (for instance,
to move that resource group to another node), you can put the resource group
in maintenance mode using the “Suspend Monitoring a Resource Group”
task on the GUI or the admin maintenance_on command of
the CLI, as described in .
Detaching a resource group leaves the resources in the resource group
running at the cluster node where it was online. After stopping HA services
on that cluster node, you should not bring the resource group online onto
another node in the cluster, as this may cause data corruption.
Taking a Resource Group Offline with the Cluster Manager GUI
To take a resource group offline using the Cluster Manager GUI, perform
the following steps:
Launch the Cluster Manager.
On the left side of the display, click on the “Failover
Policies & Resource Groups” category.
On the right side of the display click on the “Take
a Resource Group Offline” task link to launch the task.
Enter the selected inputs.
Click on “OK” at the bottom of the screen to complete
the task.
Taking a Resource Group Offline with the Cluster Manager CLI
To take a resource group offline, use the following CLI command:
cmgr> admin offline resource_group
A [in cluster B]
If you have specified a default cluster, you do not need to specify
a cluster in this command and the CLI will use the default.
To take a resource group offline with the force option in effect, use
the following CLI command:
cmgr> admin offline_force resource_group
A [in cluster B]
To detach a resource group, use the following CLI command:
cmgr> admin offline_detach resource_group
A [in cluster B]
Moving a Resource Group
resource group
movingWhile Linux FailSafe is active, you can move
a resource group to another node in the same cluster. When you move a resource
group, you specify the following:
The name of the resource group.
The logical name of the destination node (optional). When
you do not provide a logical destination name, Linux FailSafe chooses the
destination based on the failover policy.
The name of the cluster that contains the nodes.
Moving a Resource Group with the Cluster Manager GUI
To move a resource group using the Cluster Manager GUI, perform the
following steps:
On the left side of the display, click on the “Failover
Policies & Resource Groups” category.
On the right side of the display click on the “Move
a Resource Group” task link to launch the task.
Enter the selected inputs.
Click on “OK” at the bottom of the screen to complete
the task.
Moving a Resource Group with the Cluster Manager CLI
To move a resource group to another node, use the following CLI command:
cmgr> admin move resource_group
A [in cluster B] [
to node C]
Stop Monitoring of a Resource Group (Maintenance
Mode)
resource group
monitoring
maintenance mode
resource groupstop monitoringYou
can temporarily stop Linux FailSafe from monitoring a specific resource group,
which puts the resource group in maintenance mode. The resource group remains
on its same node in the cluster but is no longer monitored by Linux FailSafe
for resource failures.
You can put a resource group into maintenance mode if you do not want
Linux FailSafe to monitor the group for a period of time. You may want to
do this for upgrade or testing purposes, or if there is any reason that Linux
FailSafe should not act on that resource group. When a resource group is in
maintenance mode, it is not being monitored and it is not highly available.
If the resource group's owner node fails, Linux FailSafe will move the resource
group to another node and resume monitoring.
When you put a resource group into maintenance mode, resources in the
resource group are in ONLINE-MAINTENANCE state. The
ONLINE-MAINTENANCE state for the resource is seen only on the node
that has the resource online. All other nodes will show the resource as ONLINE.
The resource group, however, should appear as being in ONLINE-MAINTENANCE
state in all nodes.
Putting a Resource Group into Maintenance Mode with the Cluster Manager
GUI
To put a resource group into maintenance mode using the Cluster Manager
GUI, perform the following steps:
On the left side of the display, click on the “Failover
Policies & Resource Groups” category.
On the right side of the display click on the “Suspend
Monitoring a Resource Group” task link to launch the task.
Enter the selected inputs.
Resume Monitoring of a Resource Group with the Cluster Manager GUI
To resume monitoring a resource group using the Cluster Manager GUI,
perform the following steps:
On the left side of the display, click on the “Failover
Policies & Resource Groups” category.
On the right side of the display click on the “Resume
Monitoring a Resource Group” task link to launch the task.
Enter the selected inputs.
Putting a Resource Group into Maintenance Mode with the Cluster Manager
CLI
To put a resource group into maintenance mode, use the following CLI
command:
cmgr> admin maintenance_on resource_group
A [in cluster B]
If you have specified a default cluster, you do not need to specify
a cluster when you use this command.
Resume Monitoring of a Resource Group with the Cluster Manager CLI
resource group
resume monitoringTo move a resource group back online
from maintenance mode, use the following CLI command:
cmgr> admin maintenance_off resource_group
A [in cluster B]
Deactivating (Stopping) Linux FailSafe
stopping Linux FailSafe
stopping HA services
deactivating HA services
You can stop the execution of Linux FailSafe on a systemwide
basis, on all the nodes in a cluster, or on a specified node only.
Deactivating a node or a cluster is a complex operation that involves
several steps and can take several minutes. Aborting a deactivate operation
can leave the nodes and the resources in an intended state.
When deactivating HA services on a node or for a cluster, the operation
may fail if any resource groups are not in a stable clean state. Resource
groups which are in transition will cause any deactivate HA services command
to fail. In many cases, the command may succeed at a later time after resource
groups have settled into a stable state.
After you have successfully deactivated a node or a cluster, the node
or cluster should have no resource groups and all HA services should be gone.
Serially stopping HA services on every node in a cluster is not the
same as stopping HA services for the entire cluster. If the former case, an
attempt is made to keep resource groups online and highly available while
in the latter case resource groups are moved offline, as described in the
following sections.
When you stop HA services, the Linux FailSafe daemons perform the following
actions:
A shutdown request is sent to Linux FailSafe (FSD)
FSD releases all resource groups and puts them in
ONLINE-READY state
All nodes in the cluster in the configuration database are
disabled (one node at a time and the local node last)
Linux FailSafe waits until the node is removed from cluster
membership before disabling the node
The shutdown is successful only when all nodes are not part
of cluster membership
CMOND receives notification from the configuration database
when nodes are disabled
The local CMOND sends SIGTERM to all HA processes and IFD.
All HA processes clean up and exit with “don't restart”
code
All other CMSD daemons remove the disabled node from the cluster
membership
Deactivating HA Services on a Node
The operation of deactivating a node tries to move all resource groups
from the node to some other node and then tries to disable the node in the
cluster, subsequently killing all HA processes.
When HA services are stopped on a node, all resource groups owned by
the node are moved to some other node in the cluster that is capable of maintaining
these resource groups in a highly available state. This operation will fail
if there is no node that can take over these resource groups. This condition
will always occur if the last node in a cluster is shut down when you deactivate
HA services on that node.
In this circumstance, you can specify the force option
to shut down the node even if resource groups cannot be moved or released.
This will normally leave resource groups allocated in a non-highly-available
state on that same node. Using the force option might
result in the node getting reset. In order to guarantee that the resource
groups remain allocated on the last node in a cluster, all online resource
groups should be detached.
If you wish to move resource groups offline that are owned by the node
being shut down, you must do so prior to deactivating the node.
Deactivating HA Services in a Cluster
The operation of deactivating a cluster attempts to release all resource
groups and disable all nodes in the cluster, subsequently killing all HA processes.
When a cluster is deactivated and the Linux FailSafe HA services are
stopped on that cluster, resource groups are moved offline or deallocated.
If you want the resource groups to remain allocated, you must detach the resource
groups before attempting to deactivate the cluster.
Serially stopping HA services on every node in a cluster is not the
same as stopping HA services for the entire cluster. If the former case, an
attempt is made to keep resource groups online and highly available while
in the latter case resource groups are moved offline.
Deactivating Linux FailSafe with the Cluster Manager GUI
To stop Linux FailSafe services using the Cluster Manager GUI, perform
the following steps:
On the left side of the display, click on the “Nodes
& Cluster” category.
On the right side of the display click on the “Stop
FailSafe HA Services” task link to launch the task.
Enter the selected inputs.
Click on “OK” at the bottom of the screen to complete
the task.
Deactivating Linux FailSafe with the Cluster Manager CLI
To deactivate Linux FailSafe in a cluster and stop Linux FailSafe processing,
use the following command:
cmgr> stop ha_services [on node
A] [for cluster
B][force]
Resetting Nodes
You can use Linux FailSafe to reset nodes in a cluster. This sends a
reset command to the system controller port on the specified node. When the
node is reset, other nodes in the cluster will detect this and remove the
node from the active cluster, reallocating any resource groups that were allocated
on that node onto a backup node. The backup node used depends on how you have
configured your system.node
reset
resetting nodes
Once the node reboots, it will rejoin the cluster. Some resource groups
might move back to the node, depending on how you have configured your system.
Resetting a Node with the Cluster Manager GUI
To reset a Linux FailSafe node using the Cluster Manager GUI, perform
the following steps:
On the left side of the display, click on the “Nodes
& Cluster” category.
On the right side of the display click on the “Reset
a Node” task link to launch the task.
Enter the node to reset.
Click on “OK” at the bottom of the screen to complete
the task.
Resetting a Node with the Cluster Manager CLI
When Linux FailSafe is running, you can reboot a node with the following
Cluster Manger CLI command:
cmgr> admin reset node A
This command uses the Linux FailSafe daemons to reset the specified
node.
You can reset a node in a cluster even when the Linux FailSafe daemons
are not running by using the standalone option of the
admin reset command of the CLI:
cmgr> admin reset standalone node
A
This command does not go through the Linux FailSafe daemons.
Backing Up and Restoring Configuration With Cluster
Manager CLI
backup and restore CDBbackup and restore
restore, CDB
backup, CDBThe
Cluster Manager CLI provides scripts that you can use to backup and restore
your configuration: cdbDump and cdbRestore.
These scripts are installed in the /var/cluster/cmgr-scripts
directory. You can modify these scripts to
suit your needs.
The cdbDump script, as provided, creates compressed
tar files of the /var/cluster/cdb/cdb.db# directory
and the /var/cluster/cdb.db file.
The cdbRestore script, as provided, restores the
compressed tar files of the /var/cluster/cdb/cdb.db#
directory and the
/var/cluster/cdb.db file.
When you use the cdbDump and cdbRestore
scripts, you should follow the following procedures:
Run the cdbDump and cdbRestore
scripts only when no administrative commands are running. This could
result in an inconsistent backup.
You must backup the configuration of each node in the cluster
separately. The configuration information is different for each node, and
all node-specific information is stored locally only.
Run the backup procedure whenever you change your configuration.
The backups of all nodes in the pool taken at the same time
should be restored together.
Cluster and Linux FailSafe process should not be running when
you restore your configuration.
In addition to the above restrictions, you should not perform a
cdbDump while information is changing in the CDB. Check SYSLOG for
information to help determine when CDB activity is occurring. As a rule of
thumb, you should be able to perform a cdbDump if at least
15 minutes have passed since the last node joined the cluster or the last
administration command was run.