Upgrading and Maintaining Active Clusters
When a Linux FailSafe system is running, you may need to perform various
administration procedures without shutting down the entire cluster. This chapter
provides instructions for performing upgrade and maintenance procedures on
active clusters. It includes the following procedures:
Adding a Node to an Active Cluster
Use the following procedure to add a node to an active cluster. This
procedure begins with the assumption that cluster_admin,
cluster_control, cluster_ha and
failsafe2 products are already installed in this node.
Check control network connections from the node to the rest
of the cluster using ping command. Note the list of control
network IP addresses.
Check the serial connections to reset this node. Note the
name of the node that can reset this node.
Run node diagnostics. For information on FailSafe diagnostic
commands, see .
Make sure sgi-cad, sgi-crsd
, sgi-cmsd, and sgi-gcd
entries are present in the /etc/services file.
The port numbers for these processes should match the port numbers in other
nodes in the cluster.
Example entries:
sgi-cad 7200/tcp # SGI cluster admin daemon
sgi-crsd 7500/udp # SGI cluster reset services daemon
sgi-cmsd 7000/udp # SGI cluster membership Daemon
sgi-gcd 8000/udp # SGI group communication Daemon
Check if cluster processes (cad,
cmond, crsd) are running.
# ps -ef | grep cad
If cluster processes are not running, run the cdbreinit
command.
# /usr/lib/failsafe/bin/cdbreinit /var/lib/failsafe/cdb/cdb.db
Killing cdbd...
Removing database header file /var/lib/failsafe/cdb/cdb.db...
Preparing to delete database directory /var/lib/failsafe/cdb/cdb.db# !!
Continue[y/n]y
Removing database directory /var/lib/failsafe/cdb/cdb.db#...
Deleted CDB database at /var/lib/failsafe/cdb/cdb.db
Recreating new CDB database at /var/lib/failsafe/cdb/cdb.db with cdb-exitop...
cdbd
Created standard CDB database in /var/lib/failsafe/cdb/cdb.db
Please make sure that "sgi-cad" service is added to /etc/services file
If not, add the entry and restart cluster processes.
Please refer to FailSafe administration manual for more
information.
Modifying CDB database at /var/lib/failsafe/cdb/cdb.db with cluster_ha-exitop...
Modified standard CDB database in /var/lib/failsafe/cdb/cdb.db
Please make sure that "sgi-cmsd" and "sgi-gcd" services are added
to /etc/services file before starting HA services.
Please refer to FailSafe administration manual for more
information.
Starting cluster control processes with cluster_control-exitop...
Please make sure that "sgi-crsd" service is added to /etc/services file
If not, add the entry and restart cluster processes.
Please refer to FailSafe administration manual for more
information.
Started cluster control processes
Restarting cluster admin processes with failsafe-exitop...
Use cluster_mgr template (
/usr/lib/failsafe/cmgr-templates/cmgr-create-node)
or cluster_mgr command to define the node.
This node must be defined from one of nodes that is already in the cluster.
Use the cluster_mgr command to add the
node to the cluster.
For example: The following cluster_mgr command adds
the node web-node3 to the cluster web-cluster
:
cmgr> modify cluster web-cluster
Enter commands, when finished enter either "done" or "cancel"
web-cluster ? add node web-node3
web-cluster ? done
You can start HA services on this node using the
cluster_mgr command. For example, the following cluster_mgr
command starts HA services on node web-node3
in cluster web-cluster:
cmgr> start ha_services on node web-node3 in cluster web-cluster
Remember to add this node to the failure domain of the relevant
failover policy. In order to do this, the entire failover policy must be re-defined,
including the additional node in the failure domain.
Deleting a Node from an Active Cluster
Use the following procedure to delete a node from an active cluster.
This procedure begins with the assumption that the node status is UP.
If resource groups are online on the node, use the
cluster_mgr command to move them to another node in the cluster.
To move the resource groups to another node in the cluster, there should
be another node available in the failover policy domain of the resource group.
If you want to leave the resource groups running in the same node, use the
cluster_mgr command to detach the resource group. For example, the
following command would leave the resource group web-rg
running in the same node in the cluster web-cluster.
cmgr> admin detach resource_group "web-rg" in cluster web-cluster
Delete the node from the failure domains of any failover policies
which use the node. In order to do this, the entire failover policy must be
re-defined, deleting the affected node from the failure domain.
To stop HA services on the node web-node3,
use the following cluster_mgr command. This command will
move all the resource groups online on this node to other nodes in the cluster
if possible.
cmgr> stop ha_services on node web-node3 for cluster web-cluster
If it is not possible to move resource groups that are online on node
web-node3, the above command will fail. The force
option is available to stop HA services in a node even in the case of an error.
Should there be any resources which can not be moved offline or deallocated
properly, a side-effect of the offline force command will be to leave these
resources allocated on the node.
Perform Steps 4, 5, 6, and 7 if the node must be deleted from the configuration
database.
Delete the node from the cluster. To delete node
web-node3 from web-cluster configuration,
use the following cluster_mgr command:
cmgr> modify cluster web-cluster
Enter commands, when finished enter either "done" or "cancel"
web-cluster ? remove node web-node3
web-cluster ? done
Remove node configuration from the configuration database.
The following cluster_mgr command deletes the
web-node3 node definition from the configuration database.
cmgr> delete node web-node3
Stop all cluster processes and delete the configuration database.
The following commands stop cluster processes on the node and delete
the configuration database.
# /etc/rc.d/init.d/failsafe stop
# killall cdbd
# cdbdelete /var/lib/failsafe/cdb/cdb.db
Disable cluster and HA processes from starting when the node
boots. The following commands perform those tasks:
# fsconfig failsafe off
Changing Control Networks in a Cluster
control networkchanging in
clusterUse the following procedure to change the control
networks in a currently active cluster. This procedure is valid for a two-node
cluster consisting of nodes node1 and node2.
In this procedure, you must complete each step before proceeding to the next
step.
Do not perform any other administration operations during this procedure.
From any node, stop HA services on the cluster. Make sure
all HA processes have exited on both nodes.
From node2, stop the cluster processes
on node2:
# /etc/rc.d/init.d/fs_cluster stop
# killall cdbd
Make sure the cdbd process have been killed on
node2.
From node1, modify the node1
and node2 definition. Use the following cmgr
commands:
cmgr> modify node node1
Enter commands, when finished enter either "done" or "cancel"
node1?> remove nic old nic address
node1> add nic nnew nic address
NIC - new nic address set heartbeat to ...
NIC - new nic address set ctrl_msgs to ...
NIC - new nic address set priority to ...
NIC - new nic address done
node1? done
Repeat the same procedure to modify node2.
From node1, check if the node1
and node2 definitions are correct. Using
cmgr on node1, execute the following commands
to view the node definitions:
cmgr> show node node1
cmgr> show node node2
On both node1 and node2,
modify the network interface IP addresses in /etc/failsafe/config/netif.options
and execute ifconfig to configure the new IP
addresses on node1 and node2. Verify
that the IP addresses match the node definitions in the CDB.
From node1, stop the cluster process on
node1:
# /etc/rc.d/init.d/fs_cluster stop
# killall cdbd
Make sure the cdbd process have been killed on
node1.
From node2, execute the following command
to start cluster process on node2:
# /usr/cluster/bin/cdbreinit /var/cluster/cdb/cdb.db
Answer y to the prompt the appears.
From node1, start cluster processes on
node1:
# /etc/rc.d/init.d/fs_cluster start
The following messages should appear in the SYSLOG
on node2:
Starting to receive CDB sync series from machine node1_node_id>
...
Finished receiving CDB sync series from machine node1_node_id
Wait for approximately sixty seconds for the sync to complete.
From any node, start HA services in the cluster.
Upgrading OS Software in an Active Cluster
When you upgrade your OS software in an active cluster, you perform
the upgrade on one node at a time.
If the OS software upgrade does not require reboot or does not impact
the FailSafe software, there is no need to use the OS upgrade procedure. If
you do not know whether the upgrade will impact FailSafe software or if the
OS upgrade requires a machine reboot, follow the upgrade procedure described
below.
The following procedure upgrades the OS software on node
web-node3.
If resource groups are online on the node, use a
cluster_mgr command to move them another node in the cluster. To
move the resource group to another node in the cluster, there should be another
node available in the failover policy domain of the resource group.
The following cluster_mgr command moves resource
group web-rg to another node in the cluster
web-cluster:
cmgr> admin move resource_group web-rg in cluster web-cluster
To stop HA services on the node web-node3,
use the following cluster_mgr command. This command will
move all the resource groups online on this node to other nodes in the cluster
if possible.
cmgr> stop ha_services on node web-node3 for cluster web-cluster
If it is not possible to move resource groups that are online on node
web-node3, the above command will fail. You can use the
force option to stop HA services in a node even in the case of an
error.
Perform the OS upgrade in the node web-node3.
After the OS upgrade, make sure cluster processes (
cmond, cad, crsd) are running.
Restart HA services on the node. The following cluster_mgr
command restarts HA services on the node:
cmgr> start ha_services on node web-node3 for cluster web-cluster
Make sure the resource groups are running on the most appropriate node
after restarting HA services.
Upgrading FailSafe Software in an Active Cluster
When you upgrade FailSafe software in an active cluster, you upgrade
one node at a time in the cluster.
The following procedure upgrades FailSafe on node web-node3
.
If resource groups are online on the node, use a
cluster_mgr command to move them another node in the cluster. To
move the resource group to another node in the cluster, there should be another
node available in the failover policy domain of the resource group.
The following cluster_mgr command moves resource
group web-rg to another node in the cluster
web-cluster:
cmgr> admin move resource_group web-rg in cluster web-cluster
To stop HA services on the node web-node3,
use the following cluster_mgr command. This command will
move all the resource groups online on this node to other nodes in the cluster
if possible.
cmgr> stop ha_services on node web-node3 for cluster web-cluster
If it is not possible to move resource groups that are online on node
web-node3, the above command will fail. You can use the
force option to stop HA services in a node even in the case of an
error.
Stop all cluster processes running on the node.
# /etc/rc.dinit.d/failsafe stop
Perform the FailSafe upgrade in the node web-node3
.
After the FailSafe upgrade, check whether cluster processes
(cmond, cad, crsd)
are running. If not, restart cluster processes:
# fsconfig failsafe on; /etc/rc.d/init.d/failsafe start
Restart HA services on the node. The following cluster_mgr
command restarts HA services on the node:
cmgr> start ha_services on node web-node3 for cluster web-cluster
Make sure the resource groups are running on the most appropriate node
after restarting HA services.
Adding New Resource Groups or Resources in an Active
Cluster
The following procedure describes how to add a resource group and resources
to an active cluster. To add resources to an existing resource group, perform
resource configuration (Step 4), resource diagnostics (Step 5) and add resources
to the resource group (Step 6).
Identify all the resources that have to be moved together.
These resources running on a node should be able to provide a service to the
client. These resources should be placed in a resource group. For example,
Netscape webserver mfg-web, its IP address 192.26.50.40,
and the filesystem /shared/mfg-web containing the web
configuration and document pages should be placed in the same resource group
(for example, mfg-web-rg).
Configure the resources in all nodes in the cluster where
the resource group is expected to be online. For example, this might involve
configuring netscape web server mfg-web on nodes
web-node1 and web-node2 in the cluster.
Create a failover policy. Determine the type of failover attribute
required for the resource group. The cluster_mgr template
(/usr/lib/failsafe/cmgr-templates/cmgr-create-failover_policy
) can be used to create the failover policy.
Configure the resources in configuration database. There are
cluster_mgr templates to create resources of various resource types
in /usr/lib/failsafe/cmgr-templates directory. For example,
the volume resource, the /shared/mfg-web filesystem,
the 192.26.50.40 IP_address resource, and the
mfg-web Netscape_web resource have to be created in the configuration
database. Create the resource dependencies for these resources.
Run resource diagnostics. For information on the diagnostic
commands, see .
Create resource group and add resources to the resource group.
The cluster_mgr template (/usr/lib/failsafe/cmgr-templates/cmgr-create-resource_group
) can be used to create resource group and add resources to resource
group.
All resources that are dependent on each other should be added to the
resource group at the same time. If resources are added to an existing resource
group that is online in a node in the cluster, the resources are also made
online on the same node.
Adding a New Hardware Device in an Active Cluster
When you add hardware devices to an active cluster, you add them one
node at a time.
To add hardware devices to a node in an active cluster, follow the same
procedure as when you upgrade OS software in an active cluster, as described
in . In summary:
You must move the resource groups offline and stop HA services
in the node before adding the hardware device.
After adding the hardware device, make sure cluster processes
are running and start HA services on the node.
To include the new hardware device in the configuration database, you
must modify your resource configuration and your node configuration, where
appropriate.