[BACK]Return to clustproc.sgml CVS log [TXT][DIR] Up to [Development] / failsafe / FailSafe-books / LnxFailSafe_AG

File: [Development] / failsafe / FailSafe-books / LnxFailSafe_AG / clustproc.sgml (download)

Revision 1.1, Wed Nov 29 21:58:28 2000 UTC (16 years, 10 months ago) by vasa
Branch: MAIN
CVS Tags: HEAD

New documentation files for the Admin Guide.

<!-- Fragment document type declaration subset:
ArborText, Inc., 1988-1997, v.4001
<!DOCTYPE SET PUBLIC "-//Davenport//DTD DocBook V3.0//EN" [
<!ENTITY ha.cluster.messages SYSTEM "figures/ha.cluster.messages.eps" NDATA eps>
<!ENTITY machine.not.in.ha.cluster SYSTEM "figures/machine.not.in.ha.cluster.eps" NDATA eps>
<!ENTITY ha.cluster.config.info.flow SYSTEM "figures/ha.cluster.config.info.flow.eps" NDATA eps>
<!ENTITY software.layers SYSTEM "figures/software.layers.eps" NDATA eps>
<!ENTITY n1n4 SYSTEM "figures/n1n4.eps" NDATA eps>
<!ENTITY example.sgml SYSTEM "example.sgml">
<!ENTITY appupgrade.sgml SYSTEM "appupgrade.sgml">
<!ENTITY a1-1.failsafe.components SYSTEM "figures/a1-1.failsafe.components.eps" NDATA eps>
<!ENTITY a1-6.disk.storage.takeover SYSTEM "figures/a1-6.disk.storage.takeover.eps" NDATA eps>
<!ENTITY a2-3.non.shared.disk.config SYSTEM "figures/a2-3.non.shared.disk.config.eps" NDATA eps>
<!ENTITY a2-4.shared.disk.config SYSTEM "figures/a2-4.shared.disk.config.eps" NDATA eps>
<!ENTITY a2-5.shred.disk.2active.cnfig SYSTEM "figures/a2-5.shred.disk.2active.cnfig.eps" NDATA eps>
<!ENTITY a2-1.examp.interface.config SYSTEM "figures/a2-1.examp.interface.config.eps" NDATA eps>
<!ENTITY intro.sgml SYSTEM "intro.sgml">
<!ENTITY overview.sgml SYSTEM "overview.sgml">
<!ENTITY planning.sgml SYSTEM "planning.sgml">
<!ENTITY nodeconfig.sgml SYSTEM "nodeconfig.sgml">
<!ENTITY admintools.sgml SYSTEM "admintools.sgml">
<!ENTITY config.sgml SYSTEM "config.sgml">
<!ENTITY operate.sgml SYSTEM "operate.sgml">
<!ENTITY diag.sgml SYSTEM "diag.sgml">
<!ENTITY recover.sgml SYSTEM "recover.sgml">
<!ENTITY appfiles.sgml SYSTEM "appfiles.sgml">
<!ENTITY gloss.sgml SYSTEM "gloss.sgml">
<!ENTITY preface.sgml SYSTEM "preface.sgml">
<!ENTITY index.sgml SYSTEM "index.sgml">
]>
-->
<chapter id="LE55630-PARENT">
<title id="LE55630-TITLE">Upgrading and Maintaining Active Clusters</title>
<para>When a Linux FailSafe system is running, you may need to perform various
administration procedures without shutting down the entire cluster. This chapter
provides instructions for performing upgrade and maintenance procedures on
active clusters. It includes the following procedures:</para>
<itemizedlist>
<listitem><para><xref linkend="LE40594-PARENT"></para>
</listitem>
<listitem><para><xref linkend="LE15663-PARENT"></para>
</listitem>
<listitem><para><xref linkend="LE26765-PARENT"></para>
</listitem>
<listitem><para><xref linkend="LE31814-PARENT"></para>
</listitem>
<listitem><para><xref linkend="LE18685-PARENT"></para>
</listitem>
<listitem><para><xref linkend="LE32198-PARENT"></para>
</listitem>
</itemizedlist>
<sect1 id="LE40594-PARENT">
<title id="LE40594-TITLE">Adding a Node to an Active Cluster</title>
<para>Use the following procedure to add a node to an active cluster. This
procedure begins with the assumption that <filename>cluster_admin</filename>, <filename>
cluster_control</filename>, <filename>cluster_ha</filename> and <filename>
failsafe2</filename> products are already installed in this node.</para>
<orderedlist>
<listitem><para>Check control network connections from the node to the rest
of the cluster using <command>ping</command> command. Note the list of control
network IP addresses.</para>
</listitem>
<listitem><para>Check the serial connections to reset this node. Note the
name of the node that can reset this node.</para>
</listitem>
<listitem><para>Run node diagnostics. For information on FailSafe diagnostic
commands, see <xref linkend="LE56830-PARENT">.&ensp;</para>
</listitem>
<listitem><para>Make sure <filename>sgi-cad</filename>, <filename>sgi-crsd
</filename>, <filename>sgi-cmsd</filename>, and <filename>sgi-gcd</filename>
entries are present in the <?Pub _nolinebreak><filename>/etc/services</filename><?Pub /_nolinebreak> file.
The port numbers for these processes should match the port numbers in other
nodes in the cluster.</para>
<para>Example entries:</para>
<programlisting>sgi-cad         7200/tcp      # SGI cluster admin daemon
sgi-crsd        7500/udp      # SGI cluster reset services daemon
sgi-cmsd        7000/udp      # SGI cluster membership Daemon
sgi-gcd         8000/udp      # SGI group communication Daemon</programlisting>
</listitem>
<listitem><para>Check if cluster processes (<filename>cad</filename>, <filename>
cmond</filename>, <filename>crsd</filename>) are running.</para>
<programlisting># <userinput>ps -ef | grep cad</userinput></programlisting>
<para>If cluster processes are not running, run the <command>cdbreinit</command>
command.</para>
<programlisting># <userinput>/usr/lib/failsafe/bin/cdbreinit /var/lib/failsafe/cdb/cdb.db
</userinput>&ensp;
	Killing cdbd...
	Removing database header file /var/lib/failsafe/cdb/cdb.db...
	Preparing to delete database directory /var/lib/failsafe/cdb/cdb.db# !!
	Continue[y/n]<userinput>y</userinput>
	Removing database directory /var/lib/failsafe/cdb/cdb.db#...
	Deleted CDB database at /var/lib/failsafe/cdb/cdb.db
	Recreating new CDB database at /var/lib/failsafe/cdb/cdb.db with cdb-exitop...
	 cdbd
	 Created standard CDB database in /var/lib/failsafe/cdb/cdb.db

	 Please make sure that "sgi-cad" service is added to /etc/services file
	 If not, add the entry and restart cluster processes.
	 Please refer to FailSafe administration manual for more
	 information.

	 Modifying CDB database at /var/lib/failsafe/cdb/cdb.db with cluster_ha-exitop...
	 Modified standard CDB database in /var/lib/failsafe/cdb/cdb.db

	 Please make sure that "sgi-cmsd" and "sgi-gcd" services are added
	 to /etc/services file before starting HA services.
	 Please refer to FailSafe administration manual for more
	 information.

	 Starting cluster control processes with cluster_control-exitop...

	 Please make sure that "sgi-crsd" service is added to /etc/services file
	 If not, add the entry and restart cluster processes.
	 Please refer to FailSafe administration manual for more
	 information.

	 Started cluster control processes
	 Restarting cluster admin processes with failsafe-exitop...</programlisting>
</listitem>
<listitem><para>Use <command>cluster_mgr</command> template (<?Pub _nolinebreak><filename>
/usr/lib/failsafe/cmgr-templates/cmgr-create-node</filename><?Pub /_nolinebreak>)
or <command>cluster_mgr</command> command to define the node.</para>
<note>
<para>This node must be defined from one of nodes that is already in the cluster.
</para>
</note>
</listitem>
<listitem><para>Use the <command>cluster_mgr</command> command to add the
node to the cluster.</para>
<para>For example: The following <command>cluster_mgr</command> command adds
the node <filename>web-node3</filename> to the cluster <filename>web-cluster
</filename>:</para>
<programlisting>cmgr> <userinput>modify cluster web-cluster</userinput>
Enter commands, when finished enter either "done" or "cancel"</programlisting>
<screen>web-cluster ? <userinput>add node web-node3</userinput>
web-cluster ? <userinput>done</userinput></screen>
</listitem>
<listitem><para>You can start HA services on this node using the <command>
cluster_mgr</command> command. For example, the following <command>cluster_mgr
</command> command starts HA services on node <filename>web-node3</filename>
in cluster <filename>web-cluster</filename>:</para>
<screen>cmgr> <userinput>start ha_services on node web-node3 in cluster web-cluster
</userinput></screen>
</listitem>
<listitem><para>Remember to add this node to the failure domain of the relevant
failover policy. In order to do this, the entire failover policy must be re-defined,
including the additional node in the failure domain.</para>
</listitem>
</orderedlist>
</sect1>
<sect1 id="LE15663-PARENT">
<title id="LE15663-TITLE">Deleting a Node from an Active Cluster</title>
<para>Use the following procedure to delete a node from an active cluster.
This procedure begins with the assumption that the node status is UP.</para>
<orderedlist>
<listitem><para>If resource groups are online on the node, use the <command>
cluster_mgr</command> command to move them to another node in the cluster.
</para>
<para>To move the resource groups to another node in the cluster, there should
be another node available in the failover policy domain of the resource group.
If you want to leave the resource groups running in the same node, use the <command>
cluster_mgr</command> command to detach the resource group. For example, the
following command would leave the resource group <filename>web-rg</filename>
running in the same node in the cluster <filename>web-cluster</filename>.
</para>
<screen>cmgr> <userinput>admin detach resource_group "web-rg" in cluster web-cluster
</userinput></screen>
</listitem>
<listitem><para>Delete the node from the failure domains of any failover policies
which use the node. In order to do this, the entire failover policy must be
re-defined, deleting the affected node from the failure domain.</para>
</listitem>
<listitem><para>To stop HA services on the node <filename>web-node3</filename>,
use the following <command>cluster_mgr</command> command. This command will
move all the resource groups online on this node to other nodes in the cluster
if possible.</para>
<screen>cmgr> <userinput>stop ha_services on node web-node3 for cluster web-cluster
</userinput></screen>
<para>If it is not possible to move resource groups that are online on node <filename>
web-node3</filename>, the above command will fail. The <command>force</command>
option is available to stop HA services in a node even in the case of an error.
Should there be any resources which can not be moved offline or deallocated
properly, a side-effect of the offline force command will be to leave these
resources allocated on the node.</para>
</listitem>
</orderedlist>
<para>Perform Steps 4, 5, 6, and 7 if the node must be deleted from the configuration
database.</para>
<orderedlist continuation="continues">
<listitem><para>Delete the node from the cluster. To delete node <filename>
web-node3</filename> from <filename>web-cluster</filename> configuration,
use the following <command>cluster_mgr</command> command:</para>
<screen>cmgr> <userinput>modify cluster web-cluster</userinput>
Enter commands, when finished enter either "done" or "cancel"
web-cluster ? <userinput>remove node web-node3</userinput>
web-cluster ? <userinput>done</userinput></screen>
</listitem>
<listitem><para>Remove node configuration from the configuration database.
</para>
<para>The following <command>cluster_mgr</command> command deletes the <filename>
web-node3</filename> node definition from the configuration database.</para>
<screen>cmgr> delete node web-node3</screen>
</listitem>
<listitem><para>Stop all cluster processes and delete the configuration database.
</para>
<para>The following commands stop cluster processes on the node and delete
the configuration database.</para>
<screen># <userinput>/etc/rc.d/init.d/failsafe stop</userinput>
# <userinput>killall cdbd</userinput>
# <userinput>cdbdelete /var/lib/failsafe/cdb/cdb.db</userinput></screen>
</listitem>
<listitem><para>Disable cluster and HA processes from starting when the node
boots. The following commands perform those tasks:</para>
<screen># <userinput>fsconfig failsafe off</userinput></screen>
</listitem>
</orderedlist>
</sect1>
<sect1>
<title>Changing Control Networks in a Cluster</title>
<para><indexterm><primary>control network</primary><secondary>changing in
cluster</secondary></indexterm>Use the following procedure to change the control
networks in a currently active cluster. This procedure is valid for a two-node
cluster consisting of nodes <literal>node1</literal> and <literal>node2</literal>.
In this procedure, you must complete each step before proceeding to the next
step.</para>
<note>
<para>Do not perform any other administration operations during this procedure.
</para>
</note>
<orderedlist>
<listitem><para>From any node, stop HA services on the cluster. Make sure
all HA processes have exited on both nodes.</para>
</listitem>
<listitem><para>From <literal>node2</literal>, stop the cluster processes
on <literal>node2</literal>:</para>
<screen># <userinput>/etc/rc.d/init.d/fs_cluster stop</userinput>
# <userinput>killall cdbd</userinput></screen>
<para>Make sure the <literal>cdbd</literal> process have been killed on <literal>
node2</literal>.</para>
</listitem>
<listitem><para>From <literal>node1</literal>, modify the <literal>node1</literal>
and <literal>node2</literal> definition. Use the following <command>cmgr</command>
commands:</para>
<screen><?Pub Dtl>cmgr> <userinput>modify node node1</userinput>
Enter commands, when finished enter either "done" or "cancel"
node1?> <userinput>remove nic <replaceable>old nic address</replaceable></userinput>
node1> <userinput>add nic <replaceable>nnew nic address</replaceable></userinput>
NIC - <replaceable>new nic address</replaceable> <userinput>set heartbeat to ...
</userinput>
NIC - <replaceable>new nic address</replaceable> <userinput>set ctrl_msgs to ...
</userinput>
NIC - <replaceable>new nic address</replaceable> <userinput>set priority to ...
</userinput>
NIC - <replaceable>new nic address</replaceable> <userinput>done</userinput>
node1? <userinput>done</userinput></screen>
<para>Repeat the same procedure to modify <literal>node2</literal>.</para>
</listitem>
<listitem><para>From <literal>node1</literal>, check if the <literal>node1
</literal> and <literal>node2</literal> definitions are correct. Using <command>
cmgr</command> on <literal>node1</literal>, execute the following commands
to view the node definitions:</para>
<screen>cmgr> <userinput>show node node1</userinput>
cmgr> <userinput>show node node2</userinput></screen>
</listitem>
<listitem><para>On both <literal>node1</literal> and <literal>node2</literal>,
modify the network interface IP addresses in <filename>/etc/failsafe/config/netif.options
</filename> and execute <command>ifconfig</command> to configure the new IP
addresses on <literal>node1</literal> and <literal>node2</literal>. Verify
that the IP addresses match the node definitions in the CDB.</para>
</listitem>
<listitem><para>From <literal>node1</literal>, stop the cluster process on <literal>
node1</literal>:</para>
<screen># <userinput>/etc/rc.d/init.d/fs_cluster stop</userinput>
# <userinput>killall cdbd</userinput></screen>
<para>Make sure the <literal>cdbd</literal> process have been killed on <literal>
node1</literal>.</para>
</listitem>
<listitem><para>From <literal>node2</literal>, execute the following command
to start cluster process on <literal>node2</literal>:</para>
<screen># <userinput>/usr/cluster/bin/cdbreinit /var/cluster/cdb/cdb.db</userinput></screen>
<para>Answer <userinput>y</userinput> to the prompt the appears.</para>
</listitem>
<listitem><para>From <literal>node1</literal>, start cluster processes on <literal>
node1</literal>:</para>
<screen># <userinput>/etc/rc.d/init.d/fs_cluster start</userinput></screen>
<para>The following messages should appear in the <filename>SYSLOG</filename>
on <literal>node2</literal>:</para>
<screen>Starting to receive CDB sync series from machine <replaceable>node1_node_id>
</replaceable>
...
Finished receiving CDB sync series from machine <replaceable>node1_node_id
</replaceable></screen>
<para>Wait for approximately sixty seconds for the sync to complete.</para>
</listitem>
<listitem><para>From any node, start HA services in the cluster.</para>
</listitem>
</orderedlist>
</sect1>
<sect1 id="LE26765-PARENT">
<title id="LE26765-TITLE">Upgrading OS Software in an Active Cluster</title>
<para>When you upgrade your OS software in an active cluster, you perform
the upgrade on one node at a time.</para>
<para>If the OS software upgrade does not require reboot or does not impact
the FailSafe software, there is no need to use the OS upgrade procedure. If
you do not know whether the upgrade will impact FailSafe software or if the
OS upgrade requires a machine reboot, follow the upgrade procedure described
below.</para>
<para>The following procedure upgrades the OS software on node <filename>
web-node3</filename>.</para>
<orderedlist>
<listitem><para>If resource groups are online on the node, use a <command>
cluster_mgr</command> command to move them another node in the cluster. To
move the resource group to another node in the cluster, there should be another
node available in the failover policy domain of the resource group.</para>
<para>The following <command>cluster_mgr</command> command moves resource
group <filename>web-rg</filename> to another node in the cluster <filename>
web-cluster:</filename></para>
<screen>cmgr> <userinput>admin move resource_group web-rg in cluster web-cluster
</userinput></screen>
</listitem>
<listitem><para>To stop HA services on the node <filename>web-node3</filename>,
use the following <command>cluster_mgr</command> command. This command will
move all the resource groups online on this node to other nodes in the cluster
if possible.</para>
<screen>cmgr> <userinput>stop ha_services on node web-node3 for cluster web-cluster
</userinput></screen>
<para>If it is not possible to move resource groups that are online on node <filename>
web-node3</filename>, the above command will fail. You can use the <command>
force</command> option to stop HA services in a node even in the case of an
error.</para>
</listitem>
<listitem><para>Perform the OS upgrade in the node <filename>web-node3</filename>.
</para>
</listitem>
<listitem><para>After the OS upgrade, make sure cluster processes (<command>
cmond</command>, <command>cad</command>, <command>crsd</command>) are running.
</para>
</listitem>
<listitem><para>Restart HA services on the node. The following <command>cluster_mgr
</command> command restarts HA services on the node:</para>
<screen>cmgr> <userinput>start ha_services on node web-node3 for cluster web-cluster
</userinput></screen>
<para>Make sure the resource groups are running on the most appropriate node
after restarting HA services.</para>
</listitem>
</orderedlist>
</sect1>
<sect1 id="LE31814-PARENT">
<title id="LE31814-TITLE">Upgrading FailSafe Software in an Active Cluster
</title>
<para>When you upgrade FailSafe software in an active cluster, you upgrade
one node at a time in the cluster.</para>
<para>The following procedure upgrades FailSafe on node <filename>web-node3
</filename>.</para>
<orderedlist>
<listitem><para>If resource groups are online on the node, use a <command>
cluster_mgr</command> command to move them another node in the cluster. To
move the resource group to another node in the cluster, there should be another
node available in the failover policy domain of the resource group.</para>
<para>The following <command>cluster_mgr</command> command moves resource
group <filename>web-rg</filename> to another node in the cluster <filename>
web-cluster:</filename></para>
<screen>cmgr> <userinput>admin move resource_group web-rg in cluster web-cluster
</userinput></screen>
</listitem>
<listitem><para>To stop HA services on the node <filename>web-node3</filename>,
use the following <command>cluster_mgr</command> command. This command will
move all the resource groups online on this node to other nodes in the cluster
if possible.</para>
<screen>cmgr> <userinput>stop ha_services on node web-node3 for cluster web-cluster
</userinput></screen>
<para>If it is not possible to move resource groups that are online on node <filename>
web-node3</filename>, the above command will fail. You can use the <command>
force</command> option to stop HA services in a node even in the case of an
error.</para>
</listitem>
<listitem><para>Stop all cluster processes running on the node.</para>
<screen># <userinput>/etc/rc.dinit.d/failsafe stop</userinput></screen>
</listitem>
<listitem><para>Perform the FailSafe upgrade in the node <filename>web-node3
</filename>.</para>
</listitem>
<listitem><para>After the FailSafe upgrade, check whether cluster processes
(<command>cmond</command>, <command>cad</command>, <command>crsd</command>)
are running. If not, restart cluster processes:</para>
<screen># <userinput>fsconfig failsafe on; /etc/rc.d/init.d/failsafe start
</userinput></screen>
</listitem>
<listitem><para>Restart HA services on the node. The following <command>cluster_mgr
</command> command restarts HA services on the node:</para>
<screen>cmgr> <userinput>start ha_services on node web-node3 for cluster web-cluster
</userinput></screen>
<para>Make sure the resource groups are running on the most appropriate node
after restarting HA services.</para>
</listitem>
</orderedlist>
</sect1>
<sect1 id="LE18685-PARENT">
<title id="LE18685-TITLE">Adding New Resource Groups or Resources in an Active
Cluster</title>
<para>The following procedure describes how to add a resource group and resources
to an active cluster. To add resources to an existing resource group, perform
resource configuration (Step 4), resource diagnostics (Step 5) and add resources
to the resource group (Step 6).</para>
<orderedlist>
<listitem><para>Identify all the resources that have to be moved together.
These resources running on a node should be able to provide a service to the
client. These resources should be placed in a resource group. For example,
Netscape webserver <filename>mfg-web</filename>, its IP address 192.26.50.40,
and the filesystem <filename>/shared/mfg-web</filename> containing the web
configuration and document pages should be placed in the same resource group
(for example, <filename>mfg-web-rg</filename>).</para>
</listitem>
<listitem><para>Configure the resources in all nodes in the cluster where
the resource group is expected to be online. For example, this might involve
configuring netscape web server <filename>mfg-web</filename> on nodes <filename>
web-node1</filename> and <filename>web-node2</filename> in the cluster.</para>
</listitem>
<listitem><para>Create a failover policy. Determine the type of failover attribute
required for the resource group. The <command>cluster_mgr</command> template
(<?Pub _nolinebreak><filename>/usr/lib/failsafe/cmgr-templates/cmgr-create-failover_policy
</filename><?Pub /_nolinebreak>) can be used to create the failover policy.
</para>
</listitem>
<listitem><para>Configure the resources in configuration database. There are <command>
cluster_mgr</command> templates to create resources of various resource types
in <filename>/usr/lib/failsafe/cmgr-templates</filename> directory. For example,
the volume resource, the <filename>/shared/mfg-web</filename> filesystem,
the <literal>192.26.50.40 IP_address</literal> resource, and the <filename>
mfg-web</filename> Netscape_web resource have to be created in the configuration
database. Create the resource dependencies for these resources.</para>
</listitem>
<listitem><para>Run resource diagnostics. For information on the diagnostic
commands, see <xref linkend="LE56830-PARENT">.</para>
</listitem>
<listitem><para>Create resource group and add resources to the resource group.
The <command>cluster_mgr</command> template (<filename><?Pub Caret>/usr/lib/failsafe/cmgr-templates/cmgr-create-resource_group
</filename>) can be used to create resource group and add resources to resource
group.</para>
<para>All resources that are dependent on each other should be added to the
resource group at the same time. If resources are added to an existing resource
group that is online in a node in the cluster, the resources are also made
online on the same node.</para>
</listitem>
</orderedlist>
</sect1>
<sect1 id="LE32198-PARENT">
<title id="LE32198-TITLE">Adding a New Hardware Device in an Active Cluster
</title>
<para>When you add hardware devices to an active cluster, you add them one
node at a time.</para>
<para>To add hardware devices to a node in an active cluster, follow the same
procedure as when you upgrade OS software in an active cluster, as described
in <xref linkend="LE26765-PARENT">. In summary:</para>
<itemizedlist>
<listitem><para>You must move the resource groups offline and stop HA services
in the node before adding the hardware device.</para>
</listitem>
<listitem><para>After adding the hardware device, make sure cluster processes
are running and start HA services on the node.</para>
</listitem>
</itemizedlist>
<para>To include the new hardware device in the configuration database, you
must modify your resource configuration and your node configuration, where
appropriate.</para>
</sect1>
</chapter>
<?Pub *0000024392>