[BACK]Return to recover.sgml CVS log [TXT][DIR] Up to [Development] / failsafe / FailSafe-books / LnxFailSafe_AG

File: [Development] / failsafe / FailSafe-books / LnxFailSafe_AG / recover.sgml (download)

Revision 1.1, Wed Nov 29 21:58:28 2000 UTC (16 years, 10 months ago) by vasa
Branch: MAIN
CVS Tags: HEAD

New documentation files for the Admin Guide.

<!-- Fragment document type declaration subset:
ArborText, Inc., 1988-1997, v.4001
<!DOCTYPE SET PUBLIC "-//Davenport//DTD DocBook V3.0//EN" [
<!ENTITY ha.cluster.messages SYSTEM "figures/ha.cluster.messages.eps" NDATA eps>
<!ENTITY machine.not.in.ha.cluster SYSTEM "figures/machine.not.in.ha.cluster.eps" NDATA eps>
<!ENTITY ha.cluster.config.info.flow SYSTEM "figures/ha.cluster.config.info.flow.eps" NDATA eps>
<!ENTITY software.layers SYSTEM "figures/software.layers.eps" NDATA eps>
<!ENTITY n1n4 SYSTEM "figures/n1n4.eps" NDATA eps>
<!ENTITY example.sgml SYSTEM "example.sgml">
<!ENTITY appupgrade.sgml SYSTEM "appupgrade.sgml">
<!ENTITY a1-1.failsafe.components SYSTEM "figures/a1-1.failsafe.components.eps" NDATA eps>
<!ENTITY a1-6.disk.storage.takeover SYSTEM "figures/a1-6.disk.storage.takeover.eps" NDATA eps>
<!ENTITY a2-3.non.shared.disk.config SYSTEM "figures/a2-3.non.shared.disk.config.eps" NDATA eps>
<!ENTITY a2-4.shared.disk.config SYSTEM "figures/a2-4.shared.disk.config.eps" NDATA eps>
<!ENTITY a2-5.shred.disk.2active.cnfig SYSTEM "figures/a2-5.shred.disk.2active.cnfig.eps" NDATA eps>
<!ENTITY a2-1.examp.interface.config SYSTEM "figures/a2-1.examp.interface.config.eps" NDATA eps>
<!ENTITY intro.sgml SYSTEM "intro.sgml">
<!ENTITY overview.sgml SYSTEM "overview.sgml">
<!ENTITY planning.sgml SYSTEM "planning.sgml">
<!ENTITY nodeconfig.sgml SYSTEM "nodeconfig.sgml">
<!ENTITY admintools.sgml SYSTEM "admintools.sgml">
<!ENTITY config.sgml SYSTEM "config.sgml">
<!ENTITY operate.sgml SYSTEM "operate.sgml">
<!ENTITY diag.sgml SYSTEM "diag.sgml">
<!ENTITY clustproc.sgml SYSTEM "clustproc.sgml">
<!ENTITY appfiles.sgml SYSTEM "appfiles.sgml">
<!ENTITY gloss.sgml SYSTEM "gloss.sgml">
<!ENTITY preface.sgml SYSTEM "preface.sgml">
<!ENTITY index.sgml SYSTEM "index.sgml">
]>
-->
<chapter id="LE28716-PARENT">
<title id="LE28716-TITLE">Linux FailSafe Recovery</title>
<para>This chapter provides information on FailSafe system recovery, and includes
sections on the following topics:</para>
<itemizedlist>
<listitem><para><xref linkend="LE14340-PARENT"></para>
</listitem>
<listitem><para><xref linkend="LE28847-PARENT"></para>
</listitem>
<listitem><para><xref linkend="LE17012-PARENT"></para>
</listitem>
<listitem><para><xref linkend="LE13884-PARENT"></para>
</listitem>
<listitem><para><xref linkend="LE35544-PARENT"></para>
</listitem>
<listitem><para><xref linkend="LE26593-PARENT"></para>
</listitem>
</itemizedlist>
<sect1 id="LE14340-PARENT">
<title id="LE14340-TITLE">Overview of FailSafe System Recovery</title>
<para><indexterm id="ITrecover-0"><primary>recovery</primary><secondary>overview
</secondary></indexterm>When a FailSafe system experiences problems, you can
use some of the FailSafe features and commands to determine where the problem
is.</para>
<para>FailSafe provides the following tools to evaluate and recover from system
failure:</para>
<itemizedlist>
<listitem><para>Log files</para>
</listitem>
<listitem><para>Commands to monitor status of system components</para>
</listitem>
<listitem><para>Commands to start, stop, and fail over highly available services
</para>
</listitem>
</itemizedlist>
<para>Keep in mind that the FailSafe logs may not detect system problems that
do not translate into FailSafe problems. For example, if a CPU goes bad, or
hardware maintenance is required, FailSafe may not be able to detect and log
these failures.</para>
<para>In general, when evaluating system problems of any nature on a FailSafe
configuration, you should determine whether you need to shut down a node to
address those problems. When you shut down a node, perform the following steps:
</para>
<orderedlist>
<listitem><para>Stop FailSafe services on that node</para>
</listitem>
<listitem><para>Shut down the node to perform needed maintenance and repair
</para>
</listitem>
<listitem><para>Start up the node</para>
</listitem>
<listitem><para>Start FailSafe services on that node</para>
</listitem>
</orderedlist>
<para>It is important that you explicitly stop FailSafe services before shutting
down a node, where possible, so that FailSafe does not interpret the node
shutdown as node failure. If FailSafe interprets the service interruption
as node failure, there could be unexpected ramifications, depending on how
you have configured your resource groups and your application failover domain.
</para>
<para>When you shut down a node to perform maintenance, you may need to change
your FailSafe configuration to keep your system running.</para>
</sect1>
<sect1 id="LE28847-PARENT">
<title id="LE28847-TITLE">FailSafe Log Files</title>
<para><indexterm id="ITrecover-1"><primary>log files</primary></indexterm>Linux
FailSafe maintains system logs for each of the FailSafe daemons. You can customize
the system logs according to the level of logging you wish to maintain.</para>
<para>For information on setting up log configurations, see  <xref linkend="fs-setlogparams">.
</para>
<para>Log messages can be of the following types:</para>
<variablelist>
<varlistentry><term>Normal</term>
<listitem>
<para>Normal messages report on the successful completion of a task. An example
of a normal message is as follows:</para>
<programlisting>Wed Sep 2 11:57:25.284 &lt;N ha_gcd cms 10185:0> 
Delivering TOTAL membership (S# 1, GS# 1)</programlisting>
</listitem>
</varlistentry>
<varlistentry><term>Error/Warning</term>
<listitem>
<para>Error or warning messages indicate that an error has occurred or may
occur soon. These messages may result from using the wrong command or improper
syntax. An example of a warning message is as follows:</para>
<programlisting>Wed Sep 2 13:45:47.199 &lt;W crsd crs 9908:0 
crs_config.c:634> CI_ERR_NOTFOUND, safer - no 
such node</programlisting>
</listitem>
</varlistentry>
<varlistentry><term>Syslog Messages</term>
<listitem>
<para>All normal and error messages are also logged to <filename>syslog</filename>.
Syslog messages include the symbol <command>&lt;CI></command> in the header
to indicate they are cluster-related messages. An example of a syslog message
is as follows:</para>
<programlisting>Wed Sep 2 12:22:57 6X:safe syslog: &lt;&lt;CI> 
ha_cmsd misc 10435:0> CI_FAILURE, I am not part 
of the enabled cluster anymore</programlisting>
</listitem>
</varlistentry>
<varlistentry><term>Debug</term>
<listitem>
<para>Debug messages appear in the log group file when the logging level is
set to debug0 or higher (using the GUI) or 10 or higher (using the CLI).</para>
<note>
<para>Many megabytes of disk space can be consumed on the server when debug
levels are used in a log configuration.</para>
</note>
</listitem>
</varlistentry>
</variablelist>
<para>Examining the log files should enable you to see the nature of the system
error. Noting the time of the error and looking at the log files to note the
activity of the various daemons immediately before error occurred, you may
be able to determine what situation existed that caused the failure.</para>
</sect1>
<sect1 id="LE17012-PARENT">
<title id="LE17012-TITLE">Node Membership and Resets</title>
<para>In looking over the actions of a FailSafe system on failure to determine
what has gone wrong and how processes have transferred, it is important to
consider the concept of node membership. When failover occurs, the runtime
failover domain can include only those nodes that are in the cluster membership.<indexterm
id="ITrecover-2"><primary>membership</primary><secondary>cluster</secondary>
</indexterm> <indexterm id="ITrecover-3"><primary>cluster</primary><secondary>
membership</secondary></indexterm><indexterm id="ITrecover-4"><primary>membership
</primary><secondary>node</secondary></indexterm> <indexterm id="ITrecover-5">
<primary>node</primary><secondary>membership</secondary></indexterm></para>
<sect2>
<title>Node Membership and Tie-Breaker Node</title>
<para><?Pub Dtl><indexterm><primary>node</primary><secondary>membership</secondary>
</indexterm>Nodes can enter into the cluster membership only when they are
not disabled and they are in a known state. This ensures that data integrity
is maintained because only nodes within the cluster membership can access
the shared storage. If nodes outside the membership and not controlled by
FailSafe were able to access the shared storage, two nodes might try to access
the same data at the same time, a situation that would result in data corruption.
For this reason, disabled nodes do not participate in the membership computation.
Note that no attempt is made to reset nodes that are configured disabled before
confirming the cluster membership.</para>
<para><?Pub Dtl>Node membership in a cluster is based on a quorum majority.
For a cluster to be enabled, more than 50% of the nodes in the cluster must
be in a known state, able to talk to each other, using heartbeat control networks.
This quorum determines which nodes are part of the cluster membership that
is formed.</para>
<para><?Pub Dtl><indexterm><primary>tie-breaker node</primary></indexterm>If
there are an even number of nodes in the cluster, it is possible that there
will be no majority quorum; there could be two sets of nodes, each consisting
of 50% of the total number of node, unable to communicate with the other set
of nodes. In this case, FailSafe uses the node that has been configured as
the tie-breaker node when you configured your FailSafe parameters. If no tie-breaker
node was configured, FailSafe uses the enabled node with the lowest node id
number.</para>
<para>For information on setting tie-breaker nodes, see <xref linkend="fs-setfsparameters">.
</para>
<para><?Pub Dtl><indexterm id="ITrecover-6"><primary>node</primary><secondary>
reset</secondary></indexterm> <indexterm id="ITrecover-7"><primary>resetting
nodes</primary></indexterm>The nodes in a quorum attempt to reset the nodes
that are not in the quorum. Nodes that can be reset are declared <literal>
DOWN</literal> in the membership, nodes that could not be reset are declared <literal>
UNKNOWN</literal>. Nodes in the quorum are <literal>UP</literal>.</para>
<para><?Pub Dtl>If a new majority quorum is computed, a new membership is
declared whether any node could be reset or not.</para>
<para><?Pub Dtl>If at least one node in the current quorum has a current membership,
the nodes will proceed to declare a new membership if they can reset at least
one node.</para>
<para><?Pub Dtl>If all nodes in the new tied quorum are coming up for the
first time, they will try to reset and proceed with a new membership only
if the quorum includes the tie-breaker node.</para>
<para><?Pub Dtl>If a tied subset of nodes in the cluster had no previous membership,
then the subset of nodes in the cluster with the tie-breaker node attempts
to reset nodes in the other subset of nodes in the cluster. If at least one
node reset succeeds, a new membership is confirmed.</para>
<para><?Pub Dtl>If a tied subset of nodes in the cluster had previous membership,
the nodes in one subset of nodes in the cluster attempt to reset nodes in
the other subset of nodes in the cluster. If at least one node reset succeeds,
a new membership is confirmed. The subset of nodes in the cluster with the
tie-breaker node resets immediately, the other subset of nodes in the cluster
attempts to reset after some time.</para>
<para><?Pub Dtl>Resets are done through system controllers connected to tty
ports through serial lines. Periodic serial line monitoring never stops. If
the estimated serial line monitoring failure interval and the estimated heartbeat
loss interval overlap, we suspect a power failure at the node being reset.
</para>
</sect2>
<sect2>
<title>No Membership Formed</title>
<para><indexterm><primary>cluster membership</primary></indexterm>When no
cluster membership is formed, you should check the following areas for possible
problems:</para>
<itemizedlist>
<listitem><para>Is the cluster membership daemon, <command>ha_cmsd</command>
running? Is the database daemon, <command>cdbd</command>, running?</para>
</listitem>
<listitem><para>Can the nodes communicate with each other?</para>
<itemizedlist>
<listitem><para>Are the control networks configured as heartbeat networks?
</para>
</listitem>
</itemizedlist>
</listitem>
<listitem><para>Can the control network addresses be pinged from peer nodes?
</para>
</listitem>
<listitem><para>Are the quorum majority or tie rules satisfied?</para>
<para>Look at the <filename>cmsd</filename> log to determine membership status.
</para>
</listitem>
<listitem><para>If a reset is required, are the following conditions met?
</para>
<itemizedlist>
<listitem><para>Is the node control daemon,<command>&ensp;crsd</command>,
up and running?</para>
</listitem>
<listitem><para>Is the reset serial line in good health?</para>
<para>You can look at the <filename>crsd</filename> log for the node you are
concerned with, or execute an <command>admin ping</command> and <command>
admin reset</command> command on the node to check this.</para>
</listitem>
</itemizedlist>
</listitem>
</itemizedlist>
</sect2>
<sect2>
<title>No Membership Formed</title>
<para>When no cluster membership is formed, you should check the following
areas for possible problems:</para>
<itemizedlist>
<listitem><para>Is the cluster membership daemon, <command>ha_cmsd</command>
running? Is the database daemon, <command>cdbd</command>, running?</para>
</listitem>
<listitem><para>Can the nodes communicate with each other?</para>
<itemizedlist>
<listitem><para>Are the control networks configured as heartbeat networks?
</para>
</listitem>
</itemizedlist>
</listitem>
<listitem><para>Can the control network addresses be pinged from peer nodes?
</para>
</listitem>
<listitem><para>Are the quorum majority or tie rules satisfied?</para>
<para>Look at the <filename>cmsd</filename> log to determine membership status.
</para>
</listitem>
<listitem><para>If a reset is required, are the following conditions met?
</para>
<itemizedlist>
<listitem><para>Is the node control daemon,<command>&ensp;crsd</command>,
up and running?</para>
</listitem>
<listitem><para>Is the reset serial line in good health?</para>
<para>You can look at the <filename>crsd</filename> log for the node you are
concerned with, or execute an <command>admin ping</command> and <command>
admin reset</command> command on the node to check this.</para>
</listitem>
</itemizedlist>
</listitem>
</itemizedlist>
</sect2>
</sect1>
<sect1 id="LE13884-PARENT">
<title id="LE13884-TITLE">Status Monitoring</title>
<para>FailSafe allows you to monitor and check the status of specified clusters,
nodes, resources, and resource groups. You can use this feature to isolate
where your system is encountering problems.</para>
<para>With the FailSafe Cluster Manager GUI Cluster View, you can monitor
the status of the FailSafe components continuously through their visual representation.
Using the FailSafe Cluster Manger CLI, you can display the status of the individual
components by using the <command>show</command> command.</para>
<para>For information on status monitoring and on the meaning of the states
of the FailSafe components, see <xref linkend="LE16877-PARENT">.</para>
</sect1>
<sect1 id="LE35544-PARENT">
<title id="LE35544-TITLE">Dynamic Control of FailSafe Services</title>
<para>FailSafe allows you to perform a variety of administrative tasks that
can help you troubleshoot a system with problems without bringing down the
entire system. These tasks include the following:</para>
<itemizedlist>
<listitem><para>You can add or delete nodes from a cluster without affecting
the FailSafe services and the applications running in the cluster</para>
</listitem>
<listitem><para>You can add or delete a resource group without affecting other
online resource groups</para>
</listitem>
<listitem><para>You can add or delete resources from a resource group while
it is still online</para>
</listitem>
<listitem><para>You can change FailSafe parameters such as the heartbeat interval
and the node timeout and have those values take immediate affect while the
services are up and running</para>
</listitem>
<listitem><para>You can start and stop FailSafe services on specified nodes
</para>
</listitem>
<listitem><para>You can move a resource group online, or take it offline</para>
</listitem>
<listitem><para>You can stop the monitoring of a resource group by putting
the resource group into maintenance mode. This is not an expensive operation,
as it does not stop and start the resource group, it just puts the resource
group in a state where it is not available to FailSafe.</para>
</listitem>
<listitem><para>You can reset individual nodes</para>
</listitem>
</itemizedlist>
<para>For information on how to perform these tasks, see <xref linkend="LE94219-PARENT">,
and <xref linkend="LE99367-PARENT">.</para>
</sect1>
<sect1 id="LE26593-PARENT">
<title id="LE26593-TITLE">Recovery Procedures</title>
<para><indexterm id="ITrecover-8"><primary>recovery</primary><secondary>procedures
</secondary></indexterm>The following sections describe various recovery procedures
you can perform when different failsafe components fail. Procedures for the
following situations are provided:</para>
<itemizedlist>
<listitem><para><xref linkend="LE37488-PARENT"></para>
</listitem>
<listitem><para><xref linkend="LE22743-PARENT"></para>
</listitem>
<listitem><para><xref linkend="LE13349-PARENT"></para>
</listitem>
<listitem><para><xref linkend="LE15209-PARENT"></para>
</listitem>
<listitem><para><xref linkend="LE32749-PARENT"></para>
</listitem>
<listitem><para><xref linkend="LE33694-PARENT"></para>
</listitem>
<listitem><para><xref linkend="LE32026-PARENT"></para>
</listitem>
<listitem><para><xref linkend="LE19553-PARENT"></para>
</listitem>
</itemizedlist>
<sect2 id="LE37488-PARENT">
<title id="LE37488-TITLE">Cluster Error Recovery</title>
<para>Follow this procedure if status of the cluster is UNKNOWN in all nodes
in the cluster.<indexterm id="ITrecover-9"><primary>cluster</primary><secondary>
error recovery</secondary></indexterm></para>
<orderedlist>
<listitem><para>Check to see if there are control networks that have failed
(see <xref linkend="LE32749-PARENT">).</para>
</listitem>
<listitem><para>At least 50% of the nodes in the cluster must be able to communicate
with each other to have an active cluster (Quorum requirement). If there are
not sufficient nodes in the cluster that can communicate with each other using
control networks, stop HA services on some of the nodes so that the quorum
requirement is satisfied.</para>
</listitem>
<listitem><para>If there are no hardware configuration problems, detach all
resource groups that are online in the cluster (if any), stop HA services
in the cluster, and restart HA services in the cluster.</para>
</listitem>
</orderedlist>
<para>The following <command>cluster_mgr</command> command detaches the resource
group <?Pub _nolinebreak><filename>web-rg</filename><?Pub /_nolinebreak> in
cluster <?Pub _nolinebreak><filename>web-cluster</filename><?Pub /_nolinebreak>:
</para>
<screen>cmgr> <userinput>admin detach resource_group web-rg in cluster web-cluster
</userinput></screen>
<para>To stop HA services in the cluster <filename>web-cluster</filename>
and ignore errors (<command>force</command> option), use the following <command>
cluster_mgr</command> command:</para>
<screen>cmgr> <userinput>stop ha_services for cluster web-cluster force</userinput></screen>
<para>To start HA services in the cluster <filename>web-cluster</filename>,
use the following <command>cluster_mgr</command> command:</para>
<programlisting>cmgr> <userinput>start ha_services for cluster web-cluster
</userinput></programlisting>
</sect2>
<sect2 id="LE22743-PARENT">
<title id="LE22743-TITLE">Node Error recovery</title>
<para>Follow this procedure if the status of a node is UNKNOWN in an active
cluster:<indexterm id="ITrecover-10"><primary>node</primary><secondary>error
recovery</secondary></indexterm></para>
<orderedlist>
<listitem><para>Check to see if the control networks in the node are working
(see <xref linkend="LE32749-PARENT">).</para>
</listitem>
<listitem><para>Check to see if the serial reset cables to reset the node
are working (see <xref linkend="LE33694-PARENT">).</para>
</listitem>
<listitem><para>If there are no hardware configuration problems, stop HA services
in the node and restart HA services.</para>
<para>To stop HA services in the node <filename>web-node3</filename> in the
cluster <filename>web-cluster</filename>, ignoring errors (<command>force
</command> option), use the following <command>cluster_mgr</command> command
</para>
<programlisting>cmgr> <userinput>stop ha_services in node web-node3 for cluster web-cluster 
force</userinput></programlisting>
<para>To start HA services in the node <filename>web-node3 </filename>in the
cluster <filename>web-cluster</filename>, use the following <command>cluster_mgr
</command> command:</para>
<programlisting>cmgr> <userinput>start ha_services in node web-node3 for cluster web-cluster
</userinput></programlisting>
</listitem>
</orderedlist>
</sect2>
<sect2 id="LE13349-PARENT">
<title id="LE13349-TITLE">Resource Group Maintenance and Error Recovery</title>
<para>To do simple maintenance on an application that is part of the resource
group, use the following procedure. This procedure stops monitoring the resources
in the resource group when maintenance mode is on. You need to turn maintenance
mode off when application maintenance is done.<indexterm id="ITrecover-11">
<primary>resource group</primary><secondary>recovery</secondary></indexterm></para>
<caution>
<para>If there is node failure on the node where resource group maintenance
is being performed, the resource group is moved to another node in the failover
policy domain.</para>
</caution>
<orderedlist>
<listitem><para>To put a resource group <filename>web-rg</filename> in maintenance
mode, use the following <command>cluster_mgr</command> command:</para>
<screen>cmgr> <userinput>admin maintenance_on resource_group web-rg in cluster web-cluster
</userinput></screen>
</listitem>
<listitem><para>The resource group state changes to <literal>ONLINE_MAINTENANCE
</literal>. Do whatever application maintenance is required. (Rotating application
logs is an example of simple application maintenance).</para>
</listitem>
<listitem><para>To remove a resource group <filename>web-rg</filename> from
maintenance mode, use the following <?Pub _nolinebreak><command>cluster_mgr
</command><?Pub /_nolinebreak><?Pub Caret> command:</para>
<screen>cmgr> <userinput>admin maintenance_off resource_group web-rg in cluster 
web-cluster</userinput></screen>
<para>The resource group state changes back to <literal>ONLINE</literal>.
</para>
</listitem>
</orderedlist>
<para>You perform the following procedure when a resource group is in an <literal>
ONLINE</literal> state and has an SRMD <literal>EXECUTABLE ERROR</literal>.
</para>
<orderedlist>
<listitem><para>Look at the SRM logs (default location: <filename>/var/log/failsafe/srmd_
</filename><filename>node name</filename>) to determine the cause of failure
and the resource that has failed.</para>
</listitem>
<listitem><para>Fix the cause of failure. This might require changes to resource
configuration or changes to resource type stop/start/failover action timeouts.
</para>
</listitem>
<listitem><para>After fixing the problem, move the resource group offline
with the <command>force</command> option and then move the resource group
online.</para>
<para>The following <command>cluster_mgr</command> command moves the resource
group <filename>web-rg</filename> in the cluster <filename>web-cluster</filename>
offline and ignores any errors:</para>
<screen>cmgr> <userinput>admin offline resource_group web-rg in cluster web-cluster 
force</userinput></screen>
<para>The following <command>cluster_mgr</command> command moves the resource
group <filename>web-rg</filename> in the cluster <filename>web-cluster</filename>
online:</para>
<screen>cmgr> <userinput>admin online resource_group web-rg in cluster web-cluster
</userinput></screen>
<para>The resource group <filename>web-rg</filename> should be in an <literal>
ONLINE</literal> state with no error.</para>
</listitem>
</orderedlist>
<para>You use the following procedure when a resource group is not online
but is in an error state. Most of these errors occur as a result of the exclusivity
process. This process, run when a resource group is brought online, determines
if any resources are already allocated somewhere in the failure domain of
a resource group. Note that exclusivity scripts return that a resource is
allocated on a node if the script fails in any way. In other words, unless
the script can determine that a resource is not present, it returns a value
indicating that the resource is allocated.</para>
<para>Some possible error states include: <literal>SPLIT RESOURCE GROUP (EXCLUSIVITY)
</literal>, <literal>NODE NOT AVAILABLE (EXCLUSIVITY)</literal>, <literal>
NO AVAILABLE NODES</literal> in failure domain. See <xref linkend="LE29367-PARENT">,
for explanations of resource group error codes.</para>
<orderedlist>
<listitem><para>Look at the <filename>failsafe</filename> and SRM logs (default
directory: <filename>/var/log/failsafe</filename>, files: <filename>failsafe_
</filename><replaceable>nodename</replaceable>, <filename>srmd_</filename><replaceable>
nodename</replaceable>) to determine the cause of the failure and the resource
that failed.</para>
<para>For example, say the task of moving a resource group online results
in a resource group with error state <literal>SPLIT RESOURCE GROUP (EXCLUSIVITY)
</literal>. This means that parts of a resource group are allocated on at
least two different nodes. One of the failsafe logs will have the description
of which nodes are believed to have the resource group partially allocated.
</para>
<para>At this point, look at the <filename>srmd</filename> logs on each of
these machines to see what resources are believed to be allocated. In some
cases, a misconfigured resource will show up as a resource which is allocated.
This is especially true for <filename>Netscape_web</filename> resources.</para>
</listitem>
<listitem><para>Fix the cause of the failure. This might require changes to
resource configuration or changes to resource type start/stop/exclusivity
timeouts.</para>
</listitem>
<listitem><para>After fixing the problem, move the resource group offline
with the <command>force</command> option and then move the resource group
online.</para>
</listitem>
</orderedlist>
<para>There are a few double failures that can occur in the cluster which
will cause resource groups to remain in a non-highly-available state. At times
a resource group might get stuck in an offline state. A resource group might
also stay in an error state on a node even when a new node joins the cluster
and the resource group can migrate to that node to clear the error.</para>
<para>When these circumstances arise, the correct action should be as follows:
</para>
<orderedlist>
<listitem><para>Try to move the resource group online if it is offline.</para>
</listitem>
<listitem><para>If the resource group is stuck on a node, detach the resource
group, then bring it online again. This should clear many errors.</para>
</listitem>
<listitem><para>If detaching the resource group does not work, force the resource
group offline, then bring it back online.</para>
</listitem>
<listitem><para>If commands appear to be hanging or not working properly,
detach all resource groups, then shut down the cluster and bring all resource
groups back online.</para>
</listitem>
</orderedlist>
<para>See <xref linkend="fs-takeresgroupoffline">, for information on detaching
resource groups and forcing resource groups offline.</para>
</sect2>
<sect2 id="LE15209-PARENT" role="fs-recresource">
<title id="LE15209-TITLE">Resource Error Recovery</title>
<para>You use this procedure when a resource that is not part of a resource
group is in an <literal>ONLINE</literal> state with error. This can happen
when the addition or removal of resources from a resource group fails.<indexterm
id="ITrecover-12"><primary>resource</primary><secondary>recovery</secondary>
</indexterm></para>
<orderedlist>
<listitem><para>Look at the SRM logs (default location: <filename>/var/log/failsafe/srmd_
</filename><replaceable>nodename</replaceable>) to determine the cause of
failure and the resource that has failed.</para>
</listitem>
<listitem><para>Fix the cause of failure. This might require changes to resource
configuration or changes to resource type stop/start/failover action timeouts.
</para>
</listitem>
<listitem><para>After fixing the problem, move the resource offline with the <command>
force</command> option of the Cluster Manager CLI <command>admin offline</command>
command:</para>
<screen>cmgr><userinput>&ensp;admin offline_force resource web-srvr of resource_type 
Netscape_Web in cluster web-cluster</userinput></screen>
<para>Executing this command removes the error state of resource <filename>
web-srvr</filename> of type <filename>Netscape_Web</filename>, making it available
to be added to a resource group.</para>
<para>You can also use the Cluster Manager GUI to clear the error state for
the resource. To do this, you select the &ldquo;Recover a Resource&rdquo;
task from the &ldquo;Resources and Resource Types&rdquo; category of the FailSafe
Manager.</para>
</listitem>
</orderedlist>
</sect2>
<sect2 id="LE32749-PARENT">
<title id="LE32749-TITLE">Control Network Failure Recovery</title>
<para>Control network failures are reported in <filename>cmsd</filename> logs.
The default location of <filename>cmsd</filename> log is <?Pub _nolinebreak><filename>
/var/log/failsafe/cmsd_<replaceable>nodename</replaceable></filename><?Pub /_nolinebreak>.
Follow this procedure when the control network fails:<indexterm id="ITrecover-13">
<primary>control network</primary><secondary>recovery</secondary></indexterm></para>
<orderedlist>
<listitem><para>Use the <command>ping</command> command to check whether the
control network IP address is configured in the node.</para>
</listitem>
<listitem><para>Check node configuration to see whether the control network
IP addresses are correctly specified.</para>
<para>The following <command>cluster_mgr</command> command displays node configuration
for <filename>web-node3:</filename></para>
<screen>cmgr> <userinput>show node web-node3</userinput></screen>
</listitem>
<listitem><para>If IP names are specified for control networks instead of
IP addresses in XX.XX.XX.XX notation, check to see whether IP names can be
resolved using DNS. It is recommended that IP addresses are used instead of
IP names.</para>
</listitem>
<listitem><para>Check whether the heartbeat interval and node timeouts are
correctly set for the cluster. These HA parameters can seen using <command>
cluster_mgr show ha_parameters</command> command.</para>
</listitem>
</orderedlist>
</sect2>
<sect2 id="LE33694-PARENT">
<title id="LE33694-TITLE">Serial Cable Failure Recovery</title>
<para>Serial cables are used for resetting a node when there is a node failure.
Serial cable failures are reported in <filename>crsd</filename> logs. The
default location for the <filename>crsd</filename> log is <?Pub _nolinebreak><filename>
/var/log/failsafe/crsd_<replaceable>nodename</replaceable></filename><?Pub /_nolinebreak>.<indexterm
id="ITrecover-14"><primary>serial cable recovery</primary></indexterm></para>
<orderedlist>
<listitem><para>Check the node configuration to see whether serial cable connection
is correctly configured.</para>
<para>The following <command>cluster_mgr</command> command displays node configuration
for <filename>web-node3</filename></para>
<screen>cmgr> <userinput>show node web-node3</userinput></screen>
<para>Use the <command>cluster_mgr admin ping</command> command to verify
the serial cables.</para>
<screen>cmgr> <userinput>admin ping node web-node3</userinput></screen>
</listitem>
</orderedlist>
<para>The above command reports serial cables problems in node <filename>
web-node3</filename>.</para>
</sect2>
<sect2 id="LE32026-PARENT">
<title id="LE32026-TITLE">CDB Maintenance and Recovery</title>
<para>When the entire configuration database (CDB) must be reinitialized,
execute the following command:<indexterm id="ITrecover-15"><primary>CDB</primary>
<secondary>recovery</secondary></indexterm> <indexterm id="ITrecover-16">
<primary>CDB</primary><secondary>maintenance</secondary></indexterm></para>
<screen># <userinput>/usr/cluster/bin/cdbreinit /var/cluster/cdb/cdb.db</userinput></screen>
<para>This command will restart all cluster processes. The contents of the
configuration database will be automatically synchronized with other nodes
if other nodes in the pool are available.</para>
<para>Otherwise, the CDB will need to be restored from backup at this point.
For instructions on backing up and restoring the CDB, see <xref linkend="LE37674-PARENT">.
</para>
</sect2>
<sect2 id="LE19553-PARENT">
<title id="LE19553-TITLE">FailSafe Cluster Manager GUI and CLI Inconsistencies
</title>
<para>If the FailSafe Cluster Manager GUI is displaying information that is
inconsistent with the FailSafe <command>cluster_mgr</command> command, restart
cad process on the node to which Cluster Manager GUI is connected to by executing
the following command:</para>
<screen># <userinput>killall cad</userinput></screen>
<para>The cluster administration daemon is restarted automatically by the <command>
cmond</command> process.</para>
</sect2>
</sect1>
</chapter>
<?Pub *0000033543>