To set up a Linux FailSafe system, you configure the cluster that will support the highly available services. This requires the following steps:
Defining the local host
Defining any additional nodes that are eligible to be included in the cluster
Defining the cluster
The following subsections describe these tasks.
A cluster node is a single Linux image. Usually, a cluster node is an individual computer. The term node is also used in this guide for brevity.
The pool is the entire set of nodes available for clustering.
The first node you define must be the local host, which is the host you have logged into to perform cluster administration.
When you are defining multiple nodes, it is advisable to wait for a minute or so between each node definition. When nodes are added to the configuration database, the contents of the configuration database are also copied to the node being added. The node definition operation is completed when the new node configuration is added to the database, at which point the database configuration is synchronized. If you define two nodes one after another, the second operation might fail because the first database synchronization is not complete.
To add a logical node definition to the pool of nodes that are eligible to be included in a cluster, you must provide the following information about the node:
Logical name: This name can contain letters and numbers but not spaces or pound signs. The name must be composed of no more than 255 characters. Any legal hostname is also a legal node name. For example, for a node whose hostname is “venus.eng.company.com” you can use a node name of “venus”, “node1”, or whatever is most convenient.
Hostname: The fully qualified name of the host, such as “server1.company.com”. Hostnames cannot begin with an underscore, include any whitespace, or be longer than 255 characters. This hostname should be the same as the output of the hostname command on the node you are defining. The IP address associated with this hostname should not be the same as any IP address you define as highly available when you define a Linux FailSafe IP address resource. Linux FailSafe will not accept an IP address (such as “192.0.2.22”) for this input.
Node ID: This number must be unique for each node in the pool and be in the range 1 through 32767.
System controller information. If the node has a system controller and you want Linux FailSafe to use the controller to reset the node, you must provide the following information about the system controller:
Type of system controller: chalL, msc, mmsc
System controller port password (optional)
Administrative status, which you can set to determine whether Linux FailSafe can use the port: enabled, disabled
Logical node name of system controller owner (i.e. the system that is physically attached to the system controller)
Device name of port on owner node that is attached to the system controller
Type of owner device: tty
A list of control networks, which are the networks used for heartbeats, reset messages, and other Linux FailSafe messages. For each network, provide the following:
Hostname or IP address. This address must not be the same as any IP address you define as highly available when you define a Linux FailSafe IP address resource, and it must be resolved in the /etc/hosts file.
Flags (hb for heartbeats, ctrl for control messages, priority). At least two control networks must use heartbeats, and at least one must use control messages.
Linux FailSafe requires multiple heartbeat networks. Usually a node sends heartbeat messages to another node on only one network at a time. However, there are times when a node might send heartbeat messages to another node on multiple networks simultaneously. This happens when the sender node does not know which networks are up and which others are down. This is a transient state and eventually the heartbeat network converges towards the highest priority network that is up.
Note that at any time different pairs of nodes might be using different networks for heartbeats.
Although all nodes in the Linux FailSafe cluster should have two control networks, it is possible to define a node to add to the pool with one control network.
To define a node with the Cluster Manager GUI, perform the following steps:
Launch the FailSafe Manager.
On the left side of the display, click on the “Nodes & Cluster” category.
On the right side of the display click on the “Define a Node” task link to launch the task.
Enter the selected inputs on this screen. Click on “Next” at the bottom of the screen and continue inputing information on the second screen.
Click on “OK” at the bottom of the screen to complete the task, or click on “Cancel” to cancel.
Use the following command to add a logical node definition:
cmgr> define node A |
Entering this command specifies the name of the node you are defining and puts you in a mode that enables you to define the parameters of the node. These parameters correspond to the items defined in Section 5.4.1. The following prompts appear:
Enter commands, when finished enter either "done" or "cancel" |
A?
When this prompt of the node name appears, you enter the node parameters in the following format:
set hostname to B set nodeid to C set sysctrl_type to D set sysctrl_password to E set sysctrl_status to F set sysctrl_owner to G set sysctrl_device to H set sysctrl_owner_type to I add nic J |
You use the add nic J command to define the network interfaces. You use this command for each network interface to define. When you enter this command, the following prompt appears:
Enter network interface commands, when finished enter "done" or "cancel" NIC - J? |
When this prompt appears, you use the following commands to specify the flags for the control network:
set heartbeat to K set ctrl_msgs to L set priority to M |
After you have defined a network controller, you can use the following command from the node name prompt to remove it:
cmgr> remove nic N |
When you have finished defining a node, enter done.
The following example defines a node called cm1a, with one controller:
cmgr> define node cm1a Enter commands, when finished enter either "done" or "cancel" |
cm1a? set hostname to cm1a cm1a? set nodeid to 1 cm1a? set sysctrl_type to msc cm1a? set sysctrl_password to [ ] cm1a? set sysctrl_status to enabled cm1a? set sysctrl_owner to cm2 cm1a? set sysctrl_device to /dev/ttyd2 cm1a? set sysctrl_owner_type to tty cm1a? add nic cm1 Enter network interface commands, when finished enter “done” or “cancel” NIC - cm1 > set heartbeat to true NIC - cm1 > set ctrl_msgs to true NIC - cm1 > set priority to 0 NIC - cm1 > done cm1a? done cmgr> |
If you have invoked the Cluster Manager CLI with the -p option,or you entered the set prompting on command, the display appears as in the following example:
cmgr> define node cm1a Enter commands, when finished enter either "done" or "cancel" |
Nodename [optional]? cm1a |
Node ID? 1 Do you wish to define system controller info[y/n]:y Sysctrl Type <null>? (null) Sysctrl Password[optional]? ( ) Sysctrl Status <enabled|disabled>? enabled Sysctrl Owner? cm2 Sysctrl Device? /dev/ttyd2 Sysctrl Owner Type <tty>? (tty) Number of Network Interfaces ? (1) NIC 1 - IP Address? cm1 NIC 1 - Heartbeat HB (use network for heartbeats) <true|false>? true NIC 1 - Priority <1,2,...>? 0 NIC 2 - IP Address? cm2 NIC 2 - Heartbeat HB (use network for heartbeats) <true|false>? true NIC 2 - (use network for control messages) <true|false>? false NIC 2 - Priority <1,2,...>? 1 |
After you have defined a cluster node, you can modify or delete the cluster with the Cluster Manager GUI or the Cluster Manager CLI. You must remove a node from a cluster before you can delete the node.
To modify a node with the Cluster Manager GUI, perform the following steps:
Launch the FailSafe Manager.
On the left side of the display, click on the “Nodes & Cluster” category.
On the right side of the display click on the “Modify a Node Definition” task link to launch the task.
Modify the node parameters.
Click on “OK” at the bottom of the screen to complete the task, or click on “Cancel” to cancel.
You can use the following command to modify an existing node. After entering this command, you can execute any of the commands you use to define a node.
cmgr> modify node A |
To delete a node with the Cluster Manager GUI, perform the following steps:
Launch the FailSafe Manager.
On the left side of the display, click on the “Nodes & Cluster” category.
On the right side of the display click on the “Delete a Node” task link to launch the task.
Enter the name of the node to delete.
Click on “OK” at the bottom of the screen to complete the task, or click on “Cancel” to cancel.
After defining a node, you can delete it with the following command:
cmgr> delete node A |
You can delete a node only if the node is not currently part of a cluster. This means that first you must modify a cluster that contains the node so that it no longer contains that node before you can delete it.
After you define cluster nodes, you can perform the following display tasks:
display the attributes of a node
display the nodes that are members of a specific cluster
display all the nodes that have been defined
You can perform any of these tasks with the FailSafe Cluster Manager GUI or the Linux FailSafe Cluster Manager CLI.
The Cluster Manager GUI provides a convenient graphic display of the defined nodes of a cluster and the attributes of those nodes through the FailSafe Cluster View. You can launch the FailSafe Cluster View directly, or you can bring it up at any time by clicking on “FailSafe Cluster View” at the bottom of the “FailSafe Manager” display.
From the View menu of the FailSafe Cluster View, you can select “Nodes in Pool” to view all nodes defined in the Linux FailSafe pool. You can also select “Nodes In Cluster” to view all nodes that belong to the default cluster. Click any node's name or icon to view detailed status and configuration information about the node.
After you have defined a node, you can display the node's parameters with the following command:
cmgr> show node A |
A show node command on node cm1a would yield the following display:
cmgr> show node cm1 Logical Node Name: cm1 Hostname: cm1 Nodeid: 1 Reset type: reset System Controller: msc System Controller status: enabled System Controller owner: cm2 System Controller owner device: /dev/ttyd2 System Controller owner type: tty ControlNet Ipaddr: cm1 ControlNet HB: true ControlNet Control: true ControlNet Priority: 0 |
You can see a list of all of the nodes that have been defined with the following command:
cmgr> show nodes in pool |
You can see a list of all of the nodes that have defined for a specified cluster with the following command:
cmgr> show nodes [in cluster A] |
If you have specified a default cluster, you do not need to specify a cluster when you use this command and it will display the nodes defined in the default cluster.
There are several parameters that determine the behavior of the nodes in a cluster of a Linux FailSafe system.
The Linux FailSafe parameters are as follows:
The tie-breaker node, which is the logical name of a machine used to compute node membership in situations where 50% of the nodes in a cluster can talk to each other. If you do not specify a tie-breaker node, the node with the lowest node ID number is used.
The tie-breaker node is a cluster-wide parameter.
It is recommended that you configure a tie-breaker node even if there is an odd number of nodes in the cluster, since one node may be deactivated, leaving an even number of nodes to determine membership.
In a heterogeneous cluster, where the nodes are of different sizes and capabilities, the largest node in the cluster with the most important application or the maximum number of resource groups should be configured as the tie-breaker node.
Node timeout, which is the timeout period, in milliseconds. If no heartbeat is received from a node in this period of time, the node is considered to be dead and is not considered part of the cluster membership.
The node timeout must be at least 5 seconds. In addition, the node timeout must be at least 10 times the heartbeat interval for proper Linux FailSafe operation; otherwise, false failovers may be triggered.
Node timeout is a cluster-wide parameter.
The interval, in milliseconds, between heartbeat messages. This interval must be greater than 500 milliseconds and it must not be greater than one-tenth the value of the node timeout period. This interval is set to one second, by default. Heartbeat interval is a cluster-wide parameter.
The higher the number of heartbeats (smaller heartbeat interval), the greater the potential for slowing down the network. Conversely, the fewer the number of heartbeats (larger heartbeat interval), the greater the potential for reducing availability of resources.
The node wait time, in milliseconds, which is the time a node waits for other nodes to join the cluster before declaring a new cluster membership. If the value is not set for the cluster, Linux FailSafe assumes the value to be the node timeout times the number of nodes.
The powerfail mode, which indicates whether a special power failure algorithm should be run when no response is received from a system controller after a reset request. This can be set to ON or OFF. Powerfail is a node-specific parameter, and should be defined for the machine that performs the reset operation.
To set Linux FailSafe parameters with the Cluster Manager GUI, perform the following steps:
Launch the FailSafe Manager from a menu or the command line.
On the left side of the display, click on the “Nodes & Cluster” category.
On the right side of the display click on the “Set Linux FailSafe HA Parameters” task link to launch the task.
Enter the selected inputs.
Click on “OK” at the bottom of the screen to complete the task, or click on “Cancel” to cancel.
You can modify the Linux FailSafe parameters with the following command:
cmgr> modify ha_parameters [on node A] [in cluster B] |
If you have specified a default node or a default cluster, you do not have to specify a node or a cluster in this command. Linux FailSafe will use the default.
Enter commands, when finished enter either "done" or "cancel" |
A?
When this prompt of the node name appears, you enter the Linux FailSafe parameters you wish to modify in the following format:
set node_timeout to A set heartbeat to B set run_pwrfail to C set tie_breaker to D |
A cluster is a collection of one or more nodes coupled with each other by networks or other similar interconnects. In Linux FailSafe, a cluster is identified by a simple name. A given node may be a member of only one cluster.
To define a cluster, you must provide the following information:
The logical name of the cluster, with a maximum length of 255 characters.
The mode of operation: normal (the default) or experimental. Experimental mode allows you to configure a Linux FailSafe cluster in which resource groups do not fail over when a node failure is detected. This mode can be useful when you are tuning node timeouts or heartbeat values. When a cluster is configured in normal mode, Linux FailSafe fails over resource groups when it detects failure in a node or resource group.
(Optional) The email address to use to notify the system administrator when problems occur in the cluster (for example, root@system)
(Optional) The email program to use to notify the system administrator when problems occur in the cluster (for example, /usr/bin/mail).
Specifying the email program is optional and you can specify only the notification address in order to receive notifications by mail. If an address is not specified, notification will not be sent.
After you have added nodes to the pool and defined a cluster, you must provide the names of the nodes to include in the cluster.
To define a cluster with the Cluster Manager GUI, perform the following steps:
Launch the Linux FailSafe Manager.
On the left side of the display, click on “Guided Configuration”.
On the right side of the display click on “Set Up a New Cluster” to launch the task link.
In the resulting window, click each task link in turn, as it becomes available. Enter the selected inputs for each task.
When finished, click “OK” to close the taskset window.
When you define a cluster with the CLI, you define and cluster and add nodes to the cluster with the same command.
Use the following cluster manager CLI command to define a cluster:
cmgr> define cluster A |
Entering this command specifies the name of the node you are defining and puts you in a mode that allows you to add nodes to the cluster. The following prompt appears:
cluster A? |
When this prompt appears during cluster creation, you can specify nodes to include in the cluster and you can specify an email address to direct messages that originate in this cluster.
You specify nodes to include in the cluster with the following command:
cluster A? add node C cluster A? |
You can add as many nodes as you want to include in the cluster.
You specify an email program to use to direct messages with the following command:
cluster A? set notify_cmd to B cluster A? |
You specify an email address to direct messages with the following command:
cluster A? set notify_addr to B cluster A? |
You specify a mode for the cluster (normal or experimental) with the following command:
cluster A? set ha_mode to D cluster A? |
When you are finished defining the cluster, enter done to return to the cmgr prompt.
After you have defined a cluster, you can modify the attributes of the cluster or you can delete the cluster. You cannot delete a cluster that contains nodes; you must move those nodes out of the cluster first.
To modify a cluster with the Cluster Manager GUI, perform the following procedure:
Launch the Linux FailSafe Manager.
On the left side of the display, click on the “Nodes & Cluster” category.
On the right side of the display click on the “Modify a Cluster Definition” task link to launch the task.
Enter the selected inputs.
Click on “OK” at the bottom of the screen to complete the task, or click on “Cancel” to cancel.
To delete a cluster with the Cluster Manager GUI, perform the following procedure:
Launch the Linux FailSafe Manager.
On the left side of the display, click on the “Nodes & Cluster” category.
On the right side of the display click on the “Delete a Cluster” task link to launch the task.
Enter the selected inputs.
Click on “OK” at the bottom of the screen to complete the task, or click on “Cancel” to cancel.
To modify an existing cluster, enter the following command:
cmgr> modify cluster A |
Entering this command specifies the name of the cluster you are modifying and puts you in a mode that allows you to modify the cluster. The following prompt appears:
cluster A? |
When this prompt appears, you can modify the cluster definition with the following commands:
cluster A? set notify_addr to B cluster A? set notify_cmd to B cluster A? add node C cluster A? remove node D cluster A? |
When you are finished modifying the cluster, enter done to return to the cmgr prompt.
You can delete a defined cluster with the following command:
cmgr> delete cluster A |
You can display defined clusters with the Cluster Manager GUI or the Cluster Manager CLI.
The Cluster Manager GUI provides a convenient display of a cluster and its components through the FailSafe Cluster View. You can launch the FailSafe Cluster View directly, or you can bring it up at any time by clicking on the “FailSafe Cluster View” prompt at the bottom of the “FailSafe Manager” display.
From the View menu of the FailSafe Cluster View, you can choose elements within the cluster to examine. To view details of the cluster, click on the cluster name or icon. Status and configuration information will appear in a new window. To view this information within the FailSafe Cluster View window, select Options. When you then click on the Show Details option, the status details will appear in the right side of the window.
After you have defined a cluster, you can display the nodes in that cluster with the following command:
cmgr> show cluster A |
You can see a list of the clusters that have been defined with the following command:
cmgr> show clusters |