Glossary

action scripts

The set of scripts that determine how a resource is started, monitored, and stopped. There must be a set of action scripts specified for each resource type. The possible set of action scripts is: probe, exclusive, start, stop, monitor, and restart.

cluster

A collection of one or more cluster nodes coupled to each other by networks or other similar interconnections. A cluster is identified by a simple name; this name must be unique within the pool. A particular node may be a member of only one cluster.

cluster administrator

The person responsible for managing and maintaining a Linux FailSafe cluster.

cluster configuration database

Contains configuration information about all resources, resource types, resource groups, failover policies, nodes, and clusters.

cluster node

A single Linux image. Usually, a cluster node is an individual computer. The term node is also used in this guide for brevity.

control messages

Messages that cluster software sends between the cluster nodes to request operations on or distribute information about cluster nodes and resource groups. Linux FailSafe sends control messages for the purpose of ensuring nodes and groups remain highly available. Control messages and heartbeat messages are sent through a node's network interfaces that have been attached to a control network. A node can be attached to multiple control networks.

A node's control networks should not be set to accept control messages if the node is not a dedicated Linux FailSafe node. Otherwise, end users who run non-Linux FailSafe jobs on the machine can have their jobs killed unexpectedly when Linux FailSafe resets the node.

control network

The network that connects nodes through their network interfaces (typically Ethernet) such that Linux FailSafe can maintain a cluster's high availability by sending heartbeat messages and control messages through the network to the attached nodes. Linux FailSafe uses the highest priority network interface on the control network; it uses a network interface with lower priority when all higher-priority network interfaces on the control network fail.

A node must have at least one control network interface for heartbeat messages and one for control messages (both heartbeat and control messages can be configured to use the same interface). A node can have no more than eight control network interfaces.

dependency list

See resource dependency list or resource type dependency list.

failover

The process of allocating a resource group to another node to another, according to a failover policy. A failover may be triggered by the failure of a resource, a change in the node membership (such as when a node fails or starts), or a manual request by the administrator.

failover attribute

A string that affects the allocation of a resource group in a cluster. The administrator must specify system-defined attributes (such as AutoFailback or ControlledFailback), and can optionally supply site-specific attributes.

failover domain

The ordered list of nodes on which a particular resource group can be allocated. The nodes listed in the failover domain must be within the same cluster; however, the failover domain does not have to include every node in the cluster.The administrator defines the initial failover domain when creating a failover policy. This list is transformed into the runningfailover domain by the failover script; the runtime failover domain is what is actually used to select the failover node. Linux FailSafe stores the runtime failover domain and uses it as input to the next failover script invocation. The initial and runtime failover domains may be identical, depending upon the contents of the failover script. In general, Linux FailSafe allocates a given resource group to the first node listed in the runtime failover domain that is also in the node membership; the point at which this allocation takes place is affected by the failover attributes.

failover policy

The method used by Linux FailSafe to determine the destination node of a failover. A failover policy consists of a failover domain, failover attributes, and a failover script. A failover policy name must be unique within the pool.

failover script

A failover policy component that generates a runtime failover domain and returns it to the Linux FailSafe process. The Linux FailSafe process applies the failover attributes and then selects the first node in the returned failover domain that is also in the current node membership.

heartbeat messages

Messages that cluster software sends between the nodes that indicate a node is up and running. Heartbeat messages and control messages are sent through a node's network interfaces that have been attached to a control network. A node can be attached to multiple control networks.

heartbeat interval

Interval between heartbeat messages. The node timeout value must be at least 10 times the heartbeat interval for proper Linux FailSafe operation (otherwise false failovers may be triggered). The higher the number of heartbeats (smaller heartbeat interval), the greater the potential for slowing down the network. Conversely, the fewer the number of heartbeats (larger heartbeat interval), the greater the potential for reducing availability of resources.

initial failover domain

The ordered list of nodes, defined by the administrator when a failover policy is first created, that is used the first time a cluster is booted.The ordered list specified by the initial failover domain is transformed into a runtime failover domain by the failover script; the runtime failover domain is used along with failover attributes to determine the node on which a resource group should reside. With each failure, the failover script takes the current runtime failover domain and potentially modifies it; the initial failover domain is never used again. Depending on the runtime conditions and contents of the failover script, the initial and runtime failover domains may be identical. See also runtime failover domain.

key/value attribute

A set of information that must be defined for a particular resource type. For example, for the resource type filesystem, one key/value pair might be mount_point=/fs1 where mount_point is the key and fs1 is the value specific to the particular resource being defined. Depending on the value, you specify either a string or integer data type. In the previous example, you would specify string as the data type for the value fs1.

log configuration

A log configuration has two parts: a log level and a log file, both associated with a log group. The cluster administrator can customize the location and amount of log output, and can specify a log configuration for all nodes or for only one node. For example, the crsd log group can be configured to log detailed level-10 messages to the /var/log/failsafe/crsd-foo log only on the node foo, and to write only minimal level-1 messages to the crsd log on all other nodes.

log file

A file containing Linux FailSafe notifications for a particular log group. A log file is part of the log configuration for a log group. By default, log files reside in the /var/log/failsafe directory, but the cluster administrator can customize this. Note: Linux FailSafe logs both normal operations and critical errors to /var/log/messages, as well as to individual logs for specific log groups.

log group

A set of one or more Linux FailSafe processes that use the same log configuration. A log group usually corresponds to one Linux FailSafe daemon, such as gcd.

log level

A number controlling the number of log messages that Linux FailSafe will write into an associated log group's log file. A log level is part of the log configuration for a log group.

node

See cluster node

node ID

A 16-bit positive integer that uniquely defines a cluster node. During node definition, Linux FailSafe will assign a node ID if one has not been assigned by the cluster administrator. Once assigned, the node ID cannot be modified.

node membership

The list of nodes in a cluster on which Linux FailSafe can allocate resource groups.

node timeout

If no heartbeat is received from a node in this period of time, the node is considered to be dead. The node timeout value must be at least 10 times the heartbeat interval for proper Linux FailSafe operation (otherwise false failovers may be triggered).

notification command

The command used to notify the cluster administrator of changes or failures in the cluster, nodes, and resource groups. The command must exist on every node in the cluster.

offline resource group

A resource group that is not highly available in the cluster. To put a resource group in offline state, Linux FailSafe stops the group (if needed) and stops monitoring the group. An offline resource group can be running on a node, yet not under Linux FailSafe control. If the cluster administrator specifies the detach only option while taking the group offline, then Linux FailSafe will not stop the group but will stop monitoring the group.

online resource group

A resource group that is highly available in the cluster. When Linux FailSafe detects a failure that degrades the resource group availability, it moves the resource group to another node in the cluster. To put a resource group in online state, Linux FailSafe starts the group (if needed) and begins monitoring the group. If the cluster administrator specifies the attach only option while bringing the group online, then Linux FailSafe will not start the group but will begin monitoring the group.

owner host

A system that can control a Linux FailSafe node remotely, such as power-cycling the node). Serial cables must physically connect the two systems through the node's system controller port. At run time, the owner host must be defined as a node in the Linux FailSafe pool.

owner TTY name

The device file name of the terminal port (TTY) on the owner host to which the system controller serial cable is connected. The other end of the cable connects to the Linux FailSafe node with the system controller port, so the node can be controlled remotely by the owner host.

pool

The entire set of nodes involved with a group of clusters. The group of clusters are usually close together and should always serve a common purpose. A replicated database is stored on each node in the pool.

port password

The password for the system controller port, usually set once in firmware or by setting jumper wires. (This is not the same as the node's root password.)

powerfail mode

When powerfail mode is turned on, Linux FailSafe tracks the response from a node's system controller as it makes reset requests to a cluster node. When these requests fail to reset the node successfully, Linux FailSafe uses heuristics to try to estimate whether the machine has been powered down. If the heuristic algorithm returns with success, Linux FailSafe assumes the remote machine has been reset successfully. When powerfail mode is turned off, the heuristics are not used and Linux FailSafe may not be able to detect node power failures.

process membership

A list of process instances in a cluster that form a process group. There can be one or more processes per node.

resource

A single physical or logical entity that provides a service to clients or other resources. For example, a resource can be a single disk volume, a particular network address, or an application such as a web server. A resource is generally available for use over time on two or more nodes in a cluster, although it can be allocated to only one node at any given time. Resources are identified by a resource name and a resource type. Dependent resources must be part of the same resource group and are identified in a resource dependency list.

resource dependency

The condition in which a resource requires the existence of other resources.

resource group

A collection of resources. A resource group is identified by a simple name; this name must be unique within a cluster. Resource groups cannot overlap; that is, two resource groups cannot contain the same resource. All interdependent resources must be part of the same resource group. If any individual resource in a resource group becomes unavailable for its intended use, then the entire resource group is considered unavailable. Therefore, a resource group is the unit of failover for Linux FailSafe.

resource keys

Variables that define a resource of a given resource type. The action scripts use this information to start, stop, and monitor a resource of this resource type.

resource name

The simple name that identifies a specific instance of a resource type. A resource name must be unique within a cluster.

resource type

A particular class of resource. All of the resources in a particular resource type can be handled in the same way for the purposes of failover. Every resource is an instance of exactly one resource type. A resource type is identified by a simple name; this name must be unique within a cluster. A resource type can be defined for a specific node or for an entire cluster. A resource type that is defined for a node overrides a cluster-wide resource type definition with the same name; this allows an individual node to override global settings from a cluster-wide resource type definition.

resource type dependency

A set of resource types upon which a resource type depends. For example, the filesystem resource type depends upon the volume resource type, and the Netscape_web resource type depends upon the filesystem and IP_address resource types.

runtime failover domain

The ordered set of nodes on which the resource group can execute upon failures, as modified by the failover script. The runtime failover domain is used along with failover attributes to determine the node on which a resource group should reside.See also initial failover domain.

start/stop order

Each resource type has a start/stop order, which is a non-negative integer. In a resource group, the start/stop orders of the resource types determine the order in which the resources will be started when Linux FailSafe brings the group online and will be stopped when Linux FailSafe takes the group offline. The group's resources are started in increasing order, and stopped in decreasing order; resources of the same type are started and stopped in indeterminate order. For example, if resource type volume has order 10 and resource type filesystem has order 20, then when Linux FailSafe brings a resource group online, all volume resources in the group will be started before all filesystem resources in the group.

system controller port

A port sitting on a node that provides a way to power-cycle the node remotely. Enabling or disabling a system controller port in the cluster configuration database (CDB) tells Linux FailSafe whether it can perform operations on the system controller port. (When the port is enabled, serial cables must attach the port to another node, the owner host.) System controller port information is optional for a node in the pool, but is required if the node will be added to a cluster; otherwise resources running on that node never will be highly available.

tie-breaker node

A node identified as a tie-breaker for Linux FailSafe to use in the process of computing node membership for the cluster, when exactly half the nodes in the cluster are up and can communicate with each other. If a tie-breaker node is not specified, Linux FailSafe will use the node with the lowest node ID in the cluster as the tie-breaker node.

type-specific attribute

Required information used to define a resource of a particular resource type. For example, for a resource of type filesystem, you must enter attributes for the resource's volume name (where the filesystem is located) and specify options for how to mount the filesystem (for example, as readable and writable).