In looking over the actions of a FailSafe system on failure to determine what has gone wrong and how processes have transferred, it is important to consider the concept of node membership. When failover occurs, the runtime failover domain can include only those nodes that are in the cluster membership.
Nodes can enter into the cluster membership only when they are not disabled and they are in a known state. This ensures that data integrity is maintained because only nodes within the cluster membership can access the shared storage. If nodes outside the membership and not controlled by FailSafe were able to access the shared storage, two nodes might try to access the same data at the same time, a situation that would result in data corruption. For this reason, disabled nodes do not participate in the membership computation. Note that no attempt is made to reset nodes that are configured disabled before confirming the cluster membership.
Node membership in a cluster is based on a quorum majority. For a cluster to be enabled, more than 50% of the nodes in the cluster must be in a known state, able to talk to each other, using heartbeat control networks. This quorum determines which nodes are part of the cluster membership that is formed.
If there are an even number of nodes in the cluster, it is possible that there will be no majority quorum; there could be two sets of nodes, each consisting of 50% of the total number of node, unable to communicate with the other set of nodes. In this case, FailSafe uses the node that has been configured as the tie-breaker node when you configured your FailSafe parameters. If no tie-breaker node was configured, FailSafe uses the enabled node with the lowest node id number.
For information on setting tie-breaker nodes, see Section 5.4.4.
The nodes in a quorum attempt to reset the nodes that are not in the quorum. Nodes that can be reset are declared DOWN in the membership, nodes that could not be reset are declared UNKNOWN. Nodes in the quorum are UP.
If a new majority quorum is computed, a new membership is declared whether any node could be reset or not.
If at least one node in the current quorum has a current membership, the nodes will proceed to declare a new membership if they can reset at least one node.
If all nodes in the new tied quorum are coming up for the first time, they will try to reset and proceed with a new membership only if the quorum includes the tie-breaker node.
If a tied subset of nodes in the cluster had no previous membership, then the subset of nodes in the cluster with the tie-breaker node attempts to reset nodes in the other subset of nodes in the cluster. If at least one node reset succeeds, a new membership is confirmed.
If a tied subset of nodes in the cluster had previous membership, the nodes in one subset of nodes in the cluster attempt to reset nodes in the other subset of nodes in the cluster. If at least one node reset succeeds, a new membership is confirmed. The subset of nodes in the cluster with the tie-breaker node resets immediately, the other subset of nodes in the cluster attempts to reset after some time.
Resets are done through system controllers connected to tty ports through serial lines. Periodic serial line monitoring never stops. If the estimated serial line monitoring failure interval and the estimated heartbeat loss interval overlap, we suspect a power failure at the node being reset.
When no cluster membership is formed, you should check the following areas for possible problems:
Is the cluster membership daemon, ha_cmsd running? Is the database daemon, cdbd, running?
Can the nodes communicate with each other?
Are the control networks configured as heartbeat networks?
Can the control network addresses be pinged from peer nodes?
Are the quorum majority or tie rules satisfied?
Look at the cmsd log to determine membership status.
If a reset is required, are the following conditions met?
Is the node control daemon, crsd, up and running?
Is the reset serial line in good health?
You can look at the crsd log for the node you are concerned with, or execute an admin ping and admin reset command on the node to check this.
When no cluster membership is formed, you should check the following areas for possible problems:
Is the cluster membership daemon, ha_cmsd running? Is the database daemon, cdbd, running?
Can the nodes communicate with each other?
Are the control networks configured as heartbeat networks?
Can the control network addresses be pinged from peer nodes?
Are the quorum majority or tie rules satisfied?
Look at the cmsd log to determine membership status.
If a reset is required, are the following conditions met?
Is the node control daemon, crsd, up and running?
Is the reset serial line in good health?
You can look at the crsd log for the node you are concerned with, or execute an admin ping and admin reset command on the node to check this.