| Linux FailSafe™ Programmer's Guide | ||
|---|---|---|
| Prev | Chapter 2. Writing the Action Scripts and Adding Monitoring Agents | Next |
Before you can write the action scripts, you must do the following:
Understand the scriptlib functions described in Appendix B.
Familiarize yourself with the script templates provided in the following directory: /usr/lib/failsafe/resource_types/template
Read the man pages for the following commands:
cluster_mgr
cdbd
ha_cilog
ha_cmsd
ha_exec2
ha_fsd
ha_gcd
ha_ifd
ha_ifdadmin
ha_macconfig2
ha_srmd
ha_statd2
haStatus
Familiarize yourself with the action scripts for other highly available services in /usr/lib/failsafe/resource_types that are similar to the scripts you wish to create.
Understand how to do the following actions for your application:
Verify that the resource is running
Verify that the resource can be run
Start the resource
Stop the resource
Check for the server processes
Do a simple query as a client and understand the expected response
Check for configuration file or directory existence (as needed)
Determine whether or not monitoring is required (see Section 2.3.1). However, even if monitoring is not needed, a monitor script is still required; in this case, it can contain only a return-success function.
Determine if a resource type must be added to the cluster configuration database.
Understand the vendor-supplied startup and shutdown procedures.
Determine the configuration parameters for the application; these may be used in the action script and should be stored in the CDB.
Determine whether the resource type can be restarted in its local node, and whether this action makes sense.
In the following situations, you may not need to perform application monitoring:
Heartbeat monitoring is sufficient; that is, simply verifying that the node is alive (provided automatically by the base software) determines the health of the highly available service.
There is no process or resource that can be monitored. For example, the Linux kernel ipchains filtering software performs IP filtering on firewall nodes. Because the filtering is done in the kernel, there is no process or resource to monitor.
A resource on which the application depends is already monitored. For example, monitoring some client-node resources might best be done by monitoring the file systems, volumes, and network interfaces they use. Because this is already done by the base software, additional monitoring is not required.
| Caution |
Beware that monitoring should be as lightweight as possible so that it does not affect system performance. Also, security issues may make monitoring difficult. If you are unable to provide a monitoring script with appropriate performance and security, consider a monitoring agent; see Section 2.7. |
There are two types of monitoring that may be accomplished in a monitor script:
Is the resource present?
Is the resource responding?
You can define multiple levels of monitoring within the monitor script, and the administrator can choose the desired level by configuring the resource definition in the cluster configuration database. Ensure that the monitoring level chosen does not affect system performance. For more information, see the Linux FailSafe Administrator's Guide.
Possible symptoms of failure include the following:
The resource returns an error code
The resource returns the wrong result
The resource does not return quickly enough
You must determine the monitoring interval and time-out values for the monitor script. The time-out must be long enough to guarantee that occasional anomalies do not cause false failovers. It will be useful for you to determine the peak load that resource may need to sustain.
You must also determine if the monitor test should execute multiple times so that an application is not declared dead after a single failure. In general, testing more than once before declaring failure is a good idea.
The test should be simple and should complete quickly, whether it succeeds or fails. Some examples of tests are as follows:
For a client/server applications that follows a well-defined protocol, the monitor script can make a simple request and verify that the proper response is received.
For a web server application, the monitor script can request a home page, verify that the connection was made, and ignore the resulting home page.
For a database, a simple request such as querying a table can be made.
For NFS, more complicated end-to-end monitoring is required. The test might consist of mounting an exported file system, checking access to the file system with a stat() system call to the root of the file system, and undoing the mount.
For a resource that writes to a log file, check that the size of the log file is increasing or use the grep command to check for a particular message.
The following command can be used to determine quickly whether a process exists:
/usr/bin/killall -0 process_name |
You can also use the ha_exec2 command to check if a process is running.
The ha_exec2 command differs from killall in that it performs a more exhaustive check on the process name as well as process arguments. killall searches for the process using the process name only. The command line is as follows:
/usr/lib/failsafe/bin/ha_exec2 -s 0 -t process_name
Note: Do not use the ps command to check on a particular process because its execution can be too slow.