Writing the Action Scripts and Adding Monitoring Agents This chapter provides information about writing the action scripts required to make an application highly available and how to add monitoring agents. It discusses the following topics: Set of Action Scripts Multiple instances of scripts may be executed at the same time. For more information, see . The following set of action scripts can be provided for each resource:action scriptsset of scripts exclusive, which verifies that the resource is not already runningexclusive script definition start, which starts the resource start scriptdefinition stop, which stops the resource stop scriptdefinition monitor, which monitors the resource monitor scriptdefinition restart, which restarts the resource on the same node when a monitoring failure occursrestart scriptdefinition The start, stop, and exclusive scripts are required for every resource type. The start and stop scripts must be idempotent; that is, an action requested multiple times successively should continue to return success, and should have no side-effects. For example, if the start script is run for a resource that is already started, the script must not return an error. A monitor script is required, but if you wish it may contain only a return-success function. A restart script is required if the application must have a restart ability on the same node in case of failure. However, the restart script may contain only a return-success function. action scriptsrequired action scriptsoptional Understanding the Execution of Action Scripts Before you can write a new action script, you must understand how action scripts are executed. This section covers the following topics: Multiple Instances of Script Executed at the Same Time Multiple instances of the same script may be executed at the same time. To avoid problems this may cause, you can use the ha_filelock and ha_execute_lock commands to achieve sequential execution of commands in different instances of the same script. For example, consider a script which modifies a configuration file to start a new application instance. Multiple instances of the script modifying the file simultaneously could cause file corruption and data loss. The start script for the application should use ha_execute_lock when executing the modification script to ensure correct configuration file modification. Assuming the script is named modify_configuration_file, the start script would contain a statement similar to the following: ${HA_CMDSPATH}/ha_execute_lock 30 ${HA_SCRIPTTMPDIR}/lock.volume_assemble \"modify_configuration_file\" The ha_execute_lock command takes 3 arguments: Number of seconds before the command times out waiting for the file lock File to be used for locking Command to be executed The ha_execute_lock command tries to obtain a lock on the file every second for timeout seconds. After obtaining a lock on the file, it executes the command argument. On command completion, it releases lock on the file. Differences between the <filename>exclusive</filename> and <filename> monitor</filename> Scripts Although the same check can be used in monitor and exclusive action scripts, they are used for different purposes. summarizes the differences between the scripts. Differences Between the <literal>monitor</literal> and <literal>exclusive</literal> Action Scripts exclusive monitor Executed in all nodes in the cluster. Executed only on the node where the resource group (which contains the resource) is online. Executed before the resource is started in the cluster. Executed when the resource is online in the cluster. (The monitor script could degrade the services provided by the HA server. Therefore, the check performed by the monitor script should be lightweight and less time consuming than the check performed by the exclusive script)) Executed only once before the resource group is made online in the cluster. Executed periodically. Failure will result in resource group not becoming online in the cluster. Failure will cause a resource group failover to another node or a restart of the resource in the local node. An error will cause false resource group failovers in the cluster.
Successful Execution of Action Scripts shows the state of a resource group after the successful execution of an action script for every resource within a resource group. To view the state of a resource group, use the Cluster Manager graphical user interface (GUI) or the cluster_mgr command. action scriptssuccessful execution results Successful Action Script Results Event Action Script to Execute Resource Group State Resource group is made online on a node start onlineresource groupstates Resource group is made offline on a node stop offline Online status of the resource group exclusive (No effect) Normal monitoring of online resource group monitor online Resource group monitoring failure restart online
Failure of Action Scripts shows the state of the resource group and the error state when an action script fails. action scriptsfailure ofFailure of an Action Script Failing Action Script Resource Group State Error State exclusive online exclusivity monitor online monitoring failure restart online monitoring failure start online srmd executable error stop online srmd executable error
Implementing Timeouts and Retrying a Command You can use the ha_exec2 command to execute action scripts using timeouts. This allows the action script to be completed within the specified time, and permits proper error messages to be logged on failure or timeout. The retry variable is especially useful in monitor and exclusive action scripts. To retry a command, use the following syntax:/usr/lib/failsafe/bin/ha_exec2 timeout_in_seconds number_of_retries command_to_be_executed For example:${HA_CMDSPATH}/ha_exec2 30 2 "umount /fs" The above ha_exec2 command executes the umount /fs command line. If the command does not complete within 30 seconds, it kills the umount command and retries the command. The ha_exec2 command retries the umount command 2 times if it times out or fails. For more information, see the ha_exec2 man page. Sending UNIX Signals You can use the ha_exec2 command to send UNIX signals to specific process. A process is identified by its name or its arguments. For example:${HA_CMDSPATH}/ha_exec2 -s 0 -t "knfsd" The above command sends signal 0 (checks if the process exists) to all processes whose name or arguments match the string knfsd. The command returns 0 if it is a success. You should use the ha_exec2 command to check for server processes in the monitor script instead of using a ps -ef | grep command line construction, for performance and speed considerations. For more information, see the ha_exec2 man page.
Preparation Before you can write the action scripts, you must do the following:action scriptspreparation for writing scripts Understand the scriptlib functions described in . Familiarize yourself with the script templates provided in the following directory: /usr/lib/failsafe/resource_types/template action scriptstemplates templatesaction scripts Read the man pages for the following commands: cluster_mgr cdbd ha_cilog ha_cmsd ha_exec2 ha_fsd ha_gcd ha_ifd ha_ifdadmin ha_macconfig2 ha_srmd ha_statd2 haStatus Familiarize yourself with the action scripts for other highly available services in /usr/lib/failsafe/resource_types action scriptsresource types provided that are similar to the scripts you wish to create. Understand how to do the following actions for your application: Verify that the resource is running Verify that the resource can be run Start the resource Stop the resource Check for the server processes Do a simple query as a client and understand the expected response Check for configuration file or directory existence (as needed) Determine whether or not monitoring is required (see ). However, even if monitoring is not needed, a monitor script is still required; in this case, it can contain only a return-success function. Determine if a resource type must be added to the cluster configuration database. Understand the vendor-supplied startup and shutdown procedures. Determine the configuration parameters for the application; these may be used in the action script and should be stored in the CDB. Determine whether the resource type can be restarted in its local node, and whether this action makes sense. Is Monitoring Necessary? monitoring necessity ofIn the following situations, you may not need to perform application monitoring: action scriptsmonitoringnecessity of monitoring types Heartbeat monitoring is sufficient; that is, simply verifying that the node is alive (provided automatically by the base software) determines the health of the highly available service. There is no process or resource that can be monitored. For example, the Linux kernel ipchains filtering software performs IP filtering on firewall nodes. Because the filtering is done in the kernel, there is no process or resource to monitor. A resource on which the application depends is already monitored. For example, monitoring some client-node resources might best be done by monitoring the file systems, volumes, and network interfaces they use. Because this is already done by the base software, additional monitoring is not required. Beware that monitoring should be as lightweight as possible so that it does not affect system performance. Also, security issues may make monitoring difficult. If you are unable to provide a monitoring script with appropriate performance and security, consider a monitoring agent; see . Types of Monitoring There are two types of monitoring that may be accomplished in a monitor script:action scripts monitoringtypes Is the resource present? Is the resource responding? You can define multiple levels of monitoring within the monitor script, and the administrator can choose the desired level by configuring the resource definition in the cluster configuration database. Ensure that the monitoring level chosen does not affect system performance. For more information, see the Linux FailSafe Administrator's Guide. What are the Symptoms of Monitoring Failure? monitoring failurePossible symptoms of failure include the following: The resource returns an error code The resource returns the wrong result The resource does not return quickly enough How Often Should Monitoring Occur? You must determine the monitoring interval and time-out values for the monitor script. The time-out must be long enough to guarantee that occasional anomalies do not cause false failovers. It will be useful for you to determine the peak load that resource may need to sustain. action scriptsmonitoring frequence monitoringfrequence You must also determine if the monitor test should execute multiple times so that an application is not declared dead after a single failure. In general, testing more than once before declaring failure is a good idea. Examples of Testing for Monitoring Failure action scripts monitoringtesting examples monitoringtesting examples The test should be simple and should complete quickly, whether it succeeds or fails. Some examples of tests are as follows: For a client/server applications that follows a well-defined protocol, the monitor script can make a simple request and verify that the proper response is received. For a web server application, the monitor script can request a home page, verify that the connection was made, and ignore the resulting home page. For a database, a simple request such as querying a table can be made. For NFS, more complicated end-to-end monitoring is required. The test might consist of mounting an exported file system, checking access to the file system with a stat() system call to the root of the file system, and undoing the mount. For a resource that writes to a log file, check that the size of the log file is increasing or use the grep command to check for a particular message. The following command can be used to determine quickly whether a process exists: /usr/bin/killall -0 process_name You can also use the ha_exec2 command to check if a process is running. The ha_exec2 command differs from killall in that it performs a more exhaustive check on the process name as well as process arguments. killall searches for the process using the process name only. The command line is as follows: /usr/lib/failsafe/bin/ha_exec2 -s 0 -t process_name Do not use the ps command to check on a particular process because its execution can be too slow. Script Format Templates for the action scripts are provided in the following directory:scripts. See action scripts or failover script action scriptsformatoverview /usr/lib/failsafe/resource_types/template The template scripts have the same general format. Following is the order in which the information appears in the script: Header information Set local variables Read resource information Exit status Perform the basic action of the script, which is the customized area you must provide Set global variables Verify arguments Read input file Action “scripts” can be of any form -- such as Bourne shell script, perl script, or C language program. The following sections show an example from the NFS start script. Note that the contents of these examples may not match the latest software. Header Information The header information contains comments about the resource type, script type, and resource configuration format. You must modify the code as needed.action scriptsformat header Following is the header for the NFS start script: #!/bin/sh   # ************************************************************************** # * * # * Copyright (C) 1998 Silicon Graphics, Inc. * # * * # * These coded instructions, statements, and computer programs contain * # * unpublished proprietary information of Silicon Graphics, Inc., and * # * are protected by Federal copyright law. They may not be disclosed * # * to third parties or copied or duplicated in any form, in whole or * # * in part, without the prior written consent of Silicon Graphics, Inc. * # * * # ************************************************************************** #ident "$Revision: 1.1 $" # Resource type: NFS # Start script NFS   # # Test resource configuration information is present in the database in # the following format # # resource-type.NFS Set Local Variables set_local_variables() section of an action script action scriptsformatread resource informationThe set_local_variables() section of the script defines all of the variables that are local to the script, such as temporary file names or database keys. All local variables should use the LOCAL_ prefix. You must modify the code as needed.action scripts formatset local variables Following is the set_local_variables() section from the NFS start script: set_local_variables() { LOCAL_TEST_KEY=NFS } Read Resource Information Theget_xxx_info() functionget_ xxx_info() function, such as get_nfs_info() , reads the resource information from the cluster configuration database. $1 is the test resource name. If the operation is successful, a value of 0 is returned; if the operation fails, 1 is returned.action scriptsformat read resource information resource informationread into an action script The information is returned in the HA_STRING variable. For more information about HA_STRING, see . Following is the get_nfs_info() section from the NFS start script get_nfs_info () { ha_get_info ${LOCAL_TEST_KEY} $1 if [ $? -ne 0 ]; then return 1; else return 0; fi } If you wish to get resource dependency information, you can call ha_get_info with a third argument of any value. The resource dependency list will be returned in the HA_STRING variable. Exit Status In theexit_script() function exit_script() function, $1 contains the exit_status value. exit_status value If cleanup actions are required, such as the removal of temporary files that were created as part of the process, place them before the exit line.action scripts formatexit status exit status in action scripts Following is the exit_script() section from the NFS start script exit_script() { exit $1; } If you call the exit_script function prior to normal termination, it should be preceded by the ha_write_status_for_resource function and you should use the same return code that is logged to the output file. For more information see . ha_write_status_for_resource function Basic Action This area of the script is the portion you must customize. The templates provide a minimal framework. action scripts formatbasic action Following is the framework for the basic action from the start template: start_template() # for all template resources passed as parameter for TEMPLATE in $HA_RES_NAMES do #HA_CMD="command to start $TEMPLATE resource on the local machine "; #ha_execute_cmd "string to describe the command being executed "; ha_write_status_for_resource $TEMPLATE $HA_SUCCESS; done } When testing the script, you can obtain debugging information by adding the shell command set -x to this section. For examples of this area, see . Set Global Variables The following lines set all of the global and local variables and store the resource names in $HA_RES_NAMES. action scriptsformatset global variables global variables set_global_variables() functionFollowing is the set_global_variables() function from the NFS start script: set_global_variables() { HA_DIR=/usr/lib/failsafe COMMON_LIB=${HA_DIR}/common_scripts/scriptlib   # Execute the common library file . $COMMON_LIB   ha_set_global_defs; } Verify Arguments Theha_check_args() function ha_check_args() function verifies the arguments and stores them in the $HA_INFILE and $HA_OUTFILE variables. It returns 1 on error and 0 on success.action scripts formatverify arguments Following is the following is the section from the NFS start script that calls ha_check_args: ha_check_args $*; if [ $? -ne 0 ]; then exit $HA_INVAL_ARGS; fi Read Input File ha_read_infile() functionThe ha_read_infile() function reads the input file and stores the resource names in the $HA_RES_NAMES variable.action scripts formatread input file Following is the ha_read_infile() function from the common library file scriptlib: ha_read_infile() { HA_RES_NAMES="";   for HA_RESOURCE in `cat ${HA_INFILE}` do HA_TMP="${HA_RES_NAMES} ${HA_RESOURCE}"; HA_RES_NAMES=${HA_TMP}; done } Complete the Action Located at the bottom of the script file are the lines which perform the actual work of the requested action using the prior sections and provided tools. The results are written as output to $HA_OUTFILE:action scriptsformat completion action_resourcetype ;   exit_script $HA_SUCCESS Following is the completion from the NFS start script: start_nfs;   exit_script $HA_SUCCESS; Steps in Writing a Script Multiple copies of actions scripts can execute at the same time. Therefore, all temporary file names must be unique within the storage space used. Often adding a script.$$ suffix script.$$ to the name is sufficient. If multiple nodes share a temporary directory, you will also want to incorporate host identifier to ensure uniqueness. Another method is to use the resource name because it must be unique to the cluster.action scripts writing steps For each script, you must do the following: Get the required variables Check the variables Perform the action Check the action The start and stop scripts are required to be idempotent; that is, they have the appearance of being run once but can in fact be run multiple times. For example, if the start script is run for a resource that is already started, the script must not return an error. All action scripts must return the status to the /var/log/failsafe/script_nodename file. Examples of Action Scripts The following sections use portions of the NFS scripts as examples. action scriptsexamples The examples in this guide may not exactly match the released system. <literal>start</literal> Script The NFS start script does the following:start script example  Creates a resource-specific NFS status directory.  Exports the specified export-point with the specified export-options. Following is a section from the NFS start script: # Start the resource on the local machine. # Return HA_SUCCESS if the resource has been successfully started on the local # machine and HA_CMD_FAILED otherwise. # start_nfs() { ${HA_DBGLOG} "Entry: start_nfs()"; # for all nfs resources passed as parameter for resource in ${HA_RES_NAMES} do NFSFILEDIR=${HA_SCRIPTTMPDIR}/${LOCAL_TEST_KEY}$resource HA_CMD="mkdir -p $NFSFILEDIR"; ha_execute_cmd "creating nfs status file directory"; if [ $? -ne 0 ]; then ${HA_LOG} "Failed to create ${NFSFILEDIR} directory"; ha_write_status_for_resource ${resource} ${HA_NOCFGINFO}; exit_script $HA_NOCFGINFO fi get_nfs_info $resource if [ $? -ne 0 ]; then ${HA_LOG} "NFS: $resource parameters not present in CDB"; ha_write_status_for_resource ${resource} ${HA_NOCFGINFO}; exit_script ${HA_NOCFGINFO}; fi ha_get_field "${HA_STRING}" export-info if [ $? -ne 0 ]; then ${HA_LOG} "NFS: export-info not present in CDB for resource $resource"; ha_write_status_for_resource ${resource} ${HA_NOCFGINFO}; exit_script ${HA_NOCFGINFO}; fi export_opts="$HA_FIELD_VALUE" ha_get_field "${HA_STRING}" filesystem if [ $? -ne 0 ]; then ${HA_LOG} "NFS: filesystem-info not present in CDB for resource $resource"; ha_write_status_for_resource ${resource} ${HA_NOCFGINFO}; exit_script ${HA_NOCFGINFO}; fi filesystem="$HA_FIELD_VALUE" # Before we try and export the NFS resource, make sure # filesystem is mounted. HA_CMD="grep $filesystem /etc/mtab > /dev/null 2>&1"; ha_execute_cmd "check if the filesystem $filesystem is mounted"; if [ $? -ne 0 ]; then ${HA_LOG} "NFS: filesystem $filesystem not mounted"; ha_write_status_for_resource ${resource} ${HA_CMD_FAILED}; exit_script ${HA_CMD_FAILED}; fi # Now do the job: export the new directory # Note: the export_dir command will check wether this directory # is already exported or not. HA_CMD="export_dir ${resource} ${export_opts}"; ha_execute_cmd "export $resource directories to NFS clients"; if [ $? -ne 0 ]; then ${HA_LOG} "NFS: could not export resoure ${resource}" ha_write_status_for_resource ${resource} ${HA_CMD_FAILED}; exit_script ${HA_CMD_FAILED}; else ha_write_status_for_resource ${resource} ${HA_SUCCESS}; fi done } <literal>stop</literal> Script The NFS stop script does the following:stop script example Unexports the specified export-point. Removes the NFS status directory. Following is an example from the NFS stop script: # Stop the nfs resource on the local machine. # Return HA_SUCCESS if the resource has been successfully stopped on the local # machine and HA_CMD_FAILED otherwise. # stop_nfs() { ${HA_DBGLOG} "Entry: stop_nfs()"; # for all nfs resources passed as parameter for resource in ${HA_RES_NAMES} do get_nfs_info ${resource} if [ $? -ne 0 ]; then # NFS resource information not available. ${HA_LOG} "NFS: $resource parameters not present in CDB"; ha_write_status_for_resource ${resource} ${HA_NOCFGINFO}; exit_script ${HA_NOCFGINFO}; fi ha_get_field "${HA_STRING}" export-info if [ $? -ne 0 ]; then ${HA_LOG} "NFS: export-info not present in CDB for resource $resource"; ha_write_status_for_resource ${resource} ${HA_NOCFGINFO}; exit_script ${HA_NOCFGINFO}; fi export_opts="$HA_FIELD_VALUE" # Unexport the directory HA_CMD="unexport_dir ${resource}" ha_execute_cmd "unexport ${resource} directory to NFS clients" if [ $? -ne 0 ]; then ${HA_LOG} "NFS: Failed to unexport resource ${resource}" ha_write_status_for_resource ${resource} ${HA_CMD_FAILED} fi ha_write_status_for_resource ${resource} ${HA_SUCCESS} done } <literal>monitor</literal> Script The NFS monitor script does the following:monitor script example Verifies that the file system is mounted at the correct mount point. Requests the status of the exported file system. Checks the export-point. Requests NFS statistics and (based on the results) make a Remote Procedure Call (RPC) to NFS as needed. Following is an example from the NFS monitor script: # Check if the nfs resource is allocated in the local node # This check must be light weight and less intrusive compared to # exclusive check. This check is done when the resource has been # allocated in the local node. # Return HA_SUCCESS if the resource is running in the local node # and HA_CMD_FAILED if the resource is not running in the local node # The list of the resources passed as input is in variable # $HA_RES_NAMES # monitor_nfs() { ${HA_DBGLOG} "Entry: monitor_nfs()"; for resource in ${HA_RES_NAMES} do get_nfs_info ${resource} if [ $? -ne 0 ]; then # No resource information available. ${HA_LOG} "NFS: ${resource} parameters not present in CDB"; ha_write_status_for_resource ${resource} ${HA_NOCFGINFO}; exit_script ${HA_NOCFGINFO}; fi ha_get_field "${HA_STRING}" filesystem if [ $? -ne 0 ]; then # filesystem not available available. ${HA_LOG} "NFS: filesystem not present in CDB for resource $resource"; ha_write_status_for_resource ${resource} ${HA_NOCFGINFO}; exit_script ${HA_NOCFGINFO}; fi fs="$HA_FIELD_VALUE"; # Check to see if the filesystem is mounted HA_CMD="mount | grep ${fs} >/dev/null 2>&1" ha_execute_cmd "check to see if $fs is mounted" if [ $? -ne 0 ]; then ${HA_LOG} "NFS: ${fs} not mounted"; ha_write_status_for_resource ${resource} ${HA_CMD_FAILED}; exit_script $HA_CMD_FAILED; fi # stat the filesystem HA_CMD="fs_stat -r ${resource} >/dev/null 2>&1"; ha_execute_cmd "stat mount point $resource" if [ $? -ne 0 ]; then ${HA_LOG} "NFS: cannot stat ${resource} NFS export point"; ha_write_status_for_resource ${resource} ${HA_CMD_FAILED}; exit_script $HA_CMD_FAILED; fi # check the filesystem is exported showmount -e | grep "${resource} " >/dev/null 2>&1 if [ $? -ne 0 ]; then ${HA_LOG} "NFS: failed to find ${resource} in exported filesystem list:-" ${HA_LOG} "`showmount -e`" ha_write_status_for_resource ${resource} ${HA_CMD_FAILED} exit_script ${HA_CMD_FAILED} fi # check the NFS daemon is still alive and responding exec_rpcinfo; if [ $? -ne 0 ]; then ${HA_LOG} "NFS: exec_rpcinfo failed"; ha_write_status_for_resource ${resource} ${HA_CMD_FAILED} exit_script $HA_CMD_FAILED fi # Check the stats ? # To Be Done... but there is no nfsstat command # for the user space NFS daemon. ha_write_status_for_resource $resource $HA_SUCCESS; done } <literal>exclusive</literal> Script The NFS exclusive script determines whether the file system is already exported. The check made by an exclusive script can be more expensive than a monitor check. Linux FailSafe uses this script to determine if resources are running on a node in the cluster, and to thereby prevent starting resources on multiple nodes in the cluster. exclusive script example Following is an example from the NFS exclusive script: # Check if the nfs resource is running in the local node. This check can # more intrusive than the monitor check. This check is used to determine # if the resource has to be started on a machine in the cluster. # Return HA_NOT_RUNNING if the resource is not running in the local node # and HA_RUNNING if the resource is running in the local node # The list of nfs resources passed as input is in variable # $HA_RES_NAMES # exclusive_nfs() { ${HA_DBGLOG} "Entry: exclusive_nfs()"; # for all resources passed as parameter for resource in ${HA_RES_NAMES} do get_nfs_info $resource if [ $? -ne 0 ]; then # No resource information available ${HA_LOG} "NFS: $resource parameters not present in CDB"; ha_write_status_for_resource ${resource} ${HA_NOCFGINFO}; exit_script ${HA_NOCFGINFO}; fi # Check if resource is already exported by the NFS server showmount -e | grep "${resource} " >/dev/null 2>&1 if [ $? -eq 0 ];then ha_write_status_for_resource ${resource} ${HA_RUNNING}; ha_print_exclusive_status ${resource} ${HA_RUNNING}; else ha_write_status_for_resource ${resource} ${HA_NOT_RUNNING}; ha_print_exclusive_status ${resource} ${HA_NOT_RUNNING}; fi done } <literal>restart</literal> Script The NFS restart script exports the specified export-point with the specified export-options. restart scriptexample Following is an example from the restart script for NFS: # Restart nfs resource # Return HA_SUCCESS if nfs resource failed over successfully or # return HA_CMD_FAILED if nfs resource could not be failed over locally. # The list of nfs resources passed as input is in variable # $HA_RES_NAMES # restart_nfs() { ${HA_DBGLOG} "Entry: restart_nfs()"; # for all nfs resources passed as parameter for resource in ${HA_RES_NAMES} do get_nfs_info $resource if [ $? -ne 0 ]; then ${HA_LOG} "NFS: $resource parameters not present in CDB"; ha_write_status_for_resource ${resource} ${HA_NOCFGINFO}; exit_script ${HA_NOCFGINFO}; fi ha_get_field "${HA_STRING}" export-info if [ $? -ne 0 ]; then ${HA_LOG} "NFS: export-info not present in CDB for resource $resource"; ha_write_status_for_resource ${resource} ${HA_NOCFGINFO}; exit_script ${HA_NOCFGINFO}; fi export_opts="$HA_FIELD_VALUE" # Note: the export_dir command will check wether this directory # is already exported or not. HA_CMD="export_dir ${resource} ${export_opts}"; ha_execute_cmd "export $resource directories to NFS clients"; if [ $? -ne 0 ]; then ${HA_LOG} "NFS: could not export resoure ${resource}" ha_write_status_for_resource ${resource} ${HA_CMD_FAILED}; exit_script ${HA_CMD_FAILED}; else ha_write_status_for_resource ${resource} ${HA_SUCCESS}; fi done } Monitoring Agents If resources cannot be monitored using a lightweight check, you should use a monitoring agent. The monitor action script contacts the monitoring agent to determine the status of the resource in the node. The monitoring agent in turn periodically monitors the resource. shows the monitoring process.
Monitoring Process
Monitoring agents are useful for monitoring database resources. In databases, creating the database connection is costly and time consuming. The monitoring agent maintains connections to the database and it queries the database using the connection in response to the monitor action script request. Monitoring agents are independent processes and can be started by cmond process, although this is not required. For example, if a monitoring agent must be started when activating highly available services on a node, information about that agent can be added to the cmond cmond process configuration configuration on that node. The cmond configuration is located in the /etc/failsafe/cmon_process_groups directory /etc/failsafe/cmon_process_groups directory. Information about different agents should go into different files. The name of the file is not relevant to the activate/deactivate procedure.monitoringagents agents If a monitoring agent exits or aborts, cmond will automatically restart the monitoring agent. This prevents monitor action script failures due to monitoring agent failures. For example, the /etc/failsafe/cmon_process_groups/ip_addresses file contains information about the ha_ifd process that monitors network interfaces. It contains the following, where ACTIONS represents what cmond can perform on the agents (which will be the same for all scripts): TYPE = cluster_agent PROCS = ha_ifd ACTIONS = start stop restart attach detach AUTOACTION = attach If you create a new monitoring agent, you must also create a corresponding file in the /etc/failsafe/cmon_process_groups directory that contains similar information about the new agent. To do this, you can copy the ip_addresses file and modify the PROCS line to list the executables that constitute your new agent. These processes must be located in the /usr/lib/failsafe/bin directory. You should not modify the other configuration lines (TYPE, ACTIONS, and AUTOACTION). Suppose you need to add a new agent called newagent that consists of processes ha_x and ha_y. The configuration information for this agent will be located in the /etc/failsafe/cmon_process_groups/newagent file, which will contain the following: TYPE = cluster_agent PROCS = ha_x ha_y ACTIONS = start stop restart attach detach AUTOACTION = attach In this case, the software will expect two executables ( /usr/lib/failsafe/bin/ha_x and /usr/lib/failsafe/bin/ha_y) to be present.