As well as computing arithmetic and logical values, pmie can execute actions (popup alarms, write system log messages, and launch programs) in response to specified conditions. Such actions are extremely useful in detecting, monitoring and correcting performance related problems.
The expressions to be evaluated are read from configuration files specified by one or more filename arguments. In the absence of any filename, expressions are read from standard input.
A description of the command line options specific to pmie follows:
expr_1 (Tue Feb 6 19:55:10 2001): 12
The -S, -T, -O, and -A options may be used to define a time window to restrict the samples retrieved, set an initial origin within the time window, or specify a ``natural'' alignment of the sample times; refer to PCPIntro(1) for a complete description of these options.
Output from pmie is directed to standard output and standard error as follows:
The directory $PCP_DEMOS_DIR/pmie contains a number of other annotated examples of pmie expressions.
The variable delta controls expression evaluation frequency. Specify that subsequent expressions be evaluated once a second, until further notice:
delta = 1 sec;
If total syscall rate exceeds 5000 per second per CPU, then display an alarm notifier:
kernel.all.syscall / hinv.ncpu > 5000 count/sec -> alarm "high syscall rate";
If the high syscall rate is sustained for 10 consecutive samples, then launch top(1) in an xwsh(1G) window to monitor processes, but do this at most once every 5 minutes:
all_sample ( kernel.all.syscall @0..9 > 5000 count/sec * hinv.ncpu ) -> shell 5 min "xwsh -e 'top'";
The following rules are evaluated once every 20 seconds:
delta = 20 sec;
If any disk is performing more than 60 I/Os per second, then print a message identifying the busy disk to standard output and launch dkvis(1):
some_inst ( disk.dev.total > 60 count/sec ) -> print "disk %i busy " & shell 5 min "dkvis";
Refine the preceding rule to apply only between the hours of 9am and 5pm, and to require 3 of 4 consecutive samples to exceed the threshold before executing the action:
$hour >= 9 && $hour <= 17 && some_inst ( 75 %_sample ( disk.dev.total @0..3 > 60 count/sec ) ) -> print "disk %i busy ";
The following rules are evaluated once every 10 minutes:
delta = 10 min;
If either the / or the /usr filesystem is more than 95% full, display an alarm popup, but not if it has already been displayed during the last 4 hours:
filesys.free #'/dev/root' / filesys.capacity #'/dev/root' < 0.05 -> alarm 4 hour "root filesystem (almost) full"; filesys.free #'/dev/usr' / filesys.capacity #'/dev/usr' < 0.05 -> alarm 4 hour "/usr filesystem (almost) full";
The following rule requires a machine that supports the PCP environment metrics. If the machine environment temperature rises more than 2 degrees over a 10 minute interval, write an entry in the system log:
environ.temp @0 - environ.temp @1 > 2 -> alarm "temperature rising fast" & syslog "machine room temperature rise alarm";
And last, something interesting if you have performance problems with your Oracle database:
db = "oracle.ptg1"; host = ":moomba.melbourne.sgi.com"; lru = "#'cache buffers lru chain'"; gets = "$db.latch.gets $host $lru"; total = "$db.latch.gets $host $lru + $db.latch.misses $host $lru + $db.latch.immisses $host $lru"; $total > 100 && $gets / $total < 0.2 -> alarm "high lru latch contention";
To expedite rapid development of pmie rules, the pmieconf(1) tool provides a facility for generating a pmie configuration file from a set of generalized pmie rules. The supplied set of rules covers a wide range of performance scenarios.
The pmrules(1) tool provides a GUI-based facility for generating pmie rules from parametrized templates. The supplied templates cover a wide range of performance scenarios.
The development efforts of the PCP engineering team are focused on pmieconf rather than pmrules, and thus pmieconf is the recommended tool for quickly deploying useful pmie rules.
The Performance Co-Pilot User's and Administrator's Guide provides a detailed tutorial-style chapter covering pmie.
A pmie specification is a sequence of semicolon terminated expressions.
Basic operators are modeled on the arithmetic, relational and Boolean operators of the C programming language. Precedence rules are as expected, although the use of parentheses is encouraged to enhance readability and remove ambiguity.
Operands are performance metric names (see pmns(4)) and the normal literal constants.
Operands involving performance metrics may produce sets of values, as a result of enumeration in the dimensions of hosts, instances and time. Special qualifiers may appear after a performance metric name to define the enumeration in each dimension. For example,
kernel.percpu.cpu.user :foo :bar #cpu0 @0..2
defines 6 values corresponding to the time spent executing in user mode on CPU 0 on the hosts ``foo'' and ``bar'' over the last 3 consecutive samples. The default interpretation in the absence of : (host), # (instance) and @ (time) qualifiers is all instances at the most recent sample time for the default source of PCP performance metrics.
Host and instance names that do not follow the rules for variables in programming languages, ie. alphabetic optionally followed by alphanumerics, should be enclosed in single quotes.
Expression evaluation follows the law of ``least surprises''. Where performance metrics have the semantics of a counter, pmie will automatically convert to a rate based upon consecutive samples and the time interval between these samples. All expressions are evaluated in double precision, and where appropriate, automatically scaled into canonical units of ``bytes'', ``seconds'' and ``counts''.
A rule is a special form of expression that specifies a condition or logical expression, a special operator (->) and actions to be performed when the condition is found to be true.
The following table summarizes the basic pmie operators:
Aggregate operators may be used to aggregate or summarize along one dimension of a set-valued expression. The following aggregate operators map from a logical expression to a logical expression of lower dimension.
The following instantial operators may be used to filter or limit a set-valued logical expression, based on regular expression matching of instance names. The logical expression must be a set involving the dimension of instances, and the regular expression is of the form used by egrep(1) or the Extended Regular Expressions of regcomp(3G).
For example, the expression below will be ``true'' for disks attached to controllers 2 or 3 performing more than 20 operations per second:
match_inst "^dksd" disk.dev.total > 20;
The following aggregate operators map from an arithmetic expression to an arithmetic expression of lower dimension.
The aggregate operators count_inst, count_host and count_sample map from a logical expression to an arithmetic expression of lower dimension by counting the number of set members for which the expression is true in the associated dimension.
For action rules, the following actions are defined:
Multiple actions may be separated by the & and | operators to specify respectively sequential execution (both actions are executed) and alternate execution (the second action will only be executed if the execution of the first action returns a non-zero error status.
Arguments to actions are an optional suppression time, and then one or more expressions (a string is an expression in this context). Strings appearing as arguments to an action may include the following special selectors that will be replaced at the time the action is executed.
Note that expansion of the special selectors is done by repeating the whole argument once for each unique binding to any of the qualifying special selectors. For example if a rule were true for the host mumble with instances grunt and snort, and for host fumble the instance puff makes the rule true, then the action
... -> shell myscript "Warning: %h-%i busy ";will execute myscript with the argument string "Warning: mumble-grunt busy Warning: mumble-snort busy Warning: fumble-puff busy".
By comparison, if the action
... -> shell myscript "'Warning! busy:" " %i@%h" "'";were executed under the same circumstances, then myscript would be executed with the argument string '"Warning! busy: grunt@mumble snort@mumble puff@fumble"'.
The semantics of the expansion of the special selectors leads to a common usage, where one argument is a constant (contains no special selectors) the second argument contains the desired special selectors with minimal separator characters, and an optional third argument provides a constant postscript (e.g. to terminate any argument quoting from the first argument). If necessary post-processing (eg. in myscript) can provide the necessary enumeration over each unique expansion of the string containing just the special selectors.
For complex conditions, the bindings to these selectors is not obvious. It is strongly recommended that pmie be used in the debugging mode (specify the -W command line option in particular) during rule development.
name = constexpr;
Where name follows the normal rules for variables in programming languages, ie. alphabetic optionally followed by alphanumerics. constexpr must be a constant expression, either a string (enclosed in double quotes) or an arithmetic expression optionally followed by a scale factor.
Macros are expanded when their name, prefixed by a dollar ($) appears in an expression, and macros may be nested within a constexpr string.
The following reserved macro names are understood.
Dates and times are presented in the reporting time zone (see description of -Z and -z command line options above).
pmie fits into this model when rules use the syslog action. Note that if the action string begins with -p (priority) and/or -t (tag) then these are extracted from the string and treated in the same way as in logger(1) and pcp-eventlog(1).
However, it is common to have other event monitoring frameworks also, into which you may wish to incorporate performance events from pmie. You can often use the shell action to send events to these frameworks, as they usually provide their a program for injecting events into the framework from external sources.
A final option is use of the stomp (Streaming Text Oriented Messaging Protocol) action, which allows pmie to connect to a central JMS (Java Messaging System) server and send events to the PMIE topic. Tools can be written to extract these text messages and present them to operations people (via desktop popup windows, etc). Use of the stomp action requires a stomp configuration file to be setup, which specifies the location of the JMS server host, port number, and username/password.
The format of this file is as follows:
host=messages.sgi.com # this is the JMS server (required) port=61616 # and its listening here (required) timeout=2 # seconds to wait for server (optional) username=joe # (required) password=j03ST0MP # (required) topic=PMIE # JMS topic for pmie messages (optional)
The timeout value specifies the time (in seconds) that pmie should wait for acknowledgements from the JMS server after sending a message (as required by the STOMP protocol). Note that on startup, pmie will wait indefinately for a connection, and will not begin rule evaluation until that initial connection has been established. Should the connection to the JMS server be lost at any time while pmie is running, pmie will attempt to reconnect on each subsequent truthful evaluation of a rule with a stomp action, but not more than once per minute. This is to avoid contributing to network congestion. In this situation, where the STOMP connection to the JMS server has been severed, the stomp action will return a non-zero error value.