[pcp] Introducing pmrep - Performance Metrics Reporter
Marko Myllynen
myllynen at redhat.com
Wed Sep 23 05:03:01 CDT 2015
Hi!
Today, there's too much complexity in using performance tools. Sure, one
can become an expert after investing a lot of time in studying different
tools, understanding the basics of each subsystem they're meant to
monitor, figuring out the appropriate metrics to monitor, and resolving
the tool set needed to display all the relevant metrics.
But in many cases this is not necessarily needed. There are two large
groups which have similar requirements for tools: those who want to
monitor and understand some set of application level metrics and the
associated system level metrics (for example, an application, CPU usage,
and disk IO). Then those who want to see the overall system status and
if a need arises then dig deeper into a subsystem or certain area of the
environment (for example, networking or a file system or a supporting
system level service like NFS or LDAP).
The common denominator here is that these groups don't merrily go around
peeking at random performance metrics and testing different tools but
rather after establishing a rough understanding of a situation it's
often a limited set of metrics they're interested in. And in both cases
once such a set of metrics have been identified, it is likely to be
helpful for themselves and others in the future, too. However, currently
even after having such a set of metrics defined it might involve using
various standard tools in combination (like sar+vmstat or such) making
things unnecessarily complex. And many of these tools support live
monitoring mode with fixed output only (bytes it will be, even with your
multi-GB/multi-TB system).
PCP already provides access to virtually all metrics provided by all the
most commonly used performance tools. Unfortunately, the current PCP
tooling is somewhat suboptimal; pmprobe(1)/pminfo(1) are one-shot tools
and pmval(1) can monitor only one metric at a time. pmdumptext(1) has
some nice features, like the possibility to define several metrics to
monitor, live or archive mode, or to use separate configuration files to
monitor predefined metric sets. But it does not support scale
conversion, customized output, single configuration file, recording
mode, combining different metric sets, and being implemented in C++/Qt
it's not very hackable or extendable for most people.
pmrep is a new PCP command line tool implemented in Python to support
all the above features and more. With its per-metric level customization
capabilities, it allows quickly defining new metric sets for monitoring
or recording.
Perhaps the following examples illustrate the best its usage. The
standard PCP metrics are denoted as usual (like mem.util.free) while
custom metric sets are prefixed with a colon (like :hpc-app1-io).
The first example is very simple, display network interface statistics:
$ pmrep network.interface.total.bytes
Next, we display per-device disk reads and writes from the host
"server1" using two seconds interval and CSV output format:
$ pmrep -h server1 -o csv -t 2s disk.dev.read disk.dev.write
Then how about displaying timestamped vmstat like information using MBs
instead of bytes and also including the number of in-use inodes? Easy:
$ pmrep -p -b MB vfs.inodes.count :vmstat
Or sar -w and sar -W information at the same time from the PCP archive
./20150921.09.13, showing values recorded between 3 PM and 5 PM?
$ pmrep -a ./20150921.09.13 -S at 15:00 -T at 17:00 :sar-w :sar-W
If there's no time for in-depth investigation before reboot^Wrecovery
procedures, capturing most essential data for later analysis might be
crucial. For this pmrep allows instant collection of metric sets to
standard PCP archives. The following example collects all 389 Directory
Server, XFS file system, and cpu/disk/memory related statistics every 5
seconds for the next 5 minutes to a PCP log archive ./dump:
$ pmrep -o archive -F ./dump -t5s -R5m ds389 xfs kernel.all.cpu disk mem
Ok, that's enough of marketing :-)
So far I've developed pmrep on my own but now I'd be interested to hear
whether there's interest to include pmrep as part of PCP proper. If so,
I think we could then finalize the user interface first (which I think
is in good shape already), then I could write a pmrep man page (and I
have a slight suspicion Nathan would like to see a QA test or two :).
The code itself is a bit less than 1k LOC so while not tiny anymore it's
still less than some of the current tools. The code itself could
probably be modularized and rearranged a bit here and there and perhaps
the internal data structures could be reviewed but so far internals
haven't become an obstacle. (I learned about the existence pmcc.py only
after writing most of the code so in theory it could have helped a bit
but OTOH it might keep the code easier to approach for drive-by
contributors if using standard Python structures instead of lots of
PCP/PMAPI layers in the middle. YMMV.)
I've also collected some development ideas mainly targeted for more
pleasant user experience at the top of the script. For example,
something like bash/zsh completion is certainly not a technically
ground-breaking feature but it would make usage much smoother for the
uninitiated, think of typing pmrep vfs<TAB> or pmrep :<TAB> and seeing
all the available VFS metrics or metric sets with their descriptions.
And since pmrep should be easily extendable, it'd be interesting to hear
could there be some user-friendly features added to lower the entry
barrier for cases like PCP+SystemTap.
I've only tested this on RHEL 7 / PCP 3.10.6 / Python 2.7 so use at your
own risk on other platforms.
https://myllynen.fedorapeople.org/pmrep.py
https://myllynen.fedorapeople.org/pmrep.conf
Comments and feedback are welcome.
Thanks,
--
Marko Myllynen
More information about the pcp
mailing list