Hi Nathan, thanks for the quick feedback! Commments below.
On 10/12/2010 05:09 AM, nathans@xxxxxxxxxx wrote:
Hi,
We are currently running collectl on our nodes here and collect data
from the moab/torque logs to gather statistics about what is happening
on our clusters. I currently generate heatmaps and various other
charts
using some tools I've written (they are open source so let me know if
you are interested). I was wondering if PCP might provide us with any
additional statistics beyond what we are getting via collectl and
moab.
I've not come across moab before, can't really comment there (google
is not helping me - got a pointer?). I have had a look at collectl in
the past though, which is a 6000+ line perl script designed primarily
(soley?) for Linux kernel metrics. It lays claim to being "better than
sar" which no doubt it has achieved.
Moab is a policy based job scheduler for clusters. It schedules jobs
and provides logs with details about how resources were allocated. I
collect and graph this information and am working on comparing and
contrasting what was allocated with what was used (as reported by
collectl). One of the things I'm especially interested in doing is
looking at resource allocation vs resource usage on a per job basis.
Right now I can use the moab logs and collectl logs to trace process
statistics and tie it back to specific users but it is very difficult to
tie processes back to their associated moab jobs.
PCP is a cross-platform extensible framework. So, it supports kernel
metrics from Linux, Windows, Solaris, Mac OS X, and other kernels.
It has an extensible architecture, is not focussed only on the kernel,
but rather the entire system - there are collection mechanisms for many
subsystems that ship out-of-the-box with PCP, and an ideal production
deployment would extend that set with custom site-specific metrics.
Some of the areas PCP ships optional metrics for are mysql, named, samba,
apache, kvm, netfilter, vmware, sendmail, web logs, lmsensors, cisco,
lustre, pdns, zimbra, memcache, postfix... so, to answer your question,
yes PCP will provide you with additional statistics beyond what you have
now with collectl.
A lot of the things you mention here sound like they would be extremely
useful for monitoring a data center with various servers fulfilling
different roles (database servers, web servers, mailservers, dns, etc).
Is this kind of the direction pcp is going? Our infrastructure group
would be very interested in this but I'm not sure PCP is quite the right
fit for what I'm trying to do.
FWIW, even out-of-the-box PCP supports many hundreds of Linux kernel
metrics (PCP has an architecture which lends itself to extension):
$ cd src/pmdas/linux
$ pminfo -n root_linux | wc -l
744
So, thats 744 kernel metrics out of the box, the set grows if you make
use of cgroups, and some kernel metrics are in separate PMDAs - infiniband,
Lustre, etc.
I've included some samples below of the kind of information that I'm
gathering and plotting now:
http://www.msi.umn.edu/~mark/msica/mirror_20090928-20100927.png
http://www.msi.umn.edu/~mark/msica/2GB-block_64MB_directIO_posix_nocache.png
Interesting stuff. We've spoken about heat maps before and its an area of
interest for some - the native PCP charting tool, pmchart, doesn't support
heatmaps at this time, but Qwt on which it is built does ... so its a Simple
Matter of Coding to add that functionality into pmchart - I'm interested in
seeing someone tackle that and attempt to solve it in a generic way (i.e.
heatmaps for arbitrary metrics& completely runtime configurable like the
other chart types there).
I doubt it will really help you, but feel free to take a look at how I
am doing heatmaps in msica (the software I've been writing). Right now
I only really deal with time-series data but I'm working on changing
that. I'm still putting documentation together so if you have any
questions feel free to ask.
http://code.google.com/p/msica
Mark
--
Mark Nelson, Lead Software Developer
Minnesota Supercomputing Institute
Phone: (612)626-4479
Email: mark@xxxxxxxxxxx
|