Hi Joe,
----- Original Message -----
> Hi,
>
> We have been using the linux_proc pmda to monitor the processes on our
> computing cluster. The cluster uses SLURM as the resource manager and is
> configured to use cgroups to isolate the processes for different users on a
> given processing node. SLURM creates cgroups for each job step and then
> removes them after the job ends. There are approximately 7 thousand jobs per
> day on the cluster and usually a couple of job steps in each job. Job
> duration ranges from a few seconds to 72 hours. Therefore, the active
> cgroups on each node change regularly.
Yep - I've been recently hacking on cgroups too, and ran into similar
kinds of issues. Another problem is the way group names are encoded
in the metric namespace means not all cgroups can be represented that
way, and those that can't are silently ignored as well.
I tend to agree that using an instance domain is the right way to go here.
The backward compat scenario is of concern, but there are things we can
do. One approach would be to use a differently named metric hierarchy, so
not cgroup.groups.* (which we could drop when we move to the new mechanism
- the PMAPI provides clear error handling semantics for unavailable metric
names) - perhaps cgroup.pergroup.* or something like that.
One big fly in the ointment is that the dev branch has moved ahead of your
patch, and blkio cgroups metrics have been added. These metrics have an
instance domain too (disks) - so, this complication has grown. It has a
slightly problematic indom too, as the PMDA has to transform the device
major:minor pair into the external disk names people expect to see.
> [...]
> In order to tackle these problems, I changed the linux_proc pmda so that the
> cgroups code uses instance domains. This solves the above issues: No limit
> on the number of cgroups, the logger infrastructure already knows how to
> process data when the instance domains change and the pmdaCache is used to
> guarantee that the instance ids for each cgroup remain constant.
*nod*, good approach.
> Disadvantages of this approach:
> 1) Its a non backwards compatible change to a core PCP component.
> 2) I was not able to keep the format of the usage_percpu metric. In the
> original implementation, this used the CPU_INDOM to report metrics for each
> cpu. Since you can't have nested instance domains, I chose to serialise the
> percpu data into a string.
>
> I've attached a patch file with all my changes (this can be used to patch the
> main code using "patch -p4" from the src/pmdas directory). Would you
> consider making this non-backwards compatible change to the linux_proc pmda
> or does it make more sense to create a new linux_cgroups pmda? Also any
> thoughts as to alternative ways to expose the percpu metrics for each
> cgroup?
>
(see detailed discussion below)
> P.S. The patch also includes support for grabbing the process environment and
> the set of allowed cpus for all processes. These changes are independent of
> the cgroup changes.
(yep, what state are those changes in? Could you send this separately with
an outline describing the change in more detail? thanks.)
> P.P.S. Obviously the QA tests and documentation also needs updating, but the
> scope of these updates will depend on whether the existing pmda gets updated
> or a new pmda created.
I think we should fix what we have, and move to the much-better-sounding
approach you have outlined here, with some indom handling tweaks.
First issue will be moving your patch forward to the current dev branch, where
the earth has moved significantly under your feet.
Then there's the instance naming problem, for some metrics. One possible way
to go there would be to append the current instance name onto the group name,
with an unlikely-to-conflict separator (so, not a space, nor a slash).
$ pminfo -f cgroup.groups.cpuacct.libvirt.lxc.usage_percpu
cgroup.groups.cpuacct.libvirt.lxc.usage_percpu
inst [0 or "cpu0"] value 0
inst [1 or "cpu1"] value 0
inst [2 or "cpu2"] value 0
inst [3 or "cpu3"] value 0
inst [4 or "cpu4"] value 0
inst [5 or "cpu5"] value 0
inst [6 or "cpu6"] value 0
inst [7 or "cpu7"] value 0
$ pminfo -f cgroup.groups.blkio.io_wait_time.total
cgroup.groups.blkio.io_wait_time.total
inst [0 or "sda"] value 32426373113
inst [3 or "sdb"] value 14667215520290
inst [5 or "sr0"] value 0
So, we could move to names like cgroup.pergroup.cpuacct "libvirt/lxc::cpu0",
and cgroup.pergroup.blkio "libvirt/lxc::sda", "::sdb", and so on ...?
cheers.
--
Nathan
|