pcp
[Top] [All Lists]

Proposed updates to cgroup support in the linux_proc PMDA

To: "pcp@xxxxxxxxxxx" <pcp@xxxxxxxxxxx>
Subject: Proposed updates to cgroup support in the linux_proc PMDA
From: "White, Joseph" <jpwhite4@xxxxxxxxxxx>
Date: Tue, 1 Jul 2014 17:01:15 +0000
Accept-language: en-US
Delivered-to: pcp@xxxxxxxxxxx
Thread-index: AQHPlU4RoySNyVksN0OJt2/zRcRE1g==
Thread-topic: Proposed updates to cgroup support in the linux_proc PMDA
User-agent: Microsoft-MacOutlook/14.4.2.140509
Hi,

We have been using the linux_proc pmda to monitor the processes on our computing cluster.  The cluster uses SLURM as the resource manager and is configured to use cgroups to isolate the processes for different users on a given processing node.  SLURM creates cgroups for each job step and then removes them after the job ends. There are approximately 7 thousand  jobs per day on the cluster and usually a couple of job steps in each job. Job duration ranges from a few seconds to 72 hours. Therefore, the active cgroups on each node change regularly.

We found that the cgroups implementation in the linux_proc pmda didn't function properly in this scenario. The main issues:

- It is not unusual to have more active cgroups than the limit of 32 different cgroups in the pmda. This causes information loss.
- When the cgroups change, the pmda rebuilds the namespace. However, there is no mechanism to notify existing pmlogger instances that the namespace changed. This causes information loss as the new/updated namespace names do not get stored in the log.
- When one cgroup is added or removed, then all of the pmids can change for entries in the namespace. This causes information loss as the log entries for one cgroup can be incorrectly associated with another.

In order to tackle these problems, I changed the linux_proc pmda so that the cgroups code uses instance domains. This solves the above issues: No limit on the number of cgroups,  the logger infrastructure already knows how to process data when the instance domains change and the pmdaCache is used to guarantee that the instance ids for each cgroup remain constant.

Disadvantages of this approach:
1) Its a non backwards compatible change to a core PCP component.
2) I was not able to keep the format of the usage_percpu metric. In the original implementation, this used the CPU_INDOM to report metrics for each cpu. Since you can't have nested instance domains, I chose to serialise the percpu data into a string.

I've attached a patch file with all my changes (this can be used to patch the main code using "patch -p4" from the src/pmdas directory). Would you consider making this non-backwards compatible change to the linux_proc pmda or does it make more sense to create a new linux_cgroups pmda? Also any thoughts as to alternative ways to expose the percpu metrics for each cgroup?

Joe

P.S. The patch also includes support for grabbing the process environment and the set of allowed cpus for all processes. These changes are independent of the cgroup changes.
P.P.S. Obviously the QA tests and documentation also needs updating, but the scope of these updates will depend on whether the existing pmda gets updated or a new pmda created.


Attachment: linux_proc_cgroups.patch
Description: linux_proc_cgroups.patch

<Prev in Thread] Current Thread [Next in Thread>