Having recently looked into the requirements of supporting the Linux kernel
container technologies, I propose here an extension to the way in which pmcd
operates. This extension would also be a mechanism for tackling a couple of
other pre-existing issues in pmcd too, so hopefully the fairly small bump in
complexity worth the cost.
Firstly, a bit of background. Linux container technologies like Docker, LXR
Linux-VServer, lmctfy, OpenVZ and others make extensive use of two kernel
features in particular - cgroups and namespaces. We have for some time had
an initial implementation providing cgroup metrics, via the cgroup.* metric
tree that the pmda_proc(1) agent exports. However, we currently do not have
support for the concept of namespaces at all; this proposal aims to provide
a mechanism whereby we could start to provide PMDAs with the ability to set
the namespace(s) they use.
The concept of namespaces is important to us, because we'd like to be able
to peek through into the container to obtain performance information about
that container. Important stuff, like the hostname being presented by the
container to its processes (pmcd.hostname, kernel.uname.*), the filesystem
mount namespaces (filesys.*), networking namespace (network.interface.*),
IPC namespace (ipc.*), and the running processes namespace - these are all
namespaces that can and will differ within the container.
Ideally, we would not force the installation of PCP inside the containers,
but instead provide mechanisms whereby client tools can obtain performance
data about the environment within a container, from the containing host.
Attached little programs are closely based on the examples of the clone(2)
and setns(2) man pages, which are the two fundamental system calls that
manipulate namespaces. The programs show the sort of things that must be
performed in order for a process to change its reporting namespace(s)...
[nathans@boing namespaces]$ sudo ./newuts bizarro &
[1] 31609
[nathans@boing namespaces]$ clone() returned 31614
uts.nodename in child: bizarro
uts.nodename in parent: boing
Fundamentally, we need to have an open file descriptor from a procfs file
(/proc/pid/ns/{ipc,mnt,net,pid,uts}) from a contained process, in order
that a PCP collector can (temporarily) switch the reporting namespace it
uses to that of the container, allowing clients to request values specific
to a named container running on a PCP collector host.
Thus we hit our first stumbling block. Since pmcd runs as an unprivileged
process, we often will not be able to open these files in order to briefly
assume the namespaces of the container, for metrics which would differ in
the container. There exists a secondary, related issue in that many PMDAs
will be running as unprivileged processes as well, and they may also wish
to be able to effect a call to setns(2) for their metrics (IOW, beyond the
scope of the pmcd(1) process, using pmda_linux.so for the above metrics).
Naturally, it would be highly undesirable to require all PMDAs (and pmcd)
to run as root once again in order to manipulate namespaces.
The setns(2) man page documents a mechanism for tackling this problem that
we could use, however. If the procfs file is opened by one (privileged)
process the file descriptor can be passed across an AF_UNIX socket using
SCM_RIGHTS, to be used by an unprivileged pmcd or process PMDA.
To fit this into the PCP collector system of pmcd and PMDAs, I propose we
fork(2) pmcd early during its startup, have one process continue as today
(continuing on to make itself unprivileged) and have the other co-process
continue to run as root.
This co-process would not accept any network connections, and would have a
tightly-defined, restricted interface to the primary pmcd. Its not clear
if it'd be best if this co-process exec(2) a helper, or continue on with
the pmcd code - I'm suspecting the latter may be simpler. The co-process
would communicate with the primary pmcd by AF_UNIX socket only, and would
be responsible for resolving container names and the associated namespace
requests, to file descriptors. This same socket would be available to any
PMDA to use - as mentioned above, these may have similar needs to the in-
process DSO PMDAs. New APIs in libpcp_pmda could hide these differences,
as well as the internal details of the socket communication, and would be
able to return file descriptors for a requested namespace.
>From the client tool and user perspective, the starting point would be the
request to source metrics related to a named container. This name needs
to be passed to pmcd (presumably once would suffice, per PMAPI context) -
a connection attribute could provide an appropriate mechanism, e.g.
pcp --host pcp://bigiron?container=www.acme.com
pcp --host local://?container=buildserver
I'll skip over the details of resolving a container name to an individual
process running within that named container (which is needed in order for
the co-process to be able to obtain the namespace file descriptors) as an
implementation detail. The implementations I looked at have cgroups that
represent the containers; processes can be found by marrying cgroup paths,
container names, and the processes listed in the cgroup tasks files. More
research needed there, to see if a general mechanism can be found for all
container implementations (optimistically, so far I think so - there also
appears to be a libcontainer effort underway that might help us out here).
So, that's the meat of accessing metric values within containers from the
outside (well, one approach anyway). I mentioned earlier other issues in
pmcd(1) we could tackle using the kind of privileged co-process described
here, in addition to container namespaces.
2. Restarting / Installing PMDAs
We introduced a limitation when making pmcd unprivileged in that it could
no longer start arbitrary PMDAs anymore, after it has dropped privileges.
If a PMDA requires non-pcp user access, it requires a full pmcd(1) restart
instead of the lighter SIGHUP mechanism previously available. A hack was
added to the PMDA installation process (forced_restart) to workaround this
but its far from ideal. This limitation could be lifted if the co-process
was used to start out-of-process PMDAs, and pass the open file descriptors
back to pmcd(1) over the AF_UNIX socket.
3. Authentication
We have a problem using SASL with pmcd(1) in that it cannot do all forms of
authentication (some popular forms require privilege to access /etc/shadow,
for example - these cannot work out-of-the-box, and would require saslauthd
to work, which we'd ideally not depend on). If the libsasl calls were made
from the privileged co-process on accepting new connections, instead of the
unprivileged pmcd, this issue could be resolved.
These three desires would be used to define a co-process/pmcd/PMDA protocol
- a request type, parameters (container name and namespace types, or sasl
user, passwd, etc) would be sent one way; a series of integers would be sent
back (open file descriptors for namespaces or new agents, user IDs and/or
group IDs). That is probably a bit of a simplification, certainly for SASL
it is - but roughly something equivalent to that should do the trick.
Thanks for reading this far. :) Any thoughts or insights you might have
would be much appreciated - please send 'em through!
cheers.
--
Nathan
makefile
Description: Text Data
newuts.c
Description: Text Data
ns_exec.c
Description: Text Data
|