pcp
[Top] [All Lists]

Re: [pcp] pcp updates: fix kernel.pernode.cpu metrics on single CPU syst

To: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>, pcp@xxxxxxxxxxx
Subject: Re: [pcp] pcp updates: fix kernel.pernode.cpu metrics on single CPU systems
From: Mark Goodwin <mgoodwin@xxxxxxxxxx>
Date: Fri, 02 Jan 2015 11:25:53 +1100
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <54A5AD0C.9010105@xxxxxxxxxxxxxxxx>
References: <54A4EE5F.2000108@xxxxxxxxxx> <54A5AD0C.9010105@xxxxxxxxxxxxxxxx>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0
On 01/02/2015 07:24 AM, Ken McDonell wrote:
On 01/01/15 17:51, Mark Goodwin wrote:
Fix for the kernel.pernode.cpu problem reported by Shirshendu Chakrabarti
where kernel.pernode.cpu.user was incorrectly zero on single CPU systems.
Plus a new QA test to verify the fix.
[...]
1. The last sprintf() in map_cpu_nodes() should be a snprintf() ... I think this
is part of our de facto virtual style guide, so I've made that change.

yep ok, thanks. (I just pulled from your tree, but you haven't committed yet?)


2. This part at the head of map_cpu_nodes() confused me ...
     pmdaIndom *idp = PMDAINDOM(NODE_INDOM);

     for (i = 0; i < proc_cpuinfo->cpuindom->it_numinst; i++)
because proc_cpuinfo->cpuindom must be == idp for this to all make sense, so I'd
expect the references to the indom control struct in this routine to all be
consistent.  I think using idp->it_numinst in the loop control will work, and
removes the inconsistency.

yep, that's outside this bug fix, but agree and ok I'll make that change
and a few others to make the code more robust in general.


3. But my real question is about the sizing of the two arrays (which need to end
up the same), proc_cpuinfo->cpuinfo[] and idp->it_set[]. The latter is resized
and initialized at the end of map_cpu_nodes() but on entry to map_cpu_nodes()
there is an assumption that proc_cpuinfo->cpuinfo[] is already the big enough to
accommodate all the cpus found in the directory traversal.  I think there needs
to be a comment describing why this assumption is expected to be true and/or a
check in the loop before
     proc_cpuinfo->cpuinfo[cpu].node = node;
to ensure cpu is not too large

yes, this could be more robust and defensive too. If it's busted then the
kernel sysfs code is busted (someone else's problem!). On the other hand,
I'm not sure how this code would tolerate cpu-hotplug - the bug has been
there since the numa epoch (circa 1995 or so, when cpu hotplug wasn't
feasible) but I guess nobody noticed the per-node cpu metrics were broken
on single CPU systems.

Any thoughts on how we might test cpu hotplug/unplug? Do we need to even
support it? I suspect that if the per-cpu and per-node indoms need to
be dynamic then this isn't the only codxe that will need some attention ...


Cheers and Happy New Year, Ken.

and to you too :)

Cheers
-- Mark

<Prev in Thread] Current Thread [Next in Thread>