pcp
[Top] [All Lists]

Re: [pcp] Fwd: Re: proc pmda oddness - qa 022

To: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
Subject: Re: [pcp] Fwd: Re: proc pmda oddness - qa 022
From: Nathan Scott <nathans@xxxxxxxxxx>
Date: Sun, 10 Nov 2013 23:35:44 -0500 (EST)
Cc: PCP Mailing List <pcp@xxxxxxxxxxx>
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <527C11EC.1030604@xxxxxxxxxxxxxxxx>
References: <527C0D86.4080107@xxxxxxxxxxxxxxxx> <527C11EC.1030604@xxxxxxxxxxxxxxxx>
Reply-to: Nathan Scott <nathans@xxxxxxxxxx>
Thread-index: U1QKk8xKZ4iGiejcb/v15m+ICNbYDA==
Thread-topic: proc pmda oddness - qa 022
Hi Ken,

----- Original Message -----
> oops ... meant this to go to the list.
> 
> 
> -------- Original Message --------
> Subject: Re: [pcp] proc pmda oddness - qa 022
> Date: Fri, 08 Nov 2013 09:00:38 +1100
> From: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
> To: Nathan Scott <nathans@xxxxxxxxxx>
> 
> I've tracked this one down (I think).
> 
> There appears to be a logic error in fetch_proc_pid_stat() when handling
> a zero sized "wchan" file.
> 
> When the read returns zero, following the "eh?" comment we set sts to -1
> ... this just seems wrong ... if the wchan is not available, the rest of
> the proc stat info should be ok.  This is especially so as the code
> behaves this way if the wchan file cannot be opened (see the check
> earlier in the code after the proc_open() call for the wchan case).
> 
> The attached patch (which includes a lot of new DESPERATE debugging code
> to help identify the problem) works for me, and QA 022 passes on the
> hosts it was previously failing on.  And check -g pmda.proc runs on
> these same hosts with no new failures, so no obvious regressions that I
> can see.

Looks good.

> Before committing this change, I'd appreciate some feedback.
> 
> The bit I _really_ don't understand is why this has not bitten before

Very likely related to the changes around threads vs non-threads in the
per-process indom.  Could experiment with the -L option to pmdaproc on
those hosts where its failing which is more like the old behaviour to
confirm (but that exposes its own set of cputime accounting issues for
all kernel versions).

> and why now it appears to be hard fail on some systems and hard pass on
> others and what has changed (this may be related to the relatively
> recent change to use /proc/PID/task/NNN and maybe wchan there has
> different semantics and state to wchan below /proc/PID that we would
> have been using previously).

Suggests a kernel bug in some versions?  wchan I would say is relatively
less important than getting the correct cputime numbers by default, so I
think we should go ahead with your patch.

cheers.

--
Nathan

<Prev in Thread] Current Thread [Next in Thread>