On Thu, 2009-09-03 at 11:03 +1000, Nathan Scott wrote:
> ----- "Martin Hicks" <mort@xxxxxxxx> wrote:
>
> > On Wed, Sep 02, 2009 at 09:53:51AM +1000, Nathan Scott wrote:
> > > ----- "Martin Hicks" <mort@xxxxxxx> wrote:
> > > > Tests 166 and 176 are showing one less sample than the expected
> > > > output with pcp 2.9.0. Not sure what's going on here.
> > > >
> > > > Is this just a bug that's been fixed? If I pmdumptext the original
> > > > archive I see 54 samples, but the first one is a "?". Maybe pmval
> > > > is now reporting how many intervals (samples-1) ?
> > >
> > > I can't think of any changes that could have caused this, don't think
> > > there have been any pmval changes, nor libpcp fetch related changes.
> > > When was the last successful run of this test for you? I'll try to
> > > get to doing a full QA run today/tomorrow, see if it happens here too.
> >
> > This is the first time I've tried to use the QA stuff. I'll see if the
> > older PCP that we're still shipping has the bug. (2.7.8 right around
> > the 2.8.0 release time. March-ish)
>
> Hoo-boy, this was a good one. AFAICT, this is a day-1 pmval bug. Ken
> & Martin, could you guys review the attached patch carefully?
>
> What I've found, is the problem either exists/doesn't exist depending on
> several factors:
> - running on a 32 bit machine, Debian unstable, the problem never occurs
> no matter how pmval is built.
> - running on a 64 bit machine, Debian unstable, the problem only happens
> with the packaged binaries. The key factor appears to be the introduction
> of -O2, which looks like it reliably changes certain values on the stack,
> that would otherwise be zero (used uninitialised, in either case).
>
> There's two problems here, I believe:
> - in pmval.c the getargs() call initialises the value of "smpls" (number
> of samples to take) based on cntxt->desc.sem. However, at that point,
> we have not established a context yet, and so that field is uninitialised.
> - this test doesn't match the comment above it, and doesn't make sense:
>
> /* counters require 2 samples to produce reported sample */
> if (*smpls > 0 && cntxt->desc.sem != PM_SEM_COUNTER)
>
> that would seem to me that it should be "cntxt->desc.sem == PM_SEM_COUNTER"
> should it not?
>
> Anyway, attached patch fixes test case 125, which is a simple test case for
> this. Some of the other QA tests now fail, like 261 - the sample.drift bit
> fails though, which is an instantaneous metric, showing we used to go down
> the wrong pmval code path in that case.
Looking at qa/166 ...
$ pmdumplog -L -z src-oss/reduce-gap
Note: timezone set to local timezone of host "kenj-pc" from archive
Log Label (Log Format Version 2)
Performance metrics from host kenj-pc
commencing Wed Jan 19 21:56:48.422 2005
ending Wed Jan 19 22:10:26.410 2005
Archive timezone: EST-11
PID for pmlogger: 28161
So archive starts at 21:56:48.422.
pmval -A 1m moves this to 21:57:00.000
First reported sample would be 21:57:15 (the metric is a counter)
and last sample would be 22:10:15 ... which is 13mins + 1 sample
or 53 samples ... which should be the number of samples reported by
pmval.
So the summary (samples: 54) at the start is wrong and should
be 53.
Nathan's explanation is correct and a good catch.
My (similar) patch is attached for comparison ... we've made mistakes
in this section of the code before so I've added some verbose comments
(those who know me will understand just how unnatural this feels!).
I'll work through the qa issues to independently verify the changes
Nathan has found ... since I expect these to be correct, Nathan would
you like to commit yours to the oss pcpqa tree?
pmval.patch
Description: Text Data
|