pcp
[Top] [All Lists]

Re: [pcp] pcp updates - pmdapapi update

To: Lukas Berk <lberk@xxxxxxxxxx>
Subject: Re: [pcp] pcp updates - pmdapapi update
From: Nathan Scott <nathans@xxxxxxxxxx>
Date: Fri, 14 Nov 2014 03:12:50 -0500 (EST)
Cc: pcp@xxxxxxxxxxx
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <87y4repuvj.fsf@xxxxxxxxxx>
References: <87oascow3f.fsf@xxxxxxxxxx> <462023254.13251879.1415861264683.JavaMail.zimbra@xxxxxxxxxx> <87y4repuvj.fsf@xxxxxxxxxx>
Reply-to: Nathan Scott <nathans@xxxxxxxxxx>
Thread-index: 3AgRh8OKpu9EoOMf1JSXKlbNhhab7g==
Thread-topic: pcp updates - pmdapapi update
Hi Lukas,

----- Original Message -----
> Hey,
> 
> Nathan Scott <nathans@xxxxxxxxxx> writes:
> 
> > Fabulous - awesome effort esp. on the QA front.  I ran out of time to
> > review it all today (maybe someone else will?), but did sneak a quick
> > background QA run in on RHEL 6 today.  I'm seeing a few new failures
> > there - see attached .bad files - any ideas on possible root causes?
> 
> Thanks for looking things over.  I've pushed fixes for 3/4 of the
> failures (diffstat and change updates below).  For 967 and 813 the issue
> was a difference in pmid's for the TOT_INS metric.  This was due to the
> dynamic pmns.  Functionally, the tests were identical (and running
> properly), to fix this I added an additional regex to match the
> papi.system.* pmid's to output them as 126.0.NUMBER.

OK, sounds good.  Is the plan to fix this more, later? (if I followed
the other discussion with Ken the other day?  these PMID values need to
be as stable as we can make 'em)

> Testcase 903 was failing partly due to the dynamic pmns as well.
> Apparently the number of metrics on that box was much lower, so it
> didn't trigger the regex (which would have swapped the number for an
> 'X').  After testing it on a vm with no papi metrics available, I
> lowered the regex to match 7 or greater.  This provides matches for the
> 5 papi.control metrics, 1 papi.available metric, and at least one actual
> papi.system.* metric.

*nod*, looks good.

> Testcase 799 failed for the same reason I mentioned in my original
> email, and I'd be open to advice on how to fix it.  The metrics I used
> to force a ECNFLCT (if multiplexing is disable) on my machine, may not
> exist on other machines.  Being able to find a combination of metrics
> which would cause such an error, programatically, on the host qa
> machine, is something I'm not sure how to do yet.

Having reviewed the code now, hmm - there's little point to this test is
there?  It seems the error handling policy has become as anticipated for
auto-enable/disable - none at all.  So if an ECNFLCT or some other error
did occur at the point we were concerned about, we're never going to see
it anyway.  IOW, I guess you could just remove this test if we're going
ahead with that (its not a problem in the non-auto-enable case, so maybe
not worth testing in that mode, as you had it).

So, handle_papi_error() can do nothing useful for regular users?  What can
it sensibly do?  I guess it should really log problems during auto-enable/
disable, right?  Otherwise, there's no way to inform anyone that bad stuff
happened in refresh_metrics() - since the return code is ignored in that
situation - and one or more of the counters spontaneously became inactive
or failed to be added into the set?

Ignoring errors is not really good enough, I think - its going to make it
really hard to debug when people do have problems with it.

Anyway - I've merged it all, and marked 799 as retired for now so that it
doesn't fail for others ... feel free to update and turn it into a more
specific test (maybe not for ECNFLCT, but for some other reproducible error
case) - to check error handling for the non-auto-enabled counters?

I'll send a couple more review notes separately, but its all small followup
stuff; I think you've done a good job here Lukas - nice one.

cheers.

--
Nathan

<Prev in Thread] Current Thread [Next in Thread>