Hi Lukas,
----- Original Message -----
> Hey,
>
> Nathan Scott <nathans@xxxxxxxxxx> writes:
>
> > Fabulous - awesome effort esp. on the QA front. I ran out of time to
> > review it all today (maybe someone else will?), but did sneak a quick
> > background QA run in on RHEL 6 today. I'm seeing a few new failures
> > there - see attached .bad files - any ideas on possible root causes?
>
> Thanks for looking things over. I've pushed fixes for 3/4 of the
> failures (diffstat and change updates below). For 967 and 813 the issue
> was a difference in pmid's for the TOT_INS metric. This was due to the
> dynamic pmns. Functionally, the tests were identical (and running
> properly), to fix this I added an additional regex to match the
> papi.system.* pmid's to output them as 126.0.NUMBER.
OK, sounds good. Is the plan to fix this more, later? (if I followed
the other discussion with Ken the other day? these PMID values need to
be as stable as we can make 'em)
> Testcase 903 was failing partly due to the dynamic pmns as well.
> Apparently the number of metrics on that box was much lower, so it
> didn't trigger the regex (which would have swapped the number for an
> 'X'). After testing it on a vm with no papi metrics available, I
> lowered the regex to match 7 or greater. This provides matches for the
> 5 papi.control metrics, 1 papi.available metric, and at least one actual
> papi.system.* metric.
*nod*, looks good.
> Testcase 799 failed for the same reason I mentioned in my original
> email, and I'd be open to advice on how to fix it. The metrics I used
> to force a ECNFLCT (if multiplexing is disable) on my machine, may not
> exist on other machines. Being able to find a combination of metrics
> which would cause such an error, programatically, on the host qa
> machine, is something I'm not sure how to do yet.
Having reviewed the code now, hmm - there's little point to this test is
there? It seems the error handling policy has become as anticipated for
auto-enable/disable - none at all. So if an ECNFLCT or some other error
did occur at the point we were concerned about, we're never going to see
it anyway. IOW, I guess you could just remove this test if we're going
ahead with that (its not a problem in the non-auto-enable case, so maybe
not worth testing in that mode, as you had it).
So, handle_papi_error() can do nothing useful for regular users? What can
it sensibly do? I guess it should really log problems during auto-enable/
disable, right? Otherwise, there's no way to inform anyone that bad stuff
happened in refresh_metrics() - since the return code is ignored in that
situation - and one or more of the counters spontaneously became inactive
or failed to be added into the set?
Ignoring errors is not really good enough, I think - its going to make it
really hard to debug when people do have problems with it.
Anyway - I've merged it all, and marked 799 as retired for now so that it
doesn't fail for others ... feel free to update and turn it into a more
specific test (maybe not for ECNFLCT, but for some other reproducible error
case) - to check error handling for the non-auto-enabled counters?
I'll send a couple more review notes separately, but its all small followup
stuff; I think you've done a good job here Lukas - nice one.
cheers.
--
Nathan
|