pcp
[Top] [All Lists]

another round of review for lberk/papi

To: lberk@xxxxxxxxxx
Subject: another round of review for lberk/papi
From: "Frank Ch. Eigler" <fche@xxxxxxxxxx>
Date: Fri, 10 Oct 2014 15:06:42 -0400
Cc: pcp developers <pcp@xxxxxxxxxxx>
Delivered-to: pcp@xxxxxxxxxxx
User-agent: Mutt/1.4.2.2i
Hi -

My usual channel for this was unavailable today, so here goes in plain
email.  Took another look at pcpfans.git lberk/papi (up to commit
05c56e4f18deaa9).  Many good fixes from the previous reviews, thanks!


At least one more correctness issue is still present in the code, by
inspection: the way the add_metric & remove_metric functions (and
subsidiaries) use the papi_info[].position number both as an index
(into values[]) and as the "user wishes this counter activated" flag.
This fails if the set-of-activated-counters is manipulated via pmStore
in a non-LIFO way.  It's because the remove path largely preserves the
previous .position numbers even though it recreates the EventSet in
increasing-index order (thus invalidating those position#'s).  Any
time after that the values[] will be mis-ordered.

What the papi_info[] struct needs is an added simple on/off flag that
the .enable/.disable callback can set/clear, and a single common
papi-eventset-refresh function that shuts everything down, saves
counter values from former values[.position]'s, and *recalculates*
brand a new .position for all surviving events (in whatever order
they're being added to the new eventset).


A small memory-corruption bug with the new status_string[]: strdup()
it and PMDA_FETCH_DYNAMIC as the return code, or else make it a static
char[] and keep PMDA_FETCH_STATIC.  (PMDA_FETCH_STATIC on a stack
local would ask libpcp_pmda to access memory from a stack frame that's
already been unwound.)


I've suggested nuking the hard-coded-papi-event parts of metrictab[],
but it's clear now why you can't quite do that yet: there is no other
source of pmDesc type data for the metrics.  You'll need to intercept
the pmda dispatch->version.any.desc field with a callback function
that fills in the given pmDesc*, very much the same way as your
papi_text() function returns the helptext stuff.  Then metrictab[] can
shrink to the bare minimum.


Some ideas regarding testing.


First of all, the existing test cases (qa/914 and qa/799) simply must
stop relying on the -L local context this way.  As we discussed on IRC
last week, -L makes no sense at all for papi (since it's so stateful;
each "pmCLIENT -L ..." loads the pmda.so+papi.so to form a new blank
slate, does one thing, and immediately dies).  So no such test will
ever see counter numbers, or results other than those related to
exactly one initialization-type operation -- plus any __pmNotifyErr
noise that would've gone to a pmcd/papi.log file.

In fact, the only way anyone could have ever seen the pmda work and produce
counter *numbers*, was by -not- using the -L mode, which means its core
was only ever "hand tested" and qa/914 was to that extent a talisman.


Anyway, as we were saying on IRC, one plausible way of using -L is via
dbpmda, instead of a PMAPI client.  That could work ... but OTOH might
be tricky to feed it a fixed dbpmda command script (like a few other
qa/NNN tests do) and get meaningful answers out, at least because the
available counters=metrics & runtime environment can so easily vary
between machines.  A simple testcase driver is probably not up to the job.


If you agree a more complex driver seems needed, we could try something
using plain default "-h local:" rather than "-L", chooses a bunch of
counters, fetches some samples, waits awhile to generate load, fetches
more samples (sanity checking that they are growing at reasonable
rates), turning some off / on in various orderings, fetching them,
checking that the survivors are still growing at reasonable rates.
It might be a job for a python script rather than bash.


(When writing tests for pmwebd, I found it helpful to compile the pmda
with the gcc test-coverage options.  It made it possible to see that
all the working parts of the code were in fact touched by the
testsuite's execution.  You might find it useful here too, considering
the tricky state manipulation required.)


- FChE

<Prev in Thread] Current Thread [Next in Thread>