Hey,
Thanks for the review, comments in-line.
"Frank Ch. Eigler" <fche@xxxxxxxxxx> writes:
[...]
> At least one more correctness issue is still present in the code, by
> inspection: the way the add_metric & remove_metric functions (and
> subsidiaries) use the papi_info[].position number both as an index
> (into values[]) and as the "user wishes this counter activated" flag.
> This fails if the set-of-activated-counters is manipulated via pmStore
> in a non-LIFO way. It's because the remove path largely preserves the
> previous .position numbers even though it recreates the EventSet in
> increasing-index order (thus invalidating those position#'s). Any
> time after that the values[] will be mis-ordered.
Ah good catch, thanks.
> What the papi_info[] struct needs is an added simple on/off flag that
> the .enable/.disable callback can set/clear, and a single common
> papi-eventset-refresh function that shuts everything down, saves
> counter values from former values[.position]'s, and *recalculates*
> brand a new .position for all surviving events (in whatever order
> they're being added to the new eventset).
I've fixed this in commit 324ec32f62a8. I implemented your suggestion
of a 'metric_enabled' variable in papi_info. This way, the 'position'
variable can serve its single purpose, and we can first assign it once
in the add_metric function, and then reassign as needed (when we're
recreating the eventset) in remove_metric. I plan on writing a testcase
for this as well.
> A small memory-corruption bug with the new status_string[]: strdup()
> it and PMDA_FETCH_DYNAMIC as the return code, or else make it a static
> char[] and keep PMDA_FETCH_STATIC. (PMDA_FETCH_STATIC on a stack
> local would ask libpcp_pmda to access memory from a stack frame that's
> already been unwound.)
Fixed in commit 7cf0569611c.
> I've suggested nuking the hard-coded-papi-event parts of metrictab[],
> but it's clear now why you can't quite do that yet: there is no other
> source of pmDesc type data for the metrics. You'll need to intercept
> the pmda dispatch->version.any.desc field with a callback function
> that fills in the given pmDesc*, very much the same way as your
> papi_text() function returns the helptext stuff. Then metrictab[] can
> shrink to the bare minimum.
Yes, you've mentioned this before and I still have it marked as TODO.
With the previous two items fixed it's my on my plate now.
> Some ideas regarding testing.
I notice there have been several emails about this already, just going
to sum up my thoughts here.
> First of all, the existing test cases (qa/914 and qa/799) simply must
> stop relying on the -L local context this way. As we discussed on IRC
> last week, -L makes no sense at all for papi (since it's so stateful;
> each "pmCLIENT -L ..." loads the pmda.so+papi.so to form a new blank
> slate, does one thing, and immediately dies). So no such test will
> ever see counter numbers, or results other than those related to
> exactly one initialization-type operation -- plus any __pmNotifyErr
> noise that would've gone to a pmcd/papi.log file.
I didn't write qa/914 so I won't speak to that case. In the case of
qa/799, Nathan pointed out that trying to add conflicting metrics should
be tested. When I was trying to reproduce the condition, it found a bug in
the way I handled that case. Even if fixed now, it needs a testcase imo
and this is the goal of qa/799. Only reason it was commited was so I
could continue to build pcp with Makepkgs. This is a situation where (as
Frank has mentioned) state of the pmda and papi counters is required to
have a meaningful test, so, aiui, using -L as currently in the testcase
will not work. If the answer is a combination of dbpmda/pmcd and a
helper PAPI test program, then so be it, and I'll be happy to work on an
approach that way!
As somebody new to PCP and to the inner workings of its testsuite, I
intend to make testcases a priority when introducing new functionality,
and will make every effort to do so. If I may humbly inquire, would it
not make sense to test how the pmda reacts when "papi proper" (ie, not the
test papi) returns, say, an PAPI_ECNFLCT error? This could even be in
_addition_ to the test against the qa/src/*.c papi. I imagine this would
only increase the level of test coverage the testsuite offers?
[...]
> Anyway, as we were saying on IRC, one plausible way of using -L is via
> dbpmda, instead of a PMAPI client. That could work ... but OTOH might
> be tricky to feed it a fixed dbpmda command script (like a few other
> qa/NNN tests do) and get meaningful answers out, at least because the
> available counters=metrics & runtime environment can so easily vary
> between machines. A simple testcase driver is probably not up to the job.
After looking over qa/617 as a quick example of using dbpmda (yes, I'm
aware there are some additional intricacies when linking a test papi).
I think that we could properly script a case where the test papi.c
returns a PAPI_ECNFLCT error after a set number of metrics are added, I
assume this is mainly in writing a proper qa/src/*.c file which throws
PAPI_ECNFLCT at a 'synthetic' point (ie, adding TOT_CYC and TOT_INS
metrics may never actually cause an ECNFLCT error, but our test papi
could throw it).
> If you agree a more complex driver seems needed, we could try something
> using plain default "-h local:" rather than "-L", chooses a bunch of
> counters, fetches some samples, waits awhile to generate load, fetches
> more samples (sanity checking that they are growing at reasonable
> rates), turning some off / on in various orderings, fetching them,
> checking that the survivors are still growing at reasonable rates.
> It might be a job for a python script rather than bash.
I think an approach like this could make sense, and looking over the
further emails, adding *additional* tests that do this, while keeping
the older ones in place seems agreeable. In this case I think the
hard(er?) part is reliably finding a case that could cause an ECNFLCT,
aiui, that changes based on the hardware. My initial thought is to find
a way to iterate over the available metrics, until we reach a limit, and
then adding another to set us 'over'. I'm also aware I need to pay
attention to installing and removing the pmda as part of the test.
In regards to the "sanity checking that they are growing at reasonable
rates". Considering that this will be run on different hardware,
potentially with vastly different counter performance and capabilities,
do you have any thoughts on how we could estimate what a 'reasonable
rate' would be? My first thought is to perhaps compare rates/ratios of
different, specific, metrics. However, I'm not sure how much those vary
per architecture. Thoughts?
> (When writing tests for pmwebd, I found it helpful to compile the pmda
> with the gcc test-coverage options. It made it possible to see that
> all the working parts of the code were in fact touched by the
> testsuite's execution. You might find it useful here too, considering
> the tricky state manipulation required.)
Definitely something I'll look at, thanks.
Cheers,
Lukas
|