Hi Lukas,
----- Original Message -----
> ----- Original Message -----
> The direct git URL is here - I'm lazy & tend to cut+paste :) -
> for anyone wanting to review or try out pmdapapi:
>
> git://sourceware.org/git/pcpfans.git lberk/papi
>
Thanks for sending this code out. Some smaller code notes follow, but
lets go through what we'll need to get an initial merge happening, and
then followup steps to build on that. My preference is to start small
and build up, rather than a "fully-featured" arrival several months
later; after which time the requirements will probably have changed
again! :P
There's a handful of immediate issues that we've (collectively) talked
about that we need to plan our landing approach for, trying to cater
for everyone as best we can:
1. privileged user requirement
The code thats there now, I'm not really following; which is always a
concern for security-sensitive stuff. :)
Are the wheel and adm groups able to access the performance counters
normally? Or are we saying, 'these groups are "close enough" to root,
so let em have access anyway'. If it is the latter.... hmmm, I don't
think we should be making that kind of decision without approval from
the sysadmin ("adm" doesn't exist on some Linux distros too, almost
certainly not there on other platforms).
For the initial pmdapapi lets set the bar at "you *must* be root user"
(so simple "attribute uid==0 or go away" kinds of tests). In followup
work on pmdapapi, we may be able to use the access.c interfaces from
libpcp to give a configurable method of enabling access (sort of like
has been discussed re the pmdapipe RFC, worth a read later if not done
so yet, for this aspect). But lets go for a simple mechanism first,
and then build on it later.
2. enabling counters
OK, so, much discussion around this ... but we must find a way to move
forward. In practice, I think we might be able to make both explicit
enabling and auto-enabling work (which would be good, cos we can then
choose what the default mode of behaviour will be - having our cake &
eating it too - but more on that later. Mmmm, cake).
Initially, lets keep it simple. There has been some doubt expressed
whether the auto-enabling will actually work with good error-handling
semantics - noones ever done this before - and it doesn't function at
all for short-lived PMAPI clients (*) - see example below. Its a fine
area to experiment in, but that level of uncertainty is unhelpful and
unnecessary for the initial version of your first PMDA.
The explicit-enabling method (pmstore) can map directly onto the PAPI
interfaces, so it comes with a 100% success guarantee - there is no
controversy over whether this approach will work, we all know that it
does and it has before. And for a few of us old-timers, this is also
the preferred approach anyway.
So, I'd like to initially see the store-based approach in v1, and then
follow-up experimentation to see if things like auto-enabling all-of-
the-counters-all-at-once with multiplexing works at all, and then more
experimentation around if auto-enabling all-counters-on-all-processes
works at all ... all of which are up in the air, AIUI, for the reasons
discussed on the dev call (e.g. PAPI_ECNFLCT and so on).
(*) e.g., this script should work, but cannot with auto-enabling:
#!/bin/sh
pmstore papi.control.enable "TOT_INS,L1_DCM"
pmstore papi.control.reset "TOT_INS,L1_DCM"
/opt/research/bin/gather_experimental_data >& phase1.log
echo "--- Phase 1 (gather) results ---"
pmprobe -v papi.preset.total_inst papi.preset.L1_DCM mem.util.shmem
cat phase1.log
threads=20
pmstore papi.control.reset "L1_DCM"
/opt/research/bin/search_experimental_data -p $threads >& phase2.log
echo "--- Phase 2 (search x $threads) results ---"
pmprobe -v papi.preset.total_inst papi.preset.L1_DCM mem.util.shmem
cat phase2.log
[...and so on, and so forth...]
This would be a neat little easy-access interface to have for the PAPI
counters, independent of whether auto-everything works out or not. So
I think its a good idea to start out with pmstore-enabling and move to
experimenting with auto-enabling as a second step. We can then let it
be optionally enabled/disabled initially (always having that known-to-
work fallback, in case of semantic issues, problematic hardware, etc).
Later still, after experience has been gained, lots of experimentation
on different hardware, etc - we can then make the decision to default
it to on or not. Then there's no emotion / ownership in the decision;
each approach will have had plenty of opportunity to shine and a well-
informed choice can be made (plus, a fallback mode will be available
either way - so, all good, and everyone's happy - I hope!).
Alrighty then, moving right along... ;)
3. missing bits
QA, man page ... we'll want the initial version documented (this is a
straightforward step - just use an existing pmda*.1 as a template),
and tested. I've not looked deeply enough yet, but it sounds like the
approach Joe has used in pmdaperfevent might be right for you here as
well (might be possible to share some code there - I'll send ya a note
after detailed review of his work).
4. next steps...
We'll need to sort out packaging too, configure-checking for papi lib
and headers, pcp-pmda-papi rpm/deb package with the libpapi dependency
& so on - I'll help you directly with all that though.
So, at that point - basic access control, basic enabling, missing bits
done, QA, QA - we'll work our way through the initial merge together
(3.9.8 seems a well-placed timeframe - middle of August).
diff --git a/src/pmdas/papi/GNUmakefile b/src/pmdas/papi/GNUmakefile
new file mode 100644
index 0000000..93feca4
--- /dev/null
+++ b/src/pmdas/papi/GNUmakefile
@@ -0,0 +1,45 @@
+...
+DFILES = README help
+#LSRCFILES = Install Remove pmns root $(DFILES) GNUmakefile.install
Yep, wont ever need the above line, it can safely go.
+LSRCFILES = Install Remove pmns root $(DFILES)
+ ...
+install_pcp install: default
+ $(INSTALL) -m 755 -d $(PMDADIR)
+ $(INSTALL) -m 755 Install Remove $(CMDTARGET) $(PMDADIR)
+# $(INSTALL) -m 644 GNUmakefile.install $(PMDADIR)/Makefile
Likewise - only needed if we install the sources too, as some PMDAs do.
diff --git a/src/pmdas/papi/help b/src/pmdas/papi/help
[...]
+@ papi.kernel.L1_DCM Level 1 data cache misses.
+@ papi.kernel.L1_ICM Level 1 instruction cache misses.
[...]
This info looks like it can be obtained at run time via PAPI_get_event_info
(via info.name, info.short_descr and info.long_descr). If we go that route,
a custom dp->version.any.text callback would be used (see pmdasample for an
example - sample_text() function).
diff --git a/src/pmdas/papi/papi.c b/src/pmdas/papi/papi.c
new file mode 100644
index 0000000..939af43
--- /dev/null
+++ b/src/pmdas/papi/papi.c
@@ -0,0 +1,691 @@
+ [...]
+#define CLUSTER_PAPI 0 //we should define this in a header for when these
exand possible values
+#define NumEvents 17
Have a peek at the shiny new pmdanvidia, it has a fairly clean way
of avoiding the hard-coding (17) here (and in some other places a
bit later).
+static char *username = "adm";
"root"? this is the default username pmdapapi will run under,
if I'm reading the code correctly, which our discussions so far
led me to believe was root.
+struct uid_gid_tuple {
(so, simplify this struct a whole lot in the initial version)
+static unsigned int queue;
(leftover from earlier experimentation, I think)
+/*
+ * There will be two domain instances, one for kernel counters
+ * and one for process counters.
+ */
+
+static pmdaIndom indomtab[] = {
+#define PAPI_KERNEL 0
+ { PAPI_KERNEL, 1, 0 },
+ //#define PAPI_PROC 1
+ // {},
+};
(wont need this in a v1, nuke)
+/* XXX not entirely sure what this is */
+static pmInDom *kernel_indom = &indomtab[PAPI_KERNEL].it_indom;
(ditto)
+/*
+ * A list of all the papi metrics we support -
+ */
+static pmdaMetric metrictab[] = {
+
+ { NULL,
+ { PMDA_PMID(CLUSTER_PAPI,0), PM_TYPE_U64, PM_INDOM_NULL, PM_SEM_COUNTER,
+ PMDA_PMUNITS(0, 0, 0, 0, 0, 0) } }, /* papi.kernel.total_inst */
As discussed on IRC, I think we should go for names that map directly
to PAPI concepts - so, initial version of the PMDA could have a fixed
metrictable with papi.presets.{total_inst,L1_DCM,...}. Then, we could
build on top of that with papi.native.* - expanded dynamically to only
be actually available metrics (as papi_avail(1) shows).
+ { NULL,
+ { PMDA_PMID(CLUSTER_PAPI,1), PM_TYPE_U32, PM_INDOM_NULL, PM_SEM_DISCRETE,
+ PMDA_PMUNITS(0, 0, 0, 0, 0, 0) } }, /* papi.enable_counters */
Common convention for storable mertics like this is to have a "control"
subtree in your namespace (e.g. gfs2.control.*, trace.control.*, xfs,
pmcd, proc, sample, ... etc, etc).
Also, PAPI has a PAPI_reset() interface to reset the counters to zero -
a common idiom, some people just like to start from zero. :) (that's
something else we can't do any way other than via pmstore btw).
So, in this case, you might want something like ...
papi.control.reset [Reset specified counters to zero using PAPI_reset]
papi.control.enable [Enable specified list of hardware counters]
papi.control.disable [Disable specified list of hardware counters]
These could possibly be string-type metrics, as in my example above, if
you like - comma-separated list of events to zero/enable/disable. You
could use the first "NULL" in each metrictable entry (e.g. in the code
snippet for total_inst above - the m_user field) to hold a pointer to a
name, and walk the table finding which metrics are affected.
PAPI_event_name_to_code(3) might be able to hook into that too, for
adding/removing events (metrics) in the enabled PAPI event set?
In the case where multiple counters arrive, multiplex'ing may optionally
come into play using PAPI_set_multiplex? (that could be another one of
those "not in v1" things too though - adds more work & esp. testing).
+static int
+papi_fetch(int numpmid, pmID pmidlist[], pmResult **resp, pmdaExt *pmda)
+{
+ if ( ctxtab[pmda->e_context].uid == 0 || ctxtab[pmda->e_context].gid == 0 )
I'd turn the above line into a clearly-named function, and pass back the
error code or zero from it.
+ return pmdaFetch(numpmid, pmidlist, resp, pmda);
+ else
+ return PM_ERR_PERMISSION;
+}
+
+#if 0
+static int
+papi_store(pmResult *result, pmdaExt *pmda)
+{
+ int i = 0;
+ //XXX add check for access
(cos you'll need it here too, and in one or two other spots).
+ break;
+ }//switch item
+ }
+
+ PAPI_read(EventSet, values);
Not sure what the above is for at the end of store? (needs a comment,
or maybe it was a debug leftover?)
+ if (retval != PAPI_VER_CURRENT) {
+ __pmNotifyErr(LOG_DEBUG, "PAPI library init error!\n");
+ return retval;
+ }
+
+ /* Check to see if the preset, PAPI_TOT_INS, exists */
+ retval = PAPI_query_event( PAPI_TOT_INS );
+ if (retval != PAPI_OK) {
+ __pmNotifyErr(LOG_DEBUG, "No instruction counter? How lame.\n");
(btw, __pmNotifyErr will insert its own EOL markers for you)
+ if(PAPI_create_eventset(&EventSet) != PAPI_OK){
+ __pmNotifyErr(LOG_DEBUG, "create eventset error!\n");
Sounds like you might want more than a DEBUG level there?
+ /* We can't just assume everything is here, add checks
+ all this papi stuff should probably be done in its own function
+ The papi_add_event can probably just be done in papi_add_events
+ I need to think of a better, more maliable way to specify what
+ event counts we want to enable, having a 'enable ALL the counters'
+ approach isn't proper */
Hopefully the earlier essay helps with this. :)
+ /* PAPI_add_event(EventSet, PAPI_L1_DCM);
+ PAPI_add_event(EventSet, PAPI_L1_ICM);
+ PAPI_add_event(EventSet, PAPI_L2_DCM);
+ PAPI_add_event(EventSet, PAPI_L2_ICM);
+ PAPI_add_event(EventSet, PAPI_L3_DCM);
+ PAPI_add_event(EventSet, PAPI_L3_ICM);
+ PAPI_add_event(EventSet, PAPI_L1_TCM);
+ PAPI_add_event(EventSet, PAPI_L2_TCM);
+ PAPI_add_event(EventSet, PAPI_L3_TCM);
+ PAPI_add_event(EventSet, PAPI_TLB_DM);
+ PAPI_add_event(EventSet, PAPI_TLB_IM);
+ PAPI_add_event(EventSet, PAPI_TLB_TL);
+ PAPI_add_event(EventSet, PAPI_L1_LDM);
+ PAPI_add_event(EventSet, PAPI_L1_STM);
+ PAPI_add_event(EventSet, PAPI_L2_LDM);
+ PAPI_add_event(EventSet, PAPI_L2_STM);*/
Yeah, so we'd have an initally empty event set and add/remove to it
on the fly later (in v1 via pmstore, later on via other means).
+/* use documented in pmdaAttribute(3) */
+static int
+papi_contextAttributeCallBack(int context, int attr,
+ const char *value, int length, pmdaExt *pmda)
+{
+ static int rootlike_gids_found = 0;
+ static int adm_gid = -1;
+ static int wheel_gid = -1;
+ int id;
+ /* Look up root-like gids if needed. A later PCP client that
+ matches any of these group-id's is treated as if root/adm,
+ i.e., journal records are not filtered for them (wildcard_p).
+ XXX: we could examine group-membership lists and check against
I think we should go for an access control mechanism here, like the
one described in the pmdapipe RFC (see list archive from a week or
three back). There's code in src/libpcp/src/access.c that can help
us here. This will simplify all of this code (rootlike*, wheel*,
adm* - can go at that point).
But, that can be a secondary step - to get an initially working PMDA,
as discussed, just track root UID here perhaps.
+ if (! rootlike_gids_found) {
+ struct group *grp;
+ grp = getgrnam("adm");
+ if (grp) adm_gid = grp->gr_gid;
+ grp = getgrnam("wheel");
+ if (grp) wheel_gid = grp->gr_gid;
+ rootlike_gids_found = 1;
+ }
+
+ enlarge_ctxtab(context);
+ assert (ctxtab != NULL && context < ctxtab_size);
+
+ /* NB: we maintain separate uid_p and gid_p for filtering
+ purposes; it's possible that a pcp client might send only
+ PCP_ATTR_USERID, leaving gid=0, possibly leading us to
+ misinterpret that as GROUPID=0 (root) and sending back _GID=0
+ records. *
FWIW, that doesn't happen, if one arrives the other is guaranteed to as
well - unless pmcd or the PMDA dies in-between ;)
+static void
+papi_end_contextCallBack(int context)
+{
+ int i, j;
+ for(i = 0; i < NumEvents; i++){
+ if (ctxtab[context].trackers[i] == 1){
+ ctxtab[context].trackers[i] = 0; // turn it off for us
+ for(j = 0; j < ctxtab_size; j++){
+ if(ctxtab[j].trackers[i] == 1){
+ return; // this shouldn't be return, but we should skip
this metric for now, continue on and check there aren't any other value's we're
no longer using, and that nobody else is either.
+ }
+ // if nobody else is using the metric, we can turn it off now
+ }
+ enablers[i] = 0;
+ switch (i) {
+ case 0:
+ PAPI_remove_event(EventSet, PAPI_TOT_INS);
+ break;
+ case 2:
+ PAPI_remove_event(EventSet, PAPI_L1_DCM);
+ break;
BTW, for this sort of switch/case, I've been starting to use enums,
see pmdanvidia again for example - makes things a bit more readable
I find.
Also, multiplexing will be needed for this approach when that time
comes, and some deep soul/code searching, testing, etc to see under
what situations PAPI_ECNFLCT happens and how we can respond to that
(amongst other things, but all later-on things).
+static void
+usage(void)
+{
+ fprintf(stderr,
+ "Usage: %s [options]\n\n"
+ "Options:\n"
+ " -d domain use domain (numeric) for metrics domain of PMDA\n"
+ " -l logfile write log into logfile rather than using default log
name\n",
+ pmProgname);
+ exit(1);
+}
[...]
+ snprintf(helppath, sizeof(helppath), "%s%c" "papi" "%c" "help",
+ pmGetConfig("PCP_PMDAS_DIR"), sep, sep);
+ pmdaDaemon(&dispatch, PMDA_INTERFACE_6, pmProgname, PAPI, "papi.log",
helppath);
+ while ((c = pmdaGetOpt(argc, argv, "D:d:l:U:?", &dispatch, &err)) != EOF) {
(all the cool kids are using the pmdaGetOptions and pmdaUsageMessage
combo nowadays - e.g. pmdanvidia in dev again)
+ papi_init(&dispatch);
+ pmdaConnect(&dispatch);
You can switch the order of the above two lines, to protect yourself
from slow PAPI init (dunno if that ever happens, but theres no harm
defending from it).
That's it so far - looking good mate. Please keep sending out regular
progress posts to the list like this one as you get stuff working, or
if you hit roadblocks, and definitely with any experimentation results
on the auto-enabling stuff - I'm really keen to hear how that pans out.
cheers.
--
Nathan
|