Comment # 5
on bug 1100
from Ken McDonell
(In reply to comment #3)
> Is it possible that a single pmlogger did an pmReconnectContext
> to a changed version of pmcd (with different metric configuration),
> and since that worked, it didn't see the need to fetch then-new pmDesc
> etc. metadata?
This is not supposed to be possible, *by design* to avoid exactly the problem
you're seeing.
There is pmReconnectContext() code in pmlogger, but it should be guarded by #if
CAN_RECONNECT and CAN_RECONNECT is not set in the build ... could you check, to
be sure, to be sure, that the text symbol reconnect is not defined in your
pmlogger binary?
Having pmlogger reconnect to pmcd is never going to be a good idea because
there is only one copy of the metadata in an archive (ignoring instance domains
that can change over time), so there is no way that a single archive can
accommodate a metric that is, say, 64-bit counter at one point and a 32-bit
instantaneous value later in the archive.
Ahh ... I've found it!
It is not the connection to pmcd that is lost, rather a PMDA is removed from
pmcd ... this sends a special error PDU with error code > 0 (the bit-wise oring
of the PMCD_ADD_AGENT, PMCD_RESTART_AGENT, PMCD_DROP_AGENT flags) ... pmlogger
catches this as per this fragment in the code (in myFetch()) ...
if (changed & PMCD_ADD_AGENT) {
/*
* PMCD_DROP_AGENT does not matter, no values are returned.
* Trying to restart (PMCD_RESTART_AGENT) is less interesting
* than when we actually start (PMCD_ADD_AGENT) ... the latter
* is also set when a successful restart occurs, but more
* to the point the sequence Install-Remove-Install does
* not involve a restart ... it is the second Install that
* generates the second PMCD_ADD_AGENT that we need to be
* particularly sensitive to, as this may reset counter
* metrics ...
*/
This causes a <mark> record to be added when the PMCD_ADD_AGENT state change is
seen, so the <mark> records in your archive probably came from pmlogger, not
pmlogextract.
Of course, if a PMDA was being logged, is dropped and a new version added, then
the new one might have different metadata and ... kaboom.
The only way I can see to fix this is for pmlogger to be super paranoid after
the PMCD_ADD_AGENT state change and mistrust all previously fetched metric
metadata until it is reverified (this is all of the pmDesc and the PMNS, the
indoms will look after themselves) ... if a change in metadata is observed,
pmlogger must exit and rely on the higher level cron or pmmgr functions to
notice and start a new pmlogger with new metadata.
Of course, then we'll trip into the pmlogextract check (as expected), unless
the metadata change comes with an associated pmlogrewrite config file changes
as it should.
I am going to let this suggested plan of attack "brew" for a while before I
make any changes.
Thanks Frank for the hint that indirectly pointed to probable root cause.