On 17/07/13 23:15, Frank Ch. Eigler wrote:
..
OK, as long as we observe the requirement that we do not accidentally
regenerate / modify any files that a sysadmin has created (whether
that was by hand or by a prior interactive pm*conf run).
Just to reinforce the mail exchange Nathan and I had yesterday, I now
believe there to be no evidence of regeneration or modification of files
in the case I reported that triggered this whole discussion. I did not
follow a sanctioned upgrade path which let to version mismatched pieces
being installed ... the cause was idiot user error, not PCP error.
*nod* - this problem I have seen in real production environments, and it
is sorta-handled in a non-intuitive way - as soon as the remote host goes
away, pmlogger loses the connection and it exits [...]
This sounds like an unfortunate policy, if for example there are
temporary network glitches or a quick reboot. A 30-minute re-poll is
IMO too slow.
Let me outline the constraints here, and maybe we can brainstorm a
better approach.
1. When a host is indeed rebooted, we have to start a new PCP archive
... the hw config may be different (even without human intervention in
some HA worlds), instance domains may be different, etc. so all of the
PCP metadata needs to be written afresh and the "log once" metrics
written to the new archive.
2. pmlogger cannot tell the difference between a network outage and
remote host reboot so if the connection to pmcd is closed, or a PDU
get/put timesout, then pmlogger must finish the current PCP archive. But
we could/should consider setting $PMCD_REQUEST_TIMEOUT to be something
larger than the default 10 seconds as pmlogger in particular is tolerant
of delayed PDUs coming back from pmcd, so that pmlogger is less exposed
to short-term network glitches.
3. pmlogger knows nothing of the date+timestamp[+sequence#] naming
convention that the scripts around pmlogger use to name the PCP archives
|