Comment # 1
on bug 1036
from Nathan Scott
So, my 2c - as discussed on IRC, I don't really think of this as improving the
situation. PMDA tardiness is a domain-specific problem, and to my mind
tackling it (as best one can, its not generally solvable & ultimately practical
tradeoffs end up being made) is thus best left to the individual PMDAs - IMO.
It'd also add more code to pmcd, for what I feel is an error/corner case, which
also adds to my reluctance.
Finally, it feels like we'd be taking a stance of being more accepting of
mediocrity (these delays reduce the quality/accuracy of the data we export) -
if we are to add code, I'd prefer it to be along the lines of helping to find
root causes of those latency problems. Perhaps timing mechanisms and new pmcd
metrics to help identify those PMDAs which are suffering latency spikes. Even
if timeouts are not reached, where pmcd is seen to "overreact", those PMDAs are
still contributing to overall reduction of quality in terms of value/timestamp
accuracy.
Also as discussed on IRC, we could implement a scheme where pmie is used to
trigger restarts on those PMDAs that are timed out, using pmcd.agent.status.
Counter-point being: pmie runs unprivileged, thus it can only sighup and not
restart pmcd (which means root/non-pcp PMDAs get no love).
Counter-counter-point: a scheme where pmie touches a file in a safe place
(probably not a world-writable-sticky-bit set directory), could be checked by a
trivial root cronjob and restarted thusly. Pretty horrifying, but then so is
being accepting of PMDA tardiness IMO. :)
In other news, I wonder if we should consider adding pmcd.agent.uid metrics to
export the user account identifier under which each PMDA is running?