Comment # 2
on bug 1036
from Frank Ch. Eigler
(In reply to comment #1)
> So, my 2c - as discussed on IRC, I don't really think of this as improving
> the situation. PMDA tardiness is a domain-specific problem, and to my mind
> tackling it [there is better...]
Not necessarily. As we discussed, momentary system overload (whether
deliberately induced by an attacker, or accidental) can affect all PMDAs.
So can a VM or power suspend/resume. So can systemwide clock jumps. So
can some administrator's misguided debugging activity.
> Finally, it feels like we'd be taking a stance of being more accepting of
> mediocrity
I suspect a lot of your concerns come from here, but IMHO it is a separate
subject. The issue is not whether to whether to shorten or lengthen
timeouts to a PMAPI client; the current default five seconds seems too
long for that, if anything. The issue is how to handle the occasional
excursion beyond the timeout: how to optimally *recover*.
> Perhaps timing mechanisms and
> new pmcd metrics to help identify those PMDAs which are suffering latency
> spikes. Even if timeouts are not reached, where pmcd is seen to
> "overreact", those PMDAs are still contributing to overall reduction of
> quality in terms of value/timestamp accuracy.
Sure, such introspection metrics would be nice to have.
> Also as discussed on IRC, we could implement a scheme where pmie is used to
> trigger restarts [...] by a trivial root cronjob and restarted thusly.
These cannot do the job well. Between the time that a PMDA exceeds
the timeout the first time, and the time that this cron job would
eventually run, the PMDA has been DoS'd. Restarting the entire PMCD
and its fleet of PMDAs means an even larger temporary DoS and impact
on unrelated PMDAs.