Hi -
nathans wrote:
>> [...]
>> Over time, we should move away from DSO's anyway,
>
> The components that are DSO's today have been carefully selected as
> such, and those decisions are sound. I agree people should have the
> choice to change from DSO to daemon if they wish (and they do). The
> defaults are working well though and there isn't a compelling reason
> to change 'em - e.g. ...
We agree there's no emergency, but as you know we have encountered
real problems even recently (bz10555818).
>> We've seen pmda bugs crash & bring down pmcd.
>> I think we've seen memory leaks.
>
> In the case of the kernel PMDA in particular, these are not arguments
> for using a daemon. The level of importance of the kernel metrics is
> so great that a crash in pmdalinux (whether daemon or DSO) may as well
> take out pmcd, cos not much useful work is going to happen beyond that
> point either way. [...]
We agree that the mem/hinv/kernel/disk/network stats form the core,
yet they are not all jointly crucial. There may well be sysadmins who
would agree "give me everything or give me nothing" - and there are
others who say "give me whatever you can". It's not obvious whether a
failure-amplification approach based on the former is any wiser than a
failure-resilience approach is based on the latter.
If we had sufficient isolation, we could continue producing partial
results. If we had automated pmcd/pmda restartability, we could
continue producing full results, in case the underlying problem was
temporary. See for example how bz1065803 (in the non-dso proc pmda)
failures are intermittent, and a manual restart restores function for
a while. (Maybe per-pmda control via systemd could do the heavy
lifting for us.)
> [...] DSO mode makes this form of [pmda] memory-check testing much
> easier too, ironically (valgrind->pminfo).
FWIW, I have also had success running pipe type pmdas under analysis
by interposing valgrind at the /etc/pcp/pmcd/pmcd.conf command line level.
(Note that the absence of valgrind reports is only weak evidence of an
absence of bugs.)
> Also, bear in mind we have many hosts running pmcd with the current
> DSO PMDA set in 24x7 production operation, and have for many years.
> We should (and do) have high levels of confidence in this code.
> [...]
High levels of confidence are totally justified by such experience,
especially in relatively static / homogenous environments. And yet we
would be remiss not to consider implications of actual failures seen
outside those particular 24x7 operations.
- FChE
|