pcp
[Top] [All Lists]

Re: DSO PMDAs

To: Nathan Scott <nathans@xxxxxxxxxx>
Subject: Re: DSO PMDAs
From: fche@xxxxxxxxxx (Frank Ch. Eigler)
Date: Tue, 25 Mar 2014 07:54:08 -0400
Cc: pcp@xxxxxxxxxxx
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <2130236572.5100446.1395723217046.JavaMail.zimbra@xxxxxxxxxx> (Nathan Scott's message of "Tue, 25 Mar 2014 00:53:37 -0400 (EDT)")
References: <532C975F.4020808@xxxxxxxxxxx> <532F56BC.9040500@xxxxxxxxxxxxxxxx> <y0mzjkf1uf7.fsf@xxxxxxxx> <2130236572.5100446.1395723217046.JavaMail.zimbra@xxxxxxxxxx>
User-agent: Gnus/5.1008 (Gnus v5.10.8) Emacs/21.4 (gnu/linux)
Hi -


nathans wrote:

>> [...]
>> Over time, we should move away from DSO's anyway,
>
> The components that are DSO's today have been carefully selected as
> such, and those decisions are sound.  I agree people should have the
> choice to change from DSO to daemon if they wish (and they do).  The
> defaults are working well though and there isn't a compelling reason
> to change 'em - e.g. ...

We agree there's no emergency, but as you know we have encountered
real problems even recently (bz10555818).


>> We've seen pmda bugs crash & bring down pmcd.
>> I think we've seen memory leaks.
>
> In the case of the kernel PMDA in particular, these are not arguments
> for using a daemon.  The level of importance of the kernel metrics is
> so great that a crash in pmdalinux (whether daemon or DSO) may as well
> take out pmcd, cos not much useful work is going to happen beyond that
> point either way.  [...]

We agree that the mem/hinv/kernel/disk/network stats form the core,
yet they are not all jointly crucial.  There may well be sysadmins who
would agree "give me everything or give me nothing" - and there are
others who say "give me whatever you can".  It's not obvious whether a
failure-amplification approach based on the former is any wiser than a
failure-resilience approach is based on the latter.

If we had sufficient isolation, we could continue producing partial
results.  If we had automated pmcd/pmda restartability, we could
continue producing full results, in case the underlying problem was
temporary.  See for example how bz1065803 (in the non-dso proc pmda)
failures are intermittent, and a manual restart restores function for
a while.  (Maybe per-pmda control via systemd could do the heavy
lifting for us.)


> [...] DSO mode makes this form of [pmda] memory-check testing much
> easier too, ironically (valgrind->pminfo).

FWIW, I have also had success running pipe type pmdas under analysis
by interposing valgrind at the /etc/pcp/pmcd/pmcd.conf command line level.
(Note that the absence of valgrind reports is only weak evidence of an
absence of bugs.)


> Also, bear in mind we have many hosts running pmcd with the current
> DSO PMDA set in 24x7 production operation, and have for many years.
> We should (and do) have high levels of confidence in this code.
> [...]

High levels of confidence are totally justified by such experience,
especially in relatively static / homogenous environments.  And yet we
would be remiss not to consider implications of actual failures seen
outside those particular 24x7 operations.


- FChE

<Prev in Thread] Current Thread [Next in Thread>