pcp
[Top] [All Lists]

Re: [pcp] Debugging sigpipe in pmda

To: Jeff Hanson <jhanson@xxxxxxx>, pcp@xxxxxxxxxxx
Subject: Re: [pcp] Debugging sigpipe in pmda
From: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
Date: Wed, 17 Aug 2016 09:24:41 +1000
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <7b627a8b-f775-5f28-423e-b33d63fffad8@xxxxxxx>
References: <df62753e-0d3d-3626-cd6e-ed1f8e17fd2e@xxxxxxx> <1831980510.1015515.1470956662271.JavaMail.zimbra@xxxxxxxxxx> <b735a150-5aa2-04f0-d9df-f4e8eb699c19@xxxxxxx> <y0m1t1ophga.fsf@xxxxxxxx> <83f3710f-d758-6f7d-d9af-480fb897f4c8@xxxxxxxxxxxxxxxx> <7b627a8b-f775-5f28-423e-b33d63fffad8@xxxxxxx>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0
On 17/08/16 07:45, Jeff Hanson wrote:
> ...
> The behavior seems only to have started when I increased the number of
> systems from which
> metrics could be fetched.  In my two tests it took 20 minutes and then
> 30+ minutes to
> replicate with two pmval <cluster metric> running at default sample rate.

This makes sense.  More nodes => more work => longer delay if the work is done 
in the PDU handler of the PMDA.

> What then would be the preferred architecture?

Not sure if I have the latest cluster PMDA source (it is outside the core PCP 
tree), but ... I'd start here

        if (FD_ISSET(tc->fd, &readableFds)) {
                /* TODO: This should be made non-blocking so that failure
                 * to get a response from a single client doesnt cause the
                 * pmda to block and get killed
                 */
                cluster_client_read(tc, i);
                ...

and much more seriously, the PMDA needs a second thread with a timer loop, 
pulling all the cluster metric refreshing logic into this thread and then 
guarding all the pmdaCache*() calls with a local mutex.

<Prev in Thread] Current Thread [Next in Thread>