pcp
[Top] [All Lists]

Re: [pcp] pmcd gets stuck with pmda kill

To: pcp@xxxxxxxxxxx
Subject: Re: [pcp] pmcd gets stuck with pmda kill
From: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
Date: Wed, 28 Jan 2015 09:15:59 +1100
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <54C7FF66.5090503@xxxxxxxxxxx>
References: <54C7FF66.5090503@xxxxxxxxxxx>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0
On 28/01/15 08:13, Martins Innus wrote:
Hi,
     I am trying to work around the "slow proc pmda gets killed by pmcd"
issue by using pmie and have run into an issue with pmcd getting "stuck"
in some way.

I have the following rule for pmie:

*************
delta = 2 min;
pmcd.agent.status #'proc' > 1
-> shell "/etc/pcp/pmie/procpmda_check.sh";
*************

Where for now, all the shell script does is log an event.  This is kind
of convoluted, but I have boiled the failure down to the following:


1. kill <pid_of_proc_pmda>

2. The following appears in the pmie.log when the pmie check is supposed
to occur:
[Tue Jan 27 15:30:12] pmie(16399) Error: pmFetch from d13n01 failed: IPC
protocol failure
[Tue Jan 27 15:30:12] pmie(16399) Info: Lost connection to pmcd on host
d13n01

How is the proc PMDA installed (process or dso)? This suggests pmcd <--> client timeout, not pmcd <--> pmda timeout (which cannot happen for dso pmdas!).

There are two different timeouts in play here: -t or pmcd.control.timeout for pmcd and the $PMCD_*_TIMEOUT family. Which are you using and what values are it/they set to?

[Tue Jan 27 15:30:17] pmie(16399) Info: Re-established connection to
pmcd on host d13n01
>
3. The procpmda_check.sh never gets called. Even after waiting 10
minutes, so several 2 minute cycles should have occurred.

Suggests pmcd and the proc PMDA are both being restarted.

<Prev in Thread] Current Thread [Next in Thread>