On 28/01/15 08:13, Martins Innus wrote:
Hi,
I am trying to work around the "slow proc pmda gets killed by pmcd"
issue by using pmie and have run into an issue with pmcd getting "stuck"
in some way.
I have the following rule for pmie:
*************
delta = 2 min;
pmcd.agent.status #'proc' > 1
-> shell "/etc/pcp/pmie/procpmda_check.sh";
*************
Where for now, all the shell script does is log an event. This is kind
of convoluted, but I have boiled the failure down to the following:
1. kill <pid_of_proc_pmda>
2. The following appears in the pmie.log when the pmie check is supposed
to occur:
[Tue Jan 27 15:30:12] pmie(16399) Error: pmFetch from d13n01 failed: IPC
protocol failure
[Tue Jan 27 15:30:12] pmie(16399) Info: Lost connection to pmcd on host
d13n01
How is the proc PMDA installed (process or dso)? This suggests pmcd
<--> client timeout, not pmcd <--> pmda timeout (which cannot happen for
dso pmdas!).
There are two different timeouts in play here: -t or
pmcd.control.timeout for pmcd and the $PMCD_*_TIMEOUT family. Which are
you using and what values are it/they set to?
[Tue Jan 27 15:30:17] pmie(16399) Info: Re-established connection to
pmcd on host d13n01
>
3. The procpmda_check.sh never gets called. Even after waiting 10
minutes, so several 2 minute cycles should have occurred.
Suggests pmcd and the proc PMDA are both being restarted.
|