Hi Martins,
----- Original Message -----
> Hi,
> I am trying to work around the "slow proc pmda gets killed by pmcd"
> issue by using pmie and have run into an issue with pmcd getting "stuck"
> in some way.
Your pmie rule is working here with a default setup (ie pmdaproc as a
privileged child process of pmcd). If I run the rule below and issue
"killall -v pmdaproc && pminfo -v proc" in a separate window, this is
what happens...
echo "stopped = pmcd.agent.status #'proc' > 1 -> print \"%i PMDA failed\" &
shell \"service pmcd restart &\";" | sudo pmie -ve -t 2
stopped (Wed Jan 28 09:33:29 2015): false
stopped (Wed Jan 28 09:33:31 2015): false
stopped (Wed Jan 28 09:33:33 2015): false
stopped (Wed Jan 28 09:33:35 2015): false
stopped (Wed Jan 28 09:33:37 2015): false
stopped (Wed Jan 28 09:33:39 2015): false
stopped (Wed Jan 28 09:33:41 2015): false
stopped (Wed Jan 28 09:33:43 2015): false
stopped (Wed Jan 28 09:33:45 2015): false
stopped (Wed Jan 28 09:33:47 2015): false
Wed Jan 28 09:33:49 2015: proc PMDA failed
stopped (Wed Jan 28 09:33:49 2015): true
Waiting for pmcd to terminate ...[Wed Jan 28 09:33:51] pmie(15086) Error:
pmFetch from smash failed: IPC protocol failure
[Wed Jan 28 09:33:51] pmie(15086) Info: Lost connection to pmcd on host smash
stopped (Wed Jan 28 09:33:51 2015): unknown
Starting pmcd ...
stopped (Wed Jan 28 09:33:53 2015): unknown
stopped (Wed Jan 28 09:33:55 2015): unknown
[Wed Jan 28 09:33:56] pmie(15086) Info: Re-established connection to pmcd on
host smash
stopped (Wed Jan 28 09:33:57 2015): false
stopped (Wed Jan 28 09:33:59 2015): false
stopped (Wed Jan 28 09:34:01 2015): false
stopped (Wed Jan 28 09:34:03 2015): false
stopped (Wed Jan 28 09:34:05 2015): false
> I have the following rule for pmie:
>
> *************
> delta = 2 min;
> pmcd.agent.status #'proc' > 1
> -> shell "/etc/pcp/pmie/procpmda_check.sh";
> *************
>
> Where for now, all the shell script does is log an event. This is kind
> of convoluted, but I have boiled the failure down to the following:
>
>
> 1. kill <pid_of_proc_pmda>
>
> 2. The following appears in the pmie.log when the pmie check is supposed
> to occur:
> [Tue Jan 27 15:30:12] pmie(16399) Error: pmFetch from d13n01 failed: IPC
> protocol failure
> [Tue Jan 27 15:30:12] pmie(16399) Info: Lost connection to pmcd on host
> d13n01
> [Tue Jan 27 15:30:17] pmie(16399) Info: Re-established connection to
> pmcd on host d13n01
(this certainly suggests *something* is restarting pmcd, because
pmie has to reconnect to it - if not pmie doing the pmcd restart,
it must be something else...?)
> 3. The procpmda_check.sh never gets called. Even after waiting 10
> minutes, so several 2 minute cycles should have occurred.
>
> If instead I do the following:
>
> 1. kill <pid_of_proc_pmda>
>
> 2. pminfo
> Error: Broken pipe
(so the difference here is an addition of a request to the failed
PMDA, I think? thats missing in the first scenario anyway)
> 3. pminfo
> <correct pminfo output>
>
> 4. procpmda_check.sh gets called properly to signal that the pmda has
> died at the appropriate time.
That makes sense, I think - pmcd is only noticing that the PMDA is
gone once you request a metric from the failed PMDA.
> practice, but it was the only way I could think of simulating the
> behavior. Any thoughts on the best way to debug this?
I guess I'm not really seeing the problem (not able to reproduce it
here anyway). I think your simulation is fine, and its OK that only
when something requests values that we notice the failed PMDA (since
thats exactly what would happen in practice).
cheers.
--
Nathan
|