pcp
[Top] [All Lists]

Re: [pcp] pmcd gets stuck with pmda kill

To: Martins Innus <minnus@xxxxxxxxxxx>
Subject: Re: [pcp] pmcd gets stuck with pmda kill
From: Nathan Scott <nathans@xxxxxxxxxx>
Date: Tue, 27 Jan 2015 17:44:05 -0500 (EST)
Cc: pcp@xxxxxxxxxxx
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <54C7FF66.5090503@xxxxxxxxxxx>
References: <54C7FF66.5090503@xxxxxxxxxxx>
Reply-to: Nathan Scott <nathans@xxxxxxxxxx>
Thread-index: DiA6414a3bMK2MrHbIAtkQ5XIuvxlQ==
Thread-topic: pmcd gets stuck with pmda kill
Hi Martins,

----- Original Message -----
> Hi,
>      I am trying to work around the "slow proc pmda gets killed by pmcd"
> issue by using pmie and have run into an issue with pmcd getting "stuck"
> in some way.

Your pmie rule is working here with a default setup (ie pmdaproc as a
privileged child process of pmcd).  If I run the rule below and issue
"killall -v pmdaproc && pminfo -v proc" in a separate window, this is
what happens...

echo "stopped = pmcd.agent.status #'proc' > 1 -> print \"%i PMDA failed\" & 
shell \"service pmcd restart &\";" | sudo pmie -ve -t 2 
stopped (Wed Jan 28 09:33:29 2015): false

stopped (Wed Jan 28 09:33:31 2015): false

stopped (Wed Jan 28 09:33:33 2015): false

stopped (Wed Jan 28 09:33:35 2015): false

stopped (Wed Jan 28 09:33:37 2015): false

stopped (Wed Jan 28 09:33:39 2015): false

stopped (Wed Jan 28 09:33:41 2015): false

stopped (Wed Jan 28 09:33:43 2015): false

stopped (Wed Jan 28 09:33:45 2015): false

stopped (Wed Jan 28 09:33:47 2015): false

Wed Jan 28 09:33:49 2015: proc PMDA failed
stopped (Wed Jan 28 09:33:49 2015): true

Waiting for pmcd to terminate ...[Wed Jan 28 09:33:51] pmie(15086) Error: 
pmFetch from smash failed: IPC protocol failure
[Wed Jan 28 09:33:51] pmie(15086) Info: Lost connection to pmcd on host smash
stopped (Wed Jan 28 09:33:51 2015): unknown


Starting pmcd ... 
stopped (Wed Jan 28 09:33:53 2015): unknown

stopped (Wed Jan 28 09:33:55 2015): unknown

[Wed Jan 28 09:33:56] pmie(15086) Info: Re-established connection to pmcd on 
host smash
stopped (Wed Jan 28 09:33:57 2015): false

stopped (Wed Jan 28 09:33:59 2015): false

stopped (Wed Jan 28 09:34:01 2015): false

stopped (Wed Jan 28 09:34:03 2015): false

stopped (Wed Jan 28 09:34:05 2015): false


> I have the following rule for pmie:
> 
> *************
> delta = 2 min;
> pmcd.agent.status #'proc' > 1
> -> shell "/etc/pcp/pmie/procpmda_check.sh";
> *************
> 
> Where for now, all the shell script does is log an event.  This is kind
> of convoluted, but I have boiled the failure down to the following:
> 
> 
> 1. kill <pid_of_proc_pmda>
> 
> 2. The following appears in the pmie.log when the pmie check is supposed
> to occur:
> [Tue Jan 27 15:30:12] pmie(16399) Error: pmFetch from d13n01 failed: IPC
> protocol failure
> [Tue Jan 27 15:30:12] pmie(16399) Info: Lost connection to pmcd on host
> d13n01
> [Tue Jan 27 15:30:17] pmie(16399) Info: Re-established connection to
> pmcd on host d13n01

(this certainly suggests *something* is restarting pmcd, because
pmie has to reconnect to it - if not pmie doing the pmcd restart,
it must be something else...?)

> 3. The procpmda_check.sh never gets called. Even after waiting 10
> minutes, so several 2 minute cycles should have occurred.
> 
> If instead I do the following:
> 
> 1. kill <pid_of_proc_pmda>
> 
> 2. pminfo
>        Error: Broken pipe

(so the difference here is an addition of a request to the failed
PMDA, I think?  thats missing in the first scenario anyway)

> 3. pminfo
>        <correct pminfo output>
> 
> 4. procpmda_check.sh gets called properly to signal that the pmda has
> died at the appropriate time.

That makes sense, I think - pmcd is only noticing that the PMDA is
gone once you request a metric from the failed PMDA.

> practice, but it was the only way I could think of simulating the
> behavior. Any thoughts on the best way to debug this?

I guess I'm not really seeing the problem (not able to reproduce it
here anyway).  I think your simulation is fine, and its OK that only
when something requests values that we notice the failed PMDA (since
thats exactly what would happen in practice).

cheers.

--
Nathan

<Prev in Thread] Current Thread [Next in Thread>