pcp
[Top] [All Lists]

Re: [pcp] pmcd gets stuck with pmda kill

To: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>, Nathan Scott <nathans@xxxxxxxxxx>
Subject: Re: [pcp] pmcd gets stuck with pmda kill
From: Martins Innus <minnus@xxxxxxxxxxx>
Date: Wed, 28 Jan 2015 15:29:27 -0500
Cc: pcp@xxxxxxxxxxx
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <54C9441E.4060302@xxxxxxxxxxxxxxxx>
References: <54C7FF66.5090503@xxxxxxxxxxx> <1902595642.1770600.1422398645794.JavaMail.zimbra@xxxxxxxxxx> <54C9441E.4060302@xxxxxxxxxxxxxxxx>
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:31.0) Gecko/20100101 Thunderbird/31.4.0
Ken,

On 1/28/15 3:18 PM, Ken McDonell wrote:
On 28/01/15 09:44, Nathan Scott wrote:
....
That makes sense, I think - pmcd is only noticing that the PMDA is
gone once you request a metric from the failed PMDA.
But in the original scenario, pmcd KNOWS the PMDA is gone (it terminated the 
PMDA after the timeout) ... is it possible we have a permissions issue here?  
What are the effective uids of pmcd and the proc pmda process at the point of 
the timeout?  On my system, it looks like this ...

kenj@bozo:~$ ps -ef | egrep '/[p](mcd|mdaproc)'
pcp      26238  9047  0 Jan28 ?        00:00:02 /usr/lib/pcp/bin/pmcd -T 3
root     26253 26238  0 Jan28 ?        00:00:31 
/var/lib/pcp/pmdas/proc/pmdaproc -d 3

which means pmcd cannot kill the proc PMDA ... but we're OK!  I checked the 
pmcd code and we don't kill the timedout PMDA, we just close all the IPC 
channels (pipes in the this case) to it, which will cause it to shutdown of its 
own accord.

I tested this by pausing the proc PMDA with gdb, running a pminfo -v proc and 
waiting for the timeout.  pmcd and the proc PMDA both behaved as expected, and 
after this

kenj@bozo:~$ pminfo -f pmcd.agent.status

pmcd.agent.status
     inst [1 or "root"] value 0
     inst [2 or "pmcd"] value 0
     inst [3 or "proc"] value 8  <====== correct
     inst [11 or "xfs"] value 0
     inst [29 or "sample"] value 0
     inst [30 or "sampledso"] value 0
     inst [60 or "linux"] value 0
     inst [70 or "mmv"] value 0
     inst [122 or "jbd2"] value 0
     inst [253 or "simple"] value 0

So, I am having trouble understanding this "extra fetch" line of reasoning, 
except in the case where you kill (as opposed to suspend) the PMDA process, which is not 
the original scenario.


Based on your analysis, this is correct for the slow pmda case. I was trying to come up with a test case to simulate this since I have not yet been able to reproduce the case where pmcd closes the pmda, reliably. So I had assumed that just doing a kill on the pmda would trigger the same response. Clearly not true. But now I can use your gdb trick, thanks!

But is it valid to assume that, as a separate case, pmcd should continue to function if a pmda gets "killed" in some other way? OOM killer, some other error?

Thanks

Martins

<Prev in Thread] Current Thread [Next in Thread>