Ken,
On 1/28/15 3:18 PM, Ken McDonell wrote:
On 28/01/15 09:44, Nathan Scott wrote:
....
That makes sense, I think - pmcd is only noticing that the PMDA is
gone once you request a metric from the failed PMDA.
But in the original scenario, pmcd KNOWS the PMDA is gone (it terminated the
PMDA after the timeout) ... is it possible we have a permissions issue here?
What are the effective uids of pmcd and the proc pmda process at the point of
the timeout? On my system, it looks like this ...
kenj@bozo:~$ ps -ef | egrep '/[p](mcd|mdaproc)'
pcp 26238 9047 0 Jan28 ? 00:00:02 /usr/lib/pcp/bin/pmcd -T 3
root 26253 26238 0 Jan28 ? 00:00:31
/var/lib/pcp/pmdas/proc/pmdaproc -d 3
which means pmcd cannot kill the proc PMDA ... but we're OK! I checked the
pmcd code and we don't kill the timedout PMDA, we just close all the IPC
channels (pipes in the this case) to it, which will cause it to shutdown of its
own accord.
I tested this by pausing the proc PMDA with gdb, running a pminfo -v proc and
waiting for the timeout. pmcd and the proc PMDA both behaved as expected, and
after this
kenj@bozo:~$ pminfo -f pmcd.agent.status
pmcd.agent.status
inst [1 or "root"] value 0
inst [2 or "pmcd"] value 0
inst [3 or "proc"] value 8 <====== correct
inst [11 or "xfs"] value 0
inst [29 or "sample"] value 0
inst [30 or "sampledso"] value 0
inst [60 or "linux"] value 0
inst [70 or "mmv"] value 0
inst [122 or "jbd2"] value 0
inst [253 or "simple"] value 0
So, I am having trouble understanding this "extra fetch" line of reasoning,
except in the case where you kill (as opposed to suspend) the PMDA process, which is not
the original scenario.
Based on your analysis, this is correct for the slow pmda case. I was
trying to come up with a test case to simulate this since I have not yet
been able to reproduce the case where pmcd closes the pmda, reliably.
So I had assumed that just doing a kill on the pmda would trigger the
same response. Clearly not true. But now I can use your gdb trick, thanks!
But is it valid to assume that, as a separate case, pmcd should continue
to function if a pmda gets "killed" in some other way? OOM killer, some
other error?
Thanks
Martins
|