Nathan,
On 1/27/15 5:44 PM, Nathan Scott wrote:
Your pmie rule is working here with a default setup (ie pmdaproc as a
privileged child process of pmcd). If I run the rule below and issue
"killall -v pmdaproc && pminfo -v proc" in a separate window, this is
what happens...
OK, this part makes sense to me now. I had a misunderstanding on what
would alert pmcd on the disappearance of a pmda.
3. The procpmda_check.sh never gets called. Even after waiting 10
minutes, so several 2 minute cycles should have occurred.
If instead I do the following:
1. kill <pid_of_proc_pmda>
2. pminfo
Error: Broken pipe
(so the difference here is an addition of a request to the failed
PMDA, I think? thats missing in the first scenario anyway)
Correct
3. pminfo
<correct pminfo output>
4. procpmda_check.sh gets called properly to signal that the pmda has
died at the appropriate time.
That makes sense, I think - pmcd is only noticing that the PMDA is
gone once you request a metric from the failed PMDA.
OK. I had assumed that the pmie call to:
pmcd.agent.status #'proc'
would trigger some sort of request to the proc pmda, but on further
thought, I guess it makes sense that this would not occur and pmcd's
knowledge of the proc pmda state would only change on a real metric request.
practice, but it was the only way I could think of simulating the
behavior. Any thoughts on the best way to debug this?
I guess I'm not really seeing the problem (not able to reproduce it
here anyway). I think your simulation is fine, and its OK that only
when something requests values that we notice the failed PMDA (since
thats exactly what would happen in practice).
OK. Sounds good. As you say, in actual practice, a metric fetch should
cause pmcd to notice what is going on, and after that, the pmie rule
will fire. I'm all set on the pmie side. But as I discovered below,
only certain types of fetches will cause this to happen.
It turns out that PMIE has nothing to do with the problem I was seeing
and was just adding noise to the problem. Stripping away more variables.
If I shut down pmie, pmlogger, pmproxy, pmwebd, nothing interacting with
pmcd on a regular basis, fresh vm (Centos 6.5), git dev from this
morning, out of the box config after doing Makepkgs, and then:
>killall -v pmdaproc
Killed pmdaproc(23682) with signal 15
>pmval pmcd.agent.status
pmval: pmLookupDesc: IPC protocol failure
>pmval hinv.ncpu
pmval: pmLookupDesc: IPC protocol failure
>pminfo hinv.ncpu
Error: hinv.ncpu: Broken pipe
>pmval proc.nprocs
pmval: pmLookupDesc: IPC protocol failure
>pminfo proc.nprocs
Error: proc.nprocs: Broken pipe
I can't get anything out of pcp until either a "pminfo" or a "pminfo
proc". Then it all works fine again. So even without pmie to restart
it, if i do "pminfo proc", the proc pmda remains dead (as is expected)
and doesn't return valid information (as is expected). But this process
gets pmcd back in a usable state. I guess my concern on this is that on
some machines, we log proc infrequently compared to other metrics. So if
the proc_pmda dies, anybody's queries to non proc metrics will fail
until something queries the dead pmda in an appropriate way.
Its important to run this test without pmlogger active for any proc
metrics as that kicks pmcd back into a good state when those metrics are
logged. In practice I can work around this by just having pmlogger log a
lightweight proc metric fairly frequently, but wanted to know if this
was expected behavior when a pmda goes away. Its strange that the act
of logging a proc metric brings pmcd back but a pmval doesn't.
Finally, the exact same thing happens if I kill the sample or linux
pmdas, but not other pmdas. No problems killing: simple, xfs, ib, all
perl pmdas. Don't understand this part.
OK, hmm, the backtrace on pmcd when doing "pmval hinv.ncpu" after the
proc_pmda has died:
Program received signal SIGPIPE, Broken pipe.
0x00007fc064b4c520 in __write_nocancel () from /lib64/libc.so.6
(gdb) bt
#0 0x00007fc064b4c520 in __write_nocancel () from /lib64/libc.so.6
#1 0x00007fc065023c03 in __pmXmitPDU (fd=11, pdubuf=0x7fc06686b000) at
pdu.c:338
#2 0x00007fc065022da7 in __pmSendAttr (fd=11, from=<value optimized
out>, attr=14, value=<value optimized out>, length=6) at p_attr.c:64
#3 0x00007fc0656c1b5f in DoAttributes (ap=0x7fc066864f60, clientID=0)
at config.c:1391
#4 0x00007fc0656c2355 in AgentsAttributes (clientID=0) at config.c:1411
#5 0x00007fc0656c8018 in DoCreds (cp=0x7fc066869e50, pb=<value
optimized out>) at dopdus.c:1139
#6 0x00007fc0656be942 in HandleClientInput (fdsPtr=0x7fffb73f77d0) at
pmcd.c:369
#7 0x00007fc0656bf317 in ClientLoop (argc=<value optimized out>,
argv=<value optimized out>) at pmcd.c:699
#8 main (argc=<value optimized out>, argv=<value optimized out>) at
pmcd.c:890
(gdb) up
#1 0x00007fc065023c03 in __pmXmitPDU (fd=11, pdubuf=0x7fc06686b000) at
pdu.c:338
338 n = socketipc ? __pmSend(fd, p, len-off, 0) : write(fd, p,
len-off);
(gdb) up
#2 0x00007fc065022da7 in __pmSendAttr (fd=11, from=<value optimized
out>, attr=14, value=<value optimized out>, length=6) at p_attr.c:64
64 sts = __pmXmitPDU(fd, (__pmPDU *)pp);
(gdb) up
#3 0x00007fc0656c1b5f in DoAttributes (ap=0x7fc066864f60, clientID=0)
at config.c:1391
1391 if ((sts = __pmSendAttr(ap->inFd,
(gdb) print *ap
$3 = {pmDomainId = 3, ipcType = 2, pduVersion = 2, inFd = 8, outFd = 9,
done = 0, profClient = 0x0, profIndex = 0, pmDomainLabel =
0x7f126cdaf470 "proc", status = {connected = 1, busy = 0, isChild = 1,
madeDsoResult = 0, restartKeep = 0,
notReady = 0, startNotReady = 0, unused = 0, flags = 68}, reason =
0, ipc = {dso = {pathName = 0x7f126cdaf0e0
"/var/lib/pcp/pmdas/proc/pmdaproc -d 3", xlatePath = 1826288160,
entryPoint = 0x6e95 <Address 0x6e95 out of bounds>, dlHandle = 0x0,
initFn = 0, dispatch = {domain = 0, comm = {pmda_interface = 0,
pmapi_version = 0, flags = 0}, status = 0, version = {any = {ext = 0x0,
profile = 0, fetch = 0, desc = 0, instance = 0, text = 0, store = 0},
two = {ext = 0x0, profile = 0,
fetch = 0, desc = 0, instance = 0, text = 0, store = 0},
three = {ext = 0x0, profile = 0, fetch = 0, desc = 0, instance = 0, text
= 0, store = 0}, four = {ext = 0x0, profile = 0, fetch = 0, desc = 0,
instance = 0, text = 0, store = 0,
pmid = 0, name = 0, children = 0}, five = {ext = 0x0,
profile = 0, fetch = 0, desc = 0, instance = 0, text = 0, store = 0,
pmid = 0, name = 0, children = 0}, six = {ext = 0x0, profile = 0, fetch
= 0, desc = 0, instance = 0, text = 0,
store = 0, pmid = 0, name = 0, children = 0, attribute =
0}}}}, socket = {addrDomain = 1826287840, port = 32530, name =
0x7f126cdaf220 "@\363\332l\022\177", commandLine = 0x6e95 <Address
0x6e95 out of bounds>, argv = 0x0,
agentPid = 0}, pipe = {commandLine = 0x7f126cdaf0e0
"/var/lib/pcp/pmdas/proc/pmdaproc -d 3", argv = 0x7f126cdaf220, agentPid
= 28309}}}
Since pmcd doesn't know that the proc_pmda has gone, AgentsAttributes
tries to send it a message and them boom. I am pretty lost in this code
and don't know how to proceed.
Thanks
Martins
|