Hi,
I am trying to work around the "slow proc pmda gets killed by pmcd"
issue by using pmie and have run into an issue with pmcd getting "stuck"
in some way.
I have the following rule for pmie:
*************
delta = 2 min;
pmcd.agent.status #'proc' > 1
-> shell "/etc/pcp/pmie/procpmda_check.sh";
*************
Where for now, all the shell script does is log an event. This is kind
of convoluted, but I have boiled the failure down to the following:
1. kill <pid_of_proc_pmda>
2. The following appears in the pmie.log when the pmie check is supposed
to occur:
[Tue Jan 27 15:30:12] pmie(16399) Error: pmFetch from d13n01 failed: IPC
protocol failure
[Tue Jan 27 15:30:12] pmie(16399) Info: Lost connection to pmcd on host
d13n01
[Tue Jan 27 15:30:17] pmie(16399) Info: Re-established connection to
pmcd on host d13n01
3. The procpmda_check.sh never gets called. Even after waiting 10
minutes, so several 2 minute cycles should have occurred.
If instead I do the following:
1. kill <pid_of_proc_pmda>
2. pminfo
Error: Broken pipe
3. pminfo
<correct pminfo output>
4. procpmda_check.sh gets called properly to signal that the pmda has
died at the appropriate time.
As another point, there is no pmlogger running when I do this. Nothing
interesting in the pmcd.log or proc.log. The pmcd process and all other
pmda processes are running the whole time. I know that the process by
which the proc_pmda is killed is not the same as pmcd would do it in
practice, but it was the only way I could think of simulating the
behavior. Any thoughts on the best way to debug this?
Thanks
Martins
|