Hi Nathan/Lukas,
On 2/18/16 12:53 AM, Nathan Scott wrote:
Hi Martins,
----- Original Message -----
Nathan,
On 1/29/2015 4:42 PM, Nathan Scott wrote:
I sent that mail from the time warp that is labelled "it is OK for all
PCP processes to run as root" ... later I realized that in the brave new
world where running as root has become less fashionable this won't work
if the PMDA needs root priveleges, because once pmcd is able to accept
the SIGHUP it has downgraded itself to user "pcp" ... so restarting
_pmcd_ (as root) is the only option in your case.
This is now fixable, happily. See point #2 here:
http://oss.sgi.com/archives/pcp/2014-06/msg00111.html
from "2. Restarting / Installing PMDAs", and:
$ grep STARTPMDA src/include/pcp/pmda.h
/*#define PDUROOT_STARTPMDA_REQ 0x9007*/
/*#define PDUROOT_STARTPMDA 0x9008*/
If anyone wants to hack on this, please send me a note - I have some
sample code that will help. It would be good to have this functionality
back; the building blocks are now in place (since pcp-3.10.2) and it'll
be an interesting little hacking project I think.
This would be great to have. I won't have time to take this on for the
next couple of weeks. I will ping you for the sample code then, unless
someone else looks at it in the meantime.
Lukas ended up working on this and getting it all in for the last release
(pcp-3.11.0). As of today (so, next release) we have a pmie rule that'll
automate the restart of failed PMDAs by sighup'ing pmcd if you chkconfig
pmie on.
Works nicely here, and can restart PMDAs running under any user account now
that the above is all in place (and basically the same pmie rule to the one
we discussed in this thread - by default it will also log to syslog whenever
it kicks pmcd).
cheers.
--
Nathan
Thanks for working on this! I can't seem to get it to work though. I
distilled the pmie rule down to this:
#################
delta = 1 min;
some_inst (
pmcd.agent.status != 0
) -> shell 10 min "pmsignal -s HUP -a pmcd"
& syslog 10 min "Restart unresponsive PMDAs" " pmda%i[%v]";
#################
Then attached gdb to a pmda to simulate something being stuck:
sudo gdb /var/lib/pcp/pmdas/proc/pmdaproc -p 24657
The rule fires properly:
Feb 22 13:26:55 cpn-d13-17 pcp-pmie[5673]: Restart unresponsive PMDAs
pmdaproc[8]
But the pmda is still dead:
[minnus@cpn-d13-17:tmp]$ pminfo -f proc.nprocs
proc.nprocs: pmLookupDesc: No PMCD agent for domain of request
pmcd.log has the expected message that the pmda is dead:
###################
[Mon Feb 22 13:26:25] pmcd(24652) Warning: pduread: timeout (after 5.000
sec) while attempting to read 12 bytes out of 12 in HDR on fd=12
[Mon Feb 22 13:26:25] pmcd(24652) Info: CleanupAgent ...
Cleanup "proc" agent (dom 3): protocol failure for fd=12
###################
But nothing to indicate that a new one is started.
The pids and start times of the pmcd and pmdaproc processes don't change
throughout this process.
Hmm, OK
Running pmsignal directly DOES seems to work:
-bash-4.2$ whoami
pcp
-bash-4.2$ /usr/libexec/pcp/bin/pmsignal -s HUP -a pmcd
-bash-4.2$
Should have looked at the logs first (pmie.log):
####################
[minnus@cpn-d13-17]$ more pmie.log
Log for pmie on cpn-d13-17.int.ccr.buffalo.edu started Mon Feb 22
13:24:55 2016
pmie: PID = 5673, via local:
sh: pmsignal: command not found
####################
Should that be in the path? Or maybe I have some misconfiguration somewhere?
Thanks
Martins
|