Hi Scott,
On Thu, 2009-02-26 at 12:11 -0600, Scott Emery wrote:
> obtained pcp 2.7.8 from git to get at the perl PMDA bits. Built their
> own perl-based Lustre PMDA. This combination worked for many weeks. Then
> I configured pmlogger.
>
> service100 /var/log/pcp/pmcd # ls -altr
> total 5992
> -rw-r--r-- 1 root root 106 Nov 19 11:24 simple.log
> drwxr-xr-x 6 root root 4096 Dec 3 15:11 ..
> -rw-r--r-- 1 root root 790 Feb 25 18:41 pmcd.log.prev
> -rw-r--r-- 1 root root 939 Feb 25 18:41 lustre.log.prev
> -rw-r--r-- 1 root root 790 Feb 26 00:41 pmcd.log
> -rw-r--r-- 1 root root 939 Feb 26 00:41 lustre.log
> -rw------- 1 root root 5828608 Feb 26 00:41 core
> -rwxr-xr-x 1 root root 200006 Feb 26 07:32 pmcd
Could you mail "core" and "pmcd" to me please?
> [Thu Feb 26 00:41:01] pmcd(22456) Error: Unexpected signal 11 ...
OK, pmcd took SIGSEGV ... (by definition, this is not the fault
of their Perl PMDA, BTW, which is a separate process).
> [Wed Feb 25 22:07:49] lustre(22470) Info: lustre_refresh_fsnames()
> Use of uninitialized value in hash element at
> /var/lib/pcp/pmdas/lustre/pmdalustre.pl line 292.
> Use of uninitialized value in concatenation (.) or string at
> /var/lib/pcp/pmdas/lustre/pmdalustre.pl line 293.
> Use of uninitialized value in hash element at
> /var/lib/pcp/pmdas/lustre/pmdalustre.pl line 293.
Although the above points to bugs in the PMDA too. FWIW
(not much help here), Martins tree has a Lustre PMDA too:
http://oss.sgi.com/projects/pcp/source.html points to his
git tree. I don't think the PMDA is the cause of their
failure here though.
> service100 /var/log/pcp/pmcd # gdb pmcd core
> warning: exec file is newer than core file.
That's just cos you copied it from $PCP_BINADM_DIR right?
> #4 0x0000000000410823 in AcceptNewClient (reqfd=0) at client.c:69
A "git-checkout pcp-2.7.8-20081117" points at this line:
FD_SET(fd, &clientFds);
__pmSetVersionIPC(fd, UNKNOWN_VERSION); /* before negotiation */
>>> client[i].fd = fd;
client[i].status.connected = 1;
client[i].status.changes = 0;
Can you double-check the code you have built from matches that?
The only address there that could have SIGSEGV'd is client[i] -
we accessed client[i].addr a few lines higher up in the accept
call ... very odd. Hopefully, should be able to diagnose further
with the binary & core file.
cheers.
--
Nathan
|