pcp
[Top] [All Lists]

RE: [pcp] pcp 3.3.3-1 problem

To: "nathans@xxxxxxxxxx" <nathans@xxxxxxxxxx>
Subject: RE: [pcp] pcp 3.3.3-1 problem
From: "Siekas, Greg" <greg.siekas@xxxxxxxxxx>
Date: Wed, 18 Aug 2010 06:22:45 -0700
Accept-language: en-US
Acceptlanguage: en-US
Cc: "pcp@xxxxxxxxxxx" <pcp@xxxxxxxxxxx>
In-reply-to: <2048008363.101171282094441734.JavaMail.root@xxxxxxxxxxxxxxxxxx>
References: <903179697.100891282094241959.JavaMail.root@xxxxxxxxxxxxxxxxxx> <2048008363.101171282094441734.JavaMail.root@xxxxxxxxxxxxxxxxxx>
Thread-index: Acs+c5z6VbvWVIPASWGzldZaOk16EgAY9W8w
Thread-topic: [pcp] pcp 3.3.3-1 problem
Nathan,

Thanks for the reply, here's the details you requested.

It's failing on the kernel.pernode.cpu.nice metric.

...
kernel.pernode.cpu.user
    inst [0 or "node0"] value 6776690
    inst [1 or "node1"] value 6913290
kernel.pernode.cpu.nice
kernel.pernode.cpu.nice: pmFetch: IPC protocol failure
...

# cat /sys/devices/system/node/node*/{meminfo,numastat}

Node 0 MemTotal:      4718588 kB
Node 0 MemFree:         76700 kB
Node 0 MemUsed:       4641888 kB
Node 0 HighTotal:           0 kB
Node 0 HighFree:            0 kB
Node 0 LowTotal:      4718588 kB
Node 0 LowFree:         76700 kB
HugePages_Total:     0
HugePages_Free:      0
Hugepagesize:     2048 kB

Node 1 MemTotal:      4194300 kB
Node 1 MemFree:         64876 kB
Node 1 MemUsed:       4129424 kB
Node 1 HighTotal:           0 kB
Node 1 HighFree:            0 kB
Node 1 LowTotal:      4194300 kB
Node 1 LowFree:         64876 kB
HugePages_Total:     0
HugePages_Free:      0
Hugepagesize:     2048 kB
numa_hit 1159487313
numa_miss 340994015
numa_foreign 301737818
interleave_hit 268
local_node 1159486710
other_node 340994618
numa_hit 1192803001
numa_miss 301737818
numa_foreign 340994015
interleave_hit 228
local_node 1192801767
other_node 301739052

# gdb --args /usr/libexec/pcp/bin/pmcd -f
GNU gdb 6.6
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-suse-linux"...
Using host libthread_db library "/lib64/tls/libthread_db.so.1".
(gdb) run
Starting program: /usr/libexec/pcp/bin/pmcd -f
*** glibc detected *** corrupted double-linked list: 0x0000000000538640 ***

Program received signal SIGSEGV, Segmentation fault.
0x0000002a96070c49 in linux_table_scan (fp=0x539aa0, table=0x21)
    at linux_table.c:81
81      linux_table.c: No such file or directory.
        in linux_table.c
(gdb) bt
#0  0x0000002a96070c49 in linux_table_scan (fp=0x539aa0, table=0x21)
    at linux_table.c:81
#1  0x0000002a960710a6 in refresh_numa_meminfo (numa_meminfo=0x2a961818b0)
    at numa_meminfo.c:129
#2  0x0000002a96059b8b in linux_refresh (pmda=0x52c640, 
    need_refresh=0x7fbfffe980) at pmda.c:4156
#3  0x0000002a96063c8d in linux_fetch (numpmid=20, pmidlist=0x52da30, 
    resp=0x7fbfffeac8, pmda=0x52c640) at pmda.c:6401
#4  0x000000000040ced5 in SendFetch (dpList=0x52da10, aPtr=0x528e28, 
    cPtr=0x52cf90, ctxnum=0) at dofetch.c:274
#5  0x000000000040d574 in DoFetch (cip=0x52cf90, pb=0x530000) at dofetch.c:430
#6  0x0000000000405658 in HandleClientInput (fdsPtr=0x7fbfffed60) at pmcd.c:458
#7  0x00000000004064d7 in ClientLoop () at pmcd.c:849
#8  0x0000000000406ea3 in main (argc=2, argv=0x7fbfffef98) at pmcd.c:1137

-----Original Message-----
From: nscott@xxxxxxxxxx [mailto:nscott@xxxxxxxxxx] On Behalf Of 
nathans@xxxxxxxxxx
Sent: Tuesday, August 17, 2010 6:21 PM
To: Siekas, Greg
Cc: pcp@xxxxxxxxxxx
Subject: Re: [pcp] pcp 3.3.3-1 problem


----- "Greg Siekas" <greg.siekas@xxxxxxxxxx> wrote:

> I’m running pcp 3.3.3-1 under SLES9 SP4 x86_64. I’m found that if I do
> a pminfo –F –h <hostname> the pmcd dies on the client.  The pmcd.log
> shows the following.

If you run:
  for m in `pminfo -h <hostname>`; do
    echo $m
    pminfo -F -h <hostname> $m
  done

It should tell us which specific metric is causing the problem.
Probably one of the newer NUMA metrics, at a guess, mem.numa.xxx,
so could you also send copies of your
  /sys/devices/system/node/node*/{meminfo,numastat}
files please?

To debug further, if you stop pmcd, then run "gdb --args pmcd -f" on
the failing host, when it gets SIGSEGV it should stop in the failing
code (a backtrace - gdb "bt" command output - would be great, then).

cheers.

-- 
Nathan
<Prev in Thread] Current Thread [Next in Thread>