Comment # 1
on bug 1100
from Ken McDonell
Well, Frank ....
First off, I need to know how this archive was created. It looks to be bad on
several fronts.
The most glaring is that the encoding of some metrics changes in the middle of
the archive ... looks like a linux PMDA update happened or possibly some deep
bug in the Linux PMDA.
I've focussed on mem.numa.alloc.other_node based on this output:
$ ./pmlogcheck -v -z ../../qa/archives/archive-20141125.000009
Note: timezone set to local timezone of host "vm-rawhide-64" from archive
[21:02:46.203] mem.numa.alloc.other_node: pmExtractValue failed: Impossible
value or scale conversion
[21:02:46.203] mem.numa.alloc.local_node: pmExtractValue failed: Impossible
value or scale conversion
[... lots of other mem.*, netstat.*, ... errors, but I think these are all
collateral damage from the PDU buffer encoding not matching the metadata]
Drilling a bit further I've isolated just the mem.numa.alloc.other_node metric
in the region of 21:00 (your local time) ...
__pmLogRead: PDU buffer
38 7001 0 8a917254 2d1c0300 1000000 2590000f 1000000
008: 1000000 0 b000000 c000003 0 0
vlist[0] pmid: 60.36.37 numval: 1 valfmt: ptr len: 8 type: 3
need: 44 vsize: 20 nvsize: 32 vbsize: 12 offset: 44 hdr.len: 56 pduend:
0x2265038 vsplit: 0x226502c (diff 12) pdubuf: 0x2265000 (diff 56)
new vlist[0] pmid: 60.36.37 numval: 1 valfmt: ptr len: 8 type: 3 value:
0x0000000000000000
pmResult dump from 0x22646d0 timestamp: 1416794506.203821 21:01:46.203 numpmid:
1
60.36.37 (mem.numa.alloc.other_node): numval: 1 valfmt: 1 vlist[]:
inst [0 or "node0"] value 0
__pmLogRead timestamp=21:01:46.203 0x226500c ... 0x2265040
000: 8a917254 2d1c0300 1000000 2590000f 1000000 1000000 0 b000000
008: 300000c 0 0
__pmLogRead: PDU buffer
18 7001 0 8a917254 15200300 0
need: 0 vsize: 0 nvsize: 0 vbsize: 0 offset: 24 hdr.len: 24 pduend: 0x2264018
vsplit: 0x2264018 (diff 0) pdubuf: 0x2264000 (diff 24)
pmResult dump from 0x2263ea0 timestamp: 1416794506.204821 21:01:46.204 numpmid:
0
__pmLogRead timestamp=21:01:46.204 0x226400c ... 0x2264020
000: 8a917254 15200300 0
__pmLogRead: PDU buffer
2c 7001 0 c6917254 2e1c0300 1000000 2590000f 1000000
008: 0 0 6564005f
vlist[0] pmid: 60.36.37 numval: 1 valfmt: insitu
need: 32 vsize: 20 nvsize: 32 vbsize: 0 offset: 44 hdr.len: 44 pduend:
0x226402c vsplit: 0x226402c (diff 0) pdubuf: 0x2264000 (diff 44)
new vlist[0] pmid: 60.36.37 numval: 1 valfmt: insitu value: 1593861221
pmResult dump from 0x2264940 timestamp: 1416794566.203822 21:02:46.203 numpmid:
1
60.36.37 (mem.numa.alloc.other_node): numval: 1 valfmt: 0 vlist[]:
inst [0 or "node0"] value 1593861221 9.2516306e+18 0x5f006465
__pmLogRead timestamp=21:02:46.203 0x226400c ... 0x2264034
000: c6917254 2e1c0300 1000000 2590000f 1000000 0 0 6564005f
Pulling this apart,
@21:00:46.206 the value is encoded as "dptr" (10000000), i.e. valfmt=1 with the
value block type=3 (PM_TYPE_U64) data length=8 and value=0. This encoding
matches the metadata:
$ pminfo -d -a archive-20141125.000009 mem.numa.alloc.other_node
mem.numa.alloc.other_node
Data Type: 64-bit unsigned int InDom: 60.19 0xf000013
Semantics: counter Units: count
@21:01:46.204 there is a <mark> record, so probably pmcd stopped and restarted
(after the linux PMDA was changed?)
@21:02:46.203 the value is encoded as "insitu" (valfmt=0) (THIS IS PLAIN WRONG)
and the value looks to be rubbish.
So this is a really bad archive ... there other things wrong with it that may
be the root cause of the segv, but for the moment I need to know how this
archive was created.
Even if it was from multiple pmlogger instances (as pmcd restarted), stitching
them together with pmlogextract should have failed when the metadata was
different in different archives (assuming that is the cause).
So I'm confused and need some more information, please, Frank.