pcp
[Top] [All Lists]

pmie cannot get insance names for single instance metrics

To: pcp@xxxxxxxxxxx
Subject: pmie cannot get insance names for single instance metrics
From: Max Matveev <makc@xxxxxxxxx>
Date: Fri, 4 Jun 2010 15:56:08 +1000
It all started with pmie dumping core on me while evaluating a rule
which looks like

some_inst ( match_inst "^someinst_" metric.foo != 1) -> print "%i is bad";

It dumped core inside cndMatch_inst():

(dbx) where
  [1] _lwp_kill(0x1, 0x6, 0xffffff03e2a3e1e0, 0xfffffd7fff284c0e,
  0xfffffd7f00000012, 0x0), at 0xfffffd7fff2842aa 
  [2] thr_kill(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff2788cd 
  [3] raise(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff227511 
  [4] abort(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff1fda41 
  [5] sigbadproc(sig = 11), line 456 in "pmie.c"
  [6] __sighndlr(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff27b076 
  [7] call_user_handler(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff26dfaf 
  [8] sigacthandler(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff26e1be 
  ---- called from signal handler with signal 11 (SIGSEGV) ------
=>[9] cndMatch_inst(x = 0x465530), line 85 in "match_inst.c"
  [10] cndSome_inst(x = 0x4655b0), line 3810 in "fun.c"
  [11] rule(x = 0x4657b0), line 524 in "fun.c"
  [12] eval(task = 0x45f800), line 207 in "eval.c"
  [13] run(), line 764 in "eval.c"
  [14] main(argc = 8, argv = 0xfffffd7fffdff588), line 986 in "pmie.c"

The metric in m in cndMatch_inst() looked suspicios

(dbx) print *m
*m = {
    expr     = 0x101010101010101
    profile  = 0x101010101010101
    host     = 0x101010101010101
    next     = 0x101010101010101
    prev     = 0x101010101010101
    mname    = 0x101010101010101
    hname    = 0x101010101010101
....

It took me a while to dig through the debries (the fact that gdb on
Solaris cannot follow the signal stack frames did not help - I was
doing manual disassembly until I remembered about dbx) to find the
metric name it was dealing with but once I've got the metric name I
knew that there was absolute no chance that this metric will ever have
more the 3 instances, so tspan of 10 in cndMatch_inst's argument is
clearly bogus.

(dbx) print *x
*x = {
    op      = 45
    arg1    = 0x4652f0
    arg2    = 0x4640b0
    parent  = 0x4655b0
    eval    = 0x422260 = &cndMatch_inst(Expr *x)
    valid   = 1
    hdom    = 1
    e_idom  = 10
    tdom    = 1
    tspan   = 10
    nsmpls  = 1
    nvals   = 10
    metrics = 0x465140
    sem     = 11
....

I was trying to find where could it come from and one possible suspect
was regex in match_inst.

(dbx) print *x->arg2
*x->arg2 = {
    op      = 80
    arg1    = (nil)
    arg2    = (nil)
    parent  = 0x465530
    eval    = (nil)
    valid   = 10
    hdom    = -1
    e_idom  = 10
    tdom    = -1
    tspan   = 10
    nsmpls  = 1
    nvals   = 10
    metrics = (nil)

I've decided to test the hypothesis that it could've came instExpr
when the wrong 'primary' is picked by using a metric which has
instance domain but only has one instance in it and this is what I've
got:

$ pminfo -f network.link.state

network.link.state
    inst [0 or "e1000g0"] value 1
$ cat ~/nx
some_inst ( match_inst "^e1000" network.link.state != 0) -> print "%i is bad";
$ pmie -T2s -t1s -c ~/nx
Fri Jun  4 15:53:45 2010: ??? unknown %i is bad
Fri Jun  4 15:53:46 2010: ??? unknown %i is bad
Fri Jun  4 15:53:47 2010: ??? unknown %i is bad
[Fri Jun  4 15:53:47] pmie(19351) Info: evaluator exiting

If I change primary() to not pick arguments of NOP type I get the
"correct" result.

Ken, what's the idea of using tspan and nvals of arguments which
have size of string in there instead of number of metrics?

max

<Prev in Thread] Current Thread [Next in Thread>