It all started with pmie dumping core on me while evaluating a rule
which looks like
some_inst ( match_inst "^someinst_" metric.foo != 1) -> print "%i is bad";
It dumped core inside cndMatch_inst():
(dbx) where
[1] _lwp_kill(0x1, 0x6, 0xffffff03e2a3e1e0, 0xfffffd7fff284c0e,
0xfffffd7f00000012, 0x0), at 0xfffffd7fff2842aa
[2] thr_kill(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff2788cd
[3] raise(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff227511
[4] abort(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff1fda41
[5] sigbadproc(sig = 11), line 456 in "pmie.c"
[6] __sighndlr(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff27b076
[7] call_user_handler(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff26dfaf
[8] sigacthandler(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff26e1be
---- called from signal handler with signal 11 (SIGSEGV) ------
=>[9] cndMatch_inst(x = 0x465530), line 85 in "match_inst.c"
[10] cndSome_inst(x = 0x4655b0), line 3810 in "fun.c"
[11] rule(x = 0x4657b0), line 524 in "fun.c"
[12] eval(task = 0x45f800), line 207 in "eval.c"
[13] run(), line 764 in "eval.c"
[14] main(argc = 8, argv = 0xfffffd7fffdff588), line 986 in "pmie.c"
The metric in m in cndMatch_inst() looked suspicios
(dbx) print *m
*m = {
expr = 0x101010101010101
profile = 0x101010101010101
host = 0x101010101010101
next = 0x101010101010101
prev = 0x101010101010101
mname = 0x101010101010101
hname = 0x101010101010101
....
It took me a while to dig through the debries (the fact that gdb on
Solaris cannot follow the signal stack frames did not help - I was
doing manual disassembly until I remembered about dbx) to find the
metric name it was dealing with but once I've got the metric name I
knew that there was absolute no chance that this metric will ever have
more the 3 instances, so tspan of 10 in cndMatch_inst's argument is
clearly bogus.
(dbx) print *x
*x = {
op = 45
arg1 = 0x4652f0
arg2 = 0x4640b0
parent = 0x4655b0
eval = 0x422260 = &cndMatch_inst(Expr *x)
valid = 1
hdom = 1
e_idom = 10
tdom = 1
tspan = 10
nsmpls = 1
nvals = 10
metrics = 0x465140
sem = 11
....
I was trying to find where could it come from and one possible suspect
was regex in match_inst.
(dbx) print *x->arg2
*x->arg2 = {
op = 80
arg1 = (nil)
arg2 = (nil)
parent = 0x465530
eval = (nil)
valid = 10
hdom = -1
e_idom = 10
tdom = -1
tspan = 10
nsmpls = 1
nvals = 10
metrics = (nil)
I've decided to test the hypothesis that it could've came instExpr
when the wrong 'primary' is picked by using a metric which has
instance domain but only has one instance in it and this is what I've
got:
$ pminfo -f network.link.state
network.link.state
inst [0 or "e1000g0"] value 1
$ cat ~/nx
some_inst ( match_inst "^e1000" network.link.state != 0) -> print "%i is bad";
$ pmie -T2s -t1s -c ~/nx
Fri Jun 4 15:53:45 2010: ??? unknown %i is bad
Fri Jun 4 15:53:46 2010: ??? unknown %i is bad
Fri Jun 4 15:53:47 2010: ??? unknown %i is bad
[Fri Jun 4 15:53:47] pmie(19351) Info: evaluator exiting
If I change primary() to not pick arguments of NOP type I get the
"correct" result.
Ken, what's the idea of using tspan and nvals of arguments which
have size of string in there instead of number of metrics?
max
|