Hi Ken,
Keen to pick your brain on some of the things I've been looking
into today with regards to Fedora/EPEL bug #958745 (as mentioned
on IRC - https://bugzilla.redhat.com/show_bug.cgi?id=958745).
>From what I can tell, the root cause appears to be a PDU buffer
pin count accounting issue. I've attached a patch that cranks
the volume up to 11 in the relevant areas of the code, running
the recipe Marko describes in the bug (play, stop after a small
number of steps) with this command line:
gdb --args pmchart -v 5 -s 5 -a gfs -t 1min -c gfs.view -DPDUBUF
(gfs.view attached, archive in the bug report).
So, we crash in pmFreeResult. Both metrics are doubles, and we
end up taking the code path below "/* not created from a pdubuf,
really free the memory */" in freeresult.c - incorrectly, I think.
I suspect we should not be getting there at all as this buffer was
indeed created from a pdubuf. However, the __pmUnpinPDUBuf in the
same routine "fails" (to find the active pdubuf), and hence we end
up attempting to free memory thats actually part of a pdubuf - the
free() call at line 64 in freeresult.c consistently blows up in
our test case.
I added all the extra diagnostics in the patch in an attempt to
follow who pins/unpins this buffer. The buffer is always one of
those cached from interp.c cache_read() and the sitution where it
is freed is as a result of pressing Stop in the UI. That issues
a pmSetMode(), which calls __pmLogResetInterp() and I suspect the
two calls to __pmUnpinPDUBuf() in there to be problematic.
In our case, we have two metrics - so two hash entries, both are
fetched via a single __pmLogRead and in a single pmResult. The
hash table thus has two hash entries (keyed by PMID) which point
at the same pdubuf. When we walk the hash table we end up doing
two unpins on the one pdubuf, which drops its reference count to
zero and ultimately exposes us to a subsequent pmFreeResult call
which gets confused.
Would love to hear your thoughts on all this! Should this buffer
have a higher pin count from elsewhere, or do you reckon the hash
walker in pmLogResetInterp is doing the wrong thing here? Thanks!
cheers.
--
Nathan
gfs.view
Description: Binary data
verbose-interp-tracing.patch
Description: Text Data
|