pcp
[Top] [All Lists]

Re: pcp updates: libpcp, pcp-uptime, docs, qa

To: Nathan Scott <nathans@xxxxxxxxxx>
Subject: Re: pcp updates: libpcp, pcp-uptime, docs, qa
From: fche@xxxxxxxxxx (Frank Ch. Eigler)
Date: Tue, 15 Sep 2015 18:06:45 -0400
Cc: pcp <pcp@xxxxxxxxxxx>
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <484231901.32444837.1442300926325.JavaMail.zimbra@xxxxxxxxxx> (Nathan Scott's message of "Tue, 15 Sep 2015 03:08:46 -0400 (EDT)")
References: <1653198047.32442774.1442300655332.JavaMail.zimbra@xxxxxxxxxx> <484231901.32444837.1442300926325.JavaMail.zimbra@xxxxxxxxxx>
User-agent: Gnus/5.1008 (Gnus v5.10.8) Emacs/21.4 (gnu/linux)
> Nathan Scott (5):
>       libpcp: temporarily revert pdubuf tsearch-based optimisation

It turns out this is the right thing to do.  I expected that, like in
a few other cases, the new pdubuf code just brought to light latent
memory handling bugs elsewhere (since it uses exact size memory
allocations, and no deferred frees, so overwrites or use-after-free
bugs can be caught right away).  However, this is not one of those
cases.

The instant freeing & exact allocation do help, but the bug was in the
new pdubuf itself.  Some searching through the logs showed anomalies
in the failing case.  There's an off-by-one bug in the way preexisting
pdubufs are matched against pointers-into-pdubufs.  This part of the
trace shows the problem:

[... some operation causes a pdubufdump():]
   pinned pdubuf[size](pincnt): 0x10050d8a0...0x10050da2f[400](10) 
0x10050dbb0...0x10050dbdf[48](1) 0x10050dc90...0x10050ddf7[360](1) 
0x10050df10...0x10050e09f[400](9) 0x10050e530...0x10050e8d7[936](3) 
0x10050ebc0...0x10050ef67[936](4) 0x10080a210...0x10080c07f[7792](18) 
0x10080e210...0x10081007f[7792](18)
[... immediately afterwards:]
__pmUnpinPDUBuf(0x10050da30) -> pdubuf=0x10050d8a0, pincnt=9

So a the pointer 0x1005da30 was deemed to fall into the interval of
that first pdubuf 0x10050d8a0...0x10050da2f[400](10), when actually it
is actually outside (by one byte).  Things go downhill very gradually
from this point.  I'm testing a fix and qa, should be ready tomorrow.

(As to why only the mac is affected?  I speculate the mac libc memory
allocator may more tightly pack or fragment-manage its memory.  So, it
can return contiguous regions for separate malloc's (one from a
pdubuf-managed one, one from a direct malloc).  This apparently doesn't
happen on the other OSs.

- FChE

<Prev in Thread] Current Thread [Next in Thread>