pcp
[Top] [All Lists]

[Bug 1057] New: memory leak & increasing slowdown in pdubuf.c buf_pin /

To: pcp@xxxxxxxxxxx
Subject: [Bug 1057] New: memory leak & increasing slowdown in pdubuf.c buf_pin / interp.c update_bounds
From: bugzilla-daemon@xxxxxxxxxxx
Date: Sun, 08 Jun 2014 02:06:04 +0000
Auto-submitted: auto-generated
Delivered-to: pcp@xxxxxxxxxxx
Bug ID 1057
Summary memory leak & increasing slowdown in pdubuf.c buf_pin / interp.c update_bounds
Product pcp
Version unspecified
Hardware All
OS Linux
Status NEW
Severity major
Priority P5
Component pcp
Assignee pcp@kenj.com.au
Reporter fche@redhat.com
CC pcp@oss.sgi.com
Classification Unclassified

While working on the graphite-integrated pmwebd, it was found that the
daemon responded to identical queries slower and slower and slower
(going from 20ms to 800ms in the space of a few minutes' operation),
despite all pmcontexts being destroyed & recreated anew for each
operation.

perf top / systemtap pf* indicated a bigger and bigger amount of time
spent in the __pmPinPDUBuf & __pmUnpinPDUBuf functions, with pdubuf.c
line 178 showing up highly:

00167 int
00168 __pmUnpinPDUBuf(void *handle)
00169 {
00170     bufctl_t    *pcp;
00171     bufctl_t    *prior = NULL;
00172 [...]
00177     for (pcp = buf_pin; pcp != NULL; pcp = pcp->bc_next) {
00178         if (pcp->bc_buf <= (char *)handle && (char *)handle <
&pcp->bc_buf[pcp->bc_size])
00179             break;
00180         prior = pcp;
00181     }

It's a linked list traversal, as the pcp->bc_next list grew and grew
and grew and grew and grew and grew and grew and grew some and grew
and grew and grew and grew and grew some more and grew and grew and so
on.

Suspecting a memory leak, another systemtap script was used to track
any pdubufs that were pinned but not promptly unpinned:

% stap -e 'global pins
probe process("/usr/lib64/libpcp.so.3").function("__pmPinPDUBuf").call
    { pins[$handle] = sprint_ubacktrace() }
probe process("/usr/lib64/libpcp.so.3").function("__pmUnpinPDUBuf").call
    { delete pins[$handle]}
probe timer.s(10)
    { foreach (s in pins) { println(s,"\n",pins[s]) }; delete pins }' \
-v -t -x `pgrep pmwebd` -d PATH/TO/pmwebd  -g --suppress-time-limits

which pinpointed the call site for the leaking PDUbufs:

47554733567564  // <--- the PDU handle/pointer
__pmPinPDUBuf+0x0 [libpcp.so.3]
update_bounds.isra.1+0x536 [libpcp.so.3]   // <---- the guilty party
__pmLogFetchInterp+0x1a84 [libpcp.so.3]
__pmLogFetch+0x68f [libpcp.so.3]
pmFetch+0xb6 [libpcp.so.3]
_Z23pmgraphite_fetch_seriesP14MHD_ConnectionRKSslll+0xc36 [pmwebd]
_Z30pmgraphite_respond_render_jsonP14MHD_ConnectionRK11http_paramsRKSt6vectorISsSaISsEEb+0x7b5
[pmwebd]
_Z18pmgraphite_respondP14MHD_ConnectionRK11http_paramsRKSt6vectorISsSaISsEE+0x283
[pmwebd]
_ZL11mhd_respondPvP14MHD_ConnectionPKcS3_S3_S3_PmPS_+0x53f [pmwebd]

The guilty party is this block in interp.c:

444     if (pcp->valfmt == PM_VAL_INSITU)
445         icp->v_prior.lval = logrp->vset[k]->vlist[i].value.lval;
446     else {
447         if (icp->v_prior.pval != NULL)
448             __pmUnpinPDUBuf((void *)icp->v_prior.pval);
449         icp->v_prior.pval = logrp->vset[k]->vlist[i].value.pval;
450         __pmPinPDUBuf((void *)icp->v_prior.pval);   // <<<< here!
451     }

It does not necessarily appear to be the case that every call at that
spot generates a leak, but all the leaks appear to come from that spot.
A gdb pdubufdump() inferior call after the pmDestroyContext indicates
a growing list of pinned pdubuf[]s, each with pincnt=3 in my particular
case.

(gdb) p pdubufdump()
   free pdubuf[size]: 0x555555808000[28672] 0x5555558fa000[28672]
0x5555558db000[28672] 0x5555558cb000[28672] 0x555555819000[17408]
0x555555814000[17408]
   pinned pdubuf[pincnt]: 0x555555852000[3] 0x555555842000[3] 0x55555586e000[3]
0x55555582a000[3] 0x55555588d000[3] 0x55555584a000[3] 0x555555862000[3]
0x555555832000[3] 0x5555558a4000[3] 0x555555885000[3] 0x5555558c3000[3]
0x55555583a000[3] 0x55555589c000[3] 0x5555557fb000[3] 0x5555558bb000[3]
0x5555558d3000[3] 0x55555585a000[3] 0x5555558f2000[3]
$7 = void


A variant of this leak can probably be demonstrated with plain pmval.
With a some random pcp archive (one available on request):

gdb -args  -args pmval -t 2344 -S '@Thu May  8 21:34:56 EDT 2014'
    -T '@Sat Jun  7 21:34:56 EDT 2014' -a /PATH/TO/FOO.meta -i '1'
'kernel.all.load'

(gdb) break pmval.c:ALMOST_LAST_LINE_NUMBER // basically, end of main()
(gdb) run
(gdb) p pdubufdump()
   free pdubuf[size]: 0x640000[28672] 0x648000[28672] 0x62b000[17408]
0x619000[17408] 0x626000[17408]
   pinned pdubuf[pincnt]: 0x638000[1] 0x61e000[1] 0x650000[4] 0x630000[4]

where one might expect a null pinned-pdubuf[] list.  With pmval being
limited to a single traversal of a single archive, it may not be an
exact match, and of course the leak is not significant.  With pmwebd
traversing dozens of archives numerous times, the leak is crippling.


You are receiving this mail because:
  • You are on the CC list for the bug.
<Prev in Thread] Current Thread [Next in Thread>