> [...]
> Some systemtapping on a busy pmwebd shows a ginormous amount of
> traffic flowing through libpcp/src/pdubuf.c, [...]
> I'm thinking of redoing this module as a <search.h> (binary tree)
> [...]
It turns out a first cut of this wasn't too hard; see (RFC only)
pcpfans.git fche/multithread
commit 2aa8ca43c1a873c2ccd85d3890d3dc671d2b78d3
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date: Mon Mar 2 15:55:02 2015 -0500
libpcp pdubuf: rewrite using <search.h> binary trees
The former single pdubuf-tracking structure with linear searches was a
serious buttleneck in multithreaded clients such as pmwebd running
graphite graphing jobs. This version does away with that, and instead
uses balanced binary trees provided by libc (SVr4, POSIX.1-2001)
storing nonoverlapping segments. This makes pin/unpin lookups
considerably faster. The nodes in the tree are malloc'd with buffers
immediately adjacent, which reduces malloc traffic/fragmentation.
There is no free/pin-list kept any more, which seems to speed things
up further.
Single-threaded clients do not appear to suffer for it (as confirmed
with perf-stat reports of larger archive-processing tasks);
multithreaded clients seem to benefit; the code is also simpler. It
seems to be a win-win.
The proof is in the pudding: over a relatively large dataset, a random
pmwebd graphite view ran thusly with single-threaded old code (line up
the metric-count rows for comparisons):
[] digested 1688 metrics, in 43325.5ms
[] digested 199 metrics, in 2028.53ms
[] digested 231 metrics, in 9893.74ms
[] digested 390 metrics, in 4028.63ms
[] digested 398 metrics, in 3815.58ms
-M8-threaded with previous libpcp
[] digested 1688 metrics, in 51467.3ms
[] digested 199 metrics, in 1795.19ms
[] digested 231 metrics, in 4131.55ms
[] digested 390 metrics, in 2215.19ms
[] digested 398 metrics, in 1375.55ms
-M8-threaded with this libpcp
[] digested 1688 metrics, in 27656ms
[] digested 199 metrics, in 1232.82ms
[] digested 231 metrics, in 4859.89ms
[] digested 390 metrics, in 2052.02ms
[] digested 398 metrics, in 1715.61ms
There is considerable variation from run-to-run (and an anomaly or
tw), but the trend is consistently that the benefits increase with
larger parallelism & job-size.
|