pcp
[Top] [All Lists]

Re: multithreading bottleneck: pdubuf.c

To: pcp developers <pcp@xxxxxxxxxxx>
Subject: Re: multithreading bottleneck: pdubuf.c
From: fche@xxxxxxxxxx (Frank Ch. Eigler)
Date: Mon, 02 Mar 2015 16:36:19 -0500
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <20150302015436.GB21203@xxxxxxxxxx> (Frank Ch. Eigler's message of "Sun, 1 Mar 2015 20:54:36 -0500")
References: <20150302015436.GB21203@xxxxxxxxxx>
User-agent: Gnus/5.1008 (Gnus v5.10.8) Emacs/21.4 (gnu/linux)
> [...]
> Some systemtapping on a busy pmwebd shows a ginormous amount of
> traffic flowing through libpcp/src/pdubuf.c, [...]
> I'm thinking of redoing this module as a <search.h> (binary tree)
> [...]

It turns out a first cut of this wasn't too hard; see (RFC only)
pcpfans.git fche/multithread

commit 2aa8ca43c1a873c2ccd85d3890d3dc671d2b78d3
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date:   Mon Mar 2 15:55:02 2015 -0500

    libpcp pdubuf: rewrite using <search.h> binary trees
    
    The former single pdubuf-tracking structure with linear searches was a
    serious buttleneck in multithreaded clients such as pmwebd running
    graphite graphing jobs.  This version does away with that, and instead
    uses balanced binary trees provided by libc (SVr4, POSIX.1-2001)
    storing nonoverlapping segments.  This makes pin/unpin lookups
    considerably faster.  The nodes in the tree are malloc'd with buffers
    immediately adjacent, which reduces malloc traffic/fragmentation.
    There is no free/pin-list kept any more, which seems to speed things
    up further.
    
    Single-threaded clients do not appear to suffer for it (as confirmed
    with perf-stat reports of larger archive-processing tasks);
    multithreaded clients seem to benefit; the code is also simpler.  It
    seems to be a win-win.
    
    The proof is in the pudding: over a relatively large dataset, a random
    pmwebd graphite view ran thusly with single-threaded old code (line up
    the metric-count rows for comparisons):
    
    [] digested 1688 metrics, in 43325.5ms
    [] digested 199 metrics, in 2028.53ms
    [] digested 231 metrics, in 9893.74ms
    [] digested 390 metrics, in 4028.63ms
    [] digested 398 metrics, in 3815.58ms
    
    -M8-threaded with previous libpcp
    
    [] digested 1688 metrics, in 51467.3ms
    [] digested 199 metrics, in 1795.19ms
    [] digested 231 metrics, in 4131.55ms
    [] digested 390 metrics, in 2215.19ms
    [] digested 398 metrics, in 1375.55ms
    
    -M8-threaded with this libpcp
    
    [] digested 1688 metrics, in 27656ms
    [] digested 199 metrics, in 1232.82ms
    [] digested 231 metrics, in 4859.89ms
    [] digested 390 metrics, in 2052.02ms
    [] digested 398 metrics, in 1715.61ms
    
    There is considerable variation from run-to-run (and an anomaly or
    tw), but the trend is consistently that the benefits increase with
    larger parallelism & job-size.

<Prev in Thread] Current Thread [Next in Thread>