pcp
[Top] [All Lists]

Re: [pcp] multithreading bottleneck: pdubuf.c

To: pcp@xxxxxxxxxxx
Subject: Re: [pcp] multithreading bottleneck: pdubuf.c
From: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
Date: Wed, 04 Mar 2015 07:22:41 +1100
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <20150302015436.GB21203@xxxxxxxxxx>
References: <20150302015436.GB21203@xxxxxxxxxx>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0
On 02/03/15 12:54, Frank Ch. Eigler wrote:
...While holding the lock, pinning/unpinning does a linear search
of all already-allocated buffers.  Needs much improvement!

I'm thinking of redoing this module as a <search.h> (binary tree)
lookup (for identifying allocated pdubufs by bc_buf[]-contained
address during pin/unpin), and ditching the free-list entirely (just
do straight malloc/free, which is well-tuned for single+multi-threaded
apps).  Any suggestions/concerns?

A bit of history might help explain the status quo and guide reimplementation ...

+ the pdu buffers are page size aligned for a reason ... these are used for direct I/O calls, and some operating systems are able to expedite the handling of I/O for page aligned buffers (avoiding the need to copy at all in some cases) ... this was _really_ important in the early days to ensure peak performance which was a major goal as we were being compared to sar(1) and vmstat(1) for the survival of the embryonic PCP project

+ page alignment means that the buffers should be sized in units of multiple pages also, but we did not have easy access to the underlying VM page size in those days (remember this is only shortly after the first versions of Linux began to appear), so 1024 was chosen

+ together these mean we have a pool of buffers in play of a small number of sizes ... 1K, 2K, 3K, etc

+ the PCP PDU mix was also more restricted (no distributed namespace operations in particular)

+ because the range of PDU sizes was small and there was no multithreading and buffers did not remain pinned for long, the number of buffers in the pool was expected to be small

+ we did some empirical experiments, and for the expected operating environment, proved that a simple pool allocator was faster than malloc/free and unlike malloc (in those days) provided page aligned allocations

Now maybe the assumptions from 20+ years ago are no longer valid, which means other implementation strategies are warranted, but this needs to be assessed against the real requirements not only for clients that are making new demands on the buffer code, but the traditional (and I suggest more common code paths) as used by pmcd, pmlogger and pmie.

Hope the history helps.

I encourage Frank to investigate here, but before committing to a new regime I'd hope to to see some empirical evidence of new vs old performance.

<Prev in Thread] Current Thread [Next in Thread>