On Thu, Aug 25, 2011 at 03:56:18PM -0500, Alex Elder wrote:
> On Thu, 2011-08-25 at 17:17 +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@xxxxxxxxxx>
> >
> > Stats show that for an 8-way unlink @ ~80,000 unlinks/s we are doing
> > ~1 million cache hit lookups to ~3000 buffer creates. That's almost
> > 3 orders of magnitude more cahce hits than misses, so optimising for
> > cache hits is quite important. In the cache hit case, we do not need
> > to allocate a new buffer in case of a cache miss, so we are
> > effectively hitting the allocator for no good reason for vast the
> > majority of calls to _xfs_buf_find. 8-way create workloads are
> > showing similar cache hit/miss ratios.
> >
> > The result is profiles that look like this:
> >
> > samples pcnt function DSO
> > _______ _____ _______________________________ _________________
> >
> > 1036.00 10.0% _xfs_buf_find [kernel.kallsyms]
> > 582.00 5.6% kmem_cache_alloc [kernel.kallsyms]
> > 519.00 5.0% __memcpy [kernel.kallsyms]
> > 468.00 4.5% __ticket_spin_lock [kernel.kallsyms]
> > 388.00 3.7% kmem_cache_free [kernel.kallsyms]
> > 331.00 3.2% xfs_log_commit_cil [kernel.kallsyms]
> >
> >
> > Further, there is a fair bit of work involved in initialising a new
> > buffer once a cache miss has occurred and we currently do that under
> > the rbtree spinlock. That increases spinlock hold time on what are
> > heavily used trees.
> >
> > To fix this, remove the initialisation of the buffer from
> > _xfs_buf_find() and only allocate the new buffer once we've had a
> > cache miss. Initialise the buffer immediately after allocating it in
> > xfs_buf_get, too, so that is it ready for insert if we get another
> > cache miss after allocation. This minimises lock hold time and
> > avoids unnecessary allocator churn. The resulting profiles look
> > like:
> >
> > samples pcnt function DSO
> > _______ _____ ___________________________ _________________
> >
> > 8111.00 9.1% _xfs_buf_find [kernel.kallsyms]
> > 4380.00 4.9% __memcpy [kernel.kallsyms]
> > 4341.00 4.8% __ticket_spin_lock [kernel.kallsyms]
> > 3401.00 3.8% kmem_cache_alloc [kernel.kallsyms]
> > 2856.00 3.2% xfs_log_commit_cil [kernel.kallsyms]
> > 2625.00 2.9% __kmalloc [kernel.kallsyms]
> > 2380.00 2.7% kfree [kernel.kallsyms]
> > 2016.00 2.3% kmem_cache_free [kernel.kallsyms]
> >
> > Showing a significant reduction in time spent doing allocation and
> > freeing from slabs (kmem_cache_alloc and kmem_cache_free).
> >
> > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
>
> This is a good change, but I found one bug (of omission).
> I also have a pretty harmless suggestion, plus suggest
> some type changes.
>
> For now I have corrected the bug and implemented the
> one suggestion but not the type changes in my own copy
> of this patch and am testing with it. If you are
> comfortable with that, I can commit my modified version.
> The type changes can go in separately (they might expand
> a bit to affect other code anyway).
>
> Otherwise if you fix the bug you can consider this
> reviewed by me.
>
> Reviewed-by: Alex Elder <aelder@xxxxxxx>
>
> > ---
> > fs/xfs/xfs_buf.c | 87
> > +++++++++++++++++++++++++++++++-----------------------
> > 1 files changed, 50 insertions(+), 37 deletions(-)
> >
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index c57836d..6fffa06 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -171,10 +171,16 @@ STATIC void
> > _xfs_buf_initialize(
> > xfs_buf_t *bp,
> > xfs_buftarg_t *target,
> > - xfs_off_t range_base,
> > - size_t range_length,
> > + xfs_off_t bno,
> > + size_t num_blocks,
>
> Since you are now passing block numbers and block counts
> rather than byte offsets and counts the types of these
> arguments should be changed accordingly. I believe the
> right types are xfs_daddr_t and xfs_filblks_t; the latter
> doesn't exactly fit the usage but it's consistent with
> how it's used elsewhere.
I considered changing to xfs_daddr_t, but then thought "that's a
change to the external interface" and "there's more than just this
that needs fixing" so I figured I'd leave that to a
followup patchset. Hence I'd prefer to keep the type/API changes
separate to functional changes
> Fixing this maybe ought to be done more pervasively; the types
> for values passed in the num_blocks argument are a mix of __u64,
> int and size_t.
Exactly.
> > xfs_buf_flags_t flags)
> > {
> > - xfs_buf_t *bp, *new_bp;
> > + struct xfs_buf *bp;
> > + struct xfs_buf *new_bp = NULL;
> > int error = 0;
> >
> > + bp = _xfs_buf_find(target, bno, num_blocks, flags, new_bp);
>
> I'd rather an explicit NULL be used above for the last argument.
> (I've made this change to my own version of this patch.)
Good point.
> > + if (likely(bp))
> > + goto found;
> > +
> > new_bp = xfs_buf_allocate(flags);
> > if (unlikely(!new_bp))
> > return NULL;
> >
> > - bp = _xfs_buf_find(target, ioff, isize, flags, new_bp);
> > + _xfs_buf_initialize(new_bp, target, bno, num_blocks, flags);
> > +
>
> . . .
>
> > @@ -790,7 +803,7 @@ xfs_buf_get_uncached(
> > bp = xfs_buf_allocate(0);
> > if (unlikely(bp == NULL))
> > goto fail;
> > - _xfs_buf_initialize(bp, target, 0, len, 0);
> > + _xfs_buf_initialize(bp, target, 0, BTOBB(len), 0);
> >
> > error = _xfs_buf_get_pages(bp, page_count, 0);
> > if (error)
>
> And the only remaining problem is the bug. You need to make
> a change comparable to what's right here in xfs_buf_get_empty().
> I.e., that function needs to pass a block count rather than
> a byte length. (I have made this change in my own copy.)
Oh, right, I missed the call in xfs_buf_get_empty(). It took me a
minute to work out that you were talking about a bug in
xfs_buf_get_empty() rather than in the above hunk :).
Will fix.
Cheers,
Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx
|