xfs
[Top] [All Lists]

Re: [PATCH 4/4] xfs: convert buffer cache hash to rbtree

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: [PATCH 4/4] xfs: convert buffer cache hash to rbtree
From: Alex Elder <aelder@xxxxxxx>
Date: Mon, 13 Sep 2010 11:53:49 -0500
Cc: xfs@xxxxxxxxxxx
In-reply-to: <1283958778-28610-5-git-send-email-david@xxxxxxxxxxxxx>
References: <1283958778-28610-1-git-send-email-david@xxxxxxxxxxxxx> <1283958778-28610-5-git-send-email-david@xxxxxxxxxxxxx>
Reply-to: aelder@xxxxxxx
On Thu, 2010-09-09 at 01:12 +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@xxxxxxxxxx>
> 
> The buffer cache hash is starting to show typical hash scalability
> problems.  large scale testing is showing the number of cached items
> growing far larger than the hash can efficiently handle. Hence we
> need to move to a self-scaling cache indexing mechanism.
> 
> I have selected rbtrees for indexing becuse they can have O(log n)
> search scalability, and insert and remove cost is not excessive,
> even on large trees. Hence we should be able to cache large numbers
> of buffers without incurring the excessive cache miss search
> penalties that the hash is imposing on us.
> 
> To ensure we still have parallel access to the cache, we need
> multiple trees. Rather than hashing the buffers by disk address to
> select a tree, it seems more sensible to separate trees by typical
> access patterns. Most operations use buffers from within a single AG
> at a time, so rather than searching lots of different lists,
> separate the buffer indexes out into per-AG rbtrees. This means that
> searches during metadata operation have a much higher chance of
> hitting cache resident nodes, and that updates of the tree are less
> likely to disturb trees being accessed on other CPUs doing
> independent operations.

I found a bug in this, and have a bunch of other less
important comments for you to consider.  I haven't spent
any time contemplating your decision to use RB trees on
AG's but it seems quite reasonable.

> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> ---
>  fs/xfs/linux-2.6/xfs_buf.c   |  139 +++++++++++++++++++++--------------------
>  fs/xfs/linux-2.6/xfs_buf.h   |   14 ++--
>  fs/xfs/linux-2.6/xfs_super.c |    6 +-
>  fs/xfs/xfs_ag.h              |    4 +
>  fs/xfs/xfs_mount.c           |    4 +-
>  5 files changed, 87 insertions(+), 80 deletions(-)
> 
> diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
> index b2b5dea..facd37e 100644
> --- a/fs/xfs/linux-2.6/xfs_buf.c
> +++ b/fs/xfs/linux-2.6/xfs_buf.c

. . .

> @@ -432,14 +434,36 @@ _xfs_buf_find(
>       ASSERT(!(range_length < (1 << btp->bt_sshift)));
>       ASSERT(!(range_base & (xfs_off_t)btp->bt_smask));
>  
> -     hash = &btp->bt_hash[hash_long((unsigned long)ioff, btp->bt_hashshift)];
> -
> -     spin_lock(&hash->bh_lock);
> -
> -     list_for_each_entry_safe(bp, n, &hash->bh_list, b_hash_list) {
> -             ASSERT(btp == bp->b_target);
> -             if (bp->b_file_offset == range_base &&
> -                 bp->b_buffer_length == range_length) {
> +     /* get tree root */
> +     pag = xfs_perag_get(btp->bt_mp, xfs_daddr_to_agno(btp->bt_mp, ioff));
> +
> +     /* walk tree */
> +     spin_lock(&pag->pagbuf_lock);
> +     rbp = &pag->pagbuf_tree.rb_node;
> +     parent = NULL;

Could drop the above assignment and change the while
statement to read:  while ((parent = *rbp)) {
(I know that leads to a mildly controversial style.)

> +     bp = NULL;
> +     while (*rbp) {
> +             parent = *rbp;
> +             bp = rb_entry(parent, struct xfs_buf, b_rbnode);

Below here you might as well make use of the
value of "parent" in places where (*rbp) is used.
I realize you're just mimicking what's in the
rbtree.h file though...  If the result seems to
read funny, maybe you could rename "parent" to
be "entry" or something.

Here's the BUG:

> +             if (bp->b_file_offset < range_base)
> +                     rbp = &(*rbp)->rb_left;
> +             else if (bp->b_file_offset > range_base)
> +                     rbp = &(*rbp)->rb_right;

Your comparisons here are reversed.  In other words,
I believe it should be:

        if (range_base < bp->b_file_offset)     
                rbp = &parent->rb_left;
        else if (range_base > bp->b_file_offset)
                rbp = &parent->rb_right;

Maybe it mostly works in the order you have it,
but I'm pretty sure it's wrong.  (The "continue
searching to the right" below would be wrong though.)


> +             else {
> +                     /*
> +                      * found a block offset match. If the range doesn't
> +                      * match, the only way this is allowed is if the buffer
> +                      * in the cache is stale and the transaction that made
> +                      * it stale has not yet committed. i.e. we are
> +                      * reallocating a busy extent. Skip this buffer and
> +                      * continue searching to the right for an exact match.
> +                      */
> +                     if (bp->b_buffer_length != range_length) {
> +                             ASSERT(bp->b_flags & XBF_STALE);

This assertion is new.  I trust it's correct, but
maybe it should go in separately (first).

> +                             rbp = &(*rbp)->rb_right;
> +                             continue;
> +                     }
>                       atomic_inc(&bp->b_hold);
>                       goto found;
>               }

When I first read the next hunk I thought you were
mistakenly not dropping the perag reference.  Later
I realized it was intentional--that the buffer holds
a reference on the perag until it (the buffer) is
released.  This is a minor point but I think it would
be helpful to see a comment explaining that.

> @@ -449,17 +473,20 @@ _xfs_buf_find(
>       if (new_bp) {
>               _xfs_buf_initialize(new_bp, btp, range_base,
>                               range_length, flags);
> -             new_bp->b_hash = hash;
> -             list_add(&new_bp->b_hash_list, &hash->bh_list);
> +             new_bp->b_pag = pag;
> +             rb_link_node(&new_bp->b_rbnode, parent, rbp);
> +             rb_insert_color(&new_bp->b_rbnode, &pag->pagbuf_tree);
> +             spin_unlock(&pag->pagbuf_lock);
>       } else {
>               XFS_STATS_INC(xb_miss_locked);
> +             spin_unlock(&pag->pagbuf_lock);
> +             xfs_perag_put(pag);
>       }
> -
> -     spin_unlock(&hash->bh_lock);
>       return new_bp;
>  
>  found:
> -     spin_unlock(&hash->bh_lock);
> +     spin_unlock(&pag->pagbuf_lock);
> +     xfs_perag_put(pag);
>  
>       /* Attempt to get the semaphore without sleeping,
>        * if this does not work then we need to drop the
> @@ -810,27 +837,30 @@ void
>  xfs_buf_rele(
>       xfs_buf_t               *bp)
>  {
> -     xfs_bufhash_t           *hash = bp->b_hash;
> +     struct xfs_perag        *pag = bp->b_pag;
>  
>       trace_xfs_buf_rele(bp, _RET_IP_);
>  
> -     if (unlikely(!hash)) {
> +     if (!pag) {

I know this is not related to your change, but I'm
curious whether this can even happen, and if so, when?

>               ASSERT(!bp->b_relse);
> +             ASSERT(RB_EMPTY_NODE(&bp->b_rbnode));
>               if (atomic_dec_and_test(&bp->b_hold))
>                       xfs_buf_free(bp);
>               return;
>       }
>  
> +     ASSERT(!RB_EMPTY_NODE(&bp->b_rbnode));
>       ASSERT(atomic_read(&bp->b_hold) > 0);
> -     if (atomic_dec_and_lock(&bp->b_hold, &hash->bh_lock)) {
> +     if (atomic_dec_and_lock(&bp->b_hold, &pag->pagbuf_lock)) {
>               if (bp->b_relse) {
>                       atomic_inc(&bp->b_hold);
> -                     spin_unlock(&hash->bh_lock);
> +                     spin_unlock(&pag->pagbuf_lock);
>                       (*(bp->b_relse)) (bp);

Drop the excess parentheses here (above) while you're at it.

>               } else {
>                       ASSERT(!(bp->b_flags & (XBF_DELWRI|_XBF_DELWRI_Q)));
> -                     list_del_init(&bp->b_hash_list);
> -                     spin_unlock(&hash->bh_lock);
> +                     rb_erase(&bp->b_rbnode, &pag->pagbuf_tree);
> +                     spin_unlock(&pag->pagbuf_lock);
> +                     xfs_perag_put(pag);
>                       xfs_buf_free(bp);
>               }
>       }
> @@ -1433,57 +1463,25 @@ xfs_buf_iomove(
>   */
>  void
>  xfs_wait_buftarg(
> -     xfs_buftarg_t   *btp)
> +     struct xfs_buftarg      *btp)
>  {
> -     xfs_bufhash_t   *hash;
> -     uint            i;
> +     struct xfs_perag        *pag;
> +     uint                    i;
>  
> -     for (i = 0; i < (1 << btp->bt_hashshift); i++) {
> -             hash = &btp->bt_hash[i];
> -             spin_lock(&hash->bh_lock);
> -             while (!list_empty(&hash->bh_list)) {
> -                     spin_unlock(&hash->bh_lock);
> +     for (i = 0; i < btp->bt_mp->m_sb.sb_agcount; i++) {
> +             pag = xfs_perag_get(btp->bt_mp, i);
> +             spin_lock(&pag->pagbuf_lock);
> +             while (rb_first(&pag->pagbuf_tree)) {
> +                     spin_unlock(&pag->pagbuf_lock);
>                       delay(100);
> -                     spin_lock(&hash->bh_lock);
> +                     spin_lock(&pag->pagbuf_lock);
>               }
> -             spin_unlock(&hash->bh_lock);
> +             spin_unlock(&pag->pagbuf_lock);
> +             xfs_perag_put(pag);
>       }
>  }

This suggestion is again not related to your change, but...
Would this basic structure (not tested) be better than
the above?

        int more;

        do {
                more = 0;
                for (i = 0; i < btp->bt_mp->m_sb.sb_agcount; i++) {
                        pag = xfs_perag_get(btp->bt_mp, i);
                        spin_lock(&pag->pagbuf_lock);
                        if (rb_first(&pag->pagbuf_tree))
                                more++;
                        spin_unlock(&pag->pagbuf_lock);
                        xfs_perag_put(pag);
                }
                if (more)
                        delay(100);
        } while (more);
>  
>  /*
> - *   Allocate buffer hash table for a given target.
> - *   For devices containing metadata (i.e. not the log/realtime devices)
> - *   we need to allocate a much larger hash table.

. . . 

> @@ -1645,6 +1643,12 @@ xfs_alloc_buftarg(
>  
>       btp = kmem_zalloc(sizeof(*btp), KM_SLEEP);
>  
> +     /*
> +      * The buftarg cache should never be used by external devices.

I suggest that "buftarg cache" is not a good name for the
(now per-AG) buffer cache.

> +      * Ensure we catch any users with extreme prejudice.
> +      */
> +     btp->bt_mp = external ? NULL : mp;
> +
>       btp->bt_dev =  bdev->bd_dev;
>       btp->bt_bdev = bdev;
>       if (xfs_setsize_buftarg_early(btp, bdev))
> @@ -1653,7 +1657,6 @@ xfs_alloc_buftarg(
>               goto error;
>       if (xfs_alloc_delwrite_queue(btp, fsname))
>               goto error;
> -     xfs_alloc_bufhash(btp, external);
>       return btp;
>  
>  error:
> diff --git a/fs/xfs/linux-2.6/xfs_buf.h b/fs/xfs/linux-2.6/xfs_buf.h
> index 802dc5e..3797ee8 100644
> --- a/fs/xfs/linux-2.6/xfs_buf.h
> +++ b/fs/xfs/linux-2.6/xfs_buf.h
. . .
> @@ -171,8 +170,8 @@ typedef struct xfs_buf {
>       wait_queue_head_t       b_waiters;      /* unpin waiters */
>       struct list_head        b_list;
>       xfs_buf_flags_t         b_flags;        /* status flags */
> -     struct list_head        b_hash_list;    /* hash table list */
> -     xfs_bufhash_t           *b_hash;        /* hash table list start */
> +     struct rb_node          b_rbnode;       /* rbtree node */
> +     struct xfs_perag        *b_pag;         /* rbtree root */

Rather, "contains rbtree root" (in the comment).

>       xfs_buftarg_t           *b_target;      /* buffer target (device) */
>       atomic_t                b_hold;         /* reference count */
>       xfs_daddr_t             b_bn;           /* block number for I/O */

. . . 

> diff --git a/fs/xfs/xfs_ag.h b/fs/xfs/xfs_ag.h
> index 4917d4e..e01d4cf 100644
> --- a/fs/xfs/xfs_ag.h
> +++ b/fs/xfs/xfs_ag.h
> @@ -230,6 +230,10 @@ typedef struct xfs_perag {
>       rwlock_t        pag_ici_lock;   /* incore inode lock */
>       struct radix_tree_root pag_ici_root;    /* incore inode cache root */
>       int             pag_ici_reclaimable;    /* reclaimable inodes */
> +
> +     /* buffer cache index */
> +     spinlock_t      pagbuf_lock;    /* lock for pagbuf_tree */
> +     struct rb_root  pagbuf_tree;    /* ordered tree of active buffers */

I know it's in some way consistent with the rest of the
naming scheme for fields in this structure, maybe these
could be named pag_buf_lock and pag_buf_tree.  (The rest
of the fields here have a sort of strange convention and
maybe they've got strong enough history that they should
be left alone.)

>  #endif
>       int             pagb_count;     /* pagb slots in use */
>  } xfs_perag_t;

. . .

<Prev in Thread] Current Thread [Next in Thread>