xfs
[Top] [All Lists]

[PATCH 3/5] xfs: exact busy extent tracking

To: xfs@xxxxxxxxxxx
Subject: [PATCH 3/5] xfs: exact busy extent tracking
From: Christoph Hellwig <hch@xxxxxxxxxxxxx>
Date: Mon, 28 Mar 2011 17:06:17 -0400
References: <20110328210614.832613417@xxxxxxxxxxxxxxxxxxxxxx>
User-agent: quilt/0.48-1
Update the extent tree in case we have to reuse a busy extent, so that it
always is kept uptodate.  This is done by replacing the busy list searches
with a new xfs_alloc_busy_reuse helper, which updates the busy extent tree
in case of a reuse.   Also replace setting transactions to sync with forcing
the log out in case we found a busy extent to reuse.  This makes the code a
lot more simple, and is required for discard support later on.  While it
will cause performance regressios with just this patch applied, the impact
is more than mitigated by the next patch in the series.

Signed-off-by: Christoph Hellwig <hch@xxxxxx>

Index: xfs/fs/xfs/xfs_alloc.c
===================================================================
--- xfs.orig/fs/xfs/xfs_alloc.c 2011-03-28 16:06:17.806838570 +0200
+++ xfs/fs/xfs/xfs_alloc.c      2011-03-28 16:09:32.489839705 +0200
@@ -1396,8 +1396,8 @@ xfs_alloc_ag_vextent_small(
                if (error)
                        goto error0;
                if (fbno != NULLAGBLOCK) {
-                       if (xfs_alloc_busy_search(args->mp, args->agno, fbno, 
1))
-                               xfs_trans_set_sync(args->tp);
+                       xfs_alloc_busy_reuse(args->tp, args->agno, fbno, 1);
+
                        if (args->userdata) {
                                xfs_buf_t       *bp;
 
@@ -2459,100 +2459,6 @@ error0:
        return error;
 }
 
-
-/*
- * AG Busy list management
- * The busy list contains block ranges that have been freed but whose
- * transactions have not yet hit disk.  If any block listed in a busy
- * list is reused, the transaction that freed it must be forced to disk
- * before continuing to use the block.
- *
- * xfs_alloc_busy_insert - add to the per-ag busy list
- * xfs_alloc_busy_clear - remove an item from the per-ag busy list
- * xfs_alloc_busy_search - search for a busy extent
- */
-
-/*
- * Insert a new extent into the busy tree.
- *
- * The busy extent tree is indexed by the start block of the busy extent.
- * there can be multiple overlapping ranges in the busy extent tree but only
- * ever one entry at a given start block. The reason for this is that
- * multi-block extents can be freed, then smaller chunks of that extent
- * allocated and freed again before the first transaction commit is on disk.
- * If the exact same start block is freed a second time, we have to wait for
- * that busy extent to pass out of the tree before the new extent is inserted.
- * There are two main cases we have to handle here.
- *
- * The first case is a transaction that triggers a "free - allocate - free"
- * cycle. This can occur during btree manipulations as a btree block is freed
- * to the freelist, then allocated from the free list, then freed again. In
- * this case, the second extxpnet free is what triggers the duplicate and as
- * such the transaction IDs should match. Because the extent was allocated in
- * this transaction, the transaction must be marked as synchronous. This is
- * true for all cases where the free/alloc/free occurs in the one transaction,
- * hence the addition of the ASSERT(tp->t_flags & XFS_TRANS_SYNC) to this case.
- * This serves to catch violations of the second case quite effectively.
- *
- * The second case is where the free/alloc/free occur in different
- * transactions. In this case, the thread freeing the extent the second time
- * can't mark the extent busy immediately because it is already tracked in a
- * transaction that may be committing.  When the log commit for the existing
- * busy extent completes, the busy extent will be removed from the tree. If we
- * allow the second busy insert to continue using that busy extent structure,
- * it can be freed before this transaction is safely in the log.  Hence our
- * only option in this case is to force the log to remove the existing busy
- * extent from the list before we insert the new one with the current
- * transaction ID.
- *
- * The problem we are trying to avoid in the free-alloc-free in separate
- * transactions is most easily described with a timeline:
- *
- *      Thread 1       Thread 2        Thread 3        xfslogd
- *     xact alloc
- *     free X
- *     mark busy
- *     commit xact
- *     free xact
- *                     xact alloc
- *                     alloc X
- *                     busy search
- *                     mark xact sync
- *                     commit xact
- *                     free xact
- *                     force log
- *                     checkpoint starts
- *                     ....
- *                                     xact alloc
- *                                     free X
- *                                     mark busy
- *                                     finds match
- *                                     *** KABOOM! ***
- *                                     ....
- *                                                     log IO completes
- *                                                     unbusy X
- *                     checkpoint completes
- *
- * By issuing a log force in thread 3 @ "KABOOM", the thread will block until
- * the checkpoint completes, and the busy extent it matched will have been
- * removed from the tree when it is woken. Hence it can then continue safely.
- *
- * However, to ensure this matching process is robust, we need to use the
- * transaction ID for identifying transaction, as delayed logging results in
- * the busy extent and transaction lifecycles being different. i.e. the busy
- * extent is active for a lot longer than the transaction.  Hence the
- * transaction structure can be freed and reallocated, then mark the same
- * extent busy again in the new transaction. In this case the new transaction
- * will have a different tid but can have the same address, and hence we need
- * to check against the tid.
- *
- * Future: for delayed logging, we could avoid the log force if the extent was
- * first freed in the current checkpoint sequence. This, however, requires the
- * ability to pin the current checkpoint in memory until this transaction
- * commits to ensure that both the original free and the current one combine
- * logically into the one checkpoint. If the checkpoint sequences are
- * different, however, we still need to wait on a log force.
- */
 void
 xfs_alloc_busy_insert(
        struct xfs_trans        *tp,
@@ -2564,9 +2470,7 @@ xfs_alloc_busy_insert(
        struct xfs_busy_extent  *busyp;
        struct xfs_perag        *pag;
        struct rb_node          **rbp;
-       struct rb_node          *parent;
-       int                     match;
-
+       struct rb_node          *parent = NULL;
 
        new = kmem_zalloc(sizeof(struct xfs_busy_extent), KM_MAYFAIL);
        if (!new) {
@@ -2583,66 +2487,28 @@ xfs_alloc_busy_insert(
        new->agno = agno;
        new->bno = bno;
        new->length = len;
-       new->tid = xfs_log_get_trans_ident(tp);
-
        INIT_LIST_HEAD(&new->list);
 
        /* trace before insert to be able to see failed inserts */
        trace_xfs_alloc_busy(tp, agno, bno, len, 0);
 
        pag = xfs_perag_get(tp->t_mountp, new->agno);
-restart:
        spin_lock(&pag->pagb_lock);
        rbp = &pag->pagb_tree.rb_node;
-       parent = NULL;
-       busyp = NULL;
-       match = 0;
-       while (*rbp && match >= 0) {
+       while (*rbp) {
                parent = *rbp;
                busyp = rb_entry(parent, struct xfs_busy_extent, rb_node);
 
                if (new->bno < busyp->bno) {
-                       /* may overlap, but exact start block is lower */
                        rbp = &(*rbp)->rb_left;
-                       if (new->bno + new->length > busyp->bno)
-                               match = busyp->tid == new->tid ? 1 : -1;
+                       ASSERT(new->bno + new->length <= busyp->bno);
                } else if (new->bno > busyp->bno) {
-                       /* may overlap, but exact start block is higher */
                        rbp = &(*rbp)->rb_right;
-                       if (bno < busyp->bno + busyp->length)
-                               match = busyp->tid == new->tid ? 1 : -1;
+                       ASSERT(bno >= busyp->bno + busyp->length);
                } else {
-                       match = busyp->tid == new->tid ? 1 : -1;
-                       break;
+                       ASSERT(0);
                }
        }
-       if (match < 0) {
-               /* overlap marked busy in different transaction */
-               spin_unlock(&pag->pagb_lock);
-               xfs_log_force(tp->t_mountp, XFS_LOG_SYNC);
-               goto restart;
-       }
-       if (match > 0) {
-               /*
-                * overlap marked busy in same transaction. Update if exact
-                * start block match, otherwise combine the busy extents into
-                * a single range.
-                */
-               if (busyp->bno == new->bno) {
-                       busyp->length = max(busyp->length, new->length);
-                       spin_unlock(&pag->pagb_lock);
-                       ASSERT(tp->t_flags & XFS_TRANS_SYNC);
-                       xfs_perag_put(pag);
-                       kmem_free(new);
-                       return;
-               }
-               rb_erase(&busyp->rb_node, &pag->pagb_tree);
-               new->length = max(busyp->bno + busyp->length,
-                                       new->bno + new->length) -
-                               min(busyp->bno, new->bno);
-               new->bno = min(busyp->bno, new->bno);
-       } else
-               busyp = NULL;
 
        rb_link_node(&new->rb_node, parent, rbp);
        rb_insert_color(&new->rb_node, &pag->pagb_tree);
@@ -2650,7 +2516,6 @@ restart:
        list_add(&new->list, &tp->t_busy);
        spin_unlock(&pag->pagb_lock);
        xfs_perag_put(pag);
-       kmem_free(busyp);
 }
 
 /*
@@ -2705,6 +2570,162 @@ xfs_alloc_busy_search(
 }
 
 /*
+ * The found free extent [fbno, fend] overlaps part or all of the given busy
+ * extent.  If the overlap covers the beginning, the end, or all of the busy
+ * extent, the overlapping portion can be made unbusy and used for the
+ * allocation.  We can't split a busy extent because we can't modify a
+ * transaction/CIL context busy list, but we can update an entries block
+ * number or length.
+ *
+ * The caller will force the log and re-check the busy list after returning
+ * from this function.
+ */
+STATIC void
+xfs_alloc_busy_update_extent(
+       struct xfs_perag        *pag,
+       struct xfs_busy_extent  *busyp,
+       xfs_agblock_t           fbno,
+       xfs_agblock_t           fend)
+{
+       xfs_agblock_t   bbno = busyp->bno;
+       xfs_agblock_t   bend = bbno + busyp->length;
+
+       if (bbno < fbno && bend > fend) {
+               /*
+                * Case 1:
+                *    bbno           bend
+                *    +BBBBBBBBBBBBBBBBB+
+                *        +---------+
+                *        fbno   fend
+                */
+
+               /*
+                * We would have to split the busy extent to be able to track
+                * it correct, which we cannot do because we would have to
+                * modify the list of busy extents attached to the transaction
+                * or CIL context, which is immutable.
+                *
+                * Let the caller force out the log to clear the busy extents
+                * and retry the search.
+                */
+       } else if (bbno >= fbno && bend <= fend) {
+               /*
+                * Case 2:
+                *    bbno           bend
+                *    +BBBBBBBBBBBBBBBBB+
+                *    +-----------------+
+                *    fbno           fend
+                *
+                * Case 3:
+                *    bbno           bend
+                *    +BBBBBBBBBBBBBBBBB+
+                *    +--------------------------+
+                *    fbno                    fend
+                *
+                * Case 4:
+                *             bbno           bend
+                *             +BBBBBBBBBBBBBBBBB+
+                *    +--------------------------+
+                *    fbno                    fend
+                *
+                * Case 5:
+                *             bbno           bend
+                *             +BBBBBBBBBBBBBBBBB+
+                *    +-----------------------------------+
+                *    fbno                             fend
+                *
+                */
+
+               /*
+                * The busy extent is fully covered by the extent we are
+                * allocating, and can simply be removed from the rbtree.
+                * However we cannot remove it from the immutable list
+                * tracking busy extents in the transaction or CIL context,
+                * so set the length to zero to mark it invalid.
+                */
+               rb_erase(&busyp->rb_node, &pag->pagb_tree);
+               busyp->length = 0;
+       } else if (bbno == fbno) {
+               /*
+                * Case 6:
+                *              bbno           bend
+                *             +BBBBBBBBBBBBBBBBB+
+                *             +---------+
+                *             fbno   fend
+                *
+                * Case 7:
+                *             bbno           bend
+                *             +BBBBBBBBBBBBBBBBB+
+                *    +------------------+
+                *    fbno            fend
+                *
+                */
+               busyp->bno = fend;
+       } else if (bend == fend) {
+               /*
+                * Case 8:
+                *    bbno           bend
+                *    +BBBBBBBBBBBBBBBBB+
+                *        +-------------+
+                *        fbno       fend
+                *
+                * Case 9:
+                *    bbno           bend
+                *    +BBBBBBBBBBBBBBBBB+
+                *        +----------------------+
+                *        fbno                fend
+                */
+
+               busyp->length = fbno - busyp->bno;
+       } else {
+               ASSERT(0);
+       }
+}
+
+
+/*
+ * For a given extent [fbno, flen], make sure we can reuse it safely.
+ */
+void
+xfs_alloc_busy_reuse(
+       struct xfs_trans        *tp,
+       xfs_agnumber_t          agno,
+       xfs_agblock_t           fbno,
+       xfs_extlen_t            flen)
+{
+       struct xfs_perag        *pag;
+       struct rb_node          *rbp;
+
+       ASSERT(flen > 0);
+
+       pag = xfs_perag_get(tp->t_mountp, agno);
+       spin_lock(&pag->pagb_lock);
+       rbp = pag->pagb_tree.rb_node;
+       while (rbp) {
+               struct xfs_busy_extent *busyp =
+                       rb_entry(rbp, struct xfs_busy_extent, rb_node);
+               xfs_agblock_t   fend = fbno + flen;
+               xfs_agblock_t   bbno = busyp->bno;
+               xfs_agblock_t   bend = bbno + busyp->length;
+
+               if (fend <= bbno) {
+                       rbp = rbp->rb_left;
+                       continue;
+               } else if (fbno >= bend) {
+                       rbp = rbp->rb_right;
+                       continue;
+               }
+
+               xfs_alloc_busy_update_extent(pag, busyp, fbno, fbno + flen);
+
+               spin_unlock(&pag->pagb_lock);
+               xfs_log_force(tp->t_mountp, XFS_LOG_SYNC);
+       }
+       spin_unlock(&pag->pagb_lock);
+       xfs_perag_put(pag);
+}
+
+/*
  * For a given extent [fbno, flen], search the busy extent list
  * to find a subset of the extent that is not busy.
  */
@@ -2886,14 +2907,12 @@ xfs_alloc_busy_clear(
        trace_xfs_alloc_unbusy(mp, busyp->agno, busyp->bno,
                                                busyp->length);
 
-       ASSERT(xfs_alloc_busy_search(mp, busyp->agno, busyp->bno,
-                                               busyp->length) == 1);
-
        list_del_init(&busyp->list);
 
        pag = xfs_perag_get(mp, busyp->agno);
        spin_lock(&pag->pagb_lock);
-       rb_erase(&busyp->rb_node, &pag->pagb_tree);
+       if (busyp->length)
+               rb_erase(&busyp->rb_node, &pag->pagb_tree);
        spin_unlock(&pag->pagb_lock);
        xfs_perag_put(pag);
 
Index: xfs/fs/xfs/xfs_alloc.h
===================================================================
--- xfs.orig/fs/xfs/xfs_alloc.h 2011-03-28 16:06:17.822840017 +0200
+++ xfs/fs/xfs/xfs_alloc.h      2011-03-28 16:06:23.013344460 +0200
@@ -145,6 +145,10 @@ xfs_alloc_busy_clear(struct xfs_mount *m
 int
 xfs_alloc_busy_search(struct xfs_mount *mp, xfs_agnumber_t agno,
        xfs_agblock_t bno, xfs_extlen_t len);
+
+void
+xfs_alloc_busy_reuse(struct xfs_trans *tp, xfs_agnumber_t agno,
+       xfs_agblock_t fbno, xfs_extlen_t flen);
 #endif /* __KERNEL__ */
 
 /*
Index: xfs/fs/xfs/xfs_alloc_btree.c
===================================================================
--- xfs.orig/fs/xfs/xfs_alloc_btree.c   2011-03-28 16:06:17.830839717 +0200
+++ xfs/fs/xfs/xfs_alloc_btree.c        2011-03-28 16:06:23.025342125 +0200
@@ -94,8 +94,8 @@ xfs_allocbt_alloc_block(
                *stat = 0;
                return 0;
        }
-       if (xfs_alloc_busy_search(cur->bc_mp, cur->bc_private.a.agno, bno, 1))
-               xfs_trans_set_sync(cur->bc_tp);
+
+       xfs_alloc_busy_reuse(cur->bc_tp, cur->bc_private.a.agno, bno, 1);
 
        xfs_trans_agbtree_delta(cur->bc_tp, 1);
        new->s = cpu_to_be32(bno);
Index: xfs/fs/xfs/xfs_ag.h
===================================================================
--- xfs.orig/fs/xfs/xfs_ag.h    2011-03-28 16:06:17.842839071 +0200
+++ xfs/fs/xfs/xfs_ag.h 2011-03-28 16:06:23.037341448 +0200
@@ -187,7 +187,6 @@ struct xfs_busy_extent {
        xfs_agnumber_t  agno;
        xfs_agblock_t   bno;
        xfs_extlen_t    length;
-       xlog_tid_t      tid;            /* transaction that created this */
 };
 
 /*
Index: xfs/fs/xfs/xfs_log.c
===================================================================
--- xfs.orig/fs/xfs/xfs_log.c   2011-03-28 16:06:17.854837887 +0200
+++ xfs/fs/xfs/xfs_log.c        2011-03-28 16:06:20.390839609 +0200
@@ -3248,13 +3248,6 @@ xfs_log_ticket_get(
        return ticket;
 }
 
-xlog_tid_t
-xfs_log_get_trans_ident(
-       struct xfs_trans        *tp)
-{
-       return tp->t_ticket->t_tid;
-}
-
 /*
  * Allocate and initialise a new log ticket.
  */
Index: xfs/fs/xfs/xfs_log.h
===================================================================
--- xfs.orig/fs/xfs/xfs_log.h   2011-03-28 16:06:17.866839600 +0200
+++ xfs/fs/xfs/xfs_log.h        2011-03-28 16:06:20.394838660 +0200
@@ -189,8 +189,6 @@ void          xlog_iodone(struct xfs_buf *);
 struct xlog_ticket *xfs_log_ticket_get(struct xlog_ticket *ticket);
 void     xfs_log_ticket_put(struct xlog_ticket *ticket);
 
-xlog_tid_t xfs_log_get_trans_ident(struct xfs_trans *tp);
-
 void   xfs_log_commit_cil(struct xfs_mount *mp, struct xfs_trans *tp,
                                struct xfs_log_vec *log_vector,
                                xfs_lsn_t *commit_lsn, int flags);
Index: xfs/fs/xfs/xfs_log_priv.h
===================================================================
--- xfs.orig/fs/xfs/xfs_log_priv.h      2011-03-28 16:06:17.878839137 +0200
+++ xfs/fs/xfs/xfs_log_priv.h   2011-03-28 16:06:20.398888649 +0200
@@ -144,6 +144,8 @@ static inline uint xlog_get_client_id(__
 #define        XLOG_RECOVERY_NEEDED    0x4     /* log was recovered */
 #define XLOG_IO_ERROR          0x8     /* log hit an I/O error, and being
                                           shutdown */
+typedef __uint32_t xlog_tid_t;
+
 
 #ifdef __KERNEL__
 /*
Index: xfs/fs/xfs/xfs_types.h
===================================================================
--- xfs.orig/fs/xfs/xfs_types.h 2011-03-28 16:06:17.894838713 +0200
+++ xfs/fs/xfs/xfs_types.h      2011-03-28 16:06:20.402904478 +0200
@@ -73,8 +73,6 @@ typedef       __int32_t       xfs_tid_t;      /* transact
 typedef        __uint32_t      xfs_dablk_t;    /* dir/attr block number (in 
file) */
 typedef        __uint32_t      xfs_dahash_t;   /* dir/attr hash value */
 
-typedef __uint32_t     xlog_tid_t;     /* transaction ID type */
-
 /*
  * These types are 64 bits on disk but are either 32 or 64 bits in memory.
  * Disk based types:

<Prev in Thread] Current Thread [Next in Thread>