xfs
[Top] [All Lists]

[PATCH 1/6] xfs: introduce inode cluster buffer trylocks for xfs_iflush

To: xfs@xxxxxxxxxxx
Subject: [PATCH 1/6] xfs: introduce inode cluster buffer trylocks for xfs_iflush
From: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Thu, 10 Mar 2011 11:05:24 +1100
In-reply-to: <1299715529-11026-1-git-send-email-david@xxxxxxxxxxxxx>
References: <1299715529-11026-1-git-send-email-david@xxxxxxxxxxxxx>
From: Dave Chinner <dchinner@xxxxxxxxxx>

There is an ABBA deadlock between synchronous inode flushing in
xfs_reclaim_inode and xfs_icluster_free. xfs_icluster_free locks the
buffer, then takes inode ilocks, whilst synchronous reclaim takes
the ilock followed by the buffer lock in xfs_iflush().

To avoid this deadlock, separate the inode cluster buffer locking
semantics from the synchronous inode flush semantics, allowing
callers to attempt to lock the buffer but still issue synchronous IO
if it can get the buffer. This requires xfs_iflush() calls that
currently use non-blocking semantics to pass SYNC_TRYLOCK rather
than 0 as the flags parameter.

This allows xfs_reclaim_inode to avoid the deadlock on the buffer
lock and detect the failure so that it can drop the inode ilock and
restart the reclaim attempt on the inode. This allows
xfs_ifree_cluster to obtain the inode lock, mark the inode stale and
release it and hence defuse the deadlock situation. It also has the
pleasant side effect of avoiding IO in xfs_reclaim_inode when it
tries to next reclaim the inode as it is now marked stale.

Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
---
 fs/xfs/linux-2.6/xfs_super.c |    2 +-
 fs/xfs/linux-2.6/xfs_sync.c  |   30 +++++++++++++++++++++++++++---
 fs/xfs/xfs_inode.c           |    2 +-
 fs/xfs/xfs_inode_item.c      |    6 +++---
 4 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
index 818c4cf..8a70b2a 100644
--- a/fs/xfs/linux-2.6/xfs_super.c
+++ b/fs/xfs/linux-2.6/xfs_super.c
@@ -1078,7 +1078,7 @@ xfs_fs_write_inode(
                        error = 0;
                        goto out_unlock;
                }
-               error = xfs_iflush(ip, 0);
+               error = xfs_iflush(ip, SYNC_TRYLOCK);
        }
 
  out_unlock:
diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index 6c10f1d..594cd82 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -761,8 +761,10 @@ xfs_reclaim_inode(
        struct xfs_perag        *pag,
        int                     sync_mode)
 {
-       int     error = 0;
+       int     error;
 
+restart:
+       error = 0;
        xfs_ilock(ip, XFS_ILOCK_EXCL);
        if (!xfs_iflock_nowait(ip)) {
                if (!(sync_mode & SYNC_WAIT))
@@ -788,9 +790,31 @@ xfs_reclaim_inode(
        if (xfs_inode_clean(ip))
                goto reclaim;
 
-       /* Now we have an inode that needs flushing */
-       error = xfs_iflush(ip, sync_mode);
+       /*
+        * Now we have an inode that needs flushing.
+        *
+        * We do a nonblocking flush here even if we are doing a SYNC_WAIT
+        * reclaim as we can deadlock with inode cluster removal.
+        * xfs_ifree_cluster() can lock the inode buffer before it locks the
+        * ip->i_lock, and we are doing the exact opposite here. As a result,
+        * doing a blocking xfs_itobp() to get the cluster buffer will result
+        * in an ABBA deadlock with xfs_ifree_cluster().
+        *
+        * As xfs_ifree_cluser() must gather all inodes that are active in the
+        * cache to mark them stale, if we hit this case we don't actually want
+        * to do IO here - we want the inode marked stale so we can simply
+        * reclaim it. Hence if we get an EAGAIN error on a SYNC_WAIT flush,
+        * just unlock the inode, back off and try again. Hopefully the next
+        * pass through will see the stale flag set on the inode.
+        */
+       error = xfs_iflush(ip, SYNC_TRYLOCK | sync_mode);
        if (sync_mode & SYNC_WAIT) {
+               if (error == EAGAIN) {
+                       xfs_iunlock(ip, XFS_ILOCK_EXCL);
+                       /* backoff longer than in xfs_ifree_cluster */
+                       delay(2);
+                       goto restart;
+               }
                xfs_iflock(ip);
                goto reclaim;
        }
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index da871f5..742c833 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2835,7 +2835,7 @@ xfs_iflush(
         * Get the buffer containing the on-disk inode.
         */
        error = xfs_itobp(mp, NULL, ip, &dip, &bp,
-                               (flags & SYNC_WAIT) ? XBF_LOCK : XBF_TRYLOCK);
+                               (flags & SYNC_TRYLOCK) ? XBF_TRYLOCK : 
XBF_LOCK);
        if (error || !bp) {
                xfs_ifunlock(ip);
                return error;
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index fd4f398..46cc401 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -760,11 +760,11 @@ xfs_inode_item_push(
         * Push the inode to it's backing buffer. This will not remove the
         * inode from the AIL - a further push will be required to trigger a
         * buffer push. However, this allows all the dirty inodes to be pushed
-        * to the buffer before it is pushed to disk. THe buffer IO completion
-        * will pull th einode from the AIL, mark it clean and unlock the flush
+        * to the buffer before it is pushed to disk. The buffer IO completion
+        * will pull the inode from the AIL, mark it clean and unlock the flush
         * lock.
         */
-       (void) xfs_iflush(ip, 0);
+       (void) xfs_iflush(ip, SYNC_TRYLOCK);
        xfs_iunlock(ip, XFS_ILOCK_SHARED);
 }
 
-- 
1.7.2.3

<Prev in Thread] Current Thread [Next in Thread>