xfs
[Top] [All Lists]

Re: long hangs when deleting large directories (3.0-rc3)

To: Markus Trippelsdorf <markus@xxxxxxxxxxxxxxx>
Subject: Re: long hangs when deleting large directories (3.0-rc3)
From: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Wed, 29 Jun 2011 22:10:01 +1000
Cc: xfs@xxxxxxxxxxx
In-reply-to: <20110629074127.GA1746@xxxxxxxxxxxxxx>
References: <20110621212201.GA1755@xxxxxxxxxxxxxx> <20110621185701.GB1723@xxxxxxxxxxxxxx> <20110622000449.GQ32466@dastard> <20110622070647.GA1744@xxxxxxxxxxxxxx> <20110622073047.GT32466@dastard> <20110629043143.GA1026@dastard> <20110629061954.GA1711@xxxxxxxxxxxxxx> <20110629072446.GR561@dastard> <20110629074127.GA1746@xxxxxxxxxxxxxx>
User-agent: Mutt/1.5.20 (2009-06-14)
On Wed, Jun 29, 2011 at 09:41:27AM +0200, Markus Trippelsdorf wrote:
> On 2011.06.29 at 17:24 +1000, Dave Chinner wrote:
> > On Wed, Jun 29, 2011 at 08:19:54AM +0200, Markus Trippelsdorf wrote:
> > > On 2011.06.29 at 14:31 +1000, Dave Chinner wrote:
> > > > On Wed, Jun 22, 2011 at 05:30:47PM +1000, Dave Chinner wrote:
> > > > > Jun 22 08:53:09 x4 kernel: XFS (sdb1): ail: ooo splice, tail 
> > > > > 0x12000156e7, item 0x12000156e6
> > > > > Jun 22 08:53:09 x4 kernel: XFS (sdb1): ail: ooo splice, walked 15503 
> > > > > items      
> > > > > .....
> > > > > Jun 22 08:53:12 x4 kernel: XFS (sdb1): ail: ooo splice, tail 
> > > > > 0x12000156e7, item 0x12000156e6
> > > > > Jun 22 08:53:12 x4 kernel: XFS (sdb1): ail: ooo splice, walked 16945 
> > > > > items
> > > > > 
> > > > > Interesting is the LSN of the tail - it's only one sector further on
> > > > > than the items being inserted. That's what I'd expect from a commit
> > > > > record write race between two checkpoints. I'll have a deeper look
> > > > > into whether this can be avoided later tonight and also whether I
> > > > > can easily implement a "last insert cursor" easily so subsequent
> > > > > inserts at the same LSN avoid the walk....
> > > > 
> > > > Ok, so here's a patch that does just this. I should probably also do
> > > > a little bit of cleanup on the cursor code as well, but this avoids
> > > > the repeated walks of the AIL to find the insert position.
> > > > 
> > > > Can you try it without the WQ changes you made, Marcus, and see if
> > > > the interactivity problems go away?
> > > 
> > > Sorry to be the bringer of bad news, but this made things much worse:
> > > 
> > > -------cpu0-usage--------------cpu1-usage--------------cpu2-usage--------------cpu3-usage------
> > >  --dsk/sdc-- ---system-- ---load-avg--- --dsk/sdc--
> > > usr sys idl wai hiq siq:usr sys idl wai hiq siq:usr sys idl wai hiq 
> > > siq:usr sys idl wai hiq siq| read  writ| int   csw | 1m   5m  15m |reads 
> > > writs
> > >   1   1  98   0   0   0:  0   1  99   0   0   0:  0   1  99   0   0   0:  
> > > 0   1  99   0   0   0|   0     0 | 603   380 |0.66 0.55 0.28|   0     0 
> > >   1   0  99   0   0   0:  1   0  99   0   0   0:  1  19  80   0   0   0:  
> > > 0   0 100   0   0   0|   0     0 | 719   383 |0.66 0.55 0.28|   0     0 
> > >   3   1  96   0   0   0:  3   1  96   0   0   0:  1  52  47   0   0   0:  
> > > 0   0 100   0   0   0|   0  6464k|1847   919 |0.66 0.55 0.28|   0   202 
> > >   2  13  85   0   0   0:  2   2  96   0   0   0:  1  56  43   0   0   0:  
> > > 1  31  69   0   0   0|4096B  256k|1910  1280 |0.68 0.56 0.28|   1     8 
> > > > 0   1  99   0   0   0:  0   0 100   0   0   0:  0   1  99   0   0   0:  
> > > > 0 100   0   0   0   0|   0     0 |1256   170 |0.68 0.56 0.28|   0     0 
> > > > 0   1  99   0   0   0:  1   1  98   0   0   0:  1   0  99   0   0   0:  
> > > > 0  99   0   0   0   1|   0     0 |1395   229 |0.68 0.56 0.28|   0     0 
> > > > 0   0 100   0   0   0:  0   0 100   0   0   0:  0   3  97   0   0   0:  
> > > > 0 100   0   0   0   0|   0   512B|1304   167 |0.68 0.56 0.28|   0     1 
> > > > 1   1  98   0   0   0:  1   1  98   0   0   0:  0   0 100   0   0   0:  
> > > > 0  99   0   0   0   1|   0     0 |1211   146 |0.68 0.56 0.28|   0     0 
> > > > 0   0 100   0   0   0:  0   0 100   0   0   0:  0   1  99   0   0   0:  
> > > > 0  97   0   0   0   3|   0     0 |1270   149 |0.87 0.60 0.30|   0     0 
> > >   5   2  65  29   0   0:  2   3  95   0   0   0:  1   0  99   0   0   0:  
> > > 2  24  72   0   0   1|   0  8866k|2654  2398 |0.87 0.60 0.30|   0   496 
> > >   6   2  25  67   0   0:  3   1  59  37   0   0:  0   0 100   0   0   0:  
> > > 4   4  92   0   0   0|   0  4554k|2224  2494 |0.87 0.60 0.30|   0   399 
> > >   1   1  98   0   0   0:  0   0  83  17   0   0:  1   3  96   0   0   0:  
> > > 0   1  99   0   0   0|   0  2270k|1079  1030 |0.87 0.60 0.30|   0   200 
> > >   1   1  98   0   0   0:  1   1  98   0   0   0:  0   1  99   0   0   0:  
> > > 1   0  99   0   0   0|   0  9216B| 713   567 |0.87 0.60 0.30|   0     2 
> > >   0   0 100   0   0   0:  1   1  98   0   0   0:  0   0 100   0   0   0:  
> > > 0   1  99   0   0   0|   0     0 | 492   386 |0.80 0.59 0.30|   0     0 
> > > 
> > > As you can see in the table above (resolution 1sec) the hang is now
> > > 5-6 seconds long, instead of the 1-3 seconds seen before.
> > 
> > Interesting. I checked that the ordering was correct in each case
> > adn that it was behaving correctly here.
> > 
> > Can you add the following patch and send me the dmesg output over a
> > hang? It will tell me where the cursor is being initialised and when
> > it is being dropped, so should indicate if a specific insert chain
> > is getting stuck or doing something stoopid.
> 
> The kernel log is attached.
> rm -fr && sync starts at Jun 29 09:32:24.

Add this patch on top of the first one I sent. If it doesn't fix the
problem, can you readd the debug patch and send the log again?

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx


xfs: unpin stale inodes directly in IOP_COMMITTED

From: Dave Chinner <dchinner@xxxxxxxxxx>

When inodes are marked stale in a transaction, they are treated
specially when the iinode log item is being inserted into the AIL.
It trieѕ to avoid moving the log item forward in the AIL due to a
race condition with the writing the underlying buffer back to disk.
The was "fixed" in commit de25c18 ("xfs: avoid moving stale inodes in
the AIL").

To avoid moving the item forward, we return a LSN smaller than the
commit_lsn of the completing transaction, thereby trying to trick
the commit code into not moving the inode forward at all. I'm not
sure this ever worked as intended - it assumes the inode is already
in the AIL, but I don't think the returned LSN would have been small
enough to prevent moving the inode. It appears that the reason it
worked is that the lower LSN of the inodes meant they were inserted
into the AIL and flushed before the inode buffer (which was moved to
the commit_lsn of the transaction).

The big problem is that with delayed logging, the returning of the
different LSN means insertion takes the slow, non-bulk path.  Worse
yet is that insertion is to a position -before- the commit_lsn so it
is doing a AIL traversal on every insertion, and has to walk over
all the items that have already been inserted into the AIL. It's
expensive.

To compound the matter further, with delayed logging inodes are
likely to go from clean to stale in a single checkpoint, which means
they aren't even in the AIL at all when we come across them at AIL
insertion time. Hence these were all getting inserted into the AIL
when they simply do not need to be as inodes marked XFS_ISTALE are
never written back.

Transactional/recovery integrity is maintained in this case by the
other items in the unlink transaction that were modified (e.g. the
AGI btree blocks) and committed in the same checkpoint.

So to fix this, simply unpin the stale inodes directly in
xfs_inode_item_committed() and return -1 to indicate that the AIL
insertion code does not need to do any further processing of these
inodes.

Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
---
 fs/xfs/xfs_inode_item.c |   14 ++++++++------
 fs/xfs/xfs_trans.c      |    2 +-
 2 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 09983a3..b1e88d5 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -681,15 +681,15 @@ xfs_inode_item_unlock(
  * where the cluster buffer may be unpinned before the inode is inserted into
  * the AIL during transaction committed processing. If the buffer is unpinned
  * before the inode item has been committed and inserted, then it is possible
- * for the buffer to be written and IO completions before the inode is inserted
+ * for the buffer to be written and IO completes before the inode is inserted
  * into the AIL. In that case, we'd be inserting a clean, stale inode into the
  * AIL which will never get removed. It will, however, get reclaimed which
  * triggers an assert in xfs_inode_free() complaining about freein an inode
  * still in the AIL.
  *
- * To avoid this, return a lower LSN than the one passed in so that the
- * transaction committed code will not move the inode forward in the AIL but
- * will still unpin it properly.
+ * To avoid this, just unpin the inode directly and return a LSN of -1 so the
+ * transaction committed code knows that it does not need to do any further
+ * processing on the item.
  */
 STATIC xfs_lsn_t
 xfs_inode_item_committed(
@@ -699,8 +699,10 @@ xfs_inode_item_committed(
        struct xfs_inode_log_item *iip = INODE_ITEM(lip);
        struct xfs_inode        *ip = iip->ili_inode;
 
-       if (xfs_iflags_test(ip, XFS_ISTALE))
-               return lsn - 1;
+       if (xfs_iflags_test(ip, XFS_ISTALE)) {
+               xfs_inode_item_unpin(lip, 0);
+               return -1;
+       }
        return lsn;
 }
 
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 7d60d7e..d5d5708 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -1487,7 +1487,7 @@ xfs_trans_committed_bulk(
                        lip->li_flags |= XFS_LI_ABORTED;
                item_lsn = IOP_COMMITTED(lip, commit_lsn);
 
-               /* item_lsn of -1 means the item was freed */
+               /* item_lsn of -1 means the item needs no further processing */
                if (XFS_LSN_CMP(item_lsn, (xfs_lsn_t)-1) == 0)
                        continue;
 

<Prev in Thread] Current Thread [Next in Thread>