[Top] [All Lists]

Re: [xfs-masters] xfs deadlock in stable kernel 3.0.4

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: [xfs-masters] xfs deadlock in stable kernel 3.0.4
From: Christoph Hellwig <hch@xxxxxxxxxxxxx>
Date: Wed, 21 Sep 2011 08:26:49 -0400
Cc: Stefan Priebe - Profihost AG <s.priebe@xxxxxxxxxxxx>, "xfs@xxxxxxxxxxx" <xfs@xxxxxxxxxxx>
In-reply-to: <20110921114237.GP15688@dastard>
References: <20110918230245.GF15688@dastard> <4E78665E.8030409@xxxxxxxxxxxx> <20110920160226.GA25542@xxxxxxxxxxxxx> <4E78CBF4.1030505@xxxxxxxxxxxx> <20110920172455.GA30757@xxxxxxxxxxxxx> <4E78CEFD.9030603@xxxxxxxxxxxx> <20110920223047.GA13758@xxxxxxxxxxxxx> <20110921021133.GM15688@dastard> <4E7994D3.5020103@xxxxxxxxxxxx> <20110921114237.GP15688@dastard>
User-agent: Mutt/1.5.21 (2010-09-15)
On Wed, Sep 21, 2011 at 09:42:37PM +1000, Dave Chinner wrote:
> So, the log force not triggering in the AIL code looks to be the
> problem. That, I simply cannot explain right now - it makes no sense
> but that is what all the stats and trace events point to. I need to
> do more investigation.

Could it be that we have a huge amount of instances of xfs_ail_worker
running at the same time?  xfs_sync_wq is marked as WQ_CPU_INTENSIVE,
so running/runnable workers are not counted towards the concurrency
limit.  From my look at the workqueue code this means we'll spawn new
instances fairly quickly if the others are stuck.  This means more
and more of them hammering the pinned items, and we'll rarely reach
the limit where we'd need to do a log force.

What is also strange is that we allocate a xfs_ail_wq, but don't
actually use it, although it would have the same idea.  Stefan,
can you try the following patch?  This moves the ail work to it's
explicit queue, and makes sure we never have the same work item
(= same fs to be pushed) concurrently.

Note that before Linux 3.1-rc you'll need to edit fs/xfs/xfs_super.c
to be fs/xfs/linux-2.6/xfs_super.c in the patch manually.

Index: linux-2.6/fs/xfs/xfs_super.c
--- linux-2.6.orig/fs/xfs/xfs_super.c   2011-09-21 08:00:01.864768359 -0400
+++ linux-2.6/fs/xfs/xfs_super.c        2011-09-21 08:04:01.335266079 -0400
@@ -1654,7 +1654,7 @@ xfs_init_workqueues(void)
        if (!xfs_syncd_wq)
                goto out;
-       xfs_ail_wq = alloc_workqueue("xfsail", WQ_CPU_INTENSIVE, 8);
+       xfs_ail_wq = alloc_workqueue("xfsail", WQ_NON_REENTRANT, 8);
        if (!xfs_ail_wq)
                goto out_destroy_syncd;
Index: linux-2.6/fs/xfs/xfs_trans_ail.c
--- linux-2.6.orig/fs/xfs/xfs_trans_ail.c       2011-09-21 08:02:28.172765827 
+++ linux-2.6/fs/xfs/xfs_trans_ail.c    2011-09-21 08:02:46.843266108 -0400
@@ -538,7 +538,7 @@ out_done:
        /* There is more to do, requeue us.  */
-       queue_delayed_work(xfs_syncd_wq, &ailp->xa_work,
+       queue_delayed_work(xfs_ail_wq, &ailp->xa_work,
@@ -575,7 +575,7 @@ xfs_ail_push(
        xfs_trans_ail_copy_lsn(ailp, &ailp->xa_target, &threshold_lsn);
        if (!test_and_set_bit(XFS_AIL_PUSHING_BIT, &ailp->xa_flags))
-               queue_delayed_work(xfs_syncd_wq, &ailp->xa_work, 0);
+               queue_delayed_work(xfs_ail_wq, &ailp->xa_work, 0);

<Prev in Thread] Current Thread [Next in Thread>