xfs
[Top] [All Lists]

Re: [RFC PATCH v3 2/2] xfs: fix xfsaild hang due to lost wake ups

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: [RFC PATCH v3 2/2] xfs: fix xfsaild hang due to lost wake ups
From: Mark Tinguely <tinguely@xxxxxxx>
Date: Thu, 24 May 2012 09:38:08 -0500
Cc: Brian Foster <bfoster@xxxxxxxxxx>, xfs@xxxxxxxxxxx
In-reply-to: <20120523235314.GN25351@dastard>
References: <1337704714-50235-1-git-send-email-bfoster@xxxxxxxxxx> <1337704714-50235-3-git-send-email-bfoster@xxxxxxxxxx> <20120523005830.GL25351@dastard> <4FBD2306.8090000@xxxxxxxxxx> <4FBD2A33.8080403@xxxxxxx> <20120523235314.GN25351@dastard>
User-agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:9.0) Gecko/20120122 Thunderbird/9.0
On 05/23/12 18:53, Dave Chinner wrote:
On Wed, May 23, 2012 at 01:19:31PM -0500, Mark Tinguely wrote:
On 05/23/12 12:48, Brian Foster wrote:
On 05/22/2012 08:58 PM, Dave Chinner wrote:
snip

Finally, rather than calling wake_up_process() in the
xfs_ail_push*() functions, call wake_up(&ailp->xa_idle); There can
only be one thread sleeping on that (the xfsaild) so there is no
need to use the wake_up_all() variant...

FWIW, you might be able to do this without the idle wait queue and
just use wake_up_process() -


Hi Dave,

I have a working version of your suggested algorithm. It looks mostly the same 
with the exception of a spin_unlock fix. I also have the below version that 
uses a wait_queue and that I plan to test overnight tonight:

...

FYI. Test 273 in a loop will still cause the sync_worker to lock
when it tries to allocate a dummy transaction.

PID: 29214  TASK: ffff8807e66404c0  CPU: 1   COMMAND: "kworker/1:15"
  #0 [ffff88081f551b60] __schedule at ffffffff814175d0
  #1 [ffff88081f551ca8] schedule at ffffffff81417944
  #2 [ffff88081f551cb8] xlog_grant_head_wait at ffffffffa055a6d5 [xfs]
  #3 [ffff88081f551d08] xlog_grant_head_check at ffffffffa055a856 [xfs]
  #4 [ffff88081f551d48] xfs_log_reserve at ffffffffa055a95f [xfs]
  #5 [ffff88081f551d88] xfs_trans_reserve at ffffffffa0557ee4 [xfs]
  #6 [ffff88081f551dd8] xfs_fs_log_dummy at ffffffffa050cf88 [xfs]
  #7 [ffff88081f551df8] xfs_sync_worker at ffffffffa0518454 [xfs]
  #8 [ffff88081f551e18] process_one_work at ffffffff810564ad
  #9 [ffff88081f551e68] worker_thread at ffffffff81059203
#10 [ffff88081f551ee8] kthread at ffffffff8105dd2e
#11 [ffff88081f551f48] kernel_thread_helper at ffffffff81421a64

I understand why the dummy transaction was added and I think we can
anticipate the hang before it happens and avoid it.

I don't think this hang has anything to do with the idle patches -
it is most likely related to the CIL stall we are chasing down.

Cheers,

Dave.

Correct, this problem is not caused nor can be corrected by the idle
patches. See thread:

 Subject: Still seeing hangs in xlog_grant_log_space

Brian, the FYI is just a warning that your replicator of running XFS
test 173 in a loop is triggering dummy ticket allocation stalls in the
sync_worker. Most of the time, they are quickly given space, but eventually things will line up and XFS will lock up.

It took me over 200 iterations of test 173 to get the above lock up,
and yes your v2 patches were in code, but that does not matter.

I did not want you to mistake a sync_worker lock up as being caused by
your code.

--Mark.

<Prev in Thread] Current Thread [Next in Thread>