On Thu, Oct 17, 2013 at 10:54:29AM -0500, Eric Sandeen wrote:
> On 10/14/13 5:17 PM, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@xxxxxxxxxx>
> > Recent analysis of a deadlocked XFS filesystem from a kernel
> > crash dump indicated that the filesystem was stuck waiting for log
> > space. The short story of the hang on the RHEL6 kernel is this:
> > - the tail of the log is pinned by an inode
> > - the inode has been pushed by the xfsaild
> > - the inode has been flushed to it's backing buffer and is
> > currently flush locked and hence waiting for backing
> > buffer IO to complete and remove it from the AIL
> > - the backing buffer is marked for write - it is on the
> > delayed write queue
> > - the inode buffer has been modified directly and logged
> > recently due to unlinked inode list modification
> > - the backing buffer is pinned in memory as it is in the
> > active CIL context.
> > - the xfsbufd won't start buffer writeback because it is
> > pinned
> > - xfssyncd won't force the log because it sees the log as
> > needing to be covered and hence wants to issue a dummy
> > transaction to move the log covering state machine along.
> > Hence there is no trigger to force the CIL to the log and hence
> > unpin the inode buffer and therefore complete the inode IO, remove
> > it from the AIL and hence move the tail of the log along, allowing
> > transactions to start again.
> > Mainline kernels also have the same deadlock, though the signature
> > is slightly different - the inode buffer never reaches the delayed
> > write lists because xfs_buf_item_push() sees that it is pinned and
> > hence never adds it to the delayed write list that the xfsaild
> > flushes.
> > There are two possible solutions here. The first is to simply force
> > the log before trying to cover the log and so ensure that the CIL is
> > emptied before we try to reserve space for the dummy transaction in
> > the xfs_log_worker(). While this might work most of the time, it is
> > still racy and is no guarantee that we don't get stuck in
> > xfs_trans_reserve waiting for log space to come free. Hence it's not
> > the best way to solve the problem.
> > The second solution is to modify xfs_log_need_covered() to be aware
> > of the CIL. We only should be attempting to cover the log if there
> > is no current activity in the log - covering the log is the process
> > of ensuring that the head and tail in the log on disk are identical
> > (i.e. the log is clean and at idle). Hence, by definition, if there
> > are items in the CIL then the log is not at idle and so we don't
> > need to attempt to cover it.
> > When we don't need to cover the log because it is active or idle, we
> > issue a log force from xfs_log_worker() - if the log is idle, then
> > this does nothing. However, if the log is active due to there being
> > items in the CIL, it will force the items in the CIL to the log and
> > unpin them.
> > In the case of the above deadlock scenario, instead of
> > xfs_log_worker() getting stuck in xfs_trans_reserve() attempting to
> > cover the log, it will instead force the log, thereby unpinning the
> > inode buffer, allowing IO to be issued and complete and hence
> > removing the inode that was pinning the tail of the log from the
> > AIL. At that point, everything will start moving along again. i.e.
> > the xfs_log_worker turns back into a watchdog that can alleviate
> > deadlocks based around pinned items that prevent the tail of the log
> > from being moved...
> > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> Reviewed-by: Eric Sandeen <sandeen@xxxxxxxxxx>