xfs
[Top] [All Lists]

Re: XFS hung task in xfs_ail_push_all_sync() when unmounting FS after di

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: XFS hung task in xfs_ail_push_all_sync() when unmounting FS after disk failure/recovery
From: Alex Lyakas <alex@xxxxxxxxxxxxxxxxx>
Date: Sun, 10 Apr 2016 21:40:29 +0300
Cc: Shyam Kaushik <shyam@xxxxxxxxxxxxxxxxx>, Brian Foster <bfoster@xxxxxxxxxx>, xfs@xxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=zadarastorage-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc; bh=Q1brdS2N1ldJ55rh7o5Ol0HQbwgTrbtAq+0DAwwoZ1s=; b=uKUoVkXXZlT5f6XBfUGTm+frzixuUVeJJ9QNuS20WElyONjdrWnQgC/PdRN4OO9hQw +DQiCvF4Q4UR71Wrl891d21QQ7tdtMzjqVzpHKXmrWXfUBHOmZaMZzf67Qa0u+5uAVGP 6ZzLx5VwXpXEQH07QlHNUMo9cBcwMNOCHtHIvoI2g5r+vcOBkRgRYfGlHaLAM+j7rVeP mi9Zfb9RCNJ8PfBxv8IrPLYal9PDPxB/1tbLtTgJu4q2qSeM2VV3BdRTGG/cmacVfXNg Du0Z9c8LCmhkFk+DETR2TpRtvidts7jIHBoaL71DJGmzXOxD0eu8sKHFji4EfXvXosHt StoQ==
In-reply-to: <20160408224648.GD567@dastard>
References: <f049419a2ab10f8e3c4fef0e4f4ca1ba@xxxxxxxxxxxxxx> <20160322121922.GA53693@xxxxxxxxxxxxxxx> <232dd85fde11d4ef1625f070eabfd167@xxxxxxxxxxxxxx> <20160408224648.GD567@dastard>
Hello Dave,

On Sat, Apr 9, 2016 at 1:46 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> On Fri, Apr 08, 2016 at 04:21:02PM +0530, Shyam Kaushik wrote:
>> Hi Dave, Brian, Carlos,
>>
>> While trying to reproduce this issue I have been running into different
>> issues that are similar. Underlying issue remains the same when backend to
>> XFS is failed & we unmount XFS, we run into hung-task timeout (180-secs)
>> with stack like
>>
>> kernel: [14952.671131]  [<ffffffffc06a5f59>]
>> xfs_ail_push_all_sync+0xa9/0xe0 [xfs]
>> kernel: [14952.671139]  [<ffffffff810b26b0>] ?
>> prepare_to_wait_event+0x110/0x110
>> kernel: [14952.671181]  [<ffffffffc0690111>] xfs_unmountfs+0x61/0x1a0
>> [xfs]
>>
>> while running trace-events, XFS ail push keeps looping around
>>
>>    xfsaild/dm-10-21143 [001] ...2 17878.555133: xfs_ilock_nowait: dev
>> 253:10 ino 0x0 flags ILOCK_SHARED caller xfs_inode_item_push [xfs]
>
> Looks like either a stale inode (which should never reach the AIL)
> or it's an inode that's been reclaimed and this is a use after free
> situation. Given that we are failing IOs here, I'd suggest it's more
> likely to be an IO failure that's caused a writeback problem, not an
> interaction with stale inodes.
>
> So, look at xfs_iflush. If an IO fails, it is supposed to unlock the
> inode by calling xfs_iflush_abort(), which will also remove it from
> the AIL. This can also happen on reclaim of a dirty inode, and if so
> we'll still reclaim the inode because reclaim assumes xfs_iflush()
> cleans up properly.
>
> Which, apparently, it doesn't:
>
>         /*
>          * Get the buffer containing the on-disk inode.
>          */
>         error = xfs_imap_to_bp(mp, NULL, &ip->i_imap, &dip, &bp, XBF_TRYLOCK, 
> 0);
>         if (error || !bp) {
>                 xfs_ifunlock(ip);
>                 return error;
>         }
>
> This looks like a bug - xfs_iflush hasn't aborted the inode
> writeback on failure - it's just unlocked the flush lock. Hence it
> has left the inode dirty in the AIL, and then the inode has probably
> then been reclaimed, setting the inode number to zero.
In our case, we do not reach this call, because XFS is already marked
as "shutdown", so in our case we do:
    /*
     * This may have been unpinned because the filesystem is shutting
     * down forcibly. If that's the case we must not write this inode
     * to disk, because the log record didn't make it to disk.
     *
     * We also have to remove the log item from the AIL in this case,
     * as we wait for an empty AIL as part of the unmount process.
     */
    if (XFS_FORCED_SHUTDOWN(mp)) {
        error = -EIO;
        goto abort_out;
    }

So we call xfs_iflush_abort, but due to "iip" being NULL (as Shyam
mentioned earlier in this thread), we proceed directly to
xfs_ifunlock(ip), which now becomes the same situation as you
described above.

The comment clearly says "We also have to remove the log item from the
AIL in this case, as we wait for an empty AIL as part of the unmount
process." Could you perhaps point us at the code that is supposed to
remove the log item from the AIL? Apparently this is not happening.

Thanks,
Alex.



>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>