xfs
[Top] [All Lists]

Re: deadlock with latest xfs

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: deadlock with latest xfs
From: Timothy Shimmin <tes@xxxxxxx>
Date: Mon, 27 Oct 2008 13:30:58 +1100
Cc: Lachlan McIlroy <lachlan@xxxxxxx>, Christoph Hellwig <hch@xxxxxxxxxxxxx>, xfs-oss <xfs@xxxxxxxxxxx>
In-reply-to: <20081026223940.GN18495@disturbed>
References: <4900412A.2050802@xxxxxxx> <20081023205727.GA28490@xxxxxxxxxxxxx> <49013C47.4090601@xxxxxxx> <20081026223940.GN18495@disturbed>
User-agent: Thunderbird 2.0.0.17 (Macintosh/20080914)
Dave Chinner wrote:
> Ok, I think I've found the regression - it's introduced by the AIL
> cursor modifications. The patch below has been running for 15
> minutes now on my UML box that would have hung in a couple of
> minutes otherwise.
> 
> FYI, the way I found this was:
> 
>       - put a breakpoint on xfs_create() once the fs hung
>       - `touch /mnt/xfs2/fred` to trigger the break point.
>       - look at:
>               - mp->m_ail->xa_target
>               - mp->m_ail->xa_ail.next->li_lsn
>               - mp->m_log->l_tail_lsn
>         which indicated the push target was way ahead the
>         tail of the log, so AIL pushing was obviously not
>         happening otherwise we'd be making progress.
>       - added breakpoint on xfsaild_push() and continued
>       - xfsaild_push() bp triggered, looked at *last_lsn
>         and found it way behind the tail of the log (like
>         3 cycle behind), which meant that would return
>         NULL instead of the first object and AIL pushing
>         would abort. Confirmed with single stepping.
> 
> Cheers,
> 
> Dave.
> XFS: correctly select first log item to push
> 
> Under heavy metadata load we are seeing log hangs. The
> AIL has items in it ready to be pushed, and they are within
> the push target window. However, we are not pushing them
> when the last pushed LSN is less than the LSN of the
> first log item on the AIL. This is a regression introduced
> by the AIL push cursor modifications.
> ---
>  fs/xfs/xfs_trans_ail.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index 67ee466..2d47f10 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -228,7 +228,7 @@ xfs_trans_ail_cursor_first(
>  
>       list_for_each_entry(lip, &ailp->xa_ail, li_ail) {
>               if (XFS_LSN_CMP(lip->li_lsn, lsn) >= 0)
> -                     break;
> +                     goto out;
>       }
>       lip = NULL;
>  out:

Yeah, the fix looks good. The previous code is pretty
obviously broken - a search which always returns NULL.

Which begs the question on the best way of testing this ail code.
I dunno - it would be nice for independent testing of data structures
but perhaps that is too ambitious.

OOC, so the call path for this code....
xfsaild -> xfsaild_push(ailp, &last_pushed_lsn)
           -> lip = xfs_trans_ail_cursor_first(ailp, cur, *last_lsn)
Initially, last_lsn = 0 in xfsaild
but it will be updated via last_pushed_lsn.
So it looks like things will work initially when lsn==0, because
xfs_trans_ail_cursor_first special cases that and uses the min.
But as soon as the lsn is set to non-zero,
xfs_trans_ail_cursor_first will return NULL,
and xfsaild_push will return early.

--Tim

<Prev in Thread] Current Thread [Next in Thread>