[PATCH] xfs: log mount failures don't wait for buffers to be released
Brian Foster
bfoster at redhat.com
Tue Jan 12 08:15:41 CST 2016
On Tue, Jan 12, 2016 at 12:35:36PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner at redhat.com>
>
> Recently I've been seeing xfs/051 fail on 1k block size filesystems.
> Trying to trace the events during the test lead to the problem going
> away, indicating that it was a race condition that lead to this
> ASSERT failure:
>
> XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0, file: fs/xfs/xfs_mount.c, line: 156
> .....
> [<ffffffff814e1257>] xfs_free_perag+0x87/0xb0
> [<ffffffff814e21b9>] xfs_mountfs+0x4d9/0x900
> [<ffffffff814e5dff>] xfs_fs_fill_super+0x3bf/0x4d0
> [<ffffffff811d8800>] mount_bdev+0x180/0x1b0
> [<ffffffff814e3ff5>] xfs_fs_mount+0x15/0x20
> [<ffffffff811d90a8>] mount_fs+0x38/0x170
> [<ffffffff811f4347>] vfs_kern_mount+0x67/0x120
> [<ffffffff811f7018>] do_mount+0x218/0xd60
> [<ffffffff811f7e5b>] SyS_mount+0x8b/0xd0
>
> When I finally caught it with tracing enabled, I saw that AG 2 had
> an elevated reference count and a buffer was responsible for it. I
> tracked down the specific buffer, and found that it was missing the
> final reference count release that would put it back on the LRU and
> hence be found by xfs_wait_buftarg() calls in the log mount failure
> handling.
>
> The last four traces for the buffer before the assert were (trimmed
> for relevance)
>
> kworker/0:1-5259 xfs_buf_iodone: hold 2 lock 0 flags ASYNC
> kworker/0:1-5259 xfs_buf_ioerror: hold 2 lock 0 error -5
> mount-7163 xfs_buf_lock_done: hold 2 lock 0 flags ASYNC
> mount-7163 xfs_buf_unlock: hold 2 lock 1 flags ASYNC
>
> This is an async write that is completing, so there's nobody waiting
> for it directly. Hence we call xfs_buf_relse() once all the
> processing is complete. That does:
>
> static inline void xfs_buf_relse(xfs_buf_t *bp)
> {
> xfs_buf_unlock(bp);
> xfs_buf_rele(bp);
> }
>
> Now, it's clear that mount is waiting on the buffer lock, and that
> it has been released by xfs_buf_relse() and gained by mount. This is
> expected, because at this point the mount process is in
> xfs_buf_delwri_submit() waiting for all the IO it submitted to
> complete.
>
> The mount process, however, is waiting on the lock for the buffer
> because it is in xfs_buf_delwri_submit(). This waits for IO
> completion, but it doesn't wait for the buffer reference owned by
> the IO to go away. The mount process collects all the completions,
> fails the log recovery, and the higher level code then calls
> xfs_wait_buftarg() to free all the remaining buffers in the
> filesystem.
>
> The issue is that on unlocking the buffer, the scheduler has decided
> that the mount process has higher priority than the the kworker
> thread that is running the IO completion, and so immediately
> switched contexts to the mount process from the semaphore unlock
> code, hence preventing the kworker thread from finishing the IO
> completion and releasing the IO reference to the buffer.
>
> Hence by the time that xfs_wait_buftarg() is run, the buffer still
> has an active reference and so isn't on the LRU list that the
> function walks to free the remaining buffers. Hence we miss that
> buffer and continue onwards to tear down the mount structures,
> at which time we get find a stray reference count on the perag
> structure. On a non-debug kernel, this will be ignored and the
> structure torn down and freed. Hence when the kworker thread is then
> rescheduled and the buffer released and freed, it will access a
> freed perag structure.
>
> The problem here is that when the log mount fails, we still need to
> quiesce the log to ensure that the IO workqueues have returned to
> idle before we run xfs_wait_buftarg(). By synchronising the
> workqueues, we ensure that all IO completions are fully processed,
> not just to the point where buffers have been unlocked. This ensures
> we don't end up in the situation above.
>
> Signed-off-by: Dave Chinner <dchinner at redhat.com>
> ---
Reviewed-by: Brian Foster <bfoster at redhat.com>
> fs/xfs/xfs_buf.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 04d2bdd..08ad701 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1529,6 +1529,16 @@ xfs_wait_buftarg(
> LIST_HEAD(dispose);
> int loop = 0;
>
> + /*
> + * We need to flush the buffer workqueue to ensure that all IO
> + * completion processing is 100% done. Just waiting on buffer locks is
> + * not sufficient for async IO as the reference count held over IO is
> + * not released until after the buffer lock is dropped. Hence we need to
> + * ensure here that all reference counts have been dropped before we
> + * start walking the LRU list.
> + */
> + drain_workqueue(btp->bt_mount->m_buf_workqueue);
> +
> /* loop until there is nothing left on the lru list. */
> while (list_lru_count(&btp->bt_lru)) {
> list_lru_walk(&btp->bt_lru, xfs_buftarg_wait_rele,
> --
> 2.5.0
>
> _______________________________________________
> xfs mailing list
> xfs at oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
More information about the xfs
mailing list