[PATCH] [RFC] xfs: prevent bmap btree from using alloc btree reserved blocks
Dave Chinner
david at fromorbit.com
Sun Sep 19 19:58:07 CDT 2010
ping?
On Tue, Aug 24, 2010 at 11:45:33PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner at redhat.com>
>
> Recently we've had WANT_CORRUPTED_GOTO filesystem shutdowns reported
> on filesystems with large numbers of small AGs. RedHat QA found a
> simple test case at:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=626244
>
> Which was further refined to:
>
> # mkfs.xfs -f -d agsize=16m,size=50g <dev>
> # mount <dev> /mnt
> # xfs_io -f -c 'resvsp 0 40G' /mnt/foo
>
> The initial analysis is in the above bug. The fundamental problem is
> that the data extent allocation is using all the free blocks in the
> allocation group, and then the bmap btree block allocation is
> dipping into the reserved block pool that each AG holds for
> freespace btree manipulations. This results in failures when
> multiple bmap btree blocks are needed for a multilevel split of the
> bmap btree.
>
> The reason this is occurring is that when we do a btree block
> allocation after a data extent allocation we run down the path that
> does not set up the minleft allocation parameter. That is, we allow
> the btree block allocation to use up all the blocks in the current
> AG if that is what will make the current allocation succeed. This
> what we are supposed to only allow the low space allocation
> algorithm to do, not a normal allocation. The result is that we can
> allocate the first block from the AG freelist, and then the second
> block allocation in the split will fail in the same AG because we do
> not have a minleft parameter set and hence will not pass the checks
> in xfs_alloc_fix_freelist() and hence the allocation will fail.
> Further, because no minleft parameter is set, the btree block
> allocation will not retry the allocation with different parameters
> and (potentially) enable the low space algorithm.
>
> The result is that we fail a btree block allocation and hence fail
> the extent allocation with a dirty btree and transaction. The dirty
> btree causes the WANT_CORRUPTED_GOTO warning, and cancelling the
> dirty transaction triggers the shutdown.
>
> The fix appears to be to ensure that we set the minleft parameter
> correctly to reflect the potential number of btree blocks we still
> need to allocate from the same AG if we are doing a worst case
> split. By doing so, the particular case results in the first btree
> block allocation setting minleft to the number of blocks needed for
> a split. Hence the AG that we just allocated all the free blocks out
> of for the data extent will fail because there are not enough free
> blocks for all the blocks in the split in the AG. It will fail this
> without dirtying anything, and because minleft is now > 0, will
> trigger the retry algorithm.
>
> The fallback algorithm also needs a slight modification. It
> currently assumes that if minleft is set, no allocation has been
> done yet, so it can scan from AG 0 to find a free block. If it is
> left like this, it can trigger deadlocks from the new case as we
> have a prior allocation and hence cannot allocate from an AG less
> than the current AG. Hence it is modified to use a START_AG
> allocation to scan upwards from the current AG, hence avoiding the
> known AG locking deadlocks.
>
> As far as I can tell, the bmap btree code has never handled this
> case properly - I checked as far back as 1995 and minleft has never
> been set to avoid selecting an AG that cannot be allocated out of...
>
> Signed-off-by: Dave Chinner <dchinner at redhat.com>
> ---
> fs/xfs/xfs_bmap_btree.c | 51 ++++++++++++++++++++++++++++++++++++++--------
> 1 files changed, 42 insertions(+), 9 deletions(-)
>
> diff --git a/fs/xfs/xfs_bmap_btree.c b/fs/xfs/xfs_bmap_btree.c
> index 87d3c10..d5ef4e3 100644
> --- a/fs/xfs/xfs_bmap_btree.c
> +++ b/fs/xfs/xfs_bmap_btree.c
> @@ -538,6 +538,25 @@ xfs_bmbt_alloc_block(
> args.type = XFS_ALLOCTYPE_START_BNO;
> } else {
> args.type = XFS_ALLOCTYPE_NEAR_BNO;
> +
> + /*
> + * we've come in here because this is the second or subsequent
> + * btree block we need to allocate for the bmap btree
> + * modification. If we've just emptied the AG and there are
> + * only free list blocks left, we need to make sure that we
> + * take into account the minleft value that was reserved on the
> + * first allocation through here (the NULLFSBLOCK branch
> + * above). In that case, minleft will already take into account
> + * the maximum number of blocks needed for a btree split, and
> + * the number of blocks already allocated is recorded in the
> + * cursor. From that, we can work out exactly how much smaller
> + * the minleft should be so that we don't select an AG that
> + * does not have enough blocks available to continue the entire
> + * btree split.
> + */
> + args.minleft = XFS_BM_MAXLEVELS(args.mp,
> + (int)cur->bc_private.b.whichfork) - 1 -
> + cur->bc_private.b.allocated;
> }
>
> args.minlen = args.maxlen = args.prod = 1;
> @@ -550,15 +569,29 @@ xfs_bmbt_alloc_block(
> if (error)
> goto error0;
>
> - if (args.fsbno == NULLFSBLOCK && args.minleft) {
> - /*
> - * Could not find an AG with enough free space to satisfy
> - * a full btree split. Try again without minleft and if
> - * successful activate the lowspace algorithm.
> - */
> - args.fsbno = 0;
> - args.type = XFS_ALLOCTYPE_FIRST_AG;
> - args.minleft = 0;
> + while (args.fsbno == NULLFSBLOCK && args.minleft) {
> + if (cur->bc_private.b.firstblock == NULLFSBLOCK) {
> + /*
> + * Could not find an AG with enough free space to satisfy
> + * a full btree split. Try again without minleft and if
> + * successful activate the lowspace algorithm.
> + */
> + args.type = XFS_ALLOCTYPE_FIRST_AG;
> + args.fsbno = 0;
> + args.minleft = 0;
> + } else {
> + /*
> + * Failed to find enough space for a btree block after
> + * a extent allocation has already occurred. Continue
> + * searching other AGs that can hold the remaining
> + * blocks. If we fail with minleft set, then clear it
> + * and try again.
> + */
> + args.type = XFS_ALLOCTYPE_START_AG;
> + args.fsbno = cur->bc_private.b.firstblock;
> + if (cur->bc_private.b.flist->xbf_low)
> + args.minleft = 0;
> + }
> error = xfs_alloc_vextent(&args);
> if (error)
> goto error0;
> --
> 1.7.1
>
> _______________________________________________
> xfs mailing list
> xfs at oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
>
--
Dave Chinner
david at fromorbit.com
More information about the xfs
mailing list