xfs
[Top] [All Lists]

Re: [PATCH] xfs: fix bmbt block allocation failures

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: [PATCH] xfs: fix bmbt block allocation failures
From: Lachlan McIlroy <lmcilroy@xxxxxxxxxx>
Date: Fri, 15 Apr 2011 03:29:02 -0400 (EDT)
Cc: xfs@xxxxxxxxxxx
In-reply-to: <20110412065328.GG21395@dastard>
Reply-to: Lachlan McIlroy <lmcilroy@xxxxxxxxxx>
----- Original Message -----
> On Mon, Apr 11, 2011 at 08:55:33PM -0400, Lachlan McIlroy wrote:
> > ----- Original Message -----
> > > From: Dave Chinner <dchinner@xxxxxxxxxx>
> > >
> > > When a multilevel bmbt split occurs, we can be asked to allocate
> > > blocks from an AG that has no space left available. In the case of
> > > an extent just being allocated, the first bmbt block allocation
> > > sees
> > > the firstblock parameter is set and does not set a minleft
> > > parameter
> > > for the allocation. The allocation also does not set the total
> > > number of blocks required by the allocation, either.
> > >
> > > If the extent allocation used all the free space in the AG, the
> > > lack
> > > of a total allocation size results in a block being reserved for
> > > the
> > > AG freespace btree manipulations being used incorrectly.
> > >
> > > Secondly, the allocation is only allowed to be allocated in the
> > > current AG (NEAR_BNO) because the low space algorithm has not been
> > > activated. As a result of this, the second block allocation in the
> > > split fails to find sufficient space in the current AG and fails,
> > > resulting in a dirty transaction cancellation and a filesytem
> > > shutdown.
> > >
> > > There are two problems here - both pointed out by Lachlan McIlroy
> > > when we were first discussing a proposed fix for the problem (See
> > > http://oss.sgi.com/archives/xfs/2010-09/msg00438.html). Firstly,
> > > the missing args->total was causing the initial block to be
> > > allocated from the freespace reserve. Secondly that the NEAR_BNO
> > > allocation should probably be a START_BNO allocation to allow
> > > blocks
> > > to be taken from a higher numbered AG.
> > >
> > > With these changes, the allocation failure goes away, but we
> > > trigger
> > > asserts in xfs_bmapi() that try to ensure that bmbt block
> > > allocations only occur in the same AG as the extent allocated,
> > > except if the low space algorithm is active. In this case, we
> > > don't
> > > need to activate the low space algorithm - if a START_BNO
> > > allocation
> > > fails with minleft = 0, then there is no space we can allocate at
> > > all, and hence the low space algorithm is no help. Therefore the
> > > asserts need modification reflect the new state of affairs.
> > >
> > > This commit fixes the problem demonstrated by the test case in
> > > xfstests 250.
> > >
> > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> > > ---
> > > fs/xfs/xfs_bmap.c | 8 ++------
> > > fs/xfs/xfs_bmap_btree.c | 7 ++-----
> > > 2 files changed, 4 insertions(+), 11 deletions(-)
> > >
> > > diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c
> > > index fa00788..50eed97 100644
> > > --- a/fs/xfs/xfs_bmap.c
> > > +++ b/fs/xfs/xfs_bmap.c
> > > @@ -4917,13 +4917,9 @@ error0:
> > > if (cur) {
> > > if (!error) {
> > > ASSERT(*firstblock == NULLFSBLOCK ||
> > > - XFS_FSB_TO_AGNO(mp, *firstblock) ==
> > > + XFS_FSB_TO_AGNO(mp, *firstblock) <=
> > > XFS_FSB_TO_AGNO(mp,
> > > - cur->bc_private.b.firstblock) ||
> > > - (flist->xbf_low &&
> > > - XFS_FSB_TO_AGNO(mp, *firstblock) <
> > > - XFS_FSB_TO_AGNO(mp,
> > > - cur->bc_private.b.firstblock)));
> > > + cur->bc_private.b.firstblock));
> > > *firstblock = cur->bc_private.b.firstblock;
> > > }
> > > xfs_btree_del_cursor(cur,
> >
> > There's another ASSERT like this earlier in xfs_bmapi() after the
> > call
> > to xfs_bmap_alloc() which may need the same attention.
> 
> No, it checks the allocated extent against any existing firstblock
> and has nothing to do with the the bmbt allocations during extent
> insertion.
> >
> > > diff --git a/fs/xfs/xfs_bmap_btree.c b/fs/xfs/xfs_bmap_btree.c
> > > index 87d3c10..b2cdbfe 100644
> > > --- a/fs/xfs/xfs_bmap_btree.c
> > > +++ b/fs/xfs/xfs_bmap_btree.c
> > > @@ -518,10 +518,11 @@ xfs_bmbt_alloc_block(
> > > args.mp = cur->bc_mp;
> > > args.fsbno = cur->bc_private.b.firstblock;
> > > args.firstblock = args.fsbno;
> > > + args.total = 1;
> > > + args.type = XFS_ALLOCTYPE_START_BNO;
> > >
> > > if (args.fsbno == NULLFSBLOCK) {
> > > args.fsbno = be64_to_cpu(start->l);
> > > - args.type = XFS_ALLOCTYPE_START_BNO;
> > > /*
> > > * Make sure there is sufficient room left in the AG to
> > > * complete a full tree split for an extent insert. If
> > > @@ -534,10 +535,6 @@ xfs_bmbt_alloc_block(
> > > * block allocation here and corrupt the filesystem.
> > > */
> > > args.minleft = xfs_trans_get_block_res(args.tp);
> > > - } else if (cur->bc_private.b.flist->xbf_low) {
> > > - args.type = XFS_ALLOCTYPE_START_BNO;
> > > - } else {
> > > - args.type = XFS_ALLOCTYPE_NEAR_BNO;
> > > }
> > >
> > > args.minlen = args.maxlen = args.prod = 1;
> >
> > I think that's exactly what I had in mind when we looked at this
> > last
> > time but now I'm not so sure. I think I understand now why there is
> > a
> > requirement that all subsequent allocations must (read 'can') come
> > from
> > the same AG.
> >
> > The first time we try to allocate blocks we look for an AG with all
> > the
> > space we need for the reservation to succeed.
> 
> Of course, that explains the near bno allocation them. It's
> basically saying "we only ever have a reservation for a freespace
> btree split in a single AG". Ok, so it seems we need to retain that,
> and the problem of not being able to allocate in the current AG is
> elsewhere.
> 
> Thanks for pointing that out to me, Lachlan - I should have realised
> that is what it was for in the first place.
> 
> > If we don't find a single
> > AG that will satisfy the full request then we use the low space
> > algorithm
> > and start from AG 0. But if we do find such an AG with all the space
> > we
> > need it should be fine to use NEAR_BNO on subsequent allocations
> > because
> > the space is there, right?
> 
> But for the case we are dealing with here - bmbt insertion after an
> allocation - we already have firstblock set and hence never traverse
> the args.fsbno == NULLFSBLOCK case as the cursor firstblock is
> always set to the block that has already been allocated. Hence we
> never hit the low space algorithm unless it was triggered by the
> extent allocation.
> 
> > If this logic fails to hold true then there are other problems we
> > need to
> > deal with including;
> >
> > If we've already allocated a block from AG N then all subsequent
> > allocations must come from AGs >= N to prevent locking issues. If
> > the
> > blocks we need are in AGs < N we're in trouble.
> 
> Agreed - this is what minleft is trying to avoid from happening
> ahead of the first allocations.
> 
> > We need to fall back to
> > the low space algorithm and start from AG 0 but we can't do this if
> > we
> > have already allocated from AG N.
> 
> Right, even the low space algorithm cannot avoid the need to avoid
> lock inversions in the AGs.
> 
> > Also if we pass START_BNO to xfs_alloc_vextent() then it will fall
> > through to the ANY/START/FIRST_AG case and set flags to TRYLOCK. It
> 
> That's not a problem.
> 
> > will then scan AGs from N to the last AG and skip any that are
> > already
> > locked. It may skip AGs that have the space we need and we may get
> > to
> > the last AG without finding all required blocks.
> 
> Sure, but....
> 
> > Then xfs_alloc_vextent() will loop through the AGs a second time if
> > we
> > didn't get all we needed the first time but this time without the
> > TRYLOCK.
> > This could result in locking AGs out of order.
> 
> It loops the second time from the start AG, not AG zero so we should
> not be locking AGs out of order.
> 
> > So why is the extent allocation using all of the remaining blocks in
> > the
> > AG and not leaving any for the split?
> 
> That's the big question, isn't it? I'll come back to this.
> 
> > Are we not reserving the split
> > space with the conversion transaction?
> 
> I don't think this is relevant - the problem is allocation, not
> extent conversion.
> 
> > Is it because the space needed
> > for splits is not accounted for when delayed allocations are
> > reserved so
> > we oversubscribe ourselves?
> 
> This is preallocation into a hole, so delalloc is not involved
> here.
> 
> > Or are we not activating the low space
> > algorithm early enough?
> 
> There's plenty of free space, just not in the AG _after_ the extent
> allocation, and so there's no reason to trigger the low space
> algorithm.
> 
> > I need to look at the code a bit more to understand all this. The
> > same
> > allocation logic appears to be used in xfs_bmap_extents_to_btree(),
> > xfs_bmap_local_to_extents() and xfs_bmap_btalloc() so maybe there's
> > some
> > logic in the logic.
> 
> I think they are all fine, now that I think I understand the reason
> for the logic - that we only reserve enough blocks in the
> transaction for freespace tree splits in a single AG. New blocks in
> the bmbt can be triggered when:
> 
> 1. extent conversion splits a record into two (or three)
> records.
> 2. extent removal split a record into two (punch a hole in
> the middle of an extent)
> 3. extent allocation inserts a new record
> 4. converting from extent to btree format
> 
> Note that the local to extent change is actually a data extent
> allocation (same as xfs_bmap_btalloc), not a bmbt allocation.
> 
> So, in the cases of #1 and #2, it is unlikely that there has been a
> previous allocation in the transaction, and so we take the first
> case of args.fsbno == NULLFSBLOCK to chose an AG with enough space
> for an entire bmbt split. Otherwise, it's the same case as #3 and
> #4.
> 
> In the case of #3 and #4, the reason for the bmbt allocation is that
> we have allocated a new extent, and as a result firstblock _will_ be
> set. Hence we should have either space in the AG available for the
> bmbt split _or_ the low space algorithm is active.
> 
> This problem case is that for #3/4 we can enter xfs_bmbt_alloc_block
> with the initial condition of firstblock set, not enough space in
> the current AG and the low space algorithm not active. For the test
> case (250), we still have lots of free space in the filesystem so we
> should not be activating the low space algorithm. That leaves the
> fact we are not leaving enough space in the AG for the bmbt split
> after the extent is allocated. AFAICT, this code in xfs_bmapi()
> handles it:
> 
> 4452 if (wr && *firstblock == NULLFSBLOCK) {
> 4453 if (XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_BTREE)
> 4454 minleft = be16_to_cpu(ifp->if_broot->bb_level) + 1;
> 4455 else
> 4456 minleft = 1;
> 4457 } else
> 4458 minleft = 0;
> 
> So it seems to me that in all cases the setting of minleft is
> correct, as are the fallbacks that clear minleft then set xbf_low in
> xfs_bmap_btalloc(). So I don't think the bug actually resides at the
> xfs-bmapi/bmbt level. The question I'm now trying to answer is how
> we can allocate the entire AG if minleft is actually set.
> 
> The answer, it appears, is in xfs_alloc_fix_minleft(). If we've
> allocated the largest extent possible, we'll have a situation where
> the only free space remaining is on the freelist. To use numbers
> relevant to the test case:
> 
> agsize = 4096 blocks (16MB)
> max ext len = 4096 - 8 blocks
> = 4088 blocks
> AGF freeblks = 4088
> AGF flcount = 4
> minleft = 2;
> 
> extent allocation length = 4088
> 
> diff = 4088 + 4 - 4088 - 2
> = 2
> 
> And because diff >= 0, xfs_alloc_fix_minleft() says that the extent
> length is OK and does not trim it. Then when we go to allocate bmbt
> blocks, args->total is not set so the first block is allocated from
> the free list, and the second then fails at the
> xfs_alloc_fix_freelist checks because we don't have space available
> in the AG....
> 
> IOWs, we've set minleft appropriately, but xfs_alloc_fix_minleft()
> assumes that the AGFL blocks are part of the space available for
> the allocation. This, it seems, is wrong. If minleft is set, then we
> are not allowing the allocation to dip into any reserves, so we
> should not be considering the freelist blocks as available for
> minleft reservations.
> 
> So, the patch below fixes the test 250 assert failure as well, and
> to me seems much more likely as the root cause of the bug.
> 
> What do you think, Lachlan?

I think you're onto something Dave and it makes a lot of sense.  It
would appear that the checks in xfs_alloc_fix_freelist() really only
apply when allocating extents and maybe should be skipped when
allocating the minleft blocks.  Oh, unless the low space algorithm is
being used.  I think args.total represents the full extent allocation
request and doesn't apply to minleft allocations so I don't think we
need that change to xfs_bmbt_alloc_block().  Is just the fix to
xfs_alloc_fix_minleft() enough here?

> 
> Cheers,
> 
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx
> 
> 
> ---
> fs/xfs/xfs_alloc.c | 1 -
> fs/xfs/xfs_bmap_btree.c | 1 +
> 2 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/xfs/xfs_alloc.c b/fs/xfs/xfs_alloc.c
> index 27d64d7..8946464 100644
> --- a/fs/xfs/xfs_alloc.c
> +++ b/fs/xfs/xfs_alloc.c
> @@ -280,7 +280,6 @@ xfs_alloc_fix_minleft(
> return 1;
> agf = XFS_BUF_TO_AGF(args->agbp);
> diff = be32_to_cpu(agf->agf_freeblks)
> - + be32_to_cpu(agf->agf_flcount)
> - args->len - args->minleft;
> if (diff >= 0)
> return 1;
> diff --git a/fs/xfs/xfs_bmap_btree.c b/fs/xfs/xfs_bmap_btree.c
> index 87d3c10..470047e 100644
> --- a/fs/xfs/xfs_bmap_btree.c
> +++ b/fs/xfs/xfs_bmap_btree.c
> @@ -518,6 +518,7 @@ xfs_bmbt_alloc_block(
> args.mp = cur->bc_mp;
> args.fsbno = cur->bc_private.b.firstblock;
> args.firstblock = args.fsbno;
> + args.total = 1;
> 
> if (args.fsbno == NULLFSBLOCK) {
> args.fsbno = be64_to_cpu(start->l);
> 
> _______________________________________________
> xfs mailing list
> xfs@xxxxxxxxxxx
> http://oss.sgi.com/mailman/listinfo/xfs

<Prev in Thread] Current Thread [Next in Thread>