xfs
[Top] [All Lists]

Re: [PATCH 2/2] repair: AGFL rebuild fails if btree split required

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: [PATCH 2/2] repair: AGFL rebuild fails if btree split required
From: Brian Foster <bfoster@xxxxxxxxxx>
Date: Wed, 29 Oct 2014 15:26:10 -0400
Cc: xfs@xxxxxxxxxxx, bvowk@xxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <1414552144-12627-3-git-send-email-david@xxxxxxxxxxxxx>
References: <1414552144-12627-1-git-send-email-david@xxxxxxxxxxxxx> <1414552144-12627-3-git-send-email-david@xxxxxxxxxxxxx>
User-agent: Mutt/1.5.23 (2014-03-12)
On Wed, Oct 29, 2014 at 02:09:04PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@xxxxxxxxxx>
> 
> In phase 5 we rebuild the freelist btrees, the AGF and the AGFL from
> all the free space information we have gathered and resolved during
> pahses 3 and 4. If the freespace information is laid out just right,
> we end up having to allocate free space for the AGFL blocks.
> 
> If the size of the free space we allocate from is larger than the
> space we need, then we have to insert the remainder back into the
> freespace btree. For the by-size tree, this means we are likely to
> be removing a record from one leaf, and then inserting the remainder
> - a smaller size - into another leaf.
> 
> The issue is that the leaf blocks to the left of the original leaf
> block we removed the extent record from are full and hence require a
> split to insert the new record. That, of course, requires a free
> block in the AGFL to allocate from, and now we have a chicken and
> egg situation: there are no free blocks in the AGFL because we are
> setting it up.
> 
> As a result, setting up the free list silently fails, leaving the
> freespace btrees in an inconsistent state and the AGFL in question
> empty. When the filesystem is next mounted, the first allocation
> from that AGF results in attempting to fix the AGFL, and it then
> does exactly the same thing as the repair code, fails to allocate a
> block during the split and fails. This results in an immediate
> shutdown because the transaction doing the allocation is dirty by
> this stage.
> 
> The fix for the problem is to make repair handle rebulding the btree
> differently. If we leave ispace for a couple of records in each
> btree leaf and node, there is never a need for a split to occur when
> initially setting up the AGFL. This results in repair doing the
> right thing, and hence the runtime problems after mount don't occur.
> Further, add error checking the the AGFL setup code and abort repair
> if we have a failure to correctly set up the AGFL so we catch this
> problem at repair time, not mount time...
> 

Interesting problem, thanks for the breakdown. I suppose the only
interesting side effect is that the alloc btrees might require a bit
more space after repair than before. I wonder if we need to take that
into consideration here for certain cases. E.g., does this have
ramifications for running repair on a clean, but completely full fs, or
should that be generally handled by the existence of reserved blocks?

A couple minor nits below...

> Reported-by: Barkley Vowk <bvowk@xxxxxxx>
> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> ---
>  repair/phase5.c | 44 +++++++++++++++++++++++++++++++++++---------
>  1 file changed, 35 insertions(+), 9 deletions(-)
> 
> diff --git a/repair/phase5.c b/repair/phase5.c
> index d6d3c6d..3d58936 100644
> --- a/repair/phase5.c
> +++ b/repair/phase5.c
> @@ -335,11 +335,22 @@ finish_cursor(bt_status_t *curs)
>  }
>  
>  /*
> + * We need to leave some free records in the tree for the corner case of
> + * setting up the AGFL. This may require allocation of blocks, and as
> + * such can require insertion of new records into the tree (e.g. moving
> + * a record in the by-count tree when a long extent is shortened). If we
> + * pack the records into the leaves with no slack space, this requires a
> + * leaf split to occur and a block to be allocated from the free list.
> + * If we don't have any blocks on the free list (because we are setting
> + * it up!), then we fail, and the filesystem will fail with the same
> + * failure at runtime. Hence leave a couple of records slack space in
> + * each block to allow immediate modification of the tree without
> + * requiring splits to be done.
> + *
>   * XXX(hch): any reason we don't just look at mp->m_alloc_mxr?

I guess we could kill this comment now. ;)

>   */
>  #define XR_ALLOC_BLOCK_MAXRECS(mp, level) \
> -                     xfs_allocbt_maxrecs((mp), (mp)->m_sb.sb_blocksize, \
> -                                             (level) == 0)
> +     (xfs_allocbt_maxrecs((mp), (mp)->m_sb.sb_blocksize, (level) == 0) - 2)
>  
>  /*
>   * this calculates a freespace cursor for an ag.
> @@ -361,10 +372,6 @@ calculate_freespace_cursor(xfs_mount_t *mp, 
> xfs_agnumber_t agno,
>       bt_stat_level_t         *p_lptr;
>       extent_tree_node_t      *ext_ptr;
>       int                     level;
> -#ifdef XR_BLD_FREE_TRACE
> -     fprintf(stderr,
> -             "in init_freespace_cursor, agno = %d\n", agno);
> -#endif
>  
>       num_extents = *extents;
>       extents_used = 0;
> @@ -385,6 +392,13 @@ calculate_freespace_cursor(xfs_mount_t *mp, 
> xfs_agnumber_t agno,
>       lptr->num_recs_tot = num_extents;
>       level = 1;
>  
> +#ifdef XR_BLD_FREE_TRACE
> +     fprintf(stderr, "%s 0 %d %d %d %d\n", __func__,

What's the 0 for?

Brian

> +                     lptr->num_blocks,
> +                     lptr->num_recs_pb,
> +                     lptr->modulo,
> +                     lptr->num_recs_tot);
> +#endif
>       /*
>        * if we need more levels, set them up.  # of records
>        * per level is the # of blocks in the level below it
> @@ -402,6 +416,14 @@ calculate_freespace_cursor(xfs_mount_t *mp, 
> xfs_agnumber_t agno,
>                       lptr->num_recs_pb = p_lptr->num_blocks
>                                       / lptr->num_blocks;
>                       lptr->num_recs_tot = p_lptr->num_blocks;
> +#ifdef XR_BLD_FREE_TRACE
> +                     fprintf(stderr, "%s %d %d %d %d %d\n", __func__,
> +                                     level,
> +                                     lptr->num_blocks,
> +                                     lptr->num_recs_pb,
> +                                     lptr->modulo,
> +                                     lptr->num_recs_tot);
> +#endif
>               }
>       }
>  
> @@ -546,8 +568,7 @@ calculate_freespace_cursor(xfs_mount_t *mp, 
> xfs_agnumber_t agno,
>                               lptr = &btree_curs->level[level];
>                               p_lptr = &btree_curs->level[level-1];
>                               lptr->num_blocks = howmany(p_lptr->num_blocks,
> -                                             XR_ALLOC_BLOCK_MAXRECS(mp,
> -                                                             level));
> +                                     XR_ALLOC_BLOCK_MAXRECS(mp, level));
>                               lptr->modulo = p_lptr->num_blocks
>                                               % lptr->num_blocks;
>                               lptr->num_recs_pb = p_lptr->num_blocks
> @@ -1434,6 +1455,7 @@ build_agf_agfl(xfs_mount_t      *mp,
>               xfs_alloc_arg_t args;
>               xfs_trans_t     *tp;
>               struct xfs_trans_res tres = {0};
> +             int             error;
>  
>               memset(&args, 0, sizeof(args));
>               args.tp = tp = libxfs_trans_alloc(mp, 0);
> @@ -1442,8 +1464,12 @@ build_agf_agfl(xfs_mount_t     *mp,
>               args.alignment = 1;
>               args.pag = xfs_perag_get(mp,agno);
>               libxfs_trans_reserve(tp, &tres, XFS_MIN_FREELIST(agf, mp), 0);
> -             libxfs_alloc_fix_freelist(&args, 0);
> +             error = libxfs_alloc_fix_freelist(&args, 0);
>               xfs_perag_put(args.pag);
> +             if (error) {
> +                     do_error(_("failed to fix AGFL on AG %d, error %d\n"),
> +                                     agno, error);
> +             }
>               libxfs_trans_commit(tp, 0);
>       }
>  
> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@xxxxxxxxxxx
> http://oss.sgi.com/mailman/listinfo/xfs

<Prev in Thread] Current Thread [Next in Thread>