xfs
[Top] [All Lists]

Re: [PATCH v3 04/11] xfs: update inode allocation/free transaction reser

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: [PATCH v3 04/11] xfs: update inode allocation/free transaction reservations for finobt
From: Brian Foster <bfoster@xxxxxxxxxx>
Date: Tue, 18 Feb 2014 12:10:16 -0500
Cc: xfs@xxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <20140211064609.GE13647@dastard>
References: <1391536182-9048-1-git-send-email-bfoster@xxxxxxxxxx> <1391536182-9048-5-git-send-email-bfoster@xxxxxxxxxx> <20140211064609.GE13647@dastard>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0
On 02/11/2014 01:46 AM, Dave Chinner wrote:
> On Tue, Feb 04, 2014 at 12:49:35PM -0500, Brian Foster wrote:
>> Create the xfs_calc_finobt_res() helper to calculate the finobt log
>> reservation for inode allocation and free. Update
>> XFS_IALLOC_SPACE_RES() to reserve blocks for the additional finobt
>> insertion on inode allocation. Create XFS_IFREE_SPACE_RES() to
>> reserve blocks for the potential finobt record insertion on inode
>> free (i.e., if an inode chunk was previously fully allocated).
>>
>> Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx>
>> ---
>>  fs/xfs/xfs_inode.c       |  4 +++-
>>  fs/xfs/xfs_trans_resv.c  | 47 
>> +++++++++++++++++++++++++++++++++++++++++++----
>>  fs/xfs/xfs_trans_space.h |  7 ++++++-
>>  3 files changed, 52 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
>> index 001aa89..57c77ed 100644
>> --- a/fs/xfs/xfs_inode.c
>> +++ b/fs/xfs/xfs_inode.c
>> @@ -1730,7 +1730,9 @@ xfs_inactive_ifree(
>>      int                     error;
>>  
>>      tp = xfs_trans_alloc(mp, XFS_TRANS_INACTIVE);
>> -    error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree, 0, 0);
>> +    tp->t_flags |= XFS_TRANS_RESERVE;
>> +    error = xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree,
>> +                              XFS_IFREE_SPACE_RES(mp), 0);
> 
> Can you add a comment explaining why the XFS_TRANS_RESERVE flag is
> used here, and why it's use won't lead to accelerated reserve pool
> depletion?
> 

So this aspect of things appears to be a bit more interesting than I
originally anticipated. I "reserve enabled" this transaction to
facilitate the ability to free up inodes under ENOSPC conditions.
Without this, the problem of failing out of xfs_inactive_ifree() (and
leaving an inode chained on the unlinked list) is easily reproducible
with generic/083.

The basic argument for why this is reasonable is that releasing an inode
releases used space (i.e., file blocks and potentially directory blocks
and inode chunks over time). That said, I can manufacture situations
where this is not the case. E.g., allocate a bunch of 0-sized files,
consume remaining free space in some separate file, start removing
inodes in a manner that removes a single inode per chunk or so. This
creates a scenario where the inobt can be very large and the finobt very
small (likely a single record). Removing the inodes in this manner
reduces the likelihood of freeing up any space and thus rapidly grows
the finobt towards the size of the inobt without any free space
available. This might or might not qualify as sane use of the fs, but I
don't think the failure scenario is acceptable as things currently stand.

I think there are several ways this can go from here. A couple ideas
that have crossed my mind:

- Find a way to variably reserve the number of blocks that would be
required to grow the finobt to the finobt, based on current state. This
would require the total number of blocks (not just enough for a split),
so this could get complex and somewhat overbearing (i.e., a lot of space
could be quietly reserved, current tracking might not be sufficient and
the allocation paths could get hairy).

- Work to push the ifree transaction allocation and reservation to the
unlink codepath rather than the eviction codepath. Under normal
circumstances, chain the tp to the xfs_inode such that the eviction code
path can grab it and run. This prevents us going into the state where an
inode is unlinked without having enough space to free up. On the flip
side, ENOSPC on unlink isn't very forgiving behavior to the user.

I think the former approach is probably overkill for something that
might be a pathological situation. The latter approach is more simple,
but it feels like a bit of a hack. I've experimented with it a bit, but
I'm not quite sure yet if it introduces any transaction issues by
allocating the unlink and ifree transactions at the same time.

Perhaps another argument could be made that it's rather unlikely we run
into an fs with as many 0-sized (or sub-inode chunk sized) files as
required to deplete the reserve pool without freeing any space, and we
should just touch up the failure handling. E.g.,

1.) Continue to reserve enable the ifree transaction. Consider expanding
the reserve pool on finobt-enabled fs' if appropriate. Note that this is
not guaranteed to provide enough resources to populate the finobt to the
level of the inobt without freeing up more space.
2.) Attempt a !XFS_TRANS_RESERVE tp reservation in xfs_inactive_ifree().
If fails, xfs_warn()/notice() and enable XFS_TRANS_RESERVE.
3.) Attempt XFS_TRANS_RESERVE reservation. If fails, xfs_notice() and
shutdown.

And this could probably be made more intelligent to bail out sooner if
we repeat XFS_TRANS_RESERVE reservations without freeing up any space,
etc. Before going too far in one direction... thoughts?

Brian

>>      if (error) {
>>              ASSERT(XFS_FORCED_SHUTDOWN(mp));
>>              xfs_trans_cancel(tp, XFS_TRANS_RELEASE_LOG_RES);
>> diff --git a/fs/xfs/xfs_trans_resv.c b/fs/xfs/xfs_trans_resv.c
>> index 2fd59c0..32f35c1 100644
>> --- a/fs/xfs/xfs_trans_resv.c
>> +++ b/fs/xfs/xfs_trans_resv.c
>> @@ -98,6 +98,37 @@ xfs_calc_inode_res(
>>  }
>>  
>>  /*
>> + * The free inode btree is a conditional feature and the log reservation
>> + * requirements differ slightly from that of the traditional inode 
>> allocation
>> + * btree. The finobt tracks records for inode chunks with at least one free 
>> inode.
>> + * Therefore, a record can be removed from the tree for an inode allocation 
>> or
>> + * free and the associated merge reservation is unconditional. This also 
>> covers
>> + * the possibility of a split on record insertion.
> 
> Slightly wider than 80 columns here. FWIW, if you use vim, add this
> rule to have it add a red line at the textwidth you have set:
> 
> " highlight textwidth
> set cc=+1
> 
> And that will point out lines that are too long quite obviously ;)
> 
>> + *
>> + * the free inode btree: max depth * block size
>> + * the free inode btree entry: block size
>> + *
>> + * TODO: is the modify res really necessary? covered by the merge/split res?
>> + * This seems to be the pattern of ifree, but not create_resv_alloc. Why?
> 
> The modify case is for an allocation that only updates an inobt
> record (i.e. chunk already allocated, free inodes in it). Because
> we can remove a finobt record when "modifying" the last free inode
> record in a chunk, "modify" can cause a redcord removal and hence a
> tree merge. In which case it's no different of any of the other
> finobt reservations....
> 
>> @@ -267,6 +298,7 @@ xfs_calc_remove_reservation(
>>   *    the superblock for the nlink flag: sector size
>>   *    the directory btree: (max depth + v2) * dir block size
>>   *    the directory inode's bmap btree: (max depth + v2) * block size
>> + *    the finobt
>>   */
>>  STATIC uint
>>  xfs_calc_create_resv_modify(
>> @@ -275,7 +307,8 @@ xfs_calc_create_resv_modify(
>>      return xfs_calc_inode_res(mp, 2) +
>>              xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
>>              (uint)XFS_FSB_TO_B(mp, 1) +
>> -            xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1));
>> +            xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1)) +
>> +            xfs_calc_finobt_res(mp, 1);
>>  }
> 
> And this is where is starts to get complex. The modify operation can
> now cause a finobt merge, when means blocks will be allocated/freed.
> That means we now need to take into account:
> 
>  *    the allocation btrees: 2 trees * (max depth - 1) * block size
> 
> and anything else freeing an extent requires.
> 
>>  /*
>> @@ -285,6 +318,7 @@ xfs_calc_create_resv_modify(
>>   *    the inode blocks allocated: XFS_IALLOC_BLOCKS * blocksize
>>   *    the inode btree: max depth * blocksize
>>   *    the allocation btrees: 2 trees * (max depth - 1) * block size
>> + *    the finobt
>>   */
>>  STATIC uint
>>  xfs_calc_create_resv_alloc(
>> @@ -295,7 +329,8 @@ xfs_calc_create_resv_alloc(
>>              xfs_calc_buf_res(XFS_IALLOC_BLOCKS(mp), XFS_FSB_TO_B(mp, 1)) +
>>              xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
>>              xfs_calc_buf_res(XFS_ALLOCFREE_LOG_COUNT(mp, 1),
>> -                             XFS_FSB_TO_B(mp, 1));
>> +                             XFS_FSB_TO_B(mp, 1)) +
>> +            xfs_calc_finobt_res(mp, 0);
>>  }
> 
> This reservation is only for v4 superblocks - the icreate
> transaction reservation is used for v5 superblocks, so that's the
> only one you need to modify.
> 
> Cheers,
> 
> Dave.
> 

<Prev in Thread] Current Thread [Next in Thread>