[Top] [All Lists]

Re: Questions about XFS discard and xfs_free_extent() code (newbie)

To: "Dave Chinner" <david@xxxxxxxxxxxxx>
Subject: Re: Questions about XFS discard and xfs_free_extent() code (newbie)
From: "Alex Lyakas" <alex@xxxxxxxxxxxxxxxxx>
Date: Wed, 8 Jan 2014 20:13:38 +0200
Cc: <xfs@xxxxxxxxxxx>
Delivered-to: xfs@xxxxxxxxxxx
Importance: Normal
In-reply-to: <20131226230018.GJ20579@dastard>
References: <AD612A564BB84E75B010AE37687DFC8E@alyakaslap> <20131218230615.GQ31386@dastard> <78FC295EC7FF48C987266DC48B183930@alyakaslap> <20131219105513.GZ31386@dastard> <8155F3F9D6094F40B4DA71BD561D2DE8@alyakaslap> <20131226230018.GJ20579@dastard>
Hello Dave,

Currently I am working on the following approach:

Basic idea: make each xfs_extent_busy struct carry potential information about a larger extent-for-discard, i.e., one that is a multiple of discard_granularity.

In detail this looks like this:
# xfs_free_ag_extent attempts to merge the freed extent into a larger free-extent in the by-bno btree. When this function completes its work, we have [nbno, nlen], which is potentially a larger free-extent. At this point we also know that AGF is locked. # We trim [nbno, nlen] to be a multiple-of and aligned-by discard_granularity (if possible), and we receive [dbno, dlen], which is a very nice extent to discard. # When calling xfs_extent_busy_insert(), we add these two values to the xfs_extent_busy struct. # When the extent-free operation is committed for this busy extent, we know that we can discard this [dbno, dlen] area, unless somebody have allocated an extent, which overlaps this area.

To address that, at the end of  xfs_alloc_fixup_trees() we do the following:
# We know that we are going to allocate [rbno, rlen] from appropriate AG. So at this point, we search the busy extents tree to check if there is a busy extent that holds [dbno, dlen] (this is a multiple-of and aligned-by discard granularity), which overlaps [rbno, rlen] fully or partially. If found, we shrink the [dbno, dlen] area to be still a multiple-of and aligned by discard-granularity, if possible. So we have a new smaller [dbno, dlen] that we still can discard, attached to the same busy extent. Or we discover that the new area is too small to discard, so we forget about it. # The allocation flow anyways searches the busy extents tree, so we should be ok WRT to locking order, but adding some extra work.

This way, we basically track larger chunks, which are nice to discard.

I am aware that I need to handle additional issues like:
# A busy extent can be "unbusyied" or shrunk by xfs_extent_busy_update_extent(). We need to update [dbno, dlen] accordingly or delete it fully # To be able to search for [dbno, dlen], we probably need another rbtree (under the same pag->pagb_lock), which tracks large extents for discard. xfs_extent_busy needs additional rbnode. # If during xfs_alloc_fixup_trees() we discover that extent is already being discarded, we need to wait. Assuming we have asynchronous discard, this wait will be short - we only need the block device to queue the discard request, and then we are good to allocate from that area again.

One thing I am unsure about, is a scenario like this:
# assume discard-granularity=1MB
# we have a 1MB almost free, except two 4K blocks, somewhere in the free space # Transaction t1 comes and frees 4K block A, but the 1MB extent is not fully free yet, so nothing to discard # Transaction t2 frees the second 4K block B, now 1MB is free and we attach a [dbno, dlen] to the second busy extent

However, I think there is no guarantee that t1 will commit before t2; is that right? But we cannot discard the 1MB extent, before both transactions commit. (One approach to solve this, is to give a sequence number for each xfs_extent_busy extent, and have a background thread that does delayed discards, once all needed busy extents are committed. The delayed discards are also considered in the check that xfs_alloc_fixup_trees() does).

What do you think overall about this approach? Is there something fundamental that prevents it from working?

Also (if you are still reading:), can you kindly comment this question that I have: # xfs_free_ag_extent() has a "isfl" parameter. If it is "true", then this extent is added as usual to the free-space btrees, but the caller doesn't add it as a busy extent. This means that such extent is suitable for allocation right away, without waiting for the log commit?

Thank you for helping,

-----Original Message----- From: Dave Chinner
Sent: 27 December, 2013 1:00 AM
To: Alex Lyakas
Cc: xfs@xxxxxxxxxxx
Subject: Re: Questions about XFS discard and xfs_free_extent() code (newbie)

On Tue, Dec 24, 2013 at 08:21:50PM +0200, Alex Lyakas wrote:
Hi Dave,
Reading through the code some more, I see that the extent that is
freed through xfs_free_extent() can be an XFS metadata extent as
For example, xfs_inobt_free_block() frees a block of the AG's
free-inode btree. Also, xfs_bmbt_free_block() frees a generic btree
block by putting it onto the cursor's "to-be-freed" list, which will
be dropped into the free-space btree (by xfs_free_extent) in
xfs_bmap_finish(). If we discard such metadata block before the
transaction is committed to the log and we crash, we might not be
able to properly mount after reboot, is that right?

Yes. The log stores a delta of the transactional changes, and so
requires th eprevious version of the block to be intact for revoery
to take place.

I mean it's not
that some file's data block will show 0s to the user instead of
before-delete data, but some XFS btree node (for example) will be
wiped in such case. Can this happen?

Yes, it could. That's what I meant by:


> IOWs, issuing the discard before the transaction that frees the
> extent is on stable storage means we are discarding user data or
> metadata before we've guaranteed that the extent free transaction
> is permanent and that means we violate certain guarantees with
> respect to crash recovery...

The "or metadata" part of the above sentence.


Dave Chinner
<Prev in Thread] Current Thread [Next in Thread>