From: Dave Chinner <dchinner@xxxxxxxxxx>
For CRC enabled filesystems, we can't just swap inode forks from one inode to
another when defragmenting a file - the blocks in the inode fork bmap btree
contain pointers back to the owner inode. Hence if we are to swap the inode
forks we have to atomically modify every block in the btree during the
There are two approaches to doing this. Firstly, if we are doing an entire fork
swap, we could create a new transaction item type that indicates we are changing
the owner of a certain structure from one value to another, and then use ordered
buffer logging to modify all the buffers in the tree without needing to log
them. This would then require log recovery to perform the modification of the
owner information of the objects/structures in question.
This does introduce some interesting ordering details into recovery - we have to
make sure that the owner change replay occurs after the change that moves the
objects is made, not before. Hence we can't use a separate log item for this as
we have no guarantee of strict ordering between multiple items in the log due to
the relogging action of asynchronous transaction commits. Hence there is no
"generic" method we can use for changing the ownership of arbitrary metadata
For inode forks, however, there is a simple method of communicating that the
fork contents need the owner rewritten - we can pass a inode log format flag
for the fork for the transaction that does a fork swap. This flag will then
follow the inode fork through relogging actions so when the swap actually gets
replayed the ownership can be changed immediately by log recovery. So that gives
us a simple method of "whole fork" exchange between two inodes.
THis is relatively simple to implement, so it makes sense to do this as an
initial implementation to support xfs_fsr on CRC enabled filesytems in the same
manner as we do on existing filesystems.
The second approach is to implement a proper extent swap transaction which moves
an arbitrary range of a fork from one inode to another. This would need to be
done as a permenent rolling transaction that moves a fixed number of extents at
a time between the two inode forks. local/extent format implementation is
trivial - we only need to modify the inode forks and log the inodes to implement
it - but the btree implementation is much, much harder.
The first thing to note is that the two inodes that are being swapped do not
necessarily contain the same data, and hence we cannot assume that we are making
a symmetrical modification. Hence we have to involve an intermediate inode fork
to stage the movement of extents. That is, we move extents from the source to
the intermediate record, move the extents on the target to the source, and then
move the intermediate record extents to the target. Because of the nature of the
movement, we want all three movements in a single transaction but we do not want
the intermediate record to show up in any transactions.
This is made complex due to the fact that the extents being swapped might be of
different offsets and lengths, and hence the movement per transaction may
require swapping of partial extent ranges on one side where one inode has a
alarge contiguous extent and the other has lots of small extents in the same
range. This means that the number of transactions we need to do the swap is
not clearly defined before we start the operation. This is very similar to the
problem truncate has - it has to string multiple extent manipulation operations
together into a single atomic operation.
The extent freeing code does this via a pair of intent/done items that wrap the
entire operation - the EFI/EFD items. To do a co-ordinated, atomic extent swap,
we are going to need to and equivalent intent/done pair of log items to indicate
that the upcoming stream of extent manipulations need to be replayed in
completely. This is necessary as the individual extent movement transactions can
result in bmbt blocks being allocated and freed, and hence can be rolling
transacitons themselves made atomic via EFI/EFD intents in xfs_bmap_finish().
Hence, at minimum, we need to ensure that each extent that is swapped is fully
and correctly replayed and to do that we need Swap Extent Intent and Swap Extent
Done pair of log items.
Like the EFI/EFD items, however, these intents can record multiple extents to be
swapped at a time, and hence this allows us some flexibility in determining how
to batch up modifications for efficiency purposes. The ESI would record the
exact extent records being swapped between inodes and be committed, after which
we can then swap in a multi-transaction loop (to handle bmap btree
allocation/free operations during insert/remove operations) that updates the ESD
after each extent range in the ESI is swapped sucessfully.
As a result, recovery woul dbe very similiar to EFI/EFD recovery - as each ESD
is seen, it cancels the completed range of the related ESI, and when all ranges
are cancelled the ESI/ESD are removed from the reocvery list. If there are ESIs
left at the end of the recovery pass, we then need to run a loop that completes
them and so leaves the the inodes in a known correct state.
This is, overall, much more complex than what is currently needed for xfs_fsr
support, so this is more documentation of how we would implement generic ranged
extent swap support for XFS.
Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
fs/xfs/xfs_dfrag.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_dfrag.h b/fs/xfs/xfs_dfrag.h
index 20bdd93..ad688fd 100644
@@ -19,7 +19,7 @@
- * Structure passed to xfs_swapext
+ * Structure passed to xfs_swapext, currently only supports full file
typedef struct xfs_swapext