Here is a little writeup I did about how we handle dirty metadata
flushing in XFS currently, and how we can improve on it in the
relatively short term:
Metadata flushing in XFS
This document describes the state of the handling of dirty XFS in-core
metadata, and how it gets flushed to disk, as well as ideas how to
simplify it in the future.
All metadata is XFS is read and written using buffers as the lowest layer.
There are two ways to write a buffer back to disk: delwri and sync.
Delwri means the buffers gets added to a delayed write list, which a
background thread writes back periodically or when forced to. Synchronous
writes means the buffer is written back immediately, and the callers waits
for completion synchronously.
Logging and the Active Item List (AIL)
The prime method of metadata writeback in XFS is by logging the changes
into the transaction log, and writing back the changes to the original
location in the background. The prime data structure to drive the
asynchronous write back is the Active Item List or AIL. The AIL contains
a list of all changes in the log that need to be written back, ordered
by on the time they were committed to the log using the Log Sequence
Number (LSN). The AIL is periodically pushed out to try to move the
log tail LSN forward. In addition periodically the sync worker attempts
to push out all items in the AIL.
Non-transaction metadata updates
XFS still has a few updates where update metadata non-transactional.
The prime cause for non-transaction metadata updates are timestamps in the
inode, and inode size updates from extending writes. These are handled
by marking the inode dirty in the VFS and XFS inodes, and either relying
on transactional updates to piggy-back these updates, or on the VFS
periodic writeback thread to call into the ->write_inode method in
XFS to write these changes back. ->write_inode either starts delwri
buffer writeback on the inode, or starts a new transaction to log
the inode core containing these changes.
The dquot structures may be scheduled for delwri writeback after a
quota check during an unclean mount.
Extended attribute payloads that are stored outside the main attribute
btree are written back synchronously using buffers.
New allocation group headers written during a filesystem resizing are
written synchronously using buffers.
The superblock is written synchronously using buffers during umount
and sync operations.
Log recovery writes back various pieces of metadata synchronously
or using delwri buffers.
Other flushing methods
For historical reasons we still have a few places that flush XFS metadata
using others methods than logging and the AIL or explicit synchronous
or delwri writes.
Filesystem freezing loops over all inodes in the system to flush out
inodes marked dirty directly using xfs_iflush.
The quotacheck code marks dquots dirty, just to flush them at the end of
the quotacheck operation.
The periodic and explicit sync code walks through all dqouts and writes
back all dirty dquots directly.
We should get rid of both the reliance of the VFS writeback tracking, and
XFS-internal non-AIL metadata flushing.
To get rid of the VFS writeback we'll just need to log all time stamps and size
updates explicitly when they happens. This could be done today, but the
overhead for frequent transactions in that area is deemed to high, especially
with delayed logging enabled. We plan to deprecate the non-delaylog mode
by Linux 3.3, and introduce a new fast-path for inode core updates that
will allow to use direct logging for this updates without introducing
The explicit inode flushing using xfs_sync_attr looks like an attempt to
make sure we do not have any inodes in the AIL when freezing a filesystem.
A better replacement would be a call into the AIL code that allows to
completely empty the AIL before a freeze.
The explicit quota flushing needs a bit more work. First quota check needs
to be converted to queue up inodes to the delwri list immediately when
updating the dquot for each inode. Second the code in xfs_qm_scall_setqlim
that attached a dquot to the transaction, but marks it dirty manually instead
of through the transaction interface needs a detailed audit. After this
we should be able to get rid of all explicit xfs_qm_sync calls.