On Tue, 2003-05-13 at 00:31, Andi Kleen wrote:
> > You need to try the current cvs kernel, there was a major rework of xfs
> > sync recently which should have fixed this.
> Does it simply flush less often and make the zero windows smaller
> or did you implement something more clever?
Complete change in approach is the short answer.
It used to be that we flushed inodes out of write super via the
xfs_syncsub function - which, if you look at the code, was big
and hairy. Now we use the dirty inode list correctly, and
xfs_syncsub is mostly used for flushing the log to disk.
The lifecycle of an xfs metadata buffer looks somewhat like this:
o read in and locked by a thread
o modified and added to a transaction
o copied into the in memory log buffer and pinned
(pinning means it cannot be flushed).
When the log buffer is flushed to disk:
o added to something called the active item list or AIL
(metadata whose most upto date copy is in the log)
o unpinned - which means it can be flushed
When the metadata is flushed to disk (caused by
age of the metadata exceeding a limit, or by
demand for log space).
o removed from the AIL
Inodes go through this cycle. Well, it used to be that the inode was
marked dirty during phase A, as was the super block. Well, the inode
cannot be written at this point since it is pinned in memory. Also,
once the super_block was marked clean, nothing was getting flushed.
Now we mark the super block dirty when we commit a transaction, and
an inode dirty when we unpin it.
We also have a thread which is calling the xfs_sync code on a timer
which will make an important contribution to the situation. On
crash recovery, the head and tail of the log are identified and
this range of the log is replayed. Each time we write out a log
record, it contains the tail - which is the oldest record in the
AIL. If a filesystem goes quiet, then there may be old metadata
still in the AIL, the last log record written out will point a
long way back in time, even if the metadata in the AIL was flushed
before the crash.
By adding the periodic actvity thread we activate some code which
looks for an empty AIL and writes out a dummy log record to record
the new tail of the log.
There are still windows when zero filled files are possible, as the
updated inode size can make it out to disk in a transaction before
all the extents do. Doing the 100% solution will require some brain
I can say though, that after sync returns a linux xfs filesystem is
now on disk to the point where it will look the same after a reboot.
Steve Lord voice: +1-651-683-3511
Principal Engineer, Filesystem Software email: lord@xxxxxxx