On Fri, Jan 22, 2010 at 12:38:48PM -0600, bpm@xxxxxxx wrote:
> Hey Emmanuel,
> I did some research on this in April last year on an old, old kernel.
> One of the codepaths I flagged:
> There were small gains to be had by reordering the sync of the parent and
> child syncs where the two inodes were in the same cluster. The larger
> problem seemed to be that we're not treating the log as stable storage.
> By calling write_inode_now we've written the changes to the log first
> and then gone and also written them out to the inode.
Pretty much right, but there are historical reasons for that
behaviour. The ->write_inode() path is the only
method for the higher layers to say "write this inode to disk".
That's how XFS has been treating it for a long time - as a command
to _physically_ write a dirty inode some time after it was first
changed and the transaction is already on disk.
Unfortunately, NFS is using the same call for is a method for saying
"commit this changed inode to disk immediately", which is a
different semantic to the way the sync code uses it and physical
inode IO really hurts here.
> nfsd_create, nfsd_link, and nfsd_setattr all do this (or do in the old
> kernel I'm looking at). I have a patchset that changes
> this to an fsync so we force the log and call it good. I'll be happy to
> dust it off if someone hasn't already addressed this situation.
The delayed write inode flushing patchset I'm finalising does this.
We now have reliable tracking of dirty inodes in XFS and a method
for efficient physical writeback, so we no longer need to rely on
->write_inode to tell us to write inodes to disk. Hence the patchset
turns the inode write into a an xfs_fsync() if it is a sync write or
a delayed write if it is async. I'm hoping to have that ready for
.34 inclusion sometime next week...