XFS metadata flushing design - current and future
david at fromorbit.com
Tue Aug 30 02:06:42 CDT 2011
On Tue, Aug 30, 2011 at 01:09:59AM -0400, Christoph Hellwig wrote:
> On Tue, Aug 30, 2011 at 11:28:21AM +1000, Dave Chinner wrote:
> > > One really stupid thing we do in that area is that the xfs_iflush from
> > > xfs_inode_item_push puts the buffer at the end of the delwri list and
> > > expects it to be aged, just so that the first xfs_inode_item_pushbuf
> > > can promote it to the front of the list. Now that we mostly write
> > > metadata from AIL pushing
> > Actually, when we have lots of data to be written, we still call
> > xfs_iflush() a lot from .write_inode. That's where the delayed write
> > buffer aging has significant benefit.
> One thing we could try is to log the inode there even for non-blocking
> write_inode calls to unifty the code path. But I suspect simply waiting
> for 3.3 and making future progress based on the removal of ->write_inode
> is going to to save us a lot of work and deal with different behaviour.
> > [ IOWs, the xfs_inode_item_push() simply locks the inode and returns
yeah, sorry, got it mixed up.
> > PUSHBUF if it needs flushing, then xfs_inode_item_pushbuf() calls
> > xfs_iflush() and gets the dirty buffer back, which it then adds the
> > buffer to a local dispatch list rather than submitting IO directly. ]
> Why not keep using _push for this?
Because if we are returning a buffer, then it makes sense to use
pushbuf and change the prototype for that operation and leave
IOP_PUSH completely unchanged...
> > FWIW, what this discussion is indicating to me is that we should
> > timestamp entries in the AIL so we can push it to a time threshold
> > as well as a LSN threshold.
> > That is, whenever we insert a new entry into the AIL, we not only
> > update the LSN and position, we also update the insert time of the
> > buffer. We can then update the time based threshold every few
> > seconds and have the AIL wq walker walk until the time threshold is
> > reached pushing items to disk. This would also make the xfssyncd
> > "push the entire AIL" change to "push anything older than 30s" which
> > is much more desirable from a sustained workload POV.
> Sounds reasonable. Except that we need to do the timestamp on the log
> item and not the buffer given that we might often not even have a
> buffer at AIL insertation time.
yeah, that's what i intended that to mean, sorry if it wasn't clear.
> > If we then modify the way all buffers are treated such that
> > the AIL is the "delayed write list" (i.e. we take a reference count
> > to the buffer when it is first logged and not in the AIL) and the
> > pushbuf operations simply add the buffer to the local dispatch list,
> > we can get rid of the delwri buffer list altogether. That also gets
> > rid of the xfsbufd, too, as the AIL handles the aging, reference
> > counting and writeback of the buffers entirely....
> That would be nice. We'll need some ways to deal with the delwri
> buffers from quotacheck and log recovery if we do this, but we could
> just revert to the good old async buffers if we want to keep things
> simple. Alternatively we could keep local buffer submission lists
> in quotacheck and pass2 of log recovery, similar to what you suggested
> for the AIL worked.
> We could also check if we can get away with not needing lists managed
> by us at all and rely on the on-stack plugging, which I'm about to move
> up from the request layer to the bio layer and thus making generally
The advantage of using our own list is that we can still then sort
them (might be thousands of buffers we queue in a single pass)
before submitting them for IO. The on-stack plugging doesn't allow
this at all, IIUC, as itis really just a FIFO list above the IO
david at fromorbit.com
More information about the xfs