Marcelo Tosatti wrote:
> On Fri, 19 Jan 2001, Steve Lord wrote:
> > Getting to this stage would involve abstracting knowledge of buffer
> > heads out of the vm and hiding them behind another flush method (I
> > think).
> Thats ->writepage(), basically. The problem with write clustering right
> now is that there is no abstraction defined.
> > The flush call would have to be free to flush more or different data
> > than it was requested to.
> The VM must "control" what is flushed, IMO:
> - the address space owner does not have access to the aging information VM
> code has. For example, we probably want to only cluster pages which are on
> the inactive dirty list, and this information belongs to VM.
> - In low memory conditions, the VM knows the right balancing between
> swapping and syncing of pages.
> - If the write clustering gets done at VM level, we potentially avoid code
There are other aspects of write clustering which are important - at least
as far as xfs is concerned. When xfs is asked to flush a delayed allocate
page it goes and allocates space for it and all other delalloc space
contiguous with it in the file. Even if some of this space is not
aged enough for the vm to consider it a candidate for flushing, if we
are going to be doing a disk I/O for pages which are on disk contiguous
with it, we can as part of the same disk operation clean this one too.
Deciding to break this disk I/O into multiple ones based on age is a
trade off between the page getting remodified later on (requiring another
write) and having to do more disk I/O later to clean the page. I would
contend that writing the extra pages (within some reasonable bound to
avoid saturating the devices with a huge request) is going to end up
with less disk I/O than we would otherwise have.
> Obviously the VM does not have knowledge about the filesystem low-level
> information, which is also needed to get write clustering right.
> To solve that, we can add a new operation to the address-space structure
> which can allow the address space owner to inform the VM if clustering is
> worth in a given range of disk. Something like this:
> int (*cluster)(struct page *page, unsigned long *boffset,
> unsigned long *foffset);
> page = page currently being written by VM
> b/foffset = Pointers passed to the address space owner so it can inform
> backward/forward offset starting from the logical offset in the inode
> (page->index) describing upto where the write clustering is worth to
> be done.
> I hope to have something similar working soon.
Having cluster and a bmap operation as distinct operations means two calls
into the filesystem to work out where pages live on disk, if we are going
to cluster the actual I/O and have the vm system do the clustering work
then this may be better combined into a single call.
There is one other snag which this will cause XFS, there is a feature we
have not implemented yet on Linux which will require xfs to obtain control
again after the I/O for a write has completed in some cases. XFS supports
space preallocation, where an extent is created in advance for data - when
an application knows it will require a set amount of space, this gives the
allocator a better chance for doing good on disk layout. Now since this
space has not been written to, we need to prevent it from being read,
since it could contain any data. Therefore the extent is marked as unwritten
when we preallocate, unwritten extents always read as zeros. We need to
change this state after the write completes. From the point of view of XFS,
letting it initiate the actual write would let it do this. All this may
mean is redirecting the actual I/O request through a vector which would
have a default implementation.
> > Second, your initial comment misses one of the points of the page cleaner,
> > that it is the only thread of activity which is going to move delalloc page
> > out to disk, if it was based purely on aging out pages due to pressure then
> > you could end up with data written to files not getting flushed to disk
> > for a very long time. Delayed allocate data needs to be treated more like
> > other writes to disk.