On Fri, 19 Jan 2001, Steve Lord wrote:
> There are other aspects of write clustering which are important - at least
> as far as xfs is concerned.
I would like to have write clustering scheme which can work with advanced
features (such as delayed allocation) even if we dont have those features
right now in the kernel. Otherwise we'll end up having different
implementations (hum, kludges) for the same thing which are hard to
maintain.
> When xfs is asked to flush a delayed allocate
> page it goes and allocates space for it and all other delalloc space
> contiguous with it in the file. Even if some of this space is not
> aged enough for the vm to consider it a candidate for flushing, if we
> are going to be doing a disk I/O for pages which are on disk contiguous
> with it, we can as part of the same disk operation clean this one too.
>
> Deciding to break this disk I/O into multiple ones based on age is a
> trade off between the page getting remodified later on (requiring another
> write) and having to do more disk I/O later to clean the page. I would
> contend that writing the extra pages (within some reasonable bound to
> avoid saturating the devices with a huge request) is going to end up
> with less disk I/O than we would otherwise have.
With the ->cluster() operation, XFS can allocate these contiguous pages
and tell the VM subsystem about them (with the "boffset" and "poffset"
pointers).
What you think?
> >
> > Obviously the VM does not have knowledge about the filesystem low-level
> > information, which is also needed to get write clustering right.
> >
> > To solve that, we can add a new operation to the address-space structure
> > which can allow the address space owner to inform the VM if clustering is
> > worth in a given range of disk. Something like this:
> >
> > int (*cluster)(struct page *page, unsigned long *boffset,
> > unsigned long *foffset);
> >
> > page = page currently being written by VM
> >
> > b/foffset = Pointers passed to the address space owner so it can inform
> > backward/forward offset starting from the logical offset in the inode
> > (page->index) describing upto where the write clustering is worth to
> > be done.
> >
> > I hope to have something similar working soon.
> >
> > Comments?
>
> Having cluster and a bmap operation as distinct operations means two calls
> into the filesystem to work out where pages live on disk, if we are going
> to cluster the actual I/O and have the vm system do the clustering work
> then this may be better combined into a single call.
They are different (for example the delayed allocation issue).
>
> There is one other snag which this will cause XFS, there is a feature we
> have not implemented yet on Linux which will require xfs to obtain control
> again after the I/O for a write has completed in some cases. XFS supports
> space preallocation, where an extent is created in advance for data - when
> an application knows it will require a set amount of space, this gives the
> allocator a better chance for doing good on disk layout. Now since this
> space has not been written to, we need to prevent it from being read,
> since it could contain any data. Therefore the extent is marked as unwritten
> when we preallocate, unwritten extents always read as zeros. We need to
> change this state after the write completes. From the point of view of XFS,
> letting it initiate the actual write would let it do this. All this may
> mean is redirecting the actual I/O request through a vector which would
> have a default implementation.
Do you think its ok for XFS to handle this special case on its own
->writepage()?
Do you have more special cases? :)
|