On Fri, Apr 15, 2011 at 04:28:37PM -0600, Andreas Dilger wrote:
> On 2011-04-15, at 11:26 AM, Christoph Hellwig wrote:
> > On Fri, Apr 15, 2011 at 11:24:19AM -0600, Eric Blake wrote:
> >> Would it be worth borrowing from Solaris' semantics and adding SEEK_HOLE
> >> and SEEK_DATA to lseek(2), as a higher level (less-detailed, but easier
> >> to define and easier to use) interface for discovering the regions of a
> >> file that only contain NUL bytes?
> > Yes, I've already suggested that both in this thread and on IRC.
> > For efficient copies it's the only usable interface.
> I suspect that these bugs would have still existed whether the
> interface is SEEK_HOLE/SEEK_DATA, or FIEMAP. The main problem is
> that the delalloc pages were not accounted for correctly during
> layout traversal..
It's not delalloc that is the problem - XFS accounts for them just
fine in the extent map when asked. However, XFS does speculative
delayed allocation over regions that contain no data, so if the
core-utils folk are assuming that delalloc extents contain data and
need to be copied, they're in for a nasty surprise.
However, every example I've seen in this thread has had to do with
unwritten extents not changing state when data is written into the
page cache. i.e. people are struggling with the expected behaviour
of unwritten extents.
That is, unwritten extent remain unwritten extents until data has
been _physically_ written to them. If there is data in the page
cache over the unwritten extent, it is still an unwritten extent.
If the system crashes while in this state, then the extent _must_
remain an unwritten extent after recovery, otherwise it exposes
Further, using FIEMAP to determine where the data is that needs
copying is extremely fragile. What happens when FIEMAP grows a
different type of extent that contains data? cp will break, because
it doesn't think it needs to copy data in extents of an unknown
type. Or it will break because it thinks it needs to copy it and
there's something in it that should not be copied.
Also, cp shoul dnot be trying to replicate the physical layout of
the file when copying it - that's for the filesystem to decide and
having userspace try to do this is a sure recipe for causing severe
filesystem fragmentation. The filesystems already do an excellent
job of optimising allocation - userspace should not be trying to
second guess what is optimal layout for the filesystem.
Fundamentally, what the core-utils guys want is FIEMAP to tell them
where data is in the file, regardless of whether it is in memory or
on disk. That is not what FIEMAP is intended for and matches
SEEK_HOLE/SEEK_DATA have very well understood semantics and is
designed specifically for optimising acceess to sparse files. This
interface abstracts all the details of how different filesystems
store their data so the application doesn't need to care about it.
The API is so, so much simpler to use and understand, to. And if the
filesystem has data in cache over an unwritten extent, then by
definition it's still data to be returned by SEEK_DATA. If it fails
to return the range as such then the implementation is broken.