On Thu, Jul 29, 2010 at 08:53:24PM -0600, Matthew Wilcox wrote:
> On Fri, Jul 30, 2010 at 08:45:16AM +1000, Dave Chinner wrote:
> > If we get two unaligned direct IO's to the same filesystem block
> > that is marked as a new allocation (i.e. buffer_new), then both IOs will
> > zero the portion of the block they are not writing data to. As a
> > result, when the IOs complete there will be a portion of the block
> > that contains zeros from the last IO to complete rather than the
> > data that should be there.
> Urgh. Yuck.
> > This is easily manifested by qemu using aio+dio with an unaligned
> > guest filesystem - every IO is unaligned and fileystem corruption is
> > encountered in the guest filesystem. xfstest 240 (from Eric Sandeen)
> > is also a simple reproducer.
> > To avoid this problem, track unaligned IO that triggers sub-block zeroing
> > and
> > check new incoming unaligned IO that require sub-block zeroing against that
> > list. If we get an overlap where the start and end of unaligned IOs hit the
> > same filesystem block, then we need to block the incoming IOs until the IO
> > that
> > is zeroing the block completes. The blocked IO can then continue without
> > needing to do any zeroing and hence won't overwrite valid data with zeros.
> Urgh. Yuck.
It's better than silent data corruption.
> Could we perhaps handle this by making an IO instantiate a page cache
> page for partial writes, and forcing that portion of the IO through the
> page cache? The second IO would hit the same page and use the existing
> O_DIRECT vs page cache paths.
I don't want any direct IO for XFS to go through the page cache -
unaligned or not. using the page cache for the unaligned blocks
would also be much worse for performance that this method because it
turns unaligned direct IO into 3 IOs - the unaligned head block, the
aligned body and the unaligned tail block. It would also be a
performance hit you take on every single dio, whereas this way the
hit is only taken when an overlap is detected.
And besides, such decisions on whether to use buffered IO have to be
made high up in the filesystem when we are deciding how to lock the
inode for the dio - buffered IO requires exclusive locking, not the
shared locking we do for dio writes. That further reduces the
performance of unaligned direct IO even when there are no overlaps,
and it's a solution that every filesystem needs to implement
themselves in some way.
I've looked at a couple of different ways to fix this (e.g. passing
the unaligned state to get_blocks() to allow the filesystem to
serialise there) but they've all died a horrible death of dodgy
locking and hacky IO completion detection. not to mention
requiring a different solution in every filesystem.
This way may be a bit ugly, but it works, is isolated and seems to
me to be the least-worst way of solving the problem. It could be
made to scale better by making the tracking per-inode, but I don't
think that growing the struct inode for this corner case is a good