On Sun, Jan 16, 2011 at 07:28:27PM -0500, Lachlan McIlroy wrote:
> ----- Original Message -----
> > Hi Dave,
> > On Fri, Jan 14, 2011 at 11:29:00AM +1100, Dave Chinner wrote:
> > > I've noticed a few suspicious things trying to reproduce the
> > > allocate-in-the-middle-of-a-delalloc-extent,
> > ...
> > > Secondly, I think we have the same expose-the-entire-delalloc-extent
> > > -to-stale-data-exposure problem in ->writepage. This onnne, however,
> > > is due to using BMAPI_ENTIRE to allocate the entire delalloc extent
> > > the first time any part of it is written to. Even if we are only
> > > writing a single page (i.e. wbc->nr_to_write = 1) and the delalloc
> > > extent covers gigabytes. So, same problem when we crash.
> > >
> > > Finally, I think the extsize based problem exposed by test 229 is a
> > > also a result of allocating space we have no pages covering in the
> > > page cache (triggered by BMAPI_ENTIRE allocation) so the allocated
> > > space is never zeroed and hence exposes stale data.
> > This is precisely the bug I was going after when I hit the
> > allocate-in-the-middle-of-a-delalloc-extent bug. This is a race
> > between
> > block_prepare_write/__xfs_get_blocks and writepage/xfs_page_state
> > convert. When xfs_page_state_convert allocates a real extent for a
> > page
> > toward the beginning of a delalloc extent, XFS_BMAPI converts the
> > entire
> > delalloc extent. Any subsequent writes into the page cache toward the
> > end of this freshly allocated extent will see a written extent instead
> > of delalloc and read the block from disk into the page before writing
> > over it. If the write does not cover the entire page garbage from disk
> > will be exposed into the page cache.
> Ben, take a look at the XFS gap list code in IRIX - this code was
> designed specifically to handle this problem. It's also implemented in
> several of the cxfs clients too. On entry to a write() it will create a
> list of all the holes that exist in the write range before any delayed
> allocations are created. The first call to xfs_bmapi() sets up the delayed
> allocation for the entire write (even if the bmaps returned don't cover the
> full write range). If the write results in a fault from disk then it checks
> the gap list to see if any sections of the buffer used to cover a hole and
> if so it ignores the state of the extent and zeroes those region(s) in the
> buffer that match the pre-existing holes. If the buffer has multiple
> non-hole sections that need to be read from disk the whole buffer will be
> read from disk and the zeroing of the holes is done post I/O - this reduces
> the number of I/Os to be done. The whole delayed allocation can be safely
> converted at any time without risk of reading exposed data (assuming no
> crash that is).
IMO, that caveat exposes the problem with this approach - it is not
persistent. It adds a lot of complexity but doesn't solve the
underlying problem: that we are exposing real extents via allocation
without having written any data to them.
> As the write progresses through the range it removes
> sections from the front of the gap list so by the time the write is complete
> the gap list is empty. The gap list does not have a dedicated lock to
> protect it but instead relies on the iolock to ensure that only one write
> operation occurs at once (so it's not appropriate for direct I/O).
> > <snip>
> > > I'm sure there are other ways to solve these problems, but these two
> > > are the ones that come to mind for me. I'm open to other solutions
> > > or ways to improve on these ones, especially if they are simpler. ;)
> > > Anyone got any ideas or improvements?
> > The direction I've been taking is to use XFS_BMAPI_EXACT in
> > *xfs_iomap_write_allocate to ensure that an extent covering exactly
> > the
> > pages we're prepared to write out immediately is allocated and the
> > rest
> > of the delalloc extent is left as is. This exercises some of the btree
> > code more heavily and led to the discovery of the
> > allocate-in-the-middle-of-a-delalloc-extent bug. It also presents a
> > performance issue which I've tried to resolve by extending
> > xfs_probe_cluster to probe delalloc extents-- lock up all of the pages
> > to be converted before performing the allocation and hold those locks
> > until they are submitted for writeback. It's not very pretty but it
> > resolves the corruption.
> > There is still the issue of crashes... This could be solved by
> > converting from delalloc to unwritten in xfs_page_state_convert in
> > this
> > very exact way and then to written in the io completion handler. Never
> > go delalloc->written directly.
> That solution would probably make the gap list redundant too.
Yes, I think you are right. The "gaps" would get mapped as unwritten
rather than as normal extents and hence get zereod appropriately. It
would also preventing exposure after a crash due to being