On Fri, Jan 14, 2011 at 11:29:00AM +1100, Dave Chinner wrote:
> I've noticed a few suspicious things trying to reproduce the
> Secondly, I think we have the same expose-the-entire-delalloc-extent
> -to-stale-data-exposure problem in ->writepage. This onnne, however,
> is due to using BMAPI_ENTIRE to allocate the entire delalloc extent
> the first time any part of it is written to. Even if we are only
> writing a single page (i.e. wbc->nr_to_write = 1) and the delalloc
> extent covers gigabytes. So, same problem when we crash.
> Finally, I think the extsize based problem exposed by test 229 is a
> also a result of allocating space we have no pages covering in the
> page cache (triggered by BMAPI_ENTIRE allocation) so the allocated
> space is never zeroed and hence exposes stale data.
This is precisely the bug I was going after when I hit the
allocate-in-the-middle-of-a-delalloc-extent bug. This is a race between
block_prepare_write/__xfs_get_blocks and writepage/xfs_page_state
convert. When xfs_page_state_convert allocates a real extent for a page
toward the beginning of a delalloc extent, XFS_BMAPI converts the entire
delalloc extent. Any subsequent writes into the page cache toward the
end of this freshly allocated extent will see a written extent instead
of delalloc and read the block from disk into the page before writing
over it. If the write does not cover the entire page garbage from disk
will be exposed into the page cache.
> I'm sure there are other ways to solve these problems, but these two
> are the ones that come to mind for me. I'm open to other solutions
> or ways to improve on these ones, especially if they are simpler. ;)
> Anyone got any ideas or improvements?
The direction I've been taking is to use XFS_BMAPI_EXACT in
*xfs_iomap_write_allocate to ensure that an extent covering exactly the
pages we're prepared to write out immediately is allocated and the rest
of the delalloc extent is left as is. This exercises some of the btree
code more heavily and led to the discovery of the
allocate-in-the-middle-of-a-delalloc-extent bug. It also presents a
performance issue which I've tried to resolve by extending
xfs_probe_cluster to probe delalloc extents-- lock up all of the pages
to be converted before performing the allocation and hold those locks
until they are submitted for writeback. It's not very pretty but it
resolves the corruption.
There is still the issue of crashes... This could be solved by
converting from delalloc to unwritten in xfs_page_state_convert in this
very exact way and then to written in the io completion handler. Never
go delalloc->written directly.
I have not had luck reproducing this on TOT xfs and have come to realize
that this is because it doesn't do speculative preallocation of larger
delalloc extents unless you are using extsize... which I haven't tried.