xfs
[Top] [All Lists]

Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: [PATCH v2 2/2] dax: move writeback calls into the filesystems
From: Dan Williams <dan.j.williams@xxxxxxxxx>
Date: Thu, 11 Feb 2016 14:59:14 -0800
Cc: Jan Kara <jack@xxxxxxx>, Ross Zwisler <ross.zwisler@xxxxxxxxxxxxxxx>, "linux-kernel@xxxxxxxxxxxxxxx" <linux-kernel@xxxxxxxxxxxxxxx>, "Theodore Ts'o" <tytso@xxxxxxx>, Alexander Viro <viro@xxxxxxxxxxxxxxxxxx>, Andreas Dilger <adilger.kernel@xxxxxxxxx>, Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>, Jan Kara <jack@xxxxxxxx>, Matthew Wilcox <willy@xxxxxxxxxxxxxxx>, linux-ext4 <linux-ext4@xxxxxxxxxxxxxxx>, linux-fsdevel <linux-fsdevel@xxxxxxxxxxxxxxx>, Linux MM <linux-mm@xxxxxxxxx>, "linux-nvdimm@xxxxxxxxxxxx" <linux-nvdimm@xxxxxxxxxxxx>, XFS Developers <xfs@xxxxxxxxxxx>
Delivered-to: xfs@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=pWlY/wAm4epetfdu5faDBRBQBKKZVi21avR8ela4rhE=; b=d0FFHnqoGHoFIhv64YzqBrrj71T7tgxs6zghGxonio69hFN0eKihydEUhtx+fsPdAQ WNAPExq4PFZwv6CSAy706aewRsirq9V3CnfTRdKpRFzOSl6syX7TFKMA52xQs6cCLIyt XhHVF6v5IRoLlTay36Et2deZauBUsxbVNSO6En3VpiKtTmml6nTZYuPeTl9ukZsnzkuY 1K79kANVKHdWglcHaenIme8+uY/qrBMh2ujf82chB0YY/qIhLtLyih+VTo+t+tOCQi+X v+JJ/lrV3TNjwl7gw0WEQhTSrTTFALCGUOpJp97DGAqih2d/IPUYaHeeAd5TqMEjWx6y uFMA==
In-reply-to: <20160211224616.GL19486@dastard>
References: <1455137336-28720-1-git-send-email-ross.zwisler@xxxxxxxxxxxxxxx> <1455137336-28720-3-git-send-email-ross.zwisler@xxxxxxxxxxxxxxx> <20160210220312.GP14668@dastard> <20160210224340.GA30938@xxxxxxxxxxxxxxx> <20160211125044.GJ21760@xxxxxxxxxxxxx> <CAPcyv4g60iOTd-ShBCfsK+B7xArcc5pWXWktNop53otDbUW-3g@xxxxxxxxxxxxxx> <20160211204635.GI19486@dastard> <CAPcyv4h4u+LB5U5nm4Jo32r=33D02yv36k5QxmJoy3DRiHmQEQ@xxxxxxxxxxxxxx> <20160211224616.GL19486@dastard>
On Thu, Feb 11, 2016 at 2:46 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> On Thu, Feb 11, 2016 at 12:58:38PM -0800, Dan Williams wrote:
>> On Thu, Feb 11, 2016 at 12:46 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>> [..]
>> >> It seems to me we need to modify the
>> >> metadata i/o paths to bypass the page cache,
>> >
>> > XFS doesn't use the block device page cache for it's metadata - it
>> > has it's own internal metadata cache structures and uses get_pages
>> > or heap memory to back it's metadata. But that doesn't make mixing
>> > DAX and pages in the block device mapping tree sane.
>> >
>> > What you are missing here is that the underlying architecture of
>> > journalling filesystems mean they can't use DAX for their metadata.
>> > Modifications have to be buffered, because they have to be written
>> > to the journal first before they are written back in place. IOWs, we
>> > need to buffer changes in volatile memory for some time, and that
>> > means we can't use DAX during transactional modifications.
>> >
>> > And to put the final nail in that coffin, metadata in XFS can be
>> > discontiguous multi-block objects - in those situations we vmap the
>> > underlying pages so they appear to the code to be a contiguous
>> > buffer, and that's something we can't do with DAX....
>>
>> Sorry, I wasn't clear when I said "bypass page cache" I meant a
>> solution similar to commit d1a5f2b4d8a1 "block: use DAX for partition
>> table reads".
>
> So there's already bandaids to prevent bad shit from happening in
> the block layer, let alone when we consider all the ways that
> userspace can screw this all up.
>
>> However, I suspect that is broken if the filesystem is not ready
>> to see a new page allocated for every I/O.  I assume one
>> thread will want to insert a page in the radix for another thread
>> to find/manipulate before metadata gets written back to storage.
>
> Right, you can't do that, especially as the struct page has a 1-1
> relationship with the bufferhead that is attached to it as the
> bufferhead carries the filesystem state for the given cached page.
>
>> >> or teach the fsync code how to flush populated data pages out
>> >> of the radix.
>> >
>> > That doesn't solve the problem. Filesystems free and reallocate
>> > filesystem blocks without intermediate block device mapping
>> > invalidation calls, so what is one minute a data block accessed
>> > by DAX may become a metadata block that accessed via buffered
>> > IO.  It all goes to crap very quickly....
>> >
>> > However, I'd say fsync is not the place to address this. This
>> > block device cache aliasing issue is supposed to be what
>> > unmap_underlying_metadata() solves, right?
>>
>> I'll take a look at this.  Right now I'm trying to implement the
>> "clear block-device-inode S_DAX on fs mount" approach.  My concern
>> though is that  we need to disable block device mmap while a
>> filesystem is mounted...
>
> /me chokes on his coffee.
>
> When did mmaping the block device behind the back of a mounted
> fileystem become a valid use case? It's not supported for normal
> block devices and for the same reasons it won't be supported for DAX
> enabled block devices, either. i.e. I'm going to tell anyone who has
> an application that does this to go and take a hike when (not if!)
> they report filesystem corruption problems.

Right, but we need to not confuse the fsync code regardless of how bad
of an idea this is ::-).

>> Maybe I don't need to worry because it's already the case that a
>> mmap of the raw device may not see the most up to date data for a
>> file that has dirty fs-page-cache data.
>
> It goes both ways. What happens if mkfs or fsck modifies the
> block device via mmap+DAX and then the filesystem mounts the block
> device and tries to read that metadata via the block device page
> cache?
>
> Quite frankly, DAX on the block device is a can of worms we really
> don't need to deal with right now. IMO it's a solution looking for a
> problem to solve,

Virtualization use cases want to give large ranges to guest-VMs, and
it is currently the only way to reliably get 1GiB mappings.

> the "default to on" policy is wrong (DAX is
> opt-in, not opt-out) and given this we should turn it off until
> we've solved the more important problems we need to solve. i.e. We
> need to concentrate on getting data integrity working correctly
> first, then address the cache aliasing issues, then address the
> "safe access" issues, and then we can re-introduce block device DAX
> access...

Agreed.

Note that the "default-on policy" came from commit bbab37ddc20b
"block: Add support for DAX reads/writes to block devices" way back in
4.2.  We're just now noticing.  Credit Ross for good sanity checking.

<Prev in Thread] Current Thread [Next in Thread>