xfs
[Top] [All Lists]

Re: [PATCH v8 6/9] dax: add support for fsync/msync

To: Jan Kara <jack@xxxxxxx>
Subject: Re: [PATCH v8 6/9] dax: add support for fsync/msync
From: Ross Zwisler <ross.zwisler@xxxxxxxxxxxxxxx>
Date: Wed, 13 Jan 2016 11:58:02 -0700
Cc: Ross Zwisler <ross.zwisler@xxxxxxxxxxxxxxx>, linux-kernel@xxxxxxxxxxxxxxx, "H. Peter Anvin" <hpa@xxxxxxxxx>, "J. Bruce Fields" <bfields@xxxxxxxxxxxx>, Theodore Ts'o <tytso@xxxxxxx>, Alexander Viro <viro@xxxxxxxxxxxxxxxxxx>, Andreas Dilger <adilger.kernel@xxxxxxxxx>, Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>, Dan Williams <dan.j.williams@xxxxxxxxx>, Dave Chinner <david@xxxxxxxxxxxxx>, Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>, Ingo Molnar <mingo@xxxxxxxxxx>, Jan Kara <jack@xxxxxxxx>, Jeff Layton <jlayton@xxxxxxxxxxxxxxx>, Matthew Wilcox <matthew.r.wilcox@xxxxxxxxx>, Matthew Wilcox <willy@xxxxxxxxxxxxxxx>, Thomas Gleixner <tglx@xxxxxxxxxxxxx>, linux-ext4@xxxxxxxxxxxxxxx, linux-fsdevel@xxxxxxxxxxxxxxx, linux-mm@xxxxxxxxx, linux-nvdimm@xxxxxxxxxxxx, x86@xxxxxxxxxx, xfs@xxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <20160113093525.GD14630@xxxxxxxxxxxxx>
Mail-followup-to: Ross Zwisler <ross.zwisler@xxxxxxxxxxxxxxx>, Jan Kara <jack@xxxxxxx>, linux-kernel@xxxxxxxxxxxxxxx, "H. Peter Anvin" <hpa@xxxxxxxxx>, "J. Bruce Fields" <bfields@xxxxxxxxxxxx>, Theodore Ts'o <tytso@xxxxxxx>, Alexander Viro <viro@xxxxxxxxxxxxxxxxxx>, Andreas Dilger <adilger.kernel@xxxxxxxxx>, Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>, Dan Williams <dan.j.williams@xxxxxxxxx>, Dave Chinner <david@xxxxxxxxxxxxx>, Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>, Ingo Molnar <mingo@xxxxxxxxxx>, Jan Kara <jack@xxxxxxxx>, Jeff Layton <jlayton@xxxxxxxxxxxxxxx>, Matthew Wilcox <matthew.r.wilcox@xxxxxxxxx>, Matthew Wilcox <willy@xxxxxxxxxxxxxxx>, Thomas Gleixner <tglx@xxxxxxxxxxxxx>, linux-ext4@xxxxxxxxxxxxxxx, linux-fsdevel@xxxxxxxxxxxxxxx, linux-mm@xxxxxxxxx, linux-nvdimm@xxxxxxxxxxxx, x86@xxxxxxxxxx, xfs@xxxxxxxxxxx
References: <1452230879-18117-1-git-send-email-ross.zwisler@xxxxxxxxxxxxxxx> <1452230879-18117-7-git-send-email-ross.zwisler@xxxxxxxxxxxxxxx> <20160112105716.GT6262@xxxxxxxxxxxxx> <20160113073019.GB30496@xxxxxxxxxxxxxxx> <20160113093525.GD14630@xxxxxxxxxxxxx>
User-agent: Mutt/1.5.24 (2015-08-30)
On Wed, Jan 13, 2016 at 10:35:25AM +0100, Jan Kara wrote:
> On Wed 13-01-16 00:30:19, Ross Zwisler wrote:
> > > And secondly: You must write-protect all mappings of the flushed range so
> > > that you get fault when the sector gets written-to again. We spoke about
> > > this in the past already but somehow it got lost and I forgot about it as
> > > well. You need something like rmap_walk_file()...
> > 
> > The code that write protected mappings and then cleaned the radix tree 
> > entries
> > did get written, and was part of v2:
> > 
> > https://lkml.org/lkml/2015/11/13/759
> > 
> > I removed all the code that cleaned PTE entries and radix tree entries for 
> > v3.
> > The reason behind this was that there was a race that I couldn't figure out
> > how to solve between the cleaning of the PTEs and the cleaning of the radix
> > tree entries.
> > 
> > The race goes like this:
> > 
> > Thread 1 (write)                    Thread 2 (fsync)
> > ================                    ================
> > wp_pfn_shared()
> > pfn_mkwrite()
> > dax_radix_entry()
> > radix_tree_tag_set(DIRTY)
> >                                     dax_writeback_mapping_range()
> >                                     dax_writeback_one()
> >                                     radix_tag_clear(DIRTY)
> >                                     pgoff_mkclean()
> > ... return up to wp_pfn_shared()
> > wp_page_reuse()
> > pte_mkdirty()
> > 
> > After this sequence we end up with a dirty PTE that is writeable, but with a
> > clean radix tree entry.  This means that users can write to the page, but 
> > that
> > a follow-up fsync or msync won't flush this dirty data to media.
> > 
> > The overall issue is that in the write path that goes through 
> > wp_pfn_shared(),
> > the DAX code has control over when the radix tree entry is dirtied but not
> > when the PTE is made dirty and writeable.  This happens up in 
> > wp_page_reuse().
> > This means that we can't easily add locking, etc. to protect ourselves.
> > 
> > I spoke a bit about this with Dave Chinner and with Dave Hansen, but no 
> > really
> > easy solutions presented themselves in the absence of a page lock.  I do 
> > have
> > one idea, but I think it's pretty invasive and will need to wait for another
> > kernel cycle.
> > 
> > The current code that leaves the radix tree entry will give us correct
> > behavior - it'll just be less efficient because we will have an ever-growing
> > dirty set to flush.
> 
> Ahaa! Somehow I imagined tag_pages_for_writeback() clears DIRTY radix tree
> tags but it does not (I should have known, I have written that functions
> few years ago ;). Makes sense. Thanks for clarification.
> 
> > > > @@ -791,15 +976,12 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
> > > >   * dax_pfn_mkwrite - handle first write to DAX page
> > > >   * @vma: The virtual memory area where the fault occurred
> > > >   * @vmf: The description of the fault
> > > > - *
> > > >   */
> > > >  int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > >  {
> > > > -       struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> > > > +       struct file *file = vma->vm_file;
> > > >  
> > > > -       sb_start_pagefault(sb);
> > > > -       file_update_time(vma->vm_file);
> > > > -       sb_end_pagefault(sb);
> > > > +       dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false, 
> > > > true);
> > > 
> > > Why is NO_SECTOR argument correct here?
> > 
> > Right - so NO_SECTOR means "I expect there to already be an entry in the 
> > radix
> > tree - just make that entry dirty".  This works because pfn_mkwrite() always
> > follows a normal __dax_fault() or __dax_pmd_fault() call.  These fault calls
> > will insert the radix tree entry, regardless of whether the fault was for a
> > read or a write.  If the fault was for a write, the radix tree entry will 
> > also
> > be made dirty.
> >
> > For reads the radix tree entry will be inserted but left clean.  When the
> > first write happens we will get a pfn_mkwrite() call, which will call
> > dax_radix_entry() with the NO_SECTOR argument.  This will look up the radix
> > tree entry & set the dirty tag.
> 
> So the explanation of this should be somewhere so that everyone knows that
> we must have radix tree entries even for clean mapped blocks. Because upto
> know that was not clear to me.  Also __dax_pmd_fault() seems to insert
> entries only for write fault so the assumption doesn't seem to hold there?

Ah, right, sorry, the read fault() -> pfn_mkwrite() sequence only happens for
4k pages.  You are right about our handling of 2MiB pages - for a read
followed by a write we will just call into the normal __dax_pmd_fault() code
again, which will do the get_block() call and insert a dirty radix tree entry.
Because we have to go all the way through the fault handler again at write
time there isn't a benefit to inserting a clean radix tree entry on read, so
we just skip it.

> I'm somewhat uneasy that a bug in this logic can be hidden as a simple race
> with hole punching. But I guess I can live with that.
>  
>                                                               Honza
> -- 
> Jan Kara <jack@xxxxxxxx>
> SUSE Labs, CR

<Prev in Thread] Current Thread [Next in Thread>