[Top] [All Lists]

Re: [BUG] ext2/3/4: dio reads stale data when we do some append dio writ

To: Christoph Hellwig <hch@xxxxxxxxxxxxx>
Subject: Re: [BUG] ext2/3/4: dio reads stale data when we do some append dio writes
From: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Tue, 19 Nov 2013 23:01:12 +1100
Cc: linux-ext4@xxxxxxxxxxxxxxx, linux-fsdevel@xxxxxxxxxxxxxxx, xfs@xxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <20131119111826.GA20485@xxxxxxxxxxxxx>
References: <20131119095302.GA4534@xxxxxxxxx> <20131119102235.GA5010@xxxxxxxxxxxxx> <20131119104508.GA4630@xxxxxxxxx> <20131119110147.GA3323@xxxxxxxxxxxxx> <20131119111947.GA4782@xxxxxxxxx> <20131119111826.GA20485@xxxxxxxxxxxxx>
User-agent: Mutt/1.5.21 (2010-09-15)
On Tue, Nov 19, 2013 at 03:18:26AM -0800, Christoph Hellwig wrote:
> On Tue, Nov 19, 2013 at 07:19:47PM +0800, Zheng Liu wrote:
> > Yes, I know that XFS has a shared/exclusive lock.  I guess that is why
> > it can pass the test.  But another question is why xfs fails when we do
> > some append dio writes with doing buffered read.
> Can you provide a test case for that issue?

For XFS, appending direct IO writes only hold the IOLOCK exclusive
for as long as it takes to guarantee that the the region between the
old EOF and the new EOF is full of zeros before it is demoted.  i.e.
once the region is guaranteed not to expose stale data, the
exclusive IO lock is demoted to to a shared lock and a buffered read
is then allowed to proceed concurrently with the DIO write.

Hence even appending writes occur concurrently with buffered reads,
and if the read overlaps the block at the old EOF then the page
brought into the page cache will have zeros in it.

FWIW, there's a wonderful comment in generic_file_direct_write()
that pretty much covers this case:

         * Finally, try again to invalidate clean pages which might have been
         * cached by non-direct readahead, or faulted in by get_user_pages()
         * if the source of the write was an mmap'ed region of the file
         * we're writing.  Either one is a pretty crazy thing to do,
         * so we don't support it 100%.  If this invalidation
         * fails, tough, the write still worked...

The kernel code simply does not have the exclusion mechanisms to
make concurrent buffered and direct IO robust. This is one of the
problems (amongst many) that we've been looking to solve with an VFS
level IO range lock of some kind....


Dave Chinner

<Prev in Thread] Current Thread [Next in Thread>