[Top] [All Lists]

Re: fsx failure on 3.10.0-rc1+ (xfstests 263) -- Mapped Read: non-zero d

To: Brian Foster <bfoster@xxxxxxxxxx>
Subject: Re: fsx failure on 3.10.0-rc1+ (xfstests 263) -- Mapped Read: non-zero data past EOF
From: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Tue, 11 Jun 2013 07:31:00 +1000
Cc: xfs@xxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <51B5D1EB.9080200@xxxxxxxxxx>
References: <51B5D1EB.9080200@xxxxxxxxxx>
User-agent: Mutt/1.5.21 (2010-09-15)
On Mon, Jun 10, 2013 at 09:17:31AM -0400, Brian Foster wrote:
> Hi guys,
> I wanted to get this onto the list... I suspect this could be
> similar/related to the issue reported here:
> http://oss.sgi.com/archives/xfs/2013-06/msg00066.html

Unlikely - generic/263 tests mmap IO vs direct IO, and Sage's
problem has neither...

> While running xfstests, the only apparent regression I hit from 3.9.0
> was generic/263. This test fails due to the following command (and
> resulting output):

Not a regression - 263 has been failing ever since it was introduced
in 2011 by:

commit 0d69e10ed15b01397e8c6fd7833fa3c2970ec024
Author: Christoph Hellwig <hch@xxxxxxxxxxxxx>
Date:   Mon Oct 10 18:22:16 2011 +0000

    split mapped writes vs direct I/O tests from 091
    This effectively reverts
        xfstests: add mapped write fsx operations to 091
    and adds a new test case for it.  It tests something slightly
    different, and regressions in existing tests due to new features
    are pretty nasty in a test suite.
    Signed-off-by: Christoph Hellwig <hch@xxxxxx>
    Reviewed-by: Dave Chinner <dchinner@xxxxxxxxxx>
    Signed-off-by: Alex Elder <aelder@xxxxxxx>

It is testing mmap() writes vs direct IO, something that is known to
be fundamentally broken (i.e. racy) as mmap() page fault path does
not hold the XFS_IOLOCK or i_mutex in any way.  The direct IO path
tries to wark around this by flushing and invalidating cached pages
before IO submission, but the lack of locking in the page fault path
means we can't avoid the race entirely.

> P.S., I also came across the following thread which, if related,
> suggests this might be known/understood to a degree:
> http://oss.sgi.com/archives/xfs/2012-04/msg00703.html

Yup, that's potentially one aspect of it. However, have you run the
test code on ext3/4? it works just fine - it's only XFS that has
problems with this case, so it's not clear that this is a DIO
problem. It was never able to work out where ext3/ext4 were zeroing
the part of the page beyond EOF, and I couldn't ever make the DIO
code reliably do the right thing. It's one of the reasons that lead
to this discussion as LSFMM:



Dave Chinner

<Prev in Thread] Current Thread [Next in Thread>