[Top] [All Lists]

Re: [PATCH] xfs: don't zero partial page cache pages during O_DIRECT

To: Chris Mason <clm@xxxxxx>
Subject: Re: [PATCH] xfs: don't zero partial page cache pages during O_DIRECT
From: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Sat, 9 Aug 2014 10:36:28 +1000
Cc: xfs@xxxxxxxxxxx, Eric Sandeen <sandeen@xxxxxxxxxx>
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <53E4E03A.7050101@xxxxxx>
References: <53E4E03A.7050101@xxxxxx>
User-agent: Mutt/1.5.21 (2010-09-15)
On Fri, Aug 08, 2014 at 10:35:38AM -0400, Chris Mason wrote:
> xfs is using truncate_pagecache_range to invalidate the page cache
> during DIO reads.  This is different from the other filesystems who only
> invalidate pages during DIO writes.

Historical oddity thanks to wrapper functions that were kept way
longer than they should have been.

> truncate_pagecache_range is meant to be used when we are freeing the
> underlying data structs from disk, so it will zero any partial ranges
> in the page.  This means a DIO read can zero out part of the page cache
> page, and it is possible the page will stay in cache.

commit fb59581 ("xfs: remove xfs_flushinval_pages").  also removed
the offset masks that seem to be the issue here. Classic case of a
regression caused by removing 10+ year old code that was not clearly
documented and didn't appear important.

The real question is why isn't fsx and other corner case data
integrity tools tripping over this?

> buffered reads will find an up to date page with zeros instead of the
> data actually on disk.
> This patch fixes things by leaving the page cache alone during DIO
> reads.
> We discovered this when our buffered IO program for distributing
> database indexes was finding zero filled blocks.  I think writes
> are broken too, but I'll leave that for a separate patch because I don't
> fully understand what XFS needs to happen during a DIO write.
> Test program:

Encapsulate it in a generic xfstest, please, and send it to


> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 1f66779..8d25d98 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -295,7 +295,11 @@ xfs_file_read_iter(
>                               xfs_rw_iunlock(ip, XFS_IOLOCK_EXCL);
>                               return ret;
>                       }
> -                     truncate_pagecache_range(VFS_I(ip), pos, -1);
> +
> +                     /* we don't remove any pages here.  A direct read
> +                      * does not invalidate any contents of the page
> +                      * cache
> +                      */

I guarantee you that there are applications out there that rely on
the implicit invalidation behaviour for performance. There are also
applications out that rely on it for correctness, too, because the
OS is not the only source of data in the filesystem the OS has

Besides, XFS's direct IO semantics are far saner, more predictable
and hence are more widely useful than the generic code. As such,
we're not going to regress semantics that have been unchanged
over 20 years just to match whatever insanity the generic Linux code
does right now.

Go on, call me a deranged monkey on some serious mind-controlling
substances. I don't care. :)

I think the fix should probably just be:

-                       truncate_pagecache_range(VFS_I(ip), pos, -1);
+                       invalidate_inode_pages2_range(VFS_I(ip)->i_mapping,
+                                       pos >> PAGE_CACHE_SHIFT, -1);


Dave Chinner

<Prev in Thread] Current Thread [Next in Thread>