On 08/08/2014 10:35 AM, Chris Mason wrote:
> xfs is using truncate_pagecache_range to invalidate the page cache
> during DIO reads. This is different from the other filesystems who only
> invalidate pages during DIO writes.
> truncate_pagecache_range is meant to be used when we are freeing the
> underlying data structs from disk, so it will zero any partial ranges
> in the page. This means a DIO read can zero out part of the page cache
> page, and it is possible the page will stay in cache.
> buffered reads will find an up to date page with zeros instead of the
> data actually on disk.
> This patch fixes things by leaving the page cache alone during DIO
> We discovered this when our buffered IO program for distributing
> database indexes was finding zero filled blocks. I think writes
> are broken too, but I'll leave that for a separate patch because I don't
> fully understand what XFS needs to happen during a DIO write.
I stuck a cc: stable@xxxxxxxxxxxxxxx after my sob, but then inserted a
giant test program. Just realized the cc might get lost...sorry I
wasn't trying to sneak it in.
I've been trying to figure out why this bug doesn't show up in our 3.2
kernels but does show up now. Today xfs does this:
truncate_pagecache_range(VFS_I(ip), pos, -1);
But in 3.2 we did this:
ret = -xfs_flushinval_pages(ip,
(iocb->ki_pos & PAGE_CACHE_MASK),
Since we've done pos & PAGE_CACHE_MASK, the 3.2 code never sent a
partial offset. So it never zero'd partial pages.
> Signed-off-by: Chris Mason <clm@xxxxxx>
> cc: stable@xxxxxxxxxxxxxxx
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 1f66779..8d25d98 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -295,7 +295,11 @@ xfs_file_read_iter(
> xfs_rw_iunlock(ip, XFS_IOLOCK_EXCL);
> return ret;
> - truncate_pagecache_range(VFS_I(ip), pos, -1);
> + /* we don't remove any pages here. A direct read
> + * does not invalidate any contents of the page
> + * cache
> + */
> xfs_rw_ilock_demote(ip, XFS_IOLOCK_EXCL);