On Sat, Mar 29, 2014 at 11:14:19AM -0400, Brian Foster wrote:
> On Fri, Mar 21, 2014 at 09:11:46PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@xxxxxxxxxx>
> >
> > If we fail a write beyond EOF and have to handle it in
> > xfs_vm_write_begin(), we truncate the inode back to the current inode
> > size. This doesn't take into account the fact that we may have
> > already made successful writes to the same page (in the case of block
> > size < page size) and hence we can truncate the page cache away from
> > blocks with valid data in them. If these blocks are delayed
> > allocation blocks, we now have a mismatch between the page cache and
> > the extent tree, and this will trigger - at minimum - a delayed
> > block count mismatch assert when the inode is evicted from the cache.
> > We can also trip over it when block mapping for direct IO - this is
> > the most common symptom seen from fsx and fsstress when run from
> > xfstests.
> >
> > Fix it by only truncating away the exact range we are updating state
> > for in this write_begin call.
> >
> > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> > ---
> > fs/xfs/xfs_aops.c | 10 ++++++++--
> > 1 file changed, 8 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index e810243..6b4ecc8 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -1609,12 +1609,18 @@ xfs_vm_write_begin(
> > status = __block_write_begin(page, pos, len, xfs_get_blocks);
> > if (unlikely(status)) {
> > struct inode *inode = mapping->host;
> > + size_t isize = i_size_read(inode);
> >
> > xfs_vm_write_failed(inode, page, pos, len);
> > unlock_page(page);
> >
> > - if (pos + len > i_size_read(inode))
> > - truncate_pagecache(inode, i_size_read(inode));
>
> From what I can see, we have a write_begin/copy/write_end sequence per
> page and the inode size is updated down in the write_end path. If the
> write fails at write_begin time, then we haven't copied any data and the
> inode size hasn't changed.
>
> The intent of the original code looks like it wants to break off any
> blocks that might have been set up beyond EOF before the write happened
> to fail. Could you elaborate on how this can kill blocks that might have
> been written successfully? Doesn't this only affect post-EOF metadata?
>
> > + /*
> > + * If the write is beyond EOF, we only want to kill blocks
> > + * allocated in this write, not blocks that were previously
> > + * written successfully.
> > + */
> > + if (pos + len > isize)
> > + truncate_pagecache_range(inode, pos, pos + len);
>
> So the cache truncate now covers the range of the write. What happens if
> the part of the write is an overwrite? This seems similar to the problem
> you've described above... should the truncate start at isize instead?
>
I ran an experiment on this to confirm my suspicion here. I added a
quick little hack to fail any write_begin (set status=-1) for which pos
!= 0. With that in place:
xfs_io -fc "pwrite -b 2k 0 2k" /mnt/file
# hexdump /mnt/file
0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
*
0000800
xfs_io -c "pwrite -b 2k 1k 2k" /mnt/file
(fails)
# hexdump /mnt/file
0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
*
0000400 0000 0000 0000 0000 0000 0000 0000 0000
*
0000800
The failed extending write trashes the data over the previously valid
region.
Brian
> Brian
>
> >
> > page_cache_release(page);
> > page = NULL;
> > --
> > 1.9.0
> >
> > _______________________________________________
> > xfs mailing list
> > xfs@xxxxxxxxxxx
> > http://oss.sgi.com/mailman/listinfo/xfs
>
> _______________________________________________
> xfs mailing list
> xfs@xxxxxxxxxxx
> http://oss.sgi.com/mailman/listinfo/xfs
|