xfs
[Top] [All Lists]

Re: [PATCH 2/6] xfs: write failure beyond EOF truncates too much data

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: [PATCH 2/6] xfs: write failure beyond EOF truncates too much data
From: Brian Foster <bfoster@xxxxxxxxxx>
Date: Fri, 4 Apr 2014 11:26:36 -0400
Cc: Al@xxxxxxxxxxxxxxxxxxxxxxx, Viro@xxxxxxxxxxxxxxxxxxxxxxx, viro@xxxxxxxxxxxxxxxxxx, xfs@xxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <20140329151419.GA33827@xxxxxxxxxxxxxxx>
References: <1395396710-3824-1-git-send-email-david@xxxxxxxxxxxxx> <1395396710-3824-3-git-send-email-david@xxxxxxxxxxxxx> <20140329151419.GA33827@xxxxxxxxxxxxxxx>
User-agent: Mutt/1.5.21 (2010-09-15)
On Sat, Mar 29, 2014 at 11:14:19AM -0400, Brian Foster wrote:
> On Fri, Mar 21, 2014 at 09:11:46PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@xxxxxxxxxx>
> > 
> > If we fail a write beyond EOF and have to handle it in
> > xfs_vm_write_begin(), we truncate the inode back to the current inode
> > size. This doesn't take into account the fact that we may have
> > already made successful writes to the same page (in the case of block
> > size < page size) and hence we can truncate the page cache away from
> > blocks with valid data in them. If these blocks are delayed
> > allocation blocks, we now have a mismatch between the page cache and
> > the extent tree, and this will trigger - at minimum - a delayed
> > block count mismatch assert when the inode is evicted from the cache.
> > We can also trip over it when block mapping for direct IO - this is
> > the most common symptom seen from fsx and fsstress when run from
> > xfstests.
> > 
> > Fix it by only truncating away the exact range we are updating state
> > for in this write_begin call.
> > 
> > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> > ---
> >  fs/xfs/xfs_aops.c | 10 ++++++++--
> >  1 file changed, 8 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index e810243..6b4ecc8 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -1609,12 +1609,18 @@ xfs_vm_write_begin(
> >     status = __block_write_begin(page, pos, len, xfs_get_blocks);
> >     if (unlikely(status)) {
> >             struct inode    *inode = mapping->host;
> > +           size_t          isize = i_size_read(inode);
> >  
> >             xfs_vm_write_failed(inode, page, pos, len);
> >             unlock_page(page);
> >  
> > -           if (pos + len > i_size_read(inode))
> > -                   truncate_pagecache(inode, i_size_read(inode));
> 
> From what I can see, we have a write_begin/copy/write_end sequence per
> page and the inode size is updated down in the write_end path. If the
> write fails at write_begin time, then we haven't copied any data and the
> inode size hasn't changed.
> 
> The intent of the original code looks like it wants to break off any
> blocks that might have been set up beyond EOF before the write happened
> to fail. Could you elaborate on how this can kill blocks that might have
> been written successfully? Doesn't this only affect post-EOF metadata?
> 
> > +           /*
> > +            * If the write is beyond EOF, we only want to kill blocks
> > +            * allocated in this write, not blocks that were previously
> > +            * written successfully.
> > +            */
> > +           if (pos + len > isize)
> > +                   truncate_pagecache_range(inode, pos, pos + len);
> 
> So the cache truncate now covers the range of the write. What happens if
> the part of the write is an overwrite? This seems similar to the problem
> you've described above... should the truncate start at isize instead?
> 

I ran an experiment on this to confirm my suspicion here. I added a
quick little hack to fail any write_begin (set status=-1) for which pos
!= 0. With that in place:

xfs_io -fc "pwrite -b 2k 0 2k" /mnt/file

# hexdump /mnt/file 
0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
*
0000800

xfs_io -c "pwrite -b 2k 1k 2k" /mnt/file
(fails)

# hexdump /mnt/file 
0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
*
0000400 0000 0000 0000 0000 0000 0000 0000 0000
*
0000800

The failed extending write trashes the data over the previously valid
region.

Brian

> Brian
> 
> >  
> >             page_cache_release(page);
> >             page = NULL;
> > -- 
> > 1.9.0
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@xxxxxxxxxxx
> > http://oss.sgi.com/mailman/listinfo/xfs
> 
> _______________________________________________
> xfs mailing list
> xfs@xxxxxxxxxxx
> http://oss.sgi.com/mailman/listinfo/xfs

<Prev in Thread] Current Thread [Next in Thread>