On Fri, Apr 27, 2012 at 07:45:21PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@xxxxxxxxxx>
>
> When a partial write inside EOF fails, it can leave delayed
> allocation blocks lying around because they don't get punched back
> out. This leads to assert failures like:
>
> XFS: Assertion failed: XFS_FORCED_SHUTDOWN(ip->i_mount) || ip->i_delayed_blks
> == 0, file: fs/xfs/xfs_super.c, line: 847
>
> when evicting inodes from the cache. This can be trivially triggered
> by xfstests 083, which takes between 5 and 15 executions on a 512
> byte block size filesystem to trip over this. Debugging shows a
> failed write due to ENOSPC calling xfs_vm_write_failed such as:
>
> [ 5012.329024] ino 0xa0026: vwf to 0x17000, sze 0x1c85ae
>
> and no action is taken on it. This leaves behind a delayed
> allocation extent that has no page covering it and no data in it:
>
> [ 5015.867162] ino 0xa0026: blks: 0x83 delay blocks 0x1, size 0x2538c0
> [ 5015.868293] ext 0: off 0x4a, fsb 0x50306, len 0x1
> [ 5015.869095] ext 1: off 0x4b, fsb 0x7899, len 0x6b
> [ 5015.869900] ext 2: off 0xb6, fsb 0xffffffffe0008, len 0x1
> ^^^^^^^^^^^^^^^
> [ 5015.871027] ext 3: off 0x36e, fsb 0x7a27, len 0xd
> [ 5015.872206] ext 4: off 0x4cf, fsb 0x7a1d, len 0xa
>
> So the delayed allocation extent is one block long at offset
> 0x16c00. Tracing shows that a bigger write:
>
> xfs_file_buffered_write: size 0x1c85ae offset 0x959d count 0x1ca3f ioflags
>
> allocates the block, and then fails with ENOSPC trying to allocate
> the last block on the page, leading to a failed write with stale
> delalloc blocks on it.
>
> Because we've had an ENOSPC when trying to allocate 0x16e00, it
> means that we are never goinge to call ->write_end on the page and
going
> so the allocated new buffer will not get marked dirty or have the
> buffer_new state cleared. In other works, what the above write is
> supposed to end up with is this mapping for the page:
>
> +------+------+------+------+------+------+------+------+
> UMA UMA UMA UMA UMA UMA UND FAIL
>
> where: U = uptodate
> M = mapped
> N = new
> A = allocated
> D = delalloc
> FAIL = block we ENOSPC'd on.
>
> and the key point being the buffer_new() state for the newly
> allocated delayed allocation block. Except it doesn't - we're not
> marking buffers new correctly.
>
> That buffer_new() problem goes back to the xfs_iomap removal days,
> where xfs_iomap() used to return a "new" status for any map with
> newly allocated blocks, so that __xfs_get_blocks() could call
> set_buffer_new() on it. We still have the "new" variable and the
> check for it in the set_buffer_new() logic - except we never set it
> now!
>
> Hence that newly allocated delalloc block doesn't have the new flag
> set on it, so when the write fails we cannot tell which blocks we
> are supposed to punch out. WHy do we need the buffer_new flag? Well,
Why
> that's because we can have this case:
>
> +------+------+------+------+------+------+------+------+
> UMD UMD UMD UMD UMD UMD UND FAIL
>
> where all the UMD buffers contain valid data from a previously
> successful write() system call. We only want to punch the UND buffer
> because that's the only one that we added in this write and it was
> only this write that failed.
>
> That implies that even the old buffer_new() logic was wrong -
> because it would result in all those UMD buffers on the page having
> set_buffer_new() called on them even though they aren't new. Hence
> we shoul donly be calling set_buffer_new() for delalloc buffers that
should only
> were allocated (i.e. were a hole before xfs_iomap_write_delay() was
> called).
>
> So, fix this set_buffer_new logic according to how we need it to
> work for handling failed writes correctly. Also, restore the new
> buffer logic handling for blocks allocated via
> xfs_iomap_write_direct(), because it should still set the buffer_new
> flag appropriately for newly allocated blocks, too.
>
> SO, now we have the buffer_new() being set appropriately in
> __xfs_get_blocks(), we can detect the exact delalloc ranges that
> we allocated in a failed write, and hence can now do a walk of the
> buffers on a page to find them.
>
> Except, it's not that easy. When block_write_begin() fails, it
> unlocks and releases the page that we just had an error on, so we
> can't use that page to handle errors anymore. We have to get access
> to the page while it is still locked to walk the buffers. Hence we
> have to open code block_write_begin() in xfs_vm_write_begin() to be
> able to insert xfs_vm_write_failed() is the right place.
>
> With that, we can pass the page and write range to
> xfs_vm_write_failed() and walk the buffers on the page, looking for
> delalloc buffers that are either new or beyond EOF and punch them
> out. Handling buffers beyond EOF ensures we still handle the
> existing case that xfs_vm_write_failed() handles.
>
> Of special note is the truncate_pagecache() handling - that only
> should be done for pages outside EOF - pages within EOF can still
> contain valid, dirty data so we must not punch them out of the
> cache.
>
> That just leaves the xfs_vm_write_end() failure handling.
> The only failure case here is that we didn't copy the entire range,
> and generic_write_end() handles that by zeroing the region of the
> page that wasn't copied,
Are you referring to
xfs_vm_write_end
generic_write_end
block_write_end
page_zero_new_buffers?
> we don't have to punch out blocks within
> the file because they are guaranteed to contain zeros. Hence we only
> have to handle the existing "beyond EOF" case and don't need access
> to the buffers on the page. Hence it remains largely unchanged.
>
> Note that xfs_getbmap() can still trip over delalloc blocks beyond
> EOF that are left there by speculative delayed allocation. Hence
> this bug fix does not solve all known issues with bmap vs delalloc,
> but it does fix all the the known accidental occurances of the
> problem.
>
> Signed-off-by: Dave Chinner <david@xxxxxxxxxxxxx>
> ---
> fs/xfs/xfs_aops.c | 173
> +++++++++++++++++++++++++++++++++++++++--------------
> 1 file changed, 127 insertions(+), 46 deletions(-)
>
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 64ed87a..ae31c31 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1184,11 +1184,18 @@ __xfs_get_blocks(
> &imap, nimaps);
> if (error)
> return -error;
> + new = 1;
> } else {
> /*
> * Delalloc reservations do not require a transaction,
> - * we can go on without dropping the lock here.
> + * we can go on without dropping the lock here. If we
> + * are allocating a new delalloc block, make sure that
> + * we set the new flag so that we mark the buffer new so
> + * that we know that it is newly allocated if the write
> + * fails.
> */
> + if (nimaps && imap.br_startblock == HOLESTARTBLOCK)
> + new = 1;
> error = xfs_iomap_write_delay(ip, offset, size, &imap);
> if (error)
> goto out_unlock;
> @@ -1405,52 +1412,91 @@ out_destroy_ioend:
> return ret;
> }
>
> +/*
> + * Punch out the delalloc blocks we have already allocated.
This language is confusing. I suggest that delay blocks are reserved and real
blocks are allocated. Tomato Tomato.
> + *
> + * Don't bother with xfs_setattr given that nothing can have made it to disk
> yet
> + * as the page is still locked at this point.
> + */
> +STATIC void
> +xfs_vm_kill_delalloc_range(
> + struct inode *inode,
> + loff_t start,
> + loff_t end)
> +{
> + struct xfs_inode *ip = XFS_I(inode);
> + xfs_fileoff_t start_fsb;
> + xfs_fileoff_t end_fsb;
> + int error;
> +
> + start_fsb = XFS_B_TO_FSB(ip->i_mount, start);
> + end_fsb = XFS_B_TO_FSB(ip->i_mount, end);
> + if (end_fsb <= start_fsb)
> + return;
> +
> + xfs_ilock(ip, XFS_ILOCK_EXCL);
> + error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
> + end_fsb - start_fsb);
> + if (error) {
> + /* something screwed, just bail */
> + if (!XFS_FORCED_SHUTDOWN(ip->i_mount)) {
> + xfs_alert(ip->i_mount,
> + "xfs_vm_write_failed: unable to clean up ino %lld",
Consider updating the function name in this error message and printing out the
value of error.
> + ip->i_ino);
> + }
> + }
> + xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +}
> +
> STATIC void
> xfs_vm_write_failed(
> - struct address_space *mapping,
> - loff_t to)
> + struct inode *inode,
> + struct page *page,
> + loff_t pos,
> + unsigned len)
> {
> - struct inode *inode = mapping->host;
> + loff_t block_offset = pos & PAGE_MASK;
> + loff_t block_start;
> + loff_t block_end;
> + loff_t from = pos & (PAGE_CACHE_SIZE - 1);
> + loff_t to = from + len;
> + struct buffer_head *bh, *head;
>
> - if (to > inode->i_size) {
> - /*
> - * Punch out the delalloc blocks we have already allocated.
> - *
> - * Don't bother with xfs_setattr given that nothing can have
> - * made it to disk yet as the page is still locked at this
> - * point.
> - */
> - struct xfs_inode *ip = XFS_I(inode);
> - xfs_fileoff_t start_fsb;
> - xfs_fileoff_t end_fsb;
> - int error;
> + ASSERT(block_offset + from == pos);
>
> - truncate_pagecache(inode, to, inode->i_size);
> + head = page_buffers(page);
> + block_start = 0;
> + for (bh = head; bh != head || !block_start;
> + bh = bh->b_this_page, block_start = block_end,
> + block_offset += bh->b_size) {
> + block_end = block_start + bh->b_size;
>
> - /*
> - * Check if there are any blocks that are outside of i_size
> - * that need to be trimmed back.
> - */
> - start_fsb = XFS_B_TO_FSB(ip->i_mount, inode->i_size);
> - end_fsb = XFS_B_TO_FSB(ip->i_mount, to);
> - if (end_fsb <= start_fsb)
> - return;
> -
> - xfs_ilock(ip, XFS_ILOCK_EXCL);
> - error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
> - end_fsb - start_fsb);
> - if (error) {
> - /* something screwed, just bail */
> - if (!XFS_FORCED_SHUTDOWN(ip->i_mount)) {
> - xfs_alert(ip->i_mount,
> - "xfs_vm_write_failed: unable to clean up ino %lld",
> - ip->i_ino);
> - }
> - }
> - xfs_iunlock(ip, XFS_ILOCK_EXCL);
> + /* skip buffers before the write */
> + if (block_end <= from)
> + continue;
> +
> + /* if the buffer is after the write, we're done */
> + if (block_start >= to)
*blink* I was looking pretty hard at that for an off-by-one. Mark
straightened me out. Eesh.
> + break;
> +
> + if (!buffer_delay(bh))
> + continue;
> +
> + if (!buffer_new(bh) && block_offset < i_size_read(inode))
> + continue;
> +
> + xfs_vm_kill_delalloc_range(inode, block_offset,
> + block_offset + bh->b_size);
> }
> +
> }
>
> +/*
> + * This used to call block_write_begin(), but it unlocks and releases the
> page
> + * on error, and we need that page to be able to punch stale delalloc blocks
> out
> + * on failure. hence we copy-n-waste it here and call xfs_vm_write_failed()
> at
> + * the appropriate point.
> + */
> STATIC int
> xfs_vm_write_begin(
> struct file *file,
> @@ -1461,15 +1507,40 @@ xfs_vm_write_begin(
> struct page **pagep,
> void **fsdata)
> {
> - int ret;
> + pgoff_t index = pos >> PAGE_CACHE_SHIFT;
> + struct page *page;
> + int status;
>
> - ret = block_write_begin(mapping, pos, len, flags | AOP_FLAG_NOFS,
> - pagep, xfs_get_blocks);
> - if (unlikely(ret))
> - xfs_vm_write_failed(mapping, pos + len);
> - return ret;
> + ASSERT(len <= PAGE_CACHE_SIZE);
> +
> + page = grab_cache_page_write_begin(mapping, index,
> + flags | AOP_FLAG_NOFS);
> + if (!page)
> + return -ENOMEM;
> +
> + status = __block_write_begin(page, pos, len, xfs_get_blocks);
> + if (unlikely(status)) {
> + struct inode *inode = mapping->host;
> +
> + xfs_vm_write_failed(inode, page, pos, len);
> + unlock_page(page);
Consistent with block_write_begin.
> +
> + if (pos + len > i_size_read(inode))
> + truncate_pagecache(inode, pos + len,
> i_size_read(inode));
> +
> + page_cache_release(page);
> + page = NULL;
> + }
> +
> + *pagep = page;
> + return status;
> }
>
> +/*
> + * On failure, we only need to kill delalloc blocks beyond EOF because they
> + * will never be written. For blocks within EOF, generic_write_end() zeros
> them
> + * so they are safe to leave alone and be written with all the other valid
> data.
> + */
> STATIC int
> xfs_vm_write_end(
> struct file *file,
> @@ -1482,9 +1553,19 @@ xfs_vm_write_end(
> {
> int ret;
>
> + ASSERT(len <= PAGE_CACHE_SIZE);
> +
> ret = generic_write_end(file, mapping, pos, len, copied, page, fsdata);
> - if (unlikely(ret < len))
> - xfs_vm_write_failed(mapping, pos + len);
> + if (unlikely(ret < len)) {
> + struct inode *inode = mapping->host;
> + size_t isize = i_size_read(inode);
> + loff_t to = pos + len;
> +
> + if (to > isize) {
> + truncate_pagecache(inode, to, isize);
> + xfs_vm_kill_delalloc_range(inode, isize, to);
> + }
> + }
> return ret;
> }
Aside from a few nits this is looking good.
Reviewed-by: Ben Myers <bpm@xxxxxxx>
|