xfs
[Top] [All Lists]

Re: [PATCH 5/5] xfs: log file size updates at I/O completion time

To: Christoph Hellwig <hch@xxxxxxxxxxxxx>
Subject: Re: [PATCH 5/5] xfs: log file size updates at I/O completion time
From: Ben Myers <bpm@xxxxxxx>
Date: Thu, 17 Nov 2011 13:25:48 -0600
Cc: xfs@xxxxxxxxxxx
In-reply-to: <20111115201427.299361653@xxxxxxxxxxxxxxxxxxxxxx>
References: <20111115201407.038216766@xxxxxxxxxxxxxxxxxxxxxx> <20111115201427.299361653@xxxxxxxxxxxxxxxxxxxxxx>
User-agent: Mutt/1.5.18 (2008-05-17)
Hey Christoph,

On Tue, Nov 15, 2011 at 03:14:12PM -0500, Christoph Hellwig wrote:
> Do not use unlogged metadata updates and the VFS dirty bit for updating
> the file size after writeback.  In addition to causing various problems
> with updates getting delayed for far too log this also drags in the
> unscalable VFS dirty tracking, and is one of the few remaining unlogged
> metadata updates.
> 
> Note that we allocate a new transaction from the I/O completion handler.
> While this sounds fairly dangerous it isn't an issue in practice given
> that any appending write alreay had to start a transaction in writepages
> to allocate blocks, and we'll start throtteling there if we run low on
> log space or memory.
> 
> We could still occasionally stall in the completion handler, but given
> that we have per-filesystems workqueues for the I/O completions,
> and completions that do not have to either convert unwritten extents
> or update the file size are processed from interrupt context we do not
> have to worry about this stalling a system to death.
> 
> In addition to that implementing log reservations from ->writepage that
> are only released by a different thread requires a lot of work, and even
> with that wasn't quite doable in a deadlock-free manner.
> 
> Signed-off-by: Christoph Hellwig <hch@xxxxxx>
> 
> ---
>  fs/xfs/xfs_aops.c |   49 ++++++++++++++++++++++++++++---------------------
>  fs/xfs/xfs_file.c |   36 +++++++++++++++++++++++++++++++++++-
>  2 files changed, 63 insertions(+), 22 deletions(-)
> 
> Index: linux-2.6/fs/xfs/xfs_aops.c
> ===================================================================
> --- linux-2.6.orig/fs/xfs/xfs_aops.c  2011-11-15 18:43:04.183113001 +0100
> +++ linux-2.6/fs/xfs/xfs_aops.c       2011-11-15 18:43:05.059779662 +0100
> @@ -26,6 +26,7 @@
>  #include "xfs_bmap_btree.h"
>  #include "xfs_dinode.h"
>  #include "xfs_inode.h"
> +#include "xfs_inode_item.h"
>  #include "xfs_alloc.h"
>  #include "xfs_error.h"
>  #include "xfs_rw.h"
> @@ -114,22 +115,39 @@ static inline bool xfs_ioend_is_append(s
>   * not extend all the way to the valid file size then restrict this update to
>   * the end of the write.
>   */
> -STATIC void
> +STATIC int
>  xfs_setfilesize(
>       struct xfs_ioend        *ioend)
>  {
>       struct xfs_inode        *ip = XFS_I(ioend->io_inode);
> +     struct xfs_mount        *mp = ip->i_mount;
> +     struct xfs_trans        *tp;
>       xfs_fsize_t             isize;
> +     int                     error = 0;
>  
>       xfs_ilock(ip, XFS_ILOCK_EXCL);
>       isize = xfs_new_eof(ip, ioend->io_offset + ioend->io_size);
> -     if (isize) {
> -             trace_xfs_setfilesize(ip, ioend->io_offset, ioend->io_size);
> -             ip->i_d.di_size = isize;
> -             xfs_mark_inode_dirty(ip);
> +     xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +
> +     if (!isize)
> +             return 0;
> +
> +     trace_xfs_setfilesize(ip, ioend->io_offset, ioend->io_size);
> +
> +     tp = xfs_trans_alloc(mp, XFS_TRANS_FSYNC_TS);
> +     error = xfs_trans_reserve(tp, 0, XFS_FSYNC_TS_LOG_RES(mp), 0, 0, 0);
> +     if (error) {
> +             xfs_trans_cancel(tp, 0);
> +             return error;
>       }
>  
> -     xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +     xfs_ilock(ip, XFS_ILOCK_EXCL);
> +
> +     ip->i_d.di_size = isize;

Make this assignment above, before dropping the ilock, so that multiple
updates to di_size cannot be reordered while allocating a transaction.

> +     xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> +     xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> +
> +     return xfs_trans_commit(tp, 0);
>  }
>  
>  /*
> @@ -183,12 +201,10 @@ xfs_end_io(
>                       ioend->io_error = -error;
>                       goto done;
>               }
> -     } else {
> -             /*
> -              * We might have to update the on-disk file size after
> -              * extending writes.
> -              */
> -             xfs_setfilesize(ioend);
> +     } else if (xfs_ioend_is_append(ioend)) {

I guess I am harping on the ilock today.  You already have this
optimisation in xfs_setfilesize, and it is clearly correct in the sense
that it takes the ilock while reading from i_d.di_size, and returns if
no update is necessary.  Is taking the ilock here really so expensive
that this additional level of optimisation is necessary?

xfs_ioend_is_append doesn't take the ilock and it really isn't obvious
why (if) that is ok.  Your explanation in reply to my earlier question
about xfs_ioend_is_append helped but the idea still isn't fully formed
for me yet.  Anyway...  I suggest a comment be added with an
explanation.

> +             error = xfs_setfilesize(ioend);
> +             if (error)
> +                     ioend->io_error = error;
>       }
>  
>  done:
> @@ -345,19 +361,10 @@ xfs_submit_ioend_bio(
>       xfs_ioend_t             *ioend,
>       struct bio              *bio)
>  {
> -     struct xfs_inode        *ip = XFS_I(ioend->io_inode);
> -
>       atomic_inc(&ioend->io_remaining);
>       bio->bi_private = ioend;
>       bio->bi_end_io = xfs_end_bio;
>  
> -     /*
> -      * If the I/O is beyond EOF we mark the inode dirty immediately
> -      * but don't update the inode size until I/O completion.
> -      */
> -     if (xfs_new_eof(ip, ioend->io_offset + ioend->io_size))
> -             xfs_mark_inode_dirty(ip);
> -
>       submit_bio(wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE, bio);
>  }

We don't need to mark dirty here because we're gonna log the size update
in the completion handler.  That looks good.

> Index: linux-2.6/fs/xfs/xfs_file.c
> ===================================================================
> --- linux-2.6.orig/fs/xfs/xfs_file.c  2011-11-15 10:03:17.539965975 +0100
> +++ linux-2.6/fs/xfs/xfs_file.c       2011-11-15 18:43:05.059779662 +0100
> @@ -436,6 +436,36 @@ xfs_aio_write_isize_update(
>       }
>  }
>  
> +STATIC int
> +xfs_aio_write_isize_reset(
> +     struct xfs_inode        *ip)
> +{
> +     struct xfs_mount        *mp = ip->i_mount;
> +     struct xfs_trans        *tp;
> +     int                     error = 0;
> +
> +     tp = xfs_trans_alloc(mp, XFS_TRANS_FSYNC_TS);
> +     error = xfs_trans_reserve(tp, 0, XFS_FSYNC_TS_LOG_RES(mp), 0, 0, 0);
> +     if (error) {
> +             xfs_trans_cancel(tp, 0);
> +             return error;
> +     }
> +
> +     xfs_ilock(ip, XFS_ILOCK_EXCL);
> +
> +     if (ip->i_d.di_size <= ip->i_size) {
> +             xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +             xfs_trans_cancel(tp, 0);
> +             return 0;
> +     }
> +
> +     ip->i_d.di_size = ip->i_size;
> +     xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> +     xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> +
> +     return xfs_trans_commit(tp, 0);
> +}
> +
>  /*
>   * If this was a direct or synchronous I/O that failed (such as ENOSPC) then
>   * part of the I/O may have been written to disk before the error occurred.  
> In
> @@ -447,14 +477,18 @@ xfs_aio_write_newsize_update(
>       struct xfs_inode        *ip,
>       xfs_fsize_t             new_size)
>  {
> +     bool                    reset = false;

add a blank line here

>       if (new_size == ip->i_new_size) {
>               xfs_rw_ilock(ip, XFS_ILOCK_EXCL);
>               if (new_size == ip->i_new_size)
>                       ip->i_new_size = 0;
>               if (ip->i_d.di_size > ip->i_size)
> -                     ip->i_d.di_size = ip->i_size;
> +                     reset = true;
>               xfs_rw_iunlock(ip, XFS_ILOCK_EXCL);
>       }
> +
> +     if (reset)
> +             xfs_aio_write_isize_reset(ip);
>  }

Wow..  Size updates are complicated.  I have more studying to do.

-Ben

<Prev in Thread] Current Thread [Next in Thread>