xfs
[Top] [All Lists]

Re: [PATCH 7/7] xfs: optimise away log forces on timestamp updates for f

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: [PATCH 7/7] xfs: optimise away log forces on timestamp updates for fdatasync
From: Brian Foster <bfoster@xxxxxxxxxx>
Date: Mon, 2 Nov 2015 09:15:39 -0500
Cc: xfs@xxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <1446435735-1526-8-git-send-email-david@xxxxxxxxxxxxx>
References: <1446435735-1526-1-git-send-email-david@xxxxxxxxxxxxx> <1446435735-1526-8-git-send-email-david@xxxxxxxxxxxxx>
User-agent: Mutt/1.5.23 (2014-03-12)
On Mon, Nov 02, 2015 at 02:42:15PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@xxxxxxxxxx>
> 
> xfs: timestamp updates cause excessive fdatasync log traffic
> 
> Sage Weil reported that a ceph test workload was writing to the
> log on every fdatasync during an overwrite workload. Event tracing
> showed that the only metadata modification being made was the
> timestamp updates during the write(2) syscall, but fdatasync(2)
> is supposed to ignore them. The key observation was that the
> transactions in the log all looked like this:
> 
> INODE: #regs: 4   ino: 0x8b  flags: 0x45   dsize: 32
> 
> And contained a flags field of 0x45 or 0x85, and had data and
> attribute forks following the inode core. This means that the
> timestamp updates were triggering dirty relogging of previously
> logged parts of the inode that hadn't yet been flushed back to
> disk.
> 
> There are two parts to this problem. The first is that XFS relogs
> dirty regions in subsequent transactions, so it carries around the
> fields that have been dirtied since the last time the inode was
> written back to disk, not since the last time the inode was forced
> into the log.
> 
> The second part is that on v5 filesystems, the inode change count
> update during inode dirtying also sets the XFS_ILOG_CORE flag, so
> on v5 filesystems this makes a timestamp update dirty the entire
> inode.
> 
> As a result when fdatasync is run, it looks at the dirty fields in
> the inode, and sees more than just the timestamp flag, even though
> the only metadata change since the last fdatasync was just the
> timestamps. Hence we force the log on every subsequent fdatasync
> even though it is not needed.
> 
> To fix this, add a new field to the inode log item that tracks
> changes since the last time fsync/fdatasync forced the log to flush
> the changes to the journal. This flag is updated when we dirty the
> inode, but we do it before updating the change count so it does not
> carry the "core dirty" flag from timestamp updates. The fields are
> zeroed when the inode is marked clean (due to writeback/freeing) or
> when an fsync/datasync forces the log. Hence if we only dirty the
> timestamps on the inode between fsync/fdatasync calls, the fdatasync
> will not trigger another log force.
> 
> Over 100 runs of the test program:
> 
> Ext4 baseline:
>       runtime: 1.63s +/- 0.24s
>       avg lat: 1.59ms +/- 0.24ms
>       iops: ~2000
> 
> XFS, vanilla kernel:
>         runtime: 2.45s +/- 0.18s
>       avg lat: 2.39ms +/- 0.18ms
>       log forces: ~400/s
>       iops: ~1000
> 
> XFS, patched kernel:
>         runtime: 1.49s +/- 0.26s
>       avg lat: 1.46ms +/- 0.25ms
>       log forces: ~30/s
>       iops: ~1500
> 
> Reported-by: Sage Weil <sage@xxxxxxxxxx>
> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> ---

Looks good:

Reviewed-by: Brian Foster <bfoster@xxxxxxxxxx>

>  fs/xfs/xfs_file.c        | 21 ++++++++++++++++-----
>  fs/xfs/xfs_inode.c       |  2 ++
>  fs/xfs/xfs_inode_item.c  |  1 +
>  fs/xfs/xfs_inode_item.h  |  1 +
>  fs/xfs/xfs_trans_inode.c |  9 +++++++++
>  5 files changed, 29 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 0045b0a..39743ef 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -242,19 +242,30 @@ xfs_file_fsync(
>       }
>  
>       /*
> -      * All metadata updates are logged, which means that we just have
> -      * to flush the log up to the latest LSN that touched the inode.
> +      * All metadata updates are logged, which means that we just have to
> +      * flush the log up to the latest LSN that touched the inode. If we have
> +      * concurrent fsync/fdatasync() calls, we need them to all block on the
> +      * log force before we clear the ili_fsync_fields field. This ensures
> +      * that we don't get a racing sync operation that does not wait for the
> +      * metadata to hit the journal before returning. If we race with
> +      * clearing the ili_fsync_fields, then all that will happen is the log
> +      * force will do nothing as the lsn will already be on disk. We can't
> +      * race with setting ili_fsync_fields because that is done under
> +      * XFS_ILOCK_EXCL, and that can't happen because we hold the lock shared
> +      * until after the ili_fsync_fields is cleared.
>        */
>       xfs_ilock(ip, XFS_ILOCK_SHARED);
>       if (xfs_ipincount(ip)) {
>               if (!datasync ||
> -                 (ip->i_itemp->ili_fields & ~XFS_ILOG_TIMESTAMP))
> +                 (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
>                       lsn = ip->i_itemp->ili_last_lsn;
>       }
> -     xfs_iunlock(ip, XFS_ILOCK_SHARED);
>  
> -     if (lsn)
> +     if (lsn) {
>               error = _xfs_log_force_lsn(mp, lsn, XFS_LOG_SYNC, &log_flushed);
> +             ip->i_itemp->ili_fsync_fields = 0;
> +     }
> +     xfs_iunlock(ip, XFS_ILOCK_SHARED);
>  
>       /*
>        * If we only have a single device, and the log force about was
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index a0f2bae..8ee3939 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2365,6 +2365,7 @@ retry:
>  
>                       iip->ili_last_fields = iip->ili_fields;
>                       iip->ili_fields = 0;
> +                     iip->ili_fsync_fields = 0;
>                       iip->ili_logged = 1;
>                       xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
>                                               &iip->ili_item.li_lsn);
> @@ -3560,6 +3561,7 @@ xfs_iflush_int(
>        */
>       iip->ili_last_fields = iip->ili_fields;
>       iip->ili_fields = 0;
> +     iip->ili_fsync_fields = 0;
>       iip->ili_logged = 1;
>  
>       xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 62bd80f..d14b12b 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -719,6 +719,7 @@ xfs_iflush_abort(
>                * attempted.
>                */
>               iip->ili_fields = 0;
> +             iip->ili_fsync_fields = 0;
>       }
>       /*
>        * Release the inode's flush lock since we're done with it.
> diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
> index 488d812..4c7722e 100644
> --- a/fs/xfs/xfs_inode_item.h
> +++ b/fs/xfs/xfs_inode_item.h
> @@ -34,6 +34,7 @@ typedef struct xfs_inode_log_item {
>       unsigned short          ili_logged;        /* flushed logged data */
>       unsigned int            ili_last_fields;   /* fields when flushed */
>       unsigned int            ili_fields;        /* fields to be logged */
> +     unsigned int            ili_fsync_fields;  /* logged since last fsync */
>  } xfs_inode_log_item_t;
>  
>  static inline int xfs_inode_clean(xfs_inode_t *ip)
> diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c
> index 17280cd..b97f1df 100644
> --- a/fs/xfs/xfs_trans_inode.c
> +++ b/fs/xfs/xfs_trans_inode.c
> @@ -108,6 +108,15 @@ xfs_trans_log_inode(
>       ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
>  
>       /*
> +      * Record the specific change for fdatasync optimisation. This
> +      * allows fdatasync to skip log forces for inodes that are only
> +      * timestamp dirty. We do this before the change count so that
> +      * the core being logged in this case does not impact on fdatasync
> +      * behaviour.
> +      */
> +     ip->i_itemp->ili_fsync_fields |= flags;
> +
> +     /*
>        * First time we log the inode in a transaction, bump the inode change
>        * counter if it is configured for this to occur. We don't use
>        * inode_inc_version() because there is no need for extra locking around
> -- 
> 2.5.0
> 
> _______________________________________________
> xfs mailing list
> xfs@xxxxxxxxxxx
> http://oss.sgi.com/mailman/listinfo/xfs

<Prev in Thread] Current Thread [Next in Thread>