On Mon, Nov 02, 2015 at 02:42:15PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@xxxxxxxxxx>
>
> xfs: timestamp updates cause excessive fdatasync log traffic
>
> Sage Weil reported that a ceph test workload was writing to the
> log on every fdatasync during an overwrite workload. Event tracing
> showed that the only metadata modification being made was the
> timestamp updates during the write(2) syscall, but fdatasync(2)
> is supposed to ignore them. The key observation was that the
> transactions in the log all looked like this:
>
> INODE: #regs: 4 ino: 0x8b flags: 0x45 dsize: 32
>
> And contained a flags field of 0x45 or 0x85, and had data and
> attribute forks following the inode core. This means that the
> timestamp updates were triggering dirty relogging of previously
> logged parts of the inode that hadn't yet been flushed back to
> disk.
>
> There are two parts to this problem. The first is that XFS relogs
> dirty regions in subsequent transactions, so it carries around the
> fields that have been dirtied since the last time the inode was
> written back to disk, not since the last time the inode was forced
> into the log.
>
> The second part is that on v5 filesystems, the inode change count
> update during inode dirtying also sets the XFS_ILOG_CORE flag, so
> on v5 filesystems this makes a timestamp update dirty the entire
> inode.
>
> As a result when fdatasync is run, it looks at the dirty fields in
> the inode, and sees more than just the timestamp flag, even though
> the only metadata change since the last fdatasync was just the
> timestamps. Hence we force the log on every subsequent fdatasync
> even though it is not needed.
>
> To fix this, add a new field to the inode log item that tracks
> changes since the last time fsync/fdatasync forced the log to flush
> the changes to the journal. This flag is updated when we dirty the
> inode, but we do it before updating the change count so it does not
> carry the "core dirty" flag from timestamp updates. The fields are
> zeroed when the inode is marked clean (due to writeback/freeing) or
> when an fsync/datasync forces the log. Hence if we only dirty the
> timestamps on the inode between fsync/fdatasync calls, the fdatasync
> will not trigger another log force.
>
> Over 100 runs of the test program:
>
> Ext4 baseline:
> runtime: 1.63s +/- 0.24s
> avg lat: 1.59ms +/- 0.24ms
> iops: ~2000
>
> XFS, vanilla kernel:
> runtime: 2.45s +/- 0.18s
> avg lat: 2.39ms +/- 0.18ms
> log forces: ~400/s
> iops: ~1000
>
> XFS, patched kernel:
> runtime: 1.49s +/- 0.26s
> avg lat: 1.46ms +/- 0.25ms
> log forces: ~30/s
> iops: ~1500
>
> Reported-by: Sage Weil <sage@xxxxxxxxxx>
> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> ---
Looks good:
Reviewed-by: Brian Foster <bfoster@xxxxxxxxxx>
> fs/xfs/xfs_file.c | 21 ++++++++++++++++-----
> fs/xfs/xfs_inode.c | 2 ++
> fs/xfs/xfs_inode_item.c | 1 +
> fs/xfs/xfs_inode_item.h | 1 +
> fs/xfs/xfs_trans_inode.c | 9 +++++++++
> 5 files changed, 29 insertions(+), 5 deletions(-)
>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 0045b0a..39743ef 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -242,19 +242,30 @@ xfs_file_fsync(
> }
>
> /*
> - * All metadata updates are logged, which means that we just have
> - * to flush the log up to the latest LSN that touched the inode.
> + * All metadata updates are logged, which means that we just have to
> + * flush the log up to the latest LSN that touched the inode. If we have
> + * concurrent fsync/fdatasync() calls, we need them to all block on the
> + * log force before we clear the ili_fsync_fields field. This ensures
> + * that we don't get a racing sync operation that does not wait for the
> + * metadata to hit the journal before returning. If we race with
> + * clearing the ili_fsync_fields, then all that will happen is the log
> + * force will do nothing as the lsn will already be on disk. We can't
> + * race with setting ili_fsync_fields because that is done under
> + * XFS_ILOCK_EXCL, and that can't happen because we hold the lock shared
> + * until after the ili_fsync_fields is cleared.
> */
> xfs_ilock(ip, XFS_ILOCK_SHARED);
> if (xfs_ipincount(ip)) {
> if (!datasync ||
> - (ip->i_itemp->ili_fields & ~XFS_ILOG_TIMESTAMP))
> + (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> lsn = ip->i_itemp->ili_last_lsn;
> }
> - xfs_iunlock(ip, XFS_ILOCK_SHARED);
>
> - if (lsn)
> + if (lsn) {
> error = _xfs_log_force_lsn(mp, lsn, XFS_LOG_SYNC, &log_flushed);
> + ip->i_itemp->ili_fsync_fields = 0;
> + }
> + xfs_iunlock(ip, XFS_ILOCK_SHARED);
>
> /*
> * If we only have a single device, and the log force about was
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index a0f2bae..8ee3939 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2365,6 +2365,7 @@ retry:
>
> iip->ili_last_fields = iip->ili_fields;
> iip->ili_fields = 0;
> + iip->ili_fsync_fields = 0;
> iip->ili_logged = 1;
> xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
> &iip->ili_item.li_lsn);
> @@ -3560,6 +3561,7 @@ xfs_iflush_int(
> */
> iip->ili_last_fields = iip->ili_fields;
> iip->ili_fields = 0;
> + iip->ili_fsync_fields = 0;
> iip->ili_logged = 1;
>
> xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 62bd80f..d14b12b 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -719,6 +719,7 @@ xfs_iflush_abort(
> * attempted.
> */
> iip->ili_fields = 0;
> + iip->ili_fsync_fields = 0;
> }
> /*
> * Release the inode's flush lock since we're done with it.
> diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
> index 488d812..4c7722e 100644
> --- a/fs/xfs/xfs_inode_item.h
> +++ b/fs/xfs/xfs_inode_item.h
> @@ -34,6 +34,7 @@ typedef struct xfs_inode_log_item {
> unsigned short ili_logged; /* flushed logged data */
> unsigned int ili_last_fields; /* fields when flushed */
> unsigned int ili_fields; /* fields to be logged */
> + unsigned int ili_fsync_fields; /* logged since last fsync */
> } xfs_inode_log_item_t;
>
> static inline int xfs_inode_clean(xfs_inode_t *ip)
> diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c
> index 17280cd..b97f1df 100644
> --- a/fs/xfs/xfs_trans_inode.c
> +++ b/fs/xfs/xfs_trans_inode.c
> @@ -108,6 +108,15 @@ xfs_trans_log_inode(
> ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
>
> /*
> + * Record the specific change for fdatasync optimisation. This
> + * allows fdatasync to skip log forces for inodes that are only
> + * timestamp dirty. We do this before the change count so that
> + * the core being logged in this case does not impact on fdatasync
> + * behaviour.
> + */
> + ip->i_itemp->ili_fsync_fields |= flags;
> +
> + /*
> * First time we log the inode in a transaction, bump the inode change
> * counter if it is configured for this to occur. We don't use
> * inode_inc_version() because there is no need for extra locking around
> --
> 2.5.0
>
> _______________________________________________
> xfs mailing list
> xfs@xxxxxxxxxxx
> http://oss.sgi.com/mailman/listinfo/xfs
|