xfs
[Top] [All Lists]

[RFC] Delayed logging

To: xfs@xxxxxxxxxxx
Subject: [RFC] Delayed logging
From: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Mon, 15 Mar 2010 15:30:00 +1100
User-agent: Mutt/1.5.20 (2009-06-14)
Hi Folks,

You've all heard me talking about delayed logging, but there hasn't
been any code yet. Well, here's the first code drop - see the git
tree reference at the end of the email to get it.

If you want to know what delayed logging is and how it works, pull
the tree and read the documentation in:

        Documentation/filesystems/xfs-delayed-logging-design.txt

or navigate to it via gitweb from here:

        http://git.kernel.org/?p=linux/kernel/git/dgc/xfs.git

The delayed-logging branch that the code lives in may be rebased at
any time, hence I'm not going to point you at commits because they
won't be stable. It also means any time you want to update, you need
to need to pull into a clean new branch.

Overall, it's not a huge change:

 19 files changed, 2594 insertions(+), 580 deletions(-)

Especially when you take away the 819 lines of design documentation.
It's still a large change, though, when you consider how critical
this code is. :/

Now you know what it is, the code in the tree implements the
documented design. While the code passes XFSQA on my test systems,
there are still occassional failures that have not been resolved and
there has been almost no stress testing of the code been
done. Hence:

*** USE THIS CODE AT YOUR OWN RISK ***

At present, I have done no performance testing on production kernel
configurations - all my testing has been done with CONFIG_XFS_DEBUG
enabled along with various other kernel debugging features as well.
Hence I've really only been looking at significant deviations in
performance (up or down) to determine whether the code is meeting
design goals or not.

The following results are from a synthetic test designed to show
just the impact of delayed logging on the amount of metadata
written to the log.

load:   Sequential create 100k zero-length files in a directory per
        thread, no fsync between create and unlink.
        (./fs_mark -S0 -n 100000 -s 0 -d ....)

measurement: via PCP. XFS specific metrics:

        xfs.log.blocks
        xfs.log.writes
        xfs.log.noiclogs
        xfs.log.force
        xfs.transactions.*
        xfs.dir_ops.create
        xfs.dir_ops.remove


machine:

2GHz Dual core opteron, 3GB RAM
single 36GB 15krpm scsi drive w/ CTQ depth=32
mkfs.xfs -f -l size=128m /dev/sdb2

Current code:

mount -o "logbsize=262144" /dev/sdb2 /mnt/scratch

threads:         fs_mark        CPU     create log      unlink log
                throughput              bandwidth       bandwidth
1                 2900/s         75%       34MB/s        34MB/s
2                 2850/s         75%       33MB/s        33MB/s
4                 2800/s         80%       30MB/s        30MB/s

Delayed logging:

mount -o "delaylog,logbsize=262144" /dev/sdb2 /mnt/scratch

threads:         fs_mark        CPU     create log      unlink log
                throughput              bandwidth       bandwidth
1                 4300/s        110%       1.5MB/s       <1MB/s
2                 7900/s        195%       <4MB/s        <1MB/s
4                 7500/s        200%       <5MB/s        <1.5MB/s

I think it pretty clear that the design goal of "an order of
magnitude less log IO bandwidth" is being met here. Scalability is
looking promising, but a 2p machine is not large enough to make any
definitive statements about that. Hence from these results the
implementation is at or exceeding design levels.

Known issues that need to be resolved:

        - xfslogd can effectively lock up spinning for 10s of
          seconds at a time under heavy load. Cause unknown,
          needs analysis and fixing.
        - leaks memory in some error paths.
        - occasional recovery failure with recovery reading an inode
          buffer that does not contain inodes. Cause unknown, tends
          to be reproduced by xfsqa test 121 semi-reliably. Needs
          further analysis and fixing. May already be fixed with a
          recent fix to commit record synchronisation.
        - Checkpoint log ticket allocation is less than ideal - can
          also trigger lockdep warnings if we re-enter the FS. =>
          needs KM_NOFS and a cleanup.
        - stress will probably break it. Need to run a variety of
          workloads/benchmarks and sort out issues that are
          uncovered.
        - scalability, while improved, is still largely an unknown.
          Will need to run tests on big machines to find where new
          contention points have been introduced.
        - impact on sync/fsync heavy workloads largely unknown. It
          should not be significant, but needs testing and analysis.
        - determine if the current checkpoint sizing is appropriate,
          or whether further dynamic sizing (e.g. based on log size)
          needs investigation.

Further algorithmic optimisations:

        - busy extent tracking is still not ideal - we can get lots
          (thousands) of adjacent single extents in the same
          transaction so combining them at transaction commit would
          be advantageous.
        - Don't need barriers on every single log IO. Indeed,
          funnelling 8MB of IO through 8x256k buffers is not really
          ideal. Only really need barrier on first IO of checkpoint
          (to ensure all the changes we are about to overwrite are on
          disk already) and last IO (to ensure commit record hits
          the disk).
        - commit record synchronisation is simplistic and can cause
          too many wakeups. needs to be smarter about finding
          previous sequences to wait on.
        - AIL pushing can trigger far too many log forces in a short
          period of time.
        - start looking at areas where CPU usage is excessive and
          try to trim it.

There's still a lot of work to do before this is production ready,
but I think it's stable enough now that the code is not going to
change significantly as a result of trying to fix bugs that are
lurking.  Currently I'm aiming for experimental inclusion into
mainline for 2.6.35, with the aim for it to be production ready by
2.6.37 and the default for 2.6.39.

Anyway, here's the details of the tree. Note that this branch
includes a merge of the trans-cleanup branch as it is dependent on
those changes.

The following changes since commit 5077f72749e6a78eb57211caf337cda8297bf882:
  Dave Chinner (1):
        xfs: don't warn about page discards on shutdown

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfs.git delayed-logging

Dave Chinner (19):
      xfs: introduce new internal log vector structure
      xfs: factor xlog_write and make use of new log vector structure
      xfs: Delayed logging design documentation
      xfs: introduce delayed logging mount option
      xfs: extend the log item to support delayed logging
      xfs: Introduce the Committed Item List
      xfs: Add delayed logging checkpoint context infrastructure
      xfs: introduce new chained log vector transaction formatting code
      xfs: format and insert log vectors into the CIL
      xfs: attach transactions to the checkpoint context
      xfs: checkpoint transaction infrastructure
      xfs: Allow multiple in-flight checkpoints
      xfs: forced unmounts need to push the CIL
      xfs: enable background pushing of the CIL
      xfs: modify buffer item reference counting for delayed logging
      XFS: replace fixed size busy extent array with an rbtree
      XFS: Don't use log forces when busy extents are allocated
      XFS: Simplify transaction busy extent tracking
      xfs: cluster fsync transaction

 .../filesystems/xfs-delayed-logging-design.txt     |  819 ++++++++++++++++++++
 fs/xfs/Makefile                                    |    1 +
 fs/xfs/linux-2.6/xfs_buf.c                         |    9 +
 fs/xfs/linux-2.6/xfs_file.c                        |   65 ++-
 fs/xfs/linux-2.6/xfs_super.c                       |    9 +
 fs/xfs/linux-2.6/xfs_trace.h                       |   80 ++-
 fs/xfs/xfs_ag.h                                    |   21 +-
 fs/xfs/xfs_alloc.c                                 |  257 ++++---
 fs/xfs/xfs_alloc.h                                 |    5 +-
 fs/xfs/xfs_buf_item.c                              |   33 +-
 fs/xfs/xfs_log.c                                   |  679 +++++++++++------
 fs/xfs/xfs_log.h                                   |   15 +-
 fs/xfs/xfs_log_cil.c                               |  698 +++++++++++++++++
 fs/xfs/xfs_log_priv.h                              |  117 +++-
 fs/xfs/xfs_mount.h                                 |    1 +
 fs/xfs/xfs_trans.c                                 |  193 ++++-
 fs/xfs/xfs_trans.h                                 |   53 +-
 fs/xfs/xfs_trans_item.c                            |  109 ---
 fs/xfs/xfs_trans_priv.h                            |   10 +-
 19 files changed, 2594 insertions(+), 580 deletions(-)
 create mode 100644 Documentation/filesystems/xfs-delayed-logging-design.txt
 create mode 100644 fs/xfs/xfs_log_cil.c

Anyway that's it for now - comments, thoughts, bug fixes, etc are
welcome. :)

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>