xfs
[Top] [All Lists]

[XFS updates] XFS development tree branch, master, updated. v2.6.34-1974

To: xfs@xxxxxxxxxxx
Subject: [XFS updates] XFS development tree branch, master, updated. v2.6.34-19744-ga9c7b13
From: xfs@xxxxxxxxxxx
Date: Tue, 28 Sep 2010 15:57:01 -0500
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "XFS development tree".

The branch, master has been updated
  a9c7b13 xfs: pack xfs_buf structure more tightly
  c6942de xfs: convert buffer cache hash to rbtree
  6c97772 xfs: serialise inode reclaim within an AG
  e1a48db xfs: batch inode reclaim lookup
  c727163 xfs: implement batched inode lookups for AG walking
  7227905 xfs: split out inode walk inode grabbing
  fa78a91 xfs: split inode AG walking into separate code for reclaim
  7608770 xfs: remove buftarg hash for external devices
  00d42de xfs: use unhashed buffers for size checks
  ec09a3c xfs: kill XBF_FS_MANAGED buffers
  075a968 xfs: store xfs_mount in the buftarg instead of in the xfs_buf
  0c6b79a xfs: introduced uncached buffer read primitve
  e601d2f xfs: rename xfs_buf_get_nodaddr to be more appropriate
  0c9a0e0 xfs: don't use vfs writeback for pure metadata modifications
  ec9cb17 xfs: lockless per-ag lookups
  c07719e xfs: remove debug assert for per-ag reference counting
  1c34652 xfs: reduce the number of CIL lock round trips during commit
  3881f5f xfs: force background CIL push under sustained load
      from  e89318c670af3959db3aa483da509565f5a2536c (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
commit a9c7b1373fab80a039c11af9683d49a557825f61
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date:   Fri Sep 24 19:59:15 2010 +1000

    xfs: pack xfs_buf structure more tightly
    
    pahole reports the struct xfs_buf has quite a few holes in it, so
    packing the structure better will reduce the size of it by 16 bytes.
    Also, move all the fields used in cache lookups into the first
    cacheline.
    
    Before on x86_64:
    
            /* size: 320, cachelines: 5 */
        /* sum members: 298, holes: 6, sum holes: 22 */
    
    After on x86_64:
    
            /* size: 304, cachelines: 5 */
        /* padding: 6 */
        /* last cacheline: 48 bytes */
    
    
    Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
    Reviewed-by: Christoph Hellwig <hch@xxxxxx>
    Reviewed-by: Alex Elder <aelder@xxxxxxx>

commit c6942de96cd4b9cd03f26fd016a6fb7d275992d4
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date:   Fri Sep 24 19:59:04 2010 +1000

    xfs: convert buffer cache hash to rbtree
    
    The buffer cache hash is showing typical hash scalability problems.
    In large scale testing the number of cached items growing far larger
    than the hash can efficiently handle. Hence we need to move to a
    self-scaling cache indexing mechanism.
    
    I have selected rbtrees for indexing becuse they can have O(log n)
    search scalability, and insert and remove cost is not excessive,
    even on large trees. Hence we should be able to cache large numbers
    of buffers without incurring the excessive cache miss search
    penalties that the hash is imposing on us.
    
    To ensure we still have parallel access to the cache, we need
    multiple trees. Rather than hashing the buffers by disk address to
    select a tree, it seems more sensible to separate trees by typical
    access patterns. Most operations use buffers from within a single AG
    at a time, so rather than searching lots of different lists,
    separate the buffer indexes out into per-AG rbtrees. This means that
    searches during metadata operation have a much higher chance of
    hitting cache resident nodes, and that updates of the tree are less
    likely to disturb trees being accessed on other CPUs doing
    independent operations.
    
    Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
    Reviewed-by: Christoph Hellwig <hch@xxxxxx>
    Reviewed-by: Alex Elder <aelder@xxxxxxx>

commit 6c977723efe0db8f028f674f2701a7f8ddb5d258
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date:   Mon Sep 27 11:09:51 2010 +1000

    xfs: serialise inode reclaim within an AG
    
    Memory reclaim via shrinkers has a terrible habit of having N+M
    concurrent shrinker executions (N = num CPUs, M = num kswapds) all
    trying to shrink the same cache. When the cache they are all working
    on is protected by a single spinlock, massive contention an
    slowdowns occur.
    
    Wrap the per-ag inode caches with a reclaim mutex to serialise
    reclaim access to the AG. This will block concurrent reclaim in each
    AG but still allow reclaim to scan multiple AGs concurrently. Allow
    shrinkers to move on to the next AG if it can't get the lock, and if
    we can't get any AG, then start blocking on locks.
    
    To prevent reclaimers from continually scanning the same inodes in
    each AG, add a cursor that tracks where the last reclaim got up to
    and start from that point on the next reclaim. This should avoid
    only ever scanning a small number of inodes at the satart of each AG
    and not making progress. If we have a non-shrinker based reclaim
    pass, ignore the cursor and reset it to zero once we are done.
    
    Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
    Reviewed-by: Alex Elder <aelder@xxxxxxx>

commit e1a48dbec9ba6aa24ae61d4b8d412b2b39b2baa9
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date:   Fri Sep 24 19:51:50 2010 +1000

    xfs: batch inode reclaim lookup
    
    Batch and optimise the per-ag inode lookup for reclaim to minimise
    scanning overhead. This involves gang lookups on the radix trees to
    get multiple inodes during each tree walk, and tighter validation of
    what inodes can be reclaimed without blocking befor we take any
    locks.
    
    This is based on ideas suggested in a proof-of-concept patch
    posted by Nick Piggin.
    
    Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
    Reviewed-by: Christoph Hellwig <hch@xxxxxx>
    Reviewed-by: Alex Elder <aelder@xxxxxxx>

commit c7271639bcbc3246e8afbd74746d32f1a507782e
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date:   Tue Sep 28 12:28:19 2010 +1000

    xfs: implement batched inode lookups for AG walking
    
    With the reclaim code separated from the generic walking code, it is
    simple to implement batched lookups for the generic walk code.
    Separate out the inode validation from the execute operations and
    modify the tree lookups to get a batch of inodes at a time.
    
    Reclaim operations will be optimised separately.
    
    Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
    Reviewed-by: Christoph Hellwig <hch@xxxxxx>
    Reviewed-by: Alex Elder <aelder@xxxxxxx>

commit 722790573bde4611dd1a3439d6f4e42d3c0cc65f
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date:   Tue Sep 28 12:28:06 2010 +1000

    xfs: split out inode walk inode grabbing
    
    When doing read side inode cache walks, the code to validate and
    grab an inode is common to all callers. Split it out of the execute
    callbacks in preparation for batching lookups. Similarly, split out
    the inode reference dropping from the execute callbacks into the
    main lookup look to be symmetric with the grab.
    
    Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
    Reviewed-by: Christoph Hellwig <hch@xxxxxx>
    Reviewed-by: Alex Elder <aelder@xxxxxxx>

commit fa78a9124f57e85382b942b183ce2cf0a691d71a
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date:   Fri Sep 24 18:40:15 2010 +1000

    xfs: split inode AG walking into separate code for reclaim
    
    The reclaim walk requires different locking and has a slightly
    different walk algorithm, so separate it out so that it can be
    optimised separately.
    
    Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
    Reviewed-by: Christoph Hellwig <hch@xxxxxx>
    Reviewed-by: Alex Elder <aelder@xxxxxxx>

commit 7608770b317d97702410477db31c159739171b00
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date:   Wed Sep 22 10:47:20 2010 +1000

    xfs: remove buftarg hash for external devices
    
    For RT and external log devices, we never use hashed buffers on them
    now.  Remove the buftarg hash tables that are set up for them.
    
    Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
    Reviewed-by: Christoph Hellwig <hch@xxxxxx>
    Reviewed-by: Alex Elder <aelder@xxxxxxx>

commit 00d42de4a2117d16c16750718242819e65889262
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date:   Wed Sep 22 10:47:20 2010 +1000

    xfs: use unhashed buffers for size checks
    
    When we are checking we can access the last block of each device, we
    do not need to use cached buffers as they will be tossed away
    immediately. Use uncached buffers for size checks so that all IO
    prior to full in-memory structure initialisation does not use the
    buffer cache.
    
    Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
    Reviewed-by: Christoph Hellwig <hch@xxxxxx>
    Reviewed-by: Alex Elder <aelder@xxxxxxx>

commit ec09a3c36986a2bf2431e835870f499ba0074991
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date:   Wed Sep 22 10:47:20 2010 +1000

    xfs: kill XBF_FS_MANAGED buffers
    
    Filesystem level managed buffers are buffers that have their
    lifecycle controlled by the filesystem layer, not the buffer cache.
    We currently cache these buffers, which makes cleanup and cache
    walking somewhat troublesome. Convert the fs managed buffers to
    uncached buffers obtained by via xfs_buf_get_uncached(), and remove
    the XBF_FS_MANAGED special cases from the buffer cache.
    
    Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
    Reviewed-by: Christoph Hellwig <hch@xxxxxx>
    Reviewed-by: Alex Elder <aelder@xxxxxxx>

commit 075a96845b43ff609476cc26d466d2e6c020eac5
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date:   Wed Sep 22 10:47:20 2010 +1000

    xfs: store xfs_mount in the buftarg instead of in the xfs_buf
    
    Each buffer contains both a buftarg pointer and a mount pointer. If
    we add a mount pointer into the buftarg, we can avoid needing the
    b_mount field in every buffer and grab it from the buftarg when
    needed instead. This shrinks the xfs_buf by 8 bytes.
    
    Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
    Reviewed-by: Christoph Hellwig <hch@xxxxxx>
    Reviewed-by: Alex Elder <aelder@xxxxxxx>

commit 0c6b79a05107490af559c9e5bfa6b906e910e1bf
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date:   Fri Sep 24 21:58:31 2010 +1000

    xfs: introduced uncached buffer read primitve
    
    To avoid the need to use cached buffers for single-shot or buffers
    cached at the filesystem level, introduce a new buffer read
    primitive that bypasses the cache an reads directly from disk.
    
    Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
    Reviewed-by: Christoph Hellwig <hch@xxxxxx>
    Reviewed-by: Alex Elder <aelder@xxxxxxx>

commit e601d2feccfb957cc95dbb151f434ca390b43949
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date:   Fri Sep 24 20:07:47 2010 +1000

    xfs: rename xfs_buf_get_nodaddr to be more appropriate
    
    xfs_buf_get_nodaddr() is really used to allocate a buffer that is
    uncached. While it is not directly assigned a disk address, the fact
    that they are not cached is a more important distinction. With the
    upcoming uncached buffer read primitive, we should be consistent
    with this disctinction.
    
    While there, make page allocation in xfs_buf_get_nodaddr() safe
    against memory reclaim re-entrancy into the filesystem by allowing
    a flags parameter to be passed.
    
    Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
    Reviewed-by: Christoph Hellwig <hch@xxxxxx>
    Reviewed-by: Alex Elder <aelder@xxxxxxx>

commit 0c9a0e0cdba9677ff78a2ec28f5ff8b4db530dd6
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date:   Tue Sep 28 12:27:25 2010 +1000

    xfs: don't use vfs writeback for pure metadata modifications
    
    Under heavy multi-way parallel create workloads, the VFS struggles
    to write back all the inodes that have been changed in age order.
    The bdi flusher thread becomes CPU bound, spending 85% of it's time
    in the VFS code, mostly traversing the superblock dirty inode list
    to separate dirty inodes old enough to flush.
    
    We already keep an index of all metadata changes in age order - in
    the AIL - and continued log pressure will do age ordered writeback
    without any extra overhead at all. If there is no pressure on the
    log, the xfssyncd will periodically write back metadata in ascending
    disk address offset order so will be very efficient.
    
    Hence we can stop marking VFS inodes dirty during transaction commit
    or when changing timestamps during transactions. This will keep the
    inodes in the superblock dirty list to those containing data or
    unlogged metadata changes.
    
    However, the timstamp changes are slightly more complex than this -
    there are a couple of places that do unlogged updates of the
    timestamps, and the VFS need to be informed of these. Hence add a
    new function xfs_trans_ichgtime() for transactional changes,
    and leave xfs_ichgtime() for the non-transactional changes.
    
    Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
    Reviewed-by: Alex Elder <aelder@xxxxxxx>
    Reviewed-by: Christoph Hellwig <hch@xxxxxx>

commit ec9cb17171ce6179f788a28a3bf4614678305715
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date:   Wed Sep 22 10:47:20 2010 +1000

    xfs: lockless per-ag lookups
    
    When we start taking a reference to the per-ag for every cached
    buffer in the system, kernel lockstat profiling on an 8-way create
    workload shows the mp->m_perag_lock has higher acquisition rates
    than the inode lock and has significantly more contention. That is,
    it becomes the highest contended lock in the system.
    
    The perag lookup is trivial to convert to lock-less RCU lookups
    because perag structures never go away. Hence the only thing we need
    to protect against is tree structure changes during a grow. This can
    be done simply by replacing the locking in xfs_perag_get() with RCU
    read locking. This removes the mp->m_perag_lock completely from this
    path.
    
    Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
    Reviewed-by: Christoph Hellwig <hch@xxxxxx>
    Reviewed-by: Alex Elder <aelder@xxxxxxx>

commit c07719e7fe1ca3bf98b89e8798ded068fe911ea1
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date:   Wed Sep 22 10:47:20 2010 +1000

    xfs: remove debug assert for per-ag reference counting
    
    When we start taking references per cached buffer to the the perag
    it is cached on, it will blow the current debug maximum reference
    count assert out of the water. The assert has never caught a bug,
    and we have tracing to track changes if there ever is a problem,
    so just remove it.
    
    Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
    Reviewed-by: Christoph Hellwig <hch@xxxxxx>
    Reviewed-by: Alex Elder <aelder@xxxxxxx>

commit 1c34652755dd670b6a1db00c7d14f9511eeecc00
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date:   Fri Sep 24 18:14:13 2010 +1000

    xfs: reduce the number of CIL lock round trips during commit
    
    When commiting a transaction, we do a lock CIL state lock round trip
    on every single log vector we insert into the CIL. This is resulting
    in the lock being as hot as the inode and dcache locks on 8-way
    create workloads. Rework the insertion loops to bring the number
    of lock round trips to one per transaction for log vectors, and one
    more do the busy extents.
    
    Also change the allocation of the log vector buffer not to zero it
    as we copy over the entire allocated buffer anyway.
    
    This patch also includes a structural cleanup to the CIL item
    insertion provided by Christoph Hellwig.
    
    Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
    Reviewed-by: Christoph Hellwig <hch@xxxxxx>
    Reviewed-by: Alex Elder <aelder@xxxxxxx>

commit 3881f5f7fc84d444a0ff45b4bffc3c2d012703ce
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date:   Fri Sep 24 18:13:44 2010 +1000

    xfs: force background CIL push under sustained load
    
    I have been seeing occasional pauses in transaction throughput up to
    30s long under heavy parallel workloads. The only notable thing was
    that the xfsaild was trying to be active during the pauses, but
    making no progress. It was running exactly 20 times a second (on the
    50ms no-progress backoff), and the number of pushbuf events was
    constant across this time as well.  IOWs, the xfsaild appeared to be
    stuck on buffers that it could not push out.
    
    Further investigation indicated that it was trying to push out inode
    buffers that were pinned and/or locked. The xfsbufd was also getting
    woken at the same frequency (by the xfsaild, no doubt) to push out
    delayed write buffers. The xfsbufd was not making any progress
    because all the buffers in the delwri queue were pinned. This scan-
    and-make-no-progress dance went one in the trace for some seconds,
    before the xfssyncd came along an issued a log force, and then
    things started going again.
    
    However, I noticed something strange about the log force - there
    were way too many IO's issued. 516 log buffers were written, to be
    exact. That added up to 129MB of log IO, which got me very
    interested because it's almost exactly 25% of the size of the log.
    He delayed logging code is suppose to aggregate the minimum of 25%
    of the log or 8MB worth of changes before flushing. That's what
    really puzzled me - why did a log force write 129MB instead of only
    8MB?
    
    Essentially what has happened is that no CIL pushes had occurred
    since the previous tail push which cleared out 25% of the log space.
    That caused all the new transactions to block because there wasn't
    log space for them, but they kick the xfsaild to push the tail.
    However, the xfsaild was not making progress because there were
    buffers it could not lock and flush, and the xfsbufd could not flush
    them because they were pinned. As a result, both the xfsaild and the
    xfsbufd could not move the tail of the log forward without the CIL
    first committing.
    
    The cause of the problem was that the background CIL push, which
    should happen when 8MB of aggregated changes have been committed, is
    being held off by the concurrent transaction commit load. The
    background push does a down_write_trylock() which will fail if there
    is a concurrent transaction commit holding the push lock in read
    mode. With 8 CPUs all doing transactions as fast as they can, there
    was enough concurrent transaction commits to hold off the background
    push until tail-pushing could no longer free log space, and the halt
    would occur.
    
    It should be noted that there is no reason why it would halt at 25%
    of log space used by a single CIL checkpoint. This bug could
    definitely violate the "no transaction should be larger than half
    the log" requirement and hence result in corruption if the system
    crashed under heavy load. This sort of bug is exactly the reason why
    delayed logging was tagged as experimental....
    
    The fix is to start blocking background pushes once the threshold
    has been exceeded. Rework the threshold calculations to keep the
    amount of log space a CIL checkpoint can use to below that of the
    AIL push threshold to avoid the problem completely.
    
    Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
    Reviewed-by: Alex Elder <aelder@xxxxxxx>
    Reviewed-by: Christoph Hellwig <hch@xxxxxx>

-----------------------------------------------------------------------

Summary of changes:
 fs/xfs/linux-2.6/xfs_buf.c     |  200 +++++++++++---------
 fs/xfs/linux-2.6/xfs_buf.h     |   50 +++---
 fs/xfs/linux-2.6/xfs_ioctl.c   |    2 +-
 fs/xfs/linux-2.6/xfs_iops.c    |   35 ----
 fs/xfs/linux-2.6/xfs_super.c   |   15 +-
 fs/xfs/linux-2.6/xfs_sync.c    |  413 +++++++++++++++++++++++-----------------
 fs/xfs/linux-2.6/xfs_sync.h    |    4 +-
 fs/xfs/linux-2.6/xfs_trace.h   |    4 +-
 fs/xfs/quota/xfs_qm_syscalls.c |   14 +--
 fs/xfs/xfs_ag.h                |    9 +
 fs/xfs/xfs_attr.c              |   31 +--
 fs/xfs/xfs_buf_item.c          |    3 +-
 fs/xfs/xfs_fsops.c             |   11 +-
 fs/xfs/xfs_inode.h             |    1 -
 fs/xfs/xfs_inode_item.c        |    9 -
 fs/xfs/xfs_log.c               |    3 +-
 fs/xfs/xfs_log_cil.c           |  244 +++++++++++++-----------
 fs/xfs/xfs_log_priv.h          |   37 ++--
 fs/xfs/xfs_log_recover.c       |   19 +-
 fs/xfs/xfs_mount.c             |  152 ++++++++-------
 fs/xfs/xfs_mount.h             |    2 +
 fs/xfs/xfs_rename.c            |   12 +-
 fs/xfs/xfs_rtalloc.c           |   29 ++--
 fs/xfs/xfs_trans.h             |    1 +
 fs/xfs/xfs_trans_inode.c       |   30 +++
 fs/xfs/xfs_utils.c             |    4 +-
 fs/xfs/xfs_vnodeops.c          |   23 ++-
 27 files changed, 732 insertions(+), 625 deletions(-)


hooks/post-receive
-- 
XFS development tree

<Prev in Thread] Current Thread [Next in Thread>
  • [XFS updates] XFS development tree branch, master, updated. v2.6.34-19744-ga9c7b13, xfs <=