This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "XFS development tree".
The branch, master has been updated
a9c7b13 xfs: pack xfs_buf structure more tightly
c6942de xfs: convert buffer cache hash to rbtree
6c97772 xfs: serialise inode reclaim within an AG
e1a48db xfs: batch inode reclaim lookup
c727163 xfs: implement batched inode lookups for AG walking
7227905 xfs: split out inode walk inode grabbing
fa78a91 xfs: split inode AG walking into separate code for reclaim
7608770 xfs: remove buftarg hash for external devices
00d42de xfs: use unhashed buffers for size checks
ec09a3c xfs: kill XBF_FS_MANAGED buffers
075a968 xfs: store xfs_mount in the buftarg instead of in the xfs_buf
0c6b79a xfs: introduced uncached buffer read primitve
e601d2f xfs: rename xfs_buf_get_nodaddr to be more appropriate
0c9a0e0 xfs: don't use vfs writeback for pure metadata modifications
ec9cb17 xfs: lockless per-ag lookups
c07719e xfs: remove debug assert for per-ag reference counting
1c34652 xfs: reduce the number of CIL lock round trips during commit
3881f5f xfs: force background CIL push under sustained load
from e89318c670af3959db3aa483da509565f5a2536c (commit)
Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.
- Log -----------------------------------------------------------------
commit a9c7b1373fab80a039c11af9683d49a557825f61
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date: Fri Sep 24 19:59:15 2010 +1000
xfs: pack xfs_buf structure more tightly
pahole reports the struct xfs_buf has quite a few holes in it, so
packing the structure better will reduce the size of it by 16 bytes.
Also, move all the fields used in cache lookups into the first
cacheline.
Before on x86_64:
/* size: 320, cachelines: 5 */
/* sum members: 298, holes: 6, sum holes: 22 */
After on x86_64:
/* size: 304, cachelines: 5 */
/* padding: 6 */
/* last cacheline: 48 bytes */
Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
Reviewed-by: Christoph Hellwig <hch@xxxxxx>
Reviewed-by: Alex Elder <aelder@xxxxxxx>
commit c6942de96cd4b9cd03f26fd016a6fb7d275992d4
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date: Fri Sep 24 19:59:04 2010 +1000
xfs: convert buffer cache hash to rbtree
The buffer cache hash is showing typical hash scalability problems.
In large scale testing the number of cached items growing far larger
than the hash can efficiently handle. Hence we need to move to a
self-scaling cache indexing mechanism.
I have selected rbtrees for indexing becuse they can have O(log n)
search scalability, and insert and remove cost is not excessive,
even on large trees. Hence we should be able to cache large numbers
of buffers without incurring the excessive cache miss search
penalties that the hash is imposing on us.
To ensure we still have parallel access to the cache, we need
multiple trees. Rather than hashing the buffers by disk address to
select a tree, it seems more sensible to separate trees by typical
access patterns. Most operations use buffers from within a single AG
at a time, so rather than searching lots of different lists,
separate the buffer indexes out into per-AG rbtrees. This means that
searches during metadata operation have a much higher chance of
hitting cache resident nodes, and that updates of the tree are less
likely to disturb trees being accessed on other CPUs doing
independent operations.
Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
Reviewed-by: Christoph Hellwig <hch@xxxxxx>
Reviewed-by: Alex Elder <aelder@xxxxxxx>
commit 6c977723efe0db8f028f674f2701a7f8ddb5d258
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date: Mon Sep 27 11:09:51 2010 +1000
xfs: serialise inode reclaim within an AG
Memory reclaim via shrinkers has a terrible habit of having N+M
concurrent shrinker executions (N = num CPUs, M = num kswapds) all
trying to shrink the same cache. When the cache they are all working
on is protected by a single spinlock, massive contention an
slowdowns occur.
Wrap the per-ag inode caches with a reclaim mutex to serialise
reclaim access to the AG. This will block concurrent reclaim in each
AG but still allow reclaim to scan multiple AGs concurrently. Allow
shrinkers to move on to the next AG if it can't get the lock, and if
we can't get any AG, then start blocking on locks.
To prevent reclaimers from continually scanning the same inodes in
each AG, add a cursor that tracks where the last reclaim got up to
and start from that point on the next reclaim. This should avoid
only ever scanning a small number of inodes at the satart of each AG
and not making progress. If we have a non-shrinker based reclaim
pass, ignore the cursor and reset it to zero once we are done.
Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
Reviewed-by: Alex Elder <aelder@xxxxxxx>
commit e1a48dbec9ba6aa24ae61d4b8d412b2b39b2baa9
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date: Fri Sep 24 19:51:50 2010 +1000
xfs: batch inode reclaim lookup
Batch and optimise the per-ag inode lookup for reclaim to minimise
scanning overhead. This involves gang lookups on the radix trees to
get multiple inodes during each tree walk, and tighter validation of
what inodes can be reclaimed without blocking befor we take any
locks.
This is based on ideas suggested in a proof-of-concept patch
posted by Nick Piggin.
Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
Reviewed-by: Christoph Hellwig <hch@xxxxxx>
Reviewed-by: Alex Elder <aelder@xxxxxxx>
commit c7271639bcbc3246e8afbd74746d32f1a507782e
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date: Tue Sep 28 12:28:19 2010 +1000
xfs: implement batched inode lookups for AG walking
With the reclaim code separated from the generic walking code, it is
simple to implement batched lookups for the generic walk code.
Separate out the inode validation from the execute operations and
modify the tree lookups to get a batch of inodes at a time.
Reclaim operations will be optimised separately.
Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
Reviewed-by: Christoph Hellwig <hch@xxxxxx>
Reviewed-by: Alex Elder <aelder@xxxxxxx>
commit 722790573bde4611dd1a3439d6f4e42d3c0cc65f
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date: Tue Sep 28 12:28:06 2010 +1000
xfs: split out inode walk inode grabbing
When doing read side inode cache walks, the code to validate and
grab an inode is common to all callers. Split it out of the execute
callbacks in preparation for batching lookups. Similarly, split out
the inode reference dropping from the execute callbacks into the
main lookup look to be symmetric with the grab.
Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
Reviewed-by: Christoph Hellwig <hch@xxxxxx>
Reviewed-by: Alex Elder <aelder@xxxxxxx>
commit fa78a9124f57e85382b942b183ce2cf0a691d71a
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date: Fri Sep 24 18:40:15 2010 +1000
xfs: split inode AG walking into separate code for reclaim
The reclaim walk requires different locking and has a slightly
different walk algorithm, so separate it out so that it can be
optimised separately.
Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
Reviewed-by: Christoph Hellwig <hch@xxxxxx>
Reviewed-by: Alex Elder <aelder@xxxxxxx>
commit 7608770b317d97702410477db31c159739171b00
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date: Wed Sep 22 10:47:20 2010 +1000
xfs: remove buftarg hash for external devices
For RT and external log devices, we never use hashed buffers on them
now. Remove the buftarg hash tables that are set up for them.
Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
Reviewed-by: Christoph Hellwig <hch@xxxxxx>
Reviewed-by: Alex Elder <aelder@xxxxxxx>
commit 00d42de4a2117d16c16750718242819e65889262
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date: Wed Sep 22 10:47:20 2010 +1000
xfs: use unhashed buffers for size checks
When we are checking we can access the last block of each device, we
do not need to use cached buffers as they will be tossed away
immediately. Use uncached buffers for size checks so that all IO
prior to full in-memory structure initialisation does not use the
buffer cache.
Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
Reviewed-by: Christoph Hellwig <hch@xxxxxx>
Reviewed-by: Alex Elder <aelder@xxxxxxx>
commit ec09a3c36986a2bf2431e835870f499ba0074991
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date: Wed Sep 22 10:47:20 2010 +1000
xfs: kill XBF_FS_MANAGED buffers
Filesystem level managed buffers are buffers that have their
lifecycle controlled by the filesystem layer, not the buffer cache.
We currently cache these buffers, which makes cleanup and cache
walking somewhat troublesome. Convert the fs managed buffers to
uncached buffers obtained by via xfs_buf_get_uncached(), and remove
the XBF_FS_MANAGED special cases from the buffer cache.
Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
Reviewed-by: Christoph Hellwig <hch@xxxxxx>
Reviewed-by: Alex Elder <aelder@xxxxxxx>
commit 075a96845b43ff609476cc26d466d2e6c020eac5
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date: Wed Sep 22 10:47:20 2010 +1000
xfs: store xfs_mount in the buftarg instead of in the xfs_buf
Each buffer contains both a buftarg pointer and a mount pointer. If
we add a mount pointer into the buftarg, we can avoid needing the
b_mount field in every buffer and grab it from the buftarg when
needed instead. This shrinks the xfs_buf by 8 bytes.
Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
Reviewed-by: Christoph Hellwig <hch@xxxxxx>
Reviewed-by: Alex Elder <aelder@xxxxxxx>
commit 0c6b79a05107490af559c9e5bfa6b906e910e1bf
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date: Fri Sep 24 21:58:31 2010 +1000
xfs: introduced uncached buffer read primitve
To avoid the need to use cached buffers for single-shot or buffers
cached at the filesystem level, introduce a new buffer read
primitive that bypasses the cache an reads directly from disk.
Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
Reviewed-by: Christoph Hellwig <hch@xxxxxx>
Reviewed-by: Alex Elder <aelder@xxxxxxx>
commit e601d2feccfb957cc95dbb151f434ca390b43949
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date: Fri Sep 24 20:07:47 2010 +1000
xfs: rename xfs_buf_get_nodaddr to be more appropriate
xfs_buf_get_nodaddr() is really used to allocate a buffer that is
uncached. While it is not directly assigned a disk address, the fact
that they are not cached is a more important distinction. With the
upcoming uncached buffer read primitive, we should be consistent
with this disctinction.
While there, make page allocation in xfs_buf_get_nodaddr() safe
against memory reclaim re-entrancy into the filesystem by allowing
a flags parameter to be passed.
Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
Reviewed-by: Christoph Hellwig <hch@xxxxxx>
Reviewed-by: Alex Elder <aelder@xxxxxxx>
commit 0c9a0e0cdba9677ff78a2ec28f5ff8b4db530dd6
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date: Tue Sep 28 12:27:25 2010 +1000
xfs: don't use vfs writeback for pure metadata modifications
Under heavy multi-way parallel create workloads, the VFS struggles
to write back all the inodes that have been changed in age order.
The bdi flusher thread becomes CPU bound, spending 85% of it's time
in the VFS code, mostly traversing the superblock dirty inode list
to separate dirty inodes old enough to flush.
We already keep an index of all metadata changes in age order - in
the AIL - and continued log pressure will do age ordered writeback
without any extra overhead at all. If there is no pressure on the
log, the xfssyncd will periodically write back metadata in ascending
disk address offset order so will be very efficient.
Hence we can stop marking VFS inodes dirty during transaction commit
or when changing timestamps during transactions. This will keep the
inodes in the superblock dirty list to those containing data or
unlogged metadata changes.
However, the timstamp changes are slightly more complex than this -
there are a couple of places that do unlogged updates of the
timestamps, and the VFS need to be informed of these. Hence add a
new function xfs_trans_ichgtime() for transactional changes,
and leave xfs_ichgtime() for the non-transactional changes.
Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
Reviewed-by: Alex Elder <aelder@xxxxxxx>
Reviewed-by: Christoph Hellwig <hch@xxxxxx>
commit ec9cb17171ce6179f788a28a3bf4614678305715
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date: Wed Sep 22 10:47:20 2010 +1000
xfs: lockless per-ag lookups
When we start taking a reference to the per-ag for every cached
buffer in the system, kernel lockstat profiling on an 8-way create
workload shows the mp->m_perag_lock has higher acquisition rates
than the inode lock and has significantly more contention. That is,
it becomes the highest contended lock in the system.
The perag lookup is trivial to convert to lock-less RCU lookups
because perag structures never go away. Hence the only thing we need
to protect against is tree structure changes during a grow. This can
be done simply by replacing the locking in xfs_perag_get() with RCU
read locking. This removes the mp->m_perag_lock completely from this
path.
Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
Reviewed-by: Christoph Hellwig <hch@xxxxxx>
Reviewed-by: Alex Elder <aelder@xxxxxxx>
commit c07719e7fe1ca3bf98b89e8798ded068fe911ea1
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date: Wed Sep 22 10:47:20 2010 +1000
xfs: remove debug assert for per-ag reference counting
When we start taking references per cached buffer to the the perag
it is cached on, it will blow the current debug maximum reference
count assert out of the water. The assert has never caught a bug,
and we have tracing to track changes if there ever is a problem,
so just remove it.
Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
Reviewed-by: Christoph Hellwig <hch@xxxxxx>
Reviewed-by: Alex Elder <aelder@xxxxxxx>
commit 1c34652755dd670b6a1db00c7d14f9511eeecc00
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date: Fri Sep 24 18:14:13 2010 +1000
xfs: reduce the number of CIL lock round trips during commit
When commiting a transaction, we do a lock CIL state lock round trip
on every single log vector we insert into the CIL. This is resulting
in the lock being as hot as the inode and dcache locks on 8-way
create workloads. Rework the insertion loops to bring the number
of lock round trips to one per transaction for log vectors, and one
more do the busy extents.
Also change the allocation of the log vector buffer not to zero it
as we copy over the entire allocated buffer anyway.
This patch also includes a structural cleanup to the CIL item
insertion provided by Christoph Hellwig.
Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
Reviewed-by: Christoph Hellwig <hch@xxxxxx>
Reviewed-by: Alex Elder <aelder@xxxxxxx>
commit 3881f5f7fc84d444a0ff45b4bffc3c2d012703ce
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date: Fri Sep 24 18:13:44 2010 +1000
xfs: force background CIL push under sustained load
I have been seeing occasional pauses in transaction throughput up to
30s long under heavy parallel workloads. The only notable thing was
that the xfsaild was trying to be active during the pauses, but
making no progress. It was running exactly 20 times a second (on the
50ms no-progress backoff), and the number of pushbuf events was
constant across this time as well. IOWs, the xfsaild appeared to be
stuck on buffers that it could not push out.
Further investigation indicated that it was trying to push out inode
buffers that were pinned and/or locked. The xfsbufd was also getting
woken at the same frequency (by the xfsaild, no doubt) to push out
delayed write buffers. The xfsbufd was not making any progress
because all the buffers in the delwri queue were pinned. This scan-
and-make-no-progress dance went one in the trace for some seconds,
before the xfssyncd came along an issued a log force, and then
things started going again.
However, I noticed something strange about the log force - there
were way too many IO's issued. 516 log buffers were written, to be
exact. That added up to 129MB of log IO, which got me very
interested because it's almost exactly 25% of the size of the log.
He delayed logging code is suppose to aggregate the minimum of 25%
of the log or 8MB worth of changes before flushing. That's what
really puzzled me - why did a log force write 129MB instead of only
8MB?
Essentially what has happened is that no CIL pushes had occurred
since the previous tail push which cleared out 25% of the log space.
That caused all the new transactions to block because there wasn't
log space for them, but they kick the xfsaild to push the tail.
However, the xfsaild was not making progress because there were
buffers it could not lock and flush, and the xfsbufd could not flush
them because they were pinned. As a result, both the xfsaild and the
xfsbufd could not move the tail of the log forward without the CIL
first committing.
The cause of the problem was that the background CIL push, which
should happen when 8MB of aggregated changes have been committed, is
being held off by the concurrent transaction commit load. The
background push does a down_write_trylock() which will fail if there
is a concurrent transaction commit holding the push lock in read
mode. With 8 CPUs all doing transactions as fast as they can, there
was enough concurrent transaction commits to hold off the background
push until tail-pushing could no longer free log space, and the halt
would occur.
It should be noted that there is no reason why it would halt at 25%
of log space used by a single CIL checkpoint. This bug could
definitely violate the "no transaction should be larger than half
the log" requirement and hence result in corruption if the system
crashed under heavy load. This sort of bug is exactly the reason why
delayed logging was tagged as experimental....
The fix is to start blocking background pushes once the threshold
has been exceeded. Rework the threshold calculations to keep the
amount of log space a CIL checkpoint can use to below that of the
AIL push threshold to avoid the problem completely.
Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
Reviewed-by: Alex Elder <aelder@xxxxxxx>
Reviewed-by: Christoph Hellwig <hch@xxxxxx>
-----------------------------------------------------------------------
Summary of changes:
fs/xfs/linux-2.6/xfs_buf.c | 200 +++++++++++---------
fs/xfs/linux-2.6/xfs_buf.h | 50 +++---
fs/xfs/linux-2.6/xfs_ioctl.c | 2 +-
fs/xfs/linux-2.6/xfs_iops.c | 35 ----
fs/xfs/linux-2.6/xfs_super.c | 15 +-
fs/xfs/linux-2.6/xfs_sync.c | 413 +++++++++++++++++++++++-----------------
fs/xfs/linux-2.6/xfs_sync.h | 4 +-
fs/xfs/linux-2.6/xfs_trace.h | 4 +-
fs/xfs/quota/xfs_qm_syscalls.c | 14 +--
fs/xfs/xfs_ag.h | 9 +
fs/xfs/xfs_attr.c | 31 +--
fs/xfs/xfs_buf_item.c | 3 +-
fs/xfs/xfs_fsops.c | 11 +-
fs/xfs/xfs_inode.h | 1 -
fs/xfs/xfs_inode_item.c | 9 -
fs/xfs/xfs_log.c | 3 +-
fs/xfs/xfs_log_cil.c | 244 +++++++++++++-----------
fs/xfs/xfs_log_priv.h | 37 ++--
fs/xfs/xfs_log_recover.c | 19 +-
fs/xfs/xfs_mount.c | 152 ++++++++-------
fs/xfs/xfs_mount.h | 2 +
fs/xfs/xfs_rename.c | 12 +-
fs/xfs/xfs_rtalloc.c | 29 ++--
fs/xfs/xfs_trans.h | 1 +
fs/xfs/xfs_trans_inode.c | 30 +++
fs/xfs/xfs_utils.c | 4 +-
fs/xfs/xfs_vnodeops.c | 23 ++-
27 files changed, 732 insertions(+), 625 deletions(-)
hooks/post-receive
--
XFS development tree
|