Folks,
FYI, here is my current XFS patch stack that I'll be trying to get ready in
time for the 2.6.38 merge window. Note that the first two patches are
candidates for 2.6.37-rc. They are a perag reference counting fix and the
movement of a trace point.
My tree is currently based on the VFS locking changes I have out for review,
so there's a couple fo patches that won't apply sanely to a mainline or OSS xfs
dev tree. See below for a pointer to a git tree with all the patches in it.
First patch is a per-cpu superblock counter rewrite. This uses the generic
per-cpu coutner infrastructure to do the heavy lifting. Needs to be split into
two patches.
Following this is the dynamic speculative allocation patches. These have been
rewritten to be base don the current inode size rather than a thumb-in-the-air
how-many-preallocs-have-we-already-done algorithm. There are also some fixes in
the second patch that fix assumptions about ip->i_delayed_blks being zero after
a flush.
Next up we have the inode cache RCU freeing and lookup patches, including one
that avoids putting the inode in the VFS hash (similar to Christoph's patch,
but using the different VFS code).
Then there are buffer cache reclaim changes. First is a per-buftarg shrinker
interface, followed by a lazily updated per-buftarg buffer LRU. building on
this connecting up the prioritised buffer reclaim hooks that ensure more
critical buffers are harder to reclaim.
AIL lock contention fixes are next, with bulk AIL insert and removal functions
being implemented and connected up to the transaction commit and inode buffer
IO completion routines. These significantly reduce AIL lock contention, and
combined with a reduction in the granularity of xfsaild push wakeups, the AIL
lock drops out of the "top 10" contended locks on ۸-way workloads.
There's a fix to avoid error injection from burning CPU on debug kernels - with
a badly fragmented freespace tree, the btree block validation was taking ~60%
of the CPU time, with most of that running error injection checks.
Finally, there's a patch to split up the log grant lock. This needs splitting
into 4 or 5 smaller patches (as you can see it was originally from the commit
log). It splits the grant lock into two list locks (reserve and write queues),
and converts all the other variables that the grant lock protected into atomic
variables. Grant head calculations are made atomic by converting them into 64
bit "LSNs" and the use of cmpxchg loops on atomic 64 bit variables. All log
tail and sync LSNs updates are made atomic via conversion to atomic variables.
With this, the grant lock goes away completely, and the transaction reserve
fast path now only has two cmpxchg loops instead of a heavily contended spin
lock.
The result of all this is raw cpu bound 8-way create performance of just over
100,000 inodes/s, and unlink performance of over 90,000 inodes/s. 8-way dbench
performance is improved from ~1150MB/s to ~1650MB/s by this patchset.
For 8-way creation and unlink of small files (~50 million), the lockstat
profiles look like:
contended total Lock
Lock acquistions acquisitions Description
----------------------------- ----------- ------------
-------------------
inode_wb_list_lock: 496330785 836287347 VFS
dcache_lock: 116299583 681450027 VFS
&(&vblk->lock)->rlock: 52829329 131054495 virtio block
device
&sb->s_type->i_lock_key#1: 41772196 2375571240 VFS
(inode->i_lock)
&(&cil->xc_cil_lock)->rlock: 29549897 410553961 XFS (CIL commit
lock)
&irq_desc_lock_class: 27520142 63908701 IRQ edge lock
&(&pag->pag_buf_lock)->rlock: 11756249 1838039685 XFS (buffer
cache lock)
&(&dentry->d_lock)->rlock: 5735657 1225028487 VFS
&(&parent->list_lock)->rlock: 4356293 249408696 VM (SLAB list
lock)
inode_sb_list_lock: 3616366 203712449 VFS
key#5: 2075310 139221312 XFS SB percpu
counter
inode_hash_lock: 1529969 102359626 VFS
rcu_node_level_0: 1363470 13730113 RCU
&(&zone->lock)->rlock: 1247467 16469316 VM (free list
lock)
&(&pag->pag_ici_lock)->rlock: 770880 337090972 XFS (inode
cache lock)
&rq->lock: 589111 184220946 Scheduler
inode_lru_lock: 527163 102791204 VFS
g->l_grant_write_lock)->rlock: 526471 51279626 XFS (grant
write lock)
&(&pag->pagb_lock)->rlock: 402878 208861744 XFS (busy
extent list)
&(&zone->lru_lock)->rlock: 167692 25383748 VM (page cache
LRU)
&on_slab_l3_key: 166183 58470153 VM (slab cache)
semaphore->lock#2: 161321 3659173925 ???
&(&ailp->xa_lock)->rlock: 143859 164470123 XFS (AIL lock)
&cil->xc_ctx_lock-W: 32850 173279 XFS (CIL push
lock)
&cil->xc_ctx_lock-R: 90868 357572724 XFS (CIL push
lock)
I'm still to determine if I'll have the time to finish the removal of the page
cache from
the buffer cache yet - for pure inode create/unlink workloads the buftarg
mapping tree lock is the second most heavily contended lock in the system.
Hence this definitely needs solving in some way or another....
Anyway, comments are welcome - just keep in mind that there is still some
polish required for these patches. ;)
If you want the git version, everything is here:
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev.git working
Dave Chinner (16):
xfs: fix per-ag reference counting in inode reclaim tree walking
xfs: move delayed write buffer trace
[RFC] xfs: use generic per-cpu counter infrastructure
xfs: dynamic speculative EOF preallocation
xfs: don't truncate prealloc from frequently accessed inodes
patch xfs-inode-hash-fake
xfs: convert inode cache lookups to use RCU locking
xfs: convert pag_ici_lock to a spin lock
xfs: convert xfsbud shrinker to a per-buftarg shrinker.
xfs: add a lru to the XFS buffer cache
xfs: connect up buffer reclaim priority hooks
xfs: bulk AIL insertion during transaction commit
xfs: reduce the number of AIL push wakeups
xfs: remove all the inodes on a buffer from the AIL in bulk
xfs: only run xfs_error_test if error injection is active
xfs: make xlog_space_left() independent of the grant lock
fs/xfs/linux-2.6/xfs_buf.c | 239 ++++++++----
fs/xfs/linux-2.6/xfs_buf.h | 43 ++-
fs/xfs/linux-2.6/xfs_iops.c | 11 +-
fs/xfs/linux-2.6/xfs_linux.h | 9 -
fs/xfs/linux-2.6/xfs_super.c | 22 +-
fs/xfs/linux-2.6/xfs_sync.c | 28 +-
fs/xfs/linux-2.6/xfs_trace.h | 36 +-
fs/xfs/quota/xfs_dquot.c | 2 +-
fs/xfs/quota/xfs_qm_syscalls.c | 3 +
fs/xfs/xfs_ag.h | 2 +-
fs/xfs/xfs_alloc.c | 4 +-
fs/xfs/xfs_bmap.c | 9 +-
fs/xfs/xfs_btree.c | 11 +-
fs/xfs/xfs_buf_item.c | 17 +-
fs/xfs/xfs_da_btree.c | 4 +-
fs/xfs/xfs_dfrag.c | 13 +
fs/xfs/xfs_error.c | 3 +
fs/xfs/xfs_error.h | 5 +-
fs/xfs/xfs_extfree_item.c | 85 +++--
fs/xfs/xfs_extfree_item.h | 12 +-
fs/xfs/xfs_fsops.c | 4 +-
fs/xfs/xfs_ialloc.c | 2 +-
fs/xfs/xfs_iget.c | 55 ++-
fs/xfs/xfs_inode.c | 24 +-
fs/xfs/xfs_inode.h | 1 +
fs/xfs/xfs_inode_item.c | 112 +++++-
fs/xfs/xfs_iomap.c | 53 ++-
fs/xfs/xfs_log.c | 678 +++++++++++++++++---------------
fs/xfs/xfs_log_cil.c | 9 +-
fs/xfs/xfs_log_priv.h | 40 ++-
fs/xfs/xfs_log_recover.c | 27 +-
fs/xfs/xfs_mount.c | 837 +++++++++++-----------------------------
fs/xfs/xfs_mount.h | 80 +---
fs/xfs/xfs_trans.c | 70 ++++-
fs/xfs/xfs_trans.h | 2 +-
fs/xfs/xfs_trans_ail.c | 189 ++++++++-
fs/xfs/xfs_trans_extfree.c | 4 +-
fs/xfs/xfs_trans_priv.h | 13 +-
fs/xfs/xfs_vnodeops.c | 61 ++-
include/linux/percpu_counter.h | 16 +
lib/percpu_counter.c | 79 ++++
41 files changed, 1593 insertions(+), 1321 deletions(-)
|