[Top] [All Lists]

[PATCH 0/18] xfs: metadata and buffer cache scalability improvements

To: xfs@xxxxxxxxxxx
Subject: [PATCH 0/18] xfs: metadata and buffer cache scalability improvements
From: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Tue, 14 Sep 2010 20:55:59 +1000
This patchset has grown quite a bit - it started out as a "convert
the buffer cache to rbtrees" patch, and has gotten bigger as I
peeled the onion from one bottleneck to another.

Performance numbers here are 8-way fs_mark create to 50M files, and
8-way rm -rf to remove the files created.

                        wall time       fs_mark rate
        create:         13m10s          65k file/s
        unlink:         23m58s          N/A

The first set of patches are generic infrastructure changes that
address pain points the rbtree based buffer cache introduces. I've
put them first because they are simpler to review and have immediate
impact on performance. These patches address lock contention as
measured by the kernel lockstat infrastructure.

xfs: single thread inode cache shrinking.
        - prevents per-ag contention during cache shrinking

xfs: reduce the number of CIL lock round trips during commit
        - reduces lock traffic on the xc_cil_lock by two orders of

xfs: remove debug assert for per-ag reference counting
xfs: lockless per-ag lookups
        - hottest lock in the system with buffer cache rbtree path
        - converted to use RCU.

xfs: convert inode cache lookups to use RCU locking
xfs: convert pag_ici_lock to a spin lock
        - addresses lookup vs reclaim contention on pag_ici_lock
        - converted to use RCU.

xfs: don't use vfs writeback for pure metadata modifications
        - inode writeback does not keep up with dirtying 100,000
          inodes a second. Avoids the superblock dirty list where
          possible by using the AIL as the age-order flusher.

Performance with these patches:

2.6.36-rc4 + shrinker + CIL + RCU:
        create:         11m38s          80k files/s
        unlink:         14m29s          N/A

Create rate has improved by 20%, unlink time has almost halved. On
large numbers of inodes, the unlink rate improves even more

The buffer cache to rbtree series current stands at:

xfs: rename xfs_buf_get_nodaddr to be more appropriate
xfs: introduced uncached buffer read primitve
xfs: store xfs_mount in the buftarg instead of in the xfs_buf
xfs: kill XBF_FS_MANAGED buffers
xfs: use unhashed buffers for size checks
xfs: remove buftarg hash for external devices
        - preparatory buffer cache API cleanup patches

xfs: convert buffer cache hash to rbtree
        - what it says ;)
        - includes changes based on Alex's review.

xfs; pack xfs_buf structure more tightly
        - memory usage reduction, means adding the LRU list head is
          effectively memory usage neutral.

xfs: convert xfsbud shrinker to a per-buftarg shrinker.
xfs: add a lru to the XFS buffer cache
        - Add an LRU for reclaim

xfs: stop using the page cache to back the buffer cache
        - kill all the page cache code

2.6.36-rc4 + shrinker + CIL + RCU + rbtree:
        create:          9m47s          95k files/s
        unlink:         14m16s          N/A

Create rate has improved by another 20%, unlink rate has improved
marginally (noise, really).

There are two remaining parts to the buffer cache conversions:

        1. work out how to efficiently support block size smaller
        than page size. The current code works, but uses a page per
        sub-apge buffer.  A set of slab caches would be perfect for
        this use, but I'm not sure that we are allowed to use them
        for IO anymore. Christoph?

        2. Connect up the buffer type sepcific reclaim priority
        reference counting and convert the LRU reclaim to a cursor
        based walk that simply drops reclaim reference counts and
        frees anything that has a zero reclaim reference.

Overall, I can swap the order of the two patch sets, and the
incremental performance increases for create are pretty much
identical. For unlink, te benefit comes from the shrinker
modification. For those that care, the rbtree patch set in isolation
results in a time of 4h38m to create 1 billion inodes on my 8p/4GB
RAM test VM. I haven't run this test with the RCU and writeback
modifications yet.

Moving on from this point is to start testing against Nick Piggin's
VFS scalability tree, aѕ the inode_lock and dcache_lock are now the
performance limiting factors. That will, without doubt, bring new
hotspots out in XFS so I'll be starting this cycle over again soon.

Overall diffstat at this point is:

 fs/xfs/linux-2.6/kmem.h        |    1 +
 fs/xfs/linux-2.6/xfs_buf.c     |  588 ++++++++++++++--------------------------
 fs/xfs/linux-2.6/xfs_buf.h     |   61 +++--
 fs/xfs/linux-2.6/xfs_iops.c    |   18 +-
 fs/xfs/linux-2.6/xfs_super.c   |   11 +-
 fs/xfs/linux-2.6/xfs_sync.c    |   49 +++-
 fs/xfs/linux-2.6/xfs_trace.h   |    2 +-
 fs/xfs/quota/xfs_qm_syscalls.c |    4 +-
 fs/xfs/xfs_ag.h                |    9 +-
 fs/xfs/xfs_buf_item.c          |    3 +-
 fs/xfs/xfs_fsops.c             |   11 +-
 fs/xfs/xfs_iget.c              |   46 +++-
 fs/xfs/xfs_inode.c             |   22 +-
 fs/xfs/xfs_inode_item.c        |    9 -
 fs/xfs/xfs_log.c               |    3 +-
 fs/xfs/xfs_log_cil.c           |  116 +++++----
 fs/xfs/xfs_log_recover.c       |   18 +-
 fs/xfs/xfs_mount.c             |  126 ++++-----
 fs/xfs/xfs_mount.h             |    2 +
 fs/xfs/xfs_rtalloc.c           |   29 +-
 fs/xfs/xfs_vnodeops.c          |    2 +-
 21 files changed, 502 insertions(+), 628 deletions(-)

So it is improving performance, removing code and fixing
longstanding bugs all at the same time. ;)

<Prev in Thread] Current Thread [Next in Thread>