xfs
[Top] [All Lists]

Re: Some baseline tests on new hardware (was Re: [PATCH] xfs: optimise C

To: xfs@xxxxxxxxxxx
Subject: Re: Some baseline tests on new hardware (was Re: [PATCH] xfs: optimise CIL insertion during transaction commit [RFC])
From: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Tue, 9 Jul 2013 18:26:21 +1000
Cc: linux-fsdevel@xxxxxxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <20130708124453.GC3438@dastard>
References: <1372657476-9241-1-git-send-email-david@xxxxxxxxxxxxx> <20130708124453.GC3438@dastard>
User-agent: Mutt/1.5.21 (2010-09-15)
On Mon, Jul 08, 2013 at 10:44:53PM +1000, Dave Chinner wrote:
> [cc fsdevel because after all the XFS stuff I did a some testing on
> mmotm w.r.t per-node LRU lock contention avoidance, and also some
> scalability tests against ext4 and btrfs for comparison on some new
> hardware. That bit ain't pretty. ]

A quick follow on mmotm:

> FWIW, the mmotm kernel (which has a fair bit of debug enabled, so
> not quite comparitive) doesn't have any LRU lock contention to speak
> of. For create:
> 
> -   7.81%  [kernel]  [k] __ticket_spin_trylock
>    - __ticket_spin_trylock
>       - 70.98% _raw_spin_lock
>          + 97.55% xfs_log_commit_cil
>          + 0.93% __d_instantiate
>          + 0.58% inode_sb_list_add
>       - 29.02% do_raw_spin_lock
>          - _raw_spin_lock
>             + 41.14% xfs_log_commit_cil
>             + 8.29% _xfs_buf_find
>             + 8.00% xfs_iflush_cluster

So i just ported all my prototype sync and inode_sb_list_lock
changes across to mmotm, as well as the XFS CIL optimisations.

-   2.33%  [kernel]  [k] __ticket_spin_trylock
   - __ticket_spin_trylock
      - 70.14% do_raw_spin_lock
         - _raw_spin_lock
            + 16.91% _xfs_buf_find
            + 15.20% list_lru_add
            + 12.83% xfs_log_commit_cil
            + 11.18% d_alloc
            + 7.43% dput
            + 4.56% __d_instantiate
....

Most of the spinlock contention has gone away.
 

> And the walk:
> 
> -  26.37%  [kernel]  [k] __ticket_spin_trylock
>    - __ticket_spin_trylock
>       - 49.10% _raw_spin_lock
>          - 50.65% evict
...
>          - 26.99% list_lru_add
>             + 89.01% inode_add_lru
>             + 10.95% dput
>          + 7.03% __remove_inode_hash
>       - 40.65% do_raw_spin_lock
>          - _raw_spin_lock
>             - 41.96% evict
....
>             - 13.55% list_lru_add
>                  84.33% inode_add_lru
....
>             + 10.10% __remove_inode_hash                                      
>                                                                               
>            
>                     system_call_fastpath

-  15.44%  [kernel]  [k] __ticket_spin_trylock
   - __ticket_spin_trylock
      - 46.59% _raw_spin_lock
         + 69.40% list_lru_add
           17.65% list_lru_del
           5.70% list_lru_count_node
           2.44% shrink_dentry_list
              prune_dcache_sb
              super_cache_scan
              shrink_slab
           0.86% __page_check_address
      - 33.06% do_raw_spin_lock
         - _raw_spin_lock
            + 36.96% list_lru_add
            + 11.98% list_lru_del
            + 6.68% shrink_dentry_list
            + 6.43% d_alloc
            + 4.79% _xfs_buf_find
.....
      + 11.48% do_raw_spin_trylock
      + 8.87% _raw_spin_trylock

So now we see that CPU wasted on contention is down by 40%.
Observation shows that most of the list_lru_add/list_lru_del
contention occurs when reclaim is running - before memory filled
up the lookup rate was on the high side of 600,000 inodes/s, but
fell back to about 425,000/s once reclaim started working.

> 
> There's quite a different pattern of contention - it has moved
> inward to evict which implies the inode_sb_list_lock is the next
> obvious point of contention. I have patches in the works for that.
> Also, the inode_hash_lock is causing some contention, even though we
> fake inode hashing. I have a patch to fix that for XFS as well.
> 
> I also note an interesting behaviour of the per-node inode LRUs -
> the contention is coming from the dentry shrinker on one node
> freeing inodes allocated on a different node during reclaim. There's
> scope for improvement there.
> 
> But here' the interesting part:
> 
> Kernel            create              walk            unlink
>       time(s)  rate           time(s)         time(s)
> 3.10-cil  222 266k+-32k         170             295
> mmotm   251   222k+-16k         128             356

mmotm-cil  225  258k+-26k         122             296

So even with all the debug on, the mmotm kernel with most of the
mods as I was running in 3.10-cil, plus the s_inodes ->list_lru
conversion gets the same throughput for create and unlink and has
much better walk times.

> Even with all the debug enabled, the overall walk time dropped by
> 25% to 128s. So performance in this workload has substantially
> improved because of the per-node LRUs and variability is also down
> as well, as predicted. Once I add all the tweaks I have in the
> 3.10-cil tree to mmotm, I expect significant improvements to create
> and unlink performance as well...
> 
> So, lets look at ext4 vs btrfs vs XFS at 16-way (this is on the
> 3.10-cil kernel I've been testing XFS on):
> 
>           create               walk           unlink
>        time(s)   rate         time(s)         time(s)
> xfs     222   266k+-32k         170             295
> ext4    978    54k+- 2k         325            2053
> btrfs  1223    47k+- 8k         366           12000(*)
> 
> (*) Estimate based on a removal rate of 18.5 minutes for the first
> 4.8 million inodes.

So, let's run these again on my current mmotm tree - it has the ext4
extent tree fixes in it and my rcu inode hash lookup patch...

            create               walk           unlink
         time(s)   rate         time(s)         time(s)
xfs       225   258k+-26k         122             296
ext4      456   118k+- 4k         128            1632
btrfs    1122    51k+- 3k         281            3200(*)

(*) about 4.7 million inodes removed in 5 minutes.

ext4 is a lot healthier: create speed doubles from the extent cache
lock contention fixes, and the walk time halves due to the rcu inode
cache lookup. That said, it is still burning a huge amount of CPU on
the inode_hash_lock adding and removing inodes. Unlink perf is a bit
faster, but still slow.  So, yeah, things will get better in the
not-too distant future...

And for btrfs? Well, create is a tiny bit faster, the walk is 20%
faster thanks to the rcu hash lookups, and unlinks are markedly
faster (3x). Still not fast enough for me to hang around waiting for
them to complete, though.

FWIW, while the results are a lot better for ext4, let me just point
out how hard it is driving the storage to get that performance:

load    |    create  |      walk    |           unlink
IO type |    write   |      read    |      read     |      write
        | IOPS   BW  |   IOPS    BW |   IOPS    BW  |    IOPS    BW
--------+------------+--------------+---------------+--------------
xfs     |  900  200  |  18000   140 |   7500    50  |     400    50
ext4    |23000  390  |  55000   200 |   2000    10  |   13000   160
btrfs(*)|peaky   75  |  26000   100 |   decay   10  |   peaky peaky

ext4 is hammering the SSDs far harder than XFS, both in terms of
IOPS and bandwidth. You do not want to run ext4 on your SSD if you
have a metadata intensive workload as it will age the SSD much, much
faster than XFS with that sort of write behaviour.

(*) the btrfs create IO pattern is 5s peaks of write IOPS every 30s.
The baseline is about 500 IOPS, but the peaks reach upwards of
30,000 write IOPS. Unlink does this as well.  There are also short
bursts of 2-3000 read IOPS just before the write IOPS bursts in the
create workload. For the unlink, it starts off with about 10,000
read IOPS, and goes quickly into exponential decay down to about
2000 read IOPS in 90s.  Then it hits some trigger and the cycle
starts again. The trigger appears to co-incide with the reclaim 1-2
million dentries being reclaimed.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>