xfs
[Top] [All Lists]

Debunking myths about metadata CRC overhead

To: xfs@xxxxxxxxxxx
Subject: Debunking myths about metadata CRC overhead
From: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Mon, 3 Jun 2013 17:44:52 +1000
Delivered-to: xfs@xxxxxxxxxxx
User-agent: Mutt/1.5.21 (2010-09-15)
Hi folks,

There has been some assertions made recently that metadata CRCs have
too much overhead to always be enabled.  So I'll run some quick
benchmarks to demonstrate the "too much overhead" assertions are
completely unfounded.

These are some numbers from my usual performance test VM. Note that
as this is a VM, it's not running the hardware CRC instructions so
I'm benchmarking the worst case overhead here. i.e. the kernel's
software CRC32c algorithm.

THe VM is 8p, 8GB RAM, 4 node fake-numa config with a 100TB XFS
filesystem being used for testing. The fs is backed by 4x64GB SSDs
sliced via LVM into a 160GB RAID0 device with an XFS filesytsem on
it to host the sparse 100TB image file. KVM is using
virtio,cache=none to use direct IO to write to the image file, and
the host is running a 3.8.5 kernel.

Baseline CRC32c performance
---------------------------

The VM runs the xfsprogs selftest program in:

crc32c: tests passed, 225944 bytes in 212 usec

so it can calulate CRCs at roughly 1GB/s on small, random chunks of
data through the software algorithm according to this. Given the
fsmark create workload only drives around 100MB/s of metadata and
journal IO, the minimum CRC32c overhead we should see on a load
spread across 8 CPUs is roughly:

        100MB/s / 1000MB/s / 8p * 100% = 1.25% per CPU

So, in a perfect world, that's what we should see from the kernel
profiles. It's not a perfect world, though, so it will never be
this low (4 cores all trying to use the same memory bus at the same
time, perhaps?), so if we get anywhere near that number I'd be very
happy.

Note that a hardware implementation should be faster than the SSE
optimised RAID5/6 calculations on the CPU, which come in at:

[    0.548004] raid6: sse2x4    7221 MB/s

which is a *lot* faster. So it's probably reasonable to assume
similar throughput for hardware CRC32c throughput. Hence Intel
servers will have substantially lower CRC overhead than the software
CRC32c implementation being measured here.

fs_mark workload
----------------

$ sudo mkfs.xfs -f -m crc=1 -l size=512m,sunit=8 /dev/vdc

vs

$ sudo mkfs.xfs -f -l size=512m,sunit=8 /dev/vdc

8-way 50 million zero-length file create, 8-way
find+stat of all the files, 8-unlink of all the files:

                no CRCs         CRCs            Difference
create  (time)  483s            510s            +5.2%   (slower)
        (rate)  109k+/6k        105k+/-5.4k     -3.8%   (slower)

walk            339s            494s            -30.3%  (slower)
     (sys cpu)  1134s           1324s           +14.4%  (slower)

unlink          692s            959s            -27.8%(*) (slower)

(*) All the slowdown here is from the traversal slowdown as seen in
the walk phase. i.e. not related to the unlink operations.

On the surface, it looks like there's a huge impact on the walk and
unlink phases from CRC calculations, but these numbers don't tell
the whole story. Lets look deeper:

Create phase top CPU users (>1% total):

  5.59%  [kernel]  [k] _xfs_buf_find
  5.52%  [kernel]  [k] xfs_dir2_node_addname
  4.58%  [kernel]  [k] memcpy
  3.28%  [kernel]  [k] xfs_dir3_free_hdr_from_disk
  3.05%  [kernel]  [k] __ticket_spin_trylock
  2.94%  [kernel]  [k] __slab_alloc
  1.96%  [kernel]  [k] xfs_log_commit_cil
  1.93%  [kernel]  [k] __slab_free
  1.90%  [kernel]  [k] kmem_cache_alloc
  1.72%  [kernel]  [k] xfs_next_bit
  1.65%  [kernel]  [k] __crc32c_le
  1.52%  [kernel]  [k] _raw_spin_unlock_irqrestore
  1.50%  [kernel]  [k] do_raw_spin_lock
  1.42%  [kernel]  [k] kmem_cache_free
  1.32%  [kernel]  [k] native_read_tsc
  1.28%  [kernel]  [k] __kmalloc
  1.17%  [kernel]  [k] xfs_buf_offset
  1.14%  [kernel]  [k] delay_tsc
  1.14%  [kernel]  [k] kfree
  1.10%  [kernel]  [k] xfs_buf_item_format
  1.06%  [kernel]  [k] xfs_btree_lookup

CRC overehad is at 1.65%, not much higher than the optimum 1.25%
overhead calculated above. So the overhead really isn't that
significant - it's far less overhead than, say, the 1.2 million
buffer lookups a second we are doing (_xfs_buf_find overhead) in
this workload...

Walk phase top CPU users:

  6.64%  [kernel]  [k] __ticket_spin_trylock
  6.05%  [kernel]  [k] _xfs_buf_find
  5.58%  [kernel]  [k] _raw_spin_unlock_irqrestore
  4.88%  [kernel]  [k] _raw_spin_unlock_irq
  3.30%  [kernel]  [k] native_read_tsc
  2.93%  [kernel]  [k] __crc32c_le
  2.87%  [kernel]  [k] delay_tsc
  2.32%  [kernel]  [k] do_raw_spin_lock
  1.98%  [kernel]  [k] blk_flush_plug_list
  1.79%  [kernel]  [k] __slab_alloc
  1.76%  [kernel]  [k] __d_lookup_rcu
  1.56%  [kernel]  [k] kmem_cache_alloc
  1.25%  [kernel]  [k] kmem_cache_free
  1.25%  [kernel]  [k] xfs_da_read_buf
  1.11%  [kernel]  [k] xfs_dir2_leaf_search_hash
  1.08%  [kernel]  [k] flat_send_IPI_mask
  1.02%  [kernel]  [k] radix_tree_lookup_element
  1.00%  [kernel]  [k] do_raw_spin_unlock

There's more CRC32c overhead indicating lower efficiency, but
there's an obvious cause for that - the CRC overhead is dwarfed by
something else new: lock contention.  A quick 30s call graph profile
during the middle of the walk phase shows:

-  12.74%  [kernel]  [k] __ticket_spin_trylock
   - __ticket_spin_trylock
      - 60.49% _raw_spin_lock
         + 91.79% inode_add_lru                 >>> inode_lru_lock
         + 2.98% dentry_lru_del                 >>> dcache_lru_lock
         + 1.30% shrink_dentry_list
         + 0.71% evict
      - 20.42% do_raw_spin_lock
         - _raw_spin_lock
            + 13.41% inode_add_lru              >>> inode_lru_lock
            + 10.55% evict
            + 8.26% dentry_lru_del              >>> dcache_lru_lock
            + 7.62% __remove_inode_hash
....
      - 10.37% do_raw_spin_trylock
         - _raw_spin_trylock
            + 79.65% prune_icache_sb            >>> inode_lru_lock
            + 11.04% shrink_dentry_list
            + 9.24% prune_dcache_sb             >>> dcache_lru_lock
      - 8.72% _raw_spin_trylock
         + 46.33% prune_icache_sb               >>> inode_lru_lock
         + 46.08% shrink_dentry_list
         + 7.60% prune_dcache_sb                >>> dcache_lru_lock

So the lock contention is variable - it's twice as high in this
short sample as the overall profile I measured above. It's also
pretty much all VFS cache LRU lock contention that is causing the
problems here. IOWs, the slowdowns are not related to the overhead
of CRC calculations; it's the change in memory access patterns that
are lowering the threshold of catastrophic lock contention that is
causing it. This VFS LRU problem is being fixed independently by the
generic numa-aware LRU list patchset I've been doing with Glauber
Costa.

Therefore, it is clear that the slowdown in this phase is not caused
by the overhead of CRCs, but that of lock contention elsewhere in
the kernel.  The unlink profiles show the same the thing as the walk
profiles - additional lock contention on the lookup phase of the
unlink walk.

----

Dbench:

$ sudo mkfs.xfs -f -m crc=1 -l size=128m,sunit=8 /dev/vdc

vs

$ sudo mkfs.xfs -f -l size=128m,sunit=8 /dev/vdc

Running:

$ dbench -t 120 -D /mnt/scratch 8

                no CRCs         CRCs            Difference
thruput         1098.06 MB/s    1229.65 MB/s    +10% (faster)
latency (max)   22.385 ms       22.661 ms       +1.3% (noise)

Well, now that's an interesting result, isn't it. CRC enabled
filesystems are 10% faster than non-crc filesystems. Again, let's
not take that number at face value, but ask ourselves why adding
CRCs improves performance (a.k.a. "know your benchmark")...

It's pretty obvious why - dbench uses xattrs and performance is
sensitive to how many attributes can be stored inline in the inode.
And CRCs increase the inode size to 512 bytes meaning attributes are
probably never out of line. So, let's make it an even playing field
and compare:

$ sudo mkfs.xfs -f -m crc=1 -l size=128m,sunit=8 /dev/vdc

vs

$ sudo mkfs.xfs -f -i size=512 -l size=128m,sunit=8 /dev/vdc

                no CRCs         CRCs            Difference
thruput         1273.22 MB/s    1229.65 MB/s    -3.5% (slower)
latency (max)   25.455 ms       22.661 ms       -12.4% (better)

So, we're back to the same relatively small difference seen in the
fsmark create phase, with similar CRC overhead being shown in the
profiles.

----

Compilebench

Testing the same filesystems with 512 byte inodes as for dbench:

$ ./compilebench -D /mnt/scratch
using working directory /mnt/scratch, 30 intial dirs 100 runs
.....

test                            no CRCs         CRCs
                        runs    avg             avg
==========================================================================
intial create           30      92.12 MB/s      90.24 MB/s
create                  14      61.91 MB/s      61.13 MB/s
patch                   15      41.04 MB/s      38.00 MB/s
compile                 14      278.74 MB/s     262.00 MB/s
clean                   10      1355.30 MB/s    1296.17 MB/s
read tree               11      25.68 MB/s      25.40 MB/s
read compiled tree      4       48.74 MB/s      48.65 MB/s
delete tree             10      2.97 seconds    3.05 seconds
delete compiled tree    4       2.96 seconds    3.05 seconds
stat tree               11      1.33 seconds    1.36 seconds
stat compiled tree      7       1.86 seconds    1.64 seconds

The numbers are so close that the differences are in the noise, and
the CRC overhead doesn't even show up in the ">1% usage" section
of the profile output.

----

Looking at these numbers realistically, dbench and compilebench
model two fairly common metadata intensive workloads - file servers
and code tree manipulations that developers tend to use all the
time. The difference that CRCs make to performance in these
workloads on equivalently configured filesystems varies between
0-5%, and for most operations they are small enough that they can
just about be considered to be noise.

Yes, we could argue over the fsmark walk/unlink phase results, but
the synthetic fsmark workload is designed to push the system to it's
limits and it's obvious that the addition of CRCs pushes the VFS into
lock contention hell. Further, we have to recognise that the same
workload on a 12p VM (run 12-way instead of 8-way) without CRCs hits
the same lock contention problem. IOWs, the slowdown is most
definitely not caused by the addition of CRC calculations to XFS
metadata.

The CPU overhead of CRCs is small and may be outweighed by other
changes for CRC filesystems that improve performance far more than
the cost of CRC calculations degrades it.  The numbers above simply
don't support the assertion that metadata CRCs have "too much
overhead".

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>