On 09/08/13 20:33, Dave Chinner wrote:
From: Dave Chinner<dchinner@xxxxxxxxxx>
CPU overhead of buffer lookups dominate most metadata intensive
workloads. The thing is, most such workloads are hitting a
relatively small number of buffers repeatedly, and so caching
recently hit buffers is a good idea.
Add a hashed lookaside buffer that records the recent buffer
lookup successes and is searched first before doing a rb-tree
lookup. If we get a hit, we avoid the expensive rbtree lookup and
greatly reduce the overhead of the lookup. If we get a cache miss,
then we've added an extra CPU cacheline miss into the lookup.
In cold cache lookup cases, this extra cache line miss is irrelevant
as we need to read or allocate the buffer anyway, and the etup time
for that dwarfs the cost of the miss.
In the case that we miss the lookaside cache and find the buffer in
the rbtree, the cache line miss overhead will be noticable only if
we don't see any lookaside cache misses at all in subsequent
lookups. We don't tend to do random cache walks in perfomrance
critical paths, so the net result is that the extra CPU cacheline
miss will be lost in the reduction of misses due to cache hits. This
hit/miss case is what we'll see with file removal operations.
A simple prime number hash was chosen for the cache (i.e. modulo 37)
because it is fast, simple, and works really well with block numbers
that tend to be aligned to a multiple of 8. No attempt to optimise
this has been made - it's just a number I picked out of thin air
given that most repetitive workloads have a working set of buffers
that is significantly smaller than 37 per AG and should hold most of
the AG header buffers permanently in the lookaside cache.
The result is that on a typical concurrent create fsmark benchmark I
run, the profile of CPU usage went from having _xfs_buf_find() as
teh number one CPU consumer:
6.55% [kernel] [k] _xfs_buf_find
4.94% [kernel] [k] xfs_dir3_free_hdr_from_disk
4.77% [kernel] [k] native_read_tsc
4.67% [kernel] [k] __ticket_spin_trylock
to this, at about #8 and #30 in the profile:
2.56% [kernel] [k] _xfs_buf_find
0.55% [kernel] [k] _xfs_buf_find_lookaside
So the lookaside cache has halved the CPU overhead of looking up
buffers for this workload.
On a buffer hit/miss workload like the followup concurrent removes,
_xfs_buf_find() went from #1 in the profile again at:
9.13% [kernel] [k] _xfs_buf_find
to #6 and #23 repesctively:
2.82% [kernel] [k] _xfs_buf_find
0.78% [kernel] [k] _xfs_buf_find_lookaside
Which is also a significant reduction in CPU overhead for buffer
lookups, and shows the benefit on mixed cold/hot cache lookup
Performance differential, as measured with -m crc=1,finobt=1:
time rate time
xfsdev 4m16s 221k/s 6m18s
patched 3m59s 236k/s 5m56s
So less CPU time spent on lookups translates directly to better
Signed-off-by: Dave Chinner<dchinner@xxxxxxxxxx>
Low cost, possible higher return. Idea looks good to me.
What happens in xfs_buf_get_map() when we lose the xfs_buf_find() race?
I don't see a removal of the losing lookaside entry inserted in the
I will let it run for a while.