On 07/01/13 00:44, Dave Chinner wrote:
From: Dave Chinner<dchinner@xxxxxxxxxx>
Note: This is an RFC right now - it'll need to be broken up into
several patches for final submission.
The CIL insertion during transaction commit currently does multiple
passes across the transaction objects and requires multiple memory
allocations per object that is to be inserted into the CIL. It is
quite inefficient, and as such xfs_log_commit_cil() and it's
children show up quite highly in profiles under metadata
modification intensive workloads.
The current insertion tries to minimise ithe number of times the
xc_cil_lock is grabbed and the hold times via a couple of methods:
1. an initial loop across the transaction items outside the
lock to allocate log vectors, buffers and copy the data into
them.
2. a second pass across the log vectors that then inserts
them into the CIL, modifies the CIL state and frees the old
vectors.
This is somewhat inefficient. While it minimises lock grabs, the
hold time is still quite high because we are freeing objects with
the spinlock held and so the hold times are much higher than they
need to be.
Optimisations that can be made:
1. change IOP_SIZE() to return the size of the metadata to
be logged with the number of vectors required. This means we
can allocate the buffer to hold the metadata as part of the
allocation for the log vector array. Once allocation instead
of two.
2. Store the size of the allocated region for the log vector
in the log vector itself. This allows us to determine if we
can just reuse the existing log vector for the new
transactions. Avoids all memory allocation for that item.
3. when setting up a new log vector, we don't need to
hold the CIL lock while determining the difference in it's
size when compared to the old one. Hence we can do all the
accounting, swapping of the log vectors and freeing the old
log vectors outside the xc_cil_lock.
4. We can do all the CIL insertation of items and busy
extents and the accounting under a single grab of the
xc_cil_lock. This minimises the number of times we need to
grab it and hence contention points are kept to a minimum
5. the xc_cil_lock is used for serialising both the CIL and
the push/commit ordering. Separate the two functions into
different locks so push/commit ordering does not affect CIL
insertion
6. the xc_cil_lock shares cachelines with other locks.
Separate them onto different cachelines.
The result is that my standard fsmark benchmark (8-way, 50m files)
on my standard test VM (8-way, 4GB RAM, 4xSSD in RAID0, 100TB fs)
gives the following results with a xfs-oss tree. No CRCs:
vanilla patched Difference
create (time) 483s 435s -10.0% (faster)
(rate) 109k+/6k 122k+/-7k +11.9% (faster)
walk 339s 335s (noise)
(sys cpu) 1134s 1135s (noise)
unlink 692s 645s -6.8% (faster)
So it's significantly faster than the current code, and lock_stat
reports lower contention on the xc_cil_lock, too. So, big win here.
With CRCs:
vanilla patched Difference
create (time) 510s 460s -9.8% (faster)
(rate) 105k+/5.4k 117k+/-5k +11.4% (faster)
walk 494s 486s (noise)
(sys cpu) 1324s 1290s (noise)
unlink 959s 889s -7.3% (faster)
Gains are of the same order, with walk and unlink still affected by
VFS LRU lock contention. IOWs, with this changes, filesystems with
CRCs enabled will still be faster than the old non-CRC kernels...
Signed-off-by: Dave Chinner<dchinner@xxxxxxxxxx>
Dave,
The idea to separate the transaction commits from the CIL push looks good.
I am still concerned that the xc_ctx->space_used comparison to
XLOG_CIL_SPACE_LIMIT(log) does not also contain any information on
the "stolen" ticket log space. Maybe the future Liu/Chinner minimum
log space series is the best place to address this and my other log
space concerns.
--Mark.
|