On Tue, May 18, 2010 at 09:24:16AM +1000, Dave Chinner wrote:
> Hi Folks,
> This is version 6 of the delayed logging series and is the first
> release candidate for incluѕion in the xfs-dev tree and 2.6.35-rc1.
BTW, here's a couple of quick benchmarks I've run over the last
couple of days to check comparitive performance. I found that the
previous scalability testing I did was limited by two factors:
1. Only 4 AGs in the test filesystem, so only 4-way
parallelism on allocation/freeing. Hence it won't scale to 8
threads no matter what I do....
2. lockdep checking limits scalability to around 4 threads.
So I re-ran the pure-metadata, sequential create/remove fs_mark
tests I've previously run with the following results. barriers
were disabled on both XFS and ext4, and XFS also was configured
MKFS_OPTIONS="-l size=128m -d agcount=16"
(./fs_mark -S0 -n 100000 -s 0 -d /mnt/scratch/0 ...)
threads nodelaylog delaylog ext4
1 17k/s 19k/s 33k/s
2 30k/s 35k/s 66k/s
4 42k/s 63k/s 80k/s
8 39k/s 97k/s 45k/s
This shows pure metadata operations scale much, much better,
especially for multithreaded workloads. The log throughput at 8
threads is 3.5x lower for a 2.5x improvement in performance.
Also worth noting is that the performance is competitive with ext4
and exceeds it at a higher parallelism. Also worth noting is the
disk subsystem requirements to sustain this performance. These
numbers were recorded of pmchart graphs w/ a 5s sample interval, so
should be considered ballpark numbers (i.e close but not perfectly
IOPS/s @ MB/s
threads nodelaylog delaylog ext4
1 1.0k @ 280 50 @ 10 50 @ 20
2 2.0k @ 460 100 @ 20 500 @ 75
4 2.5k @ 520 300 @ 50 6.5k @ 150
8 3.7k @ 480 900 @ 150 9.8k @ 180
We can see why the curent XFS journalling mode is slow - it requires
500MB/s of log throughput to get to 40k creates/s and almost all
the IOPs are servicing log IO.
ext4, on the other hand, really strains the IOP capability of the
disk subsystem and that is the limiting factor at greater than two
threads. It's an interesting progression, too, in that the iops go
up by an order of magnitude each time the thread count doubles.
The best IO behaviour comes from the delayed logging version of XFS,
with the lowest bandwidth and iops to sustain the highest
performance. All the IO is to the log - no metadata is written to
disk at all, which is the way this test should execute. As a reult,
the delayed logging code was the only configuration not limited by
the IO subsystem - instead it was completely CPU bound (8 CPUs
However, it's not all roses, as dbench will show:
# MKFS_OPTIONS="-l size=128m -d agcount=16" MOUNT_OPTIONS="-o
nodelaylog,nobarrier" ./bench 1 dave dave dbenchmulti
Throughput is in MB/s, latency in ms.
Threads thruput max-lat thruput max-lat
1 153.011 45.450 157.036 59.685
2 319.534 18.096 330.062 41.458
10 1350.31 46.631 726.075 303.434
20 1497.93 365.092 547.380 541.223
100 1410.42 2488.105 477.964 177.471
200 1232.97 297.982 457.641 447.060
There is no difference for 1-2 threads (within the error margin of
dbench), but delayed logging shows significant throughput reductions
(>60% degradation) at higher thread counts. This appears to be due
to the unoptimised log force implementation that the delayed logging
code currently has. I'll probably use dbench over the next few weeks
as a measure to try to bring this part of the delayed logging code
up to the same performance.