xfs
[Top] [All Lists]

Re: [PATCH 0/12] xfs: delayed logging V6

To: xfs@xxxxxxxxxxx
Subject: Re: [PATCH 0/12] xfs: delayed logging V6
From: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Mon, 24 May 2010 10:30:39 +1000
In-reply-to: <1274138668-1662-1-git-send-email-david@xxxxxxxxxxxxx>
References: <1274138668-1662-1-git-send-email-david@xxxxxxxxxxxxx>
User-agent: Mutt/1.5.20 (2009-06-14)
On Tue, May 18, 2010 at 09:24:16AM +1000, Dave Chinner wrote:
> 
> Hi Folks,
> 
> This is version 6 of the delayed logging series and is the first
> release candidate for incluѕion in the xfs-dev tree and 2.6.35-rc1.

BTW, here's a couple of quick benchmarks I've run over the last
couple of days to check comparitive performance. I found that the
previous scalability testing I did was limited by two factors:

        1. Only 4 AGs in the test filesystem, so only 4-way
        parallelism on allocation/freeing. Hence it won't scale to 8
        threads no matter what I do....
        2. lockdep checking limits scalability to around 4 threads.

So I re-ran the pure-metadata, sequential create/remove fs_mark
tests I've previously run with the following results. barriers
were disabled on both XFS and ext4, and XFS also was configured
with:

MKFS_OPTIONS="-l size=128m -d agcount=16"
MOUNT_OPTIONS="-o [no]delaylog,logbsize=262144,nobarrier"

(./fs_mark -S0 -n 100000 -s 0 -d /mnt/scratch/0 ...)

nodelaylog:
                   fs_mark rate
threads      nodelaylog       delaylog          ext4
  1             17k/s           19k/s           33k/s
  2             30k/s           35k/s           66k/s
  4             42k/s           63k/s           80k/s
  8             39k/s           97k/s           45k/s

This shows pure metadata operations scale much, much better,
especially for multithreaded workloads. The log throughput at 8
threads is 3.5x lower for a 2.5x improvement in performance.

Also worth noting is that the performance is competitive with ext4
and exceeds it at a higher parallelism. Also worth noting is the
disk subsystem requirements to sustain this performance. These
numbers were recorded of pmchart graphs w/ a 5s sample interval, so
should be considered ballpark numbers (i.e close but not perfectly
accurate):

                           IOPS/s @ MB/s
threads         nodelaylog      delaylog          ext4
  1             1.0k @ 280       50 @ 10           50 @ 20
  2             2.0k @ 460      100 @ 20          500 @ 75
  4             2.5k @ 520      300 @ 50         6.5k @ 150
  8             3.7k @ 480      900 @ 150        9.8k @ 180

We can see why the curent XFS journalling mode is slow - it requires
500MB/s of log throughput to get to 40k creates/s and almost all
the IOPs are servicing log IO.

ext4, on the other hand, really strains the IOP capability of the
disk subsystem and that is the limiting factor at greater than two
threads. It's an interesting progression, too, in that the iops go
up by an order of magnitude each time the thread count doubles.

The best IO behaviour comes from the delayed logging version of XFS,
with the lowest bandwidth and iops to sustain the highest
performance. All the IO is to the log - no metadata is written to
disk at all, which is the way this test should execute.  As a reult,
the delayed logging code was the only configuration not limited by
the IO subsystem - instead it was completely CPU bound (8 CPUs
worth)...

However, it's not all roses, as dbench will show:

# MKFS_OPTIONS="-l size=128m -d agcount=16" MOUNT_OPTIONS="-o 
nodelaylog,nobarrier" ./bench 1 dave dave dbenchmulti

Throughput is in MB/s, latency in ms.

                  nodelaylog              delaylog
Threads         thruput  max-lat        thruput max-lat
   1            153.011   45.450        157.036  59.685
   2            319.534   18.096        330.062  41.458
  10            1350.31   46.631        726.075 303.434
  20            1497.93  365.092        547.380 541.223
 100            1410.42 2488.105        477.964 177.471
 200            1232.97  297.982        457.641 447.060

There is no difference for 1-2 threads (within the error margin of
dbench), but delayed logging shows significant throughput reductions
(>60% degradation) at higher thread counts. This appears to be due
to the unoptimised log force implementation that the delayed logging
code currently has. I'll probably use dbench over the next few weeks
as a measure to try to bring this part of the delayed logging code
up to the same performance.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>