[Top] [All Lists]

Re: allocsize mount option

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: allocsize mount option
From: Gim Leong Chin <chingimleong@xxxxxxxxxxxx>
Date: Fri, 15 Jan 2010 01:25:15 +0800 (SGT)
Cc: Eric Sandeen <sandeen@xxxxxxxxxxx>, xfs@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com.sg; s=s1024; t=1263489916; bh=4jCdMNFjE0ugoo9962G4fT5kb4Jsqb1ekjg1XvjtuZo=; h=Message-ID:X-YMail-OSG:Received:X-Mailer:Date:From:Subject:To:Cc:MIME-Version:Content-Type; b=CC4wxp0yqjRzI6zeg2FqwRVVq7knZdAuc23D+gBt2q/1y/ghW5Q5GscF7L3hzi2ifTocxKD+j2uRJBEFjOAid2jhWg3idWr90PQqyiUD8BxIOaAxIH6pfFQzkmIrpvz48PZRYQ/KJX0K5afHe0YnmMkqb1Id7LMkoMg3DPVLSfc=
Domainkey-signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com.sg; h=Message-ID:X-YMail-OSG:Received:X-Mailer:Date:From:Subject:To:Cc:MIME-Version:Content-Type; b=2wkNoOhsQJsvQw2a3IgMb/R9G6QnOU7+VxvwlStfXRQcmmhd5vwrMq9/nAkC/NNkVtfLhueyAAwpYnC6FDn3ZB1nWb9q8zdgG5T8m7VU+4ZWEnOREkgQAM8DFfhlu6sWyAkKDrO+SLmlj6x9kHaL5cw+k9DHB35QZL7P/cGW7nM=;
Hi Dave,

> fragmented, it just means that that there are 19% more
> fragments
> than the ideal. In 4TB of data with 1GB sized files, that
> would mean
> there are 4800 extents (average length ~800MB, which is
> excellent)
> instead of the perfect 4000 extents (@1GB each). Hence you
> can see
> how misleading this "19% fragmentation" number can be on an
> extent
> based filesystem...

There are many files that are 128 GB.

When I did the tests with dd on this computer, the 20 GB files had up to > 50 

> This all looks good - it certainly seems that you have done
> your
> research. ;) The only thing I'd do differently is that if
> you have
> only one partition on the drives, I wouldn't even put a
> partition on it.

I just learnt from you that I can have a filesystem without a partition table!  
That takes care of having to calculate the start of the partition!  Are there 
any other benefits?  But are there any down sides to not having a partition 

> I'd significantly reduce the size of that buffer - too
> large a
> buffer can slow down IO due to the memory it consumes and
> TLB misses
> it causes. I'd typically use something like:
> $ dd if=/dev/zero of=bigfile bs=1024k count=20k
> Which does 20,000 writes of 1MB each and ensures the dd
> process
> doesn't consume over a GB of RAM.

I did try with 1 MB.  I have attached the raw test result file.  As you can see 
from line 261, in writing 10 GB with bs=1MB, the speed was no faster two out of 
three times, so I dropped it.  I could re-try that next time.

> This seems rather low for a buffered write on hardware that
> can
> clearly go faster. SLED11 is based on 2.6.27, right? I
> suspect that
> many of the buffered writeback issues that have been fixed
> since
> 2.6.30 are present in the SLED11 kernel, and if that is the
> case I
> can see why the allocsize mount option makes such a big
> difference.

Is it possible for the fixes in the 2.6.30 kernel to be backported to the 
2.6.27 kernel in SLE 11?
If so, I would like to open a service request to Novell to do that to fix the 
performance issues in SLE 11.

> It might be worth checking how well direct IO writes run to
> take
> buffered writeback issues out ofthe equation. In that case,
> I'd use
> stripe width multiple sized buffers like:
> $ dd if=/dev/zero of=bigfile bs=3072k count=7k
> oflag=direct

I would like to do that tomorrow when I go back to work, but on my openSUSE 
11.1 AMD Turion RM-74 notebook with kernel, on the system 
WD Scorpio Black 7200 RPM drive, I get 62 MB/s with dd bs=1GB for writing 20 GB 
file with Direct IO, and 56 MB/s without Direct IO.  You are on to something!

As for the hardware performance potential, see below.

> I'd suggest that you might need to look at increasing the
> maximum IO
> size for the block device
> (/sys/block/sdb/queue/max_sectors_kb),
> maybe the request queue depth as well to get larger IOs to
> be pushed
> to the raid controller. if you can, at least get it to the
> stripe
> width of 1536k....

Could you give a good reference for performance tuning of these parameters?  I 
am at a total loss here.

As seen from the results file, I have tried different configurations of RAID 0, 
5 and 6, with different number of drives.  I am pretty confused by the results 
I see, although only the 20 GB file writes were done with allocsize=1g.  I also 
did not lock the CPU frequency governor at the top clock except for the RAID 6 

I decided on the allocsize=1g after checking that the multiple instance 30 MB 
writes have only one extent for each file, without holes or unused space.

It appears that RAID 6 writes are faster than RAID 5!  And RAID 6 can even 
match RAID 0!  The system seems to thrive on throughput, when doing multiple 
instances of writes, for getting high aggregate bandwidth.

I will put the performance potential of the system in context by giving some 

The system has four Kingston DDR2-800 MHz CL6 4 GB unbuffered ECC DIMMs, set to 
unganged mode, so each thread has up to 6.4 GB of memory bandwidth, from one of 
two independent memory channels.

The AMD Phenom II X4 965 has three levels of cache, and data from memory goes 
directly to the L1 caches. The four cores have dedicated L1 and L2 caches, and 
a shared 6 MB L3.  Thread switching will result in cache misses if more than 
four threads are running.

The IO through the HyperTransport 3.0 from CPU to the AMD 790FX chipset is at 8 
GB/s.  The Areca ARC-1680ix-16 is PCI-E Gen 1 x8, so the maximum bandwidth is 2 
GB/s.  The cache is Kingston DDR-667 CL5 4 GB unbuffered ECC, although it runs 
at 533 MHz, so the maximum bandwidth is 4.2 GB/s.  The Intel IOP 348 1200 MHz 
on the card has two cores.

There are sixteen WD Caviar Black 1 TB drives in the Lian-Li PC-V2110 chassis.  
For the folks reading this, please do not follow this set-up, as the Caviar 
Blacks are a mistake.  WD quietly disabled the use of WD time limited error 
recovery utility since the September 2009 manufactured  Caviar Black drives, so 
I have an array of drives that can pop out of the RAID any time if I am 
unlucky, and I got screwed here.

There is a battery back-up module for the cache, and the drive caches are 
disabled.  Tests run with the drive caches enabled showed quite some bit of 
speed up in RAID 0.

We previously did tests of the Caviar Black 1 TB writing 100 MB chuncks to the 
device without a file system, with the drive connected to the SATA ports on a 
Tyan Opteron motherboard with nVidia nForce 4 Professional chipset.  With the 
drive cache disabled, the sequential write speed was 30+ MB/s if I remember 
correctly, versus sub 100 MB/s with cache enabled.  That is a big fall-off in 
speed, and that was writing at the outer diameter of the platter; speed would 
be halved at the inner diameter.  It seems the controller firmware is meant to 
work with cache enabled for proper functioning.

The desktop Caviar Black also does not have rotatry vibration compensation, 
unlike the Caviar RE nearline drives.  WD has a document showing the 
performance difference having rotary vibration compensation makes.  I am not 
trying to save pennies here, but the local distributor refuses to bring in the 
Caviar REs, and I am stuck in one man's land.

The system has sixteen hard drives, and ten fans of difference sizes and 
purposes in total, so that is quite some bit of rotary vibration, which I can 
feel when I place my hand on the side panels.  I really do not know how badly 
the drive performance suffers as a result. The drives are attached with rubber 
dampers on the mounting screws.

I did the 20 GB dd test on the RAID 1 system drive, also with XFS, and got 53 
MB/s with disabled drive caches, 63 MB/s enabled.  That is pretty 
disappointing, but in light of all the above considerations, plus the kernel 
buffer issues, I do not really know if that is a good figure.

NCQ is enabled at depth 32.  NCQ should cause performance loss for single 
writes, but gains for multiple writes.

Areca has a document showing that this card can do RAID 6 800 MB/s with Seagate 
nearline drives, with the standard 512 MB cache.  That is in Windows Server.  I 
do not know if the caches are disabled.  The benchmark is IO Meter workstation 
sequential write.  IO Meter requries WIndows for the front end, which causes me 
great difficulties, so I gave up trying to figure it out and I do not 
understand what the workstation test does.  However, in writing 30 MB files, I 
already exceed 1 GB/s.


Attachment: xfstesting
Description: Binary data

<Prev in Thread] Current Thread [Next in Thread>