[Top] [All Lists]

Re: allocsize mount option

To: Gim Leong Chin <chingimleong@xxxxxxxxxxxx>
Subject: Re: allocsize mount option
From: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Fri, 15 Jan 2010 10:28:09 +1100
Cc: Eric Sandeen <sandeen@xxxxxxxxxxx>, xfs@xxxxxxxxxxx
In-reply-to: <264613.60659.qm@xxxxxxxxxxxxxxxxxxxxxxxxxxx>
References: <264613.60659.qm@xxxxxxxxxxxxxxxxxxxxxxxxxxx>
User-agent: Mutt/1.5.18 (2008-05-17)
On Fri, Jan 15, 2010 at 01:25:15AM +0800, Gim Leong Chin wrote:
> Hi Dave,
> > fragmented, it just means that that there are 19% more
> > fragments
> > than the ideal. In 4TB of data with 1GB sized files, that
> > would mean
> > there are 4800 extents (average length ~800MB, which is
> > excellent)
> > instead of the perfect 4000 extents (@1GB each). Hence you
> > can see
> > how misleading this "19% fragmentation" number can be on an
> > extent
> > based filesystem...
> There are many files that are 128 GB.
> When I did the tests with dd on this computer, the 20 GB files had
> up to > 50 extents.

That's still an average of 400MB extents, which is more than large
enough to guarantee optimal disk bandwidth when reading or writing them
on your setup....

> > This all looks good - it certainly seems that you have done
> > your
> > research. ;) The only thing I'd do differently is that if
> > you have
> > only one partition on the drives, I wouldn't even put a
> > partition on it.
> > 
> I just learnt from you that I can have a filesystem without a
> partition table!  That takes care of having to calculate the start
> of the partition!  Are there any other benefits?  But are there
> any down sides to not having a partition table?

That's the main benefit, though there are others like no limit on
the partition size (e.g. msdos partitions are a max of 2TB) but
you avoided most of those problems by using GPT labels.

There aren't any real downsides that I am aware of, except
maybe that future flexibilty of the volume is reduced. e.g.
if you grow the volume, then you can still only have one filesystem
on it....

> > This seems rather low for a buffered write on hardware that
> > can
> > clearly go faster. SLED11 is based on 2.6.27, right? I
> > suspect that
> > many of the buffered writeback issues that have been fixed
> > since
> > 2.6.30 are present in the SLED11 kernel, and if that is the
> > case I
> > can see why the allocsize mount option makes such a big
> > difference.
> Is it possible for the fixes in the 2.6.30 kernel to be backported to the 
> 2.6.27 kernel in SLE 11?
> If so, I would like to open a service request to Novell to do that to fix the 
> performance issues in SLE 11.

Youἀd have to get all the fixes from 2.6.30 to 2.6.32, and the
backport would be very difficult to get right. Better would
be طust to upgrade the kernel to 2.6.32 ;)

> > I'd suggest that you might need to look at increasing the
> > maximum IO
> > size for the block device
> > (/sys/block/sdb/queue/max_sectors_kb),
> > maybe the request queue depth as well to get larger IOs to
> > be pushed
> > to the raid controller. if you can, at least get it to the
> > stripe
> > width of 1536k....
> Could you give a good reference for performance tuning of these
> parameters?  I am at a total loss here.

Welcome to the black art of storage subsystem tuning ;)

I'm not sure there is a good reference for tuning the block device
parameters - most of what I know was handed down by word of mouth
from gurus on high mountains.

The overriding principle, though, is to try to ensure that the
stripe width sized IOs can be issued right through the IO stack to
the hardware, and that those IOs are correctly aligned to the
stripes. You've got the filesystem configuration and layout part
correct, now it's just tuning the block layer to pass the IO's

I'd be looking in the Documentation/block directory
of the kernel source and googling for other documentation....

> As seen from the results file, I have tried different
> configurations of RAID 0, 5 and 6, with different number of
> drives.  I am pretty confused by the results I see, although only
> the 20 GB file writes were done with allocsize=1g.  I also did not
> lock the CPU frequency governor at the top clock except for the
> RAID 6 tests.

FWIW, your tests are not timing how longit takes for all the
data to hit the disk, only how long it takes to get into cache.
You really need to do for single threads:

$ time (dd if=/dev/zero of=<file> bs=XXX count=YYY; sync)

and something like this for multiple (N) threads:

time (
        for i in `seq 0 1 N`; do
                dd if=/dev/zero of=<file>.$i bs=XXX count=YYY &

And that will give you a much more accurate measure across all file
sizes of the throughput rate. You'll need to manually calculate
the rate from the output of the time command and the amount of
data that the test runs.

Or, alternatively, you could just use direct IO which avoids
such cache affects by bypassing it....

> I decided on the allocsize=1g after checking that the multiple
> instance 30 MB writes have only one extent for each file, without
> holes or unused space.
> It appears that RAID 6 writes are faster than RAID 5!  And RAID 6
> can even match RAID 0!  The system seems to thrive on throughput,
> when doing multiple instances of writes, for getting high
> aggregate bandwidth.

Given my above comments, that may not be true.


> We previously did tests of the Caviar Black 1 TB writing 100 MB
> chuncks to the device without a file system, with the drive
> connected to the SATA ports on a Tyan Opteron motherboard with
> nVidia nForce 4 Professional chipset.  With the drive cache
> disabled, the sequential write speed was 30+ MB/s if I remember
> correctly, versus sub 100 MB/s with cache enabled.  That is a big
> fall-off in speed, and that was writing at the outer diameter of
> the platter; speed would be halved at the inner diameter.  It
> seems the controller firmware is meant to work with cache enabled
> for proper functioning.

That sounds wrong - it sounds like NCQ is not functioning properly
as with NCQ enabled, disabling the drive cache should not impact
throughput at all....

FWIW, for SAS and SCSI drives, I recommend turning the drive caches
off as the impact of filesystem issued barrier writes on performance
is worse than disabling the drive caches....

> The desktop Caviar Black also does not have rotatry vibration
> compensation, unlike the Caviar RE nearline drives.  WD has a
> document showing the performance difference having rotary
> vibration compensation makes.  I am not trying to save pennies
> here, but the local distributor refuses to bring in the Caviar
> REs, and I am stuck in one man's land.

I'd suggest trying to find another distributor that will bring them
in for you. Putting that many drives in a single chassis is almost
certainly going to cause vibration problems, especially if you get
all the disk heads moving in close synchronisation (which is what
happens when you get all your IO sizing and alignment right).


Dave Chinner

<Prev in Thread] Current Thread [Next in Thread>