[Top] [All Lists]

Re: Need advice on building a new XFS setup for large files

To: Alvin Ong <alvin.ong@xxxxxxxxxxxxxxxxx>
Subject: Re: Need advice on building a new XFS setup for large files
From: Emmanuel Florac <eflorac@xxxxxxxxxxxxxx>
Date: Tue, 22 Jan 2013 22:51:46 +0100
Cc: xfs@xxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <CAMX-HjqMCm8CdekPhJRPyvsb3v1pDKOonp5JDeOT9NBDD-T=+g@xxxxxxxxxxxxxx>
Organization: Intellique
References: <CAMX-HjqMCm8CdekPhJRPyvsb3v1pDKOonp5JDeOT9NBDD-T=+g@xxxxxxxxxxxxxx>
Le Tue, 22 Jan 2013 12:22:42 +0800 vous Ãcriviez:

> We plan to use a 6+2 RAID6 to start off with.

Bad, bad idea. 6+2 RAID-6 performance sucks. If you want decent
performance, use at least 12 to 16 drives arrays. See these tests (all
using the same HGST HUA 2 TB drives, and Adaptec 5xx5 controllers,
and XFS) :

6+2 2TB RAID-6  sequential performance : write 580 MB/s, read 660 MB/s
10+2 2TB RAID-6 sequential performance: write 800 MB/s, read 1250 MB/s
14+2 2TB RAID-6 sequential performance: write 830 MB/s, read 1300 MB/s
22+2 2TB RAID-6 sequential performance: write 900 MB/s, read 1400 MB/s

As you can see, maximum controller throughput is reached around 12 to 16
drives. And the difference in IOPS is even more obvious. Better
controllers will give you more oomph with wider arrays (but at higher
cost, obviously).

> Then when it gets
> filled up to maybe 60-70% we will
> expand by adding another 6+2 RAID6 to the array.

> The max we can grow this configuration is up to 252TB usable which
> should be enough for a year.
> Our requirements might grow up to 2PB in 2 years time if all goes
> well.

And you'll always write to the latest RAID only. Plan for that: you
need your base array to be fast enough to serve your planned traffic in
2 years time, else you'll have to ditch everything and rebuild it from
the ground up.

> So I have been testing all of this out on a VM running 3 vmdk's and
> using LVM to create a single logical volume of the 3 disks.
> I noticed that out of sdb, sdc and sdd, files keep getting written to
> sdc. This is probably due to our web app creating a single folder and
> all files are written under that folder.

No, this is due to the fact that LVM can't stripe across physical
volumes if you keep adding them, and that xfs can't optimze AGs after
extension. If you start with one volume then add one, and another one,
you'll always be writing to the volumes sequentially. Therefore your
maximum write performance will always be that of the current volume,
making it more obvious that you must look for the fastest single volume
performance from the start.

> Is LVM a good choice of doing this configuration? Or do you have a
> better recommendation?

You could try a parallel filesystem like Lustre, PVFS2, Gluster,
Ceph... These are precisely made to overcome these kind of problems and
scaling by adding more nodes.
Lustre and PVFS2 are HPC-oriented. Lustre is a complete PITA to be
reserved to specialists. PVFS2 is relatively easy to set up, run and
extend (if properly planned beforehand), and can run for years without a
glitch. Gluster and Ceph are more "internet oriented" and won't give as
much performance (well actually they're pretty slow) but provide
redundancy and on-the-fly expansion.

> Mount options:
> /dev/mapper/vg_xfs-lv_xfs on /xfs type xfs
> (rw,noatime,nodiratime,logdev=/dev/vg_xfs/lv_log_xfs,nobarrier,inode64,logbsize=262144,allocsize=512m)
> Is I was to use the 8-disk RAID6 array with a 256kB stripe size will
> have a sunit of 512 and a swidth of (8-2)*512=3072.
> # mkfs.xfs -d sunit=512,swidth=3072 /dev/mapper/vg_xfs-lv_xfs
> # mount -o remount,sunit=512,swidth=3072
> Correct?

Don't bother, use su and sw: su=256k,sw=6 
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |   <eflorac@xxxxxxxxxxxxxx>
                    |   +33 1 78 94 84 02

<Prev in Thread] Current Thread [Next in Thread>