[Top] [All Lists]

Re: Need advice on building a new XFS setup for large files

To: Alvin Ong <alvin.ong@xxxxxxxxxxxxxxxxx>
Subject: Re: Need advice on building a new XFS setup for large files
From: Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx>
Date: Tue, 22 Jan 2013 06:49:46 -0600
Cc: xfs@xxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <CAMX-HjqMCm8CdekPhJRPyvsb3v1pDKOonp5JDeOT9NBDD-T=+g@xxxxxxxxxxxxxx>
References: <CAMX-HjqMCm8CdekPhJRPyvsb3v1pDKOonp5JDeOT9NBDD-T=+g@xxxxxxxxxxxxxx>
Reply-to: stan@xxxxxxxxxxxxxxxxx
User-agent: Mozilla/5.0 (Windows NT 5.1; rv:17.0) Gecko/20130107 Thunderbird/17.0.2
On 1/21/2013 10:22 PM, Alvin Ong wrote:
> Hi,
> We are building a solution with a web front end for our users to store
> large files.
> Large files starting from the size of 500GB and above and can grow up to
> 1-2TB's per file.
> This is the reason we are trying out XFS to see if we can get a test system
> running.

Tell us more about these files.  Is this simply bulk file storage?
Start at 500GB and append until 2TB?  How often will the files be
appended and at what rate?  I.e. will it take 3 days to append from
500GB to 2TB or take 3 months?  The answer to this dictates how the
files and filesystem will fragment over time.  Constantly expanding with
additional 6 spindle constituent arrays, LVM concatenation, and
xfs_growfs may leave you with an undesirable, possibly disastrous,
fragmentation pattern.

Will any of these files ever be deleted or moved to tape silo via HSM,
or manually?  Deletion also greatly affects fragmentation patterns.

What is the planed read workload of these files once written?  High
performance parallel read of a single file, i.e. TPC-H data mining, is
not feasible with this configuration.

> We plan to use a 6+2 RAID6 to start off with. Then when it gets filled up
> to maybe 60-70% we will
> expand by adding another 6+2 RAID6 to the array.
> The max we can grow this configuration is up to 252TB usable which should
> be enough for a year.
> Our requirements might grow up to 2PB in 2 years time if all goes well.

I'd not attempt growing a single XFS to the scale you're describing, via
the methods you describe.  The odds of catastrophe are too great.

Important question:  What make/model of storage array are you using.
The quality and reliability of it makes a difference in choosing a
proper architecture and expansion methodology.  Is this a NetApp filer
or other?  How is the front end web host connected?  4/8Gb Fibre
Channel?  1Gb iSCSI?  10Gb iSCSI?

> So I have been testing all of this out on a VM running 3 vmdk's and using
> LVM to create a single logical volume of the 3 disks.
> I noticed that out of sdb, sdc and sdd, files keep getting written to sdc.
> This is probably due to our web app creating a single folder and all files
> are written under that folder.
> This is the nature of the Allocation Group of XFS? Is there a way to avoid
> this? 


1.  Don't put all files in a single directory.

2.  Use the inode32 allocator on a filesystem greater than 1TB in size.
 This will cause inodes to be located in the first 1TB and files to be
allocated round robin across the AGs via rotor stepping.  See page 10:

>From what you've stated so far, inode32 would seem ideal for your
workload, as you have relatively few massive files, very little
metadata, and would like all files in a single directory.  Inode32 on a
huge XFS can give you that.

> As we will have files keep writing to the same disk thus creating a
> hot spot.
> Although it might not hurt us that much if we fill up a single RAID6 to
> 60-70% then adding another RAID6 to the mix. We could go up to a total of
> 14 RAID6 sets.

Again, you probably don't want to do this.  Too many eggs in one basket.

You should investigate using GlusterFS to tie multiple XFS storage
servers together into a single file tree.  A proper Gluster/XFS
architecture provides for better resiliency, failover, throughput, etc.
 Start with 4 Gluster nodes each with a 6+2 RAID6, expanding all 4 nodes
simultaneously, resulting in each node with a max 63TB XFS.  Gluster
provides the ability to mirror files across nodes as well as some other
tricks which increase resiliency to failures.

Running an xfs_repair on a single filesystem denies all access, and with
a 252TB XFS this could take some time.  With the Gluster architecture,
you can take a Gluster node offline to run the xfs_check and users never
know the difference as the other 3 nodes handle the load.

> Is LVM a good choice of doing this configuration? Or do you have a better
> recommendation?
> The reason we thought LVM would be good was so that we could easily grow
> XFS.

Why not do the concatenation within the SAN array controller?

> Is I was to use the 8-disk RAID6 array with a 256kB stripe size will have a
> sunit of 512 and a swidth of (8-2)*512=3072.

So a 256KB strip and a 1.5MB stripe.  With RAID6 RMW?  I wouldn't
recommend this.

> # mkfs.xfs -d sunit=512,swidth=3072 /dev/mapper/vg_xfs-lv_xfs
> # mount -o remount,sunit=512,swidth=3072
> Correct?

It appears most of your writes will be appends, meaning little
allocation, which means little stripe aligned write out.  Here you are
trying to optimize for large IOs which would be fine if you had an all
or mostly allocation workload, but you don't.  You have an append heavy

Using large strips (stripe units, chunks) with parity RAID, especially
RAID6, will simply murder your append performance due to massive
read-modify-write operations on large strips.

With RAID6 with a mostly append workload, you should be using a small
strip size.  This has been discussed here at length and the consensus is
anything over a 32KB strip size doesn't improve performance, but can
hurt performance, especially with parity RAID.  Thus you should create
your 6+2 arrays with a 32KB strip and (6*32)=192KB stripe, and create
your XFS with "-d su=32k,sw=6".  This should yield significantly better
append performance.

> What about the logdev option? What is the optimal size to create for it?

You don't use an external log device for workloads that have no metadata
operations.  By your account above you'll have approximately 125-500
files stored in 252TB net of disk space.  Which means you'll update the
directory tress with something like one write every few days.

External log devices are for systems that modify metadata at rates of
hundreds of IOs per second.  So don't specify a log device.


<Prev in Thread] Current Thread [Next in Thread>