[Top] [All Lists]

Re: Need advice on building a new XFS setup for large files

To: stan@xxxxxxxxxxxxxxxxx
Subject: Re: Need advice on building a new XFS setup for large files
From: Alvin Ong <alvin.ong@xxxxxxxxxxxxxxxxx>
Date: Wed, 23 Jan 2013 23:09:18 +0800
Cc: Dave Chinner <david@xxxxxxxxxxxxx>, xfs@xxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <50FFD911.3@xxxxxxxxxxxxxxxxx>
References: <CAMX-HjqMCm8CdekPhJRPyvsb3v1pDKOonp5JDeOT9NBDD-T=+g@xxxxxxxxxxxxxx> <50FE8AEA.7020300@xxxxxxxxxxxxxxxxx> <20130122220511.GN2498@dastard> <50FFD911.3@xxxxxxxxxxxxxxxxx>
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130107 Thunderbird/17.0.2
Thanks Stan, Dave and Emmanuel for such informative replies. I will take some time to digest this information and make some considerations. As for the files they start at 500GB at a minimum. The rate of the growth is not known as of yet. But it won't be high loads. The idea is sort of like a cloud storage for the customer to dump data. With that said we also do not want to have issues with fragmentation or any failure that could cause data lost in the future.


On 23-Jan-13 8:35 PM, Stan Hoeppner wrote:
On 1/22/2013 4:05 PM, Dave Chinner wrote:
On Tue, Jan 22, 2013 at 06:49:46AM -0600, Stan Hoeppner wrote:
On 1/21/2013 10:22 PM, Alvin Ong wrote:

We are building a solution with a web front end for our users to store
large files.
Large files starting from the size of 500GB and above and can grow up to
1-2TB's per file.
This is the reason we are trying out XFS to see if we can get a test system
Tell us more about these files.  Is this simply bulk file storage?
Start at 500GB and append until 2TB?  How often will the files be
appended and at what rate?  I.e. will it take 3 days to append from
500GB to 2TB or take 3 months?  The answer to this dictates how the
files and filesystem will fragment over time.  Constantly expanding with
additional 6 spindle constituent arrays, LVM concatenation, and
xfs_growfs may leave you with an undesirable, possibly disastrous,
fragmentation pattern.
I'd say it's guaranteed, not a possibility.

We plan to use a 6+2 RAID6 to start off with. Then when it gets filled up
to maybe 60-70% we will
expand by adding another 6+2 RAID6 to the array.
The max we can grow this configuration is up to 252TB usable which should
be enough for a year.
Our requirements might grow up to 2PB in 2 years time if all goes well.
I'd not attempt growing a single XFS to the scale you're describing, via
the methods you describe.  The odds of catastrophe are too great.
It's a recipe for disaster and not recommended at all.

So I have been testing all of this out on a VM running 3 vmdk's and using
LVM to create a single logical volume of the 3 disks.
I noticed that out of sdb, sdc and sdd, files keep getting written to sdc.
This is probably due to our web app creating a single folder and all files
are written under that folder.
This is the nature of the Allocation Group of XFS? Is there a way to avoid

1.  Don't put all files in a single directory.

2.  Use the inode32 allocator on a filesystem greater than 1TB in size.
  This will cause inodes to be located in the first 1TB and files to be
allocated round robin across the AGs via rotor stepping.  See page 10:
3: Use a storage layout that is not affected by hotspots due to
filesystem locality.

That is, build the storage to the scale that you are likely to need
in the future. i.e. use all 112 disks (14x 6+2 RAID = 112 disks) to
begin with and lay the storage and filesystem out optimally
accordingly.  I'd build seven 14+2 hardware RAID6 luns (112 disks)
and stripe them in RAID0, setting the XFS stripe unit to be the
width of a hardware RAID6 lun. That way sequential IO to a single
region of the disk still hits every single  disk in the array, and
hotspots don't occur

If you do this, it doesn't matter if you use inode64 or inode32 for
a hotspot perspective, only a file fragmentation perspective. This
is the way XFs has been used for exactly this sort of storage for
the last 15 years....

As we will have files keep writing to the same disk thus creating a
hot spot.
Although it might not hurt us that much if we fill up a single RAID6 to
60-70% then adding another RAID6 to the mix. We could go up to a total of
14 RAID6 sets.
Again, you probably don't want to do this.  Too many eggs in one basket.

You should investigate using GlusterFS to tie multiple XFS storage
servers together into a single file tree.
Another possible solution. You should talk to RedHat (says the
RedHat employee ;)....
I get the impression the "grow as you go" mindset here is probably due
to budget/cash flow issues, as well as evaluating the system at small
scale before committing to going larger.  Thus I'd guess building the
112 drive system up front isn't a real possibility.  And this is where
something like Gluster atop XFS would really come in handy, as it would
make "grow as you go" much more feasible, while avoiding the 'game over'
fragmentation issue with simply growing XFS in the manner described by
the OP.

Emmanuel states Gluster is slow, but that's a very relative statement.
For clients streaming single large files over GbE or slower links it
should be plenty fast.  Gluster and similar network file systems tend to
be slow with metadata intensive or transactional workloads.

Running an xfs_repair on a single filesystem denies all access, and with
a 252TB XFS this could take some time.
For a filesystem with 1-2TB files, it'll take 30s to run. That's not
an issue.
For some reason I was thinking data size instead of metadata.  With only
a few hundred to low thousand files it would be quick indeed, a non issue.

Is LVM a good choice of doing this configuration? Or do you have a better
The reason we thought LVM would be good was so that we could easily grow
Why not do the concatenation within the SAN array controller?
Same problem as LVM concatenation. Hot spots.
I was simply suggesting hardware vs software concatenation here
unrelated to his current flawed expansion path idea, as his SAN
controller probably has some nice features and performance here.

Is I was to use the 8-disk RAID6 array with a 256kB stripe size will have a
sunit of 512 and a swidth of (8-2)*512=3072.
So a 256KB strip and a 1.5MB stripe.  With RAID6 RMW?  I wouldn't
recommend this.
Large files, sequential IO, there will be no RMW cycles in the RAID.
The write cache of RAID controller will do the aggregation of
individual IOs into full stripe writes just fine.

It appears most of your writes will be appends, meaning little
allocation, which means little stripe aligned write out.  Here you are
trying to optimize for large IOs which would be fine if you had an all
or mostly allocation workload, but you don't.  You have an append heavy

Using large strips (stripe units, chunks) with parity RAID, especially
RAID6, will simply murder your append performance due to massive
read-modify-write operations on large strips.
No, that's wrong. sequential IO will always fill full stripes in the
cache, so RMW cycles simple will not happen. Remember that RMW
occurs whenteh cache has to be flushed to the back end disks, not
when writes come in to the front end cache....

With RAID6 with a mostly append workload, you should be using a small
strip size.  This has been discussed here at length and the consensus is
anything over a 32KB strip size doesn't improve performance, but can
hurt performance, especially with parity RAID.  Thus you should create
your 6+2 arrays with a 32KB strip and (6*32)=192KB stripe, and create
your XFS with "-d su=32k,sw=6".  This should yield significantly better
append performance.
That's a tuning for an IOPS intensive workload, not a large scale,
large file storage workloads.

While sequential writes are an append workload, it's an append
workload that the RAID controller is optimised to avoid causing RMW
cycles for. As such, the above is bad advice for large files with
sequential IO workloads. Large files, large fielsystem, sequential
IO is ideal for large RAID6 widths....
Yes, of course.  WRT XFS you've drilled "allocation=aligned" and "non
allocation=unaligned" so thoroughly into my head that I failed to
actually think for a second about what the hardware does with this type
of large append data stream.  I feel a bit silly making this juvenile
oversight.  Won't happen again. ;)

External log devices are for systems that modify metadata at rates of
hundreds of IOs per second.  So don't specify a log device.
Even at hundreds of thousands of IOs per second, external logs don't
provide much in way of benefit thanks to delayed logging. The only
reason for using an external log these days is a fsync heavy or
synchronous write workload. And in most cases a BBWC means even
those worklaods don't need an external log...
Which bloke provided us with this journal magic code again?  Can't
recall his name... ;)

<Prev in Thread] Current Thread [Next in Thread>