Is XFS suitable for 350 million files on 20TB storage?
Sean Caron
scaron at umich.edu
Fri Sep 5 17:39:42 CDT 2014
Hi Stefan,
Generally speaking, this is a situation that you want to avoid. At 350
million files and 20 TB, you're looking at around 17-18 MB per file at
minimum? That's pretty small. And with 350M files, a fair number of those
350M must be on the smaller side of things.
Memory is cheap these days... people can make 50 GB, 100 GB, ... files, go
ahead and read those things directly into memory, 100%. And CPU cycles are
pretty cheap, too. Certainly you get more bang per buck there, than in IOPS
on your storage system!!
Empirically, I have found from experience (currently running Linux 3.4.61;
many historical revs previous) in a reasonably large-scale (up to ~0.5 PB
in single file system, up to 270 JBOD spindles on one machine) high-I/O
(jobs running on a few-hundred-node compute cluster, or a few hundred
threads running locally on the server) environment, that XFS (and things
running on top of it, ESPECIALLY rsync) will perform MUCH better on smaller
numbers of very large files, then they will on very large numbers of small
files (I'm always trying to reinforce this to our end users).
I'm not really even saying XFS is to really blame here... in fact in 3.4.61
it has been very well-behaved; but Linux has many warts: poor
implementation of I/O and CPU scheduling algorithms; kernel does not
degrade gracefully in resource-constrained settings; if you are ultimately
using this data store as a file share, the protocol implementations have
their own issues... NFS, CIFS, etc... Not trying to dog all the hardworking
free software devs out there but clearly much work remains to be done in
many areas, to make Linux really ready to play in the "big big leagues" of
computing (unless you have a local staff of good systems programmers with
some free time on their hands...). XFS is just one piece of the puzzle we
have to work with in trying to integrate a Linux system as a good
high-throughput storage machine.
If there is any way that you can use simple catenation or some kind of
archiver... even things as simple as shar, tar, zip... to get the file
sizes up and the absolute number of files down, you should notice some big
performance gains when trying to process your 20 TB worth of stuff.
If you can't dramatically increase individual file size while dramatically
reducing the absolute number of files for whatever reason in your
environment, I think you can still win by trying to reduce the number of
files in any one directory. You want to look out for directories that have
five or six figures worth of files in them, those can be real performance
killers. If your claim of no more than 5,000 files per any directory is
accurate, that shouldn't be a big deal for XFS at all, I don't think you're
in bad shape there.
Rsync can be just the worst in this kind of scenario. It runs so slow, you
feel sometimes like you might as well be on 10 Meg Ethernet (or worse).
I'm not sure exactly what your application is here... It sounds backup
related. If you're doing rsync, you can win a little bit by dropping down a
level or two in your directory hierarchy from the top of the tree where XFS
is mounted, and running a number of rsync threads in parallel,
per-directory, instead of just one top-level rsync thread for an entire
filesystem. Experiment to find the best number of threads; run too many and
they can deadlock, or just step all over one another.
Also, I have a suspicion (sorry can't back this up quantitatively) that if
you are just trying to a straight copy from here to there, a 'cp -Rp' will
be faster than an rsync. You might be better off doing an initial copy with
'cp -Rp' and then just synchronizing diffs at the end with an rsync pass,
rather than trying to do the whole thing with rsync.
Hope some of this might help... just casual thoughts from a daily
XFS-wrangler ;)
Best,
Sean
On Fri, Sep 5, 2014 at 5:24 PM, Brian Foster <bfoster at redhat.com> wrote:
> On Fri, Sep 05, 2014 at 10:14:51PM +0200, Stefan Priebe wrote:
> >
> > Am 05.09.2014 21:18, schrieb Brian Foster:
> > ...
> >
> > >On Fri, Sep 05, 2014 at 08:07:38PM +0200, Stefan Priebe wrote:
> > >Interesting, that seems like a lot of free inodes. That's 1-2 million in
> > >each AG that we have to look around for each time we want to allocate an
> > >inode. I can't say for sure that's the source of the slowdown, but this
> > >certainly looks like the kind of workload that inspired the addition of
> > >the free inode btree (finobt) to more recent kernels.
> > >
> > >It appears that you still have quite a bit of space available in
> > >general. Could you run some local tests on this filesystem to try and
> > >quantify how much of this degradation manifests on sustained writes vs.
> > >file creation? For example, how is throughput when writing a few GB to a
> > >local test file?
> >
> > Not sure if this is what you expect:
> >
> > # dd if=/dev/zero of=bigfile oflag=direct,sync bs=4M count=1000
> > 1000+0 records in
> > 1000+0 records out
> > 4194304000 bytes (4,2 GB) copied, 125,809 s, 33,3 MB/s
> >
> > or without sync
> > # dd if=/dev/zero of=bigfile oflag=direct bs=4M count=1000
> > 1000+0 records in
> > 1000+0 records out
> > 4194304000 bytes (4,2 GB) copied, 32,5474 s, 129 MB/s
> >
> > > How about with that same amount of data broken up
> > >across a few thousand files?
> >
> > This results in heavy kworker usage.
> >
> > 4GB in 32kb files
> > # time (mkdir test; for i in $(seq 1 1 131072); do dd if=/dev/zero
> > of=test/$i bs=32k count=1 oflag=direct,sync 2>/dev/null; done)
> >
> > ...
> >
> > 55 min
> >
>
> Both seem pretty slow in general. Any way you can establish a baseline
> for these tests on this storage? If not, the only other suggestion I
> could make is to allocate inodes until all of those freecount numbers
> are accounted for and see if anything changes. That could certainly take
> some time and it's not clear it will actually help.
>
> > >Brian
> > >
> > >P.S., Alternatively if you wanted to grab a metadump of this filesystem
> > >and compress/upload it somewhere, I'd be interested to take a look at
> > >it.
> >
> > I think there might be file and directory names in it. If this is the
> case i
> > can't do it.
> >
>
> It should enable obfuscation by default, but I would suggest to restore
> it yourself and verify it meets your expectations.
>
> Brian
>
> > Stefan
> >
> >
> > >
> > >>Thanks!
> > >>
> > >>Stefan
> > >>
> > >>
> > >>
> > >>>Brian
> > >>>
> > >>>>>... as well as what your typical workflow/dataset is for this fs. It
> > >>>>>seems like you have relatively small files (15TB used across 350m
> files
> > >>>>>is around 46k per file), yes?
> > >>>>
> > >>>>Yes - most fo them are even smaller. And some files are > 5GB.
> > >>>>
> > >>>>>If so, I wonder if something like the
> > >>>>>following commit introduced in 3.12 would help:
> > >>>>>
> > >>>>>133eeb17 xfs: don't use speculative prealloc for small files
> > >>>>
> > >>>>Looks interesting.
> > >>>>
> > >>>>Stefan
> >
> > _______________________________________________
> > xfs mailing list
> > xfs at oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs
>
> _______________________________________________
> xfs mailing list
> xfs at oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://oss.sgi.com/pipermail/xfs/attachments/20140905/4eda2039/attachment-0001.html>
More information about the xfs
mailing list