[Top] [All Lists]

Re: creating a new 80 TB XFS

To: Linux fs XFS <xfs@xxxxxxxxxxx>
Subject: Re: creating a new 80 TB XFS
From: pg_xf2@xxxxxxxxxxxxxxxxxx (Peter Grandi)
Date: Fri, 24 Feb 2012 14:52:47 +0000
In-reply-to: <4F478818.4050803@xxxxxxxxxxxxxxxxx>
References: <4F478818.4050803@xxxxxxxxxxxxxxxxx>
[ ... ]

> We are getting now 32 x 3 TB Hitachi SATA HDDs. I plan to
> configure them in a single RAID 6 set with one or two
> hot-standby discs. The raw storage space will then be 28 x 3
> TB = 84 TB.  On this one RAID set I will create only one
> volume.  Any thoughts on this?

Well, many storage experts would be impressed by and support
such an audacious plan...

But I think that wide RAID6 sets and large RAID6 stripes are a
phenomenally bad idea, and large filetrees also strikingly bad,
and the two combined seems to me almost the most terrible setup.
It is also remarkably brave to use 32 identical drives in a RAID
set. But all this is very popular because in the beginning "it
works" and is really cheap.

The proposed setup has only 7% redundancy, RMW issues with large
stripe sizes, and 'fsck' time and space issues with large trees.

Consider this series of blog notes:


> This storage will be used as secondary storage for backups. We
> use dirvish (www.dirvish.org, which uses rsync) to run our
> daily backups.

So it will be lots and lots of metadata (mostly directory)
updates. Not a very good match there. Especially considering
that almost always you will be only writing to it even for data,
and presumably from multiple hosts concurrently. You may benefit
considerably from putting the XFS log on a separate disk, and if
you use Linux MD for RAID the bitmaps on a separate disk.

> *MKFS* We also heavily use ACLs for almost all of our files.

That's a daring choice.

> [ ... ] "-i size=512" on XFS creation, so my mkfs.xfs would look
> something like: mkfs.xfs -i size=512 -d su=stripe_size,sw=28
> -L Backup_2 /dev/sdX1

As a rule I specify a sector size of 4096, and in your case
perhaps an inode size of 2048 might be appropriate to raise the
chance of ACLs and directories fully stored in inode tails,
which seem particularly important in your case. Something like:

  -s size=4096 -b size=4096 -i size=2048,attr=2

> mount -o noatime,nobarrier,nofail,logbufs=8,logbsize=256k,inode64
> /dev/sdX1 /mount_point

'nobarrier' seems rather optimistic unless you are very very
sure there won't be failures.

There are many others details to looks into, from readhead to
flusher frequency.

<Prev in Thread] Current Thread [Next in Thread>