[Top] [All Lists]

Re: Using xfs_growfs on SSD raid-10

To: Alexey Zilber <alexeyzilber@xxxxxxxxx>
Subject: Re: Using xfs_growfs on SSD raid-10
From: Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx>
Date: Wed, 09 Jan 2013 23:48:57 -0600
Cc: xfs@xxxxxxxxxxx
In-reply-to: <CAGdvdE3eeQY1xX0Zdskr461D6ag+JC4tWEozhK32108G3y_=9A@xxxxxxxxxxxxxx>
References: <CAGdvdE3VnYKg8OXFZ-0eALuhK=Qdt-Apj0uwrB8Yfs=4Uun3UA@xxxxxxxxxxxxxx> <50EE33BC.8010403@xxxxxxxxxxxxxxxxx> <CAGdvdE3eeQY1xX0Zdskr461D6ag+JC4tWEozhK32108G3y_=9A@xxxxxxxxxxxxxx>
Reply-to: stan@xxxxxxxxxxxxxxxxx
User-agent: Mozilla/5.0 (Windows NT 5.1; rv:17.0) Gecko/20130107 Thunderbird/17.0.2
On 1/9/2013 9:50 PM, Alexey Zilber wrote:
> Hi Stan,
>   Please see in-line:
> On Thu, Jan 10, 2013 at 11:21 AM, Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx>wrote:
>> On 1/9/2013 7:23 PM, Alexey Zilber wrote:
>>> Hi All,
>>>   I've read the FAQ on 'How to calculate the correct sunit,swidth values
>>> for optimal performance' when setting up xfs on a RAID.  Thing is, I'm
>>> using LVM, and with the colo company we use, the easiest thing I've
>> found,
>>> when adding more space, is to have another RAID added to the system, then
>>> I'll just pvcreate, expand the vgroup over it, lvextend and xfs_growfs
>> and
>>> I'm done.  That is probably sub-optimal on an SSD raid.
>>> Here's the example situation.  I start off with a 6 (400GB) raid-10.
>>  It's
>>> got 1M stripe sizes.  So I would start with pvcreate --dataalignment 1M
>>> /dev/sdb
>>> after all the lvm stuff I would do: mkfs.xfs -L mysql -d su=1m,sw=3
>>> /dev/mapper/db-mysql
>>> (so the above reflects the 3 active drives, and 1m stripe. So far so
>> good?)
>>> Now, I need more space. We have a second raid-10 added, that's 4 (400gb)
>>> drives. So I do the same pvcreate --dataalignment 1M /dev/sdc
>>> then vgextend and lvextend, and finally; with xfs_growfs, there's no way
>> to
>>> specify, change su/sw values.  So how do I do this?  I'd rather not use
>>> mount options, but is that the only way, and would that work?
>> It's now impossible to align the second array.  You have a couple of
>> options:
> Only the sw=3 is no longer valid, correct?  There's no way to add sw=5?

You don't have a 5 spindle array.  You have a 3 spindle array and a 2
spindle array.  Block striping occurs WITHIN an array, not ACROSS arrays.

>> 1.  Mount with "noalign", but that only affects data, not journal writes
> Is "noalign" the default when no sw/su option is specified at all?

No.  When no alignment options are specified during creation, neither
journal nor data writes are aligned.  This mount option acts only on
data writes.

>> 2.  Forget LVM and use the old tried and true UNIX method of expansion:
>>  create a new XFS on the new array and simply mount it at a suitable
>> place in the tree.
>> Not a possible solution.  The space is for a database and must be
> contiguous.

So you have a single file that is many hundreds of GBs in size, is that
correct?  If this is the case then XFS alignment does not matter.
Writes to existing files are unaligned.  Alignment is used during
allocation only.  And since you're working with a single large file you
will likely have few journal writes, so again, alignment isn't a problem.

*** Thus, you're fretting over nothing, as alignment doesn't matter for
your workload. ***

Since we've established this, everything below is for your education,
and doesn't necessarily pertain to your expansion project.

>> 3.  Add 2 SSDs to the new array and rebuild it as a 6 drive RAID10 to
>> match the current array.  This would be the obvious and preferred path,
> How is this the obvious and preferred path when I still can't modify the sw
> value?

It is precisely because you don't specify new su/sw values in this
situation, because they haven't changed:  you now have two identical
arrays glued together with LVM.  XFS writes to its allocation groups.
Each AG exists on only one RAID array or the other (when setup
properly).  It is the allocation group that is aligned to the RAID, not
the "entire XFS filesystem" as you might see it.  Thus, since both
arrays have identical su/sw, nothing changes.  When you grow the XFS, it
simply creates new AGs in the space on the new RAID, and the new AGs are
properly aligned automatically, as both arrays are identical.

>> assuming you actually mean 1MB STRIP above, not 1MB stripe.  If you
> Stripesize 1MB

You're telling us something you have not verified, which cannot possibly
be correct.  Log into the RAID controller firmware and confirm this, or
do an mdadm --examine /dev/mdX.  It's simply not possible to have a 1MB
stripe with 3 devices, because 1,048,576 / 3 = 349,525.3

>> actually mean 1MB hardware RAID stripe, then the controller would have
>> most likely made a 768KB stripe with 256KB strip, as 1MB isn't divisible
>> by 3.  Thus you've told LVM to ship 1MB writes to a device expecting
>> 256KB writes.  In that case you've already horribly misaligned your LVM
>> volumes to the hardware stripe, and everything is FUBAR already.  You
>> probably want to verify all of your strip/stripe configuration before
>> moving forward.

> I don't believe you're correct here.  

This is due to your current lack of knowledge of the technology.

> The SSD Erase Block size for the
> drives we're using is 1MB.   

The erase block size of the SSD device is irrelevant.  What is relevant
is how the RAID controller (or mdraid) is configured.

> Why does being divisible by 3 matter?  Because
> of the number of drives?  

Because the data is written in strips, one device at a time.  Those
strips must be divisible by 512 (hardware sector size) and/or 4,096
(filesystem block size).  349,525.3 is not divisible by either.

> Nowhere online have a seen anything about a
> 768MB+256MB stripe.  All the performance info I've seen point to it being
> the fastest.  I'm sure that wouldn't be the case if the controller had to
> deal with two stripes.

Maybe you've not explained things correctly.  You said you had a 6 drive
RAID10, and then added a 4 drive RAID10.  That means the controller is
most definitely dealing with two stripes.  And, from a performance
perspective, to a modern controller, two RAID sets is actually better
than one, as multiple cores in the RAID ASIC come into play.

> So essentially, my take-away here is that xfs_growfs doesn't work properly
> when adding more logical raid drives?   What kind of a performance hit am I
> looking at if sw is wrong?  How about this.  If I know that the maximum
> number of drives I can add is say 20 in a RAID-10.  Can I format with sw=10
> (even though sw should be 3) in the eventual expectation of expanding it?
>  What would be the downside of doing that?

xfs_growfs works properly and gives the performance one seeks when the
underlying storage layers have been designed and configured properly.
Yours hasn't, though it'll probably work well enough given your
workload.  In your situation, your storage devices are SSDs, with
20-50K+ IOPS and 300-500MB/s throughput, and your application is a large
database, which means random unaligned writes and random reads, a
workload that doesn't require striping for performance with SSDs.  So
you might have designed the storage more optimally for your workload,
something like this:

1.  Create 3 hardware RAID1 mirrors in the controller then concatenate
2.  Create your XFS filesystem atop the device with no alignment

There is no need for LVM unless you need snapshot capability.  This
works fine and now you need to expand storage space.  So you simply add
your 4 SSDs to the controller, creating two more RAID1s and add them to
the concatenation.  Since you've simply made the disk device bigger from
Linux' point of view, all you do now is xfs_growfs and you're done.  No
alignment issues to fret over, and your performance will be no worse
than before, maybe better, as you'll get all the cores in the RAID ASIC
into play with so many RAID1s.

Now, I'm going to guess that since you mentioned "colo provider" that
you may not actually have access to the actual RAID configuration, or
that possibly they are giving you "cloud" storage, not direct attached
SSDs.  In this case you absolutely want to avoid specifying alignment,
because the RAID information they are providing you is probably not
accurate.  Which is probably why you told me "1MB stripe" twice, when we
all know for a fact that's impossible.


<Prev in Thread] Current Thread [Next in Thread>