[Top] [All Lists]

Re: Using xfs_growfs on SSD raid-10

To: stan@xxxxxxxxxxxxxxxxx
Subject: Re: Using xfs_growfs on SSD raid-10
From: Alexey Zilber <alexeyzilber@xxxxxxxxx>
Date: Thu, 10 Jan 2013 15:19:12 +0800
Cc: xfs@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=izuB1wmYsymXP/Td7mYT2RGOtVX4v8sr3qSplSFnCXM=; b=C2ngrQcymqkLbb4k/9tsWa932QHzhbt2ngSb8rUkRXen3pLtb/obInphH2nmzxm6aE 2xdv913jfuMzvcQzylCJ9lMvwhFl4Yg/NN8bpJ0Y79XYYQctLKj5H3oTDRQ4NnledEwX gtIWsZ59Tsg5Gmh6OL5XHSK/sUAeAjydCW/Fx2+PCrYP3cils/GVeuAwwesXW39ylmQY 30/b4hoCeF8aqdYE7JEBG381j1Jlq48+WK6RyKcvIrHsORGBnDtb+pXo2eC9W/JL87OZ drXOMg52S5VM7F8WJGLrzh29V2qLJxQNMv26zGJrzpkFC2Uz/DBLrBe8sHe5YCk1aY/X iwkw==
In-reply-to: <50EE5649.60608@xxxxxxxxxxxxxxxxx>
References: <CAGdvdE3VnYKg8OXFZ-0eALuhK=Qdt-Apj0uwrB8Yfs=4Uun3UA@xxxxxxxxxxxxxx> <50EE33BC.8010403@xxxxxxxxxxxxxxxxx> <CAGdvdE3eeQY1xX0Zdskr461D6ag+JC4tWEozhK32108G3y_=9A@xxxxxxxxxxxxxx> <50EE5649.60608@xxxxxxxxxxxxxxxxx>
Hi Stan,

   Thanks for the details btw, really appreciate it.  Responses inline below:

> Only the sw=3 is no longer valid, correct?  There's no way to add sw=5?

You don't have a 5 spindle array.  You have a 3 spindle array and a 2
spindle array.  Block striping occurs WITHIN an array, not ACROSS arrays.

That's correct, but I was going with the description of the sw option as "number of data disks" which is constantly increasing as you're adding disks.  I realize that block striping occurs independently within each array, but I do not know how that translates into parity with the way xfs works with the logical disks.  How badly does alignment get messed up with you have sw=3 but you have 6 disks?  Or vice/versa, if you specify sw=6, but you only have 3 disks?

>> 1.  Mount with "noalign", but that only affects data, not journal writes
> Is "noalign" the default when no sw/su option is specified at all?

No.  When no alignment options are specified during creation, neither
journal nor data writes are aligned.  This mount option acts only on
data writes.

>> 2.  Forget LVM and use the old tried and true UNIX method of expansion:
>>  create a new XFS on the new array and simply mount it at a suitable
>> place in the tree.
>> Not a possible solution.  The space is for a database and must be
> contiguous.

So you have a single file that is many hundreds of GBs in size, is that
correct?  If this is the case then XFS alignment does not matter.
Writes to existing files are unaligned.  Alignment is used during
allocation only.  And since you're working with a single large file you
will likely have few journal writes, so again, alignment isn't a problem.

*** Thus, you're fretting over nothing, as alignment doesn't matter for
your workload. ***

Since we've established this, everything below is for your education,
and doesn't necessarily pertain to your expansion project.

It's mostly correct.  We're using mysql with innodb_file_per_table, so there's maybe a hundred files or so in a few directories, some quite big.
I'm guessing though that that's still not going to be a huge issue.  I've actually been using both LVM and XFS on ssd raid without aligning for a while on a few other databases and the performance has been exceptional.  I've decided for this round to go deeper into actual alignment to see if I can get extra performance/life out of the drives.

>> 3.  Add 2 SSDs to the new array and rebuild it as a 6 drive RAID10 to
>> match the current array.  This would be the obvious and preferred path,
> How is this the obvious and preferred path when I still can't modify the sw
> value?

It is precisely because you don't specify new su/sw values in this
situation, because they haven't changed:  you now have two identical
arrays glued together with LVM.  XFS writes to its allocation groups.
Each AG exists on only one RAID array or the other (when setup
properly).  It is the allocation group that is aligned to the RAID, not
the "entire XFS filesystem" as you might see it.  Thus, since both
arrays have identical su/sw, nothing changes.  When you grow the XFS, it
simply creates new AGs in the space on the new RAID, and the new AGs are
properly aligned automatically, as both arrays are identical.

But both arrays are not identical, ie. the second array has less (or more) drives for example.  How does the sw value affect it then?


>> assuming you actually mean 1MB STRIP above, not 1MB stripe.  If you
> Stripesize 1MB

You're telling us something you have not verified, which cannot possibly
be correct.  Log into the RAID controller firmware and confirm this, or
do an mdadm --examine /dev/mdX.  It's simply not possible to have a 1MB
stripe with 3 devices, because 1,048,576 / 3 = 349,525.3

# /usr/StorMan/arcconf getconfig 1 LD
Logical device number 1
   Logical device name                      : RAID10-B
   RAID level                               : 10
   Status of logical device                 : Optimal
   Size                                     : 1142774 MB
   Stripe-unit size                         : 1024 KB
   Read-cache mode                          : Enabled
   MaxCache preferred read cache setting    : Disabled
   MaxCache read cache setting              : Disabled
   Write-cache mode                         : Enabled (write-back)
   Write-cache setting                      : Enabled (write-back) when protected by battery/ZMM
   Partitioned                              : No
   Protected by Hot-Spare                   : No
   Bootable                                 : No
   Failed stripes                           : No
   Power settings                           : Disabled
   Logical device segment information
   Group 0, Segment 0                       : Present (Controller:1,Enclosure:0,Slot:2)             FG001MMV
   Group 0, Segment 1                       : Present (Controller:1,Enclosure:0,Slot:3)             FG001MNW
   Group 1, Segment 0                       : Present (Controller:1,Enclosure:0,Slot:4)             FG001MMT
   Group 1, Segment 1                       : Present (Controller:1,Enclosure:0,Slot:5)             FG001MNY
   Group 2, Segment 0                       : Present (Controller:1,Enclosure:0,Slot:6)             FG001DH8
   Group 2, Segment 1                       : Present (Controller:1,Enclosure:0,Slot:7)             FG001DKK

The controller itself is: Adaptec 5805Z

"The second important parameter is the stripe size of the array, sometimes also referred to by terms such as block sizechunk sizestripe length or granularity. This term refers to the size of the stripes written to each disk. RAID arrays that stripe in blocks typically allow the selection of block sizes in kiB ranging from 2 kiB to 512 kiB (or even higher) in powers of two (meaning 2 kiB, 4 kiB, 8 kiB and so on.) Byte-level striping (as in RAID 3) uses a stripe size of one byte or perhaps a small number like 512, usually not selectable by the user."

So they're talking about powers of 2, not powers of 3. 1MB would definitely work then.   

>> actually mean 1MB hardware RAID stripe, then the controller would have
>> most likely made a 768KB stripe with 256KB strip, as 1MB isn't divisible
>> by 3.  Thus you've told LVM to ship 1MB writes to a device expecting
>> 256KB writes.  In that case you've already horribly misaligned your LVM
>> volumes to the hardware stripe, and everything is FUBAR already.  You
>> probably want to verify all of your strip/stripe configuration before
>> moving forward.

> I don't believe you're correct here.

This is due to your current lack of knowledge of the technology.

Please educate me then.  Where can I find more information that stripes are calculated by a power of 3? The article above references power of 2.

> The SSD Erase Block size for the
> drives we're using is 1MB.

The erase block size of the SSD device is irrelevant.  What is relevant
is how the RAID controller (or mdraid) is configured.

Right, but say that 1MB stripe is a single stripe.  I'm guessing it would fit within a single erase block?  Or should I just use 512k stripes to be safe?


> Why does being divisible by 3 matter?  Because
> of the number of drives?

Because the data is written in strips, one device at a time.  Those
strips must be divisible by 512 (hardware sector size) and/or 4,096
(filesystem block size).  349,525.3 is not divisible by either.

> Nowhere online have a seen anything about a
> 768MB+256MB stripe.  All the performance info I've seen point to it being
> the fastest.  I'm sure that wouldn't be the case if the controller had to
> deal with two stripes.

Maybe you've not explained things correctly.  You said you had a 6 drive
RAID10, and then added a 4 drive RAID10.  That means the controller is
most definitely dealing with two stripes.  And, from a performance
perspective, to a modern controller, two RAID sets is actually better
than one, as multiple cores in the RAID ASIC come into play.

Ok, so with the original 6 drives, if it's a raid 10, that would give 3 mirrors to stripe into 1 logical drive.
Is this where the power of 3 comes from?

> So essentially, my take-away here is that xfs_growfs doesn't work properly
> when adding more logical raid drives?   What kind of a performance hit am I
> looking at if sw is wrong?  How about this.  If I know that the maximum
> number of drives I can add is say 20 in a RAID-10.  Can I format with sw=10
> (even though sw should be 3) in the eventual expectation of expanding it?
>  What would be the downside of doing that?

xfs_growfs works properly and gives the performance one seeks when the
underlying storage layers have been designed and configured properly.
Yours hasn't, though it'll probably work well enough given your
workload.  In your situation, your storage devices are SSDs, with
20-50K+ IOPS and 300-500MB/s throughput, and your application is a large
database, which means random unaligned writes and random reads, a
workload that doesn't require striping for performance with SSDs.  So
you might have designed the storage more optimally for your workload,
something like this:

1.  Create 3 hardware RAID1 mirrors in the controller then concatenate
2.  Create your XFS filesystem atop the device with no alignment

There is no need for LVM unless you need snapshot capability.  This
works fine and now you need to expand storage space.  So you simply add
your 4 SSDs to the controller, creating two more RAID1s and add them to
the concatenation.  Since you've simply made the disk device bigger from
Linux' point of view, all you do now is xfs_growfs and you're done.  No
alignment issues to fret over, and your performance will be no worse
than before, maybe better, as you'll get all the cores in the RAID ASIC
into play with so many RAID1s.

Now, I'm going to guess that since you mentioned "colo provider" that
you may not actually have access to the actual RAID configuration, or
that possibly they are giving you "cloud" storage, not direct attached
SSDs.  In this case you absolutely want to avoid specifying alignment,
because the RAID information they are providing you is probably not
accurate.  Which is probably why you told me "1MB stripe" twice, when we
all know for a fact that's impossible.

That's a good idea, though I would use LVM for the concatenation.  I just don't trust the hardware to concatenate existing disks to more disks, I'd rather leave that up to LVM to handle, AND be able to take snapshots, etc.

Thanks Stan, very informative!



<Prev in Thread] Current Thread [Next in Thread>