Using xfs_growfs on SSD raid-10

Alexey Zilber alexeyzilber at gmail.com
Thu Jan 10 01:19:12 CST 2013


Hi Stan,

   Thanks for the details btw, really appreciate it.  Responses inline
below:

 >>
> > Only the sw=3 is no longer valid, correct?  There's no way to add sw=5?
>
> You don't have a 5 spindle array.  You have a 3 spindle array and a 2
> spindle array.  Block striping occurs WITHIN an array, not ACROSS arrays.
>

That's correct, but I was going with the description of the sw option as
"number of data disks" which is constantly increasing as you're adding
disks.  I realize that block striping occurs independently within each
array, but I do not know how that translates into parity with the way xfs
works with the logical disks.  How badly does alignment get messed up with
you have sw=3 but you have 6 disks?  Or vice/versa, if you specify sw=6,
but you only have 3 disks?


>
> >> 1.  Mount with "noalign", but that only affects data, not journal writes
> >
> > Is "noalign" the default when no sw/su option is specified at all?
>
> No.  When no alignment options are specified during creation, neither
> journal nor data writes are aligned.  This mount option acts only on
> data writes.
>
> >> 2.  Forget LVM and use the old tried and true UNIX method of expansion:
> >>  create a new XFS on the new array and simply mount it at a suitable
> >> place in the tree.
> >>
> >> Not a possible solution.  The space is for a database and must be
> > contiguous.
>
> So you have a single file that is many hundreds of GBs in size, is that
> correct?  If this is the case then XFS alignment does not matter.
> Writes to existing files are unaligned.  Alignment is used during
> allocation only.  And since you're working with a single large file you
> will likely have few journal writes, so again, alignment isn't a problem.
>
> *** Thus, you're fretting over nothing, as alignment doesn't matter for
> your workload. ***
>
> Since we've established this, everything below is for your education,
> and doesn't necessarily pertain to your expansion project.
>

It's mostly correct.  We're using mysql with innodb_file_per_table, so
there's maybe a hundred files or so in a few directories, some quite big.
I'm guessing though that that's still not going to be a huge issue.  I've
actually been using both LVM and XFS on ssd raid without aligning for a
while on a few other databases and the performance has been exceptional.
 I've decided for this round to go deeper into actual alignment to see if I
can get extra performance/life out of the drives.


>
> >> 3.  Add 2 SSDs to the new array and rebuild it as a 6 drive RAID10 to
> >> match the current array.  This would be the obvious and preferred path,
> >
> > How is this the obvious and preferred path when I still can't modify the
> sw
> > value?
>
> It is precisely because you don't specify new su/sw values in this
> situation, because they haven't changed:  you now have two identical
> arrays glued together with LVM.  XFS writes to its allocation groups.
> Each AG exists on only one RAID array or the other (when setup
> properly).  It is the allocation group that is aligned to the RAID, not
> the "entire XFS filesystem" as you might see it.  Thus, since both
> arrays have identical su/sw, nothing changes.  When you grow the XFS, it
> simply creates new AGs in the space on the new RAID, and the new AGs are
> properly aligned automatically, as both arrays are identical.
>

But both arrays are not identical, ie. the second array has less (or more)
drives for example.  How does the sw value affect it then?



>
> >> assuming you actually mean 1MB STRIP above, not 1MB stripe.  If you
> >
> > Stripesize 1MB
>
> You're telling us something you have not verified, which cannot possibly
> be correct.  Log into the RAID controller firmware and confirm this, or
> do an mdadm --examine /dev/mdX.  It's simply not possible to have a 1MB
> stripe with 3 devices, because 1,048,576 / 3 = 349,525.3
>

# /usr/StorMan/arcconf getconfig 1 LD
Logical device number 1
   Logical device name                      : RAID10-B
   RAID level                               : 10
   Status of logical device                 : Optimal
   Size                                     : 1142774 MB
   Stripe-unit size                         : 1024 KB
   Read-cache mode                          : Enabled
   MaxCache preferred read cache setting    : Disabled
   MaxCache read cache setting              : Disabled
   Write-cache mode                         : Enabled (write-back)
   Write-cache setting                      : Enabled (write-back) when
protected by battery/ZMM
   Partitioned                              : No
   Protected by Hot-Spare                   : No
   Bootable                                 : No
   Failed stripes                           : No
   Power settings                           : Disabled
   --------------------------------------------------------
   Logical device segment information
   --------------------------------------------------------
   Group 0, Segment 0                       : Present
(Controller:1,Enclosure:0,Slot:2)             FG001MMV
   Group 0, Segment 1                       : Present
(Controller:1,Enclosure:0,Slot:3)             FG001MNW
   Group 1, Segment 0                       : Present
(Controller:1,Enclosure:0,Slot:4)             FG001MMT
   Group 1, Segment 1                       : Present
(Controller:1,Enclosure:0,Slot:5)             FG001MNY
   Group 2, Segment 0                       : Present
(Controller:1,Enclosure:0,Slot:6)             FG001DH8
   Group 2, Segment 1                       : Present
(Controller:1,Enclosure:0,Slot:7)             FG001DKK


The controller itself is: Adaptec 5805Z

According to this article:
http://www.pcguide.com/ref/hdd/perf/raid/concepts/perfStripe-c.html

"The second important parameter is the *stripe size* of the array,
sometimes also referred to by terms such as *block size*, *chunk size*, *stripe
length *or *granularity*. This term refers to the size of the stripes
written to each disk. RAID arrays that stripe in blocks typically allow the
selection of block sizes in
kiB<http://www.pcguide.com/intro/fun/bindec.htm> ranging
from 2 kiB to 512 kiB (or even higher) in powers of two (meaning 2 kiB, 4
kiB, 8 kiB and so on.) Byte-level striping (as in RAID
3<http://www.pcguide.com/ref/hdd/perf/raid/levels/single_Level3.htm>)
uses a stripe size of one byte or perhaps a small number like 512, usually
not selectable by the user."

So they're talking about powers of 2, not powers of 3. 1MB would definitely
work then.



> >> actually mean 1MB hardware RAID stripe, then the controller would have
> >> most likely made a 768KB stripe with 256KB strip, as 1MB isn't divisible
> >> by 3.  Thus you've told LVM to ship 1MB writes to a device expecting
> >> 256KB writes.  In that case you've already horribly misaligned your LVM
> >> volumes to the hardware stripe, and everything is FUBAR already.  You
> >> probably want to verify all of your strip/stripe configuration before
> >> moving forward.
>
> > I don't believe you're correct here.
>
> This is due to your current lack of knowledge of the technology.
>

Please educate me then.  Where can I find more information that stripes are
calculated by a power of 3? The article above references power of 2.


>
> > The SSD Erase Block size for the
> > drives we're using is 1MB.
>
> The erase block size of the SSD device is irrelevant.  What is relevant
> is how the RAID controller (or mdraid) is configured.
>

Right, but say that 1MB stripe is a single stripe.  I'm guessing it would
fit within a single erase block?  Or should I just use 512k stripes to be
safe?



>
> > Why does being divisible by 3 matter?  Because
> > of the number of drives?
>
> Because the data is written in strips, one device at a time.  Those
> strips must be divisible by 512 (hardware sector size) and/or 4,096
> (filesystem block size).  349,525.3 is not divisible by either.
>
> > Nowhere online have a seen anything about a
> > 768MB+256MB stripe.  All the performance info I've seen point to it being
> > the fastest.  I'm sure that wouldn't be the case if the controller had to
> > deal with two stripes.
>
> Maybe you've not explained things correctly.  You said you had a 6 drive
> RAID10, and then added a 4 drive RAID10.  That means the controller is
> most definitely dealing with two stripes.  And, from a performance
> perspective, to a modern controller, two RAID sets is actually better
> than one, as multiple cores in the RAID ASIC come into play.
>

Ok, so with the original 6 drives, if it's a raid 10, that would give 3
mirrors to stripe into 1 logical drive.
Is this where the power of 3 comes from?


>
> > So essentially, my take-away here is that xfs_growfs doesn't work
> properly
> > when adding more logical raid drives?   What kind of a performance hit
> am I
> > looking at if sw is wrong?  How about this.  If I know that the maximum
> > number of drives I can add is say 20 in a RAID-10.  Can I format with
> sw=10
> > (even though sw should be 3) in the eventual expectation of expanding it?
> >  What would be the downside of doing that?
>
> xfs_growfs works properly and gives the performance one seeks when the
> underlying storage layers have been designed and configured properly.
> Yours hasn't, though it'll probably work well enough given your
> workload.  In your situation, your storage devices are SSDs, with
> 20-50K+ IOPS and 300-500MB/s throughput, and your application is a large
> database, which means random unaligned writes and random reads, a
> workload that doesn't require striping for performance with SSDs.  So
> you might have designed the storage more optimally for your workload,
> something like this:
>
> 1.  Create 3 hardware RAID1 mirrors in the controller then concatenate
> 2.  Create your XFS filesystem atop the device with no alignment
>
> There is no need for LVM unless you need snapshot capability.  This
> works fine and now you need to expand storage space.  So you simply add
> your 4 SSDs to the controller, creating two more RAID1s and add them to
> the concatenation.  Since you've simply made the disk device bigger from
> Linux' point of view, all you do now is xfs_growfs and you're done.  No
> alignment issues to fret over, and your performance will be no worse
> than before, maybe better, as you'll get all the cores in the RAID ASIC
> into play with so many RAID1s.
>
> Now, I'm going to guess that since you mentioned "colo provider" that
> you may not actually have access to the actual RAID configuration, or
> that possibly they are giving you "cloud" storage, not direct attached
> SSDs.  In this case you absolutely want to avoid specifying alignment,
> because the RAID information they are providing you is probably not
> accurate.  Which is probably why you told me "1MB stripe" twice, when we
> all know for a fact that's impossible.
>
>
That's a good idea, though I would use LVM for the concatenation.  I just
don't trust the hardware to concatenate existing disks to more disks, I'd
rather leave that up to LVM to handle, AND be able to take snapshots, etc.

Thanks Stan, very informative!

-Alex


--
> Stan
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://oss.sgi.com/pipermail/xfs/attachments/20130110/31fa78a3/attachment-0001.htm>


More information about the xfs mailing list