On 1/10/2013 1:19 AM, Alexey Zilber wrote:
> Hi Stan,
> Thanks for the details btw, really appreciate it. Responses inline
>>> Only the sw=3 is no longer valid, correct? There's no way to add sw=5?
>> You don't have a 5 spindle array. You have a 3 spindle array and a 2
>> spindle array. Block striping occurs WITHIN an array, not ACROSS arrays.
> That's correct, but I was going with the description of the sw option as
> "number of data disks" which is constantly increasing as you're adding
That is correct. But "data disks" is in the context of a single striped
array. Remember, what you're doing here is aligning XFS write out to
the stripe size of the array.
> disks. I realize that block striping occurs independently within each
> array, but I do not know how that translates into parity with the way xfs
> works with the logical disks.
The takeaway here is this: If you concatenate two or more striped
arrays together with a single XFS atop, you must use identical arrays,
or at least arrays with the same overall stripe size. Otherwise you
cannot achieve or maintain alignment with XFS.
> How badly does alignment get messed up with
> you have sw=3 but you have 6 disks? Or vice/versa, if you specify sw=6,
> but you only have 3 disks?
Instead of attempting to figure out how bad a wrong configuration is,
which is impossible, let's concentrate on how to do proper configurations.
> It's mostly correct. We're using mysql with innodb_file_per_table, so
> there's maybe a hundred files or so in a few directories, some quite big.
> I'm guessing though that that's still not going to be a huge issue. I've
> actually been using both LVM and XFS on ssd raid without aligning for a
> while on a few other databases and the performance has been exceptional.
With a database workload it's probably best to not align the XFS,
especially if you intend to expand storage the way you have been.
> I've decided for this round to go deeper into actual alignment to see if I
> can get extra performance/life out of the drives.
But you made this decision before knowing that XFS only does alignment
during allocation or writing to the journal. And your workload does no
aligned writes. So not only did you gain nothing by aligning here, but
you've caused yourself some grief. On the plus side, you're learning
much valuable information about RAID and XFS.
>>>> 3. Add 2 SSDs to the new array and rebuild it as a 6 drive RAID10 to
>>>> match the current array. This would be the obvious and preferred path,
>>> How is this the obvious and preferred path when I still can't modify the
>> It is precisely because you don't specify new su/sw values in this
>> situation, because they haven't changed: you now have two identical
>> arrays glued together with LVM. XFS writes to its allocation groups.
>> Each AG exists on only one RAID array or the other (when setup
>> properly). It is the allocation group that is aligned to the RAID, not
>> the "entire XFS filesystem" as you might see it. Thus, since both
>> arrays have identical su/sw, nothing changes. When you grow the XFS, it
>> simply creates new AGs in the space on the new RAID, and the new AGs are
>> properly aligned automatically, as both arrays are identical.
> But both arrays are not identical, ie. the second array has less (or more)
> drives for example. How does the sw value affect it then?
Note my comments in  at the start above. What I'm doing here is
telling you how to do it correctly and fix the alignment problem you
have with your 3x and 2x spindle arrays. Again, I'm not going to
attempt to explain the negative effects of a wrong configuration,
especially given that it won't affect your non-allocation workload.
>>>> assuming you actually mean 1MB STRIP above, not 1MB stripe. If you
>>> Stripesize 1MB
>> You're telling us something you have not verified, which cannot possibly
>> be correct. Log into the RAID controller firmware and confirm this, or
>> do an mdadm --examine /dev/mdX. It's simply not possible to have a 1MB
>> stripe with 3 devices, because 1,048,576 / 3 = 349,525.3
> Stripe-unit size : 1024 KB
The problem here is a lack of understanding/use of terminology, and this
article you reference explains your misunderstanding. The author
doesn't understand the terminology himself. He's misusing "stripe size"
and I've never heard of "stripe length". He's probably using "stripe
length" for "stripe width". Obviously the author is not versed in the
SNIA RAID specifications, or those generally accepted/used by the
storage community. To clarify:
"Stripe unit" or "strip" or "chunk" is the portion that resides on a
single spindle. "Stripe" or "Stripe size" is (stripe unit * spindle
count), or in XFS terminology, (su * sw), or (stripe unit * stripe
width). You've apparently created your XFS correctly, but you used the
wrong terminology in your post, which is confusing. You keep saying
your "stripe size" is 1MB, when in fact it is 3MB. It is your "stripe
unit" that is 1MB. But again, for a database workload, it makes no
difference WRT performance.
> So they're talking about powers of 2, not powers of 3. 1MB would definitely
> work then.
Digest the explanation above and this should all be clear. Again, a 1MB
stripe size with 3 spindles is not possible. You have a 3MB stripe size
with 3 spindles, which is possible. "Stripe unit" and "stripe size" are
two different quantities/parameters, one being wholly contained within
>>>> actually mean 1MB hardware RAID stripe, then the controller would have
>>>> most likely made a 768KB stripe with 256KB strip, as 1MB isn't divisible
>>>> by 3. Thus you've told LVM to ship 1MB writes to a device expecting
>>>> 256KB writes. In that case you've already horribly misaligned your LVM
>>>> volumes to the hardware stripe, and everything is FUBAR already. You
>>>> probably want to verify all of your strip/stripe configuration before
>>>> moving forward.
>>> I don't believe you're correct here.
>> This is due to your current lack of knowledge of the technology.
> Please educate me then.
> Where can I find more information that stripes are
> calculated by a power of 3? The article above references power of 2.
It seems you confused "divisible by 3" and "power of 3".
>>> The SSD Erase Block size for the
>>> drives we're using is 1MB.
>> The erase block size of the SSD device is irrelevant. What is relevant
>> is how the RAID controller (or mdraid) is configured.
> Right, but say that 1MB stripe is a single stripe. I'm guessing it would
> fit within a single erase block? Or should I just use 512k stripes to be
Again you're saying "stripe size" when you actually mean "stripe unit".
With some workloads there can be an advantage to using a stripe unit
the same size as the erase block. But with a random write database
workload, where small records, far less than 1MB, are being rewritten in
place, or appended, then making your stripe unit the size of the erase
block gains you nothing.
> Ok, so with the original 6 drives, if it's a raid 10, that would give 3
> mirrors to stripe into 1 logical drive.
> Is this where the power of 3 comes from?
Again, not power of 3, but divisible by 3. This should be obvious.
>>> So essentially, my take-away here is that xfs_growfs doesn't work
>>> when adding more logical raid drives? What kind of a performance hit
>> am I
>>> looking at if sw is wrong? How about this. If I know that the maximum
>>> number of drives I can add is say 20 in a RAID-10. Can I format with
>>> (even though sw should be 3) in the eventual expectation of expanding it?
>>> What would be the downside of doing that?
>> xfs_growfs works properly and gives the performance one seeks when the
>> underlying storage layers have been designed and configured properly.
>> Yours hasn't, though it'll probably work well enough given your
>> workload. In your situation, your storage devices are SSDs, with
>> 20-50K+ IOPS and 300-500MB/s throughput, and your application is a large
>> database, which means random unaligned writes and random reads, a
>> workload that doesn't require striping for performance with SSDs. So
>> you might have designed the storage more optimally for your workload,
>> something like this:
>> 1. Create 3 hardware RAID1 mirrors in the controller then concatenate
>> 2. Create your XFS filesystem atop the device with no alignment
>> There is no need for LVM unless you need snapshot capability. This
>> works fine and now you need to expand storage space. So you simply add
>> your 4 SSDs to the controller, creating two more RAID1s and add them to
>> the concatenation. Since you've simply made the disk device bigger from
>> Linux' point of view, all you do now is xfs_growfs and you're done. No
>> alignment issues to fret over, and your performance will be no worse
>> than before, maybe better, as you'll get all the cores in the RAID ASIC
>> into play with so many RAID1s.
>> Now, I'm going to guess that since you mentioned "colo provider" that
>> you may not actually have access to the actual RAID configuration, or
>> that possibly they are giving you "cloud" storage, not direct attached
>> SSDs. In this case you absolutely want to avoid specifying alignment,
>> because the RAID information they are providing you is probably not
>> accurate. Which is probably why you told me "1MB stripe" twice, when we
>> all know for a fact that's impossible.
> That's a good idea, though I would use LVM for the concatenation. I just
> don't trust the hardware to concatenate existing disks to more disks, I'd
But you trust it to do striped RAID, which is far more complicated? And
your description is inaccurate. It works identically to md linear and
LVM concatenation. You make your RAID1 pairs, then add them to a volume
group. That volume group becomes your disk device. When you want to
expand, you simply add a new RAID1 pair to the volume group. I'm using
generic terminology here. Different RAID vendors use different terms.
Also, you do understand I'm talking about a fresh architecture here.
I'm not talking about concatenating anything onto arrays you already
have. With this you start from scratch.
> rather leave that up to LVM to handle, AND be able to take snapshots, etc.
Just remember this is a concatenation. There is no striping, no special
block size, nothing. You're simply increasing the available LBA numbers
in a block device. There is nothing to tune nor optimize, no alignment,
WRT the disks, middle layers, or XFS. For maximum potential database
performance all you need to do is put the most frequently accessed files
in different directories. Once you understand the XFS allocation group
architecture and allocators you'll understand why. Start here:
> Thanks Stan, very informative!
You're welcome. I am a bit curious as to why you didn't simply grow the
original 6 disk RAID10 array into a 10 disk RAID10 array. Then all
you'd need to do is mount with the new alignment. The 5805 supports
only 8 disks without expanders. Do you have an expander, or do you have
multiple RAID cards in the box?