<div>Hi Stan,</div><div><br></div><div> Thanks for the details btw, really appreciate it. Responses inline below:</div><br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="HOEnZb"><div class="h5">
>><br>
> Only the sw=3 is no longer valid, correct? There's no way to add sw=5?<br>
<br>
</div></div>You don't have a 5 spindle array. You have a 3 spindle array and a 2<br>
spindle array. Block striping occurs WITHIN an array, not ACROSS arrays.<br></blockquote><div><br></div><div>That's correct, but I was going with the description of the sw option as "number of data disks" which is constantly increasing as you're adding disks. I realize that block striping occurs independently within each array, but I do not know how that translates into parity with the way xfs works with the logical disks. How badly does alignment get messed up with you have sw=3 but you have 6 disks? Or vice/versa, if you specify sw=6, but you only have 3 disks?</div>
<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="im"><br>
>> 1. Mount with "noalign", but that only affects data, not journal writes<br>
><br>
> Is "noalign" the default when no sw/su option is specified at all?<br>
<br>
</div>No. When no alignment options are specified during creation, neither<br>
journal nor data writes are aligned. This mount option acts only on<br>
data writes.<br>
<div class="im"><br>
>> 2. Forget LVM and use the old tried and true UNIX method of expansion:<br>
>> create a new XFS on the new array and simply mount it at a suitable<br>
>> place in the tree.<br>
>><br>
>> Not a possible solution. The space is for a database and must be<br>
> contiguous.<br>
<br>
</div>So you have a single file that is many hundreds of GBs in size, is that<br>
correct? If this is the case then XFS alignment does not matter.<br>
Writes to existing files are unaligned. Alignment is used during<br>
allocation only. And since you're working with a single large file you<br>
will likely have few journal writes, so again, alignment isn't a problem.<br>
<br>
*** Thus, you're fretting over nothing, as alignment doesn't matter for<br>
your workload. ***<br>
<br>
Since we've established this, everything below is for your education,<br>
and doesn't necessarily pertain to your expansion project.<br></blockquote><div><br></div><div>It's mostly correct. We're using mysql with innodb_file_per_table, so there's maybe a hundred files or so in a few directories, some quite big.</div>
<div>I'm guessing though that that's still not going to be a huge issue. I've actually been using both LVM and XFS on ssd raid without aligning for a while on a few other databases and the performance has been exceptional. I've decided for this round to go deeper into actual alignment to see if I can get extra performance/life out of the drives.</div>
<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="im"><br>
>> 3. Add 2 SSDs to the new array and rebuild it as a 6 drive RAID10 to<br>
>> match the current array. This would be the obvious and preferred path,<br>
><br>
> How is this the obvious and preferred path when I still can't modify the sw<br>
> value?<br>
<br>
</div>It is precisely because you don't specify new su/sw values in this<br>
situation, because they haven't changed: you now have two identical<br>
arrays glued together with LVM. XFS writes to its allocation groups.<br>
Each AG exists on only one RAID array or the other (when setup<br>
properly). It is the allocation group that is aligned to the RAID, not<br>
the "entire XFS filesystem" as you might see it. Thus, since both<br>
arrays have identical su/sw, nothing changes. When you grow the XFS, it<br>
simply creates new AGs in the space on the new RAID, and the new AGs are<br>
properly aligned automatically, as both arrays are identical.<br></blockquote><div><br></div><div>But both arrays are not identical, ie. the second array has less (or more) drives for example. How does the sw value affect it then?</div>
<div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="im"><br>
>> assuming you actually mean 1MB STRIP above, not 1MB stripe. If you<br>
><br>
> Stripesize 1MB<br>
<br>
</div>You're telling us something you have not verified, which cannot possibly<br>
be correct. Log into the RAID controller firmware and confirm this, or<br>
do an mdadm --examine /dev/mdX. It's simply not possible to have a 1MB<br>
stripe with 3 devices, because 1,048,576 / 3 = 349,525.3<br></blockquote><div><br></div><div># /usr/StorMan/arcconf getconfig 1 LD</div><div><div>Logical device number 1</div><div> Logical device name : RAID10-B</div>
<div> RAID level : 10</div><div> Status of logical device : Optimal</div><div> Size : 1142774 MB</div><div> Stripe-unit size : 1024 KB</div>
<div> Read-cache mode : Enabled</div><div> MaxCache preferred read cache setting : Disabled</div><div> MaxCache read cache setting : Disabled</div><div> Write-cache mode : Enabled (write-back)</div>
<div> Write-cache setting : Enabled (write-back) when protected by battery/ZMM</div><div> Partitioned : No</div><div> Protected by Hot-Spare : No</div>
<div> Bootable : No</div><div> Failed stripes : No</div><div> Power settings : Disabled</div><div> --------------------------------------------------------</div>
<div> Logical device segment information</div><div> --------------------------------------------------------</div><div> Group 0, Segment 0 : Present (Controller:1,Enclosure:0,Slot:2) FG001MMV</div>
<div> Group 0, Segment 1 : Present (Controller:1,Enclosure:0,Slot:3) FG001MNW</div><div> Group 1, Segment 0 : Present (Controller:1,Enclosure:0,Slot:4) FG001MMT</div>
<div> Group 1, Segment 1 : Present (Controller:1,Enclosure:0,Slot:5) FG001MNY</div><div> Group 2, Segment 0 : Present (Controller:1,Enclosure:0,Slot:6) FG001DH8</div>
<div> Group 2, Segment 1 : Present (Controller:1,Enclosure:0,Slot:7) FG001DKK</div></div><div><br></div><div><br></div><div>The controller itself is: Adaptec 5805Z</div><div><br></div>
<div>
According to this article: <a href="http://www.pcguide.com/ref/hdd/perf/raid/concepts/perfStripe-c.html">http://www.pcguide.com/ref/hdd/perf/raid/concepts/perfStripe-c.html</a></div><div><br></div><div>"<span style="font-family:'Times New Roman';font-size:medium">The second important parameter is the </span><em style="font-family:'Times New Roman';font-size:medium">stripe size</em><span style="font-family:'Times New Roman';font-size:medium"> of the array, sometimes also referred to by terms such as </span><em style="font-family:'Times New Roman';font-size:medium">block size</em><span style="font-family:'Times New Roman';font-size:medium">, </span><em style="font-family:'Times New Roman';font-size:medium">chunk size</em><span style="font-family:'Times New Roman';font-size:medium">, </span><em style="font-family:'Times New Roman';font-size:medium">stripe length </em><span style="font-family:'Times New Roman';font-size:medium">or </span><em style="font-family:'Times New Roman';font-size:medium">granularity</em><span style="font-family:'Times New Roman';font-size:medium">. This term refers to the size of the stripes written to each disk. RAID arrays that stripe in blocks typically allow the selection of block sizes in </span><a href="http://www.pcguide.com/intro/fun/bindec.htm" style="font-family:'Times New Roman';font-size:medium">kiB</a><span style="font-family:'Times New Roman';font-size:medium"> ranging from 2 kiB to 512 kiB (or even higher) in powers of two (meaning 2 kiB, 4 kiB, 8 kiB and so on.) Byte-level striping (as in </span><a href="http://www.pcguide.com/ref/hdd/perf/raid/levels/single_Level3.htm" style="font-family:'Times New Roman';font-size:medium">RAID 3</a><span style="font-family:'Times New Roman';font-size:medium">) uses a stripe size of one byte or perhaps a small number like 512, usually not selectable by the user."</span></div>
<div><br></div><div>So they're talking about powers of 2, not powers of 3. 1MB would definitely work then. </div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="im"><br>
>> actually mean 1MB hardware RAID stripe, then the controller would have<br>
>> most likely made a 768KB stripe with 256KB strip, as 1MB isn't divisible<br>
>> by 3. Thus you've told LVM to ship 1MB writes to a device expecting<br>
>> 256KB writes. In that case you've already horribly misaligned your LVM<br>
>> volumes to the hardware stripe, and everything is FUBAR already. You<br>
>> probably want to verify all of your strip/stripe configuration before<br>
>> moving forward.<br>
<br>
> I don't believe you're correct here.<br>
<br>
</div>This is due to your current lack of knowledge of the technology.<br></blockquote><div><br></div><div>Please educate me then. Where can I find more information that stripes are calculated by a power of 3? The article above references power of 2.</div>
<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="im"><br>
> The SSD Erase Block size for the<br>
> drives we're using is 1MB.<br>
<br>
</div>The erase block size of the SSD device is irrelevant. What is relevant<br>
is how the RAID controller (or mdraid) is configured.<br></blockquote><div><br></div><div>Right, but say that 1MB stripe is a single stripe. I'm guessing it would fit within a single erase block? Or should I just use 512k stripes to be safe?</div>
<div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="im"><br>
> Why does being divisible by 3 matter? Because<br>
> of the number of drives?<br>
<br>
</div>Because the data is written in strips, one device at a time. Those<br>
strips must be divisible by 512 (hardware sector size) and/or 4,096<br>
(filesystem block size). 349,525.3 is not divisible by either.<br>
<div class="im"><br>
> Nowhere online have a seen anything about a<br>
> 768MB+256MB stripe. All the performance info I've seen point to it being<br>
> the fastest. I'm sure that wouldn't be the case if the controller had to<br>
> deal with two stripes.<br>
<br>
</div>Maybe you've not explained things correctly. You said you had a 6 drive<br>
RAID10, and then added a 4 drive RAID10. That means the controller is<br>
most definitely dealing with two stripes. And, from a performance<br>
perspective, to a modern controller, two RAID sets is actually better<br>
than one, as multiple cores in the RAID ASIC come into play.<br></blockquote><div><br></div><div>Ok, so with the original 6 drives, if it's a raid 10, that would give 3 mirrors to stripe into 1 logical drive.</div><div>
Is this where the power of 3 comes from?</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="im"><br>
> So essentially, my take-away here is that xfs_growfs doesn't work properly<br>
> when adding more logical raid drives? What kind of a performance hit am I<br>
> looking at if sw is wrong? How about this. If I know that the maximum<br>
> number of drives I can add is say 20 in a RAID-10. Can I format with sw=10<br>
> (even though sw should be 3) in the eventual expectation of expanding it?<br>
> What would be the downside of doing that?<br>
<br>
</div>xfs_growfs works properly and gives the performance one seeks when the<br>
underlying storage layers have been designed and configured properly.<br>
Yours hasn't, though it'll probably work well enough given your<br>
workload. In your situation, your storage devices are SSDs, with<br>
20-50K+ IOPS and 300-500MB/s throughput, and your application is a large<br>
database, which means random unaligned writes and random reads, a<br>
workload that doesn't require striping for performance with SSDs. So<br>
you might have designed the storage more optimally for your workload,<br>
something like this:<br>
<br>
1. Create 3 hardware RAID1 mirrors in the controller then concatenate<br>
2. Create your XFS filesystem atop the device with no alignment<br>
<br>
There is no need for LVM unless you need snapshot capability. This<br>
works fine and now you need to expand storage space. So you simply add<br>
your 4 SSDs to the controller, creating two more RAID1s and add them to<br>
the concatenation. Since you've simply made the disk device bigger from<br>
Linux' point of view, all you do now is xfs_growfs and you're done. No<br>
alignment issues to fret over, and your performance will be no worse<br>
than before, maybe better, as you'll get all the cores in the RAID ASIC<br>
into play with so many RAID1s.<br>
<br>
Now, I'm going to guess that since you mentioned "colo provider" that<br>
you may not actually have access to the actual RAID configuration, or<br>
that possibly they are giving you "cloud" storage, not direct attached<br>
SSDs. In this case you absolutely want to avoid specifying alignment,<br>
because the RAID information they are providing you is probably not<br>
accurate. Which is probably why you told me "1MB stripe" twice, when we<br>
all know for a fact that's impossible.<br>
<span class="HOEnZb"><font color="#888888"><br></font></span></blockquote><div><br></div><div>That's a good idea, though I would use LVM for the concatenation. I just don't trust the hardware to concatenate existing disks to more disks, I'd rather leave that up to LVM to handle, AND be able to take snapshots, etc.</div>
<div><br></div><div>Thanks Stan, very informative!</div><div><br></div><div>-Alex </div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<span class="HOEnZb"><font color="#888888">
--<br>
Stan<br>
<br>
</font></span></blockquote></div><br>