[Top] [All Lists]

Re: RAID setups, usage, Q's' effect of spindle groups...etc...blah blahb

To: Linda Walsh <xfs@xxxxxxxxx>
Subject: Re: RAID setups, usage, Q's' effect of spindle groups...etc...blah blahblah...
From: Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx>
Date: Mon, 21 Jan 2013 07:38:09 -0600
Cc: xfs-oss <xfs@xxxxxxxxxxx>
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <50FCB3B8.5000404@xxxxxxxxx>
References: <50FAF860.3000702@xxxxxxxxx> <20130119231644.GX2498@dastard> <50FB3265.8060506@xxxxxxxxx> <20130120004638.GZ2498@dastard> <50FBB9D4.9060109@xxxxxxxxxxxxxxxxx> <50FCB3B8.5000404@xxxxxxxxx>
Reply-to: stan@xxxxxxxxxxxxxxxxx
User-agent: Mozilla/5.0 (Windows NT 5.1; rv:17.0) Gecko/20130107 Thunderbird/17.0.2
On 1/20/2013 9:19 PM, Linda Walsh wrote:

> 2) Downloaded+online media+SW. RAID5: 4-data spindles using 2tB(1.819TB)
> Hitachi Ultrastar 7.2K SATA's (note, the disks in #3 & #4 are the same
> type).
> 3) Main data+devel disk: RAID50 12-data spindles in 3 groups of 4. 
> NOTE: I tried
> and benched RAID60 but wasn't happy with the performance, 

The 2108 ROC ASIC in the 9280 doesn't have sufficient horsepower for
good performance with dual parity arrays, but that pales in comparison
to the performance drop due to the RMW induced seek latency.

> not to mention
> the diskspace
> hit RAID10 would be a bit too decadent for my usage/budget.

When one perceives the capacity overhead of RAID1/10 as an intolerable
cost, instead of a benefit, one is forever destined to suffer from the
poor performance of parity RAID schemes.

> On #3 currently using 12.31tB in 20 partitions
> Note -- I generally Like the RAID50's
> Cards, 1 internal: Dell Perc 6/i (serving #1 & #2 above -- all internal)
> 1 LSI MR9280DE-8e (serving #3+4)
> 2 Enclosures LSI-DE1600-SAS (12x3.5" ea)

So you have 24x 2TB 7.2K SATA drives total in two 630Js, correct?

>     3 years from now?  Ha!.  Lets just say that with the dollar dropping
> as fast as disk prices over the past 4 years has flamboozled any normal
> planning.

I feel ya.

>     I was mostly interested in how increasing number of spindles
> in a Raid50 would help parallelism.
> My thoughts on that
> was that since each member of a RAID0, can be read or written independently
> of any other member (as there is no parity to check), that IF I wanted to
> increase parallelism (while hurting maximum throughput AND disk space), I
> **could** reconfigure to .. well extreme would be 5 groups of 2-data/3disk
> RAID5's.  That would, I think, theoretically  (and if the controller is
> up to
> it, which I think it is), allow *up_to* 5 separate reads/writes to be
> served
> in parallel, vs. now, I think it should be 3.
>     It was, I thought, a fairly simply question, but I have a history
> of sometimes thinking things will be easier than they are proportional to
> how far away (in future or someone else doing it! ;-)) something is...

The answer is simple too:  Parity RAID sucks.  If you want anything more
than a trivial increase in performance, you need to ditch parity RAID.
Given the time and effort involved in rearranging all of your disks to
get one or two more RAID5 arrays with fewer disks per array into a
RAID50, it doesn't make sense to do so when you can simply create one
large RAID10, and be done monkeying around and second guessing.  You'll
have the performance you're seeking.  Actually far, far more.

>     My **GENERAL** plan if prices had cooperated was to move
> to 3TB SATA's and **mabye** a 3rd enclosure -- I sorta like the LSI ones..

Using 3TB (or larger) drives simply increases the probability of losing
an entire RAID5 array if you ever have to rebuild a dead drive.  See the
linux-raid archives for the past few days for a good discussion on the

LSI sold the Engenio division to NetApp in early 2011, only a ~year
after they started selling the 620/630 in the channel--very short
product life.  The DE1600 (LSI 630J) 12 bay enclosure you have is still
available from NetApp but I'd doubt you want to pay NetApp's price, if
they'd even sell it bare without drives, or without a mandatory service

> they seem pretty solid.   Have tried a few others and generally found them
> not as good, but have looked on the economical side since this is for
> a home office^h^h^h^h^h^hlab^h^h^hplay setup....

Norco's units have a decent rep, especially given the price point.  They
use LSI's SAS expander ASIC:

No dual hot swap PSUs or expander module slots compared to the LSI, but
probably 2:1 lower price.  I didn't list the 12 bay unit as it's only
$230 less, $1369 vs $1139.

>     Consider this -- my max read and write (both), on my
> large array is 1GB/s.  There's no way I could get that with a RAID10 setup
> without a much larger number of disks.  

On the contrary.  The same disks in RAID10 will walk all over your
RAID50 setup.  Let's discuss practical use and performance instead of
peak optimums shall we?  Note that immediately below I'm simply
educating you, not recommending a 12 drive RAID10.  Recommendations come

In this one array you have 12 drives, 3x 4 drive RAID5 arrays in RAID50,
for 9 effective data spindles.  An equivalent 12 drive RAID10 would
yield 6 data spindles.

For a pure streaming read workload with all drives evenly in play, the
RAID50 might be ~50% faster.  For a purely random read workload about
the same, although in both cases 50x or more slower than the streaming
read case due to random seeks.

With a pure streaming allocation write workload with perfect stripe
filling, no RMW, the RAID50 will be faster, but less than the 50% above
due to parity calcs in the ASIC.

Now it gets interesting.  With a purely random write non aligned non
allocation workload on the RAID50, RMW cycles will abound driving seek
latency through the roof, while the ASIC is performing a parity calc on
each stripe update.  Throughput here will be in the low tens of MBs per
second, tops.  RAID10 simply writes each sector--done.  Throughput will
be in the high tens to 100s of MB/s.  So in this scenario RAID10 will be
anywhere from 5-10x or more faster depending on the distribution of the
writes across the drives.  Another factor here is that RMW reads from
the disks go into the LSI cache for parity recalculation, eating cache
bandwidth and capacity, decreasing the writeback efficiency.  With
RAID10 you get full cache bandwidth for sinking incoming writes and
performing flush scheduling, both being extremely important for random
write workloads.

Food for thought:  A random write workload of ~500MB with RAID10 will
complete almost instantly after the controller cache consumes it.  With
RAID50 you have to go through the hundreds or thousands of RMW cycles on
the disks, so the same operation will take many minutes.

Lets look at more real world scenarios.  Take your example of the
nightly background processes kicking in.  This constitutes a mixed
random read and write workload.  In this situation every RMW can create
3 seeks per drive write:  read, write, parity write.  Now you have a
seek for a pending read operation, making 4 seeks.  But the problem
isn't just the seeks, it is the inter-seek latency due to the slow 7.2K
RPM platters having to spin under the head for the next read or write.
This scenario makes scheduling in the controller and the drives
themselves very difficult adding more latency.

With RAID10 in this scenario, you simply have write/read/write/read/etc.
 You're not performing 2 extra seeks for each write, so you're not
incurring that latency between operations, nor the scheduling
complexity, thus driving throughput much higher.  In this scenario, the
6 disk RAID10 may be 10x to as much as 50x faster than the RAID50
depending on the access/seek patterns.

I've obviously not covered this in much technical detail as storage
behavior is quite complex.  I've attempted to give you a high level
overview of the behavioral differences between parity and non parity
RAID, and the potential performance differences with various workloads,
and the differences between "peak" performance and actual performance.

While your RAID50 may have greater theoretical peak streaming
performance, the RAID10 will typically, literally, run circles around it
with most day-to-day mixed IO workloads.  While the RAID50 may have a
peak throughput of ~1GB/s, it may only attain that 1-10% of the time.
The RAID10 may have a peak throughput of "only" ~700MB/s, but may likely
achieve that more than 60% of the time.  And as a result its performance
degradation will be much more graceful with concurrent workloads due the
the dramatically lower IO completion latencies.

> Though I admit, concurrency would
> rise... but I generate most of my workload, so usually I don't have
> too many things going on at the same time... a few maybe...

But I'd guess it's at times like this when you bog down the RAID50 with
mixed workloads and become annoyed.  You typically don't see that with
the non-parity arrays.

>     When an xfs_fsr kicks in and starts swallowing disk-cache, *ahem*,
> and the daily backup kicks in, AND the daily 'rsync' to create a static
> snapshot... things can slow down a bit.. but rare am I up at those hours...

And this is one scenario where the RAID10 would run circles around the

>     The most intensive is the xfs_fsr, partly due to it swallowing
> up disk cache (it runs at nice -19 ionice -c3, and I can still feel it!)...

As Dave and possibly other devs have stated, cron'ing xfs_fsr is not
recommended.  While it defrags your files it fragments free space.
Fragmented free space tends to kill performance more than file
fragmentation.  It also puts extra wear & tear on your drives,
especially when using parity RAID due to the things mentioned above.

>     I might play more with putting it in it's own blkio cgroup.
> and just limiting the overall disk transactions...(not to mention
> fixing that disk-buffer usage issue)...

On the off chance that xfs_fsr has completion timers or some such, I'd
ask around before doing that.  Sufficiently limiting its IO rate may
have unintended consequences.

>>  You'll need more drives to  maintain the same usable capacity,
> ---
> (oh, a minor detail! ;^))...

Well how much space do you really need in a one person development
operation plus home media/etc storage system?  10TB, 24TB, 48TB?

Assuming you have both 630Js filled with 24x 2TB drives, that's 48TB
raw.  If you have 6x 4 drive RAID5s in multiple RAID50 spans, you have
18x 2TB = 36TB of capacity.  Your largest array is 12 drives with 9
effective spindles of throughput.  You've split up your arrays for
different functions, limiting some workloads to fewer spindles of
performance, and having spindles sit idle that could otherwise be
actively adding performance to active workloads.  You've created
partitions directly on the array disk devices and have various LVM
devices and filesystems on those for various purposes, again limiting
some filesystems to less performance than your total spindles can give you.

The change I recommend you consider is to do something similar to what
we do with SAN storage consolidation.  Create a single large spindle
count non-parity array on the LSI.  In this case that would be a 24
drive RAID10 with a strip (sunit) of 32KB, yielding a stripe width
(swidth) of 384KB, which should work very well with all of your
filesystems and workloads, giving a good probability of full stripe
writes.  You'd have ~24TB of usable space.  All of your workloads would
have 12 spindles of non-parity performance, peak streaming read/write of
~1.4GB/s, and random read/write mixed workload throughput of a few
hundred MB/s, simply stomping what you have now.  You'd be very hard
pressed to bog down this 12 spindle non-parity array.  Making a
conservative guesstimate, I'd say the mixed random IO throughput would
be on the order of 30x-50x that of your current RAID5/50 arrays combined.

In summary, you'd gain a staggering performance increase you simply
wouldn't have considered possible with your current hardware.  You'd
"sacrifice" 12TB of your 48TB of raw space to achieve it.  That 30-50x
increase in random IOPs is exactly why many folks gladly "waste money on
extra drives".  After you see the dramatic performance increase you'll
wonder why you ever considered spending money on high RPM SAS drives to
reduce RAID5 latency.  Put these 24 7.2K SATA drives in this RAID10 up
against 24 15K SAS drives in a 6x4 RAID50.  Your big slow Hitachis will
best the nimble SAS 15ks in random IOPS, probably by a wide margin.
Simply due to RMW.  Yes, RMW will hammer 15K drives that much.  RMW
hammers all spinning rust, everything but SSDs.

> Don't spend much time on this.. (well if you read it, that might be too
> much
> already! ;-))... As I said it's not THAT important...and was mostly about
> the effect of groups in a RAID50 relating to performance tradeoffs.

Optimizing the spindle count of constituent RAID5s in a RAID50 to gain
performance is akin to a downhill skier manically waxing his skis every
day, hoping to shave 2 seconds off a 2 minute course.

> Thanks for any insights...(I'm always open to learning how wrong I am!
> ;-))...

If nothing else I hopefully got the point across as to how destructive
parity RAID read-modify-write operations are to performance.  It's
simply impossible to get good mixed IO performance from parity RAID
unless one's workloads always fit in controller write cache, or if one
has SSD storage.


<Prev in Thread] Current Thread [Next in Thread>