xfs
[Top] [All Lists]

md-RAID5/6 stripe_cache_size default value vs performance vs memory foot

To: Arkadiusz MiÅkiewicz <arekm@xxxxxxxx>
Subject: md-RAID5/6 stripe_cache_size default value vs performance vs memory footprint
From: Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx>
Date: Sat, 21 Dec 2013 05:18:42 -0600
Cc: linux-raid@xxxxxxxxxxxxxxx, "xfs@xxxxxxxxxxx" <xfs@xxxxxxxxxxx>
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <201312202343.47895.arekm@xxxxxxxx>
References: <52B102FF.8040404@xxxxxxxxxxx> <52B2FE9E.50307@xxxxxxxxxxxxxxxxx> <52B41B67.9030308@xxxxxxxxxxx> <201312202343.47895.arekm@xxxxxxxx>
Reply-to: stan@xxxxxxxxxxxxxxxxx
User-agent: Mozilla/5.0 (Windows NT 5.1; rv:24.0) Gecko/20100101 Thunderbird/24.2.0
I renamed the subject as your question doesn't really apply to XFS, or
the OP, but to md-RAID.

On 12/20/2013 4:43 PM, Arkadiusz MiÅkiewicz wrote:

> I wonder why kernel is giving defaults that everyone repeatly recommends to 
> change/increase? Has anyone tried to bugreport that for stripe_cache_size 
> case?

The answer is balancing default md-RAID5/6 write performance against
kernel RAM consumption, with more weight given to the latter.  The formula:

((4096*stripe_cache_size)*num_drives)= RAM consumed for stripe cache

High stripe_cache_size values will cause the kernel to eat non trivial
amounts of RAM for the stripe cache buffer.  This table demonstrates the
effect today for typical RAID5/6 disk counts.

stripe_cache_size       drives  RAM consumed
256                      4        4 MB
                         8        8 MB
                        16       16 MB
512                      4        8 MB
                         8       16 MB
                        16       32 MB
1024                     4       16 MB
                         8       32 MB
                        16       64 MB
2048                     4       32 MB
                         8       64 MB
                        16      128 MB
4096                     4       64 MB
                         8      128 MB
                        16      256 MB

The powers that be, Linus in particular, are not fond of default
settings that create a lot of kernel memory structures.  The default
md-RAID5/6 stripe_cache-size yields 1MB consumed per member device.

With SSDs becoming mainstream, and becoming ever faster, at some point
the md-RAID5/6 architecture will have to be redesigned because of the
memory footprint required for performance.  Currently the required size
of the stripe cache appears directly proportional to the aggregate write
throughput of the RAID devices.  Thus the optimal value will vary
greatly from one system to another depending on the throughput of the
drives.

For example, I assisted a user with 5x Intel SSDs back in January and
his system required 4096, or 80MB of RAM for stripe cache, to reach
maximum write throughput of the devices.  This yielded 600MB/s or 60%
greater throughput than 2048, or 40MB RAM for cache.  In his case 60MB
more RAM than the default was well worth the increase as the machine was
an iSCSI target server with 8GB RAM.

In the previous case with 5x rust RAID6 the 2048 value seemed optimal
(though not yet verified), requiring 40MB less RAM than the 5x Intel
SSDs.  For a 3 modern rust RAID5 the default of 256, or 3MB, is close to
optimal but maybe a little low.  Consider that 256 has been the default
for a very long time, and was selected back when average drive
throughput was much much lower, as in 50MB/s or less, SSDs hadn't yet
been invented, and system memories were much smaller.

Due to the massive difference in throughput between rust and SSD, any
meaningful change in the default really requires new code to sniff out
what type of devices constitute the array, if that's possible, and it
probably isn't, and set a lowish default accordingly.  Again, SSDs
didn't exist when md-RAID was coded, nor when this default was set, and
this throws a big monkey wrench into these spokes.

-- 
Stan

<Prev in Thread] Current Thread [Next in Thread>