[Top] [All Lists]

Re: How to deal with XFS stripe geometry mismatch with hardware RAID5

To: Linux fs XFS <xfs@xxxxxxxxxxx>
Subject: Re: How to deal with XFS stripe geometry mismatch with hardware RAID5
From: pg_xf2@xxxxxxxxxxxxxxxxxx (Peter Grandi)
Date: Wed, 14 Mar 2012 22:48:36 +0000
In-reply-to: <33504048.post@xxxxxxxxxxxxxxx>
References: <33498437.post@xxxxxxxxxxxxxxx> <4F605877.2030304@xxxxxxxxxxxxxxxxx> <33504048.post@xxxxxxxxxxxxxxx>
>>>> I have a 30TB XFS filesystem created on CentOS 5.4 X86_64,
>>>> kernel 2.6.39, using xfsprogs 2.9.4. The underlying hardware
>>>> is 12 3TB SATA drives on a Dell PERC 700 controller with 1GB
>>>> cache. [ ... ]

>>>> [ ... ] set up the RAID BIOS to use a small stripe element
>>>> of 8KB per drive, [ ... ] The filesystem contains a MongoDB
>>>> installation consisting of roughly 13000 2GB files which are
>>>> already allocated. The application is almost exclusively
>>>> inserting data, there are no updates, and files are written
>>>> pretty much sequentially. [ ... ]

How many of the 13,000 are being written at roughly at the same
time? Because if you are logging 100K to each of them all the
time, that is a heavily random access workload. Each file may be
written sequentially, but the *disk* would be subject to a storm
of seeks.

>>>> When I set up the fstab entry I believed that it would
>>>> inherit the stripe geometry automatically, however now I
>>>> understand that is not the case with XFS version 2.

'mkfs.xfs' asks the kernel about drive geometry. If the kernel
could read it odd the PERC 700 it would have been fine. The
kernel can easily read geometry off MD etc. RAID sets because the
relevant info is already in the system state.

>>>> What I'm seeing now is average request sizes which are about
>>>> 100KB, half the stripe size.

But writes from what to what? From Linux to the PERC 700 cache or
from the PERC 700 cache to the RAID set drives?

>>>> With a typical write volume around 5MB per second I am
>>>> getting wait times around 50ms, which appears to be
>>>> degrading performance. [ ... ]

5MB per second in aggregate is hardly worth worrying about.
What do the 50ms mean as wait times? Again, it matters a great
deal whether it is Linux->PERC or PERC->drives.

If you have barriers enabled, and the MongoDB is 'fsync'ing every
100K, then 100K will be the transaction size.

Also, with a 100K append size, and 5MB/s aggregate, you are doing
50 transactions/s and it matters a great deal whether all are to
the same file, sequentially, or each is to a different file, etc.

>>>> [ ... ] Is there a danger of filesystem corruption if I give
>>>> fstab a mount geometry that doesn't match the values used at
>>>> filesystem creation time?

No, those values are purely advisory.

>>>> I'm unclear on the role of the RAID hardware cache in
>>>> this. Since the writes are sequential, and since the volume
>>>> of data written is such that it would take about 3 minutes
>>>> to actually fill the RAID cache, I would think the data
>>>> would be resident in the cache long enough to assemble a
>>>> full-width stripe at the hardware level and avoid the 4 I/O
>>>> RAID5 penalty.

Sure, if the cache is configured right and barriers are not
invoked every 100KiB.

[ ... ]

> Mongo pre-allocates its datafiles and zero-fills them (there is
> a short header at the start of each, not rewritten as far as I
> know) and then writes to them sequentially, wrapping around
> when it hits the end.

Preallocating is good.

> In this case the entire load is inserts, no updates, hence the
> sequential writes.

So it is not random access, if it is a log-like operation. If it
is a lot of 100K appends, things look a lot better.

> [ ... ] The BBU is functioning and the cache is set to
> write-back.

That's good, check whether XFS has barriers enabled, and it might
help to make sure that the host adapter really knows the geometry
of the RAID set and if there is a parameter as to how much
unwritten data to buffer, to set is high, to maximize the chances
that it will do like it should and issue whole-stripe writes.

> [ ... ] Flushing is done about every 30 seconds and takes
> about 8 seconds.

I usually prefer nearly continuous flushing (and the Linux level
too and in particular), in part to avoid the 8s pauses. Even if
that defeats in part the XFS delayed allocation logic.

However there is a contradiction here between seeing 100K
transactions and flushing taking 8s times a write rate of 5MB/s,
every 30s. The latter would imply 40MB of writes every 30s.

> One thing I'm wondering is whether the incorrect stripe
> structure I specified with mkfs

Probably the incorrect stripe structure here is mostly not
that important, there are bigger factors at play.

> is actually written into the file system structure or
> effectively just a hint to the kernel for what to use for a
> write size.

The stripe parameters have static and dynamic effects:

    - The metadata allocator attempts to interleave metadata at
      chunk ('sunit') boundaries to parallelize access.
    - The data allocator attempts to allocate extents on stripe
      ('swidth') aligned boundaries to maximize the chances of
      doing stripe aligned IO.
    These allocations are aligned according to the stripe
    parameters current when the metadata and data extents were
    - The block IO bottom end attempts to generate bulk IO
      requests aligned on stripe boundaries.
    These requests are aligned according to the stripe
    parameters current at the moment the IO occurs. The metadata
    and data extents may well have been allocated with alignment
    different from that on which IO requests are aligned.

> If not, could I specify the correct stripe width in the mount
> options and override the incorrect width used by mkfs?

Sure, but the space already allocated is already on the "wrong"
boundaries, even if XFS supposedly will try to issue IOs on the
as-mounted stripe alignment.

> Since the current average write size is only about half the
> specified stripe size, and since I'm not using md or xfs v.3
> it seems the kernel is ignoring it for now.

All the kernel does is to upload a bunch of blocks to the PERC,
and all the RAID optimization is done by the PERC.

> The choice of RAID5 was a compromise due to the need to store
> 30TB of data on each of 2 systems (a master and a replicated
> slave) - we couldn't afford that much space on our SAN for this
> application, but we could afford a 12-bay system with 3TB SATA
> drives.

Still an 11+1 RAID5 is a bravce option to take.

> My hope was that since the write pattern was expected to be
> large sequential writes with no updates that the RAID5 penalty
> would not be significant.

That was a reasonable hope, but 11+1 RAID5 has other downsides.

> And it's quite possible that would be the case if I had got the
> stripe width right.

Uh, I suspect that stripe alignment here is not that important.
That 50ms after 100k sounds much much worse than RMW. On 15k
drives 50ms are about 4-6 seek times, which is way more than RMW
would take.

> The 8K element size was chosen because the actual average
> request size I was seeing on previous installations of the
> database was around 60K, which is still smaller than the stripe
> width over 12 drives even using 8K.

That is not necessarily the right logic, but for bulk sequential
transfers a small chunk size is a good idea, in general other
things equal the smaller the chunk and the stripe size the better.

> I did try btrfs early on to take advantage of compression, but
> it failed. This was about six months ago, though.

"failed" sounds a bit strange, and note that BTRFS has much
larger overheads than other filesystems. But your applications
seems ideal for XFS. Instead of using some weird kernel like
2.6.39 with EL5, you might want to try an "official" EL5 kernel
like the Oracle 2.6.32 one, or switch to EL6/CentOS6.

<Prev in Thread] Current Thread [Next in Thread>