[Top] [All Lists]

Re: Question regarding XFS on LVM over hardware RAID.

To: stan <stan@xxxxxxxxxxxxxxxxx>
Subject: Re: Question regarding XFS on LVM over hardware RAID.
From: "C. Morgan Hamill" <chamill@xxxxxxxxxxxx>
Date: Fri, 31 Jan 2014 16:14:31 -0500
Cc: Dave Chinner <david@xxxxxxxxxxxxx>, xfs <xfs@xxxxxxxxxxx>
Delivered-to: xfs@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=wesleyan.edu; s=feb2013.wesmsa; t=1391202872; bh=bVSO0sWL3HH+8NKaDvudZ4dNBHfo/mxXFhxLsxEVV+A=; h=From:To:Cc:In-reply-to:Subject:References:Date; b=anCpwySbz1LGRjQ0TYzVCazkAZ6GNhnqElaPEUDpkjMuXH5bZrXSJVutrcUuYieBA B89ZoCywQSzYpElYCFXC45DmZ5ry0/nKKDx0fnUe0Qati4pZPWKc7urPgm8BUQE4IY yJh6LTJJUNuzpHB3QrlxZHzHGRUtS2DVxLz5N1rc=
In-reply-to: <52EB3B96.7000103@xxxxxxxxxxxxxxxxx>
References: <1391005406-sup-1881@xxxxxxxxxxxxxxx> <52E91923.4070706@xxxxxxxxxxx> <1391022066-sup-5863@xxxxxxxxxxxxxxx> <52E99504.4030902@xxxxxxxxxxxxxxxxx> <1391090527-sup-4664@xxxxxxxxxxxxxxx> <20140130202819.GO2212@dastard> <52EB3B96.7000103@xxxxxxxxxxxxxxxxx>
User-agent: Sup/git
Excerpts from Stan Hoeppner's message of 2014-01-31 00:58:46 -0500:
> RAID60 is a nested RAID level just like RAID10 and RAID50.  It is a
> stripe, or RAID0, across multiple primary array types, RAID6 in this
> case.  The stripe width of each 'inner' RAID6 becomes the stripe unit of
> the 'outer' RAID0 array:
> RAID6 geometry     128KB * 12 = 1536KB
> RAID0 geometry  1536KB * 3  = 4608KB
> If you are creating your RAID60 array with a proprietary hardware
> RAID/SAN management utility it may not be clearly showing you the
> resulting nested geometry I've demonstrated above, which is correct for
> your RAID60.
> It is possible with software RAID to continue nesting stripe upon stripe
> to build infinitely large nested arrays.  It is not practical to do so
> for many reasons, but I'll not express those here as it is out of scope
> for this discussion.  I am simply attempting to explain how nested RAID
> levels are constructed.
> > So optimised for sequential IO. The time-honoured method of setting
> > up XFS for this if the workload is large files is to use a stripe
> > unit that is equal to the width of the underlying RAID6 volumes with
> > a stripe width of 3. That way XFS tries to align files to the start
> > of each RAID6 volume, and allocate in full RAID6 stripe chunks. This
> > mostly avoids RMW cycles for large files and sequential IO. i.e. su
> > = 1536k, sw = 3.

Makes perfect sense.

> As Dave demonstrates, your hardware geometry is 1536*3=4608KB.  Thus,
> when you create your logical volumes they each need to start and end on
> a 4608KB boundary, and be evenly divisible by 4608KB.  This will ensure
> that all of your logical volumes are aligned to the RAID60 geometry.
> When formatting the LVs with XFS you will use:
> ~# mkfs.xfs -d su=1536k,sw=3 /dev/[lv_device_path]


> This aligns XFS to the RAID60 geometry.  Geometry alignment must be
> maintained throughout the entire storage stack.  If a single layer is
> not aligned properly, every layer will be misaligned.  When this occurs
> performance will suffer, and could suffer tremendously.
> You'll want to add "inode64" to your fstab mount options for these
> filesystems.  This has nothing to do with geometry, but how XFS
> allocates inodes and how/where files are written to AGs.  It is the
> default in very recent kernels but I don't know in which it was made so.

Yes, I was aware of this.

> LVM typically affords you much more flexibility here than your RAID/SAN
> controller.  Just be mindful that when you expand you need to keep your
> geometry, i.e. stripe width, the same.  Let's say some time in the
> future you want to expand but can only afford, or only need, one 14 disk
> chassis at the time, not another 3 for another RAID60.  Here you could
> create a single 14 drive RAID6 with stripe geometry 384KB * 12 = 4608KB.
> You could then carve it up into 1-3 pieces, each aligned to the
> start/end of a 4608KB stripe and evenly divisible by 4608KB, and add
> them to one of more of your LVs/XFS filesystems.  This maintains the
> same overall stripe width geometry as the RAID60 to which all of your
> XFS filesystems are already aligned.

OK, so the upshot is is that any additions to the volume group must be
array with su*sw=4608k, and all logical volumes and filesystems must
begin and end on multiples of 4608k from the start of the block device.

As long as these things hold true, is it all right for logical
volumes/filesystems to begin on one physical device and end on another?

> If you remember only 3 words of my post, remember:
> Alignment, alignment, alignment.

Yes, I am hearing you. :-)

> For a RAID60 setup such as you're describing, you'll want to use LVM,
> and you must maintain consistent geometry throughout the stack, from
> array to filesystem.  This means every physical volume you create must
> start and end on a 4608KB stripe boundary.  Every volume group you
> create must do the same.  And every logical volume must also start and
> end on a 4608KB stripe boundary.  If you don't verify each layer is
> aligned all of your XFS filesystems will likely be unaligned.  And
> again, performance will suffer, possibly horribly so.

So, basically, --dataalignment is my friend during pvcreate and

Thanks so much for your and Dave's help; this has been tremendously
Morgan Hamill

<Prev in Thread] Current Thread [Next in Thread>