Alignment: XFS + LVM2

Marc Caubet mcaubet at pic.es
Thu May 8 08:52:09 CDT 2014


Hi Stan,

once again, thanks for your answer.

> Hi Stan,
> >
> > thanks for your answer.
> >
> > Everything begins and ends with the workload.
> >>
> >> On 5/7/2014 7:43 AM, Marc Caubet wrote:
> >>> Hi all,
> >>>
> >>> I am trying to setup a storage pool with correct disk alignment and I
> >> hope
> >>> somebody can help me to understand some unclear parts to me when
> >>> configuring XFS over LVM2.
> >>
> >> I'll try.  But to be honest, after my first read of your post, a few
> >> things jump out as breaking traditional rules.
> >>
> >> The first thing you need to consider is your workload and the type of
> >> read/write patterns it will generate.  This document is unfinished, and
> >> unformatted, but reading what is there should be informative:
> >>
> >> http://www.hardwarefreak.com/xfs/storage-arch.txt
> >>
> >
> > Basically we are moving a lot of data :) It means, parallel large files
> > (GBs) are being written and read all the time. Basically we have a batch
> > farm with 3,5k cores processing jobs that are constantly reading and
> > writing to the storage pools (4PBs). Only few pools (~5% of the total)
> > contain small files (and only small files).
>
> And these pools are tied together with?  Gluster?  Ceph?
>

We are using dCache (http://www.dcache.org/), where a file is written in a
single pool instead of spreading parts among pools as Ceph or Hadoop do. So
large files go entirely to a pool.


> >>> Actually we have few storage pools with the following settings each:
> >>>
> >>> - LSI Controller with 3xRAID6
> >>> - Each RAID6 is configured with 10 data disks + 2 for double-parity.
> >>> - Each disk has a capacity of 4TB, 512e and physical sector size of 4K.
> >>
> >> 512e drives may cause data loss.  See:
> >> http://docs.oracle.com/cd/E26502_01/html/E28978/gmkgj.html#gmlfz
> >>
> >
> > Haven't experienced this yet. But good to know thanks :)  On the other
> > hand, we do not use zfs
>
> This problem affects all filesystems.  If the drive loses power during
> an RMW cycle the physical sector is corrupted.  As noted, not all 512e
> drives may have this problem.  And for the bulk of your workload this
> shouldn't be an issue.  If you have sufficient and properly functioning
> UPS it shouldn't be an issue either.
>

 Actually all LSI controllers have batteries so I hope it will not happen.
This problems is good to have this in mind when we purchase new storage
machines so thanks :)


>
> >>> - 3x(10+2) configuration was considered in order to gain best
> performance
> >>> and data safety (less disks per RAID less probability of data
> corruption)
> >>
> >> RAID6 is the worst performer of all the RAID levels but gives the best
> >> resilience to multiple drive failure.  The reason for using fewer drives
> >> per array has less to do with probability of corruption, but
> >>
> >> 1. Limiting RMW operations to as few drives as possible, especially for
> >> controllers that do full stripe scrubbing on RMW
> >>
> >> 2.  Lowering bandwidth and time required to rebuild a dead drive, fewer
> >> drives tied up during a rebuild
> >>
> >
> >>> From the O.S. side we see:
> >>>
> >>> [root at stgpool01 ~]# fdisk -l /dev/sda /dev/sdb /dev/sdc
> >> ...
> >>
> >> You omitted crucial information.  What is the stripe unit size of each
> >> RAID6?
> >>
> >
> > Actually the stripe size for each RAID6 is 256KB but we plan to increase
> > some pools to 1MB for all their RAIDs. It will be in order to compare
> > performance for pools containing large files and if this improves, we
> will
> > apply it to the other systems in the future.
>
> So currently you have a 2.5MB stripe width per RAID6 and you plan to
> test with a 10MB stripe width.
>

> >>> The idea is to aggregate the above devices and show only 1 storage
> space.
> >>> We did as follows:
> >>>
> >>> vgcreate dcvg_a /dev/sda /dev/sdb /dev/sdc
> >>> lvcreate -i 3 -I 4096 -n dcpool -l 100%FREE -v dcvg_a
> >>
> >> You've told LVM that its stripe unit is 4MB, and thus the stripe width
> >> of each RAID6 is 4MB.  This is not possible with 10 data spindles.
> >> Again, show the RAID geometry from the LSI tools.
> >>
> > When creating a nested stripe, the stripe unit of the outer stripe (LVM)
> >> must equal the stripe width of eachinner stripe (RAID6).
> >>
> >
> > Great. Hence, if the RAID6 stripe size is 256k then the LVM should be
> > defined with 256k as well, isn't it?
>
> No.  And according to lvcreate(8) you cannot use LVM for the outer
> stripe because you have 10 data spindles per RAID6.  "StripeSize" is
> limited to power of 2 values.  Your RAID6 stripe width is 2560 KB which
> is not a power of 2 value.  So you must use md.  See mdadm(8).
>

Great thanks, this is exactly what I needed and I think I am starting to
understand then :) So a RAID6 of 16+2 disks, stripe width of 256KB will
have a stripe width of 256*16=4096 which is a power of 2. Then in this case
LVM2 can be used. Am I correct? Then seems clear to me that new purchases
will go in this way (we have planned a new purchase in the next month and I
am trying to understand this)


> And be careful with terminology.  "Stripe unit" is per disk, called
> "chunk" by mdadm.  "Stripe width" is per array.  "Stripe size" is
> ambiguous.
>

Yes correct, sorry for the wrong terminology is something that I don't use
to manage :)


>
> When nesting stripes, the "stripe width" of the RAID6 becomes the
> "stripe unit" of the outer stripe of the resulting RAID60.  In essence,
> each RAID6 is treated as a "drive" in the outer stripe.  For example:
>
> RAID6  stripe unit  =  256 KB
> RAID6  stripe width = 2560 KB
> RAID60 stripe unit  = 2560 KB
> RAID60 stripe width = 7680 KB
>
> For RAID6 w/1MB stripe unit
>
> RAID6  stripe unit  =  1 MB
> RAID6  stripe width = 10 MB
> RAID60 stripe unit  = 10 MB
> RAID60 stripe width = 30 MB
>
> This is assuming your stated configuration of 12 drives per RAID6, 10
> data spindles, and 3 RAID6 arrays per nested stripe.
>
> >> Hence, stripe of the 3 RAID6 in a LV.
> >>
> >> Each RAID6 has ~1.3GB/s of throughput.  By striping the 3 arrays into a
> >> nested RAID60 this suggests you need single file throughput greater than
> >> 1.3GB/s and that all files are very large.  If not, you'd be better off
> >> using a concatenation, and using md to accomplish that instead of LVM.
> >>
> >>> And here is my first question: How can I check if the storage and the
> LV
> >>> are correctly aligned?
> >>
> >> Answer is above.  But the more important question is whether your
> >> workload wants a stripe or a concatenation.
> >>
> >>> On the other hand, I have formatted XFS as follows:
> >>>
> >>> mkfs.xfs -d su=256k,sw=10 -l size=128m,lazy-count=1 /dev/dcvg_a/dcpool
>
> lazy-count=1 is the default.  No need to specify it.
>

Ok thanks :)


>
> >> This alignment is not correct.  XFS must be aligned to the LVM stripe
> >> geometry.  Here you apparently aligned XFS to the RAID6 geometry
> >> instead.  Why are you manually specifying a 128M log?  If you knew your
> >> workload that well, you would not have made these other mistakes.
> >>
> >
> > We receive several parallel writes all the time, and afaik filesystems
> with
> > such write load benenfit from a larger log. 128M is the maximum log size.
>
> Metadata is journaled, file data is not.  Filesystems experiencing a
> large amount of metadata modification may benefit from a larger journal
> log, however writing many large files in parallel typically doesn't
> generate much metadata modification.  In addition, with delayed logging
> now the default, the amount of data written to the journal is much less
> than it used to be.  So specifying a log size should not be necessary
> with your workload.
>

Ok. Then I'll try to remove that.


> > So how XFS should be formatted then? As you specify, should be aligned
> with
> > the LVM stripe, as we have a LV with 3 stripes then 256k*3 and sw=30?
>
> It must be aligned to the outer stripe in the nest, which would be the
> LVM geometry if it could work.  However, as stated, it appears you
> cannot use lvcreate to make the outer stripe because it does not allow a
> 2560 KiB StripeSize.  Destroy the LVM volume and create an md RAID0
> device of the 3 RAID6 devices, eg:
>
> $ mdadm -C /dev/md0 --raid_devices=3 --chunk=2560 --level=0 /dev/sd[abc]
>
> For making the filesystem and aligning it to the md nested stripe
> RAID60, this is all that is required:
>
> $ mkfs.xfs -d su=2560k,sw=3 /dev/md0
>

Perfect! I'll try this with the current server having 3xRAID6(10+2). You
really helped me with that.

Just one final question, if I had 3*RAID6(16+2) the Stripe Width should be
4096 (256KB*16) and when applying this to LVM2 should be:

lvcreate -i 3 -I 4096 -n dcpool -l 100%FREE -v dcvg_a

And then the XFS format should be:

mkfs.xfs -d su=4096k, sw=3 /dev/dcvg_a/dcpool

Is it correct?

Thanks a lot for your help,
-- 
Marc Caubet Serrabou
PIC (Port d'Informació Científica)
Campus UAB, Edificio D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 22
Fax: +34 93 581 41 10
http://www.pic.es
Avis - Aviso - Legal Notice: http://www.ifae.es/legal.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://oss.sgi.com/pipermail/xfs/attachments/20140508/125fe999/attachment-0001.html>


More information about the xfs mailing list