Alignment: XFS + LVM2
Marc Caubet
mcaubet at pic.es
Thu May 8 08:52:09 CDT 2014
Hi Stan,
once again, thanks for your answer.
> Hi Stan,
> >
> > thanks for your answer.
> >
> > Everything begins and ends with the workload.
> >>
> >> On 5/7/2014 7:43 AM, Marc Caubet wrote:
> >>> Hi all,
> >>>
> >>> I am trying to setup a storage pool with correct disk alignment and I
> >> hope
> >>> somebody can help me to understand some unclear parts to me when
> >>> configuring XFS over LVM2.
> >>
> >> I'll try. But to be honest, after my first read of your post, a few
> >> things jump out as breaking traditional rules.
> >>
> >> The first thing you need to consider is your workload and the type of
> >> read/write patterns it will generate. This document is unfinished, and
> >> unformatted, but reading what is there should be informative:
> >>
> >> http://www.hardwarefreak.com/xfs/storage-arch.txt
> >>
> >
> > Basically we are moving a lot of data :) It means, parallel large files
> > (GBs) are being written and read all the time. Basically we have a batch
> > farm with 3,5k cores processing jobs that are constantly reading and
> > writing to the storage pools (4PBs). Only few pools (~5% of the total)
> > contain small files (and only small files).
>
> And these pools are tied together with? Gluster? Ceph?
>
We are using dCache (http://www.dcache.org/), where a file is written in a
single pool instead of spreading parts among pools as Ceph or Hadoop do. So
large files go entirely to a pool.
> >>> Actually we have few storage pools with the following settings each:
> >>>
> >>> - LSI Controller with 3xRAID6
> >>> - Each RAID6 is configured with 10 data disks + 2 for double-parity.
> >>> - Each disk has a capacity of 4TB, 512e and physical sector size of 4K.
> >>
> >> 512e drives may cause data loss. See:
> >> http://docs.oracle.com/cd/E26502_01/html/E28978/gmkgj.html#gmlfz
> >>
> >
> > Haven't experienced this yet. But good to know thanks :) On the other
> > hand, we do not use zfs
>
> This problem affects all filesystems. If the drive loses power during
> an RMW cycle the physical sector is corrupted. As noted, not all 512e
> drives may have this problem. And for the bulk of your workload this
> shouldn't be an issue. If you have sufficient and properly functioning
> UPS it shouldn't be an issue either.
>
Actually all LSI controllers have batteries so I hope it will not happen.
This problems is good to have this in mind when we purchase new storage
machines so thanks :)
>
> >>> - 3x(10+2) configuration was considered in order to gain best
> performance
> >>> and data safety (less disks per RAID less probability of data
> corruption)
> >>
> >> RAID6 is the worst performer of all the RAID levels but gives the best
> >> resilience to multiple drive failure. The reason for using fewer drives
> >> per array has less to do with probability of corruption, but
> >>
> >> 1. Limiting RMW operations to as few drives as possible, especially for
> >> controllers that do full stripe scrubbing on RMW
> >>
> >> 2. Lowering bandwidth and time required to rebuild a dead drive, fewer
> >> drives tied up during a rebuild
> >>
> >
> >>> From the O.S. side we see:
> >>>
> >>> [root at stgpool01 ~]# fdisk -l /dev/sda /dev/sdb /dev/sdc
> >> ...
> >>
> >> You omitted crucial information. What is the stripe unit size of each
> >> RAID6?
> >>
> >
> > Actually the stripe size for each RAID6 is 256KB but we plan to increase
> > some pools to 1MB for all their RAIDs. It will be in order to compare
> > performance for pools containing large files and if this improves, we
> will
> > apply it to the other systems in the future.
>
> So currently you have a 2.5MB stripe width per RAID6 and you plan to
> test with a 10MB stripe width.
>
> >>> The idea is to aggregate the above devices and show only 1 storage
> space.
> >>> We did as follows:
> >>>
> >>> vgcreate dcvg_a /dev/sda /dev/sdb /dev/sdc
> >>> lvcreate -i 3 -I 4096 -n dcpool -l 100%FREE -v dcvg_a
> >>
> >> You've told LVM that its stripe unit is 4MB, and thus the stripe width
> >> of each RAID6 is 4MB. This is not possible with 10 data spindles.
> >> Again, show the RAID geometry from the LSI tools.
> >>
> > When creating a nested stripe, the stripe unit of the outer stripe (LVM)
> >> must equal the stripe width of eachinner stripe (RAID6).
> >>
> >
> > Great. Hence, if the RAID6 stripe size is 256k then the LVM should be
> > defined with 256k as well, isn't it?
>
> No. And according to lvcreate(8) you cannot use LVM for the outer
> stripe because you have 10 data spindles per RAID6. "StripeSize" is
> limited to power of 2 values. Your RAID6 stripe width is 2560 KB which
> is not a power of 2 value. So you must use md. See mdadm(8).
>
Great thanks, this is exactly what I needed and I think I am starting to
understand then :) So a RAID6 of 16+2 disks, stripe width of 256KB will
have a stripe width of 256*16=4096 which is a power of 2. Then in this case
LVM2 can be used. Am I correct? Then seems clear to me that new purchases
will go in this way (we have planned a new purchase in the next month and I
am trying to understand this)
> And be careful with terminology. "Stripe unit" is per disk, called
> "chunk" by mdadm. "Stripe width" is per array. "Stripe size" is
> ambiguous.
>
Yes correct, sorry for the wrong terminology is something that I don't use
to manage :)
>
> When nesting stripes, the "stripe width" of the RAID6 becomes the
> "stripe unit" of the outer stripe of the resulting RAID60. In essence,
> each RAID6 is treated as a "drive" in the outer stripe. For example:
>
> RAID6 stripe unit = 256 KB
> RAID6 stripe width = 2560 KB
> RAID60 stripe unit = 2560 KB
> RAID60 stripe width = 7680 KB
>
> For RAID6 w/1MB stripe unit
>
> RAID6 stripe unit = 1 MB
> RAID6 stripe width = 10 MB
> RAID60 stripe unit = 10 MB
> RAID60 stripe width = 30 MB
>
> This is assuming your stated configuration of 12 drives per RAID6, 10
> data spindles, and 3 RAID6 arrays per nested stripe.
>
> >> Hence, stripe of the 3 RAID6 in a LV.
> >>
> >> Each RAID6 has ~1.3GB/s of throughput. By striping the 3 arrays into a
> >> nested RAID60 this suggests you need single file throughput greater than
> >> 1.3GB/s and that all files are very large. If not, you'd be better off
> >> using a concatenation, and using md to accomplish that instead of LVM.
> >>
> >>> And here is my first question: How can I check if the storage and the
> LV
> >>> are correctly aligned?
> >>
> >> Answer is above. But the more important question is whether your
> >> workload wants a stripe or a concatenation.
> >>
> >>> On the other hand, I have formatted XFS as follows:
> >>>
> >>> mkfs.xfs -d su=256k,sw=10 -l size=128m,lazy-count=1 /dev/dcvg_a/dcpool
>
> lazy-count=1 is the default. No need to specify it.
>
Ok thanks :)
>
> >> This alignment is not correct. XFS must be aligned to the LVM stripe
> >> geometry. Here you apparently aligned XFS to the RAID6 geometry
> >> instead. Why are you manually specifying a 128M log? If you knew your
> >> workload that well, you would not have made these other mistakes.
> >>
> >
> > We receive several parallel writes all the time, and afaik filesystems
> with
> > such write load benenfit from a larger log. 128M is the maximum log size.
>
> Metadata is journaled, file data is not. Filesystems experiencing a
> large amount of metadata modification may benefit from a larger journal
> log, however writing many large files in parallel typically doesn't
> generate much metadata modification. In addition, with delayed logging
> now the default, the amount of data written to the journal is much less
> than it used to be. So specifying a log size should not be necessary
> with your workload.
>
Ok. Then I'll try to remove that.
> > So how XFS should be formatted then? As you specify, should be aligned
> with
> > the LVM stripe, as we have a LV with 3 stripes then 256k*3 and sw=30?
>
> It must be aligned to the outer stripe in the nest, which would be the
> LVM geometry if it could work. However, as stated, it appears you
> cannot use lvcreate to make the outer stripe because it does not allow a
> 2560 KiB StripeSize. Destroy the LVM volume and create an md RAID0
> device of the 3 RAID6 devices, eg:
>
> $ mdadm -C /dev/md0 --raid_devices=3 --chunk=2560 --level=0 /dev/sd[abc]
>
> For making the filesystem and aligning it to the md nested stripe
> RAID60, this is all that is required:
>
> $ mkfs.xfs -d su=2560k,sw=3 /dev/md0
>
Perfect! I'll try this with the current server having 3xRAID6(10+2). You
really helped me with that.
Just one final question, if I had 3*RAID6(16+2) the Stripe Width should be
4096 (256KB*16) and when applying this to LVM2 should be:
lvcreate -i 3 -I 4096 -n dcpool -l 100%FREE -v dcvg_a
And then the XFS format should be:
mkfs.xfs -d su=4096k, sw=3 /dev/dcvg_a/dcpool
Is it correct?
Thanks a lot for your help,
--
Marc Caubet Serrabou
PIC (Port d'Informació Científica)
Campus UAB, Edificio D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 22
Fax: +34 93 581 41 10
http://www.pic.es
Avis - Aviso - Legal Notice: http://www.ifae.es/legal.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://oss.sgi.com/pipermail/xfs/attachments/20140508/125fe999/attachment-0001.html>
More information about the xfs
mailing list