xfs
[Top] [All Lists]

Re: XFS and Raid

To: Dan Yocum <yocum@xxxxxxxx>
Subject: Re: XFS and Raid
From: Steve Lord <lord@xxxxxxx>
Date: 14 Nov 2001 11:40:27 -0600
Cc: "Martin K. Petersen" <mkp@xxxxxxxxxxxxx>, Anuradha Ratnaweera <anuradha@xxxxxxx>, linux-xfs@xxxxxxxxxxx
In-reply-to: <3BF2A847.2DD759A4@fnal.gov>
References: <20011108171403.A28816@bee.lk> <yq1hes5xrtj.fsf@jaguar.mkp.net> <3BF2A847.2DD759A4@fnal.gov>
Sender: owner-linux-xfs@xxxxxxxxxxx
On Wed, 2001-11-14 at 11:22, Dan Yocum wrote:

> > 
> > Have you read the sections about sunit and swidth?
> 
> 
> Yup.  Still doesn't make much sense to me.  It sounds like swidth is
> analogous to chunk-size in software raid, but what is sunit?
> 
> And are 'sw' and 'su' just the abreviated forms of swidth and sunit,
> respectively?

There are versions with different units, one is in filesystem blocks and
one is in 512 byte blocks.

> 
> 
> > 
> > In general mkfs.xfs will do the right thing.  As the man page states,
> > when you run mkfs on an LVM or MD device it will automagically extract
> > the stripe unit and stripe width.
> > 
> > If you have a hardware RAID device, however, you'll have to specify
> > these parameters manually to match the configuration of your device.
> 
> 
> So, to wit, in our systems we have 2, 8 disk HW RAID5 arrays which are SW
> RAID0 (striped) together.  The HW chunk size is 64k (this is hardcoded). 
> The SW chunk size is 512k.  I wish that I could make this 448k (so one SW
> chunk goes to one array, with the left over used as the parity chunk), but
> that's not possible.  
> 
> So, xfs_info shows this:
> 
> [root@sdssdp10 dp]# xfs_info /export/data/dp10.a/
> meta-data=/export/data/dp10.a    isize=512    agcount=268, agsize=1048576
> blks
> data     =                       bsize=4096   blocks=280145408, imaxpct=25
>          =                       sunit=128    swidth=256 blks, unwritten=0
> naming   =version 2              bsize=4096  
> log      =internal               bsize=4096   blocks=32768
> realtime =none                   extsz=1048576 blocks=0, rtextents=0

This is picking up info from the software raid, it does not see any
info from the hardware. As far as it is concerned, you have two devices
the stripe unit (amount of data written to one device before it switches
to the next device) is 128 file system blocks, or 512Kbytes, the stripe
width (or amount of data before it cycles back to the first device
again) is twice this (2 devices).

You can override the automatically selected values at mkfs time, the
tricky part is working out what values will work for you. To quote from
the man page:

              The sunit suboption is used to specify  the  stripe
              unit  for  a  RAID device or a logical volume.  The
              suboption value has to  be  specified  in  512-byte
              block  units.   Use the su suboption to specify the
              stripe unit size in bytes.  This suboption  ensures
              that  data  allocations will be stripe unit aligned
              when the current end of file is being extended  and
              the  file  size  is  larger than 512KB.  Also inode
              allocations and the internal  log  will  be  stripe
              unit aligned.

              The  su suboption is an alternative to using sunit.
              The su suboption is used to specify the stripe unit
              for a RAID device or a striped logical volume.  The
              suboption value has to be specified in bytes, (usu­
              ally  using  the m or g suffixes).  This value must
              be a multiple of the filesystem block size.

              The swidth suboption is used to specify the  stripe
              width  for  a RAID device or a striped logical vol­
              ume.  The suboption value has to  be  specified  in
              512-byte  block  units.   Use  the  sw suboption to
              specify the stripe width size in bytes.  This  sub­
              option  is  required if -d sunit has been specified
              and it has to be a multiple of the -d sunit  subop­
              tion.   The  stripe  width  will  be  the preferred
              iosize returned in the stat(2) system call.

              The sw suboption is an alternative to using swidth.
              The  sw  suboption  is  used  to specify the stripe
              width for a RAID device or striped logical  volume.
              The suboption value is expressed as a multiplier of
              the stripe unit, usually the same as the number  of
              stripe members in the logical volume configuration,
              or data disks in a RAID device.

              When a filesystem is created on  a  logical  volume
              device, mkfs.xfs will automatically query the logi­
              cal volume for appropriate sunit and swidth values.


You could specify -d su=64k,sw=896k

To be honest I am not totally sure what your sw value should be, I
presume an 8 disk raid is 7 data one parity, so I multiplied the
stripe unit by 14. The stripe unit is the only one which really matters.

Also, we discovered a problem with the latest version of mkfs in
how it lays out allocation groups onto the stripes, on really
large filesystems it attempts to make the allocation groups 4G
in size, this typically makes them all start on the same LUN,
which is not good. You should make the allocation group size
one stripe unit less than 4G, so for a 64k stripe unit this
would be

        -d agsize=4294967232k

That is 4G - 64k

This will tend to spread data over all the LUNs better I think.

Steve

p.s. interested in testing some code to allow you to use 256 byte inodes
on a device bigger than 1 Tbyte?



> 
> 
> Why?  Shouldn't swidth be 1000?  And about sunit... well, I'm just confused
> about what that should be.
> 
> 
> > 
> > --
> > Martin K. Petersen, Principal Linux Consultant, Linuxcare, Inc.
>                       ^^^^^^^^^

maybe that should read:

                        Only ;-)

Steve

-- 

Steve Lord                                      voice: +1-651-683-3511
Principal Engineer, Filesystem Software         email: lord@xxxxxxx


<Prev in Thread] Current Thread [Next in Thread>