xfs
[Top] [All Lists]

Re: XFS and I/O alignment

To: Steve Lord <lord@xxxxxxx>
Subject: Re: XFS and I/O alignment
From: Luciano Chavez <lnx1138@xxxxxxxxxx>
Date: 18 Jun 2002 13:52:35 -0500
Cc: linux-xfs@xxxxxxxxxxx, evms-devel@xxxxxxxxxxxxxxxxxxxxx
In-reply-to: <1024423467.1392.50.camel@jen.americas.sgi.com>
References: <1024422375.30888.142.camel@chavez> <1024423467.1392.50.camel@jen.americas.sgi.com>
Sender: owner-linux-xfs@xxxxxxxxxxx
On Tue, 2002-06-18 at 13:04, Steve Lord wrote:
> On Tue, 2002-06-18 at 12:46, Luciano Chavez wrote:
> > Hello,
> > 
> > Recently on the EVMS mailing list, we had a gentlemen report a problem
> > using Linux XFS 1.1 on a RAID5 storage object (Linux MD compatibility
> > storage object). See
> > http://sourceforge.net/mailarchive/forum.php?thread_id=799287&forum_id=2003 
> > for the initial post.
> > 
> > After some research I found that moving the internal log to another
> > device worked around the problem.
> > 
> > In short the problem appears to be related to I/O requests of 4K in
> > length coming in on devices sensitive to alignment such as striped LVs
> > or MD devices (specifically when these unaligned I/O requests cross
> > boundaries like outside a chunksize). This problem should also manifest
> > itself on non-striped entities such as fragmented LVs where a PE may get
> > an unaligned I/O request that may span into a PE corresponding to a
> > different LV.
> > 
> > Also, the problem manifested itself most easily with striped devices. I
> > found explanations under man mkfs.xfs of some options specific to
> > striping so I experimented. Below is the output of the several mkfs.xfs
> > attempts on a RAID 5 storage object composed of 6 partitions.
> >       
> > root@gunslinger ~ # mkfs.xfs -f /dev/evms/md/md0
> > meta-data=/dev/evms/md/md0       isize=256    agcount=8, agsize=31360
> > blks
> > data     =                       bsize=4096   blocks=250880, imaxpct=25
> >          =                       sunit=0      swidth=0 blks, unwritten=0
> > naming   =version 2              bsize=4096  
> > log      =internal log           bsize=4096   blocks=1200
> > realtime =none                   extsz=65536  blocks=0, rtextents=0
> > root@gunslinger ~ # mkfs.xfs -f /dev/evms/md/md0 -d sunit=8,swidth=40
> > meta-data=/dev/evms/md/md0       isize=256    agcount=8, agsize=31360
> > blks
> > data     =                       bsize=4096   blocks=250880, imaxpct=25
> >          =                       sunit=1      swidth=5 blks, unwritten=0
> > naming   =version 2              bsize=4096  
> > log      =internal log           bsize=4096   blocks=1200
> > realtime =none                   extsz=65536  blocks=0, rtextents=0
> > 
> > root@gunslinger ~ # mkfs.xfs -f /dev/evms/md/md0 -d su=32768,sw=5    
> > meta-data=/dev/evms/md/md0       isize=256    agcount=8, agsize=31360
> > blks
> > data     =                       bsize=4096   blocks=250880, imaxpct=25
> >          =                       sunit=8      swidth=40 blks,
> > unwritten=0
> > naming   =version 2              bsize=4096  
> > log      =internal log           bsize=4096   blocks=1200
> > realtime =none                   extsz=65536  blocks=0, rtextents=0
> > 
> > None of these helped. Not even specifying the same option on the mount.
> > I still ended up with unaligned I/O coming in and crossing chunksize
> > stripe boundaries essentially corrupting data. I also tried the mkfs.xfs
> > options set to sunit=64,swidth=320 which produced sunit=8 and swidth=40
> > on output and still didn't help.
> > 
> > I noticed that xfsprogs libdisk source files make tests of the device to
> > see if it is a MD or LV striped device to automatically set the sunit
> > and swidth values in your superblock to provide proper alignment on log
> > I/O for example. But in my attempts to isolate this, there also must be
> > a mount time check somewhere to determine whether to use these since
> > formatting it correctly and mounting it with these options using the
> > EVMS MD plug-in, they don't seem to get honored.
> > 
> > I would appreciate any help the XFS developers could offer in allowing
> > XFS to work on top of block devices sensitive to alignment under Linux.
> > 
> > Please cross-post any responses to the evms-devel@xxxxxxxxxxxxxxxxxxxxx
> > so that others not subscribed to the linux-xfs list can see them.
> > 
> > We (EVMS) will offer any assistance we can as we would like to see
> > customers using XFS and EVMS together seamlessly and happily on their
> > enterprise systems.
> > 
> > -- 
> > regards,
> > 
> > Luciano Chavez
> > 
> > lnx1138@xxxxxxxxxx          
> > http://evms.sourceforge.net
> 
> Hi,
> 
> The answer to this problem is sitting on my workstation right now, and
> I am trying to decide if pushing it out into the world just before I
> leave for OLS followed by a week's vacation is a good idea or not.
> 
> The stripe alignment code in xfs does not apply to the log, the log is
> written in chunks of upto 32K which can be any multiple of 512 bytes and
> can start on any 512 byte boundary. The only 'safe' way now to make this
> work with volumes where that can end up crossing device boundaries is to
> do all the I/O in 512 byte buffer heads. Which as you are probably aware
> is not the best thing in the world to do from a cpu and memory usage
> standpoint. This is why moving the log to a different device made the
> problem go away.
> 
> A quick check of if this is going to fix things for EVMS is to take this
> code in fs/xfs/pagebuf/page_buf.c:
> 
>         if ((MAJOR(dev) != LVM_BLK_MAJOR) && (MAJOR(dev) != MD_MAJOR)) {
>                 sector = blk_length << SECTOR_SHIFT;
>                 blk_length = 1;
>          } else if ((MAJOR(dev) == MD_MAJOR) && (pg_offset == 0) &&
>                    (pg_length == PAGE_CACHE_SIZE) &&
>                    (((unsigned int) bn) & BN_ALIGN_MASK) == 0) {
>                 sector = blk_length << SECTOR_SHIFT;
>                 blk_length = 1;
>         } else {
>                 sector = SECTOR_SIZE;
>         }
> 
> and replace it with:
> 
>       sector = SECTOR_SIZE;
> 
> ------------------------
> 

Steve,

Thank you much for the speedy reply! My page_buf.c didn't quite look
like yours (I assume this was the _pagebuf_page_io routine). I made the
following change to version of the source and it now appears to be
working.

        int concat_ok=0; /* <---- I initialized this to zero */
/*
        if ((MAJOR(dev) != LVM_BLK_MAJOR) && (MAJOR(dev) != MD_MAJOR)) {
                concat_ok = 1;
        } else if ((MAJOR(dev) == MD_MAJOR) && (pg_offset == 0) &&
                   (pg_length == PAGE_CACHE_SIZE) &&
                   ((bn & ((page_buf_daddr_t)(PAGE_CACHE_SIZE - 1) >>
9)) == 0)) {
                concat_ok = 1;
        } else {
                concat_ok = 0;
        }
*/

Will the recommended code change be the permanent fix?

> The code I have sitting here introduces a new log format in xfs which can
> be aligned on different boundaries. It introduces new mkfs options:
> 
>       -l version=2,sunit=xxxx
> 
> Log writes then become aligned on and padded to the stripe unit specified,
> 4K is enough in most cases. You can also do larger logwrites with this code,
> but that is not the issue here.
> 

What about non-striped devices? How are they aligned now?

> Steve
> 
> p.s. LVM2 has hit exactly the same problem.
> 
> 
> Steve Lord                                      voice: +1-651-683-3511
> Principal Engineer, Filesystem Software         email: lord@xxxxxxx
> 
-- 
regards,

Luciano Chavez

lnx1138@xxxxxxxxxx          
http://evms.sourceforge.net


<Prev in Thread] Current Thread [Next in Thread>