xfs
[Top] [All Lists]

Re: RAID5 internal log question

To: Austin Gonyou <austin@xxxxxxxxxxxxxxx>
Subject: Re: RAID5 internal log question
From: "Martin K. Petersen" <mkp@xxxxxxxxxxxxx>
Date: 17 Feb 2002 15:14:51 -0500
Cc: Seth Mos <knuffie@xxxxxxxxx>, Mihai RUSU <dizzy@xxxxxxxxx>, Linux XFS List <linux-xfs@xxxxxxxxxxx>
In-reply-to: <1013814364.16590.12.camel@UberGeek>
Organization: Linuxcare, Inc.
References: <4.3.2.7.2.20020215223304.036f5cd0@pop.xs4all.nl> <1013814364.16590.12.camel@UberGeek>
Sender: owner-linux-xfs@xxxxxxxxxxx
User-agent: Gnus/5.0808 (Gnus v5.8.8) XEmacs/21.4 (Civil Service)
>>>>> "Austin" == Austin Gonyou <austin@xxxxxxxxxxxxxxx> writes:

[XFS vs. RAID5 performance]

Seth>> Just with software raid. Most hardware raid controllers have
Seth>> cache onboard which will make it far less noticeable.

Austin> I disagree a little bit. If you have poor hardware, or
Austin> "low-cost" hardware pushing your RAID subsystem, then I'd say
Austin> you will have bad IO in these instances.

Austin> In general, if you buy decent hardware, there shouldn't be a
Austin> problem with this, some caching RAID controllers though, still
Austin> use simms, and from experience, perform way slower than the
Austin> newer ones with DIMMs(PC100, etc).

This seems to be an area of constant confusion.  Let me describe why
XFS sucks on Linux software RAID5.  It has nothing to do with
controllers, physical disk layout, or anything like that.

RAID5 works by saving a N-1 chunks of data followed by a chunk of
parity information (the location of the checksum chunk is actually
interleaved between devices with RAID5, but whatever).  These N-1 data
chunks + the parity blob make out a stripe.

Every time you update any chunk of data you need to read in the rest
of the data chunks in that stripe, calculate the parity, and then
write out the modified data chunk + parity.

This sucks performance-wise because a write could worst case end up
causing N-2 reads (at this point you have your updated chunk in
memory) followed by 2 writes.  The Linux RAID5 personality isn't quite
that stupid and actually uses a slightly different algorithm involving
reading old data + parity off disk, masking, and then writing the new
data + parity back.

In any case Linux software RAID keeps a stripe cache around to cut
down on the disk I/Os caused by parity updates.  And this cache really
improves performance.

Now.  Unlike the other Linux filesystems, XFS does not stick to one
I/O size.  The filesystem data blocks are 4K (on PC anyway), but log
entries will be written in 512 byte chunks.

Unfortunately these 512 byte I/Os will cause the RAID5 code to flush
its entire stripe cache and reconfigure it for 512 byte I/O sizes.
Then, a few ms later, we come back and do a 4K data write, causing the
damn thing to be flushed again.  And so on.

IOW, Linux software RAID5 code was written for filesystems like ext2
that only do fixed size I/Os.

So the real problem is that because XFS keeps switching the I/O size,
the RAID5 code effectively runs without a stripe cache.  And that's
what's making the huge sucking sound.  This will be fixed -
eventually...

By moving the XFS journal to a different device (like a software RAID1
as we suggest in the FAQ), you can work around this problem.

And finally - All the hardware RAID controllers I have worked with
stick to one I/O size internally and don't have this problem.  They do
read-modify-write on their own preferred size I/Os anyway.

-- 
Martin K. Petersen, Principal Linux Consultant, Linuxcare, Inc.
mkp@xxxxxxxxxxxxx, http://www.linuxcare.com/
SGI XFS for Linux Developer, http://oss.sgi.com/projects/xfs/


<Prev in Thread] Current Thread [Next in Thread>