On Wed, Nov 24, 2010 at 02:12:32PM +0100, Spelic wrote:
> On 11/24/2010 01:20 AM, Dave Chinner wrote:
> >512MB of BBWC backing the disks. The BBWC does a much better job of
> >reordering out-of-order writes than the Linux elevators because
> >512MB is a much bigger window than a couple of thousand 4k IOs.
> Hmmm very interesting...
> so you are using a MD or DM raid-0 above a SATA controller with a BBWC?
> That would probably be a RAID controller used as SATA because I have
> never seen SATA controllers with a BBWC. I'd be interested in the
> brand if you don't mind.
Actually, it's a SAS RAID controller:
03:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 1078 (rev
With each disk exported as a RAID0 lun because the raid controller
does not do JBOD.
> Also I wanted to know... the requests to the drives are really only
> 4K in size for linux?
No, they do much larger than that. However, for small file
workloads the IO size is determined mostly by the file size.
> Then what purpose do the elevators' merges
> have? When the elevator merges two 4k requests doesn't it create an
> 8k request for the drive?
Yes. But when the two 4k blocks are not adjacent, they can't be
merged and hence are two IOs. And if the block that separated them
is then written 5ms after the other two completed, it's three IOs
that get combined in the BBWC into one...
> Also look at this competitor's link:
> post #9
> these scalability patches submit larger i/o than 4k. I can confirm
> that from within iostat -x 1 (I can't understand what he means with
> "bypasses the buffer cache layer" though, does it mean it's only for
> DIRECTIO? it does not seem to me).
It means he's calling submit_bio() rather than submit_bh(). Most of
that "new" code in ext4 was copied directly from XFS - XFS has been
using submit_bio() for large IO submission since roughly 2.6.15.
> When such large requests go into
> the elevator, are they broken up into 4K requests again?
No, the opposite used to happen - ext4 would break large contiguous
regions into 4k IOs (becaus submit_bh could only handle one block at
a time), and then the elevator would re-merge them into a large IO.
The issue here is CPU overhead of merging thousands of blocks