On Mon, Nov 18, 2013 at 7:57 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> On Mon, Nov 18, 2013 at 12:28:21PM -0600, Eric Sandeen wrote:
>> On 11/18/13, 10:02 AM, Martin Boutin wrote:
>> > Dear list,
>> > I am writing about an apparent issue (or maybe it is normal, that's my
>> > question) regarding filesystem write speed in in a linux raid device.
>> > More specifically, I have linux-3.10.10 running in an Intel Haswell
>> > embedded system with 3 HDDs in a RAID-5 configuration.
>> > The hard disks have 4k physical sectors which are reported as 512
>> > logical size. I made sure the partitions underlying the raid device
>> > start at sector 2048.
>> (fixed cc: to xfs list)
>> > The RAID device has version 1.2 metadata and 4k (bytes) of data
>> > offset, therefore the data should also be 4k aligned. The raid chunk
>> > size is 512K.
>> > I have the md0 raid device formatted as ext3 with a 4k block size, and
>> > stride and stripes correctly chosen to match the raid chunk size, that
>> > is, stride=128,stripe-width=256.
>> > While I was working in a small university project, I just noticed that
>> > the write speeds when using a filesystem over raid are *much* slower
>> > than when writing directly to the raid device (or even compared to
>> > filesystem read speeds).
>> > The command line for measuring filesystem read and write speeds was:
>> > $ dd if=/tmp/diskmnt/filerd.zero of=/dev/null bs=1M count=1000 iflag=direct
>> > $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct
>> > The command line for measuring raw read and write speeds was:
>> > $ dd if=/dev/md0 of=/dev/null bs=1M count=1000 iflag=direct
>> > $ dd if=/dev/zero of=/dev/md0 bs=1M count=1000 oflag=direct
>> > Here are some speed measures using dd (an average of 20 runs).:
>> > device raw/fs mode speed (MB/s) slowdown (%)
>> > /dev/md0 raw read 207
>> > /dev/md0 raw write 209
>> > /dev/md1 raw read 214
>> > /dev/md1 raw write 212
> So, that's writing to the first 1GB of /dev/md0, and all the writes
> are going to be aligned to the MD stripe.
>> > /dev/md0 xfs read 188 9
>> > /dev/md0 xfs write 35 83o
> And these will not be written to the first 1GB of the block device
> but somewhere else. Most likely a region that hasn't otherwise been
> used, and so isn't going to be overwriting the same blocks like the
> /dev/md0 case is going to be. Perhaps there's some kind of stripe
> caching effect going on here? Was the md device fully initialised
> before you ran these tests?
>> > /dev/md1 ext3 read 199 7
>> > /dev/md1 ext3 write 36 83
>> > /dev/md0 ufs read 212 0
>> > /dev/md0 ufs write 53 75
>> > /dev/md0 ext2 read 202 2
>> > /dev/md0 ext2 write 34 84
> I suspect what you are seeing here is either the latency introduced
> by having to allocate blocks before issuing the IO, or the file
> layout due to allocation is not idea. Single threaded direct IO is
> latency bound, not bandwidth bound and, as such, is IO size
> sensitive. Allocation for direct IO is also IO size sensitive -
> there's typically an allocation per IO, so the more IO you have to
> do, the more allocation that occurs.
I just did a few more tests, this time with ext4:
device raw/fs mode speed (MB/s) slowdown (%)
/dev/md0 ext4 read 199 4%
/dev/md0 ext4 write 210 0%
This time, no slowdown at all on ext4. I believe this is due to the
multiblock allocation feature of ext4 (I'm using O_DIRECT, so it
should be it). So I guess for the other filesystems, it was indeed
the latency introduced by block allocation.
> So, on XFS, what does "xfs_bmap -vvp /tmp/diskmnt/filewr.zero"
> output for the file you wrote? Specifically, I'm interested whether
> it aligned the allocations to the stripe unit boundary, and if so,
> what offset into the device those extents sit at....
> Also, you should run iostat and blktrace to determine if MD is
> doing RMW cycles when being written to through the filesystem.
>> > Is it possible that the filesystem has such enormous impact in the
>> > write speed? We are talking about a slowdown of 80%!!! Even a
>> > filesystem as simple as ufs has a slowdown of 75%! What am I missing?
>> One thing you're missing is enough info to debug this.
>> /proc/mdstat, kernel version, xfs_info output, mkfs commandlines used,
>> partition table details, etc.
> THere's a good list here:
>> If something is misaligned and you are doing RMW for these IOs it could
>> hurt a lot.
>> > Thank you,
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Dave Chinner