On Fri, Jun 15, 2012 at 11:52:17AM +0200, Michael Monnerie wrote:
> Am Freitag, 15. Juni 2012, 10:16:02 schrieb Dave Chinner:
> > So, the average service time for an IO is 10-16ms, which is a seek
> > per IO. You're doing primarily 128k read IOs, and maybe one or 2
> > writes a second. You have a very deep request queue: > 512 requests.
> > Have you tuned /sys/block/sda/queue/nr_requests up from the default
> > of 128? This is going to be one of the causes of your problems - you
> > have 511 oustanding write requests, and only one read at a time.
> > Reduce the ioscehduer queue depth, and potentially also the device
> > CTQ depth.
> Dave, I'm puzzled by this. I'd believe that a higher #req. would help
> the block layer to resort I/O in the elevator, and therefore help to
> gain throughput. Why would 128 be better than 512 here?
512 * 16ms per IO = 7-8s IO latency.
Fundamentally, deep queues are as harmful to latency as shallow
queues are to throughput. Everyone says "make the queues deeper" to
get the highest benchmark numbers, but in reality most benchmarks
measure throughput and aren't IO latency sensistive.
I did a bunch of measurement 7 or8 years ago on high end FC HW RAID,
and found that a CTQ depth per lun of 4 was all that was needed to
reach maximum write bandwidth under almost all circumstances. When
doing concurrent read and write with a CTQ depth of 4, the balance
was roughly 50/50 read/write. Al things the same except for a CTQ
depth of 6, and it was 30/70 read/write. And any CTQ depth deeper
than 8 is was roughly 10/90 read/write. That hardware supported a
CTQ depth of 240 IOs per lun....
So even high end hardware that can support a maximum CTQ depth of
256 IOs will see this problem - you'll get 255 writes and a single
read at a time, resulting in terrible read IO latency. There is
always another async write ready to be queued, but the application
doesn't queue another read until the first one completes. Hence
reads always are issued in small numbers and when any IO is
completed, there isn't another read queued ready for dispatch. Hence
all that happens is that async writes are sent to the drive.
And then when the BBWC fills up and has to flush all those writes,
everything slows right done because the cache effective becomes
a write-through cache - it can't take another read or write until
the flush completes another IO and space is freed in the BBWC for
the next IO.
> And maybe Matthew could profit from limiting the vm.dirty_bytes, I've
> seen when this value is too high the server stucks on lots of writes,
> for streaming it's better to have this smaller so the disk writes can
> keep up and delays are not too long.
I pretty much never tune dirty limits anymore - most writeback
problems are storage stack related these days...
> > Oh, I just noticed you are might be using CFQ (it's the default in
> > dmesg). Don't - CFQ is highly unsuited for hardware RAID - it's
> > hueristically tuned to work well on sngle SATA drives. Use deadline,
> > or preferably for hardware RAID, noop.
> Wouldn't deadline be better with a higher rq_qu size? As I understand
> it, noop only groups adjacent I/Os together, while deadline does a bit
> more and should be able to get bigger adjacent I/O areas because it
> waits a bit longer before a flush.
The BBWC does a much better job of sorting and batching IOs than the
io scheduler can ever possibly hope to. Think about it - 512MB can
hold a 100,000 4k IOs and reorder and batch them far more
effectively than a io scheduler with even a 512 request deept
That's why making the IO scheduler queue deeper with HW RAID is
harmful - it's not needed to reach maximum performance for almost
all workloads, and all it does is add latency to the IO path...