[Top] [All Lists]

Re: does having ~Ncore+1? kworkers flushing XFS to 1 disk improve throug

To: stan@xxxxxxxxxxxxxxxxx
Subject: Re: does having ~Ncore+1? kworkers flushing XFS to 1 disk improve throughput?
From: Linda Walsh <xfs@xxxxxxxxx>
Date: Sat, 24 Aug 2013 16:22:11 -0700
Cc: xfs-oss <xfs@xxxxxxxxxxx>
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <521904F4.90208@xxxxxxxxxxxxxxxxx>
References: <52181B69.6060707@xxxxxxxxx> <52183194.2060008@xxxxxxxxxxxxxxxxx> <5218EADD.4000704@xxxxxxxxx> <521904F4.90208@xxxxxxxxxxxxxxxxx>
User-agent: Thunderbird

Stan Hoeppner wrote:
On 8/24/2013 12:18 PM, Linda Walsh wrote:

Stan Hoeppner wrote:
On 8/23/2013 9:33 PM, Linda Walsh wrote:

So what are all the kworkers doing and does having 6 of them do
things at the same time really help disk-throughput?

Seems like they would conflict w/each other, cause disk
contention, and extra fragmentation as they do things?  If they
were all writing to separate disks, that would make sense, but do
that many kworker threads need to be finishing out disk I/O on
1 disk?

Thanks for the pointer.

I see ways to limit #workers/cpu if they were hogging too much cpu,
which isn't the problem..  My concern is that the work they are
doing is all writing info back to the same physical disk -- and that
while >1 writer can improve throughput, generally, it would be best
if the pending I/O was sorted in disk order and written out using
the elevator algorithm.  I.e. I can't imagine that it takes 6-8
processes (mostly limiting themselves to 1 NUMA node) to keep the
elevator filled?

You're making a number of incorrect assumptions here.  The work queues
are generic, which is clearly spelled out in the document above.  The
kworker threads are just that, kernel threads, not processes as you
assume above.
        Sorry, terminology.  Linux threads are implemented as processes with
minor differences -- they are threads, though as the kernel see them.

 XFS is not the only subsystem that uses them.  Any
subsystem or driver can use work queues.  You can't tell what's
executing within a kworker thread from top or ps output.  You must look
at the stack trace.

The work you are seeing in those 7 or 8 kworker threads is not all
parallel XFS work.  Your block device driver, whether libata, SCSI, or
proprietary RAID card driver, is placing work in these queues as well.
Hmmm.... I hadn't thought of the driver doing that... I sort thought
it just took blocks as fed by the kernel and when it was done with
a DMA, then it told the kernel it was done and was ready for another.

I thought such drivers did direct IO at that point -- i.e. they are below
the elevator algorithm?

The work queues are not limited to filesystems and block device drivers.
 Any device driver or kernel subsystem can use work queues.
        True, but I when I see a specific number come up and work
constantly when I unpack a tar, I would see it as related to that
command.   What other things would use that much cpu?

Nothing bypasses the elevator; sectors are still sorted.  But keep in
mind if you're using a hardware RAID controller -it- does the final
sorting of writeback anyway, so this is a non issue.
LSI raid

So in a nutshell, whatever performance issue you're having, if you
indeed have an issue, isn't caused by work queues or the number of
kworker threads on your system, per CPU, or otherwise.

Um... but it could be made worse by having an excessive number of
threads all contending for a limited resource.   The more contenders
for a limited resource, the more the scheduler has to sort out who
gets access to the resource next.

If you have 6 threads dumping sectors to different areas of the
disk that need seeks between each thread's output becoming complete,
then you have a seek penalty with each thread switch -- vs. if
they were coalesced and sorted into 1 queue, 1 worker could do
the work of the 6 without the extra seeks between the different
kworkers emptying their queues.

You need to look
elsewhere for the bottleneck.  Given it's lightning fast up to the point
buffers start flushing to disk it's pretty clear your spindles simply
can't keep up.
        That's not the point (though it is a given).  What I'm focusing on
is how the kernel handles a backlog.

        If I want throughput, I use 1 writer -- to an unfragmented file that
won't require seeks.  If I try to use 2 writers -- each to unfrag'd files
and run them at the same time, It's almost certain that that the throughput will
drop == since the disk will have to seek back and forth between the two files
to give "disk-write-resources" to each writer.

        It would be faster if I did both files sequentially rather than trying 
do them in parallel, The disk is limited to ~1GB/s, -- every seek that needs to
be done to get files out reduces that.  So tar splats 5000 files into memory.
Then it takes time for those to be written.   If I write 5000 files sequentially
with 1 writer, I will get faster performance than if I use 25 threads each
dumping 50 files in parallel.  The disk subsystem's responsiveness drops
due to all the seeks between writes, whereas if it was 1 big sorted write --
it could be written out in 1-2 elevator passes... I don't think it is being
that efficient.  Thus my Q about whether or not it was really the optimal way
to improve throughput to have "too many writers" accessing a resource at the
same time.

        I'm not saying there is a "problem" per se, I'm just asking/wondering
how so many writers won't have the disk seeking all over the place to 
service their requests.

        FWIW, the disk could probably handle 2-3 writers and show improvement
over a single -- but anything over that, and I have started to see an overall
drop in throughput.

<Prev in Thread] Current Thread [Next in Thread>