does having ~Ncore+1? kworkers flushing XFS to 1 disk improve throughput?
Linda Walsh
xfs at tlinx.org
Sat Aug 24 18:22:11 CDT 2013
Stan Hoeppner wrote:
> On 8/24/2013 12:18 PM, Linda Walsh wrote:
>>
>> Stan Hoeppner wrote:
>>> On 8/23/2013 9:33 PM, Linda Walsh wrote:
>>>
>>>> So what are all the kworkers doing and does having 6 of them do
>>>> things at the same time really help disk-throughput?
>>>>
>>>> Seems like they would conflict w/each other, cause disk
>>>> contention, and extra fragmentation as they do things? If they
>>>> were all writing to separate disks, that would make sense, but do
>>>> that many kworker threads need to be finishing out disk I/O on
>>>> 1 disk?
>>> https://raw.github.com/torvalds/linux/master/Documentation/workqueue.txt
>> ----
>>
>> Thanks for the pointer.
>>
>> I see ways to limit #workers/cpu if they were hogging too much cpu,
>> which isn't the problem.. My concern is that the work they are
>> doing is all writing info back to the same physical disk -- and that
>> while >1 writer can improve throughput, generally, it would be best
>> if the pending I/O was sorted in disk order and written out using
>> the elevator algorithm. I.e. I can't imagine that it takes 6-8
>> processes (mostly limiting themselves to 1 NUMA node) to keep the
>> elevator filled?
>
> You're making a number of incorrect assumptions here. The work queues
> are generic, which is clearly spelled out in the document above. The
> kworker threads are just that, kernel threads, not processes as you
> assume above.
----
Sorry, terminology. Linux threads are implemented as processes with
minor differences -- they are threads, though as the kernel see them.
> XFS is not the only subsystem that uses them. Any
> subsystem or driver can use work queues. You can't tell what's
> executing within a kworker thread from top or ps output. You must look
> at the stack trace.
>
> The work you are seeing in those 7 or 8 kworker threads is not all
> parallel XFS work. Your block device driver, whether libata, SCSI, or
> proprietary RAID card driver, is placing work in these queues as well.
---
Hmmm.... I hadn't thought of the driver doing that... I sort thought
it just took blocks as fed by the kernel and when it was done with
a DMA, then it told the kernel it was done and was ready for another.
I thought such drivers did direct IO at that point -- i.e. they are below
the elevator algorithm?
> The work queues are not limited to filesystems and block device drivers.
> Any device driver or kernel subsystem can use work queues.
---
True, but I when I see a specific number come up and work
constantly when I unpack a tar, I would see it as related to that
command. What other things would use that much cpu?
>
> Nothing bypasses the elevator; sectors are still sorted. But keep in
> mind if you're using a hardware RAID controller -it- does the final
> sorting of writeback anyway, so this is a non issue.
LSI raid
>
> So in a nutshell, whatever performance issue you're having, if you
> indeed have an issue, isn't caused by work queues or the number of
> kworker threads on your system, per CPU, or otherwise.
Um... but it could be made worse by having an excessive number of
threads all contending for a limited resource. The more contenders
for a limited resource, the more the scheduler has to sort out who
gets access to the resource next.
If you have 6 threads dumping sectors to different areas of the
disk that need seeks between each thread's output becoming complete,
then you have a seek penalty with each thread switch -- vs. if
they were coalesced and sorted into 1 queue, 1 worker could do
the work of the 6 without the extra seeks between the different
kworkers emptying their queues.
> You need to look
> elsewhere for the bottleneck. Given it's lightning fast up to the point
> buffers start flushing to disk it's pretty clear your spindles simply
> can't keep up.
----
That's not the point (though it is a given). What I'm focusing on
is how the kernel handles a backlog.
If I want throughput, I use 1 writer -- to an unfragmented file that
won't require seeks. If I try to use 2 writers -- each to unfrag'd files
and run them at the same time, It's almost certain that that the throughput will
drop == since the disk will have to seek back and forth between the two files
to give "disk-write-resources" to each writer.
It would be faster if I did both files sequentially rather than trying to
do them in parallel, The disk is limited to ~1GB/s, -- every seek that needs to
be done to get files out reduces that. So tar splats 5000 files into memory.
Then it takes time for those to be written. If I write 5000 files sequentially
with 1 writer, I will get faster performance than if I use 25 threads each
dumping 50 files in parallel. The disk subsystem's responsiveness drops
due to all the seeks between writes, whereas if it was 1 big sorted write --
it could be written out in 1-2 elevator passes... I don't think it is being
that efficient. Thus my Q about whether or not it was really the optimal way
to improve throughput to have "too many writers" accessing a resource at the
same time.
I'm not saying there is a "problem" per se, I'm just asking/wondering
how so many writers won't have the disk seeking all over the place to round-robin
service their requests.
FWIW, the disk could probably handle 2-3 writers and show improvement
over a single -- but anything over that, and I have started to see an overall
drop in throughput.
More information about the xfs
mailing list