netdev
[Top] [All Lists]

Re: RFC: NAPI packet weighting patch

To: jamal <hadi@xxxxxxxxxx>
Subject: Re: RFC: NAPI packet weighting patch
From: Jesse Brandeburg <jesse.brandeburg@xxxxxxxxx>
Date: Thu, 9 Jun 2005 14:37:31 -0700 (PDT)
Cc: "David S. Miller" <davem@xxxxxxxxxxxxx>, "Brandeburg, Jesse" <jesse.brandeburg@xxxxxxxxx>, "Ronciak, John" <john.ronciak@xxxxxxxxx>, shemminger@xxxxxxxx, "Williams, Mitch A" <mitch.a.williams@xxxxxxxxx>, mchan@xxxxxxxxxxxx, buytenh@xxxxxxxxxxxxxx, jdmason@xxxxxxxxxx, netdev@xxxxxxxxxxx, Robert.Olsson@xxxxxxxxxxx, "Venkatesan, Ganesh" <ganesh.venkatesan@xxxxxxxxx>
In-reply-to: <1118237775.6382.34.camel@xxxxxxxxxxxxxxxxxxxxx>
References: <468F3FDA28AA87429AD807992E22D07E0450C01F@orsmsx408> <20050607.132159.35660612.davem@xxxxxxxxxxxxx> <Pine.LNX.4.62.0506071852290.31708@ladlxr> <20050607.204339.21591152.davem@xxxxxxxxxxxxx> <1118237775.6382.34.camel@xxxxxxxxxxxxxxxxxxxxx>
Replyto: "Jesse Brandeburg" <jesse.brandeburg@xxxxxxxxx>
Sender: netdev-bounce@xxxxxxxxxxx
On Wed, 8 Jun 2005, jamal wrote:
> Something is up, if a single gigabit TCP stream can fully CPU
> load your machine.  10 gigabit, yeah, definitely all current
> generation machines are cpu limited over that link speed, but
> 1 gigabit should be no problem.
>

Yes, sir.
BTW, all along i thought the sender and receiver are hooked up directly
(there was some mention of chariot a while back).

Okay let me clear this up once and for all, here is our test setup:

* 10 1u rack machines (dual P3 - 1250MHz), with both windows and linux installed (running windows now)
* Extreme 1gig switch
* Dual 2.8 GHz P4 server, RHEL3 base, running 2.6.12-rc5 or supertso patch

* the test entails transferring 1MB files of zeros from memory to memory, using TCP, with each client doing primary either send or recv, not both.

Even if they did have some smart ass thing in the middle that reorders,
it is still suprising that such a fast CPU cant handle a mere one Gig of
what seems to be MTU=1500 bytes sized packets.

It can handle a single thread (or even 6) just fine, its after that we get in trouble somewhere.

I suppose a netstat -s would help for visualization in addition to those
dumps.

Okay I have that data, do you want it for the old tso, supertso, or no tso at all?

Heres what i am deducing from their data, correct me if i am wrong:
->The evidence is that something is expensive in their code path (duh).

Actually I've found that adding more threads (10 total) sending to the server, while keeping the transmit thread count constant yields an increase our throughput all the way to 1750+ Mb/s (with supertso)

-> Whatever that expensive thing code is, it not helped by them
replenishing the descriptors after all the budget is exhausted since the
descriptor departure rate is much slower than packet arrival.

I'm running all my tests with the replenish patch mentioned earlier in this thread.

---> This is why they would be seeing that the reduction of weight
improves performance since the replenishing happens sooner with a
smaller weight.

seems like we're past the weight problem now, should i start a new thread?

------> Clearly the driver needs some fixing - if they could do what

I'm not convinced it is the driver that is having issues. We might be having some complex interaction with the stack, but I definitely think we have a lot of onion layers to hack through here, all of which are probably relevant.

Even if they SACKed for every packet, this still would not make any
sense. So i think a profile of where the cycles are spent would also
help. I am suspecting the driver at this point but i could be wrong.

I have profile data, here is an example of 5tx/5rx threads, where the throughput was 1236Mb/s total, 936tx, 300rx, on 2.6.12-rc5 with old TSO (the original problem case) we are at 100% cpu and generating 3289 ints/s, with no hardware drops reported prolly due to my replenish patch
CPU: P4 / Xeon with 2 hyper-threads, speed 2791.36 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) 
with a unit mask of 0x01 (mandatory) count 100000
samples  %        image name       symbol name
533687    8.1472  vmlinux          pskb_expand_head
428726    6.5449  vmlinux          __copy_user_zeroing_intel
349934    5.3421  vmlinux          _read_lock_irqsave
313667    4.7884  vmlinux          csum_partial
218870    3.3413  vmlinux          _spin_lock
214302    3.2715  vmlinux          __copy_user_intel
193662    2.9564  vmlinux          skb_release_data
177755    2.7136  vmlinux          ipt_do_table
148445    2.2662  vmlinux          _write_lock_irqsave
148080    2.2606  vmlinux          _read_unlock_bh
143308    2.1877  vmlinux          tcp_sendmsg
115745    1.7670  vmlinux          ip_queue_xmit
111487    1.7020  vmlinux          __kfree_skb
108383    1.6546  vmlinux          _spin_lock_irqsave
108071    1.6498  e1000.ko         e1000_xmit_frame
107850    1.6464  vmlinux          tcp_clean_rtx_queue
104552    1.5961  e1000.ko         e1000_clean_tx_irq
101308    1.5466  e1000.ko         e1000_clean_rx_irq
94297     1.4395  vmlinux          __copy_from_user_ll
85170     1.3002  vmlinux          kfree
76730     1.1714  vmlinux          tcp_transmit_skb
70976     1.0835  vmlinux          eth_type_trans
67381     1.0286  vmlinux          tcp_rcv_established
64670     0.9872  vmlinux          sub_preempt_count
64451     0.9839  vmlinux          dev_queue_xmit
64010     0.9772  vmlinux          skb_clone
62314     0.9513  vmlinux          tcp_v4_rcv
61980     0.9462  vmlinux          nf_iterate
60374     0.9217  vmlinux          ip_finish_output
57407     0.8764  vmlinux          _write_unlock_bh
56165     0.8574  vmlinux          mark_offset_tsc
54673     0.8346  endpoint         (no symbols)
52662     0.8039  vmlinux          __kmalloc
50112     0.7650  vmlinux          sock_wfree
50001     0.7633  vmlinux          _spin_trylock
47053     0.7183  vmlinux          _read_lock_bh
45988     0.7021  vmlinux          tcp_write_xmit
44229     0.6752  vmlinux          kmem_cache_alloc
43506     0.6642  vmlinux          smp_processor_id
42401     0.6473  vmlinux          ip_conntrack_find_get
42095     0.6426  vmlinux          alloc_skb
40619     0.6201  vmlinux          tcp_in_window
38098     0.5816  vmlinux          add_preempt_count
37701     0.5755  vmlinux          __copy_to_user_ll
31529     0.4813  vmlinux          ip_conntrack_in
31314     0.4780  vmlinux          kmem_cache_free
30954     0.4725  vmlinux          __ip_conntrack_find
30863     0.4712  vmlinux          local_bh_enable
30774     0.4698  vmlinux          tcp_packet
29426     0.4492  vmlinux          _spin_unlock_irqrestore
28716     0.4384  vmlinux          hash_conntrack
27073     0.4133  vmlinux          ip_route_input
26540     0.4052  e1000.ko         e1000_clean
25817     0.3941  vmlinux          nf_hook_slow
23395     0.3571  vmlinux          schedule
22981     0.3508  vmlinux          tcp_v4_send_check
22139     0.3380  vmlinux          __mod_timer
22126     0.3378  vmlinux          timer_interrupt
21511     0.3284  vmlinux          cache_alloc_refill
21161     0.3230  vmlinux          netif_receive_skb
20418     0.3117  vmlinux          _write_lock_bh
19443     0.2968  vmlinux          skb_copy_datagram_iovec
19100     0.2916  vmlinux          ip_nat_fn
18784     0.2868  vmlinux          ip_local_deliver
18251     0.2786  vmlinux          _read_lock
17513     0.2674  vmlinux          nat_packet
17124     0.2614  e1000.ko         e1000_intr
16357     0.2497  vmlinux          default_idle
15358     0.2345  vmlinux          qdisc_restart
14564     0.2223  vmlinux          _read_unlock
14360     0.2192  vmlinux          tcp_recvmsg
13853     0.2115  oprofiled        odb_insert
13374     0.2042  e1000.ko         e1000_alloc_rx_buffers
13321     0.2034  vmlinux          apic_timer_interrupt
12668     0.1934  vmlinux          pfifo_fast_enqueue
12618     0.1926  vmlinux          tcp_sack
12180     0.1859  vmlinux          ip_nat_local_fn
11434     0.1746  vmlinux          system_call
11426     0.1744  vmlinux          free_block
11377     0.1737  vmlinux          try_to_wake_up
11138     0.1700  vmlinux          irq_entries_start
11017     0.1682  vmlinux          ipt_route_hook
10987     0.1677  vmlinux          dev_queue_xmit_nit
10970     0.1675  vmlinux          tcp_push_one
10508     0.1604  vmlinux          tcp_error
10365     0.1582  vmlinux          pfifo_fast_dequeue
10323     0.1576  vmlinux          ip_rcv
10022     0.1530  vmlinux          ip_output
<Prev in Thread] Current Thread [Next in Thread>