netdev
[Top] [All Lists]

Re: bad TSO performance in 2.6.9-rc2-BK

To: "David S. Miller" <davem@xxxxxxxxxxxxx>
Subject: Re: bad TSO performance in 2.6.9-rc2-BK
From: "David S. Miller" <davem@xxxxxxxxxxxxx>
Date: Thu, 30 Sep 2004 18:12:48 -0700
Cc: herbert@xxxxxxxxxxxxxxxxxxx, jheffner@xxxxxxx, ak@xxxxxxx, niv@xxxxxxxxxx, andy.grover@xxxxxxxxx, anton@xxxxxxxxx, netdev@xxxxxxxxxxx
In-reply-to: <20040930173439.3e0d2799.davem@xxxxxxxxxxxxx>
References: <20040929162923.796d142e.davem@xxxxxxxxxxxxx> <Pine.NEB.4.33.0409291945100.3434-100000@xxxxxxxxxxxxxx> <20040929170310.46c58095.davem@xxxxxxxxxxxxx> <20040930001007.GB10496@xxxxxxxxxxxxxxxxxxx> <20040930173439.3e0d2799.davem@xxxxxxxxxxxxx>
Sender: netdev-bounce@xxxxxxxxxxx
On Thu, 30 Sep 2004 17:34:39 -0700
"David S. Miller" <davem@xxxxxxxxxxxxx> wrote:

> If I disable /proc/sys/net/tcp_moderate_rcvbuf performance
> goes down from ~634Mbit/sec to ~495Mbit/sec.
> 
> Andi, I know you said that with TSO disabled things go 
> more smoothly.  But could you try upping the TCP socket
> receive buffer sizes on the 2.6.5 box to see if that gives
> you the performance back with TSO enabled?

Ok, here is something to play with.  This adds a sysctl
to moderate the percentage of the congestion window we'll
limit TSO segmenting to.

It defaults to 2, but setting of 3 or 4 seem to make
Andi's case behave much better.

With such small receive buffers, netperf simply can't clear
the receive queue fast enough when a burst of TSO created
frames come in.

This is also where the stretch ACKs come from.  We defer
the ACK to recvmsg making progress, because we cannot
advertise a larger window and thus the connection is
application limited.

I'm also thinking about whether this sysctl should be
a divisor instead of a shift, and also whether it should
be in terms of the snd_cwnd or the advertised receiver
window whichever is smaller.

Basically, receivers with too small socket receive buffers
crap out if TSO bursts are too large.  This effect is
minimized the further the receiver is (rtt wise) from
the sender since the path tends to smooth out the bursts.
But on local gigabit lans, the effect is quite pronounced.

Ironically, this case is a great example of how powerful
and incredibly effective John's receive buffer moderation
code is.  2.6.5 performance is severely hampered due to lack
of this code.

===== include/linux/sysctl.h 1.88 vs edited =====
--- 1.88/include/linux/sysctl.h 2004-09-23 14:34:12 -07:00
+++ edited/include/linux/sysctl.h       2004-09-30 17:17:49 -07:00
@@ -341,6 +341,7 @@
        NET_TCP_BIC_LOW_WINDOW=104,
        NET_TCP_DEFAULT_WIN_SCALE=105,
        NET_TCP_MODERATE_RCVBUF=106,
+       NET_TCP_TSO_CWND_SHIFT=107,
 };
 
 enum {
===== include/net/tcp.h 1.92 vs edited =====
--- 1.92/include/net/tcp.h      2004-09-29 21:11:52 -07:00
+++ edited/include/net/tcp.h    2004-09-30 17:18:02 -07:00
@@ -609,6 +609,7 @@
 extern int sysctl_tcp_bic_fast_convergence;
 extern int sysctl_tcp_bic_low_window;
 extern int sysctl_tcp_moderate_rcvbuf;
+extern int sysctl_tcp_tso_cwnd_shift;
 
 extern atomic_t tcp_memory_allocated;
 extern atomic_t tcp_sockets_allocated;
===== net/ipv4/sysctl_net_ipv4.c 1.25 vs edited =====
--- 1.25/net/ipv4/sysctl_net_ipv4.c     2004-08-26 13:55:36 -07:00
+++ edited/net/ipv4/sysctl_net_ipv4.c   2004-09-30 17:19:32 -07:00
@@ -674,6 +674,14 @@
                .mode           = 0644,
                .proc_handler   = &proc_dointvec,
        },
+       {
+               .ctl_name       = NET_TCP_TSO_CWND_SHIFT,
+               .procname       = "tcp_tso_cwnd_shift",
+               .data           = &sysctl_tcp_tso_cwnd_shift,
+               .maxlen         = sizeof(int),
+               .mode           = 0644,
+               .proc_handler   = &proc_dointvec,
+       },
        { .ctl_name = 0 }
 };
 
===== net/ipv4/tcp_output.c 1.65 vs edited =====
--- 1.65/net/ipv4/tcp_output.c  2004-09-29 21:11:53 -07:00
+++ edited/net/ipv4/tcp_output.c        2004-09-30 17:27:32 -07:00
@@ -44,6 +44,7 @@
 
 /* People can turn this off for buggy TCP's found in printers etc. */
 int sysctl_tcp_retrans_collapse = 1;
+int sysctl_tcp_tso_cwnd_shift = 2;
 
 static __inline__
 void update_send_head(struct sock *sk, struct tcp_opt *tp, struct sk_buff *skb)
@@ -673,7 +674,7 @@
                    !tp->urg_mode);
 
        if (do_large) {
-               int large_mss, factor;
+               int large_mss, factor, limit;
 
                large_mss = 65535 - tp->af_specific->net_header_len -
                        tp->ext_header_len - tp->ext2_header_len -
@@ -688,8 +689,10 @@
                 * can keep the ACK clock ticking.
                 */
                factor = large_mss / mss_now;
-               if (factor > (tp->snd_cwnd >> 2))
-                       factor = max(1, tp->snd_cwnd >> 2);
+               limit = tp->snd_cwnd >> sysctl_tcp_tso_cwnd_shift;
+               limit = max(1, limit);
+               if (factor > limit)
+                       factor = limit;
 
                tp->mss_cache = mss_now * factor;
 

<Prev in Thread] Current Thread [Next in Thread>