netdev
[Top] [All Lists]

Re: your mail

To: Ronald G Minnich <rminnich@xxxxxxxx>
Subject: Re: your mail
From: Andi Kleen <ak@xxxxxxx>
Date: Tue, 5 Feb 2002 22:42:34 +0100
Cc: Andi Kleen <ak@xxxxxxx>, kuznet@xxxxxxxxxxxxx, netdev@xxxxxxxxxxx, nivedita@xxxxxxxxxxx, hendriks@xxxxxxxx, matt@xxxxxxxx
In-reply-to: <Pine.LNX.4.33.0202051400580.8565-200000@snaresland.acl.lanl.gov>
References: <20020205214910.B17532@wotan.suse.de> <Pine.LNX.4.33.0202051400580.8565-200000@snaresland.acl.lanl.gov>
Sender: owner-netdev@xxxxxxxxxxx
User-agent: Mutt/1.3.22.1i
On Tue, Feb 05, 2002 at 02:16:19PM -0700, Ronald G Minnich wrote:
> blue is sending a request to xed for a packet. Supermon (on xed) in turn
> sends status requests to several other nodes. Xed formats each "blob" of
> data from the other nodes and sends it to blue. The 'S' to supermon
> results in one byte of 'S' to the other nodes, and the data comes back
> from the other nodes as a large S-expression (in other words the delays
> you see in this tcpdump are NOT from the other nodes).
> 
> Here's one cycle. Blue sends 'S' (one byte) to 'xed', and 'xed' is sending
> data back. As you can see, in the middle of xed sending lots of data back,
> there is this fine ~40 ms. delay. This delay occurs for each transaction
> at least once, and limits us to 25 hz. request processing. We also had
> this problem before with the 'mon' programs on the nodes but have
> developed a workaround, which we will also be applying to supermon.

I remember at least one other report of such a spurious delay too. 

I suspect the problem is not the delayed ack algorithm as is, but the
new user context TCP fast path. 2.4 added a somewhat experimental
variant of a Van Jacobsen style user context TCP. When the receiver
is blocking in recvmsg() the packet is not processed directly in softirq
context, but put on a special prequeue and the user process woken up. 
When the user process schedules it'll run the TCP input path in its
process context. If this doesn't work out (process schedules too late),
the the delayed ack timer will do the TCP input processing (it's admittedly
a bit of a hack) 
The reason for this is to do csum-copy to user space, but when your NICs
do hardware checksumming anyways it shouldn't make much difference.
In addition we have an adaptive delayed ack which can make the delay a 
bit unpredictable. It should probably be shorter. 
I guess your workload for some reason prevents the fast wakeup of the
user context and you frequently run into the timer delayed TCP processing.

You can test the theory by turning off the fast path.

In net/ipv4/tcp_ipv4.c:tcp_v4_rcv replace

if (!sk->lock.users) {
                if (!tcp_prequeue(sk, skb))
                        ret = tcp_v4_do_rcv(sk, skb);
} else

with

if (!sk->lock.users) {
                ret = tcp_v4_do_rcv(sk, skb);
} else

Does this make the delay go away ? 


-Andi



<Prev in Thread] Current Thread [Next in Thread>