Problems with TCP + netem packet loss simulation

From: Elvis Pfützenreuter <epx@xxxxxxxxxxxxxxxxxxx>
Date: Thu, 11 Nov 2004 21:15:23 -0200
I am experiencing a strange problem when I try to simulate a lossy channel 
with netem scheduler module. Whatever the packet loss %, all TCP connections 
eventually stall. 

The TCP connection exchange no packets after the stall (not even when the 
process is killed), but PING to the other host flows normally, with the 
expected loss rate.

The first thing to be suspected would be my throughput testing application, 
but other TCP connections (including ssh sessions) will stall too, more often 
if the packet loss is high.

I tried the kernels 2.6.9-rc4, 2.6.9, 2.6.10rc1 with the same results. Any 
other netem function (delay, reordering [2.6.10rc1] etc.) does not stall the 
TCP connection.

I suspected a bug in TCP stack (SCTP connections stall for 1 second sometimes 
but they recover and go on), but if I make packet loss with "nth" iptables 
rule, the connection does not stall. I even tried to pull network cables, 
turn off switch etc. to see if any "real" packet loss would stall TCP the 
same way but it it didn't happen.  

And, since I use packet loss-based techniques (tc ingress filter) to do QoS on 
my ADSL, I would have been experiencing problems with TCP well before I tried 
netem. So it seems to be some interation between TCP and netem. 

I am using 2 machines in a local Ethernet network. I didn't try to put netem 
in a third machine (that would be the router), to see if it improves 
something. I am going to try this later. But if anyone is aware of the cause 
of the problem I am experiencing, I'd like to hear of.



