To whom it might concern,
We have observed bandwidth problems of TCP when running benchmarks which
gradually increase its payload and having (most of its) data flowing
unidirectional. The degradation is observed between two dual XEON/E7500
machines, using Gbe and a cross-over cable. The Gbe is eth1 and has an MTU
of 9000. eth0 is a FE with an MTU of 1500. The machines is running linux
2.4.20, and the NIC in question is an Intel Corp. 82544GC Gigabit Ethernet
Controller (rev 02).
During an attempt to change TCP parameter settings, and also systematically
changing socket rx/tx buffer sizes, I discovered that once in a while, the
benchmark ran well. The OK run was not deterministic, and had nothing to do
with change of the parameter settings. Hence, I let the machine work during
the week-end, and took tcpdumps which I saved for the successful run. An
analysis of the "OK" run vs. the "BAD" run, discovers a couple of
interesting things. The problem is that the advertised window of the
receiver does not increase. The second problem, which also affects the "OK"
run, is that the ratio of packets sent (from src to dst) and the number of
packets received (#advertisements) is 1 for the "BAD" scenario (actually
this is probably a consequence of the window being only ~2xMTU size).
Surprisingly, it is as large as 0.5 for the "OK" scenario. IMHO, this
violates RFC813, which states:
There are two reasons for prompt acknowledgement. One is to
prevent retransmission. We will discuss later how to determine whether
unnecessary retransmission is occurring. The other reason one
acknowledges promptly is to permit further data to be sent. However,
the previous section makes quite clear that it is not always desirable
to send a little bit of data, even though the receiver may have room for
it. Therefore, one can state a general rule that under normal
operation, the receiver of data need not, and for efficiency reasons
should not, acknowledge the data unless either the acknowledgement is
intended to produce an increased useable window, is necessary in order
to prevent retransmission or is being sent as part of a reverse
direction segment being sent for some other reason. We will consider an
algorithm to achieve these goals.
The two tcpdumps has been analyzed by a simple awk scripts, which gives
information for every 1/10 of a second of the runtime:
time: elapsed time relative to the first packet
MB/s: sum of TCP payload (based on TCP sequence numbers) *1e-6 / delta time
avg_len: average packet length (TCP payload) sent
nsent: no of packets sent from src to dst
avg_win: average window size advertised by the src
adv/sent: ratio between #advertisements (sum of prompt acks and piggybacked
acks) and #packets sent
The extract of the analysis is enclosed as "ok.txt" and "bad.txt". Also,
the information is presented as graphs in "bw_winsiz.png" and
"bw_adv-sent.png". In the graphs, the "OK" data uses the bottom x-axis, the
"BAD" data uses the upper x-axis (the "BAD" run-time is longer due to the
lower bandwidth). Bandwidth uses the left y-axes, whereas the average
advertised window size and the ratio of the #advertisements and #packets
sent uses the rightmost x-axis.
I do hope this information is useful 4u and that the problem can be fixed.
Since I do not read the mailing lists, I would appreciate a note back by
email of someone triggers on this.
Cheers, Håkon
--
Håkon Bugge; VP Product Development; Scali AS;
mailto:hob@xxxxxxxx; http://www.scali.com; fax: +47 22 62 89 51;
Voice: +47 22 62 89 50; Cellular (Europe+US): +47 924 84 514;
Visiting Addr: Olaf Helsets vei 6, Bogerud, N-0621 Oslo, Norway;
Mail Addr: Scali AS, Postboks 150, Oppsal, N-0619 Oslo, Norway;
bw_adv-sent.png
Description: PNG image
bad.txt
Description: Text document
bw_winsiz.png
Description: PNG image
ok.txt
Description: Text document
|