John Heffner wrote:
On Sat, 19 Mar 2005, Andi Kleen wrote:
Stephen Hemminger <shemminger@xxxxxxxx> writes:
Since developers want to experiment with different congestion
control mechanisms, and the kernel is getting bloated with overlapping
data structure and code for multiple algorithms; here is a patch to
split out the Reno, Vegas, Westwood, BIC congestion control stuff
into an infrastructure similar to the I/O schedulers.
Did you do any benchmarks to check that wont slow it down?
I would recommend to try it on a IA64 machine if possible. In the
past we found that adding indirect function calls on IA64 to networking
caused measurable slowdowns in macrobenchmarks.
In that case it was LSM callbacks, but your code looks like it will
add even more.
Is there a canonical benchmark?
I would put-forth netperf - but then I'm of course biased. It is reasonably
straightforward to run, is sophisticated enough to look for interesting things,
and not so big as some benchmarketing benchmarks that require other software
besides the stack (eg web servers and whatnot).
If using netperf (not to be confused with Linux versions) versions < 2.4.0 then
make sure it is compiled with the makefile edited to have -DUSE_PROC_STAT and
_NOT_ have -DHISTOGRAM or -DINTERVALS. If using the rc1 of 2.4.0, just typing
"configure" after unpacking the tar file should suffice under linux, but before
compiling make sure config.h has a "USE_PROC_STAT" in it. If it is missing
USE_PROC_STAT then add a --enable-cpuutil=procstat to the configure step.
Be certain to request CPU utilization numbers with the -c/-C options. Probably
best to request confidence intervals. I'd suggest a "128x32" TCP_STREAM test and
a "1x1" TCP_RR test. So, something along the lines of:
netperf -H <remote> -i 10,3 -I 99,5 -l 60 -t TCP_STREAM -- -s 128K -S 128K -m
to have netperf reqeust 128KB socket buffers, and pass 32KB in each call to
send. Each iteration lasting 60 seconds, and running at least three and no more
than 10 iterations to get to the point that it is 99% certain ("confident") that
the reported mean for throughput and CPU util is within +/- 2.5% of the actual
mean. You can make that -I 99,2 to be +/- 1% at the risk of having a harder
time hitting the confidence intervals. If at first you do not hit the
confidence intervals you can increase the values in -i up to 30 and/or increase
the iteration run time with -l.
For the TCP_RR test:
netperf -H <remote> -I 10,3 -I 99,5 -l 60 -t TCP_RR
which will be as above except running a TCP_RR test. The default in a TCP_RR
test is to have a single-byte request and a single-byte response.
If you grab 2.4.0rc1 and run on an MP system, it may be good for reproducability
to use the -T option to pin netperf and/or netserver to specific CPUs.
-T 0 will attempt to bind both netperf and netserver to CPU 0
-T 1,0 will attempt to bind netperf to CPU 1 and netserver to CPU 0
-T 0, will bind netperf to CPU 0 and leave netserver floating
-T ,1 will bind netserver to CPU 1 and leave netperf floating
I would suggest two sitations - netperf/netserver bound to the same CPU as that
taking interrupts from the NIC, and one where it is not. How broad the "where
it is not" case needs/wants to be depends on just how many degrees of "not the
same CPU" as one hase on the system (thinking NUMA).
netperf bits can be found at:
with the 2.4.0rc1 bits in the experimental/ subdirectory. There is a Debian
package floating around somewhere but I cannot recall the revision of netperf on
which it is based so probably best to grab source bits and compile them.
Interrupt avoidance/coalescing may have a noticable effect on the single-stream
netperf TCP_RR performance, capping it at a lower transaction per second rating
no matter the increase in CPU util. So, it is very important to include the CPU
util measurements. Similarly, if a system can already max-out a GbE link, just
looking at Bits per second does not sufice.
For situations where the CPU utilization measurement mechanism is questionable
(I'm still not sure about the -DUSE_PROC_STAT stuff and interrupt time...any
comments there most welcome) it may be preferred to run aggregate tests.
Netperf2 has no explicit synchronization, but if one is content with "stopwatch"
accuracy, aggregate performance along the lines of:
for i in 1 2 ... N
netperf -t TCP_RR -H <remote> -i 30 -L 60 -P 0 -v 0 &
may suffice. The -P 0 stuff disables output of the test headers. The -v 0 will
cause just the Single Figure of Merit (SFM) to be displayed - in this case the
transaction per second rate. Here the -i 30 is to make each instance of netperf
run 30 iterations. The idea being that at least 28 of them will be while the
other N-1 netperfs are running. And, hitting the (default -I 99,5) confidence
interval gives us some confidence that any skew is reasonably close to epsilon.
The idea is to take N high enough to saturate the CPU(s) in the system and peak
the aggregate transaction rate. Single-byte is used to avoid pegging the link
on bits per second. Since this is "stopwatch" I tend to watch to make sure that
they all start and end "close" to one another. (NB the combination of -i 30 and
-l 60 means the test will run for an hour...alter at your discression)
For aggregate tests it is generally best to have three systems - the System
Under Test (SUT) and a pair or more of LG's - sometimes just using a pair of
systems saturates before driving the SUT with two or more LGs would.
Would you really expect a single extra indirect call per ack to have a
significant performance impact? This is surprising to me. Where does the
cost come from? Replacing instruction cache lines?
I don't have specific data on hand, but the way the selinux stuff (used?) to be
implemented did indeed not run very well at all even when selinux was disabled
(enabled was another story entirely...)
Even if a single extra indirect call is nearly epsilon, the "thousand cuts"
principle would apply. Enough of them and the claims about other OSes having
faster networking may actually become true - if it isn't true already. But I may