On 22 Mar 2005 08:41:22 +0100
Andi Kleen <ak@xxxxxx> wrote:
> On Mon, Mar 21, 2005 at 04:25:56PM -0500, John Heffner wrote:
> > On Sat, 19 Mar 2005, Andi Kleen wrote:
> > > Stephen Hemminger <shemminger@xxxxxxxx> writes:
> > >
> > > > Since developers want to experiment with different congestion
> > > > control mechanisms, and the kernel is getting bloated with overlapping
> > > > data structure and code for multiple algorithms; here is a patch to
> > > > split out the Reno, Vegas, Westwood, BIC congestion control stuff
> > > > into an infrastructure similar to the I/O schedulers.
> > >
> > > [...]
> > >
> > > Did you do any benchmarks to check that wont slow it down?
> > >
> > > I would recommend to try it on a IA64 machine if possible. In the
> > > past we found that adding indirect function calls on IA64 to networking
> > > caused measurable slowdowns in macrobenchmarks.
> > > In that case it was LSM callbacks, but your code looks like it will
> > > add even more.
> > Is there a canonical benchmark?
> For the LSM case we saw the problem with running netperf over loopback.
> It added one or two hooks per packet, but it already made a noticeable
> difference on IA64 boxes.
> On other systems it is unnoticeable.
> > Would you really expect a single extra indirect call per ack to have a
> > significant performance impact? This is surprising to me. Where does the
> > cost come from? Replacing instruction cache lines?
> I was never quite clear. Some instruction stalls in the CPUs.
> One not very good theory was that McKinley really likes
> to have its jump registers loaded early for indirect calls, and gcc
> doesn't even attempt this.
Running on 2 Cpu Opteron using netperf loopback mode shows that the change is
very small when averaged over 10 runs. Overall there is
a .28% decrease in CPU usage and a .96% loss in throughput. But both those
values are less than twice standard deviation which was .4% for the CPU
and .8% for the performance measurements. I can't see it as a worth
bothering unless there is some big money benchmark on the line, in which case
it would make more sense to look at other optimizations of the loopback