I apologize for the over 10K email.. consider this documentation ;->
This is cross-posted to l-k; i would prefer the discussions on netdev
or cc netdev (i am not subscribed to l-k)
This is a port against 2.4.0-test8 based on the OLS presentation i made
"Fast Forwarding the Bird" available at:
There are lotsa improvements since OLS in a collaboration involving
Robert Olsson and myself.
Current snapshot is available via ftp from:
This includes a patched tulip driver; tested only on a DEC21143-based
This RFC at:
To make things interesting robur.slu.se is currently running these patches.
1) Alexey Kuznetsov : Without his FF and FC these thoughts might never have
been born. Alexey is still involved whenever time permits.
2) Robert Olsson : Many insights and current partner-in-crime
3) Donald Becker : Well thought network driver design.
Test updates September/2000:
Robert Olson and I decided after the OLS that we were going to try to
hit the 100Mbps(148.8Kpps) routing peak by year end. I am afraid the
bar has been raised. Robert is already hitting with 2.4.0-test7 ~148Kpps
with a ASUS CUBX motherboard carrying PIII 700 MHZ coppermine with
about 65% CPU utilization.
With a single PII based Dell machine i was able to get a consistent value of
So the new goal is to go to about 500Kpps ;-> (maybe not by year end, but
surely by that next random Linux hacker conference)
A sample modified tulip driver (hacked by Alexey for 2.2 and mod'ed by Robert
and myself over a period of time) is supplied as an example on how to use the
To begin, i have to say that forwarding 100Mbps of 64byte packets is _not_
a problem at all in Linux; Alexey's Fast Forwarding does a fine job.
FF, like in routers is _not_ subjected to Firewalling (eg CISCO Fast routing).
So the challenge was to try and do it without FF turned on so we could do
** Very important to note:
Although the tests and improvements were for packet forwarding, the technique
used applies for servers under a heavy load. System congestion is moved down
to the hardware therefore alleviating the system overload.
I believe we could have done better with the mindcraft tests with these
changes in 2.2 (and HW FC turned on).
[Fact is, HW FC was there but i suppose no-one knew you could
use Alexey's hacked version of the Tulip ;->]
The changes proposed below are transparent to drivers that dont use them.
It is however highly encouraged they take advantage of the supplied
interface. This does not break anything in 2.4. It is a clean patch.
- This is a first cut, hopefully discussions will ensue and maybe a
revision of the patch. In particular of interest are the recent weird
requirements by the X.25 code.
Refer to thread on l-k "Re: Q: sock output serialization"
Henner, hoping to hear from you.
- I intend on submitting this patch for inclusion in 2.4 since it is
non-intrusive. I suspect there will be about one more revision.
*** Proposed changes:
netif_rx() now returns a value to the driver (change from void).
The queue is divided into 4 configurable threshold regions:
*no congestion zone
*low congestion zone
*A high congestion zone.
*A drop zone where packets are dropped.
(the configuration interface is via /proc/sys/net/core/)
A positive value (different for each of the regions) implies that the packet
was successfully queued whereas a negative value implies it was dropped.
The default congestion threshold values are based on engineering
experimentation and not on theoretical scientific proofs. There are
probably better ways of drawing up the associated thresholds.
I would like to add that the HW FC feature is _a neccessary requirement_
to complement this. Maybe i should say this complements HW FC.
If a driver keeps sending packets up to netif_rx() even when its been given
feedback to stop, HW FC kicks in and the device is shut up; "sent to its room"
so to speak. So the HW FC is considered the "when all else fails" rule.
A separate document will describe how to use HW FC for driver authors.
I think it should be a *mandantory* interface to drivers.
The netif_rx() feedback helps the driver get 2 insights:
i) understand the fate of the packets it sent up the stack. No point
in continuing to blast packets to the stack when they are being dropped.
(which happens today)
ii) smartly gauge the congestion level on the stack and adjust the
rate at which packets are being sent to the stack to reduce overall
system load. A scheme which moves the congestion away from the system and
onto the driver is considered a bonus. The tulip does this in two ways
selectable at compile: 1) Kill the rx thread or 2) kill the rx_interupt.
Both which work extremely well.
[The sample tulip driver can be made to use 2) by undefining D_INT in
The driver uses the feedback information to intelligently adjust its
sending rate. (i.e reduce or increase calls to netif_rx() or send a
congestion-experienced frame to its peer eg in X.25).
In the sample tulip driver, dynamic mitigation based on congestion feedback
is used. Under low congestion, the mitigation parameters are turned off
(default behavior as is today); under heavy congestion we dynamically move
up to 16 packet times. It is not mandatory to use this scheme; however,
it serves as a good example. A word of caution: This scheme is still being
experimented on; we feel we could do better. Look at the code.
The backlog queue is now getting sampled.
This helps in detecting incipient congestion by the top layer.
Essentially, the sampler is a low-pass filter which weeds out
"congestion detected" wolf-cries due to sudden short bursts which
fill the backlog.
I have experimented with two schemes: one which samples the queue via
a timer and one which does it per-packet and found that the per-packet
sampler gave better results (more samples, Shannon's theorem applies).
It didnt matter whether HZ was 100 or 1024 during the tests.
The measure of "better" was throughput.
Introduce a scheme which does occasionaly send a "random-lie"
when around moderate to high congestions;
The motivation for this is to improve unfairness issues of many devices
sharing the same backlog queue.
e.g if eth0 is blasting 70Kpps to the backlog queue and eth1 is merely
sending 10pps, then under the current system setup, eth0 will pretty
much fill up the backlog (every time it gets drained) and have a
very small opportunities to queue for eth1. Imagine scaling this to over
10 interfaces with eth0 blasting at that rate.
Solution i have devised:
- Randomly lie in the feedback to the driver when under moderate to high
congestion levels (tell device there is a higher congestion than really is).
Theory is that "the harder they come, the harder they fall". It is more
than likely that eth0s packets from the above example will be hit than
say eth1 by the randomness (simply because there is more of them to take
target shots at)
My testing with the included scheme (#ifdef RAND_LIE) indicates that fairness
infact goes up; however, the overall throughput when only one interface
is utilizing the system goes down under heavy to moderate congestion.
I am including it here as a way to highlight the problem. I think there could
be better ways to do this.
Code is included and can be turned on by defining RAND_LIE in dev.c
After a brief talk with Alexey and Robert at the OLS, i am withdrawing this
Currently after netdev_dropping is raised, all incoming packets to the
backlog are dropped even if there was only a single packet on the queue.
I had proposed removing that check. In a world with good driver-zens
(system-zens?) where every driver backs off once system congestion is detected,
the change would make sense. It is, however, unfair to good citizens to backoff
while the bad guys are filling the backlog which is what would happen if the
change is made.
Description: Text document