[Top] [All Lists]

route cache DoS testing and softirqs

To: linux-kernel@xxxxxxxxxxxxxxx
Subject: route cache DoS testing and softirqs
From: Dipankar Sarma <dipankar@xxxxxxxxxx>
Date: Tue, 30 Mar 2004 00:15:50 +0530
Cc: netdev@xxxxxxxxxxx, Robert Olsson <Robert.Olsson@xxxxxxxxxxx>, Andrea Arcangeli <andrea@xxxxxxx>, "Paul E. McKenney" <paulmck@xxxxxxxxxx>, Dave Miller <davem@xxxxxxxxxx>, Alexey Kuznetsov <kuznet@xxxxxxxxxxxxx>, Andrew Morton <akpm@xxxxxxxx>
Reply-to: dipankar@xxxxxxxxxx
Sender: netdev-bounce@xxxxxxxxxxx
User-agent: Mutt/1.4.1i
Robert Olsson noticed dst cache overflows while doing DoS stress testing in a
2.6 based router setup a few months and davem, alexey, robert and I
have been discussing this privately since then (198 mails, no less!!).
Recently, I set up an environment to test Robert's problem and have 
been characterizing it. My setup is -
pktgen box --- in router out --
eth0           eth0 <-> dumm0
The router box is a 2-way P4 xeon 2.4 GHz with 256MB memory. I use
Robert's pktgen script -
CLONE_SKB="clone_skb 1"
PKT_SIZE="pkt_size 60"
#COUNT="count 0"
COUNT="count 10000000"
IPG="ipg 0"
echo "Configuring $PGDEV"
pgset "$COUNT"
pgset "$CLONE_SKB"
pgset "$PKT_SIZE"
pgset "$IPG"
pgset "flag IPDST_RND"
pgset "dst_min"
pgset "dst_max"
pgset "flows 32768"
pgset "flowlen 10"
With this, wthin a few seconds of starting pktgen, I get dst cache
overflow messages. I use the following instrumentation patch
to look at what's happening -
I tried both vanilla 2.6.0 and 2.6.0 + throttle-rcu patch which limits
RCU to 4 updates per RCU tasklet. The results are here -

This graph shows the maximum grace period during ~4ms time buckets on x-axis.

Couple of things are clear from this -

1. RCU grace periods of upto 300ms are seen. 300ms + 100Kpps packet
   amounts to about 30000 pending dst entries which result in route cache

2. throttle-rcu is only marginally better (10% less worst case grace period).

So, what causes RCU to stall for 300ms odd time ? I did some measurements
using the following patch -

It applies on top of the 15-rcu-debug patch. This counts the number of
softirqs (in effect and approximation) during ~4ms time buckets. The
result is here -

The rcu grace period spikes are always accompanied by softirq frequency
spikes. So, this indicates that it is the large number of quick-running
softirqs that cause userland starvation which in turn result in RCU
delays. This raises a fundamental question - should we work around
this by providing a quiescent point at the end of every softirq handler
(giving softirqs its own RCU mechanism) or should we address a wider
problem, the system getting overwhelmed by heavy softirq load, and
try to implement a real softirq throttling mechanism that balances
cpu use. 

Robert demonstrated to us sometime ago with a small
timestamping user program to show that it can get starved for
more than 6 seconds in his system. So userland starvation is an


<Prev in Thread] Current Thread [Next in Thread>