netdev
[Top] [All Lists]

Re: 2.6.0-test11: dst_cache_overflow causing unresponsive box

To: Robert Olsson <Robert.Olsson@xxxxxxxxxxx>
Subject: Re: 2.6.0-test11: dst_cache_overflow causing unresponsive box
From: "David S. Miller" <davem@xxxxxxxxxx>
Date: Tue, 2 Dec 2003 03:26:06 -0800
Cc: francois@xxxxxxxxxxxx, netdev@xxxxxxxxxxx
In-reply-to: <16332.27919.502097.988522@xxxxxxxxxxxx>
References: <072501c3b874$2542ae70$15fea8c0@fortress> <16332.27919.502097.988522@xxxxxxxxxxxx>
Sender: netdev-bounce@xxxxxxxxxxx
On Tue, 2 Dec 2003 11:44:31 +0100
Robert Olsson <Robert.Olsson@xxxxxxxxxxx> wrote:

> No experience with 90k TCP-flows but it seems GC is not able to free some 
> the dst-entries for some reason. This will slowly kill your box with 
> symptoms you describe. We have ask TCP-experts for timer settings to avoid
> pending sessions etc. Also check slab for any other objects growing as 
> dst cache overflow is most likely secondary effect in your case. rtstat 
> looks sane expect for the high number of dst-entries. Tuning is another 
> story.

Let us assume, for the sake of back of the envelope calculations, that
all 90k TCP connections speak to unique destinations.  Let us further
assume that all of them have at least one packet in flight.

This means the routing cache must be able to hold at least 90k entries.
All of these routing cache entires will be referenced by the packets
in the TCP retransmission queues of all the sockets, and thus the
entries are unreclaimable.

You are setting net.ipv4.route.max_size to 655360 which should be more
than enough.  But you also have to make the net.ipv4.route.gc_thresh
more reasonable as well, perhaps 90K as a test.

If net.ipv4.route.gc_thresh is lower than 90K and my assertions above
hold, then the kernel will try to garbage collect too early, all the
routing cache entries will be in use and therefore uncollectable,
and you'll get the message you're seeing.

Try to pump up gc_thresh and see if that helps.

<Prev in Thread] Current Thread [Next in Thread>