[Top] [All Lists]

Re: fealnx oopses (with [PATCH])

To: Jeff Garzik <jgarzik@xxxxxxxxx>
Subject: Re: fealnx oopses (with [PATCH])
From: Denis Vlasenko <vda@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>
Date: Wed, 31 Mar 2004 18:39:33 +0200
Cc: Francois Romieu <romieu@xxxxxxxxxxxxx>, Andreas Henriksson <andreas@xxxxxxxxxxxx>, netdev@xxxxxxxxxxx, Denis Vlasenko <vda@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>
In-reply-to: <>
References: <> <> <>
Sender: netdev-bounce@xxxxxxxxxxx
User-agent: KMail/1.5.4
> > > Oopses are gone but it looks like box is so much interrupt
> > > flooded that userspace has no chance of processing ctrl-C.
> > > What can we do? I think driver can do something useful
> > > whet it detects 'too much work in interrupt'. Disabling rx
> > > for several ms seems like 'quick and dirty' way.
> > >
> > > Francois what do you think? Can you code something up
> > > for me to test?
> >
> > Andreas had a francois+jgarzik patch, I thought...?
> >
> > Looking good...
> Yes, I did all these tests with this patch.

Ok, here is what I have now.

What we had before:
original behaviour: Andreas had tx tomeouts, I had fatal oops.
francois+jgarzik patch: Andreas happy, I had lockup (endless 'Too much work')

Now, with the attached patch, I *don't* lock up. I can ctrl-C
netcat, UDP flood stops and card is alive, tested with pinging.

I modified 'Too much work in interrupt' code.
I added code which completely stops rx and tx and schedules
card reset a-la reset previously used in tx_timeout code path.
There is 1 second delay.

In testing, I saw this code triggering a number of times.
It works as intended.

I also have seen tx timeout events. (I touched that code a bit,
factoring out reset code, but without changing logic). They did
not hang the box, either, but box was unable to do any tx.
tx timeouts just happened again and again.
Remote side decided that we are gone and stopped UDP flood,
starting ARP address resolution instead, but test box could
not answer even that.

In this state I can ctrl-C local netcat and box recovers
after several seconds. Tx is working again, I can ping other
hosts, etc.

I conclude that tx timeout logic does not handle situation
when *local* process generates and submits tons of packets
at enormous rate.

Now, the patch itself. It is against 2.4.25.
Sorry, it contains some debugging stuff and some future work
(e.g. #defines). I decided to not remove it now to avoid
brown paper bugs. (I can remove and retest, but testing can
take another couple of hours...)

I can code tx timeout similarly to 'Too much work', but I
retained it 'as is' in hopes of your comments on both
code paths.

Attachment: fealnx-jgarzik-fromieu-vda.24.patch
Description: Text Data

<Prev in Thread] Current Thread [Next in Thread>