netdev
[Top] [All Lists]

Re: Queue and SMP locking discussion (was Re: 3c59x.c)

To: Andrew Morton <andrewm@xxxxxxxxxx>
Subject: Re: Queue and SMP locking discussion (was Re: 3c59x.c)
From: Donald Becker <becker@xxxxxxxxx>
Date: Fri, 31 Mar 2000 10:08:00 -0500 (EST)
Cc: jamal <hadi@xxxxxxxxxx>, netdev@xxxxxxxxxxx, kuznet@xxxxxxxxxxxxx
In-reply-to: <38E40D26.D34D72CB@xxxxxxxxxx>
Sender: owner-netdev@xxxxxxxxxxx
On Fri, 31 Mar 2000, Andrew Morton wrote:

> I believe that this tx starvation is due to the decision to schedule the
> tx in the device ISR, for BH handling, rather than to actually dequeue
> and send packets within the Tx ISR.  I can see why the bh scheduling is
> simpler...

Not only is the Linux scheme simpler, it's much better.  

The BSD stack uses the scheme of dequeuing packets in the ISR.  This was a
good design in the VAX days, and with primative hardware that handled only
single packets.  But it has horrible cache behavior, needs an extra lock,
and can result the interrupt service routine running a very long time,
blocking interrupts.

I think the problem is that our recieve routines, eth_type_trans() and
netif_rx(), have grown to be too complex, and they have similar bad cache
behavior to Tx queuing in the ISR.  At least Tx starvation is better than Rx
deafness.

> I like the loop-until-max_interrupt_work-exceeded architecture.  It's
> _very_ efficient compared with interrupt per packet, and it kicks in
> when the system is under stress.  But it's not being leveraged for
> transmits.

Thanks.  (I obviously think it's a good design, or I wouldn't have done it.)
Many people just want to turn it off, but the network usually isn't
the only part of the system that needs to get work done.

> The 3c59x driver's ISR does this:
> while (stuff_to_do && (count++ < max_interrupt_work))
> {
>       if (the device has room for a tx packet)
>               netif_wake_queue()
..
> It appears to me that netif_wake_queue can be called multiple times
> within this loop, at considerable expense, when the system is under Rx
> stress.

Not quite: it only does  clear tbusy == netif_wake_queue()  once.
The TxAvailable indication is only used on the 3c590 series Vortex, and it
only triggers once.  The 3c900 series has a modern descriptor list and uses
the other block of code, where netif_wake_queue() is called only when the
16/32 element Tx queue was full.

Most drivers, including pci-skeleton.c, have some hysteresis so that we
don't pound against the full limit.  We should have two or four Tx slot free
before we transition to non-full.

Acckk!!! I just saw that someone put netif_wake_queue() in the normal path
of the 3c59x.c Tx routine!  This is BAD.  That is putting an expensive call
in the critical path, and it's not even the right semantics.

The original intent of netif_wake_queue() was that it would be very
lightweight.  It should only clear the flag and set the BH bit.  But it
seems to have grown in complexity...


Donald Becker
Scyld Computing Corporation, becker@xxxxxxxxx


<Prev in Thread] Current Thread [Next in Thread>