On Mon, May 24, 2004 at 10:06:34AM -0700, David S. Miller wrote:
> The most relevant (and accurate) piece of documentation for the chip
> is Broadcom's own driver :-)
The Windows-flavoured driver you decided to replace in Linux?
> And they do not account at all for such
> a case of partial-packet TX completion indication. If the first frag
> is ACK'd they assume the whole packet has been taken. Here is the
> relevant code from the bcm5700 driver in LM_ServiceTxInterrupt():
> while(SwConIdx != HwConIdx)
> pPacket = pDevice->SendRing[SwConIdx];
> pDevice->SendRing[SwConIdx] = 0;
> /* Set the return status. */
> pPacket->PacketStatus = LM_STATUS_SUCCESS;
> /* Put the packet in the TxPacketXmittedQ for indication later. */
> QQ_PushTail(&pDevice->TxPacketXmittedQ.Container, pPacket);
> /* Move to the next packet's BD. */
> SwConIdx = (SwConIdx + pPacket->u.Tx.FragCount) &
> /* Update the number of unused BDs. */
> MM_ATOMIC_ADD(&pDevice->SendBdLeft, pPacket->u.Tx.FragCount);
> /* Get the new updated HwConIdx. */
> HwConIdx = pDevice->pStatusBlkVirt->Idx.SendConIdx;
> } /* while */
> Imagine how badly this piece of code would fail if partial ACK'ing of
> TX packets actually occurred. It would loop past HwConIdx and thus
> ACK really-not-completed packets, potentially colliding with what
> the chip is transmitting and thus causing massive data corruption
> and likely a crash. Actually, it would most likely loop past all
> valid TX packets and dereference a pPacket NULL pointer.
I agree that this code appears to implictly rely on always getting
complete send ring updates.
So has this driver been tested in 2.6? I did notice that the bug
occurs far more frequently in 2.6 because better zero-copy code
means that the driver actually sees skbs with frags instead of just
the header. The driver might be accidentally working because
pPacket->u.Tx.FragCount=1 during all its testing.
Also, I'm not familiar with this driver's source (I haven't looked
at it for a long time), but I can see that there are behaviour
differences between this driver and the tg3 driver which might affect
the visibility of the bug.
For one thing, this code fetches a new sample of the hardware consumer
index on every loop iteration. This may result in accidentally
bumping up HwConIdx to avoid apparent partial completions. Also note
that the queuing step (QQ_PushTail) might add enough delay to make
a second status block update more likely.
Without looking at the remainder of the driver, I can't say if this
code is called more or less frequently than tg3_tx(). If it's using
any interrupt coalescing at all it will be called much less frequently
and thus have a smaller window to see a partial complete. On my
hardware I see interrupt rates of 10000-30000 per second per card,
and tracing shows that tg3_tx() is called on most of these interrupts.
It could be the case that the huge interrupt rate in the tg3 driver
is banging hard on a race condition which the bcom driver avoids.
Also, the NAPI code in the tg3 driver will presumably set up the
card's interrupt coalescing engine with different parameters, which
might have an effect on the timing of status block updates.
Finally, the tg3 driver departs from the recommended ISR flow control
diagram and handles a status block update during the ISR by using
the SETINT bit to tell the card to force a new interrupt (instead of
restarting the ISR). This might also have an effect.
In short, even if the implicit assumptions in the bcom driver are
correct in that driver, I don't see how you can argue that they can
be carried across to the tg3 driver.
BTW, at least one other person has reported what appears to be the
exactly this bug:
He found the same bug in both bcom and tg3 drivers, and his hardware
has little in common with mine (apart from having >1 fast CPUs).
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.