From: Andi Kleen <ak@xxxxxxx>
Date: Wed, 22 Jun 2005 20:06:55 +0200
> However it is tricky because CPUs have only a limited load queue
> entries and doing too many prefetches will just overflow that.
Several processors can queue about 8 prefetch requests, and
these slots are independant of those consumed by a load.
Yes, if you queue too many prefetches, the queue overflows.
I think the optimal scheme would be:
1) eth_type_trans() info in RX descriptor
2) prefetch(skb->data) done as early as possible in driver
Actually, I believe to most optimal scheme is:
skb = driver_priv->rx_skbs[index];
skb = realloc_or_recycle_rx_descriptor(skb, index);
if (skb == NULL)
skb->prot = eth_type_trans(skb, driver_priv->dev);
The idea is that first the prefetch goes into flight, then you do the
recycle or reallocation of the RX descriptor SKB, then you try to
touch the data.
This makes it very likely the prefetch will be in the cpu in time.
Everyone seems to have this absolute fetish about batching the RX
descriptor refilling work. It's wrong, it should be done when you
pull a receive packet off the ring, for many reasons. Off the top of
1) Descriptors are refilled as soon as possible, decreasing
the chance of the device hitting the end of the RX ring
and thus unable to receive a packet.
2) As shown above, it gives you compute time which can be used to
schedule the prefetch. This nearly makes RX replenishment free.
Instead of having the CPU spin on a cache miss when we run
eth_type_trans() during those cycles, we do useful work.
I'm going to play around with these ideas in the tg3 driver.
Obvious patch below.
--- 1/drivers/net/tg3.c.~1~ 2005-06-22 12:33:07.000000000 -0700
+++ 2/drivers/net/tg3.c 2005-06-22 13:19:13.000000000 -0700
@@ -2772,6 +2772,13 @@
+ /* Prefetch now. The recycle/realloc of the RX
+ * entry is moderately expensive, so by the time
+ * that is complete the data should have reached
+ * the cpu.
work_mask |= opaque_key;
if ((desc->err_vlan & RXD_ERR_MASK) != 0 &&