netdev
[Top] [All Lists]

Re: [PATCH] loop unrolling in net/sched/sch_generic.c

To: Thomas Graf <tgraf@xxxxxxx>
Subject: Re: [PATCH] loop unrolling in net/sched/sch_generic.c
From: Eric Dumazet <dada1@xxxxxxxxxxxxx>
Date: Wed, 06 Jul 2005 02:32:24 +0200
Cc: "David S. Miller" <davem@xxxxxxxxxxxxx>, netdev@xxxxxxxxxxx
In-reply-to: <20050705234104.GR16076@xxxxxxxxxxxxxx>
References: <20050705173411.GK16076@xxxxxxxxxxxxxx> <20050705.142210.14973612.davem@xxxxxxxxxxxxx> <20050705213355.GM16076@xxxxxxxxxxxxxx> <20050705.143548.28788459.davem@xxxxxxxxxxxxx> <42CB14B2.5090601@xxxxxxxxxxxxx> <20050705234104.GR16076@xxxxxxxxxxxxxx>
Sender: netdev-bounce@xxxxxxxxxxx
User-agent: Mozilla Thunderbird 1.0 (Windows/20041206)
Thomas Graf a écrit :

I still think we can fix this performance issue without manually
unrolling the loop or we should at least try to. In the end gcc
should notice the constant part of the loop and move it out so
basically the only difference should the additional prio++ and
possibly a failing branch prediction.

What about this? I'm still not sure where exactly all the time
is lost so this is a shot in the dark.

Index: net-2.6/net/sched/sch_generic.c
===================================================================
--- net-2.6.orig/net/sched/sch_generic.c
+++ net-2.6/net/sched/sch_generic.c
@@ -330,10 +330,11 @@ static struct sk_buff *pfifo_fast_dequeu
 {
        int prio;
        struct sk_buff_head *list = qdisc_priv(qdisc);
+       struct sk_buff *skb;
- for (prio = 0; prio < PFIFO_FAST_BANDS; prio++, list++) {
-               struct sk_buff *skb = __qdisc_dequeue_head(qdisc, list);
-               if (skb) {
+       for (prio = 0; prio < PFIFO_FAST_BANDS; prio++) {
+               if (!skb_queue_empty(list + prio)) {
+                       skb = __qdisc_dequeue_head(qdisc, list);
                        qdisc->q.qlen--;
                        return skb;
                }



Hum... shouldnt it be :

+                       skb = __qdisc_dequeue_head(qdisc, list + prio);


?

Anyway, the branches misprediction come from the fact that most of packets are 
queued in the prio=2 list.

So each time this function is called, a non unrolled version has to pay 2 to 5 
branches misprediction.

if ((!skb_queue_empty(list + prio))  /* branch not taken, mispredict when 
prio=0 */

if ((!skb_queue_empty(list + prio))  /* branch not taken, mispredict when 
prio=1 */
if ((!skb_queue_empty(list + prio))  /* branch taken (or not if queue is really 
empty), mispredict when prio=2 */

Maybe we can rewrite the whole thing without branches, examining prio from PFIFO_FAST_BANDS-1 down to 0, at least for modern cpu with conditional mov (cmov)

struct sk_buff_head *best = NULL;
struct sk_buff_head *list = qdisc_priv(qdisc)+PFIFO_FAST_BANDS-1;
if (skb_queue_empty(list)) best = list ;
list--;
if (skb_queue_empty(list)) best = list ;
list--;
if (skb_queue_empty(list)) best = list ;
if (best != NULL) {
        qdisc->q.qlen--;
        return __qdisc_dequeue_head(qdisc, best);
        }

This version should have one branch.
I will test this after some sleep :)
See you
Eric

<Prev in Thread] Current Thread [Next in Thread>