netdev
[Top] [All Lists]

[PATCH] NULL pointer deref in tcp_do_twkill_work()

To: davem@xxxxxxxxxx
Subject: [PATCH] NULL pointer deref in tcp_do_twkill_work()
From: olof@xxxxxxxxxxxxxx
Date: Fri, 27 Feb 2004 16:09:21 -0600 (CST)
Cc: linux-kernel@xxxxxxxxxxxxxxx, <netdev@xxxxxxxxxxx>, <niv@xxxxxxxxxx>, <anton@xxxxxxxxx>, <milliner@xxxxxxxxxx>
Sender: netdev-bounce@xxxxxxxxxxx
This bug has been encountered before (ref 
http://bugme.osdl.org/show_bug.cgi?id=1627),
but that time it went away by itself. We're now seeing it again with
another heavy network load on 2.6.3.

Recap:

We're crashing in tcp_do_twkill_work():

        tw_for_each_inmate(tw, node, safe,
                           &tcp_tw_death_row[slot]) {
                __tw_del_dead_node(tw);
                spin_unlock(&tw_death_lock);
                tcp_timewait_kill(tw);
                tcp_tw_put(tw);
                killed++;
                spin_lock(&tw_death_lock);
                if (killed > quota) {
                        ret = 1;
                        break;
                }
        }


The crash is in __tw_del_dead_node, where tw is taken off of the
tw_death_node list. At the time of the crash, the pprev list pointer is
NULL, which indicates that tw is no longer on the list.

I'm suspicious of the fact that the tw_death_lock is released, and since
it's released, the "safe" inmate could be descheduled on another CPU. Once
the lock is reaquired and the loop is redone, the new tw (old safe) is
now already taken off the list so we go astray.

Shouldn't the loop always restart from the beginning instead of using the
(not thread-)"safe" list iteration? Since it keeps taking the entries off
the list, the behaviour should be identical but it wouldn't be racy.

The alternative is to not drop the lock, but I'm guessing we need to do
that to avoid deadlocks. I don't know the TCP code well enough to tell for
sure.

Proposed patch is attached. We've given it a bit of runtime here but the
crashes happen randomly so it'll take a while to gain full confidence that
it's fixed. But the above reasoning around the patch seems to make sense.


Thanks,

Olof

Olof Johansson                                        Office: 4E002/905
Linux on Power Development                            IBM Systems Group
Email: olof@xxxxxxxxxxxxxx                          Phone: 512-838-9858
All opinions are my own and not those of IBM


Attachment: twpatch
Description: Text document

<Prev in Thread] Current Thread [Next in Thread>