Dave,
We're investigating a hang in TCP that a clustered node
is running into, and I'd appreciate any help whatsoever
on this...
System is running SLES8 + patches (including latest
fixes in timewait stuff) - but is pretty equivalent
to mainline 2.4 kernel from what I can tell.
Problem is reproducible, takes anywhere from several
hours to a day.
The hang occurs due to the while in tcp_twkill going
into an infinite loop:
while((tw = tcp_tw_death_row[tcp_tw_death_row_slot]) != NULL) {
tcp_tw_death_row[tcp_tw_death_row_slot] = tw->next_death;
if (tw->next_death)
tw->next_death->pprev_death = tw->pprev_death;
tw->pprev_death = NULL;
spin_unlock(&tw_death_lock);
tcp_timewait_kill(tw);
tcp_tw_put(tw);
killed++;
spin_lock(&tw_death_lock);
}
Thanks to some neat detective work by Beth Kon and Joe
Garvey, the culprit seems to be a tw node pointing to
itself. See attached note from Beth at end.
This is possible if a tcp_tw_bucket is freed prematurely
before being taken off the death list. If the node is
at the head of the list, and is freed and then later
reallocated in tcp_time_wait() and reinserted into the
list, (now linked to a new sk) it will end up pointing at
itself. [There might be other ways to end up like this,
but I'm not seeing them]
We come into tcp_tw_schedule() (which puts it into the
death list) with pprev_death cleared by tcp_time_wait().
tcp_tw_schedule() {
if (tw->pprev_death) {
...
} else
atomic_inc(&tw->refcnt);
...
if((tw->next_death = *tpp) != NULL)
(*tpp)->pprev_death = &tw->next_death;
*tpp = tw;
tw->pprev_death = tpp;
If tw is at the head of the list, (*tpp == tw), then
we just created a loop of tw->next_death pointing at tw.
If tw is in other places on the death list, we could
potentially have Y-shaped chains and other garbage...
Does that seem correct, or am I barking up the wrong
tree here?
Just checking at this point for a node pointing to
itself is rather late - the damage has been done in
losing the original linkages from the tcp_tw_bucket
to the other structures which we need to remove as
well, so as to not cause a further mess in the hash
table and death list pointers.
So the question is, is there any path that leads to
us erroneously freeing tcp_tw_bucket without taking it
off the death list?
I've been looking at the tw refcount manipulation
and am trying to identify any possible gratuitous
tcp_tw_put() calls, but haven't successfully isolated
any one yet.
Any ideas, pointers would be very much appreciated!
thanks,
Nivedita
---
From Beth Kon:
I see what is going on here... not sure how it got to this state.
Joe Garvey did excellent work gathering kdb info (and
graciously taught me a lot as he went along) and confirming that the
while loop in tcp_twkill is in an infinite loop.
Here is the code in tcp_twkill that is in an infinite loop:
while((tw = tcp_tw_death_row[tcp_tw_death_row_slot]) != NULL) {
tcp_tw_death_row[tcp_tw_death_row_slot] = tw->next_death;
if (tw->next_death)
tw->next_death->pprev_death = tw->pprev_death;
tw->pprev_death = NULL;
spin_unlock(&tw_death_lock);
tcp_timewait_kill(tw);
tcp_tw_put(tw);
killed++;
spin_lock(&tw_death_lock);
}
Using the data Joe gathered, here is what I see...
[0]kdb> rd
eax = 0x00000001 ebx = 0xc50a7840 ecx = 0xdf615478 edx = 0x00000001
esi = 0x061c3332 edi = 0x00000000 esp = 0xc03e7f10 eip = 0xc02be950
ebp = 0x00000000 xss = 0xc02e0018 xcs = 0x00000010 eflags = 0x00000282
xds = 0x00000018 xes = 0x00000018 origeax = 0xffffffff ®s = 0xc03e7edc
In the above register dump, the pointer to the tw being handled in the
tcp_twkill loop is in ebx.
The contents of the tw struct (annotated by me) are:
[0]kdb> mds %ebx tw
0xc50a7840 260f3c09 .<.& daddr
0xc50a7844 6d0f3c09 .<.m rcv_saddr
0xc50a7848 8200a3e5 å£.. dport, num
0xc50a784c 00000000 .... bound_dev_if
0xc50a7850 00000000 .... next
0xc50a7854 00000000 .... pprev
0xc50a7858 00000000 .... bindnext
0xc50a785c c26dcbc8 ÈËm bind_pprev
[0]kdb>
0xc50a7860 00820506 .... state, substate, sport
0xc50a7864 00000002 .... family
0xc50a7868 f9e3ccd0 ÐÌãù refcnt
0xc50a786c 00002a8f .*.. hashent
0xc50a7870 00001770 p... timeout
0xc50a7874 d4ad3cee î<Ô rcv_next
0xc50a7878 878fe09e .à.. send_next
0xc50a787c 000016d0 Ð... rcv_wnd
[0]kdb>
0xc50a7880 00000000 .... ts_recent
0xc50a7884 00000000 .... ts_recent_stamp
0xc50a7888 000353c1 ÁS.. ttd
0xc50a788c 00000000 .... tb
0xc50a7890 c50a7840 @x.Å next_death
0xc50a7894 00000000 .... pprev_death
0xc50a7898 00000000 ....
0xc50a789c 00000000 ....
The above shows that next_death in the structure == ebx. Which means this
element of the linked list is pointing to itself. So it in an infinite loop.
Assuming this is the last element on the linked list, next_death should be null.
|