(more info on my problems below, including original email
for Donald Becker's benefit)
On Mon, 10 Jun 2002, Adam Wozniak wrote:
> On a machine with an 82559ER ethernet chip
> Linux 2.5.5, with the 6/5/2002 eepro100.c
>
> Connect to some other machine via a crossover cable
>
> run iperf as a server on both machines
> run iperf as a client to the server on the other machine
>
> (i.e. we want an isolated network, full duplex, full traffic in both
> directions)
>
> [awozniak@rangers ~/linux-2.5.5]$ ksymoops -VKOL -m System239.map <
> oops239.txt
> ksymoops 2.4.0 on i686 2.4.2-2. Options used
> -V (specified)
> -K (specified)
> -L (specified)
> -O (specified)
> -m System239.map (specified)
>
> Unable to handle kernel paging request at virtual address 3938377f
> c01d1716
> *pde = 00000000
> Oops: 0000
> CPU: 0
> EIP: 0010:[<c01d1716>] Not tainted
> Using defaults from ksymoops -t elf32-i386 -a i386
> EFLAGS: 00010217
> eax: cdc42a08 ebx: 39383736 ecx: 00000004 edx: 00000000
> esi: cdc42b10 edi: cdc429a0 ebp: 00000009 esp: c0251ef0
> ds: 0018 es: 0018 ss: 0018
> Stack: cdc429a0 cdc42b10 cdc42a08 cdc429a0 c01d9dee cdc429a0 00000000 cda8e5c0
> 00000000 00000001 00000000 cdc429a0 0000e8c8 00000000 00000046 c01d9fc6
> cdc429a0 00000001 00000000 c0251fac 00000000 cdc429a0 c01d9f20 c011a933
> Call Trace: [<c01d9dee>] [<c01d9fc6>] [<c01d9f20>] [<c011a933>] [<c011779b>]
> [<c01176a4>] [<c01174cb>] [<c0109eac>] [<c0106cc0>] [<c0106cc0>]
> [<c0106cc0>]
> [<c0106cc0>] [<c0106ce4>] [<c0106d62>] [<c0105000>]
> Code: 0f b6 4b 49 45 f6 c1 82 74 0c 31 d2 89 96 70 01 00 00 0f b6
>
> >>EIP; c01d1716 <tcp_enter_loss+c6/1a0> <=====
> Trace; c01d9dee <tcp_retransmit_timer+17e/2b0>
> Trace; c01d9fc6 <tcp_write_timer+a6/f0>
> Trace; c01d9f20 <tcp_write_timer+0/f0>
> Trace; c011a933 <timer_bh+213/250>
> Trace; c011779b <bh_action+1b/50>
> Trace; c01176a4 <tasklet_hi_action+44/70>
> Trace; c01174cb <do_softirq+4b/90>
> Trace; c0109eac <do_IRQ+9c/b0>
> Trace; c0106cc0 <default_idle+0/30>
> Trace; c0106cc0 <default_idle+0/30>
> Trace; c0106cc0 <default_idle+0/30>
> Trace; c0106cc0 <default_idle+0/30>
> Trace; c0106ce4 <default_idle+24/30>
> Trace; c0106d62 <cpu_idle+32/50>
> Trace; c0105000 <_stext+0/0>
> Code; c01d1716 <tcp_enter_loss+c6/1a0>
> 00000000 <_EIP>:
> Code; c01d1716 <tcp_enter_loss+c6/1a0> <=====
> 0: 0f b6 4b 49 movzbl 0x49(%ebx),%ecx <=====
> Code; c01d171a <tcp_enter_loss+ca/1a0>
> 4: 45 inc %ebp
> Code; c01d171b <tcp_enter_loss+cb/1a0>
> 5: f6 c1 82 test $0x82,%cl
> Code; c01d171e <tcp_enter_loss+ce/1a0>
> 8: 74 0c je 16 <_EIP+0x16> c01d172c
> <tcp_enter_loss+dc/1a0>
> Code; c01d1720 <tcp_enter_loss+d0/1a0>
> a: 31 d2 xor %edx,%edx
> Code; c01d1722 <tcp_enter_loss+d2/1a0>
> c: 89 96 70 01 00 00 mov %edx,0x170(%esi)
> Code; c01d1728 <tcp_enter_loss+d8/1a0>
> 12: 0f b6 00 movzbl (%eax),%eax
>
> <0> Kernel panic: Aiee, killing interrupt handler!
> [awozniak@rangers ~/linux-2.5.5]$
(someone asked if we enabled preemptive, we did not)
(This bug manifested in several forms. A common one is above.
Another common one was a hard lockup of the machine, no oops.)
Here are some details about our resolution. Dunno if they help
anyone or not. (These comments were not written by me.)
} ------- Additional Comments From XXXXXXXXX 2002-06-10 18:41 -------
}
} Long story short, they replaced the BIOS on the board (General Software)
} with a new one (Award) and we re-ran the test. It came up and ran for
} over an hour before we took it down to swap out the eepro100 driver
} (see below.) After bringing it back up, it ran again for 50 minutes
} until we packed it up and came home.
}
} After upgrading to the Award BIOS, we started getting an error message
} generated by the eepro100 drivers. Something along the lines of "Command
} 0x0080 not accepted immediatly, x ticks!" where x is some number larger
} than 100. Originally, there was a bug in this reporting code that we
} fixed (which required the interruption of the test as stated above.
} I needed to load the new module), which reported Command 0x0000 every
} time. This code is called when the ethernet chip doesn't respond to a
} command very quickly. A "tick" is an undefined amount of time, basically,
} how many times it takes to go through a loop waiting for the command to
} finish. 100 seems to be an arbitrary number. All the numbers we saw were
} less than 110, so its just barely missing the arbitrary cut-off point.
}
} The interesting bit is that the messages come at about the same interval
} as the crashes used to happen. With the PCI bus analizer on the card,
} when the system would crash, the Ethernet chip was repeatedly requesting
} memory access from the North Bridge, and it kept rejecting it, but neither
} would let up. It sorta makes sense that if the chip was programmed one
} way by the old BIOS to keep trying and never give up, that it would slam
} the bus, but being programmed differently by the new BIOS, it'll retry
} or something similar.
}
} We never found out exactly what WAS happening, and what change in the
} BIOS fixed the problem. But we know that the Award BIOS appears to work
} and the General Software BIOS does not.
--
Adam Wozniak (KG6GZR) COM DEV Broadband - Digital and Software Systems
awozniak@xxxxxxxxx 805 Aerovista Place, San Luis Obispo, CA 93401
http://www.comdev.cc
Voice: (805) 544-1089 Fax: (805) 544-2055
|