netdev
[Top] [All Lists]

Re: Hard freeze (linux 2.6.7) or OOPS (linux 2.6.8.1) with e1000 + vlan,

To: Patrick <patrick@xxxxxxxxxxxxxx>
Subject: Re: Hard freeze (linux 2.6.7) or OOPS (linux 2.6.8.1) with e1000 + vlan, possible bug
From: Ben Greear <greearb@xxxxxxxxxxxxxxx>
Date: Mon, 13 Sep 2004 09:35:31 -0700
Cc: davem@xxxxxxxxxx, linux.nics@xxxxxxxxx, "'netdev@xxxxxxxxxxx'" <netdev@xxxxxxxxxxx>
In-reply-to: <20040913141059.GJ21600@xxxxxxxxxxxxxxxxxx>
Organization: Candela Technologies
References: <20040913141059.GJ21600@xxxxxxxxxxxxxxxxxx>
Sender: netdev-bounce@xxxxxxxxxxx
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.2) Gecko/20040803
Patrick wrote:
Hello,

I'm contacting you both because I believe there may be a problem in
the e1000 driver for linux, the vlan module or both.

There were some recent locking changes, which included a bug,
in the VLAN code.  This was fixed late last week, but I don't know
if the fix is in the version that you are running.

The 2.6.8.1 oops looks like it could be the bug introduced recently,
but I don't think that bug exists at all in 2.6.7.

I'm cc'ing netdev as well, maybe someone else has some better
ideas.  To trouble-shoot, any chance you could try with a different
NIC (maybe broadcom running the tg3 driver)?  Can you reproduce if
you do not use SAMBA?


I have a box with an Intel Xeon 2.40 GHz with on-board Intel gigabit
connections (two) and an additionnal 2 gigabit ports PCI card.
So I'm using 3 of those 4 gigabit ports with the e1000 driver, and
some vlans.
e1000 and 8021q are compiled as modules (loaded at boot with /etc/modules, 
8021q listed before e1000).
Kernel output:
Linux version 2.6.8.1 (root@zatras) (gcc version 3.3.4 (Debian 1:3.3.4-4)) #1 
SMP Mon Sep 13 10:31:31 CEST 2004
[..]
511MB LOWMEM available.
[..]
802.1Q VLAN Support v1.8 Ben Greear <greearb@xxxxxxxxxxxxxxx>
All bugs added by David S. Miller <davem@xxxxxxxxxx>
[..]
Intel(R) PRO/1000 Network Driver - version 5.2.52-k4
Copyright (c) 1999-2004 Intel Corporation.
e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection
e1000: eth1: e1000_probe: Intel(R) PRO/1000 Network Connection
e1000: eth2: e1000_probe: Intel(R) PRO/1000 Network Connection
e1000: eth3: e1000_probe: Intel(R) PRO/1000 Network Connection
[..]
e1000: eth0: e1000_watchdog: NIC Link is Up 100 Mbps Full Duplex
e1000: eth3: e1000_watchdog: NIC Link is Up 10 Mbps Half Duplex
[..]
e1000: eth2: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex


The eth2 nic has currently 3 vlans.


Here is what is happening:
- with kernels 2.6.6 or 2.6.7 : 2 or 3 times per day, the box freeze
completely (keyboard unresponsive), nothing printed on console or in
log files. Does not seem to be related to network traffic (very low)
or anything else.

- with kernel 2.6.8.1 : I have an OOPS right at boot and many
problems just after, so it may be an idea of the problem with
previous kernels

Here is the relevant log:
Sep 13 12:30:02 whitestar kernel: e08390f5
Sep 13 12:30:02 whitestar kernel: SMP Sep 13 12:30:02 whitestar kernel: Modules linked in: af_packet md5 ipv6 8250 serial_core ipt_multiport ipt_MASQUERADE ipt_REJECT ipt_state ipt_limit ipt_LOG ip_nat_irc ip_nat_ftp iptable_nat iptable_mangle iptable_filter ip_conntrack_irc ip_conntrack_ftp ip_conntrack ip_tables dm_mod p4_clockmod speedstep_lib w83627hf_wdt w83627hf i2c_sensor i2c_isa i2c_core e1000 8021q
Sep 13 12:30:02 whitestar kernel: CPU:    0
Sep 13 12:30:02 whitestar kernel: EIP:    
0060:[__crc_scm_detach_fds+103817/677563]    Not tainted
Sep 13 12:30:02 whitestar kernel: EFLAGS: 00010212 (2.6.8.1) Sep 13 12:30:02 whitestar kernel: EIP is at e1000_shift_out_mdi_bits+0x22/0x8c [e1000]
Sep 13 12:30:02 whitestar kernel: eax: fffffffc   ebx: 00000001   ecx: 0000001f 
  edx: 00000000
Sep 13 12:30:02 whitestar kernel: esi: de70bc10   edi: dca05e6c   ebp: ffffffff 
  esp: dca05e64
Sep 13 12:30:02 whitestar kernel: ds: 007b   es: 007b   ss: 0068
Sep 13 12:30:02 whitestar kernel: Process snmpd (pid: 1025, threadinfo=dca04000 
task=dc9f1390)
Sep 13 12:30:02 whitestar kernel: Stack: 00000000 c0374000 0000000a 00001820 de70bc10 dca05ee2 dca05f30 e0839301 Sep 13 12:30:02 whitestar kernel: de70bc10 ffffffff 00000020 dca05ecc de70ba20 dca05edc e0836a0c de70bc10 Sep 13 12:30:02 whitestar kernel: 00000000 dca05ee2 dca05ecc de903005 dca05edc e0814688 de70b800 dca05ecc Sep 13 12:30:02 whitestar kernel: Call Trace:
Sep 13 12:30:02 whitestar kernel:  [__crc_scm_detach_fds+104341/677563] 
e1000_read_phy_reg_ex+0x92/0xb3 [e1000]
Sep 13 12:30:02 whitestar kernel:  [__crc_scm_detach_fds+93856/677563] 
e1000_mii_ioctl+0x1c8/0x1ca [e1000]
Sep 13 12:30:02 whitestar kernel:  [__crc_journal_load+4760390/4806698] 
vlan_dev_ioctl+0xb5/0xe9 [8021q]
Sep 13 12:30:02 whitestar kernel:  [dev_ifsioc+851/957] dev_ifsioc+0x353/0x3bd
Sep 13 12:30:02 whitestar kernel:  [dev_ioctl+355/618] dev_ioctl+0x163/0x26a
Sep 13 12:30:02 whitestar kernel:  [inet_ioctl+142/158] inet_ioctl+0x8e/0x9e
Sep 13 12:30:02 whitestar kernel:  [sock_ioctl+238/641] sock_ioctl+0xee/0x281
Sep 13 12:30:02 whitestar kernel:  [sys_ioctl+273/605] sys_ioctl+0x111/0x25d
Sep 13 12:30:02 whitestar kernel:  [syscall_call+7/11] syscall_call+0x7/0xb
Sep 13 12:30:02 whitestar kernel: Code: 8b 02 d3 e3 0d 00 00 00 03 85 db 89 44 24 08 74 47 85 eb 74

ksymoops says:
Error (regular_file): read_ksyms stat /proc/ksyms failed
ksymoops: No such file or directory
No modules in ksyms, skipping objects
No ksyms, skipping lsmod
Sep 13 12:30:02 whitestar kernel: e08390f5
Sep 13 12:30:02 whitestar kernel: CPU:    0
Sep 13 12:30:02 whitestar kernel: EIP:    
0060:[__crc_scm_detach_fds+103817/677563]    Not tainted
Sep 13 12:30:02 whitestar kernel: EFLAGS: 00010212   (2.6.8.1)
Sep 13 12:30:02 whitestar kernel: eax: fffffffc   ebx: 00000001   ecx: 0000001f 
  edx: 00000000
Sep 13 12:30:02 whitestar kernel: esi: de70bc10   edi: dca05e6c   ebp: ffffffff 
  esp: dca05e64
Sep 13 12:30:02 whitestar kernel: ds: 007b   es: 007b   ss: 0068
Sep 13 12:30:02 whitestar kernel: Stack: 00000000 c0374000 0000000a 00001820 
de70bc10 dca05ee2 dca05f30 e0839301
Sep 13 12:30:02 whitestar kernel:        de70bc10 ffffffff 00000020 dca05ecc 
de70ba20 dca05edc e0836a0c de70bc10
Sep 13 12:30:02 whitestar kernel:        00000000 dca05ee2 dca05ecc de903005 
dca05edc e0814688 de70b800 dca05ecc
Sep 13 12:30:02 whitestar kernel: Call Trace:
Warning (Oops_read): Code line not seen, dumping what data is available



eax; fffffffc <__kernel_rt_sigreturn+1bbc/????>
esi; de70bc10 <__crc_cap_inode_removexattr+6c19a/188f0f>
edi; dca05e6c <__crc_wait_on_sync_kiocb+1c4c11/294abb>
ebp; ffffffff <__kernel_rt_sigreturn+1bbf/????>
esp; dca05e64 <__crc_wait_on_sync_kiocb+1c4c09/294abb>


Sep 13 12:30:02 whitestar kernel: Code: 8b 02 d3 e3 0d 00 00 00 03 85 db 89 44 
24 08 74 47 85 eb 74
Using defaults from ksymoops -t elf32-i386 -a i386


Code;  00000000 Before first symbol
00000000 <_EIP>:
Code;  00000000 Before first symbol
   0:   8b 02                     mov    (%edx),%eax
Code;  00000002 Before first symbol
   2:   d3 e3                     shl    %cl,%ebx
Code;  00000004 Before first symbol
   4:   0d 00 00 00 03            or     $0x3000000,%eax
Code;  00000009 Before first symbol
   9:   85 db                     test   %ebx,%ebx
Code;  0000000b Before first symbol
   b:   89 44 24 08               mov    %eax,0x8(%esp,1)
Code;  0000000f Before first symbol
   f:   74 47                     je     58 <_EIP+0x58>
Code;  00000011 Before first symbol
  11:   85 eb                     test   %ebp,%ebx
Code;  00000013 Before first symbol
  13:   74 00                     je     15 <_EIP+0x15>


1 warning and 1 error issued.  Results may not be reliable.



I've tried with and without HyperThreading enabled in Bios, and nosmp
flag at boot, but I have the same results in both cases.
I've also tried with boot options: noapic nolapic noacpi
without change.

This message comes exactly 5 minutes after boot (probably due to snmp/mrtg 
generating network traffic).
After what I encounter problems: ifconfig hangs for example
(when running correctly with other kernel), here is the end of the strace:

uname({sys="Linux", node="whitestar", ...}) = 0
access("/proc/net", R_OK)               = 0
access("/proc/net/unix", R_OK)          = 0
socket(PF_FILE, SOCK_DGRAM, 0)          = 3
socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 4
access("/proc/net/if_inet6", R_OK)      = 0
socket(PF_INET6, SOCK_DGRAM, IPPROTO_IP) = 5
access("/proc/net/ax25", R_OK)          = -1 ENOENT (No such file or directory)
access("/proc/net/nr", R_OK)            = -1 ENOENT (No such file or directory)
access("/proc/net/rose", R_OK)          = -1 ENOENT (No such file or directory)
access("/proc/net/ipx", R_OK)           = -1 ENOENT (No such file or directory)
access("/proc/net/appletalk", R_OK)     = -1 ENOENT (No such file or directory)
access("/proc/sys/net/econet", R_OK)    = -1 ENOENT (No such file or directory)
access("/proc/sys/net/ash", R_OK)       = -1 ENOENT (No such file or directory)
access("/proc/net/x25", R_OK)           = -1 ENOENT (No such file or directory)
open("/proc/net/dev", O_RDONLY)         = 6
fstat64(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0x40018000
read(6, "Inter-|   Receive               "..., 1024) = 1024
read(6, "44    0    0    0     0       0 "..., 1024) = 292
read(6, "", 1024)                       = 0
close(6)                                = 0
munmap(0x40018000, 4096)                = 0
ioctl(4, SIOCGIFCONF, {


and sits there indefinitely.

Samba process (nmbd) is then in uninterruptible sleep (according to
ps), when it runs correctly under previous versions of kernel.
When I try to shutdown, it hangs when trying to deconfigure all network 
interfaces.


When I try to stress test with multiple ping -f/crashme/bonnie++ in parallel, 
the box has no problem,
and do not freeze.


Can you please let me know if you believe this to be a kernel bug and
in which part exactly, and/or what I can do to alleviate the problem
?
The box is used in production as a firewall and was running correctly
until I started to use vlans (3 currently) and samba.

Thanks for your help in advance, and do not hesitate to let me know
if I have forgotten to include needed information.

Regards.
Patrick Mevzek.



--
Ben Greear <greearb@xxxxxxxxxxxxxxx>
Candela Technologies Inc  http://www.candelatech.com


<Prev in Thread] Current Thread [Next in Thread>
  • Re: Hard freeze (linux 2.6.7) or OOPS (linux 2.6.8.1) with e1000 + vlan, possible bug, Ben Greear <=