netdev
[Top] [All Lists]

Suggestions with hard lockup on 4 systems, have oops report

To: kaos@xxxxxxxxxx, <linux-kernel@xxxxxxxxxxxxxxx>, <akpm@xxxxxxxxxx>, <jgarzik@xxxxxxxxx>, <netdev@xxxxxxxxxxx>
Subject: Suggestions with hard lockup on 4 systems, have oops report
From: Brian McEntire <brianm@xxxxxxxxxxxxxxxxx>
Date: Fri, 16 Jul 2004 11:01:39 -0400 (EDT)
Sender: netdev-bounce@xxxxxxxxxxx
Thank you for taking time from your busy days to read this. You all
(kernel maintainers) rock!  :)

I have four Linux hosts, with identical hardware and OSs, that exhibit a
very tough to troubleshoot hang/freeze.  About once every two weeks (and
infrequently, up to a couple of times a day) these systems lock up. I
cannot ping them, cannot toggle caps lock or num lock, nor get any mouse
movement. Even the Magic SysRQ key, enabled just while troubleshooting
this issue, does not respond. (I have tested to make sure it works when
the systems are not frozen.) The screen can be blank or frozen with a
GNOME desktop visible.

When it locks, it does not write an OOPS to the screen, nor to
/var/log/messages, nor to the remote log host although syslog.conf is set
up and does log to them under normal conditions. I connected a serial
terminal and was able to capture an oops report there and have run it
though ksymoops on the system where it was captured. That is attached (run
through ksymoops) to this e-mail.

This has proven impossible (so far) to replicate on demand. I've tried
looping kernel compiles to stress test the CPUs, and replicated that test
with the source on an NFS mounted file system to stress the network sub
system but I can't force the hang to happen. I used memtest86, the RAM
checks out okay. I don't think it could be a hardware issue since it
affects all 4 identical systems. I remade the swap partition with -c to
check and mark bad blocks. These did not fix the hang.

The systems are:
  Dual Xeon 2.4GHz processors
  2 GB RAM
  2 GB swap
  Ethernet controller: PCI device 14e4:16a7 (BROADCOM Corporation) (rev 2)
  Dual channel SCSI storage controller: PCI device 1000:0030 (Symbios
    Logic Inc. (formerly NCR)) (rev 7)
  VGA compatible controller: nVidia Corporation NV11
  Matrox Graphics, Inc. MGA G400 AGP

The OS specifics:
  RH 7.2 with latest patches except running kernel 2.4.9-31enterprise for 
CM reasons (at one point, I tried the latest available RH 7.2 kernel but 
it did not improve stability so I went back.)
 bcm5700-7.1.22-1
 nvidia ??  (no RPM listed, didn't know where to find the version.)


I have been logging /proc/meminfo and `uptime` into a file in /tmp every
minute. Load isn't usually above 1 when the systems lock. 0-10 people are
on the system. Sometimes it happens during work hours, other times over
night or over the weekend when no one is running any interactive commands.

* HighFree and especially LowFree usually approach zero just before the
hang. Also, although swap is enabled (`free` shows it and kswapd is
running,) SwapUsed never goes above 0 for the duration of the uptime.

Any ideas? I understand this is an old kernel and BCM5700 is a proprietary
driver module so there may be little anyone can offer. But, just in case
the swap is an issue, or the bug stems from something other than just the
network module, I wanted to send a report in and see if anyone has ideas
or fixes.

Thank you so much!
  Brian McEntire

[ksymoops output attached.]

Also, tried Linus's trick to 'disassemble the "Code:" part' ... this 
doesn't mean anything to me but maybe its a big clue to someone else:

(gdb) disassemble str
Dump of assembler code for function str:
0x8049384 <str>:        movl   $0x0,(%ebx)
0x804938a <str+6>:      mov    0xffffffec(%ebp),%ecx
0x804938d <str+9>:      mov    %ebx,0x4(%esp,1)
0x8049391 <str+13>:     mov    %ecx,(%esp,1)
0x8049394 <str+16>:     call   0x804fd15
0x8049399 <str+21>:     add    %al,(%eax)
0x804939b <str+23>:     add    %al,(%eax)
End of assembler dump.

Attachment: oopstrace
Description: Text document

<Prev in Thread] Current Thread [Next in Thread>