netdev
[Top] [All Lists]

Re: Suggestions with hard lockup on 4 systems, have oops report

To: Brian McEntire <brianm@xxxxxxxxxxxxxxxxx>
Subject: Re: Suggestions with hard lockup on 4 systems, have oops report
From: Adam Kropelin <akropel1@xxxxxxxxxxxxxxxx>
Date: Fri, 16 Jul 2004 14:08:12 -0400
Cc: kaos@xxxxxxxxxx, linux-kernel@xxxxxxxxxxxxxxx, akpm@xxxxxxxxxx, jgarzik@xxxxxxxxx, netdev@xxxxxxxxxxx
In-reply-to: <Pine.LNX.4.44.0407161025030.20914-200000@xxxxxxxxxxxxxxxxx>; from brianm@xxxxxxxxxxxxxxxxx on Fri, Jul 16, 2004 at 11:01:39AM -0400
References: <Pine.LNX.4.44.0407161025030.20914-200000@xxxxxxxxxxxxxxxxx>
Sender: netdev-bounce@xxxxxxxxxxx
User-agent: Mutt/1.2.5.1i
On Fri, Jul 16, 2004 at 11:01:39AM -0400, Brian McEntire wrote:
> Thank you for taking time from your busy days to read this. You all
> (kernel maintainers) rock!  :)
> 
> I have four Linux hosts, with identical hardware and OSs, that exhibit a
> very tough to troubleshoot hang/freeze.  About once every two weeks (and

<snip>

> The OS specifics:
>   RH 7.2 with latest patches except running kernel 2.4.9-31enterprise for 
> CM reasons (at one point, I tried the latest available RH 7.2 kernel but 
> it did not improve stability so I went back.)
>  bcm5700-7.1.22-1
>  nvidia ??  (no RPM listed, didn't know where to find the version.)

You've really got to eliminate the binary bcm5700 and nvidia modules in
order to diagnose this. Based on the oops, bcm5700 looks suspect, but it
could just be the unlucky guy whose memory was stepped on by nvidia or
some other part of the kernel.

Switch to an open NIC like e1000 temporarily (or better yet,
permanently) and see if the lockup persists. Do the same with nvidia. If
you can reproduce the problem without ever having loaded either module
(unloading the module once it's loaded is not sufficient), post the new
oops and you'll have a solid foundation for debugging.

--Adam


<Prev in Thread] Current Thread [Next in Thread>