xfs
[Top] [All Lists]

Re: Kernel dump from 2.4.18-27SGI_XFS1.2.0

To: Seth Mos <knuffie@xxxxxxxxx>
Subject: Re: Kernel dump from 2.4.18-27SGI_XFS1.2.0
From: Austin Gonyou <austin@xxxxxxxxxxxxxxx>
Date: 04 Jun 2003 09:49:52 -0500
Cc: XFS List <linux-xfs@xxxxxxxxxxx>
In-reply-to: <4.3.2.7.2.20030604143455.024e1e88@xxxxxxxxxxxxx>
Organization: Coremetrics, Inc.
References: <4.3.2.7.2.20030604143455.024e1e88@xxxxxxxxxxxxx>
Sender: linux-xfs-bounce@xxxxxxxxxxx
On Wed, 2003-06-04 at 07:39, Seth Mos wrote:
> At 13:57 2-6-2003 -0500, Austin Gonyou wrote:
> >Here is the stack trace from this kernel. This is 2.4.18-27 errata +
> XFS
> >1.2.0 release. I did re-work the spec file, but only to disable
> options
> >we don't want and also re-worked the config to our liking. I have one
> >patch I apply, but it is to scsi_scan.c for a BLIST entry for our
> >hardware. Overall, the core kernel config and source is relatively
> >unchanged. We patch nothing else, and just use RH's src.rpm to create
> >our i686 rpm. Usually, right before the crash, all the fiber channel
> >devices go unaccessible, local are still ok, and whole system is XFS,
> >then the poof. This only seems to occur during the load test I put
> this
> >thing through. If anyone would like to see it, I'd be happy to
> provide
> >the info.
> 
> I'm afraid I can't help you with that. I did patch the kernel but I'm
> not 
> skilled enough to actually be a kernel hacker. :-/

Understood. Though you seem to have some pretty good skillz either way.
;) 

> Noon else responded yet, which is a shame.

Yeah, that does suck, but I'm glad you did. I think in time this will
get sorted out, but this is one of the best stack traces I've been able
to get from any of the crashes I've ever had. :) 

> Can you tell me what hardware you are using? 

This is a Dell 6650 4x 1.6Ghz and a single qla2310. (our production
configuration is qla2200F's but will eventually get back to 2300
series.)

> It doesn't really sound like a 
> XFS issue is the problem here, my best guess is that something might
> be 
> causing a reset in the fibre channel array which is causing a driver
> to stall.

I believe it may not be, since I've noticed that only the FC attatched
LUNs have the problem, prior to the box hanging. Since I'm load-testing
this box right now, it is a short period of time between storage outage,
and crash. That was why I was very happy to even get a stack trace at
all. 

> I currently have the same problem with megaraid cards without
> optimizations 
> and the megaraid v2 driver. During high disk IO is stalls so long that
> the 
> write eventually returns a error and the box hangs.

IC. I have changed the driver set I'm using from the 6.01-FO drivers, to
the 6.04-FO drivers. The box actually seems *much* more responsive but
I'm going to watch the swap and see if there's some kind of memory leak
issue possibly with the amount of IO I'm doing. (i.e. driver leaking
under the load possibly?)

Overall, I was just interested to see if anyone on this list can say
what they believe the problem might be, given the stack trace and any
other information I can offer. I think XFS was noted because the IO was
going on, and all the FS are XFS anyway. Just a though.

> Cheers
> 
> --
> Seth
> It might just be your lucky day, if you only knew.
-- 
Austin Gonyou <austin@xxxxxxxxxxxxxxx>
Coremetrics, Inc.


<Prev in Thread] Current Thread [Next in Thread>