[Top] [All Lists]

RE: Xfs Access to block zero exception and system crash

To: "Dave Chinner" <dchinner@xxxxxxxxx>
Subject: RE: Xfs Access to block zero exception and system crash
From: "Sagar Borikar" <Sagar_Borikar@xxxxxxxxxxxxxx>
Date: Wed, 25 Jun 2008 23:46:59 -0700
Cc: <xfs@xxxxxxxxxxx>
In-reply-to: <20080625084931.GI16257@xxxxxxxxxxxxxxxxxxxxx>
References: <340C71CD25A7EB49BFA81AE8C839266701323BD8@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> <20080625084931.GI16257@xxxxxxxxxxxxxxxxxxxxx>
Sender: xfs-bounce@xxxxxxxxxxx
Thread-index: AcjWoHzZdt/yFnu+SPeMBnqVvfPNWQAp11BA
Thread-topic: Xfs Access to block zero exception and system crash

Thanks Dave.

>> with 2.6.18 kernel,128 MB of RAM, MIPS architecture and XFS version 
>> 2.8.11.

> [...]

>> Can anyone let me know what could be the probable cause of this issue.

> they are all from  corrupted extent btrees.
> There are many possible causes of this that we've fixed over the past years 
> since 2.6.18 was released. Indeed, we are currently discussing fixes for a 
> bunch of problems that lead to corrupted extent btrees and problems like 
> this. I'd suggest that you should probably start with a more recent kernel, 
> make sure you have a serial console and set the xfs_error_level to 11 so that 
> it gives as much information as possible on the console when the error it > 
> hit.
> if that doesn't give a stack trace, then  you need to set the xfs_panic_mask 
> to crash the machine on block zero accesses and report the stack straces 
> that it outputs...

Yes, I went through the changes between 2.6.24 and 2.6.18 and they are quite a 
few. But as this is production system and on field, its not viable to upgrade 
the kernel. I do understand that there could be many places which can cause the 
corruption. Unfortunately, three different systems have given three different 
places of corruption as stated. Now I am sleeping in the access to block zero 
exception and rescheduling so that it won't stall the system and I can monitor 
the state of the filesystem. As the frequency of landing the error is once in 
2.5 days under extreme stress,  if you could point me to the probable place to 
look at, I can narrow down the debugging path.

Thanks in advance

<Prev in Thread] Current Thread [Next in Thread>