[Top] [All Lists]

Re: Xfs Access to block zero exception and system crash

To: Sagar Borikar <Sagar_Borikar@xxxxxxxxxxxxxx>
Subject: Re: Xfs Access to block zero exception and system crash
From: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Thu, 26 Jun 2008 17:02:15 +1000
Cc: xfs@xxxxxxxxxxx
In-reply-to: <340C71CD25A7EB49BFA81AE8C839266701323BE8@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>
Mail-followup-to: Sagar Borikar <Sagar_Borikar@xxxxxxxxxxxxxx>, xfs@xxxxxxxxxxx
References: <340C71CD25A7EB49BFA81AE8C839266701323BD8@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> <20080625084931.GI16257@xxxxxxxxxxxxxxxxxxxxx> <340C71CD25A7EB49BFA81AE8C839266701323BE8@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>
Sender: xfs-bounce@xxxxxxxxxxx
User-agent: Mutt/1.5.17+20080114 (2008-01-14)
[please wrap your replies at 72 columns]

On Wed, Jun 25, 2008 at 11:46:59PM -0700, Sagar Borikar wrote:
> >> with 2.6.18 kernel,128 MB of RAM, MIPS architecture and XFS version 
> >> 2.8.11.
> > [...]
> >> Can anyone let me know what could be the probable cause of this issue.
> > they are all from  corrupted extent btrees.  There are many
> > possible causes of this that we've fixed over the past years
> > since 2.6.18 was released. Indeed, we are currently discussing
> > fixes for a bunch of problems that lead to corrupted extent
> > btrees and problems like this. I'd suggest that you should
> > probably start with a more recent kernel, make sure you have a
> > serial console and set the xfs_error_level to 11 so that it
> > gives as much information as possible on the console when the
> > error it > hit.  if that doesn't give a stack trace, then  you
> > need to set the xfs_panic_mask to crash the machine on block
> > zero accesses and report the stack straces that it outputs...
> Yes, I went through the changes between 2.6.24 and 2.6.18 and they
> are quite a few. But as this is production system and on field,
> its not viable to upgrade the kernel.

Well, you're pretty much on your own then :/

> I do understand that there
> could be many places which can cause the corruption.
> Unfortunately, three different systems have given three different
> places of corruption as stated.

Yes, but all the same pattern of corruption, so it is likely
that it is one problem.

> Now I am sleeping in the access to
> block zero exception and rescheduling so that it won't stall the
> system and I can monitor the state of the filesystem. As the
> frequency of landing the error is once in 2.5 days under extreme
> stress,  if you could point me to the probable place to look at, I
> can narrow down the debugging path.

Like I said - it's a corrupt bmap btree. It could be a bug in the
bmap btree code, the alloc btree code, the inode data fork
manipulation code, it could be a block device bug returning bad data
to XFS on on a cancelled btree readahead, etc. IOWs, there are so many
possible causes of a corrupted btree that a bug report by itself is
mostly useless.

All I can suggest is working out a reproducable test case in your
development environment, attaching a debugger and start digging around
in memory when the problem is hit and try to find out exactly what
is corrupted. If you can't reproduce it or work out what is
occurring to trigger the problem, then we're not going to be able to
find the cause...


Dave Chinner

<Prev in Thread] Current Thread [Next in Thread>