xfs
[Top] [All Lists]

Re: raid5: I lost a XFS file system due to a minor IDE cable problem

To: Pallai Roland <dap@xxxxxxxxxxxxx>
Subject: Re: raid5: I lost a XFS file system due to a minor IDE cable problem
From: David Chinner <dgc@xxxxxxx>
Date: Tue, 29 May 2007 09:36:17 +1000
Cc: David Chinner <dgc@xxxxxxx>, Linux-Raid <linux-raid@xxxxxxxxxxxxxxx>, xfs@xxxxxxxxxxx
In-reply-to: <200705281730.53343.dap@xxxxxxxxxxxxx>
References: <200705241318.30711.dap@xxxxxxxxxxxxx> <20070525000547.GH85884050@xxxxxxx> <200705281453.55618.dap@xxxxxxxxxxxxx> <200705281730.53343.dap@xxxxxxxxxxxxx>
Sender: xfs-bounce@xxxxxxxxxxx
User-agent: Mutt/1.4.2.1i
On Mon, May 28, 2007 at 05:30:52PM +0200, Pallai Roland wrote:
> 
> On Monday 28 May 2007 14:53:55 Pallai Roland wrote:
> > On Friday 25 May 2007 02:05:47 David Chinner wrote:
> > > "-o ro,norecovery" will allow you to mount the filesystem and get any
> > > uncorrupted data off it.
> > >
> > > You still may get shutdowns if you trip across corrupted metadata in
> > > the filesystem, though.
> >
> > This filesystem is completely dead.
> > [...]
> 
>  I tried to make a md patch to stop writes if a raid5 array got 2+ failed 
> drives, but I found it's already done, oops. :) handle_stripe5() ignores 
> writes in this case quietly, I tried and works.

Hmmm - it clears the uptodate bit on the bio, which is supposed to
make the bio return EIO. That looks to be doing the right thing...

>  There's an another layer I used on this box between md and xfs: loop-aes. I 

Oh, that's a kind of important thing to forget to mention....

> used it since years and rock stable, but now it's my first suspect, cause I 
> found a bug in it today:
>  I assembled my array from n-1 disks, and I failed a second disk for a test 
> and I found /dev/loop1 still provides *random* data where /dev/md1 serves 
> nothing, it's definitely a loop-aes bug:

.....

>  It's not an explanation to my screwed up file system, but for me it's enough 
> to drop loop-aes. Eh.

If you can get random data back instead of an error from the block device,
then I'm not surprised your filesystem is toast. If it's one sector in a
larger block that is corrupted, then the only thing that will protect you from
this sort of corruption causing problems is metadata checksums (yet another
thin on my list of stuff to do).

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group


<Prev in Thread] Current Thread [Next in Thread>