xfs
[Top] [All Lists]

Re: Segfault of xfs_repair during repair of a xfs filesystem

To: Rainer Krienke <krienke@xxxxxxxxxxxxxx>
Subject: Re: Segfault of xfs_repair during repair of a xfs filesystem
From: Eric Sandeen <sandeen@xxxxxxx>
Date: Mon, 5 Jan 2004 09:13:08 -0600 (CST)
Cc: linux-xfs@xxxxxxxxxxx
In-reply-to: <200401050850.03928.krienke@uni-koblenz.de>
Sender: linux-xfs-bounce@xxxxxxxxxxx
On Mon, 5 Jan 2004, Rainer Krienke wrote:

> Hello,
> 
> a happy new year to everyone on this list ...
> 
> well for some of my xfs filesystems the year had a bad start. Due to a 
> powerfail on some systems that possibly rebootet and later again crashed due 
> to another powerfail some xfs filesystem were damaged. I could not mount 
> theses filesystems and so ran xfs_repair -L.

What was the failure when you first tried to mount?

> For one filesystem (~150GB) this worked. xfs_repair reported some errors in 
> the filesystem but finished its work. Next I tired to mount this filesystem 
> but mount complained that it could not find a valid superblock. So I ran 
> xfs_repair once again. It still found some errors (but less than before). 
> Next I rebootet the machine and the filesystem was mounted.

Unable to find a superblock immediately after repair?  I have never
seen this before, sounds very odd.  BTW running repair twice in
sucession will keep finding "errors" because the first run
creates /lost+found, and the second run unlinks /lost+found then
rediscovers everything that was in it as disconnected.  This
is a "feature" that I'd like to change some day.  Moving /lost+found
after the first run should make repair run cleanly the next time,
unless something is really still wrong with the filesystem.

In any case a successful xfs_repair run followed by a bad superblock
is a big red flag indicating... something.  :)

> Another filesystem on a second machien (~40GB) that strange enough had been 
> mounted on reboot but was not accessible (ls /filesystem: input output error) 
> could not be repaired by xfs_repair:

Sounds like the filesystem shut down due to some error, can you check
your logs?  In fact checking your logs in general might be useful
here, I wonder if there is anything else going on.

> again I had to use -L since after unmounting I was no longer able to mount 
> it. xfs_repair reported a lot more errors compared to the first filesystem 
> from above and then in pass 4 I think when traversing the directory hierarchy 
> from / it segfaultet. A second and third run of xfs_repair produced each time 
> more errors but always ended in a segfault. In between I recovered the data 
> from tape, but I still have the old broken filesystem for further 
> investigation if needed. 

Can you convince it to dump a corefile?  We could then have a better
idea of what's going wrong.

> Now I would like to know if this behaviour is "normal":
> 
> - Can a filesystem with tansaction logging like xfs become inconsitant 
> because 
> of a power fail? There is no disk failure!

It should not be inconsistent (disk+log should always be consistent).
You may have data loss though due to cached data which never makes it
to disk.

> - Should xfs_repair find all errors in one run (like a regular fsck does) or 
> do I have to run it again and again until it reports no more errors?

See above; in general it should find all errors except for the
/lost+found "feature."

> - Is it a known issue that xfs_repair seg faults sometimes or is it perhaps a 
> problem of my version (see below) ?

I wonder if maybe it's failing a memory allocation for the large
filesystem; I thought i remembered a (fixed) problem like this
but I don't see it in the changelogs.  A little gdb debugging
would be a big help.

> - Can I do something else to avoid corrupted xfs filesystems in case of a 
> crash.

You shouldn't have to do anything special; let's try to get to the
bottom of what happened here.

> The machine is a suse linux 8.2 multiprocessor  system with SuSE patched 
> kernel 2.4.21-144 (from a suse 9.0 system). The xfs kernel driver reports 
> version 1.3.1 but I think the filesystems in question were created with an 
> earlier kernel I guess with driver version 1.3beta.  xfs_repair has version 
> 2.5.6. All filesystems are on logical volumes that are handled by lvm2. The 
> underlying (lvm) physical volume is a software raid 1, to prevent data loss 
> due to a disk failure. 

Those look like pretty recent versions of everything.

-Eric



<Prev in Thread] Current Thread [Next in Thread>