xfs
[Top] [All Lists]

Re: Segfault of xfs_repair during repair of a xfs filesystem

To: linux-xfs@xxxxxxxxxxx
Subject: Re: Segfault of xfs_repair during repair of a xfs filesystem
From: Rainer Krienke <krienke@xxxxxxxxxxxxxx>
Date: Tue, 6 Jan 2004 09:05:41 +0100
Cc: Eric Sandeen <sandeen@xxxxxxx>
In-reply-to: <Pine.LNX.4.44.0401050900160.12604-100000@stout.americas.sgi.com>
Organization: Uni Koblenz
References: <Pine.LNX.4.44.0401050900160.12604-100000@stout.americas.sgi.com>
Sender: linux-xfs-bounce@xxxxxxxxxxx
User-agent: KMail/1.5.4
On Montag, 5. Januar 2004 16:13, Eric Sandeen wrote:
> On Mon, 5 Jan 2004, Rainer Krienke wrote:
> > Hello,
> >
> > a happy new year to everyone on this list ...
> >
> > well for some of my xfs filesystems the year had a bad start. Due to a
> > powerfail on some systems that possibly rebootet and later again crashed
> > due to another powerfail some xfs filesystem were damaged. I could not
> > mount theses filesystems and so ran xfs_repair -L.
>
> What was the failure when you first tried to mount?

I think originally after I had started the machines (from power off) the 
filesystem on server1 (that later on could be repaired) was mounted but not 
accessible. If I tried to list the directory and ls reported an I/O error. So 
I unexported it, unmounted it and the tried to run xfs_repair which reported 
that there was still a log on the filesystem that should be replayed by 
mounting the filesystem again or using xfs_restore -L. So I tried to mount it 
again and now mount said, that either there are too many mounts, wrong 
filesystem type or invalid superblock. I tried another mount point and said 
that this is a xfs filesystem (-t xfs) but this did not make any difference. 
The error message was the same message as for the other (smaller) corrupted 
filesystem on server2 which could not be repaired. 
...
>
> Unable to find a superblock immediately after repair?  I have never
> seen this before, sounds very odd.  BTW running repair twice in
> sucession will keep finding "errors" because the first run
> creates /lost+found, and the second run unlinks /lost+found then
> rediscovers everything that was in it as disconnected.  This
> is a "feature" that I'd like to change some day.  Moving /lost+found
> after the first run should make repair run cleanly the next time,
> unless something is really still wrong with the filesystem.
>
> In any case a successful xfs_repair run followed by a bad superblock
> is a big red flag indicating... something.  :)
>
> > Another filesystem on a second machien (~40GB) that strange enough had
> > been mounted on reboot but was not accessible (ls /filesystem: input
> > output error) could not be repaired by xfs_repair:
>
> Sounds like the filesystem shut down due to some error, can you check
> your logs?  In fact checking your logs in general might be useful
> here, I wonder if there is anything else going on.

One the first machine (server1) I found a sequence of messages like the log 
attached to this mail. But this message was generated upon startup after 
powerfail not before. Before the power failure there is nothing xfs related 
in the logs.

>
> > again I had to use -L since after unmounting I was no longer able to
> > mount it. xfs_repair reported a lot more errors compared to the first
> > filesystem from above and then in pass 4 I think when traversing the
> > directory hierarchy from / it segfaultet. A second and third run of
> > xfs_repair produced each time more errors but always ended in a segfault.
> > In between I recovered the data from tape, but I still have the old
> > broken filesystem for further investigation if needed.
>
> Can you convince it to dump a corefile?  We could then have a better
> idea of what's going wrong.

Yes I ran xfs_repair once again (perhaps the 5th time) on the corrupted 
filesystem on server2 and lifted the ulimit for core dumps. It produced one. 
You can download it under

http://www.uni-koblenz.de/~krienke/core-xfs_repair.gz

It happens during phase 6. Part of the xfs_repair output of this run is   (the 
complete log is available unter http://www.uni-koblenz.de/~krienke/
xfs_repair.log):
---------------------------------------------
...
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - ensuring existence of lost+found directory
        - traversing filesystem starting at / ...
rebuilding directory inode 128
empty data block 7 in directory inode 85967685: junking block
free block 16777216 entry 7 for directory ino 85967685 bad
rebuilding directory inode 85967685
rebuilding directory inode 132008213
...
rebuilding directory inode 158259325
        - traversal finished ...
        - traversing all unattached subtrees ...
segmentation fault
-------------------------------------------

> > Now I would like to know if this behaviour is "normal":
> >
> > - Can a filesystem with tansaction logging like xfs become inconsitant
> > because of a power fail? There is no disk failure!
>
> It should not be inconsistent (disk+log should always be consistent).
> You may have data loss though due to cached data which never makes it
> to disk.

Thanks for these explanations. Perhaps the buffer in the hardware raids that 
are used as basis for data storage are to blame (IFT 7250 Raid (Level 5), 
with 12 160GB IDE disks inside). I'll try to find out if this cache is read 
only or read/write.

Thanks
Rainer
-- 
---------------------------------------------------------------------------
Rainer Krienke, Universitaet Koblenz, Rechenzentrum, Raum A022
Universitaetsstrasse 1, 56070 Koblenz, Tel: +49 261287 -1312, Fax: -1001312
Mail: krienke@xxxxxxxxxxxxxx, Web: http://www.uni-koblenz.de/~krienke
Get my public PGP key: http://www.uni-koblenz.de/~krienke/mypgp.html
---------------------------------------------------------------------------

Attachment: pgpCB7YP7eqBD.pgp
Description: signature

Attachment: syslog.xfserror
Description: Text document

<Prev in Thread] Current Thread [Next in Thread>