[Top] [All Lists]

Re: Bug in xfs_repair

To: Nathan Scott <nathans@xxxxxxx>
Subject: Re: Bug in xfs_repair
From: erik@xxxxxxxxxxxxxxxxx (Erik Tews)
Date: Tue, 17 Jun 2003 14:26:02 +0200
Cc: linux-xfs@xxxxxxxxxxx
In-reply-to: <20030617065643.GG794@frodo>
References: <20030616205718.GA6783@xxxxxxxxxxxxxxxxx> <20030617065643.GG794@frodo>
Sender: linux-xfs-bounce@xxxxxxxxxxx
User-agent: Mutt/1.5.4i
On Tue, Jun 17, 2003 at 04:56:43PM +1000, Nathan Scott wrote:
> On Mon, Jun 16, 2003 at 10:57:18PM +0200, Erik Tews wrote:
> > Hi
> hi there,
> > I think I found a bug in xfs_repair. It saegfaults when I run it on a
> Yep, that looks like a new bug.
> > filesystem which is a little bit corrupt. After I ran it with efence, I
> > got these backtraces. I think it cannot handle this block-out-of-range
> > condition correctly. I have attached all important informations.
> Hmm.. these are always difficult to diagnose when I haven't got the
> filesystem right in front of me (so I can sit in gdb and xfs_repair
> at the same time for interactive debugging).

I got gdb here, so if you tell me what infos you need, I cann see if I
can get them.

> What is almost certainly happening is we are moving past the end of
> the buffer we've read in (ie. the one pointed to by "ablock" below).
> This is likely because of either corruption in values in the buffer
> itself which repair has not catered for, or corruption of some other
> related control value we're using (many of these will be hanging off
> the "mp" variable you see in the stack trace.
> If you can figure out whats causing the pointer to walk past the end
> of the buffer, you've nailed the problem.

I think is is wrong metadata. I resized the filesystem without any
errors but this could be the reason for that.

> Hmmm, what else?  From the trace we can see we're walking the by-block
> freespace btree in the ninth allocation group, and in particular we're
> up to (ag-relative) block number 338 (you can use the xfs_db "convert"
> command to get the real disk address - iirc, theres an example on the
> man page describing how to do that).  For a quick fix, you may be able
> to zero that block using xfs_db/dd, but it'd be even better to figure
> out the underlying cause of the segv...

I got a full backup of the filesystem, so there is no need to fix the
filesystem, but it could be intresting for you to fix xfs_repair to make
it not saegfault on similar filesystems.

The b-variable it at 65536, so it is the following line from xfs_check
where xfs_repair has his problems:

block 9/65536 out of range

Could we perhaps add a safety-check for get_agbno_state to make it no
touch such blocks with these errors?

> > (gdb) bt
> > #0  0x0807918f in scanfunc_bno (ablock=0x405bd000, level=0, bno=338, 
> > agno=9, suspect=0, isroot=1) at scan.c:569
> > #1  0x0807760a in scan_sbtree (root=338, nlevels=1, agno=9, suspect=0, 
> > func=0x8078d05 <scanfunc_bno>, isroot=1) at scan.c:84
> > #2  0x0807b401 in scan_ag (agno=9) at swab.h:125
> > #3  0x0806570f in phase2 (mp=0xbfffdc70) at phase2.c:149
> > #4  0x0807c9e9 in main (argc=2, argv=0xbfffdc70) at xfs_repair.c:506
> > ...
> > Phase 2 - using internal log
> >         - zero log...
> >         - scan filesystem freespace and inode maps...
> cheers.
> -- 
> Nathan

<Prev in Thread] Current Thread [Next in Thread>