On Tue, Jun 17, 2003 at 04:56:43PM +1000, Nathan Scott wrote:
> On Mon, Jun 16, 2003 at 10:57:18PM +0200, Erik Tews wrote:
> > Hi
> hi there,
> > I think I found a bug in xfs_repair. It saegfaults when I run it on a
> Yep, that looks like a new bug.
> > filesystem which is a little bit corrupt. After I ran it with efence, I
> > got these backtraces. I think it cannot handle this block-out-of-range
> > condition correctly. I have attached all important informations.
> Hmm.. these are always difficult to diagnose when I haven't got the
> filesystem right in front of me (so I can sit in gdb and xfs_repair
> at the same time for interactive debugging).
I got gdb here, so if you tell me what infos you need, I cann see if I
can get them.
> What is almost certainly happening is we are moving past the end of
> the buffer we've read in (ie. the one pointed to by "ablock" below).
> This is likely because of either corruption in values in the buffer
> itself which repair has not catered for, or corruption of some other
> related control value we're using (many of these will be hanging off
> the "mp" variable you see in the stack trace.
> If you can figure out whats causing the pointer to walk past the end
> of the buffer, you've nailed the problem.
I think is is wrong metadata. I resized the filesystem without any
errors but this could be the reason for that.
> Hmmm, what else? From the trace we can see we're walking the by-block
> freespace btree in the ninth allocation group, and in particular we're
> up to (ag-relative) block number 338 (you can use the xfs_db "convert"
> command to get the real disk address - iirc, theres an example on the
> man page describing how to do that). For a quick fix, you may be able
> to zero that block using xfs_db/dd, but it'd be even better to figure
> out the underlying cause of the segv...
I got a full backup of the filesystem, so there is no need to fix the
filesystem, but it could be intresting for you to fix xfs_repair to make
it not saegfault on similar filesystems.
The b-variable it at 65536, so it is the following line from xfs_check
where xfs_repair has his problems:
block 9/65536 out of range
Could we perhaps add a safety-check for get_agbno_state to make it no
touch such blocks with these errors?
> > (gdb) bt
> > #0 0x0807918f in scanfunc_bno (ablock=0x405bd000, level=0, bno=338,
> > agno=9, suspect=0, isroot=1) at scan.c:569
> > #1 0x0807760a in scan_sbtree (root=338, nlevels=1, agno=9, suspect=0,
> > func=0x8078d05 <scanfunc_bno>, isroot=1) at scan.c:84
> > #2 0x0807b401 in scan_ag (agno=9) at swab.h:125
> > #3 0x0806570f in phase2 (mp=0xbfffdc70) at phase2.c:149
> > #4 0x0807c9e9 in main (argc=2, argv=0xbfffdc70) at xfs_repair.c:506
> > ...
> > Phase 2 - using internal log
> > - zero log...
> > - scan filesystem freespace and inode maps...