[Top] [All Lists]

Re: bad fs - xfs_repair 3.01 crashes on it

To: Michael Monnerie <michael.monnerie@xxxxxxxxxxxxxxxxxxx>
Subject: Re: bad fs - xfs_repair 3.01 crashes on it
From: Eric Sandeen <sandeen@xxxxxxxxxxx>
Date: Sat, 04 Jul 2009 00:43:02 -0500
Cc: xfs mailing list <xfs@xxxxxxxxxxx>
In-reply-to: <200907031320.48358@xxxxxx>
References: <200907031320.48358@xxxxxx>
User-agent: Thunderbird (Macintosh/20090605)
Michael Monnerie wrote:
> Tonight our server rebooted, and I found in /var/log/warn that he was crying 
> a lot about xfs since June 7 already:


> But XFS didn't go offline, so nobody found this messages. There are a lot of 
> them.
> They obviously are generated by the nightly "xfs_fsr -v -t 7200" which we run
> since then. It would have been nice if xfs_fsr could have displayed
> a message, so we would have received the cron mail. (But it got killed
> by the kernel, that's a good excuse)

ok yeah we should see why fsr didn't print anything ...

> Anyway, so I went to xfs_repair (3.01) and got this:
> Phase 3 - for each AG...
>         - scan and clear agi unlinked lists...
>         - process known inodes and perform inode discovery...
> [snip]
>         - agno = 14
> local inode 3857051697 attr too small (size = 3, min size = 4)
> bad attribute fork in inode 3857051697, clearing attr fork
> clearing inode 3857051697 attributes
> cleared inode 3857051697
> [snip]
> Phase 4 - check for duplicate blocks...
> [snip]
>         - agno = 15
> data fork in regular inode 3857051697 claims used block 537147998
> xfs_repair: dinode.c:2108: process_inode_data_fork: Assertion `err == 0' 
> failed.

Ok, so this is essentially some code which first does a scan; if it
finds an error it bails out and clears the inode, but if not, it calls
essentially the same function again, comments say "set bitmaps this
time" - but on the 2nd call it finds an error, which isn't handled well.
 The ASSERT(err == 0) bit is presumably because if the first scan didn't
find anything, the 2nd call shouldn't either, but ... not the case here
:(  There are more checks that can go wrong -after- the scan-only portion.

So either the caller needs to cope w/ the error at this point, or the
scan only business needs do all the checks, I think.

Where's Barry when you need him ....

Also I need to look at when the ASSERTs are active and when they should
be; the Fedora packaged xfsprogs doesn't have the ASSERT active, and so
this doesn't trip.  After 2 calls to xfs_repair on Fedora, w/o the
ASSERTs active, it checks clean on the 3rd (!).  Not great.  Not sure
how much was cleared out in the process either...


<Prev in Thread] Current Thread [Next in Thread>