xfs
[Top] [All Lists]

Re: xfs_repair 3.2.0 cannot (?) fix fs

To: Arkadiusz MiÅkiewicz <arekm@xxxxxxxx>
Subject: Re: xfs_repair 3.2.0 cannot (?) fix fs
From: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Mon, 30 Jun 2014 22:06:13 +1000
Cc: xfs@xxxxxxxxxxx, Alex Elder <elder@xxxxxxxxxx>
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <201406301353.10829.arekm@xxxxxxxx>
References: <20140630031810.GC4453@dastard> <201406300736.24291.arekm@xxxxxxxx> <20140630111214.GE4453@dastard> <201406301353.10829.arekm@xxxxxxxx>
User-agent: Mutt/1.5.21 (2010-09-15)
On Mon, Jun 30, 2014 at 01:53:10PM +0200, Arkadiusz MiÅkiewicz wrote:
> On Monday 30 of June 2014, Dave Chinner wrote:
> > On Mon, Jun 30, 2014 at 07:36:24AM +0200, Arkadiusz MiÅkiewicz wrote:
> > > On Monday 30 of June 2014, Dave Chinner wrote:
> > > > but right now only user quotas are enabled.  It's only AGs 1-15 that
> > > > show this, so this seems to me that it is likely that this
> > > > filesystem was originally only 16 AGs and it's been grown many times
> > > > since?
> > > 
> > > The quotas was running fine until some repair run (ie. before and after
> > > first repair mounting with quota succeeded) - some xfs_repair run later
> > > broke this.
> > 
> > Actually, it looks more likely that a quotacheck has failed part way
> > though, leaving the quota in an indeterminate state and then repair
> > has been run, messing things up more...
> 
> Hm, the only quotacheck I see in logs from that day reported "Done". I assume 
> it wouldn't report that if some problem occured in middle?
> 
> Jun 28 00:57:36 web2 kernel: [736161.906626] XFS (dm-1): Quotacheck needed: 
> Please wait.
> Jun 28 01:09:10 web2 kernel: [736855.851555] XFS (dm-1): Quotacheck: Done.

If there was an error, it should report it and say that quotas are
being turned off.

> [...] here were few Internal error xfs_bmap_read_extents(1) while doing 
> xfs_dir_lookup (I assume due to not fixed directory entries problem). 
> xfs_repair was also run few times and then...
> 
> Jun 28 23:16:50 web2 kernel: [816515.898210] XFS (dm-1): Mounting Filesystem
> Jun 28 23:16:50 web2 kernel: [816515.915356] XFS (dm-1): Ending clean mount
> Jun 28 23:16:50 web2 kernel: [816515.940008] XFS (dm-1): Failed to initialize 
> disk quotas.

I haven't yet tracked down what the error here is yet - I'm still
working on the reapir side of things before I even try to mount the
images you sent me. :/

Once I get repair running cleanly, I'll look at why this is failing.

> > > > > Invalid inode number 0xfeffffffffffffff
> > > > > xfs_dir_ino_validate: XFS_ERROR_REPORT
> > > > > Metadata corruption detected at block 0x11fbb698/0x1000
> > > > > libxfs_writebufr: write verifer failed on bno 0x11fbb698/0x1000
> > > > > done
> > > > 
> > > > Not sure what that is yet, but it looks like writing a directory
> > > > block found entries with invalid inode numbers in it. i.e. it's
> > > > telling me that there's something not been fixed up.
> > > > 
> > > > I'm actually seeing this in phase4:
> > > >         - agno = 148
> > > > 
> > > > Invalid inode number 0xfeffffffffffffff
> > > > xfs_dir_ino_validate: XFS_ERROR_REPORT
> > > > Metadata corruption detected at block 0x11fbb698/0x1000
> > > > libxfs_writebufr: write verifer failed on bno 0x11fbb698/0x1000
> > > > 
> > > > Second time around, this does not happen, so the error has been
> > > > corrected in a later phase of the first pass.
> > > 
> > > Here on two runs I got exactly the same report:
> > > 
> > > Phase 7 - verify and correct link counts...
> > > 
> > > Invalid inode number 0xfeffffffffffffff
> > > xfs_dir_ino_validate: XFS_ERROR_REPORT
> > > Metadata corruption detected at block 0x11fbb698/0x1000
> > > libxfs_writebufr: write verifer failed on bno 0x11fbb698/0x1000
> > > Invalid inode number 0xfeffffffffffffff
> > > xfs_dir_ino_validate: XFS_ERROR_REPORT
> > > Metadata corruption detected at block 0x11fbb698/0x1000
> > > libxfs_writebufr: write verifer failed on bno 0x11fbb698/0x1000
> > > 
> > > but there were more of errors like this earlier so repair fixed some but
> > > left with these two.
> > 
> > Right, I suspect that I've got a partial fix for this already in
> > place - i was having xfs_repair -n ... SEGV on when parsing the
> > broken directory in phase 6, so I have some code that prevents that
> > crash which might also be partially fixing this.
> 
> Nice :-) Do you also know why 3.1.11 doesn't have this problem with 
> xfs_dir_ino_validate: XFS_ERROR_REPORT ?

Oh, that's easy: 3.1.11 doesn't have write verifiers, so it would
never know that it wrote a bad inode number to disk. Like the kernel
code, the write verifiers actually check that the modifications
being made result in valid on disk values, and that's something
we've never had in repair before 3.2.0.

IOWs, 3.1.11 could well be writing inodes with 0xfeffffffffffffff in
them, but there's nothing to catch that in repair or libxfs on read
or write. Hence we could be tripping over an old bug we never knew
existed until now...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>