xfs
[Top] [All Lists]

Re: xfs_repair 3.2.0 cannot (?) fix fs

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: xfs_repair 3.2.0 cannot (?) fix fs
From: Arkadiusz MiÅkiewicz <arekm@xxxxxxxx>
Date: Mon, 30 Jun 2014 13:53:10 +0200
Cc: xfs@xxxxxxxxxxx, Alex Elder <elder@xxxxxxxxxx>
Delivered-to: xfs@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=maven.pl; s=maven; h=from:to:subject:date:user-agent:cc:references:in-reply-to :mime-version:content-type:content-transfer-encoding:message-id; bh=iUzKpD3JKd4dqMT4kY58VM43a2yGKvQfOITsZ5zmxss=; b=U2bpas6CRo/ifNKcIjeTCjkB/+0Vp+AZOyOUuZ0mQi3Md3rlYMkZr3ECq3VdlYCuhr sZm3FTfS9Z334ZhBdmbs3/K2zoHc4buwMUWVas5vEcCOnL5UrgIdkY8PyCnFQYNFioi5 VAkMHAH9VUPyG2+ayIBRZB+dnkDgscGbpsuz0=
In-reply-to: <20140630111214.GE4453@dastard>
References: <20140630031810.GC4453@dastard> <201406300736.24291.arekm@xxxxxxxx> <20140630111214.GE4453@dastard>
User-agent: KMail/1.13.7 (Linux/3.16.0-rc3-00006-g16874b2; KDE/4.13.2; x86_64; ; )
On Monday 30 of June 2014, Dave Chinner wrote:
> On Mon, Jun 30, 2014 at 07:36:24AM +0200, Arkadiusz MiÅkiewicz wrote:
> > On Monday 30 of June 2014, Dave Chinner wrote:
> > > [Compendium reply to all 3 emails]
> > > 
> > > On Sat, Jun 28, 2014 at 01:41:54AM +0200, Arkadiusz MiÅkiewicz wrote:
> > > > reset bad sb for ag 5
> > > >
> > > >. non-null group quota inode field in superblock 7
> > > 
> > > OK, so this is indicative of something screwed up a long time ago.
> > > Firstly, the primary superblocks shows:
> > > 
> > > uquotino = 4077961
> > > gquotino = 0
> > > qflags = 0
> > > 
> > > i.e. user quota @ inode 4077961, no group quota. The secondary
> > > superblocks that are being warned about show:
> > > 
> > > uquotino = 0
> > > gquotino = 4077962
> > > qflags = 0
> > > 
> > > Which is clearly wrong. They should have been overwritten during the
> > > growfs operation to match the primary superblock.
> > > 
> > > The similarity in inode number leads me to beleive at some point
> > > both user and group/project quotas were enabled on this filesystem,
> > 
> > Both user and project quotas were enabled on this fs for last few years.
> > 
> > > but right now only user quotas are enabled.  It's only AGs 1-15 that
> > > show this, so this seems to me that it is likely that this
> > > filesystem was originally only 16 AGs and it's been grown many times
> > > since?
> > 
> > The quotas was running fine until some repair run (ie. before and after
> > first repair mounting with quota succeeded) - some xfs_repair run later
> > broke this.
> 
> Actually, it looks more likely that a quotacheck has failed part way
> though, leaving the quota in an indeterminate state and then repair
> has been run, messing things up more...

Hm, the only quotacheck I see in logs from that day reported "Done". I assume 
it wouldn't report that if some problem occured in middle?

Jun 28 00:57:36 web2 kernel: [736161.906626] XFS (dm-1): Quotacheck needed: 
Please wait.
Jun 28 01:09:10 web2 kernel: [736855.851555] XFS (dm-1): Quotacheck: Done.

[...] here were few Internal error xfs_bmap_read_extents(1) while doing 
xfs_dir_lookup (I assume due to not fixed directory entries problem). 
xfs_repair was also run few times and then...

Jun 28 23:16:50 web2 kernel: [816515.898210] XFS (dm-1): Mounting Filesystem
Jun 28 23:16:50 web2 kernel: [816515.915356] XFS (dm-1): Ending clean mount
Jun 28 23:16:50 web2 kernel: [816515.940008] XFS (dm-1): Failed to initialize 
disk quotas.


> 
> > > > Invalid inode number 0xfeffffffffffffff
> > > > xfs_dir_ino_validate: XFS_ERROR_REPORT
> > > > Metadata corruption detected at block 0x11fbb698/0x1000
> > > > libxfs_writebufr: write verifer failed on bno 0x11fbb698/0x1000
> > > > done
> > > 
> > > Not sure what that is yet, but it looks like writing a directory
> > > block found entries with invalid inode numbers in it. i.e. it's
> > > telling me that there's something not been fixed up.
> > > 
> > > I'm actually seeing this in phase4:
> > >         - agno = 148
> > > 
> > > Invalid inode number 0xfeffffffffffffff
> > > xfs_dir_ino_validate: XFS_ERROR_REPORT
> > > Metadata corruption detected at block 0x11fbb698/0x1000
> > > libxfs_writebufr: write verifer failed on bno 0x11fbb698/0x1000
> > > 
> > > Second time around, this does not happen, so the error has been
> > > corrected in a later phase of the first pass.
> > 
> > Here on two runs I got exactly the same report:
> > 
> > Phase 7 - verify and correct link counts...
> > 
> > Invalid inode number 0xfeffffffffffffff
> > xfs_dir_ino_validate: XFS_ERROR_REPORT
> > Metadata corruption detected at block 0x11fbb698/0x1000
> > libxfs_writebufr: write verifer failed on bno 0x11fbb698/0x1000
> > Invalid inode number 0xfeffffffffffffff
> > xfs_dir_ino_validate: XFS_ERROR_REPORT
> > Metadata corruption detected at block 0x11fbb698/0x1000
> > libxfs_writebufr: write verifer failed on bno 0x11fbb698/0x1000
> > 
> > but there were more of errors like this earlier so repair fixed some but
> > left with these two.
> 
> Right, I suspect that I've got a partial fix for this already in
> place - i was having xfs_repair -n ... SEGV on when parsing the
> broken directory in phase 6, so I have some code that prevents that
> crash which might also be partially fixing this.

Nice :-) Do you also know why 3.1.11 doesn't have this problem with 
xfs_dir_ino_validate: XFS_ERROR_REPORT ?

> > > > 5)Metadata CRC error detected at block 0x0/0x200
> > > > but it is not CRC enabled fs
> > > 
> > > That's typically caused by junk in the superblock beyond the end
> > > of the v4 superblock structure. It should be followed by "zeroing
> > > junk ..."
> > 
> > Shouldn't repair fix superblocks when noticing v4 fs?
> 
> It does.
> 
> > I mean 3.2.0 repair reports:
> > 
> > $ xfs_repair -v ./1t-image
> > Phase 1 - find and verify superblock...
> > 
> >         - reporting progress in intervals of 15 minutes
> >         - block cache size set to 748144 entries
> > 
> > Phase 2 - using internal log
> > 
> >         - zero log...
> > 
> > zero_log: head block 2 tail block 2
> > 
> >         - scan filesystem freespace and inode maps...
> > 
> > Metadata CRC error detected at block 0x0/0x200
> > zeroing unused portion of primary superblock (AG #0)
> > 
> >         - 07:20:11: scanning filesystem freespace - 391 of 391 allocation
> > 
> > groups done
> > 
> >         - found root inode chunk
> > 
> > Phase 3 - for each AG...
> > 
> >         - scan and clear agi unlinked lists...
> >         - 07:20:11: scanning agi unlinked lists - 391 of 391 allocation
> >         groups
> > 
> > done
> > 
> >         - process known inodes and perform inode discovery...
> >         - agno = 0
> > 
> > [...]
> > 
> > but if I run 3.1.11 after running 3.2.0 then superblocks get fixed:
> > 
> > $ ./xfsprogs/repair/xfs_repair -v ./1t-image
> > Phase 1 - find and verify superblock...
> > 
> >         - block cache size set to 748144 entries
> > 
> > Phase 2 - using internal log
> > 
> >         - zero log...
> > 
> > zero_log: head block 2 tail block 2
> > 
> >         - scan filesystem freespace and inode maps...
> > 
> > zeroing unused portion of primary superblock (AG #0)
> 
> ,,,
> 
> > Shouldn't these be "unused" for 3.2.0, too (since v4 fs) ?
> 
> I'm pretty sure that's indicative of older xfs_repair code
> not understanding that sb_badfeatures2  didn't need to be zeroed.
> It wasn't until:
> 
> cbd7508 xfs_repair: zero out unused parts of superblocks
> 
> that xfs_repair correctly sized the unused area of the superblock.
> You'll probably find that mounting this filesystem resulted in
> ""sb_badfeatures2 mistach detected. Correcting." or something
> similar in dmesg because of this (now fixed) repair bug.

Tested 3.1.11 with  cbd7508 applied and indeed no "zeroing unused portion of 
primary superblock: anymore.

> > > > Made xfs metadump without file obfuscation and I'm able to reproduce
> > > > the problem reliably on the image (if some xfs developer wants
> > > > metadump image then please mail me - I don't want to put it for
> > > > everyone due to obvious reasons).
> > > > 
> > > > So additional bug in xfs_metadump where file obfuscation "fixes" some
> > > > issues. Does it obfuscate but keep invalid conditions (like keeping
> > > > "/" in file name) ? I guess it is not doing that.
> > > 
> > > I doubt it handles a "/" in a file name properly - that's rather
> > > illegal, and the obfuscation code probably doesn't handle it at all.
> > 
> > Would be nice to keep these bad conditions. obfuscated metadump is
> > behaving differently than non-obfuscated metadump with xfs_repair here
> > (less issues with obfuscated than non-obfuscated), so obfuscation simply
> > hides problems.
> 
> Sure, but we didn't even know this was a problem until now, so that
> will have to wait....
> 
> > I assume that you do testing on the non-obfuscated dump I gave on irc?
> 
> Yes, but I've been cross checking against the obfuscated one with
> xfs_db....
> 
> Cheers,
> 
> Dave.


-- 
Arkadiusz MiÅkiewicz, arekm / maven.pl

<Prev in Thread] Current Thread [Next in Thread>