[Bisected] Corruption of root fs during git bisect of drm system hang

Mark Tinguely tinguely at sgi.com
Sat Jul 20 12:21:47 CDT 2013


On 07/19/13 22:18, Dave Chinner wrote:
> On Fri, Jul 19, 2013 at 04:11:28PM -0500, Mark Tinguely wrote:
>> On 07/19/13 07:22, Markus Trippelsdorf wrote:
>>>
>>> I've bisected this issue to the following commit:
>>>
>>>   commit cca9f93a52d2ead50b5da59ca83d5f469ee4be5f
>>>   Author: Dave Chinner<dchinner at redhat.com>
>>>   Date:   Thu Jun 27 16:04:49 2013 +1000
>>>
>>>       xfs: don't do IO when creating an new inode
>>>
>>> Reverting this commit on top of the Linus tree "solves" all problems for
>>> me. IOW I no longer loose my KDE and LibreOffice config files during a
>>> crash. Log recovery now works fine and xfs_repair shows no issues.
>>>
>>> So users of 3.11.0-rc1 beware. Only run this version if you have
>>> up-to-date backups handy.
>>>
>>
>> I reviewed the above patch and liked it but, I think I recreated the
>> above mentioned problem with a simple script:
>>
>> cp /root/.bash_history /root/.lesshst /root/.pwclientrc
>> /root/.viminfo /root/.bash_profile  /root/.lesshst.YCJCDz
>> /root/.quiltrc /somexfsdir
>> sync
>> echo 'c'>  /proc/sysrq-trigger
>> .... reboot, remount ...
>> cd /somexfsdir
>
> I've only reproduced the problem *once* with this method - the first
> time I tried. Then I mkfs'd the filesystem rather than repairing it
> and I haven't been able to reproduce it since.  So the problem is
> far more subtle that just copying some files, running sync and
> crashing the machine - there's some kind of initial or timing
> condition that we are missing that triggers it...
>
> The one interesting thing I noticed was that the generation number
> in the crash case was non-zero. That's an important piece of
> information, and:
>
>> # cat .bash_history
>> cat: .bash_history: No such file or directory
>>
>> xfs_db>  inode 131
>> xfs_db>  p
>> core.magic = 0x494e
>> core.mode = 0
>
> That's a "free" inode, and why XFS considers it invalid when the
> lookup sees it.
>
>> core.gen = 3707503345
>
> You saw it as well, Mark.
>
> That means it has actually been allocated and written to disk at
> some point in time. That is, inodes allocated by mkfs in the root
> inode chunk have a generation number of zero. For this to have a
> non-zero generation number, it means that had to be written after
> allocation - either before the sync or during log recovery.
>
> Unfortunately, without the 'xfs_logprint -t -i<dev>' output from
> prior to mounting the filesystem which demonstrates te problem, I
> can't tell if the issue is a recovery problem or something that
> happened before the crash....
>
>> revert the above commit and the problem goes away.
> ....
>> core.mode = 0100600
>
> Not an free inode...
>
>> core.gen = 0
>
> And, importantly, the generation number is zero, as would be
> expected for an inode in the root chunk.
>
> FWIW, if you can reproduce this on demand, Mark, is to see if
> mounting "-o ikeep" makes the problem go away as this optimisation
> is only used on filesystems that are configured to free inode
> chunks...
>
> Cheers,
>
> Dave.


Yeah, I thought of the logprint and the ikeep afterwards.

I tried the script today and it did not reproduce the problem. The 
logprint and the mounted filesystem was empty. I will rebuild the 
sources to eliminate some patched kernel versions on that box and 
experiment with the sync and the shooting of the kernel.

--Mark.



More information about the xfs mailing list