xfs
[Top] [All Lists]

Re: [Bisected] Corruption of root fs during git bisect of drm system han

To: Eric Sandeen <sandeen@xxxxxxxxxxx>
Subject: Re: [Bisected] Corruption of root fs during git bisect of drm system hang
From: Markus Trippelsdorf <markus@xxxxxxxxxxxxxxx>
Date: Fri, 19 Jul 2013 21:53:12 +0200
Cc: Stefan Ring <stefanrin@xxxxxxxxx>, Ben Myers <bpm@xxxxxxx>, Mark Tinguely <tinguely@xxxxxxx>, Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx>, Linux fs XFS <xfs@xxxxxxxxxxx>
Delivered-to: xfs@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=simple; d=mail.ud10.udmedia.de; h= date:from:to:cc:subject:message-id:references:mime-version :content-type:in-reply-to; s=beta; bh=VpwdHfJhUm2scxmFDR/GF0po+t VfFy2APElWm1Z9EYI=; b=BkPWUyMKTndELPWmVS5ZwLL/L0wUfKJ7Ayht4y4bGg zkkO8tAHmPY6tslAp1TQgyfvdU+Y2pQ6x7JSkzIvrZfWTH/F0QKy81yqDVOALC2n UnbAPfyMclGB/AviuDjzkDgA1g1RCzmBHWAbrCVa62JEfOM+d0eKqEJTuq0CaVF/ Q=
In-reply-to: <51E99216.9060609@xxxxxxxxxxx>
References: <20130713090523.GA362@x4> <20130712070721.GA359@x4> <20130715022841.GH5228@dastard> <20130715064734.GA361@x4> <20130719122235.GA360@x4> <CAAxjCExBi-4Qgf6-=MBzdkzBmMtu=GTURu46DoD2CzpnF2dinw@xxxxxxxxxxxxxx> <20130719125149.GB360@x4> <51E9630A.3070201@xxxxxxxxxxx> <20130719163220.GA363@x4> <51E99216.9060609@xxxxxxxxxxx>
On 2013.07.19 at 14:23 -0500, Eric Sandeen wrote:
> On 7/19/13 11:32 AM, Markus Trippelsdorf wrote:
> > On 2013.07.19 at 11:02 -0500, Eric Sandeen wrote:
> >> On 7/19/13 7:51 AM, Markus Trippelsdorf wrote:
> >>> On 2013.07.19 at 14:41 +0200, Stefan Ring wrote:
> >>>>> I've bisected this issue to the following commit:
> >>>>>
> >>>>>  commit cca9f93a52d2ead50b5da59ca83d5f469ee4be5f
> >>>>>  Author: Dave Chinner <dchinner@xxxxxxxxxx>
> >>>>>  Date:   Thu Jun 27 16:04:49 2013 +1000
> >>>>>
> >>>>>      xfs: don't do IO when creating an new inode
> >>>>>
> >>>>> Reverting this commit on top of the Linus tree "solves" all problems for
> >>>>> me. IOW I no longer loose my KDE and LibreOffice config files during a
> >>>>> crash. Log recovery now works fine and xfs_repair shows no issues.
> >>>>>
> >>>>> So users of 3.11.0-rc1 beware. Only run this version if you have
> >>>>> up-to-date backups handy.
> >>
> >> Are you certain about that bisection point?  All that does is
> >> say:  When we allocate a new inode, assign it a random generation
> >> number, rather than reading it from disk & incrementing the
> >> older generation number, AFAICS.  So it simply avoids a read IO.
> > 
> > Yes, I'm sure. 
> > As I wrote above I also double-checked by reverting the commit on top of
> > the current Linus tree.
> > 
> >> I wonder if simply changing IO patterns on the SSD changes how
> >> it's doing caching & destaging <handwave>.
> > 
> > No. The corruption also happens on my conventional (spinning) drives.
> > 
> >>>> What I miss in this thread is a distinction between filesystem
> >>>> corruption on the one hand and a few zeroed files on the other. The
> >>>> latter may be a nuisance, but it is expected behavior, while the
> >>>> former should never happen, period, if I'm not mistaken.
> >>>
> >>> Well, it is natural that fs developers at first try to blame userspace.
> >>
> >> I disagree with that, we just need to be clear about your scenarios,
> >> and what integrity guarantees should apply.
> >>
> >>> Unfortunately it turned out that in this case there is filesystem
> >>> corruption. (Fortunately this normally happens only very rarely on rc1
> >>> kernels).
> >>
> >> Corruption is when you get back data that you did not write,
> >> or metadata which is inconsistent or unreadable even after a proper
> >> log replay.
> >>
> >> Corruption is _not_ unsynced, buffered data that was lost on a
> >> crash or poweroff.
> >>
> >> But I might not have followed the thread properly, and I might
> >> misunderstand your situation.
> >>
> >> When you experience this lost file [data] scenario, was it after an
> >> orderly reboot, or after a crash and/or system reset?
> > 
> > To reproduce this issue simply boot into your desktop and then hit
> > sysrq-c and reboot. 
> 
> Ok, a crash, so at a minimum, some buffered data loss is 100% expected.

Sure.

> > After log replay without error messages, the
> > filesystem is in an inconsistent state
> 
> What exactly do you mean by inconsistent state?  Sorry to be pedantic here.

By inconsistent state I mean a filesystem state that forces you to run
xfs_repair to get back to normal.

> > and many small config files are
> > lost. 
> 
> Written how long ago?  Were they fsynced?
> I suppose you are unsure about that, if they're app-written.

I hit sysrq-c ~10 seconds after the KDE session is fully functional.
As I've wrote above I added an fsync to the KDE config file handler. So
the files should be fsynced.

> > There are also undeletable files.
> 
> What happens when you try to delete them?

They show up as "?????? ??????" in "ls -l" and I get an error when I
try to delete them. (I don't recall the exact error message)
See for example the /tmp/.X0-lock file that I mentioned earlier in this
thread.

> > You need to run xfs_repair
> > manually to bring the filesystem back to normal.
> 
> And what is the repair output?

See the outputs I've posted in this thread before. It's always a
variation thereof.

> Can you show an exact sequence of events, capturing all relevant output from 
> repair and/or dmesg, etc, just so we see exactly what you see?

I already did that. 

-- 
Markus

<Prev in Thread] Current Thread [Next in Thread>