[Top] [All Lists]

Re: Got "Internal error XFS_WANT_CORRUPTED_GOTO". Filesystem needs refor

To: XFS mailing list <xfs@xxxxxxxxxxx>
Subject: Re: Got "Internal error XFS_WANT_CORRUPTED_GOTO". Filesystem needs reformatting to correct issue.
From: "Carlos E. R." <carlos.e.r@xxxxxxxxxxxx>
Date: Fri, 4 Jul 2014 01:34:52 +0200 (CEST)
Delivered-to: xfs@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:subject:in-reply-to:message-id:references :user-agent:mime-version:content-type; bh=01KJ39USuyJ29uSkwnA0uaCvs92Mtzf8EZDvtfcq+is=; b=aZcDeAjjv5M1Su7wsY5CeR3c4F4z/Hlo+HNwU3TeKkAkkFPzgso+qx3HSlZlGLzjiF 0ztovZb+0+Ed2IT/NHVj+IYRphgAE3SyZ09B8zOIns7ccBiGvzXvxoU1pSGnx/YYL6CO hKc48yfBpv5zEuhHbuFfcIsYOrAWh5NBA/J465in+a5XzzyZ0cNE/ZNWUNvuo+oKcORn 5GmNlsVbDBbF29a/plPIt8DJRvYkMEpyGa4C6We62A0Na8ePes3l3smyd6vgxb+Iy+kG x4ctJwLGv3LHEc5S8VPY0tmHLj7ndlJQ1IZfRkmBTibFDBZrdtPrOguATFLyvgaA/4fl IXkw==
In-reply-to: <20140703094347.GU4453@dastard>
References: <alpine.LSU.2.11.1407021104480.9881@xxxxxxxxxxxxxxxxx> <20140702120441.GA51757@xxxxxxxxxxxxxxx> <alpine.LSU.2.11.1407030057310.9881@xxxxxxxxxxxxxxxxx> <20140703094347.GU4453@dastard>
Sender: Carlos Robinson <robin.listas@xxxxxxxxx>
User-agent: Alpine 2.11 (LSU 23 2013-08-11)
Hash: SHA1

On Thursday, 2014-07-03 at 19:43 +1000, Dave Chinner wrote:
On Thu, Jul 03, 2014 at 05:00:47AM +0200, Carlos E. R. wrote:
On Wednesday, 2014-07-02 at 08:04 -0400, Brian Foster wrote:
On Wed, Jul 02, 2014 at 11:57:25AM +0200, Carlos E. R. wrote:


hibernated at least once a day, perhaps three times if I have to go
out several times. It makes no sense to me to leave the machine
powered doing nothing, if hibernating is so easy and reliable - till
now. If I have to leave for more than a week, I tend to do a full

Hibernation has always been suspect w.r.t. flushing filesystem
metadata. It does not guarantee that the filesystem is quiesced
and idle, it just does a sync() and hopes that is sufficient to get
the filesystem into a consistent state. The mess that this leaves is
then left to filesystem developers to play whack-a-mole with when
users have problems.

Ah, but my problem would then not happen always on the same partition. It would affect others, would not?

But soon after, it oopses:

Point of note: there is no oops or crash occurring. XFS dumps the
stack when a corruption occurs to tell use where it was detected
and then shuts down the filesystem. Your system is still just fine
apart from not being able to access that filesystem until you
unmount it, rpeair it and mount it again.

Ok, true, there is no formal "Oops".

But no, the system does not remains fine, I had to hit the hardware reset or power off button to get out.

3 PID: 57 Comm: kworker/3:1 Tainted: P           O 3.11.10-7-desktop

What's tainting your kernel? If you remove that taint, does the
problem still occur?

Sorry, I can't find that out. It is either the nvidia driver, or the vmware kernel module. I can temporarily remove it for some days, but hardly for a month. I agree that it might have unknown influence on the initial corruption, but not on doing the repair, which I do in text mode, or with another boot partition that doesn't have that driver.

That is, it would not have influence on "xfs_repair", when done on a non tainted system.

I don't know of a way to provoking the problem at will, in order to remove the taint for a brief period :-?

<0.4> 2014-04-17 22:47:08 Telcontar kernel - - - [280270.081655] Restarting 
kernel threads ... done.
<0.4> 2014-04-17 22:47:08 Telcontar kernel - - - [280270.086714] Restarting 
tasks ... done.
<0.1> 2014-04-17 22:47:08 Telcontar kernel - - - [280271.851374] XFS: Internal 
error XFS_WANT_CORRUPTED_GOTO at line 1602 of file 
Caller 0xffffffffa0c54fe9

So the corruption occurred within 2s of the kernel restarting tasks
after a hibernation. It's really looking like a hibernation issue.

It's got to be related, of course.


As this always happens on recovery from hibernation, and seeing the message
"Corruption of in-memory data detected", could it be that thawing does a bad
memory recovery from the swap?  I thought that the procedure includes some
checksum, but I don't know for sure.

It's the fact that the filesystem si still running and modifying
state when the snapshot is being taken that results in the snapshot
image containing an inconsistent snapshot. That then gets loaded
on thaw and it goes boom.

But it only happens on the /home partition, not on the email partition, for instance, also in the same hard disk.

Unless... there are probably more things writing on the home partition than on the mail partition any time.

To me, there are two problems:

 1) The corruption itself.
 2) That xfs_repair fails to repair the filesystem. In fact, I believe
    it does not detect it!

That's because the filesystem is likely to be consistent on disk.
The issue is in-memory corruption, not on-disk corruption, like
the messages are telling us:

No, the on disk filesystem is not healthy. If I continue using it, after reboot and using "xfs_repair" several times, it fails again within a day.

I got after booting (the first event):

0.1> 2014-03-15 03:53:47 Telcontar kernel - - - [  301.857523] XFS: Internal 
error XFS_WANT_CORRUPTED_RETURN at line 350 of file 

And some hours later:

<0.1> 2014-03-15 22:20:34 Telcontar kernel - - - [20151.298345] XFS: Internal 
error XFS_WANT_CORRUPTED_GOTO at line 1602 of file 

So, instead of using xfs_repair, I re-formatted and restored backup, which worked for a month till next event.

- -- Cheers,
       Carlos E. R.
       (from 13.1 x86_64 "Bottle" at Telcontar)

Version: GnuPG v2.0.22 (GNU/Linux)


<Prev in Thread] Current Thread [Next in Thread>