[Top] [All Lists]

Re: Got "Internal error XFS_WANT_CORRUPTED_GOTO". Filesystem needs refor

To: XFS mailing list <xfs@xxxxxxxxxxx>
Subject: Re: Got "Internal error XFS_WANT_CORRUPTED_GOTO". Filesystem needs reformatting to correct issue.
From: "Carlos E. R." <carlos.e.r@xxxxxxxxxxxx>
Date: Fri, 4 Jul 2014 23:32:26 +0200 (CEST)
Delivered-to: xfs@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:subject:in-reply-to:message-id:references :user-agent:mime-version:content-type; bh=WVUZn89IjC52UfudltBbmAvypIMAs1OLRFIy1QngQjY=; b=N0ETUqeWUuEI3vnfCInAkrTNRixe4WGLcXMp3+YX7E4b2/me6N1v8FBwph49OCpevT 3vOlyTEi3xE3N+PLegfnRLPu03RXhv210XSMh8V9HHFnK87Ls/B7gQtiKGkZEILQph0H 9kMaFL9Vk8jrLYZEBuP9m+Dxkb8AZK2l/w3EdO11o76UMxfMLQRBIuHpZOxPcuyfhdQ0 fi5OIyzO1l/ptXVgHsl1UJsuH0hd5KjhJ5Qw4WfEC2WZ5mBSp4011ztAM5blshSSqIiU LI+tc0E36Z3BX9M2JhU7E10v7G0zRPnpMXWMcMn+dpO+vwcb9fy+jHWgvowWvlUY09/0 ZGjQ==
In-reply-to: <20140703173916.GB3452@xxxxxxxxxxxxxx>
References: <alpine.LSU.2.11.1407021104480.9881@xxxxxxxxxxxxxxxxx> <20140702120441.GA51757@xxxxxxxxxxxxxxx> <alpine.LSU.2.11.1407030057310.9881@xxxxxxxxxxxxxxxxx> <20140703173916.GB3452@xxxxxxxxxxxxxx>
Sender: Carlos Robinson <robin.listas@xxxxxxxxx>
User-agent: Alpine 2.11 (LSU 23 2013-08-11)
Hash: SHA1

[This email has been delayed, while I thought about where to upload metadata file - see near the end]

On Thursday, 2014-07-03 at 13:39 -0400, Brian Foster wrote:
On Thu, Jul 03, 2014 at 05:00:47AM +0200, Carlos E. R. wrote:

Ok, so there's a lot going on. I was mainly curious to see what was
causing lingering preallocations, but it could be anything extending a
file multiple times.


AFAIK, xfsdump can not carry over a filesystem corruption, right?

I think that's accurate, though it might complain/fail in the act of
dumping an fs that is corrupted. The behavior here suggests there might
not be on disk corruption, however.

At least, not a detectable one.

If I don't do that backup-format-restore, I get issues soon, and it crashes within a day - I got after booting (the first event):

0.1> 2014-03-15 03:53:47 Telcontar kernel - - - [  301.857523] XFS: Internal 
error XFS_WANT_CORRUPTED_RETURN at line 350 of file 

And some hours later:

<0.1> 2014-03-15 22:20:34 Telcontar kernel - - - [20151.298345] XFS: Internal 
error XFS_WANT_CORRUPTED_GOTO at line 1602 of file 

It was here that I decided to backup-format-restore instead.

Maybe next time I can take the photo with dd before doing anything else (it
takes about 80 minutes), or simply do an "xfs_metadump", which should be
faster.  And I might not have then 500 GiB of free space to make a dd copy,

xfs_metadump should be faster. It will grab the metadata only and
obfuscate filenames so as to hide sensitive information.

Ok, I have a post-it label on the monitor so that I remember - my notes are typically stored in the home partition :-)

But the obfuscation is not complete, I can recognize file names:

00008DC0   .leeme.kfPTgt . ....... .2aujzfJ.%;u. .   .0...
00008DF0    .pepe_after_gnome.tar.bz2.vcTJ8c.@.. . .......
00008E20   .amyN3xYjaldFXYpeUry. 3;&.K.. ..  .0... !.pepe_j
00008E50   ust_created.tar.bz2.JlyD0W .. .@....... .NGb0URO
00008E80   C0Bh9cHwp-hBh.6wMS .. .p  . ... ..registro.0DPzS
00008EB0   G  .. . ....... .8n-.w$.9. .. .   .8... +.suse_u
00008EE0   pgrade_to_102_pkglist-bis.txt.tcFUKq. . .......
00008F10   #B-XqcrWP4cqsw77yv8UsYbcCa-D76q..(#.. ..  .8...
00008F40   '.suse_upgrade_to_102_pkglist.txt.0KTuDa  7.. .8

I just had a quick look with 'mc', the dump is to large too inspect it all.


As this always happens on recovery from hibernation, and seeing the message
"Corruption of in-memory data detected", could it be that thawing does a bad
memory recovery from the swap?  I thought that the procedure includes some
checksum, but I don't know for sure.

Not sure, though if so I would think that might be a more common source
of problems.

And it only affects my /home partition - although it may be the busiest one.

To me, there are two problems:

 1) The corruption itself.
 2) That xfs_repair fails to repair the filesystem. In fact, I believe
    it does not detect it!

To me, #2 is the worst, and it is what makes me do the backup, format,
restore cycle for recovery. An occassional kernel crash is somewhat
acceptable :-}

Well it could be that the "corruption" is gone at the point of a
remount. E.g., something becomes inconsistent in memory, the fs detects
it and shuts down before going any further. That's actually a positive.

That also means it's probably not be necessary to do a full backup,
reformat and restore sequence as part of your routine here. xfs_repair
should scour through all of the allocation metadata and yell if it finds
something like free blocks allocated to a file.

No, if I don't backup-format-restore it happens again within a day. There is something lingering. Unless that was just chance... :-?

It is true that during that day I hibernated several times more than needed to see if it happened again - and it did.

I'm curious if something like an 'rm -rf *' on the metadump
would catch any other corruptions or if this is indeed limited to
something associated with recent (pre)allocations.

Sorry, run 'rm -rf *' where???

On the metadump... mainly just to see whether freeing all of the used
blocks in the fs triggered any other errors (i.e., a brute force way to
check for further corruptions).

Sorry, but I fail to see how to do it. I maybe thick, or I lack the context.

If I run:

Telcontar:/data/storage_d/old_backup # ls -lh
total 604G
drwxr-xr-x 22 root root  4.0K Mar  8 20:30 home
drwxr-xr-x  3 root root    16 Sep 25  2010 home1
drwxr-xr-x  2 root root     6 Jul  3 02:36 mount
- -rw-r--r--  1 root root    45 Jul  3 04:25 procedure
- -rw-r--r--  1 root root  388M Jul  3 02:42 tgtfile
- -rw-r--r--  1 root root   11M Jul  3 02:50 tgtfile2.xz
- -rw-r--r--  1 root users 489G Mar 16 05:42 xfs_copy_home
- -rw-r--r--  1 root root  489G Jul  3 04:40 xfs_copy_home_workonit
- -rw-r--r--  1 root users  39G Mar 16 05:49 xfsdump__home
- -rw-r--r--  1 root users  39G Mar 16 05:57 xfsdump__home1
Telcontar:/data/storage_d/old_backup # rm -rf *

that would destroy my entire backup!

If you mean:

 rm -rf tgtfile

I fail to see what that would accomplish, except to remove a file that is 
actually on a different partition, not home.

However, I can do:

Telcontar:/data/storage_d/old_backup # mount -v xfs_copy_home_workonit mount/
mount: /dev/loop0 mounted on /data/storage_d/old_backup/mount.
Telcontar:/data/storage_d/old_backup # cd mount
Telcontar:/data/storage_d/old_backup/mount # time rm -r 
Telcontar:/data/storage_d/old_backup/mount # time rm -r 

real    2m45.380s
user    0m0.265s
sys     0m6.878s
Telcontar:/data/storage_d/old_backup/mount #
Telcontar:/data/storage_d/old_backup/mount # ls -la
total 4
drwxr-xr-x 2 root root    6 Jul  4 01:56 .
drwxr-xr-x 5 root root 4096 Jul  3 04:25 ..
Telcontar:/data/storage_d/old_backup/mount #
Telcontar:/data/storage_d/old_backup/mount # df -h .
Filesystem      Size  Used Avail Use% Mounted on
/dev/loop0      489G   33M  489G   1% /data/storage_d/old_backup/mount
Telcontar:/data/storage_d/old_backup/mount #

And I do not see anything on the log, only that it mounted cleanly.

Meanwhile, I have done a xfs_metadump of the image, and compressed it with
xz. It has 10834536 bytes. What do I do with it? I'm not sure I can email
that, and even less to a mail list.

Do you still have a bugzilla system where I can upload it? I had an account
at <http://oss.sgi.com/bugzilla/>, made on 2010. I don't know if it still
runs :-?

I have an active bugzilla account at <http://oss.sgi.com/bugzilla/>, I'm logged in there now. I haven't checked if I can create a bug, not been sure what parameters to use (product, component, whom to assign to). I think that would be the most appropriate place.

Meanwhile, I have uploaded the file to my google drive account, so I can share it with anybody on request - ie, it is not public, I need to add a gmail address to the list of people that can read the file.

Alternatively, I could just email the file to people asking for it, offlist, but not in a single email, in chunks limited to 1.5 MB per email.

I think http://bugzilla.redhat.com should allow you to file a bug and
attach the file.

Sorry, I don't have an account there...

I do have one at openSUSE, though, and it does allow me to attach files, up to a limit. If the file is to big, it can be fragmented in pieces. But I will not use it unless you people say that you have an account there.

For using a bugzilla, the most appropriate one would be at SGI, IMHO, if they are still supporting this project.

- -- Cheers,
       Carlos E. R.
       (from 13.1 x86_64 "Bottle" at Telcontar)

Version: GnuPG v2.0.22 (GNU/Linux)


<Prev in Thread] Current Thread [Next in Thread>