[Top] [All Lists]

Re: Got "Internal error XFS_WANT_CORRUPTED_GOTO". Filesystem needs refor

To: XFS mailing list <xfs@xxxxxxxxxxx>
Subject: Re: Got "Internal error XFS_WANT_CORRUPTED_GOTO". Filesystem needs reformatting to correct issue.
From: "Carlos E. R." <carlos.e.r@xxxxxxxxxxxx>
Date: Sat, 12 Jul 2014 02:30:45 +0200 (CEST)
Delivered-to: xfs@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:subject:in-reply-to:message-id:references :user-agent:mime-version:content-type; bh=kCdEDtrXFbL6dIAzFnFwdqA6OmP2bNZznrDA1iyR94s=; b=lzulqDGtOWeosaG/TC63MrnCicfNW0ryZSSbFdEa8O3kdEEcLzgAS4O77Nv6X6xF7b snZyAxc0YmBzWCi8B8my4YCzzQvCK1qcPOA3yG/XF6+QO3x+NZ+ls/e0TP5RnCurSXWD YjjCdNjOIwJLg1hCWl5ATjjiG0WlGBdzT4bZG33ozM1etVlWxuz1WDN288SjwQR2+ozw kXZ7jGCJRGYFZtBVhkZiyL7Mz1BYK2lrWQr8g98e3y7CxvXhnyd71n7FOYrAUN0aZT/O /G6oRhoMTwXp0FzTH9fZesVxyP23bmrshrItP2qYq4krYMyxuozj8JOK2I9HevQ7RTz3 YDIA==
In-reply-to: <20140705122833.GA3573@xxxxxxxxxxxxxxx>
References: <alpine.LSU.2.11.1407021104480.9881@xxxxxxxxxxxxxxxxx> <20140702120441.GA51757@xxxxxxxxxxxxxxx> <alpine.LSU.2.11.1407030057310.9881@xxxxxxxxxxxxxxxxx> <20140703173916.GB3452@xxxxxxxxxxxxxx> <alpine.LSU.2.11.1407040135140.9881@xxxxxxxxxxxxxxxxx> <20140705122833.GA3573@xxxxxxxxxxxxxxx>
Sender: Carlos Robinson <robin.listas@xxxxxxxxx>
User-agent: Alpine 2.11 (LSU 23 2013-08-11)
Hash: SHA1

On Saturday, 2014-07-05 at 08:28 -0400, Brian Foster wrote:
On Fri, Jul 04, 2014 at 11:32:26PM +0200, Carlos E. R. wrote:

If I don't do that backup-format-restore, I get issues soon, and it crashes
within a day - I got after booting (the first event):

I echo Dave's previous question... within a day of doing what? Just
using the system or doing more hibernation cycles?

It is in the long post with the logs I posted.

The first time it crashed, I rebooted, got some errors I probably did not see, managed to mount the device, and I used the machine normally, doing several hibernation cycles. On one of these, it crashed, within the day.

As explained in this part of the previous post:

0.1> 2014-03-15 03:53:47 Telcontar kernel - - - [  301.857523] XFS: Internal 
error XFS_WANT_CORRUPTED_RETURN at line 350 of file 

And some hours later:

<0.1> 2014-03-15 22:20:34 Telcontar kernel - - - [20151.298345] XFS: Internal 
error XFS_WANT_CORRUPTED_GOTO at line 1602 of file 

It was here that I decided to backup-format-restore instead.

That also means it's probably not be necessary to do a full backup,
reformat and restore sequence as part of your routine here. xfs_repair
should scour through all of the allocation metadata and yell if it finds
something like free blocks allocated to a file.

No, if I don't backup-format-restore it happens again within a day. There is
something lingering. Unless that was just chance... :-?

It is true that during that day I hibernated several times more than needed
to see if it happened again - and it did.

This depends on what causes this to happen, not how frequent it happens.
Does it continue to happen along with hibernation, or do you start
seeing these kind of errors during normal use?

Except the first time that this happened, the sequence is this:

I use the machine for weeks, without event, booting once, then hibernating at least once per day. I finally reboot when I have to apply some system update, or something special.

Till one day, this "thing" happens. It happens inmediately after coming out from hibernation, and puts the affected partition, always /home, in read only mode. When it happens, I reboot, repair partition manually if needed, then I back up the files, format it, and replace all the files from the backup just made, with xfsdump. Well, this last time, I used rsync instead.

It has happened "only" four times:

2014-03-15 03:35:17
2014-03-15 22:20:34
2014-04-17 22:47:08
2014-06-29 12:32:18

If the latter, that could suggest something broken on disk.

That was my first thought, because it started hapening after replacing the hard disk, but also after a kernel update. But I have tested that disk several times, with smartctl and with the manufacturer test tool, and nothing came out.

If the
former, that could simply suggest the fs (perhaps on-disk) has made it
into some kind of state that makes this easier to reproduce, for
whatever reason. It could be timing, location of metadata,
fragmentation, or anything really for that matter, but it doesn't
necessarily mean corruption (even though it doesn't rule it out).
Perhaps the clean regeneration of everything by a from-scratch recovery
simply makes this more difficult to reproduce until the fs naturally
becomes more aged/fragmented, for example.

This probably makes a pristine, pre-repair metadump of the reproducing
fs more interesting. I could try some of my previous tests against a
restore of that metadump.

Well, I suggest that, unless you can find something on the metadata (I just sent you the link via email from google), we wait till the next event. I will at that time take an intact metadata photo. But this can take a month or two to happen again, if the pattern keeps.

I was somewhat thinking out loud originally discussing this topic. I was
suggesting to run this against a restored metadump, not the primary
dataset or a backup.

The metadump creates an image of the metadata of the source fs in a file
(no data is copied). This metadump image can be restored at will via
'xfs_mdrestore.' This allows restoring to a file, mounting the file
loopback, and performing experiments or investigation on the fs
generally as it existed when the shutdown was reproducible.

Ah... I see.

So basically:

- xfs_mdrestore <mdimgfile> <tmpfileimg>
- mount <tmpfileimg> /mnt
- rm -rf /mnt/*

... was what I was suggesting. <tmpfileimg> can be recreated from the
metadump image afterwards to get back to square one.

I see.

Well, I tried this on a copy of the 'dd' image days ago, and nothing hapened. I guess the procedure above would be the same.

I have an active bugzilla account at <http://oss.sgi.com/bugzilla/>, I'm
logged in there now. I haven't checked if I can create a bug, not been sure
what parameters to use (product, component, whom to assign to). I think that
would be the most appropriate place.

Meanwhile, I have uploaded the file to my google drive account, so I can
share it with anybody on request - ie, it is not public, I need to add a
gmail address to the list of people that can read the file.

Alternatively, I could just email the file to people asking for it, offlist,
but not in a single email, in chunks limited to 1.5 MB per email.

Either of the bugzilla or google drive options works Ok for me.

It's here:


Whoever wants to read it, has to tell me the address to add to it, access is not public.

- -- Cheers,
       Carlos E. R.
       (from 13.1 x86_64 "Bottle" at Telcontar)
Version: GnuPG v2.0.22 (GNU/Linux)


<Prev in Thread] Current Thread [Next in Thread>