-----BEGIN PGP SIGNED MESSAGE-----
On Saturday, 2014-07-05 at 08:28 -0400, Brian Foster wrote:
On Fri, Jul 04, 2014 at 11:32:26PM +0200, Carlos E. R. wrote:
If I don't do that backup-format-restore, I get issues soon, and it crashes
within a day - I got after booting (the first event):
I echo Dave's previous question... within a day of doing what? Just
using the system or doing more hibernation cycles?
It is in the long post with the logs I posted.
The first time it crashed, I rebooted, got some errors I probably did not
see, managed to mount the device, and I used the machine normally, doing
several hibernation cycles. On one of these, it crashed, within the day.
As explained in this part of the previous post:
0.1> 2014-03-15 03:53:47 Telcontar kernel - - - [ 301.857523] XFS: Internal
error XFS_WANT_CORRUPTED_RETURN at line 350 of file
And some hours later:
<0.1> 2014-03-15 22:20:34 Telcontar kernel - - - [20151.298345] XFS: Internal
error XFS_WANT_CORRUPTED_GOTO at line 1602 of file
It was here that I decided to backup-format-restore instead.
That also means it's probably not be necessary to do a full backup,
reformat and restore sequence as part of your routine here. xfs_repair
should scour through all of the allocation metadata and yell if it finds
something like free blocks allocated to a file.
No, if I don't backup-format-restore it happens again within a day. There is
something lingering. Unless that was just chance... :-?
It is true that during that day I hibernated several times more than needed
to see if it happened again - and it did.
This depends on what causes this to happen, not how frequent it happens.
Does it continue to happen along with hibernation, or do you start
seeing these kind of errors during normal use?
Except the first time that this happened, the sequence is this:
I use the machine for weeks, without event, booting once, then hibernating
at least once per day. I finally reboot when I have to apply some
system update, or something special.
Till one day, this "thing" happens. It happens inmediately after coming
out from hibernation, and puts the affected partition, always /home, in
read only mode. When it happens, I reboot, repair partition manually if
needed, then I back up the files, format it, and replace all the files
from the backup just made, with xfsdump. Well, this last time, I used
It has happened "only" four times:
If the latter, that could suggest something broken on disk.
That was my first thought, because it started hapening after replacing the
hard disk, but also after a kernel update. But I have tested that disk
several times, with smartctl and with the manufacturer test tool, and
nothing came out.
former, that could simply suggest the fs (perhaps on-disk) has made it
into some kind of state that makes this easier to reproduce, for
whatever reason. It could be timing, location of metadata,
fragmentation, or anything really for that matter, but it doesn't
necessarily mean corruption (even though it doesn't rule it out).
Perhaps the clean regeneration of everything by a from-scratch recovery
simply makes this more difficult to reproduce until the fs naturally
becomes more aged/fragmented, for example.
This probably makes a pristine, pre-repair metadump of the reproducing
fs more interesting. I could try some of my previous tests against a
restore of that metadump.
Well, I suggest that, unless you can find something on the metadata (I
just sent you the link via email from google), we wait till the next
event. I will at that time take an intact metadata photo. But this can
take a month or two to happen again, if the pattern keeps.
I was somewhat thinking out loud originally discussing this topic. I was
suggesting to run this against a restored metadump, not the primary
dataset or a backup.
The metadump creates an image of the metadata of the source fs in a file
(no data is copied). This metadump image can be restored at will via
'xfs_mdrestore.' This allows restoring to a file, mounting the file
loopback, and performing experiments or investigation on the fs
generally as it existed when the shutdown was reproducible.
Ah... I see.
- xfs_mdrestore <mdimgfile> <tmpfileimg>
- mount <tmpfileimg> /mnt
- rm -rf /mnt/*
... was what I was suggesting. <tmpfileimg> can be recreated from the
metadump image afterwards to get back to square one.
Well, I tried this on a copy of the 'dd' image days ago, and nothing
hapened. I guess the procedure above would be the same.
I have an active bugzilla account at <http://oss.sgi.com/bugzilla/>, I'm
logged in there now. I haven't checked if I can create a bug, not been sure
what parameters to use (product, component, whom to assign to). I think that
would be the most appropriate place.
Meanwhile, I have uploaded the file to my google drive account, so I can
share it with anybody on request - ie, it is not public, I need to add a
gmail address to the list of people that can read the file.
Alternatively, I could just email the file to people asking for it, offlist,
but not in a single email, in chunks limited to 1.5 MB per email.
Either of the bugzilla or google drive options works Ok for me.
Whoever wants to read it, has to tell me the address to add to it, access
is not public.
Carlos E. R.
(from 13.1 x86_64 "Bottle" at Telcontar)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
-----END PGP SIGNATURE-----