xfs
[Top] [All Lists]

Re: Corrupted files

To: Emmanuel Florac <eflorac@xxxxxxxxxxxxxx>, Sean Caron <scaron@xxxxxxxxx>
Subject: Re: Corrupted files
From: Sean Caron <scaron@xxxxxxxxx>
Date: Wed, 10 Sep 2014 10:49:32 -0400
Cc: Leslie Rhorer <lrhorer@xxxxxxxxxxxx>, Roger Willcocks <roger@xxxxxxxxxxxxxxxx>, Eric Sandeen <sandeen@xxxxxxxxxxx>, "xfs@xxxxxxxxxxx" <xfs@xxxxxxxxxxx>
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <20140910162411.07ef02a5@xxxxxxxxxxxxxxxxxxxx>
References: <540F1B01.3020700@xxxxxxxxxxxx> <CAA43vkXwHF9RHW-cbTZ91_vF6wiQ6o_+TQDL3=7kD9P4tErCNQ@xxxxxxxxxxxxxx> <CAA43vkWgh8-EjDXjkySUn+y18W1O+v_W5j+fQankRTgDCmc8tw@xxxxxxxxxxxxxx> <540F7E37.7020500@xxxxxxxxxxx> <540F9FE9.7070500@xxxxxxxxxxxx> <3E40936B-A1F2-424D-B0B3-54B6C7B50B13@xxxxxxxxxxxxxxxx> <540FA9DB.20000@xxxxxxxxxxxx> <20140910162411.07ef02a5@xxxxxxxxxxxxxxxxxxxx>
I don't want to bloviate too much and drag this completely off topic esp. since the OPs query is resolved but please allow me just one anecdote :)

Earlier this year, I had one of our project file servers (450 TB) go down. It didn't go down because the array spuriously just lost a bunch of disks; it was simply your usual sort of Linux kernel panic... you go to the console and it's just black screen and unresponsive, or maybe you can see the tail end of a backtrace and it's unresponsive. So, OK, issue a quick remote IPMI reboot of the machine, it comes up...

I'm in single user mode, bringing up each sub-RAID6 in our RAID60 by hand, no problem. Bring up the top level RAID0. OK. Then I go to mount the XFS... no go. Apparently the log somehow got corrupted in the crash?

So I try to mount ro, no dice, but I _can_ mount ro,noreplaylog and I see good files here! Thank goodness. I start scavenging to a spare host...

A few weeks later, after the scavenge is done, I did a few xfs_repair runs just for the sake of experimentation. Using both in dry run mode, I tried the version that shipped with Ubuntu 12.04, as well as the latest xfs_repair I could pull from the source tree. I redirected the output of both runs to file and watched them with 'tail -f'.

Diffing the output when they were done, it didn't look like they were behaving much differently. Both files had thousands or tens of thousands of lines worth of output in them, bad this, bad that... (I always run in verbose mode) Since the filesystem was hosed anyway and I was going to rebuild it, I decided to let the new xfs_repair run "for real" just to see what would happen, for kicks. And who knows? Maybe I could recover even more than I already had ...? (I wasn't just totally wasting time)

I think it took maybe a week for it to run on a 450 TB volume? At least a week. Maybe I was being a teensy bit hyperbolic in my previous descriptions of runtime, LOL. After it was done?

... almost everything was obliterated. I had tens of millions of zero-length files, and tens of millions of bits of anonymous scrambled junk in lost+found.

So, I chuckled a bit (thankful for my hard-won previous experience) before reformatting the array and then copied back the results of my scavenging. Just by ro-mounting and copying what I could, I was able to save around 90% of the data by volume on the array (it was a little more than half full when it failed... ~290 TB? There was only ~30 TB that I couldn't salvage); good clean files that passed validation from their respective users. I think 80-90% recovery rates are very commonly achievable just mounting ro,noreplaylog and getting what you can with cp -R or rsync, given that there wasn't grievous failure of the underlying storage system.

If I had depended on xfs_repair, or blithely run it as a first line of response as the documentation might intimate (hey, it's called xfs_repair, right?) like you would casually think to do; run it like people run fsck or CHKDSK... I would have been hosed, big time.Â

Best,

Sean







On Wed, Sep 10, 2014 at 10:24 AM, Emmanuel Florac <eflorac@xxxxxxxxxxxxxx> wrote:
Le Tue, 09 Sep 2014 20:31:07 -0500
Leslie Rhorer <lrhorer@xxxxxxxxxxxx> Ãcrivait:

> More
> importantly, is there some reason 3.1.7 would make things worse while
> 3.2.1 would not? If not, then I can always try 3.1.7 and then try
> 3.2.1 if that does not help.

I don't know for these particular versions, however in the past
I've confirmed that a later version of xfs_repair performed way better
(salvaged more files from lost+found, in particular).

At some point in the distant past, some versions of xfs_repair were
buggy and would happily throw away TB of perfectly sane data... Ih ad
this very problem once on Christmas eve in 2005 IIRC :/

--
------------------------------------------------------------------------
Emmanuel Florac  Â| ÂDirection technique
          | ÂIntellique
          | Â<eflorac@xxxxxxxxxxxxxx>
          | Â+33 1 78 94 84 02
------------------------------------------------------------------------

<Prev in Thread] Current Thread [Next in Thread>