[Top] [All Lists]

Re: Corrupted files

To: Sean Caron <scaron@xxxxxxxxx>
Subject: Re: Corrupted files
From: Roger Willcocks <roger@xxxxxxxxxxxxxxxx>
Date: Wed, 10 Sep 2014 02:00:18 +0100
Cc: Roger Willcocks <roger@xxxxxxxxxxxxxxxx>, Eric Sandeen <sandeen@xxxxxxxxxxx>, Leslie Rhorer <lrhorer@xxxxxxxxxxxx>, "xfs@xxxxxxxxxxx" <xfs@xxxxxxxxxxx>
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <CAA43vkX8Ve3g7=w16742b4vT3=yRdCMR7m66g2M7fYPmEMmctA@xxxxxxxxxxxxxx>
References: <540F1B01.3020700@xxxxxxxxxxxx> <CAA43vkXwHF9RHW-cbTZ91_vF6wiQ6o_+TQDL3=7kD9P4tErCNQ@xxxxxxxxxxxxxx> <CAA43vkWgh8-EjDXjkySUn+y18W1O+v_W5j+fQankRTgDCmc8tw@xxxxxxxxxxxxxx> <540F7E37.7020500@xxxxxxxxxxx> <CAA43vkX8Ve3g7=w16742b4vT3=yRdCMR7m66g2M7fYPmEMmctA@xxxxxxxxxxxxxx>
I normally watch quietly from the sidelines but I think it's important to get some balance here; our customers between them run many hundreds of multi-terabyte arrays and when something goes badly awry it generally falls to me to sort it out. In my experience xfs_repair does exactly what it says on the tin.

I can recall only a couple of instances where we elected to reformat and reload from backups and they were both due to human error: somebody deleted the wrong raid unit when doing routine maintenance, and then tried to fix it up hemselves.

In theory of course xfs_repair shouldn't be needed if the write barriers work properly (it's a journalled filesystem), but low-level corruption does creep in due to power failures / kernel crashes and it's this which xfs_repair is intended to address; not massive data corruption due to failed hardware or careless users.


On 9 Sep 2014, at 23:57, Sean Caron <scaron@xxxxxxxxx> wrote:

Hey, just sharing some hard-won (believe me) professional experience. I have seen xfs_repair take a bad situation and make it worse many times. I don't know that a filesystem fuzzer or any other simulation can ever provide true simulation of users absolutely pounding the tar out of a system. There seems to be a real disconnect between what developers are able to test and observe directly, and what happens in the production environment in a very high-throughput environment.



On Tue, Sep 9, 2014 at 6:24 PM, Eric Sandeen <sandeen@xxxxxxxxxxx> wrote:
On 9/9/14 11:03 AM, Sean Caron wrote:

Barring rare cases, xfs_repair is bad juju.

No, it's not.  It is the appropriate tool to use for filesystem repair.

But it is not the appropriate tool for recovery from mangled storage.

I've actually been running a filesystem fuzzer over xfs images, randomly
corrupting data and testing repair, 1000s of times over.  It does
remarkably well.

If you scramble your raid, which means your block device is no longer
an xfs filesystem, but is instead a random tangle of bits and pieces of
other things, of course xfs_repair won't do well, but it's not the right
tool for the job at that stage.


xfs mailing list

<Prev in Thread] Current Thread [Next in Thread>