[Top] [All Lists]

Re: Corrupted files

To: Roger Willcocks <roger@xxxxxxxxxxxxxxxx>, Sean Caron <scaron@xxxxxxxxx>
Subject: Re: Corrupted files
From: Leslie Rhorer <lrhorer@xxxxxxxxxxxx>
Date: Tue, 09 Sep 2014 20:23:51 -0500
Cc: Eric Sandeen <sandeen@xxxxxxxxxxx>, "xfs@xxxxxxxxxxx" <xfs@xxxxxxxxxxx>
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <62B15E94-1944-457F-B298-89EDEE3EC70D@xxxxxxxxxxxxxxxx>
References: <540F1B01.3020700@xxxxxxxxxxxx> <CAA43vkXwHF9RHW-cbTZ91_vF6wiQ6o_+TQDL3=7kD9P4tErCNQ@xxxxxxxxxxxxxx> <CAA43vkWgh8-EjDXjkySUn+y18W1O+v_W5j+fQankRTgDCmc8tw@xxxxxxxxxxxxxx> <540F7E37.7020500@xxxxxxxxxxx> <CAA43vkX8Ve3g7=w16742b4vT3=yRdCMR7m66g2M7fYPmEMmctA@xxxxxxxxxxxxxx> <62B15E94-1944-457F-B298-89EDEE3EC70D@xxxxxxxxxxxxxxxx>
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0
On 9/9/2014 8:00 PM, Roger Willcocks wrote:
I normally watch quietly from the sidelines but I think it's important
to get some balance here

That is almost always wise advice. Shooting from the hip often has regrettable consequences, yet being too cautious can have its down side, too. In this case, things are working very well at the moment, and the apparent issues are reasonably small, so there is no need for panic.

our customers between them run many hundreds
of multi-terabyte arrays and when something goes badly awry it generally
falls to me to sort it out. In my experience xfs_repair does exactly
what it says on the tin.

I couldn't say. This is only the second time I have ever had an array drop, and the first time it was completely unrecoverable. Less than 5 minutes after I had started a RAID upgrade from RAID5 to RAID6, there was a protracted power outage. I shut down the system cleanly and after the outage restarted the reshape. The recovery had only been running a few minutes when the system suffered a kernel panic - I never did find out why. Every single structure on the array larger than the stripe size (16K, I think) was garbage.

I can recall only a couple of instances where we elected to reformat and
reload from backups and they were both due to human error: somebody
deleted the wrong raid unit when doing routine maintenance, and then
tried to fix it up hemselves.

In theory of course xfs_repair shouldn't be needed if the write barriers
work properly (it's a journalled filesystem), but low-level corruption
does creep in due to power failures / kernel crashes and it's this which
xfs_repair is intended to address; not massive data corruption due to
failed hardware or careless users.

Oh, yeah, like losing 3 out of 8 drives in the array after a drive controller replacement...

<Prev in Thread] Current Thread [Next in Thread>