Re: xfs_repair of critical volume

Date: Sun, 31 Oct 2010 12:56:33 -0700
> Hi,
> I have a large XFS filesystem (60 TB) that is composed of 5 hardware RAID 6 
> volumes. One of those volumes had several drives fail in a very short time 
> and we lost that volume. However, four of the volumes seem OK. We are in a 
> worse state because our backup unit failed a week later when four drives 
> simultaneously went offline. So we are in a bad very state. I am able to 
> mount the filesystem that consists of the four remaining volumes. I was 
> thinking about running xfs_repair on the filesystem in hopes it would recover 
> all the files that were not on the bad volume, which are obviously gone. 
> Since our backup is gone, I'm very concerned about doing anything to lose the 
> data that will still have. I ran xfs_repair with the -n flag and I have a 
> lengthly file of things that program would do to our filesystem. I don't have 
> the expertise to decipher the output and figure out if xfs_repair would fix 
> the filesystem in a way that would retain our remaining data or if it would, 
> let's say t!
> runcate the filesystem at the data loss boundary (our lost volume was the 
> middle one of the five volumes), returning 2/5 of the filesystem or some 
> other undesirable result. I would post the xfs_repair -n output here, but it 
> is more than a megabyte. I'm hoping some one of you xfs gurus will take pity 
> on me and let me send you the output to look at or give me an idea as to what 
> they think xfs_repair is likely to do if I should run it or if anyone has any 
> suggestions as to how to get back as much data as possible in this recovery.
> thanks very much,
> Eli

Hi guys,

Thanks for all the responses. On the XFS volume that I'm trying to recover 
here, I've already re-initialized the RAID, so I've kissed that data goodbye. I 
am using LVM2. Each of the 5 RAID volumes is a physical volume. Then a logical 
volume is created out of those, and then the filesystem lies on top of that. So 
now we have, in order, 2 intact PVs, 1 OK, but blank PV, 2 intact PVs. On the 
RAID where we lost the drives, replacements are in place and I created a now 
healthy volume. Through LVM, I was then able to create a new PV from the 
re-constituted RAID volume and put that into our logical volume in place of the 
destroyed PV. So now, I have a logical volume that I can activate and I can see 
the filesystem. It still reports as having all the old files as before, 
although it doesn't. So the hardware is now OK. It's just what to do with our 
damaged filesystem that has a huge chunk missing out of it. I put the 
xfs_repair trial output on an http server, as suggested (good suggestion) and 
it is here:


Now I also have the problem of our backup RAID unit that failed. That one 
failed after I re-initialized the primary RAID, but before I could restore the 
backups to the primary. I'm having some good luck, huh? On that RAID unit, 
everything was fine until the next time I looked at it, which was a couple of 
hours later, 4 drives went offline and it reported the volume as lost. On that 
unit, the only thing I have done so far is to power cycle it a couple of times. 
Other than that, it is untouched. In it we are using the Caviar Green 2 TB 
drives, which our vendor told us where fine to use. However, I have read in the 
last couple of days that they have as issue with timing out as they remap 
sectors, as noted here:


Thus, I've learned that they are not recommended for use in RAID volumes. So I 
am looking hard into ways to trying to recover that data as well, although it 
is only a partial backup of our main volume. It contains about 10 TB of the 
most critical files from the main volume. Fortunately, this isn't the human 
genome, but it is climate modeling data that graduate students have been 
generating for years. So losing all this could set them back years on their 
PhDs. So I take the situation pretty seriously. In this case, we are thinking 
about going with a data recovery company, but this isn't industry. Our lab 
doesn't have very deep pockets. $10K would be a huge chunk of money to spend. 
So, I would welcome suggestions for this unit as well. I believe the drives 
themselves in this unit are OK, as four going out with one minute, as the log 
shows, is not something that makes a lot of sense.  My guess is that they were 
under heavy load for the first time in a few months and four of the drives 
started remapping sectors at pretty much the same time. The RAID controller in 
this DAS 16 drive box tried to contact the drives and reached a timeout and 
marked them all as dead. We are also considering that we are having some sort 
of power problem as we seem to be usually unlucky in the last couple of weeks, 
although we do have everything behind a pretty nice $7K UPS that isn't 
reporting any problems. 

OK, that's a long tale of woe. Thanks for any advise. 

