[Top] [All Lists]

Re: xfs_repair of critical volume

To: xfs@xxxxxxxxxxx
Subject: Re: xfs_repair of critical volume
From: Steve Costaras <stevecs@xxxxxxxxxx>
Date: Sun, 31 Oct 2010 09:41:37 -0500
Authentication-results: cm-omr14 smtp.user=stevecs@xxxxxxxxxx; auth=pass (CRAM-MD5)
In-reply-to: <20101031151000.70dcd6b9@xxxxxxxxxxxxxx>
References: <75C248E3-2C99-426E-AE7D-9EC543726796@xxxxxxxx> <20101031151000.70dcd6b9@xxxxxxxxxxxxxx>
User-agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv: Gecko/20100608 Lightning/1.0b2 Thunderbird/3.1

On 2010-10-31 09:10, Emmanuel Florac wrote:
Did you try to actually freeze the failed drives (it may revive them for a while)?

Do NOT try this. It's only good for some /very/ specific types of issues with older drives. With an array of your size you are probably running relatively current drives (i.e. past 5-7 years) and this has a vary large probability of causing more damage.

The other questions are to the point to determine the circumstances around the failure and what the state of the array was at the time. Take your time, do not rush anything; you are already hanging over a cliff.

First thing if you are able is to do a bit copy of the physical drives to spares that way you can always get back to the same point where you are now. This may not be practical with such a large array but if you have the means it's worth it.

You want to start from the lowest component and work your way up. So you want to make sure that your raid array itself is sane before looking to fix any volume management functions and that before looking at your file systems. When dealing with degraded or failed arrays be careful on what you do if you have write cache enabled on your controllers. Talk to the vendor! Whatever operations you do on the card could cause this data to be lost and that can be substantial with some controllers (MiB->GiB ranges). Normally we run w/ write cache disabled (both on the drive and on the raid controllers) for critical data to avoid having too much data in flight if a problem ever did occur.

The points that Emmanuel mentioned are valid; Though would hold off on powering down until you are able to get all the geometry information from your raid's (unless you already have them). Also would hold off until you determine if you have any dirty caches on the raid controllers. Most controllers keep a rotating buffer of events including failure pointers that if you re-boot the re-scanning of drives upon start may push that pointer further down the stack until it gets lost and then you won't be able to recover outstanding data. I've seen this set at 128 - 256 entries on various systems, another reason to keep drives per controller counts down.


<Prev in Thread] Current Thread [Next in Thread>