xfs
[Top] [All Lists]

Re: xfs data loss

To: Linux XFS <xfs@xxxxxxxxxxx>
Subject: Re: xfs data loss
From: pg_xf2@xxxxxxxxxxxxxxxxxx (Peter Grandi)
Date: Sat, 29 Aug 2009 14:08:33 +0000
In-reply-to: <B9A7B002C7FAFC469D4229539E909760308DA651DE@xxxxxxxxxxxxxxxxxxxxxxxxxxx>
References: <B9A7B002C7FAFC469D4229539E909760308DA651DE@xxxxxxxxxxxxxxxxxxxxxxxxxxx>
> Dear xfs developers We have a SUN X4500 with 48 500 GB drives,
> that we configured under SUSE SLES 10.

> Among others, we have 3 RAID5 xfs filesystems, /dev/md4 with
> 20 units (9.27 TB) /dev/md5 with 20 units (9.27 TB) and
> /dev/md6 with 5 units (1.95 TB)

AAAAA-Amazing! 19+1 RAID5s. With all identical drives in the same
box. What a wondersome, challenging configuration!

> These units are not backed up.

Many would say that RAID means that backup is not necessary.

> Due to a power shock, suddenly and without log messages about
> one half (5 TB) of the user directories on /dev/md4 have
> disappeared. [ ... ]

That was unreally unforeseeable! Power problems never happen to
computers, especially those that fry electronics or cause write
errors on multiple drives, so why worry? 

> Upon reboot, /dev/md6 showed only 3 units, and after a
> xfs_repair it was again ok.

Uh, that was the "/dev/md6 with 5 units", so I guess Elvis
personally came to deliver an exact copy of drive #4, or else
someone has invented a new algorithm to ensure that a RAID5 can
lose 2 drives and still be fine. In the latter case rush to the
patent office -- you shall become billionaires.

> /dev/md4 mounted immediately, but always with one half of the
> directories. 1) xfs_check gives no problem on /dev/md4 but 2)
> xfs_logprint [ ... ]

> 3) xfs_ncheck eats a lot of memory and freezes after 6-7 hours
> without giving output

I wonder whether the further advanced technique of creating a
multi-TB filesystem on a 32-bit kernel was used for higher
sophistication.

> WHat can I do? [ ... ]

Well, I am not at the level of those who can develop such an
advanced understanding of storage systems that includes 19+1
arrays and RAID5s that can lose 2 drives and be "again OK".

But I'd look at how many controllers and drives are actually
still working by doing a read test of all the drives. Once that
is known perhaps some recovery strategy can be discerned. If
several drives have failed it might take several weeks or months
or even years to do a partial recovery. If things are really
very lucky only some of the 6 host adapters have failed, or only
one drive per array, and replacing the host adapters will get
things working again, except that running 'xfs_repair' on an
incomplete array will have made things even better, more
challenging and advanced.

<Prev in Thread] Current Thread [Next in Thread>