xfs
[Top] [All Lists]

Re: HELP!!!

To: Knuth Posern <posern@xxxxxxxxxxxxxxxxxxxxxxxxx>, <linux-xfs@xxxxxxxxxxx>
Subject: Re: HELP!!!
From: Seth Mos <knuffie@xxxxxxxxx>
Date: Tue, 16 Oct 2001 11:05:21 +0200
Cc: <posern@xxxxxxxxxxxxxxxxxxxxxxxxxx>
In-reply-to: <Pine.LNX.4.33.0110160151240.4568-100000@lagos>
Sender: owner-linux-xfs@xxxxxxxxxxx
At 02:47 16-10-2001 +0200, Knuth Posern wrote:
Hi.

I have (or had?!) a sotware RAID-5 with the following /etc/raidtab

Short answer: You had.

I built the raid half a year ago. Formatted it with XFS (and a 2.4.5er
kernel). At the moment I use 2.4.10-xfs.

The machine is a debian-unstable linux-server.

The following happened to me:

While I was playing an mp3-file on a console I got the following kernel
message(s) bumped into the console:
___________________________________________________________________________
hde: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=56410433,
sector=56410368
end_request: I/O error, dev 21:01 (hde), sector 56410368
raid5: Disk failure on hde1, disabling device. Operation continuing on 2
devices
md: recovery thread got woken up ...
md0: no spare disk to reconstruct array! -- continuing in degraded mode
md: recovery thread finished ...
md: updating md0 RAID superblock on device
md: hdi1 [events: 000000de](write) hdi1's sb offset: 45034816
md: hdg1 [events: 000000de](write) hdg1's sb offset: 45034816
md: (skipping faulty hde1 )
XFS: device 0x900- XFS write error in file system meta-data block 0x40 in
md(9,0)
XFS: device 0x900- XFS write error in file system meta-data block 0x40 in
md(9,0)
XFS: device 0x900- XFS write error in file system meta-data block 0x40 in
md(9,0)
XFS: device 0x900- XFS write error in file system meta-data block 0x40 in
md(9,0)
XFS: device 0x900- XFS write error in file system meta-data block 0x40 in
md(9,0)

This should not happen with a md raid5. This means corruption. Normally when a disk fails in a md raid 1/5 set the OS is unaffected. XFS or not.

This is the first alarming sign.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I then switched to runlevel 0:
____________________________________________________________________________
Give root password for maintenance
(or type Control-D for normal startup):
jolie:~# umount /raid

You should have had errors in your log before unmounting is what my intuition says.

xfs_unmount: xfs_ibusy says error/16
XFS unmount got error 16
linvfs_put_super: vfsp/0xdf467520 left dangling!
VFS: Busy inodes after unmount. Self-destruct in 5 seconds.  Have a nice
day...
jolie:~# mount
/dev/hda3 on / type ext2 (rw,errors=remount-ro,errors=remount-ro)
proc on /proc type proc (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/md0 on /mnt/raid type xfs (rw)
jolie:~# lsof
...
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The lsof did NOT show any open files on /mnt/raid.
So I tried again to unuount /mnt/raid:
_____________________________________________________________________________
jolie:~#
jolie:~# umount /mnt/raid
umount: /mnt/raid: not mounted
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
But now it was unmounted already?!

It was unmounted, with errors that is. That message you got with inodes left dangling was the result from a unclean unmount.

So I tried to mount it...

Bad idea, i would have run xfs_repair first.

jolie:~#
jolie:~# mount /mnt/raid
XFS: SB read failed
I/O error in filesystem ("md(9,0)") meta-data dev 0x900 block 0x0
       ("xfs_readsb") error 5 buf count 512
mount: wrong fs type, bad option, bad superblock on /dev/md0,
       or too many mounted file systems
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I rebooted the computer - and got the following during bootup:

<snip

XFS: SB read failed
I/O error in filesystem ("md(9,0)") meta-data dev 0x900 block 0x0
       ("xfs_readsb") error 5 buf count 512
mount: wrong fs type, bad option, bad superblock on /dev/md0,
       or too many mounted file systems
       (could this be the IDE device where you in fact use
       ide-scsi so that sr0 or sda or so is needed?)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This means it at least needs repair. Sometimes xfs_repair can recover a secondary superblock.

I logged in and I edited the /etc/raidtab to have a SPARE-DISC:

Don't do that.

I connected an identical harddrive (like the other raid-harddiscs) as
/dev/hdc.

Course of action on a failed disk.
Power off the box.
Remove the disk from hde.
Insert the new disk to hde.
Power on the box.
raidhotadd hde.

And rebooted again. - without any changes.

I then read in the Software-RAID-Howto (from January 2000) to just remove
the faulty drive and instead connect a new drive.

Correct.

So I connected the harddisc from /dev/hdc on /dev/hde (and edited the
/etc/raidtab to be as it was before (without spare-disks)!).

Don't touch the raidtab file when something goes wrong. Unless you really know what you are doing it will make things worse then it was.

And rebooted.

md0 didn't start the array - because /dev/hde is 0K big (or something like
that).
That was because I had forgotten to built a partion on /dev/hde - so I
built the one partition (as on the other raid-drives too).

That is not fatal, it happened to me once as well.

And rebooted again - But md0 had an "Failed autostart of /dev/md0" again.

That is normal. It does note rebuild fully automatically. You have to instruct it yourself.

And the Software-RAID-Howto told me to "raidhotadd /dev/md0 /dev/hde1".

correct.

Which I tried but it said somehting like: "/dev/md0 - no such raid is
running".

what did /proc/mdstat tell?

So I tried to get /dev/md0 RUNNING again.

In 6.1 of the Software-Raid-Howto there was something with "mkraid
/dev/md0 --force".

DON'T DO THIS UNLESS YOU ARE BUILDING THE ARRAY!

So I tried:

<snip>

And tried again to hotadd the /dev/hde1:

You just remade the md0 array which means the disks will be syncing.

___________________________________________________________________________
jolie:~# raidhotadd /dev/md0 /dev/hde1
md: trying to hot-add hde1 to md0 ...
/dev/md0: can not hot-add disk: disk busy!
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Then I checked /proc/mdstat.
Where it said about reconstructing - which sounds hopefully good... ?!

It means you're data is gone.

But:
___________________________________________________________________________
jolie:~# mount /mnt/raid
XFS: bad magic number
XFS: SB validate failed
mount: wrong fs type, bad option, bad superblock on /dev/md0,
       or too many mounted file systems
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
So I just rebooted again - in the hope that the raid-autostart during boot
time would bring some new/other results. But a mount /mnt/raid gives still
the same results!!!???

It just synced random parts of the other disks and constructed parity out of that.

What can I do? - Is my data lost? - if so: Is there ANY CHANCE to get at
least SOME of it BACK SOMEHOW (it doesnt matter how difficult)!?

No. :-(

???

Help would be VERY, VERY, VERY apreciated!!!

I am very afraid that I can not help you anymore.

You can try xfs_repair and see if it shows up anything or repair anything at all but I don't have high hopes.

Cheers
--
Seth
Every program has two purposes one for which
it was written and another for which it wasn't
I use the last kind.


<Prev in Thread] Current Thread [Next in Thread>