xfs
[Top] [All Lists]

Re: total/partial fs corruption

To: Nigel Kukard <nkukard@xxxxxxxx>, linux-xfs@xxxxxxxxxxx
Subject: Re: total/partial fs corruption
From: Seth Mos <knuffie@xxxxxxxxx>
Date: Sun, 14 Oct 2001 01:46:13 +0200
In-reply-to: <Pine.LNX.4.21.0110140124420.1797-100000@xxxxxxxxxxxxx>
Sender: owner-linux-xfs@xxxxxxxxxxx
At 01:37 14-10-2001 +0200, Nigel Kukard wrote:
Hi,

ok, firstly i'm NOT joking!

i'm the designer of our companies linux distribution, based on vanilla source
of just over 220 packages. we strongly support XFS, but one thing scares me,
we running vanilla 2.4.12 (and 2.4.10) with the latest XFS patches along with
the latest xfsprogs...etc.

Sounds good.

i installed our distro for the 10th time or summin yesterday and rebooted a few
times, after about 3 reboots the entire / partition was blank... so i thought
ok, i must of done summin wrong... so i re-installed & rebooted a few times over,
SAME thing!! i then tried on another server, used a diff harddrive and totally
diff hardware. rebooted about 12 times, did a halt (got a few 990 errors?), and the harddrive was BLANK! (by blank i mean when i boot with a rescue cd and mount
it there is no files, no dirs nothing... blank). sum times after i find its
blank i reboot again & some files are on it, sometimes not. if i run xfs_check
or xfs_repair it seems to find alot of errors. but what really gets me is next
to NO changes are made during a reboot & all FS's are unmounted.

error 990 means that it detected corruption. Something is horribly wrong in this case if it happens a lot. What compiler did you use. (Use egcs-1.1.2 == 2.91.66 for production systems)

i thought i'd fixed the problem when i compiled 2.4.12 (from 2.4.10), but i
enabled quota support & rebooted... BLANK! as i said before i have been getting
error 990's and once an in-memory data corruption. i must have you know the
ram is 100% ok, even tried different cpu's, ram modules, motherboards, hdd's
everything. this error is 100% reproducable. how you guys can reproduce it
i'm not entirely sure. i could understand it if i just turned off the pc while
it was working, but this is doing a proper reboot. :(

If it detects fs corruption the fs is disabled to prevent further corruption. This is why you don't see it. It is there to protect you from making the mess larger then it is.

i just ran xfs_check on it now, clean after i rebooted to find the hdd blank and
i get the following...  hda1 = /boot, hda2 = 128Mb swap, hda3 = /


[root@localhost root]# xfs_check /dev/hda3
bad directory data magic # 0x44510101 for dir ino 128 block 0
no . entry for direcotry 128
no .. entry for directory 128
block 0/220 expected type unknown got dir
block 0/220 claimed by inode 131, previous inum 128
link count mismatch for inode 128 (name ?), nlink 15, counted 13
disconnected inode 132, nlink 1
link count mismatch for inode 4194432 (name ?), nlink 2, counted 1
link count mismatch for inode 4212876 (name ?), nlink 4, counted 3
.
.
.


i would very very greatly appreciate any input as i have some very very important
servers running XFS on 1Tb+ raid arrays, and i'm very scared for them!

You should not need to be since a lot of people are running XFS on production systems and are not seeing these problems. They do see the ocassional 990 with some less tested kernel releases but that's that.

by the way, i'm running an SMP kernel on the test box which has one cpu, all the
high end servers we have running XFS are dual+ cpu. could this maybe be it?

Yes, athlon systems react very badly to this and equals suicide.

Cheers

--
Seth
Every program has two purposes one for which
it was written and another for which it wasn't
I use the last kind.


<Prev in Thread] Current Thread [Next in Thread>