At 01:37 14-10-2001 +0200, Nigel Kukard wrote:
Hi,
ok, firstly i'm NOT joking!
i'm the designer of our companies linux distribution, based on vanilla source
of just over 220 packages. we strongly support XFS, but one thing scares me,
we running vanilla 2.4.12 (and 2.4.10) with the latest XFS patches along with
the latest xfsprogs...etc.
Sounds good.
i installed our distro for the 10th time or summin yesterday and rebooted
a few
times, after about 3 reboots the entire / partition was blank... so i thought
ok, i must of done summin wrong... so i re-installed & rebooted a few
times over,
SAME thing!! i then tried on another server, used a diff harddrive and totally
diff hardware. rebooted about 12 times, did a halt (got a few 990
errors?), and
the harddrive was BLANK! (by blank i mean when i boot with a rescue cd and
mount
it there is no files, no dirs nothing... blank). sum times after i find its
blank i reboot again & some files are on it, sometimes not. if i run xfs_check
or xfs_repair it seems to find alot of errors. but what really gets me is next
to NO changes are made during a reboot & all FS's are unmounted.
error 990 means that it detected corruption. Something is horribly wrong in
this case if it happens a lot. What compiler did you use. (Use egcs-1.1.2
== 2.91.66 for production systems)
i thought i'd fixed the problem when i compiled 2.4.12 (from 2.4.10), but i
enabled quota support & rebooted... BLANK! as i said before i have been
getting
error 990's and once an in-memory data corruption. i must have you know the
ram is 100% ok, even tried different cpu's, ram modules, motherboards, hdd's
everything. this error is 100% reproducable. how you guys can reproduce it
i'm not entirely sure. i could understand it if i just turned off the pc while
it was working, but this is doing a proper reboot. :(
If it detects fs corruption the fs is disabled to prevent further
corruption. This is why you don't see it. It is there to protect you from
making the mess larger then it is.
i just ran xfs_check on it now, clean after i rebooted to find the hdd
blank and
i get the following... hda1 = /boot, hda2 = 128Mb swap, hda3 = /
[root@localhost root]# xfs_check /dev/hda3
bad directory data magic # 0x44510101 for dir ino 128 block 0
no . entry for direcotry 128
no .. entry for directory 128
block 0/220 expected type unknown got dir
block 0/220 claimed by inode 131, previous inum 128
link count mismatch for inode 128 (name ?), nlink 15, counted 13
disconnected inode 132, nlink 1
link count mismatch for inode 4194432 (name ?), nlink 2, counted 1
link count mismatch for inode 4212876 (name ?), nlink 4, counted 3
.
.
.
i would very very greatly appreciate any input as i have some very very
important
servers running XFS on 1Tb+ raid arrays, and i'm very scared for them!
You should not need to be since a lot of people are running XFS on
production systems and are not seeing these problems. They do see the
ocassional 990 with some less tested kernel releases but that's that.
by the way, i'm running an SMP kernel on the test box which has one cpu,
all the
high end servers we have running XFS are dual+ cpu. could this maybe be it?
Yes, athlon systems react very badly to this and equals suicide.
Cheers
--
Seth
Every program has two purposes one for which
it was written and another for which it wasn't
I use the last kind.
|