xfs
[Top] [All Lists]

Re: total/partial fs corruption

To: Seth Mos <knuffie@xxxxxxxxx>
Subject: Re: total/partial fs corruption
From: Nigel Kukard <nkukard@xxxxxxxx>
Date: Sun, 14 Oct 2001 01:52:43 +0200 (SAST)
Cc: linux-xfs@xxxxxxxxxxx
In-reply-to: <4.3.2.7.2.20011014014116.0304b538@pop.xs4all.nl>
Sender: owner-linux-xfs@xxxxxxxxxxx
wow, i'm impressed by the response time!!! faster than the alarm company! hehe

On Sun, 14 Oct 2001, Seth Mos wrote:

> At 01:37 14-10-2001 +0200, Nigel Kukard wrote:
> >Hi,
> >
> >ok, firstly i'm NOT joking!
> >
> >i'm the designer of our companies linux distribution, based on vanilla source
> >of just over 220 packages. we strongly support XFS, but one thing scares me,
> >we running vanilla 2.4.12 (and 2.4.10) with the latest XFS patches along with
> >the latest xfsprogs...etc.
> 
> Sounds good.
> 

yea, thats what scares me

> >i installed our distro for the 10th time or summin yesterday and rebooted 
> >a few
> >times, after about 3 reboots the entire / partition was blank... so i thought
> >ok, i must of done summin wrong... so i re-installed & rebooted a few 
> >times over,
> >SAME thing!! i then tried on another server, used a diff harddrive and 
> >totally
> >diff hardware. rebooted about 12 times, did a halt  (got a few 990 
> >errors?), and
> >the harddrive was BLANK! (by blank i mean when i boot with a rescue cd and 
> >mount
> >it there is no files, no dirs nothing... blank). sum times after i find its
> >blank i reboot again & some files are on it, sometimes not. if i run 
> >xfs_check
> >or xfs_repair it seems to find alot of errors. but what really gets me is 
> >next
> >to NO changes are made during a reboot & all FS's are unmounted.
> 
> error 990 means that it detected corruption. Something is horribly wrong in 
> this case if it happens a lot. What compiler did you use. (Use egcs-1.1.2 
> == 2.91.66 for production systems)

[nkukard@devel source]$ gcc -v
Reading specs from /usr/lib/gcc-lib/i586-pc-linux/2.96/specs
gcc version 2.96 20000731 (IDMS Linux 2.96-5)


that is basically the same "strain" of gcc that redhat use as i pulled it out
their srpm a few months ago.

could it really be this that is the problem?

> 
> >i thought i'd fixed the problem when i compiled 2.4.12 (from 2.4.10), but i
> >enabled quota support & rebooted... BLANK! as i said before i have been 
> >getting
> >error 990's and once an in-memory data corruption. i must have you know the
> >ram is 100% ok, even tried different cpu's, ram modules, motherboards, hdd's
> >everything. this error is 100% reproducable. how you guys can reproduce it
> >i'm not entirely sure. i could understand it if i just turned off the pc 
> >while
> >it was working, but this is doing a proper reboot. :(
> 
> If it detects fs corruption the fs is disabled to prevent further 
> corruption. This is why you don't see it. It is there to protect you from 
> making the mess larger then it is.
>

aha, anything i can do to help to find the source of the corruption? i'm not
goint to touch the test box incase u guys want me to try anything out.

 
> >i just ran xfs_check on it now, clean after i rebooted to find the hdd 
> >blank and
> >i get the following...  hda1 = /boot, hda2 = 128Mb swap, hda3 = /
> >
> >
> >[root@localhost root]# xfs_check /dev/hda3
> >bad directory data magic # 0x44510101 for dir ino 128 block 0
> >no . entry for direcotry 128
> >no .. entry for directory 128
> >block 0/220 expected type unknown got dir
> >block 0/220 claimed by inode 131, previous inum 128
> >link count mismatch for inode 128 (name ?), nlink 15, counted 13
> >disconnected inode 132, nlink 1
> >link count mismatch for inode 4194432 (name ?), nlink 2, counted 1
> >link count mismatch for inode 4212876 (name ?), nlink 4, counted 3
> >.
> >.
> >.
> >
> >
> >i would very very greatly appreciate any input as i have some very very 
> >important
> >servers running XFS on 1Tb+ raid arrays, and i'm very scared for them!
> 
> You should not need to be since a lot of people are running XFS on 
> production systems and are not seeing these problems. They do see the 
> ocassional 990 with some less tested kernel releases but that's that.
> 

ok, so 990 not THAT bad.... 99% chance it will happen here before i get
a blank / though

> >by the way, i'm running an SMP kernel on the test box which has one cpu, 
> >all the
> >high end servers we have running XFS are dual+ cpu. could this maybe be it?
> 
> Yes, athlon systems react very badly to this and equals suicide.
> 

i wouldn't touch athlon even if it was free! our systems are intel based,
celerly's & pIII's  :)

> Cheers
> 
> --
> Seth
> Every program has two purposes one for which
> it was written and another for which it wasn't
> I use the last kind.
> 

-- 


================================================================================

Contact Details
---------------
Name: Nigel Kukard
GSM Mobile: (+27) 082 564 2120
GSM Fax: (+27) 082 131 564 2120
Email: nkukard@xxxxxxxxxxxxxxxx

Organizations
-------------
 - LinuxRulz
     Url: http://www.linuxrulz.za.net
     Position: Owner
 - Linux Based Systems Design
     Url: http://www.lbsd.net
     Position: Systems Designer, Programmer
 - Lando Technologies
     Url: http://www.lando.co.za
     Position: Linux Systems/Network Administrator


<Prev in Thread] Current Thread [Next in Thread>