hi,
This thread is all over the show - its hard to find somewhere
to start trying to help, after all the suggestions and various
experiments. Anyway, here goes...
Some general notes - following from the large amount of success
others have with XFS and the lack of any corruption showing up
on developers boxes, you want to look very closely at "outside"
influences, esp. the compiler version used to build the kernel
and your hardware... as a starting point. You also want to do
lots of cross checking to ensure the running kernel is the one
you think it is and is built with the compiler you think it is.
On Sun, Oct 14, 2001 at 01:37:19AM +0200, Nigel Kukard wrote:
> Hi,
>
> i installed our distro for the 10th time or summin yesterday and rebooted a
> few
> times, after about 3 reboots the entire / partition was blank... so i thought
> ok, i must of done summin wrong...
At this first point of corruption, you need to stop and take stock.
Trying to mount this again, when known corrupted is a bad plan.
There would have been a console message to the effect:
"Structure XYZ is corrupt. Unmount and run xfs_repair." from XFS,
and would have gone into forced shutdown mode for that filesystem.
I think the unmount will still write out the unmount record, so
the log doesn't need to be replayed (I haven't look at that code
for awhile though, so not 100% there) and you must repair before
attempting to mount again.
> so i re-installed & rebooted a few times over, SAME thing!!
> i then tried on another server, used a diff harddrive and totally
> diff hardware. rebooted about 12 times, did a halt (got a few 990 errors?),
> and
Again, you would have had console/syslog messages, and XFS has gone
into forced shutdown for that filesystem. The (extremely useful!)
"990" message really means "Filesystem is corrupted".
> the harddrive was BLANK! (by blank i mean when i boot with a rescue cd and
> mount
> it there is no files, no dirs nothing... blank). sum times after i find its
> blank i reboot again & some files are on it, sometimes not. if i run xfs_check
> or xfs_repair it seems to find alot of errors.
> but what really gets me is next
> to NO changes are made during a reboot & all FS's are unmounted.
Hmm - small changes are made during reboot - an unmount log record
is written for one, and the shutdown scripts themselves are possibly
writing to / as you go through the shutdown phase too (creating or
removing temporary files, creating/removing lock files, etc, etc).
So, theres an opportunity for corruption at least.
> i thought i'd fixed the problem when i compiled 2.4.12 (from 2.4.10), but i
> enabled quota support & rebooted... BLANK! as i said before i have been
> getting
> error 990's and once an in-memory data corruption. i must have you know the
These will be accompanied by a console message, and Eric recently
added additional diagnostics in here - I'm sure he'd be interested
in seeing the exact error messages (cut&pasted) that you see.
> ram is 100% ok, even tried different cpu's, ram modules, motherboards, hdd's
> everything. this error is 100% reproducable. how you guys can reproduce it
> i'm not entirely sure. i could understand it if i just turned off the pc while
> it was working, but this is doing a proper reboot. :(
>
> i just ran xfs_check on it now, clean after i rebooted to find the hdd blank
> and
> i get the following... hda1 = /boot, hda2 = 128Mb swap, hda3 = /
>
>
> [root@localhost root]# xfs_check /dev/hda3
> bad directory data magic # 0x44510101 for dir ino 128 block 0
> no . entry for direcotry 128
> no .. entry for directory 128
[Should cut and paste the output - less typos that way. ;-)]
The root inode is corrupt - this is certainly why you don't see any
files after next mount. xfs_repair should be able to fix this, but
sounds like your kernel will just do it again, so fixing the kernel
is the major issue here.
A few additional notes, might help here...
For each of your experiments with new gcc versions, did you update
the top level kernel Makefile every time? (seems obvious, but gotta
ask - ensure you're definately building with the gcc version you
expect). Double-check lilo.conf entries too.
As to a known stable release of XFS - use the 1.0.1 installer if
you can. It has binary kernels built and extensively tested with
known good compiler versions. At 2.4.5(?) its a bit older now,
but is a known-sane starting point, and removes the subtle issues
involved with compiling your own kernel.
For notes on enabling quota, read README.quota in xfsprogs also.
The man pages you referenced are less extensive. As another data
point, I have run XFS quota on my / and /build for many months now,
without any corruption (and this is a development box, so it crashes
in bizarre and wonderful ways, spends lots of time in kdb, and does
lots of XFS recovery), so its more likely to be a symptom than an
actual problem.
Hope this helps.
cheers.
--
Nathan
|