Stephen Lord <lord@xxxxxxx> writes:
> Karl M. Hegbloom wrote:
>
> >Several times I've had my SMP machine with EVMS and XFS filesystems lock
> >up and need to be reset. After the reboot, sometimes filesystems won't
> >mount (it tends to be the "/var" partition) and I must boot to single
> >user mode.
> >
Note that the problem is a lockup that I suspect is SMP + EVMS +
XFS(?) related. Since it only seems to happen when I am compiling
software, thus, when there is a large amount of file system activity,
I suspect there's a race condition someplace and a deadlock.
When it happens, the "gkrellm" meter keeps running, and I can see
there is some CPU activity. It seems like the system is running
smoothly, quiescently, and all is fairly normal. The mouse pointer
moves around, and I can still type in my editor. However, I cannot
save the file I'm working on -- the editor blocks then. I cannot
start another command in a terminal -- it blocks and never runs. I
can use Alt-SysReq-F1 to snick down to a Linux VT, but cannot log in
-- it blocks. The gnome desk guide that I keep in a panel drawer will
display, but when I click on it, nothing seems to happen; it's blocked
too.
Tell me if I'm wrong, but it sounds like any file system activity is
blocking someplace. That's why I suspect that there is a deadlock.
I am running a Debian 2.4.18 kernel, with the EVMS and XFS patches
supplied by Debian. I've upgraded those kernel-patch patches a few
times, and the last build failed entirely due to the unaligned access
bug. I don't have the version information for the patches in the
currently running kernel, which runs fine unless I try to build
software. (making it difficult to rebuild the kernel)
Luckily it has never locked up like that during a Debian upgrade.
> >When I attempt to mount the filesystem to cause the log replay, it
> >refuses to do so, giving an error about an "invalid client id". I must
> >then run "xfs_repair -L" on it to get the machine back online.
> >
> >What does that error mean, and is there a way to make it mount anyway
> >and replay the log?
> >
> >How can I debug system lockups like that, so that I can give a
> >meaningful and useful bug report? Will someone please work with me in
> >private mail (or direct me to the relevant mailing list or
> >documentation) and teach me how to debug the Linux kernel, to find
> >lockups and such?
> >
... and that I want to learn how I can DEBUG it -- can I set my system
up so that if it happens, a debugger is run or at least I can cause a
core dump?
What do you folks do, as developers, to find out what's happening in
cases like this? Will you please help me learn how to be of more
assistance? I care about this more than about whether the darn thing
locks up on me once in a while.
> This sounds like a corrupted log, the question is, who is doing the
> corruption here. Bad clientid means there was something in the log
> which was not recognized. Since EVMS and XFS have not had a lot of
> exposure with each other, I would suspect EVMS is not taking well to
> the XFS log writes, they are variable in size, between 512 bytes and
> 32K, and they can start on any 512 byte boundary. Not much else in
> Linux does I/O like this, possibly EVMS is dropping part of the I/O.
>
> I would raise this on the evms mailing list as well as the xfs one,
> we can then work out between us what is going on.
>
> Next time it happens, try xfs_logprint -t /dev/xxx before you mount
> the filesystem. It might fail in a similar manner to the kernel
> though.
>
> Steve
--
As any limb well and duly exercised, grows stronger,
the nerves of the body are corroborated thereby. --I. Watts. .''`.
We are deB.ORG; You will be freed. : :' :
<URL:http://www.debian.org/social_contract> `. `'
|