Re: "Invalid client ID" after system lockup and subsequent reset ?

From: Karl M. Hegbloom
Date: 26 Jul 2002 14:05:54 -0700
Stephen Lord <lord@xxxxxxx> writes:

> Karl M. Hegbloom wrote:
> >Several times I've had my SMP machine with EVMS and XFS filesystems lock
> >up and need to be reset.  After the reboot, sometimes filesystems won't
> >mount (it tends to be the "/var" partition) and I must boot to single
> >user mode.
> >

Note that the problem is a lockup that I suspect is SMP + EVMS +
XFS(?) related.  Since it only seems to happen when I am compiling
software, thus, when there is a large amount of file system activity,
I suspect there's a race condition someplace and a deadlock.

When it happens, the "gkrellm" meter keeps running, and I can see
there is some CPU activity.  It seems like the system is running
smoothly, quiescently, and all is fairly normal.  The mouse pointer
moves around, and I can still type in my editor.  However, I cannot
save the file I'm working on -- the editor blocks then.  I cannot
start another command in a terminal -- it blocks and never runs.  I
can use Alt-SysReq-F1 to snick down to a Linux VT, but cannot log in
-- it blocks.  The gnome desk guide that I keep in a panel drawer will
display, but when I click on it, nothing seems to happen; it's blocked

Tell me if I'm wrong, but it sounds like any file system activity is
blocking someplace.  That's why I suspect that there is a deadlock.

I am running a Debian 2.4.18 kernel, with the EVMS and XFS patches
supplied by Debian.  I've upgraded those kernel-patch patches a few
times, and the last build failed entirely due to the unaligned access
bug.  I don't have the version information for the patches in the
currently running kernel, which runs fine unless I try to build
software.  (making it difficult to rebuild the kernel)

Luckily it has never locked up like that during a Debian upgrade.

> >When I attempt to mount the filesystem to cause the log replay, it
> >refuses to do so, giving an error about an "invalid client id".  I must
> >then run "xfs_repair -L" on it to get the machine back online.
> >
> >What does that error mean, and is there a way to make it mount anyway
> >and replay the log?
> >
> >How can I debug system lockups like that, so that I can give a
> >meaningful and useful bug report?  Will someone please work with me in
> >private mail (or direct me to the relevant mailing list or
> >documentation) and teach me how to debug the Linux kernel, to find
> >lockups and such?
> >

... and that I want to learn how I can DEBUG it -- can I set my system
up so that if it happens, a debugger is run or at least I can cause a
core dump?

What do you folks do, as developers, to find out what's happening in
cases like this?  Will you please help me learn how to be of more
assistance?  I care about this more than about whether the darn thing
locks up on me once in a while.

> This sounds like a corrupted log, the question is, who is doing the
> corruption here. Bad clientid means there was something in the log
> which was not recognized. Since EVMS and XFS have not had a lot of
> exposure with each other, I would suspect EVMS is not taking well to
> the XFS log writes, they are variable in size, between 512 bytes and
> 32K, and they can start on any 512 byte boundary. Not much else in
> Linux does I/O like this, possibly EVMS is dropping part of the I/O.
> I would raise this on the evms mailing list as well as the xfs one,
> we can then work out between us what is going on.
> Next time it happens, try xfs_logprint -t /dev/xxx before you mount
> the filesystem. It might fail in a similar manner to the kernel
> though.
> Steve

