On Sunday 27 April 2003 12:12 pm, Federico Sevilla III wrote:
> I used to experience these problems quite often (eg: circa
> July 2002, 2.4.18-xfs), but have not experienced it at all
> with 2.4.20-xfs that I went back to after a couple of months
> using ext3. The major change in the system is the fact that I
> changed RAM after finding out using MemTest86 that I had a
> subtle problem with one of the memory modules. This may or may
> not be the case with you, but if you haven't yet you may want
> to do at least two full passes of MemTest86[1].
>
> [1] http://www.memtest86.com
Unfortunately, memtest86 only works on x86. What I have for my
EV56 is the SRM-embedded memory test, which is supposedly about
as good. (The box appears to pass this test.)
> BTW, right now I am using the same patchset you are using,
> made over kernel-source-2.4.20_2.4.20-6 from Debian, built
> using GCC 2.95.4 from Debian (even if gcc 3.2 is the new
> default in Sid because of a message[2] from Wessel Dankers to
> this list).
I hear you on that. I keep my x86 kernels compiled with
gcc-2.95.3, but I'm standardizing my Alpha kernels to build with
gcc-3.2.x. gcc-2.9x/alpha has bugs of its own that prevent it
from building SMP-enabled Alpha kernels. :(
As it is, this box didn't actually have any problems until now,
and the only thing that changed was the kernel+XFS patchset.
It's been a build host for some time now, doing compile jobs
day-in and day-out. I could probably take the drive to another
machine and run a media check on it, but the timing leads me to
strongly suspect the kernel.
As it is, the problem seems very reproducible. I've already hit
it again under strace...it's syscall_377 that's hanging
(apparently getdents64, obviously very commonly used syscall
when doing directory recursion). Probably not often hit on x86,
because x86 linux stuff usually uses bog-standard 32-bit
getdents() instead of getdents64(). Stuff that explicitly
enables large-file support would hit it.
The problem isn't a full deadlock. It eventually breaks out of
it (especially if I send more processes to mess with the
affected directory), but it takes WAY too long to complete such
a simple syscall. Something in the kernel is turning
getdents64() into I/O quicksand.
--
Kelledin
"If a server crashes in a server farm and no one pings it, does
it still cost four figures to fix?"
|