xfs
[Top] [All Lists]

Re: getdents64 hangs [WAS: More processes hanging in 'D' state.]

To: linux-xfs@xxxxxxxxxxx
Subject: Re: getdents64 hangs [WAS: More processes hanging in 'D' state.]
From: Kelledin <kelledin+XFS@xxxxxxxxxxxxxxxxxxx>
Date: Sun, 27 Apr 2003 13:31:05 -0500
Sender: linux-xfs-bounce@xxxxxxxxxxx
User-agent: KMail/1.5.1
On Sunday 27 April 2003 12:12 pm, Federico Sevilla III wrote:
> I used to experience these problems quite often (eg: circa
> July 2002, 2.4.18-xfs), but have not experienced it at all
> with 2.4.20-xfs that I went back to after a couple of months
> using ext3. The major change in the system is the fact that I
> changed RAM after finding out using MemTest86 that I had a
> subtle problem with one of the memory modules. This may or may
> not be the case with you, but if you haven't yet you may want
> to do at least two full passes of MemTest86[1].
>
> [1] http://www.memtest86.com

Unfortunately, memtest86 only works on x86.  What I have for my
EV56 is the SRM-embedded memory test, which is supposedly about
as good.  (The box appears to pass this test.)

> BTW, right now I am using the same patchset you are using,
> made over kernel-source-2.4.20_2.4.20-6 from Debian, built
> using GCC 2.95.4 from Debian (even if gcc 3.2 is the new
> default in Sid because of a message[2] from Wessel Dankers to
> this list).

I hear you on that.  I keep my x86 kernels compiled with
gcc-2.95.3, but I'm standardizing my Alpha kernels to build with
gcc-3.2.x.  gcc-2.9x/alpha has bugs of its own that prevent it
from building SMP-enabled Alpha kernels. :(

As it is, this box didn't actually have any problems until now,
and the only thing that changed was the kernel+XFS patchset.
It's been a build host for some time now, doing compile jobs
day-in and day-out.  I could probably take the drive to another
machine and run a media check on it, but the timing leads me to
strongly suspect the kernel.

As it is, the problem seems very reproducible.  I've already hit
it again under strace...it's syscall_377 that's hanging
(apparently getdents64, obviously very commonly used syscall
when doing directory recursion).  Probably not often hit on x86,
because x86 linux stuff usually uses bog-standard 32-bit
getdents() instead of getdents64().  Stuff that explicitly
enables large-file support would hit it.

The problem isn't a full deadlock.  It eventually breaks out of
it (especially if I send more processes to mess with the
affected directory), but it takes WAY too long to complete such
a simple syscall.  Something in the kernel is turning
getdents64() into I/O quicksand.

--
Kelledin
"If a server crashes in a server farm and no one pings it, does
it still cost four figures to fix?"


<Prev in Thread] Current Thread [Next in Thread>