xfs
[Top] [All Lists]

Re: State D with 2.4.18, XFS-1.1 and GCC-3.0.4 on Debian Sid

To: linux-xfs@xxxxxxxxxxx
Subject: Re: State D with 2.4.18, XFS-1.1 and GCC-3.0.4 on Debian Sid
From: Federico Sevilla III <jijo@xxxxxxxxxxx>
Date: Mon, 3 Jun 2002 00:22:11 +0800
In-reply-to: <87elfpiul5.fsf@xxxxxxxxx>
References: <87elfpiul5.fsf@xxxxxxxxx>
Sender: owner-linux-xfs@xxxxxxxxxxx
User-agent: Internet Messaging Program (IMP) 3.0
Quoting Carl Lunde <carll+news@xxxxxxxxx>:
> First I want to say that the contents and of this machine is not important,
> but I assume you are interested in any problems.  I do not know if this
> is XFS-related, but
>  $ mount
>  /dev/hda1 on / type xfs (rw)
>  proc on /proc type proc (rw)
>  devpts on /dev/pts type devpts (rw,gid=5,mode=620)
>  it's the only file system I use.

On a server I maintain I experienced this problem awhile back, too.
Unfortunately I could not isolate the problem, nor successfully reproduce it,
despite the fact it happened maybe four times on various snapshots of
2.4.18-xfs, all compiled with gcc-3.0, on various states of uptime (two days,
five days, and some other times even more than ten days already). I also use
Debian Sid (crazy, I know).

The server is now on a CVS snapshot of 2.4.18-xfs checked out on 2002-05-13,
with RML's preempt-4 patch, compiled with gcc 2.95.4, and it's been up and
running stably since. It's either something that made it to CVS before
2002-05-13, or the fact that I'm back to using gcc 2.95.4 from gcc 3.0.4. I
haven't tried using gcc 3.1 for the kernels of any of my XFS systems, yet, and
don't plan to until the coast seems clear enough.

I'm sure it wasn't the preempt patch's fault. The first time things froze I had
the preempt patch so I pulled it out right away (keeping the CVS snapshot of
that date... which unfortunately I did not jot down). And things still froze.
The system now has the preempt patch in and things have been working really
well, so far. Uptime has been 12 days, and that's because of a blackout that our
UPSs didn't have enough to see things through.

> I noticed the following process;
>  USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME COMMAND
>  root      6172  0.0  0.1  1464  496 ?        D    May30   0:04 find / (..)
> and I found out that this problem isn't new, this had happened to every
> updatedb-job started by crond, and it happened to my `find /' process too.

The server here used to run X, and our only sign of trouble would be a
completely cold freeze. No logs. No errors. Nothing. Which is also why I hadn't
reported anything to this list. It could have been anything under the sun. I
thought it might have been hardware, specifically something with X and my video
card. I dropped X and got myself a workstation to use, instead.

The "lockup" happened once when X was not running. Although thankfully (?)
because X was not running it didn't freeze cold. Instead I noticed state D
processes creeping in slowly. My "ps ax"'s would freeze, and later on a "ps ax"
would not freeze and showed a growing number of state D processes. Most of them
at the time were hung "ps ax"'s of mine, or instances of uvscan scanning mail,
stuck way beyond their expected run time.

Lately (with my new kernel) I notice that my uvscan calls modprobe to do some
sort of scan every time it's run. This causes a delay with my current box but is
no major problem with our loads. However it is possible, in hindsight, that it's
these modprobes, and not the uvscan's themselves, that made the uvscan 
processes 
get stuck in state D. Just maybe.

Like you, all these state D processes could not be killed.

This last "creeping death" happened in mid-day so I had all users (thankfully
we're a relatively small firm) save their work and I issued a "shutdown -r now".
The shutdown did not complete successfully. It froze while attempting to kill
all processes. A reboot required recovery of the partitions, but this thankfully
went well.

> Further checking showed;
>  # ls -l /proc/6172/cwd
>  cwd -> /usr/share/texmf/fonts/tfm/cg/
> 
>  # ls -l /proc/6172/fd/
>  ...
>  4 -> /usr/share/texmf/fonts/tfm/cg/times/
> 
>  # strace ls -l /usr/share/texmf/fonts/tfm/cg/times/
>  open("/usr/share/texmf/fonts/tfm/cg/times/",
>   O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 3
>  fstat64(3, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
>  fcntl64(3, F_SETFD, FD_CLOEXEC)         = 0
>  brk(0x8059000)                          = 0x8059000
>  getdents64(0x3, 0x8056848, 0x1000, 0
>                                      ^-- end of output

I'm glad you were able to get stuff like this. I wasn't. Does the XFS team have
any suggestions for standard debugging tasks to do when things go wrong (without
a kernel oops or anything like that, but these hung processes) to see what's
goofing up and help you help us all?

> So - this happens to every process doing that there, I haven't found
> anything
> similar elsewhere so far.  These processes does not respond to kill -9, and
> I assumed you wanted to know about this before I'll reboot and see if the
> problem persists.

With me I would get random uptime but it would eventually happen again. The
"creeping death with state D's" only happened once, though, before I upgraded my
kernel with a new snapshot and a downgraded compiler and got things to work thus
far. Before that all we had were cold freezes.

> I cannot remember why I rebooted, but I think it shut down as normal
> but not completely, it froze for more than 30 sec without making a sound
> so I `pulled the plug'. At least this[1] shows the problem didn't start
> directly after the unclean shutdown.

This sounds similar to what happend to my box. And the three-finger-salute
didn't do the job. Neither did the Magic SysRq. I had to hit the reset button.
It was a server (still is) so I couldn't take my time. Unfortunately that also
means I was quiet until you came along and all I had to do was add on more 
info. :(

> [1] 
>  $ ps aux |grep D
>  USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME COMMAND
> [..not State D..] 
>  root      6172  0.0  0.1  1464  496 ?        D    May30   0:04 find / -xdev
> ( -false ) -prune -o ( -type f -perm +06000 -o ( ( -type b -o -type c ) -a
> -not ( -false ) ) ) -printf %8i %5m %3n %-10u %-10g %9s %t %h/%f?n
>  [...]
> Cron probably started several similar find-processes before without any
> problems.

Yes. You can't just reproduce them, or at least from my experience. But when the
"creeping death" came, things just started getting stuck. Now it's procedure for
me to do "ps ax" and check for state D stuff that stay in state D for too long.

 --> Jijo

-- 
Federico Sevilla III   :  <http://jijo.free.net.ph/>
Network Administrator  :  The Leather Collection, Inc.
GnuPG Key ID           :  0x93B746BE


<Prev in Thread] Current Thread [Next in Thread>