Received: with ECARTIS (v1.0.0; list linux-xfs); Tue, 27 May 2003 17:54:06 -0700 (PDT) Received: from gusi.leathercollection.ph (gusi.leathercollection.ph [202.163.192.10]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h4S0rj2x025672 for ; Tue, 27 May 2003 17:53:46 -0700 Received: from localhost (localhost [127.0.0.1]) by gusi.leathercollection.ph (Postfix) with ESMTP id E8984E282EE for ; Wed, 28 May 2003 08:53:37 +0800 (PHT) Received: by gusi.leathercollection.ph (Postfix, from userid 1000) id 5B902E2DB1C; Wed, 28 May 2003 08:53:32 +0800 (PHT) Date: Wed, 28 May 2003 08:53:32 +0800 From: Federico Sevilla III To: xfs mailing list Subject: Re: D State and XFS 1.2 Message-ID: <20030528005332.GD14513@leathercollection.ph> Mail-Followup-To: xfs mailing list References: <20030527200806.LTIO17201.imf43bis.bellsouth.net@tiger2> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030527200806.LTIO17201.imf43bis.bellsouth.net@tiger2> X-Organization: The Leather Collection, Inc. X-Organization-URL: http://www.leathercollection.ph X-Personal-URL: http://jijo.free.net.ph User-Agent: Mutt/1.5.4i X-archive-position: 4164 X-ecartis-version: Ecartis v1.0.0 Sender: linux-xfs-bounce@oss.sgi.com Errors-to: linux-xfs-bounce@oss.sgi.com X-original-sender: jijo@free.net.ph Precedence: bulk X-list: linux-xfs Content-Length: 1851 Lines: 39 On Tue, May 27, 2003 at 04:13:00PM -0400, Greg Freemyer wrote: > I'm running a vanilla 2.4.19 kernel with xfs 1.2 patched in. xfsdump > from 1am Monday morning is stuck in D state. The server has been up > and running for 40 days. The xfsdump is of a lvm snapshot. The base > FS is working fine. I remember seeing threads about getting stuck in > D state, but did not realize it affected the 1.2 release. (I thought > it was cvs only.) Is this a known/resolved issue, or is there some > interest in troubleshooting the issue. In my case faulty RAM hit me. Even the "extensive" BIOS check didn't find the problem: I had to do two full passes of MemTest86 to find the minor corruption. With the memory replaced, our server has been running smoothly so far. I don't use LVM though. > I assume I can kill -9 the stuck processes, unmount the FS and kill > the snapshot to restore normal operation. In my case I could not kill the 'stuck in D' processes, and as the number of them grew, more and more processes would join them stuck until the system would be intoleralbly unresponsive, requiring a forced unclean shutdown (read: turn off the switch). What first looked to me like an XFS problem turned out to be filesystem-independent. I also got hit by this with ext3, albeit after much longer, probably because of the aggressiveness of XFS's algorithms for memory use. Hopefully this is it. If you can afford to take your box down to do a full memory scan, or perhaps if you can change the RAM then do a memory scan of it elsewhere, it's much easier to fix this than find a potential bug somewhere. --> Jijo -- Federico Sevilla III : http://jijo.free.net.ph : When we speak of free Network Administrator : The Leather Collection, Inc. : software we refer to GnuPG Key ID : 0x93B746BE : freedom, not price.