xfs
[Top] [All Lists]

Re: Filesystem kernel hangup, 2.6.3 (bad: scheduling while atomic!)

To: Christoph Hellwig <hch@xxxxxxxxxxxxx>
Subject: Re: Filesystem kernel hangup, 2.6.3 (bad: scheduling while atomic!)
From: Mikael Wahlberg <mikael.wahlberg@xxxxxxxxxx>
Date: Mon, 23 Feb 2004 14:08:09 +0100
Cc: linux-kernel@xxxxxxxxxxxxxxx, linux-xfs@xxxxxxxxxxx, Per Lejontand <pele@xxxxxxxxxx>, Jonas Engström <jonas@xxxxxxxxxx>
In-reply-to: <20040223121959.A8354@infradead.org>
Organization: Ardendo
References: <20040222164941.D6046@foo.ardendo.se> <20040223121959.A8354@infradead.org>
Sender: linux-xfs-bounce@xxxxxxxxxxx
On Mon, 2004-02-23 at 13:19, Christoph Hellwig wrote:
> On Sun, Feb 22, 2004 at 04:49:41PM +0100, Mikael Wahlberg wrote:
> > Description:
> > 
> > On heavy FTP Load (About 1Gbit/s) running both reads and writes on two 
> > ServeRAID6m Raid5 controllers merged together to one filesystem with 
> > Raidtools we see the error below. The filesystem gets totally hanged up. 
> > Currently with XFS, but JFS gets the same problem (Actually even more 
> > often).
> 
> What does the JFS oops look like?  

Unfortunately I cannot find the JFS oops right now. We can try to
reproduce it on one of the machines (We have 4 identical setups). It
seems like the JFS oops never was written in the messages-file.. 

We are running the system on 2 CPUs (No HyperThreading) right now, and
it has not crashed for about 18 hours.. so it is a bit more stable with
less CPU:s it seems. We will try JFS without HT today.

> > Feb 22 15:00:53 mserv1 kernel:  [<c011e54a>] __wake_up_common+0x3a/0x60
> > Feb 22 15:00:53 mserv1 kernel:  [<c011e5af>] __wake_up+0x3f/0x70
> 
> This doesn't make a lot of sense, there's only two mrlocks in XFS, and
> they're in the inodes that have well-defined and understood lifetime rules.
> 
> OTOH the previos oops might have messed quite a bit up in your system.

Ok.. 

> did you run memtest86 on the box?  do you some strange patches applied or
> external modules loaded?  What's your .config?

No strange patches. Pure 2.6.3 dist kernel. We haven't run memtest86,
but as I mentioned above, we have 4 equal machines with error correcting
memory, so I find it unlikely to be a memory problem.

The .config is attached.

> > Feb 22 15:00:54 mserv1 kernel:  [<c011b150>] do_page_fault+0x0/0x523
> > Feb 22 15:00:54 mserv1 kernel:  [<c010baf5>] error_code+0x2d/0x38
> > Feb 22 15:00:54 mserv1 kernel:  [<c011e5b5>] __wake_up+0x45/0x70
> > Feb 22 15:00:54 mserv1 kernel:  [<c011e54a>] __wake_up_common+0x3a/0x60
> > Feb 22 15:00:55 mserv1 kernel:  [<c011e5af>] __wake_up+0x3f/0x70
> > Feb 22 15:00:55 mserv1 kernel:  [<c0259e62>] mrunlock+0x82/0xb0
> > Feb 22 15:00:55 mserv1 kernel:  [<c0259b00>] mraccessf+0xc0/0xe0
> > Feb 22 15:00:55 mserv1 kernel:  [<c023038e>] xfs_iunlock+0x3e/0x80
> > Feb 22 15:00:55 mserv1 kernel:  [<c023727b>] xfs_iomap+0x3bb/0x540
> > Feb 22 15:00:55 mserv1 kernel:  [<c0163fc7>] bio_alloc+0xd7/0x1c0
> > Feb 22 15:00:55 mserv1 kernel:  [<c025a17a>] map_blocks+0x7a/0x170
> > Feb 22 15:00:55 mserv1 kernel:  [<c025b40b>] page_state_convert+0x52b/0x6d0
> > Feb 22 15:00:55 mserv1 kernel:  [<c0236cb9>] xfs_imap_to_bmap+0x39/0x240
> > Feb 22 15:00:55 mserv1 kernel:  [<c025be48>] linvfs_release_page+0xa8/0xb0
> > Feb 22 15:00:55 mserv1 kernel:  [<c025bce0>] linvfs_writepage+0x60/0x120
> > Feb 22 15:00:55 mserv1 kernel:  [<c014990c>] shrink_list+0x41c/0x710
> > Feb 22 15:00:55 mserv1 kernel:  [<c0149df8>] shrink_cache+0x1f8/0x3d0
> > Feb 22 15:00:55 mserv1 kernel:  [<c01b3a00>] journal_stop+0x220/0x330
> > Feb 22 15:00:55 mserv1 kernel:  [<c014a6dc>] shrink_zone+0xbc/0xc0
> > Feb 22 15:00:55 mserv1 kernel:  [<c014a7a5>] shrink_caches+0xc5/0xe0
> > Feb 22 15:00:55 mserv1 kernel:  [<c014a87c>] try_to_free_pages+0xbc/0x190
> > Feb 22 15:00:55 mserv1 kernel:  [<c0143043>] __alloc_pages+0x203/0x370
> > Feb 22 15:00:55 mserv1 kernel:  [<c01431d5>] __get_free_pages+0x25/0x40
> 
> Hmm, from the trace it looks like ->release_page was called from a context
> where we can't sleep.  XFS defintily doesn't handle that, so the question
> is whether the kernel should do it.

Ok.. 
If you need any more information please tell us, it is quite urgent for
us, since we really don't want to go back to 2.4, the performance
increase with 2.6 is really impressive (Except when it crashes :) 

/Mikael

-- 
-----------------------------------------------------------------------
 Mikael Wahlberg,  M.Sc.                  Ardendo
 Unit Manager Professional Services/      e-mail: mikael@xxxxxxxxxx
 Technical Project Manager                GSM:    +46 733 279 274

Attachment: kernel-config-mserv2
Description: Text document

<Prev in Thread] Current Thread [Next in Thread>