xfs
[Top] [All Lists]

RE: Re-occurance of NFS server panics

To: I.D.Hardy@xxxxxxxxxxx
Subject: RE: Re-occurance of NFS server panics
From: Steve Lord <lord@xxxxxxx>
Date: 19 Sep 2002 14:22:05 -0500
Cc: linux-xfs@xxxxxxxxxxx, O.G.Parchment@xxxxxxxxxxx, "'Russell Cattelan'" <cattelan@xxxxxxxxxxx>
In-reply-to: <E5CC9E66DAF2D411A0D700B0D079331B41F1FE@exchange2.soton.ac.uk>
References: <E5CC9E66DAF2D411A0D700B0D079331B41F1FE@exchange2.soton.ac.uk>
Sender: linux-xfs-bounce@xxxxxxxxxxx
On Thu, 2002-09-19 at 12:17, Ian D. Hardy wrote:
> Steve, +
> 
> I'm not sure if this provides any more info but I've just had another
> crash on this server, again caught by 'slab.c', this time from an 'nfsd'
> process:

Well, either someone trampled on some memory allocation headers they
should not have and confused the code in a precise direction, or, the
pointer passed into the free was a little out of the range it should
have been.

I do not suppose there is anyway you can remove the binary only modules
from your kernel. From the sound of it they are pretty fundamental to
your setup - the disk driver and the network driver. If the free really
is a bogus address of some form, then it is a network packet, and we
definitely do not do those. Also, the fact that this corruption seems
specific to your setup says investigating those drivers some more
might be beneficial.

Steve


> 
> Reading Oops report from the terminal
>  kernel BUG at slab.c:1218!
> invalid operand: 0000
> CPU:    0
> EIP:    0010:[<c0132aeb>]    Tainted: P 
> EFLAGS: 00010002
> eax: c1c0e060   ebx: 00091858   ecx: 00001000   edx: 000005a6
> esi: 5a5a5a5a   edi: c96c8bcc   ebp: ebdfe000   esp: f381fe8c
> ds: 0018   es: 0018   ss: 0018
> Process nfsd (pid: 856, stackpage=f381f000)
> Stack: ebdfe000 c96c8bcc ebdff000 ebdfe000 c0132e33 c1c0e060 c96c8bcc
> ebdfe000 
>        0000001b f7ee5720 f7ee5694 c1c0e060 008c0f70 eeafd000 c013360a
> c1c0e060 
>        f7ee5714 0000001e 00000286 eeafd000 ed79ffff c1c0e4d0 c010cef8
> 000003a0 
> Call Trace:    [<c0132e33>] [<c013360a>] [<c010cef8>] [<c02558a2>]
> [<c02558bc>]
>   [<c0255a36>] [<c02557eb>] [<c0255898>] [<c0255e11>] [<f89d0488>]
> [<f89d1820>]
>   [<f89d14d8>] [<f8a252b4>] [<c0107296>] [<f8a251a0>]
> 
> Code: 0f 0b c2 04 a0 54 2b c0 89 d8 0f af c1 8d 04 30 39 c5 74 08 
>  
> Entering kdb (current=0xf381e000, pid 856) on processor 0 Oops: invalid
> operand
> due to oops @ 0xc0132aeb
> eax = 0xc1c0e060 ebx = 0x00091858 ecx = 0x00001000 edx = 0x000005a6 
> esi = 0x5a5a5a5a edi = 0xc96c8bcc esp = 0xf381fe8c eip = 0xc0132aeb 
> ebp = 0xebdfe000 xss = 0x00000018 xcs = 0x00000010 eflags = 0x00010002 
> xds = 0xc0250018 xes = 0x00000018 origeax = 0xffffffff &regs =
> 0xf381fe58
> [0]kdb> 
>  kernel BUG at slab.c:1218!
> invalid operand: 0000
> CPU:    0
> EIP:    0010:[<c0132aeb>]    Tainted: P 
> Using defaults from ksymoops -t elf32-i386 -a i386
> EFLAGS: 00010002
> eax: c1c0e060   ebx: 00091858   ecx: 00001000   edx: 000005a6
> esi: 5a5a5a5a   edi: c96c8bcc   ebp: ebdfe000   esp: f381fe8c
> ds: 0018   es: 0018   ss: 0018
> Process nfsd (pid: 856, stackpage=f381f000)
> Stack: ebdfe000 c96c8bcc ebdff000 ebdfe000 c0132e33 c1c0e060 c96c8bcc
> ebdfe000 
>        0000001b f7ee5720 f7ee5694 c1c0e060 008c0f70 eeafd000 c013360a
> c1c0e060 
>        f7ee5714 0000001e 00000286 eeafd000 ed79ffff c1c0e4d0 c010cef8
> 000003a0 
> Call Trace:    [<c0132e33>] [<c013360a>] [<c010cef8>] [<c02558a2>]
> [<c02558bc>]
>   [<c0255a36>] [<c02557eb>] [<c0255898>] [<c0255e11>] [<f89d0488>]
> [<f89d1820>]
>   [<f89d14d8>] [<f8a252b4>] [<c0107296>] [<f8a251a0>]
> Code: 0f 0b c2 04 a0 54 2b c0 89 d8 0f af c1 8d 04 30 39 c5 74 08 
> 
> >>EIP; c0132aea <kmem_extra_free_checks+2a/70>   <=====
> Trace; c0132e32 <free_block+162/210>
> Trace; c013360a <kfree+14a/180>
> Trace; c010cef8 <call_do_IRQ+6/e>
> Trace; c02558a2 <skb_release_data+72/80>
> Trace; c02558bc <kfree_skbmem+c/70>
> Trace; c0255a36 <__kfree_skb+116/120>
> Trace; c02557ea <skb_drop_fraglist+3a/50>
> Trace; c0255898 <skb_release_data+68/80>
> Trace; c0255e10 <skb_linearize+90/f0>
> Trace; f89d0488 <[sunrpc]svc_udp_recvfrom+128/380>
> Trace; f89d1820 <[sunrpc]svc_send+70/1a0>
> Trace; f89d14d8 <[sunrpc]svc_recv+2c8/470>
> Trace; f8a252b4 <[nfsd]nfsd+114/350>
> Trace; c0107296 <kernel_thread+26/30>
> Trace; f8a251a0 <[nfsd]nfsd+0/350>
> Code;  c0132aea <kmem_extra_free_checks+2a/70>
> 00000000 <_EIP>:
> Code;  c0132aea <kmem_extra_free_checks+2a/70>   <=====
>    0:   0f 0b                     ud2a      <=====
> Code;  c0132aec <kmem_extra_free_checks+2c/70>
>    2:   c2 04 a0                  ret    $0xa004
> Code;  c0132aee <kmem_extra_free_checks+2e/70>
>    5:   54                        push   %esp
> Code;  c0132af0 <kmem_extra_free_checks+30/70>
>    6:   2b c0                     sub    %eax,%eax
> Code;  c0132af2 <kmem_extra_free_checks+32/70>
>    8:   89 d8                     mov    %ebx,%eax
> Code;  c0132af4 <kmem_extra_free_checks+34/70>
>    a:   0f af c1                  imul   %ecx,%eax
> Code;  c0132af6 <kmem_extra_free_checks+36/70>
>    d:   8d 04 30                  lea    (%eax,%esi,1),%eax
> Code;  c0132afa <kmem_extra_free_checks+3a/70>
>   10:   39 c5                     cmp    %eax,%ebp
> Code;  c0132afc <kmem_extra_free_checks+3c/70>
>   12:   74 08                     je     1c <_EIP+0x1c> c0132b06
> <kmem_extra_free_checks+46/70>
> 
> Entering kdb (current=0xf381e000, pid 856) on processor 0 Oops: invalid
> operand
> eax = 0xc1c0e060 ebx = 0x00091858 ecx = 0x00001000 edx = 0x000005a6 
> esi = 0x5a5a5a5a edi = 0xc96c8bcc esp = 0xf381fe8c eip = 0xc0132aeb 
> ebp = 0xebdfe000 xss = 0x00000018 xcs = 0x00000010 eflags = 0x00010002 
> 
> 
> -----Original Message-----
> From: I.D.Hardy@xxxxxxxxxxx [mailto:I.D.Hardy@xxxxxxxxxxx] 
> Sent: 18 September 2002 21:49
> 
> Steve +,
> 
> > 
> > On Wed, 2002-09-18 at 13:31, Ian D. Hardy wrote:
> > > Steve,
> > > 
> > > >On Mon, 2002-09-16 at 11:56, Ian D. Hardy wrote:
> > > >> Steve,
> > > >> 
> > > >> Thanks for the quick response. I don't always get a Oops output
> > > >> (sometimes the server just hangs and requires a reboot). However
> as
> > > it
> > > >> happens the server has just crashed again with the following Oops
> > > >> (through 'ksymoops'):
> > > >
> > > >This one suggests heap corruption more than anything else.
> > > >
> > > >Steve
> > > 
> > > I upgraded the kernel to the current CVS (2.4.19-xfs) tree (as of 
> > > Monday 16th Sept.) today and got a very similar looking Ooops to the
> 
> > > one I reported on Monday, see below. I guess it is very difficult to
> 
> > > know what would have caused any heap corruption. As I understand it 
> > > there's nothing in these latest panics to directly link them with 
> > > XFS? I need to do some more thinking on this. Any pointers would be 
> > > very welcome.
> > > 
> > > Regards Ian Hardy
> > > 
> > > 
> > >  kernel BUG at slab.c:1439!
> > 
> > 
> > OK, so you have slab debugging turned on, which I was going to ask you
> 
> > to do. Looks like someone walked off the end of an allocation here, 
> > that is progress.
> > 
> > So, the question is what did it, thats the really hard part! Is this 
> > machine just running XFS via NFS, or is it doing anything else? Also, 
> > which options do you have turned on in XFS, in fact, sending the whole
> 
> > kernel config might be an idea.
> > 
> > Steve
> > 
> 
> Thanks for your continued help/interest, it is much appreciated.
> 
> The server (dual 1Ghz PIII, with 1Gbyte memory) is a dedicated NFS
> fileserver (no users have direct access to it). It is serving ~260 NFS
> clients (part of a computational/Beowulf system). The server has a
> 40Gbyte IDE system disk and 2 FC-IDE connected RAID units (each with
> ~500Gbytes usable storage - RAID 5 configuration). The RAID units are
> connected via a Qlogic QLA2200 HBA and a FC switch. The 2 RAID units are
> stripped together(RAID 0) using the kernel 'md' RAID driver.
> 
> Below is the kernel configuration file:
> 
> 
> -- 
> 
> Regards and thanks
> 
> Ian
> 
> --
> Ian Hardy                                   Tel: 023 80593577
> Research Services                           Fax: 023 80593131
> Information Systems Services                email: i.d.hardy@xxxxxxxxxxx
> 
> Southampton University                     
> Southampton  S017 1BJ, UK.
> 
-- 

Steve Lord                                      voice: +1-651-683-3511
Principal Engineer, Filesystem Software         email: lord@xxxxxxx


<Prev in Thread] Current Thread [Next in Thread>