On Thu, 2002-09-19 at 12:17, Ian D. Hardy wrote:
> Steve, +
>
> I'm not sure if this provides any more info but I've just had another
> crash on this server, again caught by 'slab.c', this time from an 'nfsd'
> process:
Well, either someone trampled on some memory allocation headers they
should not have and confused the code in a precise direction, or, the
pointer passed into the free was a little out of the range it should
have been.
I do not suppose there is anyway you can remove the binary only modules
from your kernel. From the sound of it they are pretty fundamental to
your setup - the disk driver and the network driver. If the free really
is a bogus address of some form, then it is a network packet, and we
definitely do not do those. Also, the fact that this corruption seems
specific to your setup says investigating those drivers some more
might be beneficial.
Steve
>
> Reading Oops report from the terminal
> kernel BUG at slab.c:1218!
> invalid operand: 0000
> CPU: 0
> EIP: 0010:[<c0132aeb>] Tainted: P
> EFLAGS: 00010002
> eax: c1c0e060 ebx: 00091858 ecx: 00001000 edx: 000005a6
> esi: 5a5a5a5a edi: c96c8bcc ebp: ebdfe000 esp: f381fe8c
> ds: 0018 es: 0018 ss: 0018
> Process nfsd (pid: 856, stackpage=f381f000)
> Stack: ebdfe000 c96c8bcc ebdff000 ebdfe000 c0132e33 c1c0e060 c96c8bcc
> ebdfe000
> 0000001b f7ee5720 f7ee5694 c1c0e060 008c0f70 eeafd000 c013360a
> c1c0e060
> f7ee5714 0000001e 00000286 eeafd000 ed79ffff c1c0e4d0 c010cef8
> 000003a0
> Call Trace: [<c0132e33>] [<c013360a>] [<c010cef8>] [<c02558a2>]
> [<c02558bc>]
> [<c0255a36>] [<c02557eb>] [<c0255898>] [<c0255e11>] [<f89d0488>]
> [<f89d1820>]
> [<f89d14d8>] [<f8a252b4>] [<c0107296>] [<f8a251a0>]
>
> Code: 0f 0b c2 04 a0 54 2b c0 89 d8 0f af c1 8d 04 30 39 c5 74 08
>
> Entering kdb (current=0xf381e000, pid 856) on processor 0 Oops: invalid
> operand
> due to oops @ 0xc0132aeb
> eax = 0xc1c0e060 ebx = 0x00091858 ecx = 0x00001000 edx = 0x000005a6
> esi = 0x5a5a5a5a edi = 0xc96c8bcc esp = 0xf381fe8c eip = 0xc0132aeb
> ebp = 0xebdfe000 xss = 0x00000018 xcs = 0x00000010 eflags = 0x00010002
> xds = 0xc0250018 xes = 0x00000018 origeax = 0xffffffff ®s =
> 0xf381fe58
> [0]kdb>
> kernel BUG at slab.c:1218!
> invalid operand: 0000
> CPU: 0
> EIP: 0010:[<c0132aeb>] Tainted: P
> Using defaults from ksymoops -t elf32-i386 -a i386
> EFLAGS: 00010002
> eax: c1c0e060 ebx: 00091858 ecx: 00001000 edx: 000005a6
> esi: 5a5a5a5a edi: c96c8bcc ebp: ebdfe000 esp: f381fe8c
> ds: 0018 es: 0018 ss: 0018
> Process nfsd (pid: 856, stackpage=f381f000)
> Stack: ebdfe000 c96c8bcc ebdff000 ebdfe000 c0132e33 c1c0e060 c96c8bcc
> ebdfe000
> 0000001b f7ee5720 f7ee5694 c1c0e060 008c0f70 eeafd000 c013360a
> c1c0e060
> f7ee5714 0000001e 00000286 eeafd000 ed79ffff c1c0e4d0 c010cef8
> 000003a0
> Call Trace: [<c0132e33>] [<c013360a>] [<c010cef8>] [<c02558a2>]
> [<c02558bc>]
> [<c0255a36>] [<c02557eb>] [<c0255898>] [<c0255e11>] [<f89d0488>]
> [<f89d1820>]
> [<f89d14d8>] [<f8a252b4>] [<c0107296>] [<f8a251a0>]
> Code: 0f 0b c2 04 a0 54 2b c0 89 d8 0f af c1 8d 04 30 39 c5 74 08
>
> >>EIP; c0132aea <kmem_extra_free_checks+2a/70> <=====
> Trace; c0132e32 <free_block+162/210>
> Trace; c013360a <kfree+14a/180>
> Trace; c010cef8 <call_do_IRQ+6/e>
> Trace; c02558a2 <skb_release_data+72/80>
> Trace; c02558bc <kfree_skbmem+c/70>
> Trace; c0255a36 <__kfree_skb+116/120>
> Trace; c02557ea <skb_drop_fraglist+3a/50>
> Trace; c0255898 <skb_release_data+68/80>
> Trace; c0255e10 <skb_linearize+90/f0>
> Trace; f89d0488 <[sunrpc]svc_udp_recvfrom+128/380>
> Trace; f89d1820 <[sunrpc]svc_send+70/1a0>
> Trace; f89d14d8 <[sunrpc]svc_recv+2c8/470>
> Trace; f8a252b4 <[nfsd]nfsd+114/350>
> Trace; c0107296 <kernel_thread+26/30>
> Trace; f8a251a0 <[nfsd]nfsd+0/350>
> Code; c0132aea <kmem_extra_free_checks+2a/70>
> 00000000 <_EIP>:
> Code; c0132aea <kmem_extra_free_checks+2a/70> <=====
> 0: 0f 0b ud2a <=====
> Code; c0132aec <kmem_extra_free_checks+2c/70>
> 2: c2 04 a0 ret $0xa004
> Code; c0132aee <kmem_extra_free_checks+2e/70>
> 5: 54 push %esp
> Code; c0132af0 <kmem_extra_free_checks+30/70>
> 6: 2b c0 sub %eax,%eax
> Code; c0132af2 <kmem_extra_free_checks+32/70>
> 8: 89 d8 mov %ebx,%eax
> Code; c0132af4 <kmem_extra_free_checks+34/70>
> a: 0f af c1 imul %ecx,%eax
> Code; c0132af6 <kmem_extra_free_checks+36/70>
> d: 8d 04 30 lea (%eax,%esi,1),%eax
> Code; c0132afa <kmem_extra_free_checks+3a/70>
> 10: 39 c5 cmp %eax,%ebp
> Code; c0132afc <kmem_extra_free_checks+3c/70>
> 12: 74 08 je 1c <_EIP+0x1c> c0132b06
> <kmem_extra_free_checks+46/70>
>
> Entering kdb (current=0xf381e000, pid 856) on processor 0 Oops: invalid
> operand
> eax = 0xc1c0e060 ebx = 0x00091858 ecx = 0x00001000 edx = 0x000005a6
> esi = 0x5a5a5a5a edi = 0xc96c8bcc esp = 0xf381fe8c eip = 0xc0132aeb
> ebp = 0xebdfe000 xss = 0x00000018 xcs = 0x00000010 eflags = 0x00010002
>
>
> -----Original Message-----
> From: I.D.Hardy@xxxxxxxxxxx [mailto:I.D.Hardy@xxxxxxxxxxx]
> Sent: 18 September 2002 21:49
>
> Steve +,
>
> >
> > On Wed, 2002-09-18 at 13:31, Ian D. Hardy wrote:
> > > Steve,
> > >
> > > >On Mon, 2002-09-16 at 11:56, Ian D. Hardy wrote:
> > > >> Steve,
> > > >>
> > > >> Thanks for the quick response. I don't always get a Oops output
> > > >> (sometimes the server just hangs and requires a reboot). However
> as
> > > it
> > > >> happens the server has just crashed again with the following Oops
> > > >> (through 'ksymoops'):
> > > >
> > > >This one suggests heap corruption more than anything else.
> > > >
> > > >Steve
> > >
> > > I upgraded the kernel to the current CVS (2.4.19-xfs) tree (as of
> > > Monday 16th Sept.) today and got a very similar looking Ooops to the
>
> > > one I reported on Monday, see below. I guess it is very difficult to
>
> > > know what would have caused any heap corruption. As I understand it
> > > there's nothing in these latest panics to directly link them with
> > > XFS? I need to do some more thinking on this. Any pointers would be
> > > very welcome.
> > >
> > > Regards Ian Hardy
> > >
> > >
> > > kernel BUG at slab.c:1439!
> >
> >
> > OK, so you have slab debugging turned on, which I was going to ask you
>
> > to do. Looks like someone walked off the end of an allocation here,
> > that is progress.
> >
> > So, the question is what did it, thats the really hard part! Is this
> > machine just running XFS via NFS, or is it doing anything else? Also,
> > which options do you have turned on in XFS, in fact, sending the whole
>
> > kernel config might be an idea.
> >
> > Steve
> >
>
> Thanks for your continued help/interest, it is much appreciated.
>
> The server (dual 1Ghz PIII, with 1Gbyte memory) is a dedicated NFS
> fileserver (no users have direct access to it). It is serving ~260 NFS
> clients (part of a computational/Beowulf system). The server has a
> 40Gbyte IDE system disk and 2 FC-IDE connected RAID units (each with
> ~500Gbytes usable storage - RAID 5 configuration). The RAID units are
> connected via a Qlogic QLA2200 HBA and a FC switch. The 2 RAID units are
> stripped together(RAID 0) using the kernel 'md' RAID driver.
>
> Below is the kernel configuration file:
>
>
> --
>
> Regards and thanks
>
> Ian
>
> --
> Ian Hardy Tel: 023 80593577
> Research Services Fax: 023 80593131
> Information Systems Services email: i.d.hardy@xxxxxxxxxxx
>
> Southampton University
> Southampton S017 1BJ, UK.
>
--
Steve Lord voice: +1-651-683-3511
Principal Engineer, Filesystem Software email: lord@xxxxxxx
|