xfs
[Top] [All Lists]

RE: Re-occurance of NFS server panics

To: I.D.Hardy@xxxxxxxxxxx
Subject: RE: Re-occurance of NFS server panics
From: Stephen Lord <lord@xxxxxxx>
Date: 16 Sep 2002 12:16:14 -0500
Cc: linux-xfs@xxxxxxxxxxx
In-reply-to: <E5CC9E66DAF2D411A0D700B0D079331B41F1F6@xxxxxxxxxxxxxxxxxxxxx>
References: <E5CC9E66DAF2D411A0D700B0D079331B41F1F6@xxxxxxxxxxxxxxxxxxxxx>
Sender: linux-xfs-bounce@xxxxxxxxxxx
On Mon, 2002-09-16 at 11:56, Ian D. Hardy wrote:
> Steve,
> 
> Thanks for the quick response. I don't always get a Oops output
> (sometimes the server just hangs and requires a reboot). However as it
> happens the server has just crashed again with the following Oops
> (through 'ksymoops'):

This one suggests heap corruption more than anything else. 

Steve

> 
> EFLAGS: 00010016
> eax: 5a2cf071   ebx: 003a0dc0   ecx: f7ee3dd0   edx: c200b060
> esi: ce837000   edi: ce8372b4   ebp: ce837ce0   esp: f7ee1f38
> ds: 0018   es: 0018   ss: 0018
> Process kswapd (pid: 5, stackpage=f7ee1000)
> Stack: f7ee0000 00000009 f7ee3dd0 f7ee3c00 00000000 00000007 00000000
> 00000000
>        00000000 c200b060 00000020 000001d0 00000006 00000000 c01325d9
> c036aa70
>        00000006 000001d0 c036aa70 00000000 c013268c 00000020 c036aa70
> 00000002
> Call Trace: [<c01325d9>] [<c013268c>] [<c0132731>] [<c01327a6>]
> [<c01328e1>]
>    [<c0132840>] [<c0105000>] [<c0105836>] [<c0132840>]
> Code: 0f 0b 8b 44 24 24 89 ea 8b 48 18 b8 71 f0 2c 5a 01 ca 87 42
> 
> >>EIP; c013127e <kmem_cache_reap+1ae/460>   <=====
> Trace; c01325d8 <shrink_caches+18/90>
> Trace; c013268c <try_to_free_pages+3c/60>
> Trace; c0132730 <kswapd_balance_pgdat+50/a0>
> Trace; c01327a6 <kswapd_balance+26/40>
> Trace; c01328e0 <kswapd+a0/ba>
> Trace; c0132840 <kswapd+0/ba>
> Trace; c0105000 <_stext+0/0>
> Trace; c0105836 <kernel_thread+26/30>
> Trace; c0132840 <kswapd+0/ba>
> Code;  c013127e <kmem_cache_reap+1ae/460>
> 00000000 <_EIP>:
> Code;  c013127e <kmem_cache_reap+1ae/460>   <=====
>    0:   0f 0b                     ud2a      <=====
> Code;  c0131280 <kmem_cache_reap+1b0/460>
>    2:   8b 44 24 24               mov    0x24(%esp,1),%eax
> Code;  c0131284 <kmem_cache_reap+1b4/460>
>    6:   89 ea                     mov    %ebp,%edx
> Code;  c0131286 <kmem_cache_reap+1b6/460>
>    8:   8b 48 18                  mov    0x18(%eax),%ecx
> Code;  c0131288 <kmem_cache_reap+1b8/460>
>    b:   b8 71 f0 2c 5a            mov    $0x5a2cf071,%eax
> Code;  c013128e <kmem_cache_reap+1be/460>
>   10:   01 ca                     add    %ecx,%edx
> Code;  c0131290 <kmem_cache_reap+1c0/460>
>   12:   87 42 00                  xchg   %eax,0x0(%edx)
> 
> Entering kdb (current=0xf7ee0000, pid 5) on processor 1 Oops: invalid
> operand
> eax = 0x5a2cf071 ebx = 0x003a0dc0 ecx = 0xf7ee3dd0 edx = 0xc200b060
> esi = 0xce837000 edi = 0xce8372b4 esp = 0xf7ee1f38 eip = 0xc013127e
> ebp = 0xce837ce0 xss = 0x00000018 xcs = 0x00000010 eflags = 0x00010016
> 
> ->======================================================================
> 
> This is a different trace to the last one I posted, this new one makes
> no direct reference to XFS which may mean that it is not XFS related (in
> which case apologies....) or that it was another non XFS process
> (kswapd) that ran into difficulty in obtaining kernel memory as a result
> of XFS/extent issues. I seem to remember that the was an issue with the
> XFS/KDB implementation around the time of the CVS kernel I'm currently
> using. (No, the server shouldn't be running out of memory, it has 1Gbyte
> RAM and does nothing but NFS serving, it normally has ~957684K cached).
> 
> If I remember correctly we have tried the fix you refer to regarding
> memory getting freed to the wrong pool:
> 
> > 
> > Only happens under high memory pressure, Ian, I think this was
> > your original oops.
> > 
> > Date:  Wed Mar 20 11:06:42 PST 2002
> > Workarea:  jen.americas.sgi.com:/src/lord/xfs-baseline
> > 
> > The following file(s) were checked into:
> >   bonnie.engr.sgi.com:/isms/slinx/2.4.x-xfs
> > 
> > Modid:  2.4.x-xfs:slinx:114531a
> > linux/fs/xfs_support/kmem.c - 1.22
> >         - Fix the case where we used vmalloc to allocate memory under
> pressure,
> >           we need to free it that way
> 
> Anyway, I think I need to get the current 2.4.19-XFS tree installed even
> if this doesn't fix the problem it will give us a better base to work
> from.
> 
> Regards
> 
> Ian Hardy 
> 
> -----Original Message-----
> From: Stephen Lord [mailto:lord@xxxxxxx] 
> Sent: 16 September 2002 16:36
> To: I.D.Hardy@xxxxxxxxxxx
> Cc: linux-xfs@xxxxxxxxxxx
> Subject: RE: Re-occurance of NFS server panics
> 
> 
> On Mon, 2002-09-16 at 09:52, Ian D. Hardy wrote:
> > Steve +,
> > 
> > Sorry to bother you again. You may remember that we've corresponded 
> > several times over the past ~9months with regards to kernel memory 
> > allocation problems and fragmented files (see bellow).
> > 
> > We had a period of relative stability, however the last few weeks we 
> > have gone back to a situation of having one or more crashes/hangs 
> > every week and are now having to again review our continued use of 
> > XFS. Therefore any  update on progress towards a fix for these 
> > problems would be very useful (I'd hate to go though the pain of 
> > converting our ~1Tbyte filesystem to Reiser of ext3 if there are fixes
> 
> > immanent).
> > 
> > We have been running a 2.4.18 XFS CVS kernel from Mid May for some 
> > time now, I'm just in the process of compiling and testing the current
> 
> > 2.4.19 XFS CVS, is this likely to help? (looking through the list 
> > archive I can't find anything of direct relevance but may have missed 
> > something).
> > 
> > We appear to be running at a lower overall system fragmentation level 
> > now (currently 13% in the past it has been 28% or more), though I 
> > guess it is possible for only a couple of large very fragmented files 
> > to result in kernel memory allocation problems and still have 
> > reasonably low overall FS fragmentation levels?
> > 
> > Unfortunately the NFS load on our server is such that it is 
> > difficult/impossible to predict times of light NFS load in which to 
> > run fsr and as reported before we've had several incidents of 
> > filesystem corruption and the kernel taking the FS offline running fsr
> 
> > under a NFS load.
> > 
> > Thanks for your time (BTW: We've persevered with XFS for so long as it
> 
> > seems to give better performance for our workload than ext3 or 
> > ReiserFS, however, stability is again becoming a problem).
> > 
> 
> Nothing immediately rings a bell, there have been some recent changes
> which fixed some hangs HP was having in doing large scale NFS
> benchmarking. These might be beneficial to you. The last oops output I
> have from you looked like this:
> 
> >>EIP; c012ff76 <kfree+66/14c>   <=====
> Trace; c01fd146 <kmem_free+22/28>
> Trace; c01fd1a4 <kmem_realloc+58/68>
> Trace; c01d04f0 <xfs_iext_realloc+f0/108>
> Trace; c01a6996 <xfs_bmap_delete_exlist+6a/74>
> Trace; c01a5f12 <xfs_bmap_del_extent+58a/f68>
> Trace; c01d6474 <xlog_state_do_callback+2a4/2ec>
> Trace; c01fd2e0 <kmem_zone_zalloc+44/d0>
> Trace; c01aac14 <xfs_bunmapi+b78/fd0>
> Trace; c01cf962 <xfs_itruncate_finish+23e/3e0>
> Trace; c01e6e22 <xfs_setattr+ae2/f7c>
> Trace; c01e6340 <xfs_setattr+0/f7c>
> Trace; c026cf44 <qdisc_restart+14/178>
> Trace; c01f696e <linvfs_setattr+152/17c>
> Trace; c01e6340 <xfs_setattr+0/f7c>
> Trace; c014f45c <notify_change+7c/2a4>
> Trace; f8d2e972 <[nfsd]nfsd_setattr+3ea/524>
> Trace; f8d33f7a <[nfsd]nfsd3_proc_setattr+b6/c4>
> Trace; f8d3b4a0 <[nfsd]nfsd_procedures3+40/2c0>
> Trace; f8d2b5d2 <[nfsd]nfsd_dispatch+d2/19a>
> Trace; f8d3b4a0 <[nfsd]nfsd_procedures3+40/2c0>
> Trace; f8cf6f88 <[sunrpc]svc_process+28c/51c>
> Trace; f8d3b400 <[nfsd]nfsd_svcstats+0/40>
> Trace; f8d3aed8 <[nfsd]nfsd_version3+0/10>
> Trace; f8d2b348 <[nfsd]nfsd+1b8/370>
> Trace; c01057ea <kernel_thread+22/30>
> Code;  c012ff76 <kfree+66/14c>
> 00000000 <_EIP>:
> Code;  c012ff76 <kfree+66/14c>   <=====
>    0:   0f 0b                     ud2a      <=====
> Code;  c012ff78 <kfree+68/14c>
>    2:   83 c4 08                  add    $0x8,%esp
> Code;  c012ff7a <kfree+6a/14c>
>    5:   8b 15 2c 95 3f c0         mov    0xc03f952c,%edx
> Code;  c012ff80 <kfree+70/14c>
>    b:   8b 2c 1a                  mov    (%edx,%ebx,1),%ebp
> Code;  c012ff84 <kfree+74/14c>
>    e:   89 7c 24 14               mov    %edi,0x14(%esp,1)
> Code;  c012ff88 <kfree+78/14c>
>   12:   b8 00 00 00 00            mov    $0x0,%eax
> 
> Is this still what you see? 
>  
> This one does bear the symptoms of a problem which was fixed a while ago
> - memory was freed into the wrong pool.
> 
> Steve
> 
> 



<Prev in Thread] Current Thread [Next in Thread>