xfs
[Top] [All Lists]

Re: Re-occurance of NFS server panics

To: Steve Lord <lord@xxxxxxx>, linux-xfs@xxxxxxxxxxx
Subject: Re: Re-occurance of NFS server panics
From: "Ian D. Hardy" <i.d.hardy@xxxxxxxxxxx>
Date: Wed, 26 Jun 2002 18:36:03 +0100
Cc: idh@xxxxxxxxxxx, oz@xxxxxxxxxxx
Organization: University of Southampton
References: <200203201906.g2KJ69q10974@xxxxxxxxxxxxxxxxxxxx> <3CB5B736.3F588C69@xxxxxxxxxxx> <1018964322.24401.0.camel@xxxxxxxxxxxxxxxxxxxx> <3CFFABF9.5BFD2B80@xxxxxxxxxxx>
Sender: owner-linux-xfs@xxxxxxxxxxx
Steve ++ Colleagues,

Sorry to bother you (I understand that your busy & short
staffed) - it would be useful to get some feedback on
the problems/issues I raised a couple of weeks ago (I did
note that you mentioned continuing problems due to 
fragmentation in another thread a few days ago). Do you
have any idea if/when it should be possible to fix this
problem? (I feel bad asking; but I'm getting preasure to
look again at alternatives ...... which I'd rather not do - as I'm
sure they have their own problems!).

FYI: in the last ~20 days we've had another panic, that looked
like another memory alloc error (I was on leave - so didn't
get the full details) + a couple of system lockups (high
load average and failing to fileserve); possibly not related.
We reduced the load by introducing another server/filesystem
(reiserfs !!) and moving some users onto that, today we had some
scheduled maintenance time and did an ofline defrag of the 
XFS filesystem bringing it down from ~28% to <1%.

Is there anything that I can do (remember I'm not a kernel
writer/expert) to help, any further diagnostics that would
help.

Again many thanks for your help.

Ian Hardy

"Ian D. Hardy" wrote:
> 
> Steve, ++++
> 
> Some bad news, you may remember a couple of months ago I was
> having problems with an NFS server that kept panicing, you
> diagnosed a number of potential problems and patches; these
> seemed to help! (many thanks) indeed this did work for ~45
> days but recently we started to get a re-occurance of these
> failures (again with crashes ~every 4-6 days).
> 
> I then remembered that one of your suggestions/fixes concerned
> potential problems with high levels of filesystem fragmentation
> (at the time I ran 'xfs_fsr' to defrag the filesystem and you
> also introduced a change to the CVS tree intended to help
> prevent these crashes). Anyway, I then noticed that at the
> time that my crashes started to re-occur that the filesystem
> fragmentation had increased to ~60% (!!!) as reported via
> 'xfs_db' (some files having >3000 extents). Is it possible that
> there is still a issue in the XFS kernel (I'm currently using a
> kernel compiled from the CVS tree as of  17th May)? Oops/ksymoops
> output below.
> 
> I've ran xfs_fsr to defrag the filesystem (getting it back
> down to <1%), however, I've (again) had a couple of instances
> in which shortly after running 'xfs_fsr' the filesystem has
> gone off-line, the system log shows:
> 
>  Jun  6 07:53:18 blue00 kernel: xfs_force_shutdown(md(9,0),0x8) called from 
> line
> 1039 of file xfs_trans.c.  Return address = 0xc01e1fb9
> Jun  6 07:53:18 blue00 kernel: Corruption of in-memory data detected.  
> Shutting
> down filesystem: md(9,0)
> 
> This was on an active filesystem. Is it possible that there
> is a interaction between xfs_fsr and NFS? (I'd guess that xfs_fsr
> may have a harder time detecting that a NFS client is accessing
> a file than for local file access?.
> 
> Our current intention, now we've defraged the filesystem is to
> see how it goes for a couple of weeks if it seems better we will
> then schedule regular (2-4 weeks) maintenance periods to run
> 'xfs_fsr' on the filesytem without other filesystem activity. Does
> this seem sensible too you?
> 
> One question, if there is a problem associated with high levels
> of fragmentation, is it overall high levels of fragmentation within
> the filesytem that causes the failure or could it potentially be a
> single file with a large number of extents? (I believe that we are
> getting some highly fragmented files as we have a number of users/
> applications that frequently open, apend and then close a log file,
> these files soon get fragmented as the close releases any pre-allocated
> blocks).
> 
> For various reasons I've not managed to capture the Oops outupt from
> most of the recent crashes but here's one (passed through ksymoops)
> that I did get. May help to identify the problem (does look to me
> as though it is related to filesystem extents/fragmentation,
> 'xfs_iext_realloc' in the trace, though I'm not a kernel code expert!).
> 
> invalid operand: 0000
> CPU:    1
> EIP:    0010:[<c012ff76>]    Not tainted
> Using defaults from ksymoops -t elf32-i386 -a i386
> EFLAGS: 00010086
> eax: 0000001d   ebx: 00e350c0   ecx: 0000002e   edx: 00000026
> esi: f8d4b000   edi: f8d43000   ebp: 00010000   esp: f3ba5a44
> ds: 0018   es: 0018   ss: 0018
> Process nfsd (pid: 615, stackpage=f3ba5000)
> Stack: c02b4082 f8d43000 00008000 f8d4b000 c0e20000 00010000 00000286 f8d43000
>        c01fd146 f8d43000 c01fd1a4 f8d43000 00010000 f4a5224c f8d43000 f8d4b000
>        00008000 c0e18000 c01d04f1 f8d43000 00008000 00010000 00000001 ffffffff
> Call Trace: [<c01fd146>] [<c01fd1a4>] [<c01d04f1>] [<c01a6996>] [<c01a5f12>]
>    [<c01d6474>] [<c01fd2e0>] [<c01aac15>] [<c01cf963>] [<c01e6e23>] 
> [<c01e6340>]
>    [<c026cf44>] [<c01f696f>] [<c01e6340>] [<c014f45c>] [<f8d2e973>] 
> [<f8d33f7b>]
>    [<f8d3b4a0>] [<f8d2b5d3>] [<f8d3b4a0>] [<f8cf6f89>] [<f8d3b400>] 
> [<f8d3aed8>]
>    [<f8d2b349>] [<c01057eb>]
> Code: 0f 0b 83 c4 08 8b 15 2c 95 3f c0 8b 2c 1a 89 7c 24 14 b8 00
> 
> >>EIP; c012ff76 <kfree+66/14c>   <=====
> Trace; c01fd146 <kmem_free+22/28>
> Trace; c01fd1a4 <kmem_realloc+58/68>
> Trace; c01d04f0 <xfs_iext_realloc+f0/108>
> Trace; c01a6996 <xfs_bmap_delete_exlist+6a/74>
> Trace; c01a5f12 <xfs_bmap_del_extent+58a/f68>
> Trace; c01d6474 <xlog_state_do_callback+2a4/2ec>
> Trace; c01fd2e0 <kmem_zone_zalloc+44/d0>
> Trace; c01aac14 <xfs_bunmapi+b78/fd0>
> Trace; c01cf962 <xfs_itruncate_finish+23e/3e0>
> Trace; c01e6e22 <xfs_setattr+ae2/f7c>
> Trace; c01e6340 <xfs_setattr+0/f7c>
> Trace; c026cf44 <qdisc_restart+14/178>
> Trace; c01f696e <linvfs_setattr+152/17c>
> Trace; c01e6340 <xfs_setattr+0/f7c>
> Trace; c014f45c <notify_change+7c/2a4>
> Trace; f8d2e972 <[nfsd]nfsd_setattr+3ea/524>
> Trace; f8d33f7a <[nfsd]nfsd3_proc_setattr+b6/c4>
> Trace; f8d3b4a0 <[nfsd]nfsd_procedures3+40/2c0>
> Trace; f8d2b5d2 <[nfsd]nfsd_dispatch+d2/19a>
> Trace; f8d3b4a0 <[nfsd]nfsd_procedures3+40/2c0>
> Trace; f8cf6f88 <[sunrpc]svc_process+28c/51c>
> Trace; f8d3b400 <[nfsd]nfsd_svcstats+0/40>
> Trace; f8d3aed8 <[nfsd]nfsd_version3+0/10>
> Trace; f8d2b348 <[nfsd]nfsd+1b8/370>
> Trace; c01057ea <kernel_thread+22/30>
> Code;  c012ff76 <kfree+66/14c>
> 00000000 <_EIP>:
> Code;  c012ff76 <kfree+66/14c>   <=====
>    0:   0f 0b                     ud2a      <=====
> Code;  c012ff78 <kfree+68/14c>
>    2:   83 c4 08                  add    $0x8,%esp
> Code;  c012ff7a <kfree+6a/14c>
>    5:   8b 15 2c 95 3f c0         mov    0xc03f952c,%edx
> Code;  c012ff80 <kfree+70/14c>
>    b:   8b 2c 1a                  mov    (%edx,%ebx,1),%ebp
> Code;  c012ff84 <kfree+74/14c>
>    e:   89 7c 24 14               mov    %edi,0x14(%esp,1)
> Code;  c012ff88 <kfree+78/14c>
>   12:   b8 00 00 00 00            mov    $0x0,%eax
> 
> 
> Many thanks for your past (and continued) support. I'll be
> away from the 9th to the 23rd of June, I'd therefore be
> grateful if you could copy any replies to my colleague Oz
> Parchment at 'O.G.Parchment@xxxxxxxxxxx'.
> 
> Thanks again.
> 
> Ian Hardy
> 
> Steve Lord wrote:
> >
> > On Thu, 2002-04-11 at 11:17, Ian D. Hardy wrote:
> > > Steve, +++
> > >
> > > Good news! (though posting this is tempting fait a bit). Since
> > > installing the XFS/CVS containing this fix my server has now
> > > been up for 21+days, this is a record (it's average was ~4 days
> > > but it did manage up to 14 days).
> > >
> > > I also applied the 'vnode.patch' that you posted in response
> > > to my problems on 6th March, as far as I'm aware that has not
> > > gone into the CVS tree? Is this still a valid patch? My
> > > understanding is that I'd probably seen at least these two bugs
> > > at various times?
> >
> > The vnode code changes should be in the tree as well.
> >
> > >
> > > Many thanks for all your help.
> >
> > Thanks for perservering with xfs!
> >
> > Steve
> >
> > --
> >
> > Steve Lord                                      voice: +1-651-683-3511
> > Principal Engineer, Filesystem Software         email: lord@xxxxxxx
> 
> --
> 
> /////////////Technical Coordination, Research Services////////////////////
> Ian Hardy                                   Tel: 023 80 593577
> Computing Services
> Southampton University                      email: idh@xxxxxxxxxxx
> Southampton  S017 1BJ, UK.                         i.d.hardy@xxxxxxxxxxx
> \\'BUGS: The notion of errors is ill-defined' (IRIX man page for netstat)\

-- 

/////////////Technical Coordination, Research Services////////////////////
Ian Hardy                                   Tel: 023 80 593577
Computing Services                          Mobile: 0709 2127503    
Southampton University                      email: idh@xxxxxxxxxxx
Southampton  S017 1BJ, UK.                         i.d.hardy@xxxxxxxxxxx
\\'BUGS: The notion of errors is ill-defined' (IRIX man page for netstat)\


<Prev in Thread] Current Thread [Next in Thread>