xfs
[Top] [All Lists]

Re-occurance of NFS server panics

To: Steve Lord <lord@xxxxxxx>, linux-xfs@xxxxxxxxxxx
Subject: Re-occurance of NFS server panics
From: "Ian D. Hardy" <i.d.hardy@xxxxxxxxxxx>
Date: Thu, 06 Jun 2002 19:37:45 +0100
Cc: idh@xxxxxxxxxxx, oz@xxxxxxxxxxx
Organization: University of Southampton
References: <200203201906.g2KJ69q10974@xxxxxxxxxxxxxxxxxxxx> <3CB5B736.3F588C69@xxxxxxxxxxx> <1018964322.24401.0.camel@xxxxxxxxxxxxxxxxxxxx>
Sender: owner-linux-xfs@xxxxxxxxxxx
Steve, ++++

Some bad news, you may remember a couple of months ago I was
having problems with an NFS server that kept panicing, you 
diagnosed a number of potential problems and patches; these
seemed to help! (many thanks) indeed this did work for ~45
days but recently we started to get a re-occurance of these
failures (again with crashes ~every 4-6 days).

I then remembered that one of your suggestions/fixes concerned
potential problems with high levels of filesystem fragmentation
(at the time I ran 'xfs_fsr' to defrag the filesystem and you
also introduced a change to the CVS tree intended to help 
prevent these crashes). Anyway, I then noticed that at the
time that my crashes started to re-occur that the filesystem
fragmentation had increased to ~60% (!!!) as reported via
'xfs_db' (some files having >3000 extents). Is it possible that
there is still a issue in the XFS kernel (I'm currently using a
kernel compiled from the CVS tree as of  17th May)? Oops/ksymoops
output below.

I've ran xfs_fsr to defrag the filesystem (getting it back
down to <1%), however, I've (again) had a couple of instances
in which shortly after running 'xfs_fsr' the filesystem has
gone off-line, the system log shows:

 Jun  6 07:53:18 blue00 kernel: xfs_force_shutdown(md(9,0),0x8) called from line
1039 of file xfs_trans.c.  Return address = 0xc01e1fb9
Jun  6 07:53:18 blue00 kernel: Corruption of in-memory data detected.  Shutting
down filesystem: md(9,0)

This was on an active filesystem. Is it possible that there
is a interaction between xfs_fsr and NFS? (I'd guess that xfs_fsr
may have a harder time detecting that a NFS client is accessing
a file than for local file access?.

Our current intention, now we've defraged the filesystem is to
see how it goes for a couple of weeks if it seems better we will
then schedule regular (2-4 weeks) maintenance periods to run
'xfs_fsr' on the filesytem without other filesystem activity. Does
this seem sensible too you?

One question, if there is a problem associated with high levels
of fragmentation, is it overall high levels of fragmentation within
the filesytem that causes the failure or could it potentially be a 
single file with a large number of extents? (I believe that we are
getting some highly fragmented files as we have a number of users/
applications that frequently open, apend and then close a log file,
these files soon get fragmented as the close releases any pre-allocated
blocks).

For various reasons I've not managed to capture the Oops outupt from
most of the recent crashes but here's one (passed through ksymoops)
that I did get. May help to identify the problem (does look to me
as though it is related to filesystem extents/fragmentation,
'xfs_iext_realloc' in the trace, though I'm not a kernel code expert!).

invalid operand: 0000
CPU:    1
EIP:    0010:[<c012ff76>]    Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010086
eax: 0000001d   ebx: 00e350c0   ecx: 0000002e   edx: 00000026
esi: f8d4b000   edi: f8d43000   ebp: 00010000   esp: f3ba5a44
ds: 0018   es: 0018   ss: 0018
Process nfsd (pid: 615, stackpage=f3ba5000)
Stack: c02b4082 f8d43000 00008000 f8d4b000 c0e20000 00010000 00000286 f8d43000 
       c01fd146 f8d43000 c01fd1a4 f8d43000 00010000 f4a5224c f8d43000 f8d4b000 
       00008000 c0e18000 c01d04f1 f8d43000 00008000 00010000 00000001 ffffffff 
Call Trace: [<c01fd146>] [<c01fd1a4>] [<c01d04f1>] [<c01a6996>] [<c01a5f12>] 
   [<c01d6474>] [<c01fd2e0>] [<c01aac15>] [<c01cf963>] [<c01e6e23>] 
[<c01e6340>] 
   [<c026cf44>] [<c01f696f>] [<c01e6340>] [<c014f45c>] [<f8d2e973>] 
[<f8d33f7b>] 
   [<f8d3b4a0>] [<f8d2b5d3>] [<f8d3b4a0>] [<f8cf6f89>] [<f8d3b400>] 
[<f8d3aed8>] 
   [<f8d2b349>] [<c01057eb>] 
Code: 0f 0b 83 c4 08 8b 15 2c 95 3f c0 8b 2c 1a 89 7c 24 14 b8 00 

>>EIP; c012ff76 <kfree+66/14c>   <=====
Trace; c01fd146 <kmem_free+22/28>
Trace; c01fd1a4 <kmem_realloc+58/68>
Trace; c01d04f0 <xfs_iext_realloc+f0/108>
Trace; c01a6996 <xfs_bmap_delete_exlist+6a/74>
Trace; c01a5f12 <xfs_bmap_del_extent+58a/f68>
Trace; c01d6474 <xlog_state_do_callback+2a4/2ec>
Trace; c01fd2e0 <kmem_zone_zalloc+44/d0>
Trace; c01aac14 <xfs_bunmapi+b78/fd0>
Trace; c01cf962 <xfs_itruncate_finish+23e/3e0>
Trace; c01e6e22 <xfs_setattr+ae2/f7c>
Trace; c01e6340 <xfs_setattr+0/f7c>
Trace; c026cf44 <qdisc_restart+14/178>
Trace; c01f696e <linvfs_setattr+152/17c>
Trace; c01e6340 <xfs_setattr+0/f7c>
Trace; c014f45c <notify_change+7c/2a4>
Trace; f8d2e972 <[nfsd]nfsd_setattr+3ea/524>
Trace; f8d33f7a <[nfsd]nfsd3_proc_setattr+b6/c4>
Trace; f8d3b4a0 <[nfsd]nfsd_procedures3+40/2c0>
Trace; f8d2b5d2 <[nfsd]nfsd_dispatch+d2/19a>
Trace; f8d3b4a0 <[nfsd]nfsd_procedures3+40/2c0>
Trace; f8cf6f88 <[sunrpc]svc_process+28c/51c>
Trace; f8d3b400 <[nfsd]nfsd_svcstats+0/40>
Trace; f8d3aed8 <[nfsd]nfsd_version3+0/10>
Trace; f8d2b348 <[nfsd]nfsd+1b8/370>
Trace; c01057ea <kernel_thread+22/30>
Code;  c012ff76 <kfree+66/14c>
00000000 <_EIP>:
Code;  c012ff76 <kfree+66/14c>   <=====
   0:   0f 0b                     ud2a      <=====
Code;  c012ff78 <kfree+68/14c>
   2:   83 c4 08                  add    $0x8,%esp
Code;  c012ff7a <kfree+6a/14c>
   5:   8b 15 2c 95 3f c0         mov    0xc03f952c,%edx
Code;  c012ff80 <kfree+70/14c>
   b:   8b 2c 1a                  mov    (%edx,%ebx,1),%ebp
Code;  c012ff84 <kfree+74/14c>
   e:   89 7c 24 14               mov    %edi,0x14(%esp,1)
Code;  c012ff88 <kfree+78/14c>
  12:   b8 00 00 00 00            mov    $0x0,%eax

 
Many thanks for your past (and continued) support. I'll be
away from the 9th to the 23rd of June, I'd therefore be
grateful if you could copy any replies to my colleague Oz
Parchment at 'O.G.Parchment@xxxxxxxxxxx'.

Thanks again.

Ian Hardy


Steve Lord wrote:
> 
> On Thu, 2002-04-11 at 11:17, Ian D. Hardy wrote:
> > Steve, +++
> >
> > Good news! (though posting this is tempting fait a bit). Since
> > installing the XFS/CVS containing this fix my server has now
> > been up for 21+days, this is a record (it's average was ~4 days
> > but it did manage up to 14 days).
> >
> > I also applied the 'vnode.patch' that you posted in response
> > to my problems on 6th March, as far as I'm aware that has not
> > gone into the CVS tree? Is this still a valid patch? My
> > understanding is that I'd probably seen at least these two bugs
> > at various times?
> 
> The vnode code changes should be in the tree as well.
> 
> >
> > Many thanks for all your help.
> 
> Thanks for perservering with xfs!
> 
> Steve
> 
> --
> 
> Steve Lord                                      voice: +1-651-683-3511
> Principal Engineer, Filesystem Software         email: lord@xxxxxxx

-- 

/////////////Technical Coordination, Research Services////////////////////
Ian Hardy                                   Tel: 023 80 593577
Computing Services                              
Southampton University                      email: idh@xxxxxxxxxxx
Southampton  S017 1BJ, UK.                         i.d.hardy@xxxxxxxxxxx
\\'BUGS: The notion of errors is ill-defined' (IRIX man page for netstat)\


<Prev in Thread] Current Thread [Next in Thread>