It's an old server. I'm willing to try (have already compiled) 2.4.31
(unmodified source from kernel.org), but from some searches of the archive
and SGI's bugzilla, this looks like a long standing unresolved bug
affecting xfs in both 2.4 and 2.6 kernels. At this point, I'm mostly
curious if anyone has any ideas for solutions/workarounds, or if it's time
to simply abandon use of XFS.
http://oss.sgi.com/bugzilla/show_bug.cgi?id=375
http://oss.sgi.com/bugzilla/show_bug.cgi?id=272
On Thu, 16 Jun 2005, Net Llama! wrote:
> You realize you're using a truly ancient kernel, right? Are you at least
> using a relatively current version of xfs-progs?
>
> On Thu, 16 Jun 2005, Jon Lewis wrote:
>
> > This xfs_repair did eventually finish after spinning in Phase 5 for a
> > couple hours. Unfortunately, it doesn't appear to have done us any good,
> > and the system is still doing
> >
> > xfs_force_shutdown(md(9,2),0x8) called from line 1071 of file
> > xfs_trans.c. Return address = 0xf8dbe6eb
> > Filesystem "md(9,2)": Corruption of in-memory data detected. Shutting down
> > filesystem: md(9,2)
> > Please umount the filesystem, and rectify the problem(s)
> >
> > pretty frequently. I've tried running a non-SMP kernel, but that didn't
> > help. Next, I decreased the read/write block sizes for the NFS clients
> > mounting this xfs. I don't know yet if that makes any difference.
> >
> > On Wed, 15 Jun 2005, Jon Lewis wrote:
> >
> > > After having a system crash twice today with messages like (from the first
> > > crash):
> > >
> > > xfs_iget_core: ambiguous vns: vp/0xc6f0e680, invp/0xecbed200
> > > ------------[ cut here ]------------
> > > kernel BUG at debug.c:106!
> > > invalid operand: 0000
> > > nfsd lockd sunrpc autofs eepro100 mii ipt_REJECT iptable_filter ip_tables
> > > xfs raid5 xor ext3 jbd raid1 isp_mod sd_mod scsi_mod
> > > CPU: 1
> > > EIP: 0010:[<f8dbf16e>] Not tainted
> > > EFLAGS: 00010246
> > >
> > > EIP is at cmn_err [xfs] 0x9e (2.4.20-35_39.rh8.0.atsmp)
> > > eax: 00000000 ebx: 00000000 ecx: 00000096 edx: 00000001
> > > esi: f8dd9412 edi: f8dec63e ebp: 00000293 esp: f5d2bd44
> > > ds: 0018 es: 0018 ss: 0018
> > > Process nfsd (pid: 661, stackpage=f5d2b000)
> > > Stack: f8dd9412 f8dd93e8 f8dec600 ecbed220 7b1f202d 00000000 e4cca100
> > > f8d8aeac
> > > 00000000 f8dda160 c6f0e680 ecbed200 f65d0c00 7b1f202d f7bfcc38
> > > c62aea90
> > > f65d0924 00000000 00000003 c62aea8c 00000000 00000000 e4cca100
> > > ecbed220
> > > Call Trace: [<f8dd9412>] .rodata.str1.1 [xfs] 0x11c2 (0xf5d2bd44))
> > > [<f8dd93e8>] .rodata.str1.1 [xfs] 0x1198 (0xf5d2bd48))
> > > [<f8dec600>] message [xfs] 0x0 (0xf5d2bd4c))
> > > [<f8d8aeac>] xfs_iget_core [xfs] 0x45c (0xf5d2bd60))
> > > [<f8dda160>] .rodata.str1.32 [xfs] 0x5a0 (0xf5d2bd68))
> > > [<f8d8b0c3>] xfs_iget [xfs] 0x143 (0xf5d2bdb0))
> > > [<f8da8247>] xfs_vget [xfs] 0x77 (0xf5d2bdf0))
> > > [<f8dbe563>] vfs_vget [xfs] 0x43 (0xf5d2be20))
> > > [<f8dbdc9d>] linvfs_fh_to_dentry [xfs] 0x5d (0xf5d2be30))
> > > [<f8e3a8c6>] nfsd_get_dentry [nfsd] 0xb6 (0xf5d2be5c))
> > > [<f8e3ad17>] find_fh_dentry [nfsd] 0x57 (0xf5d2be80))
> > > [<f8e3b1b9>] fh_verify [nfsd] 0x189 (0xf5d2beb0))
> > > [<f8e19616>] svc_sock_enqueue [sunrpc] 0x1b6 (0xf5d2befc))
> > > [<f8e42bdf>] nfsd3_proc_getattr [nfsd] 0x6f (0xf5d2bf10))
> > > [<f8e44a93>] nfs3svc_decode_fhandle [nfsd] 0x33 (0xf5d2bf28))
> > > [<f8e4b384>] nfsd_procedures3 [nfsd] 0x24 (0xf5d2bf3c))
> > > [<f8e3863e>] nfsd_dispatch [nfsd] 0xce (0xf5d2bf48))
> > > [<f8e4ac98>] nfsd_version3 [nfsd] 0x0 (0xf5d2bf5c))
> > > [<f8e38570>] nfsd_dispatch [nfsd] 0x0 (0xf5d2bf60))
> > > [<f8e1927f>] svc_process_Rsmp_9d8bc81a [sunrpc] 0x45f (0xf5d2bf64))
> > > [<f8e4b384>] nfsd_procedures3 [nfsd] 0x24 (0xf5d2bf84))
> > > [<f8e4acb8>] nfsd_program [nfsd] 0x0 (0xf5d2bf88))
> > > [<f8e38404>] nfsd [nfsd] 0x224 (0xf5d2bfa4))
> > > [<c010758e>] arch_kernel_thread [kernel] 0x2e (0xf5d2bff0))
> > > [<f8e381e0>] nfsd [nfsd] 0x0 (0xf5d2bff8))
> > >
> > >
> > > Code: 0f 0b 6a 00 08 94 dd f8 83 c4 0c 5b 5e 5f 5d c3 89 f6 55 b8
> > > <5>xfs_force_shutdown(md(9,2),0x8) called from line 1071 of file
> > > xfs_trans.c. Return address = 0xf8dbe6eb
> > > Filesystem "md(9,2)": Corruption of in-memory data detected. Shutting
> > > down
> > > filesystem: md(9,2)
> > > Please umount the filesystem, and rectify the problem(s)
> > >
> > > I figured it'd be a good idea to xfs_repair it. That was a little more
> > > than 4 hours ago. The fs is an software RAID5:
> > > md2 : active raid5 sdn2[13] sdg2[12] sdm2[11] sdl2[10] sdk2[9] sdj2[8]
> > > sdi2[7] sdh2[6] sdf2[5] sde2[4] sdd2[3] sdc2[2] sdb2[1] sda2[0]
> > > 385414656 blocks level 5, 64k chunk, algorithm 2 [12/12]
> > > [UUUUUUUUUUUU]
> > > md0 : active raid1 sdn1[1] sdg1[0]
> > > 803136 blocks [2/2] [UU]
> > >
> > > xfs_repair [version 2.6.9] has gotten to:
> > >
> > > Phase 5 - rebuild AG headers and trees...
> > >
> > > and seems to have stopped progressing.
> > >
> > > root 798 91.8 1.0 45080 41576 pts/1 R 15:57 242:04 xfs_repair
> > > -l /dev/md0 /dev/md2
> > >
> > > Its still using lots of CPU, but there is no disk activity. Further
> > > searching suggests this might be a kernel issue and not an actual fs
> > > corruption issue. I'd like to upgrade from 2.4.20-35_39.rh8.0.atsmp to
> > > 2.4.20-43_41.rh8.0.atsmp, but the question is, is it safe to stop (kill)
> > > xfs_repair? Will the fs be mountable if I interrupt xfs_repair at this
> > > point?
> >
> > ----------------------------------------------------------------------
> > Jon Lewis | I route
> > Senior Network Engineer | therefore you are
> > Atlantic Net |
> > _________ http://www.lewis.org/~jlewis/pgp for PGP public key_________
> >
> >
>
> --
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Lonni J Friedman netllama@xxxxxxxxxxxxx
> LlamaLand http://netllama.linux-sxs.org
>
----------------------------------------------------------------------
Jon Lewis | I route
Senior Network Engineer | therefore you are
Atlantic Net |
_________ http://www.lewis.org/~jlewis/pgp for PGP public key_________
|