xfs
[Top] [All Lists]

Re: Major XFS problems...

To: linux-xfs@xxxxxxxxxxx
Subject: Re: Major XFS problems...
From: Anders Saaby <as@xxxxxxxxxxxx>
Date: Wed, 08 Sep 2004 17:23:43 +0200
Delivered-to: news2mail@xxxxxxxxxxxxxxxxx
Organization: Cohaesio A/S
References: <20040908133954.GB390@xxxxxxxxxxxxx> <413F1C6E.9040009@xxxxxxx>
Sender: linux-xfs-bounce@xxxxxxxxxxx
Hi Eric,

I am the primary technician on the "large" system Jakob wrote about. I can
give you the details regarding the problems.

Eric Sandeen wrote:

> Jakob Oestergaard wrote:
> 
>> Second XFS bug:
>> ---------------
>> Also causes the 'kernel BUG at fs/xfs/support/debug.c:106' message to be
>> printed. This bug is not solved by applying the simple patch to the
>> first problem.
>> 
>> How well known this problem is, I don't know - I can get more details on
>> this if anyone is actually interested in working on fixing XFS.
> 
> Do you have -any- details on this problem... pretty much nothing to go
> on here.

I have some details here... The following is a snip of the kernel log right
before it reboots itself:

<SNIP>
Sep  4 16:20:31 st1 kernel: xfs_iget_core: ambiguous vns: vp/0xe7993980,
invp/0xeab6d380
Sep  4 16:20:31 st1 kernel: ------------[ cut here ]------------
Sep  4 16:20:31 st1 kernel: kernel BUG at fs/xfs/support/debug.c:106!
Sep  4 16:20:31 st1 kernel: invalid operand: 0000 [#1]
Sep  4 16:20:31 st1 kernel: SMP
Sep  4 16:20:31 st1 kernel: Modules linked in: nfs e1000 rtc
Sep  4 16:20:31 st1 kernel: CPU:    1
Sep  4 16:20:31 st1 kernel: EIP:    0060:[<c021111c>]    Not tainted
Sep  4 16:20:31 st1 kernel: EFLAGS: 00010246   (2.6.8.1)
Sep  4 16:20:31 st1 kernel: EIP is at cmn_err+0x8c/0xa0
Sep  4 16:20:31 st1 kernel: eax: 00000040   ebx: 00000293   ecx: 00000000  
edx: c0351544
Sep  4 16:20:31 st1 kernel: esi: c03145f1   edi: c042e0fe   ebp: 00000000  
esp: f3913b50
Sep  4 16:20:31 st1 kernel: ds: 007b   es: 007b   ss: 0068
Sep  4 16:20:31 st1 kernel: Process nfsd (pid: 1297, threadinfo=f3912000
task=f38d60b0)
Sep  4 16:20:31 st1 kernel: Stack: f3912000 eab6d380 cd4168d0 f6f71928
c01e3f12 00000000 c031bc00 e7993980
Sep  4 16:20:31 st1 kernel:        eab6d380 40853aca f7fa8e00 c0161d6d
00000000 00000000 40853aca 00000000
Sep  4 16:20:31 st1 kernel:        cd4168d0 c0162315 f7fa8e00 c2536a7c
40853aca eab6d3a0 40853aca eab6d380
Sep  4 16:20:31 st1 kernel: Call Trace:
Sep  4 16:20:31 st1 kernel:  [<c01e3f12>] xfs_iget_core+0x1a2/0x590
Sep  4 16:20:31 st1 kernel:  [<c0161d6d>] find_inode_fast+0x4d/0x60
Sep  4 16:20:31 st1 kernel:  [<c0162315>] iget_locked+0x95/0xa0
Sep  4 16:20:31 st1 kernel:  [<c01e43a4>] xfs_iget+0xa4/0x170
Sep  4 16:20:31 st1 kernel:  [<c01fff6b>] xfs_vget+0x4b/0xc0
Sep  4 16:20:31 st1 kernel:  [<c0210631>] vfs_vget+0x21/0x30
Sep  4 16:20:31 st1 kernel:  [<c0210098>] linvfs_get_dentry+0x48/0x80
Sep  4 16:20:31 st1 kernel:  [<c02ab1e0>] pfifo_fast_enqueue+0x0/0x90
Sep  4 16:20:31 st1 kernel:  [<c018ef98>] find_exported_dentry+0x38/0x5d0
Sep  4 16:20:31 st1 kernel:  [<c02b717b>] ip_finish_output2+0x13b/0x18f
Sep  4 16:20:31 st1 kernel:  [<c02a9eed>] nf_iterate+0x3d/0xa0
Sep  4 16:20:31 st1 kernel:  [<c02b7040>] ip_finish_output2+0x0/0x18f
Sep  4 16:20:31 st1 kernel:  [<c02b7040>] ip_finish_output2+0x0/0x18f
Sep  4 16:20:31 st1 kernel:  [<c02aa213>] nf_hook_slow+0x63/0xe0
Sep  4 16:20:31 st1 kernel:  [<c02b7040>] ip_finish_output2+0x0/0x18f
Sep  4 16:20:31 st1 kernel:  [<c02aa24e>] nf_hook_slow+0x9e/0xe0
Sep  4 16:20:31 st1 kernel:  [<c02e5fea>] ipt_do_table+0x35a/0x370
Sep  4 16:20:31 st1 kernel:  [<c02b6fd0>] dst_output+0x0/0x20
Sep  4 16:20:31 st1 kernel:  [<c02b4de7>] ip_finish_output+0x1c7/0x1e0
Sep  4 16:20:31 st1 kernel:  [<c02b7040>] ip_finish_output2+0x0/0x18f
Sep  4 16:20:31 st1 kernel:  [<c02b6fd0>] dst_output+0x0/0x20
Sep  4 16:20:31 st1 kernel:  [<c02e84bc>] ipt_local_out_hook+0x5c/0x60
Sep  4 16:20:31 st1 kernel:  [<c02a9eed>] nf_iterate+0x3d/0xa0
Sep  4 16:20:31 st1 kernel:  [<c02b6fd0>] dst_output+0x0/0x20
Sep  4 16:20:31 st1 kernel:  [<c02b6fd0>] dst_output+0x0/0x20
Sep  4 16:20:31 st1 kernel:  [<c02aa213>] nf_hook_slow+0x63/0xe0
Sep  4 16:20:31 st1 kernel:  [<c02b6fd0>] dst_output+0x0/0x20
Sep  4 16:20:31 st1 kernel:  [<c02b6fe1>] dst_output+0x11/0x20
Sep  4 16:20:31 st1 kernel:  [<c02aa24e>] nf_hook_slow+0x9e/0xe0
Sep  4 16:20:31 st1 kernel:  [<c0112625>] find_busiest_group+0x105/0x310
Sep  4 16:20:31 st1 kernel:  [<c01112f4>] recalc_task_prio+0x134/0x140
Sep  4 16:20:31 st1 kernel:  [<c011138b>] activate_task+0x8b/0xa0
Sep  4 16:20:31 st1 kernel:  [<c010d089>] smp_send_reschedule+0x19/0x20
Sep  4 16:20:31 st1 kernel:  [<c011186c>] try_to_wake_up+0x25c/0x290
Sep  4 16:20:31 st1 kernel:  [<c0196640>] exp_find_key+0x90/0xa0
Sep  4 16:20:31 st1 kernel:  [<c018f852>] export_decode_fh+0x62/0x6a
Sep  4 16:20:31 st1 kernel:  [<c0191600>] nfsd_acceptable+0x0/0xe0
Sep  4 16:20:31 st1 kernel:  [<c0191a73>] fh_verify+0x393/0x540
Sep  4 16:20:31 st1 kernel:  [<c0191600>] nfsd_acceptable+0x0/0xe0
Sep  4 16:20:31 st1 kernel:  [<c02f9f6c>] svcauth_unix_accept+0x22c/0x2b0
Sep  4 16:20:31 st1 kernel:  [<c01999f4>] nfsd3_proc_getattr+0x74/0x80
Sep  4 16:20:31 st1 kernel:  [<c018fee6>] nfsd_dispatch+0xc6/0x16c
Sep  4 16:20:31 st1 kernel:  [<c02f653a>] svc_process+0x40a/0x618
Sep  4 16:20:31 st1 kernel:  [<c018fca7>] nfsd+0x1f7/0x370
Sep  4 16:20:31 st1 kernel:  [<c018fab0>] nfsd+0x0/0x370
Sep  4 16:20:31 st1 kernel:  [<c01024bd>] kernel_thread_helper+0x5/0x18
Sep  4 16:20:31 st1 kernel: Code: 0f 0b 6a 00 f5 45 31 c0 5b 5e 5f 5d c3 8d
b4 26 00 00 00 00
</SNIP>

> 
>> Third XFS bug:
>> --------------
>> XFS causes lowmem oom, triggering the OOM killer. Reported by
>> as@xxxxxxxxxxxx on the 18th of august.
>> 
>> On the 24th of august, William Lee Irwin gives some suggestions and
>> mentions  "xfs has some known bad slab behavior."
> 
> I'm curious to know what that means... :)

So are we :-)

> 
>> So, it's normal to OOM the lowmem with XFS? Again, more info can be
>> presented if anyone cares about fixing this.
> 
> of course, please file a bug with all info you have.  How do you know
> it's xfs causing the oom killer to kick in?  Surely there are other
> memory consumers on the box; also how much memory is in the box to start
> with?

Actually, no. This server is *only* serving files over NFS, nothing else. Of
course it's running the usual stuff like sysklogd, ssh, and so on, but
nothing memony consuming.

The server has 2.5G memory and is configured with 4G swap (of which we use
~1M :-))

> 
> This may have as much to do with the way linux (2.4, anyway) caches
> dentries; xfs has structures that can't be freed as long as the dentry
> still has a reference.

You can se the thread on LKML I started on 18/8: subj: "oom-killer 2.6.8.1" 
(archive: http://lkml.org/lkml/2004/8/18/72)
- This explains what we are seeing, any comments?

> 
>> Stability on large filesystems:
>> -------------------------------
>> On a 600+G filesystem with some 17M files, we are currently unable to
>> run a backup of the filesystem.
>> 
>> Some 4-8 hours after the backup has started, the dreaded 'debug.c:106'
>> message will appear (at some random place thru the filesystem - it is
>> not a consistent error in one specific location in the filesystem), and
>> the server will need a reboot.
> 
> a report of "debug.c:106 message" is not helpful; this is a generic
> error printing routine which will BUG() the box if CE_PANIC was
> specified in the error.  We need all error messages leading up to this
> to know how you got here.
> 
OK, hope you can use the kernelog snip. - If you need more info - I will be
happy to send it to you.

>   > Does anyone actually use XFS for serious file-serving?  (yes, I run it
>> on my desktop at home and I don't have problems there - such reports are
>> not really relevant).
> 
> yes.  http://oss.sgi.com/projects/xfs/xfs_users.html
>
OK
 
>> Is anyone actually maintaining/bugfixing XFS?  Yes, I know the
>> MAINTAINERS file, but I am a little bit confused here - seeing that
>> trivial-to-trigger bugs that crash the system and have simple fixes,
>> have not been fixed in current mainline kernels.
> 
> Yes, sgi is maintaining it.  Perhaps you've missed the large volume of
> commits on the linux-xfs list and on lkml.  :)
> 

:-)

> -Eric

-- 
/Saaby


<Prev in Thread] Current Thread [Next in Thread>