xfs
[Top] [All Lists]

Re: [XFS on bad superblock] BUG: unable to handle kernel NULL pointer de

To: Fengguang Wu <fengguang.wu@xxxxxxxxx>
Subject: Re: [XFS on bad superblock] BUG: unable to handle kernel NULL pointer dereference at 00000003
From: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Thu, 10 Oct 2013 14:15:15 +1100
Cc: Dave Chinner <dchinner@xxxxxxxxxx>, linux-fsdevel@xxxxxxxxxxxxxxx, Ben Myers <bpm@xxxxxxx>, linux-kernel@xxxxxxxxxxxxxxx, xfs@xxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <20131010014117.GA6017@localhost>
References: <20131009073910.GA387@localhost> <20131010005900.GE2025@xxxxxxxxxxxxxxxxx> <20131010011640.GA5726@localhost> <20131010014117.GA6017@localhost>
User-agent: Mutt/1.5.21 (2010-09-15)
On Thu, Oct 10, 2013 at 09:41:17AM +0800, Fengguang Wu wrote:
> On Thu, Oct 10, 2013 at 09:16:40AM +0800, Fengguang Wu wrote:
> > On Thu, Oct 10, 2013 at 11:59:00AM +1100, Dave Chinner wrote:
> > > [add xfs@xxxxxxxxxxx to cc]
> > 
> > Thanks.
> > 
> > To help debug the problem, I searched XFS in my tests' oops database
> > and find one kernel that failed 4 times (out of 12 total boots) with
> > basically the same error:
> > 
> >       4 BUG: sleeping function called from invalid context at 
> > kernel/workqueue.c:2810
> >       1 WARNING: CPU: 1 PID: 372 at lib/debugobjects.c:260 
> > debug_print_object+0x94/0xa2()
> >       1 WARNING: CPU: 1 PID: 360 at lib/debugobjects.c:260 
> > debug_print_object+0x94/0xa2()
> >       1 WARNING: CPU: 0 PID: 381 at lib/debugobjects.c:260 
> > debug_print_object+0x94/0xa2()
> >       1 WARNING: CPU: 0 PID: 361 at lib/debugobjects.c:260 
> > debug_print_object+0x94/0xa2()
> 

Fenguang, I'll having real trouble associating these with the XFS
code path that is seeing the problems. These look like a use after
free or a double free, but that isn't possible in the XFS code paths
that are showing up in the traces.

> And some other messages in an older kernel:
> 
> [   39.004416] F2FS-fs (nbd2): unable to read second superblock
> [   39.005088] XFS: Assertion failed: read && bp->b_ops, file: 
> fs/xfs/xfs_buf.c, line: 1036

This can not possibily occur on the superblock read path, as
bp->b_ops in that case is *always* initialised, as is XBF_READ.

So this implies something else has modified the struct xfs_buf.

> [   41.550471] ------------[ cut here ]------------
> [   41.550476] WARNING: CPU: 1 PID: 878 at lib/list_debug.c:33 
> __list_add+0xac/0xc0()
> [   41.550478] list_add corruption. prev->next should be next 
> (ffff88000f3d7360), but was           (null). (prev=ffff880008786a30).

And this is a smoking gun - list corruption...

> [   41.550481] CPU: 1 PID: 878 Comm: mount Not tainted 
> 3.11.0-rc1-00667-gf70eb07 #64
> [   41.550482] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [   41.550485]  0000000000000009 ffff880007d6fb08 ffffffff824044a1 
> ffff880007d6fb50
> [   41.550488]  ffff880007d6fb40 ffffffff8109a0a8 ffff880007c6b530 
> ffff88000f3d7360
> [   41.550491]  ffff880008786a30 0000000000000007 0000000000000000 
> ffff880007d6fba0
> [   41.550491] Call Trace:
> [   41.550499]  [<ffffffff824044a1>] dump_stack+0x4e/0x82
> [   41.550503]  [<ffffffff8109a0a8>] warn_slowpath_common+0x78/0xa0
> [   41.550505]  [<ffffffff8109a14c>] warn_slowpath_fmt+0x4c/0x50
> [   41.550509]  [<ffffffff81101359>] ? get_lock_stats+0x19/0x60
> [   41.550511]  [<ffffffff8163434c>] __list_add+0xac/0xc0
> [   41.550515]  [<ffffffff810ba453>] insert_work+0x43/0xa0
> [   41.550518]  [<ffffffff810bb22b>] __queue_work+0x11b/0x510
> [   41.550520]  [<ffffffff810bb936>] queue_work_on+0x96/0xa0
> [   41.550526]  [<ffffffff813d2096>] ? _xfs_buf_ioend.constprop.15+0x26/0x30
> [   41.550529]  [<ffffffff813d1f6c>] xfs_buf_ioend+0x15c/0x260

... in the workqueue code on a work item in the the struct xfs_buf .....

> [   41.550531]  [<ffffffff813d2f92>] ? xfsbdstrat+0x22/0x170
> [   41.550534]  [<ffffffff813d2096>] _xfs_buf_ioend.constprop.15+0x26/0x30
> [   41.550537]  [<ffffffff813d2873>] xfs_buf_iorequest+0x73/0x1a0
> [   41.550539]  [<ffffffff813d2f92>] xfsbdstrat+0x22/0x170
> [   41.550542]  [<ffffffff813d3832>] xfs_buf_read_uncached+0x72/0xa0
> [   41.550546]  [<ffffffff81445846>] xfs_readsb+0x176/0x250

... in the very context that we allocated the struct xfs_buf. It's
not a use after free or memory corruption caused by XFS you are
seeing here.

I note that you have CONFIG_SLUB=y, which means that the cache slabs
are shared with objects of other types. That means that the memory
corruption problem is likely to be caused by one of the other
filesystems that is probing the block device(s), not XFS.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>