xfs
[Top] [All Lists]

Re: xfs_efi_item slab corruption. (v3.9-10936-g51a26ae)

To: Mark Tinguely <tinguely@xxxxxxx>
Subject: Re: xfs_efi_item slab corruption. (v3.9-10936-g51a26ae)
From: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Wed, 8 May 2013 08:22:56 +1000
Cc: Dave Jones <davej@xxxxxxxxxx>, CAI Qian <caiqian@xxxxxxxxxx>, xfs@xxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <518962FC.2060509@xxxxxxx>
References: <20130507133707.GA18301@xxxxxxxxxx> <51895025.2010709@xxxxxxx> <20130507190731.GA15528@xxxxxxxxxx> <518954DE.4070803@xxxxxxx> <20130507193146.GA7539@xxxxxxxxxx> <51895CD7.7040806@xxxxxxx> <20130507195954.GA8384@xxxxxxxxxx> <51895E51.2050508@xxxxxxx> <20130507202217.GA9883@xxxxxxxxxx> <518962FC.2060509@xxxxxxx>
User-agent: Mutt/1.5.21 (2010-09-15)
On Tue, May 07, 2013 at 03:24:28PM -0500, Mark Tinguely wrote:
> On 05/07/13 15:22, Dave Jones wrote:
> >On Tue, May 07, 2013 at 03:04:33PM -0500, Mark Tinguely wrote:
> >  >  On 05/07/13 14:59, Dave Jones wrote:
> >  >  >  On Tue, May 07, 2013 at 02:58:15PM -0500, Mark Tinguely wrote:
> >  >  >
> >  >  >    >   >   I can hit this almost instantly with fsx. I'll do a 
> > bisect, though
> >  >  >    >   >   it sounds like you already have a suspect.
> >  >  >    >   >
> >  >  >    >
> >  >  >    >   If you want to try kmem debug of Linux 3.8 that would help.
> >  >  >
> >  >  >  I'm not sure what that is.
> >  >
> >  >  Sorry, if you would test Linux 3.8 with "CONFIG_DEBUG_SLAB=y".
> >
> >Ah, done that. (I pretty much always run with it).
> >
> >This is something new. Even 3.9 was fine. It's only since
> >the recent xfs merge.
> >
> >     Dave
> >
> 
> git revert 666d644cd72a9ec58b353209ff191d7430f3b357

That won't prevent the use after free. That commit fixed a problem
that could lead to a use after free, but what we are seeing here is
that it has ultimately exposed a previously unknown issue that
causes the use after free.

Basically what is happening is that there are two commits for the
EFD being processed, when there should only be one. I'm not sure how
this is happening yet, but these three traces came out from my debug
sequentially when running generic/006:

__xfs_efi_release 0xffff88006c4d7058 2
Pid: 4756, comm: kworker/1:1H Tainted: G    B        3.9.0-rc8-dgc+ #592
Call Trace:
 [<ffffffff814e43a5>] __xfs_efi_release+0x35/0x80
 [<ffffffff814e444d>] xfs_efi_item_unpin+0x5d/0x70
 [<ffffffff814dbb88>] xfs_trans_committed_bulk+0x248/0x2e0
 [<ffffffff814e18db>] xlog_cil_committed+0x3b/0x140
 [<ffffffff814ddfae>] xlog_state_do_callback+0x19e/0x3a0
.....

That's the EFI being unpinned and it's reference going away during
transaction commit completion procesing.

__xfs_efi_release 0xffff88006c4d7058 1
Pid: 4756, comm: kworker/1:1H Tainted: G    B        3.9.0-rc8-dgc+ #592
Call Trace:
 [<ffffffff814e43a5>] __xfs_efi_release+0x35/0x80
 [<ffffffff814e4620>] xfs_efi_release+0x60/0x80
 [<ffffffff814e46a6>] xfs_efd_item_committed+0x26/0x40
 [<ffffffff814dba21>] xfs_trans_committed_bulk+0xe1/0x2e0
 [<ffffffff814e18db>] xlog_cil_committed+0x3b/0x140
 [<ffffffff814ddfae>] xlog_state_do_callback+0x19e/0x3a0
.....

That's the EFD releasing the EFI, which drops it's reference count
and frees the both the EFI and the EFD.

__xfs_efi_release 0xffff88006c4d7058 1802201963
Pid: 4756, comm: kworker/1:1H Tainted: G    B        3.9.0-rc8-dgc+ #592
Call Trace:
 [<ffffffff814e43a5>] __xfs_efi_release+0x35/0x80
 [<ffffffff814e4630>] xfs_efi_release+0x70/0x80
 [<ffffffff814e46a6>] xfs_efd_item_committed+0x26/0x40
 [<ffffffff814dba21>] xfs_trans_committed_bulk+0xe1/0x2e0
 [<ffffffff814e18db>] xlog_cil_committed+0x3b/0x140
.....

And there's the EFI being released by an EFD after it has been
freed. It's definitely been freed because 1802201963 =
0x6b6b6b6b....

Like I said, I don't understand how this is happening yet, but just
seeing this behaviour explains an awful lot about why we've had long
term issues with this code....

FWIW, I used to run with slab debugging turned on for all my debug
builds. I switched my .config from SLAB to SLUB a few months ago,
and didn't realise this silently turned of memory poisoning.  That's
how it slipped through my testing....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>