On Tue, Mar 22, 2011 at 01:57:36PM +0100, Johannes Weiner wrote:
> On Mon, Mar 21, 2011 at 01:25:26PM +0100, Johannes Weiner wrote:
> > On Fri, Mar 11, 2011 at 09:49:45AM +1100, Dave Chinner wrote:
> > > FWIW, while the VM folk might be paying attention about vmap realted
> > > stuff, this vmap BUG() also needs triage:
> > >
> > > https://bugzilla.kernel.org/show_bug.cgi?id=27002
> > I stared at this bug and the XFS code for a while over the weekend.
> > What you are doing in there is really scary!
> > So xfs_buf_free() does vm_unmap_ram if the buffer has the XBF_MAPPED
> > flag set and spans multiple pages (b_page_count > 1).
> > In xlog_sync() you have that split case where you do XFS_BUF_SET_PTR
> > on that in-core log's l_xbuf which changes that buffer to, as far as I
> > could understand, linear kernel memory. Later in xlog_dealloc_log you
> > call xfs_buf_free() on that buffer.
> > I was unable to determine if this can ever be more than one page in
> > the buffer for the split case. But if this is the case, you end up
> > invoking vm_unmap_ram() on something you never vm_map_ram'd, which
> > could explain why this triggers the BUG_ON() for the dirty area map.
> Blech, that's bogus, please pardon my rashness.
> I looked over the vmalloc side several times but could not spot
> anything that would explain this crash.
> However, when you switched from vunmap to vm_unmap_ram you had to add
> the area size parameter.
> I am guessing that the base address was always correct, vunmap would
> have caught an error with it. But the new size argument could be too
> large and crash the kernel when it would reach into the next area that
> had already been freed (and marked in the dirty bitmap).
> I have given up on verifying that what xlog_sync() does to l_xbuf is
> okay. It would be good if you could confirm that it leaves the buffer
> in a state so that its b_addr - b_offset, b_page_count are correctly
> describing the exact vmap area.
Thanks for looking at this, Hannes. A fresh set of eyes always
helps. However, I don't think that l_xbuf is the only source of
potential problems w.r.t. the mapped region size when the buffer is
freed. This was reported on #xfs overnight:
[ 248.794327] XFS mounting filesystem md0
[ 248.970190] Starting XFS recovery on filesystem: md0 (logdev: internal)
[ 249.434782] ------------[ cut here ]------------
[ 249.434962] kernel BUG at mm/vmalloc.c:942!
[ 249.435053] invalid opcode: 0000 [#1] SMP
[ 249.435200] last sysfs file: /sys/devices/virtual/block/dm-5/dm/name
[ 249.435291] CPU 1
[ 249.435324] Modules linked in: arc4 ecb ves1820 rt61pci crc_itu_t
eeprom_93cx6 rt2x00pci rt2x00lib mac80211 budget budget_core saa7146
ttpci_eeprom dvb_core ftdi_sio usbserial evdev button cfg80211 shpchp
pci_hotplug r8168 serio_raw pcspkr e1000e edac_core ohci_hcd
[ 249.436597] Pid: 2739, comm: mount Not tainted 2.6.38 #31 System
manufacturer System Product Name/M4A785D-M PRO
[ 249.436893] RIP: 0010:[<ffffffff810cb44e>] [<ffffffff810cb44e>]
[ 249.437078] RSP: 0018:ffff8801156fba88 EFLAGS: 00010246
[ 249.437168] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 249.437260] RDX: 0000000000000000 RSI: 0000000000000041 RDI: 0000000000000001
[ 249.437353] RBP: ffff8801156fbaa8 R08: 0000000000000000 R09: ffff8801125c4490
[ 249.437445] R10: ffff880114ff5780 R11: dead000000200200 R12: 0000000000000006
[ 249.437537] R13: ffffc900106e6000 R14: 0000000000040000 R15: ffff880114ff0dc0
[ 249.437631] FS: 00007fba61615740(0000) GS:ffff8800dfd00000(0000)
[ 249.437777] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 249.437867] CR2: 00007fba61627000 CR3: 0000000114ff3000 CR4: 00000000000006e0
[ 249.437959] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 249.438051] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 249.438144] Process mount (pid: 2739, threadinfo ffff8801156fa000, task
[ 249.438288] Stack:
[ 249.438374] ffff880114ff0dc0 ffff880114ff0c80 0000000000000008
[ 249.438637] ffff8801156fbac8 ffffffff811c167a ffff880119293800
[ 249.438677] ffff8801156fbad8 ffffffff811b1b79 ffff8801156fbbf8
[ 249.438677] Call Trace:
[ 249.438677] [<ffffffff811c167a>] xfs_buf_free+0x38/0x78
[ 249.438677] [<ffffffff811b1b79>] xlog_put_bp+0x9/0xb
[ 249.438677] [<ffffffff811b4b37>] xlog_do_recovery_pass+0x5c8/0x5f4
[ 249.438677] [<ffffffff811b4bbb>] xlog_do_log_recovery+0x58/0x91
[ 249.438677] [<ffffffff811b300a>] ? xlog_find_tail+0x2a6/0x2fb
[ 249.438677] [<ffffffff811b4c07>] xlog_do_recover+0x13/0xed
[ 249.438677] [<ffffffff811b4e1b>] xlog_recover+0x7e/0x89
[ 249.438677] [<ffffffff811aedb0>] xfs_log_mount+0xdb/0x149
[ 249.438677] [<ffffffff811b714e>] xfs_mountfs+0x310/0x5c3
[ 249.438677] [<ffffffff811b7de1>] ? xfs_mru_cache_create+0x126/0x173
[ 249.438677] [<ffffffff811c8ecb>] xfs_fs_fill_super+0x183/0x2c4
[ 249.438677] [<ffffffff810e2d11>] mount_bdev+0x147/0x1ba
[ 249.438677] [<ffffffff811c8d48>] ? xfs_fs_fill_super+0x0/0x2c4
[ 249.438677] [<ffffffff811c7259>] xfs_fs_mount+0x10/0x12
[ 249.438677] [<ffffffff810e1f4f>] vfs_kern_mount+0x61/0x132
[ 249.438677] [<ffffffff810e207e>] do_kern_mount+0x48/0xda
[ 249.438677] [<ffffffff810f8aff>] do_mount+0x6ae/0x71b
[ 249.438677] [<ffffffff810f8dfd>] sys_mount+0x87/0xc8
[ 249.438677] [<ffffffff8102a8bb>] system_call_fastpath+0x16/0x1b
[ 249.438677] Code: d1 e8 75 f8 48 be 00 00 00 00 00 37 00 00 48 c7 c7 b0 98
62 81 49 8d 74 35 00 48 c1 ee 16 e8 a2 8c 12 00 48 85 c0 48 89 c3 75 02 <0f> 0b
4b 8d 74 35 00 4c 89 ef e8 2d fa ff ff 48 89 df e8 21 3a
[ 249.438677] RIP [<ffffffff810cb44e>] vm_unmap_ram+0x9a/0x133
[ 249.438677] RSP <ffff8801156fba88>
[ 249.443421] ---[ end trace 2360f16b307700c6 ]---
Which is basically reading the log, not writing to it where l_xbuf
comes into play. The log reading code plays a _lot_ of tricks with
buffer offsets and sizes (if the simple l_xbuf tricks scare you, do
not look at this code ;). Hence it's definitely possible that the
size of the region being passed back to vm_unmap_ram() is wrong in
some of the error cases. I'll spend some more time to verify whether
they are restoring the buffer correctly or not before freeing it.