xfs umount with i/o error hang/memory corruption

Bob Mastors bob.mastors at solidfire.com
Fri Apr 4 16:40:38 CDT 2014


On Fri, Apr 4, 2014 at 3:20 PM, Dave Chinner <david at fromorbit.com> wrote:

> On Fri, Apr 04, 2014 at 12:15:23PM -0600, Bob Mastors wrote:
> > Greetings,
> >
> > I am new to xfs and am running into a problem
> > and would appreciate any guidance on how to proceed.
> >
> > After an i/o error from the block device that xfs is using,
> > an umount results in a message like:
> > [  370.636473] XFS (sdx): Log I/O Error Detected.  Shutting down
> filesystem
> > [  370.644073] XFS (h           ���h"h          ���H#h          ���bsg):
> > Please umount the filesystem and rectify the problem(s)
> > Note the garbage on the previous line which suggests memory corruption.
> > About half the time I get the garbled log message. About half the time
> > umount hangs.
>
> I got an email about this last night with a different trigger - thin
> provisioning failing log IO in the unmount path. I know what the
> problem is, I just don't have a fix for it yet.
>
> To confirm it's the same problem, can you post the entirity of the
> dmesg where the error occurs?
>
Here is the console output from putting a BUG() call  in
xfs_do_force_shutdown
after the "Please umount ..." message.
Let me know if want something else.

I don't know how much help I can be on this, but I can run more tests
or try out fixes.

Thanks,
Bob

[  111.004665] XFS (sdb): Mounting Filesystem
[  111.099442] XFS (sdb): Ending clean mount
[  129.548644] XFS (sdb): metadata I/O error: block 0x19022 ("xlog_iodone")
error 19 numblks 64
[  129.618739] XFS (sdb): xfs_do_force_shutdown(0x2) called from line 1170
of file fs/xfs/xfs_log.c.  Return address = 0xffffffffa04c8001
[  129.699768] XFS (sdb): Log I/O Error Detected.  Shutting down filesystem
[  129.699924] XFS (sdb): Unable to update superblock counters. Freespace
may not be correct on next mount.
[  129.699927] XFS (sdb): xfs_log_force: error 5 returned.
[  129.716031] XFS (P専/): Please umount the filesystem and rectify the
problem(s)
[  129.718411] ------------[ cut here ]------------
[  129.719120] Kernel BUG at ffffffffa04726ac [verbose debug info
unavailable]
[  129.720016] invalid opcode: 0000 [#1] SMP
[  129.720016] Modules linked in: xfs(F) libcrc32c ib_srpt tcm_qla2xxx
qla2xxx tcm_loop tcm_fc libfc scsi_transport_fc scsi_tgt iscsi_target_mod
target_core_pscsi target_core_file target_core_iblock target_core_mod
configfs ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp
libiscsi_tcp libiscsi scsi_transport_iscsi nfsd nfs_acl nfs lockd fscac\
he sunrpc snd_intel8x0 snd_ac97_codec parport_pc parport ac97_bus snd_pcm
psmouse snd_timer serio_raw snd joydev evbug mac_hid pcspkr i2c_piix4
soundcore hid_generic usbhid hid e1000 ahci libahci
[  129.720016] CPU: 1 PID: 234 Comm: kworker/1:1H Tainted: GF
 3.14.0+ #1
[  129.720016] Hardware name: innotek GmbH VirtualBox, BIOS VirtualBox
12/01/2006
[  129.720016] Workqueue: xfslogd xfs_buf_iodone_work [xfs]
[  129.720016] task: ffff88002f598000 ti: ffff88002fecc000 task.ti:
ffff88002fecc000
[  129.720016] RIP: 0010:[<ffffffffa04726ac>]  [<ffffffffa04726ac>]
xfs_do_force_shutdown+0x9c/0x170 [xfs]
[  129.720016] RSP: 0018:ffff88002fecdd68  EFLAGS: 00010296
[  129.720016] RAX: 0000000000000043 RBX: 0000000000000002 RCX:
0000000000000000
[  129.720016] RDX: ffff88003fd0efb0 RSI: ffff88003fd0d3d8 RDI:
ffff88003fd0d3d8
[  129.720016] RBP: ffff88002fecdd98 R08: 0000000000000000 R09:
0000000000000000
[  129.720016] R10: 00000000000001cf R11: 00000000000001ce R12:
ffff880028865000
[  129.720016] R13: 0000000000000002 R14: 0000000000000000 R15:
0000000000000000
[  129.720016] FS:  0000000000000000(0000) GS:ffff88003fd00000(0000)
knlGS:0000000000000000
[  129.720016] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  129.720016] CR2: 00000000006105e8 CR3: 0000000001c0e000 CR4:
00000000000006e0
[  129.720016] Stack:
[  129.720016]  ffffffffa04c8001 0000000000000000 ffff880030140300
ffff88003026dc00
[  129.720016]  ffff88002fa13980 0000000000000000 ffff88002fecddc8
ffffffffa04c8001
[  129.720016]  ffff88003653cf48 ffff88002fa13a20 ffff88002fa13980
ffffe8ffffd01e00
[  129.720016] Call Trace:
[  129.720016]  [<ffffffffa04c8001>] ? xlog_iodone+0x91/0xa0 [xfs]
[  129.720016]  [<ffffffffa04c8001>] xlog_iodone+0x91/0xa0 [xfs]
[  129.720016]  [<ffffffffa046ac3e>] xfs_buf_iodone_work+0x5e/0xc0 [xfs]
[  129.720016]  [<ffffffff8106ad04>] process_one_work+0x174/0x4a0
[  129.720016]  [<ffffffff8106c0ab>] worker_thread+0x11b/0x370
[  129.720016]  [<ffffffff8106bf90>] ? manage_workers.isra.20+0x2d0/0x2d0
[  129.720016]  [<ffffffff81072c29>] kthread+0xc9/0xe0
[  129.720016]  [<ffffffff81072b60>] ? flush_kthread_worker+0xb0/0xb0
[  129.720016]  [<ffffffff816d6aec>] ret_from_fork+0x7c/0xb0
[  129.720016]  [<ffffffff81072b60>] ? flush_kthread_worker+0xb0/0xb0
[  129.720016] Code: d0 45 85 ed 0f 85 a5 00 00 00 f6 c3 20 66 90 75 7e 83
e3 10 0f 84 b5 00 00 00 48 c7 c6 18 8c 4e a0 4c 89 e7 31 c0 e8 54 92 00 00
<0f> 0b 66 90 48 8b 45 08 49 89 d1 41 89 c8 48 c7 c2 d0 4f 4e a0
[  129.720016] RIP  [<ffffffffa04726ac>] xfs_do_force_shutdown+0x9c/0x170
[xfs]
[  129.720016]  RSP <ffff88002fecdd68>
[  129.777164] ---[ end trace d335180e3c345156 ]---
[  129.777953] BUG: unable to handle kernel paging request at
ffffffffffffffd8
[  129.779087] IP: [<ffffffff81072c80>] kthread_data+0x10/0x20
[  129.779960] PGD 1c0f067 PUD 1c11067 PMD 0
[  129.780900] Oops: 0000 [#2] SMP
[  129.781150] Modules linked in: xfs(F) libcrc32c ib_srpt tcm_qla2xxx
qla2xxx tcm_loop tcm_fc libfc scsi_transport_fc scsi_tgt iscsi_target_mod
target_core_pscsi target_core_file target_core_iblock target_core_mod
configfs ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp
libiscsi_tcp libiscsi scsi_transport_iscsi nfsd nfs_acl nfs lockd fscac\
he sunrpc snd_intel8x0 snd_ac97_codec parport_pc parport ac97_bus snd_pcm
psmouse snd_timer serio_raw snd joydev evbug mac_hid pcspkr i2c_piix4
soundcore hid_generic usbhid hid e1000 ahci libahci
[  129.781150] CPU: 1 PID: 234 Comm: kworker/1:1H Tainted: GF     D
 3.14.0+ #1
[  129.781150] Hardware name: innotek GmbH VirtualBox, BIOS VirtualBox
12/01/2006
[  129.781150] task: ffff88002f598000 ti: ffff88002fecc000 task.ti:
ffff88002fecc000
[  129.781150] RIP: 0010:[<ffffffff81072c80>]  [<ffffffff81072c80>]
kthread_data+0x10/0x20
[  129.781150] RSP: 0018:ffff88002fecd9f8  EFLAGS: 00010046
[  129.781150] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
ffffffff81e75380
[  129.781150] RDX: 0000000000000001 RSI: 0000000000000001 RDI:
ffff88002f598000
[  129.781150] RBP: ffff88002fecd9f8 R08: 0000000000000000 R09:
0000000000000164
[  129.781150] R10: ffffffff8104c8b4 R11: 0000000000000000 R12:
0000000000000001
[  129.781150] R13: ffff88002f5984e0 R14: ffff88003d9f8000 R15:
ffff88002f5983d8
[  129.781150] FS:  0000000000000000(0000) GS:ffff88003fd00000(0000)
knlGS:0000000000000000
[  129.781150] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  129.781150] CR2: 0000000000000028 CR3: 0000000001c0e000 CR4:
00000000000006e0
[  129.781150] Stack:
[  129.781150]  ffff88002fecda18 ffffffff8106c526 ffff88003fd13140
0000000000000001
[  129.781150]  ffff88002fecda98 ffffffff816ca7cf 0000000000000001
ffff88002f598000
[  129.781150]  ffff88002f598000 0000000000013140 ffff88002fecdfd8
0000000000013140
[  129.781150] Call Trace:
[  129.781150]  [<ffffffff8106c526>] wq_worker_sleeping+0x16/0xa0
[  129.781150]  [<ffffffff816ca7cf>] __schedule+0x56f/0x7c0
[  129.781150]  [<ffffffff816caaf9>] schedule+0x29/0x70
[  129.781150]  [<ffffffff81052075>] do_exit+0x6e5/0xa70
[  129.781150]  [<ffffffff816c2303>] ? printk+0x4d/0x4f
[  129.781150]  [<ffffffff816cf0e1>] oops_end+0xa1/0xf0
[  129.781150]  [<ffffffff81006208>] die+0x58/0x90
[  129.781150]  [<ffffffff816ce93b>] do_trap+0xcb/0x170
[  129.781150]  [<ffffffff8100308d>] do_invalid_op+0xad/0x100
[  129.781150]  [<ffffffffa04726ac>] ? xfs_do_force_shutdown+0x9c/0x170
[xfs]
[  129.781150]  [<ffffffff816c2303>] ? printk+0x4d/0x4f
[  129.781150]  [<ffffffff816d8088>] invalid_op+0x18/0x20
[  129.781150]  [<ffffffffa04726ac>] ? xfs_do_force_shutdown+0x9c/0x170
[xfs]
[  129.781150]  [<ffffffffa04726ac>] ? xfs_do_force_shutdown+0x9c/0x170
[xfs]
[  129.781150]  [<ffffffffa04c8001>] ? xlog_iodone+0x91/0xa0 [xfs]
[  129.781150]  [<ffffffffa04c8001>] xlog_iodone+0x91/0xa0 [xfs]
[  129.781150]  [<ffffffffa046ac3e>] xfs_buf_iodone_work+0x5e/0xc0 [xfs]
[  129.781150]  [<ffffffff8106ad04>] process_one_work+0x174/0x4a0
[  129.781150]  [<ffffffff8106c0ab>] worker_thread+0x11b/0x370
[  129.781150]  [<ffffffff8106bf90>] ? manage_workers.isra.20+0x2d0/0x2d0
[  129.781150]  [<ffffffff81072c29>] kthread+0xc9/0xe0
[  129.781150]  [<ffffffff81072b60>] ? flush_kthread_worker+0xb0/0xb0
[  129.781150]  [<ffffffff816d6aec>] ret_from_fork+0x7c/0xb0
[  129.781150]  [<ffffffff81072b60>] ? flush_kthread_worker+0xb0/0xb0
[  129.781150] Code: 00 48 89 e5 5d 48 8b 40 c8 48 c1 e8 02 83 e0 01 c3 66
2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 87 88 04 00 00 55 48 89 e5
<48> 8b 40 d8 5d c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
[  129.781150] RIP  [<ffffffff81072c80>] kthread_data+0x10/0x20
[  129.781150]  RSP <ffff88002fecd9f8>
[  129.781150] CR2: ffffffffffffffd8
[  129.781150] ---[ end trace d335180e3c345157 ]---
[  129.781150] Fixing recursive fault but reboot is needed!


> In essence, the log IO failure is triggering a shutdown, and as part
> of the shutdown process it wakes anyone waiting on a log force.
> The log quiesce code that waits for log completion during unmount
> uses a log force to ensure the log is idle before tearing down all
> the log structures and finishing the unmount. Unfortunatey, the log
> force the unmount blocks on is woken prematurely by the shutdown,
> and hence it runs before the log IO processing is completed. Hence
> the use after free.
>
> > And then I get this kind of error and the system is unresponsive:
> > Message from syslogd at debian at Apr  4 09:27:40 ...
> >  kernel:[  680.080022] BUG: soft lockup - CPU#2 stuck for 22s!
> [umount:2849]
> >
> > The problem appears to be similar to this issue:
> > http://www.spinics.net/lists/linux-xfs/msg00061.html
>
> Similar symptoms, but not the same bug.
>
> > The problem is triggered by doing an iscsi logout which causes
> > the block device to return i/o errors to xfs.
> > Steps to reproduce the problem are below.
>
> Seems similar to the thinp ENOSPC issue I mentioned above - data IO
> errors occur, then you do an unmount, which causes a log IO error
> writing the superblock, and then this happens....
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david at fromorbit.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://oss.sgi.com/pipermail/xfs/attachments/20140404/ddcd860a/attachment-0001.html>


More information about the xfs mailing list