Hi all,
We've got a system running Linux 3.0.12 with Cristoph's
xfs-bulletproof-sync patch:
This patch changes the XFS sync code to:
- always log the inode from ->write_inode, no matter if it was a blocking
call or not. This means we might stall the writeback thread on the
inode lock for a short time, but except for that it should not cause
problems as long as the delaylog option is enabled given that we do
not cause additional log traffic by logging the inode multiple times
in a single checkpoint. This should solve issue (1)
- add a pass to ->sync_fs to log all inodes that still have unlogged
changes. This should solve issue (2) at the expense of possibly
logging inodes that have only been dirtied after the sync call was
issued. Given that logging the inode is cheap, and the radix tree
gang lookup isn't lifelockable this should be fine, too.
The main use of XFS on this system is as a home directory filesystem
on a 40 TB Fibre Channel LUN that's hosted on a Promise vTrak NAS
device.
The system has been running well and the patch has fixed the data loss
issues that we were seeing on various 2.6.x kernels on
crashes/shutdowns/reboots.
This weekend, I got reports of I/O errors on the home volume on this
system. I took a look at the dmesg (attached) and it's full of these
tracebacks with XFS_WANT_CORRUPTED RETURN errors.
At that time, I rebooted the system and it seems to have come back up
fine without (so far) any reports of filesystem damage or lost files,
thank goodness.
Today, I've seen this error pop up once again in dmesg (though I
haven't had any people report I/O errors again yet). This is maybe
24-36 hours out after the first time we rebooted the system due to
receiving this error:
[134031.131435] XFS: Internal error XFS_WANT_CORRUPTED_RETURN at line
341 of file fs/xfs/xfs_alloc.c. Caller 0xffffffffa017fff9
[134031.131437]
[134031.131465] Pid: 11482, comm: flush-251:6 Tainted: P 3.0.12 #1
[134031.131467] Call Trace:
[134031.131499] [<ffffffffa01a948f>] xfs_error_report+0x3f/0x50 [xfs]
[134031.131511] [<ffffffffa017fff9>] ?
xfs_alloc_ag_vextent_size+0x489/0x660 [xfs]
[134031.131523] [<ffffffffa017da09>] ? xfs_alloc_lookup_eq+0x19/0x20 [xfs]
[134031.131534] [<ffffffffa017dc96>] xfs_alloc_fixup_trees+0x236/0x350 [xfs]
[134031.131546] [<ffffffffa017fff9>]
xfs_alloc_ag_vextent_size+0x489/0x660 [xfs]
[134031.131557] [<ffffffffa0180f4d>] xfs_alloc_ag_vextent+0xad/0x100 [xfs]
[134031.131569] [<ffffffffa01817e4>] xfs_alloc_vextent+0x2a4/0x600 [xfs]
[134031.131581] [<ffffffffa018ba57>] xfs_bmap_btalloc+0x257/0x720 [xfs]
[134031.131586] [<ffffffff810fbbbd>] ? mempool_alloc+0x5d/0x140
[134031.131598] [<ffffffffa018c231>] xfs_bmap_alloc+0x21/0x40 [xfs]
[134031.131610] [<ffffffffa0192bf0>] xfs_bmapi+0x9b0/0x1150 [xfs]
[134031.131626] [<ffffffffa01cecf7>] ? kmem_zone_alloc+0x77/0xf0 [xfs]
[134031.131641] [<ffffffffa01b4aee>]
xfs_iomap_write_allocate+0x17e/0x350 [xfs]
[134031.131656] [<ffffffffa01d0d91>] xfs_map_blocks+0x1d1/0x260 [xfs]
[134031.131671] [<ffffffffa01d0fea>] xfs_vm_writepage+0x1ca/0x510 [xfs]
[134031.131674] [<ffffffff81102397>] __writepage+0x17/0x40
[134031.131676] [<ffffffff81103683>] write_cache_pages+0x243/0x4d0
[134031.131691] [<ffffffffa01c6877>] ? xfs_trans_free_items+0x87/0xb0 [xfs]
[134031.131693] [<ffffffff81102380>] ? set_page_dirty+0x70/0x70
[134031.131696] [<ffffffff81103961>] generic_writepages+0x51/0x80
[134031.131711] [<ffffffffa01cfa9d>] xfs_vm_writepages+0x5d/0x80 [xfs]
[134031.131713] [<ffffffff811039b1>] do_writepages+0x21/0x40
[134031.131715] [<ffffffff8117669e>] writeback_single_inode+0x10e/0x280
[134031.131718] [<ffffffff81176ad3>] writeback_sb_inodes+0xe3/0x1b0
[134031.131720] [<ffffffff81176f64>] writeback_inodes_wb+0xa4/0x170
[134031.131722] [<ffffffff811779b3>] wb_writeback+0x2f3/0x430
[134031.131725] [<ffffffff815a30df>] ? _raw_spin_lock_irqsave+0x2f/0x40
[134031.131727] [<ffffffff81177d0f>] wb_do_writeback+0x21f/0x270
[134031.131729] [<ffffffff81177e0a>] bdi_writeback_thread+0xaa/0x270
[134031.131731] [<ffffffff81177d60>] ? wb_do_writeback+0x270/0x270
[134031.131734] [<ffffffff81080a96>] kthread+0x96/0xa0
[134031.131737] [<ffffffff815ac124>] kernel_thread_helper+0x4/0x10
[134031.131739] [<ffffffff81080a00>] ? kthread_worker_fn+0x190/0x190
[134031.131741] [<ffffffff815ac120>] ? gs_change+0x13/0x13
[134031.131745] XFS (dm-6): page discard on page ffffea000772f358,
inode 0x9472fbb16, offset 0.
Although I haven't seen the device start throwing I/O errors yet.
Can anyone tell me anything about what may be causing this error? Is
there anything I can do to provide more debugging information?
We're in the middle of a rolling upgrade to 3.0.23, maybe this will be
fixed in the later kernel? I haven't seen it on any 3.0.23 machines
yet (though this is the first time I've seen this particular error at
all).
This disk is at 97% capacity, maybe that is causing some trouble that
is resulting in this message being produced?
How should I respond to this? I'm hoping this is more a kernel glitch
than filesystem corruption...
Any assistance would be greatly appreciated.
Please send any responses to myself directly as I'm not subscribed to
the list.
Thank you!
-Sean
crash_20120407.txt
Description: Text document
|