xfs
[Top] [All Lists]

Internal error XFS_WANT_CORRUPTED_RETURN, Linux 3.0.12 with bulletproof-

To: xfs@xxxxxxxxxxx
Subject: Internal error XFS_WANT_CORRUPTED_RETURN, Linux 3.0.12 with bulletproof-sync patch
From: Sean Thomas Caron <scaron@xxxxxxxxx>
Date: Mon, 09 Apr 2012 14:16:29 -0400
Cc: scaron@xxxxxxxxx
User-agent: Internet Messaging Program (IMP) H3 (4.3.5)
Hi all,

We've got a system running Linux 3.0.12 with Cristoph's xfs-bulletproof-sync patch:


This patch changes the XFS sync code to:

 - always log the inode from ->write_inode, no matter if it was a blocking
   call or not.  This means we might stall the writeback thread on the
   inode lock for a short time, but except for that it should not cause
   problems as long as the delaylog option is enabled given that we do
   not cause additional log traffic by logging the inode multiple times
   in a single checkpoint.  This should solve issue (1)
 - add a pass to ->sync_fs to log all inodes that still have unlogged
   changes.  This should solve issue (2) at the expense of possibly
   logging inodes that have only been dirtied after the sync call was
   issued.  Given that logging the inode is cheap, and the radix tree
   gang lookup isn't lifelockable this should be fine, too.


The main use of XFS on this system is as a home directory filesystem on a 40 TB Fibre Channel LUN that's hosted on a Promise vTrak NAS device.

The system has been running well and the patch has fixed the data loss issues that we were seeing on various 2.6.x kernels on crashes/shutdowns/reboots.

This weekend, I got reports of I/O errors on the home volume on this system. I took a look at the dmesg (attached) and it's full of these tracebacks with XFS_WANT_CORRUPTED RETURN errors.

At that time, I rebooted the system and it seems to have come back up fine without (so far) any reports of filesystem damage or lost files, thank goodness.

Today, I've seen this error pop up once again in dmesg (though I haven't had any people report I/O errors again yet). This is maybe 24-36 hours out after the first time we rebooted the system due to receiving this error:

[134031.131435] XFS: Internal error XFS_WANT_CORRUPTED_RETURN at line 341 of file fs/xfs/xfs_alloc.c. Caller 0xffffffffa017fff9
[134031.131437]
[134031.131465] Pid: 11482, comm: flush-251:6 Tainted: P            3.0.12 #1
[134031.131467] Call Trace:
[134031.131499]  [<ffffffffa01a948f>] xfs_error_report+0x3f/0x50 [xfs]
[134031.131511] [<ffffffffa017fff9>] ? xfs_alloc_ag_vextent_size+0x489/0x660 [xfs]
[134031.131523]  [<ffffffffa017da09>] ? xfs_alloc_lookup_eq+0x19/0x20 [xfs]
[134031.131534]  [<ffffffffa017dc96>] xfs_alloc_fixup_trees+0x236/0x350 [xfs]
[134031.131546] [<ffffffffa017fff9>] xfs_alloc_ag_vextent_size+0x489/0x660 [xfs]
[134031.131557]  [<ffffffffa0180f4d>] xfs_alloc_ag_vextent+0xad/0x100 [xfs]
[134031.131569]  [<ffffffffa01817e4>] xfs_alloc_vextent+0x2a4/0x600 [xfs]
[134031.131581]  [<ffffffffa018ba57>] xfs_bmap_btalloc+0x257/0x720 [xfs]
[134031.131586]  [<ffffffff810fbbbd>] ? mempool_alloc+0x5d/0x140
[134031.131598]  [<ffffffffa018c231>] xfs_bmap_alloc+0x21/0x40 [xfs]
[134031.131610]  [<ffffffffa0192bf0>] xfs_bmapi+0x9b0/0x1150 [xfs]
[134031.131626]  [<ffffffffa01cecf7>] ? kmem_zone_alloc+0x77/0xf0 [xfs]
[134031.131641] [<ffffffffa01b4aee>] xfs_iomap_write_allocate+0x17e/0x350 [xfs]
[134031.131656]  [<ffffffffa01d0d91>] xfs_map_blocks+0x1d1/0x260 [xfs]
[134031.131671]  [<ffffffffa01d0fea>] xfs_vm_writepage+0x1ca/0x510 [xfs]
[134031.131674]  [<ffffffff81102397>] __writepage+0x17/0x40
[134031.131676]  [<ffffffff81103683>] write_cache_pages+0x243/0x4d0
[134031.131691]  [<ffffffffa01c6877>] ? xfs_trans_free_items+0x87/0xb0 [xfs]
[134031.131693]  [<ffffffff81102380>] ? set_page_dirty+0x70/0x70
[134031.131696]  [<ffffffff81103961>] generic_writepages+0x51/0x80
[134031.131711]  [<ffffffffa01cfa9d>] xfs_vm_writepages+0x5d/0x80 [xfs]
[134031.131713]  [<ffffffff811039b1>] do_writepages+0x21/0x40
[134031.131715]  [<ffffffff8117669e>] writeback_single_inode+0x10e/0x280
[134031.131718]  [<ffffffff81176ad3>] writeback_sb_inodes+0xe3/0x1b0
[134031.131720]  [<ffffffff81176f64>] writeback_inodes_wb+0xa4/0x170
[134031.131722]  [<ffffffff811779b3>] wb_writeback+0x2f3/0x430
[134031.131725]  [<ffffffff815a30df>] ? _raw_spin_lock_irqsave+0x2f/0x40
[134031.131727]  [<ffffffff81177d0f>] wb_do_writeback+0x21f/0x270
[134031.131729]  [<ffffffff81177e0a>] bdi_writeback_thread+0xaa/0x270
[134031.131731]  [<ffffffff81177d60>] ? wb_do_writeback+0x270/0x270
[134031.131734]  [<ffffffff81080a96>] kthread+0x96/0xa0
[134031.131737]  [<ffffffff815ac124>] kernel_thread_helper+0x4/0x10
[134031.131739]  [<ffffffff81080a00>] ? kthread_worker_fn+0x190/0x190
[134031.131741]  [<ffffffff815ac120>] ? gs_change+0x13/0x13
[134031.131745] XFS (dm-6): page discard on page ffffea000772f358, inode 0x9472fbb16, offset 0.

Although I haven't seen the device start throwing I/O errors yet.

Can anyone tell me anything about what may be causing this error? Is there anything I can do to provide more debugging information?

We're in the middle of a rolling upgrade to 3.0.23, maybe this will be fixed in the later kernel? I haven't seen it on any 3.0.23 machines yet (though this is the first time I've seen this particular error at all).

This disk is at 97% capacity, maybe that is causing some trouble that is resulting in this message being produced?

How should I respond to this? I'm hoping this is more a kernel glitch than filesystem corruption...

Any assistance would be greatly appreciated.

Please send any responses to myself directly as I'm not subscribed to the list.

Thank you!

-Sean


Attachment: crash_20120407.txt
Description: Text document

<Prev in Thread] Current Thread [Next in Thread>