Bugzilla – Bug 765
xfs_force_shutdown(dm-0,0x8) called from line 1091 of file fs/xfs/xfs_trans.c
Last modified: 2009-06-22 06:11:29 CDT
Kernel: 2.6.12.6 with some tweaks but none to xfs. We are getting the following on a very repeated basis on our NAS appliances: Jun 1 10:18:46 localhost kernel: xfs_force_shutdown(dm-0,0x8) called from line 1091 of file fs/xfs/xfs_trans.c. Return address = 0xf9003a1b Jun 1 10:18:46 localhost kernel: Filesystem "dm-0": Corruption of in-memory data detected. Shutting down filesystem: dm-0 Jun 1 10:18:46 localhost kernel: Please umount the filesystem, and rectify the problem(s) Jun 21 10:39:06 localhost kernel: xfs_force_shutdown(dm-0,0x8) called from line 1091 of file fs/xfs/xfs_trans.c. Return address = 0xf9003a1b Jun 21 10:39:06 localhost kernel: Filesystem "dm-0": Corruption of in-memory data detected. Shutting down filesystem: dm-0 Jun 21 10:39:06 localhost kernel: Please umount the filesystem, and rectify the problem(s) Our implementation runs LVM on top of XFS. We're unsure what is going on, but doing an xfs repair does not remedy the situation. We cannot make them fail at will, but the failure rate can be as short as two weeks, which is unacceptable. We've had more than one customer site fail in the exact same way. Many of these customers use our NAS appliance with tools such as Adobe Premier for doing video editing. What can we do up front to provide you with more information about the failure. These are systems out in the field (we are unable to recreate them here), so are there any switches we can throw to the fs layer to maybe turn on more debug?
Looks like it's canceling a dirty transaction: 1084 /* 1085 * See if the caller is relying on us to shut down the 1086 * filesystem. This happens in paths where we detect 1087 * corruption and decide to give up. 1088 */ 1089 if ((tp->t_flags & XFS_TRANS_DIRTY) && 1090 !XFS_FORCED_SHUTDOWN(tp->t_mountp)) 1091 xfs_force_shutdown(tp->t_mountp, XFS_CORRUPT_INCORE); 2.6.12 is awfully old, but I understand that it's an embedded product. There may have been a fix for this since then, it rings a bell, but I don't recall offhand. You could set the panic mask sysctl: fs.xfs.panic_mask (Min: 0 Default: 0 Max: 127) Causes certain error conditions to call BUG(). Value is a bitmask; AND together the tags which represent errors which should cause panics: XFS_NO_PTAG 0 XFS_PTAG_IFLUSH 0x00000001 XFS_PTAG_LOGRES 0x00000002 XFS_PTAG_AILDELETE 0x00000004 XFS_PTAG_ERROR_REPORT 0x00000008 XFS_PTAG_SHUTDOWN_CORRUPT 0x00000010 XFS_PTAG_SHUTDOWN_IOERROR 0x00000020 XFS_PTAG_SHUTDOWN_LOGERROR 0x00000040 to trip a panic when you get a shutdown; you could then get a BUG, backtrace, and perhaps a dump at the moment shutdown was called... though I don't know if that's feasible in the field. -Eric
Thanks Eric for responding so soon. I'll have them set the panic mask sysctl, but will they be able to relay that information to me easily when the failure occurs? They're not running a kdb enabled kernel, and us not being there locally that would be difficult anyway. Is this information spewed into /var/log/messages or something similar to access after they reboot the appliance?
if it triggers a BUG() you might get it in /var/log/messages or maybe only as far as the console. I don't know what you guys have in place for remote debugging or post-mortems... that part is up to you I think. :) -Eric
We're back chasing this again. Is there a reliable way to force a crashdump of xfs so we can ensure our kdb/collection are working for us?
See Eric's original replay (comment #1). i.e. 'echo 255 > /proc/sys/fs/xfs/panic_mask' Will cause the machine to panic on detection of the problem. FWIW, are the machines at/near ENOSPC when this shutdown occurs?
Yes I have the panic_mask set to all flags on every bootup. filesytems can have as little as 7% on them when they fail so not near ENOSPC We're actually putting 2.6.21 with all of our special sauce on them at that level. KDB is in place and on by default. My hope is we never see a failure again because - if it is even an xfs issue - the problem was fixed since 2.6.12.6 Will keep you informed.
If you want to test crash collection, echo c > /proc/sysrq trigger should start a crash sequence.
also look into netdump, diskdump, kdump, or whatnot for collecting a core from the field (I suppose in an embedded product, diskdump or kdump would be the options... though if this is an arm box, I don't know if such things work there)
Did you guys make any progress on this one?
Closed due to lack of feedback. Please re-open if you have new information.