Bug 765 - xfs_force_shutdown(dm-0,0x8) called from line 1091 of file fs/xfs/xfs_trans.c
: xfs_force_shutdown(dm-0,0x8) called from line 1091 of file fs/xfs/xfs_trans.c
Status: RESOLVED WORKSFORME
Product: XFS
Classification: Unclassified
Component: XFS kernel code
: unspecified
: PC Linux
: P1 blocker
: ---
Assigned To: XFS power people
:
:
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2007-08-13 17:23 CDT by Peter Nealy
Modified: 2009-06-22 06:11 CDT (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Nealy 2007-08-13 17:23:46 CDT
Kernel: 2.6.12.6 with some tweaks but none to xfs.

We are getting the following on a very repeated basis on our NAS appliances:

Jun  1 10:18:46 localhost kernel: xfs_force_shutdown(dm-0,0x8) called from line
1091 of file fs/xfs/xfs_trans.c.  Return address = 0xf9003a1b
Jun  1 10:18:46 localhost kernel: Filesystem "dm-0": Corruption of in-memory
data detected.  Shutting down filesystem: dm-0
Jun  1 10:18:46 localhost kernel: Please umount the filesystem, and rectify the
problem(s)

Jun 21 10:39:06 localhost kernel: xfs_force_shutdown(dm-0,0x8) called from line
1091 of file fs/xfs/xfs_trans.c.  Return address = 0xf9003a1b
Jun 21 10:39:06 localhost kernel: Filesystem "dm-0": Corruption of in-memory
data detected.  Shutting down filesystem: dm-0
Jun 21 10:39:06 localhost kernel: Please umount the filesystem, and rectify the
problem(s)

Our implementation runs LVM on top of XFS.  We're unsure what is going on, but
doing an xfs repair does not remedy the situation.  We cannot make them fail at
will, but the failure rate can be as short as two weeks, which is unacceptable. 

We've had more than one customer site fail in the exact same way.  Many of these
customers use our NAS appliance with tools such as Adobe Premier for doing video
editing.  

What can we do up front to provide you with more information about the failure.
 These are systems out in the field (we are unable to recreate them here), so
are there any switches we can throw to the fs layer to maybe turn on more debug?
Comment 1 Eric Sandeen 2007-08-14 12:25:50 CDT
Looks like it's canceling a dirty transaction:

  1084          /*
  1085           * See if the caller is relying on us to shut down the
  1086           * filesystem.  This happens in paths where we detect
  1087           * corruption and decide to give up.
  1088           */
  1089          if ((tp->t_flags & XFS_TRANS_DIRTY) &&
  1090              !XFS_FORCED_SHUTDOWN(tp->t_mountp))
  1091                  xfs_force_shutdown(tp->t_mountp, XFS_CORRUPT_INCORE);

2.6.12 is awfully old, but I understand that it's an embedded product.

There may have been a fix for this since then, it rings a bell, but I don't
recall offhand.

You could set the panic mask sysctl:

  fs.xfs.panic_mask             (Min: 0  Default: 0  Max: 127)
        Causes certain error conditions to call BUG(). Value is a bitmask;
        AND together the tags which represent errors which should cause panics:

                XFS_NO_PTAG                     0
                XFS_PTAG_IFLUSH                 0x00000001
                XFS_PTAG_LOGRES                 0x00000002
                XFS_PTAG_AILDELETE              0x00000004
                XFS_PTAG_ERROR_REPORT           0x00000008
                XFS_PTAG_SHUTDOWN_CORRUPT       0x00000010
                XFS_PTAG_SHUTDOWN_IOERROR       0x00000020
                XFS_PTAG_SHUTDOWN_LOGERROR      0x00000040

to trip a panic when you get a shutdown; you could then get a BUG, backtrace,
and perhaps a dump at the moment shutdown was called... though I don't know if
that's feasible in the field.

-Eric
Comment 2 Peter Nealy 2007-08-14 20:24:52 CDT
Thanks Eric for responding so soon.  I'll have them set the panic mask sysctl,
but will they be able to relay that information to me easily when the failure
occurs? 

They're not running a kdb enabled kernel, and us not being there locally that
would be difficult anyway.  Is this information spewed into /var/log/messages or
something similar to access after they reboot the appliance?
Comment 3 Eric Sandeen 2007-08-14 22:22:58 CDT
if it triggers a BUG() you might get it in /var/log/messages or maybe only as
far as the console.  I don't know what you guys have in place for remote
debugging or post-mortems... that part is up to you I think.  :)

-Eric
Comment 4 Peter Nealy 2007-11-17 14:39:06 CST
We're back chasing this again.  Is there a reliable way to force a crashdump of
xfs so we can ensure our kdb/collection are working for us?
Comment 5 Dave Chinner 2007-11-18 15:40:16 CST
See Eric's original replay (comment #1). i.e. 
 
'echo 255 > /proc/sys/fs/xfs/panic_mask' 
 
Will cause the machine to panic on detection of the problem. 
 
FWIW, are the machines at/near ENOSPC when this shutdown occurs? 
Comment 6 Peter Nealy 2007-11-18 15:43:56 CST
Yes I have the panic_mask set to all flags on every bootup.  

filesytems can have as little as 7% on them when they fail so not near ENOSPC

We're actually putting 2.6.21 with all of our special sauce on them at that
level.  KDB is in place and on by default.

My hope is we never see a failure again because - if it is even an xfs issue -
the problem was fixed since 2.6.12.6

Will keep you informed.
Comment 7 Eric Sandeen 2007-11-18 17:31:56 CST
If you want to test crash collection, echo c > /proc/sysrq trigger should start
a crash sequence.
Comment 8 Eric Sandeen 2007-11-18 17:35:52 CST
also look into netdump, diskdump, kdump, or whatnot for collecting a core from
the field (I suppose in an embedded product, diskdump or kdump would be the
options... though if this is an arm box, I don't know if such things work there)
Comment 9 Christoph Hellwig 2008-12-27 03:41:45 CST
Did you guys make any progress on this one?
Comment 10 Christoph Hellwig 2009-06-22 06:11:29 CDT
Closed due to lack of feedback.  Please re-open if you have new information.