xfs
[Top] [All Lists]

Corruption of in-memory data detected.

To: linux-xfs@xxxxxxxxxxx
Subject: Corruption of in-memory data detected.
From: "Jeffrey W. Baker" <jwbaker@xxxxxxx>
Date: Mon, 09 Feb 2004 17:15:02 -0800
Sender: linux-xfs-bounce@xxxxxxxxxxx
We are running a system on 2.4 kernels, previously the XFS 2.4.23,
currently the mainline 2.4.25-rc1.  We have no additional patches to the
kernels.  Our storage is a SCSI-SATA RAID controller attached to the
host via Adaptec 39320D HBA, driver revision 1.3.10.  Our XFS filesystem
is a 700GB volume which we use for local databases and also export via
NFS.  

We are able to fairly reliably produce this message (kernel 2.4.23-xfs):

xfs_force_shutdown(sd(8,1),0x8) called from line 1070 of file
xfs_trans.c.  Return address = 0xc0229d9c
Filesystem "sd(8,1)": Corruption of in-memory data detected.  Shutting
down filesystem: sd(8,1)
Please umount the filesystem, and rectify the problem(s)

There are *no* I/O errors preceeding this message.  After rebooting the
system, xfs_check shows errors and xfs_repair claims to repair them. 
Some files turn up in lost+found.

We can cause this error fairly reliably.  We have also seen this dump
from the kernel:

Feb  7 06:39:11 prime kernel: 0x0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00
Feb  7 06:39:11 prime kernel: Filesystem "sd(8,1)": XFS internal error 
xfs_da_do_buf(2) at line 2273 of file xfs_da_btree.c.  Caller 0xc01dfa17
Feb  7 06:39:11 prime kernel: f0e4bcb0 c01df425 c0364e3d 00000001 f7626400 
c0364d7f 000008e1 c01dfa17
Feb  7 06:39:11 prime kernel:        c01dfa17 f0e4bd18 00000000 007fffbf 
00000000 00000041 00000003 0100032a
Feb  7 06:39:11 prime kernel:        00000018 00000000 0000002d 00000001 
00000000 f7626400 f0e4bd34 00000001
Feb  7 06:39:11 prime kernel: Call Trace: [<c01df425>]  [<c01dfa17>]  
[<c01dfa17>]  [<c01dfa17>]  [<c01e67bf>]  [<c01e67bf>]  [<c01e28e0>]  
[<c01e28e0>]  [<c01e20e8>]  [<c01e28e0>]  [<c02192e0>]  [<c02220f9>]  
[<c0151b6e>]  [<c0152270>]  [<c01523db>]  [<c0152270>]  [<c0135065>]  
[<c010762f>]
Feb  7 19:19:53 prime kernel: 0x0: 1f 8b 08 08 6e 9b 09 40 00 03 32 33 37 30 36 
30
Feb  7 19:19:53 prime kernel: Filesystem "sd(8,1)": XFS internal error 
xfs_da_do_buf(2) at line 2273 of file xfs_da_btree.c.  Caller 0xc01dfa17
Feb  7 19:19:53 prime kernel: f760dcb0 c01df425 c0364e3d 00000001 f7626400 
c0364d7f 000008e1 c01dfa17
Feb  7 19:19:53 prime kernel:        c01dfa17 f760dd18 00000000 007fffbf 
00000000 00000041 00000003 0100032a
Feb  7 19:19:53 prime kernel:        00000018 00000000 0000002d 00000001 
00000000 f7626400 f760dd34 00000001
Feb  7 19:19:53 prime kernel: Call Trace: [<c01df425>]  [<c01dfa17>]  
[<c01dfa17>]  [<c01dfa17>]  [<c01e67bf>]  [<c01e67bf>]  [<c01e28e0>]  
[<c01e28e0>]  [<c01e20e8>]  [<c01e28e0>]  [<c02192e0>]  [<c02220f9>]  
[<c0286d38>]  [<c0151b6e>]  [<c0152270>]  [<c01523db>]  [<c0152270>]  
[<c0135065>]  [<c010762f>]

Which translates to something like (ksymoops, without code line):



Trace; c01df425 <xfs_da_do_buf+395/8d0>
Trace; c01dfa17 <xfs_da_read_buf+57/60>
Trace; c01dfa17 <xfs_da_read_buf+57/60>
Trace; c01dfa17 <xfs_da_read_buf+57/60>
Trace; c01e67bf <xfs_dir2_leaf_getdents+38f/b80>
Trace; c01e67bf <xfs_dir2_leaf_getdents+38f/b80>
Trace; c01e28e0 <xfs_dir2_put_dirent64_direct+0/c0>
Trace; c01e28e0 <xfs_dir2_put_dirent64_direct+0/c0>
Trace; c01e20e8 <xfs_dir2_getdents+f8/150>
Trace; c01e28e0 <xfs_dir2_put_dirent64_direct+0/c0>
Trace; c02192e0 <xfs_readdir+60/c0>
Trace; c02220f9 <linvfs_readdir+109/220>
Trace; c0151b6e <vfs_readdir+ae/110>
Trace; c0152270 <filldir64+0/110>
Trace; c01523db <sys_getdents64+5b/c0>
Trace; c0152270 <filldir64+0/110>
Trace; c0135065 <sys_mremap+55/80>
Trace; c010762f <system_call+33/38>

and:

Trace; c01df425 <xfs_da_do_buf+395/8d0>
Trace; c01dfa17 <xfs_da_read_buf+57/60>
Trace; c01dfa17 <xfs_da_read_buf+57/60>
Trace; c01dfa17 <xfs_da_read_buf+57/60>
Trace; c01e67bf <xfs_dir2_leaf_getdents+38f/b80>
Trace; c01e67bf <xfs_dir2_leaf_getdents+38f/b80>
Trace; c01e28e0 <xfs_dir2_put_dirent64_direct+0/c0>
Trace; c01e28e0 <xfs_dir2_put_dirent64_direct+0/c0>
Trace; c01e20e8 <xfs_dir2_getdents+f8/150>
Trace; c01e28e0 <xfs_dir2_put_dirent64_direct+0/c0>
Trace; c02192e0 <xfs_readdir+60/c0>
Trace; c02220f9 <linvfs_readdir+109/220>
Trace; c0286d38 <scsi_finish_command+98/c0>
Trace; c0151b6e <vfs_readdir+ae/110>
Trace; c0152270 <filldir64+0/110>
Trace; c01523db <sys_getdents64+5b/c0>
Trace; c0152270 <filldir64+0/110>
Trace; c0135065 <sys_mremap+55/80>
Trace; c010762f <system_call+33/38>

To be fair, we have had SCSI errors on this device before, so device
corruption is not out of the question.  But we've also seen that after
repairing the FS and testing again, we can corrupt the filesystem
without ever seeing any hardware I/O errors at all.  Thus in our
investigation to nail down a problem with a SCSI bus we seemed to have
narrowed it to a purely software problem.

We reproduce the problem by running three parallel postgresql bulk loads
(using pg_dump | psql) where both the source and target databases, as
well as the transaction logs, are on the XFS in question.  In parallel
with that, we find -type f | xargs cat > /dev/null, which should
exercise the metadata by traversing the entire FS, on which there are
several millions of files.  It's my impression that the proximate cause
of the corruption is the sync writes done by postgresql.  I don't have
any evidence on that, it just seems that starting a postgresql operation
can trigger the problem.

I am more than willing to provide as much information as is needed to
help diagnose the problem.  I can't collect xfs_info right now, as
xfs_repair is slogging through a lengthy Phase 5.  

-jwb


<Prev in Thread] Current Thread [Next in Thread>
  • Corruption of in-memory data detected., Jeffrey W. Baker <=