xfs-masters
[Top] [All Lists]

[xfs-masters] [Bug 720] New: xfs_da_do_buf error under load on Core 5 c

To: xfs-master@xxxxxxxxxxx
Subject: [xfs-masters] [Bug 720] New: xfs_da_do_buf error under load on Core 5 current kernel - XFS over LVM on 3Ware 9500 HW RAID
From: bugzilla-daemon@xxxxxxxxxxx
Date: Mon, 18 Sep 2006 13:47:39 -0700
Reply-to: xfs-masters@xxxxxxxxxxx
Sender: xfs-masters-bounce@xxxxxxxxxxx
http://oss.sgi.com/bugzilla/show_bug.cgi?id=720

           Summary: xfs_da_do_buf error under load on Core 5 current kernel
                    - XFS over LVM on 3Ware 9500 HW RAID
           Product: Linux XFS
           Version: unspecified
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: major
          Priority: P2
         Component: XFS kernel code
        AssignedTo: xfs-master@xxxxxxxxxxx
        ReportedBy: blackavr@xxxxxxxxxxxxx


This is a reasonably large box with large devices:
2 7-device 3000GB RAID5, built from 16x500GB drives on 2 3Ware 9500 controllers
Write cache is active on the 3Ware cards. (I know, I know!)  
2GB ECC memory
2x 2.8G Xeons on a Tyan server board. 
We're also running vm/swappiness=0 and vm/min_free_kbytes=32768 , as we had some
truly ugly XFS-related crashes on the previous Core 3 system this is to replace
 which were avoided by these changes. 

The filesystems are all under 1.5TB, though, to avoid the "can't repair the
filesystem on a 32-bit platform if it's too big" issue, and most are under
500GB. dm-16, which is the location of the error, is 230GB. The filesystems are
built on LVM logical volumes - 4 pv per raidset, filesystems in this group
striped across the 3ware cards. A similar error has occured on a single smaller
filesystem on a test machine, as well - I can add those messages. This system
was built two days before the failure, and has always run 2.6.17-1.2157_FC5smp .
The system was kickstarted with that kernel, and the storage volumes and
filesystems were built under that kernel. 
The access pattern on this filesystem is as follows:
This is a queue filesystem - Postfix is dumping a number of files into the
filesystem via the pipe delivery mechanism, and our code is creating those files
using mktemp. Another process is running as a daemon watching the filesystem.
When it finds a new file, it opens, reads and copies it off to another location.
Other filesystems on the machine are also under load, including database access.
However, the database filesystem is in a different volume group. Also,
/var/spool/postfix is on a separate filesystem in another volume group. 
Under a load test, feeding Postfix as quickly as it'll take mail, the following
series of errors was logged:

Sep 17 22:17:05 defendercore5 kernel: xfs_da_do_buf: bno 2698
Sep 17 22:17:05 defendercore5 kernel: dir: inode 128
Sep 17 22:17:05 defendercore5 kernel: Filesystem "dm-16": XFS internal error
xfs_da_do_buf(1) at line 2119 of file fs/xfs/xfs_da_btree.c.  Caller 0xf8a6a44e
Sep 17 22:17:05 defendercore5 kernel:  <f8a69ff6> xfs_da_do_buf+0x45b/0x829
[xfs]  <f8a6a44e> xfs_da_read_buf+0x30/0x35 [xfs]
Sep 17 22:17:05 defendercore5 kernel:  <f8a6a44e> xfs_da_read_buf+0x30/0x35
[xfs]  <f8a74329> xfs_dir2_node_addname+0x76d/0xa41 [xfs]
Sep 17 22:17:05 defendercore5 kernel:  <f8a74329>
xfs_dir2_node_addname+0x76d/0xa41 [xfs]  <f8a54b2a> xfs_attr_fetch+0xb4/0x244 
[xfs]
Sep 17 22:17:05 defendercore5 kernel:  <f8a8b458> xlog_grant_push_ail+0x34/0xf2
[xfs]  <f8a6e031> xfs_dir2_isleaf+0x1b/0x50 [xfs]
Sep 17 22:17:05 defendercore5 kernel:  <f8a6e86b>
xfs_dir2_createname+0x101/0x109 [xfs]  <f8a7fd9a> xfs_ilock+0x8c/0xd4 [xfs]
Sep 17 22:17:05 defendercore5 kernel:  <f8a9fb2e> xfs_link+0x342/0x496 [xfs] 
<f8aa819b> xfs_vn_permission+0x0/0x13 [xfs]
Sep 17 22:17:05 defendercore5 kernel:  <c0480a4a> dput+0x35/0x230  <f8aa7f13>
xfs_vn_link+0x41/0x8e [xfs]
Sep 17 22:17:05 defendercore5 kernel:  <c0439acb>
debug_mutex_add_waiter+0x97/0xa9  <c0477c13> vfs_link+0xdd/0x190
Sep 17 22:17:05 defendercore5 kernel:  <c06171af>
__mutex_lock_slowpath+0x339/0x439  <c0477c13> vfs_link+0xdd/0x190
Sep 17 22:17:05 defendercore5 kernel:  <c0477c13> vfs_link+0xdd/0x190 
<c0477c52> vfs_link+0x11c/0x190
Sep 17 22:17:06 defendercore5 kernel:  <c047a7ea> sys_linkat+0xb1/0xf0 
<c0480b0d> dput+0xf8/0x230
Sep 17 22:17:06 defendercore5 kernel:  <c046c4ce> __fput+0x146/0x170  <c047a858>
sys_link+0x2f/0x33
Sep 17 22:17:06 defendercore5 kernel:  <c0403dd5> sysenter_past_esp+0x56/0x79
Sep 17 22:17:06 defendercore5 kernel: xfs_da_do_buf: bno 2698
Sep 17 22:17:06 defendercore5 kernel: dir: inode 128
Sep 17 22:17:06 defendercore5 kernel: Filesystem "dm-16": XFS internal error
xfs_da_do_buf(1) at line 2119 of file fs/xfs/xfs_da_btree.c.  Caller 0xf8a6a44e
Sep 17 22:17:06 defendercore5 kernel:  <f8a69ff6> xfs_da_do_buf+0x45b/0x829
[xfs]  <f8a6a44e> xfs_da_read_buf+0x30/0x35 [xfs]
Sep 17 22:17:06 defendercore5 kernel:  <f8a6a44e> xfs_da_read_buf+0x30/0x35
[xfs]  <f8a74329> xfs_dir2_node_addname+0x76d/0xa41 [xfs]
Sep 17 22:17:06 defendercore5 kernel:  <f8a74329>
xfs_dir2_node_addname+0x76d/0xa41 [xfs]  <f8a80a06> xfs_iget+0x57a/0x621 [xfs]
Sep 17 22:17:06 defendercore5 kernel:  <f8aa1b29> kmem_zone_zalloc+0x1d/0x41
[xfs]  <f8a97c53> xfs_trans_iget+0x10a/0x143 [xfs]
Sep 17 22:17:06 defendercore5 kernel:  <f8aaae83> vfs_init_vnode+0x21/0x25 [xfs]
 <f8a6e031> xfs_dir2_isleaf+0x1b/0x50 [xfs]
Sep 17 22:17:06 defendercore5 kernel:  <f8a6e86b>
xfs_dir2_createname+0x101/0x109 [xfs]  <f8a98584> xfs_dir_ialloc+0x7b/0x28f 
[xfs]
Sep 17 22:17:06 defendercore5 kernel:  <f8a962d7> xfs_trans_reserve+0xc7/0x18f
[xfs]  <f8a6e76a> xfs_dir2_createname+0x0/0x109 [xfs]
Sep 17 22:17:06 defendercore5 kernel:  <f8a9f13a> xfs_create+0x40b/0x665 [xfs] 
<f8aa7bde> xfs_vn_mknod+0x1a9/0x398 [xfs]
Sep 17 22:17:06 defendercore5 kernel:  <f8a74c79>
xfs_dir2_leafn_lookup_int+0x3e/0x455 [xfs]  <f8aa42bf> xfs_buf_rele+0x25/0x7d 
[xfs]
Sep 17 22:17:06 defendercore5 kernel:  <f8a699d4> xfs_da_brelse+0x6b/0x8f [xfs]
 <f8a7319a> xfs_dir2_node_lookup+0x8c/0x95 [xfs]
Sep 17 22:17:06 defendercore5 kernel:  <f8a6e94e> xfs_dir2_lookup+0xdb/0xfe
[xfs]  <c0479a73> __link_path_walk+0xc5c/0xd31
Sep 17 22:17:06 defendercore5 kernel:  <c0480a4a> dput+0x35/0x230  <c04850d5>
mntput_no_expire+0x11/0x6e
Sep 17 22:17:06 defendercore5 kernel:  <f8a98437> xfs_dir_lookup_int+0x30/0xd8
[xfs]  <c047811a> vfs_create+0xce/0x12e
Sep 17 22:17:06 defendercore5 kernel:  <c047ab59> open_namei+0x176/0x5db 
<c046a00a> do_filp_open+0x25/0x39
Sep 17 22:17:06 defendercore5 kernel:  <c0618c53> do_page_fault+0x2e7/0x6b2 
<c0469da7> get_unused_fd+0xb9/0xc3
Sep 17 22:17:06 defendercore5 kernel:  <c046a060> do_sys_open+0x42/0xb5 
<c046a10c> sys_open+0x1c/0x1e
Sep 17 22:17:06 defendercore5 kernel:  <c0403dd5> sysenter_past_esp+0x56/0x79
Sep 17 22:17:06 defendercore5 kernel: Filesystem "dm-16": XFS internal error
xfs_trans_cancel at line 1150 of file fs/xfs/xfs_trans.c.  Caller 0xf8a9f34c
Sep 17 22:17:06 defendercore5 kernel:  <f8a963f8> xfs_trans_cancel+0x59/0xe5
[xfs]  <f8a9f34c> xfs_create+0x61d/0x665 [xfs]
Sep 17 22:17:06 defendercore5 kernel:  <f8a9f34c> xfs_create+0x61d/0x665 [xfs] 
<f8aa7bde> xfs_vn_mknod+0x1a9/0x398 [xfs]
Sep 17 22:17:06 defendercore5 kernel:  <f8a74c79>
xfs_dir2_leafn_lookup_int+0x3e/0x455 [xfs]  <f8aa42bf> xfs_buf_rele+0x25/0x7d 
[xfs]
Sep 17 22:17:06 defendercore5 kernel:  <f8a699d4> xfs_da_brelse+0x6b/0x8f [xfs]
 <f8a7319a> xfs_dir2_node_lookup+0x8c/0x95 [xfs]
Sep 17 22:17:06 defendercore5 kernel:  <f8a6e94e> xfs_dir2_lookup+0xdb/0xfe
[xfs]  <c0479a73> __link_path_walk+0xc5c/0xd31
Sep 17 22:17:06 defendercore5 kernel:  <c0480a4a> dput+0x35/0x230  <c04850d5>
mntput_no_expire+0x11/0x6e
Sep 17 22:17:06 defendercore5 kernel:  <f8a98437> xfs_dir_lookup_int+0x30/0xd8
[xfs]  <c047811a> vfs_create+0xce/0x12e
Sep 17 22:17:06 defendercore5 kernel:  <c047ab59> open_namei+0x176/0x5db 
<c046a00a> do_filp_open+0x25/0x39
Sep 17 22:17:06 defendercore5 kernel:  <c0618c53> do_page_fault+0x2e7/0x6b2 
<c0469da7> get_unused_fd+0xb9/0xc3
Sep 17 22:17:06 defendercore5 kernel:  <c046a060> do_sys_open+0x42/0xb5 
<c046a10c> sys_open+0x1c/0x1e
Sep 17 22:17:06 defendercore5 kernel:  <c0403dd5> sysenter_past_esp+0x56/0x79
Sep 17 22:17:06 defendercore5 kernel: xfs_force_shutdown(dm-16,0x8) called from
line 1151 of file fs/xfs/xfs_trans.c.  Return address = 0xf8aaaea8
Sep 17 22:17:06 defendercore5 kernel: Filesystem "dm-16": Corruption of
in-memory data detected.  Shutting down filesystem: dm-16
Sep 17 22:17:06 defendercore5 kernel: Please umount the filesystem, and rectify
the problem(s)
 
At this point, the filesystem was offlined as expected. 

Upon umounting and running xfs_check, it asked for a remount to play back the
journal. I did that, and then umounted again and did xfs_check and xfs_repair.
The repair created lost+found, and found some inconsistencies, but fixed all of
them, and the filesystem remounted as expected. There were 10 files full of
nulls, also expected, as we're not (yet) forcing a flush on incoming queue
files. However, to do that, we're going to have to flock first, due to our code,
and that will be more disk access. 

The following seemingly ordinary lines were logged during the umount/mount
process, and are included here in the interest of erring on the side of too much
information: 

Sep 18 01:13:08 defendercore5 kernel: xfs_force_shutdown(dm-16,0x1) called from
line 338 of file fs/xfs/xfs_rw.c.  Return address = 0xf8aaaea8
Sep 18 01:13:08 defendercore5 kernel: xfs_force_shutdown(dm-16,0x1) called from
line 338 of file fs/xfs/xfs_rw.c.  Return address = 0xf8aaaea8
Sep 18 01:13:52 defendercore5 kernel: Filesystem "dm-16": Disabling barriers,
not supported by the underlying device
Sep 18 01:13:52 defendercore5 kernel: XFS mounting filesystem dm-16
Sep 18 01:13:52 defendercore5 kernel: Starting XFS recovery on filesystem: dm-16
(logdev: internal)
Sep 18 01:13:52 defendercore5 kernel: Ending XFS recovery on filesystem: dm-16
(logdev: internal)

I know that Fedora uses a smaller stack, and that there is additional stack
overflow risk with each subsystem above the disk, but I don't see any stack
errors getting dumped. 

Any ideas? Is there a known bug in the Fedora 2.6.17-1.2157_FC5smp kernel? I've
done some searching, but it seems that this error is usually associated with
existing corruption, if I'm searching for the right things. This is a new
filesystem, on a kernel that supposedly has the fix for the directory corruption
bug, so I hope it's not that bug, or it's still in this kernel.

-- 
Configure bugmail: http://oss.sgi.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


<Prev in Thread] Current Thread [Next in Thread>
  • [xfs-masters] [Bug 720] New: xfs_da_do_buf error under load on Core 5 current kernel - XFS over LVM on 3Ware 9500 HW RAID, bugzilla-daemon <=