xfs and raid5 - "Structure needs cleaning for directory open"
Rainer Fuegenstein
rfu at kaneda.iguw.tuwien.ac.at
Sun May 9 13:48:00 CDT 2010
today in the morning some daemon processes terminated because of
errors in the xfs file system on top of a software raid5, consisting
of 4*1.5TB WD caviar green SATA disks.
current OS is centos 5.4, kernel is:
Linux alfred 2.6.18-164.15.1.el5xen #1 SMP Wed Mar 17 12:04:23 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
the history: this raid was originally created on an ASUS M2N-X plus
mainboard with all 4 drives connected to the on-board controller.
(centos 5.4, current i386 kernel). it
worked fine first, but after some months problems occured when copying
files via SMB, in these situations dmesg showed a stack trace,
starting with an interrupt problem deep in the kernel and reaching up
to the xfs filesystem code.
a few months ago the weekly raid check
(/etc/cron.weekly/99-raid-check) started a re-sync of the raid which
(on the M2N-X board) took about 2.5 to 3 days to complete.
to overcome the interrupt problems, I recently bought an intel D510
atom mainboard and a "Promise Technology, Inc. PDC40718 (SATA 300 TX4)
(rev 02))" sata controller, reinstalled centos 5.4 from scratch
(x86_64 version) and attached the 4 sata disks which worked fine until
this sunday night the 99-raid-check started again at 4:00 in the
morning and lasted until just now (19:00 o'clock).
around 12:00 noon (resync at about 50%) I noticed the first problems,
namely "Structure needs cleaning for directory open" messages. at this
time, a "du -sh *" revealed that around 50% of the data stored on the
xfs was lost (due to directories that couldn't be read because of the
"needs cleaning ..." error. a daring xfs_repair on the unmounted, but
still syncing filesystem revealed & fixed no errors (see output
below).
after painfully waiting 7 hours for the resync to complete, it looks
like the filesystem is OK and back to normal again: du shows the
expected 3.5TB usage, there are no more "needs cleaning ..." errors
and a quick check into the previously lost directories seems to show
that the files contained within seem to be OK.
I wonder what caused this behaviour (and how to prevent it in the
future):
1) damages done to the xfs filesystem on the old board? shouldn't
xfs_repair find & repair them?
2) does a re-syncing raid deliver bad/corrupt data to the filesystem
layer above?
3) may this be a hardware/memory problem since xfs reports "Corruption
of in-memory data detected". ?
4) is the Promise SATA controller to blame ?
here's some output that may help. please let me know if you need more:
*** this is where it started:
May 9 04:22:01 alfred kernel: md: syncing RAID array md0
May 9 04:22:01 alfred kernel: md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
May 9 04:22:01 alfred kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstructio
n.
May 9 04:22:01 alfred kernel: md: using 128k window, over a total of 1465135936 blocks.
May 9 04:24:06 alfred kernel: XFS internal error XFS_WANT_CORRUPTED_GOTO at line 4565 of file fs/xfs/xfs_bmap.c. Caller 0xffff
ffff8835dba8
May 9 04:24:06 alfred kernel:
May 9 04:24:06 alfred kernel: Call Trace:
May 9 04:24:06 alfred kernel: [<ffffffff8833f15e>] :xfs:xfs_bmap_read_extents+0x361/0x384
May 9 04:24:06 alfred kernel: [<ffffffff8835dba8>] :xfs:xfs_iread_extents+0xac/0xc8
May 9 04:24:06 alfred kernel: [<ffffffff883448c3>] :xfs:xfs_bmapi+0x226/0xe79
May 9 04:24:06 alfred kernel: [<ffffffff8021c4c6>] generic_make_request+0x211/0x228
May 9 04:24:06 alfred kernel: [<ffffffff882edd2e>] :raid456:handle_stripe+0x20a6/0x21ff
May 9 04:24:06 alfred kernel: [<ffffffff88361a2b>] :xfs:xfs_iomap+0x144/0x2a5
May 9 04:24:06 alfred kernel: [<ffffffff88376c38>] :xfs:__xfs_get_blocks+0x7a/0x1bf
May 9 04:24:06 alfred kernel: [<ffffffff882eebdb>] :raid456:make_request+0x4ba/0x4f4
May 9 04:24:06 alfred kernel: [<ffffffff8029bfc3>] autoremove_wake_function+0x0/0x2e
May 9 04:24:06 alfred kernel: [<ffffffff80228a95>] do_mpage_readpage+0x167/0x474
May 9 04:24:06 alfred kernel: [<ffffffff88376d8e>] :xfs:xfs_get_blocks+0x0/0xe
May 9 04:24:06 alfred kernel: [<ffffffff88376d8e>] :xfs:xfs_get_blocks+0x0/0xe
May 9 04:24:06 alfred kernel: [<ffffffff8020cc70>] add_to_page_cache+0xb9/0xc5
May 9 04:24:06 alfred kernel: [<ffffffff88376d8e>] :xfs:xfs_get_blocks+0x0/0xe
May 9 04:24:06 alfred kernel: [<ffffffff8023a3d8>] mpage_readpages+0x91/0xd9
May 9 04:24:06 alfred kernel: [<ffffffff88376d8e>] :xfs:xfs_get_blocks+0x0/0xe
May 9 04:24:06 alfred kernel: [<ffffffff8020f66f>] __alloc_pages+0x65/0x2ce
May 9 04:24:06 alfred kernel: [<ffffffff802137d1>] __do_page_cache_readahead+0x130/0x1ab
May 9 04:24:06 alfred kernel: [<ffffffff802336c8>] blockable_page_cache_readahead+0x53/0xb2
May 9 04:24:06 alfred kernel: [<ffffffff802147a4>] page_cache_readahead+0xd6/0x1af
May 9 04:24:06 alfred kernel: [<ffffffff8020c6d7>] do_generic_mapping_read+0xc6/0x38a
May 9 04:24:06 alfred kernel: [<ffffffff8020d693>] file_read_actor+0x0/0x101
May 9 04:24:06 alfred kernel: [<ffffffff8020cae7>] __generic_file_aio_read+0x14c/0x198
May 9 04:24:06 alfred kernel: [<ffffffff8837d7de>] :xfs:xfs_read+0x187/0x209
May 9 04:24:06 alfred kernel: [<ffffffff8837a4d8>] :xfs:xfs_file_aio_read+0x63/0x6b
May 9 04:24:06 alfred kernel: [<ffffffff8020d3d2>] do_sync_read+0xc7/0x104
May 9 04:24:06 alfred kernel: [<ffffffff8021ecec>] __dentry_open+0x101/0x1dc
May 9 04:24:06 alfred kernel: [<ffffffff8029bfc3>] autoremove_wake_function+0x0/0x2e
May 9 04:24:06 alfred kernel: [<ffffffff80227a40>] do_filp_open+0x2a/0x38
May 9 04:24:06 alfred kernel: [<ffffffff8020bbaf>] vfs_read+0xcb/0x171
May 9 04:24:06 alfred kernel: [<ffffffff80212495>] sys_read+0x45/0x6e
May 9 04:24:06 alfred kernel: [<ffffffff8026168d>] ia32_sysret+0x0/0x5
May 9 04:24:06 alfred kernel:
May 9 04:24:06 alfred kernel: XFS internal error XFS_WANT_CORRUPTED_GOTO at line 4565 of file fs/xfs/xfs_bmap.c. Caller 0xffff
ffff8835dba8
*** (many, many more)
May 9 06:19:16 alfred kernel: Filesystem "md0": corrupt dinode 1610637790, (btree extents). Unmount and run xfs_repair.
May 9 06:19:16 alfred kernel: Filesystem "md0": XFS internal error xfs_bmap_read_extents(1) at line 4560 of file fs/xfs/xfs_bma
p.c. Caller 0xffffffff8835dba8
May 9 06:19:16 alfred kernel:
May 9 06:19:16 alfred kernel: Call Trace:
May 9 06:19:16 alfred kernel: [<ffffffff8833f15e>] :xfs:xfs_bmap_read_extents+0x361/0x384
May 9 06:19:16 alfred kernel: [<ffffffff8835dba8>] :xfs:xfs_iread_extents+0xac/0xc8
May 9 06:19:16 alfred kernel: [<ffffffff883448c3>] :xfs:xfs_bmapi+0x226/0xe79
May 9 06:19:16 alfred kernel: [<ffffffff8866ac47>] :ip_conntrack:tcp_pkt_to_tuple+0x0/0x61
May 9 06:19:16 alfred kernel: [<ffffffff8866883d>] :ip_conntrack:__ip_conntrack_find+0xd/0xb7
May 9 06:19:16 alfred kernel: [<ffffffff8023f750>] lock_timer_base+0x1b/0x3c
May 9 06:19:16 alfred kernel: [<ffffffff8021ce99>] __mod_timer+0xb0/0xbe
May 9 06:19:16 alfred kernel: [<ffffffff88668e71>] :ip_conntrack:__ip_ct_refresh_acct+0x10f/0x152
May 9 06:19:16 alfred kernel: [<ffffffff8866b8a8>] :ip_conntrack:tcp_packet+0xa5f/0xa9f
May 9 06:19:16 alfred kernel: [<ffffffff88361a2b>] :xfs:xfs_iomap+0x144/0x2a5
May 9 06:19:16 alfred kernel: [<ffffffff88376c38>] :xfs:__xfs_get_blocks+0x7a/0x1bf
May 9 06:19:16 alfred kernel: [<ffffffff802235ae>] alloc_buffer_head+0x31/0x36
May 9 06:19:16 alfred kernel: [<ffffffff8022fa7a>] alloc_page_buffers+0x81/0xd3
May 9 06:19:16 alfred kernel: [<ffffffff8020ea95>] __block_prepare_write+0x1ad/0x375
May 9 06:19:16 alfred kernel: [<ffffffff88376d8e>] :xfs:xfs_get_blocks+0x0/0xe
May 9 06:19:16 alfred kernel: [<ffffffff802bda81>] add_to_page_cache_lru+0x1c/0x22
May 9 06:19:16 alfred kernel: [<ffffffff802d3456>] block_write_begin+0x80/0xcf
May 9 06:19:16 alfred kernel: [<ffffffff8837637d>] :xfs:xfs_vm_write_begin+0x19/0x1e
May 9 06:19:16 alfred kernel: [<ffffffff88376d8e>] :xfs:xfs_get_blocks+0x0/0xe
May 9 06:19:16 alfred kernel: [<ffffffff8021072e>] generic_file_buffered_write+0x14b/0x60c
May 9 06:19:16 alfred kernel: [<ffffffff80209e60>] __d_lookup+0xb0/0xff
May 9 06:19:16 alfred kernel: [<ffffffff80264931>] _spin_lock_irqsave+0x9/0x14
May 9 06:19:16 alfred kernel: [<ffffffff8837dcfe>] :xfs:xfs_write+0x49e/0x69e
May 9 06:19:16 alfred kernel: [<ffffffff8022d090>] mntput_no_expire+0x19/0x89
May 9 06:19:16 alfred kernel: [<ffffffff8020edf0>] link_path_walk+0xa6/0xb2
May 9 06:19:16 alfred kernel: [<ffffffff8837a470>] :xfs:xfs_file_aio_write+0x65/0x6a
May 9 06:19:16 alfred kernel: [<ffffffff802185e8>] do_sync_write+0xc7/0x104
May 9 06:19:16 alfred kernel: [<ffffffff8021ecec>] __dentry_open+0x101/0x1dc
May 9 06:19:16 alfred kernel: [<ffffffff8029bfc3>] autoremove_wake_function+0x0/0x2e
May 9 06:19:16 alfred kernel: [<ffffffff80227a40>] do_filp_open+0x2a/0x38
May 9 06:19:16 alfred kernel: [<ffffffff802171aa>] vfs_write+0xce/0x174
May 9 06:19:16 alfred kernel: [<ffffffff802179e2>] sys_write+0x45/0x6e
May 9 06:19:16 alfred kernel: [<ffffffff8026168d>] ia32_sysret+0x0/0x5
*** also many, many more, always the same dinode
May 9 12:53:32 alfred kernel: Filesystem "md0": XFS internal error xfs_btree_check_sblock at line 307 of file fs/xfs/xfs_btree.
c. Caller 0xffffffff88358eb7
May 9 12:53:32 alfred kernel:
May 9 12:53:32 alfred kernel: Call Trace:
May 9 12:53:32 alfred kernel: [<ffffffff88349bee>] :xfs:xfs_btree_check_sblock+0xaf/0xbe
May 9 12:53:32 alfred kernel: [<ffffffff88358eb7>] :xfs:xfs_inobt_increment+0x156/0x17e
May 9 12:53:32 alfred kernel: [<ffffffff88358920>] :xfs:xfs_dialloc+0x4d0/0x80c
May 9 12:53:32 alfred kernel: [<ffffffff802260ff>] find_or_create_page+0x3f/0xab
May 9 12:53:32 alfred kernel: [<ffffffff8835eafc>] :xfs:xfs_ialloc+0x5f/0x57f
May 9 12:53:32 alfred kernel: [<ffffffff8805c02a>] :ext3:ext3_get_acl+0x63/0x310
May 9 12:53:32 alfred kernel: [<ffffffff8020b242>] kmem_cache_alloc+0x62/0x6d
May 9 12:53:32 alfred kernel: [<ffffffff88370b23>] :xfs:xfs_dir_ialloc+0x86/0x2b7
May 9 12:53:32 alfred kernel: [<ffffffff883654c0>] :xfs:xlog_grant_log_space+0x204/0x25c
May 9 12:53:32 alfred kernel: [<ffffffff883735f8>] :xfs:xfs_create+0x237/0x45c
May 9 12:53:32 alfred kernel: [<ffffffff88338d47>] :xfs:xfs_attr_get+0x8e/0x9f
May 9 12:53:32 alfred kernel: [<ffffffff8837cd38>] :xfs:xfs_vn_mknod+0x144/0x215
May 9 12:53:32 alfred kernel: [<ffffffff8023bdcb>] vfs_create+0xe6/0x158
May 9 12:53:32 alfred kernel: [<ffffffff8021b38f>] open_namei+0x1a1/0x6ed
May 9 12:53:32 alfred kernel: [<ffffffff80227a32>] do_filp_open+0x1c/0x38
May 9 12:53:32 alfred kernel: [<ffffffff8021a1a0>] do_sys_open+0x44/0xbe
May 9 12:53:32 alfred kernel: [<ffffffff8026168d>] ia32_sysret+0x0/0x5
May 9 12:53:32 alfred kernel:
*** also many, many more
May 9 13:44:35 alfred kernel: 00000000: ff ff ff ff ff ff ff ff 00 00 00 00 00 00 00 00 ÿÿÿÿÿÿÿÿ........
May 9 13:44:35 alfred kernel: Filesystem "md0": XFS internal error xfs_da_do_buf(2) at line 2112 of file fs/xfs/xfs_da_btree.c.
Caller 0xffffffff8834b82e
May 9 13:44:35 alfred kernel:
May 9 13:44:35 alfred kernel: Call Trace:
May 9 13:44:35 alfred kernel: [<ffffffff8834b72d>] :xfs:xfs_da_do_buf+0x503/0x5b1
May 9 13:44:35 alfred kernel: [<ffffffff8834b82e>] :xfs:xfs_da_read_buf+0x16/0x1b
May 9 13:44:35 alfred kernel: [<ffffffff8020cb6c>] _atomic_dec_and_lock+0x39/0x57
May 9 13:44:35 alfred kernel: [<ffffffff8834b82e>] :xfs:xfs_da_read_buf+0x16/0x1b
May 9 13:44:35 alfred kernel: [<ffffffff88350b0c>] :xfs:xfs_dir2_leaf_getdents+0x354/0x5ec
May 9 13:44:35 alfred kernel: [<ffffffff88350b0c>] :xfs:xfs_dir2_leaf_getdents+0x354/0x5ec
May 9 13:44:35 alfred kernel: [<ffffffff88379c6c>] :xfs:xfs_hack_filldir+0x0/0x5b
May 9 13:44:35 alfred kernel: [<ffffffff88379c6c>] :xfs:xfs_hack_filldir+0x0/0x5b
May 9 13:44:35 alfred kernel: [<ffffffff8834d868>] :xfs:xfs_readdir+0xa7/0xb6
May 9 13:44:35 alfred kernel: [<ffffffff8837a301>] :xfs:xfs_file_readdir+0xff/0x14c
May 9 13:44:35 alfred kernel: [<ffffffff80225d93>] filldir+0x0/0xb7
May 9 13:44:35 alfred kernel: [<ffffffff80225d93>] filldir+0x0/0xb7
May 9 13:44:35 alfred kernel: [<ffffffff802366f7>] vfs_readdir+0x77/0xa9
May 9 13:44:35 alfred kernel: [<ffffffff80239f2a>] sys_getdents+0x75/0xbd
May 9 13:44:35 alfred kernel: [<ffffffff80260295>] tracesys+0x47/0xb6
May 9 13:44:35 alfred kernel: [<ffffffff802602f9>] tracesys+0xab/0xb6
May 9 13:44:35 alfred kernel:
May 9 13:51:24 alfred kernel: Filesystem "md0": Disabling barriers, trial barrier write failed
May 9 13:51:24 alfred kernel: XFS mounting filesystem md0
*** these xfs_da_do_buf errors appear at a rate of about 5 per second
until 14:40 o'clock, then stop. file system was still mounted, maybe
one daemon was still accessing it.
*** xfs_repair performed when raid was at 50% resync and filesystem was
corrupted:
[root at alfred ~]# xfs_repair /dev/md0
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
- scan filesystem freespace and inode maps...
- found root inode chunk
Phase 3 - for each AG...
- scan and clear agi unlinked lists...
- process known inodes and perform inode discovery...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- agno = 5
- agno = 6
- agno = 7
[...]
- agno = 62
- agno = 63
- process newly discovered inodes...
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
[...]
- agno = 61
- agno = 62
- agno = 63
Phase 5 - rebuild AG headers and trees...
- reset superblock...
Phase 6 - check inode connectivity...
- resetting contents of realtime bitmap and summary inodes
- traversing filesystem ...
- traversal finished ...
- moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
done
raid output after sync was finished:
[root at alfred md]# cat /sys/block/md0/md/array_state
clean
[root at alfred md]# cat /sys/block/md0/md/mismatch_cnt
0
tnx & cu
More information about the xfs
mailing list