Weird XFS Corruption Error

Date: Wed, 22 Jan 2014 17:09:10 +0100
Hi everybody,

We experienced a weird XFS corruption yesterday and I desperately trying to 
find out what was happening.
First, the setup:

* ProLiant DL380p Gen8
* 256GB RAM
* HP SmartArray P420i Controller
** 1 GB BBWC
** Firmware Version 4.68
** 20x MK0100GCTYU 100GB SSD Drives
** Raid 1+0
* Ubuntu 12.10 LTS
* Kernel 3.11.0-15-generic #23~precise1-Ubuntu

fstab Entry: 
/dev/vg00/opt_mysqlbackup   /opt/mysqlbackup            xfs     
nobarrier,noatime,nodiratime,logbufs=8,logbsize=256k       0 2

We created a 120GB LV mounted on /opt/mysqlbackup with which (obviously) 
temporarily hosts our MariaDB Backups until they are transferred to tape. We 
use mylvmbackup (http://www.lenzg.net/mylvmbackup/) to create a (approx. 55GB) 
tar.gz file containing the dump. While testing, I created a hardlink for 2 
Files in a subdir („safe“) and forgot them for a day while the „original“ file 
was deleted and replaced by next day’s backup.

When I tried cleaning up the no longer needed files, I encountered the 

me@hsoi-gts3-de02:/opt/mysqlbackup$ sudo rm -rf safe/
sudo rm -rf safe/
[sudo] password for saskani:
rm: cannot remove `safe/daily_snapshot.tar.gz.md5': Input/output error

dmesg told me:
[964199.138848] XFS (dm-8): Internal error xfs_bmbt_read_verify at line 789 of 
file /build/buildd/linux-lts-saucy-3.11.0/fs/xfs/xfs_bmap_btree.c.  Caller 
[964199.138850] CPU: 1 PID: 3694 Comm: kworker/1:1H Tainted: GF            
3.11.0-15-generic #23~precise1-Ubuntu
[964199.138851] Hardware name: HP ProLiant DL380p Gen8, BIOS P70 09/18/2013
[964199.138874] Workqueue: xfslogd xfs_buf_iodone_work [xfs]
[964199.138876]  0000000000000001 ffff881c6be6fd18 ffffffff8173bc0e 
[964199.138878]  ffff883f9061c000 ffff881c6be6fd38 ffffffffa016629f 
[964199.138879]  0000000000000001 ffff881c6be6fd78 ffffffffa016630e 
[964199.138880] Call Trace:
[964199.138886]  [<ffffffff8173bc0e>] dump_stack+0x46/0x58
[964199.138906]  [<ffffffffa016629f>] xfs_error_report+0x3f/0x50 [xfs]
[964199.138913]  [<ffffffffa0164495>] ? xfs_buf_iodone_work+0x95/0xc0 [xfs]
[964199.138921]  [<ffffffffa016630e>] xfs_corruption_error+0x5e/0x90 [xfs]
[964199.138928]  [<ffffffffa0164495>] ? xfs_buf_iodone_work+0x95/0xc0 [xfs]
[964199.138939]  [<ffffffffa01944d6>] xfs_bmbt_read_verify+0x76/0xf0 [xfs]
[964199.138946]  [<ffffffffa0164495>] ? xfs_buf_iodone_work+0x95/0xc0 [xfs]
[964199.138949]  [<ffffffff81095bb2>] ? finish_task_switch+0x52/0xf0
[964199.138969]  [<ffffffffa0164495>] xfs_buf_iodone_work+0x95/0xc0 [xfs]
[964199.138972]  [<ffffffff81081060>] process_one_work+0x170/0x4a0
[964199.138973]  [<ffffffff81082121>] worker_thread+0x121/0x390
[964199.138975]  [<ffffffff81082000>] ? manage_workers.isra.21+0x170/0x170
[964199.138977]  [<ffffffff81088fe0>] kthread+0xc0/0xd0
[964199.138979]  [<ffffffff81088f20>] ? flush_kthread_worker+0xb0/0xb0
[964199.138981]  [<ffffffff817508ac>] ret_from_fork+0x7c/0xb0
[964199.138983]  [<ffffffff81088f20>] ? flush_kthread_worker+0xb0/0xb0
[964199.138984] XFS (dm-8): Corruption detected. Unmount and run xfs_repair
[964199.139014] XFS (dm-8): metadata I/O error: block 0x1f0 
("xfs_trans_read_buf_map") error 117 numblks 8
[964199.139016] XFS (dm-8): xfs_do_force_shutdown(0x1) called from line 367 of 
file /build/buildd/linux-lts-saucy-3.11.0/fs/xfs/xfs_trans_buf.c.  Return 
address = 0xffffffffa01cadbc
[964199.139324] XFS (dm-8): I/O Error Detected. Shutting down filesystem
[964199.139325] XFS (dm-8): Please umount the filesystem and rectify the 
[964212.367300] XFS (dm-8): xfs_log_force: error 5 returned.
[964242.477283] XFS (dm-8): xfs_log_force: error 5 returned.

After that, I tried the following (in order):

1. xfs_repair, which did not find the superblock and started scanning the LV, 
after finding the secondary superblock, it told me there’s still something in 
the log, so I
2. mounted the filesystem, which gave me a „Structure needs cleaning“ after a 
couple of seconds
3. tried mounting again for good measure, same error „Structure needs cleaning“
4. xfs_repair -L which repaired everything, and effectively cleaned my 
Filesystem in the process.
5. mount the filesystem to find it empty.

Since then, I’m desperately trying to reproduce the problem, but unfortunately 
to no avail. Can somebody give some insight on the errors I encountered. I have 
previously operated 4,5PB worth of XFS Filesystems for 3 years and never got an 
error similar to this.

Best regards

