XFS data corruption with high I/O even on hardware raid
Steve Costaras
stevecs at chaven.com
Wed Jan 13 19:11:27 CST 2010
Ok, I've been seeing a problem here since had to move over to XFS from
JFS due to file system size issues. I am seeing XFS Data corruption
under ?heavy io? Basically, what happens is that under heavy load
(i.e. if I'm doing say a xfs_fsr (which nearly always triggers the
freeze issue) on a volume the system hovers around 90% utilization for
the dm device for a while (sometimes an hour+, sometimes minutes) the
subsystem goes into 100% utilization and then freezes solid forcing me
to do a hard reboot of the box. When coming back up generally the XFS
volumes are really screwed up (see below). Areca cards all have BBU's
and the only write cache is on the BBU (drive cache disabled). Systems
are all UPS protected as well. These freezes have happened too
frequently and unfortunatly nothing is logged anywhere. It's not
worth doing a repair as the amount of corruption is too extensive so
requires a complete restore from backup. I just mention xfs_fsr here
as that /seems/ to generate an I/O pattern that nearly always results in
a freeze. I have done it with other high-i/o functions though not as
reliably.
I don't know what else can be done to remove this issue (and not really
sure it's really directly related to XFS, as LVM and the areca driver
are also involved) however the main result is that XFS gets really
screwed up. I did NOT have these issues w/ JFS (same subsystem lvm +
areca set up so it /seems/ to point to XFS or at least it's tied in
there somewhere) unfortunately JFS has issues with file systems larger
than 32TiB so the only file system I can use is XFS.
Since I'm using hardware raid w/ BBU when I reboot and it comes back up
the raid controller writes out to the drives any outstanding data in
it's cache and from the hardware point of view (as well as lvm's point
of view) the array is ok. The file system however generally can't be
mounted (about 4 out of 5 times, some times it does get auto-mounted but
when I then run an xfs_repair -n -v in those cases there are pages of
errors (badly aligned inode rec, bad starting inode #'s, dubious inode
btree block headers among others). When I let a repair actually run
in one case out of 4,500,000 files it linked about 2,000,000 or so but
there was no way to identify and verify file integrity. The others were
just lost.
This is not limited to large volume sizes I have seen similar on small
~2TiB file systems as well. Also when it happened in a couple cases the
file system that was taking the I/O (say xfs_fsr -v /home ) another XFS
filesystem on the same system which was NOT taking much if any I/O gets
badly corrupted (say /var/test ). Both would be using the same areca
controllers and same physical discs (same PV's and same VG's but
different LV's).
Any suggestions on how to isolate or eliminate this would be greatly
appreciated.
Steve
Technical data is below:
==============
$iostat -m -x 15
(IOSTAT capture right up to a freeze event:)
(system sits here for a long bit hovering around 90% for the DM device
and about 30% for the the PV's)
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
avgrq-sz avgqu-sz await svctm %util
sda 0.00 7.80 0.07 2.00 0.00 0.04
38.19 0.00 2.26 0.97 0.20
sdb 120.07 34.47 253.00 706.67 24.98 28.96
115.11 1.06 1.10 0.28 26.87
sdc 48.80 28.93 324.73 730.87 24.98 28.94
104.62 1.19 1.13 0.29 30.60
sdd 121.73 33.13 251.60 700.40 24.99 28.94
116.01 1.11 1.17 0.29 27.40
sde 49.00 28.60 324.33 731.47 24.99 28.95
104.65 1.22 1.15 0.26 27.53
sdf 120.27 33.20 253.00 701.00 24.99 28.97
115.84 1.14 1.20 0.33 31.67
sdg 48.80 29.07 324.73 731.80 25.00 28.95
104.59 1.37 1.29 0.35 36.93
sdh 120.47 33.47 252.73 702.53 25.00 28.96
115.68 1.24 1.30 0.35 33.67
sdi 50.73 28.27 322.73 735.13 24.99 29.01
104.54 1.34 1.26 0.31 32.27
dm-0 0.00 0.00 0.13 0.13 0.00 0.00
12.00 0.01 25.00 25.00 0.67
dm-1 0.00 0.00 1602.67 992.73 199.93 231.69
340.59 4.12 1.59 0.34 88.40
dm-2 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
(Then it jumps up to 99-100% for the majority of devices (here sdf,sdg,
sdh, sdi are all on the same physical areca card).
avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 0.60 24.71 0.00 74.69
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
avgrq-sz avgqu-sz await svctm %util
sda 0.00 4.07 0.00 1.13 0.00 0.02
36.71 0.00 1.18 1.18 0.13
sdb 2.07 1.93 8.00 17.00 0.63 0.84
120.33 0.04 1.49 0.35 0.87
sdc 2.87 1.20 7.40 22.13 0.63 0.83
101.86 0.04 1.49 0.25 0.73
sdd 2.13 1.80 8.07 17.20 0.63 0.84
119.64 0.04 1.45 0.32 0.80
sde 2.93 1.07 7.20 21.80 0.63 0.83
103.65 0.05 1.89 0.34 1.00
sdf 1.93 1.87 8.13 13.67 0.63 0.64
119.78 46.58 2.35 45.63 99.47
sdg 2.87 1.00 7.13 17.80 0.62 0.64
104.04 64.12 2.41 39.84 99.33
sdh 2.07 1.67 7.93 13.47 0.62 0.64
121.22 47.85 2.12 46.39 99.27
sdi 2.93 1.07 7.07 18.47 0.62 0.64
101.77 62.15 2.32 38.83 99.13
dm-0 0.00 0.00 0.20 0.07 0.00 0.00
10.00 0.00 2.50 2.50 0.07
dm-1 0.00 0.00 40.20 30.13 5.03 6.68
340.96 74.73 2.13 14.19 99.80
dm-2 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
(Then here it hits 100% and the system locks)
avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 0.81 24.95 0.00 74.24
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
avgrq-sz avgqu-sz await svctm %util
sda 0.00 8.40 0.00 2.13 0.00 0.04
39.50 0.00 1.88 0.63 0.13
sdb 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.07 0.00 0.00
16.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00
0.00 50.00 0.00 0.00 100.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00
0.00 69.00 0.00 0.00 100.00
sdh 0.00 0.00 0.00 0.00 0.00 0.00
0.00 50.00 0.00 0.00 100.00
sdi 0.00 0.00 0.00 0.00 0.00 0.00
0.00 65.00 0.00 0.00 100.00
dm-0 0.00 0.00 0.00 0.07 0.00 0.00
16.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 0.00 0.00 0.00
0.00 85.00 0.00 0.00 100.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
============ (System)
(Ubuntu 8.04.3 LTS):
Linux loki 2.6.24-26-server #1 SMP Tue Dec 1 18:26:43 UTC 2009 x86_64
GNU/Linux
--------------
xfs_repair version 2.9.4
============= (modinfo's)
filename: /lib/modules/2.6.24-26-server/kernel/fs/xfs/xfs.ko
license: GPL
description: SGI XFS with ACLs, security attributes, realtime, large
block/inode numbers, no debug enabled
author: Silicon Graphics, Inc.
srcversion: A2E6459B3A4C96355F95E61
depends:
vermagic: 2.6.24-26-server SMP mod_unload
============
filename:
/lib/modules/2.6.24-26-server/kernel/drivers/scsi/arcmsr/arcmsr.ko
version: Driver Version 1.20.00.15 2007/08/30
license: Dual BSD/GPL
description: ARECA (ARC11xx/12xx/13xx/16xx) SATA/SAS RAID HOST Adapter
author: Erich Chen <support at areca.com.tw>
srcversion: 38E576EB40C1A58E8B9E007
alias: pci:v000017D3d00001681sv*sd*bc*sc*i*
alias: pci:v000017D3d00001680sv*sd*bc*sc*i*
alias: pci:v000017D3d00001381sv*sd*bc*sc*i*
alias: pci:v000017D3d00001380sv*sd*bc*sc*i*
alias: pci:v000017D3d00001280sv*sd*bc*sc*i*
alias: pci:v000017D3d00001270sv*sd*bc*sc*i*
alias: pci:v000017D3d00001260sv*sd*bc*sc*i*
alias: pci:v000017D3d00001230sv*sd*bc*sc*i*
alias: pci:v000017D3d00001220sv*sd*bc*sc*i*
alias: pci:v000017D3d00001210sv*sd*bc*sc*i*
alias: pci:v000017D3d00001202sv*sd*bc*sc*i*
alias: pci:v000017D3d00001201sv*sd*bc*sc*i*
alias: pci:v000017D3d00001200sv*sd*bc*sc*i*
alias: pci:v000017D3d00001170sv*sd*bc*sc*i*
alias: pci:v000017D3d00001160sv*sd*bc*sc*i*
alias: pci:v000017D3d00001130sv*sd*bc*sc*i*
alias: pci:v000017D3d00001120sv*sd*bc*sc*i*
alias: pci:v000017D3d00001110sv*sd*bc*sc*i*
depends: scsi_mod
vermagic: 2.6.24-26-server SMP mod_unload
===========
filename: /lib/modules/2.6.24-26-server/kernel/drivers/md/dm-mod.ko
license: GPL
author: Joe Thornber <dm-devel at redhat.com>
description: device-mapper driver
srcversion: A7E89E997173E41CB6AAF04
depends:
vermagic: 2.6.24-26-server SMP mod_unload
parm: major:The major number of the device mapper (uint)
===========
============
mounted with:
/dev/vg_media/lv_ftpshare /var/ftp xfs
defaults,relatime,nobarrier,logbufs=8,logbsize=256k,sunit=256,swidth=2048,inode64,noikeep,largeio,swalloc,allocsize=128k
0 2
============
XFS info:
meta-data=/dev/mapper/vg_media-lv_ftpshare isize=2048 agcount=41,
agsize=268435424 blks
= sectsz=512 attr=0
data = bsize=4096 blocks=10737418200, imaxpct=1
= sunit=32 swidth=256 blks, unwritten=1
naming =version 2 bsize=4096
log =internal bsize=4096 blocks=32768, version=2
= sectsz=512 sunit=32 blks, lazy-count=0
realtime =none extsz=1048576 blocks=0, rtextents=0
=============
XFS is running on top of LVM:
--- Logical volume ---
LV Name /dev/vg_media/lv_ftpshare
VG Name vg_media
LV UUID MgEBWv-x9fn-KUoJ-3y5X-snlk-7F9E-A3CiHh
LV Write Access read/write
LV Status available
# open 1
LV Size 40.00 TB
Current LE 40960
Segments 1
Allocation inherit
Read ahead sectors 0
Block device 254:1
==============
LVM is using as it's base physical volumes 8 hardware raids
(MediaVol00-70 inclusive):
[ 175.320738] ARECA RAID ADAPTER4: FIRMWARE VERSION V1.47 2009-07-16
[ 175.336238] scsi4 : Areca SAS Host Adapter RAID Controller( RAID6
capable)
[ 175.336239] Driver Version 1.20.00.15 2007/08/30
[ 175.336387] ACPI: PCI Interrupt 0000:0a:00.0[A] -> GSI 17 (level,
low) -> IRQ 17
[ 175.336395] PCI: Setting latency timer of device 0000:0a:00.0 to 64
[ 175.336990] scsi 4:0:0:0: Direct-Access Areca BootVol#00
R001 PQ: 0 ANSI: 5
[ 175.337096] scsi 4:0:0:1: Direct-Access Areca MediaVol#00
R001 PQ: 0 ANSI: 5
[ 175.337169] scsi 4:0:0:2: Direct-Access Areca MediaVol#10
R001 PQ: 0 ANSI: 5
[ 175.337240] scsi 4:0:0:3: Direct-Access Areca MediaVol#20
R001 PQ: 0 ANSI: 5
[ 175.337312] scsi 4:0:0:4: Direct-Access Areca MediaVol#30
R001 PQ: 0 ANSI: 5
[ 175.337907] scsi 4:0:16:0: Processor Areca RAID
controller R001 PQ: 0 ANSI: 0
[ 175.356231] ARECA RAID ADAPTER5: FIRMWARE VERSION V1.47 2009-10-22
[ 175.376144] scsi5 : Areca SAS Host Adapter RAID Controller( RAID6
capable)
[ 175.376145] Driver Version 1.20.00.15 2007/08/30
[ 175.377354] scsi 5:0:0:5: Direct-Access Areca MediaVol#40
R001 PQ: 0 ANSI: 5
[ 175.377434] scsi 5:0:0:6: Direct-Access Areca MediaVol#50
R001 PQ: 0 ANSI: 5
[ 175.377495] scsi 5:0:0:7: Direct-Access Areca MediaVol#60
R001 PQ: 0 ANSI: 5
[ 175.377587] scsi 5:0:1:0: Direct-Access Areca MediaVol#70
R001 PQ: 0 ANSI: 5
[ 175.378156] scsi 5:0:16:0: Processor Areca RAID
controller R001 PQ: 0 ANSI: 0
=================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://oss.sgi.com/pipermail/xfs/attachments/20100113/5f035ef4/attachment.htm>
More information about the xfs
mailing list