<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
</head>
<body bgcolor="#ffffff" text="#000000">
<font face="Courier New">Ok, I've been seeing a problem here since had
to move over to XFS from JFS due to file system size issues. I am
seeing XFS Data corruption under ?heavy io? Basically, what happens
is that under heavy load (i.e. if I'm doing say a xfs_fsr (which nearly
always triggers the freeze issue) on a volume the system hovers around
90% utilization for the dm device for a while (sometimes an hour+,
sometimes minutes) the subsystem goes into 100% utilization and then
freezes solid forcing me to do a hard reboot of the box. When coming
back up generally the XFS volumes are really screwed up (see below).
Areca cards all have BBU's and the only write cache is on the BBU
(drive cache disabled). Systems are all UPS protected as well.
These freezes have happened too frequently and unfortunatly nothing is
logged anywhere. It's not worth doing a repair as the amount of
corruption is too extensive so requires a complete restore from
backup. I just mention xfs_fsr here as that /seems/ to generate an
I/O pattern that nearly always results in a freeze. I have done it
with other high-i/o functions though not as reliably.<br>
<br>
I don't know what else can be done to remove this issue (and not really
sure it's really directly related to XFS, as LVM and the areca driver
are also involved) however the main result is that XFS gets really
screwed up. I did NOT have these issues w/ JFS (same subsystem lvm +
areca set up so it /seems/ to point to XFS or at least it's tied in
there somewhere) unfortunately JFS has issues with file systems larger
than 32TiB so the only file system I can use is XFS.<br>
<br>
Since I'm using hardware raid w/ BBU when I reboot and it comes back up
the raid controller writes out to the drives any outstanding data in
it's cache and from the hardware point of view (as well as lvm's point
of view) the array is ok. The file system however generally can't be
mounted (about 4 out of 5 times, some times it does get auto-mounted
but when I then run an xfs_repair -n -v in those cases there are pages
of errors (badly aligned inode rec, bad starting inode #'s, dubious
inode btree block headers among others). When I let a repair
actually run in one case out of 4,500,000 files it linked about
2,000,000 or so but there was no way to identify and verify file
integrity. The others were just lost.<br>
<br>
This is not limited to large volume sizes I have seen similar on small
~2TiB file systems as well. Also when it happened in a couple cases
the file system that was taking the I/O (say xfs_fsr -v /home ) another
XFS filesystem on the same system which was NOT taking much if any I/O
gets badly corrupted (say /var/test ). Both would be using the same
areca controllers and same physical discs (same PV's and same VG's but
different LV's).<br>
<br>
Any suggestions on how to isolate or eliminate this would be greatly
appreciated.<br>
<br>
<br>
Steve<br>
<br>
<br>
</font><font face="Courier New">Technical data is below:</font><br>
<font face="Courier New">==============<br>
$iostat -m -x 15<br>
(IOSTAT capture right up to a freeze event:)<br>
(system sits here for a long bit hovering around 90% for the DM device
and about 30% for the the PV's)<br>
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
avgrq-sz avgqu-sz await svctm %util<br>
sda 0.00 7.80 0.07 2.00 0.00 0.04
38.19 0.00 2.26 0.97 0.20<br>
sdb 120.07 34.47 253.00 706.67 24.98 28.96
115.11 1.06 1.10 0.28 26.87<br>
sdc 48.80 28.93 324.73 730.87 24.98 28.94
104.62 1.19 1.13 0.29 30.60<br>
sdd 121.73 33.13 251.60 700.40 24.99 28.94
116.01 1.11 1.17 0.29 27.40<br>
sde 49.00 28.60 324.33 731.47 24.99 28.95
104.65 1.22 1.15 0.26 27.53<br>
sdf 120.27 33.20 253.00 701.00 24.99 28.97
115.84 1.14 1.20 0.33 31.67<br>
sdg 48.80 29.07 324.73 731.80 25.00 28.95
104.59 1.37 1.29 0.35 36.93<br>
sdh 120.47 33.47 252.73 702.53 25.00 28.96
115.68 1.24 1.30 0.35 33.67<br>
sdi 50.73 28.27 322.73 735.13 24.99 29.01
104.54 1.34 1.26 0.31 32.27<br>
dm-0 0.00 0.00 0.13 0.13 0.00 0.00
12.00 0.01 25.00 25.00 0.67<br>
dm-1 0.00 0.00 1602.67 992.73 199.93 231.69
340.59 4.12 1.59 0.34 88.40<br>
dm-2 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00<br>
<br>
<br>
(Then it jumps up to 99-100% for the majority of devices (here sdf,sdg,
sdh, sdi are all on the same physical areca card).<br>
avg-cpu: %user %nice %system %iowait %steal %idle<br>
0.00 0.00 0.60 24.71 0.00 74.69<br>
<br>
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
avgrq-sz avgqu-sz await svctm %util<br>
sda 0.00 4.07 0.00 1.13 0.00 0.02
36.71 0.00 1.18 1.18 0.13<br>
sdb 2.07 1.93 8.00 17.00 0.63 0.84
120.33 0.04 1.49 0.35 0.87<br>
sdc 2.87 1.20 7.40 22.13 0.63 0.83
101.86 0.04 1.49 0.25 0.73<br>
sdd 2.13 1.80 8.07 17.20 0.63 0.84
119.64 0.04 1.45 0.32 0.80<br>
sde 2.93 1.07 7.20 21.80 0.63 0.83
103.65 0.05 1.89 0.34 1.00<br>
sdf 1.93 1.87 8.13 13.67 0.63 0.64
119.78 46.58 2.35 45.63 99.47<br>
sdg 2.87 1.00 7.13 17.80 0.62 0.64
104.04 64.12 2.41 39.84 99.33<br>
sdh 2.07 1.67 7.93 13.47 0.62 0.64
121.22 47.85 2.12 46.39 99.27<br>
sdi 2.93 1.07 7.07 18.47 0.62 0.64
101.77 62.15 2.32 38.83 99.13<br>
dm-0 0.00 0.00 0.20 0.07 0.00 0.00
10.00 0.00 2.50 2.50 0.07<br>
dm-1 0.00 0.00 40.20 30.13 5.03 6.68
340.96 74.73 2.13 14.19 99.80<br>
dm-2 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00<br>
<br>
(Then here it hits 100% and the system locks)<br>
avg-cpu: %user %nice %system %iowait %steal %idle<br>
0.00 0.00 0.81 24.95 0.00 74.24<br>
<br>
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
avgrq-sz avgqu-sz await svctm %util<br>
sda 0.00 8.40 0.00 2.13 0.00 0.04
39.50 0.00 1.88 0.63 0.13<br>
sdb 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00<br>
sdc 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00<br>
sdd 0.00 0.00 0.00 0.07 0.00 0.00
16.00 0.00 0.00 0.00 0.00<br>
sde 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00<br>
sdf 0.00 0.00 0.00 0.00 0.00 0.00
0.00 50.00 0.00 0.00 100.00<br>
sdg 0.00 0.00 0.00 0.00 0.00 0.00
0.00 69.00 0.00 0.00 100.00<br>
sdh 0.00 0.00 0.00 0.00 0.00 0.00
0.00 50.00 0.00 0.00 100.00<br>
sdi 0.00 0.00 0.00 0.00 0.00 0.00
0.00 65.00 0.00 0.00 100.00<br>
dm-0 0.00 0.00 0.00 0.07 0.00 0.00
16.00 0.00 0.00 0.00 0.00<br>
dm-1 0.00 0.00 0.00 0.00 0.00 0.00
0.00 85.00 0.00 0.00 100.00<br>
dm-2 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00<br>
<br>
<br>
<br>
============ (System)<br>
(Ubuntu 8.04.3 LTS):<br>
Linux loki 2.6.24-26-server #1 SMP Tue Dec 1 18:26:43 UTC 2009 x86_64
GNU/Linux<br>
<br>
--------------<br>
xfs_repair version 2.9.4<br>
<br>
============= (modinfo's)<br>
filename: /lib/modules/2.6.24-26-server/kernel/fs/xfs/xfs.ko<br>
license: GPL<br>
description: SGI XFS with ACLs, security attributes, realtime, large
block/inode numbers, no debug enabled<br>
author: Silicon Graphics, Inc.<br>
srcversion: A2E6459B3A4C96355F95E61<br>
depends:<br>
vermagic: 2.6.24-26-server SMP mod_unload<br>
============<br>
filename:
/lib/modules/2.6.24-26-server/kernel/drivers/scsi/arcmsr/arcmsr.ko<br>
version: Driver Version 1.20.00.15 2007/08/30<br>
license: Dual BSD/GPL<br>
description: ARECA (ARC11xx/12xx/13xx/16xx) SATA/SAS RAID HOST
Adapter<br>
author: Erich Chen <a class="moz-txt-link-rfc2396E" href="mailto:support@areca.com.tw"><support@areca.com.tw></a><br>
srcversion: 38E576EB40C1A58E8B9E007<br>
alias: pci:v000017D3d00001681sv*sd*bc*sc*i*<br>
alias: pci:v000017D3d00001680sv*sd*bc*sc*i*<br>
alias: pci:v000017D3d00001381sv*sd*bc*sc*i*<br>
alias: pci:v000017D3d00001380sv*sd*bc*sc*i*<br>
alias: pci:v000017D3d00001280sv*sd*bc*sc*i*<br>
alias: pci:v000017D3d00001270sv*sd*bc*sc*i*<br>
alias: pci:v000017D3d00001260sv*sd*bc*sc*i*<br>
alias: pci:v000017D3d00001230sv*sd*bc*sc*i*<br>
alias: pci:v000017D3d00001220sv*sd*bc*sc*i*<br>
alias: pci:v000017D3d00001210sv*sd*bc*sc*i*<br>
alias: pci:v000017D3d00001202sv*sd*bc*sc*i*<br>
alias: pci:v000017D3d00001201sv*sd*bc*sc*i*<br>
alias: pci:v000017D3d00001200sv*sd*bc*sc*i*<br>
alias: pci:v000017D3d00001170sv*sd*bc*sc*i*<br>
alias: pci:v000017D3d00001160sv*sd*bc*sc*i*<br>
alias: pci:v000017D3d00001130sv*sd*bc*sc*i*<br>
alias: pci:v000017D3d00001120sv*sd*bc*sc*i*<br>
alias: pci:v000017D3d00001110sv*sd*bc*sc*i*<br>
depends: scsi_mod<br>
vermagic: 2.6.24-26-server SMP mod_unload<br>
===========<br>
filename:
/lib/modules/2.6.24-26-server/kernel/drivers/md/dm-mod.ko<br>
license: GPL<br>
author: Joe Thornber <a class="moz-txt-link-rfc2396E" href="mailto:dm-devel@redhat.com"><dm-devel@redhat.com></a><br>
description: device-mapper driver<br>
srcversion: A7E89E997173E41CB6AAF04<br>
depends:<br>
vermagic: 2.6.24-26-server SMP mod_unload<br>
parm: major:The major number of the device mapper (uint)<br>
===========<br>
<br>
============<br>
mounted with:<br>
/dev/vg_media/lv_ftpshare /var/ftp xfs
defaults,relatime,nobarrier,logbufs=8,logbsize=256k,sunit=256,swidth=2048,inode64,noikeep,largeio,swalloc,allocsize=128k
0 2<br>
<br>
============<br>
XFS info:<br>
meta-data=/dev/mapper/vg_media-lv_ftpshare isize=2048 agcount=41,
agsize=268435424 blks<br>
= sectsz=512 attr=0<br>
data = bsize=4096 blocks=10737418200,
imaxpct=1<br>
= sunit=32 swidth=256 blks,
unwritten=1<br>
naming =version 2 bsize=4096<br>
log =internal bsize=4096 blocks=32768, version=2<br>
= sectsz=512 sunit=32 blks,
lazy-count=0<br>
realtime =none extsz=1048576 blocks=0, rtextents=0<br>
<br>
=============<br>
XFS is running on top of LVM:<br>
--- Logical volume ---<br>
LV Name /dev/vg_media/lv_ftpshare<br>
VG Name vg_media<br>
LV UUID MgEBWv-x9fn-KUoJ-3y5X-snlk-7F9E-A3CiHh<br>
LV Write Access read/write<br>
LV Status available<br>
# open 1<br>
LV Size 40.00 TB<br>
Current LE 40960<br>
Segments 1<br>
Allocation inherit<br>
Read ahead sectors 0<br>
Block device 254:1<br>
<br>
==============<br>
LVM is using as it's base physical volumes 8 hardware raids
(MediaVol00-70 inclusive):<br>
[ 175.320738] ARECA RAID ADAPTER4: FIRMWARE VERSION V1.47 2009-07-16<br>
[ 175.336238] scsi4 : Areca SAS Host Adapter RAID Controller( RAID6
capable)<br>
[ 175.336239] Driver Version 1.20.00.15 2007/08/30<br>
[ 175.336387] ACPI: PCI Interrupt 0000:0a:00.0[A] -> GSI 17 (level,
low) -> IRQ 17<br>
[ 175.336395] PCI: Setting latency timer of device 0000:0a:00.0 to 64<br>
[ 175.336990] scsi 4:0:0:0: Direct-Access Areca
BootVol#00 R001 PQ: 0 ANSI: 5<br>
[ 175.337096] scsi 4:0:0:1: Direct-Access Areca
MediaVol#00 R001 PQ: 0 ANSI: 5<br>
[ 175.337169] scsi 4:0:0:2: Direct-Access Areca
MediaVol#10 R001 PQ: 0 ANSI: 5<br>
[ 175.337240] scsi 4:0:0:3: Direct-Access Areca
MediaVol#20 R001 PQ: 0 ANSI: 5<br>
[ 175.337312] scsi 4:0:0:4: Direct-Access Areca
MediaVol#30 R001 PQ: 0 ANSI: 5<br>
[ 175.337907] scsi 4:0:16:0: Processor Areca RAID
controller R001 PQ: 0 ANSI: 0<br>
[ 175.356231] ARECA RAID ADAPTER5: FIRMWARE VERSION V1.47 2009-10-22<br>
[ 175.376144] scsi5 : Areca SAS Host Adapter RAID Controller( RAID6
capable)<br>
[ 175.376145] Driver Version 1.20.00.15 2007/08/30<br>
[ 175.377354] scsi 5:0:0:5: Direct-Access Areca
MediaVol#40 R001 PQ: 0 ANSI: 5<br>
[ 175.377434] scsi 5:0:0:6: Direct-Access Areca
MediaVol#50 R001 PQ: 0 ANSI: 5<br>
[ 175.377495] scsi 5:0:0:7: Direct-Access Areca
MediaVol#60 R001 PQ: 0 ANSI: 5<br>
[ 175.377587] scsi 5:0:1:0: Direct-Access Areca
MediaVol#70 R001 PQ: 0 ANSI: 5<br>
[ 175.378156] scsi 5:0:16:0: Processor Areca RAID
controller R001 PQ: 0 ANSI: 0<br>
<br>
=================<br>
<br>
</font>
</body>
</html>