xfs
[Top] [All Lists]

XFS data corruption with high I/O even on hardware raid

To: xfs@xxxxxxxxxxx
Subject: XFS data corruption with high I/O even on hardware raid
From: Steve Costaras <stevecs@xxxxxxxxxx>
Date: Wed, 13 Jan 2010 19:11:27 -0600
Authentication-results: localhcloudmarkgw1 smtp.user=stevecs; auth=pass (CRAM-MD5)
User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.9.1.5) Gecko/20091204 Lightning/1.0b2pre Thunderbird/3.0
Ok, I've been seeing a problem here since had to move over to XFS from JFS due to file system size issues.   I am seeing XFS Data corruption under ?heavy io?   Basically, what happens is that under heavy load (i.e. if I'm doing say a xfs_fsr (which nearly always triggers the freeze issue) on a volume the system hovers around 90% utilization for the dm device for a while (sometimes an hour+, sometimes minutes) the subsystem goes into 100% utilization and then freezes solid forcing me to do a hard reboot of the box.  When coming back up generally the XFS volumes are really screwed up (see below).    Areca cards all have BBU's and the only write cache is on the BBU (drive cache disabled).   Systems are all UPS protected as well.    These freezes have happened too frequently and unfortunatly nothing is logged anywhere.    It's not worth doing a repair as the amount of corruption is too extensive so requires a complete restore from backup.    I just mention xfs_fsr here as that /seems/ to generate an I/O pattern that nearly always results in a freeze.   I have done it with other high-i/o functions though not as reliably.

I don't know what else can be done to remove this issue (and not really sure it's really directly related to XFS, as LVM and the areca driver are also involved) however the main result is that XFS gets really screwed up.   I did NOT have these issues w/ JFS (same subsystem lvm + areca set up so it /seems/ to point to XFS or at least it's tied in there somewhere) unfortunately JFS has issues with file systems larger than 32TiB so the only file system I can use is XFS.

Since I'm using hardware raid w/ BBU when I reboot and it comes back up the raid controller writes out to the drives any outstanding data in it's cache and from the hardware point of view (as well as lvm's point of view) the array is ok.    The file system however generally can't be mounted (about 4 out of 5 times, some times it does get auto-mounted but when I then run an xfs_repair -n -v in those cases there are pages of errors (badly aligned inode rec, bad starting inode #'s, dubious inode btree block headers among others).    When I let a repair actually run in one case out of 4,500,000 files it linked about 2,000,000 or so but there was no way to identify and verify file integrity.  The others were just lost.

This is not limited to large volume sizes I have seen similar on small ~2TiB file systems as well.  Also when it happened in a couple cases the file system that was taking the I/O (say xfs_fsr -v /home ) another XFS filesystem on the same system which was NOT taking much if any I/O gets badly corrupted (say /var/test ).   Both would be using the same areca controllers and same physical discs (same PV's and same VG's but different LV's).

Any suggestions on how to isolate or eliminate this would be greatly appreciated.


Steve


Technical data is below:
==============
$iostat -m -x 15
(IOSTAT capture right up to a freeze event:)
(system sits here for a long bit hovering around 90% for the DM device and about 30% for the the PV's)
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     7.80    0.07    2.00     0.00     0.04    38.19     0.00    2.26   0.97   0.20
sdb             120.07    34.47  253.00  706.67    24.98    28.96   115.11     1.06    1.10   0.28  26.87
sdc              48.80    28.93  324.73  730.87    24.98    28.94   104.62     1.19    1.13   0.29  30.60
sdd             121.73    33.13  251.60  700.40    24.99    28.94   116.01     1.11    1.17   0.29  27.40
sde              49.00    28.60  324.33  731.47    24.99    28.95   104.65     1.22    1.15   0.26  27.53
sdf             120.27    33.20  253.00  701.00    24.99    28.97   115.84     1.14    1.20   0.33  31.67
sdg              48.80    29.07  324.73  731.80    25.00    28.95   104.59     1.37    1.29   0.35  36.93
sdh             120.47    33.47  252.73  702.53    25.00    28.96   115.68     1.24    1.30   0.35  33.67
sdi              50.73    28.27  322.73  735.13    24.99    29.01   104.54     1.34    1.26   0.31  32.27
dm-0              0.00     0.00    0.13    0.13     0.00     0.00    12.00     0.01   25.00  25.00   0.67
dm-1              0.00     0.00 1602.67  992.73   199.93   231.69   340.59     4.12    1.59   0.34  88.40
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00


(Then it jumps up to 99-100% for the majority of devices (here sdf,sdg, sdh, sdi are all on the same physical areca card).
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.60   24.71    0.00   74.69

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     4.07    0.00    1.13     0.00     0.02    36.71     0.00    1.18   1.18   0.13
sdb               2.07     1.93    8.00   17.00     0.63     0.84   120.33     0.04    1.49   0.35   0.87
sdc               2.87     1.20    7.40   22.13     0.63     0.83   101.86     0.04    1.49   0.25   0.73
sdd               2.13     1.80    8.07   17.20     0.63     0.84   119.64     0.04    1.45   0.32   0.80
sde               2.93     1.07    7.20   21.80     0.63     0.83   103.65     0.05    1.89   0.34   1.00
sdf               1.93     1.87    8.13   13.67     0.63     0.64   119.78    46.58    2.35  45.63  99.47
sdg               2.87     1.00    7.13   17.80     0.62     0.64   104.04    64.12    2.41  39.84  99.33
sdh               2.07     1.67    7.93   13.47     0.62     0.64   121.22    47.85    2.12  46.39  99.27
sdi               2.93     1.07    7.07   18.47     0.62     0.64   101.77    62.15    2.32  38.83  99.13
dm-0              0.00     0.00    0.20    0.07     0.00     0.00    10.00     0.00    2.50   2.50   0.07
dm-1              0.00     0.00   40.20   30.13     5.03     6.68   340.96    74.73    2.13  14.19  99.80
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

(Then here it hits 100% and the system locks)
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.81   24.95    0.00   74.24

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     8.40    0.00    2.13     0.00     0.04    39.50     0.00    1.88   0.63   0.13
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdd               0.00     0.00    0.00    0.07     0.00     0.00    16.00     0.00    0.00   0.00   0.00
sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdf               0.00     0.00    0.00    0.00     0.00     0.00     0.00    50.00    0.00   0.00 100.00
sdg               0.00     0.00    0.00    0.00     0.00     0.00     0.00    69.00    0.00   0.00 100.00
sdh               0.00     0.00    0.00    0.00     0.00     0.00     0.00    50.00    0.00   0.00 100.00
sdi               0.00     0.00    0.00    0.00     0.00     0.00     0.00    65.00    0.00   0.00 100.00
dm-0              0.00     0.00    0.00    0.07     0.00     0.00    16.00     0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00    85.00    0.00   0.00 100.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00



============ (System)
(Ubuntu 8.04.3 LTS):
Linux loki 2.6.24-26-server #1 SMP Tue Dec 1 18:26:43 UTC 2009 x86_64 GNU/Linux

--------------
xfs_repair version 2.9.4

============= (modinfo's)
filename:       /lib/modules/2.6.24-26-server/kernel/fs/xfs/xfs.ko
license:        GPL
description:    SGI XFS with ACLs, security attributes, realtime, large block/inode numbers, no debug enabled
author:         Silicon Graphics, Inc.
srcversion:     A2E6459B3A4C96355F95E61
depends:
vermagic:       2.6.24-26-server SMP mod_unload
============
filename:       /lib/modules/2.6.24-26-server/kernel/drivers/scsi/arcmsr/arcmsr.ko
version:        Driver Version 1.20.00.15 2007/08/30
license:        Dual BSD/GPL
description:    ARECA (ARC11xx/12xx/13xx/16xx) SATA/SAS RAID HOST Adapter
author:         Erich Chen <support@xxxxxxxxxxxx>
srcversion:     38E576EB40C1A58E8B9E007
alias:          pci:v000017D3d00001681sv*sd*bc*sc*i*
alias:          pci:v000017D3d00001680sv*sd*bc*sc*i*
alias:          pci:v000017D3d00001381sv*sd*bc*sc*i*
alias:          pci:v000017D3d00001380sv*sd*bc*sc*i*
alias:          pci:v000017D3d00001280sv*sd*bc*sc*i*
alias:          pci:v000017D3d00001270sv*sd*bc*sc*i*
alias:          pci:v000017D3d00001260sv*sd*bc*sc*i*
alias:          pci:v000017D3d00001230sv*sd*bc*sc*i*
alias:          pci:v000017D3d00001220sv*sd*bc*sc*i*
alias:          pci:v000017D3d00001210sv*sd*bc*sc*i*
alias:          pci:v000017D3d00001202sv*sd*bc*sc*i*
alias:          pci:v000017D3d00001201sv*sd*bc*sc*i*
alias:          pci:v000017D3d00001200sv*sd*bc*sc*i*
alias:          pci:v000017D3d00001170sv*sd*bc*sc*i*
alias:          pci:v000017D3d00001160sv*sd*bc*sc*i*
alias:          pci:v000017D3d00001130sv*sd*bc*sc*i*
alias:          pci:v000017D3d00001120sv*sd*bc*sc*i*
alias:          pci:v000017D3d00001110sv*sd*bc*sc*i*
depends:        scsi_mod
vermagic:       2.6.24-26-server SMP mod_unload
===========
filename:       /lib/modules/2.6.24-26-server/kernel/drivers/md/dm-mod.ko
license:        GPL
author:         Joe Thornber <dm-devel@xxxxxxxxxx>
description:    device-mapper driver
srcversion:     A7E89E997173E41CB6AAF04
depends:
vermagic:       2.6.24-26-server SMP mod_unload
parm:           major:The major number of the device mapper (uint)
===========

============
mounted with:
/dev/vg_media/lv_ftpshare       /var/ftp        xfs     defaults,relatime,nobarrier,logbufs=8,logbsize=256k,sunit=256,swidth=2048,inode64,noikeep,largeio,swalloc,allocsize=128k       0       2

============
XFS info:
meta-data="" isize=2048   agcount=41, agsize=268435424 blks
         =                       sectsz=512   attr=0
data     =                       bsize=4096   blocks=10737418200, imaxpct=1
         =                       sunit=32     swidth=256 blks, unwritten=1
naming   =version 2              bsize=4096
log      =internal               bsize=4096   blocks=32768, version=2
         =                       sectsz=512   sunit=32 blks, lazy-count=0
realtime =none                   extsz=1048576 blocks=0, rtextents=0

=============
XFS is running on top of LVM:
 --- Logical volume ---
  LV Name                /dev/vg_media/lv_ftpshare
  VG Name                vg_media
  LV UUID                MgEBWv-x9fn-KUoJ-3y5X-snlk-7F9E-A3CiHh
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                40.00 TB
  Current LE             40960
  Segments               1
  Allocation             inherit
  Read ahead sectors     0
  Block device           254:1

==============
LVM is using as it's base physical volumes 8 hardware raids (MediaVol00-70 inclusive):
[  175.320738] ARECA RAID ADAPTER4: FIRMWARE VERSION V1.47 2009-07-16
[  175.336238] scsi4 : Areca SAS Host Adapter RAID Controller( RAID6 capable)
[  175.336239]  Driver Version 1.20.00.15 2007/08/30
[  175.336387] ACPI: PCI Interrupt 0000:0a:00.0[A] -> GSI 17 (level, low) -> IRQ 17
[  175.336395] PCI: Setting latency timer of device 0000:0a:00.0 to 64
[  175.336990] scsi 4:0:0:0: Direct-Access     Areca    BootVol#00       R001 PQ: 0 ANSI: 5
[  175.337096] scsi 4:0:0:1: Direct-Access     Areca    MediaVol#00      R001 PQ: 0 ANSI: 5
[  175.337169] scsi 4:0:0:2: Direct-Access     Areca    MediaVol#10      R001 PQ: 0 ANSI: 5
[  175.337240] scsi 4:0:0:3: Direct-Access     Areca    MediaVol#20      R001 PQ: 0 ANSI: 5
[  175.337312] scsi 4:0:0:4: Direct-Access     Areca    MediaVol#30      R001 PQ: 0 ANSI: 5
[  175.337907] scsi 4:0:16:0: Processor         Areca    RAID controller  R001 PQ: 0 ANSI: 0
[  175.356231] ARECA RAID ADAPTER5: FIRMWARE VERSION V1.47 2009-10-22
[  175.376144] scsi5 : Areca SAS Host Adapter RAID Controller( RAID6 capable)
[  175.376145]  Driver Version 1.20.00.15 2007/08/30
[  175.377354] scsi 5:0:0:5: Direct-Access     Areca    MediaVol#40      R001 PQ: 0 ANSI: 5
[  175.377434] scsi 5:0:0:6: Direct-Access     Areca    MediaVol#50      R001 PQ: 0 ANSI: 5
[  175.377495] scsi 5:0:0:7: Direct-Access     Areca    MediaVol#60      R001 PQ: 0 ANSI: 5
[  175.377587] scsi 5:0:1:0: Direct-Access     Areca    MediaVol#70      R001 PQ: 0 ANSI: 5
[  175.378156] scsi 5:0:16:0: Processor         Areca    RAID controller  R001 PQ: 0 ANSI: 0

=================

<Prev in Thread] Current Thread [Next in Thread>