XFS data corruption with high I/O even on Areca hardware raid
Steve Costaras
stevecs at chaven.com
Thu Jan 14 20:15:25 CST 2010
On 01/14/2010 19:35, Dave Chinner wrote:
> There stack traces do - everything is waiting on IO completion to
> occur. The elevator queues are full, the block device is congested
> and lots of XFS code is waiting on IO completion to occur.
>
>
>> I don't
>> like the abort device commands to arcmsr, still have not heard anything
>> from Areca support for them to look at it.
>>
>> -------------
>> [ 3494.731923] arcmsr6: abort device command of scsi id = 0 lun = 5
>> [ 3511.746966] arcmsr6: abort device command of scsi id = 0 lun = 5
>> [ 3511.746978] arcmsr6: abort device command of scsi id = 0 lun = 7
>> [ 3528.759509] arcmsr6: abort device command of scsi id = 0 lun = 7
>> [ 3528.759520] arcmsr6: abort device command of scsi id = 0 lun = 7
>> [ 3545.782040] arcmsr6: abort device command of scsi id = 0 lun = 7
>> [ 3545.782052] arcmsr6: abort device command of scsi id = 0 lun = 6
>> [ 3562.785862] arcmsr6: abort device command of scsi id = 0 lun = 6
>> [ 3562.785872] arcmsr6: abort device command of scsi id = 0 lun = 6
>> [ 3579.798404] arcmsr6: abort device command of scsi id = 0 lun = 6
>> [ 3579.798410] arcmsr6: abort device command of scsi id = 0 lun = 5
>>
> Yea, that looks bad - the driver appears to have aborted some IOs
> (no idea why) but probably hasn't handled the error correctly and
> completed the IOs it aborted with an error status (which would cause
> XFS to shut down the filesystem but not freeze like this). So it
> looks like there is a buggy error handling path in the driver being
> triggered by some kind of hardware problem.
>
> I note that it is the same raid controller that has had problems
> as the last report. It might just be a bad RAID card or SATA cables
> from that RAID card. I'd work out which card it is, replace it
> and the cables and see if that makes the problem go away....
>
Yeah, actually this IS a new raid card * cables since the last failure
so don't think (statistically) that it's hardware. Could be firmware
or driver. I've forwarded this over to Areca and hopefully they can
come up with something.
Right now I'm testing with the raid in write-through mode in hopes that,
if it doesn't avoid the problem, minimize the effect of it. Also found
some references thanks to the above abort messages about lengthening the
scsi command timeouts which may help when under 'heavy' load.
(/sys/class/scsi_device/{?}/device/timeout) which looks to default to 30
seconds on at least my kernel. I haven't read up in detail yet as to
the reasons why it is set to 30 or if that was just some arbitrary
number someone picked, to me it seems long (considering we're talking
10-13ms for service times generally on a 7200rpm drive but who knows).
Steve
More information about the xfs
mailing list