xfs
[Top] [All Lists]

Re: XFS data corruption with high I/O even on Areca hardware raid

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: XFS data corruption with high I/O even on Areca hardware raid
From: Steve Costaras <stevecs@xxxxxxxxxx>
Date: Thu, 14 Jan 2010 20:15:25 -0600
Cc: xfs@xxxxxxxxxxx
In-reply-to: <20100115013507.GB28498@xxxxxxxxxxxxxxxx>
References: <4B4E6F3F.1090901@xxxxxxxxxx> <20100114022409.GW17483@xxxxxxxxxxxxxxxx> <4B4FBC3F.6050904@xxxxxxxxxx> <20100115013507.GB28498@xxxxxxxxxxxxxxxx>
User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.9.1.5) Gecko/20091204 Lightning/1.0b2pre Thunderbird/3.0


On 01/14/2010 19:35, Dave Chinner wrote:
There stack traces do - everything is waiting on IO completion to
occur. The elevator queues are full, the block device is congested
and lots of XFS code is waiting on IO completion to occur.

I don't
like the abort device commands to arcmsr, still have not heard anything
from Areca support for them to look at it.

-------------
[ 3494.731923] arcmsr6: abort device command of scsi id = 0 lun = 5
[ 3511.746966] arcmsr6: abort device command of scsi id = 0 lun = 5
[ 3511.746978] arcmsr6: abort device command of scsi id = 0 lun = 7
[ 3528.759509] arcmsr6: abort device command of scsi id = 0 lun = 7
[ 3528.759520] arcmsr6: abort device command of scsi id = 0 lun = 7
[ 3545.782040] arcmsr6: abort device command of scsi id = 0 lun = 7
[ 3545.782052] arcmsr6: abort device command of scsi id = 0 lun = 6
[ 3562.785862] arcmsr6: abort device command of scsi id = 0 lun = 6
[ 3562.785872] arcmsr6: abort device command of scsi id = 0 lun = 6
[ 3579.798404] arcmsr6: abort device command of scsi id = 0 lun = 6
[ 3579.798410] arcmsr6: abort device command of scsi id = 0 lun = 5
Yea, that looks bad - the driver appears to have aborted some IOs
(no idea why) but probably hasn't handled the error correctly and
completed the IOs it aborted with an error status (which would cause
XFS to shut down the filesystem but not freeze like this).  So it
looks like there is a buggy error handling path in the driver being
triggered by some kind of hardware problem.

I note that it is the same raid controller that has had problems
as the last report. It might just be a bad RAID card or SATA cables
from that RAID card. I'd work out which card it is, replace it
and the cables and see if that makes the problem go away....

Yeah, actually this IS a new raid card * cables since the last failure so don't think (statistically) that it's hardware. Could be firmware or driver. I've forwarded this over to Areca and hopefully they can come up with something.

Right now I'm testing with the raid in write-through mode in hopes that, if it doesn't avoid the problem, minimize the effect of it. Also found some references thanks to the above abort messages about lengthening the scsi command timeouts which may help when under 'heavy' load. (/sys/class/scsi_device/{?}/device/timeout) which looks to default to 30 seconds on at least my kernel. I haven't read up in detail yet as to the reasons why it is set to 30 or if that was just some arbitrary number someone picked, to me it seems long (considering we're talking 10-13ms for service times generally on a 7200rpm drive but who knows).

Steve

<Prev in Thread] Current Thread [Next in Thread>