| To: | Dave Chinner <david@xxxxxxxxxxxxx> |
|---|---|
| Subject: | Re: XFS data corruption with high I/O even on Areca hardware raid |
| From: | Steve Costaras <stevecs@xxxxxxxxxx> |
| Date: | Thu, 14 Jan 2010 20:15:25 -0600 |
| Cc: | xfs@xxxxxxxxxxx |
| In-reply-to: | <20100115013507.GB28498@xxxxxxxxxxxxxxxx> |
| References: | <4B4E6F3F.1090901@xxxxxxxxxx> <20100114022409.GW17483@xxxxxxxxxxxxxxxx> <4B4FBC3F.6050904@xxxxxxxxxx> <20100115013507.GB28498@xxxxxxxxxxxxxxxx> |
| User-agent: | Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.9.1.5) Gecko/20091204 Lightning/1.0b2pre Thunderbird/3.0 |
On 01/14/2010 19:35, Dave Chinner wrote: There stack traces do - everything is waiting on IO completion to occur. The elevator queues are full, the block device is congested and lots of XFS code is waiting on IO completion to occur.I don't like the abort device commands to arcmsr, still have not heard anything from Areca support for them to look at it. ------------- [ 3494.731923] arcmsr6: abort device command of scsi id = 0 lun = 5 [ 3511.746966] arcmsr6: abort device command of scsi id = 0 lun = 5 [ 3511.746978] arcmsr6: abort device command of scsi id = 0 lun = 7 [ 3528.759509] arcmsr6: abort device command of scsi id = 0 lun = 7 [ 3528.759520] arcmsr6: abort device command of scsi id = 0 lun = 7 [ 3545.782040] arcmsr6: abort device command of scsi id = 0 lun = 7 [ 3545.782052] arcmsr6: abort device command of scsi id = 0 lun = 6 [ 3562.785862] arcmsr6: abort device command of scsi id = 0 lun = 6 [ 3562.785872] arcmsr6: abort device command of scsi id = 0 lun = 6 [ 3579.798404] arcmsr6: abort device command of scsi id = 0 lun = 6 [ 3579.798410] arcmsr6: abort device command of scsi id = 0 lun = 5Yea, that looks bad - the driver appears to have aborted some IOs (no idea why) but probably hasn't handled the error correctly and completed the IOs it aborted with an error status (which would cause XFS to shut down the filesystem but not freeze like this). So it looks like there is a buggy error handling path in the driver being triggered by some kind of hardware problem. I note that it is the same raid controller that has had problems as the last report. It might just be a bad RAID card or SATA cables from that RAID card. I'd work out which card it is, replace it and the cables and see if that makes the problem go away.... Yeah, actually this IS a new raid card * cables since the last failure so don't think (statistically) that it's hardware. Could be firmware or driver. I've forwarded this over to Areca and hopefully they can come up with something. Right now I'm testing with the raid in write-through mode in hopes that, if it doesn't avoid the problem, minimize the effect of it. Also found some references thanks to the above abort messages about lengthening the scsi command timeouts which may help when under 'heavy' load. (/sys/class/scsi_device/{?}/device/timeout) which looks to default to 30 seconds on at least my kernel. I haven't read up in detail yet as to the reasons why it is set to 30 or if that was just some arbitrary number someone picked, to me it seems long (considering we're talking 10-13ms for service times generally on a 7200rpm drive but who knows). Steve |
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| ||
| Previous by Date: | Re: XFS data corruption with high I/O even on Areca hardware raid, Dave Chinner |
|---|---|
| Next by Date: | Re: allocsize mount option, Gim Leong Chin |
| Previous by Thread: | Re: XFS data corruption with high I/O even on Areca hardware raid, Dave Chinner |
| Next by Thread: | Re: XFS data corruption with high I/O even on hardware raid, Andi Kleen |
| Indexes: | [Date] [Thread] [Top] [All Lists] |