XFS/driver bug or bad drive?
Eric Sandeen
sandeen at sandeen.net
Thu Oct 1 19:39:54 CDT 2009
David Engel wrote:
> Hi,
>
> I've been trying to diagnose a suspected disk drive problem for about
> a week. I now think the problem might be a known (and fixed) xfs or
> driver bug, but I'm not 100% sure. I'm hoping someone here can
> confirm the problem is or isn't an xfs bug.
>
> The drive in question is a Samsung HD753LJ. I have two of these
> drives and have had to do three replacements for various reasons in
> <10 months of use. In short, I don't have a lot of confidence in the
> drive, even though recent evidence seems to point elsewhere.
>
> The problem occurs when I copy several hundred gigabytes of large
> files (MythTV recordings, to be specific) to the troublesome drive
> from another drive. When using a stock 2.6.30.8 kernel and xfs, the
> copy eventually fails because the drive quits responding (and won't
> respond again until it is power cycled). The failure doesn't always
> occur at the same point in the copy, but it does always occur. Here
> is a log sample of one of the failures.
>
> Sep 29 17:59:34 tux kernel: XFS mounting filesystem sdb1
> Sep 29 17:59:34 tux kernel: Ending clean XFS mount for filesystem: sdb1
> Sep 29 18:32:07 tux kernel: ata2.00: exception Emask 0x0 SAct 0xffff SErr 0x0 action 0x6 frozen
> Sep 29 18:32:07 tux kernel: ata2.00: cmd 61/00:00:af:02:eb/04:00:17:00:00/40 tag 0 ncq 524288 out
> Sep 29 18:32:07 tux kernel: res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
> Sep 29 18:32:07 tux kernel: ata2.00: status: { DRDY }
...
> Sep 29 18:32:07 tux kernel: ata2: hard resetting link
> Sep 29 18:32:17 tux kernel: ata2: softreset failed (device not ready)
...
> Sep 29 18:33:07 tux kernel: ata2.00: disabled
> Sep 29 18:33:07 tux kernel: ata2.00: device reported invalid CHS sector 0
> Sep 29 18:33:07 tux last message repeated 15 times
> Sep 29 18:33:07 tux kernel: ata2: EH complete
> Sep 29 18:33:07 tux kernel: sd 1:0:0:0: [sdb] Unhandled error code
> Sep 29 18:33:07 tux kernel: sd 1:0:0:0: [sdb] Result: hostbyte=0x04 driverbyte=0x00
> Sep 29 18:33:07 tux kernel: end_request: I/O error, dev sdb, sector 401276591
> Sep 29 18:33:07 tux kernel: sd 1:0:0:0: [sdb] Unhandled error code
> Sep 29 18:33:07 tux kernel: sd 1:0:0:0: [sdb] Result: hostbyte=0x04 driverbyte=0x00
> Sep 29 18:33:07 tux kernel: end_request: I/O error, dev sdb, sector 401275567
These are all storage errors, not xfs. I suppose it could be differing
IO patterns from one fs or the other that trips it up, but nothing above
is related to an xfs bug; any xfs problems are in response to the above
IO errors, maybe a hardware problem or a driver problem, not sure - but
most likely a hardware issue I think. You might point smartctl at the
drive and see what it says.
-Eric
> I finally decided to give some other filesystems a try to see if
> anything changed. Low and behold it did. Still using a stock
> 2.6.30.8 kernel, but with ext3, ext4 and jfs filesystems, the large
> copy succeeded everytime! I then decided to try a stock 2.6.31.1
> kernel with xfs. It worked fine, too!
>
> My question, now, is -- is this problem a known xfs bug that was fixed
> in 2.6.31.x? I glanced through the code changes and git log and
> didn't see any smoking gun. If it's not an xfs bug, does anyone know
> if it might be a block driver bug (ata/ahci, in this case) that was
> only tickled by xfs?
>
> David
More information about the xfs
mailing list