xfs
[Top] [All Lists]

Re: XFS/driver bug or bad drive?

To: David Engel <david@xxxxxxxxxx>
Subject: Re: XFS/driver bug or bad drive?
From: Eric Sandeen <sandeen@xxxxxxxxxxx>
Date: Thu, 01 Oct 2009 19:39:54 -0500
Cc: xfs@xxxxxxxxxxx
In-reply-to: <20091001232759.GA12832@xxxxxxxxxxxxxxx>
References: <20091001232759.GA12832@xxxxxxxxxxxxxxx>
User-agent: Thunderbird 2.0.0.23 (Macintosh/20090812)
David Engel wrote:
Hi,

I've been trying to diagnose a suspected disk drive problem for about
a week.  I now think the problem might be a known (and fixed) xfs or
driver bug, but I'm not 100% sure.  I'm hoping someone here can
confirm the problem is or isn't an xfs bug.

The drive in question is a Samsung HD753LJ.  I have two of these
drives and have had to do three replacements for various reasons in
<10 months of use.  In short, I don't have a lot of confidence in the
drive, even though recent evidence seems to point elsewhere.

The problem occurs when I copy several hundred gigabytes of large
files (MythTV recordings, to be specific) to the troublesome drive
from another drive.  When using a stock 2.6.30.8 kernel and xfs, the
copy eventually fails because the drive quits responding (and won't
respond again until it is power cycled).  The failure doesn't always
occur at the same point in the copy, but it does always occur.  Here
is a log sample of one of the failures.

Sep 29 17:59:34 tux kernel: XFS mounting filesystem sdb1
Sep 29 17:59:34 tux kernel: Ending clean XFS mount for filesystem: sdb1
Sep 29 18:32:07 tux kernel: ata2.00: exception Emask 0x0 SAct 0xffff SErr 0x0 
action 0x6 frozen
Sep 29 18:32:07 tux kernel: ata2.00: cmd 61/00:00:af:02:eb/04:00:17:00:00/40 
tag 0 ncq 524288 out
Sep 29 18:32:07 tux kernel:          res 40/00:00:00:4f:c2/00:00:00:00:00/00 
Emask 0x4 (timeout)
Sep 29 18:32:07 tux kernel: ata2.00: status: { DRDY }
...
Sep 29 18:32:07 tux kernel: ata2: hard resetting link
Sep 29 18:32:17 tux kernel: ata2: softreset failed (device not ready)
...

Sep 29 18:33:07 tux kernel: ata2.00: disabled
Sep 29 18:33:07 tux kernel: ata2.00: device reported invalid CHS sector 0
Sep 29 18:33:07 tux last message repeated 15 times
Sep 29 18:33:07 tux kernel: ata2: EH complete
Sep 29 18:33:07 tux kernel: sd 1:0:0:0: [sdb] Unhandled error code
Sep 29 18:33:07 tux kernel: sd 1:0:0:0: [sdb] Result: hostbyte=0x04 
driverbyte=0x00
Sep 29 18:33:07 tux kernel: end_request: I/O error, dev sdb, sector 401276591
Sep 29 18:33:07 tux kernel: sd 1:0:0:0: [sdb] Unhandled error code
Sep 29 18:33:07 tux kernel: sd 1:0:0:0: [sdb] Result: hostbyte=0x04 
driverbyte=0x00
Sep 29 18:33:07 tux kernel: end_request: I/O error, dev sdb, sector 401275567

These are all storage errors, not xfs. I suppose it could be differing IO patterns from one fs or the other that trips it up, but nothing above is related to an xfs bug; any xfs problems are in response to the above IO errors, maybe a hardware problem or a driver problem, not sure - but most likely a hardware issue I think. You might point smartctl at the drive and see what it says.

-Eric

I finally decided to give some other filesystems a try to see if
anything changed.  Low and behold it did.  Still using a stock
2.6.30.8 kernel, but with ext3, ext4 and jfs filesystems, the large
copy succeeded everytime!  I then decided to try a stock 2.6.31.1
kernel with xfs.  It worked fine, too!

My question, now, is -- is this problem a known xfs bug that was fixed
in 2.6.31.x?  I glanced through the code changes and git log and
didn't see any smoking gun.  If it's not an xfs bug, does anyone know
if it might be a block driver bug (ata/ahci, in this case) that was
only tickled by xfs?

David

<Prev in Thread] Current Thread [Next in Thread>