Re: Have the velociraptors in a test system now, checkout the errors.

To: Redeeman <redeeman@xxxxxxxxxxx>
Subject: Re: Have the velociraptors in a test system now, checkout the errors.
From: Justin Piszcz <jpiszcz@xxxxxxxxxxxxxxx>
Date: Sun, 14 Dec 2008 04:05:11 -0500 (EST)
Cc: Bill Davidsen <davidsen@xxxxxxx>, linux-kernel@xxxxxxxxxxxxxxx, linux-raid@xxxxxxxxxxxxxxx, xfs@xxxxxxxxxxx, smartmontools-support@xxxxxxxxxxxxxxxxxxxxx
In-reply-to: <1229225303.16555.149.camel@localhost>
References: <alpine.DEB.1.10.0812060426280.3494@xxxxxxxxxxxxxxxx> <49405A94.8080601@xxxxxxx> <1229225303.16555.149.camel@localhost>
User-agent: Alpine 1.10 (DEB 962 2008-03-14)

On Sun, 14 Dec 2008, Redeeman wrote:

On Wed, 2008-12-10 at 19:11 -0500, Bill Davidsen wrote:
Justin Piszcz wrote:
Point of thread: Two problems, mentioned in detail below, NCQ in Linux
when used in a RAID configuration and two, something with how Linux
interacts with the drives causes lots of problems as when I run the WD
tools on the disks, they do not show any errors.

If anyone has/would like me to run any debugging/patches/etc on this
system feel free to suggest/send me things to try out.  After I put
the VR's in a test system, I left NCQ enabled and I made a 10 disk
raid5 to see how fast I could get it to fail, I ran bonnie++ shown
below as a disk benchmark/stress test:

For the next test I will repeat this one but with NCQ disabled, having
NCQ enabled makes it fail very easily.  Then I want to re-run the test
with RAID6.

bonnie++ -d /r1/test -s 1000G -m p63 -n 16:100000:16:64

$ df -h
/dev/md3              2.5T  5.5M  2.5T   1% /r1

And the results?  Two disk "failures" according to md/Linux within a
few hours as shown below:

Note, the NCQ-related errors are what I talk about all of the time, if
you use
NCQ and Linux in a RAID environment with WD drives, well-- good luck.

Two-disks failed out of the RAID5 and I currentlty cannot even 'see'
one of the drives with smartctl, will reboot the host and check sde

After a reboot, it comes up and has no errors, really makes one wonder
where/what the bugs is/are, there are two I can see:
1. NCQ issue on at least WD drives in Linux in SW md/RAID
2. Velociraptor/other disks reporting all kinds of sector errors etc,
but when you use the WD 11.x disk tools program and run all of their
tests it says the disks have no problems whatsoever!  The smart
statistics do confirm this.  Currently, TLER is on for all disks, for
the duration of these tests.

Just a few comments on this, I have several RAID arrays built on Seagate
using NCQ, and yet to have a problem. I have NCQ on with my WD drives,
non-RAID, and haven't had an issue with them either. The WDs run a lot
cooler than the SG, but they are probably getting less use, as well. If
the WD are still on sale after the holiday I may grab a few more and run
RAID, by then I will have some small sense of trusting them.
Velociraptors, or which WD?
Velociraptors or Raptors or 750GiB disks (in Linux SW Raid) the NCQ issue does not appear to occur on Raptors on a 3ware card though, only in Linux+SW_raid. The regular raptor/750gib also run just fine as standalone with NCQ.


