No subject


Tue Jan 31 03:57:03 CST 2012


A.  A defective Areca card
B.  Firmware issue (card and/or drives)
C.  Driver issue
D.  More than one of the above

> Areca Technology Corp. ARC-1160 16-Port PCI-X to SATA RAID Controller
> Firmware Version   : V1.42 2006-10-13

That's a 16 port card.  How many total drives do you have connected?

Are they all the same model/firmware rev?  If different models, do you at
least have identical models in each RAID pack? Mixing different
brands/models/firmware revs within a RAID pack is always a very bad idea.
In fact, using anything but identical drives/firmware on a single controller
card is a bad idea.  Some cards are more finicky than others, but almost all
of them will have problems of one kind or another with a mixed bag 'o
drives.  They can have problems with all identical drives if the drive
firmware isn't to the card firmware's liking (see below).

> After the hard reset, one disk was reported as 'faild' and the rebuild
> started.

Unfortunately the errors reported weren't indicative of a bad drive, but
multiple bad drives.  None of the drives are bad.  The
controller/firmware/driver have a problem, or have a problem with the
drive(s) firmware.  The Areca firmware marked one drive as bad because the
logic says something besides the card/firmware/driver _must_ be bad.  So, it
marked one of the drives as bad and started rebuilding it.

Back in the late '90s I had Mylex DAC960 cards doing exactly the same thing
due to a problem with firmware on the Seagate ST118202 Cheetah drives.  The
DAC960 would just kick a drive offline willy nilly.  This was with 8
identical firmware drives in RAID5 arrays on a single SCSI channel.  Was
really annoying.  I was at customer sites twice weekly replacing and
rebuilding drives until Seagate finally admitted the firmware bug and
advance shipped us 50 new 3 series Cheetah drives.  That was really fun
replacing drives one by one and rebuilding the arrays after each drive swap.
 We lost a lot of labor $$ over that and had some less than happy customers.
 Once all the drives were replaced with the 3 series, we never had another
problem with any of those arrays.  I'm still surprised I was able to rebuild
the arrays without issues after adding each new drive, which was a slightly
different size with a different firmware.  I was just sure the rebuilds
would puke.  I got lucky.  These systems were in production, thus the reason
we didn't restore from tape, which would have saved a lot of time.

>> What is the status of the RAID6 volume as reported by the RAID card BIOS?
> 
> By now, the rebuild finished, therefor the volume is in normal
> non-degraded state.

That's good.

>> What is the status of each of your EVMS volumes as reported by the EVMS UI?
> 
> They're all active. Do you need more informations here? There are
> approximately 45 active volumes on this server.

No.  Just wanted to know if they're all reported as healthy.

>> I'm asking all of these questions because it seems rather clear that the
>> root cause of your problem lies at a layer well below the XFS filesystem.
> 
> Yes, I never blamed XFS for being the cause of the problem.

I should have worded that differently.  I didn't mean to imply that you were
blaming XFS.  I meant that I wanted to help you figure out the root cause
which wasn't XFS.

>> You have two layers of physical disk abstraction below XFS:  a hardware
>> RAID6 and a software logical volume manager.  You've apparently suffered a
>> storage system hardware failure, according to your description.  You haven't
>> given any details of the current status of the hardware RAID, or of the
>> logical volumes, merely that XFS is having problems.  I think a "Well duh!"
>> is in order.
>>
>> Please provide _detailed_ information from the RAID card BIOS and the EVMS
>> UI.  Even if the problem isn't XFS related I for one would be glad to assist
>> you in getting this fixed.  Right now we don't have enough information.  At
>> least I don't.

On second read, this looks rather preachy and antagonistic.  I truly did not
intend that tone.  Please accept my apology if this came across that way.  I
think I was starting to get frustrated because I wanted to troubleshoot this
further but didn't feel I had enough info.  Again, this was less than
professional, and I apologize.

-- 
Stan




More information about the xfs mailing list