On Tue, Mar 5, 2013 at 12:23 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> On Mon, Mar 04, 2013 at 10:03:29AM +0100, Ole Tange wrote:
>> On Fri, Mar 1, 2013 at 9:53 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>> > What filesystem errors occurred
>> > when the srives went offline?
>> See http://dna.ku.dk/~tange/tmp/syslog.3
> You log is full of this:
> mpt2sas1: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
> What's that mean?
We do not know, but it is something we are continually trying to find
out. We have 5 other systems using the same setup and they experience
1 of these 5 systems drop disks off the RAID but the rest work fine.
In other words: we do not experience data corruption - only disk
dropping of the RAID. That leads me to believe it is some kind of
>> Feb 26 00:46:52 franklin kernel: [556238.429259] XFS (md5p1): metadata
>> I/O error: block 0x459b8 ("xfs_buf_iodone_callbacks") error 5 buf
>> count 4096
> So, the first IO errors appear at 23:00 on /dev/sdb, and the
> controller does a full reset and reprobe. Look slike a port failure
> of some kind. Notable:
> mpt2sas1: LSISAS2008: FWVersion(07.00.00.00), ChipRevision(0x03),
> From a quick google, that firmware looks out of date (current
> LSISAS2008 firmwares are numbered 10 or 11, and bios versions are at
We have tried updating the firmware using LSIs own tool. That fails as
LSI tools says the firmware is not signed correctly.
> /dev/md4 had a failure (/dev/sds) around 00:19, no rebuild started.
The rebuild of md4 is now complete.
> /dev/md1 had another failure (/dev/sdj) around 00:46, this time on a
> SYNCHRONISE CACHE command (i.e. log write). This IO failure caused
> the shutdown to occur. And this is the result:
> [556219.292225] end_request: I/O error, dev sdj, sector 10
> [556219.292275] md: super_written gets error=-5, uptodate=0
> [556219.292283] md/raid:md1: Disk failure on sdj, disabling device.
> [556219.292286] md/raid:md1: Operation continuing on 7 devices.
> At this point, /dev/md1 is reporting 7 working disks and has had an
> EIO on it's superblock write, which means it's probably in an
> inconsistent state. Further, it's only got 8 disks associated with
> it and as a rebuild is in progress it means that data loss has
> occurred with this failure. There's your problem.
Yep. What I would like to see from xfs_repair is salvaging the part
that is not affected - which ought to be the primary part of the 100
> Essentially, you need to fix your hardware before you do anything
> else. Get it all back fully online and fix whatever the problems are
> that are causing IO errors, then you can worry about recovering the
> filesystem and your data. Until the hardware is stable and not
> throwing errors, recovery is going to be unreliable (if not
As that has been an ongoing effort it is unlikely to be solved within
a short timeframe.