xfs
[Top] [All Lists]

Re: xfs_repair segfaults

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: xfs_repair segfaults
From: Ole Tange <tange@xxxxxxxxxx>
Date: Fri, 8 Mar 2013 11:09:40 +0100
Cc: xfs@xxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:mime-version:sender:in-reply-to:references:from:date :x-google-sender-auth:message-id:subject:to:cc:content-type; bh=LZTXXx67E0G4rUoM3I6683aNEqTY7fLy7m9f0frb0fw=; b=w+R3G3l9UGdcn989/NWSyKIaT4EKVEvbXasI/YmlqAJJh6f0nYEKnuff1vfXnAIe1f kddvxnXwfcFJQreDoOgZF2wcU61DbT0NzCSJbpqUNeCRJZHmRb2Yryr0Npoi1H3PiwUZ W8Cz+/FC05EnCaFZ2QltgaCd1+bCFLtr4+vn9jnHQliRKeNjTIVYq9njb0x+GoeZIXZ2 O1lr5kSX2vOHrhjZ1s7ntki3+OcGaOEqQdKHDVxUg9aQ6s0dOQa7/GkQdhqXngwwO1lx t6dhB64MO931QkkfLc0euHg9o8PMDQyCv6mrepLQnRmfG1oxy218incQaaG/V1S1Q1zS CQ1w==
In-reply-to: <20130304232319.GR23616@dastard>
References: <CANU9nTnvJS50vdQv2K0gKHZPvzzH5EY1qpizJNsqUobrr2juDA@xxxxxxxxxxxxxx> <20130301111701.GB23616@dastard> <CANU9nTkkEd61UsM+KngGzMzXVXdOa2oXOPtmqN4xGaPuEBN_nQ@xxxxxxxxxxxxxx> <20130301205305.GD23616@dastard> <CANU9nTmmw3FcHRBNvu_S6Uj8M-B2JFf5poQfHbZuCbJ6_=_RgA@xxxxxxxxxxxxxx> <20130304232319.GR23616@dastard>
Sender: ole.tange.work@xxxxxxxxx
On Tue, Mar 5, 2013 at 12:23 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> On Mon, Mar 04, 2013 at 10:03:29AM +0100, Ole Tange wrote:
>> On Fri, Mar 1, 2013 at 9:53 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>> :
>> > What filesystem errors occurred
>> > when the srives went offline?
>>
>> See http://dna.ku.dk/~tange/tmp/syslog.3
>
> You log is full of this:
>
> mpt2sas1: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
>
> What's that mean?

We do not know, but it is something we are continually trying to find
out. We have 5 other systems using the same setup and they experience
the same.

1 of these 5 systems drop disks off the RAID but the rest work fine.
In other words: we do not experience data corruption - only disk
dropping of the RAID. That leads me to believe it is some kind of
timeout error.

>> Feb 26 00:46:52 franklin kernel: [556238.429259] XFS (md5p1): metadata
>> I/O error: block 0x459b8 ("xfs_buf_iodone_callbacks") error 5 buf
>> count 4096
>
> So, the first IO errors appear at 23:00 on /dev/sdb, and the
> controller does a full reset and reprobe. Look slike a port failure
> of some kind. Notable:
>
> mpt2sas1: LSISAS2008: FWVersion(07.00.00.00), ChipRevision(0x03), 
> BiosVersion(07.11.10.00)
>
> From a quick google, that firmware looks out of date (current
> LSISAS2008 firmwares are numbered 10 or 11, and bios versions are at
> 7.21).

We have tried updating the firmware using LSIs own tool. That fails as
LSI tools says the firmware is not signed correctly.

> /dev/md4 had a failure (/dev/sds) around 00:19, no rebuild started.

The rebuild of md4 is now complete.

> /dev/md1 had another failure (/dev/sdj) around 00:46, this time on a
> SYNCHRONISE CACHE command (i.e. log write). This IO failure caused
> the shutdown to occur. And this is the result:
>
> [556219.292225] end_request: I/O error, dev sdj, sector 10
> [556219.292275] md: super_written gets error=-5, uptodate=0
> [556219.292283] md/raid:md1: Disk failure on sdj, disabling device.
> [556219.292286] md/raid:md1: Operation continuing on 7 devices.
>
> At this point, /dev/md1 is reporting 7 working disks and has had an
> EIO on it's superblock write, which means it's probably in an
> inconsistent state. Further, it's only got 8 disks associated with
> it and as a rebuild is in progress it means that data loss has
> occurred with this failure. There's your problem.

Yep. What I would like to see from xfs_repair is salvaging the part
that is not affected - which ought to be the primary part of the 100
TB.

> Essentially, you need to fix your hardware before you do anything
> else. Get it all back fully online and fix whatever the problems are
> that are causing IO errors, then you can worry about recovering the
> filesystem and your data. Until the hardware is stable and not
> throwing errors, recovery is going to be unreliable (if not
> impossible).

As that has been an ongoing effort it is unlikely to be solved within
a short timeframe.


/Ole

<Prev in Thread] Current Thread [Next in Thread>