xfs Digest, Vol 79, Issue 19

David Raffelt david.raffelt at florey.edu.au
Wed Jan 7 01:05:14 CST 2015


Hi Chris,
Thanks for your time. I have responded to your suggestions below.


> Date: Tue, 6 Jan 2015 19:35:34 -0700
> From: Chris Murphy <lists at colorremedies.com>
> To: "xfs at oss.sgi.com" <xfs at oss.sgi.com>
> Subject: Re: XFS corrupt after RAID failure and resync
> Message-ID:
>         <
> CAJCQCtS1p080M9dvuAJWBoafCcoOo2SaVLEhdVQV6QA8RC54cA at mail.gmail.com>
> Content-Type: text/plain; charset=UTF-8
>
> On Tue, Jan 6, 2015 at 1:34 PM, David Raffelt
> <david.raffelt at florey.edu.au> wrote:
> > Hi Brian and Stefan,
> > Thanks for your reply.  I checked the status of the array after the
> rebuild
> > (and before the reset).
> >
> > md0 : active raid6 sdd1[8] sdc1[4] sda1[3] sdb1[7] sdi1[5] sde1[1]
> >       14650667520 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/6]
> > [UUUUUU_]
> >
> > However given that I've never had any problems before with mdadm
> rebuilds I
> > did not think to check the data before rebooting.  Note that the array is
> > still in this state. Before the reboot I tried to run a smartctl check on
> > the failed drives and it could not read them. When I rebooted I did not
> > actually replace any drives, I just power cycled to see if I could
> re-access
> > the drives that were thrown out of the array. According to smartctl they
> are
> > completely fine.
> >
> > I guess there is no way I can re-add the old drives and remove the newly
> > synced drive?  Even though I immediately kicked all users off the system
> > when I got the mdadm alert, it's possible a small amount of data was
> written
> > to the array during the resync.
>
> Well it sounds like there's more than one possibility here. If I
> follow correctly, you definitely had a working degraded 5/7 drive
> array, correct? In which case at least it should be possible to get
> that back, but I don't know what was happening at the time the system
> hung up on poweroff.
>
>
Yes, after the 2 disks were dropped I definitely had a working degraded
drive with 5/7 . I only see XFS errors in the kernel log soon AFTER the hot
spare finished syncing. Here are the errors.

Jan 06 00:00:27 server kernel: XFS (md0): Corruption detected. Unmount and
run xfs_repair
Jan 06 00:00:27 server kernel: XFS (md0): metadata I/O error: block
0x36b106c00 ("xfs_trans_read_buf_map") error 117 numblks 16
Jan 06 00:00:27 server kernel: XFS (md0): xfs_imap_to_bp:
xfs_trans_read_buf() returned error 117.

If it helps, about 1 min before the drives were dropped from the array I
got many of the below errors in the log. Here is a link to the complete log
if needed. https://dl.dropboxusercontent.com/u/1156508/journalctl_dump.txt

Jan 05 11:40:45 server kernel: ata11.00: status: { DRDY }
Jan 05 11:40:45 server kernel: ata11.00: cmd
60/10:00:70:60:0c/00:00:71:00:00/40 tag 0 ncq 8192 in res
40/00:00:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout)
Jan 05 11:40:45 server kernel: ata11.00: failed command: READ FPDMA QUEUED
Jan 05 11:40:45 server kernel: ata11.00: exception Emask 0x0 SAct
0x7fffffff SErr 0x0 action 0x6 frozen
...
Jan 05 11:41:15 server kernel: ata12: limiting SATA link speed to 3.0 Gbps
Jan 05 11:41:15 server kernel: ata12: hard resetting link
Jan 05 11:41:16 server kernel: ata12: SATA link up 6.0 Gbps (SStatus 133
SControl 320)
Jan 05 11:41:38 server kernel: ata11.00: qc timeout (cmd 0xec)
Jan 05 11:41:38 server kernel: ata11.00: failed to IDENTIFY (I/O error,
err_mask=0x4)
Jan 05 11:41:38 server kernel: ata11.00: revalidation failed (errno=-5)
Jan 05 11:41:38 server kernel: ata11.00: disabled
Jan 05 11:41:38 server kernel: ata11.00: device reported invalid CHS sector
0
....
Jan 05 11:41:40 server kernel: sd 11:0:0:0: [sdf] Unhandled error code
Jan 05 11:41:40 server kernel: sd 11:0:0:0: [sdf]
Jan 05 11:41:40 server kernel: Result: hostbyte=0x04 driverbyte=0x00
Jan 05 11:41:40 server kernel: sd 11:0:0:0: [sdf] CDB:
Jan 05 11:41:40 server kernel: cdb[0]=0x88: 88 00 00 00 00 00 71 0c 60 40
00 00 00 10 00 00
Jan 05 11:41:40 server kernel: end_request: I/O error, dev sdf, sector
1896636480
...
Jan 05 11:41:40 server kernel: sd 11:0:0:0: [sdf] Unhandled error code
Jan 05 11:41:40 server kernel: sd 11:0:0:0: [sdf]
Jan 05 11:41:40 server kernel: Result: hostbyte=0x04 driverbyte=0x00
Jan 05 11:41:40 server kernel: sd 11:0:0:0: [sdf] CDB:
Jan 05 11:41:40 server kernel: cdb[0]=0x88: 88 00 00 00 00 00 7f 00 0b 00
00 00 00 08 00 00
Jan 05 11:41:40 server kernel: md: super_written gets error=-5, uptodate=0
Jan 05 11:41:40 server kernel: md/raid:md0: Disk failure on sdf1, disabling
device.
                                                  md/raid:md0: Operation
continuing on 6 devices.




> It's not rare for smart to not test for certain failure vectors so it
> might say the drive is fine when it isn't. But what you should do next
> is
>
> mdadm -Evv /dev/sd[abcdefg]1   ##use actual drive letters
>
> Are you able to get information on all seven drives? Or do you
> definitely have at least one drive failed?
>

 I get all the information from all 8 drives (7 + the hot spare). The 6
drives currently in the array all have the same event number (80664).
However the 2 drives that were dropped have the event number 80327.


>
> If the event counter from the above examine is the same for at least 5
> drives, you should be able to assemble the array with this command:
>
> mdadm --assemble --verbose /dev/mdX /dev/sd[bcdef]1
>
> You have to feed the drive letter designation with the right letters
> for drives with the same event counter. If that's 5 drives, use that.
> If it's 6 drives, use that. If the event counters are all off, then
> it's a matter of what they are so you may just post the event counters
> so we can see this. This isn't going to write anything to the array,
> the fs isn't mounted. So if it fails, nothing is worse off. If it
> works, then you can run xfs_repair -n and see if you get a sane
> result. If that works you can mount it in this degraded state and
> maybe extract some of the more important data before proceeding to the
> next step.
>

Not sure if I'm missing something, but the array is already assembled with
6/7 drives all having the same event number. In any case I have stopped the
array, and assembled again with no luck.


>
> In the meantime I'm also curious about:
>
> smarctl -l scterc /dev/sdX
>

I'm ashamed to say that this command only works on 1 of the 8 drives since
this is the only enterprise class drive (we are funded by small science
grants). We have been gradually replacing the desktop class drives as they
fail.

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)


>
> This has to be issued per drive, no shortcut available by specifying
> all letters at once in brackets. And then lastly this one:
>
> cat /sys/block/sd[abcdefg]/device/timeout
>
> Again plug in the correct letters.
>

All devices are set to 30 seconds.


>
>
>
> > Unfortunately this 15TB RAID was part of a 45TB GlusterFS distributed
> > volume. It was only ever meant to be a scratch drive for intermediate
> > scientific results, however inevitably most users used it to store lots
> of
> > data. Oh well.
>
> Right well it's not fore sure toast yet. Also, one of the things
> gluster is intended to mitigate is the loss of an entire brick, which
> is what happened, but you need another 15TB of space to do
> distributed-replicated on your scratch space. If you can tolerate
> upwards of 48 hour single disk rebuild times, there are now 8TB HGST
> Helium drives :-P
>

Just to confirm, we have 3x15TB bricks in a 45TB volume. Don't we need
complete duplication in a distributed-replicated Gluster volume, or can we
get away with only 1 more brick?

Those HGST 8GB drives do look very tempting!
Thanks again,
Dave


>
> --
> Chris Murphy
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://oss.sgi.com/pipermail/xfs/attachments/20150107/ea17476d/attachment-0001.html>


More information about the xfs mailing list