xfs
[Top] [All Lists]

Re: xfs Digest, Vol 79, Issue 19

To: "xfs@xxxxxxxxxxx" <xfs@xxxxxxxxxxx>, lists@xxxxxxxxxxxxxxxxx
Subject: Re: xfs Digest, Vol 79, Issue 19
From: David Raffelt <david.raffelt@xxxxxxxxxxxxx>
Date: Wed, 7 Jan 2015 18:05:14 +1100
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <9cb00ded133a452e9fed635bd0094885@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>
References: <9cb00ded133a452e9fed635bd0094885@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>
Hi Chris,
Thanks for your time. I have responded to your suggestions below.
Â
Date: Tue, 6 Jan 2015 19:35:34 -0700
From: Chris Murphy <lists@xxxxxxxxxxxxxxxxx>
To: "xfs@xxxxxxxxxxx" <xfs@xxxxxxxxxxx>
Subject: Re: XFS corrupt after RAID failure and resync
Message-ID:
    <CAJCQCtS1p080M9dvuAJWBoafCcoOo2SaVLEhdVQV6QA8RC54cA@xxxxxxxxxxxxxx>
Content-Type: text/plain; charset=UTF-8

On Tue, Jan 6, 2015 at 1:34 PM, David Raffelt
<david.raffelt@xxxxxxxxxxxxx> wrote:
> Hi Brian and Stefan,
> Thanks for your reply. I checked the status of the array after the rebuild
> (and before the reset).
>
> md0 : active raid6 sdd1[8] sdc1[4] sda1[3] sdb1[7] sdi1[5] sde1[1]
>Â Â Â Â14650667520 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/6]
> [UUUUUU_]
>
> However given that I've never had any problems before with mdadm rebuilds I
> did not think to check the data before rebooting. Note that the array is
> still in this state. Before the reboot I tried to run a smartctl check on
> the failed drives and it could not read them. When I rebooted I did not
> actually replace any drives, I just power cycled to see if I could re-access
> the drives that were thrown out of the array. According to smartctl they are
> completely fine.
>
> I guess there is no way I can re-add the old drives and remove the newly
> synced drive? Even though I immediately kicked all users off the system
> when I got the mdadm alert, it's possible a small amount of data was written
> to the array during the resync.

Well it sounds like there's more than one possibility here. If I
follow correctly, you definitely had a working degraded 5/7 drive
array, correct? In which case at least it should be possible to get
that back, but I don't know what was happening at the time the system
hung up on poweroff.


Yes, after the 2 disks were dropped I definitely had a working degraded drive with 5/7 . I only see XFS errors in the kernel log soon AFTER the hot spare finished syncing. Here are the errors.Â

Jan 06 00:00:27 server kernel: XFS (md0): Corruption detected. Unmount and run xfs_repair
Jan 06 00:00:27 server kernel: XFS (md0): metadata I/O error: block 0x36b106c00 ("xfs_trans_read_buf_map") error 117 numblks 16
Jan 06 00:00:27 server kernel: XFS (md0): xfs_imap_to_bp: xfs_trans_read_buf() returned error 117.

If it helps, about 1 min before the drives were dropped from the array I got many of the below errors in the log. Here is a link to the complete log if needed.Âhttps://dl.dropboxusercontent.com/u/1156508/journalctl_dump.txt

Jan 05 11:40:45 server kernel: ata11.00: status: { DRDY }
Jan 05 11:40:45 server kernel: ata11.00: cmd 60/10:00:70:60:0c/00:00:71:00:00/40 tag 0 ncq 8192 in res 40/00:00:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout)
Jan 05 11:40:45 server kernel: ata11.00: failed command: READ FPDMA QUEUED
Jan 05 11:40:45 server kernel: ata11.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x6 frozen
...
Jan 05 11:41:15 server kernel: ata12: limiting SATA link speed to 3.0 Gbps
Jan 05 11:41:15 server kernel: ata12: hard resetting link
Jan 05 11:41:16 server kernel: ata12: SATA link up 6.0 Gbps (SStatus 133 SControl 320)
Jan 05 11:41:38 server kernel: ata11.00: qc timeout (cmd 0xec)
Jan 05 11:41:38 server kernel: ata11.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Jan 05 11:41:38 server kernel: ata11.00: revalidation failed (errno=-5)
Jan 05 11:41:38 server kernel: ata11.00: disabled
Jan 05 11:41:38 server kernel: ata11.00: device reported invalid CHS sector 0
....
Jan 05 11:41:40 server kernel: sd 11:0:0:0: [sdf] Unhandled error code
Jan 05 11:41:40 server kernel: sd 11:0:0:0: [sdf] Â
Jan 05 11:41:40 server kernel: Result: hostbyte=0x04 driverbyte=0x00
Jan 05 11:41:40 server kernel: sd 11:0:0:0: [sdf] CDB:Â
Jan 05 11:41:40 server kernel: cdb[0]=0x88: 88 00 00 00 00 00 71 0c 60 40 00 00 00 10 00 00
Jan 05 11:41:40 server kernel: end_request: I/O error, dev sdf, sector 1896636480
...
Jan 05 11:41:40 server kernel: sd 11:0:0:0: [sdf] Unhandled error code
Jan 05 11:41:40 server kernel: sd 11:0:0:0: [sdf] Â
Jan 05 11:41:40 server kernel: Result: hostbyte=0x04 driverbyte=0x00
Jan 05 11:41:40 server kernel: sd 11:0:0:0: [sdf] CDB:Â
Jan 05 11:41:40 server kernel: cdb[0]=0x88: 88 00 00 00 00 00 7f 00 0b 00 00 00 00 08 00 00
Jan 05 11:41:40 server kernel: md: super_written gets error=-5, uptodate=0
Jan 05 11:41:40 server kernel: md/raid:md0: Disk failure on sdf1, disabling device.
                         md/raid:md0: Operation continuing on 6 devices.


Â
It's not rare for smart to not test for certain failure vectors so it
might say the drive is fine when it isn't. But what you should do next
is

mdadm -Evv /dev/sd[abcdefg]1Â Â##use actual drive letters

Are you able to get information on all seven drives? Or do you
definitely have at least one drive failed?

ÂI get all the information from all 8 drives (7 + the hot spare). The 6 drives currently in the array all have the same event number (80664). However the 2 drives that were dropped have the event numberÂ80327.Â
Â

If the event counter from the above examine is the same for at least 5
drives, you should be able to assemble the array with this command:

mdadm --assemble --verbose /dev/mdX /dev/sd[bcdef]1

You have to feed the drive letter designation with the right letters
for drives with the same event counter. If that's 5 drives, use that.
If it's 6 drives, use that. If the event counters are all off, then
it's a matter of what they are so you may just post the event counters
so we can see this. This isn't going to write anything to the array,
the fs isn't mounted. So if it fails, nothing is worse off. If it
works, then you can run xfs_repair -n and see if you get a sane
result. If that works you can mount it in this degraded state and
maybe extract some of the more important data before proceeding to the
next step.

Not sure if I'm missing something, but the array is already assembled with 6/7 drives all having the same event number. In any case I have stopped the array, and assembled again with no luck.
Â

In the meantime I'm also curious about:

smarctl -l scterc /dev/sdX

I'm ashamed to say that this command only works on 1 of the 8 drives since this is the only enterprise class drive (we are funded by small science grants). We have been gradually replacing the desktop class drives as they fail.Â

SCT Error Recovery Control:
     ÂRead:   70 (7.0 seconds)
     Write:   70 (7.0 seconds)
Â

This has to be issued per drive, no shortcut available by specifying
all letters at once in brackets. And then lastly this one:

cat /sys/block/sd[abcdefg]/device/timeout

Again plug in the correct letters.

All devices are set to 30 seconds.Â
Â



> Unfortunately this 15TB RAID was part of a 45TB GlusterFS distributed
> volume. It was only ever meant to be a scratch drive for intermediate
> scientific results, however inevitably most users used it to store lots of
> data. Oh well.

Right well it's not fore sure toast yet. Also, one of the things
gluster is intended to mitigate is the loss of an entire brick, which
is what happened, but you need another 15TB of space to do
distributed-replicated on your scratch space. If you can tolerate
upwards of 48 hour single disk rebuild times, there are now 8TB HGST
Helium drives :-P

Just to confirm, we have 3x15TB bricks in a 45TB volume. Don't we need complete duplication in a distributed-replicated Gluster volume, or can we get away with only 1 more brick?Â

Those HGST 8GB drives do look very tempting!
Thanks again,
Dave



--
Chris Murphy




<Prev in Thread] Current Thread [Next in Thread>
  • Re: xfs Digest, Vol 79, Issue 19, David Raffelt <=