<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div>Hi Chris,</div><div>Thanks for your time. I have responded to your suggestions below.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
Date: Tue, 6 Jan 2015 19:35:34 -0700<br>
From: Chris Murphy <<a href="mailto:lists@colorremedies.com">lists@colorremedies.com</a>><br>
To: "<a href="mailto:xfs@oss.sgi.com">xfs@oss.sgi.com</a>" <<a href="mailto:xfs@oss.sgi.com">xfs@oss.sgi.com</a>><br>
Subject: Re: XFS corrupt after RAID failure and resync<br>
Message-ID:<br>
<<a href="mailto:CAJCQCtS1p080M9dvuAJWBoafCcoOo2SaVLEhdVQV6QA8RC54cA@mail.gmail.com">CAJCQCtS1p080M9dvuAJWBoafCcoOo2SaVLEhdVQV6QA8RC54cA@mail.gmail.com</a>><br>
Content-Type: text/plain; charset=UTF-8<br>
<br>
On Tue, Jan 6, 2015 at 1:34 PM, David Raffelt<br>
<<a href="mailto:david.raffelt@florey.edu.au">david.raffelt@florey.edu.au</a>> wrote:<br>
> Hi Brian and Stefan,<br>
> Thanks for your reply. I checked the status of the array after the rebuild<br>
> (and before the reset).<br>
><br>
> md0 : active raid6 sdd1[8] sdc1[4] sda1[3] sdb1[7] sdi1[5] sde1[1]<br>
> 14650667520 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/6]<br>
> [UUUUUU_]<br>
><br>
> However given that I've never had any problems before with mdadm rebuilds I<br>
> did not think to check the data before rebooting. Note that the array is<br>
> still in this state. Before the reboot I tried to run a smartctl check on<br>
> the failed drives and it could not read them. When I rebooted I did not<br>
> actually replace any drives, I just power cycled to see if I could re-access<br>
> the drives that were thrown out of the array. According to smartctl they are<br>
> completely fine.<br>
><br>
> I guess there is no way I can re-add the old drives and remove the newly<br>
> synced drive? Even though I immediately kicked all users off the system<br>
> when I got the mdadm alert, it's possible a small amount of data was written<br>
> to the array during the resync.<br>
<br>
Well it sounds like there's more than one possibility here. If I<br>
follow correctly, you definitely had a working degraded 5/7 drive<br>
array, correct? In which case at least it should be possible to get<br>
that back, but I don't know what was happening at the time the system<br>
hung up on poweroff.<br>
<br></blockquote><div><br></div><div>Yes, after the 2 disks were dropped I definitely had a working degraded drive with 5/7 . I only see XFS errors in the kernel log soon AFTER the hot spare finished syncing. Here are the errors. </div><div><br></div><div style="font-size:13px">Jan 06 00:00:27 server kernel: XFS (md0): Corruption detected. Unmount and run xfs_repair</div><div style="font-size:13px">Jan 06 00:00:27 server kernel: XFS (md0): metadata I/O error: block 0x36b106c00 ("xfs_trans_read_buf_map") error 117 numblks 16</div><div style="font-size:13px">Jan 06 00:00:27 server kernel: XFS (md0): xfs_imap_to_bp: xfs_trans_read_buf() returned error 117.</div><div><br></div><div>If it helps, about 1 min before the drives were dropped from the array I got many of the below errors in the log. Here is a link to the complete log if needed. <a href="https://dl.dropboxusercontent.com/u/1156508/journalctl_dump.txt">https://dl.dropboxusercontent.com/u/1156508/journalctl_dump.txt</a></div><div><br></div><div><div style="font-size:13px">Jan 05 11:40:45 server kernel: ata11.00: status: { DRDY }</div><div style="font-size:13px">Jan 05 11:40:45 server kernel: ata11.00: cmd 60/10:00:70:60:0c/00:00:71:00:00/40 tag 0 ncq 8192 in res 40/00:00:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout)</div><div style="font-size:13px">Jan 05 11:40:45 server kernel: ata11.00: failed command: READ FPDMA QUEUED</div><div style="font-size:13px">Jan 05 11:40:45 server kernel: ata11.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x6 frozen</div><div style="font-size:13px">...</div><div style="font-size:13px">Jan 05 11:41:15 server kernel: ata12: limiting SATA link speed to 3.0 Gbps</div><div style="font-size:13px">Jan 05 11:41:15 server kernel: ata12: hard resetting link</div><div style="font-size:13px">Jan 05 11:41:16 server kernel: ata12: SATA link up 6.0 Gbps (SStatus 133 SControl 320)</div><div style="font-size:13px">Jan 05 11:41:38 server kernel: ata11.00: qc timeout (cmd 0xec)</div><div style="font-size:13px">Jan 05 11:41:38 server kernel: ata11.00: failed to IDENTIFY (I/O error, err_mask=0x4)</div><div style="font-size:13px">Jan 05 11:41:38 server kernel: ata11.00: revalidation failed (errno=-5)</div><div style="font-size:13px">Jan 05 11:41:38 server kernel: ata11.00: disabled</div><div style="font-size:13px">Jan 05 11:41:38 server kernel: ata11.00: device reported invalid CHS sector 0</div><div style="font-size:13px">....</div><div style="font-size:13px">Jan 05 11:41:40 server kernel: sd 11:0:0:0: [sdf] Unhandled error code</div><div style="font-size:13px">Jan 05 11:41:40 server kernel: sd 11:0:0:0: [sdf] </div><div style="font-size:13px">Jan 05 11:41:40 server kernel: Result: hostbyte=0x04 driverbyte=0x00</div><div style="font-size:13px">Jan 05 11:41:40 server kernel: sd 11:0:0:0: [sdf] CDB: </div><div style="font-size:13px">Jan 05 11:41:40 server kernel: cdb[0]=0x88: 88 00 00 00 00 00 71 0c 60 40 00 00 00 10 00 00</div><div style="font-size:13px">Jan 05 11:41:40 server kernel: end_request: I/O error, dev sdf, sector 1896636480</div><div style="font-size:13px">...</div><div style="font-size:13px">Jan 05 11:41:40 server kernel: sd 11:0:0:0: [sdf] Unhandled error code</div><div style="font-size:13px">Jan 05 11:41:40 server kernel: sd 11:0:0:0: [sdf] </div><div style="font-size:13px">Jan 05 11:41:40 server kernel: Result: hostbyte=0x04 driverbyte=0x00</div><div style="font-size:13px">Jan 05 11:41:40 server kernel: sd 11:0:0:0: [sdf] CDB: </div><div style="font-size:13px">Jan 05 11:41:40 server kernel: cdb[0]=0x88: 88 00 00 00 00 00 7f 00 0b 00 00 00 00 08 00 00</div><div style="font-size:13px">Jan 05 11:41:40 server kernel: md: super_written gets error=-5, uptodate=0</div><div style="font-size:13px">Jan 05 11:41:40 server kernel: md/raid:md0: Disk failure on sdf1, disabling device.</div><div style="font-size:13px"> md/raid:md0: Operation continuing on 6 devices.</div></div><div style="font-size:13px"><br></div><div style="font-size:13px"><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
It's not rare for smart to not test for certain failure vectors so it<br>
might say the drive is fine when it isn't. But what you should do next<br>
is<br>
<br>
mdadm -Evv /dev/sd[abcdefg]1 ##use actual drive letters<br>
<br>
Are you able to get information on all seven drives? Or do you<br>
definitely have at least one drive failed?<br></blockquote><div><br></div><div> I get all the information from all 8 drives (7 + the hot spare). The 6 drives currently in the array all have the same event number (80664). However the 2 drives that were dropped have the event number 80327. </div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<br>
If the event counter from the above examine is the same for at least 5<br>
drives, you should be able to assemble the array with this command:<br>
<br>
mdadm --assemble --verbose /dev/mdX /dev/sd[bcdef]1<br>
<br>
You have to feed the drive letter designation with the right letters<br>
for drives with the same event counter. If that's 5 drives, use that.<br>
If it's 6 drives, use that. If the event counters are all off, then<br>
it's a matter of what they are so you may just post the event counters<br>
so we can see this. This isn't going to write anything to the array,<br>
the fs isn't mounted. So if it fails, nothing is worse off. If it<br>
works, then you can run xfs_repair -n and see if you get a sane<br>
result. If that works you can mount it in this degraded state and<br>
maybe extract some of the more important data before proceeding to the<br>
next step.<br></blockquote><div><br></div><div>Not sure if I'm missing something, but the array is already assembled with 6/7 drives all having the same event number. In any case I have stopped the array, and assembled again with no luck.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<br>
In the meantime I'm also curious about:<br>
<br>
smarctl -l scterc /dev/sdX<br></blockquote><div><br></div><div>I'm ashamed to say that this command only works on 1 of the 8 drives since this is the only enterprise class drive (we are funded by small science grants). We have been gradually replacing the desktop class drives as they fail. </div><div><div><br></div><div>SCT Error Recovery Control:</div><div> Read: 70 (7.0 seconds)</div><div> Write: 70 (7.0 seconds)</div><div> <br></div></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<br>
This has to be issued per drive, no shortcut available by specifying<br>
all letters at once in brackets. And then lastly this one:<br>
<br>
cat /sys/block/sd[abcdefg]/device/timeout<br>
<br>
Again plug in the correct letters.<br></blockquote><div><br></div><div>All devices are set to 30 seconds. </div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<br>
<br>
<br>
> Unfortunately this 15TB RAID was part of a 45TB GlusterFS distributed<br>
> volume. It was only ever meant to be a scratch drive for intermediate<br>
> scientific results, however inevitably most users used it to store lots of<br>
> data. Oh well.<br>
<br>
Right well it's not fore sure toast yet. Also, one of the things<br>
gluster is intended to mitigate is the loss of an entire brick, which<br>
is what happened, but you need another 15TB of space to do<br>
distributed-replicated on your scratch space. If you can tolerate<br>
upwards of 48 hour single disk rebuild times, there are now 8TB HGST<br>
Helium drives :-P<br></blockquote><div><br></div><div>Just to confirm, we have 3x15TB bricks in a 45TB volume. Don't we need complete duplication in a distributed-replicated Gluster volume, or can we get away with only 1 more brick? </div><div><br></div><div>Those HGST 8GB drives do look very tempting!</div><div>Thanks again,</div><div>Dave</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<br>
<br>
--<br>
Chris Murphy<br>
<br>
<br>
<br><br></blockquote></div>
</div></div>