<div dir="ltr">Hi Brain, <div><br></div><div>Here's the a gist include sysrq-trigger and strace of one of the hanging $ls result. This is from another problematic disk (d817) on the same server.</div><div><br></div><blockquote style="margin:0px 0px 0px 40px;border:none;padding:0px"><div><a href="https://gist.github.com/HugoKuo/8eb8208bbb7a7f562a6c9a3eafa8f37f">https://gist.github.com/HugoKuo/8eb8208bbb7a7f562a6c9a3eafa8f37f</a></div><div><br></div></blockquote><div>It looks like the hanging $ls is stuck on getting extend attribute of a file on this disk. The full output can be found in the link above. </div><div><br></div><div><table class="" style="border-collapse:collapse;border-spacing:0px;color:rgb(51,51,51);font-family:Helvetica,arial,nimbussansl,liberationsans,freesans,clean,sans-serif,'Apple Color Emoji','Segoe UI Emoji','Segoe UI Symbol';font-size:13px;line-height:18.2px"><tbody style=""><tr style=""><td id="file-hanging-L143" class="" style="padding:0px 10px;width:50px;min-width:50px;font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:18px;color:rgba(0,0,0,0.298039);text-align:right;white-space:nowrap;vertical-align:top;border-style:solid;border-color:rgb(238,238,238);border-width:0px 1px 0px 0px"></td><td id="file-hanging-LC143" class="" style="padding:0px 10px;vertical-align:top;overflow:visible;font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;word-wrap:normal;white-space:pre">lstat("/srv/node/d864/tmp/tmpIRYFaW", {st_mode=S_IFREG|0600, st_size=0, ...}) = 0</td></tr><tr style=""><td id="file-hanging-L144" class="" style="padding:0px 10px;width:50px;min-width:50px;font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:18px;color:rgba(0,0,0,0.298039);text-align:right;white-space:nowrap;vertical-align:top;border-style:solid;border-color:rgb(238,238,238);border-width:0px 1px 0px 0px"></td><td id="file-hanging-LC144" class="" style="padding:0px 10px;vertical-align:top;overflow:visible;font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;word-wrap:normal;white-space:pre">capget(0x20080522, 0, NULL) = -1 EFAULT (Bad address)</td></tr><tr style=""><td id="file-hanging-L145" class="" style="padding:0px 10px;width:50px;min-width:50px;font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:18px;color:rgba(0,0,0,0.298039);text-align:right;white-space:nowrap;vertical-align:top;border-style:solid;border-color:rgb(238,238,238);border-width:0px 1px 0px 0px"></td><td id="file-hanging-LC145" class="" style="padding:0px 10px;vertical-align:top;overflow:visible;font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;word-wrap:normal;white-space:pre">getxattr("/srv/node/d864/tmp/tmpIRYFaW", "security.capability"</td></tr></tbody></table></div><div><br></div><div><br></div><div>As for the xfs_repair output in link <a href="https://gist.github.com/HugoKuo/76f65bdc0b860ca6ed5e786f8c43da0e" rel="noreferrer" target="_blank">https://gist.github.com/HugoKuo/76f65bdc0b860ca6ed5e786f8c43da0e</a> . Your question is if the node been force rebooted. The answer is NO. I<b> didn't reboot</b> this server yet. I force unmounted it via <i>$umount -l <dev></i> . Then run the xfs_repair. </div><div><br></div><blockquote style="margin:0px 0px 0px 40px;border:none;padding:0px"><div><font color="#000000">$ls /srv/node/d864/tmp > test.d864</font></div><div><font color="#000000">$ls /srv/node/d864/tmp</font></div></blockquote><div><div class="gmail_extra"><br></div><div class="gmail_extra">Here's the contents of test.d864 <a href="https://gist.github.com/HugoKuo/25f93cd6daf5b0666a2ab85defd63a56">https://gist.github.com/HugoKuo/25f93cd6daf5b0666a2ab85defd63a56</a></div><div class="gmail_extra"><br></div><div class="gmail_extra">Thanks // Hugo</div><div class="gmail_extra"><br><div class="gmail_quote"><font color="#cccccc">On Tue, Apr 19, 2016 at 7:30 PM, Brian Foster <span dir="ltr"><<a href="mailto:bfoster@redhat.com" target="_blank">bfoster@redhat.com</a>></span> wrote:<br></font><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex"><font color="#cccccc"><span class="">On Tue, Apr 19, 2016 at 05:56:19PM +0800, Hugo Kuo wrote:<br>
> Hi XFS team,<br>
><br>
> We encountered a problem frequently in past three weeks. Our daemons store<br>
> data to XFS partition associate with xattr.<br>
><br>
> Disk seems not responding since all processes to this disk in D state and<br>
> can't be killed at all.<br>
><br>
</span>> - It happens on several disks. I feel it's randomly.<br>
> - Reboot seems solve the problem temporarily.<br>
> - All disks are multipath devices.<br>
<span class="">><br>
><br>
> I suspected that's an issue from disk corrupted at beginning. But smartctl<br>
> doesn't show any clue about disk bad. And reboot makes the problem gone<br>
> away.<br>
><br>
><br>
</span>> - Any process to this disk is blocked. Even a simple $ls . Kernel log<br>
> <<a href="https://gist.github.com/HugoKuo/f87748786b26ea04fd9e1d86d9538293" rel="noreferrer" target="_blank">https://gist.github.com/HugoKuo/f87748786b26ea04fd9e1d86d9538293</a>><br>
<br>
Looks like it's waiting on an AGF buffer. The buffer could be held by<br>
something else, but we don't have enough information from that one<br>
trace. Could you get all of the blocked tasks when in this state (e.g.,<br>
"echo w > /proc/sysrq-trigger")?<br></font></blockquote><div><font color="#cccccc"><br></font></div><div><font color="#cccccc"><br></font></div><div><font color="#cccccc"> </font></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">
<font color="#cccccc"><br>
> - I tested the disk by read bytes on block via $dd . It works fine<br>
<span class="">> without any error in dmesg.<br>
</span>> - The `xfs_repair -n` output of a problematic mount point [xfs_repair -n]<br>
> <<a href="https://gist.github.com/HugoKuo/76f65bdc0b860ca6ed5e786f8c43da0e" rel="noreferrer" target="_blank">https://gist.github.com/HugoKuo/76f65bdc0b860ca6ed5e786f8c43da0e</a>> . It<br>
> is still processing.<br>
<br>
I presume this was run after a forced reboot..? If so, was the<br>
filesystem remounted first to replay the log (xfs_repair -n doesn't<br>
detect/warn about a dirty log, iirc). If the log was dirty, then repair<br>
is a bit less interesting simply because some corruption is to be<br>
expected in that scenario.<br>
<br>
> - Kernel : Linux node9 2.6.32-573.8.1.el6.x86_64 #1 SMP Tue Nov 10<br>
<span class="">> 18:01:38 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux<br>
</span>> - OS : CentOS release 6.5 (Final)<br>
> - XFS : xfsprogs.x86_64 3.1.1-14.el6<br>
<span class="">><br>
><br>
> There's an interesting behaviour of $ls command.<br>
><br>
> * This is completed in 1sec. Very quick and give me the result in the<br>
> test.d864 file $ls /srv/node/d864/tmp > test.d864<br>
> * This is hanging $ls /srv/node/d864/tmp<br>
><br>
<br>
</span>I'm not following you here. Are you missing an attachment (test.d864)?<br>
<br>
Brian<br>
<span class=""><br>
> [image: Inline image 1]<br>
><br>
> I suspect there's something wrong with imap. Is there a known bug ?<br>
><br>
> Thanks // Hugo<br>
<br>
<br>
<br>
</span>> _______________________________________________<br>
> xfs mailing list<br>
> <a href="mailto:xfs@oss.sgi.com">xfs@oss.sgi.com</a><br>
> <a href="http://oss.sgi.com/mailman/listinfo/xfs" rel="noreferrer" target="_blank">http://oss.sgi.com/mailman/listinfo/xfs</a><br>
<br>
</font></blockquote></div><br></div></div></div>