<div dir="ltr"><div>Hi XFS team, </div><div><br></div><div><br></div><div>Here's the lsof output of the grouped result of any openfile happens on problematic disks. The full log of xfs_repair -n is included in this gist as well. The xfs_repair recommend to contact xfs mailing list in the end of the command. </div><div><br></div><div><a href="https://gist.github.com/HugoKuo/95613d7864aa0a1343615642b3309451">https://gist.github.com/HugoKuo/95613d7864aa0a1343615642b3309451</a><br></div><div><br></div><div>Perhaps I should go ahead to reboot the machine and run the xfs_repair again. Please find my answers inlines. </div><br><div class="gmail_extra"><br><div class="gmail_quote"><font color="#cccccc">On Wed, Apr 20, 2016 at 3:34 AM, Brian Foster <span dir="ltr"><<a href="mailto:bfoster@redhat.com" target="_blank">bfoster@redhat.com</a>></span> wrote:<br></font><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex"><font color="#cccccc"><span class=""><br>
</span>So there's definitely some traces waiting on AGF locks and whatnot, but<br>
also many traces that appear to be waiting on I/O. For example:<br></font></blockquote><div><br></div><div>Yes, those I/O waiting is the original problem of this thread. It looks like the disk was locked. All these I/O waiting for same disk (a multipath entry).</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex"><font color="#cccccc">
<br>
kernel: swift-object- D 0000000000000008 0 2096 1605 0x00000000<br>
kernel: ffff8877cc2378b8 0000000000000082 ffff8877cc237818 ffff887ff016eb68<br>
kernel: ffff883fd4ab6b28 0000000000000046 ffff883fd4bd9400 00000001e7ea49d0<br>
kernel: ffff8877cc237848 ffffffff812735d1 ffff885fa2e4a5f8 ffff8877cc237fd8<br>
kernel: Call Trace:<br>
kernel: [<ffffffff812735d1>] ? __blk_run_queue+0x31/0x40<br>
kernel: [<ffffffff81539455>] schedule_timeout+0x215/0x2e0<br>
kernel: [<ffffffff812757c9>] ? blk_peek_request+0x189/0x210<br>
kernel: [<ffffffff8126d9b3>] ? elv_queue_empty+0x33/0x40<br>
kernel: [<ffffffffa00040a0>] ? dm_request_fn+0x240/0x340 [dm_mod]<br>
kernel: [<ffffffff815390d3>] wait_for_common+0x123/0x180<br>
kernel: [<ffffffff810672b0>] ? default_wake_function+0x0/0x20<br>
kernel: [<ffffffffa0001036>] ? dm_unplug_all+0x36/0x50 [dm_mod]<br>
kernel: [<ffffffffa0415b56>] ? _xfs_buf_read+0x46/0x60 [xfs]<br>
kernel: [<ffffffffa040b417>] ? xfs_trans_read_buf+0x197/0x410 [xfs]<br>
kernel: [<ffffffff815391ed>] wait_for_completion+0x1d/0x20<br>
kernel: [<ffffffffa041503b>] xfs_buf_iowait+0x9b/0x100 [xfs]<br>
kernel: [<ffffffffa040b417>] ? xfs_trans_read_buf+0x197/0x410 [xfs]<br>
kernel: [<ffffffffa0415b56>] _xfs_buf_read+0x46/0x60 [xfs]<br>
kernel: [<ffffffffa0415c1b>] xfs_buf_read+0xab/0x100 [xfs]<br><br>
<br>
Are all of these swift processes running against independent storage, or<br>
one big array? Also, can you tell (e.g., with iotop) whether progress is<br>
being made here, albiet very slowly, or if the storage is indeed locked<br>
up..?<br>
<br></font></blockquote><div>There're 240+ swift processes in running. </div><div>All stuck swift processes were attempting to access same disk. I can confirm it's indeed locked rather than slowly. By monitoring io via iotop. There's 0 activity one the problematic mount point. </div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex"><font color="#cccccc">
In any event, given the I/O hangs, the fact that you're on an old distro<br>
kernel and you have things like multipath enabled, it might be<br>
worthwhile to see if you can rule out any multipath issues.<br><br></font></blockquote><div><br></div><div>To upgrade the kernel for CentOS6.5 may not the option for the time being but it definitely worth to give it try by picking up one of nodes for testing later. As for the multipath, yes I did suspect some mystery problem with multipath + XFS under a certain loading. But it's more like a XFS and inode related hence I start to investigate from XFS. If there's no chance to move forward in XFS, I might break the multipath and observe the result for awhile. </div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex"><font color="#cccccc"><span class=""></span><br>
'umount -l' doesn't necessarily force anything. It just lazily unmounts<br>
the fs from the namespace and cleans up the mount once all references<br>
are dropped. I suspect the fs is still mounted internally.<br>
<span class=""><br>
Brian<br>
</span></font><div class=""><div class="h5"><font color="#cccccc"><br></font></div></div></blockquote><div><br></div><div>Thanks // Hugo </div></div></div></div>