<div dir="ltr">Hi Brian,<div><br></div><div>Here's the result of xfs_repair on the same disk after rebooting. <a href="https://gist.github.com/HugoKuo/e1d683d9653e66a80dfcfcbee4294fe8">https://gist.github.com/HugoKuo/e1d683d9653e66a80dfcfcbee4294fe8</a> </div><div>It's looks normal. </div><div><br></div><div>We rebooted the server and no hanging process in past 12hrs. I'll keep eyes on the server. </div><div>I know it's helpful to to trace-cmd for xfs. But there's too may xfs operations are happening in the server. It's crazy like 100MB data per second. I stopped the trace-cmd. As you said, we need to find out the source of the problem. trace-cmd would be a nice option. Is there a way to flush recored data if nothing happens in the past hours ?</div><div><br></div><pre style="font-family:monospace,Courier;padding:1em;border:1px dashed rgb(47,111,171);color:rgb(0,0,0);line-height:1.1em;font-size:12.7px;background-color:rgb(249,249,249)">trace-cmd record -e xfs\*<br></pre><div><br></div><div><br></div><div>Regards // Hugo</div><div><br></div><div> <div><br></div></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Apr 20, 2016 at 7:24 PM, Brian Foster <span dir="ltr"><<a href="mailto:bfoster@redhat.com" target="_blank">bfoster@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On Wed, Apr 20, 2016 at 01:49:49PM +0800, Hugo Kuo wrote:<br>
> Hi XFS team,<br>
><br>
><br>
> Here's the lsof output of the grouped result of any openfile happens on<br>
> problematic disks. The full log of xfs_repair -n is included in this gist<br>
> as well. The xfs_repair recommend to contact xfs mailing list in the end of<br>
> the command.<br>
><br>
> <a href="https://gist.github.com/HugoKuo/95613d7864aa0a1343615642b3309451" rel="noreferrer" target="_blank">https://gist.github.com/HugoKuo/95613d7864aa0a1343615642b3309451</a><br>
><br>
> Perhaps I should go ahead to reboot the machine and run the xfs_repair<br>
> again. Please find my answers inlines.<br>
><br>
<br>
</span>Yes, repair is crashing in this case. Best to try xfs_repair after<br>
you've rebooted and mounted/umounted the fs to replay the log. If it's<br>
still crashing at that point, we'll probably want a metadata image of<br>
the fs, if possible (though there's a good chance a newer xfsprogs has<br>
the problem fixed).<br>
<div><div class="h5"><br>
><br>
> On Wed, Apr 20, 2016 at 3:34 AM, Brian Foster <<a href="mailto:bfoster@redhat.com">bfoster@redhat.com</a>> wrote:<br>
><br>
> ><br>
> > So there's definitely some traces waiting on AGF locks and whatnot, but<br>
> > also many traces that appear to be waiting on I/O. For example:<br>
> ><br>
><br>
> Yes, those I/O waiting is the original problem of this thread. It looks<br>
> like the disk was locked. All these I/O waiting for same disk (a multipath<br>
> entry).<br>
><br>
><br>
> ><br>
> > kernel: swift-object- D 0000000000000008 0 2096 1605 0x00000000<br>
> > kernel: ffff8877cc2378b8 0000000000000082 ffff8877cc237818 ffff887ff016eb68<br>
> > kernel: ffff883fd4ab6b28 0000000000000046 ffff883fd4bd9400 00000001e7ea49d0<br>
> > kernel: ffff8877cc237848 ffffffff812735d1 ffff885fa2e4a5f8 ffff8877cc237fd8<br>
> > kernel: Call Trace:<br>
> > kernel: [<ffffffff812735d1>] ? __blk_run_queue+0x31/0x40<br>
> > kernel: [<ffffffff81539455>] schedule_timeout+0x215/0x2e0<br>
> > kernel: [<ffffffff812757c9>] ? blk_peek_request+0x189/0x210<br>
> > kernel: [<ffffffff8126d9b3>] ? elv_queue_empty+0x33/0x40<br>
> > kernel: [<ffffffffa00040a0>] ? dm_request_fn+0x240/0x340 [dm_mod]<br>
> > kernel: [<ffffffff815390d3>] wait_for_common+0x123/0x180<br>
> > kernel: [<ffffffff810672b0>] ? default_wake_function+0x0/0x20<br>
> > kernel: [<ffffffffa0001036>] ? dm_unplug_all+0x36/0x50 [dm_mod]<br>
> > kernel: [<ffffffffa0415b56>] ? _xfs_buf_read+0x46/0x60 [xfs]<br>
> > kernel: [<ffffffffa040b417>] ? xfs_trans_read_buf+0x197/0x410 [xfs]<br>
> > kernel: [<ffffffff815391ed>] wait_for_completion+0x1d/0x20<br>
> > kernel: [<ffffffffa041503b>] xfs_buf_iowait+0x9b/0x100 [xfs]<br>
> > kernel: [<ffffffffa040b417>] ? xfs_trans_read_buf+0x197/0x410 [xfs]<br>
> > kernel: [<ffffffffa0415b56>] _xfs_buf_read+0x46/0x60 [xfs]<br>
> > kernel: [<ffffffffa0415c1b>] xfs_buf_read+0xab/0x100 [xfs]<br>
> ><br>
> ><br>
> > Are all of these swift processes running against independent storage, or<br>
> > one big array? Also, can you tell (e.g., with iotop) whether progress is<br>
> > being made here, albiet very slowly, or if the storage is indeed locked<br>
> > up..?<br>
> ><br>
> > There're 240+ swift processes in running.<br>
> All stuck swift processes were attempting to access same disk. I can<br>
> confirm it's indeed locked rather than slowly. By monitoring io via iotop.<br>
> There's 0 activity one the problematic mount point.<br>
><br>
><br>
> > In any event, given the I/O hangs, the fact that you're on an old distro<br>
> > kernel and you have things like multipath enabled, it might be<br>
> > worthwhile to see if you can rule out any multipath issues.<br>
> ><br>
> ><br>
> To upgrade the kernel for CentOS6.5 may not the option for the time being<br>
> but it definitely worth to give it try by picking up one of nodes for<br>
> testing later. As for the multipath, yes I did suspect some mystery problem<br>
> with multipath + XFS under a certain loading. But it's more like a XFS and<br>
> inode related hence I start to investigate from XFS. If there's no chance<br>
> to move forward in XFS, I might break the multipath and observe the result<br>
> for awhile.<br>
><br>
<br>
</div></div>It's hard to pinpoint something to the fs when there's a bunch of hung<br>
I/Os. You probably want to track down the source of those problems<br>
first.<br>
<span class="HOEnZb"><font color="#888888"><br>
Brian<br>
</font></span><div class="HOEnZb"><div class="h5"><br>
><br>
> ><br>
> > 'umount -l' doesn't necessarily force anything. It just lazily unmounts<br>
> > the fs from the namespace and cleans up the mount once all references<br>
> > are dropped. I suspect the fs is still mounted internally.<br>
> ><br>
> > Brian<br>
> ><br>
> ><br>
> Thanks // Hugo<br>
</div></div></blockquote></div><br></div>