[XFS] Any process to a particular XFS device hung in D state forever.
Hugo Kuo
hugo at swiftstack.com
Tue Apr 19 08:24:55 CDT 2016
Hi Brain,
Here's the a gist include sysrq-trigger and strace of one of the hanging
$ls result. This is from another problematic disk (d817) on the same server.
https://gist.github.com/HugoKuo/8eb8208bbb7a7f562a6c9a3eafa8f37f
It looks like the hanging $ls is stuck on getting extend attribute of a
file on this disk. The full output can be found in the link above.
lstat("/srv/node/d864/tmp/tmpIRYFaW", {st_mode=S_IFREG|0600, st_size=0,
...}) = 0
capget(0x20080522, 0, NULL) = -1 EFAULT (Bad address)
getxattr("/srv/node/d864/tmp/tmpIRYFaW", "security.capability"
As for the xfs_repair output in link
https://gist.github.com/HugoKuo/76f65bdc0b860ca6ed5e786f8c43da0e . Your
question is if the node been force rebooted. The answer is NO. I* didn't
reboot* this server yet. I force unmounted it via *$umount -l <dev>* . Then
run the xfs_repair.
$ls /srv/node/d864/tmp > test.d864
$ls /srv/node/d864/tmp
Here's the contents of test.d864
https://gist.github.com/HugoKuo/25f93cd6daf5b0666a2ab85defd63a56
Thanks // Hugo
On Tue, Apr 19, 2016 at 7:30 PM, Brian Foster <bfoster at redhat.com> wrote:
> On Tue, Apr 19, 2016 at 05:56:19PM +0800, Hugo Kuo wrote:
> > Hi XFS team,
> >
> > We encountered a problem frequently in past three weeks. Our daemons
> store
> > data to XFS partition associate with xattr.
> >
> > Disk seems not responding since all processes to this disk in D state and
> > can't be killed at all.
> >
> > - It happens on several disks. I feel it's randomly.
> > - Reboot seems solve the problem temporarily.
> > - All disks are multipath devices.
> >
> >
> > I suspected that's an issue from disk corrupted at beginning. But
> smartctl
> > doesn't show any clue about disk bad. And reboot makes the problem gone
> > away.
> >
> >
> > - Any process to this disk is blocked. Even a simple $ls . Kernel log
> > <https://gist.github.com/HugoKuo/f87748786b26ea04fd9e1d86d9538293>
>
> Looks like it's waiting on an AGF buffer. The buffer could be held by
> something else, but we don't have enough information from that one
> trace. Could you get all of the blocked tasks when in this state (e.g.,
> "echo w > /proc/sysrq-trigger")?
>
>
> > - I tested the disk by read bytes on block via $dd . It works fine
> > without any error in dmesg.
> > - The `xfs_repair -n` output of a problematic mount point [xfs_repair
> -n]
> > <https://gist.github.com/HugoKuo/76f65bdc0b860ca6ed5e786f8c43da0e> .
> It
> > is still processing.
>
> I presume this was run after a forced reboot..? If so, was the
> filesystem remounted first to replay the log (xfs_repair -n doesn't
> detect/warn about a dirty log, iirc). If the log was dirty, then repair
> is a bit less interesting simply because some corruption is to be
> expected in that scenario.
>
> > - Kernel : Linux node9 2.6.32-573.8.1.el6.x86_64 #1 SMP Tue Nov 10
> > 18:01:38 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
> > - OS : CentOS release 6.5 (Final)
> > - XFS : xfsprogs.x86_64 3.1.1-14.el6
> >
> >
> > There's an interesting behaviour of $ls command.
> >
> > * This is completed in 1sec. Very quick and give me the result in the
> > test.d864 file $ls /srv/node/d864/tmp > test.d864
> > * This is hanging $ls /srv/node/d864/tmp
> >
>
> I'm not following you here. Are you missing an attachment (test.d864)?
>
> Brian
>
> > [image: Inline image 1]
> >
> > I suspect there's something wrong with imap. Is there a known bug ?
> >
> > Thanks // Hugo
>
>
>
> > _______________________________________________
> > xfs mailing list
> > xfs at oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://oss.sgi.com/pipermail/xfs/attachments/20160419/11d21767/attachment.html>
More information about the xfs
mailing list