[Top] [All Lists]

Re: XFS appears to cause strange hang with md raid1 on reboot

To: <david@xxxxxxxxxxxxx>
Subject: Re: XFS appears to cause strange hang with md raid1 on reboot
From: "Tom" <storm9c1@xxxxxxxxxxxx>
Date: Tue, 5 Feb 2013 18:05:20 -0500 (EST)
Cc: <storm9c1@xxxxxxxxxxxx>, <xfs@xxxxxxxxxxx>
Delivered-to: xfs@xxxxxxxxxxx
Importance: Normal
In-reply-to: <20130205213206.GP2667@dastard>
References: <32271.> <20130129151833.GF27055@xxxxxxx> <42720.> <20130130234650.GE32297@xxxxxxxxxxxxxxxxxx> <45702.> <20130204125510.GL2667@dastard> <11083.> <20130205213206.GP2667@dastard>
In a previous message, Dave Chinner wrote:
>> Any suggestions on how I would debug this?
> Find out if the unmount is returning an error first. If there is no
> error, then you need to find what is doing bind mounts on your
> system and make sure they are unmounted properly before the final
> unmount is done. If lazy unmount is being done, make it a normal
> unmount an see where the unmount is getting stcuk or taking time to
> complete by using sysrq-w if it gets delayed for any length of time.

I agree.  However, I may be at the mercy of the upstream vendor.
This is a completely stock CentOS system with minimal packages
installed (save using XFS as the root fs).  Oracle Enterprise Linux
as well as Scientific Linux also suffers from the same problem.
It's curious because the upstream vendor must be doing something
at shutdown to trigger this.  I will get my hands dirty tonight and
see how the system performs these umounts at shutdown.

I am also prepared to blacklist the entire CentOS/RHEL 5.9 since
I've been struggling with this for a few weeks now.  I am getting
close to giving up and moving on.  I mean, I ran into this during
regression testing before moving my "latest" pointer from 5.8 to 5.9.
I can easily keep latest at 5.8 and not cause impact for now (which is
why I use the version "pointer" mentality).  But I'm also worried about
no upgrade path until this is resolved.  I am stubborn about using XFS
exclusively.  Otherwise I could just deploy on ext3 and never see this.

> FWIW, because this is a old, old kernel, event tracing is not
> available, so the single most useful tool for tracking this down is not
> available...

Yeah, I know.  Welcome to the RHEL/CentOS/OEL/SL world.  That's why I use
Ubuntu for the desktop (easy to play with, easy to break, easy to fix) and
RHEL for servers (hard to break, but when it does break, watch out).
It's been a long time since I saw RHEL break this bad...I'm usually pretty
good at working around problems like this.  But this one has me stumped.

Thanks again.  I'll keep you posted.

-- Tom

<Prev in Thread] Current Thread [Next in Thread>