[Top] [All Lists]

Re: Still seeing hangs in xlog_grant_log_space

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: Still seeing hangs in xlog_grant_log_space
From: Juerg Haefliger <juergh@xxxxxxxxx>
Date: Mon, 23 Apr 2012 17:33:40 +0200
Cc: xfs@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; bh=TB8E13gpuooNo1WeKyEnQA0s4jVqilTqMlrDNVjxP6w=; b=mdqPXKQpyrA9xXE6zi2AmygftMiH1V5su4ZbaHowJCgK0pCHNKEPp4Gsa1Kf1QtGxh lpJ5h1TECm7IB3wK/XAMip7axnmx+URQB1IP3McPObPkHE8Gy7cW0hjtB36QVdCRMUmM JgvePtoRFAKFqC5mYG10EwF7Y5+p66vg5d4w5pI6coRxz6KGgdDg0r/CImPF5A+2oPrK zUO3DjSGk2+IxoJX02rEuxFnthMG6S1HBPjusnUxrFM1tHI4zTPM3KXU7EjTqVda+nJc 1Y61QhWKccEhcQMNbsqJw1PJPdABAGwrlbJW8f0X3kfvIAjlgwaOe1lJDqSLs/J7eGlV uU2w==
In-reply-to: <20120423143843.GN9541@dastard>
References: <CADLDEKsP4DsXf_G07ub+a-ODbrJbsiprRJUX1fJdaQ41TB7+Xg@xxxxxxxxxxxxxx> <20120423143843.GN9541@dastard>
Hi Dave,

On Mon, Apr 23, 2012 at 4:38 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> On Mon, Apr 23, 2012 at 02:09:53PM +0200, Juerg Haefliger wrote:
>> Hi,
>> I have a test system that I'm using to try to force an XFS filesystem
>> hang since we're encountering that problem sporadically in production
>> running a 2.6.38-8 Natty kernel. The original idea was to use this
>> system to find the patches that fix the issue but I've tried a whole
>> bunch of kernels and they all hang eventually (anywhere from 5 to 45
>> mins) with the stack trace shown below.
> If you kill the workload, does the file system recover normally?

The workload can't be killed.

>> Only an emergency flush will
>> bring the filesystem back. I tried kernels 3.0.29, 3.1.10, 3.2.15,
>> 3.3.2. From reading through the mail archives, I get the impression
>> that this should be fixed in 3.1.
> What you see is not necessarily a hang. It may just be that you've
> caused your IO subsystem to have so much IO queued up it's completely
> overwhelmed. How much RAM do you have in the machine?

When it hangs, there are zero IOs going to the disk. The machine has
100GB of RAM.

>> What makes the test system special is:
>> 1) The test partition uses 1024 block size and 576b log size.
> So you've made the log as physically small as possible on a tiny
> (9GB) filesystem. Why?

:-) Because that breaks it. Somebody on the list mentioned that he
experienced hangs with that configuration so I gave it a shot.

>> 2) The RAID controller cache is disabled.
> And you've made the storage subsystem as slow as possible. What type
> of RAID are you using, how many disks in the RAID volume, which type
> of disks, etc?

4 2TB SAS 6Gb 7.2K disks in a RAID10 config

>> I can't seem to hit the problem without the above modifications.
> How on earth did you come up with this configuration?

Just plain ol' luck. I was looking for a configuration that would
allow me to reproduce the hangs and I accidentally picked a machine
with a faulty controller battery which disabled the cache.

>> For the IO workload I pre-create 8000 files with random content and
>> sizes between 1k and 128k on the test partition. Then I run a tool
>> that spawns a bunch of threads which just copy these files to a
>> different directory on the same partition.
> So, your workload also has a significant amount parallelism and
> concurrency on a filesytsem with only 4 AGs?

Yes. Excuse my ignorance but what are AGs?

>> At the same time there are
>> other threads that rename, remove and overwrite random files in the
>> destination directory keeping the file count at around 500.
> And you've added as much concurrent metadata modification as
> possible, too, which makes me wonder.....
>> Let me know what other information I can provide to pin this down.
> .... exactly what are you trying to acheive with this test?  From my
> point of view, you're doing something completely and utterly insane.
> You filesystem config and workload is so far outside normal
> configurations and workloads that I'm not surprised you're seeing
> some kind of problem.....

No objection from my side. It's a silly configuration but it's the
only one I've found that lets me reproduce a hang at will. Here's the
deal. We see sporadic hangs in xlog_grant_log_space on production
machines. I cannot just roll out a new kernel on 1000+ production
machines impacting I don't know how many customers and just cross my
fingers hoping that it fixes the problem. I need to verify that the
new kernel indeed behaves better. I was hoping to use the above setup
to test a patched kernel but now all kernels up to the latest stable
one hang sooner or later. I agree that I should see problems with this
setup but the worst I would expect is horrible performance but
certainly not a filesystem hang. I'm more than open to any suggestions
for doing the verification differently.

Thanks, I sure appreciate the help.


> Cheers,
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>