xfs
[Top] [All Lists]

Re: Still seeing hangs in xlog_grant_log_space

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: Still seeing hangs in xlog_grant_log_space
From: Juerg Haefliger <juergh@xxxxxxxxx>
Date: Tue, 24 Apr 2012 10:55:22 +0200
Cc: xfs@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; bh=8JxTWTGtmIJ2KxgYwcKLhI2Fga5n+jfLCbvZtcnsEjw=; b=U9EViSuvi2W8MkdLmrnQ1sUA2lRW3Nxh93XUohOykokjsHEXPOKu1W6D2rGl0625Fx Azt64QpteHl2xzTPC5WfqauECdil2Yse3khUKsVoti+QaF/eNYldZ250xAmJ5SslqMM7 EgBIonG4TtyoTTRtbosx2woH6gnZHbOT2Mp5G6aPXVZRsE5sB5AyFQkPZ2sGIK6Yie+/ TuNqMbKWvLhvOhOj8xBxMvqDQjfQAu74U1Tix/KB9P0lRIqerrB7M1R3v7C16lFt3k3f ylUTUx9tPvAigTKTIRvk8TvIvmp+0fvEl4n0GECTI1wm+2yAnnZwSVJlTAUKmfYkcoFD EIsQ==
In-reply-to: <20120423235840.GQ9541@dastard>
References: <CADLDEKsP4DsXf_G07ub+a-ODbrJbsiprRJUX1fJdaQ41TB7+Xg@xxxxxxxxxxxxxx> <20120423143843.GN9541@dastard> <CADLDEKvFF3FvEHVtmwdWhbM58_jrCRX+Uk9vLBg1hA8sizh5BQ@xxxxxxxxxxxxxx> <20120423235840.GQ9541@dastard>
On Tue, Apr 24, 2012 at 1:58 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> On Mon, Apr 23, 2012 at 05:33:40PM +0200, Juerg Haefliger wrote:
>> Hi Dave,
>>
>>
>> On Mon, Apr 23, 2012 at 4:38 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>> > On Mon, Apr 23, 2012 at 02:09:53PM +0200, Juerg Haefliger wrote:
>> >> Hi,
>> >>
>> >> I have a test system that I'm using to try to force an XFS filesystem
>> >> hang since we're encountering that problem sporadically in production
>> >> running a 2.6.38-8 Natty kernel. The original idea was to use this
>> >> system to find the patches that fix the issue but I've tried a whole
>> >> bunch of kernels and they all hang eventually (anywhere from 5 to 45
>> >> mins) with the stack trace shown below.
>> >
>> > If you kill the workload, does the file system recover normally?
>>
>> The workload can't be killed.
>
> OK.
>
>> >> Only an emergency flush will
>> >> bring the filesystem back. I tried kernels 3.0.29, 3.1.10, 3.2.15,
>> >> 3.3.2. From reading through the mail archives, I get the impression
>> >> that this should be fixed in 3.1.
>> >
>> > What you see is not necessarily a hang. It may just be that you've
>> > caused your IO subsystem to have so much IO queued up it's completely
>> > overwhelmed. How much RAM do you have in the machine?
>>
>> When it hangs, there are zero IOs going to the disk. The machine has
>> 100GB of RAM.
>
> Can you get an event trace across the period where the hang occurs?
>
> ....
>
>> >> I can't seem to hit the problem without the above modifications.
>> >
>> > How on earth did you come up with this configuration?
>>
>> Just plain ol' luck. I was looking for a configuration that would
>> allow me to reproduce the hangs and I accidentally picked a machine
>> with a faulty controller battery which disabled the cache.
>
> Wonderful.
>
>> >> For the IO workload I pre-create 8000 files with random content and
>> >> sizes between 1k and 128k on the test partition. Then I run a tool
>> >> that spawns a bunch of threads which just copy these files to a
>> >> different directory on the same partition.
>> >
>> > So, your workload also has a significant amount parallelism and
>> > concurrency on a filesytsem with only 4 AGs?
>>
>> Yes. Excuse my ignorance but what are AGs?
>
> Allocation groups.
>
>> >> At the same time there are
>> >> other threads that rename, remove and overwrite random files in the
>> >> destination directory keeping the file count at around 500.
>> >
>> > And you've added as much concurrent metadata modification as
>> > possible, too, which makes me wonder.....
>> >
>> >> Let me know what other information I can provide to pin this down.
>> >
>> > .... exactly what are you trying to acheive with this test?  From my
>> > point of view, you're doing something completely and utterly insane.
>> > You filesystem config and workload is so far outside normal
>> > configurations and workloads that I'm not surprised you're seeing
>> > some kind of problem.....
>>
>> No objection from my side. It's a silly configuration but it's the
>> only one I've found that lets me reproduce a hang at will.
>
> Ok, that's fair enough - it's handy to tell us that up front,
> though.  ;)

Ah sorry for not being clear enough. I thought my intentions could be
deduced from the information that I provided :-)


> Alright, then I need all the usual information. I suspect an event
> trace is the only way I'm going to see what is happening. I just
> updated the FAQ entry, so all the necessary info for gathering a
> trace should be there now.
>
> http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

Very good. Will do. What kernel do you want me to run? I would prefer
our current production kernel (2.6.38-8-server) but I understand if
you want something newer.

...Juerg


> --
> Dave Chinner
> david@xxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>