On Mon, Apr 23, 2012 at 02:09:53PM +0200, Juerg Haefliger wrote:
> I have a test system that I'm using to try to force an XFS filesystem
> hang since we're encountering that problem sporadically in production
> running a 2.6.38-8 Natty kernel. The original idea was to use this
> system to find the patches that fix the issue but I've tried a whole
> bunch of kernels and they all hang eventually (anywhere from 5 to 45
> mins) with the stack trace shown below.
If you kill the workload, does the file system recover normally?
> Only an emergency flush will
> bring the filesystem back. I tried kernels 3.0.29, 3.1.10, 3.2.15,
> 3.3.2. From reading through the mail archives, I get the impression
> that this should be fixed in 3.1.
What you see is not necessarily a hang. It may just be that you've
caused your IO subsystem to have so much IO queued up it's completely
overwhelmed. How much RAM do you have in the machine?
> What makes the test system special is:
> 1) The test partition uses 1024 block size and 576b log size.
So you've made the log as physically small as possible on a tiny
(9GB) filesystem. Why?
> 2) The RAID controller cache is disabled.
And you've made the storage subsystem as slow as possible. What type
of RAID are you using, how many disks in the RAID volume, which type
of disks, etc?
> I can't seem to hit the problem without the above modifications.
How on earth did you come up with this configuration?
> For the IO workload I pre-create 8000 files with random content and
> sizes between 1k and 128k on the test partition. Then I run a tool
> that spawns a bunch of threads which just copy these files to a
> different directory on the same partition.
So, your workload also has a significant amount parallelism and
concurrency on a filesytsem with only 4 AGs?
> At the same time there are
> other threads that rename, remove and overwrite random files in the
> destination directory keeping the file count at around 500.
And you've added as much concurrent metadata modification as
possible, too, which makes me wonder.....
> Let me know what other information I can provide to pin this down.
.... exactly what are you trying to acheive with this test? From my
point of view, you're doing something completely and utterly insane.
You filesystem config and workload is so far outside normal
configurations and workloads that I'm not surprised you're seeing
some kind of problem.....