xfs
[Top] [All Lists]

Re: [PATCH 0/3] xfs: allocation worker causes freelist buffer lock hang

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: [PATCH 0/3] xfs: allocation worker causes freelist buffer lock hang
From: Mark Tinguely <tinguely@xxxxxxx>
Date: Thu, 27 Sep 2012 15:10:38 -0500
Cc: bpm@xxxxxxx, xfs@xxxxxxxxxxx
In-reply-to: <20120926234149.GJ29154@dastard>
References: <20120924171159.GG1140@xxxxxxx> <201209241809.q8OI94s3003323@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> <20120925005632.GB23520@dastard> <5061CA48.3040202@xxxxxxx> <20120925220110.GF29154@dastard> <50630DB6.4070405@xxxxxxx> <20120926234149.GJ29154@dastard>
User-agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:9.0) Gecko/20120122 Thunderbird/9.0
On 09/26/12 18:41, Dave Chinner wrote:
On Wed, Sep 26, 2012 at 09:14:14AM -0500, Mark Tinguely wrote:
On 09/25/12 17:01, Dave Chinner wrote:
On Tue, Sep 25, 2012 at 10:14:16AM -0500, Mark Tinguely wrote:

<deletes>


As a bonus, consolidating the loops into one worker actually gives a slight
performance advantage.

Can you quantify it?

I was comparing the bonnie and iozone benchmarks outputs. I will see
if someone can enlighten me on how to quantify those numbers.

Ugh.

Don't bother. Those are two of the worst offenders in the "useless
benchmarks for regression testing" category. Yeah, they *look* like
they give decent numbers, but I've wasted so much time looking at
results from these benhmarks only to find they do basic things wrong
and give numbers that vary simple because you've made a change that
increases or decreases the CPU cache footprint of a code path.

e.g. IOZone uses the same memory buffer as the source/destination of
all it's IO, and does not touch the contents of it at all. Hence for
small IO, the buffer stays resident in the CPU caches and gives
unrealsitically high throughput results. Worse is the fact that CPU
cache residency of the buffer can change according to the kernel
code path taken, so you can get massive changes in throughput just
by changing the layout of the code without changing any logic....

IOZone can be useful if you know exactly what you are doing and
using it to test a specific code path with a specific set of
configurations. e.g. comparing ext3/4/xfs/btrfs on the same kernel
and storage is fine. However, the moment you start using it to
compare different kernels, it's a total crap shoot....

does anyone have a good benchmark XFS should use to share
performance results? A number that we can agree a series does not
degrade the filesystem..

ffsb and fio with benchmark style workload definitions, fsmark when
using total runtime rather than files/s for comparison. I also have a
bunch of hand written scripts that I've written specifically for
purpose for testing stuff like directory traversal and IO
parallelism scalability.

The problem is that common benchmarks like iozone, postmark,
bonnie++ and dbench are really just load generators that output
numbers - there are all very sensitive to many non-filesystem
changes (e.g. VM tunings) and so have to be run "just right" to
produce meaningful numbers.  That "just right" changes from kernel
to kernel, machine to machine (running the same kernel!) and
filesystem to filesystem.

Dbench is particularly sensitive to VM changes and tunings, as well as
log IO latency.  e.g.  running it on the same system on different
region of the same disk give significantly different results because
writes to the log are slower on the inner regions of disk. Indeed,
just changing the log stripe unit without changing anything else can
have a marked affect on results...

I have machines here that I can't use IOzone at all for IO sizes of
less than 4MB because the timing loops it uses don't have enough
granularity to accurately measure syscall times, and hence it
doesn't report throughputs reliably. The worst part is that in the
results output, it doesn't tell you it had problems - youhave to
look inteh log files to get that information and realise that the
entire benchmark run was compromised...

lies, damn lies, statistics and then filesystem benchmarks?! :)

I think I've said that myself many times in the past. ;)

Thank-you for the information.


I guess I don't understand what you mean by "loop on
xfs_alloc_vextent()" then.

The problem I see above is this:

thread 1                worker 1                worker 2..max
xfs_bmapi_write(userdata)

     loops here calling xfs_bmapi_alloc()

I don't think it can for userdata allocations. Metadata, yes,
userdata, no.

   xfs_bmapi_allocate(user)
     xfs_alloc_vextent(user)
       wait

                        _xfs_alloc_vextent()
                        locks AGF

                        first loop it takes the lock

                        one of the next times through the above
                         loop it cannot get a worker. deadlock here.


                        I saved the xfs_bmalloca and fs_alloc_arg when
                         allocating a buffer to verify the paths.

                                                _xfs_alloc_vextent()
                                                blocks on AGF lock

                        completes allocation

       <returns with AGF locked in transaction>

And then xfs_bmapi_allocate() increments bma->nallocs after each
successful allocation. Hence the only way we can loop there for
userdata allocations is if the nmap parameter passes in is>  1.

When it returns to xfs-bmapi_write(), it trims and updates the map,
and then this code prevents it from looping:

                 /*
                  * If we're done, stop now.  Stop when we've allocated
                  * XFS_BMAP_MAX_NMAP extents no matter what.  Otherwise
                  * the transaction may get too big.
                  */
                 if (bno>= end || n>= *nmap || bma.nallocs>= *nmap)
                         break;

Because the userdata callers use these values for *nmap:

        xfs_iomap_write_direct          nimaps = 1;
        xfs_iomap_write_allocate        nimaps = 1;
        xfs_iomap_write_unwritten       nimaps = 1;
        xfs_alloc_file_space            nimaps = 1;

So basically there will break out of the loop after a single
allocation, and hence should not be getting stuck on a second
allocation in xfs_bmapi_allocate()/xfs_alloc_vextent() with the AGF
locked from this code path.


     xfs_bmap_add_extent_hole_real
       xfs_bmap_extents_to_btree
         xfs_alloc_vextent(user)
           wait

        this does not need a worker, and since in the same

I agree that it does not *need* a worker, but it will *use* a worker
thread if args->userdata is not initialised to zero.

            transaction all locks to the AGF buffer are recursive locks.

Right, the worker won't block on the AGF lock, but if it can't get a
worker because they are all blocked on the AGF lock, then it
deadlocks.

As I said before, I cannot see any other path that will trigger a
userdata allocation with the AGF locked. So does patch 3 by itself
solve the problem?



I thought I tried it, but I will retry it when I have a free machine.


--Mark.

<Prev in Thread] Current Thread [Next in Thread>