[Top] [All Lists]

[patch 0/4] 50% faster writing to your USB drive!*

To: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Subject: [patch 0/4] 50% faster writing to your USB drive!*
From: Johannes Weiner <jweiner@xxxxxxxxxx>
Date: Tue, 20 Sep 2011 15:45:11 +0200
Cc: Mel Gorman <mgorman@xxxxxxx>, Christoph Hellwig <hch@xxxxxxxxxxxxx>, Dave Chinner <david@xxxxxxxxxxxxx>, Wu Fengguang <fengguang.wu@xxxxxxxxx>, Jan Kara <jack@xxxxxxx>, Rik van Riel <riel@xxxxxxxxxx>, Minchan Kim <minchan.kim@xxxxxxxxx>, Chris Mason <chris.mason@xxxxxxxxxx>, "Theodore Ts'o" <tytso@xxxxxxx>, Andreas Dilger <adilger.kernel@xxxxxxxxx>, xfs@xxxxxxxxxxx, linux-btrfs@xxxxxxxxxxxxxxx, linux-ext4@xxxxxxxxxxxxxxx, linux-mm@xxxxxxxxx, linux-fsdevel@xxxxxxxxxxxxxxx, linux-kernel@xxxxxxxxxxxxxxx
*if you use ntfs-3g copy files larger than main memory

or: per-zone dirty limits

There have been several discussions and patches around the issue of
dirty pages being written from page reclaim, that is, they reach the
end of the LRU list before they are cleaned.

Proposed reasons for this are the divergence of dirtying age from page
cache age, on one hand, and unequal distribution of the globally
limited dirty memory across the LRU lists of different zones.

Mel's recent patches to reduce writes from reclaim, by simply skipping
over dirty pages until a certain amount of memory pressure builds up,
do help quite a bit.  But they can only deal with a limited length of
runs of dirty pages before kswapd goes to lower priority levels to
balance the zone and begins writing.

The unequal distribution of dirty memory between zones is easily
observable through the statistics in /proc/zoneinfo, but the test
results varied between filesystems.  To get an overview of where and
how often different page cache pages are created and dirtied, I hacked
together an object tracker that remembers the instantiator of a page
cache page and associates with it the paths that dirty or activate the
page, together with counters that indicate how often those operations

Btrfs, for example, appears to be activating a significant amount of
regularly written tree data with mark_page_accessed(), even with a
purely linear, page-aligned write load.  So in addition to the already
unbounded dirty memory on smaller zones, this is a divergence between
page age and dirtying age and leads to a situation where the pages
reclaimed next are not the ones that are also flushed next:

                             min|  median|     max
              xfs:         5.000|   6.500|  20.000
        fuse-ntfs:         5.000|  19.000| 275.000
             ext4:         2.000|  67.000| 810.000 
            btrfs:      2915.000|3316.500|5786.000

ext4's delalloc, on the other hand, refuses regular write attemps from
kjournald, but the write index of the inode is still advanced for
cyclic write ranges and so the pages are not even immediately written
when the inode is selected again.

I cc'd the filesystem people because it is at least conceivable that
things could be improved on their side, but I do think the problem is
mainly with the VM and needs fixing there.

This patch series implements per-zone dirty limits, derived from the
configured global dirty limits and the individual zone size, that the
page allocator uses to distribute pages allocated for writing across
the allowable zones.  Even with pages dirtied out of the inactive LRU
order this gives page reclaim a minimum number of clean pages on each
LRU so that balancing a zone should no longer require writeback in the
common case.

The previous version included code to wake the flushers and stall the
allocation on NUMA setups where the load is bound to a node that is in
itself not large enough to reach the global dirty limits, but I am
still trying to get it to work reliably and dropped it for now, the
series has merits even without it.

                        Test results

15M DMA + 3246M DMA32 + 504 Normal = 3765M memory
40% dirty ratio
16G USB thumb drive
10 runs of dd if=/dev/zero of=disk/zeroes bs=32k count=$((10 << 15))

                seconds                 nr_vmscan_write
                        (stddev)               min|     median|        max
vanilla:         549.747( 3.492)             0.000|      0.000|      0.000
patched:         550.996( 3.802)             0.000|      0.000|      0.000

vanilla:        1183.094(53.178)         54349.000|  59341.000|  65163.000
patched:         558.049(17.914)             0.000|      0.000|     43.000

vanilla:         573.679(14.015)        156657.000| 460178.000| 606926.000
patched:         563.365(11.368)             0.000|      0.000|   1362.000

vanilla:         561.197(15.782)             0.000|2725438.000|4143837.000
patched:         568.806(17.496)             0.000|      0.000|      0.000

Even though most filesystems already ignore the write request from
reclaim, we were reluctant in the past to remove it, as it was still
theoretically our only means to stay on top of the dirty pages on a
per-zone basis.  This patchset should get us closer to removing the
dreaded writepage call from page reclaim altogether.


<Prev in Thread] Current Thread [Next in Thread>