xfs
[Top] [All Lists]

Re: How to handle TIF_MEMDIE stalls?

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: How to handle TIF_MEMDIE stalls?
From: Johannes Weiner <hannes@xxxxxxxxxxx>
Date: Fri, 6 Mar 2015 19:20:55 -0500
Cc: Vlastimil Babka <vbabka@xxxxxxx>, Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>, Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>, mhocko@xxxxxxx, dchinner@xxxxxxxxxx, linux-mm@xxxxxxxxx, rientjes@xxxxxxxxxx, oleg@xxxxxxxxxx, mgorman@xxxxxxx, torvalds@xxxxxxxxxxxxxxxxxxxx, xfs@xxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <20150302223154.GJ18360@dastard>
References: <20150217125315.GA14287@xxxxxxxxxxxxxxxxxxxxxx> <20150217225430.GJ4251@dastard> <20150219102431.GA15569@xxxxxxxxxxxxxxxxxxxxxx> <20150219225217.GY12722@dastard> <20150221235227.GA25079@xxxxxxxxxxxxxxxxxxxxxx> <20150223004521.GK12722@dastard> <20150222172930.6586516d.akpm@xxxxxxxxxxxxxxxxxxxx> <20150223073235.GT4251@dastard> <54F42FEA.1020404@xxxxxxx> <20150302223154.GJ18360@dastard>
On Tue, Mar 03, 2015 at 09:31:54AM +1100, Dave Chinner wrote:
> What we don't know is how many objects we might need to scan to find
> the objects we will eventually modify.  Here's an (admittedly
> extreme) example to demonstrate a worst case scenario: allocate a
> 64k data extent. Because it is an exact size allocation, we look it
> up in the by-size free space btree. Free space is fragmented, so
> there are about a million 64k free space extents in the tree.
> 
> Once we find the first 64k extent, we search them to find the best
> locality target match.  The btree records are 16 bytes each, so we
> fit roughly 500 to a 4k block. Say we search half the extents to
> find the best match - i.e. we walk a thousand leaf blocks before
> finding the match we want, and modify that leaf block.
> 
> Now, the modification removed an entry from the leaf and tht
> triggers leaf merge thresholds, so a merge with the 1002nd block
> occurs. That block now demand pages in and we then modify and join
> it to the transaction. Now we walk back up the btree to update
> indexes, merging blocks all the way back up to the root.  We have a
> worst case size btree (5 levels) and we merge at every level meaning
> we demand page another 8 btree blocks and modify them.
> 
> In this case, we've demand paged ~1010 btree blocks, but only
> modified 10 of them. i.e. the memory we consumed permanently was
> only 10 4k buffers (approx. 10 slab and 10 page allocations), but
> the allocation demand was 2 orders of magnitude more than the
> unreclaimable memory consumption of the btree modification.
> 
> I hope you start to see the scope of the problem now...

Isn't this bounded one way or another?  Sure, the inaccuracy itself is
high, but when you put the absolute numbers in perspective it really
doesn't seem to matter: with your extreme case of 3MB per transaction,
you can still run 5k+ of them in parallel on a small 16G machine.
Occupy a generous 75% of RAM with anonymous pages, and you can STILL
run over a thousand transactions concurrently.  That would seem like a
decent pipeline to keep the storage device occupied.

The level of precision that you are asking for comes with complexity
and fragility that I'm not convinced is necessary, or justified.

<Prev in Thread] Current Thread [Next in Thread>