On Sat, Feb 21, 2015 at 06:52:27PM -0500, Johannes Weiner wrote:
> On Fri, Feb 20, 2015 at 09:52:17AM +1100, Dave Chinner wrote:
> > I will actively work around aanything that causes filesystem memory
> > pressure to increase the chance of oom killer invocations. The OOM
> > killer is not a solution - it is, by definition, a loose cannon and
> > so we should be reducing dependencies on it.
> Once we have a better-working alternative, sure.
Great, but first a simple request: please stop writing code and
instead start architecting a solution to the problem. i.e. we need a
design and have that documented before code gets written. If you
watched my recent LCA talk, then you'll understand what I mean
when I say: stop programming and start engineering.
> > I really don't care about the OOM Killer corner cases - it's
> > completely the wrong way line of development to be spending time on
> > and you aren't going to convince me otherwise. The OOM killer a
> > crutch used to justify having a memory allocation subsystem that
> > can't provide forward progress guarantee mechanisms to callers that
> > need it.
> We can provide this. Are all these callers able to preallocate?
Anything that allocates in transaction context (and therefor is
GFP_NOFS by definition) can preallocate at transaction reservation
time. However, preallocation is dumb, complex, CPU and memory
intensive and will have a *massive* impact on performance.
Allocating 10-100 pages to a reserve which we will almost *never
use* and then free them again *on every single transaction* is a lot
of unnecessary additional fast path overhead. Hence a "preallocate
for every context" reserve pool is not a viable solution.
And, really, "reservation" != "preallocation".
Maybe it's my filesystem background, but those to things are vastly
Reservations are simply an *accounting* of the maximum amount of a
reserve required by an operation to guarantee forwards progress. In
filesystems, we do this for log space (transactions) and some do it
for filesystem space (e.g. delayed allocation needs correct ENOSPC
detection so we don't overcommit disk space). The VM already has
such concepts (e.g. watermarks and things like min_free_kbytes) that
it uses to ensure that there are sufficient reserves for certain
types of allocations to succeed.
A reserve memory pool is no different - every time a memory reserve
occurs, a watermark is lifted to accommodate it, and the transaction
is not allowed to proceed until the amount of free memory exceeds
that watermark. The memory allocation subsystem then only allows
*allocations* marked correctly to allocate pages from that the
reserve that watermark protects. e.g. only allocations using
__GFP_RESERVE are allowed to dip into the reserve pool.
By using watermarks, freeing of memory will automatically top
up the reserve pool which means that we guarantee that reclaimable
memory allocated for demand paging during transacitons doesn't
deplete the reserve pool permanently. As a result, when there is
plenty of free and/or reclaimable memory, the reserve pool
watermarks will have almost zero impact on performance and
Further, because it's just accounting and behavioural thresholds,
this allows the mm subsystem to control how the reserve pool is
accounted internally. e.g. clean, reclaimable pages in the page
cache could serve as reserve pool pages as they can be immediately
reclaimed for allocation. This could be acheived by setting reclaim
targets first to the reserve pool watermark, then the second target
is enough pages to satisfy the current allocation.
And, FWIW, there's nothing stopping this mechanism from have order
based reserve thresholds. e.g. IB could really do with a 64k reserve
pool threshold and hence help solve the long standing problems they
have with filling the receive ring in GFP_ATOMIC context...
Sure, that's looking further down the track, but my point still
remains: we need a viable long term solution to this problem. Maybe
reservations are not the solution, but I don't see anyone else who
is thinking of how to address this architectural problem at a system
level right now. We need to design and document the model first,
then review it, then we can start working at the code level to
implement the solution we've designed.