[Top] [All Lists]

Re: How to handle TIF_MEMDIE stalls?

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: How to handle TIF_MEMDIE stalls?
From: Johannes Weiner <hannes@xxxxxxxxxxx>
Date: Sat, 28 Feb 2015 11:29:43 -0500
Cc: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>, mhocko@xxxxxxx, dchinner@xxxxxxxxxx, linux-mm@xxxxxxxxx, rientjes@xxxxxxxxxx, oleg@xxxxxxxxxx, akpm@xxxxxxxxxxxxxxxxxxxx, mgorman@xxxxxxx, torvalds@xxxxxxxxxxxxxxxxxxxx, xfs@xxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <20150223004521.GK12722@dastard>
References: <201502102258.IFE09888.OVQFJOMSFtOLFH@xxxxxxxxxxxxxxxxxxx> <20150210151934.GA11212@xxxxxxxxxxxxxxxxxxxxxx> <201502111123.ICD65197.FMLOHSQJFVOtFO@xxxxxxxxxxxxxxxxxxx> <201502172123.JIE35470.QOLMVOFJSHOFFt@xxxxxxxxxxxxxxxxxxx> <20150217125315.GA14287@xxxxxxxxxxxxxxxxxxxxxx> <20150217225430.GJ4251@dastard> <20150219102431.GA15569@xxxxxxxxxxxxxxxxxxxxxx> <20150219225217.GY12722@dastard> <20150221235227.GA25079@xxxxxxxxxxxxxxxxxxxxxx> <20150223004521.GK12722@dastard>
On Mon, Feb 23, 2015 at 11:45:21AM +1100, Dave Chinner wrote:
> On Sat, Feb 21, 2015 at 06:52:27PM -0500, Johannes Weiner wrote:
> > On Fri, Feb 20, 2015 at 09:52:17AM +1100, Dave Chinner wrote:
> > > I will actively work around aanything that causes filesystem memory
> > > pressure to increase the chance of oom killer invocations. The OOM
> > > killer is not a solution - it is, by definition, a loose cannon and
> > > so we should be reducing dependencies on it.
> > 
> > Once we have a better-working alternative, sure.
> Great, but first a simple request: please stop writing code and
> instead start architecting a solution to the problem. i.e. we need a
> design and have that documented before code gets written. If you
> watched my recent LCA talk, then you'll understand what I mean
> when I say: stop programming and start engineering.

This code was for the sake of argument, see below.

> > > I really don't care about the OOM Killer corner cases - it's
> > > completely the wrong way line of development to be spending time on
> > > and you aren't going to convince me otherwise. The OOM killer a
> > > crutch used to justify having a memory allocation subsystem that
> > > can't provide forward progress guarantee mechanisms to callers that
> > > need it.
> > 
> > We can provide this.  Are all these callers able to preallocate?
> Anything that allocates in transaction context (and therefor is
> GFP_NOFS by definition) can preallocate at transaction reservation
> time. However, preallocation is dumb, complex, CPU and memory
> intensive and will have a *massive* impact on performance.
> Allocating 10-100 pages to a reserve which we will almost *never
> use* and then free them again *on every single transaction* is a lot
> of unnecessary additional fast path overhead.  Hence a "preallocate
> for every context" reserve pool is not a viable solution.

You are missing the point of my question.  Whether we allocate right
away or make sure the memory is allocatable later on is a matter of
cost, but the logical outcome is the same.  That is not my concern
right now.

An OOM killer allows transactional allocation sites to get away
without planning ahead.  You are arguing that the OOM killer is a
cop-out on the MM site but I see it as the opposite: it puts a lot of
complexity in the allocator so that callsites can maneuver themselves
into situations where they absolutely need to get memory - or corrupt
user data - without actually making sure their needs will be covered.

If we replace __GFP_NOFAIL + OOM killer with a reserve system, we are
putting the full responsibility on the user.  Are you sure this is
going to reduce our kernel-wide error rate?

> And, really, "reservation" != "preallocation".

That's an implementation detail.  Yes, the example implementation was
dumb and heavy-handed, but a reservation system that works based on
watermarks, and considers clean cache readily allocatable, is not much
more complex than that.

I'm trying to figure out if the current nofail allocators can get
their memory needs figured out beforehand.  And reliably so - what
good are estimates that are right 90% of the time, when failing the
allocation means corrupting user data?  What is the contingency plan?

<Prev in Thread] Current Thread [Next in Thread>