[Top] [All Lists]

Re: How to handle TIF_MEMDIE stalls?

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: How to handle TIF_MEMDIE stalls?
From: Theodore Ts'o <tytso@xxxxxxx>
Date: Sat, 28 Feb 2015 10:17:48 -0500
Cc: Vlastimil Babka <vbabka@xxxxxxx>, Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>, Johannes Weiner <hannes@xxxxxxxxxxx>, Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>, mhocko@xxxxxxx, dchinner@xxxxxxxxxx, linux-mm@xxxxxxxxx, rientjes@xxxxxxxxxx, oleg@xxxxxxxxxx, mgorman@xxxxxxx, torvalds@xxxxxxxxxxxxxxxxxxxx, xfs@xxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=thunk.org; s=ef5046eb; h=In-Reply-To:Content-Type:MIME-Version:References:Message-ID:Subject:Cc:To:From:Date; bh=UkaohAsHWBxA2zbzlKtkWABR90VVqmioQJQdCTHXqRM=; b=PG/k5Nn9+MByvG79ipDNF25y+aurNVddzCwSvPfoEs19ihEi+Q/RlMOlY2Jq14TKnQtAN3XhEW7LTGvj+seby7Sr4rjstJ7UFmSz+OZTxFcjta3U6SadBUFG9z7BZPK7AMMTXE4baFteUjBCjTADEbys7Kp1C9HFn6Rmysdvzq4=;
In-reply-to: <20150228000359.GL4251@dastard>
References: <20150217125315.GA14287@xxxxxxxxxxxxxxxxxxxxxx> <20150217225430.GJ4251@dastard> <20150219102431.GA15569@xxxxxxxxxxxxxxxxxxxxxx> <20150219225217.GY12722@dastard> <20150221235227.GA25079@xxxxxxxxxxxxxxxxxxxxxx> <20150223004521.GK12722@dastard> <20150222172930.6586516d.akpm@xxxxxxxxxxxxxxxxxxxx> <20150223073235.GT4251@dastard> <54F0B662.8020508@xxxxxxx> <20150228000359.GL4251@dastard>
User-agent: Mutt/1.5.23 (2014-03-12)
On Sat, Feb 28, 2015 at 11:03:59AM +1100, Dave Chinner wrote:
> > I think the best way is if slab could also learn to provide reserves for
> > individual objects. Either just mark internally how many of them are 
> > reserved,
> > if sufficient number is free, or translate this to the page allocator 
> > reserves,
> > as slab knows which order it uses for the given objects.
> Which is effectively what a slab based mempool is. Mempools don't
> guarantee a reserve is available once it's been resized, however,
> and we'd have to have mempools configured for every type of
> allocation we are going to do. So from that perspective it's not
> really a solution.

The bigger problem is it means that the upper layer which is making
the reservation before it starts taking lock won't necessarily know
exactly which slab objects it and all of the lower layers might need.

So it's much more flexible, and requires less accuracy, if we can just
request that (a) the mm subsystems reserves at least N pages, and (b)
tell it that at this point in time, it's safe for the requesting
subsystem to block until N pages is available.

Can this be guaranteed to be accurate?  No, of course not.  And in
some cases, it may be possible since it might depend on whether the
iSCSI device needs to reconnect to the target, or some sort of
exception handling, before it can complete its I/O request.

But it's better than what we have now, which is that once we've taken
certain locks, and/or started a complex transaction, we can't really
back out, so we end up looping either using GFP_NOFAIL, or around the
memory allocation request if there are still mm developers who are
delusional enough to believe, ala like King Canute, to say, "You must
always be able to handle memory allocation at any point in the kernel
and GFP_NOFAIL is an indicatoin of a subsystem bug!"

I can imagine using some adjustment factors, where a particular
voratious device might require hint to the file system to boost its
memory allocation estimate by 30%, or 50%.  So yes, it's a very,
*very* rough estimate.  And if we guess wrong, we might end up having
to loop ala GFP_NOFAIL anyway.  But it's better than not having such
an estimate.

I also grant that this doesn't work very well for emergency writeback,
or background writeback, where we can't and shouldn't block waiting
for enough memory to become free, since page cleaning is one of the
ways that we might be able to make memory available.  But if that's
the only problem we have, we're in good shape, since that can be
solved by either (a) doing a better job throttling memory allocations
or memory reservation requests in the first place, and/or (b) starting
the background writeback much more aggressively and earlier.

                                                - Ted

<Prev in Thread] Current Thread [Next in Thread>