[Top] [All Lists]

Re: How to handle TIF_MEMDIE stalls?

To: Michal Hocko <mhocko@xxxxxxx>
Subject: Re: How to handle TIF_MEMDIE stalls?
From: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Sat, 21 Feb 2015 10:09:10 +1100
Cc: Johannes Weiner <hannes@xxxxxxxxxxx>, Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>, dchinner@xxxxxxxxxx, linux-mm@xxxxxxxxx, rientjes@xxxxxxxxxx, oleg@xxxxxxxxxx, akpm@xxxxxxxxxxxxxxxxxxxx, mgorman@xxxxxxx, torvalds@xxxxxxxxxxxxxxxxxxxx, xfs@xxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <20150220124849.GH21248@xxxxxxxxxxxxxx>
References: <201502172123.JIE35470.QOLMVOFJSHOFFt@xxxxxxxxxxxxxxxxxxx> <20150217125315.GA14287@xxxxxxxxxxxxxxxxxxxxxx> <20150217225430.GJ4251@dastard> <20150218082502.GA4478@xxxxxxxxxxxxxx> <20150218104859.GM12722@dastard> <20150218121602.GC4478@xxxxxxxxxxxxxx> <20150219110124.GC15569@xxxxxxxxxxxxxxxxxxxxxx> <20150219122914.GH28427@xxxxxxxxxxxxxx> <20150219214356.GW12722@dastard> <20150220124849.GH21248@xxxxxxxxxxxxxx>
User-agent: Mutt/1.5.21 (2010-09-15)
On Fri, Feb 20, 2015 at 01:48:49PM +0100, Michal Hocko wrote:
> On Fri 20-02-15 08:43:56, Dave Chinner wrote:
> > On Thu, Feb 19, 2015 at 01:29:14PM +0100, Michal Hocko wrote:
> > > On Thu 19-02-15 06:01:24, Johannes Weiner wrote:
> > > [...]
> > > > Preferrably, we'd get rid of all nofail allocations and replace them
> > > > with preallocated reserves.  But this is not going to happen anytime
> > > > soon, so what other option do we have than resolving this on the OOM
> > > > killer side?
> > > 
> > > As I've mentioned in other email, we might give GFP_NOFAIL allocator
> > > access to memory reserves (by giving it __GFP_HIGH).
> > 
> > Won't work when you have thousands of concurrent transactions
> > running in XFS and they are all doing GFP_NOFAIL allocations.
> Is there any bound on how many transactions can run at the same time?

Yes. As many reservations that can fit in the available log space.

The log can be sized up to 2GB, and for filesystems larger than 4TB
will default to 2GB. Log space reservations depend on the operation
being done - an inode timestamp update requires about 5kB of
reservation, and rename requires about 200kB. Hence we can easily
have thousands of active transactions, even in the worst case
log space reversation cases.

You're saying it would be insane to have hundreds or thousands of
threads doing GFP_NOFAIL allocations concurrently. Reality check:
XFS has been operating successfully under such workload conditions
in production systems for many years.

> > That's why I suggested the per-transaction reserve pool - we can use
> > that
> I am still not sure what you mean by reserve pool (API wise). How
> does it differ from pre-allocating memory before the "may not fail
> context"? Could you elaborate on it, please?

It is preallocating memory: into a reserve pool associated with the
transaction, done as part of the transaction reservation mechanism
we already have in XFS. The allocator then uses that reserve pool
to allocate from if an allocation would otherwise fail.

There is no way we can preallocate specific objects before the
transaction - that's just insane, especially handling the unbound
demand paged object requirement. Hence the need for a "preallocated
reserve pool" that the allocator can dip into that covers the memory
we need to *allocate and can't reclaim* during the course of the


Dave Chinner

<Prev in Thread] Current Thread [Next in Thread>