xfs
[Top] [All Lists]

Re: memory reclaim problems on fs usage

To: david@xxxxxxxxxxxxx
Subject: Re: memory reclaim problems on fs usage
From: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
Date: Fri, 13 Nov 2015 21:19:46 +0900
Cc: arekm@xxxxxxxx, htejun@xxxxxxxxx, cl@xxxxxxxxx, mhocko@xxxxxxxx, linux-mm@xxxxxxxxx, xfs@xxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <20151112200641.GR19199@dastard>
References: <201511111719.44035.arekm@xxxxxxxx> <201511120719.EBF35970.OtSOHOVFJMFQFL@xxxxxxxxxxxxxxxxxxx> <201511120706.10739.arekm@xxxxxxxx> <56449E44.7020407@xxxxxxxxxxxxxxxxxxx> <20151112200641.GR19199@dastard>
Dave Chinner wrote:
> So why have we only scanned *176* pages* during reclaim?  On other
> OOM reports in this trace it's as low as 12.  Either that stat is
> completely wrong, or we're not doing sufficient page LRU reclaim
> scanning....
> 
> > [ 9662.234685] MemAlloc-Info: 3 stalling task, 0 dying task, 0 victim task.
> > 
> > vmstat_update() and submit_flushes() remained pending for about 110 seconds.
> > If xlog_cil_push_work() were spinning inside GFP_NOFS allocation, it should 
> > be
> > reported as MemAlloc: traces, but no such lines are recorded. I don't know 
> > why
> > xlog_cil_push_work() did not call schedule() for so long.
> 
> I'd say it is repeatedly waiting for IO completion on log buffers to
> write out the checkpoint. It's making progress, just if it's taking
> multiple second per journal IO it will take a long time to write a
> checkpoint. All the other blocked tasks in XFS inode reclaim are
> either waiting directly on IO completion or waiting for the log to
> complete a flush, so this really just looks like an overloaded IO
> subsystem to me....

The vmstat statistics can become wrong when vmstat_update() workqueue item
cannot be processed due to in-flight workqueue item not calling schedule().
If in-flight workqueue item (in this case xlog_cil_push_work()) called
schedule(), the pending vmstat_update() workqueue item will be processed
and the vmstat becomes up to dated. Like you expect that xlog_cil_push_work()
was waiting for IO completion on log buffers rather than spinning inside
GFP_NOFS allocation, what should happened is xlog_cil_push_work() called
schedule() and vmstat_update() was processed. But vmstat_update() remained
pending for about 110 seconds. That's strange...

Arkadiusz is trying http://marc.info/?l=linux-mm&m=144725782107096&w=2
which is for making sure that vmstat_update() workqueue item is processed
by changing wait_iff_congested() to call schedule(), and we are waiting
for test results.

Well, one of dependent patches "vmstat: explicitly schedule per-cpu work
on the CPU we need it to run on" might be relevant to this problem.

If http://sprunge.us/GYBb and http://sprunge.us/XWUX solve the problem
(for both with swap case and without swap case), the vmstat statistics
was wrong.

<Prev in Thread] Current Thread [Next in Thread>