Thank you for your comment! and I apologize for my delayed response.
As your comment, I have investigated again the RHEL7 crash dump
why the processes which doing direct memory reclaim are stuck
at shrink_inactive_list(). Then, I found the reason that the processes
and kswapd are trying to free page caches from a zone despite
the number of inactive file pages is very very small (40 pages).
kswapd moved inactive file pages to isolate file pages to free the
pages at shrink_inactive_list(). As the result, NR_INACTIVE_FILE
was 0 and NR_ISOLATED_FILE was 40.
Therefore, no one can increase NR_INACTIVE_FILE or decrease
NR_ISOLATED_FILE, so the system hangs up.
In such situation, we should not try to free inactive file
pages because kswapd and direct memory reclaimer can move inactive
file pages to isolate file pages up to 32 pages.
And, I found why the problems did not happen on the upstream kernel.
The problems did not happen because of the following commit.
Author: Johannes Weiner <hannes@xxxxxxxxxxx>
Date: Tue May 6 12:50:07 2014 -0700
revert "mm: vmscan: do not swap anon pages just because free+file
Thank you so much!
On Wed, 25 Jun 2014 08:05:30 +1000 Dave Chinner wrote:
On Mon, Jun 23, 2014 at 04:27:08PM +0900, Masayoshi Mizuma wrote:
(I removed CCing xfs and linux-mm. And I changed your email address
to @redhat.com because this email includes RHEL7 kernel stack traces.)
Please don't do that. There's nothing wrong with posting RHEL7 stack
traces to public lists (though I'd prefer you to reproduce this
problem on a 3.15 or 3.16-rc kernel), and breaking the thread of
discussion makes it impossible to involve the people necessary to
solve this problem.
I've re-added xfs and linux-mm to the cc list, and taken my redhat
address off it...
<snip the 3 process back traces>
[looks at sysrq-w output]
kswapd0 is blocked in shrink_inactive_list/congestion_wait().
kswapd1 is blocked waiting for log space from
kthreadd is blocked in shrink_inactive_list/congestion_wait trying
to fork another process.
xfsaild is in uninterruptible sleep, indicating that there is still
metadata to be written to push the log tail to it's required target,
and it will retry again in less than 20ms.
xfslogd is not blocked, indicating the log has not deadlocked
due to lack of space.
there are lots of timestamp updates waiting for log space.
There is one kworker stuck in data IO completion on an inode lock.
There are several threads blocked on an AGF lock trying to free
The bdi writeback thread is blocked waiting for allocation.
A single xfs_alloc_wq kworker is blocked in
shrink_inactive_list/congestion_wait while trying to read in btree
blocks for transactional modification. Indicative of memory pressure
trashing the working set of cached metadata. waiting for memory
- holds agf lock, blocks unlinks
There are 113 (!) blocked sadc processes - why are there so many
stats gathering processes running? If you stop gathering stats, does
the problem go away?
There are 54 mktemp processes blocked - what is generating them?
what filesystem are they actually running on? i.e. which XFS
filesystem in the system is having log space shortages? And what is
the xfs_info output of that filesystem i.e. have you simply
oversubscribed a tiny log and so it crawls along at a very slow
All of the blocked processes are on CPUs 0-3 i.e. on node 0, which
is handled by kswapd0, which is not blocked waiting for log
space. Hmmm - what is the value of /proc/sys/vm/zone_reclaim_mode?
If it is not zero, does setting it to zero make the problem go away?
Interestingly enough, for a system under extreme memory pressure,
don't see any processes blocked waiting for swap space or swap IO.
Do you have any swap space configured on this machine? If you
don't, does the problem go away when you add a swap device?
Overall, I can't see anything that indicates that the filesystem has
actually hung. I can see it having trouble allocating the memory it
needs to make forwards progress, but the system itself is not
deadlocked. Is there any IO being issued when the system is in this
state? If there is Io being issued, then progress is being made and
the system is merely slow because of the extreme memory pressure
generated by the stress test.
If there is not IO being issued, does the system start making
progress again if you kill one of the memory hogs? i.e. does the
equivalent of triggering an OOM-kill make the system responsive
again? If it does, then the filesystem is not hung and the problem
is that there isn't enough free memory to allow the filesystem to do
IO and hence allow memory reclaim to make progress. In which case,
does increasing /proc/sys/vm/min_free_kbytes make the problem go