xfs
[Top] [All Lists]

Re: xfs: two deadlock problems occur when kswapd writebacks XFS pages.

To: Masayoshi Mizuma <m.mizuma@xxxxxxxxxxxxxx>
Subject: Re: xfs: two deadlock problems occur when kswapd writebacks XFS pages.
From: Dave Chinner <dchinner@xxxxxxxxxx>
Date: Wed, 25 Jun 2014 08:05:30 +1000
Cc: xfs@xxxxxxxxxxx, linux-mm@xxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <53A7D6CC.1040605@xxxxxxxxxxxxxx>
References: <53A0013A.1010100@xxxxxxxxxxxxxx> <20140617132609.GI9508@dastard> <53A15DC7.50001@xxxxxxxxxxxxxx> <53A7D6CC.1040605@xxxxxxxxxxxxxx>
User-agent: Mutt/1.5.21 (2010-09-15)
On Mon, Jun 23, 2014 at 04:27:08PM +0900, Masayoshi Mizuma wrote:
> Hi Dave,
> 
> (I removed CCing xfs and linux-mm. And I changed your email address
>  to @redhat.com because this email includes RHEL7 kernel stack traces.)

Please don't do that. There's nothing wrong with posting RHEL7 stack
traces to public lists (though I'd prefer you to reproduce this
problem on a 3.15 or 3.16-rc kernel), and breaking the thread of
discussion makes it impossible to involve the people necessary to
solve this problem.

I've re-added xfs and linux-mm to the cc list, and taken my redhat
address off it...

<snip the 3 process back traces>

[looks at sysrq-w output]

kswapd0 is blocked in shrink_inactive_list/congestion_wait().

kswapd1 is blocked waiting for log space from
shrink_inactive_list().

kthreadd is blocked in shrink_inactive_list/congestion_wait trying
to fork another process.

xfsaild is in uninterruptible sleep, indicating that there is still
metadata to be written to push the log tail to it's required target,
and it will retry again in less than 20ms.

xfslogd is not blocked, indicating the log has not deadlocked
due to lack of space.

there are lots of timestamp updates waiting for log space.

There is one kworker stuck in data IO completion on an inode lock.

There are several threads blocked on an AGF lock trying to free
extents.

The bdi writeback thread is blocked waiting for allocation.

A single xfs_alloc_wq kworker is blocked in
shrink_inactive_list/congestion_wait while trying to read in btree
blocks for transactional modification. Indicative of memory pressure
trashing the working set of cached metadata. waiting for memory
reclaim
        - holds agf lock, blocks unlinks

There are 113 (!) blocked sadc processes - why are there so many
stats gathering processes running? If you stop gathering stats, does
the problem go away?

There are 54 mktemp processes blocked - what is generating them?
what filesystem are they actually running on? i.e. which XFS
filesystem in the system is having log space shortages? And what is
the xfs_info output of that filesystem i.e. have you simply
oversubscribed a tiny log and so it crawls along at a very slow
pace?

All of the blocked processes are on CPUs 0-3 i.e. on node 0, which
is handled by kswapd0, which is not blocked waiting for log
space. Hmmm - what is the value of /proc/sys/vm/zone_reclaim_mode?
If it is not zero, does setting it to zero make the problem go away?

Interestingly enough, for a system under extreme memory pressure,
don't see any processes blocked waiting for swap space or swap IO.
Do you have any swap space configured on this machine?  If you
don't, does the problem go away when you add a swap device?

Overall, I can't see anything that indicates that the filesystem has
actually hung. I can see it having trouble allocating the memory it
needs to make forwards progress, but the system itself is not
deadlocked. Is there any IO being issued when the system is in this
state? If there is Io being issued, then progress is being made and
the system is merely slow because of the extreme memory pressure
generated by the stress test.

If there is not IO being issued, does the system start making
progress again if you kill one of the memory hogs? i.e. does the
equivalent of triggering an OOM-kill make the system responsive
again? If it does, then the filesystem is not hung and the problem
is that there isn't enough free memory to allow the filesystem to do
IO and hence allow memory reclaim to make progress. In which case,
does increasing /proc/sys/vm/min_free_kbytes make the problem go
away?

Cheers,

Dave.
-- 
Dave Chinner
dchinner@xxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>