I found the problem that is causing this issue. The logic around the
threshold calculation works as expected.
I saw the problem even when there is lot of space left and
xlog_grant_push_ail() returns with free space available.
The problem is in the way the l_reserveq and l_writeq are handled.
When we wake the processes that are sleeping on l_reserveq and l_writeq
thru wake_up(), we do not remove them from the queue, we expect the
process to remove themselves from the list (and we drop the lock). But,
before the woken up process gets a chance to remove itself, some other
process p1 comes in, checks that the queue is not empty and puts itself
at the end of the queue. All the woken up processes remove themselves
from the queue and move on. Whereas, the process p1 just gets stuck in
the queue. Any new process that comes in gets back at the end of the
queue and all of them gets stuck.
The problem doesn't happen if there is lot of activities, which makes
some process calls xfs_log_move_tail() or push the ail (thru
xlog_grant_push_ail()). But, with no activity, all these processes are
never woken up.
IMO, the right solution is to remove the item from the list when we wake
them up. I tried the change and it works as expected.
Will send the patch to the list shortly.