On Sat, Jan 10, 2015 at 10:28:15AM +1100, Dave Chinner wrote:
> process A kworker (1..N)
> queue work(allocwq)
> (work queued as no kworker
> threads available)
> execute work from xfsbuf-wq
> (blocks waiting on queued work)
> No new kworkers are started, so the queue never makes progress,
> we deadlock.
But allocwq is a separate workqueue from xfsbuf-wq and should have its
own rescuer. The work item queued by process on A is guaranteed to
make forward progress no matter what work items on xfsbuf-wq are
doing. The deadlock as depicted above cannot happen. A workqueue
with WQ_MEM_RECLAIM can deadlock iff an executing work item on the
> AFAICT, the only way we can get here is that we have N idle
> kworkers, and N+M works get queued where the allocwq work is at the
> tail end of the queue. This results in newly queued work is not
> kicking a new kworker threadi as there are idle threads, and as
> works are executed the are all for the xfsbuf-wq and blocking on the
The workqueue doesn't work that way. Forward progress guarantee is
separate for each workqueue with WQ_MEM_RECLAIM set. Each is
guaranteed to make forward progress indepdently of what others are
> We eventually get to the point where there are no more idle
> kworkers, but we still have works queued, and progress is still
> dependent the queued works completing....
> This is actually not an uncommon queuing occurrence, because we
> can get storms of end-io works queued from batched IO completion
And all these have nothing to with why the deadlock is happening.
> > Ummm, this feel pretty voodoo. In practice, it'd change the order of
> > things being executed and may make certain deadlocks unlikely enough,
> > but I don't think this can be a proper fix.
> Right, that's why Eric approached about this a few weeks ago asking
> whether it could be fixed in the workqueue code.
> As I've said before (in years gone by), we've got multiple levels of
> priority needed for executing XFS works because of lock ordering
> requirements. We *always* want the allocation workqueue work to run
There's no problem there - create a separate WQ_MEM_RECLAIM workqueue
at each priority level. Each is guaranteed to make forward progress
and as long as the dependency graph isn't looped, the whole thing is
guaranteed to make forward progress.
> before the end-io processing of the xfsbuf-wq and unwritten-wq
> because of this lock inversion, just like we we always want the
> xfsbufd to run before the unwritten-wq because unwritten extent
> conversion may block waiting for metadata buffer IO to complete, and
> we always want the the xfslog-wq works to run before all of them
> because metadata buffer IO may get blocked waiting for buffers
> pinned by the log to be unpinned for log Io completion...
I'm not really following your logic here. Are you saying that xfs is
trying to work around cyclic dependency by manipulating execution
order of specific work items?
> We solve these dependencies in a sane manner with a single high
> priority workqueue level, so we're stuck with hacking around the
> worst of the problems for the moment.
I'm afraid I'm lost. Sans bugs in the rescuer logic, each workqueue
with WQ_MEM_RECLAIM guarantees forward progress of a single work item
at any given time. If there are dependencies among the work items,
each node of the dependency graph should correspond to a separate
workqueue. As long as the dependency graph isn't cyclic, the whole is
guaranteed to make forward progress.
There no reason to play with priorities to avoid deadlock. That
doesn't make any sense to me. Priority or chained queueing, which is
a lot better way to do it, can be used to optimize queueing and
execution behavior if one set of work items are likely to wait on
another set, but that should have *NOTHING* to do with deadlock
If the dependency graph is cyclic, it will deadlock. If not, it
won't. It's as simple as that.