On Fri, Jan 09, 2015 at 01:23:10PM -0500, Tejun Heo wrote:
> Hello, Eric.
> On Fri, Jan 09, 2015 at 12:12:04PM -0600, Eric Sandeen wrote:
> > I had a case reported where a system under high stress
> > got deadlocked. A btree split was handed off to the xfs
> > allocation workqueue, and it is holding the xfs_ilock
> > exclusively. However, other xfs_end_io workers are
> > not running, because they are waiting for that lock.
> > As a result, the xfs allocation workqueue never gets
> > run, and everything grinds to a halt.
> I'm having a difficult time following the exact deadlock. Can you
> please elaborate in more detail?
process A kworker (1..N)
(work queued as no kworker
execute work from xfsbuf-wq
(blocks waiting on queued work)
No new kworkers are started, so the queue never makes progress,
AFAICT, the only way we can get here is that we have N idle
kworkers, and N+M works get queued where the allocwq work is at the
tail end of the queue. This results in newly queued work is not
kicking a new kworker threadi as there are idle threads, and as
works are executed the are all for the xfsbuf-wq and blocking on the
We eventually get to the point where there are no more idle
kworkers, but we still have works queued, and progress is still
dependent the queued works completing....
This is actually not an uncommon queuing occurrence, because we
can get storms of end-io works queued from batched IO completion
> > To be honest, it's not clear to me how the workqueue
> > subsystem manages this sort of thing. But in testing,
> > making the allocation workqueue high priority so that
> > it gets added to the front of the pending work list,
> > resolves the problem. We did similar things for
> > the xfs-log workqueues, for similar reasons.
> Ummm, this feel pretty voodoo. In practice, it'd change the order of
> things being executed and may make certain deadlocks unlikely enough,
> but I don't think this can be a proper fix.
Right, that's why Eric approached about this a few weeks ago asking
whether it could be fixed in the workqueue code.
As I've said before (in years gone by), we've got multiple levels of
priority needed for executing XFS works because of lock ordering
requirements. We *always* want the allocation workqueue work to run
before the end-io processing of the xfsbuf-wq and unwritten-wq
because of this lock inversion, just like we we always want the
xfsbufd to run before the unwritten-wq because unwritten extent
conversion may block waiting for metadata buffer IO to complete, and
we always want the the xfslog-wq works to run before all of them
because metadata buffer IO may get blocked waiting for buffers
pinned by the log to be unpinned for log Io completion...
We solve these dependencies in a sane manner with a single high
priority workqueue level, so we're stuck with hacking around the
worst of the problems for the moment.