On Mon, Oct 10, 2011 at 09:26:11AM -0400, Christoph Hellwig wrote:
> On Mon, Oct 10, 2011 at 07:55:46AM +0200, Markus Trippelsdorf wrote:
> > Wouldn't it be possible to verify that the problem also goes away with
> > this simple one liner?
> We've been through a few variants, and none fixed it while Stefan had
> to try them on production machines.
> To be honest I'm not convinced at all that a workqueue was such a good
> idea for the ail in particular. It works extremly well for things were
> we can easily define a work item, e.g. an object that gets queued up
> and a method on it gets exectured. But for the AIL we really have
> a changing target that needs more or less constant pushing, and the
> target keeps changing while executing our work. Conceptually it fits
> the idea of an thread much better, with the added benefit of not relying
> on finding a combination of workqueue flags that gets the exact
> behaviour (exectuion ASAP without any limits because of other items
> or required memory allocation).
> And unlike the various per-cpu threads we used to have it is only one
> thread per filesystem anyway.
I don't know xfs internals at all so I don't have too strong an
opinion at this point but don't we at least need to understand what's
going on? CPU_INTENSIVE / HIGHPRI flags shouldn't cause deadlock
unless some work items are doing busy looping waiting for another work
item to do something (busy yielding might achieve similar effect tho).
They don't change forward progress guarantee.
The only thing which can cause stall is lack of MEM_RECLAIM. One
thing to be careful about is that each wq has only one rescuer, so if
more than one work items have inter-dependency, it might still lead to
deadlock and they need to be served by different workqueues.
The reasons for moving away from using kthread directly are two folded
- resources and correctness. I've gone through a number of kthread
users during auditing freezer usage recently and more than half of
them get the synchronization against kthread_stop() or freezer wrong
(to be fair, the rules are quite tricky). The problem with those bugs
is that they are really obscure race conditions and won't trigger