xfs
[Top] [All Lists]

Re: On-stack work item completion race? (was Re: XFS crash?)

To: Tejun Heo <tj@xxxxxxxxxx>
Subject: Re: On-stack work item completion race? (was Re: XFS crash?)
From: Austin Schuh <austin@xxxxxxxxxxxxxxxx>
Date: Tue, 24 Jun 2014 20:05:07 -0700
Cc: Dave Chinner <david@xxxxxxxxxxxxx>, xfs <xfs@xxxxxxxxxxx>, Thomas Gleixner <tglx@xxxxxxxxxxxxx>, linux-kernel@xxxxxxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <20140624032521.GA12164@xxxxxxxxxxxxxx>
References: <CANGgnMb=2dYGQO4K36pQ9LEb8E4rT6S_VskLF+n=ndd0_kJr_g@xxxxxxxxxxxxxx> <CANGgnMa80WwQ8zSkL52yYegmQURVQeZiBFv41=FQXMZJ_NaEDw@xxxxxxxxxxxxxx> <20140513034647.GA5421@dastard> <CANGgnMZ0q9uE3NHj2i0SBK1d0vdKLx7QBJeFNb+YwP-5EAmejQ@xxxxxxxxxxxxxx> <20140513063943.GQ26353@dastard> <CANGgnMYn++1++UyX+D2d9GxPxtytpQJv0ThFwdxM-yX7xDWqiA@xxxxxxxxxxxxxx> <20140513090321.GR26353@dastard> <CANGgnMZqQc_NeaDpO_aX+bndmHrQ9VWo9mkfxhPBkRD-J=N6sQ@xxxxxxxxxxxxxx> <CANGgnMZ8OwzfBj5m9H7c6q2yahGhU7oFZLsJfVxnWoqZExkZmQ@xxxxxxxxxxxxxx> <20140624030240.GB9508@dastard> <20140624032521.GA12164@xxxxxxxxxxxxxx>
[Adding tglx to the cc]

On Mon, Jun 23, 2014 at 8:25 PM, Tejun Heo <tj@xxxxxxxxxx> wrote:
Hello,

On Tue, Jun 24, 2014 at 01:02:40PM +1000, Dave Chinner wrote:
> start_flush_work() is effectively a special queue_work()
> implementation, so if if it's not safe to call complete() from the
> workqueue as the above patch implies then this code has the same
> problem.
>
> Tejun - is this "do it yourself completion" a known issue w.r.t.
> workqueues? I can't find any documentation that says "don't do
> that" so...?

It's more complex than using flush_work() but there's nothing
fundamentally wrong with it. ÂA work item is completely unlinked
before its execution starts. ÂIt's safe to free the work item once its
work function started, whether through kfree() or returning.

One difference between flush_work() and manual completion would be
that if the work item gets requeued, flush_work() would wait for the
queued one to finish but given the work item is one-shot this doesn't
make any difference.

I can see no reason why manual completion would behave differently
from flush_work() in this case.

I went looking for a short trace in my original log to show the problem, and instead found evidence of the second problem. ÂI still like the shorter flush_work call, but that's not my call.

I did find this comment in theÂprocess_one_work function. ÂSounds like this could be better documented.

 /*
 Â* It is permissible to free the struct work_struct from
 Â* inside the function that is called from it, this we need to
 Â* take into account for lockdep too. ÂTo avoid bogus "held
 Â* lock freed" warnings as well as problems when looking into
 Â* work->lockdep_map, make a copy and use that here.
 Â*/

> As I understand it, what then happens is that the workqueue code
> grabs another kworker thread and runs the next work item in it's
> queue. IOWs, work items can block, but doing that does not prevent
> execution of other work items queued on other work queues or even on
> the same work queue. Tejun, did I get that correct?

Yes, as long as the workqueue is under its @max_active limit and has
access to an existing kworker or can create a new one, it'll start
executing the next work item immediately; however, the guaranteed
level of concurrency is 1 even for WQ_RECLAIM workqueues. ÂIOW, the
work items queued on a workqueue must be able to make forward progress
with single work item if the work items are being depended upon for
memory reclaim.

I mentioned this to Dave when I initially started this thread, but I'm running a RT patched kernel. ÂI don't see forwards progress. ÂThe workqueue is only used in 1 spot (xfs_alloc_wq), and has WQ_MEM_RECLAIM set. ÂI started with a 10,000,000 line trace and pulled out the entries which referenced the workqueue and pool leading up to the lockup.

      Âscp-4110 Â[001] ....1.3  101.184640: workqueue_queue_work: work struct=ffff8803c782d900 function=xfs_bmapi_allocate_worker workqueue=ffff88040c9f5a00 pool=ffff88042dae3fc0 req_cpu=512 cpu=1
      Âscp-4110 Â[001] ....1.3  101.184641: workqueue_activate_work: work struct ffff8803c782d900
  Âkworker/1:1-89  Â[001] ....1.1  101.184883: workqueue_nr_running: pool=ffff88042dae3fc0 nr_running=1
  Âkworker/1:1-89  Â[001] ....1..  101.184885: workqueue_execute_start: work struct ffff8803c782d900: function xfs_bmapi_allocate_worker
  Âirq/44-ahci-275  [001] ....1.5  101.185086: workqueue_queue_work: work struct=ffff8800ae3f01e0 function=xfs_end_io workqueue=ffff88040b459a00 pool=ffff88042dae3fc0 req_cpu=512 cpu=1
  Âirq/44-ahci-275  [001] ....1.5  101.185088: workqueue_activate_work: work struct ffff8800ae3f01e0
      Âscp-4110 Â[001] ....1..  101.187911: xfs_ilock: dev 8:5 ino 0xf9e flags ILOCK_EXCL caller xfs_iomap_write_allocate
      Âscp-4110 Â[001] ....1.3  101.187914: workqueue_queue_work: work struct=ffff8803c782d900 function=xfs_bmapi_allocate_worker workqueue=ffff88040c9f5a00 pool=ffff88042dae3fc0 req_cpu=512 cpu=1
      Âscp-4110 Â[001] ....1.3  101.187915: workqueue_activate_work: work struct ffff8803c782d900
      Âscp-4110 Â[001] ....1.4  101.187926: workqueue_queue_work: work struct=ffff88040a6a01c8 function=blk_delay_work workqueue=ffff88040c9f4a00 pool=ffff88042dae44c0 req_cpu=512 cpu=1
      Âscp-4110 Â[001] ....1.4  101.187926: workqueue_activate_work: work struct ffff88040a6a01c8
  Âkworker/1:1-89  Â[001] ....1..  101.187998: workqueue_execute_end: work struct ffff8803c782d900
  Âkworker/1:1-89  Â[001] ....1..  101.188000: workqueue_execute_start: work struct ffff8800ae3f01e0: function xfs_end_io
  Âkworker/1:1-89  Â[001] ....1..  101.188001: xfs_ilock: dev 8:5 ino 0xf9e flags ILOCK_EXCL caller xfs_setfilesize

The last work never runs. ÂHangcheck triggers shortly after.

I spent some more time debugging, and I am seeing thatÂtsk_is_pi_blocked is returning 1 inÂsched_submit_work (kernel/sched/core.c). ÂIt looks likeÂsched_submit_work is not detecting that the worker task is blocked on a mutex.

This looks very RT related right now. ÂI see 2 problems from my reading (and experimentation). ÂThe first is that the second worker isn't getting started because tsk_is_pi_blocked is reporting that the task isn't blocked on a mutex. ÂThe second is that even if another worker needs to be scheduled because the original worker is blocked on a mutex, we need the pool lock to schedule another worker. ÂThe pool lock can be acquired by any CPU, and is a spin_lock. ÂIf we end up on the slow path for the pool lock, we hitÂBUG_ON(rt_mutex_real_waiter(task->pi_blocked_on)) inÂtask_blocks_on_rt_mutex in rtmutex.c. ÂI'm not sure how to deal with either problem.

Hopefully I've got all my facts right... ÂDebugging kernel code is a whole new world from userspace code.

Thanks!
Austin
<Prev in Thread] Current Thread [Next in Thread>