xfs
[Top] [All Lists]

Re: Filesystem lockup with CONFIG_PREEMPT_RT

To: Richard Weinberger <richard.weinberger@xxxxxxxxx>, Austin Schuh <austin@xxxxxxxxxxxxxxxx>
Subject: Re: Filesystem lockup with CONFIG_PREEMPT_RT
From: John Blackwood <john.blackwood@xxxxxxxx>
Date: Wed, 21 May 2014 14:30:23 -0500
Cc: <linux-kernel@xxxxxxxxxxxxxxx>, <xfs@xxxxxxxxxxx>, <linux-rt-users@xxxxxxxxxxxxxxx>
Delivered-to: xfs@xxxxxxxxxxx
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.5.0
> Date: Wed, 21 May 2014 03:33:49 -0400
> From: Richard Weinberger <richard.weinberger@xxxxxxxxx>
> To: Austin Schuh <austin@xxxxxxxxxxxxxxxx>
> CC: LKML <linux-kernel@xxxxxxxxxxxxxxx>, xfs <xfs@xxxxxxxxxxx>, rt-users
>    <linux-rt-users@xxxxxxxxxxxxxxx>
> Subject: Re: Filesystem lockup with CONFIG_PREEMPT_RT
>
> CC'ing RT folks
>
> On Wed, May 21, 2014 at 8:23 AM, Austin Schuh <austin@xxxxxxxxxxxxxxxx> wrote:
> > > On Tue, May 13, 2014 at 7:29 PM, Austin Schuh <austin@xxxxxxxxxxxxxxxx> 
wrote:
> >> >> Hi,
> >> >>
> >> >> I am observing a filesystem lockup with XFS on a CONFIG_PREEMPT_RT
> >> >> patched kernel.  I have currently only triggered it using dpkg.  Dave
> >> >> Chinner on the XFS mailing list suggested that it was a rt-kernel
> >> >> workqueue issue as opposed to a XFS problem after looking at the
> >> >> kernel messages.
> >> >>
> >> >> The only modification to the kernel besides the RT patch is that I
> >> >> have applied tglx's "genirq: Sanitize spurious interrupt detection of
> >> >> threaded irqs" patch.
> > >
> > > I upgraded to 3.14.3-rt4, and the problem still persists.
> > >
> > > I turned on event tracing and tracked it down further.  I'm able to
> > > lock it up by scping a new kernel debian package to /tmp/ on the
> > > machine.  scp is locking the inode, and then scheduling
> > > xfs_bmapi_allocate_worker in the work queue.  The work then never gets
> > > run.  The kworkers then lock up waiting for the inode lock.
> > >
> > > Here are the relevant events from the trace.  ffff8803e9f10288
> > > (blk_delay_work) gets run later on in the trace, but ffff8803b4c158d0
> > > (xfs_bmapi_allocate_worker) never does.  The kernel then warns about
> > > blocked tasks 120 seconds later.

Austin and Richard,

I'm not 100% sure that the patch below will fix your problem, but we
saw something that sounds pretty familiar to your issue involving the
nvidia driver and the preempt-rt patch.  The nvidia driver uses the
completion support to create their own driver's notion of an internally
used semaphore.

Some tasks were failing to ever wakeup from wait_for_completion() calls
due to a race in the underlying do_wait_for_common() routine.

This is the patch that we used to fix this issue:

------------------- -------------------

Fix a race in the PRT wait for completion simple wait code.

A wait_for_completion() waiter task can be awoken by a task calling
complete(), but fail to consume the 'done' completion resource if it
looses a race with another task calling wait_for_completion() just as
it is waking up.

In this case, the awoken task will call schedule_timeout() again
without being in the simple wait queue.

So if the awoken task is unable to claim the 'done' completion resource,
check to see if it needs to be re-inserted into the wait list before
waiting again in schedule_timeout().

Fix-by: John Blackwood <john.blackwood@xxxxxxxx>
Index: b/kernel/sched/core.c
===================================================================
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3529,11 +3529,19 @@ static inline long __sched
 do_wait_for_common(struct completion *x,
                   long (*action)(long), long timeout, int state)
 {
+       int again = 0;
+
        if (!x->done) {
                DEFINE_SWAITER(wait);

                swait_prepare_locked(&x->wait, &wait);
                do {
+                       /* Check to see if we lost race for 'done' and are
+                        * no longer in the wait list.
+                        */
+                       if (unlikely(again) && list_empty(&wait.node))
+                               swait_prepare_locked(&x->wait, &wait);
+
                        if (signal_pending_state(state, current)) {
                                timeout = -ERESTARTSYS;
                                break;
@@ -3542,6 +3550,7 @@ do_wait_for_common(struct completion *x,
                        raw_spin_unlock_irq(&x->wait.lock);
                        timeout = action(timeout);
                        raw_spin_lock_irq(&x->wait.lock);
+                       again = 1;
                } while (!x->done && timeout);
                swait_finish_locked(&x->wait, &wait);
                if (!x->done)

<Prev in Thread] Current Thread [Next in Thread>