xfs
[Top] [All Lists]

Re: BUG: workqueue leaked lock or atomic

To: Alex Elder <elder@xxxxxxxxxxx>
Subject: Re: BUG: workqueue leaked lock or atomic
From: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Wed, 19 Dec 2012 07:43:30 +1100
Cc: xfs@xxxxxxxxxxx
In-reply-to: <50D07CC2.3020508@xxxxxxxxxxx>
References: <50D07CC2.3020508@xxxxxxxxxxx>
User-agent: Mutt/1.5.21 (2010-09-15)
On Tue, Dec 18, 2012 at 08:25:06AM -0600, Alex Elder wrote:
> I was running xfstests on a 3.6-derived kernel and injecting
> some errors.  At some point a few of these surfaced as I/O
> errors, which the generic buffer code complained about.
> That's all fine (well, I think).  An example:
> 
>   Buffer I/O error on device rbd2, logical block 3072
>   Buffer I/O error on device rbd2, logical block 3073
>   ...
> 
> However, after a string of these, I got this:
> 
>   BUG: workqueue leaked lock or atomic: kworker/0:1/0x00000000/17554
>       last function: xfs_end_io+0x0/0x110 [xfs]

What are the errors leading up to this, and the full stack of the
oops?

> I haven't looked very hard at this yet because I wanted to
> see if anyone had some quick info that would avoid me going
> off in the wrong direction.
> 
> The I/O error messages are generated in two spots (sadly,
> identical error messages):
> 
>     end_buffer_write_sync()
>     end_buffer_async_write()
> 
> The workqueue leaked message comes from process_one_work(), so the
> xfs_end_io() is being called by the ioend work queue (not from
> xfs_finish_ioend_sync()).
> 
> So...  I want to report this in case it's not been seen before.

No, I haven't seen it before. Do you know what test is triggering
it? If it's direct IO, I'm wondering if it might be caused by the
nested transaction problem I recently fixed leaving an elevated
freeze count behind....

> But I'm also trying to figure out whether the problem is likely
> to lie in XFS, the generic buffer, code, or in the underlying
> block device code.  The latter is (of course) my assumption...
> And any useful insights or suggestions how to proceed?

I'd start by finding out what workqueue and work was just finished
processed when the error occurs e.g. is it unwritten conversion, a
buffered IO append transaction or a direct IO size update.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>