xfs
[Top] [All Lists]

Re: BUG: workqueue leaked lock or atomic

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: BUG: workqueue leaked lock or atomic
From: Alex Elder <elder@xxxxxxxxxxx>
Date: Wed, 19 Dec 2012 08:15:09 -0600
Cc: xfs@xxxxxxxxxxx
In-reply-to: <20121218204330.GD15182@dastard>
References: <50D07CC2.3020508@xxxxxxxxxxx> <20121218204330.GD15182@dastard>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/17.0 Thunderbird/17.0
On 12/18/2012 02:43 PM, Dave Chinner wrote:
> On Tue, Dec 18, 2012 at 08:25:06AM -0600, Alex Elder wrote:
>> I was running xfstests on a 3.6-derived kernel and injecting
>> some errors.  At some point a few of these surfaced as I/O
>> errors, which the generic buffer code complained about.
>> That's all fine (well, I think).  An example:
>>
>>   Buffer I/O error on device rbd2, logical block 3072
>>   Buffer I/O error on device rbd2, logical block 3073
>>   ...
>>
>> However, after a string of these, I got this:
>>
>>   BUG: workqueue leaked lock or atomic: kworker/0:1/0x00000000/17554
>>       last function: xfs_end_io+0x0/0x110 [xfs]
> 
> What are the errors leading up to this, and the full stack of the
> oops?

Below is the first one.  Lots more followed, but I think it's
simply because the same workqueue task has the same lock or
preempt count imbalance left over from the original problem.

[ 6867.442761] XFS (rbd1): Mounting Filesystem
[ 6867.659093] XFS (rbd1): Ending clean mount
[ 6878.963950] XFS (rbd2): Mounting Filesystem
[ 6880.812367] XFS (rbd2): Ending clean mount
[ 6885.855893] libceph: osd0 10.214.133.37:6800 socket closed
[ 6893.441466] quiet_error: 48 callbacks suppressed
[ 6893.441471] Buffer I/O error on device rbd2, logical block 3072
[ 6893.456297] lost page write due to I/O error on rbd2
[ 6893.456410] Buffer I/O error on device rbd2, logical block 3073
[ 6893.471008] lost page write due to I/O error on rbd2
[ 6893.471013] Buffer I/O error on device rbd2, logical block 3074
[ 6893.485604] lost page write due to I/O error on rbd2
[ 6893.485608] Buffer I/O error on device rbd2, logical block 3075
[ 6893.500332] lost page write due to I/O error on rbd2
[ 6893.500337] Buffer I/O error on device rbd2, logical block 3076
[ 6893.514902] lost page write due to I/O error on rbd2
[ 6893.514906] Buffer I/O error on device rbd2, logical block 3077
[ 6893.529342] lost page write due to I/O error on rbd2
[ 6893.529346] Buffer I/O error on device rbd2, logical block 3078
[ 6893.543652] lost page write due to I/O error on rbd2
[ 6893.543656] Buffer I/O error on device rbd2, logical block 3079
[ 6893.557807] lost page write due to I/O error on rbd2
[ 6893.557812] Buffer I/O error on device rbd2, logical block 3080
[ 6893.571898] lost page write due to I/O error on rbd2
[ 6893.571902] Buffer I/O error on device rbd2, logical block 3081
[ 6893.585884] lost page write due to I/O error on rbd2
[ 6893.585942] BUG: workqueue leaked lock or atomic:
kworker/0:1/0x00000000/17554
[ 6893.611486]     last function: xfs_end_io+0x0/0x110 [xfs]
[ 6893.627968] 1 lock held by kworker/0:1/17554:
[ 6893.627973]  #0:  (sb_internal#2){.+.+.+}, at: [<ffffffffa03a232b>]
xfs_end_io+0x2b/0x110 [xfs]
[ 6893.628000] Pid: 17554, comm: kworker/0:1 Not tainted
3.6.0-ceph-00174-g2978257 #1
[ 6893.628005] Call Trace:
[ 6893.628016]  [<ffffffff810719d7>] process_one_work+0x397/0x5f0
[ 6893.628020]  [<ffffffff81071776>] ? process_one_work+0x136/0x5f0
[ 6893.628043]  [<ffffffffa03a2300>] ? xfs_destroy_ioend+0x90/0x90 [xfs]
[ 6893.628050]  [<ffffffff810735fd>] worker_thread+0x18d/0x4f0
[ 6893.628055]  [<ffffffff81073470>] ? manage_workers+0x320/0x320
[ 6893.628060]  [<ffffffff810791fe>] kthread+0xae/0xc0
[ 6893.628066]  [<ffffffff810b393d>] ? trace_hardirqs_on+0xd/0x10
[ 6893.628072]  [<ffffffff8163f3c4>] kernel_thread_helper+0x4/0x10
[ 6893.628077]  [<ffffffff816360b0>] ? retint_restore_args+0x13/0x13
[ 6893.628081]  [<ffffffff81079150>] ? flush_kthread_work+0x1a0/0x1a0
[ 6893.628085]  [<ffffffff8163f3c0>] ? gs_change+0x13/0x13


And this:
    1 lock held by kworker/0:1/17554:
     #0:  (sb_internal#2){.+.+.+}, at: [<ffffffffa03a232b>]
xfs_end_io+0x2b/0x110 [xfs]

corresponds to this call to rwsem_acquire_read():

        if (ioend->io_append_trans) {
                /*
                 * We've got freeze protection passed with the transaction.
                 * Tell lockdep about it.
                 */
                rwsem_acquire_read(       <--------

&ioend->io_inode->i_sb->s_writers.lock_map[SB_FREEZE_FS-1],
                        0, 1, _THIS_IP_);
        }

So maybe it's a false report?  I don't know.

>> I haven't looked very hard at this yet because I wanted to
>> see if anyone had some quick info that would avoid me going
>> off in the wrong direction.
>>
>> The I/O error messages are generated in two spots (sadly,
>> identical error messages):
>>
>>     end_buffer_write_sync()
>>     end_buffer_async_write()
>>
>> The workqueue leaked message comes from process_one_work(), so the
>> xfs_end_io() is being called by the ioend work queue (not from
>> xfs_finish_ioend_sync()).
>>
>> So...  I want to report this in case it's not been seen before.
> 
> No, I haven't seen it before. Do you know what test is triggering
> it? If it's direct IO, I'm wondering if it might be caused by the
> nested transaction problem I recently fixed leaving an elevated
> freeze count behind....

Based on some time stamps I see, it appears it might have been
running test 137, and it may have triggered right after godown
was called.  That test just creates 1000 64KB files with 64KB
buffers, sleeps 10 seconds, and does a forced file system
shutdown.  Direct I/O is not used when creating the files.

2012-12-14T00:57:37.694297-08:00 plana78 logger: run xfstest 135
2012-12-14T00:57:42.473536-08:00 plana78 logger: run xfstest 137
2012-12-14T00:58:08.099443-08:00 plana78 godown: xfstests-induced forced
shutdown of /tmp/cephtest/scratch_mnt.ii0ZXcUyaf:
   --- here ---
2012-12-14T00:58:21.308086-08:00 plana78 logger: run xfstest 138

Here's the line in the syslog, including the full time stamp:

2012-12-14T00:58:08.429389-08:00 plana78 kernel: [ 6893.585942] BUG:
workqueue leaked lock or atomic: kworker/0:1/0x00000000/17554

As indicated above, the workqueue function that had just returned
at the time the error detected was:
[ 6893.611486]     last function: xfs_end_io+0x0/0x110 [xfs]

>> But I'm also trying to figure out whether the problem is likely
>> to lie in XFS, the generic buffer, code, or in the underlying
>> block device code.  The latter is (of course) my assumption...
>> And any useful insights or suggestions how to proceed?
> 
> I'd start by finding out what workqueue and work was just finished
> processed when the error occurs e.g. is it unwritten conversion, a
> buffered IO append transaction or a direct IO size update.

I don't have the live machine any more, and so the only evidence
I have is in the logs, unfortunately.

I'm more convinced now that I've looked a little closer that
this is an xfs problem so I hope this information helps you
confirm that.  If there's more I should do or more info I
can provide let me know.

                                        -Alex

> Cheers,
> 
> Dave.
> 

<Prev in Thread] Current Thread [Next in Thread>