On Thu, Mar 05, 2015 at 03:47:49PM -0500, J. Bruce Fields wrote:
> On Thu, Mar 05, 2015 at 12:02:17PM -0500, J. Bruce Fields wrote:
> > On Thu, Mar 05, 2015 at 10:01:38AM -0500, J. Bruce Fields wrote:
> > > On Thu, Mar 05, 2015 at 02:17:31PM +0100, Christoph Hellwig wrote:
> > > > On Wed, Mar 04, 2015 at 11:08:49PM -0500, J. Bruce Fields wrote:
> > > > > Ah-hah:
> > > > >
> > > > > static void
> > > > > nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls)
> > > > > {
> > > > > ...
> > > > > nfsd4_cb_layout_fail(ls);
> > > > >
> > > > > That'd do it!
> > > > >
> > > > > Haven't tried to figure out why exactly that's getting called, and why
> > > > > only rarely. Some intermittent problem with the callback path, I
> > > > > guess.
> > > > >
> > > > > Anyway, I think that solves most of the mystery....
> > > >
> > > > Ooops, that was a nasty git merge error in the last rebase, see the fix
> > > > below.
> > >
> > > Thanks!
> >
> > And with that fix things look good.
> >
> > I'm still curious why the callbacks are failling. It's also logging
> > "nfsd: client 192.168.122.32 failed to respond to layout recall".
>
> I spoke too soon, I'm still not getting through my usual test run--the most
> recent run is hanging in generic/247 with the following in the server logs.
>
> But I probably still won't get a chance to look at this any closer till after
> vault.
>
> --b.
>
> nfsd: client 192.168.122.32 failed to respond to layout recall. Fencing..
> nfsd: fence failed for client 192.168.122.32: -2!
> nfsd: client 192.168.122.32 failed to respond to layout recall. Fencing..
> nfsd: fence failed for client 192.168.122.32: -2!
> receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt
> ffff88005639a000 xid c21abd62
> kswapd0: page allocation failure: order:0, mode:0x120
[snip network driver memory allocation failure]
> active_anon:7053 inactive_anon:2435 isolated_anon:0
> active_file:88743 inactive_file:89505 isolated_file:32
> unevictable:0 dirty:9786 writeback:0 unstable:0
> free:3571 slab_reclaimable:227807 slab_unreclaimable:75772
> mapped:21010 shmem:380 pagetables:1567 bounce:0
> free_cma:0
Looks like there should be heaps of reclaimable memory...
> nfsd: client 192.168.122.32 failed to respond to layout recall. Fencing..
So there's a layout recall pending...
> nfsd: fence failed for client 192.168.122.32: -2!
> receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt
> ffff880051dfc000 xid 8ff02aaf
> INFO: task nfsd:17653 blocked for more than 120 seconds.
> Not tainted 4.0.0-rc2-09922-g26cbcc7 #89
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> nfsd D ffff8800753a7848 11720 17653 2 0x00000000
> ffff8800753a7848 0000000000000001 0000000000000001 ffffffff82210580
> ffff88004e9bcb50 0000000000000006 ffffffff8119d9f0 ffff8800753a7848
> ffff8800753a7fd8 ffff88002e5e3d70 0000000000000246 ffff88004e9bcb50
> Call Trace:
> [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0
> [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0
> [<ffffffff81a95737>] schedule+0x37/0x90
> [<ffffffff81a95ac8>] schedule_preempt_disabled+0x18/0x30
> [<ffffffff81a97756>] mutex_lock_nested+0x156/0x400
> [<ffffffff813a0d5a>] ? xfs_file_buffered_aio_write.isra.9+0x6a/0x2a0
> [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0
> [<ffffffff813a0d5a>] xfs_file_buffered_aio_write.isra.9+0x6a/0x2a0
> [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0
> [<ffffffff813a1016>] xfs_file_write_iter+0x86/0x130
> [<ffffffff8119db05>] do_iter_readv_writev+0x65/0xa0
and the nfsd got hung up on the inode mutex during a write.
Which means there's some other process blocked holding the i_mutex.
sysrq-w and sysrq-t is probably going to tell us more here.
I suspect we'll have another write stuck in break_layout().....
Cheers,
Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx
|