The real problem here is that two threads can end up attempting to manipulate
inode state based on the reference count crossing through 1. What we actually
need is synchronization between these threads.
Based on the code after your patch:
Consider vn_hold - which wraps around igrab and vn_rele which calls iput
and ends up in vn_put_inode. These both manipulate the inode count.
A thread doing a vn_hold racing with a thread in vn_put_inode will block
vn_put_inode is checking the hold count. But once we release the lock in the
vnode and call xfs_inactive the vn_hold will continue and bump the count.
So we now have one thread doing the inactive processing and another thread
bumping the hold count. If this was an unlinked inode then inactive will start
the work of removing it from the disk. At the same time the thread doing the
vn_hold will carry on using the inode, this is bad.
The code currently in the tree does have a similar race in it - I am not sure
yet if it is a narrower window than with your patch, it survives a dbench load
of 100, I will test the patch which will help the NFS server end of things.
There is still work to do on this either way, and trying to do this without
modifying iput is non-trivial to say the least.