[PATCH 2/3] xfs: convert inode cache lookups to use RCU locking
Paul E. McKenney
paulmck at linux.vnet.ibm.com
Wed Dec 15 00:34:35 CST 2010
On Wed, Dec 15, 2010 at 01:50:49PM +1100, Dave Chinner wrote:
> On Tue, Dec 14, 2010 at 05:05:36PM -0800, Paul E. McKenney wrote:
> > On Wed, Dec 15, 2010 at 10:00:47AM +1100, Dave Chinner wrote:
> > > On Tue, Dec 14, 2010 at 01:18:01PM -0800, Paul E. McKenney wrote:
> > > > On Mon, Dec 13, 2010 at 12:32:36PM +1100, Dave Chinner wrote:
> > > > > + /*
> > > > > + * check for re-use of an inode within an RCU grace period due to the
> > > > > + * radix tree nodes not being updated yet. We monitor for this by
> > > > > + * setting the inode number to zero before freeing the inode structure.
> > > > > + * If the inode has been reallocated and set up, then the inode number
> > > > > + * will not match, so check for that, too.
> > > > > + */
> > > > > spin_lock(&ip->i_flags_lock);
> > > > > + if (ip->i_ino != ino) {
> > > > > + trace_xfs_iget_skip(ip);
> > > > > + XFS_STATS_INC(xs_ig_frecycle);
> > > > > + spin_unlock(&ip->i_flags_lock);
> > > > > + rcu_read_unlock();
> > > > > + /* Expire the grace period so we don't trip over it again. */
> > > > > + synchronize_rcu();
> > > >
> > > > Hmmm... Interesting. Wouldn't the fact that we acquired the same lock
> > > > that was held after removing the inode guarantee that an immediate retry
> > > > would manage not to find this same inode again?
> > >
> > > That is what I'm not sure of. I was more worried about resolving the
> > > contents of the radix tree nodes, not so much the inode itself. If a
> > > new traversal will resolve the tree correctly (which is what you are
> > > implying), then synchronize_rcu() is not needed....
> >
> > Here is the sequence of events that I believe must be in place:
>
> s/vfs_inode/xfs_inode/g
Good point!
> > 1. Some CPU removes the vfs_inode from the radix tree, presumably
> > while holding some lock whose identity does not matter here.
>
> Yes, the ->pag_ici_lock.
OK.
> > 2. Before invoking call_rcu() on a given vfs_inode, this same
> > CPU clears the inode number while holding ->i_flags_lock.
>
> Not necessarily the same CPU - there are points where we take
> sleeping locks for synchronisation with any remaining users, and
> we don't use preempt_disable() to prevent a change of CPU on a
> preemptible kernel.
>
> > If the CPU in #1 and #2 might be different, then either
> > CPU #1 must hold ->i_flags_lock while removing the vfs_inode
> > from the radix tree, or I don't understand how this can work
> > even with the synchronize_rcu().
>
> I'm probably missing something, but why does the CPU we run
> call_rcu() to free the inode on matter w.r.t. which CPU it was
> deleted from the radix tree on?
I was oversimplifying. What matters is that the item was deleted
from the radix tree unambiguously before it was passed to call_rcu().
There are a number of ways to make this happen:
1. Do the removal and the call_rcu() on the same CPU, in that
order.
2. Do the removal while holding a given lock, and do the call_rcu()
under a later critical section for that same lock.
3. Do the removal while holding lock A one CPU 1, then later
acquire lock B on CPU 1, and then do the call_rcu() after
a later acquisition of lock B on some other CPU.
There are a bunch of other variations on this theme, but the key
requirement is again that the call_rcu() happen unambiguously after
the removal. Otherwise, how is the grace period supposed to
guarantee that all RCU readers that might be accessing the removed
xfs_inode really have completed?
> There is this comment in include/linux/radix-tree.h:
>
> * It is still required that the caller manage the synchronization and lifetimes
> * of the items. So if RCU lock-free lookups are used, typically this would mean
> * that the items have their own locks, or are amenable to lock-free access; and
> * that the items are freed by RCU (or only freed after having been deleted from
> * the radix tree *and* a synchronize_rcu() grace period).
>
> There is nothing there that mentions the items need to be deleted on
> the same CPU as they were removed from the radix tree or that the
> item lock needs to be held when the item is removed from the tree.
> AFAICT, the XFS code is following these guidelines.
Well, the grace period (from either synchronize_rcu() or call_rcu())
does need to start unambiguously after the deletion from the radix tree.
Should we upgrade the comment?
> FWIW, this is where is got the idea of using synchronize_rcu() to
> ensure a llokup retry wouldn't see the same freed inode. I'm
> thinking that simply re-running the lookup will give the same
> guarantee because of the memory barriers in the radix tree lookup
> code...
Maybe... But we do need to be sure, right?
> > 3. Some CPU, possibly different than that in #1 and #2 above,
> > executes the above code. If locking works correctly, it
> > must see the earlier changes, so a subsequent access should
> > see the results of #1 above, so that it won't see the element
> > that was removed there.
>
> Isn't the item validity considered separately to the tree traversal
> consistency? i.e. the item lock (i.e. ->i_flags_lock) provides a
> safe item validity check via the unlock->lock memory barrier, whilst
> the radix tree uses rcu_dereference() to provide memory barriers
> against the modifications?
But the safe validity check assumes that the RCU grace period starts
unambiguously after the item has been removed. Violate that assumption,
and all bets are off.
> > That said, I don't claim to understand either vfs or xfs very well, so
> > I would be arbitrarily deeply confused.
>
> We might both be confused ;)
Sounds like the most likely possibility, now that you mention it. ;-)
Thanx, Paul
More information about the xfs
mailing list