xfs
[Top] [All Lists]

Re: [PATCH 2/3] xfs: convert inode cache lookups to use RCU locking

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: [PATCH 2/3] xfs: convert inode cache lookups to use RCU locking
From: "Paul E. McKenney" <paulmck@xxxxxxxxxxxxxxxxxx>
Date: Tue, 14 Dec 2010 17:05:36 -0800
Cc: xfs@xxxxxxxxxxx, eric.dumazet@xxxxxxxxx
In-reply-to: <20101214230047.GC16267@dastard>
References: <1292203957-15819-1-git-send-email-david@xxxxxxxxxxxxx> <1292203957-15819-3-git-send-email-david@xxxxxxxxxxxxx> <20101214211801.GH2161@xxxxxxxxxxxxxxxxxx> <20101214230047.GC16267@dastard>
Reply-to: paulmck@xxxxxxxxxxxxxxxxxx
User-agent: Mutt/1.5.20 (2009-06-14)
On Wed, Dec 15, 2010 at 10:00:47AM +1100, Dave Chinner wrote:
> On Tue, Dec 14, 2010 at 01:18:01PM -0800, Paul E. McKenney wrote:
> > On Mon, Dec 13, 2010 at 12:32:36PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@xxxxxxxxxx>
> > > 
> > > With delayed logging greatly increasing the sustained parallelism of inode
> > > operations, the inode cache locking is showing significant read vs write
> > > contention when inode reclaim runs at the same time as lookups. There is
> > > also a lot more write lock acquistions than there are read locks (4:1 
> > > ratio)
> > > so the read locking is not really buying us much in the way of 
> > > parallelism.
> > > 
> > > To avoid the read vs write contention, change the cache to use RCU 
> > > locking on
> > > the read side. To avoid needing to RCU free every single inode, use the 
> > > built
> > > in slab RCU freeing mechanism. This requires us to be able to detect 
> > > lookups of
> > > freed inodes, so enѕure that ever freed inode has an inode number of zero 
> > > and
> > > the XFS_IRECLAIM flag set. We already check the XFS_IRECLAIM flag in 
> > > cache hit
> > > lookup path, but also add a check for a zero inode number as well.
> > > 
> > > We canthen convert all the read locking lockups to use RCU read side 
> > > locking
> > > and hence remove all read side locking.
> > 
> > OK, so the xfs_inode uses straight RCU, and there fore cannot be freed and
> > immediately reallocated, while the inode itself uses SLAB_DESTROY_BY_RCU,
> > which does allow the inode to be freed and immediately reallocated,
> > correct?
> 
> No, the struct inode is embedded in the struct xfs_inode, so they
> have the same lifecycle. i.e. we don't separately allocate and free
> the struct inode. So it is all using straight RCU.
> 
> > Some questions and comments below.
> > 
> >                                                     Thanx, Paul
> > 
> > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> > > Reviewed-by: Alex Elder <aelder@xxxxxxx>
> > > ---
> > >  fs/xfs/linux-2.6/xfs_sync.c |   27 ++++++++++++++++-----
> > >  fs/xfs/xfs_iget.c           |   50 
> > > +++++++++++++++++++++++++++++++----------
> > >  fs/xfs/xfs_inode.c          |   52 
> > > +++++++++++++++++++++++++++++++++----------
> > >  3 files changed, 98 insertions(+), 31 deletions(-)
> > > 
> > > diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
> > > index afb0d7c..5ee02d7 100644
> > > --- a/fs/xfs/linux-2.6/xfs_sync.c
> > > +++ b/fs/xfs/linux-2.6/xfs_sync.c
> > > @@ -53,14 +53,20 @@ xfs_inode_ag_walk_grab(
> > >  {
> > >   struct inode            *inode = VFS_I(ip);
> > > 
> > > + /* check for stale RCU freed inode */
> > > + spin_lock(&ip->i_flags_lock);
> > > + if (!ip->i_ino)
> > > +         goto out_unlock_noent;
> > > +
> > > + /* avoid new or reclaimable inodes. Leave for reclaim code to flush */
> > > + if (__xfs_iflags_test(ip, XFS_INEW | XFS_IRECLAIMABLE | XFS_IRECLAIM))
> > > +         goto out_unlock_noent;
> > > + spin_unlock(&ip->i_flags_lock);
> > > +
> > 
> > OK, this works because the xfs_inode cannot be freed and reallocated (which
> > the RCU change should enforce).  It is not clear to me that the above
> > would work if using the SLAB_DESTROY_BY_RCU approach, at least not unless
> > some higher-level checks can reliably catch an inode changing identity
> > due to quick free and reallocation.
> 
> In this case, I don't believe it matters if the inode has changed
> identity - we are walking for writeback, so if we get a reallocated
> inode we'll write it back if it is dirty. If it has not been
> reallocated or still being initialised, the !ip->i_ino and
> XFS_INEW|XFS_IRECLAIM checks are sufficient to avoid using the inode.
> 
> I probably should add a comment to this effect, yes?

One vote in favor from me.  ;-)

> > Also, all calls to xfs_inode_ag_walk_grab() must be in RCU read-side
> > critical sections...  I suggest a debug check for rcu_read_lock_held() to
> > catch any call paths that might have slipped through. 
> 
> Yes, good idea.
> 
> > At first glance,
> > it appears that RCU is replacing some of ->pag_ici_lock, but I could
> > easily be mistaken.
> 
> Correct, it is replacing the read side of the ->pag_ici_lock.

OK, good!

> > > @@ -639,6 +649,9 @@ xfs_reclaim_inode_grab(
> > >   struct xfs_inode        *ip,
> > >   int                     flags)
> > >  {
> > > + /* check for stale RCU freed inode */
> > > + if (!ip->i_ino)
> > > +         return 1;
> > 
> > Does this mean that we need to be under rcu_read_lock() here?  If not,
> > how do we keep the inode from really being freed out from under us?
> 
> Hmmm, I think I've mismerged a patch somewhere along the line. In
> this patch the reclaim tree walk is under the ->pag_ici_lock(), when
> in fact it should be under the rcu_read_lock(). Good catch, Paul.
> 
> As it is, being under the ->pag_ici_lock means that the tree is
> consistent and we won't be seeing RCU freed inodes in the walk.
> Hence the code is functioning correctly, just not as wasss intended.
> 
> > (Again, if we do need to be under rcu_read_lock(), I highly recommend
> > a debug check for rcu_read_lock_held().)
> 
> Yup, which would have caught the merge screwup...

;-)

> ....
> > > 
> > > + /*
> > > +  * check for re-use of an inode within an RCU grace period due to the
> > > +  * radix tree nodes not being updated yet. We monitor for this by
> > > +  * setting the inode number to zero before freeing the inode structure.
> > > +  * If the inode has been reallocated and set up, then the inode number
> > > +  * will not match, so check for that, too.
> > > +  */
> > >   spin_lock(&ip->i_flags_lock);
> > > + if (ip->i_ino != ino) {
> > > +         trace_xfs_iget_skip(ip);
> > > +         XFS_STATS_INC(xs_ig_frecycle);
> > > +         spin_unlock(&ip->i_flags_lock);
> > > +         rcu_read_unlock();
> > > +         /* Expire the grace period so we don't trip over it again. */
> > > +         synchronize_rcu();
> > 
> > Hmmm...  Interesting.  Wouldn't the fact that we acquired the same lock
> > that was held after removing the inode guarantee that an immediate retry
> > would manage not to find this same inode again?
> 
> That is what I'm not sure of. I was more worried about resolving the
> contents of the radix tree nodes, not so much the inode itself. If a
> new traversal will resolve the tree correctly (which is what you are
> implying), then synchronize_rcu() is not needed....

Here is the sequence of events that I believe must be in place:

1.      Some CPU removes the vfs_inode from the radix tree, presumably
        while holding some lock whose identity does not matter here.

2.      Before invoking call_rcu() on a given vfs_inode, this same
        CPU clears the inode number while holding ->i_flags_lock.

        If the CPU in #1 and #2 might be different, then either
        CPU #1 must hold ->i_flags_lock while removing the vfs_inode
        from the radix tree, or I don't understand how this can work
        even with the synchronize_rcu().

3.      Some CPU, possibly different than that in #1 and #2 above,
        executes the above code.  If locking works correctly, it
        must see the earlier changes, so a subsequent access should
        see the results of #1 above, so that it won't see the element
        that was removed there.

That said, I don't claim to understand either vfs or xfs very well, so
I would be arbitrarily deeply confused.

> > If this is not the case, then readers finding it again will not be
> > protected by the RCU grace period, right?
> > 
> > In short, I don't understand why the synchronize_rcu() is needed.
> > If it is somehow helping, that sounds to me like it is covering up
> > a real bug that should be fixed separately.
> 
> It isn't covering up a bug, it was more tryingt o be consistent with
> the rest of the xfs_inode lookup failures - we back off and try
> again later. If that is unnecessary resolve the RCU lookup race,
> then it can be dropped.

OK, please let me know whether my sequence of steps above makes sense.

> > > @@ -397,7 +423,7 @@ xfs_iget(
> > >   xfs_agino_t     agino;
> > > 
> > >   /* reject inode numbers outside existing AGs */
> > > - if (XFS_INO_TO_AGNO(mp, ino) >= mp->m_sb.sb_agcount)
> > > + if (!ino || XFS_INO_TO_AGNO(mp, ino) >= mp->m_sb.sb_agcount)
> > 
> > For the above check to be safe, don't we need to already be in an
> > RCU read-side critical section?  Or is something else protecting us?
> 
> "ino" is the inode number used as the lookup key to find the struct
> xfs_inode. This is ensuring we don't try to look up an inode number
> of zero given it's new special meaning as a freed inode. Hence it
> can be safely validated outside the RCU read-side critical sectioni
> as it is a constant.

Ah, OK, got it!

                                                        Thanx, Paul

> Thanks for the review, Paul. I'll fix up the issues you've pointed
> out and retest.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>