xfs
[Top] [All Lists]

Re: [PATCH 05/18] xfs: convert inode cache lookups to use RCU locking

To: Alex Elder <aelder@xxxxxxx>
Subject: Re: [PATCH 05/18] xfs: convert inode cache lookups to use RCU locking
From: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Wed, 15 Sep 2010 09:42:24 +1000
Cc: xfs@xxxxxxxxxxx
In-reply-to: <1284499421.9701.69.camel@doink>
References: <1284461777-1496-1-git-send-email-david@xxxxxxxxxxxxx> <1284461777-1496-6-git-send-email-david@xxxxxxxxxxxxx> <1284499421.9701.69.camel@doink>
User-agent: Mutt/1.5.20 (2009-06-14)
On Tue, Sep 14, 2010 at 04:23:41PM -0500, Alex Elder wrote:
> On Tue, 2010-09-14 at 20:56 +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@xxxxxxxxxx>
> > 
> > With delayed logging greatly increasing the sustained parallelism of inode
> > operations, the inode cache locking is showing significant read vs write
> > contention when inode reclaim runs at the same time as lookups. There is
> > also a lot more write lock acquistions than there are read locks (4:1 ratio)
> > so the read locking is not really buying us much in the way of parallelism.
> > 
> > To avoid the read vs write contention, change the cache to use RCU locking 
> > on
> > the read side. To avoid needing to RCU free every single inode, use the 
> > built
> > in slab RCU freeing mechanism. This requires us to be able to detect 
> > lookups of
> > freed inodes, so enѕure that ever freed inode has an inode number of zero 
> > and
> > the XFS_IRECLAIM flag set. We already check the XFS_IRECLAIM flag in cache 
> > hit
> > lookup path, but also add a check for a zero inode number as well.
> > 
> > We canthen convert all the read locking lockups to use RCU read side locking
> > and hence remove all read side locking.
> > 
> > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> 
> I confess that I'm a little less than solid on this, but
> that's a comment on me, not your code.  (After writing
> all this I feel a bit better.)
> 
> I'll try to describe my understanding and you can reassure
> me all is well...  It's quite a lot, but I'll call attention
> to two things to look for:  a question about something in
> xfs_reclaim_inode(); and a comment related to
> xfs_iget_cache_hit().
> 
> 
> First, you are replacing the use of a single rwlock for
> protecting access to the per-AG in-core inode radix tree
> with RCU for readers and a spinlock for writers.

Correct.

> This initially seemed strange to me, and unsafe, but I
> now think it's OK because:
> - the spinlock protects against concurrent writers
>   interfering with each other
> - the rcu_read_lock() is sufficient for ensuring readers
>   have valid pointers, because the underlying structure
>   is a radix tree, which uses rcu_update_pointer() in
>   order to change anything in the tree.

Correct.

> I'm still unsettled about the protection readers have
> against a concurrent writer, but it's probably just
> because this particular usage is new to me.

The protection is provided by the fact that the radix tree
node connectivity is protected by RCU read locking, so the only
thing we have to worry about on lookup is whether we have a valid
inode or not.

> Second, you are exploiting the SLAB_DESTROY_BY_RCU
> feature in order to avoid having to have each inode
> wait an RCU grace period when it's freed.  To use
> that we need to check for and recognize a freed
> inode after looking it up, since we have no guarantee
> it's updated in the radix tree after it's freed until
> after an RCU grace period has passed.  So zeroing the
> i_ino field and setting XFS_RECLAIM handles that.

Yes. However, that is not specific to the use of
SLAB_DESTROY_BY_RCU. Even just using call_rcu() to free the inodes,
we'd still need to detect freed inodes on lookup in some way because
the lookup can return those inodes due to the radix tree nodes being
RCU freed. That is, a lockless RCU cache lookup of any kind needs to
be able to safely detect a freed structure and avoid re-using it.

> So I see these lookups:
> - Two gang lookups in xfs_inode_ag_lookup(), which
>   is called only by xfs_inode_ag_walk(), in turn
>   called only by xfs_inode_ag_iterator().  The
>   check in this case has to happen in the "execute"
>   function passed in to xfs_inode_ag_walk() via
>   xfs_inode_ag_iterator().  The affected functions
>   are:
>     - xfs_sync_inode_data().  This one calls
>       xfs_sync_inode_valid() right away, which in
>       your change now checks for a zero i_ino.
>     - xfs_sync_inode_attr().  Same as above,
>       handled by xfs_sync_inode_valid().
>     - xfs_reclaim_inode().  This one should
>       be fine, because it already has a test
>       for the XFS_IRECLAIM flag being set, and
>       ignores the inode if it is.  However, it
>       has this line also:
>         ASSERT_ALWAYS(__xfs_iflags_test(ip, XFS_IRECLAIMABLE));
>       Your change doesn't set XFS_IRECLAIMABLE, so
> *     I imagine if we get here inside that RCU window
> *     we'd have a problem.  Am I wrong about this?

Good question. I think you are right - we didn't actually clear that
flag anywhere until this patch, so yes, it could trigger. I think
I'll add a check for ip->i_ino == 0 before we take the lock and
in that case I can leave the assert there.

>     - xfs_dqrele_inode().  This one again calls
>       xfs_sync_inode_valid(), so should be covered.
> - A lookup in xfs_iget().  This is handled by
>   your change, by looking for a zero i_ino in
> * xfs_iget_cache_hit().  (Please see the comment
>   on this function in-line, below.)
> - A lookup in xfs_ifree_cluster().  Handled by
>   your change (now checks for zero i_ino).
> - And a gang lookup in xfs_iflush_cluster().  This
>   one is handled by your change (now checks each
>   inode for a zero i_ino field).
> 
> OK, so I think that covers everything, but I have
> that one question about xfs_reclaim_inode(), and
> then I have one more comment below.
> 
> 
> 
> Despite all my commentary above...  The patch looks
> good (consistent) to me.  I'm interested to hear
> your feedback though.  And unless there is something
> major changed, or I'm fundamentally misguided about
> this stuff, you can consider it:
> 
> Reviewed-by: Alex Elder <aelder@xxxxxxx>
> 
> 
> > ---
> >  fs/xfs/linux-2.6/kmem.h        |    1 +
> >  fs/xfs/linux-2.6/xfs_super.c   |    3 ++-
> >  fs/xfs/linux-2.6/xfs_sync.c    |   12 ++++++------
> >  fs/xfs/quota/xfs_qm_syscalls.c |    4 ++--
> 
> . . .
> 
> > diff --git a/fs/xfs/xfs_iget.c b/fs/xfs/xfs_iget.c
> > index b1ecc6f..f3a46b6 100644
> > --- a/fs/xfs/xfs_iget.c
> > +++ b/fs/xfs/xfs_iget.c
> 
> . . .
> 
> > @@ -145,12 +153,26 @@ xfs_iget_cache_hit(
> >     struct xfs_perag        *pag,
> >     struct xfs_inode        *ip,
> >     int                     flags,
> > -   int                     lock_flags) __releases(pag->pag_ici_lock)
> > +   int                     lock_flags) __releases(RCU)
> >  {
> >     struct inode            *inode = VFS_I(ip);
> >     struct xfs_mount        *mp = ip->i_mount;
> >     int                     error;
> >  
> > +   /*
> > +    * check for re-use of an inode within an RCU grace period due to the
> > +    * radix tree nodes not being updated yet. We monitor for this by
> > +    * setting the inode number to zero before freeing the inode structure.
> > +    */
> > +   if (ip->i_ino == 0) {
> > +           trace_xfs_iget_skip(ip);
> > +           XFS_STATS_INC(xs_ig_frecycle);
> > +           rcu_read_unlock();
> > +           /* Expire the grace period so we don't trip over it again. */
> > +           synchronize_rcu();
> 
> Since you're waiting for the end of the grace period here,
> it seems a shame that the caller (xfs_iget()) will still
> end up calling delay(1) before trying again.  It would
> be nice if the delay could be avoided in that case.

True. I'll see if I can come up with a simple way of doing this.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>