xfs
[Top] [All Lists]

Re: Warning from unlock_new_inode

To: Ben Myers <bpm@xxxxxxx>
Subject: Re: Warning from unlock_new_inode
From: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Thu, 1 Mar 2012 14:58:12 +1100
Cc: Christoph Hellwig <hch@xxxxxxxxxxxxx>, Jan Kara <jack@xxxxxxx>, xfs@xxxxxxxxxxx
In-reply-to: <20120301030324.GX28391@xxxxxxx>
References: <20120222220137.GB3650@xxxxxxxxxxxxx> <20120228083444.GB22995@xxxxxxxxxxxxx> <20120229005351.GV3592@dastard> <20120229014906.GX3592@dastard> <20120301030324.GX28391@xxxxxxx>
User-agent: Mutt/1.5.21 (2010-09-15)
On Wed, Feb 29, 2012 at 09:03:24PM -0600, Ben Myers wrote:
> > The race condition is between the insertion of the inode into the
> > cache in the case of a cache miss and a concurrently lookup:
> > 
> > Thread 1                    Thread 2
> > xfs_iget()
> >   xfs_iget_cache_miss()
> >     xfs_iread()
> >     lock radix tree
> >     radix_tree_insert()
> >                                     rcu_read_lock
> >                             radix_tree_lookup
> >                             lock inode flags
> >                             XFS_INEW not set
> >                             igrab()
> >                             unlock inode flags
> >                             rcu_read_unlock
> >                             use uninitialised inode
> >                             .....
> >     lock inode flags
> >     set XFS_INEW
> >     unlock inode flags
> >     unlock radix tree
> >   xfs_setup_inode()
> >     inode flags = I_NEW
> >     unlock_new_inode()
> >       WARNING as inode flags != I_NEW

.....

> > -   spin_lock(&pag->pag_ici_lock);
> > +   /* These values _must_ be set before inserting the inode into the radix
> > +    * tree as the moment it is inserted a concurrent lookup (allowed by the
> > +    * RCU locking mechanism) can find it and that lookup must see that this
> > +    * is an inode currently under construction (i.e. that XFS_INEW is set).
> > +    * The ip->i_flags_lock that protects the XFS_INEW flag forms the
> > +    * memory barrier that ensures this detection works correctly at lookup
> > +    * time.
> > +    */
> > +   xfs_iflags_set(ip, XFS_INEW);
> > +   ip->i_udquot = ip->i_gdquot = NULL;
> >  
> >     /* insert the new inode */
> > +   spin_lock(&pag->pag_ici_lock);
> >     error = radix_tree_insert(&pag->pag_ici_root, agino, ip);
> >     if (unlikely(error)) {
> >             WARN_ON(error != -EEXIST);
> > @@ -360,11 +370,6 @@ xfs_iget_cache_miss(
> >             error = EAGAIN;
> >             goto out_preload_end;
> >     }
> > -
> > -   /* These values _must_ be set before releasing the radix tree lock! */
>                                                          ^^^ 
> So, in this comment 'radix tree lock' refers to pag->pag_ici_lock?

Right. "ici" is short for "in-core inode". So pag_ici_lock reads as
"per-ag in-core inode lock". The lock is used to protect
modifications to the in-core inode cache which is radix tree indexed
and the root is at pag_ici_root...

> And, pag_ici_lock lock provides no exclusion with radix_tree_lookup.

Right, but rcu based radix tree traversal is safe as long as the
objects being indexed are RCU freed and have some external method of
validating their freed/in use state independent of the radix tree
lookup. IOWs, tree lookups are not synchronised with inserts,
deletes or tag modifications - the are synchronised with RCU freeing
of the objects the tree points to. Hence we only need to validate
the objects against races with RCU freeing after the tree lookup -
we don't need to care about the state of the tree at all....

> I believe I understand.  That isn't to say that I couldn't use a
> brush-up on RCU.  Awesome.  ;)

RCU is tricky. It took me a long time to convince myself that we
could use RCU lookups here because it relies on memory barriers
rather than locking to ensure coherency across all CPUs.  I firmly
beleive that when a concept is difficult to understand, it shouldn't
be made harder to understand by optimising it with a smart and/or
tricky implementation. Spin locks provide memory barriers and they
are something simple that everyone understands, therefore all we
need to concentrate on is the order in which operations occur....

FWIW, once I had decided on the memory barrier implementation
(ip->i_flags_lock), there were three basic principles I followed to
prove to myself that the inode cache lookup is RCU safe:

        1. that the state of inodes in the process of removal from
        the tree is detectable during lookup (i.e. via the
        XFS_IRECLAIM flag).

        2. that inodes in the process of being freed (regardless of
        removal state) in the RCU grace period can be detected (i.e.
        by clearing ip->i_ino before calling rcu_free)

        3. that the state of new inodes is detectable during lookup
        so we can avoid them until they are fully initialised. i.e.
        the XFS_INEW flag.

That's why xfs_iget_cache_hit() takes the i_flags_lock, then checks
ip->i_ino and against the XFS_INEW | XFS_IRECLAIM flags, and
xfs_inode_free() ensures that ip->i_ino = 0 and XFS_IRECLAIM is set
before calling rcu_free() on the inode. The problem was that I
hadn't got the ordering of operations for #3 correct, and that's
what the bug fix is for....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>