[PATCH 01/15] xfs: xfs_remove deadlocks due to inverted AGF vs AGI lock ordering
Ben Myers
bpm at sgi.com
Mon Nov 4 17:10:58 CST 2013
On Thu, Oct 31, 2013 at 10:15:57AM +1100, Dave Chinner wrote:
> On Wed, Oct 30, 2013 at 05:39:04PM -0500, Ben Myers wrote:
> > On Tue, Oct 29, 2013 at 10:11:44PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner at redhat.com>
> > >
> > > Removing an inode from the namespace involves removing the directory
> > > entry and dropping the link count on the inode. Removing the
> > > directory entry can result in locking an AGF (directory blocks were
> > > freed) and removing a link count can result in placing the inode on
> > > an unlinked list which results in locking an AGI.
> > >
> > > The big problem here is that we have an ordering constraint on AGF
> > > and AGI locking - inode allocation locks the AGI, then can allocate
> > > a new extent for new inodes, locking the AGF after the AGI.
> > > Similarly, freeing the inode removes the inode from the unlinked
> > > list, requiring that we lock the AGI first, and then freeing the
> > > inode can result in an inode chunk being freed and hence freeing
> > > disk space requiring that we lock an AGF.
> > >
> > > Hence the ordering that is imposed by other parts of the code is AGI
> > > before AGF. This means we cannot remove the directory entry before
> > > we drop the inode reference count and put it on the unlinked list as
> > > this results in a lock order of AGF then AGI, and this can deadlock
> > > against inode allocation and freeing. Therefore we must drop the
> > > link counts before we remove the directory entry.
> > >
> > > This is still safe from a transactional point of view - it is not
> > > until we get to xfs_bmap_finish() that we have the possibility of
> > > multiple transactions in this operation. Hence as long as we remove
> > > the directory entry and drop the link count in the first transaction
> > > of the remove operation, there are no transactional constraints on
> > > the ordering here.
> > >
> > > Change the ordering of the operations in the xfs_remove() function
> > > to align the ordering of AGI and AGF locking to match that of the
> > > rest of the code.
> > >
> > > Signed-off-by: Dave Chinner <dchinner at redhat.com>
> >
> > These two codepaths look plausible for the deadlock you described:
> >
> > inode allocation locking:
> > xfs_create
> > xfs_dir_ialloc
> > xfs_ialloc
> > xfs_dialloc
> > xfs_ialloc_read_agi * takes agi
> > xfs_ialloc_ag_alloc
> > xfs_alloc_vextent
> > xfs_alloc_fix_freelist
> > xfs_alloc_read_agf * takes agf
> >
> > vs
> >
> > xfs_remove
> > xfs_dir_removename
> > xfs_dir2_node_removename
> > xfs_dir2_leafn_remove
> > xfs_dir2_shrink_inode
> > xfs_bunmapi
> > . xfs_bmap_del_extent
> > . xfs_btree_delete
> > . xfs_btree_delrec
> > . .free_block
> > . xfs_bmbt_free_block
> > . xfs_bmap_add_free * adds to free list, doesn't take agf
> > xfs_bmap_extents_to_btree
> > xfs_alloc_vextent * takes agf
>
> Yeah, that's not the obvious or common path, but it has the same
> cause of allocation - it's a bmbt block that gets allocated. i.e.
> removing a block from the middle of a contiguous extent can result
> in the extent tree growing, and hence needing allocation of block
> for the new entry. This is the path I was hitting:
>
> ....
> xfs_dir2_shrink_inode
> xfs_bunmapi
> xfs_bmap_del_extent
> case 0: /* delete middle of extent */
> xfs_btree_update
> xfs_btree_increment
> xfs_btree_insert
> xfs_btree_insrec
> xfs_btree_make_block_unfull
> xfs_btree_split
> .alloc_block
> xfs_bmbt_alloc_block
> xfs_alloc_vextent * takes agf
>
>
> > I was thinking I'd find something in .free_block, but I didn't.
>
> Right, data extents are added to the free list that is later walked
> and freed via xfs_bmap_finish() after it adds an EFI to match the
> free list to the current transaction the free list belongs to and
> commits it.
>
> > But it does
> > look like we'll take the agf if we have to convert between directory formats in
> > xfs_dir2_leafn_remove, and it looks like there are a few more opportunities to
> > take the agf in xfs_bunmapi...
>
> Yup, but with the above call chain, any random block removal can
> cause a bmbt allocation to occur, so we don't really need to look
> any further. Indeed, you should just assume that any call to
> xfs_bunmapi() to free an extent will require block allocation....
Applied this. Thanks Dave.
-Ben
More information about the xfs
mailing list