On Wed, Oct 04, 2006 at 06:20:14PM +0900, Takenori Nagano wrote:
> The patch attached to this mail is a fix for a race of xfs_iunpin()
> and generic_delete_inode().
Ahh, that problem, still. :/
> If __mark_inode_dirty() is running simultaneously between
> clear_inode() and BUG_ON() in generic_delete_inode(), BUG_ON() is
> called. We think this is a cause of this bug.
> xfs_fs_clear_inode() calls xfs_reclaim(). We think the recent fixes to
> xfs_iunpin() were not correct. With those patches, xfs_iunpin() now
> can determine whether xfs_inode is recycled or not, but it is not
> essential way to fix this bug.
Agreed - it was always a pretty ugly way to fix the problem.
> xfs_iunpin() must never touch xfs_inode
> which is already freed.
It must never touch the linux inode, not the xfs_inode....
> If try_to_free_page() collects some slabs
> including pinned inode, it is possible to result in memory corruption.
It's the linux inode that gets used after it's been freed, not the
xfs_inode (which doesn't get freed until after it is unpinned
and the async reclaim is run).
> We come up with three possible solutions:
> 1. xfs_fs_clear_inode() waits for i_pincount to become 0.
> 2. xfs_fs_clear_inode() syncs in-core log if i_pincount is not 0.
> 3. xfs_fs_clear_inode() invalidates in-core log that relates to the
> deleted inode.
> We chose 2, because the frequency of sync is almost same to as that of
> BUG(), and it is the same way to sync in-core log in xfs_fsync() when
> inode is still pinned. It has very very little effect for xfs performance.
Option 1 is the correct solution and we already have a function for
doing that - xfs_iunpin_wait(). This also forces the log in the
most optimal manner (i.e. only up to the LSN of the pinned log item)
before waiting for the pin count to fall to zero so it does
option 2 as well.
> This patch fixes to sync in-core log if i_pincount is not 0 in
> xfs_fs_clear_inode(). We think this is essential.
The fix should go into xfs_reclaim rather than xfs_fs_clear_inode
as this is where it is important to wait.
I think this is a much better way of fixing the problem, but it needs
a little tweaking. Also, it indicates that we can probably revert
some of the previous changes made in attempting to fix this bug.
I'll put together a new patch with this fix and as much of the
other fixes removed as possible and run some tests on it here.
It'l be a day or two before I have a tested patch ready....
Thanks for persisting with this, Takenori - this looks like
it will be a good, robust solution to the problem.
SGI Australian Software Group