I've been looking into a case of filesystem corruption and found
that we are flushing unlinked inodes after the inode cluster has
been freed - and potentially reallocated as something else. The
case happens when we unlink the last inode in a cluster and that
triggers the cluster to be released.
The code path of interest here is:
-> queues inode on deleted inodes list
... and later on
When the inode is unlinked it gets logged in a transaction so
xfs_iflush() considers it dirty and writes it out but by this
time the cluster has been reallocated. If the cluster is
reallocated as user data then the checks in xfs_imap_to_bp will
complain because the inode magic will be incorrect but if the
cluster is reallocated as another inode cluster then these checks
wont detect that.
I modified xfs_iflush() to bail out if we try to flush an
unlinked inode (ie nlink == 0) and that avoids the corruption but
xfs_repair now has problems with inodes marked as free but with
non-zero nlink counts. Do we really want to write out unlinked
inodes? Seems a bit redundant.
Other options could be to delay the release of the inode cluster
until the inode has been flushed or move the flush into xfs_ifree()
before releasing the cluster. Looking at xfs_ifree_cluster() it
scans the inodes in a cluster and tries to lock them and mark them
stale - maybe we can leverage this and avoid flushing staled inodes.
If so we'd need to tighten up the locking.
Does anyone have suggestions which direction we should take?