On Thu, Nov 10, 2011 at 10:11:33AM +1100, Dave Chinner wrote:
> You're making the assumption that log recovery has done the correct
> thing any only replayed entire unlink transactions and hence the
> filesystem is otherwise consistent (i.e that there are no other
> references). I think that's a bad assumption - there's no guarantee
> that the unlinked list only contains unreferenced inodes if there's
> been corruption and/or log replay was not able to be run.
We add inodes to the uncertain list if any of the following applies
a) are found in an inode btree record reachable from the root in
phase2, but they are suspect based on certain factors - else
we add them to the inode tree directly.
b) are found on the unlinked inodes list in phase3
c) a directory found in an reachable inode btree record points to
them in phase3
so any inodes that either has a link pointing to it, or an inode
allocation btree record pointing to it will still be added to the
uncertain inode list if they aren't on the actual inode btree yet.
Then later in phase3 we move all uncertain inodes that appear fine
back into the main inode record tree.
> I also think there's more to it than that. The walk of the inode list
> also marks all the blocks in the block map as containing inodes, and
> all the blocks still used by those inodes as data/bmap/attr types.
> This change removes that, so we're going to potentially lose that
> state if all the inodes in a block are on the unlinked list.
We still do that walk if we have any genuine reference to the inode.
If we don't have any reference but the unlinked list they can be
considered free - we'd free every ressources assoicated with them
on log recovery anyway.
> Hence we'll end up with blocks containing inodes that are still
> marked as used in the AGINO btree, but are marked as free space in
> the block map.
They aren't. We completely rebuild both the inode allocation and
space allocation bitmaps from the information we gather in the earlier
repair phases, and they will be in sync.
> the AG or filesystem (e.g. agi->agi_count - agi->agi_free). Yes, it will
> still spin for some time on this sort of corruption, but it won't
> get stuck, and it won't add new holes into our block/inode usage
This would basically take forever with thinkgs like Arek's filesystem
with almost 11 million inodes in each AG.