Document the new sparse inodes feature and how it affects the inobt records.
Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
---
.../allocation_groups.asciidoc | 157 ++++++++++++++++++++
design/XFS_Filesystem_Structure/docinfo.xml | 1
2 files changed, 155 insertions(+), 3 deletions(-)
diff --git a/design/XFS_Filesystem_Structure/allocation_groups.asciidoc
b/design/XFS_Filesystem_Structure/allocation_groups.asciidoc
index 845b359..ca3210c 100644
--- a/design/XFS_Filesystem_Structure/allocation_groups.asciidoc
+++ b/design/XFS_Filesystem_Structure/allocation_groups.asciidoc
@@ -388,6 +388,18 @@ Directory file type. Each directory entry tracks the type
of the inode to
which the entry points. This is a performance optimization to remove the need
to load every inode into memory to iterate a directory.
+| +XFS_SB_FEAT_INCOMPAT_SPINODES+ |
+Sparse inodes. This feature relaxes the requirement to allocate inodes in
+chunks of 64. When the free space is heavily fragmented, there might exist
+plenty of free space but not enough contiguous free space to allocate a new
+inode chunk. With this feature, the user can continue to create files until
+all free space is exhausted.
+
+Unused space in the inode B+tree records are used to track which parts of the
+inode chunk are not inodes.
+
+See the chapter on xref:Sparse_Inodes[Sparse Inodes] for more information.
+
| +XFS_SB_FEAT_INCOMPAT_META_UUID+ |
Metadata UUID. The UUID stamped into each metadata block must match the value
in +sb_meta_uuid+. This enables the administrator to change +sb_uuid+ at will
@@ -977,9 +989,9 @@ Specifies the number of levels in the free inode B+tree.
[[Inode_Btrees]]
== Inode B+trees
-Inodes are allocated in chunks of 64, and a B+tree is used to track these
chunks
-of inodes as they are allocated and freed. The block containing root of the
-B+tree is defined by the AGI's +agi_root+ value. If the
+Inodes are traditionally allocated in chunks of 64, and a B+tree is used to
+track these chunks of inodes as they are allocated and freed. The block
+containing root of the B+tree is defined by the AGI's +agi_root+ value. If the
+XFS_SB_FEAT_RO_COMPAT_FINOBT+ feature is enabled, a second B+tree is used to
track the chunks containing free inodes; this is an optimization to speed up
inode allocation.
@@ -1111,6 +1123,145 @@ recs[1] = [startino,freecount,free]
1:[5792,9,0xff80000000000000]
Observe also that the AGI's +agi_newino+ points to this chunk, which has never
been fully allocated.
+[[Sparse_Inodes]]
+== Sparse Inodes
+
+As mentioned in the previous section, XFS allocates inodes in chunks of 64. If
+there are no free extents large enough to hold a full chunk of 64 inodes, the
+inode allocation fails and XFS claims to have run out of space. On a
+filesystem with highly fragmented free space, this can lead to out of space
+errors long before the filesystem runs out of free blocks.
+
+The sparse inode feature tracks inode chunks in the inode B+tree as if they
+were full chunks but uses some previously unused bits in the freecount field to
+track which parts of the inode chunk are not allocated for use as inodes. This
+allows XFS to allocate inodes one block at a time if absolutely necessary.
+
+The inode and free inode B+trees operate in the same manner as they do without
+the sparse inode feature; the B+tree header for the nodes and leaves use the
++xfs_btree_sblock+ structure which is the same as the header used in the
+xref:AG_Free_Space_Btrees[AGF B+trees].
+
+Leaves contain an array of the following structure:
+
+[source,c]
+----
+struct xfs_inobt_rec {
+ __be32 ir_startino;
+ __be16 ir_holemask;
+ __u8 ir_count;
+ __u8 ir_freecount;
+ __be64 ir_free;
+};
+----
+
+*ir_startino*::
+The lowest-numbered inode in this chunk, rounded down to the nearest multiple
+of 64, even if the start of this chunk is sparse.
+
+*ir_holemask*::
+A 16 element bitmap showing which parts of the chunk are not allocated to
+inodes. Each bit represents four inodes; if a bit is marked here, the
+corresponding bits in ir_free must also be marked.
+
+*ir_count*::
+Number of inodes allocated to this chunk.
+
+*ir_freecount*::
+Number of free inodes in this chunk.
+
+*ir_free*::
+A 64 element bitmap showing which inodes in this chunk are not available for
+allocation.
+
+==== xfs_db Sparse Inode AGI Example
+
+This example derives from an AG that has been deliberately fragmented. The
+inode B+tree:
+
+----
+xfs_db> agi 0
+xfs_db> p
+magicnum = 0x58414749
+versionnum = 1
+seqno = 0
+length = 6400
+count = 10432
+root = 2381
+level = 2
+freecount = 0
+newino = 14912
+dirino = null
+unlinked[0-63] =
+uuid = b9b4623b-f678-4d48-8ce7-ce08950e3cd6
+lsn = 0x600000ac4
+crc = 0xef550dbc (correct)
+free_root = 4
+free_level = 1
+----
+
+This AGI was formatted on a v5 filesystem; notice the extra v5 fields. So far
+everything else looks much the same as always.
+
+----
+xfs_db> addr root
+magic = 0x49414233
+level = 1
+numrecs = 2
+leftsib = null
+rightsib = null
+bno = 19048
+lsn = 0x50000192b
+uuid = b9b4623b-f678-4d48-8ce7-ce08950e3cd6
+owner = 0
+crc = 0xd98cd2ca (correct)
+keys[1-2] = [startino] 1:[128] 2:[35136]
+ptrs[1-2] = 1:3 2:2380
+xfs_db> addr ptrs[1]
+xfs_db> p
+magic = 0x49414233
+level = 0
+numrecs = 159
+leftsib = null
+rightsib = 2380
+bno = 24
+lsn = 0x600000ac4
+uuid = b9b4623b-f678-4d48-8ce7-ce08950e3cd6
+owner = 0
+crc = 0x836768a6 (correct)
+recs[1-159] = [startino,holemask,count,freecount,free]
+ 1:[128,0,64,0,0]
+ 2:[14912,0xff,32,0,0xffffffff]
+ 3:[15040,0,64,0,0]
+ 4:[15168,0xff00,32,0,0xffffffff00000000]
+ 5:[15296,0,64,0,0]
+ 6:[15424,0xff,32,0,0xffffffff]
+ 7:[15552,0,64,0,0]
+ 8:[15680,0xff00,32,0,0xffffffff00000000]
+ 9:[15808,0,64,0,0]
+ 10:[15936,0xff,32,0,0xffffffff]
+----
+
+Here we see the difference in the inode B+tree records. For example, in record
+2, we see that the holemask has a value of 0xff. This means that the first
+sixteen inodes in this chunk record do not actually map to inode blocks; the
+first inode in this chunk is actually inode 14944:
+
+----
+xfs_db> inode 14912
+Metadata corruption detected at block 0x3a40/0x2000
+...
+Metadata CRC error detected for ino 14912
+xfs_db> p core.magic
+core.magic = 0
+xfs_db> inode 14944
+xfs_db> p core.magic
+core.magic = 0x494e
+----
+
+The chunk record also indicates that this chunk has 32 inodes, and that the
+missing inodes are also ``free''.
+
[[Real-time_Devices]]
== Real-time Devices
diff --git a/design/XFS_Filesystem_Structure/docinfo.xml
b/design/XFS_Filesystem_Structure/docinfo.xml
index 6189fd6..ba97809 100644
--- a/design/XFS_Filesystem_Structure/docinfo.xml
+++ b/design/XFS_Filesystem_Structure/docinfo.xml
@@ -104,6 +104,7 @@
<member>Discuss metadata integrity.</member>
<member>Document the free inode B+tree.</member>
<member>Create an index of magic
numbers.</member>
+ <member>Document sparse inodes.</member>
</simplelist>
</revdescription>
</revision>
|