Dave,
Great read; thanks! It makes sense that the 1TB AG limit gets hit far faster
than the depth. Given those numbers, XFS does look feasible for billions of
files, for at least this scaling.
And as you point out, the arbitrary worstcase scenario hits other performance
and logic issues anyhow (pathological allocation patterns essentially).
Is there any xfstest which tries to see what happens when you fill up the AG
group? 1TB wouldn't be an entirely impossible amount of data to write, although
it would take a long time of course, and whatever it took to generate that
could take a while. It might function as an overall stress test in a worstcase
type of scenario? This is an XFS 101 question, but does / can it split the AG
once it approaches that limit? (And then of course the Btree will be reduced
as well, so it doesn't result in further issue, just curious about how the AG
filling up is handled.)
Cheers,
Shaun
Original Message
From: xfsbounces@xxxxxxxxxxx [mailto:xfsbounces@xxxxxxxxxxx] On Behalf Of
Dave Chinner
Sent: Wednesday, February 12, 2014 7:03 PM
To: Eric Sandeen
Cc: xfsoss
Subject: Re: Limits on agi>agi_level (and other btree depths?)
On Wed, Feb 12, 2014 at 05:54:19PM 0600, Eric Sandeen wrote:
> If agi>agi_level exceeds XFS_BTREE_MAXLEVELS (8), bad things happen.
> For example in xfs_inobt_init_cursor() we read it directly off disk
> into a btree cursor:
>
> xfs_inobt_init_cursor()
> cur>bc_nlevels = be32_to_cpu(agi>agi_level);
>
> and then when it's time to tear it down we'll index into bc_bufs[] buy
> whatever it said:
>
> xfs_btree_del_cursor()
> for (i = 0; i < cur>bc_nlevels; i++) {
> if (cur>bc_bufs[i])
> xfs_trans_brelse(cur>bc_tp, cur>bc_bufs[i]);
>
> but bc_bufs[] in the xfs_btree_cur is of fixed size:
>
> struct xfs_buf *bc_bufs[XFS_BTREE_MAXLEVELS]; /* buf ptr per
> level */
>
> where
>
> #define XFS_BTREE_MAXLEVELS 8 /* max of all btrees */
>
> (which means this limits any btree depth, not just agi, right...)
>
> ...
> So I ran across this on an intentionally corrupted image, but I don't
> know what stops us from going past XFS_BTREE_MAXLEVELS in normal
> operations (unless we just hit filesystem limits before
> then?)
Right, we hit filesystem limits before we get deeper than 8 levels.
For an AGI btree, ptr/key pairs in a node use 8 bytes, while records use 16
bytes. Hence worst case is a 512 byte blocksize filesystem where we get roughly
30 records/leaf and 60 ptr/key pairs per node.
So, the number of extents we can reference at different levels are:
level number
0 30
1 60 * 30
2 60^2 * 30
....
n 60^n * 30
In 1TB AG, worst case freespace is alternate single block freespace, so that's
1TB/blocksize/2 = (2^31*2^9) / 2^9 / 2^1 = 2^30 extent records for a 512 byte
blocksize filesystem.
2^30 : 1,073,741,824 records
n = 5 : 23,328,000,000 records
So the maximum possible number of levels for an AGI btree is 6 (5 node levels +
leaf level). The AGF btrees are the same (32 bit key/ptrs, 16 byte records)
The AGF freespace trees are more dense  the records are only 8 bytes so
there's 60/leaf. It still needs 6 levels though.
For the extent btree (bmbt) it can index 54 bits of file offset, so worst case
is single block fragments so 2^54 extents. Records are 16 bytes, key and
pointers are 8 bytes each. Hence 30/30 are the numbers for a 512 byte block
size fs. At level n, the extents are 30^n * 30 = 30^(n+1). So, solving 2^54 <=
30^(n+1) gives n = 11.
So in theory we could overflow XFS_BTREE_MAXLEVELS here, but in practice this
worst case requires 30^8 extents in memory, and that requires this much RAM:
656,100,000,000 * sizeof(struct xfs_bmbt_irec) bytes
= 656,100,000,000 * 32 bytes
~= 19TiB
And requires reading in from disk 512 bytes at a time. Nothing in XFS^WLinux
scales to indexing 19TiB of extent metadata with any efficiency in this manner.
And let's face it, if you have a 300TiB file in single 512 byte block
fragments, you've got bigger problems.
The least of which being that you should be using a larger block size...
Back in reality, if we take a 4k block size, the bmbt tree has a
240/240 breakdown, which means that the equation is actually 2^54 <= 240^(n+1),
and in that case n = 6, so we don't overflow XFS_BTREE_MAXLEVELS at all for the
normal mkfs cases.
> i.e. xfs_btree_new_root() does:
>
> /* Set the root in the holding structure increasing the level by 1.
> */
> cur>bc_ops>set_root(cur, &lptr, 1);
>
> and >set_root / xfs_inobt_set_root() will happily increase agi_level;
> I don't see anything limiting it to XFS_BTREE_MAXLEVELS.
Physical limits of the AGI due to the 1TB size of the AG.
> I guess XFS_BTREE_MAXLEVELS is just an arbitrary inmemory limit, not
> a limit of the underlying disk structures, but as it stands, we should
> be sure that we don't exceed it, right?
If you really want to enforce XFS_BTREE_MAXLEVELS, add checks into
xfs_alloc_compute_maxlevels(), xfs_ialloc_compute_maxlevels() and
xfs_bmap_compute_maxlevels() to constrain the limits in the struct xfs_mount
and validate the ondisk values based on the values in the struct xfs_mount.
> I was going to put that limit into xfs_agi_verify, but realized that I
> wasn't sure if we could actually exceed that depth in normal
> operations.
>
> (cue dchinner working out that 9 levels is 59 bazillion jillion items,
> and will never be hit?)
Yep, done that ;)
Cheers,
Dave.

Dave Chinner
david@xxxxxxxxxxxxx
_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs
