On 11/11/13, 4:45 PM, Dave Chinner wrote:
> On Fri, Nov 08, 2013 at 12:21:20PM -0600, Eric Sandeen wrote:
>> On 10/31/13, 11:27 PM, Dave Chinner wrote:
>>> So, this patch removes most of the performance and CPU usage
>>> differential between v4 and v5 filesystems on traversal related
>>> workloads.
...
>> Just thinking out loud here: So this is runtime only; nothing on disk sets
>> the cluster size explicitly (granted, it never did).
>>
>> So moving back and forth across newer/older kernels will create clusters
>> of different sizes on the same filesystem, right?
>
> No - inodes are allocated in chunks, not clusters. Inode clusters
> are the unit of IO we read and write inodes in.
Hohum, confused that fundamental difference. Sorry. Kind of
invalidates my other questions doesn't it.
>> (In the very distant past, this same change could have happened if
>> the amount of memory in a box changed (!) - see commit
>> 425f9ddd534573f58df8e7b633a534fcfc16d44d; prior to that we set
>> m_inode_cluster_size on the fly as well).
>
> Right, I think I've already pointed that out.
>
>> But sb_inoalignmt is a mkfs-set, on-disk feature. So we might start with
>> i.e. this, where A1 are 8k alignment points, and 512 byte inodes, in clusters
>> of size 8k / 16 inodes:
>>
>> A1 A1 A1 A1
>> [ 16 inodes ][ 16 inodes ] [ 16 inodes ]
>
> Ok, here's where you go wrong. Inode chunks are always 64 inodes,
> and so what you have on disk after any inode allocation is:
>
> A1 A1 A1 A1
> [ 16 inodes ][ 16 inodes ][ 16 inodes ][ 16 inodes ]
>
> and sb_inoalign determines where A1 lands in terms of filesystem
> blocks. With sb_inoalign = 2 and a 4k filesystem block size, you can
> only align inode *chunks* to even filesystem blocks like so:
>
> ODD EVEN ODD EVEN ODD EVEN ODD EVEN ODD EVEN
> A1 A1 A1 A1 A1
> [ 16 inodes ][ 16 inodes ][ 16 inodes ][ 16 inodes ]
>
<snip a few examples>
> To be able to bump up the inode cluster size, what we have to
> guarantee is that the inode chunks align to the the larger cluster
> size like so:
>
> A2 A2
> A1 A1 A1 A1
> [ 16 inodes ][ 16 inodes ][ 16 inodes ][ 16 inodes ] <--- existing
> [ 32 inodes ][ 32 inodes ] <--- new
>
> i.e. inode chunk allocation needs to be aligned to A2, not A1 for
> the correct alignment of the larger clusters.
>
> If we align to A1, then this will happen:
>
> A2 A2 A2
> A1 A1 A1 A1 A1
> [ 16 inodes ][ 16 inodes ][ 16 inodes ][ 16 inodes ]
> [ 32 inodes ][ 32 inodes ] <--- new
>
> And that is clearly broken. Hence, to ensure we can use larger inode
> clusters, we have to ensure that the inode chunks are aligned
> appropriately for those cluster sizes. If the chunks are
> appropriately aligned for larger inode clusters (e.g. sb_inoalign =
> 4), then they are also appropriately aligned for inode cluster sizes
> older kernels support.
*nod* Ok.
>> So the only other thing I wonder about is when we are handling
>> pre-existing, smaller-than m_inode_cluster_size clusters.
>>
>> i.e. xfs_ifree_cluster() figures out the number of blocks &
>> number of inodes in a cluster, based on the (now not
>> constant) m_inode_cluster_size.
>>
>> What stops us from going off the end of a smaller cluster?
>
> The fact that we calculate the number of inodes to process per
> cluster based on the size of the cluster buffer (in blocks)
> multiplied by the number of inodes per block. If the code didn't
> work, we'd have found out a long time ago ;)
And I was rather stupidly confusing clusters & chunks, so never mind...
-Eric
|