[Top] [All Lists]

Re: [PATCH 5/5] xfs: increase inode cluster size for v5 filesystems

To: Ben Myers <bpm@xxxxxxx>
Subject: Re: [PATCH 5/5] xfs: increase inode cluster size for v5 filesystems
From: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Thu, 7 Nov 2013 11:32:50 +1100
Cc: Christoph Hellwig <hch@xxxxxxxxxxxxx>, xfs@xxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <20131106213159.GV1935@xxxxxxx>
References: <1383280040-21979-1-git-send-email-david@xxxxxxxxxxxxx> <1383280040-21979-6-git-send-email-david@xxxxxxxxxxxxx> <20131105164307.GD32110@xxxxxxxxxxxxx> <20131105195650.GA6188@dastard> <20131106213159.GV1935@xxxxxxx>
User-agent: Mutt/1.5.21 (2010-09-15)
On Wed, Nov 06, 2013 at 03:31:59PM -0600, Ben Myers wrote:
> Hi Dave,
> On Wed, Nov 06, 2013 at 06:56:50AM +1100, Dave Chinner wrote:
> > On Tue, Nov 05, 2013 at 08:43:07AM -0800, Christoph Hellwig wrote:
> > > > +       if (xfs_sb_version_hascrc(&mp->m_sb)) {
> > > > +               int     new_size = mp->m_inode_cluster_size;
> > > > +
> > > > +               new_size *= mp->m_sb.sb_inodesize / XFS_DINODE_MIN_SIZE;
> > > > +               if (mp->m_sb.sb_inoalignmt >= XFS_B_TO_FSBT(mp, 
> > > > new_size))
> > > > +                       mp->m_inode_cluster_size = new_size;
> > > > +               xfs_info(mp, "Using inode cluster size of %d bytes",
> > > > +                        mp->m_inode_cluster_size);
> > > 
> > > printing this on every mount seem a bit too verbose.
> > 
> > I'd like to leave it there until we remove the experimental tag from
> > the v5 superblock configuration, as there is no good way of
> > determining that someone is using a mkfs patched to enable this
> > feature yet...
> I don't think I have a problem with bumping up the inode cluster size, but I 
> am
> a little concerned about two aspects of this patch:
> 1) Backward compatability issue with earlier v5 filesystems that don't support
> the larger inode cluster.  I know it's experimental, but what do those 
> failures
> look like?  This strikes me as being kind of scary.

There shouldn't be any failures - it should just work.

That is, larger alignment on older kernels will only affect
allocation alignment of new chunks, not the cluster size, and so
everything should work just fine because it's inode chunk alignment
that matters for larger inode clusters to work....

Also, log recovery knows about the fact that inode clusters could
potentially change size from mount to mount and handles such cases
appropriately e.g. see the comment in xlog_recover_buffer_pass2().

Hence there should be no problems at all.

If there are, discovering flaws in my understanding of how inode
alignment and clusters interact is part of reason I have not enabled
this feature on v4 superblocks (as the original commit explained).
If we discover one, then the worst case is that we change mkfs back
to setting sb_inoalign = 2 for the xfsprogs 3.2.0 release and this
code in the kernel becomes a no-op...

> 2) I don't like to overload the inode alignment mkfs option to do this.  I
> think it would be better if we explicitly set the inode cluster size at mkfs
> time.

Overload? Not at all - inode alignment was introduced back in 1996
specifically to alleviate performance issues with mapping inode
numbers to cluster buffers way back in 1996. The two have been
intimately related for a long time, and you can't use larger inode
clusters without first adjusting the inode alignment to support
those larger cluster sizes.

FWIW, the kernel is free to use whatever cluster size it likes as
they are an in-memory construct - the kernel can do inode IO in
single blocks if it wants to or needs to. However, it can't do
correct inode number to cluster offset calculations if the inode
chunk block number is not appropriately aligned for the size of the
cluster it wants to use.

Originally (1995), clusters were 4k:


months later, 8k:


But we don't have the history of mkfs available to what inode
alignment was used at the time.

Then when the various Irix trees got merged into a combined tree a
few months later, both 4k and 8k clusters were supported:


And so you can see that the cluster size was chosen base on the
amount of RAM the system had. However, because mkfs never set an
alignment of more than 2 blocks, the cluster size couldn't be
increased to be any larger....

> Or maybe this one should have an incompatability bit.  Maybe it doesn't need 
> to
> be a separate mkfs option.

It shouldn't need an incompatibility bit, nor should it be a
separate mkfs option for v5 filesystems. Whether either are
necessary for v4 filesystems is separate discussion.


Dave Chinner

<Prev in Thread] Current Thread [Next in Thread>