On Fri, Jul 25, 2014 at 12:30:57PM -0400, Brian Foster wrote:
> On Fri, Jul 25, 2014 at 08:32:11AM +1000, Dave Chinner wrote:
> > On Thu, Jul 24, 2014 at 10:22:50AM -0400, Brian Foster wrote:
> > > Hi all,
> > >
> > > This is a first pass at sparse inode chunk support for XFS. Some
> > > background on this work is available here:
> > >
> > > http://oss.sgi.com/archives/xfs/2013-08/msg00346.html
> > >
> > > The basic idea is to allow the partial allocation of inode chunks into
> > > fragmented regions of free space. This is accomplished through addition
> > > of a holemask field into the inobt record that defines what portion(s)
> > > of an inode chunk are invalid (i.e., holes in the chunk). This work is
> > > not quite complete, but is at a point where I'd like to start getting
> > > feedback on the design and what direction to take for some of the known
> > > gaps.
> > >
> > > The basic breakdown of functionality in this set is as follows:
> > >
> > > - Patches 1-2 - A couple generic cleanups that are dependencies for later
> > > patches in the series.
> > > - Patches 3-5 - Basic data structure update, feature bit and minor
> > > helper introduction.
> > > - Patches 6-7 - Update v5 icreate logging and recovery to handle sparse
> > > inode records.
> > > - Patches 8-13 - Allocation support for sparse inode records. Physical
> > > chunk allocation and individual inode allocation.
> > > - Patches 14-16 - Deallocation support for sparse inode chunks. Physical
> > > chunk deallocation, individual inode free and cluster free.
> > > - Patch 17 - Fixes for bulkstat/inumbers.
> > > - Patch 18 - Activate support for sparse chunk allocation and
> > > processing.
> > >
> > > This work is lightly tested for regression (some xfstests failures due
> > > to repair) and basic functionality. I have a new xfstests test I'll
> > > forward along for demonstration purposes.
> > >
> > > Some notes on gaps in the design:
> > >
> > > - Sparse inode chunk allocation granularity:
> > >
> > > The current minimum sparse chunk allocation granularity is the cluster
> > > size.
> > Looking at the patchset (I got to patch 5 that first uses this),
> > this is problematic. the cluster size is currently a kernel
> > implementation detail, and not something defined by the on-disk
> > format. We can change the cluster size in the kernel and not affect
> > the format on disk. Making the cluster size a part of the disk
> > format by defining it to be the resolution of sparse inode chunks
> > changes that - it's now a part of the on-disk inode format, and that
> > greatly limits what we can do with it.
> I was going off the inoalignmt bit that's set at mkfs time, but looking
> at the code I think I see what you mean...
> > > My initial attempts at this work tried to redefine to the minimum
> > > chunk length based on the holemask granularity (a la the stale macro I
> > > seemingly left in this series ;), but this involves tweaking the
> > > codepaths that use the cluster size (i.e., imap) which proved rather
> > > hairy.
> > This is where we need to head towards, though. The cluster size is
> > currently the unit of inode IO, so that needs to be influenced by the
> > sparse inode chunk granularity. Yes, we can define the inode chunk
> > granularity to be the same as the cluster size, but that simply
> > means we need to configure the cluster size appropriately at mount.
> > It doesn't mean we need to change what cluster size means or it's
> > implementation....
> I don't think I've necessarily encoded the cluster size into the disk
> format explicitly, but rather used it as a heuristic for how to size the
> sparse inode chunk allocations. I think what you're saying is that the
> cluster size can change across a mount or code update, while the on-disk
> allocation state will not. Thus we can't always go on the assumption
> that the on-disk allocations will be sane with regard to the current
> cluster size. For example, I suppose if the cluster size increased after
> we have some existing sparse allocations we'd end up with some broken
> buffer mappings for inode chunks.
Yup, that's the problem in a nutshell: cluster size is not fixed.
> > Again, that's kernel inode buffer cache implementaiton details, not
> > something that matters for the on-disk format. So really these need
> > to be separated. Probably means we need a "sparse inode allocation
> > alignment" field in the superblock to define this. Having
> > the kernel reject sparse alignments it can't support from the
> > initial implementation means we can improve the kernel
> > implementation over time and (eventually) support sub-cluster sized
> > sparse inode allocation.
> > i.e. initial implementation only supports sparse alignment ==
> > cluster size, and rejects everything else....
> Yeah, that's pretty much what I was trying to accomplish, just not
> encoded properly with regard to the superblock options. E.g., I
> overloaded the cluster size in a fragile way. Instead, explicitly set
> the sparse inode allocation granularity and let that influence the
> cluster size. I suspect that means we could even still use the cluster
> size heuristic to define the default alloc. granularity, but of course
> we aren't married to it so can evaluate that independently. That sounds
> much nicer, thanks.
Right. We may not do anything else in the short or medium term, but
having a separate superblock field gives us the flexibility for
change the inode cluster implementation/behaviour in the future...
> > dynamically, because it means the same upgrade/downdgrade cycle can
> > have different results simply due to filesystem freespace
> > fragmentation patterns...
> > Perhaps an xfs-admin command to turn the feature on dynamicallyi for
> > existing filesystems, kind of like what we did with lazy superblock
> > counters when they were introduced?
> Hmm, I suppose that does create a new and interesting dynamic with
> regard to the feature bit (non-deterministic backwards compatibility).
> One could certainly value backwards compatibility over this particular
> feature, and there is currently no way to control it. I'll look into
> doing something with xfs_admin. In fact, I was thinking of adding
> something to tune the cluster size bit to get around the v5 scaling
> issue anyways.
What v5 scalability issue is that? I don't recall any outstanding
issues with inode cluster IO....