On Tue, Mar 17, 2015 at 09:25:15AM -0400, Brian Foster wrote:
> On Mon, Mar 16, 2015 at 05:00:20PM +1100, Dave Chinner wrote:
> > Hi Folks,
> > As I told many people at Vault last week, I wrote a document
> > outlining how we should modify the on-disk structures of XFS to
> > support host aware SMR drives on the (long) plane flights to Boston.
> > TL;DR: not a lot of change to the XFS kernel code is required, no
> > specific SMR awareness is needed by the kernel code. Only
> > relatively minor tweaks to the on-disk format will be needed and
> > most of the userspace changes are relatively straight forward, too.
> > The source for that document can be found in this git tree here:
> > git://git.kernel.org/pub/scm/fs/xfs/xfs-documentation
> > in the file design/xfs-smr-structure.asciidoc. Alternatively,
> > pull it straight from cgit:
> > https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/tree/design/xfs-smr-structure.asciidoc
> > Or there is a pdf version built from the current TOT on the xfs.org
> > wiki here:
> > http://xfs.org/index.php/Host_Aware_SMR_architecture
> > Happy reading!
> Hi Dave,
> Thanks for sharing this. Here are some thoughts/notes/questions/etc.
> from a first pass. This is mostly XFS oriented and I'll try to break it
> down by section.
> I've also attached a diff to the original doc with some typo fixes and
> whatnot. Feel free to just fold it into the original doc if you like.
> == Concepts
> - With regard to the assumption that the CMR region is not spread around
> the drive, I saw at least one presentation at Vault that suggested
> otherwise (the skylight one iirc). That said, it was theoretical and
> based on a drive-managed drive. It is in no way clear to me whether that
> is something to expect for host-managed drives.
AFAIK, the CMR region is contiguous. The skylight paper spells it
out pretty clearly that it is a contiguous 20-25GB region on the
outer edge of the seagate drives. Other vendors I've spoken to
indicate that the region in host managed drives is also contiguous
and at the outer edge, and some vendors have indicated they have
much more of it that the seagate drives analysed in the skylight
If it is not contiguous, then we can use DM to make that problem go
away. i.e. use DM to stitch the CMR zones back together into a
contiguous LBA region. Then we can size AGs in the data device to
map to the size of the individual disjoint CMR regions, and we
have a neat, well aligned, isolated solution to the problem without
having to modify the XFS code at all.
> - It isn't clear to me here and in other places whether you propose to
> use the CMR regions as a "metadata device" or require some other
> randomly writeable storage to serve that purpose.
CMR as the "metadata device" if there is nothing else we can use.
I'd really like to see hybrid drives with the "CMR" zone being the
flash region in the drive....
> == Journal modifications
> - The tail->head log zeroing behavior on mount comes to mind here. Maybe
> the writes are still sequential and it's not a problem, but we should
> consider that with the proposition. It's probably not critical as we do
> have the out of using the cmr region here (as noted). I assume we can
> also cleanly relocate the log without breaking anything else (e.g., the
> current location is performance oriented rather than architectural,
We place the log anywhere in the data device LBA space. You might
want to go look up what L_AGNUM does in mkfs. :)
And if we can use the CMR region for the log, then that's what we'll
do - "no modifications required" is always the best solution.
> == Data zones
> - Will this actually support data overwrite or will that return error?
We'll support data overwrite. xfs_get_blocks() will need to detect
> - TBH, I've never looked at realtime functionality so I don't grok the
> high level approach yet. I'm wondering... have you considered a design
> based on reflink and copy-on-write?
Yes, I have. Complex, invasive and we don't even have basic reflink
infrastructure yet. Such a solution pushes us a couple of years
out, as opposed to having something before the end of the year...
> I know the current plan is to
> disentangle the reflink tree from the rmap tree, but my understanding is
> the reflink tree is still in the pipeline. Assuming we have that
> functionality, it seems like there's potential to use it to overcome
> some of the overwrite complexity.
There isn't much overwrite complexity - it's simply clearing bits
in a zone bitmap to indicate free space, allocating new blocks and
then rewriting bmbt extent records. It's fairly simple, really ;)
> Just as a handwaving example, use the
> per-zone inode to hold an additional reference to each allocated extent
> in the zone, thus all writes are handled as if the file had a clone. If
> the only reference drops to the zoneino, the extent is freed and thus
> stale wrt to the zone cleaner logic.
> I suspect we would still need an allocation strategy, but I expect we're
> going to have zone metadata regardless that will help deal with that.
> Note that the current sparse inode proposal includes an allocation range
> limit mechanism (for the inode record overlaps an ag boundary case),
> which could potentially be used/extended to build something on top of
> the existing allocator for zone allocation (e.g., if we had some kind of
> zone record with the write pointer that indicated where it's safe to
> allocate from). Again, just thinking out loud here.
Yup, but the bitmap allocator doesn't have support for many of the
btree allocator controls. It's a simple, fast, deterministic
allocator, and we only need it is to track freed space in the zones
as all allocation from the zones is going to be sequential...
> == Zone cleaner
> - Paragraph 3 - "fixpel?" I would have just fixed this, but I can't
> figure out what it's supposed to say. ;)
> - The idea sounds sane, but the dependency on userspace for a critical
> fs mechanism sounds a bit scary to be honest. Is in kernel allocation
> going to throttle/depend on background work in the userspace cleaner in
> the event of low writeable free space?
Of course. ENOSPC always throttles ;)
I expect the cleaner will work zone group at a time; locking new,
non-cleaner based allocations out of the zone group while it cleans
zones. This means the cleaner should always be able to make progress
w.r.t. ENOSPC - it gets triggered on a zone group before it runs out
of clean zones for freespace defrag purposes....
I also expect that the cleaner won't be used in many bulk storage
applications as data is never deleted. I also expect tht XFS-SMR
won't be used for general purpose storage applications - that's what
solid state storage will be used for - and so the cleaner is not
something we need to focus a lot of time and effort on.
And the thing that distributed storage guys should love: if we put
the cleaner in userspace, then they can *write their own cleaners*
that are customised to their own storage algorithms.
> What if that userspace thing
> dies, etc.? I suppose an implementation with as much mechanism in libxfs
> as possible allows us greatest flexibility to go in either direction
If the cleaner dies of can't make progress, we ENOSPC. Whether the
cleaner is in kernel or userspace is irrelevant to how we handle
> - I'm also wondering how much real overlap there is in xfs_fsr (another
> thing I haven't really looked at :) beyond that it calls swapext.
> E.g., cleaning a zone sounds like it must map back to N files that could
> have allocated extents in the zone vs. considering individual files for
> defragmentation, fragmentation of the parent file may not be as much of
> a consideration as resetting zones, etc. It sounds like a separate tool
> might be warranted, even if there is code to steal from fsr. :)
As I implied above, zone cleaning is addressing exactly the same
problem as we are currently working on in xfs_fsr: defragmenting
> == Reverse mapping btrees
> - This is something I still need to grok, perhaps just because the rmap
> code isn't available yet. But I'll note that this does seem like
> another bit that could be unnecessary if we could get away with using
> the traditional allocator.
> == Mkfs
> - We have references to the "metadata device" as well as random write
> regions. Similar to my question above, is there an expectation of a
> separate physical metadata device or is that terminology for the random
> write regions?
"metadata device" == "data device" == "CMR" == "random write region"
> Finally, some general/summary notes:
> - Some kind of data structure outline would eventually make a nice
> addition to this document. I understand it's probably too early yet,
> but we are talking about new per-zone inodes, new and interesting
> relationships between AGs and zones (?), etc. Fine grained detail is not
> required, but an outline or visual that describes the high-level
> mappings goes a long way to facilitate reasoning about the design.
Sure, a plane flight is not long enough to do this. Future
revisions, as the structure is clarified.
> - A big question I had (and something that is touched on down thread wrt
> to embedded flash) is whether the random write zones are runtime
> configurable. If so, couldn't this facilitate use of existing AG
> metadata (now that I think of it, it's not clear to me whether the
> realtime mechanism excludes or coexists with AGs)?
the "realtime device" contains only user data. It contains no
filesystem metadata at all. That separation of user data and
filesystem metadata is what makes it so appealing for supporting SMR
> IOW, we obviously
> need this kind of space for inodes, dirs, xattrs, btrees, etc.
> regardless. It would be interesting if we had the added flexibility to
> align it with AGs.
I'm trying to keep the solution as simple as possible. No alignment,
single whole disk only, metadata in the "data device" on CMR and
user data in "real time" zones on SMR.
> diff --git a/design/xfs-smr-structure.asciidoc
> index dd959ab..2fea88f 100644
Oh, there's a patch. Thanks! ;)