On Mon, Mar 16, 2015 at 05:00:20PM +1100, Dave Chinner wrote:
> Hi Folks,
> As I told many people at Vault last week, I wrote a document
> outlining how we should modify the on-disk structures of XFS to
> support host aware SMR drives on the (long) plane flights to Boston.
> TL;DR: not a lot of change to the XFS kernel code is required, no
> specific SMR awareness is needed by the kernel code. Only
> relatively minor tweaks to the on-disk format will be needed and
> most of the userspace changes are relatively straight forward, too.
> The source for that document can be found in this git tree here:
> in the file design/xfs-smr-structure.asciidoc. Alternatively,
> pull it straight from cgit:
> Or there is a pdf version built from the current TOT on the xfs.org
> wiki here:
> Happy reading!
Thanks for sharing this. Here are some thoughts/notes/questions/etc.
from a first pass. This is mostly XFS oriented and I'll try to break it
down by section.
I've also attached a diff to the original doc with some typo fixes and
whatnot. Feel free to just fold it into the original doc if you like.
- With regard to the assumption that the CMR region is not spread around
the drive, I saw at least one presentation at Vault that suggested
otherwise (the skylight one iirc). That said, it was theoretical and
based on a drive-managed drive. It is in no way clear to me whether that
is something to expect for host-managed drives.
- It isn't clear to me here and in other places whether you propose to
use the CMR regions as a "metadata device" or require some other
randomly writeable storage to serve that purpose.
== Journal modifications
- The tail->head log zeroing behavior on mount comes to mind here. Maybe
the writes are still sequential and it's not a problem, but we should
consider that with the proposition. It's probably not critical as we do
have the out of using the cmr region here (as noted). I assume we can
also cleanly relocate the log without breaking anything else (e.g., the
current location is performance oriented rather than architectural,
== Data zones
- Will this actually support data overwrite or will that return error?
- TBH, I've never looked at realtime functionality so I don't grok the
high level approach yet. I'm wondering... have you considered a design
based on reflink and copy-on-write? I know the current plan is to
disentangle the reflink tree from the rmap tree, but my understanding is
the reflink tree is still in the pipeline. Assuming we have that
functionality, it seems like there's potential to use it to overcome
some of the overwrite complexity. Just as a handwaving example, use the
per-zone inode to hold an additional reference to each allocated extent
in the zone, thus all writes are handled as if the file had a clone. If
the only reference drops to the zoneino, the extent is freed and thus
stale wrt to the zone cleaner logic.
I suspect we would still need an allocation strategy, but I expect we're
going to have zone metadata regardless that will help deal with that.
Note that the current sparse inode proposal includes an allocation range
limit mechanism (for the inode record overlaps an ag boundary case),
which could potentially be used/extended to build something on top of
the existing allocator for zone allocation (e.g., if we had some kind of
zone record with the write pointer that indicated where it's safe to
allocate from). Again, just thinking out loud here.
== Zone cleaner
- Paragraph 3 - "fixpel?" I would have just fixed this, but I can't
figure out what it's supposed to say. ;)
- The idea sounds sane, but the dependency on userspace for a critical
fs mechanism sounds a bit scary to be honest. Is in kernel allocation
going to throttle/depend on background work in the userspace cleaner in
the event of low writeable free space? What if that userspace thing
dies, etc.? I suppose an implementation with as much mechanism in libxfs
as possible allows us greatest flexibility to go in either direction
- I'm also wondering how much real overlap there is in xfs_fsr (another
thing I haven't really looked at :) beyond that it calls swapext.
E.g., cleaning a zone sounds like it must map back to N files that could
have allocated extents in the zone vs. considering individual files for
defragmentation, fragmentation of the parent file may not be as much of
a consideration as resetting zones, etc. It sounds like a separate tool
might be warranted, even if there is code to steal from fsr. :)
== Reverse mapping btrees
- This is something I still need to grok, perhaps just because the rmap
code isn't available yet. But I'll note that this does seem like
another bit that could be unnecessary if we could get away with using
the traditional allocator.
- We have references to the "metadata device" as well as random write
regions. Similar to my question above, is there an expectation of a
separate physical metadata device or is that terminology for the random
Finally, some general/summary notes:
- Some kind of data structure outline would eventually make a nice
addition to this document. I understand it's probably too early yet,
but we are talking about new per-zone inodes, new and interesting
relationships between AGs and zones (?), etc. Fine grained detail is not
required, but an outline or visual that describes the high-level
mappings goes a long way to facilitate reasoning about the design.
- A big question I had (and something that is touched on down thread wrt
to embedded flash) is whether the random write zones are runtime
configurable. If so, couldn't this facilitate use of existing AG
metadata (now that I think of it, it's not clear to me whether the
realtime mechanism excludes or coexists with AGs)? IOW, we obviously
need this kind of space for inodes, dirs, xattrs, btrees, etc.
regardless. It would be interesting if we had the added flexibility to
align it with AGs.
> Dave Chinner
> xfs mailing list
Description: Text document