diff --git a/design/xfs-smr-structure.asciidoc b/design/xfs-smr-structure.asciidoc index dd959ab..2fea88f 100644 --- a/design/xfs-smr-structure.asciidoc +++ b/design/xfs-smr-structure.asciidoc @@ -95,7 +95,7 @@ going to need a special directory to expose this information. It would be useful to have a ".zones" directory hanging off the root directory that contains all the zone allocation inodes so userspace can simply open them. -THis biggest issue that has come to light here is the number of zones in a +This biggest issue that has come to light here is the number of zones in a device. Zones are typically 256MB in size, and so we are looking at 4,000 zones/TB. For a 10TB drive, that's 40,000 zones we have to keep track of. And if the devices keep getting larger at the expected rate, we're going to have to @@ -112,24 +112,24 @@ also have other benefits... While it seems like tracking free space is trivial for the purposes of allocation (and it is!), the complexity comes when we start to delete or overwrite data. Suddenly zones no longer contain contiguous ranges of valid -data; they have "freed" extents in the middle of them that contian stale data. +data; they have "freed" extents in the middle of them that contain stale data. We can't use that "stale space" until the entire zone is made up of "stale" extents. Hence we need a Cleaner. === Zone Cleaner The purpose of the cleaner is to find zones that are mostly stale space and -consolidate the remaining referenced data into a new, contigious zone, enabling +consolidate the remaining referenced data into a new, contiguous zone, enabling us to then "clean" the stale zone and make it available for writing new data again. -The real complexity here is finding the owner of the data that needs to be move, -but we are in the process of solving that with the reverse mapping btree and -parent pointer functionality. This gives us the mechanism by which we can +The real complexity here is finding the owner of the data that needs to be +moved, but we are in the process of solving that with the reverse mapping btree +and parent pointer functionality. This gives us the mechanism by which we can quickly re-organise files that have extents in zones that need cleaning. The key word here is "reorganise". We have a tool that already reorganises file -layout: xfs_fsr. The "Cleaner" is a finely targetted policy for xfs_fsr - +layout: xfs_fsr. The "Cleaner" is a finely targeted policy for xfs_fsr - instead of trying to minimise fixpel fragments, it finds zones that need cleaning by reading their summary info from the /.zones/ directory and analysing the free bitmap state if there is a high enough percentage of stale blocks. From @@ -142,7 +142,7 @@ Hence we don't actually need any major new data moving functionality in the kernel to enable this, except maybe an event channel for the kernel to tell xfs_fsr it needs to do some cleaning work. -If we arrange zones into zoen groups, we also have a method for keeping new +If we arrange zones into zone groups, we also have a method for keeping new allocations out of regions we are re-organising. That is, we need to be able to mark zone groups as "read only" so the kernel will not attempt to allocate from them while the cleaner is running and re-organising the data within the zones in @@ -166,17 +166,17 @@ inode to track the zone's owner information. == Mkfs Mkfs is going to have to integrate with the userspace zbc libraries to query the -layout of zones from the underlying disk and then do some magic to lay out al +layout of zones from the underlying disk and then do some magic to lay out all the necessary metadata correctly. I don't see there being any significant challenge to doing this, but we will need a stable libzbc API to work with and -it will need ot be packaged by distros. +it will need to be packaged by distros. -If mkfs cannot find ensough random write space for the amount of metadata we -need to track all the space in the sequential write zones and a decent amount of -internal fielsystem metadata (inodes, etc) then it will need to fail. Drive -vendors are going to need to provide sufficient space in these regions for us -to be able to make use of it, otherwise we'll simply not be able to do what we -need to do. +If mkfs cannot find enough random write space for the amount of metadata we need +to track all the space in the sequential write zones and a decent amount of +internal filesystem metadata (inodes, etc) then it will need to fail. Drive +vendors are going to need to provide sufficient space in these regions for us to +be able to make use of it, otherwise we'll simply not be able to do what we need +to do. mkfs will need to initialise all the zone allocation inodes, reset all the zone write pointers, create the /.zones directory, place the log in an appropriate @@ -187,13 +187,13 @@ place and initialise the metadata device as well. Because we've limited the metadata to a section of the drive that can be overwritten, we don't have to make significant changes to xfs_repair. It will need to be taught about the multiple zone allocation bitmaps for it's space -reference checking, but otherwise all the infrastructure we need ifor using +reference checking, but otherwise all the infrastructure we need for using bitmaps for verifying used space should already be there. -THere be dragons waiting for us if we don't have random write zones for +There be dragons waiting for us if we don't have random write zones for metadata. If that happens, we cannot repair metadata in place and we will have to redesign xfs_repair from the ground up to support such functionality. That's -jus tnot going to happen, so we'll need drives with a significant amount of +just not going to happen, so we'll need drives with a significant amount of random write space for all our metadata...... == Quantification of Random Write Zone Capacity @@ -214,7 +214,7 @@ performance, replace the CMR region with a SSD.... The allocator will need to learn about multiple allocation zones based on bitmaps. They aren't really allocation groups, but the initialisation and -iteration of them is going to be similar to allocation groups. To get use going +iteration of them is going to be similar to allocation groups. To get us going we can do some simple mapping between inode AG and data AZ mapping so that we keep some form of locality to related data (e.g. grouping of data by parent directory). @@ -273,19 +273,19 @@ location, the current location or anywhere in between. The only guarantee that we have is that if we flushed the cache (i.e. fsync'd a file) then they will at least be in a position at or past the location of the fsync. -Hence before a filesystem runs journal recovery, all it's zone allocation write +Hence before a filesystem runs journal recovery, all its zone allocation write pointers need to be set to what the drive thinks they are, and all of the zone allocation beyond the write pointer need to be cleared. We could do this during log recovery in kernel, but that means we need full ZBC awareness in log recovery to iterate and query all the zones. -Hence it's not clear if we want to do this in userspace as that has it's own -problems e.g. we'd need to have xfs.fsck detect that it's a smr filesystem and +Hence it's not clear if we want to do this in userspace as that has its own +problems e.g. we'd need to have xfs.fsck detect that it's an smr filesystem and perform that recovery, or write a mount.xfs helper that does it prior to mounting the filesystem. Either way, we need to synchronise the on-disk filesystem state to the internal disk zone state before doing anything else. -This needs more thought, because I have a nagging suspiscion that we need to do +This needs more thought, because I have a nagging suspicion that we need to do this write pointer resynchronisation *after log recovery* has completed so we can determine if we've got to now go and free extents that the filesystem has allocated and are referenced by some inode out there. This, again, will require