On Wed, Apr 22, 2015 at 11:40:16PM -0700, Darrick J. Wong wrote:
> On Thu, Apr 23, 2015 at 11:13:45AM +1000, Dave Chinner wrote:
> > On Wed, Apr 22, 2015 at 05:44:26PM -0700, Darrick J. Wong wrote:
> > > On Wed, Apr 22, 2015 at 08:27:38AM +1000, Dave Chinner wrote:
> > > > On Mon, Apr 20, 2015 at 09:28:20PM -0700, Darrick J. Wong wrote:
> > > > > On Mon, Apr 20, 2015 at 08:06:46PM -0500, xfs@xxxxxxxxxxx wrote:
> > > > crashing between the breaking of the shared reference and data
> > > > writeback doesn't leave us with a hole instead of data. To deal with
> > > > that, I think that we're going to have to break shared extents
> > > > during writeback, not during the write. However, we are going to
> > > > need a delalloc reservation to do that.
> > > >
> > > > So I suspect we need a new type of extent in the in-core extent tree
> > > > - a "delalloc overwrite" extent - so that when we map it in writeback
> > > > we can allocate the new extent, do the write to it, and then on IO
> > > > completion do the BMBT manipulation to break the shared reference
> > > > and insert the new extent. That solves the atomicity problem, and it
> > > > allows us to track COW data on a per-inode basis without having
> > > > to care about all the other reflink contexts to that same data.
> > >
> > > I think that'll work... in xfs_vm_writepage (more probably
> > > xfs_map_blocks) if
> > > the refcount > 2, allocate a new block, insert a new delalloc-overwrite
> > > in-core
> Speaking of which, should I add a XFS_DIFLAG_ to indicate that a file has
> (or has had) reflinked blocks? In theory this would save us a trip through
> the reflinkbt for "normal" files when the reflink feature is set, but
> we'd then have to maintain it (and repair would have to check it).
Yes, we probably should have a flag like this. It makes it simple to
determine where to look for extent info, but it also gives us some
redundant information as to whether the inode should have shared
extents rather than have them trashed as duplicate references to an
unshared block in repair...
> > > extent with the new block number and set a flag in the ioend to remind
> > > ourselves to update the bookkeeping later. During xfs_end_io if that
> > > flag is
> > > set, commit the new in-core extent to disk, decrement the refcounts, and
> > > free the block if the refcount is 1.
> > If we are going to track the overwrite in-core, then we are probably
> > going to need some form of intent/done transaction structure so that
> > we don't leak the allocated block if we crash before the completion
> > runs and commits the extent swap. I'd prefer to do that than require
> > on-disk state to prevent free space leakage in this case.
> > We could, potentially, abuse the EFI for this. i.e. record an EFI
> > for the extent in the allocation transaction, then in the completion
> > record a matching EFD. That way recovery will free the allocated
> > extent if we don't complete it....
> Clever! I was looking around to see if XFS had something that could
> take care of cleaning up orphans like that.
It's not intended for this purpose, but I think it will work just
fine - as long as the extent is not actually added to the on-disk
bmbt by the allocation transaction. The EFI is currently committed in
the same transaction that removes the extent from the BMBT to
indicate it is not referenced on disk and then it is freed in the
following transaction that also commits the EFD to indicate it is
referenced again (by the free space tree). The above would work the
opposite way around - EFI commited on removal from the free space
tree, EFD committed on addition to the BMBT...