On Thu, Apr 30, 2015 at 04:26:14PM +1000, Dave Chinner wrote:
> On Wed, Apr 29, 2015 at 07:53:30PM -0500, xfs@xxxxxxxxxxx wrote:
> > On Mon, Apr 20, 2015 at 09:28:20PM -0700, Darrick J. Wong wrote:
> > > On Mon, Apr 20, 2015 at 08:06:46PM -0500, xfs@xxxxxxxxxxx wrote:
> > > > My prototype already mostly works just using xfs_alloc_file_space() to
> > > > allocate the appropriate space in the destination inode, but I need to
> > > > get that allocated space populated from the shared inode's extents.
> > >
> > > I think you're asking how to copy extent map entries from one file to
> > > another?
> >
> > What I really wanted was something analogous to do_splice_direct() but for
> > operating on the inodes/address_space structs. I ended up just hacking
> > something together which does the copy ad-hoc directly via the address_space
> > structs and calling xfs_get_blocks() on buffer heads of the destination
> > pages,
> > without any readahead or other optimizations, at least it reads from and
> > populates the page caches.
> >
> > It looks like what you guys are working on is a more granular/block-level
> > COW
> > reflink implementation, which I assumed would be significantly more
> > challenging
> > and well beyond my ability to hack up quickly for experimentation.
> >
> > Below I'll summarize what I've hacked together. It's probably
> > inappropriate to
> > refer to this as a reflink, I've been referring to it as a COW-file in the
> > code.
>
> It's the same thing, just using a different COW mechanism to break
> the reflink.
>
> > A COW-file is a new inode which refers to another (shared) inode for its
> > data
> > until the COW-file is opened for writing. At that point it clones the
> > shared
> > inode's data as its own.
> >
> > Here's the mid-level details of the hack:
> >
> > 1. Add two members to the inode in the available padding:
> > be32 nlink_cow: Number of COW-file links to the inode
> > be64 inumber_cow: Number of backing inode if inode is a COW-file
>
> FYI, inode number alone is not enough to uniquely identify an inode.
> Needs the ino/gen pair as inode numbers can be reused.
>
> > 2. Introduc a new ioctl for creating a COW-file:
> > #define XFS_IOC_CREATE_COWFILE_AT _IOW ('X', 126, struct
> > xfs_create_cowfile_at)
> >
> > typedef struct xfs_create_cowfile_at
> > {
> > int dfd; /* parent directory */
> > char name[MAXNAMELEN]; /* name to create */
> > umode_t mode; /* mode */
> > } xfs_create_cowfile_at_t;
>
> Let's not create yet another incompatible reflink API. Please use
> OCFS2_IOC_REFLINK as the API and pull the ioctl and infrastructure
> within fs/ocfs2/refcounttree.c up to the VFS, add a ->reflink method
> to the operations structure and then hook ocfs2 and the new XFS code
> up to that interface. We don't want to duplicate all that code
> unnecessarily in an incompatible fashion.
I've been using BTRFS_IOC_CLONE{,_RANGE} so far, and eyeing zab's proposed
copy_file_range syscall.
> > 3. Derive a xfs_create_cowfile() from xfs_create() and xfs_link():
> > xfs_create_cowfile(
> > xfs_inode_t *dp, /* parent directory */
> > xfs_inode_t *sip, /* shared inode (ioctl callee) */
> > struct xfs_name *name, /* name of cowfile to create in dp */
> > umode_t mode, /* mode */
> > xfs_dev_t rdev,
> > xfs_inode_t **ipp) /* new inode */
> >
> > - ipp = allocate inode
> > - ipp->i_mapping->host = sip
> > - bump sip->nlink_cow
> > - ipp->inumber_cow = sip->di_ino
> > - create name in dp referencing ipp
> >
> > ipp starts out with the shared inode hosting i_mapping
>
> That's kinda messy. I'd like to leave sharing page cache completely
> out of the picture first, and get all the inode and extent
> manipulations correct first. most important is all the inode life
> cycle and unlink/unshare operations sorted, especially w.r.t.
> locking and transactions.
Heh heh heh. I found another one hidden one of those just this evening
(zeroing ranges for punch/zerorange). In addition to buffered writes,
dio writes (which lazily fall back to buffered for now), rm, and block-punch.
There was also the little matter of making FIEMAP work.
> Once we have that sanely under control, page cache sharing is a
> matter of reflecting the shared extents in the page cache via
> multiple page references (i.e. one for each mapping tree the page is
> inserted into) rather than playing games trying to share mappings.
>
> > 4. Modify xfs_setup_inode() to be inumber_cow-aware, opening the shared
> > inode
> > when set, and assigning to i_mapping->host.
>
> So you've added an xfs_iget() call to xfs_setup_inode() if the inode
> is a reflink inode? How does the locking work?
>
> > 5. Modify xfs_file_open() S_ISREG && inumber_cow && FMODE_WRITE:
> > - clear inumber_cow
> > - restore i_mapping->host to the inode being opened
> > - invalidate_inode_pages2(i_mapping)
> > - allocate all needed space in this inode
> > - copy size from shared inode to this inode
> > - copy all pages from the previously shared inode to this one
>
> So the copy is done on open(O_WR) of the file. files get opened in
> write mode for more than just writing date to them. e.g. fchmod,
> writing new attributes, etc. We don't want a data cow if this is the
> first operation that is done to the inode after the reflink is
> made...
That could be a lot of stuff to copy?
Anyway, I might have a RFC(rap) posting of reflink patches tomorrow.
I haven't added the "this is reflinked" inode flag optimization yet. I bet the
performance hit is awful. But, no more patching at 1am.
--D
> > 6. Modify xfs_vn_getattr() to copy stat->size from the shared inode if
> > inumer_cow is set.
>
> Why wouldn't you just put the size in the COW inode?
>
> > 7. Modify unlink paths to be nlink_cow-aware
>
> What happens when parent is unlinked? It stays on the unlinked list?
> Or do we know have the possibility that unlink can have two inodes
> to free in xfs-inactive()? What's the implication for transaction
> reservations? What's the locking order?
>
> reflink doesn't have this "readonly parent" wart. If the parent is
> modified, it breaks it's side of the reflink via COW. Simple,
> symmetric handling of the problem - it doesn't matter who writes to
> what, the unmodified inodes do not see any change.
>
> > 8. To simplify things, inodes that have nlink_cow no longer can be opened
> > for
> > writing, they've become immutable backing stores of sorts for other
> > inodes.
>
> So, someone marks a file as COW, and that makes it read only? What
> happens if there is already an open fd with write permissions when
> the COW ioctl is called? So, where does the immutable state of the
> inode get enforced? What's preventing it from being unlinked?
>
> > The one major question mark I see in this achieving correctness is the live
> > manipulation of i_mapping->host. It seems to work in my casual testing on
> > x86_64, this actually all works surprisingly well considering it's a fast
> > and
> > nasty hack. I abuse invalidate_inode_pages2() as if a truncate has
> > occurred,
> > forcing subsequent page accesses to fault and revisit i_mapping->host.
>
> I don't think you can't change mapping->host, as there is no one
> lock that protects access to it.
>
> > The goal here was to achieve something overlayfs-like but with inodes
> > capable
> > of being chowned/chmodded without triggering the copy_up, operations likely
> > necessary for supporting user namespaces in containers.
>
> Yup, reflink gives us this.
>
> > Additionally,
> > overlayfs has some serious correctness issues WRT multiply-opened files
> > spanning the lower and upper layers due to one of the subsequent opens
> > being a
> > writer. Since my hack creates distinct inodes from the start, no such issue
> > exists.
>
> Right, but it's not something that overlayfs can use unconditionally
> (reflink or your COW file) because overlayfs can have upper and
> lower directories on different filesystems. And, as has also been
> pointed out, it doesn't work on lower filesystems that are
> read-only.
>
> As such, I'm not sure that overlayfs will ever support this change
> of behaviour as it restricts overlayfs to a single writeable
> filesystem. I know, most people only need overlayfs on a single
> writeable filesystem, but....
>
> > However, one of the attractive things about overlayfs is the page cache
> > sharing
> > which my XFS hack doesn't enable due to the distinct struct addres_space's
> > and
> > i_mapping->host host exchanging. I had hoped to explore something KSM-like
> > with maybe some hints from XFS for these shared inode pagess saying "hey
> > these
> > are read-only pages in a shared inode, try deduplicate them!" but KSM is
> > purely
> > in vma land so that seems like a lot of work to make happen.
>
> As I mentioned before, page cache sharing needs to reflect shared
> blocks on disk. Playing games with KSM or mappings isn't a viable
> solution.
>
> > In any case, thanks for the responses and any further input you guys have.
> > Obviously a more granular btrfs-like reflink is preferable, and I'd welcome
> > it.
> > It just seemed like doing something overlayfs-like would be a whole lot
> > easier
> > to get working in a relatively short time.
>
> Overlayfs is creating more problems than it is solving. Hacking
> random one-off functionality into filesystems is not the way to fix
> overlayfs problems. Let's do reflink properly, then we can modify
> overlayfs to be able to use reflink when all it's working
> directories are on the same filesystem. The we can add page cache
> sharing to reflink files, and all of a sudden, the overlayfs page
> cache sharing problem is solved. Not to mention it is solved for all
> the other people that want reflink files for their applications,
> too.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx
|