On Tue, Jan 14, 2014 at 05:20:33PM +0000, Al Viro wrote:
> On Tue, Jan 14, 2014 at 05:22:07AM -0800, Christoph Hellwig wrote:
> > On Mon, Jan 13, 2014 at 11:56:46PM +0000, Al Viro wrote:
> > > On Mon, Jan 13, 2014 at 06:14:16AM -0800, Christoph Hellwig wrote:
> > > > ping? Would be nice to get this into 3.14
> > >
> > > Umm... The reason for pipe_lock outside of ->i_mutex is this:
> > > default_file_splice_write() calls splice_from_pipe() with
> > > write_pipe_buf for callback. splice_from_pipe() calls that
> > > callback under pipe_lock(pipe). And write_pipe_buf() calls
> > > __kernel_write(), which certainly might want to take ->i_mutex.
> > >
> > > Now, this codepath isn't taken for files that have non-NULL
> > > ->splice_write(), so that's not an issue for XFS and OCFS2,
> > > but having pipe_lock nest between the ->i_mutex for filesystems
> > > that do and do not have ->splice_write()... Ouch...
> > What would be the alternative? Duplicating the code in even more
> > filesystems to enforce an non-natural locking order for filesystems
> > actually implementing splice? There don't actually seem to be a whole
> > lot of real filesystems not implemting splice_write, the prime use
> > would be for device drivers or synthetic ones. I'm not even sure
> > how much that fallback gets used in practice.
Hmm... In principle, the following would be no worse than what
generic_file_splice_write() is doing: confirm and map the pages, build
an iovec and use ->aio_write() to write it out, then unmap the suckers,
release ones entirely written to file and adjust the partially
written one. All under pipe_lock(). Hell, if we introduce
kernel_writev() (either by calling vfs_writev() or taking do_readv_writev()
sans copying iovec and using that under set_fs()), we could switch
default_file_splice_write() to that and get rid of ->splice_write() for
the majority of filesystems, if not all of them.
Sure, it means copying from pipe buffers to pagecache, but we have
generic_file_splice_write() do that copy anyway - conditional memcpy()
in pipe_to_file() is actually unconditional; that if (page != buf->page) in
there had just been forgotten by Nick back in 2007 ("1/2 splice: dont steal").
The problem Christoph was talking about is that generic_file_splice_write()
plays with ->i_mutex and both gets/drops it for each page of IO *and*
causes PITA for any fs that wants some locks of its own taken in addition
to ->i_mutex on the write paths. What ->splice_write() without page
stealing is doing is pretty much a writev() from array of pages in kernel
space; so it looks like we might as well just reuse writev() guts for that...