[Top] [All Lists]

Re: [RFC] unifying write variants for filesystems

To: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Subject: Re: [RFC] unifying write variants for filesystems
From: Al Viro <viro@xxxxxxxxxxxxxxxxxx>
Date: Sat, 1 Feb 2014 22:43:01 +0000
Cc: Christoph Hellwig <hch@xxxxxxxxxxxxx>, Jens Axboe <axboe@xxxxxxxxx>, Mark Fasheh <mfasheh@xxxxxxxx>, Joel Becker <jlbec@xxxxxxxxxxxx>, linux-fsdevel <linux-fsdevel@xxxxxxxxxxxxxxx>, xfs@xxxxxxxxxxx, Sage Weil <sage@xxxxxxxxxxx>, Steve French <sfrench@xxxxxxxxx>, Dave Kleikamp <shaggy@xxxxxxxxxx>, Anton Altaparmakov <anton@xxxxxxxxxx>
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <CA+55aFzEA-eM9v2PvsWx4v4ANaKXuRGYyGCkegJg++rhtHvnig@xxxxxxxxxxxxxx>
References: <20140114132207.GA25170@xxxxxxxxxxxxx> <20140114172033.GU10323@xxxxxxxxxxxxxxxxxx> <20140118064040.GE10323@xxxxxxxxxxxxxxxxxx> <CA+55aFw4LgyYEkygxHUnpKZg3jMACGzsyENc9a9rWFmLcaRefQ@xxxxxxxxxxxxxx> <20140118074649.GF10323@xxxxxxxxxxxxxxxxxx> <CA+55aFzM0N7WjqnLNnuqTkbj3iws9f3bYxei=ZBCM8hvps4zYg@xxxxxxxxxxxxxx> <20140118201031.GI10323@xxxxxxxxxxxxxxxxxx> <20140119051335.GN10323@xxxxxxxxxxxxxxxxxx> <20140120135514.GA21567@xxxxxxxxxxxxx> <CA+55aFzEA-eM9v2PvsWx4v4ANaKXuRGYyGCkegJg++rhtHvnig@xxxxxxxxxxxxxx>
Sender: Al Viro <viro@xxxxxxxxxxxxxxxx>
User-agent: Mutt/1.5.21 (2010-09-15)
On Mon, Jan 20, 2014 at 12:32:01PM -0800, Linus Torvalds wrote:
> On Mon, Jan 20, 2014 at 5:55 AM, Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote:
> > On Sun, Jan 19, 2014 at 05:13:35AM +0000, Al Viro wrote:
> >> Folks, what do you think about the following:
> >
> > That's very much what Shaggy did in the aio-direct tree, except that
> > it kept using a single set of methods.  Linus really didn't like it
> > unfortunately.
> Umm. That wasn't what I objected to.
> I objected to the incredibly ugly implementation, the crazy new flags
> arguments for core helper functions, ugly naming etc etc. I even
> outlined what might fix it.
> In other words, I thought the code was shit and ugly. Not that
> iterators per se would be wrong. Just doing them badly is wrong.

Gyahhh...  OK, I should've known better than go looking into that thing
after such warning.  Some relatively printable notes (i.e. about 10%
of the comments I really had about that) follow:

* WTF bother passing 'pos' separately?  It's the same mistake that was
made with ->aio_read/->aio_write and just as with those, *all* callers
provably have pos == iocb->ki_pos.

* I'm not sure that iov_iter is a good fit here.  OTOH, it probably could
be reused (it has damn few users right now and they are on the codepaths
involved into that thing).

* We *definitely* want a variant structure with tag - unsigned long thing
was just plain insane.  I see at least two variants - array of iovecs
and array of (at least) triples <page, offset, length>.  Quite possibly -
quadruples, with "here's how to try to steal this page" thrown in, if
we want that as replacement for ->splice_write() as well (it looks like
the few instances that do steal on pipe-to-file splices could be dealt
with the same way as the dumb ones, provided that ->write_iter or whatever
we end up calling it is allowed to try and steal pages).   Possibly more
variants on the read side of things...  FWIW, I'm not sure that bio_vec
makes a lot of sense here.

* direction (i.e. source vs. destination) also should be a part of that
tag.  Which, BTW, turns ->direct_IO() into iocb x iov_iter -> int;
the situation with pos is identical to aio_read/aio_write.  While we
are at it, KERNEL_WRITE thing (used only for __swap_writepage()) looks
very much like a special case of "array of <page,offset,size>" we want
for splice_write...

* having looked through the ->read and ->write instances in drivers,
I'd say that surprisingly few helpers would go a _long_ way towards
converting those guys to the same methods.  And we need such helpers
anyway - there's a whole lot of (badly) open-coded "copy the whole
user buffer into kmalloc'ed array and NUL-terminate the sucker" logics
in ->write() instances, for example.  Even more "copy up to that much
into given array and NUL-terminate it", with rather amusing bugs in
there - e.g. NUL going into the end of array, regardless of the actual
amount of data to copy; junk is left in the middle, complete with
printk of the entire thing if it doesn't make sense.  IOW, spewing
random piece of kernel stack into dmesg.  Off-by-ones galore, too...

BTW, speaking of bogosities - grep for generic_write_checks().  Exactly
one caller (in __generic_file_aio_write()) has any business looking
at S_ISBLK(inode->i_mode) - it can be called by blkdev_aio_write().
All other callers have copied that, even though it makes absolutely
no sense for them...

* file_readable/file_writable looks really wrong; if nothing else, I would
rather check that after open() and set a couple of FMODE bits, then check
those.  Possibly even "knock FMODE_READ/FMODE_WRITE out if there's no

* pipe_buffer_operations ->map()/->unmap() should die; let the caller do
k{,un}map{,_atomic}().  All instances have the same method there and
there's no point to make it different.  PIPE_BUF_FLAG_ATOMIC should also

* WTF do iov_iter_copy_from_user_atomic() callers keep doing pagefault_disable
and pagefault_enable around it?  The sucker starts with kmap_atomic() and
ends with kunmap_atomic(); all instances of those guys have pagefaul
disabling/enabling (and I suspect that it might make sense to lift it
out of arch-specific variants - rename them to e.g. __kmap_atomic(),
rip pagefault_disable() out of those and put kmap_atomic() into highmem.h
outside of ifdef, with !HIGHMEM side of ifdef having #define __kmap_atomic
page_address; move pagefault_enable() from __kunmap_atomic() to
kunmap_atomic() while we are at it).

Note that e.g. pipe.c relies on kmap_atomic() disabling pagefaults - we
have __copy_from_user_inatomic() done there under kmap_atomic(), and we
really don't want to block in such conditions.

* pipe_iov_copy_from_user() ought to return how much has it managed to
bring in, not 0 or -EFAULT as it does now.  Then it won't need the
non-atomic side, AFAICS.  Moreover, we'll just be able to use
iov_iter_copy_from_user_atomic() (which badly needs a more palatable
name, BTW).

* sure, we want to call do_generic_file_read() once, passing the entire
iovec to file_read_actor().  But what the hell does that have to do with
introduction of new methods?  It's a change that makes sense on its
own.  Moreover, it's a damn good preparation to adding those - we
get generic_file_aio_read() into "set iov_iter up, then do <this>",
with <this> using iov_iter instead of iovec.  Once we get to introducing
those methods, it's just a matter of taking the rest of generic_file_aio_read()
into a separate function and making that function an instance of new

* Unrelated to this patchset, but... may I politely inquire about the reasons
why ->is_partially_uptodate() gets read_descriptor_t?  The whole point
of read_descriptor_t (if it has any) is that its interpretation is up to
whoever's doing the reading, _not_ what they are reading from.  So desc->arg
is off-limits for any instance of ->is_partially_uptodate().  desc->written is
obviously pointless for them; the need (or lack thereof) to do something
to the page doesn't depend on how much have we already read from the
file.  Moreover, reporting an error is rather dubious in such method;
if there's something fishy with the page, we'll just return "no" and let
->readpage() complain.  Which leaves only desc->count, which, unsurprisingly,
is the only thing looked at by (the only) instance of that method.  And
"tell me if the part of this page that long starting from that offset is
up to date" is much more natural that "is what this read_descriptor_t would
have had us read from this offset in this page up to date?"...

* do we really need separate non-atomic variant of iov_iter_copy_from_user()?
We have only two users right now (cifs and ceph) and both can use the
fault-in / atomic copy loop without much pain...

* in addition to having too many methods, I'm not convinced that we want
them to be methods.  Let's start with explicit checks in the primitives and
see where it goes from there; if we start to grow too many variants,
we can always introduce some methods, but then we'll be in better position
to decide what is and what is not a good method...

* on the read side, I really don't believe that exposing atomic and
non-atomic variants is a good idea.  Look at the existing users of
__copy_to_user_inatomic(); leaving aside the i915_gem weirdness,
all of them are used to implement the exact same thing: given a page,
offset and length, feed its contents to iovec/buffer/whatnot.  Unlike
the write side of things, there's nothing between prefaulting pages
and actual attempts to copy.  So let's make _that_ an exposed primitive
and let it deal with kmap/kmap_atomic/etc.  Variants that don't have
to worry about blocking (vector of <page,offset,length>, etc.) would
simply not bother with non-atomic kmap, etc.  Sure, it should take
iov_iter as destination.  And deal with mapping the damn thing internally...

* ntfs_file_buffered_write() should switch to iov_iter as well.  It's
open-coding a lot of iov_iter stuff.  It's not entirely trivial and
I'd really like to hear from ntfs folks on that, though, and the current
code looks deadlock-prone.  We prefault everything, then lock the pages
to which we'll be copying, then attempt to do __copy_from_user_inatomic(),
falling back to __copy_from_user() if that fails.  Fine, but what happens
if the source of write() is mmaped from the same file and we lose CPU while
prefaulting the last page, have memory pressure evict the first one, then
have __ntfs_grab_cache_pages() allocate and lock (nonuptodate) pages
we'll be copying to and have __copy_from_user() try to copy *from* those
same pages?  We are doing it while holding these pages locked, so pagefault
will have really fun time with them...  Anton?

BTW, Linus, when did you comment on that patchset?  I've found an iteration
of that patchset circa last October (v9, apparently the latest posted),
but it looks like your comments had either got lost or had been on the
earlier iteration of that thing...

<Prev in Thread] Current Thread [Next in Thread>