On Tue 05-06-12 15:51:50, Dave Chinner wrote:
> On Thu, May 24, 2012 at 02:35:38PM +0200, Jan Kara wrote:
> > > To me the issue at hand is that we have no method of serialising
> > > multi-page operations on the mapping tree between the filesystem and
> > > the VM, and that seems to be the fundamental problem we face in this
> > > whole area of mmap/buffered/direct IO/truncate/holepunch coherency.
> > > Hence it might be better to try to work out how to fix this entire
> > > class of problems rather than just adding a complex kuldge that just
> > > papers over the current "hot" symptom....
> > Yes, looking at the above table, the amount of different synchronization
> > mechanisms is really striking. So probably we should look at some
> > possibility of unifying at least some cases.
> It seems to me that we need some thing in between the fine grained
> page lock and the entire-file IO exclusion lock. We need to maintain
> fine grained locking for mmap scalability, but we also need to be
> able to atomically lock ranges of pages.
Yes, we also need to keep things fine grained to keep scalability of
direct IO and buffered reads...
> I guess if we were to nest a fine grained multi-state lock
> inside both the IO exclusion lock and the mmap_sem, we might be able
> to kill all problems in one go.
> Exclusive access on a range needs to be granted to:
> - direct IO
> - truncate
> - hole punch
> so they can be serialised against mmap based page faults, writeback
> and concurrent buffered IO. Serialisation against themselves is an
> IO/fs exclusion problem.
> Shared access for traversal or modification needs to be granted to:
> - buffered IO
> - mmap page faults
> - writeback
> Each of these cases can rely on the existing page locks or IO
> exclusion locks to provide safety for concurrent access to the same
> ranges. This means that once we have access granted to a range we
> can check truncate races once and ignore the problem until we drop
> the access. And the case of taking a page fault within a buffered
> IO won't deadlock because both take a shared lock....
You cannot just use a lock (not even a shared one) both above and under
mmap_sem. That is deadlockable in presence of other requests for exclusive
locking... Luckily, with buffered writes the situation isn't that bad. You
need mmap_sem only before each page is processed (in
iov_iter_fault_in_readable()). Later on in the loop we use
iov_iter_copy_from_user_atomic() which doesn't need mmap_sem. So we can
just get our shared lock after iov_iter_fault_in_readable() (or simply
leave it for ->write_begin() if we want to give control over the locking to
> We'd need some kind of efficient shared/exclusive range lock for
> this sort of exclusion, and it's entirely possible that it would
> have too much overhead to be acceptible in the page fault path. It's
> the best I can think of right now.....
> As it is, a range lock of this kind would be very handy for other
> things, too (like the IO exclusion locks so we can do concurrent
> buffered writes in XFS ;).
Yes, that's what I thought as well. In particular it should be pretty
efficient in locking a single page range because that's going to be
majority of calls. I'll try to write something and see how fast it can
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR