Re: Hole punching and mmap races

To: Dave Chinner
Subject: Re: Hole punching and mmap races
From: Marco Stornelli
Date: Tue, 5 Jun 2012 08:22:29 +0200
Cc: Jan Kara <jack@xxxxxxx>, linux-fsdevel@xxxxxxxxxxxxxxx, xfs@xxxxxxxxxxx, linux-ext4@xxxxxxxxxxxxxxx, Hugh Dickins <hughd@xxxxxxxxxx>, linux-mm@xxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; bh=XCs2/imyr81hgETez7VRZXZbWY3Yf+QuPLeLjv7d8mY=; b=C70AuH8P05p04t/y90fKmbNctVDfFe7rtbHUNL/LNmqAz2K87yigo1CawQJHTrdbZh cDWcoZMugr9HF/AvVC/K45/U81jVk3Je4ndIswD8PoeDFy9nxe6A+t2KyfJnQkfLVf6K OgvaS1qm3S3FJKEkGTyOHoUZQSstfdudKk4+Rhm9A3f5J69A3UlgjmbY6KyOXYX08AVH 8JJKu0j3e8sWIukiOszAy7aVP8keMikr3Paneo6tA3gwbsKlTV8pLgCFA6C3QMSXabY7 jVJICNot7gCaCLsh62jBTeXQYlPgjpYUC8SwyF8OK3C38kzxTtJUt40rka+iq8uRiRTd dHGA==
In-reply-to: <20120605055150.GF4347@dastard>
References: <20120515224805.GA25577@xxxxxxxxxxxxx> <20120516021423.GO25351@dastard> <20120516130445.GA27661@xxxxxxxxxxxxx> <20120517074308.GQ25351@dastard> <20120517232829.GA31028@xxxxxxxxxxxxx> <20120518101210.GX25351@dastard> <20120518133250.GC5589@xxxxxxxxxxxxx> <20120519014024.GZ25351@dastard> <20120524123538.GA5632@xxxxxxxxxxxxx> <20120605055150.GF4347@dastard>
2012/6/5 Dave Chinner <david@xxxxxxxxxxxxx>:
> On Thu, May 24, 2012 at 02:35:38PM +0200, Jan Kara wrote:
>> On Sat 19-05-12 11:40:24, Dave Chinner wrote:
>> > So let's step back a moment and have a look at how we've got here.
>> > The problem is that we've optimised ourselves into a corner with the
>> > way we handle page cache truncation - we don't need mmap
>> > serialisation because of the combination of i_size and page locks
>> > mean we can detect truncated pages safely at page fault time. With
>> > hole punching, we don't have that i_size safety blanket, and so we
>> > need some other serialisation mechanism to safely detect whether a
>> > page is valid or not at any given point in time.
>> >
>> > Because it needs to serialise against IO operations, we need a
>> > sleeping lock of some kind, and it can't be the existing IO lock.
>> > And now we are looking at needing a new lock for hole punching, I'm
>> > really wondering if the i_size/page lock truncation optimisation
>> > should even continue to exist. i.e. replace it with a single
>> > mechanism that works for both hole punching, truncation and other
>> > functions that require exclusive access or exclusion against
>> > modifications to the mapping tree.
>> >
>> > But this is only one of the problems in this area.The way I see it
>> > is that we have many kludges in the area of page invalidation w.r.t.
>> > different types of IO, the page cache and mmap, especially when we
>> > take into account direct IO. What we are seeing here is we need
>> > some level of _mapping tree exclusion_ between:
>> >
>> >     1. mmap vs hole punch (broken)
>> >     2. mmap vs truncate (i_size/page lock)
>> >     3. mmap vs direct IO (non-existent)
>> >     4. mmap vs buffered IO (page lock)
>> >     5. writeback vs truncate (i_size/page lock)
>> >     6. writeback vs hole punch (page lock, possibly broken)
>> >     7. direct IO vs buffered IO (racy - flush cache before/after DIO)
>>   Yes, this is a nice summary of the most interesting cases. For 
>> completeness,
>> here are the remaining cases:
>>   8. mmap vs writeback (page lock)
>>   9. writeback vs direct IO (as direct IO vs buffered IO)
>>  10. writeback vs buffered IO (page lock)
>>  11. direct IO vs truncate (dio_wait)
>>  12. direct IO vs hole punch (dio_wait)
>>  13. buffered IO vs truncate (i_mutex for writes, i_size/page lock for reads)
>>  14. buffered IO vs hole punch (fs dependent, broken for ext4)
>>  15. truncate vs hole punch (fs dependent)
>>  16. mmap vs mmap (page lock)
>>  17. writeback vs writeback (page lock)
>>  18. direct IO vs direct IO (i_mutex or fs dependent)
>>  19. buffered IO vs buffered IO (i_mutex for writes, page lock for reads)
>>  20. truncate vs truncate (i_mutex)
>>  21. punch hole vs punch hole (fs dependent)

I think we have even the xip cases here.


