xfs
[Top] [All Lists]

Re: [PATCH v4 3/7] mm: Allow filesystems to defer cmtime updates

To: Jan Kara <jack@xxxxxxx>
Subject: Re: [PATCH v4 3/7] mm: Allow filesystems to defer cmtime updates
From: Andy Lutomirski <luto@xxxxxxxxxxxxxx>
Date: Wed, 4 Sep 2013 13:05:27 -0700
Cc: "linux-kernel@xxxxxxxxxxxxxxx" <linux-kernel@xxxxxxxxxxxxxxx>, "linux-ext4@xxxxxxxxxxxxxxx" <linux-ext4@xxxxxxxxxxxxxxx>, Dave Chinner <david@xxxxxxxxxxxxx>, "Theodore Ts'o" <tytso@xxxxxxx>, Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>, xfs@xxxxxxxxxxx, Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>, Christoph Hellwig <hch@xxxxxxxxxxxxx>
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <20130904192048.GM3996@xxxxxxxxxxxxx>
References: <cover.1377193658.git.luto@xxxxxxxxxxxxxx> <e1620c8e4909a65e270c8e9590e307c22fd96a44.1377193658.git.luto@xxxxxxxxxxxxxx> <20130904145740.GE3996@xxxxxxxxxxxxx> <CALCETrUA653bw3C=XP=Es9UApqsKsDG-UAzuZ=qS1RA1LeoKYw@xxxxxxxxxxxxxx> <20130904192048.GM3996@xxxxxxxxxxxxx>
On Wed, Sep 4, 2013 at 12:20 PM, Jan Kara <jack@xxxxxxx> wrote:
> On Wed 04-09-13 10:54:50, Andy Lutomirski wrote:
>> >> @@ -1970,6 +1988,39 @@ int write_one_page(struct page *page, int wait)
>> >>  }
>> >>  EXPORT_SYMBOL(write_one_page);
>> >>
>> >> +void mapping_flush_cmtime(struct address_space *mapping)
>> >> +{
>> >> +     if (mapping_test_clear_cmtime(mapping) &&
>> >> +         mapping->a_ops->update_cmtime_deferred)
>> >> +             mapping->a_ops->update_cmtime_deferred(mapping);
>> >> +}
>> >> +EXPORT_SYMBOL(mapping_flush_cmtime);
>> >   Hum, is there a reason for update_cmtime_deferred() operation? I can
>> > hardly imagine anyone will want to do anything else than what
>> > inode_update_time_writable() does so why bother? You mention tmpfs & co.
>> > don't fit into your scheme well with which I agree so let's just keep
>> > file_update_time() in their page_mkwrite() operation. But I don't see a
>> > real need for avoiding the deferred cmtime logic...
>>
>> I think there might be odd corner cases.  For example, mmap a tmpfs
>> file, write it, and unmap it.  Then, an hour later, maybe the system
>   If you unmap it then that will handle the update. But if you won't unmap,
> you'd get spurious updates of timestamps which would be strange.
>
>> will be under memory pressure and page out the file.  This could
>> trigger a surprising time update.  (I'm not sure this can actually
>> happen on tmpfs, but maybe it would on some other filesystem.)
>>
>> Does this actually matter?  A flag to turn the feature on or off would
>> do the trick, but I don't think there's precedent for sticking a flag
>> in a_ops.
>   Flag in a_ops is ugly. But you can have a flag in 'struct
> filesystem_type' which would be reasonable.

OK, will do.

>
>> >> +void mapping_flush_cmtime_nowb(struct address_space *mapping)
>> >> +{
>> >> +     /*
>> >> +      * We get called from munmap and msync.  Both calls can race
>> >> +      * with fs freezing.  If the fs is frozen after
>> >> +      * mapping_test_clear_cmtime but before the time update, then
>> >> +      * sync_filesystem will miss the cmtime update (because we
>> >> +      * just cleared it) and we don't be able to write (because the
>> >> +      * fs is frozen).  On the other hand, we can't just return if
>> >> +      * we're in the SB_FREEZE_PAGEFAULT state because our caller
>> >> +      * expects the timestamp to be synchronously updated.  So we
>> >> +      * get write access without blocking, at the SB_FREEZE_FS
>> >> +      * level.  If the fs is already fully frozen, then we already
>> >> +      * know we have nothing to do.
>> >> +      */
>> >> +
>> >> +     if (!mapping_test_cmtime(mapping))
>> >> +             return;  /* Optimization: nothing to do. */
>> >> +
>> >> +     if (__sb_start_write(mapping->host->i_sb, SB_FREEZE_FS, false)) {
>> >> +             mapping_flush_cmtime(mapping);
>> >> +             __sb_end_write(mapping->host->i_sb, SB_FREEZE_FS);
>> >> +     }
>> >> +}
>> >   This is wrong because SB_FREEZE_FS level is targetted for filesystem
>> > internal use. Also it is racy. mapping_flush_cmtime() ends up calling
>> > mark_inode_dirty() and filesystems such as ext4 or xfs will start a
>> > transaction to store inode in the journal. This gets freeze protection at
>> > SB_FREEZE_FS level again. If freeze_super() sets s_writers.frozen to
>> > SB_FREEZE_FS before this second protection, things will deadlock.
>>
>> Whoops -- I assumed that it was safe to recursively take freeze
>> protection at the same level.
>>
>> I'm worried about the following race:
>>
>> Thread 1 (in munmap):
>> Check AS_CMTIME set
>> sb_start_pagefault
>>
>> Thread 2 (freezing the fs):
>> frozen = SB_FREEZE_PAGEFAULT;
>> sync_filesystem()
>>
>> Thread 1 is now stuck.  It doesn't need to be, because sync_filesystem
>> will flush out the cmtime write.  But there doesn't seem to be a clean
>> mechanism to wait for the freeze to finish.
>   OK, I see. Frankly, I'd rather live with msync() and munmap() blocking
> while filesystem is frozen than trying to outsmart the freezing logic...
> If someone comes up with a usecase where it causes trouble, we can always
> improve the logic with some clever tricks.

I'll at least check that it's a shared writable mapping before doing
the flush to avoid blocking on other types of munmap.

--Andy

<Prev in Thread] Current Thread [Next in Thread>