Christoph Hellwig wrote:
> Linux has sync_file_range which currently is a perfect way to lose your
> synced' data, but with two more flags and calls to ->fsync we could
> turn it into range-fsync/fdatasync.
Apart from the way it loses your data, the man page for
sync_file_range never manages to explain quite why you should use the
existing flags in various combinations. It's only obvious if you've
worked on a kernel yourself.
Having asked this before, it appears one of the reasons for
sync_file_range working as it does is to give the application more
control over writeback order and to some extent, reduce the amount of
But it's really difficult to manage the amount of blocking with it.
You need to know the request queue size among other things, and even
if you do it's dynamic. Writeback order would be as easy with
fdatasync_range, and if you want to reduce blocking, a good
implementation of aio_fsync would be more useful. Or, you have to use
application writeback threads anyway, so fdatasync_range again.
The one thing sync_file_range can do is let you submit multiple ranges
which the elevators can sort for the hardware. You can't do that with
sequential calls to fdatasync_range, and it's not clear that aio_fsync
is implemented well enough (but it's a fairly good API for it).
Nick Piggin's idea to let fdatasync_range take multiple ranges might
help with that, but it's not clear how much.
> I'm not sure if that's a good
> idea or if we should just add a sys_fdatasync_rage systems call.
fdatasync_range has the advantage of being comprehensible. People
will use it because it makes sense.
sync_file_range could be hijacked with new flags to implement
fdatasync_range. If that's done, I'd rename the system call, but keep
it compatible with sync_file_range's flags, which would never be set
when userspace uses the new functionality.
> I don't quite see the point of a range-fsync, but it could be easily
> implemented as a flag.
A flags argument would be good anyway: to indicate if we want an
ordinary fdatasync, or something which flushes the relevant bit of
volatile hardware caches too. With that as a capability, it is useful
to offer fsync, because that'd be the only way to get a volatile
hardware cache flush (or maybe the only way not to?).
For that reason, it should be permitted to give an infinitely large range.
I don't see the point of range-fsync either, but I'm not sure if I see
any harm in it. If permitted, range-fsync with a zero-byte range
would flush just the inode state and none of the data. If that's
technically available, maybe O_ISYNC and "#define O_SYNC
(O_DATASYNC|O_ISYNC)" isn't such as daft idea.
I'd call it fsync_range for consistency with aio_fsync (POSIX), which
takes flags O_DSYNC or O_SYNC to indicate the type of sync. But I'd
use new flag names, to keep the space clear for other flags. Just
sketching some ideas:
/* One of FSYNC_RANGE_SYNC or FSYNC_RANGE_DATASYNC must be set. */
#define FSYNC_RANGE_SYNC (1 << 0) /* Like fsync, O_SYNC. */
#define FSYNC_RANGE_DATASYNC (1 << 1) /* Like fdatasync, O_DSYNC. */
#define FSYNC_RANGE_NO_HWCACHE (1 << 2) /* Not hardware caches. */