pcp
[Top] [All Lists]

Re: pmlogrewrite questions from the developer meeting

To: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
Subject: Re: pmlogrewrite questions from the developer meeting
From: fche@xxxxxxxxxx (Frank Ch. Eigler)
Date: Wed, 16 Apr 2014 20:11:42 -0400
Cc: Nathan Scott <nathans@xxxxxxxxxx>, pcp@xxxxxxxxxxx
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <534C66BB.6050206@xxxxxxxxxxxxxxxx> (Ken McDonell's message of "Tue, 15 Apr 2014 08:52:43 +1000")
References: <01c501cf56de$54b6d370$fe247a50$@internode.on.net> <379318588.4720106.1397436550627.JavaMail.zimbra@xxxxxxxxxx> <534C66BB.6050206@xxxxxxxxxxxxxxxx>
User-agent: Gnus/5.1008 (Gnus v5.10.8) Emacs/21.4 (gnu/linux)
Hi, Ken -

kenj wrote:

> [...]  So before exit(), you're sugesting fsync() on each of the
> data files, and I think I need fsync() on the container directory as
> well.

One should do it not just before program exit, but earlier, just
before the unlink of the input files.  That way, we have some
guarantees that the output file will be on disk before the input files
get nuked.  (If we were to fsync only at the exit(), then the unlinks
could be flushed to disk quickly by the kernel, the new files might
not be flushed fully yet, and a crash at this point would lose all
copies of the data.)


> But we should be consistent.  If this is the "right" way to do it
> then surely all applications that can write PCP archives should do
> the same thing.

(That was what I proposed with the fche/fsync branch.)


> I am not against doing this, although if one was concerned at this
> level then I suspect an option to enforce O_SYNC might be better to
> guarantee on disk for all writes, not just flushing everything at
> exit, but we should choose one policy for writing PCP archives and
> implement it consistently throughout the PCP ecosystem.

There is a spectrum, but yes, toolset-wide consistent aim at a point
on the spectrum sounds appropriate.

1) a mode where archive data gets fsync()d (or O_SYNC'd) at every
incremental write (from pmlogger), so that all times are durable
and any time is good time for crashing.  This of course means
accepting potentially more i/o latency.

2a) a mode where archive data gets fsync()d at close (from all bulk
archive creation tools), so that process exits correspond to durable
states.

2b) a mode where archive data gets fsync()d from only a few archive
creation tools deemed special.

not-3) a mode where archive data gets fsync()d asynchronously (
receiving a SIGfoo or pmlc flush), except that this doesn't provide
any synchronization (completion information) back, so doesn't identify
a durability point; a client might as well system("/bin/sync");

3) a mode where we we don't care about crash-robustness (status quo)

Perhaps this deserves to be a pcp.conf level option.


- FChE

<Prev in Thread] Current Thread [Next in Thread>