Hi -
> [...] But the missing test is log verification post-failure, when we
> *know* logs should be ondisk. [...]
We already know that the situation with the current code base, even
with the fsync patches added, is not sufficient. There are few I/O
error checks, and generally none at exit. The fsync stuff only claims
to handle one particular problem.
It's interesting to contemplate the sudden-power-off sort of testing,
but it's premature.
> Ideally, we'd also be using rename (tricky with 3 archive files
> but it might be possible) and issuing fsync on the directory
> housing the final archive names - only at that point would we be
> as safe as we can be.
Those may well be worthwhile further improvements. (Note that
fsync'ing directories is uncommon amongst other tools, despite what
the fsync(2) man page says.)
> > This one's tricky to measure well. fsync does not increase I/O
> > amount, only compresses it along the time axis.
>
> *nod* but it turns out that latter statement is not always true;
> consider 2 cases that thwart this:
>
> - if the archive we're fsync'ing (eg on close) was used for some
> intermediate computation, before a final write pass, then we can
> double, triple, etc (depending on #passes) the runtime of the
> tool by fsyncing at some inappropriate (too early) point [...]
The new code only fsyncs at archive file close time. That in turns
only affects files that were written-to (because fsync of a read-only
file is a noop). AFAIK, none of our tools append to existing
archives, so I'm not sure what kind of multi-write-pass processing you
are referring to.
> Consider further - those files generated on an intermediate pass may
> then be unlinked before the following pass, and may never actually
> need to be written out. If we fsync'ed them... well, thats a case
> where we would increase the I/O amount via fsync!
Yes, in this scenario, if one knows that the intermediate files are
disposable, and one can't be bothered to place them into a tmpfs
ram-disky volume, the fsyncs can cause more I/O than there might have
been. (Even then, fsync does not invent I/O: the kernel could well
have written out the data before the unlink, which is likelier with
greater volume of data.)
I hope that our other thread about future streamable archive files
would make this scenario moot, by using pipes instead of files for
such intermediate values.
> - most modern filesystems perform delayed allocation. If we
> make an assumption about when an appropriate time to flush is,
> we force the filesystems to allocate space early, and this can
> result in less optimal allocation patterns (causing seeks) [...]
I don't see how this applies. We only fsync at file close, after
which there will be no further allocation.
> [...] Usually copious amounts of unrelated dirty metadata will be
> caught up in it too, as the journal must be flushed for everything
> that changed up to the point of completing the final data write in
> the file. [...]
This applies to some extent, especially on ext3. On other file
systems, it's not as "copious".
> [...] When the application doing the fsync's is an ACID compliant
> database for which the system is dedicated, this is fine. But when
> the program issuing the fsyncs is just a performance tool [...]
I don't see how we can have it both ways. One can't on the one hand
complain "system crash with no data flush - tail of the file corruption,
etc., The last one is the most common ...", and then argue that a
pinpoint fix for that problem is not worth some cost after all.
> > Yup. We're a long way from the sort of robustness guarantees we might
> > like to have. This was just "low-hanging fruit" as they say, and
> > should mostly solve one particular problem you had encountered.
>
> The patches take the "insert fsync at salient-looking points
> in libpcp, esp log close" approach,
(Again, "salient-looking points" == "log file close".)
> [...] I think the tools are the right place to make these changes,
> and possibly even only via new command line options, for use by the
> daily log rotation regimes.
Knobs to override it are justifiable (like for that non-tmpfs
intermediate file case), but I suggest defaulting to greater
data-safety rather than less. This is the same default used for
modern text editors, git, and of course databases of all kinds:
they all fsync on close by default.
As to "for daily log rotation", are you suggesting that pmlogger's own
fresh/original output (which has the lowest write volume/rate, thus
the lowest cost for fsync's) should not be considered at least as
precious as the log-merging postprocessors'?
> This [fork/fsync] technique can be used to hide the synchronous
> writing latency (which is enormous on a clunky old laptop with an
> 800 megabyte PCP archive!) that fsync introduces, if all we need to
> do is initiate the I/O "earlier".
Yes, 800MB I/O on a laptop, fsync or not, are a problem.
- FChE
|