pcp
[Top] [All Lists]

Re: Logged data integrity improvement ideas

To: Nathan Scott <nathans@xxxxxxxxxx>
Subject: Re: Logged data integrity improvement ideas
From: "Frank Ch. Eigler" <fche@xxxxxxxxxx>
Date: Fri, 14 Feb 2014 10:46:57 -0500
Cc: pcp developers <pcp@xxxxxxxxxxx>
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <1137677123.6956816.1392357878691.JavaMail.zimbra@xxxxxxxxxx>
References: <2108905700.15892281.1391033827018.JavaMail.root@xxxxxxxxxx> <1178788786.16735370.1391119719356.JavaMail.root@xxxxxxxxxx> <y0mob2l2e67.fsf@xxxxxxxx> <1557539857.20737008.1391680756568.JavaMail.root@xxxxxxxxxx> <20140206133220.GC5017@xxxxxxxxxx> <633105491.21839148.1391757935531.JavaMail.root@xxxxxxxxxx> <2133757769.21973706.1391772553504.JavaMail.root@xxxxxxxxxx> <y0mha8a27wc.fsf@xxxxxxxx> <1137677123.6956816.1392357878691.JavaMail.zimbra@xxxxxxxxxx>
User-agent: Mutt/1.4.2.2i
Hi -


> [...] But the missing test is log verification post-failure, when we
> *know* logs should be ondisk. [...]

We already know that the situation with the current code base, even
with the fsync patches added, is not sufficient.  There are few I/O
error checks, and generally none at exit.  The fsync stuff only claims
to handle one particular problem.

It's interesting to contemplate the sudden-power-off sort of testing,
but it's premature.  


> Ideally, we'd also be using rename (tricky with 3 archive files
> but it might be possible) and issuing fsync on the directory
> housing the final archive names - only at that point would we be
> as safe as we can be.

Those may well be worthwhile further improvements.  (Note that
fsync'ing directories is uncommon amongst other tools, despite what
the fsync(2) man page says.)


> > This one's tricky to measure well.  fsync does not increase I/O
> > amount, only compresses it along the time axis.
> 
> *nod* but it turns out that latter statement is not always true;
> consider 2 cases that thwart this:
> 
> - if the archive we're fsync'ing (eg on close) was used for some
> intermediate computation, before a final write pass, then we can
> double, triple, etc (depending on #passes) the runtime of the
> tool by fsyncing at some inappropriate (too early) point [...]

The new code only fsyncs at archive file close time.  That in turns
only affects files that were written-to (because fsync of a read-only
file is a noop).  AFAIK, none of our tools append to existing
archives, so I'm not sure what kind of multi-write-pass processing you
are referring to.


> Consider further - those files generated on an intermediate pass may
> then be unlinked before the following pass, and may never actually
> need to be written out.  If we fsync'ed them...  well, thats a case
> where we would increase the I/O amount via fsync!

Yes, in this scenario, if one knows that the intermediate files are
disposable, and one can't be bothered to place them into a tmpfs
ram-disky volume, the fsyncs can cause more I/O than there might have
been.  (Even then, fsync does not invent I/O: the kernel could well
have written out the data before the unlink, which is likelier with
greater volume of data.)

I hope that our other thread about future streamable archive files
would make this scenario moot, by using pipes instead of files for
such intermediate values.


> - most modern filesystems perform delayed allocation.  If we
> make an assumption about when an appropriate time to flush is,
> we force the filesystems to allocate space early, and this can
> result in less optimal allocation patterns (causing seeks) [...]

I don't see how this applies.  We only fsync at file close, after
which there will be no further allocation.


> [...] Usually copious amounts of unrelated dirty metadata will be
> caught up in it too, as the journal must be flushed for everything
> that changed up to the point of completing the final data write in
> the file.  [...]

This applies to some extent, especially on ext3.  On other file
systems, it's not as "copious".


> [...]  When the application doing the fsync's is an ACID compliant
> database for which the system is dedicated, this is fine.  But when
> the program issuing the fsyncs is just a performance tool  [...]

I don't see how we can have it both ways.  One can't on the one hand
complain "system crash with no data flush - tail of the file corruption,
etc., The last one is the most common ...", and then argue that a
pinpoint fix for that problem is not worth some cost after all.


> > Yup.  We're a long way from the sort of robustness guarantees we might
> > like to have.  This was just "low-hanging fruit" as they say, and
> > should mostly solve one particular problem you had encountered.
> 
> The patches take the "insert fsync at salient-looking points
> in libpcp, esp log close" approach, 

(Again, "salient-looking points" == "log file close".)

> [...] I think the tools are the right place to make these changes,
> and possibly even only via new command line options, for use by the
> daily log rotation regimes.

Knobs to override it are justifiable (like for that non-tmpfs
intermediate file case), but I suggest defaulting to greater
data-safety rather than less.  This is the same default used for
modern text editors, git, and of course databases of all kinds:
they all fsync on close by default.

As to "for daily log rotation", are you suggesting that pmlogger's own
fresh/original output (which has the lowest write volume/rate, thus
the lowest cost for fsync's) should not be considered at least as
precious as the log-merging postprocessors'?


> This [fork/fsync] technique can be used to hide the synchronous
> writing latency (which is enormous on a clunky old laptop with an
> 800 megabyte PCP archive!) that fsync introduces, if all we need to
> do is initiate the I/O "earlier".

Yes, 800MB I/O on a laptop, fsync or not, are a problem.


- FChE

<Prev in Thread] Current Thread [Next in Thread>