Hi,
Sorry its taken awhile to get back to this one, but I wanted to
(attempt to) do it justice in terms of discussing I/O and some
filesystem internals.
----- Original Message -----
> > ["logging durability: add fsync(2) when closing log archives"]
> >
> > Can you describe the testing done for this change or planned?
>
> The pcpqa suite, plus strace examination that the fsync(2)'s are going
> out at the right time. There are still some known problems.
OK - excellent. We have coverage on any regression (e.g., if we
passed a dodgey fd & started to error out, something like that),
and we know the call is being made.
What we don't know is if the data is safe from a users POV. We of
course must assume sanity from below the fsync(2) call: in glibc,
the kernel and the hardware (sometimes a big assumption it turns
out, heh), so no need to test that. But the missing test is log
verification post-failure, when we *know* logs should be ondisk.
When I've done this in the past (not PCP related), I've used a
specialised crash test harness (as in, specially written for the
task) - performing data generation (and metadata, also important
to us here) using known patterns; fsync; crash *immediately* (cut
power); then verify on reboot. In our case, its "do a daily log
rotation (high level operation), then immediate restart, verify".
Repeat, alot - its all as racey as can be - to build confidence -
so the harness needs to help with that aspect.
This seems silly (or might seem like we're testing the kernel)
for our case at first glance, but we need to be particularly wary
of the cases where we do things like writing to temporary files,
then rename 'em over the old files (-i option to pmlogrewrite,
for example), or where we generally replace one archive file with
another - because then its not enough to fsync and the user will
see corruption.
For other reasons (below), the only places I'd encourage the use
of fsync for us are during daily log rotation, and explicitly
driven from individual tools (so, not initiated from libpcp on
log close, though likely the syncing will need to be done via
an internal double-underscore libpcp API).
Ideally, we'd also be using rename (tricky with 3 archive files
but it might be possible) and issuing fsync on the directory
housing the final archive names - only at that point would we be
as safe as we can be.
> ...
> This one's tricky to measure well. fsync does not increase I/O
> amount, only compresses it along the time axis.
*nod* but it turns out that latter statement is not always true;
consider 2 cases that thwart this:
- if the archive we're fsync'ing (eg on close) was used for some
intermediate computation, before a final write pass, then we can
double, triple, etc (depending on #passes) the runtime of the
tool by fsyncing at some inappropriate (too early) point - we're
not doing double passes in any of *our* tools, AFAIK (Ken?), but
the PMAPI can have been used by anyone in any way, or might yet
be used in the future in a multi-pass kind of way.
Only the tools (as in, the PMAPI client tools & not necessarily
part of our shipped toolkit) know when the right time to fsync
is (e.g. when finished all passes over the data), if at all!
Consider further - those files generated on an intermediate pass
may then be unlinked before the following pass, and may never
actually need to be written out. If we fsync'ed them... well,
thats a case where we would increase the I/O amount via fsync!
- most modern filesystems perform delayed allocation. If we
make an assumption about when an appropriate time to flush is,
we force the filesystems to allocate space early, and this can
result in less optimal allocation patterns (causing seeks),
increases to the amount of metadata for those files (so, same
amount of data but a greater block/extent count) and another
case where we'd have increased the amount of I/O results.
But wait, there's more (and a free set of steak knives!). In
Valerie's article she hints at this - when fsync is issued,
due to the many complications of journalling and other wierd
and wonderful filesystem internals, its not simply the data
and the affected inode that will be getting flushed. Usually
copious amounts of unrelated dirty metadata will be caught up
in it too, as the journal must be flushed for everything that
changed up to the point of completing the final data write in
the file. So, the metadata for that file (which may require
new space to be allocated, which touches free space trees,
which may cause btree splits, all of which is yet more dirty
metadata - all preceding operations will be logged on fsync).
In modern XFS for example, even the transactions that would
go into the journal are delayed (not just delayed allocation
for data, IOW, but metadata operations as well), to allow it
to detect and drop any that are not required, and reduce the
amount of log I/O that is needed.
So, the writeout has to to happen at *some* point - that can
only be delayed so long. Any fsync activity that we start
to engage in, however, is going to begin to circumvent these
filesystem optimisations, and causes impact to the system in
unrelated ways to our little toolkit. When the application
doing the fsync's is an ACID compliant database for which
the system is dedicated, this is fine. But when the program
issuing the fsyncs is just a performance tool trying to make
no impact on the system, we'll need to start taking care (it
may be the filesystem that we are attempting to analyse with
our tools!).
Obviously, as stated in earlier mail, the more data we end
up writing and flushing (big logs vs little logs), the more
noticeable the impact.
>
> > There's also other subtleties discussed on the fsync(1) man page,
> > its well worth a careful read too - the directory paragraph is
> > relevant to our needs.
>
> Yup. We're a long way from the sort of robustness guarantees we might
> like to have. This was just "low-hanging fruit" as they say, and
> should mostly solve one particular problem you had encountered.
The patches take the "insert fsync at salient-looking points
in libpcp, esp log close" approach, so I'm not a fan of that
for the above reasons - I think the tools are the right place
to make these changes, and possibly even only via new command
line options, for use by the daily log rotation regimes.
Another random note about the initial fsync-on-close approach,
I have seen some take the approack of fork(2)ing, the issuing
the fsync from the child (I don't think we should do this).
This technique can be used to hide the synchronous writing
latency (which is enormous on a clunky old laptop with an 800
megabyte PCP archive!) that fsync introduces, if all we need
to do is initiate the I/O "earlier". But, we should not do
that, I mention it only as an aside.
Hah, although having said that - we may want to add in a new
pmlc/pmlogger fsync command to complement the flush command;
that technique may well suit there (we do not want to delay
the pmlogger pmFetch'ing and write'ing, if at all possible).
cheers.
--
Nathan
|