----- Original Message -----
>
> > [...] - there is no redundancy or protection from failure - if
> > there is any kind of corruption, *all* data is lost (potentially
> > some weeks) from the wun beeeg log.
>
> What kinds of file corruption can one expect to deal with on modern
> systems, that would affect these files so badly? Do we want to be in
> the storage-safety department just on their account?
Oh all manner of corruption - and using a wide umbrella there for
"corruption", covering accidentally overwriting the start of a file
(which, bizarrely, happens more than I'd expect); fat-fingers on
sysadmins - accidental file removal; system crash with no data
flush - tail of the file corruption, etc, etc. The last one is the
most common, I think. All of these things I have seen happen, and
alot more often than we'd expect / like.
> [...] I know of no
> modern file processing tool that deliberately slices up its own data,
> just to protect it from unspecified hypothetical breakage.
For example all modern filesystems do this sort of thing - putting
multiple copies of critical data structures all over the place, and
going to great lengths to keep them at arms length from each other
for recovery purposes ("block groups" in extN, "allocation groups"
in XFS, same ideas). All old filesystems too, probably, databases
as well - its just sensible. For the PCP archive format, we don't
have critical structures spread throughout the file, we get our
protection by splitting data over multiple independent logs, each
with their own headers and so on - finding a good balance between
data isolation and usability.
re "unspecified" - fair enough, I was less verbose than I should've
been there - but we are absolutely not talking "hypothetical" here.
I've seen this on a number of occasions, pmloglabel(1) exists due
to one such incident (and its been used in anger on more than one
occassion!), pmlogcheck(1) exists for similar reasons.
> Plus, we may be surprised to what extent the record/framing structure
> of the files is robust enough to tolerate even outright overwrite/loss
> type damage to the interior. If this were a substantial concern, I'm
There's no surprises here - its well known what happens, we've been
at this game for twenty years now :) [ Heh, came across pmlogger.c
copyright annotations in recent work there - (C) SGI 1995! ]
> sure we could improve the status quo (e.g. by searching for undamaged
> record frames heuristically instead of giving up at the first problem).
There is a huge difference between losing one days worth of data vs
two weeks+. The + is because current pmmgr scheme loses more and
more data as the collection period is increased, whereas thats not
the case for pmlogger_daily.
That said, we could do more, you're right - perhaps the temporal
index could also be used to find starting records. Maybe add CRCs.
Lots of things we could do - but I'm talking about the default that
we ship right now, and to my mind the current default configuration
for pmmgr increases the risk of data loss.
Something else that just occurred to me is that the pmmgr model
of changing *every single archive* involved, *every single day*
further increases the loss risk. Compare it to the other model,
where the daily logs remain *unchanged* forever after that first
day - so, there's zero potential for the partial-write style of
corruption after a crash there.
> > I worry that we've taken the wrong default approach in pmmgr, and
> > wonder if we should default to the daily split once more? [...]
>
> As per the TODO, I plan to add a few more log-management options, and
> provide a generous number of knobs to parametrize them. Many
> alternatives make sense. The old pmlogger_daily model is one of the
> ones it should be able to emulate.
Absolutely, and it should be the default behaviour too. Wun beeg log
should be opt-in IMO.
cheers.
--
Nathan
|