pcp
[Top] [All Lists]

Re: pmmgr pmlogger default behaviour

To: Nathan Scott <nathans@xxxxxxxxxx>
Subject: Re: pmmgr pmlogger default behaviour
From: "Frank Ch. Eigler" <fche@xxxxxxxxxx>
Date: Thu, 30 Jan 2014 13:11:34 -0500
Cc: pcp developers <pcp@xxxxxxxxxxx>
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <1583484726.15908616.1391037005341.JavaMail.root@xxxxxxxxxx>
References: <2108905700.15892281.1391033827018.JavaMail.root@xxxxxxxxxx> <1583484726.15908616.1391037005341.JavaMail.root@xxxxxxxxxx>
User-agent: Mutt/1.4.2.2i
Hi -


> AIUI, the default behaviour is to have the active pmlogger writing
> to "todays" archive (named archive-YYYYMMDD.nnnnnn), and a second
> archive for *all* preceding data (merged-archive-YYYYMMDD.nnnnnn).
> Data is lopped off the beginning of the merged archive and new data
> merged onto the end each day.  So this latter is the large log, and
> it shifts temporally each day in terms of start/end.

To clarify, this is not a compiled-in default, but one produced by the
presence of the "/etc/pcp/pmmgr/pmlogmerge" file.


> [...]  - there is no redundancy or protection from failure - if
> there is any kind of corruption, *all* data is lost (potentially
> some weeks) from the wun beeeg log.

What kinds of file corruption can one expect to deal with on modern
systems, that would affect these files so badly?  Do we want to be in
the storage-safety department just on their account?  I know of no
modern file processing tool that deliberately slices up its own data,
just to protect it from unspecified hypothetical breakage.

Plus, we may be surprised to what extent the record/framing structure
of the files is robust enough to tolerate even outright overwrite/loss
type damage to the interior.  If this were a substantial concern, I'm
sure we could improve the status quo (e.g. by searching for undamaged
record frames heuristically instead of giving up at the first problem).


> - as there is no log rewrite support [...]

In the TODO - not a big deal, and is independent of the daily
vs. merged archives issue (since even the cron-style daily archives
are potentially merged from smaller sub-day slices).


> [...] where monitored applications change over time and metric
> metadata changes (so the log merging process *will* fail in
> practice, at critical times - usually right after a
> monitored-software upgrade, for example).  This unexpected pattern
> has caused data loss in the past (and hence logrewrite now exists).
> [...]

Note that if pmmgr encounters errors while merging log archives, it
specifically preserves the former inputs.  So the worst case *today*
is the accumulation of unmergeable/broken archives, not data loss.


> [...] Since the start and end points of the archive change each day,
> one cannot simply rsync/scp the production data to a central spot
> (e.g. for all their many production systems, from different data
> centres - as Aconex do, for example) because that would either
> overwrite or duplicate (or both) existing data.

Given that pmmgr/pmlogmerge file names change as per the timestamp,
they can't actually overwrite each other.  (That naming policy also
happens to block incremental updates via rsync, but that's the same
with the cron-based scheme.)


> Nor can a support person (with e.g. no production access) simply
> request a days log for analysis from the operations team - it'd
> involve a convoluted process of pmlogextracts, and so on (because of
> the "rubbery" start/end times).

Actually, in this a large uncut archive could be better suited than
ones created by cron intervals.  One can ask for arbitrary boundaries
with -S/-T without having to splice things by hand (or feed
pmlogextract a wildcard full of potentially relevant archives).  Plus,
scox's work generalizing the time-specifications to be more englishy a
la getdate(), this sort of query (-S "last tuesday 5pm" -T "2 hours ago")
should become even easier, again without splicing.


> - keeping a central repo of all performance data (so, reaped from
> multiple pmmgr sites, eg from different data centres) is made much
> more difficult because of the naming convention, and again because
> of the rubbery start/end times of the long-term logs.

(This sounds like a duplicate of the above.)


> I worry that we've taken the wrong default approach in pmmgr, and
> wonder if we should default to the daily split once more?  [...]

As per the TODO, I plan to add a few more log-management options, and
provide a generous number of knobs to parametrize them.  Many
alternatives make sense.  The old pmlogger_daily model is one of the
ones it should be able to emulate.


- FChE

<Prev in Thread] Current Thread [Next in Thread>