Hi Frank,
Wanted to open to wider discussion this default behaviour of pmmgr
when managing the system pmlogger archives, as is touched on here:
http://oss.sgi.com/bugzilla/show_bug.cgi?id=1044
"pmmgr normally generates relatively large merged-archive-* files."
AIUI, the default behaviour is to have the active pmlogger writing
to "todays" archive (named archive-YYYYMMDD.nnnnnn), and a second
archive for *all* preceding data (merged-archive-YYYYMMDD.nnnnnn).
Data is lopped off the beginning of the merged archive and new data
merged onto the end each day. So this latter is the large log, and
it shifts temporally each day in terms of start/end.
There are advantages & disadvantages to this. The advantage is we
get a single archive (well, except for todays data) which means the
tools can directly operate on more than a single days worth of data.
And that is great!!! But...
The disadvantages are worrisome, however, and I think these are the
reason the pmlogger_daily(1) script-fu goes for the one archive per
day approach. Here are a few:
- there is no redundancy or protection from failure - if there is
any kind of corruption, *all* data is lost (potentially some weeks)
from the wun beeeg log.
- as there is no log rewrite support, log merging is guaranteed to
fail in real production environments, where monitored applications
change over time and metric metadata changes (so the log merging
process *will* fail in practice, at critical times - usually right
after a monitored-software upgrade, for example). This unexpected
pattern has caused data loss in the past (and hence logrewrite now
exists). This is fixable, but its not there today, and who knows
if/when it might appear - defaults should reflect the current state
of the code.
- its "unwieldy" in the field to have one massive archive, for a
larger organisation (with different groups involved with operations
and analysis, for example). Since the start and end points of the
archive change each day, one cannot simply rsync/scp the production
data to a central spot (e.g. for all their many production systems,
from different data centres - as Aconex do, for example) because
that would either overwrite or duplicate (or both) existing data.
Nor can a support person (with e.g. no production access) simply
request a days log for analysis from the operations team - it'd
involve a convoluted process of pmlogextracts, and so on (because
of the "rubbery" start/end times).
- keeping a central repo of all performance data (so, reaped from
multiple pmmgr sites, eg from different data centres) is made much
more difficult because of the naming convention, and again because
of the rubbery start/end times of the long-term logs.
I worry that we've taken the wrong default approach in pmmgr, and
wonder if we should default to the daily split once more? And if
so, I again say we should look into the YYYY/MM/DD split as a new
(additional, default) convention over YYYYMMDD archive naming.
Now is a good time - we're just setting out on our pmmgr adventure
so there's little/no installed user base to inconvenience.
Longer term, if we manage to pull off some form of unified context
with transparent multi-archive support, the single advantage that
humungo-logs have goes away and we're left with only disadvantages
AFAICT...?
Thoughts? Feels like we should ponder deeply here, as we're making
long-ranging decisions - it may well be that pmmgr already supports
the above, alternate behaviour I'm after, but the default behaviour
is super-important as 99.99% of users will go that route. It feels
to me that its sub-optimal at the moment, and it should be changed.
cheers.
--
Nathan
|