G'day Frank.
> -----Original Message-----
> From: pcp-bounces@xxxxxxxxxxx [mailto:pcp-bounces@xxxxxxxxxxx] On
> Behalf Of Frank Ch. Eigler
> Sent: Saturday, 20 February 2016 7:04 AM
> To: bugzilla-daemon@xxxxxxxxxxx
> Cc: pcp@xxxxxxxxxxx
> Subject: Re: [pcp] [Bug 1136] pmlogreduce type conversion conflicts
> with naive multi-archive same-type assertion
>
> Hi -
>
> kenj wrote, re PR1136:
>
> > (a) yes, the nirvana solution for pmlogreduce would be to convert
> > counter,bytes -> instant,bytes/sec
>
> Could you explain why? Random PCP tools would be just as able to do
> counter rate-conversion based on temporally-subsampled counter values
> as original counter values. Is there a complication other than the
> somewhat larger possibility of counter overflow between larger time
> samples? Could pmlogreduce handle those situations emitting of
> synthetic 0x0, 0xFFFFFF pairs of counter values for each complete
> overflow cycle? Then it could preserve the previous type/units.
There is a more fundamental (and compelling IMHO) reason ...
We regularly rotate logs (daily by default). Then if we try to combine
these into, say a weekly archive we end up with 6 <mark> records. Now if
you try to pmlogreduce this to, say hourly summaries (so samples every
hour), you end up with only 23 samples per day, thanks to the <mark>
records. Later on, if you try to further reduce these to 4 hourly samples,
1/4 of the samples are missing, ... And later on if you try to reduce these
to daily samples, and all of the data is missing!
We could do better than this, especially since the original archives are (in
general) more than 99% complete.
Consider the case of counter metrics and the original data was sampled at 5
min intervals, and the log rotation occurs just after midnight, then we have
values for 11 of the 12 intervals from midnight to 1am and could assert that
the "average" over the hour to 1am was the _time_ averaged value over the 11
observations (this is the integral under the area where the values are known
divided by the time interval where the values are known). The more frequent
the sampling, the more accurate this approximation becomes ... we'd need
some concept of a threshold of % of time covered by observations in the
interval before accepting the _time_ average as semantically OK. If all the
values are known, then _time_ average above is arithmetically identical the
average gradient we return today after interpolation and rate conversion.
This method also works for counter reset due to pmda or pmcd restart. And
for bonus points, it can be adjusted to handle counter wraps in a sane
manner.
But it does require changing the metric semantics in the reduced archive
from counter to rate.
Because you cannot reliably distinguish counter wrap from reset I don't
think there is any way to synthesize counter values across an archive
boundary.
The _same_ algorithm works for instantaneous (and discrete) metrics,
although they do not require a change in the metric semantics.
>
> > (c) widening was added to pmlogreduce because the longer the time
> > interval that the archive spans the higher the probability of a
> > counter wrap, and in the absence of (a) the simplest way to deal
> with
> > this is to expand 32-bit counters to 64-bit counters
>
> Maybe a less simple way is worth considering, due to this fallout. It
> already makes the time-reduced archives difficult to glue together
> with others; further explicitly performed rate-conversion would make
> it worse.
I am not sure we can avoid changed semantics for pmlogreduced archives.
Which may mean we need to change the archive management procedures to
maintain two overlapping sets of archives ... the first in the original time
precision, and a second set of pmlogreduced archives, e.g.
yesterday ...... a week ago ...................... a month ago .............
<-------------------> original archives
<----------------------------------------------------------------
pmlogreduced archives
And then the user has to choose if they doing analysis over the short term
(where higher sampling rates are more useful) or the longer term (when
longer sampling rates are probably required both from analysis and data
volume considerations).
We're still very much in "design my argument" mode here (that's good, not
bad, by the way). So let's keep the discussion going in the hope that our
ideas and requirements and expectations converge, rather than diverge.
|