pcp
[Top] [All Lists]

Re: pmlogreduce - use by date has expired

To: kenj@xxxxxxxxxxxxxxxx
Subject: Re: pmlogreduce - use by date has expired
From: Nathan Scott <nscott@xxxxxxxxxx>
Date: Mon, 15 Sep 2008 15:24:07 +1000
Cc: pcp@xxxxxxxxxxx
In-reply-to: <1221111368.25428.11.camel@bozo>
References: <1221111368.25428.11.camel@bozo>
Sender: pcp-bounce@xxxxxxxxxxx
Hi Ken,

On Thu, 2008-09-11 at 15:36 +1000, Ken McDonell wrote:
> ...
> Some Things that WILL be Supported
> The existing pmlogreduce attempts some of the list below, but most of
> these features are either not implemented, or implemented incorrectly
> in the current code. 
>       * The temporal reduction is achieved by the -t delta command
>         line option.  The output archive will contain observations at
>         most once per delta for each metric-instance pair in the input
>         archive. 

Have been wondering to myself whether the ability to have
set of values recorded at different frequencies in the new
log would be useful (iow, different -t for different sets
of metrics) ... like pmlogger allows.  I'm undecided, but
have you given that option thought?  Complicates things a
fair bit, I guess, I'm leaning toward "probably not worth
it" but just thought I'd mention it.

>       * The size of the output archive may be limited with the -s
>         command line option.

How does that combine with -t? (when the size limit is hit,
it just ends the archive & warns user?)

>       * Multi-volume output archives will be supported through the -v
>         command line option and internal volume switching logic to
>         ensure the 32-bit offset limit of the temporal index is not
>         exceeded. 

Should that be automatic and only if needed?  (no -v)

>       * Counters will be rate converted (so mapped to INSTANTANEOUS
>         metrics, have their semantics changed when the TIME DIMENSION
>         is reduced by one, e.g. MBYTE -> MBYTE / SEC, and their TYPE
>         will be converted to DOUBLE).

This could potentially make larger output files than input files.
Would an option for FLOAT instead of DOUBLE be useful to prevent
that phenomenon?

> Some Open Questions
> The following issues warrant some discussion before I make unilateral
> decisions. 
>      1. Output Window Clipping.  In several useful deployments of
>         pmlogreduce one may wish to further restrict the temporal
>         domain by selecting some re-occurring periods to be included,
>         and some to be excluded.  Examples might be between the hours
>         08:00 and 20:00 each day, and/or each day excluding Saturday
>         and Sunday.  There are several problems here: 
>              1. suitable command line syntax to specify this sort of
>                 clipping 
>              2. what would the output archive contain - no pmResult,
>                 or pmResult and no metrics (which is formally a MARK
>                 record) for each delta in  the "clipped" region

I'd go for the former just to save space, in absence of a
compelling reason either way.

In my local use-case-scenario, I'd imagine we'd be doing
this clipping via logextract in the first level of daily
archive to some other interval (weekly/monthly/...) log
munging (which will also reduce the set of metrics stored
longer term, etc), and then running logreduce on that -
so we'd have no reason to need this AFAICS.  But perhaps
other use-cases would call for it.

>              1.  
>      1. Should DISCRETE metrics appear in the output only if there is
>         a value observed in the corresponding interval in the input
>         archive?  The alternative is to have all metrics repeated in
>         every pmResult in the output archive.  

That doesn't seem a good alternative - I'd go with the first
option, or use the last previous value seen (may be outside
the window) for discrete metrics.

>      1. For DISCRETE metrics, and all but the last value before a MARK
>         record or the end of the input archive for INSTANTANEOUS
>         metrics, consecutive identical values can be omitted without
>         changing the data semantics - is this worth it? 

I think so.  If these are string valued (like topology metrics,
or some such thing) these could waste plenty of space.

>      1. What to do with COUNTER metrics that have a TIME dimension
>         other than 0 or 1?  I don't know that we have any such
>         metrics, and I'm not sure what the real semantics of data like
>         this might be, but it seems pretty obvious that "rate
>         conversion" is not going to make the semantics any more
>         obvious!

Yeah, just leave as-is I guess.

>      1.  
>      2. For INSTANTANEOUS and DISCRETE metrics with non-numeric
>         values, we have to decide what to do if multiple observations
>         appear in the input archive within a single output archive
>         time interval.  Take the last observed value seems to be the
>         least worst thing to do.

Yep, agreed.

cheers.

--
Nathan


<Prev in Thread] Current Thread [Next in Thread>