pcp
[Top] [All Lists]

Re: [pcp] Floating point problem

To: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
Subject: Re: [pcp] Floating point problem
From: Brendan Gregg <bgregg@xxxxxxxxxxx>
Date: Wed, 30 Jul 2014 15:00:11 -0700
Cc: Martin Spier <mspier@xxxxxxxxxxx>, pcp@xxxxxxxxxxx, Amer Ather <aather@xxxxxxxxxxx>, Coburn Watson <cwatson@xxxxxxxxxxx>
Delivered-to: pcp@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=netflix.com; s=google; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=koOL/2HdMfcbXi2dvd2Raa5KBCw2SBCWHq7SpJG0ii4=; b=HujGnOW6Vz5Kvf6WjJKWxk2h/1egj4lEVyfHa8dR8BOFW9LZqIMPt8blphD8le/gM+ kpFXeicB447DJ10UPoKrtcGX+O3AMTr3EzuHkQcdokmFR2mMSKFyOPgc2g1hS4N4Ergf m9CsBsnew+lm8DPHzJQfnuIayXGReXgotv/ks=
In-reply-to: <53D6CE6A.8030309@xxxxxxxxxxxxxxxx>
References: <CAEp4+dU2kE9JJztBPc=N5oSyoEyBvN5Of19rohC3DxXGeomuRw@xxxxxxxxxxxxxx> <033501cfa8a4$fd091ed0$f71b5c70$@internode.on.net> <CAEp4+dUH6fEQ2E=o5O2q8LKfR2xUypM-AeOwQhWy9sEntvO-AQ@xxxxxxxxxxxxxx> <53D6CE6A.8030309@xxxxxxxxxxxxxxxx>



On Mon, Jul 28, 2014 at 3:27 PM, Ken McDonell <kenj@xxxxxxxxxxxxxxxx> wrote:
On 29/07/14 05:47, Martin Spier wrote:
Here it is:

kernel.pct.cpu.user = 100 * kernel.all.cpu.user / hinv.ncpu
kernel.pct.cpu.sys Â= 100 * kernel.all.cpu.sys / hinv.ncpu

Same definition Amer posted before. Think it came from:

http://www.performancecopilot.org/pcp.git/man/html/howto.cpuperf.html


As I suspected ...

Note in that web page, the table is headed "PCP equivalent (assuming rate conversion)" ... we don't have any rate conversion in play here.

The expressions above will produce exactly the floating point precision problem Martin has observed.

The options are ...

1. Revisit the design specs for pmwebd and see if it makes sense for this daemon to be performing per-client rate conversion (so taking on some of the role of a PMAPI client, like pmie, pmval, pmchart, pmdumptext, ... and detecting the counter semantics of metrics and rate converting them). ÂIn this case the formulae and derived metrics above would work.

2. Push the rate conversion arithmetic out the the pmwebd clients ... this involves keeping the last observed value and the last timestamp, then computing delta(value) / delta(timestamp), and you could do the *100 and /hinv at the same time. I am guessing this is not attractive option.

3. Extend the derived metrics support. ÂWe already have delta() which can be applied to counter metrics and returns the difference in value between one pmFetch and the next. ÂThis is closer to the semantics Martin needs, but does not include the divide by delta(timestamp) part. ÂI could add rate() as a new intrinsic function for derived metrics that does the rate conversion.

With option 3. the derived metric definitions would be something like ...

kernel.pct.cpu.user = 100 * rate(kernel.all.cpu.user) / hinv.ncpu

Note that rate(kernel.all.cpu.user) would be a double precision number but restricted to the (small) interval [0, hinv.ncpu].

Before jumping into 3., I'd like to hear feedback on options 1. and 2.

I actually like 2. It's simple.

I'd like to not just fetch per-second metrics, but possibly other intervals at the same time, including per-hour and per-day. And possibly from multiple clients. And possibly ad-hoc queries. With 2, I simply stash away whatever cumulative values and timestamp pairs, and use them later when needed.

Brendan

--
Brendan Gregg, Senior Performance Architect, Netflix
<Prev in Thread] Current Thread [Next in Thread>