pcp
[Top] [All Lists]

Re: [pcp] pcp grafana and graphite - How to convert pcp metric values in

To: Amer Ather <aather@xxxxxxxxxxx>
Subject: Re: [pcp] pcp grafana and graphite - How to convert pcp metric values into percent
From: Nathan Scott <nathans@xxxxxxxxxx>
Date: Tue, 1 Jul 2014 21:46:34 -0400 (EDT)
Cc: "Frank Ch. Eigler" <fche@xxxxxxxxxx>, Martin Spier <mspier@xxxxxxxxxxx>, pcp@xxxxxxxxxxx
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <CAM1aq-Hr2ss+b82p1EMCRTjK_Vsbnde_tKVv53MQfP25p4e-aw@xxxxxxxxxxxxxx>
References: <CAM1aq-HG2oJxCx1EAA0Zr+W7-NsSmVfThb8zG35FOKC4mbbnww@xxxxxxxxxxxxxx> <y0mha353ky5.fsf@xxxxxxxx> <y0m8uoh3kf4.fsf@xxxxxxxx> <002d01cf9284$dc8fe570$95afb050$@internode.on.net> <002f01cf9285$f0c63160$d2529420$@internode.on.net> <CAM1aq-HgP5+Tsq_sCLWH6GrLOn8UXQWaEVkVaC1p=TT_kMn6Tg@xxxxxxxxxxxxxx> <20140629220735.GA13993@xxxxxxxxxx> <CAM1aq-Hr2ss+b82p1EMCRTjK_Vsbnde_tKVv53MQfP25p4e-aw@xxxxxxxxxxxxxx>
Reply-to: Nathan Scott <nathans@xxxxxxxxxx>
Thread-index: IUSlwQILz9MNmBJfO+UGWgqWlj/xUQ==
Thread-topic: pcp grafana and graphite - How to convert pcp metric values into percent
Hi Amer,

----- Original Message -----
> We need some additional help with derived metrics
> 
> As suggested, I created some derived CPU metrics:
> 
> kernel.pct.cpu.user = 100 * kernel.all.cpu.user / (hinv.ncpu * 1000)
> kernel.pct.cpu.sys = 100 * kernel.all.cpu.sys / (hinv.ncpu * 1000)
> kernel.pct.cpu.idle = 100 * kernel.all.cpu.idle / (hinv.ncpu * 1000)
> kernel.pct.cpu.nice = 100 * kernel.all.cpu.nice / (hinv.ncpu * 1000)
> kernel.pct.cpu.intr = 100 * kernel.all.cpu.intr / (hinv.ncpu * 1000)
> kernel.pct.cpu.wait.total = 100 * kernel.all.cpu.wait.total / (hinv.ncpu *
> 1000)
> ...
> 
> grafana/graphite reports these values from archives correctly in percent.
> However, pmapi fetch returns wrong values:
> 
> http://ec2-204-236-180-248.us-west-1.compute.amazonaws.com:7002/pmapi/1121044401/_fetch?names=kernel.pct.cpu.sys
> , kernel.pct.cpu.user
> 
> 
> # PCP_DERIVED_CONFIG=pctavg pminfo -f kernel.pct.cpu.user
> 
> kernel.pct.cpu.user
> value 12359528.75
> 
> What is the difference between grafana/graphite fetching from archive logs
> and fetching via pmapi or pminfo? How to fix it?

The answer lies in understanding the units and semantics of these metrics,
which you can dig into using the -d (--desc) option to pminfo, which dumps
the pmDesc structure - see pmLookupDesc(3) for further into.  In your case
the structure is built on the fly by libpcp using each line in your config
file...

$ PCP_DERIVED_CONFIG=/home/nathans/pctavg pminfo -fd kernel.pct.cpu.user

kernel.pct.cpu.user
    Data Type: double  InDom: PM_INDOM_NULL 0xffffffff
    Semantics: counter  Units: millisec
    value 21477.375

So, these are counter metrics, and they are exported in milliseconds.  In
order to achieve the utilization metric you're after, the counter needs
to be converted to a rate (change-in-value over change-in-time) and the
units converted to a utilization (initially normalized, then multiplied
by 100 to produce a percent).

Its not clear exactly what the web client is doing here, but these derived
metrics should not need the final "... * 1000)" bit - I think thats making
some incorrect assumptions, it should just be using "hinv.ncpu".  So using
pmval instead, with this config...

$ cat pctavg2
kernel.pct.cpu.user = 100 * kernel.all.cpu.user / hinv.ncpu
kernel.pct.cpu.sys  = 100 * kernel.all.cpu.sys / hinv.ncpu
kernel.pct.cpu.idle = 100 * kernel.all.cpu.idle / hinv.ncpu
kernel.pct.cpu.nice = 100 * kernel.all.cpu.nice / hinv.ncpu
kernel.pct.cpu.intr = 100 * kernel.all.cpu.intr / hinv.ncpu
kernel.pct.cpu.wait.total = 100 * kernel.all.cpu.wait.total / hinv.ncpu

... gives a detailed account of the transformations that pmval makes here;
e.g. for my (very idle) desktop...

$ PCP_DERIVED_CONFIG=/home/nathans/pctavg2 pmval -f3 -s3 kernel.pct.cpu.idle

metric:    kernel.pct.cpu.idle
host:      smash
semantics: cumulative counter (converting to rate)
units:     millisec (converting to time utilization)
samples:   3
interval:  1.00 sec
               99.071
               98.684
               99.188

... those values are correct, and show the appropriate transformation of
semantics and units that you're after from a client in this situation.
Hopefully Frank can point out what the graph* client is doing differently
for us here - I think the fix will need to be over in that code.

cheers.

--
Nathan

<Prev in Thread] Current Thread [Next in Thread>