Hi Amer,
----- Original Message -----
> We need some additional help with derived metrics
>
> As suggested, I created some derived CPU metrics:
>
> kernel.pct.cpu.user = 100 * kernel.all.cpu.user / (hinv.ncpu * 1000)
> kernel.pct.cpu.sys = 100 * kernel.all.cpu.sys / (hinv.ncpu * 1000)
> kernel.pct.cpu.idle = 100 * kernel.all.cpu.idle / (hinv.ncpu * 1000)
> kernel.pct.cpu.nice = 100 * kernel.all.cpu.nice / (hinv.ncpu * 1000)
> kernel.pct.cpu.intr = 100 * kernel.all.cpu.intr / (hinv.ncpu * 1000)
> kernel.pct.cpu.wait.total = 100 * kernel.all.cpu.wait.total / (hinv.ncpu *
> 1000)
> ...
>
> grafana/graphite reports these values from archives correctly in percent.
> However, pmapi fetch returns wrong values:
>
> http://ec2-204-236-180-248.us-west-1.compute.amazonaws.com:7002/pmapi/1121044401/_fetch?names=kernel.pct.cpu.sys
> , kernel.pct.cpu.user
>
>
> # PCP_DERIVED_CONFIG=pctavg pminfo -f kernel.pct.cpu.user
>
> kernel.pct.cpu.user
> value 12359528.75
>
> What is the difference between grafana/graphite fetching from archive logs
> and fetching via pmapi or pminfo? How to fix it?
The answer lies in understanding the units and semantics of these metrics,
which you can dig into using the -d (--desc) option to pminfo, which dumps
the pmDesc structure - see pmLookupDesc(3) for further into. In your case
the structure is built on the fly by libpcp using each line in your config
file...
$ PCP_DERIVED_CONFIG=/home/nathans/pctavg pminfo -fd kernel.pct.cpu.user
kernel.pct.cpu.user
Data Type: double InDom: PM_INDOM_NULL 0xffffffff
Semantics: counter Units: millisec
value 21477.375
So, these are counter metrics, and they are exported in milliseconds. In
order to achieve the utilization metric you're after, the counter needs
to be converted to a rate (change-in-value over change-in-time) and the
units converted to a utilization (initially normalized, then multiplied
by 100 to produce a percent).
Its not clear exactly what the web client is doing here, but these derived
metrics should not need the final "... * 1000)" bit - I think thats making
some incorrect assumptions, it should just be using "hinv.ncpu". So using
pmval instead, with this config...
$ cat pctavg2
kernel.pct.cpu.user = 100 * kernel.all.cpu.user / hinv.ncpu
kernel.pct.cpu.sys = 100 * kernel.all.cpu.sys / hinv.ncpu
kernel.pct.cpu.idle = 100 * kernel.all.cpu.idle / hinv.ncpu
kernel.pct.cpu.nice = 100 * kernel.all.cpu.nice / hinv.ncpu
kernel.pct.cpu.intr = 100 * kernel.all.cpu.intr / hinv.ncpu
kernel.pct.cpu.wait.total = 100 * kernel.all.cpu.wait.total / hinv.ncpu
... gives a detailed account of the transformations that pmval makes here;
e.g. for my (very idle) desktop...
$ PCP_DERIVED_CONFIG=/home/nathans/pctavg2 pmval -f3 -s3 kernel.pct.cpu.idle
metric: kernel.pct.cpu.idle
host: smash
semantics: cumulative counter (converting to rate)
units: millisec (converting to time utilization)
samples: 3
interval: 1.00 sec
99.071
98.684
99.188
... those values are correct, and show the appropriate transformation of
semantics and units that you're after from a client in this situation.
Hopefully Frank can point out what the graph* client is doing differently
for us here - I think the fix will need to be over in that code.
cheers.
--
Nathan
|