pcp
[Top] [All Lists]

Re: [PCP - Bug?] metric rate higher than 1 !

To: "Guillier, Nicolas" <nicolas.guillier@xxxxxxxxxx>
Subject: Re: [PCP - Bug?] metric rate higher than 1 !
From: Ken McDonell <kenmcd@xxxxxxxxxxxxxxxxx>
Date: Fri, 2 Jul 2004 10:03:48 +1000
Cc: pcp@xxxxxxxxxxx
In-reply-to: <5E3610150FD4454697F8203F72E14D93081592@xxxxxxxxxxxxxxxxxxxxxxxxxxxx>
Sender: pcp-bounce@xxxxxxxxxxx
On Wed, 30 Jun 2004, Guillier, Nicolas wrote:

> Hello,
> I use PCP-2.2.2-132 to remotly monitor a linux system.
> I sometimes face a strange problem: between two acquisitions, the
> consumed cpu time is higher than the real time ! Once turned into a
> percentage, the resulting value can reach up to 250% of cpu load !
> This case occurs for kernel.cpu.* metrics and with disk.all.avactive
> metric as well (both from linux pmda).
>
> I need to understand the root causes of such a behaviour.

First cpu time and disk active time are both really _counters_ in units
of time in the kernel, so the reported value for the metric v requires
observations at times t1 and t2, then reporting the rate (actually
time/time, so a utilization) as

        v(t2) - v(t1)
        -------------
           t2 - t1

The sort of perturbation you report occurs when the collector system
(pmcd + pmdas) is heavily loaded.

The collection architecture assigns one timestamp per fetch, and if the
collection system is heavily loaded then there is some (non-trivial in
the extreme case) time window between when the first value in the fetch
is retrieved from the kernel and when the last is retried from the kernel.

Let me try to explain with an example with two counter metrics, x and y
with correct values as shown below

    Time        x         y
       0        0         0
       1        1        10
       2        2        20
       3        3        30
       4        4        40
       5        5        50
       6        6        60
       7        7        70
       8        8        80

Now on a lightly loaded system, if we consider 2 samples at t=1, t=4 and
t=7, then the first fetch would return ([x] is the timestamp)

        Time
         1      pcp client sends fetch request
                pmcd retrieves x=1 and y=10
                pcp client receives { [1] x=1 y=10 }

         4      pcp client sends fetch
                pmcd retrieves x=4 and y=40
                pcp client receives { [4] x=4 y=40 }

         7      pcp client sends fetch
                pmcd retrieves x=7 and y=70
                pcp client receives { [7] x=7 y=70 }

And the reported rates would be correct, namely

         1      no values available
         4      x=(4-1)/3=1 y=(40-10)/3=10
         7      x=(7-4)/3=1 y=(70-40)/3=10

Now on a heavily loaded system this could happen ...

        Time
         1      pcp client sends fetch request
                pmcd retrieves x=1 and y=10
                pcp client receives { [1] x=1 y=10 }

         4      pcp client sends fetch
                pmcd retrieves x=4
         5      pmcd retrieves y=50                     <-- delay
                pcp client receives { [5] x=4 y=50 }    <-- wrong for x

         7      pcp client sends fetch
                pmcd retrieves x=7 and y=70
                pcp client receives { [7] x=7 y=70 }

And the reported rates would be ...

         1      no values available
         5      x=(4-1)/4=0.75 y=(50-10)/5=10
         7      x=(7-4)/2=1.50 y=(70-50)/2=10

So, the delayed fetch at time 4 (which does not return values until time
5) produces

        x is too _small_ at t=5
        x is too _big_ at t=7

You're noticing the second case.

Note that because these are counters, the effects are self-cancelling
and diminish over longer sampling intervals.  There is nothing inherently
wrong here.

> Is it due to pmcd, pmda on monitored machine ? or pmcd, pmlogger on
> remote monitor ?

The effects are all on the collection (monitored) system.

> Can I conclude than the consumption reached a peak, or could it just
> be a pmda failure when updating a metric, trying to read a /proc/ file
?

You cannot really conclude either ... it is just the way things work.
Now 250% _is_ extreme, but if the system is totally CPU bound then there
is no reason to believe pmcd should be also impacted.

> Where can I find information about this ?

Hopefully this mail will explain it.

I'll add this to the pcp faq on the oss.sgi.com web site.


<Prev in Thread] Current Thread [Next in Thread>