pcp
[Top] [All Lists]

Re: [pcp] Dynamic PMDA cluster - proposal

To: kenj@xxxxxxxxxxxxxxxx
Subject: Re: [pcp] Dynamic PMDA cluster - proposal
From: Corneliu Boac <cboac@xxxxxxx>
Date: Mon, 07 Jun 2010 14:50:09 -0500
Cc: pcp@xxxxxxxxxxx
In-reply-to: <1275866862.3803.120.camel@xxxxxxxxxxxxxxxx>
References: <4C07DC70.2030700@xxxxxxx> <1275866862.3803.120.camel@xxxxxxxxxxxxxxxx>
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.9) Gecko/20100317 SUSE/3.0.4-1.1.1 Thunderbird/3.0.4
On 06/06/2010 06:27 PM, Ken McDonell wrote:
> Corneliu,
>
> Let me start by saying the current cluster pmda was written after I left
> sgi so my comments are from a position of relative ignorance ...
>
> Comments are in situ below.
>   


Hi Ken,

Thank you for providing feed-back to my proposal.
I will answer your questions bellow.


> On Thu, 2010-06-03 at 11:46 -0500, Corneliu Boac wrote:
>   
>> Hello PCP community,
>>
>> Customers have been requested us to make the pmda cluster more dynamic:
>>
>> 1) Right now we use a static configuration file and the compute nodes
>> always push all metrics that are in that file.
>>
>> 2) The compute nodes push the metrics to the cluster head node every two
>>  seconds even if no client is requesting information.
>>
>> After looking at the code and some brainstorming here at SGI,  I wrote
>> the following proposal that I need to discuss with you. I value a lot
>> your feed-back so I need to ask you the favor to analyze it and provide
>> me your thoughts.
>>
>> Thank you,
>> Cornel
>>
>> -----------------------------------------------------
>>
>> Overview
>>
>> The Cluster PMDA has two components: pmdacluster that runs on the
>> cluster head node (the PMDA) and pmclusterd that runs on compute nodes
>> (provides the data). The Cluster PMDA configuration files contains the
>> list of all metrics that are required for pmclusterd to retrieve every
>> two seconds.
>>
>> PROPOSAL
>>
>> * The PMDA will load the list of supported metrics including the update
>> interval in ms for each one of the metrics. 2 seconds can be used as a
>> default value when a specific refresh interval is not defined. Some
>> metrics will not change their values (e.g. memory size and other metrics
>> that are related to hardware). Some other may change very often (e.g.
>> network traffic counters).
>>     
> This seems sensible and is consistent with the way other tools like pmie
> and pmlogger handle metrics with different underlying rates of change.
>
> Is the intention that the configuration file could be updated
> dynamically once the PMDA is running (like pmlogger), or the
> configuration remains unchanged for the life of the PMDA (like pmie)?
>   


Yes, this is my intention, to update dynamically the information that
Cluster PMDA is pulling from the blades based on what the clients are
requesting.


>   
>> * In general the protocol between pmdacluster and pmclusterd will remain
>> the same with the exception that metrics will be pushed only when
>> requested based on their refresh interval (not every 2 seconds as right
>> now).
>>
>> * First time a pm_fetch() call is made (from pmcd), the pmdacluster will
>> retrieve the values from pmclusterd and cache them. The time stamp of
>> the request will be saved in the cache together with the values for each
>> metrics.
>>     
> How is the latency of this initial fetch handled? pmdacluster presumably
> cannot afford to wait for all the pmclusterds to push data?  Does this
> use the PM_ERR_PMDANOTREADY/PM_ERR_PMDAREADY handshake protocol between
> the pmda and pmcd?
>
>   


Yes. We will say data not ready at the initial fetch.


> How do the pmclusterds get the configuration file information?
>   


The pmclusterds do a CLUSTER_PDU_CONFIG when they connect to the
pmdacluster. My intention is to close their connection on the pmdacluster
when no data is needed from the pmclusterds anymore. The pmclusterds
will reconnect but pmdacluster will not respond to CLUSTER_PDU_CONFIG
until a new set of metrics is required from the pmclusterds. This way
I do not have to change the protocol between pmclusterds and pmdacluster
(unless I see an issue with pmclusterds locking on the socket while
trying to retrive the new list of metrics).



>   
>> * When the next pm_fetch() comes, the pmdacluster will check the cache
>> and provide the data from there if the data has not expired (saved time
>> stamp + refresh interval < current time stamp). If the data has expired
>> for one or more metrics, it will retrieve new data for those metrics and
>> save the new time stamp.
>>     
> OK I'm missing something here ... I thought the model was that the
> pmclusterds _pushed_ new data to pmdacluster ... if this is the case,
> how can pmdacluster retrieve (i.e. _pull_) new data?
>   


When new data is requested, pmdacluster will respond to CLUSTER_PDU_CONFIG
sent by the pmclusterds providing the new set of metrics. pmclusterds
will then start pushing the new list of metrics.



>   
>> * The pmdacluster will start requesting periodically new data for the
>> metrics for which the time between the two pm_fetch() calls is less or
>> equal than twice the value of the refresh interval so it always has data
>> available when the next pm_fetch() call will be made.
>>     
> Again, I don't understand the push vs pull semantics here.
>   


I meant to say pmdacluster will unlock the pmclusterds which will
restart pushing data.


> But more critically, is pmdacluster maintaining state for every metric
> (or metric-instance pair?) to measure the incoming request intervals?
> And if so, is this state per pmcd client or state across all pmcd
> clients (I think it has to be the latter, because by the time
> pmdacluster gets the request from pmcd the identity of the pmcd client
> is not visible).
>   


pmdacluster will maintain state for each metric-instance per all pmcd
clients (not per pmcd clients).


> If 2 clients of pmcd are reqesting the cluster.foo metric with an
> interval of 1 sec, and the refresh interval for cluster.foo is 0.5 sec
> then requests (from different clients) could well arrive less than 0.25
> sec apart which would seem to trigger the "periodic request" you mention
> above, but for no apparent advantage.
>
> Perhaps if you could complete this table I'd understand better ...
>
> A and B are 2 clients of pmcd requesting cluster.foo at 1 sec intervals.
> cluster.foo is the sum of the metric X from each node in the cluster.
> I'm using a 2 node cluster to keep it simple.
>
> time  A       B       pmdacluster     X @ node 0      X @ node 1
>  0
>  0.1  ?               ?               1               10
>  0.3          ?       ?               3               30
>  1                    ?               10              100
>  1.1  ?               ?               11              110
>  1.3          ?       ?               13              130
>  2                    ?               20              200
>   


I am not sure I understand how to fill your table. In the current
implementation the pmclusterds push all metrics every two seconds even if
no clients are requesting any metric. If requests are made every second,
the same answer is provided twice.

We want to change that so
pmclusterds push the data that is requested at the interval that we
want and only for as long as there is at least a client requesting it.
When clients stop requesting will will stop the pushing after two
metrics have been collected without being used.

This way the data push will not be asynchronous to the pm_fetch() anymore.

Let's say there are two clients (A and B) requesting the same metric
instance every second. Let's say that we consider this metric instance
stays fresh for 1.5 seconds and we configure pmdacluster accordingly.

Time     A     B     pmdacluster                pmclusterds
0 sec    -     -     -                          -

0.100    Req   -     Configures pmclusterds     Get the configuration
                     to push the metric         from pmdacluster and
                     every 1.5 sec              start pushing new values
                                                every 1.5 sec

0.250    -     -     pmdacluster cashes the     Data available is pushed
                     current value              to pmdacluster

0.300    -     Req   Already configured to      -
                     retrieve the metric.
                     Data is still fresh
                     since .05 sec < 1.5 sec.
                     Provides the metric.
                     Stores that the request
                     was received prior to
                     the data being expired.

1.100    Req   -     Already configured to      -
                     retrieve the metric.
                     Data is still fresh
                     since .85 sec < 1.5 sec.
                     Provides the metric.
                     Stores that the request
                     was received prior to
                     data expiration.

1.300    -     Req   Already configured to      -
                     retrieve the metric.
                     Data is still fresh
                     since 1.05 sec < 1.5 sec.
                     Provides the metric.
                     Stores that the request
                     was received prior to
                     data expiration.

1.750    -     -     pmdacluster cashes the     Data available is pushed
                     data the current value     to pmdacluster
                     At least a request has
                     been received since
                     the previous update so
                     it will let pmclusterds
                     continue to push new
                     values for the metric.

3.250    -     -     pmdacluster cashes the     Data available is pushed
                     data the current value     to pmdacluster
                     No request have been
                     made since last update.
                     It will remember that
                     but it will still let
                     pmclusterds continue to
                     push new values for the
                     metric.

4.750    -     -     pmdacluster cashes the     Data available is pushed
                     data the current value     to pmdacluster
                     No request have been
                     made since last two
                     updates. It will stop
                     the pmclusterds from
                     pushing this data.


>               
>   
>> * If no pm_fetch() comes for a metric after three refresh interval
>> expire, the metric will not be requested anymore.
>>     
> So "not requested" suggests some feedback from the pmda to the daemon
> which may help explain my confusion over push/pull.
>   


If no client, pmdacluster closes the pmclusterds connections, which will
reconnect and lock waiting for a new set of metrics.


>   
>> The following are some assumptions, ideas, questions/answers that I
>> copied from some emails that I exchanged with some of my colleagues at
>> SGI.
>>
>> 1) The 2 second default will be tunable.
>>
>> 2) Pushes for different metrics will be grouped.
>>     
> Based on refresh interval?
>   


Yes, when possible (when their refresh intervals match).


>   
>> 3) We will define the refresh interval for each metric meaning that we
>> will specify how long a value can be returned as it is (how long we
>> think the value stays fresh). If we get the second request after the
>> value has spoiled we will start asking for values so we have a fresh
>> value when the third request comes. We will stop collecting data if more
>> than twice the time between first request and second request has passed
>> and a third request has not come yet. We can make this configurable.
>>     
> I don't see how the "start asking for values" is really going to
> help ... if the push for metric X is defined to be every 2 seconds,
> thenand  it will be every 2 seconds and the number of requests that
> might arrive from clients between refreshes is almost immaterial.
>
> There is the pathological case of the client's sample interval and the
> daemon's refresh interval being the same and the client request arriving
> right on the boundary of the value being marked stale ... but short of
> refreshing more frequently (which this proposal does not seem to
> suggest), this boundary condition is unavoidable in the worst (and
> unlikely) case.
>
> The refresh interval and pushing applies to all the instances of a
> particular metric?
>   


Yes. Worst case will not be able to group anything that pmclusterd push.
We will still group metrics when we set the current configuration
(CLUSTER_PDU_CONFIG) but not necessary group the metric values that are
pushed to pmdacluster by pmclusterds.



>   
>> 4) I will monitor all requests (possible from multiple clients) and
>> refresh the data to accommodate the slowest pulling client.
>>     
> Sorry, I don't understand this at all ... "you" (being the pmda) cannot
> see the pmcd clients (as per my comment above), so who/what are these
> clients?
>
> In what sense is a client "slowest"?
>   


If there are multiple clients requesting the same data at intervals
longer than the refresh time, we will adjust the algorithm to not stop
pmclusterds push the data so we can satisfy the slowest client.
In example if we stop pmclusterds from pushing data when no request has
arrived for the last two values and then we receive a new request right
after that, we could adjust the algorithm to accomodate for slower rates
and let more pushed values waste before stopping pmclusterds.


>   
>> 5) To not disrupt PMCD, pmdacluster will return "no value available"
>> error the first time we are queried if it takes too much time to get
>> the metrics from pmclusterd (e.g. longer than 5 sec).
>>     
> Really this should be PM_ERR_PMDANOTREADY/PM_ERR_PMDAREADY.
>   


Agree.


>   
>> 6) The API, and some tools, allow querying individual instances. We
>> still need to get instance domain info from each client when they
>> connect.  This volume of initial setup traffic could be an argument for
>> limiting intentional client disconnects, which is currently used as a
>> method to change the list of metrics that should be pushed.
>>     
> Which clients are these?
>   


[root@quiero-admin ~]# pmdumptext -h r1lead -s 1 -r
'cluster.kernel.percpu.cpu.user["r1i0n0-0 cpu0"]'
Thu May 6 09:11:45 8010.000


> Hope this helps, rather than confuses, the issue.
>
>   


I hope I was more clear now.

Thank you,
Cornel.

<Prev in Thread] Current Thread [Next in Thread>