pcp
[Top] [All Lists]

Re: [pcp] pmclusterd versus other solutions

To: Jeff Hanson <jhanson@xxxxxxx>
Subject: Re: [pcp] pmclusterd versus other solutions
From: Mark Goodwin <mgoodwin@xxxxxxxxxx>
Date: Tue, 13 Sep 2016 09:57:19 +1000
Cc: PCP <pcp@xxxxxxxxxxx>
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <2f9d172e-ea4d-1235-24ff-ab3c8d1e49ea@xxxxxxx>
References: <3b551b84-ff74-5b9c-5854-3bdcba1c1212@xxxxxxx> <CAFmffyUkbMi1g3XScEE-XjEHBmdbd5WvHZ6UpGKN_eZtG6pm=g@xxxxxxxxxxxxxx> <CAFmffyVCraY1-idhONyyBt6euF1t6ijbFKEU3qagVEbC-odM1w@xxxxxxxxxxxxxx> <2f9d172e-ea4d-1235-24ff-ab3c8d1e49ea@xxxxxxx>
Hi Jeff, the cluster PMDA log 'cluster.log' on the head node is full
of errors, e.g. :

[Mon Sep 12 08:27:34] pmdacluster(14849) Error: pmdaFetch: PMID
65.1.13 not handled by fetch callback

which means cluster_fetchCallBack() isn't finding a matching PMID and
instance for the requested metric/instance in the cached pmResult for
each node. Are all your cluster nodes same arch/endian as the head
node? The code assumes this since we're sending binary pmResult
structures from each cluster node and caching them on the head node.
Here's a code snippet :

    /*
     * Now find the pmid and instance in the cached result.  The domain and
     * cluster for each PMID in the result will be for the sub-PMDA that
     * returned it, so translate the pmDesc.pmID to match before comparing.
     */
    idsp->domain = subdom_dom_map[idsp->subdomain];
    idsp->subdomain = 0;
    sts = PM_ERR_PMID;
    for (i=0, r = tc->result; i < r->numpmid; i++) {
        if (pmid_domain( r->vset[i]->pmid) != pmid_domain( pmda_pmid) ||
            pmid_cluster(r->vset[i]->pmid) != pmid_cluster(pmda_pmid) ||
            pmid_item(   r->vset[i]->pmid) != pmid_item(   pmda_pmid) )
            continue;
        /* found the pmid, now look for the instance */
        sts = PM_ERR_INST;
        for (j=0; j < r->vset[i]->numval; j++) {
            v = &r->vset[i]->vlist[j];
            if (indom_int->serial == CLUSTER_INDOM || v->inst ==
instp->node_inst) {
                /*
                 * found
                 */
                if (r->vset[i]->valfmt == PM_VAL_INSITU)
                    memcpy(&atom->l, &v->value.lval, sizeof(atom->l));
                else
                    pmExtractValue(r->vset[i]->valfmt, v,
v->value.pval->vtype, atom, v->value.pval->vtype);

                return 1;
            }
        }
    }
    return sts;

Also, the other log asked for is /var/log/pcp/pmclusterd.log on one or
more of the cluster nodes. That log wont be present on the head node.
Please attach.
Regards
-- Mark

On Mon, Sep 12, 2016 at 11:40 PM, Jeff Hanson <jhanson@xxxxxxx> wrote:
> On 09/12/2016 12:31 AM, Mark Goodwin wrote:
>>>
>>> But the real problem is that although pmclusterd exposes some 100 metrics
>>> or
>>> so but only 20 of them are actually able to be fetched.
>>
>>
>
> 88 default metrics, 22 are fetched.  IB ones seem to be an issue different
> issue
> from the rest.
>
>> Jeff, do you ever see "cluster_node_rw: spinning" in either
>> /var/log/pcp/pmcd/cluster.log or /var/log/pcp/pmclusterd.log ?
>
>
> No.
>
>> Can you send me these logs after reproducing the issue where only some
>> (20 out of 100) metrics can be fetched but the others report the
>> instance domain issue?
>>
>
> Attached.
>
>> Thanks
>> -- Mark
>>
>> On Thu, Sep 1, 2016 at 3:59 PM, Mark Goodwin <mgoodwin@xxxxxxxxxx> wrote:
>>>
>>> Hi Jeff, I don't think we ever open-sourced pmclusterd since it was
>>> (at the time) SGI ICE specific,
>>> so it's unlikely anyone outside SGI will know much about it.
>>>
>>> This is the daemon that aggregates indoms for per-cluster-node CPU
>>> data on the head node, so
>>> the client tools just monitor the head node, right? If that's the tool
>>> framework you're referring to,
>>> I always thought it was a bit of an abomination of the indom concept
>>> (even though I wrote it!),
>>> but designed it that way to be more scalable than monitoring every
>>> cluster node individually.
>>> WHat issues are you running in to?
>>>
>>> Regards
>>> -- Mark
>>>
>>>
>>> On Thu, Sep 1, 2016 at 2:43 AM, Jeff Hanson <jhanson@xxxxxxx> wrote:
>>>>
>>>> As we (SGI) explore what to do about the scaling issues with pmclusterd
>>>> as it is currently written I am exploring other options.  For cluster
>>>> configurations are people generally running pmcd locally on the cluster
>>>> nodes
>>>> and logging to the node?  Running pmcd locally on the cluster node with
>>>> another system as the logger?  Other thoughts?
>>>>
>>>> Thanks.
>>>> --
>>>> -----------------------------------------------------------------------
>>>> Jeff Hanson - jhanson@xxxxxxx - Senior Technical Support Engineer
>>>>
>>>> You can choose a ready guide in some celestial voice.
>>>> If you choose not to decide, you still have made a choice.
>>>> You can choose from phantom fears and kindness that can kill;
>>>> I will choose a path that's clear
>>>> I will choose freewill. - Peart
>>>>
>>>> _______________________________________________
>>>> pcp mailing list
>>>> pcp@xxxxxxxxxxx
>>>> http://oss.sgi.com/mailman/listinfo/pcp
>
>
>
> --
> -----------------------------------------------------------------------
> Jeff Hanson - jhanson@xxxxxxx - Senior Technical Support Engineer
>
> You can choose a ready guide in some celestial voice.
> If you choose not to decide, you still have made a choice.
> You can choose from phantom fears and kindness that can kill;
> I will choose a path that's clear
> I will choose freewill. - Peart

<Prev in Thread] Current Thread [Next in Thread>