pcp
[Top] [All Lists]

[Bug 1036] pmcd should not permanently give up on tardy pmdas

To: pcp@xxxxxxxxxxx
Subject: [Bug 1036] pmcd should not permanently give up on tardy pmdas
From: bugzilla-daemon@xxxxxxxxxxx
Date: Thu, 21 Nov 2013 00:13:34 +0000
Auto-submitted: auto-generated
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <bug-1036-835@xxxxxxxxxxxxxxxx/bugzilla/>
References: <bug-1036-835@xxxxxxxxxxxxxxxx/bugzilla/>

Comment # 3 on bug 1036 from
> ... momentary system overload (whether deliberately induced by an attacker,
> or accidental) can affect all PMDAs.  So can a VM or power suspend/resume.
> So can systemwide clock jumps.

These all affect pmcd just as much as the PMDAs.  Thus, they seem to me to not
be good examples to support your argument ... e.g. how often would a power
suspend resume affect a PMDA and not pmcd itself!?!  Thats just silly, no
amount of new pmcd code is going to help for these cases.

For all the realistic cases I can think of, or have observed, they are/were
domain-specific issues where the PMDA was in the best/only position to dodge
'em.

> So can some administrator's misguided debugging activity.

Surely not on a production system?  :)  For more responsible sysadmins, for
whom lack of service from pmcd and its merry agent-men is a problem, dbpmda(1)
would be a more suitable tool, and is readily available on their system if they
really must debug it on the production system itself.


There seems to be an underlying suggestion that this is readily fixable and
pmcd is taking this action lightly - "if only pmcd would just wait, surely
everything will be just peachy?"

That's simply not the case.  Life goes on while the tardy PMDA is off with the
weeds.  pmcd simply doesn't know *what is going on* with that PMDA, and it
could be minutes/hours before the PMDA resurfaces.  And who knows what state it
- and the horse/communication channel it rode in on - is in at that point
if/when it does resurface.

New clients may have arrived, sent in their authentication information which
pmcd has duly forwarded on to all the (other) agents ... except, who knows what
these agents have been up to while they're off doing their own thing for
awhile, and they've missed out.  Does pmcd need to resend those PDUs?  (yes, I
think it does have to, and in-order with other PDUs - becomes a hard problem to
solve inexpensively).

Or what if the PMDA was in the middle of a PDU exchange with pmcd when it
stopped talking - does pmcd need to queue everything up that happens while the
PMDA is off in lala land, indefinitely?  (including partial PDU reads and
writes!!!  argh!)

Or clients have asked (sequentially) for names, PMIDs, descriptors, instances,
then started fetching - and one PMDA went away briefly somewhere in the middle
there, then comes back and we wonder why the client isn't getting values?  No
evidence trail either - later, the PMDA is running fine when we try to debug
the problem.


In theory it'd be nice to have; but in practice, I can't see how it can
possibly work in a reliable fashion, nor does it fix the perceived problem it
sets out to.  And yes, I suspect developers would take this as an easy "out" to
not set good response time requirements for their PMDAs (alternatively, it'll
make responding quickly something they can come back to, someday - i.e. never).

"Perfect" is the enemy of "good" in this case, or so it seems to me anyway.


You are receiving this mail because:
  • You are on the CC list for the bug.
<Prev in Thread] Current Thread [Next in Thread>