pcp
[Top] [All Lists]

pcp updates: PM_ERR_TIMEOUT protocol changes

To: pcp@xxxxxxxxxxx
Subject: pcp updates: PM_ERR_TIMEOUT protocol changes
From: kenj@xxxxxxxxxxx (Ken McDonell)
Date: Mon, 1 Aug 2016 20:12:57 +1000 (AEST)
Delivered-to: pcp@xxxxxxxxxxx
I know it is late in the release cycle for change like this, but
subject to peer review I think the benefit-risk trade-off is in favour
of pulling this into the next release.

It avoids the cascaded problems (not the root cause) of the bug from
Netflix, namely http://oss.sgi.com/bugzilla/show_bug.cgi?id=1158
where the wrong PDU types for pmcd requests are observed.

Given the severity of the issue, it is hard to imagine these changes
making the situation worse ... and they demonstrably make it better
in a range of QA tests built to simulate the BZ 1158 failure signature.

There are still some __getPDU() uses to be analyzed, but these all
relate to context creation and should a similar problem occur there
and be undetected the changes here will catch it almost immediately
the real client-pmcd traffic starts.

At the end is an (unrelated) fix for the recalcitrant qa/381.

Changes committed to git://git.pcp.io/kenj/pcp master

Ken McDonell (6):
      qa/381: at last, problem identified
      src/libpcp/src/pdu.c: add fault injection check for __pmGetPDU()
      src/libpcp: make the pmcd-client protocol safer in the presence of 
timeouts
      qa/865 and 866 (new): exercise the new PM_ERR_TIMOUT protocols
      qa/866: tweak filtering
      qa/169 & 200: minor tweaks after PM_ERR_TIMEOUT protocol changes

 qa/169                     |   11 
 qa/169.out                 |    1 
 qa/200.out                 |    5 
 qa/381                     |    2 
 qa/381.out                 |   22 -
 qa/865                     |   75 ++++++
 qa/865.out                 |  161 ++++++++++++++
 qa/866                     |  133 ++++++++++++
 qa/866.out                 |  339 ++++++++++++++++++++++++++++++
 qa/group                   |    2 
 qa/src/GNUlocaldefs        |    2 
 qa/src/multictx.c          |  496 +++++++++++++++++++++++++++++++++++++++++++++
 qa/src/store.c             |   48 ++++
 src/include/pcp/fault.h    |    3 
 src/libpcp/src/context.c   |   21 +
 src/libpcp/src/desc.c      |    9 
 src/libpcp/src/fetch.c     |    3 
 src/libpcp/src/help.c      |   10 
 src/libpcp/src/instance.c  |   28 +-
 src/libpcp/src/internal.h  |    2 
 src/libpcp/src/pdu.c       |    3 
 src/libpcp/src/pmns.c      |   26 +-
 src/libpcp/src/store.c     |    9 
 src/libpcp_fault/pmfault.3 |   23 +-
 24 files changed, 1391 insertions(+), 43 deletions(-)

Details ...

commit 12cf46982cf0124ef3a312ad77608c73ec2e7486
Author: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
Date:   Mon Aug 1 19:57:18 2016 +1000

    qa/169 & 200: minor tweaks after PM_ERR_TIMEOUT protocol changes

commit d175228933515ad30d83b6eb46d2d880d0499534
Author: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
Date:   Mon Aug 1 16:15:59 2016 +1000

    qa/866: tweak filtering

commit dc9ee41f7683b5414bf9347b6c1c185b3f31fb1b
Author: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
Date:   Mon Aug 1 15:57:58 2016 +1000

    qa/865 and 866 (new): exercise the new PM_ERR_TIMOUT protocols
    
    865 uses the (Perl version of the) slow PMDA.
    
    866 uses new fault injection services in libpcp_fault.

commit 22758584ba8e3597355b73a98f3313086d57ebfb
Author: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
Date:   Mon Aug 1 15:49:11 2016 +1000

    src/libpcp: make the pmcd-client protocol safer in the presence of timeouts
    
    If a timeout occurs on a PDU read (in __pmGetPDU) the libpcp callers
    are now responsible for closing the channel that connects pmcd and all
    contexts associated with that pmcd in the current client app.
    
    Because this channel maybe multiplexed between contexts, a timeout
    introduces the possibility that incorrect PDUs are seen by other
    contexts and this breaks the synchronous communication between pmcd
    and a PMAPI context, or worse, delivers incorrect data to other PMAPI
    contexts.
    
    This change prevents that from happening, but the other contexts
    will need to call pmReconnectContext() before they can resume sending
    and receiving PDUs to and from the pmcd that triggered the original
    timeout condition.

commit 5f76a6400baafbb5e64046a6af1d68b65a6d4c29
Author: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
Date:   Mon Aug 1 15:47:07 2016 +1000

    src/libpcp/src/pdu.c: add fault injection check for __pmGetPDU()
    
    If the class PM_FAULT_TIMEOUT has been armed, return immediately
    with PM_ERR_TIMEOUT.
    
    Note this generates no code for the regular libpcp, it is only
    enabled in libpcp_fault.

commit 6eea1779a5e45ccfe47864e267cdd582e5a60c62
Author: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
Date:   Wed Jul 27 08:56:56 2016 +1000

    qa/381: at last, problem identified
    
    This test has been failing intermittently for some time and I've
    finally tracked it down ... the logic assumed that the primary logger
    was handling pmlc requests on port 4330.  This is true most of the
    time (test passes) but sometimes (especially with pmmgr in the mix)
    may not be so (test fails).
!v

<Prev in Thread] Current Thread [Next in Thread>