I know it is late in the release cycle for change like this, but
subject to peer review I think the benefit-risk trade-off is in favour
of pulling this into the next release.
It avoids the cascaded problems (not the root cause) of the bug from
Netflix, namely http://oss.sgi.com/bugzilla/show_bug.cgi?id=1158
where the wrong PDU types for pmcd requests are observed.
Given the severity of the issue, it is hard to imagine these changes
making the situation worse ... and they demonstrably make it better
in a range of QA tests built to simulate the BZ 1158 failure signature.
There are still some __getPDU() uses to be analyzed, but these all
relate to context creation and should a similar problem occur there
and be undetected the changes here will catch it almost immediately
the real client-pmcd traffic starts.
At the end is an (unrelated) fix for the recalcitrant qa/381.
Changes committed to git://git.pcp.io/kenj/pcp master
Ken McDonell (6):
qa/381: at last, problem identified
src/libpcp/src/pdu.c: add fault injection check for __pmGetPDU()
src/libpcp: make the pmcd-client protocol safer in the presence of
timeouts
qa/865 and 866 (new): exercise the new PM_ERR_TIMOUT protocols
qa/866: tweak filtering
qa/169 & 200: minor tweaks after PM_ERR_TIMEOUT protocol changes
qa/169 | 11
qa/169.out | 1
qa/200.out | 5
qa/381 | 2
qa/381.out | 22 -
qa/865 | 75 ++++++
qa/865.out | 161 ++++++++++++++
qa/866 | 133 ++++++++++++
qa/866.out | 339 ++++++++++++++++++++++++++++++
qa/group | 2
qa/src/GNUlocaldefs | 2
qa/src/multictx.c | 496 +++++++++++++++++++++++++++++++++++++++++++++
qa/src/store.c | 48 ++++
src/include/pcp/fault.h | 3
src/libpcp/src/context.c | 21 +
src/libpcp/src/desc.c | 9
src/libpcp/src/fetch.c | 3
src/libpcp/src/help.c | 10
src/libpcp/src/instance.c | 28 +-
src/libpcp/src/internal.h | 2
src/libpcp/src/pdu.c | 3
src/libpcp/src/pmns.c | 26 +-
src/libpcp/src/store.c | 9
src/libpcp_fault/pmfault.3 | 23 +-
24 files changed, 1391 insertions(+), 43 deletions(-)
Details ...
commit 12cf46982cf0124ef3a312ad77608c73ec2e7486
Author: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
Date: Mon Aug 1 19:57:18 2016 +1000
qa/169 & 200: minor tweaks after PM_ERR_TIMEOUT protocol changes
commit d175228933515ad30d83b6eb46d2d880d0499534
Author: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
Date: Mon Aug 1 16:15:59 2016 +1000
qa/866: tweak filtering
commit dc9ee41f7683b5414bf9347b6c1c185b3f31fb1b
Author: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
Date: Mon Aug 1 15:57:58 2016 +1000
qa/865 and 866 (new): exercise the new PM_ERR_TIMOUT protocols
865 uses the (Perl version of the) slow PMDA.
866 uses new fault injection services in libpcp_fault.
commit 22758584ba8e3597355b73a98f3313086d57ebfb
Author: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
Date: Mon Aug 1 15:49:11 2016 +1000
src/libpcp: make the pmcd-client protocol safer in the presence of timeouts
If a timeout occurs on a PDU read (in __pmGetPDU) the libpcp callers
are now responsible for closing the channel that connects pmcd and all
contexts associated with that pmcd in the current client app.
Because this channel maybe multiplexed between contexts, a timeout
introduces the possibility that incorrect PDUs are seen by other
contexts and this breaks the synchronous communication between pmcd
and a PMAPI context, or worse, delivers incorrect data to other PMAPI
contexts.
This change prevents that from happening, but the other contexts
will need to call pmReconnectContext() before they can resume sending
and receiving PDUs to and from the pmcd that triggered the original
timeout condition.
commit 5f76a6400baafbb5e64046a6af1d68b65a6d4c29
Author: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
Date: Mon Aug 1 15:47:07 2016 +1000
src/libpcp/src/pdu.c: add fault injection check for __pmGetPDU()
If the class PM_FAULT_TIMEOUT has been armed, return immediately
with PM_ERR_TIMEOUT.
Note this generates no code for the regular libpcp, it is only
enabled in libpcp_fault.
commit 6eea1779a5e45ccfe47864e267cdd582e5a60c62
Author: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
Date: Wed Jul 27 08:56:56 2016 +1000
qa/381: at last, problem identified
This test has been failing intermittently for some time and I've
finally tracked it down ... the logic assumed that the primary logger
was handling pmlc requests on port 4330. This is true most of the
time (test passes) but sometimes (especially with pmmgr in the mix)
may not be so (test fails).
!v
|