On 10/03/13 07:21, Ken McDonell wrote:
I had suspected, without any proof that PCP QA was running much slower.
I started to look at 169 failing with a error return of "Timeout waiting
for a response from PMCD" rather than "IPC protocol failure" which I
thought was a minor issue, but is in fact a regression ... when pmcd
times out on the pmda ipc, it used to (and should) send an ipc error
response to the client waiting on the pmda response.
This no longer happens ... the pmda timeout happens, but the client is
left hanging until its own timeout on the pmcd ipc goes off ... this is
wrong.
OK, there _is_ a regression here and test 169 failing is a libpcp/pmcd
code failure, not a QA failure.
Here is what I know ... I have no plausible hypothesis so I'm including
all my evidence in the hope that this helps someone else get a handle on
the root cause.
It is not a non-deterministic issue. 169 fails 20 out of 20 attempts.
BUT if you change 169 so that the pmcd tracing is not buffered, i.e.
pmstore pmcd.control.tracenobuf 1
after pmcd is reconfigured, then the test passes 20 out of 20 attempts.
From pmcd's log the critical failing section is ...
->14:56:06 Wrong PDU type: expecting DESC PDU, fd=6, got err=-12353:
Timeout waiting for a response from PMCD
-> Drop PMDA: domain=160, infd=5, outfd=6
end of DoDesc: sts=-12366
back from DoDesc: sts=-12366 connected=1
[Mon Mar 11 14:56:12] pmcd(7415) Error: HandleClientInput: error sending
Error PDU to client[1] IPC protocol failure
And the passing section is ...
->19:28:40 Wrong PDU type: expecting DESC PDU, fd=6, got err=-12353:
Timeout waiting for a response from PMCD
after pmcd_trace: sts=-12366
->19:28:40 Drop PMDA: domain=160, infd=5, outfd=6
[Mon Mar 11 19:28:40] pmcd(9269) Info: CleanupAgent ...
Cleanup "dummyproc" agent (dom 160): protocol failure for fd=6, exit(0)
<empty>
end of DoDesc: sts=-12366
back from DoDesc: sts=-12366 connected=1
->19:28:41 Xmit: ERROR PDU, fd=1027, err=-12366: IPC protocol failure
->19:28:41 End client: fd=1027
In the fail case pmcd waits 6 seconds after the bad PMDA response before
sending the response PDU to the client who's already given up because
pmcd took too long.
In the passing case, pmcd sends the IPC Error code in the ERROR PDU
immediately after the bad PMDA response as expected.
I have no clue why making the pmcd trace logs unbuffered leads to the
correct behaviour ... some I/O, signal, select, ... snarfoo as result of
recent changes in the whole socket/ipc layer?
I've attached both pmcd.logs ... ignore the DoDesc diags, I added these
in desperation.
But much more seriously, in the process of investigating this, I turned
on all diags for pmcd and arrggghhh .... millions of line of output of
the form
OK, this one is a diagnostic only issue when PM_DBG_APPL0 is set, so it
needs some rework but is not a release blocker.
pmcd.log.169.fail
Description: Text document
pmcd.log.169.pass
Description: Text document
|