pcp
[Top] [All Lists]

Re: [pcp] URGENT potentially serious regression in 3.7.0

To: pcp@xxxxxxxxxxx
Subject: Re: [pcp] URGENT potentially serious regression in 3.7.0
From: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
Date: Mon, 11 Mar 2013 19:49:23 +1100
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <513B99E4.7030007@xxxxxxxxxxxxxxxx>
References: <513B99E4.7030007@xxxxxxxxxxxxxxxx>
User-agent: Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/20130221 Thunderbird/17.0.3
On 10/03/13 07:21, Ken McDonell wrote:
I had suspected, without any proof that PCP QA was running much slower.

I started to look at 169 failing with a error return of "Timeout waiting
for a response from PMCD" rather than "IPC protocol failure" which I
thought was a minor issue, but is in fact a regression ... when pmcd
times out on the pmda ipc, it used to (and should) send an ipc error
response to the client waiting on the pmda response.

This no longer happens ... the pmda timeout happens, but the client is
left hanging until its own timeout on the pmcd ipc goes off ... this is
wrong.

OK, there _is_ a regression here and test 169 failing is a libpcp/pmcd code failure, not a QA failure.

Here is what I know ... I have no plausible hypothesis so I'm including all my evidence in the hope that this helps someone else get a handle on the root cause.

It is not a non-deterministic issue.  169 fails 20 out of 20 attempts.

BUT if you change 169 so that the pmcd tracing is not buffered, i.e.

pmstore pmcd.control.tracenobuf 1

after pmcd is reconfigured, then the test passes 20 out of 20 attempts.

From pmcd's log the critical failing section is ...

->14:56:06 Wrong PDU type: expecting DESC PDU, fd=6, got err=-12353: Timeout waiting for a response from PMCD
->         Drop PMDA: domain=160, infd=5, outfd=6

end of DoDesc: sts=-12366
back from DoDesc: sts=-12366 connected=1
[Mon Mar 11 14:56:12] pmcd(7415) Error: HandleClientInput: error sending Error PDU to client[1] IPC protocol failure

And the passing section is ...

->19:28:40 Wrong PDU type: expecting DESC PDU, fd=6, got err=-12353: Timeout waiting for a response from PMCD
after pmcd_trace: sts=-12366
->19:28:40 Drop PMDA: domain=160, infd=5, outfd=6
[Mon Mar 11 19:28:40] pmcd(9269) Info: CleanupAgent ...
Cleanup "dummyproc" agent (dom 160): protocol failure for fd=6, exit(0)
<empty>
end of DoDesc: sts=-12366
back from DoDesc: sts=-12366 connected=1
->19:28:41 Xmit: ERROR PDU, fd=1027, err=-12366: IPC protocol failure
->19:28:41 End client: fd=1027


In the fail case pmcd waits 6 seconds after the bad PMDA response before sending the response PDU to the client who's already given up because pmcd took too long.

In the passing case, pmcd sends the IPC Error code in the ERROR PDU immediately after the bad PMDA response as expected.

I have no clue why making the pmcd trace logs unbuffered leads to the correct behaviour ... some I/O, signal, select, ... snarfoo as result of recent changes in the whole socket/ipc layer?

I've attached both pmcd.logs ... ignore the DoDesc diags, I added these in desperation.

But much more seriously, in the process of investigating this, I turned
on all diags for pmcd and arrggghhh .... millions of line of output of
the form

OK, this one is a diagnostic only issue when PM_DBG_APPL0 is set, so it needs some rework but is not a release blocker.

Attachment: pmcd.log.169.fail
Description: Text document

Attachment: pmcd.log.169.pass
Description: Text document

<Prev in Thread] Current Thread [Next in Thread>