G'day Frank.
I've had another look at this test.
1. I think the non-determinism Frank observes can be handled in a different
way that does not make the test run for longer ... in other places the test
is correct if we get the expected outcome for N or N-1 or N+1 iterations.
I'll explore this and work with Frank (off the list, as no one else probably
cares and Frank has the environment in which the test was failing).
2. I am not sure why the kill's are there at all (thanks for pointing out
the extra one that I was not expecting) ... seems to me we can do a better
job of filtering the output to ignore the other pmie instances (there is one
block of output from pcp -P per pmie instance). I'll take a look at this.
3. Frank's observations about pmie and signals is a bit concerning, although
I see evidence of "try TERM and if that does not work try KILL and repeat
until successful or timeout" in the pmie init script, so perhaps this is a
long standing problem that has been masked by hackery. pmie does have a
TERM signal handler and a delayed exit but only after the nanosleep() ... so
if we are blocked somewhere else, or don't abandon expression evaluation
completely when an I/O returns with EINTR then we could be off in the weeds
long enough for some script to believe pmie has not died.
Any insight into 3. would be helpful.
> -----Original Message-----
> From: Frank Ch. Eigler [mailto:fche@xxxxxxxxxx]
> Sent: Saturday, 1 November 2014 12:04 PM
> To: Ken McDonell
> Cc: 'pcp developers'
> Subject: Re: [pcp] qa/518 tweaks on pcpfans.git fche/dev
>
> ...
> I'll try to trace it with something like systemtap. (Even with pmmgr I
> encountered cases where a single SIGTERM sent to pmie was blocked/ignored,
> so sudo is probably not a necessary component of the
> problem.)
|