On Tue, 23 Oct 2007, Nathan Scott wrote:
> On Mon, 2007-10-22 at 12:58 +1000, Michael Newton wrote:
> > qa/296 fails with daemon PMDAs hanging round after a reconfigure (ie
> > kill -HUP to pmcd). It appears that they need to be wait()ed upon.
> > Prior to the addition of the condition || ap->ipcType == AGENT_DSO
> > to HarvestAgents in agent.c this condition was masked by other cleanup.
> > Previously, CleanupAgent did not waitpid() for AT_CONFIG. In addition,
> > there is no reason to sleep every time round the waitpid() loop, and in
> > many cases even once may be unnecessary
>
> Looks mostly OK, except I don't follow the first part of the last
> sentence - why is it not necessary to wait4/waitpid until we're sure
> the child has exited? It would seem that signal state changes in the
> agent could cause the parent to prematurely exit CleanupAgent, here...
>
> > + for ( ; ; ) {
> >
> > #if defined(HAVE_WAIT3)
> > - done = wait3(&wait_status, WNOHANG, NULL);
> > + done = wait3(&wait_status, WNOHANG, NULL);
> > #elif defined(HAVE_WAITPID)
> > - done = waitpid((pid_t)-1, &wait_status, WNOHANG);
> > + done = waitpid((pid_t)-1, &wait_status, WNOHANG);
> > #else
> > - break;
> > + done = 0;
> > #endif
> > - if (done <= 0)
> > - break;
> > - else if (done == pid)
> > - exit_status = wait_status;
> > + if (done == pid) {
> > + exit_status = wait_status;
> > + break;
> > + }
> > + if (done > 0) {
> > + continue;
> > }
> > + if (slept) {
> > + break;
> > + }
> > + /* give PMDA a chance to notice the close() and exit */
> > + sleep(1);
> > + slept = 1;
> > }
>
> WAIT(2) says:
> waitpid(): on success, returns the process ID of the child whose state
> has changed; on error, -1 is returned; if WNOHANG was specified and no
> child(ren) specified by pid has yet changed state, then 0 is returned.
>
> We specify WNOHANG, so it'd seem we could get through to the "if (slept)
> break;" logic, which would potentially break out of the loop too early,
> wouldn't it?
>
> If my logic isn't flawed, looks like it can be resolved by removing the
> slept variable and associated logic?
you are right that the code does not keep waiting until it sees an exit
for the child its looking for at that time.. and it never has. We may pick
up exits for other children.. and since we may have actually closed
multiple connections before coming in here, those children may get looked
for in a later pass thru this code, when we've already seen their exit in
this one... so we better not wait indefinitely. Basically we keep calling
waitpid until it errors, but if you call it immediately the client may not
have had a chance to exit yet, so some kind of sleep is required... now i
dont think its necessary to keep sleeping for a whole second after every
waitpid... if youve slept for a second, that oughta be enough, i think..
but maybe it would be better to nanosleep each time, only just for a
couple ticks?
one could i suppose instead go close, wait, close, wait etc.. but how long
a sleep do we need for each one? perhaps the better alternative is to have
code that waits on an array of pids.. i could perhaps raise a PV in that
direction..its just that the code here is a twisty maze of tunnels, all
alike, with the potential for an unintended indirect recursion if one is
not careful, and i have other priorities, so i prefer for now to just
code conservatively, and only fix what im reasonably sure is broke. If you
do go thru without getting notified of a given child, the result is one
zombie.. and this only happens in the restart (SIGHUP) case, which isnt
even accessible from /etc/init.d/pcp
Dr.Michael("Kimba")Newton kimbrr@xxxxxxx
|