pcp
[Top] [All Lists]

Re: [patch] waitpid on dying agents at reconfig (qa/296) (fwd)

To: Michael Newton <kimbrr@xxxxxxx>
Subject: Re: [patch] waitpid on dying agents at reconfig (qa/296) (fwd)
From: Nathan Scott <nscott@xxxxxxxxxx>
Date: Tue, 23 Oct 2007 15:02:56 +1000
Cc: pcp@xxxxxxxxxxx
In-reply-to: <Pine.SGI.4.58.0710221255510.75139806@xxxxxxxxxxxxxxxxxxxxxxx>
Organization: Aconex
References: <Pine.SGI.4.58.0710221255510.75139806@xxxxxxxxxxxxxxxxxxxxxxx>
Reply-to: nscott@xxxxxxxxxx
Sender: pcp-bounce@xxxxxxxxxxx
On Mon, 2007-10-22 at 12:58 +1000, Michael Newton wrote:
> here it is again

( thanks - not sure how I managed to miss this one - apologies! )

> ---------- Forwarded message ----------
> Date: Tue, 9 Oct 2007 15:19:43 +1000
> From: Michael Newton <kimbrr@xxxxxxx>
> To: pcp@xxxxxxxxxxx
> Cc: nscott@xxxxxxxxxx
> Subject: [patch] waitpid on dying agents at reconfig (qa/296)
> 
> Nathan, can you review please?
> 
> qa/296 fails with daemon PMDAs hanging round after a reconfigure (ie
> kill -HUP to pmcd). It appears that they need to be wait()ed upon.
> Prior to the addition of the condition  || ap->ipcType == AGENT_DSO
> to HarvestAgents in agent.c this condition was masked by other cleanup.
> Previously, CleanupAgent did not waitpid() for AT_CONFIG. In addition,
> there is no reason to sleep every time round the waitpid() loop, and in
> many cases even once may be unnecessary

Looks mostly OK, except I don't follow the first part of the last
sentence - why is it not necessary to wait4/waitpid until we're sure
the child has exited?  It would seem that signal state changes in the
agent could cause the parent to prematurely exit CleanupAgent, here...

> +         for ( ; ; ) {
> 
>  #if defined(HAVE_WAIT3)
> -                 done = wait3(&wait_status, WNOHANG, NULL);
> +             done = wait3(&wait_status, WNOHANG, NULL);
>  #elif defined(HAVE_WAITPID)
> -                 done = waitpid((pid_t)-1, &wait_status, WNOHANG);
> +             done = waitpid((pid_t)-1, &wait_status, WNOHANG);
>  #else
> -                 break;
> +             done = 0;
>  #endif
> -                 if (done <= 0)
> -                     break;
> -                 else if (done == pid)
> -                     exit_status = wait_status;
> +             if (done == pid) {
> +                 exit_status = wait_status;
> +                 break;
> +             }
> +             if (done > 0) {
> +                 continue;
>               }
> +             if (slept) {
> +                 break;
> +             }
> +             /* give PMDA a chance to notice the close() and exit */
> +             sleep(1);
> +             slept = 1;
>           }

WAIT(2) says:
waitpid(): on success, returns the process ID of the child whose  state
has  changed; on error, -1 is returned; if WNOHANG was specified and no
child(ren) specified by pid has yet changed state, then 0 is  returned.

We specify WNOHANG, so it'd seem we could get through to the "if (slept)
break;" logic, which would potentially break out of the loop too early,
wouldn't it?

If my logic isn't flawed, looks like it can be resolved by removing the
slept variable and associated logic?

cheers.

-- 
Nathan


<Prev in Thread] Current Thread [Next in Thread>