In AcceptNewClient, when the accept() fails, I suspect there is nothing
useful to report in client[i].addr.sin_addr (worse it is unintialized at
that point, so it will report something, just not a useful something).
The "mproxy-client 1" message is a bit of a concern. This suggests
we're seeing the initial message getting broken up ... pmproxy should
receive "pmproxy-client 1" from the client, so the initial "p" is
missing, possibly gobbled in the previous connection attempt that failed
and/or we've lost the syncronization altogether since this is happening
on the pmcd port line? Looking closer, pmproxy was expecting "hostname
port" and got "pmproxy-client 1".
Hypothesis - after accept() and reading the "pmproxy-client 1" line,
pmproxy sends ack back and starts reading the "hostname port" line .. if
this read returns '\n' or '\r' at the first attempt ...
bp == buf
buf[] contains "pmproxy-client 1"
*bp = '\0'
we strdup an empty string for pmcd_hostname
bp++ (so -> "mproxy-client 1"
and strtoul chokes, not surprisingly
The patch looks fine and does fix A problem ... I'm just not sure it is
YOUR problem.
On Tue, 2009-03-31 at 21:24 +1100, Nathan Scott wrote:
> ----- "Nathan Scott" <nscott@xxxxxxxxxx> wrote:
>
> > Hi all,
> >
> > We're observing an occassional problem when running pmproxy.
> >
> > Every now and again, pmproxy stops listening on its port. This can be
> > seen with "netstat -tulnp" - the process is still running, but it
> > isn't
> > accepting new connections. The log file tends to have one or two
> > client
> > connection attempts listed, partially-setup but failed, and then it
> > just
> > stops talking to the world.
>
> After a few hours of sitting on a firewall, poking and prodding
> with gdb, I think I understand now why it stops listening. This
> patch seems to fix this problem for me, but since its intermittent
> its difficult to tell for sure.
>
> I believe the root cause is in src/pmproxy/client.c. NewClient
> allocates chunks of new client entries in the client array, and
> most fields are not initialised there, but rather in the caller
> AcceptNewClient. One field that is not initialised is pmcd_fd,
> that is only set later, once we try to establish the socket to
> pmcd.
>
> DeleteClient is called at many points in the initial connection
> setup attempt. It tests to see of pmcd_fd is non-negative, and
> if so closes it - for some of the early dropped connections, we
> will have random junk in this field, and its possible we could
> call close on the fd we listen(2) and accept(2) on ... I think
> that would get us in the situation where pmproxy still runs but
> doesn't take any new connections until restarted.
>
> > AcceptNewClient: bad pmcd port "mproxy-client 1" recv from client at
> > [IPADDR]
> > AcceptNewClient: bad version string () recv from client at [IPADDR]
> > AcceptNewClient: bad pmcd port "mproxy-client 1" recv from client at
> > [IPADDR]
>
> I'm not completely sure about this class of error though, they do
> not seem likely to me to be caused by the same thing. Could it
> just be (near-)immediately-dropped connections?
>
> I've attached a patch which:
> - initialises pmcd_fd appropriately to prevent accidental fd close
> - uses send/recv, so this code can work on Windows
> - adds timestamped logging for all pmproxy connection failures, to
> replace the non-timestamped logging there now.
> - improve error reporting for a dodget host/port handshake line
>
> Lemme know if this sounds correct (code review wouldn't go astray
> here), and if anyone has any plausible theories for the log lines
> above I'd love to hear them.
>
> cheers.
>
> _______________________________________________
> pcp mailing list
> pcp@xxxxxxxxxxx
> http://oss.sgi.com/mailman/listinfo/pcp
>
|