----- "Nathan Scott" <nscott@xxxxxxxxxx> wrote:
> Hi all,
>
> We're observing an occassional problem when running pmproxy.
>
> Every now and again, pmproxy stops listening on its port. This can be
> seen with "netstat -tulnp" - the process is still running, but it
> isn't
> accepting new connections. The log file tends to have one or two
> client
> connection attempts listed, partially-setup but failed, and then it
> just
> stops talking to the world.
After a few hours of sitting on a firewall, poking and prodding
with gdb, I think I understand now why it stops listening. This
patch seems to fix this problem for me, but since its intermittent
its difficult to tell for sure.
I believe the root cause is in src/pmproxy/client.c. NewClient
allocates chunks of new client entries in the client array, and
most fields are not initialised there, but rather in the caller
AcceptNewClient. One field that is not initialised is pmcd_fd,
that is only set later, once we try to establish the socket to
pmcd.
DeleteClient is called at many points in the initial connection
setup attempt. It tests to see of pmcd_fd is non-negative, and
if so closes it - for some of the early dropped connections, we
will have random junk in this field, and its possible we could
call close on the fd we listen(2) and accept(2) on ... I think
that would get us in the situation where pmproxy still runs but
doesn't take any new connections until restarted.
> AcceptNewClient: bad pmcd port "mproxy-client 1" recv from client at
> [IPADDR]
> AcceptNewClient: bad version string () recv from client at [IPADDR]
> AcceptNewClient: bad pmcd port "mproxy-client 1" recv from client at
> [IPADDR]
I'm not completely sure about this class of error though, they do
not seem likely to me to be caused by the same thing. Could it
just be (near-)immediately-dropped connections?
I've attached a patch which:
- initialises pmcd_fd appropriately to prevent accidental fd close
- uses send/recv, so this code can work on Windows
- adds timestamped logging for all pmproxy connection failures, to
replace the non-timestamped logging there now.
- improve error reporting for a dodget host/port handshake line
Lemme know if this sounds correct (code review wouldn't go astray
here), and if anyone has any plausible theories for the log lines
above I'd love to hear them.
cheers.
--
Nathan
pmproxy.patch
Description: Binary data
|