pcp
[Top] [All Lists]

Re: [pcp] pmproxy intermittent failure

To: Nathan Scott <nscott@xxxxxxxxxx>
Subject: Re: [pcp] pmproxy intermittent failure
From: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
Date: Wed, 01 Apr 2009 07:11:23 +1100
Cc: pcp@xxxxxxxxxxx
In-reply-to: <2010589101.2027921238495067825.JavaMail.root@xxxxxxxxxxxxxxxxxx>
References: <2010589101.2027921238495067825.JavaMail.root@xxxxxxxxxxxxxxxxxx>
Reply-to: kenj@xxxxxxxxxxxxxxxx
In AcceptNewClient, when the accept() fails, I suspect there is nothing
useful to report in client[i].addr.sin_addr (worse it is unintialized at
that point, so it will report something, just not a useful something).

The "mproxy-client 1" message is a bit of a concern.  This suggests
we're seeing the initial message getting broken up ... pmproxy should
receive "pmproxy-client 1" from the client, so the initial "p" is
missing, possibly gobbled in the previous connection attempt that failed
and/or we've lost the syncronization altogether since this is happening
on the pmcd port line?  Looking closer, pmproxy was expecting "hostname
port" and got "pmproxy-client 1".

Hypothesis - after accept() and reading the "pmproxy-client 1" line,
pmproxy sends ack back and starts reading the "hostname port" line .. if
this read returns '\n' or '\r' at the first attempt ...

bp == buf
buf[] contains "pmproxy-client 1"
*bp = '\0'
we strdup an empty string for pmcd_hostname
bp++ (so -> "mproxy-client 1"
and strtoul chokes, not surprisingly

The patch looks fine and does fix A problem ... I'm just not sure it is
YOUR problem.


On Tue, 2009-03-31 at 21:24 +1100, Nathan Scott wrote:
> ----- "Nathan Scott" <nscott@xxxxxxxxxx> wrote:
> 
> > Hi all,
> > 
> > We're observing an occassional problem when running pmproxy.
> > 
> > Every now and again, pmproxy stops listening on its port.  This can be
> > seen with "netstat -tulnp" - the process is still running, but it
> > isn't
> > accepting new connections.  The log file tends to have one or two
> > client
> > connection attempts listed, partially-setup but failed, and then it
> > just
> > stops talking to the world.
> 
> After a few hours of sitting on a firewall, poking and prodding
> with gdb, I think I understand now why it stops listening.  This
> patch seems to fix this problem for me, but since its intermittent
> its difficult to tell for sure.
> 
> I believe the root cause is in src/pmproxy/client.c.  NewClient
> allocates chunks of new client entries in the client array, and
> most fields are not initialised there, but rather in the caller
> AcceptNewClient.  One field that is not initialised is pmcd_fd,
> that is only set later, once we try to establish the socket to
> pmcd.
> 
> DeleteClient is called at many points in the initial connection
> setup attempt.  It tests to see of pmcd_fd is non-negative, and
> if so closes it - for some of the early dropped connections, we
> will have random junk in this field, and its possible we could
> call close on the fd we listen(2) and accept(2) on ... I think
> that would get us in the situation where pmproxy still runs but
> doesn't take any new connections until restarted.
> 
> > AcceptNewClient: bad pmcd port "mproxy-client 1" recv from client at
> > [IPADDR]
> > AcceptNewClient: bad version string () recv from client at [IPADDR]
> > AcceptNewClient: bad pmcd port "mproxy-client 1" recv from client at
> > [IPADDR]
> 
> I'm not completely sure about this class of error though, they do
> not seem likely to me to be caused by the same thing.  Could it
> just be (near-)immediately-dropped connections?
> 
> I've attached a patch which:
> - initialises pmcd_fd appropriately to prevent accidental fd close
> - uses send/recv, so this code can work on Windows
> - adds timestamped logging for all pmproxy connection failures, to
> replace the non-timestamped logging there now.
> - improve error reporting for a dodget host/port handshake line
> 
> Lemme know if this sounds correct (code review wouldn't go astray
> here), and if anyone has any plausible theories for the log lines
> above I'd love to hear them.
> 
> cheers.
> 
> _______________________________________________
> pcp mailing list
> pcp@xxxxxxxxxxx
> http://oss.sgi.com/mailman/listinfo/pcp
> 

<Prev in Thread] Current Thread [Next in Thread>