On 05/03/2016 02:26 AM, Nathan Scott wrote:
Hi Dave,
I came across this quirky libpcp networking behaviour today as
I was looking further into Rares' recently reported issue ...
$ time /usr/libexec/pcp/bin/pmcd_wait -h oss.sgi.com -t 2
real 0m6.349s
user 0m0.001s
sys 0m0.005s
The -t 2 there sets PMCD_CONNECT_TIMEOUT. So, what I think we
see here (timeout taking 3x longer than expected) is that the
getaddrinfo loop in __pmAuxConnectPMCDPort causes the timeout
to be (re-)applied for each address returned. strace shows we
definitely see 3 connect() attempts in the above example.
Yes, that's definitely what's happening.
Not sure what the correct behaviour should be here - thoughts?
Seems like its probably not doing what users would expect atm.
The delay is applied during the call __pmSelectWrite(). One thing we
could try would be to open a socket for each address, use the select to
wait on all of them at once, and choose the one that's selected. If the
timeout expires, then we can assume that they all timed out and we will
have applied the timeout once for all of the addresses. I can't think of
another way to apply one timeout while trying all of the addresses.
The downside is that PMCD will see several connections, some (most?) of
which will succeed and then be abandoned. I assume that these will get
logged in a similar way to the pmprobe connections that fche opened a
bug about.
Dave
|