pcp
[Top] [All Lists]

Performance of Service Discovery via Probing

To: PCP Mailing List <pcp@xxxxxxxxxxx>
Subject: Performance of Service Discovery via Probing
From: Dave Brolley <brolley@xxxxxxxxxx>
Date: Wed, 02 Jul 2014 12:04:54 -0400
Delivered-to: pcp@xxxxxxxxxxx
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.5.0
fche and I had discussed the (poor) performance of service discovery via active probing and I've spent some time looking into it. We discussed a target rate of 50 connection attempts per second per thread where he was observing a total time of about 4 minutes to scan the Toronto office's two /24 inet networks using 10 threads.

I implemented a timeout=N.N option for the probe mechanism, where N.N is a floating point number representing the maximum number of seconds to wait for a connection attempt. For example:

    pmfind -m probe=10.15.16.0/24,timeout=0.1

The default is the target 0.02 seconds mentioned above.

While doing so, I discovered the following bugs which were affecting the existing performance.

The first was in the secure (NSPR) implementation of __pmConnect() which was unconditionally applying the timeout of __pmConnectTimeout() (default 3 seconds) to PR_Connect(). The problem with this is that often, __pmConnect() is called via __pmConnectTo() which sets the FNDELAY flag for the socket and expects __pmConnect() to return immediately. These callers then enforce the timeout they want using __pmSelectWrite(). The active probing code is among these callers and, so it was unable to reduce the timeout for a failed connection to anything less than __pmConnectTimeOut(). This problem has been solved by using PR_INTERVAL_NO_WAIT with PR_Connect when the FNDELAY flag has been set.

The second bug was not affecting failed connection timeouts and was masked by a third bug. The wait time for a retry after a failed attempt to create a new socket was erroneously set to 100 seconds instead of the intended 0.1 seconds (microseconds vs nanoseconds). However this bug was masked by another bug for which the probing code was overwriting this timeout with the value of __pmConnectTimeout() (default 3 seconds).

As a result of these bugs, each failed connection attempt could take up to 6 seconds by default. With the fixes and the new default, I can now scan each of the Toronto office's /24 inet networks in:

approximately 0.04 seconds with the default settings (255 threads): 0.04 seconds/connection/thread
approximately 0.50 seconds with 10 threads: 0.019 seconds/connection/thread
approximately 4.15 seconds with 1 thread: 0.016 seconds/connection/thread

One can see that that latter two meet the performance goal. We also see that, while increasing the number of threads increases the time taken per connection attempt in each thread, the overall time can decrease dramatically.

------------------------------------------------

commit 4870755fe7d28f323f525195cadc35424590db39
Author: Dave Brolley <brolley@xxxxxxxxxx>
Date:   Wed Jul 2 10:39:17 2014 -0400

    Implement 'timeout' option for 'probe' service discovery mechanism.

    timeout=N.N, where N.N is a floating point number specifying the
    maximum number of seconds to wait for each connection attempt
    to fail. Current default is 0.02 seconds which allows for
    50 connection attempts per second per thread.

    Other timeout related bugs found and fixed in the process:
    - Retry timeout was erroneously set to 1000 seconds and not 0.1
      seconds as was intended.
    - The retry timeout was being reset to __pmConnectTimeout() on
      the first connection attempt.

commit 75e40a1a033df79ca36b4034670bede4ba39ebdf
Author: Dave Brolley <brolley@xxxxxxxxxx>
Date:   Wed Jul 2 10:32:21 2014 -0400

    Respect FNDELAY in __pmConnect() for secure connections.

    The secure (NSPR) implementation of __pmConnect() was
    unconditionally applying __pmConnectTimeout() to PR_Connect(),
    even when FNDELAY was set for the file descriptor. For FNDELAY
    sockets, this timeout is applied again during the subsequent
    call to __pmSelectWrite(). As a result, failed connections on
    secure sockets were taking twice as long as for non-secure
    sockets.

    Correct this by using PR_INTERVAL_NO_WAIT when FNDELAY is set.

<Prev in Thread] Current Thread [Next in Thread>