pcp
[Top] [All Lists]

pcp updates: multithreaded libpcp pmNewContext, pmmgr

To: pcp developers <pcp@xxxxxxxxxxx>
Subject: pcp updates: multithreaded libpcp pmNewContext, pmmgr
From: "Frank Ch. Eigler" <fche@xxxxxxxxxx>
Date: Sat, 9 Apr 2016 19:34:53 -0400
Delivered-to: pcp@xxxxxxxxxxx
User-agent: Mutt/1.4.2.2i
Hi -

Related to RHBZ1325363, presenting for your review, a series of
patches for multithreading pmNewContext and its client pmmgr.  I would
appreciate review, as the libpcp changes are central.  It looks pretty
good here though, and should not modify single-threaded app observable
behaviour: the libpcp lock is simply taken with finer grain.



commit 495e97816d9e80712eb6bdcf4d197e9dde5ecb94
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date:   Sat Apr 9 19:20:28 2016 -0400

    pmmgr: make foreground mode less magic
    
    Just as for pmwebd back in commit 9c82cf68a, don't mandate -U `whoami`
    if one simply wants to run pmmgr under one's own unprivileged userid.
    Only attempt __pmSetProcessIdentity() if we're root to start with.

commit c4ea84304c0d60d4a75100a0f23120b813f2995b
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date:   Sat Apr 9 18:49:40 2016 -0400

    pmmgr: parallelize potential target pmcd analysis & daemon shutdown
    
    It was reported that if pmmgr was given a target pmcd list containing
    numerous hosts that are at times unreachable, then a delay of up to
    $PMCD_CONNECT_TIMEOUT (10s!) may be absorbed - per unreachable host -
    during the hostid-calculation phase.
    
    So now we parallelize a couple of more things, to let pmmgr scale out
    to a much larger number of target daemons:
    - pcp contexts are opened in parallel to the potential pmcd list
      already gathered from target-host and target-discovery
    - container subtargets are searched in parallel for surviving
      live pmcds
    - eventually, pmmgr daemons are shut down in parallel, in separate
      threads that issue the SIGTERM / SIGKILL)
    
    qa/666 updated.  Other scale testing with hundreds of
    always-unreachable hosts (e.g., the RFC5737 TEST-NET 192.0.2.0/24
    range) indicates proper parallelization and tolerance of timeouts.
    
    Amongst some tasty coding treats:
    - a "locker" class to embody automatic {}-block-lifespan mutex
      holding, instead of explicit pthread_mutex_[un]lock ops
    - an "obatched" ostream-like class to let output-streaming <<
      operations accumulate in a stringstream, so concurrent cerr
      output is not interleaved
    - a "parallel_do" function that launches N threads against a shared
      (usually embedded-lock-carrying) work-queue structure

commit f0231ffa1d02e019dd11f8b28d65e4abc9d7a664
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date:   Fri Apr 8 20:07:57 2016 -0400

    RHBZ1325363: multithreaded pmNewContext
    
    While parallelizing pmmgr, it was discovered that the core
    pmNewContext function is a bottleneck when trying to connect to a
    large number of servers.  Prior to this patch, it held the big libpcp
    lock throughout the entire context-creation process, which can last
    10+ seconds (e.g., if a remote pmcd host is unreachable).  That locks
    out many other pmapi operations, and serializes connections to
    multiple hosts.
    
    Detailed analysis of pmNewContext and its callees showed that it is
    possible to relax holding the big libpcp lock to much shorter time
    periods, and specifically to exclude indefinite-length operations like
    the socket connection to a remote pmcd, and even the analysis of
    archives.  This is partly done by introducing a special
    PM_CONTEXT_INIT c_type placeholder object into the context[] array
    during initialization, and tweaking timing & locking sequences.
    
    The result is that pmNewContext calls can almost completely overlap
    each other safely.  A new test case (4751, a descendant of 475)
    stress-tests by opening hundreds of various types of contexts at the
    same time, including repeated, unreachable, and
    theoretically-shareable ones.  The new code precludes sharing of
    connections/archive-control data to the same destinations, but
    non-concurrent sharing behaviour is unmodified.

commit c63958ac86b7a92e0a257ea6e65799446ab1d833
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date:   Fri Apr 8 18:12:27 2016 -0400

    libpcp fetchgroups: match docs & implementation
    
    The pmFetchGroup() function return value was misdocumented (>0 ok).
    The pmFetchGroupSetMode() function was removed from the exported /
    documented API, so can safely be removed from the implementation..

<Prev in Thread] Current Thread [Next in Thread>