Hi -
Related to RHBZ1325363, presenting for your review, a series of
patches for multithreading pmNewContext and its client pmmgr. I would
appreciate review, as the libpcp changes are central. It looks pretty
good here though, and should not modify single-threaded app observable
behaviour: the libpcp lock is simply taken with finer grain.
commit 495e97816d9e80712eb6bdcf4d197e9dde5ecb94
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date: Sat Apr 9 19:20:28 2016 -0400
pmmgr: make foreground mode less magic
Just as for pmwebd back in commit 9c82cf68a, don't mandate -U `whoami`
if one simply wants to run pmmgr under one's own unprivileged userid.
Only attempt __pmSetProcessIdentity() if we're root to start with.
commit c4ea84304c0d60d4a75100a0f23120b813f2995b
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date: Sat Apr 9 18:49:40 2016 -0400
pmmgr: parallelize potential target pmcd analysis & daemon shutdown
It was reported that if pmmgr was given a target pmcd list containing
numerous hosts that are at times unreachable, then a delay of up to
$PMCD_CONNECT_TIMEOUT (10s!) may be absorbed - per unreachable host -
during the hostid-calculation phase.
So now we parallelize a couple of more things, to let pmmgr scale out
to a much larger number of target daemons:
- pcp contexts are opened in parallel to the potential pmcd list
already gathered from target-host and target-discovery
- container subtargets are searched in parallel for surviving
live pmcds
- eventually, pmmgr daemons are shut down in parallel, in separate
threads that issue the SIGTERM / SIGKILL)
qa/666 updated. Other scale testing with hundreds of
always-unreachable hosts (e.g., the RFC5737 TEST-NET 192.0.2.0/24
range) indicates proper parallelization and tolerance of timeouts.
Amongst some tasty coding treats:
- a "locker" class to embody automatic {}-block-lifespan mutex
holding, instead of explicit pthread_mutex_[un]lock ops
- an "obatched" ostream-like class to let output-streaming <<
operations accumulate in a stringstream, so concurrent cerr
output is not interleaved
- a "parallel_do" function that launches N threads against a shared
(usually embedded-lock-carrying) work-queue structure
commit f0231ffa1d02e019dd11f8b28d65e4abc9d7a664
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date: Fri Apr 8 20:07:57 2016 -0400
RHBZ1325363: multithreaded pmNewContext
While parallelizing pmmgr, it was discovered that the core
pmNewContext function is a bottleneck when trying to connect to a
large number of servers. Prior to this patch, it held the big libpcp
lock throughout the entire context-creation process, which can last
10+ seconds (e.g., if a remote pmcd host is unreachable). That locks
out many other pmapi operations, and serializes connections to
multiple hosts.
Detailed analysis of pmNewContext and its callees showed that it is
possible to relax holding the big libpcp lock to much shorter time
periods, and specifically to exclude indefinite-length operations like
the socket connection to a remote pmcd, and even the analysis of
archives. This is partly done by introducing a special
PM_CONTEXT_INIT c_type placeholder object into the context[] array
during initialization, and tweaking timing & locking sequences.
The result is that pmNewContext calls can almost completely overlap
each other safely. A new test case (4751, a descendant of 475)
stress-tests by opening hundreds of various types of contexts at the
same time, including repeated, unreachable, and
theoretically-shareable ones. The new code precludes sharing of
connections/archive-control data to the same destinations, but
non-concurrent sharing behaviour is unmodified.
commit c63958ac86b7a92e0a257ea6e65799446ab1d833
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date: Fri Apr 8 18:12:27 2016 -0400
libpcp fetchgroups: match docs & implementation
The pmFetchGroup() function return value was misdocumented (>0 ok).
The pmFetchGroupSetMode() function was removed from the exported /
documented API, so can safely be removed from the implementation..
|