Hi -
Fresh on pcpfans.git fche/multithread, for your review:
commit a9764809b468d02f6e00763ced6b42f9abd75380
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date: Mon May 2 13:20:00 2016 -0400
qa/4751 reactivate
After the recent libpcp fixes, this test seems repeatable and
a good stressor for libpcp multithreading.
commit 2bf81a70ec5dff787fb4077c448b305fffd1c4c0
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date: Mon May 2 11:34:52 2016 -0400
pmwebd speedup: libmicrohttpd TURBO mode
An implementation artifact in libmicrohttpd prior to svn commit r37105
meant that concurrent requests into pmwebd are batched in the sense
that the response to one is not sent until the response to all are
finished. This means more perceived waiting for e.g. pmwebd grafana
dashboards with multiple charts, because the empty screen lasts
longer.
The MHD_USE_EPOLL_TURBO flag for MHD_start_daemon activates
performance tweaks, including an improvement in the above behavior.
It's harmless in older libmicrohttpd, and is transparent to qa.
commit 81935689d077b8d30f666dec46907c38c23af336
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date: Mon May 2 11:22:25 2016 -0400
pmmgr pcpqa/666: robustify, run unprivileged
The 666 test case is sometimes reported flaky. Some experiments
suggest one factor is the sloth of pmlogconf, especially on virtual
machines. It can take some 90 seconds (!) for a simple kvm guest, for
reasons not yet understood. This can lead the 666 script's
pmlogconf-awaiting logic to time out, since history waits for no man -
longer than 60 seconds. This timeout is bumped up to 300 seconds.
Synchronization via pmcd.* metrics is also a bit flaky, so we switch
to running pmmgr and its subordinate daemons unprivileged, and monitor
the output files [-s $FILE] directly. Not using $sudo all over also
simplifies the valgrind supervision logic.
commit 836fd5ea1b3939f9f60d55f9b30a4e6efc8c5698
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date: Mon Apr 25 09:44:45 2016 -0400
libpcp multithreading: un-nest tz_lock
libpcp's historical use of recursive libpcp lock has allowed patterns
of carefree intercalling of lock-taking functions. With normal
non-recursive locks, that's instant deadlock. Remove nested locking
in purely unnecessary cases.
commit 0a5caba663cbbf7420b189e20387bf36f39c30e7
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date: Sun Apr 24 19:23:26 2016 -0400
qa/4751 multithread: create new PCP_DEBUG subtest
Running the big final test with PCP_DEBUG=-1 can slow it down
enough to occasionally fail. Add an intermediate length test
that runs quicker but still covers a swath of context types.
Some higher values of PCP_DEBUG invoke taking locks in a nested,
order-violating fashion. This patch brings local lock goodness to
libpcp/src/tz.c, moves dumping outside locking in pdubuf.c, and
extends qa/4751 to test two sets of PCP_DEBUG runs. DBG_TRACE_PDU
is particularly vulnerable because it does (locky) PMNS ops.
commit 169b018477648e0b25bd7ccfa7b1f47f03b93e9f
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date: Sun Apr 24 18:35:31 2016 -0400
multithreaded testing: ipc debugging messages
Similar to commit c7e9299f6a03, the ipc.c tracing operations also need
to be moved outside the new non-recursive locks. qa/4751 runs the
last test with PCP_DEBUG=-1 to try to stress this aspect.
commit 4da610ef287e6841046eb0822766f9bd3c658198
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date: Sun Apr 24 15:17:00 2016 -0400
PR1055: handle some multithreaded deadlocks & race conditions
While running the qa/4751 test case at full scale, deadlocks reliably
occur. (In fact, the 4751.out file was initially checked in truncated
due to an alarm() catching the deadlocked run, producing no output.)
The same type of deadlock is also easily demonstrated on stock
previous-version libpcp, so it exculpates the recent pmNewContext
multithreading changes.
The valgrind "helgrind" tool is good at identifying problems of this
nature, and should be routinely used for verifying code that deals
with PM_*LOCK.
The gist of one problem is inconsistent lock ordering. The libpcp
lock is sometimes taken nested within a context c_lock; and sometimes
vice versa. Two threads can easily lock each other out. helgrind
showed multiple different scenarios where the libpcp lock was taken
unnecessarily by lower level code - where a smaller lock was
sufficient. This patchset adds a handful of small, non-recursive
locks for these.
This patch also includes a fix to a nastier race condition in
__pmHandleToPtr(), whereby a context-destruction could race against
context-structure lookup. Some work remains in the multi-archive code
and elsewhere to avoid two mildly racy functions (__pmPtrToHandle and
the new __pmHandleToPtr_unlocked).
qa/4751 and all other prexisting thread-group test cases look good
now, no more deadlocks or lock-ordering-error reports there at least.
(There are likely more hiding in the code: the libpcp lock is way
overused.)
commit 2a7e146b5400736801a8daaff8bf0f3213d962dd
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date: Sun Apr 24 14:55:25 2016 -0400
multithreading qa/4751
Tweak the qa/4751 test case so that different unreachable-host type
error codes are mapped to a uniform one. Generate an actual proper
output for the last test (the one with some 156 contexts/threads).
_______________________________________________
pcp mailing list
pcp@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/pcp