Hi, I've cherry picked most of the commits from pcpfans/fche/pmfuntools.
Changes committed to git://git.pcp.io/lberk/pcp.git master
Frank Ch. Eigler (11):
pmmgr pcpqa/666: robustify, run unprivileged
pmwebd speedup: libmicrohttpd TURBO mode
pmmgr target-threads: tolerate OSs that return <0 for
sysconf(_SC_NPROCESSORS_ONLN)
pmmgr scaling: don't cry on a SIGPIPE
crash-resilience for systemd pmmgr/pmwebd
pmmgr: tune logging batching
pmmgr qa/666: tweak timings
pmmgr qa/666: further relax test predicates
pmmgr: corrupt archive handling, pmlogextract batching
pmwebd graphite png graphics: interrupt tweak
unresponsive-pmda pmie message: identify host
man/man1/pmmgr.1 | 14 +++
qa/666 | 78 ++++++++----------
qa/666.out | 4
qa/669 | 159 +++++++++++++++++++++++++++++++++++++
qa/669.out | 17 ++++
qa/group | 1
src/pmieconf/primary/pmda_status | 2
src/pmmgr/GNUmakefile | 3
src/pmmgr/pmmgr.cxx | 165 +++++++++++++++++++++++++--------------
src/pmmgr/pmmgr.h | 2
src/pmmgr/pmmgr.service.in | 9 +-
src/pmwebapi/GNUmakefile | 3
src/pmwebapi/main.cxx | 11 ++
src/pmwebapi/pmgraphite.cxx | 14 +--
src/pmwebapi/pmwebd.service.in | 11 +-
15 files changed, 366 insertions(+), 127 deletions(-)
Details ...
commit 1a4a8dd156254ad53bc0d637a982ad37a1951301
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date: Sun May 8 11:29:11 2016 -0400
unresponsive-pmda pmie message: identify host
For remotely monitored hosts that have suffered PMDA failure, the pmie
message should identify the host. Adding @%h to the message, as per
many other pmieconf examples. (No QA impact, as this message does not
appear in QA at all.)
commit 796986c6a71e9b027ce16a7e868333d6d78efa3a
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date: Wed Jun 1 10:14:34 2016 -0400
pmwebd graphite png graphics: interrupt tweak
When a ^C is received within the gfx/cairo rendering pipeline, abort
before cairo memory is allocated. That way there's less carping in
a valgrind leak-check reports about irrelevant shutdown time.
commit 634433a316d7e7a73ae00f3dc58a934181450c06
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date: Sun May 15 18:01:49 2016 -0400
pmmgr: corrupt archive handling, pmlogextract batching
Sometimes we encounter corrupt archives during aging/merging. Instead
of retrying them over and over, we now rename those "corrupt-YYYY..."
and give them a separate aging parameter pmlogcheck-corrupt-gc so they
get cleaned up eventually.
Sometimes we encounter the situation wherein short-lived pmcd tasks
cause a great number of distinct pmmgr archives to pile up for some
given day (pmlogmerge period). This can trigger the old pmlogextract
PR1054 limitation, wherein if it is given too many archives on its
command line, it runs out of file descriptors and dies. We can't use
pmlogger_merge (because it doesn't take all pmlogextract parameters
we'd like), so instead we impose a batching limit of 64 (archives per
pmlogextract run) ourselves. Over a few restart cycles it can catch
up - or at least make progress.
qa/669 added to cover both new bits of function. man/pmmgr.1 updated.
commit e1d2911784e3a4caa7f37f12362e4b6f9dd2f5dc
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date: Tue May 10 09:07:06 2016 -0400
pmmgr qa/666: further relax test predicates
With pmmgr running asynchronously, and its child processes taking a
hard-to-predict amount of time, it continues to be a struggle to have
a brief test case that measures their progress. Relax another test
predicate to make it accept bare existence rather than minimum-count
of a type of file.
commit 7ce54dd7a0b6a29236ec797752da4b0f4943ca98
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date: Mon May 9 17:00:17 2016 -0400
pmmgr qa/666: tweak timings
Adjusting test times & windows so that the qa/666 test case runs
more proeditably on both valgrind and non-valgrind scenarios.
commit 3f5186e2e8fcee40fd3c8db9b517621a52263214
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date: Sun May 8 16:30:57 2016 -0400
pmmgr: tune logging batching
When pmmgr runs pmlogcheck on an archive, this can produce voluminous
warning traffic (e.g. for SGI PR1142) that's not helpful for a pmmgr
admin. We now redirect that output also to /dev/null. Since there is
now less output, tweak the obatched(stream) code to issue an explicit
ostream::flush(), so that whether the stream is default-buffered or
not, the log file will be current.
commit 438ce2390165bdafddb86898551b13028f6c6107
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date: Sun May 8 10:50:06 2016 -0400
crash-resilience for systemd pmmgr/pmwebd
Switch to using Unit=forking Restart=always for these services.
They now get auto-restarted by systemd if they crash or are kill-9'd.
The same treatment is probably appropriate for pmcd.
commit 96f4255f8204b2ca661779aba0d9603859ac0be5
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date: Sun May 8 09:05:06 2016 -0400
pmmgr scaling: don't cry on a SIGPIPE
It has been reported that on some heavily loaded systems, pmmgr
can intermittently die with a "too many interrupts" message. Analysis
with systemtap indicates that these events come from SIGPIPE's being
sent by the kernel from within a
__pmSend
__pmXmitPDU
__pmSendNameList
pmLookupName
....
__dmopencontext
pmNewContext
call chain. Presumably, a remote pmcd died mid-conversation, and
pdu.c's SIGPIPE ignoring logic didn't help enough.
pmmgr should not look for SIGPIPE anyway as a termination signal - we
don't produce output on stdout like a pipeable UNIX tool. We now
SIG_IGN it.
commit 4ff3f1fb7c6a75378a480f4dbf5a9e75e3e3589a
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date: Sun May 8 08:10:57 2016 -0400
pmmgr target-threads: tolerate OSs that return <0 for
sysconf(_SC_NPROCESSORS_ONLN)
It's theoretically possible for the online-cpu-count to come back
negative. Map that to zero instead of propagating to a negative
number of target threads.
commit ec6a78e35bd359c90fc764fd1943ddd36135c161
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date: Mon May 2 11:34:52 2016 -0400
pmwebd speedup: libmicrohttpd TURBO mode
An implementation artifact in libmicrohttpd prior to svn commit r37105
meant that concurrent requests into pmwebd are batched in the sense
that the response to one is not sent until the response to all are
finished. This means more perceived waiting for e.g. pmwebd grafana
dashboards with multiple charts, because the empty screen lasts
longer.
The MHD_USE_EPOLL_TURBO flag for MHD_start_daemon activates
performance tweaks, including an improvement in the above behavior.
It's harmless in older libmicrohttpd, and is transparent to qa.
commit 693af7d56ce71e8eb9385de6a937776e7c094478
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date: Mon May 2 11:22:25 2016 -0400
pmmgr pcpqa/666: robustify, run unprivileged
The 666 test case is sometimes reported flaky. Some experiments
suggest one factor is the sloth of pmlogconf, especially on virtual
machines. It can take some 90 seconds (!) for a simple kvm guest, for
reasons not yet understood. This can lead the 666 script's
pmlogconf-awaiting logic to time out, since history waits for no man -
longer than 60 seconds. This timeout is bumped up to 300 seconds.
Synchronization via pmcd.* metrics is also a bit flaky, so we switch
to running pmmgr and its subordinate daemons unprivileged, and monitor
the output files [-s $FILE] directly. Not using $sudo all over also
simplifies the valgrind supervision logic.
|