pcp
[Top] [All Lists]

pcp updates: fche pmmgr/pmweb/qa

To: pcp@xxxxxxxxxxx
Subject: pcp updates: fche pmmgr/pmweb/qa
From: Lukas Berk <lberk@xxxxxxxxxx>
Date: Thu, 16 Jun 2016 09:04:32 -0400
Delivered-to: pcp@xxxxxxxxxxx
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.5 (gnu/linux)
Hi, I've cherry picked most of the commits from pcpfans/fche/pmfuntools.

Changes committed to git://git.pcp.io/lberk/pcp.git master

Frank Ch. Eigler (11):
      pmmgr pcpqa/666: robustify, run unprivileged
      pmwebd speedup: libmicrohttpd TURBO mode
      pmmgr target-threads: tolerate OSs that return <0 for 
sysconf(_SC_NPROCESSORS_ONLN)
      pmmgr scaling: don't cry on a SIGPIPE
      crash-resilience for systemd pmmgr/pmwebd
      pmmgr: tune logging batching
      pmmgr qa/666: tweak timings
      pmmgr qa/666: further relax test predicates
      pmmgr: corrupt archive handling, pmlogextract batching
      pmwebd graphite png graphics: interrupt tweak
      unresponsive-pmda pmie message: identify host

 man/man1/pmmgr.1                 |   14 +++
 qa/666                           |   78 ++++++++----------
 qa/666.out                       |    4 
 qa/669                           |  159 +++++++++++++++++++++++++++++++++++++
 qa/669.out                       |   17 ++++
 qa/group                         |    1 
 src/pmieconf/primary/pmda_status |    2 
 src/pmmgr/GNUmakefile            |    3 
 src/pmmgr/pmmgr.cxx              |  165 +++++++++++++++++++++++++--------------
 src/pmmgr/pmmgr.h                |    2 
 src/pmmgr/pmmgr.service.in       |    9 +-
 src/pmwebapi/GNUmakefile         |    3 
 src/pmwebapi/main.cxx            |   11 ++
 src/pmwebapi/pmgraphite.cxx      |   14 +--
 src/pmwebapi/pmwebd.service.in   |   11 +-
 15 files changed, 366 insertions(+), 127 deletions(-)

Details ...

commit 1a4a8dd156254ad53bc0d637a982ad37a1951301
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date:   Sun May 8 11:29:11 2016 -0400

    unresponsive-pmda pmie message: identify host
    
    For remotely monitored hosts that have suffered PMDA failure, the pmie
    message should identify the host.  Adding @%h to the message, as per
    many other pmieconf examples.  (No QA impact, as this message does not
    appear in QA at all.)

commit 796986c6a71e9b027ce16a7e868333d6d78efa3a
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date:   Wed Jun 1 10:14:34 2016 -0400

    pmwebd graphite png graphics: interrupt tweak
    
    When a ^C is received within the gfx/cairo rendering pipeline, abort
    before cairo memory is allocated.  That way there's less carping in
    a valgrind leak-check reports about irrelevant shutdown time.

commit 634433a316d7e7a73ae00f3dc58a934181450c06
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date:   Sun May 15 18:01:49 2016 -0400

    pmmgr: corrupt archive handling, pmlogextract batching
    
    Sometimes we encounter corrupt archives during aging/merging.  Instead
    of retrying them over and over, we now rename those "corrupt-YYYY..."
    and give them a separate aging parameter pmlogcheck-corrupt-gc so they
    get cleaned up eventually.
    
    Sometimes we encounter the situation wherein short-lived pmcd tasks
    cause a great number of distinct pmmgr archives to pile up for some
    given day (pmlogmerge period).  This can trigger the old pmlogextract
    PR1054 limitation, wherein if it is given too many archives on its
    command line, it runs out of file descriptors and dies.  We can't use
    pmlogger_merge (because it doesn't take all pmlogextract parameters
    we'd like), so instead we impose a batching limit of 64 (archives per
    pmlogextract run) ourselves.  Over a few restart cycles it can catch
    up - or at least make progress.
    
    qa/669 added to cover both new bits of function.  man/pmmgr.1 updated.

commit e1d2911784e3a4caa7f37f12362e4b6f9dd2f5dc
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date:   Tue May 10 09:07:06 2016 -0400

    pmmgr qa/666: further relax test predicates
    
    With pmmgr running asynchronously, and its child processes taking a
    hard-to-predict amount of time, it continues to be a struggle to have
    a brief test case that measures their progress.  Relax another test
    predicate to make it accept bare existence rather than minimum-count
    of a type of file.

commit 7ce54dd7a0b6a29236ec797752da4b0f4943ca98
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date:   Mon May 9 17:00:17 2016 -0400

    pmmgr qa/666: tweak timings
    
    Adjusting test times & windows so that the qa/666 test case runs
    more proeditably on both valgrind and non-valgrind scenarios.

commit 3f5186e2e8fcee40fd3c8db9b517621a52263214
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date:   Sun May 8 16:30:57 2016 -0400

    pmmgr: tune logging batching
    
    When pmmgr runs pmlogcheck on an archive, this can produce voluminous
    warning traffic (e.g. for SGI PR1142) that's not helpful for a pmmgr
    admin.  We now redirect that output also to /dev/null.  Since there is
    now less output, tweak the obatched(stream) code to issue an explicit
    ostream::flush(), so that whether the stream is default-buffered or
    not, the log file will be current.

commit 438ce2390165bdafddb86898551b13028f6c6107
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date:   Sun May 8 10:50:06 2016 -0400

    crash-resilience for systemd pmmgr/pmwebd
    
    Switch to using Unit=forking Restart=always for these services.
    They now get auto-restarted by systemd if they crash or are kill-9'd.
    The same treatment is probably appropriate for pmcd.

commit 96f4255f8204b2ca661779aba0d9603859ac0be5
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date:   Sun May 8 09:05:06 2016 -0400

    pmmgr scaling: don't cry on a SIGPIPE
    
    It has been reported that on some heavily loaded systems, pmmgr
    can intermittently die with a "too many interrupts" message.  Analysis
    with systemtap indicates that these events come from SIGPIPE's being
    sent by the kernel from within a
     __pmSend
     __pmXmitPDU
     __pmSendNameList
     pmLookupName
     ....
     __dmopencontext
     pmNewContext
    call chain.  Presumably, a remote pmcd died mid-conversation, and
    pdu.c's SIGPIPE ignoring logic didn't help enough.
    
    pmmgr should not look for SIGPIPE anyway as a termination signal - we
    don't produce output on stdout like a pipeable UNIX tool.  We now
    SIG_IGN it.

commit 4ff3f1fb7c6a75378a480f4dbf5a9e75e3e3589a
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date:   Sun May 8 08:10:57 2016 -0400

    pmmgr target-threads: tolerate OSs that return <0 for 
sysconf(_SC_NPROCESSORS_ONLN)
    
    It's theoretically possible for the online-cpu-count to come back
    negative.  Map that to zero instead of propagating to a negative
    number of target threads.

commit ec6a78e35bd359c90fc764fd1943ddd36135c161
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date:   Mon May 2 11:34:52 2016 -0400

    pmwebd speedup: libmicrohttpd TURBO mode
    
    An implementation artifact in libmicrohttpd prior to svn commit r37105
    meant that concurrent requests into pmwebd are batched in the sense
    that the response to one is not sent until the response to all are
    finished.  This means more perceived waiting for e.g. pmwebd grafana
    dashboards with multiple charts, because the empty screen lasts
    longer.
    
    The MHD_USE_EPOLL_TURBO flag for MHD_start_daemon activates
    performance tweaks, including an improvement in the above behavior.
    It's harmless in older libmicrohttpd, and is transparent to qa.

commit 693af7d56ce71e8eb9385de6a937776e7c094478
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date:   Mon May 2 11:22:25 2016 -0400

    pmmgr pcpqa/666: robustify, run unprivileged
    
    The 666 test case is sometimes reported flaky.  Some experiments
    suggest one factor is the sloth of pmlogconf, especially on virtual
    machines.  It can take some 90 seconds (!) for a simple kvm guest, for
    reasons not yet understood.  This can lead the 666 script's
    pmlogconf-awaiting logic to time out, since history waits for no man -
    longer than 60 seconds.  This timeout is bumped up to 300 seconds.
    
    Synchronization via pmcd.* metrics is also a bit flaky, so we switch
    to running pmmgr and its subordinate daemons unprivileged, and monitor
    the output files [-s $FILE] directly.  Not using $sudo all over also
    simplifies the valgrind supervision logic.

<Prev in Thread] Current Thread [Next in Thread>