pcp
[Top] [All Lists]

pcp updates

To: pcp@xxxxxxxxxxx
Subject: pcp updates
From: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
Date: Tue, 22 May 2012 10:18:27 +1000
First round of changes ... the lock-related ones are probably serious
enough to be considered as potential 3.6.4 release fodder ... but there
will be a second round to come for pmdumplog and pmlogextract (at least)
for some really ugly problems (seen on OpenIndiana) associated with
arithmetic errors caused by using a double to represent a timestamp ...
this has to be excised.

Changes committed to git://oss.sgi.com/kenj/pcp.git dev

 build/sun/pcp.xml                |   58 +++++++++++++++
 src/dbpmda/src/util.c            |   13 +++
 src/include/pcp/impl.h           |   12 +++
 src/libpcp/src/check-statics     |    1 
 src/libpcp/src/context.c         |   79 +++++++++++++++++++--
 src/libpcp/src/derive.c          |   46 +++++++++++-
 src/libpcp/src/interp.c          |    1 
 src/libpcp/src/lock.c            |  142 +++++++++++++++++++++++++++++++++++----
 src/libpcp/src/util.c            |   19 ++++-
 src/libpcp_fault/README          |   25 ++++++
 src/libpcp_fault/src/GNUmakefile |    9 +-
 src/pmdas/linux/cgroups.c        |   31 +++++++-
 src/pmdas/linux/pmda.c           |    5 -
 src/pmdas/trace/stub.c           |    2 
 src/pmdbg/pmdbg.c                |    2 
 src/pmdumplog/pmdumplog.c        |    8 ++
 src/pmie/src/show.c              |    2 
 src/pmlogextract/pmlogextract.c  |    7 +
 src/pmlogger/fetch.c             |    9 +-
 src/pmlogger/pmlogger.c          |   38 ++++++++--
 20 files changed, 460 insertions(+), 49 deletions(-)

commit 3ea8062065e212dcdc3a1dcd993b6b208554f7f6
Author: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
Date:   Sun May 20 08:49:38 2012 +1000

    libpcp_fault.so - make symlink to libpcp.so
    
    Only in workarea, not installed.  Needed for QA (see 512 for bizarre
    explanation).

commit 12679bfc7e18aeca8c34be37e5b4a17206cf222f
Author: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
Date:   Sat May 19 22:24:36 2012 +1000

    Add checks for all pthread_*() calls
    
    Check return values for all pthread_*() routines ... mostly
    related to mutexes.
    
    If the call fails report and exit(4) ... it is better to drop
    dead than blunder on believing we're thread-safe when one of the
    pthread_*() routines fails.
    -

commit 29dbf1a2fc7bd89014732a47c491508d304c0303
Author: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
Date:   Sat May 19 21:35:04 2012 +1000

    pmlogextract - bad lock recursion
    
    Similar to the pmlogger case, this is a 3.6 regression as fallout
    from making libpcp thread-safe on most platforms.  I checked
    all of the similar cases in libpcp, but missed this one when an
    application (pmlogextract) is using an internal libpcp API that
    has locking side-effects.
    
    We were calling __pmHandleToPtr() (which returns with the context
    locked) and not being careful about releasing the context lock
    (c_lock) ... in the worst case, this was happening once per input
    log record (data or metadata).
    
    The fix is to release the context lock as soon as we've finished
    poking around with the context.

commit 205af1e345a5cac21d58b7acdaf315f2040404ea
Author: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
Date:   Sat May 19 21:32:33 2012 +1000

    dbpmda - add comment to explain why context remains locked throughout

commit 326ab916b583af358ff6a034793b5a07334a4fe9
Author: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
Date:   Sat May 19 18:11:03 2012 +1000

    pmdumplog - add comment to explain why context remains locked throughout

commit 09278bb2b198e47ef333ac4a456069d8c1208aa5
Author: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
Date:   Sat May 19 17:56:45 2012 +1000

    Enable lock debug tracing for libpcp_fault.
    
    This variant of libpcp for QA includes both the fault injection
    infrastructure _and_ the new lock debug tracing features.

commit a084372a4f7de89838b1d536dbd39f6d49408a31
Author: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
Date:   Sat May 19 17:42:17 2012 +1000

    Add lock trace debugging for libpcp.
    
    This is _not_ enabled by default.  Need to #define PM_MULTI_THREAD_DEBUG
    and rebuild libpcp.
    
    When this is done, all PM_LOCK() and PM_UNLOCK() calls are
    intercepted and optional diagnostics produced to report who is
    locking/unlocking what and report unusual lock counts (except
    for recursive locking we expect the lock count to be 0 before
    PM_LOCK() and 1 before PM_UNLOCK()) ... reporting is enabled by
    the [new] -Dlock debug option in concert with the -Dappl? options
    to selectively enable diagnostics for particular lock classes -
    appl0 for the libpcp global lock, appl1 for the per-context locks
    and appl2 for everything else (including the ipc channel lock).
    
    PCP QA is the only likely consumer of these features, unless we
    have problems that _really_ look like they are lock-related.

commit 9c6c2d5dc430020fc2956aa36d6e14132b29d91a
Author: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
Date:   Sat May 19 17:21:04 2012 +1000

    pmlogger - bad lock recursion
    
    This is a 3.6 regression as fallout from making libpcp thread-safe on
    most platforms.  I checked all of the similar cases in libpcp, but missed
    this one when an application (pmlogger) is using an internal libpcp API
    that has locking side-effects.
    
    We were calling __pmHandleToPtr() (which returns with the context locked)
    and not being careful about releasing the context lock (c_lock) ... in the
    worst case, this was in pmlogger's special fetch routine, so deeper
    locking recursion every time we fetch metrics to be logged.
    
    For many platforms, recursive locking seems to be allowed without obvious
    limit (probably 2^32 or 2^64 depending on how the mutex is implemented),
    but on Solaris (more specifically OpenIndiana) there appears to be a limit
    of 500 or so ... which we can hit easily in QA.
    
    The fix is to release the context lock as soon as we've finished poking
    around with the context, and if necessary re-acquire the context pointer
    and context lock later, then release it again ... repeat until bored.

commit 324673d77526f113e61af53233c5317e9efdeb1e
Author: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
Date:   Fri May 18 06:51:48 2012 +1000

    Solaris init scripts rework.
    
    The pcp.xml manifest was missed in the pcp -> pmcd and pmlogger
    changes for the init scripts in PCP 3.6.

commit 71fbef22283e13c008ce4acc524dde4a203a1c83
Author: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
Date:   Fri May 18 06:51:16 2012 +1000

    interp.c - fix small mem leak on unlikely error path

commit 686b96525ab3d26a6389ad7da821d4306a98c0ad
Author: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
Date:   Thu May 17 17:22:18 2012 +1000

    trace demo stub.c - sys_nerr is not all that portable!

commit cd45147c3f60648b1898a785debf74360a93e6dd
Author: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
Date:   Tue May 15 08:32:37 2012 +1000

    pmlogger/pmlc/libpcp - fix segv in diag dark corner
    
    pmlc and pmlogger use pmResults in a perverse ways to encode
    logging state ... this was causing pmlogger to dump core when
    it was running with full diagnostics enabled and pmlc enquired
    about the logging state for event record metrics!
    
    Found by accident when investigating a QA 139 failure on one host,
    a new QA 510 now checks explicitly for this unusual combination.

commit c51adc15849173f65cfba9f1abca6ea7f0d39a2a
Author: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
Date:   Tue May 15 08:31:38 2012 +1000

    pmie - NaN fixup
    
    Small error in the last round of NaN changes ... conditional code when
    fpclassify() is not available was wrong.

commit 2fc49a724119505d80121fe004b4455f7a63937f
Author: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
Date:   Mon May 14 06:03:43 2012 +1000

    linux pmda - fix unaligned memory access for cgroup.subsys.hierarchy
    
    Code was incorrectly using opaque pointer to private data with
    the pmdaCache*() routines causing unaligned memory accesses on
    ia64 platforms.
    
    Although this problem is fixed, this code still seems a bit
    dodgey and would appear to do the wrong thing if /proc/cgroups
    ever contained more than one subsystem hierarchy with the same
    subsystem name in more than one hierarchy (not sure if that
    is possible).



<Prev in Thread] Current Thread [Next in Thread>
  • pcp updates, Nathan Scott
    • pcp updates, Ken McDonell <=