pcp
[Top] [All Lists]

Re: [pcp] qa/023 hanging?

To: Nathan Scott <nathans@xxxxxxxxxx>
Subject: Re: [pcp] qa/023 hanging?
From: "Frank Ch. Eigler" <fche@xxxxxxxxxx>
Date: Mon, 25 Apr 2016 21:45:10 -0400
Cc: Mark Goodwin <mgoodwin@xxxxxxxxxx>, pcp <pcp@xxxxxxxxxxx>
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <280513017.43174641.1461633146452.JavaMail.zimbra@xxxxxxxxxx>
References: <571DA64A.1080802@xxxxxxxxxx> <280513017.43174641.1461633146452.JavaMail.zimbra@xxxxxxxxxx>
User-agent: Mutt/1.4.2.2i
Hi -

> [...failures in...] 4751 (locking in pmNewContext) however ... see attached.

Work on this is ongoing in the pcpfans.git fche/multithread branch.
It turns out the new 4751 (src/multithread10) test case is good at
tickling prior latent locking bugs in libpcp, so results vary.  The
status quo in the master branch is probably OK in the sense that the
latent bugs in question appear less likely to be triggered by the only
real multithreaded pcp clients (pmmgr & pmwebd) than by the test case.
We could ship the code, and treat 4751 as flakey.

This commit was the first step down the rabbit hole of actually fixing
those latent bugs.  Things are looking OK in the branch now, though am
tracking down a mysterious 15x slowdown in pmlogconf
(/usr/libexec/pcp/bin/pmlogconf-setup) on a new test VM, relative to
much older code, leading to timeouts in 024's _wait_for_pmlogger().


commit 7a5b2f9963e050f9aaa374d06a7f5c8d600bc0fa
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date:   Sun Apr 24 15:17:00 2016 -0400

    PR1055: handle some multithreaded deadlocks & race conditions
    
    While running the qa/4751 test case at full scale, deadlocks reliably
    occur.  (In fact, the 4751.out file was initially checked in truncated
    due to an alarm() catching the deadlocked run, producing no output.)
    The same type of deadlock is also easily demonstrated on stock
    previous-version libpcp, so it exculpates the recent pmNewContext
    multithreading changes.
    
    The valgrind "helgrind" tool is good at identifying problems of this
    nature, and should be routinely used for verifying code that deals
    with PM_*LOCK.
    
    The gist of one problem is inconsistent lock ordering.  The libpcp
    lock is sometimes taken nested within a context c_lock; and sometimes
    vice versa.  Two threads can easily lock each other out.  helgrind
    showed multiple different scenarios where the libpcp lock was taken
    unnecessarily by lower level code - where a smaller lock was
    sufficient.  This patchset adds a handful of small, non-recursive
    locks for these.
    
    This patch also includes a fix to a nastier race condition in
    __pmHandleToPtr(), whereby a context-destruction could race against
    context-structure lookup.  Some work remains in the multi-archive code
    and elsewhere to avoid two mildly racy functions (__pmPtrToHandle and
    the new __pmHandleToPtr_unlocked).
    
    qa/4751 and all other prexisting thread-group test cases look good
    now, no more deadlocks or lock-ordering-error reports there at least.
    (There are likely more hiding in the code: the libpcp lock is way
    overused.)

<Prev in Thread] Current Thread [Next in Thread>