pcp
[Top] [All Lists]

Re: [pcp] pcp updates: more multithreaded fixes and then some

To: "Frank Ch. Eigler" <fche@xxxxxxxxxx>, pcp developers <pcp@xxxxxxxxxxx>
Subject: Re: [pcp] pcp updates: more multithreaded fixes and then some
From: Dave Brolley <brolley@xxxxxxxxxx>
Date: Tue, 10 May 2016 15:20:19 -0400
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <20160508205432.GA7399@xxxxxxxxxx>
References: <20160508205432.GA7399@xxxxxxxxxx>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0
Hi Frank,

I've a more detailed look at this now. A couple of questions and test 449 still behaving erratically for me.

In derive.c, you release the lock around several calls to PMAPI functions. Can you confirm that none of the static state protected by the lock needs to be preserved across those calls? i.e. that this is not an opportunity for another thread to take control and change that static state destructively.

In logutil.c: __pmLogLoadLabel, according to their man pages, dirname(3) and basename(3) are not thread safe since they may return pointers to reusable static memory. I think that's why there was originally a lock around these calls.

qa/449 is still behaving erratically for me. I'm sporadically getting an output mismatch:

449 - output mismatch (see 449.out.bad)
93a95
> traverse: found 1052 metrics, sts PMNS not accessible

Dave


On 05/08/2016 04:54 PM, Frank Ch. Eigler wrote:
Hi -

A mixture of core libpcp multithreading fixes and independent
scaling/robustness patches for other stuff are on the pcpfans.git
fche/multithread branch [freshly rebased]:


commit 17a67d2fcc9e39fb94ce536e3664dc1ce450d873 (HEAD -> fche/multithread)
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date:   Sun May 8 16:30:57 2016 -0400

     pmmgr: tune logging batching
When pmmgr runs pmlogcheck on an archive, this can produce voluminous
     warning traffic (e.g. for SGI PR1142) that's not helpful for a pmmgr
     admin.  We now redirect that output also to /dev/null.  Since there is
     now less output, tweak the obatched(stream) code to issue an explicit
     ostream::flush(), so that whether the stream is default-buffered or
     not, the log file will be current.

commit f8af410a6aa6a5185c54e959fa900f7147a8824a
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date:   Sun May 8 16:16:30 2016 -0400

     libpcp multithreading: context.c, derive.c, pmns.c lock order corrections
More instances of inconsistent lock orderings are corrected. context.c: pmDupContext() removes unnecessary nesting entirely. derive.c: trades the possibility of data races for the elimination of
               deadlocks, by briefly releasing the registered.mutex around
               reentrant PMAPI calls like pmLookup*
pmns.c: Introduces pmns_lock.
             Removes recursive locking from __pmFixPMNSHashTab() and 
TraversePMNS.
The results are that all the thread-group test cases run reliably
     here, with no remaining helgrind lock-ordering warnings in any of the
     449-invoked multithread* tests, nor 4751.

commit 2a3815f65cf173070c840ce5798611eb7054ceb8
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date:   Sun May 8 11:29:11 2016 -0400

     unresponsive-pmda pmie message: identify host
For remotely monitored hosts that have suffered PMDA failure, the pmie
     message should identify the host.  Adding @%h to the message, as per
     many other pmieconf examples.  (No QA impact, as this message does not
     appear in QA at all.)

commit 547da9b379d6cbccd6233134005fb30fc8a90456
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date:   Sun May 8 10:50:06 2016 -0400

     crash-resilience for systemd pmmgr/pmwebd
Switch to using Unit=forking Restart=always for these services.
     They now get auto-restarted by systemd if they crash or are kill-9'd.
     The same treatment is probably appropriate for pmcd.

commit 399bbaec4d8dd2b89892f383da2095599f59ec52
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date:   Sun May 8 09:05:06 2016 -0400

     pmmgr scaling: don't cry on a SIGPIPE
It has been reported that on some heavily loaded systems, pmmgr
     can intermittently die with a "too many interrupts" message.  Analysis
     with systemtap indicates that these events come from SIGPIPE's being
     sent by the kernel from within a
      __pmSend
      __pmXmitPDU
      __pmSendNameList
      pmLookupName
      ....
      __dmopencontext
      pmNewContext
     call chain.  Presumably, a remote pmcd died mid-conversation, and
     pdu.c's SIGPIPE ignoring logic didn't help enough.
pmmgr should not look for SIGPIPE anyway as a termination signal - we
     don't produce output on stdout like a pipeable UNIX tool.  We now
     SIG_IGN it.

commit 00a20c48964b2cbb74696ef77ad09d24b60ec3e2
Author: Frank Ch. Eigler <fche@xxxxxxxxxx>
Date:   Sun May 8 08:10:57 2016 -0400

     pmmgr target-threads: tolerate OSs that return <0 for 
sysconf(_SC_NPROCESSORS_ONLN)
It's theoretically possible for the online-cpu-count to come back
     negative.  Map that to zero instead of propagating to a negative
     number of target threads.


Older commits f96eecd etc. were already reported back on May 5 under
different commit hashes.


- FChE

_______________________________________________
pcp mailing list
pcp@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/pcp

<Prev in Thread] Current Thread [Next in Thread>