pcp
[Top] [All Lists]

Re: libpcp multithreading - next steps

To: Dave Brolley <brolley@xxxxxxxxxx>
Subject: Re: libpcp multithreading - next steps
From: "Frank Ch. Eigler" <fche@xxxxxxxxxx>
Date: Mon, 25 Jul 2016 16:32:57 -0400
Cc: pcp@xxxxxxxxxxx
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <57965C89.40401@xxxxxxxxxx>
References: <20160603155039.GB26460@xxxxxxxxxx> <578D1AE1.6060307@xxxxxxxxxx> <y0my44xksjb.fsf@xxxxxxxx> <57965C89.40401@xxxxxxxxxx>
User-agent: Mutt/1.4.2.2i
Hi, Dave -


> Yeah, that's what I was referring to. For some reason, I thought that it 
> had never been merged.

Yes, it was merged into 3.11.2 (commit efc0173ad and a few followups).
How time flies.


> >Can you explain what you mean?  What would you like to see on this new
> >branch?

> It's my understanding the the subsequent changes came in a series of
> smaller changes. If so, rather than trying to pull them in one at a
> time from the existing branch(es), it might be easier to
> re-introduce them one at a time.

Not sure how "pulling in" differs from "reintroducing".  It seems we
need to go through them one by one, whether from a branch where they
already exist.


> addressing a demonstrable problem
> failing (possibly new) qa, or performance issue
> reviewable fix
> qa now passing or performance improved

This is a fine way to go for normal small sorts of bugs.  We have more
of a systemic design issue.  The multithreading problems tend to have
a similar problem signature:

(1)
* pick a multithreading test case
* run it under valgrind --helgrind
* observe lock nesting errors
* fix by tweaking lock ordering/selection
and
(2)
* pick a multithreading test case
* run it with a large workload, enough times to ...
* ... observe deadlock or crash
* from various thread backtraces, deduce nesting or other error
* fix by tweaking lock ordering/selection

... then repeat.


Take for example, as reported back three months ago at [1]: the 4751
test case, as committed, was failing in that the last run timed out
and failed to produce output.  

[1]  http://oss.sgi.com/archives/pcp/2016-04/msg00202.html

See the current git head qa/4751{,.out} and 3.11.3 released binaries
fedora 24.  On my machine, the 4751 test now happens to run
successfully most of the time.  But:

    valgrind --tool=helgrind ./src/multithread10 [....]

(where [....] is the 157 archives/ip-addresses for the last part of the
test case) produces numerous errors:

==12842== Thread #3: lock order "0x90839D0 before 0x52DA1E0" violated
==12842== 
==12842== Observed (incorrect) order is: acquisition of lock at 0x52DA1E0
==12842==    at 0x4C2FE9D: mutex_lock_WRK (hg_intercepts.c:901)
==12842==    by 0x4C33D01: pthread_mutex_lock (hg_intercepts.c:917)
==12842==    by 0x50ABE82: __pmLock (lock.c:278)
==12842==    by 0x506D87B: pmDestroyContext (context.c:1494)
==12842==    by 0x5072C01: pmDestroyFetchGroup (fetchgroup.c:1653)
==12842==    by 0x109059: thread_fn (multithread10.c:65)
==12842==    by 0x4C32A24: mythread_wrapper (hg_intercepts.c:389)
==12842==    by 0x4E455C9: start_thread (pthread_create.c:333)
==12842==    by 0x53E2EAC: clone (clone.S:109)
==12842== 
==12842==  followed by a later acquisition of lock at 0x90839D0
==12842==    at 0x4C2FE9D: mutex_lock_WRK (hg_intercepts.c:901)
==12842==    by 0x4C33D01: pthread_mutex_lock (hg_intercepts.c:917)
==12842==    by 0x50ABE82: __pmLock (lock.c:278)
==12842==    by 0x506D8C0: pmDestroyContext (context.c:1507)
==12842==    by 0x5072C01: pmDestroyFetchGroup (fetchgroup.c:1653)
==12842==    by 0x109059: thread_fn (multithread10.c:65)
==12842==    by 0x4C32A24: mythread_wrapper (hg_intercepts.c:389)
==12842==    by 0x4E455C9: start_thread (pthread_create.c:333)
==12842==    by 0x53E2EAC: clone (clone.S:109)
==12842== 
[...]

Every such lock order report is a potential deadlock site, several of
which have been actually observed to occur.  Every one represents a
design flaw.


So, would this be our game plan?

> addressing a demonstrable problem               
    -> helgrind error or otherwise triggered deadlock/crash
> failing (possibly new) qa, or performance issue 
    -> qa code already exists, just invoke repeatedly or heavily or under 
monitoring
> reviewable fix                                  
    -> as per fche/multithread branch (just rebased)
> qa now passing or performance improved          
    -> one fewer helgrind errors


- FChE

<Prev in Thread] Current Thread [Next in Thread>