Hi, Dave -
> Yeah, that's what I was referring to. For some reason, I thought that it
> had never been merged.
Yes, it was merged into 3.11.2 (commit efc0173ad and a few followups).
How time flies.
> >Can you explain what you mean? What would you like to see on this new
> >branch?
> It's my understanding the the subsequent changes came in a series of
> smaller changes. If so, rather than trying to pull them in one at a
> time from the existing branch(es), it might be easier to
> re-introduce them one at a time.
Not sure how "pulling in" differs from "reintroducing". It seems we
need to go through them one by one, whether from a branch where they
already exist.
> addressing a demonstrable problem
> failing (possibly new) qa, or performance issue
> reviewable fix
> qa now passing or performance improved
This is a fine way to go for normal small sorts of bugs. We have more
of a systemic design issue. The multithreading problems tend to have
a similar problem signature:
(1)
* pick a multithreading test case
* run it under valgrind --helgrind
* observe lock nesting errors
* fix by tweaking lock ordering/selection
and
(2)
* pick a multithreading test case
* run it with a large workload, enough times to ...
* ... observe deadlock or crash
* from various thread backtraces, deduce nesting or other error
* fix by tweaking lock ordering/selection
... then repeat.
Take for example, as reported back three months ago at [1]: the 4751
test case, as committed, was failing in that the last run timed out
and failed to produce output.
[1] http://oss.sgi.com/archives/pcp/2016-04/msg00202.html
See the current git head qa/4751{,.out} and 3.11.3 released binaries
fedora 24. On my machine, the 4751 test now happens to run
successfully most of the time. But:
valgrind --tool=helgrind ./src/multithread10 [....]
(where [....] is the 157 archives/ip-addresses for the last part of the
test case) produces numerous errors:
==12842== Thread #3: lock order "0x90839D0 before 0x52DA1E0" violated
==12842==
==12842== Observed (incorrect) order is: acquisition of lock at 0x52DA1E0
==12842== at 0x4C2FE9D: mutex_lock_WRK (hg_intercepts.c:901)
==12842== by 0x4C33D01: pthread_mutex_lock (hg_intercepts.c:917)
==12842== by 0x50ABE82: __pmLock (lock.c:278)
==12842== by 0x506D87B: pmDestroyContext (context.c:1494)
==12842== by 0x5072C01: pmDestroyFetchGroup (fetchgroup.c:1653)
==12842== by 0x109059: thread_fn (multithread10.c:65)
==12842== by 0x4C32A24: mythread_wrapper (hg_intercepts.c:389)
==12842== by 0x4E455C9: start_thread (pthread_create.c:333)
==12842== by 0x53E2EAC: clone (clone.S:109)
==12842==
==12842== followed by a later acquisition of lock at 0x90839D0
==12842== at 0x4C2FE9D: mutex_lock_WRK (hg_intercepts.c:901)
==12842== by 0x4C33D01: pthread_mutex_lock (hg_intercepts.c:917)
==12842== by 0x50ABE82: __pmLock (lock.c:278)
==12842== by 0x506D8C0: pmDestroyContext (context.c:1507)
==12842== by 0x5072C01: pmDestroyFetchGroup (fetchgroup.c:1653)
==12842== by 0x109059: thread_fn (multithread10.c:65)
==12842== by 0x4C32A24: mythread_wrapper (hg_intercepts.c:389)
==12842== by 0x4E455C9: start_thread (pthread_create.c:333)
==12842== by 0x53E2EAC: clone (clone.S:109)
==12842==
[...]
Every such lock order report is a potential deadlock site, several of
which have been actually observed to occur. Every one represents a
design flaw.
So, would this be our game plan?
> addressing a demonstrable problem
-> helgrind error or otherwise triggered deadlock/crash
> failing (possibly new) qa, or performance issue
-> qa code already exists, just invoke repeatedly or heavily or under
monitoring
> reviewable fix
-> as per fche/multithread branch (just rebased)
> qa now passing or performance improved
-> one fewer helgrind errors
- FChE
|