Hi -
I need some advice about how to proceed with my recent work fixing
various multithreading problems within libpcp.
We have been presenting libpcp as multithread-safe, but it just isn't.
A variety of problems (lock ordering -> deadlocks, race conditions)
exist in the source code, some of them acknowledged in black & white
in the code, and some reported as bugs for years. Some of them are
hiding by virtue of passing pcpqa tests, but it's a false comfort
because these tests barely scratch the surface. pmmgr / pmwebd /
qa-4751 can be much more thorough.
I've dived into the code somewhat deeply in the last few weeks, and
started some cleanup - see the pcpfans.git fche/multithread branch.
Despite a good showing on pcpqa, the code is being held up. I
appreciate the review comments from the few folks who looked over it,
but those conversations have ground to a halt. Admittedly, the work
is incomplete, but a path needs to be agreed-upon in order to justify
expending further effort.
It seems to me that our options are:
0) status quo as of v3.11.2; tolerate hangs etc.
1) roll back even v3.11.2 context.c changes to v3.11.1; tolerate hangs
and show-stopper pmNewContext performance
2) merge fche/multithread and stop there, handling future bugs as/when
they appear
3) merge or rework libpcp parts of fche/multithread, and continue work
piecemeal; agree now on docs/testing/merging criteria in order to
liberate from constraints of preserving idiosyncracies of current
code base (e.g., move toward much less sharing of data between
contexts; simpler locking model; conceivable deprecation of some
functionality in multithreaded apps)
4) declare that libpcp is not multithread safe; rearchitect our
various programs without multithreading
Option 3 makes most sense to me: in time, we can have both
thread-safety & high performance. Are y'all ready to discuss further?
- FChE
|