kenj wrote:
> [...]
> (a) what's the problem this is trying to address? [...]
Generally: deadlocks & race conditions.
> (b) what's the evidence for this being a real problem? ... here
> "real" means a realistic probability of badness happening in normal
> use cases [...]
At the risk of "true scotsmaning", normal use cases seem to be
entirely single-threaded in the classic pcp code base. The only
multithreaded pmapi clients are those in qa/*, pmmgr, and pmwebd. (I
believe pmchart runs pmapi single-threaded.) And it is scaled-out
pmmgr and pmwebd installations that highlight the preexisting
problems, with 100% probability.
Thus my June 3 question about the general philosophy about
multithreading. If failures of scaled-out multithreaded programs are
acceptable, then probably we should no longer represent libpcp as
multithread-safe, and should rework pmmgr etc. to single-threadedness.
(I called this "option 4" in the posting.)
> (c) some verifiable before-and-after evidence to show the problem has
> been addressed without side-effects.
This would probably require institutionalizing qa/4751 style heavier
duty multithreaded testing, and valgrind/helgrind monitoring. And
that can only go so far: the tooling can find bugs more easily than
provide guarantees about the absence of bugs. A decrease in number of
tool-identified errors is a tangible improvement though.
> But before we start, I think we need a discussion and consensus on
> how/if the create/delete context race should be addressed ...
Agreed!
> this relates to the current lock ordering and this snippet from
>
> __pmHandleToPtr():
> PM_UNLOCK(__pmLock_libpcp);
> PM_LOCK(sts->c_lock);
Yes, this was one of the original sins of the multithreading work, so
to speak. The race condition is clear: two threads doing racing
pmNewContext & pmDestroyContext calls can cause invalid __pmContext
pointers to propagate.
> I think this is a "how" not "if" question, but the best solution is
> not obvious (at least to me).
I guess it depends on what one means by "best".
The most correct, from a local concurrency/safety point of view, is to
flip those two operations, as commit b2208f06b4e does, then work
through all the lock ordering problems this exposes, one by one. This
is how pcpfans.git fche/multithread built up.
Another option, measured as "smallest change", was one I mentioned
back in private email on May 23. That is to switch to a different
mutex (not the libpcp Big Lock) to protect the contexts[] array only.
The pcpfans.git fche/multithread2 was an experiment in this direction,
but that did not produce instantly satisfactory results either, so its
relative smallness is not apples-to-apples yet.
- FChE
|