pcp
[Top] [All Lists]

Possible multithreading problems, Was Re: pmcd gives up on slow starting

To: Ken McDonell <kenj@xxxxxxxxxxxxxxxx>
Subject: Possible multithreading problems, Was Re: pmcd gives up on slow starting Perl PMDA
From: fche@xxxxxxxxxx (Frank Ch. Eigler)
Date: Mon, 24 Mar 2014 13:07:36 -0400
Cc: pcp@xxxxxxxxxxx
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <532F5BAA.1070507@xxxxxxxxxxxxxxxx> (Ken McDonell's message of "Mon, 24 Mar 2014 09:09:46 +1100")
References: <532C975F.4020808@xxxxxxxxxxx> <001701cf4594$8a193d60$9e4bb820$@internode.on.net> <y0msiq9396s.fsf@xxxxxxxx> <532F5BAA.1070507@xxxxxxxxxxxxxxxx>
User-agent: Gnus/5.1008 (Gnus v5.10.8) Emacs/21.4 (gnu/linux)
kenj wrote:

> [...]
>> ... (Please note that the robustness of
>> multithreading support in libpcp* may not be as good as its sole user
>> (pcpqa) may lead one to believe; I recall seeing some problems in an
>> earlier review.  These bugs might crawl out from under the rocks if we
>> kick them over.)
>
> Can you point me to that review?

Sorry, I only noted some in passing (on IRC).  Just going for a random
drive-by in libpcp/src ...

- access.c, getClientIds(), a mismatched PM_UNLOCK in the loop around
  line 1295.  (By the way, the 'myhostid' variable is probably a
  symptom of more inappropriate FQDN assumptions that will need to be
  fixed.)

- context.c, pmNewContext(), contexts*, old_*context* used unprotected
  in FAILED: path.

- context.c, pmDupContext(), quite possibly unsafe if another thread
  is manipulating the oldcon at the same time; by the way, is that
  condition (same pmUseContext by different threads) detected /
  forbidden / permitted?  The __pmHandleToPtr() ctxp->c_lock design
  sounds fine, but it's not used in all context-structure users

- context.c __pmHandleToPtr race condition between the libpcp unlock
  and the context lock (what if another thread deletes the context)
  during this time?)  (Flipping the locks around could maybe cause
  deadlocks OTOH; __pmLogFetchInterp nests the libpcp lock within the
  ctxp lock already.)

- interp.c cache_read: race condition between PM_UNLOCK and use of      
  lfup pointer; another thread may have nuked that cache[] slot in the
  mean time; this general pattern recurs several places in the code
  (look for UNLOCK followed by a return FOO->BAR, e.g. logmeta.c)

- util.c: using libpcp lock for all kinds of printing, possibly
  nesting within context locks -> possible deadlock

- loop.c: no locking at all



... anyway, this was just about 30 mins' random browsing, not a
systematic audit.  Even with that, a couple of non-trivial problems
showed up.

As difficult as it is to prove the absence of bugs in general, when
you add concurrency to it, it's many times worse.  We should accept
the idea that if functionality is only used by small synthetic pcpqa
test cases, we should not trust it too much (see also pmDupContext).

After a more thorough audit & bug-fixing pass, we could seriously
multithread something like pmchart or pmie or pmwebd to get some
confidence.  But even then, latent concurrency bugs won't make
themselves obvious.


- FChE

<Prev in Thread] Current Thread [Next in Thread>