An inspection of the libpcp/src/AF.c code indicates that it relies on
unsafe mechanisms that can cause heisencrashes. The gist of the
problem is that from within a SIGALRM signal handler, it is not safe
to invoke general libc/application functions. A poorly timed callback
can corrupt e.g. libc malloc/stdio or libpcp internals, or result in
hangs. (These have been observed in the wild, just not necessarily
in the context of pcp.)
Some general references on async-signal safety, which applies to the
whole transitive callchain of signal handlers:
http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html
https://www.gnu.org/software/libc/manual/html_node/POSIX-Safety-Concepts.html
The specific problems include:
AF.c:onalarm()
- calling free (heap operations)
- calling stdio printf/putc (e.g. in pmDebug case)
- pmprintf() and pmflush()
AF.c:enqueue() (called both from onalarm and __pmAFregister):
- doing list manipulation with limited (AFhold) reentrancy controls,
which could be broken if e.g. __pmAFregister is interrupted by
some other signal wherein __pmAF* functions are called.
The lack of documented constraints on __pmAFregister callback
functions in pmaf.3 leads clients to do risky things:
pmlogger.c:run_done_callback
- more stdio
pmlogger.c:vol_switch_callback
- more stdio, including fopen/fclose
- more racy AF manpulation
callback.c:log_callback
- heap operations
- many general LIBPCP ops
perl/PMDA/PMDA.xs:timer_callback
- call into general perl interpreter
It may be possible to trigger some example problems with
numerous/rapid/unsafe-content __pmAFregister callbacks. I got a toy
program to show some malloc corruption, but some of the race windows
are short enough that auditing rather than simple tests may be
necessary. (Lengthening some of the race windows by inserting
usleep() here and there might help.)
The longevity of this code testifies that these races & corruption are
infrequent, so fixing the problems is not urgent. One possible
thorough approach for an eventual fix would be to move away from
timers/signal handlers, and manage events/timing at (say) the exit of
PMAPI functions, or with a more formal application-main-loop
mechanism. Some of the races may be shrunk with more aggressive
__pmAFblock, and/or nestedness counting for AF.c:block.