Hi Marko,
----- Original Message -----
> Hi,
>
> On 2016-03-03 20:42, Marko Myllynen wrote:
> >
> > PCP has very complete coverage for system and supporting applications /
> > infrastructure metrics (like containers, 389 Directory Server, KVM,
> > Oracle, PostgreSQL, etc.) but there are lots of places where Java
> > performance metrics would be essential to have in the mix as well.
+1 ... it's high time we tackled this area, so thanks for kick-starting
a new effort in this direction, Marko. Apologies up-front that my reply
has turned into an essay too, like your earlier mail, and taken awhile.
It's an inherently difficult topic I think - which is probably why noone
has solved it well yet!
> > https://myllynen.fedorapeople.org/pcp-jmx/
>
Here's my thoughts so far. I agree JMX is worthwhile as the starting
point for extracting stats from a running Java process, but from that
point we start diverging I think. Particularly around how to go about
accessing those JMX values and how to present the PCP metrics. Similar
goals though, so I'm sure we can find common ground here.
The approach pmdajmx has taken has some drawbacks to my eye (that eye
being jaded^Wfiltered based on experience doing several years of Java
analysis in a previous life) ...
- Running a separate Java process (in addition to a separate perl PMDA)
is a relatively complex architecture.
pmdajmx.perl <-> PCPJMXConnector.java <-...-> multiple-java-apps
It causes a need for some fancy footwork on the part of pmdajmx, to
dodge the intermittent high latencies in socket-based communication
between the java processes, PCPJMXConnector.java, and the perl PMDA.
(to put in context, this is all more complex than any other PMDA we
have in PCP today - except perhaps for pmdajson).
- Stop-the-world GC activity at unfortunate times anywhere to the right
of ".perl", above, is a major latency problem that has to be handled.
i.e. all processes in this design are using threads to try to hide
that latency, or rather the potential for latency once in a while.
- Threads are inherently more complex than not using threads. :)
Threads in both pmdajmx and its java helper is ... alot of threads.
- That separate multiplexing java process has a fairly large footprint
in terms of memory utilisation (in Java 8, approx 80-100MB is steady
state RSS). We can improve that via non-default command line options
and/or properties file settings, etc.
However that introduces java implementation and version dependencies
in pmdajmx/PCP, and its always likely to consume more memory than the
rest of a PCP collector no matter how much its tweaked, unfortunately.
That is bad, and reflects poorly on PCP (i.e. gives something for the
nay-sayers to point to and say - "see PCP eats all your memory").
(the above are all external architecture kinds of things - next, inside
the PMDA...)
- The model used, mapping one metric to one JMX value is not ideal --
if modelled more "ideally", many of these values would be in just one
metric (since all the metadata is the same) using the PCP instances
to represent set-values. But, the instance domain is "taken up" by
the target java processes. This burned pmdajstat many years ago when
it took the same model. Falls down a bit when different JVM versions
have different semantics for similar metrics. :(
- In the original pmdajmx code there was basically no PCP metadata at
all (since JMX provides only pmDesc.type for us). Since then, I see
you took on that issue by starting to add PCP metadata for individual
JMX values in %semantics, %encoding, perl hashes (needs %helptext?)
The plan here being we'd have Java programmers adding JMX values into
their Java code, then updating the Perl code in PCP (or more likely,
someone else who knows about PCP doing it on their behalf).
A solution where Java programmers update Java code in a real Java
project would be more likely to succeed I think (but dunno for sure).
- That external tools.jar dependency is unfortunate for some users; all
external dependencies cause pain for users and pain for PCP developers
getting requests for help when bits aren't installed. (minor issue,
we like talking to users really - but not everyone will followup).
- pmdajmx makes all the add_metrics() calls in its mainline - there is
no capability to add metrics on the fly. This is not pmdajmx's fault
as there are assumptions being made in the perl PMDA API that prevent
this (the perl wrapper predates dynamic metrics!). When David Smith
wrote pmdajson, this also wasn't foreseen, and a ton of extra effort
had to be made to extend the python PMDA wrapper. (major issue)
- Also similar to early pmdajson, nothing is done to provide stability
of PMIDs, so logging jmx.* metrics is going to explode (pmlogrewrite).
- Current PCPJMXConnector.java code is >1000 lines of code already, and
its only two weeks old. ;)
To explore what I think will help tackle these problem areas while still
meeting all our Java needs, I started hacking on Parfait a bit. This is
now showing signs of life, so I'm keen to share the early results. This
uses a -java-agent .jar approach, where the JMX (and other) values are
accessed directly by the -java-agent (which is runtime-loaded into each
application, via command line option / properties - ala NewRelic & most
other Java instrumentation tools of PCP's ilk).
There's alot to be done still, but here's the design goals its meeting:
- no Java source code modifications
- isolation from individual Java processes going off with the pixies -
as they so often do. Any process in stop-the-world GC, or swapping,
or otherwise misbehaving, should not be able to affect any other
process, including the pmda/pmcd. And no use of threads to try to
hide latency, nor failing to satisfy requests for current values.
- use the existing proven code in both PCP and Parfait, rather than
starting from scratch with a Java-specific addition to PCP source.
- leverage the existing Parfait JMX extraction code instead of writing
it again from scratch and putting it in PCP :P
- eventually allow PCP maintainers to focus on the core PCP components,
and Java gurus to focus on the Java components in a real Java project
(i.e. maven, not autoconf/make)
- allow PCP maintainers to improve *one* core PCP component (pmdammv),
which benefits multiple languages (i.e. not just for Java).
- allow arbitrary modelling of metric names, instances, and allow for
correct PCP metric metadata. All via configuration files, no code.
- allow for more than just JMX as the source of our metric values
- allow for better PCP metric modelling - using instances for set-values
and not forcing a transformation based on JMX names.
So, it aims to tackle all of the pmdajmx areas-for-improvement I listed
above, without adding any new code in PCP. And, it turns out, with very
little new code in Parfait too (~90 lines of Java code so far).
As I'd hoped, it turned out Parfait does 99% of what we need and quite
efficiently. pmcd timeouts simply cannot happen, and the applications
continue to export correct values under stop-the-world GC conditions -
parfait-agent has alot of potential I think.
> [...] it's now possible to get hundreds
> or even thousands of attributes/metrics from these and other apps to
> PCP. However, one should be careful to adjust querying interval and
> especially filtering to match the requirements, fetching thousands of
> metrics from several apps every 10 s is not going to end up well.
>
> The code should be ready for testing and review now, I'm not planning to
> do further changes before some concrete feedback here.
>
Have a browse through my Parfait tree - I'll send a pointer to it out
shortly. Keep in mind it's early days, there's plenty more to be done,
and I am on the lookout for more helpers with Java experience. :) I'll
also be away after tomorrow so hopefully it will be finished by the time
I get back online.
A big shout out to TallPaul Smith for all his Parfait help, without which
parfait-agent would not have gotten this far!
cheers.
--
Nathan
|