Attendees:
Dave Brolley
Owen Butler
Stan Cox
Chandana De Silva
Ryan Doyle
Frank Eigler
Paul Evans
Mark Goodwin
Brad Hubbard
Ken McDonell
Nathan Scott
Topics covered:
1. Schedules [Nathan]
pcp-3.7.2 end of this week (minor bugfix release)
pcp-gui-1.5.7 end of this week (minor bugfix release)
pcp-3.8.0 in around four weeks - feature release, pulling in a
number of bit ticket items: pmwebapi, python packaging rejig,
new python sub-modules (pmsubsys, pmda, mmv), a new pmatop tool,
initial sasl2-based per-user access control, and a kitchen sink
or two.
2. Trees [Nathan]
Discussed whether the current "mini forest of trees" model we're
using to manage source and other revision controlled docs is as
effective as it could be for us. We split out pieces like the
cluster and infiniband pmdas awhile ago, but they are now rotting
with no updates and little visibility. Cornel recently looking
into updates for pmdacluster, others on the call hadn't even heard
of it but were quite interested. Ken points out difficulties in
testing these pieces, particularly infiniband, with no access to
hardware. Nathan listed other issues with out-of-tree code, like
duplicated/no packaging, duplicated build system, updates to the
build system (like --prefix configure work) not done on those dup
build systems, etc, so the negatives seem to be outweighing the
positives now. In general, people seemed positive to attempting
a merge-back of these pieces at some point.
Talked a bit about how the merge-back of pcpqa was affecting Ken,
the biggest user - we've worked around most/all of the issues that
were causing grief initially, Ken preferring the in-tree QA model
more, Nathan using both, and others still using just the packages.
Moved onto the pcp / pcp-gui split, which is less clear cut due
to major toolchain differences. However, it wasn't dismissed out
of hand - if this merge is ever attempted, configure magic must
be in-place to allow --without-gui builds, as many QA machines do
not have graphics heads. Some value was identified in this merge
due to possibility of improvements via a merged QA (libqmc has an
extensive QA suite, even with custom agents) and sharing of build
and packaging systems with core pcp would be helpful.
With the pcp-books tree arriving, highly desirable to not have the
same kind of bit-rot happen there so having the books alongside the
code they document was suggested. Initially, merging into pcp-gui
seems a start, as both books and existing pcp-doc tutorials, and
the gui code itself, have image files that could be shared.
Side discussion about the pcpweb tree [Owen/Ryan] - rumblings of
improvements to the pcp web site, modernisation of the look, etc -
would be wonderful if someone takes this on. Two thumbs up.
Another side discussion about housing the trees - using oss.sgi.com
as primary site, sourceware.org houses many trees already too - in
the unlikely event of an emergency, could take over role of hosting
PCP site. In general, the sources are considered well backed up by
virtue of so many git trees around the world. Frank pointed out we
could not move bugzilla service quickly. General wonderings about
relative lack of use of bugzilla, whether we could just use Red Hat
bugzilla hosting if needed. Led into discussion about visibility
of current Fedora bugs - Nathan mentioned the Debian bugs and all
package related mail now goes to the pcp list (used to be private),
and doing the same for Fedora/EPEL updates was generally approved
of - Nathan to investigate making it happen.
3. Amazon Web Services [Chandana]
PCP model of remote loggers which explicitly know about all of the
hosts they need to log is mismatched to the needs of monitoring in
the AWS space. Here, hosts can be spun up and down in relatively
short time spans, and the remote pmlogger "pull" model is not what
is wanted - a "push" model where the host starts up and starts to
broadcast out data (including its hostname) to something listening
for such traffic is offered by collectd and is better suited.
Discussion about how to tackle similar functionality ensued - the
use of a local pmlogger on each dynamic AWS host agreed as a good
first step and then two approaches discussed. Ken and Chandana
pondered changes to pmlogger to allow arbitrary pluggable backends
via a new API to plugin anything (e.g. a streaming-to-remote-AWS-
host-listener plugin). Seemed very flexible, and it could also be
used to stream PCP data directly into a relational DB as well.
Frank pointed out this results in loss of the ability to run all
of the PCP analytic tools on historic data, however, and suggested
an alternate scheme where we do a better job of allowing the PMAPI
to access PCP logs as they are being written. This would be more
compelling in that it offers potential improvements to existing PCP
tools like pmchart too, which currently do relatively poorly in the
way they manage live/archive transition. Ryan points out the pain
of starting pmchart and having to wait for live sampled data - he'd
prefer the tool to be able to seamlessly fetch history for metrics
selected for live plotting. Would require logging everything all
the time ... not really feasible in terms of overhead, finding a
sampling interval to suit everyone, and so on. But worth keeping
in mind ways to optimise/improve usability for this common request.
4. QA [Ken]
Who's running it, what failures - visibility is proving a problem,
not knowing whether failures that would/should affect everyone are
being seen by others. Mark/Frank/Nathan talked a bit about Red Hat
QA on the released bits, but the testing before releases is more
where Ken/Nathan are keen to see improved visibility. Some concern
QA coverage may be decreasing too in recent times.
Continuous integration via Jenkins suggested as possible way to go
to keep tests running for all commits and provide visibility - Owen
and Ryan report its working well at Aconex, and the somewhat tricky
distributed nature of PCP QA should be handled. Frank mentioned
also the systemtap approach, and the web frontend used there that
could be used in PCP too with small amounts of tweaking.
5. Diagnostics [Ken]
Concerns raised that recent code changes are not adding sufficient
in-built diagnostics for others to be able to triage problems with
the changed code. Example of initial protocol exchange flags that
Nathan changed recently came up - lack of detail there made secure
sockets problems even more difficult to track down. Suggested we
keep this in mind for all new code, and consider measures to make
newer contributes aware too - perhaps some written guidelines, and
pre-commit checklists, and actively looking for this in reviews.
6. pmdasummary [Ken]
Ken asked if others knew of any sites using the summary PMDA, as
he's recently fixed brokenness in the pmie secret-agent-mode it
uses, rendering parts of pmdasummary functionality useless (bogus
values). No general group insight into how many people might be
affected though.
7. pmchart configs [Owen]
Discussion around how to distribute canned pmchart views that the
Aconex folk are using for Elasticsearch and other monitoring. If
they are sent through this week, we'll include them in the pending
pcp-gui update for all to share.
8. Percentiles [Owen]
Discussion around best ways to expose percentile calculations, in
the particular context of the Aconex application (response times),
but also with wider applicability to arbitrary metrics (networking
metrics in particular of interest to Chandana too). Each possible
point of implementing this (application, agent, libpcp, clients)
covered with pros and cons of each tossed about. Ken believes the
"right" place to put this, to solve generically, is on the client
side and an extension to the derived metrics syntax would be quite
doable for a generic solution (extending the rate conversion model
which already needs to keep two samples worth of data).
9. Command line options [Mark]
Back-compatibility policy of command line options raised yet again,
with the -h (hostname) option coming back again with the possibility
of switching away from -h discussed. Generally, back-compat seems
to (continue to) be preferred, so no change here for the foreseeable
here. Ryan raised the option of special casing -h parsing so that
the common case of a single -h (help) argument which is well known
and expected elsewhere could be better handled more cleanly. Also,
the use of long options discussed a bit ... generally this seems to
be considered worth doing. Fair bit of code to update though.
10. Qt markup language [Mark + Brad]
Mark and Brad gave an overview of QML and discussed potential use of
this in pcp-gui tools. No push back, but state of pmview discussed
by Ken and its well along the path to completion - scene layout is
functional, in particular, where QML might have fitted well. Nathan
generally bemoaned the pmchart config file syntax and parsing code,
wishing for something better - perhaps a scripting API someday.
11. Simplify default setup [Mark]
Topic of making a default install simpler to setup for new users was
broached. Came up for Mark in wider discussions with other Red Hat
customer support people, and pointing to the other tools like sar,
collectl, etc which are basically "chkconfig on && service start" to
get useful results. With PCP we are not providing generally useful
default setups today, which is just plain silly.
Ken points out that of everything discussed today, this one's by far
the easiest and should be a no-brainer - if someone could propose a
more useful set of metrics for /etc/pcp/pmlogger/config.default, it
could happen immediately. I think the action item was with Mark(?)
to peruse his archive set, and come up with something. Nathan also
suggested using the Aconex production configuration as well for some
insight - feel free to ping us if anyone begins looks into this.
Similarly, pmie should also be easy, with a good default setup, and
consensus was that it should default to reporting problems detected
into syslog.
That's about it from my recollection and hand-scrawled notes - if I
have overlooked or misrepresented anything, please send reply-to-all
mail with further detail, thanks!
cheers.
--
Nathan
|