pcp
[Top] [All Lists]

PCP Developers Meeting Minutes

To: PCP <pcp@xxxxxxxxxxx>
Subject: PCP Developers Meeting Minutes
From: Nathan Scott <nathans@xxxxxxxxxx>
Date: Thu, 9 Oct 2014 01:29:38 -0400 (EDT)
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <1337819626.64674757.1412831161274.JavaMail.zimbra@xxxxxxxxxx>
Reply-to: Nathan Scott <nathans@xxxxxxxxxx>
Thread-index: Y1Xbq2oyn59Xh+yICSEZkFqs54Rs0A==
Thread-topic: PCP Developers Meeting Minutes
Hi all,

Here's a collation of notes that Mark and I made from today's call.
I've stuck with the first-person style that Mark used - if I have
missed or misrepresented anything, please send an update; thanks!


- Welcome
   - Any general news?

Nathan: python3 packaging is available from latest dev now, please
     kick the tires and let me know how it goes.

   - Website revamp request/options [nathans, michele]

Nathan (paraphrasing Michele): would be good to update the website,
    with a more modern look
Nathan: any takers or anyone know UI folks inside RH, please contact me
    (website is all committed in a git tree, easily updated - so mostly
    needs folks with more/betterer UI skills than myself).

   - Easy hacks reference, ala libreoffice [nathans, michele]
     https://wiki.documentfoundation.org/Development/Easy_Hacks

Nathan: overview of what "easy hacks" is on other projects.
Frank: suggest simply pointing people toward bugzilla may be best.
Ken: can be good to have an environment without the possibility for
     feeling foolish in public, for not knowing stuff about things.
Mark: newbie syndrome, 1:1 mentoring isn't very productive
Mark: need to identify some projects and get these guys involved
     (e.g. some RH GSS folks)
Nathan: anyone have potential projects?  small projects - possibly things
     like PCP variants of other simple tools, like vmstat, mpstat, etc.
Nathan: if anyone else has suggestions and are prepared to act as mentors
      please send me details and I'll start up a page.


- Quality assurance
   - Status [kenj, nathans]

Ken: recent daylight savings change causing some problems
     2% failure rate at the moment (1% is typical)
     approx 10 failures on all machines out of 700 or so tests
     fixes can take 4 days to make their way thru (run/scheduling delays)
     time conversion issues on some platforms (getdate stuff)
Nathan: ACK, also seeing that intermittently, ditch unstable times around
     the epoch in testing?  (not useful)
Ken: agree.  Bit more noise than couple weeks back, at the moment
Nathan/Ken: permission checking stuff - needs some tweaking - warning from
     test 000.
Nathan/Ken: we're testing predominantly upgrades (not new installs)
Nathan: RH QE always do a fresh install - packages tend to be closer to our
     product releases than what we'd be doing upstream of course.
Nathan: they run sanity group as smoke test (expect 100% pass), and then run
     full suite in local mode, but no remote tests.
Ken: good to know
Ken: remote hosts are non-deterministic (like it this way, wants it passing
     as is!)
Nathan: starting to look into scripts for QA farm setup, ran into hard-code
     hosts
Ken: qahosts.master, need to think about how to make this go away somehow
      (infrastructure)
Frank: wcohen testing large host counts, not wanting to expose host details
Ken/Nathan: no valuable host details being exposed (local setups only, for
     convenience), easy to externalise this config if others want to start
     using it and are concerned.
Ken/Nathan: for remote testing its good for talking to non-homogeneous and
     down-revision-pcp remote hosts
Ken: grundy.sgi.com needs to be expunged (and all old sgi.com references)
Nathan: most unused hosts are gone, after recent cleanup - just grundy left.


   - PCP culture and Zen of QA [nathans]

Nathan: increasingly, experienced people are not writing QA tests - phrase
     "hand-testing" is becoming frequent.
Nathan: Testing by-hand isn't really good enough - we need automated regression
     tests - usually just shell scripts doing that same testing anyway!
Nathan: pmdapapi case, where contrary advice given to not write tests just prior
     to release and focus on coding instead; caused me pain, had to drop things.
Frank: there was more to it than that, code not in state fit for release anyway.
Nathan: looked fine to me, test I ended up writing worked (and exposed issues,
     but they were readily resolved and test passed prior to release).
Frank: disagree.
Nathan: another example - untested pmwebd back-compat (http://goo.gl/kdMJBI)
Nathan: these are not isolated incidents.
Nathan: am I being too harsh here?
Ken: Not being harsh enough!  Also unhappy with those who are perfectly capable
     of producing good test cases just reporting bugs and moving on, not really
     making an appropriate level of effort to help others with fixing.
Nathan: PCP testsuite has been invaluable over the past decade, especially for
     Dave/Ken/myself doing deep surgery in core areas (IPv6, NSS, auth, etc)
     which potentially affects so many areas of the code base.
Nathan: generally encourage people to always write tests, please.
Frank: sometimes hand-testing is the only way, e.g. pmcd and pmda under valgrind
Ken: dbpmda?  Generally - "hand testing" when a regression test could be written
     isn't good enough for experienced PCP developers.
Ken: QA needs constant care and feeding, QA tests an intimate part of the code
Ken: histogram of people making qa/ contributions to illustrate frustration: Ken
     and Nathan [many hundreds each] ... Dave [~100] then Frank [~30], then it's
     onesies-and-twosies all the way down.
Nathan: running QA is a good way to learn and identify issues - makes for a
     valuable contribution to the project and your own learning as to what makes
     a good test, how others have tackled different aspects of testing, etc.
Ken: would be happier if there was more QA activity in general


   - Non-intrusive QA testing for pcp [nathans, fche, kenj]

Nathan: non-invasive testing - e.g. running along-side a production pmcd, using
     freshly built (not installed) PCP bits.
Ken: what improvement will this make?  Testing was designed to run on dedicated
     systems, and assumption has been baked in for many many years now
Frank: making it easy to run on build systems, no dedicated systems needed
Nathan: full QA run takes two hours, impractical to add to builds, enough value?
Frank: making it easier to set up and run could increase QA activity
Ken: there are small parts of QA that could be tested in an isolated environment
     e.g. perhaps libpcp related stuff (kenj+fche to chat more)
Nathan: core pcp testing uses dedicated infrastructure by design, prefer a 
layered
     products approach to introduce new testing styles (Java, Web) as a 
pragmatic
     approach.
Mark: discussion around when to write a new test vs modify existing test
Ken: can use the "reserved" tag in group file or just add to the test and let it
     fail until the fix is checked-in
Nathan: later is just fine, helps around release time to know issue is unfixed;
     points out this has not happened so far (i.e. that qa changes only are sent
     in with no accompanying fix) - but welcomes it.
Ken: yes please!


- Source trees
   - Code encumbrance concerns, a history [nathans]

Nathan: overview of personal experiences in the XFS encumbrance review
     process (circa 2000) - noted surprise at level of code copying that
     had happened, unexpectedly (legally of course).
     - its built-in to programmers to not reinvent, and they'll use what
     is available to them (or what they've written before).  which is OK
     as long as cross-contamination cannot happen - but it can and does
     in some situations, from this experience.
Nathan: discussed the later litigation that unexpectedly followed a few
     years later against the Linux kernel - companies with developers and
     large end users of the kernel (Red Hat & other distros customers were
     targetted), noted effects on Linux kernel processes (sign off) that
     were allowing legally questionable contributions.
Nathan: concerns about volumes of code being requested pushed into core PCP
     that we didn't author and know nothing about, esp with the potential
     legal risks associated with that.
Ken: there are boundaries we should not cross with included code for the
     core PCP product

   - Issues with separate trees, pcp web status [fche, nathans]

Nathan: gives interpretation of state of web tree and how it got there, and
     what I believe are the many good reasons for separating core from web
     functionality.
Frank: re encumbrance, its all totally irrelevent to the issue at hand.
Nathan: believe its a genuine concern, was initial case for refactoring
     before many of the other good reasons came to light.
Frank: disagree, its all nonsense, and we could've fixed this one problem
     in other ways
Frank: gives interpretation of state of web tree and how it got there, and
     questions code that has been in PCP for years now moving.
Nathan: "years"?  nothing here earlier than this year.
Nathan: generally disagree with your disagree.  Complex issues, many factors.
Frank: disagree.
Mark: talked with RH middleware guys, webasset bundling is the way javascript
     infrastructure is shipped
Mark: cannot be solved with install/build deps (maybe one day, but not today)
Nathan: agreed - AIUI thats why Frank went that way - doesn't need to be in
     core PCP sources though - release separately.
Frank: graphite and jsquery are mature enough that encumbrance is not an issue
Frank: inconceivable that PCP will get blasted for licensing issues, especially
     with apache or MIT license
Nathan: not simply the license thats the issue - developers in some 3rd party
     stuff also work on proprietary code (see extjs).  Pollution can happen.
     And it is not a valid legal argument to say "others do it, so its fine".
Frank: that is a risk for that company, not the end-user.
Nathan: it is also a risk for the end users (and developers), that was one of
     the points of the earlier discussion.
Frank: the licensing and bundling issues are separate issues to the code/tree
     separation of pmwebd.
Frank: handling webassets could have been solved without as much "trauma"
Nathan: "trauma" .. is an over-reaction, why is this approach traumatic?
     seems a helpful approach to solving a complex array of issues.
Frank: disagree with not not a trauma.  Question why not to do this is separate
     question to why to do it.
Nathan: much more comfortable with separate tree.  Why the concern?  Its working
     fine, and others take this approach too - solves many problems here.
Frank: what are the benefits of the split?
Ken: This is not helping the PCP project.  Suggest we take this topic off the
     agenda, moderate a resolution in an offline forum.  Resolution will not be
     achieved in a public forum.
Nathan: sure, either way - thanks for the offer.
Frank: prefer to have the community decide something now.
Nathan: points toward the lack of interest on the list too, someone had to make 
a
     decision and so moved forward with what is the most mutually agreeable 
path.
Frank: disagree.
Nathan/Frank/Ken: agree to take this offline with Ken moderating on how to move
     the web tree forward.


   - discussion around the maintenance role [fche]

Frank: discussed spreading maintenance aspects amongst more people in general
Ken: are there enough people involved in the project to warrant spreading the
     maintenance role?
Frank: yes you are quite right - enough bodies is not a requirement - some of
     these jobs need more than one person involved. constitutional questions -
     policy decisions, consensus type manner driving gatekeeper role
Ken: agree .. start with some uncontentious things, guide behaviour and policy
     decisions
Nathan: have been trying to spread the load a bit, e.g. asking for people to do
     code reviews.  no requirement for me to do all code reviews, e.g. Marco's
     stuff recently - anyone could have helped review & help test.  noone did.
Ken: if you're willing to contribute code then you should be willing to review
     other people's code
Nathan: discussed the libcontainer project source tree, which explicitly splits
    "maintainer" and "contributor" roles 
(https://github.com/docker/libcontainer)
Nathan: we effectively have two "maintainers" for PCP at the moment (myself & 
Ken)
     Keen to add more people - if they are experienced, reviewing, testing, 
fixing
     bugs, doing the mundane maintenance stuff.  Other folks are "contributors" 
-
     also doing highly valuable work, but the roles and responsibilities differ 
a
     great deal.


  - WIP / future work
    - local-context pmlogger [mgoodwin]

Mark: many enterprise sites wont install anything - breaks their qualification
     so aim for minimal logger service, installed and enabled by default
     some sites will install, if their perf issues are bad enough (had a few
     recently)
Mark: minimize daemons/services required for a minimal logger deployment.
Mark: no pmcd daemon - pmlogger service only, in local context mode
Mark: do not listen on any internet domain sockets (unix domain only or nothing
     for pmlc)
Mark: local context mode only supports DSO PMDAS
Mark: only really need linux_pmda, proc and pmcd PMDAs, others would be good too
     though
     - pmcd PMDA uses extern symbols from pmcd itself, needs fixing
     - some other PMDAs don't have DSO versions - alternate uid, need args, etc
     - PMDA_INTERFACE version that supports args to foo_init() or foo_args()
Nathan: to tackle the elevated privilege aspects, we could replace the pmsocks
     field in control file, rev version of control file - other notes above.
     (and deprecate pmsocks - it's ancient and bit-rotting)
Ken: this is doable
Mark: will make time to work on it some more

    - multi-archive libpcp support [brolley]

Dave: pushing along .... as per email list discussions .. surprised this was
     not there to start with!  Early days.

    - generic json pmda, python pmda module work [fche]

Frank: progressing well, will use split metadata/values json files.
Frank: current work is extending the python PMDA wrapper with more of the C
     API for adding metrics in a more dynamic fashion.

     - smaller packages for containers [nathans]

Nathan: minimal dependencies (pcp-libs only), need a package with pmcd and
     Linux and core PMDAs, reduced footprint, reduce duplication of daemon
     PMDAs and DSO PMDAs. Split the base package.
Ken: would pmlogger be in there?
Nathan: Jeremy wanting a minimal install in a container, so thinking not
     pmlogger.  Discussion about being able to monitor containers from
     outside vs inside.
Frank: people inexperienced with container use cases as yet, proc filesystem
    information leakage an issue - probably only want to monitor from base.
Ken: in the VM case, need a lot more.
Nathan: unclear if we'll need this small pmcd-focussed package then, maybe 
defer?
Frank: think its still generally useful.
Ken: worth looking into what it'd take to get PMDA installation in there too
     (Install scripts - pminfo/pmprobe, etc - maybe not a huge addition?)

    - simplifying PMAPI access (new APIs, caching) [fche, mgoodwin, kenj?]

Frank: investigating, on-going work. caching possibilities, but not the focus at
     this stage.
Ken/Mark: what version of the "metrics class" is this?  (jokingly, from SGI days
     where this was tackled many times, never quite right for everyone it seems)

    - JVM instrumentation planning [nathans]

Nathan: wanting to instrument middleware more easily, something that can be
     switched on easily without custom code (but supporting that too).  Plan
     to iterate on current Parfait and Metrics, basically.
Mark: will connect nathans with RH middleware guys working in MW support, and
     also dealing with kernel issues (when JVM goes beserk).
Nathan: thanks, agreed - sweet spot for PCP if we can progress JVM side.

    - ... any other hacking in progress?
  - anyone need help with anything?

Paul: working on ensuring coverage for RH supported filesystems (GFS2, ext[3-4],
     XFS, NFS, CEPH/block storage, Samba, CIFS instrumentation)
Nathan: Great to hear!  Mark & I hacking in device-mapper instrumentation area
     recently - if that comes under this banner we can certaingly help.
Nathan: Good coverage on XFS, extN instrumentation (mainly JBD2, not much else) 
-
     CIFS kernel code available but we're not exporting it yet - so definitely
     work needed there.
Ken: discussion around working with Tridge years ago to add instrumentation into
     Samba, for PCP (with PMDA), which is available if Samba is built with it
     switched on.

  - any other topics?

All done for today - thanks all!


cheers.

--
Nathan

<Prev in Thread] Current Thread [Next in Thread>
  • PCP Developers Meeting Minutes, Nathan Scott <=