pcp
[Top] [All Lists]

notes from buffalo pcp meetathon

To: pcp developers <pcp@xxxxxxxxxxx>
Subject: notes from buffalo pcp meetathon
From: "Frank Ch. Eigler" <fche@xxxxxxxxxx>
Date: Wed, 29 Oct 2014 16:23:42 -0400
Delivered-to: pcp@xxxxxxxxxxx
User-agent: Mutt/1.4.2.2i
Hi -

The following is a set of notes from meetings held at the SUNY @ Buffalo
CCR team's lovely facilities last Thursday.  Nathan and I were hosted by
Martins Innus, Joe White, Thomas Yearke, and a few other locals.  Thanks
for having us!  Let's get to work on those todos.


- ccr.buffalo.edu background info
  - cluster - scientific & computational research
  - 900-node, 10000-cpu hetero cluster
  - 1-3 PB of storage
  - 5000ish jobs per day
  - serves university & associates & general public too
  - managed by slurm resource manager - openxtmod data warehouse
  - ccs includes hpc support specialists, parallelization experts
  - our contacts mostly software dev for the management tools
  - contrast with very large clusters - here they have lots of small jobs
    - STAMPEDE from texas uses non-pcp collector 'taccstats'
  - cluster os: centos6, puppet managed


- management system
  - uses fancy web front-end and stats data warehouse
  - python scripts ingest data from pcp and/or taccstats (<= sar),
    aggregate across jobs/notes/projects
  - see also http://xdmod.sourceforge.net/
  - includes running background health-assessment jobs "application kernels"
    that provide canonical histories for trends/anomalies


- pcp history
  - nathans shared overall timeline
  - pcp went through boom & bust cycles at sgi & aconex (a victim of
    its own success)
  - buffalo.edu a user since 1999, sgi-based source code, heavy on 3d viewage
  - rh plans to rebase frequently in RHEL
    - even faster personal rebases/builds possible via fedora COPR


- broad problem area for pcp to help with
  - need to track resource consumption on a per-job per-node basis
  - this corresponds to short-lived cgroups
  - existing pcp pmda seems to reuse pmids during run -- bug!
  - has other problems too (cgroup name lack-of-quoting in pmns         
    causing missing most cgroups)
  - full cgroup names from job manager are usefully unique though


- pcp todo item 1: rework cgroups pmda
  - possibly wrapping cgroup name into nested indom with prefixing/quoting,
    then using pmdaCache to persist indom codes to avoid collisions
  - quoting cgroup names and making them part of dynamic pmns also
    a possibility
  - need holistic mesh of data about containers both from the host side
    (e.g., cgroup instantaneous resource usage & allocation), and from
    within (e.g., introspective network/filesystem states)
  - buffalo folks don't have recent code progress on this


- pcp todo item 2: pmlogger logging proc.* - authentication
  - pmlogger needs to authenticate in order to get at systemwide
    proc.* stats
  - maybe as root; maybe enough to use AF_UNIX / local: / uid-pcp
  - if full root access as needed, then the current pmcd authentication
    machinery needs to be automated & simplified, so out-of-box it works
    (maybe equivalently to local pam?)
  - similarly, configuration of pmlogger to pass that authentication info
    needs to be smooth; possibly requiring tls if authenticating via plaintext


- pcp todo item 3: pmlogger logging proc.* - data quantity
  - it is desirable to log some textual proc.* bits for analysis, like
    a hypothetical (?) proc.psinfo.environ metric, but this is very large
  - the hotproc pmda work is in a way a kludge to work around having too
    much input data
  - if the proc indom changes frequently (as it does), the .meta files
    get large; http://oss.sgi.com/bugzilla/show_bug.cgi?id=1046
  - repeated data values bloat the log file
  - (a sign of this is how gzip-compressible archives are: >>90% common)
  - pmlogger should get a mode to skip same-as-last-value entries
    - (though this would not reduce query load on target pmcd)
  - semantics may permit this sort of change without archive file format 
revision
  - or use libz to compress on the fly & decompress on the fly too
    - ideally retaining compatible .index file-offset format


- pcp todo item 4: need to log some metrics on demand, oneshot type
  - to match exact moment of job start/stop
  - something like % pmlc log oneshot _metriclist  perhaps?
  - or deploy job-lifetime-matched temporary pmloggers
    - almost as in % pmlogger --lifetime PID
    - related to earlier idea re. pmval -c CMD in papi context

  
- pcp todo item 5: more analysis capability
  - if pcp had more native analytics (well in excess of pmdiff), the
    data warehouse aggregations could become secondary and enable
    drilldown to raw but analyzed data
  - example types of calculations: cross-correlations between metrics
  - anomaly detection (big area)
    - fche's hobbyhorse: connecting this to pmlogger to govern archiving


- pcp todo item 6: 3d view
  - used because anomaly detection still manual with mk. 1 eyeball
  - old sgi pmview still attractive, especially if considering topology
  - an openscenegraph-based partial reimplementation exists
    - so dies a unity 3d/gl-based one, with browser blob etc.
  - maybe have pmwebd / webgl reimplementation


- pcp todo item 7: qa testsuite deployability
  - individual developer quick testing also well-established
  - need ability to reproduce something like kenj's elaborate qa farm
  - something like a public qa server tied to git


- pcp todo item 8: docs on pmie best practices
  - would like more documentation/samples on local / centralized /
    archive-targeted ("has this happened before?") usage


- pcp todo item 9: centralized logging with dynamic instances
  - need better support for monitoring instances that come & go
    asynchronously
  - need more awareness of pcp discovery / scanning / pmmgr
  - probably also need an instance-invoked mechanism to notify
    central logger of arrival of new node to log (apart from avahi?),
    or cloud/grid-infrastructure apis for instance enumeration
   

- pcp todo item 10: archive access api across network
  - to let analytics / guis run remotely from archive files
  - already in plans of one flavour of the "grand unified context"
  - pmwebd graphite interface one possible near-term hack for this


- pcp todo item 11: more data importers/exporters
  - taccstats, ganglia importers interesting
  - it could allow pcp archives to be canonical long-term storage
  - kenj's tutorial needs more awareness
  - export to numpy for analytics there


- pcp todo item 12: better tolerance of broken archive files
  - pmlogextract reportedly gives up too easily if truncated archives 
encountered
  - recent pmlogger shouldn't normally generate these, but old files exist
    - maybe fix them with a one-time pm-archive-lint?
  - or just let the archive consuming tools put up with errors without aborting


- pcp todo item 13: proc pmda timeouts
  - well known problem, hits ccr (and other larger servers) good and hard
  - http://oss.sgi.com/bugzilla/show_bug.cgi?id=1036
  - kenj had drafted work for pmda timeout-detecting thread for fetches
    - still need this!
  - pmie-based auto-pmcd-restarting possible too, though one must run with 
root/sudo
    - and restarting doesn't guarantee persistency of pmids etc. 
  - elevated-privileged split-pmcd a possible solution too (it could restart 
pmda
    processes)


- pcp todo item 14: pmfetch timestamps
  - especially for slightly tardy pmdas, the pmResult timestamp is
    ambiguous: is it at beginning, end, or middle of process?
  - having too-short inter-fetch timestamp intervals can lead to
    crazy-big rate-converted values
  - maybe just confirm/fix pmcd to return beginning-of-operation timestamps
  - apps can also confirm that timestamp is reasonable, by comparing to
    their own clocks around pmFetch()
  - pmlogger could emit diagnostics for violations


- pcp todo item 15: per-peer stats
  - sometimes need to be able to measure inter-node traffic stats
  - not just on network interface basis, but per-peer 
    e.g., network.traffic.ipv4.perpeer.in.packets ["addr"]
  - also came up in comparisons to ntop
  - this would could generate large indoms
  - potentially generate from userspace (a la tcpdump)
  - or perhaps systemtap (via json / mmv interfaces)


- pcp todo item 16: syscall stats
  - sometimes need to be able to measure per-process per-syscall stats
    e.g., proc.syscall.FOO ["pid"]   or proc.syscall ["pid:FOO"]
  - naturally matrix-valued
  - probably gather via systemtap or similar


- pcp todo item 17: gpfs pmda
  - need it


- pcp todo item 18: simplified pmapi
  - fche already prototyping in C
  - python pmcc also useful for python clients


- pcp todo item 19: todo lists
  - need to collect ideas of small things for students to do
  - need to remember to formally report bugs
  - fche proposes bugzilla as hammer for both nails
    - hm, maybe the above list should be there too

<Prev in Thread] Current Thread [Next in Thread>