Hi -
The following is a set of notes from meetings held at the SUNY @ Buffalo
CCR team's lovely facilities last Thursday. Nathan and I were hosted by
Martins Innus, Joe White, Thomas Yearke, and a few other locals. Thanks
for having us! Let's get to work on those todos.
- ccr.buffalo.edu background info
- cluster - scientific & computational research
- 900-node, 10000-cpu hetero cluster
- 1-3 PB of storage
- 5000ish jobs per day
- serves university & associates & general public too
- managed by slurm resource manager - openxtmod data warehouse
- ccs includes hpc support specialists, parallelization experts
- our contacts mostly software dev for the management tools
- contrast with very large clusters - here they have lots of small jobs
- STAMPEDE from texas uses non-pcp collector 'taccstats'
- cluster os: centos6, puppet managed
- management system
- uses fancy web front-end and stats data warehouse
- python scripts ingest data from pcp and/or taccstats (<= sar),
aggregate across jobs/notes/projects
- see also http://xdmod.sourceforge.net/
- includes running background health-assessment jobs "application kernels"
that provide canonical histories for trends/anomalies
- pcp history
- nathans shared overall timeline
- pcp went through boom & bust cycles at sgi & aconex (a victim of
its own success)
- buffalo.edu a user since 1999, sgi-based source code, heavy on 3d viewage
- rh plans to rebase frequently in RHEL
- even faster personal rebases/builds possible via fedora COPR
- broad problem area for pcp to help with
- need to track resource consumption on a per-job per-node basis
- this corresponds to short-lived cgroups
- existing pcp pmda seems to reuse pmids during run -- bug!
- has other problems too (cgroup name lack-of-quoting in pmns
causing missing most cgroups)
- full cgroup names from job manager are usefully unique though
- pcp todo item 1: rework cgroups pmda
- possibly wrapping cgroup name into nested indom with prefixing/quoting,
then using pmdaCache to persist indom codes to avoid collisions
- quoting cgroup names and making them part of dynamic pmns also
a possibility
- need holistic mesh of data about containers both from the host side
(e.g., cgroup instantaneous resource usage & allocation), and from
within (e.g., introspective network/filesystem states)
- buffalo folks don't have recent code progress on this
- pcp todo item 2: pmlogger logging proc.* - authentication
- pmlogger needs to authenticate in order to get at systemwide
proc.* stats
- maybe as root; maybe enough to use AF_UNIX / local: / uid-pcp
- if full root access as needed, then the current pmcd authentication
machinery needs to be automated & simplified, so out-of-box it works
(maybe equivalently to local pam?)
- similarly, configuration of pmlogger to pass that authentication info
needs to be smooth; possibly requiring tls if authenticating via plaintext
- pcp todo item 3: pmlogger logging proc.* - data quantity
- it is desirable to log some textual proc.* bits for analysis, like
a hypothetical (?) proc.psinfo.environ metric, but this is very large
- the hotproc pmda work is in a way a kludge to work around having too
much input data
- if the proc indom changes frequently (as it does), the .meta files
get large; http://oss.sgi.com/bugzilla/show_bug.cgi?id=1046
- repeated data values bloat the log file
- (a sign of this is how gzip-compressible archives are: >>90% common)
- pmlogger should get a mode to skip same-as-last-value entries
- (though this would not reduce query load on target pmcd)
- semantics may permit this sort of change without archive file format
revision
- or use libz to compress on the fly & decompress on the fly too
- ideally retaining compatible .index file-offset format
- pcp todo item 4: need to log some metrics on demand, oneshot type
- to match exact moment of job start/stop
- something like % pmlc log oneshot _metriclist perhaps?
- or deploy job-lifetime-matched temporary pmloggers
- almost as in % pmlogger --lifetime PID
- related to earlier idea re. pmval -c CMD in papi context
- pcp todo item 5: more analysis capability
- if pcp had more native analytics (well in excess of pmdiff), the
data warehouse aggregations could become secondary and enable
drilldown to raw but analyzed data
- example types of calculations: cross-correlations between metrics
- anomaly detection (big area)
- fche's hobbyhorse: connecting this to pmlogger to govern archiving
- pcp todo item 6: 3d view
- used because anomaly detection still manual with mk. 1 eyeball
- old sgi pmview still attractive, especially if considering topology
- an openscenegraph-based partial reimplementation exists
- so dies a unity 3d/gl-based one, with browser blob etc.
- maybe have pmwebd / webgl reimplementation
- pcp todo item 7: qa testsuite deployability
- individual developer quick testing also well-established
- need ability to reproduce something like kenj's elaborate qa farm
- something like a public qa server tied to git
- pcp todo item 8: docs on pmie best practices
- would like more documentation/samples on local / centralized /
archive-targeted ("has this happened before?") usage
- pcp todo item 9: centralized logging with dynamic instances
- need better support for monitoring instances that come & go
asynchronously
- need more awareness of pcp discovery / scanning / pmmgr
- probably also need an instance-invoked mechanism to notify
central logger of arrival of new node to log (apart from avahi?),
or cloud/grid-infrastructure apis for instance enumeration
- pcp todo item 10: archive access api across network
- to let analytics / guis run remotely from archive files
- already in plans of one flavour of the "grand unified context"
- pmwebd graphite interface one possible near-term hack for this
- pcp todo item 11: more data importers/exporters
- taccstats, ganglia importers interesting
- it could allow pcp archives to be canonical long-term storage
- kenj's tutorial needs more awareness
- export to numpy for analytics there
- pcp todo item 12: better tolerance of broken archive files
- pmlogextract reportedly gives up too easily if truncated archives
encountered
- recent pmlogger shouldn't normally generate these, but old files exist
- maybe fix them with a one-time pm-archive-lint?
- or just let the archive consuming tools put up with errors without aborting
- pcp todo item 13: proc pmda timeouts
- well known problem, hits ccr (and other larger servers) good and hard
- http://oss.sgi.com/bugzilla/show_bug.cgi?id=1036
- kenj had drafted work for pmda timeout-detecting thread for fetches
- still need this!
- pmie-based auto-pmcd-restarting possible too, though one must run with
root/sudo
- and restarting doesn't guarantee persistency of pmids etc.
- elevated-privileged split-pmcd a possible solution too (it could restart
pmda
processes)
- pcp todo item 14: pmfetch timestamps
- especially for slightly tardy pmdas, the pmResult timestamp is
ambiguous: is it at beginning, end, or middle of process?
- having too-short inter-fetch timestamp intervals can lead to
crazy-big rate-converted values
- maybe just confirm/fix pmcd to return beginning-of-operation timestamps
- apps can also confirm that timestamp is reasonable, by comparing to
their own clocks around pmFetch()
- pmlogger could emit diagnostics for violations
- pcp todo item 15: per-peer stats
- sometimes need to be able to measure inter-node traffic stats
- not just on network interface basis, but per-peer
e.g., network.traffic.ipv4.perpeer.in.packets ["addr"]
- also came up in comparisons to ntop
- this would could generate large indoms
- potentially generate from userspace (a la tcpdump)
- or perhaps systemtap (via json / mmv interfaces)
- pcp todo item 16: syscall stats
- sometimes need to be able to measure per-process per-syscall stats
e.g., proc.syscall.FOO ["pid"] or proc.syscall ["pid:FOO"]
- naturally matrix-valued
- probably gather via systemtap or similar
- pcp todo item 17: gpfs pmda
- need it
- pcp todo item 18: simplified pmapi
- fche already prototyping in C
- python pmcc also useful for python clients
- pcp todo item 19: todo lists
- need to collect ideas of small things for students to do
- need to remember to formally report bugs
- fche proposes bugzilla as hammer for both nails
- hm, maybe the above list should be there too
|