https://bugzilla.redhat.com/show_bug.cgi?id=1296750
Bug ID: 1296750
Summary: incorrect interpolation across <mark> record in a
merged archive
Product: Fedora
Version: rawhide
Component: pcp
Severity: high
Priority: high
Assignee: nathans@xxxxxxxxxx
Reporter: mgoodwin@xxxxxxxxxx
QA Contact: extras-qa@xxxxxxxxxxxxxxxxx
CC: brolley@xxxxxxxxxx, fche@xxxxxxxxxx, lberk@xxxxxxxxxx,
mgoodwin@xxxxxxxxxx, nathans@xxxxxxxxxx,
pcp@xxxxxxxxxxx, scox@xxxxxxxxxx
Created attachment 1112686
--> https://bugzilla.redhat.com/attachment.cgi?id=1112686&action=edit
repro script - needs the sample PMDA enabled
Description of problem: libpcp seems to be interpolating counters across mark
records in merged archives. If a counter gets reset to zero between the two
archives (e.g. following a reboot), then most tools replaying the merged
archive will report a negative rate.
pmval and pmdumptext actually check if the rate converted value is negative and
report '?', but seems to me the library should return PM_ERR_VALUE since the
interpolated result is unlikely to be correct.
Version-Release number of selected component (if applicable): pcp-3.11
How reproducible: easily, see attached repro script
Steps to Reproduce:
1. create an archive containing a counter metric
2. reset the counter metric to zero (e.g. sample.byte_ctr)
3. create another archive containing the same counter metric
4. merge the two archives
5. use a client to report the interpolated rate converted value across the time
of the resulting mark record
Actual results: negative rate converted value
Expected results: either every tool should check for a counter going backwards
(this is different to a counter wrap), or the library should return
PM_ERR_VALUE since the interpolated counter value is likely to be bogus.
Additional info: a common support scenario is to merge all archives on a
customer system, then replay with a large sampling interval, e.g. pmiostat -t
4h on a merged archive spanning a week or more to see when the problems are
occurring. On most such merged archives, I'm seeing many negative values.
Example output from the repro.sh script
$ ./repro.sh
first archive
Log Label (Log Format Version 2)
Performance metrics from host kilcunda
commencing Fri Jan 8 13:00:18.729 2016
ending Fri Jan 8 13:00:22.749 2016
second archive
Log Label (Log Format Version 2)
Performance metrics from host kilcunda
commencing Fri Jan 8 13:00:23.853 2016
ending Fri Jan 8 13:00:52.873 2016
merged archive
Log Label (Log Format Version 2)
Performance metrics from host kilcunda
commencing Fri Jan 8 13:00:18.729 2016
ending Fri Jan 8 13:00:52.873 2016
raw (uninterpolated) values in merged archive :
metric: sample.byte_ctr
archive: third
host: kilcunda
start: Fri Jan 8 13:00:18 2016
end: Fri Jan 8 13:00:52 2016
semantics: cumulative counter
units: byte
samples: 15
13:00:18.749 14067
13:00:19.749 14603
13:00:20.749 15163
13:00:21.749 15616
13:00:22.749 16398
13:00:22.750 Archive logging suspended
13:00:23.873 0
13:00:24.873 534
13:00:25.873 1085
13:00:26.873 1567
13:00:27.873 1909
13:00:28.873 2089
13:00:29.873 2290
13:00:30.873 2879
13:00:31.873 3291
mark record is here:
13:00:22.750 <mark>
RAW values across mark record :
s.byte_ctr
byte
13:00:20 15152
13:00:30 2794
13:00:40 7755
13:00:50 12635
RATE converted values across mark record :
s.byte_ctr
byte/s
13:00:20 N/A
13:00:30 -1235.800
13:00:40 496.100
13:00:50 488.000
--
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug
https://bugzilla.redhat.com/token.cgi?t=gNaIiudoxJ&a=cc_unsubscribe
|