pcp
[Top] [All Lists]

[Bug 1296750] New: incorrect interpolation across <mark> record in a mer

To: pcp@xxxxxxxxxxx
Subject: [Bug 1296750] New: incorrect interpolation across <mark> record in a merged archive
From: bugzilla@xxxxxxxxxx
Date: Fri, 08 Jan 2016 02:11:42 +0000
Auto-submitted: auto-generated
Delivered-to: pcp@xxxxxxxxxxx
https://bugzilla.redhat.com/show_bug.cgi?id=1296750

            Bug ID: 1296750
           Summary: incorrect interpolation across <mark> record in a
                    merged archive
           Product: Fedora
           Version: rawhide
         Component: pcp
          Severity: high
          Priority: high
          Assignee: nathans@xxxxxxxxxx
          Reporter: mgoodwin@xxxxxxxxxx
        QA Contact: extras-qa@xxxxxxxxxxxxxxxxx
                CC: brolley@xxxxxxxxxx, fche@xxxxxxxxxx, lberk@xxxxxxxxxx,
                    mgoodwin@xxxxxxxxxx, nathans@xxxxxxxxxx,
                    pcp@xxxxxxxxxxx, scox@xxxxxxxxxx



Created attachment 1112686
  --> https://bugzilla.redhat.com/attachment.cgi?id=1112686&action=edit
repro script - needs the sample PMDA enabled

Description of problem: libpcp seems to be interpolating counters across mark
records in merged archives. If a counter gets reset to zero between the two
archives (e.g. following a reboot), then most tools replaying the merged
archive will report a negative rate.

pmval and pmdumptext actually check if the rate converted value is negative and
report '?', but seems to me the library should return PM_ERR_VALUE since the
interpolated result is unlikely to be correct.

Version-Release number of selected component (if applicable): pcp-3.11

How reproducible: easily, see attached repro script

Steps to Reproduce:
1. create an archive containing a counter metric
2. reset the counter metric to zero (e.g. sample.byte_ctr)
3. create another archive containing the same counter metric
4. merge the two archives
5. use a client to report the interpolated rate converted value across the time
of the resulting mark record

Actual results: negative rate converted value

Expected results: either every tool should check for a counter going backwards
(this is different to a counter wrap), or the library should return
PM_ERR_VALUE since the interpolated counter value is likely to be bogus.

Additional info: a common support scenario is to merge all archives on a
customer system, then replay with a large sampling interval, e.g. pmiostat -t
4h on a merged archive spanning a week or more to see when the problems are
occurring. On most such merged archives, I'm seeing many negative values.

Example output from the repro.sh script

$ ./repro.sh 
first archive
Log Label (Log Format Version 2)
Performance metrics from host kilcunda
  commencing Fri Jan  8 13:00:18.729 2016
  ending     Fri Jan  8 13:00:22.749 2016

second archive
Log Label (Log Format Version 2)
Performance metrics from host kilcunda
  commencing Fri Jan  8 13:00:23.853 2016
  ending     Fri Jan  8 13:00:52.873 2016

merged archive
Log Label (Log Format Version 2)
Performance metrics from host kilcunda
  commencing Fri Jan  8 13:00:18.729 2016
  ending     Fri Jan  8 13:00:52.873 2016

raw (uninterpolated) values in merged archive :

metric:    sample.byte_ctr
archive:   third
host:      kilcunda
start:     Fri Jan  8 13:00:18 2016
end:       Fri Jan  8 13:00:52 2016
semantics: cumulative counter
units:     byte
samples:   15
13:00:18.749      14067
13:00:19.749      14603
13:00:20.749      15163
13:00:21.749      15616
13:00:22.749      16398
13:00:22.750  Archive logging suspended
13:00:23.873          0
13:00:24.873        534
13:00:25.873       1085
13:00:26.873       1567
13:00:27.873       1909
13:00:28.873       2089
13:00:29.873       2290
13:00:30.873       2879
13:00:31.873       3291

mark record is here:
13:00:22.750  <mark>

RAW values across mark record :
          s.byte_ctr
                byte
13:00:20       15152
13:00:30        2794
13:00:40        7755
13:00:50       12635

RATE converted values across mark record :
          s.byte_ctr
              byte/s
13:00:20         N/A
13:00:30   -1235.800
13:00:40     496.100
13:00:50     488.000

-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug 
https://bugzilla.redhat.com/token.cgi?t=gNaIiudoxJ&a=cc_unsubscribe
<Prev in Thread] Current Thread [Next in Thread>