Will, welcome to the list ...
On 20/06/14 06:38, William Cohen wrote:
...The thought is that it would probably
be sufficient to provide metrics for latency for packet send and
receive of each network device.
Sounds like a good idea and a natural extension to the coverage of the
existing network interface metrics.
... I have some questions on implementing
the performance metric names. Thinking maybe somethingn like the
following names:
network.interface.in.latency instance "devname"
network.interface.out.latency instance "devname"
The value would be the average latency on the device. This would be
similar to the kernel.all.cpu.* metrics in the respect that the
latencies would be average over some window of time. Would it be
better to provide raw monotonic sums of the delays and number of
network packets in the pmda and have pcp derive the average latency
from the raw metric values? ...
As a general rule we're very strongly in favour of PMDAs exporting
running counters when they are available. This reduces state and
complexity in the PMDA, avoids hard-wired decisions about the "average
time period" in the PMDA and dodges the whole (statistically) messy
issue of rolling time averages and delays in average calculation and
reporting for the PMDA ... all of these issues dilute the semantic value
of the exported data and are better handled by the clients.
For simple counters all PCP clients are able to deal with their own
averaging to match their needs, with consistency for multiple clients
that may be sampling the same metrics at different times and rates.
Now the data you're talking about probably requires the more complicated
delta(v1) / delta(v2) calculation for a user-facing client (but not
pmlogger), where v1 is the sum of latencies and v2 is the packet count,
but this can be automagically produced by "derived metrics" that a
client can choose to use (see an example in the second para of the
pmRegisterDerived(3) man page), or done by pmie(1).
... Any thoughts or comments about this proposed network latency PMDA?
For the other network interface metrics, we only incur an overhead when
the data is requested (when the metrics are not being requested there is
zero additional cost because the instrumentation is already being done
in the kernel). This might not be the case here, where I expect the
instrumentation would need to be enabled, even if the metrics are not
being requested.
Depending on the overhead of this instrumentation, you may wish to
consider using an additional control variable and pmStore(1) to allow a
user to enable or disable the collection, with the metrics returning
PM_ERR_VALUE or PM_ERR_APPVERSION if the metrics are requested but the
collection is disabled. For examples of PMDA behaviour being modified
by pmStore metrics, see (as in pminfo -T) xfs.control.reset or
sample.dodgey.control).
Cheers, Ken.
|