I have been looking at a PMDA that provides information about how long
it takes for packets to make their way from userspace to the network
device and from network device to userspace. People might be
interested to know whether the network traffic has too much latency in
the kernel. The kernel tracepoints and the perf netdev-times script
[+] that allow the user to determine how long it takes for network
packets to make their way through the networking stacks. However, it
isn't appropriate using the netdev-times script for production
systems. The script provides too much detail (information on every
packet) and results in WAY too much overhead. It is not able to
process significant network traffice in real time.
The same tracepoints are available to systemtap and a systemtap script
could provide more appropriate summary style information to pcp as a
pmda with much lower overhead. The thought is that it would probably
be sufficient to provide metrics for latency for packet send and
receive of each network device. I have some questions on implementing
the performance metric names. Thinking maybe somethingn like the
following names:
network.interface.in.latency instance "devname"
network.interface.out.latency instance "devname"
The value would be the average latency on the device. This would be
similar to the kernel.all.cpu.* metrics in the respect that the
latencies would be average over some window of time. Would it be
better to provide raw monotonic sums of the delays and number of
network packets in the pmda and have pcp derive the average latency
from the raw metric values? This would allow arbtrary window times to
be used by pcp. So for some time t and delta between measurements could do:
(latency_sum[t]-latency_sum[t-delta])/(packets[t]-packets[t-delta])
The systemtap script could use systemtap PROCFS probes to make it to
read that information out when pcp desires it [*]. Maybe something
that echos the /proc/net/dev format (there might be more latency
fields in there to give a finer grained picture where the packet
spends its time:
Inter-| Receive | Transmit
face | packets latency | packets latency
wlp3s0: 0 0 0 0
lo: 1738527854 1373704 1738527854 1373704
virbr0-nic: 0 0 0 0
virbr0: 0 0 0 0
em1: 87683319 105450 11920860 62401
Any thoughts or comments about this proposed network latency PMDA?
-Will
[+]
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/tools/perf/scripts/python/netdev-times.py
[*]
https://sourceware.org/systemtap/langref/Probe_points.html#SECTION00057000000000000000
|