On 06/25/2014 12:07 PM, Frank Ch. Eigler wrote:
>
> Hi, Will -
>
> wcohen wrote:
>
>
>> [...] The receive is a bit more complicated because of the async
>> nature of interrupts for incoming packets and the syscall reading
>> the data, but I think that I understand enough of that to code
>> something similar for receives. [...]
>
> Looking forward to that part. It sounds significantly trickier.
Hi Frank,
I have been digging through the tracepoints for receive, writing a script for
the receive side. Things are definitely more complicated for receive.
Something is written, but not working yet.
>
>
>> [...]
>> $ cat /proc/systemtap/stap_df43d0122ca9ec5271896401487121a6_9305/net_latency
>> #dev: tx_queue_n tx_queue_avg tx_queue_sum tx_xmit_n tx_xmit_avg tx_xmit_sum
>> tx_free_n tx tx_free_avg tx_free_sum
>> em1: 89 3277 291671 89 3284 292340 83 3536 293549
>> lo: 9236 7 69453 9257 1948 18033760 140 96020 13442850
>
> OK, that confirms the suspicion that a sampled-metric type of pmda
> approach suits this better than timestamped-line-of-trace-data one.
Note that the number above was with really light traffic. It is quite possible
that number of packets would be hundreds of thousands per second. The
netdev-times perf script records a huge amount of trace data and does post
processing on it. Multiple entries for each packet. Because the perf trace
events might be unordered between processors it also needs to sort on the
timestamp. All in all it seems post processing of the trace is going to have
too much overhead to be useful in production systems. Maybe for some
infrequent events the trace and post processing would be okay. However, for
the high-frequency events like network operations it seems just too much.
>
>
>> [...]
>> probe kernel.trace("sys_enter") { sys_t[tid()] = gettimeofday_us() }
>>
>> probe kernel.trace("sys_exit") { delete sys_t[tid()] }
>
> Note that not all syscalls return; you'll want sys_t to be %-marked or
> something to avoid overflow over time.
Ah, good point.
>
> You might consider also tracking the skb-free-to-sys_exit latency, as
> time required to return to the application after the outgoing packet
> has made it down to the network device.
>
> Do we have any visibility into network drivers as to when they finally
> send the packets out on the wire?
The syscall can return before the packet is sent (or freed) so it is not clear
what skb-free-to-sys_exit would show. The skb-free can be a bit lazy and may
not happen immediately after the packet is sent. The tx_queue_t give an idea
how long it takes the packet to go through the network layer before it is
passed to the networking hardware. The tx_xmit_avg time show how long takes to
get to the networking hardware and the tx_free_avg happens after the hardware
has sent the data; the network hardware send happens between tx_xmit time and
the tx_free time. My understanding is that some of the networking hardware is
using DMA so the kernel is just giving it some pointers on a list to the
networking hardware, so the kernel doesn't know exactly when a particular
packet has been send. The kernel knows the data has been sent because the skb
can be freed.
>
> How does the script handle socket writes that don't result in an
> outgoing transmission right away due e.g. to !TCP_NODELAY or TCP_CORK?
I will have to check on this. The net_dev_queue and net_dev_xmit tracepoints
are low level and I think those !TCP_NODELAY and TCP_CORK would show up as
increased tx_queue times.
-Will
|