pcp
[Top] [All Lists]

Re: PCP Network Latency PMDA

To: "Frank Ch. Eigler" <fche@xxxxxxxxxx>
Subject: Re: PCP Network Latency PMDA
From: William Cohen <wcohen@xxxxxxxxxx>
Date: Wed, 25 Jun 2014 14:54:18 -0400
Cc: pcp@xxxxxxxxxxx
Delivered-to: pcp@xxxxxxxxxxx
In-reply-to: <y0mlhsl3rd9.fsf@xxxxxxxx>
References: <53A34A47.3060008@xxxxxxxxxx> <53A9E126.2040000@xxxxxxxxxx> <y0mlhsl3rd9.fsf@xxxxxxxx>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0
On 06/25/2014 12:07 PM, Frank Ch. Eigler wrote:
> 
> Hi, Will -
> 
> wcohen wrote:
> 
> 
>> [...] The receive is a bit more complicated because of the async
>> nature of interrupts for incoming packets and the syscall reading
>> the data, but I think that I understand enough of that to code
>> something similar for receives. [...]
> 
> Looking forward to that part.  It sounds significantly trickier.

Hi Frank,

I have been digging through the tracepoints for receive, writing a script for 
the receive side. Things are definitely more complicated for receive.  
Something is written, but not working yet.

> 
> 
>> [...]
>> $ cat /proc/systemtap/stap_df43d0122ca9ec5271896401487121a6_9305/net_latency 
>> #dev: tx_queue_n tx_queue_avg tx_queue_sum tx_xmit_n tx_xmit_avg tx_xmit_sum 
>> tx_free_n tx tx_free_avg tx_free_sum
>> em1: 89 3277 291671 89 3284 292340 83 3536 293549
>> lo: 9236 7 69453 9257 1948 18033760 140 96020 13442850
> 
> OK, that confirms the suspicion that a sampled-metric type of pmda
> approach suits this better than timestamped-line-of-trace-data one.

Note that the number above was with really light traffic.  It is quite possible 
that number of packets would be hundreds of thousands per second.  The 
netdev-times perf script records a huge amount of trace data and does post 
processing on it.  Multiple entries for each packet.  Because the perf trace 
events might be unordered between processors it also needs to sort on the 
timestamp.  All in all it seems post processing of the trace is going to have 
too much overhead to be useful in production systems.  Maybe for some 
infrequent events the trace and post processing would be okay.  However, for 
the high-frequency events like network operations it seems just too much.

> 
> 
>> [...]
>> probe kernel.trace("sys_enter") { sys_t[tid()] = gettimeofday_us() }
>>
>> probe kernel.trace("sys_exit") { delete sys_t[tid()] }
> 
> Note that not all syscalls return; you'll want sys_t to be %-marked or
> something to avoid overflow over time.

Ah, good point.
> 
> You might consider also tracking the skb-free-to-sys_exit latency, as
> time required to return to the application after the outgoing packet
> has made it down to the network device.
> 
> Do we have any visibility into network drivers as to when they finally
> send the packets out on the wire?

The syscall can return before the packet is sent (or freed) so it is not clear 
what skb-free-to-sys_exit would show.  The skb-free can be a bit lazy and may 
not happen immediately after the packet is sent.  The tx_queue_t give an idea 
how long it takes the packet to go through the network layer before it is 
passed to the networking hardware.  The tx_xmit_avg time show how long takes to 
get to the networking hardware and the tx_free_avg happens after the hardware 
has sent the data; the network hardware send happens between tx_xmit time and 
the tx_free time.  My understanding is that some of the networking hardware is 
using DMA so the kernel is just giving it some pointers on a list to the 
networking hardware, so the kernel doesn't know exactly when a particular 
packet has been send.  The kernel knows the data has been sent because the skb 
can be freed.

> 
> How does the script handle socket writes that don't result in an
> outgoing transmission right away due e.g. to !TCP_NODELAY or TCP_CORK?

I will have to check on this.  The net_dev_queue and net_dev_xmit tracepoints 
are low level and I think those !TCP_NODELAY and TCP_CORK would show up as 
increased tx_queue times.

-Will

<Prev in Thread] Current Thread [Next in Thread>