Recently, I was trying to understand kernprof output while
doing a scalability study. My intention was to run a benchmark
on a UP (actually 1P SMP kernel) and capture kernprof output.
I then ran the same benchmark after enabling 7 additional
processors. I 'thought' that I could do an apples to apples
comparison of the output to look for routines whose execution
time increased. However, this is not what I got. It 'looks'
like the 8P numbers are 8X the UP numbers???
Anyone know if kernprof can be used for such a comparison?
While looking at the kernprof code, I noticed the global
variable 'total_mcount' (in drivers/char/profile.c) is
incremented each time mcount is called: OUCH!!! Since mcount
is called upon entry to each function, it is hard to imagine
a more contested cache line. It appears total_mcount is only
used for informational purposes. I commented out the one
line that increments this variable and did some 'before and
after' runs. The output is below. Looks like updating this
global variable accounts for about 20% of the total time
of running this benchmark with profiling enabled. I was going
to create a patch to create per-CPU instances of total_mcount
each on a separate cache line. However, I couldn't find any
other code in this file that appeared to be concerned with
cache line boundaries. Perhaps there is room for improvement
in this area?
I don't mean to be too down on the kernprof implementation.
I think it is a great tool that I frequently use. Perhaps
someday I'll get some time to help out with it's development/
support.
Thanks,
--
Mike
Original kernprof
-----------------
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls us/call us/call name
35.51 1062.13 1062.13 cg_record_arc
12.82 1445.64 383.51 10733690 35.73 35.73 sys_sched_yield
12.66 1824.38 378.74 13239210 28.61 28.94 schedule
11.47 2167.45 343.07 2874286 119.36 119.83 send_sig_info
8.82 2431.30 263.85 USER
5.56 2597.67 166.37 2640054 63.02 63.02 _wake_up
1.34 2637.88 40.21 17832 2254.93 2254.93 poll_idle
1.12 2671.50 33.62 2874280 11.70 15.20 do_signal
0.87 2697.61 26.11 stext_lock
0.63 2716.55 18.94 mcount
0.57 2733.56 17.01 6577379 2.59 3.29 netif_rx
0.54 2749.66 16.10 4005451 4.02 17.10 tcp_sendmsg
0.51 2765.06 15.40 8433015 1.83 1.83 mod_timer
0.49 2779.74 14.68 6577234 2.23 31.01 tcp_v4_rcv
0.35 2790.16 10.42 system_call
0.28 2798.62 8.46 6576949 1.29 7.86 ip_queue_xmit
0.28 2806.99 8.37 6577234 1.27 5.58 dev_queue_xmit
0.26 2814.81 7.82 6576949 1.19 9.42 tcp_transmit_skb
kernprof with total_mcount commented out
----------------------------------------
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls us/call us/call name
20.04 476.96 476.96 10134665 47.06 47.06 sys_sched_yield
18.60 919.57 442.61 12921479 34.25 34.64 schedule
16.31 1307.74 388.17 2857557 135.84 136.32 send_sig_info
13.32 1624.77 317.03 cg_record_arc
10.86 1883.18 258.41 USER
7.59 2063.70 180.52 2605133 69.29 69.29 _wake_up
1.24 2093.16 29.46 19011 1549.63 1549.63 poll_idle
0.77 2111.50 18.34 mcount
0.68 2127.75 16.25 2857551 5.69 9.08 do_signal
0.64 2142.90 15.15 4005451 3.78 14.51 tcp_sendmsg
0.59 2156.92 14.02 8310456 1.69 1.69 mod_timer
0.58 2170.75 13.83 6387602 2.17 33.20 tcp_v4_rcv
0.41 2180.50 9.75 system_call
0.34 2188.49 7.99 6387193 1.25 5.85 ip_queue_xmit
0.30 2195.72 7.23 6387602 1.13 3.69 dev_queue_xmit
0.30 2202.84 7.12 10227829 0.70 1.34 _kfree_skb
0.30 2209.89 7.05 6387193 1.10 7.37 tcp_transmit_skb
0.29 2216.89 7.00 6386259 1.10 10.01 tcp_rcv_established
|