Rick Jones a écrit :
Robert Olsson wrote:
Eric Dumazet writes:
> I have no profiling info for this exact patch, I'm sorry David.
> On a dual opteron machine, this thing from ip_route_input() is very
expensive :
> > RT_CACHE_STAT_INC(in_hlist_search);
> > ip_route_input() use a total of 3.4563 % of one cpu, but this
'increment' takes 1.20 % !!!
Very weird if the statscounter taking a third of ip_route_input.
> Sometime I wonder if oprofile can be trusted :(
> > Maybe we should increment a counter on the stack and do a final
> if (counter != 0)
> RT_CACHE_STAT_ADD(in_hlist_search, counter);
My experiences from playing with prefetching eth_type_trans in this
case. One must look in the total performance not just were the
prefetching is done. In this case I was able to get eth_type_trans
down in the profile list but other functions increased so performance
was the same or lower. This needs to be sorted out...
How many of the architectures have PMU's that can give us cache miss
statistics? Itanium does, and can go so far as to tell us which
addresses and instructions are involved - do the others?
That sort of data would seem to be desirable in this sort of situation.
rick jones
oprofile on AMD64 can gather lots of data, DATA_CACHE_MISSES for example...
But I think I know what happens...
nm -v /usr/src/linux/vmlinux | grep -5 rt_cache_stat
ffffffff804c6a80 b rover.5
ffffffff804c6a88 b last_gc.2
ffffffff804c6a90 b rover.3
ffffffff804c6a94 b equilibrium.4
ffffffff804c6a98 b ip_fallback_id.7
ffffffff804c6aa0 B rt_cache_stat
ffffffff804c6aa8 b ip_rt_max_size
ffffffff804c6aac b ip_rt_debug
ffffffff804c6ab0 b rt_deadline
So rt_cache_stat (which is a read only pointer) is in the middle of a hot cache line (some parts of it are written over and over), that
probably ping pong between CPUS.
Time to provide a patch to carefully place all the static data from
net/ipv4/route.c into 2 parts : mostly readonly, and others... :)
Eric
|