Quoting Andi Kleen <ak@xxxxxxx>:
> > c0106e59 42693 1.89176 restore_all
> > c01dfe68 42787 1.89592 sys_socketcall
> > c01df39c 54185 2.40097 sys_bind
> > c01de698 62740 2.78005 sockfd_lookup
> > c01372c8 97886 4.3374 fput
> > c022c110 125306 5.55239 __generic_copy_to_user
> > c01373b0 181922 8.06109 fget
> > c020958c 199054 8.82022 tcp_v4_get_port
> > c0106e10 199934 8.85921 system_call
> > c022c158 214014 9.48311 __generic_copy_from_user
> > c0216ecc 257768 11.4219 inet_bind
> The profile looks bogus. The NIC driver is nowhere in sight.
> Normally its mmap IO for interrupts and device registers
> should show. I would double check it (e.g. with normal profile)
Separately compiled acenic..
I'm surprised by this profile a bit too - on the client side,
since the requests are small, and the client is receiving
all those files, I would have thought that __generic_copy_to_user
would have been way higher than *from_user.
inet_bind() and tcp_v4_get_port() are up there because
we have to grab the socket lock, the tcp_portalloc_lock,
then the head chain lock and traverse the hash table
which has now many hundred entries. Also, because
of the varied length of the connections, the clients
get freed not in the same order they are allocated
a port, hence the fragmentation of the port space..
Tthere is some cacheline thrashing hurting the NUMA
more than other systems here too..
If you just wanted to speed things up, you could get the
clients to specify ports instead of letting the kernel
cycle through for a free port..:)
> In case it is no bogus:
> Most of these are either atomic_inc/dec of reference counters or
> some form of lock. The system_call could be the int 0x80 (using the
> SYSENTER patches would help), which also does atomic operations
> implicitely. restore_all is IRET, could also likely be speed up by
> using SYSEXIT.
> If NAPI hurts here then it surely not because of eating CPU time.