/me said:
> I'm experimenting with different csum_ routines in userspace now.
Short conclusion:
1. It is possible to speed up csum routines for AMD processors by 30%.
2. It is possible to speed up csum_copy routines for both AMD and Intel
three times or more. Roy, do you like that? ;)
Tests: they checksum 4MB block and csum_copy 2MB into second 2MB.
POISON=0/1 controls whether to perform correctness tests or not.
That slows down test very noticeably. What does glibc use for
memset/memcmp? for() loop?!!
With POISON=1 ntqpf2_copy bugs out, see its source. I left it in
to save repeating my work by others. BTW, i do NOT understand why
it does not work. ;) Anyone with cluebat?
IMHO the only way to make it optimal for all CPUs is to make these
functions race at kernel init and pick the best one.
tests on Celeron 1200 (100 MHz, x12 core)
=========================================
Csum benchmark program
buffer size: 4 Mb
Each test tried 16 times, max and min CPU cycles are reported.
Please disregard max values. They are due to system interference only.
csum tests:
kernel_csum  took 717 max, 704 min cycles per kb.
sum=0x44000077
kernel_csum  took 4760 max, 704 min cycles per kb.
sum=0x44000077
kernel_csum  took 722 max, 704 min cycles per kb.
sum=0x44000077
kernelpii_csum  took 539 max, 528 min cycles per kb.
sum=0x44000077
kernelpiipf_csum  took 573 max, 529 min cycles per kb.
sum=0x44000077
pfm_csum  took 1411 max, 1306 min cycles per kb.
sum=0x44000077
pfm2_csum  took 875 max, 762 min cycles per kb.
sum=0x44000077
copy tests:
kernel_copy  took 5738 max, 3423 min cycles per kb.
sum=0x99aaaacc
kernel_copy  took 3517 max, 3431 min cycles per kb.
sum=0x99aaaacc
kernel_copy  took 4385 max, 3432 min cycles per kb.
sum=0x99aaaacc
kernelpii_copy  took 2912 max, 2752 min cycles per kb.
sum=0x99aaaacc
ntqpf_copy  took 2010 max, 1700 min cycles per kb.
sum=0x99aaaacc
ntqpfm_copy  took 1749 max, 1701 min cycles per kb.
sum=0x99aaaacc
ntq_copy  took 2218 max, 2141 min cycles per kb.
sum=0x99aaaacc
BAD copy! < ntqpf2_copy is buggy :) see its source
'copy tests' above are with POISON=1
These are with POISON=0:
kernel_copy  took 2009 max, 1935 min cycles per kb.
sum=0x44000077
kernel_copy  took 2240 max, 1959 min cycles per kb.
sum=0x44000077
kernel_copy  took 2197 max, 1936 min cycles per kb.
sum=0x44000077
kernelpii_copy  took 2121 max, 1939 min cycles per kb.
sum=0x44000077
ntqpf_copy  took 667 max, 548 min cycles per kb.
sum=0x44000077
ntqpfm_copy  took 651 max, 546 min cycles per kb.
sum=0x44000077
ntq_copy  took 660 max, 545 min cycles per kb.
sum=0x44000077
ntqpf2_copy  took 644 max, 548 min cycles per kb.
sum=0x44000077
Done
Tests on Duron 650 (100 MHz, x6,5 core)
=======================================
Csum benchmark program
buffer size: 4 Mb
Each test tried 16 times, max and min CPU cycles are reported.
Please disregard max values. They are due to system interference only.
csum tests:
kernel_csum  took 1090 max, 1051 min cycles per kb.
sum=0x44000077
kernel_csum  took 1080 max, 1052 min cycles per kb.
sum=0x44000077
kernel_csum  took 1178 max, 1058 min cycles per kb.
sum=0x44000077
kernelpii_csum  took 1614 max, 1052 min cycles per kb.
sum=0x44000077
kernelpiipf_csum  took 976 max, 962 min cycles per kb.
sum=0x44000077
pfm_csum  took 755 max, 746 min cycles per kb.
sum=0x44000077
pfm2_csum  took 749 max, 745 min cycles per kb.
sum=0x44000077
copy tests:
kernel_copy  took 1251 max, 1072 min cycles per kb.
sum=0x99aaaacc
kernel_copy  took 1363 max, 1072 min cycles per kb.
sum=0x99aaaacc
kernel_copy  took 1352 max, 1072 min cycles per kb.
sum=0x99aaaacc
kernelpii_copy  took 1132 max, 1014 min cycles per kb.
sum=0x99aaaacc
ntqpf_copy  took 514 max, 480 min cycles per kb.
sum=0x99aaaacc
ntqpfm_copy  took 495 max, 482 min cycles per kb.
sum=0x99aaaacc
ntq_copy  took 1153 max, 948 min cycles per kb.
sum=0x99aaaacc
BAD copy! < ntqpf2_copy is buggy :) see its source
'copy tests' above are with POISON=1
These are with POISON=0:
kernel_copy  took 1145 max, 871 min cycles per kb.
sum=0x44000077
kernel_copy  took 879 max, 871 min cycles per kb.
sum=0x44000077
kernel_copy  took 876 max, 871 min cycles per kb.
sum=0x44000077
kernelpii_copy  took 1019 max, 845 min cycles per kb.
sum=0x44000077
ntqpf_copy  took 2972 max, 229 min cycles per kb.
sum=0x44000077
ntqpfm_copy  took 248 max, 245 min cycles per kb.
sum=0x44000077
ntq_copy  took 460 max, 452 min cycles per kb.
sum=0x44000077
ntqpf2_copy  took 390 max, 340 min cycles per kb.
sum=0x44000077
Done

vda
timing_csum_copy.tar.bz2
Description: BZip2 compressed data
