>>>>> "Denis" == Denis Vlasenko <vda@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> writes:
Denis> [please drop libc from CC:]
Denis> On 25 October 2002 05:48, Momchil Velikov wrote:
>>> Short conclusion:
>>> 1. It is possible to speed up csum routines for AMD processors
>>> by 30%.
>>> 2. It is possible to speed up csum_copy routines for both AMD
>>> andd Intel three times or more.
>> Additional data point:
>>
>> Short summary:
>> 1. Checksum  kernelpii_csum is ~19% faster
>> 2. Copy  lernelpii_csum is ~6% faster
>>
>> Dual Pentium III, 1266Mhz, 512K cache, 2G SDRAM (133Mhz, ECC)
>>
>> The only changes I made were to decrease the buffer size to 1K (as I
>> think this is more representative to a network packet size, correct
>> me if I'm wrong) and increase the runs to 1024. Max values are
>> worthless indeed.
Denis> Well, that makes it run entirely in L0 cache. This is unrealistic
Denis> for actual use. movntq is x3 faster when you hit RAM instead of L0.
Oops ...
Denis> You need to be more clever than that  generate pseudorandom
Denis> offsets in large buffer and run on ~1K pieces of that buffer.
Here it is:
Csum benchmark program
buffer size: 1 K
Each test tried 1024 times, max and min CPU cycles are reported.
Please disregard max values. They are due to system interference only.
csum tests:
kernel_csum  took 8678 max, 808 min cycles per kb.
sum=0x400270e8
kernel_csum  took 941 max, 808 min cycles per kb.
sum=0x400270e8
kernel_csum  took 11604 max, 808 min cycles per kb.
sum=0x400270e8
kernelpii_csum  took 28839 max, 664 min cycles per kb.
sum=0x400270e8
kernelpiipf_csum  took 9163 max, 665 min cycles per kb.
sum=0x400270e8
pfm_csum  took 2788 max, 1470 min cycles per kb.
sum=0x400270e8
pfm2_csum  took 1179 max, 915 min cycles per kb.
sum=0x400270e8
copy tests:
kernel_copy  took 688 max, 263 min cycles per kb.
sum=0x400270e8
kernel_copy  took 456 max, 263 min cycles per kb.
sum=0x400270e8
kernel_copy  took 11241 max, 263 min cycles per kb.
sum=0x400270e8
kernelpii_copy  took 7635 max, 246 min cycles per kb.
sum=0x400270e8
ntqpf_copy  took 5349 max, 536 min cycles per kb.
sum=0x400270e8
ntqpfm_copy  took 769 max, 425 min cycles per kb.
sum=0x400270e8
ntq_copy  took 672 max, 469 min cycles per kb.
sum=0x400270e8
ntqpf2_copy  took 8000 max, 579 min cycles per kb.
sum=0x400270e8
Done
Ran on a 512K (my cache size) buffer, choosing each time a 1K
piece. (making the buffer larger (2M, 4M) does not make any
difference).
And the modified 0main.c is attached.
~velco
0main.c
Description: Text Data
