Hello
The problem has been resolved by Rafal Malek from tellique, here his
reasoning:
Each network device supports sending and receiving of data packets by
default. But, in case of one-way satellite communication (used by your
TelliCast service) the device is not able to send data - it only
receives. So, if the DVB device (configured as a network interface) wants
to send data (e.g. broadcasts for its own subnet) the data cannot be
delivered and the packets' information is cached, the DVB driver does
_not_ free the buffer of the packets which should be sent.
Both the drivers from linuxtv.org and the open source driver from the
pent@value card have this error. Simply inserting a dev_kfree_skb(skb)
in the dvb_net_tx() (linuxtv.org) / PentaVal_start_xmit() (pent@value)
function solved this problem. Rafal Malek verified this for the
linuxtv.org driver and I did it for the pent@value driver. Here the
relevant values for a system with a pent@value card and nearly 23 days
uptime:
ip_dst_cache 270 270 256 18 18 1 : 252 126
skbuff_head_cache 795 795 256 53 53 1 : 252 126
size-2048 452 482 2048 238 241 1 : 60 30
Thank you to all the people who helped to find this error!
Unfortunatly there still is another leak, the dentry_cache value does not
go down. But this problem does not belong to this list.
Holger
On Fri, 14 Nov 2003, Holger Kiehl wrote:
> Sorry for cross posting this message, but it has been pointed out that
> linux-kernel is the wrong list for this question.
>
> Hello
>
> We have about 25 systems that receive data via a pci DVB card from satellite.
> The data is received through multiple muticast streams by some closed
> source software. On all systems we notice that the free memory decreases
> until in most cases the system are no longer reachable via network. They
> then constantly print out: dst cache overflow. But I also have noticed that
> some systems lock up hard, I assume this is because we just increase
> the ip_dst_cache in /proc/sys/net/ipv4/route/max_size to some very
> large value.
>
> I also know that the German Telekom and Eumetsat have the same problems
> and always have to reboot their systems. I also have reports from Austria
> and expect many more systems in Europe are effected.
>
> To get more information I have setup 3 systems with different kernels and
> hardware and noticed that over the time ip_dst_cache and skbuff_head_cache
> in /proc/slabinfo always increase. They never go down. Also one or more of
> the of the size-x values always increase depending on the kernel and DVB
> card being used. Here some more slabinfo details and hardware being used:
>
> System1 : PIII 450MHz, 256MB ram, Kernel 2.4.23-pre9, pent@value DVB card
> System2 : PII 350MHz, 384MB ram, Kernel 2.4.21, pent@value DVB card
> System3 : P4 2.4GHz with HT enabled, 1 GB ram (high mem enabled),
> Kernel 2.4.23-rc1 and libata patch, Nova-S DVB card
>
> Now the slabinfo data every 24 hours:
>
> System1:
>
> ip_dst_cache 647 672 160 27 28 1
> ip_dst_cache 7444 7464 160 311 311 1
> ip_dst_cache 14339 14352 160 598 598 1
> ip_dst_cache 21106 21120 160 880 880 1
> ip_dst_cache 28101 28104 160 1171 1171 1
>
> skbuff_head_cache 796 1008 160 41 42 1
> skbuff_head_cache 7588 7824 160 326 326 1
> skbuff_head_cache 14482 14688 160 612 612 1
> skbuff_head_cache 21258 21480 160 895 895 1
> skbuff_head_cache 28255 28416 160 1184 1184 1
>
> size-2048 685 968 2048 343 484 1
> size-2048 7483 7676 2048 3742 3838 1
> size-2048 14376 14398 2048 7188 7199 1
> size-2048 21146 21216 2048 10573 10608 1
> size-2048 28142 28292 2048 14071 14146 1
>
> System2:
>
> ip_dst_cache 9 48 160 1 2 1
> ip_dst_cache 7437 7464 160 311 311 1
> ip_dst_cache 15161 15168 160 632 632 1
> ip_dst_cache 18831 18840 160 785 785 1
>
> skbuff_head_cache 14 24 160 1 1 1
> skbuff_head_cache 11482 12168 160 500 507 1
> skbuff_head_cache 23312 23904 160 996 996 1
> skbuff_head_cache 28900 29640 160 1235 1235 1
>
> size-128 611 660 128 21 22 1
> size-128 11987 12210 128 402 407 1
> size-128 23800 23970 128 798 799 1
> size-128 29445 29670 128 983 989 1
>
>
> Slabinfo for every 12 hours and CONFIG_DEBUG_SLAB set:
>
> System3:
>
> ip_dst_cache 576 576 160 24 24 1 : 576 576
> 24 0 0 : 252 126 : 1946 48 1426 0
> ip_dst_cache 17760 17760 160 740 740 1 : 17760 17760
> 740 0 0 : 252 126 : 46553 1480 29557 0
> ip_dst_cache 35376 35376 160 1474 1474 1 : 35376 36403
> 1474 0 0 : 252 126 : 94140 3014 60309 0
> ip_dst_cache 51624 51624 160 2151 2151 1 : 51624 53444
> 2151 0 0 : 252 126 : 138864 4431 89547 0
>
> skbuff_head_cache 1311 1311 168 57 57 1 : 1311 79557
> 57 0 0 : 252 126 : 82108 735 81114 621
> skbuff_head_cache 18492 18492 168 804 804 1 : 18492 3300792
> 804 0 0 : 252 126 : 3320868 27658 3303434 26050
> skbuff_head_cache 36133 36133 168 1571 1571 1 : 36133 6652585
> 1583 12 0 : 252 126 : 6684139 55715 6649977 52420
> skbuff_head_cache 52371 52371 168 2277 2277 1 : 52371 9913620
> 2294 17 0 : 252 126 : 9957116 82923 9907545 78097
>
> size-8192 540 540 8192 540 540 2 : 540 3196
> 540 0 0 : 0 0 : 0 0 0 0
> size-8192 17736 17738 8192 17736 17738 2 : 17738 23194
> 17738 0 0 : 0 0 : 0 0 0 0
> size-8192 35367 35367 8192 35367 35367 2 : 35367 43715
> 35374 7 0 : 0 0 : 0 0 0 0
> size-8192 51596 51598 8192 51596 51598 2 : 51598 62824
> 51611 13 0 : 0 0 : 0 0 0 0
>
> size-2048 452 512 2048 240 256 1 : 512 75002
> 256 0 0 : 60 30 : 140293 2995 140145 2485
> size-2048 454 514 2048 238 257 1 : 514 3029044
> 257 0 0 : 60 30 : 5130850 101465 5130703 100953
> size-2048 456 486 2048 241 243 1 : 530 6113873
> 593 350 0 : 60 30 : 10457205 204975 10457530 203655
> size-2048 454 484 2048 239 242 1 : 542 9104228
> 1042 800 0 : 60 30 : 15398297 305608 15399447 303014
>
> size-128 2016 2268 136 78 81 1 : 2268 9125
> 81 0 0 : 252 126 : 23644 195 22128 56
> size-128 19096 19096 136 682 682 1 : 19096 26457
> 682 0 0 : 252 126 : 131136 1401 113018 58
> size-128 36708 36708 136 1311 1311 1 : 36708 59707
> 1317 6 0 : 252 126 : 255889 2833 220918 144
> size-128 52920 52920 136 1890 1890 1 : 52920 81855
> 1911 21 0 : 252 126 : 370264 4135 319786 153
>
> size-64 7844 7844 72 148 148 1 : 7844 7931
> 148 0 0 : 252 126 : 15660 253 9102 0
> size-64 18497 18497 72 349 349 1 : 18497 18584
> 349 0 0 : 252 126 : 110763 655 93784 0
> size-64 24963 24963 72 471 471 1 : 24963 32458
> 471 0 0 : 252 126 : 209402 1008 186275 0
> size-64 34503 34503 72 651 651 1 : 34503 48900
> 651 0 0 : 252 126 : 305026 1613 272674 0
>
> There is much more data available, the full slabinfo was taken every
> hour for each system. Additionally with the help of Jörn Engel I managed
> to setup System1 with gcov kernel patch and have all data available on
> an hourly basis until the system has reached "dst cache overflow". I have
> tried very hard to evaluate this data myself, but find that the linux
> network code is way beyond my c programming knowledge.
>
> Another thing noticed is that as the memory usage increases the systems
> become slower, when you log in on them and work there.
>
> Has anyone any suggestion of what else I can do to narrow down the problem?
>
> What I am also not sure if it is correct to assume the bug in the ipv4
> multicast implementation, or can it still be a driver problem? But I assume
> two completely different drivers make this very unlikely.
>
> Please, can someone help me to find the bug. I am willing to do any tests
> or provide more information.
>
> Thanks,
> Holger
>
> PS: Please cc me, since I am not on the list.
>
>
>
--
|