From owner-netdev@oss.sgi.com Wed Jan 3 05:50:53 2001 Received: by oss.sgi.com id ; Wed, 3 Jan 2001 05:50:44 -0800 Received: from mail.zmailer.org ([194.252.70.162]:46857 "EHLO zmailer.org") by oss.sgi.com with ESMTP id ; Wed, 3 Jan 2001 05:50:29 -0800 Received: (from localhost user: 'mea' uid#500 fake: STDIN (mea@zmailer.org)) by mail.zmailer.org id ; Wed, 3 Jan 2001 15:50:10 +0200 Date: Wed, 3 Jan 2001 15:50:10 +0200 From: Matti Aarnio To: netdev@oss.sgi.com Subject: Network statistics counters missing 'broadcasts' Message-ID: <20010103155010.D12545@mea-ext.zmailer.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 3240 Lines: 74 A thing to think about. This is IMO 2.5 time stuff. A generic network layer statistics thing is that there apparently is no counter for broadcasts, and thus usually multicasts and broadcasts do get folded into the same counter. Partly this is interface problem, as we can't very easily change the /proc/net/dev file format which has device counters on one long line for each interface. (When a change was done long while ago, it caused lots of "ifconfig is broken" email, as some of us may recall.) Perhaps we should create a *new* interface in parallel of that "dev", which gives each interface in separate tagged format, something like: interface: %16s \t rxpackets: %12ld \t rxbytes: %12ld etc. allowing us to add new tags when deemed necessary. One important detail would be that the interface must produce same size data globs for each interface so that when there is more than PAGESIZE-1k of text output, code can quickly get to index with new interesting stuff without need to do O(N^2) size iteration (like /proc/ksyms does now). (Adding new tags will need adjusted size, nothing else.) Another approach could be /proc/sys/net/***/IFNAME/stats style files ( or /proc/sys/net/stats/IFNAME files ) per interface. For counters I would suggest renaming 'collisions' to be 'tx_collisions', and also renaming 'multicast' as 'rx_multicast', then adding counters: - rx_broadcast - tx_multicast - tx_broadcast With VLAN work I have missed specifically separation of multicast and broadcast counters, but also the lack of transmit direction counters for the same. The set of statistics counters as of 2.4.0-prerelease: unsigned long rx_packets; /* total packets received */ unsigned long tx_packets; /* total packets transmitted */ unsigned long rx_bytes; /* total bytes received */ unsigned long tx_bytes; /* total bytes transmitted */ unsigned long rx_errors; /* bad packets received */ unsigned long tx_errors; /* packet transmit problems */ unsigned long rx_dropped; /* no space in linux buffers */ unsigned long tx_dropped; /* no space available in linux */ unsigned long multicast; /* multicast packets received */ unsigned long collisions; /* detailed rx_errors: */ unsigned long rx_length_errors; unsigned long rx_over_errors; /* receiver ring buff overflow */ unsigned long rx_crc_errors; /* recved pkt with crc error */ unsigned long rx_frame_errors; /* recv'd frame alignment error */ unsigned long rx_fifo_errors; /* recv'r fifo overrun */ unsigned long rx_missed_errors; /* receiver missed packet */ /* detailed tx_errors */ unsigned long tx_aborted_errors; unsigned long tx_carrier_errors; unsigned long tx_fifo_errors; unsigned long tx_heartbeat_errors; unsigned long tx_window_errors; /* for cslip etc */ unsigned long rx_compressed; unsigned long tx_compressed; From owner-netdev@oss.sgi.com Wed Jan 3 06:04:23 2001 Received: by oss.sgi.com id ; Wed, 3 Jan 2001 06:04:13 -0800 Received: from robur.slu.se ([130.238.98.12]:25361 "EHLO robur.slu.se") by oss.sgi.com with ESMTP id ; Wed, 3 Jan 2001 06:04:05 -0800 Received: (from robert@localhost) by robur.slu.se (8.8.7/8.8.7) id PAA13043; Wed, 3 Jan 2001 15:04:02 +0100 From: Robert Olsson MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <14931.12626.704075.376912@robur.slu.se> Date: Wed, 3 Jan 2001 15:04:02 +0100 (CET) To: Matti Aarnio Cc: netdev@oss.sgi.com Subject: Network statistics counters missing 'broadcasts' In-Reply-To: <20010103155010.D12545@mea-ext.zmailer.org> References: <20010103155010.D12545@mea-ext.zmailer.org> X-Mailer: VM 6.75 under Emacs 19.34.1 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1737 Lines: 65 Matti Aarnio writes: > > A generic network layer statistics thing is that there apparently is no > counter for broadcasts, and thus usually multicasts and broadcasts do get > folded into the same counter. Hello! I may mention a little hack for average statistics and clearable stats in case anyone has need for it. It does not break ordinary device stats. ftp from: robur.slu.se:/pub/Linux/net-development/ends/ends.c Cheers. --ro / * In essential ENDS gives average statistics and clearable interface counters for all open network devices. The statistics is held in ends_device struct. ENDS creates and maintains the /proc/net/stat directory. /proc/net/stats/ends_control # Control various ends settings /proc/net/stats/ # For all open devices. Example: cat /proc/net/stats/eth0 [Output lines are truncated] Uptime Exceeds: 118c9 11 min 58 sec Last input: 0 NOW Last output: 0 NOW Counters age: fe34 10 min 50 sec Scan Interval: 20 sec Inter-| Receive | Transmit face |bytes packets errs drop fifo frame compressed multicast|bytes eth0: 168565 1777 0 0 0 0 0 0 123231 eth0: 1156 10 0 0 0 0 0 0 1079 First stats-line holds counters since last cleared. Second statsline is average statistics in unit/seconds. echo clear > /proc/net/stats/eth0 # Clears eth0 counters (line 1 above) echo interval 30 > /proc/net/stats/eth0 # Output format is the same as in /proc/net/dev plus extra lines with meta information. TODO: More flexible EWMA calulations. */ From owner-netdev@oss.sgi.com Wed Jan 3 06:15:44 2001 Received: by oss.sgi.com id ; Wed, 3 Jan 2001 06:15:34 -0800 Received: from cerberus.nemoto.ecei.tohoku.ac.jp ([130.34.199.67]:42757 "EHLO cerberus.nemoto.ecei.tohoku.ac.jp") by oss.sgi.com with ESMTP id ; Wed, 3 Jan 2001 06:15:11 -0800 Received: from localhost (yoshfuji@localhost [127.0.0.1]) by cerberus.nemoto.ecei.tohoku.ac.jp (8.9.3+3.2W/8.9.3/Debian 8.9.3-21) with ESMTP id XAA22130; Wed, 3 Jan 2001 23:14:21 +0900 To: matti.aarnio@zmailer.org Cc: netdev@oss.sgi.com Subject: Re: Network statistics counters missing 'broadcasts' In-Reply-To: <20010103155010.D12545@mea-ext.zmailer.org> References: <20010103155010.D12545@mea-ext.zmailer.org> X-Mailer: Mew version 1.94 on Emacs 20.7 / Mule 4.1 (AOI) X-URL: http://www.ecei.tohoku.ac.jp/%7Eyoshfuji/ X-Fingerprint: F7 31 65 99 5E B2 BB A7 15 15 13 23 18 06 A9 6F 57 00 6B 25 X-Pgp5-Key-Url: http://cerberus.nemoto.ecei.tohoku.ac.jp/%7Eyoshfuji/yoshfuji@ecei.tohoku.ac.jp.asc Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-Id: <20010103231421X.yoshfuji@ecei.tohoku.ac.jp> Date: Wed, 03 Jan 2001 23:14:21 +0900 From: YOSHIFUJI Hideaki / =?iso-2022-jp?B?GyRCNUhGIzFRTEAbKEI=?= X-Dispatcher: imput version 990905(IM130) Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 865 Lines: 17 In article <20010103155010.D12545@mea-ext.zmailer.org> (at Wed, 3 Jan 2001 15:50:10 +0200), Matti Aarnio says: > One important detail would be that the interface must produce same size > data globs for each interface so that when there is more than PAGESIZE-1k > of text output, code can quickly get to index with new interesting stuff > without need to do O(N^2) size iteration (like /proc/ksyms does now). > (Adding new tags will need adjusted size, nothing else.) > > Another approach could be /proc/sys/net/***/IFNAME/stats style files > ( or /proc/sys/net/stats/IFNAME files ) per interface. Just FYI: USAGI linux24 has such an interface /proc/net/dev_snmp/ for ipv6 per-device statistics. -- Hideaki YOSHIFUJI @ USAGI Project PGP5i FP: F731 6599 5EB2 BBA7 1515 1323 1806 A96F 5700 6B25 From owner-netdev@oss.sgi.com Wed Jan 3 06:27:14 2001 Received: by oss.sgi.com id ; Wed, 3 Jan 2001 06:27:04 -0800 Received: from a203-167-249-89.reverse.clear.net.nz ([203.167.249.89]:14093 "HELO metastasis.f00f.org") by oss.sgi.com with SMTP id ; Wed, 3 Jan 2001 06:26:38 -0800 Received: by metastasis.f00f.org (Postfix, from userid 1000) id 657389DAB; Thu, 4 Jan 2001 03:26:35 +1300 (NZDT) Date: Thu, 4 Jan 2001 03:26:35 +1300 From: Chris Wedgwood To: Matti Aarnio Cc: netdev@oss.sgi.com Subject: Re: Network statistics counters missing 'broadcasts' Message-ID: <20010104032635.A28111@metastasis.f00f.org> References: <20010103155010.D12545@mea-ext.zmailer.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <20010103155010.D12545@mea-ext.zmailer.org>; from matti.aarnio@zmailer.org on Wed, Jan 03, 2001 at 03:50:10PM +0200 X-No-Archive: Yes Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1229 Lines: 34 On Wed, Jan 03, 2001 at 03:50:10PM +0200, Matti Aarnio wrote: A thing to think about. This is IMO 2.5 time stuff. A generic network layer statistics thing is that there apparently is no counter for broadcasts, and thus usually multicasts and broadcasts do get folded into the same counter. Partly this is interface problem, as we can't very easily change the /proc/net/dev file format which has device counters on one long line for each interface. (When a change was done long while ago, it caused lots of "ifconfig is broken" email, as some of us may recall.) IMO we can and should. A lot of stuff in /proc is ugly and out-of-date for the current implementation of the kernel. 2.5.x is exactly the place to gratuitously break things in favor of something better. I guess early in 2.5.x would be a good time for this; I would suggest people think _now_ of what could be improved or replaced and /proc so we get as much done as early as possible in 2.5.x FWIW: other OSs do this -- only it's easier for them because when you upgrade their kernel; you generally upgrade all the support infrastructure. Linux differs here and it is sometimes to it's detriment. --cw From owner-netdev@oss.sgi.com Fri Jan 5 05:35:00 2001 Received: by oss.sgi.com id ; Fri, 5 Jan 2001 05:34:50 -0800 Received: from yamato.ccrle.nec.de ([195.37.70.1]:55559 "EHLO yamato.ccrle.nec.de") by oss.sgi.com with ESMTP id ; Fri, 5 Jan 2001 05:34:35 -0800 Received: from wallace.heidelberg.ccrle.nec.de (root@Wallace.heidelberg.ccrle.nec.de [192.168.102.1]) by yamato.ccrle.nec.de (8.10.1/8.10.1) with ESMTP id f05DZ1B01967 for ; Fri, 5 Jan 2001 14:35:01 +0100 (CET) Received: from ccrle.nec.de (judiciary.heidelberg.ccrle.nec.de [192.168.102.83]) by wallace.heidelberg.ccrle.nec.de (8.9.3/8.9.3/SuSE Linux 8.9.3-0.1) with ESMTP id OAA01067 for ; Fri, 5 Jan 2001 14:34:32 +0100 Message-ID: <3A55CF54.8C890679@ccrle.nec.de> Date: Fri, 05 Jan 2001 14:42:44 +0100 From: Karl Jonas Organization: NEC CCRLE X-Mailer: Mozilla 4.76 [en] (WinNT; U) X-Accept-Language: en MIME-Version: 1.0 To: netdev@oss.sgi.com Subject: looking for info about struct tcp_opt Content-Type: multipart/mixed; boundary="------------10BE7A57E9CA273E1D2F3244" Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 8380 Lines: 123 This is a multi-part message in MIME format. --------------10BE7A57E9CA273E1D2F3244 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Dear all, I am looking for some information about the members of the 'struct tcp_opt' in the linux kernel. Please help me with answers or references to the right place/person/doc. Thanks in advance, Karl Jonas WHY DOES SND_WND DECREASE, RATHER THAN INCREASE ? According to rfc761, snd.wnd is the send window, which includes both the number of unacknowledged bytes and the number of bytes that may be transmitted before any ack arrives. >From this, i expect snd_wnd to increase exponentially (slow start) and than to alter around some high value, maybe decreasing in the case of transmission errors. What i see is that the snd_wnd starts with a high value and decreases continuously, sometimes shows a sharp increase, sometimes stays around zero for a while. (my observations come from a transmission on a local machine, with tcpdump showing that it is error-free). Interestingly, the value of 'packets_out' alters between 0 and 1, which is what i would expect. A trace is provided in the attached file which was made by a tcp sender transmitting 1kB packets as fast as possible on to a receiver on the local machine. WHAT IS THE UNIT OF SRTT? According to its declaration in sock.h, it is the 'smoothed rtt << 3' . Does this mean, that the value '8', which i observed, corresponds to a round trip time of about 1 ms ? WHAT IS THE UNIT OF SND_CWND? In my trace, snd_cwnd initialises with 3 and increases by 1 for each packet sent (acked?), and slower after, corresponding perfectly to slow start / congestion avoidance. But what is the real size of the congestion window (maybe snd_cwnd * mss ?) ? WHAT IS THE RTO UNIT? I see a rto of 20. This seems to be quite short if it was 20 ms. Does linux count in 10's of ms here? -- Karl Jonas, NEC Research Labs Adenauerplatz 6, 69115 Heidelberg EMail: karl.jonas@ieee.org http://www.ccrle.nec.de/heidelberg/index.html Tel.: +49.(0)6221.13 70 819 Fax: +49.(0)6221.905 11 55 --------------10BE7A57E9CA273E1D2F3244 Content-Type: text/plain; charset=us-ascii; name="output.txt" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="output.txt" pnum rcv_nxt snd_nxt snd_una rcv_tstamp lrcvtime srtt snd_wnd max_window snd_cwnd rto packets_out rcv_wnd 1 923103371 930938866 930938866 559095 0 8 30072 31072 3 20 0 31072 2 923103371 930939866 930939866 559095 0 8 29072 31072 4 20 0 31072 3 923103371 930940866 930940866 559095 0 8 28072 31072 5 20 0 31072 4 923103371 930941866 930941866 559096 0 8 27072 31072 6 20 0 31072 5 923103371 930942866 930942866 559096 0 8 26072 31072 7 20 0 31072 6 923103371 930943866 930943866 559096 0 8 25072 31072 8 20 0 31072 7 923103371 930944866 930944866 559096 0 8 24072 31072 9 20 0 31072 8 923103371 930945866 930945866 559096 0 8 23072 31072 10 20 0 31072 9 923103371 930946866 930946866 559096 0 8 22072 31072 11 20 0 31072 10 923103371 930947866 930947866 559096 0 8 21072 31072 12 20 0 31072 11 923103371 930948866 930948866 559096 0 8 20072 31072 13 20 0 31072 12 923103371 930949866 930949866 559097 0 8 19072 31072 14 20 0 31072 13 923103371 930950866 930950866 559097 0 8 18072 31072 15 20 0 31072 14 923103371 930951866 930951866 559097 0 8 17072 31072 16 20 0 31072 15 923103371 930952866 930952866 559097 0 8 16072 31072 17 20 0 31072 16 923103371 930953866 930953866 559097 0 8 15072 31072 18 20 0 31072 17 923103371 930954866 930954866 559097 0 8 14072 31072 19 20 0 31072 18 923103371 930955866 930955866 559098 0 8 13072 31072 20 20 0 31072 19 923103371 930956866 930956866 559098 0 8 12072 31072 21 20 0 31072 20 923103371 930957866 930957866 559098 0 8 11072 31072 22 20 0 31072 21 923103371 930958866 930958866 559098 0 8 10072 31072 23 20 0 31072 22 923103371 930959866 930959866 559098 0 8 9072 31072 24 20 0 31072 23 923103371 930960866 930960866 559098 0 8 8072 31072 25 20 0 31072 24 923103371 930961866 930961866 559098 0 8 7072 31072 26 20 0 31072 25 923103371 930962866 930962866 559099 0 8 6072 31072 27 20 0 31072 26 923103371 930963866 930963866 559099 0 8 5072 31072 28 20 0 31072 27 923103371 930964866 930964866 559099 0 8 4072 31072 29 20 0 31072 28 923103371 930965866 930965866 559099 0 8 3072 31072 30 20 0 31072 29 923103371 930966866 930966866 559099 0 8 2072 31072 31 20 0 31072 30 923103371 930967866 930967866 559099 0 8 1072 31072 32 20 0 31072 31 923103371 930968866 930968866 559099 0 8 72 31072 33 20 0 31072 32 923103371 930968866 930968866 559099 0 8 72 31072 33 20 0 31072 33 923103371 930968866 930968866 559099 0 8 72 31072 33 20 0 31072 34 923103371 930968866 930968866 559099 0 8 72 31072 33 20 0 31072 35 923103371 930968866 930968866 559099 0 8 72 31072 33 20 0 31072 36 923103371 930968866 930968866 559099 0 8 72 31072 33 20 0 31072 37 923103371 930968866 930968866 559099 0 8 72 31072 33 20 0 31072 38 923103371 930968866 930968866 559099 0 8 72 31072 33 20 0 31072 39 923103371 930968866 930968866 559099 0 8 72 31072 33 20 0 31072 40 923103371 930968866 930968866 559099 0 8 72 31072 33 20 0 31072 41 923103371 930968866 930968866 559099 0 8 72 31072 33 20 0 31072 42 923103371 930968866 930968866 559099 0 8 72 31072 33 20 0 31072 43 923103371 930972738 930968866 559100 0 8 3872 31072 33 20 1 31072 44 923103371 930972738 930968866 559100 0 8 3872 31072 33 20 1 31072 45 923103371 930972738 930968866 559100 0 8 3872 31072 33 20 1 31072 46 923103371 930972738 930968866 559100 0 8 3872 31072 33 20 1 31072 47 923103371 930976610 930972738 559100 0 8 3872 31072 34 20 1 31072 48 923103371 930976610 930976610 559101 0 8 0 31072 35 20 0 31072 49 923103371 930976610 930976610 559101 0 8 0 31072 35 20 0 31072 50 923103371 930976610 930976610 559101 0 8 0 31072 35 20 0 31072 --------------10BE7A57E9CA273E1D2F3244-- From owner-netdev@oss.sgi.com Fri Jan 5 06:09:50 2001 Received: by oss.sgi.com id ; Fri, 5 Jan 2001 06:09:41 -0800 Received: from colin.muc.de ([193.149.48.1]:43536 "HELO colin.muc.de") by oss.sgi.com with SMTP id ; Fri, 5 Jan 2001 06:09:14 -0800 Received: by colin.muc.de id <140550-3>; Fri, 5 Jan 2001 15:08:53 +0100 Message-ID: <20010105150849.45985@colin.muc.de> From: Andi Kleen To: Karl Jonas Cc: netdev@oss.sgi.com Subject: Re: looking for info about struct tcp_opt References: <3A55CF54.8C890679@ccrle.nec.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.88e In-Reply-To: <3A55CF54.8C890679@ccrle.nec.de>; from Karl Jonas on Fri, Jan 05, 2001 at 02:35:54PM +0100 Date: Fri, 5 Jan 2001 15:08:49 +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1022 Lines: 36 On Fri, Jan 05, 2001 at 02:35:54PM +0100, Karl Jonas wrote: > WHAT IS THE UNIT OF SRTT? > According to its declaration in sock.h, it is the 'smoothed rtt << 3' . > Does this mean, that the value '8', which i observed, corresponds to a > round trip time of about 1 ms ? In the linux kernel time is usually counted in jiffies (=timer interrupts), on i386 that's 10ms, e.g. on alpha it is ~1ms. Unit is jiffies<<3. > > WHAT IS THE UNIT OF SND_CWND? > In my trace, snd_cwnd initialises with 3 and increases by 1 for each > packet > sent (acked?), and slower after, corresponding perfectly to slow start / acked > congestion avoidance. But what is the real size of the congestion window > (maybe snd_cwnd * mss ?) ? It's packets. 2.4 switched to counting it in bytes, because the packet counting has some problems. factor is *average of your packet size. > > WHAT IS THE RTO UNIT? > I see a rto of 20. This seems to be quite short if it was 20 ms. Does > linux > count in 10's of ms here? Yes, in jiffies. -Andi From owner-netdev@oss.sgi.com Fri Jan 5 16:25:26 2001 Received: by oss.sgi.com id ; Fri, 5 Jan 2001 16:25:06 -0800 Received: from foobar.napster.com ([64.124.41.10]:57616 "EHLO foobar.napster.com") by oss.sgi.com with ESMTP id ; Fri, 5 Jan 2001 16:24:49 -0800 Received: from wagner.napster.com (mail.napster.com [63.108.185.112]) by foobar.napster.com (8.9.3/8.9.3) with ESMTP id QAA09876 for ; Fri, 5 Jan 2001 16:24:44 -0800 Received: from napster.com (gw.napster.com [63.108.185.120]) by wagner.napster.com (8.9.3/8.9.3) with SMTP id QAA16577 for ; Fri, 5 Jan 2001 16:24:43 -0800 Message-ID: <3A5665CB.30AB2158@napster.com> Date: Fri, 05 Jan 2001 16:24:43 -0800 From: Jordan Mendelson Organization: Napster, Inc. X-Mailer: Mozilla 4.76 [en] (X11; U; Linux 2.4.0 i686) X-Accept-Language: en MIME-Version: 1.0 To: netdev@oss.sgi.com Subject: Mediocre TCP Performance 2.4.0 <-> Win98SE PPP Content-Type: multipart/mixed; boundary="------------867E021D4EAF7445854A92E7" Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 7569 Lines: 114 This is a multi-part message in MIME format. --------------867E021D4EAF7445854A92E7 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit I reported this a couple of months back with 2.4.0-test10 and hoping that it might have been fixed I tested 2.4.0 proper with no luck. We are seeing a huge performance problem between 2.4.0 SMP and Windows users dialed up using compressed PPP. The Linux box is a standard valinux dual p3 running 2.4.0 without TCP ECN and without mmapped network IO connected to a Cisco 6509 which is then connected to the Internet. The Windows machine is a running 98SE 4.10.2222 dialed up to an ISP over it's built in modem. Basically, the Windows machine attempts to connect to the Linux box and exchanges a few packets for login. The Linux box then dumps a roughly ~2.5K of data to the Windows machines via ~30 individual send() calls. The Windows machine however fails to get the data or it ends up being corrupted and retransmits are triggered. With the exact same hardware and software running, but using 2.2.16 instead of 2.4.0 on the Linux box, everything appears to be ok. I've attached two tcpdumps... one from a connection to 2.4.0 and one to 2.2.16. These are actually from the original mail I posted with 2.4.0-test10. Unfortunatly, until this gets fixed we can't roll out 2.4.0 on our production servers as a good number of users complain their connections to the Napster service are extremely slow while it's running even though the performance appeared to be significantly better the last time we tested it. Jordan --------------867E021D4EAF7445854A92E7 Content-Type: text/plain; charset=us-ascii; name="240.log" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="240.log" 22:00:39.625351 209.179.245.186.1092 > 64.124.41.179.8888: S 4155530:4155530(0) win 8192 (DF) 22:00:39.625437 64.124.41.179.8888 > 209.179.245.186.1092: S 1301092473:1301092473(0) ack 4155531 win 5840 (DF) 22:00:39.887133 209.179.245.186.1092 > 64.124.41.179.8888: . ack 1 win 8576 (DF) 22:00:39.887969 209.179.245.186.1092 > 64.124.41.179.8888: . ack 1 win 65280 (DF) 22:00:39.888951 209.179.245.186.1092 > 64.124.41.179.8888: P 1:44(43) ack 1 win 65280 (DF) 22:00:39.888964 64.124.41.179.8888 > 209.179.245.186.1092: . ack 44 win 5840 (DF) 22:00:39.991515 64.124.41.179.8888 > 209.179.245.186.1092: P 1:21(20) ack 44 win 5840 (DF) 22:00:39.991660 64.124.41.179.8888 > 209.179.245.186.1092: P 21:557(536) ack 44 win 5840 (DF) 22:00:42.991490 64.124.41.179.8888 > 209.179.245.186.1092: P 1:21(20) ack 44 win 5840 (DF) 22:00:43.180946 209.179.245.186.1092 > 64.124.41.179.8888: P 44:56(12) ack 21 win 65260 (DF) 22:00:43.180997 64.124.41.179.8888 > 209.179.245.186.1092: P 21:557(536) ack 44 win 5840 (DF) 22:00:43.181025 64.124.41.179.8888 > 209.179.245.186.1092: P 557:1093(536) ack 56 win 5840 (DF) 22:00:45.685143 209.179.245.186.1092 > 64.124.41.179.8888: P 44:456(412) ack 21 win 65260 (DF) 22:00:45.685204 64.124.41.179.8888 > 209.179.245.186.1092: . ack 456 win 6432 (DF) 22:00:49.171046 64.124.41.179.8888 > 209.179.245.186.1092: P 21:557(536) ack 456 win 6432 (DF) 22:00:49.470193 209.179.245.186.1092 > 64.124.41.179.8888: . ack 557 win 65280 (DF) 22:00:49.470233 64.124.41.179.8888 > 209.179.245.186.1092: P 557:1093(536) ack 456 win 6432 (DF) 22:00:49.470248 64.124.41.179.8888 > 209.179.245.186.1092: P 1093:1629(536) ack 456 win 6432 (DF) 22:01:01.461056 64.124.41.179.8888 > 209.179.245.186.1092: P 557:1093(536) ack 456 win 6432 (DF) 22:01:01.755362 209.179.245.186.1092 > 64.124.41.179.8888: . ack 1093 win 65280 (DF) 22:01:01.755428 64.124.41.179.8888 > 209.179.245.186.1092: P 1093:1629(536) ack 456 win 6432 (DF) 22:01:01.755451 64.124.41.179.8888 > 209.179.245.186.1092: P 1629:1825(196) ack 456 win 6432 (DF) 22:01:25.751048 64.124.41.179.8888 > 209.179.245.186.1092: P 1093:1629(536) ack 456 win 6432 (DF) 22:01:26.171932 209.179.245.186.1092 > 64.124.41.179.8888: . ack 1629 win 65280 (DF) 22:01:26.171979 64.124.41.179.8888 > 209.179.245.186.1092: P 1629:1825(196) ack 456 win 6432 (DF) 22:02:14.171052 64.124.41.179.8888 > 209.179.245.186.1092: P 1629:1825(196) ack 456 win 6432 (DF) 22:02:14.499920 209.179.245.186.1092 > 64.124.41.179.8888: . ack 1825 win 65084 (DF) 22:02:14.499944 64.124.41.179.8888 > 209.179.245.186.1092: P 1825:1847(22) ack 456 win 6432 (DF) 22:02:16.168708 209.179.245.186.1092 > 64.124.41.179.8888: F 456:456(0) ack 1825 win 65084 (DF) 22:02:16.181061 64.124.41.179.8888 > 209.179.245.186.1092: . ack 457 win 6432 (DF) 22:02:16.281724 64.124.41.179.8888 > 209.179.245.186.1092: F 1847:1847(0) ack 457 win 6432 (DF) 22:02:16.477943 209.179.245.186.1092 > 64.124.41.179.8888: . ack 1825 win 65084 (DF) 22:03:50.491063 64.124.41.179.8888 > 209.179.245.186.1092: P 1825:1847(22) ack 457 win 6432 (DF) 22:03:50.680141 209.179.245.186.1092 > 64.124.41.179.8888: R 4155987:4155987(0) win 0 (DF) --------------867E021D4EAF7445854A92E7 Content-Type: text/plain; charset=us-ascii; name="2216.log" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="2216.log" 22:00:01.684927 209.179.245.186.1091 > 64.124.41.136.8888: S 4033171:4033171(0) win 8192 (DF) 22:00:01.685021 64.124.41.136.8888 > 209.179.245.186.1091: S 1261602556:1261602556(0) ack 4033172 win 32696 (DF) 22:00:01.916120 209.179.245.186.1091 > 64.124.41.136.8888: . ack 1 win 8576 (DF) 22:00:01.916191 209.179.245.186.1091 > 64.124.41.136.8888: . ack 1 win 65280 (DF) 22:00:01.916981 209.179.245.186.1091 > 64.124.41.136.8888: P 1:44(43) ack 1 win 65280 (DF) 22:00:01.917032 64.124.41.136.8888 > 209.179.245.186.1091: . ack 44 win 32696 (DF) 22:00:02.121143 64.124.41.136.8888 > 209.179.245.186.1091: P 1:21(20) ack 44 win 32696 (DF) 22:00:02.121279 64.124.41.136.8888 > 209.179.245.186.1091: P 21:349(328) ack 44 win 32696 (DF) 22:00:02.327779 209.179.245.186.1091 > 64.124.41.136.8888: . ack 349 win 64932 (DF) 22:00:02.327813 64.124.41.136.8888 > 209.179.245.186.1091: P 349:885(536) ack 44 win 32696 (DF) 22:00:02.327825 64.124.41.136.8888 > 209.179.245.186.1091: P 885:1408(523) ack 44 win 32696 (DF) 22:00:02.328909 209.179.245.186.1091 > 64.124.41.136.8888: P 44:56(12) ack 349 win 64932 (DF) 22:00:02.340110 64.124.41.136.8888 > 209.179.245.186.1091: . ack 56 win 32696 (DF) 22:00:02.605282 209.179.245.186.1091 > 64.124.41.136.8888: P 56:456(400) ack 885 win 65280 (DF) 22:00:02.608462 209.179.245.186.1091 > 64.124.41.136.8888: . ack 1408 win 64757 (DF) 22:00:02.608533 64.124.41.136.8888 > 209.179.245.186.1091: P 1408:1420(12) ack 456 win 32296 (DF) 22:00:02.766833 64.124.41.136.8888 > 209.179.245.186.1091: P 1420:1689(269) ack 456 win 32696 (DF) 22:00:02.889731 209.179.245.186.1091 > 64.124.41.136.8888: . ack 1420 win 64745 (DF) 22:00:03.091796 209.179.245.186.1091 > 64.124.41.136.8888: . ack 1689 win 65280 (DF) 22:00:03.091829 64.124.41.136.8888 > 209.179.245.186.1091: P 1689:1822(133) ack 456 win 32696 (DF) 22:00:03.388700 209.179.245.186.1091 > 64.124.41.136.8888: . ack 1822 win 65147 (DF) 22:00:04.442114 209.179.245.186.1091 > 64.124.41.136.8888: F 456:456(0) ack 1822 win 65147 (DF) 22:00:04.442178 64.124.41.136.8888 > 209.179.245.186.1091: . ack 457 win 32696 (DF) 22:00:04.502433 64.124.41.136.8888 > 209.179.245.186.1091: F 1822:1822(0) ack 457 win 32696 (DF) 22:00:04.689026 209.179.245.186.1091 > 64.124.41.136.8888: . ack 1823 win 65147 (DF) --------------867E021D4EAF7445854A92E7-- From owner-netdev@oss.sgi.com Fri Jan 5 16:32:56 2001 Received: by oss.sgi.com id ; Fri, 5 Jan 2001 16:32:36 -0800 Received: from pizda.ninka.net ([216.101.162.242]:17289 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Fri, 5 Jan 2001 16:32:18 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id QAA07335; Fri, 5 Jan 2001 16:15:04 -0800 Date: Fri, 5 Jan 2001 16:15:04 -0800 Message-Id: <200101060015.QAA07335@pizda.ninka.net> From: "David S. Miller" To: jordy@napster.com CC: netdev@oss.sgi.com In-reply-to: <3A5665CB.30AB2158@napster.com> (message from Jordan Mendelson on Fri, 05 Jan 2001 16:24:43 -0800) Subject: Re: Mediocre TCP Performance 2.4.0 <-> Win98SE PPP References: <3A5665CB.30AB2158@napster.com> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 577 Lines: 17 Date: Fri, 05 Jan 2001 16:24:43 -0800 From: Jordan Mendelson I reported this a couple of months back with 2.4.0-test10 and hoping that it might have been fixed I tested 2.4.0 proper with no luck. Jordan, if I recall correctly, last time we were analyzing dumps about this problem from you Alexey and myself both had concluded that Linux was doing things perfectly fine and therefore there is nothing to "fix" on the Linux side. If I noted a different conclusion at that time, please let me know. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Fri Jan 5 16:49:56 2001 Received: by oss.sgi.com id ; Fri, 5 Jan 2001 16:49:46 -0800 Received: from foobar.napster.com ([64.124.41.10]:45317 "EHLO foobar.napster.com") by oss.sgi.com with ESMTP id ; Fri, 5 Jan 2001 16:49:32 -0800 Received: from wagner.napster.com (mail.napster.com [63.108.185.112]) by foobar.napster.com (8.9.3/8.9.3) with ESMTP id QAA10710; Fri, 5 Jan 2001 16:49:27 -0800 Received: from napster.com (gw.napster.com [63.108.185.120]) by wagner.napster.com (8.9.3/8.9.3) with SMTP id QAA17642; Fri, 5 Jan 2001 16:49:26 -0800 Message-ID: <3A566B96.BCAC86E@napster.com> Date: Fri, 05 Jan 2001 16:49:26 -0800 From: Jordan Mendelson Organization: Napster, Inc. X-Mailer: Mozilla 4.76 [en] (X11; U; Linux 2.4.0 i686) X-Accept-Language: en MIME-Version: 1.0 To: "David S. Miller" CC: netdev@oss.sgi.com Subject: Re: Mediocre TCP Performance 2.4.0 <-> Win98SE PPP References: <3A5665CB.30AB2158@napster.com> <200101060015.QAA07335@pizda.ninka.net> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1491 Lines: 37 "David S. Miller" wrote: > > Date: Fri, 05 Jan 2001 16:24:43 -0800 > From: Jordan Mendelson > > I reported this a couple of months back with 2.4.0-test10 and > hoping that it might have been fixed I tested 2.4.0 proper with no > luck. > > Jordan, if I recall correctly, last time we were analyzing dumps about > this problem from you Alexey and myself both had concluded that Linux > was doing things perfectly fine and therefore there is nothing to > "fix" on the Linux side. > > If I noted a different conclusion at that time, please let me know. I think the general conclusion was that the Linux side looked like it was doing the right thing and it was the Windows side or the upstream ISP that was causing the problem. During this latest test, a different ISP was used just to make sure (Earthlink) and came up with the same problem. Unfortunatly, the number of users who reported having this problem are too great for me to ignore and I'd imagine anyone running a Web server would have the exact same problems with Win98 clients dialed up over PPP w/ compression. My best guess is that the problem lies with the size of the packets themselves as the 2.4.0 trace shows the Windows side choaking on the 536 byte packets repeatedly and small writes going through while the 2.2.16 trace shows really only smaller writes. It was mentioned before that 2.2.x packetizes write calls, so this might make sense. Of course, it's just a guess. Jordan From owner-netdev@oss.sgi.com Fri Jan 5 19:09:07 2001 Received: by oss.sgi.com id ; Fri, 5 Jan 2001 19:08:57 -0800 Received: from sith.mimuw.edu.pl ([193.0.97.1]:10501 "HELO sith.mimuw.edu.pl") by oss.sgi.com with SMTP id ; Fri, 5 Jan 2001 19:08:29 -0800 Received: (qmail 23046 invoked by uid 601); 6 Jan 2001 03:11:03 -0000 Date: Sat, 6 Jan 2001 04:11:03 +0100 From: Jan Rekorajski To: Jordan Mendelson Cc: netdev@oss.sgi.com Subject: Re: Mediocre TCP Performance 2.4.0 <-> Win98SE PPP Message-ID: <20010106041103.B22665@sith.mimuw.edu.pl> Mail-Followup-To: Jan Rekorajski , Jordan Mendelson , netdev@oss.sgi.com References: <3A5665CB.30AB2158@napster.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-2 Content-Disposition: inline Content-Transfer-Encoding: 8bit User-Agent: Mutt/1.2.5i In-Reply-To: <3A5665CB.30AB2158@napster.com>; from jordy@napster.com on Fri, Jan 05, 2001 at 04:24:43PM -0800 X-Operating-System: Linux 2.4.0-prerelease i686 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1223 Lines: 27 On Fri, 05 Jan 2001, Jordan Mendelson wrote: > > I reported this a couple of months back with 2.4.0-test10 and hoping > that it might have been fixed I tested 2.4.0 proper with no luck. > > We are seeing a huge performance problem between 2.4.0 SMP and Windows > users dialed up using compressed PPP. The Linux box is a standard > valinux dual p3 running 2.4.0 without TCP ECN and without mmapped > network IO connected to a Cisco 6509 which is then connected to the > Internet. The Windows machine is a running 98SE 4.10.2222 dialed up to > an ISP over it's built in modem. Try disabling VJ TCP/IP header compression on Windows box. I think I have the same problem, and when connecting to 2.4 server I see these in logs (this is linux client): Jan 3 01:14:17 home kernel: PPP: VJ decompression error Jan 3 01:14:37 home kernel: PPP: VJ decompression error Why this happens? This occurs when I try to connect to 2.4 server _only_ :( Same hardware on both sides but 2.2 on server and no problems. Jan -- Jan Rêkorajski | ALL SUSPECTS ARE GUILTY. PERIOD! bagginsmimuw.edu.pl | OTHERWISE THEY WOULDN'T BE SUSPECTS, WOULD THEY? BOFH, MANIAC | -- TROOPS by Kevin Rubio From owner-netdev@oss.sgi.com Fri Jan 5 19:35:46 2001 Received: by oss.sgi.com id ; Fri, 5 Jan 2001 19:35:37 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:6564 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Fri, 5 Jan 2001 19:35:24 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id WAA19172; Fri, 5 Jan 2001 22:34:37 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Fri, 5 Jan 2001 22:34:37 -0500 (EST) From: jamal To: Jordan Mendelson cc: Subject: Re: Mediocre TCP Performance 2.4.0 <-> Win98SE PPP In-Reply-To: <3A5665CB.30AB2158@napster.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 920 Lines: 28 I am not on lk so i missed your initial post; i apologize if i am repeating things you might have tried out. It does not make sense that it works in 2.2.16 but not 2.4.0. How reproducible is this? The only visible difference between the two appears to be the mss. It'll be a shocker that would cause problems. They both seem to be happily SACKing. >From the traces it appears 64.124.41.136 is Linux, no? And your traces must be from the Linux side, right? Why are those packets being dropped only when using 2.4.0? Some suggestions: Force the MTU on linux to be 536+40. If that doesnt fix it turn off SACK. If that doesnt fix it i would start suspecting the NIC. What NIC do you have on Linux? Might be alignment screw ups. If that doesnt fix it, i would start suspecting windows. Capture traces on both sides (windows and Linux side) for both kernels preferably using -w option and post the URL. cheers, jamal From owner-netdev@oss.sgi.com Sat Jan 6 12:30:50 2001 Received: by oss.sgi.com id ; Sat, 6 Jan 2001 12:30:41 -0800 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:63492 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Sat, 6 Jan 2001 12:30:16 -0800 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.8.7) with ESMTP id OAA20655; Sat, 6 Jan 2001 14:33:27 -0700 Message-ID: <3A578F27.D2A9DF52@candelatech.com> Date: Sat, 06 Jan 2001 14:33:27 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.16 i586) X-Accept-Language: en MIME-Version: 1.0 To: linux-kernel , "netdev@oss.sgi.com" Subject: [PATCH] hashed device lookup (Does NOT meet Linus' sumission policy!) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 16351 Lines: 445 I'm hoping that I can get a few comments on this code. It was added to (significantly) speed up things like 'ifconfig -a' when running with 4000 or so VLAN devices. It should also help other instances with lots of (virtual) devices, like FrameRelay, ATM, and possibly virtual IP interfaces. It probably won't help 'normal' users much, and in it's final form, should probably be a selectable option in the config process. Anyway, let me know what you think! Patch to net/core/dev.c of 2.4.0 fame: *** ../../../linux/net/core/dev.c Mon Dec 11 14:29:35 2000 --- dev.c Sat Jan 6 14:14:10 2001 *************** *** 1,3 **** ! /* * NET3 Protocol independent device support routines. * --- 1,3 ---- ! /* -*- linux-c -*- * NET3 Protocol independent device support routines. * *************** *** 132,138 **** * * Why 16. Because with 16 the only overlap we get on a hash of the ! * low nibble of the protocol value is RARP/SNAP/X.25. * * 0800 IP * 0001 802.3 * 0002 AX.25 --- 133,147 ---- * * Why 16. Because with 16 the only overlap we get on a hash of the ! * low nibble of the protocol value is RARP/SNAP/X.25. ! * ! * NOTE: That is no longer true with the addition of VLAN tags. Not ! * sure which should go first, but I bet it won't make much ! * difference if we are running VLANs. The good news is that ! * this protocol won't be in the list unless compiled in, so ! * the average user (w/out VLANs) will not be adversly affected. ! * --BLG * * 0800 IP + * 8100 802.1Q VLAN * 0001 802.3 * 0002 AX.25 *************** *** 179,182 **** --- 188,435 ---- + #define BENS_FAST_DEV_LOOKUP + #ifdef BENS_FAST_DEV_LOOKUP + /* Fash Device Lookup code. Should give much better than + * linear speed when looking for devices by idx or name. + * --Ben (greearb@candelatech.com) + */ + #define FDL_HASH_LEN 256 + + /* #define FDL_DEBUG */ + + struct dev_hash_node { + struct net_device* dev; + struct dev_hash_node* next; + }; + + struct dev_hash_node* fdl_name_base[FDL_HASH_LEN];/* hashed by name */ + struct dev_hash_node* fdl_idx_base[FDL_HASH_LEN]; /* hashed by index */ + int fdl_initialized_yet = 0; + + /* TODO: Make these inline methods */ + /* Nice cheesy little hash method to be used on device-names (eth0, ppp0, etc) */ + int fdl_calc_name_idx(const char* dev_name) { + int tmp = 0; + int i; + #ifdef FDL_DEBUG + printk(KERN_ERR "fdl_calc_name_idx, name: %s\n", dev_name); + #endif + for (i = 0; dev_name[i]; i++) { + tmp += (int)(dev_name[i]); + } + if (i > 3) { + tmp += (dev_name[i-2] * 10); /* might add a little spread to the hash */ + tmp += (dev_name[i-3] * 100); /* might add a little spread to the hash */ + } + #ifdef FDL_DEBUG + printk(KERN_ERR "fdl_calc_name_idx, rslt: %i\n", (int)(tmp % FDL_HASH_LEN)); + #endif + return (tmp % FDL_HASH_LEN); + } + + int fdl_calc_index_idx(const int ifindex) { + return (ifindex % FDL_HASH_LEN); + } + + + /* Better have a lock on the dev_base before calling this... */ + int __fdl_ensure_init(void) { + #ifdef FDL_DEBUG + printk(KERN_ERR "__fdl_ensure_init, enter\n"); + #endif + if (! fdl_initialized_yet) { + /* only do this once.. */ + int i; + int idx = 0; /* into the hash table */ + struct net_device* dev = dev_base; + struct dev_hash_node* dhn; + + #ifdef FDL_DEBUG + printk(KERN_ERR "__fdl_ensure_init, doing real work..."); + #endif + + fdl_initialized_yet = 1; /* it has been attempted at least... */ + + for (i = 0; iname, idx); + #endif + /* first, take care of the hash-by-name */ + idx = fdl_calc_name_idx(dev->name); + dhn = kmalloc(sizeof(struct dev_hash_node), GFP_ATOMIC); + if (dhn) { + dhn->dev = dev; + dhn->next = fdl_name_base[idx]; + fdl_name_base[idx] = dhn; + } + else { + /* Nasty..couldn't get memory... */ + return -ENOMEM; + } + + /* now, do the hash-by-idx */ + idx = fdl_calc_index_idx(dev->ifindex); + dhn = kmalloc(sizeof(struct dev_hash_node), GFP_ATOMIC); + if (dhn) { + dhn->dev = dev; + dhn->next = fdl_idx_base[idx]; + fdl_idx_base[idx] = dhn; + } + else { + /* Nasty..couldn't get memory... */ + return -ENOMEM; + } + + dev = dev->next; + } + fdl_initialized_yet = 2; /* initialization actually worked */ + } + #ifdef FDL_DEBUG + printk(KERN_ERR "__fdl_ensure_init, end, fdl_initialized_yet: %i\n", fdl_initialized_yet); + #endif + if (fdl_initialized_yet == 2) { + return 0; + } + else { + return -1; + } + }/* fdl_ensure_init */ + + + /* called from register_netdevice, assumes dev is locked, and that no one + * will be calling __find_dev_by_name before this exits.. etc. + */ + int __fdl_register_netdevice(struct net_device* dev) { + if (__fdl_ensure_init() == 0) { + /* first, take care of the hash-by-name */ + int idx = fdl_calc_name_idx(dev->name); + struct dev_hash_node* dhn = kmalloc(sizeof(struct dev_hash_node), GFP_ATOMIC); + + #ifdef FDL_DEBUG + printk(KERN_ERR "__fdl_register_netdevice, dev: %p dev: %s, idx: %i", dev, dev->name, idx); + #endif + + if (dhn) { + dhn->dev = dev; + dhn->next = fdl_name_base[idx]; + fdl_name_base[idx] = dhn; + } + else { + /* Nasty..couldn't get memory... */ + /* Don't try to use these hash tables any more... */ + fdl_initialized_yet = 1; /* tried, but failed */ + return -ENOMEM; + } + + /* now, do the hash-by-idx */ + idx = fdl_calc_index_idx(dev->ifindex); + dhn = kmalloc(sizeof(struct dev_hash_node), GFP_ATOMIC); + + #ifdef FDL_DEBUG + printk(KERN_ERR "__fdl_register_netdevice, ifindex: %i, idx: %i", dev->ifindex, idx); + #endif + + if (dhn) { + dhn->dev = dev; + dhn->next = fdl_idx_base[idx]; + fdl_idx_base[idx] = dhn; + } + else { + /* Nasty..couldn't get memory... */ + /* Don't try to use these hash tables any more... */ + fdl_initialized_yet = 1; /* tried, but failed */ + return -ENOMEM; + } + } + return 0; + } /* fdl_register_netdevice */ + + + /* called from register_netdevice, assumes dev is locked, and that no one + * will be calling __find_dev_by_name, etc. Returns 0 if found & removed one, + * returns -1 otherwise. + */ + int __fdl_unregister_netdevice(struct net_device* dev) { + int retval = -1; + if (fdl_initialized_yet == 2) { /* If we've been initialized correctly... */ + /* first, take care of the hash-by-name */ + int idx = fdl_calc_name_idx(dev->name); + struct dev_hash_node* prev = fdl_name_base[idx]; + struct dev_hash_node* cur = NULL; + + #ifdef FDL_DEBUG + printk(KERN_ERR "__fdl_unregister_netdevice, dev: %p dev: %s, idx: %i", dev, dev->name, idx); + #endif + + if (prev) { + if (strcmp(dev->name, prev->dev->name) == 0) { + /* it's the first one... */ + fdl_name_base[idx] = prev->next; + kfree(prev); + retval = 0; + } + else { + cur = prev->next; + while (cur) { + if (strcmp(dev->name, cur->dev->name) == 0) { + prev->next = cur->next; + kfree(cur); + retval = 0; + break; + } + else { + prev = cur; + cur = cur->next; + } + } + } + } + + /* Now, the hash-by-index */ + idx = fdl_calc_index_idx(dev->ifindex); + prev = fdl_idx_base[idx]; + cur = NULL; + if (prev) { + if (dev->ifindex == prev->dev->ifindex) { + /* it's the first one... */ + fdl_idx_base[idx] = prev->next; + kfree(prev); + retval = 0; + } + else { + cur = prev->next; + while (cur) { + if (dev->ifindex == cur->dev->ifindex) { + prev->next = cur->next; + kfree(cur); + retval = 0; + break; + } + else { + prev = cur; + cur = cur->next; + } + } + } + } + }/* if we ensured init OK */ + return retval; + } /* fdl_unregister_netdevice */ + + + + #endif /* BENS_FAST_DEV_LOOKUP */ + + + /****************************************************************************************** *************** *** 397,401 **** { struct net_device *dev; ! for (dev = dev_base; dev != NULL; dev = dev->next) { if (strcmp(dev->name, name) == 0) --- 650,673 ---- { struct net_device *dev; ! ! #ifdef BENS_FAST_DEV_LOOKUP ! int idx = fdl_calc_name_idx(name); ! struct dev_hash_node* dhn; ! if (fdl_initialized_yet == 2) { ! #ifdef FDL_DEBUG ! printk(KERN_ERR "__dev_get_by_name, name: %s idx: %i\n", name, idx); ! #endif ! dhn = fdl_name_base[idx]; ! while (dhn) { ! if (strcmp(dhn->dev->name, name) == 0) { ! /* printk(KERN_ERR "__dev_get_by_name, found it: %p\n", dhn->dev); */ ! return dhn->dev; ! } ! dhn = dhn->next; ! } ! /* printk(KERN_ERR "__dev_get_by_name, didn't find it for name: %s\n", name); */ ! return NULL; ! } ! #endif for (dev = dev_base; dev != NULL; dev = dev->next) { if (strcmp(dev->name, name) == 0) *************** *** 473,476 **** --- 745,762 ---- struct net_device *dev; + #ifdef BENS_FAST_DEV_LOOKUP + int idx = fdl_calc_index_idx(ifindex); + struct dev_hash_node* dhn; + if (fdl_initialized_yet == 2) { /* have we gone through initialization before... */ + dhn = fdl_idx_base[idx]; + while (dhn) { + if (dhn->dev->ifindex == ifindex) + return dhn->dev; + dhn = dhn->next; + } + return NULL; + } + #endif + for (dev = dev_base; dev != NULL; dev = dev->next) { if (dev->ifindex == ifindex) *************** *** 550,555 **** /* * If you need over 100 please also fix the algorithm... */ ! for (i = 0; i < 100; i++) { sprintf(buf,name,i); if (__dev_get_by_name(buf) == NULL) { --- 836,843 ---- /* * If you need over 100 please also fix the algorithm... + * Increased it to deal with VLAN interfaces. It is unlikely + * that this many will ever be added, but it can't hurt! -BLG */ ! for (i = 0; i < 8192; i++) { sprintf(buf,name,i); if (__dev_get_by_name(buf) == NULL) { *************** *** 558,562 **** } } ! return -ENFILE; /* Over 100 of the things .. bail out! */ } --- 846,850 ---- } } ! return -ENFILE; /* Over 8192 of the things .. bail out! */ } *************** *** 2068,2073 **** --- 2356,2369 ---- if (__dev_get_by_name(ifr->ifr_newname)) return -EEXIST; + #ifdef BENS_FAST_DEV_LOOKUP + write_lock_bh(&dev_base_lock); /* gotta lock it to remove stuff */ + __fdl_unregister_netdevice(dev); /* remove it from the hash.. */ + #endif memcpy(dev->name, ifr->ifr_newname, IFNAMSIZ); dev->name[IFNAMSIZ-1] = 0; + #ifdef BENS_FAST_DEV_LOOKUP + __fdl_register_netdevice(dev); + write_unlock_bh(&dev_base_lock); /* gotta lock it to add stuff too */ + #endif notifier_call_chain(&netdev_chain, NETDEV_CHANGENAME, dev); return 0; *************** *** 2343,2346 **** --- 2639,2648 ---- dev->next = NULL; write_lock_bh(&dev_base_lock); + #ifdef BENS_FAST_DEV_LOOKUP + /* Must do this before dp is set to dev, or it could be added twice, once + * on initialization based on dev_base, and once again after that... + */ + __fdl_register_netdevice(dev); + #endif *dp = dev; dev_hold(dev); *************** *** 2397,2400 **** --- 2699,2708 ---- dev_init_scheduler(dev); write_lock_bh(&dev_base_lock); + #ifdef BENS_FAST_DEV_LOOKUP + /* Must do this before dp is set to dev, or it could be added twice, once + * on initialization based on dev_base, and once again after that... + */ + __fdl_register_netdevice(dev); + #endif *dp = dev; dev_hold(dev); *************** *** 2469,2473 **** write_lock_bh(&dev_base_lock); *dp = d->next; ! write_unlock_bh(&dev_base_lock); break; } --- 2777,2784 ---- write_lock_bh(&dev_base_lock); *dp = d->next; ! #ifdef BENS_FAST_DEV_LOOKUP ! __fdl_unregister_netdevice(dev); ! #endif ! write_unlock_bh(&dev_base_lock); break; } -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Sat Jan 6 15:35:23 2001 Received: by oss.sgi.com id ; Sat, 6 Jan 2001 15:35:13 -0800 Received: from pizda.ninka.net ([216.101.162.242]:38033 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Sat, 6 Jan 2001 15:34:51 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id PAA13411; Sat, 6 Jan 2001 15:17:27 -0800 Date: Sat, 6 Jan 2001 15:17:27 -0800 Message-Id: <200101062317.PAA13411@pizda.ninka.net> From: "David S. Miller" To: greearb@candelatech.com CC: linux-kernel@vger.kernel.org, netdev@oss.sgi.com In-reply-to: <3A578F27.D2A9DF52@candelatech.com> (message from Ben Greear on Sat, 06 Jan 2001 14:33:27 -0700) Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission policy!) References: <3A578F27.D2A9DF52@candelatech.com> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 79 Lines: 6 Unified diffs only please... Thanks. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Sat Jan 6 16:00:23 2001 Received: by oss.sgi.com id ; Sat, 6 Jan 2001 16:00:13 -0800 Received: from pizda.ninka.net ([216.101.162.242]:56977 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Sat, 6 Jan 2001 16:00:05 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id PAA13551; Sat, 6 Jan 2001 15:42:15 -0800 Date: Sat, 6 Jan 2001 15:42:15 -0800 Message-Id: <200101062342.PAA13551@pizda.ninka.net> From: "David S. Miller" To: baggins@sith.mimuw.edu.pl CC: jordy@napster.com, netdev@oss.sgi.com In-reply-to: <20010106041103.B22665@sith.mimuw.edu.pl> (message from Jan Rekorajski on Sat, 6 Jan 2001 04:11:03 +0100) Subject: Re: Mediocre TCP Performance 2.4.0 <-> Win98SE PPP References: <3A5665CB.30AB2158@napster.com> <20010106041103.B22665@sith.mimuw.edu.pl> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 307 Lines: 11 Date: Sat, 6 Jan 2001 04:11:03 +0100 From: Jan Rekorajski Try disabling VJ TCP/IP header compression on Windows box. Well, in fact, if TCP timestamps are enabled, none of the TCP headers will be compressable by the VJ stuff. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Sat Jan 6 16:40:03 2001 Received: by oss.sgi.com id ; Sat, 6 Jan 2001 16:39:43 -0800 Received: from sith.mimuw.edu.pl ([193.0.97.1]:20233 "HELO sith.mimuw.edu.pl") by oss.sgi.com with SMTP id ; Sat, 6 Jan 2001 16:39:16 -0800 Received: (qmail 23914 invoked by uid 601); 7 Jan 2001 00:41:50 -0000 Date: Sun, 7 Jan 2001 01:41:50 +0100 From: Jan Rekorajski To: "David S. Miller" Cc: jordy@napster.com, netdev@oss.sgi.com Subject: Re: Mediocre TCP Performance 2.4.0 <-> Win98SE PPP Message-ID: <20010107014150.A23851@sith.mimuw.edu.pl> Mail-Followup-To: Jan Rekorajski , "David S. Miller" , jordy@napster.com, netdev@oss.sgi.com References: <3A5665CB.30AB2158@napster.com> <20010106041103.B22665@sith.mimuw.edu.pl> <200101062342.PAA13551@pizda.ninka.net> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-2 Content-Disposition: inline Content-Transfer-Encoding: 8bit User-Agent: Mutt/1.2.5i In-Reply-To: <200101062342.PAA13551@pizda.ninka.net>; from davem@redhat.com on Sat, Jan 06, 2001 at 03:42:15PM -0800 X-Operating-System: Linux 2.4.0-prerelease i686 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 701 Lines: 19 On Sat, 06 Jan 2001, David S. Miller wrote: > Date: Sat, 6 Jan 2001 04:11:03 +0100 > From: Jan Rekorajski > > Try disabling VJ TCP/IP header compression on Windows box. > > Well, in fact, if TCP timestamps are enabled, none of the TCP headers > will be compressable by the VJ stuff. And `echo 0 >/proc/sys/net/ipv4/tcp_timestamps` should cure this? If yes then something is wrong because it didn't :( And I did turn it off on both sides. Jan -- Jan Rêkorajski | ALL SUSPECTS ARE GUILTY. PERIOD! bagginsmimuw.edu.pl | OTHERWISE THEY WOULDN'T BE SUSPECTS, WOULD THEY? BOFH, MANIAC | -- TROOPS by Kevin Rubio From owner-netdev@oss.sgi.com Sat Jan 6 16:58:33 2001 Received: by oss.sgi.com id ; Sat, 6 Jan 2001 16:58:13 -0800 Received: from sith.mimuw.edu.pl ([193.0.97.1]:25353 "HELO sith.mimuw.edu.pl") by oss.sgi.com with SMTP id ; Sat, 6 Jan 2001 16:58:04 -0800 Received: (qmail 24189 invoked by uid 601); 7 Jan 2001 01:00:38 -0000 Date: Sun, 7 Jan 2001 02:00:38 +0100 From: Jan Rekorajski To: "David S. Miller" , jordy@napster.com, netdev@oss.sgi.com Subject: Re: Mediocre TCP Performance 2.4.0 <-> Win98SE PPP Message-ID: <20010107020038.A24159@sith.mimuw.edu.pl> Mail-Followup-To: Jan Rekorajski , "David S. Miller" , jordy@napster.com, netdev@oss.sgi.com References: <3A5665CB.30AB2158@napster.com> <20010106041103.B22665@sith.mimuw.edu.pl> <200101062342.PAA13551@pizda.ninka.net> <20010107014150.A23851@sith.mimuw.edu.pl> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-2 Content-Disposition: inline Content-Transfer-Encoding: 8bit User-Agent: Mutt/1.2.5i In-Reply-To: <20010107014150.A23851@sith.mimuw.edu.pl>; from baggins@sith.mimuw.edu.pl on Sun, Jan 07, 2001 at 01:41:50AM +0100 X-Operating-System: Linux 2.4.0-prerelease i686 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 937 Lines: 25 On Sun, 07 Jan 2001, Jan Rekorajski wrote: > On Sat, 06 Jan 2001, David S. Miller wrote: > > > Date: Sat, 6 Jan 2001 04:11:03 +0100 > > From: Jan Rekorajski > > > > Try disabling VJ TCP/IP header compression on Windows box. > > > > Well, in fact, if TCP timestamps are enabled, none of the TCP headers > > will be compressable by the VJ stuff. > > And `echo 0 >/proc/sys/net/ipv4/tcp_timestamps` should cure this? > If yes then something is wrong because it didn't :( > And I did turn it off on both sides. And after turning off timestamps it got even worse, with timestamps enabled I can at least ssh to that box, but when I disabled timestamps I could not do even this. Jan -- Jan Rêkorajski | ALL SUSPECTS ARE GUILTY. PERIOD! bagginsmimuw.edu.pl | OTHERWISE THEY WOULDN'T BE SUSPECTS, WOULD THEY? BOFH, MANIAC | -- TROOPS by Kevin Rubio From owner-netdev@oss.sgi.com Sat Jan 6 17:03:33 2001 Received: by oss.sgi.com id ; Sat, 6 Jan 2001 17:03:13 -0800 Received: from pizda.ninka.net ([216.101.162.242]:23698 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Sat, 6 Jan 2001 17:02:58 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id QAA13879; Sat, 6 Jan 2001 16:44:19 -0800 Date: Sat, 6 Jan 2001 16:44:19 -0800 Message-Id: <200101070044.QAA13879@pizda.ninka.net> From: "David S. Miller" To: baggins@sith.mimuw.edu.pl CC: jordy@napster.com, netdev@oss.sgi.com In-reply-to: <20010107014150.A23851@sith.mimuw.edu.pl> (message from Jan Rekorajski on Sun, 7 Jan 2001 01:41:50 +0100) Subject: Re: Mediocre TCP Performance 2.4.0 <-> Win98SE PPP References: <3A5665CB.30AB2158@napster.com> <20010106041103.B22665@sith.mimuw.edu.pl> <200101062342.PAA13551@pizda.ninka.net> <20010107014150.A23851@sith.mimuw.edu.pl> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 440 Lines: 14 Date: Sun, 7 Jan 2001 01:41:50 +0100 From: Jan Rekorajski And `echo 0 >/proc/sys/net/ipv4/tcp_timestamps` should cure this? If yes then something is wrong because it didn't :( And I did turn it off on both sides. No, I just stated that since VJ compression can't compress any of the headers that header compression cannot be the cause of the problems. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Sat Jan 6 19:04:23 2001 Received: by oss.sgi.com id ; Sat, 6 Jan 2001 19:04:04 -0800 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:19975 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Sat, 6 Jan 2001 19:03:42 -0800 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.8.7) with ESMTP id VAA23138; Sat, 6 Jan 2001 21:06:54 -0700 Message-ID: <3A57EB5E.966C91DA@candelatech.com> Date: Sat, 06 Jan 2001 21:06:54 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.16 i586) X-Accept-Language: en MIME-Version: 1.0 To: "David S. Miller" CC: linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission policy!) References: <3A578F27.D2A9DF52@candelatech.com> <200101062317.PAA13411@pizda.ninka.net> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 15164 Lines: 415 "David S. Miller" wrote: > > Unified diffs only please... Thanks. Hrm, here's one with a -u option, this what you're looking for? --- ../../../linux/net/core/dev.c Mon Dec 11 14:29:35 2000 +++ dev.c Sat Jan 6 14:14:10 2001 @@ -1,4 +1,4 @@ -/* +/* -*- linux-c -*- * NET3 Protocol independent device support routines. * * This program is free software; you can redistribute it and/or @@ -131,9 +132,17 @@ * and the routines to invoke. * * Why 16. Because with 16 the only overlap we get on a hash of the - * low nibble of the protocol value is RARP/SNAP/X.25. + * low nibble of the protocol value is RARP/SNAP/X.25. + * + * NOTE: That is no longer true with the addition of VLAN tags. Not + * sure which should go first, but I bet it won't make much + * difference if we are running VLANs. The good news is that + * this protocol won't be in the list unless compiled in, so + * the average user (w/out VLANs) will not be adversly affected. + * --BLG * * 0800 IP + * 8100 802.1Q VLAN * 0001 802.3 * 0002 AX.25 * 0004 802.2 @@ -178,6 +187,250 @@ #endif +#define BENS_FAST_DEV_LOOKUP +#ifdef BENS_FAST_DEV_LOOKUP +/* Fash Device Lookup code. Should give much better than + * linear speed when looking for devices by idx or name. + * --Ben (greearb@candelatech.com) + */ +#define FDL_HASH_LEN 256 + +/* #define FDL_DEBUG */ + +struct dev_hash_node { + struct net_device* dev; + struct dev_hash_node* next; +}; + +struct dev_hash_node* fdl_name_base[FDL_HASH_LEN];/* hashed by name */ +struct dev_hash_node* fdl_idx_base[FDL_HASH_LEN]; /* hashed by index */ +int fdl_initialized_yet = 0; + +/* TODO: Make these inline methods */ +/* Nice cheesy little hash method to be used on device-names (eth0, ppp0, etc) */ +int fdl_calc_name_idx(const char* dev_name) { + int tmp = 0; + int i; +#ifdef FDL_DEBUG + printk(KERN_ERR "fdl_calc_name_idx, name: %s\n", dev_name); +#endif + for (i = 0; dev_name[i]; i++) { + tmp += (int)(dev_name[i]); + } + if (i > 3) { + tmp += (dev_name[i-2] * 10); /* might add a little spread to the hash */ + tmp += (dev_name[i-3] * 100); /* might add a little spread to the hash */ + } +#ifdef FDL_DEBUG + printk(KERN_ERR "fdl_calc_name_idx, rslt: %i\n", (int)(tmp % FDL_HASH_LEN)); +#endif + return (tmp % FDL_HASH_LEN); +} + +int fdl_calc_index_idx(const int ifindex) { + return (ifindex % FDL_HASH_LEN); +} + + +/* Better have a lock on the dev_base before calling this... */ +int __fdl_ensure_init(void) { +#ifdef FDL_DEBUG + printk(KERN_ERR "__fdl_ensure_init, enter\n"); +#endif + if (! fdl_initialized_yet) { + /* only do this once.. */ + int i; + int idx = 0; /* into the hash table */ + struct net_device* dev = dev_base; + struct dev_hash_node* dhn; + +#ifdef FDL_DEBUG + printk(KERN_ERR "__fdl_ensure_init, doing real work..."); +#endif + + fdl_initialized_yet = 1; /* it has been attempted at least... */ + + for (i = 0; iname, idx); +#endif + /* first, take care of the hash-by-name */ + idx = fdl_calc_name_idx(dev->name); + dhn = kmalloc(sizeof(struct dev_hash_node), GFP_ATOMIC); + if (dhn) { + dhn->dev = dev; + dhn->next = fdl_name_base[idx]; + fdl_name_base[idx] = dhn; + } + else { + /* Nasty..couldn't get memory... */ + return -ENOMEM; + } + + /* now, do the hash-by-idx */ + idx = fdl_calc_index_idx(dev->ifindex); + dhn = kmalloc(sizeof(struct dev_hash_node), GFP_ATOMIC); + if (dhn) { + dhn->dev = dev; + dhn->next = fdl_idx_base[idx]; + fdl_idx_base[idx] = dhn; + } + else { + /* Nasty..couldn't get memory... */ + return -ENOMEM; + } + + dev = dev->next; + } + fdl_initialized_yet = 2; /* initialization actually worked */ + } +#ifdef FDL_DEBUG + printk(KERN_ERR "__fdl_ensure_init, end, fdl_initialized_yet: %i\n", fdl_initialized_yet); +#endif + if (fdl_initialized_yet == 2) { + return 0; + } + else { + return -1; + } +}/* fdl_ensure_init */ + + +/* called from register_netdevice, assumes dev is locked, and that no one + * will be calling __find_dev_by_name before this exits.. etc. + */ +int __fdl_register_netdevice(struct net_device* dev) { + if (__fdl_ensure_init() == 0) { + /* first, take care of the hash-by-name */ + int idx = fdl_calc_name_idx(dev->name); + struct dev_hash_node* dhn = kmalloc(sizeof(struct dev_hash_node), GFP_ATOMIC); + +#ifdef FDL_DEBUG + printk(KERN_ERR "__fdl_register_netdevice, dev: %p dev: %s, idx: %i", dev, dev->name, idx); +#endif + + if (dhn) { + dhn->dev = dev; + dhn->next = fdl_name_base[idx]; + fdl_name_base[idx] = dhn; + } + else { + /* Nasty..couldn't get memory... */ + /* Don't try to use these hash tables any more... */ + fdl_initialized_yet = 1; /* tried, but failed */ + return -ENOMEM; + } + + /* now, do the hash-by-idx */ + idx = fdl_calc_index_idx(dev->ifindex); + dhn = kmalloc(sizeof(struct dev_hash_node), GFP_ATOMIC); + +#ifdef FDL_DEBUG + printk(KERN_ERR "__fdl_register_netdevice, ifindex: %i, idx: %i", dev->ifindex, idx); +#endif + + if (dhn) { + dhn->dev = dev; + dhn->next = fdl_idx_base[idx]; + fdl_idx_base[idx] = dhn; + } + else { + /* Nasty..couldn't get memory... */ + /* Don't try to use these hash tables any more... */ + fdl_initialized_yet = 1; /* tried, but failed */ + return -ENOMEM; + } + } + return 0; +} /* fdl_register_netdevice */ + + +/* called from register_netdevice, assumes dev is locked, and that no one + * will be calling __find_dev_by_name, etc. Returns 0 if found & removed one, + * returns -1 otherwise. + */ +int __fdl_unregister_netdevice(struct net_device* dev) { + int retval = -1; + if (fdl_initialized_yet == 2) { /* If we've been initialized correctly... */ + /* first, take care of the hash-by-name */ + int idx = fdl_calc_name_idx(dev->name); + struct dev_hash_node* prev = fdl_name_base[idx]; + struct dev_hash_node* cur = NULL; + +#ifdef FDL_DEBUG + printk(KERN_ERR "__fdl_unregister_netdevice, dev: %p dev: %s, idx: %i", dev, dev->name, idx); +#endif + + if (prev) { + if (strcmp(dev->name, prev->dev->name) == 0) { + /* it's the first one... */ + fdl_name_base[idx] = prev->next; + kfree(prev); + retval = 0; + } + else { + cur = prev->next; + while (cur) { + if (strcmp(dev->name, cur->dev->name) == 0) { + prev->next = cur->next; + kfree(cur); + retval = 0; + break; + } + else { + prev = cur; + cur = cur->next; + } + } + } + } + + /* Now, the hash-by-index */ + idx = fdl_calc_index_idx(dev->ifindex); + prev = fdl_idx_base[idx]; + cur = NULL; + if (prev) { + if (dev->ifindex == prev->dev->ifindex) { + /* it's the first one... */ + fdl_idx_base[idx] = prev->next; + kfree(prev); + retval = 0; + } + else { + cur = prev->next; + while (cur) { + if (dev->ifindex == cur->dev->ifindex) { + prev->next = cur->next; + kfree(cur); + retval = 0; + break; + } + else { + prev = cur; + cur = cur->next; + } + } + } + } + }/* if we ensured init OK */ + return retval; +} /* fdl_unregister_netdevice */ + + + +#endif /* BENS_FAST_DEV_LOOKUP */ + + + /****************************************************************************************** Protocol management and registration routines @@ -396,7 +649,26 @@ struct net_device *__dev_get_by_name(const char *name) { struct net_device *dev; - + +#ifdef BENS_FAST_DEV_LOOKUP + int idx = fdl_calc_name_idx(name); + struct dev_hash_node* dhn; + if (fdl_initialized_yet == 2) { +#ifdef FDL_DEBUG + printk(KERN_ERR "__dev_get_by_name, name: %s idx: %i\n", name, idx); +#endif + dhn = fdl_name_base[idx]; + while (dhn) { + if (strcmp(dhn->dev->name, name) == 0) { + /* printk(KERN_ERR "__dev_get_by_name, found it: %p\n", dhn->dev); */ + return dhn->dev; + } + dhn = dhn->next; + } + /* printk(KERN_ERR "__dev_get_by_name, didn't find it for name: %s\n", name); */ + return NULL; + } +#endif for (dev = dev_base; dev != NULL; dev = dev->next) { if (strcmp(dev->name, name) == 0) return dev; @@ -472,6 +744,20 @@ { struct net_device *dev; +#ifdef BENS_FAST_DEV_LOOKUP + int idx = fdl_calc_index_idx(ifindex); + struct dev_hash_node* dhn; + if (fdl_initialized_yet == 2) { /* have we gone through initialization before... */ + dhn = fdl_idx_base[idx]; + while (dhn) { + if (dhn->dev->ifindex == ifindex) + return dhn->dev; + dhn = dhn->next; + } + return NULL; + } +#endif + for (dev = dev_base; dev != NULL; dev = dev->next) { if (dev->ifindex == ifindex) return dev; @@ -549,15 +835,17 @@ /* * If you need over 100 please also fix the algorithm... + * Increased it to deal with VLAN interfaces. It is unlikely + * that this many will ever be added, but it can't hurt! -BLG */ - for (i = 0; i < 100; i++) { + for (i = 0; i < 8192; i++) { sprintf(buf,name,i); if (__dev_get_by_name(buf) == NULL) { strcpy(dev->name, buf); return i; } } - return -ENFILE; /* Over 100 of the things .. bail out! */ + return -ENFILE; /* Over 8192 of the things .. bail out! */ } /** @@ -2067,8 +2355,16 @@ return -EBUSY; if (__dev_get_by_name(ifr->ifr_newname)) return -EEXIST; +#ifdef BENS_FAST_DEV_LOOKUP + write_lock_bh(&dev_base_lock); /* gotta lock it to remove stuff */ + __fdl_unregister_netdevice(dev); /* remove it from the hash.. */ +#endif memcpy(dev->name, ifr->ifr_newname, IFNAMSIZ); dev->name[IFNAMSIZ-1] = 0; +#ifdef BENS_FAST_DEV_LOOKUP + __fdl_register_netdevice(dev); + write_unlock_bh(&dev_base_lock); /* gotta lock it to add stuff too */ +#endif notifier_call_chain(&netdev_chain, NETDEV_CHANGENAME, dev); return 0; @@ -2342,6 +2638,12 @@ } dev->next = NULL; write_lock_bh(&dev_base_lock); +#ifdef BENS_FAST_DEV_LOOKUP + /* Must do this before dp is set to dev, or it could be added twice, once + * on initialization based on dev_base, and once again after that... + */ + __fdl_register_netdevice(dev); +#endif *dp = dev; dev_hold(dev); write_unlock_bh(&dev_base_lock); @@ -2396,6 +2698,12 @@ dev->next = NULL; dev_init_scheduler(dev); write_lock_bh(&dev_base_lock); +#ifdef BENS_FAST_DEV_LOOKUP + /* Must do this before dp is set to dev, or it could be added twice, once + * on initialization based on dev_base, and once again after that... + */ + __fdl_register_netdevice(dev); +#endif *dp = dev; dev_hold(dev); dev->deadbeaf = 0; @@ -2468,7 +2776,10 @@ if (d == dev) { write_lock_bh(&dev_base_lock); *dp = d->next; - write_unlock_bh(&dev_base_lock); +#ifdef BENS_FAST_DEV_LOOKUP + __fdl_unregister_netdevice(dev); +#endif + write_unlock_bh(&dev_base_lock); break; } } -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Sat Jan 6 19:29:44 2001 Received: by oss.sgi.com id ; Sat, 6 Jan 2001 19:29:33 -0800 Received: from a203-167-249-89.reverse.clear.net.nz ([203.167.249.89]:61700 "HELO metastasis.f00f.org") by oss.sgi.com with SMTP id ; Sat, 6 Jan 2001 19:29:08 -0800 Received: by metastasis.f00f.org (Postfix, from userid 1000) id 5552EA31D; Sun, 7 Jan 2001 16:29:05 +1300 (NZDT) Date: Sun, 7 Jan 2001 16:29:05 +1300 From: Chris Wedgwood To: Ben Greear Cc: linux-kernel , "netdev@oss.sgi.com" Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission policy!) Message-ID: <20010107162905.B1804@metastasis.f00f.org> References: <3A578F27.D2A9DF52@candelatech.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <3A578F27.D2A9DF52@candelatech.com>; from greearb@candelatech.com on Sat, Jan 06, 2001 at 02:33:27PM -0700 X-No-Archive: Yes Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1061 Lines: 27 On Sat, Jan 06, 2001 at 02:33:27PM -0700, Ben Greear wrote: I'm hoping that I can get a few comments on this code. It was added to (significantly) speed up things like 'ifconfig -a' when running with 4000 or so VLAN devices. It should also help other instances with lots of (virtual) devices, like FrameRelay, ATM, and possibly virtual IP interfaces. It probably won't help 'normal' users much, and in it's final form, should probably be a selectable option in the config process. Virtual IP interfaces in the form of ifname: (e.g. eth:1) IMO should be deprecated and removed completely in 2.5.x. It's an ugly external wart that should be removed. That said, if this was done -- how would things like routing daemons and bind cope? Actually, when I think about it they can't cope with situating like this now: tapu:~# ip addr show lo 1: lo: mtu 3904 qdisc noqueue link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo inet 10.0.0.1/32 scope global lo --cw From owner-netdev@oss.sgi.com Sat Jan 6 19:30:23 2001 Received: by oss.sgi.com id ; Sat, 6 Jan 2001 19:30:13 -0800 Received: from Cantor.suse.de ([194.112.123.193]:9224 "HELO Cantor.suse.de") by oss.sgi.com with SMTP id ; Sat, 6 Jan 2001 19:30:05 -0800 Received: from Hermes.suse.de (Hermes.suse.de [194.112.123.136]) by Cantor.suse.de (Postfix) with ESMTP id 6B62D1E39A; Sun, 7 Jan 2001 04:30:00 +0100 (MET) Received: from gruyere.muc.suse.de (unknown [10.23.1.2]) by Hermes.suse.de (Postfix) with ESMTP id D8E743E44F; Sun, 7 Jan 2001 04:29:59 +0100 (MET) Received: by gruyere.muc.suse.de (Postfix, from userid 14446) id 2A0F12F300; Sun, 7 Jan 2001 04:29:59 +0100 (MET) Date: Sun, 7 Jan 2001 04:29:59 +0100 From: Andi Kleen To: Ben Greear Cc: linux-kernel , "netdev@oss.sgi.com" Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission policy!) Message-ID: <20010107042959.A14330@gruyere.muc.suse.de> References: <3A578F27.D2A9DF52@candelatech.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <3A578F27.D2A9DF52@candelatech.com>; from greearb@candelatech.com on Sat, Jan 06, 2001 at 02:33:27PM -0700 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 799 Lines: 17 On Sat, Jan 06, 2001 at 02:33:27PM -0700, Ben Greear wrote: > I'm hoping that I can get a few comments on this code. It was added > to (significantly) speed up things like 'ifconfig -a' when running with > 4000 or so VLAN devices. It should also help other instances with lots > of (virtual) devices, like FrameRelay, ATM, and possibly virtual IP > interfaces. It probably won't help 'normal' users much, and in it's final > form, should probably be a selectable option in the config process. > > Anyway, let me know what you think! Does it make any significant different with the ifconfig from newest nettools? I removed a quadratic algorithm from ifconfig's device parsing, and with that I was able to display a few thousand alias devices on a unpatched kernel in reasonable time. -Andi From owner-netdev@oss.sgi.com Sat Jan 6 20:01:44 2001 Received: by oss.sgi.com id ; Sat, 6 Jan 2001 20:01:34 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:28068 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Sat, 6 Jan 2001 20:01:12 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id XAA20700; Sat, 6 Jan 2001 23:00:10 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Sat, 6 Jan 2001 23:00:10 -0500 (EST) From: jamal To: Andi Kleen cc: Ben Greear , linux-kernel , "netdev@oss.sgi.com" Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission policy!) In-Reply-To: <20010107042959.A14330@gruyere.muc.suse.de> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 694 Lines: 19 On Sun, 7 Jan 2001, Andi Kleen wrote: > Does it make any significant different with the ifconfig from newest nettools? I > removed a quadratic algorithm from ifconfig's device parsing, and with that I was > able to display a few thousand alias devices on a unpatched kernel in reasonable time. I think someone should just flush ifconfig down some toilet. a wrapper around "ip" to to give the same look and feel as ifconfig would be a good thing so that some stupid program that depends on ifconfig look and feel would be a good start. Not to stray from the subject, Ben's effort is still needed. I think real numbers are useful instead of claims like it "displayed faster" cheers, jamal From owner-netdev@oss.sgi.com Sat Jan 6 20:07:14 2001 Received: by oss.sgi.com id ; Sat, 6 Jan 2001 20:06:53 -0800 Received: from Cantor.suse.de ([194.112.123.193]:27913 "HELO Cantor.suse.de") by oss.sgi.com with SMTP id ; Sat, 6 Jan 2001 20:06:52 -0800 Received: from Hermes.suse.de (Hermes.suse.de [194.112.123.136]) by Cantor.suse.de (Postfix) with ESMTP id BC6371E096; Sun, 7 Jan 2001 05:06:50 +0100 (MET) Received: from gruyere.muc.suse.de (unknown [10.23.1.2]) by Hermes.suse.de (Postfix) with ESMTP id 23A7D3E451; Sun, 7 Jan 2001 05:06:50 +0100 (MET) Received: by gruyere.muc.suse.de (Postfix, from userid 14446) id AB32E2F300; Sun, 7 Jan 2001 05:06:49 +0100 (MET) Date: Sun, 7 Jan 2001 05:06:49 +0100 From: Andi Kleen To: jamal Cc: Andi Kleen , Ben Greear , linux-kernel , "netdev@oss.sgi.com" Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission policy!) Message-ID: <20010107050649.A14637@gruyere.muc.suse.de> References: <20010107042959.A14330@gruyere.muc.suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: ; from hadi@cyberus.ca on Sat, Jan 06, 2001 at 11:00:10PM -0500 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 749 Lines: 18 On Sat, Jan 06, 2001 at 11:00:10PM -0500, jamal wrote: > Not to stray from the subject, Ben's effort is still needed. I think real > numbers are useful instead of claims like it "displayed faster" The problem with old ifconfig was really visible, old ifconfig needed several minutes to setup. It was slow enough that "visual benchmarking" worked very well. With the fixed ifconfig it scrolls without too annoying delays. The ifconfig could be tuned even more by adding a hash table, but it didn't look necessary (and frankly nobody wants to invest too much work into it, given that ip is far superior) Note that the alias case is different from thousands of devices -- alias works via SIOCGIFCONF while devices work via /proc/net/dev. -Andi From owner-netdev@oss.sgi.com Sat Jan 6 21:13:06 2001 Received: by oss.sgi.com id ; Sat, 6 Jan 2001 21:12:46 -0800 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:12552 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Sat, 6 Jan 2001 21:12:13 -0800 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.8.7) with ESMTP id XAA23916; Sat, 6 Jan 2001 23:15:24 -0700 Message-ID: <3A58097C.5BB52B5@candelatech.com> Date: Sat, 06 Jan 2001 23:15:24 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.16 i586) X-Accept-Language: en MIME-Version: 1.0 To: Chris Wedgwood CC: linux-kernel , "netdev@oss.sgi.com" Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission policy!) References: <3A578F27.D2A9DF52@candelatech.com> <20010107162905.B1804@metastasis.f00f.org> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1407 Lines: 33 Chris Wedgwood wrote: > > On Sat, Jan 06, 2001 at 02:33:27PM -0700, Ben Greear wrote: > > I'm hoping that I can get a few comments on this code. It was > added to (significantly) speed up things like 'ifconfig -a' when > running with 4000 or so VLAN devices. It should also help other > instances with lots of (virtual) devices, like FrameRelay, ATM, > and possibly virtual IP interfaces. It probably won't help > 'normal' users much, and in it's final form, should probably be a > selectable option in the config process. > > Virtual IP interfaces in the form of ifname: (e.g. eth:1) IMO > should be deprecated and removed completely in 2.5.x. It's an ugly > external wart that should be removed. I don't know enough to have any serious opinion about virtual IP interfaces, but I am very certain that VLANs should appear as an interface to external code/user-land, just as eth0 does. This hashing helps with VLANs, and since I aim to get VLANs into the kernel proper sooner or later, having this piece in there makes my patch a little smaller :) However, if folks look at the patch and hate it, then it can be left out, or changed appropriately. Ben -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Sat Jan 6 21:20:06 2001 Received: by oss.sgi.com id ; Sat, 6 Jan 2001 21:19:46 -0800 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:16648 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Sat, 6 Jan 2001 21:19:29 -0800 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.8.7) with ESMTP id XAA23979; Sat, 6 Jan 2001 23:22:41 -0700 Message-ID: <3A580B31.7998C783@candelatech.com> Date: Sat, 06 Jan 2001 23:22:41 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.16 i586) X-Accept-Language: en MIME-Version: 1.0 To: Andi Kleen CC: linux-kernel , "netdev@oss.sgi.com" Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission policy!) References: <3A578F27.D2A9DF52@candelatech.com> <20010107042959.A14330@gruyere.muc.suse.de> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1989 Lines: 45 Andi Kleen wrote: > > On Sat, Jan 06, 2001 at 02:33:27PM -0700, Ben Greear wrote: > > I'm hoping that I can get a few comments on this code. It was added > > to (significantly) speed up things like 'ifconfig -a' when running with > > 4000 or so VLAN devices. It should also help other instances with lots > > of (virtual) devices, like FrameRelay, ATM, and possibly virtual IP > > interfaces. It probably won't help 'normal' users much, and in it's final > > form, should probably be a selectable option in the config process. > > > > Anyway, let me know what you think! > > Does it make any significant different with the ifconfig from newest nettools? I > removed a quadratic algorithm from ifconfig's device parsing, and with that I was > able to display a few thousand alias devices on a unpatched kernel in reasonable time. > > -Andi At the time I was doing this, I downloaded the latest nettools version. The hashing made a very noticable difference on 4000 interfaces, but I haven't run any real solid benchmarkings at other levels. Can you tell me some distinguishing mark (version?) on ifconfig that I can look for? I'm willing to run such benchmarks, but what would make a good benchmark, other than ifconfig -a? And a question for the socket gurus: Suppose I bind a raw socket to device vlan4001 (ie I have 4k in the list before that one!!). Currently, that means a linear search on all devices, right? In that extreme example, I would expect the hash to be very useful. Binding to IP addresses have the same issue?? Also, though hashing by name is not horribly exact, hashing on the device index should be nearly perfect, so finding device 666 might take a search through only 5 or so devices (find the hash-bucket, walk down the list in that bucket). -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Sat Jan 6 21:22:06 2001 Received: by oss.sgi.com id ; Sat, 6 Jan 2001 21:21:46 -0800 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:19208 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Sat, 6 Jan 2001 21:21:35 -0800 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.8.7) with ESMTP id XAA23996; Sat, 6 Jan 2001 23:24:41 -0700 Message-ID: <3A580BA9.ADAA2B97@candelatech.com> Date: Sat, 06 Jan 2001 23:24:41 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.16 i586) X-Accept-Language: en MIME-Version: 1.0 To: jamal CC: Andi Kleen , linux-kernel , "netdev@oss.sgi.com" Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumissionpolicy!) References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 533 Lines: 17 jamal wrote: > > Not to stray from the subject, Ben's effort is still needed. I think real > numbers are useful instead of claims like it "displayed faster" A single #define near the top of the patch will turn it on/off, so benchmarking should be fairly easy. Please suggest benchmarks you consider valid. Thanks, Ben -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Sat Jan 6 21:28:16 2001 Received: by oss.sgi.com id ; Sat, 6 Jan 2001 21:28:07 -0800 Received: from Cantor.suse.de ([194.112.123.193]:58122 "HELO Cantor.suse.de") by oss.sgi.com with SMTP id ; Sat, 6 Jan 2001 21:27:52 -0800 Received: from Hermes.suse.de (Hermes.suse.de [194.112.123.136]) by Cantor.suse.de (Postfix) with ESMTP id 67CB31E0AF; Sun, 7 Jan 2001 06:27:50 +0100 (MET) Received: from gruyere.muc.suse.de (gruyere.muc.suse.de [10.23.1.2]) by Hermes.suse.de (Postfix) with ESMTP id 23C383E44F; Sun, 7 Jan 2001 06:27:50 +0100 (MET) Received: by gruyere.muc.suse.de (Postfix, from userid 14446) id A58512F300; Sun, 7 Jan 2001 06:27:44 +0100 (MET) Date: Sun, 7 Jan 2001 06:27:44 +0100 From: Andi Kleen To: Ben Greear Cc: Andi Kleen , linux-kernel , "netdev@oss.sgi.com" Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission policy!) Message-ID: <20010107062744.A15198@gruyere.muc.suse.de> References: <3A578F27.D2A9DF52@candelatech.com> <20010107042959.A14330@gruyere.muc.suse.de> <3A580B31.7998C783@candelatech.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <3A580B31.7998C783@candelatech.com>; from greearb@candelatech.com on Sat, Jan 06, 2001 at 11:22:41PM -0700 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1515 Lines: 42 On Sat, Jan 06, 2001 at 11:22:41PM -0700, Ben Greear wrote: > At the time I was doing this, I downloaded the latest nettools version. > The hashing made a very noticable difference on 4000 interfaces, but > I haven't run any real solid benchmarkings at other levels. Can > you tell me some distinguishing mark (version?) on ifconfig that I > can look for? Just get the latest release. > > I'm willing to run such benchmarks, but what would make a good benchmark, > other than ifconfig -a? ifconfig -a is fine IMHO. Everything else I know is just a single pass through the lists (which even at 4000 is not very significant) > Suppose I bind a raw socket to device vlan4001 (ie I have 4k in the list > before that one!!). Currently, that means a linear search on all devices, > right? In that extreme example, I would expect the hash to be very > useful. Nope, it doesn't. Raw binding works via IP addresses, and the IP address resolution works via the routing table, which is extensively hashed. Packet socket binding or SO_BINDTODEVICE will search the list, but it is unlikely that these paths are worth optimizing for. > Binding to IP addresses have the same issue?? No, uses the fib. > Also, though hashing by name is not horribly exact, hashing on the device > index should be nearly perfect, so finding device 666 might take a search > through only 5 or so devices (find the hash-bucket, walk down the list in > that bucket). Just why? It is not in any critical path I can see at all. -Andi From owner-netdev@oss.sgi.com Sat Jan 6 21:30:05 2001 Received: by oss.sgi.com id ; Sat, 6 Jan 2001 21:29:46 -0800 Received: from Cantor.suse.de ([194.112.123.193]:6923 "HELO Cantor.suse.de") by oss.sgi.com with SMTP id ; Sat, 6 Jan 2001 21:29:30 -0800 Received: from Hermes.suse.de (Hermes.suse.de [194.112.123.136]) by Cantor.suse.de (Postfix) with ESMTP id 021931E0B9; Sun, 7 Jan 2001 06:29:29 +0100 (MET) Received: from gruyere.muc.suse.de (gruyere.muc.suse.de [10.23.1.2]) by Hermes.suse.de (Postfix) with ESMTP id 443163E44F; Sun, 7 Jan 2001 06:29:28 +0100 (MET) Received: by gruyere.muc.suse.de (Postfix, from userid 14446) id F33712F300; Sun, 7 Jan 2001 06:29:22 +0100 (MET) Date: Sun, 7 Jan 2001 06:29:22 +0100 From: Andi Kleen To: Ben Greear Cc: jamal , Andi Kleen , linux-kernel , "netdev@oss.sgi.com" Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumissionpolicy!) Message-ID: <20010107062922.B15198@gruyere.muc.suse.de> References: <3A580BA9.ADAA2B97@candelatech.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <3A580BA9.ADAA2B97@candelatech.com>; from greearb@candelatech.com on Sat, Jan 06, 2001 at 11:24:41PM -0700 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 521 Lines: 16 On Sat, Jan 06, 2001 at 11:24:41PM -0700, Ben Greear wrote: > jamal wrote: > > > > > Not to stray from the subject, Ben's effort is still needed. I think real > > numbers are useful instead of claims like it "displayed faster" > > A single #define near the top of the patch will turn it on/off, so > benchmarking should be fairly easy. Please suggest benchmarks you > consider valid. The only issue I know was the long delay in ifconfig. If that's fixed I see nothing else that would be worth benchmarking. -Andi From owner-netdev@oss.sgi.com Sat Jan 6 21:54:36 2001 Received: by oss.sgi.com id ; Sat, 6 Jan 2001 21:54:26 -0800 Received: from pizda.ninka.net ([216.101.162.242]:2196 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Sat, 6 Jan 2001 21:54:08 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id VAA24632; Sat, 6 Jan 2001 21:36:37 -0800 Date: Sat, 6 Jan 2001 21:36:37 -0800 Message-Id: <200101070536.VAA24632@pizda.ninka.net> From: "David S. Miller" To: greearb@candelatech.com CC: linux-kernel@vger.kernel.org, netdev@oss.sgi.com In-reply-to: <3A57EB5E.966C91DA@candelatech.com> (message from Ben Greear on Sat, 06 Jan 2001 21:06:54 -0700) Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission policy!) References: <3A578F27.D2A9DF52@candelatech.com> <200101062317.PAA13411@pizda.ninka.net> <3A57EB5E.966C91DA@candelatech.com> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 293 Lines: 14 Date: Sat, 06 Jan 2001 21:06:54 -0700 From: Ben Greear "David S. Miller" wrote: > > Unified diffs only please... Thanks. Hrm, here's one with a -u option, this what you're looking for? Yes, thanks a lot. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Sat Jan 6 21:58:37 2001 Received: by oss.sgi.com id ; Sat, 6 Jan 2001 21:58:18 -0800 Received: from pizda.ninka.net ([216.101.162.242]:5780 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Sat, 6 Jan 2001 21:58:13 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id VAA24665; Sat, 6 Jan 2001 21:40:36 -0800 Date: Sat, 6 Jan 2001 21:40:36 -0800 Message-Id: <200101070540.VAA24665@pizda.ninka.net> From: "David S. Miller" To: greearb@candelatech.com, linux-kernel@vger.kernel.org, netdev@oss.sgi.com In-reply-to: <20010107162905.B1804@metastasis.f00f.org> (message from Chris Wedgwood on Sun, 7 Jan 2001 16:29:05 +1300) Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission policy!) References: <3A578F27.D2A9DF52@candelatech.com> <20010107162905.B1804@metastasis.f00f.org> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1361 Lines: 30 On Sat, Jan 06, 2001 at 02:33:27PM -0700, Ben Greear wrote: I'm hoping that I can get a few comments on this code. It was added to (significantly) speed up things like 'ifconfig -a' when running with 4000 or so VLAN devices. It should also help other instances with lots of (virtual) devices, like FrameRelay, ATM, and possibly virtual IP interfaces. It probably won't help 'normal' users much, and in it's final form, should probably be a selectable option in the config process. Ben, if ifconfig uses /proc/net/dev to list devices, how can your changes speed up ifconfig? Andi mentioned in another email how he has fixed the quadratic behavior in ifconfig, you should check if it fixes your problem. Jamal has suggested dumping ifconfig and making a dummy "ifconfig" which just wrappers around "ip". I like this idea the most. Really, what I'm concerned about is what calls dev_get_by_{name,index} so often and in such critical places that optimizing it makes any sense? I don't mind optimizing stuff like this where needed, in fact I'm the most guilty of this, check out the complex TCP hash tables we have :-) But if it's only a problem because of poorly implemented user applications, let's fix the apps instead of adding the complexity to the kernel. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Sat Jan 6 22:01:27 2001 Received: by oss.sgi.com id ; Sat, 6 Jan 2001 22:01:08 -0800 Received: from pizda.ninka.net ([216.101.162.242]:8340 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Sat, 6 Jan 2001 22:00:48 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id VAA24689; Sat, 6 Jan 2001 21:43:16 -0800 Date: Sat, 6 Jan 2001 21:43:16 -0800 Message-Id: <200101070543.VAA24689@pizda.ninka.net> From: "David S. Miller" To: hadi@cyberus.ca CC: ak@suse.de, greearb@candelatech.com, linux-kernel@vger.kernel.org, netdev@oss.sgi.com In-reply-to: (message from jamal on Sat, 6 Jan 2001 23:00:10 -0500 (EST)) Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission policy!) References: Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1038 Lines: 26 Date: Sat, 6 Jan 2001 23:00:10 -0500 (EST) From: jamal I think someone should just flush ifconfig down some toilet. a wrapper around "ip" to to give the same look and feel as ifconfig would be a good thing so that some stupid program that depends on ifconfig look and feel would be a good start. I could not agree more. This reminds me to do something I could not justify before, making netlink be enabled in the kernel and non-configurable. I could almost, but not quite, justify it right now just because "ip" is becomming standard and needs it. Not to stray from the subject, Ben's effort is still needed. I think real numbers are useful instead of claims like it "displayed faster" See my previous email, if it's just slow because of some poorly coded version of ifconfig, it does not justify the patch. If only a forcefully created "benchmark" can show some performance problem, that is not an acceptable reason to champion this patch. Ok? Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Sat Jan 6 23:08:27 2001 Received: by oss.sgi.com id ; Sat, 6 Jan 2001 23:08:18 -0800 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:25609 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Sat, 6 Jan 2001 23:07:58 -0800 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.8.7) with ESMTP id BAA31481; Sun, 7 Jan 2001 01:11:11 -0700 Message-ID: <3A58249F.86DD52BC@candelatech.com> Date: Sun, 07 Jan 2001 01:11:11 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.16 i586) X-Accept-Language: en MIME-Version: 1.0 To: Andi Kleen CC: linux-kernel , "netdev@oss.sgi.com" , "David S. Miller" Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission policy!) (Benchmarks) References: <3A578F27.D2A9DF52@candelatech.com> <20010107042959.A14330@gruyere.muc.suse.de> <3A580B31.7998C783@candelatech.com> <20010107062744.A15198@gruyere.muc.suse.de> Content-Type: multipart/mixed; boundary="------------9BC101C2C1F69CA8BF3B57E3" Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 7482 Lines: 223 This is a multi-part message in MIME format. --------------9BC101C2C1F69CA8BF3B57E3 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Andi Kleen wrote: > > I'm willing to run such benchmarks, but what would make a good benchmark, > > other than ifconfig -a? > > ifconfig -a is fine IMHO. Everything else I know is just a single pass through > the lists (which even at 4000 is not very significant) Hardware: Celeron 500, mostly idle (ie not too scientific!!) ifconfig: [root@candle bin]# /sbin/ifconfig --version net-tools 1.57 ifconfig 1.40 (2000-05-21) (vlan_test.pl is attached) My conclusion: The patch definately helps in this instance, but this instance may not be realistic. *********************************************************************** This is with the hash patch enabled: (2.4.prerelease + VLAN patch) (run 2) *********************************************************************** [root@candle bin]# time vlan_test.pl Adding VLAN interfaces 1 through 4000 Done adding 4000 VLAN interfaces in 76 seconds. Doing ifconfig -a 10.47user 6.33system 0:16.80elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (116major+421minor)pagefaults 0swaps Removing VLAN interfaces 1 through 4000 Done deleting 4000 VLAN interfaces in 58 seconds. Going to add and remove 2 interfaces 1000 times. Done adding/removing 2 VLAN interfaces 1000 times in 46 seconds. 74.12user 115.26system 3:22.81elapsed 93%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (2821121major+677780minor)pagefaults 0swaps *********************************************************************** This is with the patch disabled: (2.4.0 + VLAN patch) (run 1) *********************************************************************** [root@candle /root]# time vlan_test.pl Adding VLAN interfaces 1 through 4000 Done adding 4000 VLAN interfaces in 132 seconds. Doing ifconfig -a 10.72user 96.31system 1:55.31elapsed 92%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (117major+423minor)pagefaults 0swaps Removing VLAN interfaces 1 through 4000 Done deleting 4000 VLAN interfaces in 65 seconds. Going to add and remove 2 interfaces 1000 times. Done adding/removing 2 VLAN interfaces 1000 times in 47 seconds. 74.20user 257.83system 6:04.46elapsed 91%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (2821122major+677782minor)pagefaults 0swaps *********************************************************************** This is with the patch enabled: (2.4.0 + VLAN patch) (run 1) *********************************************************************** [root@candle /root]# time vlan_test.pl Adding VLAN interfaces 1 through 4000 Done adding 4000 VLAN interfaces in 83 seconds. Doing ifconfig -a 10.61user 10.22system 0:23.43elapsed 88%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (117major+423minor)pagefaults 0swaps Removing VLAN interfaces 1 through 4000 Done deleting 4000 VLAN interfaces in 64 seconds. Going to add and remove 2 interfaces 1000 times. Done adding/removing 2 VLAN interfaces 1000 times in 47 seconds. 73.69user 120.69system 3:44.10elapsed 86%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (2821122major+677782minor)pagefaults 0swaps *********************************************************************** This is with the patch enabled: (2.4.0 + VLAN patch) (run 2) *********************************************************************** [root@candle /root]# time vlan_test.pl Adding VLAN interfaces 1 through 4000 Done adding 4000 VLAN interfaces in 80 seconds. Doing ifconfig -a 10.62user 6.31system 0:18.63elapsed 90%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (117major+423minor)pagefaults 0swaps Removing VLAN interfaces 1 through 4000 Done deleting 4000 VLAN interfaces in 61 seconds. Going to add and remove 2 interfaces 1000 times. Done adding/removing 2 VLAN interfaces 4000 times in 47 seconds. 74.05user 114.93system 3:33.00elapsed 88%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (2821122major+677782minor)pagefaults 0swaps > > > Suppose I bind a raw socket to device vlan4001 (ie I have 4k in the list > > before that one!!). Currently, that means a linear search on all devices, > > right? In that extreme example, I would expect the hash to be very > > useful. > > Nope, it doesn't. Raw binding works via IP addresses, and the IP address resolution > works via the routing table, which is extensively hashed. Ahh, I meant raw, like raw Ethernet. True, not used very often, but I happen to have been using it lately :) The original idea for the hashing came after I was severly chastised by a few for even considering allowing 4000 VLAN Interfaces into the kernel. The complaint was that somehow having lots of devices was going to hurt performance. The eventual conclusion (by me at least), was that there were no linear lookups in any critical path I could think of. However, things like binding to interfaces, and searching for them (ifconfig -i vlan3999), were doing (at least) linear searching, so hashing it might make the user happier. > > Packet socket binding or SO_BINDTODEVICE will search the list, but it is unlikely > that these paths are worth optimizing for. The patch has been written, so even if it helps just a little more than it hurts, it might be worth including. Of course, it may actually hurt more than help. I'd be very interested in lucid arguments as to why adding the patch would actually be worse than not adding it, not just why I'm lame for considering it *grin* :) -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear --------------9BC101C2C1F69CA8BF3B57E3 Content-Type: application/x-perl; name="vlan_test.pl" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="vlan_test.pl" #!/usr/bin/perl # For now, this just tests the addition and removal of 1000 VLAN interfaces on eth0 my $num_if = 4000; `/usr/local/bin/vconfig set_name_type VLAN_PLUS_VID_NO_PAD`; my $d = 5; my $c = 5; my $i; print "Adding VLAN interfaces 1 through $num_if\n"; my $p = time(); for ($i = 1; $i<=$num_if; $i++) { `/usr/local/bin/vconfig add eth0 $i`; `ifconfig vlan$i 192.168.$c.$d`; `ifconfig vlan$i up`; $d++; if ($d > 250) { $d = 5; $c++; } } my $n = time(); my $diff = $n - $p; print "Done adding $num_if VLAN interfaces in $diff seconds.\n"; sleep 2; print "Doing ifconfig -a\n"; `time ifconfig -a > /tmp/vlan_test_ifconfig_a.txt`; sleep 2; print "Removing VLAN interfaces 1 through $num_if\n"; $d = 5; $c = 5; $p = time(); for ($i = 1; $i<=$num_if; $i++) { `/usr/local/bin/vconfig rem vlan$i`; $d++; if ($d > 250) { $d = 5; $c++; } } $n = time(); $diff = $n - $p; print "Done deleting $num_if VLAN interfaces in $diff seconds.\n"; sleep 2; my $tmp = $num_if / 4; print "\nGoing to add and remove 2 interfaces $tmp times.\n"; $p = time(); for ($i = 1; $i<=$tmp; $i++) { `/usr/local/bin/vconfig add eth0 1`; `ifconfig vlan1 192.168.200.200`; `ifconfig vlan1 up`; `/usr/local/bin/vconfig add eth0 2`; `ifconfig vlan2 192.168.202.202`; `ifconfig vlan2 up`; `/usr/local/bin/vconfig rem vlan2`; `/usr/local/bin/vconfig rem vlan1`; } $n = time(); $diff = $n - $p; print "Done adding/removing 2 VLAN interfaces $tmp times in $diff seconds.\n"; --------------9BC101C2C1F69CA8BF3B57E3-- From owner-netdev@oss.sgi.com Sat Jan 6 23:15:57 2001 Received: by oss.sgi.com id ; Sat, 6 Jan 2001 23:15:37 -0800 Received: from Cantor.suse.de ([194.112.123.193]:30477 "HELO Cantor.suse.de") by oss.sgi.com with SMTP id ; Sat, 6 Jan 2001 23:15:36 -0800 Received: from Hermes.suse.de (Hermes.suse.de [194.112.123.136]) by Cantor.suse.de (Postfix) with ESMTP id BBEA31E0C3; Sun, 7 Jan 2001 08:15:34 +0100 (MET) Received: from gruyere.muc.suse.de (unknown [10.23.1.2]) by Hermes.suse.de (Postfix) with ESMTP id 2A1483E44F; Sun, 7 Jan 2001 08:15:34 +0100 (MET) Received: by gruyere.muc.suse.de (Postfix, from userid 14446) id 3B70F2F300; Sun, 7 Jan 2001 08:15:33 +0100 (MET) Date: Sun, 7 Jan 2001 08:15:33 +0100 From: Andi Kleen To: Ben Greear Cc: Andi Kleen , linux-kernel , "netdev@oss.sgi.com" , "David S. Miller" Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission policy!) (Benchmarks) Message-ID: <20010107081533.A16077@gruyere.muc.suse.de> References: <3A578F27.D2A9DF52@candelatech.com> <20010107042959.A14330@gruyere.muc.suse.de> <3A580B31.7998C783@candelatech.com> <20010107062744.A15198@gruyere.muc.suse.de> <3A58249F.86DD52BC@candelatech.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <3A58249F.86DD52BC@candelatech.com>; from greearb@candelatech.com on Sun, Jan 07, 2001 at 01:11:11AM -0700 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 786 Lines: 21 On Sun, Jan 07, 2001 at 01:11:11AM -0700, Ben Greear wrote: > > Packet socket binding or SO_BINDTODEVICE will search the list, but it is unlikely > > that these paths are worth optimizing for. > > The patch has been written, so even if it helps just a little more than it > hurts, it might be worth including. Of course, it may actually hurt more > than help. > > I'd be very interested in lucid arguments as to why adding the patch would > actually be worse than not adding it, not just why I'm lame for considering > it *grin* :) It's like David said: it's not a very good idea to tune things that are not commonly used. If some user really realistically has some workload where the linear search is a bottleneck it can be considered. Currently it doesn't look like it. -Andi From owner-netdev@oss.sgi.com Sun Jan 7 00:02:48 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 00:02:38 -0800 Received: from mail.zmailer.org ([194.252.70.162]:17166 "EHLO zmailer.org") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 00:02:21 -0800 Received: (from localhost user: 'mea' uid#500 fake: STDIN (mea@zmailer.org)) by mail.zmailer.org id ; Sun, 7 Jan 2001 10:01:58 +0200 Date: Sun, 7 Jan 2001 10:01:57 +0200 From: Matti Aarnio To: "David S. Miller" Cc: baggins@sith.mimuw.edu.pl, jordy@napster.com, netdev@oss.sgi.com Subject: Re: Mediocre TCP Performance 2.4.0 <-> Win98SE PPP Message-ID: <20010107100157.C16945@mea-ext.zmailer.org> References: <3A5665CB.30AB2158@napster.com> <20010106041103.B22665@sith.mimuw.edu.pl> <200101062342.PAA13551@pizda.ninka.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200101062342.PAA13551@pizda.ninka.net>; from davem@redhat.com on Sat, Jan 06, 2001 at 03:42:15PM -0800 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 548 Lines: 18 On Sat, Jan 06, 2001 at 03:42:15PM -0800, David S. Miller wrote: > Date: Sat, 6 Jan 2001 04:11:03 +0100 > From: Jan Rekorajski > > Try disabling VJ TCP/IP header compression on Windows box. > > Well, in fact, if TCP timestamps are enabled, none of the TCP headers > will be compressable by the VJ stuff. "Shouldn't", you mean ? Only heaven (pick your choice) knows what various access server manufacturers and M$ do in that case... > Later, > David S. Miller > davem@redhat.com /Matti Aarnio From owner-netdev@oss.sgi.com Sun Jan 7 00:21:48 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 00:21:38 -0800 Received: from pizda.ninka.net ([216.101.162.242]:50048 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 00:21:12 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id AAA01182; Sun, 7 Jan 2001 00:04:01 -0800 Date: Sun, 7 Jan 2001 00:04:01 -0800 Message-Id: <200101070804.AAA01182@pizda.ninka.net> From: "David S. Miller" To: matti.aarnio@zmailer.org CC: baggins@sith.mimuw.edu.pl, jordy@napster.com, netdev@oss.sgi.com In-reply-to: <20010107100157.C16945@mea-ext.zmailer.org> (message from Matti Aarnio on Sun, 7 Jan 2001 10:01:57 +0200) Subject: Re: Mediocre TCP Performance 2.4.0 <-> Win98SE PPP References: <3A5665CB.30AB2158@napster.com> <20010106041103.B22665@sith.mimuw.edu.pl> <200101062342.PAA13551@pizda.ninka.net> <20010107100157.C16945@mea-ext.zmailer.org> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 411 Lines: 13 Date: Sun, 7 Jan 2001 10:01:57 +0200 From: Matti Aarnio "Shouldn't", you mean ? Only heaven (pick your choice) knows what various access server manufacturers and M$ do in that case... Nope, it just won't happen. The VJ compression algorithm states that when TCP options are present, no compression takes place. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Sun Jan 7 02:23:19 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 02:23:09 -0800 Received: from james.kalifornia.com ([208.179.0.2]:23146 "EHLO james.kalifornia.com") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 02:22:54 -0800 Received: from Huntington-Beach.Blue-Labs.org (mail@Huntington-Beach.Blue-Labs.org [208.179.0.198]) by james.kalifornia.com (8.11.0/8.11.0) with ESMTP id f07AMde13475; Sun, 7 Jan 2001 02:22:39 -0800 Received: (from david@localhost) by Huntington-Beach.Blue-Labs.org (8.9.3/8.9.0) id CAA04222; Sun, 7 Jan 2001 02:22:33 -0800 Date: Sun, 7 Jan 2001 02:22:31 -0800 (PST) From: David Ford X-Sender: david@Huntington-Beach.Blue-Labs.org To: Chris Wedgwood cc: Ben Greear , linux-kernel , "netdev@oss.sgi.com" Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission policy!) In-Reply-To: <20010107162905.B1804@metastasis.f00f.org> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 706 Lines: 18 On Sun, 7 Jan 2001, Chris Wedgwood wrote: > Virtual IP interfaces in the form of ifname: (e.g. eth:1) IMO > should be deprecated and removed completely in 2.5.x. It's an ugly > external wart that should be removed. > > That said, if this was done -- how would things like routing daemons > and bind cope? Actually, when I think about it they can't cope with > situating like this now: > > tapu:~# ip addr show lo > 1: lo: mtu 3904 qdisc noqueue > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > inet 127.0.0.1/8 scope host lo > inet 10.0.0.1/32 scope global lo BIND copes just fine, how would it not? I haven't heard any problems with routing daemons either. From owner-netdev@oss.sgi.com Sun Jan 7 04:13:58 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 04:13:39 -0800 Received: from a203-167-249-89.reverse.clear.net.nz ([203.167.249.89]:22021 "HELO metastasis.f00f.org") by oss.sgi.com with SMTP id ; Sun, 7 Jan 2001 04:13:11 -0800 Received: by metastasis.f00f.org (Postfix, from userid 1000) id DB8D6A336; Mon, 8 Jan 2001 01:13:08 +1300 (NZDT) Date: Mon, 8 Jan 2001 01:13:08 +1300 From: Chris Wedgwood To: David Ford Cc: Ben Greear , linux-kernel , "netdev@oss.sgi.com" Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission policy!) Message-ID: <20010108011308.A2575@metastasis.f00f.org> References: <20010107162905.B1804@metastasis.f00f.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: ; from david@linux.com on Sun, Jan 07, 2001 at 02:22:31AM -0800 X-No-Archive: Yes Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 933 Lines: 28 On Sun, Jan 07, 2001 at 02:22:31AM -0800, David Ford wrote: BIND copes just fine, how would it not? I haven't heard any problems with routing daemons either. Bind knows about multiple virtual interfaces; but we can also have multiple addresses on a single interface and have no virtual interfaces at all. I doubt bind knows about this nor handles it. OK, I'm a liar -- bind does handle this. Cool. Jan 8 01:09:12 tapu named[599]: listening on [127.0.0.1].53 (lo) Jan 8 01:09:12 tapu named[599]: listening on [10.0.0.1].53 (lo) Jan 8 01:09:12 tapu named[599]: listening on [x.x.x.x].53 (x0) Jan 8 01:09:12 tapu named[599]: Forwarding source address is [0.0.0.0].1032 This is good news, because it means there is a precedent for multiple addresses on a single interface so we can kill the : syntax in favor of the above which is cleaner of more accurately represents what is happening. --cw From owner-netdev@oss.sgi.com Sun Jan 7 04:18:39 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 04:18:29 -0800 Received: from pizda.ninka.net ([216.101.162.242]:27778 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 04:18:16 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id EAA01790; Sun, 7 Jan 2001 04:01:04 -0800 Date: Sun, 7 Jan 2001 04:01:04 -0800 Message-Id: <200101071201.EAA01790@pizda.ninka.net> From: "David S. Miller" To: cw@f00f.org CC: david@linux.com, greearb@candelatech.com, linux-kernel@vger.kernel.org, netdev@oss.sgi.com In-reply-to: <20010108011308.A2575@metastasis.f00f.org> (message from Chris Wedgwood on Mon, 8 Jan 2001 01:13:08 +1300) Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission policy!) References: <20010107162905.B1804@metastasis.f00f.org> <20010108011308.A2575@metastasis.f00f.org> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 541 Lines: 18 Date: Mon, 8 Jan 2001 01:13:08 +1300 From: Chris Wedgwood OK, I'm a liar -- bind does handle this. Cool. Standard BSD allows it, what do you expect :-) This is good news, because it means there is a precedent for multiple addresses on a single interface so we can kill the : syntax in favor of the above which is cleaner of more accurately represents what is happening. If this is really true, 2.5.x is an appropriate time to make this, no sooner. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Sun Jan 7 04:20:28 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 04:20:08 -0800 Received: from james.kalifornia.com ([208.179.0.2]:44395 "EHLO james.kalifornia.com") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 04:20:03 -0800 Received: from Huntington-Beach.Blue-Labs.org (mail@Huntington-Beach.Blue-Labs.org [208.179.0.198]) by james.kalifornia.com (8.11.0/8.11.0) with ESMTP id f07CK0e12864; Sun, 7 Jan 2001 04:20:00 -0800 Received: (from david@localhost) by Huntington-Beach.Blue-Labs.org (8.9.3/8.9.0) id EAA04915; Sun, 7 Jan 2001 04:19:59 -0800 Date: Sun, 7 Jan 2001 04:19:59 -0800 (PST) From: David Ford X-Sender: david@Huntington-Beach.Blue-Labs.org To: Chris Wedgwood cc: Ben Greear , linux-kernel , "netdev@oss.sgi.com" Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission policy!) In-Reply-To: <20010108011308.A2575@metastasis.f00f.org> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1508 Lines: 39 On Mon, 8 Jan 2001, Chris Wedgwood wrote: > Bind knows about multiple virtual interfaces; but we can also have > multiple addresses on a single interface and have no virtual > interfaces at all. > > I doubt bind knows about this nor handles it. > > > > OK, I'm a liar -- bind does handle this. Cool. > > Jan 8 01:09:12 tapu named[599]: listening on [127.0.0.1].53 (lo) > Jan 8 01:09:12 tapu named[599]: listening on [10.0.0.1].53 (lo) > Jan 8 01:09:12 tapu named[599]: listening on [x.x.x.x].53 (x0) > Jan 8 01:09:12 tapu named[599]: Forwarding source address is [0.0.0.0].1032 > > This is good news, because it means there is a precedent for multiple > addresses on a single interface so we can kill the : > syntax in favor of the above which is cleaner of more accurately > represents what is happening. I've been using the new form for a long long time now and I assure you, BIND hasn't had any problems with it for a long long time. :) BIND as most all programs, should not care what the interface is or how it is laid out. It binds to an address and port and shouldn't care otherwise. Would I really put you in a quandry if I told you I had multiple different media interfaces all with the same IP and BIND happily answered on all of them? ;) -d -- ---NOTICE--- fwd: fwd: fwd: type emails will be deleted automatically. "There is a natural aristocracy among men. The grounds of this are virtue and talents", Thomas Jefferson [1742-1826], 3rd US President From owner-netdev@oss.sgi.com Sun Jan 7 05:42:19 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 05:42:09 -0800 Received: from router-100M.swansea.linux.org.uk ([194.168.151.17]:37902 "EHLO the-village.bc.nu") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 05:41:49 -0800 Received: from alan by the-village.bc.nu with local (Exim 2.12 #1) id 14FG60-0002eP-00; Sun, 7 Jan 2001 13:42:52 +0000 Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission To: greearb@candelatech.com (Ben Greear) Date: Sun, 7 Jan 2001 13:42:50 +0000 (GMT) Cc: davem@redhat.com (David S. Miller), linux-kernel@vger.kernel.org, netdev@oss.sgi.com In-Reply-To: <3A57EB5E.966C91DA@candelatech.com> from "Ben Greear" at Jan 06, 2001 09:06:54 PM X-Mailer: ELM [version 2.5 PL1] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-Id: From: Alan Cox Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 760 Lines: 21 > + * NOTE: That is no longer true with the addition of VLAN tags. Not > + * sure which should go first, but I bet it won't make much > + * difference if we are running VLANs. The good news is that It makes a lot of difference tha the vlan goes 2nd. Most sane people wont have vlans active on a high load interface. > strcpy(dev->name, buf); > return i; > } > } > - return -ENFILE; /* Over 100 of the things .. bail out! */ > + return -ENFILE; /* Over 8192 of the things .. bail out! */ So fix the algorithm. You want the list sorted at this point, or to generate a bitmap of free/used entries and scan the list then scan the map Question: How do devices with hardware vlan support fit into your model ? Alan From owner-netdev@oss.sgi.com Sun Jan 7 05:47:29 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 05:47:19 -0800 Received: from router-100M.swansea.linux.org.uk ([194.168.151.17]:41742 "EHLO the-village.bc.nu") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 05:47:12 -0800 Received: from alan by the-village.bc.nu with local (Exim 2.12 #1) id 14FGAx-0002es-00; Sun, 7 Jan 2001 13:47:59 +0000 Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission To: davem@redhat.com (David S. Miller) Date: Sun, 7 Jan 2001 13:47:57 +0000 (GMT) Cc: hadi@cyberus.ca, ak@suse.de, greearb@candelatech.com, linux-kernel@vger.kernel.org, netdev@oss.sgi.com In-Reply-To: <200101070543.VAA24689@pizda.ninka.net> from "David S. Miller" at Jan 06, 2001 09:43:16 PM X-Mailer: ELM [version 2.5 PL1] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-Id: From: Alan Cox Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 717 Lines: 17 > thing so that some stupid program that depends on ifconfig look and feel > would be a good start. > > I could not agree more. This reminds me to do something I could not > justify before, making netlink be enabled in the kernel and > non-configurable. Why. Its bad enough that the networking layer doesnt let you configure out stuff like SACK and the big routing hashes. Please don't make it even worse for the embedded world. 99.9% of Linux boxes probably have less than 5 routing table entries > I could almost, but not quite, justify it right now just because "ip" > is becomming standard and needs it. ip is also not the smallest and simplest of binaries. You can fit an ifconfig for ip in about 24K From owner-netdev@oss.sgi.com Sun Jan 7 05:49:19 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 05:48:59 -0800 Received: from router-100M.swansea.linux.org.uk ([194.168.151.17]:43534 "EHLO the-village.bc.nu") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 05:48:52 -0800 Received: from alan by the-village.bc.nu with local (Exim 2.12 #1) id 14FGD4-0002f7-00; Sun, 7 Jan 2001 13:50:10 +0000 Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission To: greearb@candelatech.com (Ben Greear) Date: Sun, 7 Jan 2001 13:50:08 +0000 (GMT) Cc: ak@suse.de (Andi Kleen), linux-kernel@vger.kernel.org (linux-kernel), netdev@oss.sgi.com (netdev@oss.sgi.com) In-Reply-To: <3A580B31.7998C783@candelatech.com> from "Ben Greear" at Jan 06, 2001 11:22:41 PM X-Mailer: ELM [version 2.5 PL1] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-Id: From: Alan Cox Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 395 Lines: 9 > Suppose I bind a raw socket to device vlan4001 (ie I have 4k in the list > before that one!!). Currently, that means a linear search on all devices, > right? In that extreme example, I would expect the hash to be very > useful. At this point you have to ask 'why is vlan4001 an interface'. Would it not be cleaner to add the vlan id to the entries in the list of addresses per interface ? From owner-netdev@oss.sgi.com Sun Jan 7 07:33:40 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 07:33:29 -0800 Received: from mail.zmailer.org ([194.252.70.162]:36622 "EHLO zmailer.org") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 07:33:17 -0800 Received: (from localhost user: 'mea' uid#500 fake: STDIN (mea@zmailer.org)) by mail.zmailer.org id ; Sun, 7 Jan 2001 17:33:06 +0200 Date: Sun, 7 Jan 2001 17:33:06 +0200 From: Matti Aarnio To: Alan Cox Cc: Ben Greear , "David S. Miller" , linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission Message-ID: <20010107173306.C25076@mea-ext.zmailer.org> References: <3A57EB5E.966C91DA@candelatech.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: ; from alan@lxorguk.ukuu.org.uk on Sun, Jan 07, 2001 at 01:42:50PM +0000 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1169 Lines: 29 On Sun, Jan 07, 2001 at 01:42:50PM +0000, Alan Cox wrote: > > + * NOTE: That is no longer true with the addition of VLAN tags. Not > > + * sure which should go first, but I bet it won't make much > > + * difference if we are running VLANs. The good news is that > > It makes a lot of difference tha the vlan goes 2nd. Most sane people wont > have vlans active on a high load interface. VLAN tag header appears as Layer-3 protocol, which then causes loop into VLAN reception code and back to generic Layer-3 receiver. I just tried to pull data from another machine, which is on normal port thru VLAN trunking port to receiving machine, and got fast-ether at wire speed. (As near as ncftp's 11.11 MB/sec is wirespeed..) > So fix the algorithm. You want the list sorted at this point, or to generate > a bitmap of free/used entries and scan the list then scan the map > > Question: How do devices with hardware vlan support fit into your model ? Hardware VLAN support at several network cards is just to recognize VLAN tags, and then do the right thing ( = skip 4 bytes ) when doing TCP/UDP checksum. > Alan /Matti Aarnio From owner-netdev@oss.sgi.com Sun Jan 7 07:58:00 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 07:57:50 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:33956 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 07:57:36 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id KAA21162; Sun, 7 Jan 2001 10:56:23 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Sun, 7 Jan 2001 10:56:23 -0500 (EST) From: jamal To: "David S. Miller" cc: , , , Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission policy!) In-Reply-To: <200101070543.VAA24689@pizda.ninka.net> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 2309 Lines: 59 On Sat, 6 Jan 2001, David S. Miller wrote: > I could not agree more. This reminds me to do something I could not > justify before, making netlink be enabled in the kernel and > non-configurable. I always use netlink and friends for something or the other. Route protocols, traffic control configuration, device configuration, firewalling; I dont see a problem turning it on. Maybe defaulting it to on, and then totaly getting rid of the config option. > I could almost, but not quite, justify it right now just because "ip" > is becomming standard and needs it. "ip" has been standard to some people for a few moons now ;-> It is more documneted than ifconfig ;-> and does a lot more > > Not to stray from the subject, Ben's effort is still needed. I think real > numbers are useful instead of claims like it "displayed faster" > > See my previous email, if it's just slow because of some poorly coded > version of ifconfig, it does not justify the patch. If only a > forcefully created "benchmark" can show some performance problem, that > is not an acceptable reason to champion this patch. Ok? > True. All the kernel paths affected are really config/control paths as well as slow paths (eg some device binding to sockets, MSG_DONTROUTE slow path, traffic control setup, route config, arp config are some that seem useful). The argumenet for the patch (and it might be weak) If you have 4000 devices, it makes sense for two reasons: -Some of these paths might hold locks which might stall the fast path. -Or if you have a busy system, they will contribute to the load. I used to be against VLANS being devices, i am withdrawing that comment; it's a lot easier to look on them as devices if you want to run IP on them. And in this case, it makes sense the possibilirt of over a thousand devices is good. So instead of depending what ifconfig does, maybe a better test for Ben is to measure the kernel level improvement in the lookup for example from 2..6000 devices. Tests with the user space tools will also help. example to add to Andi's flavor: "date; time ifconfig -a; date" for each number of devices. repeat for ip as well ;-> Plot kernel and the two user tools results after several iterations (number of devices on x-axis and some time/CPU measure on y-axis). cheers, jamal From owner-netdev@oss.sgi.com Sun Jan 7 08:13:40 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 08:13:30 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:35492 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 08:13:19 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id LAA21173; Sun, 7 Jan 2001 11:12:23 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Sun, 7 Jan 2001 11:12:23 -0500 (EST) From: jamal To: Alan Cox cc: "David S. Miller" , , , , Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1053 Lines: 31 On Sun, 7 Jan 2001, Alan Cox wrote: > Why. Its bad enough that the networking layer doesnt let you configure out > stuff like SACK and the big routing hashes. Please don't make it even worse > for the embedded world. 99.9% of Linux boxes probably have less than 5 routing > table entries Ok. Good point. But remember that parsing /proc for an embedded system is also not the most healthy thing. > > > I could almost, but not quite, justify it right now just because "ip" > > is becomming standard and needs it. > > ip is also not the smallest and simplest of binaries. You can fit an ifconfig > for ip in about 24K > ip is also a replacement of many nettools: ifconfig, route, arp config, tunneling setup etc. You can also do many funky things with it using small scripts (ip dup-address detection is documented). Seems like Alexey already has a wrapper "ifcfg" which is a ifconfig replacement. It does not format display the same way as ifconfig; however, that would only be necessary if some app is dependent on ifconfig output. cheers, jamal From owner-netdev@oss.sgi.com Sun Jan 7 08:33:00 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 08:32:40 -0800 Received: from gleb.nbase.co.il ([194.90.136.56]:55303 "EHLO gleb.nbase.co.il") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 08:32:11 -0800 Received: (from gleb@localhost) by gleb.nbase.co.il (8.11.1/8.11.1/Debian 8.11.0-6) id f07GUWh06727; Sun, 7 Jan 2001 18:30:32 +0200 From: Gleb Natapov Date: Sun, 7 Jan 2001 18:30:32 +0200 To: jamal Cc: "David S. Miller" , ak@suse.de, greearb@candelatech.com, linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission policy!) Message-ID: <20010107183032.E28257@nbase.co.il> References: <200101070543.VAA24689@pizda.ninka.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: ; from hadi@cyberus.ca on Sun, Jan 07, 2001 at 10:56:23AM -0500 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 433 Lines: 14 On Sun, Jan 07, 2001 at 10:56:23AM -0500, jamal wrote: [snip] > > I used to be against VLANS being devices, i am withdrawing that comment; it's > a lot easier to look on them as devices if you want to run IP on them. And > in this case, it makes sense the possibilirt of over a thousand devices > is good. > Glad to hear :) So perhaps this is a good time to move one of VLAN implementations into the official kernel? -- Gleb. From owner-netdev@oss.sgi.com Sun Jan 7 08:37:49 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 08:37:29 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:37028 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 08:37:12 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id LAA21207; Sun, 7 Jan 2001 11:36:23 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Sun, 7 Jan 2001 11:36:23 -0500 (EST) From: jamal To: Gleb Natapov cc: "David S. Miller" , , , , Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission policy!) In-Reply-To: <20010107183032.E28257@nbase.co.il> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 493 Lines: 21 On Sun, 7 Jan 2001, Gleb Natapov wrote: > > I used to be against VLANS being devices, i am withdrawing that comment; it's > > a lot easier to look on them as devices if you want to run IP on them. And > > in this case, it makes sense the possibilirt of over a thousand devices > > is good. > > > > Glad to hear :) So perhaps this is a good time to move one of VLAN implementations > into the official kernel? > Absolutely. I think we need a VLAN implementation in there. cheers, jamal From owner-netdev@oss.sgi.com Sun Jan 7 08:45:49 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 08:45:29 -0800 Received: from router-100M.swansea.linux.org.uk ([194.168.151.17]:33039 "EHLO the-village.bc.nu") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 08:45:13 -0800 Received: from alan by the-village.bc.nu with local (Exim 2.12 #1) id 14FIxT-0002ue-00; Sun, 7 Jan 2001 16:46:15 +0000 Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission To: matti.aarnio@zmailer.org (Matti Aarnio) Date: Sun, 7 Jan 2001 16:46:14 +0000 (GMT) Cc: alan@lxorguk.ukuu.org.uk (Alan Cox), greearb@candelatech.com (Ben Greear), davem@redhat.com (David S. Miller), linux-kernel@vger.kernel.org, netdev@oss.sgi.com In-Reply-To: <20010107173306.C25076@mea-ext.zmailer.org> from "Matti Aarnio" at Jan 07, 2001 05:33:06 PM X-Mailer: ELM [version 2.5 PL1] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-Id: From: Alan Cox Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 324 Lines: 7 > I just tried to pull data from another machine, which > is on normal port thru VLAN trunking port to receiving > machine, and got fast-ether at wire speed. (As near as > ncftp's 11.11 MB/sec is wirespeed..) But talking between two vlans on the same physical lan you will go in and back out via the switch and you wont From owner-netdev@oss.sgi.com Sun Jan 7 08:50:50 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 08:50:29 -0800 Received: from router-100M.swansea.linux.org.uk ([194.168.151.17]:37391 "EHLO the-village.bc.nu") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 08:50:17 -0800 Received: from alan by the-village.bc.nu with local (Exim 2.12 #1) id 14FJ2H-0002vI-00; Sun, 7 Jan 2001 16:51:13 +0000 Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission To: hadi@cyberus.ca (jamal) Date: Sun, 7 Jan 2001 16:51:11 +0000 (GMT) Cc: alan@lxorguk.ukuu.org.uk (Alan Cox), davem@redhat.com (David S. Miller), ak@suse.de, greearb@candelatech.com, linux-kernel@vger.kernel.org, netdev@oss.sgi.com In-Reply-To: from "jamal" at Jan 07, 2001 11:12:23 AM X-Mailer: ELM [version 2.5 PL1] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-Id: From: Alan Cox Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 190 Lines: 6 > Ok. Good point. > But remember that parsing /proc for an embedded system is also not the > most healthy thing. I dont compile in /proc either. SIOCGIFCONF is enough for an embedded box. From owner-netdev@oss.sgi.com Sun Jan 7 08:57:29 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 08:57:20 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:38052 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 08:57:08 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id LAA21226; Sun, 7 Jan 2001 11:56:26 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Sun, 7 Jan 2001 11:56:26 -0500 (EST) From: jamal To: Chris Wedgwood cc: Ben Greear , linux-kernel , "netdev@oss.sgi.com" Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission policy!) In-Reply-To: <20010107162905.B1804@metastasis.f00f.org> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 719 Lines: 22 On Sun, 7 Jan 2001, Chris Wedgwood wrote: > That said, if this was done -- how would things like routing daemons > and bind cope? I dont know of any routing daemons that are taking advantage of the alias interfaces today. This being said, i think that the fact that a lot of protocols that need IP-ization are coming up eg VLANs; you should see a good use for this. Out of curiosity for the VLAN people, how do you work with something like Zebra? One could have the route daemon take charge of management of these devices, a master device like "eth0" and a attached device like "vlan0". They both share the same ifindex but different have labels. Basically, i dont think there would be a problem. cheers, jamal From owner-netdev@oss.sgi.com Sun Jan 7 09:32:50 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 09:32:39 -0800 Received: from mail.zmailer.org ([194.252.70.162]:49422 "EHLO zmailer.org") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 09:32:16 -0800 Received: (from localhost user: 'mea' uid#500 fake: STDIN (mea@zmailer.org)) by mail.zmailer.org id ; Sun, 7 Jan 2001 19:32:13 +0200 Date: Sun, 7 Jan 2001 19:32:13 +0200 From: Matti Aarnio To: Alan Cox Cc: Matti Aarnio , Ben Greear , "David S. Miller" , linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission Message-ID: <20010107193213.D25076@mea-ext.zmailer.org> References: <20010107173306.C25076@mea-ext.zmailer.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: ; from alan@lxorguk.ukuu.org.uk on Sun, Jan 07, 2001 at 04:46:14PM +0000 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 575 Lines: 16 On Sun, Jan 07, 2001 at 04:46:14PM +0000, Alan Cox wrote: > But talking between two vlans on the same physical lan you will go in and back > out via the switch and you wont So ? If your box is routing in between VLANs, you are using it wrong way, IMO. On the other hand, I could very well put clients in some building into an set where I have a switch with FE connection to router, and lots of 10BaseT ports to clients. Hard-limiting bandwith to said 10 Mbit. I use VLAN truncked systems mainly for network administration, for DHCP servers. /Matti Aarnio From owner-netdev@oss.sgi.com Sun Jan 7 09:39:39 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 09:39:20 -0800 Received: from gleb.nbase.co.il ([194.90.136.56]:522 "EHLO gleb.nbase.co.il") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 09:39:13 -0800 Received: (from gleb@localhost) by gleb.nbase.co.il (8.11.1/8.11.1/Debian 8.11.0-6) id f07Hbvh07074; Sun, 7 Jan 2001 19:37:57 +0200 From: Gleb Natapov Date: Sun, 7 Jan 2001 19:37:57 +0200 To: jamal Cc: Chris Wedgwood , Ben Greear , linux-kernel , "netdev@oss.sgi.com" Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission policy!) Message-ID: <20010107193757.F28257@nbase.co.il> References: <20010107162905.B1804@metastasis.f00f.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: ; from hadi@cyberus.ca on Sun, Jan 07, 2001 at 11:56:26AM -0500 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1103 Lines: 29 On Sun, Jan 07, 2001 at 11:56:26AM -0500, jamal wrote: > > > On Sun, 7 Jan 2001, Chris Wedgwood wrote: > > > That said, if this was done -- how would things like routing daemons > > and bind cope? > > I dont know of any routing daemons that are taking advantage of the > alias interfaces today. This being said, i think that the fact that a > lot of protocols that need IP-ization are coming up eg VLANs; you should > see a good use for this. Out of curiosity for the VLAN people, how do you > work with something like Zebra? Without any problems. Zebra sees different VLAN interfaces as different networks and happily route between them. > One could have the route daemon take charge of management of these > devices, a master device like "eth0" and a attached device like "vlan0". > They both share the same ifindex but different have labels. > Basically, i dont think there would be a problem. > Theoretically it seems to be possible but it's much harder to do in Zebra than in kernel. And "eth0" shouldn't share ifindex with "vlan0" I don't think SNMP will be happy about that. -- Gleb. From owner-netdev@oss.sgi.com Sun Jan 7 09:59:50 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 09:59:41 -0800 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:32013 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 09:59:27 -0800 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.8.7) with ESMTP id MAA08375; Sun, 7 Jan 2001 12:02:28 -0700 Message-ID: <3A58BD44.1381D182@candelatech.com> Date: Sun, 07 Jan 2001 12:02:28 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.16 i586) X-Accept-Language: en MIME-Version: 1.0 To: Alan Cox CC: "David S. Miller" , linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1694 Lines: 43 Alan Cox wrote: > > > + * NOTE: That is no longer true with the addition of VLAN tags. Not > > + * sure which should go first, but I bet it won't make much > > + * difference if we are running VLANs. The good news is that > > It makes a lot of difference tha the vlan goes 2nd. Most sane people wont > have vlans active on a high load interface. Um, what about people running their box as just a VLAN router/firewall? That seems to be one of the principle uses so far. Actually, in that case both VLAN and IP traffic would come through, so it would be a tie if VLAN came first, but non-vlan traffic would suffer worse. So, how can I make sure that it is second in the list? > > > strcpy(dev->name, buf); > > return i; > > } > > } > > - return -ENFILE; /* Over 100 of the things .. bail out! */ > > + return -ENFILE; /* Over 8192 of the things .. bail out! */ > > So fix the algorithm. You want the list sorted at this point, or to generate > a bitmap of free/used entries and scan the list then scan the map Actually, VLAN code no longer uses this method to generate it's name, it uses it's own mechanism (which, by the way, the hashed name lookup makes much faster.) So, this part of the patch can be removed. > > Question: How do devices with hardware vlan support fit into your model ? I don't know of any, and I'm not sure how they would be supported. > > Alan -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Sun Jan 7 10:03:50 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 10:03:40 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:39588 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 10:03:24 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id NAA21304; Sun, 7 Jan 2001 13:02:24 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Sun, 7 Jan 2001 13:02:24 -0500 (EST) From: jamal To: Gleb Natapov cc: Chris Wedgwood , Ben Greear , linux-kernel , "netdev@oss.sgi.com" Subject: routable interfaces WAS( Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission policy!) In-Reply-To: <20010107193757.F28257@nbase.co.il> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1317 Lines: 33 On Sun, 7 Jan 2001, Gleb Natapov wrote: > > One could have the route daemon take charge of management of these > > devices, a master device like "eth0" and a attached device like "vlan0". > > They both share the same ifindex but different have labels. > > Basically, i dont think there would be a problem. > > > > Theoretically it seems to be possible but it's much harder to do in Zebra than > in kernel. And "eth0" shouldn't share ifindex with "vlan0" I don't think SNMP > will be happy about that. A very good reason why you would want them to have separate ifindices. Essentially, vlans have to be separate interfaces today. Other "virtual" interfaces such as aliased devices are not going to work with route daemons today since they dont meet this requirement. Not to rain on Ben's parade but: My thought was to have the vlan be attached on the interface ifa list and just give it a different label since it is a "virtual interface" on top of the "physical interface". Now that you mention the SNMP requirement, maybe an idea of major:minor ifindex makes sense. Say make the ifindex a u32 with major 16 bit and minor 16 bit. This way we can have upto 2^16 physical interfaces and upto 2^16 virtual interfaces on the physical interface. The search will be broken into two 16 bits. Thoughts? cheers, jamal From owner-netdev@oss.sgi.com Sun Jan 7 10:06:10 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 10:05:50 -0800 Received: from router-100M.swansea.linux.org.uk ([194.168.151.17]:62991 "EHLO the-village.bc.nu") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 10:05:39 -0800 Received: from alan by the-village.bc.nu with local (Exim 2.12 #1) id 14FKDI-00033e-00; Sun, 7 Jan 2001 18:06:40 +0000 Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission To: greearb@candelatech.com (Ben Greear) Date: Sun, 7 Jan 2001 18:06:37 +0000 (GMT) Cc: alan@lxorguk.ukuu.org.uk (Alan Cox), davem@redhat.com (David S. Miller), linux-kernel@vger.kernel.org, netdev@oss.sgi.com In-Reply-To: <3A58BD44.1381D182@candelatech.com> from "Ben Greear" at Jan 07, 2001 12:02:28 PM X-Mailer: ELM [version 2.5 PL1] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-Id: From: Alan Cox Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1049 Lines: 25 > Um, what about people running their box as just a VLAN router/firewall? > That seems to be one of the principle uses so far. Actually, in that case > both VLAN and IP traffic would come through, so it would be a tie if VLAN > came first, but non-vlan traffic would suffer worse. Why would someone filter between vlans when any node on each vlan can happily ignore the vlan partitioning > So, how can I make sure that it is second in the list? Register vlan in the top level protocol hash then have that yank the header and feed the packets through the hash again. > Actually, VLAN code no longer uses this method to generate it's name, > it uses it's own mechanism (which, by the way, the hashed name lookup > makes much faster.) So, this part of the patch can be removed. Ok > > Question: How do devices with hardware vlan support fit into your model ? > I don't know of any, and I'm not sure how they would be supported. Several cards have vlan ability, but Matti reports they just lose the header not filter on it if I understood him From owner-netdev@oss.sgi.com Sun Jan 7 10:07:20 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 10:07:10 -0800 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:37389 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 10:07:01 -0800 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.8.7) with ESMTP id MAA08445; Sun, 7 Jan 2001 12:09:56 -0700 Message-ID: <3A58BF04.D7304A0A@candelatech.com> Date: Sun, 07 Jan 2001 12:09:56 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.16 i586) X-Accept-Language: en MIME-Version: 1.0 To: Alan Cox CC: Andi Kleen , linux-kernel , "netdev@oss.sgi.com" Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1082 Lines: 25 Alan Cox wrote: > > > Suppose I bind a raw socket to device vlan4001 (ie I have 4k in the list > > before that one!!). Currently, that means a linear search on all devices, > > right? In that extreme example, I would expect the hash to be very > > useful. > > At this point you have to ask 'why is vlan4001 an interface'. Would it not > be cleaner to add the vlan id to the entries in the list of addresses per > interface ? Among other things, some VLAN switches won't work unless you can change the MAC address on your VLANs to be different from the rest of the VLAN MACs on that physical interface. For OSPF you also need to have multicast work on them, and other things that look very much like a real interface. Also, by making the VLANs a net_device, the rest of the kernel and user-space code (ip, ifconfig, for example), works as expected, with no changes. -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Sun Jan 7 10:19:10 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 10:19:00 -0800 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:43533 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 10:18:35 -0800 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.8.7) with ESMTP id MAA08526; Sun, 7 Jan 2001 12:21:45 -0700 Message-ID: <3A58C1C9.1E4B6265@candelatech.com> Date: Sun, 07 Jan 2001 12:21:45 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.16 i586) X-Accept-Language: en MIME-Version: 1.0 To: jamal CC: Gleb Natapov , Chris Wedgwood , linux-kernel , "netdev@oss.sgi.com" Subject: Re: routable interfaces WAS( Re: [PATCH] hashed device lookup (DoesNOT meet Linus' sumission policy!) References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1372 Lines: 35 jamal wrote: > A very good reason why you would want them to have separate ifindices. > Essentially, vlans have to be separate interfaces today. Other "virtual" > interfaces such as aliased devices are not going to work with route > daemons today since they dont meet this requirement. > > Not to rain on Ben's parade but: > My thought was to have the vlan be attached on the interface ifa list and > just give it a different label since it is a "virtual interface" on top > of the "physical interface". Now that you mention the SNMP requirement, > maybe an idea of major:minor ifindex makes sense. Say make the ifindex > a u32 with major 16 bit and minor 16 bit. This way we can have upto 2^16 > physical interfaces and upto 2^16 virtual interfaces on the physical > interface. The search will be broken into two 16 bits. What problem does this fix? If you are mucking with the ifindex, you may be affecting many places in the rest of the kernel, as well as user-space programs which use ifindex to bind to raw devices. On the other hand, the hash patch touches only one file, and should not have any external impacts. > > Thoughts? > > cheers, > jamal -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Sun Jan 7 10:22:20 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 10:22:10 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:42404 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 10:22:04 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id NAA21335; Sun, 7 Jan 2001 13:21:11 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Sun, 7 Jan 2001 13:21:11 -0500 (EST) From: jamal To: Ben Greear cc: Alan Cox , "David S. Miller" , , Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission In-Reply-To: <3A58BD44.1381D182@candelatech.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 385 Lines: 16 On Sun, 7 Jan 2001, Ben Greear wrote: > > Question: How do devices with hardware vlan support fit into your model ? > > I don't know of any, and I'm not sure how they would be supported. > erm, this is a MUST. You MUST factor the hardware VLANs and be totaly 802.1q compliant. Also of interest is 802.1P and D. We must have full compliance, not some toy emulation. cheers, jamal From owner-netdev@oss.sgi.com Sun Jan 7 10:28:10 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 10:27:50 -0800 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:48909 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 10:27:43 -0800 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.8.7) with ESMTP id MAA08602; Sun, 7 Jan 2001 12:30:48 -0700 Message-ID: <3A58C3E8.FF5FF68E@candelatech.com> Date: Sun, 07 Jan 2001 12:30:48 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.16 i586) X-Accept-Language: en MIME-Version: 1.0 To: Alan Cox CC: "David S. Miller" , linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1679 Lines: 37 Alan Cox wrote: > > > Um, what about people running their box as just a VLAN router/firewall? > > That seems to be one of the principle uses so far. Actually, in that case > > both VLAN and IP traffic would come through, so it would be a tie if VLAN > > came first, but non-vlan traffic would suffer worse. > > Why would someone filter between vlans when any node on each vlan can happily > ignore the vlan partitioning Suppose you have a 100bt link upstream, and want to re-sell that as 10 10Mb links to all the customers in one building. With VLANs, you can haul all the data over one wire to a Linux box with 11 interfaces: 1 running VLAN (100bt), and 10 others running 10bt ethernet. Now, your uses are segregated, and you only have 1 100bt wire running to the basement, instead of 10. Alternately, if you have a VLAN ethernet switch, your linux box just feeds 100bt into it, and acts as a router with 10 (vlan) interfaces. In either of these cases, assuming the etherswitch and/or Linux box is secure, the customers will not be able to be on other peoples VLAN. This enables all kinds of routing/billing possibilities... > > So, how can I make sure that it is second in the list? > > Register vlan in the top level protocol hash then have that yank the header > and feed the packets through the hash again. Thats what it already does, if I understand correctly. Of course, if VLAN is loaded as a module, then it will be in the hash before IP, right? -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Sun Jan 7 10:30:40 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 10:30:20 -0800 Received: from router-100M.swansea.linux.org.uk ([194.168.151.17]:6160 "EHLO the-village.bc.nu") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 10:30:07 -0800 Received: from alan by the-village.bc.nu with local (Exim 2.12 #1) id 14FKat-00036O-00; Sun, 7 Jan 2001 18:31:03 +0000 Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission To: greearb@candelatech.com (Ben Greear) Date: Sun, 7 Jan 2001 18:30:59 +0000 (GMT) Cc: alan@lxorguk.ukuu.org.uk (Alan Cox), davem@redhat.com (David S. Miller), linux-kernel@vger.kernel.org, netdev@oss.sgi.com In-Reply-To: <3A58C3E8.FF5FF68E@candelatech.com> from "Ben Greear" at Jan 07, 2001 12:30:48 PM X-Mailer: ELM [version 2.5 PL1] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-Id: From: Alan Cox Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 339 Lines: 6 > Thats what it already does, if I understand correctly. Of course, if VLAN > is loaded as a module, then it will be in the hash before IP, right? Thats fine. I think it'll be a different hash bucket anyway. The point of having vlan first is that if its not registered or the interface isnt doing vlan there is basically a zero overhead From owner-netdev@oss.sgi.com Sun Jan 7 10:31:30 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 10:31:10 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:43684 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 10:31:02 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id NAA21340; Sun, 7 Jan 2001 13:29:51 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Sun, 7 Jan 2001 13:29:51 -0500 (EST) From: jamal To: Ben Greear cc: Gleb Natapov , Chris Wedgwood , linux-kernel , "netdev@oss.sgi.com" Subject: Re: routable interfaces WAS( Re: [PATCH] hashed device lookup (DoesNOT meet Linus' sumission policy!) In-Reply-To: <3A58C1C9.1E4B6265@candelatech.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1119 Lines: 29 On Sun, 7 Jan 2001, Ben Greear wrote: > > My thought was to have the vlan be attached on the interface ifa list and > > just give it a different label since it is a "virtual interface" on top > > of the "physical interface". Now that you mention the SNMP requirement, > > maybe an idea of major:minor ifindex makes sense. Say make the ifindex > > a u32 with major 16 bit and minor 16 bit. This way we can have upto 2^16 > > physical interfaces and upto 2^16 virtual interfaces on the physical > > interface. The search will be broken into two 16 bits. > > What problem does this fix? > > If you are mucking with the ifindex, you may be affecting many places > in the rest of the kernel, as well as user-space programs which use > ifindex to bind to raw devices. > I am talking about 2.5 possibilities now that 2.4 is out. I think "parasitic/virtual" interfaces is not a issue specific to VLANs. VLANs happen to use devices today to solve the problem. As pointed by that example no routing daemons are doing aliased interfaces (which are also virtual interfaces). We need some more general solution. cheers, jamal From owner-netdev@oss.sgi.com Sun Jan 7 10:35:31 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 10:35:10 -0800 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:54541 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 10:35:07 -0800 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.8.7) with ESMTP id MAA08674; Sun, 7 Jan 2001 12:37:30 -0700 Message-ID: <3A58C57A.131BFEF2@candelatech.com> Date: Sun, 07 Jan 2001 12:37:30 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.16 i586) X-Accept-Language: en MIME-Version: 1.0 To: jamal CC: Alan Cox , "David S. Miller" , linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1403 Lines: 40 jamal wrote: > > On Sun, 7 Jan 2001, Ben Greear wrote: > > > > Question: How do devices with hardware vlan support fit into your model ? > > > > I don't know of any, and I'm not sure how they would be supported. > > > > erm, this is a MUST. You MUST factor the hardware VLANs and be totaly > 802.1q compliant. Also of interest is 802.1P and D. We must have full > compliance, not some toy emulation. I have seen neither hardware nor spec sheets on how these NICs are doing VLAN 'support'. So, I don't know what the best way to support them is. If it requires driver changes, then the ethernet driver folks will need to be involved. There is also a difference between supporting hardware VLAN solutions and being 100% compliant: If I can send/receive packets that are 100% compliant from an RTL 8139 NIC, then as far as the world (ie Switch) knows, I am 100% compliant. If the specific VLAN hardware features are not supported in some exotic NIC, then that should just mean slightly less performance, or worst cast, not supporting that particular NIC. My vlan code supports setting of Priority bits already (thats' the .1P, right?) What is the .1D stuff about? > > cheers, > jamal -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Sun Jan 7 10:52:00 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 10:51:50 -0800 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:64013 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 10:51:25 -0800 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.8.7) with ESMTP id MAA08808; Sun, 7 Jan 2001 12:54:35 -0700 Message-ID: <3A58C97B.A3C5676B@candelatech.com> Date: Sun, 07 Jan 2001 12:54:35 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.16 i586) X-Accept-Language: en MIME-Version: 1.0 To: jamal CC: "David S. Miller" , ak@suse.de, linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumissionpolicy!) References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1019 Lines: 27 jamal wrote: > So instead of depending what ifconfig does, maybe a better test for Ben is > to measure the kernel level improvement in the lookup for example from > 2..6000 devices. In the benchmark I gave, the performance increase was in the kernel, not user space, and it was more than 10 times faster, at least with 4k VLANs. Adding VLANs was about twice as fast, and deleting them was faster, though not as much. Tests with the user space tools will also help. example > to add to Andi's flavor: > "date; time ifconfig -a; date" for each number of devices. > repeat for ip as well ;-> I can show a range w/out much trouble. I think I'll also tweak the hash code to just do linear lookups if the number of interfaces is below some number, (probably 20, or whatever the numbers show is good...) Ben -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Sun Jan 7 10:53:00 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 10:52:40 -0800 Received: from gleb.nbase.co.il ([194.90.136.56]:35340 "EHLO gleb.nbase.co.il") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 10:52:33 -0800 Received: (from gleb@localhost) by gleb.nbase.co.il (8.11.1/8.11.1/Debian 8.11.0-6) id f07IpDx07550; Sun, 7 Jan 2001 20:51:13 +0200 From: Gleb Natapov Date: Sun, 7 Jan 2001 20:51:13 +0200 To: jamal Cc: Ben Greear , Chris Wedgwood , linux-kernel , "netdev@oss.sgi.com" Subject: Re: routable interfaces WAS( Re: [PATCH] hashed device lookup (DoesNOT meet Linus' sumission policy!) Message-ID: <20010107205113.H28257@nbase.co.il> References: <3A58C1C9.1E4B6265@candelatech.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: ; from hadi@cyberus.ca on Sun, Jan 07, 2001 at 01:29:51PM -0500 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1408 Lines: 35 On Sun, Jan 07, 2001 at 01:29:51PM -0500, jamal wrote: > > > On Sun, 7 Jan 2001, Ben Greear wrote: > > > > My thought was to have the vlan be attached on the interface ifa list and > > > just give it a different label since it is a "virtual interface" on top > > > of the "physical interface". Now that you mention the SNMP requirement, > > > maybe an idea of major:minor ifindex makes sense. Say make the ifindex > > > a u32 with major 16 bit and minor 16 bit. This way we can have upto 2^16 > > > physical interfaces and upto 2^16 virtual interfaces on the physical > > > interface. The search will be broken into two 16 bits. > > > > What problem does this fix? > > > > If you are mucking with the ifindex, you may be affecting many places > > in the rest of the kernel, as well as user-space programs which use > > ifindex to bind to raw devices. > > > > I am talking about 2.5 possibilities now that 2.4 is out. I think > "parasitic/virtual" interfaces is not a issue specific to VLANs. > VLANs happen to use devices today to solve the problem. > As pointed by that example no routing daemons are doing aliased > interfaces (which are also virtual interfaces). > We need some more general solution. > And what about bonding device? What major number should they use? Ifindexes not reusable so in your scheme we should have separate minor counter for each major interface, what for? -- Gleb. From owner-netdev@oss.sgi.com Sun Jan 7 10:54:00 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 10:53:41 -0800 Received: from mail.zmailer.org ([194.252.70.162]:51214 "EHLO zmailer.org") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 10:53:34 -0800 Received: (from localhost user: 'mea' uid#500 fake: STDIN (mea@zmailer.org)) by mail.zmailer.org id ; Sun, 7 Jan 2001 20:53:15 +0200 Date: Sun, 7 Jan 2001 20:53:15 +0200 From: Matti Aarnio To: Alan Cox Cc: Ben Greear , linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission Message-ID: <20010107205315.F25076@mea-ext.zmailer.org> References: <3A58BD44.1381D182@candelatech.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: ; from alan@lxorguk.ukuu.org.uk on Sun, Jan 07, 2001 at 06:06:37PM +0000 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 2491 Lines: 62 On Sun, Jan 07, 2001 at 06:06:37PM +0000, Alan Cox wrote: > > Um, what about people running their box as just a VLAN router/firewall? > > That seems to be one of the principle uses so far. Actually, in that case > > both VLAN and IP traffic would come through, so it would be a tie if VLAN > > came first, but non-vlan traffic would suffer worse. > > Why would someone filter between vlans when any node on each vlan can happily > ignore the vlan partitioning VLANs are Level-2, that is SWITCHING. They have no real meaning unless you have a switching fabric, in which they present ways to hard-partition ports to different switching domains without having physically separate cabling. Normal hosts are connected on non-truncking ports, and only some rare systems are connected to 802.1Q trunks so they can access multiple VLANs inside the fabric. No ordinary hosts are able to choose at which VLANs they are. Truncking ports have ways to control which VLANs are allowed to go thru them (at least at Cisco hardware I am familiar with). > > So, how can I make sure that it is second in the list? > > Register vlan in the top level protocol hash then have that yank the header > and feed the packets through the hash again. That is what the two existing VLAN codes for Linux do now. Better(?) way could be to have a way to have device specific reception vector in addition to xmit vector. That way we could stack "Layer-2 protocols", like 802.1Q and (to an extent) even 802 bridging. See ftp://zmailer.org/linux/netif_rx.patch After all, if you have a way to plumb reception to an optional bridging layer, you propably would not need netif_rx() contained bridging code. > > > Question: How do devices with hardware vlan support fit into your model ? > > I don't know of any, and I'm not sure how they would be supported. > > Several cards have vlan ability, but Matti reports they just lose the header > not filter on it if I understood him No you didn't understand. Nothing is lost, it relates to hardware assisted received IP frame TCP/UDP checksumming by the network cards. Some cards support that, some support it even in presense of 802.1Q TAG header. I don't yet see any cards which have hardware assist for IPv6 checksumming. VLAN tags or not. Reception must handle at first tearing off the VLAN header when receiving the frame, then return back to netif_rx() to see what was inside - SNAP frame, IPv4 frame, whatever. /Matti Aarnio From owner-netdev@oss.sgi.com Sun Jan 7 10:55:10 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 10:55:00 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:44964 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 10:54:49 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id NAA21377; Sun, 7 Jan 2001 13:53:48 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Sun, 7 Jan 2001 13:53:48 -0500 (EST) From: jamal To: Ben Greear cc: Alan Cox , "David S. Miller" , , Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission In-Reply-To: <3A58C57A.131BFEF2@candelatech.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1803 Lines: 56 On Sun, 7 Jan 2001, Ben Greear wrote: > jamal wrote: > > > > erm, this is a MUST. You MUST factor the hardware VLANs and be totaly > > 802.1q compliant. Also of interest is 802.1P and D. We must have full > > compliance, not some toy emulation. > > I have seen neither hardware nor spec sheets on how these NICs are doing > VLAN 'support'. So, I don't know what the best way to support them is. > Most of the GIGe interfaces do provide VLAN insertion/removal. You pass/receive it as part of the DMA descriptor. > If it requires driver changes, then the ethernet driver folks will need > to be involved. > I think the design MUST consider not just a poor man's VLAN way of doing things. You and the other VLAN folks (Gleb and co) will have to iron that out. Basically, all i am saying is that if there is going to be a linux solution at some point, the requirement for these devices is a MUST. Please involve me in discussions you guys end up having. > There is also a difference between supporting hardware VLAN solutions > and being 100% compliant: If I can send/receive packets that are > 100% compliant from an RTL 8139 NIC, then as far as the world (ie Switch) knows, > I am 100% compliant. > ok. > If the specific VLAN hardware features are not supported in some exotic > NIC, then that should just mean slightly less performance, or worst cast, > not supporting that particular NIC. The included design must be flexible enough to allow for this. As much as i hate it, some vendors will continue releasing binaries only for their code. > > My vlan code supports setting of Priority bits already (thats' the .1P, right?) > right. There's a lot of work to be done in that area. > What is the .1D stuff about? > spanning tree. Seems the bridging code already does this. cheers, jamal From owner-netdev@oss.sgi.com Sun Jan 7 11:01:11 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 11:01:01 -0800 Received: from mail.zmailer.org ([194.252.70.162]:52238 "EHLO zmailer.org") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 11:00:48 -0800 Received: (from localhost user: 'mea' uid#500 fake: STDIN (mea@zmailer.org)) by mail.zmailer.org id ; Sun, 7 Jan 2001 21:00:36 +0200 Date: Sun, 7 Jan 2001 21:00:36 +0200 From: Matti Aarnio To: jamal Cc: linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission Message-ID: <20010107210036.G25076@mea-ext.zmailer.org> References: <3A58BD44.1381D182@candelatech.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: ; from hadi@cyberus.ca on Sun, Jan 07, 2001 at 01:21:11PM -0500 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 732 Lines: 20 On Sun, Jan 07, 2001 at 01:21:11PM -0500, jamal wrote: > On Sun, 7 Jan 2001, Ben Greear wrote: > > > Question: How do devices with hardware vlan support fit into your model ? > > I don't know of any, and I'm not sure how they would be supported. > > erm, this is a MUST. You MUST factor the hardware VLANs and be totaly > 802.1q compliant. Also of interest is 802.1P and D. We must have full > compliance, not some toy emulation. Read what I wrote about the issue to Alan. Ben's code has no problems with receiving VLANs with network cards which have "hardware support" for VLANs. But this is far away from device name hashes. (Which aren't in data reception or sending fastpaths, anyway.) > cheers, > jamal /Matti Aarnio From owner-netdev@oss.sgi.com Sun Jan 7 11:07:21 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 11:07:01 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:46500 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 11:06:42 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id OAA21385; Sun, 7 Jan 2001 14:05:43 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Sun, 7 Jan 2001 14:05:43 -0500 (EST) From: jamal To: Gleb Natapov cc: Ben Greear , Chris Wedgwood , linux-kernel , "netdev@oss.sgi.com" Subject: Re: routable interfaces WAS( Re: [PATCH] hashed device lookup (DoesNOT meet Linus' sumission policy!) In-Reply-To: <20010107205113.H28257@nbase.co.il> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1131 Lines: 39 On Sun, 7 Jan 2001, Gleb Natapov wrote: > And what about bonding device? What major number should they use? Would that include several ifindeces? use standards. 802.3ad(?). Didnt Intel release some code on this or are they still playing the big bad corporation? Normaly standards will take care of things like MIBs etc. > Ifindexes not reusable so in your scheme we should have separate minor > counter for each major interface, what for? Still each "interface" has its own ifindex, so counters will be per-interface. Here's what i mean: netdevice (major part of ifindex, proto unaware link layer) | | --> protocol level (IPV4, V6 etc) | ..........| | |--> interface 2^16 | -->interface 1 The interface could be looked as something on top of a device (struct ifa) and is distinguishable on the device by its minor number. EG ifindex 1002 is eth0:2. I could write a whole lengthy RFC if it is of interest and we could use that as a starting point for discussion. Note, i dont think this would affect the core code other than the setup part. cheers, jamal From owner-netdev@oss.sgi.com Sun Jan 7 11:11:51 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 11:11:41 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:47268 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 11:11:38 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id OAA21389; Sun, 7 Jan 2001 14:10:52 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Sun, 7 Jan 2001 14:10:52 -0500 (EST) From: jamal To: Matti Aarnio cc: , Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission In-Reply-To: <20010107210036.G25076@mea-ext.zmailer.org> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 344 Lines: 15 On Sun, 7 Jan 2001, Matti Aarnio wrote: > Read what I wrote about the issue to Alan. > Ben's code has no problems with receiving VLANs with network > cards which have "hardware support" for VLANs. > OK. I suppose an skb->vlan_tag is passed to the driver and it will know what to do with it (pass it on a descriptor etc). cheers, jamal From owner-netdev@oss.sgi.com Sun Jan 7 11:20:41 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 11:20:32 -0800 Received: from storm.ca ([209.87.239.69]:13499 "EHLO mail.storm.ca") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 11:20:25 -0800 Received: from storm.ca (ppp-209-87-255-4.ottawa.storm.ca [209.87.255.4]) by mail.storm.ca (8.9.3+Sun/8.9.3) with ESMTP id OAA15441; Sun, 7 Jan 2001 14:20:17 -0500 (EST) Message-ID: <3A58C137.63907CDC@storm.ca> Date: Sun, 07 Jan 2001 14:19:19 -0500 From: Sandy Harris X-Mailer: Mozilla 4.76 [en] (Win98; U) X-Accept-Language: en,fr MIME-Version: 1.0 To: jamal CC: linux-kernel , "netdev@oss.sgi.com" Subject: Re: routable interfaces WAS( Re: [PATCH] hashed device lookup(DoesNOT meet Linus' sumission policy!) References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 928 Lines: 23 jamal wrote: > > What problem does this fix? > > > > If you are mucking with the ifindex, you may be affecting many places > > in the rest of the kernel, as well as user-space programs which use > > ifindex to bind to raw devices. > > I am talking about 2.5 possibilities now that 2.4 is out. I think > "parasitic/virtual" interfaces is not a issue specific to VLANs. > VLANs happen to use devices today to solve the problem. > As pointed by that example no routing daemons are doing aliased > interfaces (which are also virtual interfaces). > We need some more general solution. > Something like this also becomes an issue when you want routing daemons to interact sensibly with IPSEC tunnels. A paper on these issues is at: http://www.quintillion.com/fdis/moat/ipsec+routing/ It is not (AFAIK) clear that the FreeS/WAN team will adopt the solutions suggested there, but it is very clear we need to deal with those issues. From owner-netdev@oss.sgi.com Sun Jan 7 11:24:42 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 11:24:32 -0800 Received: from mail.zmailer.org ([194.252.70.162]:53006 "EHLO zmailer.org") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 11:24:24 -0800 Received: (from localhost user: 'mea' uid#500 fake: STDIN (mea@zmailer.org)) by mail.zmailer.org id ; Sun, 7 Jan 2001 21:24:14 +0200 Date: Sun, 7 Jan 2001 21:24:14 +0200 From: Matti Aarnio To: jamal Cc: linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission Message-ID: <20010107212414.B25659@mea-ext.zmailer.org> References: <20010107210036.G25076@mea-ext.zmailer.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: ; from hadi@cyberus.ca on Sun, Jan 07, 2001 at 02:10:52PM -0500 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1206 Lines: 32 On Sun, Jan 07, 2001 at 02:10:52PM -0500, jamal wrote: > On Sun, 7 Jan 2001, Matti Aarnio wrote: > > Read what I wrote about the issue to Alan. > > Ben's code has no problems with receiving VLANs with network > > cards which have "hardware support" for VLANs. > > OK. I suppose an skb->vlan_tag is passed to the driver and it will know > what to do with it (pass it on a descriptor etc). Sure, nice. WHY SHOULD THERE BE MORE LAYER-2 STUFF ADDED TO SKB OBJECTS ? One of important abstraction issues is to isolate device specific new things (like what VLAN/PVC/SVC is used at your favourtite 802.1Q/ATM/X.25/FrameRelay connection). The less we leak that kind of things to SKB, the better, IMO. They are net_device issues, after all. Tell me (if you can), why packet sender calls hardware-header generation for packet, if the card can insert it for you ? Consider the structure of Ethernet MAC header, where is source address ? Where is the destination address ? If you write the destination, why should you not write the source there too ? No doubt some cards can fill in the source address while doing frame transmit, but is it worth the hazzle ? > cheers, > jamal /Matti Aarnio From owner-netdev@oss.sgi.com Sun Jan 7 11:39:21 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 11:39:12 -0800 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:21006 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 11:39:00 -0800 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.8.7) with ESMTP id NAA09158; Sun, 7 Jan 2001 13:42:05 -0700 Message-ID: <3A58D49D.C4152BD5@candelatech.com> Date: Sun, 07 Jan 2001 13:42:05 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.16 i586) X-Accept-Language: en MIME-Version: 1.0 To: Sandy Harris CC: jamal , linux-kernel , "netdev@oss.sgi.com" Subject: Re: routable interfaces WAS( Re: [PATCH] hashed device lookup(DoesNOT meet Linus' sumission policy!) References: <3A58C137.63907CDC@storm.ca> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1940 Lines: 48 Sandy Harris wrote: > > jamal wrote: > > > > What problem does this fix? > > > > > > If you are mucking with the ifindex, you may be affecting many places > > > in the rest of the kernel, as well as user-space programs which use > > > ifindex to bind to raw devices. > > > > I am talking about 2.5 possibilities now that 2.4 is out. I think > > "parasitic/virtual" interfaces is not a issue specific to VLANs. > > VLANs happen to use devices today to solve the problem. > > As pointed by that example no routing daemons are doing aliased > > interfaces (which are also virtual interfaces). > > We need some more general solution. > > > Something like this also becomes an issue when you want routing > daemons to interact sensibly with IPSEC tunnels. A paper on these > issues is at: > > http://www.quintillion.com/fdis/moat/ipsec+routing/ > > It is not (AFAIK) clear that the FreeS/WAN team will adopt the solutions > suggested there, but it is very clear we need to deal with those issues. Hrm, what if they just made each IP-SEC interface a net_device? If they are a routable entity, with it's own IP address, it starts to look a lot like an interface/net_device. This has seeming worked well for VLANs: Maybe net_device is already general enough?? So, what would be the down-side of having VLANs and other virtual interfaces be net_devices? The only thing I ever thought of was the linear lookups, which is why I wrote the hash code. The beauty of working with existing user-space tools should not be over-looked! It may be easier to fix other problems with many interface/net_devices than cram a whole other virtual net_device structure (with many duplicate functionalities found in the current net_device). Ben -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Sun Jan 7 15:45:11 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 15:45:01 -0800 Received: from foobar.napster.com ([64.124.41.10]:1292 "EHLO foobar.napster.com") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 15:44:41 -0800 Received: from wagner.napster.com (mail.napster.com [63.108.185.112]) by foobar.napster.com (8.9.3/8.9.3) with ESMTP id PAA04327; Sun, 7 Jan 2001 15:44:36 -0800 Received: from napster.com (c991585-a.stcla1.sfba.home.com [65.5.99.217]) by wagner.napster.com (8.9.3/8.9.3) with ESMTP id PAA31945; Sun, 7 Jan 2001 15:44:35 -0800 Message-ID: <3A58FF61.D26D6708@napster.com> Date: Sun, 07 Jan 2001 15:44:33 -0800 From: Jordan Mendelson Organization: Napster, Inc. X-Mailer: Mozilla 4.76 [en] (X11; U; Linux 2.4.0 i686) X-Accept-Language: en MIME-Version: 1.0 To: jamal CC: netdev@oss.sgi.com Subject: Re: Mediocre TCP Performance 2.4.0 <-> Win98SE PPP References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1207 Lines: 40 jamal wrote: > > I am not on lk so i missed your initial post; i apologize if > i am repeating things you might have tried out. > It does not make sense that it works in 2.2.16 but not 2.4.0. > > How reproducible is this? 100% without much difficulty. I should try it with a service other than our own napster backend, but I can't see that making too much difference. > Force the MTU on linux to be 536+40. Haven't tried that, will try on next pass with 576 and 400 just for kicks. > If that doesnt fix it turn off SACK. If I'm not mistaken, I believe I did try it with various TCP options turned off, including SACK, but I'll try again and document it this time. > If that doesnt fix it i would start suspecting the NIC. What NIC > do you have on Linux? Might be alignment screw ups. The tests were done on valinux boxes which all have eepro100's. > If that doesnt fix it, i would start suspecting windows. > Capture traces on both sides (windows and Linux side) for both kernels > preferably using -w option and post the URL. Will do. I won't be able to get to any of this until next week because I don't have a dialup PPP account anywhere (have to borrow another employees account). Jordan From owner-netdev@oss.sgi.com Sun Jan 7 16:23:01 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 16:22:52 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:48548 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 16:22:32 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id TAA21748; Sun, 7 Jan 2001 19:21:20 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Sun, 7 Jan 2001 19:21:20 -0500 (EST) From: jamal To: Matti Aarnio cc: , Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission In-Reply-To: <20010107212414.B25659@mea-ext.zmailer.org> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1701 Lines: 42 On Sun, 7 Jan 2001, Matti Aarnio wrote: > On Sun, Jan 07, 2001 at 02:10:52PM -0500, jamal wrote: > > OK. I suppose an skb->vlan_tag is passed to the driver and it will know > > what to do with it (pass it on a descriptor etc). > > Sure, nice. WHY SHOULD THERE BE MORE LAYER-2 STUFF ADDED TO > SKB OBJECTS ? > > One of important abstraction issues is to isolate device specific > new things (like what VLAN/PVC/SVC is used at your favourtite > 802.1Q/ATM/X.25/FrameRelay connection). > > The less we leak that kind of things to SKB, the better, IMO. > They are net_device issues, after all. You are right, the IP-ization information should be a "device" specific. What "device" means is the other discussion [I think we need new naming conventions and abstractions] > Tell me (if you can), why packet sender calls hardware-header > generation for packet, if the card can insert it for you ? > Consider the structure of Ethernet MAC header, where is source > address ? Where is the destination address ? If you write the > destination, why should you not write the source there too ? > It doesnt cost that much to do in s/ware if it is contigous and you have the information. some form of neighbor discovery protocol gathers all that info for you it is pretty cheap to insert the dst/src/etherype which you already have. Linux already has the 14 bytes link layer header cached today based on ARP for example. Works very nicely on the transmit path. Now consider trying to insert the VLAN header after all this work (it goes between the src MAC and the ethertype). If the hardware knows what to do with the tag, you dont need the hardware header rebuilder function. cheers, jamal From owner-netdev@oss.sgi.com Sun Jan 7 16:31:41 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 16:31:32 -0800 Received: from lox.sandelman.ottawa.on.ca ([209.151.24.2]:63475 "EHLO lox.sandelman.ottawa.on.ca") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 16:31:17 -0800 Received: from nox.sandelman.ottawa.on.ca (nox.sandelman.ottawa.on.ca [209.151.24.6]) by lox.sandelman.ottawa.on.ca (8.8.7/8.8.8) with ESMTP id TAA03677 for ; Sun, 7 Jan 2001 19:31:16 -0500 (EST) Received: from sandelman.ottawa.on.ca ([63.70.211.130]) by nox.sandelman.ottawa.on.ca (8.11.0/8.11.0) with ESMTP id f080s0814712 (using TLSv1/SSLv3 with cipher EDH-RSA-DES-CBC3-SHA (168 bits) verified OK) for ; Sun, 7 Jan 2001 16:55:24 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by sandelman.ottawa.on.ca (8.11.0/8.11.0) with ESMTP id f080QE120377 for ; Sun, 7 Jan 2001 19:26:15 -0500 (EST) Message-Id: <200101080026.f080QE120377@sandelman.ottawa.on.ca> To: "netdev@oss.sgi.com" Subject: Re: routable interfaces WAS( Re: [PATCH] hashed device lookup(DoesNOT meet Linus' sumission policy!) In-reply-to: Your message of "Sun, 07 Jan 2001 13:42:05 MST." <3A58D49D.C4152BD5@candelatech.com> Mime-Version: 1.0 (generated by tm-edit 7.108) Content-Type: text/plain; charset=US-ASCII Date: Sun, 07 Jan 2001 19:26:14 -0500 From: Michael Richardson Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 812 Lines: 14 The nicest thing about routable interfaces (vs what FreeSWAN and many other IPsec's use now) is that it makes the choice of outgoing IP address (the one inside the tunnel) behave like all other multihoming. I think the same criteria applies to VLAN interfaces as well. My hunch is that the having a dozen VLAN/IPsec interfaces on a box may be rather reasonable. Having 4000 of them is a pretty rare situation, that can be dealt with via expansion of the hash table at compile time. ] Train travel features AC outlets with no take-off restrictions|gigabit is no[ ] Michael Richardson, Solidum Systems Oh where, oh where has|problem with[ ] mcr@solidum.com www.solidum.com the little fishy gone?|PAX.port 1100[ ] panic("Just another NetBSD/notebook using, kernel hacking, security guy"); [ From owner-netdev@oss.sgi.com Sun Jan 7 16:38:51 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 16:38:42 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:49572 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 16:38:31 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id TAA21777; Sun, 7 Jan 2001 19:37:30 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Sun, 7 Jan 2001 19:37:30 -0500 (EST) From: jamal To: Ben Greear cc: Sandy Harris , linux-kernel , "netdev@oss.sgi.com" Subject: Re: routable interfaces WAS( Re: [PATCH] hashed device lookup(DoesNOT meet Linus' sumission policy!) In-Reply-To: <3A58D49D.C4152BD5@candelatech.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1889 Lines: 49 On Sun, 7 Jan 2001, Ben Greear wrote: > Hrm, what if they just made each IP-SEC interface a net_device? If they > are a routable entity, with it's own IP address, it starts to look a lot > like an interface/net_device. As in my response to Matti, i thing a netdevice is a generalized link layer structure and should remain that way. To add a new naming convention a "link" or maybe an "interface" is what the protocol aware part should be. Define a routable "interface" to be one that (from an abstraction perspective) sits on top of a netdevice and has a ifindex, name, and IP address (v4 or V6) I think the goals of the author of that IPSEC article are served with this scheme. I need to read that article, i just schemed through it. > > This has seeming worked well for VLANs: Maybe net_device is already > general enough?? I think it is not proper to generalize netdevices for IP. I am not thinking of dead protocols like IPX, more of other newer encapsulations such as MPLS etc. > > So, what would be the down-side of having VLANs and other virtual interfaces > be net_devices? The only thing I ever thought of was the linear lookups, > which is why I wrote the hash code. The beauty of working with existing > user-space tools should not be over-looked! > IP configuration tools you mean. Fine, they should be used to configure "interfaces" in the way i defined them above. > It may be easier to fix other problems with many interface/net_devices > than cram a whole other virtual net_device structure (with many duplicate > functionalities found in the current net_device). > It makes sense from an abstraction and management perspective to have all virtual interfaces which run on top of a physical interface to be managed in conjuction with the device. Device goes down, you destroy them or send them to a shutdown state (instead of messaging) etc. cheers, jamal From owner-netdev@oss.sgi.com Sun Jan 7 17:00:11 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 16:57:22 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:50852 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 16:57:13 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id TAA21808; Sun, 7 Jan 2001 19:56:20 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Sun, 7 Jan 2001 19:56:20 -0500 (EST) From: jamal To: Michael Richardson cc: "netdev@oss.sgi.com" Subject: Re: routable interfaces WAS( Re: [PATCH] hashed device lookup(DoesNOT meet Linus' sumission policy!) In-Reply-To: <200101080026.f080QE120377@sandelman.ottawa.on.ca> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 930 Lines: 26 On Sun, 7 Jan 2001, Michael Richardson wrote: > > The nicest thing about routable interfaces (vs what FreeSWAN and many other > IPsec's use now) is that it makes the choice of outgoing IP address (the one > inside the tunnel) behave like all other multihoming. > I think the same criteria applies to VLAN interfaces as well. > As well for MPLS (on an LER), L2TP, IP/AAL5, all other tunneling proptocols etc. anything that needs to use IP and be "forwardable". I think the net_device by design is a routable interface because of IP influences. Multi-homing means just adding another IP alias; unfortunately i wouldnt call an alias "routable" (if you agree with my definition of a routable interface). Making IP aliases net_devices would solve the problem; thats what VLANs do today. I would also think there are people who run 4000 aliases. urgh, maybe i am not making sense and should write up something. cheers, jamal From owner-netdev@oss.sgi.com Sun Jan 7 17:45:02 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 17:42:12 -0800 Received: from pizda.ninka.net ([216.101.162.242]:33160 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 17:41:51 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id RAA08134; Sun, 7 Jan 2001 17:24:24 -0800 Date: Sun, 7 Jan 2001 17:24:24 -0800 Message-Id: <200101080124.RAA08134@pizda.ninka.net> From: "David S. Miller" To: linux-kernel@vger.kernel.org CC: netdev@oss.sgi.com Subject: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1409 Lines: 35 I've put a patch up for testing on the kernel.org mirrors: /pub/linux/kernel/people/davem/zerocopy-2.4.0-1.diff.gz It provides a framework for zerocopy transmits and delayed receive fragment coalescing. TUX-1.01 uses this framework. Zerocopy transmit requires some driver support, things run as they did before for drivers which do not have the support added. Currently sg+csum driver support has been added to Acenic, 3c59x, sunhme, and loopback drivers. We had eepro100 support coded at one point, but it was removed because we didn't know how to identify the cards which support hw csum assist vs. ones which could not. I would like people to test this hard and report bugs they may discover. _PLEASE_ try to see if 2.4.0 without this patch produces the same problem, and if so report it is a 2.4.0 bug _not_ as a bug in the zerocopy patch. Thank you. In particular, I am interested in hearing about any new breakage caused by the zerocopy patches when using netfilter. When reporting bugs, please note what networking cards you are using as whether the card actually is using hw csum assist and sg support is an important data point. Finally, regardless of networking card, there should be a measurable performance boost for NFS clients with this patch due to the delayed fragment coalescing. KNFSD does not take full advantage of this facility yet. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Sun Jan 7 18:22:52 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 18:20:03 -0800 Received: from james.kalifornia.com ([208.179.0.2]:50295 "EHLO james.kalifornia.com") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 18:19:36 -0800 Received: from linux.com (david@pool-63.50.108.84.rlgh.grid.net [63.50.108.84]) by james.kalifornia.com (8.11.0/8.11.0) with ESMTP id f082JS722935; Sun, 7 Jan 2001 18:19:29 -0800 Message-ID: <3A5923AC.AA6FDA10@linux.com> Date: Sun, 07 Jan 2001 18:19:24 -0800 From: David Ford Organization: Blue Labs X-Mailer: Mozilla 4.75 [en] (X11; U; Linux 2.4.0-ac2 i686) X-Accept-Language: en MIME-Version: 1.0 CC: Ben Greear , "David S. Miller" , linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission References: Content-Type: multipart/mixed; boundary="------------1051BCC36D66E43A4FDD7826" To: unlisted-recipients:; (no To-header on input) Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1159 Lines: 43 This is a multi-part message in MIME format. --------------1051BCC36D66E43A4FDD7826 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Alan Cox wrote: > > Um, what about people running their box as just a VLAN router/firewall? > > That seems to be one of the principle uses so far. Actually, in that case > > both VLAN and IP traffic would come through, so it would be a tie if VLAN > > came first, but non-vlan traffic would suffer worse. > > Why would someone filter between vlans when any node on each vlan can happily > ignore the vlan partitioning ports 137-139 blather. -d --------------1051BCC36D66E43A4FDD7826 Content-Type: text/x-vcard; charset=us-ascii; name="david.vcf" Content-Transfer-Encoding: 7bit Content-Description: Card for David Ford Content-Disposition: attachment; filename="david.vcf" begin:vcard n:Ford;David x-mozilla-html:TRUE url:www.blue-labs.org adr:;;;;;; version:2.1 email;internet:david@blue-labs.org title:Blue Labs Developer note;quoted-printable:GPG key: http://www.blue-labs.org/david@nifty.key=0D=0A x-mozilla-cpt:;9952 fn:David Ford end:vcard --------------1051BCC36D66E43A4FDD7826-- From owner-netdev@oss.sgi.com Sun Jan 7 20:26:13 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 20:23:13 -0800 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:38161 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 20:22:52 -0800 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.8.7) with ESMTP id WAA12787; Sun, 7 Jan 2001 22:25:41 -0700 Message-ID: <3A594F55.298EBCF8@candelatech.com> Date: Sun, 07 Jan 2001 22:25:41 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.16 i586) X-Accept-Language: en MIME-Version: 1.0 To: jamal CC: Sandy Harris , linux-kernel , "netdev@oss.sgi.com" Subject: Re: routable interfaces WAS( Re: [PATCH] hashed device lookup(DoesNOTmeet Linus' sumission policy!) References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 3060 Lines: 77 jamal wrote: > > On Sun, 7 Jan 2001, Ben Greear wrote: > > > Hrm, what if they just made each IP-SEC interface a net_device? If they > > are a routable entity, with it's own IP address, it starts to look a lot > > like an interface/net_device. > > As in my response to Matti, i thing a netdevice is a generalized link > layer structure and should remain that way. Yes, but VLANs are a link-layer structure too, and things like tunnels are really link-layer too, as far as protocols using them are concerned. With tunneling and virtual interfaces, you could conceivably do something like: OC3 - ATM - Ethernet - VLAN - IP - IP-Sec - IP as well as plain old: Ethernet - IP Which of these are netdevices? (I argue that at least the Ethernet-over-ATM, VLAN, and IP-Sec entities could profit from being a net_device at it's core.) You argue that we should split the net_device into physical and virtual portions. Perhaps you could give an idea of the data members that would belong in the new structures? I argue that you lose the minute you need one in both structures :) > > This has seeming worked well for VLANs: Maybe net_device is already > > general enough?? > > I think it is not proper to generalize netdevices for IP. I am not > thinking of dead protocols like IPX, more of other newer encapsulations > such as MPLS etc. MPLS can run over FrameRelay, Ethernet, and ATM, at the moment (right?). What if you want to run MPLS over an IP-Sec link? If you want it to magically work, IP-Sec could be a net_device with it's own particular member methods and private data that let it do the right thing. > > So, what would be the down-side of having VLANs and other virtual interfaces > > be net_devices? The only thing I ever thought of was the linear lookups, > > which is why I wrote the hash code. The beauty of working with existing > > user-space tools should not be over-looked! > > > > IP configuration tools you mean. Fine, they should be used to configure > "interfaces" in the way i defined them above. Think also of creating sockets with SOCK_RAW and other lower-level (but user-space) access to the net_device's methods. > It makes sense from an abstraction and management perspective to have all > virtual interfaces which run on top of a physical interface to be > managed in conjuction with the device. What if you had an inverse-MUX type of device that spanned two different physical interfaces. Then, one can go down, but the virtual interface is still up. So, there is not a one-to-one coorespondence. At a higher level, what if your interface is some tunnel running over IP. IP in turn can be routed out any physical interface (and may dynamically change due to routing protocols.) > Device goes down, you destroy them > or send them to a shutdown state (instead of messaging) etc. > > cheers, > jamal -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Sun Jan 7 21:35:23 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 21:32:34 -0800 Received: from Cantor.suse.de ([194.112.123.193]:35086 "HELO Cantor.suse.de") by oss.sgi.com with SMTP id ; Sun, 7 Jan 2001 21:32:18 -0800 Received: from Hermes.suse.de (Hermes.suse.de [194.112.123.136]) by Cantor.suse.de (Postfix) with ESMTP id 8259A1E083; Mon, 8 Jan 2001 06:32:16 +0100 (MET) Received: from gruyere.muc.suse.de (unknown [10.23.1.2]) by Hermes.suse.de (Postfix) with ESMTP id D1A743E44F; Mon, 8 Jan 2001 06:32:15 +0100 (MET) Received: by gruyere.muc.suse.de (Postfix, from userid 14446) id 7F2AC2F300; Mon, 8 Jan 2001 06:32:14 +0100 (MET) Date: Mon, 8 Jan 2001 06:32:14 +0100 From: Andi Kleen To: "David S. Miller" Cc: cw@f00f.org, david@linux.com, greearb@candelatech.com, linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission policy!) Message-ID: <20010108063214.A29026@gruyere.muc.suse.de> References: <20010107162905.B1804@metastasis.f00f.org> <20010108011308.A2575@metastasis.f00f.org> <200101071201.EAA01790@pizda.ninka.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <200101071201.EAA01790@pizda.ninka.net>; from davem@redhat.com on Sun, Jan 07, 2001 at 04:01:04AM -0800 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 915 Lines: 27 On Sun, Jan 07, 2001 at 04:01:04AM -0800, David S. Miller wrote: > Date: Mon, 8 Jan 2001 01:13:08 +1300 > From: Chris Wedgwood > > OK, I'm a liar -- bind does handle this. Cool. > > Standard BSD allows it, what do you expect :-) > > This is good news, because it means there is a precedent for multiple > addresses on a single interface so we can kill the : > syntax in favor of the above which is cleaner of more accurately > represents what is happening. > > If this is really true, 2.5.x is an appropriate time to make > this, no sooner. I think it would be better to keep it. The ifa based alias interface emulation adds minor overhead (currently it's only a few lines of code, assuming we need named if addresses for other reasons too, which we do) and removing it it would break a lot of configuration scripts etc., for no really good gain. -Andi From owner-netdev@oss.sgi.com Sun Jan 7 22:15:04 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 22:12:13 -0800 Received: from a203-167-249-89.reverse.clear.net.nz ([203.167.249.89]:11782 "HELO metastasis.f00f.org") by oss.sgi.com with SMTP id ; Sun, 7 Jan 2001 22:12:11 -0800 Received: by metastasis.f00f.org (Postfix, from userid 1000) id 2988EA31D; Mon, 8 Jan 2001 19:12:09 +1300 (NZDT) Date: Mon, 8 Jan 2001 19:12:09 +1300 From: Chris Wedgwood To: Andi Kleen Cc: "David S. Miller" , david@linux.com, greearb@candelatech.com, linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission policy!) Message-ID: <20010108191209.B4682@metastasis.f00f.org> References: <20010107162905.B1804@metastasis.f00f.org> <20010108011308.A2575@metastasis.f00f.org> <200101071201.EAA01790@pizda.ninka.net> <20010108063214.A29026@gruyere.muc.suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <20010108063214.A29026@gruyere.muc.suse.de>; from ak@suse.de on Mon, Jan 08, 2001 at 06:32:14AM +0100 X-No-Archive: Yes Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 943 Lines: 23 On Mon, Jan 08, 2001 at 06:32:14AM +0100, Andi Kleen wrote: I think it would be better to keep it. The ifa based alias interface emulation adds minor overhead (currently it's only a few lines of code, assuming we need named if addresses for other reasons too, which we do) and removing it it would break a lot of configuration scripts etc., for no really good gain. It's ugly and deceptive -- eth0:0 is _not_ a separate device to eth0, so why pretend it is? Yes, FIXING this wart will break stuff, that is part of the reason we have development cycles. Applications that break need fixing anyhow, as DaveM says BSD support multiple addresses per interface anyhow, so perhaps not many applications will break at all -- I've not really checked. 2.5.x seems like an excellent time to FIX this. I guess the final decision is that of DaveM and Alexey. --cw (These are mine opinions alone, but they should be everyones) From owner-netdev@oss.sgi.com Sun Jan 7 22:16:13 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 22:13:13 -0800 Received: from mail.linux.com ([198.186.203.59]:28420 "EHLO mail.i.linux.com") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 22:13:08 -0800 Received: from linux.com (mail@shiftq.linux.com [10.1.1.10]) by mail.i.linux.com (8.11.0.Beta3/8.11.0.Beta3) with ESMTP id f086D7H16130; Sun, 7 Jan 2001 22:13:07 -0800 Received: from localhost (david@localhost) by linux.com (8.9.3/8.9.3) with ESMTP id WAA07616; Sun, 7 Jan 2001 22:13:07 -0800 Date: Sun, 7 Jan 2001 22:13:06 -0800 (PST) From: Blu3Viper To: Andi Kleen cc: "David S. Miller" , cw@f00f.org, greearb@candelatech.com, linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission policy!) In-Reply-To: <20010108063214.A29026@gruyere.muc.suse.de> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 579 Lines: 15 On Mon, 8 Jan 2001, Andi Kleen wrote: > > If this is really true, 2.5.x is an appropriate time to make > > this, no sooner. > > I think it would be better to keep it. The ifa based alias interface > emulation adds minor overhead (currently it's only a few lines of code, > assuming we need named if addresses for other reasons too, which we do) > and removing it it would break a lot of configuration scripts etc., for > no really good gain. How about turning it out with a legacy/deprecated CONFIG_ option so we can prepare people. For now it can default to enabled. -d From owner-netdev@oss.sgi.com Sun Jan 7 22:29:43 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 22:26:54 -0800 Received: from Cantor.suse.de ([194.112.123.193]:18704 "HELO Cantor.suse.de") by oss.sgi.com with SMTP id ; Sun, 7 Jan 2001 22:26:36 -0800 Received: from Hermes.suse.de (Hermes.suse.de [194.112.123.136]) by Cantor.suse.de (Postfix) with ESMTP id 49F511E083; Mon, 8 Jan 2001 07:26:35 +0100 (MET) Received: from gruyere.muc.suse.de (unknown [10.23.1.2]) by Hermes.suse.de (Postfix) with ESMTP id 045B63E44F; Mon, 8 Jan 2001 07:26:35 +0100 (MET) Received: by gruyere.muc.suse.de (Postfix, from userid 14446) id 85F1D2F300; Mon, 8 Jan 2001 07:26:34 +0100 (MET) Date: Mon, 8 Jan 2001 07:26:34 +0100 From: Andi Kleen To: Chris Wedgwood Cc: Andi Kleen , "David S. Miller" , david@linux.com, greearb@candelatech.com, linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission policy!) Message-ID: <20010108072634.A29753@gruyere.muc.suse.de> References: <20010107162905.B1804@metastasis.f00f.org> <20010108011308.A2575@metastasis.f00f.org> <200101071201.EAA01790@pizda.ninka.net> <20010108063214.A29026@gruyere.muc.suse.de> <20010108191209.B4682@metastasis.f00f.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <20010108191209.B4682@metastasis.f00f.org>; from cw@f00f.org on Mon, Jan 08, 2001 at 07:12:09PM +1300 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1100 Lines: 23 On Mon, Jan 08, 2001 at 07:12:09PM +1300, Chris Wedgwood wrote: > On Mon, Jan 08, 2001 at 06:32:14AM +0100, Andi Kleen wrote: > > I think it would be better to keep it. The ifa based alias > interface emulation adds minor overhead (currently it's only a > few lines of code, assuming we need named if addresses for other > reasons too, which we do) and removing it it would break a lot of > configuration scripts etc., for no really good gain. > > It's ugly and deceptive -- eth0:0 is _not_ a separate device to eth0, > so why pretend it is? Who says that it names a device? It names interfaces. There are good reasons to have names for ifas, and I see no really good convincing reasons not to put these names into the interface name space. (in addition it'll save a lot of people a lot of grief) When you're proposing a change that breaks thousands of configuration you need a really good reason for it, and so far I cannot see one. It would be different if the older way needed lots of hard to maintain fragile code in the kernel, but that's really not the case. -Andi From owner-netdev@oss.sgi.com Sun Jan 7 23:01:03 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 22:58:13 -0800 Received: from james.kalifornia.com ([208.179.0.2]:7548 "EHLO james.kalifornia.com") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 22:57:56 -0800 Received: from Huntington-Beach.Blue-Labs.org (mail@Huntington-Beach.Blue-Labs.org [208.179.0.198]) by james.kalifornia.com (8.11.0/8.11.0) with ESMTP id f086vle22984; Sun, 7 Jan 2001 22:57:47 -0800 Received: (from david@localhost) by Huntington-Beach.Blue-Labs.org (8.9.3/8.9.0) id WAA13529; Sun, 7 Jan 2001 22:57:45 -0800 Date: Sun, 7 Jan 2001 22:57:45 -0800 (PST) From: David Ford X-Sender: david@Huntington-Beach.Blue-Labs.org To: Andi Kleen cc: Chris Wedgwood , "David S. Miller" , greearb@candelatech.com, linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission policy!) In-Reply-To: <20010108072634.A29753@gruyere.muc.suse.de> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 2264 Lines: 42 On Mon, 8 Jan 2001, Andi Kleen wrote: > Who says that it names a device? It names interfaces. > There are good reasons to have names for ifas, and I see no really good > convincing reasons not to put these names into the interface name space. > (in addition it'll save a lot of people a lot of grief) > When you're proposing a change that breaks thousands of configuration you > need a really good reason for it, and so far I cannot see one. It would > be different if the older way needed lots of hard to maintain fragile code in > the kernel, but that's really not the case. If people are upgrading to things like 2.6, then one must expect some changes. The eth0:0 style has already been whittled down, it has nothing now but the IP and mask info. It's a something between two styles. It encourages a non-scalable use. I.e. eth0:2342, or eth0:http. It came up because listing w/ ifconfig -a in it's current form wasn't satisfactorily fast. Distributions should be encouraged to use ip rather than ifconfig/route. It works better and does more, the output is more informative, more concise, and less confusing. It doesn't take that much more disk space than ifconfig and route does, ifconfig and route take 74K, ip takes 89K. I don't think 15k of disk space is sufficient concern, given that inodes are probably page size. That comes out to three pages difference. Even on a floppy that's not much. I didn't even compile optimized for size either. Due to that, 'eth0:n' becomes a byproduct without much merit. People who insist on eth0:n are probably people who also will insist on 1.2.13 for their router simply because they don't want or need to change it. The new form with the new tool is easier, especially if you have any cisco background. You can't beat 'ip a a 10.1.1.1/24 brd + dev eth0' for the netmask and figuring out the broadcast properly without error. It's shorter and less prone to error and more easily scriptable because you don't need a changing label. It's also more easily parsed by scripts. -d -- ---NOTICE--- fwd: fwd: fwd: type emails will be deleted automatically. "There is a natural aristocracy among men. The grounds of this are virtue and talents", Thomas Jefferson [1742-1826], 3rd US President From owner-netdev@oss.sgi.com Sun Jan 7 23:11:53 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 23:09:14 -0800 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:55058 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 23:09:00 -0800 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.8.7) with ESMTP id BAA14441; Mon, 8 Jan 2001 01:12:21 -0700 Message-ID: <3A597665.4B68C39@candelatech.com> Date: Mon, 08 Jan 2001 01:12:21 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.16 i586) X-Accept-Language: en MIME-Version: 1.0 To: "netdev@oss.sgi.com" , linux-kernel Subject: Re: [PATCH] hashed device lookup (New Benchmarks) References: <3A578F27.D2A9DF52@candelatech.com> <20010107042959.A14330@gruyere.muc.suse.de> <3A580B31.7998C783@candelatech.com> <20010107062744.A15198@gruyere.muc.suse.de> <3A58249F.86DD52BC@candelatech.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1518 Lines: 40 See a pretty graph showing performance of ifconfig and ip both with and without my device-hashed-lookup patch: http://grok.yi.org/~greear/hashed_dev.png (If you can't get to it, let me know and I'll email it to you...some cable modem networks have I firewalled.) I ran ifconfig -a and ip addr show every 50 interfaces, as I added 4000 interfaces, and used the 'time -p' program to find the system and user times. Summary: ifconfig scales badly, ip is better. Both ip and ifconfig work better with the hash patch, at least when the number of interfaces grows past 1000. If anyone wants the raw numbers, I can provide them and the script that generated them. NOTE: I stopped the non-hashed test after 3000 interfaces because it was just going too slow (ifconfig was killing me!) So, is this good enough reason to add the hashed patch? If not, I feel sure I can write a program that binds to a specific interface 10k times, and my assumption is that the hash will help significantly if there are lots of interfaces. However, I'd rather not go to the hassle if the ifconfig/ip numbers are sufficient. If no amount of benchmarking will change key player's minds, then go ahead and tell me now so that I can go back to hacking code and just include this patch with my VLAN patch. Thanks, Ben -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Sun Jan 7 23:21:03 2001 Received: by oss.sgi.com id ; Sun, 7 Jan 2001 23:18:14 -0800 Received: from pizda.ninka.net ([216.101.162.242]:6786 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Sun, 7 Jan 2001 23:18:03 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id XAA10037; Sun, 7 Jan 2001 23:00:45 -0800 Date: Sun, 7 Jan 2001 23:00:45 -0800 Message-Id: <200101080700.XAA10037@pizda.ninka.net> From: "David S. Miller" To: greearb@candelatech.com CC: netdev@oss.sgi.com, linux-kernel@vger.kernel.org In-reply-to: <3A597665.4B68C39@candelatech.com> (message from Ben Greear on Mon, 08 Jan 2001 01:12:21 -0700) Subject: Re: [PATCH] hashed device lookup (New Benchmarks) References: <3A578F27.D2A9DF52@candelatech.com> <20010107042959.A14330@gruyere.muc.suse.de> <3A580B31.7998C783@candelatech.com> <20010107062744.A15198@gruyere.muc.suse.de> <3A58249F.86DD52BC@candelatech.com> <3A597665.4B68C39@candelatech.com> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 633 Lines: 18 Date: Mon, 08 Jan 2001 01:12:21 -0700 From: Ben Greear http://grok.yi.org/~greear/hashed_dev.png (If you can't get to it, let me know and I'll email it to you...some cable modem networks have I firewalled.) It just seems that this shows that the implementation of ifconfig can be improved, since "ip" can do the same thing several orders of magnitude better (ie. non-quadratic system time complexity). This is the argument I started with when this thread began, so my position hasn't changed, it has in fact been well supported by your tests :-) Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Mon Jan 8 02:42:35 2001 Received: by oss.sgi.com id ; Mon, 8 Jan 2001 02:39:46 -0800 Received: from ns.caldera.de ([212.34.180.1]:6661 "EHLO ns.caldera.de") by oss.sgi.com with ESMTP id ; Mon, 8 Jan 2001 02:39:37 -0800 Received: (from hch@localhost) by ns.caldera.de (8.9.3/8.9.3) id LAA30397; Mon, 8 Jan 2001 11:39:15 +0100 Date: Mon, 8 Jan 2001 11:39:15 +0100 Message-Id: <200101081039.LAA30397@ns.caldera.de> From: Christoph Hellwig To: davem@redhat.com ("David S. Miller") Cc: netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 X-Newsgroups: caldera.lists.linux.kernel In-Reply-To: <200101080124.RAA08134@pizda.ninka.net> User-Agent: tin/1.4.1-19991201 ("Polish") (UNIX) (Linux/2.2.14 (i686)) Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 885 Lines: 28 In article <200101080124.RAA08134@pizda.ninka.net> you wrote: > I've put a patch up for testing on the kernel.org mirrors: > > /pub/linux/kernel/people/davem/zerocopy-2.4.0-1.diff.gz > > It provides a framework for zerocopy transmits and delayed > receive fragment coalescing. TUX-1.01 uses this framework. Hi Dave, don't you think the writepage file operation is rather hackish? I'd much prefer Ben La Haise's rw_kiovec [1] operation, it is more generic (supports read and write) and should be easily usable for zerocopy networking with plain old write (using map_user_kio). Besides that the FS crew thinks it should go in soon because of aio anyway... Christoph [1] for those that don't know yet, the prototype is: rw_kiovec(struct file * filp, int rw, int nr, struct kiobuf ** kiovec, int flags, size_t size, loff_t pos); -- Whip me. Beat me. Make me maintain AIX. From owner-netdev@oss.sgi.com Mon Jan 8 02:55:25 2001 Received: by oss.sgi.com id ; Mon, 8 Jan 2001 02:52:35 -0800 Received: from pizda.ninka.net ([216.101.162.242]:18564 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Mon, 8 Jan 2001 02:52:18 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id CAA17681; Mon, 8 Jan 2001 02:34:52 -0800 Date: Mon, 8 Jan 2001 02:34:52 -0800 Message-Id: <200101081034.CAA17681@pizda.ninka.net> From: "David S. Miller" To: hch@caldera.de CC: netdev@oss.sgi.com, linux-kernel@vger.kernel.org In-reply-to: <200101081039.LAA30397@ns.caldera.de> (message from Christoph Hellwig on Mon, 8 Jan 2001 11:39:15 +0100) Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 References: <200101081039.LAA30397@ns.caldera.de> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 289 Lines: 11 Date: Mon, 8 Jan 2001 11:39:15 +0100 From: Christoph Hellwig don't you think the writepage file operation is rather hackish? Not at all, it's simply direct sendfile support. It does not try to be any fancier than that. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Mon Jan 8 05:07:06 2001 Received: by oss.sgi.com id ; Mon, 8 Jan 2001 05:06:57 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:59300 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Mon, 8 Jan 2001 05:06:37 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id IAA22541; Mon, 8 Jan 2001 08:05:49 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Mon, 8 Jan 2001 08:05:49 -0500 (EST) From: jamal To: Ben Greear cc: Sandy Harris , linux-kernel , "netdev@oss.sgi.com" Subject: Re: routable interfaces WAS( Re: [PATCH] hashed device lookup(DoesNOTmeet Linus' sumission policy!) In-Reply-To: <3A594F55.298EBCF8@candelatech.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 4537 Lines: 106 On Sun, 7 Jan 2001, Ben Greear wrote: > jamal wrote: > > As in my response to Matti, i thing a netdevice is a generalized link > > layer structure and should remain that way. > > Yes, but VLANs are a link-layer structure too, and things like tunnels > are really link-layer too, as far as protocols using them are concerned. > > With tunneling and virtual interfaces, you could conceivably do something > like: > > OC3 - ATM - Ethernet - VLAN - IP - IP-Sec - IP > as well as plain old: > Ethernet - IP > > Which of these are netdevices? > I think you are mixing up the packet munging/management part of the entity and the data part. There is a very close relation, however the munger is _not_ a netdevice. The entry l2 protocol demuxing selects the munger. > (I argue that at least the Ethernet-over-ATM, VLAN, and IP-Sec entities could > profit from being a net_device at it's core.) > > You argue that we should split the net_device into physical and virtual > portions. Iam not asking for this. It is pseudo-already-there. What i am asking for is consistency. A virtual IP address is "virtual" and resides on top of a link (eg ethernet) as does a VLAN. Yet a VLAN gets its own life as a netdevice. Why? so that it can be "routable". > Perhaps you could give an idea of the data members that would belong in the new > structures? I argue that you lose the minute you need one in both structures :) > I dont have to add new data structures. The ifa structure on a netdevice is sufficient. It has a name/label, it has an IP address, it lacks an ifindex. counters is the other lacking one (thinking of what Gleb posted). A vlan device could be a simple IP-alias. I am not all bent to insist that it has to be an ifa. What matters is consistency. I want an alias to be used by a route daemon. The easiest brute force way is to make it a netdevice. You might hack this from the route daemon itself as well but as Gleb was pointing out SNMP wont be happy. > > > This has seeming worked well for VLANs: Maybe net_device is already > > > general enough?? > > > > I think it is not proper to generalize netdevices for IP. I am not > > thinking of dead protocols like IPX, more of other newer encapsulations > > such as MPLS etc. > > MPLS can run over FrameRelay, Ethernet, and ATM, at the moment (right?). > > What if you want to run MPLS over an IP-Sec link? If you want it to > magically work, IP-Sec could be a net_device with it's own particular > member methods and private data that let it do the right thing. > Again, this is an issue of netdevice data vs its associated methods/mungers. > > What if you had an inverse-MUX type of device that spanned two different > physical interfaces. Then, one can go down, but the virtual interface > is still up. So, there is not a one-to-one coorespondence. At a higher > level, what if your interface is some tunnel running over IP. IP in turn > can be routed out any physical interface (and may dynamically change due > to routing protocols.) it's that thin line between a netdevice and the packet munging. Today the first association is via a l2 demux point -- that split is very clean, IMO. An IP packet comes in, you invoke the IP handling code. a type ETH_P_802_1Q comes in you invoke the vlan handling code. At some point along that path you invoke the ip handling code which ends up invoking the IPSEC handler etc. IP/GRE or IPIP tunnels get handled the same way. NOTE that this has absolutely nothing to do with a netdevice. Your VLAN code manages the state mapping of what vlanid maps to what skb->dev. I would say this is where the one-to-one mapping is maintained. You dont have to assume a device to get the mapping. Infact each of those stages maintains some state to work. IP maintains a route table. it should probably be refined further to be based on a classification of any sort needed instead of the incremental classification only. The posting by Matti pointed to some page that was suggesting a netif_rx per input device. I also think this is that linked association that a netdvice is a net-munger as well as a routable interface. Maybe the netdevice is the best abstraction for a routable interface because it is already stamped in people's heads and also because IP rules. Once you've stripped all headers and done what you need to do just pass it to a dev->rx() and it will take it from there. I still see confusion. BTW,The problem you described will face you on a VLAN spanning two physical ports ;-> Do tell me how you solve it. cheers, jamal From owner-netdev@oss.sgi.com Mon Jan 8 05:09:26 2001 Received: by oss.sgi.com id ; Mon, 8 Jan 2001 05:09:16 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:61348 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Mon, 8 Jan 2001 05:09:05 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id IAA22557; Mon, 8 Jan 2001 08:08:12 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Mon, 8 Jan 2001 08:08:12 -0500 (EST) From: jamal To: David Ford cc: Andi Kleen , Chris Wedgwood , "David S. Miller" , , , Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission policy!) In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 539 Lines: 17 On Sun, 7 Jan 2001, David Ford wrote: > Distributions should be encouraged to use ip rather than ifconfig/route. It > works better and does more, the output is more informative, more concise, > and less confusing. It doesn't take that much more disk space than ifconfig > and route does, ifconfig and route take 74K, ip takes 89K. I don't think 15k > of disk space is sufficient concern, given that inodes are probably page > size. Actually if you count arp which is also part of ip; ip becomes smaller by about 15K. cheers, jamal From owner-netdev@oss.sgi.com Mon Jan 8 07:23:18 2001 Received: by oss.sgi.com id ; Mon, 8 Jan 2001 07:22:58 -0800 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:22022 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Mon, 8 Jan 2001 07:22:49 -0800 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.8.7) with ESMTP id JAA17717; Mon, 8 Jan 2001 09:26:07 -0700 Message-ID: <3A59EA1F.AEAD08A6@candelatech.com> Date: Mon, 08 Jan 2001 09:26:07 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.16 i586) X-Accept-Language: en MIME-Version: 1.0 To: "David S. Miller" CC: netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [PATCH] hashed device lookup (New Benchmarks) References: <3A578F27.D2A9DF52@candelatech.com> <20010107042959.A14330@gruyere.muc.suse.de> <3A580B31.7998C783@candelatech.com> <20010107062744.A15198@gruyere.muc.suse.de> <3A58249F.86DD52BC@candelatech.com> <3A597665.4B68C39@candelatech.com> <200101080700.XAA10037@pizda.ninka.net> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1226 Lines: 29 "David S. Miller" wrote: > > Date: Mon, 08 Jan 2001 01:12:21 -0700 > From: Ben Greear > > http://grok.yi.org/~greear/hashed_dev.png > (If you can't get to it, let me know and I'll email it to you...some > cable modem networks have I firewalled.) > > It just seems that this shows that the implementation of ifconfig can > be improved, since "ip" can do the same thing several orders of > magnitude better (ie. non-quadratic system time complexity). > > This is the argument I started with when this thread began, so my > position hasn't changed, it has in fact been well supported by your > tests :-) I don't argue that ifconfig shouldn't be fixed, but the hash speeds up ip by about 2X too. Is that not useful enough? ip seems to be implemented pretty efficient, so if the hash helps it significantly then maybe it can help other efficient programs too. Notice that it is the system (ie kernel) time that stays remarkably flat with the hash + ip graph. Ben -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Mon Jan 8 10:06:17 2001 Received: by oss.sgi.com id ; Mon, 8 Jan 2001 10:05:58 -0800 Received: from brutus.conectiva.com.br ([200.250.58.146]:26101 "EHLO brutus.conectiva.com.br") by oss.sgi.com with ESMTP id ; Mon, 8 Jan 2001 10:05:45 -0800 Received: from localhost (riel@localhost) by brutus.conectiva.com.br (8.11.2/8.11.2) with ESMTP id f08I5Nl28191; Mon, 8 Jan 2001 16:05:23 -0200 X-Authentication-Warning: duckman.distro.conectiva: riel owned process doing -bs Date: Mon, 8 Jan 2001 16:05:23 -0200 (BRDT) From: Rik van Riel X-Sender: riel@duckman.distro.conectiva To: "David S. Miller" cc: hch@caldera.de, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 In-Reply-To: <200101081034.CAA17681@pizda.ninka.net> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 973 Lines: 31 On Mon, 8 Jan 2001, David S. Miller wrote: > From: Christoph Hellwig > > don't you think the writepage file operation is rather hackish? > > Not at all, it's simply direct sendfile support. It does > not try to be any fancier than that. I really think the zerocopy network stuff should be ported to kiobuf proper. The usefulness of the patch you posted is rather .. umm .. limited. Having proper kiobuf support would make it possible to, for example, do zerocopy network->disk data transfers and lots of other things. Furthermore, by using kiobuf for the network zerocopy stuff there's a good chance the networking code will be integrated. Otherwise we just might end up with a zero-copy-for-everything- except-networking Linux 2.5 kernel ;) regards, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com.br/ From owner-netdev@oss.sgi.com Mon Jan 8 13:25:39 2001 Received: by oss.sgi.com id ; Mon, 8 Jan 2001 13:25:28 -0800 Received: from pizda.ninka.net ([216.101.162.242]:1163 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Mon, 8 Jan 2001 13:25:05 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id NAA21378; Mon, 8 Jan 2001 13:07:02 -0800 Date: Mon, 8 Jan 2001 13:07:02 -0800 Message-Id: <200101082107.NAA21378@pizda.ninka.net> From: "David S. Miller" To: riel@conectiva.com.br CC: hch@caldera.de, netdev@oss.sgi.com, linux-kernel@vger.kernel.org In-reply-to: (message from Rik van Riel on Mon, 8 Jan 2001 16:05:23 -0200 (BRDT)) Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 References: Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 331 Lines: 14 Date: Mon, 8 Jan 2001 16:05:23 -0200 (BRDT) From: Rik van Riel I really think the zerocopy network stuff should be ported to kiobuf proper. That is how it could be done in 2.5.x, sure. But this patch is intended for 2.4.x so "minimum impact" applies. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Mon Jan 8 13:57:28 2001 Received: by oss.sgi.com id ; Mon, 8 Jan 2001 13:57:19 -0800 Received: from smtp1.cern.ch ([137.138.128.38]:49168 "EHLO smtp1.cern.ch") by oss.sgi.com with ESMTP id ; Mon, 8 Jan 2001 13:57:03 -0800 Received: from lxplus015.cern.ch (IDENT:root@lxplus015.cern.ch [137.138.161.112]) by smtp1.cern.ch (8.9.3/8.9.3) with ESMTP id WAA01046; Mon, 8 Jan 2001 22:56:49 +0100 (MET) Received: (from jes@localhost) by lxplus015.cern.ch (8.9.3/8.9.3) id WAA28218; Mon, 8 Jan 2001 22:56:48 +0100 To: "David S. Miller" Cc: linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 References: <200101080124.RAA08134@pizda.ninka.net> From: Jes Sorensen Date: 08 Jan 2001 22:56:48 +0100 In-Reply-To: "David S. Miller"'s message of "Sun, 7 Jan 2001 17:24:24 -0800" Message-ID: User-Agent: Gnus/5.070096 (Pterodactyl Gnus v0.96) Emacs/20.4 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1496 Lines: 31 >>>>> "David" == David S Miller writes: David> I've put a patch up for testing on the kernel.org mirrors: David> /pub/linux/kernel/people/davem/zerocopy-2.4.0-1.diff.gz David> It provides a framework for zerocopy transmits and delayed David> receive fragment coalescing. TUX-1.01 uses this framework. David> Zerocopy transmit requires some driver support, things run as David> they did before for drivers which do not have the support David> added. Currently sg+csum driver support has been added to David> Acenic, 3c59x, sunhme, and loopback drivers. We had eepro100 David> support coded at one point, but it was removed because we David> didn't know how to identify the cards which support hw csum David> assist vs. ones which could not. I haven't had time to test this patch, but looking over the changes to the acenic driver I have to say that I am quite displeased with the way the changes were done. I can't comment on how authors of the other drivers which were changed feel about it. However I find it highly annoying that someone goes off and makes major cosmetic structural changes to someone elses code without even consulting the author who happens to maintain the code. It doesn't help that the patch reverts changes that should not have been reverted. I don't think it's too much to ask that one actually tries to communicate with an author of a piece of code before making such major changes and submitting them opting for inclusion in the kernel. Jes From owner-netdev@oss.sgi.com Mon Jan 8 14:05:59 2001 Received: by oss.sgi.com id ; Mon, 8 Jan 2001 14:05:49 -0800 Received: from pizda.ninka.net ([216.101.162.242]:3212 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Mon, 8 Jan 2001 14:05:39 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id NAA21738; Mon, 8 Jan 2001 13:48:08 -0800 Date: Mon, 8 Jan 2001 13:48:08 -0800 Message-Id: <200101082148.NAA21738@pizda.ninka.net> From: "David S. Miller" To: jes@linuxcare.com CC: linux-kernel@vger.kernel.org, netdev@oss.sgi.com In-reply-to: (message from Jes Sorensen on 08 Jan 2001 22:56:48 +0100) Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 References: <200101080124.RAA08134@pizda.ninka.net> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1309 Lines: 32 From: Jes Sorensen Date: 08 Jan 2001 22:56:48 +0100 I don't think it's too much to ask that one actually tries to communicate with an author of a piece of code before making such major changes and submitting them opting for inclusion in the kernel. Jes, I have not submitted this for inclusion into the kernel. This is the "everyone, including driver authors, take a look" part of the development process. We _had_ to change some drivers to show how to support this new SKB api for transmit sg+csum support. If you can think of a way for us to effectively do this work without changing at least a few drivers as examples (and proof of concept), please let us know. In the process we hit real bugs in your driver, and tried to deal with them as best we could so that we could continue testing and debugging our own code. As a side note, as much as you may hate some of Alexey's changes to your driver, several things he does fixes long standing real bugs in the Acenic driver that you've been papering over with workarounds for quite some time. I would even go so far as to say that in many regards Alexey understands the Acenic much better than you, and you would be wise to work with Alexey and not against him. Thanks. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Mon Jan 8 14:33:19 2001 Received: by oss.sgi.com id ; Mon, 8 Jan 2001 14:33:09 -0800 Received: from smtp1.cern.ch ([137.138.128.38]:60426 "EHLO smtp1.cern.ch") by oss.sgi.com with ESMTP id ; Mon, 8 Jan 2001 14:33:03 -0800 Received: from lxplus015.cern.ch (IDENT:root@lxplus015.cern.ch [137.138.161.112]) by smtp1.cern.ch (8.9.3/8.9.3) with ESMTP id XAA31339; Mon, 8 Jan 2001 23:32:49 +0100 (MET) Received: (from jes@localhost) by lxplus015.cern.ch (8.9.3/8.9.3) id XAA09810; Mon, 8 Jan 2001 23:32:48 +0100 To: "David S. Miller" Cc: linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 References: <200101080124.RAA08134@pizda.ninka.net> <200101082148.NAA21738@pizda.ninka.net> From: Jes Sorensen Date: 08 Jan 2001 23:32:48 +0100 In-Reply-To: "David S. Miller"'s message of "Mon, 8 Jan 2001 13:48:08 -0800" Message-ID: User-Agent: Gnus/5.070096 (Pterodactyl Gnus v0.96) Emacs/20.4 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 2235 Lines: 45 >>>>> "David" == David S Miller writes: David> We _had_ to change some drivers to show how to support this new David> SKB api for transmit sg+csum support. If you can think of a David> way for us to effectively do this work without changing at David> least a few drivers as examples (and proof of concept), please David> let us know. Dave, I am not complaining about drivers having to be changed for this to work I am fully aware of this need. My complaints are about how this is being done, ie. I some people try to maintain drivers and have certain ideas about how they structure their code etc. If you had sent me a short email saying this is what we plan to do and this is what we think should be done to your code, whats your oppinion. I would have volunteered to help write the code and get the stuff integrated much earlier as well as given you my input on how I would like to see the changes implemented. Instead we now have a fairly large patch which will take me a long time to merge into the driver version that I maintain. David> In the process we hit real bugs in your driver, and tried to David> deal with them as best we could so that we could continue David> testing and debugging our own code. I would have appreciated a simple email saying "we found bug X in your driver" with either a patch attached or a short note of your observations. David> As a side note, as much as you may hate some of Alexey's David> changes to your driver, several things he does fixes long David> standing real bugs in the Acenic driver that you've been David> papering over with workarounds for quite some time. I would David> even go so far as to say that in many regards Alexey David> understands the Acenic much better than you, and you would be David> wise to work with Alexey and not against him. Thanks. I don't question Alexey's skills and I have no intentions of working against him. All I am asking is that someone lets me know if they make major changes to my code so I can keep track of whats happening. It is really hard to maintain code if you work on major changes while someone else branches off in a different direction without you knowing. It's simply a waste of everybody's time. Thanks Jes From owner-netdev@oss.sgi.com Mon Jan 8 14:44:59 2001 Received: by oss.sgi.com id ; Mon, 8 Jan 2001 14:44:49 -0800 Received: from ns.snowman.net ([63.80.4.34]:21256 "EHLO ns.snowman.net") by oss.sgi.com with ESMTP id ; Mon, 8 Jan 2001 14:44:39 -0800 Received: (from sfrost@localhost) by ns.snowman.net (8.9.3/8.9.3/Debian 8.9.3-21) id RAA06311; Mon, 8 Jan 2001 17:43:56 -0500 Date: Mon, 8 Jan 2001 17:43:56 -0500 From: Stephen Frost To: Jes Sorensen Cc: "David S. Miller" , linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 Message-ID: <20010108174355.P26953@ns> Mail-Followup-To: Jes Sorensen , "David S. Miller" , linux-kernel@vger.kernel.org, netdev@oss.sgi.com References: <200101080124.RAA08134@pizda.ninka.net> <200101082148.NAA21738@pizda.ninka.net> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="33yLIq9/uqwyGAKN" Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: ; from jes@linuxcare.com on Mon, Jan 08, 2001 at 11:32:48PM +0100 X-Editor: Vim http://www.vim.org/ X-Info: http://www.snowman.net X-Operating-System: Linux/2.2.16 (i686) X-Uptime: 5:39pm up 144 days, 21:26, 7 users, load average: 2.00, 2.00, 2.00 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1658 Lines: 44 --33yLIq9/uqwyGAKN Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable * Jes Sorensen (jes@linuxcare.com) wrote: > >>>>> "David" =3D=3D David S Miller writes: >=20 > I don't question Alexey's skills and I have no intentions of working > against him. All I am asking is that someone lets me know if they make > major changes to my code so I can keep track of whats happening. It is > really hard to maintain code if you work on major changes while > someone else branches off in a different direction without you > knowing. It's simply a waste of everybody's time. Perhaps you missed it, but I believe Dave's intent is for this to only be a proof-of-concept idea at this time. These changes are not=20 currently up for inclusion into the mainstream kernel. I can not think that Dave would ever just step around a maintainer and submit a patch to Linus for large changes. If many people test these and things work out well for them=20 then I'm sure Dave will go back to the maintainers with the code and=20 the api and work with them to get it into the mainstream kernel. =20 Soliciting ideas and suggestions on how to improve the api and the code=20 paths in the drivers to handle this new method most effectively. Stephen --33yLIq9/uqwyGAKN Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.4 (GNU/Linux) Comment: For info see http://www.gnupg.org iD8DBQE6WkKrrzgMPqB3kigRAmokAJ9u4syg08ujQlPVBXuoetDVjJnS6ACeMDj6 B1oHeCXNmDhAVQQmoP+TeGc= =4/AR -----END PGP SIGNATURE----- --33yLIq9/uqwyGAKN-- From owner-netdev@oss.sgi.com Mon Jan 8 14:54:09 2001 Received: by oss.sgi.com id ; Mon, 8 Jan 2001 14:53:59 -0800 Received: from pizda.ninka.net ([216.101.162.242]:39052 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Mon, 8 Jan 2001 14:53:50 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id OAA22381; Mon, 8 Jan 2001 14:36:26 -0800 Date: Mon, 8 Jan 2001 14:36:26 -0800 Message-Id: <200101082236.OAA22381@pizda.ninka.net> From: "David S. Miller" To: jes@linuxcare.com CC: linux-kernel@vger.kernel.org, netdev@oss.sgi.com In-reply-to: (message from Jes Sorensen on 08 Jan 2001 23:32:48 +0100) Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 References: <200101080124.RAA08134@pizda.ninka.net> <200101082148.NAA21738@pizda.ninka.net> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 652 Lines: 19 From: Jes Sorensen Date: 08 Jan 2001 23:32:48 +0100 All I am asking is that someone lets me know if they make major changes to my code so I can keep track of whats happening. We have not made any major changes to your code, in lieu of this not being code which is actually being submitted yet. If it bothers you that publicly someone has published changes to your driver which you disagree with, oh well... :-) This "please check things out" phase is precisely what you are asking of us, it is how we are saying "here is what we need to do with your driver, please comment". Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Mon Jan 8 14:55:08 2001 Received: by oss.sgi.com id ; Mon, 8 Jan 2001 14:54:59 -0800 Received: from pizda.ninka.net ([216.101.162.242]:40076 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Mon, 8 Jan 2001 14:54:52 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id OAA22387; Mon, 8 Jan 2001 14:37:23 -0800 Date: Mon, 8 Jan 2001 14:37:23 -0800 Message-Id: <200101082237.OAA22387@pizda.ninka.net> From: "David S. Miller" To: sfrost@snowman.net CC: jes@linuxcare.com, linux-kernel@vger.kernel.org, netdev@oss.sgi.com In-reply-to: <20010108174355.P26953@ns> (message from Stephen Frost on Mon, 8 Jan 2001 17:43:56 -0500) Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 References: <200101080124.RAA08134@pizda.ninka.net> <200101082148.NAA21738@pizda.ninka.net> <20010108174355.P26953@ns> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 306 Lines: 11 Date: Mon, 8 Jan 2001 17:43:56 -0500 From: Stephen Frost Perhaps you missed it, but I believe Dave's intent is for this to only be a proof-of-concept idea at this time. Thank you Stephen, this is the point Jes continues to miss. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Tue Jan 9 02:24:34 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 02:24:16 -0800 Received: from chiara.elte.hu ([157.181.150.200]:65035 "HELO chiara.elte.hu") by oss.sgi.com with SMTP id ; Tue, 9 Jan 2001 02:24:04 -0800 Received: by chiara.elte.hu (Postfix, from userid 17806) id 7C5D5186D; Tue, 9 Jan 2001 11:24:01 +0100 (CET) Date: Tue, 9 Jan 2001 11:23:41 +0100 (CET) From: Ingo Molnar Reply-To: To: Rik van Riel Cc: "David S. Miller" , , , Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 2061 Lines: 45 On Mon, 8 Jan 2001, Rik van Riel wrote: > I really think the zerocopy network stuff should be ported to kiobuf > proper. yep, we talked to Stephen Tweedie about this already, but it involves some changes in kiovec support and we didnt want to touch too much code for 2.4. In any case, the zerocopy code is 'kiovec in spirit' (uses vectors of struct page *, offset, size entities), so transition to a finalized kiovec framework (or whatever other mechanizm) is trivial. Right now kiovecs are *way* too bloated for the purposes of skb fragments. > The usefulness of the patch you posted is rather .. umm .. limited. > [...] i violently disagree :-) The upcoming TUX release is based on David's and Alexey's cleaned-up zerocopy framework. [thus TUX and zerocopy are separated.] David's patch adds a *very* scalable implementation of zerocopy sendfile() and zerocopy sendmsg(), the panacea of fileserver (webserver) scalability - it can be used by Apache, Samba and other fileservers. The new zerocopy networking code DMA-s straight out of the pagecache, natively supports hardware-checksumming and highmem (64-bit DMA on 32-bit systems) zerocopy as well and multi-fragment DMA - no limitations. We can saturate a gigabit link with TCP traffic, at about 20% CPU usage on a 500 MHz x86 UP system. David and Alexey's patch is cool - check it out! > Having proper kiobuf support would make it possible to, for example, > do zerocopy network->disk data transfers and lots of other things. i used to think that this is useful, but these days it isnt. It's a waste of PCI bandwidth resources, and it's much cheaper to keep a cache in RAM instead of doing direct disk=>network DMA *all the time* some resource is requested. > Furthermore, by using kiobuf for the network zerocopy stuff there's a > good chance the networking code will be integrated. David and Alexey are TCP/IP networking code maintainers. So if you see a 'test this' networking framework patch from them on l-k, it has quite high chances of being integrated into the networking code :-) Ingo From owner-netdev@oss.sgi.com Tue Jan 9 02:33:04 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 02:32:54 -0800 Received: from ns.caldera.de ([212.34.180.1]:35847 "EHLO ns.caldera.de") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 02:32:44 -0800 Received: (from hch@localhost) by ns.caldera.de (8.9.3/8.9.3) id LAA29280; Tue, 9 Jan 2001 11:31:45 +0100 Date: Tue, 9 Jan 2001 11:31:45 +0100 From: Christoph Hellwig To: Ingo Molnar Cc: Rik van Riel , "David S. Miller" , netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 Message-ID: <20010109113145.A28758@caldera.de> Mail-Followup-To: Ingo Molnar , Rik van Riel , "David S. Miller" , netdev@oss.sgi.com, linux-kernel@vger.kernel.org References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0i In-Reply-To: ; from mingo@elte.hu on Tue, Jan 09, 2001 at 11:23:41AM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1683 Lines: 40 On Tue, Jan 09, 2001 at 11:23:41AM +0100, Ingo Molnar wrote: > > On Mon, 8 Jan 2001, Rik van Riel wrote: > > > I really think the zerocopy network stuff should be ported to kiobuf > > proper. > > yep, we talked to Stephen Tweedie about this already, but it involves some > changes in kiovec support and we didnt want to touch too much code for > 2.4. In any case, the zerocopy code is 'kiovec in spirit' (uses vectors of > struct page *, offset, size entities), Yep. That is why I was so worried aboit the writepages file op. It's rather hackish (only write, looks usefull only for networking) instead of the proposed rw_kiovec fop. > > > The usefulness of the patch you posted is rather .. umm .. limited. > > [...] > > i violently disagree :-) The upcoming TUX release is based on David's and > Alexey's cleaned-up zerocopy framework. [thus TUX and zerocopy are > separated.] David's patch adds a *very* scalable implementation of > zerocopy sendfile() and zerocopy sendmsg(), the panacea of fileserver > (webserver) scalability - it can be used by Apache, Samba and other > fileservers. The new zerocopy networking code DMA-s straight out of the > pagecache, natively supports hardware-checksumming and highmem (64-bit DMA > on 32-bit systems) zerocopy as well and multi-fragment DMA - no > limitations. We can saturate a gigabit link with TCP traffic, at about 20% > CPU usage on a 500 MHz x86 UP system. David and Alexey's patch is cool - > check it out! Yuck. A new file_opo just to get a few benchmarks right ... I hope the writepages stuff will not be merged in Linus tree (but I wish the code behind it!) Christoph -- Whip me. Beat me. Make me maintain AIX. From owner-netdev@oss.sgi.com Tue Jan 9 02:48:54 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 02:48:44 -0800 Received: from pizda.ninka.net ([216.101.162.242]:36224 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 02:48:30 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id CAA01242; Tue, 9 Jan 2001 02:31:13 -0800 Date: Tue, 9 Jan 2001 02:31:13 -0800 Message-Id: <200101091031.CAA01242@pizda.ninka.net> From: "David S. Miller" To: hch@caldera.de CC: mingo@elte.hu, riel@conectiva.com.br, netdev@oss.sgi.com, linux-kernel@vger.kernel.org In-reply-to: <20010109113145.A28758@caldera.de> (message from Christoph Hellwig on Tue, 9 Jan 2001 11:31:45 +0100) Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 References: <20010109113145.A28758@caldera.de> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1100 Lines: 26 Date: Tue, 9 Jan 2001 11:31:45 +0100 From: Christoph Hellwig Yuck. A new file_opo just to get a few benchmarks right ... I hope the writepages stuff will not be merged in Linus tree (but I wish the code behind it!) It's a "I know how to send a page somewhere via this filedescriptor all by myself" operation. I don't see why people need to take painkillers over this for 2.4.x. I think f_op->write is stupid, such a special case file operation just to get a few benchmarks right. This is the kind of argument I am hearing. Orthogonal to f_op->write being for specifying a low-level implementation of sys_write, f_op->writepage is for specifying a low-level implementation of sys_sendfile. Can you grok that? Linus has already seen this. Originally he had a gripe because in an older revision of the code used to allow multiple pages to be passed in an array to the writepage(s) operation. He didn't like that, so I made it take only one page as he requested. He had no other major objections to the infrastructure. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Tue Jan 9 03:06:54 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 03:06:34 -0800 Received: from chiara.elte.hu ([157.181.150.200]:2572 "HELO chiara.elte.hu") by oss.sgi.com with SMTP id ; Tue, 9 Jan 2001 03:06:22 -0800 Received: by chiara.elte.hu (Postfix, from userid 17806) id 75530186D; Tue, 9 Jan 2001 12:06:20 +0100 (CET) Date: Tue, 9 Jan 2001 12:05:59 +0100 (CET) From: Ingo Molnar Reply-To: To: Christoph Hellwig Cc: Rik van Riel , "David S. Miller" , , Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 In-Reply-To: <20010109113145.A28758@caldera.de> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 3215 Lines: 66 On Tue, 9 Jan 2001, Christoph Hellwig wrote: > > 2.4. In any case, the zerocopy code is 'kiovec in spirit' (uses > > vectors of struct page *, offset, size entities), > Yep. That is why I was so worried aboit the writepages file op. i believe you misunderstand. kiovecs (in their current form) are simply too bloated for networking purposes. Due to its nature and nonpersistency, networking is very lightweight and memory-footprint-sensitive code (as opposed to eg. block IO code), right now an 'struct skb_shared_info' [which is roughly equivalent to a kiovec] is 12+4*6 == 36 bytes, which includes support for 6 distinct fragments (each fragment can be on any page, any offset, any size). A *single* kiobuf (which is roughly equivalent to an skb fragment) is 52+16*4 == 116 bytes. 6 of these would be 696 bytes, for a single TCP packet (!!!). This is simply not something to be used for lightweight zero-copy networking. so it's easy to say 'use kiovecs', but right now it's simply not practical. kiobufs are a loaded concept, and i'm not sure whether it's desirable at all to mix networking zero-copy concepts with block-IO/filesystem zero-copy concepts. Just to make it even more clear: although i do believe it to be desirable from an architectural point of view, i'm not sure at all whether it's possible, based on the experience we gathered while implementing TCP-zerocopy. we talked (and are talking) to Stephen about this problem, but it's a clealy 2.5 kernel issue. Merging to a finalized zero-copy framework will be easy. (The overwhelming percentage of zero-copy code is in the networking code itself and is insensitive to any kiovec issues.) > It's rather hackish (only write, looks usefull only for networking) > instead of the proposed rw_kiovec fop. i'm not sure what you are trying to say. You mean we should remove sendfile() as well? It's only write, looks useful mostly for networking. A substantial percentage of kernel code is useful only for networking :-) > > zerocopy sendfile() and zerocopy sendmsg(), the panacea of fileserver > > (webserver) scalability - it can be used by Apache, Samba and other > > fileservers. The new zerocopy networking code DMA-s straight out of the > > The new zerocopy networking code DMA-s straight out of the > > pagecache, natively supports hardware-checksumming and highmem (64-bit > > DMA on 32-bit systems) zerocopy as well and multi-fragment DMA - no > > limitations. We can saturate a gigabit link with TCP traffic, at about > > 20% CPU usage on a 500 MHz x86 UP system. David and Alexey's patch is > > cool - check it out! > Yuck. A new file_opo just to get a few benchmarks right ... no. As David said, it's direct sendfile() support. It's completely isolated, it's 20 lines of code, it does not impact filesystems, it only shows up in sendfile(). So i truly dont understand your point. This interface has gone through several iterations and was actually further simplified. Ingo ps1. "first they say it's impossible, then they ridicule you, then they oppose you, finally they say it's self-evident". Looks like, after many many years, zero-copy networking for Linux is now finally in phase III. :-) ps2. i'm joking :-) From owner-netdev@oss.sgi.com Tue Jan 9 03:29:44 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 03:29:25 -0800 Received: from ns.caldera.de ([212.34.180.1]:23560 "EHLO ns.caldera.de") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 03:29:03 -0800 Received: (from hch@localhost) by ns.caldera.de (8.9.3/8.9.3) id MAA03705; Tue, 9 Jan 2001 12:28:10 +0100 Date: Tue, 9 Jan 2001 12:28:10 +0100 From: Christoph Hellwig To: "David S. Miller" Cc: mingo@elte.hu, riel@conectiva.com.br, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 Message-ID: <20010109122810.A3115@caldera.de> Mail-Followup-To: "David S. Miller" , mingo@elte.hu, riel@conectiva.com.br, netdev@oss.sgi.com, linux-kernel@vger.kernel.org References: <20010109113145.A28758@caldera.de> <200101091031.CAA01242@pizda.ninka.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0i In-Reply-To: <200101091031.CAA01242@pizda.ninka.net>; from davem@redhat.com on Tue, Jan 09, 2001 at 02:31:13AM -0800 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1721 Lines: 41 On Tue, Jan 09, 2001 at 02:31:13AM -0800, David S. Miller wrote: > Date: Tue, 9 Jan 2001 11:31:45 +0100 > From: Christoph Hellwig > > Yuck. A new file_opo just to get a few benchmarks right ... I > hope the writepages stuff will not be merged in Linus tree (but I > wish the code behind it!) > > It's a "I know how to send a page somewhere via this filedescriptor > all by myself" operation. I don't see why people need to take > painkillers over this for 2.4.x. I think f_op->write is stupid, such > a special case file operation just to get a few benchmarks right. > This is the kind of argument I am hearing. > > Orthogonal to f_op->write being for specifying a low-level > implementation of sys_write, f_op->writepage is for specifying a > low-level implementation of sys_sendfile. Can you grok that? Sure. But sendfile is not one of the fundamental UNIX operations... If there was no alternative to this I would probably have not said anything, but with the rw_kiovec file op just before the door I don't see any reason to add this _very_ specific file operation. An alloc_kiovec before and an free_kiovec after the actual call and the memory overhaed of a kiobuf won't hurt so much that it stands against a clean interface, IMHO. > > Linus has already seen this. Originally he had a gripe because in an > older revision of the code used to allow multiple pages to be passed > in an array to the writepage(s) operation. He didn't like that, so I > made it take only one page as he requested. He had no other major > objections to the infrastructure. You get that multiple page call with kiobufs for free... Christoph -- Whip me. Beat me. Make me maintain AIX. From owner-netdev@oss.sgi.com Tue Jan 9 03:33:14 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 03:33:05 -0800 Received: from laurin.munich.netsurf.de ([194.64.166.1]:15091 "EHLO laurin.munich.netsurf.de") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 03:33:01 -0800 Received: from fred.muc.de (noidentity@ns1078.munich.netsurf.de [195.180.235.78]) by laurin.munich.netsurf.de (8.9.3/8.9.3) with ESMTP id MAA05282; Tue, 9 Jan 2001 12:32:55 +0100 (MET) Received: by fred.muc.de (Postfix, from userid 500) id 2D574E3BB7; Mon, 8 Jan 2001 17:50:37 +0100 (CET) Date: Mon, 8 Jan 2001 17:50:36 +0100 From: Andi Kleen To: Ben Greear Cc: "David S. Miller" , netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [PATCH] hashed device lookup (New Benchmarks) Message-ID: <20010108175036.A22154@fred.local> References: <3A578F27.D2A9DF52@candelatech.com> <20010107042959.A14330@gruyere.muc.suse.de> <3A580B31.7998C783@candelatech.com> <20010107062744.A15198@gruyere.muc.suse.de> <3A58249F.86DD52BC@candelatech.com> <3A597665.4B68C39@candelatech.com> <200101080700.XAA10037@pizda.ninka.net> <3A59EA1F.AEAD08A6@candelatech.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0.1i In-Reply-To: <3A59EA1F.AEAD08A6@candelatech.com>; from greearb@candelatech.com on Mon, Jan 08, 2001 at 04:23:41PM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1419 Lines: 28 On Mon, Jan 08, 2001 at 04:23:41PM +0100, Ben Greear wrote: > I don't argue that ifconfig shouldn't be fixed, but the hash speeds up It's already fixed since months. There was one stupid algorithm, which I was to blame for when I changed ifconfig to use a device list two years ago. At that time I didn't think that anybody would be ever crazy enough to set up 4000 interfaces and just chosed the simplest list management. I fixed it when you first complained a few months ago and now the list insertion works that the list does not need to be walked fully in the usual case. It could be optimized more in user space, but it's probably not worth it. > ip by about 2X too. Is that not useful enough? ip seems to be implemented > pretty efficient, so if the hash helps it significantly then maybe it > can help other efficient programs too. Notice that it is the system > (ie kernel) time that stays remarkably flat with the hash + ip graph. Just does your benchmark represent anything that real users do frequently ? If you really want to optimize I'm sure there are lots of areas in the kernel where your efforts are better spent ;) [just run with a the kernel profiler on for a few days on your box and look at all the real hot spots] BTW, if you just want to optimize ip link ls speed it would be probably enough to keep a one behind cache that just caches the next member after the last search. -Andi From owner-netdev@oss.sgi.com Tue Jan 9 04:00:44 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 04:00:25 -0800 Received: from pizda.ninka.net ([216.101.162.242]:28289 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 04:00:06 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id DAA01640; Tue, 9 Jan 2001 03:42:47 -0800 Date: Tue, 9 Jan 2001 03:42:47 -0800 Message-Id: <200101091142.DAA01640@pizda.ninka.net> From: "David S. Miller" To: hch@caldera.de CC: mingo@elte.hu, riel@conectiva.com.br, netdev@oss.sgi.com, linux-kernel@vger.kernel.org In-reply-to: <20010109122810.A3115@caldera.de> (message from Christoph Hellwig on Tue, 9 Jan 2001 12:28:10 +0100) Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 References: <20010109113145.A28758@caldera.de> <200101091031.CAA01242@pizda.ninka.net> <20010109122810.A3115@caldera.de> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1478 Lines: 37 Date: Tue, 9 Jan 2001 12:28:10 +0100 From: Christoph Hellwig Sure. But sendfile is not one of the fundamental UNIX operations... It's a fundamental Linux interface and VFS-->networking interface. An alloc_kiovec before and an free_kiovec after the actual call and the memory overhaed of a kiobuf won't hurt so much that it stands against a clean interface, IMHO. This whole exercise is pointless unless it performs well. The overhead _DOES_ matter, we've tested and profiled all of this with full specweb99 runs, zerocopy ftp server loads, etc. Removing one word of information from anything involved in these code paths makes enormous differences. Have you run such tests with your suggested kiobuf scheme? Know what I really hate? People who are talking, "almost done", and "designing" the "real solution" to a problem and have no code to show for it. Ie. a total working implementation. Often they have not one line of code to show. Then the folks who actually get off their lazy asses and make something real, which works, and in fact exceeded most of our personal performance expectations, are the ones who are getting told that what they did was crap. What was the first thing out of people's mouths? Not "nice work", but "I think writepage is ugly and an eyesore, I hope nobody seriously considers this code for inclusion." Keep designing... like Linus says, "show me the code". Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Tue Jan 9 04:05:44 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 04:05:25 -0800 Received: from chiara.elte.hu ([157.181.150.200]:5132 "HELO chiara.elte.hu") by oss.sgi.com with SMTP id ; Tue, 9 Jan 2001 04:05:15 -0800 Received: by chiara.elte.hu (Postfix, from userid 17806) id 05F91186D; Tue, 9 Jan 2001 13:05:09 +0100 (CET) Date: Tue, 9 Jan 2001 13:04:49 +0100 (CET) From: Ingo Molnar Reply-To: To: Christoph Hellwig Cc: "David S. Miller" , , , Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 In-Reply-To: <20010109122810.A3115@caldera.de> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 2010 Lines: 43 On Tue, 9 Jan 2001, Christoph Hellwig wrote: > Sure. But sendfile is not one of the fundamental UNIX operations... Neither were eg. kernel-based semaphores. So what? Unix wasnt perfect and isnt perfect - but it was a (very) good starting point. If you are arguing against the existence or importance of sendfile() you should re-think, sendfile() is a unique (and important) interface because it enables moving information between files (streams) without involving any interim user-space memory buffer. No original Unix API did this AFAIK, so we obviously had to add it. It's an important Linux API category. > If there was no alternative to this I would probably have not said > anything, but with the rw_kiovec file op just before the door I don't > see any reason to add this _very_ specific file operation. I do think that the kiovec code has to be rewritten substantially before it can be used for networking zero-copy, so right now we do the least damange if we do not increase the coverage of kiovec code. > An alloc_kiovec before and an free_kiovec after the actual call and > the memory overhaed of a kiobuf won't hurt so much that it stands > against a clean interface, IMHO. please study the networking portions of the zerocopy patch and you'll see why this is not desirable. An alloc_kiovec()/free_kiovec() is exactly the thing we cannot afford in a sendfile() operation. sendfile() is lightweight, the setup times of kiovecs are not. basically the current kiovec design does not deal with the realities of high-speed, featherweight networking. DO NOT talk in hypotheticals. The code is there, do it, measure it. You might not care about performance, we do. another, more theoretical issue is that i think the kernel should not be littered with multi-page interfaces, we should keep the one "struct page * at a time" interfaces. Eg. check out how the new zerocopy code generates perfect MTU sized frames via the ->writepage() interface. No interim container objects are necessary. Ingo From owner-netdev@oss.sgi.com Tue Jan 9 04:13:14 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 04:12:54 -0800 Received: from chiara.elte.hu ([157.181.150.200]:5644 "HELO chiara.elte.hu") by oss.sgi.com with SMTP id ; Tue, 9 Jan 2001 04:12:50 -0800 Received: by chiara.elte.hu (Postfix, from userid 17806) id 8FFE1186D; Tue, 9 Jan 2001 13:12:48 +0100 (CET) Date: Tue, 9 Jan 2001 13:12:28 +0100 (CET) From: Ingo Molnar Reply-To: To: "David S. Miller" Cc: , , Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 In-Reply-To: <200101082236.OAA22381@pizda.ninka.net> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1139 Lines: 28 On Mon, 8 Jan 2001, David S. Miller wrote: > All I am asking is that someone lets me know if they make major > changes to my code so I can keep track of whats happening. > > We have not made any major changes to your code, in lieu of this > not being code which is actually being submitted yet. > > If it bothers you that publicly someone has published changes to your > driver which you disagree with, oh well... :-) i did tell Jes about our zerocopy work, months ago (and IIRC we even exchanged emails about technical issues briefly). The changes were first published in the TUX 1.0 source code last August, and subsequent cleanups (more than 10 iterations) were published on Alexey's public FTP site: ftp://ftp.inr.ac.ru/ip-routing/ i think this whole issue got miscommunicated because Jes moved to Canada exactly when we wrote the fragmented-API changes. I do believe Jes will like most of our changes though, and i can surely tell that the elegant and clean code of the Acenic driver made these changes so much easier. Jen's Acenic driver was the first Linux networking driver in history to support zero-copy TCP. Ingo From owner-netdev@oss.sgi.com Tue Jan 9 05:28:45 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 05:28:36 -0800 Received: from mail.linux.com ([198.186.203.59]:8972 "EHLO mail.i.linux.com") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 05:28:16 -0800 Received: from linux.com (mail@shiftq.linux.com [10.1.1.10]) by mail.i.linux.com (8.11.0.Beta3/8.11.0.Beta3) with ESMTP id f09DSAH12628; Tue, 9 Jan 2001 05:28:10 -0800 Received: from localhost (david@localhost) by linux.com (8.9.3/8.9.3) with ESMTP id FAA22533; Tue, 9 Jan 2001 05:28:09 -0800 Date: Tue, 9 Jan 2001 05:28:08 -0800 (PST) From: Blu3Viper To: jamal cc: Andi Kleen , Chris Wedgwood , "David S. Miller" , greearb@candelatech.com, linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission policy!) In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 138 Lines: 9 > Actually if you count arp which is also part of ip; ip becomes smaller > by about 15K. ...i always forget some small detail. thx -d From owner-netdev@oss.sgi.com Tue Jan 9 05:53:36 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 05:53:16 -0800 Received: from mons.uio.no ([129.240.130.14]:48802 "EHLO mons.uio.no") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 05:52:53 -0800 Received: from charged.uio.no ([129.240.86.49]) by mons.uio.no with esmtp (Exim 2.12 #7) id 14FzCb-00040c-00; Tue, 9 Jan 2001 14:52:41 +0100 Received: from trondmy by charged.uio.no with local (Exim 2.12 #1) id 14FzCa-0000th-00; Tue, 9 Jan 2001 14:52:40 +0100 To: "David S. Miller" Cc: linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 References: <200101080124.RAA08134@pizda.ninka.net> From: Trond Myklebust Date: 09 Jan 2001 14:52:40 +0100 In-Reply-To: "David S. Miller"'s message of "Sun, 7 Jan 2001 17:24:24 -0800" Message-ID: X-Mailer: Gnus v5.6.45/XEmacs 21.1 - "Channel Islands" Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1175 Lines: 31 >>>>> " " == David S Miller writes: > I've put a patch up for testing on the kernel.org mirrors: > /pub/linux/kernel/people/davem/zerocopy-2.4.0-1.diff.gz ..... > Finally, regardless of networking card, there should be a > measurable performance boost for NFS clients with this patch > due to the delayed fragment coalescing. KNFSD does not take > full advantage of this facility yet. Hi David, I don't really want to be chiming in with another 'make it a kiobuf', but given that you already have written 'do_tcp_sendpages()' why did you make sock->ops->sendpage() take the single page as an argument rather than just have it take the 'struct page **'? I would have thought one of the main interests of doing something like this would be to allow us to speed up large writes to the socket for ncpfs/knfsd/nfs/smbfs/... After all, in both the case of the client WRITE requests and the server READ responses, we end up with a set of several pages that just need to be pushed down the network without further ado. Unless I misunderstood the code, it seems that do_tcp_sendpages() fits the bill nicely... Cheers, Trond From owner-netdev@oss.sgi.com Tue Jan 9 06:00:25 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 06:00:15 -0800 Received: from pizda.ninka.net ([216.101.162.242]:38530 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 06:00:06 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id FAA02414; Tue, 9 Jan 2001 05:42:48 -0800 Date: Tue, 9 Jan 2001 05:42:48 -0800 Message-Id: <200101091342.FAA02414@pizda.ninka.net> From: "David S. Miller" To: trond.myklebust@fys.uio.no CC: linux-kernel@vger.kernel.org, netdev@oss.sgi.com In-reply-to: (message from Trond Myklebust on 09 Jan 2001 14:52:40 +0100) Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 References: <200101080124.RAA08134@pizda.ninka.net> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1151 Lines: 30 From: Trond Myklebust Date: 09 Jan 2001 14:52:40 +0100 I don't really want to be chiming in with another 'make it a kiobuf', but given that you already have written 'do_tcp_sendpages()' why did you make sock->ops->sendpage() take the single page as an argument rather than just have it take the 'struct page **'? It was like that to begin with. But to do it cleanly you have to pass in not a vector of "pages" but a vector of "page+offset+len" triplets. Linus hated it, and I understood why, so I reverted the API to be single page based. I would have thought one of the main interests of doing something like this would be to allow us to speed up large writes to the socket for ncpfs/knfsd/nfs/smbfs/... This is what TCP_CORK/MSG_MORE et al. are all for, things get coalesced perfectly. Sending in a vector of pages seems nice, but none of the page cache infrastructure works like this, all of the core routines work on a page at a time. It actually simplifies a lot. The writepage interface optimizes large file writes to a socket just fine. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Tue Jan 9 06:20:55 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 06:20:45 -0800 Received: from dukat.scot.redhat.com ([195.89.149.246]:55046 "EHLO dukat.scot.redhat.com") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 06:20:25 -0800 Received: from spock.scot.redhat.com (IDENT:root@spock.scot.redhat.com [10.3.1.2]) by dukat.scot.redhat.com (8.9.3/8.9.3) with ESMTP id OAA21330; Tue, 9 Jan 2001 14:18:12 GMT Received: (from sct@localhost) by spock.scot.redhat.com (8.11.0/8.11.0) id f09EI6a04336; Tue, 9 Jan 2001 14:18:06 GMT Date: Tue, 9 Jan 2001 14:18:06 +0000 From: "Stephen C. Tweedie" To: Ingo Molnar Cc: Rik van Riel , "David S. Miller" , hch@caldera.de, netdev@oss.sgi.com, linux-kernel@vger.kernel.org, Stephen Tweedie Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 Message-ID: <20010109141806.F4284@redhat.com> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: ; from mingo@elte.hu on Tue, Jan 09, 2001 at 11:23:41AM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 933 Lines: 22 Hi, On Tue, Jan 09, 2001 at 11:23:41AM +0100, Ingo Molnar wrote: > > > Having proper kiobuf support would make it possible to, for example, > > do zerocopy network->disk data transfers and lots of other things. > > i used to think that this is useful, but these days it isnt. It's a waste > of PCI bandwidth resources, and it's much cheaper to keep a cache in RAM > instead of doing direct disk=>network DMA *all the time* some resource is > requested. No. I'm certain you're right when talking about things like web serving, but it just doesn't apply when you look at some other applications, such as streaming out video data or performing fileserving in a high-performance compute cluster where you are serving bulk data. The multimedia and HPC worlds typically operate on datasets which are far too large to cache, so you want to keep them in memory as little as possible when you ship them over the wire. Cheers, Stephen From owner-netdev@oss.sgi.com Tue Jan 9 06:28:15 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 06:28:06 -0800 Received: from dukat.scot.redhat.com ([195.89.149.246]:57094 "EHLO dukat.scot.redhat.com") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 06:27:54 -0800 Received: from spock.scot.redhat.com (IDENT:root@spock.scot.redhat.com [10.3.1.2]) by dukat.scot.redhat.com (8.9.3/8.9.3) with ESMTP id OAA21336; Tue, 9 Jan 2001 14:25:42 GMT Received: (from sct@localhost) by spock.scot.redhat.com (8.11.0/8.11.0) id f09EPga04344; Tue, 9 Jan 2001 14:25:42 GMT Date: Tue, 9 Jan 2001 14:25:42 +0000 From: "Stephen C. Tweedie" To: Ingo Molnar Cc: Christoph Hellwig , "David S. Miller" , riel@conectiva.com.br, netdev@oss.sgi.com, linux-kernel@vger.kernel.org, Stephen Tweedie Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 Message-ID: <20010109142542.G4284@redhat.com> References: <20010109122810.A3115@caldera.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: ; from mingo@elte.hu on Tue, Jan 09, 2001 at 01:04:49PM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1245 Lines: 29 Hi, On Tue, Jan 09, 2001 at 01:04:49PM +0100, Ingo Molnar wrote: > > On Tue, 9 Jan 2001, Christoph Hellwig wrote: > > please study the networking portions of the zerocopy patch and you'll see > why this is not desirable. An alloc_kiovec()/free_kiovec() is exactly the > thing we cannot afford in a sendfile() operation. sendfile() is > lightweight, the setup times of kiovecs are not. > Right. However, kiobufs can be kept around for as long as you want and can be reused easily, and even if allocating and freeing them is more work than you want, populating an existing kiobuf is _very_ cheap. > another, more theoretical issue is that i think the kernel should not be > littered with multi-page interfaces, we should keep the one "struct page * > at a time" interfaces. Bad bad bad. We already have SCSI devices optimised for bandwidth which don't approach decent performance until you are passing them 1MB IOs, and even in networking the 1.5K packet limit kills us in some cases and we need an interface capable of generating jumbograms. Perhaps tcp can merge internal 4K requests, but if you're doing udp jumbograms (or STP or VIA), you do need an interface which can give the networking stack more than one page at once. --Stephen From owner-netdev@oss.sgi.com Tue Jan 9 06:32:46 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 06:32:36 -0800 Received: from router-100M.swansea.linux.org.uk ([194.168.151.17]:58891 "EHLO the-village.bc.nu") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 06:32:33 -0800 Received: from alan by the-village.bc.nu with local (Exim 2.12 #1) id 14Fzpr-0006ij-00; Tue, 9 Jan 2001 14:33:15 +0000 Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 To: sct@redhat.com (Stephen C. Tweedie) Date: Tue, 9 Jan 2001 14:33:13 +0000 (GMT) Cc: mingo@elte.hu (Ingo Molnar), hch@caldera.de (Christoph Hellwig), davem@redhat.com (David S. Miller), riel@conectiva.com.br, netdev@oss.sgi.com, linux-kernel@vger.kernel.org, sct@redhat.com (Stephen Tweedie) In-Reply-To: <20010109142542.G4284@redhat.com> from "Stephen C. Tweedie" at Jan 09, 2001 02:25:42 PM X-Mailer: ELM [version 2.5 PL1] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-Id: From: Alan Cox Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 321 Lines: 7 > Bad bad bad. We already have SCSI devices optimised for bandwidth > which don't approach decent performance until you are passing them 1MB > IOs, and even in networking the 1.5K packet limit kills us in some Even low end cheap raid cards like the AMI megaraid dearly want 128K writes. Its quite a difference on them From owner-netdev@oss.sgi.com Tue Jan 9 06:41:46 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 06:41:36 -0800 Received: from chiara.elte.hu ([157.181.150.200]:11276 "HELO chiara.elte.hu") by oss.sgi.com with SMTP id ; Tue, 9 Jan 2001 06:41:29 -0800 Received: by chiara.elte.hu (Postfix, from userid 17806) id 28F8E186D; Tue, 9 Jan 2001 15:41:23 +0100 (CET) Date: Tue, 9 Jan 2001 15:40:56 +0100 (CET) From: Ingo Molnar Reply-To: To: "Stephen C. Tweedie" Cc: Rik van Riel , "David S. Miller" , , , Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 In-Reply-To: <20010109141806.F4284@redhat.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1314 Lines: 30 On Tue, 9 Jan 2001, Stephen C. Tweedie wrote: > > i used to think that this is useful, but these days it isnt. It's a waste > > of PCI bandwidth resources, and it's much cheaper to keep a cache in RAM > > instead of doing direct disk=>network DMA *all the time* some resource is > > requested. > > No. I'm certain you're right when talking about things like web > serving, [...] yep, i was concentrating on fileserving load. > but it just doesn't apply when you look at some other applications, > such as streaming out video data or performing fileserving in a > high-performance compute cluster where you are serving bulk data. > The multimedia and HPC worlds typically operate on datasets which are > far too large to cache, so you want to keep them in memory as little > as possible when you ship them over the wire. i'd love to first see these kinds of applications (under Linux) before designing for them. Eg. if an IO operation (eg. streaming video webcast) does a DMA from a camera card to an outgoing networking card, would it be possible to access the packet data in case of a TCP retransmit? Basically these applications are limited enough in scope to justify even temporary 'hacks' that enable them - and once we *see* things in action, we could design for them. Not the other way around. Ingo From owner-netdev@oss.sgi.com Tue Jan 9 06:50:56 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 06:50:46 -0800 Received: from router-100M.swansea.linux.org.uk ([194.168.151.17]:780 "EHLO the-village.bc.nu") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 06:50:38 -0800 Received: from alan by the-village.bc.nu with local (Exim 2.12 #1) id 14G07X-0006kT-00; Tue, 9 Jan 2001 14:51:31 +0000 Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 To: mingo@elte.hu Date: Tue, 9 Jan 2001 14:51:28 +0000 (GMT) Cc: sct@redhat.com (Stephen C. Tweedie), riel@conectiva.com.br (Rik van Riel), davem@redhat.com (David S. Miller), hch@caldera.de, netdev@oss.sgi.com, linux-kernel@vger.kernel.org In-Reply-To: from "Ingo Molnar" at Jan 09, 2001 03:40:56 PM X-Mailer: ELM [version 2.5 PL1] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-Id: From: Alan Cox Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 416 Lines: 11 > designing for them. Eg. if an IO operation (eg. streaming video webcast) > does a DMA from a camera card to an outgoing networking card, would it be Most mpeg2 hardware isnt set up for that kind of use. And webcast protocols like h.263 tend to be software implemented. Capturing raw video for pre-processing is similar. Right now thats best done with mmap() on the ring buffer and O_DIRECT I/O it seems Alan From owner-netdev@oss.sgi.com Tue Jan 9 07:01:26 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 07:01:18 -0800 Received: from chiara.elte.hu ([157.181.150.200]:13324 "HELO chiara.elte.hu") by oss.sgi.com with SMTP id ; Tue, 9 Jan 2001 07:00:56 -0800 Received: by chiara.elte.hu (Postfix, from userid 17806) id 2A7B1186D; Tue, 9 Jan 2001 16:00:54 +0100 (CET) Date: Tue, 9 Jan 2001 16:00:34 +0100 (CET) From: Ingo Molnar Reply-To: To: "Stephen C. Tweedie" Cc: Christoph Hellwig , "David S. Miller" , , , Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 In-Reply-To: <20010109142542.G4284@redhat.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 2518 Lines: 57 On Tue, 9 Jan 2001, Stephen C. Tweedie wrote: > > please study the networking portions of the zerocopy patch and you'll see > > why this is not desirable. An alloc_kiovec()/free_kiovec() is exactly the > > thing we cannot afford in a sendfile() operation. sendfile() is > > lightweight, the setup times of kiovecs are not. > > > Right. However, kiobufs can be kept around for as long as you want > and can be reused easily, and even if allocating and freeing them is > more work than you want, populating an existing kiobuf is _very_ > cheap. we do have SLAB [which essentially caches structures, on a per-CPU basis] which i did take into account, but still, initializing a 600+ byte kiovec is probably more work than the rest of sending a packet! I mean i'd love to eliminate the 200+ bytes skb initialization as well, it shows up. > > another, more theoretical issue is that i think the kernel should not be > > littered with multi-page interfaces, we should keep the one "struct page * > > at a time" interfaces. > > Bad bad bad. We already have SCSI devices optimised for bandwidth > which don't approach decent performance until you are passing them 1MB > IOs, [...] The fact that we're using single-page interfaces doesnt preclude us from having nicely clustered requests, this is what IO-plugging is about! > and even in networking the 1.5K packet limit kills us in some cases > and we need an interface capable of generating jumbograms. which cases? > Perhaps tcp can merge internal 4K requests, [...] yes, because depending on the application to send properly sized requests is a futile act IMO. So we do have intelligent buffering and clustering in basically every kernel subsystem - and we'll continue to have it because we have no choice - most of Linux's user-visible IO APIs have byte granularity (which is good btw.). Adding a multi-page interface will IMO mostly just complicate the design and the implementation. Do you have empirical (or theoretical) proof which shows that single-page interfaces cannot perform well? > but if you're doing udp jumbograms (or STP or VIA), you do need an > interface which can give the networking stack more than one page at > once. nothing prevents the introduction of specialized interfaces - if they feel like they can get enough traction. I was talking about the normal Linux IO APIs, read()/write()/sendfile(), which are byte granularity and invoke an almost mandatory buffering/clustering mechanizm in every kernel subsystem they deal with. Ingo From owner-netdev@oss.sgi.com Tue Jan 9 07:20:26 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 07:20:16 -0800 Received: from dukat.scot.redhat.com ([195.89.149.246]:62982 "EHLO dukat.scot.redhat.com") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 07:19:53 -0800 Received: (from sct@localhost) by dukat.scot.redhat.com (8.9.3/8.9.3) id PAA21428; Tue, 9 Jan 2001 15:17:25 GMT Date: Tue, 9 Jan 2001 15:17:25 +0000 From: "Stephen C. Tweedie" To: Ingo Molnar Cc: "Stephen C. Tweedie" , Rik van Riel , "David S. Miller" , hch@caldera.de, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 Message-ID: <20010109151725.D9321@redhat.com> References: <20010109141806.F4284@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2i In-Reply-To: ; from mingo@elte.hu on Tue, Jan 09, 2001 at 03:40:56PM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1415 Lines: 33 Hi, On Tue, Jan 09, 2001 at 03:40:56PM +0100, Ingo Molnar wrote: > > i'd love to first see these kinds of applications (under Linux) before > designing for them. Things like Beowulf have been around for a while now, and SGI have been doing that sort of multimedia stuff for ages. I don't think that there's any doubt that there's a demand for this. > Eg. if an IO operation (eg. streaming video webcast) > does a DMA from a camera card to an outgoing networking card, would it be > possible to access the packet data in case of a TCP retransmit? I'm not thinking about pci-to-pci as much as pci-to-memory-to-pci with no memory-to-memory copies. That's no different to writepage: doing a zero-copy writepage on a page cache page still gives you the problem of maintaining retransmit semantics if a user mmaps the file or writes to it after your initial transmit. And if you want other examples, we have applications such as Oracle who want to do raw disk IO in chunks of at least 128K. Going through a page-by-page interface for large IOs is almost as bad as the existing buffer_head-by-buffer_head interface, and we have already demonstrated that to be a bottleneck in the block device layer. Jes has also got hard numbers for the performance advantages of jumbograms on some of the networks he's been using, and you ain't going to get udp jumbograms through a page-by-page API, ever. Cheers, Stephen From owner-netdev@oss.sgi.com Tue Jan 9 07:26:46 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 07:26:36 -0800 Received: from ns.snowman.net ([63.80.4.34]:29965 "EHLO ns.snowman.net") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 07:26:27 -0800 Received: (from sfrost@localhost) by ns.snowman.net (8.9.3/8.9.3/Debian 8.9.3-21) id KAA17619; Tue, 9 Jan 2001 10:25:25 -0500 Date: Tue, 9 Jan 2001 10:25:25 -0500 From: Stephen Frost To: Ingo Molnar Cc: "Stephen C. Tweedie" , Rik van Riel , "David S. Miller" , hch@caldera.de, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 Message-ID: <20010109102525.Q26953@ns> Mail-Followup-To: Ingo Molnar , "Stephen C. Tweedie" , Rik van Riel , "David S. Miller" , hch@caldera.de, netdev@oss.sgi.com, linux-kernel@vger.kernel.org References: <20010109141806.F4284@redhat.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="loVroY/sZgZp7srB" Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: ; from mingo@elte.hu on Tue, Jan 09, 2001 at 03:40:56PM +0100 X-Editor: Vim http://www.vim.org/ X-Info: http://www.snowman.net X-Operating-System: Linux/2.2.16 (i686) X-Uptime: 10:19am up 145 days, 14:06, 6 users, load average: 2.00, 2.00, 2.00 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 2600 Lines: 61 --loVroY/sZgZp7srB Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable * Ingo Molnar (mingo@elte.hu) wrote: >=20 > On Tue, 9 Jan 2001, Stephen C. Tweedie wrote: >=20 > > but it just doesn't apply when you look at some other applications, > > such as streaming out video data or performing fileserving in a > > high-performance compute cluster where you are serving bulk data. > > The multimedia and HPC worlds typically operate on datasets which are > > far too large to cache, so you want to keep them in memory as little > > as possible when you ship them over the wire. >=20 > i'd love to first see these kinds of applications (under Linux) before > designing for them. Eg. if an IO operation (eg. streaming video webcast) > does a DMA from a camera card to an outgoing networking card, would it be > possible to access the packet data in case of a TCP retransmit? Basically > these applications are limited enough in scope to justify even temporary > 'hacks' that enable them - and once we *see* things in action, we could > design for them. Not the other way around. Well, I know I for one use a system that you might have heard of called 'MOSIX'. It's a (kinda large) kernel patch with some user-space tools but allows for migration of processes between machines without modifying any code. There are some limitations (threaded applications and shared memory and whatnot) but it works very well for the rendering work we use it for. We use radiance which in general has pretty little inter- process communication and what it has is done through the filesystem. Now, the interesting bit here is that the processes can grow to be pretty large (200M+, up as high as 500M, higher if we let it ;) ) and what happens with MOSIX is that entire processes get sent over the wire to=20 other machines for work. MOSIX will also attempt to rebalance the load on all of the machines in the cluster and whatnot so it can often be moving processes back and forth. So, anyhow, this is just an fyi if you weren't aware of it that I believe more than a few people are using MOSIX these days for similar appliactions and that it's availible at http://www.mosix.org if you're curious. Stephen --loVroY/sZgZp7srB Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.4 (GNU/Linux) Comment: For info see http://www.gnupg.org iD8DBQE6Wy1lrzgMPqB3kigRAhHzAKCBurTvc8elEFpftqVZDuCVLq7OdgCggFdL AvU9yTd3BetBFnqTUnKxH1U= =J92I -----END PGP SIGNATURE----- --loVroY/sZgZp7srB-- From owner-netdev@oss.sgi.com Tue Jan 9 07:28:16 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 07:28:06 -0800 Received: from mons.uio.no ([129.240.130.14]:53928 "EHLO mons.uio.no") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 07:27:59 -0800 Received: from charged.uio.no ([129.240.86.49]) by mons.uio.no with esmtp (Exim 2.12 #7) id 14G0gg-0006B5-00; Tue, 9 Jan 2001 16:27:50 +0100 Received: from trondmy by charged.uio.no with local (Exim 2.12 #1) id 14G0gf-0001Eu-00; Tue, 9 Jan 2001 16:27:49 +0100 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <14939.11765.649805.239618@charged.uio.no> Date: Tue, 9 Jan 2001 16:27:49 +0100 (CET) To: "David S. Miller" Cc: linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 In-Reply-To: <200101091342.FAA02414@pizda.ninka.net> References: <200101080124.RAA08134@pizda.ninka.net> <200101091342.FAA02414@pizda.ninka.net> X-Mailer: VM 6.72 under 21.1 (patch 12) "Channel Islands" XEmacs Lucid Reply-To: trond.myklebust@fys.uio.no From: Trond Myklebust Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 891 Lines: 23 >>>>> David S Miller writes: > I would have thought one of the main interests of doing > something like this would be to allow us to speed up large > writes to the socket for ncpfs/knfsd/nfs/smbfs/... > This is what TCP_CORK/MSG_MORE et al. are all for, things get > coalesced perfectly. Sending in a vector of pages seems nice, > but none of the page cache infrastructure works like this, all > of the core routines work on a page at a time. It actually > simplifies a lot. > The writepage interface optimizes large file writes to a socket > just fine. OK, but can you eventually generalize it to non-stream protocols (i.e. UDP)? After all, it doesn't make sense to differentiate between zero-copy on stream and non-stream sockets, and Linux NFS, at least, remains heavily UDP-oriented... Cheers, Trond From owner-netdev@oss.sgi.com Tue Jan 9 07:29:05 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 07:28:55 -0800 Received: from ottawa.linuxcare.com ([216.208.98.2]:46332 "EHLO localhost.localdomain") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 07:28:48 -0800 Received: (from cneufeld@localhost) by localhost.localdomain (8.10.0/8.10.0) id f09FSXO24925 for netdev@oss.sgi.com; Tue, 9 Jan 2001 10:28:33 -0500 Date: Tue, 9 Jan 2001 10:28:33 -0500 From: Christopher Neufeld Message-Id: <200101091528.f09FSXO24925@localhost.localdomain> X-Mailer: Mail User's Shell (7.2.5 10/14/92) To: netdev@oss.sgi.com Subject: Apparent bug in cls_u32.c Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1055 Lines: 33 I've run across what looks like a bounds-checking typo in net/sched/cls_u32.c in both the 2.2 and 2.4 kernel lines. The function gen_new_htid() is not dead code, and is, in its entirety: static u32 gen_new_htid(struct tc_u_common *tp_c) { int i = 0x800; do { if (++tp_c->hgenerator == 0x7FF) tp_c->hgenerator = 1; } while (i>0 && u32_lookup_ht(tp_c, (tp_c->hgenerator|0x800)<<20)); return i > 0 ? (tp_c->hgenerator|0x800)<<20 : 0; } There's nothing in that code which can modify the value of the local variable 'i' after initialization. As such, the FALSE branch of the ternary operator can never be called. I suspect that the author's intent was to make the first part of the while() condition: while (--i>0 && .... ) Could somebody please check this, and make any necessary corrections? -- Christopher Neufeld neufeld@linuxcare.com Home page: http://caliban.physics.utoronto.ca/neufeld/Intro.html "Don't edit reality for the sake of simplicity" From owner-netdev@oss.sgi.com Tue Jan 9 07:29:45 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 07:29:35 -0800 Received: from ns.tdt.de ([195.243.126.82]:18266 "EHLO ns.tdt.de") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 07:29:27 -0800 Received: from tdt.de (bjansen2 [10.1.4.164]) by ns.tdt.de (8.8.8/8.8.8) with ESMTP id QAA07680; Tue, 9 Jan 2001 16:34:46 -0100 Message-ID: <3A5A0264.457F967D@tdt.de> Date: Mon, 08 Jan 2001 19:09:40 +0100 From: Bernhard Jansen Organization: T.D.T Gmbh X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.2.16 i586) X-Accept-Language: en MIME-Version: 1.0 To: netdev@oss.sgi.com, alan@redhat.com Subject: The networking code in Kernel 2.4 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 895 Lines: 22 Hi, I have a little problem with the new networking code in kernel 2.4 . In the 2.2 kernels was the dev->tbusy variable to make sure that the dev->hard_start_xmit funktion is not called next before the old packet is completly sent out. My problem is that I wanna write a driver for a Siemens or better now an Infinion chip called ESCC2, it's for sync serial communication, it have a 32 byte deep FIFO what actually problem is. Because I get a packet from the upper layer and which is normally bigger than 32 byte, I write the first 32 byte out to the chip and then I have to wait for an interrupt that the FIFO is ready again. The spot of my problem is how can I make sure that dev->hard_start_xmit funktion is not called again form the kernel until the whole buffer is sent out. Thanks in advance Bernhard Jansen P.S. if you need the whole source code or more information, let me know. From owner-netdev@oss.sgi.com Tue Jan 9 07:35:46 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 07:35:36 -0800 Received: from dukat.scot.redhat.com ([195.89.149.246]:1543 "EHLO dukat.scot.redhat.com") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 07:35:34 -0800 Received: (from sct@localhost) by dukat.scot.redhat.com (8.9.3/8.9.3) id PAA21454; Tue, 9 Jan 2001 15:27:02 GMT Date: Tue, 9 Jan 2001 15:27:02 +0000 From: "Stephen C. Tweedie" To: Ingo Molnar Cc: "Stephen C. Tweedie" , Christoph Hellwig , "David S. Miller" , riel@conectiva.com.br, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 Message-ID: <20010109152702.E9321@redhat.com> References: <20010109142542.G4284@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2i In-Reply-To: ; from mingo@elte.hu on Tue, Jan 09, 2001 at 04:00:34PM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 2483 Lines: 63 Hi, On Tue, Jan 09, 2001 at 04:00:34PM +0100, Ingo Molnar wrote: > > On Tue, 9 Jan 2001, Stephen C. Tweedie wrote: > > we do have SLAB [which essentially caches structures, on a per-CPU basis] > which i did take into account, but still, initializing a 600+ byte kiovec > is probably more work than the rest of sending a packet! I mean i'd love > to eliminate the 200+ bytes skb initialization as well, it shows up. Reusing a kiobuf for a request involves setting up the length, offset and maybe errno fields, and writing the struct page *'s into the maplist[]. Nothing more. > > Bad bad bad. We already have SCSI devices optimised for bandwidth > > which don't approach decent performance until you are passing them 1MB > > IOs, [...] > > The fact that we're using single-page interfaces doesnt preclude us from > having nicely clustered requests, this is what IO-plugging is about! We've already got measurements showing how insane this is. Raw IO requests, plus internal pagebuf contiguous requests from XFS, have to get broken down into page-sized chunks by the current ll_rw_block() API, only to get reassembled by the make_request code. It's *enormous* overhead, and the kiobuf-based disk IO code demonstrates this clearly. We have already shown that the IO-plugging API sucks, I'm afraid. > > and even in networking the 1.5K packet limit kills us in some cases > > and we need an interface capable of generating jumbograms. > > which cases? Gig Ethernet, HIPPI... It's not so bad with an intelligent controller, admittedly. > > but if you're doing udp jumbograms (or STP or VIA), you do need an > > interface which can give the networking stack more than one page at > > once. > > nothing prevents the introduction of specialized interfaces - if they feel > like they can get enough traction. So you mean we'll introduce two separate APIs for general zero-copy, just to get around the problems in the single-page-based on? > I was talking about the normal Linux IO > APIs, read()/write()/sendfile(), which are byte granularity and invoke an > almost mandatory buffering/clustering mechanizm in every kernel subsystem > they deal with. Only tcp and ll_rw_block. ll_rw_block has already been fixed in the SGI patches, and gets _much_ better performance as a result. udp doesn't do any such clustering. That leaves tcp. The presence of terrible performance in the old ll_rw_block code is NOT a good excuse for perpetuating that model. Cheers, Stephen From owner-netdev@oss.sgi.com Tue Jan 9 07:38:06 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 07:37:56 -0800 Received: from chiara.elte.hu ([157.181.150.200]:14860 "HELO chiara.elte.hu") by oss.sgi.com with SMTP id ; Tue, 9 Jan 2001 07:37:48 -0800 Received: by chiara.elte.hu (Postfix, from userid 17806) id 6C0C1186D; Tue, 9 Jan 2001 16:37:46 +0100 (CET) Date: Tue, 9 Jan 2001 16:37:26 +0100 (CET) From: Ingo Molnar Reply-To: To: "Stephen C. Tweedie" Cc: Rik van Riel , "David S. Miller" , , , Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 In-Reply-To: <20010109151725.D9321@redhat.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 653 Lines: 16 On Tue, 9 Jan 2001, Stephen C. Tweedie wrote: > Jes has also got hard numbers for the performance advantages of > jumbograms on some of the networks he's been using, and you ain't > going to get udp jumbograms through a page-by-page API, ever. i know the performance advantages of jumbograms (typically when it's over a local network), it's undisputed. Still i dont see why it should be impossible to do effective UDP via a single-page interface. Eg. buffering of outgoing pages could be supported, and MSG_MORE in sendmsg() used to indicate end of stream. This is why ->writepage() has a 'more' flag (and tcp_sendpage() has a flag as well). Ingo From owner-netdev@oss.sgi.com Tue Jan 9 07:41:26 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 07:41:17 -0800 Received: from chiara.elte.hu ([157.181.150.200]:15884 "HELO chiara.elte.hu") by oss.sgi.com with SMTP id ; Tue, 9 Jan 2001 07:41:08 -0800 Received: by chiara.elte.hu (Postfix, from userid 17806) id 473C5186D; Tue, 9 Jan 2001 16:41:06 +0100 (CET) Date: Tue, 9 Jan 2001 16:40:46 +0100 (CET) From: Ingo Molnar Reply-To: To: Stephen Frost Cc: "Stephen C. Tweedie" , Rik van Riel , "David S. Miller" , , , Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 In-Reply-To: <20010109102525.Q26953@ns> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 801 Lines: 20 On Tue, 9 Jan 2001, Stephen Frost wrote: > Now, the interesting bit here is that the processes can grow to be > pretty large (200M+, up as high as 500M, higher if we let it ;) ) and what > happens with MOSIX is that entire processes get sent over the wire to > other machines for work. MOSIX will also attempt to rebalance the load on > all of the machines in the cluster and whatnot so it can often be moving > processes back and forth. then you'll love the zerocopy patch :-) Just use sendfile() or specify MSG_NOCOPY to sendmsg(), and you'll see effective memory-to-card DMA-and-checksumming on cards that support it. the discussion with Stephen is about various device-to-device schemes. (which Mosix i dont think wants to use. Mosix wants to use memory to device zero-copy, right?) Ingo From owner-netdev@oss.sgi.com Tue Jan 9 07:41:26 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 07:41:17 -0800 Received: from kanga.kvack.org ([216.129.200.3]:29706 "EHLO kanga.kvack.org") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 07:41:07 -0800 Received: (from localhost user: 'blah', uid#63042) by kanga.kvack.org with SMTP id ; Tue, 9 Jan 2001 10:38:30 -0500 Date: Tue, 9 Jan 2001 10:38:30 -0500 (EST) From: "Benjamin C.R. LaHaise" To: Ingo Molnar cc: "Stephen C. Tweedie" , Christoph Hellwig , "David S. Miller" , riel@conectiva.com.br, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1724 Lines: 38 On Tue, 9 Jan 2001, Ingo Molnar wrote: > > On Tue, 9 Jan 2001, Stephen C. Tweedie wrote: > > > > please study the networking portions of the zerocopy patch and you'll see > > > why this is not desirable. An alloc_kiovec()/free_kiovec() is exactly the > > > thing we cannot afford in a sendfile() operation. sendfile() is > > > lightweight, the setup times of kiovecs are not. > > > > > Right. However, kiobufs can be kept around for as long as you want > > and can be reused easily, and even if allocating and freeing them is > > more work than you want, populating an existing kiobuf is _very_ > > cheap. > > we do have SLAB [which essentially caches structures, on a per-CPU basis] > which i did take into account, but still, initializing a 600+ byte kiovec > is probably more work than the rest of sending a packet! I mean i'd love > to eliminate the 200+ bytes skb initialization as well, it shows up. Do the math again: for transmitting a single page in a kiobuf only 64 bytes needs to be initialized. If map_array is moved to the end of the structure, that's all contiguous data and is a single cacheline. What you're completely ignoring is that sendpages is lacking a huge amount of functionality that is *needed*. I can't implement clean async io on top of sendpages -- it'll require keeping 1 task around per outstanding io, which is exactly the bottleneck we're trying to work around. > The fact that we're using single-page interfaces doesnt preclude us from > having nicely clustered requests, this is what IO-plugging is about! It does waste a significant amount of CPU cycles trying to reassemble io requests and is not deterministic. Unplugging the io queue is a real pain with async io. -ben From owner-netdev@oss.sgi.com Tue Jan 9 07:49:06 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 07:48:56 -0800 Received: from ns.snowman.net ([63.80.4.34]:42253 "EHLO ns.snowman.net") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 07:48:50 -0800 Received: (from sfrost@localhost) by ns.snowman.net (8.9.3/8.9.3/Debian 8.9.3-21) id KAA18128; Tue, 9 Jan 2001 10:48:00 -0500 Date: Tue, 9 Jan 2001 10:48:00 -0500 From: Stephen Frost To: Ingo Molnar Cc: "Stephen C. Tweedie" , Rik van Riel , "David S. Miller" , hch@caldera.de, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 Message-ID: <20010109104800.R26953@ns> Mail-Followup-To: Ingo Molnar , "Stephen C. Tweedie" , Rik van Riel , "David S. Miller" , hch@caldera.de, netdev@oss.sgi.com, linux-kernel@vger.kernel.org References: <20010109102525.Q26953@ns> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="znxST63EDyM/OAm8" Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: ; from mingo@elte.hu on Tue, Jan 09, 2001 at 04:40:46PM +0100 X-Editor: Vim http://www.vim.org/ X-Info: http://www.snowman.net X-Operating-System: Linux/2.2.16 (i686) X-Uptime: 10:42am up 145 days, 14:28, 6 users, load average: 2.00, 2.00, 2.00 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1794 Lines: 53 --znxST63EDyM/OAm8 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable * Ingo Molnar (mingo@elte.hu) wrote: >=20 > On Tue, 9 Jan 2001, Stephen Frost wrote: >=20 > > Now, the interesting bit here is that the processes can grow to be > > pretty large (200M+, up as high as 500M, higher if we let it ;) ) and w= hat > > happens with MOSIX is that entire processes get sent over the wire to > > other machines for work. MOSIX will also attempt to rebalance the load= on > > all of the machines in the cluster and whatnot so it can often be moving > > processes back and forth. >=20 > then you'll love the zerocopy patch :-) Just use sendfile() or specify > MSG_NOCOPY to sendmsg(), and you'll see effective memory-to-card > DMA-and-checksumming on cards that support it. Excellent, this patch certainly sounds interesting which is why I've been following this discussion. Once the MOSIX patch for 2.4 comes out I think I'm going to tinker with this and see if I can get MOSIX to use these methods. > the discussion with Stephen is about various device-to-device schemes. > (which Mosix i dont think wants to use. Mosix wants to use memory to > device zero-copy, right?) Yes, very much so actually now that I think about it. Alot of memory->device and device->memory work going on. I was mainly replying to the idea of clustering since that's what MOSIX is all about. Stephen --znxST63EDyM/OAm8 Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.4 (GNU/Linux) Comment: For info see http://www.gnupg.org iD8DBQE6WzKwrzgMPqB3kigRAlkxAJ95LQoPFn9t0rxpT4cHlGNyt3ToCQCdG58i yvQlMGYSS7HhAkBeSHG+tgY= =gIcs -----END PGP SIGNATURE----- --znxST63EDyM/OAm8-- From owner-netdev@oss.sgi.com Tue Jan 9 08:17:25 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 08:17:16 -0800 Received: from chiara.elte.hu ([157.181.150.200]:18444 "HELO chiara.elte.hu") by oss.sgi.com with SMTP id ; Tue, 9 Jan 2001 08:17:03 -0800 Received: by chiara.elte.hu (Postfix, from userid 17806) id A36C7186D; Tue, 9 Jan 2001 17:17:00 +0100 (CET) Date: Tue, 9 Jan 2001 17:16:40 +0100 (CET) From: Ingo Molnar Reply-To: To: "Stephen C. Tweedie" Cc: Christoph Hellwig , "David S. Miller" , , , Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 In-Reply-To: <20010109152702.E9321@redhat.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 4557 Lines: 96 On Tue, 9 Jan 2001, Stephen C. Tweedie wrote: > > we do have SLAB [which essentially caches structures, on a per-CPU basis] > > which i did take into account, but still, initializing a 600+ byte kiovec > > is probably more work than the rest of sending a packet! I mean i'd love > > to eliminate the 200+ bytes skb initialization as well, it shows up. > > Reusing a kiobuf for a request involves setting up the length, offset > and maybe errno fields, and writing the struct page *'s into the > maplist[]. Nothing more. i'm talking about kiovecs not kiobufs (because those are equivalent to a fragmented packet - every packet fragment can be anywhere). Initializing a kiovec involves touching a dozen cachelines. Keeping structures compressed is very important. i dont know. I dont think it's necesserily bad for a subsystem to have its own 'native structure' how it manages data. > We've already got measurements showing how insane this is. Raw IO > requests, plus internal pagebuf contiguous requests from XFS, have to > get broken down into page-sized chunks by the current ll_rw_block() > API, only to get reassembled by the make_request code. It's > *enormous* overhead, and the kiobuf-based disk IO code demonstrates > this clearly. i do believe that you are wrong here. We did have a multi-page API between sendfile and the TCP layer initially, and it made *absolutely no performance difference*. But it was more complex, and harder to fix. And we had to keep intelligent buffering/clustering/merging in any case, because some native Linux interfaces such as write() and read() have byte granularity. so unless there is some fundamental difference between the two approaches, i dont buy this argument. I dont necesserily say that your measurements are wrong, i'm saying that the performance analysis is wrong. > We have already shown that the IO-plugging API sucks, I'm afraid. it might not be important to others, but we do hold one particular SPECweb99 world record: on 2-way, 2 GB RAM, testing a load with a full fileset of ~9 GB. It generates insane block-IO load, and we do beat other OSs that have multipage support, including SGI. (and no, it's not due to kernel-space acceleration alone this time - it's mostly due to very good block-IO performance.) We use Jens Axobe's IO-batching fixes that dramatically improve the block scheduler's performance under high load. > > > and even in networking the 1.5K packet limit kills us in some cases > > > and we need an interface capable of generating jumbograms. > > > > which cases? > > Gig Ethernet, [...] we handle gigabit ethernet with 1.5K zero-copy packets just fine. One thing people forget is IRQ throttling: when switching from 1500 byte packets to 9000 byte packets then the amount of interrupts drops by a factor of 6. Now if the tunings of a driver are not changed accordingly, 1500 byte MTU can show dramatically lower performance than 9000 byte MTU. But if tuned properly, i see little difference between 1500 byte and 9000 byte MTU. (when using a good protocol such as TCP.) > > nothing prevents the introduction of specialized interfaces - if they feel > > like they can get enough traction. > > So you mean we'll introduce two separate APIs for general zero-copy, > just to get around the problems in the single-page-based on? no. But i think that none of the mainstream protocols or APIs mandate a multi-page interface - i do think that the performance problems mentioned were mis-analyzed. I'd call the multi-page API thing an urban legend. Nobody in their right mind can claim that a series of function calls shows any difference in *block IO* performance, compared to a multi-page API (which has an additional vector-setup cost). Only functional differences can explain any measured performance difference - and for those merging/clustering bugs, multipage support is only a workaround. > > I was talking about the normal Linux IO > > APIs, read()/write()/sendfile(), which are byte granularity and invoke an > > almost mandatory buffering/clustering mechanizm in every kernel subsystem > > they deal with. > > Only tcp and ll_rw_block. ll_rw_block has already been fixed in the > SGI patches, and gets _much_ better performance as a result. [...] as mentioned above, i think this is not due to going multipage. > The presence of terrible performance in the old ll_rw_block code is > NOT a good excuse for perpetuating that model. i'd like to measure this performance problem (because i'd like to double-check it) - what measurement method was used? Ingo From owner-netdev@oss.sgi.com Tue Jan 9 08:31:05 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 08:30:55 -0800 Received: from mail.sun.ac.za ([146.232.128.1]:45065 "EHLO mail.sun.ac.za") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 08:30:34 -0800 Received: from prime.sun.ac.za ([146.232.164.2]) by mail.sun.ac.za with esmtp (Exim 2.10 #1) id 14G1fD-00031o-00; Tue, 9 Jan 2001 18:30:23 +0200 Received: from grobh (helo=localhost) by prime.sun.ac.za with local-esmtp (Exim 3.20 #1) id 14G1fD-0005nY-00; Tue, 09 Jan 2001 18:30:23 +0200 Date: Tue, 9 Jan 2001 18:30:23 +0200 (SAST) From: Hans Grobler To: Bernhard Jansen cc: , Subject: Re: The networking code in Kernel 2.4 In-Reply-To: <3A5A0264.457F967D@tdt.de> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 563 Lines: 15 On Mon, 8 Jan 2001, Bernhard Jansen wrote: > I have a little problem with the new networking code in kernel 2.4 . In > the 2.2 kernels was the dev->tbusy variable to make sure that > the dev->hard_start_xmit funktion is not called next before the old > packet is completly sent out. This may not be a complete answer to your question but you may want to search a LKML archive for subject "[ANNOUNCE] SOFTNETing". Should be around Feb of last year. You may also want to peek at (a Siemens SCC driver) http://www.afthd.tu-darmstadt.de/~dg1kjd/pciscc4 -- Hans From owner-netdev@oss.sgi.com Tue Jan 9 08:37:25 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 08:37:16 -0800 Received: from router-100M.swansea.linux.org.uk ([194.168.151.17]:45324 "EHLO the-village.bc.nu") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 08:37:13 -0800 Received: from alan by the-village.bc.nu with local (Exim 2.12 #1) id 14G1mB-0006vF-00; Tue, 9 Jan 2001 16:37:35 +0000 Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 To: mingo@elte.hu Date: Tue, 9 Jan 2001 16:37:32 +0000 (GMT) Cc: sct@redhat.com (Stephen C. Tweedie), hch@caldera.de (Christoph Hellwig), davem@redhat.com (David S. Miller), riel@conectiva.com.br, netdev@oss.sgi.com, linux-kernel@vger.kernel.org In-Reply-To: from "Ingo Molnar" at Jan 09, 2001 05:16:40 PM X-Mailer: ELM [version 2.5 PL1] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-Id: From: Alan Cox Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 309 Lines: 8 > > We have already shown that the IO-plugging API sucks, I'm afraid. > > it might not be important to others, but we do hold one particular > SPECweb99 world record: on 2-way, 2 GB RAM, testing a load with a full And its real world value is exactly the same as the mindcraft NT values. Don't forget that. From owner-netdev@oss.sgi.com Tue Jan 9 08:40:45 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 08:40:35 -0800 Received: from chiara.elte.hu ([157.181.150.200]:20236 "HELO chiara.elte.hu") by oss.sgi.com with SMTP id ; Tue, 9 Jan 2001 08:40:28 -0800 Received: by chiara.elte.hu (Postfix, from userid 17806) id DC93F186D; Tue, 9 Jan 2001 17:40:25 +0100 (CET) Date: Tue, 9 Jan 2001 17:40:06 +0100 (CET) From: Ingo Molnar Reply-To: To: "Benjamin C.R. LaHaise" Cc: "Stephen C. Tweedie" , Christoph Hellwig , "David S. Miller" , , , Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1713 Lines: 36 On Tue, 9 Jan 2001, Benjamin C.R. LaHaise wrote: > Do the math again: for transmitting a single page in a kiobuf only 64 > bytes needs to be initialized. If map_array is moved to the end of > the structure, that's all contiguous data and is a single cacheline. but you are comparing apples to oranges: an iobuf to a fragment-array. A fragment-array is equivalent to an array of iobufs. In typical (eg. HTTP) output we have mixed sendfile() and sendmsg() based output, so we have an array of (page, offset, size) memory-areas, not a (initial_offset, page[]) array like kiobufs. The closest thing would be an array of kiobufs (where each kiobuf would use a single page only). this is why i ment that *right now* kiobufs are not suited for networking, at least the way we do it. Maybe if kiobufs had the same kind of internal structure as sk_frag (ie. array of (page,offset,size) triples, not array of pages), that would work out better. > What you're completely ignoring is that sendpages is lacking a huge > amount of functionality that is *needed*. I can't implement clean > async io on top of sendpages -- it'll require keeping 1 task around > per outstanding io, which is exactly the bottleneck we're trying to > work around. Please take a look at next release of TUX. Probably the last missing piece was that i added O_NONBLOCK to generic_file_read() && sendfile(), so not fully cached requests can be offloaded to IO threads. Otherwise the current lowlevel filesystem infrastructure is not suited for implementing "process-less async IO "- and kiovecs wont be able to help that either. Unless we implement async, IRQ-driven bmap(), we'll always need some sort of process context to set up IO. Ingo From owner-netdev@oss.sgi.com Tue Jan 9 08:49:26 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 08:49:06 -0800 Received: from chiara.elte.hu ([157.181.150.200]:22028 "HELO chiara.elte.hu") by oss.sgi.com with SMTP id ; Tue, 9 Jan 2001 08:48:54 -0800 Received: by chiara.elte.hu (Postfix, from userid 17806) id 9D002186D; Tue, 9 Jan 2001 17:48:51 +0100 (CET) Date: Tue, 9 Jan 2001 17:48:32 +0100 (CET) From: Ingo Molnar Reply-To: To: Alan Cox Cc: "Stephen C. Tweedie" , Christoph Hellwig , "David S. Miller" , , , Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 720 Lines: 19 On Tue, 9 Jan 2001, Alan Cox wrote: > > > We have already shown that the IO-plugging API sucks, I'm afraid. > > > > it might not be important to others, but we do hold one particular > > SPECweb99 world record: on 2-way, 2 GB RAM, testing a load with a full > > And its real world value is exactly the same as the mindcraft NT > values. Don't forget that. ( what you have not quoted is the part that says that the fileset is 9GB. This is one of the busiest and most complex block-IO Linux systems i've ever seen, this is why i quoted it - the talk was about block-IO performance, and Stephen said that our block IO sucks. It used to suck, but in 2.4, with the right patch from Jens, it doesnt suck anymore. ) Ingo From owner-netdev@oss.sgi.com Tue Jan 9 09:29:15 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 09:28:55 -0800 Received: from router-100M.swansea.linux.org.uk ([194.168.151.17]:9229 "EHLO the-village.bc.nu") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 09:28:38 -0800 Received: from alan by the-village.bc.nu with local (Exim 2.12 #1) id 14G2aT-00071v-00; Tue, 9 Jan 2001 17:29:33 +0000 Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 To: mingo@elte.hu Date: Tue, 9 Jan 2001 17:29:29 +0000 (GMT) Cc: alan@lxorguk.ukuu.org.uk (Alan Cox), sct@redhat.com (Stephen C. Tweedie), hch@caldera.de (Christoph Hellwig), davem@redhat.com (David S. Miller), riel@conectiva.com.br, netdev@oss.sgi.com, linux-kernel@vger.kernel.org In-Reply-To: from "Ingo Molnar" at Jan 09, 2001 05:48:32 PM X-Mailer: ELM [version 2.5 PL1] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-Id: From: Alan Cox Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 495 Lines: 11 > ever seen, this is why i quoted it - the talk was about block-IO > performance, and Stephen said that our block IO sucks. It used to suck, > but in 2.4, with the right patch from Jens, it doesnt suck anymore. ) Thats fine. Get me 128K-512K chunks nicely streaming into my raid controller and I'll be a happy man I don't have a problem with the claim that its not the per page stuff and plugging that breaks ll_rw_blk. If there is evidence contradicting the SGI stuff it's very interesting From owner-netdev@oss.sgi.com Tue Jan 9 09:33:45 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 09:33:26 -0800 Received: from kanga.kvack.org ([216.129.200.3]:55306 "EHLO kanga.kvack.org") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 09:33:17 -0800 Received: (from localhost user: 'blah', uid#63042) by kanga.kvack.org with SMTP id ; Tue, 9 Jan 2001 12:30:39 -0500 Date: Tue, 9 Jan 2001 12:30:39 -0500 (EST) From: "Benjamin C.R. LaHaise" To: Ingo Molnar cc: "Stephen C. Tweedie" , Christoph Hellwig , "David S. Miller" , riel@conectiva.com.br, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1350 Lines: 29 On Tue, 9 Jan 2001, Ingo Molnar wrote: > this is why i ment that *right now* kiobufs are not suited for networking, > at least the way we do it. Maybe if kiobufs had the same kind of internal > structure as sk_frag (ie. array of (page,offset,size) triples, not array > of pages), that would work out better. That I can agree with, and it would make my life easier since I really only care about the completion of an entire io, not the individual fragments of it. > Please take a look at next release of TUX. Probably the last missing piece > was that i added O_NONBLOCK to generic_file_read() && sendfile(), so not > fully cached requests can be offloaded to IO threads. > > Otherwise the current lowlevel filesystem infrastructure is not suited for > implementing "process-less async IO "- and kiovecs wont be able to help > that either. Unless we implement async, IRQ-driven bmap(), we'll always > need some sort of process context to set up IO. I've already got fully async read and write working via a helper thread for doing the bmaps when the page is not uptodate in the page cache. The primatives for async locking of pages and waiting on events such that converting ext2 to performing full async bmap should be trivial. Note that O_NONBLOCK is not good enough because you can't implement an asynchronous O_SYNC write with it. -ben From owner-netdev@oss.sgi.com Tue Jan 9 09:39:25 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 09:39:16 -0800 Received: from ns.virtualhost.dk ([195.184.98.160]:30220 "EHLO virtualhost.dk") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 09:38:58 -0800 Received: from burns.home.kernel.dk ([192.168.0.2]) by virtualhost.dk with esmtp (Exim 3.16 #1) id 14G2ir-0006P6-00; Tue, 09 Jan 2001 18:38:13 +0100 Received: from axboe by burns.home.kernel.dk with local (Exim 3.13 #1 (Debian)) id 14G2im-0003Q7-00; Tue, 09 Jan 2001 18:38:08 +0100 Date: Tue, 9 Jan 2001 18:38:08 +0100 From: Jens Axboe To: Alan Cox Cc: mingo@elte.hu, "Stephen C. Tweedie" , Christoph Hellwig , "David S. Miller" , riel@conectiva.com.br, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 Message-ID: <20010109183808.A12128@suse.de> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: ; from alan@lxorguk.ukuu.org.uk on Tue, Jan 09, 2001 at 05:29:29PM +0000 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 483 Lines: 13 On Tue, Jan 09 2001, Alan Cox wrote: > > ever seen, this is why i quoted it - the talk was about block-IO > > performance, and Stephen said that our block IO sucks. It used to suck, > > but in 2.4, with the right patch from Jens, it doesnt suck anymore. ) > > Thats fine. Get me 128K-512K chunks nicely streaming into my raid controller > and I'll be a happy man No problem, apply blk-13B and you'll get 512K chunks for SCSI and RAID. -- * Jens Axboe * SuSE Labs From owner-netdev@oss.sgi.com Tue Jan 9 09:54:56 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 09:54:46 -0800 Received: from ns.caldera.de ([212.34.180.1]:21516 "EHLO ns.caldera.de") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 09:54:25 -0800 Received: (from hch@localhost) by ns.caldera.de (8.9.3/8.9.3) id SAA17828; Tue, 9 Jan 2001 18:53:10 +0100 Date: Tue, 9 Jan 2001 18:53:10 +0100 From: Christoph Hellwig To: "Benjamin C.R. LaHaise" Cc: Ingo Molnar , "Stephen C. Tweedie" , "David S. Miller" , riel@conectiva.com.br, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 Message-ID: <20010109185310.C15990@caldera.de> Mail-Followup-To: "Benjamin C.R. LaHaise" , Ingo Molnar , "Stephen C. Tweedie" , "David S. Miller" , riel@conectiva.com.br, netdev@oss.sgi.com, linux-kernel@vger.kernel.org References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0i In-Reply-To: ; from blah@kvack.org on Tue, Jan 09, 2001 at 10:38:30AM -0500 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 722 Lines: 15 On Tue, Jan 09, 2001 at 10:38:30AM -0500, Benjamin C.R. LaHaise wrote: > What you're completely ignoring is that sendpages is lacking a huge amount > of functionality that is *needed*. I can't implement clean async io on > top of sendpages -- it'll require keeping 1 task around per outstanding > io, which is exactly the bottleneck we're trying to work around. Yepp. That's why I proposed to ue rw_kiovec. Currently Alexy seems to have an own hack for socket-only asynch IO with some COW semantics for the userlevel buffers, but I would much prefer a generic version... Christoph P.S. Any chance to find a new version of your aio-patch somewhere? -- Of course it doesn't work. We've performed a software upgrade. From owner-netdev@oss.sgi.com Tue Jan 9 10:15:46 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 10:15:36 -0800 Received: from dukat.scot.redhat.com ([195.89.149.246]:13063 "EHLO dukat.scot.redhat.com") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 10:15:23 -0800 Received: (from sct@localhost) by dukat.scot.redhat.com (8.9.3/8.9.3) id SAA21719; Tue, 9 Jan 2001 18:12:24 GMT Date: Tue, 9 Jan 2001 18:12:24 +0000 From: "Stephen C. Tweedie" To: "Benjamin C.R. LaHaise" Cc: Ingo Molnar , "Stephen C. Tweedie" , Christoph Hellwig , "David S. Miller" , riel@conectiva.com.br, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 Message-ID: <20010109181224.M9321@redhat.com> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2i In-Reply-To: ; from blah@kvack.org on Tue, Jan 09, 2001 at 12:30:39PM -0500 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 919 Lines: 24 Hi, On Tue, Jan 09, 2001 at 12:30:39PM -0500, Benjamin C.R. LaHaise wrote: > On Tue, 9 Jan 2001, Ingo Molnar wrote: > > > this is why i ment that *right now* kiobufs are not suited for networking, > > at least the way we do it. Maybe if kiobufs had the same kind of internal > > structure as sk_frag (ie. array of (page,offset,size) triples, not array > > of pages), that would work out better. > > That I can agree with, and it would make my life easier since I really > only care about the completion of an entire io, not the individual > fragments of it. Right, but this is why the kiobuf IO functions are supposed to accept kiovecs (ie. counted vectors of kiobuf *s, just like ll_rw_block receives buffer_heads). The kiobuf is supposed to be a unit of memory, not of IO. You can map several different kiobufs from different sources and send them all together to brw_kiovec() as a single IO. Cheers, Stephen From owner-netdev@oss.sgi.com Tue Jan 9 10:16:46 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 10:16:26 -0800 Received: from dukat.scot.redhat.com ([195.89.149.246]:14087 "EHLO dukat.scot.redhat.com") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 10:16:21 -0800 Received: (from sct@localhost) by dukat.scot.redhat.com (8.9.3/8.9.3) id SAA21714; Tue, 9 Jan 2001 18:10:43 GMT Date: Tue, 9 Jan 2001 18:10:43 +0000 From: "Stephen C. Tweedie" To: Ingo Molnar Cc: "Stephen C. Tweedie" , Christoph Hellwig , "David S. Miller" , riel@conectiva.com.br, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 Message-ID: <20010109181043.L9321@redhat.com> References: <20010109152702.E9321@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2i In-Reply-To: ; from mingo@elte.hu on Tue, Jan 09, 2001 at 05:16:40PM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 4259 Lines: 90 Hi, On Tue, Jan 09, 2001 at 05:16:40PM +0100, Ingo Molnar wrote: > On Tue, 9 Jan 2001, Stephen C. Tweedie wrote: > > i'm talking about kiovecs not kiobufs (because those are equivalent to a > fragmented packet - every packet fragment can be anywhere). Initializing a > kiovec involves touching a dozen cachelines. Keeping structures compressed > is very important. > > i dont know. I dont think it's necesserily bad for a subsystem to have its > own 'native structure' how it manages data. For the transmit case, unless the sender needs seriously fragmented data, the kiovec is just a kiobuf*. > i do believe that you are wrong here. We did have a multi-page API between > sendfile and the TCP layer initially, and it made *absolutely no > performance difference*. That may be fine for tcp, but tcp explicitly maintains the state of the caller and can stream things sequentially to a specific file descriptor. The block device layer, on the other hand, has to accept requests _in any order_ and still reorder them to the optimal elevator order. The merging in ll_rw_block is _far_ more expensive than adding a request to the end of a list. It's not helped by the fact that each such request has a buffer_head and a struct request associated with it, so deconstructing the large IO into buffer_heads results in huge amounts of data being allocated and deleted. We could streamline this greatly if the block device layer kept per-caller context in the way that tcp does, but the block device API just doesn't work that way. > > We have already shown that the IO-plugging API sucks, I'm afraid. > > it might not be important to others, but we do hold one particular > SPECweb99 world record: on 2-way, 2 GB RAM, testing a load with a full > fileset of ~9 GB. It generates insane block-IO load, and we do beat other > OSs that have multipage support, including SGI. (and no, it's not due to > kernel-space acceleration alone this time - it's mostly due to very good > block-IO performance.) We use Jens Axobe's IO-batching fixes that > dramatically improve the block scheduler's performance under high load. Perhaps, but we have proven and significant reductions in CPU utilisation from eliminating the per-buffer_head API to the block layer. Next time M$ gets close to our specweb records, maybe this is the next place to look for those extra few % points! > > Gig Ethernet, [...] > > we handle gigabit ethernet with 1.5K zero-copy packets just fine. One > thing people forget is IRQ throttling: when switching from 1500 byte > packets to 9000 byte packets then the amount of interrupts drops by a > factor of 6. Now if the tunings of a driver are not changed accordingly, > 1500 byte MTU can show dramatically lower performance than 9000 byte MTU. > But if tuned properly, i see little difference between 1500 byte and 9000 > byte MTU. (when using a good protocol such as TCP.) Maybe you see good throughput numbers, but I still bet the CPU utilisation could be bettered significantly with jumbograms. That's one of the problems with benchmarks: our CPU may be fast enough that we can keep the IO subsystems streaming, and the benchmark will not show up any OS bottlenecks, but we may still be consuming far too much CPU time internally. That's certainly the case with the block IO measurements made on XFS: sure, ext2 can keep a fast disk loaded to pretty much 100%, but at the cost of far more system CPU time than XFS+pagebuf+kiobuf-IO takes on the same disk. > > The presence of terrible performance in the old ll_rw_block code is > > NOT a good excuse for perpetuating that model. > > i'd like to measure this performance problem (because i'd like to > double-check it) - what measurement method was used? "time" will show it. A 13MB/sec raw IO dd using 64K blocks uses something between 5% and 15% of CPU time on the various systems I've tested on (up to 30% on an old 486 with a 1540, but that's hardly representative. :) The kernel profile clearly shows the buffer management as the biggest cost, with the SCSI code walking those buffer heads a close second. On my main scsi server test box, I get raw 32K reads taking about 7% system time on the cpu, with make_request and __get_request_wait being the biggest hogs. --Stephen From owner-netdev@oss.sgi.com Tue Jan 9 10:28:26 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 10:28:16 -0800 Received: from ns.caldera.de ([212.34.180.1]:40716 "EHLO ns.caldera.de") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 10:28:05 -0800 Received: (from hch@localhost) by ns.caldera.de (8.9.3/8.9.3) id TAA21794; Tue, 9 Jan 2001 19:27:07 +0100 Date: Tue, 9 Jan 2001 19:27:07 +0100 From: Christoph Hellwig To: Ingo Molnar Cc: Rik van Riel , "David S. Miller" , netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 Message-ID: <20010109192707.A20536@caldera.de> Mail-Followup-To: Ingo Molnar , Rik van Riel , "David S. Miller" , netdev@oss.sgi.com, linux-kernel@vger.kernel.org References: <20010109113145.A28758@caldera.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0i In-Reply-To: ; from mingo@elte.hu on Tue, Jan 09, 2001 at 12:05:59PM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 3006 Lines: 67 On Tue, Jan 09, 2001 at 12:05:59PM +0100, Ingo Molnar wrote: > > On Tue, 9 Jan 2001, Christoph Hellwig wrote: > > > > 2.4. In any case, the zerocopy code is 'kiovec in spirit' (uses > > > vectors of struct page *, offset, size entities), > > > Yep. That is why I was so worried aboit the writepages file op. > > i believe you misunderstand. kiovecs (in their current form) are simply > too bloated for networking purposes. Stop. I NEVER said you should use them internally. My concern is too use a file operation with a kiobuf ** as main argument instead of page *. With a little more bloat it allows you to do the same you do now. But it also offers a real advantage: you don't have to call into the network stack for every single page, and this fits easily in Ben's AIO stuff, so your stuff is very well integrated into the (futur) asynch IO framework. (he latter was my main concern). You pay 116 bytes and a few cycles for a _lot_ more abstraction and integration. Exactly such a design principle (design vs speed) is the cause why UNIX survived so long. > Due to its nature and nonpersistency, > networking is very lightweight and memory-footprint-sensitive code (as > opposed to eg. block IO code), right now an 'struct skb_shared_info' > [which is roughly equivalent to a kiovec] is 12+4*6 == 36 bytes, which > includes support for 6 distinct fragments (each fragment can be on any > page, any offset, any size). A *single* kiobuf (which is roughly > equivalent to an skb fragment) is 52+16*4 == 116 bytes. 6 of these would > be 696 bytes, for a single TCP packet (!!!). This is simply not something > to be used for lightweight zero-copy networking. This doesn't matter, because rw_kiovec can easily take only one kiobuf, and you don't really need the different fragments there. > so it's easy to say 'use kiovecs', but right now it's simply not > practical. kiobufs are a loaded concept, and i'm not sure whether it's > desirable at all to mix networking zero-copy concepts with > block-IO/filesystem zero-copy concepts. I didn't wnat to suggest that - I'm to clueless concerning networking to even consider an internal design for network zero-copy IO. I'm just talking about the VFS interface to the rest of the kernel. > we talked (and are talking) to Stephen about this problem, but it's a > clealy 2.5 kernel issue. Merging to a finalized zero-copy framework will > be easy. (The overwhelming percentage of zero-copy code is in the > networking code itself and is insensitive to any kiovec issues.) Agreed. > > It's rather hackish (only write, looks usefull only for networking) > > instead of the proposed rw_kiovec fop. > > i'm not sure what you are trying to say. You mean we should remove > sendfile() as well? It's only write, looks useful mostly for networking. A > substantial percentage of kernel code is useful only for networking :-) No. But it looks like a recvmsg syscall wouldn't too bad either ... Christoph -- Whip me. Beat me. Make me maintain AIX. From owner-netdev@oss.sgi.com Tue Jan 9 10:36:16 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 10:35:57 -0800 Received: from chiara.elte.hu ([157.181.150.200]:25612 "HELO chiara.elte.hu") by oss.sgi.com with SMTP id ; Tue, 9 Jan 2001 10:35:49 -0800 Received: by chiara.elte.hu (Postfix, from userid 17806) id 9D515186F; Tue, 9 Jan 2001 19:35:46 +0100 (CET) Date: Tue, 9 Jan 2001 19:35:28 +0100 (CET) From: Ingo Molnar Reply-To: To: "Benjamin C.R. LaHaise" Cc: "Stephen C. Tweedie" , Christoph Hellwig , "David S. Miller" , , , Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 898 Lines: 25 On Tue, 9 Jan 2001, Benjamin C.R. LaHaise wrote: > I've already got fully async read and write working via a helper thread ^^^^^^^^^^^^^^^^^^^ > for doing the bmaps when the page is not uptodate in the page cache. ^^^^^^^^^^^^^^^^^^^ thats what TUX 2.0 does. (it does async reads at the moment.) > The primatives for async locking of pages and waiting on events such > that converting ext2 to performing full async bmap should be trivial. well - if you think it's trivial (ie. no process context, no helper thread will be needed), more power to you. How are you going to assure that the issuing process does not block during the bmap()? [without extensive lowlevel-FS changes that is.] > Note that O_NONBLOCK is not good enough because you can't implement an > asynchronous O_SYNC write with it. (i'm using it for reads only.) Ingo From owner-netdev@oss.sgi.com Tue Jan 9 10:39:16 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 10:39:06 -0800 Received: from chiara.elte.hu ([157.181.150.200]:27148 "HELO chiara.elte.hu") by oss.sgi.com with SMTP id ; Tue, 9 Jan 2001 10:38:49 -0800 Received: by chiara.elte.hu (Postfix, from userid 17806) id 1DE99186D; Tue, 9 Jan 2001 19:38:47 +0100 (CET) Date: Tue, 9 Jan 2001 19:38:28 +0100 (CET) From: Ingo Molnar Reply-To: To: Jens Axboe Cc: Alan Cox , "Stephen C. Tweedie" , Christoph Hellwig , "David S. Miller" , , , Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 In-Reply-To: <20010109183808.A12128@suse.de> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 759 Lines: 20 On Tue, 9 Jan 2001, Jens Axboe wrote: > > > ever seen, this is why i quoted it - the talk was about block-IO > > > performance, and Stephen said that our block IO sucks. It used to suck, > > > but in 2.4, with the right patch from Jens, it doesnt suck anymore. ) > > > > Thats fine. Get me 128K-512K chunks nicely streaming into my raid controller > > and I'll be a happy man > > No problem, apply blk-13B and you'll get 512K chunks for SCSI and RAID. i cannot agree more - Jens' patch did wonders to IO performance here. It fixes a long-standing bug in the Linux block-IO-scheduler that caused very suboptimal requests being issued to lowlevel drivers once the request queue gets full. I think this patch is a clear candidate for 2.4.x inclusion. Ingo From owner-netdev@oss.sgi.com Tue Jan 9 11:20:38 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 11:20:29 -0800 Received: from chiara.elte.hu ([157.181.150.200]:30220 "HELO chiara.elte.hu") by oss.sgi.com with SMTP id ; Tue, 9 Jan 2001 11:20:05 -0800 Received: by chiara.elte.hu (Postfix, from userid 17806) id 02C03186D; Tue, 9 Jan 2001 20:20:03 +0100 (CET) Date: Tue, 9 Jan 2001 20:19:44 +0100 (CET) From: Ingo Molnar Reply-To: To: Christoph Hellwig Cc: Rik van Riel , "David S. Miller" , , Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 In-Reply-To: <20010109192707.A20536@caldera.de> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 366 Lines: 12 On Tue, 9 Jan 2001, Christoph Hellwig wrote: > I didn't want to suggest that - I'm to clueless concerning networking > to even consider an internal design for network zero-copy IO. I'm just > talking about the VFS interface to the rest of the kernel. (well, i think you just cannot be clueless about one and then demand various things about the other...) Ingo From owner-netdev@oss.sgi.com Tue Jan 9 11:20:48 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 11:20:38 -0800 Received: from raven.toyota.com ([63.87.74.200]:64004 "EHLO raven.toyota.com") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 11:20:23 -0800 Received: from uranium.tms.toyota.com (uranium.tms.toyota.com [10.48.31.6]) by raven.toyota.com (8.11.0/8.11.0) with ESMTP id f09JKFr17049; Tue, 9 Jan 2001 11:20:15 -0800 Received: from toyota.com (IDENT:jjs@localhost.localdomain [127.0.0.1]) by uranium.tms.toyota.com (8.11.0/8.11.0) with ESMTP id f09JKEW11585; Tue, 9 Jan 2001 11:20:14 -0800 Message-ID: <3A5B646E.6FFF1A4E@toyota.com> Date: Tue, 09 Jan 2001 11:20:14 -0800 From: J Sloan X-Mailer: Mozilla 4.76 [en] (X11; U; Linux 2.4.0-ll i686) X-Accept-Language: en MIME-Version: 1.0 To: Alan Cox CC: mingo@elte.hu, "Stephen C. Tweedie" , Christoph Hellwig , "David S. Miller" , riel@conectiva.com.br, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 299 Lines: 13 Alan Cox wrote: > > > it might not be important to others, but we do hold one particular > > SPECweb99 world record: on 2-way, 2 GB RAM, testing a load with a full > > And its real world value is exactly the same as the mindcraft NT values. Don't > forget that. In other words, devastating. jjs From owner-netdev@oss.sgi.com Tue Jan 9 11:55:57 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 11:55:38 -0800 Received: from penguin.e-mind.com ([195.223.140.120]:32784 "EHLO penguin.e-mind.com") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 11:55:23 -0800 Received: from black.random ([195.223.140.107]) by penguin.e-mind.com (8.9.1a/8.9.1/Debian/GNU) with ESMTP id WAA15282; Tue, 9 Jan 2001 22:01:46 +0100 Received: from athlon.random ([192.168.1.7]) by black.random (8.10.2/8.10.2/SuSE Linux 8.10.0-0.3) with ESMTP id f09JsVA09014; Tue, 9 Jan 2001 20:54:31 +0100 Received: (from andrea@localhost) by athlon.random (8.10.2/8.10.2/SuSE Linux 8.10.0-0.3) id f09JsK601473; Tue, 9 Jan 2001 20:54:20 +0100 Date: Tue, 9 Jan 2001 20:54:20 +0100 From: Andrea Arcangeli To: Ingo Molnar Cc: Jens Axboe , Alan Cox , "Stephen C. Tweedie" , Christoph Hellwig , "David S. Miller" , riel@conectiva.com.br, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 Message-ID: <20010109205420.H29904@athlon.random> References: <20010109183808.A12128@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: ; from mingo@elte.hu on Tue, Jan 09, 2001 at 07:38:28PM +0100 X-GnuPG-Key-URL: http://e-mind.com/~andrea/aa.gnupg.asc X-PGP-Key-URL: http://e-mind.com/~andrea/aa.asc Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 2011 Lines: 41 On Tue, Jan 09, 2001 at 07:38:28PM +0100, Ingo Molnar wrote: > > On Tue, 9 Jan 2001, Jens Axboe wrote: > > > > > ever seen, this is why i quoted it - the talk was about block-IO > > > > performance, and Stephen said that our block IO sucks. It used to suck, > > > > but in 2.4, with the right patch from Jens, it doesnt suck anymore. ) > > > > > > Thats fine. Get me 128K-512K chunks nicely streaming into my raid controller > > > and I'll be a happy man > > > > No problem, apply blk-13B and you'll get 512K chunks for SCSI and RAID. > > i cannot agree more - Jens' patch did wonders to IO performance here. It BTW, I noticed what is left in blk-13B seems to be my work (Jens's fixes for merging when the I/O queue is full are just been integrated in test1x). The 512K SCSI command, wake_up_nr, elevator fixes and cleanups and removal of the bogus 64 max_segment limit in scsi.c that matters only with the IOMMU to allow devices with sg_tablesize <64 to do SG with 64 segments were all thought and implemented by me. My last public patch with most of the blk-13B stuff in it was here: ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/2.4.0-test7/blkdev-3 I sumbitted a later revision of the above blkdev-3 to Jens and he kept nicely maintaining it in sync with 2.4.x-latest. My blkdev tree is even more advanced but I didn't had time to update with 2.4.0 and marge it with Jens yet (I just described to Jens what "more advanced" means though, in practice it means something like a x2 speedup in tiotest seek write numbers, streaming I/O doesn't change on highmem boxes but it doesn't hurt lowmem boxes anymore). Current blk-13B isn't ok for integration yet because it hurts with lowmem (try with mem=32m with your scsi array that gets 512K*512 requests in flight :) and it's not able to exploit the elevator as well as my tree even on highmemory machines. So I'd wait until I merge the last bits with Jens (I raised the QUEUE_NR_REQUESTS to 3000) before inclusion. Confirm Jens? Andrea From owner-netdev@oss.sgi.com Tue Jan 9 12:11:07 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 12:10:58 -0800 Received: from chiara.elte.hu ([157.181.150.200]:34060 "HELO chiara.elte.hu") by oss.sgi.com with SMTP id ; Tue, 9 Jan 2001 12:10:44 -0800 Received: by chiara.elte.hu (Postfix, from userid 17806) id 4850E186D; Tue, 9 Jan 2001 21:10:42 +0100 (CET) Date: Tue, 9 Jan 2001 21:10:24 +0100 (CET) From: Ingo Molnar Reply-To: To: Andrea Arcangeli Cc: Jens Axboe , Alan Cox , "Stephen C. Tweedie" , Christoph Hellwig , "David S. Miller" , , , Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 In-Reply-To: <20010109205420.H29904@athlon.random> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 548 Lines: 18 On Tue, 9 Jan 2001, Andrea Arcangeli wrote: > BTW, I noticed what is left in blk-13B seems to be my work (Jens's > fixes for merging when the I/O queue is full are just been integrated > in test1x). [...] it was Jens' [i think those were implemented by Jens entirely] batch-freeing changes that made the most difference. (we did profile it step by step.) > ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/2.4.0-test7/blkdev-3 great! i'm happy that the block IO layer and IO scheduler now has a real home :-) nice work. Ingo From owner-netdev@oss.sgi.com Tue Jan 9 12:13:08 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 12:12:57 -0800 Received: from ns.virtualhost.dk ([195.184.98.160]:28173 "EHLO virtualhost.dk") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 12:12:43 -0800 Received: from burns.home.kernel.dk ([192.168.0.2]) by virtualhost.dk with esmtp (Exim 3.16 #1) id 14G57q-0007JJ-00; Tue, 09 Jan 2001 21:12:10 +0100 Received: from axboe by burns.home.kernel.dk with local (Exim 3.13 #1 (Debian)) id 14G57k-0003xu-00; Tue, 09 Jan 2001 21:12:04 +0100 Date: Tue, 9 Jan 2001 21:12:04 +0100 From: Jens Axboe To: Andrea Arcangeli Cc: Ingo Molnar , Alan Cox , "Stephen C. Tweedie" , Christoph Hellwig , "David S. Miller" , riel@conectiva.com.br, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 Message-ID: <20010109211204.E12128@suse.de> References: <20010109183808.A12128@suse.de> <20010109205420.H29904@athlon.random> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20010109205420.H29904@athlon.random>; from andrea@suse.de on Tue, Jan 09, 2001 at 08:54:20PM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 3131 Lines: 65 On Tue, Jan 09 2001, Andrea Arcangeli wrote: > > > > Thats fine. Get me 128K-512K chunks nicely streaming into my raid controller > > > > and I'll be a happy man > > > > > > No problem, apply blk-13B and you'll get 512K chunks for SCSI and RAID. > > > > i cannot agree more - Jens' patch did wonders to IO performance here. It > > BTW, I noticed what is left in blk-13B seems to be my work (Jens's fixes for > merging when the I/O queue is full are just been integrated in test1x). The > 512K SCSI command, wake_up_nr, elevator fixes and cleanups and removal of the > bogus 64 max_segment limit in scsi.c that matters only with the IOMMU to allow > devices with sg_tablesize <64 to do SG with 64 segments were all thought and > implemented by me. My last public patch with most of the blk-13B stuff in it > was here: > > ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/2.4.0-test7/blkdev-3 > > I sumbitted a later revision of the above blkdev-3 to Jens and he kept nicely > maintaining it in sync with 2.4.x-latest. There are several parts that have been merged beyond recognition at this point :-). The wake_up_nr was actually partially redone by Ingo, I suspect he can fill in the gaps there. Then there are the general cleanups and cruft removal done by you (elevator->nr_segments stuff). The bogus 64 max segments from SCSI was there before merge too, I think I've actually had that in my tree for ages! The request free batching and pending queues were done by me, and Ingo helped tweak it during the spec runs to find a sweet spot of how much to batch etc. The elevator received lots of massaging beyond blkdev-3. For one, there are now only one complete queue scan for merge and insert of request where we before did one for each of them. The merger also does correct accounting and aging. In addition there are a bunch other small fixes in there, I'm too lazy to list them all now :) > My blkdev tree is even more advanced but I didn't had time to update with 2.4.0 > and marge it with Jens yet (I just described to Jens what "more advanced" > means though, in practice it means something like a x2 speedup in tiotest seek I haven't heard anything beyond the raised QUEUE_NR_REQUEST, so I'd like to see what you have pending so we can merge :-). The tiotest seek increase was mainly due to the elevator having 3000 requests to juggle and thus being able to eliminate a lot of seeks right? > write numbers, streaming I/O doesn't change on highmem boxes but it doesn't > hurt lowmem boxes anymore). Current blk-13B isn't ok for integration yet > because it hurts with lowmem (try with mem=32m with your scsi array that gets > 512K*512 requests in flight :) and it's not able to exploit the elevator as I don't see any lowmem problems -- if under pressure, the queue should be fired and thus it won't get as long as if you have lots of memory free.` > well as my tree even on highmemory machines. So I'd wait until I merge the last > bits with Jens (I raised the QUEUE_NR_REQUESTS to 3000) before inclusion. ?? What do you mean exploit the elevator? -- * Jens Axboe * SuSE Labs From owner-netdev@oss.sgi.com Tue Jan 9 12:17:57 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 12:17:37 -0800 Received: from borg.denalics.net ([209.112.170.15]:35078 "HELO borg.denalics.net") by oss.sgi.com with SMTP id ; Tue, 9 Jan 2001 12:17:33 -0800 Received: from borg.denalics.net (borg.denalics.net [209.112.170.15]) by borg.denalics.net (Postfix) with ESMTP id AA05948825; Tue, 09 Jan 2001 11:25:14 -0900 (AKST) Date: Tue, 9 Jan 2001 11:25:14 -0900 (AKST) From: "Christopher E. Brown" To: Alan Cox Cc: Ben Greear , "David S. Miller" , linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1658 Lines: 39 On Sun, 7 Jan 2001, Alan Cox wrote: > > Um, what about people running their box as just a VLAN router/firewall? > > That seems to be one of the principle uses so far. Actually, in that case > > both VLAN and IP traffic would come through, so it would be a tie if VLAN > > came first, but non-vlan traffic would suffer worse. > > Why would someone filter between vlans when any node on each vlan can happily > ignore the vlan partitioning Think VLANing switch clusters. Say 4 switches connected by GigE on 4 floors or in 4 separate building. Now, across these switches 20 VLANS are running, with the switches enforcing VLAN partitioning. The client PCs know nothing about it, as each one resides within a single VLAN. Now we have our Linux box with 2 x 100Mbit FD links to the switch cluster running 10 VLANS per interface, and an external DS1/SDSL/whatever connection. We now have 20 separate zones with different security controls per zone, with per switchport control over who resided in what group. Or even forget the routing and just plugging a Linux box to a companies 200VLAN setup to provide DHCP/whatever. I must say, I *hate* VLANs for this use, it is a horrible thing to do that wastes massive amounts of bandwidth on simulating a local broadcast domain across a much larger area, but oh well. As long as we have stupid managers and brain dead sales persons not much will change. Are there better things to do than VLAN? YES! Will we get stuck with needing VLANs in the real world? YES! --- The roaches seem to have survived, but they are not routing packets correctly. --About the Internet and nuclear war. From owner-netdev@oss.sgi.com Tue Jan 9 13:31:48 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 13:31:28 -0800 Received: from pizda.ninka.net ([216.101.162.242]:39558 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 13:31:21 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id NAA05515; Tue, 9 Jan 2001 13:13:52 -0800 Date: Tue, 9 Jan 2001 13:13:52 -0800 Message-Id: <200101092113.NAA05515@pizda.ninka.net> From: "David S. Miller" To: sct@redhat.com CC: mingo@elte.hu, hch@caldera.de, riel@conectiva.com.br, netdev@oss.sgi.com, linux-kernel@vger.kernel.org, sct@redhat.com In-reply-to: <20010109142542.G4284@redhat.com> (sct@redhat.com) Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 References: <20010109122810.A3115@caldera.de> <20010109142542.G4284@redhat.com> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 461 Lines: 13 Date: Tue, 9 Jan 2001 14:25:42 +0000 From: "Stephen C. Tweedie" Perhaps tcp can merge internal 4K requests, but if you're doing udp jumbograms (or STP or VIA), you do need an interface which can give the networking stack more than one page at once. All network protocols can use the current interface and get the result you are after, see MSG_MORE. TCP isn't "special" in this regard. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Tue Jan 9 13:35:58 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 13:35:48 -0800 Received: from pizda.ninka.net ([216.101.162.242]:43398 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 13:35:35 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id NAA05606; Tue, 9 Jan 2001 13:18:06 -0800 Date: Tue, 9 Jan 2001 13:18:06 -0800 Message-Id: <200101092118.NAA05606@pizda.ninka.net> From: "David S. Miller" To: sct@redhat.com CC: mingo@elte.hu, sct@redhat.com, riel@conectiva.com.br, hch@caldera.de, netdev@oss.sgi.com, linux-kernel@vger.kernel.org In-reply-to: <20010109151725.D9321@redhat.com> (sct@redhat.com) Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 References: <20010109141806.F4284@redhat.com> <20010109151725.D9321@redhat.com> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 432 Lines: 13 Date: Tue, 9 Jan 2001 15:17:25 +0000 From: "Stephen C. Tweedie" Jes has also got hard numbers for the performance advantages of jumbograms on some of the networks he's been using, and you ain't going to get udp jumbograms through a page-by-page API, ever. Again, see MSG_MORE in the patches. It is possible and our UDP implementation could make it easily. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Tue Jan 9 13:37:38 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 13:37:18 -0800 Received: from pizda.ninka.net ([216.101.162.242]:46214 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 13:37:05 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id NAA05633; Tue, 9 Jan 2001 13:19:28 -0800 Date: Tue, 9 Jan 2001 13:19:28 -0800 Message-Id: <200101092119.NAA05633@pizda.ninka.net> From: "David S. Miller" To: trond.myklebust@fys.uio.no CC: linux-kernel@vger.kernel.org, netdev@oss.sgi.com In-reply-to: <14939.11765.649805.239618@charged.uio.no> (message from Trond Myklebust on Tue, 9 Jan 2001 16:27:49 +0100 (CET)) Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 References: <200101080124.RAA08134@pizda.ninka.net> <200101091342.FAA02414@pizda.ninka.net> <14939.11765.649805.239618@charged.uio.no> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 311 Lines: 12 Date: Tue, 9 Jan 2001 16:27:49 +0100 (CET) From: Trond Myklebust OK, but can you eventually generalize it to non-stream protocols (i.e. UDP)? Sure, this is what MSG_MORE is meant to accomodate. UDP could support it just fine. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Tue Jan 9 15:21:08 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 15:20:48 -0800 Received: from penguin.e-mind.com ([195.223.140.120]:3624 "EHLO penguin.e-mind.com") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 15:20:28 -0800 Received: from black.random ([195.223.140.107]) by penguin.e-mind.com (8.9.1a/8.9.1/Debian/GNU) with ESMTP id BAA19014; Wed, 10 Jan 2001 01:27:43 +0100 Received: from athlon.random ([192.168.1.7]) by black.random (8.10.2/8.10.2/SuSE Linux 8.10.0-0.3) with ESMTP id f09NKUA09027; Wed, 10 Jan 2001 00:20:30 +0100 Received: (from andrea@localhost) by athlon.random (8.10.2/8.10.2/SuSE Linux 8.10.0-0.3) id f09NKJv04584; Wed, 10 Jan 2001 00:20:19 +0100 Date: Wed, 10 Jan 2001 00:20:19 +0100 From: Andrea Arcangeli To: Jens Axboe Cc: Ingo Molnar , Alan Cox , "Stephen C. Tweedie" , Christoph Hellwig , "David S. Miller" , riel@conectiva.com.br, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 Message-ID: <20010110002019.I29904@athlon.random> References: <20010109183808.A12128@suse.de> <20010109205420.H29904@athlon.random> <20010109211204.E12128@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20010109211204.E12128@suse.de>; from axboe@suse.de on Tue, Jan 09, 2001 at 09:12:04PM +0100 X-GnuPG-Key-URL: http://e-mind.com/~andrea/aa.gnupg.asc X-PGP-Key-URL: http://e-mind.com/~andrea/aa.asc Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 2312 Lines: 43 On Tue, Jan 09, 2001 at 09:12:04PM +0100, Jens Axboe wrote: > I haven't heard anything beyond the raised QUEUE_NR_REQUEST, so I'd like to > see what you have pending so we can merge :-). The tiotest seek increase was > mainly due to the elevator having 3000 requests to juggle and thus being able > to eliminate a lot of seeks right? Raising QUEUE_NR_REQUEST is possible because of the rework of other parts of ll_rw_block meant to fix the lowmem boxes. > > write numbers, streaming I/O doesn't change on highmem boxes but it doesn't > > hurt lowmem boxes anymore). Current blk-13B isn't ok for integration yet > > because it hurts with lowmem (try with mem=32m with your scsi array that gets > > 512K*512 requests in flight :) and it's not able to exploit the elevator as > > I don't see any lowmem problems -- if under pressure, the queue should be > fired and thus it won't get as long as if you have lots of memory free.` A write(2) shouldn't cause the allocator to wait I/O completion. It's the write that should block when it's only polluting the cache or you'll hurt the innocent rest of the system that isn't writing. At least with my original implementation of the 512K large scsi command support that you merged, before a write could block you first had to generate at least 128Mbyte of memory _locked_ all queued in the I/O request list waiting the driver to process the requests (only locked, without considering the dirty part of memory). Since you raised from 256 requests per queue to 512 with your patch you may have to generate 256Mbyte of locked memory before a write can block. This is great on the 8G boxes that runs specweb but this isn't that great on a 32Mbyte box connected incidentally to a decent SCSI adapter. I say "may" because I didn't checked closely if you introduced any kind of logic to avoid this. It seems not though because such a logic needs to touch at least blkdev_release_request and that's what I developed in my tree and then I could raise the number of I/O request in the queue up to 10000 if I wanted without any problem, the max-I/O in flight was controlled properly. (this allowed me to optimize away not 256 or in your case 512 seeks but 10000 seeks) This is what I meant with exploiting the elevator. No panic, there's no buffer overflow there ;) Andrea From owner-netdev@oss.sgi.com Tue Jan 9 15:35:37 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 15:35:27 -0800 Received: from ns.virtualhost.dk ([195.184.98.160]:59150 "EHLO virtualhost.dk") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 15:35:06 -0800 Received: from burns.home.kernel.dk ([192.168.0.2]) by virtualhost.dk with esmtp (Exim 3.16 #1) id 14G8Hq-0008Uq-00; Wed, 10 Jan 2001 00:34:42 +0100 Received: from axboe by burns.home.kernel.dk with local (Exim 3.13 #1 (Debian)) id 14G8Hj-0004sg-00; Wed, 10 Jan 2001 00:34:35 +0100 Date: Wed, 10 Jan 2001 00:34:35 +0100 From: Jens Axboe To: Andrea Arcangeli Cc: Ingo Molnar , Alan Cox , "Stephen C. Tweedie" , Christoph Hellwig , "David S. Miller" , riel@conectiva.com.br, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 Message-ID: <20010110003435.K12128@suse.de> References: <20010109183808.A12128@suse.de> <20010109205420.H29904@athlon.random> <20010109211204.E12128@suse.de> <20010110002019.I29904@athlon.random> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20010110002019.I29904@athlon.random>; from andrea@suse.de on Wed, Jan 10, 2001 at 12:20:19AM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 3189 Lines: 62 On Wed, Jan 10 2001, Andrea Arcangeli wrote: > On Tue, Jan 09, 2001 at 09:12:04PM +0100, Jens Axboe wrote: > > I haven't heard anything beyond the raised QUEUE_NR_REQUEST, so I'd like to > > see what you have pending so we can merge :-). The tiotest seek increase was > > mainly due to the elevator having 3000 requests to juggle and thus being able > > to eliminate a lot of seeks right? > > Raising QUEUE_NR_REQUEST is possible because of the rework of other parts of > ll_rw_block meant to fix the lowmem boxes. Ah I see. It would be nice to base the QUEUE_NR_REQUEST on something else than a static number. For example, 3000 per queue translates into 281Kb of request slots per queue. On a typical system with a floppy, hard drive, and CD-ROM it's getting close to 1Mb of RAM used for this alone. On a 32Mb box this is unaccebtable. I previously had blk_init_queue_nr(q, nr_free_slots) to eg not use that many free slots on say floppy, which doesn't really make much sense anyway. > > I don't see any lowmem problems -- if under pressure, the queue should be > > fired and thus it won't get as long as if you have lots of memory free.` > > A write(2) shouldn't cause the allocator to wait I/O completion. It's the write > that should block when it's only polluting the cache or you'll hurt the > innocent rest of the system that isn't writing. > > At least with my original implementation of the 512K large scsi command > support that you merged, before a write could block you first had to generate > at least 128Mbyte of memory _locked_ all queued in the I/O request list waiting > the driver to process the requests (only locked, without considering > the dirty part of memory). > > Since you raised from 256 requests per queue to 512 with your patch you > may have to generate 256Mbyte of locked memory before a write can block. > > This is great on the 8G boxes that runs specweb but this isn't that great on a > 32Mbyte box connected incidentally to a decent SCSI adapter. Yes I see your point. However memory shortage will fire the queue in due time, it won't make the WRITE block however. In this case it would be bdflush blocking on the WRITE's, which seem exactly what we don't want? > I say "may" because I didn't checked closely if you introduced any kind of > logic to avoid this. It seems not though because such a logic needs to touch at > least blkdev_release_request and that's what I developed in my tree and then I > could raise the number of I/O request in the queue up to 10000 if I wanted > without any problem, the max-I/O in flight was controlled properly. (this > allowed me to optimize away not 256 or in your case 512 seeks but 10000 seeks) > This is what I meant with exploiting the elevator. No panic, there's no buffer > overflow there ;) So you imposed a MB limit on how much I/O would be outstanding in blkdev_release_request? Wouldn't it make more sense to move this to at get_request time, since with the blkdev_release_request approach you won't catch lots of outstanding lock buffers before you start releasing one of them, at which point it would be too late (it might recover, but still). -- * Jens Axboe * SuSE Labs From owner-netdev@oss.sgi.com Tue Jan 9 15:53:27 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 15:53:07 -0800 Received: from penguin.e-mind.com ([195.223.140.120]:50733 "EHLO penguin.e-mind.com") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 15:52:46 -0800 Received: from black.random ([195.223.140.107]) by penguin.e-mind.com (8.9.1a/8.9.1/Debian/GNU) with ESMTP id CAA19756; Wed, 10 Jan 2001 02:00:05 +0100 Received: from athlon.random ([192.168.1.7]) by black.random (8.10.2/8.10.2/SuSE Linux 8.10.0-0.3) with ESMTP id f09NqrA09049; Wed, 10 Jan 2001 00:52:53 +0100 Received: (from andrea@localhost) by athlon.random (8.10.2/8.10.2/SuSE Linux 8.10.0-0.3) id f09Nqft05772; Wed, 10 Jan 2001 00:52:41 +0100 Date: Wed, 10 Jan 2001 00:52:41 +0100 From: Andrea Arcangeli To: Jens Axboe Cc: Ingo Molnar , Alan Cox , "Stephen C. Tweedie" , Christoph Hellwig , "David S. Miller" , riel@conectiva.com.br, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 Message-ID: <20010110005241.K29904@athlon.random> References: <20010109183808.A12128@suse.de> <20010109205420.H29904@athlon.random> <20010109211204.E12128@suse.de> <20010110002019.I29904@athlon.random> <20010110003435.K12128@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20010110003435.K12128@suse.de>; from axboe@suse.de on Wed, Jan 10, 2001 at 12:34:35AM +0100 X-GnuPG-Key-URL: http://e-mind.com/~andrea/aa.gnupg.asc X-PGP-Key-URL: http://e-mind.com/~andrea/aa.asc Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 2085 Lines: 41 On Wed, Jan 10, 2001 at 12:34:35AM +0100, Jens Axboe wrote: > Ah I see. It would be nice to base the QUEUE_NR_REQUEST on something else > than a static number. For example, 3000 per queue translates into 281Kb > of request slots per queue. On a typical system with a floppy, hard drive, > and CD-ROM it's getting close to 1Mb of RAM used for this alone. On a > 32Mb box this is unaccebtable. Yes of course. Infact 3000 was just the number I choosen when doing the benchmarks on a 128Mbox. Things needs to be autotuning and that's not yet implemented. I meant 3000 to tell how such number can grow. Right now if you use 3000 you will need to lock 1.5G of RAM (more than the normal zone!) before you can block with the 512K scsi commands. This was just to show the rest of the blkdev layer was obviously restructured. On a 8G box 10000 requests would probably be a good number. > Yes I see your point. However memory shortage will fire the queue in due > time, it won't make the WRITE block however. In this case it would be That's the performance problem I'm talking about on the lowmem boxes. Infact this problem will happen in 2.4.x too, just less biased than with the 512K scsi commands and by you increasing the number of requests from 256 to 512. > bdflush blocking on the WRITE's, which seem exactly what we don't want? In 2.4.0 Linus fixed wakeup_bdflush not to wait bdflush anymore as I suggested, now it's the task context that sumbits the requests directly to the I/O queue so it's the task that must block, not bdflush. And the task will block correctly _if_ we unplug at the sane time in ll_rw_block. > So you imposed a MB limit on how much I/O would be outstanding in > blkdev_release_request? Wouldn't it make more sense to move this to at No absolutely. Not in blkdev_release_request. The changes there are because you need to somehow do some accounting at I/O completion. > get_request time, since with the blkdev_release_request approach you won't Yes, only ll_rw_block uplugs, not blkdev_release_request. Obviously since the latter runs from irqs. Andrea From owner-netdev@oss.sgi.com Tue Jan 9 17:13:48 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 17:13:28 -0800 Received: from pizda.ninka.net ([216.101.162.242]:52873 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 17:13:23 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id QAA07674; Tue, 9 Jan 2001 16:55:36 -0800 Date: Tue, 9 Jan 2001 16:55:36 -0800 Message-Id: <200101100055.QAA07674@pizda.ninka.net> From: "David S. Miller" To: linux-kernel@vger.kernel.org CC: netdev@oss.sgi.com Subject: Updated zerocopy patch up on kernel.org Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 763 Lines: 22 Nothing interesting or new, just merges up with the latest 2.4.1-pre1 patch from Linus. ftp.kernel.org:/pub/linux/kernel/people/davem/zerocopy-2.4.1p1-1.diff.gz I haven't had any reports from anyone, which must mean that it is working perfectly fine and adds no new bugs, testers are thus in nirvana and thus have nothing to report. :-) As much as I would like to believe this, I know that there must be some (however minor) bug in there. So please make the effort to report bugs if you do spot them. Again, a reminder to test bugs also against vanilla 2.4.1pre1 and report the bug against that if the bug appears there too. This way, I know what bugs are specific to the zerocopy stuff and which are not. Thanks. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Tue Jan 9 17:24:07 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 17:23:48 -0800 Received: from panic.ohr.gatech.edu ([130.207.47.194]:61964 "EHLO havoc.gtf.org") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 17:23:37 -0800 Received: from mandrakesoft.com (adsl-20-73-169.asm.bellsouth.net [66.20.73.169]) by havoc.gtf.org (8.9.3/8.9.3) with ESMTP id UAA19865; Tue, 9 Jan 2001 20:23:16 -0500 Message-ID: <3A5BB985.8A249BE1@mandrakesoft.com> Date: Tue, 09 Jan 2001 20:23:17 -0500 From: Jeff Garzik Organization: MandrakeSoft X-Mailer: Mozilla 4.76 [en] (X11; U; Linux 2.4.1-pre1 i686) X-Accept-Language: en MIME-Version: 1.0 To: "David S. Miller" CC: linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: Updated zerocopy patch up on kernel.org References: <200101100055.QAA07674@pizda.ninka.net> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 771 Lines: 22 "David S. Miller" wrote: > > Nothing interesting or new, just merges up with the latest 2.4.1-pre1 > patch from Linus. > > ftp.kernel.org:/pub/linux/kernel/people/davem/zerocopy-2.4.1p1-1.diff.gz > > I haven't had any reports from anyone, which must mean that it is > working perfectly fine and adds no new bugs, testers are thus in > nirvana and thus have nothing to report. :-) Is there any value to supporting fragments in a driver which doesn't do hardware checksumming? IIRC Alexey had a patch to do such for Tulip, but I don't see it in the above patchset. Jeff -- Jeff Garzik | "You see, in this world there's two kinds of Building 1024 | people, my friend: Those with loaded guns MandrakeSoft | and those who dig. You dig." --Blondie From owner-netdev@oss.sgi.com Tue Jan 9 17:38:08 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 17:37:58 -0800 Received: from pizda.ninka.net ([216.101.162.242]:4490 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 17:37:30 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id RAA07805; Tue, 9 Jan 2001 17:20:08 -0800 Date: Tue, 9 Jan 2001 17:20:08 -0800 Message-Id: <200101100120.RAA07805@pizda.ninka.net> From: "David S. Miller" To: jgarzik@mandrakesoft.com CC: linux-kernel@vger.kernel.org, netdev@oss.sgi.com In-reply-to: <3A5BB985.8A249BE1@mandrakesoft.com> (message from Jeff Garzik on Tue, 09 Jan 2001 20:23:17 -0500) Subject: Re: Updated zerocopy patch up on kernel.org References: <200101100055.QAA07674@pizda.ninka.net> <3A5BB985.8A249BE1@mandrakesoft.com> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 955 Lines: 24 Date: Tue, 09 Jan 2001 20:23:17 -0500 From: Jeff Garzik Is there any value to supporting fragments in a driver which doesn't do hardware checksumming? IIRC Alexey had a patch to do such for Tulip, but I don't see it in the above patchset. I'm actually considering making the SG w/o hwcsum situation illegal. Basically, with SG only, you can run into some problems. We don't prevent anyone from making modifications to the paged data, we just grab references to the pages. This works perfectly fine when the card does the checksumming, the card DMAs a snapshot of the data into it's internal buffers, checksums that local snapshot of the data, and the checksum is fine. If, on the other hand, we're doing this in software, we can send out packets with bad checksums. The next retransmit could fix it up, but this is a horrible scheme quality of implementation wise. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Tue Jan 9 17:44:48 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 17:44:38 -0800 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:34579 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 17:44:28 -0800 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.8.7) with ESMTP id TAA16606; Tue, 9 Jan 2001 19:47:45 -0700 Message-ID: <3A5BCD51.E5888A2C@candelatech.com> Date: Tue, 09 Jan 2001 19:47:45 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.16 i586) X-Accept-Language: en MIME-Version: 1.0 To: "Christopher E. Brown" CC: Alan Cox , "David S. Miller" , linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [PATCH] hashed device lookup (Does NOT meet Linus' sumission References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1742 Lines: 43 "Christopher E. Brown" wrote: > > Think VLANing switch clusters. Say 4 switches connected by > GigE on 4 floors or in 4 separate building. Now, across these > switches 20 VLANS are running, with the switches enforcing VLAN > partitioning. The client PCs know nothing about it, as each one > resides within a single VLAN. That would seem to cut down broadcast packets, and generally be a good thing! > > Now we have our Linux box with 2 x 100Mbit FD links to the > switch cluster running 10 VLANS per interface, and an external > DS1/SDSL/whatever connection. We now have 20 separate zones with > different security controls per zone, with per switchport control over > who resided in what group. Or even forget the routing and just > plugging a Linux box to a companies 200VLAN setup to provide > DHCP/whatever. > > I must say, I *hate* VLANs for this use, it is a horrible > thing to do that wastes massive amounts of bandwidth on simulating a > local broadcast domain across a much larger area, but oh well. As > long as we have stupid managers and brain dead sales persons not much > will change. Are there better things to do than VLAN? YES! Will we > get stuck with needing VLANs in the real world? YES! Umm, how does using VLANs lead to wasting massive amount of bandwidth? (You seem to be saying that by partitioning the network we make each partition bigger??) What are the better solutions? And what does your dislike for sales and management have to do with the topic at hand? -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Tue Jan 9 18:57:08 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 18:56:59 -0800 Received: from twinlark.arctic.org ([204.107.140.52]:55819 "HELO twinlark.arctic.org") by oss.sgi.com with SMTP id ; Tue, 9 Jan 2001 18:56:49 -0800 Received: (qmail 22876 invoked by uid 500); 10 Jan 2001 02:56:34 -0000 Received: from localhost (sendmail-bs@127.0.0.1) by localhost with SMTP; 10 Jan 2001 02:56:34 -0000 Date: Tue, 9 Jan 2001 18:56:33 -0800 (PST) From: dean gaudet To: Ingo Molnar cc: Rik van Riel , "David S. Miller" , , , Subject: storage over IP (was Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1) In-Reply-To: Message-ID: X-comment: visit http://arctic.org/~dean/legal for information regarding copyright and disclaimer. MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1986 Lines: 48 On Tue, 9 Jan 2001, Ingo Molnar wrote: > On Mon, 8 Jan 2001, Rik van Riel wrote: > > > Having proper kiobuf support would make it possible to, for example, > > do zerocopy network->disk data transfers and lots of other things. > > i used to think that this is useful, but these days it isnt. this seems to be in the general theme of "network receive is boring". which i mostly agree with... except recently i've been thinking about an application where it may not be so boring, but i haven't researched all the details yet. the application is storage over IP -- SAN using IP (i.e. gigabit ethernet) technologies instead of fiberchannel technologies. several companies are doing it or planning to do it (for example EMC, 3ware). i'm taking a wild guess that SCSI over FC is arranged conveniently to allow a scatter request to read packets off the FC NIC such that the headers go one way and the data lands neatly into the page cache (i.e. fixed length headers). i've never investigated the actual protocols though so maybe the solution used was to just push a lot of the detail down into the controllers. a quick look at the iSCSI specification , and the FCIP spec show that both use TCP/IP. TCP/IP has variable length headers (or am i on crack?), which totally complicates the receive path. the iSCSI requirements document seems to imply they're happy with pushing this extra processing down to a special storage NIC. that kind of sucks -- one of the benefits of storage over IP would be the ability to redundantly connect a box to storage and IP with only two NICs (instead of 4 -- 2 IP and 2 FC). is NFS receive single copy today? anyone tried doing packet demultiplexing by grabbing headers on one pass and scattering the data on a second pass? i'm hoping i'm missing something. anyone else looked around at this stuff yet? -dean From owner-netdev@oss.sgi.com Tue Jan 9 19:05:38 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 19:05:28 -0800 Received: from router-100M.swansea.linux.org.uk ([194.168.151.17]:3601 "EHLO the-village.bc.nu") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 19:05:18 -0800 Received: from alan by the-village.bc.nu with local (Exim 2.12 #1) id 14GBa7-00081t-00; Wed, 10 Jan 2001 03:05:47 +0000 Subject: Re: storage over IP (was Re: [PLEASE-TESTME] Zerocopy networking patch, To: dean-list-linux-kernel@arctic.org (dean gaudet) Date: Wed, 10 Jan 2001 03:05:45 +0000 (GMT) Cc: mingo@elte.hu (Ingo Molnar), riel@conectiva.com.br (Rik van Riel), davem@redhat.com (David S. Miller), hch@caldera.de, netdev@oss.sgi.com, linux-kernel@vger.kernel.org In-Reply-To: from "dean gaudet" at Jan 09, 2001 06:56:33 PM X-Mailer: ELM [version 2.5 PL1] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-Id: From: Alan Cox Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 756 Lines: 18 > fixed length headers). i've never investigated the actual protocols > though so maybe the solution used was to just push a lot of the detail > down into the controllers. The stuff I have access to (MPT fusion) pushes the FC handling down onto the board. Basically you talk scsi and IP to it (See drivers/message/fusion in -ac) > > show that both use TCP/IP. TCP/IP has variable length headers (or am i on > crack?), which totally complicates the receive path. TCP has variable length headers. It also prevents you re-ordering commands in the stream which would be beneficial. I've not checked if the draft uses multiple TCP streams but then you have scaling questions. Alan From owner-netdev@oss.sgi.com Tue Jan 9 19:16:08 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 19:15:59 -0800 Received: from pizda.ninka.net ([216.101.162.242]:2955 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 19:15:46 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id SAA16505; Tue, 9 Jan 2001 18:58:07 -0800 Date: Tue, 9 Jan 2001 18:58:07 -0800 Message-Id: <200101100258.SAA16505@pizda.ninka.net> From: "David S. Miller" To: dean-list-linux-kernel@arctic.org CC: mingo@elte.hu, riel@conectiva.com.br, hch@caldera.de, netdev@oss.sgi.com, linux-kernel@vger.kernel.org In-reply-to: (message from dean gaudet on Tue, 9 Jan 2001 18:56:33 -0800 (PST)) Subject: Re: storage over IP (was Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1) References: Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 276 Lines: 12 Date: Tue, 9 Jan 2001 18:56:33 -0800 (PST) From: dean gaudet is NFS receive single copy today? With the zerocopy patches, NFS client receive is "single cpu copy" if that's what you mean. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Tue Jan 9 19:19:29 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 19:19:09 -0800 Received: from twinlark.arctic.org ([204.107.140.52]:57868 "HELO twinlark.arctic.org") by oss.sgi.com with SMTP id ; Tue, 9 Jan 2001 19:18:53 -0800 Received: (qmail 25488 invoked by uid 500); 10 Jan 2001 03:18:53 -0000 Received: from localhost (sendmail-bs@127.0.0.1) by localhost with SMTP; 10 Jan 2001 03:18:53 -0000 Date: Tue, 9 Jan 2001 19:18:53 -0800 (PST) From: dean gaudet To: "David S. Miller" cc: , , , , Subject: Re: storage over IP (was Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1) In-Reply-To: <200101100258.SAA16505@pizda.ninka.net> Message-ID: X-comment: visit http://arctic.org/~dean/legal for information regarding copyright and disclaimer. MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 468 Lines: 18 On Tue, 9 Jan 2001, David S. Miller wrote: > Date: Tue, 9 Jan 2001 18:56:33 -0800 (PST) > From: dean gaudet > > is NFS receive single copy today? > > With the zerocopy patches, NFS client receive is "single cpu copy" if > that's what you mean. yeah sorry, i meant: - NIC DMAs packet to memory - CPU reads headers from memory, figures out it's NFS - CPU copies data bytes from packet image in memory to pagecache -dean From owner-netdev@oss.sgi.com Tue Jan 9 19:27:59 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 19:27:39 -0800 Received: from pizda.ninka.net ([216.101.162.242]:21643 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Tue, 9 Jan 2001 19:27:32 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id TAA16576; Tue, 9 Jan 2001 19:09:59 -0800 Date: Tue, 9 Jan 2001 19:09:59 -0800 Message-Id: <200101100309.TAA16576@pizda.ninka.net> From: "David S. Miller" To: dean-list-linux-kernel@arctic.org CC: mingo@elte.hu, riel@conectiva.com.br, hch@caldera.de, netdev@oss.sgi.com, linux-kernel@vger.kernel.org In-reply-to: (message from dean gaudet on Tue, 9 Jan 2001 19:18:53 -0800 (PST)) Subject: Re: storage over IP (was Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1) References: Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 391 Lines: 13 Date: Tue, 9 Jan 2001 19:18:53 -0800 (PST) From: dean gaudet - NIC DMAs packet to memory - CPU reads headers from memory, figures out it's NFS - CPU copies data bytes from packet image in memory to pagecache Yes, this is precisely what happens in the NFS client with the zerocopy patches applied. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Tue Jan 9 20:18:49 2001 Received: by oss.sgi.com id ; Tue, 9 Jan 2001 20:18:39 -0800 Received: from web121.mail.yahoo.com ([205.180.60.129]:7945 "HELO web121.yahoomail.com") by oss.sgi.com with SMTP id ; Tue, 9 Jan 2001 20:18:16 -0800 Received: (qmail 23160 invoked by uid 60001); 10 Jan 2001 04:18:15 -0000 Message-ID: <20010110041815.23159.qmail@web121.yahoomail.com> Received: from [156.153.255.134] by web121.yahoomail.com; Tue, 09 Jan 2001 20:18:15 PST Date: Tue, 9 Jan 2001 20:18:15 -0800 (PST) From: Cacophonix Subject: Re: storage over IP (was Re: [PLEASE-TESTME] Zerocopy networking patch, To: Alan Cox , dean gaudet Cc: Ingo Molnar , Rik van Riel , "David S. Miller" , hch@caldera.de, netdev@oss.sgi.com, linux-kernel@vger.kernel.org MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 995 Lines: 31 I haven't tracked the IP storage group too closely, but was at the San Diego IETF where there were some interesting debates about this issue. There is a write-up at http://ips.pdl.cs.cmu.edu/mail/msg02598.html Now I'm not sure if I agree with some of the assumptions. And I share your concern about using multiple tcp streams. Thoughts? cheers, karthik --- Alan Cox wrote: > > > > show that both use TCP/IP. TCP/IP has variable length headers (or am i on > > crack?), which totally complicates the receive path. > > TCP has variable length headers. It also prevents you re-ordering commands > in the stream which would be beneficial. I've not checked if the draft uses > multiple TCP streams but then you have scaling questions. > > Alan > __________________________________________________ Do You Yahoo!? Yahoo! Photos - Share your holiday photos online! http://photos.yahoo.com/ From owner-netdev@oss.sgi.com Wed Jan 10 01:22:32 2001 Received: by oss.sgi.com id ; Wed, 10 Jan 2001 01:22:22 -0800 Received: from pat.uio.no ([129.240.130.16]:61118 "EHLO pat.uio.no") by oss.sgi.com with ESMTP id ; Wed, 10 Jan 2001 01:22:04 -0800 Received: from charged.uio.no ([129.240.86.49]) by pat.uio.no with esmtp (Exim 2.12 #7) id 14GHS8-0001Ay-00; Wed, 10 Jan 2001 10:21:56 +0100 Received: from trondmy by charged.uio.no with local (Exim 2.12 #1) id 14GHS4-0003xm-00; Wed, 10 Jan 2001 10:21:52 +0100 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <14940.10672.413028.265595@charged.uio.no> Date: Wed, 10 Jan 2001 10:21:52 +0100 (CET) To: "David S. Miller" Cc: linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 In-Reply-To: <200101092119.NAA05633@pizda.ninka.net> References: <200101080124.RAA08134@pizda.ninka.net> <200101091342.FAA02414@pizda.ninka.net> <14939.11765.649805.239618@charged.uio.no> <200101092119.NAA05633@pizda.ninka.net> X-Mailer: VM 6.72 under 21.1 (patch 12) "Channel Islands" XEmacs Lucid Reply-To: trond.myklebust@fys.uio.no From: Trond Myklebust Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 758 Lines: 22 >>>>> " " == David S Miller writes: > Date: Tue, 9 Jan 2001 16:27:49 +0100 (CET) From: Trond > Myklebust > OK, but can you eventually generalize it to non-stream > protocols (i.e. UDP)? > Sure, this is what MSG_MORE is meant to accomodate. UDP could > support it just fine. Great! I've been waiting for something like this. In particular the knfsd TCP server code can get very buffer-intensive without it since you need to pre-allocate 1 set of buffers per TCP connection (else you get DOS due to buffer saturation when doing wait+retry for blocked sockets). If it all gets in to the kernel, I'll do the work of adapting the NFS + sunrpc stuff. Cheers, Trond From owner-netdev@oss.sgi.com Wed Jan 10 02:03:21 2001 Received: by oss.sgi.com id ; Wed, 10 Jan 2001 02:03:01 -0800 Received: from pizda.ninka.net ([216.101.162.242]:16270 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Wed, 10 Jan 2001 02:02:44 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id BAA18640; Wed, 10 Jan 2001 01:45:19 -0800 Date: Wed, 10 Jan 2001 01:45:19 -0800 Message-Id: <200101100945.BAA18640@pizda.ninka.net> From: "David S. Miller" To: neufeld@linuxcare.com CC: netdev@oss.sgi.com In-reply-to: <200101091528.f09FSXO24925@localhost.localdomain> (message from Christopher Neufeld on Tue, 9 Jan 2001 10:28:33 -0500) Subject: Re: Apparent bug in cls_u32.c References: <200101091528.f09FSXO24925@localhost.localdomain> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 438 Lines: 17 Date: Tue, 9 Jan 2001 10:28:33 -0500 From: Christopher Neufeld I suspect that the author's intent was to make the first part of the while() condition: while (--i>0 && .... ) Sure looks that way, and it is what I have done in my trees. I've sent the 2.2.x version of the fix off the Alan and will send the 2.4.x version to Linus eventually. Thanks. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Wed Jan 10 08:12:04 2001 Received: by oss.sgi.com id ; Wed, 10 Jan 2001 08:11:54 -0800 Received: from sirppi.helsinki.fi ([128.214.205.27]:32529 "EHLO sirppi.helsinki.fi") by oss.sgi.com with ESMTP id ; Wed, 10 Jan 2001 08:11:33 -0800 Received: from localhost (amlaukka@localhost) by sirppi.helsinki.fi (8.10.1/8.10.1) with ESMTP id f0AGBRV06669; Wed, 10 Jan 2001 18:11:28 +0200 (EET) X-Authentication-Warning: sirppi.helsinki.fi: amlaukka owned process doing -bs Date: Wed, 10 Jan 2001 18:11:27 +0200 (EET) From: Aki M Laukkanen To: cc: Subject: [PATCH] reassembly in IPv6 Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1754 Lines: 57 Hello, I just checked the 2.4.0 tree and this patch never made it. Sorry, it's against an older kernel. The bug is that icmpv6_param_prob() already frees the skb. I reported this back in November. Excerpt from RFC 2460: If the length of a fragment, as derived from the fragment packet's Payload Length field, is not a multiple of 8 octets and the M flag of that fragment is 1, then that fragment must be discarded and an ICMP Parameter Problem, Code 0, message should be sent to the source of the fragment, pointing to the Payload Length field of the fragment packet. The latter part of the patch concerns sending param_prob even when offset != 0. Alexey wrote in the comment otherwise though. Regarding the IPv6 evaluations, I've been unable to to work on them - being full-time employed and working on my thesis. Maybe, I can resume when things get less hectic. --- linux-2.4.0-test10/net/ipv6/reassembly.c Sat Jul 15 00:02:20 2000 +++ linux-2.4.0-test10.ipv6/net/ipv6/reassembly.c Sun Nov 5 01:31:43 2000 @@ -365,7 +365,7 @@ if ((unsigned int)end >= 65536) { icmpv6_param_prob(skb,ICMPV6_HDR_FIELD, (u8*)&fhdr->frag_off); - goto err; + return; } /* Is this the final fragment? */ @@ -383,16 +383,9 @@ * Required by the RFC. */ if (end & 0x7) { - printk(KERN_DEBUG "fragment not rounded to 8bytes\n"); - - /* - It is not in specs, but I see no reasons - to send an error in this case. --ANK - */ - if (offset == 0) - icmpv6_param_prob(skb, ICMPV6_HDR_FIELD, - &skb->nh.ipv6h->payload_len); - goto err; + icmpv6_param_prob(skb, ICMPV6_HDR_FIELD, + &skb->nh.ipv6h->payload_len); + return; } if (end > fq->len) { /* Some bits beyond end -> corruption. */ Aki From owner-netdev@oss.sgi.com Wed Jan 10 08:22:53 2001 Received: by oss.sgi.com id ; Wed, 10 Jan 2001 08:22:44 -0800 Received: from pizda.ninka.net ([216.101.162.242]:18563 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Wed, 10 Jan 2001 08:22:31 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id IAA02890; Wed, 10 Jan 2001 08:03:11 -0800 Date: Wed, 10 Jan 2001 08:03:11 -0800 Message-Id: <200101101603.IAA02890@pizda.ninka.net> From: "David S. Miller" To: amlaukka@cc.helsinki.fi CC: netdev@oss.sgi.com In-reply-to: (message from Aki M Laukkanen on Wed, 10 Jan 2001 18:11:27 +0200 (EET)) Subject: Re: [PATCH] reassembly in IPv6 References: Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 626 Lines: 19 Date: Wed, 10 Jan 2001 18:11:27 +0200 (EET) From: Aki M Laukkanen I just checked the 2.4.0 tree and this patch never made it. Sorry, it's against an older kernel. The bug is that icmpv6_param_prob() already frees the skb. I reported this back in November. This is already fixed in my tree, its' just currently embedded in our zerocopy patches, it'll get merged. The latter part of the patch concerns sending param_prob even when offset != 0. Alexey wrote in the comment otherwise though. I've added this change to my tree, thanks. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Wed Jan 10 10:28:14 2001 Received: by oss.sgi.com id ; Wed, 10 Jan 2001 10:28:05 -0800 Received: from pizda.ninka.net ([216.101.162.242]:18564 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Wed, 10 Jan 2001 10:27:37 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id KAA05049; Wed, 10 Jan 2001 10:27:07 -0800 Date: Wed, 10 Jan 2001 10:27:07 -0800 Message-Id: <200101101827.KAA05049@pizda.ninka.net> From: "David S. Miller" To: nromer@lsil.com CC: netdev@oss.sgi.com, noah.romer@lsil.com In-reply-to: <3A40E268.F844EC2@lsil.com> (message from Noah Romer on Wed, 20 Dec 2000 11:46:32 -0500) Subject: Re: [PATCH] accept ARP's with HW Type of 1 from IEEE802 devices References: <3A40E268.F844EC2@lsil.com> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 418 Lines: 14 Date: Wed, 20 Dec 2000 11:46:32 -0500 From: Noah Romer Patch against linux/net/ipv4/arp.c to accept ARP's with HW Type of 1 from IEEE802 devices (specifically, Fibre Channel devices). This has been tested with linux-2.4.0-test9 and linux-2.4.0-test12. This patch didn't get lost, I've just applied it to my tree. Sorry for taking so long. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Wed Jan 10 12:28:34 2001 Received: by oss.sgi.com id ; Wed, 10 Jan 2001 12:28:24 -0800 Received: from lsi.lsil.com ([147.145.40.2]:41393 "EHLO lsi.lsil.com") by oss.sgi.com with ESMTP id ; Wed, 10 Jan 2001 12:28:01 -0800 Received: from mhbs.lsil.com (mhbs [147.145.31.100]) by lsi.lsil.com (8.9.3+Sun/8.9.1) with ESMTP id MAA15813 for ; Wed, 10 Jan 2001 12:28:00 -0800 (PST) Received: from inca.co.lsil.com by mhbs.lsil.com with ESMTP for netdev@oss.sgi.com; Wed, 10 Jan 2001 11:22:15 -0800 Received: from exw-kansas.ks.lsil.com (exw-kansas.ks.lsil.com [153.79.8.7]) by inca.co.lsil.com (8.9.3/8.9.3) with ESMTP id MAA07603; Wed, 10 Jan 2001 12:22:13 -0700 (MST) Received: from lsil.com (nromernt.ks.lsil.com [153.79.8.107]) by exw-kansas.ks.lsil.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id C2GGFB24; Wed, 10 Jan 2001 13:19:44 -0600 Message-Id: <3A5C65F0.7661093D@lsil.com> Date: Wed, 10 Jan 2001 08:38:56 -0500 From: Noah Romer Organization: LSI Logic X-Mailer: Mozilla 4.76 [en] (X11; U; Linux 2.2.12 i686) X-Accept-Language: en MIME-Version: 1.0 To: netdev@oss.sgi.com CC: "Romer, Noah" Subject: [Fwd: [PATCH] accept ARP's with HW Type of 1 from IEEE802 devices] Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1715 Lines: 44 Well it appears that my message never made it to the list the first time (not in the list archive for December, and no response), so I'm trying again. The patch applies cleanly to the 2.4.0 kernel (2 line fuzz reported by patch). So, as long as nothing major has changed in the arp handling while I was on vacation, it should still be Ok. -------- Original Message -------- Subject: [PATCH] accept ARP's with HW Type of 1 from IEEE802 devices Date: Wed, 20 Dec 2000 11:46:32 -0500 From: Noah Romer Organization: LSI Logic To: netdev@oss.sgi.com CC: noah.romer@lsil.com Patch against linux/net/ipv4/arp.c to accept ARP's with HW Type of 1 from IEEE802 devices (specifically, Fibre Channel devices). This has been tested with linux-2.4.0-test9 and linux-2.4.0-test12. --- diff -u arp.c~ arp.c --- arp.c~ Tue Oct 3 09:24:41 2000 +++ arp.c Wed Dec 20 16:05:36 2000 @@ -647,6 +647,20 @@ goto out; break; #endif +#ifdef CONFIG_NET_FC + case ARPHRD_IEEE802: + /* + * According to RFC 2625, Fibre Channel devices (which are IEEE + * 802 devices) should accept ARP hardware types of 6 (IEEE 802) + * and 1 (Ethernet). + */ + if (arp->ar_hrd != __constant_htons(ARPHRD_ETHER) && + arp->ar_hrd != __constant_htons(ARPHRD_IEEE802)) + goto out; + if (arp->ar_pro != __constant_htons(ETH_P_IP)) + goto out; + break; +#endif #if defined(CONFIG_AX25) || defined(CONFIG_AX25_MODULE) case ARPHRD_AX25: if (arp->ar_pro != __constant_htons(AX25_P_IP)) From owner-netdev@oss.sgi.com Wed Jan 10 12:56:54 2001 Received: by oss.sgi.com id ; Wed, 10 Jan 2001 12:56:34 -0800 Received: from pizda.ninka.net ([216.101.162.242]:26245 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Wed, 10 Jan 2001 12:56:21 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id MAA06642; Wed, 10 Jan 2001 12:55:58 -0800 Date: Wed, 10 Jan 2001 12:55:58 -0800 Message-Id: <200101102055.MAA06642@pizda.ninka.net> From: "David S. Miller" To: nromer@lsil.com CC: netdev@oss.sgi.com, noah.romer@lsil.com In-reply-to: <3A5C65F0.7661093D@lsil.com> (message from Noah Romer on Wed, 10 Jan 2001 08:38:56 -0500) Subject: Re: [Fwd: [PATCH] accept ARP's with HW Type of 1 from IEEE802 devices] References: <3A5C65F0.7661093D@lsil.com> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 204 Lines: 8 Noah, I posted a respose to your original posting stating that I applied your patch. Your original posting did make it to the list which is how I got it myself. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Wed Jan 10 16:54:16 2001 Received: by oss.sgi.com id ; Wed, 10 Jan 2001 16:54:07 -0800 Received: from cerberus.nemoto.ecei.tohoku.ac.jp ([130.34.199.67]:32005 "EHLO cerberus.nemoto.ecei.tohoku.ac.jp") by oss.sgi.com with ESMTP id ; Wed, 10 Jan 2001 16:53:44 -0800 Received: from localhost (yoshfuji@localhost [127.0.0.1]) by cerberus.nemoto.ecei.tohoku.ac.jp (8.9.3+3.2W/8.9.3/Debian 8.9.3-21) with ESMTP id JAA05878; Thu, 11 Jan 2001 09:53:16 +0900 To: netdev@oss.sgi.com CC: usagi-core@linux-ipv6.org Subject: [PATCH] fragment, MLD6, and router-alert option X-Mailer: Mew version 1.94 on Emacs 20.7 / Mule 4.1 (AOI) X-URL: http://www.ecei.tohoku.ac.jp/%7Eyoshfuji/ X-Fingerprint: F7 31 65 99 5E B2 BB A7 15 15 13 23 18 06 A9 6F 57 00 6B 25 X-Pgp5-Key-Url: http://cerberus.nemoto.ecei.tohoku.ac.jp/%7Eyoshfuji/yoshfuji@ecei.tohoku.ac.jp.asc Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-Id: <20010111095316B.yoshfuji@ecei.tohoku.ac.jp> Date: Thu, 11 Jan 2001 09:53:16 +0900 From: YOSHIFUJI Hideaki / =?iso-2022-jp?B?GyRCNUhGIzFRTEAbKEI=?= X-Dispatcher: imput version 990905(IM130) Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 4328 Lines: 163 Hi, this is feed back from USAGI Project. 1. Ensure to send paramter problem for fragment. 2. Fix router-alert option format. 3. Do not overrun while parsing tlv options. [MEMO] bFIX_2_4_0-20010111 -> tFIX_2_4_0-20010111_20010111 Index: net/ipv6/icmp.c =================================================================== RCS file: /cvsroot/usagi/kernel/linux24/net/ipv6/icmp.c,v retrieving revision 1.1.1.2 retrieving revision 1.1.1.2.2.2 diff -u -r1.1.1.2 -r1.1.1.2.2.2 --- net/ipv6/icmp.c 2001/01/05 05:39:31 1.1.1.2 +++ net/ipv6/icmp.c 2001/01/11 00:17:49 1.1.1.2.2.2 @@ -23,6 +23,8 @@ * Andi Kleen : exception handling * Andi Kleen add rate limits. never reply to a icmp. * add more length checks and other fixes. + * yoshfuji : ensure to sent parameter problem for + * fragments. */ #define __NO_VERSION__ @@ -191,9 +193,14 @@ * for sender. * * --ANK (980726) + * + * Note: We MUST send Parameter Problem even if it is not for the first + * fragment. See RFC2460. + * + * --yoshfuji (2001/01/05) */ -static int is_ineligible(struct ipv6hdr *hdr, int len) +static int is_ineligible(int type, struct ipv6hdr *hdr, int len) { u8 *ptr; __u8 nexthdr = hdr->nexthdr; @@ -208,7 +215,7 @@ struct icmp6hdr *ihdr = (struct icmp6hdr *)ptr; return (ptr - (u8*)hdr) > len || !(ihdr->icmp6_type & 0x80); } - return nexthdr == NEXTHDR_FRAGMENT; + return nexthdr == NEXTHDR_FRAGMENT && type != ICMPV6_PARAMPROB; } int sysctl_icmpv6_time = 1*HZ; @@ -341,7 +348,7 @@ /* * Never answer to a ICMP packet. */ - if (is_ineligible(hdr, (u8*)skb->tail - (u8*)hdr)) { + if (is_ineligible(type, hdr, (u8*)skb->tail - (u8*)hdr)) { if (net_ratelimit()) printk(KERN_DEBUG "icmpv6_send: no reply to icmp error/fragment\n"); return; Index: net/ipv6/mcast.c =================================================================== RCS file: /cvsroot/usagi/kernel/linux24/net/ipv6/mcast.c,v retrieving revision 1.1.1.5 retrieving revision 1.1.1.5.2.2 diff -u -r1.1.1.5 -r1.1.1.5.2.2 --- net/ipv6/mcast.c 2000/10/05 13:40:35 1.1.1.5 +++ net/ipv6/mcast.c 2001/01/11 00:17:49 1.1.1.5.2.2 @@ -15,6 +15,11 @@ * 2 of the License, or (at your option) any later version. */ +/* Changes: + * + * yoshfuji : fix format of router-alert option + */ + #define __NO_VERSION__ #include #include @@ -491,7 +496,7 @@ struct in6_addr all_routers; int err, len, payload_len, full_len; u8 ra[8] = { IPPROTO_ICMPV6, 0, - IPV6_TLV_ROUTERALERT, 0, 0, 0, + IPV6_TLV_ROUTERALERT, 2, 0, 0, IPV6_TLV_PADN, 0 }; snd_addr = addr; Index: net/ipv6/exthdrs.c =================================================================== RCS file: /cvsroot/usagi/kernel/linux24/net/ipv6/exthdrs.c,v retrieving revision 1.1.1.1 retrieving revision 1.1.1.1.10.1 diff -u -r1.1.1.1 -r1.1.1.1.10.1 --- net/ipv6/exthdrs.c 2000/08/25 02:29:26 1.1.1.1 +++ net/ipv6/exthdrs.c 2001/01/11 00:33:05 1.1.1.1.10.1 @@ -15,6 +15,11 @@ * 2 of the License, or (at your option) any later version. */ +/* Changes: + * yoshfuji : ensure not to overrun while parsing + * tlv options. + */ + #include #include #include @@ -104,6 +109,7 @@ struct tlvtype_proc *curr; u8 *ptr = skb->h.raw; int len = ((ptr[1]+1)<<3) - 2; + int optlen; ptr += 2; @@ -113,19 +119,31 @@ } while (len > 0) { - int optlen = ptr[1]+2; - switch (ptr[0]) { case IPV6_TLV_PAD0: optlen = 1; break; case IPV6_TLV_PADN: + if (len < 2) + goto bad; + optlen = ptr[1]+2; + if (len < optlen) + goto bad; break; - default: /* Other TLV code so scan list */ + default: + /* Other TLV code so scan list */ + if (len < 2) + goto bad; + optlen = ptr[1]+2; + if (len < optlen) + goto bad; for (curr=procs; curr->type >= 0; curr++) { if (curr->type == ptr[0]) { + /* type specific length/alignment + checks will be perfomed in the + func(). */ if (curr->func(skb, ptr) == 0) return 0; break; @@ -142,6 +160,7 @@ } if (len == 0) return 1; +bad: kfree_skb(skb); return 0; } -- Hideaki YOSHIFUJI @ USAGI Project PGP5i FP: F731 6599 5EB2 BBA7 1515 1323 1806 A96F 5700 6B25 From owner-netdev@oss.sgi.com Wed Jan 10 17:06:36 2001 Received: by oss.sgi.com id ; Wed, 10 Jan 2001 17:06:16 -0800 Received: from cerberus.nemoto.ecei.tohoku.ac.jp ([130.34.199.67]:33541 "EHLO cerberus.nemoto.ecei.tohoku.ac.jp") by oss.sgi.com with ESMTP id ; Wed, 10 Jan 2001 17:06:01 -0800 Received: from localhost (yoshfuji@localhost [127.0.0.1]) by cerberus.nemoto.ecei.tohoku.ac.jp (8.9.3+3.2W/8.9.3/Debian 8.9.3-21) with ESMTP id KAA05911; Thu, 11 Jan 2001 10:05:49 +0900 To: netdev@oss.sgi.com Cc: usagi-core@linux-ipv6.org Subject: Re: [PATCH] fragment, MLD6 and options (is Re: [PATCH] fragment, MLD6, and router-alert option) In-Reply-To: <20010111095316B.yoshfuji@ecei.tohoku.ac.jp> References: <20010111095316B.yoshfuji@ecei.tohoku.ac.jp> X-Mailer: Mew version 1.94 on Emacs 20.7 / Mule 4.1 (AOI) X-URL: http://www.ecei.tohoku.ac.jp/%7Eyoshfuji/ X-Fingerprint: F7 31 65 99 5E B2 BB A7 15 15 13 23 18 06 A9 6F 57 00 6B 25 X-Pgp5-Key-Url: http://cerberus.nemoto.ecei.tohoku.ac.jp/%7Eyoshfuji/yoshfuji@ecei.tohoku.ac.jp.asc Mime-Version: 1.0 Content-Type: Text/Plain; charset=iso-2022-jp Content-Transfer-Encoding: 7bit Message-Id: <20010111100548N.yoshfuji@linux-ipv6.org> Date: Thu, 11 Jan 2001 10:05:48 +0900 From: YOSHIFUJI Hideaki / =?iso-2022-jp?B?GyRCNUhGIzFRTEAbKEI=?= X-Dispatcher: imput version 990905(IM130) Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 589 Lines: 15 In article <20010111095316B.yoshfuji@ecei.tohoku.ac.jp> (at Thu, 11 Jan 2001 09:53:16 +0900), YOSHIFUJI Hideaki / $B5HF#1QL@(B says: > Hi, this is feed back from USAGI Project. > > 1. Ensure to send paramter problem for fragment. > 2. Fix router-alert option format. > 3. Do not overrun while parsing tlv options. > > [MEMO] bFIX_2_4_0-20010111 -> tFIX_2_4_0-20010111_20010111 Sorry, subject: and from: was not appropriate... -- Hideaki YOSHIFUJI @ USAGI Project PGP5i FP: F731 6599 5EB2 BBA7 1515 1323 1806 A96F 5700 6B25 From owner-netdev@oss.sgi.com Thu Jan 11 02:39:19 2001 Received: by oss.sgi.com id ; Thu, 11 Jan 2001 02:39:10 -0800 Received: from chiara.elte.hu ([157.181.150.200]:54541 "HELO chiara.elte.hu") by oss.sgi.com with SMTP id ; Thu, 11 Jan 2001 02:38:57 -0800 Received: by chiara.elte.hu (Postfix, from userid 17806) id 74C6B186D; Thu, 11 Jan 2001 11:38:50 +0100 (CET) Date: Thu, 11 Jan 2001 11:38:27 +0100 (CET) From: Ingo Molnar Reply-To: To: "David S. Miller" Cc: , Subject: Re: Updated zerocopy patch up on kernel.org In-Reply-To: <200101100055.QAA07674@pizda.ninka.net> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 435 Lines: 16 On Tue, 9 Jan 2001, David S. Miller wrote: > Nothing interesting or new, just merges up with the latest 2.4.1-pre1 > patch from Linus. > > ftp.kernel.org:/pub/linux/kernel/people/davem/zerocopy-2.4.1p1-1.diff.gz > > I haven't had any reports from anyone, which must mean that it is > working perfectly fine and adds no new bugs, testers are thus in > nirvana and thus have nothing to report. :-) (works like a charm here.) Ingo From owner-netdev@oss.sgi.com Thu Jan 11 02:42:19 2001 Received: by oss.sgi.com id ; Thu, 11 Jan 2001 02:41:59 -0800 Received: from chiara.elte.hu ([157.181.150.200]:54797 "HELO chiara.elte.hu") by oss.sgi.com with SMTP id ; Thu, 11 Jan 2001 02:41:53 -0800 Received: by chiara.elte.hu (Postfix, from userid 17806) id 7C86D186D; Thu, 11 Jan 2001 11:41:51 +0100 (CET) Date: Thu, 11 Jan 2001 11:41:30 +0100 (CET) From: Ingo Molnar Reply-To: To: "David S. Miller" Cc: , , Subject: Re: Updated zerocopy patch up on kernel.org In-Reply-To: <200101100120.RAA07805@pizda.ninka.net> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 591 Lines: 16 On Tue, 9 Jan 2001, David S. Miller wrote: > Is there any value to supporting fragments in a driver which > doesn't do hardware checksumming? IIRC Alexey had a patch to do > such for Tulip, but I don't see it in the above patchset. > > I'm actually considering making the SG w/o hwcsum situation illegal. i believe it might still make some limited sense for normal sendmsg() and higher MTUs (or 8k NFS) - we could copy & checksum stuff into the ->tcp_page if SG is possible and thus the SG capability improves the VM. (because we can allocate at PAGE_SIZE granularity.) Ingo From owner-netdev@oss.sgi.com Thu Jan 11 04:33:09 2001 Received: by oss.sgi.com id ; Thu, 11 Jan 2001 04:32:49 -0800 Received: from laurin.munich.netsurf.de ([194.64.166.1]:40324 "EHLO laurin.munich.netsurf.de") by oss.sgi.com with ESMTP id ; Thu, 11 Jan 2001 04:32:27 -0800 Received: from fred.muc.de (noidentity@ns1067.munich.netsurf.de [195.180.235.67]) by laurin.munich.netsurf.de (8.9.3/8.9.3) with ESMTP id NAA25352; Thu, 11 Jan 2001 13:32:24 +0100 (MET) Received: by fred.muc.de (Postfix, from userid 500) id 608F9E3BB8; Thu, 11 Jan 2001 13:37:03 +0100 (CET) Date: Thu, 11 Jan 2001 13:37:03 +0100 From: Andi Kleen To: Bernhard Jansen Cc: netdev@oss.sgi.com, alan@redhat.com Subject: Re: The networking code in Kernel 2.4 Message-ID: <20010111133703.A8869@fred.local> References: <3A5A0264.457F967D@tdt.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0.1i In-Reply-To: <3A5A0264.457F967D@tdt.de>; from bjansen@tdt.de on Tue, Jan 09, 2001 at 04:30:02PM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 972 Lines: 24 On Tue, Jan 09, 2001 at 04:30:02PM +0100, Bernhard Jansen wrote: > Hi, > > I have a little problem with the new networking code in kernel 2.4 . In > the 2.2 kernels was the dev->tbusy variable to make sure that > the dev->hard_start_xmit funktion is not called next before the old > packet is completly sent out. My problem is that I wanna write a driver > > for a Siemens or better now an Infinion chip called ESCC2, it's for sync > serial communication, it have a 32 byte deep FIFO what actually problem > is. Because I get a packet from the upper layer and which is normally > bigger than 32 byte, I write the first 32 byte out to the chip and then > I have to wait for an interrupt that the FIFO is ready again. The spot > of my problem is how can I make sure that dev->hard_start_xmit > funktion is not called again form the kernel until the whole buffer is > sent out. Read http://www.firstfloor.org/~andi/softnet -Andi -- This is like TV. I don't like TV. From owner-netdev@oss.sgi.com Thu Jan 11 04:33:59 2001 Received: by oss.sgi.com id ; Thu, 11 Jan 2001 04:33:39 -0800 Received: from pizda.ninka.net ([216.101.162.242]:6272 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Thu, 11 Jan 2001 04:33:33 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id UAA02297; Wed, 10 Jan 2001 20:33:16 -0800 Date: Wed, 10 Jan 2001 20:33:16 -0800 Message-Id: <200101110433.UAA02297@pizda.ninka.net> From: "David S. Miller" To: mingo@elte.hu CC: jgarzik@mandrakesoft.com, linux-kernel@vger.kernel.org, netdev@oss.sgi.com In-reply-to: (message from Ingo Molnar on Thu, 11 Jan 2001 11:41:30 +0100 (CET)) Subject: Re: Updated zerocopy patch up on kernel.org References: Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 839 Lines: 24 Date: Thu, 11 Jan 2001 11:41:30 +0100 (CET) From: Ingo Molnar On Tue, 9 Jan 2001, David S. Miller wrote: > I'm actually considering making the SG w/o hwcsum situation illegal. i believe it might still make some limited sense for normal sendmsg() and higher MTUs (or 8k NFS) - we could copy & checksum stuff into the ->tcp_page if SG is possible and thus the SG capability improves the VM. (because we can allocate at PAGE_SIZE granularity.) Basically what your advocating for is to take advantage of SG-only devices when we have full control of the page contents. Sure this would work. But honestly the real gain from SG-only devices would be (as you know) the memory usage savings when sending a single static file object to several thousand clients. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Thu Jan 11 04:41:00 2001 Received: by oss.sgi.com id ; Thu, 11 Jan 2001 04:40:50 -0800 Received: from mail.sun.ac.za ([146.232.128.1]:22542 "EHLO mail.sun.ac.za") by oss.sgi.com with ESMTP id ; Thu, 11 Jan 2001 04:40:43 -0800 Received: from prime.sun.ac.za ([146.232.164.2]) by mail.sun.ac.za with esmtp (Exim 2.10 #1) id 14Gh1w-0002gV-00; Thu, 11 Jan 2001 14:40:36 +0200 Received: from grobh (helo=localhost) by prime.sun.ac.za with local-esmtp (Exim 3.20 #1) id 14Gh1v-0004Ib-00; Thu, 11 Jan 2001 14:40:35 +0200 Date: Thu, 11 Jan 2001 14:40:35 +0200 (SAST) From: Hans Grobler To: Andi Kleen cc: Subject: Re: The networking code in Kernel 2.4 In-Reply-To: <20010111133703.A8869@fred.local> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 368 Lines: 12 Hi Andi, On Thu, 11 Jan 2001, Andi Kleen wrote: > Read http://www.firstfloor.org/~andi/softnet Maybe this should be converted into a Documentation/DocBook/* or just plain Documentation/networking/softnet. Given the requests so far, I think this is going to be asked a lot. Shall I have a go at this or would you like to? Would the original authors object? -- Hans From owner-netdev@oss.sgi.com Thu Jan 11 04:41:29 2001 Received: by oss.sgi.com id ; Thu, 11 Jan 2001 04:41:09 -0800 Received: from isis.its.uow.edu.au ([130.130.68.21]:36571 "EHLO isis.its.uow.edu.au") by oss.sgi.com with ESMTP id ; Thu, 11 Jan 2001 04:40:56 -0800 Received: from uow.edu.au (wumpus.its.uow.edu.au [130.130.68.12]) by isis.its.uow.edu.au (8.9.3/8.9.3) with ESMTP id XAA11632; Thu, 11 Jan 2001 23:40:26 +1100 (EST) Message-ID: <3A5DAB5E.6F4B05B7@uow.edu.au> Date: Thu, 11 Jan 2001 23:47:26 +1100 From: Andrew Morton X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.4.0 i586) X-Accept-Language: en MIME-Version: 1.0 To: mingo@elte.hu CC: "David S. Miller" , linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: Updated zerocopy patch up on kernel.org References: <200101100055.QAA07674@pizda.ninka.net> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 737 Lines: 27 Ingo Molnar wrote: > > On Tue, 9 Jan 2001, David S. Miller wrote: > > > Nothing interesting or new, just merges up with the latest 2.4.1-pre1 > > patch from Linus. > > > > ftp.kernel.org:/pub/linux/kernel/people/davem/zerocopy-2.4.1p1-1.diff.gz > > > > I haven't had any reports from anyone, which must mean that it is > > working perfectly fine and adds no new bugs, testers are thus in > > nirvana and thus have nothing to report. :-) > > (works like a charm here.) For the record... I've been running it since release on 2.4.0-UP/x86. The NIC is a 3c905B so we're doing scater/gather and hw checksumming. It does a lot of NFS client work against a Netapp server. rsize=wsize=8192. I'm not using sendfile(). IOW: me too. - From owner-netdev@oss.sgi.com Thu Jan 11 04:51:29 2001 Received: by oss.sgi.com id ; Thu, 11 Jan 2001 04:51:20 -0800 Received: from pizda.ninka.net ([216.101.162.242]:9600 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Thu, 11 Jan 2001 04:51:05 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id UAA02405; Wed, 10 Jan 2001 20:50:50 -0800 Date: Wed, 10 Jan 2001 20:50:50 -0800 Message-Id: <200101110450.UAA02405@pizda.ninka.net> From: "David S. Miller" To: linux-kernel@vger.kernel.org CC: netdev@oss.sgi.com Subject: Updated zerocopy patches on kernel.org Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 762 Lines: 25 Now against 2.4.1-pre2: ftp.kernel.org:/pub/linux/kernel/people/davem/zerocopy-2.4.1p2-1.diff.gz Changes since the previous installment: 1) Correct netfilter URLS, from Paul Russell. 2) Increase MAX_SKB_FRAGS to 6, from me. 3) Set loopback MTU more appropriately now that we use page based SKBs, from me. 4) Backout bogus ip_decrease_ttl "boolean in C can be not 1" change. 5) Accept ARP hardware types of 6 and 1 for Fibre Channel, from LSI Logic. 6) If ipv6 fragment is not a multiple of 8 _always_ send parameter problem message. From Aki M. Laukkanen 7) Make u32 packet classifier algorithm halt if all handles are taken already. From me. 8) Fix SMP protection of xprt->snd_task value, from me. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Thu Jan 11 14:42:54 2001 Received: by oss.sgi.com id ; Thu, 11 Jan 2001 14:42:44 -0800 Received: from nero.doit.wisc.edu ([128.104.17.130]:18 "EHLO nero.doit.wisc.edu") by oss.sgi.com with ESMTP id ; Thu, 11 Jan 2001 14:42:36 -0800 Received: (from jleu@localhost) by nero.doit.wisc.edu (8.8.7/8.8.7) id QAA29623 for netdev@oss.sgi.com; Thu, 11 Jan 2001 16:42:31 -0600 Date: Thu, 11 Jan 2001 16:42:31 -0600 From: "James R. Leu" To: netdev@oss.sgi.com Subject: TCP: sendmsg/recvmsg/ioctl(SIOCINQ/SIOCOUTQ) Message-ID: <20010111164231.C29379@doit.wisc.edu> Reply-To: jleu@mindspring.com References: <921FA59842C3D111BB2400A0C9498D0B9EB241@exchange03.rl.ac.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0.1i In-Reply-To: <921FA59842C3D111BB2400A0C9498D0B9EB241@exchange03.rl.ac.uk>; from R.Dravid@rl.ac.uk on Thu, Jan 11, 2001 at 01:29:01PM -0000 Organization: none Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 521 Lines: 20 It seems that TCPs recvmsg/sendmsg and ioctl for SIOCINQ/SIOCOUTQ do not check to make sure the socket is connected. I was unable to find a place that does check to make sure the socket is connected before doig the above socket ops. Is it adequte to do: if (sk->state != TCPF_ESTABLISHED) return(-ENOTCONN); to determin if a TCP socket is connected? Where should this check (or another more appropriate check) be made? Thanks, Jim PS I'm working with 2.4.0 -- James R. Leu From owner-netdev@oss.sgi.com Thu Jan 11 15:11:04 2001 Received: by oss.sgi.com id ; Thu, 11 Jan 2001 15:10:45 -0800 Received: from pizda.ninka.net ([216.101.162.242]:15749 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Thu, 11 Jan 2001 15:10:26 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id PAA27981; Thu, 11 Jan 2001 15:10:01 -0800 From: "David S. Miller" MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <14942.15689.642046.345119@pizda.ninka.net> Date: Thu, 11 Jan 2001 15:10:01 -0800 (PST) To: jleu@mindspring.com Cc: netdev@oss.sgi.com Subject: Re: TCP: sendmsg/recvmsg/ioctl(SIOCINQ/SIOCOUTQ) In-Reply-To: <20010111164231.C29379@doit.wisc.edu> References: <921FA59842C3D111BB2400A0C9498D0B9EB241@exchange03.rl.ac.uk> <20010111164231.C29379@doit.wisc.edu> X-Mailer: VM 6.75 under 21.1 (patch 13) "Crater Lake" XEmacs Lucid Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 941 Lines: 24 James R. Leu writes: > It seems that TCPs recvmsg/sendmsg and ioctl for SIOCINQ/SIOCOUTQ do not > check to make sure the socket is connected. And what is the problem with that? If the socket is in the process of connecting (SYN_SENT or SYN_RECV) then SIOCINQ/SIOCOUTQ will report zero. If closed or closing, it will report zero since by definition all data is sent. The only invalid case is LISTEN, and we flag this with -EINVAL. For sendmsg/recvmsg you _want_ the kernel to wait for the connection attempt to complete if we are in SYN_SENT or SYN_RECV. If this is not what is happening (the socket is some other kind of "closed") you will get an EPIPE (and perhaps a SIGPIPE signal) back. This is all done via wait_for_tcp_connect in net/ipv4/tcp.c In short I see no problems in the current implementation, and I doubt any other BSD socket interface implementation acts much differently. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Thu Jan 11 15:48:03 2001 Received: by oss.sgi.com id ; Thu, 11 Jan 2001 15:47:43 -0800 Received: from nero.doit.wisc.edu ([128.104.17.130]:1298 "EHLO nero.doit.wisc.edu") by oss.sgi.com with ESMTP id ; Thu, 11 Jan 2001 15:47:28 -0800 Received: (from jleu@localhost) by nero.doit.wisc.edu (8.8.7/8.8.7) id RAA29716; Thu, 11 Jan 2001 17:47:26 -0600 Date: Thu, 11 Jan 2001 17:47:26 -0600 From: "James R. Leu" To: "David S. Miller" Cc: netdev@oss.sgi.com Subject: Re: TCP: sendmsg/recvmsg/ioctl(SIOCINQ/SIOCOUTQ) Message-ID: <20010111174726.A29707@doit.wisc.edu> Reply-To: jleu@mindspring.com References: <921FA59842C3D111BB2400A0C9498D0B9EB241@exchange03.rl.ac.uk> <20010111164231.C29379@doit.wisc.edu> <14942.15689.642046.345119@pizda.ninka.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0.1i In-Reply-To: <14942.15689.642046.345119@pizda.ninka.net>; from davem@redhat.com on Thu, Jan 11, 2001 at 03:10:01PM -0800 Organization: none Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1257 Lines: 33 Hello David, On Thu, Jan 11, 2001 at 03:10:01PM -0800, David S. Miller wrote: > > James R. Leu writes: > > It seems that TCPs recvmsg/sendmsg and ioctl for SIOCINQ/SIOCOUTQ do not > > check to make sure the socket is connected. > > And what is the problem with that? > > If the socket is in the process of connecting (SYN_SENT or SYN_RECV) > then SIOCINQ/SIOCOUTQ will report zero. If closed or closing, it will > report zero since by definition all data is sent. The only invalid > case is LISTEN, and we flag this with -EINVAL. This makes sense. I interpreted the The man page for tcp(4) to say that FIONREAD and TIOCOUTQ would leave errno equal to EPIPE if the socket was closed. > For sendmsg/recvmsg you _want_ the kernel to wait for the connection > attempt to complete if we are in SYN_SENT or SYN_RECV. If this is not > what is happening (the socket is some other kind of "closed") you will > get an EPIPE (and perhaps a SIGPIPE signal) back. This is all done > via wait_for_tcp_connect in net/ipv4/tcp.c In the code I see how sendmsg/recvmsg end up returning -EPIPE if the local side has been shutdown (sk->shutdown & SEND_SHUTDOWN). What should I expect if I close down the remote side of the socket? Thanks, Jim -- James R. Leu From owner-netdev@oss.sgi.com Thu Jan 11 19:20:13 2001 Received: by oss.sgi.com id ; Thu, 11 Jan 2001 19:20:03 -0800 Received: from mailer.psc.edu ([128.182.58.100]:36625 "EHLO mailer.psc.edu") by oss.sgi.com with ESMTP id ; Thu, 11 Jan 2001 19:19:52 -0800 Received: from dexter.psc.edu (dexter.psc.edu [128.182.61.232]) by mailer.psc.edu (8.9.3/8.9.3/psc) with ESMTP id WAA16235; Thu, 11 Jan 2001 22:19:50 -0500 (EST) Received: from localhost (localhost [[UNIX: localhost]]) by dexter.psc.edu (8.11.0/8.11.0) with ESMTP id f0C3JoD25300; Thu, 11 Jan 2001 22:19:50 -0500 (EST) Date: Thu, 11 Jan 2001 22:19:50 -0500 (EST) From: John Heffner To: "James R. Leu" cc: netdev@oss.sgi.com Subject: Re: TCP: sendmsg/recvmsg/ioctl(SIOCINQ/SIOCOUTQ) In-Reply-To: <20010111174726.A29707@doit.wisc.edu> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 684 Lines: 17 On Thu, 11 Jan 2001, James R. Leu wrote: > > For sendmsg/recvmsg you _want_ the kernel to wait for the connection > > attempt to complete if we are in SYN_SENT or SYN_RECV. If this is not > > what is happening (the socket is some other kind of "closed") you will > > get an EPIPE (and perhaps a SIGPIPE signal) back. This is all done > > via wait_for_tcp_connect in net/ipv4/tcp.c > > In the code I see how sendmsg/recvmsg end up returning -EPIPE if the local > side has been shutdown (sk->shutdown & SEND_SHUTDOWN). What should I expect > if I close down the remote side of the socket? If the other end has closed, you will get a -EPIPE from wait_for_tcp_connect(). -John From owner-netdev@oss.sgi.com Fri Jan 12 06:11:06 2001 Received: by oss.sgi.com id ; Fri, 12 Jan 2001 06:10:47 -0800 Received: from laurin.munich.netsurf.de ([194.64.166.1]:24050 "EHLO laurin.munich.netsurf.de") by oss.sgi.com with ESMTP id ; Fri, 12 Jan 2001 06:10:24 -0800 Received: from fred.muc.de (noidentity@ns1037.munich.netsurf.de [195.180.235.37]) by laurin.munich.netsurf.de (8.9.3/8.9.3) with ESMTP id PAA07100; Fri, 12 Jan 2001 15:10:20 +0100 (MET) Received: by fred.muc.de (Postfix, from userid 500) id D4BD9E3913; Fri, 12 Jan 2001 15:09:21 +0100 (CET) Date: Fri, 12 Jan 2001 15:09:21 +0100 From: Andi Kleen To: Hans Grobler Cc: Andi Kleen , netdev@oss.sgi.com Subject: Re: The networking code in Kernel 2.4 Message-ID: <20010112150921.A6375@fred.local> References: <20010111133703.A8869@fred.local> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0.1i In-Reply-To: ; from grobh@sun.ac.za on Thu, Jan 11, 2001 at 01:41:34PM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 549 Lines: 18 On Thu, Jan 11, 2001 at 01:41:34PM +0100, Hans Grobler wrote: > Hi Andi, > > On Thu, 11 Jan 2001, Andi Kleen wrote: > > Read http://www.firstfloor.org/~andi/softnet > > Maybe this should be converted into a Documentation/DocBook/* or just > plain Documentation/networking/softnet. Given the requests so far, I > think this is going to be asked a lot. Shall I have a go at this or would > you like to? Would the original authors object? Just go ahead. You'll need to ask the original authors though. -Andi -- This is like TV. I don't like TV. From owner-netdev@oss.sgi.com Fri Jan 12 06:11:17 2001 Received: by oss.sgi.com id ; Fri, 12 Jan 2001 06:10:56 -0800 Received: from laurin.munich.netsurf.de ([194.64.166.1]:25586 "EHLO laurin.munich.netsurf.de") by oss.sgi.com with ESMTP id ; Fri, 12 Jan 2001 06:10:24 -0800 Received: from fred.muc.de (noidentity@ns1037.munich.netsurf.de [195.180.235.37]) by laurin.munich.netsurf.de (8.9.3/8.9.3) with ESMTP id PAA07101; Fri, 12 Jan 2001 15:10:20 +0100 (MET) Received: by fred.muc.de (Postfix, from userid 500) id D4547E3BB8; Fri, 12 Jan 2001 15:12:48 +0100 (CET) Date: Fri, 12 Jan 2001 15:12:48 +0100 From: Andi Kleen To: "James R. Leu" Cc: "David S. Miller" , netdev@oss.sgi.com Subject: Re: TCP: sendmsg/recvmsg/ioctl(SIOCINQ/SIOCOUTQ) Message-ID: <20010112151248.A6399@fred.local> References: <921FA59842C3D111BB2400A0C9498D0B9EB241@exchange03.rl.ac.uk> <20010111164231.C29379@doit.wisc.edu> <14942.15689.642046.345119@pizda.ninka.net> <20010111174726.A29707@doit.wisc.edu> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0.1i In-Reply-To: <20010111174726.A29707@doit.wisc.edu>; from jleu@mindspring.com on Fri, Jan 12, 2001 at 12:48:36AM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1023 Lines: 30 On Fri, Jan 12, 2001 at 12:48:36AM +0100, James R. Leu wrote: > Hello David, > > On Thu, Jan 11, 2001 at 03:10:01PM -0800, David S. Miller wrote: > > > > James R. Leu writes: > > > It seems that TCPs recvmsg/sendmsg and ioctl for SIOCINQ/SIOCOUTQ do not > > > check to make sure the socket is connected. > > > > And what is the problem with that? > > > > If the socket is in the process of connecting (SYN_SENT or SYN_RECV) > > then SIOCINQ/SIOCOUTQ will report zero. If closed or closing, it will > > report zero since by definition all data is sent. The only invalid > > case is LISTEN, and we flag this with -EINVAL. > > This makes sense. I interpreted the The man page for tcp(4) to say that > FIONREAD and TIOCOUTQ would leave errno equal to EPIPE if the socket was > closed. .B FIONREAD, TIOCINQ Returns the amount of queued unread data in the receive buffer. Argument is a pointer to an integer. I honestly don't see how it can be interpreted to state that. -Andi -- This is like TV. I don't like TV. From owner-netdev@oss.sgi.com Fri Jan 12 22:49:22 2001 Received: by oss.sgi.com id ; Fri, 12 Jan 2001 22:49:13 -0800 Received: from linuxcare.com.au ([203.29.91.49]:46860 "EHLO front.linuxcare.com.au") by oss.sgi.com with ESMTP id ; Fri, 12 Jan 2001 22:48:56 -0800 Received: (from anton@localhost) by front.linuxcare.com.au (8.9.3/8.9.3/Debian 8.9.3-21) id RAA10097; Sat, 13 Jan 2001 17:48:38 +1100 From: Anton Blanchard Date: Sat, 13 Jan 2001 17:46:31 +1100 To: Ingo Molnar Cc: "David S. Miller" , linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: Updated zerocopy patch up on kernel.org Message-ID: <20010113174630.B17761@linuxcare.com> References: <200101100055.QAA07674@pizda.ninka.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.3.12i In-Reply-To: ; from mingo@elte.hu on Thu, Jan 11, 2001 at 11:38:27AM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 461 Lines: 15 > > Nothing interesting or new, just merges up with the latest 2.4.1-pre1 > > patch from Linus. > > > > ftp.kernel.org:/pub/linux/kernel/people/davem/zerocopy-2.4.1p1-1.diff.gz > > > > I haven't had any reports from anyone, which must mean that it is > > working perfectly fine and adds no new bugs, testers are thus in > > nirvana and thus have nothing to report. :-) > > (works like a charm here.) Likewise here running a sendfile hacked samba :) Anton From owner-netdev@oss.sgi.com Sun Jan 14 10:31:08 2001 Received: by oss.sgi.com id ; Sun, 14 Jan 2001 10:30:48 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:60328 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Sun, 14 Jan 2001 10:30:25 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id NAA12517; Sun, 14 Jan 2001 13:29:23 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Sun, 14 Jan 2001 13:29:22 -0500 (EST) From: jamal To: , Subject: Is sendfile all that sexy? Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1885 Lines: 62 I thought i'd run some tests on the new zerocopy patches (this is using a hacked ttcp which knows how to do sendfile and does MSG_TRUNC for true zero-copy receive, if you know what i mean ;-> ). 2 back to back SMP 2*PII-450Mhz hooked up via 1M acenics (gigE). MTU 9K. Before getting excited i had the courage to give plain 2.4.0-pre3 a whirl and somethings bothered me. test1: ------ regular ttcp, no ZC and no sendfile. send as much as you can in 15secs; actually 8192 byte chunks, 2048 of them at a time. Repeat until 15 secs is complete. Repeat the test 5 times to narrow experimental deviation. Throughput: ~99MB/sec (for those obsessed with Mbps ~810Mbps) CPU abuse: server side 87% client side 22% (the CPU measurement could do with some work and proper measure for SMP). test2: ------ sendfile server. created a file which is 8192*2048 bytes. Again the same 15 second exercise as test1 (and the 5-set thing): - throughput: 86MB/sec - CPU: server 100%, client 17% So i figured, no problem i'll re-run it with a file 10 times larger. **I was dissapointed to see no improvement.** Looking at the system calls being made: with the non-sendfile version, approximately 182K write-to-socket system calls were made each writing 8192 bytes, Each call lasted on average 0.08ms. With sendfile test2: 78 calls were made, each sending the file size 8192*2048 bytes; each lasted about 199 msecs TWO observations: - Given Linux's non-pre-emptability of the kernel i get the feeling that sendfile could starve other user space programs. Imagine trying to send a 1Gig file on 10Mbps pipe in one shot. - It doesnt matter if you break down the file into chunks for self-pre-emption; sendfile is still a pig. I have a feeling i am missing some very serious shit. So enlighten me. Has anyone done similar tests? Anyways, the struggle continues next with zc patches. cheers, jamal From owner-netdev@oss.sgi.com Sun Jan 14 10:51:40 2001 Received: by oss.sgi.com id ; Sun, 14 Jan 2001 10:51:20 -0800 Received: from chiara.elte.hu ([157.181.150.200]:3344 "HELO chiara.elte.hu") by oss.sgi.com with SMTP id ; Sun, 14 Jan 2001 10:50:50 -0800 Received: by chiara.elte.hu (Postfix, from userid 17806) id 85CC7186D; Sun, 14 Jan 2001 19:50:47 +0100 (CET) Date: Sun, 14 Jan 2001 19:50:12 +0100 (CET) From: Ingo Molnar Reply-To: To: jamal Cc: , Subject: Re: Is sendfile all that sexy? In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 787 Lines: 23 On Sun, 14 Jan 2001, jamal wrote: > regular ttcp, no ZC and no sendfile. [...] > Throughput: ~99MB/sec (for those obsessed with Mbps ~810Mbps) > CPU abuse: server side 87% client side 22% [...] > sendfile server. > - throughput: 86MB/sec > - CPU: server 100%, client 17% i believe what you are seeing here is the overhead of the pagecache. When using sendmsg() only, you do not read() the file every time, right? Is ttcp using multiple threads? In that case if the sendfile() is using the *same* file all the time, creating SMP locking overhead. if this is the case, what result do you get if you use a separate, isolated file per process? (And i bet that with DaveM's pagecache scalability patch the situation would also get much better - the global pagecache_lock hurts.) Ingo From owner-netdev@oss.sgi.com Sun Jan 14 11:04:10 2001 Received: by oss.sgi.com id ; Sun, 14 Jan 2001 11:03:49 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:64168 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Sun, 14 Jan 2001 11:03:30 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id OAA12577; Sun, 14 Jan 2001 14:02:40 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Sun, 14 Jan 2001 14:02:40 -0500 (EST) From: jamal To: Ingo Molnar cc: , Subject: Re: Is sendfile all that sexy? In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1106 Lines: 33 On Sun, 14 Jan 2001, Ingo Molnar wrote: > > i believe what you are seeing here is the overhead of the pagecache. When > using sendmsg() only, you do not read() the file every time, right? Is In that case just a user space buffer is sent i.e no file association. > ttcp using multiple threads? Only a single thread, single flow setup. Very primitive but simple. > In that case if the sendfile() is using the > *same* file all the time, creating SMP locking overhead. > > if this is the case, what result do you get if you use a separate, > isolated file per process? (And i bet that with DaveM's pagecache > scalability patch the situation would also get much better - the global > pagecache_lock hurts.) > Already doing the single file, single process. However, i do run by time which means i could read the file from the begining(offset 0) to the end then re-do it for as many times as 15secs would allow. Does this affect it? I tried one 1.5 GB file, it was oopsing and given my setup right now i cant trace it. So i am using about 170M which is read about 8 times in the 15 secs cheers, jamal From owner-netdev@oss.sgi.com Sun Jan 14 11:10:59 2001 Received: by oss.sgi.com id ; Sun, 14 Jan 2001 11:10:50 -0800 Received: from chiara.elte.hu ([157.181.150.200]:9232 "HELO chiara.elte.hu") by oss.sgi.com with SMTP id ; Sun, 14 Jan 2001 11:10:40 -0800 Received: by chiara.elte.hu (Postfix, from userid 17806) id 0550C186D; Sun, 14 Jan 2001 20:10:38 +0100 (CET) Date: Sun, 14 Jan 2001 20:09:54 +0100 (CET) From: Ingo Molnar Reply-To: To: jamal Cc: , Subject: Re: Is sendfile all that sexy? In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 726 Lines: 22 On Sun, 14 Jan 2001, jamal wrote: > Already doing the single file, single process. [...] in this case there could still be valid performance differences, as copying from user-space is cheaper than copying from the pagecache. To rule out SMP interactions, you could try a UP-IOAPIC kernel on that box. (I'm also curious what kind of numbers you'll get with the zerocopy patch.) > However, i do run by time which means i could read the file from the > begining(offset 0) to the end then re-do it for as many times as > 15secs would allow. Does this affect it? [...] no, in the case of a single thread this should have minimum impact. But i'd suggest to increase the /proc/sys/net/tcp*mem* values (to 1MB or more). Ingo From owner-netdev@oss.sgi.com Sun Jan 14 11:19:30 2001 Received: by oss.sgi.com id ; Sun, 14 Jan 2001 11:19:20 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:65192 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Sun, 14 Jan 2001 11:19:12 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id OAA12584; Sun, 14 Jan 2001 14:18:29 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Sun, 14 Jan 2001 14:18:29 -0500 (EST) From: jamal To: Ingo Molnar cc: , Subject: Re: Is sendfile all that sexy? In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 791 Lines: 29 On Sun, 14 Jan 2001, Ingo Molnar wrote: > > in this case there could still be valid performance differences, as > copying from user-space is cheaper than copying from the pagecache. To > rule out SMP interactions, you could try a UP-IOAPIC kernel on that box. > Let me complete this with the ZC patches first. then i'll do that. There are a few retarnsmits; maybe receiver IRQ affinity might help some. > (I'm also curious what kind of numbers you'll get with the zerocopy > patch.) Working on it. > no, in the case of a single thread this should have minimum impact. But > i'd suggest to increase the /proc/sys/net/tcp*mem* values (to 1MB or > more). The upper thresholds to 1000000 ? I should have mentioned that i set /proc/sys/net/core/*mem* to currently 262144. cheers, jamal From owner-netdev@oss.sgi.com Sun Jan 14 11:40:10 2001 Received: by oss.sgi.com id ; Sun, 14 Jan 2001 11:40:00 -0800 Received: from minus.inr.ac.ru ([193.233.7.97]:61453 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Sun, 14 Jan 2001 11:39:38 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id WAA25279; Sun, 14 Jan 2001 22:39:31 +0300 From: kuznet@ms2.inr.ac.ru Message-Id: <200101141939.WAA25279@ms2.inr.ac.ru> Subject: Re: Is sendfile all that sexy? To: hadi@cyberus.CA (jamal) Date: Sun, 14 Jan 2001 22:39:31 +0300 (MSK) Cc: netdev@oss.sgi.com In-Reply-To: from "jamal" at Jan 14, 1 10:15:01 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1073 Lines: 34 Hello! > it? I tried one 1.5 GB file, it was oopsing Jamal, you say this as something normal. 8) Seems, this is the most interesting statement of your report. You could tell where did it oops at least. > So i figured, no problem i'll re-run it with a file 10 times larger. > **I was dissapointed to see no improvement.** You should see much worse behaviour in this case. > cant trace it. So i am using about 170M which is read about 8 times in > the 15 secs You forgot to say how much of memory you machine has. 8) Page cache works as soon as pages are not pushed out of cache. In order to compare to write() from vm you must make the following things: 1. Use write() buffer not fitting to L2 cache. Otherwise you measure bandwidth of L2 cache, and in the case of ttcp it is even bandwidth of L1 cache. It will beat any zero copy, no doubts. 2. Take moderately large file, not pushed out from page cache for sendfile(). > - Given Linux's non-pre-emptability of the kernel i get the feeling that It is scheduled each sndbuf in the _worst_ case. Alexey From owner-netdev@oss.sgi.com Sun Jan 14 12:08:21 2001 Received: by oss.sgi.com id ; Sun, 14 Jan 2001 12:08:11 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:1193 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Sun, 14 Jan 2001 12:07:55 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id PAA12650; Sun, 14 Jan 2001 15:07:08 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Sun, 14 Jan 2001 15:07:08 -0500 (EST) From: jamal To: cc: Subject: Re: Is sendfile all that sexy? In-Reply-To: <200101141939.WAA25279@ms2.inr.ac.ru> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 2059 Lines: 81 On Sun, 14 Jan 2001 kuznet@ms2.inr.ac.ru wrote: > Hello! > > > it? I tried one 1.5 GB file, it was oopsing > > Jamal, you say this as something normal. 8) > > Seems, this is the most interesting statement of your report. > You could tell where did it oops at least. Ok. I'll try tracing this ;-> > > > So i figured, no problem i'll re-run it with a file 10 times larger. > > **I was dissapointed to see no improvement.** > > You should see much worse behaviour in this case. > I didnt see a difference > > > cant trace it. So i am using about 170M which is read about 8 times in > > the 15 secs > > You forgot to say how much of memory you machine has. 8) 256M. > Page cache works as soon as pages are not pushed out of cache. > In order to compare to write() from vm you must make the following things: > > 1. Use write() buffer not fitting to L2 cache. Otherwise you measure > bandwidth of L2 cache, and in the case of ttcp it is even bandwidth > of L1 cache. It will beat any zero copy, no doubts. I think the L2 cache on these machine is 512KB. So something slightly larger (600KB)? > 2. Take moderately large file, not pushed out from page cache > for sendfile(). > How Large? 170M currently. > > - Given Linux's non-pre-emptability of the kernel i get the feeling that > > It is scheduled each sndbuf in the _worst_ case. > OK. ----- Some more data, with zc patches on 2.4.0-pre3: * again the 8192 byte writes tput: 66.2MB/sec (compare to 99MB/sec with no patch) CPU-sender: 60% (compare to 87% earlier) CPU-receive: 11% (compare to 22% earlier) Sendfile: 170MB file tput: 68MB/sec (compare to 86MB/sec) CPU-sender: 8% (compare to 100% earlier) CPU-receiver 8%(compare to 17% earlier) So i would say that CPU utilization has improved incredibly, but throughput has gone down. Which means you could (probably) have a lot more flows running concurently with the zc patches. I would say also that sendfile is very much usable now. --- To Ingo, upping the thresholds to up 10 times what they are does not help. cheers, jamal From owner-netdev@oss.sgi.com Sun Jan 14 12:18:01 2001 Received: by oss.sgi.com id ; Sun, 14 Jan 2001 12:17:51 -0800 Received: from minus.inr.ac.ru ([193.233.7.97]:11022 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Sun, 14 Jan 2001 12:17:35 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id XAA25556; Sun, 14 Jan 2001 23:17:22 +0300 From: kuznet@ms2.inr.ac.ru Message-Id: <200101142017.XAA25556@ms2.inr.ac.ru> Subject: Re: Updated zerocopy patch up on kernel.org To: mingo@elte.HU Date: Sun, 14 Jan 2001 23:17:22 +0300 (MSK) Cc: netdev@oss.sgi.com In-Reply-To: from "Ingo Molnar" at Jan 11, 1 03:15:06 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1041 Lines: 25 Hello! > i believe it might still make some limited sense for normal sendmsg() > and higher MTUs (or 8k NFS) - we could copy & checksum stuff into the > ->tcp_page if SG is possible and thus the SG capability improves the VM. > (because we can allocate at PAGE_SIZE granularity.) There is no problem in doing this for udp and af_unix sockets. But I suspect it is not very useful. NFS allocations do not stress VM too much and high order allocations are still cheaper than allocations of vector of pages. Seems, this requires some caching to allocate vectors of pages. For tcp it is also not difficult, though requires combined segmentizer and pager. Easy to do, but not useful because interfaces with large mtu not doing checksummming in hardware... do they exist if to forget about buggy SK? Real advantage is that this scheme allows to get rid of too early fragmentation, but any efforts in this direction are worthful only after netfilter and all the popular devices (f.e. including ppp) start to understand non-linear skbs. Alexey From owner-netdev@oss.sgi.com Sun Jan 14 12:30:01 2001 Received: by oss.sgi.com id ; Sun, 14 Jan 2001 12:29:51 -0800 Received: from minus.inr.ac.ru ([193.233.7.97]:13070 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Sun, 14 Jan 2001 12:29:31 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id XAA25591; Sun, 14 Jan 2001 23:29:24 +0300 From: kuznet@ms2.inr.ac.ru Message-Id: <200101142029.XAA25591@ms2.inr.ac.ru> Subject: Re: Is sendfile all that sexy? To: hadi@cyberus.ca (jamal) Date: Sun, 14 Jan 2001 23:29:24 +0300 (MSK) Cc: netdev@oss.sgi.com In-Reply-To: from "jamal" at Jan 14, 1 03:07:08 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 808 Lines: 35 Hello! > 256M. 8Mb file should be enough. At least, I do see real difference between the first and subsequent runs with 32MB file and 512M machine. You do not and this is funny. > I think the L2 cache on these machine is 512KB. So something slightly > larger (600KB)? >2*L2 cache. Otherwise it is not invalidated. > How Large? 170M currently. Seems, your first experiment was 16M. It was reasonable. I have no idea why you did not see differences between runs, it is simply impossible. > tput: 68MB/sec (compare to 86MB/sec) Sorry? But what did you measure earlier? 8) > So i would say that CPU utilization has improved incredibly, but > throughput has gone down. I have no idea, throughput grows here on similar hardware. And by the way, numbers are pretty similar, but inverted. 8) Alexey From owner-netdev@oss.sgi.com Sun Jan 14 15:02:43 2001 Received: by oss.sgi.com id ; Sun, 14 Jan 2001 15:02:23 -0800 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:39429 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Sun, 14 Jan 2001 15:02:03 -0800 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.8.7) with ESMTP id RAA13304; Sun, 14 Jan 2001 17:06:00 -0700 Message-ID: <3A623EE8.644A572D@candelatech.com> Date: Sun, 14 Jan 2001 17:06:00 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.16 i586) X-Accept-Language: en MIME-Version: 1.0 To: "netdev@oss.sgi.com" , linux-kernel Subject: Question on 2.2.18 and setting a device to PROMISC. Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 3292 Lines: 84 This code works in 2.4.0: (The important part is the dev_set_promiscuity() method.) int vlan_dev_set_mac_address(struct net_device *dev, void* addr_struct_p) { int i; struct sockaddr *addr = (struct sockaddr*)(addr_struct_p); if (netif_running(dev)) { return -EBUSY; } memcpy(dev->dev_addr, addr->sa_data, dev->addr_len); printk("%s: Setting MAC address to ", dev->name); for (i = 0; i < 6; i++) { printk(" %2.2x", dev->dev_addr[i]); } printk(".\n"); if (memcmp(dev->vlan_dev->real_dev->dev_addr, dev->dev_addr, dev->addr_len) != 0) { if (dev->vlan_dev->real_dev->flags & IFF_PROMISC) { /* Already promiscious...leave it alone. */ printk("VLAN (%s): Good, underlying device (%s) is already promiscious.\n", dev->name, dev->vlan_dev->real_dev->name); } else { printk("VLAN (%s): Setting underlying device (%s) to promiscious mode.\n", dev->name, dev->vlan_dev->real_dev->name); dev_set_promiscuity(dev->vlan_dev->real_dev, 1); } } else { printk("VLAN (%s): Underlying device (%s) has same MAC, not checking promiscious mode.\n", dev->name, dev->vlan_dev->real_dev->name); } return 0; } But this code in the 2.2.18 kernel does not work. Specifically, the dev_set_promiscuity method fails to actually make the interface promiscious. Anyone know why? int vlan_dev_set_mac_address(struct device *dev, void* addr_struct_p) { int i; struct sockaddr *addr = (struct sockaddr*)(addr_struct_p); if (dev->start) { return -EBUSY; } memcpy(dev->dev_addr, addr->sa_data, dev->addr_len); printk("%s: Setting MAC address to ", dev->name); for (i = 0; i < 6; i++) { printk(" %2.2x", dev->dev_addr[i]); } printk(".\n"); if (memcmp(dev->vlan_dev->real_dev->dev_addr, dev->dev_addr, dev->addr_len) != 0) { if (dev->vlan_dev->real_dev->flags & IFF_PROMISC) { /* Already promiscious...leave it alone. */ printk("VLAN (%s): Good, underlying device (%s) is already promiscious.\n", dev->name, dev->vlan_dev->real_dev->name); } else { printk("VLAN (%s): Setting underlying device (%s) to promiscious mode.\n", dev->name, dev->vlan_dev->real_dev->name); dev_set_promiscuity(dev->vlan_dev->real_dev, 1); } } else { printk("VLAN (%s): Underlying device (%s) has same MAC, not checking promiscious mode.\n", dev->name, dev->vlan_dev->real_dev->name); } return 0; } -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Sun Jan 14 19:48:56 2001 Received: by oss.sgi.com id ; Sun, 14 Jan 2001 19:48:46 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:15785 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Sun, 14 Jan 2001 19:48:29 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id WAA13609; Sun, 14 Jan 2001 22:47:47 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Sun, 14 Jan 2001 22:47:47 -0500 (EST) From: jamal To: cc: Subject: Re: Is sendfile all that sexy? In-Reply-To: <200101142029.XAA25591@ms2.inr.ac.ru> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1321 Lines: 57 On Sun, 14 Jan 2001 kuznet@ms2.inr.ac.ru wrote: > Hello! > > > 256M. > > 8Mb file should be enough. > > At least, I do see real difference between the first and subsequent > runs with 32MB file and 512M machine. You do not and this is funny. Maybe it is because i keep sending the file over and over again for the duration of the 15 secs. > > > > I think the L2 cache on these machine is 512KB. So something slightly > > larger (600KB)? > > >2*L2 cache. Otherwise it is not invalidated. > I have 512K L2 and 16K L1 and i made the buffer size 2M. No difference in terms of throughput; but CPU utilization went down to about 50% > > > How Large? 170M currently. > > Seems, your first experiment was 16M. It was reasonable. > I have no idea why you did not see differences between runs, > it is simply impossible. > I'll send you my hacked ttcp in another mail. Maybe i am screwing something ... > > > tput: 68MB/sec (compare to 86MB/sec) > > Sorry? But what did you measure earlier? 8) > sorry, earlier means in the case where no ZC patch > > So i would say that CPU utilization has improved incredibly, but > > throughput has gone down. > > I have no idea, throughput grows here on similar hardware. > And by the way, numbers are pretty similar, but inverted. 8) Ok, let me mail you the code. cheers, jamal From owner-netdev@oss.sgi.com Sun Jan 14 20:01:56 2001 Received: by oss.sgi.com id ; Sun, 14 Jan 2001 20:01:37 -0800 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:38663 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Sun, 14 Jan 2001 20:01:18 -0800 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.8.7) with ESMTP id WAA17344; Sun, 14 Jan 2001 22:05:15 -0700 Message-ID: <3A62850B.B369C87C@candelatech.com> Date: Sun, 14 Jan 2001 22:05:15 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.16 i586) X-Accept-Language: en MIME-Version: 1.0 To: VLAN Mailing List , "netdev@oss.sgi.com" Subject: 802.1Q VLAN kernel patch 1.0.0 released. Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1031 Lines: 26 Well, I've bumped the version number to 1.0.0 in anticipation of sending the patch off to linux-kernel for discussion and possible inclusion in kernel some day. http://scry.wanfear.com/~greear/vlan.html The fixes are fairly minor, but I did finally get MAC-settability to work like it should. When you set the MAC address, it now makes the underlying device into PROMISC mode, so that the VLAN can continue to receive packets. Hashed devices are disabled by default now, because some ppl have trouble with the 'lo' device (and I can't reproduce the problem, or figure out what the real problem is...) See net/core/dev.c to turn it back on and send me bug reports/patches!! Please try this out as soon as you can, and send me feedback..I plan on submitting the patch to LK next weekend if all goes well. Thanks, Ben -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Mon Jan 15 10:18:44 2001 Received: by oss.sgi.com id ; Mon, 15 Jan 2001 10:18:34 -0800 Received: from minus.inr.ac.ru ([193.233.7.97]:63503 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Mon, 15 Jan 2001 10:18:21 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id VAA12009; Mon, 15 Jan 2001 21:17:40 +0300 From: kuznet@ms2.inr.ac.ru Message-Id: <200101151817.VAA12009@ms2.inr.ac.ru> Subject: Re: Is sendfile all that sexy? To: hadi@cyberus.ca (jamal) Date: Mon, 15 Jan 2001 21:17:40 +0300 (MSK) Cc: netdev@oss.sgi.com In-Reply-To: from "jamal" at Jan 14, 1 10:47:47 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 533 Lines: 17 Hello! > I'll send you my hacked ttcp in another mail. Maybe i am screwing > something ... I do not think that the problem is in ttcp. The number, which you showed are so crazy, that this cannot be explained by breakage there. > sorry, earlier means in the case where no ZC patch These numbers contradict to numbers, which you sent in previous report. Please, redo your test recording the conditions accurately. 8) Well, and check your kernel configuration, Probbaly, you did something silly sort of enabling netfilter. Alexey From owner-netdev@oss.sgi.com Mon Jan 15 11:29:25 2001 Received: by oss.sgi.com id ; Mon, 15 Jan 2001 11:29:05 -0800 Received: from minus.inr.ac.ru ([193.233.7.97]:57104 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Mon, 15 Jan 2001 11:28:56 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id WAA12705; Mon, 15 Jan 2001 22:28:48 +0300 From: kuznet@ms2.inr.ac.ru Message-Id: <200101151928.WAA12705@ms2.inr.ac.ru> Subject: Re: routable interfaces To: hadi@cyberus.CA (jamal) Date: Mon, 15 Jan 2001 22:28:48 +0300 (MSK) Cc: netdev@oss.sgi.com In-Reply-To: from "jamal" at Jan 8, 1 04:15:02 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1154 Lines: 26 Hello! > A vlan device could be a simple IP-alias. Sorry, Jamal you messed all the things completely. ifaddr is not a virtual device. "Alias" was virtual device, but this was bogus ill-defined device which created nothing but problems for routing, that's why we have no "aliases", but use plain ifaddrs. VLAN is true "virtual" device, using nice well-defined encapsulation, and from viewpoint of routing/addressing such device is not distinguishable of "physical" one. The question: "physical" or "virtual" is just meaningless from the viewpoint of routing and from any other viewpoint but packet scheduling and policing. What's about VLANs, they can be handled as separate virtual devices provided you have _couple_ of them. It the number is higher, they must be clustered as single nbma interface via framing (i.e neighbour) level or via tags in routing tables. The same thing is with MPLs. That's why I strongly dislike the idea to create zillions of net_devices and consider that approach to VLANs as stupid one. And this is reason why hashing device list (being great in principle) is not considered to be some really required feature. Alexey From owner-netdev@oss.sgi.com Mon Jan 15 12:12:05 2001 Received: by oss.sgi.com id ; Mon, 15 Jan 2001 12:11:57 -0800 Received: from ns.snowman.net ([63.80.4.34]:7692 "EHLO ns.snowman.net") by oss.sgi.com with ESMTP id ; Mon, 15 Jan 2001 12:11:40 -0800 Received: (from sfrost@localhost) by ns.snowman.net (8.9.3/8.9.3/Debian 8.9.3-21) id PAA29465; Mon, 15 Jan 2001 15:11:16 -0500 Date: Mon, 15 Jan 2001 15:11:16 -0500 From: Stephen Frost To: "David S. Miller" Cc: linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: Updated zerocopy patches on kernel.org Message-ID: <20010115151116.M26953@ns> Mail-Followup-To: "David S. Miller" , linux-kernel@vger.kernel.org, netdev@oss.sgi.com References: <200101110450.UAA02405@pizda.ninka.net> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="qNFB6cgt1ZLbnF/I" Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <200101110450.UAA02405@pizda.ninka.net>; from davem@redhat.com on Wed, Jan 10, 2001 at 08:50:50PM -0800 X-Editor: Vim http://www.vim.org/ X-Info: http://www.snowman.net X-Operating-System: Linux/2.2.16 (i686) X-Uptime: 3:04pm up 151 days, 18:50, 8 users, load average: 2.00, 2.00, 2.00 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 21799 Lines: 843 --qNFB6cgt1ZLbnF/I Content-Type: multipart/mixed; boundary="uM6pvXgO/RqlDXIz" Content-Disposition: inline --uM6pvXgO/RqlDXIz Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable * David S. Miller (davem@redhat.com) wrote: >=20 > Now against 2.4.1-pre2: >=20 > ftp.kernel.org:/pub/linux/kernel/people/davem/zerocopy-2.4.1p2-1.diff.gz Tried it with 2.4.1-pre3, didn't have any problem applying it, but when I rebooted the system it pretty much had no interest in talking TCP to anything. 2.4.1-pre3 alone has no problem. Should I give 2.4.1p2 a try with the zerocopy patch? I think that's my next step, but if it isn't likely to change anything.. The problem tended to be that a connection could reach the=20 'ESTABLISHED' point (in netstat), but then very little data would pass over the connection. ie; 'telnet somehost 25' would give me the SMTP HELO statement, but nothing I typed seemed to make it anywhere. Inbound and outbound ssh connections would reach 'ESTABLISHED', but then wouldn't make it to the point of prompting me for a password. Nothing apparent in any log files or anything. Info attached, more availible upon request. Stephen --uM6pvXgO/RqlDXIz Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="dmesg.out" Linux version 2.4.1-pre2 (root@gw2-snowman) (gcc version 2.95.3 20010111 (prerelease)) #1 SMP Mon Jan 15 12:59:07 EST 2001 BIOS-provided physical RAM map: BIOS-e820: 000000000009fc00 @ 0000000000000000 (usable) BIOS-e820: 0000000000000400 @ 000000000009fc00 (reserved) BIOS-e820: 0000000000010000 @ 00000000000f0000 (reserved) BIOS-e820: 000000000feec000 @ 0000000000100000 (usable) BIOS-e820: 0000000000003000 @ 000000000ffec000 (ACPI data) BIOS-e820: 0000000000010000 @ 000000000ffef000 (reserved) BIOS-e820: 0000000000001000 @ 000000000ffff000 (ACPI NVS) BIOS-e820: 0000000000010000 @ 00000000ffff0000 (reserved) Scan SMP from c0000000 for 1024 bytes. Scan SMP from c009fc00 for 1024 bytes. Scan SMP from c00f0000 for 65536 bytes. Scan SMP from c009fc00 for 4096 bytes. On node 0 totalpages: 65516 zone(0): 4096 pages. zone(1): 61420 pages. zone(2): 0 pages. mapped APIC to ffffe000 (01444000) Kernel command line: auto BOOT_IMAGE=Linux ro root=301 Initializing CPU#0 Detected 655.851 MHz processor. Console: colour VGA+ 80x25 Calibrating delay loop... 1307.44 BogoMIPS Memory: 255292k/262064k available (1062k kernel code, 6384k reserved, 429k data, 200k init, 0k highmem) Dentry-cache hash table entries: 32768 (order: 6, 262144 bytes) Buffer-cache hash table entries: 16384 (order: 4, 65536 bytes) Page-cache hash table entries: 65536 (order: 6, 262144 bytes) Inode-cache hash table entries: 16384 (order: 5, 131072 bytes) CPU: Before vendor init, caps: 0183f9ff c1c7f9ff 00000000, vendor = 2 CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 64K (64 bytes/line) CPU: After vendor init, caps: 0183f9ff c1c7f9ff 00000000 00000000 CPU: After generic, caps: 0183f9ff c1c7f9ff 00000000 00000000 CPU: Common caps: 0183f9ff c1c7f9ff 00000000 00000000 Enabling fast FPU save and restore... done. Checking 'hlt' instruction... OK. POSIX conformance testing by UNIFIX CPU: Before vendor init, caps: 0183f9ff c1c7f9ff 00000000, vendor = 2 CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 64K (64 bytes/line) CPU: After vendor init, caps: 0183f9ff c1c7f9ff 00000000 00000000 CPU: After generic, caps: 0183f9ff c1c7f9ff 00000000 00000000 CPU: Common caps: 0183f9ff c1c7f9ff 00000000 00000000 CPU0: AMD Athlon(tm) Processor stepping 00 per-CPU timeslice cutoff: 182.95 usecs. SMP motherboard not detected. Using dummy APIC emulation. Setting commenced=1, go go go PCI: PCI BIOS revision 2.10 entry at 0xf1010, last bus=1 PCI: Using configuration type 1 PCI: Probing PCI hardware Unknown bridge resource 0: assuming transparent Unknown bridge resource 1: assuming transparent Unknown bridge resource 2: assuming transparent PCI: Using IRQ router VIA [1106/0686] at 00:04.0 isapnp: Scanning for Pnp cards... isapnp: No Plug & Play device found Linux NET4.0 for Linux 2.4 Based upon Swansea University Computer Society NET3.039 Initializing RT netlink socket DMI 2.3 present. 49 structures occupying 1371 bytes. DMI table at 0x000F27D0. BIOS Vendor: Award Software, Inc. BIOS Version: ASUS A7V ACPI BIOS Revision 1003 BIOS Release: 07/21/2000 System Vendor: System Manufacturer. Product Name: System Name. Version System Version. Serial Number SYS-1234567890. Board Vendor: ASUSTeK Computer INC.. Board Name: . Board Version: REV 1.xx. Asset Tag: Asset-1234567890. Starting kswapd v1.8 pty: 256 Unix98 ptys configured RAMDISK driver initialized: 16 RAM disks of 4096K size 1024 blocksize loop: enabling 8 loop devices Uniform Multi-Platform E-IDE driver Revision: 6.31 ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx VP_IDE: IDE controller on PCI bus 00 dev 21 VP_IDE: chipset revision 16 VP_IDE: not 100% native mode: will probe irqs later ide0: BM-DMA at 0xd800-0xd807, BIOS settings: hda:DMA, hdb:pio ide1: BM-DMA at 0xd808-0xd80f, BIOS settings: hdc:pio, hdd:pio PDC20265: IDE controller on PCI bus 00 dev 88 PCI: Found IRQ 10 for device 00:11.0 PDC20265: chipset revision 2 PDC20265: not 100% native mode: will probe irqs later ide2: BM-DMA at 0x8800-0x8807, BIOS settings: hde:pio, hdf:pio ide3: BM-DMA at 0x8808-0x880f, BIOS settings: hdg:pio, hdh:pio hda: FUJITSU M1638TAU, ATA DISK drive ide0 at 0x1f0-0x1f7,0x3f6 on irq 14 hda: 5023680 sectors (2572 MB) w/128KiB Cache, CHS=622/128/63, DMA Partition check: hda: hda1 hda2 hda3 hda4 < hda5 hda6 > Floppy drive(s): fd0 is 1.44M FDC 0 is a post-1991 82077 Serial driver version 5.02 (2000-08-09) with MANY_PORTS SHARE_IRQ SERIAL_PCI ISAPNP enabled ttyS00 at 0x03f8 (irq = 4) is a 16550A ttyS01 at 0x02f8 (irq = 3) is a 16550A Real Time Clock Driver v1.10d PCI: Found IRQ 5 for device 00:0a.0 3c59x.c:LK1.1.12 06 Jan 2000 Donald Becker and others. http://www.scyld.com/network/vortex.html $Revision: 1.102.2.46 $ See Documentation/networking/vortex.txt eth0: 3Com PCI 3c905B Cyclone 100baseTx at 0xa400, 00:50:04:63:b2:e1, IRQ 5 product code 'TN' rev 00.9 date 02-06-99 8K byte-wide RAM 5:3 Rx:Tx split, 10baseT interface. Enabling bus-master transmits and whole-frame receives. PPP generic driver version 2.4.1 mice: PS/2 mouse device common for all mice NET4: Linux TCP/IP 1.0 for NET4.0 IP Protocols: ICMP, UDP, TCP, IGMP IP: routing cache hash table of 2048 buckets, 16Kbytes TCP: Hash tables configured (established 16384 bind 16384) ip_conntrack (2047 buckets, 16376 max) ip_tables: (c)2000 Netfilter core team NET4: Unix domain sockets 1.0/SMP for Linux NET4.0. VFS: Mounted root (ext2 filesystem) readonly. Freeing unused kernel memory: 200k freed Adding Swap: 249976k swap-space (priority -1) eth0: using default media 10baseT --uM6pvXgO/RqlDXIz Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename=config-gw2 Content-Transfer-Encoding: quoted-printable # # Automatically generated make config: don't edit # CONFIG_X86=3Dy CONFIG_ISA=3Dy # CONFIG_SBUS is not set CONFIG_UID16=3Dy # # Code maturity level options # CONFIG_EXPERIMENTAL=3Dy # # Loadable module support # CONFIG_MODULES=3Dy CONFIG_MODVERSIONS=3Dy CONFIG_KMOD=3Dy # # Processor type and features # # CONFIG_M386 is not set # CONFIG_M486 is not set # CONFIG_M586 is not set CONFIG_M586TSC=3Dy # CONFIG_M586MMX is not set # CONFIG_M686 is not set # CONFIG_MPENTIUMIII is not set # CONFIG_MPENTIUM4 is not set # CONFIG_MK6 is not set # CONFIG_MK7 is not set # CONFIG_MCRUSOE is not set # CONFIG_MWINCHIPC6 is not set # CONFIG_MWINCHIP2 is not set # CONFIG_MWINCHIP3D is not set CONFIG_X86_WP_WORKS_OK=3Dy CONFIG_X86_INVLPG=3Dy CONFIG_X86_CMPXCHG=3Dy CONFIG_X86_BSWAP=3Dy CONFIG_X86_POPAD_OK=3Dy CONFIG_X86_L1_CACHE_SHIFT=3D5 CONFIG_X86_USE_STRING_486=3Dy CONFIG_X86_ALIGNMENT_16=3Dy CONFIG_X86_TSC=3Dy # CONFIG_TOSHIBA is not set # CONFIG_MICROCODE is not set # CONFIG_X86_MSR is not set # CONFIG_X86_CPUID is not set CONFIG_NOHIGHMEM=3Dy # CONFIG_HIGHMEM4G is not set # CONFIG_HIGHMEM64G is not set # CONFIG_MATH_EMULATION is not set # CONFIG_MTRR is not set CONFIG_SMP=3Dy CONFIG_HAVE_DEC_LOCK=3Dy # # General setup # CONFIG_NET=3Dy # CONFIG_VISWS is not set CONFIG_X86_IO_APIC=3Dy CONFIG_X86_LOCAL_APIC=3Dy CONFIG_PCI=3Dy # CONFIG_PCI_GOBIOS is not set # CONFIG_PCI_GODIRECT is not set CONFIG_PCI_GOANY=3Dy CONFIG_PCI_BIOS=3Dy CONFIG_PCI_DIRECT=3Dy CONFIG_PCI_NAMES=3Dy # CONFIG_EISA is not set # CONFIG_MCA is not set CONFIG_HOTPLUG=3Dy # # PCMCIA/CardBus support # # CONFIG_PCMCIA is not set CONFIG_SYSVIPC=3Dy CONFIG_BSD_PROCESS_ACCT=3Dy CONFIG_SYSCTL=3Dy CONFIG_KCORE_ELF=3Dy # CONFIG_KCORE_AOUT is not set CONFIG_BINFMT_AOUT=3Dy CONFIG_BINFMT_ELF=3Dy CONFIG_BINFMT_MISC=3Dy # CONFIG_PM is not set # CONFIG_ACPI is not set # CONFIG_APM is not set # # Memory Technology Devices (MTD) # # CONFIG_MTD is not set # # Parallel port support # CONFIG_PARPORT=3Dm CONFIG_PARPORT_PC=3Dm CONFIG_PARPORT_PC_FIFO=3Dy # CONFIG_PARPORT_PC_SUPERIO is not set # CONFIG_PARPORT_AMIGA is not set # CONFIG_PARPORT_MFC3 is not set # CONFIG_PARPORT_ATARI is not set # CONFIG_PARPORT_SUNBPP is not set # CONFIG_PARPORT_OTHER is not set # CONFIG_PARPORT_1284 is not set # # Plug and Play configuration # CONFIG_PNP=3Dy CONFIG_ISAPNP=3Dy # # Block devices # CONFIG_BLK_DEV_FD=3Dy # CONFIG_BLK_DEV_XD is not set # CONFIG_PARIDE is not set # CONFIG_BLK_CPQ_DA is not set # CONFIG_BLK_CPQ_CISS_DA is not set # CONFIG_BLK_DEV_DAC960 is not set CONFIG_BLK_DEV_LOOP=3Dy CONFIG_BLK_DEV_NBD=3Dy CONFIG_BLK_DEV_RAM=3Dy CONFIG_BLK_DEV_RAM_SIZE=3D4096 # CONFIG_BLK_DEV_INITRD is not set # # Multi-device support (RAID and LVM) # # CONFIG_MD is not set # CONFIG_BLK_DEV_MD is not set # CONFIG_MD_LINEAR is not set # CONFIG_MD_RAID0 is not set # CONFIG_MD_RAID1 is not set # CONFIG_MD_RAID5 is not set # CONFIG_BLK_DEV_LVM is not set # CONFIG_LVM_PROC_FS is not set # # Networking options # CONFIG_PACKET=3Dy # CONFIG_PACKET_MMAP is not set CONFIG_NETLINK=3Dy CONFIG_RTNETLINK=3Dy # CONFIG_NETLINK_DEV is not set CONFIG_NETFILTER=3Dy # CONFIG_NETFILTER_DEBUG is not set # CONFIG_FILTER is not set CONFIG_UNIX=3Dy CONFIG_INET=3Dy CONFIG_IP_MULTICAST=3Dy CONFIG_IP_ADVANCED_ROUTER=3Dy CONFIG_RTNETLINK=3Dy CONFIG_NETLINK=3Dy CONFIG_IP_MULTIPLE_TABLES=3Dy CONFIG_IP_ROUTE_FWMARK=3Dy CONFIG_IP_ROUTE_NAT=3Dy CONFIG_IP_ROUTE_MULTIPATH=3Dy CONFIG_IP_ROUTE_TOS=3Dy CONFIG_IP_ROUTE_VERBOSE=3Dy CONFIG_IP_ROUTE_LARGE_TABLES=3Dy # CONFIG_IP_PNP is not set # CONFIG_NET_IPIP is not set # CONFIG_NET_IPGRE is not set # CONFIG_IP_MROUTE is not set # CONFIG_ARPD is not set # CONFIG_INET_ECN is not set # CONFIG_SYN_COOKIES is not set # # IP: Netfilter Configuration # CONFIG_IP_NF_CONNTRACK=3Dy CONFIG_IP_NF_FTP=3Dy # CONFIG_IP_NF_QUEUE is not set CONFIG_IP_NF_IPTABLES=3Dy CONFIG_IP_NF_MATCH_LIMIT=3Dy CONFIG_IP_NF_MATCH_MAC=3Dy CONFIG_IP_NF_MATCH_MARK=3Dy CONFIG_IP_NF_MATCH_MULTIPORT=3Dy CONFIG_IP_NF_MATCH_TOS=3Dy CONFIG_IP_NF_MATCH_STATE=3Dy # CONFIG_IP_NF_MATCH_UNCLEAN is not set # CONFIG_IP_NF_MATCH_OWNER is not set CONFIG_IP_NF_FILTER=3Dy CONFIG_IP_NF_TARGET_REJECT=3Dy # CONFIG_IP_NF_TARGET_MIRROR is not set CONFIG_IP_NF_NAT=3Dy CONFIG_IP_NF_NAT_NEEDED=3Dy CONFIG_IP_NF_TARGET_MASQUERADE=3Dy CONFIG_IP_NF_TARGET_REDIRECT=3Dy CONFIG_IP_NF_MANGLE=3Dy CONFIG_IP_NF_TARGET_TOS=3Dy CONFIG_IP_NF_TARGET_MARK=3Dy CONFIG_IP_NF_TARGET_LOG=3Dy # CONFIG_IPV6 is not set CONFIG_KHTTPD=3Dm # CONFIG_ATM is not set # # =20 # # CONFIG_IPX is not set # CONFIG_ATALK is not set # CONFIG_DECNET is not set # CONFIG_BRIDGE is not set # CONFIG_X25 is not set # CONFIG_LAPB is not set # CONFIG_LLC is not set # CONFIG_NET_DIVERT is not set # CONFIG_ECONET is not set # CONFIG_WAN_ROUTER is not set # CONFIG_NET_FASTROUTE is not set # CONFIG_NET_HW_FLOWCONTROL is not set # # QoS and/or fair queueing # # CONFIG_NET_SCHED is not set # # Telephony Support # # CONFIG_PHONE is not set # CONFIG_PHONE_IXJ is not set # # ATA/IDE/MFM/RLL support # CONFIG_IDE=3Dy # # IDE, ATA and ATAPI Block devices # CONFIG_BLK_DEV_IDE=3Dy # # Please see Documentation/ide.txt for help/info on IDE drives # # CONFIG_BLK_DEV_HD_IDE is not set # CONFIG_BLK_DEV_HD is not set CONFIG_BLK_DEV_IDEDISK=3Dy # CONFIG_IDEDISK_MULTI_MODE is not set # CONFIG_BLK_DEV_IDEDISK_VENDOR is not set # CONFIG_BLK_DEV_IDEDISK_FUJITSU is not set # CONFIG_BLK_DEV_IDEDISK_IBM is not set # CONFIG_BLK_DEV_IDEDISK_MAXTOR is not set # CONFIG_BLK_DEV_IDEDISK_QUANTUM is not set # CONFIG_BLK_DEV_IDEDISK_SEAGATE is not set # CONFIG_BLK_DEV_IDEDISK_WD is not set # CONFIG_BLK_DEV_COMMERIAL is not set # CONFIG_BLK_DEV_TIVO is not set # CONFIG_BLK_DEV_IDECS is not set CONFIG_BLK_DEV_IDECD=3Dy CONFIG_BLK_DEV_IDETAPE=3Dm CONFIG_BLK_DEV_IDEFLOPPY=3Dm # CONFIG_BLK_DEV_IDESCSI is not set # # IDE chipset support/bugfixes # # CONFIG_BLK_DEV_CMD640 is not set # CONFIG_BLK_DEV_CMD640_ENHANCED is not set # CONFIG_BLK_DEV_ISAPNP is not set # CONFIG_BLK_DEV_RZ1000 is not set CONFIG_BLK_DEV_IDEPCI=3Dy CONFIG_IDEPCI_SHARE_IRQ=3Dy CONFIG_BLK_DEV_IDEDMA_PCI=3Dy # CONFIG_BLK_DEV_OFFBOARD is not set CONFIG_IDEDMA_PCI_AUTO=3Dy CONFIG_BLK_DEV_IDEDMA=3Dy # CONFIG_IDEDMA_PCI_WIP is not set # CONFIG_IDEDMA_NEW_DRIVE_LISTINGS is not set # CONFIG_BLK_DEV_AEC62XX is not set # CONFIG_AEC62XX_TUNING is not set # CONFIG_BLK_DEV_ALI15X3 is not set # CONFIG_WDC_ALI15X3 is not set # CONFIG_BLK_DEV_AMD7409 is not set # CONFIG_AMD7409_OVERRIDE is not set # CONFIG_BLK_DEV_CMD64X is not set # CONFIG_BLK_DEV_CY82C693 is not set # CONFIG_BLK_DEV_CS5530 is not set # CONFIG_BLK_DEV_HPT34X is not set # CONFIG_HPT34X_AUTODMA is not set # CONFIG_BLK_DEV_HPT366 is not set # CONFIG_BLK_DEV_PIIX is not set # CONFIG_PIIX_TUNING is not set # CONFIG_BLK_DEV_NS87415 is not set # CONFIG_BLK_DEV_OPTI621 is not set # CONFIG_BLK_DEV_PDC202XX is not set # CONFIG_PDC202XX_BURST is not set # CONFIG_BLK_DEV_OSB4 is not set # CONFIG_BLK_DEV_SIS5513 is not set # CONFIG_BLK_DEV_SLC90E66 is not set # CONFIG_BLK_DEV_TRM290 is not set # CONFIG_BLK_DEV_VIA82CXXX is not set # CONFIG_IDE_CHIPSETS is not set CONFIG_IDEDMA_AUTO=3Dy # CONFIG_IDEDMA_IVB is not set # CONFIG_DMA_NONPCI is not set # CONFIG_BLK_DEV_IDE_MODES is not set # # SCSI support # # CONFIG_SCSI is not set # # IEEE 1394 (FireWire) support # # CONFIG_IEEE1394 is not set # # I2O device support # # CONFIG_I2O is not set # CONFIG_I2O_PCI is not set # CONFIG_I2O_BLOCK is not set # CONFIG_I2O_LAN is not set # CONFIG_I2O_SCSI is not set # CONFIG_I2O_PROC is not set # # Network device support # CONFIG_NETDEVICES=3Dy # # ARCnet devices # # CONFIG_ARCNET is not set CONFIG_DUMMY=3Dm # CONFIG_BONDING is not set # CONFIG_EQUALIZER is not set # CONFIG_TUN is not set # CONFIG_ETHERTAP is not set # CONFIG_NET_SB1000 is not set # # Ethernet (10 or 100Mbit) # CONFIG_NET_ETHERNET=3Dy CONFIG_NET_VENDOR_3COM=3Dy # CONFIG_EL1 is not set # CONFIG_EL2 is not set # CONFIG_ELPLUS is not set # CONFIG_EL16 is not set # CONFIG_EL3 is not set # CONFIG_3C515 is not set # CONFIG_ELMC is not set # CONFIG_ELMC_II is not set CONFIG_VORTEX=3Dy # CONFIG_LANCE is not set # CONFIG_NET_VENDOR_SMC is not set # CONFIG_NET_VENDOR_RACAL is not set # CONFIG_AT1700 is not set # CONFIG_DEPCA is not set # CONFIG_HP100 is not set # CONFIG_NET_ISA is not set CONFIG_NET_PCI=3Dy # CONFIG_PCNET32 is not set CONFIG_ADAPTEC_STARFIRE=3Dy # CONFIG_AC3200 is not set # CONFIG_APRICOT is not set # CONFIG_CS89x0 is not set # CONFIG_TULIP is not set # CONFIG_DE4X5 is not set # CONFIG_DGRS is not set # CONFIG_DM9102 is not set # CONFIG_EEPRO100 is not set # CONFIG_EEPRO100_PM is not set # CONFIG_LNE390 is not set # CONFIG_NATSEMI is not set # CONFIG_NE2K_PCI is not set # CONFIG_NE3210 is not set # CONFIG_ES3210 is not set # CONFIG_8139TOO is not set # CONFIG_RTL8129 is not set # CONFIG_SIS900 is not set # CONFIG_EPIC100 is not set # CONFIG_SUNDANCE is not set # CONFIG_TLAN is not set # CONFIG_VIA_RHINE is not set # CONFIG_WINBOND_840 is not set # CONFIG_HAPPYMEAL is not set # CONFIG_NET_POCKET is not set # # Ethernet (1000 Mbit) # # CONFIG_ACENIC is not set # CONFIG_HAMACHI is not set # CONFIG_YELLOWFIN is not set # CONFIG_SK98LIN is not set # CONFIG_FDDI is not set # CONFIG_HIPPI is not set # CONFIG_PLIP is not set CONFIG_PPP=3Dy CONFIG_PPP_MULTILINK=3Dy CONFIG_PPP_ASYNC=3Dm CONFIG_PPP_SYNC_TTY=3Dm CONFIG_PPP_DEFLATE=3Dm CONFIG_PPP_BSDCOMP=3Dm CONFIG_PPPOE=3Dm # CONFIG_SLIP is not set # # Wireless LAN (non-hamradio) # # CONFIG_NET_RADIO is not set # # Token Ring devices # # CONFIG_TR is not set # CONFIG_NET_FC is not set # CONFIG_RCPCI is not set # CONFIG_SHAPER is not set # # Wan interfaces # # CONFIG_WAN is not set # # Amateur Radio support # # CONFIG_HAMRADIO is not set # # IrDA (infrared) support # # CONFIG_IRDA is not set # # ISDN subsystem # # CONFIG_ISDN is not set # # Old CD-ROM drivers (not SCSI, not IDE) # # CONFIG_CD_NO_IDESCSI is not set # # Input core support # CONFIG_INPUT=3Dy CONFIG_INPUT_KEYBDEV=3Dy CONFIG_INPUT_MOUSEDEV=3Dy CONFIG_INPUT_MOUSEDEV_SCREEN_X=3D1024 CONFIG_INPUT_MOUSEDEV_SCREEN_Y=3D768 # CONFIG_INPUT_JOYDEV is not set # CONFIG_INPUT_EVDEV is not set # # Character devices # CONFIG_VT=3Dy CONFIG_VT_CONSOLE=3Dy CONFIG_SERIAL=3Dy CONFIG_SERIAL_CONSOLE=3Dy # CONFIG_SERIAL_EXTENDED is not set # CONFIG_SERIAL_NONSTANDARD is not set CONFIG_UNIX98_PTYS=3Dy CONFIG_UNIX98_PTY_COUNT=3D256 CONFIG_PRINTER=3Dm CONFIG_LP_CONSOLE=3Dy # CONFIG_PPDEV is not set # # I2C support # # CONFIG_I2C is not set # # Mice # # CONFIG_BUSMOUSE is not set CONFIG_MOUSE=3Dy CONFIG_PSMOUSE=3Dy # CONFIG_82C710_MOUSE is not set # CONFIG_PC110_PAD is not set # # Joysticks # # CONFIG_JOYSTICK is not set # # Input core support is needed for joysticks # # CONFIG_QIC02_TAPE is not set # # Watchdog Cards # # CONFIG_WATCHDOG is not set # CONFIG_INTEL_RNG is not set # CONFIG_NVRAM is not set CONFIG_RTC=3Dy # CONFIG_DTLK is not set # CONFIG_R3964 is not set # CONFIG_APPLICOM is not set # # Ftape, the floppy tape device driver # # CONFIG_FTAPE is not set # CONFIG_AGP is not set # CONFIG_DRM is not set # # Multimedia devices # # CONFIG_VIDEO_DEV is not set # # File systems # # CONFIG_QUOTA is not set # CONFIG_AUTOFS_FS is not set CONFIG_AUTOFS4_FS=3Dy # CONFIG_ADFS_FS is not set # CONFIG_ADFS_FS_RW is not set # CONFIG_AFFS_FS is not set # CONFIG_HFS_FS is not set # CONFIG_BFS_FS is not set CONFIG_FAT_FS=3Dm CONFIG_MSDOS_FS=3Dm CONFIG_UMSDOS_FS=3Dm CONFIG_VFAT_FS=3Dm # CONFIG_EFS_FS is not set # CONFIG_JFFS_FS is not set # CONFIG_CRAMFS is not set # CONFIG_RAMFS is not set CONFIG_ISO9660_FS=3Dy # CONFIG_JOLIET is not set # CONFIG_MINIX_FS is not set # CONFIG_NTFS_FS is not set # CONFIG_NTFS_RW is not set # CONFIG_HPFS_FS is not set CONFIG_PROC_FS=3Dy # CONFIG_DEVFS_FS is not set # CONFIG_DEVFS_MOUNT is not set # CONFIG_DEVFS_DEBUG is not set CONFIG_DEVPTS_FS=3Dy # CONFIG_QNX4FS_FS is not set # CONFIG_QNX4FS_RW is not set # CONFIG_ROMFS_FS is not set CONFIG_EXT2_FS=3Dy # CONFIG_SYSV_FS is not set # CONFIG_SYSV_FS_WRITE is not set # CONFIG_UDF_FS is not set # CONFIG_UDF_RW is not set # CONFIG_UFS_FS is not set # CONFIG_UFS_FS_WRITE is not set # # Network File Systems # # CONFIG_CODA_FS is not set CONFIG_NFS_FS=3Dy # CONFIG_NFS_V3 is not set # CONFIG_ROOT_NFS is not set # CONFIG_NFSD is not set # CONFIG_NFSD_V3 is not set CONFIG_SUNRPC=3Dy CONFIG_LOCKD=3Dy # CONFIG_SMB_FS is not set # CONFIG_NCP_FS is not set # CONFIG_NCPFS_PACKET_SIGNING is not set # CONFIG_NCPFS_IOCTL_LOCKING is not set # CONFIG_NCPFS_STRONG is not set # CONFIG_NCPFS_NFS_NS is not set # CONFIG_NCPFS_OS2_NS is not set # CONFIG_NCPFS_SMALLDOS is not set # CONFIG_NCPFS_NLS is not set # CONFIG_NCPFS_EXTRAS is not set # # Partition Types # # CONFIG_PARTITION_ADVANCED is not set CONFIG_MSDOS_PARTITION=3Dy # CONFIG_SMB_NLS is not set CONFIG_NLS=3Dy # # Native Language Support # CONFIG_NLS_DEFAULT=3D"iso8859-1" CONFIG_NLS_CODEPAGE_437=3Dy # CONFIG_NLS_CODEPAGE_737 is not set # CONFIG_NLS_CODEPAGE_775 is not set # CONFIG_NLS_CODEPAGE_850 is not set # CONFIG_NLS_CODEPAGE_852 is not set # CONFIG_NLS_CODEPAGE_855 is not set # CONFIG_NLS_CODEPAGE_857 is not set # CONFIG_NLS_CODEPAGE_860 is not set # CONFIG_NLS_CODEPAGE_861 is not set # CONFIG_NLS_CODEPAGE_862 is not set # CONFIG_NLS_CODEPAGE_863 is not set # CONFIG_NLS_CODEPAGE_864 is not set # CONFIG_NLS_CODEPAGE_865 is not set # CONFIG_NLS_CODEPAGE_866 is not set # CONFIG_NLS_CODEPAGE_869 is not set # CONFIG_NLS_CODEPAGE_874 is not set # CONFIG_NLS_CODEPAGE_932 is not set # CONFIG_NLS_CODEPAGE_936 is not set # CONFIG_NLS_CODEPAGE_949 is not set # CONFIG_NLS_CODEPAGE_950 is not set CONFIG_NLS_ISO8859_1=3Dy # CONFIG_NLS_ISO8859_2 is not set # CONFIG_NLS_ISO8859_3 is not set # CONFIG_NLS_ISO8859_4 is not set # CONFIG_NLS_ISO8859_5 is not set # CONFIG_NLS_ISO8859_6 is not set # CONFIG_NLS_ISO8859_7 is not set # CONFIG_NLS_ISO8859_8 is not set # CONFIG_NLS_ISO8859_9 is not set # CONFIG_NLS_ISO8859_14 is not set # CONFIG_NLS_ISO8859_15 is not set # CONFIG_NLS_KOI8_R is not set # CONFIG_NLS_UTF8 is not set # # Console drivers # CONFIG_VGA_CONSOLE=3Dy CONFIG_VIDEO_SELECT=3Dy # CONFIG_MDA_CONSOLE is not set # # Frame-buffer support # # CONFIG_FB is not set # # Sound # # CONFIG_SOUND is not set # # USB support # # CONFIG_USB is not set # # Kernel hacking # CONFIG_MAGIC_SYSRQ=3Dy --uM6pvXgO/RqlDXIz-- --qNFB6cgt1ZLbnF/I Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.4 (GNU/Linux) Comment: For info see http://www.gnupg.org iD8DBQE6Y1lkrzgMPqB3kigRAh8RAJ9lkryakpdrKjJhp/DzwKrjUlK+0ACfagmP tyQnrwsNOicufTcJ3cQAGxY= =bo4p -----END PGP SIGNATURE----- --qNFB6cgt1ZLbnF/I-- From owner-netdev@oss.sgi.com Mon Jan 15 12:20:35 2001 Received: by oss.sgi.com id ; Mon, 15 Jan 2001 12:20:26 -0800 Received: from laurin.munich.netsurf.de ([194.64.166.1]:41623 "EHLO laurin.munich.netsurf.de") by oss.sgi.com with ESMTP id ; Mon, 15 Jan 2001 12:20:04 -0800 Received: from fred.muc.de (noidentity@ns1080.munich.netsurf.de [195.180.235.80]) by laurin.munich.netsurf.de (8.9.3/8.9.3) with ESMTP id VAA01925; Mon, 15 Jan 2001 21:19:43 +0100 (MET) Received: by fred.muc.de (Postfix, from userid 500) id F0C24E3911; Mon, 15 Jan 2001 21:25:56 +0100 (CET) Date: Mon, 15 Jan 2001 21:25:56 +0100 From: Andi Kleen To: kuznet@ms2.inr.ac.ru Cc: jamal , netdev@oss.sgi.com Subject: Re: routable interfaces Message-ID: <20010115212556.A28379@fred.local> References: <200101151928.WAA12705@ms2.inr.ac.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0.1i In-Reply-To: <200101151928.WAA12705@ms2.inr.ac.ru>; from kuznet@ms2.inr.ac.ru on Mon, Jan 15, 2001 at 08:30:12PM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 851 Lines: 20 On Mon, Jan 15, 2001 at 08:30:12PM +0100, A.N.Kuznetsov wrote: > What's about VLANs, they can be handled as separate virtual devices > provided you have _couple_ of them. It the number is higher, they must > be clustered as single nbma interface via framing (i.e neighbour) level > or via tags in routing tables. The same thing is with MPLs. > That's why I strongly dislike the idea to create zillions of net_devices > and consider that approach to VLANs as stupid one. And this is reason > why hashing device list (being great in principle) is not considered > to be some really required feature. Is there any evidence that people really want to use hundreds of VLANs on a single box in practice? If not (which looks likely) just using net_devices is fine I guess and keep it as simple as possible. -Andi -- This is like TV. I don't like TV. From owner-netdev@oss.sgi.com Mon Jan 15 12:26:05 2001 Received: by oss.sgi.com id ; Mon, 15 Jan 2001 12:25:55 -0800 Received: from minus.inr.ac.ru ([193.233.7.97]:32273 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Mon, 15 Jan 2001 12:25:41 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id XAA13219; Mon, 15 Jan 2001 23:25:21 +0300 From: kuznet@ms2.inr.ac.ru Message-Id: <200101152025.XAA13219@ms2.inr.ac.ru> Subject: Re: routable interfaces To: ak@muc.de (Andi Kleen) Date: Mon, 15 Jan 2001 23:25:21 +0300 (MSK) Cc: hadi@cyberus.CA, netdev@oss.sgi.com In-Reply-To: <20010115212556.A28379@fred.local> from "Andi Kleen" at Jan 15, 1 09:25:56 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 249 Lines: 9 Hello! > Is there any evidence that people really want to use hundreds of VLANs on a > single box in practice? People _have_ to use hundreds of them simply because existing VLAN software does not allow to do this in more clever fashion. Alexey From owner-netdev@oss.sgi.com Mon Jan 15 14:12:48 2001 Received: by oss.sgi.com id ; Mon, 15 Jan 2001 14:12:37 -0800 Received: from relay1.smtp.psi.ca ([154.11.136.226]:58553 "EHLO relay1.smtp.psi.ca") by oss.sgi.com with ESMTP id ; Mon, 15 Jan 2001 14:12:15 -0800 Received: from [205.250.170.202] (helo=netvipswitch.netcorp.qc.ca) by relay1.smtp.psi.ca with esmtp (Exim 3.13 #3) id 14IHrA-0001wG-00; Mon, 15 Jan 2001 17:12:04 -0500 From: Francois Desloges To: Andi Kleen , kuznet@ms2.inr.ac.ru Subject: Re: routable interfaces Date: Mon, 15 Jan 2001 16:43:35 -0500 Content-Type: text/plain Cc: jamal , netdev@oss.sgi.com References: <200101151928.WAA12705@ms2.inr.ac.ru> <20010115212556.A28379@fred.local> In-Reply-To: <20010115212556.A28379@fred.local> MIME-Version: 1.0 Message-Id: <01011516573300.01106@dual.vip.ca> Content-Transfer-Encoding: 8bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1543 Lines: 39 On Mon, 15 Jan 2001, Andi Kleen wrote: > On Mon, Jan 15, 2001 at 08:30:12PM +0100, A.N.Kuznetsov wrote: > > What's about VLANs, they can be handled as separate virtual devices > > provided you have _couple_ of them. It the number is higher, they must > > be clustered as single nbma interface via framing (i.e neighbour) level > > or via tags in routing tables. The same thing is with MPLs. > > That's why I strongly dislike the idea to create zillions of net_devices > > and consider that approach to VLANs as stupid one. And this is reason > > why hashing device list (being great in principle) is not considered > > to be some really required feature. > > Is there any evidence that people really want to use hundreds of VLANs on a > single box in practice? > We plan to use hundreds if not thousands, on our 2,5 Tbps, linux controlled, MAN switch-router. We may especially use VIDs as Client (or servce) identifier, for billing purposes, once they have been aggegated on a single links past the really first (typically only L2 switched) hop. Such machines will be used in Metropoles to connect a _lot_ of people. And I know that there is other hardware startup that plan to use Linux on their high speed platform as well (we're not alone :-) > If not (which looks likely) just using net_devices is fine I guess and > keep it as simple as possible. > I can't wait to find the time to figure that one out.. > -Andi > -- > This is like TV. I don't like TV. Hey! I don't like TV either :-) -- François Desloges fd@vipswitch.com From owner-netdev@oss.sgi.com Mon Jan 15 17:37:08 2001 Received: by oss.sgi.com id ; Mon, 15 Jan 2001 17:36:58 -0800 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:54798 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Mon, 15 Jan 2001 17:36:35 -0800 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.8.7) with ESMTP id TAA07750; Mon, 15 Jan 2001 19:39:49 -0700 Message-ID: <3A63B475.950D824D@candelatech.com> Date: Mon, 15 Jan 2001 19:39:49 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.16 i586) X-Accept-Language: en MIME-Version: 1.0 To: kuznet@ms2.inr.ac.ru CC: jamal , netdev@oss.sgi.com Subject: Re: routable interfaces References: <200101151928.WAA12705@ms2.inr.ac.ru> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1672 Lines: 43 kuznet@ms2.inr.ac.ru wrote: > > Hello! > > > A vlan device could be a simple IP-alias. > > Sorry, Jamal you messed all the things completely. > > ifaddr is not a virtual device. "Alias" was virtual device, but > this was bogus ill-defined device which created nothing but problems for routing, > that's why we have no "aliases", but use plain ifaddrs. > > VLAN is true "virtual" device, using nice well-defined encapsulation, > and from viewpoint of routing/addressing such device is not distinguishable > of "physical" one. The question: "physical" or "virtual" is just meaningless > from the viewpoint of routing and from any other viewpoint but packet > scheduling and policing. I agree with that. > > What's about VLANs, they can be handled as separate virtual devices > provided you have _couple_ of them. It the number is higher, they must > be clustered as single nbma interface via framing (i.e neighbour) level > or via tags in routing tables. Why must they be clustered (and what is an nbma interface?) Is it just because of manability of the system, or are there performance reasons too? If there are performance reasons, please elaborate. The same thing is with MPLs. > That's why I strongly dislike the idea to create zillions of net_devices > and consider that approach to VLANs as stupid one. And this is reason > why hashing device list (being great in principle) is not considered > to be some really required feature. > > Alexey -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Mon Jan 15 17:39:28 2001 Received: by oss.sgi.com id ; Mon, 15 Jan 2001 17:39:09 -0800 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:57102 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Mon, 15 Jan 2001 17:39:03 -0800 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.8.7) with ESMTP id TAA07807; Mon, 15 Jan 2001 19:42:27 -0700 Message-ID: <3A63B513.896DDFEA@candelatech.com> Date: Mon, 15 Jan 2001 19:42:27 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.16 i586) X-Accept-Language: en MIME-Version: 1.0 To: kuznet@ms2.inr.ac.ru CC: Andi Kleen , hadi@cyberus.CA, netdev@oss.sgi.com Subject: Re: routable interfaces References: <200101152025.XAA13219@ms2.inr.ac.ru> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 779 Lines: 24 kuznet@ms2.inr.ac.ru wrote: > > Hello! > > > Is there any evidence that people really want to use hundreds of VLANs on a > > single box in practice? > > People _have_ to use hundreds of them simply because existing > VLAN software does not allow to do this in more clever fashion. > > Alexey If you want 500 routable, virtual LAN interfaces, then it would seem to me that you really want 500 VLAN interfaces as we have implemented them. Can you give an example of a need for 500 virtual LANs, and an alternative implementation that pleases you more? Thanks, Ben -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Mon Jan 15 17:54:58 2001 Received: by oss.sgi.com id ; Mon, 15 Jan 2001 17:54:49 -0800 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:65294 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Mon, 15 Jan 2001 17:54:25 -0800 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.8.7) with ESMTP id TAA08097; Mon, 15 Jan 2001 19:57:48 -0700 Message-ID: <3A63B8AC.CC035795@candelatech.com> Date: Mon, 15 Jan 2001 19:57:48 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.16 i586) X-Accept-Language: en MIME-Version: 1.0 To: Francois Desloges CC: Andi Kleen , kuznet@ms2.inr.ac.ru, jamal , netdev@oss.sgi.com Subject: Re: routable interfaces References: <200101151928.WAA12705@ms2.inr.ac.ru> <20010115212556.A28379@fred.local> <01011516573300.01106@dual.vip.ca> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 2146 Lines: 46 Francois Desloges wrote: > > On Mon, 15 Jan 2001, Andi Kleen wrote: > > On Mon, Jan 15, 2001 at 08:30:12PM +0100, A.N.Kuznetsov wrote: > > > What's about VLANs, they can be handled as separate virtual devices > > > provided you have _couple_ of them. It the number is higher, they must > > > be clustered as single nbma interface via framing (i.e neighbour) level > > > or via tags in routing tables. The same thing is with MPLs. > > > That's why I strongly dislike the idea to create zillions of net_devices > > > and consider that approach to VLANs as stupid one. And this is reason > > > why hashing device list (being great in principle) is not considered > > > to be some really required feature. > > > > Is there any evidence that people really want to use hundreds of VLANs on a > > single box in practice? > > > We plan to use hundreds if not thousands, on our 2,5 Tbps, linux > controlled, MAN switch-router. > > We may especially use VIDs as Client (or servce) identifier, for billing > purposes, once they have been aggegated on a single links past the really first > (typically only L2 switched) hop. Such machines will be used in Metropoles to > connect a _lot_ of people. I think my VLAN patch will scale to that without any problem, though I have not run serious traffic over that many interfaces. I do think you'll want the device-hashing patch for that application, even if most users of VLAN will not need it. The one linear lookup currently in the VLAN code (that I know of), is based on the number of physical ethernet interfaces that are running ethernet. That could be fixed with some smarter code on my part, but since the number has been low, sofar, I have not optimized it yet. > And I know that there is other hardware startup that plan to use Linux > on their high speed platform as well (we're not alone :-) Makes you wonder what a dual CPU Ghz machine with a bunch of Gb NICs in it could do :) Ben -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Tue Jan 16 02:30:11 2001 Received: by oss.sgi.com id ; Tue, 16 Jan 2001 02:30:00 -0800 Received: from [194.213.32.137] ([194.213.32.137]:1540 "EHLO bug.ucw.cz") by oss.sgi.com with ESMTP id ; Tue, 16 Jan 2001 02:29:37 -0800 Received: (from pavel@localhost) by bug.ucw.cz (8.8.8/8.8.5) id AAA03347; Tue, 16 Jan 2001 00:16:34 +0100 Message-ID: <20010116001633.A3343@bug.ucw.cz> Date: Tue, 16 Jan 2001 00:16:33 +0100 From: Pavel Machek To: jamal , linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: Is sendfile all that sexy? References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.93i In-Reply-To: ; from jamal on Sun, Jan 14, 2001 at 01:29:22PM -0500 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 470 Lines: 13 Hi! > TWO observations: > - Given Linux's non-pre-emptability of the kernel i get the feeling that > sendfile could starve other user space programs. Imagine trying to send a > 1Gig file on 10Mbps pipe in one shot. Hehe, try sigkilling process doing that transfer. Last time I tried it it did not work. Pavel -- I'm pavel@ucw.cz. "In my country we have almost anarchy and I don't care." Panos Katsaloulis describing me w.r.t. patents at discuss@linmodems.org From owner-netdev@oss.sgi.com Tue Jan 16 04:56:33 2001 Received: by oss.sgi.com id ; Tue, 16 Jan 2001 04:56:24 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:17066 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Tue, 16 Jan 2001 04:56:15 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id HAA17445; Tue, 16 Jan 2001 07:55:11 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Tue, 16 Jan 2001 07:55:11 -0500 (EST) From: jamal To: cc: Subject: Re: Is sendfile all that sexy? In-Reply-To: <200101151817.VAA12009@ms2.inr.ac.ru> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1707 Lines: 52 Sorry to disappoint you Alexey, the numbers havent changed ;-< It would be helpful if someone else with two machines, SMP and two gigE cards can try this out; i'll send my setup to anybody -- email me. Setup: ===== ** ttcp is the traffic source/sink. - the receiver is MSG_TRUNCing ** Sender: SMP-PII-450Mhz, ASUS m/board; 3com version of acenic - 1M version ** receiver: same hardware; acenic alteon card - 1M version Results are: ============ - SF means sendfile on sender with the usual goodies (TCP_CORK etc) - NSF means the usual write to a socket from user space - ZC means Zero copy patches include in kernel Kernel | tput | sender-CPU | receiver-CPU | ------------------------------------------------- 2.4.0-pre3 | 99MB/s | 87% | 23% | NSF | | | | ------------------------------------------------- 2.4.0-pre3 | 86MB/s | 100% | 17% | SF | | | | ------------------------------------------------- 2.4.0-pre3 | 66.2 | 60% | 11% | +ZC | MB/s | | | ------------------------------------------------- 2.4.0-pre3 | 68 | 8% | 8% | +ZC SF | MB/s | | | ------------------------------------------------- Observations? ============= CPU down, throughput down with ZC. I dont understand why the ZC managed to bring down CPU also on the regular writes to socket. Perhaps something else in that general patch. For something like a web server which opens gazilions of connections, this is fantastic news; for an ftp server, a single flow might not be able to fill the pipe. cheers, jamal From owner-netdev@oss.sgi.com Tue Jan 16 05:24:24 2001 Received: by oss.sgi.com id ; Tue, 16 Jan 2001 05:24:14 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:19114 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Tue, 16 Jan 2001 05:23:54 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id IAA17538; Tue, 16 Jan 2001 08:23:09 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Tue, 16 Jan 2001 08:23:09 -0500 (EST) From: jamal To: cc: Subject: Re: routable interfaces In-Reply-To: <200101151928.WAA12705@ms2.inr.ac.ru> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1914 Lines: 45 On Mon, 15 Jan 2001 kuznet@ms2.inr.ac.ru wrote: > Hello! > > > A vlan device could be a simple IP-alias. > > Sorry, Jamal you messed all the things completely. > > ifaddr is not a virtual device. "Alias" was virtual device, but > this was bogus ill-defined device which created nothing but problems for routing, > that's why we have no "aliases", but use plain ifaddrs. > > VLAN is true "virtual" device, using nice well-defined encapsulation, > and from viewpoint of routing/addressing such device is not distinguishable > of "physical" one. The question: "physical" or "virtual" is just meaningless > from the viewpoint of routing and from any other viewpoint but packet > scheduling and policing. > I mixed the terminology ifaddr != alias. sorry. Infact just looking at zebra 0.90 it seems they know how to handle multiple ifas per ifindex. So no problem there. From a SNMP perspective, (as Gleb pointed out) it is an issue. secondary addresses for example do not have counters. You have a counters per ifindex only. I think this could be one of the main reasons people end up intuitevely choosing a device as is today. That's why i changed my opinion that for now, until we come with something better, a a netdevice is a good brute force abstraction. > What's about VLANs, they can be handled as separate virtual devices > provided you have _couple_ of them. It the number is higher, they must > be clustered as single nbma interface via framing (i.e neighbour) level > or via tags in routing tables. The same thing is with MPLs. I think i would prefer the neighbour framing. Most of these protocols already have some form of ndisc. This would also work nicely with dynamic type of "virtual" interfaces eg L2TP etc. Caching headers etc would also be advantageous. Maybe the additional route table tag is useful. It's sort of tricky if you want to generalize for all sorts if tunnels etc; cheers, jamal From owner-netdev@oss.sgi.com Tue Jan 16 05:34:54 2001 Received: by oss.sgi.com id ; Tue, 16 Jan 2001 05:34:44 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:20394 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Tue, 16 Jan 2001 05:34:24 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id IAA17549; Tue, 16 Jan 2001 08:33:11 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Tue, 16 Jan 2001 08:33:10 -0500 (EST) From: jamal To: Francois Desloges cc: Andi Kleen , , Subject: Re: routable interfaces In-Reply-To: <01011516573300.01106@dual.vip.ca> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 802 Lines: 26 On Mon, 15 Jan 2001, Francois Desloges wrote: > We plan to use hundreds if not thousands, on our 2,5 Tbps, linux > controlled, MAN switch-router. > > We may especially use VIDs as Client (or servce) identifier, for billing > purposes, once they have been aggegated on a single links past the really first > (typically only L2 switched) hop. Such machines will be used in Metropoles to > connect a _lot_ of people. > I think this pretty much sums up the requirements of a lot of the high speed metro type routers (using VLANS) emerging that are using Linux on the control plane. > And I know that there is other hardware startup that plan to use Linux > on their high speed platform as well (we're not alone :-) > There are more than a few around. World domination, Real Soon Now. cheers, jamal From owner-netdev@oss.sgi.com Tue Jan 16 05:48:34 2001 Received: by oss.sgi.com id ; Tue, 16 Jan 2001 05:48:24 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:21674 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Tue, 16 Jan 2001 05:48:12 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id IAA17575; Tue, 16 Jan 2001 08:47:22 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Tue, 16 Jan 2001 08:47:22 -0500 (EST) From: jamal To: Pavel Machek cc: , Subject: Re: Is sendfile all that sexy? In-Reply-To: <20010116001633.A3343@bug.ucw.cz> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 496 Lines: 18 On Tue, 16 Jan 2001, Pavel Machek wrote: > > TWO observations: > > - Given Linux's non-pre-emptability of the kernel i get the feeling that > > sendfile could starve other user space programs. Imagine trying to send a > > 1Gig file on 10Mbps pipe in one shot. > > Hehe, try sigkilling process doing that transfer. Last time I tried it > it did not work. >From Alexey's response: it does get descheduled possibly every sndbuf send. So you should be able to sneak that sigkill. cheers, jamal From owner-netdev@oss.sgi.com Tue Jan 16 05:49:03 2001 Received: by oss.sgi.com id ; Tue, 16 Jan 2001 05:48:54 -0800 Received: from gleb.nbase.co.il ([194.90.136.56]:30988 "EHLO gleb.nbase.co.il") by oss.sgi.com with ESMTP id ; Tue, 16 Jan 2001 05:48:41 -0800 Received: (from gleb@localhost) by gleb.nbase.co.il (8.11.1/8.11.1/Debian 8.11.0-6) id f0GDlDL07308; Tue, 16 Jan 2001 15:47:13 +0200 From: Gleb Natapov Date: Tue, 16 Jan 2001 15:47:13 +0200 To: jamal Cc: kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com Subject: Re: routable interfaces Message-ID: <20010116154713.A5122@nbase.co.il> References: <200101151928.WAA12705@ms2.inr.ac.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: ; from hadi@cyberus.ca on Tue, Jan 16, 2001 at 08:23:09AM -0500 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 2555 Lines: 54 On Tue, Jan 16, 2001 at 08:23:09AM -0500, jamal wrote: > > > On Mon, 15 Jan 2001 kuznet@ms2.inr.ac.ru wrote: > > > Hello! > > > > > A vlan device could be a simple IP-alias. > > > > Sorry, Jamal you messed all the things completely. > > > > ifaddr is not a virtual device. "Alias" was virtual device, but > > this was bogus ill-defined device which created nothing but problems for routing, > > that's why we have no "aliases", but use plain ifaddrs. > > > > VLAN is true "virtual" device, using nice well-defined encapsulation, > > and from viewpoint of routing/addressing such device is not distinguishable > > of "physical" one. The question: "physical" or "virtual" is just meaningless > > from the viewpoint of routing and from any other viewpoint but packet > > scheduling and policing. > > > > I mixed the terminology ifaddr != alias. sorry. Infact just looking at > zebra 0.90 it seems they know how to handle multiple ifas per ifindex. Yes, zebra can handle multiple ifas, but ospf cannot. ospf can only advertise secondary ips as stab networks and cannot form adjacency through them, so no transit traffic. AFAIK CISCO does the same thing. The only way I see to handle secondary ips and be able to route transit traffic through them is to create virtual interface for each ip alias in zebra and feed them to ospf as real interfaces. Possible, but very complicated, and then what about the ifindexes of these interfaces? > So no problem there. From a SNMP perspective, (as Gleb pointed out) it is > an issue. secondary addresses for example do not have counters. You have a > counters per ifindex only. I think this could be one of the main reasons > people end up intuitevely choosing a device as is today. That's why i > changed my opinion that for now, until we come with something better, a > a netdevice is a good brute force abstraction. > > > What's about VLANs, they can be handled as separate virtual devices > > provided you have _couple_ of them. It the number is higher, they must > > be clustered as single nbma interface via framing (i.e neighbour) level > > or via tags in routing tables. The same thing is with MPLs. > > I think i would prefer the neighbour framing. Most of these > protocols already have some form of ndisc. This would also work nicely > with dynamic type of "virtual" interfaces eg L2TP etc. Caching headers > etc would also be advantageous. Maybe the additional route table tag is > useful. It's sort of tricky if you want to generalize for all sorts if > tunnels etc; > > cheers, > jamal -- Gleb. From owner-netdev@oss.sgi.com Tue Jan 16 06:42:44 2001 Received: by oss.sgi.com id ; Tue, 16 Jan 2001 06:42:35 -0800 Received: from atrey.karlin.mff.cuni.cz ([195.113.31.123]:28937 "EHLO atrey.karlin.mff.cuni.cz") by oss.sgi.com with ESMTP id ; Tue, 16 Jan 2001 06:42:26 -0800 Received: (from pavel@localhost) by atrey.karlin.mff.cuni.cz (8.8.8/8.8.8) id PAA08411; Tue, 16 Jan 2001 15:41:16 +0100 Date: Tue, 16 Jan 2001 15:41:16 +0100 From: Pavel Machek To: jamal Cc: linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: Is sendfile all that sexy? Message-ID: <20010116154116.A8213@atrey.karlin.mff.cuni.cz> References: <20010116001633.A3343@bug.ucw.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0i In-Reply-To: ; from hadi@cyberus.ca on Tue, Jan 16, 2001 at 08:47:22AM -0500 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 865 Lines: 22 Hi! > > > TWO observations: > > > - Given Linux's non-pre-emptability of the kernel i get the feeling that > > > sendfile could starve other user space programs. Imagine trying to send a > > > 1Gig file on 10Mbps pipe in one shot. > > > > Hehe, try sigkilling process doing that transfer. Last time I tried it > > it did not work. > > >From Alexey's response: it does get descheduled possibly every sndbuf > send. So you should be able to sneak that sigkill. Did you actually tried it? Last time I did the test, SIGKILL did not make it in. sendfile did not actually check for signals... (And you could do something like send 100MB from cache into dev null. I do not see where sigkill could sneak in in this case.) Pavel -- The best software in life is free (not shareware)! Pavel GCM d? s-: !g p?:+ au- a--@ w+ v- C++@ UL+++ L++ N++ E++ W--- M- Y- R+ From owner-netdev@oss.sgi.com Tue Jan 16 09:37:35 2001 Received: by oss.sgi.com id ; Tue, 16 Jan 2001 09:37:15 -0800 Received: from gleb.nbase.co.il ([194.90.136.56]:33796 "EHLO gleb.nbase.co.il") by oss.sgi.com with ESMTP id ; Tue, 16 Jan 2001 09:37:07 -0800 Received: (from gleb@localhost) by gleb.nbase.co.il (8.11.1/8.11.1/Debian 8.11.0-6) id f0GHZrb10050; Tue, 16 Jan 2001 19:35:53 +0200 From: Gleb Natapov Date: Tue, 16 Jan 2001 19:35:53 +0200 To: kuznet@ms2.inr.ac.ru Cc: netdev@oss.sgi.com Subject: netlink drops messages. Message-ID: <20010116193553.B5122@nbase.co.il> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="h31gzZEtNLTqOjlF" Content-Disposition: inline Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 4180 Lines: 116 --h31gzZEtNLTqOjlF Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Hello, Recently I noticed that when I simultaneously do 'up' to many network interfaces (many is ~15) netlink drops part of the messages about interface state change and thus my userspace tools don't know that some interfaces are in up state now. The error that I get from netlink socket is "No buffer space available". After looking at the code I saw that the only way I can get such error from netlink is if sk->rmem_allock is bigger than sk->rcvbuf. I can enlarge sk->rcvbuf, but for each interface I receive six messages and each of this messages is smaller then 200 bytes. the default size of sk->rcvbuf is 65535 bytes, so why messages about 15 interfaces can't fit in default buffer size? So I looked at rtmsg_ifinfo function. We allocate skb of size NLMSG_GOODSIZE there (NLMSG_GOODSIZE appears to be one page size), fill only ~200 bytes and broadcast the message to all netlink sockets that should receive it. When we actually deliver skb to the socket we add skb->truesize (4096 bytes) to sk->rmem_allock and not the size of the actual message (200 byte). So the number of messages that can be in sk->receive_queue simultaneously is about 16 only! Now, I understand that we have to use skb->truesize for accounting and not skb->len, but waste 4000 bytes for each NEWLINK message is to much IMO. I see two solutions to the problem: First is to define NLMSG_GOODSIZE to something more reasonable (small) and second to aggregate many small messages to one big multipart message and store it in one skb. Only when the skb is full we will add another skb to the receive_queue. I've implemented second approach for netlink_broadcast_deliver just to explain what I mean (see attached patch against 2.4.0). The same thing can be done in netlink_unicast too. Is there other way to avoid such waste of space in netlink socket's rcvbuf? Comments are welcom! -- Gleb. --h31gzZEtNLTqOjlF Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename=patch --- 2.4.0-zc/net/netlink/af_netlink.c Mon Jan 15 11:02:14 2001 +++ 2.4.0-vlan/net/netlink/af_netlink.c Tue Jan 16 17:22:21 2001 @@ -456,6 +456,10 @@ static __inline__ int netlink_broadcast_deliver(struct sock *sk, struct sk_buff *skb) { +struct sk_buff *skb2; +unsigned long flags; +unsigned int len = skb->len; + #ifdef NL_EMULATE_DEV if (sk->protinfo.af_netlink->handler) { skb_orphan(skb); @@ -463,15 +467,48 @@ return 0; } else #endif - if (atomic_read(&sk->rmem_alloc) <= sk->rcvbuf && - !test_bit(0, &sk->protinfo.af_netlink->state)) { - skb_orphan(skb); + if (atomic_read(&sk->rmem_alloc) > sk->rcvbuf || + test_bit(0, &sk->protinfo.af_netlink->state)) + return -1; + + skb_orphan(skb); + spin_lock_irqsave(&(sk->receive_queue.lock), flags); + skb2 = skb_peek_tail (&sk->receive_queue); + + if (skb2 == NULL) { +notailroom: skb_set_owner_r(skb, sk); - skb_queue_tail(&sk->receive_queue, skb); - sk->data_ready(sk, skb->len); - return 0; + __skb_queue_tail(&sk->receive_queue, skb); + spin_unlock_irqrestore(&(sk->receive_queue.lock), flags); + } else { + struct nlmsghdr *h = (struct nlmsghdr *)skb->data, + *h2 = (struct nlmsghdr *)skb2->data, *nlh; + unsigned int tailroomreq = len; + + if ( !(h2->nlmsg_flags & NLM_F_MULTI) ) + tailroomreq += NLMSG_ALIGN(NLMSG_LENGTH(sizeof(int))); + + if (skb_tailroom (skb2) < tailroomreq) + goto notailroom; + + if ( h2->nlmsg_flags & NLM_F_MULTI ) + skb_trim (skb2, skb2->len - + NLMSG_ALIGN(NLMSG_LENGTH(sizeof(int)))); + + h2->nlmsg_flags |= NLM_F_MULTI; + h->nlmsg_flags |= NLM_F_MULTI; + + memcpy(skb_put(skb2, skb->len), skb->data, skb->len); + nlh = __nlmsg_put(skb2, NETLINK_CB(skb).pid, + h2->nlmsg_seq, NLMSG_DONE, sizeof (int)); + nlh->nlmsg_flags |= NLM_F_MULTI; + memcpy(NLMSG_DATA(nlh), &(skb2->len), sizeof(int)); + + spin_unlock_irqrestore(&(sk->receive_queue.lock), flags); + kfree_skb (skb); } - return -1; + sk->data_ready(sk, len); + return 0; } void netlink_broadcast(struct sock *ssk, struct sk_buff *skb, u32 pid, --h31gzZEtNLTqOjlF-- From owner-netdev@oss.sgi.com Tue Jan 16 09:56:06 2001 Received: by oss.sgi.com id ; Tue, 16 Jan 2001 09:55:47 -0800 Received: from minus.inr.ac.ru ([193.233.7.97]:21009 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Tue, 16 Jan 2001 09:55:15 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id UAA31048; Tue, 16 Jan 2001 20:54:53 +0300 From: kuznet@ms2.inr.ac.ru Message-Id: <200101161754.UAA31048@ms2.inr.ac.ru> Subject: Re: netlink drops messages. To: gleb@nbase.co.il (Gleb Natapov) Date: Tue, 16 Jan 2001 20:54:53 +0300 (MSK) Cc: netdev@oss.sgi.com In-Reply-To: <20010116193553.B5122@nbase.co.il> from "Gleb Natapov" at Jan 16, 1 07:35:53 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 707 Lines: 19 Hello! > Recently I noticed that when I simultaneously do 'up' to many network interfaces > (many is ~15) netlink drops part of the messages about interface state change and thus > my userspace tools don't know that some interfaces are in up state now. The error that > I get from netlink socket is "No buffer space available". Which means that applications must invalidate stored state and to resynchronize doing dumps of all the necessary information. > sk->receive_queue simultaneously is about 16 only! 16 or 116, this is not very essential. Of course, page size is sort of overkill, but I do not want to estimate required room forward. Application must be able to resync in any case. Alexey From owner-netdev@oss.sgi.com Tue Jan 16 10:07:36 2001 Received: by oss.sgi.com id ; Tue, 16 Jan 2001 10:07:26 -0800 Received: from gleb.nbase.co.il ([194.90.136.56]:39430 "EHLO gleb.nbase.co.il") by oss.sgi.com with ESMTP id ; Tue, 16 Jan 2001 10:07:09 -0800 Received: (from gleb@localhost) by gleb.nbase.co.il (8.11.1/8.11.1/Debian 8.11.0-6) id f0GI60b10297; Tue, 16 Jan 2001 20:06:00 +0200 From: Gleb Natapov Date: Tue, 16 Jan 2001 20:06:00 +0200 To: kuznet@ms2.inr.ac.ru Cc: netdev@oss.sgi.com Subject: Re: netlink drops messages. Message-ID: <20010116200600.C5122@nbase.co.il> References: <20010116193553.B5122@nbase.co.il> <200101161754.UAA31048@ms2.inr.ac.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200101161754.UAA31048@ms2.inr.ac.ru>; from kuznet@ms2.inr.ac.ru on Tue, Jan 16, 2001 at 08:54:53PM +0300 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1097 Lines: 30 On Tue, Jan 16, 2001 at 08:54:53PM +0300, kuznet@ms2.inr.ac.ru wrote: > Hello! > > > Recently I noticed that when I simultaneously do 'up' to many network interfaces > > (many is ~15) netlink drops part of the messages about interface state change and thus > > my userspace tools don't know that some interfaces are in up state now. The error that > > I get from netlink socket is "No buffer space available". > > Which means that applications must invalidate stored state > and to resynchronize doing dumps of all the necessary information. > It means I will resynchronize almost every time :(. > > > sk->receive_queue simultaneously is about 16 only! > > 16 or 116, this is not very essential. This may be essential. When the sk->receive_queue is bigger, my application will have a chance to read part of the messages and free space in socket for more. > > Of course, page size is sort of overkill, but I do not want to estimate > required room forward. Application must be able to resync in any case. Resync should be an exception and not the rule IMO. > > Alexey -- Gleb. From owner-netdev@oss.sgi.com Tue Jan 16 10:13:37 2001 Received: by oss.sgi.com id ; Tue, 16 Jan 2001 10:13:17 -0800 Received: from nero.doit.wisc.edu ([128.104.17.130]:4627 "EHLO nero.doit.wisc.edu") by oss.sgi.com with ESMTP id ; Tue, 16 Jan 2001 10:13:03 -0800 Received: (from jleu@localhost) by nero.doit.wisc.edu (8.8.7/8.8.7) id MAA01365; Tue, 16 Jan 2001 12:12:52 -0600 Date: Tue, 16 Jan 2001 12:12:52 -0600 From: "James R. Leu" To: kuznet@ms2.inr.ac.ru Cc: Gleb Natapov , netdev@oss.sgi.com Subject: Re: netlink drops messages. Message-ID: <20010116121252.B1299@doit.wisc.edu> Reply-To: jleu@mindspring.com References: <20010116193553.B5122@nbase.co.il> <200101161754.UAA31048@ms2.inr.ac.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0.1i In-Reply-To: <200101161754.UAA31048@ms2.inr.ac.ru>; from kuznet@ms2.inr.ac.ru on Tue, Jan 16, 2001 at 08:54:53PM +0300 Organization: none Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1141 Lines: 33 Hello, On Tue, Jan 16, 2001 at 08:54:53PM +0300, kuznet@ms2.inr.ac.ru wrote: > Hello! > > > Recently I noticed that when I simultaneously do 'up' to many network interfaces > > (many is ~15) netlink drops part of the messages about interface state change and thus > > my userspace tools don't know that some interfaces are in up state now. The error that > > I get from netlink socket is "No buffer space available". > > Which means that applications must invalidate stored state > and to resynchronize doing dumps of all the necessary information. Just want to make sure I didn't mis-understand you: Are you saying that an application should not rely on netlink to deliver accurate and complete information about interface or routing changes? AND that re-reading the entire route table or interface table is the correct solution to this problem? Jim > > sk->receive_queue simultaneously is about 16 only! > > 16 or 116, this is not very essential. > > Of course, page size is sort of overkill, but I do not want to estimate > required room forward. Application must be able to resync in any case. > > Alexey -- James R. Leu From owner-netdev@oss.sgi.com Tue Jan 16 10:28:56 2001 Received: by oss.sgi.com id ; Tue, 16 Jan 2001 10:28:48 -0800 Received: from minus.inr.ac.ru ([193.233.7.97]:44049 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Tue, 16 Jan 2001 10:28:44 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id VAA31502; Tue, 16 Jan 2001 21:28:34 +0300 From: kuznet@ms2.inr.ac.ru Message-Id: <200101161828.VAA31502@ms2.inr.ac.ru> Subject: Re: netlink drops messages. To: gleb@nbase.co.il (Gleb Natapov) Date: Tue, 16 Jan 2001 21:28:34 +0300 (MSK) Cc: netdev@oss.sgi.com In-Reply-To: <20010116200600.C5122@nbase.co.il> from "Gleb Natapov" at Jan 16, 1 08:06:00 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 192 Lines: 9 Hello! > Resync should be an exception and not the rule IMO. If in your system simultaneous UP of 100 interfaces is not an exception. 8) SO_RCVBUF exists to adapt to the situation. Alexey From owner-netdev@oss.sgi.com Tue Jan 16 10:33:06 2001 Received: by oss.sgi.com id ; Tue, 16 Jan 2001 10:32:57 -0800 Received: from minus.inr.ac.ru ([193.233.7.97]:47889 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Tue, 16 Jan 2001 10:32:46 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id VAA31543; Tue, 16 Jan 2001 21:32:24 +0300 From: kuznet@ms2.inr.ac.ru Message-Id: <200101161832.VAA31543@ms2.inr.ac.ru> Subject: Re: netlink drops messages. To: jleu@mindspring.com Date: Tue, 16 Jan 2001 21:32:24 +0300 (MSK) Cc: gleb@nbase.co.il, netdev@oss.sgi.com In-Reply-To: <20010116121252.B1299@doit.wisc.edu> from "James R. Leu" at Jan 16, 1 12:12:52 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 448 Lines: 13 Hello! > Are you saying that an application should not rely on netlink to deliver > accurate and complete information about interface or routing changes? > AND that re-reading the entire route table or interface table is the correct > solution to this problem? "Reliable" delivery without retransmissions is possible provided you have infinite amount of memory. 8)8)8) You never have and ENOBUFS provides necessary retransmission logic. Alexey From owner-netdev@oss.sgi.com Tue Jan 16 10:44:06 2001 Received: by oss.sgi.com id ; Tue, 16 Jan 2001 10:43:57 -0800 Received: from nero.doit.wisc.edu ([128.104.17.130]:7699 "EHLO nero.doit.wisc.edu") by oss.sgi.com with ESMTP id ; Tue, 16 Jan 2001 10:43:35 -0800 Received: (from jleu@localhost) by nero.doit.wisc.edu (8.8.7/8.8.7) id MAA01439; Tue, 16 Jan 2001 12:43:19 -0600 Date: Tue, 16 Jan 2001 12:43:19 -0600 From: "James R. Leu" To: kuznet@ms2.inr.ac.ru Cc: jleu@mindspring.com, gleb@nbase.co.il, netdev@oss.sgi.com Subject: Re: netlink drops messages. Message-ID: <20010116124319.D1299@doit.wisc.edu> Reply-To: jleu@mindspring.com References: <20010116121252.B1299@doit.wisc.edu> <200101161832.VAA31543@ms2.inr.ac.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0.1i In-Reply-To: <200101161832.VAA31543@ms2.inr.ac.ru>; from kuznet@ms2.inr.ac.ru on Tue, Jan 16, 2001 at 09:32:24PM +0300 Organization: none Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 931 Lines: 27 On Tue, Jan 16, 2001 at 09:32:24PM +0300, kuznet@ms2.inr.ac.ru wrote: > Hello! > > > Are you saying that an application should not rely on netlink to deliver > > accurate and complete information about interface or routing changes? > > AND that re-reading the entire route table or interface table is the correct > > solution to this problem? > > "Reliable" delivery without retransmissions is possible provided > you have infinite amount of memory. 8)8)8) > > You never have and ENOBUFS provides necessary retransmission logic. You didn't say "retransmission" of missed netlink information, you said "resynchronize doing dumps". These are completly differnt things. If netlink can't provide reliable update information then it needs to be changed. How is it that we can provide reliable transport of information across the internet via TCP, yet a netlink socket within the same box cannot? Jim > Alexey -- James R. Leu From owner-netdev@oss.sgi.com Tue Jan 16 10:49:17 2001 Received: by oss.sgi.com id ; Tue, 16 Jan 2001 10:49:06 -0800 Received: from minus.inr.ac.ru ([193.233.7.97]:59153 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Tue, 16 Jan 2001 10:49:03 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id VAA31729; Tue, 16 Jan 2001 21:48:48 +0300 From: kuznet@ms2.inr.ac.ru Message-Id: <200101161848.VAA31729@ms2.inr.ac.ru> Subject: Re: netlink drops messages. To: jleu@mindspring.com Date: Tue, 16 Jan 2001 21:48:48 +0300 (MSK) Cc: gleb@nbase.co.il, netdev@oss.sgi.com In-Reply-To: <20010116124319.D1299@doit.wisc.edu> from "James R. Leu" at Jan 16, 1 12:43:19 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 396 Lines: 17 Hello! > You didn't say "retransmission" of missed netlink information, you said > "resynchronize doing dumps". These are completly differnt things. Think a bit. > If netlink can't provide reliable update information then it needs to be > changed. 8)8)8) If you will invent a way to reliably store N bytes having storage of only 1 byte, you will make a revolution in technique. 8) Alexey From owner-netdev@oss.sgi.com Tue Jan 16 11:32:28 2001 Received: by oss.sgi.com id ; Tue, 16 Jan 2001 11:32:18 -0800 Received: from nero.doit.wisc.edu ([128.104.17.130]:10259 "EHLO nero.doit.wisc.edu") by oss.sgi.com with ESMTP id ; Tue, 16 Jan 2001 11:32:11 -0800 Received: (from jleu@localhost) by nero.doit.wisc.edu (8.8.7/8.8.7) id NAA01468; Tue, 16 Jan 2001 13:14:56 -0600 Date: Tue, 16 Jan 2001 13:14:56 -0600 From: "James R. Leu" To: kuznet@ms2.inr.ac.ru Cc: jleu@mindspring.com, gleb@nbase.co.il, netdev@oss.sgi.com Subject: Re: netlink drops messages. Message-ID: <20010116131456.E1299@doit.wisc.edu> Reply-To: jleu@mindspring.com References: <20010116124319.D1299@doit.wisc.edu> <200101161848.VAA31729@ms2.inr.ac.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0.1i In-Reply-To: <200101161848.VAA31729@ms2.inr.ac.ru>; from kuznet@ms2.inr.ac.ru on Tue, Jan 16, 2001 at 09:48:48PM +0300 Organization: none Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 691 Lines: 26 On Tue, Jan 16, 2001 at 09:48:48PM +0300, kuznet@ms2.inr.ac.ru wrote: > Hello! > > > You didn't say "retransmission" of missed netlink information, you said > > "resynchronize doing dumps". These are completly differnt things. > > Think a bit. About what? How silly it is to think that netlink isn't reliable? > > If netlink can't provide reliable update information then it needs to be > > changed. > > 8)8)8) > > If you will invent a way to reliably store N bytes having > storage of only 1 byte, you will make a revolution in technique. 8) I'm not asking for the impossible. Sequence numbers and/or client to server ACKs would solve the problem. Jim > Alexey -- James R. Leu From owner-netdev@oss.sgi.com Tue Jan 16 11:52:58 2001 Received: by oss.sgi.com id ; Tue, 16 Jan 2001 11:52:48 -0800 Received: from minus.inr.ac.ru ([193.233.7.97]:787 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Tue, 16 Jan 2001 11:52:43 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id WAA32292; Tue, 16 Jan 2001 22:52:30 +0300 From: kuznet@ms2.inr.ac.ru Message-Id: <200101161952.WAA32292@ms2.inr.ac.ru> Subject: Re: netlink drops messages. To: jleu@mindspring.com Date: Tue, 16 Jan 2001 22:52:30 +0300 (MSK) Cc: gleb@nbase.co.il, netdev@oss.sgi.com In-Reply-To: <20010116131456.E1299@doit.wisc.edu> from "James R. Leu" at Jan 16, 1 01:14:56 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 478 Lines: 15 Hello! > > If you will invent a way to reliably store N bytes having > > storage of only 1 byte, you will make a revolution in technique. 8) > > I'm not asking for the impossible. Sequence numbers and/or client > to server ACKs would solve the problem. When you have finite memory, you have no room to remember information about not acked data. 8) rtnetlink implements the only known method of reliable delivery. If you invented another one, please, explain it. 8) Alexey From owner-netdev@oss.sgi.com Tue Jan 16 11:58:38 2001 Received: by oss.sgi.com id ; Tue, 16 Jan 2001 11:58:29 -0800 Received: from laurin.munich.netsurf.de ([194.64.166.1]:25018 "EHLO laurin.munich.netsurf.de") by oss.sgi.com with ESMTP id ; Tue, 16 Jan 2001 11:58:24 -0800 Received: from fred.muc.de (noidentity@ns1218.munich.netsurf.de [195.180.235.218]) by laurin.munich.netsurf.de (8.9.3/8.9.3) with ESMTP id UAA02509; Tue, 16 Jan 2001 20:58:10 +0100 (MET) Received: by fred.muc.de (Postfix, from userid 500) id 40308E3BB8; Tue, 16 Jan 2001 21:03:45 +0100 (CET) Date: Tue, 16 Jan 2001 21:03:45 +0100 From: Andi Kleen To: Gleb Natapov Cc: kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com Subject: Re: netlink drops messages. Message-ID: <20010116210345.A8319@fred.local> References: <20010116193553.B5122@nbase.co.il> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0.1i In-Reply-To: <20010116193553.B5122@nbase.co.il>; from gleb@nbase.co.il on Tue, Jan 16, 2001 at 06:38:28PM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 876 Lines: 19 On Tue, Jan 16, 2001 at 06:38:28PM +0100, Gleb Natapov wrote: > Hello, > > Recently I noticed that when I simultaneously do 'up' to many network interfaces > (many is ~15) netlink drops part of the messages about interface state change and thus > my userspace tools don't know that some interfaces are in up state now. The error that > I get from netlink socket is "No buffer space available". > > After looking at the code I saw that the only way I can get such error from netlink > is if sk->rmem_allock is bigger than sk->rcvbuf. I can enlarge sk->rcvbuf, but for each > interface I receive six messages and each of this messages is smaller then 200 bytes. > the default size of sk->rcvbuf is 65535 bytes, so why messages about 15 interfaces can't > fit in default buffer size? Because the sk_buff header size is accounted too. sk_buffs are not lightweight. -Andi From owner-netdev@oss.sgi.com Tue Jan 16 12:04:38 2001 Received: by oss.sgi.com id ; Tue, 16 Jan 2001 12:04:28 -0800 Received: from minus.inr.ac.ru ([193.233.7.97]:6675 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Tue, 16 Jan 2001 12:04:18 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id XAA32438; Tue, 16 Jan 2001 23:04:00 +0300 From: kuznet@ms2.inr.ac.ru Message-Id: <200101162004.XAA32438@ms2.inr.ac.ru> Subject: Re: Is sendfile all that sexy? To: hadi@cyberus.ca (jamal) Date: Tue, 16 Jan 2001 23:04:00 +0300 (MSK) Cc: netdev@oss.sgi.com In-Reply-To: from "jamal" at Jan 16, 1 07:55:11 am X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 313 Lines: 13 Hello! > Sorry to disappoint you Alexey, the numbers havent changed ;-< Try to gather profile of zc case. I have nothing more to propose. > might not be able to fill the pipe. This does not happen here yet. But, to be honest, your nozc numbers of 100MB/sec is sort of too high for my test case. 8)8) Alexey From owner-netdev@oss.sgi.com Tue Jan 16 12:24:09 2001 Received: by oss.sgi.com id ; Tue, 16 Jan 2001 12:23:59 -0800 Received: from minus.inr.ac.ru ([193.233.7.97]:18451 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Tue, 16 Jan 2001 12:23:48 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id XAA32644; Tue, 16 Jan 2001 23:23:40 +0300 From: kuznet@ms2.inr.ac.ru Message-Id: <200101162023.XAA32644@ms2.inr.ac.ru> Subject: Re: routable interfaces To: hadi@cyberus.ca (jamal) Date: Tue, 16 Jan 2001 23:23:40 +0300 (MSK) Cc: netdev@oss.sgi.com In-Reply-To: from "jamal" at Jan 16, 1 08:23:09 am X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 482 Lines: 19 Hello! > an issue. secondary addresses for example do not have counters. You have a > counters per ifindex only. Statistics has a sense only per link. Aliases (even not saying about "addresses") do not receive and do not send anything, sorry. > useful. It's sort of tricky if you want to generalize for all sorts if > tunnels etc; When you have set of hundred of tunnels, any trick allowing to put an order there loses name "trick" and acquires name "interface". 8)8) Alexey From owner-netdev@oss.sgi.com Tue Jan 16 19:07:30 2001 Received: by oss.sgi.com id ; Tue, 16 Jan 2001 19:07:10 -0800 Received: from lsb-catv-1-p021.vtxnet.ch ([212.147.5.21]:9230 "EHLO almesberger.net") by oss.sgi.com with ESMTP id ; Tue, 16 Jan 2001 19:06:59 -0800 Received: (from almesber@localhost) by almesberger.net (8.9.3/8.9.3) id EAA08497; Wed, 17 Jan 2001 04:05:42 +0100 Date: Wed, 17 Jan 2001 04:05:42 +0100 From: Werner Almesberger To: "James R. Leu" Cc: kuznet@ms2.inr.ac.ru, gleb@nbase.co.il, netdev@oss.sgi.com Subject: Re: netlink drops messages. Message-ID: <20010117040542.Y18286@almesberger.net> References: <20010116124319.D1299@doit.wisc.edu> <200101161848.VAA31729@ms2.inr.ac.ru> <20010116131456.E1299@doit.wisc.edu> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20010116131456.E1299@doit.wisc.edu>; from jleu@mindspring.com on Tue, Jan 16, 2001 at 01:14:56PM -0600 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1285 Lines: 27 James R. Leu wrote: > I'm not asking for the impossible. Sequence numbers and/or client > to server ACKs would solve the problem. So what do you do when the client doesn't ACK and you run out of buffer space ? Block all activities that may trigger netlink messages ? Obviously, in this case (interface up/down transitions), netlink doesn't scale well. A state-based interface would be better, e.g. netlink could generate a bit vector indicating the states (or the transitions, if it matters whether any have occurred), and update the vector until it has been read by the client. The question is of course whether we really need an optimized, scalable solution for this. However, in general, I get the impression that netlink is vastly over-engineered for most uses. Perhaps the situation could be improved if distributions would start to include libnetlink (so you can expect it to be available), and somebody would write a man page. Actually, isn't netlink from BSD ? If they also have a libnetlink, maybe there's some documentation too. - Werner -- _________________________________________________________________________ / Werner Almesberger, ICA, EPFL, CH Werner.Almesberger@epfl.ch / /_IN_N_032__Tel_+41_21_693_6621__Fax_+41_21_693_6610_____________________/ From owner-netdev@oss.sgi.com Tue Jan 16 19:17:59 2001 Received: by oss.sgi.com id ; Tue, 16 Jan 2001 19:17:50 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:6571 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Tue, 16 Jan 2001 19:17:48 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id WAA20494; Tue, 16 Jan 2001 22:17:02 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Tue, 16 Jan 2001 22:17:02 -0500 (EST) From: jamal To: cc: Gleb Natapov , Subject: Re: netlink drops messages. In-Reply-To: <200101161754.UAA31048@ms2.inr.ac.ru> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 271 Lines: 13 On Tue, 16 Jan 2001 kuznet@ms2.inr.ac.ru wrote: > Of course, page size is sort of overkill, but I do not want to estimate > required room forward. NLM_F_MULTI solution does seem to address that, no? (In addition to the application resync, of course). cheers, jamal From owner-netdev@oss.sgi.com Tue Jan 16 19:23:30 2001 Received: by oss.sgi.com id ; Tue, 16 Jan 2001 19:23:11 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:7851 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Tue, 16 Jan 2001 19:22:58 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id WAA20531; Tue, 16 Jan 2001 22:22:10 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Tue, 16 Jan 2001 22:22:10 -0500 (EST) From: jamal To: Werner Almesberger cc: "James R. Leu" , , , Subject: Re: netlink drops messages. In-Reply-To: <20010117040542.Y18286@almesberger.net> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 681 Lines: 25 On Wed, 17 Jan 2001, Werner Almesberger wrote: > James R. Leu wrote: > > I'm not asking for the impossible. Sequence numbers and/or client > > to server ACKs would solve the problem. > > So what do you do when the client doesn't ACK and you run out of buffer > space ? Block all activities that may trigger netlink messages ? He should still be able to do reliable communication by doing the state maintanance within the user space app. The mechanism is already in place. [nlmsghdr.nlmsg_seq. as well as flags NLM_F_ACK, NLM_F_ECHO ] > > Actually, isn't netlink from BSD ? dont think so ;->. BSD has routing sockets which are a very small subset of netlink. cheers, jamal From owner-netdev@oss.sgi.com Tue Jan 16 19:26:40 2001 Received: by oss.sgi.com id ; Tue, 16 Jan 2001 19:26:30 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:8619 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Tue, 16 Jan 2001 19:26:22 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id WAA20537; Tue, 16 Jan 2001 22:25:37 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Tue, 16 Jan 2001 22:25:37 -0500 (EST) From: jamal To: Gleb Natapov cc: , Subject: Re: routable interfaces In-Reply-To: <20010116154713.A5122@nbase.co.il> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1020 Lines: 25 On Tue, 16 Jan 2001, Gleb Natapov wrote: > Yes, zebra can handle multiple ifas, but ospf cannot. ospf can only advertise > secondary ips as stab networks and cannot form adjacency through them, so no transit > traffic. AFAIK CISCO does the same thing. The only way I see to handle secondary > ips and be able to route transit traffic through them is to create virtual interface > for each ip alias in zebra and feed them to ospf as real interfaces. You mean the ifa_label can be used as a interface name that zebra/ospf will understand? Ok, That should work + you might have to force sourcing of ip addresses given certain routes. I am not sure if zebra knows how to do that i.e can you tell zebra to set the route table such packets coming out of that link going towards certain destination will have the aliased IP as the source? It's easy to set on the command line, for example. > Possible, but > very complicated, and then what about the ifindexes of these interfaces? Ifindices are an issue. cheers, jamal From owner-netdev@oss.sgi.com Tue Jan 16 19:44:00 2001 Received: by oss.sgi.com id ; Tue, 16 Jan 2001 19:43:51 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:9131 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Tue, 16 Jan 2001 19:43:32 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id WAA20544; Tue, 16 Jan 2001 22:42:49 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Tue, 16 Jan 2001 22:42:49 -0500 (EST) From: jamal To: cc: Subject: Re: routable interfaces In-Reply-To: <200101162023.XAA32644@ms2.inr.ac.ru> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1157 Lines: 39 On Tue, 16 Jan 2001 kuznet@ms2.inr.ac.ru wrote: > Hello! > > > an issue. secondary addresses for example do not have counters. You have a > > counters per ifindex only. > > Statistics has a sense only per link. > > Aliases (even not saying about "addresses") do not receive and do not send > anything, sorry. > But packets can "appear;->" to be coming from an aliased IP. Also packets could be addresses to a virtual address. Could it be defined as a link from that perspective? And it seems to make sense to keep count of such packets. Maybe it is logical to have a physical ifindex + counters as well as virtual ones (i,e ifindex+counter). I dont think current SNMP standards support that; but then i am no expert there. > > > useful. It's sort of tricky if you want to generalize for all sorts if > > tunnels etc; > > When you have set of hundred of tunnels, any trick allowing > to put an order there loses name "trick" and acquires name > "interface". 8)8) > Do you mean, netdevice? ;-> I have seen interface being abused to mean a lot of things (eg the packet munging, such as routing, is refered by some people as "interface"). cheers, jamal From owner-netdev@oss.sgi.com Tue Jan 16 21:17:49 2001 Received: by oss.sgi.com id ; Tue, 16 Jan 2001 21:17:30 -0800 Received: from [129.94.172.186] ([129.94.172.186]:65521 "EHLO localhost.localdomain") by oss.sgi.com with ESMTP id ; Tue, 16 Jan 2001 21:17:14 -0800 Received: from localhost (riel@localhost) by localhost.localdomain (8.11.2/8.11.2) with ESMTP id f0H5GZN27007; Wed, 17 Jan 2001 03:16:36 -0200 X-Authentication-Warning: localhost.localdomain: riel owned process doing -bs Date: Wed, 17 Jan 2001 16:16:35 +1100 (EST) From: Rik van Riel X-X-Sender: To: Andrea Arcangeli cc: Ingo Molnar , Jens Axboe , Alan Cox , "Stephen C. Tweedie" , Christoph Hellwig , "David S. Miller" , , Subject: Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1 In-Reply-To: <20010109205420.H29904@athlon.random> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 358 Lines: 14 On Tue, 9 Jan 2001, Andrea Arcangeli wrote: > BTW, I noticed what is left in blk-13B seems to be my work Yeah yeah, we'll buy you beer at the next conference... ;) Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com.br/ From owner-netdev@oss.sgi.com Wed Jan 17 00:07:30 2001 Received: by oss.sgi.com id ; Wed, 17 Jan 2001 00:07:21 -0800 Received: from gleb.nbase.co.il ([194.90.136.56]:6674 "EHLO gleb.nbase.co.il") by oss.sgi.com with ESMTP id ; Wed, 17 Jan 2001 00:07:05 -0800 Received: (from gleb@localhost) by gleb.nbase.co.il (8.11.1/8.11.1/Debian 8.11.0-6) id f0H857515183; Wed, 17 Jan 2001 10:05:07 +0200 From: Gleb Natapov Date: Wed, 17 Jan 2001 10:05:07 +0200 To: Andi Kleen Cc: kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com Subject: Re: netlink drops messages. Message-ID: <20010117100506.D5122@nbase.co.il> References: <20010116193553.B5122@nbase.co.il> <20010116210345.A8319@fred.local> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20010116210345.A8319@fred.local>; from ak@muc.de on Tue, Jan 16, 2001 at 09:03:45PM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1129 Lines: 25 On Tue, Jan 16, 2001 at 09:03:45PM +0100, Andi Kleen wrote: > On Tue, Jan 16, 2001 at 06:38:28PM +0100, Gleb Natapov wrote: > > Hello, > > > > Recently I noticed that when I simultaneously do 'up' to many network interfaces > > (many is ~15) netlink drops part of the messages about interface state change and thus > > my userspace tools don't know that some interfaces are in up state now. The error that > > I get from netlink socket is "No buffer space available". > > > > After looking at the code I saw that the only way I can get such error from netlink > > is if sk->rmem_allock is bigger than sk->rcvbuf. I can enlarge sk->rcvbuf, but for each > > interface I receive six messages and each of this messages is smaller then 200 bytes. > > the default size of sk->rcvbuf is 65535 bytes, so why messages about 15 interfaces can't > > fit in default buffer size? > > Because the sk_buff header size is accounted too. > sk_buffs are not lightweight. > Here is how NLMSG_GOODSIZE is defined: #define NLMSG_GOODSIZE (PAGE_SIZE - ((sizeof(struct sk_buff)+0xF)&~0xF)) So sk_buff header is not an issue here. -- Gleb. From owner-netdev@oss.sgi.com Wed Jan 17 00:12:10 2001 Received: by oss.sgi.com id ; Wed, 17 Jan 2001 00:12:00 -0800 Received: from gleb.nbase.co.il ([194.90.136.56]:16658 "EHLO gleb.nbase.co.il") by oss.sgi.com with ESMTP id ; Wed, 17 Jan 2001 00:11:49 -0800 Received: (from gleb@localhost) by gleb.nbase.co.il (8.11.1/8.11.1/Debian 8.11.0-6) id f0H89v015208; Wed, 17 Jan 2001 10:09:57 +0200 From: Gleb Natapov Date: Wed, 17 Jan 2001 10:09:57 +0200 To: jamal Cc: kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com Subject: Re: netlink drops messages. Message-ID: <20010117100957.E5122@nbase.co.il> References: <200101161754.UAA31048@ms2.inr.ac.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: ; from hadi@cyberus.ca on Tue, Jan 16, 2001 at 10:17:02PM -0500 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 505 Lines: 16 On Tue, Jan 16, 2001 at 10:17:02PM -0500, jamal wrote: > > On Tue, 16 Jan 2001 kuznet@ms2.inr.ac.ru wrote: > > > Of course, page size is sort of overkill, but I do not want to estimate > > required room forward. > > NLM_F_MULTI solution does seem to address that, no? (In addition to the > application resync, of course). > Exactly my point. We can't solve the problem, but we can reduce sk_buff overkill. Evry program that wants to be correct still needs to implement resync mechanism. -- Gleb. From owner-netdev@oss.sgi.com Wed Jan 17 00:19:10 2001 Received: by oss.sgi.com id ; Wed, 17 Jan 2001 00:19:01 -0800 Received: from gleb.nbase.co.il ([194.90.136.56]:35602 "EHLO gleb.nbase.co.il") by oss.sgi.com with ESMTP id ; Wed, 17 Jan 2001 00:18:50 -0800 Received: (from gleb@localhost) by gleb.nbase.co.il (8.11.1/8.11.1/Debian 8.11.0-6) id f0H8HKj15276; Wed, 17 Jan 2001 10:17:20 +0200 From: Gleb Natapov Date: Wed, 17 Jan 2001 10:17:20 +0200 To: kuznet@ms2.inr.ac.ru Cc: netdev@oss.sgi.com Subject: Re: netlink drops messages. Message-ID: <20010117101720.F5122@nbase.co.il> References: <20010116200600.C5122@nbase.co.il> <200101161828.VAA31502@ms2.inr.ac.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200101161828.VAA31502@ms2.inr.ac.ru>; from kuznet@ms2.inr.ac.ru on Tue, Jan 16, 2001 at 09:28:34PM +0300 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 588 Lines: 16 On Tue, Jan 16, 2001 at 09:28:34PM +0300, kuznet@ms2.inr.ac.ru wrote: > Hello! > > > Resync should be an exception and not the rule IMO. > > If in your system simultaneous UP of 100 interfaces is not an exception. 8) Forget about interfaces :). Suppose I have BGP router with 100.000 prefixes on one interface. When interface goes UP my router daemon adds all this prefixes to the kernel. This will generate burst of netlink messages right? > > SO_RCVBUF exists to adapt to the situation. Ok, if I'll increase rcvbuf to be 10000 bytes large I'll get two more messages :) -- Gleb. From owner-netdev@oss.sgi.com Wed Jan 17 01:03:50 2001 Received: by oss.sgi.com id ; Wed, 17 Jan 2001 01:03:41 -0800 Received: from gleb.nbase.co.il ([194.90.136.56]:34308 "EHLO gleb.nbase.co.il") by oss.sgi.com with ESMTP id ; Wed, 17 Jan 2001 01:03:24 -0800 Received: (from gleb@localhost) by gleb.nbase.co.il (8.11.1/8.11.1/Debian 8.11.0-6) id f0H91hl15517; Wed, 17 Jan 2001 11:01:43 +0200 From: Gleb Natapov Date: Wed, 17 Jan 2001 11:01:43 +0200 To: kuznet@ms2.inr.ac.ru Cc: jamal , netdev@oss.sgi.com Subject: Re: routable interfaces Message-ID: <20010117110143.G5122@nbase.co.il> References: <200101162023.XAA32644@ms2.inr.ac.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200101162023.XAA32644@ms2.inr.ac.ru>; from kuznet@ms2.inr.ac.ru on Tue, Jan 16, 2001 at 11:23:40PM +0300 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 796 Lines: 25 On Tue, Jan 16, 2001 at 11:23:40PM +0300, kuznet@ms2.inr.ac.ru wrote: > Hello! > > > an issue. secondary addresses for example do not have counters. You have a > > counters per ifindex only. > > Statistics has a sense only per link. In case of error counters yes, but it make sense to me to have separate counters for rx\tx bytes per IP address. After all I want to know how much traffic we routed from one subnet to another. > > Aliases (even not saying about "addresses") do not receive and do not send > anything, sorry. > > > > useful. It's sort of tricky if you want to generalize for all sorts if > > tunnels etc; > > When you have set of hundred of tunnels, any trick allowing > to put an order there loses name "trick" and acquires name > "interface". 8)8) > > Alexey -- Gleb. From owner-netdev@oss.sgi.com Wed Jan 17 01:48:31 2001 Received: by oss.sgi.com id ; Wed, 17 Jan 2001 01:48:21 -0800 Received: from gleb.nbase.co.il ([194.90.136.56]:62213 "EHLO gleb.nbase.co.il") by oss.sgi.com with ESMTP id ; Wed, 17 Jan 2001 01:48:05 -0800 Received: (from gleb@localhost) by gleb.nbase.co.il (8.11.1/8.11.1/Debian 8.11.0-6) id f0H9kpF15763; Wed, 17 Jan 2001 11:46:51 +0200 From: Gleb Natapov Date: Wed, 17 Jan 2001 11:46:51 +0200 To: jamal Cc: kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com Subject: Re: routable interfaces Message-ID: <20010117114651.H5122@nbase.co.il> References: <20010116154713.A5122@nbase.co.il> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: ; from hadi@cyberus.ca on Tue, Jan 16, 2001 at 10:25:37PM -0500 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1756 Lines: 31 On Tue, Jan 16, 2001 at 10:25:37PM -0500, jamal wrote: > > > On Tue, 16 Jan 2001, Gleb Natapov wrote: > > > Yes, zebra can handle multiple ifas, but ospf cannot. ospf can only advertise > > secondary ips as stab networks and cannot form adjacency through them, so no transit > > traffic. AFAIK CISCO does the same thing. The only way I see to handle secondary > > ips and be able to route transit traffic through them is to create virtual interface > > for each ip alias in zebra and feed them to ospf as real interfaces. > > You mean the ifa_label can be used as a interface name that zebra/ospf > will understand? Ok, That should work + you might have to force sourcing of > ip addresses given certain routes. I am not sure if zebra knows how to do > that i.e can you tell zebra to set the route table such packets coming out > of that link going towards certain destination will have the aliased IP as > the source? It's easy to set on the command line, for example. > I am not talking about sending or receiving packets here. OSPF has a strong notion of interface. Interface has mtu, ip address and many OSPF specific parameters, it has state that changes according to interface state machine, it has list of neighbours, etc, etc. All this things are needed in order to be able to build adjacency through the interface. Currently zebra has one to one mapping between "kernel interfaces" and "zebra interfaces". If I want to run OSPF (I don't know about other protocols) on secondary ips zebra should be able to have "zebra interface" for each ip (and not for each interface), or, in other words one to many mapping. This is only theory, I don't know if it's even possible to implement such thing (you never know until you'll try ;)). -- Gleb. From owner-netdev@oss.sgi.com Wed Jan 17 03:10:42 2001 Received: by oss.sgi.com id ; Wed, 17 Jan 2001 03:10:22 -0800 Received: from laurin.munich.netsurf.de ([194.64.166.1]:35001 "EHLO laurin.munich.netsurf.de") by oss.sgi.com with ESMTP id ; Wed, 17 Jan 2001 03:10:05 -0800 Received: from fred.muc.de (noidentity@ns1247.munich.netsurf.de [195.180.235.247]) by laurin.munich.netsurf.de (8.9.3/8.9.3) with ESMTP id MAA24500; Wed, 17 Jan 2001 12:09:52 +0100 (MET) Received: by fred.muc.de (Postfix, from userid 500) id C1978E3BB8; Wed, 17 Jan 2001 12:06:52 +0100 (CET) Date: Wed, 17 Jan 2001 12:06:52 +0100 From: Andi Kleen To: Gleb Natapov Cc: kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com Subject: Re: netlink drops messages. Message-ID: <20010117120652.A1830@fred.local> References: <20010116200600.C5122@nbase.co.il> <200101161828.VAA31502@ms2.inr.ac.ru> <20010117101720.F5122@nbase.co.il> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0.1i In-Reply-To: <20010117101720.F5122@nbase.co.il>; from gleb@nbase.co.il on Wed, Jan 17, 2001 at 09:19:45AM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 625 Lines: 17 On Wed, Jan 17, 2001 at 09:19:45AM +0100, Gleb Natapov wrote: > On Tue, Jan 16, 2001 at 09:28:34PM +0300, kuznet@ms2.inr.ac.ru wrote: > > Hello! > > > > > Resync should be an exception and not the rule IMO. > > > > If in your system simultaneous UP of 100 interfaces is not an exception. 8) > Forget about interfaces :). Suppose I have BGP router with 100.000 prefixes on > one interface. When interface goes UP my router daemon adds all this prefixes > to the kernel. This will generate burst of netlink messages right? Netlink sendmsg does flow control based on the buffer. -Andi -- This is like TV. I don't like TV. From owner-netdev@oss.sgi.com Wed Jan 17 03:10:42 2001 Received: by oss.sgi.com id ; Wed, 17 Jan 2001 03:10:33 -0800 Received: from laurin.munich.netsurf.de ([194.64.166.1]:46521 "EHLO laurin.munich.netsurf.de") by oss.sgi.com with ESMTP id ; Wed, 17 Jan 2001 03:10:16 -0800 Received: from fred.muc.de (noidentity@ns1247.munich.netsurf.de [195.180.235.247]) by laurin.munich.netsurf.de (8.9.3/8.9.3) with ESMTP id MAA24494; Wed, 17 Jan 2001 12:09:51 +0100 (MET) Received: by fred.muc.de (Postfix, from userid 500) id E6AFBE3BB9; Wed, 17 Jan 2001 12:15:32 +0100 (CET) Date: Wed, 17 Jan 2001 12:15:32 +0100 From: Andi Kleen To: Werner Almesberger Cc: "James R. Leu" , kuznet@ms2.inr.ac.ru, gleb@nbase.co.il, netdev@oss.sgi.com Subject: Re: netlink drops messages. Message-ID: <20010117121532.B1830@fred.local> References: <20010116124319.D1299@doit.wisc.edu> <200101161848.VAA31729@ms2.inr.ac.ru> <20010116131456.E1299@doit.wisc.edu> <20010117040542.Y18286@almesberger.net> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="ikeVEW9yuYc//A+q" X-Mailer: Mutt 1.0.1i In-Reply-To: <20010117040542.Y18286@almesberger.net>; from Werner.Almesberger@epfl.ch on Wed, Jan 17, 2001 at 04:08:14AM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 6488 Lines: 246 --ikeVEW9yuYc//A+q Content-Type: text/plain; charset=us-ascii On Wed, Jan 17, 2001 at 04:08:14AM +0100, Werner Almesberger wrote: > James R. Leu wrote: > > I'm not asking for the impossible. Sequence numbers and/or client > > to server ACKs would solve the problem. > > So what do you do when the client doesn't ACK and you run out of buffer > space ? Block all activities that may trigger netlink messages ? > > Obviously, in this case (interface up/down transitions), netlink doesn't > scale well. A state-based interface would be better, e.g. netlink could > generate a bit vector indicating the states (or the transitions, if it > matters whether any have occurred), and update the vector until it has > been read by the client. The question is of course whether we really > need an optimized, scalable solution for this. A simple way is to delete ip addresses when you down an interface and use regular SIOCGIFCONF. > > However, in general, I get the impression that netlink is vastly > over-engineered for most uses. Perhaps the situation could be improved > if distributions would start to include libnetlink (so you can expect > it to be available), and somebody would write a man page. Actually, > isn't netlink from BSD ? If they also have a libnetlink, maybe there's > some documentation too. BSD has routing sockets, but they are very different from linux 2.2 netlink. I did both ;) libnetlink is included in SuSE 7.0 and it contains a manpage. I even sent it to Alexey, but somehow it doesn't seem to have appeared in standard iproute2 yet (or I missed it) I attached the manpage in case someone wants it. -Andi -- This is like TV. I don't like TV. --ikeVEW9yuYc//A+q Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="libnetlink.3" .TH libnetlink 3 .SH NAME libnetlink \- Access netlink service .SH SYNOPSIS .nf #include .br #include .br #include .br #include .sp int rtnl_open(struct rtnl_handle *rth, unsigned subscriptions) .sp int rtnl_wilddump_request(struct rtnl_handle *rth, int family, int type) .sp int rtnl_send(struct rtnl_handle *rth, char *buf, int len) .sp int rtnl_dump_request(struct rtnl_handle *rth, int type, void *req, int len) .sp int rtnl_dump_filter(struct rtnl_handle *rth, int (*filter)(struct sockaddr_nl *, struct nlmsghdr *n, void *), void *arg1, int (*junk)(struct sockaddr_nl *,struct nlmsghdr *n, void *), void *arg2) .sp int rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n, pid_t peer, unsigned groups, struct nlmsghdr *answer, .br int (*junk)(struct sockaddr_nl *,struct nlmsghdr *n, void *), .br void *jarg) .sp int rtnl_listen(struct rtnl_handle *rtnl, int (*handler)(struct sockaddr_nl *,struct nlmsghdr *n, void *), void *jarg) .sp int rtnl_from_file(FILE *rtnl, int (*handler)(struct sockaddr_nl *,struct nlmsghdr *n, void *), void *jarg) .sp int addattr32(struct nlmsghdr *n, int maxlen, int type, __u32 data) .sp int addattr_l(struct nlmsghdr *n, int maxlen, int type, void *data, int alen) .sp int rta_addattr32(struct rtattr *rta, int maxlen, int type, __u32 data) .sp int rta_addattr_l(struct rtattr *rta, int maxlen, int type, void *data, int alen) .SH DESCRIPTION libnetlink provides an higher level inetrface to .BR rtnetlink(7). The read functions return 0 on success and a negative errno on failure. The send functions return the amount of data sent, or -1 on error. .TP rtnl_open Open a rtnetlink socket and save state into the .B rth handle. This handle is passed to all subsequent calls. .B subscriptions is a bitmap of the rtnetlink multicast groups the socket will be member of. .TP rtnl_wilddump_request Request a full dump of the .B type database for .B family addresses. .B type is a rtnetlink message type. .\" XXX .TP rtnl_dump_request Request a full dump of the .B type databuf into .B buf with maximum length of .B len. .B type is a rtnetlink message type. .TP rtnl_dump_filter Receive netlink data after an request and filter it. The .B filter callback checks if the received message is wanted. It gets the source address of the message, the message itself and .B arg1 as arguments. 0 as return means filter passed, negative is returned by .I rtnl_dump_filter as an error. NULL for .I filter means no filter. .B junk is used to filter messages not destined to the local socket. Only one message bundle is received, unless there is no message pending this function does not block. .TP rtnl_listen Receive netlink data after an request and pass it to .I handler. .B handler is a callback that gets the message source address, the message itself, and the .B jarg cookie as arguments. It will get called for all received messages. Only one message bundle is received, unless there is no message pending this function does not block. .TP rtnl_from_file Works like .I rtnl_listen, but read a netlink message bundle from the file .B file and passes the messages to handler for parsing. The file contains raw data as received from a rtnetlink socket. .PP The following functions are useful to construct custom rtnetlink messages. For simple database dumping with filtering it is better to use the higher level functions above. See .BR rtnetlink(3) and .BR netlink(3) on how to generate a rtnetlink message. The following utility functions require a continuous buffer that already contains a netlink message header and a rtnetlink request. .TP rtnl_send Send the rtnetlink message in .B buf of length .B len to handle .B rth. .TP addattr32 Add a __u32 attribute of type .B type and with value .B data to netlink message .B n, which is part of a buffer of length .B maxlen. .TP addattr_l Add a variable length attribute of type .B type and with value .B data and .B alen length to netlink message .B n, which is part of a buffer of length .B maxlen. .B data is copied. .TP rta_addattr32 Initialize the rtnetlink attribute .B rta with a __u32 data value. .TP rta_addattr32 Initialize the rtnetlink attribute .B rta with a variable length data value. .SH BUGS The functions sometimes use fprintf and exit when a fatal error occurs. This library should be named librtnetlink. .SH AUTHORS netlink/rtnetlink was designed and writen by Alexey Kuznetsov. Andi Kleen wrote the man page. .SH SEE ALSO .BR netlink(7), .BR rtnetlink(7) .br /usr/include/linux/rtnetlink.h --ikeVEW9yuYc//A+q-- From owner-netdev@oss.sgi.com Wed Jan 17 03:10:52 2001 Received: by oss.sgi.com id ; Wed, 17 Jan 2001 03:10:42 -0800 Received: from laurin.munich.netsurf.de ([194.64.166.1]:3002 "EHLO laurin.munich.netsurf.de") by oss.sgi.com with ESMTP id ; Wed, 17 Jan 2001 03:10:37 -0800 Received: from fred.muc.de (noidentity@ns1247.munich.netsurf.de [195.180.235.247]) by laurin.munich.netsurf.de (8.9.3/8.9.3) with ESMTP id MAA24499; Wed, 17 Jan 2001 12:09:52 +0100 (MET) Received: by fred.muc.de (Postfix, from userid 500) id 80138E3911; Wed, 17 Jan 2001 12:03:31 +0100 (CET) Date: Wed, 17 Jan 2001 12:03:31 +0100 From: Andi Kleen To: Gleb Natapov Cc: Andi Kleen , kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com Subject: Re: netlink drops messages. Message-ID: <20010117120331.A1711@fred.local> References: <20010116193553.B5122@nbase.co.il> <20010116210345.A8319@fred.local> <20010117100506.D5122@nbase.co.il> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0.1i In-Reply-To: <20010117100506.D5122@nbase.co.il>; from gleb@nbase.co.il on Wed, Jan 17, 2001 at 09:06:34AM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1743 Lines: 40 On Wed, Jan 17, 2001 at 09:06:34AM +0100, Gleb Natapov wrote: > On Tue, Jan 16, 2001 at 09:03:45PM +0100, Andi Kleen wrote: > > On Tue, Jan 16, 2001 at 06:38:28PM +0100, Gleb Natapov wrote: > > > Hello, > > > > > > Recently I noticed that when I simultaneously do 'up' to many network interfaces > > > (many is ~15) netlink drops part of the messages about interface state change and thus > > > my userspace tools don't know that some interfaces are in up state now. The error that > > > I get from netlink socket is "No buffer space available". > > > > > > After looking at the code I saw that the only way I can get such error from netlink > > > is if sk->rmem_allock is bigger than sk->rcvbuf. I can enlarge sk->rcvbuf, but for each > > > interface I receive six messages and each of this messages is smaller then 200 bytes. > > > the default size of sk->rcvbuf is 65535 bytes, so why messages about 15 interfaces can't > > > fit in default buffer size? > > > > Because the sk_buff header size is accounted too. > > sk_buffs are not lightweight. > > > > Here is how NLMSG_GOODSIZE is defined: > #define NLMSG_GOODSIZE (PAGE_SIZE - ((sizeof(struct sk_buff)+0xF)&~0xF)) > So sk_buff header is not an issue here. I think you're misunderstanding what NLMSG_GOODSIZE is used for. It is just used to avoid multi page allocations for netlink. The header is still accounted. It's also a bit buggy, because when the sk_buff header has an already rounded size it could forget to include the reference count in the data area, giving a 2page allocation. It should probably be (PAGE_SIZE - ((sizeof(struct sk_buff)+sizeof(unsigned long)+0xF)&~0xF)) [+ even different for 2.5 zero copy pskbs] -Andi -- This is like TV. I don't like TV. From owner-netdev@oss.sgi.com Wed Jan 17 03:26:11 2001 Received: by oss.sgi.com id ; Wed, 17 Jan 2001 03:26:01 -0800 Received: from lsb-catv-1-p021.vtxnet.ch ([212.147.5.21]:40975 "EHLO almesberger.net") by oss.sgi.com with ESMTP id ; Wed, 17 Jan 2001 03:25:43 -0800 Received: (from almesber@localhost) by almesberger.net (8.9.3/8.9.3) id MAA09370; Wed, 17 Jan 2001 12:24:13 +0100 Date: Wed, 17 Jan 2001 12:24:13 +0100 From: Werner Almesberger To: jamal Cc: "James R. Leu" , kuznet@ms2.inr.ac.ru, gleb@nbase.co.il, netdev@oss.sgi.com Subject: Re: netlink drops messages. Message-ID: <20010117122413.B18286@almesberger.net> References: <20010117040542.Y18286@almesberger.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: ; from hadi@cyberus.ca on Tue, Jan 16, 2001 at 10:22:10PM -0500 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1607 Lines: 35 jamal wrote: > He should still be able to do reliable communication by doing the > state maintanance within the user space app. > The mechanism is already in place. > [nlmsghdr.nlmsg_seq. as well as flags NLM_F_ACK, NLM_F_ECHO ] Well, having flow control alone doesn't help. You still have the problem of what to do when you generate more events that the receiver(s) are handling. In TCP, you simply block the sender, or return EAGAIN, pushing the responsibility one level up. I don't like the idea of ifconfig hanging because some netlink monitoring demon just decided to take a break. Neither do I like the idea of ifconfig coming back with an error in this case. So in this sense, netlink is certainly doing the right thing. The question here is of course if the entire communication model is appropriate. An alternative approach could be to just record that a notification is needed, wake the reader(s), and generate the actual message in response to the recvmsg call. This way, one could still pretend to user space that we have a simple message-based notification system, while underneath, we happily compress notifications. Note that this wouldn't work well in all cases. E.g. if you have lots of routes coming and going (and not coming back), you can't compress notifications. Of course, netlink may be the least of your worries in such a situation ... - Werner -- _________________________________________________________________________ / Werner Almesberger, ICA, EPFL, CH Werner.Almesberger@epfl.ch / /_IN_N_032__Tel_+41_21_693_6621__Fax_+41_21_693_6610_____________________/ From owner-netdev@oss.sgi.com Wed Jan 17 03:37:01 2001 Received: by oss.sgi.com id ; Wed, 17 Jan 2001 03:36:52 -0800 Received: from gleb.nbase.co.il ([194.90.136.56]:21257 "EHLO gleb.nbase.co.il") by oss.sgi.com with ESMTP id ; Wed, 17 Jan 2001 03:36:27 -0800 Received: (from gleb@localhost) by gleb.nbase.co.il (8.11.1/8.11.1/Debian 8.11.0-6) id f0HBZ4P16224; Wed, 17 Jan 2001 13:35:04 +0200 From: Gleb Natapov Date: Wed, 17 Jan 2001 13:35:04 +0200 To: Andi Kleen Cc: kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com Subject: Re: netlink drops messages. Message-ID: <20010117133504.A16180@nbase.co.il> References: <20010116193553.B5122@nbase.co.il> <20010116210345.A8319@fred.local> <20010117100506.D5122@nbase.co.il> <20010117120331.A1711@fred.local> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20010117120331.A1711@fred.local>; from ak@muc.de on Wed, Jan 17, 2001 at 12:03:31PM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 2205 Lines: 48 On Wed, Jan 17, 2001 at 12:03:31PM +0100, Andi Kleen wrote: > On Wed, Jan 17, 2001 at 09:06:34AM +0100, Gleb Natapov wrote: > > On Tue, Jan 16, 2001 at 09:03:45PM +0100, Andi Kleen wrote: > > > On Tue, Jan 16, 2001 at 06:38:28PM +0100, Gleb Natapov wrote: > > > > Hello, > > > > > > > > Recently I noticed that when I simultaneously do 'up' to many network interfaces > > > > (many is ~15) netlink drops part of the messages about interface state change and thus > > > > my userspace tools don't know that some interfaces are in up state now. The error that > > > > I get from netlink socket is "No buffer space available". > > > > > > > > After looking at the code I saw that the only way I can get such error from netlink > > > > is if sk->rmem_allock is bigger than sk->rcvbuf. I can enlarge sk->rcvbuf, but for each > > > > interface I receive six messages and each of this messages is smaller then 200 bytes. > > > > the default size of sk->rcvbuf is 65535 bytes, so why messages about 15 interfaces can't > > > > fit in default buffer size? > > > > > > Because the sk_buff header size is accounted too. > > > sk_buffs are not lightweight. > > > > > > > Here is how NLMSG_GOODSIZE is defined: > > #define NLMSG_GOODSIZE (PAGE_SIZE - ((sizeof(struct sk_buff)+0xF)&~0xF)) > > So sk_buff header is not an issue here. > > I think you're misunderstanding what NLMSG_GOODSIZE is used for. It is > just used to avoid multi page allocations for netlink. > > The header is still accounted. I understand that, but what I said in my original post is that we allocate whole page only for one short message. You say that we allocate whole page for one short message plus sk_buff header, but this doesn't make my point less valid: in order to deliver 200 bytes to socket we steal 4096 bytes from rcvbuf. > > It's also a bit buggy, because when the sk_buff header has an already rounded > size it could forget to include the reference count in the data area, giving > a 2page allocation. It should probably be > (PAGE_SIZE - ((sizeof(struct sk_buff)+sizeof(unsigned long)+0xF)&~0xF)) > > [+ even different for 2.5 zero copy pskbs] > > -Andi > > -- > This is like TV. I don't like TV. -- Gleb. From owner-netdev@oss.sgi.com Wed Jan 17 03:41:01 2001 Received: by oss.sgi.com id ; Wed, 17 Jan 2001 03:40:52 -0800 Received: from gleb.nbase.co.il ([194.90.136.56]:31241 "EHLO gleb.nbase.co.il") by oss.sgi.com with ESMTP id ; Wed, 17 Jan 2001 03:40:48 -0800 Received: (from gleb@localhost) by gleb.nbase.co.il (8.11.1/8.11.1/Debian 8.11.0-6) id f0HBdW916262; Wed, 17 Jan 2001 13:39:32 +0200 From: Gleb Natapov Date: Wed, 17 Jan 2001 13:39:32 +0200 To: Andi Kleen Cc: kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com Subject: Re: netlink drops messages. Message-ID: <20010117133932.B16180@nbase.co.il> References: <20010116200600.C5122@nbase.co.il> <200101161828.VAA31502@ms2.inr.ac.ru> <20010117101720.F5122@nbase.co.il> <20010117120652.A1830@fred.local> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20010117120652.A1830@fred.local>; from ak@muc.de on Wed, Jan 17, 2001 at 12:06:52PM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 907 Lines: 21 On Wed, Jan 17, 2001 at 12:06:52PM +0100, Andi Kleen wrote: > On Wed, Jan 17, 2001 at 09:19:45AM +0100, Gleb Natapov wrote: > > On Tue, Jan 16, 2001 at 09:28:34PM +0300, kuznet@ms2.inr.ac.ru wrote: > > > Hello! > > > > > > > Resync should be an exception and not the rule IMO. > > > > > > If in your system simultaneous UP of 100 interfaces is not an exception. 8) > > Forget about interfaces :). Suppose I have BGP router with 100.000 prefixes on > > one interface. When interface goes UP my router daemon adds all this prefixes > > to the kernel. This will generate burst of netlink messages right? > > Netlink sendmsg does flow control based on the buffer. > But if there is another process listening to netlink and it wants to know about routing table changes. Will kernel stop the process that adds routes to the routing table until reading process will empty the socket? I hope not. -- Gleb. From owner-netdev@oss.sgi.com Wed Jan 17 04:22:51 2001 Received: by oss.sgi.com id ; Wed, 17 Jan 2001 04:22:41 -0800 Received: from wirespeed.solidum.com ([207.35.224.226]:25505 "EHLO solidum.com") by oss.sgi.com with ESMTP id ; Wed, 17 Jan 2001 04:22:23 -0800 Received: from research.solidum.com (phobos.solidum.com [192.168.1.13]) by solidum.com (8.8.7/8.8.7) with ESMTP id HAA17891 for ; Wed, 17 Jan 2001 07:22:21 -0500 Received: from phobos.solidum.com (phobos.solidum.com [127.0.0.1]) by research.solidum.com (8.11.0/8.11.0) with ESMTP id f0HCLNZ23462 for ; Wed, 17 Jan 2001 07:21:23 -0500 (EST) Message-Id: <200101171221.f0HCLNZ23462@research.solidum.com> To: netdev@oss.sgi.com Subject: Re: netlink drops messages. In-Reply-To: Your message of "Wed, 17 Jan 2001 04:05:42 +0100." <20010117040542.Y18286@almesberger.net> Mime-Version: 1.0 (generated by tm-edit 1.5) Content-Type: text/plain; charset=US-ASCII Date: Wed, 17 Jan 2001 07:21:23 -0500 From: Michael Richardson Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 901 Lines: 18 >>>>> "Werner" == Werner Almesberger writes: Werner> you can expect it to be available), and somebody would write a Werner> man page. Actually, isn't netlink from BSD ? If they also have a Werner> libnetlink, maybe there's some documentation too. BSD has a routing socket, and the IPsec PFKEY socket in NRL was patterned after that. There is no equivalent to netlink in any BSD that I know. All configuration is done via ioctl's. Alan Cox' comments at various conferences about how open source is excessively modular at times probably applies to netlink. :!mcr!: | Solidum Systems Corporation, http://www.solidum.com Michael Richardson |For a better connected world,where data flows faster Personal: http://www.sandelman.ottawa.on.ca/People/Michael_Richardson/Bio.html mailto:mcr@sandelman.ottawa.on.ca mailto:mcr@solidum.com From owner-netdev@oss.sgi.com Wed Jan 17 04:26:11 2001 Received: by oss.sgi.com id ; Wed, 17 Jan 2001 04:26:01 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:18347 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Wed, 17 Jan 2001 04:26:00 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id HAA20944; Wed, 17 Jan 2001 07:25:06 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Wed, 17 Jan 2001 07:25:06 -0500 (EST) From: jamal To: Gleb Natapov cc: , Subject: Re: routable interfaces In-Reply-To: <20010117114651.H5122@nbase.co.il> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1298 Lines: 29 On Wed, 17 Jan 2001, Gleb Natapov wrote: > I am not talking about sending or receiving packets here. OSPF has a strong notion > of interface. Interface has mtu, ip address and many OSPF specific parameters, > it has state that changes according to interface state machine, it has list of neighbours, > etc, etc. All this things are needed in order to be able to build adjacency through the > interface. > urgh. OK. I forgot about ospf "interfaces". I think the term interface is highly overloaded. > Currently zebra has one to one mapping between "kernel interfaces" and "zebra interfaces". > If I want to run OSPF (I don't know about other protocols) on secondary ips zebra should > be able to have "zebra interface" for each ip (and not for each interface), or, in other words > one to many mapping. This is only theory, I don't know if it's even possible to implement such > thing (you never know until you'll try ;)). It sounds reasonable especially from the perspective that you are forced to maintain distinct neighbor lists per "interface". I think the problem is solved if Zebra knows how to do NBMA OSPF. On the same link: at least the physical attributes can be shared (MTU etc) between all the "virtual links" or maybe not even that if you use path/per-route MTUs. cheers, jamal From owner-netdev@oss.sgi.com Wed Jan 17 05:16:01 2001 Received: by oss.sgi.com id ; Wed, 17 Jan 2001 05:15:52 -0800 Received: from laurin.munich.netsurf.de ([194.64.166.1]:65447 "EHLO laurin.munich.netsurf.de") by oss.sgi.com with ESMTP id ; Wed, 17 Jan 2001 05:15:42 -0800 Received: from fred.muc.de (noidentity@ns1029.munich.netsurf.de [195.180.235.29]) by laurin.munich.netsurf.de (8.9.3/8.9.3) with ESMTP id OAA11414; Wed, 17 Jan 2001 14:15:27 +0100 (MET) Received: by fred.muc.de (Postfix, from userid 500) id 13C4CE3911; Wed, 17 Jan 2001 14:19:01 +0100 (CET) Date: Wed, 17 Jan 2001 14:19:01 +0100 From: Andi Kleen To: Gleb Natapov Cc: Andi Kleen , kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com Subject: Re: netlink drops messages. Message-ID: <20010117141900.A3308@fred.local> References: <20010116200600.C5122@nbase.co.il> <200101161828.VAA31502@ms2.inr.ac.ru> <20010117101720.F5122@nbase.co.il> <20010117120652.A1830@fred.local> <20010117133932.B16180@nbase.co.il> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0.1i In-Reply-To: <20010117133932.B16180@nbase.co.il>; from gleb@nbase.co.il on Wed, Jan 17, 2001 at 12:40:39PM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1246 Lines: 28 On Wed, Jan 17, 2001 at 12:40:39PM +0100, Gleb Natapov wrote: > On Wed, Jan 17, 2001 at 12:06:52PM +0100, Andi Kleen wrote: > > On Wed, Jan 17, 2001 at 09:19:45AM +0100, Gleb Natapov wrote: > > > On Tue, Jan 16, 2001 at 09:28:34PM +0300, kuznet@ms2.inr.ac.ru wrote: > > > > Hello! > > > > > > > > > Resync should be an exception and not the rule IMO. > > > > > > > > If in your system simultaneous UP of 100 interfaces is not an exception. 8) > > > Forget about interfaces :). Suppose I have BGP router with 100.000 prefixes on > > > one interface. When interface goes UP my router daemon adds all this prefixes > > > to the kernel. This will generate burst of netlink messages right? > > > > Netlink sendmsg does flow control based on the buffer. > > > > But if there is another process listening to netlink and it wants to know about routing > table changes. Will kernel stop the process that adds routes to the routing table until > reading process will empty the socket? I hope not. It does, unless you made the socket non blocking, in which case you would get an EAGAIN and could wait using poll(2) for new write space. [that's no different from how normal non-blocking networking works] -Andi -- This is like TV. I don't like TV. From owner-netdev@oss.sgi.com Wed Jan 17 05:32:21 2001 Received: by oss.sgi.com id ; Wed, 17 Jan 2001 05:32:11 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:21163 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Wed, 17 Jan 2001 05:32:00 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id IAA21048; Wed, 17 Jan 2001 08:30:55 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Wed, 17 Jan 2001 08:30:55 -0500 (EST) From: jamal To: Werner Almesberger cc: "James R. Leu" , , , Subject: Re: netlink drops messages. In-Reply-To: <20010117122413.B18286@almesberger.net> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 2326 Lines: 53 On Wed, 17 Jan 2001, Werner Almesberger wrote: > jamal wrote: > > He should still be able to do reliable communication by doing the > > state maintanance within the user space app. > > The mechanism is already in place. > > [nlmsghdr.nlmsg_seq. as well as flags NLM_F_ACK, NLM_F_ECHO ] > > Well, having flow control alone doesn't help. You still have the > problem of what to do when you generate more events that the > receiver(s) are handling. In TCP, you simply block the sender, > or return EAGAIN, pushing the responsibility one level up. > I dont know if you have a serious enough problem to mandate congestion control. Reordering/gaps will happen if you loose a message. ACKs, seq numbers and timers should suffice plenty, no? The information about a link state for example doesnt disappear because a netlink message didnt make it to user space for whatever reason (eg the small rcvbuffer overflow). You can always pull the details off the kernel when some associated timer expires. The only problem i see is that if you were interested in a state > The question here is of course if the entire communication model > is appropriate. An alternative approach could be to just record > that a notification is needed, wake the reader(s), and generate > the actual message in response to the recvmsg call. This way, > one could still pretend to user space that we have a simple > message-based notification system, while underneath, we happily > compress notifications. > > Note that this wouldn't work well in all cases. E.g. if you have > lots of routes coming and going (and not coming back), you can't > compress notifications. Of course, netlink may be the least of > your worries in such a situation ... > This is the classical problem Alexey was pointing to. Unsolvable, unless you have infinite memory in place. The idea of compressed notifications (update "queues") in the kernel is beggining to look sane to me. But even for that you need to draw a line somewhere. And it is only useful if you are really interested in the sequence of states some entity has gone through. I suspect mostly you will be interested in knowing the state _current_ as opposed to "what happened? how did you get here? " kind of thing. I think the whole retransmit and reliability stuff needs to stay in user-space cheers, jamal From owner-netdev@oss.sgi.com Wed Jan 17 05:50:41 2001 Received: by oss.sgi.com id ; Wed, 17 Jan 2001 05:50:21 -0800 Received: from nero.doit.wisc.edu ([128.104.17.130]:4 "EHLO nero.doit.wisc.edu") by oss.sgi.com with ESMTP id ; Wed, 17 Jan 2001 05:50:11 -0800 Received: (from jleu@localhost) by nero.doit.wisc.edu (8.8.7/8.8.7) id HAA02509; Wed, 17 Jan 2001 07:49:53 -0600 Date: Wed, 17 Jan 2001 07:49:52 -0600 From: "James R. Leu" To: Andi Kleen Cc: Werner Almesberger , kuznet@ms2.inr.ac.ru, gleb@nbase.co.il, netdev@oss.sgi.com Subject: Re: netlink drops messages. Message-ID: <20010117074952.C2459@doit.wisc.edu> Reply-To: jleu@mindspring.com References: <20010116124319.D1299@doit.wisc.edu> <200101161848.VAA31729@ms2.inr.ac.ru> <20010116131456.E1299@doit.wisc.edu> <20010117040542.Y18286@almesberger.net> <20010117121532.B1830@fred.local> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0.1i In-Reply-To: <20010117121532.B1830@fred.local>; from ak@muc.de on Wed, Jan 17, 2001 at 12:15:32PM +0100 Organization: none Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1945 Lines: 49 Hello, On Wed, Jan 17, 2001 at 12:15:32PM +0100, Andi Kleen wrote: > On Wed, Jan 17, 2001 at 04:08:14AM +0100, Werner Almesberger wrote: > > James R. Leu wrote: > > > I'm not asking for the impossible. Sequence numbers and/or client > > > to server ACKs would solve the problem. > > > > So what do you do when the client doesn't ACK and you run out of buffer > > space ? Block all activities that may trigger netlink messages ? > > > > Obviously, in this case (interface up/down transitions), netlink doesn't > > scale well. A state-based interface would be better, e.g. netlink could > > generate a bit vector indicating the states (or the transitions, if it > > matters whether any have occurred), and update the vector until it has > > been read by the client. The question is of course whether we really > > need an optimized, scalable solution for this. > > A simple way is to delete ip addresses when you down an interface and use > regular SIOCGIFCONF. That is basically a dump of the entire inteface table! If we are talking about 16K interfaces that is an awful lot of work just because an interface when down or up. > > However, in general, I get the impression that netlink is vastly > > over-engineered for most uses. Perhaps the situation could be improved > > if distributions would start to include libnetlink (so you can expect > > it to be available), and somebody would write a man page. Actually, > > isn't netlink from BSD ? If they also have a libnetlink, maybe there's > > some documentation too. > > BSD has routing sockets, but they are very different from linux 2.2 netlink. > > I did both ;) libnetlink is included in SuSE 7.0 and it contains a manpage. > I even sent it to Alexey, but somehow it doesn't seem to have appeared in > standard iproute2 yet (or I missed it) > > I attached the manpage in case someone wants it. > > > -Andi > > > -- > This is like TV. I don't like TV. -- James R. Leu From owner-netdev@oss.sgi.com Wed Jan 17 05:52:41 2001 Received: by oss.sgi.com id ; Wed, 17 Jan 2001 05:52:21 -0800 Received: from gleb.nbase.co.il ([194.90.136.56]:271 "EHLO gleb.nbase.co.il") by oss.sgi.com with ESMTP id ; Wed, 17 Jan 2001 05:52:04 -0800 Received: (from gleb@localhost) by gleb.nbase.co.il (8.11.1/8.11.1/Debian 8.11.0-6) id f0HDoZv16925; Wed, 17 Jan 2001 15:50:35 +0200 From: Gleb Natapov Date: Wed, 17 Jan 2001 15:50:35 +0200 To: Andi Kleen Cc: kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com Subject: Re: netlink drops messages. Message-ID: <20010117155035.C16180@nbase.co.il> References: <20010116200600.C5122@nbase.co.il> <200101161828.VAA31502@ms2.inr.ac.ru> <20010117101720.F5122@nbase.co.il> <20010117120652.A1830@fred.local> <20010117133932.B16180@nbase.co.il> <20010117141900.A3308@fred.local> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20010117141900.A3308@fred.local>; from ak@muc.de on Wed, Jan 17, 2001 at 02:19:01PM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1676 Lines: 33 On Wed, Jan 17, 2001 at 02:19:01PM +0100, Andi Kleen wrote: > On Wed, Jan 17, 2001 at 12:40:39PM +0100, Gleb Natapov wrote: > > On Wed, Jan 17, 2001 at 12:06:52PM +0100, Andi Kleen wrote: > > > On Wed, Jan 17, 2001 at 09:19:45AM +0100, Gleb Natapov wrote: > > > > On Tue, Jan 16, 2001 at 09:28:34PM +0300, kuznet@ms2.inr.ac.ru wrote: > > > > > Hello! > > > > > > > > > > > Resync should be an exception and not the rule IMO. > > > > > > > > > > If in your system simultaneous UP of 100 interfaces is not an exception. 8) > > > > Forget about interfaces :). Suppose I have BGP router with 100.000 prefixes on > > > > one interface. When interface goes UP my router daemon adds all this prefixes > > > > to the kernel. This will generate burst of netlink messages right? > > > > > > Netlink sendmsg does flow control based on the buffer. > > > > > > > But if there is another process listening to netlink and it wants to know about routing > > table changes. Will kernel stop the process that adds routes to the routing table until > > reading process will empty the socket? I hope not. > > It does, unless you made the socket non blocking, in which case you would > get an EAGAIN and could wait using poll(2) for new write space. > [that's no different from how normal non-blocking networking works] > You are trying to say that if I'll connect to my router via 9600 serial line and run 'ip monitor' there the routing daemon will not be able to feed routes to the kernel quicker than ip monitor will be able to read them and send output via slow serial line?! Somehow 9600 serial line become a bottleneck! Are you sure about that, or I misunderstood you? -- Gleb. From owner-netdev@oss.sgi.com Wed Jan 17 06:05:31 2001 Received: by oss.sgi.com id ; Wed, 17 Jan 2001 06:05:12 -0800 Received: from gleb.nbase.co.il ([194.90.136.56]:26127 "EHLO gleb.nbase.co.il") by oss.sgi.com with ESMTP id ; Wed, 17 Jan 2001 06:04:52 -0800 Received: (from gleb@localhost) by gleb.nbase.co.il (8.11.1/8.11.1/Debian 8.11.0-6) id f0HE38Z17016; Wed, 17 Jan 2001 16:03:08 +0200 From: Gleb Natapov Date: Wed, 17 Jan 2001 16:03:08 +0200 To: jamal Cc: kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com Subject: Re: routable interfaces Message-ID: <20010117160308.D16180@nbase.co.il> References: <20010117114651.H5122@nbase.co.il> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: ; from hadi@cyberus.ca on Wed, Jan 17, 2001 at 07:25:06AM -0500 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1045 Lines: 19 On Wed, Jan 17, 2001 at 07:25:06AM -0500, jamal wrote: > > Currently zebra has one to one mapping between "kernel interfaces" and "zebra interfaces". > > If I want to run OSPF (I don't know about other protocols) on secondary ips zebra should > > be able to have "zebra interface" for each ip (and not for each interface), or, in other words > > one to many mapping. This is only theory, I don't know if it's even possible to implement such > > thing (you never know until you'll try ;)). > > It sounds reasonable especially from the perspective that you are > forced to maintain distinct neighbor lists per "interface". I think the > problem is solved if Zebra knows how to do NBMA OSPF. On the same link: at > least the physical attributes can be shared (MTU etc) between all the "virtual > links" or maybe not even that if you use path/per-route MTUs. > OSPF knows about NBMA, NBMA has one network address and no broadcast capability; we face a different problem here: we have many network addresses assigned to one network. -- Gleb. From owner-netdev@oss.sgi.com Wed Jan 17 06:24:42 2001 Received: by oss.sgi.com id ; Wed, 17 Jan 2001 06:24:32 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:24747 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Wed, 17 Jan 2001 06:24:16 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id JAA21258; Wed, 17 Jan 2001 09:23:29 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Wed, 17 Jan 2001 09:23:29 -0500 (EST) From: jamal To: Gleb Natapov cc: , Subject: Re: routable interfaces In-Reply-To: <20010117160308.D16180@nbase.co.il> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 601 Lines: 19 On Wed, 17 Jan 2001, Gleb Natapov wrote: > > OSPF knows about NBMA, NBMA has one network address and no broadcast capability; we face a different > problem here: we have many network addresses assigned to one network. > It could also be a socket connection over a physical interface capable of broadcast. What i meant was that they are a similar problem from a amnagement perspective; in the sense that if you solve the socket-"interface" abstraction problem then you can use the same technique for multiples secondary addresses and present them to OSPF as different "interfaces". cheers, jamal From owner-netdev@oss.sgi.com Wed Jan 17 08:40:13 2001 Received: by oss.sgi.com id ; Wed, 17 Jan 2001 08:40:04 -0800 Received: from laurin.munich.netsurf.de ([194.64.166.1]:18890 "EHLO laurin.munich.netsurf.de") by oss.sgi.com with ESMTP id ; Wed, 17 Jan 2001 08:39:56 -0800 Received: from fred.muc.de (noidentity@ns1249.munich.netsurf.de [195.180.235.249]) by laurin.munich.netsurf.de (8.9.3/8.9.3) with ESMTP id RAA11582; Wed, 17 Jan 2001 17:39:37 +0100 (MET) Received: by fred.muc.de (Postfix, from userid 500) id 91B87E3911; Wed, 17 Jan 2001 17:13:10 +0100 (CET) Date: Wed, 17 Jan 2001 17:13:10 +0100 From: Andi Kleen To: "James R. Leu" Cc: Andi Kleen , Werner Almesberger , kuznet@ms2.inr.ac.ru, gleb@nbase.co.il, netdev@oss.sgi.com Subject: Re: netlink drops messages. Message-ID: <20010117171310.A5589@fred.local> References: <20010116124319.D1299@doit.wisc.edu> <200101161848.VAA31729@ms2.inr.ac.ru> <20010116131456.E1299@doit.wisc.edu> <20010117040542.Y18286@almesberger.net> <20010117121532.B1830@fred.local> <20010117074952.C2459@doit.wisc.edu> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0.1i In-Reply-To: <20010117074952.C2459@doit.wisc.edu>; from jleu@mindspring.com on Wed, Jan 17, 2001 at 02:51:19PM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1546 Lines: 36 On Wed, Jan 17, 2001 at 02:51:19PM +0100, James R. Leu wrote: > Hello, > > On Wed, Jan 17, 2001 at 12:15:32PM +0100, Andi Kleen wrote: > > On Wed, Jan 17, 2001 at 04:08:14AM +0100, Werner Almesberger wrote: > > > James R. Leu wrote: > > > > I'm not asking for the impossible. Sequence numbers and/or client > > > > to server ACKs would solve the problem. > > > > > > So what do you do when the client doesn't ACK and you run out of buffer > > > space ? Block all activities that may trigger netlink messages ? > > > > > > Obviously, in this case (interface up/down transitions), netlink doesn't > > > scale well. A state-based interface would be better, e.g. netlink could > > > generate a bit vector indicating the states (or the transitions, if it > > > matters whether any have occurred), and update the vector until it has > > > been read by the client. The question is of course whether we really > > > need an optimized, scalable solution for this. > > > > A simple way is to delete ip addresses when you down an interface and use > > regular SIOCGIFCONF. > > That is basically a dump of the entire inteface table! If we are talking > about 16K interfaces that is an awful lot of work just because an interface > when down or up. The thread was: when only a few interfaces go up/down then netlink messages work fine. Then someone complained that the netlink buffer overflows when too many interfaces go up/down. In this case you can do a whole resynchronization regularly (e.g. every minute) and do less work overall. -Andi From owner-netdev@oss.sgi.com Wed Jan 17 08:40:13 2001 Received: by oss.sgi.com id ; Wed, 17 Jan 2001 08:40:04 -0800 Received: from laurin.munich.netsurf.de ([194.64.166.1]:714 "EHLO laurin.munich.netsurf.de") by oss.sgi.com with ESMTP id ; Wed, 17 Jan 2001 08:39:51 -0800 Received: from fred.muc.de (noidentity@ns1249.munich.netsurf.de [195.180.235.249]) by laurin.munich.netsurf.de (8.9.3/8.9.3) with ESMTP id RAA11581; Wed, 17 Jan 2001 17:39:37 +0100 (MET) Received: by fred.muc.de (Postfix, from userid 500) id 975B0E3BB8; Wed, 17 Jan 2001 17:14:38 +0100 (CET) Date: Wed, 17 Jan 2001 17:14:38 +0100 From: Andi Kleen To: Gleb Natapov Cc: Andi Kleen , kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com Subject: Re: netlink drops messages. Message-ID: <20010117171438.B5589@fred.local> References: <20010116200600.C5122@nbase.co.il> <200101161828.VAA31502@ms2.inr.ac.ru> <20010117101720.F5122@nbase.co.il> <20010117120652.A1830@fred.local> <20010117133932.B16180@nbase.co.il> <20010117141900.A3308@fred.local> <20010117155035.C16180@nbase.co.il> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0.1i In-Reply-To: <20010117155035.C16180@nbase.co.il>; from gleb@nbase.co.il on Wed, Jan 17, 2001 at 02:51:48PM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 678 Lines: 16 On Wed, Jan 17, 2001 at 02:51:48PM +0100, Gleb Natapov wrote: > > You are trying to say that if I'll connect to my router via 9600 serial line and run 'ip monitor' > there the routing daemon will not be able to feed routes to the kernel quicker than ip monitor will > be able to read them and send output via slow serial line?! Somehow 9600 serial line become a > bottleneck! Are you sure about that, or I misunderstood you? You misunderstood me. The kernel side doesn't do any flow control, it just drops messages when the buffers fill up. The user side does flow control by blocking or EAGAIN, e.g. you submitting new routes. -Andi -- This is like TV. I don't like TV. From owner-netdev@oss.sgi.com Wed Jan 17 09:01:13 2001 Received: by oss.sgi.com id ; Wed, 17 Jan 2001 09:01:02 -0800 Received: from nero.doit.wisc.edu ([128.104.17.130]:13060 "EHLO nero.doit.wisc.edu") by oss.sgi.com with ESMTP id ; Wed, 17 Jan 2001 09:00:47 -0800 Received: (from jleu@localhost) by nero.doit.wisc.edu (8.8.7/8.8.7) id LAA02736; Wed, 17 Jan 2001 11:00:35 -0600 Date: Wed, 17 Jan 2001 11:00:35 -0600 From: "James R. Leu" To: Andi Kleen Cc: Werner Almesberger , kuznet@ms2.inr.ac.ru, gleb@nbase.co.il, netdev@oss.sgi.com Subject: Re: netlink drops messages. Message-ID: <20010117110035.G2459@doit.wisc.edu> Reply-To: jleu@mindspring.com References: <20010116124319.D1299@doit.wisc.edu> <200101161848.VAA31729@ms2.inr.ac.ru> <20010116131456.E1299@doit.wisc.edu> <20010117040542.Y18286@almesberger.net> <20010117121532.B1830@fred.local> <20010117074952.C2459@doit.wisc.edu> <20010117171310.A5589@fred.local> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0.1i In-Reply-To: <20010117171310.A5589@fred.local>; from ak@muc.de on Wed, Jan 17, 2001 at 05:13:10PM +0100 Organization: none Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 2475 Lines: 58 Hello, Comments at the bottom: On Wed, Jan 17, 2001 at 05:13:10PM +0100, Andi Kleen wrote: > On Wed, Jan 17, 2001 at 02:51:19PM +0100, James R. Leu wrote: > > Hello, > > > > On Wed, Jan 17, 2001 at 12:15:32PM +0100, Andi Kleen wrote: > > > On Wed, Jan 17, 2001 at 04:08:14AM +0100, Werner Almesberger wrote: > > > > James R. Leu wrote: > > > > > I'm not asking for the impossible. Sequence numbers and/or client > > > > > to server ACKs would solve the problem. > > > > > > > > So what do you do when the client doesn't ACK and you run out of buffer > > > > space ? Block all activities that may trigger netlink messages ? > > > > > > > > Obviously, in this case (interface up/down transitions), netlink doesn't > > > > scale well. A state-based interface would be better, e.g. netlink could > > > > generate a bit vector indicating the states (or the transitions, if it > > > > matters whether any have occurred), and update the vector until it has > > > > been read by the client. The question is of course whether we really > > > > need an optimized, scalable solution for this. > > > > > > A simple way is to delete ip addresses when you down an interface and use > > > regular SIOCGIFCONF. > > > > That is basically a dump of the entire inteface table! If we are talking > > about 16K interfaces that is an awful lot of work just because an interface > > when down or up. > > > The thread was: when only a few interfaces go up/down then netlink messages > work fine. > Then someone complained that the netlink buffer overflows when too many interfaces > go up/down. In this case you can do a whole resynchronization regularly (e.g. every > minute) and do less work overall. Sorry for taking the previous comment out of context. As far as your last comment "resynchronization regularly": I disagree with it as well. :-) The reason a notification system like netlink is created is to prevent the clients from polling the kernel and doing aggregious dumps of information. Simply pushing it off to the clients by making them poll for this information is a hack. A client should only have to dump the inteface or routing table when it first connects, from then on it's view of the interface and routing table should be keep consistent via incremental and timly updates. Period. If netlink can not provide incremental, reliable and timly updates about the status of interfaces and routes then we should change it so it can. > -Andi Jim -- James R. Leu From owner-netdev@oss.sgi.com Wed Jan 17 09:19:42 2001 Received: by oss.sgi.com id ; Wed, 17 Jan 2001 09:19:32 -0800 Received: from gleb.nbase.co.il ([194.90.136.56]:62214 "EHLO gleb.nbase.co.il") by oss.sgi.com with ESMTP id ; Wed, 17 Jan 2001 09:19:25 -0800 Received: (from gleb@localhost) by gleb.nbase.co.il (8.11.1/8.11.1/Debian 8.11.0-6) id f0HHIC517764; Wed, 17 Jan 2001 19:18:12 +0200 From: Gleb Natapov Date: Wed, 17 Jan 2001 19:18:12 +0200 To: Andi Kleen Cc: kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com Subject: Re: netlink drops messages. Message-ID: <20010117191811.E16180@nbase.co.il> References: <20010116200600.C5122@nbase.co.il> <200101161828.VAA31502@ms2.inr.ac.ru> <20010117101720.F5122@nbase.co.il> <20010117120652.A1830@fred.local> <20010117133932.B16180@nbase.co.il> <20010117141900.A3308@fred.local> <20010117155035.C16180@nbase.co.il> <20010117171438.B5589@fred.local> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20010117171438.B5589@fred.local>; from ak@muc.de on Wed, Jan 17, 2001 at 05:14:38PM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1171 Lines: 22 On Wed, Jan 17, 2001 at 05:14:38PM +0100, Andi Kleen wrote: > On Wed, Jan 17, 2001 at 02:51:48PM +0100, Gleb Natapov wrote: > > > > You are trying to say that if I'll connect to my router via 9600 serial line and run 'ip monitor' > > there the routing daemon will not be able to feed routes to the kernel quicker than ip monitor will > > be able to read them and send output via slow serial line?! Somehow 9600 serial line become a > > bottleneck! Are you sure about that, or I misunderstood you? > > You misunderstood me. The kernel side doesn't do any flow control, it just drops messages when the > buffers fill up. The user side does flow control by blocking or EAGAIN, e.g. you submitting new > routes. > What user side? There are to sides: W - feeds routes to the kernel, R - read netlink updates from the kernel. Who does flow control and how? W knows nothing about R it simple adds routes to the kernel with the speed kernel can process them. R tries its best to read messages from kernel, but if it doesn't do this fast enough it will get ENOBUFS. The only one who can synchronize between W and R is kernel, but I hope kernel doesn't do this. -- Gleb. From owner-netdev@oss.sgi.com Wed Jan 17 09:45:02 2001 Received: by oss.sgi.com id ; Wed, 17 Jan 2001 09:44:54 -0800 Received: from laurin.munich.netsurf.de ([194.64.166.1]:47760 "EHLO laurin.munich.netsurf.de") by oss.sgi.com with ESMTP id ; Wed, 17 Jan 2001 09:44:49 -0800 Received: from fred.muc.de (noidentity@ns1023.munich.netsurf.de [195.180.235.23]) by laurin.munich.netsurf.de (8.9.3/8.9.3) with ESMTP id SAA22214; Wed, 17 Jan 2001 18:44:36 +0100 (MET) Received: by fred.muc.de (Postfix, from userid 500) id 02D46E3BB8; Wed, 17 Jan 2001 18:50:57 +0100 (CET) Date: Wed, 17 Jan 2001 18:50:57 +0100 From: Andi Kleen To: Gleb Natapov Cc: Andi Kleen , kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com Subject: Re: netlink drops messages. Message-ID: <20010117185057.B7146@fred.local> References: <20010116200600.C5122@nbase.co.il> <200101161828.VAA31502@ms2.inr.ac.ru> <20010117101720.F5122@nbase.co.il> <20010117120652.A1830@fred.local> <20010117133932.B16180@nbase.co.il> <20010117141900.A3308@fred.local> <20010117155035.C16180@nbase.co.il> <20010117171438.B5589@fred.local> <20010117191811.E16180@nbase.co.il> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0.1i In-Reply-To: <20010117191811.E16180@nbase.co.il>; from gleb@nbase.co.il on Wed, Jan 17, 2001 at 06:22:41PM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1333 Lines: 33 On Wed, Jan 17, 2001 at 06:22:41PM +0100, Gleb Natapov wrote: > On Wed, Jan 17, 2001 at 05:14:38PM +0100, Andi Kleen wrote: > > On Wed, Jan 17, 2001 at 02:51:48PM +0100, Gleb Natapov wrote: > > > > > > You are trying to say that if I'll connect to my router via 9600 serial line and run 'ip monitor' > > > there the routing daemon will not be able to feed routes to the kernel quicker than ip monitor will > > > be able to read them and send output via slow serial line?! Somehow 9600 serial line become a > > > bottleneck! Are you sure about that, or I misunderstood you? > > > > You misunderstood me. The kernel side doesn't do any flow control, it just drops messages when the > > buffers fill up. The user side does flow control by blocking or EAGAIN, e.g. you submitting new > > routes. > > > What user side? There are to sides: W - feeds routes to the kernel, R - read netlink updates from the kernel. (W) is flow controlled. (R) is in a sense flow controlled too because it can only get information the kernel queues, but the kernel cheats a bit by dropping stuff when the buffer fills. > > Who does flow control and how? > > W knows nothing about R it simple adds routes to the kernel with the speed kernel can process them. Which is the flow control -- it gets blocked when the kernel cannot keep up. -Andi From owner-netdev@oss.sgi.com Wed Jan 17 09:45:13 2001 Received: by oss.sgi.com id ; Wed, 17 Jan 2001 09:45:02 -0800 Received: from laurin.munich.netsurf.de ([194.64.166.1]:49552 "EHLO laurin.munich.netsurf.de") by oss.sgi.com with ESMTP id ; Wed, 17 Jan 2001 09:44:50 -0800 Received: from fred.muc.de (noidentity@ns1023.munich.netsurf.de [195.180.235.23]) by laurin.munich.netsurf.de (8.9.3/8.9.3) with ESMTP id SAA22213; Wed, 17 Jan 2001 18:44:36 +0100 (MET) Received: by fred.muc.de (Postfix, from userid 500) id BEABAE3911; Wed, 17 Jan 2001 18:48:52 +0100 (CET) Date: Wed, 17 Jan 2001 18:48:52 +0100 From: Andi Kleen To: "James R. Leu" Cc: Andi Kleen , Werner Almesberger , kuznet@ms2.inr.ac.ru, gleb@nbase.co.il, netdev@oss.sgi.com Subject: Re: netlink drops messages. Message-ID: <20010117184852.A7146@fred.local> References: <20010116124319.D1299@doit.wisc.edu> <200101161848.VAA31729@ms2.inr.ac.ru> <20010116131456.E1299@doit.wisc.edu> <20010117040542.Y18286@almesberger.net> <20010117121532.B1830@fred.local> <20010117074952.C2459@doit.wisc.edu> <20010117171310.A5589@fred.local> <20010117110035.G2459@doit.wisc.edu> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0.1i In-Reply-To: <20010117110035.G2459@doit.wisc.edu>; from jleu@mindspring.com on Wed, Jan 17, 2001 at 06:00:42PM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 3444 Lines: 74 On Wed, Jan 17, 2001 at 06:00:42PM +0100, James R. Leu wrote: > Hello, > > Comments at the bottom: > > On Wed, Jan 17, 2001 at 05:13:10PM +0100, Andi Kleen wrote: > > On Wed, Jan 17, 2001 at 02:51:19PM +0100, James R. Leu wrote: > > > Hello, > > > > > > On Wed, Jan 17, 2001 at 12:15:32PM +0100, Andi Kleen wrote: > > > > On Wed, Jan 17, 2001 at 04:08:14AM +0100, Werner Almesberger wrote: > > > > > James R. Leu wrote: > > > > > > I'm not asking for the impossible. Sequence numbers and/or client > > > > > > to server ACKs would solve the problem. > > > > > > > > > > So what do you do when the client doesn't ACK and you run out of buffer > > > > > space ? Block all activities that may trigger netlink messages ? > > > > > > > > > > Obviously, in this case (interface up/down transitions), netlink doesn't > > > > > scale well. A state-based interface would be better, e.g. netlink could > > > > > generate a bit vector indicating the states (or the transitions, if it > > > > > matters whether any have occurred), and update the vector until it has > > > > > been read by the client. The question is of course whether we really > > > > > need an optimized, scalable solution for this. > > > > > > > > A simple way is to delete ip addresses when you down an interface and use > > > > regular SIOCGIFCONF. > > > > > > That is basically a dump of the entire inteface table! If we are talking > > > about 16K interfaces that is an awful lot of work just because an interface > > > when down or up. > > > > > > The thread was: when only a few interfaces go up/down then netlink messages > > work fine. > > Then someone complained that the netlink buffer overflows when too many interfaces > > go up/down. In this case you can do a whole resynchronization regularly (e.g. every > > minute) and do less work overall. > > Sorry for taking the previous comment out of context. > > As far as your last comment "resynchronization regularly": I disagree > with it as well. :-) > > The reason a notification system like netlink is created is to prevent the > clients from polling the kernel and doing aggregious dumps of information. > > Simply pushing it off to the clients by making them poll for this information > is a hack. A client should only have to dump the inteface or routing table > when it first connects, from then on it's view of the interface and routing > table should be keep consistent via incremental and timly updates. Period. > > If netlink can not provide incremental, reliable and timly updates about the > status of interfaces and routes then we should change it so it can. I don't think polling is a hack. It's just that a single strategy for synchronizing information is not the best in all cases. When you have lots of data and only minor bits change then it is best to only transmit the differences. When most of the data changes it is better to just dump the whole data regularly. Netlink supports both strategies, in addition you can use SIOCGIFCONF which is a bit cheaper for the dump case because it doesn't queue. Netlink even tells you when you should switch strategies. What you should implement in your application depends on what you need to handle, usually you will implement both strategies because netlink requires you to support resynchronizion anyways because it's not reliable. SIOCGIFCONF is just another optional optimization for the dump case. None of this is a hack. -Andi From owner-netdev@oss.sgi.com Wed Jan 17 10:54:45 2001 Received: by oss.sgi.com id ; Wed, 17 Jan 2001 10:54:35 -0800 Received: from nero.doit.wisc.edu ([128.104.17.130]:18692 "EHLO nero.doit.wisc.edu") by oss.sgi.com with ESMTP id ; Wed, 17 Jan 2001 10:54:15 -0800 Received: (from jleu@localhost) by nero.doit.wisc.edu (8.8.7/8.8.7) id MAA02870; Wed, 17 Jan 2001 12:54:05 -0600 Date: Wed, 17 Jan 2001 12:54:05 -0600 From: "James R. Leu" To: Andi Kleen Cc: Werner Almesberger , kuznet@ms2.inr.ac.ru, gleb@nbase.co.il, netdev@oss.sgi.com Subject: Re: netlink drops messages. Message-ID: <20010117125405.J2459@doit.wisc.edu> Reply-To: jleu@mindspring.com References: <20010116124319.D1299@doit.wisc.edu> <200101161848.VAA31729@ms2.inr.ac.ru> <20010116131456.E1299@doit.wisc.edu> <20010117040542.Y18286@almesberger.net> <20010117121532.B1830@fred.local> <20010117074952.C2459@doit.wisc.edu> <20010117171310.A5589@fred.local> <20010117110035.G2459@doit.wisc.edu> <20010117184852.A7146@fred.local> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0.1i In-Reply-To: <20010117184852.A7146@fred.local>; from ak@muc.de on Wed, Jan 17, 2001 at 06:48:52PM +0100 Organization: none Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 4324 Lines: 93 Hello, More comments at the bottom: On Wed, Jan 17, 2001 at 06:48:52PM +0100, Andi Kleen wrote: > On Wed, Jan 17, 2001 at 06:00:42PM +0100, James R. Leu wrote: > > Hello, > > > > Comments at the bottom: > > > > On Wed, Jan 17, 2001 at 05:13:10PM +0100, Andi Kleen wrote: > > > On Wed, Jan 17, 2001 at 02:51:19PM +0100, James R. Leu wrote: > > > > Hello, > > > > > > > > On Wed, Jan 17, 2001 at 12:15:32PM +0100, Andi Kleen wrote: > > > > > On Wed, Jan 17, 2001 at 04:08:14AM +0100, Werner Almesberger wrote: > > > > > > James R. Leu wrote: > > > > > > > I'm not asking for the impossible. Sequence numbers and/or client > > > > > > > to server ACKs would solve the problem. > > > > > > > > > > > > So what do you do when the client doesn't ACK and you run out of buffer > > > > > > space ? Block all activities that may trigger netlink messages ? > > > > > > > > > > > > Obviously, in this case (interface up/down transitions), netlink doesn't > > > > > > scale well. A state-based interface would be better, e.g. netlink could > > > > > > generate a bit vector indicating the states (or the transitions, if it > > > > > > matters whether any have occurred), and update the vector until it has > > > > > > been read by the client. The question is of course whether we really > > > > > > need an optimized, scalable solution for this. > > > > > > > > > > A simple way is to delete ip addresses when you down an interface and use > > > > > regular SIOCGIFCONF. > > > > > > > > That is basically a dump of the entire interface table! If we are talking > > > > about 16K interfaces that is an awful lot of work just because an interface > > > > when down or up. > > > > > > > > > The thread was: when only a few interfaces go up/down then netlink messages > > > work fine. > > > Then someone complained that the netlink buffer overflows when too many interfaces > > > go up/down. In this case you can do a whole resynchronization regularly (e.g. every > > > minute) and do less work overall. > > > > Sorry for taking the previous comment out of context. > > > > As far as your last comment "resynchronization regularly": I disagree > > with it as well. :-) > > > > The reason a notification system like netlink is created is to prevent the > > clients from polling the kernel and doing aggregious dumps of information. > > > > Simply pushing it off to the clients by making them poll for this information > > is a hack. A client should only have to dump the interface or routing table > > when it first connects, from then on it's view of the interface and routing > > table should be keep consistent via incremental and timely updates. Period. > > > > If netlink can not provide incremental, reliable and timely updates about the > > status of interfaces and routes then we should change it so it can. > > I don't think polling is a hack. It's just that a single strategy for > synchronizing information is not the best in all cases. When you have lots > of data and only minor bits change then it is best to only transmit the > differences. When most of the data changes it is better to just dump the > whole data regularly. > > Netlink supports both strategies, in addition you can use SIOCGIFCONF which > is a bit cheaper for the dump case because it doesn't queue. > Netlink even tells you when you should switch strategies. > > What you should implement in your application depends on what you need to > handle, usually you will implement both strategies because netlink requires > you to support resynchronization anyways because it's not reliable. > SIOCGIFCONF is just another optional optimization for the dump case. > > None of this is a hack. If an IP routing stack running on Linux ever hopes to achieve sub 50ms end-to-end re-routing with 80K routes and 16K interface, netlink will have to guarantee reliable, incremental and timely updates. If netlink cannot do it, routing software developers will either look to another operating system that can provide this, or will completely by-pass the Linux IP stack and relegate Linux to be an embedded trampoline for them to run there own IP stack in user land. I still assert that netlink's requirement for a client to re-read the entire interface or routing table to maintain synchronization is a hack. > -Andi Jim -- James R. Leu From owner-netdev@oss.sgi.com Wed Jan 17 12:21:26 2001 Received: by oss.sgi.com id ; Wed, 17 Jan 2001 12:21:16 -0800 Received: from minus.inr.ac.ru ([193.233.7.97]:61447 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Wed, 17 Jan 2001 12:21:06 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id XAA21094; Wed, 17 Jan 2001 23:20:49 +0300 From: kuznet@ms2.inr.ac.ru Message-Id: <200101172020.XAA21094@ms2.inr.ac.ru> Subject: Re: routable interfaces To: gleb@nbase.co.il (Gleb Natapov) Date: Wed, 17 Jan 2001 23:20:49 +0300 (MSK) Cc: hadi@cyberus.ca, netdev@oss.sgi.com In-Reply-To: <20010116154713.A5122@nbase.co.il> from "Gleb Natapov" at Jan 16, 1 03:47:13 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 390 Lines: 16 Hello! > but ospf cannot. Gated does this superbly. If zebra does not, it is buggy. Yes, ospf is a bit broken protocol in this respect, it is very inconvenient on shared media. Unfortunetely, this aspect was repaired only in OSPFv6. > very complicated, and then what about the ifindexes of these interfaces? Addresses need not any indices. They are "indices" themselves. Alexey From owner-netdev@oss.sgi.com Wed Jan 17 12:29:36 2001 Received: by oss.sgi.com id ; Wed, 17 Jan 2001 12:29:26 -0800 Received: from lsb-catv-1-p021.vtxnet.ch ([212.147.5.21]:63505 "EHLO almesberger.net") by oss.sgi.com with ESMTP id ; Wed, 17 Jan 2001 12:29:12 -0800 Received: (from almesber@localhost) by almesberger.net (8.9.3/8.9.3) id VAA11277; Wed, 17 Jan 2001 21:28:25 +0100 Date: Wed, 17 Jan 2001 21:28:24 +0100 From: Werner Almesberger To: "James R. Leu" Cc: netdev@oss.sgi.com Subject: Re: netlink drops messages. Message-ID: <20010117212824.O18286@almesberger.net> References: <20010116124319.D1299@doit.wisc.edu> <200101161848.VAA31729@ms2.inr.ac.ru> <20010116131456.E1299@doit.wisc.edu> <20010117040542.Y18286@almesberger.net> <20010117121532.B1830@fred.local> <20010117074952.C2459@doit.wisc.edu> <20010117171310.A5589@fred.local> <20010117110035.G2459@doit.wisc.edu> <20010117184852.A7146@fred.local> <20010117125405.J2459@doit.wisc.edu> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20010117125405.J2459@doit.wisc.edu>; from jleu@mindspring.com on Wed, Jan 17, 2001 at 12:54:05PM -0600 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 450 Lines: 13 James R. Leu wrote: > and relegate Linux to be an embedded trampoline for them to run there own > IP stack in user land. Whee, "KA9Q outperforms Linux kernel TCP/IP". Mindcraft special at nine on MSNBC ;-)) - Werner -- _________________________________________________________________________ / Werner Almesberger, ICA, EPFL, CH Werner.Almesberger@epfl.ch / /_IN_N_032__Tel_+41_21_693_6621__Fax_+41_21_693_6610_____________________/ From owner-netdev@oss.sgi.com Wed Jan 17 12:30:15 2001 Received: by oss.sgi.com id ; Wed, 17 Jan 2001 12:30:05 -0800 Received: from minus.inr.ac.ru ([193.233.7.97]:64007 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Wed, 17 Jan 2001 12:30:04 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id XAA21143; Wed, 17 Jan 2001 23:29:57 +0300 From: kuznet@ms2.inr.ac.ru Message-Id: <200101172029.XAA21143@ms2.inr.ac.ru> Subject: Re: routable interfaces To: hadi@cyberus.ca (jamal) Date: Wed, 17 Jan 2001 23:29:57 +0300 (MSK) Cc: netdev@oss.sgi.com In-Reply-To: from "jamal" at Jan 16, 1 10:42:49 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 324 Lines: 12 Hello! > But packets can "appear;->" to be coming from an aliased IP. They cannot, just because the procedure of mapping is ill defined. Even in the simplest cases it depends on concrete protocol. If kernel maps by mistake packet to other virtual interface, application using indices has no chances to recover. Alexey From owner-netdev@oss.sgi.com Wed Jan 17 12:37:06 2001 Received: by oss.sgi.com id ; Wed, 17 Jan 2001 12:36:56 -0800 Received: from lsb-catv-1-p021.vtxnet.ch ([212.147.5.21]:274 "EHLO almesberger.net") by oss.sgi.com with ESMTP id ; Wed, 17 Jan 2001 12:36:55 -0800 Received: (from almesber@localhost) by almesberger.net (8.9.3/8.9.3) id VAA11313; Wed, 17 Jan 2001 21:36:34 +0100 Date: Wed, 17 Jan 2001 21:36:34 +0100 From: Werner Almesberger To: Andi Kleen Cc: netdev@oss.sgi.com Subject: Re: netlink drops messages. Message-ID: <20010117213634.P18286@almesberger.net> References: <20010116124319.D1299@doit.wisc.edu> <200101161848.VAA31729@ms2.inr.ac.ru> <20010116131456.E1299@doit.wisc.edu> <20010117040542.Y18286@almesberger.net> <20010117121532.B1830@fred.local> <20010117074952.C2459@doit.wisc.edu> <20010117171310.A5589@fred.local> <20010117110035.G2459@doit.wisc.edu> <20010117184852.A7146@fred.local> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20010117184852.A7146@fred.local>; from ak@muc.de on Wed, Jan 17, 2001 at 06:48:52PM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 905 Lines: 21 Andi Kleen wrote: > What you should implement in your application depends on what you need to > handle, I think what James means is that there's nothing between being able to handle a small number of changes efficiently and the possibly very inefficient full dump, even though in some cases (e.g. if you have a large but fairly static collection of items which only change state, but don't get created or destroyed all the time), a selective yet scalable notification mechanism would be possible (as outlined in previous messages in this thread). But I'm not sure whether the scenario in question is sufficiently close to real life to justify any optimization. - Werner -- _________________________________________________________________________ / Werner Almesberger, ICA, EPFL, CH Werner.Almesberger@epfl.ch / /_IN_N_032__Tel_+41_21_693_6621__Fax_+41_21_693_6610_____________________/ From owner-netdev@oss.sgi.com Wed Jan 17 12:59:25 2001 Received: by oss.sgi.com id ; Wed, 17 Jan 2001 12:59:15 -0800 Received: from nero.doit.wisc.edu ([128.104.17.130]:23556 "EHLO nero.doit.wisc.edu") by oss.sgi.com with ESMTP id ; Wed, 17 Jan 2001 12:55:41 -0800 Received: (from jleu@localhost) by nero.doit.wisc.edu (8.8.7/8.8.7) id OAA02976; Wed, 17 Jan 2001 14:55:37 -0600 Date: Wed, 17 Jan 2001 14:55:37 -0600 From: "James R. Leu" To: Werner Almesberger Cc: netdev@oss.sgi.com Subject: Re: netlink drops messages. Message-ID: <20010117145537.K2459@doit.wisc.edu> Reply-To: jleu@mindspring.com References: <200101161848.VAA31729@ms2.inr.ac.ru> <20010116131456.E1299@doit.wisc.edu> <20010117040542.Y18286@almesberger.net> <20010117121532.B1830@fred.local> <20010117074952.C2459@doit.wisc.edu> <20010117171310.A5589@fred.local> <20010117110035.G2459@doit.wisc.edu> <20010117184852.A7146@fred.local> <20010117125405.J2459@doit.wisc.edu> <20010117212824.O18286@almesberger.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0.1i In-Reply-To: <20010117212824.O18286@almesberger.net>; from Werner.Almesberger@epfl.ch on Wed, Jan 17, 2001 at 09:28:24PM +0100 Organization: none Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 918 Lines: 27 On Wed, Jan 17, 2001 at 09:28:24PM +0100, Werner Almesberger wrote: > James R. Leu wrote: > > and relegate Linux to be an embedded trampoline for them to run there own > > IP stack in user land. > > Whee, "KA9Q outperforms Linux kernel TCP/IP". Mindcraft special at nine > on MSNBC ;-)) I guess I should clarify that a bit ;-) Remember that routing/switching vendors have a hardware fast path. The signalling or routing protocol packets are the only traffic that go to the route/switch processor. Protocol developers look for features of the IP stack not neccessarily speed. If the IP stack doesn't provide enough features why use it? Jim > - Werner > > -- > _________________________________________________________________________ > / Werner Almesberger, ICA, EPFL, CH Werner.Almesberger@epfl.ch / > /_IN_N_032__Tel_+41_21_693_6621__Fax_+41_21_693_6610_____________________/ -- James R. Leu From owner-netdev@oss.sgi.com Wed Jan 17 13:42:05 2001 Received: by oss.sgi.com id ; Wed, 17 Jan 2001 13:41:55 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:17836 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Wed, 17 Jan 2001 13:41:32 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id QAA23394; Wed, 17 Jan 2001 16:40:41 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Wed, 17 Jan 2001 16:40:40 -0500 (EST) From: jamal To: "James R. Leu" cc: Werner Almesberger , Subject: Re: netlink drops messages. In-Reply-To: <20010117145537.K2459@doit.wisc.edu> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1986 Lines: 44 On Wed, 17 Jan 2001, James R. Leu wrote: > Remember that routing/switching vendors have a hardware fast path. The > signalling or routing protocol packets are the only traffic that go to the > route/switch processor. Protocol developers look for features of the IP > stack not neccessarily speed. If the IP stack doesn't provide enough features > why use it? > So what kind of features would you be looking for? I take it this is still related to netlink and the unreliability? >If an IP routing stack running on Linux ever hopes to achieve sub 50ms >end-to-end re-routing with 80K routes and 16K interface, netlink will >have to guarantee reliable, incremental and timely updates. If netlink >cannot do it, routing software developers will either look to another >operating system that can provide this, or will completely by-pass the >Linux IP stack and relegate Linux to be an embedded trampoline for them >to run there own IP stack in user land. This sounds a bit strong. Netlink happens to document that messages can be lost. Route sockets in BSD which are used by quiet a few router vendors out there albeit for years now, i suppose do not make this obvious. If you have a fifo socket receive queue which fills up at some point; wouldnt you expect that BSD sockets also would drop those messages? Unless you have infinite memory, how is this different from Linux netlink as far as "unreliability" is concerned? > I still assert that netlink's requirement for a client to re-read the > entire interface or routing table to maintain synchronization is a hack. I think it's a design tradeoff. Under a really heavy load, if you can detect this, say by seeing the first ENOBUFS, polling is a _better_ solution. Latency is a non-issue since you are gonna be loosing a lot of those messages. Under low load, latency becomes important, so async notification makes sense, Your mileage may vary, but if you make your app capable of this, it becomes more robust. cheers, jamal From owner-netdev@oss.sgi.com Wed Jan 17 16:34:46 2001 Received: by oss.sgi.com id ; Wed, 17 Jan 2001 16:34:36 -0800 Received: from SLASH.REM.CMU.EDU ([128.2.87.44]:43270 "EHLO SLASH.REM.CMU.EDU") by oss.sgi.com with ESMTP id ; Wed, 17 Jan 2001 16:34:15 -0800 Received: from localhost (mukesh@localhost) by SLASH.REM.CMU.EDU (8.9.3/8.9.3) with ESMTP id TAA06028 for ; Wed, 17 Jan 2001 19:43:11 -0500 X-Authentication-Warning: SLASH.REM.CMU.EDU: mukesh owned process doing -bs Date: Wed, 17 Jan 2001 19:43:11 -0500 (EST) From: mukesh agrawal X-Sender: mukesh@SLASH.REM.CMU.EDU To: netdev@oss.sgi.com Subject: internal drops with tcp, kernel 2.2.16 Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 799 Lines: 23 >From looking over the code, running some experiments, and from the mailing list archive, it looks like packets simply get dropped inside the kernel when the queues overflow. This seems to interact badly with TCP. If the initial retransmit time out is 3 seconds, and one of the first few packets gets dropped (before the rtt estimate is updated), then the connection is stalled for 3 seconds. (TCP packet gets dropped silently, then kernel waits an RTT for a response before retransmitting.) Two questions: 1. Can this really happen, or have I overlooked something? (Our experiments suggest that it can happen.) 2. If this can happen, is it worth changing? In particular, we might not want to wait the entire RTT before retransmitting the packet if it was dropped inside the kernel. Thanks. From owner-netdev@oss.sgi.com Thu Jan 18 01:20:47 2001 Received: by oss.sgi.com id ; Thu, 18 Jan 2001 01:20:28 -0800 Received: from gleb.nbase.co.il ([194.90.136.56]:14341 "EHLO gleb.nbase.co.il") by oss.sgi.com with ESMTP id ; Thu, 18 Jan 2001 01:20:11 -0800 Received: (from gleb@localhost) by gleb.nbase.co.il (8.11.1/8.11.1/Debian 8.11.0-6) id f0I9IhJ05935; Thu, 18 Jan 2001 11:18:43 +0200 From: Gleb Natapov Date: Thu, 18 Jan 2001 11:18:43 +0200 To: Andi Kleen Cc: kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com Subject: Re: netlink drops messages. Message-ID: <20010118111843.A21503@nbase.co.il> References: <20010116200600.C5122@nbase.co.il> <200101161828.VAA31502@ms2.inr.ac.ru> <20010117101720.F5122@nbase.co.il> <20010117120652.A1830@fred.local> <20010117133932.B16180@nbase.co.il> <20010117141900.A3308@fred.local> <20010117155035.C16180@nbase.co.il> <20010117171438.B5589@fred.local> <20010117191811.E16180@nbase.co.il> <20010117185057.B7146@fred.local> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20010117185057.B7146@fred.local>; from ak@muc.de on Wed, Jan 17, 2001 at 06:50:57PM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1819 Lines: 41 On Wed, Jan 17, 2001 at 06:50:57PM +0100, Andi Kleen wrote: > On Wed, Jan 17, 2001 at 06:22:41PM +0100, Gleb Natapov wrote: > > On Wed, Jan 17, 2001 at 05:14:38PM +0100, Andi Kleen wrote: > > > On Wed, Jan 17, 2001 at 02:51:48PM +0100, Gleb Natapov wrote: > > > > > > > > You are trying to say that if I'll connect to my router via 9600 serial line and run 'ip monitor' > > > > there the routing daemon will not be able to feed routes to the kernel quicker than ip monitor will > > > > be able to read them and send output via slow serial line?! Somehow 9600 serial line become a > > > > bottleneck! Are you sure about that, or I misunderstood you? > > > > > > You misunderstood me. The kernel side doesn't do any flow control, it just drops messages when the > > > buffers fill up. The user side does flow control by blocking or EAGAIN, e.g. you submitting new > > > routes. > > > > > What user side? There are to sides: W - feeds routes to the kernel, R - read netlink updates from the kernel. > > > (W) is flow controlled. > > (R) is in a sense flow controlled too because it can only get information > the kernel queues, but the kernel cheats a bit by dropping stuff when the > buffer fills. Exactly. And currently buffer fills very quickly. Alexey says that there is no difference between 16 and 116 messages but I disagree; if queue will be bigger, R will have a chance to empty it before W will run next time and adds more routes to the kernel. Less resyncs needed. If we can considerably enlarge queue size for free why not to do it? > > > > > > Who does flow control and how? > > > > W knows nothing about R it simple adds routes to the kernel with the speed kernel can process them. > > Which is the flow control -- it gets blocked when the kernel cannot keep up. > > > > -Andi -- Gleb. From owner-netdev@oss.sgi.com Thu Jan 18 04:34:58 2001 Received: by oss.sgi.com id ; Thu, 18 Jan 2001 04:34:38 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:33964 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Thu, 18 Jan 2001 04:34:18 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id HAA24789; Thu, 18 Jan 2001 07:33:31 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Thu, 18 Jan 2001 07:33:31 -0500 (EST) From: jamal To: mukesh agrawal cc: Subject: Re: internal drops with tcp, kernel 2.2.16 In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1369 Lines: 40 On Wed, 17 Jan 2001, mukesh agrawal wrote: > > From looking over the code, running some experiments, and from the mailing > list archive, it looks like packets simply get dropped inside the kernel > when the queues overflow. > > This seems to interact badly with TCP. If the initial retransmit time out > is 3 seconds, and one of the first few packets gets dropped (before the > rtt estimate is updated), then the connection is stalled for 3 seconds. > (TCP packet gets dropped silently, then kernel waits an RTT for a response > before retransmitting.) > > Two questions: > > 1. Can this really happen, or have I overlooked something? (Our > experiments suggest that it can happen.) > In OOM conditions, the drop will happen much earlier in the packet processing probably at the driver level, not the way you described it i.e at the socket queueing time. One thing that will NEVER happen: remote end receives an ACK for a packet that is then dropped in the kernel. If it gets dropped before being processed then it's a c'est la vie. I dont know how any other OS can solve this without (as Alexey nicely put it) infinite memory. TCP is, by design, resilient to this. cheers, jamal > 2. If this can happen, is it worth changing? In particular, we might not > want to wait the entire RTT before retransmitting the packet if it was > dropped inside the kernel. > From owner-netdev@oss.sgi.com Thu Jan 18 04:36:48 2001 Received: by oss.sgi.com id ; Thu, 18 Jan 2001 04:36:38 -0800 Received: from laurin.munich.netsurf.de ([194.64.166.1]:1774 "EHLO laurin.munich.netsurf.de") by oss.sgi.com with ESMTP id ; Thu, 18 Jan 2001 04:36:25 -0800 Received: from fred.muc.de (noidentity@ns1021.munich.netsurf.de [195.180.235.21]) by laurin.munich.netsurf.de (8.9.3/8.9.3) with ESMTP id NAA12118; Thu, 18 Jan 2001 13:36:18 +0100 (MET) Received: by fred.muc.de (Postfix, from userid 500) id 9988DE3BB8; Thu, 18 Jan 2001 12:51:32 +0100 (CET) Date: Thu, 18 Jan 2001 12:51:32 +0100 From: Andi Kleen To: mukesh agrawal Cc: netdev@oss.sgi.com Subject: Re: internal drops with tcp, kernel 2.2.16 Message-ID: <20010118125132.A3272@fred.local> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0.1i In-Reply-To: ; from mukesh@cs.cmu.edu on Thu, Jan 18, 2001 at 01:35:33AM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 918 Lines: 22 On Thu, Jan 18, 2001 at 01:35:33AM +0100, mukesh agrawal wrote: > 2. If this can happen, is it worth changing? In particular, we might not > want to wait the entire RTT before retransmitting the packet if it was > dropped inside the kernel. 2.4 has most of the infrastructure neededfor that: TCP gets an error code now when the device queue overflows. It currently doesn't send more aggressively when this happens, but just doesn't advance snd_nxt, making established behave a bit better. I guess you could experiment with more aggressive reactions, especially for initial SYNs (no handling of this case there currently) The common problem of the dynamic IP changing is already mostly but not completelycovered by the ip_dynaddr socket address rewrite hack. The main problem left in there is in the user space resolver library btw with its too short timeout. -Andi -- This is like TV. I don't like TV. From owner-netdev@oss.sgi.com Thu Jan 18 04:36:58 2001 Received: by oss.sgi.com id ; Thu, 18 Jan 2001 04:36:48 -0800 Received: from laurin.munich.netsurf.de ([194.64.166.1]:41710 "EHLO laurin.munich.netsurf.de") by oss.sgi.com with ESMTP id ; Thu, 18 Jan 2001 04:36:41 -0800 Received: from fred.muc.de (noidentity@ns1021.munich.netsurf.de [195.180.235.21]) by laurin.munich.netsurf.de (8.9.3/8.9.3) with ESMTP id NAA12119; Thu, 18 Jan 2001 13:36:18 +0100 (MET) Received: by fred.muc.de (Postfix, from userid 500) id E1712E3BB9; Thu, 18 Jan 2001 12:59:36 +0100 (CET) Date: Thu, 18 Jan 2001 12:59:36 +0100 From: Andi Kleen To: jamal Cc: Werner Almesberger , "James R. Leu" , kuznet@ms2.inr.ac.ru, gleb@nbase.co.il, netdev@oss.sgi.com Subject: Re: netlink drops messages. Message-ID: <20010118125936.B3272@fred.local> References: <20010117122413.B18286@almesberger.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0.1i In-Reply-To: ; from hadi@cyberus.ca on Wed, Jan 17, 2001 at 02:33:24PM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 2153 Lines: 47 On Wed, Jan 17, 2001 at 02:33:24PM +0100, jamal wrote: > > The question here is of course if the entire communication model > > is appropriate. An alternative approach could be to just record > > that a notification is needed, wake the reader(s), and generate > > the actual message in response to the recvmsg call. This way, > > one could still pretend to user space that we have a simple > > message-based notification system, while underneath, we happily > > compress notifications. > > > > Note that this wouldn't work well in all cases. E.g. if you have > > lots of routes coming and going (and not coming back), you can't > > compress notifications. Of course, netlink may be the least of > > your worries in such a situation ... > > > > This is the classical problem Alexey was pointing to. Unsolvable, unless > you have infinite memory in place. Or you hack the kernel to directly update a shared memory with the needed information with appropiate locking. Then there is no buffer to overflow, at worst you could get some problems with the locks in overload cases. I guess the router vendors who want to use Linux as a base could certainly do that, the changes are not very complicated. For a general linux interface it would be serious overkill though. > The idea of compressed notifications (update "queues") in the kernel is > beggining to look sane to me. But even for that you need to draw a line > somewhere. And it is only useful if you are really interested in the > sequence of states some entity has gone through. I suspect mostly you > will be interested in knowing the state _current_ as opposed to "what > happened? how did you get here? " kind of thing. I suspect it first needs someone to show that it can get a real problem in a real world case and cannot get solved there by reserving a few MB for the queue. ("Premature optimization is the root of all evil...") I suspect if someone is really seriously expecting to handle hundreds of interface up/down per seconds they would opt for shared memory. For routes you do not even need kernel support, because you can do that privately with the routing daemon. -Andi From owner-netdev@oss.sgi.com Thu Jan 18 04:39:18 2001 Received: by oss.sgi.com id ; Thu, 18 Jan 2001 04:39:08 -0800 Received: from a203-167-249-89.reverse.clear.net.nz ([203.167.249.89]:64267 "HELO metastasis.f00f.org") by oss.sgi.com with SMTP id ; Thu, 18 Jan 2001 04:39:07 -0800 Received: by metastasis.f00f.org (Postfix, from userid 1000) id BE0C3A435; Fri, 19 Jan 2001 01:39:04 +1300 (NZDT) Date: Fri, 19 Jan 2001 01:39:04 +1300 From: Chris Wedgwood To: Gleb Natapov Cc: Andi Kleen , kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com Subject: Re: netlink drops messages. Message-ID: <20010119013904.B29530@metastasis.f00f.org> References: <200101161828.VAA31502@ms2.inr.ac.ru> <20010117101720.F5122@nbase.co.il> <20010117120652.A1830@fred.local> <20010117133932.B16180@nbase.co.il> <20010117141900.A3308@fred.local> <20010117155035.C16180@nbase.co.il> <20010117171438.B5589@fred.local> <20010117191811.E16180@nbase.co.il> <20010117185057.B7146@fred.local> <20010118111843.A21503@nbase.co.il> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <20010118111843.A21503@nbase.co.il>; from gleb@nbase.co.il on Thu, Jan 18, 2001 at 11:18:43AM +0200 X-No-Archive: Yes Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 896 Lines: 22 On Thu, Jan 18, 2001 at 11:18:43AM +0200, Gleb Natapov wrote: Exactly. And currently buffer fills very quickly. Alexey says that there is no difference between 16 and 116 messages but I disagree; if queue will be bigger, R will have a chance to empty it before W will run next time and adds more routes to the kernel. Less resyncs needed. If we can considerably enlarge queue size for free why not to do it? What about something like the mmap'd AF_PACKET code, basically each application case register a user-land buffer for these sockets and also potentially a signal for overflow, the messages get written to this buffer and in the case of overflow a signal is sent and writing stops; the application can then manually resync and start reading again... Routing daemons can register larger buffers to prevent or reduce the number of times it might overflow. --cw From owner-netdev@oss.sgi.com Thu Jan 18 04:42:08 2001 Received: by oss.sgi.com id ; Thu, 18 Jan 2001 04:41:58 -0800 Received: from a203-167-249-89.reverse.clear.net.nz ([203.167.249.89]:268 "HELO metastasis.f00f.org") by oss.sgi.com with SMTP id ; Thu, 18 Jan 2001 04:41:50 -0800 Received: by metastasis.f00f.org (Postfix, from userid 1000) id BCF09A435; Fri, 19 Jan 2001 01:41:48 +1300 (NZDT) Date: Fri, 19 Jan 2001 01:41:48 +1300 From: Chris Wedgwood To: Andi Kleen Cc: jamal , Werner Almesberger , "James R. Leu" , kuznet@ms2.inr.ac.ru, gleb@nbase.co.il, netdev@oss.sgi.com Subject: Re: netlink drops messages. Message-ID: <20010119014148.C29530@metastasis.f00f.org> References: <20010117122413.B18286@almesberger.net> <20010118125936.B3272@fred.local> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <20010118125936.B3272@fred.local>; from ak@muc.de on Thu, Jan 18, 2001 at 12:59:36PM +0100 X-No-Archive: Yes Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 497 Lines: 13 On Thu, Jan 18, 2001 at 12:59:36PM +0100, Andi Kleen wrote: Or you hack the kernel to directly update a shared memory with the needed information with appropiate locking. Then there is no buffer to overflow, at worst you could get some problems with the locks in overload cases. freaky timing -- I just suggested something along those lines myself; I'm not sure how we would implement locking and overflow though, I suggested signals but that is perhaps a little gross. --cw From owner-netdev@oss.sgi.com Thu Jan 18 04:44:08 2001 Received: by oss.sgi.com id ; Thu, 18 Jan 2001 04:43:58 -0800 Received: from a203-167-249-89.reverse.clear.net.nz ([203.167.249.89]:1804 "HELO metastasis.f00f.org") by oss.sgi.com with SMTP id ; Thu, 18 Jan 2001 04:43:49 -0800 Received: by metastasis.f00f.org (Postfix, from userid 1000) id A81A0A435; Fri, 19 Jan 2001 01:43:47 +1300 (NZDT) Date: Fri, 19 Jan 2001 01:43:47 +1300 From: Chris Wedgwood To: Andi Kleen Cc: mukesh agrawal , netdev@oss.sgi.com Subject: Re: internal drops with tcp, kernel 2.2.16 Message-ID: <20010119014347.D29530@metastasis.f00f.org> References: <20010118125132.A3272@fred.local> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <20010118125132.A3272@fred.local>; from ak@muc.de on Thu, Jan 18, 2001 at 12:51:32PM +0100 X-No-Archive: Yes Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 497 Lines: 14 On Thu, Jan 18, 2001 at 12:51:32PM +0100, Andi Kleen wrote: The common problem of the dynamic IP changing is already mostly but not completelycovered by the ip_dynaddr socket address rewrite hack. The main problem left in there is in the user space resolver library btw with its too short timeout. is there any reason why the ip_dynaddr stuff can't be compiled out for those like myself who find it an esthetically unpleasant kernel hack to solve a user-land problem? --cw From owner-netdev@oss.sgi.com Thu Jan 18 04:47:38 2001 Received: by oss.sgi.com id ; Thu, 18 Jan 2001 04:47:28 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:35500 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Thu, 18 Jan 2001 04:47:18 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id HAA24805; Thu, 18 Jan 2001 07:46:14 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Thu, 18 Jan 2001 07:46:14 -0500 (EST) From: jamal To: Andi Kleen cc: Werner Almesberger , "James R. Leu" , , , Subject: Re: netlink drops messages. In-Reply-To: <20010118125936.B3272@fred.local> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 606 Lines: 18 On Thu, 18 Jan 2001, Andi Kleen wrote: > I suspect if someone is really seriously expecting to handle hundreds of > interface up/down per seconds they would opt for shared memory. For routes > you do not even need kernel support, because you can do that privately with > the routing daemon. Sounds nice. I think hundreds of interface up/down per seconds is extreme end unless you have "dynamic" type of devices like L2TP which come and go (lets not go into the interface discussion again ;-<). Having said that, the router has to be robust to hundreds of interface up/down per seconds. cheers, jamal From owner-netdev@oss.sgi.com Thu Jan 18 05:19:28 2001 Received: by oss.sgi.com id ; Thu, 18 Jan 2001 05:19:18 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:37548 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Thu, 18 Jan 2001 05:18:56 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id IAA24911; Thu, 18 Jan 2001 08:18:10 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Thu, 18 Jan 2001 08:18:10 -0500 (EST) From: jamal To: Andi Kleen cc: mukesh agrawal , Subject: Re: internal drops with tcp, kernel 2.2.16 In-Reply-To: <20010118125132.A3272@fred.local> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1334 Lines: 35 On Thu, 18 Jan 2001, Andi Kleen wrote: > On Thu, Jan 18, 2001 at 01:35:33AM +0100, mukesh agrawal wrote: > > 2. If this can happen, is it worth changing? In particular, we might not > > want to wait the entire RTT before retransmitting the packet if it was > > dropped inside the kernel. > > 2.4 has most of the infrastructure neededfor that: TCP gets an error code now > when the device queue overflows. It currently doesn't send more aggressively > when this happens, but just doesn't advance snd_nxt, making established behave > a bit better. > > I guess you could experiment with more aggressive reactions, especially > for initial SYNs (no handling of this case there currently) > sorry, I totaly took his question out of context in my response by talking about the receive path (it's that netlink thing!);-< Your suggestion sounds good and the code change is trivial, (tcp_transmit_skb() already returns NET_XMIT_CN in these situations). I'd be interested in seeing the backoff algorithm though. The packets might be getting dropped because of a lot of factors: some other socket hogging resources, interupt livelock, slow devices, bandwidth control etc. I wonder if you have to adapt different things in a slightly different manner. I also wonder whether not advancing snd_nxt is a good enough solution. cheers, jamal From owner-netdev@oss.sgi.com Thu Jan 18 10:44:19 2001 Received: by oss.sgi.com id ; Thu, 18 Jan 2001 10:44:00 -0800 Received: from laurin.munich.netsurf.de ([194.64.166.1]:38052 "EHLO laurin.munich.netsurf.de") by oss.sgi.com with ESMTP id ; Thu, 18 Jan 2001 10:43:50 -0800 Received: from fred.muc.de (noidentity@ns1120.munich.netsurf.de [195.180.235.120]) by laurin.munich.netsurf.de (8.9.3/8.9.3) with ESMTP id TAA04202; Thu, 18 Jan 2001 19:43:21 +0100 (MET) Received: by fred.muc.de (Postfix, from userid 500) id 651EDE3913; Thu, 18 Jan 2001 15:24:52 +0100 (CET) Date: Thu, 18 Jan 2001 15:24:52 +0100 From: Andi Kleen To: jamal Cc: Andi Kleen , Werner Almesberger , "James R. Leu" , kuznet@ms2.inr.ac.ru, gleb@nbase.co.il, netdev@oss.sgi.com Subject: Re: netlink drops messages. Message-ID: <20010118152452.A5289@fred.local> References: <20010118125936.B3272@fred.local> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0.1i In-Reply-To: ; from hadi@cyberus.ca on Thu, Jan 18, 2001 at 01:47:04PM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 790 Lines: 22 On Thu, Jan 18, 2001 at 01:47:04PM +0100, jamal wrote: > > > On Thu, 18 Jan 2001, Andi Kleen wrote: > > > I suspect if someone is really seriously expecting to handle hundreds of > > interface up/down per seconds they would opt for shared memory. For routes > > you do not even need kernel support, because you can do that privately with > > the routing daemon. > > Sounds nice. I think hundreds of interface up/down per seconds is extreme > end unless you have "dynamic" type of devices like L2TP which come > and go (lets not go into the interface discussion again ;-<). > Having said that, the router has to be robust to hundreds of interface > up/down per seconds. It is when you go to "dump netlink state every 60s" mode on overload. -Andi -- This is like TV. I don't like TV. From owner-netdev@oss.sgi.com Thu Jan 18 11:23:19 2001 Received: by oss.sgi.com id ; Thu, 18 Jan 2001 11:23:10 -0800 Received: from minus.inr.ac.ru ([193.233.7.97]:37906 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Thu, 18 Jan 2001 11:22:56 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id WAA08028; Thu, 18 Jan 2001 22:22:38 +0300 From: kuznet@ms2.inr.ac.ru Message-Id: <200101181922.WAA08028@ms2.inr.ac.ru> Subject: Re: internal drops with tcp, kernel 2.2.16 To: ak@muc.DE (Andi Kleen) Date: Thu, 18 Jan 2001 22:22:38 +0300 (MSK) Cc: netdev@oss.sgi.com In-Reply-To: <20010118125132.A3272@fred.local> from "Andi Kleen" at Jan 18, 1 03:45:01 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 689 Lines: 19 Hello! > On Thu, Jan 18, 2001 at 01:35:33AM +0100, mukesh agrawal wrote: > > 2. If this can happen, is it worth changing? In particular, we might not > > want to wait the entire RTT before retransmitting the packet if it was > > dropped inside the kernel. > > 2.4 has most of the infrastructure neededfor that: TCP gets an error code now > when the device queue overflows. It currently doesn't send more aggressively Really it does. rto backoff is not applied in this case and tcp enters congestion avoidance rather than to slow start. In fact local drop is equivalent to ECN, with all its advantages. What's about SYNs, it is lacune. It should be repaired and this is easy. Alexey From owner-netdev@oss.sgi.com Thu Jan 18 12:55:39 2001 Received: by oss.sgi.com id ; Thu, 18 Jan 2001 12:55:30 -0800 Received: from laurin.munich.netsurf.de ([194.64.166.1]:2764 "EHLO laurin.munich.netsurf.de") by oss.sgi.com with ESMTP id ; Thu, 18 Jan 2001 12:55:21 -0800 Received: from fred.muc.de (noidentity@ns1201.munich.netsurf.de [195.180.235.201]) by laurin.munich.netsurf.de (8.9.3/8.9.3) with ESMTP id VAA21509; Thu, 18 Jan 2001 21:55:04 +0100 (MET) Received: by fred.muc.de (Postfix, from userid 500) id 62CF2E3913; Thu, 18 Jan 2001 21:59:39 +0100 (CET) Date: Thu, 18 Jan 2001 21:59:39 +0100 From: Andi Kleen To: kuznet@ms2.inr.ac.ru Cc: Andi Kleen , netdev@oss.sgi.com Subject: Re: internal drops with tcp, kernel 2.2.16 Message-ID: <20010118215939.A9723@fred.local> References: <20010118125132.A3272@fred.local> <200101181922.WAA08028@ms2.inr.ac.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0.1i In-Reply-To: <200101181922.WAA08028@ms2.inr.ac.ru>; from kuznet@ms2.inr.ac.ru on Thu, Jan 18, 2001 at 08:24:04PM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 986 Lines: 24 On Thu, Jan 18, 2001 at 08:24:04PM +0100, A.N.Kuznetsov wrote: > > 2.4 has most of the infrastructure neededfor that: TCP gets an error code now > > when the device queue overflows. It currently doesn't send more aggressively > > Really it does. rto backoff is not applied in this case and tcp enters > congestion avoidance rather than to slow start. > > In fact local drop is equivalent to ECN, with all its advantages. Ok for retransmits, but as far as I can see not for first transmits. In fact the code is racy, packets_out is not incremented and the retransmit handler bails out when it is zero. > > > What's about SYNs, it is lacune. It should be repaired and this is easy. It will usually not help because it doesn't change the resolver in the glibc (most programs resolve names first and depend on these packets). What I usually do is to use a local dns proxy with more aggressive dns retransmit. Currently it cannot detect queue drops from user space though. -Andi From owner-netdev@oss.sgi.com Thu Jan 18 13:35:01 2001 Received: by oss.sgi.com id ; Thu, 18 Jan 2001 13:34:41 -0800 Received: from kanga.kvack.org ([216.129.200.3]:59919 "EHLO kanga.kvack.org") by oss.sgi.com with ESMTP id ; Thu, 18 Jan 2001 13:34:37 -0800 Received: (from localhost user: 'blah', uid#63042) by kanga.kvack.org with SMTP id ; Thu, 18 Jan 2001 16:31:41 -0500 Date: Thu, 18 Jan 2001 16:31:41 -0500 (EST) From: "Benjamin C.R. LaHaise" To: Andi Kleen cc: jamal , Werner Almesberger , "James R. Leu" , kuznet@ms2.inr.ac.ru, gleb@nbase.co.il, netdev@oss.sgi.com Subject: Re: netlink drops messages. In-Reply-To: <20010118152452.A5289@fred.local> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 505 Lines: 14 On Thu, 18 Jan 2001, Andi Kleen wrote: > On Thu, Jan 18, 2001 at 01:47:04PM +0100, jamal wrote: > > Sounds nice. I think hundreds of interface up/down per seconds is extreme > > end unless you have "dynamic" type of devices like L2TP which come > > and go (lets not go into the interface discussion again ;-<). > > Having said that, the router has to be robust to hundreds of interface > > up/down per seconds. > > It is when you go to "dump netlink state every 60s" mode on overload. Arguing about From owner-netdev@oss.sgi.com Thu Jan 18 13:48:41 2001 Received: by oss.sgi.com id ; Thu, 18 Jan 2001 13:48:31 -0800 Received: from kanga.kvack.org ([216.129.200.3]:62479 "EHLO kanga.kvack.org") by oss.sgi.com with ESMTP id ; Thu, 18 Jan 2001 13:48:17 -0800 Received: (from localhost user: 'blah', uid#63042) by kanga.kvack.org with SMTP id ; Thu, 18 Jan 2001 16:45:32 -0500 Date: Thu, 18 Jan 2001 16:45:32 -0500 (EST) From: "Benjamin C.R. LaHaise" To: Andi Kleen cc: jamal , Werner Almesberger , "James R. Leu" , kuznet@ms2.inr.ac.ru, gleb@nbase.co.il, netdev@oss.sgi.com Subject: Re: netlink drops messages. In-Reply-To: <20010118152452.A5289@fred.local> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 963 Lines: 21 On Thu, 18 Jan 2001, Andi Kleen wrote: > It is when you go to "dump netlink state every 60s" mode on overload. 60 esconds is way too long, but arguing about the length of polls isn't a useful way to solve the problem. A good example for the characteristics needed comes from the typical ISP setting. If an L2TP server goes down for whatever reason and eventually comes back up, the system ends up with several thousand users, perhaps tens of thousands, all reconnecting within the same 3 second interval. If the connection takes more than a few seconds to come up, then the user will disconnect and attempt to reconnect -- presenting a further overload on the system (this is a real world example a local ISP has encountered). So far the shared memory idea sounds like it holds the most promise. I could see having a series of device ids, states and generation #'s would make the whole thing very efficient for both small and large state changes. -ben From owner-netdev@oss.sgi.com Fri Jan 19 04:37:29 2001 Received: by oss.sgi.com id ; Fri, 19 Jan 2001 04:37:20 -0800 Received: from laurin.munich.netsurf.de ([194.64.166.1]:17588 "EHLO laurin.munich.netsurf.de") by oss.sgi.com with ESMTP id ; Fri, 19 Jan 2001 04:36:53 -0800 Received: from fred.muc.de (noidentity@ns1185.munich.netsurf.de [195.180.235.185]) by laurin.munich.netsurf.de (8.9.3/8.9.3) with ESMTP id NAA15812; Fri, 19 Jan 2001 13:36:24 +0100 (MET) Received: by fred.muc.de (Postfix, from userid 500) id 33F6DE3913; Fri, 19 Jan 2001 00:14:54 +0100 (CET) Date: Fri, 19 Jan 2001 00:14:53 +0100 From: Andi Kleen To: "Benjamin C.R. LaHaise" Cc: Andi Kleen , jamal , Werner Almesberger , "James R. Leu" , kuznet@ms2.inr.ac.ru, gleb@nbase.co.il, netdev@oss.sgi.com Subject: Re: netlink drops messages. Message-ID: <20010119001453.A11693@fred.local> References: <20010118152452.A5289@fred.local> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0.1i In-Reply-To: ; from blah@kvack.org on Thu, Jan 18, 2001 at 10:48:18PM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1464 Lines: 34 On Thu, Jan 18, 2001 at 10:48:18PM +0100, Benjamin C.R. LaHaise wrote: > On Thu, 18 Jan 2001, Andi Kleen wrote: > > > It is when you go to "dump netlink state every 60s" mode on overload. > > 60 esconds is way too long, but arguing about the length of polls isn't a > useful way to solve the problem. A good example for the characteristics > needed comes from the typical ISP setting. If an L2TP server goes down > for whatever reason and eventually comes back up, the system ends up with > several thousand users, perhaps tens of thousands, all reconnecting within > the same 3 second interval. If the connection takes more than a few > seconds to come up, then the user will disconnect and attempt to > reconnect -- presenting a further overload on the system (this is a real > world example a local ISP has encountered). > > So far the shared memory idea sounds like it holds the most promise. I > could see having a series of device ids, states and generation #'s would > make the whole thing very efficient for both small and large state > changes. Somehow this thread looks like Don Quichote chasing wind mills for me @) Has there been any real world problem shown that cannot be solved by increasing the netlink queue to a few MB? [the default of 64K arguably is a bit on the low side] Shared memory IMHO is an option for a local hack for some routing appliance, it shouldn't be in the main kernel. -Andi -- This is like TV. I don't like TV. From owner-netdev@oss.sgi.com Fri Jan 19 09:39:01 2001 Received: by oss.sgi.com id ; Fri, 19 Jan 2001 09:38:42 -0800 Received: from minus.inr.ac.ru ([193.233.7.97]:44302 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Fri, 19 Jan 2001 09:38:32 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id UAA24004; Fri, 19 Jan 2001 20:38:20 +0300 From: kuznet@ms2.inr.ac.ru Message-Id: <200101191738.UAA24004@ms2.inr.ac.ru> Subject: Re: internal drops with tcp, kernel 2.2.16 To: ak@muc.DE (Andi Kleen) Date: Fri, 19 Jan 2001 20:38:20 +0300 (MSK) Cc: ak@muc.DE, netdev@oss.sgi.com In-Reply-To: <20010118215939.A9723@fred.local> from "Andi Kleen" at Jan 18, 1 09:59:39 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 402 Lines: 14 Hello! > Ok for retransmits, but as far as I can see not for first transmits. > > In fact the code is racy, packets_out is not incremented and the retransmit > handler bails out when it is zero. Sorry, seems you misunderstood something capitally. At least, I cannot find an interpretation to these statements. Please, explain this in more details. Probably you see some bug, which I do not. Alexey From owner-netdev@oss.sgi.com Fri Jan 19 19:28:34 2001 Received: by oss.sgi.com id ; Fri, 19 Jan 2001 19:28:14 -0800 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:51460 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Fri, 19 Jan 2001 19:27:43 -0800 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.8.7) with ESMTP id VAA02593; Fri, 19 Jan 2001 21:33:40 -0700 Message-ID: <3A691524.C534BEDE@candelatech.com> Date: Fri, 19 Jan 2001 21:33:40 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.16 i586) X-Accept-Language: en MIME-Version: 1.0 To: linux-kernel , "netdev@oss.sgi.com" , VLAN Mailing List , "gleb@tochna.technion.ac.il" Subject: [PATCH] 802.1Q VLAN for Linux Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 2498 Lines: 57 Announcing 802.1Q VLAN version 1.0.0 for Linux. 802.1Q VLAN is an industry standard that allows you to run multiple Virtual LANs over a single ethernet wire/interface. It also supports priority settings. To user-space code, this VLAN patch makes VLAN devices look like Ethernet devices. The patch is large, so go here to find it: (get version 1.0.0) http://scry.wanfear.com/~greear/vlan.html Or, if you just want the raw patch & user tools with no explanation: http://scry.wanfear.com/~greear/vlan/vlan.1.0.0.tar.gz If you really want the whole thing as a diff emailed to you, let me know! This code has been in development for about 2 years now, and I feel it is stable enough to warrant inclusion into the kernel as an EXPERIMENTAL feature. I understand Linus and others may not wish to add it to 2.4.0 proper, but even getting in something like the -ac series, or any broader audience, should help wring the final bugs out and ensure it will be stable. Your comments and testing will help make the code stronger! My 802.1Q VLAN patch features: * Implements 802.1Q VLAN spec. * Can support up to 4094 VLANs per ethernet interface. * Scales well in critical paths: O(1), as far as I can tell. * Optional hashed device lookup tables in the kernel, for better scalability in non-critical paths. * Supports MULTICAST * Can change MAC address of VLAN. * Multiple naming conventions supported, and adjustable at runtime. * Optional header-reordering, to make the VLAN interface look JUST LIKE an Ethernet interface. This fixes some problems with DHCPd and anything else that uses a SOCK_PACKET socket. Default setting is off, which works for every other protocol I know about, and is slightly faster. * Patches for both the 2.2.X series and the 2.4.0 series. * /proc file system support (/proc/net/vlan) * 802.1Q Priority mapping and support. I have not tested compiling it in as a module, though preliminary module support has been added to the 2.4.0 patch. It should work just fine when compiled in (not a module.) For comparison, there is also another VLAN project at http://vlan.sourceforge.net, but I think mine is better, or at least has a more colorful web-page! :) Comments are welcome. Thanks, Ben -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Sun Jan 21 00:17:34 2001 Received: by oss.sgi.com id ; Sun, 21 Jan 2001 00:17:24 -0800 Received: from gleb.nbase.co.il ([194.90.136.56]:42258 "EHLO gleb.nbase.co.il") by oss.sgi.com with ESMTP id ; Sun, 21 Jan 2001 00:17:06 -0800 Received: (from gleb@localhost) by gleb.nbase.co.il (8.11.1/8.11.1/Debian 8.11.0-6) id f0L8FV412674; Sun, 21 Jan 2001 10:15:31 +0200 From: Gleb Natapov Date: Sun, 21 Jan 2001 10:15:31 +0200 To: Andi Kleen Cc: "Benjamin C.R. LaHaise" , jamal , Werner Almesberger , "James R. Leu" , kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com Subject: Re: netlink drops messages. Message-ID: <20010121101530.B19680@nbase.co.il> References: <20010118152452.A5289@fred.local> <20010119001453.A11693@fred.local> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20010119001453.A11693@fred.local>; from ak@muc.de on Fri, Jan 19, 2001 at 12:14:53AM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1909 Lines: 49 On Fri, Jan 19, 2001 at 12:14:53AM +0100, Andi Kleen wrote: > On Thu, Jan 18, 2001 at 10:48:18PM +0100, Benjamin C.R. LaHaise wrote: > > On Thu, 18 Jan 2001, Andi Kleen wrote: > > > > > It is when you go to "dump netlink state every 60s" mode on overload. > > > > 60 esconds is way too long, but arguing about the length of polls isn't a > > useful way to solve the problem. A good example for the characteristics > > needed comes from the typical ISP setting. If an L2TP server goes down > > for whatever reason and eventually comes back up, the system ends up with > > several thousand users, perhaps tens of thousands, all reconnecting within > > the same 3 second interval. If the connection takes more than a few > > seconds to come up, then the user will disconnect and attempt to > > reconnect -- presenting a further overload on the system (this is a real > > world example a local ISP has encountered). > > > > So far the shared memory idea sounds like it holds the most promise. I > > could see having a series of device ids, states and generation #'s would > > make the whole thing very efficient for both small and large state > > changes. > > Somehow this thread looks like Don Quichote chasing wind mills for me @) > > Has there been any real world problem shown that cannot be solved by > increasing the netlink queue to a few MB? [the default of 64K arguably > is a bit on the low side] I really like this argument. Lets use it this way: Why we need swaping in the kernel? Has there been any real world problem shown that cannot be solved by adding few more MB of memory into the box? :) The point is we can enlarge netlink queue by factor of 20 without increasing buffer size. > > Shared memory IMHO is an option for a local hack for some routing appliance, > it shouldn't be in the main kernel. > Agreed. > > -Andi > > -- > This is like TV. I don't like TV. -- Gleb. From owner-netdev@oss.sgi.com Sun Jan 21 03:58:15 2001 Received: by oss.sgi.com id ; Sun, 21 Jan 2001 03:57:56 -0800 Received: from laurin.munich.netsurf.de ([194.64.166.1]:14263 "EHLO laurin.munich.netsurf.de") by oss.sgi.com with ESMTP id ; Sun, 21 Jan 2001 03:57:28 -0800 Received: from fred.muc.de (noidentity@ns1190.munich.netsurf.de [195.180.235.190]) by laurin.munich.netsurf.de (8.9.3/8.9.3) with ESMTP id MAA19463; Sun, 21 Jan 2001 12:56:58 +0100 (MET) Received: by fred.muc.de (Postfix, from userid 500) id 74484E3BB7; Sun, 21 Jan 2001 11:28:07 +0100 (CET) Date: Sun, 21 Jan 2001 11:28:07 +0100 From: Andi Kleen To: Gleb Natapov Cc: Andi Kleen , "Benjamin C.R. LaHaise" , jamal , Werner Almesberger , "James R. Leu" , kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com Subject: Re: netlink drops messages. Message-ID: <20010121112807.A1082@fred.local> References: <20010118152452.A5289@fred.local> <20010119001453.A11693@fred.local> <20010121101530.B19680@nbase.co.il> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0.1i In-Reply-To: <20010121101530.B19680@nbase.co.il>; from gleb@nbase.co.il on Sun, Jan 21, 2001 at 09:18:25AM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 413 Lines: 14 On Sun, Jan 21, 2001 at 09:18:25AM +0100, Gleb Natapov wrote: > Why we need swaping in the kernel? Has there been any real world problem > shown that cannot be solved by adding few more MB of memory into the box? > :) Times have changed. These days Linux uses without much consideration several MB of your main memory for hash tables because it's easier to enlarge them than to tune the functions. -Andi From owner-netdev@oss.sgi.com Tue Jan 23 05:54:00 2001 Received: by oss.sgi.com id ; Tue, 23 Jan 2001 05:51:01 -0800 Received: from mout03.kundenserver.de ([195.20.224.218]:7973 "EHLO mout03.kundenserver.de") by oss.sgi.com with ESMTP id ; Tue, 23 Jan 2001 05:50:40 -0800 Received: from [195.20.224.148] (helo=mxintern.kundenserver.de) by mout03.kundenserver.de with esmtp (Exim 2.12 #2) id 14L3q4-0005wZ-00; Tue, 23 Jan 2001 14:50:24 +0100 Received: from [172.17.0.91] (helo=bluesky.schlund.de) by mxintern.kundenserver.de with esmtp (Exim 2.12 #3) id 14L3q5-0002yj-00; Tue, 23 Jan 2001 14:50:26 +0100 Received: from toke by bluesky.schlund.de with local (Exim 3.12 #1 (Debian)) id 14L3pg-00009F-00; Tue, 23 Jan 2001 14:50:00 +0100 X-Sieve: cmu-sieve 1.3 Received: from mx00.kundenserver.de ([195.20.224.130] helo=mx06) by imap.kundenserver.de with esmtp (Exim 3.14 #4) id 14L3gK-0008VX-00 for toke@imap.schlund.de; Tue, 23 Jan 2001 14:40:20 +0100 Received: from [167.216.222.41] (helo=lists.samba.org) by mx06 with esmtp (Exim 2.12 #3) id 14L3cK-0002lp-00 for netfilter@fnordic.de; Tue, 23 Jan 2001 14:36:12 +0100 Received: from lists.samba.org (localhost [127.0.0.1]) by lists.samba.org (Postfix) with ESMTP id DBE8080B1; Tue, 23 Jan 2001 05:36:06 -0800 (PST) Received: from mout02.kundenserver.de (mout02.kundenserver.de [195.20.224.133]) by lists.samba.org (Postfix) with ESMTP id 09D237B6C for ; Tue, 23 Jan 2001 05:35:37 -0800 (PST) Received: from [195.20.224.148] (helo=mxintern.kundenserver.de) by mout02.kundenserver.de with esmtp (Exim 2.12 #2) id 14L3bU-0007VH-00 for netfilter@lists.samba.org; Tue, 23 Jan 2001 14:35:20 +0100 Received: from [172.17.0.91] (helo=bluesky.schlund.de) by mxintern.kundenserver.de with esmtp (Exim 2.12 #3) id 14L3bV-0002OA-00 for netfilter@lists.samba.org; Tue, 23 Jan 2001 14:35:21 +0100 Received: from toke by bluesky.schlund.de with local (Exim 3.12 #1 (Debian)) id 14L3aq-00008s-00 for ; Tue, 23 Jan 2001 14:34:40 +0100 From: Thomas Kerpe To: netfilter@lists.samba.org Subject: Masquerading Portrange in 2.4.0 Message-ID: <20010123143440.A535@schlund.de> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="SLDf9lqlvOQaIe6s" Content-Disposition: inline User-Agent: Mutt/1.2.5i X-BeenThere: netfilter@lists.samba.org X-Mailman-Version: 2.0beta6 List-Help: List-Post: List-Subscribe: , List-Id: Netfilter discussions List-Unsubscribe: , List-Archive: http://lists.samba.org/pipermail/netfilter/ Date: Tue, 23 Jan 2001 14:34:40 +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1026 Lines: 39 --SLDf9lqlvOQaIe6s Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Hello! I am wondering about the hard-coded Masquerading Port-Range in net/ipv4/netfilter/ip_fw_compat_masq.c (kernel 2.4.0)=20 In the 2.2.x Kernel hirarchy the Masquerading Ports could be=20 changed in an include file. It doesnt look so pretty... range =3D ((struct ip_nat_multi_range) { 1, {{IP_NAT_RANGE_MAP_IPS|IP_NAT_RANGE_PROTO_SPECIF= IED, newsrc, newsrc, { htons(61000) }, { htons(65095) } } } }); Greets...=20 Thomas Kerpe --SLDf9lqlvOQaIe6s Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.4 (GNU/Linux) Comment: For info see http://www.gnupg.org iD8DBQE6bYhw9jrn3C9VwiMRAjCsAJ0TLIn1WxiEzRhfcGK6Q4Edh6MefwCfUvAz dcLmwU4XzdVZDS5JVPvkuoc= =CdOp -----END PGP SIGNATURE----- --SLDf9lqlvOQaIe6s-- From owner-netdev@oss.sgi.com Tue Jan 23 17:51:22 2001 Received: by oss.sgi.com id ; Tue, 23 Jan 2001 17:51:03 -0800 Received: from foobar.napster.com ([64.124.41.10]:24079 "EHLO foobar.napster.com") by oss.sgi.com with ESMTP id ; Tue, 23 Jan 2001 17:50:40 -0800 Received: from wagner.napster.com (mail.napster.com [63.108.185.112]) by foobar.napster.com (8.9.3/8.9.3) with ESMTP id RAA08211 for ; Tue, 23 Jan 2001 17:50:34 -0800 Received: from napster.com (gw.napster.com [63.108.185.120]) by wagner.napster.com (8.9.3/8.9.3) with SMTP id RAA02171 for ; Tue, 23 Jan 2001 17:50:34 -0800 Message-ID: <3A6E34EA.4AF27AA8@napster.com> Date: Tue, 23 Jan 2001 17:50:34 -0800 From: Jordan Mendelson Organization: Napster, Inc. X-Mailer: Mozilla 4.76 [en] (X11; U; Linux 2.4.0 i686) X-Accept-Language: en MIME-Version: 1.0 To: netdev@oss.sgi.com Subject: TCP Performance 2.4.0 <-> Win98 Followup Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 674 Lines: 20 Just a followup on the poor TCP performance we were seeing with 2.4.0 compared to 2.2.x. Apparently, the performance problem with Win98 <-> 2.4.0 only happens with specific ISPs. The MTU and SACK settings didn't appear to have any effect. Right now the only confirmed ISP we've been able to reproduce it with is Earthlink's dialup in the bay area. Hopefully we'll get some more data on the equipment used upstream to see if terminal server, router or firewall issue. If anyone has any suggestions on how to track down this problem, I'd appreciate it. Right now the TCP performance problem is the only thing keeping us from switching to 2.4.x across the board. Jordan From owner-netdev@oss.sgi.com Thu Jan 25 03:35:43 2001 Received: by oss.sgi.com id ; Thu, 25 Jan 2001 03:35:33 -0800 Received: from pop3.galileo.co.il ([199.203.130.130]:21924 "EHLO galileo5.galileo.co.il") by oss.sgi.com with ESMTP id ; Thu, 25 Jan 2001 03:35:11 -0800 Received: from galileo.co.il (rabeeh@linux2.galileo.co.il [10.2.40.2]) by galileo.co.il (8.8.5/8.8.5) with ESMTP id NAA24549 for ; Thu, 25 Jan 2001 13:35:27 +0200 (GMT-2) Message-ID: <3A700F0C.7050004@galileo.co.il> Date: Thu, 25 Jan 2001 13:33:32 +0200 From: Rabeeh Khoury Organization: Galileo Technology User-Agent: Mozilla/5.0 (X11; U; Linux 2.2.14-5.0 i686; en-US; m18) Gecko/20001107 Netscape6/6.0 X-Accept-Language: en MIME-Version: 1.0 To: netdev Subject: TCP Syn Ack function Content-Type: text/html; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 664 Lines: 17 Hi All,

I am issueing a telnet command in my embedded system to another machine, the kernel sends SYN command, then receives a SYN Ack then it does nothing.
The telnet, after a while, prints that it is unable to connect to the server.

Which function in the kernel source tree is called, when a SYN Ack packet is received ? (in the TCP stack)
and which function calls it ?

p.s. I can see from tcpdump that the packet really received to the kernel, which means that the device driver is working fine !

I'm using kernel 2.4-test11 (PowerPC version)

Regards,
Rabeeh
From owner-netdev@oss.sgi.com Thu Jan 25 03:58:03 2001 Received: by oss.sgi.com id ; Thu, 25 Jan 2001 03:57:54 -0800 Received: from mail.zmailer.org ([194.252.70.162]:22025 "EHLO zmailer.org") by oss.sgi.com with ESMTP id ; Thu, 25 Jan 2001 03:57:40 -0800 Received: (from localhost user: 'mea' uid#500 fake: STDIN (mea@zmailer.org)) by mail.zmailer.org id ; Thu, 25 Jan 2001 13:57:32 +0200 Date: Thu, 25 Jan 2001 13:57:32 +0200 From: Matti Aarnio To: Rabeeh Khoury Cc: netdev Subject: Re: TCP Syn Ack function Message-ID: <20010125135732.A25659@mea-ext.zmailer.org> References: <3A700F0C.7050004@galileo.co.il> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3A700F0C.7050004@galileo.co.il>; from rabeeh@galileo.co.il on Thu, Jan 25, 2001 at 01:33:32PM +0200 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 302 Lines: 9 On Thu, Jan 25, 2001 at 01:33:32PM +0200, Rabeeh Khoury wrote: > > [-- text/html is unsupported (use 'v' to view this part) --] > Do repeat your question, but at FIRST turn off the HTML sending in email! Even if you use web-browser for email, it does not give you license to expect others to do so. From owner-netdev@oss.sgi.com Thu Jan 25 04:18:03 2001 Received: by oss.sgi.com id ; Thu, 25 Jan 2001 04:17:44 -0800 Received: from f00f.stub.clear.net.nz ([203.167.224.51]:31493 "HELO metastasis.f00f.org") by oss.sgi.com with SMTP id ; Thu, 25 Jan 2001 04:17:32 -0800 Received: by metastasis.f00f.org (Postfix, from userid 1000) id 88EB4A460; Fri, 26 Jan 2001 01:17:29 +1300 (NZDT) Date: Fri, 26 Jan 2001 01:17:29 +1300 From: Chris Wedgwood To: Jordan Mendelson Cc: netdev@oss.sgi.com Subject: Re: TCP Performance 2.4.0 <-> Win98 Followup Message-ID: <20010126011729.A9441@metastasis.f00f.org> References: <3A6E34EA.4AF27AA8@napster.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <3A6E34EA.4AF27AA8@napster.com>; from jordy@napster.com on Tue, Jan 23, 2001 at 05:50:34PM -0800 X-No-Archive: Yes Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 307 Lines: 11 On Tue, Jan 23, 2001 at 05:50:34PM -0800, Jordan Mendelson wrote: Apparently, the performance problem with Win98 <-> 2.4.0 only happens with specific ISPs. The MTU and SACK settings didn't appear to have any effect. Can you easily get packet dumps from both good and bad situations? --cw From owner-netdev@oss.sgi.com Fri Jan 26 01:40:23 2001 Received: by oss.sgi.com id ; Fri, 26 Jan 2001 01:40:13 -0800 Received: from horus.its.uow.edu.au ([130.130.68.25]:42934 "EHLO horus.its.uow.edu.au") by oss.sgi.com with ESMTP id ; Fri, 26 Jan 2001 01:39:56 -0800 Received: from uow.edu.au (wumpus.its.uow.edu.au [130.130.68.12]) by horus.its.uow.edu.au (8.9.3/8.9.3) with ESMTP id UAA26073; Fri, 26 Jan 2001 20:39:38 +1100 (EST) Message-ID: <3A714788.82C064BD@uow.edu.au> Date: Fri, 26 Jan 2001 20:46:48 +1100 From: Andrew Morton X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.4.0-test8 i586) X-Accept-Language: en MIME-Version: 1.0 To: Alexey Kuznetsov CC: "netdev@oss.sgi.com" Subject: zerocopy changes in 3c59x.c Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 387 Lines: 18 Hi, Alexey. /* Tx timeout interval (millisecs) */ -static int watchdog = 400; +static int watchdog = 5000; Five second transmit timeout. Why is this? And why do we do this: vp->stats.rx_compressed++; each time we send a frame with hardware checksum? This is the transmit path, not the receive path. It seems we're abusing some ppp-related stats here. What's up? From owner-netdev@oss.sgi.com Fri Jan 26 03:35:33 2001 Received: by oss.sgi.com id ; Fri, 26 Jan 2001 03:35:23 -0800 Received: from pizda.ninka.net ([216.101.162.242]:41616 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Fri, 26 Jan 2001 03:34:59 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id DAA10269; Fri, 26 Jan 2001 03:33:49 -0800 From: "David S. Miller" MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <14961.24733.869800.77633@pizda.ninka.net> Date: Fri, 26 Jan 2001 03:33:49 -0800 (PST) To: Andrew Morton Cc: Alexey Kuznetsov , "netdev@oss.sgi.com" Subject: Re: zerocopy changes in 3c59x.c In-Reply-To: <3A714788.82C064BD@uow.edu.au> References: <3A714788.82C064BD@uow.edu.au> X-Mailer: VM 6.75 under 21.1 (patch 13) "Crater Lake" XEmacs Lucid Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 473 Lines: 17 Andrew Morton writes: > And why do we do this: > > vp->stats.rx_compressed++; > > each time we send a frame with hardware checksum? This is the > transmit path, not the receive path. It seems we're abusing > some ppp-related stats here. What's up? Every zerocopy driver does this, rx_compressed counts HW csummed transmit packets, and tx_compressed counts transmit packets containing more than one buffer. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Fri Jan 26 03:42:34 2001 Received: by oss.sgi.com id ; Fri, 26 Jan 2001 03:42:15 -0800 Received: from horus.its.uow.edu.au ([130.130.68.25]:35003 "EHLO horus.its.uow.edu.au") by oss.sgi.com with ESMTP id ; Fri, 26 Jan 2001 03:42:06 -0800 Received: from uow.edu.au (wumpus.its.uow.edu.au [130.130.68.12]) by horus.its.uow.edu.au (8.9.3/8.9.3) with ESMTP id WAA06683; Fri, 26 Jan 2001 22:41:42 +1100 (EST) Message-ID: <3A716447.CF6E8BB0@uow.edu.au> Date: Fri, 26 Jan 2001 22:49:27 +1100 From: Andrew Morton X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.4.0-test8 i586) X-Accept-Language: en MIME-Version: 1.0 To: "David S. Miller" CC: Alexey Kuznetsov , "netdev@oss.sgi.com" Subject: Re: zerocopy changes in 3c59x.c References: <3A714788.82C064BD@uow.edu.au>, <3A714788.82C064BD@uow.edu.au> <14961.24733.869800.77633@pizda.ninka.net> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 667 Lines: 20 "David S. Miller" wrote: > > Andrew Morton writes: > > And why do we do this: > > > > vp->stats.rx_compressed++; > > > > each time we send a frame with hardware checksum? This is the > > transmit path, not the receive path. It seems we're abusing > > some ppp-related stats here. What's up? > > Every zerocopy driver does this, rx_compressed counts HW csummed > transmit packets, and tx_compressed counts transmit packets > containing more than one buffer. Thanks. I can see that :) But why do this, rather than create new accounting fields? Let me guess: short-term thing, intended to be removed, didn't want to hack the userspace tools? From owner-netdev@oss.sgi.com Fri Jan 26 05:06:24 2001 Received: by oss.sgi.com id ; Fri, 26 Jan 2001 05:06:04 -0800 Received: from pizda.ninka.net ([216.101.162.242]:30353 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Fri, 26 Jan 2001 05:05:50 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id FAA10816; Fri, 26 Jan 2001 05:04:37 -0800 From: "David S. Miller" MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <14961.30181.671982.174763@pizda.ninka.net> Date: Fri, 26 Jan 2001 05:04:37 -0800 (PST) To: Andrew Morton Cc: Alexey Kuznetsov , "netdev@oss.sgi.com" Subject: Re: zerocopy changes in 3c59x.c In-Reply-To: <3A716447.CF6E8BB0@uow.edu.au> References: <3A714788.82C064BD@uow.edu.au> <14961.24733.869800.77633@pizda.ninka.net> <3A716447.CF6E8BB0@uow.edu.au> X-Mailer: VM 6.75 under 21.1 (patch 13) "Crater Lake" XEmacs Lucid Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 801 Lines: 24 Andrew Morton writes: > But why do this, rather than create new accounting fields? Let > me guess: short-term thing, intended to be removed, didn't want > to hack the userspace tools? Why add new fields when they are unnecessary? What is hurt by having these fields increment. We can teach programs like ifconfig et al. that for ethernet type interfaces, these fields have this different meaning. Currently, ifconfig will not even display this statistic for ethernet type netdevices, one has to go probing by hand to spot them from user space. If people are going to be really pedantic about this, then I'll just kill this usage in the final patch I send to Linus. But I personally think it is not such a big deal and it will assist with debugging. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Fri Jan 26 05:20:54 2001 Received: by oss.sgi.com id ; Fri, 26 Jan 2001 05:20:34 -0800 Received: from horus.its.uow.edu.au ([130.130.68.25]:59788 "EHLO horus.its.uow.edu.au") by oss.sgi.com with ESMTP id ; Fri, 26 Jan 2001 05:20:11 -0800 Received: from uow.edu.au (wumpus.its.uow.edu.au [130.130.68.12]) by horus.its.uow.edu.au (8.9.3/8.9.3) with ESMTP id AAA15017; Sat, 27 Jan 2001 00:19:46 +1100 (EST) Message-ID: <3A717B44.82C66152@uow.edu.au> Date: Sat, 27 Jan 2001 00:27:32 +1100 From: Andrew Morton X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.4.0-test8 i586) X-Accept-Language: en MIME-Version: 1.0 To: "David S. Miller" CC: Alexey Kuznetsov , "netdev@oss.sgi.com" Subject: Re: zerocopy changes in 3c59x.c References: <3A716447.CF6E8BB0@uow.edu.au>, <3A714788.82C064BD@uow.edu.au> <14961.24733.869800.77633@pizda.ninka.net> <3A716447.CF6E8BB0@uow.edu.au> <14961.30181.671982.174763@pizda.ninka.net> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1354 Lines: 32 "David S. Miller" wrote: > > Andrew Morton writes: > > But why do this, rather than create new accounting fields? Let > > me guess: short-term thing, intended to be removed, didn't want > > to hack the userspace tools? > > Why add new fields when they are unnecessary? What is hurt > by having these fields increment. > Hey, I was just asking! Dunno about others, but for me the zc thing has basically come from nowhere - I'm still coming up to speed about the design decisions which were made, how it is implemented, etc. /proc/net/dev says: Inter-| Receive | Transmit face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed lo: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 eth0: 14274 104 0 0 0 0 54 0 5414 69 0 0 0 0 0 54 So this: ^^ tells us that 54 packets have been sent with h/w checksums and this: ^^ tells us that they were all multi-fragment. I assume this is because the IP header is in a different frag? Is there ever a situation in which these numbers will differ? From owner-netdev@oss.sgi.com Fri Jan 26 05:27:54 2001 Received: by oss.sgi.com id ; Fri, 26 Jan 2001 05:27:34 -0800 Received: from pizda.ninka.net ([216.101.162.242]:45457 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Fri, 26 Jan 2001 05:27:31 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id FAA10964; Fri, 26 Jan 2001 05:26:19 -0800 From: "David S. Miller" MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <14961.31482.977020.162767@pizda.ninka.net> Date: Fri, 26 Jan 2001 05:26:18 -0800 (PST) To: Andrew Morton Cc: Alexey Kuznetsov , "netdev@oss.sgi.com" Subject: Re: zerocopy changes in 3c59x.c In-Reply-To: <3A717B44.82C66152@uow.edu.au> References: <3A716447.CF6E8BB0@uow.edu.au> <3A714788.82C064BD@uow.edu.au> <14961.24733.869800.77633@pizda.ninka.net> <14961.30181.671982.174763@pizda.ninka.net> <3A717B44.82C66152@uow.edu.au> X-Mailer: VM 6.75 under 21.1 (patch 13) "Crater Lake" XEmacs Lucid Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 598 Lines: 21 Andrew Morton writes: > tells us that 54 packets have been sent with h/w checksums and this: > > tells us that they were all multi-fragment. > > I assume this is because the IP header is in a different frag? Right, the headers are all in the skb->data buffer and the application data sits in the SKB frags. > Is there ever a situation in which these numbers will differ? If we allowed SG without hw csumming, for example :-) It would let us identify such devices by just asking the user for a statistics dump which gathering debug information. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Fri Jan 26 05:45:24 2001 Received: by oss.sgi.com id ; Fri, 26 Jan 2001 05:45:14 -0800 Received: from horus.its.uow.edu.au ([130.130.68.25]:59385 "EHLO horus.its.uow.edu.au") by oss.sgi.com with ESMTP id ; Fri, 26 Jan 2001 05:44:52 -0800 Received: from uow.edu.au (wumpus.its.uow.edu.au [130.130.68.12]) by horus.its.uow.edu.au (8.9.3/8.9.3) with ESMTP id AAA16882; Sat, 27 Jan 2001 00:44:30 +1100 (EST) Message-ID: <3A718110.5C540AE@uow.edu.au> Date: Sat, 27 Jan 2001 00:52:16 +1100 From: Andrew Morton X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.4.0-test8 i586) X-Accept-Language: en MIME-Version: 1.0 To: "David S. Miller" CC: Alexey Kuznetsov , "netdev@oss.sgi.com" Subject: Re: zerocopy changes in 3c59x.c References: <3A717B44.82C66152@uow.edu.au>, <3A716447.CF6E8BB0@uow.edu.au> <3A714788.82C064BD@uow.edu.au> <14961.24733.869800.77633@pizda.ninka.net> <14961.30181.671982.174763@pizda.ninka.net> <3A717B44.82C66152@uow.edu.au> <14961.31482.977020.162767@pizda.ninka.net> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1423 Lines: 37 "David S. Miller" wrote: > > Andrew Morton writes: > > tells us that 54 packets have been sent with h/w checksums and this: > > > > tells us that they were all multi-fragment. > > > > I assume this is because the IP header is in a different frag? > > Right, the headers are all in the skb->data buffer and the > application data sits in the SKB frags. > > > Is there ever a situation in which these numbers will differ? > > If we allowed SG without hw csumming, for example :-) > > It would let us identify such devices by just asking the > user for a statistics dump which gathering debug information. I see. And just to clarify: it is currently the case that we do support scatter/gather on devices which don't have hardware checksums on transmit. When tx checksums are disabled, with a 3c905C, SG is still used, so /proc/net/dev says: Inter-| Receive | Transmit face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed lo: 2520 36 0 0 0 0 0 0 2520 36 0 0 0 0 0 0 eth0: 112364 549 0 0 0 0 0 0 69961 413 0 0 0 0 0 239 That is, 239 packets were sent using multiple fragments and none of them used h/w checksums. I think it's sunk in now :) From owner-netdev@oss.sgi.com Fri Jan 26 05:50:24 2001 Received: by oss.sgi.com id ; Fri, 26 Jan 2001 05:50:14 -0800 Received: from pizda.ninka.net ([216.101.162.242]:57233 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Fri, 26 Jan 2001 05:49:55 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id FAA11049; Fri, 26 Jan 2001 05:48:45 -0800 From: "David S. Miller" MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <14961.32829.353428.295904@pizda.ninka.net> Date: Fri, 26 Jan 2001 05:48:45 -0800 (PST) To: Andrew Morton Cc: Alexey Kuznetsov , "netdev@oss.sgi.com" Subject: Re: zerocopy changes in 3c59x.c In-Reply-To: <3A718110.5C540AE@uow.edu.au> References: <3A717B44.82C66152@uow.edu.au> <3A716447.CF6E8BB0@uow.edu.au> <3A714788.82C064BD@uow.edu.au> <14961.24733.869800.77633@pizda.ninka.net> <14961.30181.671982.174763@pizda.ninka.net> <14961.31482.977020.162767@pizda.ninka.net> <3A718110.5C540AE@uow.edu.au> X-Mailer: VM 6.75 under 21.1 (patch 13) "Crater Lake" XEmacs Lucid Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 818 Lines: 21 Andrew Morton writes: > And just to clarify: it is currently the case that we > do support scatter/gather on devices which don't have > hardware checksums on transmit. Yes, but I won't allow this for ipv4/ipv6 in the final zerocopy patch I send to Linus. The reason (did you attend my talk at UNSW last week??? :-))) is that if SG-only is allowed it is possible for the paged data to get modified between when we calculate the checksum in software and the device transmits the packet. We don't get exclusive access to the pages, we just grab a reference to them. Even though a TCP retransmit would fixup the checksum this is horrible from a quality of implementation standpoint. However, as I mentioned, DecNET and friends will be able to make use of SG-only devices. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Fri Jan 26 06:11:44 2001 Received: by oss.sgi.com id ; Fri, 26 Jan 2001 06:11:26 -0800 Received: from horus.its.uow.edu.au ([130.130.68.25]:64510 "EHLO horus.its.uow.edu.au") by oss.sgi.com with ESMTP id ; Fri, 26 Jan 2001 06:11:20 -0800 Received: from uow.edu.au (wumpus.its.uow.edu.au [130.130.68.12]) by horus.its.uow.edu.au (8.9.3/8.9.3) with ESMTP id BAA10694; Sat, 27 Jan 2001 01:10:36 +1100 (EST) Message-ID: <3A71872E.5DBB4C7B@uow.edu.au> Date: Sat, 27 Jan 2001 01:18:22 +1100 From: Andrew Morton X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.4.0-test8 i586) X-Accept-Language: en MIME-Version: 1.0 To: "David S. Miller" , Alexey Kuznetsov CC: "netdev@oss.sgi.com" Subject: [prepatch] 3c59x.c Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 27725 Lines: 766 OK, here's the rolled-up patch against 2.4.1-pre10 vanilla. It compiles and works just fine with and without the zc patch. The idea is that if this patch is applied to the Linus tree, the 3c59x part of the zerocopy diff can be removed. Once the zerocopy patch is merged to Linus' tree, it'll "just work". Non-ZC notes: - Added `medialock' stuff for NICs which have multiple physical interfaces. This is because the driver was wandering off and trying 10base2 and AUI interfaces if 10baseT is unplugged. It was getting stuck there and a driver reload was needed to recover. - Misc fixlets from Arnaldo Carvalho de Melo - Added and used EEPROM_NORESET to make PM resumes a little more succesful with 3c556B's. - Fixed a memory leak: missing pci_free_consistent() ZC notes - Added the HAS_HWCKSM device flag. If this flag is present, we use hardware tx checksums. - All devices which are marked IS_CYCLONE or IS_TORNADO have also been marked HAS_HWCKSM. This means that all these devices will use hw checksums, as was the case in the zerocopy patch. - Added the `hw_checksums' module parameter. This means that if hw checksums cause problems, the user can disable it with `hw_checksums=0'. It also means that if the user thinks their NIC _does_ support hw checksums, and we haven't enabled this, they can use `hw_checksums=1' to override the device table setting. I've tested it with: 3c590: Doesn't do SG or checksums 3c905: Does SG, no checksums 3c905B, 3c905C, 3CCFE575CT These do both. Please let me know if this all looks sane and I'll punt it over to Linus. Or we could roll them together if you're confident that zerocopy is going in. I'd like to get the unrelated fixes into 2.4.1.... Thanks. --- linux-2.4.1-pre10/drivers/net/3c59x.c Tue Jan 23 19:28:15 2001 +++ linux-akpm/drivers/net/3c59x.c Sat Jan 27 00:53:26 2001 @@ -118,7 +118,7 @@ LK1.1.11 13 Nov 2000 andrewm - Dump MOD_INC/DEC_USE_COUNT, use SET_MODULE_OWNER - LK1.1.12 1 Jan 2001 andrewm + LK1.1.12 1 Jan 2001 andrewm (2.4.0-pre1) - Call pci_enable_device before we request our IRQ (Tobias Ringstrom) - Add 3c590 PCI latency timer hack to vortex_probe1 (from 0.99Ra) - Added extended wait_for_completion for the 3c905CX. @@ -126,6 +126,16 @@ - Add HAS_NWAY to 3cSOHO100-TX (Brett Frankenberger) - Don't free skbs we don't own on oom path in vortex_open(). + LK1.1.13 27 Jan 2001 + - Added explicit `medialock' flag so we can truly + lock the media type down with `options'. + - "check ioremap return and some tidbits" (Arnaldo Carvalho de Melo ) + - Added and used EEPROM_NORESET for 3c556B PM resumes. + - Fixed leakage of vp->rx_ring. + - Break out separate HAS_HWCKSM device capability flag. + - Kill vp->tx_full (ANK) + - Merge zerocopy fragment handling (ANK?) + - See http://www.uow.edu.au/~andrewm/linux/#3c59x-2.3 for more details. - Also see Documentation/networking/vortex.txt */ @@ -154,7 +164,7 @@ /* Maximum events (Rx packets, etc.) to handle at each interrupt. */ static int max_interrupt_work = 32; /* Tx timeout interval (millisecs) */ -static int watchdog = 400; +static int watchdog = 5000; /* Allow aggregation of Tx interrupts. Saves CPU load at the cost * of possible Tx stalls if the system is blocking interrupts @@ -174,10 +184,6 @@ static int vortex_debug = 1; #endif -/* Some values here only for performance evaluation and path-coverage - debugging. */ -static int rx_nocopy = 0, rx_copy = 0, queued_packet = 0, rx_csumhits; - #ifndef __OPTIMIZE__ #error You must compile this file with the correct options! #error See the last lines of the source file. @@ -211,13 +217,14 @@ #include static char version[] __devinitdata = -"3c59x.c:LK1.1.12 06 Jan 2000 Donald Becker and others. http://www.scyld.com/network/vortex.html " "$Revision: 1.102.2.46 $\n"; +"3c59x.c:LK1.1.13 27 Jan 2001 Donald Becker and others. http://www.scyld.com/network/vortex.html\n"; MODULE_AUTHOR("Donald Becker "); MODULE_DESCRIPTION("3Com 3c59x/3c90x/3c575 series Vortex/Boomerang/Cyclone driver"); MODULE_PARM(debug, "i"); MODULE_PARM(options, "1-" __MODULE_STRING(8) "i"); MODULE_PARM(full_duplex, "1-" __MODULE_STRING(8) "i"); +MODULE_PARM(hw_checksums, "1-" __MODULE_STRING(8) "i"); MODULE_PARM(flow_ctrl, "1-" __MODULE_STRING(8) "i"); MODULE_PARM(rx_copybreak, "i"); MODULE_PARM(max_interrupt_work, "i"); @@ -332,7 +339,7 @@ EEPROM_8BIT=0x10, /* AKPM: Uses 0x230 as the base bitmaps for EEPROM reads */ HAS_PWR_CTRL=0x20, HAS_MII=0x40, HAS_NWAY=0x80, HAS_CB_FNS=0x100, INVERT_MII_PWR=0x200, INVERT_LED_PWR=0x400, MAX_COLLISION_RESET=0x800, - EEPROM_OFFSET=0x1000 }; + EEPROM_OFFSET=0x1000, EEPROM_NORESET=0x2000, HAS_HWCKSM=0x4000 }; enum vortex_chips { CH_3C590 = 0, @@ -405,58 +412,65 @@ {"3c900 Boomerang 10Mbps Combo", PCI_USES_IO|PCI_USES_MASTER, IS_BOOMERANG, 64, }, {"3c900 Cyclone 10Mbps TPO", /* AKPM: from Don's 0.99M */ - PCI_USES_IO|PCI_USES_MASTER, IS_CYCLONE|HAS_NWAY, 128, }, + PCI_USES_IO|PCI_USES_MASTER, IS_CYCLONE|HAS_NWAY|HAS_HWCKSM, 128, }, {"3c900 Cyclone 10Mbps Combo", - PCI_USES_IO|PCI_USES_MASTER, IS_CYCLONE, 128, }, + PCI_USES_IO|PCI_USES_MASTER, IS_CYCLONE|HAS_HWCKSM, 128, }, {"3c900 Cyclone 10Mbps TPC", /* AKPM: from Don's 0.99M */ - PCI_USES_IO|PCI_USES_MASTER, IS_CYCLONE, 128, }, + PCI_USES_IO|PCI_USES_MASTER, IS_CYCLONE|HAS_HWCKSM, 128, }, {"3c900B-FL Cyclone 10base-FL", - PCI_USES_IO|PCI_USES_MASTER, IS_CYCLONE, 128, }, + PCI_USES_IO|PCI_USES_MASTER, IS_CYCLONE|HAS_HWCKSM, 128, }, {"3c905 Boomerang 100baseTx", PCI_USES_IO|PCI_USES_MASTER, IS_BOOMERANG|HAS_MII, 64, }, {"3c905 Boomerang 100baseT4", PCI_USES_IO|PCI_USES_MASTER, IS_BOOMERANG|HAS_MII, 64, }, {"3c905B Cyclone 100baseTx", - PCI_USES_IO|PCI_USES_MASTER, IS_CYCLONE|HAS_NWAY, 128, }, + PCI_USES_IO|PCI_USES_MASTER, IS_CYCLONE|HAS_NWAY|HAS_HWCKSM, 128, }, {"3c905B Cyclone 10/100/BNC", - PCI_USES_IO|PCI_USES_MASTER, IS_CYCLONE|HAS_NWAY, 128, }, + PCI_USES_IO|PCI_USES_MASTER, IS_CYCLONE|HAS_NWAY|HAS_HWCKSM, 128, }, {"3c905B-FX Cyclone 100baseFx", - PCI_USES_IO|PCI_USES_MASTER, IS_CYCLONE, 128, }, + PCI_USES_IO|PCI_USES_MASTER, IS_CYCLONE|HAS_HWCKSM, 128, }, {"3c905C Tornado", - PCI_USES_IO|PCI_USES_MASTER, IS_TORNADO|HAS_NWAY, 128, }, + PCI_USES_IO|PCI_USES_MASTER, IS_TORNADO|HAS_NWAY|HAS_HWCKSM, 128, }, {"3c980 Cyclone", - PCI_USES_IO|PCI_USES_MASTER, IS_CYCLONE, 128, }, + PCI_USES_IO|PCI_USES_MASTER, IS_CYCLONE|HAS_HWCKSM, 128, }, {"3c980 10/100 Base-TX NIC(Python-T)", - PCI_USES_IO|PCI_USES_MASTER, IS_CYCLONE, 128, }, + PCI_USES_IO|PCI_USES_MASTER, IS_CYCLONE|HAS_HWCKSM, 128, }, {"3cSOHO100-TX Hurricane", - PCI_USES_IO|PCI_USES_MASTER, IS_CYCLONE|HAS_NWAY, 128, }, + PCI_USES_IO|PCI_USES_MASTER, IS_CYCLONE|HAS_NWAY|HAS_HWCKSM, 128, }, {"3c555 Laptop Hurricane", - PCI_USES_IO|PCI_USES_MASTER, IS_CYCLONE|EEPROM_8BIT, 128, }, + PCI_USES_IO|PCI_USES_MASTER, IS_CYCLONE|EEPROM_8BIT|HAS_HWCKSM, 128, }, {"3c556 Laptop Tornado", - PCI_USES_IO|PCI_USES_MASTER, IS_TORNADO|HAS_NWAY|EEPROM_8BIT|HAS_CB_FNS|INVERT_MII_PWR, 128, }, + PCI_USES_IO|PCI_USES_MASTER, IS_TORNADO|HAS_NWAY|EEPROM_8BIT|HAS_CB_FNS|INVERT_MII_PWR| + HAS_HWCKSM, 128, }, {"3c556B Laptop Hurricane", - PCI_USES_IO|PCI_USES_MASTER, IS_TORNADO|HAS_NWAY|EEPROM_OFFSET|HAS_CB_FNS|INVERT_MII_PWR, 128, }, + PCI_USES_IO|PCI_USES_MASTER, IS_TORNADO|HAS_NWAY|EEPROM_OFFSET|HAS_CB_FNS|INVERT_MII_PWR| + EEPROM_NORESET|HAS_HWCKSM, 128, }, {"3c575 [Megahertz] 10/100 LAN CardBus", - PCI_USES_IO|PCI_USES_MASTER, IS_BOOMERANG|HAS_MII|EEPROM_8BIT, 128, }, + PCI_USES_IO|PCI_USES_MASTER, IS_BOOMERANG|HAS_MII|EEPROM_8BIT, 128, }, {"3c575 Boomerang CardBus", PCI_USES_IO|PCI_USES_MASTER, IS_BOOMERANG|HAS_MII|EEPROM_8BIT, 128, }, {"3CCFE575BT Cyclone CardBus", - PCI_USES_IO|PCI_USES_MASTER, IS_CYCLONE|HAS_NWAY|HAS_CB_FNS|EEPROM_8BIT|INVERT_LED_PWR, 128, }, + PCI_USES_IO|PCI_USES_MASTER, IS_CYCLONE|HAS_NWAY|HAS_CB_FNS|EEPROM_8BIT| + INVERT_LED_PWR|HAS_HWCKSM, 128, }, {"3CCFE575CT Tornado CardBus", - PCI_USES_IO|PCI_USES_MASTER, IS_TORNADO|HAS_NWAY|HAS_CB_FNS|EEPROM_8BIT|INVERT_MII_PWR|MAX_COLLISION_RESET, 128, }, + PCI_USES_IO|PCI_USES_MASTER, IS_TORNADO|HAS_NWAY|HAS_CB_FNS|EEPROM_8BIT|INVERT_MII_PWR| + MAX_COLLISION_RESET|HAS_HWCKSM, 128, }, {"3CCFE656 Cyclone CardBus", - PCI_USES_IO|PCI_USES_MASTER, IS_CYCLONE|HAS_NWAY|HAS_CB_FNS|EEPROM_8BIT|INVERT_MII_PWR|INVERT_LED_PWR, 128, }, + PCI_USES_IO|PCI_USES_MASTER, IS_CYCLONE|HAS_NWAY|HAS_CB_FNS|EEPROM_8BIT|INVERT_MII_PWR| + INVERT_LED_PWR|HAS_HWCKSM, 128, }, {"3CCFEM656B Cyclone+Winmodem CardBus", - PCI_USES_IO|PCI_USES_MASTER, IS_CYCLONE|HAS_NWAY|HAS_CB_FNS|EEPROM_8BIT|INVERT_MII_PWR|INVERT_LED_PWR, 128, }, + PCI_USES_IO|PCI_USES_MASTER, IS_CYCLONE|HAS_NWAY|HAS_CB_FNS|EEPROM_8BIT|INVERT_MII_PWR| + INVERT_LED_PWR|HAS_HWCKSM, 128, }, {"3CXFEM656C Tornado+Winmodem CardBus", /* From pcmcia-cs-3.1.5 */ - PCI_USES_IO|PCI_USES_MASTER, IS_TORNADO|HAS_NWAY|HAS_CB_FNS|EEPROM_8BIT|INVERT_MII_PWR|MAX_COLLISION_RESET, 128, }, + PCI_USES_IO|PCI_USES_MASTER, IS_TORNADO|HAS_NWAY|HAS_CB_FNS|EEPROM_8BIT|INVERT_MII_PWR| + MAX_COLLISION_RESET|HAS_HWCKSM, 128, }, {"3c450 HomePNA Tornado", /* AKPM: from Don's 0.99Q */ - PCI_USES_IO|PCI_USES_MASTER, IS_TORNADO|HAS_NWAY, 128, }, + PCI_USES_IO|PCI_USES_MASTER, IS_TORNADO|HAS_NWAY|HAS_HWCKSM, 128, }, {0,}, /* 0 terminated list. */ }; @@ -631,11 +645,24 @@ IPChksumValid=1<<29, TCPChksumValid=1<<30, UDPChksumValid=1<<31, }; +#ifdef MAX_SKB_FRAGS +#define DO_ZEROCOPY 1 +#else +#define DO_ZEROCOPY 0 +#endif + struct boom_tx_desc { u32 next; /* Last entry points to 0. */ s32 status; /* bits 0:12 length, others see below. */ - u32 addr; - s32 length; +#if DO_ZEROCOPY + struct { + u32 addr; + s32 length; + } frag[1+MAX_SKB_FRAGS]; +#else + u32 addr; + s32 length; +#endif }; /* Values for the Tx status entry. */ @@ -668,6 +695,10 @@ struct pci_dev *pdev; char *cb_fn_base; /* CardBus function status addr space. */ + /* Some values here only for performance evaluation and path-coverage */ + int rx_nocopy, rx_copy, queued_packet, rx_csumhits; + int card_idx; + /* The remainder are related to chip state, mostly media selection. */ struct timer_list timer; /* Media selection timer. */ struct timer_list rx_oom_timer; /* Rx skb allocation retry timer */ @@ -682,6 +713,7 @@ tx_full:1, has_nway:1, open:1, + medialock:1, must_free_region:1; /* Flag: if zero, Cardbus owns the I/O region */ int drv_flags; u16 status_enable; @@ -755,6 +787,7 @@ #define MAX_UNITS 8 static int options[MAX_UNITS] = { -1, -1, -1, -1, -1, -1, -1, -1,}; static int full_duplex[MAX_UNITS] = {-1, -1, -1, -1, -1, -1, -1, -1}; +static int hw_checksums[MAX_UNITS] = {-1, -1, -1, -1, -1, -1, -1, -1}; static int flow_ctrl[MAX_UNITS] = {-1, -1, -1, -1, -1, -1, -1, -1}; /* #define dev_alloc_skb dev_alloc_skb_debug */ @@ -889,9 +922,9 @@ } dev = init_etherdev(NULL, sizeof(*vp)); + retval = -ENOMEM; if (!dev) { printk (KERN_ERR PFX "unable to allocate etherdev, aborting\n"); - retval = -ENOMEM; goto out; } SET_MODULE_OWNER(dev); @@ -909,6 +942,7 @@ vp->drv_flags = vci->drv_flags; vp->has_nway = (vci->drv_flags & HAS_NWAY) ? 1 : 0; vp->io_size = vci->io_size; + vp->card_idx = card_idx; /* module list only for EISA devices */ if (pdev == NULL) { @@ -953,10 +987,9 @@ vp->rx_ring = pci_alloc_consistent(pdev, sizeof(struct boom_rx_desc) * RX_RING_SIZE + sizeof(struct boom_tx_desc) * TX_RING_SIZE, &vp->rx_ring_dma); - if (vp->rx_ring == 0) { - retval = -ENOMEM; + retval = -ENOMEM; + if (vp->rx_ring == 0) goto free_region; - } vp->tx_ring = (struct boom_tx_desc *)(vp->rx_ring + RX_RING_SIZE); vp->tx_ring_dma = vp->rx_ring_dma + sizeof(struct boom_rx_desc) * RX_RING_SIZE; @@ -982,6 +1015,8 @@ vp->media_override = 7; if (option >= 0) { vp->media_override = ((option & 7) == 2) ? 0 : option & 15; + if (vp->media_override != 7) + vp->medialock = 1; vp->full_duplex = (option & 0x200) ? 1 : 0; vp->bus_master = (option & 16) ? 1 : 0; } @@ -1049,7 +1084,7 @@ EL3WINDOW(4); step = (inb(ioaddr + Wn4_NetDiag) & 0x1e) >> 1; - printk(KERN_INFO " product code '%c%c' rev %02x.%d date %02d-" + printk(KERN_INFO " product code %02x%02x rev %02x.%d date %02d-" "%02d-%02d\n", eeprom[6]&0xff, eeprom[6]>>8, eeprom[0x14], step, (eeprom[4]>>5) & 15, eeprom[4] & 31, eeprom[4]>>9); @@ -1059,8 +1094,12 @@ unsigned short n; fn_st_addr = pci_resource_start (pdev, 2); - if (fn_st_addr) + if (fn_st_addr) { vp->cb_fn_base = ioremap(fn_st_addr, 128); + retval = -ENOMEM; + if (!vp->cb_fn_base) + goto free_ring; + } printk(KERN_INFO "%s: CardBus functions mapped %8.8lx->%p\n", dev->name, fn_st_addr, vp->cb_fn_base); EL3WINDOW(2); @@ -1167,21 +1206,43 @@ /* The 3c59x-specific entries in the device structure. */ dev->open = vortex_open; - dev->hard_start_xmit = vp->full_bus_master_tx ? - boomerang_start_xmit : vortex_start_xmit; + if (vp->full_bus_master_tx) { + dev->hard_start_xmit = boomerang_start_xmit; +#ifndef CONFIG_HIGHMEM + /* Actually, it still should work with iommu. */ + dev->features |= NETIF_F_SG; +#endif + if (((hw_checksums[card_idx] == -1) && (vp->drv_flags & HAS_HWCKSM)) || + (hw_checksums[card_idx] == 1)) + dev->features |= NETIF_F_IP_CSUM; + } else { + dev->hard_start_xmit = vortex_start_xmit; + } + + if (vortex_debug > 0) { + printk(KERN_INFO "%s: scatter/gather %sabled. h/w checksums %sabled\n", + dev->name, + (dev->features & NETIF_F_SG) ? "en":"dis", + (dev->features & NETIF_F_IP_CSUM) ? "en":"dis"); + } + dev->stop = vortex_close; dev->get_stats = vortex_get_stats; dev->do_ioctl = vortex_ioctl; dev->set_multicast_list = set_rx_mode; dev->tx_timeout = vortex_tx_timeout; dev->watchdog_timeo = (watchdog * HZ) / 1000; -// publish_netdev(dev); return 0; +free_ring: + pci_free_consistent(pdev, + sizeof(struct boom_rx_desc) * RX_RING_SIZE + + sizeof(struct boom_tx_desc) * TX_RING_SIZE, + vp->rx_ring, + vp->rx_ring_dma); free_region: if (vp->must_free_region) release_region(ioaddr, vci->io_size); -// withdraw_netdev(dev); unregister_netdev(dev); kfree (dev); printk(KERN_ERR PFX "vortex_probe1 fails. Returns %d\n", retval); @@ -1202,7 +1263,8 @@ /* OK, that didn't work. Do it the slow way. One second */ for (i = 0; i < 100000; i++) { if (!(inw(dev->base_addr + EL3_STATUS) & CmdInProgress)) { - printk(KERN_INFO "%s: command 0x%04x took %d usecs! Please tell andrewm@uow.edu.au\n", + if (vortex_debug > 1) + printk(KERN_INFO "%s: command 0x%04x took %d usecs\n", dev->name, cmd, i * 10); return; } @@ -1478,6 +1540,8 @@ printk(KERN_DEBUG "dev->watchdog_timeo=%d\n", dev->watchdog_timeo); } + if (vp->medialock) + goto leave_media_alone; disable_irq(dev->irq); old_window = inw(ioaddr + EL3_CMD) >> 13; EL3WINDOW(4); @@ -1567,6 +1631,7 @@ EL3WINDOW(old_window); enable_irq(dev->irq); +leave_media_alone: if (vortex_debug > 2) printk(KERN_DEBUG "%s: Media selection timer finished, %s.\n", dev->name, media_tbl[dev->if_port].name); @@ -1619,18 +1684,12 @@ vp->stats.tx_errors++; if (vp->full_bus_master_tx) { - if (vortex_debug > 0) - printk(KERN_DEBUG "%s: Resetting the Tx ring pointer.\n", - dev->name); + printk(KERN_DEBUG "%s: Resetting the Tx ring pointer.\n", dev->name); if (vp->cur_tx - vp->dirty_tx > 0 && inl(ioaddr + DownListPtr) == 0) outl(vp->tx_ring_dma + (vp->dirty_tx % TX_RING_SIZE) * sizeof(struct boom_tx_desc), ioaddr + DownListPtr); - if (vp->tx_full && (vp->cur_tx - vp->dirty_tx <= TX_RING_SIZE - 1)) { - vp->tx_full = 0; + if (vp->cur_tx - vp->dirty_tx < TX_RING_SIZE) netif_wake_queue (dev); - } - if (vp->tx_full) - netif_stop_queue (dev); if (vp->drv_flags & IS_BOOMERANG) outb(PKT_BUF_SZ>>8, ioaddr + TxFreeThreshold); outw(DownUnstall, ioaddr + EL3_CMD); @@ -1820,17 +1879,57 @@ dev->name, vp->cur_tx); } - if (vp->tx_full) { + if (vp->cur_tx - vp->dirty_tx >= TX_RING_SIZE) { if (vortex_debug > 0) - printk(KERN_WARNING "%s: Tx Ring full, refusing to send buffer.\n", + printk(KERN_WARNING "%s: BUG! Tx Ring full, refusing to send buffer.\n", dev->name); + netif_stop_queue(dev); return 1; } + vp->tx_skbuff[entry] = skb; + vp->tx_ring[entry].next = 0; +#if DO_ZEROCOPY + if (skb->ip_summed != CHECKSUM_HW) + vp->tx_ring[entry].status = cpu_to_le32(skb->len | TxIntrUploaded); + else { + vp->tx_ring[entry].status = cpu_to_le32(skb->len | TxIntrUploaded | AddTCPChksum); + vp->stats.rx_compressed++; + } + + if (!skb_shinfo(skb)->nr_frags) { + vp->tx_ring[entry].frag[0].addr = cpu_to_le32(pci_map_single(vp->pdev, skb->data, + skb->len, PCI_DMA_TODEVICE)); + vp->tx_ring[entry].frag[0].length = cpu_to_le32(skb->len | LAST_FRAG); + } else { + int i; + + vp->stats.tx_compressed++; + + vp->tx_ring[entry].frag[0].addr = cpu_to_le32(pci_map_single(vp->pdev, skb->data, + skb->len-skb->data_len, PCI_DMA_TODEVICE)); + vp->tx_ring[entry].frag[0].length = cpu_to_le32(skb->len-skb->data_len); + + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { + skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + + vp->tx_ring[entry].frag[i+1].addr = + cpu_to_le32(pci_map_single(vp->pdev, + (void*)page_address(frag->page) + frag->page_offset, + frag->size, PCI_DMA_TODEVICE)); + + if (i == skb_shinfo(skb)->nr_frags-1) + vp->tx_ring[entry].frag[i+1].length = cpu_to_le32(frag->size|LAST_FRAG); + else + vp->tx_ring[entry].frag[i+1].length = cpu_to_le32(frag->size); + } + } +#else vp->tx_ring[entry].addr = cpu_to_le32(pci_map_single(vp->pdev, skb->data, skb->len, PCI_DMA_TODEVICE)); vp->tx_ring[entry].length = cpu_to_le32(skb->len | LAST_FRAG); vp->tx_ring[entry].status = cpu_to_le32(skb->len | TxIntrUploaded); +#endif spin_lock_irqsave(&vp->lock, flags); /* Wait for the stall to complete. */ @@ -1838,18 +1937,19 @@ prev_entry->next = cpu_to_le32(vp->tx_ring_dma + entry * sizeof(struct boom_tx_desc)); if (inl(ioaddr + DownListPtr) == 0) { outl(vp->tx_ring_dma + entry * sizeof(struct boom_tx_desc), ioaddr + DownListPtr); - queued_packet++; + vp->queued_packet++; } vp->cur_tx++; if (vp->cur_tx - vp->dirty_tx > TX_RING_SIZE - 1) { - vp->tx_full = 1; netif_stop_queue (dev); } else { /* Clear previous interrupt enable. */ #if defined(tx_interrupt_mitigation) + /* Dubious. If in boomeang_interrupt "faster" cyclone ifdef + * were selected, this would corrupt DN_COMPLETE. No? + */ prev_entry->status &= cpu_to_le32(~TxIntrUploaded); #endif - /* netif_start_queue (dev); */ /* AKPM: redundant? */ } outw(DownUnstall, ioaddr + EL3_CMD); spin_unlock_irqrestore(&vp->lock, flags); @@ -2032,9 +2132,17 @@ if (vp->tx_skbuff[entry]) { struct sk_buff *skb = vp->tx_skbuff[entry]; - +#if DO_ZEROCOPY + int i; + for (i=0; i<=skb_shinfo(skb)->nr_frags; i++) + pci_unmap_single(vp->pdev, + le32_to_cpu(vp->tx_ring[entry].frag[i].addr), + le32_to_cpu(vp->tx_ring[entry].frag[i].length)&0xFFF, + PCI_DMA_TODEVICE); +#else pci_unmap_single(vp->pdev, le32_to_cpu(vp->tx_ring[entry].addr), skb->len, PCI_DMA_TODEVICE); +#endif dev_kfree_skb_irq(skb); vp->tx_skbuff[entry] = 0; } else { @@ -2044,10 +2152,9 @@ dirty_tx++; } vp->dirty_tx = dirty_tx; - if (vp->tx_full && (vp->cur_tx - dirty_tx <= TX_RING_SIZE - 1)) { + if (vp->cur_tx - dirty_tx <= TX_RING_SIZE - 1) { if (vortex_debug > 6) printk(KERN_DEBUG "boomerang_interrupt: clearing tx_full\n"); - vp->tx_full = 0; netif_wake_queue (dev); } } @@ -2199,14 +2306,14 @@ memcpy(skb_put(skb, pkt_len), vp->rx_skbuff[entry]->tail, pkt_len); - rx_copy++; + vp->rx_copy++; } else { /* Pass up the skbuff already on the Rx ring. */ skb = vp->rx_skbuff[entry]; vp->rx_skbuff[entry] = NULL; skb_put(skb, pkt_len); pci_unmap_single(vp->pdev, dma, PKT_BUF_SZ, PCI_DMA_FROMDEVICE); - rx_nocopy++; + vp->rx_nocopy++; } skb->protocol = eth_type_trans(skb, dev); { /* Use hardware checksum info. */ @@ -2215,7 +2322,7 @@ (csum_bits == (IPChksumValid | TCPChksumValid) || csum_bits == (IPChksumValid | UDPChksumValid))) { skb->ip_summed = CHECKSUM_UNNECESSARY; - rx_csumhits++; + vp->rx_csumhits++; } } netif_rx(skb); @@ -2320,9 +2427,18 @@ dev->name, inw(ioaddr + EL3_STATUS), inb(ioaddr + TxStatus)); printk(KERN_DEBUG "%s: vortex close stats: rx_nocopy %d rx_copy %d" " tx_queued %d Rx pre-checksummed %d.\n", - dev->name, rx_nocopy, rx_copy, queued_packet, rx_csumhits); + dev->name, vp->rx_nocopy, vp->rx_copy, vp->queued_packet, vp->rx_csumhits); } +#if DO_ZEROCOPY + if ( vp->rx_csumhits && + ((vp->drv_flags & HAS_HWCKSM) == 0) && + (hw_checksums[vp->card_idx] == -1)) { + printk(KERN_WARNING "%s supports hardware checksums, and we're not using them!\n", dev->name); + printk(KERN_WARNING "Please see http://www.uow.edu.au/~andrewm/zerocopy.html\n"); + } +#endif + free_irq(dev->irq, dev); if (vp->full_bus_master_rx) { /* Free Boomerang bus master Rx buffers. */ @@ -2335,14 +2451,24 @@ } } if (vp->full_bus_master_tx) { /* Free Boomerang bus master Tx buffers. */ - for (i = 0; i < TX_RING_SIZE; i++) + for (i = 0; i < TX_RING_SIZE; i++) { if (vp->tx_skbuff[i]) { struct sk_buff *skb = vp->tx_skbuff[i]; +#if DO_ZEROCOPY + int k; + for (k=0; k<=skb_shinfo(skb)->nr_frags; k++) + pci_unmap_single(vp->pdev, + le32_to_cpu(vp->tx_ring[i].frag[k].addr), + le32_to_cpu(vp->tx_ring[i].frag[k].length)&0xFFF, + PCI_DMA_TODEVICE); +#else pci_unmap_single(vp->pdev, le32_to_cpu(vp->tx_ring[i].addr), skb->len, PCI_DMA_TODEVICE); +#endif dev_kfree_skb(skb); vp->tx_skbuff[i] = 0; } + } } vp->open = 0; @@ -2360,7 +2486,6 @@ int i; int stalled = inl(ioaddr + PktStatus) & 0x04; /* Possible racy. But it's only debug stuff */ - wait_for_completion(dev, DownStall); printk(KERN_ERR " Flags; bus-master %d, full %d; dirty %d(%d) " "current %d(%d).\n", vp->full_bus_master_tx, vp->tx_full, @@ -2369,10 +2494,15 @@ printk(KERN_ERR " Transmit list %8.8x vs. %p.\n", inl(ioaddr + DownListPtr), &vp->tx_ring[vp->dirty_tx % TX_RING_SIZE]); + wait_for_completion(dev, DownStall); for (i = 0; i < TX_RING_SIZE; i++) { printk(KERN_ERR " %d: @%p length %8.8x status %8.8x\n", i, &vp->tx_ring[i], +#if DO_ZEROCOPY + le32_to_cpu(vp->tx_ring[i].frag[0].length), +#else le32_to_cpu(vp->tx_ring[i].length), +#endif le32_to_cpu(vp->tx_ring[i].status)); } if (!stalled) @@ -2634,7 +2764,13 @@ * here */ unregister_netdev(dev); - outw(TotalReset, dev->base_addr + EL3_CMD); + /* Should really use wait_for_completion() here */ + outw((vp->drv_flags & EEPROM_NORESET) ? (TotalReset|0x10) : TotalReset, dev->base_addr + EL3_CMD); + pci_free_consistent(pdev, + sizeof(struct boom_rx_desc) * RX_RING_SIZE + + sizeof(struct boom_tx_desc) * TX_RING_SIZE, + vp->rx_ring, + vp->rx_ring_dma); if (vp->must_free_region) release_region(dev->base_addr, vp->io_size); kfree(dev); @@ -2707,7 +2843,6 @@ module_init(vortex_init); module_exit(vortex_cleanup); - /* --- linux-2.4.1-pre10/Documentation/networking/vortex.txt Sun Oct 15 01:27:35 2000 +++ linux-akpm/Documentation/networking/vortex.txt Sat Jan 27 00:11:11 2001 @@ -116,7 +116,11 @@ full_duplex=N1,N2,N3... Similar to bit 9 of 'options'. Forces the corresponding card into - full-duplex mode. + full-duplex mode. Please use this in preference to the `options' + parameter. + + In fact, please don't use this at all! You're better off getting + autonegotiation working properly. flow_ctrl=N1,N2,N3... @@ -156,6 +160,27 @@ is exceeded the interrupt service routine gives up and generates a warning message "eth0: Too much work in interrupt". +hw_checksums=N1,N2,N3,... + + Recent 3com NICs are able to generate IPv4, TCP and UDP checksums + in hardware. Linux has used the Rx checksumming for a long time. + The "zero copy" patch which is planned for kernel 2.4.1 allows you to + make use of the transmit checksumming as well. + + The driver is set up so that, when the zerocopy patch is applied, + all Tornado and Cyclone devices will use Tx checksums. + + This module parameter has been provided so you can override this + decision. If you think that Tx checksums are causing a problem, you + may disable the feature with `hw_checksums=0'. + + If you think your NIC should be performing Tx checksumming and the + driver isn't enabling it, you can force the use of hardware Tx + checksumming with `hw_checksums=1'. + + The driver drops a message in the logfiles to indicate whether or + not it is using hardware scatter/gather and hardware Tx checksums. + compaq_ioaddr=N compaq_irq=N compaq_device_id=N @@ -168,7 +193,29 @@ decides that the transmitter has become stuck and needs to be reset. This is mainly for debugging purposes, although it may be advantageous to increase this value on LANs which have very high collision rates. - The default value is 400 (0.4 seconds). + The default value is 5000 (5.0 seconds). + + +Media selection +--------------- + +A number of the older NICs such as the 3c590 and 3c900 series have +10base2 and AUI interfaces. + +Prior to January, 2001 this driver would autoeselect the 10base2 or AUI +port if it didn't detect activity on the 10baseT port. It would then +get stuck on the 10base2 port and a driver reload was necessary to +switch back to 10baseT. This behaviour could not be prevented with a +module option override. + +Later (current) versions of the driver _do_ support locking of the +media type. So if you load the driver module with + + modprobe 3c59x options=0 + +it will permanently select the 10baseT port. Automatic selection of +other media types does not occur. + Additional resources -------------------- @@ -193,7 +240,6 @@ NIC's Media Independent Interface subsystem: http://www.scyld.com/diag/#mii-diag - http://cesdis.gsfc.nasa.gov/linux/diag/#mii-diag 3Com's documentation for many NICs, including the ones supported by this driver is available at @@ -222,6 +268,21 @@ tree parameter for the port the machine is plugged into to 'portfast' mode. Otherwise, the negotiation fails. This has been an issue we've noticed for a while but haven't had the time to track down. + + Cisco switches (Jeff Busch ) + + My "standard config" for ports to which PC's/servers connect directly: + + interface FastEthernet0/N + description machinename + load-interval 30 + spanning-tree portfast + + If autonegotiation is a problem, you may need to specify "speed + 100" and "duplex full" as well (or "speed 10" and "duplex half"). + + WARNING: DO NOT hook up hubs/switches/bridges to these + specially-configured ports! The switch will become very confused. Reporting and diagnosing problems From owner-netdev@oss.sgi.com Fri Jan 26 06:43:15 2001 Received: by oss.sgi.com id ; Fri, 26 Jan 2001 06:42:55 -0800 Received: from pizda.ninka.net ([216.101.162.242]:20114 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Fri, 26 Jan 2001 06:42:29 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id GAA11257; Fri, 26 Jan 2001 06:41:19 -0800 From: "David S. Miller" MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <14961.35983.58143.191497@pizda.ninka.net> Date: Fri, 26 Jan 2001 06:41:19 -0800 (PST) To: Andrew Morton Cc: Alexey Kuznetsov , "netdev@oss.sgi.com" Subject: Re: [prepatch] 3c59x.c In-Reply-To: <3A71872E.5DBB4C7B@uow.edu.au> References: <3A71872E.5DBB4C7B@uow.edu.au> X-Mailer: VM 6.75 under 21.1 (patch 13) "Crater Lake" XEmacs Lucid Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 642 Lines: 22 Andrew Morton writes: > Please let me know if this all looks sane and I'll punt > it over to Linus. Or we could roll them together if > you're confident that zerocopy is going in. I'd like > to get the unrelated fixes into 2.4.1.... Looks fine. Keep the patch together, it'll make my diffs smaller too :-) Oh yes, and fix this please: > + The "zero copy" patch which is planned for kernel 2.4.1 allows you to > + make use of the transmit checksumming as well. The zerocopy patch is currently scheduled for 2.4.2 at the earliest :-))) 2.4.1 is still bug fixing only, no feature adds. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Fri Jan 26 09:46:55 2001 Received: by oss.sgi.com id ; Fri, 26 Jan 2001 09:46:36 -0800 Received: from minus.inr.ac.ru ([193.233.7.97]:57354 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Fri, 26 Jan 2001 09:46:31 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id UAA27262; Fri, 26 Jan 2001 20:46:03 +0300 From: kuznet@ms2.inr.ac.ru Message-Id: <200101261746.UAA27262@ms2.inr.ac.ru> Subject: Re: zerocopy changes in 3c59x.c To: andrewm@uow.edu.au (Andrew Morton) Date: Fri, 26 Jan 2001 20:46:03 +0300 (MSK) Cc: netdev@oss.sgi.com In-Reply-To: <3A714788.82C064BD@uow.edu.au> from "Andrew Morton" at Jan 26, 1 08:46:48 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 884 Lines: 30 Hello! > /* Tx timeout interval (millisecs) */ > -static int watchdog = 400; > +static int watchdog = 5000; > > Five second transmit timeout. Why is this? I would ask: why 400msec? 8) All the drivers use much larger values. 5 seconds is good one. Actually, I changed this by _your_ advice? Do you remember? 8) BTW it really helped to avoid watchdog, when card had some spurious stalls due to excessive collisions or those link problems. > vp->stats.rx_compressed++; > > each time we send a frame with hardware checksum? This is the > transmit path, not the receive path. It seems we're abusing > some ppp-related stats here. What's up? Is 3com ppp driver? 8) No? Then you can reuse this value for any other purposes. All this debugging statistics should be deteted. I removed it from acenic, for example. It was interesting only at the first stage. Alexey From owner-netdev@oss.sgi.com Fri Jan 26 09:51:35 2001 Received: by oss.sgi.com id ; Fri, 26 Jan 2001 09:51:25 -0800 Received: from minus.inr.ac.ru ([193.233.7.97]:59914 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Fri, 26 Jan 2001 09:51:12 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id UAA27324; Fri, 26 Jan 2001 20:50:56 +0300 From: kuznet@ms2.inr.ac.ru Message-Id: <200101261750.UAA27324@ms2.inr.ac.ru> Subject: Re: zerocopy changes in 3c59x.c To: andrewm@uow.edu.au (Andrew Morton) Date: Fri, 26 Jan 2001 20:50:56 +0300 (MSK) Cc: davem@redhat.com, netdev@oss.sgi.com In-Reply-To: <3A717B44.82C66152@uow.edu.au> from "Andrew Morton" at Jan 27, 1 00:27:32 am X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 100 Lines: 7 Hello! > Is there ever a situation in which these numbers will differ? Yes. In theory. 8) Alexey From owner-netdev@oss.sgi.com Fri Jan 26 10:05:26 2001 Received: by oss.sgi.com id ; Fri, 26 Jan 2001 10:05:07 -0800 Received: from mail.iwr.uni-heidelberg.de ([129.206.104.30]:15275 "EHLO mail.iwr.uni-heidelberg.de") by oss.sgi.com with ESMTP id ; Fri, 26 Jan 2001 10:04:48 -0800 Received: from kenzo.iwr.uni-heidelberg.de (IDENT:root@kenzo.iwr.uni-heidelberg.de [129.206.120.29]) by mail.iwr.uni-heidelberg.de (8.11.1/8.11.1) with ESMTP id f0QI4j511029; Fri, 26 Jan 2001 19:04:46 +0100 (MET) Received: from localhost (bogdan@localhost) by kenzo.iwr.uni-heidelberg.de (8.9.3/8.9.3) with ESMTP id TAA25258; Fri, 26 Jan 2001 19:04:45 +0100 Date: Fri, 26 Jan 2001 19:04:45 +0100 (CET) From: Bogdan Costescu To: cc: Andrew Morton , Subject: Re: zerocopy changes in 3c59x.c In-Reply-To: <200101261746.UAA27262@ms2.inr.ac.ru> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1037 Lines: 28 On Fri, 26 Jan 2001 kuznet@ms2.inr.ac.ru wrote: > > Five second transmit timeout. Why is this? > I would ask: why 400msec? 8) That was the original value in Don's older drivers. I think that it was computed based on packet transmission time, inter-packet gap and so on. But there is a difference: Don didn't have Tx mitigation enabled by default, so a packet going through would trigger an interrupt which would clear tbusy and tx_full. With Tx mitigation, for a filled-up ring and collisions, the interrupt might be delayed past Tx timeout value. > BTW it really helped to avoid watchdog, when card had some spurious stalls > due to excessive collisions or those link problems. Andrew has enabled Tx mitigation by default to reduce interrupt rate under load. See above. Sincerely, Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: Bogdan.Costescu@IWR.Uni-Heidelberg.De From owner-netdev@oss.sgi.com Fri Jan 26 13:38:28 2001 Received: by oss.sgi.com id ; Fri, 26 Jan 2001 13:38:07 -0800 Received: from foobar.napster.com ([64.124.41.10]:30727 "EHLO foobar.napster.com") by oss.sgi.com with ESMTP id ; Fri, 26 Jan 2001 13:38:03 -0800 Received: from wagner.napster.com (mail.napster.com [63.108.185.112]) by foobar.napster.com (8.9.3/8.9.3) with ESMTP id NAA09137; Fri, 26 Jan 2001 13:37:57 -0800 Received: from napster.com (gw.napster.com [63.108.185.120]) by wagner.napster.com (8.9.3/8.9.3) with ESMTP id NAA12106; Fri, 26 Jan 2001 13:37:57 -0800 Message-ID: <3A71EE16.936AC0F3@napster.com> Date: Fri, 26 Jan 2001 13:37:26 -0800 From: Jordan Mendelson Organization: Napster, Inc. X-Mailer: Mozilla 4.76 [en] (X11; U; Linux 2.4.0 i686) X-Accept-Language: en MIME-Version: 1.0 To: Chris Wedgwood CC: netdev@oss.sgi.com Subject: Re: TCP Performance 2.4.0 <-> Win98 Followup References: <3A6E34EA.4AF27AA8@napster.com> <20010126011729.A9441@metastasis.f00f.org> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1729 Lines: 42 Chris Wedgwood wrote: > > On Tue, Jan 23, 2001 at 05:50:34PM -0800, Jordan Mendelson wrote: > > Apparently, the performance problem with Win98 <-> 2.4.0 only > happens with specific ISPs. The MTU and SACK settings didn't > appear to have any effect. > > Can you easily get packet dumps from both good and bad situations? I've created 2 sets of tcpdump logs. One using 2.2.16 which represents the "good" situation. The other from 2.4.0 representing the "bad" situation. The client in the logs is a Windows 98SE machine dialed up with a local bay area Earthlink number. The 2.2.16 and 2.4.0 logs were both created during the same session. The logs show the Napster sign on process. When you first connect to the Napster service, the client sends a login string. After the server authenticates the user, it sends an acknowledgement plus the message of the day. The message of the day is rather large, but there exists no throttling of it on the backend. It is sent as a set of send() calls, one for each line with non-blocking sockets. The client is rather inefficient in it's read handling; two recv() calls per command are done. With 2.2.16 and the Windows client, the MOTD is sent in under a second and displayed in the client. With 2.4.0, the MOTD takes several times longer to be fully transmitted to the client. There are multiple retransmits, apparently corrupted packets, etc. I did a traceroute from the windows client to the Napster servers to get the IP of the terminal server the Windows client was dialed into, but unfortunately it didn't respond to the Windows tracert, so I can't run nmap on it to check the OS. I've placed the logs at http://jordy.napster.com/linux/ to save bandwidth. Jordan From owner-netdev@oss.sgi.com Fri Jan 26 21:38:49 2001 Received: by oss.sgi.com id ; Fri, 26 Jan 2001 21:38:40 -0800 Received: from horus.its.uow.edu.au ([130.130.68.25]:46326 "EHLO horus.its.uow.edu.au") by oss.sgi.com with ESMTP id ; Fri, 26 Jan 2001 21:38:24 -0800 Received: from uow.edu.au (wumpus.its.uow.edu.au [130.130.68.12]) by horus.its.uow.edu.au (8.9.3/8.9.3) with ESMTP id QAA24293; Sat, 27 Jan 2001 16:37:57 +1100 (EST) Message-ID: <3A726087.764CC02E@uow.edu.au> Date: Sat, 27 Jan 2001 16:45:43 +1100 From: Andrew Morton X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.4.0-test8 i586) X-Accept-Language: en MIME-Version: 1.0 To: lkml , "netdev@oss.sgi.com" Subject: sendfile+zerocopy: fairly sexy (nothing to do with ECN) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 2265 Lines: 58 (Please keep netdev copied, else Jamal will grump at you, and you don't want that). I've whacked together some tools to measure TCP throughput with both sendfile and read/write. I've tested with and without the zerocopy patch. The CPU load figures are very accurate: the tool uses a `subtractive' algorithm which measures how much CPU is left over by the networking code, rather than trying to measure how much CPU load the networking stuff actually takes, if you see what I mean. This accounts accurately for CPU load, interrupts, softirq processing, memory bandwidth utilisation and cache pollution. The client is a 650 MHz PIII. The NIC is a 3CCFE575CT Cardbus 3com. It supports Scatter/Gather and hardware checksums. The NIC's interrupt is shared with the Cardbus controller, so this will impact throughput slightly. The kernels which were tested were 2.4.1-pre10 with and without the zerocopy patch. We only look at client load (the TCP sender). The link throughput was 11.5 mbytes/sec at all times (saturated 100baseT) 2.4.1-pre10-vanilla, using sendfile(): 29.6% CPU 2.4.1-pre10-vanilla, using read()/write(): 34.5% CPU 2.4.1-pre10+zercopy, using sendfile(): 18.2% CPU 2.4.1-pre10+zercopy, using read()/write(): 38.1% CPU 2.4.1-pre10+zercopy, using sendfile(): 22.9% CPU * hardware tx checksums disabled 2.4.1-pre10+zercopy, using read()/write(): 39.2% CPU * hardware tx checksums disabled What can we conclude? - sendfile is 10% cheaper than read()-then-write() on 2.4.1-pre10. - sendfile() with the zerocopy patch is 40% cheaper than sendfile() without the zerocopy patch. - hardware Tx checksums don't make much difference. hmm... Bear in mind that the 3c59x driver uses a one-interrupt-per-packet algorithm. Mitigation reduces this to 0.3 ints/packet. So we're absorbing 4,500 interrupts/sec while processing 12,000 packets/sec. gigE NICs do much better mitigation than this and the relative benefits of zerocopy will be much higher for these. Hopefully Jamal can do some testing. BTW: I could not reproduce Jamal's oops when sending large files (2 gigs with sendfile()). The test tool is, of course, documented [ :-)/2 ]. It's at http://www.uow.edu.au/~andrewm/linux/#zc - From owner-netdev@oss.sgi.com Fri Jan 26 22:20:39 2001 Received: by oss.sgi.com id ; Fri, 26 Jan 2001 22:20:29 -0800 Received: from vitelus.com ([64.81.36.147]:12044 "EHLO vitelus.com") by oss.sgi.com with ESMTP id ; Fri, 26 Jan 2001 22:20:12 -0800 Received: from aaronl by vitelus.com with local (Exim 3.20 #1 (Debian)) id 14MOiR-00037v-00; Fri, 26 Jan 2001 22:20:03 -0800 Date: Fri, 26 Jan 2001 22:20:03 -0800 From: Aaron Lehmann To: Andrew Morton Cc: lkml , "netdev@oss.sgi.com" Subject: Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN) Message-ID: <20010126222003.A11994@vitelus.com> References: <3A726087.764CC02E@uow.edu.au> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.3.12i In-Reply-To: <3A726087.764CC02E@uow.edu.au>; from andrewm@uow.edu.au on Sat, Jan 27, 2001 at 04:45:43PM +1100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 640 Lines: 17 On Sat, Jan 27, 2001 at 04:45:43PM +1100, Andrew Morton wrote: > 2.4.1-pre10-vanilla, using read()/write(): 34.5% CPU > 2.4.1-pre10+zercopy, using read()/write(): 38.1% CPU Am I right to be bothered by this? The majority of Unix network traffic is handled with read()/write(). Why would zerocopy slow that down? If zerocopy is simply unoptimized, that's fine for now. But if the problem is inherent in the implementation or design, that might be a problem. Any patch which incurs a signifigant slowdown on traditional networking should be contraversial. Aaron Lehmann please ignore me if I don't know what I'm talking about. From owner-netdev@oss.sgi.com Sat Jan 27 00:11:59 2001 Received: by oss.sgi.com id ; Sat, 27 Jan 2001 00:11:50 -0800 Received: from horus.its.uow.edu.au ([130.130.68.25]:21424 "EHLO horus.its.uow.edu.au") by oss.sgi.com with ESMTP id ; Sat, 27 Jan 2001 00:11:37 -0800 Received: from uow.edu.au (wumpus.its.uow.edu.au [130.130.68.12]) by horus.its.uow.edu.au (8.9.3/8.9.3) with ESMTP id TAA07472; Sat, 27 Jan 2001 19:11:15 +1100 (EST) Message-ID: <3A728475.34CF841@uow.edu.au> Date: Sat, 27 Jan 2001 19:19:01 +1100 From: Andrew Morton X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.4.0-test8 i586) X-Accept-Language: en MIME-Version: 1.0 To: Aaron Lehmann CC: lkml , "netdev@oss.sgi.com" Subject: Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN) References: <3A726087.764CC02E@uow.edu.au>, <3A726087.764CC02E@uow.edu.au>; from andrewm@uow.edu.au on Sat, Jan 27, 2001 at 04:45:43PM +1100 <20010126222003.A11994@vitelus.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 2378 Lines: 63 Aaron Lehmann wrote: > > On Sat, Jan 27, 2001 at 04:45:43PM +1100, Andrew Morton wrote: > > 2.4.1-pre10-vanilla, using read()/write(): 34.5% CPU > > 2.4.1-pre10+zercopy, using read()/write(): 38.1% CPU > > Am I right to be bothered by this? > > The majority of Unix network traffic is handled with read()/write(). > Why would zerocopy slow that down? > > If zerocopy is simply unoptimized, that's fine for now. But if the > problem is inherent in the implementation or design, that might be a > problem. Any patch which incurs a signifigant slowdown on traditional > networking should be contraversial. Good point. The figures I quoted for the no-hw-checksum case were still using scatter/gather. That can be turned off as well and it makes it a tiny bit quicker. So the table is now: 2.4.1-pre10-vanilla, using sendfile(): 29.6% CPU 2.4.1-pre10-vanilla, using read()/write(): 34.5% CPU 2.4.1-pre10+zercopy, using sendfile(): 18.2% CPU 2.4.1-pre10+zercopy, using read()/write(): 38.1% CPU 2.4.1-pre10+zercopy, using sendfile(): 22.9% CPU * hardware tx checksums disabled 2.4.1-pre10+zercopy, using read()/write(): 39.2% CPU * hardware tx checksums disabled 2.4.1-pre10+zercopy, using sendfile(): 22.4% CPU * hardware tx checksums and SG disabled 2.4.1-pre10+zercopy, using read()/write(): 38.5% CPU * hardware tx checksums and SG disabled But that's not relevant. I just retested everything. Yes, the zerocopy patch does appear to decrease the efficiency of TCP on non-SG+checksumming hardware by 5% - 10%. Others need to test... With an RTL8139/8139too. CPU is 500MHz PII Celeron, uniprocessor: 2.4.1-pre10-vanilla, using sendfile(): 43.8% CPU 2.4.1-pre10-vanilla, using read()/write(): 54.1% CPU 2.4.1-pre10+zerocopy, using sendfile(): 43.1% CPU 2.4.1-pre10+zerocopy, using read()/write(): 55.5% CPU Note that the 8139 only gets 10.8 Mbytes/sec here. it randomly jumps up to 11.5 occasionally, but spends most of its time at 10.8. Hard to know what to make of this. Of course, if you're using an 8139 you don't care about performance anyway :) Contradictory results. rtl8139 doesn't do Rx checksums, and I think has an extra copy in the driver, so caching effects may be obscuring things here. I can test with eepro100 in a couple of days. - From owner-netdev@oss.sgi.com Sat Jan 27 02:05:50 2001 Received: by oss.sgi.com id ; Sat, 27 Jan 2001 02:05:40 -0800 Received: from vp175097.reshsg.uci.edu ([128.195.175.97]:21770 "EHLO moisil.dev.hydraweb.com") by oss.sgi.com with ESMTP id ; Sat, 27 Jan 2001 02:05:27 -0800 Received: (from ionut@localhost) by moisil.dev.hydraweb.com (8.11.0/8.9.3) id f0RA5DX04357; Sat, 27 Jan 2001 02:05:13 -0800 Date: Sat, 27 Jan 2001 02:05:13 -0800 Message-Id: <200101271005.f0RA5DX04357@moisil.dev.hydraweb.com> From: Ion Badulescu To: Andrew Morton Cc: lkml , "netdev@oss.sgi.com" Subject: Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN) In-Reply-To: <3A726087.764CC02E@uow.edu.au> User-Agent: tin/1.4.4-20000803 ("Vet for the Insane") (UNIX) (Linux/2.2.18 (i586)) Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 3456 Lines: 94 On Sat, 27 Jan 2001 16:45:43 +1100, Andrew Morton wrote: > The client is a 650 MHz PIII. The NIC is a 3CCFE575CT Cardbus 3com. > It supports Scatter/Gather and hardware checksums. The NIC's interrupt > is shared with the Cardbus controller, so this will impact throughput > slightly. > > The kernels which were tested were 2.4.1-pre10 with and without the > zerocopy patch. We only look at client load (the TCP sender). > > The link throughput was 11.5 mbytes/sec at all times (saturated 100baseT) > > 2.4.1-pre10-vanilla, using sendfile(): 29.6% CPU > 2.4.1-pre10-vanilla, using read()/write(): 34.5% CPU > > 2.4.1-pre10+zercopy, using sendfile(): 18.2% CPU > 2.4.1-pre10+zercopy, using read()/write(): 38.1% CPU > > 2.4.1-pre10+zercopy, using sendfile(): 22.9% CPU * hardware tx checksums disabled > 2.4.1-pre10+zercopy, using read()/write(): 39.2% CPU * hardware tx checksums disabled 750MHz PIII, Adaptec Starfire NIC, driver modified to use hardware sg+csum (both Tx/Rx), and Intel i82559 (eepro100), no hardware csum support, vanilla driver. The box has 512MB of RAM, and I'm using a 100MB file, so it's entirely cached. starfire: 2.4.1-pre10+zerocopy, using sendfile(): 9.6% CPU 2.4.1-pre10+zerocopy, using read()/write(): 18.3%-29.6% CPU * why so much variance? 2.4.1-pre10+zerocopy, using sendfile(): 17.4% CPU * hardware csum disabled 2.4.1-pre10+zerocopy, using read()/write(): 16.5%-26.8% CPU * idem, again why so much variance? 2.4.1-pre10-vanilla, using sendfile(): 16.5% CPU 2.4.1-pre10-vanilla, using read()/write(): 14.5%-24.5% CPU * high variance again eepro100: 2.4.1-pre10+zerocopy, using sendfile(): 16.0% CPU 2.4.1-pre10+zerocopy, using read()/write(): 15.0%-24.5% CPU * why so much variance? 2.4.1-pre10-vanilla, using sendfile(): 16.7% CPU 2.4.1-pre10-vanilla, using read()/write(): 14.5%-24.6% CPU * high variance again The read+write case is really weird. I'm getting results like this: CPU load: 27.9491 CPU load: 25.4763 CPU load: 15.8544 CPU load: 25.455 CPU load: 25.2072 CPU load: 15.8677 CPU load: 25.4896 CPU load: 25.2791 CPU load: 15.8837 i.e. 2 slow, 1 fast, 2 slow, 1 fast, and so on so forth. > What can we conclude? > > - sendfile is 10% cheaper than read()-then-write() on 2.4.1-pre10. Hard to tell, with such inconclusive results... > - sendfile() with the zerocopy patch is 40% cheaper than > sendfile() without the zerocopy patch. Indeed. Close to 50% in fact. > - hardware Tx checksums don't make much difference. hmm... Actually it makes all the difference in the world for the starfire. Interesting... > Bear in mind that the 3c59x driver uses a one-interrupt-per-packet > algorithm. Mitigation reduces this to 0.3 ints/packet. > So we're absorbing 4,500 interrupts/sec while processing > 12,000 packets/sec. gigE NICs do much better mitigation than > this and the relative benefits of zerocopy will be much higher > for these. Hopefully Jamal can do some testing. Hmm.. the starfire also has quite advanced interrupt mitigation, but I have not played with it. Maybe tomorrow. So these results are with one-interrupt-per-packet. P.S. The starfire still doesn't like tinygrams (skb's with 1-byte fragments). Fortunately your test program doesn't seem to generate them. :-) Ion -- It is better to keep your mouth shut and be thought a fool, than to open it and remove all doubt. From owner-netdev@oss.sgi.com Sat Jan 27 02:10:20 2001 Received: by oss.sgi.com id ; Sat, 27 Jan 2001 02:10:01 -0800 Received: from vp175097.reshsg.uci.edu ([128.195.175.97]:22794 "EHLO moisil.dev.hydraweb.com") by oss.sgi.com with ESMTP id ; Sat, 27 Jan 2001 02:09:51 -0800 Received: (from ionut@localhost) by moisil.dev.hydraweb.com (8.11.0/8.9.3) id f0RA9fb04363; Sat, 27 Jan 2001 02:09:41 -0800 Date: Sat, 27 Jan 2001 02:09:41 -0800 Message-Id: <200101271009.f0RA9fb04363@moisil.dev.hydraweb.com> From: Ion Badulescu To: Andrew Morton Cc: lkml , "netdev@oss.sgi.com" , Aaron Lehmann Subject: Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN) In-Reply-To: <3A728475.34CF841@uow.edu.au> User-Agent: tin/1.4.4-20000803 ("Vet for the Insane") (UNIX) (Linux/2.2.18 (i586)) Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 653 Lines: 19 On Sat, 27 Jan 2001 19:19:01 +1100, Andrew Morton wrote: > The figures I quoted for the no-hw-checksum case were still > using scatter/gather. That can be turned off as well and > it makes it a tiny bit quicker. Hmm. Are you sure the differences are not just noise? Unless you modified the zerocopy patch yourself, it won't use SG without checksums... In fact it would be interesting to revert that policy and see how much SG alone helps. Probably not much, since the CPU checksumming is close to onecopy. Ion -- It is better to keep your mouth shut and be thought a fool, than to open it and remove all doubt. From owner-netdev@oss.sgi.com Sat Jan 27 02:33:21 2001 Received: by oss.sgi.com id ; Sat, 27 Jan 2001 02:33:12 -0800 Received: from horus.its.uow.edu.au ([130.130.68.25]:63710 "EHLO horus.its.uow.edu.au") by oss.sgi.com with ESMTP id ; Sat, 27 Jan 2001 02:32:46 -0800 Received: from uow.edu.au (wumpus.its.uow.edu.au [130.130.68.12]) by horus.its.uow.edu.au (8.9.3/8.9.3) with ESMTP id VAA18818; Sat, 27 Jan 2001 21:32:06 +1100 (EST) Message-ID: <3A72A578.28D27AED@uow.edu.au> Date: Sat, 27 Jan 2001 21:39:52 +1100 From: Andrew Morton X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.4.0-test8 i586) X-Accept-Language: en MIME-Version: 1.0 To: Ion Badulescu CC: lkml , "netdev@oss.sgi.com" Subject: Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN) References: <3A726087.764CC02E@uow.edu.au> <200101271005.f0RA5DX04357@moisil.dev.hydraweb.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1048 Lines: 41 Ion Badulescu wrote: > > 2.4.1-pre10+zerocopy, using read()/write(): 18.3%-29.6% CPU * why so much variance? The variance is presumably because of the naive read/write implementation. It sucks in 16 megs and writes out out again. With a 100 megabyte file you'll get aliasing effects between the sampling interval and the client's activity. You will get more repeatable results using smaller files. I'm just sending /usr/local/bin/* ten times, with ./zcc -s otherhost -c /usr/local/bin/* -n10 -N2 -S Maybe that 16 meg buffer should be shorter... Yes, making it smaller smooths things out. Heh, look at this. It's a simple read-some, send-some loop. Plot CPU utilisation against the transfer size: Size %CPU 256 31 512 25 1024 22 2048 18 4096 17 8192 16 16384 18 32768 19 65536 21 128k 22 256k 22.5 8192 bytes is best. I've added the `-b' option to zcc to set the transfer size. Same URL. - From owner-netdev@oss.sgi.com Sat Jan 27 02:38:31 2001 Received: by oss.sgi.com id ; Sat, 27 Jan 2001 02:38:22 -0800 Received: from horus.its.uow.edu.au ([130.130.68.25]:38890 "EHLO horus.its.uow.edu.au") by oss.sgi.com with ESMTP id ; Sat, 27 Jan 2001 02:38:02 -0800 Received: from uow.edu.au (wumpus.its.uow.edu.au [130.130.68.12]) by horus.its.uow.edu.au (8.9.3/8.9.3) with ESMTP id VAA19217; Sat, 27 Jan 2001 21:37:43 +1100 (EST) Message-ID: <3A72A6C9.7D085A39@uow.edu.au> Date: Sat, 27 Jan 2001 21:45:29 +1100 From: Andrew Morton X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.4.0-test8 i586) X-Accept-Language: en MIME-Version: 1.0 To: Ion Badulescu CC: lkml , "netdev@oss.sgi.com" , Aaron Lehmann Subject: Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN) References: <3A728475.34CF841@uow.edu.au> <200101271009.f0RA9fb04363@moisil.dev.hydraweb.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 716 Lines: 22 Ion Badulescu wrote: > > On Sat, 27 Jan 2001 19:19:01 +1100, Andrew Morton wrote: > > > The figures I quoted for the no-hw-checksum case were still > > using scatter/gather. That can be turned off as well and > > it makes it a tiny bit quicker. > > Hmm. Are you sure the differences are not just noise? I don't think so. It's all pretty repeatable. > Unless you > modified the zerocopy patch yourself, it won't use SG without > checksums... I believe it in fact does use SG when hardware tx checksums are unavailable, but this capability wil be removed RSN because userspace can scribble on the pagecache after the checksum has been calculated, and before the frame has hit the wire. - From owner-netdev@oss.sgi.com Sat Jan 27 04:45:11 2001 Received: by oss.sgi.com id ; Sat, 27 Jan 2001 04:45:01 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:35509 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Sat, 27 Jan 2001 04:44:39 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id HAA24094; Sat, 27 Jan 2001 07:43:31 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Sat, 27 Jan 2001 07:43:31 -0500 (EST) From: jamal To: Andrew Morton cc: lkml , "netdev@oss.sgi.com" Subject: Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN) In-Reply-To: <3A726087.764CC02E@uow.edu.au> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 2282 Lines: 71 On Sat, 27 Jan 2001, Andrew Morton wrote: > (Please keep netdev copied, else Jamal will grump at you, and > you don't want that). > Thanks, Andrew ;-> Isnt netdev where networking stuff should be discussed? I think i give up and will join lk, RSN ;-> > The kernels which were tested were 2.4.1-pre10 with and without the > zerocopy patch. We only look at client load (the TCP sender). > > The link throughput was 11.5 mbytes/sec at all times (saturated 100baseT) > > 2.4.1-pre10-vanilla, using sendfile(): 29.6% CPU > 2.4.1-pre10-vanilla, using read()/write(): 34.5% CPU > > 2.4.1-pre10+zercopy, using sendfile(): 18.2% CPU > 2.4.1-pre10+zercopy, using read()/write(): 38.1% CPU > > 2.4.1-pre10+zercopy, using sendfile(): 22.9% CPU * hardware tx checksums disabled > 2.4.1-pre10+zercopy, using read()/write(): 39.2% CPU * hardware tx checksums disabled > > > What can we conclude? > > - sendfile is 10% cheaper than read()-then-write() on 2.4.1-pre10. > > - sendfile() with the zerocopy patch is 40% cheaper than > sendfile() without the zerocopy patch. > It is also useful to have both client and server stats. BTW, since the laptop (with the 3C card) is the client, the SG shouldnt kick in at all. > - hardware Tx checksums don't make much difference. hmm... > > Bear in mind that the 3c59x driver uses a one-interrupt-per-packet > algorithm. Mitigation reduces this to 0.3 ints/packet. > So we're absorbing 4,500 interrupts/sec while processing > 12,000 packets/sec. gigE NICs do much better mitigation than > this and the relative benefits of zerocopy will be much higher > for these. Hopefully Jamal can do some testing. > I dont have my babies right now, but as soon as i can get access to them > BTW: I could not reproduce Jamal's oops when sending large > files (2 gigs with sendfile()). Alexey was concerned about this. Good. But maybe it will still happen with my setupo. We'll see. > > The test tool is, of course, documented [ :-)/2 ]. It's at > > http://www.uow.edu.au/~andrewm/linux/#zc > I'll give this a shot later. Can you try with the sendfiled-ttcp? http://www.cyberus.ca/~hadi/ttcp-sf.tar.gz Anyways, you are NIC-challenged ;-> Get GigE. 100Mbps doesnt give much information. cheers, jamal From owner-netdev@oss.sgi.com Sat Jan 27 04:51:01 2001 Received: by oss.sgi.com id ; Sat, 27 Jan 2001 04:50:51 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:36533 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Sat, 27 Jan 2001 04:50:43 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id HAA24100; Sat, 27 Jan 2001 07:49:35 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Sat, 27 Jan 2001 07:49:35 -0500 (EST) From: jamal To: Ion Badulescu cc: Andrew Morton , lkml , "netdev@oss.sgi.com" Subject: Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN) In-Reply-To: <200101271005.f0RA5DX04357@moisil.dev.hydraweb.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 654 Lines: 29 On Sat, 27 Jan 2001, Ion Badulescu wrote: > > 750MHz PIII, Adaptec Starfire NIC, driver modified to use hardware sg+csum > (both Tx/Rx), and Intel i82559 (eepro100), no hardware csum support, > vanilla driver. > > The box has 512MB of RAM, and I'm using a 100MB file, so it's entirely cached. > > starfire: > 2.4.1-pre10+zerocopy, using sendfile(): 9.6% CPU > 2.4.1-pre10+zerocopy, using read()/write(): 18.3%-29.6% CPU * why so much variance? > What are your throughput numbers? Could you also, please, test using: http://www.cyberus.ca/~hadi/ttcp-sf.tar.gz post both sender and receiver data. Repeat each test about 5 times. cheers, jamal From owner-netdev@oss.sgi.com Sat Jan 27 05:22:02 2001 Received: by oss.sgi.com id ; Sat, 27 Jan 2001 05:21:52 -0800 Received: from horus.its.uow.edu.au ([130.130.68.25]:61618 "EHLO horus.its.uow.edu.au") by oss.sgi.com with ESMTP id ; Sat, 27 Jan 2001 05:21:38 -0800 Received: from uow.edu.au (wumpus.its.uow.edu.au [130.130.68.12]) by horus.its.uow.edu.au (8.9.3/8.9.3) with ESMTP id AAA01586; Sun, 28 Jan 2001 00:21:16 +1100 (EST) Message-ID: <3A72CD1E.32BB523F@uow.edu.au> Date: Sun, 28 Jan 2001 00:29:02 +1100 From: Andrew Morton X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.4.0-test8 i586) X-Accept-Language: en MIME-Version: 1.0 To: jamal CC: lkml , "netdev@oss.sgi.com" Subject: Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN) References: <3A726087.764CC02E@uow.edu.au> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1529 Lines: 40 jamal wrote: > > .. > It is also useful to have both client and server stats. > BTW, since the laptop (with the 3C card) is the client, the SG > shouldnt kick in at all. The `client' here is doing the sendfiling, so yes, the gathering occurs on the client. > ... > > The test tool is, of course, documented [ :-)/2 ]. It's at > > > > http://www.uow.edu.au/~andrewm/linux/#zc > > > > I'll give this a shot later. Can you try with the sendfiled-ttcp? > http://www.cyberus.ca/~hadi/ttcp-sf.tar.gz hmm.. I didn't bother with TCP_CORK because the files being sent are "much" larger than a frame. Guess I should. The problem with things like ttcp is the measurement of CPU load. If your network is so fast that your machine can't keep up then fine, raw throughput is a good measure. But if the link is saturated then normal process accounting doesn't cut it. For example, at 100 mbps, `top' says ttcp is chewing 4% CPU. But guess what? A low-priority process running on the same machine is in fact slowed down by 30%. top lies. Most of the cost of the networking layer is being accounted to swapper, and lost. And who accounts for cache eviction, bus utilisation, etc. We're better off measuring what's left behind, rather than measuring what is consumed. You can in fact do this with ttcp: run it with a super-high priority and run a little task in the background (dummyload.c in the above tarball does this). See how much the dummy task is slowed down wrt an unloaded system. It gets tricky on SMP though. - From owner-netdev@oss.sgi.com Sat Jan 27 06:16:51 2001 Received: by oss.sgi.com id ; Sat, 27 Jan 2001 06:16:42 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:38837 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Sat, 27 Jan 2001 06:16:21 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id JAA24193; Sat, 27 Jan 2001 09:15:28 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Sat, 27 Jan 2001 09:15:28 -0500 (EST) From: jamal To: Andrew Morton cc: lkml , "netdev@oss.sgi.com" Subject: Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN) In-Reply-To: <3A72CD1E.32BB523F@uow.edu.au> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 2607 Lines: 66 On Sun, 28 Jan 2001, Andrew Morton wrote: > jamal wrote: > > > > .. > > It is also useful to have both client and server stats. > > BTW, since the laptop (with the 3C card) is the client, the SG > > shouldnt kick in at all. > > The `client' here is doing the sendfiling, so yes, the > gathering occurs on the client. > OK, semantics. Maybe we should stick to sender and receiver. (server normally will translate to "serve" files) > > I'll give this a shot later. Can you try with the sendfiled-ttcp? > > http://www.cyberus.ca/~hadi/ttcp-sf.tar.gz > > hmm.. I didn't bother with TCP_CORK because the files being > sent are "much" larger than a frame. Guess I should. It doesnt make much sense to use sendfile without TCP_CORK. > The problem with things like ttcp is the measurement of CPU load. > If your network is so fast that your machine can't keep up then > fine, raw throughput is a good measure. But if the link is saturated > then normal process accounting doesn't cut it. ttcp's CPU measure is not the best. Part of my plan was to change that. It uses times(). So the measurement is not good. It is infact not very reflective on SMP. The way to do it there is to break it down by CPU. Throughput: 100Mbps is really nothing. Linux never had a problem with 4-500Mbps file serving. So throughput is an important number. so is end to end latency, but in file serving case, latency might not be a big deal so ignore it. > For example, at 100 mbps, `top' says ttcp is chewing 4% CPU. But guess > what? A low-priority process running on the same machine is in fact > slowed down by 30%. top lies. Most of the cost of the networking layer > is being accounted to swapper, and lost. And who accounts for cache > eviction, bus utilisation, etc. We're better off measuring what's > left behind, rather than measuring what is consumed. > > You can in fact do this with ttcp: run it with a super-high priority > and run a little task in the background (dummyload.c in the above > tarball does this). See how much the dummy task is slowed down > wrt an unloaded system. It gets tricky on SMP though. > The best way to do CPU measurement is via /proc. The way top does it. You measure it from within your nettest program. This does measure what is "left behind" since your proggie is in user space. Actually, it shouldnt matter whether you do it from your test program or from dummyload.c. With dummyload you might have to sigkill the program every time a test terminates. You also should break down utilization by CPU. cheers, jamal PS:- can you try it out with the ttcp testcode i posted? From owner-netdev@oss.sgi.com Sat Jan 27 10:55:02 2001 Received: by oss.sgi.com id ; Sat, 27 Jan 2001 10:54:43 -0800 Received: from minus.inr.ac.ru ([193.233.7.97]:23304 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Sat, 27 Jan 2001 10:54:24 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id VAA02845; Sat, 27 Jan 2001 21:54:06 +0300 From: kuznet@ms2.inr.ac.ru Message-Id: <200101271854.VAA02845@ms2.inr.ac.ru> Subject: Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN) To: andrewm@uow.EDU.AU (Andrew Morton) Date: Sat, 27 Jan 2001 21:54:06 +0300 (MSK) Cc: netdev@oss.sgi.com In-Reply-To: <3A726087.764CC02E@uow.edu.au> from "Andrew Morton" at Jan 27, 1 08:45:00 am X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 551 Lines: 19 Hello! > 2.4.1-pre10+zercopy, using read()/write(): 38.1% CPU write() on zc card is worse than normal write() by definition. It generates split buffers. Split buffers are more expensive and we have to pay for this. You have paid too much for slow card though. 8) Do you measure load correctly? > 2.4.1-pre10+zercopy, using read()/write(): 39.2% CPU * hardware tx checksums disabled This is illegal combination of parameters. You force two memory accesses, doing this. The fact that it does not add to load is dubious. 8)8) Alexey From owner-netdev@oss.sgi.com Sat Jan 27 12:11:33 2001 Received: by oss.sgi.com id ; Sat, 27 Jan 2001 12:11:22 -0800 Received: from minus.inr.ac.ru ([193.233.7.97]:40712 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Sat, 27 Jan 2001 12:11:02 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id XAA03306; Sat, 27 Jan 2001 23:10:54 +0300 From: kuznet@ms2.inr.ac.ru Message-Id: <200101272010.XAA03306@ms2.inr.ac.ru> Subject: Re: TCP Performance 2.4.0 <-> Win98 Followup To: jordy@napster.COM (Jordan Mendelson) Date: Sat, 27 Jan 2001 23:10:54 +0300 (MSK) Cc: netdev@oss.sgi.com In-Reply-To: <3A71EE16.936AC0F3@napster.com> from "Jordan Mendelson" at Jan 27, 1 00:45:02 am X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 326 Lines: 12 Hello! > I've placed the logs at http://jordy.napster.com/linux/ to save > bandwidth. Sorry. 23:08:06.840184 < 63.108.185.66 > 193.233.7.97: icmp: host 63.108.185.113 unreachable - admin prohibited filter Offending pkt: [|tcp] (DF) (ttl 45, id 0) (ttl 243, id 49161) This server hides behind some broken firewall. Alexey From owner-netdev@oss.sgi.com Sat Jan 27 12:23:22 2001 Received: by oss.sgi.com id ; Sat, 27 Jan 2001 12:23:02 -0800 Received: from foobar.napster.com ([64.124.41.10]:44816 "EHLO foobar.napster.com") by oss.sgi.com with ESMTP id ; Sat, 27 Jan 2001 12:22:53 -0800 Received: from wagner.napster.com (mail.napster.com [63.108.185.112]) by foobar.napster.com (8.9.3/8.9.3) with ESMTP id MAA18222; Sat, 27 Jan 2001 12:22:47 -0800 Received: from napster.com (c991585-a.stcla1.sfba.home.com [65.5.99.217]) by wagner.napster.com (8.9.3/8.9.3) with ESMTP id MAA22182; Sat, 27 Jan 2001 12:22:47 -0800 Message-ID: <3A732E16.A50A077B@napster.com> Date: Sat, 27 Jan 2001 12:22:46 -0800 From: Jordan Mendelson Organization: Napster, Inc. X-Mailer: Mozilla 4.76 [en] (X11; U; Linux 2.4.0 i686) X-Accept-Language: en MIME-Version: 1.0 To: kuznet@ms2.inr.ac.ru CC: netdev@oss.sgi.com Subject: TCP Performance 2.4.0 <-> Win98 Logs References: <200101272010.XAA03306@ms2.inr.ac.ru> Content-Type: multipart/mixed; boundary="------------8BF15922FAC371894C574ED7" Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 5835 Lines: 101 This is a multi-part message in MIME format. --------------8BF15922FAC371894C574ED7 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit kuznet@ms2.inr.ac.ru wrote: > > Hello! > > > I've placed the logs at http://jordy.napster.com/linux/ to save > > bandwidth. > > Sorry. > > 23:08:06.840184 < 63.108.185.66 > 193.233.7.97: icmp: host 63.108.185.113 unreachable - admin prohibited filter Offending pkt: [|tcp] (DF) (ttl 45, id 0) (ttl 243, id 49161) > > This server hides behind some broken firewall. Ack, they just replaced a router and screwed up the access lists. I've attached it to this email instead. Jordan --------------8BF15922FAC371894C574ED7 Content-Type: application/x-gtar; name="tcpdump-logs.tgz" Content-Transfer-Encoding: base64 Content-Disposition: inline; filename="tcpdump-logs.tgz" H4sIAKgtczoAA+1ae3QU1Rm/uySzISEQUosBUS5UESQku8nuhoRAQkh4RoyA0qgVJ7uT7CSz M+vsbJY9qOQUWw6CR6v4RI+P+qo9/qMejbbHd33QojGitWhL5ajY1iqlPo7y6vfdfSTZ3Qwb IKj0fudsso+5v/vd7/6+e+e78zM8AW/IHyhRtLaSsjKHmwyD2R12u9vpJHawCrd7wH8wd7m9 jNgrysqd5W6Xo6Icri+rAE+ofTicSbZQ0BB1Skm7pnsjJtcd6fcfqDm3P3qPlWQlPvfYCLHA /4aL9KrfvWgjc+E9vqzwyolfZCHWqyb3dueQBphceVwt6RRmfPjmY1/W93S/O9FGp3QT1/5s vDBgpeTNndA6y1pksWRZEXXmS8mo0Lv1sdN7l0XBo6gzPqgltcJd4xERkad02+g8d8cHgCwE Ctd1v/TNQNTFu/JJNbSuNvN1WgfzdfqnSb4KgLy7qWByV+NF0MCSk5u3ATFDuzPBVBjmtEEw DxP1L9hkJ7HtQ8wzvskAc2aAYZ65Jy1m0WFy36vY4NYRDoZZC5iYuG6ziE4rrcaIbn05KaK7 ATevqWBd92cTCVkAaNuvHUHaoVW7mYeXBJmHk7uTPMyLe7jxAZhEaM/SxpCChoP2e2unUzrL Suy0rmHlPFpZ4pwC3zhnucqpHT3ofmtEBuMpW4XjufW3acZTg+P59yfR8bjWZ5Ml0GqJGVp1 +ZmIdsu2tGhF67r/00xIARlBRFVTa1UxEDQkvcSj+bGH66CHDZA0+BqsB0uk/Czs4ebNST18 Fe9h8WuEjCXvkApXpZ06HHZ3ZaWLOl2zXEuJ31onUdGvqW3U8Em0VdaDBjU0auhip6RQVQpT cEeXDU2PUI9P1A3JS1sitB5+pueKBjQKB2mdqHqprDYCHKL4Qx4fFVVD9sgBERu0aiHd8NGg EfLKGhWVlpC/mDZc2LC8uX5ecwldmeg5KKttilRMpyym9bJ3MeAtNoqn0FZJNEK6FIx2q6lU kUQv7dQ8ohKk2HeLiC6qEm0LybDoFrPBSOCBLsFvcIHGnGsRPR2hQKxhMQ0oIWhOgwHJI4tK 9GsakPRWTfeLqgec0jU/7ZQ1RVZl8K5Oi3jpSlkNKlKkZCFMz5bN2WQ+TMN8M0K71jBCT9yV ROganKKmokOfCtB8JOkhHq1FFx125yyE/jtALwNMfA0289ba6mU485vvGDjz/oUA34gzv2Uf ITjweKxpWFPPNmiLBLPFYhSEiYW4hiBUCl0gteghESa6rKKE1oUMGtFC1CNCuGH0ksp44ZNK AS86T1TVwkiGoCTqHh98FZs3utiYgrO0LEZmaoUmlCLblkITRfK2gQNB2qSF8ddVQBXE7pCk AIXpCcvAFXROETGnkYQwTS0aOBRDLKZAKGRHawh5QYNaqxEW4Y0uATOCUpQUcA31IPOAAoYB 7kIfsCVHgIByp1ST6NcH7mOQvMAWj6FEotMe6wq5q3rh/wqfGFbpAlFVcaCID13pKvVBDOJx 8klKgAZ9YkBaEEuFmINaK/XKbUBNBZIjKHtq6Aq5TcXBAtfYWOPdLYPBKpKBbw3NK0ZKKCBN 8xlGoKq0VE38WNJvoZgei289/F2lhRQv80eRO9iI2zWYavBNWuORDXQdWgRENUKRZY4/2cgb wKY3zAj8fIQR+LQVSQRuRKY1FR3cOnU/IblA4FZdlhS/pgWlPPwkt8qKCKneIuKPEVH3hkWl Q9LxU0ASPb5WRQr6suCTWOZ05uC3q326HPTJeIWIOejV2mRlNHxqCcEMhjwdshoQ9cAo/EZs iYhqm6Q4K1w2vN7wypWVlfnwNhYbp1/0OBw4zM7xORks+3PuwmS61jMwmbTVMNSPYNmf9M9L CUG0X9ydQ5bDgrzcbFGun3s6om16LS0apOb8qwgJ+yQgRwSJGtb0DmC+olAfrHEzkduiPlPH MOGkyTBnHqOGLm4Fshcn2AJJ5JN1uGBGP5aEw+H+9CiNTXhpu9YSLPEZfmV6SZQwdrb8AwOl KIN9ooGLImYQZJIYzaD58C1dGJK9EqyCUtBl1pEiw5pbijir2xItYl0SaDiS9PatcrnwoY8x +eTN/oyhToz0oj/kZXCbtJbRc3zybdJHGG28TWq/MnrrZduHmPuzx2SAeQXDLEqDGVjbVHDw wzhm4WjEPEMYQ7oBotuMXw/XPo2M2Jh0mxRYG2fE/QFMo95+mTIKgtKXKbRSgM+YKxSypbdf tvT2y5YCuGZgttCKUXDFwHzp7cuX3oH5sgjGc9Yn2RnE6EoWo1PTxeggxKilL0bLAbNqRxZZ ChBLzWI051d7MEaVQlKMDsZj1NQeu5Vx28vxVqayzI23MpXl2MW6K3IzcPtO5vbX/0p1+/JC cPvcuNvChYTbMZjRv/53Dk/FbV7/O1xOtyte/4MLdqz/3eUOXv+fCBus/neu1KumPpifQf1f kY+Z+rP/xjJ1XPbHkKmTXtuPF2L9nzW7r1JH1KyHk1GjS0zjxCh4FJUQXGL+SBAxusRkf7zr 88cXAPKBQOG4ntAGRM1+HFFdgHrD1KwMVpXZzNfm/Um+HgDkhVj/37I13gwx26dlglnNMH86 COZh8vSE/pi3zcgAc+Ychrnqq7SYUF3vHRmr/xnms4CZ7sZpQESnxSKakxTRhWTSNktTwbie exdEPXzr+pEZ1P81zMMLXhno4TZL3MMde46u/kcP9oIHxzQeFxtPYXQ8h5/MS1v/D0CrjqJt +yItWtG4nvvXp6//sYdQd17aKrB/D9baaA+v7k3q4eJ4D3dA5gmwaZZRh4s6eNF/9EX/yVdL zwKWrdpgzSAvG1hennfc8xI9aAcP6qDDOrNMcsYy6cBAnj8RRJ53QV6+3Yx3btm56BV+h8i7 Hh3OHMUeqneNzuAQaBGLXuNnSdFzYZ7iIdD5VycdAiF0G0Dz9OfpP7zpHz4GlrHkq0OWXQFX 8oO543Ewd/KdCs0Flq38uUDeB468b7ZMvn4pWybdFwy2TDbfl7RMfm/PHXHM18CYj3Zb694E 436vq2A87clNbGsuzDZEps8d487wXlPReHrbWL4z8J0h/c5QAyyT1EzOAVtZzpYklanb3sPd AcvUJ8fEmyHmOjX9g62MdxvG3MipfLfhu0363QZZdg2wzIrPZywmK+8XsZW3Ot3KiywrfIeQ 4/dE47s/218KkaldwvOP599w5h+y7NtDmTyKWc92jrFpdo7uTbhzPP51vBlilh3O5TmdktMe iMz1n/PIpEbGC5Hp+VsmB+2PMB7u+TaVh0+taio45OzjoQKYTjdGMvoaNNplsWgfGhjtp1bF o10ZjN57l7N77zsAd/KE4cH9fZMtgxiQ0RiDLeliIEIMzumLwUOAuV3JPnpfxYSvcn9fcy7Q q+YuGR5cv5rJeuRnMTg7XQw6IAa0LwaIeTtgHrWvHQlfV/f3lZw8FpbV71r/7XJDtwn9t9PF 9N/l5fz574mwHS+mf/77zJrLqyZPyiHnwHt8HWZW37BiJaSMQCw5RMgiluhnTJhY6+53J2LL u8tyEs946Yrl8wmJtUwoxruEyRkpxhFt1+ejEmhRnChmn1LcLfwmI6U4oo3bNyrxdDHFN6YQ 7xImmSrEEcVtiqIwlDNMNeGI8tAp+WQ2IMxOh8JU4F3ChCOrwJ8FqGfnWRMOpQSJib/dwv2m 4m9E2Tu/7yFTikNM9N0lnHrcRd/Ys9RrM/Efxd5u4V5TsfdzgOLf2Cf2TkFhIm+3cM+QRd6I fOC6PhVxSmSYerhLKMxcPYyQ014QErrxZGejenG3cDfXi5/IE0Kcli9fFBJy25SZZjLbLmHs D1Nmi8Nb25GbOFpIZl1Uq+4WbuZadX5IMuRDEmTXppV5Jss4irfdwi2m4m1EeWaO2e66lqXg GFMpMaI8es+ohAQ8ZX1l0m+3cBOXfh/l0cHzEGGoyEzm6Qo2T6NN5dmIsqhmREKYncIZJsh2 Czd+zwXZOA7XTwSTaFzJopFvKsQ+8Tag/vtu9L/OCvwtrv8tdzD9r5PXfyfEBqv/CoKXV70C mTN4FYe63y5hRUa6X0T75Q15JlUcHsC4hHcy0vsi2tab8kxybTbz7TxTnS+iPGOKUs1Qlpkq exHl4JN5JlXcHIZy7pG1vGMBSu6ymGyfsSCZSngRZckNFpMqroY5tPS4SwSx5wemZ6NUgr1S /J+6/2b0f2dwcMkuYmz5h40sgvaL0mFU5TGMbZeZiQBTKzjEPaCNJI1w09uY7sZ37liG+8Il SbjN/fV5XIPBNRipxcOPgFs76nNNMq6BZVzDcc+4U6DnOZuyE+qllGxxRleMtzMW4yKi9TKb yRlKDHHIIlxEfsJrMzlDWcSiVJe5+PbHAJk3mpgktcCc3b7PTHTLk5ondWpSI7d23mgx4daa YuTWxS+bSW35qQKXXqTW4OOBWz4oqONC25SlkAlsu4Tik0BgOwHG2jzfOuQtanBhLUPsMjvD jCK+xQW1fN0f8rqP7BKlfJOqsJXl5pmmQtrTsLZ80Gz30NnuIabbPRLSPb578N0jdfdAbtm/ tESPadKcbJN9f74duXXw10kr6sb+ErKTR0B2NsTj9jfIkPcDLpPl2XbkbEN2bd89wmQ/WM/2 A5upPBZR9nYKJjkrqchRrcdM9nny5GwNxKOwNC8hh02Jxxexu8L/GxksRuTOt82eej7CePZX U/nrQkD52mZ2zh49EelJOmcfKEyNyxzDgPb65lEJ+WQKWllsljKUua4BPONCsyeGKG/tEq4x lbdeDSiX2SxDH+MA4Wncpw+RiUWWoY9xEBkr4uUVmq0XfjbG003lq4hy/jfZQx/jAMHqSShX 5caNGzdu3Lhx48aNGzdu3Lhx48aNGzdu3Lhx48aNGzdu3Lilsf8BV00UswB4AAA= --------------8BF15922FAC371894C574ED7-- From owner-netdev@oss.sgi.com Sat Jan 27 14:20:13 2001 Received: by oss.sgi.com id ; Sat, 27 Jan 2001 14:20:03 -0800 Received: from cust-795-150.customer.jump.net ([207.8.95.150]:62226 "EHLO dev.audiogalaxy.com") by oss.sgi.com with ESMTP id ; Sat, 27 Jan 2001 14:19:50 -0800 Received: from localhost (michael@localhost) by dev.audiogalaxy.com (8.9.3/8.9.3) with ESMTP id QAA10258 for ; Sat, 27 Jan 2001 16:19:46 -0600 Date: Sat, 27 Jan 2001 16:19:46 -0600 (CST) From: Michael Merhej To: Subject: 2.4.x kernel w/ /proc/sys/net/ipv4/conf/xx/hidden flag Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 374 Lines: 14 Hello, We would like to use kernel 2.4.x for our webservers, but require ARP messages to be suppressed on ips bound to the local loopback device. We are using the Linux Virtual Server w/ direct route routing. Is there a different flag in kernel 2.4.x that has the same functionality that the hidden flag does in kernel 2.2.x Thanks for your time! Sincerely, --Michael From owner-netdev@oss.sgi.com Sat Jan 27 14:56:33 2001 Received: by oss.sgi.com id ; Sat, 27 Jan 2001 14:56:14 -0800 Received: from m361-mp1-cvx1b.col.ntl.com ([213.104.73.105]:17924 "EHLO [213.104.73.105]") by oss.sgi.com with ESMTP id ; Sat, 27 Jan 2001 14:55:58 -0800 Received: by boreas (chainmail); Sat, 27 Jan 2001 22:54:54 GMT To: Cc: , , , Subject: [PATCH] dynamic IP support for 2.4.0 (SIOCKILLADDR) From: "John Fremlin" Date: 27 Jan 2001 22:54:51 +0000 Message-ID: MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 12432 Lines: 38 --=-=-= When the IP address of an interface changes, TCP connections with the old source address are useless. Applications are not notified of this and time out ordinarily, just as if nothing had happened. This is behaviour isn't very helpful when you have a dynamic IP and know you're probably not going to get the old one back. In that case, you want processes to get errors when they try to use one of the dead connections, so they can handle the disconnect more cleanly. Otherwise fetchmail, etc. can just hang waiting for ages. Andi Kleen implemented this functionality with a per interface flag in 2.2. See ftp.suse.com:/pub/people/ak/v2.2/iff-dynamic*. The following patch against 2.4.0 does it a different way. It introduces a new ioctl, called SIOCKILLADDR. When this ioctl is called, it makes all IPv4 sockets with the specified source address return -ENETRESET when they are used. Is this the right error number? I wasn't quite sure where the ioctl should go to be in keeping with convention - I bunged it in devinet_ioctl. I patched userspace ppp-2.4.0 to use this functionality. It would be better if SIOCKILLADDR were not used until we are sure that the new IP is in fact different from the old one, but pppd in demand mode would not notice that there were extant connections and so would not bring up the link - so the problem would not be alleviated. Therefore SIOCKILLADDR is used on disconnect. The functionality is activated with the killoldaddr option. I would be happy to document it in the manpage if it were accepted. Further the build process is cleaned up slightly, as in the patch I sent on or around 8 October 2000. --=-=-= Content-Type: text/x-patch Content-Disposition: attachment; filename=linux-2.4.0-dynip-3.patch diff -u --exclude *~ --recursive linux-2.4.0-orig/include/linux/sockios.h linux-hacked-dynip/include/linux/sockios.h --- linux-2.4.0-orig/include/linux/sockios.h Sat Dec 30 00:20:32 2000 +++ linux-hacked-dynip/include/linux/sockios.h Sat Jan 27 17:04:34 2001 @@ -65,6 +65,7 @@ #define SIOCDIFADDR 0x8936 /* delete PA address */ #define SIOCSIFHWBROADCAST 0x8937 /* set hardware broadcast addr */ #define SIOCGIFCOUNT 0x8938 /* get number of devices */ +#define SIOCKILLADDR 0x8939 /* kill all connections with this local address */ #define SIOCGIFBR 0x8940 /* Bridging support */ #define SIOCSIFBR 0x8941 /* Set bridging options */ diff -u --exclude *~ --recursive linux-2.4.0-orig/include/net/tcp.h linux-hacked-dynip/include/net/tcp.h --- linux-2.4.0-orig/include/net/tcp.h Fri Jan 5 21:41:37 2001 +++ linux-hacked-dynip/include/net/tcp.h Sat Jan 27 18:02:21 2001 @@ -787,9 +787,8 @@ extern int tcp_disconnect(struct sock *sk, int flags); extern void tcp_unhash(struct sock *sk); - extern int tcp_v4_hash_connecting(struct sock *sk); - +extern void tcp_v4_zap_saddr(u32 saddr); /* From syncookies.c */ extern struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb, diff -u --exclude *~ --recursive linux-2.4.0-orig/net/ipv4/af_inet.c linux-hacked-dynip/net/ipv4/af_inet.c --- linux-2.4.0-orig/net/ipv4/af_inet.c Tue Jan 2 09:26:19 2001 +++ linux-hacked-dynip/net/ipv4/af_inet.c Sat Jan 27 18:27:38 2001 @@ -854,6 +854,7 @@ case SIOCSIFPFLAGS: case SIOCGIFPFLAGS: case SIOCSIFFLAGS: + case SIOCKILLADDR: return(devinet_ioctl(cmd,(void *) arg)); case SIOCGIFBR: case SIOCSIFBR: diff -u --exclude *~ --recursive linux-2.4.0-orig/net/ipv4/devinet.c linux-hacked-dynip/net/ipv4/devinet.c --- linux-2.4.0-orig/net/ipv4/devinet.c Sat Dec 30 00:22:05 2000 +++ linux-hacked-dynip/net/ipv4/devinet.c Sat Jan 27 21:09:48 2001 @@ -510,6 +510,7 @@ case SIOCSIFBRDADDR: /* Set the broadcast address */ case SIOCSIFDSTADDR: /* Set the destination address */ case SIOCSIFNETMASK: /* Set the netmask for the interface */ + case SIOCKILLADDR: /* Kill all connections with this local address */ if (!capable(CAP_NET_ADMIN)) return -EACCES; if (sin->sin_family != AF_INET) @@ -536,7 +537,10 @@ break; } - if (ifa == NULL && cmd != SIOCSIFADDR && cmd != SIOCSIFFLAGS) { + if (ifa == NULL + && cmd != SIOCSIFADDR + && cmd != SIOCSIFFLAGS + && cmd != SIOCKILLADDR) { ret = -EADDRNOTAVAIL; goto done; } @@ -646,6 +650,9 @@ ifa->ifa_prefixlen = inet_mask_len(ifa->ifa_mask); inet_insert_ifa(ifa); } + break; + case SIOCKILLADDR: /* Kill all connections with this local address */ + tcp_v4_zap_saddr(sin->sin_addr.s_addr); break; } done: diff -u --exclude *~ --recursive linux-2.4.0-orig/net/ipv4/tcp_ipv4.c linux-hacked-dynip/net/ipv4/tcp_ipv4.c --- linux-2.4.0-orig/net/ipv4/tcp_ipv4.c Fri Jan 5 21:17:42 2001 +++ linux-hacked-dynip/net/ipv4/tcp_ipv4.c Sat Jan 27 18:07:25 2001 @@ -390,6 +390,38 @@ wake_up(&tcp_lhash_wait); } +/* Terminate all active connections with a local address equal to + * SADDR. If sysctl_ip_dynaddr is set, connections in the SYN_SENT + * state are not closed, because their source address will presumably + * be rewritten. + */ +void tcp_v4_zap_saddr(u32 saddr) +{ + int i; + rwlock_t *lock; + struct sock *sk; + + for (i = 0; i < (tcp_ehash_size<<1); i++) { + lock = &tcp_ehash[i].lock; + + read_lock(lock); + + for(sk = tcp_ehash[i].chain; sk; sk = sk->next) + if(sk->rcv_saddr == saddr) + { + if(sysctl_ip_dynaddr && sk->state == TCP_SYN_SENT) + continue; + + sk->err = ENETRESET; + sk->error_report(sk); + + tcp_done(sk); + } + + read_unlock(lock); + } +} + /* Don't inline this cruft. Here are some nice properties to * exploit here. The BSD API does not allow a listening TCP * to specify the remote port nor the remote address for the --=-=-= Content-Type: text/x-patch Content-Disposition: attachment; filename=ppp-2.4.0-killaddr.patch diff -u --recursive ppp-2.4.0-orig/chat/Makefile.linux ppp-2.4.0-hacked/chat/Makefile.linux --- ppp-2.4.0-orig/chat/Makefile.linux Fri Aug 13 02:54:32 1999 +++ ppp-2.4.0-hacked/chat/Makefile.linux Sat Jan 27 18:34:47 2001 @@ -6,14 +6,14 @@ CDEF4= -DFNDELAY=O_NDELAY # Old name value CDEFS= $(CDEF1) $(CDEF2) $(CDEF3) $(CDEF4) -CFLAGS= -O2 -g -pipe $(CDEFS) +CFLAGS= $(COPTS) $(CDEFS) INSTALL= install all: chat chat: chat.o - $(CC) -o chat chat.o + $(CC) $(LDFLAGS) -o chat chat.o chat.o: chat.c $(CC) -c $(CFLAGS) -o chat.o chat.c diff -u --recursive ppp-2.4.0-orig/pppd/options.c ppp-2.4.0-hacked/pppd/options.c --- ppp-2.4.0-orig/pppd/options.c Tue Aug 1 02:38:30 2000 +++ ppp-2.4.0-hacked/pppd/options.c Sat Jan 27 18:51:30 2001 @@ -77,6 +77,9 @@ char user[MAXNAMELEN]; /* Username for PAP */ char passwd[MAXSECRETLEN]; /* Password for PAP */ bool persist = 0; /* Reopen link after it goes down */ +bool killoldaddr = 0; /* If our IP is reassigned on + reconnect, kill active TCP + connections using the old IP. */ char our_name[MAXNAMELEN]; /* Our name for authentication purposes */ bool demand = 0; /* do dial-on-demand */ char *ipparam = NULL; /* Extra parameter for ip up/down scripts */ @@ -194,6 +197,10 @@ "Turn off persist option" }, { "demand", o_bool, &demand, "Dial on demand", OPT_INITONLY | 1, &persist }, + { "killoldaddr", o_bool, &killoldaddr, + "Kill connections from an old source address", 1}, + { "nokilloldaddr", o_bool,&killoldaddr, + "Don't kill connections from an old source address" }, { "--version", o_special_noarg, (void *)showversion, "Show version number" }, { "--help", o_special_noarg, (void *)showhelp, diff -u --recursive ppp-2.4.0-orig/pppd/pppd.h ppp-2.4.0-hacked/pppd/pppd.h --- ppp-2.4.0-orig/pppd/pppd.h Thu Jul 6 12:17:03 2000 +++ ppp-2.4.0-hacked/pppd/pppd.h Sat Jan 27 20:13:17 2001 @@ -235,6 +235,9 @@ extern char remote_name[MAXNAMELEN]; /* Peer's name for authentication */ extern bool explicit_remote;/* remote_name specified with remotename opt */ extern bool demand; /* Do dial-on-demand */ +extern bool killoldaddr; /* If our IP is reassigned on + reconnect, kill active TCP + connections using the old IP. */ extern char *ipparam; /* Extra parameter for ip up/down scripts */ extern bool cryptpap; /* Others' PAP passwords are encrypted */ extern int idle_time_limit;/* Shut down link if idle for this long */ diff -u --recursive ppp-2.4.0-orig/pppd/sys-linux.c ppp-2.4.0-hacked/pppd/sys-linux.c --- ppp-2.4.0-orig/pppd/sys-linux.c Wed Jul 26 05:17:12 2000 +++ ppp-2.4.0-hacked/pppd/sys-linux.c Sat Jan 27 21:55:03 2001 @@ -115,6 +115,10 @@ #endif /* INET6 */ +#ifndef SIOCKILLADDR +#define SIOCKILLADDR 0x8939 +#endif + /* We can get an EIO error on an ioctl if the modem has hung up */ #define ok_error(num) ((num)==EIO) @@ -152,6 +156,7 @@ static u_int32_t proxy_arp_addr; /* Addr for proxy arp entry added */ static char proxy_arp_dev[16]; /* Device for proxy arp entry */ static u_int32_t our_old_addr; /* for detecting address changes */ +static u_int32_t our_current_addr; static int dynaddr_set; /* 1 if ip_dynaddr set */ static int looped; /* 1 if using loop */ static int link_mtu; /* mtu for the link (not bundle) */ @@ -491,6 +496,27 @@ return -1; } +static void do_killaddr(u_int32_t oldaddr) +{ + struct ifreq ifr; + + memset(&ifr,0,sizeof ifr); + + SET_SA_FAMILY (ifr.ifr_addr, AF_INET); + SET_SA_FAMILY (ifr.ifr_dstaddr, AF_INET); + SET_SA_FAMILY (ifr.ifr_netmask, AF_INET); + + SIN_ADDR(ifr.ifr_addr) = oldaddr; + + strlcpy(ifr.ifr_name, ifname, sizeof (ifr.ifr_name)); + + if(ioctl(sock_fd,SIOCKILLADDR,&ifr) < 0) { + if (!ok_error (errno)) + error("ioctl(SIOCKILLADDR): %m(%d)", errno); + return; + } +} + /******************************************************************** * * disestablish_ppp - Restore the serial port to normal operation. @@ -534,6 +560,9 @@ if (!multilink) remove_fd(ppp_dev_fd); } + + if(killoldaddr) + do_killaddr(our_current_addr); } /* @@ -2177,10 +2206,10 @@ { struct ifreq ifr; struct rtentry rt; - + memset (&ifr, '\0', sizeof (ifr)); memset (&rt, '\0', sizeof (rt)); - + SET_SA_FAMILY (ifr.ifr_addr, AF_INET); SET_SA_FAMILY (ifr.ifr_dstaddr, AF_INET); SET_SA_FAMILY (ifr.ifr_netmask, AF_INET); @@ -2247,21 +2276,29 @@ } } - /* set ip_dynaddr in demand mode if address changes */ - if (demand && tune_kernel && !dynaddr_set - && our_old_addr && our_old_addr != our_adr) { + if(persist && our_old_addr && our_old_addr != our_adr) { + /* + if(killoldaddr) + do_killaddr(our_old_addr); + */ + + /* set ip_dynaddr in persist mode if address changes */ + if (tune_kernel && !dynaddr_set) { /* set ip_dynaddr if possible */ char *path; int fd; path = path_to_procfs("/sys/net/ipv4/ip_dynaddr"); if (path != 0 && (fd = open(path, O_WRONLY)) >= 0) { - if (write(fd, "1", 1) != 1) - error("Couldn't enable dynamic IP addressing: %m"); - close(fd); + if (write(fd, "1", 1) != 1) + error("Couldn't enable dynamic IP addressing: %m"); + close(fd); } dynaddr_set = 1; /* only 1 attempt */ + } } + + our_current_addr = our_adr; our_old_addr = 0; return 1; @@ -2317,7 +2354,8 @@ } our_old_addr = our_adr; - + our_current_addr = 0; + return 1; } diff -u --recursive ppp-2.4.0-orig/pppdump/Makefile.linux ppp-2.4.0-hacked/pppdump/Makefile.linux --- ppp-2.4.0-orig/pppdump/Makefile.linux Mon Jul 26 12:09:29 1999 +++ ppp-2.4.0-hacked/pppdump/Makefile.linux Sat Jan 27 18:34:47 2001 @@ -1,4 +1,4 @@ -CFLAGS= -O -I../include/net +CFLAGS= $(COPTS) -I../include/net OBJS = pppdump.o bsd-comp.o deflate.o zlib.o INSTALL= install @@ -6,7 +6,7 @@ all: pppdump pppdump: $(OBJS) - $(CC) -o pppdump $(OBJS) + $(CC) $(LDFLAGS) -o pppdump $(OBJS) clean: rm -f pppdump $(OBJS) *~ diff -u --recursive ppp-2.4.0-orig/pppstats/Makefile.linux ppp-2.4.0-hacked/pppstats/Makefile.linux --- ppp-2.4.0-orig/pppstats/Makefile.linux Wed Mar 25 02:21:19 1998 +++ ppp-2.4.0-hacked/pppstats/Makefile.linux Sat Jan 27 18:34:48 2001 @@ -22,7 +22,7 @@ $(INSTALL) -c -m 444 pppstats.8 $(MANDIR)/man8/pppstats.8 pppstats: $(PPPSTATSRCS) - $(CC) $(CFLAGS) -o pppstats pppstats.c $(LIBS) + $(CC) $(CFLAGS) $(LDFLAGS) -o pppstats pppstats.c $(LIBS) clean: rm -f pppstats *~ #* core --=-=-= -- http://www.penguinpowered.com/~vii --=-=-=-- From owner-netdev@oss.sgi.com Sat Jan 27 16:07:03 2001 Received: by oss.sgi.com id ; Sat, 27 Jan 2001 16:06:44 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:48821 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Sat, 27 Jan 2001 16:06:25 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id TAA24967; Sat, 27 Jan 2001 19:05:39 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Sat, 27 Jan 2001 19:05:38 -0500 (EST) From: jamal To: cc: Subject: ECN: Clearing the air Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 4144 Lines: 87 On Fri, 26 Jan 2001 15:29:51 +0000, James Sutherland wrote: > Except you can't retry without ECN, because DaveM wants to do a > Microsoft and force ECN on everyone, whether they like it or not. I think there is some serious misinformation going on here. Hopefully, this will straighten things out: - ECN is not a standard that DaveM came up with, or some cabal within the Linux community pulled out of a hat. It was the Internet Engineering Task Force that endorsed it. If you want to blame anybody, blame the IETF. Specifically you should also blame Sally Floyd and KK Ramakrishnan who proposed it after years of research. In case those names dont ring a bell look, them up in the internet whos-who almanac. Dont ask me where you'll find one. In case the IETF doesnt ring a bell either to some people, it is the same standard body that made the internet happen. It is the same standard body that also ensures that although the internet is anarchical in nature, there are some simple governing rules that should be defined to keep it alive. They are called protocols. The IETF has a very simple motto "we believe in running code ...". [Although that's not neccesarily true these days, but let's not tread there]. People, Linux is no longer a baby. We are leaders as far as the internet is concerned. We are there first. We set trends and other follow. We have "running code" to flush out all the heretics out there. We have the best TCP/IP people in the world today coding for you and i. Blaspheming with "DaveM wants to pull a MS" doesnt help. We need to encourage these kind of activities because we are making the internet a better place. Yes, Al Gore might have funded some good causes on the internet, but today _we_ make them happen. - ECN is a good thing. It has been proven for years to be a good thing. Standards normally go through a experimental phase before becoming proposed standard. If you dont want it turn it off. - ECN is going to become a proposed standard perhaps by this coming IETF at Mineapolis. - A lot of OS vendors and good router vendors will be deploying ECN soon. There is nothing wrong with Linux being first. We code in the open, others prefer press releases. - ECN does not break things. It's brain damaged firewalls, Intrusion detection systems, and load balancers that should be shot. One intrusion detection "expert" was quoted suggesting the blocking of ECN bits should be blocked because "nmap uses them" to probe systems. Any commercial non-open-source entity designing and abusing reserved fields should at least have the courtesy of providing a config option to stop that abuse. If it was open source we would have fixed their sins. - Any design which blatantly ASSumes that "reserved" means no one should use something simply amazes me. The collegiate dictionary definition of "reserved" is: --------------------- Main Entry: reserve Pronunciation: ri-'z&rv Function: transitive verb Inflected Form(s): reserved; reserving Etymology: Middle English, from Middle French reserver, from Latin reservare, literally, to keep back, from re- + servare to keep -- more at CONSERVE Date: 14th century 1 a : to hold in reserve : keep back b : to set aside (part of the consecrated elements) at the Eucharist for future use c : to retain or hold over to a future time or place : DEFER d : to make legal reservation of 2 : to set or have set aside or apart synonym see KEEP - reservable /-'z&r-v&-b&l/ adjective ----------------------- Now where is the ambiguity in that? And where really is the ambiguity in the meaning of a TCP RST? Maybe an analogy in a very ambiguos protocol called "English Language" would help. The word "no" in response to the packet "davem please add an extra meaning to RST". Where is the ambiguity in that? phew! just my 2 .ca cents cheers, jamal From owner-netdev@oss.sgi.com Sat Jan 27 16:25:03 2001 Received: by oss.sgi.com id ; Sat, 27 Jan 2001 16:24:53 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:49845 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Sat, 27 Jan 2001 16:24:40 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id TAA24978; Sat, 27 Jan 2001 19:23:51 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Sat, 27 Jan 2001 19:23:51 -0500 (EST) From: jamal To: cc: Subject: Re: ECN: Clearing the air In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 527 Lines: 20 On Sat, 27 Jan 2001, jamal wrote: > > - ECN does not break things. It's brain damaged firewalls, Intrusion > detection systems, and load balancers that should be shot. > One intrusion detection "expert" was quoted suggesting the blocking of ECN > bits should be blocked because "nmap uses them" to probe systems. > should proof read before posting. The intrusion detection expert suggested blocking ECN. Article at: http://www.securityfocus.com/frames/?focus=ids&content=/focus/ids/articles /portscan.html cheers, jamal From owner-netdev@oss.sgi.com Sat Jan 27 18:08:54 2001 Received: by oss.sgi.com id ; Sat, 27 Jan 2001 18:08:34 -0800 Received: from f00f.stub.clear.net.nz ([203.167.224.51]:3588 "HELO metastasis.f00f.org") by oss.sgi.com with SMTP id ; Sat, 27 Jan 2001 18:08:16 -0800 Received: by metastasis.f00f.org (Postfix, from userid 1000) id C4E3AA486; Sun, 28 Jan 2001 15:08:13 +1300 (NZDT) Date: Sun, 28 Jan 2001 15:08:13 +1300 From: Chris Wedgwood To: jamal Cc: linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: ECN: Clearing the air Message-ID: <20010128150813.A1595@metastasis.f00f.org> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: ; from hadi@cyberus.ca on Sat, Jan 27, 2001 at 07:23:51PM -0500 X-No-Archive: Yes Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 283 Lines: 12 On Sat, Jan 27, 2001 at 07:23:51PM -0500, jamal wrote: suggested blocking ECN. Article at: http://www.securityfocus.com/frames/?focus=ids&content=/focus/ids/articles/portscan.html the site is now ATM -- can someone briefly explain the logic in blocking it? --cw From owner-netdev@oss.sgi.com Sat Jan 27 18:17:13 2001 Received: by oss.sgi.com id ; Sat, 27 Jan 2001 18:16:53 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:51637 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Sat, 27 Jan 2001 18:16:43 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id VAA25093; Sat, 27 Jan 2001 21:15:48 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Sat, 27 Jan 2001 21:15:48 -0500 (EST) From: jamal To: Chris Wedgwood cc: , Subject: Re: ECN: Clearing the air In-Reply-To: <20010128150813.A1595@metastasis.f00f.org> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 735 Lines: 27 On Sun, 28 Jan 2001, Chris Wedgwood wrote: > On Sat, Jan 27, 2001 at 07:23:51PM -0500, jamal wrote: > > suggested blocking ECN. Article at: > > http://www.securityfocus.com/frames/?focus=ids&content=/focus/ids/articles/portscan.html > > the site is now ATM -- can someone briefly explain the logic in > blocking it? > It is Queso they quoted not nmap, sorry -- same thing. The idea is to "detect" port scanners. Queso sets the two TCP reserved bits in the SYN (now allocated for ECN). Some OSes reflect that back in the SYN-ACK (Linux < 2.0.2? for example was such a culprit). Queso could then use that information to narow down the OS detection. I suppose the idea is to detect Queso and react to it ;-> cheers, jamal From owner-netdev@oss.sgi.com Sat Jan 27 21:27:54 2001 Received: by oss.sgi.com id ; Sat, 27 Jan 2001 21:27:45 -0800 Received: from horus.its.uow.edu.au ([130.130.68.25]:16122 "EHLO horus.its.uow.edu.au") by oss.sgi.com with ESMTP id ; Sat, 27 Jan 2001 21:27:18 -0800 Received: from uow.edu.au (wumpus.its.uow.edu.au [130.130.68.12]) by horus.its.uow.edu.au (8.9.3/8.9.3) with ESMTP id QAA03743; Sun, 28 Jan 2001 16:27:05 +1100 (EST) Message-ID: <3A73AF7B.6610B559@uow.edu.au> Date: Sun, 28 Jan 2001 16:34:51 +1100 From: Andrew Morton X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.4.0-test8 i586) X-Accept-Language: en MIME-Version: 1.0 To: kuznet@ms2.inr.ac.ru CC: netdev@oss.sgi.com, lkml Subject: Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN) References: <3A726087.764CC02E@uow.edu.au> from "Andrew Morton" at Jan 27, 1 08:45:00 am <200101271854.VAA02845@ms2.inr.ac.ru> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 3675 Lines: 104 kuznet@ms2.inr.ac.ru wrote: > > Hello! > > > 2.4.1-pre10+zercopy, using read()/write(): 38.1% CPU > > write() on zc card is worse than normal write() by definition. > It generates split buffers. yes. The figures below show this. Disabling SG+checksums speeds up write() and send(). > Split buffers are more expensive and we have to pay for this. > You have paid too much for slow card though. 8) > > Do you measure load correctly? Yes. Quite confident about this. Here's the algorithm: 1: Run a cycle-soaker on each CPU on an otherwise unloaded system. See how much "work" they all do per second. 2: Run the cycle-soakers again, but with network traffic happening. See how much their "work" is reduced. Deduce networking CPU load from this difference. The networking code all runs SCHED_FIFO or in interrupt context, so the cycle-soakers have no effect upon the network code's access to the CPU. The "cycle-soakers" just sit there spinning and dirtying 10,000 cachelines per second. > > 2.4.1-pre10+zercopy, using read()/write(): 39.2% CPU * hardware tx checksums disabled > > This is illegal combination of parameters. You force two memory accesses, > doing this. The fact that it does not add to load is dubious. 8)8) mm.. Perhaps with read()/write() the data is already in cache? Anyway, I've tweaked up the tool again so it can do send() or write() (then I looked at the implementation and wondered why I'd bothered). It also does TCP_CORK now. I ran another set of tests. The zerocopy patch improves sendfile() hugely but slows down send()/write() significantly, with a 3c905C: http://www.uow.edu.au/~andrewm/linux/#zc The kernels which were tested were 2.4.1-pre10 with and without the zerocopy patch. We only look at client load (the TCP sender). In all tests the link throughput was 11.5 mbytes/sec at all times (saturated 100baseT) unless otherwise noted. The client (the thing which sends data) is a dual 500MHz PII with a 3c905C. For the write() and send() tests, the chunk size was 64 kbytes. The workload was 63 files with an average length of 350 kbytes. CPU 2.4.1-pre10+zerocopy, using sendfile(): 9.6% 2.4.1-pre10+zerocopy, using send(): 24.1% 2.4.1-pre10+zerocopy, using write(): 24.2% 2.4.1-pre10+zerocopy, using sendfile(): 16.2% * checksums and SG disabled 2.4.1-pre10+zerocopy, using send(): 21.5% * checksums and SG disabled 2.4.1-pre10+zerocopy, using write(): 21.5% * checksums and SG disabled 2.4.1-pre10-vanilla, using sendfile(): 17.1% 2.4.1-pre10-vanilla, using send(): 21.1% 2.4.1-pre10-vanilla, using write(): 21.1% Bearing in mind that a large amount of the load is in the device driver, the zerocopy patch makes a large improvement in sendfile efficiency. But read() and send() performance is decreased by 10% - more than this if you factor out the constant device driver overhead. TCP_CORK makes no difference. The files being sent are much larger than a single frame. Conclusions: For a NIC which cannot do scatter/gather/checksums, the zerocopy patch makes no change in throughput in all case. For a NIC which can do scatter/gather/checksums, sendfile() efficiency is improved by 40% and send() efficiency is decreased by 10%. The increase and decrease caused by the zerocopy patch will in fact be significantly larger than these two figures, because the measurements here include a constant base load caused by the device driver. - From owner-netdev@oss.sgi.com Sun Jan 28 07:07:08 2001 Received: by oss.sgi.com id ; Sun, 28 Jan 2001 07:06:49 -0800 Received: from lsb-catv-1-p021.vtxnet.ch ([212.147.5.21]:14 "EHLO almesberger.net") by oss.sgi.com with ESMTP id ; Sun, 28 Jan 2001 07:06:24 -0800 Received: (from almesber@localhost) by almesberger.net (8.9.3/8.9.3) id QAA07651; Sun, 28 Jan 2001 16:06:06 +0100 Date: Sun, 28 Jan 2001 16:06:05 +0100 From: Werner Almesberger To: netdev@oss.sgi.com, linux-diffserv@lrc.di.epfl.ch Subject: New toy: tc simulator Message-ID: <20010128160605.N18286@almesberger.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1536 Lines: 32 I've written a little package that allows experimenting with Linux traffic control entirely in user space. This is useful for configuration testing, traffic control element development, debugging, and simulations. Right now, the simulator is very basic, but it'll be easy to add new capabilities. It works as follows: the kernel traffic control code is compiled in user space and linked against iproute2/tc and some glue code. This way, the original tc code can be used for configuration. The glue code provides the netlink communication from tc to the "kernel" code. Afterwards, a simple event-driven simulation can be run on that configuration. Note that the classifiers "fw" and "route" don't work, because neither firewalling nor routing are included in the simulator. Also, the ATM qdisc isn't supported. The current version is in: ftp://icaftp.epfl.ch/pub/linux/tcng/tcsim-0d.tar.gz It requires a 2.4.0 kernel source tree and iproute2 991023 or 001007 (probably works also with all the iproute2 versions in between, I just haven't tried them). Also, I haven't done much testing yet, so there are probably several nasty bugs in addition to the long list of known bugs. I won't announce each little update, so please check for new versions every once in a while if you're using tcsim. - Werner -- _________________________________________________________________________ / Werner Almesberger, ICA, EPFL, CH Werner.Almesberger@epfl.ch / /_IN_N_032__Tel_+41_21_693_6621__Fax_+41_21_693_6610_____________________/ From owner-netdev@oss.sgi.com Sun Jan 28 07:58:39 2001 Received: by oss.sgi.com id ; Sun, 28 Jan 2001 07:58:20 -0800 Received: from horus.its.uow.edu.au ([130.130.68.25]:46999 "EHLO horus.its.uow.edu.au") by oss.sgi.com with ESMTP id ; Sun, 28 Jan 2001 07:57:52 -0800 Received: from uow.edu.au (wumpus.its.uow.edu.au [130.130.68.12]) by horus.its.uow.edu.au (8.9.3/8.9.3) with ESMTP id CAA16953; Mon, 29 Jan 2001 02:57:29 +1100 (EST) Message-ID: <3A74433C.239D1317@uow.edu.au> Date: Mon, 29 Jan 2001 03:05:16 +1100 From: Andrew Morton X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.4.0-test8 i586) X-Accept-Language: en MIME-Version: 1.0 To: jamal CC: lkml , "netdev@oss.sgi.com" Subject: Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN) References: <3A72CD1E.32BB523F@uow.edu.au> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 2964 Lines: 94 jamal wrote: > > PS:- can you try it out with the ttcp testcode i posted? Yup. See below. The numbers are almost the same as with `zcs' and `zcc'. The CPU utilisation code which was in `zcc' has been broken out into a standalone tool, so the new `cyclesoak' app is a general-purpose system load measurement tool. It's fascinating to play with, if you're into that sort of thing. `cyclesoak' was used to measure ttcp-sf and NFS/UDP client and server throughput. The times()-based instrumentation inside ttcp-sf doesn't (can't) give correct numbers. 2-4% CPU at 100 mbps? We wish :) The zerocopy patch doesn't seem to affect NFS efficiency at all. Confused. Excerpt from the rapidly swelling README: NFS/UDP client results ====================== Reading a 100 meg file across 100baseT. The file is fully cached on the server. The client is the above machine. You need to unmount the server between runs to avoid client-side caching. The server is mounted with various rsize and wsize options. Kernel rsize wsize mbyte/sec CPU 2.4.1-pre10+zc 1024 1024 2.4 10.3% 2.4.1-pre10+zc 2048 2048 3.7 11.4% 2.4.1-pre10+zc 4096 4096 10.1 29.0% 2.4.1-pre10+zc 8199 8192 11.9 28.2% 2.4.1-pre10+zc 16384 16384 11.9 28.2% 2.4.1-pre10 1024 1024 2.4 9.7% 2.4.1-pre10 2048 2048 3.7 11.8% 2.4.1-pre10 4096 4096 10.7 33.6% 2.4.1-pre10 8199 8192 11.9 29.5% 2.4.1-pre10 16384 16384 11.9 29.2% Small diff at 8192. NFS/UDP server results ====================== Reading a 100 meg file across 100baseT. The file is fully cached on the server. The server is the above machine. Kernel rsize wsize mbyte/sec CPU 2.4.1-pre10+zc 1024 1024 2.6 19.1% 2.4.1-pre10+zc 2048 2048 3.9 18.8% 2.4.1-pre10+zc 4096 4096 10.0 34.5% 2.4.1-pre10+zc 8199 8192 11.8 28.9% 2.4.1-pre10+zc 16384 16384 11.8 29.0% 2.4.1-pre10 1024 1024 2.6 18.5% 2.4.1-pre10 2048 2048 3.9 18.6% 2.4.1-pre10 4096 4096 10.9 33.8% 2.4.1-pre10 8199 8192 11.8 29.0% 2.4.1-pre10 16384 16384 11.8 29.0% No diff. ttcp-sf Results =============== Jamal Hadi Salim has taught ttcp to use sendfile. See http://www.cyberus.ca/~hadi/ttcp-sf.tar.gz Using the same machine as above, and the following commands: Sender: ./ttcp-sf -t -c -l 32768 -v receiver_host Receiver: ./ttcp-sf -c -r -l 32768 -v sender_host CPU 2.4.1-pre10-zerocopy, sending with ttcp-sf: 10.5% 2.4.1-pre10-zerocopy, receiving with ttcp-sf: 16.1% 2.4.1-pre10-vanilla, sending with ttcp-sf: 18.5% 2.4.1-pre10-vanilla, receiving with ttcp-sf: 16.0% - From owner-netdev@oss.sgi.com Sun Jan 28 18:45:54 2001 Received: by oss.sgi.com id ; Sun, 28 Jan 2001 18:45:35 -0800 Received: from saturn.cs.uml.edu ([129.63.8.2]:23817 "EHLO saturn.cs.uml.edu") by oss.sgi.com with ESMTP id ; Sun, 28 Jan 2001 18:45:08 -0800 Received: (from acahalan@localhost) by saturn.cs.uml.edu (8.11.0/8.10.0) id f0T2j2Y438757; Sun, 28 Jan 2001 21:45:02 -0500 (EST) From: "Albert D. Cahalan" Message-Id: <200101290245.f0T2j2Y438757@saturn.cs.uml.edu> Subject: Re: [PATCH] dynamic IP support for 2.4.0 (SIOCKILLADDR) To: vii@altern.org (John Fremlin) Date: Sun, 28 Jan 2001 21:45:02 -0500 (EST) Cc: linux-kernel@vger.kernel.org, netdev@oss.sgi.com, paulus@linuxcare.com, linux-ppp@vger.kernel.org, linux-net@vger.kernel.org In-Reply-To: from "John Fremlin" at Jan 27, 2001 10:54:51 PM X-Mailer: ELM [version 2.5 PL2] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 721 Lines: 14 John Fremlin writes: > When the IP address of an interface changes, TCP connections with the > old source address are useless. Applications are not notified of this > and time out ordinarily, just as if nothing had happened. This is > behaviour isn't very helpful when you have a dynamic IP and know > you're probably not going to get the old one back. In that case, you ... > I patched userspace ppp-2.4.0 to use this functionality. It would be > better if SIOCKILLADDR were not used until we are sure that the new IP > is in fact different from the old one, but pppd in demand mode would I get the same IP about 2/3 of the time, so it is pretty important to avoid killing connections until after the new IP is known. From owner-netdev@oss.sgi.com Mon Jan 29 05:00:57 2001 Received: by oss.sgi.com id ; Mon, 29 Jan 2001 05:00:38 -0800 Received: from laurin.munich.netsurf.de ([194.64.166.1]:27802 "EHLO laurin.munich.netsurf.de") by oss.sgi.com with ESMTP id ; Mon, 29 Jan 2001 05:00:08 -0800 Received: from fred.muc.de (noidentity@ns1066.munich.netsurf.de [195.180.235.66]) by laurin.munich.netsurf.de (8.9.3/8.9.3) with ESMTP id NAA22853; Mon, 29 Jan 2001 13:59:55 +0100 (MET) Received: by fred.muc.de (Postfix, from userid 500) id 034D1E3BB8; Mon, 29 Jan 2001 13:59:05 +0100 (CET) Date: Mon, 29 Jan 2001 13:59:05 +0100 From: Andi Kleen To: "Albert D. Cahalan" Cc: John Fremlin , linux-kernel@vger.kernel.org, netdev@oss.sgi.com, paulus@linuxcare.com, linux-ppp@vger.kernel.org, linux-net@vger.kernel.org Subject: Re: [PATCH] dynamic IP support for 2.4.0 (SIOCKILLADDR) Message-ID: <20010129135905.B1591@fred.local> References: <200101290245.f0T2j2Y438757@saturn.cs.uml.edu> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0.1i In-Reply-To: <200101290245.f0T2j2Y438757@saturn.cs.uml.edu>; from acahalan@cs.uml.edu on Mon, Jan 29, 2001 at 03:46:42AM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1620 Lines: 35 On Mon, Jan 29, 2001 at 03:46:42AM +0100, Albert D. Cahalan wrote: > John Fremlin writes: > > > When the IP address of an interface changes, TCP connections with the > > old source address are useless. Applications are not notified of this > > and time out ordinarily, just as if nothing had happened. This is > > behaviour isn't very helpful when you have a dynamic IP and know > > you're probably not going to get the old one back. In that case, you > ... > > I patched userspace ppp-2.4.0 to use this functionality. It would be > > better if SIOCKILLADDR were not used until we are sure that the new IP > > is in fact different from the old one, but pppd in demand mode would > > I get the same IP about 2/3 of the time, so it is pretty important > to avoid killing connections until after the new IP is known. I prefer it when the IP is killed as soon as possible so that I can see when the connection is lost (ssh sessions get killed etc.) Another reason for killing as soon as possible is the last-ack problem. When the other end goes away suddenly TCP often gets into last-ack state. This means it'll retransmit a FIN until it times out or the other end answers. Each such retransmitted FIN triggers a new dialin, which can get quite costly when you don't have flat rate (like still most of Europe). With your approach (waiting until the new IP is known) it would cost at least another dialin in this case. When you have flatrate your way may be better of course, so a final user space solution could switch it via a pppd flag. [I agree that the user space way is better than my kernel hacks] -Andi From owner-netdev@oss.sgi.com Mon Jan 29 05:56:58 2001 Received: by oss.sgi.com id ; Mon, 29 Jan 2001 05:56:48 -0800 Received: from horus.its.uow.edu.au ([130.130.68.25]:6895 "EHLO horus.its.uow.edu.au") by oss.sgi.com with ESMTP id ; Mon, 29 Jan 2001 05:56:31 -0800 Received: from uow.edu.au (wumpus.its.uow.edu.au [130.130.68.12]) by horus.its.uow.edu.au (8.9.3/8.9.3) with ESMTP id AAA11195 for ; Tue, 30 Jan 2001 00:56:22 +1100 (EST) Message-ID: <3A75785A.42B9E7CE@uow.edu.au> Date: Tue, 30 Jan 2001 01:04:10 +1100 From: Andrew Morton X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.4.0-test8 i586) X-Accept-Language: en MIME-Version: 1.0 To: netdev@oss.sgi.com Subject: More measurements Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 2390 Lines: 54 Let's compare 3coms and eepro100s. And the effects of MMIO versus PIO, and other stuff. 3c905C 3c905C 3c905C 3c905C 3c905C eepro100 eepro100 eepro100 CPU affine ints rx pps tx pps MMIO I/O ops ints 2.4.1-pre10+zerocopy, sendfile(): 9.6% 4395 4106 8146 15.3% 2.4.1-pre10+zerocopy, send(): 24.1% 4449 4163 8196 20.2% 2.4.1-pre10+zerocopy, receiving: 18.7% 12332 8156 4189 17.6% 2.4.1-pre10+zerocopy, sendfile(), no xsum/SG: 16.2% (15.3%) 2.4.1-pre10+zerocopy, send(), no xsum/SG: 21.5% (20.2%) 2.4.1-pre10-vanilla, using sendfile(): 17.1% 17.9% 5729 5296 8214 16.1% 16.8% 2.4.1-pre10-vanilla, using send(): 21.1% 4629 4152 8191 20.3% 20.6% 6310 2.4.1-pre10-vanilla, receiving: 18.3% 12333 8156 4188 17.1% 18.2% 12335 Lots of interesting things here. - eepro100 generates more interrupts doing TCP Tx, but not TCP Rx. I assume it doesn't do Tx mitigation? - Changing eepro100 to use IO operations instead of MMIO slows down this dual 500MHz machine by less than one percent at 100 mbps. At 12,000 interrupts per second. Why all the fuss about MMIO? - Bonding the 905's interrupt to CPU0 slows things down slightly. (This is contrary to other measurements I've previously taken. Don't pay any attention to this). - Without the zc patch, there is a significant increase (25%) in the number of Rx packets (acks, persumably) when data is sent using sendfile() as opposed to when the same data is sent with send(). Workload: 62 files, average size 350k. sendfile() tries to send the entire file in one hit send() breaks it up into 64kbyte chunks. When the zerocopy patch is applied, the Rx packet rate during sendfile() is the same as the rate during send(). Why is this? If this *alone* were fixed in 2.4.1, I'd expect the performance gain to be ~10% of system capacity on this NIC. - I see a consistent 12-13% slowdown on send() with the zerocopy patch. Can this be fixed? From owner-netdev@oss.sgi.com Mon Jan 29 06:42:17 2001 Received: by oss.sgi.com id ; Mon, 29 Jan 2001 06:42:08 -0800 Received: from [194.213.32.137] ([194.213.32.137]:4356 "EHLO bug.ucw.cz") by oss.sgi.com with ESMTP id ; Mon, 29 Jan 2001 06:41:50 -0800 Received: (from pavel@localhost) by bug.ucw.cz (8.8.8/8.8.5) id WAA01368; Sun, 28 Jan 2001 22:55:30 +0100 Message-ID: <20010128225530.A1300@bug.ucw.cz> Date: Sun, 28 Jan 2001 22:55:30 +0100 From: Pavel Machek To: jamal , Chris Wedgwood Cc: linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: ECN: Clearing the air References: <20010128150813.A1595@metastasis.f00f.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.93i In-Reply-To: ; from jamal on Sat, Jan 27, 2001 at 09:15:48PM -0500 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 782 Lines: 21 Hi! > > suggested blocking ECN. Article at: > > > > http://www.securityfocus.com/frames/?focus=ids&content=/focus/ids/articles/portscan.html > > > > the site is now ATM -- can someone briefly explain the logic in > > blocking it? > > It is Queso they quoted not nmap, sorry -- same thing. > The idea is to "detect" port scanners. > Queso sets the two TCP reserved bits in the SYN (now allocated for ECN). > Some OSes reflect that back in the SYN-ACK (Linux < 2.0.2? for example > was such a culprit). Does not that mean that Linux 2.0.10 mistakenly announces it is ECN capable when offered ECN connection? Pavel -- I'm pavel@ucw.cz. "In my country we have almost anarchy and I don't care." Panos Katsaloulis describing me w.r.t. patents at discuss@linmodems.org From owner-netdev@oss.sgi.com Mon Jan 29 08:35:39 2001 Received: by oss.sgi.com id ; Mon, 29 Jan 2001 08:35:20 -0800 Received: from cache.sh.cvut.cz ([147.32.127.204]:50953 "EHLO cache.sh.cvut.cz") by oss.sgi.com with ESMTP id ; Mon, 29 Jan 2001 08:34:55 -0800 Received: from nightmare.sh.cvut.cz (root@nightmare.sh.cvut.cz [147.32.127.206]) by cache.sh.cvut.cz (8.9.3/Silicon Hill/Antispam) with ESMTP id RAA23278; Mon, 29 Jan 2001 17:34:38 +0100 Received: from localhost (bobek@localhost) by nightmare.sh.cvut.cz (8.9.3/8.9.3/Debian 8.9.3-21) with ESMTP id RAA23762; Mon, 29 Jan 2001 17:34:37 +0100 Date: Mon, 29 Jan 2001 17:34:37 +0100 (CET) From: Antonin Kral To: Jonathan Earle cc: "'jamal'" , Andrew Morton , lkml , netdev@oss.sgi.com Subject: RE: sendfile+zerocopy: fairly sexy (nothing to do with ECN) In-Reply-To: <28560036253BD41191A10000F8BCBD116BDCE5@zcard00g.ca.nortel.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 758 Lines: 20 > > Throughput: 100Mbps is really nothing. Linux never had a problem with > > 4-500Mbps file serving. So throughput is an important number. so is > > end to end latency, but in file serving case, latency might > > not be a big deal so ignore it. > > If I try to route more than 40mbps (40% line utilization) through a 100mbps > port (tulip) on a 2.4.0-test kernel running on a pIII 500 (or higher) > system, not only does the performance drop to nearly 0, the system gets all > sluggish and unusable. This is with or without Jamal's FF patches. > > How are you managing to get such high throughput? > I have used 2.2.13 to 2.2.18 and 2.4.0, for first approach, with no patches and with no probles I managed bandwidth about 200 and 300 Mbps Antonin From owner-netdev@oss.sgi.com Mon Jan 29 08:37:19 2001 Received: by oss.sgi.com id ; Mon, 29 Jan 2001 08:37:09 -0800 Received: from h56s242a129n47.user.nortelnetworks.com ([47.129.242.56]:58829 "EHLO zcars04e.ca.nortel.com") by oss.sgi.com with ESMTP id ; Mon, 29 Jan 2001 08:36:59 -0800 Received: from zcard015.ca.nortel.com (actually zcard015) by zcars04e.ca.nortel.com; Mon, 29 Jan 2001 11:16:57 -0500 Received: by zcard015.ca.nortel.com with Internet Mail Service (5.5.2652.35) id ; Mon, 29 Jan 2001 11:17:00 -0500 Message-ID: <28560036253BD41191A10000F8BCBD116BDCE5@zcard00g.ca.nortel.com> From: "Jonathan Earle" To: 'jamal' , Andrew Morton Cc: lkml , netdev@oss.sgi.com Subject: RE: sendfile+zerocopy: fairly sexy (nothing to do with ECN) Date: Mon, 29 Jan 2001 11:16:56 -0500 X-Mailer: Internet Mail Service (5.5.2652.35) X-Orig: Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 586 Lines: 13 > Throughput: 100Mbps is really nothing. Linux never had a problem with > 4-500Mbps file serving. So throughput is an important number. so is > end to end latency, but in file serving case, latency might > not be a big deal so ignore it. If I try to route more than 40mbps (40% line utilization) through a 100mbps port (tulip) on a 2.4.0-test kernel running on a pIII 500 (or higher) system, not only does the performance drop to nearly 0, the system gets all sluggish and unusable. This is with or without Jamal's FF patches. How are you managing to get such high throughput? Jon From owner-netdev@oss.sgi.com Mon Jan 29 08:51:59 2001 Received: by oss.sgi.com id ; Mon, 29 Jan 2001 08:51:49 -0800 Received: from mail.mediaways.net ([193.189.224.113]:21465 "HELO mail.mediaways.net") by oss.sgi.com with SMTP id ; Mon, 29 Jan 2001 08:51:46 -0800 Received: (qmail 23542 invoked from network); 29 Jan 2001 17:51:43 +0100 Received: from nrbg-3e363847.pool.mediaways.net (HELO frodo.uni-erlangen.de) (62.54.56.71) by mail.mediaways.net with SMTP; 29 Jan 2001 17:51:43 +0100 Received: (from wh@localhost) by frodo.uni-erlangen.de (8.9.3/8.8.8) id RAA04828; Mon, 29 Jan 2001 17:21:55 +0100 Date: Mon, 29 Jan 2001 17:21:55 +0100 From: Walter Hofmann To: Pavel Machek Cc: jamal , Chris Wedgwood , linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: ECN: Clearing the air Message-ID: <20010129172155.A4712@frodo.uni-erlangen.de> References: <20010128150813.A1595@metastasis.f00f.org> <20010128225530.A1300@bug.ucw.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2i In-Reply-To: <20010128225530.A1300@bug.ucw.cz>; from pavel@suse.cz on Sun, Jan 28, 2001 at 10:55:30PM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 189 Lines: 8 On Sun, 28 Jan 2001, Pavel Machek wrote: > Does not that mean that Linux 2.0.10 mistakenly announces it is ECN > capable when offered ECN connection? No, the RFC deals with this. Walter From owner-netdev@oss.sgi.com Mon Jan 29 10:34:50 2001 Received: by oss.sgi.com id ; Mon, 29 Jan 2001 10:34:30 -0800 Received: from pcep-jamie.cern.ch ([137.138.38.126]:16651 "EHLO pcep-jamie.cern.ch") by oss.sgi.com with ESMTP id ; Mon, 29 Jan 2001 10:34:20 -0800 Received: (from jamie@localhost) by pcep-jamie.cern.ch (8.11.0/8.11.0) id f0TIVaf11058; Mon, 29 Jan 2001 19:31:36 +0100 Date: Mon, 29 Jan 2001 19:31:36 +0100 From: Jamie Lokier To: Andi Kleen Cc: "Albert D. Cahalan" , John Fremlin , linux-kernel@vger.kernel.org, netdev@oss.sgi.com, paulus@linuxcare.com, linux-ppp@vger.kernel.org, linux-net@vger.kernel.org Subject: Re: [PATCH] dynamic IP support for 2.4.0 (SIOCKILLADDR) Message-ID: <20010129193136.A11035@pcep-jamie.cern.ch> References: <200101290245.f0T2j2Y438757@saturn.cs.uml.edu> <20010129135905.B1591@fred.local> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <20010129135905.B1591@fred.local>; from ak@muc.de on Mon, Jan 29, 2001 at 01:59:05PM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 955 Lines: 21 Andi Kleen wrote: > > I get the same IP about 2/3 of the time, so it is pretty important > > to avoid killing connections until after the new IP is known. > > I prefer it when the IP is killed as soon as possible so that I can see > when the connection is lost (ssh sessions get killed etc.) I like it when I get the same IP back and can continue an ssh session. My line drops regularly in mid session. Unfortunately getting the same IP is rare now, so I've been toying with running a PPP tunnel through a fixed host out on the net. The tunnel would be dropped and recreated with each new connection. My local link IP would change, but the tunnel IP would not so connections to other places, ssh etc. would all be from the tunnel IP. The important thing is that the tunnel is destroyed and recreated (it has to be, it is over different underlying link addresses). I do not want that to destroy the connections from the tunnelled address. -- Jamie From owner-netdev@oss.sgi.com Mon Jan 29 10:51:20 2001 Received: by oss.sgi.com id ; Mon, 29 Jan 2001 10:51:10 -0800 Received: from palrel1.hp.com ([156.153.255.242]:25093 "HELO palrel1.hp.com") by oss.sgi.com with SMTP id ; Mon, 29 Jan 2001 10:50:44 -0800 Received: from tardy.cup.hp.com (tardy.cup.hp.com [15.8.80.176]) by palrel1.hp.com (Postfix) with ESMTP id D4A3C18A4; Mon, 29 Jan 2001 10:50:42 -0800 (PST) Received: from cup.hp.com (localhost [127.0.0.1]) by tardy.cup.hp.com (8.9.3 (PHNE_18546)/8.9.3 SMKit7.02) with ESMTP id KAA15767; Mon, 29 Jan 2001 10:50:42 -0800 (PST) Message-ID: <3A75BB81.3423F1B2@cup.hp.com> Date: Mon, 29 Jan 2001 10:50:41 -0800 From: Rick Jones Organization: the Unofficial HP X-Mailer: Mozilla 4.75 [en] (X11; U; HP-UX B.11.00 9000/785) X-Accept-Language: en MIME-Version: 1.0 To: jamal Cc: Andrew Morton , lkml , "netdev@oss.sgi.com" Subject: Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN) References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1024 Lines: 24 > I'll give this a shot later. Can you try with the sendfiled-ttcp? > http://www.cyberus.ca/~hadi/ttcp-sf.tar.gz I guess I need to "leverage" some bits for netperf :) WRT getting data with links that cannot saturate a system, having something akin to the netperf service demand measure can help. Nothing terribly fancy - simply a conversion of the CPU utilization and throughput to a microseconds of CPU to transfer a KB of data. As for CKO and avoiding copies and such, if past experience is any guide (ftp://ftp.cup.hp.com/dist/networking/briefs/copyavoid.ps) you get a very nice synergistic effect once the last "access" of data is removed. CKO gets you say 10%, avoiding the copy gets you say 10%, but doing both at the same time gets you 30%. rick jones http://www.netperf.org/ -- ftp://ftp.cup.hp.com/dist/networking/misc/rachel/ these opinions are mine, all mine; HP might not want them anyway... :) feel free to email, OR post, but please do NOT do BOTH... my email address is raj in the cup.hp.com domain... From owner-netdev@oss.sgi.com Mon Jan 29 13:09:00 2001 Received: by oss.sgi.com id ; Mon, 29 Jan 2001 13:08:50 -0800 Received: from news.suse.de ([213.95.15.193]:63752 "HELO Cantor.suse.de") by oss.sgi.com with SMTP id ; Mon, 29 Jan 2001 13:08:44 -0800 Received: from Hermes.suse.de (Hermes.suse.de [213.95.15.136]) by Cantor.suse.de (Postfix) with ESMTP id 430BC1E209; Mon, 29 Jan 2001 22:08:42 +0100 (MET) Date: Mon, 29 Jan 2001 22:08:40 +0100 From: Andi Kleen To: Jamie Lokier Cc: Andi Kleen , "Albert D. Cahalan" , John Fremlin , linux-kernel@vger.kernel.org, netdev@oss.sgi.com, paulus@linuxcare.com, linux-ppp@vger.kernel.org, linux-net@vger.kernel.org Subject: Re: [PATCH] dynamic IP support for 2.4.0 (SIOCKILLADDR) Message-ID: <20010129220840.A7206@gruyere.muc.suse.de> References: <200101290245.f0T2j2Y438757@saturn.cs.uml.edu> <20010129135905.B1591@fred.local> <20010129193136.A11035@pcep-jamie.cern.ch> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <20010129193136.A11035@pcep-jamie.cern.ch>; from ln@tantalophile.demon.co.uk on Mon, Jan 29, 2001 at 07:31:36PM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 385 Lines: 10 On Mon, Jan 29, 2001 at 07:31:36PM +0100, Jamie Lokier wrote: > The important thing is that the tunnel is destroyed and recreated (it > has to be, it is over different underlying link addresses). I do not > want that to destroy the connections from the tunnelled address. Just do not set IFF_DYNAMIC on the tunnel interface then, that is why it is a flag and not hardcoded. -Andi From owner-netdev@oss.sgi.com Mon Jan 29 13:59:10 2001 Received: by oss.sgi.com id ; Mon, 29 Jan 2001 13:59:00 -0800 Received: from pizda.ninka.net ([216.101.162.242]:14722 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Mon, 29 Jan 2001 13:58:50 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id NAA03087; Mon, 29 Jan 2001 13:57:08 -0800 From: "David S. Miller" MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <14965.59187.922597.322214@pizda.ninka.net> Date: Mon, 29 Jan 2001 13:57:07 -0800 (PST) To: kuznet@ms2.inr.ac.ru Cc: andrewm@uow.edu.au (Andrew Morton), netdev@oss.sgi.com Subject: Re: zerocopy changes in 3c59x.c In-Reply-To: <200101261746.UAA27262@ms2.inr.ac.ru> References: <3A714788.82C064BD@uow.edu.au> <200101261746.UAA27262@ms2.inr.ac.ru> X-Mailer: VM 6.75 under 21.1 (patch 13) "Crater Lake" XEmacs Lucid Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 311 Lines: 11 kuznet@ms2.inr.ac.ru writes: > All this debugging statistics should be deteted. I removed it from acenic, > for example. It was interesting only at the first stage. I've killed it from my tree. Note that Acenic still had one rx_compressed reference still in it :-) Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Mon Jan 29 16:48:41 2001 Received: by oss.sgi.com id ; Mon, 29 Jan 2001 16:48:31 -0800 Received: from m67-mp1-cvx1b.col.ntl.com ([213.104.72.67]:28933 "EHLO [213.104.72.67]") by oss.sgi.com with ESMTP id ; Mon, 29 Jan 2001 16:48:23 -0800 Received: by boreas (chainmail); Tue, 30 Jan 2001 00:47:05 GMT To: "Jamie Lokier" Cc: "Andi Kleen" , "Albert D. Cahalan" , , , , , Subject: Re: [PATCH] dynamic IP support for 2.4.0 (SIOCKILLADDR) References: <200101290245.f0T2j2Y438757@saturn.cs.uml.edu> <20010129135905.B1591@fred.local> <20010129193136.A11035@pcep-jamie.cern.ch> From: "John Fremlin" Date: 30 Jan 2001 00:47:04 +0000 In-Reply-To: Jamie Lokier's message of "Mon, 29 Jan 2001 19:31:36 +0100" Message-ID: User-Agent: Gnus/5.0807 (Gnus v5.8.7) XEmacs/21.1 (GTK) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 418 Lines: 15 Jamie Lokier writes: [...] > The important thing is that the tunnel is destroyed and recreated > (it has to be, it is over different underlying link addresses). I > do not want that to destroy the connections from the tunnelled > address. No connections at all will be destroyed by my patch unless you enable the new killoldaddr pppd option. -- http://www.penguinpowered.com/~vii From owner-netdev@oss.sgi.com Mon Jan 29 16:49:21 2001 Received: by oss.sgi.com id ; Mon, 29 Jan 2001 16:49:11 -0800 Received: from m67-mp1-cvx1b.col.ntl.com ([213.104.72.67]:32517 "EHLO [213.104.72.67]") by oss.sgi.com with ESMTP id ; Mon, 29 Jan 2001 16:48:56 -0800 Received: by boreas (chainmail); Tue, 30 Jan 2001 00:43:33 GMT To: "Albert D. Cahalan" Cc: , , , , Subject: Re: [PATCH] dynamic IP support for 2.4.0 (SIOCKILLADDR) References: <200101290245.f0T2j2Y438757@saturn.cs.uml.edu> From: "John Fremlin" Date: 30 Jan 2001 00:43:32 +0000 In-Reply-To: "Albert D. Cahalan"'s message of "Sun, 28 Jan 2001 21:45:02 -0500 (EST)" Message-ID: User-Agent: Gnus/5.0807 (Gnus v5.8.7) XEmacs/21.1 (GTK) MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 8010 Lines: 35 --=-=-= "Albert D. Cahalan" writes: [...] > > I patched userspace ppp-2.4.0 to use this functionality. It would be > > better if SIOCKILLADDR were not used until we are sure that the new IP > > is in fact different from the old one, but pppd in demand mode would > > I get the same IP about 2/3 of the time, so it is pretty important > to avoid killing connections until after the new IP is known. I'll try to explain again. If you have an existing (e.g. ssh) connection to a host across the interface, and the interface comes down then pppd _will not bring it up again_ until you try to start a new connection, as far as I have experienced. Therefore you will get the old behaviour and my patch will do nothing. I decided it was better to inform ssh that the link was dead. Like I said, the solution to this is to make pppd cleverer about bringing the link up when there are existing connections. Alternatively, you could have some dubious script parsing netstat checking whether there are connections over the interface. and pinging hosts at intervals to bring the link up again ;-) Here is a patch for pppd-2.4.0 orig that will give you the behaviour you want, provided you can solve the problem in the first paragraph. It almost exactly the same as my last patch. It compiles and everything. Note that there are no changes required to the kernel side patch to enable this functionality. --=-=-= Content-Type: text/x-patch Content-Disposition: attachment; filename=ppp-2.4.0-killaddr-smarter.patch diff -u --recursive ppp-2.4.0-orig/chat/Makefile.linux ppp-2.4.0-hacked/chat/Makefile.linux --- ppp-2.4.0-orig/chat/Makefile.linux Fri Aug 13 02:54:32 1999 +++ ppp-2.4.0-hacked/chat/Makefile.linux Sat Jan 27 18:34:47 2001 @@ -6,14 +6,14 @@ CDEF4= -DFNDELAY=O_NDELAY # Old name value CDEFS= $(CDEF1) $(CDEF2) $(CDEF3) $(CDEF4) -CFLAGS= -O2 -g -pipe $(CDEFS) +CFLAGS= $(COPTS) $(CDEFS) INSTALL= install all: chat chat: chat.o - $(CC) -o chat chat.o + $(CC) $(LDFLAGS) -o chat chat.o chat.o: chat.c $(CC) -c $(CFLAGS) -o chat.o chat.c diff -u --recursive ppp-2.4.0-orig/pppd/options.c ppp-2.4.0-hacked/pppd/options.c --- ppp-2.4.0-orig/pppd/options.c Tue Aug 1 02:38:30 2000 +++ ppp-2.4.0-hacked/pppd/options.c Sat Jan 27 18:51:30 2001 @@ -77,6 +77,9 @@ char user[MAXNAMELEN]; /* Username for PAP */ char passwd[MAXSECRETLEN]; /* Password for PAP */ bool persist = 0; /* Reopen link after it goes down */ +bool killoldaddr = 0; /* If our IP is reassigned on + reconnect, kill active TCP + connections using the old IP. */ char our_name[MAXNAMELEN]; /* Our name for authentication purposes */ bool demand = 0; /* do dial-on-demand */ char *ipparam = NULL; /* Extra parameter for ip up/down scripts */ @@ -194,6 +197,10 @@ "Turn off persist option" }, { "demand", o_bool, &demand, "Dial on demand", OPT_INITONLY | 1, &persist }, + { "killoldaddr", o_bool, &killoldaddr, + "Kill connections from an old source address", 1}, + { "nokilloldaddr", o_bool,&killoldaddr, + "Don't kill connections from an old source address" }, { "--version", o_special_noarg, (void *)showversion, "Show version number" }, { "--help", o_special_noarg, (void *)showhelp, diff -u --recursive ppp-2.4.0-orig/pppd/pppd.h ppp-2.4.0-hacked/pppd/pppd.h --- ppp-2.4.0-orig/pppd/pppd.h Thu Jul 6 12:17:03 2000 +++ ppp-2.4.0-hacked/pppd/pppd.h Sat Jan 27 20:13:17 2001 @@ -235,6 +235,9 @@ extern char remote_name[MAXNAMELEN]; /* Peer's name for authentication */ extern bool explicit_remote;/* remote_name specified with remotename opt */ extern bool demand; /* Do dial-on-demand */ +extern bool killoldaddr; /* If our IP is reassigned on + reconnect, kill active TCP + connections using the old IP. */ extern char *ipparam; /* Extra parameter for ip up/down scripts */ extern bool cryptpap; /* Others' PAP passwords are encrypted */ extern int idle_time_limit;/* Shut down link if idle for this long */ diff -u --recursive ppp-2.4.0-orig/pppd/sys-linux.c ppp-2.4.0-hacked/pppd/sys-linux.c --- ppp-2.4.0-orig/pppd/sys-linux.c Wed Jul 26 05:17:12 2000 +++ ppp-2.4.0-hacked/pppd/sys-linux.c Sat Jan 27 21:55:03 2001 @@ -115,6 +115,10 @@ #endif /* INET6 */ +#ifndef SIOCKILLADDR +#define SIOCKILLADDR 0x8939 +#endif + /* We can get an EIO error on an ioctl if the modem has hung up */ #define ok_error(num) ((num)==EIO) @@ -152,6 +156,7 @@ static u_int32_t proxy_arp_addr; /* Addr for proxy arp entry added */ static char proxy_arp_dev[16]; /* Device for proxy arp entry */ static u_int32_t our_old_addr; /* for detecting address changes */ +static u_int32_t our_current_addr; static int dynaddr_set; /* 1 if ip_dynaddr set */ static int looped; /* 1 if using loop */ static int link_mtu; /* mtu for the link (not bundle) */ @@ -491,6 +496,27 @@ return -1; } +static void do_killaddr(u_int32_t oldaddr) +{ + struct ifreq ifr; + + memset(&ifr,0,sizeof ifr); + + SET_SA_FAMILY (ifr.ifr_addr, AF_INET); + SET_SA_FAMILY (ifr.ifr_dstaddr, AF_INET); + SET_SA_FAMILY (ifr.ifr_netmask, AF_INET); + + SIN_ADDR(ifr.ifr_addr) = oldaddr; + + strlcpy(ifr.ifr_name, ifname, sizeof (ifr.ifr_name)); + + if(ioctl(sock_fd,SIOCKILLADDR,&ifr) < 0) { + if (!ok_error (errno)) + error("ioctl(SIOCKILLADDR): %m(%d)", errno); + return; + } +} + /******************************************************************** * * disestablish_ppp - Restore the serial port to normal operation. @@ -2177,10 +2206,10 @@ { struct ifreq ifr; struct rtentry rt; - + memset (&ifr, '\0', sizeof (ifr)); memset (&rt, '\0', sizeof (rt)); - + SET_SA_FAMILY (ifr.ifr_addr, AF_INET); SET_SA_FAMILY (ifr.ifr_dstaddr, AF_INET); SET_SA_FAMILY (ifr.ifr_netmask, AF_INET); @@ -2247,21 +2276,29 @@ } } - /* set ip_dynaddr in demand mode if address changes */ - if (demand && tune_kernel && !dynaddr_set - && our_old_addr && our_old_addr != our_adr) { + if(persist && our_old_addr && our_old_addr != our_adr) { + + if(killoldaddr) + do_killaddr(our_old_addr); + + + /* set ip_dynaddr in persist mode if address changes */ + if (tune_kernel && !dynaddr_set) { /* set ip_dynaddr if possible */ char *path; int fd; path = path_to_procfs("/sys/net/ipv4/ip_dynaddr"); if (path != 0 && (fd = open(path, O_WRONLY)) >= 0) { - if (write(fd, "1", 1) != 1) - error("Couldn't enable dynamic IP addressing: %m"); - close(fd); + if (write(fd, "1", 1) != 1) + error("Couldn't enable dynamic IP addressing: %m"); + close(fd); } dynaddr_set = 1; /* only 1 attempt */ + } } + + our_current_addr = our_adr; our_old_addr = 0; return 1; @@ -2317,7 +2354,8 @@ } our_old_addr = our_adr; - + our_current_addr = 0; + return 1; } diff -u --recursive ppp-2.4.0-orig/pppdump/Makefile.linux ppp-2.4.0-hacked/pppdump/Makefile.linux --- ppp-2.4.0-orig/pppdump/Makefile.linux Mon Jul 26 12:09:29 1999 +++ ppp-2.4.0-hacked/pppdump/Makefile.linux Sat Jan 27 18:34:47 2001 @@ -1,4 +1,4 @@ -CFLAGS= -O -I../include/net +CFLAGS= $(COPTS) -I../include/net OBJS = pppdump.o bsd-comp.o deflate.o zlib.o INSTALL= install @@ -6,7 +6,7 @@ all: pppdump pppdump: $(OBJS) - $(CC) -o pppdump $(OBJS) + $(CC) $(LDFLAGS) -o pppdump $(OBJS) clean: rm -f pppdump $(OBJS) *~ diff -u --recursive ppp-2.4.0-orig/pppstats/Makefile.linux ppp-2.4.0-hacked/pppstats/Makefile.linux --- ppp-2.4.0-orig/pppstats/Makefile.linux Wed Mar 25 02:21:19 1998 +++ ppp-2.4.0-hacked/pppstats/Makefile.linux Sat Jan 27 18:34:48 2001 @@ -22,7 +22,7 @@ $(INSTALL) -c -m 444 pppstats.8 $(MANDIR)/man8/pppstats.8 pppstats: $(PPPSTATSRCS) - $(CC) $(CFLAGS) -o pppstats pppstats.c $(LIBS) + $(CC) $(CFLAGS) $(LDFLAGS) -o pppstats pppstats.c $(LIBS) clean: rm -f pppstats *~ #* core --=-=-= -- http://www.penguinpowered.com/~vii --=-=-=-- From owner-netdev@oss.sgi.com Mon Jan 29 17:06:52 2001 Received: by oss.sgi.com id ; Mon, 29 Jan 2001 17:06:42 -0800 Received: from cs.columbia.edu ([128.59.16.20]:39058 "EHLO cs.columbia.edu") by oss.sgi.com with ESMTP id ; Mon, 29 Jan 2001 17:06:34 -0800 Received: from age.cs.columbia.edu (IDENT:root@age.cs.columbia.edu [128.59.22.100]) by cs.columbia.edu (8.9.3/8.9.3) with ESMTP id UAA12295; Mon, 29 Jan 2001 20:06:32 -0500 (EST) Received: from localhost (ionut@localhost) by age.cs.columbia.edu (8.9.3/8.9.3) with ESMTP id UAA02548; Mon, 29 Jan 2001 20:06:32 -0500 Date: Mon, 29 Jan 2001 17:06:31 -0800 (PST) From: Ion Badulescu To: jamal cc: Andrew Morton , lkml , "netdev@oss.sgi.com" Subject: Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN) In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1562 Lines: 60 On Sat, 27 Jan 2001, jamal wrote: > > starfire: > > 2.4.1-pre10+zerocopy, using sendfile(): 9.6% CPU > > 2.4.1-pre10+zerocopy, using read()/write(): 18.3%-29.6% CPU * why so much variance? > > > > What are your throughput numbers? 11.5kBps, quite consistently. BTW, Andrew's new tool (with 8k reads/writes) has shown the load in the read/write case to be essentially the lower margin of the intervals I got in the first mail. > Could you also, please, test using: > > http://www.cyberus.ca/~hadi/ttcp-sf.tar.gz > > post both sender and receiver data. Repeat each test about > 5 times. I've tried it, but I'm not really sure what I can report. ttcp's measurements are clearly misleading, so I used Andrew's cyclesoak instead. The numbers are (with 2.4.1-pre10+zerocopy): [starfire, hw csum & sg enabled] sending with sendfile: 10.0-10.2% sending with send/write: 13.5-13.7% receiving: 20.0-20.2% [starfire, hw csum & sg disabled] sending with sendfile: 18.1-18.3% sending with send/write: 13.9-14.1% receiving: 24.3-24.5% [eepro100, i82559, no hw fancies] sending with sendfile: 16.2-16.4% sending with send/write: 12.0-12.2% receiving: 21.5-21.7% Same tests, this time with 2.4.1-pre10 vanilla: [starfire] sending with sendfile: 18.1-18.3% sending with send/write: 12.5-12.7% receiving: 23.0-23.1% [eepro100, i82559] sending with sendfile: 16.7-16.9% sending with send/write: 12.0-12.2% receiving: 20.8-20.9% Ion -- It is better to keep your mouth shut and be thought a fool, than to open it and remove all doubt. From owner-netdev@oss.sgi.com Mon Jan 29 18:13:02 2001 Received: by oss.sgi.com id ; Mon, 29 Jan 2001 18:12:51 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:55990 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Mon, 29 Jan 2001 18:12:31 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id VAA00216; Mon, 29 Jan 2001 21:11:01 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Mon, 29 Jan 2001 21:11:01 -0500 (EST) From: jamal To: Pavel Machek cc: Chris Wedgwood , , Subject: Re: ECN: Clearing the air In-Reply-To: <20010128225530.A1300@bug.ucw.cz> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 491 Lines: 17 On Sun, 28 Jan 2001, Pavel Machek wrote: > > Does not that mean that Linux 2.0.10 mistakenly announces it is ECN > capable when offered ECN connection? In fact it does. But as someone mentioned, ECN is resilient to this. i.e this will be trapped and no ECN connection will happen. For historical purposes, let it be noted that it was the Linux ECN implementation in the 2.0 days that made this change in the RFC possible. "we believe in running code ..." -- yes, indeed. cheers, jamal From owner-netdev@oss.sgi.com Mon Jan 29 18:50:02 2001 Received: by oss.sgi.com id ; Mon, 29 Jan 2001 18:49:52 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:61366 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Mon, 29 Jan 2001 18:49:34 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id VAA00353; Mon, 29 Jan 2001 21:48:40 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Mon, 29 Jan 2001 21:48:40 -0500 (EST) From: jamal To: Ion Badulescu cc: Andrew Morton , lkml , "netdev@oss.sgi.com" Subject: Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN) In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1074 Lines: 30 On Mon, 29 Jan 2001, Ion Badulescu wrote: > 11.5kBps, quite consistently. This gige card is really sick. Are you sure? Please double check. > > I've tried it, but I'm not really sure what I can report. ttcp's > measurements are clearly misleading, so I used Andrew's cyclesoak instead. The ttcp CPU (times()) measurements are misleading. In particular when doing sendfile. All they say is how much time ttcp spent in kernel space vs user space. So all CPU measurement i have posted in the past should be considered bogus. It is interesting to note, however, that the trend reported by ttcp's CPU measurements as well as Andrew (and yourself) are similar;-> But the point is: CPU is not the only measure that is of interest. Throughput is definetly one of those that is of extreme importance. 100Mbps is not exciting. You seem to have gigE. I think your 11KB looks suspiciously wrong. Can you double check please? cheers, jamal PS:- another important parameter is latency, but that might not be as important in file serving (maybe in short file tranfers ala http). From owner-netdev@oss.sgi.com Mon Jan 29 19:27:22 2001 Received: by oss.sgi.com id ; Mon, 29 Jan 2001 19:27:12 -0800 Received: from cs.columbia.edu ([128.59.16.20]:62923 "EHLO cs.columbia.edu") by oss.sgi.com with ESMTP id ; Mon, 29 Jan 2001 19:26:59 -0800 Received: from age.cs.columbia.edu (IDENT:root@age.cs.columbia.edu [128.59.22.100]) by cs.columbia.edu (8.9.3/8.9.3) with ESMTP id WAA20398; Mon, 29 Jan 2001 22:26:57 -0500 (EST) Received: from localhost (ionut@localhost) by age.cs.columbia.edu (8.9.3/8.9.3) with ESMTP id WAA02765; Mon, 29 Jan 2001 22:26:55 -0500 Date: Mon, 29 Jan 2001 19:26:55 -0800 (PST) From: Ion Badulescu To: jamal cc: Andrew Morton , lkml , "netdev@oss.sgi.com" Subject: Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN) In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 400 Lines: 15 On Mon, 29 Jan 2001, jamal wrote: > > 11.5kBps, quite consistently. > > This gige card is really sick. Are you sure? Please double check. Umm.. the starfire chipset is 100Mbit only. So 11.5MBps (sorry, that was a typo, it's mega not kilo) is really all I'd expect out of it. Ion -- It is better to keep your mouth shut and be thought a fool, than to open it and remove all doubt. From owner-netdev@oss.sgi.com Mon Jan 29 22:02:33 2001 Received: by oss.sgi.com id ; Mon, 29 Jan 2001 22:02:23 -0800 Received: from pizda.ninka.net ([216.101.162.242]:8576 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Mon, 29 Jan 2001 22:01:54 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id WAA00930; Mon, 29 Jan 2001 22:00:47 -0800 From: "David S. Miller" MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <14966.22671.446439.838872@pizda.ninka.net> Date: Mon, 29 Jan 2001 22:00:47 -0800 (PST) To: Andrew Morton Cc: lkml , "netdev@oss.sgi.com" Subject: Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN) In-Reply-To: <3A728475.34CF841@uow.edu.au> References: <3A726087.764CC02E@uow.edu.au> <20010126222003.A11994@vitelus.com> <3A728475.34CF841@uow.edu.au> X-Mailer: VM 6.75 under 21.1 (patch 13) "Crater Lake" XEmacs Lucid Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1340 Lines: 34 The "more expensive" write/send in zerocopy is a known cost of paged SKBs. This cost may be decreased a bit with some fine tuning, but not eliminated entirely. What do we get for this cost? Basically, the big win is not that the card checksums the packet. We could get that for free while copying the data from userspace into the kernel pages during the sendmsg(), using the combined "copy+checksum" hand-coded assembly routines we already have. It is in fact the better use of memory. Firstly, we use page allocations, only single ones. With linear buffers SLAB could use multiple pages which strain the memory subsystem quite a bit at times. Secondly, we fill pages with socket data precisely whereas SLAB can only get as tight packing as any general purpose memory allocator can. This, I feel, outweighs the slight performance decrease. And I would wager a bet that the better usage of memory will result in better all around performance. The problem with microscopic tests is that you do not see the world around the thing being focused on. I feel Andrew/Jamal's test are very valuable, but lets keep things in perspective when doing cost analysis. Finally, please do some tests on loopback. It is usually a great way to get "pure software overhead" measurements of our TCP stack. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Tue Jan 30 01:35:04 2001 Received: by oss.sgi.com id ; Tue, 30 Jan 2001 01:34:55 -0800 Received: from pizda.ninka.net ([216.101.162.242]:37505 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Tue, 30 Jan 2001 01:34:39 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id BAA05576; Tue, 30 Jan 2001 01:33:34 -0800 From: "David S. Miller" MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <14966.35438.429963.405587@pizda.ninka.net> Date: Tue, 30 Jan 2001 01:33:34 -0800 (PST) To: linux-kernel@vger.kernel.org CC: netdev@oss.sgi.com Subject: [UPDATE] Fresh zerocopy patch on kernel.org X-Mailer: VM 6.75 under 21.1 (patch 13) "Crater Lake" XEmacs Lucid Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 490 Lines: 20 At the usual place: ftp://ftp.kernel.org/pub/linux/kernel/people/davem/zerocopy-2.4.1-1.diff.gz (As usual, please allow some minutes for the mirrors to get it) Changes since last installment: 1) Merge to 2.4.1 final. (me) 2) Accept TCP flags (ACK, URG, RST, etc.) for out of window packets if truncating the data to the window would make that packet valid. (Alexey) 3) Add SO_ACCEPTCONN, Unix standard wants it. (me) Have fun testing... Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Tue Jan 30 02:29:47 2001 Received: by oss.sgi.com id ; Tue, 30 Jan 2001 02:29:37 -0800 Received: from [61.132.182.1] ([61.132.182.1]:23876 "EHLO mx1.ustc.edu.cn") by oss.sgi.com with ESMTP id ; Tue, 30 Jan 2001 02:29:17 -0800 Received: from webmail.ustc.edu.cn (postfix@webmail.ustc.edu.cn [202.38.64.16]) by mx1.ustc.edu.cn (8.8.7/8.8.6) with ESMTP id TAA02739 for ; Tue, 30 Jan 2001 19:15:08 -0800 Received: by webmail.ustc.edu.cn (Postfix, from userid 99) id 10471BC48; Tue, 30 Jan 2001 18:39:10 +0800 (CST) To: netdev@oss.sgi.com Subject: From: ynguo@mail.ustc.edu.cn MIME-Version: 1.0 Content-Type: multipart/mixed; boundary = "b4b8d6277e7d4dad8b53a47bd9b69774f" Message-Id: <20010130103910.10471BC48@webmail.ustc.edu.cn> Date: Tue, 30 Jan 2001 18:39:10 +0800 (CST) Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 228 Lines: 13 This is a MIME encoded message. --b4b8d6277e7d4dad8b53a47bd9b69774f Content-Type: text/plain Content-Transfer-Encoding: base64 Content-Disposition: inline IAogICAgdW5zdWJzY3JpYmUgKg== --b4b8d6277e7d4dad8b53a47bd9b69774f-- From owner-netdev@oss.sgi.com Tue Jan 30 03:16:27 2001 Received: by oss.sgi.com id ; Tue, 30 Jan 2001 03:16:17 -0800 Received: from f00f.stub.clear.net.nz ([203.167.224.51]:58885 "HELO metastasis.f00f.org") by oss.sgi.com with SMTP id ; Tue, 30 Jan 2001 03:16:07 -0800 Received: by metastasis.f00f.org (Postfix, from userid 1000) id 5F9FAA495; Wed, 31 Jan 2001 00:16:05 +1300 (NZDT) Date: Wed, 31 Jan 2001 00:16:05 +1300 From: Chris Wedgwood To: "David S. Miller" Cc: linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [UPDATE] Fresh zerocopy patch on kernel.org Message-ID: <20010131001605.B6620@metastasis.f00f.org> References: <14966.35438.429963.405587@pizda.ninka.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <14966.35438.429963.405587@pizda.ninka.net>; from davem@redhat.com on Tue, Jan 30, 2001 at 01:33:34AM -0800 X-No-Archive: Yes Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 350 Lines: 12 On Tue, Jan 30, 2001 at 01:33:34AM -0800, David S. Miller wrote: 2) Accept TCP flags (ACK, URG, RST, etc.) for out of window packets if truncating the data to the window would make that packet valid. (Alexey) 3) Add SO_ACCEPTCONN, Unix standard wants it. (me) these have been feed back for 2.4.x Linus anyhow right? --cw From owner-netdev@oss.sgi.com Tue Jan 30 03:49:26 2001 Received: by oss.sgi.com id ; Tue, 30 Jan 2001 03:49:17 -0800 Received: from colin.muc.de ([193.149.48.1]:34056 "HELO colin.muc.de") by oss.sgi.com with SMTP id ; Tue, 30 Jan 2001 03:48:58 -0800 Received: by colin.muc.de id <140563-3>; Tue, 30 Jan 2001 12:48:43 +0100 Message-ID: <20010130124839.24151@colin.muc.de> From: Andi Kleen To: Andrew Morton Cc: netdev@oss.sgi.com Subject: Re: More measurements References: <3A75785A.42B9E7CE@uow.edu.au> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.88e In-Reply-To: <3A75785A.42B9E7CE@uow.edu.au>; from Andrew Morton on Tue, Jan 30, 2001 at 10:08:02AM +0100 Date: Tue, 30 Jan 2001 12:48:40 +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1624 Lines: 52 On Tue, Jan 30, 2001 at 10:08:02AM +0100, Andrew Morton wrote: > Lots of interesting things here. > > - eepro100 generates more interrupts doing TCP Tx, but not > TCP Rx. I assume it doesn't do Tx mitigation? The Intel driver (e100.c) uploads special firmware and does it for RX and TX. eepro100 doesn't. Perhaps you could measure that driver too? It unfortunately doesn't support zc. > > - Changing eepro100 to use IO operations instead of MMIO slows > down this dual 500MHz machine by less than one percent at > 100 mbps. At 12,000 interrupts per second. Why all the fuss > about MMIO? iirc Ingo at some point found at some monster machine that the IO operations in the eepro100 interrupt handler dominated some Tux profile. > > - Bonding the 905's interrupt to CPU0 slows things down slightly. > (This is contrary to other measurements I've previously taken. > Don't pay any attention to this). ;) > > - Without the zc patch, there is a significant increase (25%) in > the number of Rx packets (acks, persumably) when data is sent > using sendfile() as opposed to when the same data is sent > with send(). RX on the sender? > > Workload: 62 files, average size 350k. > sendfile() tries to send the entire file in one hit > send() breaks it up into 64kbyte chunks. > > When the zerocopy patch is applied, the Rx packet rate during > sendfile() is the same as the rate during send(). > > Why is this? Does the send() variant use TCP_CORK ? > - I see a consistent 12-13% slowdown on send() with the zerocopy > patch. Can this be fixed? Ugh. -Andi From owner-netdev@oss.sgi.com Tue Jan 30 04:25:37 2001 Received: by oss.sgi.com id ; Tue, 30 Jan 2001 04:25:27 -0800 Received: from pizda.ninka.net ([216.101.162.242]:63361 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Tue, 30 Jan 2001 04:25:06 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id EAA05979; Tue, 30 Jan 2001 04:23:53 -0800 From: "David S. Miller" MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <14966.45657.61304.403990@pizda.ninka.net> Date: Tue, 30 Jan 2001 04:23:53 -0800 (PST) To: Chris Wedgwood Cc: linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [UPDATE] Fresh zerocopy patch on kernel.org In-Reply-To: <20010131001605.B6620@metastasis.f00f.org> References: <14966.35438.429963.405587@pizda.ninka.net> <20010131001605.B6620@metastasis.f00f.org> X-Mailer: VM 6.75 under 21.1 (patch 13) "Crater Lake" XEmacs Lucid Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 488 Lines: 17 Chris Wedgwood writes: > On Tue, Jan 30, 2001 at 01:33:34AM -0800, David S. Miller wrote: > > 2) Accept TCP flags (ACK, URG, RST, etc.) for out of window packets > if truncating the data to the window would make that packet valid. > (Alexey) > > 3) Add SO_ACCEPTCONN, Unix standard wants it. (me) > > these have been feed back for 2.4.x Linus anyhow right? Yes, but I couldn't get them to him in time for 2.4.1 Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Tue Jan 30 04:37:27 2001 Received: by oss.sgi.com id ; Tue, 30 Jan 2001 04:37:16 -0800 Received: from horus.its.uow.edu.au ([130.130.68.25]:3037 "EHLO horus.its.uow.edu.au") by oss.sgi.com with ESMTP id ; Tue, 30 Jan 2001 04:37:02 -0800 Received: from uow.edu.au (wumpus.its.uow.edu.au [130.130.68.12]) by horus.its.uow.edu.au (8.9.3/8.9.3) with ESMTP id XAA02747; Tue, 30 Jan 2001 23:36:40 +1100 (EST) Message-ID: <3A76B72D.2DD3E640@uow.edu.au> Date: Tue, 30 Jan 2001 23:44:29 +1100 From: Andrew Morton X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.4.0-test8 i586) X-Accept-Language: en MIME-Version: 1.0 To: "David S. Miller" CC: lkml , "netdev@oss.sgi.com" Subject: Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN) References: <3A728475.34CF841@uow.edu.au>, <3A726087.764CC02E@uow.edu.au> <20010126222003.A11994@vitelus.com> <3A728475.34CF841@uow.edu.au> <14966.22671.446439.838872@pizda.ninka.net> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 2116 Lines: 54 "David S. Miller" wrote: > > The "more expensive" write/send in zerocopy is a known cost of paged > SKBs. This cost may be decreased a bit with some fine tuning, but not > eliminated entirely. Can you say what causes the difference? I had a brief poke around - generic_copy_from_user() dominates in both cases of course, but nothing really stood out when comparing the zerocopy kernel's profile with non-zc. Varying the value of MAXPGS (all the way down to 1) and also the amount of data which is sent with send() does change the throughput, but not the ratio wrt non-zc. > What do we get for this cost? > > Basically, the big win is not that the card checksums the packet. > We could get that for free while copying the data from userspace > into the kernel pages during the sendmsg(), using the combined > "copy+checksum" hand-coded assembly routines we already have. > > It is in fact the better use of memory. Firstly, we use page > allocations, only single ones. With linear buffers SLAB could > use multiple pages which strain the memory subsystem quite a bit at > times. Secondly, we fill pages with socket data precisely whereas > SLAB can only get as tight packing as any general purpose memory > allocator can. > > This, I feel, outweighs the slight performance decrease. And I would > wager a bet that the better usage of memory will result in better > all around performance. ie: inappropriate test coverage. Not surprising. What additional scenarios need to be tested? Zillions of connections? If anyone really needs that 10% they can use the `hw_checksums=0' module parm, but SG+xsum is enabled by default - we need the testing. > The problem with microscopic tests is that you do not see the world > around the thing being focused on. I feel Andrew/Jamal's test are > very valuable, but lets keep things in perspective when doing cost > analysis. > > Finally, please do some tests on loopback. It is usually a great > way to get "pure software overhead" measurements of our TCP stack. Will do. BTW: can you suggest why I'm not observing any change in NFS client efficiency? - From owner-netdev@oss.sgi.com Tue Jan 30 04:54:17 2001 Received: by oss.sgi.com id ; Tue, 30 Jan 2001 04:54:07 -0800 Received: from pizda.ninka.net ([216.101.162.242]:5762 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Tue, 30 Jan 2001 04:53:47 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id EAA06054; Tue, 30 Jan 2001 04:52:41 -0800 From: "David S. Miller" MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <14966.47384.971741.939842@pizda.ninka.net> Date: Tue, 30 Jan 2001 04:52:40 -0800 (PST) To: Andrew Morton Cc: lkml , "netdev@oss.sgi.com" Subject: Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN) In-Reply-To: <3A76B72D.2DD3E640@uow.edu.au> References: <3A728475.34CF841@uow.edu.au> <3A726087.764CC02E@uow.edu.au> <20010126222003.A11994@vitelus.com> <14966.22671.446439.838872@pizda.ninka.net> <3A76B72D.2DD3E640@uow.edu.au> X-Mailer: VM 6.75 under 21.1 (patch 13) "Crater Lake" XEmacs Lucid Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 552 Lines: 18 Andrew Morton writes: > BTW: can you suggest why I'm not observing any change in NFS client > efficiency? As in "filecopy speed" or "cpu usage while copying a file"? The current fragmentation code eliminates a full SKB allocation and data copy on the NFS file data receive path in the client, CPU has to be saved compared to pre-zerocopy or something is very wrong. File copy speed, well you should be link speed limited as even without the zerocopy patches you ought to have enough cpu to keep it busy. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Tue Jan 30 05:40:07 2001 Received: by oss.sgi.com id ; Tue, 30 Jan 2001 05:39:58 -0800 Received: from horus.its.uow.edu.au ([130.130.68.25]:37575 "EHLO horus.its.uow.edu.au") by oss.sgi.com with ESMTP id ; Tue, 30 Jan 2001 05:39:33 -0800 Received: from uow.edu.au (wumpus.its.uow.edu.au [130.130.68.12]) by horus.its.uow.edu.au (8.9.3/8.9.3) with ESMTP id AAA29439; Wed, 31 Jan 2001 00:39:21 +1100 (EST) Message-ID: <3A76C5DE.993BB140@uow.edu.au> Date: Wed, 31 Jan 2001 00:47:10 +1100 From: Andrew Morton X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.4.0-test8 i586) X-Accept-Language: en MIME-Version: 1.0 To: Andi Kleen CC: netdev@oss.sgi.com Subject: Re: More measurements References: <3A75785A.42B9E7CE@uow.edu.au>, <3A75785A.42B9E7CE@uow.edu.au>; from Andrew Morton on Tue, Jan 30, 2001 at 10:08:02AM +0100 <20010130124839.24151@colin.muc.de> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 3508 Lines: 99 Andi Kleen wrote: > > On Tue, Jan 30, 2001 at 10:08:02AM +0100, Andrew Morton wrote: > > Lots of interesting things here. > > > > - eepro100 generates more interrupts doing TCP Tx, but not > > TCP Rx. I assume it doesn't do Tx mitigation? > > The Intel driver (e100.c) uploads special firmware and does it for RX and TX. > eepro100 doesn't. Perhaps you could measure that driver too? Sure. Anyone have a URL for intel's driver? > > > > - Changing eepro100 to use IO operations instead of MMIO slows > > down this dual 500MHz machine by less than one percent at > > 100 mbps. At 12,000 interrupts per second. Why all the fuss > > about MMIO? > > iirc Ingo at some point found at some monster machine that the IO operations > in the eepro100 interrupt handler dominated some Tux profile. mm... My cycle soaker is not aggressive enough in its generation of memory traffic. I think I need to tune it so when all CPUs are 'soaking', memory traffic _just_ reaches 100%. This doesn't make much difference at all at the low utilisation levels of a 100bT NIC. However I doubt if that's relevant to the PIO vs MMIO issue. IIRC, an I/O op only flushes the write buffers, and blocks the CPU on writes. > ... > > > > - Without the zc patch, there is a significant increase (25%) in > > the number of Rx packets (acks, persumably) when data is sent > > using sendfile() as opposed to when the same data is sent > > with send(). > > RX on the sender? Amazing but true. If the sender is sending with sendfile(), he gets 25% more packets sent back at him than if he's sending the same data using send(). With the zerocopy patch this discrepancy disappears - the machine doing the sendfile() receives less packets with zc. ** Yup, I just retested everything. With vanilla 2.4.1, when data is sent with sendfile() we get 25% more packets coming back than when sending with 64 kbyte send()s. And 25% more bytes. With 2.4.1+zc, the number of packets coming back to the sendfile()r is the same as the number of packets coming back to the send()er, for both an SG_xsum NIC and a non-SG+xsum one. Average packets size om the receive path is 13.1 bytes. So by intent or by accident, something got fixed! > > > > Workload: 62 files, average size 350k. > > sendfile() tries to send the entire file in one hit > > send() breaks it up into 64kbyte chunks. > > > > When the zerocopy patch is applied, the Rx packet rate during > > sendfile() is the same as the rate during send(). > > > > Why is this? > > Does the send() variant use TCP_CORK ? `zcc' supports that, and it makes no difference. You see, all the send()s are really big, and there's always data queued in the transmit direction so everything gets coalesced into full-sized frames. Even when sending 64 byte chunks with send() and no TCP_CORK, the frames on the wire are 500-1000 bytes long because of this. > > - I see a consistent 12-13% slowdown on send() with the zerocopy > > patch. Can this be fixed? > > Ugh. Well, although the ratio is a bit ugh, it's only 2% of the capacity of a dual 500 box at 100 mbps. It's hard to think of an app which has high enough bandwidth-over-compute requirements for this to matter. GigE may be a totally different story. With GigE the ratio between memory use and interrupt load is vastly different, and the 12-13% ratio could be a lot larger. But without knowing the cause, I'm guessing. I'll swap my postal address for a couple of Acenics :) - From owner-netdev@oss.sgi.com Tue Jan 30 06:51:38 2001 Received: by oss.sgi.com id ; Tue, 30 Jan 2001 06:51:28 -0800 Received: from horus.its.uow.edu.au ([130.130.68.25]:16849 "EHLO horus.its.uow.edu.au") by oss.sgi.com with ESMTP id ; Tue, 30 Jan 2001 06:51:14 -0800 Received: from uow.edu.au (wumpus.its.uow.edu.au [130.130.68.12]) by horus.its.uow.edu.au (8.9.3/8.9.3) with ESMTP id BAA07508; Wed, 31 Jan 2001 01:50:55 +1100 (EST) Message-ID: <3A76D6A4.2385185E@uow.edu.au> Date: Wed, 31 Jan 2001 01:58:44 +1100 From: Andrew Morton X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.4.0-test8 i586) X-Accept-Language: en MIME-Version: 1.0 To: "David S. Miller" CC: lkml , "netdev@oss.sgi.com" Subject: Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN) References: <3A76B72D.2DD3E640@uow.edu.au>, <3A728475.34CF841@uow.edu.au> <3A726087.764CC02E@uow.edu.au> <20010126222003.A11994@vitelus.com> <14966.22671.446439.838872@pizda.ninka.net> <3A76B72D.2DD3E640@uow.edu.au> <14966.47384.971741.939842@pizda.ninka.net> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1490 Lines: 40 "David S. Miller" wrote: > > Andrew Morton writes: > > BTW: can you suggest why I'm not observing any change in NFS client > > efficiency? > > As in "filecopy speed" or "cpu usage while copying a file"? > > The current fragmentation code eliminates a full SKB allocation and > data copy on the NFS file data receive path in the client, CPU has to > be saved compared to pre-zerocopy or something is very wrong. > > File copy speed, well you should be link speed limited as even without > the zerocopy patches you ought to have enough cpu to keep it busy. > Mount the server rsize=wsize=8192. `cp' a 102,400,000 byte file from the NFS server to /dev/null. The file is fully cached on the server. unmount and remount the server between runs to eliminate client caching. The copy takes 8.654 seconds. That's 11.8 megabytes/sec. Client is 2.4.1-vanilla: 29.8% CPU Client is 2.4.1-zc: 28.2% CPU Client is 2.4.1-zc, non-SG+xsum NIC: 27.7% CPU So I was mistaken - there is an improvement. (A 2% CPU change is easily measurable with this setup). It may be a little better than this - cyclesoak I think will underestimate the benefit of saving on memory traffic. It only generates 10,000 cacheline writebacks per second per CPU. But winding it up to 80,000 doesn't affect the above figures much at all. The box has 130 mbyte/sec memory write bandwidth, so saving a copy should save 10% of this. (Wanders away, scratching head...) - From owner-netdev@oss.sgi.com Tue Jan 30 09:49:38 2001 Received: by oss.sgi.com id ; Tue, 30 Jan 2001 09:49:29 -0800 Received: from f00f.stub.clear.net.nz ([203.167.224.51]:518 "HELO metastasis.f00f.org") by oss.sgi.com with SMTP id ; Tue, 30 Jan 2001 09:49:13 -0800 Received: by metastasis.f00f.org (Postfix, from userid 1000) id 1A383A4AA; Wed, 31 Jan 2001 06:49:11 +1300 (NZDT) Date: Wed, 31 Jan 2001 06:49:11 +1300 From: Chris Wedgwood To: Andrew Morton Cc: "David S. Miller" , lkml , "netdev@oss.sgi.com" Subject: Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN) Message-ID: <20010131064911.B7244@metastasis.f00f.org> References: <3A76B72D.2DD3E640@uow.edu.au>, <3A728475.34CF841@uow.edu.au> <3A726087.764CC02E@uow.edu.au> <20010126222003.A11994@vitelus.com> <14966.22671.446439.838872@pizda.ninka.net> <3A76B72D.2DD3E640@uow.edu.au> <14966.47384.971741.939842@pizda.ninka.net> <3A76D6A4.2385185E@uow.edu.au> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <3A76D6A4.2385185E@uow.edu.au>; from andrewm@uow.edu.au on Wed, Jan 31, 2001 at 01:58:44AM +1100 X-No-Archive: Yes Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 514 Lines: 15 On Wed, Jan 31, 2001 at 01:58:44AM +1100, Andrew Morton wrote: Mount the server rsize=wsize=8192. `cp' a 102,400,000 byte file from the NFS server to /dev/null. The file is fully cached on the server. unmount and remount the server between runs to eliminate client caching. The copy takes 8.654 seconds. That's 11.8 megabytes/sec. What server are you using here? Using NetApp filers I don't see anything like this, probably only 8.5MB/s at most and this number is fairly noisy. --cw From owner-netdev@oss.sgi.com Tue Jan 30 11:01:59 2001 Received: by oss.sgi.com id ; Tue, 30 Jan 2001 11:01:50 -0800 Received: from cpe-24-221-152-185.az.sprintbbd.net ([24.221.152.185]:33267 "EHLO opus.bloom.county") by oss.sgi.com with ESMTP id ; Tue, 30 Jan 2001 11:01:31 -0800 Received: (from tmrini@localhost) by opus.bloom.county (8.11.2/8.11.2/Debian 8.11.2-1) id f0UIxdh18932; Tue, 30 Jan 2001 11:59:39 -0700 Date: Tue, 30 Jan 2001 11:59:39 -0700 From: Tom Rini To: "David S. Miller" Cc: linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [UPDATE] Fresh zerocopy patch on kernel.org Message-ID: <20010130115939.H17512@opus.bloom.county> References: <14966.35438.429963.405587@pizda.ninka.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.3.12i In-Reply-To: <14966.35438.429963.405587@pizda.ninka.net>; from davem@redhat.com on Tue, Jan 30, 2001 at 01:33:34AM -0800 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 424 Lines: 12 On Tue, Jan 30, 2001 at 01:33:34AM -0800, David S. Miller wrote: > Have fun testing... I've got a question. I think one of the threads mentioned that this could help NFS performance, in general. I assume having the NFS traffic go out the a card w/ the right HW could help as well, right? Should the 3com 3c905b work, and how can i check it's "working" ? Thanks! -- Tom Rini (TR1265) http://gate.crashing.org/~trini/ From owner-netdev@oss.sgi.com Tue Jan 30 11:15:09 2001 Received: by oss.sgi.com id ; Tue, 30 Jan 2001 11:14:59 -0800 Received: from minus.inr.ac.ru ([193.233.7.97]:41988 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Tue, 30 Jan 2001 11:14:39 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id WAA12591; Tue, 30 Jan 2001 22:14:24 +0300 From: kuznet@ms2.inr.ac.ru Message-Id: <200101301914.WAA12591@ms2.inr.ac.ru> Subject: Re: More measurements To: andrewm@uow.EDU.AU (Andrew Morton) Date: Tue, 30 Jan 2001 22:14:24 +0300 (MSK) Cc: netdev@oss.sgi.com In-Reply-To: <3A76C5DE.993BB140@uow.edu.au> from "Andrew Morton" at Jan 30, 1 04:45:00 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 545 Lines: 18 Hello! > ** Yup, I just retested everything. With vanilla 2.4.1, > when data is sent with sendfile() we get 25% more packets > coming back than when sending with 64 kbyte send()s. And > 25% more bytes. This was due to pushes made by 4K writes used by old sendfile(). Pretty silly feature, but it seems to be required. 25% of packets are better than full collapse in some cases. Sigh... In any case, I did not understand one thing: do you really see some case, when you do not saturate eepro100? This is sort of difficult to make. Alexey From owner-netdev@oss.sgi.com Tue Jan 30 11:29:40 2001 Received: by oss.sgi.com id ; Tue, 30 Jan 2001 11:29:30 -0800 Received: from minus.inr.ac.ru ([193.233.7.97]:56836 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Tue, 30 Jan 2001 11:29:14 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id WAA12732; Tue, 30 Jan 2001 22:29:06 +0300 From: kuznet@ms2.inr.ac.ru Message-Id: <200101301929.WAA12732@ms2.inr.ac.ru> Subject: Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN) To: andrewm@uow.EDU.AU (Andrew Morton) Date: Tue, 30 Jan 2001 22:29:06 +0300 (MSK) Cc: netdev@oss.sgi.com In-Reply-To: <3A76B72D.2DD3E640@uow.edu.au> from "Andrew Morton" at Jan 30, 1 03:45:00 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 370 Lines: 15 Hello! > If anyone really needs that 10% they can use the `hw_checksums=0' > module parm, but SG+xsum is enabled by default - we need the testing. Before all: where did you lose these 2.5% of cpu? You can find leak with profiling. > BTW: can you suggest why I'm not observing any change in NFS client > efficiency? No. You must see strong decrease of load. Alexey From owner-netdev@oss.sgi.com Tue Jan 30 11:55:39 2001 Received: by oss.sgi.com id ; Tue, 30 Jan 2001 11:55:30 -0800 Received: from minus.inr.ac.ru ([193.233.7.97]:6405 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Tue, 30 Jan 2001 11:55:28 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id WAA12890; Tue, 30 Jan 2001 22:55:15 +0300 From: kuznet@ms2.inr.ac.ru Message-Id: <200101301955.WAA12890@ms2.inr.ac.ru> Subject: Re: More measurements To: andrewm@uow.EDU.AU (Andrew Morton) Date: Tue, 30 Jan 2001 22:55:15 +0300 (MSK) Cc: netdev@oss.sgi.com In-Reply-To: <3A75785A.42B9E7CE@uow.edu.au> from "Andrew Morton" at Jan 29, 1 05:15:07 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 936 Lines: 26 Hello! > - Changing eepro100 to use IO operations instead of MMIO slows > down this dual 500MHz machine by less than one percent at > 100 mbps. At 12,000 interrupts per second. Why all the fuss > about MMIO? Interesting. 0.9% is ridiculously small, but 2.5% is some big number. 8)8) Actually, I am not sure that mmio improves something in your case. Even these 0.9% can be random or systematic error. It works if it caches access to io (not a fact in your case) and if bus load is high. You have no bus load, so that you do not see any bus congestion effects. > - Bonding the 905's interrupt to CPU0 slows things down slightly. I am afraid, you do mistake here. Bounding IRQ without bounding sending/receiving threads results either in improvement or in disaster depending on cpu, where thread is scheduled. Generally, setting irq affinity is valid only if host is router or if services and sockets are bound too. Alexey From owner-netdev@oss.sgi.com Tue Jan 30 12:15:59 2001 Received: by oss.sgi.com id ; Tue, 30 Jan 2001 12:15:50 -0800 Received: from minus.inr.ac.ru ([193.233.7.97]:15877 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Tue, 30 Jan 2001 12:15:31 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id XAA13024; Tue, 30 Jan 2001 23:15:20 +0300 From: kuznet@ms2.inr.ac.ru Message-Id: <200101302015.XAA13024@ms2.inr.ac.ru> Subject: Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN) To: andrewm@uow.EDU.AU (Andrew Morton) Date: Tue, 30 Jan 2001 23:15:20 +0300 (MSK) Cc: netdev@oss.sgi.com In-Reply-To: <3A76D6A4.2385185E@uow.edu.au> from "Andrew Morton" at Jan 30, 1 06:15:01 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 414 Lines: 15 Hello! > The box has 130 mbyte/sec memory write bandwidth, so saving > a copy should save 10% of this. (Wanders away, scratching > head...) Did you hope to get negative load? It is unlikely. 8) You had nic->skb mem->page mem->user mem and saved only one copy, moreover that copy which happens back-to-back through cache. BTW no need to scratch head, profiler exists to help to answer such questions. Alexey From owner-netdev@oss.sgi.com Tue Jan 30 12:52:40 2001 Received: by oss.sgi.com id ; Tue, 30 Jan 2001 12:52:30 -0800 Received: from smtp1.cern.ch ([137.138.128.38]:33042 "EHLO smtp1.cern.ch") by oss.sgi.com with ESMTP id ; Tue, 30 Jan 2001 12:52:22 -0800 Received: from lxplus015.cern.ch (IDENT:root@lxplus015.cern.ch [137.138.161.112]) by smtp1.cern.ch (8.9.3/8.9.3) with ESMTP id VAA06654; Tue, 30 Jan 2001 21:52:15 +0100 (MET) Received: (from jes@localhost) by lxplus015.cern.ch (8.9.3/8.9.3) id VAA32151; Tue, 30 Jan 2001 21:52:14 +0100 To: Andrew Morton Cc: netdev@oss.sgi.com Subject: Re: More measurements References: <3A75785A.42B9E7CE@uow.edu.au> From: Jes Sorensen Date: 30 Jan 2001 21:52:14 +0100 In-Reply-To: Andrew Morton's message of "Tue, 30 Jan 2001 01:04:10 +1100" Message-ID: User-Agent: Gnus/5.070096 (Pterodactyl Gnus v0.96) Emacs/20.4 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 704 Lines: 17 >>>>> "Andrew" == Andrew Morton writes: Andrew> - eepro100 generates more interrupts doing TCP Tx, but not TCP Andrew> Rx. I assume it doesn't do Tx mitigation? Andrew> - Changing eepro100 to use IO operations instead of MMIO slows Andrew> down this dual 500MHz machine by less than one percent at 100 Andrew> mbps. At 12,000 interrupts per second. Why all the fuss about Andrew> MMIO? Ingo or Don Becker (sorry don't remember if it was Ingo or Don) did some tests showing that the write speed to IO ports was about 10 times the slower and read about 5 times slower. There is also the issues of stalling the bus. This may not all show up in actual transmission speeds etc. Jes From owner-netdev@oss.sgi.com Tue Jan 30 13:24:00 2001 Received: by oss.sgi.com id ; Tue, 30 Jan 2001 13:23:50 -0800 Received: from lsb-catv-1-p021.vtxnet.ch ([212.147.5.21]:12305 "EHLO almesberger.net") by oss.sgi.com with ESMTP id ; Tue, 30 Jan 2001 13:23:36 -0800 Received: (from almesber@localhost) by almesberger.net (8.9.3/8.9.3) id WAA12468; Tue, 30 Jan 2001 22:23:07 +0100 Date: Tue, 30 Jan 2001 22:23:07 +0100 From: Werner Almesberger To: netdev@oss.sgi.com, linux-diffserv@lrc.di.epfl.ch Subject: [RFC] tcng: Traffic Control, next generation language Message-ID: <20010130222307.A18286@almesberger.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1180 Lines: 29 Linux traffic control currently isn't exactly renowned for its user friendly configuration ;-) I'd like to solve this problem by introducing a new configuration language. Once this is in place, also more advanced concepts, such a direct generation of highly optimized C code for selected traffic control scenarios will become feasible. To get things rolling, I, with help from Milena Mondini, have defined a language that should be able to express most current traffic control configurations, and I've started implementing a translator to the "old" tc syntax: ftp://icaftp.epfl.ch/pub/linux/tcng/tcng-0s.tar.gz Note that the translator is still next to useless, because it only supports a very small number of traffic control elements and parameters. Also, some parameters will need special handling in the language (e.g. rsvp's "ipproto"). I'm making this release mainly to gather feedback on the configuration language. Comments ? - Werner -- _________________________________________________________________________ / Werner Almesberger, ICA, EPFL, CH Werner.Almesberger@epfl.ch / /_IN_N_032__Tel_+41_21_693_6621__Fax_+41_21_693_6610_____________________/ From owner-netdev@oss.sgi.com Tue Jan 30 14:20:00 2001 Received: by oss.sgi.com id ; Tue, 30 Jan 2001 14:19:50 -0800 Received: from pizda.ninka.net ([216.101.162.242]:2182 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Tue, 30 Jan 2001 14:19:31 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id OAA08053; Tue, 30 Jan 2001 14:17:57 -0800 From: "David S. Miller" MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <14967.15765.553667.802101@pizda.ninka.net> Date: Tue, 30 Jan 2001 14:17:57 -0800 (PST) To: Chris Wedgwood Cc: Andrew Morton , lkml , "netdev@oss.sgi.com" Subject: Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN) In-Reply-To: <20010131064911.B7244@metastasis.f00f.org> References: <3A76B72D.2DD3E640@uow.edu.au> <3A728475.34CF841@uow.edu.au> <3A726087.764CC02E@uow.edu.au> <20010126222003.A11994@vitelus.com> <14966.22671.446439.838872@pizda.ninka.net> <14966.47384.971741.939842@pizda.ninka.net> <3A76D6A4.2385185E@uow.edu.au> <20010131064911.B7244@metastasis.f00f.org> X-Mailer: VM 6.75 under 21.1 (patch 13) "Crater Lake" XEmacs Lucid Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 395 Lines: 13 Chris Wedgwood writes: > What server are you using here? Using NetApp filers I don't see > anything like this, probably only 8.5MB/s at most and this number is > fairly noisy. 8.5MB/sec sounds like half-duplex 100baseT. Positive you are running at full duplex all the way to the netapp, and if so how many switches sit between you and this netapp? Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Tue Jan 30 14:30:30 2001 Received: by oss.sgi.com id ; Tue, 30 Jan 2001 14:30:21 -0800 Received: from pizda.ninka.net ([216.101.162.242]:11398 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Tue, 30 Jan 2001 14:30:06 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id OAA08155; Tue, 30 Jan 2001 14:28:27 -0800 From: "David S. Miller" MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <14967.16395.42967.978677@pizda.ninka.net> Date: Tue, 30 Jan 2001 14:28:27 -0800 (PST) To: Andrew Morton Cc: lkml , "netdev@oss.sgi.com" Subject: Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN) In-Reply-To: <3A76D6A4.2385185E@uow.edu.au> References: <3A76B72D.2DD3E640@uow.edu.au> <3A728475.34CF841@uow.edu.au> <3A726087.764CC02E@uow.edu.au> <20010126222003.A11994@vitelus.com> <14966.22671.446439.838872@pizda.ninka.net> <14966.47384.971741.939842@pizda.ninka.net> <3A76D6A4.2385185E@uow.edu.au> X-Mailer: VM 6.75 under 21.1 (patch 13) "Crater Lake" XEmacs Lucid Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 374 Lines: 13 Andrew Morton writes: > The box has 130 mbyte/sec memory write bandwidth, so saving > a copy should save 10% of this. (Wanders away, scratching > head...) Are you sure your measurment program will account properly for all system cycles spent in softnet processing? This is where the bulk of the cpu cycle savings will occur. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Tue Jan 30 15:27:41 2001 Received: by oss.sgi.com id ; Tue, 30 Jan 2001 15:27:30 -0800 Received: from horus.its.uow.edu.au ([130.130.68.25]:62192 "EHLO horus.its.uow.edu.au") by oss.sgi.com with ESMTP id ; Tue, 30 Jan 2001 15:27:18 -0800 Received: from uow.edu.au (wumpus.its.uow.edu.au [130.130.68.12]) by horus.its.uow.edu.au (8.9.3/8.9.3) with ESMTP id KAA05933; Wed, 31 Jan 2001 10:26:54 +1100 (EST) Message-ID: <3A774F92.B0F20608@uow.edu.au> Date: Wed, 31 Jan 2001 10:34:42 +1100 From: Andrew Morton X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.4.0-test8 i586) X-Accept-Language: en MIME-Version: 1.0 To: "David S. Miller" CC: lkml , "netdev@oss.sgi.com" Subject: Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN) References: <3A76D6A4.2385185E@uow.edu.au>, <3A76B72D.2DD3E640@uow.edu.au> <3A728475.34CF841@uow.edu.au> <3A726087.764CC02E@uow.edu.au> <20010126222003.A11994@vitelus.com> <14966.22671.446439.838872@pizda.ninka.net> <14966.47384.971741.939842@pizda.ninka.net> <3A76D6A4.2385185E@uow.edu.au> <14967.16395.42967.978677@pizda.ninka.net> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1424 Lines: 46 "David S. Miller" wrote: > > Andrew Morton writes: > > The box has 130 mbyte/sec memory write bandwidth, so saving > > a copy should save 10% of this. (Wanders away, scratching > > head...) > > Are you sure your measurment program will account properly > for all system cycles spent in softnet processing? This is > where the bulk of the cpu cycle savings will occur. > It tries to. It runs n_cpus instances of this: static void busyloop(int instance) { int idx; for ( ; ; ) { for (idx = 0; idx < busyloop_size; idx++) { int thumb; busyloop_buf[idx]++; /* Dirty a cacheline */ for (thumb = 0; thumb < 200; thumb++) ; /* twiddle */ busyloop_progress[instance * CACHE_LINE_SIZE]++; } } } At minimum priority. And it measures how much these threads are slowed down, wrt an unloaded system. So interrupt work is definitely accounted for. It needs work. It should walk the buffer in cacheline-sized strides, should have tunable read-versus-write ratios, should be scheduled with `idle' priority, should be bondable to CPUs and should create PCI traffic. That means a in-kernel implementation. But tweaking this thing thus far has made only very small differences in output. - From owner-netdev@oss.sgi.com Tue Jan 30 16:20:50 2001 Received: by oss.sgi.com id ; Tue, 30 Jan 2001 16:20:41 -0800 Received: from horus.its.uow.edu.au ([130.130.68.25]:17034 "EHLO horus.its.uow.edu.au") by oss.sgi.com with ESMTP id ; Tue, 30 Jan 2001 16:20:26 -0800 Received: from uow.edu.au (wumpus.its.uow.edu.au [130.130.68.12]) by horus.its.uow.edu.au (8.9.3/8.9.3) with ESMTP id LAA18355; Wed, 31 Jan 2001 11:20:13 +1100 (EST) Message-ID: <3A775C12.DEB372A5@uow.edu.au> Date: Wed, 31 Jan 2001 11:28:02 +1100 From: Andrew Morton X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.4.0-test8 i586) X-Accept-Language: en MIME-Version: 1.0 To: kuznet@ms2.inr.ac.ru CC: netdev@oss.sgi.com Subject: Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN) References: <3A76D6A4.2385185E@uow.edu.au> from "Andrew Morton" at Jan 30, 1 06:15:01 pm <200101302015.XAA13024@ms2.inr.ac.ru> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1791 Lines: 52 kuznet@ms2.inr.ac.ru wrote: > > Hello! > > > The box has 130 mbyte/sec memory write bandwidth, so saving > > a copy should save 10% of this. (Wanders away, scratching > > head...) > > Did you hope to get negative load? It is unlikely. 8) > > You had nic->skb mem->page mem->user mem and saved only one copy, > moreover that copy which happens back-to-back through cache. Well, it's an interesting problem. How do we define "system load"? It's a combination of CPU cycles, memory bandwidth and I/O bandwidth (yes?). Given that, how do we measure it? My approach is to generate a mix of CPU load and memory traffic, and see how much this is slowed down by networking. Run the dummy load "in the background" so it doesn't affect the thing being tested. (Even this is questionable, because it's not real-world. Perhaps the dummy load should run with equal priority). That's why I'm scratching my head. Interesting problem. Perhaps it would make more sense to not look for a single percantage figure, but measure percentage of CPU cycles, percentage of memory bandwidth separately. Saving a single copy of a 100 mbps stream should save 11 mbytes/sec of memory write bandwidth. I assume the saving in read bandwidth is insignificant because it's already in CPU cache, or will be soon. I tried changing the dummy loop so it reads 1,000,000 cachelines/sec/CPU and dirties 250,000 lines/sec/CPU. Dual CPU. It made a negligible difference to all measurements. > > BTW no need to scratch head, profiler exists to help to answer > such questions. I didn't try profiling NFS reads. Profiling sendfile() and send() activity didn't show up anything very interesting, but we're looking for a pretty small delta. May I ask: have you tried to do any quantitative performance testing? From owner-netdev@oss.sgi.com Tue Jan 30 16:23:30 2001 Received: by oss.sgi.com id ; Tue, 30 Jan 2001 16:23:21 -0800 Received: from horus.its.uow.edu.au ([130.130.68.25]:61585 "EHLO horus.its.uow.edu.au") by oss.sgi.com with ESMTP id ; Tue, 30 Jan 2001 16:23:11 -0800 Received: from uow.edu.au (wumpus.its.uow.edu.au [130.130.68.12]) by horus.its.uow.edu.au (8.9.3/8.9.3) with ESMTP id LAA18999; Wed, 31 Jan 2001 11:23:00 +1100 (EST) Message-ID: <3A775CB8.AE0C6A2F@uow.edu.au> Date: Wed, 31 Jan 2001 11:30:48 +1100 From: Andrew Morton X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.4.0-test8 i586) X-Accept-Language: en MIME-Version: 1.0 To: kuznet@ms2.inr.ac.ru CC: netdev@oss.sgi.com Subject: Re: More measurements References: <3A76C5DE.993BB140@uow.edu.au> from "Andrew Morton" at Jan 30, 1 04:45:00 pm <200101301914.WAA12591@ms2.inr.ac.ru> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 779 Lines: 23 kuznet@ms2.inr.ac.ru wrote: > > Hello! > > > ** Yup, I just retested everything. With vanilla 2.4.1, > > when data is sent with sendfile() we get 25% more packets > > coming back than when sending with 64 kbyte send()s. And > > 25% more bytes. > > This was due to pushes made by 4K writes used by old sendfile(). > > Pretty silly feature, but it seems to be required. 25% of packets > are better than full collapse in some cases. Sigh... Ah. Thanks. It's good to have an explanation :) > In any case, I did not understand one thing: > do you really see some case, when you do not saturate eepro100? > This is sort of difficult to make. In my testing, in all cases, at all times, with all NICs, the link is 100% saturated. I don't need another dimension to deal with! From owner-netdev@oss.sgi.com Tue Jan 30 16:30:12 2001 Received: by oss.sgi.com id ; Tue, 30 Jan 2001 16:30:02 -0800 Received: from horus.its.uow.edu.au ([130.130.68.25]:28324 "EHLO horus.its.uow.edu.au") by oss.sgi.com with ESMTP id ; Tue, 30 Jan 2001 16:29:54 -0800 Received: from uow.edu.au (wumpus.its.uow.edu.au [130.130.68.12]) by horus.its.uow.edu.au (8.9.3/8.9.3) with ESMTP id LAA20462; Wed, 31 Jan 2001 11:29:39 +1100 (EST) Message-ID: <3A775E48.53BEA38A@uow.edu.au> Date: Wed, 31 Jan 2001 11:37:28 +1100 From: Andrew Morton X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.4.0-test8 i586) X-Accept-Language: en MIME-Version: 1.0 To: Jes Sorensen CC: netdev@oss.sgi.com Subject: Re: More measurements References: <3A75785A.42B9E7CE@uow.edu.au>, Andrew Morton's message of "Tue, 30 Jan 2001 01:04:10 +1100" Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1029 Lines: 26 Jes Sorensen wrote: > > >>>>> "Andrew" == Andrew Morton writes: > > Andrew> - eepro100 generates more interrupts doing TCP Tx, but not TCP > Andrew> Rx. I assume it doesn't do Tx mitigation? > > Andrew> - Changing eepro100 to use IO operations instead of MMIO slows > Andrew> down this dual 500MHz machine by less than one percent at 100 > Andrew> mbps. At 12,000 interrupts per second. Why all the fuss about > Andrew> MMIO? > > Ingo or Don Becker (sorry don't remember if it was Ingo or Don) did > some tests showing that the write speed to IO ports was about 10 times > the slower and read about 5 times slower. There is also the issues of > stalling the bus. This may not all show up in actual transmission > speeds etc. Interesting. Question: when the CPU reads from a PCI location, are the CPU <-> PCI bridge buses blocked for the duration of the read, or does the PCI bridge reconnect? I'm guessing that the major benefit of MMIO is posted writes: getting the CPU off the memory bus quickly. From owner-netdev@oss.sgi.com Tue Jan 30 16:31:52 2001 Received: by oss.sgi.com id ; Tue, 30 Jan 2001 16:31:42 -0800 Received: from f00f.stub.clear.net.nz ([203.167.224.51]:4870 "HELO metastasis.f00f.org") by oss.sgi.com with SMTP id ; Tue, 30 Jan 2001 16:31:31 -0800 Received: by metastasis.f00f.org (Postfix, from userid 1000) id 69562A495; Wed, 31 Jan 2001 13:31:23 +1300 (NZDT) Date: Wed, 31 Jan 2001 13:31:23 +1300 From: Chris Wedgwood To: "David S. Miller" Cc: Andrew Morton , lkml , "netdev@oss.sgi.com" Subject: Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN) Message-ID: <20010131133123.A7875@metastasis.f00f.org> References: <3A76B72D.2DD3E640@uow.edu.au> <3A728475.34CF841@uow.edu.au> <3A726087.764CC02E@uow.edu.au> <20010126222003.A11994@vitelus.com> <14966.22671.446439.838872@pizda.ninka.net> <14966.47384.971741.939842@pizda.ninka.net> <3A76D6A4.2385185E@uow.edu.au> <20010131064911.B7244@metastasis.f00f.org> <14967.15765.553667.802101@pizda.ninka.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <14967.15765.553667.802101@pizda.ninka.net>; from davem@redhat.com on Tue, Jan 30, 2001 at 02:17:57PM -0800 X-No-Archive: Yes Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 709 Lines: 22 On Tue, Jan 30, 2001 at 02:17:57PM -0800, David S. Miller wrote: 8.5MB/sec sounds like half-duplex 100baseT. No; I'm 100% its FD; HD gives 40k/sec TCP because of collisions and such like. Positive you are running at full duplex all the way to the netapp, and if so how many switches sit between you and this netapp? It's FD all the way (we hardwire everything to 100-FD and never trust auto-negotiate); I see no errors or such like anywhere. There are ... ... 3 switches between four switches in between, mostly linked via GE. I'm not sure if latency might be an issue here, is it was critical I can imagine 10 km of glass might be a problem but it's not _that_ far... --cw From owner-netdev@oss.sgi.com Tue Jan 30 16:44:22 2001 Received: by oss.sgi.com id ; Tue, 30 Jan 2001 16:44:12 -0800 Received: from pizda.ninka.net ([216.101.162.242]:6537 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Tue, 30 Jan 2001 16:43:55 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id QAA09073; Tue, 30 Jan 2001 16:42:39 -0800 From: "David S. Miller" MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <14967.24447.473094.687361@pizda.ninka.net> Date: Tue, 30 Jan 2001 16:42:39 -0800 (PST) To: Andrew Morton Cc: Jes Sorensen , netdev@oss.sgi.com Subject: Re: More measurements In-Reply-To: <3A775E48.53BEA38A@uow.edu.au> References: <3A75785A.42B9E7CE@uow.edu.au> <3A775E48.53BEA38A@uow.edu.au> X-Mailer: VM 6.75 under 21.1 (patch 13) "Crater Lake" XEmacs Lucid Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 535 Lines: 20 Andrew Morton writes: > I'm guessing that the major benefit of MMIO is posted > writes: getting the CPU off the memory bus quickly. My vague recollection is: 1) Any old style IO flushes fifos in both directions in the path on the PCI bus to the device before the transaction is even started. 2) As mentioned, writes cannot be posted and thus the cpu must stall on completion. A reading of the 2.1 PCI specs will show all of this and show if my brain is working today or not :-) Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Tue Jan 30 16:47:52 2001 Received: by oss.sgi.com id ; Tue, 30 Jan 2001 16:47:42 -0800 Received: from [63.93.198.67] ([63.93.198.67]:17574 "EHLO mercury.mayannetworks.com") by oss.sgi.com with ESMTP id ; Tue, 30 Jan 2001 16:47:41 -0800 Received: from fs-phx.mayannetworks.com (fs-phx.mayannetworks.com [10.4.1.3]) by mercury.mayannetworks.com (8.9.3/8.9.3) with ESMTP id QAA05974; Tue, 30 Jan 2001 16:47:39 -0800 (PST) Received: from mayannetworks.com (bgreear@[10.4.1.247]) by fs-phx.mayannetworks.com (8.8.8/8.8.8) with ESMTP id RAA10537; Tue, 30 Jan 2001 17:47:39 -0700 (MST) Message-ID: <3A776000.E0CF3D37@mayannetworks.com> Date: Tue, 30 Jan 2001 17:44:48 -0700 From: Ben Greear X-Mailer: Mozilla 4.76 [en] (X11; U; Linux 2.2.16-22 i686) X-Accept-Language: en MIME-Version: 1.0 To: linux-atm , netdev Subject: packet (ppp) over Sonet in Linux Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 408 Lines: 13 Is there anyone considering, or already supporting, ppp-over-sonet or some other packet-over-sonet protocol? Could existing ATM NICs/drivers be adapted to support such protocols? Thanks, Ben -- Ben Greear (bgreear@mayannetworks.com) http://www.mayannetworks.com NAM Team, Phoenix http://www-internal/~bgreear Phone: 602-325-2043 Personal Cell Phone: 602-502-6887 From owner-netdev@oss.sgi.com Tue Jan 30 16:51:42 2001 Received: by oss.sgi.com id ; Tue, 30 Jan 2001 16:51:32 -0800 Received: from pizda.ninka.net ([216.101.162.242]:11145 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Tue, 30 Jan 2001 16:51:26 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id QAA09086; Tue, 30 Jan 2001 16:45:04 -0800 From: "David S. Miller" MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <14967.24592.679296.55231@pizda.ninka.net> Date: Tue, 30 Jan 2001 16:45:04 -0800 (PST) To: Chris Wedgwood Cc: Andrew Morton , lkml , "netdev@oss.sgi.com" Subject: Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN) In-Reply-To: <20010131133123.A7875@metastasis.f00f.org> References: <3A76B72D.2DD3E640@uow.edu.au> <3A728475.34CF841@uow.edu.au> <3A726087.764CC02E@uow.edu.au> <20010126222003.A11994@vitelus.com> <14966.22671.446439.838872@pizda.ninka.net> <14966.47384.971741.939842@pizda.ninka.net> <3A76D6A4.2385185E@uow.edu.au> <20010131064911.B7244@metastasis.f00f.org> <14967.15765.553667.802101@pizda.ninka.net> <20010131133123.A7875@metastasis.f00f.org> X-Mailer: VM 6.75 under 21.1 (patch 13) "Crater Lake" XEmacs Lucid Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 534 Lines: 15 Chris Wedgwood writes: > There are ... ... 3 switches between four switches in > between, mostly linked via GE. I'm not sure if latency might be an > issue here, is it was critical I can imagine 10 km of glass might be > a problem but it's not _that_ far... Other than this, I don't know what to postulate. Really, most reports and my own experimentation (directly connected Linux knfsd to 2.4.x nfs client) supports the fact that our client can saturate 100baseT rather fully. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Tue Jan 30 16:55:31 2001 Received: by oss.sgi.com id ; Tue, 30 Jan 2001 16:55:22 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:43447 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Tue, 30 Jan 2001 16:55:16 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id TAA03719; Tue, 30 Jan 2001 19:53:59 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Tue, 30 Jan 2001 19:53:59 -0500 (EST) From: jamal To: Ion Badulescu cc: Andrew Morton , lkml , "netdev@oss.sgi.com" Subject: Still not sexy! (Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN) In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1679 Lines: 53 On Mon, 29 Jan 2001, Ion Badulescu wrote: > On Mon, 29 Jan 2001, jamal wrote: > > > > 11.5kBps, quite consistently. > > > > This gige card is really sick. Are you sure? Please double check. > > Umm.. the starfire chipset is 100Mbit only. So 11.5MBps (sorry, that was a > typo, it's mega not kilo) is really all I'd expect out of it. > not good. So far all the tests have been around CPU. The general trend seems to be: - sendfile + ZC good for CPU - write() + ZC not good for CPU (i might have forgotten something from Andrew's results). This happens (even with my bogus cpu measure) to be similar. That seems to be explainable. ** I reported that there was also an oddity in throughput values, unfortunately since no one (other than me) seems to have access to a gige cards in the ZC list, nobody can confirm or disprove what i posted. Here again as a reminder: Kernel | tput | sender-CPU | receiver-CPU | ------------------------------------------------- 2.4.0-pre3 | 99MB/s | 87% | 23% | NSF | | | | ------------------------------------------------- 2.4.0-pre3 | 86MB/s | 100% | 17% | SF | | | | ------------------------------------------------- 2.4.0-pre3 | 66.2 | 60% | 11% | +ZC | MB/s | | | ------------------------------------------------- 2.4.0-pre3 | 68 | 8% | 8% | +ZC SF | MB/s | | | ------------------------------------------------- Just ignore the CPU readings, focus on throughput. And could someone plese post results? cheers, jamal From owner-netdev@oss.sgi.com Tue Jan 30 17:00:32 2001 Received: by oss.sgi.com id ; Tue, 30 Jan 2001 17:00:22 -0800 Received: from chiara.elte.hu ([157.181.150.200]:48645 "HELO chiara.elte.hu") by oss.sgi.com with SMTP id ; Tue, 30 Jan 2001 17:00:12 -0800 Received: by chiara.elte.hu (Postfix, from userid 17806) id ABBF9186D; Wed, 31 Jan 2001 02:00:09 +0100 (CET) Date: Wed, 31 Jan 2001 01:59:38 +0100 (CET) From: Ingo Molnar Reply-To: To: jamal Cc: Ion Badulescu , Andrew Morton , lkml , "netdev@oss.sgi.com" Subject: Re: Still not sexy! (Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN) In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 803 Lines: 27 On Tue, 30 Jan 2001, jamal wrote: > Kernel | tput | sender-CPU | receiver-CPU | > ------------------------------------------------- > 2.4.0-pre3 | 99MB/s | 87% | 23% | > NSF | | | | > ------------------------------------------------- > 2.4.0-pre3 | 68 | 8% | 8% | > +ZC SF | MB/s | | | > ------------------------------------------------- isnt the CPU utilization difference amazing? :-) a couple of questions: - is this UDP or TCP based? (UDP i guess) - what wsize/rsize are you using? How do these requests look like on the network, ie. are they suffieciently MTU-sized? - what happens if you run multiple instances of the testcode, does it saturate bandwidth (or CPU)? Ingo From owner-netdev@oss.sgi.com Tue Jan 30 17:06:02 2001 Received: by oss.sgi.com id ; Tue, 30 Jan 2001 17:05:52 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:45239 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Tue, 30 Jan 2001 17:05:45 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id UAA03734; Tue, 30 Jan 2001 20:04:51 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Tue, 30 Jan 2001 20:04:51 -0500 (EST) From: jamal To: Ingo Molnar cc: Ion Badulescu , Andrew Morton , lkml , "netdev@oss.sgi.com" Subject: Re: Still not sexy! (Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN) In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1298 Lines: 46 On Wed, 31 Jan 2001, Ingo Molnar wrote: > > On Tue, 30 Jan 2001, jamal wrote: > > > Kernel | tput | sender-CPU | receiver-CPU | > > ------------------------------------------------- > > 2.4.0-pre3 | 99MB/s | 87% | 23% | > > NSF | | | | > > ------------------------------------------------- > > 2.4.0-pre3 | 68 | 8% | 8% | > > +ZC SF | MB/s | | | > > ------------------------------------------------- > > isnt the CPU utilization difference amazing? :-) > With a caveat, sadly ;-> ttcp uses times() system call (or a diff of times() one at the beggining and another at the end). So the cpu measurements are not reflective. > a couple of questions: > > - is this UDP or TCP based? (UDP i guess) > TCP > - what wsize/rsize are you using? How do these requests look like on the > network, ie. are they suffieciently MTU-sized? yes. writes vary from 8K->64K but not much difference over the long period of time. > > - what happens if you run multiple instances of the testcode, does it > saturate bandwidth (or CPU)? This is something of great interest. I havent tried it. I should. I suspect this would be where the value of the ZC changes will become evident. cheers, jamal From owner-netdev@oss.sgi.com Tue Jan 30 17:11:32 2001 Received: by oss.sgi.com id ; Tue, 30 Jan 2001 17:11:23 -0800 Received: from palrel1.hp.com ([156.153.255.242]:60171 "HELO palrel1.hp.com") by oss.sgi.com with SMTP id ; Tue, 30 Jan 2001 17:11:14 -0800 Received: from tardy.cup.hp.com (tardy.cup.hp.com [15.8.80.176]) by palrel1.hp.com (Postfix) with ESMTP id C47482740; Tue, 30 Jan 2001 17:10:52 -0800 (PST) Received: from cup.hp.com (localhost [127.0.0.1]) by tardy.cup.hp.com (8.9.3 (PHNE_18546)/8.9.3 SMKit7.02) with ESMTP id RAA14622; Tue, 30 Jan 2001 17:10:52 -0800 (PST) Message-ID: <3A77661C.5D7FD4C@cup.hp.com> Date: Tue, 30 Jan 2001 17:10:52 -0800 From: Rick Jones Organization: the Unofficial HP X-Mailer: Mozilla 4.75 [en] (X11; U; HP-UX B.11.00 9000/785) X-Accept-Language: en MIME-Version: 1.0 To: jamal Cc: Ion Badulescu , Andrew Morton , lkml , "netdev@oss.sgi.com" Subject: Re: Still not sexy! (Re: sendfile+zerocopy: fairly sexy (nothing to dowith ECN) References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1653 Lines: 41 > ** I reported that there was also an oddity in throughput values, > unfortunately since no one (other than me) seems to have access > to a gige cards in the ZC list, nobody can confirm or disprove > what i posted. Here again as a reminder: > > Kernel | tput | sender-CPU | receiver-CPU | > ------------------------------------------------- > 2.4.0-pre3 | 99MB/s | 87% | 23% | > NSF | | | | > ------------------------------------------------- > 2.4.0-pre3 | 86MB/s | 100% | 17% | > SF | | | | > ------------------------------------------------- > 2.4.0-pre3 | 66.2 | 60% | 11% | > +ZC | MB/s | | | > ------------------------------------------------- > 2.4.0-pre3 | 68 | 8% | 8% | > +ZC SF | MB/s | | | > ------------------------------------------------- > > Just ignore the CPU readings, focus on throughput. And could someone plese > post results? In the spirit of the socratic method :) Is your gige card based on Alteon? How does ZC/SG change the nature of the packets presented to the NIC? How well does the NIC do with that changed nature? rick jones sometimes, performance tuning is like squeezing a balloon. one part gets smaller, but then you start to see the rest of the balloon... -- ftp://ftp.cup.hp.com/dist/networking/misc/rachel/ these opinions are mine, all mine; HP might not want them anyway... :) feel free to email, OR post, but please do NOT do BOTH... my email address is raj in the cup.hp.com domain... From owner-netdev@oss.sgi.com Tue Jan 30 17:15:32 2001 Received: by oss.sgi.com id ; Tue, 30 Jan 2001 17:15:22 -0800 Received: from chiara.elte.hu ([157.181.150.200]:50437 "HELO chiara.elte.hu") by oss.sgi.com with SMTP id ; Tue, 30 Jan 2001 17:15:08 -0800 Received: by chiara.elte.hu (Postfix, from userid 17806) id 7DA8E186E; Wed, 31 Jan 2001 02:15:06 +0100 (CET) Date: Wed, 31 Jan 2001 02:14:35 +0100 (CET) From: Ingo Molnar Reply-To: To: jamal Cc: Ion Badulescu , Andrew Morton , lkml , "netdev@oss.sgi.com" Subject: Re: Still not sexy! (Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN) In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 220 Lines: 15 On Tue, 30 Jan 2001, jamal wrote: > > - is this UDP or TCP based? (UDP i guess) > > > TCP well then i'd suggest to do: echo 100000 100000 100000 > /proc/sys/net/ipv4/tcp_wmem does this make any difference? Ingo From owner-netdev@oss.sgi.com Tue Jan 30 17:41:03 2001 Received: by oss.sgi.com id ; Tue, 30 Jan 2001 17:40:53 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:49079 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Tue, 30 Jan 2001 17:40:36 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id UAA03817; Tue, 30 Jan 2001 20:39:42 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Tue, 30 Jan 2001 20:39:42 -0500 (EST) From: jamal To: Ingo Molnar cc: Ion Badulescu , Andrew Morton , lkml , "netdev@oss.sgi.com" Subject: Re: Still not sexy! (Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN) In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 489 Lines: 26 On Wed, 31 Jan 2001, Ingo Molnar wrote: > > On Tue, 30 Jan 2001, jamal wrote: > > > > - is this UDP or TCP based? (UDP i guess) > > > > > TCP > > well then i'd suggest to do: > > echo 100000 100000 100000 > /proc/sys/net/ipv4/tcp_wmem > > does this make any difference? According to my notes, i dont see this. however, 262144 into /proc/sys/net/core/*mem_max/default. I have access to my h/ware this weekend. Hopefully i should get something better than ttcp to use. cheers, jamal From owner-netdev@oss.sgi.com Tue Jan 30 17:51:03 2001 Received: by oss.sgi.com id ; Tue, 30 Jan 2001 17:50:53 -0800 Received: from shell.cyberus.ca ([209.195.95.7]:51639 "EHLO shell.cyberus.ca") by oss.sgi.com with ESMTP id ; Tue, 30 Jan 2001 17:50:52 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id UAA03837; Tue, 30 Jan 2001 20:45:59 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Tue, 30 Jan 2001 20:45:59 -0500 (EST) From: jamal To: Rick Jones cc: Ion Badulescu , Andrew Morton , lkml , "netdev@oss.sgi.com" Subject: Re: Still not sexy! (Re: sendfile+zerocopy: fairly sexy (nothing to dowith ECN) In-Reply-To: <3A77661C.5D7FD4C@cup.hp.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1778 Lines: 53 On Tue, 30 Jan 2001, Rick Jones wrote: > > ** I reported that there was also an oddity in throughput values, > > unfortunately since no one (other than me) seems to have access > > to a gige cards in the ZC list, nobody can confirm or disprove > > what i posted. Here again as a reminder: > > > > Kernel | tput | sender-CPU | receiver-CPU | > > ------------------------------------------------- > > 2.4.0-pre3 | 99MB/s | 87% | 23% | > > NSF | | | | > > ------------------------------------------------- > > 2.4.0-pre3 | 86MB/s | 100% | 17% | > > SF | | | | > > ------------------------------------------------- > > 2.4.0-pre3 | 66.2 | 60% | 11% | > > +ZC | MB/s | | | > > ------------------------------------------------- > > 2.4.0-pre3 | 68 | 8% | 8% | > > +ZC SF | MB/s | | | > > ------------------------------------------------- > > > > Just ignore the CPU readings, focus on throughput. And could someone plese > > post results? > > In the spirit of the socratic method :) ;-> > > Is your gige card based on Alteon? Yes, sir, it is. To be precise: ** Sender: SMP-PII-450Mhz, ASUS m/board; 3com version of acenic - 1M version ** receiver: same hardware; acenic alteon card - 1M version > How does ZC/SG change the nature of the packets presented to the NIC? what do you mean? I am _sure_ you know how SG/ZC work. So i am suspecting more than socratic view on life here. Could be influence from Aristotle;-> > How well does the NIC do with that changed nature? > Hard question to answer ;-> I havent done any analysis at that level cheers, jamal From owner-netdev@oss.sgi.com Tue Jan 30 18:25:34 2001 Received: by oss.sgi.com id ; Tue, 30 Jan 2001 18:25:25 -0800 Received: from palrel3.hp.com ([156.153.255.226]:43020 "HELO palrel3.hp.com") by oss.sgi.com with SMTP id ; Tue, 30 Jan 2001 18:25:04 -0800 Received: from tardy.cup.hp.com (tardy.cup.hp.com [15.8.80.176]) by palrel3.hp.com (Postfix) with ESMTP id 6570F110A; Tue, 30 Jan 2001 18:25:02 -0800 (PST) Received: from cup.hp.com (localhost [127.0.0.1]) by tardy.cup.hp.com (8.9.3 (PHNE_18546)/8.9.3 SMKit7.02) with ESMTP id SAA16714; Tue, 30 Jan 2001 18:25:01 -0800 (PST) Message-ID: <3A77777D.E1A998FC@cup.hp.com> Date: Tue, 30 Jan 2001 18:25:01 -0800 From: Rick Jones Organization: the Unofficial HP X-Mailer: Mozilla 4.75 [en] (X11; U; HP-UX B.11.00 9000/785) X-Accept-Language: en MIME-Version: 1.0 To: jamal Cc: Ion Badulescu , Andrew Morton , lkml , "netdev@oss.sgi.com" Subject: Re: Still not sexy! (Re: sendfile+zerocopy: fairly sexy (nothing todowith ECN) References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1615 Lines: 36 > > How does ZC/SG change the nature of the packets presented to the NIC? > > what do you mean? I am _sure_ you know how SG/ZC work. So i am suspecting > more than socratic view on life here. Could be influence from Aristotle;-> Well, I don't know the specifics of Linux, but I gather from what I've read on the list thusfar, that prior to implementing SG support, Linux NIC drivers would copy packets into single contiguous buffers that were then sent to the NIC yes? If so, the implication is with SG going, that copy no longer takes place, and so a chain of buffers is given to the NIC. Also, if one is fully ZC :) pesky things like protocol headers can naturally end-up in separate buffers. So, now you have to ask how well any given NIC follows chains of buffers. At what number of buffers is the overhead in the NIC of following the chains enough to keep it from achieving link-rate? One way to try and deduce that would be to meld some of the SG and preSG behaviours and copy packets into varying numbers of buffers per packet and measure the resulting impact on throughput through the NIC. rick jones As time marches on, the orders of magnitude of the constants may change, but basic concepts still remain, and the "lessons" learned in the past by one generation tend to get relearned in the next :) for example - there is no such a thing as a free lunch... :) -- ftp://ftp.cup.hp.com/dist/networking/misc/rachel/ these opinions are mine, all mine; HP might not want them anyway... :) feel free to email, OR post, but please do NOT do BOTH... my email address is raj in the cup.hp.com domain... From owner-netdev@oss.sgi.com Tue Jan 30 18:52:55 2001 Received: by oss.sgi.com id ; Tue, 30 Jan 2001 18:52:44 -0800 Received: from nero.doit.wisc.edu ([128.104.17.130]:43782 "EHLO nero.doit.wisc.edu") by oss.sgi.com with ESMTP id ; Tue, 30 Jan 2001 18:52:25 -0800 Received: (from jleu@localhost) by nero.doit.wisc.edu (8.8.7/8.8.7) id UAA14296; Tue, 30 Jan 2001 20:52:19 -0600 Date: Tue, 30 Jan 2001 20:52:19 -0600 From: "James R. Leu" To: Ben Greear Cc: linux-atm , netdev Subject: Re: packet (ppp) over Sonet in Linux Message-ID: <20010130205219.A14291@doit.wisc.edu> Reply-To: jleu@mindspring.com References: <3A776000.E0CF3D37@mayannetworks.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0.1i In-Reply-To: <3A776000.E0CF3D37@mayannetworks.com>; from bgreear@mayannetworks.com on Tue, Jan 30, 2001 at 05:44:48PM -0700 Organization: none Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 631 Lines: 21 Look at Lucents Optistar products, OC48/12 cards with drivers for Linux. http://www.lucent-optical.com/oan/products/ On Tue, Jan 30, 2001 at 05:44:48PM -0700, Ben Greear wrote: > Is there anyone considering, or already supporting, ppp-over-sonet > or some other packet-over-sonet protocol? > > Could existing ATM NICs/drivers be adapted to support such > protocols? > > Thanks, > Ben > > -- > Ben Greear (bgreear@mayannetworks.com) http://www.mayannetworks.com > NAM Team, Phoenix http://www-internal/~bgreear > Phone: 602-325-2043 Personal Cell Phone: 602-502-6887 -- James R. Leu From owner-netdev@oss.sgi.com Tue Jan 30 19:23:44 2001 Received: by oss.sgi.com id ; Tue, 30 Jan 2001 19:23:35 -0800 Received: from saw.sw.com.sg ([203.120.9.98]:55947 "HELO saw.sw.com.sg") by oss.sgi.com with SMTP id ; Tue, 30 Jan 2001 19:23:25 -0800 Received: (qmail 24202 invoked by uid 577); 31 Jan 2001 03:23:17 -0000 Message-ID: <20010131112317.B24045@saw.sw.com.sg> Date: Wed, 31 Jan 2001 11:23:17 +0800 From: Andrey Savochkin To: Andrew Morton , Andi Kleen Cc: netdev@oss.sgi.com Subject: Re: More measurements References: <3A75785A.42B9E7CE@uow.edu.au>, <3A75785A.42B9E7CE@uow.edu.au>; <20010130124839.24151@colin.muc.de> <3A76C5DE.993BB140@uow.edu.au> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.93.2i In-Reply-To: <3A76C5DE.993BB140@uow.edu.au>; from "Andrew Morton" on Wed, Jan 31, 2001 at 12:47:10AM Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1351 Lines: 38 Hi, On Wed, Jan 31, 2001 at 12:47:10AM +1100, Andrew Morton wrote: > Andi Kleen wrote: > > > > On Tue, Jan 30, 2001 at 10:08:02AM +0100, Andrew Morton wrote: > > > Lots of interesting things here. > > > > > > - eepro100 generates more interrupts doing TCP Tx, but not > > > TCP Rx. I assume it doesn't do Tx mitigation? > > > > The Intel driver (e100.c) uploads special firmware and does it for RX and TX. > > eepro100 doesn't. Perhaps you could measure that driver too? That's about RX mitigation, where a special hardware assistance is required. For TX interrupts, the current driver arbitrarily asks for a TX interrupt every forth packet (to do garbage collection). > Sure. Anyone have a URL for intel's driver? ftp://download.intel.com/support/network/adapter/pro100/ e100-*.tgz > > > > > > > - Changing eepro100 to use IO operations instead of MMIO slows > > > down this dual 500MHz machine by less than one percent at > > > 100 mbps. At 12,000 interrupts per second. Why all the fuss > > > about MMIO? > > > > iirc Ingo at some point found at some monster machine that the IO operations > > in the eepro100 interrupt handler dominated some Tux profile. The same was reported by kumon@flab.fujitsu.co.jp I think that serialization effects of IO instructions lead to them staying at the top of profile. Best regards Andrey From owner-netdev@oss.sgi.com Tue Jan 30 22:02:06 2001 Received: by oss.sgi.com id ; Tue, 30 Jan 2001 22:01:56 -0800 Received: from Overkill.EnterZone.Net ([66.35.65.2]:29296 "EHLO Overkill.EnterZone.Net") by oss.sgi.com with ESMTP id ; Tue, 30 Jan 2001 22:01:41 -0800 Received: from localhost (atm@localhost) by Overkill.EnterZone.Net (8.11.0/8.11.0) with ESMTP id f0V613c28535; Wed, 31 Jan 2001 01:01:03 -0500 Date: Wed, 31 Jan 2001 01:01:03 -0500 (EST) From: John Fraizer To: "James R. Leu" cc: Ben Greear , linux-atm , netdev Subject: Re: packet (ppp) over Sonet in Linux In-Reply-To: <20010130205219.A14291@doit.wisc.edu> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 781 Lines: 29 Anyone know what the streetprice for those two are though? On Tue, 30 Jan 2001, James R. Leu wrote: > Look at Lucents Optistar products, OC48/12 cards with drivers for Linux. > > http://www.lucent-optical.com/oan/products/ > > On Tue, Jan 30, 2001 at 05:44:48PM -0700, Ben Greear wrote: > > Is there anyone considering, or already supporting, ppp-over-sonet > > or some other packet-over-sonet protocol? > > > > Could existing ATM NICs/drivers be adapted to support such > > protocols? > > > > Thanks, > > Ben > > > > -- > > Ben Greear (bgreear@mayannetworks.com) http://www.mayannetworks.com > > NAM Team, Phoenix http://www-internal/~bgreear > > Phone: 602-325-2043 Personal Cell Phone: 602-502-6887 > > -- > James R. Leu > From owner-netdev@oss.sgi.com Wed Jan 31 03:22:39 2001 Received: by oss.sgi.com id ; Wed, 31 Jan 2001 03:22:29 -0800 Received: from oxmail4.ox.ac.uk ([163.1.2.33]:10447 "EHLO oxmail.ox.ac.uk") by oss.sgi.com with ESMTP id ; Wed, 31 Jan 2001 03:22:06 -0800 Received: from sable.ox.ac.uk ([163.1.2.4]) by oxmail.ox.ac.uk with esmtp (Exim 3.12 #1) id 14NvKc-0002z0-00; Wed, 31 Jan 2001 11:21:46 +0000 Received: from mbeattie by sable.ox.ac.uk with local (Exim 3.13 #1) id 14NvKb-0000iD-00; Wed, 31 Jan 2001 11:21:45 +0000 Date: Wed, 31 Jan 2001 11:21:45 +0000 From: Malcolm Beattie To: Ingo Molnar Cc: jamal , Ion Badulescu , Andrew Morton , lkml , "netdev@oss.sgi.com" Subject: Re: Still not sexy! (Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN) Message-ID: <20010131112145.A13345@sable.ox.ac.uk> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0.1i In-Reply-To: ; from mingo@elte.hu on Wed, Jan 31, 2001 at 02:14:35AM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 2459 Lines: 58 Ingo Molnar writes: > > On Tue, 30 Jan 2001, jamal wrote: > > > > - is this UDP or TCP based? (UDP i guess) > > > > > TCP > > well then i'd suggest to do: > > echo 100000 100000 100000 > /proc/sys/net/ipv4/tcp_wmem > > does this make any difference? For the last week I've been benchmarking Linux network and I/O on a couple of machines with 3c985 gigabit cards and some other stuff (see below). One of the things I tried yesterday was a beta test version of a secure ftpd written by Chris Evans which happens to use sendfile() making it a convenient extra benchmark. I'd already put net.core.{r,w}mem_max up to 262144 for the sake of gensink and other benchmarks which raise SO_{SND,RCV}BUF. I hadn't however, tried raising tcp_wmem as per your suggestion above. Currently the systems are linked back to back with fibre with jumbo frames (MTU 9000) on and running pure kernel 2.4.1. I transferred a 300 MByte file repeatedly from the server to the client with an ftp "get" client-side. The file will have been completely in page cache on the server (both machines have 512MB RAM) and was written to /dev/null on the client side. (Yes, I checked the client was doing ordinary read/write and not throwing it away). Without the raised tcp_wmem setting I was getting 81 MByte/s. With tcp_wmem set as above I got 86 MByte/s. Nice increase. Any other setting I can tweak apart from {r,w}mem_max and tcp_{w,r}mem? The CPU on the client (350 MHz PII) is the bottleneck: gensink4 maxes out at 69 Mbyte/s pulling TCP from the server and 94 Mbyte/s pushing. (The other system, 733 MHz PIII pushes >100MByte/s UDP with ttcp but the client drops most of it). I'll be following up Dave Miller's "please benchmark zerocopy" request when I've got some more numbers written down since I've only just put the zerocopy patch in and haven't rebooted yet. If anyone wants any other specific benchmarks done (I/O or network) I may get some time to do them: the PIII system has an 8-port Escalade card with 8 x 46GB disks (117 MByte/s block writes as measured by Bonnie on a RAID1/0 mixed RAIDset) and there are also four dual-port eepro fast ethernet cards, a Cisco 8-port 3508G gigabit switch and a 24-port 3524 fast ethernet switch (gigastack linked to the 3508G). I'm benchmarking and looking into the possibility of a DIY NAS or SAN-type thing. --Malcolm -- Malcolm Beattie Unix Systems Programmer Oxford University Computing Services From owner-netdev@oss.sgi.com Wed Jan 31 03:25:59 2001 Received: by oss.sgi.com id ; Wed, 31 Jan 2001 03:25:50 -0800 Received: from chiara.elte.hu ([157.181.150.200]:4102 "HELO chiara.elte.hu") by oss.sgi.com with SMTP id ; Wed, 31 Jan 2001 03:25:37 -0800 Received: by chiara.elte.hu (Postfix, from userid 17806) id 5F585186D; Wed, 31 Jan 2001 12:25:34 +0100 (CET) Date: Wed, 31 Jan 2001 12:24:53 +0100 (CET) From: Ingo Molnar Reply-To: To: Malcolm Beattie Cc: jamal , Ion Badulescu , Andrew Morton , lkml , "netdev@oss.sgi.com" Subject: Re: Still not sexy! (Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN) In-Reply-To: <20010131112145.A13345@sable.ox.ac.uk> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 835 Lines: 19 On Wed, 31 Jan 2001, Malcolm Beattie wrote: > Without the raised tcp_wmem setting I was getting 81 MByte/s. With > tcp_wmem set as above I got 86 MByte/s. Nice increase. Any other > setting I can tweak apart from {r,w}mem_max and tcp_{w,r}mem? The CPU > on the client (350 MHz PII) is the bottleneck: gensink4 maxes out at > 69 Mbyte/s pulling TCP from the server and 94 Mbyte/s pushing. (The > other system, 733 MHz PIII pushes >100MByte/s UDP with ttcp but the > client drops most of it). you can speed up the client significantly by using the MSG_TRUNC option ('truncate message'). It will zap incoming data without copying it into user-space. (you can use this for the 'bulk transfer' part - the initial protocol handling code needs to see the actual data.) This way you should be able to saturate the server even more. Ingo From owner-netdev@oss.sgi.com Wed Jan 31 07:27:41 2001 Received: by oss.sgi.com id ; Wed, 31 Jan 2001 07:27:31 -0800 Received: from oxmail2.ox.ac.uk ([163.1.2.1]:37066 "EHLO oxmail.ox.ac.uk") by oss.sgi.com with ESMTP id ; Wed, 31 Jan 2001 07:27:15 -0800 Received: from sable.ox.ac.uk ([163.1.2.4]) by oxmail.ox.ac.uk with esmtp (Exim 3.12 #1) id 14Nz9q-00042y-00; Wed, 31 Jan 2001 15:26:54 +0000 Received: from mbeattie by sable.ox.ac.uk with local (Exim 3.13 #1) id 14Nz9q-00006b-00; Wed, 31 Jan 2001 15:26:54 +0000 Date: Wed, 31 Jan 2001 15:26:54 +0000 From: Malcolm Beattie To: "David S. Miller" Cc: linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [UPDATE] Fresh zerocopy patch on kernel.org Message-ID: <20010131152653.C13345@sable.ox.ac.uk> References: <14966.35438.429963.405587@pizda.ninka.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0.1i In-Reply-To: <14966.35438.429963.405587@pizda.ninka.net>; from davem@redhat.com on Tue, Jan 30, 2001 at 01:33:34AM -0800 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1680 Lines: 47 David S. Miller writes: > > At the usual place: > > ftp://ftp.kernel.org/pub/linux/kernel/people/davem/zerocopy-2.4.1-1.diff.gz Hmm, disappointing results here; maybe I've missed something. Setup is a Pentium II 350MHz (tusk) connected to a Pentium III 733MHz (heffalump) (both 512MB RAM) with SX fibre, each with a 3Com 3C985 NIC. Kernels compared are 2.4.1 and 2.4.1+zc (the 2.4.1-1 diff above) using acenic driver with MTU set to 9000. Sysctls set are # Raise socket buffer limits net.core.rmem_max = 262144 net.core.wmem_max = 262144 # Increase TCP write memory net.ipv4.tcp_wmem = 100000 100000 100000 on both sides. Comparison tests done were gensink4: 10485760 (10MB) buffer size, 262144 (256K) socket buffer ftp: server does sendfile() from a 300MB file in page cache, client does read from socket/write to /dev/null in 4K chunks. 2.4.1 2.4.1+zc KByte/s tusk%CPU heff%CPU KByte/s tusk%CPU heff%CPU gensink4 tusk->heffalump 94000 58-100 93 54000 98-102 11-45 heffalump->tusk 72000 86-100 46-59 70000 71-93 53-71 2.4.1 2.4.1+zc KByte/s KByte/s ftp heffalump->tusk 86000 62000 I was impressed with the raw 2.4.1 figures and hoped to be even more impressed with the 2.4.1+zc numbers. Is there something I'm missing or can change or do to help to improve matters or track down potential problems? --Malcolm -- Malcolm Beattie Unix Systems Programmer Oxford University Computing Services From owner-netdev@oss.sgi.com Wed Jan 31 11:15:03 2001 Received: by oss.sgi.com id ; Wed, 31 Jan 2001 11:14:53 -0800 Received: from minus.inr.ac.ru ([193.233.7.97]:37899 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Wed, 31 Jan 2001 11:14:39 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id WAA01230; Wed, 31 Jan 2001 22:13:47 +0300 From: kuznet@ms2.inr.ac.ru Message-Id: <200101311913.WAA01230@ms2.inr.ac.ru> Subject: Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN) To: andrewm@uow.edu.au (Andrew Morton) Date: Wed, 31 Jan 2001 22:13:47 +0300 (MSK) Cc: netdev@oss.sgi.com In-Reply-To: <3A775C12.DEB372A5@uow.edu.au> from "Andrew Morton" at Jan 31, 1 11:28:02 am X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1552 Lines: 41 Hello! > Well, it's an interesting problem. How do we > define "system load"? But you defined it absolutely correctly! System load is non-sense. Idle time is some meaningful quantity. Ticks remaining to a process doing some work, not related to work under measurement. I use profiler to measure it. Your approach is more precise and more aggressive. Probably, it is even too aggressive to get good numbers, but it is surely closer to real life. > May I ask: have you tried to do any quantitative performance > testing? My approach in before-zerocopy time was much simpler. Just saturate obscure Ys, but one, and look what remaining Y is. Particularly, load must be 100%, no idle cycles, no cycles spent to useless work. After this you measure throughput. It is some stable number. This approach failed with zerocopy, 100% load is not so easy to achieve. Luckily, we still have valid throughput number with gige, which is not saturated. Actually, I suspect tcp (driver? scheduler?) still has some miracleous bug, look f.e. at note of 001022 in README.zerocopy-sendfile. Probably, your measurement will show 100% load in this case, because all idle time can happen to be at entry-exit from idle task. What's about nfs, I not only did not make any measurements, I even compiled it only once to check that no silly misprints are made there. (after Dave found that the first patch happened to swap arguments of one functions. 8)) Pure saving one copy and not-0 order allocation does not require any measurements to prove its importance. Alexey From owner-netdev@oss.sgi.com Wed Jan 31 17:53:40 2001 Received: by oss.sgi.com id ; Wed, 31 Jan 2001 17:53:21 -0800 Received: from pizda.ninka.net ([216.101.162.242]:21141 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Wed, 31 Jan 2001 17:53:14 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id RAA17044; Wed, 31 Jan 2001 17:51:50 -0800 From: "David S. Miller" MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <14968.49462.674977.825098@pizda.ninka.net> Date: Wed, 31 Jan 2001 17:51:50 -0800 (PST) To: Malcolm Beattie Cc: linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [UPDATE] Fresh zerocopy patch on kernel.org In-Reply-To: <20010131152653.C13345@sable.ox.ac.uk> References: <14966.35438.429963.405587@pizda.ninka.net> <20010131152653.C13345@sable.ox.ac.uk> X-Mailer: VM 6.75 under 21.1 (patch 13) "Crater Lake" XEmacs Lucid Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 514 Lines: 19 Malcolm Beattie writes: > David S. Miller writes: > > > > At the usual place: > > > > ftp://ftp.kernel.org/pub/linux/kernel/people/davem/zerocopy-2.4.1-1.diff.gz > > Hmm, disappointing results here; maybe I've missed something. As discussed elsewhere there is a %10 to %15 performance hit for normal write()'s done with the new code. If you do your testing using sendfile() as the data source, you'll results ought to be wildly different and more encouraging. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Wed Jan 31 22:01:22 2001 Received: by oss.sgi.com id ; Wed, 31 Jan 2001 22:01:13 -0800 Received: from deliverator.sgi.com ([204.94.214.10]:21515 "EHLO deliverator.sgi.com") by oss.sgi.com with ESMTP id ; Wed, 31 Jan 2001 22:00:56 -0800 Received: from dhcp-163-154-5-240.engr.sgi.com (dhcp-163-154-5-240.engr.sgi.com [163.154.5.240]) by deliverator.sgi.com (980309.SGI.8.8.8-aspam-6.2/980310.SGI-aspam) via ESMTP id VAA23554 for ; Wed, 31 Jan 2001 21:59:56 -0800 (PST) mail_from (ralf@oss.sgi.com) Received: (ralf@lappi.waldorf-gmbh.de) by bacchus.dhis.org id ; Wed, 31 Jan 2001 21:56:13 -0800 Date: Wed, 31 Jan 2001 21:56:03 -0800 From: Ralf Baechle To: Jamie Lokier Cc: Andi Kleen , "Albert D. Cahalan" , John Fremlin , linux-kernel@vger.kernel.org, netdev@oss.sgi.com, paulus@linuxcare.com, linux-ppp@vger.kernel.org, linux-net@vger.kernel.org Subject: Re: [PATCH] dynamic IP support for 2.4.0 (SIOCKILLADDR) Message-ID: <20010131215602.K874@bacchus.dhis.org> References: <200101290245.f0T2j2Y438757@saturn.cs.uml.edu> <20010129135905.B1591@fred.local> <20010129193136.A11035@pcep-jamie.cern.ch> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <20010129193136.A11035@pcep-jamie.cern.ch>; from ln@tantalophile.demon.co.uk on Mon, Jan 29, 2001 at 07:31:36PM +0100 X-Accept-Language: de,en,fr Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 425 Lines: 12 On Mon, Jan 29, 2001 at 07:31:36PM +0100, Jamie Lokier wrote: > Unfortunately getting the same IP is rare now, so I've been toying with Pretty much dependant of the type of equipment and the configuration used at the ISP's servers. I use two ISPs when I'm back in Germany of which the one always and the other one never gives me the same IP when I reconnect within some short time. (Guess which one I prefer ...) Ralf