From owner-netdev@oss.sgi.com Thu Jun 1 03:19:30 2000 Received: by oss.sgi.com id ; Thu, 1 Jun 2000 03:19:20 -0700 Received: from sirppi.helsinki.fi ([128.214.205.27]:14351 "EHLO sirppi.helsinki.fi") by oss.sgi.com with ESMTP id ; Thu, 1 Jun 2000 03:19:07 -0700 Received: from localhost (amlaukka@localhost) by sirppi.helsinki.fi (8.10.1/8.10.1) with ESMTP id e51BHBm27290; Thu, 1 Jun 2000 14:17:11 +0300 (EET DST) X-Authentication-Warning: sirppi.helsinki.fi: amlaukka owned process doing -bs Date: Thu, 1 Jun 2000 14:17:09 +0300 (EET DST) From: Aki M Laukkanen To: kuznet@ms2.inr.ac.ru cc: netdev@oss.sgi.com, iwtcp@cs.helsinki.fi Subject: Re: recent TCP changes adversive on slow links In-Reply-To: <200005311813.WAA23306@ms2.inr.ac.ru> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Wed, 31 May 2000 kuznet@ms2.inr.ac.ru wrote: > It is difficult to believe that the problem is here, even if this > change kills the effect. Actually, I am even not sure, that this It seems in retrospect that this report was a wee bit hasty, sorry. Pre3 didn't exhibit this behaviour and the only relevant change seemed to be this. I was a bit too quick and didn't think it thoroughly. go to sleep check: return atomic_read(&sk->wmem_alloc) < sk->sndbuf; wakeup condition: if (sock_wspace(sk) >= tcp_min_write_space(sk) && These heuristics were masked because of the over-scheduling but I don't see how this could attribute what was seen in the tcpdumps. It seems as you remark, as if the transmit queue length was huge. But we used ppp which has tx_queue_len of three. So it seems something further down is effecting the behaviour indirectly. That seems to be our own fault (will do tests). The link used was really an emulated link (sorry, software not publicly available and it's still being beta-tested). The basic principle is to catch the PPP stream with a pseudo tty and from there on we can mess with packets (in principle like dummynet). We are aware of the gotchas this implies and the emulator implements flow control but apparently in those tests it wasn't turned on. The only thing I can not put my finger on is why did this change make the dumps look like they did? From owner-netdev@oss.sgi.com Thu Jun 1 07:47:16 2000 Received: by oss.sgi.com id ; Thu, 1 Jun 2000 07:46:56 -0700 Received: from ren.mcnc.org ([152.45.4.110]:14355 "EHLO ren.mcnc.org") by oss.sgi.com with ESMTP id ; Thu, 1 Jun 2000 07:46:37 -0700 Received: from anr.mcnc.org (localhost.localdomain [127.0.0.1]) by ren.mcnc.org (8.9.3/8.9.3) with ESMTP id LAA19749 for ; Thu, 1 Jun 2000 11:46:36 -0400 Message-ID: <39368559.AE57D9C0@anr.mcnc.org> Date: Thu, 01 Jun 2000 11:46:33 -0400 From: Phoemphun Oothongsap Organization: Advanced Networking Group X-Mailer: Mozilla 4.61 [en] (X11; I; Linux 2.2.12-20 i686) X-Accept-Language: en MIME-Version: 1.0 To: netdev@oss.sgi.com Subject: Re: sendto problem Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hi - I try the same experiment with sun OS. Instead of sending a udp packet to Linux machine, I send a udp packet to sun machine. But the result is different.There is connection refused. Every packet goes thru. Why? - if system receive ICMP unreachable message, will the kernel actually deliver it to the UDP port I am listening? Regards, Phoemphun Chandru Sargor wrote: > > Phoemphun, > > Is there someone listening to the UDP port you are sending > to at the destination? Otherwise, the remote system will send > an ICMP error back to your machine and the next time u > try to send a packet, u will get connection refused.... probably > it resets the error status each time - so alternate packets > can go thru. > > it would be helpful if you provided more info on what u are > trying to do. > > hope this helps... > > chandru > ------------------------- > > Phoemphun Oothongsap wrote: > > > Hi > > > > I am using Red Hat 6.2. I have a problem with using "sendto" command. > > I try to send a udp packet to different three machines. The packets are > > sent back to back. The first packet can get through but the second > > packet gets the Connection refused, the third packet is ok. I get the > > alternated > > result between connection refused and successful packet. > > > > This is a code snippet : > > > > : > > : > > while(1) { > > > > for(i=1;i<=3;i++) { > > ret = sendto(sock,(char *)&packet, > > length,0, > > (struct sockaddr*)&neighborAddr[i], > > addrlen); > > } > > : > > : > > } > > : > > : > > > > Do anybody have an idea how to fix this problem? > > > > Regards, > > > > Phoemphun Oothongsap > > ____________________________________________________________________________ > > ANR Linux NetDev Exploder. > > ____________________________________________________________________________ From owner-netdev@oss.sgi.com Thu Jun 1 08:43:03 2000 Received: by oss.sgi.com id ; Thu, 1 Jun 2000 08:42:43 -0700 Received: from lsi.lsil.com ([147.145.40.2]:19698 "EHLO lsi.lsil.com") by oss.sgi.com with ESMTP id ; Thu, 1 Jun 2000 08:42:19 -0700 Received: from mhbs.lsil.com ([147.145.31.100]) by lsi.lsil.com (8.9.3+Sun/8.9.1) with ESMTP id JAA12283 for ; Thu, 1 Jun 2000 09:42:17 -0700 (PDT) Received: from inca.co.lsil.com by mhbs.lsil.com with ESMTP; Thu, 1 Jun 2000 09:42:01 -0700 Received: from exw-kansas.ks.symbios.com (exw-kansas.ks.lsil.com [153.79.8.7]) by inca.co.lsil.com (8.9.3/8.9.3) with ESMTP id KAA29253; Thu, 1 Jun 2000 10:41:59 -0600 (MDT) Received: from lsil.com (nromernt.ks.lsil.com [153.79.8.107]) by exw-kansas.ks.symbios.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id L9J0Q6GT; Thu, 1 Jun 2000 11:41:17 -0500 Message-Id: <3936924D.D0F319FE@lsil.com> Date: Thu, 01 Jun 2000 11:41:49 -0500 From: Noah Romer Reply-To: noah.romer@lsil.com Organization: LSI Logic X-Mailer: Mozilla 4.72 [en] (X11; I; Linux 2.2.12 i686) X-Accept-Language: en MIME-Version: 1.0 To: linux-net@vger.rutgers.edu, netdev@oss.sgi.com CC: "Romer, Noah" Subject: net_rx_action, ptype_all and dropped packets Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Here's the setup, I'm writing a LAN driver for a Fibre Channel host adapter and am running into a slight problem w/ received packets. I can transmit just fine, at least ARP's, I can't get past that due to the fact that all incoming packets get dropped by net_rx_action (in net/core/dev.c). As far as I can tell, the necessary ptype_all values aren't getting set, so net_rx_action just drops them (I have confirmed that it's making the kfree_skb call found at line 1194 of dev.c). My question is: where, when and how should the setup for FC devices be done in the ptype_all array? I've got fibre channel support turned on, and all outgoing packets appear to meet the spec in RFC 2625. I've tried turning on the experimental LLC support (which is only listed for X.25) since it's the only way to compile net/802/p8022.c, and that looked related (what w/ the FC stuff I've found claiming ptype of ETH_P_8022), but that didn't do any good. Thanks, Noah Romer From owner-netdev@oss.sgi.com Thu Jun 1 09:48:53 2000 Received: by oss.sgi.com id ; Thu, 1 Jun 2000 09:48:33 -0700 Received: from wirespeed.solidum.com ([216.13.130.242]:26051 "EHLO solidum.com") by oss.sgi.com with ESMTP id ; Thu, 1 Jun 2000 09:48:26 -0700 Received: from phobos.solidum.com (mcr@phobos.solidum.com [192.168.1.13]) by solidum.com (8.8.7/8.8.7) with ESMTP id NAA03922 for ; Thu, 1 Jun 2000 13:48:18 -0400 Message-Id: <200006011748.NAA03922@solidum.com> To: netdev@oss.sgi.com Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? In-Reply-To: Your message of "Wed, 31 May 2000 20:26:28 EDT." Mime-Version: 1.0 (generated by tm-edit 7.108) Content-Type: text/plain; charset=US-ASCII Date: Thu, 01 Jun 2000 13:48:18 -0400 From: Michael Richardson Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing >>>>> "jamal" == jamal writes: jamal> Your architecture of maintaining a device per VLAN does not scale; jamal> (as you might have heard from your numerous attempts to change device jamal> lookups). jamal> What is the specific reason that you insist on mapping a VLAN to a device? It is a nice abtraction. It has known interfaces (ifconfig, netstat, route). jamal> Have you thought of using a VLAN lookup table instead? I agree... But, what do the interfaces look like for this? jamal> cheers, jamal> jamal jamal> I am only asking because i think that sooner than later we need to have jamal> 802.1p/q in the kernel and your current scheme is problematic. jamal> BTW, it seems there is another 802.1p/q project at sourceforge; From owner-netdev@oss.sgi.com Thu Jun 1 10:42:34 2000 Received: by oss.sgi.com id ; Thu, 1 Jun 2000 10:42:24 -0700 Received: from minus.inr.ac.ru ([193.233.7.97]:4879 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Thu, 1 Jun 2000 10:42:13 -0700 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id WAA04811; Thu, 1 Jun 2000 22:29:25 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200006011829.WAA04811@ms2.inr.ac.ru> Subject: Re: recent TCP changes adversive on slow links To: amlaukka@cc.helsinki.fi (Aki M Laukkanen) Date: Thu, 1 Jun 2000 22:29:25 +0400 (MSK DST) Cc: netdev@oss.sgi.com, iwtcp@cs.helsinki.fi In-Reply-To: from "Aki M Laukkanen" at Jun 1, 0 02:17:09 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Length: 1015 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello! > you remark, as if the transmit queue length was huge. But we used ppp > which has tx_queue_len of three. This is impossible. In this case you would never see bursts more than 4 packets. 4th one is dropped, which suspends transmission. > So it seems something further down is effecting the behaviour indirectly. > That seems to be our own fault (will do tests). The link used was > really an emulated link (sorry, software not publicly available and > it's still being beta-tested). The basic principle is to catch the PPP > stream with a pseudo tty and from there on we can mess with packets (in > principle like dummynet). We are aware of the gotchas this implies and the > emulator implements flow control but apparently in those tests it wasn't > turned on. It is possible then. Apparently, you broke flow control on your device. On devices without flow control (loopback is the best example) bursts are infinite, of course, and controlled only by TCP window and congestion avoidance. Alexey From owner-netdev@oss.sgi.com Thu Jun 1 17:18:13 2000 Received: by oss.sgi.com id ; Thu, 1 Jun 2000 17:17:53 -0700 Received: from async152-7.nas.onetel.net.uk ([212.67.103.152]:53264 "EHLO Consulate.UFP.CX") by oss.sgi.com with ESMTP id ; Thu, 1 Jun 2000 17:17:40 -0700 Received: from localhost (root@localhost) by Consulate.UFP.CX (8.9.3/8.9.3) with ESMTP id UAA15146; Thu, 1 Jun 2000 20:06:17 GMT Date: Thu, 1 Jun 2000 21:06:14 +0100 (BST) From: Riley Williams X-Sender: root@Consulate.UFP.CX To: Linux Maintainers Subject: Email address test Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hi there. This is a test message to confirm that all addresses in the Linux kernel maintainers list are still current. It would be appreciated if you could reply to this message confirming receipt hereof. Best wishes from Riley. From owner-netdev@oss.sgi.com Thu Jun 1 21:26:03 2000 Received: by oss.sgi.com id ; Thu, 1 Jun 2000 21:25:53 -0700 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:17159 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Thu, 1 Jun 2000 21:25:40 -0700 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.9.3) with ESMTP id WAA30520 for ; Thu, 1 Jun 2000 22:02:33 -0700 Message-ID: <39373FE7.AA70C85B@candelatech.com> Date: Thu, 01 Jun 2000 22:02:31 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.14-5.0 i586) X-Accept-Language: en MIME-Version: 1.0 To: "netdev@oss.sgi.com" Subject: Ethernet lockup (Linksys, Tulip) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing This happened after I pulled the ethernet link (and re-inserted). The other end of the cable went into a no-name 10/100 Etherswitch. The card is a LinkSys (part# LNE100Tx), here's part of the kernel boot. Jun 1 18:07:36 candle kernel: tulip.c:v0.91g 7/16/99 becker@cesdis.gsfc.nasa.gov Jun 1 18:07:36 candle kernel: eth0: Lite-On 82c168 PNIC rev 32 at 0xe800, 00:A0:CC:61:1E:C5, IRQ 9. Jun 1 18:07:36 candle kernel: eth0: MII transceiver #1 config 3000 status 7829 advertising 01e1. Jun 1 18:07:36 candle kernel: eth1: Lite-On PNIC-II rev 37 at 0xe400, 00:A0:CC:E5:4A:48, IRQ 5. Jun 1 18:07:36 candle kernel: rtl8139.c:v1.07 5/6/99 Donald Becker http://cesdis.gsfc.nasa.gov/linux/drivers/rtl8139.html Jun 1 18:07:36 candle kernel: eth2: SMC1211TX EZCard 10/100 (RealTek RTL8139) at 0xe000, IRQ 11, 00:e0:29:70:44:9e. .... Jun 1 18:28:10 candle kernel: eth1: Transmit timed out, status ec90a910, CSR12 45e1d0cc, resetting... Jun 1 18:28:15 candle kernel: eth1: Transmit timed out, status ec92a910, CSR12 45e1d0cc, resetting... Jun 1 18:28:50 candle last message repeated 7 times Jun 1 18:29:55 candle last message repeated 13 times This seems to have 'fixed' it: ip link set eth1 down ip link set eth1 up This is not reproducible, at least not easily (I just tried pulling the link several times, and it's working fine...) Anyone got any ideas? Thanks, Ben -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Fri Jun 2 04:46:35 2000 Received: by oss.sgi.com id ; Fri, 2 Jun 2000 04:46:26 -0700 Received: from eising.k-net.dtu.dk ([130.225.71.227]:16576 "HELO eising.k-net.dk") by oss.sgi.com with SMTP id ; Fri, 2 Jun 2000 04:46:15 -0700 Received: from akp-3.bergsoe.k-net.dk (akp-3.bergsoe.dtu.dk [192.38.219.27]) by eising.k-net.dk (Postfix) with SMTP id 19FE6C26A for ; Fri, 2 Jun 2000 13:46:56 +0200 (CEST) Received: (qmail 22399 invoked by uid 9); 2 Jun 2000 11:46:55 -0000 To: netdev@oss.sgi.com Path: not-for-mail From: "Anders K. Pedersen" Newsgroups: akp.lists.linux.netdev Subject: Re: Ethernet lockup (Linksys, Tulip) Date: Fri, 02 Jun 2000 13:43:55 +0200 Organization: AKP Consult I/S Lines: 27 Message-ID: <39379DFB.78686F0B@akp.dk> References: <39373FE7.AA70C85B@candelatech.com> NNTP-Posting-Host: akp-1.bergsoe.dtu.dk Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Trace: akp-3.bergsoe.k-net.dk 959946415 22397 192.38.218.231 (2 Jun 2000 11:46:55 GMT) X-Complaints-To: newsmaster@akp.dk NNTP-Posting-Date: 2 Jun 2000 11:46:55 GMT X-Mailer: Mozilla 4.73 [en] (Win98; U) X-Accept-Language: da,en Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Ben Greear wrote: > This happened after I pulled the ethernet link (and re-inserted). The > other end of the cable went into a no-name 10/100 Etherswitch. ... > Jun 1 18:28:10 candle kernel: eth1: Transmit timed out, status ec90a910, CSR12 45e1d0cc, resetting... > Jun 1 18:28:15 candle kernel: eth1: Transmit timed out, status ec92a910, CSR12 45e1d0cc, resetting... > Jun 1 18:28:50 candle last message repeated 7 times > Jun 1 18:29:55 candle last message repeated 13 times > > This seems to have 'fixed' it: > ip link set eth1 down > ip link set eth1 up > > This is not reproducible, at least not easily (I just tried pulling the link several > times, and it's working fine...) This looks like a problem I had with a 3c590 card recently. When tulip_tx_timeout() is called from tulip_start_xmit(), dev->tbusy is never cleared, and whenever tulip_start_xmit() is called after this, it will call tulip_tx_timeout() as dev->tbusy remains set. You could try adding the line "clear_bit(0, (void*)&dev->tbusy);" to the bottom of tulip_tx_timeout(). Regards, Anders K. Pedersen From owner-netdev@oss.sgi.com Fri Jun 2 04:51:56 2000 Received: by oss.sgi.com id ; Fri, 2 Jun 2000 04:51:45 -0700 Received: from Cantor.suse.de ([194.112.123.193]:23568 "HELO Cantor.suse.de") by oss.sgi.com with SMTP id ; Fri, 2 Jun 2000 04:51:42 -0700 Received: from Hermes.suse.de (Hermes.suse.de [194.112.123.136]) by Cantor.suse.de (Postfix) with ESMTP id 47BC51E261; Fri, 2 Jun 2000 13:52:23 +0200 (MEST) Received: from gruyere.muc.suse.de (unknown [10.23.1.2]) by Hermes.suse.de (Postfix) with ESMTP id 9B50C10A0F8; Fri, 2 Jun 2000 13:52:22 +0200 (MEST) Received: by gruyere.muc.suse.de (Postfix, from userid 14446) id BED512F300; Fri, 2 Jun 2000 13:52:15 +0200 (MEST) Date: Fri, 2 Jun 2000 13:52:15 +0200 From: "Andi Kleen" To: noah.romer@lsil.com Cc: linux-net@vger.rutgers.edu, netdev@oss.sgi.com Subject: Re: net_rx_action, ptype_all and dropped packets Message-ID: <20000602135215.B22139@gruyere.muc.suse.de> References: <3936924D.D0F319FE@lsil.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0i In-Reply-To: <3936924D.D0F319FE@lsil.com>; from nromer@lsil.com on Thu, Jun 01, 2000 at 11:41:49AM -0500 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Thu, Jun 01, 2000 at 11:41:49AM -0500, Noah Romer wrote: > Here's the setup, I'm writing a LAN driver for a Fibre Channel host adapter > and am running into a slight problem w/ received packets. I can transmit > just fine, at least ARP's, I can't get past that due to the fact that all > incoming packets get dropped by net_rx_action (in net/core/dev.c). As far > as I can tell, the necessary ptype_all values aren't getting set, so > net_rx_action just drops them (I have confirmed that it's making the > kfree_skb call found at line 1194 of dev.c). You need to setup skb->protocol and skb->pkt_type correctly in your network driver before you pass up to netif_rx. How you do that is up to you, most ethernet drivers use the standard eth_type_trans() function for that. > > My question is: where, when and how should the setup for FC devices be done > in the ptype_all array? I've got fibre channel support turned on, and all > outgoing packets appear to meet the spec in RFC 2625. I've tried turning on > the experimental LLC support (which is only listed for X.25) since it's the > only way to compile net/802/p8022.c, and that looked related (what w/ the > FC stuff I've found claiming ptype of ETH_P_8022), but that didn't do any > good. The LLC layer in the current tree is broken and won't do much good. If you really need working 802.2 (as opposed to plain 802.3) you could coordinate with the Linux/SNA project, they apparently have fixed it -Andi P.S.: Don't cross post to multiple lists like that. It's rude. From owner-netdev@oss.sgi.com Fri Jun 2 05:08:05 2000 Received: by oss.sgi.com id ; Fri, 2 Jun 2000 05:07:55 -0700 Received: from mail.cyberus.ca ([209.195.95.1]:44713 "EHLO cyberus.ca") by oss.sgi.com with ESMTP id ; Fri, 2 Jun 2000 05:07:44 -0700 Received: from shell.cyberus.ca (shell [209.195.95.7]) by cyberus.ca (8.9.3/8.9.3/Cyberus Online Inc.) with ESMTP id IAA11576; Fri, 2 Jun 2000 08:06:13 -0400 (EDT) Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.1b+Sun/8.9.3) with ESMTP id IAA13734; Fri, 2 Jun 2000 08:06:04 -0400 (EDT) Date: Fri, 2 Jun 2000 08:06:04 -0400 (EDT) From: jamal To: Ben Greear cc: netdev@oss.sgi.com, buytenh@gnu.org, gleb@tochna.technion.ac.il Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? In-Reply-To: <3935E898.83674015@candelatech.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Wed, 31 May 2000, Ben Greear wrote: > jamal wrote: > > > > Ben, > > > > Your architecture of maintaining a device per VLAN does not scale; > > (as you might have heard from your numerous attempts to change device > > lookups). > > I have no intention of changing device lookups. I did write my own method > of naming them, but that's quite trivial. > > > What is the specific reason that you insist on mapping a VLAN to a device? > > Have you thought of using a VLAN lookup table instead? > > I do use a vlan lookup table, did you think I do a linear search through > the vlans for incomming pkts or something??? As far as I recall, every > lookup involved in any critical path is constant time. That is one of > the reasons I believe my implementation will scale better than the > other project... (could be wrong...) I just scanned through the other code and this is _exactly_ what i had in mind. ccing the authors incase they are not in the list. Their scanning algorithm certainly needs improvement > A VLAN should look just like an ethernet port. I **WANT** it to look > exactly like a device, so why not actually make it one?? I want to > use tcpdump on it, Modify tcpdump to do Vlan recognition (if it doesnt already). > I want to route, firewall, bridge and do everything > else that a device would do. VLANs are primarily for switching. Fair enough that you want to route them. But you can certainly still do this with policy routing for example. The only place i see routing taking effect is in the boundary between VLANs and non-VLAN domains. Out of ignorance: Is there anyone (vendor) who does routing of VLANs? firewalling: I believe the bridging folks are trying to move netfilter/ipchains down there; i dont know what the progress status is. > How can that be done if the VLAN is not > a device, and if it could be done, why would it be any more efficient? > Again from the above, you really dont need a device. Really, take a look at James Leu's MPLS which is a similar (introduces the extra shim headers etc) but a more complex issue. He doesnt introduce any new devices. Use a table or whatever structure (radix tree etc); allow search by VLANid, priority etc; With these you are free to put whatever search schemes, naming conventions etc that you want instead of subjecting the rest of the kernel to something that you need for your feature. In any case when there are two solutions, they need to get unified before making it into the kernel proper. On Thu, 1 Jun 2000, Michael Richardson wrote: > > >>>>> "jamal" == jamal writes: > jamal> What is the specific reason that you insist on mapping a VLAN > > It is a nice abtraction. It has known interfaces (ifconfig, netstat,route). > hope my comments above answer this. On Wed, 31 May 2000, Mitchell Blank Jr wrote: > Is it just impossible to make this scale in 2.5? There are other things > which could require large numbers of network devices (like large-scale > PPPoA/PPPoE termination), it would be nice to support them. > 1) Fix pppd. Paulus is planning on a total re-write (i think you were copied on that email) 2) allow for multiple channels per device. With the new pppox architecture, you can then have a single socket selecting/polling for multiple circuits. cheers, jamal From owner-netdev@oss.sgi.com Fri Jun 2 06:13:16 2000 Received: by oss.sgi.com id ; Fri, 2 Jun 2000 06:13:05 -0700 Received: from apollo.nbase.co.il ([194.90.137.2]:37125 "EHLO apollo.nbase.co.il") by oss.sgi.com with ESMTP id ; Fri, 2 Jun 2000 06:12:45 -0700 Received: from nbase.co.il ([194.90.136.56]) by apollo.nbase.co.il (Post.Office MTA v3.1.2 release (PO205-101c) ID# 0-44418U200L2S100) with ESMTP id AAA432; Fri, 2 Jun 2000 16:12:30 +0200 Message-ID: <3937B2B3.3A939010@nbase.co.il> Date: Fri, 02 Jun 2000 13:12:19 +0000 From: Gleb Natapov Organization: NBase-Xyplex X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.2.14vlan i686) X-Accept-Language: en MIME-Version: 1.0 To: jamal CC: Ben Greear , netdev@oss.sgi.com, buytenh@gnu.org, gleb@tochna.technion.ac.il Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: Content-Type: text/plain; charset=koi8-r Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing jamal wrote: > [snip] > > > > I do use a vlan lookup table, did you think I do a linear search through > > the vlans for incomming pkts or something??? As far as I recall, every > > lookup involved in any critical path is constant time. That is one of > > the reasons I believe my implementation will scale better than the > > other project... (could be wrong...) > > I just scanned through the other code and this is _exactly_ what i had in > mind. ccing the authors incase they are not in the list. > Their scanning algorithm certainly needs improvement We use hash table currently. Why do you think it is inefficient? How many vlans are you expecting to have on each device? You can tune performance by changing hash table size if you wish. Anyway, it is very easy to change ADT from hash table to something else. What do you suggest? > > > A VLAN should look just like an ethernet port. I **WANT** it to look > > exactly like a device, so why not actually make it one?? I want to > > use tcpdump on it, > > Modify tcpdump to do Vlan recognition (if it doesnt already). Also don't forget to modify dhspd and, I suppose, many other programs, and to tell to everyone out there to write code that can handle extra header in ethernet frame from now on ;) > > > I want to route, firewall, bridge and do everything > > else that a device would do. > > VLANs are primarily for switching. Fair enough that you want to route Vlans are for switching but vlan tags are not IMO. > them. But you can certainly still do this with policy routing for example. > The only place i see routing taking effect is in the boundary between > VLANs and non-VLAN domains. > Out of ignorance: Is there anyone (vendor) who does routing of VLANs? I think that everyone who uses current vlan implementations does just that. Am I right? > firewalling: I believe the bridging folks are trying to move > netfilter/ipchains down there; i dont know what the progress status is. > > > How can that be done if the VLAN is not > > a device, and if it could be done, why would it be any more efficient? > > > > Again from the above, you really dont need a device. > Really, take a look at James Leu's MPLS which is a similar (introduces the > extra shim headers etc) but a more complex issue. He doesnt introduce any > new devices. I want to look at this. Can you provide some URLs? > > Use a table or whatever structure (radix tree etc); allow search by > VLANid, priority etc; > With these you are free to put whatever search schemes, naming conventions > etc that you want instead of subjecting the rest of the kernel to > something that you need for your feature. > > In any case when there are two solutions, they need to get unified before > making it into the kernel proper. > Since I didn't participate in this discussion until now I don't know what are you talking about :). > [snip] -- Gleb. From owner-netdev@oss.sgi.com Fri Jun 2 06:19:06 2000 Received: by oss.sgi.com id ; Fri, 2 Jun 2000 06:18:56 -0700 Received: from zada.math.leidenuniv.nl ([132.229.231.3]:12553 "EHLO zada.math.leidenuniv.nl") by oss.sgi.com with ESMTP id ; Fri, 2 Jun 2000 06:18:49 -0700 Received: from mara.math.leidenuniv.nl (IDENT:buytenh@mara.math.leidenuniv.nl [132.229.232.80]) by zada.math.leidenuniv.nl (8.9.3/8.9.3) with ESMTP id PAA08337; Fri, 2 Jun 2000 15:19:15 +0200 Date: Fri, 2 Jun 2000 15:19:15 +0200 (MEST) From: Lennert Buytenhek X-Sender: buytenh@mara.math.leidenuniv.nl To: jamal cc: Ben Greear , netdev@oss.sgi.com, Gleb Natapov Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Fri, 2 Jun 2000, jamal wrote: > > I do use a vlan lookup table, did you think I do a linear search through > > the vlans for incomming pkts or something??? As far as I recall, every > > lookup involved in any critical path is constant time. That is one of > > the reasons I believe my implementation will scale better than the > > other project... (could be wrong...) > > I just scanned through the other code and this is _exactly_ what i had > in mind. What do you mean? (Sorry, I didn't follow the rest of the thread, and I can't seem to find netdev archives. The oss.sgi.com page doesn't help a lot either (Where can I find the xfs cvs details for example?)). My idea is that VLANs enable you to run multiple logical ethernets over one physical ethernet. A host that is on multiple VLANs has different IP addresses for each VLAN it is on. That is why it made sense to me to make VLANs interfaces. Just like the bridge is an interface. > > A VLAN should look just like an ethernet port. I **WANT** it to look > > exactly like a device, so why not actually make it one?? I want to > > use tcpdump on it, > > Modify tcpdump to do Vlan recognition (if it doesnt already). Yeah, and while we're at it, modify every program that needs to look at ethernet headers (dhcpd is a nice example). You say we shouldn't create VLANs as separate interfaces? > > I want to route, firewall, bridge and do everything else that a > > device would do. > > VLANs are primarily for switching. I can think of two uses for 802.1q in linux machines: 1. switching 2. multi-homed host with less NICs than subnet memberships 1 can be done easily. 2 is the case I was interested in. That's why I didn't optimise the lookup (yet) and made each VLAN a separate interface. I think most people using VLANs on linux will want to use option 2. In this case, having a separate device for each VLAN makes sense to me. > Fair enough that you want to route them. But you can certainly still > do this with policy routing for example. The only place i see routing > taking effect is in the boundary between VLANs and non-VLAN domains. Well, routing is maybe not the best example. But the fact is that VLANs look and feel a lot like separate interfaces. That's probably why both 802.1q patches treat them as such. > Out of ignorance: Is there anyone (vendor) who does routing of VLANs? I think it would make perfect sense. Connect a bunch of machines from different VLANs to a VLAN switch, and connect the switch to a VLANning router via a trunk line. > firewalling: I believe the bridging folks are trying to move > netfilter/ipchains down there; i dont know what the progress status > is. Eeeeh... those bridging folks would be me. I'm working on it, as time permits. > Again from the above, you really dont need a device. Really, take a > look at James Leu's MPLS which is a similar (introduces the extra shim > headers etc) but a more complex issue. He doesnt introduce any new > devices. I don't really see yet how we can do clean support without fake devices. Will his 'solution' let us attach IP addresses to VLAN interfaces for example? Do you have a URL? > Use a table or whatever structure (radix tree etc); allow search by > VLANid, priority etc; With these you are free to put whatever search > schemes, naming conventions etc that you want instead of subjecting > the rest of the kernel to something that you need for your feature. > > In any case when there are two solutions, they need to get unified > before making it into the kernel proper. This makes sense, but the fact that both these solutions use fake devices probably indicates the three of us don't get your point... :) greetings, Lennert From owner-netdev@oss.sgi.com Fri Jun 2 07:16:06 2000 Received: by oss.sgi.com id ; Fri, 2 Jun 2000 07:15:56 -0700 Received: from iwr1.iwr.uni-heidelberg.de ([129.206.104.40]:59791 "EHLO iwr1.iwr.uni-heidelberg.de") by oss.sgi.com with ESMTP id ; Fri, 2 Jun 2000 07:15:54 -0700 Received: from kenzo.iwr.uni-heidelberg.de (IDENT:bogdan@kenzo.iwr.uni-heidelberg.de [129.206.120.29]) by iwr1.iwr.uni-heidelberg.de (8.9.3/8.9.3) with ESMTP id QAA21513; Fri, 2 Jun 2000 16:16:34 +0200 (MET DST) Received: from localhost (bogdan@localhost) by kenzo.iwr.uni-heidelberg.de (8.8.7/8.8.7) with ESMTP id QAA07157; Fri, 2 Jun 2000 16:16:34 +0200 Date: Fri, 2 Jun 2000 16:16:34 +0200 (CEST) From: Bogdan Costescu To: "Anders K. Pedersen" cc: netdev@oss.sgi.com Subject: Re: Ethernet lockup (Linksys, Tulip) In-Reply-To: <39379DFB.78686F0B@akp.dk> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Fri, 2 Jun 2000, Anders K. Pedersen wrote: > This looks like a problem I had with a 3c590 card recently. When > tulip_tx_timeout() is called from tulip_start_xmit(), dev->tbusy is > never cleared, and whenever tulip_start_xmit() is called after this, it > will call tulip_tx_timeout() as dev->tbusy remains set. You could try > adding the line "clear_bit(0, (void*)&dev->tbusy);" to the bottom of > tulip_tx_timeout(). By doing this you break the whole logic of setting and clearing dev->tbusy. It should be set whenever start_xmit cannot be called to handle another packet which is your case - the Tx ring is still full. When you reconnect the cable, the card could start sending packets, thus generating interrupts which will clear dev->tbusy. (I'm talking now for the 3c59x driver, but I think that all of Donald Becker's drivers use the same logic). I said above: the card _could_ start sending. There are some situations when the card cannot start sending immediately after the link is established: collisions (you connect to a heavily loaded hub), failed or inexistent media speed (auto-)negotiation and maybe some others that don't come to my mind right now. The collisions are also handled by tx_timeout which will be called again (having dev->tbusy _not_ cleared). If, when you re-establish the link, the speed (auto-negotiated or not) used by the switch differ from the previous one, you have to wait until the timer routine is called (once every 60 seconds in most cases) that will (hopefully) rewrite the media settings to the card which will then be able to talk to the switch; if this does not happen, the card will not be able to send any more packets. So, after you re-establish a link, the transmission might not start immediately and there may be some other tx_timeout calls until the transmission is really resumed. In the 3c590 case, can you give some more details (exact type of the board, conditions, logs), preferably on the linux-vortex-bug list ? Best regards, Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: Bogdan.Costescu@IWR.Uni-Heidelberg.De From owner-netdev@oss.sgi.com Fri Jun 2 07:43:46 2000 Received: by oss.sgi.com id ; Fri, 2 Jun 2000 07:43:26 -0700 Received: from nevald.k-net.dtu.dk ([130.225.71.226]:61114 "HELO nevald.k-net.dk") by oss.sgi.com with SMTP id ; Fri, 2 Jun 2000 07:42:59 -0700 Received: from akp-3.bergsoe.k-net.dk (akp-3.bergsoe.dtu.dk [192.38.219.27]) by nevald.k-net.dk (Postfix) with SMTP id 14A8D3C48A for ; Fri, 2 Jun 2000 16:43:30 +0200 (CEST) Received: (qmail 27606 invoked by uid 9); 2 Jun 2000 14:43:30 -0000 To: netdev@oss.sgi.com Path: not-for-mail From: "Anders K. Pedersen" Newsgroups: akp.lists.linux.netdev Subject: Re: Ethernet lockup (Linksys, Tulip) Date: Fri, 02 Jun 2000 16:39:37 +0200 Organization: AKP Consult I/S Lines: 43 Message-ID: <3937C729.AF689F9B@akp.dk> References: <39379DFB.78686F0B@akp.dk> NNTP-Posting-Host: akp-1.bergsoe.dtu.dk Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Trace: akp-3.bergsoe.k-net.dk 959957010 27389 192.38.218.231 (2 Jun 2000 14:43:30 GMT) X-Complaints-To: newsmaster@akp.dk NNTP-Posting-Date: 2 Jun 2000 14:43:30 GMT To: Bogdan.Costescu@IWR.Uni-Heidelberg.De X-Mailer: Mozilla 4.73 [en] (Win98; U) X-Accept-Language: da,en Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Bogdan Costescu wrote: > On Fri, 2 Jun 2000, Anders K. Pedersen wrote: > > This looks like a problem I had with a 3c590 card recently. When > > tulip_tx_timeout() is called from tulip_start_xmit(), dev->tbusy is > > never cleared, and whenever tulip_start_xmit() is called after this, it > > will call tulip_tx_timeout() as dev->tbusy remains set. You could try > > adding the line "clear_bit(0, (void*)&dev->tbusy);" to the bottom of > > tulip_tx_timeout(). > > By doing this you break the whole logic of setting and clearing > dev->tbusy. It should be set whenever start_xmit cannot be called to > handle another packet which is your case - the Tx ring is still full. > When you reconnect the cable, the card could start sending packets, thus > generating interrupts which will clear dev->tbusy. (I'm talking now for > the 3c59x driver, but I think that all of Donald Becker's drivers use the > same logic). One of the last things done in the tx_timeout() is to reset the card - I guess that this might cause the card not to generate the interrupt, that will clear dev->tbusy. > So, after you re-establish a link, the transmission might not start > immediately and there may be some other tx_timeout calls until the > transmission is really resumed. Well, in our case, the driver never resumed sending any packets after the first timeout - at least not until it was restarted with ifconfig. It was discussed a few weeks ago on linux-net, and Donald Becker said, that it should not happen (the timeouts as I understand it), but the fix was to clear dev->tbusy. Andrew Morton implemented it in his version of the driver, and it is now part of the 3c59x.c driver in the newest kernels (both 2.2.x and 2.3.x). > In the 3c590 case, can you give some more details (exact type of the > board, conditions, logs), preferably on the linux-vortex-bug list ? As I mentioned above, it was discussed on the linux-net list a few weeks ago. You can find my initial mail at http://kernelnotes.org/lnxlists/linux-net/ln_0005_04/msg00033.html along with the followups. Regards, Anders K. Pedersen From owner-netdev@oss.sgi.com Fri Jun 2 08:04:46 2000 Received: by oss.sgi.com id ; Fri, 2 Jun 2000 08:04:36 -0700 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:64525 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Fri, 2 Jun 2000 08:04:12 -0700 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.9.3) with ESMTP id IAA32352; Fri, 2 Jun 2000 08:40:59 -0700 Message-ID: <3937D58B.708AE7F8@candelatech.com> Date: Fri, 02 Jun 2000 08:40:59 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.14-5.0 i586) X-Accept-Language: en MIME-Version: 1.0 To: jamal CC: netdev@oss.sgi.com, buytenh@gnu.org, gleb@tochna.technion.ac.il Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing jamal wrote: > > > I just scanned through the other code and this is _exactly_ what i had in > mind. ccing the authors incase they are not in the list. > Their scanning algorithm certainly needs improvement I'm confused: You like their implementation better, though it needs improvement? or it is wrong just as you expected?? If it is the former, could you enlighten the discussion with what you like better about theirs? If unification ever happens, it would be best if we took the best from both... > > A VLAN should look just like an ethernet port. I **WANT** it to look > > exactly like a device, so why not actually make it one?? I want to > > use tcpdump on it, > > Modify tcpdump to do Vlan recognition (if it doesnt already). Not so hard, but consider how much more friendly tcpdump -i vlan0005 is than tcpdump -ben_hack_foo 5 eth0 The same holds true for every other program that interacts with (logical) devices. > > > I want to route, firewall, bridge and do everything > > else that a device would do. > > VLANs are primarily for switching. Fair enough that you want to route > them. But you can certainly still do this with policy routing for example. > The only place i see routing taking effect is in the boundary between > VLANs and non-VLAN domains. In applications that people are using my code for, I've heard of NO ONE switching, and everyone is using the interfaces for traffic management (ie routing, firewalling, etc..) The truth is that dedicated hardware switches much better (ie a 10/100 etherswitch). However, routing and other cool stuff is handled better (or at least much cheaper) by Linux.. > Again from the above, you really dont need a device. I think this is just a matter of taste at this point. I dissagree with your assumption. I wrote VLAN because it could solve problems that I saw in managing DSLAM (DSL) installations, and that definately needs the VLANs to be able to firewall and route. I see no reason to limit what people can do with VLANS, and I have little desire to patch 50 different tools that work with interfaces to have special handling for VLANS. > Use a table or whatever structure (radix tree etc); allow search by > VLANid, priority etc; I have an array of 4096 entries, one per VLAN ID. Wastes a little memory, but gives constant time lookup. In the case of VLANs being able to be duplicated across physical devices, there is one array per physical port. > With these you are free to put whatever search schemes, naming conventions > etc that you want instead of subjecting the rest of the kernel to > something that you need for your feature. What have I subjected the kernel too? The only part of my code that touches the existing kernel is some in the eth.c to detect VLAN, and some other structure changes to allow for bigger ethernet headers... > 1) Fix pppd. Paulus is planning on a total re-write (i think you were > copied on that email) 2) allow for multiple channels per device. > With the new pppox architecture, you can then have a single socket > selecting/polling for multiple circuits. So, what about routing between circuits? Firewalling? People don't write generalized code (ie making them all devices) for no good reason.... Thanks, Ben -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Fri Jun 2 08:36:06 2000 Received: by oss.sgi.com id ; Fri, 2 Jun 2000 08:35:57 -0700 Received: from wirespeed.solidum.com ([216.13.130.242]:35761 "EHLO solidum.com") by oss.sgi.com with ESMTP id ; Fri, 2 Jun 2000 08:35:30 -0700 Received: from phobos.solidum.com (mcr@phobos.solidum.com [192.168.1.13]) by solidum.com (8.8.7/8.8.7) with ESMTP id LAA31809 for ; Fri, 2 Jun 2000 11:36:11 -0400 Message-Id: <200006021536.LAA31809@solidum.com> To: netdev@oss.sgi.com Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? In-Reply-To: Your message of "Fri, 02 Jun 2000 08:06:04 EDT." Mime-Version: 1.0 (generated by tm-edit 7.108) Content-Type: text/plain; charset=US-ASCII Date: Fri, 02 Jun 2000 11:36:11 -0400 From: Michael Richardson Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing >>>>> "jamal" == jamal writes: jamal> On Thu, 1 Jun 2000, Michael Richardson wrote: >> >> >>>>> "jamal" == jamal writes: jamal> What is the specific reason that you insist on mapping a VLAN >> >> It is a nice abtraction. It has known interfaces (ifconfig, netstat,route). >> jamal> hope my comments above answer this. Nope... you talk a lot about internal data structures. That's nice, but so what? I want to know what the userland interface is. :!mcr!: | Solidum Systems Corporation, http://www.solidum.com Michael Richardson |For a better connected world,where data flows faster Personal: http://www.sandelman.ottawa.on.ca/People/Michael_Richardson/Bio.html mailto:mcr@sandelman.ottawa.on.ca mailto:mcr@solidum.com From owner-netdev@oss.sgi.com Fri Jun 2 09:53:57 2000 Received: by oss.sgi.com id ; Fri, 2 Jun 2000 09:53:37 -0700 Received: from nat-su-33.valinux.com ([198.186.202.33]:12314 "EHLO mail.valinux.com") by oss.sgi.com with ESMTP id ; Fri, 2 Jun 2000 09:53:33 -0700 Received: from tungsten.su.valinux.com ([10.1.0.74]) by mail.valinux.com with esmtp (Exim 2.12 #6) id 12xuhH-0003XL-00; Fri, 2 Jun 2000 09:53:23 -0700 Received: (from rob@localhost) by tungsten.su.valinux.com (8.8.7/8.8.7) id JAA04851; Fri, 2 Jun 2000 09:53:23 -0700 From: Rob Walker MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Date: Fri, 2 Jun 2000 09:53:22 -0700 (PDT) To: buytenh@gnu.org Cc: hadi@cyberus.ca, greearb@candelatech.com, netdev@oss.sgi.com, gleb@tochna.technion.ac.il Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? In-Reply-To: References: X-Mailer: VM 6.43 under 20.4 "Emerald" XEmacs Lucid Message-ID: <14647.58672.343104.432906@tungsten.su.valinux.com> Reply-To: rob@valinux.com X-PGP-Fingerprint: CD 3B D7 01 69 A1 DD E3 3E 69 11 23 6D 8E A2 6A Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing >>>>> On Fri, 2 Jun 2000 15:19:15 +0200 (MEST), Lennert Buytenhek >>>>> said: >> Fair enough that you want to route them. But you can certainly >> still do this with policy routing for example. The only place i see >> routing taking effect is in the boundary between VLANs and non-VLAN >> domains. and between VLAN-IS and VLAN-ENG, and between VLAN-ACCT and VLAN-INTERNET, and between VLAN-* and VLAN-GAMES. Lennert> Well, routing is maybe not the best example. But the fact is Lennert> that VLANs look and feel a lot like separate Lennert> interfaces. That's probably why both 802.1q patches treat Lennert> them as such. >> Out of ignorance: Is there anyone (vendor) who does routing of >> VLANs? Lennert> I think it would make perfect sense. Connect a bunch of Lennert> machines from different VLANs to a VLAN switch, and connect Lennert> the switch to a VLANning router via a trunk line. routing of packets between VLANs? Of course. Router#sh ver Cisco Internetwork Operating System Software IOS (tm) 7200 Software (C7200-JS56I-M), Version 12.0(7)T, RELEASE SOFTWARE (fc2) Router#sh run interface FastEthernet2/0 no ip address no ip directed-broadcast no ip mroute-cache full-duplex ! interface FastEthernet2/0.1 description desc1 encapsulation dot1Q 1 ip address a.b.c.d 255.255.255.192 secondary ip address e.f.g.h 255.255.255.128 no ip directed-broadcast no ip mroute-cache ! interface FastEthernet2/0.2 description desc2 encapsulation dot1Q 4095 ip address e.f.g.i 255.255.255.240 secondary ip address a.b.j.k 255.255.255.128 secondary ip address l.m.o.p 255.255.255.240 no ip directed-broadcast no ip mroute-cache >> Again from the above, you really dont need a device. Really, take a >> look at James Leu's MPLS which is a similar (introduces the extra >> shim headers etc) but a more complex issue. He doesnt introduce any >> new devices. Lennert> I don't really see yet how we can do clean support without Lennert> fake devices. Will his 'solution' let us attach IP addresses Lennert> to VLAN interfaces for example? Lennert> Do you have a URL? Why should VLANs not be fake devices? How are they different from aliased interfaces? rob From owner-netdev@oss.sgi.com Fri Jun 2 10:10:06 2000 Received: by oss.sgi.com id ; Fri, 2 Jun 2000 10:09:57 -0700 Received: from apollo.nbase.co.il ([194.90.137.2]:27913 "EHLO apollo.nbase.co.il") by oss.sgi.com with ESMTP id ; Fri, 2 Jun 2000 10:09:37 -0700 Received: from nbase.co.il ([194.90.136.56]) by apollo.nbase.co.il (Post.Office MTA v3.1.2 release (PO205-101c) ID# 0-44418U200L2S100) with ESMTP id AAA445; Fri, 2 Jun 2000 20:09:22 +0200 Message-ID: <3937EA35.423CB3C8@nbase.co.il> Date: Fri, 02 Jun 2000 17:09:09 +0000 From: Gleb Natapov Organization: NBase-Xyplex X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.2.14vlan i686) X-Accept-Language: en MIME-Version: 1.0 To: rob@valinux.com CC: buytenh@gnu.org, hadi@cyberus.ca, greearb@candelatech.com, netdev@oss.sgi.com, gleb@tochna.technion.ac.il Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: <14647.58672.343104.432906@tungsten.su.valinux.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Rob Walker wrote: > > Why should VLANs not be fake devices? How are they different from > aliased interfaces? Not so good example IMHO. Aliased interfaces are deprecated ;) > > rob -- Gleb. From owner-netdev@oss.sgi.com Fri Jun 2 11:53:17 2000 Received: by oss.sgi.com id ; Fri, 2 Jun 2000 11:53:07 -0700 Received: from nat-su-33.valinux.com ([198.186.202.33]:39197 "EHLO mail.valinux.com") by oss.sgi.com with ESMTP id ; Fri, 2 Jun 2000 11:52:45 -0700 Received: from tungsten.su.valinux.com ([10.1.0.74]) by mail.valinux.com with esmtp (Exim 2.12 #6) id 12xwZK-0002Nc-00; Fri, 2 Jun 2000 11:53:18 -0700 Received: (from rob@localhost) by tungsten.su.valinux.com (8.8.7/8.8.7) id LAA07341; Fri, 2 Jun 2000 11:53:18 -0700 From: Rob Walker MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Date: Fri, 2 Jun 2000 11:53:17 -0700 (PDT) To: gleb@nbase.co.il Cc: rob@valinux.com, buytenh@gnu.org, hadi@cyberus.ca, greearb@candelatech.com, netdev@oss.sgi.com, gleb@tochna.technion.ac.il Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? In-Reply-To: <3937EA35.423CB3C8@nbase.co.il> References: <14647.58672.343104.432906@tungsten.su.valinux.com> <3937EA35.423CB3C8@nbase.co.il> X-Mailer: VM 6.43 under 20.4 "Emerald" XEmacs Lucid Message-ID: <14648.652.386962.113311@tungsten.su.valinux.com> Reply-To: rob@valinux.com X-PGP-Fingerprint: CD 3B D7 01 69 A1 DD E3 3E 69 11 23 6D 8E A2 6A Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing >>>>> On Fri, 02 Jun 2000 17:09:09 +0000, Gleb Natapov >>>>> said: Gleb> Rob Walker wrote: >> Gleb> >> Why should VLANs not be fake devices? How are they different from >> aliased interfaces? Gleb> Not so good example IMHO. Aliased interfaces are deprecated ;) well, that shows what I know. :-) rob From owner-netdev@oss.sgi.com Fri Jun 2 12:40:47 2000 Received: by oss.sgi.com id ; Fri, 2 Jun 2000 12:40:37 -0700 Received: from mail.hsnp.com ([205.161.174.10]:32570 "HELO netc.netc.com") by oss.sgi.com with SMTP id ; Fri, 2 Jun 2000 12:40:25 -0700 Received: (qmail 23732 invoked by uid 510); 2 Jun 2000 14:41:07 -0500 (CDT) Date: Fri, 2 Jun 2000 14:41:07 -0500 (CDT) From: Jeff Barrow To: Gleb Natapov cc: netdev@oss.sgi.com Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? In-Reply-To: <3937EA35.423CB3C8@nbase.co.il> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Aliased interfaces may be depreciated, but we still require them for our nameserver (without the aliased interfaces, bind doesn't bind to all the IP addresses) True, a newer version of bind might work properly; haven't tried that. It was more important that bind WORK quickly, and using ip aliasing fixed the problem QUICKLY. On Fri, 2 Jun 2000, Gleb Natapov wrote: > Rob Walker wrote: > > > > > Why should VLANs not be fake devices? How are they different from > > aliased interfaces? > > Not so good example IMHO. Aliased interfaces are deprecated ;) > > > > > rob > > -- > Gleb. > From owner-netdev@oss.sgi.com Fri Jun 2 14:19:21 2000 Received: by oss.sgi.com id ; Fri, 2 Jun 2000 14:19:11 -0700 Received: from sirppi.helsinki.fi ([128.214.205.27]:49928 "EHLO sirppi.helsinki.fi") by oss.sgi.com with ESMTP id ; Fri, 2 Jun 2000 14:18:57 -0700 Received: from localhost (amlaukka@localhost) by sirppi.helsinki.fi (8.10.1/8.10.1) with ESMTP id e52MIGD02093; Sat, 3 Jun 2000 01:18:16 +0300 (EET DST) X-Authentication-Warning: sirppi.helsinki.fi: amlaukka owned process doing -bs Date: Sat, 3 Jun 2000 01:18:15 +0300 (EET DST) From: Aki M Laukkanen To: kuznet@ms2.inr.ac.ru cc: Linux kernel mailing list , netdev@oss.sgi.com Subject: Re: Slow TCP connection between linux and wince In-Reply-To: <200006021901.XAA02397@ms2.inr.ac.ru> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Fri, 2 Jun 2000 kuznet@ms2.inr.ac.ru wrote: > I repeat, this change cannot affect behaviour visible on wire. > It wakes up socket write rather than transmit. Yes, a brain fart (maybe this saying has been worn out?-). My only excuse was that I had to catch a movie and was in a hurry. Just a note to everyone, haven't seen a more corny ending in a while than in `Mission to Mars' - von Daniken should be proud. Regarding the correspondance on netdev - yes, I'm quite aware (now but once was lost) that transmit is ack-driven but let's think about the dump and the bursts. We get incoming acks in a steady flow - why then (when approaching window limit) sender stops sending packets to the wire until always a certain point? Burst was able to be so steep only because our emulator had no flow control. It ate all the packets and delayed each of them so that receiver saw a steady stream of 9600 bps - which is why the incoming acks were a steady stream too. What's so different about tcp_snd_test() and __tcp_data_snd_check() that this happens? Of course, it might be all our fault so you need only to ignore this for now. I'll report if we find something concrete that's attributable to the TCP/IP stack. There was another error in my mail too, 2*130 is indeed over 200 but that was irrelevant. However the conclusion seemed correct, delayed acks timing out. Could you explain me why the window of 3100 seems to allow 2 packets in flight but sender always waits for the next ack? I must have missed something but in our tests too, the sender never quite reaches the window limit (but this was of course with a bigger window size). From owner-netdev@oss.sgi.com Fri Jun 2 17:00:53 2000 Received: by oss.sgi.com id ; Fri, 2 Jun 2000 17:00:33 -0700 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:50437 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Fri, 2 Jun 2000 17:00:19 -0700 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.9.3) with ESMTP id SAA01200; Fri, 2 Jun 2000 18:36:30 -0700 Message-ID: <3938611E.D074F254@candelatech.com> Date: Fri, 02 Jun 2000 18:36:30 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.14-5.0 i586) X-Accept-Language: en MIME-Version: 1.0 To: rob@valinux.com CC: buytenh@gnu.org, hadi@cyberus.ca, netdev@oss.sgi.com, gleb@tochna.technion.ac.il Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: <14647.58672.343104.432906@tungsten.su.valinux.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Rob Walker wrote: > Why should VLANs not be fake devices? How are they different from > aliased interfaces? I think most of us are in violent agreement that they should be devices, Alexey and Jamal seem to be the main dissenters, and at least IMO, they have not offered a reason good enough to make me consider trying to make VLANs anything other than devices... On this account, my vlan implementation, and Lenert and Gleb's are almost identical. Other than some internal issues, I believe the only user-visible difference between my imp and theirs is that they re-write the packet header on the way up the stack so that it looks **exactly** like an ethernet pkt, where as I just leave the header alone and pull 4 extra bytes off of the SKB before giving it to the higher levels. I like their idea, but it means they have to move the header around for each pkt. In mine, the packet is not modified, *BUT*, programs such as dhcpd which expect to be reading the raw ethernet pkt have to be modified. I think I'll implement a compile-time switch to have their copy behavior, but will also retain the (slightly?) faster non-copy method, which seems to work with most programs which obey the stack layering... I would welcome more discussion on the relative strengths and weaknesses of the two VLAN implementations, especially as the VLANs scale from 1, to 10, to 100 and to 1000+. Ben -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Fri Jun 2 17:41:22 2000 Received: by oss.sgi.com id ; Fri, 2 Jun 2000 17:41:02 -0700 Received: from nat-su-33.valinux.com ([198.186.202.33]:19212 "EHLO mail.valinux.com") by oss.sgi.com with ESMTP id ; Fri, 2 Jun 2000 17:40:42 -0700 Received: from tungsten.su.valinux.com ([10.1.0.74]) by mail.valinux.com with esmtp (Exim 2.12 #6) id 12y2vN-0007nB-00; Fri, 2 Jun 2000 18:40:29 -0700 Received: (from rob@localhost) by tungsten.su.valinux.com (8.8.7/8.8.7) id SAA16989; Fri, 2 Jun 2000 18:40:29 -0700 From: Rob Walker MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Date: Fri, 2 Jun 2000 18:40:29 -0700 (PDT) To: greearb@candelatech.com Cc: rob@valinux.com, buytenh@gnu.org, hadi@cyberus.ca, netdev@oss.sgi.com, gleb@tochna.technion.ac.il Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? In-Reply-To: <3938611E.D074F254@candelatech.com> References: <14647.58672.343104.432906@tungsten.su.valinux.com> <3938611E.D074F254@candelatech.com> X-Mailer: VM 6.43 under 20.4 "Emerald" XEmacs Lucid Message-ID: <14648.24946.413472.365255@tungsten.su.valinux.com> Reply-To: rob@valinux.com X-PGP-Fingerprint: CD 3B D7 01 69 A1 DD E3 3E 69 11 23 6D 8E A2 6A Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing >>>>> On Fri, 02 Jun 2000 18:36:30 -0700, Ben Greear >>>>> said: Ben> Rob Walker wrote: >> Why should VLANs not be fake devices? How are they different from >> aliased interfaces? Ben> I think most of us are in violent agreement that they should be Ben> devices, Alexey and Jamal seem to be the main dissenters, and at Ben> least IMO, they have not offered a reason good enough to make me Ben> consider trying to make VLANs anything other than devices... Ben> On this account, my vlan implementation, and Lenert and Gleb's Ben> are almost identical. Other than some internal issues, I believe Ben> the only user-visible difference between my imp and theirs is Ben> that they re-write the packet header on the way up the stack so Ben> that it looks **exactly** like an ethernet pkt, where as I just Ben> leave the header alone and pull 4 extra bytes off of the SKB Ben> before giving it to the higher levels. Remember how hard *BSD keeps ragging that their stack is faster due to "zero copy"? I can't evaluate whether that statement is true, or if the speed advantage has worn off with time, but I do think that the faster the implementation, the better. Ben> I like their idea, but it means they have to move the header Ben> around for each pkt. In mine, the packet is not modified, *BUT*, Ben> programs such as dhcpd which expect to be reading the raw Ben> ethernet pkt have to be modified. Maybe a run-time switch could be added to dhcpd, or you could extend it to automatically read both types of frames as detected. Is this even possible? rob From owner-netdev@oss.sgi.com Fri Jun 2 18:57:42 2000 Received: by oss.sgi.com id ; Fri, 2 Jun 2000 18:57:33 -0700 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:25863 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Fri, 2 Jun 2000 18:57:15 -0700 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.9.3) with ESMTP id UAA01440; Fri, 2 Jun 2000 20:32:22 -0700 Message-ID: <39387C46.C055E05A@candelatech.com> Date: Fri, 02 Jun 2000 20:32:22 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.14-5.0 i586) X-Accept-Language: en MIME-Version: 1.0 To: rob@valinux.com CC: buytenh@gnu.org, hadi@cyberus.ca, netdev@oss.sgi.com, gleb@tochna.technion.ac.il Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: <14647.58672.343104.432906@tungsten.su.valinux.com> <3938611E.D074F254@candelatech.com> <14648.24946.413472.365255@tungsten.su.valinux.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Rob Walker wrote: > Ben> On this account, my vlan implementation, and Lenert and Gleb's > Ben> are almost identical. Other than some internal issues, I believe > Ben> the only user-visible difference between my imp and theirs is > Ben> that they re-write the packet header on the way up the stack so > Ben> that it looks **exactly** like an ethernet pkt, where as I just > Ben> leave the header alone and pull 4 extra bytes off of the SKB > Ben> before giving it to the higher levels. > > Remember how hard *BSD keeps ragging that their stack is faster due to > "zero copy"? I can't evaluate whether that statement is true, or if > the speed advantage has worn off with time, but I do think that the > faster the implementation, the better. Yeah, but remember also that Apache, not the fastest, but the most flexible, rules the web. A compile-time switch can offer the best of both worlds..the only question is which one to 'default' to. If default to the zero-copy, then dhcpd and others must be fixed (there is a patch on my web-site that is sorta-kinda fixes dhcpd.) Of course, maybe these programs should be fixed anyway... > Maybe a run-time switch could be added to dhcpd, or you could extend > it to automatically read both types of frames as detected. Is this > even possible? I basically compared the name (vlan* matched) and used that to determine the behavior for the interface. I'm sure a more elegant solution could be contrived... > > rob -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Sat Jun 3 02:04:54 2000 Received: by oss.sgi.com id ; Sat, 3 Jun 2000 02:04:34 -0700 Received: from tochna.technion.ac.il ([132.68.48.10]:6415 "EHLO tochna.technion.ac.il") by oss.sgi.com with ESMTP id ; Sat, 3 Jun 2000 02:04:16 -0700 Received: from localhost (gleb@localhost) by tochna.technion.ac.il (8.8.7/8.8.5) with SMTP id NAA09826; Sat, 3 Jun 2000 13:03:41 +0300 (IDT) Date: Sat, 3 Jun 2000 13:03:40 +0300 (IDT) From: Gleb Natapov To: Rob Walker cc: greearb@candelatech.com, buytenh@gnu.org, hadi@cyberus.ca, netdev@oss.sgi.com Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? In-Reply-To: <14648.24946.413472.365255@tungsten.su.valinux.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Fri, 2 Jun 2000, Rob Walker wrote: > > Ben> On this account, my vlan implementation, and Lenert and Gleb's > Ben> are almost identical. Other than some internal issues, I believe > Ben> the only user-visible difference between my imp and theirs is > Ben> that they re-write the packet header on the way upthe stack so > Ben> that it looks **exactly** like an ethernet pkt, where as I just > Ben> leave the header alone and pull 4 extra bytes off of the SKB > Ben> before giving it to the higher levels. > > Remember how hard *BSD keeps ragging that their stack isfaster due to > "zero copy"? I can't evaluate whether that statement is true, or if > the speed advantage has worn off with time, but I do think that the > faster the implementation, the better. You mean you want to sacrifice "correctness" of implementation to speed? I also was against moving headers around, until I encountered dhcp problem. Anyway, it is useless to argue about performance loss until someone actually does benchmarks and provides us with real numbers. I don't see any performance loss with ping -f. > > Ben> I like their idea, but it means they have to move the header > Ben> around for each pkt. In mine, the packet is not modified, *BUT*, > Ben> programs such as dhcpd which expect to be reading the raw > Ben> ethernet pkt have to be modified. > > Maybe a run-time switch could be added to dhcpd, or you could extend > it to automatically read both types of frames as detected. Is this > even possible? > I think that the perfect solution will be to remove tags from frames _only_ if some process actually reads ethernet headers from this vlan device. Any ideas how this can be implemented? > rob -- Gleb. From owner-netdev@oss.sgi.com Sat Jun 3 05:33:15 2000 Received: by oss.sgi.com id ; Sat, 3 Jun 2000 05:33:05 -0700 Received: from zada.math.leidenuniv.nl ([132.229.231.3]:11019 "EHLO zada.math.leidenuniv.nl") by oss.sgi.com with ESMTP id ; Sat, 3 Jun 2000 05:32:47 -0700 Received: from mara.math.leidenuniv.nl (IDENT:buytenh@mara.math.leidenuniv.nl [132.229.232.80]) by zada.math.leidenuniv.nl (8.9.3/8.9.3) with ESMTP id PAA21999; Sat, 3 Jun 2000 15:32:24 +0200 Date: Sat, 3 Jun 2000 15:32:24 +0200 (MEST) From: Lennert Buytenhek X-Sender: buytenh@mara.math.leidenuniv.nl To: Gleb Natapov cc: Rob Walker , greearb@candelatech.com, hadi@cyberus.ca, netdev@oss.sgi.com Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Sat, 3 Jun 2000, Gleb Natapov wrote: > I think that the perfect solution will be to remove tags from frames > _only_ if some process actually reads ethernet headers from this vlan > device. Any ideas how this can be implemented? We could detect if there are any packet sockets (something similar to the fastroute obstacle detection)? greetings, Lennert From owner-netdev@oss.sgi.com Sat Jun 3 07:39:56 2000 Received: by oss.sgi.com id ; Sat, 3 Jun 2000 07:39:46 -0700 Received: from mail.cyberus.ca ([209.195.95.1]:63975 "EHLO cyberus.ca") by oss.sgi.com with ESMTP id ; Sat, 3 Jun 2000 07:39:28 -0700 Received: from shell.cyberus.ca (shell [209.195.95.7]) by cyberus.ca (8.9.3/8.9.3/Cyberus Online Inc.) with ESMTP id LAA29621; Sat, 3 Jun 2000 11:37:17 -0400 (EDT) Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.1b+Sun/8.9.3) with ESMTP id LAA16000; Sat, 3 Jun 2000 11:36:58 -0400 (EDT) Date: Sat, 3 Jun 2000 11:36:58 -0400 (EDT) From: jamal To: Ben Greear cc: rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? In-Reply-To: <3938611E.D074F254@candelatech.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Sorry cant keep up with all the overwhelming response ;-> I think i'll pick on Ben again since his post seems to capture mostly everything. On Fri, 2 Jun 2000, Ben Greear wrote: > Rob Walker wrote: > > > Why should VLANs not be fake devices? How are they different from > > aliased interfaces? > > I think most of us are in violent agreement that they should be devices, > Alexey and Jamal seem to be the main dissenters, and at least IMO, > they have not offered a reason good enough to make me consider trying > to make VLANs anything other than devices... > Devices map to physical devices i.e ports in your lingo. How many of those do you see in your average Linux machine? Infact i have never seen a single switch blade with more than 48 ports but even that is beside the point. The point really is the desiugn abstraction. VLANs are virtuals circuits which are abstracted on the device. So are PPPOE sessions, ATM *VCs, MPLS LSPs, FR etc and the list goes on. They all 'suffer' from the same problem as you, i dont understand why you are such a special case. I will argue that you _can not_ write a generic search algorithm for all these protocols. Unfortunately if you enforce one then the device search algorithm will have to be the same across the board. It goes without any arguement that we have a very good worst case estimate today, given the practical limits. You try adding all those thousands of VLANs as devices and i can _guarantee you_ that you are not optimizing for the common case. I like Lennert and Gleb's because they dont use devices for the abstraction rather they attach themselves to the device. You dont. You could use the aliasing interface if you wanted to add extra IP addresses (one per VLAN). I dont understand the DHCP problem. What has an app got to do with what happens at layer 2? Tcpdump is pretty straight forward; you add another protocol, the rule is you extend tcpdump. I was also kind of suprised to see that VLANs were being preached as a replacement for aliasing. I have heard that the authors of 802.1q have already apologized to the world for the mess ;-> Please dont spread the 802.1q gospel incase some frail minds start believing you. the only reason i am for it is because of what i would call "peer pressure" (win2k has it as are other so called "enterprise ready" OSes). cheers, jamal From owner-netdev@oss.sgi.com Sat Jun 3 07:49:06 2000 Received: by oss.sgi.com id ; Sat, 3 Jun 2000 07:48:56 -0700 Received: from laurin.munich.netsurf.de ([194.64.166.1]:27036 "EHLO laurin.munich.netsurf.de") by oss.sgi.com with ESMTP id ; Sat, 3 Jun 2000 07:48:38 -0700 Received: from fred.muc.de (none@ns1097.munich.netsurf.de [195.180.235.97]) by laurin.munich.netsurf.de (8.9.3/8.9.3) with ESMTP id RAA17636; Sat, 3 Jun 2000 17:43:16 +0200 (MET DST) Received: from andi by fred.muc.de with local (Exim 2.05 #1) id 12yG5V-0003rd-00; Sat, 3 Jun 2000 17:43:49 +0200 Date: Sat, 3 Jun 2000 17:43:49 +0200 From: Andi Kleen To: Lennert Buytenhek Cc: Gleb Natapov , Rob Walker , greearb@candelatech.com, hadi@cyberus.ca, netdev@oss.sgi.com Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? Message-ID: <20000603174349.A14730@fred.muc.de> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.95.4us In-Reply-To: ; from Lennert Buytenhek on Sat, Jun 03, 2000 at 03:34:27PM +0200 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Sat, Jun 03, 2000 at 03:34:27PM +0200, Lennert Buytenhek wrote: > > > On Sat, 3 Jun 2000, Gleb Natapov wrote: > > > I think that the perfect solution will be to remove tags from frames > > _only_ if some process actually reads ethernet headers from this vlan > > device. Any ideas how this can be implemented? > > We could detect if there are any packet sockets (something similar to the > fastroute obstacle detection)? Just use dev->hard_header_parse (or a VLAN pseudo device equivalent) to do it on demand. -Andi -- This is like TV. I don't like TV. From owner-netdev@oss.sgi.com Sat Jun 3 08:19:16 2000 Received: by oss.sgi.com id ; Sat, 3 Jun 2000 08:18:56 -0700 Received: from dhcp41.toaster.net ([199.108.84.41]:40196 "EHLO schmee.sfgoth.com") by oss.sgi.com with ESMTP id ; Sat, 3 Jun 2000 08:18:40 -0700 Received: (from mitch@localhost) by schmee.sfgoth.com (8.9.3/8.9.3) id JAA64415; Sat, 3 Jun 2000 09:18:19 -0700 (PDT) Date: Sat, 3 Jun 2000 09:18:18 -0700 From: Mitchell Blank Jr To: jamal Cc: Ben Greear , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? Message-ID: <20000603091818.B48132@sfgoth.com> References: <3938611E.D074F254@candelatech.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0i In-Reply-To: ; from hadi@cyberus.ca on Sat, Jun 03, 2000 at 11:36:58AM -0400 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing > Devices map to physical devices i.e ports in your lingo. How many of those > do you see in your average Linux machine? The problem is that if you only think about the "common" network types (ethernet, PPP, etc) this line gets blurred, since there's a one-to-one corresponance between: * physical devices * network devices (i.e. things that you can bind IP addresses to, netfilter based on, tcpdump of) Any sane implementation of VLANs needs to be a network device in the second sense. > VLANs are virtuals circuits which are abstracted on the device. So are > PPPOE sessions, ATM *VCs, MPLS LSPs, FR etc and the list goes on. > They all 'suffer' from the same problem as you, i dont understand why you > are such a special case. Neither do I. PPPoE shows up as a device (ppp0). ATM LANE shows up as a device (lec0). ATM CLIP shows up as a device (atm0). Currently, atm *cards* don't show up as a network device (since they can't be bound directly to L3 protocols), but that may need to change for benefit of SNMP (which wants it to be an enumerable device for statistics purposes). This is how Cisco deals with ATM - there is one device for the physical interface and one for each LANE/CLIP running on it. > You try adding all those thousands of > VLANs as devices and i can _guarantee you_ that you are not optimizing for > the common case. ...or thousands of routes, or gigabytes of RAM, etc... do you suggest we drop kernel support for those too? It's all a matter of finding an algorithm that scales. > I like Lennert and Gleb's because they dont use devices for the > abstraction rather they attach themselves to the device. You dont. Again, how do I write an IP filter that discerns traffic on two different VLANSs - for instance imagine we had one shared ethernet switch that included * (VLAN 1) Router to outside world * (VLAN 1, VLAN 2) Linux box acting as masquerading firewall * (VLAN 2) Internal machines How do I do this under your system? Keep in mind that for security's sake we need to make sure that traffic on VLAN1 doesn't spoof an internal IP address. > I have heard that the authors of 802.1q > have already apologized to the world for the mess ;-> Do you have a reference, or is this just flamebait? I'm sure 802.1q has some interesting applications. Just because it doesn't solve every problem doesn't mean that it's not worth supporting well (much like ATM :-) -Mitch From owner-netdev@oss.sgi.com Sat Jun 3 08:50:15 2000 Received: by oss.sgi.com id ; Sat, 3 Jun 2000 08:50:06 -0700 Received: from nat-su-33.valinux.com ([198.186.202.33]:17436 "EHLO mail.valinux.com") by oss.sgi.com with ESMTP id ; Sat, 3 Jun 2000 08:49:40 -0700 Received: from tungsten.su.valinux.com ([10.1.0.74]) by mail.valinux.com with esmtp (Exim 2.12 #6) id 12yH73-0002O6-00; Sat, 3 Jun 2000 09:49:29 -0700 Received: (from rob@localhost) by tungsten.su.valinux.com (8.8.7/8.8.7) id JAA04417; Sat, 3 Jun 2000 09:49:29 -0700 From: Rob Walker MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Date: Sat, 3 Jun 2000 09:49:29 -0700 (PDT) To: gleb@tochna.technion.ac.il Cc: rob@valinux.com, greearb@candelatech.com, buytenh@gnu.org, hadi@cyberus.ca, netdev@oss.sgi.com Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? In-Reply-To: References: <14648.24946.413472.365255@tungsten.su.valinux.com> X-Mailer: VM 6.43 under 20.4 "Emerald" XEmacs Lucid Message-ID: <14649.13419.718862.497771@tungsten.su.valinux.com> Reply-To: rob@valinux.com X-PGP-Fingerprint: CD 3B D7 01 69 A1 DD E3 3E 69 11 23 6D 8E A2 6A Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing >>>>> On Sat, 3 Jun 2000 13:03:40 +0300 (IDT), Gleb Natapov >>>>> said: Gleb> On Fri, 2 Jun 2000, Rob Walker wrote: Ben> On this account, my vlan implementation, and Lenert and Gleb's Ben> are almost identical. Other than some internal issues, I believe Ben> the only user-visible difference between my imp and theirs is Ben> that they re-write the packet header on the way upthe stack so Ben> that it looks **exactly** like an ethernet pkt, where as I just Ben> leave the header alone and pull 4 extra bytes off of the SKB Ben> before giving it to the higher levels. >> Remember how hard *BSD keeps ragging that their stack isfaster due >> to "zero copy"? I can't evaluate whether that statement is true, >> or if the speed advantage has worn off with time, but I do think >> that the faster the implementation, the better. Gleb> You mean you want to sacrifice "correctness" of implementation Gleb> to speed? I also was against moving headers around, until I Gleb> encountered dhcp problem. I am not sure about that one. Muct think more about it. Gleb> Anyway, it is useless to argue about performance loss until Gleb> someone actually does benchmarks and provides us with real Gleb> numbers. I don't see any performance loss with ping -f. Yes, benchmarking numbers should be taken on by someone somewhere. I would love to see linux up at the PPP bake-offs in San Ramon, and at any other gathering of the heavyweights to do benchmarks. I think that this will not happen until a company has a vested interest in a commercial solution of this type. something-xylink or xylink-something is the only company I know of doing linux routers/switches in a chassis-based solution. I can't even find their site right now, so those names are probably a ways off. >> Maybe a run-time switch could be added to dhcpd, or you could >> extend it to automatically read both types of frames as detected. >> Is this even possible? Gleb> I think that the perfect solution will be to remove tags from Gleb> frames _only_ if some process actually reads ethernet headers Gleb> from this vlan device. Any ideas how this can be implemented? As far as Gleb> I also was against moving headers around, until I encountered Gleb> dhcp problem. maybe this is a time where dhcpd needs to be updated to follow the new interface standard? Interfaces change, and the applications which use those interfaces need to be updated. another thought along the same lines is that if I have set up dhcpd to answer only off of certain subnets, but those subnets are on different vlans, doesnt' dhcpd need to know which vlan it came in on? (therefore it needs to be able to read the vlan tag, and get the entire packet with the tag intact) rob -- I want the TCP/IP equivalent of a Rat Thing. -- James W. Abendschan From owner-netdev@oss.sgi.com Sat Jun 3 08:50:25 2000 Received: by oss.sgi.com id ; Sat, 3 Jun 2000 08:50:15 -0700 Received: from styx.uwaterloo.ca ([129.97.40.10]:16909 "EHLO styx.uwaterloo.ca") by oss.sgi.com with ESMTP id ; Sat, 3 Jun 2000 08:50:09 -0700 Received: (from mostrows@localhost) by styx.uwaterloo.ca (8.9.3/8.9.3) id MAA17831; Sat, 3 Jun 2000 12:45:48 -0400 From: Michal Ostrowski MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <14649.13884.93499.610988@styx.uwaterloo.ca> Date: Sat, 3 Jun 2000 12:45:48 -0400 (EDT) To: jamal , Ben Greear , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? In-Reply-To: References: <3938611E.D074F254@candelatech.com> X-Mailer: VM 6.72 under 21.1 (patch 4) "Arches" XEmacs Lucid Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing My $0.02 (CDN)... I think that one of the fundamental issues involved here is that the networking code relies on network devices and sockets as the basic packet accepting and packet generating objects. What we are seeing here is that we may want to have some other kind of basic object to work with (that is, if we don't want VLAN's to be represented by network devices). I would like to suggest that people take a look at the "netgraph" architecture for BSD. I think that the kind of issues that people have raised in this discussion can be dealt with in a much simpler manner in such an architecture. For those of you not familiar with netgraph, here's a link to an introduction, though a quick search will reveal many more resources: http://www.daemonnews.org/200003/netgraph.html Michal Ostrowski mostrows@styx.uwaterloo.ca From owner-netdev@oss.sgi.com Sat Jun 3 09:44:35 2000 Received: by oss.sgi.com id ; Sat, 3 Jun 2000 09:44:26 -0700 Received: from minus.inr.ac.ru ([193.233.7.97]:4 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Sat, 3 Jun 2000 09:44:06 -0700 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id VAA10055; Sat, 3 Jun 2000 21:31:15 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200006031731.VAA10055@ms2.inr.ac.ru> Subject: Re: Slow TCP connection between linux and wince To: amlaukka@cc.helsinki.fi (Aki M Laukkanen) Date: Sat, 3 Jun 2000 21:31:15 +0400 (MSK DST) Cc: linux-kernel@vger.rutgers.edu, netdev@oss.sgi.com In-Reply-To: from "Aki M Laukkanen" at Jun 3, 0 01:18:15 am X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Length: 2681 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello! [ BTW you have problems with your mail at helsinki.fi ] > There was another error in my mail too, 2*130 is indeed over 200 but that > was irrelevant. However the conclusion seemed correct, delayed acks timing > out. Could you explain me why the window of 3100 seems to allow 2 packets > in flight but sender always waits for the next ack? No, tcpdump shows that sender always has full window. Receiver always ACKs _previous_ packet. RTT=0.9sec, window is 2 packets or ~3K, bandwidth is ~3K/sec. Tcpdump is perfectly smooth and correct except for the rate. 8) Probably, receiver delays each ACK by 500msec. I have no idea why it does this, because it is apparently illegal behaviour. Look: 1 02:59:02.875097 10.0.0.3.www > 192.168.55.100.1029: . 233601:235061(1460) ack 196 win 16060 (DF) 2 02:59:03.235031 192.168.55.100.1029 > 10.0.0.3.www: . ack 233601 win 3100 (DF) 3 02:59:03.235094 10.0.0.3.www > 192.168.55.100.1029: . 235061:236521(1460) ack 196 win 16060 (DF) 4 02:59:03.775064 192.168.55.100.1029 > 10.0.0.3.www: . ack 235061 win 3100 (DF) 5 02:59:03.775137 10.0.0.3.www > 192.168.55.100.1029: . 236521:237981(1460) ack 196 win 16060 (DF) 6 02:59:04.145061 192.168.55.100.1029 > 10.0.0.3.www: . ack 236521 win 3100 (DF) ACK #4 acks packet #1, sent 0.9sec before this (hence, rtt=0.9sec). We have two packets in transmit, so that packet #1 reaches receiver not later than (2*1500)/bandwidth after transmit. If bandwidth is 115Kbaud, it is not more ~0.3sec. Hence, packet #3 reaches receiver _before_ receiver sent ACK for previous packet!! Conclusion is: 1. Either link rate is not 115Kbaud, but three times less. 2. Link is busy with another traffic (it is very unlikely, timing is very smooth, I see no jitter). 3. Or receiver has totally broken TCP, which sends delayed ACKs ~500msec after packet arrives, even if another packets arrived during this period. I.e. delays ACK only to delay. 8) Probably, this stack even looked as working, if window were >2packets. Namely, if it were > bandwidth*0.9sec=~10K nobody even would notice this misbehaviour. > then (when approaching window limit) sender stops sending packets to the > wire until always a certain point? You can investigate this. Probably, sender had no data to send for these 5 seconds. Probably, your sender selected illegal sndbuf, which is much less than receiver window. > something but in our tests too, the sender never quite reaches the window > limit (but this was of course with a bigger window size). Exactly. sndbuf is selected so that amount of data ready to send at sender exceeds receiver window, otherwise sender will not able to utilize network. Alexey From owner-netdev@oss.sgi.com Sat Jun 3 10:38:07 2000 Received: by oss.sgi.com id ; Sat, 3 Jun 2000 10:37:47 -0700 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:24849 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Sat, 3 Jun 2000 10:37:21 -0700 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.9.3) with ESMTP id MAA03324; Sat, 3 Jun 2000 12:11:12 -0700 Message-ID: <39395850.F19A5B98@candelatech.com> Date: Sat, 03 Jun 2000 12:11:12 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.14-5.0 i586) X-Accept-Language: en MIME-Version: 1.0 To: Gleb Natapov CC: Rob Walker , buytenh@gnu.org, hadi@cyberus.ca, netdev@oss.sgi.com Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Gleb Natapov wrote: > > On Fri, 2 Jun 2000, Rob Walker wrote: > > > Remember how hard *BSD keeps ragging that their stack isfaster due to > > "zero copy"? I can't evaluate whether that statement is true, or if > > the speed advantage has worn off with time, but I do think that the > > faster the implementation, the better. > > You mean you want to sacrifice "correctness" of implementation to speed? > I also was against moving headers around, until I encountered dhcp > problem. In my opinion, processes should not break layering (ie read the MAC header) unless they can deal with whatever they find there, correctly. So, whether or not it would be sacrificing correctness, is questionable. Also, could there be some reason for giving the priority bits and VLAN ID to the upper stack (not taking it out of the skb)? Seems reasonable to me, would be good for all kinds of traffic management... > Anyway, it is useless to argue about performance loss until someone actually > does benchmarks and provides us with real numbers. I don't see any > performance loss with ping -f. There is definately a theoretical loss of performance in copying the header, but it's only moving 14 bytes or so, so it really isn't that bad, and it's probably only noise, statistically. The more interesting question is which is actually more right! > > > > > Ben> I like their idea, but it means they have to move the header > > Ben> around for each pkt. In mine, the packet is not modified, *BUT*, > > Ben> programs such as dhcpd which expect to be reading the raw > > Ben> ethernet pkt have to be modified. > > > > Maybe a run-time switch could be added to dhcpd, or you could extend > > it to automatically read both types of frames as detected. Is this > > even possible? > > > > I think that the perfect solution will be to remove tags from frames > _only_ if some process actually reads ethernet headers from this vlan > device. Any ideas how this can be implemented? I can't imagine how the VLAN layer could know what process the frames were destined for... I think it's definately something the user-space processes should deal with. (Note: Patching DHCPd to fix it wasn't too hard.) -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Sat Jun 3 10:42:06 2000 Received: by oss.sgi.com id ; Sat, 3 Jun 2000 10:41:46 -0700 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:28433 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Sat, 3 Jun 2000 10:41:34 -0700 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.9.3) with ESMTP id MAA03342; Sat, 3 Jun 2000 12:16:55 -0700 Message-ID: <393959A7.404AB54C@candelatech.com> Date: Sat, 03 Jun 2000 12:16:55 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.14-5.0 i586) X-Accept-Language: en MIME-Version: 1.0 To: Lennert Buytenhek CC: Gleb Natapov , Rob Walker , hadi@cyberus.ca, netdev@oss.sgi.com Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Lennert Buytenhek wrote: > > On Sat, 3 Jun 2000, Gleb Natapov wrote: > > > I think that the perfect solution will be to remove tags from frames > > _only_ if some process actually reads ethernet headers from this vlan > > device. Any ideas how this can be implemented? > > We could detect if there are any packet sockets (something similar to the > fastroute obstacle detection)? DHCPd is broken because it uses the BSD (like?) packet filters, at least in it's default compilation. I don't think we should break, or overly complicate the kernel to work around one user-space implementation... (I know of no other program that doesn't work with non-coppied VLAN skbs, though they may exist.) Ben -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Sat Jun 3 10:44:37 2000 Received: by oss.sgi.com id ; Sat, 3 Jun 2000 10:44:16 -0700 Received: from smtp03.mweb.co.za ([196.2.134.189]:1033 "EHLO smtp03.iafrica.com") by oss.sgi.com with ESMTP id ; Sat, 3 Jun 2000 10:44:08 -0700 Received: from [196.7.200.119] (helo=cheese.iafrica.com ident=root) by smtp03.iafrica.com with esmtp (Exim 1.92 #1) for netdev@oss.sgi.com id 12yItm-0008EO-00; Sat, 3 Jun 2000 20:43:55 +0200 Received: from localhost (bmerry@localhost) by cheese.iafrica.com (8.9.3/8.9.3) with ESMTP id RAA03948 for ; Sat, 3 Jun 2000 17:32:47 +0200 Date: Sat, 3 Jun 2000 17:32:47 +0200 (SAST) From: Bruce Merry To: netdev@oss.sgi.com Subject: Possible bug in Space.c in 2.4.0-test1 Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello I'm writing because I think I've found a fault in drivers/net/Space.c in the Linux kernel version 2.4.0-test1. I say "think" because a) Changing it made my network card start working, but it could be that I hacked it instead of doing things The Right Way b) It could be that it has the be done the present way for some other configuration (e.g. modules vs. compiled in), and that the real fault is somewhere else (e.g. I think the FIXME at the top of Space.c refers to this) Anyway, enough waffling. In Space.c there is a chunk of code (starting at line 542) that statically creates a linked list of ethernet devices. The device name for all of them is "eth%d". I haven't traced what happens with modules, but with a monolithic kernel these names are compared to parameters passed on the kernel command line (e.g. ether=12,0x240,eth0 for my ISA NE2000 card) to pass the correct cards the correct info. However it doesn't make sense to me that you should specify ether=...eth%d for any ethernet card (if nothing else, you have no way to distinguish between them). I changed the "eth%d"'s in Space.c to eth0, eth1, ... and recompiled, and after that it picked up the network card and nothing else has died, so I'm guessing that that was the right thing to do. B4N Bruce /--------------------------------------------------------------------\ | Bruce Merry (Entropy) | bmerry at iafrica dot com | | Proud user of Linux! | http://www.cs.uct.ac.za/~bmerry | | Monday is an awful way to spend 1/7 of your life. | \--------------------------------------------------------------------/ From owner-netdev@oss.sgi.com Sat Jun 3 10:49:07 2000 Received: by oss.sgi.com id ; Sat, 3 Jun 2000 10:48:47 -0700 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:38161 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Sat, 3 Jun 2000 10:48:41 -0700 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.9.3) with ESMTP id MAA03388; Sat, 3 Jun 2000 12:24:23 -0700 Message-ID: <39395B67.C86EFA21@candelatech.com> Date: Sat, 03 Jun 2000 12:24:23 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.14-5.0 i586) X-Accept-Language: en MIME-Version: 1.0 To: rob@valinux.com CC: gleb@tochna.technion.ac.il, buytenh@gnu.org, hadi@cyberus.ca, netdev@oss.sgi.com Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: <14648.24946.413472.365255@tungsten.su.valinux.com> <14649.13419.718862.497771@tungsten.su.valinux.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Rob Walker wrote: > > Yes, benchmarking numbers should be taken on by someone somewhere. I > would love to see linux up at the PPP bake-offs in San Ramon, and at > any other gathering of the heavyweights to do benchmarks. I think > that this will not happen until a company has a vested interest in a > commercial solution of this type. something-xylink or > xylink-something is the only company I know of doing linux > routers/switches in a chassis-based solution. I can't even find their > site right now, so those names are probably a ways off. Xyplex, I think. > another thought along the same lines is that if I have set up dhcpd to > answer only off of certain subnets, but those subnets are on different > vlans, doesnt' dhcpd need to know which vlan it came in on? > (therefore it needs to be able to read the vlan tag, and get the > entire packet with the tag intact) dhcpd knows which **device** the packet came in on, so, since VLANs are devices, then it can use the same logic that it uses for regular ethernet... Thus the beauty of VLAN==device! Ben -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Sat Jun 3 11:27:57 2000 Received: by oss.sgi.com id ; Sat, 3 Jun 2000 11:27:36 -0700 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:530 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Sat, 3 Jun 2000 11:27:04 -0700 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.9.3) with ESMTP id NAA03465; Sat, 3 Jun 2000 13:03:11 -0700 Message-ID: <3939647F.E08A09AB@candelatech.com> Date: Sat, 03 Jun 2000 13:03:11 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.14-5.0 i586) X-Accept-Language: en MIME-Version: 1.0 To: jamal CC: rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing jamal wrote: > Infact i have never seen a single switch blade with more than 48 ports > but even that is beside the point. The point really is the desiugn > abstraction. I had a cisco with two FrameRelay 'ports' on it. I added 200 PVC 'devices' to the cisco setup. Last time I'll mention it, so remember it! > I will argue that you _can not_ write a generic search algorithm for all > these protocols. Unfortunately if you enforce one then the device search > algorithm will have to be the same across the board. I see no need to even have a generic search algorithm, each protocol implementation (ATM, FR, VLAN) can do whatever makes the most sense for it. > It goes without any arguement that we have a very good worst case estimate > today, given the practical limits. You try adding all those thousands of > VLANs as devices and i can _guarantee you_ that you are not optimizing for > the common case. Ok, the question is where is the lookup 'hit' you are talking about. Where is this searching that is slowing everything down? Don't just say there is a hit, show me the specific code or logic where this hit takes place. For incomming pkts, the packet is detected in eth.c, as it comes off of the hardware. I can immediately hash to find the VLAN device. Constant time, worst case, O(n), where n is the number of physical ethernet ports, and this is only when configured to allow 4096 VLANs PER Ethernet device, which is fairly non-standard. After that, the packet goes on up the stack, as if it were from any other device. So, I see no performance hit going up the stack, regardless of how many devices you have... Now, going down the stack, I know less about. However, it's something like: send on a socket, which then looks at the routing to determine the interface (net_device), then send it out that device. Where is the lookup/search problem here? Surely the IP stack is not so dumb that it must do a linear search on the net_device linked list?? That would blow your performance to hell even if you only have 5-10 interfaces. So, please either provide some concrete examples where having more devices hurt performance, or quit arguing about it. > I like Lennert and Gleb's because they dont use devices for the > abstraction rather they attach themselves to the device. You dont. I'm not sure you are correct here, Gleb, Lennert? Either way, my VLANs are **logically** attached to devices, even if they don't have pointers linked off of the ethernet device. > You could use the aliasing interface if you wanted to add extra IP > addresses (one per VLAN). Maybe I want to add 20 IP aliases to a VLAN? > I dont understand the DHCP problem. What has an app got to do with > what happens at layer 2? I do, DHCP uses packet filters and uses hard-coded offsets into the raw packet. The 4 extra bytes throw it off by 4, and so it never things it gets a packet on the right port. See the patch on my web site if you want to learn more. Ben -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Sat Jun 3 11:43:57 2000 Received: by oss.sgi.com id ; Sat, 3 Jun 2000 11:43:37 -0700 Received: from zada.math.leidenuniv.nl ([132.229.231.3]:51211 "EHLO zada.math.leidenuniv.nl") by oss.sgi.com with ESMTP id ; Sat, 3 Jun 2000 11:43:15 -0700 Received: from mara.math.leidenuniv.nl (IDENT:buytenh@mara.math.leidenuniv.nl [132.229.232.80]) by zada.math.leidenuniv.nl (8.9.3/8.9.3) with ESMTP id VAA25486; Sat, 3 Jun 2000 21:42:57 +0200 Date: Sat, 3 Jun 2000 21:42:57 +0200 (MEST) From: Lennert Buytenhek X-Sender: buytenh@mara.math.leidenuniv.nl To: Ben Greear cc: jamal , rob@valinux.com, netdev@oss.sgi.com, Gleb Natapov Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? In-Reply-To: <3939647F.E08A09AB@candelatech.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Sat, 3 Jun 2000, Ben Greear wrote: > > I like Lennert and Gleb's because they dont use devices for the > > abstraction rather they attach themselves to the device. You dont. > > I'm not sure you are correct here, Gleb, Lennert? Either way, my > VLANs are **logically** attached to devices, even if they don't have > pointers linked off of the ethernet device. I'm not too familiar with your code, so I couldn't say. Our code keeps a list of vlan slaves in the master ethernet device struct. All in all I think our approaches do not differ much. The idea is the same, at least. > > You could use the aliasing interface if you wanted to add extra IP > > addresses (one per VLAN). > > Maybe I want to add 20 IP aliases to a VLAN? I think not using fake slave devices is fundamentally wrong. Of course, I could be wrong. greetings, Lennert From owner-netdev@oss.sgi.com Mon Jun 5 07:51:49 2000 Received: by oss.sgi.com id ; Mon, 5 Jun 2000 07:51:29 -0700 Received: from mail.cyberus.ca ([209.195.95.1]:20888 "EHLO cyberus.ca") by oss.sgi.com with ESMTP id ; Mon, 5 Jun 2000 06:48:45 -0700 Received: from shell.cyberus.ca (shell [209.195.95.7]) by cyberus.ca (8.9.3/8.9.3/Cyberus Online Inc.) with ESMTP id HAA24266; Mon, 5 Jun 2000 07:34:58 -0400 (EDT) Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.1b+Sun/8.9.3) with ESMTP id HAA18174; Mon, 5 Jun 2000 07:34:57 -0400 (EDT) Date: Mon, 5 Jun 2000 07:34:57 -0400 (EDT) From: jamal To: Ben Greear cc: Mitchell Blank Jr , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? In-Reply-To: <393B414B.ACF84B1B@candelatech.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Sun, 4 Jun 2000, Ben Greear wrote: > Mitchell Blank Jr wrote: > > > (selecting source IPs for outgoing packets is taken care by the > > routing code, right?!?) > > I hope someone has a good answer here...I certainly have no idea! > And yet you made the loudest noise;-> Ok, ok i take that back. Indeed the routing slow path is affected and i would think this is very critical. The point is i see no need for us to make any change at this point just so that we can accomodate VLANS; which in my opinion is brokenness at its best. Andrey brought the best points so far. Just because something is broken doesnt mean we need to continue doing it. Packet mungling does not equate to device. It does not equate to socket. It is just packet mungling! We have netfilter and packet type for that. A VLAN as far as i have seen is a packet mungler. Regarding Zebra: i believe they use the name "interface" but really mean a "circuit"; you could have many circuits within the same interface of which some could be just simple unicast sockets in an NBMA mode. An interface is brought up/down etc but does not need to be abstracted as a device. cheers, jamal From owner-netdev@oss.sgi.com Mon Jun 5 07:51:49 2000 Received: by oss.sgi.com id ; Mon, 5 Jun 2000 07:51:30 -0700 Received: from mail.cyberus.ca ([209.195.95.1]:25241 "EHLO cyberus.ca") by oss.sgi.com with ESMTP id ; Mon, 5 Jun 2000 06:50:20 -0700 Received: from shell.cyberus.ca (shell [209.195.95.7]) by cyberus.ca (8.9.3/8.9.3/Cyberus Online Inc.) with ESMTP id KAA11187; Sun, 4 Jun 2000 10:07:14 -0400 (EDT) Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.1b+Sun/8.9.3) with ESMTP id KAA16988; Sun, 4 Jun 2000 10:07:13 -0400 (EDT) Date: Sun, 4 Jun 2000 10:07:13 -0400 (EDT) From: jamal To: Michal Ostrowski cc: Ben Greear , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? In-Reply-To: <14649.13884.93499.610988@styx.uwaterloo.ca> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Sat, 3 Jun 2000, Michal Ostrowski wrote: > > I would like to suggest that people take a look at the "netgraph" > architecture for BSD. I think that the kind of issues that people > have raised in this discussion can be dealt with in a much simpler > manner in such an architecture. > You already know my opinions on this ;-> I think BSD had bigger problems than us (packet type and netfilter should help as is and something like pppox overlaying is a very good start). But in general, i agree that sometime in the future we need to do the rethinking. It helps to do a lot of thinking and prototyping rather than going with some academic exercise (of which there are about 101 such architectures in existence; netgraph happens to emulate some) cheers, jamal From owner-netdev@oss.sgi.com Mon Jun 5 07:51:50 2000 Received: by oss.sgi.com id ; Mon, 5 Jun 2000 07:51:33 -0700 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:52497 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Mon, 5 Jun 2000 06:50:53 -0700 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.9.3) with ESMTP id IAA23517; Mon, 5 Jun 2000 08:27:11 -0700 Message-ID: <393BC6CF.2D8AA4C0@candelatech.com> Date: Mon, 05 Jun 2000 08:27:11 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.14-5.0 i586) X-Accept-Language: en MIME-Version: 1.0 To: jamal CC: Mitchell Blank Jr , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing jamal wrote: > > On Sun, 4 Jun 2000, Ben Greear wrote: > > > Mitchell Blank Jr wrote: > > > > > (selecting source IPs for outgoing packets is taken care by the > > > routing code, right?!?) > > > > I hope someone has a good answer here...I certainly have no idea! > > > > And yet you made the loudest noise;-> Ok, ok i take that back. > Indeed the routing slow path is affected and i would think this is very > critical. In the original post about the linear search, there was a question as to whether or not it was really called that often. If it is, then I think it should be fixed to become constant-time anyway, if at all possible. Walking five devices (lo, dummy, eth0, eth1, ppp0 (for example)) is a performance hit in itself, without counting any new exotic devices we're dreaming up. I was hoping someone could tell us how often that code was called, and why. I haven't had time to look at the kernel yet, maybe this evening.... Ben -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Mon Jun 5 07:51:50 2000 Received: by oss.sgi.com id ; Mon, 5 Jun 2000 07:51:34 -0700 Received: from apollo.nbase.co.il ([194.90.137.2]:3852 "EHLO apollo.nbase.co.il") by oss.sgi.com with ESMTP id ; Mon, 5 Jun 2000 06:52:32 -0700 Received: from nbase.co.il ([194.90.136.56]) by apollo.nbase.co.il (Post.Office MTA v3.1.2 release (PO205-101c) ID# 0-44418U200L2S100) with ESMTP id AAA450; Mon, 5 Jun 2000 10:33:16 +0200 Message-ID: <393B56DD.34A83A7D@nbase.co.il> Date: Mon, 05 Jun 2000 07:29:33 +0000 From: Gleb Natapov Organization: NBase-Xyplex X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.2.14vlan i686) X-Accept-Language: en MIME-Version: 1.0 To: Andrey Savochkin CC: Mitchell Blank Jr , Ben Greear , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il, jamal Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: <3938611E.D074F254@candelatech.com> <20000603091818.B48132@sfgoth.com> <20000605102627.A8473@saw.sw.com.sg> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Andrey Savochkin wrote: > > Hello, > > I want to ad my $0.02. > > On Sat, Jun 03, 2000 at 09:18:18AM -0700, Mitchell Blank Jr wrote: > > > Devices map to physical devices i.e ports in your lingo. How many of those > > > do you see in your average Linux machine? > > > > The problem is that if you only think about the "common" network types > > (ethernet, PPP, etc) this line gets blurred, since there's a one-to-one > > corresponance between: > > * physical devices > > * network devices (i.e. things that you can bind IP addresses to, > > netfilter based on, tcpdump of) > > > > Any sane implementation of VLANs needs to be a network device in the > > second sense. > > Network devices in the second sense is only an abstraction. > Linux kernel do not bind IP addresses for devices. IP address assignment to > any device is just an entry in routing table "local". The kernel keeps > information about the correspondence about IP address and device only for > backward compatibility to help ifconfig and other obsolete network management > software to work. I'm very thankful to Alexey for removing finally the > long-standing mistake of correspondence between IP addresses and devices from > the kernel. > > Netfilters isn't a big problem, too. A specific VLAN-id matching netfilter > module is a clean and powerful solution. > > I think that the current VLAN implementation slightly abuses the notion of > device. And it doesn't relate to the number of devices and the efficiency of > search algorithms. The current VLAN implementation is a pure packet-mangling > code. It misses one of the most important properties of network devices - > flow control. Any code that doesn't provide flow control isn't a device, but a > code just manipulating of packet contents. > > The current kernel infrastructure for packet mangling may still need some > adjustments, but it at least exists. I'm encouraging to consider VLAN > implementation as just a netfilter module. > > Best regards > Andrey V. > Savochkin As Mitchell said, will I be able to run OSPF between VLANs? I actually run zebra ospfd on vlans. Zebra has a strong notion of device. It relies on device up, device done, change ip and other messages from netlink. Will I be able to do that if vlan will be implemented not as a device. It looks like vlan will be useless without device interface, at least for me. -- Gleb. From owner-netdev@oss.sgi.com Mon Jun 5 07:51:50 2000 Received: by oss.sgi.com id ; Mon, 5 Jun 2000 07:51:34 -0700 Received: from apollo.nbase.co.il ([194.90.137.2]:3852 "EHLO apollo.nbase.co.il") by oss.sgi.com with ESMTP id ; Mon, 5 Jun 2000 06:52:29 -0700 Received: from nbase.co.il ([194.90.136.56]) by apollo.nbase.co.il (Post.Office MTA v3.1.2 release (PO205-101c) ID# 0-44418U200L2S100) with ESMTP id AAA444; Sun, 4 Jun 2000 18:22:44 +0200 Message-ID: <393A7428.5E31D6EC@nbase.co.il> Date: Sun, 04 Jun 2000 15:22:16 +0000 From: Gleb Natapov Organization: NBase-Xyplex X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.2.14vlan i686) X-Accept-Language: en MIME-Version: 1.0 To: jamal CC: Ben Greear , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing jamal wrote: > > On Sat, 3 Jun 2000, Ben Greear wrote: > > > jamal wrote: > > > > > Infact i have never seen a single switch blade with more than 48 ports > > > but even that is beside the point. The point really is the desiugn > > > abstraction. > > > > I had a cisco with two FrameRelay 'ports' on it. I added 200 PVC > > 'devices' to the cisco setup. Last time I'll mention it, so remember it! > > We are talking about two different things. 'ports' are _physical_. So you > only had two ports; and unless you really understand CISCOs internal > structuring, no point in making references to their 200 'devices' > (that could be just the user interface showing stuff that > people like to see). In linux there is no one to one mapping between physical ports and network devices. tunneling (ip in ip encapsulation) and bridging are two examples. I am sure there are more. > > > > > > I will argue that you _can not_ write a generic search algorithm for all > > > these protocols. Unfortunately if you enforce one then the device search > > > algorithm will have to be the same across the board. > > > > I see no need to even have a generic search algorithm, each protocol implementation > > (ATM, FR, VLAN) can do whatever makes the most sense for it. > > > > I dont follow. > > > > It goes without any arguement that we have a very good worst case estimate > > > today, given the practical limits. You try adding all those thousands of > > > VLANs as devices and i can _guarantee you_ that you are not optimizing for > > > the common case. > > > > Ok, the question is where is the lookup 'hit' you are talking about. > > Where is this searching that is slowing everything down? Don't just > > say there is a hit, show me the specific code or logic where this hit takes place. > > > > You register_netdevice() each VLAN device (because you have a device for > each vlan). > > > For incomming pkts, the packet is detected in eth.c, as it comes off > > of the hardware. I can immediately hash to find the VLAN device. > > Constant time, worst case, O(n), where n is the number of physical ethernet > > ports, and this is only when configured to allow 4096 VLANs PER Ethernet device, > > which is fairly non-standard. > > > > I am not gonna bitch about how many devices you have but in most cases > 1024 per device is already overkill (including cross port VLANs). > So in the worst case for 4 ethernet ports you have grown dev_base to over > 15000 structures (in the worst case). Now go look at the associated code > and tell me you dont see the repurcasions. If you really want you can create 15000 tunneling devices. The question is why somebody will ever want to do such thing. > If you know of a smart way to optimize that please post it. You might get > me to support you. If you are talking about searching the device according to the device name then we can store the devices in hash instead of linked list for instance. > Most of the manipulating code doesnt run in the critical data path, but > you are adding unnecessary noise, and besides my point is that _you dont_ > need to have a device per VLAN; i might convert if you optimize it for > everyone else. > > > I do, DHCP uses packet filters and uses hard-coded offsets into the > > raw packet. The 4 extra bytes throw it off by 4, and so it never things > > it gets a packet on the right port. See the patch on my web site if you > > want to learn more. > > Ok so they use BPF. Either out of coolness or madness. Would packet socket > have sufficed here for Linux ? > > cheers, > jamal -- Gleb. From owner-netdev@oss.sgi.com Mon Jun 5 07:51:59 2000 Received: by oss.sgi.com id ; Mon, 5 Jun 2000 07:51:35 -0700 Received: from apollo.nbase.co.il ([194.90.137.2]:3852 "EHLO apollo.nbase.co.il") by oss.sgi.com with ESMTP id ; Mon, 5 Jun 2000 06:52:36 -0700 Received: from nbase.co.il ([194.90.136.56]) by apollo.nbase.co.il (Post.Office MTA v3.1.2 release (PO205-101c) ID# 0-44418U200L2S100) with ESMTP id AAA478; Mon, 5 Jun 2000 12:30:36 +0200 Message-ID: <393B7319.8C92BD74@nbase.co.il> Date: Mon, 05 Jun 2000 09:30:01 +0000 From: Gleb Natapov Organization: NBase-Xyplex X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.2.14vlan i686) X-Accept-Language: en MIME-Version: 1.0 To: Andrey Savochkin CC: Mitchell Blank Jr , Ben Greear , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il, jamal Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: <3938611E.D074F254@candelatech.com> <20000603091818.B48132@sfgoth.com> <20000605102627.A8473@saw.sw.com.sg> <393B56DD.34A83A7D@nbase.co.il> <20000605154657.D10091@saw.sw.com.sg> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Andrey Savochkin wrote: > > Hello Gleb, > > On Mon, Jun 05, 2000 at 07:29:33AM +0000, Gleb Natapov wrote: > > As Mitchell said, will I be able to run OSPF between VLANs? I actually > > run zebra ospfd on vlans. Zebra has a strong notion of device. It relies > > on device up, device done, change ip and other messages from netlink. > > I do not know how Zebra works, but the design described by you looks very > broken at the first glance. If you run routing managements software on your > system you should perform all kernel state changes only through this > software. Thus, the software do not need any kernel feedback about device/ip > state except the confirmations of its own commands. You can do all configuration from zebra and, in fact, that is what you are expected to do. Zebra has cisco-like CLI: 'shutdown' to shutdown interface 'no shutdown' enable interface again, and so on. But if somebody will ever do 'ifconfig eth0 down' outside of the zebra, the zebra will notice the change and will continue to work properly. I don't know why this is a bad thing. Besides, zebra gets current configuration of the kernel via netlink at startup time. > > > Will I be able to do that if vlan will be implemented not as a device. > > It looks like vlan will be useless without device interface, at least > > for me. > > If Zebra works as you've told me, try to include their developers into > the discussion, too. Zebra and, I suppose, other routing software will need to be changed in order to support vlans if vlan will be implemented not as regular interface IMO. I'll ask zebra developer (Kunihiro Ishiguro) what does he think about supporting vlans in zebra. > > Best regards > Andrey V. > Savochkin -- Gleb. From owner-netdev@oss.sgi.com Mon Jun 5 07:51:59 2000 Received: by oss.sgi.com id ; Mon, 5 Jun 2000 07:51:43 -0700 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:25618 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Mon, 5 Jun 2000 07:39:10 -0700 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.9.3) with ESMTP id WAA21196; Sun, 4 Jun 2000 22:57:31 -0700 Message-ID: <393B414B.ACF84B1B@candelatech.com> Date: Sun, 04 Jun 2000 22:57:31 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.14-5.0 i586) X-Accept-Language: en MIME-Version: 1.0 To: Mitchell Blank Jr CC: jamal , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: <20000603091818.B48132@sfgoth.com> <20000604214856.B77216@sfgoth.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Mitchell Blank Jr wrote: > The one tricky part is dev_alloc_name() function... it's current > algorithm is to search for "prefix0" through "prefix99" and if > they're all in use bail out. Not only does this mean that you > won't naturally end up with "ppp100", the search is N^2 (meaning > the time to set up N of them is N^3). The best algorithm compromise > isn't clear - I'd say that for each prefix in use ("ppp", "eth") keep > an ordered linked list of them, that way you can quickly linearly > scan looking for the lowest hole. An added optomization would be > to hold a "next_to_try_after" pointer into the list. (after adding > an interface, set to self. When deleting an interface, set to > our previous one if it's before next_to_try_after) which greatly > reduces the length of the search for many common access patterns. > This sounds complicated, but would actually be pretty easy to implement > using linux/list.h - just keep the next_to_try_after as the first > element and just rotate the list around it as it changes. VLAN, by default, allocs it's names in constant time (eth-dev-name + VLAN-ID) I would imagine FrameRelay and ATM could do something similar. ppp an others should be able to easily get it down to a linear search of the list by parsing the names in one pass, then allocating the first hole, or appending onto the end. I believe that will not touch any of the data structures, only a smarter, or just more specific, dev_alloc_name. The ppp component itself could keep track of things itself too, and just never call the dev_alloc_name, perhaps with a list as you mentioned... > The tricker part is where the IP stack searches through the list > itself (ipv4/devinet.c:inet_selcect_addr(), > ipv6/addrconf.c/ipv6_get_saddr())... I assume these are just used > for corner cases, right? I assume that things on the fast path > (selecting source IPs for outgoing packets is taken care by the > routing code, right?!?) I hope someone has a good answer here...I certainly have no idea! > Anyway, assuming that those code paths in ipv4 and ipv6 aren't > hotspots, some basic datastructure work in net/core/dev.c would > remove the algorithmic limitations on hosting thousands of devices. Thanks for the information! Ben -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Mon Jun 5 07:51:59 2000 Received: by oss.sgi.com id ; Mon, 5 Jun 2000 07:51:36 -0700 Received: from apollo.nbase.co.il ([194.90.137.2]:3852 "EHLO apollo.nbase.co.il") by oss.sgi.com with ESMTP id ; Mon, 5 Jun 2000 06:52:39 -0700 Received: from nbase.co.il ([194.90.136.56]) by apollo.nbase.co.il (Post.Office MTA v3.1.2 release (PO205-101c) ID# 0-44418U200L2S100) with ESMTP id AAA129; Mon, 5 Jun 2000 15:26:00 +0200 Message-ID: <393B9C33.A2AC848C@nbase.co.il> Date: Mon, 05 Jun 2000 12:25:23 +0000 From: Gleb Natapov Organization: NBase-Xyplex X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.2.14vlan i686) X-Accept-Language: en MIME-Version: 1.0 To: jamal CC: Ben Greear , Mitchell Blank Jr , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing jamal wrote: > > On Sun, 4 Jun 2000, Ben Greear wrote: > > > Mitchell Blank Jr wrote: > > > > > (selecting source IPs for outgoing packets is taken care by the > > > routing code, right?!?) > > > > I hope someone has a good answer here...I certainly have no idea! > > > > And yet you made the loudest noise;-> Ok, ok i take that back. > Indeed the routing slow path is affected and i would think this is very > critical. > The point is i see no need for us to make any change at this point just so > that we can accomodate VLANS; which in my opinion is brokenness at its > best. It seams that you suppose that if you don't need vlans, nobody needs vlans. You will be surprised, but some people find them useful. :) > > Andrey brought the best points so far. Just because something is broken > doesnt mean we need to continue doing it. > Packet mungling does not equate to device. It does not equate to socket. > It is just packet mungling! We have netfilter and packet type for that. > A VLAN as far as i have seen is a packet mungler. > > Regarding Zebra: i believe they use the name "interface" but really > mean a "circuit"; you could have many circuits within the same > interface of which some could be just simple unicast sockets in an > NBMA mode. An interface is brought up/down etc but does not need to be > abstracted as a device. I've looked at Zebra code. 'shutdown' command directly communicates with the kernel using zebra interface name as kernel interface name. So it seams that in order to use interface in zebra one should have the same interface in kernel. This can be changed of course. > > cheers, > jamal -- Gleb. From owner-netdev@oss.sgi.com Mon Jun 5 07:51:59 2000 Received: by oss.sgi.com id ; Mon, 5 Jun 2000 07:51:37 -0700 Received: from dhcp41.toaster.net ([199.108.84.41]:11027 "EHLO schmee.sfgoth.com") by oss.sgi.com with ESMTP id ; Mon, 5 Jun 2000 06:56:00 -0700 Received: (from mitch@localhost) by schmee.sfgoth.com (8.9.3/8.9.3) id GAA84674; Mon, 5 Jun 2000 06:03:21 -0700 (PDT) Date: Mon, 5 Jun 2000 06:03:21 -0700 From: Mitchell Blank Jr To: Ben Greear Cc: jamal , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? Message-ID: <20000605060321.E77216@sfgoth.com> References: <20000603091818.B48132@sfgoth.com> <20000604214856.B77216@sfgoth.com> <393B414B.ACF84B1B@candelatech.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0i In-Reply-To: <393B414B.ACF84B1B@candelatech.com>; from greearb@candelatech.com on Sun, Jun 04, 2000 at 10:57:31PM -0700 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing > The ppp component itself could keep track of things itself too, and > just never call the dev_alloc_name, perhaps with a list as you mentioned... Good point. The only place this falls down is if two unrelated network devices are using the same naming scheme (like they both want to use "abc0", "abc1", etc). It will still work, of course, but will have lots of collisions so it will tend towards N^2 performance just like now. As long as only one of the contenders wants to make a lot of devices, it's just fine. So you're right, this is the way to go. So if we can get the inet_select_addr() thing sorted then the entire impact of this change is to add two btrees to core/dev.c -- one for looking up net_devices by name, another for looking them up by ifindex. Since these lookups are abstracted through the __dev_get_by_{name,index}() calls we could even make this CONFIG_SUPPORT_LOTSA_NET_DEVICES without too much gross code. -Mitch From owner-netdev@oss.sgi.com Mon Jun 5 07:52:00 2000 Received: by oss.sgi.com id ; Mon, 5 Jun 2000 07:51:36 -0700 Received: from saw.sw.com.sg ([203.120.9.98]:53121 "HELO saw.sw.com.sg") by oss.sgi.com with SMTP id ; Mon, 5 Jun 2000 06:53:48 -0700 Received: (qmail 10226 invoked by uid 577); 5 Jun 2000 07:46:57 -0000 Message-ID: <20000605154657.D10091@saw.sw.com.sg> Date: Mon, 5 Jun 2000 15:46:57 +0800 From: Andrey Savochkin To: Gleb Natapov Cc: Mitchell Blank Jr , Ben Greear , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il, jamal Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: <3938611E.D074F254@candelatech.com> <20000603091818.B48132@sfgoth.com> <20000605102627.A8473@saw.sw.com.sg> <393B56DD.34A83A7D@nbase.co.il> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.93.2i In-Reply-To: <393B56DD.34A83A7D@nbase.co.il>; from "Gleb Natapov" on Mon, Jun 05, 2000 at 07:29:33AM Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello Gleb, On Mon, Jun 05, 2000 at 07:29:33AM +0000, Gleb Natapov wrote: > As Mitchell said, will I be able to run OSPF between VLANs? I actually > run zebra ospfd on vlans. Zebra has a strong notion of device. It relies > on device up, device done, change ip and other messages from netlink. I do not know how Zebra works, but the design described by you looks very broken at the first glance. If you run routing managements software on your system you should perform all kernel state changes only through this software. Thus, the software do not need any kernel feedback about device/ip state except the confirmations of its own commands. > Will I be able to do that if vlan will be implemented not as a device. > It looks like vlan will be useless without device interface, at least > for me. If Zebra works as you've told me, try to include their developers into the discussion, too. Best regards Andrey V. Savochkin From owner-netdev@oss.sgi.com Mon Jun 5 07:52:09 2000 Received: by oss.sgi.com id ; Mon, 5 Jun 2000 07:51:36 -0700 Received: from dhcp41.toaster.net ([199.108.84.41]:11027 "EHLO schmee.sfgoth.com") by oss.sgi.com with ESMTP id ; Mon, 5 Jun 2000 06:56:00 -0700 Received: (from mitch@localhost) by schmee.sfgoth.com (8.9.3/8.9.3) id FAA84622; Mon, 5 Jun 2000 05:53:44 -0700 (PDT) Date: Mon, 5 Jun 2000 05:53:44 -0700 From: Mitchell Blank Jr To: jamal Cc: Ben Greear , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? Message-ID: <20000605055344.D77216@sfgoth.com> References: <393B414B.ACF84B1B@candelatech.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0i In-Reply-To: ; from hadi@cyberus.ca on Mon, Jun 05, 2000 at 07:34:57AM -0400 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing > Indeed the routing slow path is affected and i would think this is very > critical. Could someone enumerate the EXACT conditions this occurs (IP using inet_select_addr to linearly go through each device looking for ok link-scope routes rather than using the normal routing table) Remember that inet_select_addr() will only do the search of all devices if it didn't find a matchin in the specified device's ifa. I suspect this is a rare occurance, but I could be wrong I guess. -Mitch From owner-netdev@oss.sgi.com Mon Jun 5 07:52:09 2000 Received: by oss.sgi.com id ; Mon, 5 Jun 2000 07:51:37 -0700 Received: from dhcp41.toaster.net ([199.108.84.41]:11027 "EHLO schmee.sfgoth.com") by oss.sgi.com with ESMTP id ; Mon, 5 Jun 2000 06:56:01 -0700 Received: (from mitch@localhost) by schmee.sfgoth.com (8.9.3/8.9.3) id FAA84536; Mon, 5 Jun 2000 05:35:33 -0700 (PDT) Date: Mon, 5 Jun 2000 05:35:33 -0700 From: Mitchell Blank Jr To: Andrey Savochkin Cc: Ben Greear , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il, jamal Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? Message-ID: <20000605053533.C77216@sfgoth.com> References: <3938611E.D074F254@candelatech.com> <20000603091818.B48132@sfgoth.com> <20000605102627.A8473@saw.sw.com.sg> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0i In-Reply-To: <20000605102627.A8473@saw.sw.com.sg>; from saw@saw.sw.com.sg on Mon, Jun 05, 2000 at 10:26:27AM +0800 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Andrey Savochkin wrote: > I want to ad my $0.02. Certainly appreciated. > Network devices in the second sense is only an abstraction. Well, everything in networking is an abstraction.. we're just trying to pick the right one. :-) > Linux kernel do not bind IP addresses for devices. OK, I being a bit sloppy with the terminology when stating my case. What I was trying to demonstrate is why atm cards aren't net_devices's currently. If we consider that "net_devices" are things like "eth0", "ppp0", or "lo" then an atm card isn't at the same level. L3 protocols don't run directly on top of an ATM card... you can certainly run protocols capable of doing L3 on top of an ATM card (say, CLIP, for example), but there isn't a strict 1-to-1 or even N-to-1 correspondence between these protocls and cards (you could have 2 CLIP networks running over a single card; CLIP network with two PVCs on different ATM cards; you could have any combination of the above) So its prettty clear that the net_device has to be the CLIP (or LANE, or whatever) network, not the raw ATM card. Now, using this form of analysis, at what level is an ethernet VLAN? Well, what can a VLAN do? Well, it can implement the SIOC[GS]* ioctls (i.e. it can be taken up or down), it can keep a net_dev_stats, it can have entries in the ARP table (or AARP for that matter), it can participate in a bridge group (net_bridge_port->dev), it can report seperate statistics via SNMP (and thus should have its own dev->ifindex), it can want different settings in /proc/sys/net/ipv4/{neigh,conf}/DEV/, can have seperate IPv6 networks with different autoconfiguration (see ipv6/addrconf.c), it could have IPX networks of different ipx_dlink_type's, we could want to bind an AF_PACKET socket to it (tcpdump, etc), we could want to make filtering or policy route decisions based on incoming VLAN, etc. In short, a VLAN can do just about anything a "real" ethernet interface can do - the only exceptions are that it probably shouldn't be split into another level of VLAN's (well, it could, but what other switch or OS would support such a thing? :-), and it needs to coordinate with the master device to handle multicast/promiscous. So I think that any solution that suggests VLAN decices should be completely different from physical network devices (in the eyes of userland and all network protocols) is SERIOUSLY suspicous. The other thing is how would it work? Assuming all the VLANs were part of one net_device, what happens when a packet gets output to that device? Well, we have to determine the VLAN ID somehow, so we'll need each device to implement an ARP-like table to map the destination hwaddr to a vlan id, right? Remember this has to happen for every packet - when using normal net_device's, the destination gets cached with each connected socket so this adds a per-packet lookup where there was none before. I think a reasonable goal for VLAN support would be that a machine with one ethernet card on two VLANs should be able to do pretty much the same as a machine with two seperate ethernet cards, with similar configuration commands. That is, after all, the promise of VLANs. > Netfilters isn't a big problem, too. A specific VLAN-id matching netfilter > module is a clean and powerful solution. Yes (and we definately need per-interface in/out netfilter hooks like a "real" router in order to handle lots of ports), but that's just one of the many differences listed above. > It misses one of the most important properties of network devices - > flow control. Any code that doesn't provide flow control isn't a device, but a > code just manipulating of packet contents. That's nice in theory, but it doesn't work in practice. The idea of what makes a net_device includes a lot of things (do a grep for net_device in net/*/*.[ch] once). MANY protocols rely on each lan having its own net_device. Flow control might not be perfect in VLANs, but that's really the least of its problems. The linux model of handling flow control only can really handle simple devices that can be modeled as a FIFO queue. Take for example the current mess we have in the ATM stack - suppose you have an atm net_device (can be CLIP or LANE, doesn't matter) with two PVCs to host1 and host2. These go across different networks, and thus have different QoS available - host1 is across a frac-T1 link which host2 is right on our local OC-12 switch. Suppose host1 gets enough traffic that the PVC becomes full - what do we do? 1. We could netif_stop_queue, but that would shut off all connectivity to host2, even though we could easily get packets to it 2. We could just drop packets to host1, but then we're not providing the neccesary backpressure to make net schedulers work Now before someone says "that's why ATM cards should be net_devices and their protocols shouldn't be" keep in mind that ATM devices provide packet sceduling to maintain the QoS on each VC, so using the 1-1 corresponce between flow-controlled FIFO queues and net_devices we might need *thousands* of net_devices to really model a single ATM card. So I agree that flow control is an important issue, but really we need to seperate it from net_device anyway. Ideally it could be abstracted to a level where the ATM code could implement it on each VC itself rather than putting it before thet net_device so whether a packet needs to be queued would be decided based on which VC we want to send the data down. This would also have the small advantage of liberating devices where flow control doesn't make sense anyway (lo, dummy) -Mitch From owner-netdev@oss.sgi.com Mon Jun 5 07:52:10 2000 Received: by oss.sgi.com id ; Mon, 5 Jun 2000 07:51:38 -0700 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:59665 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Mon, 5 Jun 2000 06:56:50 -0700 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.9.3) with ESMTP id IAA23546; Mon, 5 Jun 2000 08:33:16 -0700 Message-ID: <393BC83B.67AD86BF@candelatech.com> Date: Mon, 05 Jun 2000 08:33:15 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.14-5.0 i586) X-Accept-Language: en MIME-Version: 1.0 To: Mitchell Blank Jr CC: jamal , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: <20000603091818.B48132@sfgoth.com> <20000604214856.B77216@sfgoth.com> <393B414B.ACF84B1B@candelatech.com> <20000605060321.E77216@sfgoth.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Mitchell Blank Jr wrote: > > > The ppp component itself could keep track of things itself too, and > > just never call the dev_alloc_name, perhaps with a list as you mentioned... > > Good point. The only place this falls down is if two unrelated network > devices are using the same naming scheme (like they both want to use > "abc0", "abc1", etc). It will still work, of course, but will have > lots of collisions so it will tend towards N^2 performance just like > now. As long as only one of the contenders wants to make a lot > of devices, it's just fine. So you're right, this is the way to go. With a little thought and community effort, we should be able to not have any such collisions as well...untill someone implements the ability to (re)name devices from user-space :) > So if we can get the inet_select_addr() thing sorted then the entire > impact of this change is to add two btrees to core/dev.c -- one > for looking up net_devices by name, another for looking them up by > ifindex. Since these lookups are abstracted through the > __dev_get_by_{name,index}() calls we could even make this > CONFIG_SUPPORT_LOTSA_NET_DEVICES without too much gross code. Seems a hashtable would be nice for the ifindex.... Also, if it takes out a linear search in a critical path (with seemingly minimal overhead), then I don't even think it should be a configurable option, just *IN* there! > > -Mitch -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Mon Jun 5 07:52:11 2000 Received: by oss.sgi.com id ; Mon, 5 Jun 2000 07:51:38 -0700 Received: from dhcp41.toaster.net ([199.108.84.41]:11027 "EHLO schmee.sfgoth.com") by oss.sgi.com with ESMTP id ; Mon, 5 Jun 2000 06:56:02 -0700 Received: (from mitch@localhost) by schmee.sfgoth.com (8.9.3/8.9.3) id VAA80915; Sun, 4 Jun 2000 21:48:57 -0700 (PDT) Date: Sun, 4 Jun 2000 21:48:57 -0700 From: Mitchell Blank Jr To: jamal Cc: Ben Greear , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? Message-ID: <20000604214856.B77216@sfgoth.com> References: <20000603091818.B48132@sfgoth.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0i In-Reply-To: ; from hadi@cyberus.ca on Sun, Jun 04, 2000 at 09:53:12AM -0400 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing > We dont wanna start changing the whole network stack just so that > we can fit in VLANS, do we? I don't think we need to. There may be some minor changes needed in the 2.5 timeframe in order to efficiently support a lot of devices. More on that later. > based on VLANS; thats why building a table is so useful. Put your > ACL on the "VLAN table"; someone makes a policy call that gets inserted > into the VLAN table of the appropriate device (which in itself might be a > simple pointer to a general filter database eg iptables). > I dont have a problem with extending things like dst cache to have a > pointer to some VLAN entry OK, so we can hack on filters. For extra credit, imagine the same scenario I described, but now the linux router is running gated to thake OSPF from the two VLANs... how is that going to work if userland sees only one network device? As to the difficulty of working with many interfaces - I did a quick survey of all the code in net/ that looks things up in the dev_base list. Here's the executive summary - if anyone wants more details I can type up more stuff from my notes. The main user is of course core/dev.c - it provides (currently linear) seach functions to retrieve a struct net_device by name or index. (there's also a search by hwaddr, but that is only used if you run "/sbin/arp -D hostname hw_addr pub" apparently... not a big deal) These searches could be expedited by using a tree or hash, which would solve the bulk of the problem. The one tricky part is dev_alloc_name() function... it's current algorithm is to search for "prefix0" through "prefix99" and if they're all in use bail out. Not only does this mean that you won't naturally end up with "ppp100", the search is N^2 (meaning the time to set up N of them is N^3). The best algorithm compromise isn't clear - I'd say that for each prefix in use ("ppp", "eth") keep an ordered linked list of them, that way you can quickly linearly scan looking for the lowest hole. An added optomization would be to hold a "next_to_try_after" pointer into the list. (after adding an interface, set to self. When deleting an interface, set to our previous one if it's before next_to_try_after) which greatly reduces the length of the search for many common access patterns. This sounds complicated, but would actually be pretty easy to implement using linux/list.h - just keep the next_to_try_after as the first element and just rotate the list around it as it changes. The tricker part is where the IP stack searches through the list itself (ipv4/devinet.c:inet_selcect_addr(), ipv6/addrconf.c/ipv6_get_saddr())... I assume these are just used for corner cases, right? I assume that things on the fast path (selecting source IPs for outgoing packets is taken care by the routing code, right?!?) Other protocols do things based on searching the list of all the devices (decnet, netrom, rose)... if that is a big problem they could easily maintain a list of devices that they're operating on. The other places that look at all the devices are things like /proc and netlink code where it explictly needs to operate on all devices. Anyway, assuming that those code paths in ipv4 and ipv6 aren't hotspots, some basic datastructure work in net/core/dev.c would remove the algorithmic limitations on hosting thousands of devices. -Mitch From owner-netdev@oss.sgi.com Mon Jun 5 07:52:12 2000 Received: by oss.sgi.com id ; Mon, 5 Jun 2000 07:51:39 -0700 Received: from mail.cyberus.ca ([209.195.95.1]:47271 "EHLO cyberus.ca") by oss.sgi.com with ESMTP id ; Mon, 5 Jun 2000 07:18:10 -0700 Received: from shell.cyberus.ca (shell [209.195.95.7]) by cyberus.ca (8.9.3/8.9.3/Cyberus Online Inc.) with ESMTP id JAA08479; Sun, 4 Jun 2000 09:53:19 -0400 (EDT) Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.1b+Sun/8.9.3) with ESMTP id JAA16951; Sun, 4 Jun 2000 09:53:12 -0400 (EDT) Date: Sun, 4 Jun 2000 09:53:12 -0400 (EDT) From: jamal To: Mitchell Blank Jr cc: Ben Greear , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? In-Reply-To: <20000603091818.B48132@sfgoth.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Sat, 3 Jun 2000, Mitchell Blank Jr wrote: > Currently, atm *cards* don't show up as a network device (since they > can't be bound directly to L3 protocols), but that may need to change didnt know this. > > You try adding all those thousands of > > VLANs as devices and i can _guarantee you_ that you are not optimizing for > > the common case. > > ...or thousands of routes, or gigabytes of RAM, etc... do you suggest we > drop kernel support for those too? It's all a matter of finding an > algorithm that scales. We dont wanna start changing the whole network stack just so that we can fit in VLANS, do we? Routes and gigabytes of RAM ? not comparing oranges to oranges here .. > > Again, how do I write an IP filter that discerns traffic on > two different VLANSs - for instance imagine we had one shared > ethernet switch that included > > * (VLAN 1) Router to outside world > * (VLAN 1, VLAN 2) Linux box acting as masquerading firewall > * (VLAN 2) Internal machines > > How do I do this under your system? Keep in mind that for security's > sake we need to make sure that traffic on VLAN1 doesn't spoof an > internal IP address. > based on VLANS; thats why building a table is so useful. Put your ACL on the "VLAN table"; someone makes a policy call that gets inserted into the VLAN table of the appropriate device (which in itself might be a simple pointer to a general filter database eg iptables). I dont have a problem with extending things like dst cache to have a pointer to some VLAN entry > > I have heard that the authors of 802.1q > > have already apologized to the world for the mess ;-> > > Do you have a reference, or is this just flamebait? > I heard about this in more than one place. I'll try and get you a reference. cheers, jamal From owner-netdev@oss.sgi.com Mon Jun 5 07:52:19 2000 Received: by oss.sgi.com id ; Mon, 5 Jun 2000 07:51:41 -0700 Received: from mail.cyberus.ca ([209.195.95.1]:18605 "EHLO cyberus.ca") by oss.sgi.com with ESMTP id ; Mon, 5 Jun 2000 07:29:56 -0700 Received: from shell.cyberus.ca (shell [209.195.95.7]) by cyberus.ca (8.9.3/8.9.3/Cyberus Online Inc.) with ESMTP id KAA17057; Sun, 4 Jun 2000 10:36:40 -0400 (EDT) Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.1b+Sun/8.9.3) with ESMTP id KAA17111; Sun, 4 Jun 2000 10:36:38 -0400 (EDT) Date: Sun, 4 Jun 2000 10:36:38 -0400 (EDT) From: jamal To: Ben Greear cc: rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? In-Reply-To: <3939647F.E08A09AB@candelatech.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Sat, 3 Jun 2000, Ben Greear wrote: > jamal wrote: > > > Infact i have never seen a single switch blade with more than 48 ports > > but even that is beside the point. The point really is the desiugn > > abstraction. > > I had a cisco with two FrameRelay 'ports' on it. I added 200 PVC > 'devices' to the cisco setup. Last time I'll mention it, so remember it! We are talking about two different things. 'ports' are _physical_. So you only had two ports; and unless you really understand CISCOs internal structuring, no point in making references to their 200 'devices' (that could be just the user interface showing stuff that people like to see). > > > I will argue that you _can not_ write a generic search algorithm for all > > these protocols. Unfortunately if you enforce one then the device search > > algorithm will have to be the same across the board. > > I see no need to even have a generic search algorithm, each protocol implementation > (ATM, FR, VLAN) can do whatever makes the most sense for it. > I dont follow. > > It goes without any arguement that we have a very good worst case estimate > > today, given the practical limits. You try adding all those thousands of > > VLANs as devices and i can _guarantee you_ that you are not optimizing for > > the common case. > > Ok, the question is where is the lookup 'hit' you are talking about. > Where is this searching that is slowing everything down? Don't just > say there is a hit, show me the specific code or logic where this hit takes place. > You register_netdevice() each VLAN device (because you have a device for each vlan). > For incomming pkts, the packet is detected in eth.c, as it comes off > of the hardware. I can immediately hash to find the VLAN device. > Constant time, worst case, O(n), where n is the number of physical ethernet > ports, and this is only when configured to allow 4096 VLANs PER Ethernet device, > which is fairly non-standard. > I am not gonna bitch about how many devices you have but in most cases 1024 per device is already overkill (including cross port VLANs). So in the worst case for 4 ethernet ports you have grown dev_base to over 15000 structures (in the worst case). Now go look at the associated code and tell me you dont see the repurcasions. If you know of a smart way to optimize that please post it. You might get me to support you. Most of the manipulating code doesnt run in the critical data path, but you are adding unnecessary noise, and besides my point is that _you dont_ need to have a device per VLAN; i might convert if you optimize it for everyone else. > I do, DHCP uses packet filters and uses hard-coded offsets into the > raw packet. The 4 extra bytes throw it off by 4, and so it never things > it gets a packet on the right port. See the patch on my web site if you > want to learn more. Ok so they use BPF. Either out of coolness or madness. Would packet socket have sufficed here for Linux ? cheers, jamal From owner-netdev@oss.sgi.com Mon Jun 5 07:52:19 2000 Received: by oss.sgi.com id ; Mon, 5 Jun 2000 07:51:41 -0700 Received: from atrey.karlin.mff.cuni.cz ([195.113.31.123]:63506 "EHLO atrey.karlin.mff.cuni.cz") by oss.sgi.com with ESMTP id ; Mon, 5 Jun 2000 07:29:39 -0700 Received: from bug.ucw.cz (root@slip15.ms.mff.cuni.cz [195.113.20.215]) by atrey.karlin.mff.cuni.cz (8.8.8/8.8.8) with ESMTP id NAA22255; Mon, 5 Jun 2000 13:48:07 +0200 Received: (from pavel@localhost) by bug.ucw.cz (8.8.8/8.8.5) id BAA01325; Mon, 5 Jun 2000 01:40:25 +0200 Message-ID: <20000605014025.A1183@bug.ucw.cz> Date: Mon, 5 Jun 2000 01:40:25 +0200 From: Pavel Machek To: kuznet@ms2.inr.ac.ru, Aki M Laukkanen Cc: linux-kernel@vger.rutgers.edu, netdev@oss.sgi.com Subject: Re: Slow TCP connection between linux and wince References: <200006031731.VAA10055@ms2.inr.ac.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.93i In-Reply-To: <200006031731.VAA10055@ms2.inr.ac.ru>; from kuznet@ms2.inr.ac.ru on Sat, Jun 03, 2000 at 09:31:15PM +0400 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hi! > > There was another error in my mail too, 2*130 is indeed over 200 but that > > was irrelevant. However the conclusion seemed correct, delayed acks timing > > out. Could you explain me why the window of 3100 seems to allow 2 packets > > in flight but sender always waits for the next ack? > > No, tcpdump shows that sender always has full window. > Receiver always ACKs _previous_ packet. > > RTT=0.9sec, window is 2 packets or ~3K, bandwidth is ~3K/sec. > Tcpdump is perfectly smooth and correct except for the rate. 8) > > Probably, receiver delays each ACK by 500msec. I have no idea > why it does this, because it is apparently illegal behaviour. > Look: > > 1 02:59:02.875097 10.0.0.3.www > 192.168.55.100.1029: . 233601:235061(1460) ack 196 win 16060 (DF) > 2 02:59:03.235031 192.168.55.100.1029 > 10.0.0.3.www: . ack 233601 win 3100 (DF) > 3 02:59:03.235094 10.0.0.3.www > 192.168.55.100.1029: . 235061:236521(1460) ack 196 win 16060 (DF) > 4 02:59:03.775064 192.168.55.100.1029 > 10.0.0.3.www: . ack 235061 win 3100 (DF) > 5 02:59:03.775137 10.0.0.3.www > 192.168.55.100.1029: . 236521:237981(1460) ack 196 win 16060 (DF) > 6 02:59:04.145061 192.168.55.100.1029 > 10.0.0.3.www: . ack 236521 win 3100 (DF) > > ACK #4 acks packet #1, sent 0.9sec before this (hence, rtt=0.9sec). > We have two packets in transmit, so that packet #1 reaches receiver not later > than (2*1500)/bandwidth after transmit. If bandwidth is 115Kbaud, > it is not more ~0.3sec. Hence, packet #3 reaches receiver _before_ > receiver sent ACK for previous packet!! For what it is worth, ping while under load looks (downloading linux kernel, as always) like this: root@bug:~# ping 192.168.55.100 PING 192.168.55.100 (192.168.55.100): 56 data bytes 64 bytes from 192.168.55.100: icmp_seq=0 ttl=32 time=519.0 ms 64 bytes from 192.168.55.100: icmp_seq=1 ttl=32 time=640.1 ms 64 bytes from 192.168.55.100: icmp_seq=2 ttl=32 time=760.0 ms 64 bytes from 192.168.55.100: icmp_seq=3 ttl=32 time=840.0 ms 64 bytes from 192.168.55.100: icmp_seq=4 ttl=32 time=989.2 ms 64 bytes from 192.168.55.100: icmp_seq=5 ttl=32 time=641.6 ms 64 bytes from 192.168.55.100: icmp_seq=6 ttl=32 time=820.0 ms 64 bytes from 192.168.55.100: icmp_seq=7 ttl=32 time=1000.2 ms 64 bytes from 192.168.55.100: icmp_seq=8 ttl=32 time=689.8 ms 64 bytes from 192.168.55.100: icmp_seq=9 ttl=32 time=859.8 ms 64 bytes from 192.168.55.100: icmp_seq=10 ttl=32 time=830.0 ms 64 bytes from 192.168.55.100: icmp_seq=12 ttl=32 time=679.9 ms 64 bytes from 192.168.55.100: icmp_seq=13 ttl=32 time=239.8 ms 64 bytes from 192.168.55.100: icmp_seq=14 ttl=32 time=530.0 ms 64 bytes from 192.168.55.100: icmp_seq=15 ttl=32 time=620.0 ms 64 bytes from 192.168.55.100: icmp_seq=16 ttl=32 time=700.0 ms and ping when links is idle looks like this: root@bug:~# ping 192.168.55.100 PING 192.168.55.100 (192.168.55.100): 56 data bytes 64 bytes from 192.168.55.100: icmp_seq=0 ttl=32 time=117.2 ms 64 bytes from 192.168.55.100: icmp_seq=1 ttl=32 time=110.1 ms 64 bytes from 192.168.55.100: icmp_seq=2 ttl=32 time=110.0 ms 64 bytes from 192.168.55.100: icmp_seq=3 ttl=32 time=110.1 ms 64 bytes from 192.168.55.100: icmp_seq=4 ttl=32 time=110.1 ms 64 bytes from 192.168.55.100: icmp_seq=5 ttl=32 time=110.1 ms 64 bytes from 192.168.55.100: icmp_seq=6 ttl=32 time=110.0 ms 64 bytes from 192.168.55.100: icmp_seq=7 ttl=32 time=109.8 ms 64 bytes from 192.168.55.100: icmp_seq=8 ttl=32 time=110.1 ms 64 bytes from 192.168.55.100: icmp_seq=9 ttl=32 time=110.0 ms 64 bytes from 192.168.55.100: icmp_seq=10 ttl=32 time=110.1 ms 64 bytes from 192.168.55.100: icmp_seq=11 ttl=32 time=110.0 ms > Conclusion is: > 3. Or receiver has totally broken TCP, which sends delayed ACKs > ~500msec after packet arrives, even if another packets This is windows ce stack; quite likely to be broken. I doubt it ever announces window bigger than 3100 (is there way to make it make its window bigger?). > arrived during this period. I.e. delays ACK only to delay. 8) > Probably, this stack even looked as working, if window were >2packets. > Namely, if it were > bandwidth*0.9sec=~10K nobody > even would notice this misbehaviour. This is well possible. Their TCP stack is probably designed to work at 19200. Is there way for us to missbehave? Ie. what would happen if I forced linux tcp stack to think their window is bigger than it really is? [Is there easy way to break my tcp stack this way?] Pavel -- I'm pavel@ucw.cz. "In my country we have almost anarchy and I don't care." Panos Katsaloulis describing me w.r.t. patents me at discuss@linmodems.org From owner-netdev@oss.sgi.com Mon Jun 5 07:52:20 2000 Received: by oss.sgi.com id ; Mon, 5 Jun 2000 07:51:40 -0700 Received: from Huntington-Beach.Blue-Labs.org ([208.179.0.198]:17532 "EHLO Huntington-Beach.Blue-Labs.org") by oss.sgi.com with ESMTP id ; Mon, 5 Jun 2000 07:26:49 -0700 Received: from kalifornia.com (david@Huntington-Beach.Blue-Labs.org [208.179.0.198]) by Huntington-Beach.Blue-Labs.org (8.9.3/8.9.0) with ESMTP id SAA27485; Sun, 4 Jun 2000 18:12:18 -0700 Message-ID: <393AFE72.6991AD3E@kalifornia.com> Date: Sun, 04 Jun 2000 18:12:18 -0700 From: David Ford Reply-To: Blu3Viper Organization: Talon Technology, Intl. X-Mailer: Mozilla 4.73 [en] (X11; U; Linux 2.4.0-test1 i686) X-Accept-Language: en MIME-Version: 1.0 To: Rusty Russell CC: alan@lxorguk.ukuu.org.uk, netdev@oss.sgi.com Subject: [PATCH] 1 serial, 3 netfilter -- Re: [2 bugs] iptable_nat module makes a black hole for GRE packets, iptables only plays with ICMP/UDP/TCP References: <20000604175726.B06318154@halfway> Content-Type: multipart/mixed; boundary="------------FF42BFB5D8BADF7499EA772F" Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing This is a multi-part message in MIME format. --------------FF42BFB5D8BADF7499EA772F Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Ok, I massaged your patch around Rusty, and now everything netfilter/GRE looks like it's doing ok. Somehow my wireless stuff broke but I'll look at that separately. Attached is a conglomerate patch for: o SNAT happens o iptables is friendly with GRE packet routing and target LOG o OOPS in serial.c pnp guessing (temp fix) For reference, this is how things look: (T1 service) --------[ eth0 ]---[ greN ] | (inet) | (cable modem) -------[ eth0 ]---[ greN ] | +-------[ eth1 ]-------(hub) | +-------[ eth2 ]-------(wireless pcmcia) Everything from the lower set is tunneled to the upper set. Now that everyone is playing nicely (except wireless :P), generic traffic such as http can be NAT'd instead of consuming tunnel bandwidth :) Thankyou muchly for the help in this Rusty, you've no idea what a relaxing day this is going to be. -d Rusty Russell wrote: > In message <393A7A41.6ED4C7CB@kalifornia.com> you write: > > Alan first. Alan, please remove "GRE is broken" from the todo list. > > Replace it with "iptable_nat module breaks GRE" and add "iptables > > generally doesn't handle protocols outside the top 3". > > Belay that. These are the same `feature': entunnelled GRE packets > don't go through LOCAL_OUT. I didn't do that originally because I > wasn't sure what the best policy for tunnels. -- "The difference between 'involvement' and 'commitment' is like an eggs-and-ham breakfast: the chicken was 'involved' - the pig was 'committed'." --------------FF42BFB5D8BADF7499EA772F Content-Type: text/plain; charset=us-ascii; name="iptables-gre-serial.patch" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="iptables-gre-serial.patch" diff -ruN linux/drivers/char/serial.c linux-fix/drivers/char/serial.c --- linux/drivers/char/serial.c Sun Jun 4 17:10:55 2000 +++ linux-fix/drivers/char/serial.c Sun Jun 4 17:31:52 2000 @@ -119,6 +119,7 @@ #undef SERIAL_DEBUG_RS_WAIT_UNTIL_SENT #undef SERIAL_DEBUG_PCI #undef SERIAL_DEBUG_AUTOCONF +#define SERIAL_DEBUG_PNP /* Sanity checks */ @@ -3345,7 +3346,7 @@ } /* - * This routine detect the IRQ of a serial port by clearing OUT2 when + * This routine detects the IRQ of a serial port by clearing OUT2 when * no UART interrupt are requested (IER = 0) (*GPL*). This seems to work at * each time, as long as no other device permanently request the IRQ. * If no IRQ is detected, or multiple IRQ appear, this function returns 0. @@ -3457,7 +3458,7 @@ * 16550, and why not? Startech doesn't seem to even acknowledge its * existence.) * - * What evil have men's minds wrought... + * What evil have mens' minds wrought... */ static void autoconfig_startech_uarts(struct async_struct *info, struct serial_state *state, @@ -4900,9 +4901,14 @@ struct isapnp_resources *res = (struct isapnp_resources *)dev->sysdata; struct isapnp_resources *resa; - if (!(check_name(dev->name) || check_name(dev->bus->name)) && - !(check_compatible_id(dev))) - return 1; + printk("Checking resource inside serial_pnp_guess_board, %08lx (%s)\n", res, dev->bus->name); + + if (!(check_name(dev->name) || check_name(dev->bus->name)) && + !(check_compatible_id(dev))) + return 1; + + if (!res) + return 0; if (res->next) return 1; @@ -4959,8 +4965,9 @@ board.device == ISAPNP_DEVICE(0x1021))) board.flags |= SPCI_FL_NO_SHIRQ; } else { - if (serial_pnp_guess_board(dev, &board)) + if (serial_pnp_guess_board(dev, &board)) { continue; + } } if (board.flags & SPCI_FL_NO_SHIRQ) diff -ruN linux/net/ipv4/ip_gre.c linux-fix/net/ipv4/ip_gre.c --- linux/net/ipv4/ip_gre.c Mon May 22 09:50:55 2000 +++ linux-fix/net/ipv4/ip_gre.c Sun Jun 4 15:16:37 2000 @@ -35,6 +35,7 @@ #include #include #include +#include #ifdef CONFIG_IPV6 #include @@ -529,6 +530,52 @@ #endif } +#ifdef CONFIG_NETFILTER +/* To preserve the cute illusion that a locally-generated packet can + be mangled before routing, we actually reroute if a hook altered + the packet. -RR */ +static int route_me_harder(struct sk_buff *skb) +{ + struct iphdr *iph = skb->nh.iph; + struct rtable *rt; + + if (ip_route_output(&rt, iph->daddr, iph->saddr, + RT_TOS(iph->tos) | RTO_CONN, + skb->sk ? skb->sk->bound_dev_if : 0)) { + printk("route_me_harder: No more route.\n"); + return -EINVAL; + } + + /* Drop old route. */ + dst_release(skb->dst); + + skb->dst = &rt->u.dst; + return 0; +} +#endif + +/* Do route recalc if netfilter changes skb. */ +static inline int +send_maybe_reroute(struct sk_buff *skb) +{ + struct ip_tunnel *tunnel = (struct ip_tunnel*)skb->dev->priv; + struct net_device_stats *stats = &tunnel->stat; + +#ifdef CONFIG_NETFILTER + if (skb->nfcache & NFC_ALTERED) { + if (route_me_harder(skb) != 0) { + kfree_skb(skb); + return -EINVAL; + } + } +#endif + stats->tx_bytes += skb->len; + stats->tx_packets++; + ip_send(skb); + tunnel->recursion--; + return 0; +} + int ipgre_rcv(struct sk_buff *skb, unsigned short len) { struct iphdr *iph = skb->nh.iph; @@ -827,12 +874,8 @@ skb->nfct = NULL; #endif - stats->tx_bytes += skb->len; - stats->tx_packets++; - ip_send(skb); - tunnel->recursion--; - return 0; - + return NF_HOOK(PF_INET, NF_IP_LOCAL_OUT, skb, NULL, rt->u.dst.dev, + send_maybe_reroute); tx_error_icmp: dst_link_failure(skb); diff -ruN linux/net/ipv4/netfilter/ip_nat_core.c linux-fix/net/ipv4/netfilter/ip_nat_core.c --- linux/net/ipv4/netfilter/ip_nat_core.c Fri Mar 17 10:56:20 2000 +++ linux-fix/net/ipv4/netfilter/ip_nat_core.c Sun Jun 4 14:25:26 2000 @@ -269,7 +269,7 @@ unsigned int score; struct ip_conntrack_tuple tuple; } best = { NULL, 0xFFFFFFFF }; - u_int32_t *var_ipp, *other_ipp, saved_ip; + u_int32_t *var_ipp, *other_ipp, saved_ip, orig_dstip; if (HOOK2MANIP(hooknum) == IP_NAT_MANIP_SRC) { var_ipp = &tuple->src.ip; @@ -280,6 +280,9 @@ saved_ip = tuple->src.ip; other_ipp = &tuple->src.ip; } + /* Don't do do_extra_mangle unless neccessary (overrides + explicit socket bindings, for example) */ + orig_dstip = tuple->dst.ip; IP_NF_ASSERT(mr->rangesize >= 1); for (i = 0; i < mr->rangesize; i++) { @@ -306,7 +309,8 @@ *other_ipp = saved_ip; if (hooknum == NF_IP_LOCAL_OUT - && !do_extra_mangle(*var_ipp, other_ipp)) { + && *var_ipp != orig_dstip + && !do_extra_mangle(*var_ipp, other_ipp)) { DEBUGP("Range %u %u.%u.%u.%u rt failed!\n", i, IP_PARTS(*var_ipp)); /* Can't route? This whole range part is diff -ruN linux/scripts/lxdialog/Makefile linux-fix/scripts/lxdialog/Makefile --- linux/scripts/lxdialog/Makefile Sun Jun 4 17:10:56 2000 +++ linux-fix/scripts/lxdialog/Makefile Sun Jun 4 00:04:02 2000 @@ -10,7 +10,7 @@ ifeq (/usr/include/ncurses.h, $(wildcard /usr/include/ncurses.h)) HOSTCFLAGS += -DCURSES_LOC="" else - HOSTCFLAGS += -DCURSES_LOC="" + HOSTCFLAGS += -I/usr/local/include -DCURSES_LOC="" endif endif endif --------------FF42BFB5D8BADF7499EA772F-- From owner-netdev@oss.sgi.com Mon Jun 5 07:52:20 2000 Received: by oss.sgi.com id ; Mon, 5 Jun 2000 07:51:42 -0700 Received: from ppp66.arobas.net ([205.205.36.136]:62725 "HELO dialin156.ottawa.globalserve.net") by oss.sgi.com with SMTP id ; Mon, 5 Jun 2000 07:30:38 -0700 Received: (qmail 699 invoked by uid 1000); 4 Jun 2000 22:06:51 -0000 Date: Sun, 4 Jun 2000 18:06:51 -0400 From: Jerome Etienne To: netdev@oss.sgi.com Subject: [PATCH] IFA_F_NO_NDISC (for vrrp) Message-ID: <20000604180651.A678@long-haul.net> Reply-To: jetienne@arobas.net Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="82I3+IH0IqGh5yIs" User-Agent: Mutt/1.0i Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing --82I3+IH0IqGh5yIs Content-Type: text/plain; charset=us-ascii Hi, This patch adds a bit in in_ifaddr.ifa_flag (IFA_F_NO_NDISC). If it is set, the host doesn't answer to the neighbour's discovery request for this address. Currently it is honored only by ipv4/arp. This feature is usefull to support VRRP(rfc2338) which requires to answer the ARP request for a 'virtual ip' with the proper 'virtual MAC'(rfc2338 section 8.2). So to potentially answer a particular MAC for a particular IP, and not with the primary MAC. As far as i know, it is currently impossible because linux assumes to have a single MAC per physical interface. My plan is to prevent the kernel from answering for the virtual ip addresses and to answer from userspace. It is the less intrusive solution i found to support several 'virtual routers' per physical interface. My vrrpd implementation runs entirely in userspace but without this feature it can't support several virtual routers per physical interface. I am not sure that read_lock(&in_dev->lock) in inet_ifa_bylocal() is the good way to lock but it seems the most reasonable to me. please correct me if needed. The patch is in 2 parts: o the kernel 2.4.0-test1 modifications o the iproute2-2.2.4-now-ss000305 modifications to set/report the no_ndisc bit from userspace. --82I3+IH0IqGh5yIs Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="patch_no_ndisc.kernel" diff -ur --exclude=*.[ao] --exclude=.* linux/net/ipv4/devinet.c linux-2.4.0-test1/net/ipv4/devinet.c --- linux/net/ipv4/devinet.c Sun Jan 9 00:36:20 2000 +++ linux-2.4.0-test1/net/ipv4/devinet.c Sun Jun 4 11:01:47 2000 @@ -357,6 +357,19 @@ return NULL; } +struct in_ifaddr *inet_ifa_bylocal(struct in_device *in_dev, u32 local ) +{ + read_lock(&in_dev->lock); + for_ifa(in_dev) { + if (ifa->ifa_local == local) { + read_unlock(&in_dev->lock); + return ifa; + } + } endfor_ifa(in_dev); + read_unlock(&in_dev->lock); + return NULL; +} + #ifdef CONFIG_RTNETLINK int diff -ur --exclude=*.[ao] --exclude=.* linux/net/ipv4/arp.c linux-2.4.0-test1/net/ipv4/arp.c --- linux/net/ipv4/arp.c Wed Apr 26 15:13:17 2000 +++ linux-2.4.0-test1/net/ipv4/arp.c Sun Jun 4 10:39:28 2000 @@ -65,6 +65,7 @@ * clean up the APFDDI & gen. FDDI bits. * Alexey Kuznetsov: new arp state machine; * now it is in net/core/neighbour.c. + * Jerome Etienne : support IFA_F_NO_NDISC */ /* RFC1122 Status: @@ -719,6 +720,10 @@ addr_type = rt->rt_type; if (addr_type == RTN_LOCAL) { + struct in_ifaddr *ifa = inet_ifa_bylocal( in_dev, tip ); + /* if the target address is IFA_F_NONDISC, dont reply */ + if( ifa && (ifa->ifa_flags & IFA_F_NO_NDISC) ) + goto out; n = neigh_event_ns(&arp_tbl, sha, &sip, dev); if (n) { arp_send(ARPOP_REPLY,ETH_P_ARP,sip,dev,tip,sha,dev->dev_addr,sha); diff -ur --exclude=*.[ao] --exclude=.* linux/include/linux/inetdevice.h linux-2.4.0-test1/include/linux/inetdevice.h --- linux/include/linux/inetdevice.h Mon Aug 23 13:01:02 1999 +++ linux-2.4.0-test1/include/linux/inetdevice.h Sat Jun 3 21:38:43 2000 @@ -80,6 +80,7 @@ extern struct in_device *inetdev_by_index(int); extern u32 inet_select_addr(const struct net_device *dev, u32 dst, int scope); extern struct in_ifaddr *inet_ifa_byprefix(struct in_device *in_dev, u32 prefix, u32 mask); +extern struct in_ifaddr *inet_ifa_bylocal(struct in_device *in_dev, u32 local); extern void inet_forward_change(void); extern __inline__ int inet_ifa_match(u32 addr, struct in_ifaddr *ifa) diff -ur --exclude=*.[ao] --exclude=.* linux/include/linux/rtnetlink.h linux-2.4.0-test1/include/linux/rtnetlink.h --- linux/include/linux/rtnetlink.h Wed Feb 9 23:08:09 2000 +++ linux-2.4.0-test1/include/linux/rtnetlink.h Sun Jun 4 10:47:55 2000 @@ -312,6 +312,8 @@ /* ifa_flags */ #define IFA_F_SECONDARY 0x01 +#define IFA_F_NO_NDISC 0x02 /* no ndisc's answer for this address */ +/* jme- why this hole between secondary(0x01) and deprecated(0x20) ? */ #define IFA_F_DEPRECATED 0x20 #define IFA_F_TENTATIVE 0x40 --82I3+IH0IqGh5yIs Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="patch_no_ndisc.iproute2" diff -rNu iproute2-2.2.4-now-ss000305/ip/ipaddress.c iproute2-2.2.4-now-ss000305_no_ndisc/ip/ipaddress.c --- iproute2-2.2.4-now-ss000305/ip/ipaddress.c Sun Mar 5 14:33:32 2000 +++ iproute2-2.2.4-now-ss000305_no_ndisc/ip/ipaddress.c Sat Jun 3 23:10:03 2000 @@ -10,6 +10,7 @@ * * Changes: * Laszlo Valko 990223: address label must be zero terminated + * Jerome Etienne 000603: support of no_ndisc */ #include @@ -65,11 +66,11 @@ fprintf(stderr, " [ to PREFIX ] [ FLAG-LIST ] [ label PATTERN ]\n"); fprintf(stderr, "IFADDR := PREFIX | ADDR peer PREFIX\n"); fprintf(stderr, " [ broadcast ADDR ] [ anycast ADDR ]\n"); - fprintf(stderr, " [ label STRING ] [ scope SCOPE-ID ]\n"); + fprintf(stderr, " [ label STRING ] [ scope SCOPE-ID ] [ no_ndisc ]\n"); fprintf(stderr, "SCOPE-ID := [ host | link | global | NUMBER ]\n"); fprintf(stderr, "FLAG-LIST := [ FLAG-LIST ] FLAG\n"); fprintf(stderr, "FLAG := [ permanent | dynamic | secondary | primary |\n"); - fprintf(stderr, " tentative | deprecated ]\n"); + fprintf(stderr, " tentative | deprecated | no_ndisc ]\n"); exit(-1); } @@ -401,6 +402,10 @@ ifa->ifa_flags &= ~IFA_F_DEPRECATED; fprintf(fp, "deprecated "); } + if (ifa->ifa_flags&IFA_F_NO_NDISC) { + ifa->ifa_flags &= ~IFA_F_NO_NDISC; + fprintf(fp, "no_ndisc "); + } if (!(ifa->ifa_flags&IFA_F_PERMANENT)) { fprintf(fp, "dynamic "); } else @@ -540,6 +545,9 @@ } else if (strcmp(*argv, "deprecated") == 0) { filter.flags |= IFA_F_DEPRECATED; filter.flagmask |= IFA_F_DEPRECATED; + } else if (strcmp(*argv, "no_ndisc") == 0) { + filter.flags |= IFA_F_NO_NDISC; + filter.flagmask |= IFA_F_NO_NDISC; } else if (strcmp(*argv, "label") == 0) { NEXT_ARG(); filter.label = *argv; @@ -737,8 +745,10 @@ req.ifa.ifa_family = preferred_family; while (argc > 0) { - if (strcmp(*argv, "peer") == 0 || - strcmp(*argv, "remote") == 0) { + if ( strcmp(*argv, "no_ndisc") == 0 ){ + req.ifa.ifa_flags |= IFA_F_NO_NDISC; + } else if (strcmp(*argv, "peer") == 0 || + strcmp(*argv, "remote") == 0) { NEXT_ARG(); if (peer_len) --82I3+IH0IqGh5yIs-- From owner-netdev@oss.sgi.com Mon Jun 5 07:52:20 2000 Received: by oss.sgi.com id ; Mon, 5 Jun 2000 07:51:44 -0700 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:25618 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Mon, 5 Jun 2000 07:39:10 -0700 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.9.3) with ESMTP id KAA13946; Sun, 4 Jun 2000 10:55:26 -0700 Message-ID: <393A980D.181029E6@candelatech.com> Date: Sun, 04 Jun 2000 10:55:25 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.14-5.0 i586) X-Accept-Language: en MIME-Version: 1.0 To: jamal CC: rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing jamal wrote: > > On Sat, 3 Jun 2000, Ben Greear wrote: > > Ok, the question is where is the lookup 'hit' you are talking about. > > Where is this searching that is slowing everything down? Don't just > > say there is a hit, show me the specific code or logic where this hit takes place. > > > > You register_netdevice() each VLAN device (because you have a device for > each vlan). True, for __non-critical__ path, like setup and destroy, there may be a linear hit, based on number of devices. That does not bother me, because this will happen once or twice every couple of days or so (or when you boot...) on average I think. Am I missing something that happens more often here? > So in the worst case for 4 ethernet ports you have grown dev_base to over > 15000 structures (in the worst case). Now go look at the associated code > and tell me you dont see the repurcasions. I give the user the flexibility to have as many VLANs as she desires. I have not seen any critical-path performance hit relating to more devices. The non-critical path hits that do exist should be linear, which isn't really too bad, and they should be extremely rare, statistically speaking. I do not expect anyone to actually configure 16k of VLAN devices, but if they have a need, and the hardware/memory to run it with the performance they can live with, then I'll not bother them about it. Once it's set up (which might take a while), then it should run fairly fast, I think. > Most of the manipulating code doesnt run in the critical data path, but > you are adding unnecessary noise, and besides my point is that _you dont_ > need to have a device per VLAN; i might convert if you optimize it for > everyone else. Optimize what for everyone else? And I dissagree with your assumption about not needing a device per VLAN anyway, as have others, with excellent arguments. Does any of the manipulating code run in the critical path? I really would be interested if there is anything in the critical path that is unduly burdened by lots of VLAN (or other) interfaces... > Ok so they use BPF. Either out of coolness or madness. Would packet socket > have sufficed here for Linux ? I think it would have, someone even mentioned that DHCPd would probably work if just compiled to not use BPF (there are other options, like raw sockets I think.) Haven't tried that though... Ben -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Mon Jun 5 07:52:20 2000 Received: by oss.sgi.com id ; Mon, 5 Jun 2000 07:51:42 -0700 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:25618 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Mon, 5 Jun 2000 07:39:10 -0700 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.9.3) with ESMTP id LAA13981; Sun, 4 Jun 2000 11:14:59 -0700 Message-ID: <393A9CA2.AA3627F6@candelatech.com> Date: Sun, 04 Jun 2000 11:14:58 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.14-5.0 i586) X-Accept-Language: en MIME-Version: 1.0 To: jamal CC: Mitchell Blank Jr , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing jamal wrote: > > On Sat, 3 Jun 2000, Mitchell Blank Jr wrote: > > ...or thousands of routes, or gigabytes of RAM, etc... do you suggest we > > drop kernel support for those too? It's all a matter of finding an > > algorithm that scales. > > We dont wanna start changing the whole network stack just so that > we can fit in VLANS, do we? Excellent point. I do *NOT* want to change the whole network stack, and I don't want to change all of the user-space programs (tcpdump, ip, ifconfig, arp, route...) either... Note that I have changed none of these, with the exception of dhcpd (and Gleb & Co's solution of tweaking the ethernet header fixes the dhcpd problem anyway), and everything just seems to work! Perhaps we could put this whole performance thing to rest if someone could add 500 or so VLANs to a box and do some agregate throughput numbers or something? A gigabit NIC would probably be best because that should take the link speed out of the picture? Someone send me two and I'll do the benchmark :) If there are hits, then 500 devices should make it appearant... Ben -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Mon Jun 5 07:52:20 2000 Received: by oss.sgi.com id ; Mon, 5 Jun 2000 07:51:44 -0700 Received: from linuxcare.com.au ([203.29.91.49]:63242 "EHLO front.linuxcare.com.au") by oss.sgi.com with ESMTP id ; Mon, 5 Jun 2000 06:47:47 -0700 Received: from halfway (penicillin.linuxcare.com.au [10.61.2.27]) by front.linuxcare.com.au (8.9.3/8.9.3/Debian 8.9.3-21) with ESMTP id DAA02826; Mon, 5 Jun 2000 03:57:31 +1000 X-Authentication-Warning: front.linuxcare.com.au: Host penicillin.linuxcare.com.au [10.61.2.27] claimed to be halfway Received: from linuxcare.com.au (localhost [127.0.0.1]) by halfway (Postfix) with ESMTP id B06318154; Mon, 5 Jun 2000 03:27:26 +0930 (CST) From: Rusty Russell To: david@kalifornia.com Cc: alan@lxorguk.ukuu.org.uk, netdev@oss.sgi.com Subject: Re: [2 bugs] iptable_nat module makes a black hole for GRE packets, iptables only plays with ICMP/UDP/TCP In-reply-to: Your message of "Mon, 05 Jun 2000 01:50:02 +1000." <393A7A41.6ED4C7CB@kalifornia.com> Date: Mon, 05 Jun 2000 03:57:26 +1000 Message-Id: <20000604175726.B06318154@halfway> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing In message <393A7A41.6ED4C7CB@kalifornia.com> you write: > Alan first. Alan, please remove "GRE is broken" from the todo list. > Replace it with "iptable_nat module breaks GRE" and add "iptables > generally doesn't handle protocols outside the top 3". Belay that. These are the same `feature': entunnelled GRE packets don't go through LOCAL_OUT. I didn't do that originally because I wasn't sure what the best policy for tunnels. It's since become clear (with usage), that tunnels should pass entunnelled packets through the NF_IP_LOCAL_OUT hook. David, please try this patch (may need some mangling, it's off the top of my head). If it helps, I'll add it to the pile. --- working-2.4.0-test1/net/ipv4/ip_gre.c.~1~ Thu May 25 12:41:52 2000 +++ working-2.4.0-test1/net/ipv4/ip_gre.c Mon Jun 5 03:51:47 2000 @@ -529,6 +529,49 @@ #endif } +#ifdef CONFIG_NETFILTER +/* To preserve the cute illusion that a locally-generated packet can + be mangled before routing, we actually reroute if a hook altered + the packet. -RR */ +static int route_me_harder(struct sk_buff *skb) +{ + struct iphdr *iph = skb->nh.iph; + struct rtable *rt; + + if (ip_route_output(&rt, iph->daddr, iph->saddr, + RT_TOS(iph->tos) | RTO_CONN, + skb->sk ? skb->sk->bound_dev_if : 0)) { + printk("route_me_harder: No more route.\n"); + return -EINVAL; + } + + /* Drop old route. */ + dst_release(skb->dst); + + skb->dst = &rt->u.dst; + return 0; +} +#endif + +/* Do route recalc if netfilter changes skb. */ +static inline int +send_maybe_reroute(struct sk_buff *skb) +{ +#ifdef CONFIG_NETFILTER + if (skb->nfcache & NFC_ALTERED) { + if (route_me_harder(skb) != 0) { + kfree_skb(skb); + return -EINVAL; + } + } +#endif + stats->tx_bytes += skb->len; + stats->tx_packets++; + ip_send(skb); + tunnel->recursion--; + return 0; +} + int ipgre_rcv(struct sk_buff *skb, unsigned short len) { struct iphdr *iph = skb->nh.iph; @@ -827,12 +870,8 @@ skb->nfct = NULL; #endif - stats->tx_bytes += skb->len; - stats->tx_packets++; - ip_send(skb); - tunnel->recursion--; - return 0; - + return NF_HOOK(PF_INET, NF_IP_LOCAL_OUT, skb, NULL, rt->u.dst.dev, + send_maybe_reroute); tx_error_icmp: dst_link_failure(skb); -- Hacking time. From owner-netdev@oss.sgi.com Mon Jun 5 07:53:39 2000 Received: by oss.sgi.com id ; Mon, 5 Jun 2000 07:53:29 -0700 Received: from saw.sw.com.sg ([203.120.9.98]:59521 "HELO saw.sw.com.sg") by oss.sgi.com with SMTP id ; Mon, 5 Jun 2000 07:53:18 -0700 Received: (qmail 8558 invoked by uid 577); 5 Jun 2000 02:26:27 -0000 Message-ID: <20000605102627.A8473@saw.sw.com.sg> Date: Mon, 5 Jun 2000 10:26:27 +0800 From: Andrey Savochkin To: Mitchell Blank Jr , Ben Greear Cc: rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il, jamal Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: <3938611E.D074F254@candelatech.com> <20000603091818.B48132@sfgoth.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.93.2i In-Reply-To: <20000603091818.B48132@sfgoth.com>; from "Mitchell Blank Jr" on Sat, Jun 03, 2000 at 09:18:18AM Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello, I want to ad my $0.02. On Sat, Jun 03, 2000 at 09:18:18AM -0700, Mitchell Blank Jr wrote: > > Devices map to physical devices i.e ports in your lingo. How many of those > > do you see in your average Linux machine? > > The problem is that if you only think about the "common" network types > (ethernet, PPP, etc) this line gets blurred, since there's a one-to-one > corresponance between: > * physical devices > * network devices (i.e. things that you can bind IP addresses to, > netfilter based on, tcpdump of) > > Any sane implementation of VLANs needs to be a network device in the > second sense. Network devices in the second sense is only an abstraction. Linux kernel do not bind IP addresses for devices. IP address assignment to any device is just an entry in routing table "local". The kernel keeps information about the correspondence about IP address and device only for backward compatibility to help ifconfig and other obsolete network management software to work. I'm very thankful to Alexey for removing finally the long-standing mistake of correspondence between IP addresses and devices from the kernel. Netfilters isn't a big problem, too. A specific VLAN-id matching netfilter module is a clean and powerful solution. I think that the current VLAN implementation slightly abuses the notion of device. And it doesn't relate to the number of devices and the efficiency of search algorithms. The current VLAN implementation is a pure packet-mangling code. It misses one of the most important properties of network devices - flow control. Any code that doesn't provide flow control isn't a device, but a code just manipulating of packet contents. The current kernel infrastructure for packet mangling may still need some adjustments, but it at least exists. I'm encouraging to consider VLAN implementation as just a netfilter module. Best regards Andrey V. Savochkin From owner-netdev@oss.sgi.com Mon Jun 5 08:03:29 2000 Received: by oss.sgi.com id ; Mon, 5 Jun 2000 08:03:19 -0700 Received: from mail.cyberus.ca ([209.195.95.1]:55998 "EHLO cyberus.ca") by oss.sgi.com with ESMTP id ; Mon, 5 Jun 2000 08:03:09 -0700 Received: from shell.cyberus.ca (shell [209.195.95.7]) by cyberus.ca (8.9.3/8.9.3/Cyberus Online Inc.) with ESMTP id IAA12316; Mon, 5 Jun 2000 08:35:09 -0400 (EDT) Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.1b+Sun/8.9.3) with ESMTP id IAA18223; Mon, 5 Jun 2000 08:35:09 -0400 (EDT) Date: Mon, 5 Jun 2000 08:35:09 -0400 (EDT) From: jamal To: Gleb Natapov cc: Ben Greear , Mitchell Blank Jr , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? In-Reply-To: <393B9C33.A2AC848C@nbase.co.il> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Mon, 5 Jun 2000, Gleb Natapov wrote: > jamal wrote: > It seams that you suppose that if you don't need vlans, nobody needs > vlans. > You will be surprised, but some people find them useful. :) No no. I am for VLANS, just not using devices. It's peer pressure, i dont think we can live without VLANS. It does not matter whether they are useful or not. > I've looked at Zebra code. 'shutdown' command directly communicates with > the kernel using zebra interface name as kernel interface name. So it > seams that in order to use interface in zebra one should have the same > interface in kernel. This can be changed of course. > Well, this is the interface they have between Linux and their code. How do they do NBMA via sockets for example? Netlink is not helpful in that case. why dont you forward this to them? cheers, jamal From owner-netdev@oss.sgi.com Mon Jun 5 08:10:59 2000 Received: by oss.sgi.com id ; Mon, 5 Jun 2000 08:10:50 -0700 Received: from sirppi.helsinki.fi ([128.214.205.27]:37393 "EHLO sirppi.helsinki.fi") by oss.sgi.com with ESMTP id ; Mon, 5 Jun 2000 08:10:42 -0700 Received: from localhost (amlaukka@localhost) by sirppi.helsinki.fi (8.10.1/8.10.1) with ESMTP id e55G9q803958; Mon, 5 Jun 2000 19:09:53 +0300 (EET DST) X-Authentication-Warning: sirppi.helsinki.fi: amlaukka owned process doing -bs Date: Mon, 5 Jun 2000 19:09:51 +0300 (EET DST) From: Aki M Laukkanen To: kuznet@ms2.inr.ac.ru cc: Linux kernel mailing list , netdev@oss.sgi.com, iwtcp@cs.helsinki.fi Subject: Re: Slow TCP connection between linux and wince In-Reply-To: <200006031731.VAA10055@ms2.inr.ac.ru> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Sat, 3 Jun 2000 kuznet@ms2.inr.ac.ru wrote: > Probably, receiver delays each ACK by 500msec. I have no idea > why it does this, because it is apparently illegal behaviour. > Look: Yes, it looks that way and I got the acks mixed up. > for these 5 seconds. Probably, your sender selected illegal sndbuf, > which is much less than receiver window. The application in question was ttcp. And your guess is quite close to truth. I should've known what was the cause since we've been hit with the exact same thing already (albeit on the receive side). tcp_snd_test: tail: 1, packets_in_flight: 95, snd_cwnd: 95 end_seq: -685618378, snd_una: -685641578, snd_wnd: 32488 This is debugging output at the time the delay begins and looks perfectly reasonable. tcp_snd_test: tail: 1, packets_in_flight: 75, snd_cwnd: 116 end_seq: -685618190, snd_una: -685636454, snd_wnd: 32488 And after the delay - in the between no calls were made to tcp_snd_test. So the real culprit is this test: if (sock_wspace(sk) >= tcp_min_write_space(sk) && (sock = sk->socket) != NULL) { The key here is that mss was 256 - sock_wspace() is: amt = sk->sndbuf - atomic_read(&sk->wmem_alloc); wmem_alloc contains the sum of skb->truesize fields. If we do not take into account the overhead from struct (sk_buff) and aligning this test would be circa: 65536-32767 >= 32767 when the sender is made to sleep so the two numbers are almost equal. However with MTU of 296 (as given to pppd) it is: tcp_new_space: wspace: 10435, write_space: 27550 tcp_new_space: wspace: 11015, write_space: 27260 tcp_new_space: wspace: 11595, write_space: 26970 tcp_new_space: wspace: 12175, write_space: 26680 ... acks flow in ... tcp_new_space: wspace: 19715, write_space: 22910 tcp_new_space: wspace: 20295, write_space: 22620 tcp_new_space: wspace: 20875, write_space: 22330 tcp_new_space: wspace: 21455, write_space: 22040 Hmm, 11015-10435=580 - it'd make sense if there were two skbs allocated for each segment? Oh, I see skb_clone() in tcp_send_skb, right? The disparity between this test and the available send window is the cause of the bursts. Also explained is why the over-scheduling masked this behaviour. Following patch changes wmem_alloc to only include the actual data and it seems to work. This is a hackish approach at best though. diff -urN --exclude=*~ linux-2.4.0-test1-ac6.bak/net/ipv4/tcp.c linux-2.4.0-test1-ac6/net/ipv4/tcp.c --- linux-2.4.0-test1-ac6.bak/net/ipv4/tcp.c Mon Apr 24 23:59:57 2000 +++ linux-2.4.0-test1-ac6/net/ipv4/tcp.c Mon Jun 5 18:48:59 2000 @@ -960,6 +960,7 @@ skb = alloc_skb(tmp, GFP_KERNEL); if (skb == NULL) goto do_oom; + skb->truesize = copy; skb_set_owner_w(skb, sk); } else { /* If we didn't get any memory, we need to sleep. */ Our second problem with this disparity is on the receive side. The scenario is essentially the same but with an unreliable link (read wireless) which drops packets. In case of packet drop receiver keeps building an out-of-order queue which grows to the limit of the receive buffer quite quickly. However sender keeps sending more because of the difference between advertised window and the actual allocated space. This triggers tcp_input.c:prune_queue() which purges the whole out-of-order queue to free up space, thus killing the TCP performance quite effectively. The fix in our internal use is similar to the rmem_alloc case. I do think both of these situations are quite valid. I am not so sure about the correct fix though. From owner-netdev@oss.sgi.com Mon Jun 5 08:15:40 2000 Received: by oss.sgi.com id ; Mon, 5 Jun 2000 08:15:29 -0700 Received: from postino4.prima.com.ar ([200.42.0.162]:13574 "EHLO postino4.prima.com.ar") by oss.sgi.com with ESMTP id ; Mon, 5 Jun 2000 08:15:18 -0700 Received: from prima28.admin ([172.16.1.8]) by postino4.prima.com.ar (8.9.3/8.9.1) with ESMTP id NAA04267; Mon, 5 Jun 2000 13:14:37 -0300 (ART) Received: from mail pickup service by prima28.admin with Microsoft SMTPSVC; Mon, 5 Jun 2000 13:01:44 -0300 Received: from postino1.prima.com.ar ([200.42.0.132]) by prima28.admin with Microsoft SMTPSVC(5.5.1877.417.41); Mon, 5 Jun 2000 09:11:31 -0300 Received: from lml.valinux.com (lml.valinux.com [198.186.203.19]) by postino1.prima.com.ar (8.9.1a/8.9.1) with ESMTP id JAA20673 for ; Mon, 5 Jun 2000 09:11:52 -0300 (ART) Received: from vger.rutgers.edu (vger.rutgers.edu [128.6.190.2]) by lml.valinux.com (Postfix) with ESMTP id F18955CA4D; Mon, 5 Jun 2000 05:06:08 -0700 (PDT) Received: by vger.rutgers.edu via listexpand id ; Mon, 5 Jun 2000 07:52:09 -0400 Received: by vger.rutgers.edu id ; Mon, 5 Jun 2000 07:48:07 -0400 Received: from atrey.karlin.mff.cuni.cz ([195.113.31.123]:3889 "EHLO atrey.karlin.mff.cuni.cz") by vger.rutgers.edu with ESMTP id ; Mon, 5 Jun 2000 07:42:53 -0400 Received: from bug.ucw.cz (root@slip15.ms.mff.cuni.cz [195.113.20.215]) by atrey.karlin.mff.cuni.cz (8.8.8/8.8.8) with ESMTP id NAA22255; Mon, 5 Jun 2000 13:48:07 +0200 Received: (from pavel@localhost) by bug.ucw.cz (8.8.8/8.8.5) id BAA01325; Mon, 5 Jun 2000 01:40:25 +0200 Message-ID: <20000605014025.A1183@bug.ucw.cz> Date: Mon, 5 Jun 2000 01:40:25 +0200 From: Pavel Machek To: kuznet@ms2.inr.ac.ru, Aki M Laukkanen Cc: linux-kernel@vger.rutgers.edu, netdev@oss.sgi.com Subject: Re: Slow TCP connection between linux and wince References: <200006031731.VAA10055@ms2.inr.ac.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.93i In-Reply-To: <200006031731.VAA10055@ms2.inr.ac.ru>; from kuznet@ms2.inr.ac.ru on Sat, Jun 03, 2000 at 09:31:15PM +0400 X-Loop: majordomo@vger.rutgers.edu Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hi! > > There was another error in my mail too, 2*130 is indeed over 200 but that > > was irrelevant. However the conclusion seemed correct, delayed acks timing > > out. Could you explain me why the window of 3100 seems to allow 2 packets > > in flight but sender always waits for the next ack? > > No, tcpdump shows that sender always has full window. > Receiver always ACKs _previous_ packet. > > RTT=0.9sec, window is 2 packets or ~3K, bandwidth is ~3K/sec. > Tcpdump is perfectly smooth and correct except for the rate. 8) > > Probably, receiver delays each ACK by 500msec. I have no idea > why it does this, because it is apparently illegal behaviour. > Look: > > 1 02:59:02.875097 10.0.0.3.www > 192.168.55.100.1029: . 233601:235061(1460) ack 196 win 16060 (DF) > 2 02:59:03.235031 192.168.55.100.1029 > 10.0.0.3.www: . ack 233601 win 3100 (DF) > 3 02:59:03.235094 10.0.0.3.www > 192.168.55.100.1029: . 235061:236521(1460) ack 196 win 16060 (DF) > 4 02:59:03.775064 192.168.55.100.1029 > 10.0.0.3.www: . ack 235061 win 3100 (DF) > 5 02:59:03.775137 10.0.0.3.www > 192.168.55.100.1029: . 236521:237981(1460) ack 196 win 16060 (DF) > 6 02:59:04.145061 192.168.55.100.1029 > 10.0.0.3.www: . ack 236521 win 3100 (DF) > > ACK #4 acks packet #1, sent 0.9sec before this (hence, rtt=0.9sec). > We have two packets in transmit, so that packet #1 reaches receiver not later > than (2*1500)/bandwidth after transmit. If bandwidth is 115Kbaud, > it is not more ~0.3sec. Hence, packet #3 reaches receiver _before_ > receiver sent ACK for previous packet!! For what it is worth, ping while under load looks (downloading linux kernel, as always) like this: root@bug:~# ping 192.168.55.100 PING 192.168.55.100 (192.168.55.100): 56 data bytes 64 bytes from 192.168.55.100: icmp_seq=0 ttl=32 time=519.0 ms 64 bytes from 192.168.55.100: icmp_seq=1 ttl=32 time=640.1 ms 64 bytes from 192.168.55.100: icmp_seq=2 ttl=32 time=760.0 ms 64 bytes from 192.168.55.100: icmp_seq=3 ttl=32 time=840.0 ms 64 bytes from 192.168.55.100: icmp_seq=4 ttl=32 time=989.2 ms 64 bytes from 192.168.55.100: icmp_seq=5 ttl=32 time=641.6 ms 64 bytes from 192.168.55.100: icmp_seq=6 ttl=32 time=820.0 ms 64 bytes from 192.168.55.100: icmp_seq=7 ttl=32 time=1000.2 ms 64 bytes from 192.168.55.100: icmp_seq=8 ttl=32 time=689.8 ms 64 bytes from 192.168.55.100: icmp_seq=9 ttl=32 time=859.8 ms 64 bytes from 192.168.55.100: icmp_seq=10 ttl=32 time=830.0 ms 64 bytes from 192.168.55.100: icmp_seq=12 ttl=32 time=679.9 ms 64 bytes from 192.168.55.100: icmp_seq=13 ttl=32 time=239.8 ms 64 bytes from 192.168.55.100: icmp_seq=14 ttl=32 time=530.0 ms 64 bytes from 192.168.55.100: icmp_seq=15 ttl=32 time=620.0 ms 64 bytes from 192.168.55.100: icmp_seq=16 ttl=32 time=700.0 ms and ping when links is idle looks like this: root@bug:~# ping 192.168.55.100 PING 192.168.55.100 (192.168.55.100): 56 data bytes 64 bytes from 192.168.55.100: icmp_seq=0 ttl=32 time=117.2 ms 64 bytes from 192.168.55.100: icmp_seq=1 ttl=32 time=110.1 ms 64 bytes from 192.168.55.100: icmp_seq=2 ttl=32 time=110.0 ms 64 bytes from 192.168.55.100: icmp_seq=3 ttl=32 time=110.1 ms 64 bytes from 192.168.55.100: icmp_seq=4 ttl=32 time=110.1 ms 64 bytes from 192.168.55.100: icmp_seq=5 ttl=32 time=110.1 ms 64 bytes from 192.168.55.100: icmp_seq=6 ttl=32 time=110.0 ms 64 bytes from 192.168.55.100: icmp_seq=7 ttl=32 time=109.8 ms 64 bytes from 192.168.55.100: icmp_seq=8 ttl=32 time=110.1 ms 64 bytes from 192.168.55.100: icmp_seq=9 ttl=32 time=110.0 ms 64 bytes from 192.168.55.100: icmp_seq=10 ttl=32 time=110.1 ms 64 bytes from 192.168.55.100: icmp_seq=11 ttl=32 time=110.0 ms > Conclusion is: > 3. Or receiver has totally broken TCP, which sends delayed ACKs > ~500msec after packet arrives, even if another packets This is windows ce stack; quite likely to be broken. I doubt it ever announces window bigger than 3100 (is there way to make it make its window bigger?). > arrived during this period. I.e. delays ACK only to delay. 8) > Probably, this stack even looked as working, if window were >2packets. > Namely, if it were > bandwidth*0.9sec=~10K nobody > even would notice this misbehaviour. This is well possible. Their TCP stack is probably designed to work at 19200. Is there way for us to missbehave? Ie. what would happen if I forced linux tcp stack to think their window is bigger than it really is? [Is there easy way to break my tcp stack this way?] Pavel -- I'm pavel@ucw.cz. "In my country we have almost anarchy and I don't care." Panos Katsaloulis describing me w.r.t. patents me at discuss@linmodems.org - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/ From owner-netdev@oss.sgi.com Mon Jun 5 08:24:21 2000 Received: by oss.sgi.com id ; Mon, 5 Jun 2000 08:24:11 -0700 Received: from nets5.rz.RWTH-Aachen.DE ([137.226.144.13]:53124 "EHLO nets5.rz.rwth-aachen.de") by oss.sgi.com with ESMTP id ; Mon, 5 Jun 2000 08:24:07 -0700 Received: from fire.malware.de (s4m206.dialup.RWTH-Aachen.DE [137.226.8.206]) by nets5.rz.rwth-aachen.de (8.10.1/8.10.1/5) with ESMTP id e55GO4b01156 for ; Mon, 5 Jun 2000 18:24:04 +0200 (MET DST) To: netdev@oss.sgi.com Path: not-for-mail From: Malware Newsgroups: malware.lists.linux.netdev Subject: Re: Possible bug in Space.c in 2.4.0-test1 Date: Mon, 05 Jun 2000 18:26:14 +0200 Organization: private site, Aachen, Germany Lines: 54 Message-ID: <393BD4A6.ADFCF1FD@post.rwth-aachen.de> References: NNTP-Posting-Host: malware.malware.de Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Mailer: Mozilla 4.61 [en] (X11; U; Linux 2.4.0-test1-ac7-vm3 i586) X-Accept-Language: de, en Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hi Bruce, you wrote: > Anyway, enough waffling. In Space.c there is a chunk of code (starting at > line 542) that statically creates a linked list of ethernet devices. The > device name for all of them is "eth%d". I haven't traced what happens with > modules, but with a monolithic kernel these names are compared to parameters > passed on the kernel command line (e.g. ether=12,0x240,eth0 for my ISA > NE2000 card) to pass the correct cards the correct info. However it doesn't Where do you get the idea from that it is compared to the entry of the list one by one. Did you saw the following comment in ./net/core/dev.c and the implementation of the function? /** * netdev_boot_setup_check - check boot time settings * @dev: the netdevice * * Check boot time settings for the device. If device's name is a * mask (eg. eth%d) and settings are found then this will allocate * name for the device. The found settings are set for the device * to be used later in the device probing. Returns 0 if no settings * found, 1 if they are. */ > make sense to me that you should specify ether=...eth%d for any ethernet > card (if nothing else, you have no way to distinguish between them). I You should not name give the parameter ether=...eth%d but ether=...eth0 as usual. The name then will be copied into the device structure taken from the static list. > changed the "eth%d"'s in Space.c to eth0, eth1, ... and recompiled, and > after that it picked up the network card and nothing else has died, so I'm > guessing that that was the right thing to do. If you make this change it should not work whenever the card does really need a parameter since following lines in netdev_boot_setup_check will throw it out thinking the interface is allready created: if (__dev_get_by_name(s[i].name)) { if (!mask) return 0; continue; } So probably you gave wrong parameters or they did not get parsed correctly. With your change those wrong parameters got not passed and the defaults applied. Michael From owner-netdev@oss.sgi.com Mon Jun 5 10:01:21 2000 Received: by oss.sgi.com id ; Mon, 5 Jun 2000 10:01:12 -0700 Received: from dhcp41.toaster.net ([199.108.84.41]:54026 "EHLO schmee.sfgoth.com") by oss.sgi.com with ESMTP id ; Mon, 5 Jun 2000 10:00:54 -0700 Received: (from mitch@localhost) by schmee.sfgoth.com (8.9.3/8.9.3) id LAA87895; Mon, 5 Jun 2000 11:00:34 -0700 (PDT) Date: Mon, 5 Jun 2000 11:00:34 -0700 From: Mitchell Blank Jr To: Ben Greear Cc: jamal , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? Message-ID: <20000605110034.G77216@sfgoth.com> References: <20000603091818.B48132@sfgoth.com> <20000604214856.B77216@sfgoth.com> <393B414B.ACF84B1B@candelatech.com> <20000605060321.E77216@sfgoth.com> <393BC83B.67AD86BF@candelatech.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0i In-Reply-To: <393BC83B.67AD86BF@candelatech.com>; from greearb@candelatech.com on Mon, Jun 05, 2000 at 08:33:15AM -0700 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Ben Greear wrote: > Seems a hashtable would be nice for the ifindex.... The problemn with using a hashtable: how big should it be? After all you want it small enough that there isn't much memory waste if you have 3 devices, yet we can efficiently do a lookup on 10000 devices. That's why I'm thinking a B+ tree or something would be more appropriate. People on lkml occasionally talk about making a general tree implementation similar to .. does anyone know if there's code somewhere? > Also, if it takes out a linear search in a critical path (with seemingly > minimal overhead), then I don't even think it should be a configurable > option, just *IN* there! I personally agree, but since some people seem strongly against the idea it's a somewhat a matter of politics. -Mitch From owner-netdev@oss.sgi.com Mon Jun 5 10:07:02 2000 Received: by oss.sgi.com id ; Mon, 5 Jun 2000 10:06:52 -0700 Received: from kanga.kvack.org ([209.82.47.3]:57103 "EHLO kanga.kvack.org") by oss.sgi.com with ESMTP id ; Mon, 5 Jun 2000 10:06:34 -0700 Received: (from localhost user: 'blah', uid#63042) by kanga.kvack.org with SMTP id ; Mon, 5 Jun 2000 14:27:37 -0400 Date: Mon, 5 Jun 2000 14:27:36 -0400 (EDT) From: "Benjamin C.R. LaHaise" To: Mitchell Blank Jr cc: Ben Greear , jamal , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? In-Reply-To: <20000605110034.G77216@sfgoth.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Mon, 5 Jun 2000, Mitchell Blank Jr wrote: > Ben Greear wrote: > > Seems a hashtable would be nice for the ifindex.... > > The problemn with using a hashtable: how big should it be? After all > you want it small enough that there isn't much memory waste if you > have 3 devices, yet we can efficiently do a lookup on 10000 devices. > That's why I'm thinking a B+ tree or something would be more > appropriate. Before going that far, why not just take advantage of the fact that network devices have a structure to their name: class? Since the numbers are typically contiguous starting at zero, just have an array pointing to the device structs hanging off of the class name. That way memory can be saved on device names too. -ben From owner-netdev@oss.sgi.com Mon Jun 5 18:40:12 2000 Received: by oss.sgi.com id ; Mon, 5 Jun 2000 18:40:03 -0700 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:22280 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Mon, 5 Jun 2000 18:39:41 -0700 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.9.3) with ESMTP id SAA24825; Mon, 5 Jun 2000 18:48:57 -0700 Message-ID: <393C5888.9FE329D2@candelatech.com> Date: Mon, 05 Jun 2000 18:48:56 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.14-5.0 i586) X-Accept-Language: en MIME-Version: 1.0 To: "Benjamin C.R. LaHaise" CC: Mitchell Blank Jr , jamal , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing "Benjamin C.R. LaHaise" wrote: > > On Mon, 5 Jun 2000, Mitchell Blank Jr wrote: > > > Ben Greear wrote: > > > Seems a hashtable would be nice for the ifindex.... > > > > The problemn with using a hashtable: how big should it be? After all > > you want it small enough that there isn't much memory waste if you > > have 3 devices, yet we can efficiently do a lookup on 10000 devices. > > That's why I'm thinking a B+ tree or something would be more > > appropriate. > > Before going that far, why not just take advantage of the fact that > network devices have a structure to their name: class? Since the > numbers are typically contiguous starting at zero, just have an array > pointing to the device structs hanging off of the class name. That way > memory can be saved on device names too. > > -ben VLAN devices are not numbered contigiously, for one. An array is also worse than a hashtable at allowing growth. We could have a dynamicly re-sized array or hashtable though, based on the if_index field.... Doesn't help finding a device by name though... Ben -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Tue Jun 6 03:40:03 2000 Received: by oss.sgi.com id ; Tue, 6 Jun 2000 03:39:53 -0700 Received: from sirppi.helsinki.fi ([128.214.205.27]:13578 "EHLO sirppi.helsinki.fi") by oss.sgi.com with ESMTP id ; Tue, 6 Jun 2000 03:39:40 -0700 Received: from localhost (amlaukka@localhost) by sirppi.helsinki.fi (8.10.1/8.10.1) with ESMTP id e56AbjC17372; Tue, 6 Jun 2000 13:37:45 +0300 (EET DST) X-Authentication-Warning: sirppi.helsinki.fi: amlaukka owned process doing -bs Date: Tue, 6 Jun 2000 13:37:43 +0300 (EET DST) From: Aki M Laukkanen To: Pavel Machek cc: kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com Subject: Re: Slow TCP connection between linux and wince In-Reply-To: <20000605014025.A1183@bug.ucw.cz> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Mon, 5 Jun 2000, Pavel Machek wrote: > 64 bytes from 192.168.55.100: icmp_seq=1 ttl=32 time=110.1 ms > 64 bytes from 192.168.55.100: icmp_seq=2 ttl=32 time=110.0 ms Assuming 84 bytes altogether, this is quite close to 19200 bps. I get about 90-100 ms for RTT. > This is well possible. Their TCP stack is probably designed to work at > 19200. Is there way for us to missbehave? Ie. what would happen if I > forced linux tcp stack to think their window is bigger than it really > is? [Is there easy way to break my tcp stack this way?] See, http://arstechnica.com/reviews/2q00/networking/networking-1.html From owner-netdev@oss.sgi.com Tue Jun 6 04:51:54 2000 Received: by oss.sgi.com id ; Tue, 6 Jun 2000 04:51:44 -0700 Received: from mail.cyberus.ca ([209.195.95.1]:11449 "EHLO cyberus.ca") by oss.sgi.com with ESMTP id ; Tue, 6 Jun 2000 04:51:20 -0700 Received: from shell.cyberus.ca (shell [209.195.95.7]) by cyberus.ca (8.9.3/8.9.3/Cyberus Online Inc.) with ESMTP id HAA22643; Tue, 6 Jun 2000 07:49:12 -0400 (EDT) Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.1b+Sun/8.9.3) with ESMTP id HAA20277; Tue, 6 Jun 2000 07:49:02 -0400 (EDT) Date: Tue, 6 Jun 2000 07:49:01 -0400 (EDT) From: jamal To: Mitchell Blank Jr cc: Andrey Savochkin , Ben Greear , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? In-Reply-To: <20000605053533.C77216@sfgoth.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Mon, 5 Jun 2000, Mitchell Blank Jr wrote: > Andrey Savochkin wrote: > > I want to ad my $0.02. > > > Linux kernel do not bind IP addresses for devices. > [ A lot of arguement on how anything which can do IP is a device, deleted] Yes people, what is net device? ;-> It routes, it ARPs, it cooks ... it must be a net device. > > I think a reasonable goal for VLAN support would be that a machine > with one ethernet card on two VLANs should be able to do pretty much > the same as a machine with two seperate ethernet cards, with similar > configuration commands. That is, after all, the promise of VLANs. > And you can do this just fine with VLANS abstracted on top of devices. > Flow control might not be perfect in VLANs, but that's really the > least of its problems. The linux model of handling flow control > only can really handle simple devices that can be modeled as a FIFO > queue. What "linux model of handling flow control"? What would the other model be? FIFOs are the basic building blocks. > Take for example the current mess we have in the ATM > stack - suppose you have an atm net_device (can be CLIP or LANE, > doesn't matter) with two PVCs to host1 and host2. These go > across different networks, and thus have different QoS available - > host1 is across a frac-T1 link which host2 is right on our local > OC-12 switch. Suppose host1 gets enough traffic that the PVC > becomes full - what do we do? > 1. We could netif_stop_queue, but that would shut off all > connectivity to host2, even though we could easily get > packets to it > 2. We could just drop packets to host1, but then we're > not providing the neccesary backpressure to make net > schedulers work > hrm.. Unless i am totaly misunderstanding you; You should leave this to policy management. Repeat after me: "I shall provide the mechanism, sir". Very simple rule of building powerful abstractions. cheers, jamal From owner-netdev@oss.sgi.com Tue Jun 6 05:44:24 2000 Received: by oss.sgi.com id ; Tue, 6 Jun 2000 05:44:14 -0700 Received: from laurin.munich.netsurf.de ([194.64.166.1]:18931 "EHLO laurin.munich.netsurf.de") by oss.sgi.com with ESMTP id ; Tue, 6 Jun 2000 05:44:00 -0700 Received: from fred.muc.de (none@ns1032.munich.netsurf.de [195.180.235.32]) by laurin.munich.netsurf.de (8.9.3/8.9.3) with ESMTP id OAA21669; Tue, 6 Jun 2000 14:43:20 +0200 (MET DST) Received: from andi by fred.muc.de with local (Exim 2.05 #1) id 12zFdM-0000Eo-00; Tue, 6 Jun 2000 11:26:52 +0200 Date: Tue, 6 Jun 2000 11:26:52 +0200 From: Andi Kleen To: Andrey Savochkin Cc: Gleb Natapov , Mitchell Blank Jr , Ben Greear , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il, jamal Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? Message-ID: <20000606112652.A912@fred.muc.de> References: <3938611E.D074F254@candelatech.com> <20000603091818.B48132@sfgoth.com> <20000605102627.A8473@saw.sw.com.sg> <393B56DD.34A83A7D@nbase.co.il> <20000605154657.D10091@saw.sw.com.sg> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.95.4us In-Reply-To: <20000605154657.D10091@saw.sw.com.sg>; from Andrey Savochkin on Mon, Jun 05, 2000 at 05:57:16PM +0200 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Mon, Jun 05, 2000 at 05:57:16PM +0200, Andrey Savochkin wrote: > Hello Gleb, > > On Mon, Jun 05, 2000 at 07:29:33AM +0000, Gleb Natapov wrote: > > As Mitchell said, will I be able to run OSPF between VLANs? I actually > > run zebra ospfd on vlans. Zebra has a strong notion of device. It relies > > on device up, device done, change ip and other messages from netlink. > > I do not know how Zebra works, but the design described by you looks very > broken at the first glance. If you run routing managements software on your > system you should perform all kernel state changes only through this > software. Thus, the software do not need any kernel feedback about device/ip > state except the confirmations of its own commands. I don't think it is broken. It seems to me that one of the design goals of netlink was to make it possible for multiple routing daemons (including the ``admin daemon'') to coexist nicely. Moving all policy into a big monolithic program would look broken for me. -Andi -- This is like TV. I don't like TV. From owner-netdev@oss.sgi.com Tue Jun 6 06:07:44 2000 Received: by oss.sgi.com id ; Tue, 6 Jun 2000 06:07:34 -0700 Received: from smtprch1.nortelnetworks.com ([192.135.215.14]:61869 "EHLO smtprch1.nortel.com") by oss.sgi.com with ESMTP id ; Tue, 6 Jun 2000 06:07:08 -0700 Received: from zrchb213.us.nortel.com (actually zrchb213) by smtprch1.nortel.com; Tue, 6 Jun 2000 08:04:40 -0500 Received: from zctwb003.asiapac.nortel.com ([47.152.32.111]) by zrchb213.us.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id LW2CDV4W; Tue, 6 Jun 2000 08:04:25 -0500 Received: from pwold011.asiapac.nortel.com ([47.181.193.45]) by zctwb003.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id MMFY1LAL; Tue, 6 Jun 2000 23:04:46 +1000 Received: from uow.edu.au (IDENT:akpm@[47.181.194.45]) by pwold011.asiapac.nortel.com (8.9.3/8.9.3) with ESMTP id XAA31077; Tue, 6 Jun 2000 23:04:34 +1000 Message-ID: <393CF7A8.6DDAE578@uow.edu.au> Date: Tue, 06 Jun 2000 23:07:52 +1000 X-Sybari-Space: 00000000 00000000 00000000 From: Andrew Morton X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.2.14-15mdk i586) X-Accept-Language: en MIME-Version: 1.0 To: Alexey Kuznetsov CC: "netdev@oss.sgi.com" Subject: timers in net/ipv6 Content-Type: multipart/mixed; boundary="------------B45AACD470AD0719FA3B63CC" Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing This is a multi-part message in MIME format. --------------B45AACD470AD0719FA3B63CC Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Hi, Alexey. Attached is my shot at tightening the timer handling in net/ipv6. Andi has provided some comments (Thanks!). The fib6_run_gc() alterations are not pleasant. There is some code which I believe is safe, but I wasn't 100% sure, so I left some commentary in there. Hope this helps... --------------B45AACD470AD0719FA3B63CC Content-Type: text/plain; charset=us-ascii; name="net-ipv6.patch" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="net-ipv6.patch" --- linux-2.4.0-test1-ac8/net/ipv6/addrconf.c Wed May 3 18:48:03 2000 +++ linux-akpm/net/ipv6/addrconf.c Mon Jun 5 23:11:11 2000 @@ -180,7 +180,7 @@ static void addrconf_del_timer(struct inet6_ifaddr *ifp) { - if (del_timer(&ifp->timer)) + if (del_timer_async(&ifp->timer)) __in6_ifa_put(ifp); } @@ -195,7 +195,7 @@ enum addrconf_timer_t what, unsigned long when) { - if (!del_timer(&ifp->timer)) + if (!del_timer_async(&ifp->timer)) in6_ifa_hold(ifp); switch (what) { @@ -316,7 +316,11 @@ in6_dev_put(ifp->idev); - if (del_timer(&ifp->timer)) +/* + * REVIEWME: I assume the del_timer_async() here is safe + * due to the below check (timer shouldn't be scheduled). + */ + if (del_timer_async(&ifp->timer)) printk("Timer is still running, when freeing ifa=%p\n", ifp); if (!ifp->dead) { @@ -1660,6 +1664,7 @@ } mod_timer(&addr_chk_timer, jiffies + ADDR_CHECK_FREQUENCY); + timer_exit(&addr_chk_timer); } #ifdef CONFIG_RTNETLINK @@ -2081,7 +2086,7 @@ } write_unlock_bh(&addrconf_hash_lock); - del_timer(&addr_chk_timer); + del_timer_sync(&addr_chk_timer); rtnl_unlock(); --- linux-2.4.0-test1-ac8/net/ipv6/ip6_fib.c Wed May 3 18:48:04 2000 +++ linux-akpm/net/ipv6/ip6_fib.c Tue Jun 6 21:51:06 2000 @@ -87,7 +87,8 @@ static __u32 rt_sernum = 0; -static struct timer_list ip6_fib_timer = { function: fib6_run_gc }; +static void fib6_run_gc_t(unsigned long dummy); +static struct timer_list ip6_fib_timer = { function: fib6_run_gc_t }; static struct fib6_walker_t fib6_walker_list = { &fib6_walker_list, &fib6_walker_list, @@ -1169,7 +1170,7 @@ static spinlock_t fib6_gc_lock = SPIN_LOCK_UNLOCKED; -void fib6_run_gc(unsigned long dummy) +void fib6_run_gc(unsigned long dummy, int from_timer) { if (dummy != ~0UL) { spin_lock_bh(&fib6_gc_lock); @@ -1193,10 +1194,19 @@ if (gc_args.more) mod_timer(&ip6_fib_timer, jiffies + ip6_rt_gc_interval); else { - del_timer(&ip6_fib_timer); + del_timer_async(&ip6_fib_timer); /* Redundant? */ ip6_fib_timer.expires = 0; } spin_unlock_bh(&fib6_gc_lock); + if (from_timer) { + timer_exit(&ip6_fib_timer); + } +} + +/* Timer handler */ +static void fib6_run_gc_t(unsigned long dummy) +{ + fib6_run_gc(dummy, 1); } void __init fib6_init(void) @@ -1211,7 +1221,7 @@ #ifdef MODULE void fib6_gc_cleanup(void) { - del_timer(&ip6_fib_timer); + del_timer_sync(&ip6_fib_timer); } #endif --- linux-2.4.0-test1-ac8/net/ipv6/ip6_flowlabel.c Fri Oct 29 07:34:44 1999 +++ linux-akpm/net/ipv6/ip6_flowlabel.c Mon Jun 5 23:12:27 2000 @@ -103,7 +103,14 @@ fl->opt = NULL; kfree(opt); } - if (!del_timer(&ip6_fl_gc_timer) || +/* + * REVIEWME: we may have a race here. + * ip6_fl_gc() could be running now, and after fl_release() returns, + * this ip6_flowlabel is kfree()ed. Is there a possibility that + * the still-running ip6_fl_gc() could end up accessing kfree'ed + * memory? If so, I think a del_timer_sync() is safe here + */ + if (!del_timer_async(&ip6_fl_gc_timer) || (long)(ip6_fl_gc_timer.expires - ttd) > 0) ip6_fl_gc_timer.expires = ttd; add_timer(&ip6_fl_gc_timer); @@ -146,6 +153,7 @@ add_timer(&ip6_fl_gc_timer); } write_unlock(&ip6_fl_lock); + timer_exit(&ip6_fl_gc_timer); } static int fl_intern(struct ip6_flowlabel *fl, __u32 label) @@ -615,7 +623,7 @@ void ip6_flowlabel_cleanup() { - del_timer(&ip6_fl_gc_timer); + del_timer_sync(&ip6_fl_gc_timer); #ifdef CONFIG_PROC_FS remove_proc_entry("net/ip6_flowlabel", 0); #endif --- linux-2.4.0-test1-ac8/net/ipv6/mcast.c Wed Feb 9 13:35:27 2000 +++ linux-akpm/net/ipv6/mcast.c Tue Jun 6 21:54:20 2000 @@ -363,7 +363,7 @@ if (ipv6_addr_type(&ma->mca_addr)&(IPV6_ADDR_LINKLOCAL|IPV6_ADDR_LOOPBACK)) return; - if (del_timer(&ma->mca_timer)) + if (del_timer_async(&ma->mca_timer)) /* REVIEWME: see below */ delay = ma->mca_timer.expires - jiffies; if (delay >= resptime) { @@ -404,6 +404,10 @@ if (idev == NULL) return 0; +/* + * REVIEWME: igmp6_timer_handler() could be running right now on another CPU. + * Is this safe? If not, we need del_timer_sync() in igmp6_group_queried() + */ read_lock(&idev->lock); if (ipv6_addr_any(addrp)) { for (ma = idev->mc_list; ma; ma=ma->next) @@ -450,11 +454,15 @@ * Cancel the timer for this group */ +/* REVIEWME: igmp6_timer_handler() could be running now. Is this safe? Could we + * do the in6_dev_put() right in the middle of igmp6_send()? If so, we need del_timer_sync(). + * Andi says "Should not matter. it is a independent idev." + */ read_lock(&idev->lock); for (ma = idev->mc_list; ma; ma=ma->next) { if (ipv6_addr_cmp(&ma->mca_addr, addrp) == 0) { if (ma->mca_flags & MAF_TIMER_RUNNING) { - del_timer(&ma->mca_timer); + del_timer_async(&ma->mca_timer); ma->mca_flags &= ~MAF_TIMER_RUNNING; } @@ -552,7 +560,7 @@ igmp6_send(&ma->mca_addr, ma->idev->dev, ICMPV6_MGM_REPORT); delay = net_random() % IGMP6_UNSOLICITED_IVAL; - if (del_timer(&ma->mca_timer)) + if (del_timer_async(&ma->mca_timer)) delay = ma->mca_timer.expires - jiffies; ma->mca_timer.expires = jiffies + delay; @@ -573,8 +581,14 @@ if (ma->mca_flags & MAF_LAST_REPORTER) igmp6_send(&ma->mca_addr, ma->idev->dev, ICMPV6_MGM_REDUCTION); +/* + * REVIEWME: igmp6_timer_handler() could be running now. As soon as we return + * from igmp6_leave_group(), 'ma' is kfree'ed. Could igmp6_send() end up + * touching kfree'ed memory? + * Andi thinks it should be del_timer_sync here. + */ if (ma->mca_flags & MAF_TIMER_RUNNING) - del_timer(&ma->mca_timer); + del_timer_async(&ma->mca_timer); } void igmp6_timer_handler(unsigned long data) --- linux-2.4.0-test1-ac8/net/ipv6/reassembly.c Wed May 3 18:48:04 2000 +++ linux-akpm/net/ipv6/reassembly.c Tue Jun 6 22:03:06 2000 @@ -197,7 +197,13 @@ { struct ipv6_frag *fp, *back; - del_timer(&fq->timer); +/* REVIEWME: frag_expire() could be running on another CPU now, (if this + * function is called from reasm_frag()). frag_expire() will be spinning + * on ip6_frag_lock. Once this function returns to reasm_frag() and + * reasm_frag releases the lock, frag_expire() will run and will + * again call fq_free. Probably safe, but needs an expert eye :) + */ + del_timer_async(&fq->timer); for (fp = fq->fragments; fp; ) { frag_kfree_skb(fp->skb); @@ -495,7 +501,10 @@ frag_kfree_s(back, sizeof(*back)); } - del_timer(&fq->timer); +/* REVIEWME: ip6_frag_lock is held now, so del_timer_sync() will deadlock. + * I think this code is safe anyway - the list is consistent when the lock is released. + */ + del_timer_async(&fq->timer); fq->prev->next = fq->next; fq->next->prev = fq->prev; fq->prev = fq->next = NULL; --- linux-2.4.0-test1-ac8/net/ipv6/route.c Sun Jan 23 06:54:58 2000 +++ linux-akpm/net/ipv6/route.c Tue Jun 6 21:47:14 2000 @@ -589,7 +589,7 @@ goto out; expire++; - fib6_run_gc(expire); + fib6_run_gc(expire, 0); last_gc = now; if (atomic_read(&ip6_dst_ops.entries) < ip6_dst_ops.gc_thresh) expire = ip6_rt_gc_timeout>>1; @@ -1882,7 +1882,7 @@ proc_dointvec(ctl, write, filp, buffer, lenp); if (flush_delay < 0) flush_delay = 0; - fib6_run_gc((unsigned long)flush_delay); + fib6_run_gc((unsigned long)flush_delay, 0); return 0; } else return -EINVAL; --- linux-2.4.0-test1-ac8/include/net/ip6_fib.h Tue Aug 24 03:01:02 1999 +++ linux-akpm/include/net/ip6_fib.h Tue Jun 6 21:48:24 2000 @@ -174,7 +174,7 @@ extern void inet6_rt_notify(int event, struct rt6_info *rt); -extern void fib6_run_gc(unsigned long dummy); +extern void fib6_run_gc(unsigned long dummy, int from_timer); extern void fib6_gc_cleanup(void); --------------B45AACD470AD0719FA3B63CC-- From owner-netdev@oss.sgi.com Tue Jun 6 09:57:55 2000 Received: by oss.sgi.com id ; Tue, 6 Jun 2000 09:57:45 -0700 Received: from nets5.rz.RWTH-Aachen.DE ([137.226.144.13]:32438 "EHLO nets5.rz.rwth-aachen.de") by oss.sgi.com with ESMTP id ; Tue, 6 Jun 2000 09:57:21 -0700 Received: from fire.malware.de (s4m026.dialup.RWTH-Aachen.DE [137.226.8.26]) by nets5.rz.rwth-aachen.de (8.10.1/8.10.1/5) with ESMTP id e56GvIb16372; Tue, 6 Jun 2000 18:57:19 +0200 (MET DST) To: linux-kernel@vger.rutgers.edu Path: not-for-mail From: Malware Newsgroups: malware.lists.linux.kernel,malware.lists.linux.netdev Subject: Re: ip_dynaddr broken in 2.4.0-test1 ? Date: Tue, 06 Jun 2000 18:56:57 +0200 Organization: private site, Aachen, Germany Lines: 82 Message-ID: <393D2D59.4E6E966D@post.rwth-aachen.de> References: <393AB6CC.40C83BA3@nautze.de> <393C2711.29D642A1@nautze.de> NNTP-Posting-Host: malware.malware.de Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Mailer: Mozilla 4.61 [en] (X11; U; Linux 2.4.0-test1-ac7-vm3 i586) X-Accept-Language: de, en Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hi Christian, you wrote on linux-kernel: > if (rt == NULL) { > int err; > > u32 daddr = sk->daddr; > > if(sk->protinfo.af_inet.opt && > sk->protinfo.af_inet.opt->srr) > daddr = sk->protinfo.af_inet.opt->faddr; > > err = ip_route_output(&rt, daddr, sk->saddr, > RT_TOS(sk->protinfo.af_inet.tos) | > RTO_CON > N | sk->localroute, > sk->bound_dev_if); > >> printk(KERN_INFO "error: %d\n", err); > if (err) { > sk->err_soft=-err; > sk->error_report(sk); > return -1; > } [...] > Jun 5 23:58:37 beastieboys kernel: rt_new: 00000000, old: 00000000 > Jun 5 23:58:37 beastieboys kernel: want rewrite 1 > Jun 5 23:58:37 beastieboys kernel: error: -22 Looks like it is something bad to call ip_route_output with the original source address of the socket (before it is rewritten) since the function ip_route_output_slow sitting behind does contain following lines: if (saddr) { if (MULTICAST(saddr) || BADCLASS(saddr) || ZERONET(saddr)) return -EINVAL; /* It is equivalent to inet_addr_type(saddr) == RTN_LOCAL */ dev_out = ip_dev_find(saddr); if (dev_out == NULL) return -EINVAL; ... So since a source address is specified it will try finding an interface this one is associated with which might fail. There should be a userspace work-arround as following: On system startup and disconnect (ip-down script) configure the PPP interface to have the address of an permanent interface as the ethernet or dummy interface. So packets generated while the line is down will get the address of this interface assigned and a route will be generated by ip_route_output - hopefully I did not miss a hook to netfilter there since it could block the packet. Since the next step in tcpv4_rebuild_header is the rewrite already it should work. If it should be fixed within the kernel I would apply: --- tcp_ipv4.c.orig Wed May 3 10:48:03 2000 +++ tcp_ipv4.c Tue Jun 6 18:52:24 2000 @@ -1744,7 +1744,7 @@ if(sk->protinfo.af_inet.opt && sk->protinfo.af_inet.opt->srr) daddr = sk->protinfo.af_inet.opt->faddr; - err = ip_route_output(&rt, daddr, sk->saddr, + err = ip_route_output(&rt, daddr, (want_rewrite ? 0 : sk->saddr), RT_TOS(sk->protinfo.af_inet.tos) | RTO_CONN | sk->localroute, sk->bound_dev_if); if (err) { But as I am not familiar with most details I do not swear on that being an solution and not causing other problems. Malware PS: Copies of this message are going to the netdev list and Christian Nautze. From owner-netdev@oss.sgi.com Tue Jun 6 10:50:15 2000 Received: by oss.sgi.com id ; Tue, 6 Jun 2000 10:50:05 -0700 Received: from minus.inr.ac.ru ([193.233.7.97]:25610 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Tue, 6 Jun 2000 10:49:48 -0700 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id VAA07354; Tue, 6 Jun 2000 21:36:29 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200006061736.VAA07354@ms2.inr.ac.ru> Subject: Re: Slow TCP connection between linux and wince To: pavel@suse.cz (Pavel Machek) Date: Tue, 6 Jun 2000 21:36:29 +0400 (MSK DST) Cc: amlaukka@cc.helsinki.fi, linux-kernel@vger.rutgers.edu, netdev@oss.sgi.com In-Reply-To: <20000605014025.A1183@bug.ucw.cz> from "Pavel Machek" at Jun 5, 0 01:40:25 am X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Length: 671 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello! > This is windows ce stack; quite likely to be broken. I doubt it ever > announces window bigger than 3100 (is there way to make it make its > window bigger?). I think, it is possible. Only I have no idea, how to make this. 8) > Is there way for us to missbehave? I see no problems in tcpdump, it looks perfect. > Ie. what would happen if I > forced linux tcp stack to think their window is bigger than it really > is? [Is there easy way to break my tcp stack this way?] No, it is impossible. Sender cannot do anything with receiver's window. Seems, the best variant is to try to upgrade it yet. Or find that place, where window is increased. Alexey From owner-netdev@oss.sgi.com Tue Jun 6 11:20:56 2000 Received: by oss.sgi.com id ; Tue, 6 Jun 2000 11:20:45 -0700 Received: from minus.inr.ac.ru ([193.233.7.97]:37386 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Tue, 6 Jun 2000 11:20:31 -0700 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id WAA07591; Tue, 6 Jun 2000 22:07:48 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200006061807.WAA07591@ms2.inr.ac.ru> Subject: Re: mail problems (was Re: slow ...) To: amlaukka@cc.helsinki.fi (Aki M Laukkanen) Date: Tue, 6 Jun 2000 22:07:48 +0400 (MSK DST) Cc: netdev@oss.sgi.com In-Reply-To: from "Aki M Laukkanen" at Jun 5, 0 09:18:44 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Length: 746 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello! > What sort of mail problems? I read through the local newsgroups but > couldn't find a mention of mail problems. I was away for most of the > weekend so can't tell. If you CC:ed me in the replies to netdev and l-k, > indeed it seems I've only gotten a copy through the mailing list. How do you suppose I will reply to this mail, if your mail does not work? 8)8) I apologize, I have to route reply through netdev. All the attempts to connect to your MXs finish with "Connection timed out". No response for SYN. Ping works. traceroute does not, showing blackhole after router 128.214.231.10. Apparently, you are firewalled by some totally misconfigured firewall. It is not a good idea to firewall mail servers in this way. Alexey From owner-netdev@oss.sgi.com Tue Jun 6 12:20:35 2000 Received: by oss.sgi.com id ; Tue, 6 Jun 2000 12:11:05 -0700 Received: from zada.math.leidenuniv.nl ([132.229.231.3]:27145 "EHLO zada.math.leidenuniv.nl") by oss.sgi.com with ESMTP id ; Tue, 6 Jun 2000 12:10:43 -0700 Received: from mara.math.leidenuniv.nl (IDENT:buytenh@mara.math.leidenuniv.nl [132.229.232.80]) by zada.math.leidenuniv.nl (8.9.3/8.9.3) with ESMTP id VAA02021; Tue, 6 Jun 2000 21:09:51 +0200 Date: Tue, 6 Jun 2000 21:09:51 +0200 (MEST) From: Lennert Buytenhek X-Sender: buytenh@mara.math.leidenuniv.nl To: Andrey Savochkin cc: Mitchell Blank Jr , Ben Greear , rob@valinux.com, netdev@oss.sgi.com, Gleb Natapov , jamal Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? In-Reply-To: <20000605102627.A8473@saw.sw.com.sg> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Mon, 5 Jun 2000, Andrey Savochkin wrote: > I think that the current VLAN implementation slightly abuses the > notion of device. And it doesn't relate to the number of devices and > the efficiency of search algorithms. The current VLAN implementation > is a pure packet-mangling code. It misses one of the most important > properties of network devices - flow control. Any code that doesn't > provide flow control isn't a device, but a code just manipulating of > packet contents. >From this I may conclude that current (2.3) bridging is broken too as it works as a device? greetings, Lennert From owner-netdev@oss.sgi.com Tue Jun 6 12:20:35 2000 Received: by oss.sgi.com id ; Tue, 6 Jun 2000 12:20:24 -0700 Received: from zada.math.leidenuniv.nl ([132.229.231.3]:26889 "EHLO zada.math.leidenuniv.nl") by oss.sgi.com with ESMTP id ; Tue, 6 Jun 2000 12:10:42 -0700 Received: from mara.math.leidenuniv.nl (IDENT:buytenh@mara.math.leidenuniv.nl [132.229.232.80]) by zada.math.leidenuniv.nl (8.9.3/8.9.3) with ESMTP id VAA02011; Tue, 6 Jun 2000 21:09:34 +0200 Date: Tue, 6 Jun 2000 21:09:34 +0200 (MEST) From: Lennert Buytenhek X-Sender: buytenh@mara.math.leidenuniv.nl To: Andrey Savochkin cc: Mitchell Blank Jr , Ben Greear , rob@valinux.com, netdev@oss.sgi.com, Gleb Natapov , jamal Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? In-Reply-To: <20000605102627.A8473@saw.sw.com.sg> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Mon, 5 Jun 2000, Andrey Savochkin wrote: > The current kernel infrastructure for packet mangling may still need > some adjustments, but it at least exists. I'm encouraging to consider > VLAN implementation as just a netfilter module. "All the world is an IP net"? How should I run IPX over my VLANs then? greetings, Lennert From owner-netdev@oss.sgi.com Tue Jun 6 12:38:06 2000 Received: by oss.sgi.com id ; Tue, 6 Jun 2000 12:37:56 -0700 Received: from minus.inr.ac.ru ([193.233.7.97]:57354 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Tue, 6 Jun 2000 12:37:32 -0700 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id XAA07893; Tue, 6 Jun 2000 23:24:47 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200006061924.XAA07893@ms2.inr.ac.ru> Subject: Re: Slow TCP connection between linux and wince To: amlaukka@cc.helsinki.fi (Aki M Laukkanen) Date: Tue, 6 Jun 2000 23:24:47 +0400 (MSK DST) Cc: linux-kernel@vger.rutgers.edu, netdev@oss.sgi.com, iwtcp@cs.helsinki.fi In-Reply-To: from "Aki M Laukkanen" at Jun 5, 0 07:09:51 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Length: 1807 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello! > The disparity between this test and the available send window is the > cause of the bursts. The analysis is correct, I think. > this behaviour. Following patch changes wmem_alloc to only include > the actual data and it seems to work. This is a hackish approach at > best though. It is not hackish, it is rather buggish. 8) You cannot mangle truesize, eben with good intention. Try better to tune tcp_min_write_space(). I want to think it is tunable. Also, you could select larger sndbuf for such funny links. Combination of ridiculously low MSS with utterly high cwnd is highly non-standard situation. > Our second problem with this disparity is on the receive side. The scenario > is essentially the same but with an unreliable link (read wireless) which > drops packets. In case of packet drop receiver keeps building an > out-of-order queue which grows to the limit of the receive buffer > quite quickly. However sender keeps sending more because of the difference > between advertised window and the actual allocated space. This triggers > tcp_input.c:prune_queue() which purges the whole out-of-order queue to > free up space, thus killing the TCP performance quite effectively. TCP performance is killed not by pruning, but rather by packet drop. 8) Yes, pruning should be a bit less aggressive. I will repair this. > The fix in our internal use is similar to the rmem_alloc case. I do think > both of these situations are quite valid. I am not so sure about the correct > fix though. Common fix is not to allow cwnd to grow to such huge values on lossy links. In any case, try net-xxyyzz.dif.gz from ftp://ftp.inr.ac.ru/ip-routing/. It will not be better, I think, but at least you will discover when it is worse. 8) Finer pruning will appear tomorrow, I hope. Alexey From owner-netdev@oss.sgi.com Tue Jun 6 14:54:16 2000 Received: by oss.sgi.com id ; Tue, 6 Jun 2000 14:54:07 -0700 Received: from sleeper.apana.org.au ([129.78.226.233]:21778 "HELO sleeper.apana.org.au") by oss.sgi.com with SMTP id ; Tue, 6 Jun 2000 14:53:57 -0700 Received: by sleeper.apana.org.au (Postfix, from userid 500) id 64EFC7537; Wed, 7 Jun 2000 06:58:01 +1000 (EST) Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? To: netdev@oss.sgi.com Date: Wed, 7 Jun 2000 06:58:00 +1000 (EST) In-Reply-To: from "Lennert Buytenhek" at Jun 06, 2000 09:09:34 PM X-Mailer: ELM [version 2.5 PL0pre8] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-Id: <20000606205801.64EFC7537@sleeper.apana.org.au> From: matthew@sleeper.apana.org.au (Matthew Geier) Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing > > > > On Mon, 5 Jun 2000, Andrey Savochkin wrote: > > > The current kernel infrastructure for packet mangling may still need > > some adjustments, but it at least exists. I'm encouraging to consider > > VLAN implementation as just a netfilter module. > > "All the world is an IP net"? How should I run IPX over my VLANs then? > I have a more than passing interest in the VLAN stuff as I have a network that uses them. My campus is uses VLAN's to seperate the different users, my faculty using 6 vlans to split the various units up. (Currently the trunks are CISCO ISL, but the system is being upgraded to giga-bit trunks, that are .1q) It is possible at some stage I would like to run a server with a foot in at least 4 VLANs so that people accessing that server would not have a router hop. (And since AppleTalk (and IPX for other departments) is at least as important as TCP/IP no layer 3 switch vendor is game, and I can't put that many AppleTalk stations in the one VLAN with out an AppleTalk broadcast meltdown....) Any VLAN implementation that doesn't allow me to fire up Samba and NetAtalk have have the 2 programs just discover the interfaces and do the right SMB broadcasting, and AppleTalk stuff on each, isn't actually any use. I certainly wouldn't be trying to replicate the routing functions of the CISCO RSM cards in the 2 central switching centres on my Campus. I just want applications to run... From owner-netdev@oss.sgi.com Tue Jun 6 15:58:55 2000 Received: by oss.sgi.com id ; Tue, 6 Jun 2000 15:58:36 -0700 Received: from mailhost.uni-koblenz.de ([141.26.64.1]:49887 "EHLO mailhost.uni-koblenz.de") by oss.sgi.com with ESMTP id ; Tue, 6 Jun 2000 15:58:25 -0700 Received: from cacc-30.uni-koblenz.de (cacc-30.uni-koblenz.de [141.26.131.30]) by mailhost.uni-koblenz.de (8.9.3/8.9.3) with ESMTP id AAA14892 for ; Wed, 7 Jun 2000 00:58:22 +0200 (MET DST) Received: (ralf@lappi) by lappi.waldorf-gmbh.de id ; Tue, 6 Jun 2000 20:25:57 +0200 Date: Tue, 6 Jun 2000 20:25:57 +0200 From: Ralf Baechle To: netdev@oss.sgi.com Subject: RTNL_ASSERT failure Message-ID: <20000606202557.A31680@uni-koblenz.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0.1i X-Accept-Language: de,en,fr Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Attempting to use boot a diskless machine with two Ethernet cards using BOOTP results in this: > Sending BOOTP requests..... OK > IP-Config: Got BOOTP answer from 10.98.169.23, my address is 10.98.169.25 > RTNL: assertion failed at devinet.c(785):inetdev_event > RTNL: assertion failed at devinet.c(785):inetdev_event Patch below, Ralf Index: net/ipv4/ipconfig.c =================================================================== RCS file: /usr/src/cvs/linux/net/ipv4/ipconfig.c,v retrieving revision 1.26 diff -u -r1.26 ipconfig.c --- ipconfig.c 2000/05/13 15:51:59 1.26 +++ ipconfig.c 2000/06/07 16:26:39 @@ -167,6 +167,7 @@ struct ic_device *d, *next; struct net_device *dev; + rtnl_shlock(); next = ic_first_dev; while ((d = next)) { next = d->next; @@ -177,6 +178,7 @@ } kfree_s(d, sizeof(struct ic_device)); } + rtnl_shunlock(); } /* From owner-netdev@oss.sgi.com Tue Jun 6 20:00:47 2000 Received: by oss.sgi.com id ; Tue, 6 Jun 2000 20:00:37 -0700 Received: from saw.sw.com.sg ([203.120.9.98]:14208 "HELO saw.sw.com.sg") by oss.sgi.com with SMTP id ; Tue, 6 Jun 2000 20:00:27 -0700 Received: (qmail 4319 invoked by uid 577); 7 Jun 2000 03:00:24 -0000 Message-ID: <20000607110024.A4304@saw.sw.com.sg> Date: Wed, 7 Jun 2000 11:00:24 +0800 From: Andrey Savochkin To: Andi Kleen Cc: Gleb Natapov , Mitchell Blank Jr , Ben Greear , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il, jamal Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: <3938611E.D074F254@candelatech.com> <20000603091818.B48132@sfgoth.com> <20000605102627.A8473@saw.sw.com.sg> <393B56DD.34A83A7D@nbase.co.il> <20000605154657.D10091@saw.sw.com.sg> <20000606112652.A912@fred.muc.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.93.2i In-Reply-To: <20000606112652.A912@fred.muc.de>; from "Andi Kleen" on Tue, Jun 06, 2000 at 11:26:52AM Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hi Andi, On Tue, Jun 06, 2000 at 11:26:52AM +0200, Andi Kleen wrote: > On Mon, Jun 05, 2000 at 05:57:16PM +0200, Andrey Savochkin wrote: > > I do not know how Zebra works, but the design described by you looks very > > broken at the first glance. If you run routing managements software on your > > system you should perform all kernel state changes only through this > > software. Thus, the software do not need any kernel feedback about device/ip > > state except the confirmations of its own commands. > > I don't think it is broken. It seems to me that one of the design goals > of netlink was to make it possible for multiple routing daemons (including > the ``admin daemon'') to coexist nicely. Moving all policy into a big > monolithic program would look broken for me. OK, point taken. But netfilter module may generate messages for routing daemons as well. Best regards Andrey From owner-netdev@oss.sgi.com Tue Jun 6 20:32:08 2000 Received: by oss.sgi.com id ; Tue, 6 Jun 2000 20:31:58 -0700 Received: from saw.sw.com.sg ([203.120.9.98]:18304 "HELO saw.sw.com.sg") by oss.sgi.com with SMTP id ; Tue, 6 Jun 2000 20:31:38 -0700 Received: (qmail 4396 invoked by uid 577); 7 Jun 2000 03:31:35 -0000 Message-ID: <20000607113135.B4304@saw.sw.com.sg> Date: Wed, 7 Jun 2000 11:31:35 +0800 From: Andrey Savochkin To: Lennert Buytenhek Cc: Mitchell Blank Jr , Ben Greear , rob@valinux.com, netdev@oss.sgi.com, Gleb Natapov , jamal Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: <20000605102627.A8473@saw.sw.com.sg> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.93.2i In-Reply-To: ; from "Lennert Buytenhek" on Tue, Jun 06, 2000 at 09:09:51PM Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Tue, Jun 06, 2000 at 09:09:51PM +0200, Lennert Buytenhek wrote: > On Mon, 5 Jun 2000, Andrey Savochkin wrote: > > I think that the current VLAN implementation slightly abuses the > > notion of device. And it doesn't relate to the number of devices and > > the efficiency of search algorithms. The current VLAN implementation > > is a pure packet-mangling code. It misses one of the most important > > properties of network devices - flow control. Any code that doesn't > > provide flow control isn't a device, but a code just manipulating of > > packet contents. > > From this I may conclude that current (2.3) bridging is broken too as it > works as a device? Packet switching inside a bridge group doesn't need the bridge device. In my opinion, bridge virtual interface (in Cisco terms) may be considered as a network device. It provides (or should provide) very non-trivial hard_header(), hard_header_cache() and similar functions determining what link to use for output of the packet. Bridge doesn't touch packets, but, instead, it provides a virtual link which has it's own notion of peer addressing (selection for device from the bridge group plus link level address) and transmission. When it comes to flow control, this notion can't be easily applied to bridges. VLANs are opposite to bridges. They use very simple hard header building methods (essentially, eth_header), and their flow control is directly based on the flow control of underlying device. Whether you just requeue packets or set __LINK_STATE_XOFF just in accordance to the underlying device, it doesn't matter. It's the underlying device which does flow control for you. Best regards Andrey V. Savochkin From owner-netdev@oss.sgi.com Tue Jun 6 20:54:38 2000 Received: by oss.sgi.com id ; Tue, 6 Jun 2000 20:54:18 -0700 Received: from saw.sw.com.sg ([203.120.9.98]:18560 "HELO saw.sw.com.sg") by oss.sgi.com with SMTP id ; Tue, 6 Jun 2000 20:53:51 -0700 Received: (qmail 4503 invoked by uid 577); 7 Jun 2000 03:53:49 -0000 Message-ID: <20000607115349.A4259@saw.sw.com.sg> Date: Wed, 7 Jun 2000 11:53:49 +0800 From: Andrey Savochkin To: jetienne@arobas.net, netdev@oss.sgi.com Subject: Re: IFA_F_NO_NDISC (for vrrp) References: <20000604180651.A678@long-haul.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.93.2i In-Reply-To: <20000604180651.A678@long-haul.net>; from "Jerome Etienne" on Sun, Jun 04, 2000 at 06:06:51PM Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello, On Sun, Jun 04, 2000 at 06:06:51PM -0400, Jerome Etienne wrote: > This feature is usefull to support VRRP(rfc2338) which requires > to answer the ARP request for a 'virtual ip' with the proper > 'virtual MAC'(rfc2338 section 8.2). So to potentially answer > a particular MAC for a particular IP, and not with the primary > MAC. As far as i know, it is currently impossible because linux > assumes to have a single MAC per physical interface. My plan is > to prevent the kernel from answering for the virtual ip addresses > and to answer from userspace. Keeping the policy decision in user-space is a wise solution. But you may just set NOARP flag for the device and do all the stuff in user-land merging your 'virtual MAC' logic with any ARP daemon. Best regards Andrey V. Savochkin From owner-netdev@oss.sgi.com Tue Jun 6 21:39:58 2000 Received: by oss.sgi.com id ; Tue, 6 Jun 2000 21:39:48 -0700 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:26892 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Tue, 6 Jun 2000 21:39:25 -0700 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.9.3) with ESMTP id WAA29544; Tue, 6 Jun 2000 22:15:46 -0700 Message-ID: <393DDA82.C38EB9E4@candelatech.com> Date: Tue, 06 Jun 2000 22:15:46 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.14-5.0 i586) X-Accept-Language: en MIME-Version: 1.0 To: jamal CC: Mitchell Blank Jr , Andrey Savochkin , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il Subject: Re: Support for many devices in the kernel (Was 802.1q) References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Aside from philosophical arguments, it seems to me that technically, many devices could be easily supported by the kernel. As someone pointed out (sorry, deleted the mail), other than the dev_get_by[index/name], there is one other place, in the IP stack, that ***may*** be called often that walks the device list. 1) So, if we optimize the dev_get_by_name with some sort of hash or tree, getting the performance to at least 0(log(N)), where N is the number of devices, then that should satisfy dev_get_by_name(). 2) Optimize the dev_get_by_index the same way (If I do it, I'll probably try to write some sort of self-adjusting hash table like thing.) A tree should work just as well. For extra flexibility, if a hashtable is used, then there can be compile time hints given to size the hash, or we could allow setting it through either IOCTLs or the procfs. 3) Investigate the linear walk in IP land. One thing I didn't bring up before: If the IP code needs to walk all 'routable things', then even if VLANs are not devices, this code will probably need to walk them. Either way, we need to investigate this, and optimize it if it proves to be called often, and is possible to optimize. TODO: Find what this method was so we can track it down. If these three changes were made, does anyone see any reason to exclude them from the kernel? Does anyone know of any other performance bottlenecks? Any volunteers? I will do it eventually if no one else does it, but it may be a week or three before I get the time... Enjoy, Ben -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Wed Jun 7 03:07:29 2000 Received: by oss.sgi.com id ; Wed, 7 Jun 2000 03:07:19 -0700 Received: from laurin.munich.netsurf.de ([194.64.166.1]:20727 "EHLO laurin.munich.netsurf.de") by oss.sgi.com with ESMTP id ; Wed, 7 Jun 2000 03:06:55 -0700 Received: from fred.muc.de (none@ns1040.munich.netsurf.de [195.180.235.40]) by laurin.munich.netsurf.de (8.9.3/8.9.3) with ESMTP id MAA10239; Wed, 7 Jun 2000 12:06:34 +0200 (MET DST) Received: from andi by fred.muc.de with local (Exim 2.05 #1) id 12zSLY-000185-00; Wed, 7 Jun 2000 01:01:20 +0200 Date: Wed, 7 Jun 2000 01:01:20 +0200 From: Andi Kleen To: Lennert Buytenhek Cc: Andrey Savochkin , Mitchell Blank Jr , Ben Greear , rob@valinux.com, netdev@oss.sgi.com, Gleb Natapov , jamal Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? Message-ID: <20000607010120.A4334@fred.muc.de> References: <20000605102627.A8473@saw.sw.com.sg> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.95.4us In-Reply-To: ; from Lennert Buytenhek on Tue, Jun 06, 2000 at 09:21:39PM +0200 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Tue, Jun 06, 2000 at 09:21:39PM +0200, Lennert Buytenhek wrote: > > > On Mon, 5 Jun 2000, Andrey Savochkin wrote: > > > The current kernel infrastructure for packet mangling may still need > > some adjustments, but it at least exists. I'm encouraging to consider > > VLAN implementation as just a netfilter module. > > "All the world is an IP net"? How should I run IPX over my VLANs then? Netfilter is not an IP only thing. It is a generic framework for packet mangling. Although currently only IPv4 and IPv6 netfilter implementations exist it would be no big problem to add ``raw ethernet'' netfilter hooks. -Andi -- This is like TV. I don't like TV. From owner-netdev@oss.sgi.com Wed Jun 7 04:06:50 2000 Received: by oss.sgi.com id ; Wed, 7 Jun 2000 04:06:39 -0700 Received: from sirppi.helsinki.fi ([128.214.205.27]:52998 "EHLO sirppi.helsinki.fi") by oss.sgi.com with ESMTP id ; Wed, 7 Jun 2000 04:06:14 -0700 Received: from localhost (amlaukka@localhost) by sirppi.helsinki.fi (8.10.1/8.10.1) with ESMTP id e57B50602783; Wed, 7 Jun 2000 14:05:04 +0300 (EET DST) X-Authentication-Warning: sirppi.helsinki.fi: amlaukka owned process doing -bs Date: Wed, 7 Jun 2000 14:05:00 +0300 (EET DST) From: Aki M Laukkanen To: kuznet@ms2.inr.ac.ru cc: netdev@oss.sgi.com, iwtcp@cs.helsinki.fi Subject: Re: Slow TCP connection between linux and wince In-Reply-To: <200006061924.XAA07893@ms2.inr.ac.ru> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Tue, 6 Jun 2000 kuznet@ms2.inr.ac.ru wrote: > It is not hackish, it is rather buggish. 8) > You cannot mangle truesize, eben with good intention. Yes, that was just to test the theory. > Try better to tune tcp_min_write_space(). I want to think it is tunable. To me it seems this would be trying to fix the symptom rather than the cause. There seems to be a certain assumption in place here that the default socket buffer of 65536 is good for all connections regardless of MTU. This is only true if we don't do meta-data accounting in {wmem|rmem}_alloc. > Also, you could select larger sndbuf for such funny links. Funny links, well I wouldn't say so. I don't think many people recognize the problem and as such will not select larger socket buffer sizes. Perhaps an option to auto-tune the socket buffer size is needed so that the condition, send buffer should always be large enough to contain two windows worth of segments, would hold true. > Combination of ridiculously low MSS with utterly high cwnd > is highly non-standard situation. I don't think this a highly non-standard situation. Even if we ignore slow wireless links (e.g. GSM data), to some extent every modem-link exhibits this behaviour. Have you read draft-ietf-pilc-slow-03.txt or rfc2757.txt? Good read for a little perspective to the problem space. > > Our second problem with this disparity is on the receive side. The scenario > > is essentially the same but with an unreliable link (read wireless) which > > drops packets. In case of packet drop receiver keeps building an > > out-of-order queue which grows to the limit of the receive buffer > > quite quickly. However sender keeps sending more because of the difference > > between advertised window and the actual allocated space. This triggers > > tcp_input.c:prune_queue() which purges the whole out-of-order queue to > > free up space, thus killing the TCP performance quite effectively. > > TCP performance is killed not by pruning, but rather by packet drop. 8) Everything is relative. Retransmitting a single packet versus having to retransmit the whole window accounts to 112 seconds versus 160 seconds when transmitting 100KB in this particular test case. But this is beside the point. I think I can argue that the receiver should never advertise a window bigger than it is prepared to receive. This does not just affect lossy links. Single packet drop due to congestion is quite a valid scenario due to routers deploying active congestion management schemes. You can calculate a threshold point for the MTU when the window calculation starts to break. /* * How much of the receive buffer do we advertize * (the rest is reserved for headers and driver packet overhead) * Use a power of 2. */ #define TCP_WINDOW_ADVERTISE_DIVISOR 2 For this divisor the threshold is something like (1536+130)/2 bytes for typical ethernet drivers. In the case of PPP it is a bit different. > Yes, pruning should be a bit less aggressive. I will repair this. Does this mean the changes described here? Doing it the other way (not killing the whole ofo-queue) will still cause further packet loss. /* THIS IS _VERY_ GOOD PLACE to play window clamp. * if free_space becomes suspiciously low * verify ratio rmem_alloc/(rcv_nxt - copied_seq), * and if we predict that when free_space will be lower mss, * rmem_alloc will run out of rcvbuf*2, shrink window_clamp. * It will eliminate most of prune events! Very simple, * it is the next thing to do. --ANK > In any case, try net-xxyyzz.dif.gz from ftp://ftp.inr.ac.ru/ip-routing/. > It will not be better, I think, but at least you will discover when Will do. From owner-netdev@oss.sgi.com Wed Jun 7 04:21:09 2000 Received: by oss.sgi.com id ; Wed, 7 Jun 2000 04:20:58 -0700 Received: from mail.cyberus.ca ([209.195.95.1]:18602 "EHLO cyberus.ca") by oss.sgi.com with ESMTP id ; Wed, 7 Jun 2000 04:20:49 -0700 Received: from shell.cyberus.ca (shell [209.195.95.7]) by cyberus.ca (8.9.3/8.9.3/Cyberus Online Inc.) with ESMTP id HAA22674; Wed, 7 Jun 2000 07:18:45 -0400 (EDT) Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.1b+Sun/8.9.3) with ESMTP id HAA23282; Wed, 7 Jun 2000 07:18:45 -0400 (EDT) Date: Wed, 7 Jun 2000 07:18:45 -0400 (EDT) From: jamal To: Andi Kleen cc: Lennert Buytenhek , Andrey Savochkin , Mitchell Blank Jr , Ben Greear , rob@valinux.com, netdev@oss.sgi.com, Gleb Natapov Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? In-Reply-To: <20000607010120.A4334@fred.muc.de> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Wed, 7 Jun 2000, Andi Kleen wrote: > On Tue, Jun 06, 2000 at 09:21:39PM +0200, Lennert Buytenhek wrote: > > Netfilter is not an IP only thing. It is a generic framework for > packet mangling. Although currently only IPv4 > and IPv6 netfilter implementations exist it would be no big problem > to add ``raw ethernet'' netfilter hooks. > Andi, Isnt packet type sufficient for this today? or are you talking about hooks based in addition to things like src/dst MACs? In regards to netlink and devices and the route daemons: I agree that netlink would be the best for daemons to use for anything that is routable. [compare to some daemons (MERIT?) which used to/maybe still are polling /proc ;->]. But does it have to be a _device_ to use netlink? i think not. cheers, jamal From owner-netdev@oss.sgi.com Wed Jun 7 05:00:48 2000 Received: by oss.sgi.com id ; Wed, 7 Jun 2000 05:00:38 -0700 Received: from dhcp41.toaster.net ([199.108.84.41]:31751 "EHLO schmee.sfgoth.com") by oss.sgi.com with ESMTP id ; Wed, 7 Jun 2000 05:00:20 -0700 Received: (from mitch@localhost) by schmee.sfgoth.com (8.9.3/8.9.3) id EAA34906; Wed, 7 Jun 2000 04:59:36 -0700 (PDT) Date: Wed, 7 Jun 2000 04:59:36 -0700 From: Mitchell Blank Jr To: jamal Cc: Andrey Savochkin , Ben Greear , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? Message-ID: <20000607045935.G7740@sfgoth.com> References: <20000605053533.C77216@sfgoth.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0i In-Reply-To: ; from hadi@cyberus.ca on Tue, Jun 06, 2000 at 07:49:01AM -0400 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing jamal wrote: > [ A lot of arguement on how anything which can do IP is a device, deleted] > Yes people, what is net device? ;-> It routes, it ARPs, it cooks ... it > must be a net device. Yes, that's pretty much the point. > > I think a reasonable goal for VLAN support would be that a machine > > with one ethernet card on two VLANs should be able to do pretty much > > the same as a machine with two seperate ethernet cards, with similar > > configuration commands. That is, after all, the promise of VLANs. > > And you can do this just fine with VLANS abstracted on top of devices. OK, explain how. I've enumerated a dozen or more places either in protocol support or in the user-kernel interface where the configuration or functionality depends on a named device per individual vlan. You keep ignoring them when I point them out, but they are still there. For instance, if both the VLANs are called "eth0", how do I set individual settings in /proc/sys/net/ipv4/conf/ for the two different VLANs? Or any of the other ones I pointed out - take your pick. > > Flow control might not be perfect in VLANs, but that's really the > > least of its problems. The linux model of handling flow control > > only can really handle simple devices that can be modeled as a FIFO > > queue. > > What "linux model of handling flow control"? > What would the other model be? FIFOs are the basic building blocks. Yes, FIFOs are the basic building blocks. The current model is "a net_device must be modeled as a single FIFO before the xmit". This isn't right. Some devices don't have any FIFO associated with them. Either they don't need one because they are trivial (dummy, lo) or they just feed something else that _does_ have a FIFO (ethernet bridge device, anything that tunnels L2-in-L2 like PPPoE, or these proposed VLAN devices). It doesn't really _hurt_ much to have these modeled with an unused FIFO in front, other than some TC stuff can't work right since there's no backpressure. The other case is worse - things that have multiple FIFOs. As I pointed out, the obvious one is any ATM protocol that uses multiple VCs. Since each VC is an individual FIFO you can't tell if you're supposed to queue a packet until you've already looked at the header and determined which VC to route out. Doing a netif_stop_queue() causes all the VCs to stop being fed. Really we need a new structure "net_fifo?" in order to divorce the queuing from the concept of a net_device. Anything that wants to look to userland and the protocol stacks as a net_device should, wheteher or not you feel they are just a packet mangler or not. -Mitch From owner-netdev@oss.sgi.com Wed Jun 7 05:08:38 2000 Received: by oss.sgi.com id ; Wed, 7 Jun 2000 05:08:29 -0700 Received: from dhcp41.toaster.net ([199.108.84.41]:36103 "EHLO schmee.sfgoth.com") by oss.sgi.com with ESMTP id ; Wed, 7 Jun 2000 05:08:23 -0700 Received: (from mitch@localhost) by schmee.sfgoth.com (8.9.3/8.9.3) id FAA34984; Wed, 7 Jun 2000 05:08:14 -0700 (PDT) Date: Wed, 7 Jun 2000 05:08:13 -0700 From: Mitchell Blank Jr To: Ben Greear Cc: jamal , Andrey Savochkin , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il Subject: Re: Support for many devices in the kernel (Was 802.1q) Message-ID: <20000607050813.H7740@sfgoth.com> References: <393DDA82.C38EB9E4@candelatech.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0i In-Reply-To: <393DDA82.C38EB9E4@candelatech.com>; from greearb@candelatech.com on Tue, Jun 06, 2000 at 10:15:46PM -0700 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Ben Greear wrote: > As someone pointed > out (sorry, deleted the mail), other than the dev_get_by[index/name], there > is one other place, in the IP stack, that ***may*** be called often that > walks the device list. Well there are others, but they're in weird protocols (decnet, rose, netrom). I doubt we'll ever need to support large numbers of devices running those protocols. If we want them to perform well even with lots of devices running other protocols we can have them keep their own list of the net_devices they are using. IPX does this, for instance. > 3) Investigate the linear walk in IP land. Someone should try putting a counter there to report who many times that code actually gets hit on normal use. > One thing I didn't bring up > before: If the IP code needs to walk all 'routable things', then even > if VLANs are not devices, this code will probably need to walk them. Exactly right. It specifically searches each device and for each one linearly searches all the "ifa" structures on the device. So if all the VLANs were using the same net_device there would still be as many ifa's total. In fact, the situation would be far, far worse since there are a lot of other places in the ipv4 code that the ifa list on a particular device gets searched. -Mitch From owner-netdev@oss.sgi.com Wed Jun 7 05:08:48 2000 Received: by oss.sgi.com id ; Wed, 7 Jun 2000 05:08:38 -0700 Received: from princess.super.nu ([216.169.96.45]:12301 "EHLO princess.super.nu") by oss.sgi.com with ESMTP id ; Wed, 7 Jun 2000 05:08:27 -0700 Received: (from blind-fire@localhost) by princess.super.nu (8.8.8/8.8.8) id IAA01767 for netdev@oss.sgi.com; Wed, 7 Jun 2000 08:06:35 -0400 Date: Wed, 7 Jun 2000 08:06:35 -0400 From: "blind-fire.net" Message-Id: <200006071206.IAA01767@princess.super.nu> To: netdev@oss.sgi.com Subject: ipchains fw, portforwarding and masquerading Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hoi, I got your email from Alan. I have this problenm with/question about our ipchains setup: We have a linux box with 3 nics; 1 for the internet connection, 1 for the intranet and 1 for our production machines. The linux box forwards some internet IPs to the machines in the production segment. Problem is that the users in the intranet cannot reach the production machiens in the production segment. In what order are the ipchains firewall, teh masquerading and teh portforwarding executed or evaluated? The data ahs to go through the firewall and then to be caugth by the portforwarder but how is this achieved>? The firewall itself works OK and we can work on the machproduction machines but not surf to them using the IP's that are used on the internet side of the firewall. Please let me knwo where I could find info on how to fix this problem. Thanks in advance! Udo From owner-netdev@oss.sgi.com Wed Jun 7 05:13:18 2000 Received: by oss.sgi.com id ; Wed, 7 Jun 2000 05:12:58 -0700 Received: from dhcp41.toaster.net ([199.108.84.41]:39943 "EHLO schmee.sfgoth.com") by oss.sgi.com with ESMTP id ; Wed, 7 Jun 2000 05:12:54 -0700 Received: (from mitch@localhost) by schmee.sfgoth.com (8.9.3/8.9.3) id FAA35034; Wed, 7 Jun 2000 05:12:38 -0700 (PDT) Date: Wed, 7 Jun 2000 05:12:38 -0700 From: Mitchell Blank Jr To: Andi Kleen Cc: Lennert Buytenhek , Andrey Savochkin , Ben Greear , rob@valinux.com, netdev@oss.sgi.com, Gleb Natapov , jamal Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? Message-ID: <20000607051238.I7740@sfgoth.com> References: <20000605102627.A8473@saw.sw.com.sg> <20000607010120.A4334@fred.muc.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0i In-Reply-To: <20000607010120.A4334@fred.muc.de>; from ak@muc.de on Wed, Jun 07, 2000 at 01:01:20AM +0200 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Andi Kleen wrote: > > "All the world is an IP net"? How should I run IPX over my VLANs then? > > Netfilter is not an IP only thing. It is a generic framework for > packet mangling. Although currently only IPv4 > and IPv6 netfilter implementations exist it would be no big problem > to add ``raw ethernet'' netfilter hooks. Netfilter isn't the problem. IPX and Appletalk AARP both assume that each net_device corresponds to one network, and they keep their per-network state there. So if you had, for instance, two different IPX networks on two different VLANs but they were both called eth0, then linux could not support that. The same goes for just about any of the non-IP protocols. -Mitch From owner-netdev@oss.sgi.com Wed Jun 7 05:28:18 2000 Received: by oss.sgi.com id ; Wed, 7 Jun 2000 05:27:59 -0700 Received: from gw.chygwyn.com ([62.172.158.50]:26632 "EHLO gw.chygwyn.com") by oss.sgi.com with ESMTP id ; Wed, 7 Jun 2000 05:27:36 -0700 Received: (from steve@localhost) by gw.chygwyn.com (8.9.3/8.9.3) id NAA22876; Wed, 7 Jun 2000 13:15:50 +0100 From: Steve Whitehouse Message-Id: <200006071215.NAA22876@gw.chygwyn.com> Subject: Re: Support for many devices in the kernel (Was 802.1q) To: mitch@sfgoth.com (Mitchell Blank Jr) Date: Wed, 7 Jun 2000 13:15:50 +0100 (BST) Cc: greearb@candelatech.com (Ben Greear), hadi@cyberus.ca (jamal), saw@saw.sw.com.sg (Andrey Savochkin), rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il In-Reply-To: <20000607050813.H7740@sfgoth.com> from "Mitchell Blank Jr" at Jun 07, 2000 05:08:13 AM Organization: ChyGywn Limited X-RegisteredOffice: 7, New Yatt Road, Witney, Oxfordshire. OX8 6NU England X-RegisteredNumber: 03887683 Reply-To: Steve Whitehouse X-Mailer: ELM [version 2.5 PL1] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing > > Ben Greear wrote: > > As someone pointed > > out (sorry, deleted the mail), other than the dev_get_by[index/name], there > > is one other place, in the IP stack, that ***may*** be called often that > > walks the device list. > > Well there are others, but they're in weird protocols (decnet, rose, netrom). > I doubt we'll ever need to support large numbers of devices running > those protocols. If we want them to perform well even with lots of > devices running other protocols we can have them keep their own list > of the net_devices they are using. IPX does this, for instance. > As you suggest the places that DECnet uses the device list are few. If you want me to put together a patch which make DECnet use its own list of devices let me know... I think the only place it will make any real difference is in the bind() call where we walk the DECnet devices to check one of them has the local address that we are bind()ing to. > > 3) Investigate the linear walk in IP land. > > Someone should try putting a counter there to report who many times that > code actually gets hit on normal use. > > > One thing I didn't bring up > > before: If the IP code needs to walk all 'routable things', then even > > if VLANs are not devices, this code will probably need to walk them. > > Exactly right. It specifically searches each device and for each > one linearly searches all the "ifa" structures on the device. > So if all the VLANs were using the same net_device there would > still be as many ifa's total. In fact, the situation would be far, > far worse since there are a lot of other places in the ipv4 code > that the ifa list on a particular device gets searched. > > -Mitch > I think the same applies to DECnet. The routing in DECnet is more or less copied from IPv4, so what works for IP will probably work with DECnet too, Steve. From owner-netdev@oss.sgi.com Wed Jun 7 06:12:18 2000 Received: by oss.sgi.com id ; Wed, 7 Jun 2000 06:12:08 -0700 Received: from saw.sw.com.sg ([203.120.9.98]:31104 "HELO saw.sw.com.sg") by oss.sgi.com with SMTP id ; Wed, 7 Jun 2000 06:11:52 -0700 Received: (qmail 8155 invoked by uid 577); 7 Jun 2000 13:11:45 -0000 Message-ID: <20000607211145.A8073@saw.sw.com.sg> Date: Wed, 7 Jun 2000 21:11:45 +0800 From: Andrey Savochkin To: Mitchell Blank Jr Cc: Ben Greear , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il, jamal Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: <3938611E.D074F254@candelatech.com> <20000603091818.B48132@sfgoth.com> <20000605102627.A8473@saw.sw.com.sg> <20000605053533.C77216@sfgoth.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.93.2i In-Reply-To: <20000605053533.C77216@sfgoth.com>; from "Mitchell Blank Jr" on Mon, Jun 05, 2000 at 05:35:33AM Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello, On Mon, Jun 05, 2000 at 05:35:33AM -0700, Mitchell Blank Jr wrote: [snip] > Now, using this form of analysis, at what level is an ethernet VLAN? > Well, what can a VLAN do? Well, it can implement the SIOC[GS]* ioctls > (i.e. it can be taken up or down), it can keep a net_dev_stats, > it can have entries in the ARP table (or AARP for that matter), > it can participate in a bridge group (net_bridge_port->dev), > it can report seperate statistics via SNMP (and thus should have > its own dev->ifindex), it can want different settings in > /proc/sys/net/ipv4/{neigh,conf}/DEV/, can have seperate IPv6 > networks with different autoconfiguration (see ipv6/addrconf.c), > it could have IPX networks of different ipx_dlink_type's, we > could want to bind an AF_PACKET socket to it (tcpdump, etc), > we could want to make filtering or policy route decisions based > on incoming VLAN, etc. > > In short, a VLAN can do just about anything a "real" ethernet interface > can do - the only exceptions are that it probably shouldn't be split > into another level of VLAN's (well, it could, but what other switch or > OS would support such a thing? :-), and it needs to coordinate with the [snip] On Wed, Jun 07, 2000 at 04:59:36AM -0700, Mitchell Blank Jr wrote: [snip] > OK, explain how. I've enumerated a dozen or more places either in > protocol support or in the user-kernel interface where the > configuration or functionality depends on a named device per individual > vlan. You keep ignoring them when I point them out, but they are > still there. For instance, if both the VLANs are called "eth0", how > do I set individual settings in /proc/sys/net/ipv4/conf/ for the two > different VLANs? Or any of the other ones I pointed out - take your > pick. The majority of the enumerated points is just a matter of saving time for adjusting user-level tools. The fact that something is able provide SNMP statistic doesn't make it a network device, it's ridiculous! Bridging between VLANs is not a problem, it's the question of who calls whom. The logic determining VLAN id may call bridge procedures instead of vice versa. The only point that really needs to be addressed by kernel is /proc/sys/net/ tuning. Per-device IPv4 parameters (like redirect and forwarding policy) fit into netfilter implementation well, but neighbour discovery parameters is a more difficult point. I'm not against VLANs or even against their implementation as netdevices. But we should agree that it's an unnatural way to do things (from kernel perspective, certainly), and we decide to do it this way only because we follow the way of least resistance. Regards Andrey V. Savochkin From owner-netdev@oss.sgi.com Wed Jun 7 07:19:08 2000 Received: by oss.sgi.com id ; Wed, 7 Jun 2000 07:18:58 -0700 Received: from apollo.nbase.co.il ([194.90.137.2]:16390 "EHLO apollo.nbase.co.il") by oss.sgi.com with ESMTP id ; Wed, 7 Jun 2000 07:18:35 -0700 Received: from nbase.co.il ([194.90.136.56]) by apollo.nbase.co.il (Post.Office MTA v3.1.2 release (PO205-101c) ID# 0-44418U200L2S100) with ESMTP id AAA122; Wed, 7 Jun 2000 17:18:13 +0200 Message-ID: <393E596F.1BD16F5D@nbase.co.il> Date: Wed, 07 Jun 2000 14:17:19 +0000 From: Gleb Natapov Organization: NBase-Xyplex X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.2.14vlan i686) X-Accept-Language: en MIME-Version: 1.0 To: Mitchell Blank Jr CC: jamal , Andrey Savochkin , Ben Greear , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: <20000605053533.C77216@sfgoth.com> <20000607045935.G7740@sfgoth.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Mitchell Blank Jr wrote: > > Really we need a new structure "net_fifo?" in order to divorce > the queuing from the concept of a net_device. Anything that > wants to look to userland and the protocol stacks as a net_device > should, wheteher or not you feel they are just a packet mangler > or not. > Excellent idea. net_device should represent physical device and provide functionality of network device (hard_start_xmit, set_mac_address, set_multicast_list, ...) but not functionality of second layer (hard_header, hard_header_parse, ...). The functionality of ethernet should be in ethernet net_fifo, functionality of vlan in vlan net_fifo, bridge in bridge net_fifo. net_fifo is what user sees as network interface. Third layer (ip, ipx, ...) communicates only with net_fifo (and not with net_device) and each net_fifo may communicate with one or more net_devices (bridging or bonding). Are there problems with such architecture ? > -Mitch -- Gleb. From owner-netdev@oss.sgi.com Wed Jun 7 07:53:59 2000 Received: by oss.sgi.com id ; Wed, 7 Jun 2000 07:53:49 -0700 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:27411 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Wed, 7 Jun 2000 07:53:31 -0700 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.9.3) with ESMTP id IAA32439; Wed, 7 Jun 2000 08:29:45 -0700 Message-ID: <393E6A69.2A51D6B2@candelatech.com> Date: Wed, 07 Jun 2000 08:29:45 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.14-5.0 i586) X-Accept-Language: en MIME-Version: 1.0 To: Gleb Natapov CC: Mitchell Blank Jr , jamal , Andrey Savochkin , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: <20000605053533.C77216@sfgoth.com> <20000607045935.G7740@sfgoth.com> <393E596F.1BD16F5D@nbase.co.il> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Gleb Natapov wrote: > > Mitchell Blank Jr wrote: > > > > Really we need a new structure "net_fifo?" in order to divorce > > the queuing from the concept of a net_device. Anything that > > wants to look to userland and the protocol stacks as a net_device > > should, wheteher or not you feel they are just a packet mangler > > or not. > > > > Excellent idea. net_device should represent physical device and provide > functionality of network device (hard_start_xmit, set_mac_address, > set_multicast_list, ...) but not functionality of second layer > (hard_header, hard_header_parse, ...). The functionality of ethernet > should be in ethernet net_fifo, functionality of vlan in vlan net_fifo, > bridge in bridge net_fifo. net_fifo is what user sees as network > interface. Third layer (ip, ipx, ...) communicates only with net_fifo > (and not with net_device) and each net_fifo may communicate with one or > more net_devices (bridging or bonding). > > Are there problems with such architecture ? Not a bad idea, if there is enough that can be split out, but I see very little that is not common between the current net_device and some net_fifo that can replace it. For example, you will need some sort of hard_start_xmit in the net_device to send it to the lower level. There is no particular reason why someone could not implement a queue in a VLAN, for example, that would buffer, and even provide pushback to higher levels, feeding the lower device (wich may be ethernet, or even be ppp/FR/ATM (think bridged)). Just because it's not real hardware doesn't mean it doesn't logically hard_start_xmit. Is multi-cast / mac_address used in anything besides ethernet? If we're trying to abstract all physical net_devices, then we should make sure it really fits, otherwise just stick with what already works. Also, if it becomes clear that the large majority of the net_device functionality would need to go into the net_fifo, I would argue that it may be cleaner to instead create a new lower level, maybe hard_net_device, or something like that so that we don't have to change a well understood interface (net_device).... -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Wed Jun 7 08:02:29 2000 Received: by oss.sgi.com id ; Wed, 7 Jun 2000 08:02:09 -0700 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:1540 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Wed, 7 Jun 2000 08:01:59 -0700 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.9.3) with ESMTP id IAA32474; Wed, 7 Jun 2000 08:38:36 -0700 Message-ID: <393E6C7C.994B41F1@candelatech.com> Date: Wed, 07 Jun 2000 08:38:36 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.14-5.0 i586) X-Accept-Language: en MIME-Version: 1.0 To: Andrey Savochkin CC: Mitchell Blank Jr , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il, jamal Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: <3938611E.D074F254@candelatech.com> <20000603091818.B48132@sfgoth.com> <20000605102627.A8473@saw.sw.com.sg> <20000605053533.C77216@sfgoth.com> <20000607211145.A8073@saw.sw.com.sg> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Andrey Savochkin wrote: > > > The majority of the enumerated points is just a matter of saving time for > adjusting user-level tools. The fact that something is able provide SNMP > statistic doesn't make it a network device, it's ridiculous! Saving time and keeping from having to adjust (the unwashed multitude of) user-level programs are both very important goals. If you don't support things like SNMP the easy way, then every piece of code that must walk all devices, now has to **also** walk every VLAN, and every other new device-like-thing that is put into it's own little abstraction. This will quickly bloat the kernel with stuff that may give us no extra performance. > Bridging between VLANs is not a problem, it's the question of who calls whom. > The logic determining VLAN id may call bridge procedures instead of vice > versa. I started to write my own bridging for VLAN, but it really made no sense, since I want to also be able to bridge a VLAN across a FR-PVC, ethernet, and/or an ATM-PVC. For that to work in any sane way, the entities must be of the same type/behavior (ie net_device). > I'm not against VLANs or even against their implementation as netdevices. > But we should agree that it's an unnatural way to do things (from kernel > perspective, certainly), and we decide to do it this way only because we > follow the way of least resistance. I don't agree it's unnatural, and I'm not alone. Lets go back to the original arguments against many devices. (Performance, if I remember correctly.) I believe we can satisfy the performance, and it will mean the least amount of new code in the kernel, (and just as importantly, in user-space.) > > Regards > Andrey V. > Savochkin -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Wed Jun 7 08:03:39 2000 Received: by oss.sgi.com id ; Wed, 7 Jun 2000 08:03:29 -0700 Received: from nero.doit.wisc.edu ([128.104.17.130]:17928 "EHLO nero.doit.wisc.edu") by oss.sgi.com with ESMTP id ; Wed, 7 Jun 2000 08:03:16 -0700 Received: (from jleu@localhost) by nero.doit.wisc.edu (8.8.7/8.8.7) id KAA09444; Wed, 7 Jun 2000 10:02:53 -0500 Message-ID: <20000607100252.B9353@doit.wisc.edu> Date: Wed, 7 Jun 2000 10:02:52 -0500 From: "James R. Leu" To: Ben Greear , Gleb Natapov Cc: Mitchell Blank Jr , jamal , Andrey Savochkin , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? Reply-To: jleu@mindspring.com References: <20000605053533.C77216@sfgoth.com> <20000607045935.G7740@sfgoth.com> <393E596F.1BD16F5D@nbase.co.il> <393E6A69.2A51D6B2@candelatech.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.93.2 In-Reply-To: <393E6A69.2A51D6B2@candelatech.com>; from Ben Greear on Wed, Jun 07, 2000 at 08:29:45AM -0700 Organization: none Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing I think ATM and FR would benifit from the net_fifo idea. On Wed, Jun 07, 2000 at 08:29:45AM -0700, Ben Greear wrote: > Gleb Natapov wrote: > > > > Mitchell Blank Jr wrote: > > > > > > Really we need a new structure "net_fifo?" in order to divorce > > > the queuing from the concept of a net_device. Anything that > > > wants to look to userland and the protocol stacks as a net_device > > > should, wheteher or not you feel they are just a packet mangler > > > or not. > > > > > > > Excellent idea. net_device should represent physical device and provide > > functionality of network device (hard_start_xmit, set_mac_address, > > set_multicast_list, ...) but not functionality of second layer > > (hard_header, hard_header_parse, ...). The functionality of ethernet > > should be in ethernet net_fifo, functionality of vlan in vlan net_fifo, > > bridge in bridge net_fifo. net_fifo is what user sees as network > > interface. Third layer (ip, ipx, ...) communicates only with net_fifo > > (and not with net_device) and each net_fifo may communicate with one or > > more net_devices (bridging or bonding). > > > > Are there problems with such architecture ? > > Not a bad idea, if there is enough that can be split out, but I see very little > that is not common between the current net_device and some net_fifo that can > replace it. For example, you will need some sort of hard_start_xmit in > the net_device to send it to the lower level. There is no particular reason > why someone could not implement a queue in a VLAN, for example, that would > buffer, and even provide pushback to higher levels, feeding the lower device > (wich may be ethernet, or even be ppp/FR/ATM (think bridged)). Just because it's not real > hardware doesn't mean it doesn't logically hard_start_xmit. > > Is multi-cast / mac_address used in anything besides ethernet? If we're trying > to abstract all physical net_devices, then we should make sure it really fits, > otherwise just stick with what already works. > > Also, if it becomes clear that the large majority of the net_device functionality > would need to go into the net_fifo, I would argue that it may be cleaner to > instead create a new lower level, maybe hard_net_device, or something like that > so that we don't have to change a well understood interface (net_device).... > > -- > Ben Greear (greearb@candelatech.com) http://www.candelatech.com > Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) > http://scry.wanfear.com http://scry.wanfear.com/~greear -- James R. Leu From owner-netdev@oss.sgi.com Wed Jun 7 09:21:50 2000 Received: by oss.sgi.com id ; Wed, 7 Jun 2000 09:21:39 -0700 Received: from minus.inr.ac.ru ([193.233.7.97]:60941 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Wed, 7 Jun 2000 09:21:14 -0700 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id UAA21333; Wed, 7 Jun 2000 20:08:27 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200006071608.UAA21333@ms2.inr.ac.ru> Subject: Re: Slow TCP connection between linux and wince To: amlaukka@cc.helsinki.fi (Aki M Laukkanen) Date: Wed, 7 Jun 2000 20:08:27 +0400 (MSK DST) Cc: netdev@oss.sgi.com, iwtcp@cs.helsinki.fi In-Reply-To: from "Aki M Laukkanen" at Jun 7, 0 02:05:00 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Length: 1790 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello! > To me it seems this would be trying to fix the symptom rather than the > cause. There seems to be a certain assumption in place here that the default > socket buffer of 65536 is good for all connections regardless of MTU. This > is only true if we don't do meta-data accounting in {wmem|rmem}_alloc. It is really OK. 65535 is very big sndbuf, when you talk to sane OSes (i.e. windows 8)). But we cannot talk to Linuxes 8), because they used to advertise non realistic windows. Also, the fact that truesize exceeds mss not more than twice is really crucial. When it is not true, linux tcp used to fail miserably. This phenomenon is visible only on high rtt networks with small losses, though. > I don't think this a highly non-standard situation. Even if we ignore > slow wireless links (e.g. GSM data), to some extent every modem-link > exhibits this behaviour. Have you read draft-ietf-pilc-slow-03.txt or > rfc2757.txt? Good read for a little perspective to the problem space. You missed the point. Network should have large _packet_ power_ i.e. (rtt*bandwidth)/mss to hit this problem. This situation never occured in real life earlier. I have no idea, how you reached cwnd of 192. 8) > the point. I think I can argue that the receiver should never advertise a > window bigger than it is prepared to receive. Of course. The question is how to make this. I proposed one solution. > Does this mean the changes described here? Yes. It has been made in net-xxyyzz, but the heuristics used there seems not to work very well in your situation. Seems, it will not open enough of window. You may check this using net-000601. > Doing it the other way > (not killing the whole ofo-queue) will still cause further packet loss. I do not want to prune at all. Alexey From owner-netdev@oss.sgi.com Wed Jun 7 10:59:19 2000 Received: by oss.sgi.com id ; Wed, 7 Jun 2000 10:59:10 -0700 Received: from minus.inr.ac.ru ([193.233.7.97]:50190 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Wed, 7 Jun 2000 10:59:01 -0700 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id VAA21947; Wed, 7 Jun 2000 21:58:42 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200006071758.VAA21947@ms2.inr.ac.ru> Subject: Re: timers in net/ipv6 To: andrewm@uow.edu.au (Andrew Morton) Date: Wed, 7 Jun 2000 21:58:42 +0400 (MSK DST) Cc: netdev@oss.sgi.com In-Reply-To: <393CF7A8.6DDAE578@uow.edu.au> from "Andrew Morton" at Jun 6, 0 11:07:52 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Length: 2385 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello! > - if (del_timer(&ifp->timer)) > +/* > + * REVIEWME: I assume the del_timer_async() here is safe > + * due to the below check (timer shouldn't be scheduled). > + */ > + if (del_timer_async(&ifp->timer)) > printk("Timer is still running, when freeing ifa=%p\n", ifp); This is debugging check. We have already discussed this. > + > +/* Timer handler */ > +static void fib6_run_gc_t(unsigned long dummy) > +{ > + fib6_run_gc(dummy, 1); > } > > void __init fib6_init(void) > @@ -1211,7 +1221,7 @@ > #ifdef MODULE > void fib6_gc_cleanup(void) > { > - del_timer(&ip6_fib_timer); > + del_timer_sync(&ip6_fib_timer); OK. Modular IPv6 is so buggy, that it is useless to fix these issues separately of big picture, though. > +/* > + * REVIEWME: we may have a race here. > + * ip6_fl_gc() could be running now, and after fl_release() returns, > + * this ip6_flowlabel is kfree()ed. Is there a possibility that > + * the still-running ip6_fl_gc() could end up accessing kfree'ed > + * memory? If so, I think a del_timer_sync() is safe here Yes, the same as in fib6 gc. > +/* > + * REVIEWME: igmp6_timer_handler() could be running right now on another CPU. > + * Is this safe? If not, we need del_timer_sync() in igmp6_group_queried() > + */ del_timer_sync() does not work in this case, certainly. It is not evident, because spinlocks are forgotten too. 8) Note, that igmp6 case is more complicated than igmp, it makes more work from softirqs. I hoped to move addrconf to process context until 2.4.0, and, certainly, forgot. 8) OK, we have to make it working on softirq yet. > +/* REVIEWME: igmp6_timer_handler() could be running now. Is this safe? Could we > + * do the in6_dev_put() right in the middle of igmp6_send()? If so, we need del_timer_sync(). > + * Andi says "Should not matter. it is a independent idev." > + */ We grab refcnt, when we use idev. > +/* REVIEWME: frag_expire() could be running on another CPU now, (if this > + * function is called from reasm_frag()). frag_expire() will be spinning > + * on ip6_frag_lock. Once this function returns to reasm_frag() and > + * reasm_frag releases the lock, frag_expire() will run and will > + * again call fq_free. Probably safe, but needs an expert eye :) > + */ > + del_timer_async(&fq->timer); We have already discussed this. It is not safe, certainly, and needs refcounting. Alexey From owner-netdev@oss.sgi.com Wed Jun 7 14:17:51 2000 Received: by oss.sgi.com id ; Wed, 7 Jun 2000 14:17:32 -0700 Received: from ppp79.arobas.net ([205.205.36.149]:32772 "HELO dialin156.ottawa.globalserve.net") by oss.sgi.com with SMTP id ; Wed, 7 Jun 2000 14:17:21 -0700 Received: (qmail 823 invoked by uid 1000); 7 Jun 2000 21:13:22 -0000 Date: Wed, 7 Jun 2000 17:13:22 -0400 From: Jerome Etienne To: Andrey Savochkin Cc: netdev@oss.sgi.com Subject: Re: IFA_F_NO_NDISC (for vrrp) Message-ID: <20000607171322.A801@long-haul.net> Reply-To: jetienne@arobas.net References: <20000604180651.A678@long-haul.net> <20000607115349.A4259@saw.sw.com.sg> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii User-Agent: Mutt/1.0i In-Reply-To: <20000607115349.A4259@saw.sw.com.sg>; from saw@saw.sw.com.sg on Wed, Jun 07, 2000 at 11:53:49AM +0800 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Wed, Jun 07, 2000 at 11:53:49AM +0800, Andrey Savochkin wrote: > Keeping the policy decision in user-space is a wise solution. > But you may just set NOARP flag for the device and do all the stuff in > user-land merging your 'virtual MAC' logic with any ARP daemon. I am not sure i understand your suggestion. I made some tests, and if IFF_NOARP is set i dont receive any messages on a listening NETLINK_ARPD socket. I need 2 things: - the kernel to keep a arp cache (no arp cache in the kernel implies a exchange with the userspace at each ip/other packet, so not a reasonable solution) - the kernel must not reply the native MAC when it receives a ARP request for a virtual ip. If there is a solution which doesnt require to modify the kernel, i dont see it. If your suggestion fits the needs, can you please elaborate to help me to understand. The other VRRP implementations just run everything in the kernel to solve this problem. My patch is 3 lines in net/ipv4/arp.c seemed a good solution when i wrote it. If anybody see a better one, please tell me... From owner-netdev@oss.sgi.com Wed Jun 7 16:48:53 2000 Received: by oss.sgi.com id ; Wed, 7 Jun 2000 16:48:33 -0700 Received: from laurin.munich.netsurf.de ([194.64.166.1]:8576 "EHLO laurin.munich.netsurf.de") by oss.sgi.com with ESMTP id ; Wed, 7 Jun 2000 16:48:03 -0700 Received: from fred.muc.de (none@ns1047.munich.netsurf.de [195.180.235.47]) by laurin.munich.netsurf.de (8.9.3/8.9.3) with ESMTP id BAA00476; Thu, 8 Jun 2000 01:47:48 +0200 (MET DST) Received: from andi by fred.muc.de with local (Exim 2.05 #1) id 12zpYk-0002jR-00; Thu, 8 Jun 2000 01:48:30 +0200 Date: Thu, 8 Jun 2000 01:48:30 +0200 From: Andi Kleen To: kuznet@ms2.inr.ac.ru Cc: Andrew Morton , netdev@oss.sgi.com Subject: Re: timers in net/ipv6 Message-ID: <20000608014830.A10453@fred.muc.de> References: <393CF7A8.6DDAE578@uow.edu.au> <200006071758.VAA21947@ms2.inr.ac.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.95.4us In-Reply-To: <200006071758.VAA21947@ms2.inr.ac.ru>; from kuznet@ms2.inr.ac.ru on Wed, Jun 07, 2000 at 08:00:12PM +0200 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Wed, Jun 07, 2000 at 08:00:12PM +0200, kuznet@ms2.inr.ac.ru wrote: > > > +/* REVIEWME: frag_expire() could be running on another CPU now, (if this > > + * function is called from reasm_frag()). frag_expire() will be spinning > > + * on ip6_frag_lock. Once this function returns to reasm_frag() and > > + * reasm_frag releases the lock, frag_expire() will run and will > > + * again call fq_free. Probably safe, but needs an expert eye :) > > + */ > > + del_timer_async(&fq->timer); > > We have already discussed this. It is not safe, certainly, > and needs refcounting. What I think is a bigger problem is the unsafe use of the spinlock: nothing prevents a timer on the same CPU from bumping into the spinlock, causing a deadlock. Hmm, probably the locks need to be split (list lock, frag queue lock protected with del_timer_async + refcount) or maybe even irq save spinlocks (costly) -Andi -- This is like TV. I don't like TV. From owner-netdev@oss.sgi.com Thu Jun 8 00:04:45 2000 Received: by oss.sgi.com id ; Thu, 8 Jun 2000 00:04:24 -0700 Received: from async176-6.nas.onetel.net.uk ([212.67.102.176]:31492 "EHLO Consulate.UFP.CX") by oss.sgi.com with ESMTP id ; Thu, 8 Jun 2000 00:04:06 -0700 Received: from localhost (rhw@localhost) by Consulate.UFP.CX (8.9.3/8.9.3) with ESMTP id FAA13748 for ; Thu, 8 Jun 2000 05:25:50 GMT X-Authentication-Warning: Consulate.UFP.CX: rhw owned process doing -bs Date: Thu, 8 Jun 2000 06:25:50 +0100 (BST) From: Riley Williams X-Sender: rhw@Consulate.UFP.CX To: netdev@oss.sgi.com Subject: RE: Email address test In-Reply-To: <018601bfcc2d$873a4200$2b63630a@marty.asgard.aus.tm> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hi there. Can I apologise for the enclosed post going to this mailing list. Unfortunately, the list was incorrectly identified as being a private individual in the Linux Kernel, so when I started checking that all private individuals therein were still at the addresses listed, I inadvertantly included this list. This has now been corrected. To those of you who responded to this posting as a result of being subscribed to this list rather than being listed as a maintainer of the Linux kernel, I offer my sincere apologies for disturbing you. > This went to the netdev mailing list... You might get > several replies.. Unfortunately, I did !!! >> -----Original Message----- >> From: owner-netdev@oss.sgi.com [mailto:owner-netdev@oss.sgi.com]On >> Behalf Of Riley Williams >> Sent: Friday, 2 June 2000 6:06 >> To: Linux Maintainers >> Subject: Email address test >> >> Hi there. >> >> This is a test message to confirm that all addresses in the >> Linux kernel maintainers list are still current. It would be >> appreciated if you could reply to this message confirming >> receipt hereof. >> >> Best wishes from Riley. Best wishes from Riley. * Copyright (C) 2000, Memory Alpha Systems. * All rights and wrongs reserved. +----------------------------------------------------------------------+ | There is something frustrating about the quality and speed of Linux | | development, ie., the quality is too high and the speed is too high, | | in other words, I can implement this XXXX feature, but I bet someone | | else has already done so and is just about to release their patch. | +----------------------------------------------------------------------+ * http://www.memalpha.cx/Linux/Kernel/ From owner-netdev@oss.sgi.com Thu Jun 8 00:20:34 2000 Received: by oss.sgi.com id ; Thu, 8 Jun 2000 00:20:24 -0700 Received: from linuxcare.com.au ([203.29.91.49]:54033 "EHLO front.linuxcare.com.au") by oss.sgi.com with ESMTP id ; Thu, 8 Jun 2000 00:20:20 -0700 Received: from halfway (penicillin.linuxcare.com.au [10.61.2.27]) by front.linuxcare.com.au (8.9.3/8.9.3/Debian 8.9.3-21) with ESMTP id RAA05566; Thu, 8 Jun 2000 17:20:09 +1000 X-Authentication-Warning: front.linuxcare.com.au: Host penicillin.linuxcare.com.au [10.61.2.27] claimed to be halfway Received: from linuxcare.com.au (localhost [127.0.0.1]) by halfway (Postfix) with ESMTP id A085A8154; Thu, 8 Jun 2000 16:50:08 +0930 (CST) From: Rusty Russell To: david@kalifornia.com Cc: netdev@oss.sgi.com Subject: Re: can't initialize iptables table `filter' In-reply-to: Your message of "Thu, 08 Jun 2000 12:14:57 +1000." <393F00FE.A3A5859F@kalifornia.com> Date: Thu, 08 Jun 2000 17:20:08 +1000 Message-Id: <20000608072008.A085A8154@halfway> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing In message <393F00FE.A3A5859F@kalifornia.com> you write: > Hmm, odd. Well, I'm stuck on ac8 with iptables and mine and rusty's > patches, ac10 breaks GRE tunneling again. Didn't submit GRE patch: not my area, so I want to pass it through the gurus in netdev. This patch changes GRE to send packets through the NF_IP_LOCAL_OUT hook. This is the most sane semantics for tunnels (someone want to change the others, like ipip?) diff -urN -X /tmp/filej9FJZx --minimal linux-2.4.0-test1-official/net/ipv4/ip_gre.c working-2.4.0-test1/net/ipv4/ip_gre.c --- linux-2.4.0-test1-official/net/ipv4/ip_gre.c Tue May 23 02:50:55 2000 +++ working-2.4.0-test1/net/ipv4/ip_gre.c Mon Jun 5 18:24:29 2000 @@ -27,6 +27,7 @@ #include #include #include +#include #include #include @@ -529,6 +530,46 @@ #endif } +#ifdef CONFIG_NETFILTER +/* To preserve the cute illusion that a locally-generated packet can + be mangled before routing, we actually reroute if a hook altered + the packet. -RR */ +static int route_me_harder(struct sk_buff *skb) +{ + struct iphdr *iph = skb->nh.iph; + struct rtable *rt; + + if (ip_route_output(&rt, iph->daddr, iph->saddr, + RT_TOS(iph->tos) | RTO_CONN, + skb->sk ? skb->sk->bound_dev_if : 0)) { + printk("route_me_harder: No more route.\n"); + return -EINVAL; + } + + /* Drop old route. */ + dst_release(skb->dst); + + skb->dst = &rt->u.dst; + return 0; +} +#endif + +/* Do route recalc if netfilter changes skb. */ +static inline int +send_maybe_reroute(struct sk_buff *skb) +{ +#ifdef CONFIG_NETFILTER + if (skb->nfcache & NFC_ALTERED) { + if (route_me_harder(skb) != 0) { + kfree_skb(skb); + return -EINVAL; + } + } +#endif + ip_send(skb); + return 0; +} + int ipgre_rcv(struct sk_buff *skb, unsigned short len) { struct iphdr *iph = skb->nh.iph; @@ -829,7 +870,8 @@ stats->tx_bytes += skb->len; stats->tx_packets++; - ip_send(skb); + NF_HOOK(PF_INET, NF_IP_LOCAL_OUT, skb, NULL, rt->u.dst.dev, + send_maybe_reroute); tunnel->recursion--; return 0; Rusty. -- Hacking time. From owner-netdev@oss.sgi.com Thu Jun 8 00:25:55 2000 Received: by oss.sgi.com id ; Thu, 8 Jun 2000 00:25:45 -0700 Received: from saw.sw.com.sg ([203.120.9.98]:47488 "HELO saw.sw.com.sg") by oss.sgi.com with SMTP id ; Thu, 8 Jun 2000 00:25:34 -0700 Received: (qmail 11527 invoked by uid 577); 8 Jun 2000 06:25:28 -0000 Message-ID: <20000608142528.A11492@saw.sw.com.sg> Date: Thu, 8 Jun 2000 14:25:28 +0800 From: Andrey Savochkin To: jetienne@arobas.net Cc: netdev@oss.sgi.com, Julian Anastasov Subject: Re: IFA_F_NO_NDISC (for vrrp) References: <20000604180651.A678@long-haul.net> <20000607115349.A4259@saw.sw.com.sg> <20000607171322.A801@long-haul.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.93.2i In-Reply-To: <20000607171322.A801@long-haul.net>; from "Jerome Etienne" on Wed, Jun 07, 2000 at 05:13:22PM Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello, On Wed, Jun 07, 2000 at 05:13:22PM -0400, Jerome Etienne wrote: > On Wed, Jun 07, 2000 at 11:53:49AM +0800, Andrey Savochkin wrote: > > Keeping the policy decision in user-space is a wise solution. > > But you may just set NOARP flag for the device and do all the stuff in > > user-land merging your 'virtual MAC' logic with any ARP daemon. > > I am not sure i understand your suggestion. I made some tests, and if > IFF_NOARP is set i dont receive any messages on a listening NETLINK_ARPD > socket. I need 2 things: Oh, I see. My quick thoughts appear to be wrong. > - the kernel to keep a arp cache (no arp cache in the kernel implies > a exchange with the userspace at each ip/other packet, so not a reasonable > solution) > - the kernel must not reply the native MAC when it receives a ARP request > for a virtual ip. > > If there is a solution which doesnt require to modify the kernel, i dont > see it. If your suggestion fits the needs, can you please elaborate > to help me to understand. > > The other VRRP implementations just run everything in the kernel to solve > this problem. My patch is 3 lines in net/ipv4/arp.c seemed a good solution > when i wrote it. If anybody see a better one, please tell me... Julian Anastasov also wanted some solution to block ARP replies for his cluster project. Julian, I don't remember exactly your situation. May the proposed patch solve some problems for you? Best regards Andrey V. Savochkin From owner-netdev@oss.sgi.com Thu Jun 8 08:39:24 2000 Received: by oss.sgi.com id ; Thu, 8 Jun 2000 08:39:04 -0700 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:57861 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Thu, 8 Jun 2000 08:38:54 -0700 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.9.3) with ESMTP id IAA04207; Thu, 8 Jun 2000 08:21:34 -0700 Message-ID: <393FB9FE.1CCF12E5@candelatech.com> Date: Thu, 08 Jun 2000 08:21:34 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.14-5.0 i586) X-Accept-Language: en MIME-Version: 1.0 To: jamal CC: Mitchell Blank Jr , Andrey Savochkin , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing jamal wrote: > > I dont think if i would call both VLANS "eth0"; VLANS would be just > circuits on top of eth0. > I would think something like VLAN would require its own config > tool. ifconfig is probably not the best tool to configure VLANS > and definetly no close to a swiss army knife; pointing to ifconfig or > route as reason to set a vlan as a device is a bad excuse. > 'nuf said. > > Here's what i would have done: At this point, I don't think anyone is going to be changing their opinions on the way things should be done. Please feel free to take either of the VLAN patches already written and write your own using your ideas. When you have working code to show, then we may be in a better position to evaluate your ideas. In the mean time, I will (sometime relatively soon), have some patches to allow for fast lookups of devices, if no one beats me to it. If they never go into the kernel proper because of the blasphemy of too many devices, so be it, but I'm hoping that with benchmarks to prove performance does not degrade, and a patch that touches little of the rest of the kernel, and virtually nothing in user space, it will be seriously considered for the 2.5 timeframe, if not sooner. Enjoy, Ben -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Thu Jun 8 08:43:54 2000 Received: by oss.sgi.com id ; Thu, 8 Jun 2000 08:43:44 -0700 Received: from mail.cyberus.ca ([209.195.95.1]:2290 "EHLO cyberus.ca") by oss.sgi.com with ESMTP id ; Thu, 8 Jun 2000 08:43:43 -0700 Received: from shell.cyberus.ca (shell [209.195.95.7]) by cyberus.ca (8.9.3/8.9.3/Cyberus Online Inc.) with ESMTP id IAA04014; Thu, 8 Jun 2000 08:57:59 -0400 (EDT) Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.1b+Sun/8.9.3) with ESMTP id IAA25625; Thu, 8 Jun 2000 08:57:52 -0400 (EDT) Date: Thu, 8 Jun 2000 08:57:52 -0400 (EDT) From: jamal To: Mitchell Blank Jr cc: Andrey Savochkin , Ben Greear , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? In-Reply-To: <20000607045935.G7740@sfgoth.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Wed, 7 Jun 2000, Mitchell Blank Jr wrote: > jamal wrote: > > > > And you can do this just fine with VLANS abstracted on top of devices. > > OK, explain how. I've enumerated a dozen or more places either in > protocol support or in the user-kernel interface where the > configuration or functionality depends on a named device per individual > vlan. You keep ignoring them when I point them out, but they are > still there. For instance, if both the VLANs are called "eth0", how > do I set individual settings in /proc/sys/net/ipv4/conf/ for the two > different VLANs? Or any of the other ones I pointed out - take your > pick. I dont think if i would call both VLANS "eth0"; VLANS would be just circuits on top of eth0. I would think something like VLAN would require its own config tool. ifconfig is probably not the best tool to configure VLANS and definetly no close to a swiss army knife; pointing to ifconfig or route as reason to set a vlan as a device is a bad excuse. 'nuf said. Here's what i would have done: "vlantool attach to eth0 VLAN10 IP a.b.c.d priority x ... blablah" The tool would: - use netlink or ioctl to add the IP as another alias to eth0; - my own private table inside the kernel would take the rest vlan info; searchable by vlanid primarily etc put all your funky search algorithms there. I see the ACL stored there as well. You might need to maintain some daemon in user space ("vland") * Incoming packets get looked up on the table for demuxing, ACL get applied etc; they might be dropped right here for example -switching happens based on the VLAN table details; or things may be sent up to IP; * outgoing packets: -not a problem if you are switching; just replace VLAN headers;off you go ACL rules apply. -coming down the stack; extend dst output to be specialized for your specific routes entered for your VLANs. This is a little tricky but well done by James Leu's MPLS code. - add notification on VLAN adds/deletes, device deletes maps to "all vlans have been deleted" and Zebra or whatever your favorite Route daemon is happy. How the route daemon abstracts things is its own problem;it should not force how the kernel should implement things. The moral of this: - ARP and routing both work .. - Your packet mungling (header replacements/adds) work just fine Did i miss anything? I think Jim has done a great job in design in MPLS with a much more complex problem. Lookup his code. > > What "linux model of handling flow control"? > > What would the other model be? FIFOs are the basic building blocks. > > Yes, FIFOs are the basic building blocks. The current model is "a > net_device must be modeled as a single FIFO before the xmit". I am either still confused or you are. The simplest FIFO scheme on transmit is a 3 band one. > This isn't right. > > Some devices don't have any FIFO associated with them. Either they > don't need one because they are trivial (dummy, lo) or they just feed > something else that _does_ have a FIFO (ethernet bridge device, > anything that tunnels L2-in-L2 like PPPoE, or these proposed VLAN > devices). It doesn't really _hurt_ much to have these modeled with > an unused FIFO in front, other than some TC stuff can't work right > since there's no backpressure. > You have some form of backpressure that no-one is using at the moment; Look at the NET_XMIT_*; The packet requeing schemes could also be considered as backpressure if you choose to use them. > The other case is worse - things that have multiple FIFOs. As > I pointed out, the obvious one is any ATM protocol that uses > multiple VCs. Since each VC is an individual FIFO you can't > tell if you're supposed to queue a packet until you've already > looked at the header and determined which VC to route out. > Doing a netif_stop_queue() causes all the VCs to stop being > fed. And you dont like the concept of classifiers? You can set a classifier whose match action is to feed things into a blackhole/noop while other VCs happily send away. > > Really we need a new structure "net_fifo?" in order to divorce > the queuing from the concept of a net_device. Anything that > wants to look to userland and the protocol stacks as a net_device > should, wheteher or not you feel they are just a packet mangler > or not. I think we already have "net_fifos" if you choose to use them; no need in re-inventing the wheel (rule number 100). cheers, jamal From owner-netdev@oss.sgi.com Thu Jun 8 09:04:54 2000 Received: by oss.sgi.com id ; Thu, 8 Jun 2000 09:04:44 -0700 Received: from ferret.cs.fiu.edu ([131.94.125.231]:54290 "EHLO ferret.cs.fiu.edu") by oss.sgi.com with ESMTP id ; Thu, 8 Jun 2000 09:04:28 -0700 Received: from cs.fiu.edu (ferret.cs.fiu.edu [131.94.125.231]) by ferret.cs.fiu.edu (8.9.2/FIU-CS-1.2) with ESMTP id KAA19828 for ; Thu, 8 Jun 2000 10:38:01 -0400 (EDT) Message-Id: <200006081438.KAA19828@ferret.cs.fiu.edu> To: netdev@oss.sgi.com Subject: ipv6 implementation docs Date: Thu, 08 Jun 2000 10:38:01 -0400 From: "Eric S. Johnson" Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Is there anything besides the source code which documents how the linux ipv6 implementation works? An example of a question I have: how do you force autoconfiguration of the interfaces, again. When you first bring up an interface with ifconfig eth0 inet6 it does stateless autoconfig. But if you manually clear things and/or change the prefixes that the router radvd advertises, I would like to force it to do autoconfig again, I haven't seen how to do this yet. E From owner-netdev@oss.sgi.com Thu Jun 8 09:14:04 2000 Received: by oss.sgi.com id ; Thu, 8 Jun 2000 09:13:54 -0700 Received: from dpx20.tu-varna.acad.bg ([194.141.24.4]:3337 "EHLO dpx20.tu-varna.acad.bg") by oss.sgi.com with ESMTP id ; Thu, 8 Jun 2000 09:13:47 -0700 Received: from linux.tu-varna.acad.bg (root@linux.tu-varna.acad.bg [194.141.24.6]) by dpx20.tu-varna.acad.bg (8.9.3/8.9.3) with ESMTP id OAA122510; Thu, 8 Jun 2000 14:43:50 +0300 Received: from linux.tu-varna.acad.bg (uli@linux.tu-varna.acad.bg [194.141.24.6]) by linux.tu-varna.acad.bg (8.8.5/myconf) with ESMTP id OAA05115; Thu, 8 Jun 2000 14:47:31 +0300 Date: Thu, 8 Jun 2000 14:47:31 +0300 (EEST) From: Julian Anastasov To: Andrey Savochkin cc: jetienne@arobas.net, netdev@oss.sgi.com Subject: Re: IFA_F_NO_NDISC (for vrrp) In-Reply-To: <20000608142528.A11492@saw.sw.com.sg> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello, On Thu, 8 Jun 2000, Andrey Savochkin wrote: > Hello, > > On Wed, Jun 07, 2000 at 05:13:22PM -0400, Jerome Etienne wrote: > > On Wed, Jun 07, 2000 at 11:53:49AM +0800, Andrey Savochkin wrote: > > > Keeping the policy decision in user-space is a wise solution. > > > But you may just set NOARP flag for the device and do all the stuff in > > > user-land merging your 'virtual MAC' logic with any ARP daemon. > > > > I am not sure i understand your suggestion. I made some tests, and if > > IFF_NOARP is set i dont receive any messages on a listening NETLINK_ARPD > > socket. I need 2 things: > > Oh, I see. My quick thoughts appear to be wrong. > > > - the kernel to keep a arp cache (no arp cache in the kernel implies > > a exchange with the userspace at each ip/other packet, so not a reasonable > > solution) > > - the kernel must not reply the native MAC when it receives a ARP request > > for a virtual ip. > > > > If there is a solution which doesnt require to modify the kernel, i dont > > see it. If your suggestion fits the needs, can you please elaborate > > to help me to understand. > > > > The other VRRP implementations just run everything in the kernel to solve > > this problem. My patch is 3 lines in net/ipv4/arp.c seemed a good solution > > when i wrote it. If anybody see a better one, please tell me... > > Julian Anastasov also wanted some solution to block ARP replies for his > cluster project. Julian, I don't remember exactly your situation. May the > proposed patch solve some problems for you? Yes, but only "some" :) Jerome Etienne, can you look at the Linux Virtual Server (LVS) project: http://www.linuxvirtualserver.org http://www.linuxvirtualserver.org/arp.html Look for "Direct Route" forwarding method. The LVS uses shared addresses as in rfc2338. But rfc2338 has other requirements. They look very complex for me but I didn't looked very deeply. In LVS the "Backup" server can talk IP while in rfc2338 this is not allowed. In LVS, for example, we can run the service on all hosts and the LVS software (the Primary Router) on one of them. Why should we (1) keep an unused Backup server(s) and (2) why with VIP configured? Of course, it depends on your needs. In LVS the concept is different: Only the "Master" advertises Virtual IP but the "Backup" servers can work as normal (Real) servers. Your are free to stop the service when the Backup server takes the Primary role if you don't want to overload it. The Virtual MAC address is not needed, we reply with the MAC address of the current Primary router. You can send gratuitous ARP replies with a user space tool. - The "Backup" router(s) and the real servers can talk with VIP but they don't send ARP replies for VIP. Normally they don't receive traffic destined for VIP because they don't send ARP replies for VIP. By this way we block the direct access from the local clients to the real servers. They must forward their traffic through the virtual router. - The "Backup" and the real servers don't announce VIP as source of their ARP probes or they will not receive the expected reply (it will be received in the virtual router). There is a "hidden" device flag in 2.2.14+ which solves the above problems. We configure all VIPs on a dummy device and set this flag (we hide all VIPs on this device) in the "Backup" and in the real servers. There are other good side effects from this flag. Supporting only IFA_F_NO_NDISC is not enough even for rfc2338. I don't see reason the Backup server to hold VIP configured. Only the Primary server must have VIP configured. I see many restrictions in the rfc2338's solution to support shared addresses. May be I don't understand well rfc2338, I don't know how you are using it. Are you trying to implement rfc2338 or just to build a working setup? I think, the only required kernel support can be: 1. not reply for these (hidden) VIPs - this can be solved with policy routing or other kind of filtering without the "hidden device" flag 2. not use them as source of the ARP probes - currently this can be solved if the VIP is not defined in the local table but in another table. But I'm not sure if the planned Andrey's fixes will allow this. Currently, arp_solicit selects only IP addresses from the local table which allows this requirement to work. After Andrey's patch they will be selected from any table and this requirement will not work. These requirements are completely covered from the "hidden device" feature. Can this feature solve your problem? I have a patch for 2.3.41+ which I can send you if you don't want to play with complex policy routing rules. The discussion for this feature in 2.3 stuck at the point where we don't know how to define VIPs (hidden IP addresses) correctly, whether we can do it with policy routing rules or there is another solution. You can read our discussion from the linux-kernel archives: http://kernelnotes.org/lnxlists/linux-kernel/lk_0005_01/ Date: 04-MAY-2000 Subject: arp, kernel 2.2.15 and 2.3.99-pre6 Jerome, can the hidden device flag solve your problem? Oh, I now see that may be you need this VIP to be on an ARP device (in the Backup server)? Regards -- Julian Anastasov From owner-netdev@oss.sgi.com Thu Jun 8 09:52:15 2000 Received: by oss.sgi.com id ; Thu, 8 Jun 2000 09:52:05 -0700 Received: from apollo.nbase.co.il ([194.90.137.2]:13835 "EHLO apollo.nbase.co.il") by oss.sgi.com with ESMTP id ; Thu, 8 Jun 2000 09:51:42 -0700 Received: from nbase.co.il ([194.90.136.56]) by apollo.nbase.co.il (Post.Office MTA v3.1.2 release (PO205-101c) ID# 0-44418U200L2S100) with ESMTP id AAA444; Thu, 8 Jun 2000 18:01:33 +0200 Message-ID: <393FB54B.DE8AD98B@nbase.co.il> Date: Thu, 08 Jun 2000 15:01:31 +0000 From: Gleb Natapov Organization: NBase-Xyplex X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.2.14vlan i686) X-Accept-Language: en MIME-Version: 1.0 To: jamal CC: Mitchell Blank Jr , Andrey Savochkin , Ben Greear , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing jamal wrote: > > On Wed, 7 Jun 2000, Mitchell Blank Jr wrote: > > > jamal wrote: > > > > > > And you can do this just fine with VLANS abstracted on top of devices. > > > > OK, explain how. I've enumerated a dozen or more places either in > > protocol support or in the user-kernel interface where the > > configuration or functionality depends on a named device per individual > > vlan. You keep ignoring them when I point them out, but they are > > still there. For instance, if both the VLANs are called "eth0", how > > do I set individual settings in /proc/sys/net/ipv4/conf/ for the two > > different VLANs? Or any of the other ones I pointed out - take your > > pick. > > I dont think if i would call both VLANS "eth0"; VLANS would be just > circuits on top of eth0. > I would think something like VLAN would require its own config > tool. ifconfig is probably not the best tool to configure VLANS > and definetly no close to a swiss army knife; pointing to ifconfig or > route as reason to set a vlan as a device is a bad excuse. > 'nuf said. > > Here's what i would have done: > > "vlantool attach to eth0 VLAN10 IP a.b.c.d priority x ... blablah" > > The tool would: > - use netlink or ioctl to add the IP as another alias to eth0; > - my own private table inside the kernel would take the rest vlan info; > searchable by vlanid primarily etc put all your funky search algorithms > there. I see the ACL stored there as well. > You might need to maintain some daemon in user space ("vland") > > * Incoming packets get looked up on the table for demuxing, ACL get > applied etc; they might be dropped right here for example > -switching happens based on the VLAN table details; or things may be > sent up to IP; > * outgoing packets: > -not a problem if you are switching; just replace VLAN headers;off you go > ACL rules apply. > -coming down the stack; extend dst output to be specialized for > your specific routes entered for your VLANs. This is a little tricky but > well done by James Leu's MPLS code. > - add notification on VLAN adds/deletes, device deletes maps to "all vlans > have been deleted" and Zebra or whatever your favorite Route daemon is > happy. How the route daemon abstracts things is its own problem;it should > not force how the kernel should implement things. > > The moral of this: > - ARP and routing both work .. > - Your packet mungling (header replacements/adds) work just fine > > Did i miss anything? > Yes. If you are going to implement VLANs as you've just described you will have to change each L3 protocol implementation to support VLANs. -- Gleb. From owner-netdev@oss.sgi.com Thu Jun 8 09:59:54 2000 Received: by oss.sgi.com id ; Thu, 8 Jun 2000 09:59:35 -0700 Received: from minus.inr.ac.ru ([193.233.7.97]:55565 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Thu, 8 Jun 2000 09:59:21 -0700 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id UAA09215; Thu, 8 Jun 2000 20:59:04 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200006081659.UAA09215@ms2.inr.ac.ru> Subject: Re: timers in net/ipv6 To: ak@muc.de (Andi Kleen) Date: Thu, 8 Jun 2000 20:59:03 +0400 (MSK DST) Cc: andrewm@uow.edu.au, netdev@oss.sgi.com In-Reply-To: <20000608014830.A10453@fred.muc.de> from "Andi Kleen" at Jun 8, 0 01:48:30 am X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Length: 709 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello! > > We have already discussed this. It is not safe, certainly, > > and needs refcounting. > > What I think is a bigger problem is the unsafe use of the spinlock: > nothing prevents a timer on the same CPU from bumping into the spinlock, > causing a deadlock. No dedalock, it causes reference to freed memory rather than deadlock. > Hmm, probably the locks need to be split (list lock, > frag queue lock protected with del_timer_async + refcount) or maybe even Nothing but refcount on fq. It is very simple and convenient solution. Andi, I have already asked you: may I change IPv4/v6 defragmenters? I remember, you had some large patches for it. > irq save spinlocks (costly) Why?! Alexey From owner-netdev@oss.sgi.com Thu Jun 8 10:07:56 2000 Received: by oss.sgi.com id ; Thu, 8 Jun 2000 10:07:35 -0700 Received: from ppp31.arobas.net ([205.205.36.101]:39431 "HELO dialin156.ottawa.globalserve.net") by oss.sgi.com with SMTP id ; Thu, 8 Jun 2000 10:07:25 -0700 Received: (qmail 1425 invoked by uid 1000); 8 Jun 2000 17:03:23 -0000 Date: Thu, 8 Jun 2000 13:03:23 -0400 From: Jerome Etienne To: Julian Anastasov Cc: Andrey Savochkin , jetienne@arobas.net, netdev@oss.sgi.com Subject: Re: IFA_F_NO_NDISC (for vrrp) Message-ID: <20000608130323.A1049@long-haul.net> Reply-To: jetienne@arobas.net References: <20000608142528.A11492@saw.sw.com.sg> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii User-Agent: Mutt/1.0i In-Reply-To: ; from uli@linux.tu-varna.acad.bg on Thu, Jun 08, 2000 at 02:47:31PM +0300 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Thu, Jun 08, 2000 at 02:47:31PM +0300, Julian Anastasov wrote: > In LVS the "Backup" server can talk IP while in > rfc2338 this is not allowed. In LVS, for example, we can run > the service on all hosts and the LVS software (the Primary > Router) on one of them. Why should we (1) keep an unused > Backup server(s) > > Of course, it depends on your needs. Indeed, vrrp and lvs are similar but still distinct. LVS modifies the two peers of the connection (dispatcher and server). VRRP is designed to work without the client being aware of it. The primary goal is to run several routers/servers, possibly running different services (e.g. 2 routers both route but dont advertize the same routes). If a router crash, another one with take over its IPs (still keeping its own ones), so the client will goes on to send the packets to the same IP/MAC without being aware of the transition. > and (2) why with VIP configured? In vrrp, the backup must not have the virtual IPs 'online'. Only the master can receive/send packets with the virtual ips. > May be I don't understand well rfc2338, I don't > know how you are using it. Are you trying to implement > rfc2338 or just to build a working setup? rfc2338 as it is specified. my current implementation[1] works as specified for 1 virtual group per physical interface but uses a non-standard trick (i.e. no more handle the virtual MAC) to support several groups per interface. [1] http://w3.arobas.net/~jetienne/vrrpd.tgz > I think, the only required kernel support can be: > > 1. not reply for these (hidden) VIPs > - this can be solved with policy routing or other > kind of filtering without the "hidden device" flag Yes, i read the linux kernel thread and the arp_filter solution of andi kleen but honestly i dont understand it. why and how to use the route table (so a destination thing) to know if you have to 'arp-reply' for a locally configured address (so a local thing) ? > 2. not use them as source of the ARP probes > > - currently this can be solved if the VIP is not > defined in the local table but in another table. But > I'm not sure if the planned Andrey's fixes will > allow this. Currently, arp_solicit selects only IP > addresses from the local table which allows this > requirement to work. After Andrey's patch they will > be selected from any table and this requirement will > not work. which patch are you speaking about ? > Jerome, can the hidden device flag solve your > problem? Oh, I now see that may be you need this VIP to be > on an ARP device (in the Backup server)? I think so, i coded IFA_F_NO_NDISC because i wasnt aware of 'hidden'. It isnt in the kernel source i looked in i.e. 2.4.0-test1. As our needs are similar, we should use the same mechanism. IFA_F_NO_NDISC is tunable per address so seems to be more flexible than 'hidden' which is tunable per interface. Thinking about it, i have another project in which i need to completly handle ARP reply and request in user space (still keeping the cache in the kernel). Why not provide a mechanism allowing to handle ARP request/reply from userspace ? (af_packet to sniff the reply from the network, a flag to prevent the kernel from replying and a kind of CONFIG_ARPD to send ARP request to userspace via netlink) It would be flexilbe and would satisfy the needs of my other project and vrrp. i think it would be enough for lvs too, correct ? If i writes that, would it be more acceptable for the maintainers ? From owner-netdev@oss.sgi.com Thu Jun 8 10:45:06 2000 Received: by oss.sgi.com id ; Thu, 8 Jun 2000 10:44:56 -0700 Received: from laurin.munich.netsurf.de ([194.64.166.1]:30113 "EHLO laurin.munich.netsurf.de") by oss.sgi.com with ESMTP id ; Thu, 8 Jun 2000 10:44:36 -0700 Received: from fred.muc.de (none@ns1177.munich.netsurf.de [195.180.235.177]) by laurin.munich.netsurf.de (8.9.3/8.9.3) with ESMTP id TAA25680; Thu, 8 Jun 2000 19:44:07 +0200 (MET DST) Received: from andi by fred.muc.de with local (Exim 2.05 #1) id 1306Lw-0004rn-00; Thu, 8 Jun 2000 19:44:24 +0200 Date: Thu, 8 Jun 2000 19:44:23 +0200 From: Andi Kleen To: kuznet@ms2.inr.ac.ru Cc: Andi Kleen , andrewm@uow.edu.au, netdev@oss.sgi.com Subject: Re: timers in net/ipv6 Message-ID: <20000608194423.A18631@fred.muc.de> References: <20000608014830.A10453@fred.muc.de> <200006081659.UAA09215@ms2.inr.ac.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.95.4us In-Reply-To: <200006081659.UAA09215@ms2.inr.ac.ru>; from kuznet@ms2.inr.ac.ru on Thu, Jun 08, 2000 at 06:59:12PM +0200 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Thu, Jun 08, 2000 at 06:59:12PM +0200, kuznet@ms2.inr.ac.ru wrote: > Hello! > > > > We have already discussed this. It is not safe, certainly, > > > and needs refcounting. > > > > What I think is a bigger problem is the unsafe use of the spinlock: > > nothing prevents a timer on the same CPU from bumping into the spinlock, > > causing a deadlock. > > No dedalock, it causes reference to freed memory rather > than deadlock. We no longer have bh protection against timers, so a timer may run while net_rx_action executes, even on the same CPU. ipv6_reassembly has ip6_frag_lock, timer tries to aquire it on the same CPU -> deadlock. > > > Hmm, probably the locks need to be split (list lock, > > frag queue lock protected with del_timer_async + refcount) or maybe even > > Nothing but refcount on fq. It is very simple and convenient solution. > > Andi, I have already asked you: may I change IPv4/v6 defragmenters? > I remember, you had some large patches for it. Please fix it for v2.4. I'll put my changes in for 2.5. -Andi -- This is like TV. I don't like TV. From owner-netdev@oss.sgi.com Thu Jun 8 11:05:46 2000 Received: by oss.sgi.com id ; Thu, 8 Jun 2000 11:05:26 -0700 Received: from minus.inr.ac.ru ([193.233.7.97]:14862 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Thu, 8 Jun 2000 11:05:21 -0700 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id WAA09761; Thu, 8 Jun 2000 22:05:03 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200006081805.WAA09761@ms2.inr.ac.ru> Subject: Re: timers in net/ipv6 To: ak@muc.de (Andi Kleen) Date: Thu, 8 Jun 2000 22:05:03 +0400 (MSK DST) Cc: ak@muc.de, andrewm@uow.edu.au, netdev@oss.sgi.com In-Reply-To: <20000608194423.A18631@fred.muc.de> from "Andi Kleen" at Jun 8, 0 07:44:23 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Length: 317 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello! > We no longer have bh protection against timers, so a timer may run softirqs are not nested. Apparently, you mean scheme, which we discussed year ago. It was dropped because of utter complexity, only one softirq per cpu is held now. > Please fix it for v2.4. I'll put my changes in for 2.5. OK. Alexey From owner-netdev@oss.sgi.com Thu Jun 8 12:38:46 2000 Received: by oss.sgi.com id ; Thu, 8 Jun 2000 12:38:37 -0700 Received: from laurin.munich.netsurf.de ([194.64.166.1]:23782 "EHLO laurin.munich.netsurf.de") by oss.sgi.com with ESMTP id ; Thu, 8 Jun 2000 12:38:16 -0700 Received: from fred.muc.de (none@ns1043.munich.netsurf.de [195.180.235.43]) by laurin.munich.netsurf.de (8.9.3/8.9.3) with ESMTP id VAA08569; Thu, 8 Jun 2000 21:37:58 +0200 (MET DST) Received: from andi by fred.muc.de with local (Exim 2.05 #1) id 130884-0005rb-00; Thu, 8 Jun 2000 21:38:12 +0200 Date: Thu, 8 Jun 2000 21:38:12 +0200 From: Andi Kleen To: kuznet@ms2.inr.ac.ru Cc: Andi Kleen , andrewm@uow.edu.au, netdev@oss.sgi.com Subject: Re: timers in net/ipv6 Message-ID: <20000608213812.A22534@fred.muc.de> References: <20000608194423.A18631@fred.muc.de> <200006081805.WAA09761@ms2.inr.ac.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.95.4us In-Reply-To: <200006081805.WAA09761@ms2.inr.ac.ru>; from kuznet@ms2.inr.ac.ru on Thu, Jun 08, 2000 at 08:05:08PM +0200 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Thu, Jun 08, 2000 at 08:05:08PM +0200, kuznet@ms2.inr.ac.ru wrote: > Hello! > > > We no longer have bh protection against timers, so a timer may run > > softirqs are not nested. Apparently, you mean scheme, > which we discussed year ago. It was dropped because of utter > complexity, only one softirq per cpu is held now. Ok, just hope then that there is never an exception during softirq execution (like spurious interrupt or NMI handled by an debugger) -- ret_from_exception calls the softirq. -Andi From owner-netdev@oss.sgi.com Thu Jun 8 12:42:07 2000 Received: by oss.sgi.com id ; Thu, 8 Jun 2000 12:41:47 -0700 Received: from minus.inr.ac.ru ([193.233.7.97]:57102 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Thu, 8 Jun 2000 12:41:33 -0700 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id XAA13342; Thu, 8 Jun 2000 23:28:53 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200006081928.XAA13342@ms2.inr.ac.ru> Subject: Re: Slow TCP connection between linux and wince To: amlaukka@cc.helsinki.fi (Aki M Laukkanen) Date: Thu, 8 Jun 2000 23:28:53 +0400 (MSK DST) Cc: netdev@oss.sgi.com, iwtcp@cs.helsinki.fi In-Reply-To: from "Aki M Laukkanen" at Jun 7, 0 02:05:00 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Length: 238 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello! > > In any case, try net-xxyyzz.dif.gz from ftp://ftp.inr.ac.ru/ip-routing/. > > It will not be better, I think, but at least you will discover when > > Will do. I hope net-000608 contains final solution to the problem. Alexey From owner-netdev@oss.sgi.com Thu Jun 8 13:23:26 2000 Received: by oss.sgi.com id ; Thu, 8 Jun 2000 13:23:16 -0700 Received: from zada.math.leidenuniv.nl ([132.229.231.3]:40708 "EHLO zada.math.leidenuniv.nl") by oss.sgi.com with ESMTP id ; Thu, 8 Jun 2000 13:22:50 -0700 Received: from mara.math.leidenuniv.nl (IDENT:buytenh@mara.math.leidenuniv.nl [132.229.232.80]) by zada.math.leidenuniv.nl (8.9.3/8.9.3) with ESMTP id WAA29662; Thu, 8 Jun 2000 22:20:56 +0200 Date: Thu, 8 Jun 2000 22:20:56 +0200 (MEST) From: Lennert Buytenhek X-Sender: buytenh@mara.math.leidenuniv.nl To: Andi Kleen cc: Andrey Savochkin , Mitchell Blank Jr , Ben Greear , rob@valinux.com, netdev@oss.sgi.com, Gleb Natapov , jamal Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? In-Reply-To: <20000607010120.A4334@fred.muc.de> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Wed, 7 Jun 2000, Andi Kleen wrote: > > > The current kernel infrastructure for packet mangling may still need > > > some adjustments, but it at least exists. I'm encouraging to consider > > > VLAN implementation as just a netfilter module. > > > > "All the world is an IP net"? How should I run IPX over my VLANs then? > > Netfilter is not an IP only thing. It is a generic framework for > packet mangling. Although currently only IPv4 and IPv6 netfilter > implementations exist it would be no big problem to add ``raw > ethernet'' netfilter hooks. Raw ethernet netfilter hooks, as are IPX netfilter hooks by the way, are currently a nice blue cloud in the sky. As we're getting into the architectural purity business anyway, does it make a whole lot of sense to netfilter on two different protocol levels? greetings, Lennert From owner-netdev@oss.sgi.com Thu Jun 8 20:30:31 2000 Received: by oss.sgi.com id ; Thu, 8 Jun 2000 20:30:21 -0700 Received: from saw.sw.com.sg ([203.120.9.98]:2177 "HELO saw.sw.com.sg") by oss.sgi.com with SMTP id ; Thu, 8 Jun 2000 20:29:55 -0700 Received: (qmail 16664 invoked by uid 577); 9 Jun 2000 03:29:47 -0000 Message-ID: <20000609112947.B16628@saw.sw.com.sg> Date: Fri, 9 Jun 2000 11:29:47 +0800 From: Andrey Savochkin To: jetienne@arobas.net Cc: netdev@oss.sgi.com, Julian Anastasov Subject: Re: IFA_F_NO_NDISC (for vrrp) References: <20000608142528.A11492@saw.sw.com.sg> <20000608130323.A1049@long-haul.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.93.2i In-Reply-To: <20000608130323.A1049@long-haul.net>; from "Jerome Etienne" on Thu, Jun 08, 2000 at 01:03:23PM Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello, On Thu, Jun 08, 2000 at 01:03:23PM -0400, Jerome Etienne wrote: > Why not provide a mechanism allowing to handle ARP request/reply from > userspace ? (af_packet to sniff the reply from the network, a flag > to prevent the kernel from replying and a kind of CONFIG_ARPD to > send ARP request to userspace via netlink) I think it is a very good solution. This area just needs certain cleanup to allow all the decisions to be made in user-space. I'm not very familiar with "neighbour" code and I can't point to the right places. > It would be flexilbe and would satisfy the needs of my other project and > vrrp. i think it would be enough for lvs too, correct ? Best regards Andrey V. Savochkin From owner-netdev@oss.sgi.com Thu Jun 8 22:58:32 2000 Received: by oss.sgi.com id ; Thu, 8 Jun 2000 22:58:22 -0700 Received: from dpx20.tu-varna.acad.bg ([194.141.24.4]:24077 "EHLO dpx20.tu-varna.acad.bg") by oss.sgi.com with ESMTP id ; Thu, 8 Jun 2000 22:58:04 -0700 Received: from linux.tu-varna.acad.bg (root@linux.tu-varna.acad.bg [194.141.24.6]) by dpx20.tu-varna.acad.bg (8.9.3/8.9.3) with ESMTP id IAA15628; Fri, 9 Jun 2000 08:54:14 +0300 Received: from linux.tu-varna.acad.bg (uli@linux.tu-varna.acad.bg [194.141.24.6]) by linux.tu-varna.acad.bg (8.8.5/myconf) with ESMTP id IAA03946; Fri, 9 Jun 2000 08:57:57 +0300 Date: Fri, 9 Jun 2000 08:57:57 +0300 (EEST) From: Julian Anastasov To: Jerome Etienne cc: Andrey Savochkin , netdev@oss.sgi.com Subject: Re: IFA_F_NO_NDISC (for vrrp) In-Reply-To: <20000608130323.A1049@long-haul.net> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello, On Thu, 8 Jun 2000, Jerome Etienne wrote: > On Thu, Jun 08, 2000 at 02:47:31PM +0300, Julian Anastasov wrote: > > In LVS the "Backup" server can talk IP while in > > rfc2338 this is not allowed. In LVS, for example, we can run > > the service on all hosts and the LVS software (the Primary > > Router) on one of them. Why should we (1) keep an unused > > Backup server(s) > > > > Of course, it depends on your needs. > > Indeed, vrrp and lvs are similar but still distinct. > LVS modifies the two peers of the connection (dispatcher and server). > VRRP is designed to work without the client being aware of it. > The primary goal is to run several routers/servers, possibly running > different services (e.g. 2 routers both route but dont advertize > the same routes). If a router crash, another one with take over its > IPs (still keeping its own ones), so the client will goes on to send > the packets to the same IP/MAC without being aware of the transition. Very good. But I don't know why you think this is a difference. This is possible with LVS too. The client uses the same VIP as dest addr. The difference is the Virtual MAC. The transition is not so easy as you said. In LVS the dispatcher keeps state for each connection to the real hosts. When switching from one router to another after crash you usually lose this connection table. So, all current connections are broken. Sending a gratuitous ARP broadcast is always faster than the establishing this table in the new router. The new router don't know which connection to which real host was forwarded and to continue the connections. The result: the connections are broken. So, keeping the local clients happy with the same VMAC is not a solution. You have to restore the connection table in the new router. Currently, this is not done in LVS. I'm not sure if it will be done. > > > and (2) why with VIP configured? > > In vrrp, the backup must not have the virtual IPs 'online'. Only the > master can receive/send packets with the virtual ips. OK, VIP is configured in the Master, not in the Backup. Or it is configured but flagged with IFA_F_NO_NDISC? Is it configured in the real hosts? Is NAT running in the default routers? How looks the packet from the real host through the default router? Now I don't know in which host you want to stop replying for VIP. > > > May be I don't understand well rfc2338, I don't > > know how you are using it. Are you trying to implement > > rfc2338 or just to build a working setup? > > rfc2338 as it is specified. my current implementation[1] works as specified > for 1 virtual group per physical interface but uses a non-standard trick > (i.e. no more handle the virtual MAC) to support several groups per > interface. > [1] http://w3.arobas.net/~jetienne/vrrpd.tgz I read it. But I don't know if you tested only the Master-Backup operations (which is the scope of this rfc). Because I'm not sure the setup described in rfc2338 (which is very limitted) will work without NAT-ing packets in the Linux dispatcher. If they are NAT-ed I don't know why you have VIP in the internal hosts. Where are VIPs, except in the Master, configured, even specified with this flag? > > > I think, the only required kernel support can be: > > > > 1. not reply for these (hidden) VIPs > > - this can be solved with policy routing or other > > kind of filtering without the "hidden device" flag > > Yes, i read the linux kernel thread and the arp_filter solution > of andi kleen but honestly i dont understand it. why and how to > use the route table (so a destination thing) to know if you have > to 'arp-reply' for a locally configured address (so a local thing) ? arp_filter is not in our game. Here are a simple rules to hide VIP in the Backup/Real servers. I'm still not sure if the traffic is NAT-ed in your routers or just forwarded. Because if the packets from the real hosts are just forwarded and saddr=VIP these packets will not pass the Linux router due to source address validation checks. I have a patch to solve this problem but I don't know how your implementation works. If you reach a point where you need it just let me know. I can propose it to the netdev. It allows saddr=local_ip to be allowed with the rp_filter flag. The danger is in the fact that rp_filter defaults to 0 and a patched kernel will allow spoofing if the flag is not set for the non-trusted interfaces. The setup is for non-Master hosts: # Block access from the LAN to the real server's VIP. By # this way we ignore the router's ARP probes. The drawback: # we ignore the client's probes too. We have to do this # because the client on the LAN can receive replies from all # real servers ip rule add prio 99 from 192.168.0/24 table 99 ip route add table 99 blackhole 192.168.0.100 # Now accept locally any other traffic, i.e. not from # 192.168.0/24 ip rule add prio 100 table 100 ip route add table 100 local 192.168.0.100 dev lo Being in table 100 the VIP (192.168.0.100) is not selected as source of any ARP probes. In other words you don't configure VIP using ifconfig or by adding IP addresses. And with the blackhole setting we block the traffic from the 192.168.0 logical network to the VIP (we solve the ARP problem). We can talk only with clients through many outgoing router(s) but not with local clients. Only the hidden flag allows the clients to be on the same LAN from the 192.168.0 logical network. > > > 2. not use them as source of the ARP probes > > > > - currently this can be solved if the VIP is not > > defined in the local table but in another table. But > > I'm not sure if the planned Andrey's fixes will > > allow this. Currently, arp_solicit selects only IP > > addresses from the local table which allows this > > requirement to work. After Andrey's patch they will > > be selected from any table and this requirement will > > not work. > > which patch are you speaking about ? I looked it one month ago: ftp://ftp.sw.com.sg/pub/Linux/people/saw/kernel/v2.3/route.generic I see it is not changed from long time ago and I don't know its status. Andrey? > > > Jerome, can the hidden device flag solve your > > problem? Oh, I now see that may be you need this VIP to be > > on an ARP device (in the Backup server)? > > I think so, i coded IFA_F_NO_NDISC because i wasnt aware of 'hidden'. > It isnt in the kernel source i looked in i.e. 2.4.0-test1. > As our needs are similar, we should use the same mechanism. I'm still not sure how the packets look exiting from the real hosts in your implementation. You have to explain this because I can't imagine how your setup is working. Without the above two requirements you patch is not working for LVS. If VIP is configured as address and so put in the "local" table it can be used as source in the ARP probes. I see this in your "MUST NOT"s for the Backup host. But this is requirement for all backup/real servers in LVS. So, your patch must be tuned if the net folks like the idea of this address flag. You have to patch inet_select_addr() too. The problem is arp_solicit, i.e. which local addresses are allowed in the ARP probes. This is decided from the routing. Your patch and defining VIP as interface address will allow using VIP in the ARP probes. But this is in our "MUST NOT". So, the big question is how you will put the IFA_F_NO_NDISC flag in the routing code (fib_lookup). > IFA_F_NO_NDISC is tunable per address so seems to be more flexible > than 'hidden' which is tunable per interface. This is not a problem. Just put your VIP in lo or dummy device and be happy. There is no difference at all. I don't know why you/rfc have VIP configured if you want the VIP to be so hidden (even not to talk IP). Just don't configure VIP. May be I don't understand something, you have to explain it. > Thinking about it, i have another project in which i need to completly > handle ARP reply and request in user space (still keeping the cache > in the kernel). Playing in user space is preferred. But we need to accept packets destined for VIP in the real hosts (where we hide these addresses). So, we need the mechanism to select IP addresses as source for the ARP probes to be determined in the kernel. It is not possible to control this from user space. The next step after the "hidden" flag is to remove the VIP but this is suitable only for your Backups. > > Why not provide a mechanism allowing to handle ARP request/reply from > userspace ? (af_packet to sniff the reply from the network, a flag > to prevent the kernel from replying and a kind of CONFIG_ARPD to > send ARP request to userspace via netlink) > It would be flexilbe and would satisfy the needs of my other project and > vrrp. i think it would be enough for lvs too, correct ? It is not enough. Read requirement 2. I'm sure you will reach to the same problems as the problems in LVS. If you explain what IP addresses are included in you communications I will tell you where is the problem. Playing in the happy world only with Master and Backups is not enough. Did you added some real hosts and clients in your tests? I assume rfc2338 is only to support failover for routers which use NAT. Is that true? Can this rfc work for plain routing (without mangling packets) in Linux? > > If i writes that, would it be more acceptable for the maintainers ? I'm not sure. I'm not against your patch but I think it will include the same requirements as the patch for the "hidden" flag. And please explain how the packets from one client request and the answer traverse all hosts in the cluster. I think, the common variants are two: 1> Routers using NAT - the setup is same as the pictures in rfc2338 - you don't need to configure VIP in the real hosts, you can use private IP addresses - you don't need to configure VIP in the backups (this is common to both setups) - you can use the dispatcher as def gw 2> Router not using NAT (forwarding without packet mangling) - VIP is configured as hidden IP in the real servers - your patch must avoid these ARP requests too: "who has MASTER_ROUTER tell VIP" - we call for our def gw. I think, the static routes are obsoleted from rfc2338 :) Just play with ARP to solve this problem in your patch. - you don't need to configure VIP in the backups - you can use any random router for the outgoing traffic - the picture from rfc2338 can't work with the current restrictions in the Linux router (the source address validation), you can't use the dispatcher as def gw without patching it because the packets reach the dispatcher with saddr=VIP. Even the policy routing can't help here. The result: you need to flag only the addresses in the real servers (if not using NAT). Waiting for your comments :) Regards -- Julian Anastasov From owner-netdev@oss.sgi.com Thu Jun 8 23:14:42 2000 Received: by oss.sgi.com id ; Thu, 8 Jun 2000 23:14:32 -0700 Received: from saw.sw.com.sg ([203.120.9.98]:12417 "HELO saw.sw.com.sg") by oss.sgi.com with SMTP id ; Thu, 8 Jun 2000 23:14:05 -0700 Received: (qmail 17370 invoked by uid 577); 9 Jun 2000 06:14:01 -0000 Message-ID: <20000609141401.A17352@saw.sw.com.sg> Date: Fri, 9 Jun 2000 14:14:01 +0800 From: Andrey Savochkin To: Julian Anastasov , Jerome Etienne Cc: netdev@oss.sgi.com Subject: Re: IFA_F_NO_NDISC (for vrrp) References: <20000608130323.A1049@long-haul.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.93.2i In-Reply-To: ; from "Julian Anastasov" on Fri, Jun 09, 2000 at 08:57:57AM Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Fri, Jun 09, 2000 at 08:57:57AM +0300, Julian Anastasov wrote: > > which patch are you speaking about ? > > I looked it one month ago: > > ftp://ftp.sw.com.sg/pub/Linux/people/saw/kernel/v2.3/route.generic > > I see it is not changed from long time ago and I > don't know its status. Andrey? I haven't added anything radically new. Just ported to to the recent kernels. I'm going to merge it into the mainstream kernel, but not right now. There is a discussion between me and Alexey Kuznetsov about some points of the patch. It hasn't been finished yet because of a lack of time on both sides. On the other hand, 2.4.0test isn't the right moment to introduce such patches :-) Best regards Andrey V. Savochkin From owner-netdev@oss.sgi.com Thu Jun 8 23:15:32 2000 Received: by oss.sgi.com id ; Thu, 8 Jun 2000 23:15:22 -0700 Received: from on-air-0.in-addr.de ([212.8.197.250]:45061 "HELO hermes.marowsky-bree.de") by oss.sgi.com with SMTP id ; Thu, 8 Jun 2000 23:15:17 -0700 Received: by hermes.marowsky-bree.de (Postfix, from userid 500) id 9E3244D03A; Fri, 9 Jun 2000 08:15:07 +0200 (CEST) Date: Fri, 9 Jun 2000 08:15:07 +0200 From: Lars Marowsky-Bree To: Julian Anastasov Cc: Jerome Etienne , Andrey Savochkin , netdev@oss.sgi.com Subject: Re: IFA_F_NO_NDISC (for vrrp) Message-ID: <20000609081507.C604@marowsky-bree.de> References: <20000608130323.A1049@long-haul.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0i In-Reply-To: ; from "Julian Anastasov" on 2000-06-09T08:57:57 X-Ctuhulu: HASTUR Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On 2000-06-09T08:57:57, Julian Anastasov said: > The transition is not so easy as you said. In LVS the > dispatcher keeps state for each connection to the real > hosts. When switching from one router to another after crash > you usually lose this connection table. So, all current > connections are broken. rusty promised a "shared-state for Netfilter conntrack" vs beer exchange at the Ottawa Linux Symposium, so I assume this problem can be solved ;-) From owner-netdev@oss.sgi.com Fri Jun 9 00:57:02 2000 Received: by oss.sgi.com id ; Fri, 9 Jun 2000 00:56:53 -0700 Received: from dpx20.tu-varna.acad.bg ([194.141.24.4]:41739 "EHLO dpx20.tu-varna.acad.bg") by oss.sgi.com with ESMTP id ; Fri, 9 Jun 2000 00:56:37 -0700 Received: from linux.tu-varna.acad.bg (root@linux.tu-varna.acad.bg [194.141.24.6]) by dpx20.tu-varna.acad.bg (8.9.3/8.9.3) with ESMTP id KAA02362; Fri, 9 Jun 2000 10:52:49 +0300 Received: from linux.tu-varna.acad.bg (uli@linux.tu-varna.acad.bg [194.141.24.6]) by linux.tu-varna.acad.bg (8.8.5/myconf) with ESMTP id KAA07494; Fri, 9 Jun 2000 10:56:34 +0300 Date: Fri, 9 Jun 2000 10:56:34 +0300 (EEST) From: Julian Anastasov To: Andrey Savochkin cc: Jerome Etienne , netdev@oss.sgi.com Subject: Re: IFA_F_NO_NDISC (for vrrp) In-Reply-To: <20000609141401.A17352@saw.sw.com.sg> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello, On Fri, 9 Jun 2000, Andrey Savochkin wrote: > On Fri, Jun 09, 2000 at 08:57:57AM +0300, Julian Anastasov wrote: > > > which patch are you speaking about ? > > > > I looked it one month ago: > > > > ftp://ftp.sw.com.sg/pub/Linux/people/saw/kernel/v2.3/route.generic > > > > I see it is not changed from long time ago and I > > don't know its status. Andrey? > > I haven't added anything radically new. Just ported to to the recent > kernels. I'm going to merge it into the mainstream kernel, but not right > now. There is a discussion between me and Alexey Kuznetsov about some points > of the patch. It hasn't been finished yet because of a lack of time on both > sides. On the other hand, 2.4.0test isn't the right moment to introduce such > patches :-) I want to ask you about arp_solicit. Is inet_addr_type() going to die? I see it is still used. My question is: can arp_solicit continue to use inet_addr_type() as in the current kernel and not to call fib_local_source()? You still can call fib_select_addr. Is there a good reason to change it? By this way we limit the local addresses that we can announce in our ARP probes. I'm not sure if this will break something but that will allow addresses not included in the "local" table not to be announced. This will help to hide these addresses. But I agree, may be this is a hack. We will allow to accept locally traffic for more IP addresses but to send probes with the preferred addresses from the same logical network defined in the "local" table. For me, there is this difference: why we should allow all IP addresses which we treat as local in the incoming traffic to be treated as local in the outgoing traffic. If we want to use them as source in the ARP probes we can include them in the "local" table. Is that sounds reasonable? The user can select which addresses to hide by not including them in the "local" table. This is the only difference we can make for the ARP and the other traffic by using only FIB calls. We still can talk with this IP, we only change ARP behaviour. If we don't want to talk IP with these hidden addresses we can completely remove them from all tables. Regards -- Julian Anastasov From owner-netdev@oss.sgi.com Fri Jun 9 01:12:23 2000 Received: by oss.sgi.com id ; Fri, 9 Jun 2000 01:12:13 -0700 Received: from saw.sw.com.sg ([203.120.9.98]:24193 "HELO saw.sw.com.sg") by oss.sgi.com with SMTP id ; Fri, 9 Jun 2000 01:12:04 -0700 Received: (qmail 19042 invoked by uid 577); 9 Jun 2000 08:11:59 -0000 Message-ID: <20000609161159.A19020@saw.sw.com.sg> Date: Fri, 9 Jun 2000 16:11:59 +0800 From: Andrey Savochkin To: Julian Anastasov Cc: Jerome Etienne , netdev@oss.sgi.com Subject: Re: IFA_F_NO_NDISC (for vrrp) References: <20000609141401.A17352@saw.sw.com.sg> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.93.2i In-Reply-To: ; from "Julian Anastasov" on Fri, Jun 09, 2000 at 10:56:34AM Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello, On Fri, Jun 09, 2000 at 10:56:34AM +0300, Julian Anastasov wrote: > I want to ask you about arp_solicit. Is > inet_addr_type() going to die? I see it is still used. My > question is: can arp_solicit continue to use > inet_addr_type() as in the current kernel and not to call > fib_local_source()? You still can call fib_select_addr. Is > there a good reason to change it? > > By this way we limit the local addresses that we can > announce in our ARP probes. I'm not sure if this will break > something but that will allow addresses not included in the > "local" table not to be announced. This will help to hide > these addresses. But I agree, may be this is a hack. We will I suppose, the kernel looked only in local table only for optimization purposes under assumption that nobody wanted to insert local addresses in other tables. But it is very necessary for source routing and proper ordering of routing lookups. A most simple example: I get network a.b.c.0/28 for some hosting. I assigned a.b.c.d for gateway and use other addresses as local for virtual servers. I write: ip route add a.b.c.d dev eth0 table 10 ip route add local a.b.c.0/28 dev eth0 table 10 That's fine. But if I'm forced to reorder entries to put all local addresses in local table, I can't add a route for such a.b.c.0/28 prefix. Adding local routes into non-local table you just explore some kernel optimization. We've discussed arp_solicit with you. As far as I remember, we agreed that always using fib_select_addr as a source address for the arp requests, independently if we may use skb's source, is good for you. It's ok for me, too. Best regards Andrey V. Savochkin From owner-netdev@oss.sgi.com Fri Jun 9 01:45:13 2000 Received: by oss.sgi.com id ; Fri, 9 Jun 2000 01:45:04 -0700 Received: from laurin.munich.netsurf.de ([194.64.166.1]:26349 "EHLO laurin.munich.netsurf.de") by oss.sgi.com with ESMTP id ; Fri, 9 Jun 2000 01:45:01 -0700 Received: from fred.muc.de (none@ns1113.munich.netsurf.de [195.180.235.113]) by laurin.munich.netsurf.de (8.9.3/8.9.3) with ESMTP id KAA14394; Fri, 9 Jun 2000 10:44:16 +0200 (MET DST) Received: from andi by fred.muc.de with local (Exim 2.05 #1) id 130KH6-0000s8-00; Fri, 9 Jun 2000 10:36:20 +0200 Date: Fri, 9 Jun 2000 10:36:20 +0200 From: Andi Kleen To: Lennert Buytenhek Cc: Andi Kleen , Andrey Savochkin , Mitchell Blank Jr , Ben Greear , rob@valinux.com, netdev@oss.sgi.com, Gleb Natapov , jamal Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? Message-ID: <20000609103620.A3028@fred.muc.de> References: <20000607010120.A4334@fred.muc.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.95.4us In-Reply-To: ; from Lennert Buytenhek on Thu, Jun 08, 2000 at 10:24:16PM +0200 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Thu, Jun 08, 2000 at 10:24:16PM +0200, Lennert Buytenhek wrote: > > > On Wed, 7 Jun 2000, Andi Kleen wrote: > > > > > The current kernel infrastructure for packet mangling may still need > > > > some adjustments, but it at least exists. I'm encouraging to consider > > > > VLAN implementation as just a netfilter module. > > > > > > "All the world is an IP net"? How should I run IPX over my VLANs then? > > > > Netfilter is not an IP only thing. It is a generic framework for > > packet mangling. Although currently only IPv4 and IPv6 netfilter > > implementations exist it would be no big problem to add ``raw > > ethernet'' netfilter hooks. > > Raw ethernet netfilter hooks, as are IPX netfilter hooks by the way, are > currently a nice blue cloud in the sky. They are not hard to add. Just so far nobody needed them. You could either hook them into the device's hard_start_xmit function or as an netfilter qdisc for the devices with queue (both requires no additional instructions in the fast path when not used) > As we're getting into the architectural purity business anyway, does it > make a whole lot of sense to netfilter on two different protocol levels? I think it does. netfilter is not a single entinity, it is a hook infrastructure. When multi layer filtering is required it can be integrated. That's generally. I haven't thought hard enough about the VLAN problem to decide if it really makes sense to implement it as netfilter module. The net_fifo idea seems to have some merits too (struct net_device probably does too many things at once currently and some hash tables/trees for device management probably couldn't hurt) -Andi -- This is like TV. I don't like TV. From owner-netdev@oss.sgi.com Fri Jun 9 04:14:23 2000 Received: by oss.sgi.com id ; Fri, 9 Jun 2000 04:14:13 -0700 Received: from tcm.hut.fi ([130.233.44.1]:51977 "EHLO tcm-gw.tcm.hut.fi") by oss.sgi.com with ESMTP id ; Fri, 9 Jun 2000 04:13:49 -0700 Received: (from smap@localhost) by tcm-gw.tcm.hut.fi (8.8.7/8.8.7) id OAA02075 for ; Fri, 9 Jun 2000 14:13:42 +0300 Received: from caffeine.tcm.hut.fi(130.233.45.27) by tcm-gw.tcm.hut.fi via smap (V2.0) id xma002072; Fri, 9 Jun 00 14:13:28 +0300 Received: from morphine.tcm.hut.fi (morphine.tcm.hut.fi [130.233.45.7]) by caffeine.tcm.hut.fi (8.9.2/8.9.2) with ESMTP id OAA03264 for ; Fri, 9 Jun 2000 14:13:32 +0300 (EET DST) Received: from localhost (lpetande@localhost) by morphine.tcm.hut.fi (8.9.2/8.7.1) with ESMTP id OAA07731 for ; Fri, 9 Jun 2000 14:13:21 +0300 (EET DST) X-Authentication-Warning: morphine.tcm.hut.fi: lpetande owned process doing -bs Date: Fri, 9 Jun 2000 14:13:21 +0300 (EET DST) From: Lars Henrik Petander To: netdev@oss.sgi.com Subject: MIPL Mobile IPv6 for Linux Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello! We have implemented Mobile IPv6 as a kernel module for the Linux kernel version 2.4.0-test1. Since the code includes implementation of the mobility related destination options it is probably of some interest to this list. The release 0.50 of MIPL is an alpha prototype, but it implements most of the functionality of the IETF draft version 12 of mobility support for IPv6, including route optimization. However authentication of the binding updates and acknowledgements is still missing. Once Free/SWAN completes porting their IPSec implementation to IPv6, we will add support for authentication. The release consists of a kernel module, some patches to the IPv6 module and configuration and installation tools. The release, source code and more information are available from: http://vesper.tky.hut.fi/mip Henrik Petander, MIPL project From owner-netdev@oss.sgi.com Fri Jun 9 05:04:23 2000 Received: by oss.sgi.com id ; Fri, 9 Jun 2000 05:04:13 -0700 Received: from dpx20.tu-varna.acad.bg ([194.141.24.4]:36111 "EHLO dpx20.tu-varna.acad.bg") by oss.sgi.com with ESMTP id ; Fri, 9 Jun 2000 05:03:55 -0700 Received: from linux.tu-varna.acad.bg (root@linux.tu-varna.acad.bg [194.141.24.6]) by dpx20.tu-varna.acad.bg (8.9.3/8.9.3) with ESMTP id PAA36806; Fri, 9 Jun 2000 15:00:00 +0300 Received: from linux.tu-varna.acad.bg (uli@linux.tu-varna.acad.bg [194.141.24.6]) by linux.tu-varna.acad.bg (8.8.5/myconf) with ESMTP id PAA14618; Fri, 9 Jun 2000 15:03:44 +0300 Date: Fri, 9 Jun 2000 15:03:44 +0300 (EEST) From: Julian Anastasov To: Andrey Savochkin cc: Jerome Etienne , netdev@oss.sgi.com Subject: Re: IFA_F_NO_NDISC (for vrrp) In-Reply-To: <20000609161159.A19020@saw.sw.com.sg> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello, On Fri, 9 Jun 2000, Andrey Savochkin wrote: > Hello, > > On Fri, Jun 09, 2000 at 10:56:34AM +0300, Julian Anastasov wrote: > > I want to ask you about arp_solicit. Is > > inet_addr_type() going to die? I see it is still used. My > > question is: can arp_solicit continue to use > > inet_addr_type() as in the current kernel and not to call > > fib_local_source()? You still can call fib_select_addr. Is > > there a good reason to change it? > > > > By this way we limit the local addresses that we can > > announce in our ARP probes. I'm not sure if this will break > > something but that will allow addresses not included in the > > "local" table not to be announced. This will help to hide > > these addresses. But I agree, may be this is a hack. We will > > I suppose, the kernel looked only in local table only for optimization > purposes under assumption that nobody wanted to insert local addresses in > other tables. But it is very necessary for source routing and proper > ordering of routing lookups. > > A most simple example: I get network a.b.c.0/28 for some hosting. I assigned > a.b.c.d for gateway and use other addresses as local for virtual servers. I > write: > ip route add a.b.c.d dev eth0 table 10 > ip route add local a.b.c.0/28 dev eth0 table 10 > That's fine. But if I'm forced to reorder entries to put all local addresses > in local table, I can't add a route for such a.b.c.0/28 prefix. Yep, may be there are cases where local routes can be specified not in the "local" table. But if we don't change the call in arp_solicit() from inet_addr_type() to fib_local_source(), we will use IP from another logical network, i.e. returned from fib_select_addr(). I don't think this is a big problem. Our gateway always can be configured with other IP address because we usually don't define only a.b.c.0/28 in our host. Of course, our request will look: "who-has 192.168.0.1 tell 192.168.0.2", where 192.168.0.1 and a.b.c.d are configured in the router and we send our probe after the request "client -> a.b.c.e". But because a.b.c.e is not defined in the "local" table it is not found from inet_addr_type(). The result: we send a valid ARP probe (with saddr and daddr from the same logical network) using 192.168.0.2 as source of the request. Is this a big problem? To use another IP as gateway? >From the same device? I assume the only requests from the router will be "who-has a.b.c.e tell a.b.c.d". But we will reply for a.b.c.e because it is local (in any table): "a.b.c.e is-at OUR_MAC", we use ip_route_input() in this case which returns RTN_LOCAL. Is that working :) The drawback is that we will probe only with addresses from the "local" table. If all hosts share this logical network it is not a problem to support many additional logical networks as specified in your example. So, only by defining additional logical network (the main network 192.168.0.0/24), which is usual because we add our new a.b.c.0/28 network as an additional network we solve this problem. The result: we can add additional networks with any prefix but to use our main logical network to resolve our ARP requests. > > Adding local routes into non-local table you just explore some kernel > optimization. > > We've discussed arp_solicit with you. > As far as I remember, we agreed that always using fib_select_addr as a source > address for the arp requests, independently if we may use skb's source, > is good for you. It's ok for me, too. Yes, using fib_select_addr() is not a problem. Oops, sorry, my question was for fib_local_source(), not for fib_select_addr() but you understood me. We want to keep using inet_addr_type() instead of fib_local_source() in arp_solicit(). Only there. And it is not a problem the hosts to be involved in same logical network to allow them to define any number of additional logical networks as local but not in the "local" table. I think, this can't be a problem for any user which can add routes in other tables, i.e. not only in the "local" table :) So, can we just in this case to use inet_addr_type? I now see that inet_addr_type() is changed too, to use fib_lookup. I forgot about this. Yep, with this change we can't do this (to limit the local addresses in arp_solicit). I have to wait for the final version of your patch :) For now, without the hidden flag, the only problem is that we can't allow clients on the LAN to access the hidden addresses in the real servers. We have to live with it. Or to allow it with patch :) The last words: IFA_F_NO_NDISC with arp_solicit calling original inet_addr_type is acceptable for LVS. This will allow local clients to talk with the cluster. But is it acceptable? If this is not true, LVS don't need IFA_F_NO_NDISC (this version of the patch). Regards -- Julian Anastasov From owner-netdev@oss.sgi.com Fri Jun 9 07:10:03 2000 Received: by oss.sgi.com id ; Fri, 9 Jun 2000 07:09:53 -0700 Received: from mail.arobas.net ([205.205.36.6]:59657 "EHLO mail.arobas.net") by oss.sgi.com with ESMTP id ; Fri, 9 Jun 2000 07:09:32 -0700 Received: from dialin156.arobas.net (qmailr@ppp41.arobas.net [205.205.36.111]) by mail.arobas.net (8.9.3/8.9.3) with SMTP id KAA68584 for ; Fri, 9 Jun 2000 10:05:54 -0400 (EDT) Received: (qmail 985 invoked by uid 1000); 9 Jun 2000 13:53:19 -0000 Date: Fri, 9 Jun 2000 09:53:19 -0400 From: Jerome Etienne To: Julian Anastasov Cc: Jerome Etienne , Andrey Savochkin , netdev@oss.sgi.com Subject: Re: IFA_F_NO_NDISC (for vrrp) Message-ID: <20000609095319.A921@long-haul.net> Reply-To: jetienne@arobas.net References: <20000608130323.A1049@long-haul.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii User-Agent: Mutt/1.0i In-Reply-To: ; from uli@linux.tu-varna.acad.bg on Fri, Jun 09, 2000 at 08:57:57AM +0300 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Fri, Jun 09, 2000 at 08:57:57AM +0300, Julian Anastasov wrote: > The transition is not so easy as you said. In LVS the > dispatcher keeps state for each connection to the real > hosts. When switching from one router to another after crash > you usually lose this connection table. So, all current > connections are broken. no, because this connection tracking doesnt exist in vrrp and is specific to LVS. vrrp and lvs dont have the same aim. > > In vrrp, the backup must not have the virtual IPs 'online'. Only the > > master can receive/send packets with the virtual ips. > > OK, VIP is configured in the Master, not in the > Backup. Or it is configured but flagged with > IFA_F_NO_NDISC? no, it is configured only in the master and always with IFA_F_NO_NDISC > Is it configured in the real hosts? i dunno what you mean by real hosts. vrrp doesnt have such thing. It is configured only in the servers and the clients arent aware of vrrp at all. > Is NAT running in the default routers? NAT is completly unrelated to VRRP. running it or not wont change vrrp action. > How looks the packet from > the real host through the default router? Now I don't know > in which host you want to stop replying for VIP. in fact the virtual ip owner have to answer to ARP request for the virtual IPs. Currently the kernel can't reply with the virtual mac for the virtual IPs because this imply differents MACs if the several groups are configured on the same interface. so i prevent the kernel from replying and let a userspace deamon do it. my first email explains it better. > I read it. But I don't know if you tested only the > Master-Backup operations (which is the scope of this rfc). > Because I'm not sure the setup described in rfc2338 (which > is very limitted) will work without NAT-ing packets in the > Linux dispatcher. If they are NAT-ed I don't know why you > have VIP in the internal hosts. as i said lvs and vrrp are similar but distinct. vrrp has no dispatcher, and no notion of internal host. > Where are VIPs, except in the Master, configured, even specified > with this flag? the vip are configured as a local interface address only in the master. in the clients, vips are used only as destination (any nodes which wish to send packet to this ip) but the clients arent aware that this ip is virtual. so the client has no special configuration, their interface doesnt accept packet destined for the vip (so they dont use IFA_F_NO_NDISC) > > I think so, i coded IFA_F_NO_NDISC because i wasnt aware of 'hidden'. > > It isnt in the kernel source i looked in i.e. 2.4.0-test1. > > As our needs are similar, we should use the same mechanism. > > I'm still not sure how the packets look exiting > from the real hosts in your implementation. You have to > explain this because I can't imagine how your setup is > working. Without the above two requirements you patch is not > working for LVS. If VIP is configured as address and so put > in the "local" table it can be used as source in the ARP > probes. I see this in your "MUST NOT"s for the Backup host. > But this is requirement for all backup/real servers in LVS. > So, your patch must be tuned if the net folks like the idea > of this address flag. You have to patch inet_select_addr() > too. The problem is arp_solicit, i.e. which local addresses > are allowed in the ARP probes. This is decided from the > routing. Your patch and defining VIP as interface address > will allow using VIP in the ARP probes. But this is in our > "MUST NOT". So, the big question is how you will put the > IFA_F_NO_NDISC flag in the routing code (fib_lookup). i fail to see why i should. To know if i answers to a ARP reply for a locally configured address doesnt seems related to the routing code to me. it is related to the configuratoin of the local addresses so typically IFA_F_*. > > IFA_F_NO_NDISC is tunable per address so seems to be more flexible > > than 'hidden' which is tunable per interface. > > This is not a problem. Just put your VIP in lo or > dummy device and be happy. There is no difference at all. I > don't know why you/rfc have VIP configured if you want the > VIP to be so hidden (even not to talk IP). the rfc doesnt want the virtual IP to be unused at all, they are used for xmit/recv as any ip. It want that a ARP request for a vip is replyed by a ARP reply containing the virtual mac. i use IFA_F_NO_NDISC only as a trick to prevent the kernel from replying and thus reply from the userspace. > Just don't > configure VIP. May be I don't understand something, you > have to explain it. maybe you should reread my first email on netdev and possibly reread the rfc without thinking it is a way to do LVS. vrrp and lvs have different aims. > > Thinking about it, i have another project in which i need to completly > > handle ARP reply and request in user space (still keeping the cache > > in the kernel). > > Playing in user space is preferred. But we need to > accept packets destined for VIP in the real hosts (where we > hide these addresses). to accept or not packet for the IP(virtual or not) is unrelated to ARP. > So, we need the mechanism to select > IP addresses as source for the ARP probes to be determined > in the kernel. It is not possible to control this from user > space. i dont see why. please elaborate on that > > If i writes that, would it be more acceptable for the maintainers ? > > I'm not sure. I'm not against your patch but I think > it will include the same requirements as the patch for the > "hidden" flag. yes but with the same requirement, IFA_F_NO_NDISC is more flexible. nevertheless it isnt the point. I try to provide the mechanism flexible enought to satisfy our current needs and the potential future one. handle ARP packets in user space allow to implement any custom behaviour easily. > And please explain how the packets from one client > request and the answer traverse all hosts in the cluster. there is no cluster, and packet dont traverse any nodes in vrrp. it is a pure LAN matter. lvs and vrrp are different. From owner-netdev@oss.sgi.com Fri Jun 9 07:28:32 2000 Received: by oss.sgi.com id ; Fri, 9 Jun 2000 07:28:22 -0700 Received: from cpu2747.adsl.bellglobal.com ([207.236.55.216]:26359 "EHLO grendel.conscoop.ottawa.on.ca") by oss.sgi.com with ESMTP id ; Fri, 9 Jun 2000 07:28:11 -0700 Received: (from rgb@localhost) by grendel.conscoop.ottawa.on.ca (8.9.0/8.9.0) id KAA07428; Fri, 9 Jun 2000 10:29:54 -0400 Date: Fri, 9 Jun 2000 10:29:53 -0400 From: Richard Guy Briggs To: Lars Henrik Petander Cc: netdev@oss.sgi.com Subject: Re: MIPL Mobile IPv6 for Linux Message-ID: <20000609102953.N31710@grendel.conscoop.ottawa.on.ca> References: Mime-Version: 1.0 Content-Type: multipart/signed; boundary=B9BE8dkJ1pIKavwa; micalg=pgp-md5; protocol="application/pgp-signature" X-Mailer: Mutt 0.95.7i In-Reply-To: Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing --B9BE8dkJ1pIKavwa Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable On Fri, Jun 09, 2000 at 02:13:21PM +0300, Lars Henrik Petander wrote: > Hello! >=20 > We have implemented Mobile IPv6 as a kernel module for the Linux kernel > version 2.4.0-test1. Since the code includes implementation of the > mobility related destination options it is probably of some interest to > this list. >=20 > The release 0.50 of MIPL is an alpha prototype, but it implements most of > the functionality of the IETF draft version 12 of mobility support for > IPv6, including route optimization. However authentication of the binding > updates and acknowledgements is still missing. Once Free/SWAN completes > porting their IPSec implementation to IPv6, we will add support for > authentication. Have you looked at FreeS/WAN yet? I just wondered how well it fits in at the moment with your structure, given that it still only does IPv4. With recent (1.3 and 1.4) we now have support for road warriors using RSA keys. We have someone in Germany (who will be at OLS, yay! www.ottawalinuxsymposium.com) who is very eager to port FreeS/WAN to IPv6 and has already started by porting most of the keying daemon (pluto) and library code to IPv6. STill to tackle is KLIPS, the kernel code, for which I am responsible. There will be at least 6 FreeS/WAN people at OLS, so if you are there, we can discuss some of this in person. > The release consists of a kernel module, some patches to the IPv6 > module and configuration and installation tools. The release, > source code and more information are available from: >=20 > http://vesper.tky.hut.fi/mip=20 Thanks for your announcement! > Henrik Petander, > MIPL project slainte mhath, RGB --=20 Richard Guy Briggs -- PGP key available Auto-Free Ottawa! Canada Prevent Internet Wiretapping! -- FreeS/WAN: Thanks for voting Green! -- Marillion: --B9BE8dkJ1pIKavwa Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: 2.6.3i iQCVAwUBOUD/X9+sBuIhFagtAQGoGgP8CpzLbtNROHBpUNEcqLKgabNP+lUy7Xvk wIfc1W0VIkbsGHrYoOGvxrOT5Hcy+v204aoBDRd/tYJ8apb8aCJmf4rnSQeqfa8Q frzBYb9zDwHepGfALIeyFJIVQvpXXq2v4VOVHurPkcz5tuUK41zqd12CaYyTv9Rm cJ1NS+iB1Ss= =U8Jt -----END PGP SIGNATURE----- --B9BE8dkJ1pIKavwa-- From owner-netdev@oss.sgi.com Fri Jun 9 08:23:02 2000 Received: by oss.sgi.com id ; Fri, 9 Jun 2000 08:22:52 -0700 Received: from sirppi.helsinki.fi ([128.214.205.27]:9235 "EHLO sirppi.helsinki.fi") by oss.sgi.com with ESMTP id ; Fri, 9 Jun 2000 08:22:40 -0700 Received: from localhost (amlaukka@localhost) by sirppi.helsinki.fi (8.10.1/8.10.1) with ESMTP id e59FM3q02059; Fri, 9 Jun 2000 18:22:03 +0300 (EET DST) X-Authentication-Warning: sirppi.helsinki.fi: amlaukka owned process doing -bs Date: Fri, 9 Jun 2000 18:22:03 +0300 (EET DST) From: Aki M Laukkanen To: kuznet@ms2.inr.ac.ru cc: netdev@oss.sgi.com, iwtcp@cs.helsinki.fi Subject: Re: Slow TCP connection between linux and wince In-Reply-To: <200006071608.UAA21333@ms2.inr.ac.ru> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Wed, 7 Jun 2000 kuznet@ms2.inr.ac.ru wrote: > Also, the fact that truesize exceeds mss not more than twice > is really crucial. When it is not true, linux tcp used to fail > miserably. This phenomenon is visible only on high rtt networks > with small losses, though. used - As in past tense? Remember that truesize is not the whole story. The cloned skbs show up in wmem_alloc too which is why we got bitten by the burstiness. I see the heuristics are on the conservative side though. > > I don't think this a highly non-standard situation. Even if we ignore > > slow wireless links (e.g. GSM data), to some extent every modem-link > > exhibits this behaviour. Have you read draft-ietf-pilc-slow-03.txt or > > rfc2757.txt? Good read for a little perspective to the problem space. > > You missed the point. Network should have large _packet_ power_ > i.e. (rtt*bandwidth)/mss to hit this problem. This situation never occured > in real life earlier. I have no idea, how you reached cwnd of 192. 8) Valid examples are wireless and satellite links. Congestion window can grow freely because the delay was constant in this test. > Of course. The question is how to make this. I proposed one solution. Ok, tested with 0608. Hmm. what can I say, the window never grows past 8kB with mss of 256. Both the ofo-queue pruning and the burstiness is masked by this behaviour (obviously). The latter of course only if the receiver is a linux with 0608 too. Tested with mss of 536 and 1024 too - results in max window of ~16kB and ~24kB respectively. I can't say I'm satisfied though. This penalises connections with smaller MTUs. Think MTU of 576 which I think is pretty common on the Internet at whole. With larger RTTs you can not use the whole bandwidth which is available because the window is just too small. Those tests were done with PPP which only allocates MRU bytes per skb but your average ethernet driver has to allocate 1500+ bytes per skb regardless of what the actual packet size is. You can enlarge the socket buffer size to get a bigger window but how many people will? Also many applications try to enforce a certain advertised window by setting the socket buffer size by themselves. This no longer results has the effect they want. From owner-netdev@oss.sgi.com Fri Jun 9 09:30:33 2000 Received: by oss.sgi.com id ; Fri, 9 Jun 2000 09:30:23 -0700 Received: from dpx20.tu-varna.acad.bg ([194.141.24.4]:8718 "EHLO dpx20.tu-varna.acad.bg") by oss.sgi.com with ESMTP id ; Fri, 9 Jun 2000 09:30:03 -0700 Received: from linux.tu-varna.acad.bg (root@linux.tu-varna.acad.bg [194.141.24.6]) by dpx20.tu-varna.acad.bg (8.9.3/8.9.3) with ESMTP id TAA62678; Fri, 9 Jun 2000 19:26:05 +0300 Received: from linux.tu-varna.acad.bg (uli@linux.tu-varna.acad.bg [194.141.24.6]) by linux.tu-varna.acad.bg (8.8.5/myconf) with ESMTP id TAA22430; Fri, 9 Jun 2000 19:29:48 +0300 Date: Fri, 9 Jun 2000 19:29:48 +0300 (EEST) From: Julian Anastasov To: Jerome Etienne cc: Andrey Savochkin , netdev@oss.sgi.com Subject: Re: IFA_F_NO_NDISC (for vrrp) In-Reply-To: <20000609095319.A921@long-haul.net> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello, On Fri, 9 Jun 2000, Jerome Etienne wrote: > > How looks the packet from > > the real host through the default router? Now I don't know > > in which host you want to stop replying for VIP. > > in fact the virtual ip owner have to answer to ARP request for the virtual > IPs. Currently the kernel can't reply with the virtual mac for the > virtual IPs because this imply differents MACs if the several groups > are configured on the same interface. so i prevent the kernel from > replying and let a userspace deamon do it. my first email explains > it better. Oh, well. I reread the rfc. I read mostly the MUSTs first time and didn't understand the difference in your tests. > > > I read it. But I don't know if you tested only the > > Master-Backup operations (which is the scope of this rfc). > > Because I'm not sure the setup described in rfc2338 (which > > is very limitted) will work without NAT-ing packets in the > > Linux dispatcher. If they are NAT-ed I don't know why you > > have VIP in the internal hosts. > > as i said lvs and vrrp are similar but distinct. vrrp > has no dispatcher, and no notion of internal host. > > > Where are VIPs, except in the Master, configured, even specified > > with this flag? > > the vip are configured as a local interface address only in the master. > > in the clients, vips are used only as destination (any nodes which wish to > send packet to this ip) but the clients arent aware that this ip is virtual. > so the client has no special configuration, their interface doesnt accept > packet destined for the vip (so they dont use IFA_F_NO_NDISC) Now I understand. > the rfc doesnt want the virtual IP to be unused at all, they are used > for xmit/recv as any ip. It want that a ARP request for a vip is > replyed by a ARP reply containing the virtual mac. i use IFA_F_NO_NDISC > only as a trick to prevent the kernel from replying and thus reply from > the userspace. OK > > > Just don't > > configure VIP. May be I don't understand something, you > > have to explain it. > > maybe you should reread my first email on netdev and possibly > reread the rfc without thinking it is a way to do LVS. vrrp Already done. > and lvs have different aims. > > > > Thinking about it, i have another project in which i need to completly > > > handle ARP reply and request in user space (still keeping the cache > > > in the kernel). > > > > Playing in user space is preferred. But we need to > > accept packets destined for VIP in the real hosts (where we > > hide these addresses). > > to accept or not packet for the IP(virtual or not) is unrelated to ARP. > > > So, we need the mechanism to select > > IP addresses as source for the ARP probes to be determined > > in the kernel. It is not possible to control this from user > > space. > > i dont see why. please elaborate on that Look in arp_solicit (2.2). I'm talking about selecting the source address in the ARP requests not for ARP replies. But now I understand your problem. > > > > If i writes that, would it be more acceptable for the maintainers ? > > > > I'm not sure. I'm not against your patch but I think > > it will include the same requirements as the patch for the > > "hidden" flag. > > yes but with the same requirement, IFA_F_NO_NDISC is more flexible. > nevertheless it isnt the point. I try to provide the mechanism > flexible enought to satisfy our current needs and the potential > future one. handle ARP packets in user space allow to implement > any custom behaviour easily. Good luck! IFA_F_NO_NDISC is not a complete solution for LVS. The "hidden" flag has other semantic and the goal is to mark part of the local addresses as shared (hidden). This is different from IFA_F_NO_NDISC. We need to control which addresses can be announced in the ARP probes too. Regards -- Julian Anastasov From owner-netdev@oss.sgi.com Fri Jun 9 09:40:02 2000 Received: by oss.sgi.com id ; Fri, 9 Jun 2000 09:39:52 -0700 Received: from mail.arobas.net ([205.205.36.6]:48400 "EHLO mail.arobas.net") by oss.sgi.com with ESMTP id ; Fri, 9 Jun 2000 09:39:29 -0700 Received: from dialin156.arobas.net (qmailr@ppp41.arobas.net [205.205.36.111]) by mail.arobas.net (8.9.3/8.9.3) with SMTP id MAA79059 for ; Fri, 9 Jun 2000 12:35:48 -0400 (EDT) Received: (qmail 2047 invoked by uid 1000); 9 Jun 2000 16:35:18 -0000 Date: Fri, 9 Jun 2000 12:35:18 -0400 From: Jerome Etienne To: Julian Anastasov Cc: Jerome Etienne , Andrey Savochkin , netdev@oss.sgi.com Subject: Re: IFA_F_NO_NDISC (for vrrp) Message-ID: <20000609123518.A2012@long-haul.net> Reply-To: jetienne@arobas.net References: <20000609095319.A921@long-haul.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii User-Agent: Mutt/1.0i In-Reply-To: ; from uli@linux.tu-varna.acad.bg on Fri, Jun 09, 2000 at 07:29:48PM +0300 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Fri, Jun 09, 2000 at 07:29:48PM +0300, Julian Anastasov wrote: > Good luck! IFA_F_NO_NDISC is not a complete solution > for LVS. The "hidden" flag has other semantic and the goal > is to mark part of the local addresses as shared (hidden). > This is different from IFA_F_NO_NDISC. We need to control > which addresses can be announced in the ARP probes too. how do you need to control them ? From owner-netdev@oss.sgi.com Fri Jun 9 10:28:42 2000 Received: by oss.sgi.com id ; Fri, 9 Jun 2000 10:28:23 -0700 Received: from minus.inr.ac.ru ([193.233.7.97]:5391 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Fri, 9 Jun 2000 10:28:16 -0700 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id VAA23823; Fri, 9 Jun 2000 21:27:55 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200006091727.VAA23823@ms2.inr.ac.ru> Subject: Re: timers in net/ipv6 To: ak@muc.de (Andi Kleen) Date: Fri, 9 Jun 2000 21:27:55 +0400 (MSK DST) Cc: ak@muc.de, andrewm@uow.edu.au, netdev@oss.sgi.com In-Reply-To: <20000608213812.A22534@fred.muc.de> from "Andi Kleen" at Jun 8, 0 09:38:12 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Length: 237 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello! > Ok, just hope then that there is never an exception during softirq > execution (like spurious interrupt or NMI handled by an debugger) -- > ret_from_exception calls the softirq. It will be blocked by local_bh_count(). Alexey From owner-netdev@oss.sgi.com Fri Jun 9 10:38:12 2000 Received: by oss.sgi.com id ; Fri, 9 Jun 2000 10:37:53 -0700 Received: from minus.inr.ac.ru ([193.233.7.97]:9999 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Fri, 9 Jun 2000 10:37:51 -0700 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id VAA23767; Fri, 9 Jun 2000 21:25:10 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200006091725.VAA23767@ms2.inr.ac.ru> Subject: Re: Slow TCP connection between linux and wince To: amlaukka@cc.helsinki.fi (Aki M Laukkanen) Date: Fri, 9 Jun 2000 21:25:10 +0400 (MSK DST) Cc: netdev@oss.sgi.com, iwtcp@cs.helsinki.fi In-Reply-To: from "Aki M Laukkanen" at Jun 9, 0 06:22:03 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Length: 3336 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello! > Ok, tested with 0608. Hmm. what can I say, the window never grows past 8kB > with mss of 256. Both the ofo-queue pruning and the burstiness is masked > by this behaviour (obviously). The latter of course only if the receiver > is a linux with 0608 too. > > Tested with mss of 536 and 1024 too - results in max window of ~16kB and > ~24kB respectively. I can't say I'm satisfied though. This penalises > connections with smaller MTUs. Think MTU of 576 which I think is pretty > common on the Internet at whole. With larger RTTs you can not use the > whole bandwidth which is available because the window is just too > small. Those tests were done with PPP which only allocates MRU bytes per > skb but your average ethernet driver has to allocate 1500+ bytes per skb > regardless of what the actual packet size is. TCP calculates _maximal_ window possible with current mss and device. If the window converged to 8K, it cannot be larger for this connection. If it were >8K, it would prune. It is law of nature, rather than something deterined by our choice. If you want larger window (you does not want this, in your case 8K is enough), you have to increase rcvbuf. But: > You can enlarge the socket buffer size to get a bigger window but how many > people will? It does not matter. People should not increase rcvbuf per socket. Essentially, this number is determined by amount of available RAM and by number of active connections, rather than by network conditions. Even current value of 64K is too large for common server configuration. See? User is even not permitted to increase rcvbuf significantly, it is limited administratively. Certainly, one day we will have to do more smart "fair" memory management, which will allow to correlate memory consumption to network conditions. For now it is impossible, existing algorithms (f.e. the work made in PSC) are too coarce to be useful for production OS, which Linux is. > used - As in past tense? In past and in present continuous. 8) > Remember that truesize is not the whole story. The cloned skbs show up in > wmem_alloc too which is why we got bitten by the burstiness. I see the > heuristics are on the conservative side though. Cloned skbs are not counted, because their number is limited by tx_queue_len. For slow links it must be small number, sort of 4. I have no idea, why it is larger in your case. If you have tx_queue_len of 3, the overhead <= 3. > Valid examples are wireless and satellite links. Congestion window can > grow freely because the delay was constant in this test. "thin" link is link with small power and small window. "thick" link is link with large power and large window. "long thin link" is semantic non-sense. Link is either thick, or it is not long. 8)8) I have never heard about links with large power and mtu of 256. If wireless ones are of these kind, they are useless for IP applications. Large power links must have large mtu (>= 1500, at least), no questions, and you will have 32K default window then. RATIONALE: mtu is selected small only for small bandwidth links with underdeveloped link layer protocols to decrease latency dominated by packet sizes. If it is the case, power is 1 and window is small. As soon as latency is not dominated by packet size, small mss is not required. It is plain logic. 8) Alexey From owner-netdev@oss.sgi.com Fri Jun 9 11:34:33 2000 Received: by oss.sgi.com id ; Fri, 9 Jun 2000 11:34:13 -0700 Received: from dpx20.tu-varna.acad.bg ([194.141.24.4]:19724 "EHLO dpx20.tu-varna.acad.bg") by oss.sgi.com with ESMTP id ; Fri, 9 Jun 2000 11:34:06 -0700 Received: from linux.tu-varna.acad.bg (root@linux.tu-varna.acad.bg [194.141.24.6]) by dpx20.tu-varna.acad.bg (8.9.3/8.9.3) with ESMTP id VAA62586; Fri, 9 Jun 2000 21:30:13 +0300 Received: from linux.tu-varna.acad.bg (uli@linux.tu-varna.acad.bg [194.141.24.6]) by linux.tu-varna.acad.bg (8.8.5/myconf) with ESMTP id VAA25875; Fri, 9 Jun 2000 21:33:58 +0300 Date: Fri, 9 Jun 2000 21:33:58 +0300 (EEST) From: Julian Anastasov To: Jerome Etienne cc: Andrey Savochkin , netdev@oss.sgi.com Subject: Re: IFA_F_NO_NDISC (for vrrp) In-Reply-To: <20000609123518.A2012@long-haul.net> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello, On Fri, 9 Jun 2000, Jerome Etienne wrote: > On Fri, Jun 09, 2000 at 07:29:48PM +0300, Julian Anastasov wrote: > > Good luck! IFA_F_NO_NDISC is not a complete solution > > for LVS. The "hidden" flag has other semantic and the goal > > is to mark part of the local addresses as shared (hidden). > > This is different from IFA_F_NO_NDISC. We need to control > > which addresses can be announced in the ARP probes too. > > how do you need to control them ? > One of the ways is by using the hidden flag. You can see how it is handled in arp_solicit in 2.2: if (skb && (dev2 = ip_dev_find(skb->nh.iph->saddr)) != NULL && (in_dev2 = dev2->ip_ptr) != NULL && !IN_DEV_HIDDEN(in_dev2)) saddr = skb->nh.iph->saddr; else saddr = inet_select_addr(dev, target, RT_SCOPE_LINK); But it is not looking good in 2.3. Regards -- Julian Anastasov From owner-netdev@oss.sgi.com Fri Jun 9 16:06:46 2000 Received: by oss.sgi.com id ; Fri, 9 Jun 2000 16:06:36 -0700 Received: from mail.barak.net.il ([206.49.94.213]:35524 "EHLO mail.barak.net.il") by oss.sgi.com with ESMTP id ; Fri, 9 Jun 2000 16:06:12 -0700 Received: from jacoblap ([212.150.34.79]) by mail.barak.net.il (8.9.3/8.9.1) with SMTP id CAA01782 for ; Sat, 10 Jun 2000 02:06:03 +0300 (IDT) Message-ID: <00be01bfd270$20ee4700$180aa8c0@pentacom.com> From: "Jacob Avraham" To: Subject: 2.4 kernel networking and SMP Date: Sat, 10 Jun 2000 02:09:20 +0200 MIME-Version: 1.0 Content-Type: text/plain; charset="windows-1255" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 5.00.2314.1300 X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2314.1300 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Greetings, I'm designing a new router product and would like to use Linux as the s/w platform. The system is going to have several CPUs (probably 5), and the majority of the time the system is going to be in the kernel networking layers, doing packet forwarding, filtering and so on. My impression of Linux is that up to 2.2, the SMP kernel has coarse-grain preemption points, so basically most of the time one CPU will do the hard work in the networking layers, while the others are locked out. I read somewhere that 2.4 has improved in this area. Can you tell me what level of parallelism do I get from the networking layers (forwarding, filtering, NAT), or point me to some reading material about it? (beside the SMP HOWTO and source code of course..)? Thanks, Jacob Avraham From owner-netdev@oss.sgi.com Sat Jun 10 06:01:20 2000 Received: by oss.sgi.com id ; Sat, 10 Jun 2000 06:01:10 -0700 Received: from mail.cyberus.ca ([209.195.95.1]:58620 "EHLO cyberus.ca") by oss.sgi.com with ESMTP id ; Sat, 10 Jun 2000 06:00:50 -0700 Received: from shell.cyberus.ca (shell [209.195.95.7]) by cyberus.ca (8.9.3/8.9.3/Cyberus Online Inc.) with ESMTP id IAA19933; Sat, 10 Jun 2000 08:58:33 -0400 (EDT) Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.1b+Sun/8.9.3) with ESMTP id IAA00205; Sat, 10 Jun 2000 08:58:26 -0400 (EDT) Date: Sat, 10 Jun 2000 08:58:26 -0400 (EDT) From: jamal To: Ben Greear cc: Mitchell Blank Jr , Andrey Savochkin , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, gleb@tochna.technion.ac.il Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? In-Reply-To: <393FB9FE.1CCF12E5@candelatech.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Thu, 8 Jun 2000, Ben Greear wrote: > > At this point, I don't think anyone is going to be changing their opinions > on the way things should be done. Please feel free to take either of the > VLAN patches already written and write your own using your ideas. When > you have working code to show, then we may be in a better position to > evaluate your ideas. Indeed i'll take you up on this, RSN. Thats one of the beauties of open source. On Thu, 8 Jun 2000, Gleb Natapov wrote: jamal> Did i miss anything? jamal> Gleb> Yes. If you are going to implement VLANs as you've just described Gleb> you will have to change each L3 protocol implementation Gleb> to support VLANs. Gleb, I am afraid i didnt understand you. You mean broken programs like DHCP? cheers, jamal From owner-netdev@oss.sgi.com Sat Jun 10 08:03:11 2000 Received: by oss.sgi.com id ; Sat, 10 Jun 2000 08:03:01 -0700 Received: from sirppi.helsinki.fi ([128.214.205.27]:49682 "EHLO sirppi.helsinki.fi") by oss.sgi.com with ESMTP id ; Sat, 10 Jun 2000 08:02:39 -0700 Received: from localhost (amlaukka@localhost) by sirppi.helsinki.fi (8.10.1/8.10.1) with ESMTP id e5AF1ZA25012; Sat, 10 Jun 2000 18:01:36 +0300 (EET DST) X-Authentication-Warning: sirppi.helsinki.fi: amlaukka owned process doing -bs Date: Sat, 10 Jun 2000 18:01:35 +0300 (EET DST) From: Aki M Laukkanen To: kuznet@ms2.inr.ac.ru cc: netdev@oss.sgi.com, iwtcp@cs.helsinki.fi Subject: Re: Slow TCP connection between linux and wince In-Reply-To: <200006091725.VAA23767@ms2.inr.ac.ru> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Fri, 9 Jun 2000 kuznet@ms2.inr.ac.ru wrote: > Hello! > > connections with smaller MTUs. Think MTU of 576 which I think is pretty > > common on the Internet at whole. With larger RTTs you can not use the > TCP calculates _maximal_ window possible with current mss and device. > If the window converged to 8K, it cannot be larger for this connection. > If it were >8K, it would prune. It is law of nature, rather than > something deterined by our choice. Yes, I understood although the truesize/len ratio might suggest that a bit larger window was possible. Maybe I forgot my own argument. See below. > If you want larger window (you does not want this, in your case > 8K is enough), you have to increase rcvbuf. But: For that particular test case 8kB is enough I agree. I was arguing a case with a smaller MTU because some link on the route or the other host might not support a larger MTU than say the default 576. Nevertheless there's plenty of bandwidth available which can not be utilized because of smaller window. > > You can enlarge the socket buffer size to get a bigger window but how many > > people will? > It does not matter. People should not increase rcvbuf per socket. > Essentially, this number is determined by amount of available > RAM and by number of active connections, rather than by network conditions. > Even current value of 64K is too large for common server configuration. > > Certainly, one day we will have to do more smart "fair" memory management, > which will allow to correlate memory consumption to network conditions. > For now it is impossible, existing algorithms (f.e. the work made in PSC) > are too coarce to be useful for production OS, which Linux is. Thanks for the PSC reference. Looks interesting. I agree for server configurations but this is really not a problem for your average workstation. It does not have a large number of simultanous connections and has plenty of spare memory. Current FreeBSD is happy to (by a glance in uipc_socket2.c so don't kill me if I'm wrong) to waste memory which I'd like as an option for Linux too. Their sockbuf structure has sb_cc and sb_mbcnt values. The latter seems equivalent to {rmem|wmem}_alloc while the former only includes actual data bytes. I could see why keeping account of actual data bytes would be beneficiary too. Although both are used in sbspace() macro it seems sb_mbmax is pretty high. Incidentally they seem to have some of the infra-structure (soreserve) to guarantee enough memory for mbufs in place but unused. > Cloned skbs are not counted, because their number is limited by tx_queue_len. > For slow links it must be small number, sort of 4. Hmm. this escaped me so forget about cloned skbs. > > Valid examples are wireless and satellite links. Congestion window can > > grow freely because the delay was constant in this test. > "thin" link is link with small power and small window. > "thick" link is link with large power and large window. This seems to be a matter of having different opinions of thin/{thick|fat}. Writers of said RFC take it to mean the bandwidth only - thin long together producing the bandwidth delay product. > Large power links must have large mtu (>= 1500, at least), no questions, > and you will have 32K default window then. I'll back out a bit and say that the MTU requirements of current GSM data are indeed imposed by the small bandwidth. However with current and forthcoming higher bandwidth wireless links there are entirely valid reasons for choosing a smaller MTU. The higher bit error rate for one, you don't need to retransmit as much. Link layer retransmit schemes might seem attractive at glance but are not in practice. From owner-netdev@oss.sgi.com Sat Jun 10 08:04:31 2000 Received: by oss.sgi.com id ; Sat, 10 Jun 2000 08:04:11 -0700 Received: from [132.68.48.10] ([132.68.48.10]:6152 "EHLO tochna.technion.ac.il") by oss.sgi.com with ESMTP id ; Sat, 10 Jun 2000 08:04:09 -0700 Received: from localhost (gleb@localhost) by tochna.technion.ac.il (8.8.7/8.8.5) with SMTP id RAA19229; Sat, 10 Jun 2000 17:54:38 +0300 (IDT) Date: Sat, 10 Jun 2000 17:54:38 +0300 (IDT) From: Gleb Natapov To: jamal cc: Ben Greear , Mitchell Blank Jr , Andrey Savochkin , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing > On Thu, 8 Jun 2000, Gleb Natapov wrote: > > jamal> Did i miss anything? > jamal> > Gleb> Yes. If you are going to implement VLANs as you've just described > Gleb> you will have to change each L3 protocol implementation > Gleb> to support VLANs. > > Gleb, I am afraid i didnt understand you. You mean broken programs like > DHCP? > No, I mean IPV6, IPX, DECnet, appletalk etc. > cheers, > jamal -- Gleb. From owner-netdev@oss.sgi.com Sat Jun 10 10:22:21 2000 Received: by oss.sgi.com id ; Sat, 10 Jun 2000 10:22:01 -0700 Received: from minus.inr.ac.ru ([193.233.7.97]:40197 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Sat, 10 Jun 2000 10:21:33 -0700 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id VAA32035; Sat, 10 Jun 2000 21:08:48 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200006101708.VAA32035@ms2.inr.ac.ru> Subject: Re: Slow TCP connection between linux and wince To: amlaukka@cc.helsinki.fi (Aki M Laukkanen) Date: Sat, 10 Jun 2000 21:08:48 +0400 (MSK DST) Cc: netdev@oss.sgi.com, iwtcp@cs.helsinki.fi In-Reply-To: from "Aki M Laukkanen" at Jun 10, 0 06:01:35 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Length: 3155 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello! > Yes, I understood although the truesize/len ratio might suggest that a bit > larger window was possible. What value did you expect? Could you send me some tcpdumps showing window evolution? It is interesting. > there's plenty of bandwidth available which can not be utilized because > of smaller window. I tried to prove that it is never the case in real life. When power is large, mtu cannot be small. These things are mutually exclusive, except for some pathological configurations. > Thanks for the PSC reference. Looks interesting. I agree for server > configurations but this is really not a problem for your average > workstation. It does not have a large number of simultanous connections > and has plenty of spare memory. Default configuration, builtin to kernel, is server configuration. You (or redhat) may relax memory bounds for single user workstation. > Current FreeBSD is happy to (by a glance in uipc_socket2.c so don't kill > me if I'm wrong) to waste memory which I'd like as an option for Linux Alas, I do not know anything about real current state of production bsd clones. I know that year ago panic "out of mbuf clusters" was one of the most frequent reason of crash of freebsd servers. I ma not sure that it changed. Remember also about simple old attack feeding to freebsd single byte packets at right edge of window. 8) > bytes would be beneficiary too. Although both are used in sbspace() macro > it seems sb_mbmax is pretty high. Incidentally they seem to have some of > the infra-structure (soreserve) to guarantee enough memory for mbufs in > place but unused. Oh, no. This scheme is not different of linux one in practice (ignoring some issues, which were severe bugs both in bsd and in linux) Only units are different. We really have some problems, not present in bsd, but they were mainly because of forced linear memory allocation, rather than to resource management. Real scheme may base only on fair corelated resource allocation, such heuristics are absent in bsd clones as well. > I'll back out a bit and say that the MTU requirements of current GSM data > are indeed imposed by the small bandwidth. However with current and > forthcoming higher bandwidth wireless links there are entirely valid > reasons for choosing a smaller MTU. The higher bit error rate for one, you > don't need to retransmit as much. Link layer retransmit schemes might seem > attractive at glance but are not in practice. With high loss rate you will have cwnd of 1. It is TCP yet, which is _not_ able to retransmit cleverly because its retransmission algorithms are _strictly_ (MUSTed!) oriented to WAN congestion rather than LAN loss. It is pretty evident that lossy links must do error correction and fragmentation at link layer. Or alternatively, TCP must be changed drastically: entering loss notifications, disabling congestion avoidance etc. etc. etc. Nobody wants this. Seems, "thin long" people understand this, at least their output wrt TCP looks very pessimistic and contains mainly negative results. BTW linux-2.3 implements almost all their hints, which I estimated as positive ones. 8) Alexey From owner-netdev@oss.sgi.com Sat Jun 10 10:44:21 2000 Received: by oss.sgi.com id ; Sat, 10 Jun 2000 10:44:11 -0700 Received: from minus.inr.ac.ru ([193.233.7.97]:47365 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Sat, 10 Jun 2000 10:43:51 -0700 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id VAA32298; Sat, 10 Jun 2000 21:43:41 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200006101743.VAA32298@ms2.inr.ac.ru> Subject: Re: 2.4 kernel networking and SMP To: jacoba@penta-com.COM (Jacob Avraham) Date: Sat, 10 Jun 2000 21:43:41 +0400 (MSK DST) Cc: netdev@oss.sgi.com In-Reply-To: <00be01bfd270$20ee4700$180aa8c0@pentacom.com> from "Jacob Avraham" at Jun 10, 0 03:13:07 am X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Length: 370 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello! > Can you tell me what level of parallelism do I get from the networking > layers (forwarding, filtering, NAT), Infinite one. 8) Seriously, it depends on network configuration. F.e. if router has only two interfaces, more than two cpus will do no useful work, but simly will break network reordering packets, if you did not bind irqs to selected cpus. Alexey From owner-netdev@oss.sgi.com Sat Jun 10 12:46:13 2000 Received: by oss.sgi.com id ; Sat, 10 Jun 2000 12:46:03 -0700 Received: from bells.cs.ucl.ac.uk ([128.16.5.31]:41995 "HELO bells.cs.ucl.ac.uk") by oss.sgi.com with SMTP id ; Sat, 10 Jun 2000 12:45:46 -0700 Received: from avalon.cs.ucl.ac.uk by bells.cs.ucl.ac.uk with local SMTP id ; Sat, 10 Jun 2000 20:45:44 +0100 Message-ID: <39429B2A.BAB4EDCB@cs.ucl.ac.uk> Date: Sat, 10 Jun 2000 20:46:50 +0100 From: Manuel Oliveira Reply-To: m.oliveira@cs.ucl.ac.uk Organization: University College London X-Mailer: Mozilla 4.7 [en] (WinNT; I) X-Accept-Language: en MIME-Version: 1.0 To: netdev@oss.sgi.com Subject: Multicast routing Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hi, My apologies for my ignorance, but what is the equivalent daemon in Linux to the mrouted of BSD? Thanks for any answers. Cheers Manuel From owner-netdev@oss.sgi.com Sat Jun 10 13:05:33 2000 Received: by oss.sgi.com id ; Sat, 10 Jun 2000 13:05:23 -0700 Received: from zada.math.leidenuniv.nl ([132.229.231.3]:11787 "EHLO zada.math.leidenuniv.nl") by oss.sgi.com with ESMTP id ; Sat, 10 Jun 2000 13:04:58 -0700 Received: from mara.math.leidenuniv.nl (IDENT:buytenh@mara.math.leidenuniv.nl [132.229.232.80]) by zada.math.leidenuniv.nl (8.9.3/8.9.3) with ESMTP id WAA13132; Sat, 10 Jun 2000 22:04:20 +0200 Date: Sat, 10 Jun 2000 22:04:20 +0200 (MEST) From: Lennert Buytenhek X-Sender: buytenh@mara.math.leidenuniv.nl To: Andi Kleen cc: Andrey Savochkin , Mitchell Blank Jr , Ben Greear , rob@valinux.com, netdev@oss.sgi.com, Gleb Natapov , jamal Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? In-Reply-To: <20000609103620.A3028@fred.muc.de> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Fri, 9 Jun 2000, Andi Kleen wrote: > > > Netfilter is not an IP only thing. It is a generic framework for > > > packet mangling. Although currently only IPv4 and IPv6 netfilter > > > implementations exist it would be no big problem to add ``raw > > > ethernet'' netfilter hooks. > > > > Raw ethernet netfilter hooks, as are IPX netfilter hooks by the way, are > > currently a nice blue cloud in the sky. > > They are not hard to add. Just so far nobody needed them. You could > either hook them into the device's hard_start_xmit function or as an > netfilter qdisc for the devices with queue (both requires no > additional instructions in the fast path when not used) I know, but my point was that our approach currently works. Saying things like 'Well, of course it could be done in this specific way' while required functionality does not even exist is nothing more than a handwave. > > As we're getting into the architectural purity business anyway, does it > > make a whole lot of sense to netfilter on two different protocol levels? > > I think it does. netfilter is not a single entinity, it is a hook > infrastructure. When multi layer filtering is required it can be > integrated. So one packet can end up being filtered multiple times, if I get this correctly. But where do we draw the line, then? For example, the IP portion of netfilter currently has the capability to filter on source MAC address as well as the capability to filter on things like TCP ports. That's a layering violation in both directions, and although it does not hurt, it's not particularly pretty. What if we want to let all traffic coming from IPX network 0xdeadbeef use VLAN 2? Should we either add 'IPX protocol extension module'-type of uglyness to the raw ethernet netfilter hooks or write a complete IPX netfilter layer just to be able to accomplish that? > I haven't thought hard enough about the VLAN problem to decide if it > really makes sense to implement it as netfilter module. The net_fifo > idea seems to have some merits too (struct net_device probably does > too many things at once currently and some hash tables/trees for > device management probably couldn't hurt) Agreed, struct net_device is a fscking mess. greetings, Lennert From owner-netdev@oss.sgi.com Sat Jun 10 20:17:03 2000 Received: by oss.sgi.com id ; Sat, 10 Jun 2000 20:16:54 -0700 Received: from smtp1.cern.ch ([137.138.128.38]:7437 "EHLO smtp1.cern.ch") by oss.sgi.com with ESMTP id ; Sat, 10 Jun 2000 20:16:40 -0700 Received: from lxplus007.cern.ch (IDENT:root@lxplus007.cern.ch [137.138.161.120]) by smtp1.cern.ch (8.9.3/8.9.3) with ESMTP id FAA03139; Sun, 11 Jun 2000 05:15:42 +0200 (MET DST) Received: (from jes@localhost) by lxplus007.cern.ch (8.9.3/8.9.3) id FAA11429; Sun, 11 Jun 2000 05:15:34 +0200 To: Gleb Natapov Cc: jamal , Ben Greear , Mitchell Blank Jr , Andrey Savochkin , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: From: Jes Sorensen Date: 11 Jun 2000 05:15:34 +0200 In-Reply-To: Gleb Natapov's message of "Sat, 10 Jun 2000 17:54:38 +0300 (IDT)" Message-ID: Lines: 18 User-Agent: Gnus/5.070096 (Pterodactyl Gnus v0.96) Emacs/20.4 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing >>>>> "Gleb" == Gleb Natapov writes: >> On Thu, 8 Jun 2000, Gleb Natapov wrote: >> jamal> Did i miss anything? jamal> Gleb> Yes. If you are going to implement VLANs as you've just Gleb> described you will have to change each L3 protocol Gleb> implementation to support VLANs. >> Gleb, I am afraid i didnt understand you. You mean broken programs >> like DHCP? Gleb> No, I mean IPV6, IPX, DECnet, appletalk etc. You mean all the ones that are practically unused (and the used ones which should never have been used in the first place: IPX + AppleTalk). Jes From owner-netdev@oss.sgi.com Sat Jun 10 23:37:46 2000 Received: by oss.sgi.com id ; Sat, 10 Jun 2000 23:37:37 -0700 Received: from apollo.nbase.co.il ([194.90.137.2]:61196 "EHLO apollo.nbase.co.il") by oss.sgi.com with ESMTP id ; Sat, 10 Jun 2000 23:37:10 -0700 Received: from nbase.co.il ([194.90.136.56]) by apollo.nbase.co.il (Post.Office MTA v3.1.2 release (PO205-101c) ID# 0-44418U200L2S100) with ESMTP id AAA431; Sun, 11 Jun 2000 09:35:55 +0200 Message-ID: <39433342.4A9E80CC@nbase.co.il> Date: Sun, 11 Jun 2000 06:35:46 +0000 From: Gleb Natapov Organization: NBase-Xyplex X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.2.14vlan i686) X-Accept-Language: en MIME-Version: 1.0 To: Jes Sorensen CC: Gleb Natapov , jamal , Ben Greear , Mitchell Blank Jr , Andrey Savochkin , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Jes Sorensen wrote: > > >>>>> "Gleb" == Gleb Natapov writes: > > >> On Thu, 8 Jun 2000, Gleb Natapov wrote: > >> > jamal> Did i miss anything? > jamal> > Gleb> Yes. If you are going to implement VLANs as you've just > Gleb> described you will have to change each L3 protocol > Gleb> implementation to support VLANs. > >> Gleb, I am afraid i didnt understand you. You mean broken programs > >> like DHCP? > > Gleb> No, I mean IPV6, IPX, DECnet, appletalk etc. > > You mean all the ones that are practically unused (and the used ones > which should never have been used in the first place: IPX + AppleTalk). > > Jes IPV6 and IPX unused?!! Perhaps we should remove all these protocols from source tree :) I don't know how close you follow this thread but here is what Matthew Geier wrote: "It is possible at some stage I would like to run a server with a foot in at least 4 VLANs so that people accessing that server would not have a router hop. (And since AppleTalk (and IPX for other departments) is at least as important as TCP/IP no layer 3 switch vendor is game, and I can't put that many AppleTalk stations in the one VLAN with out an AppleTalk broadcast meltdown....) Any VLAN implementation that doesn't allow me to fire up Samba and NetAtalk have have the 2 programs just discover the interfaces and do the right SMB broadcasting, and AppleTalk stuff on each, isn't actually any use. I certainly wouldn't be trying to replicate the routing functions of the CISCO RSM cards in the 2 central switching centres on my Campus. I just want applications to run..." -- Gleb. From owner-netdev@oss.sgi.com Sun Jun 11 05:43:08 2000 Received: by oss.sgi.com id ; Sun, 11 Jun 2000 05:42:38 -0700 Received: from apollo.nbase.co.il ([194.90.137.2]:11278 "EHLO apollo.nbase.co.il") by oss.sgi.com with ESMTP id ; Sun, 11 Jun 2000 05:42:01 -0700 Received: from nbase.co.il ([194.90.136.56]) by apollo.nbase.co.il (Post.Office MTA v3.1.2 release (PO205-101c) ID# 0-44418U200L2S100) with ESMTP id AAA269; Sun, 11 Jun 2000 15:40:27 +0200 Message-ID: <394388B0.54C15B24@nbase.co.il> Date: Sun, 11 Jun 2000 12:40:16 +0000 From: Gleb Natapov Organization: NBase-Xyplex X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.2.14vlan i686) X-Accept-Language: en MIME-Version: 1.0 To: kuznet@ms2.inr.ac.ru CC: netdev@oss.sgi.com Subject: Re: 2.4 kernel networking and SMP References: <200006101743.VAA32298@ms2.inr.ac.ru> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing kuznet@ms2.inr.ac.ru wrote: > > Hello! > > > Can you tell me what level of parallelism do I get from the networking > > layers (forwarding, filtering, NAT), > > Infinite one. 8) > > Seriously, it depends on network configuration. F.e. if router has only > two interfaces, more than two cpus will do no useful work, Two interfaces or physical devices? If I have many network interfaces that use the same physical device to send packets do additional cpu will make sense? > but simly will break network reordering packets, if you > did not bind irqs to selected cpus. > > Alexey -- Gleb. From owner-netdev@oss.sgi.com Mon Jun 12 09:31:58 2000 Received: by oss.sgi.com id ; Mon, 12 Jun 2000 09:31:47 -0700 Received: from smtp1.cern.ch ([137.138.128.38]:59400 "EHLO smtp1.cern.ch") by oss.sgi.com with ESMTP id ; Sun, 11 Jun 2000 08:34:35 -0700 Received: from lxplus007.cern.ch (IDENT:root@lxplus007.cern.ch [137.138.161.120]) by smtp1.cern.ch (8.9.3/8.9.3) with ESMTP id RAA02099; Sun, 11 Jun 2000 17:31:45 +0200 (MET DST) Received: (from jes@localhost) by lxplus007.cern.ch (8.9.3/8.9.3) id RAA20692; Sun, 11 Jun 2000 17:31:40 +0200 To: Gleb Natapov Cc: Gleb Natapov , jamal , Ben Greear , Mitchell Blank Jr , Andrey Savochkin , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: <39433342.4A9E80CC@nbase.co.il> From: Jes Sorensen Date: 11 Jun 2000 17:31:40 +0200 In-Reply-To: Gleb Natapov's message of "Sun, 11 Jun 2000 06:35:46 +0000" Message-ID: Lines: 14 User-Agent: Gnus/5.070096 (Pterodactyl Gnus v0.96) Emacs/20.4 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing >>>>> "Gleb" == Gleb Natapov writes: Gleb> Jes Sorensen wrote: >> You mean all the ones that are practically unused (and the used >> ones which should never have been used in the first place: IPX + >> AppleTalk). Gleb> IPV6 and IPX unused?!! Perhaps we should remove all these Gleb> protocols from source tree :) IPv6 is practically unused, IPX and AppleTalk are in use a lot but thats very *unfortunate* since they are completely broken. Jes From owner-netdev@oss.sgi.com Mon Jun 12 09:31:59 2000 Received: by oss.sgi.com id ; Mon, 12 Jun 2000 09:31:47 -0700 Received: from smtp1.cern.ch ([137.138.128.38]:51982 "EHLO smtp1.cern.ch") by oss.sgi.com with ESMTP id ; Sun, 11 Jun 2000 09:10:00 -0700 Received: from lxplus007.cern.ch (IDENT:root@lxplus007.cern.ch [137.138.161.120]) by smtp1.cern.ch (8.9.3/8.9.3) with ESMTP id SAA10219; Sun, 11 Jun 2000 18:07:10 +0200 (MET DST) Received: (from jes@localhost) by lxplus007.cern.ch (8.9.3/8.9.3) id SAA22259; Sun, 11 Jun 2000 18:07:09 +0200 To: Ben Greear Cc: Gleb Natapov , Gleb Natapov , jamal , Mitchell Blank Jr , Andrey Savochkin , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: <39433342.4A9E80CC@nbase.co.il> <3943C086.930DF571@candelatech.com> From: Jes Sorensen Date: 11 Jun 2000 18:07:09 +0200 In-Reply-To: Ben Greear's message of "Sun, 11 Jun 2000 09:38:30 -0700" Message-ID: Lines: 26 User-Agent: Gnus/5.070096 (Pterodactyl Gnus v0.96) Emacs/20.4 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing >>>>> "Ben" == Ben Greear writes: Ben> So just because you don't see a use for it means everyone else Ben> should be denied the use of it??? Gleb's argument is valid Ben> whether or not IPX exists, because other, so-far-unthought-of, Ben> protocols may be created, and they would have the same problem Ben> that IPX would now. Try to take a look at how IPX behaves on the wire before commenting - the people who designed it need serious larting. Ben> TCP/IP is not the end of the line in network protocols! At the Ben> very least IPv6 will be extremely important in the near future. Some people still believe that, I used to think that but I don't see the push for it anymore. Ben> Broad support for almost every protocol known is one of the very Ben> best features of Linux. Doing anything to make that less true Ben> would make Linux less useful to the rest of us. Broad support for as much as possible is good, but limiting support for the mainstream in order to improve support for something broken is wrong. Jes From owner-netdev@oss.sgi.com Mon Jun 12 09:31:59 2000 Received: by oss.sgi.com id ; Mon, 12 Jun 2000 09:31:47 -0700 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:43269 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Sun, 11 Jun 2000 09:02:34 -0700 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.9.3) with ESMTP id JAA20493; Sun, 11 Jun 2000 09:38:31 -0700 Message-ID: <3943C086.930DF571@candelatech.com> Date: Sun, 11 Jun 2000 09:38:30 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.14-5.0 i586) X-Accept-Language: en MIME-Version: 1.0 To: Jes Sorensen CC: Gleb Natapov , Gleb Natapov , jamal , Mitchell Blank Jr , Andrey Savochkin , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: <39433342.4A9E80CC@nbase.co.il> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Jes Sorensen wrote: > > >>>>> "Gleb" == Gleb Natapov writes: > > Gleb> Jes Sorensen wrote: > >> You mean all the ones that are practically unused (and the used > >> ones which should never have been used in the first place: IPX + > >> AppleTalk). > > Gleb> IPV6 and IPX unused?!! Perhaps we should remove all these > Gleb> protocols from source tree :) > > IPv6 is practically unused, IPX and AppleTalk are in use a lot > but thats very *unfortunate* since they are completely broken. > > Jes So just because you don't see a use for it means everyone else should be denied the use of it??? Gleb's argument is valid whether or not IPX exists, because other, so-far-unthought-of, protocols may be created, and they would have the same problem that IPX would now. TCP/IP is not the end of the line in network protocols! At the very least IPv6 will be extremely important in the near future. Broad support for almost every protocol known is one of the very best features of Linux. Doing anything to make that less true would make Linux less useful to the rest of us. Ben -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Mon Jun 12 09:32:06 2000 Received: by oss.sgi.com id ; Mon, 12 Jun 2000 09:31:47 -0700 Received: from apollo.nbase.co.il ([194.90.137.2]:34320 "EHLO apollo.nbase.co.il") by oss.sgi.com with ESMTP id ; Sun, 11 Jun 2000 09:16:49 -0700 Received: from nbase.co.il ([194.90.136.56]) by apollo.nbase.co.il (Post.Office MTA v3.1.2 release (PO205-101c) ID# 0-44418U200L2S100) with ESMTP id AAA444; Sun, 11 Jun 2000 19:15:14 +0200 Message-ID: <3943BB06.E57D6574@nbase.co.il> Date: Sun, 11 Jun 2000 16:15:02 +0000 From: Gleb Natapov Organization: NBase-Xyplex X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.2.14vlan i686) X-Accept-Language: en MIME-Version: 1.0 To: Jes Sorensen CC: Ben Greear , Gleb Natapov , jamal , Mitchell Blank Jr , Andrey Savochkin , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: <39433342.4A9E80CC@nbase.co.il> <3943C086.930DF571@candelatech.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Jes Sorensen wrote: > > >>>>> "Ben" == Ben Greear writes: > > Ben> So just because you don't see a use for it means everyone else > Ben> should be denied the use of it??? Gleb's argument is valid > Ben> whether or not IPX exists, because other, so-far-unthought-of, > Ben> protocols may be created, and they would have the same problem > Ben> that IPX would now. > > Try to take a look at how IPX behaves on the wire before commenting - > the people who designed it need serious larting. > May be somebody wants to put IPX on separate VLAN because IPX behaves this way and you don't want to give him such possibility ? :) Seriously, I don't see how your last argument is relevant to the discussion. -- Gleb. From owner-netdev@oss.sgi.com Mon Jun 12 09:32:06 2000 Received: by oss.sgi.com id ; Mon, 12 Jun 2000 09:31:48 -0700 Received: from wirespeed.solidum.com ([216.13.130.242]:16817 "EHLO solidum.com") by oss.sgi.com with ESMTP id ; Sun, 11 Jun 2000 09:17:22 -0700 Received: from phobos.solidum.com (mcr@phobos.solidum.com [192.168.1.13]) by solidum.com (8.8.7/8.8.7) with ESMTP id MAA24319 for ; Sun, 11 Jun 2000 12:16:50 -0400 Message-Id: <200006111616.MAA24319@solidum.com> To: netdev@oss.sgi.com Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? In-Reply-To: Your message of "11 Jun 2000 05:15:34 +0200." Mime-Version: 1.0 (generated by tm-edit 7.108) Content-Type: text/plain; charset=US-ASCII Date: Sun, 11 Jun 2000 12:16:50 -0400 From: Michael Richardson Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing >>>>> "Jes" == Jes Sorensen writes: Jes> You mean all the ones that are practically unused (and the used ones Jes> which should never have been used in the first place: IPX + Jes> AppleTalk). No, he means all the protocols which Sun, SGI and HP ignored (or overcharged), forcing houses with Macintoshes, VMS, etc. to buy NT servers, eventually buying NT desktops as well. He is talking about the protocols which Open Source operating systems have long supported, and which are a major route by which open source operating systems penetrate into places with otherwise closed buying strategies. Novell file sharing protocols are light years ahead of SMB. Of course, they all run over IP now. :!mcr!: | Solidum Systems Corporation, http://www.solidum.com Michael Richardson |For a better connected world,where data flows faster Personal: http://www.sandelman.ottawa.on.ca/People/Michael_Richardson/Bio.html mailto:mcr@sandelman.ottawa.on.ca mailto:mcr@solidum.com From owner-netdev@oss.sgi.com Mon Jun 12 09:32:06 2000 Received: by oss.sgi.com id ; Mon, 12 Jun 2000 09:31:48 -0700 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:17926 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Sun, 11 Jun 2000 09:42:56 -0700 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.9.3) with ESMTP id KAA21554; Sun, 11 Jun 2000 10:13:53 -0700 Message-ID: <3943C8D1.DDF14EA5@candelatech.com> Date: Sun, 11 Jun 2000 10:13:53 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.14-5.0 i586) X-Accept-Language: en MIME-Version: 1.0 To: Jes Sorensen CC: Gleb Natapov , Gleb Natapov , jamal , Mitchell Blank Jr , Andrey Savochkin , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: <39433342.4A9E80CC@nbase.co.il> <3943C086.930DF571@candelatech.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Jes Sorensen wrote: > > >>>>> "Ben" == Ben Greear writes: > > Ben> So just because you don't see a use for it means everyone else > Ben> should be denied the use of it??? Gleb's argument is valid > Ben> whether or not IPX exists, because other, so-far-unthought-of, > Ben> protocols may be created, and they would have the same problem > Ben> that IPX would now. > > Try to take a look at how IPX behaves on the wire before commenting - > the people who designed it need serious larting. That is irrelavent to the people who need to use it to fit in with an existing network architecture. I also hear that many Windows games like to use IPX because it's faster on a local network, but I don't know if that is true or not. > Ben> Broad support for almost every protocol known is one of the very > Ben> best features of Linux. Doing anything to make that less true > Ben> would make Linux less useful to the rest of us. > > Broad support for as much as possible is good, but limiting support > for the mainstream in order to improve support for something broken is > wrong. True, but no one is trying to do that. Find a way that a netfilter implementation is inherently more efficient than device-per-VLAN and tell us about it. Or implement it, I think Jamal is gonna start working on that... Remember that linear searching of the device list will be hashed shortly, and the truth is, if you need to search the entire device list, say for TCP/IP routing information, then you'll have to search all of the VLAN device-let structures too because they hold routing information. > > Jes -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Mon Jun 12 09:32:06 2000 Received: by oss.sgi.com id ; Mon, 12 Jun 2000 09:31:48 -0700 Received: from minus.inr.ac.ru ([193.233.7.97]:64011 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Sun, 11 Jun 2000 10:02:02 -0700 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id VAA15303; Sun, 11 Jun 2000 21:01:15 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200006111701.VAA15303@ms2.inr.ac.ru> Subject: Re: 2.4 kernel networking and SMP To: gleb@nbase.co.il (Gleb Natapov) Date: Sun, 11 Jun 2000 21:01:15 +0400 (MSK DST) Cc: netdev@oss.sgi.com In-Reply-To: <394388B0.54C15B24@nbase.co.il> from "Gleb Natapov" at Jun 11, 0 12:40:16 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Length: 463 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello! > Two interfaces or physical devices? If I have many network interfaces > that use the same physical device What is "physical device"? 8) Assuming, you talk about tunnels: > to send packets do additional cpu will make sense? If encapsulation is expensive, sort of IPsec transform, smp will help a lot certainly. If it is not and you have thousand of tunnels, all of them will have single contention point at underlying interface certainly. Alexey From owner-netdev@oss.sgi.com Mon Jun 12 09:32:07 2000 Received: by oss.sgi.com id ; Mon, 12 Jun 2000 09:31:48 -0700 Received: from smtp1.cern.ch ([137.138.128.38]:36106 "EHLO smtp1.cern.ch") by oss.sgi.com with ESMTP id ; Mon, 12 Jun 2000 09:19:24 -0700 Received: from lxplus007.cern.ch (IDENT:root@lxplus007.cern.ch [137.138.161.120]) by smtp1.cern.ch (8.9.3/8.9.3) with ESMTP id XAA26939; Sun, 11 Jun 2000 23:21:05 +0200 (MET DST) Received: (from jes@localhost) by lxplus007.cern.ch (8.9.3/8.9.3) id XAA03074; Sun, 11 Jun 2000 23:21:03 +0200 To: Ben Greear Cc: Gleb Natapov , Gleb Natapov , jamal , Mitchell Blank Jr , Andrey Savochkin , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: <39433342.4A9E80CC@nbase.co.il> <3943C086.930DF571@candelatech.com> <3943C8D1.DDF14EA5@candelatech.com> From: Jes Sorensen Date: 11 Jun 2000 23:21:03 +0200 In-Reply-To: Ben Greear's message of "Sun, 11 Jun 2000 10:13:53 -0700" Message-ID: Lines: 37 User-Agent: Gnus/5.070096 (Pterodactyl Gnus v0.96) Emacs/20.4 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing >>>>> "Ben" == Ben Greear writes: Ben> Jes Sorensen wrote: >> Try to take a look at how IPX behaves on the wire before >> commenting - the people who designed it need serious larting. Ben> That is irrelavent to the people who need to use it to fit in Ben> with an existing network architecture. I also hear that many Ben> Windows games like to use IPX because it's faster on a local Ben> network, but I don't know if that is true or not. I would pretty much claim that to be rubbish since it is quite easy to fill a 100baseT segment with TCP. >> Broad support for as much as possible is good, but limiting >> support for the mainstream in order to improve support for >> something broken is wrong. Ben> True, but no one is trying to do that. Find a way that a Ben> netfilter implementation is inherently more efficient than Ben> device-per-VLAN and tell us about it. Or implement it, I think Ben> Jamal is gonna start working on that... I'll be happy to wait for Jamal to do the work, I still consider VLAN's to be the stupidest invention in the network world since the OSI protocol stack. Ben> Remember that linear searching of the device list will be hashed Ben> shortly, and the truth is, if you need to search the entire Ben> device list, say for TCP/IP routing information, then you'll have Ben> to search all of the VLAN device-let structures too because they Ben> hold routing information. Having a large device list may end up causing other problems in other parts of the stack. There is no reason for the proper devices to suffer because you want to glue in vlans on top of it. Jes From owner-netdev@oss.sgi.com Mon Jun 12 09:32:07 2000 Received: by oss.sgi.com id ; Mon, 12 Jun 2000 09:31:49 -0700 Received: from mail0.u-aizu.ac.jp ([163.143.1.43]:31971 "EHLO mail0.u-aizu.ac.jp") by oss.sgi.com with ESMTP id ; Mon, 12 Jun 2000 09:28:17 -0700 Received: from ccultra2.u-aizu.ac.jp (ccultra2 [163.143.99.104]) by mail0.u-aizu.ac.jp (8.9.3+3.1W/3.7Winternet-gw) with ESMTP id LAA21885 for ; Mon, 12 Jun 2000 11:49:01 +0900 (JST) Received: from ccultra2.u-aizu.ac.jp ([163.143.99.156]) by ccultra2.u-aizu.ac.jp (8.9.3+Sun/8.8.8) with ESMTP id LAA07911 for ; Mon, 12 Jun 2000 11:45:11 +0900 (JST) Message-ID: <39444F72.D0EF4331@ccultra2.u-aizu.ac.jp> Date: Mon, 12 Jun 2000 11:48:18 +0900 From: Behcet Sarikaya X-Mailer: Mozilla 4.72 [en] (Win98; I) X-Accept-Language: en,pdf MIME-Version: 1.0 To: netdev@oss.sgi.com Subject: Re: Linux 2.4 IPv6 Problem References: Content-Type: text/plain; charset=iso-2022-jp Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello, I installed Linux 2.4-test1 kernel and had a problem with ifconfig sit0 mtu 1280 command. It refuses the mtu to be set and sets it to the default value of 1480. Any help would be appreciated. --behcet From owner-netdev@oss.sgi.com Mon Jun 12 09:32:07 2000 Received: by oss.sgi.com id ; Mon, 12 Jun 2000 09:31:49 -0700 Received: from dpx20.tu-varna.acad.bg ([194.141.24.4]:37898 "EHLO dpx20.tu-varna.acad.bg") by oss.sgi.com with ESMTP id ; Mon, 12 Jun 2000 09:28:39 -0700 Received: from linux.tu-varna.acad.bg (root@linux.tu-varna.acad.bg [194.141.24.6]) by dpx20.tu-varna.acad.bg (8.9.3/8.9.3) with ESMTP id KAA64368; Mon, 12 Jun 2000 10:19:56 +0300 Received: from linux.tu-varna.acad.bg (uli@linux.tu-varna.acad.bg [194.141.24.6]) by linux.tu-varna.acad.bg (8.8.5/myconf) with ESMTP id KAA31068; Mon, 12 Jun 2000 10:23:48 +0300 Date: Mon, 12 Jun 2000 10:23:48 +0300 (EEST) From: Julian Anastasov To: Andrey Savochkin cc: Jerome Etienne , netdev@oss.sgi.com Subject: Re: IFA_F_NO_NDISC (for vrrp) In-Reply-To: <20000612135938.A586@saw.sw.com.sg> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello Andrey, On Mon, 12 Jun 2000, Andrey Savochkin wrote: > > > > I now see that inet_addr_type() is changed too, to > > use fib_lookup. I forgot about this. Yep, with this change > > we can't do this (to limit the local addresses in > > arp_solicit). I have to wait for the final version of your > > patch :) > > Using inet_addr_type() in arp_solicit(), especially the current one looking > into local table only, is wrong from conceptual point of view. > It makes decisions that can't be explained in human terms, only as some > manipulations in C language. Agree. > It's better not to call any checking function that to call wrong one. > I'll call fib_select_addr() unconditionally. Without calling fib_local_source() the situation is different. This leads arp_solicit to use limited number of local addresses in some situations and may be will allow not to include hidden addresses in the ARP request. To summarize the problem without the hidden flag. If we have a client on the LAN and our host with the hidden address we need: 1. route from CLIENT to Hidden address to go in the blackhole for ARP 2. the same route from CLIENT to Hidden address to return RTN_LOCAL for non-ARP 3. arp_solicit not to include hidden addresses in the ARP request (1) and (2) can't coexist in the FIB world. May be if we play with fwmark? So, we need a way to distinguish them by using other methods. I'm not sure if we can add something like ip add route "shared" VIP table 100. RTN_SHARED will be handled as RTN_LOCAL but the difference will be for ARP. This leads to big changes in the routing (RTN_SHARED subset of RTN_LOCAL) and I don't want to make such big changes. In many places we will need: if (type == RTN_LOCAL || type == RTN_SHARED) May be if we change the RTN_ values from enum to flags, for example: if (type & RTN_LOCAL) In some places we will check: if (type & RTN_SHARED) We will have: #define RTN_SHARED (RTN_HIDDEN | RTN_LOCAL) But this is a big change. I will continue to think on this problem. For now, I'm sure I can't do it with current policy routing rules. Any ideas are welcome. Regards -- Julian Anastasov From owner-netdev@oss.sgi.com Mon Jun 12 09:32:07 2000 Received: by oss.sgi.com id ; Mon, 12 Jun 2000 09:31:48 -0700 Received: from smtp1.cern.ch ([137.138.128.38]:4875 "EHLO smtp1.cern.ch") by oss.sgi.com with ESMTP id ; Mon, 12 Jun 2000 09:21:03 -0700 Received: from lxplus002.cern.ch (IDENT:root@lxplus002.cern.ch [137.138.161.125]) by smtp1.cern.ch (8.9.3/8.9.3) with ESMTP id QAA23133; Mon, 12 Jun 2000 16:31:07 +0200 (MET DST) Received: (from jes@localhost) by lxplus002.cern.ch (8.9.3/8.9.3) id QAA28370; Mon, 12 Jun 2000 16:31:07 +0200 To: Gleb Natapov Cc: jamal , Gleb Natapov , Ben Greear , Mitchell Blank Jr , Andrey Savochkin , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: <3944DD68.3AB7DDEB@nbase.co.il> From: Jes Sorensen Date: 12 Jun 2000 16:31:07 +0200 In-Reply-To: Gleb Natapov's message of "Mon, 12 Jun 2000 12:54:00 +0000" Message-ID: Lines: 14 User-Agent: Gnus/5.070096 (Pterodactyl Gnus v0.96) Emacs/20.4 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing >>>>> "Gleb" == Gleb Natapov writes: Gleb> Answer me this question: are you going to change something in Gleb> net/ipv4/ directory in your VLAN implementation? Gleb> If answer is yes, you implementation is broken IMO. If answer Gleb> is no, I really want to see your code. There is *nothing* wrong with having to change code in say the ipv4 directory if that provides a better solution. Trying to avoid code changes is not a goal in itself - the target for Linux is not to avoid changing code but to provide the best solution. Jes From owner-netdev@oss.sgi.com Mon Jun 12 09:35:26 2000 Received: by oss.sgi.com id ; Mon, 12 Jun 2000 09:35:06 -0700 Received: from mail.cyberus.ca ([209.195.95.1]:64207 "EHLO cyberus.ca") by oss.sgi.com with ESMTP id ; Mon, 12 Jun 2000 09:34:47 -0700 Received: from shell.cyberus.ca (shell [209.195.95.7]) by cyberus.ca (8.9.3/8.9.3/Cyberus Online Inc.) with ESMTP id JAA02857; Mon, 12 Jun 2000 09:39:21 -0400 (EDT) Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.1b+Sun/8.9.3) with ESMTP id JAA02111; Mon, 12 Jun 2000 09:39:17 -0400 (EDT) Date: Mon, 12 Jun 2000 09:39:17 -0400 (EDT) From: jamal To: Gleb Natapov cc: Gleb Natapov , Ben Greear , Mitchell Blank Jr , Andrey Savochkin , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, Jes Sorensen Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? In-Reply-To: <3944DD68.3AB7DDEB@nbase.co.il> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Mon, 12 Jun 2000, Gleb Natapov wrote: > jamal wrote: > > > > On Sat, 10 Jun 2000, Gleb Natapov wrote: > > > > > > > > jamal> > Gleb, I am afraid i didnt understand you. You mean broken > > jamal> > programs like DHCP? > > > > > > > > > Gleb > No, I mean IPV6, IPX, DECnet, appletalk etc. > > > > This is a very strong statement that you make above. You have > > to provide justification. The protocol type is still available on the > > header. how is any of the above protocols broken? > > I _never_ said they are broken! What I've said is that you will need to > add explicit support of VLANs in each one of these protocols (if I > understand correct you suggestion). > And this is what i was refering to as you saying "broken" ;-> small miscomunication .. If you can uniquely distinguish each protocol from looking at the header, how can the implementation be 'broken'? Thats was my question to you. > Answer me this question: are you going to change something in net/ipv4/ > directory in your VLAN implementation? The main requirement seems to be to be able to 'route' -- thats what about 90% of the arguements here seem to be saying. IP rules. Who cares about IPX? Just encapsulate it (IPX) in the ether-packet and let it be switched/routed somewhere. some policy maps it to some VLAN. Is this not sufficient? if you are asking whether i'll put files in net/ipv4/ or ipv6; yes indeed, i might but not necessarily so. I see little or zero changes to the current common case of routing IPV4/6 packets. Why would i wanna touch IPX code? > > If answer is yes, you implementation is broken IMO. > If answer is no, I really want to see your code. I still dont understand why you make that claim; you'll have to wait for the code then. Hint: We already have a very powerful modular routing architecture; use the route flags and dst_in/out properly and you have yourself specific routing code which does not touch the common path *at all* > > > On 11 Jun 2000, Jes Sorensen wrote: > > > > Jes> Broad support for as much as possible is good, but limiting support > > Jes> for the mainstream in order to improve support for something broken > > Jes> is wrong. > > > > What's limited in current implementation? We have equally good support > for all L3 protocols. Our code even doesn't depend on L3 protocol you > uses. This is the way it should be IMO. > > > Jes just hit it on the head above. Infact i am begining to believe that > > even if you could look up the device in one lookup, always, the > > architecture being used is _wrong_. > > > > Why? It seams to me that *you* don't provide any justification to your > claims. > Your only complain about existent implementation was theoretical > performance hit. What's wrong with the architecture if we have no > performance hit? > > "Interface" is a nice abstraction with well defined API and good support > in user space (ifconfig, route, ipchains, etc). You want to create new > abstraction for VLANs, new set of user tools new API just because you > can. To have unique API for every little thing looks like Windows to me. > And this is based on the simple fact by trying to make all those tools a swiss army knife is wrong. Trying to use them as _the_ excuse is wrong. I dislike extending the kernel to 'help' a few things at the expense of the common path. I would rather extend the user space tools or even re-write them. I dont claim to be a purist, but certainly i think that a VLAN should be abstracted on a device and is therefore not a device. Perhaps the debate we are having here can be used to re-think what the netdevice structure should look in 2.5; as it is right now, i will totaly and _very strongly_ disagree with you cheers, jamal From owner-netdev@oss.sgi.com Mon Jun 12 09:35:27 2000 Received: by oss.sgi.com id ; Mon, 12 Jun 2000 09:35:17 -0700 Received: from apollo.nbase.co.il ([194.90.137.2]:40710 "EHLO apollo.nbase.co.il") by oss.sgi.com with ESMTP id ; Mon, 12 Jun 2000 09:35:03 -0700 Received: from nbase.co.il ([194.90.136.56]) by apollo.nbase.co.il (Post.Office MTA v3.1.2 release (PO205-101c) ID# 0-44418U200L2S100) with ESMTP id AAA466; Mon, 12 Jun 2000 18:21:56 +0200 Message-ID: <3945001E.C632A7E0@nbase.co.il> Date: Mon, 12 Jun 2000 15:22:06 +0000 From: Gleb Natapov Organization: NBase-Xyplex X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.2.14vlan i686) X-Accept-Language: en MIME-Version: 1.0 To: jamal CC: Gleb Natapov , Ben Greear , Mitchell Blank Jr , Andrey Savochkin , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, Jes Sorensen Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing jamal wrote: > > On Mon, 12 Jun 2000, Gleb Natapov wrote: > > > jamal wrote: > > > > > > On Sat, 10 Jun 2000, Gleb Natapov wrote: > > > > > > > > > > > jamal> > Gleb, I am afraid i didnt understand you. You mean broken > > > jamal> > programs like DHCP? > > > > > > > > > > > > Gleb > No, I mean IPV6, IPX, DECnet, appletalk etc. > > > > > > This is a very strong statement that you make above. You have > > > to provide justification. The protocol type is still available on the > > > header. how is any of the above protocols broken? > > > > I _never_ said they are broken! What I've said is that you will need to > > add explicit support of VLANs in each one of these protocols (if I > > understand correct you suggestion). > > > > And this is what i was refering to as you saying "broken" ;-> > small miscomunication .. > If you can uniquely distinguish each protocol from looking at the header, > how can the implementation be 'broken'? Thats was my question to you. By looking into the header you break the layering. Anyway what do you expect to find there? IP, IPX or some protocol that I just wrote? How are you going to support my protocol? Or you suppose that I should write my own code in order to support my new protocol over VLAN ? > > > Answer me this question: are you going to change something in net/ipv4/ > > directory in your VLAN implementation? > > The main requirement seems to be to be able to 'route' -- thats what about > 90% of the arguements here seem to be saying. IP rules. Who cares about > IPX? Just encapsulate it (IPX) in the ether-packet and let it be > switched/routed somewhere. some policy maps it to some VLAN. Is this not > sufficient? > > if you are asking whether i'll put files in net/ipv4/ or ipv6; yes > indeed, i might but not necessarily so. I see little or zero changes to > the current common case of routing IPV4/6 packets. Why would i wanna touch > IPX code? Because I want to run IPX in separate VLAN! How will I be able to do that with your implementation? > > > > > If answer is yes, you implementation is broken IMO. > > If answer is no, I really want to see your code. > > I still dont understand why you make that claim; you'll have to wait for > the code then. > Because you change IP stack implementation to suit your needs. This was you initial argument against existing solution. > Hint: > We already have a very powerful modular routing architecture; use the > route flags and dst_in/out properly and you have yourself specific routing > code which does not touch the common path *at all* Current implementation doesn't touch the common path at all (unless you insist that many devices is a performance hit, and I don't see any reason to have many vlans on the same network anyway). > > > > > > On 11 Jun 2000, Jes Sorensen wrote: > > > > > > Jes> Broad support for as much as possible is good, but limiting support > > > Jes> for the mainstream in order to improve support for something broken > > > Jes> is wrong. > > > > > > > What's limited in current implementation? We have equally good support > > for all L3 protocols. Our code even doesn't depend on L3 protocol you > > uses. This is the way it should be IMO. > > > > > Jes just hit it on the head above. Infact i am begining to believe that > > > even if you could look up the device in one lookup, always, the > > > architecture being used is _wrong_. > > > > > > > Why? It seams to me that *you* don't provide any justification to your > > claims. > > Your only complain about existent implementation was theoretical > > performance hit. What's wrong with the architecture if we have no > > performance hit? > > > > "Interface" is a nice abstraction with well defined API and good support > > in user space (ifconfig, route, ipchains, etc). You want to create new > > abstraction for VLANs, new set of user tools new API just because you > > can. To have unique API for every little thing looks like Windows to me. > > > > And this is based on the simple fact by trying to make all those tools a > swiss army knife is wrong. Trying to use them as _the_ excuse is wrong. I don't use them as the excuse, I point you on additional advantages. > I dislike extending the kernel to 'help' a few things at the expense of > the common path. I would rather extend the user space tools or even > re-write them. I dont claim to be a purist, but certainly i think that a > VLAN should be abstracted on a device and is therefore not a device. I agree with you here. VLAN should not be a device VLAN should be an interface. I don't see what makes you think that what 'ifconfig -a' shows are devices they are _interfaces_. Interface can be attached to device (eth0), can be not (dummy0, lo), one interface can be attached to many devices (bridging), many interfaces can be attached to single device (tunneling, vlans). > > Perhaps the debate we are having here can be used to re-think what the > netdevice structure should look in 2.5; as it is right now, i will totaly > and _very strongly_ disagree with you I see ;). > > cheers, > jamal -- Gleb. From owner-netdev@oss.sgi.com Mon Jun 12 09:35:46 2000 Received: by oss.sgi.com id ; Mon, 12 Jun 2000 09:35:27 -0700 Received: from apollo.nbase.co.il ([194.90.137.2]:40710 "EHLO apollo.nbase.co.il") by oss.sgi.com with ESMTP id ; Mon, 12 Jun 2000 09:35:00 -0700 Received: from nbase.co.il ([194.90.136.56]) by apollo.nbase.co.il (Post.Office MTA v3.1.2 release (PO205-101c) ID# 0-44418U200L2S100) with ESMTP id AAA452; Mon, 12 Jun 2000 15:53:51 +0200 Message-ID: <3944DD68.3AB7DDEB@nbase.co.il> Date: Mon, 12 Jun 2000 12:54:00 +0000 From: Gleb Natapov Organization: NBase-Xyplex X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.2.14vlan i686) X-Accept-Language: en MIME-Version: 1.0 To: jamal CC: Gleb Natapov , Ben Greear , Mitchell Blank Jr , Andrey Savochkin , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, Jes Sorensen Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing jamal wrote: > > On Sat, 10 Jun 2000, Gleb Natapov wrote: > > > > > jamal> > Gleb, I am afraid i didnt understand you. You mean broken > jamal> > programs like DHCP? > > > > > > Gleb > No, I mean IPV6, IPX, DECnet, appletalk etc. > > This is a very strong statement that you make above. You have > to provide justification. The protocol type is still available on the > header. how is any of the above protocols broken? I _never_ said they are broken! What I've said is that you will need to add explicit support of VLANs in each one of these protocols (if I understand correct you suggestion). Answer me this question: are you going to change something in net/ipv4/ directory in your VLAN implementation? If answer is yes, you implementation is broken IMO. If answer is no, I really want to see your code. > > On 11 Jun 2000, Jes Sorensen wrote: > > Jes> Broad support for as much as possible is good, but limiting support > Jes> for the mainstream in order to improve support for something broken > Jes> is wrong. > What's limited in current implementation? We have equally good support for all L3 protocols. Our code even doesn't depend on L3 protocol you uses. This is the way it should be IMO. > Jes just hit it on the head above. Infact i am begining to believe that > even if you could look up the device in one lookup, always, the > architecture being used is _wrong_. > Why? It seams to me that *you* don't provide any justification to your claims. Your only complain about existent implementation was theoretical performance hit. What's wrong with the architecture if we have no performance hit? "Interface" is a nice abstraction with well defined API and good support in user space (ifconfig, route, ipchains, etc). You want to create new abstraction for VLANs, new set of user tools new API just because you can. To have unique API for every little thing looks like Windows to me. > cheers, > jamal -- Gleb. From owner-netdev@oss.sgi.com Mon Jun 12 09:35:46 2000 Received: by oss.sgi.com id ; Mon, 12 Jun 2000 09:35:37 -0700 Received: from mail.cyberus.ca ([209.195.95.1]:64207 "EHLO cyberus.ca") by oss.sgi.com with ESMTP id ; Mon, 12 Jun 2000 09:35:07 -0700 Received: from shell.cyberus.ca (shell [209.195.95.7]) by cyberus.ca (8.9.3/8.9.3/Cyberus Online Inc.) with ESMTP id HAA17054; Mon, 12 Jun 2000 07:48:27 -0400 (EDT) Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.1b+Sun/8.9.3) with ESMTP id HAA01791; Mon, 12 Jun 2000 07:48:23 -0400 (EDT) Date: Mon, 12 Jun 2000 07:48:22 -0400 (EDT) From: jamal To: Gleb Natapov cc: Ben Greear , Mitchell Blank Jr , Andrey Savochkin , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, Jes Sorensen Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Sat, 10 Jun 2000, Gleb Natapov wrote: > > jamal> > Gleb, I am afraid i didnt understand you. You mean broken jamal> > programs like DHCP? > > > Gleb > No, I mean IPV6, IPX, DECnet, appletalk etc. This is a very strong statement that you make above. You have to provide justification. The protocol type is still available on the header. how is any of the above protocols broken? On 11 Jun 2000, Jes Sorensen wrote: Jes> Broad support for as much as possible is good, but limiting support Jes> for the mainstream in order to improve support for something broken Jes> is wrong. Jes just hit it on the head above. Infact i am begining to believe that even if you could look up the device in one lookup, always, the architecture being used is _wrong_. cheers, jamal From owner-netdev@oss.sgi.com Mon Jun 12 09:39:26 2000 Received: by oss.sgi.com id ; Mon, 12 Jun 2000 09:39:16 -0700 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:19211 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Mon, 12 Jun 2000 09:39:04 -0700 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.9.3) with ESMTP id SAA23788; Sun, 11 Jun 2000 18:00:55 -0700 Message-ID: <39443646.7519C828@candelatech.com> Date: Sun, 11 Jun 2000 18:00:54 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.14-5.0 i586) X-Accept-Language: en MIME-Version: 1.0 To: "netdev@oss.sgi.com" , linux-net , Donald Becker Subject: Tulip (21140) locked up. Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing I'm running with two ZNYX 4-port NICs back-to-back with cross-over cables. I was running at about 28Mbps, full duplex. I then started up some more traffic, attempting to push it to about 35Mbps. Looks like I locked the NICs on both sides, such that nothing is being sent. The interesting thing is that the link lights blink on/off about every second, and there is a blip on the 'activity' LED. This coresponds with the log messages. Manually bouncing the links with ip set link ethX down, ip set link ethX up seems to 'fix' the problem. Here's a trace from the logs of one machine...the other looks very similar: Jun 13 20:17:11 card1 kernel: tulip.c:v0.91g-ppc 7/16/99 becker@cesdis.gsfc.nasa.gov Jun 13 20:17:11 card1 kernel: eth0: Digital DS21143 Tulip rev 65 at 0xc000, 00:C0:95:E2:4C:0C, IRQ 12. Jun 13 20:17:11 card1 kernel: eth0: EEPROM default media type Autosense. Jun 13 20:17:11 card1 kernel: eth0: Index #0 - Media 10baseT (#0) described by a 21142 Serial PHY (2) block. Jun 13 20:17:11 card1 kernel: eth0: Index #1 - Media 10baseT-FD (#4) described by a 21142 Serial PHY (2) block. Jun 13 20:17:11 card1 kernel: eth0: Index #2 - Media 100baseTx (#3) described by a 21143 SYM PHY (4) block. Jun 13 20:17:11 card1 kernel: eth0: Index #3 - Media 100baseTx-FD (#5) described by a 21143 SYM PHY (4) block. Jun 13 20:17:11 card1 kernel: eth1: Digital DS21143 Tulip rev 65 at 0xc400, 00:C0:95:E2:4C:0D, IRQ 10. Jun 13 20:17:11 card1 kernel: eth1: EEPROM default media type Autosense. Jun 13 20:17:11 card1 kernel: eth1: Index #0 - Media 10baseT (#0) described by a 21142 Serial PHY (2) block. Jun 13 20:17:11 card1 kernel: eth1: Index #1 - Media 10baseT-FD (#4) described by a 21142 Serial PHY (2) block. Jun 13 20:17:11 card1 kernel: eth1: Index #2 - Media 100baseTx (#3) described by a 21143 SYM PHY (4) block. Jun 13 20:17:11 card1 kernel: eth1: Index #3 - Media 100baseTx-FD (#5) described by a 21143 SYM PHY (4) block. Jun 13 20:17:11 card1 kernel: eth2: Digital DS21143 Tulip rev 65 at 0xc800, 00:C0:95:E2:4C:0E, IRQ 11. Jun 13 20:17:11 card1 kernel: eth2: EEPROM default media type Autosense. Jun 13 20:17:11 card1 kernel: eth2: Index #0 - Media 10baseT (#0) described by a 21142 Serial PHY (2) block. Jun 13 20:17:11 card1 kernel: eth2: Index #1 - Media 10baseT-FD (#4) described by a 21142 Serial PHY (2) block. Jun 13 20:17:11 card1 kernel: eth2: Index #2 - Media 100baseTx (#3) described by a 21143 SYM PHY (4) block. Jun 13 20:17:11 card1 kernel: eth2: Index #3 - Media 100baseTx-FD (#5) described by a 21143 SYM PHY (4) block. Jun 13 20:17:11 card1 kernel: eth3: Digital DS21143 Tulip rev 65 at 0xcc00, 00:C0:95:E2:4C:0F, IRQ 9. Jun 13 20:17:11 card1 kernel: eth3: EEPROM default media type Autosense. Jun 13 20:17:11 card1 kernel: eth3: Index #0 - Media 10baseT (#0) described by a 21142 Serial PHY (2) block. Jun 13 20:17:11 card1 kernel: eth3: Index #1 - Media 10baseT-FD (#4) described by a 21142 Serial PHY (2) block. Jun 13 20:17:11 card1 kernel: eth3: Index #2 - Media 100baseTx (#3) described by a 21143 SYM PHY (4) block. Jun 13 20:17:11 card1 kernel: eth3: Index #3 - Media 100baseTx-FD (#5) described by a 21143 SYM PHY (4) block. Jun 13 20:17:11 card1 kernel: rtl8139.c:v1.07 5/6/99 Donald Becker http://cesdis.gsfc.nasa.gov/linux/drivers/rtl8 139.html Jun 13 20:17:11 card1 kernel: eth4: RealTek RTL8139 Fast Ethernet at 0xd000, IRQ 11, 00:48:54:66:68:68. Jun 13 20:17:11 card1 kernel: Sending BOOTP requests.... OK ....... Jun 13 22:10:04 card1 kernel: eth1: transmit timed out, switching to 10baseT-FD media. Jun 13 22:10:10 card1 kernel: eth1: Tx hung, 1261852 vs. 1261837. Jun 13 22:10:10 card1 kernel: eth1: 21140 transmit timed out, status f0360000, SIA 000010c6 ffff0001 fffbffff 8ff5c000, resetting... Jun 13 22:10:10 card1 kernel: eth1: transmit timed out, switching to 100baseTx-FD media. Jun 13 22:16:00 card1 kernel: eth2: Tx hung, 1436469 vs. 1436454. Jun 13 22:16:00 card1 kernel: eth2: 21140 transmit timed out, status f0360000, SIA 000010c6 ffff0001 fffbffff 8ff5c000, resetting... Jun 13 22:16:00 card1 kernel: eth2: transmit timed out, switching to 10baseT-FD media. Jun 13 22:16:06 card1 kernel: eth2: Tx hung, 1436469 vs. 1436454. I'm going to go dig up that patch that someone said fixed this kind of thing, and others said should never be used.... Can anyone tell me why it locks up in the first place? I'm trying to make some testing equipment, and it won't really do to have it lock up!! Let me know if I can offer more information that might help someone fix this for good. --Ben -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Mon Jun 12 09:39:46 2000 Received: by oss.sgi.com id ; Mon, 12 Jun 2000 09:39:27 -0700 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:19211 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Mon, 12 Jun 2000 09:39:03 -0700 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.9.3) with ESMTP id JAA28295; Mon, 12 Jun 2000 09:14:25 -0700 Message-ID: <39450C61.C8445A35@candelatech.com> Date: Mon, 12 Jun 2000 09:14:25 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.14-5.0 i586) X-Accept-Language: en MIME-Version: 1.0 To: jamal CC: Gleb Natapov , Gleb Natapov , Mitchell Blank Jr , Andrey Savochkin , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, Jes Sorensen Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing jamal wrote: > > On Mon, 12 Jun 2000, Gleb Natapov wrote: > > > jamal wrote: > > > > > > On Sat, 10 Jun 2000, Gleb Natapov wrote: > > > > > > Gleb > No, I mean IPV6, IPX, DECnet, appletalk etc. > > > > > > This is a very strong statement that you make above. You have > > > to provide justification. The protocol type is still available on the > > > header. how is any of the above protocols broken? > > > > I _never_ said they are broken! What I've said is that you will need to > > add explicit support of VLANs in each one of these protocols (if I > > understand correct you suggestion). > > > > And this is what i was refering to as you saying "broken" ;-> > small miscomunication .. > If you can uniquely distinguish each protocol from looking at the header, > how can the implementation be 'broken'? Thats was my question to you. So now the plan is to basically break all protocol layering in the kernel and hack little pieces of VLAN support into every layer three protocol? To summarize, the current VLAN scheme requires these changes: 1. Rewrite the device lookup methods. 2. Add the soft VLAN device code (done, twice). 3. Small user-level config program (vconfig, for example). Done. NOTE: The above solution has not been, IMHO, shown to have any serious performance hits in the critical path. The Jamal/Jes solution requires: 1. Change IPv4 2. Change IPv6 3. Change ip and/or ifconfig 4. Change route 5. Hell, change every other user-level program that needs to use VLANs like they are devices. 6. Change every other layer 3 protocol that will be supported by VLAN. 7. Test a million different ways. 8. Fix DHCP and others that bind to specific devices, and may now want to bind to VLANs. 9. And for what gain? > > Answer me this question: are you going to change something in net/ipv4/ > > directory in your VLAN implementation? > > The main requirement seems to be to be able to 'route' -- thats what about > 90% of the arguements here seem to be saying. IP rules. Who cares about > IPX? Just encapsulate it (IPX) in the ether-packet and let it be > switched/routed somewhere. some policy maps it to some VLAN. Is this not > sufficient? It's the other 10% that always kills you. I don't even understand your argument for IPX. Isn't linux supposed to be able to route/switch IPX? You'd be taking that away and asking that it be done somewhere else? Oh, and when we get bridging working, will your implementation allow us to add VLANs to PPP (bridging ethernet) devices? That is something we may want out of whatever VLAN implementation survives... > if you are asking whether i'll put files in net/ipv4/ or ipv6; yes > indeed, i might but not necessarily so. I see little or zero changes to > the current common case of routing IPV4/6 packets. Why would i wanna touch > IPX code? Because you just brok it w/regard to VLANs, eh? > And this is based on the simple fact by trying to make all those tools a > swiss army knife is wrong. Trying to use them as _the_ excuse is wrong. > I dislike extending the kernel to 'help' a few things at the expense of > the common path. I would rather extend the user space tools or even > re-write them. I dont claim to be a purist, but certainly i think that a > VLAN should be abstracted on a device and is therefore not a device. So rather than stick a small shim (another device on a device) into the stack layers, you are going to stick pieces of it all over the place? How does that make architectural sense? What is wrong with generic code? Note that NO changes had to be made to the swiss-army knife programs, they just worked, magically! I don't think we have any additional costs to the common path, even w/out hashed lookups. Ben -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Mon Jun 12 09:51:17 2000 Received: by oss.sgi.com id ; Mon, 12 Jun 2000 09:51:06 -0700 Received: from [192.115.143.146] ([192.115.143.146]:17674 "EHLO www.penta-com.com") by oss.sgi.com with ESMTP id ; Mon, 12 Jun 2000 09:50:30 -0700 Received: by WWW with Internet Mail Service (5.5.2448.0) id ; Mon, 12 Jun 2000 18:00:33 +0200 Message-ID: From: jacob avraham To: kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com Subject: RE: 2.4 kernel networking and SMP Date: Mon, 12 Jun 2000 18:00:30 +0200 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2448.0) Content-Type: text/plain; charset="windows-1255" Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing > > > Can you tell me what level of parallelism do I get from the > networking > > layers (forwarding, filtering, NAT), > > Infinite one. 8) > > Seriously, it depends on network configuration. F.e. if > router has only > two interfaces, more than two cpus will do no useful work, > but simly will break network reordering packets, if you > did not bind irqs to selected cpus. > > Alexey > I'm looking at 4 CPUs and 24 i/f, so I'd like to bind 6 i/fs per CPU. In an SMP scenario, I can imagine that the net i/f ISRs are working in parallel on each CPU, where the end result is (de)queuing packets off the IP layer queue. What I'm not clear on is when the IP layer kicks in (I assume not in an ISR), is all the code that handles the packet processing (NAT, filters, forwording) running on on kernel thread on one CPU, or is this job parallelized as well? From owner-netdev@oss.sgi.com Mon Jun 12 10:07:47 2000 Received: by oss.sgi.com id ; Mon, 12 Jun 2000 10:07:17 -0700 Received: from saw.sw.com.sg ([203.120.9.98]:56454 "HELO saw.sw.com.sg") by oss.sgi.com with SMTP id ; Mon, 12 Jun 2000 10:06:53 -0700 Received: (qmail 617 invoked by uid 577); 12 Jun 2000 05:59:39 -0000 Message-ID: <20000612135938.A586@saw.sw.com.sg> Date: Mon, 12 Jun 2000 13:59:38 +0800 From: Andrey Savochkin To: Julian Anastasov Cc: Jerome Etienne , netdev@oss.sgi.com Subject: Re: IFA_F_NO_NDISC (for vrrp) References: <20000609161159.A19020@saw.sw.com.sg> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.93.2i In-Reply-To: ; from "Julian Anastasov" on Fri, Jun 09, 2000 at 03:03:44PM Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello Julian, On Fri, Jun 09, 2000 at 03:03:44PM +0300, Julian Anastasov wrote: > Yes, using fib_select_addr() is not a problem. Oops, > sorry, my question was for fib_local_source(), not for > fib_select_addr() but you understood me. We want to keep > using inet_addr_type() instead of fib_local_source() in > arp_solicit(). Only there. And it is not a problem the hosts > to be involved in same logical network to allow them to > define any number of additional logical networks as local > but not in the "local" table. I think, this can't be a > problem for any user which can add routes in other tables, > i.e. not only in the "local" table :) > > So, can we just in this case to use inet_addr_type? > > I now see that inet_addr_type() is changed too, to > use fib_lookup. I forgot about this. Yep, with this change > we can't do this (to limit the local addresses in > arp_solicit). I have to wait for the final version of your > patch :) Using inet_addr_type() in arp_solicit(), especially the current one looking into local table only, is wrong from conceptual point of view. It makes decisions that can't be explained in human terms, only as some manipulations in C language. It's better not to call any checking function that to call wrong one. I'll call fib_select_addr() unconditionally. Best regards Andrey V. Savochkin From owner-netdev@oss.sgi.com Mon Jun 12 10:24:18 2000 Received: by oss.sgi.com id ; Mon, 12 Jun 2000 11:22:01 -0700 Received: from aarnima1-pc1.tmt.tele.fi ([194.252.70.4]:65197 "EHLO aarnima1-pc1.tmt.tele.fi") by oss.sgi.com with ESMTP id ; Mon, 12 Jun 2000 10:30:49 -0700 Received: (mea@aarnima1-pc1.tmt.tele.fi) by mea.tmt.tele.fi id ; Mon, 12 Jun 2000 20:29:48 +0300 Date: Mon, 12 Jun 2000 20:29:48 +0300 From: Matti Aarnio To: Jes Sorensen Cc: netdev@oss.sgi.com Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? Message-ID: <20000612202948.V20265@mea.tmt.tele.fi> References: <39433342.4A9E80CC@nbase.co.il> <3943C086.930DF571@candelatech.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: 7bit In-Reply-To: ; from jes@linuxcare.com on Sun, Jun 11, 2000 at 06:07:09PM +0200 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing [Cc set truncated..] On Sun, Jun 11, 2000 at 06:07:09PM +0200, Jes Sorensen wrote: > >>>>> "Ben" == Ben Greear writes: > Ben> So just because you don't see a use for it means everyone else > Ben> should be denied the use of it??? Gleb's argument is valid > Ben> whether or not IPX exists, because other, so-far-unthought-of, > Ben> protocols may be created, and they would have the same problem > Ben> that IPX would now. > > Try to take a look at how IPX behaves on the wire before commenting - > the people who designed it need serious larting. IPX is/was protocol used in XNS networking, which had its origins in early Ethernets -- I mean BEFORE the Ethernet was standardized to be 10 Mbit/sec. As such it worked reasonably well with 2-10 stations in the network. Growing beyond that has proven to be non-trivial.. > Ben> TCP/IP is not the end of the line in network protocols! At the > Ben> very least IPv6 will be extremely important in the near future. > > Some people still believe that, I used to think that but I don't see > the push for it anymore. One could say same of IP(v4) multicast. Only now the userspace has gotten *some* support for it, and all manner of multimedia things are still burning network bandwidth for kwazillion point-to-point unicast copies of datastreams. Of course networks don't very widely support multicast, as nobody has called for it because there are no applications which would use it, and application writers don't use MC because networks don't support it ubiquitously... One of the major things which IPv6 has over IPv4 is its seriously expanded address space. However... Emergence of NAT/PAT for IPv4 has somewhat reduced the network address space explosion with the unpleasant feature of it not being completely transparent for all applications. New "Realm Specific IP" (RSIP) mechanism requires that hosts interact with external RSIP gateway (which NAT doesn't need), so in my thinking it is similar magnitude thing to IPv4 for the kernel stack. It isn't quite the same task for e.g. user- space, which gets most of RSIP handling "free" from kernel. Deployment requires a new kernel, of course. > Ben> Broad support for almost every protocol known is one of the very > Ben> best features of Linux. Doing anything to make that less true > Ben> would make Linux less useful to the rest of us. > > Broad support for as much as possible is good, but limiting support > for the mainstream in order to improve support for something broken is > wrong. Hmm.. I think I have already scrubbed the thread start so I am not sure what thing you are talking about. But on broad terms, in Linux having two competeting implementations which have performance differences affecting commonly used protocols, performance winner gets picked -- usually. > Jes /Matti Aarnio From owner-netdev@oss.sgi.com Mon Jun 12 10:24:20 2000 Received: by oss.sgi.com id ; Mon, 12 Jun 2000 11:22:00 -0700 Received: from minus.inr.ac.ru ([193.233.7.97]:14596 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Mon, 12 Jun 2000 10:13:34 -0700 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id TAA24257; Mon, 12 Jun 2000 19:27:31 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200006121527.TAA24257@ms2.inr.ac.ru> Subject: Re: 2.4 kernel networking and SMP To: jacoba@penta-com.com (jacob avraham) Date: Mon, 12 Jun 2000 19:27:31 +0400 (MSK DST) Cc: netdev@oss.sgi.com In-Reply-To: from "jacob avraham" at Jun 12, 0 06:00:30 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Length: 277 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello! > running on on kernel thread on one CPU, or is this job parallelized as well? All the work is made by that CPU, where packet first appeared i.e. where corresponding irq was delivered. Routing between different cpus is suppressed so strong as it is possible. Alexey From owner-netdev@oss.sgi.com Mon Jun 12 10:24:27 2000 Received: by oss.sgi.com id ; Mon, 12 Jun 2000 10:24:19 -0700 Received: from user148.s137.samsung.co.kr ([203.241.137.148]:49420 "EHLO swc.sec.samsung.co.kr") by oss.sgi.com with ESMTP id ; Mon, 12 Jun 2000 11:11:28 -0700 Received: by SWC with Internet Mail Service (5.0.1460.8) id ; Mon, 12 Jun 2000 09:21:00 +0900 Message-ID: <003928C86BDDD211BC8000A0C98A6129BBBEFB@SWC> From: =?euc-kr?B?udq/7LDm?= To: "'m.oliveira@cs.ucl.ac.uk'" , netdev@oss.sgi.com Subject: RE: Multicast routing Date: Mon, 12 Jun 2000 09:20:57 +0900 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.0.1460.8) Content-Type: text/plain Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing -----Original Message----- From: Manuel Oliveira [mailto:m.oliveira@cs.ucl.ac.uk] Sent: Sunday, June 11, 2000 4:47 AM To: netdev@oss.sgi.com Subject: Multicast routing Hi, My apologies for my ignorance, but what is the equivalent daemon in Linux to the mrouted of BSD? Thanks for any answers. Cheers Manuel From owner-netdev@oss.sgi.com Mon Jun 12 10:49:09 2000 Received: by oss.sgi.com id ; Mon, 12 Jun 2000 10:48:59 -0700 Received: from dhcp41.toaster.net ([199.108.84.41]:29447 "EHLO schmee.sfgoth.com") by oss.sgi.com with ESMTP id ; Mon, 12 Jun 2000 10:48:47 -0700 Received: (from mitch@localhost) by schmee.sfgoth.com (8.9.3/8.9.3) id LAA42407; Mon, 12 Jun 2000 11:46:10 -0700 (PDT) Date: Mon, 12 Jun 2000 11:46:10 -0700 From: Mitchell Blank Jr To: jamal Cc: Gleb Natapov , Gleb Natapov , Ben Greear , Andrey Savochkin , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, Jes Sorensen Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? Message-ID: <20000612114610.A40834@sfgoth.com> References: <3944DD68.3AB7DDEB@nbase.co.il> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0i In-Reply-To: ; from hadi@cyberus.ca on Mon, Jun 12, 2000 at 09:39:17AM -0400 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing jamal wrote: > IP rules. Who cares about > IPX? Just encapsulate it (IPX) in the ether-packet and let it be > switched/routed somewhere. some policy maps it to some VLAN. Is this not > sufficient? It's already been explained several times in this thread why this is not so - namely, that one of the major uses of vlans is to isolate large networks running broadcast-happy protocols in such a way that they can still talk to their servers directly. Suppose you had a big appletalk installation - you could split it in 10 pieces by VLAN and then have your linux+netatalk servers bring in all the VLANs on 802.1q. Only your IP-centric solution doesn't handle this very well. > We already have a very powerful modular routing architecture; use the > route flags and dst_in/out properly and you have yourself specific routing > code which does not touch the common path *at all* Do you have a reference for this code - I wonder if we could make ATM-CLIP work better. Still, I don't see it as relevant to the VLAN discussion. The biggest problem with your idea is that it changes the API that protocols use to pass packets out a VLAN opposed to a LAN (i.e. net_device), you're changing the abstration used for lots of configuring protocls (i.e. net_device), and you're changing the abstration presented to the user space tools (i.e. a named net_device) jamal - answer this one question: what plans do you have for PPPoE? After all, it's just another protocol for expressing VLANs on an ethernet. It also has the exact same (supposedly fatal) flow control issues that VLANs have. Therefore, I assume your arguments are the same against them being a net_device, right? This one's particularly hairy because you'll need to keep the ppp_generic layer the same (to interface with pppd and speak the various PPP protocols... this will become more even more important if we do some of the things proposed for 2.5 like ppp-layer-bridging) I'd like to see your proposal for handling this (and I mean a real proposal, not just "use netfilter") Technically, while you're fixing places that have broken flow control, you can note that the same is true of any net_device that internally uses dev_queue_xmit. Specifically, eql.c, shaper.c, bonding.c are all busted in the same way you describe. These actually could be done with netfilter, though, so I'm more interested in the PPPoE implementation. > I dislike extending the kernel to 'help' a few things at the expense of > the common path. I don't see where you can claim to hold the high ground on "common path". -Mitch From owner-netdev@oss.sgi.com Mon Jun 12 14:05:59 2000 Received: by oss.sgi.com id ; Mon, 12 Jun 2000 14:05:40 -0700 Received: from chaos.ao.net ([205.244.242.21]:4368 "EHLO chaos.ao.net") by oss.sgi.com with ESMTP id ; Mon, 12 Jun 2000 14:05:36 -0700 Received: from chaos.ao.net (harik@localhost [127.0.0.1]) by chaos.ao.net (8.9.3/8.9.3/Debian 8.9.3-21) with ESMTP id SAA12180; Mon, 12 Jun 2000 18:05:23 -0400 Message-Id: <200006122205.SAA12180@chaos.ao.net> To: "'m.oliveira@cs.ucl.ac.uk'" Cc: netdev@oss.sgi.com Subject: Re: Multicast routing Date: Mon, 12 Jun 2000 18:05:22 -0400 From: Dan Merillat Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Amazingly enough, the equivalent daemon is mrouted. What I'd like to know is where's there a pim-1 implementation, since the Real World deployments are pim-1, not 2. I've found archives of people who were working on pim1, even success reports, but no code. --Dan > -----Original Message----- > From: Manuel Oliveira [mailto:m.oliveira@cs.ucl.ac.uk] > Sent: Sunday, June 11, 2000 4:47 AM > To: netdev@oss.sgi.com > Subject: Multicast routing > > > Hi, > > My apologies for my ignorance, but what is the equivalent daemon in > Linux to the mrouted of BSD? > > Thanks for any answers. > > Cheers > > Manuel > From owner-netdev@oss.sgi.com Mon Jun 12 16:09:40 2000 Received: by oss.sgi.com id ; Mon, 12 Jun 2000 16:09:20 -0700 Received: from seattle.3com.com ([129.213.128.97]:6788 "EHLO seattle.3com.com") by oss.sgi.com with ESMTP id ; Mon, 12 Jun 2000 16:08:50 -0700 Received: from new-york.3com.com (new-york.3com.com [129.213.157.12]) by seattle.3com.com (8.8.8/8.8.8) with ESMTP id RAA12433; Mon, 12 Jun 2000 17:08:37 -0700 (PDT) From: Jim_March@3com.com Received: from hqoutbound.ops.3com.com (hqoutbound.OPS.3Com.COM [139.87.48.104]) by new-york.3com.com (8.8.8/8.8.8) with SMTP id RAA24170; Mon, 12 Jun 2000 17:08:26 -0700 (PDT) Received: by hqoutbound.ops.3com.com(Lotus SMTP MTA v4.6.7 (934.1 12-30-1999)) id 882568FD.0000A6C5 ; Mon, 12 Jun 2000 17:07:06 -0700 X-Lotus-FromDomain: 3COM To: Matti Aarnio cc: netdev@oss.sgi.com Message-ID: <882568FD.0000A525.00@hqoutbound.ops.3com.com> Date: Mon, 12 Jun 2000 17:08:30 -0700 Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? Mime-Version: 1.0 Content-type: text/plain; charset=us-ascii Content-Disposition: inline Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Matti Aarnio on 06/12/2000 10:29:48 AM Sent by: Matti Aarnio To: Jes Sorensen cc: netdev@oss.sgi.com (Jim March/HQ/3Com) Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? New "Realm Specific IP" (RSIP) mechanism requires that hosts interact with external RSIP gateway (which NAT doesn't need), so in my thinking it is similar magnitude thing to IPv4 for the kernel stack. It isn't quite the same task for e.g. user- space, which gets most of RSIP handling "free" from kernel. Deployment requires a new kernel, of course. RSIP is coming. I have been working on an RSIP Host solution for a while now, and should begin testing by next week. The solution, so far, is primarily in User Space, with a small Kernel Module acting as a conduit between four "shim-points" within the IP stack and the User Space daemon. More later when testing progresses... /Matti Aarnio From owner-netdev@oss.sgi.com Mon Jun 12 17:46:31 2000 Received: by oss.sgi.com id ; Mon, 12 Jun 2000 17:46:20 -0700 Received: from mail.cyberus.ca ([209.195.95.1]:60336 "EHLO cyberus.ca") by oss.sgi.com with ESMTP id ; Mon, 12 Jun 2000 17:45:53 -0700 Received: from shell.cyberus.ca (shell [209.195.95.7]) by cyberus.ca (8.9.3/8.9.3/Cyberus Online Inc.) with ESMTP id VAA28179; Mon, 12 Jun 2000 21:45:50 -0400 (EDT) Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.1b+Sun/8.9.3) with ESMTP id VAA03729; Mon, 12 Jun 2000 21:45:45 -0400 (EDT) Date: Mon, 12 Jun 2000 21:45:45 -0400 (EDT) From: jamal To: Ben Greear cc: Gleb Natapov , Gleb Natapov , Mitchell Blank Jr , Andrey Savochkin , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, Jes Sorensen Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? In-Reply-To: <39450C61.C8445A35@candelatech.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Is something wrong with netdev? Seems like it is offlined and polls every 24 hours or so. On Mon, 12 Jun 2000, Gleb Natapov wrote: > > By looking into the header you break the layering. Anyway what do you > expect to find there? IP, IPX or some protocol that I just wrote? How > are you going to support my protocol? Or you suppose that I should write > my own code in order to support my new protocol over VLAN ? ;-> i think you are totaly misunderstanding me. If the ethertype is IPX it goes to IPX. If the ethertype is your protocol it goes to your protocol. There is _no_ protocol violation. So how does this require me to change IPX? The only reason you need to extend (not change) IP is because 'routing' (which appears to be a major requirement) is a function of IP. The hooks are already there! > > Because I want to run IPX in separate VLAN! How will I be able to do > that with your implementation? > Yes, of course! > interface. I don't see what makes you think that what 'ifconfig -a' > shows are devices they are _interfaces_. Interface can be attached to > device (eth0), can be not (dummy0, lo), one interface can be attached to > many devices (bridging), many interfaces can be attached to single > device (tunneling, vlans). Ok, so do you use ifconfig to attach a vlan to a device? And why not? Sure go ahead and extend ifconfig so now it knows about attaching and deleting VLANs on a device. After all, it is another interface ;-> A VLAN is a way of logically separating traffic, across the same device (or equipment) Device being a physical entity; i would prefer to call a VLAN a circuit And sure i dont mind re-writting ifconfig so it knows about such things. But i dont know if it would still be called ifconfig. On Mon, 12 Jun 2000, Ben Greear wrote: > NOTE: The above solution has not been, IMHO, shown to have any > serious performance hits in the critical path. > > > > The Jamal/Jes solution requires: English (as is any natural language) could be very ambigious. I suppose Jes or I could have used mathematical equations to express ourselves. I really fail to see how you reached the conclusion above after all that head banging. > It's the other 10% that always kills you. I don't even understand your > argument for IPX. Isn't linux supposed to be able to route/switch IPX? > You'd be taking that away and asking that it be done somewhere else? > I am not ruling out IPX or Gleb's protocol. If IPX was so important why are we "routing", which is an IP term? I understand there are legacy protocols such as appletalk etc which have to be supported. In their native form they work just fine today over Linux. If you want to wrap them in a VLAN, how is my proposal stopping you? > Oh, and when we get bridging working, will your implementation allow us > to add VLANs to PPP (bridging ethernet) devices? That is something we > may want out of whatever VLAN implementation survives... Why dont you take up the challenge and write up an RFC? Its not about what "survives"; its about what a good solution is. "Survival" with a technical twist is a corporate political term. Normally, it is driven by how colorful your chartware is. We are above that. So go ahead and write something up so we dont resort to the "he said, she said". cheers, jamal From owner-netdev@oss.sgi.com Mon Jun 12 18:21:31 2000 Received: by oss.sgi.com id ; Mon, 12 Jun 2000 18:21:11 -0700 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:35601 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Mon, 12 Jun 2000 18:21:01 -0700 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.9.3) with ESMTP id TAA29808; Mon, 12 Jun 2000 19:56:48 -0700 Message-ID: <3945A2F0.A93D09BA@candelatech.com> Date: Mon, 12 Jun 2000 19:56:48 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.14-5.0 i586) X-Accept-Language: en MIME-Version: 1.0 To: jamal CC: Gleb Natapov , Gleb Natapov , Mitchell Blank Jr , Andrey Savochkin , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, Jes Sorensen Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing jamal wrote: > > interface. I don't see what makes you think that what 'ifconfig -a' > > shows are devices they are _interfaces_. Interface can be attached to > > device (eth0), can be not (dummy0, lo), one interface can be attached to > > many devices (bridging), many interfaces can be attached to single > > device (tunneling, vlans). > > Ok, so do you use ifconfig to attach a vlan to a device? And why not? > Sure go ahead and extend ifconfig so now it knows about attaching and > deleting VLANs on a device. After all, it is another interface ;-> ifconfig is a big hairy beast, so I wrote a separate program to create/destroy VLANs, and do specific VLAN things. However, once they are created, then I am able to use ifconfig to set their IP address, UP/DOWN, and everything else you might want to do to an ethernet 'device'. Both programs do what they do best, and no existing user level applications need to be patched. > A VLAN is a way of logically separating traffic, across the same device > (or equipment) > Device being a physical entity; i would prefer to call a VLAN a circuit > And sure i dont mind re-writting ifconfig so it knows about such things. > But i dont know if it would still be called ifconfig. So how about we call it a circuit, and then use a struct net_device to hold the circuit information? Since it walks, looks, and quacks like an ethernet device, we could just put it in the same kernel management structures that hold the other ethernet devices... :) > I am not ruling out IPX or Gleb's protocol. If IPX was so important > why are we "routing", which is an IP term? I understand there are legacy > protocols such as appletalk etc which have to be supported. > In their native form they work just fine today over Linux. > If you want to wrap them in a VLAN, how is my proposal stopping you? Most protocols that run over an ethernet interface, may want to run over a VLAN interface. So, if we can make VLAN look *just* like an ethernet device to them, the beauty is that we don't actually have to know anything about them, they just magically work. To me, that is much more appealing than tweaking every protocol to teach it how to bind to a VLAN. > > > Oh, and when we get bridging working, will your implementation allow us > > to add VLANs to PPP (bridging ethernet) devices? That is something we > > may want out of whatever VLAN implementation survives... > > Why dont you take up the challenge and write up an RFC? Its not about > what "survives"; its about what a good solution is. "Survival" with I figure someone already knows how to bridge stuff over a PPP link. Maybe RFC 1490 or something? It's been done for ages with FrameRelay. To bridge ethernet shouldn't be too hard, and if VLANs are net_devices, it shouldn't be any harder to bridge them too. As for the comments on survival. It's obvious that what is obvious and good to both of us is not the same thing. So I think survival is a valid term! -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Mon Jun 12 18:40:32 2000 Received: by oss.sgi.com id ; Mon, 12 Jun 2000 18:40:12 -0700 Received: from mail.cyberus.ca ([209.195.95.1]:44485 "EHLO cyberus.ca") by oss.sgi.com with ESMTP id ; Mon, 12 Jun 2000 18:39:50 -0700 Received: from shell.cyberus.ca (shell [209.195.95.7]) by cyberus.ca (8.9.3/8.9.3/Cyberus Online Inc.) with ESMTP id WAA18196; Mon, 12 Jun 2000 22:39:48 -0400 (EDT) Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.1b+Sun/8.9.3) with ESMTP id WAA03768; Mon, 12 Jun 2000 22:39:49 -0400 (EDT) Date: Mon, 12 Jun 2000 22:39:49 -0400 (EDT) From: jamal To: Mitchell Blank Jr cc: Gleb Natapov , Gleb Natapov , Ben Greear , Andrey Savochkin , rob@valinux.com, buytenh@gnu.org, netdev@oss.sgi.com, Jes Sorensen Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? In-Reply-To: <20000612114610.A40834@sfgoth.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Mon, 12 Jun 2000, Mitchell Blank Jr wrote: > It's already been explained several times in this thread why this is > not so - namely, that one of the major uses of vlans is to isolate > large networks running broadcast-happy protocols in such a way that > they can still talk to their servers directly. Suppose you had a > big appletalk installation - you could split it in 10 pieces by VLAN > and then have your linux+netatalk servers bring in all the VLANs > on 802.1q. > And i never 'broke' this requirement. > Only your IP-centric solution doesn't handle this very well. > But "IP centric" is the main requirement! again, i really fail to see how i break the traffic isolation requirement. > > We already have a very powerful modular routing architecture; use the > > route flags and dst_in/out properly and you have yourself specific routing > > code which does not touch the common path *at all* > > Do you have a reference for this code - I wonder if we could make ATM-CLIP > work better. > Look at James Leu's MPLS code for starters. > Still, I don't see it as relevant to the VLAN discussion. IP centrisim. Its all about IP, she said. > The biggest problem with your idea is that it changes the API that > protocols use to pass packets out a VLAN opposed to a LAN > (i.e. net_device), you're changing the abstration used for lots of > configuring protocls (i.e. net_device), > and you're changing the abstration presented to the user space tools > (i.e. a named net_device) > If it is a packet mungler it is a packet mungler, it is a packet mungler. Dont try to call it anything else, it wont change the facts. I seen VLAN as such. The major reason any of the arguements i have seen presented here end up using the user space tools is because they want to associate things with IP; they need to attach an IP address etc. So if that equates to something being a net_device then you are right. But you have argued against that citing ATM. As you mentioned in your earlier email, we might need to abstract things out better and tame net_device a little. we can probably resolve this then. I am just concerned about now. > jamal - answer this one question: what plans do you have for PPPoE? > After all, it's just another protocol for expressing VLANs on an > ethernet. It also has the exact same (supposedly fatal) flow > control issues that VLANs have. Therefore, I assume your arguments > are the same against them being a net_device, right? This one's > particularly hairy because you'll need to keep the ppp_generic > layer the same (to interface with pppd and speak the various PPP > protocols... this will become more even more important if we do > some of the things proposed for 2.5 like ppp-layer-bridging) > I'd like to see your proposal for handling this (and I mean a > real proposal, not just "use netfilter") > I thinking shunting (or bridging) between protocol blocks cant be solved by netfilter; but netfilter can probably be extended. With the curent pppox (Look at Michal's code) the main "device" problem you refer to is caused by pppd's lameness. You should easily be able to send several pppoe sockets/circuits across the same ppp "device"; just ioctl attach them. We can work towards a better solution. Michal and I already talked about this. He points to the BSD solution; i think we can do better. > Technically, while you're fixing places that have broken flow control, > you can note that the same is true of any net_device that internally > uses dev_queue_xmit. Specifically, eql.c, shaper.c, bonding.c are > all busted in the same way you describe. These actually could > be done with netfilter, though, so I'm more interested in the PPPoE > implementation. > Lets talk about it at some point. Although code speaks louder, some times it is safer to solve a problem like this by having some design. Its definetly a 2.5 thing. I think we need to have a "real" proposal as you put it. cheers, jamal From owner-netdev@oss.sgi.com Mon Jun 12 22:54:13 2000 Received: by oss.sgi.com id ; Mon, 12 Jun 2000 22:54:03 -0700 Received: from quechua.inka.de ([212.227.14.2]:24616 "EHLO mail.inka.de") by oss.sgi.com with ESMTP id ; Mon, 12 Jun 2000 22:53:38 -0700 Received: from dungeon.inka.de by mail.inka.de with uucp (rmailwrap 0.4) id 131kZr-0001Is-00; Tue, 13 Jun 2000 08:53:35 +0200 Received: by dungeon.inka.de (Postfix, from userid 1000) id 3366EB7841; Tue, 13 Jun 2000 08:53:30 +0200 (CEST) Date: Tue, 13 Jun 2000 08:53:30 +0200 From: Andreas Jellinghaus To: Dan Merillat Cc: "'m.oliveira@cs.ucl.ac.uk'" , netdev@oss.sgi.com Subject: Re: Multicast routing Message-ID: <20000613085329.B5865@dungeon.inka.de> References: <200006122205.SAA12180@chaos.ao.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii User-Agent: Mutt/1.0.1i In-Reply-To: <200006122205.SAA12180@chaos.ao.net>; from harik@chaos.ao.net on Mon, Jun 12, 2000 at 06:05:22PM -0400 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing try ftp.funet.fi /pub/mirrors/ftp.inr.ac.ru/ip-routing/pim ftp> ls 200 PORT command successful. 150 Opening ASCII mode data connection for . -r--r--r-- 1 mirror mirror 2707 May 30 1999 README lrwxrwxrwx 1 mirror mirror 24 Jul 28 1999 pimsm-v1-current.tar.gz -> pimsm-v1-ss990530.tar.gz -r--r--r-- 1 mirror mirror 68769 Mar 27 1999 pimsm-v1-ss990327.tar.gz -r--r--r-- 1 mirror mirror 68830 Mar 30 1999 pimsm-v1-ss990330.tar.gz -r--r--r-- 1 mirror mirror 68920 May 30 1999 pimsm-v1-ss990530.tar.gz andreas From owner-netdev@oss.sgi.com Tue Jun 13 00:22:03 2000 Received: by oss.sgi.com id ; Tue, 13 Jun 2000 00:21:54 -0700 Received: from lrcsun15.epfl.ch ([128.178.156.77]:30935 "EHLO lrcsun15.epfl.ch") by oss.sgi.com with ESMTP id ; Tue, 13 Jun 2000 00:21:34 -0700 Received: (from almesber@localhost) by lrcsun15.epfl.ch (8.8.X/EPFL-8.1a) id KAA09870; Tue, 13 Jun 2000 10:21:27 +0200 (MET DST) From: Werner Almesberger Message-Id: <200006130821.KAA09870@lrcsun15.epfl.ch> Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? To: greearb@candelatech.com (Ben Greear) Date: Tue, 13 Jun 2000 10:21:27 +0200 (MET DST) Cc: netdev@oss.sgi.com In-Reply-To: <3945A2F0.A93D09BA@candelatech.com> from "Ben Greear" at Jun 12, 2000 07:56:48 PM X-Mailer: ELM [version 2.5 PL2] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Ben Greear wrote: > So how about we call it a circuit, and then use a struct net_device to hold > the circuit information? Since it walks, looks, and quacks like an ethernet > device, we could just put it in the same kernel management structures that > hold the other ethernet devices... :) Hmm, I'm afraid it might turn out to be more like a neighbour than a net_device ... I think the underlying problem of this discussion is that the current net_device is heavily overloaded. Let's see, there are: - a set of local L3 addresses - a local L2 address (with VLAN, that would be multiple ?) - a set of potentially reachable L3 addresses - a set of actually reachable (connected) L3 addresses, along with the correspnding L2 addresses - a set of methods to translate L3 to L2, possibly manipulating the packet content (ARP) - a queuing discipline - maybe a physical device - an application-visible device (e.g. selected by address in bind(2)) - a few references from the routing table - initial information for more specific structures, e.g. the mtu for destinations - per-net_device state (e.g. flags) - statistics - typically one or more communication channels (e.g. Ethernet link or ATM VCs) - a name, used for management - an index, e.g. used for multicast socket options etc. No surprise it's nearly impossible to agree on a reasonable way to add some major new functionality :-) - Werner -- _________________________________________________________________________ / Werner Almesberger, ICA, EPFL, CH werner.almesberger@ica.epfl.ch / /_IN_N_032__Tel_+41_21_693_6621__Fax_+41_21_693_6610_____________________/ From owner-netdev@oss.sgi.com Tue Jun 13 00:54:24 2000 Received: by oss.sgi.com id ; Tue, 13 Jun 2000 00:54:04 -0700 Received: from robur.slu.se ([130.238.98.12]:26376 "EHLO robur.slu.se") by oss.sgi.com with ESMTP id ; Tue, 13 Jun 2000 00:53:47 -0700 Received: (from robert@localhost) by robur.slu.se (8.8.7/8.8.7) id KAA08441; Tue, 13 Jun 2000 10:53:14 +0200 From: Robert Olsson MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <14661.63096.939180.46830@robur.slu.se> Date: Tue, 13 Jun 2000 10:53:12 +0200 (CEST) To: Dan Merillat Cc: "'m.oliveira@cs.ucl.ac.uk'" , aj@dungeon.inka.de, netdev@oss.sgi.com Subject: Re: Multicast routing In-Reply-To: <200006122205.SAA12180@chaos.ao.net> References: <200006122205.SAA12180@chaos.ao.net> X-Mailer: VM 6.75 under Emacs 19.34.1 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello! If you are looking for sparse-mode you might also want to look at USC: For anonymous FTP: ftp://catarina.usc.edu/pub/pim/pimd/pimd-current.tar.gz For HTTP: http://netweb.usc.edu/pim/pimd/pimd-current.tar.gz Plus that I and colleague has written some code for IGMPV2/PIMSM-V2 within the Zebra framework but things like asserts and autorp are still left do. But multicast is still a toy. :-) --ro Andreas Jellinghaus writes: > try ftp.funet.fi > /pub/mirrors/ftp.inr.ac.ru/ip-routing/pim > ftp> ls > 200 PORT command successful. > 150 Opening ASCII mode data connection for . > -r--r--r-- 1 mirror mirror 2707 May 30 1999 README > lrwxrwxrwx 1 mirror mirror 24 Jul 28 1999 pimsm-v1-current.tar.gz -> pimsm-v1-ss990530.tar.gz > -r--r--r-- 1 mirror mirror 68769 Mar 27 1999 pimsm-v1-ss990327.tar.gz > -r--r--r-- 1 mirror mirror 68830 Mar 30 1999 pimsm-v1-ss990330.tar.gz > -r--r--r-- 1 mirror mirror 68920 May 30 1999 pimsm-v1-ss990530.tar.gz > > andreas From owner-netdev@oss.sgi.com Tue Jun 13 06:34:04 2000 Received: by oss.sgi.com id ; Tue, 13 Jun 2000 06:33:54 -0700 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:51722 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Tue, 13 Jun 2000 06:33:46 -0700 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.9.3) with ESMTP id IAA32548; Tue, 13 Jun 2000 08:10:57 -0700 Message-ID: <39464F01.2206643E@candelatech.com> Date: Tue, 13 Jun 2000 08:10:57 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.14-5.0 i586) X-Accept-Language: en MIME-Version: 1.0 To: Werner Almesberger CC: netdev@oss.sgi.com Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? References: <200006130821.KAA09870@lrcsun15.epfl.ch> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Werner Almesberger wrote: > > Ben Greear wrote: > > So how about we call it a circuit, and then use a struct net_device to hold > > the circuit information? Since it walks, looks, and quacks like an ethernet > > device, we could just put it in the same kernel management structures that > > hold the other ethernet devices... :) > > Hmm, I'm afraid it might turn out to be more like a neighbour than a > net_device ... > > I think the underlying problem of this discussion is that the current > net_device is heavily overloaded. Let's see, there are: > > - a set of local L3 addresses > - a local L2 address (with VLAN, that would be multiple ?) > - a set of potentially reachable L3 addresses > - a set of actually reachable (connected) L3 addresses, along with > the correspnding L2 addresses > - a set of methods to translate L3 to L2, possibly manipulating the > packet content (ARP) > - a queuing discipline > - maybe a physical device > - an application-visible device (e.g. selected by address in bind(2)) > - a few references from the routing table > - initial information for more specific structures, e.g. the mtu for > destinations > - per-net_device state (e.g. flags) > - statistics > - typically one or more communication channels (e.g. Ethernet link or > ATM VCs) > - a name, used for management > - an index, e.g. used for multicast socket options > etc. > > No surprise it's nearly impossible to agree on a reasonable way to add > some major new functionality :-) Yep, there is a comment in the code, probably from 1.x talking about how it should be cleaned up in the 'next' release...seems that comment is still valid :) It would be a perfect place for OO and inheritance, if we were into that, but since that isn't an option, seems to me the logical choice is to add a poiner to the end of the struct and put your specialized stuff there. (That's what I did for VLAN at least.) That has the unsavory result of adding cruft to the net_device struct, but with some clever use of DOCUMENTED flags in the net device, we could probably do run-time casting of a void* or two to hold information for various, mutually exclusive data structures. For example, a net_device should never need both VLAN info and ATM_PVC info, so one void*, with a flag to tell what type it is, should be sufficient. Basically, just hand-coded inheritance... Ben -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Tue Jun 13 06:46:24 2000 Received: by oss.sgi.com id ; Tue, 13 Jun 2000 06:46:14 -0700 Received: from lrcsun15.epfl.ch ([128.178.156.77]:44269 "EHLO lrcsun15.epfl.ch") by oss.sgi.com with ESMTP id ; Tue, 13 Jun 2000 06:45:55 -0700 Received: (from almesber@localhost) by lrcsun15.epfl.ch (8.8.X/EPFL-8.1a) id QAA18667; Tue, 13 Jun 2000 16:45:52 +0200 (MET DST) From: Werner Almesberger Message-Id: <200006131445.QAA18667@lrcsun15.epfl.ch> Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? To: greearb@candelatech.com (Ben Greear) Date: Tue, 13 Jun 2000 16:45:52 +0200 (MET DST) Cc: netdev@oss.sgi.com In-Reply-To: <39464F01.2206643E@candelatech.com> from "Ben Greear" at Jun 13, 2000 08:10:57 AM X-Mailer: ELM [version 2.5 PL2] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Ben Greear wrote: > Yep, there is a comment in the code, probably from 1.x talking about how it > should be cleaned up in the 'next' release...seems that comment is still > valid :) Oh, a _lot_ got cleaned up in 2.3 ... ;-) > It would be a perfect place for OO and inheritance, The "all in one" type of object you're describing would be rather ugly, IMHO. Better to remove those elements that don't have a 1:1 relationship and put them elsewhere. E.g. neighbours are already elsewhere. Local addresses (L3 and maybe also L2) should probably also go elsewhere. Then qdiscs (think multilink PPP, ATM, or FR SVCs), etc. Probably the best approach to clean up the design would be to split the net_device structure into as many substructures as possible, create new name spaces, where applicable, and then to combine those things that always or usually occur together. - Werner -- _________________________________________________________________________ / Werner Almesberger, ICA, EPFL, CH werner.almesberger@ica.epfl.ch / /_IN_N_032__Tel_+41_21_693_6621__Fax_+41_21_693_6610_____________________/ From owner-netdev@oss.sgi.com Tue Jun 13 13:59:42 2000 Received: by oss.sgi.com id ; Tue, 13 Jun 2000 13:59:22 -0700 Received: from lrcsun15.epfl.ch ([128.178.156.77]:20371 "EHLO lrcsun15.epfl.ch") by oss.sgi.com with ESMTP id ; Tue, 13 Jun 2000 13:58:53 -0700 Received: (from almesber@localhost) by lrcsun15.epfl.ch (8.8.X/EPFL-8.1a) id WAA27846 for netdev@oss.sgi.com; Tue, 13 Jun 2000 22:58:56 +0200 (MET DST) From: Werner Almesberger Message-Id: <200006132058.WAA27846@lrcsun15.epfl.ch> Subject: netfilter NAT vs. pump To: netdev@oss.sgi.com Date: Tue, 13 Jun 2000 22:58:55 +0200 (MET DST) X-Mailer: ELM [version 2.5 PL2] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing I've come across a rather nasty problem: if I configure my 2.4.0test1* kernel without "Full NAT", DHCP works as expected. If I add NAT support, DHCP fails miserably. Here are the gory details: Setup: - I have a box running 2.4.0test1-ac17 (also happens with older kernels, at least including test1-ac8) with RedHat 6.2 - this machine has two Ethernet interfaces, eth0 and eth1. - eth0 points to an internal network and has static IP address 10.0.0.1 - eth1 points to my cable modem and gets its IP address via DHCP - I have netfilter's "Full NAT" (CONFIG_IP_NF_NAT) enabled The problem: - pump (RedHat's DHCP client) fails to configure eth1 - I see DHCP requests on eth0, i.e. on the wrong interface I've chased this through the kernel, and the reasons for this are as follows: - whenever NAT is enabled, even if it's doing absolutely nothing, it sets NFC_ALTERED in skb->nfcache (net/ipv4/netfilter/ip_nat_standalone.c:ip_nat_fn) - this eventually leads to the packet being re-routed (linux/net/ipv4/ip_output.c:route_me_harder) - BEFORE the re-routing, the interface is eth1, as expected, destination IP is 255.255.255.255 (okay), and source IP is 10.0.0.1 (not nice, but probably doesn't really matter to DHCP) - AFTER re-routing, the interface becomes eth0, because the source IP was used to determine the interface. Very bad. - pump doesn't explicitly set the interface (it binds to { 0.0.0.0,68}), but eth1 gets picked for some lucky reason anyway - in the call to ip_route_output in net/ipv4/udp.c:udp_sendmsg, the source is 0.0.0.0, but afterwards, the 10.0.0.1 from rt->rt_src becomes the new source Fixing this is a little tricky, because it seems that several parties are at fault: 1) I'm not sure why ip_route_output returns such a strange interface/address combination (didn't look up the details since it happens to err on the convenient side) 2) as shown, route_me_harder doesn't necessarily perform the same operation as the original routing. This is a problem, because the system behaves differently with and without NAT support, even if nothing else changes. I see three possible approaches to correct this: - make it happen less often by setting NFC_ALTERED only when something has changed (probably a good idea in any case) - copy all the route selection subtleties (including multicast, etc.) into route_me_harder - declare any application that fails because of this as broken ;-) (probably a reasonable approach, although it breaks backward- compatibility) 3) pump really ought to be a little more explicit about that interface (BTW, dhcpcd apparently has the same problem, but I didn't examine that one any further) I've attached a patch to pump that seems to avoid the problem. At least it has obtained my IP address and kept it for the last half hour. Does anybody know who's taking care of pump ? I've only found the SRPM, and it is remarkably devoid of any hint to its origin :-( - Werner ---------------------------------- cut here ----------------------------------- --- pump-0.7.8-orig/dhcp.c Tue Feb 15 21:59:11 2000 +++ pump-0.7.8/dhcp.c Tue Jun 13 22:16:03 2000 @@ -940,6 +940,13 @@ return s; } + +static int bind_to_device(int s,const char *itf) +{ + return setsockopt(s,SOL_SOCKET,SO_BINDTODEVICE,itf,strlen(itf)); +} + + int pumpDhcpRelease(struct pumpNetIntf * intf) { struct bootpRequest breq, bresp; unsigned char messageType; @@ -957,6 +964,8 @@ if ((s = createSocket()) < 0) return 1; + if (bind_to_device(s,intf->device) < 0) return 1; + if ((chptr = prepareRequest(&breq, s, intf->device, pumpUptime()))) { close(s); while (1) { @@ -1020,6 +1029,8 @@ s = createSocket(); + if (bind_to_device(s,intf->device) < 0) return 1; + if ((chptr = prepareRequest(&breq, s, intf->device, pumpUptime()))) { close(s); while (1); /* problem */ @@ -1147,6 +1158,12 @@ pumpDisableInterface(intf->device); close(s); return perrorstr("bind"); + } + + if (bind_to_device(s,intf->device) < 0) { + pumpDisableInterface(intf->device); + close(s); + return perrorstr("SO_BINDTODEVICE"); } serverAddr.sin_family = AF_INET; -- _________________________________________________________________________ / Werner Almesberger, ICA, EPFL, CH werner.almesberger@ica.epfl.ch / /_IN_N_032__Tel_+41_21_693_6621__Fax_+41_21_693_6610_____________________/ From owner-netdev@oss.sgi.com Tue Jun 13 14:34:32 2000 Received: by oss.sgi.com id ; Tue, 13 Jun 2000 14:34:13 -0700 Received: from sirppi.helsinki.fi ([128.214.205.27]:35346 "EHLO sirppi.helsinki.fi") by oss.sgi.com with ESMTP id ; Tue, 13 Jun 2000 14:33:57 -0700 Received: from localhost (amlaukka@localhost) by sirppi.helsinki.fi (8.10.1/8.10.1) with ESMTP id e5DLXql15096; Wed, 14 Jun 2000 00:33:52 +0300 (EET DST) X-Authentication-Warning: sirppi.helsinki.fi: amlaukka owned process doing -bs Date: Wed, 14 Jun 2000 00:33:50 +0300 (EET DST) From: Aki M Laukkanen To: Werner Almesberger cc: netdev@oss.sgi.com Subject: Re: netfilter NAT vs. pump In-Reply-To: <200006132058.WAA27846@lrcsun15.epfl.ch> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Tue, 13 Jun 2000, Werner Almesberger wrote: > - pump doesn't explicitly set the interface (it binds to { 0.0.0.0,68}), > but eth1 gets picked for some lucky reason anyway ack. I have exactly the same problem. This far I've worked around it by bringing eth1 always up first. > 1) I'm not sure why ip_route_output returns such a strange > interface/address combination (didn't look up the details since it > happens to err on the convenient side) Didn't check either but think it's a list where you simply take the tail (or head). > - make it happen less often by setting NFC_ALTERED only when > something has changed (probably a good idea in any case) There seems to be a FIXME waiting for removal in ip_nat_fn(). Depends on a couple of things I didn't check but testing for info->num_manips != 0 might work. Will test tomorrow. > Does anybody know who's taking care of pump ? I've only found the SRPM, > and it is remarkably devoid of any hint to its origin :-( How about man pump? :) It's developed and maintained at RedHat internally. -- D. From owner-netdev@oss.sgi.com Tue Jun 13 20:30:15 2000 Received: by oss.sgi.com id ; Tue, 13 Jun 2000 20:30:06 -0700 Received: from wirespeed.solidum.com ([216.13.130.242]:61639 "EHLO solidum.com") by oss.sgi.com with ESMTP id ; Tue, 13 Jun 2000 20:29:47 -0700 Received: from phobos.solidum.com (mcr@phobos.solidum.com [192.168.1.13]) by solidum.com (8.8.7/8.8.7) with ESMTP id XAA26598 for ; Tue, 13 Jun 2000 23:29:38 -0400 Message-Id: <200006140329.XAA26598@solidum.com> To: netdev@oss.sgi.com Subject: Re: netfilter NAT vs. pump In-Reply-To: Your message of "Tue, 13 Jun 2000 22:58:55 +0200." <200006132058.WAA27846@lrcsun15.epfl.ch> Mime-Version: 1.0 (generated by tm-edit 7.108) Content-Type: text/plain; charset=US-ASCII Date: Tue, 13 Jun 2000 23:29:37 -0400 From: Michael Richardson Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing >>>>> "Werner" == Werner Almesberger writes: Werner> re-routing, the interface is eth1, as expected, destination IP is Werner> 255.255.255.255 (okay), and source IP is 10.0.0.1 (not nice, but Werner> probably doesn't really matter to DHCP) - AFTER re-routing, the Yes, it does. When doing a request, you must use address 0.0.0.0 according to the spec. Werner> interface becomes eth0, because the source IP was used to Werner> determine the interface. Very bad. - pump doesn't explicitly set Werner> the interface (it binds to { 0.0.0.0,68}), but eth1 gets picked So, we found the culprit. Why not use ISC dhclient? Frankly, I can't see any reason why anyone would want anything else... Your other comments about problems are beyond my knowledge to comment on. :!mcr!: | Solidum Systems Corporation, http://www.solidum.com Michael Richardson |For a better connected world,where data flows faster Personal: http://www.sandelman.ottawa.on.ca/People/Michael_Richardson/Bio.html mailto:mcr@sandelman.ottawa.on.ca mailto:mcr@solidum.com From owner-netdev@oss.sgi.com Wed Jun 14 00:58:47 2000 Received: by oss.sgi.com id ; Wed, 14 Jun 2000 00:58:37 -0700 Received: from lrcsun15.epfl.ch ([128.178.156.77]:4799 "EHLO lrcsun15.epfl.ch") by oss.sgi.com with ESMTP id ; Wed, 14 Jun 2000 00:58:09 -0700 Received: (from almesber@localhost) by lrcsun15.epfl.ch (8.8.X/EPFL-8.1a) id JAA12654; Wed, 14 Jun 2000 09:58:00 +0200 (MET DST) From: Werner Almesberger Message-Id: <200006140758.JAA12654@lrcsun15.epfl.ch> Subject: Re: netfilter NAT vs. pump To: mcr@solidum.com (Michael Richardson) Date: Wed, 14 Jun 2000 09:58:00 +0200 (MET DST) Cc: netdev@oss.sgi.com In-Reply-To: <200006140329.XAA26598@solidum.com> from "Michael Richardson" at Jun 13, 2000 11:29:37 PM X-Mailer: ELM [version 2.5 PL2] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Michael Richardson wrote: > Yes, it does. When doing a request, you must use address 0.0.0.0 according > to the spec. Okay, then it's probably impossible to get correct behaviour with the current design of pump if there's already another configured interface. > So, we found the culprit. Why not use ISC dhclient? Frankly, I can't > see any reason why anyone would want anything else... I've just tested the Hariguchi/Viznyuk dhcpcd a bit more systematically, and contrary to what I thought first, it's working correctly (it uses PF_PACKET sockets). Apologies to the authors. However, there's still some problem if I try to use the -s option. BTW, my patch also has the problem that it makes pump fail with kernels that don't have support for SO_BINDTODEVICE enabled. While one could just ignore the return code, this is ugly. Also, it may be possible to have a kernel with the NAT problem but no SO_BINDTODEVICE, so there's again no way to get it to work. So I guess I agree - time to dump pump. - Werner -- _________________________________________________________________________ / Werner Almesberger, ICA, EPFL, CH werner.almesberger@ica.epfl.ch / /_IN_N_032__Tel_+41_21_693_6621__Fax_+41_21_693_6610_____________________/ From owner-netdev@oss.sgi.com Wed Jun 14 08:53:38 2000 Received: by oss.sgi.com id ; Wed, 14 Jun 2000 08:53:28 -0700 Received: from minus.inr.ac.ru ([193.233.7.97]:28423 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Wed, 14 Jun 2000 08:53:10 -0700 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id TAA20025; Wed, 14 Jun 2000 19:51:55 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200006141551.TAA20025@ms2.inr.ac.ru> Subject: Re: netfilter NAT vs. pump To: almesber@lrc.epfl.CH (Werner Almesberger) Date: Wed, 14 Jun 2000 19:51:55 +0400 (MSK DST) Cc: netdev@oss.sgi.com In-Reply-To: <200006132058.WAA27846@lrcsun15.epfl.ch> from "Werner Almesberger" at Jun 14, 0 01:13:34 am X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Length: 746 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello! > - BEFORE the re-routing, the interface is eth1, as expected, destination > IP is 255.255.255.255 (okay), and source IP is 10.0.0.1 (not nice, but > probably doesn't really matter to DHCP) Of course, packets with zero address cannot be sent out. If udp source is zero, it means only that kernel should select some valid address itself. > - pump doesn't explicitly set the interface (it binds to { 0.0.0.0,68}), > but eth1 gets picked for some lucky reason anyway Apparently, the interface is selected to eth1 because of another bug. I think, pump crapped routing tables as well. It is rather "unlucky reason". 8) Probably, this client looks like working on single interface machine but only for (un)lucky reason. 8) Alexey From owner-netdev@oss.sgi.com Wed Jun 14 08:55:58 2000 Received: by oss.sgi.com id ; Wed, 14 Jun 2000 08:55:48 -0700 Received: from wirespeed.solidum.com ([216.13.130.242]:37603 "EHLO solidum.com") by oss.sgi.com with ESMTP id ; Wed, 14 Jun 2000 08:55:36 -0700 Received: from phobos.solidum.com (mcr@phobos.solidum.com [192.168.1.13]) by solidum.com (8.8.7/8.8.7) with ESMTP id LAA00172 for ; Wed, 14 Jun 2000 11:55:34 -0400 Message-Id: <200006141555.LAA00172@solidum.com> To: netdev@oss.sgi.com Subject: Re: netfilter NAT vs. pump In-Reply-To: Your message of "Wed, 14 Jun 2000 09:58:00 +0200." <200006140758.JAA12654@lrcsun15.epfl.ch> Mime-Version: 1.0 (generated by tm-edit 7.108) Content-Type: text/plain; charset=US-ASCII Date: Wed, 14 Jun 2000 11:55:34 -0400 From: Michael Richardson Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing >>>>> "Werner" == Werner Almesberger writes: Werner> Michael Richardson wrote: >> Yes, it does. When doing a request, you must use address 0.0.0.0 >> according to the spec. Werner> Okay, then it's probably impossible to get correct behaviour with Werner> the current design of pump if there's already another configured Werner> interface. There is a reason that ISC dhclient uses raw sockets to send the requests, and BPF to get the replies :-) Werner> So I guess I agree - time to dump pump. Pump dump lump! :!mcr!: | Solidum Systems Corporation, http://www.solidum.com Michael Richardson |For a better connected world,where data flows faster Personal: http://www.sandelman.ottawa.on.ca/People/Michael_Richardson/Bio.html mailto:mcr@sandelman.ottawa.on.ca mailto:mcr@solidum.com From owner-netdev@oss.sgi.com Wed Jun 14 16:53:09 2000 Received: by oss.sgi.com id ; Wed, 14 Jun 2000 16:52:50 -0700 Received: from ertpg14e1.nortelnetworks.com ([47.234.0.35]:48366 "EHLO ertpg14e1.nortelnetworks.com") by oss.sgi.com with ESMTP id ; Wed, 14 Jun 2000 16:52:29 -0700 Received: from zsngd101.asiapac.nortel.com (actually znsgd101) by ertpg14e1.nortelnetworks.com; Wed, 14 Jun 2000 19:45:17 -0400 Received: from zctwb003.asiapac.nortel.com ([47.152.32.111]) by zsngd101.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id MS0LA7GT; Thu, 15 Jun 2000 07:45:12 +0800 Received: from pwold011.asiapac.nortel.com ([47.181.193.45]) by zctwb003.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id MNGMW8BJ; Thu, 15 Jun 2000 09:45:11 +1000 Received: from uow.edu.au (IDENT:akpm@localhost [127.0.0.1]) by pwold011.asiapac.nortel.com (8.9.3/8.9.3) with ESMTP id JAA17946; Thu, 15 Jun 2000 09:45:02 +1000 Message-ID: <394818FD.70258D1B@uow.edu.au> Date: Wed, 14 Jun 2000 23:45:01 +0000 X-Sybari-Space: 00000000 00000000 00000000 From: Andrew Morton X-Mailer: Mozilla 4.61 [en] (X11; I; Linux 2.4.0-test1-ac10 i686) X-Accept-Language: en MIME-Version: 1.0 To: Werner Almesberger CC: netdev@oss.sgi.com, Harm Verhagen Subject: Re: netfilter NAT vs. pump References: <200006132058.WAA27846@lrcsun15.epfl.ch> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Werner Almesberger wrote: > > I've come across a rather nasty problem: if I configure my 2.4.0test1* > kernel without "Full NAT", DHCP works as expected. If I add NAT support, > DHCP fails miserably. Here are the gory details: > Werner, Harm Verhagen is running dhcpd and pump on a single gateway machine (cable modem, I think). Since upgrading from 2.2 he has been unable to obtain a DHCP lease from the modem (eth0) when the internal interface (eth1) is up. I forwarded your patch along and it fixed the problem. He's using pump-0.7.9-1. Harm can provide more details if required. From owner-netdev@oss.sgi.com Wed Jun 14 18:42:31 2000 Received: by oss.sgi.com id ; Wed, 14 Jun 2000 18:42:11 -0700 Received: from smtprch2.nortelnetworks.com ([192.135.215.15]:15090 "EHLO smtprch2.nortel.com") by oss.sgi.com with ESMTP id ; Wed, 14 Jun 2000 18:41:47 -0700 Received: from zrchb213.us.nortel.com (actually zrchb213) by smtprch2.nortel.com; Wed, 14 Jun 2000 20:38:16 -0500 Received: from zctwb003.asiapac.nortel.com ([47.152.32.111]) by zrchb213.us.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id MPD1W92W; Wed, 14 Jun 2000 20:41:18 -0500 Received: from pwold011.asiapac.nortel.com ([47.181.193.45]) by zctwb003.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id MNGMW827; Thu, 15 Jun 2000 11:41:16 +1000 Received: from uow.edu.au (IDENT:akpm@localhost [127.0.0.1]) by pwold011.asiapac.nortel.com (8.9.3/8.9.3) with ESMTP id LAA18819; Thu, 15 Jun 2000 11:40:54 +1000 Message-ID: <39483426.B07E4BA@uow.edu.au> Date: Thu, 15 Jun 2000 01:40:54 +0000 X-Sybari-Space: 00000000 00000000 00000000 From: Andrew Morton Reply-To: netdev X-Mailer: Mozilla 4.61 [en] (X11; I; Linux 2.4.0-test1-ac10 i686) X-Accept-Language: en MIME-Version: 1.0 To: Bogdan Costescu CC: vortex , eepro100 , tulip , netdev Subject: Re: [eepro100] Re: True on TRANSMIT ERROR TIMEOUT References: <39461BB3.299DD807@uow.edu.au> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Orig: Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing [ I really wish I hadn't done that reply-to-all on Tuesday. Can we please henceforth keep this on netdev? ] Bogdan Costescu wrote: > > On Tue, 13 Jun 2000, Andrew Morton wrote: > > > And there are many, many times when a wedged driver can be resurrected > > by a down/up or a rmmod/insmod. This means that the driver _could_ have > > automtically recovered in tx_timeout, but it simply did not do so. > > I beg to disagree. The above mentioned operations are doing much more than > handling TX timeouts: register/unregister IRQ, get/release memory, set up > (from scratch) the Tx and Rx buffers, media selection... The tx_timeout > routine should only recover from a Tx timeout! What you propose is > something like calling xxx_open() from tx_timeout... if I understand it > right. Well, I didn't say "let's put in lots of bugs" :) My point is very simple: - Drivers and/or NICs are hanging - The hangs are fixed by down+up or rmmod+insmod Hence, the hangs _could_ be unhung by appropriate action in the tx timeout! This would be a great step forward. A sub-second hiccup and a few dropped packets versus a complete system outage. It's not pretty, it shouldn't be necessary, but geeze, it's better than what we have now. Without access to the ASIC designers and everything else, we may never resolve these problems. And can we be sure that the closed-source vendor-developed drivers don't _currently_ reset the crap out of the NIC when it goes unresponsive? > > Can anyone suggest a reason why we _shouldn't_ simply reset the NIC to > > the utmost possible extent in tx_timeout? Restart media selection, > > reinitialise ring buffers, etc, etc? > > Simply because you need to know the exact state of the card in order to > save the relevant parts of it, reset and then load the card with > the previous values (of course, you don't need to re-load the parts > which created the problem). mm. This is a matter of keeping appropriate state in the dev private space. True, the effects of 'mii-diag -F -A' will be lost, but they'll also be lost across an up+down or a reboot, no? > Think of the 3c59x driver: you might want to > save the Tx/Rx threshold related values, poll interval values and so on. > This is only efficient if you keep most of them to the default (as > power-on) values. > For media selection, xxx_timer should take care of this, there should be > no need for tx_timeout to handle media changes. When the transmission is > stopped because of a media change, the xxx_timer routine should take care > of the media state, then tx_timeout routine should reset the transmitter > and everything should be working again. Also adding media related logic to > tx_timeout raises the problem of protecting the access to the media > related registers WRT xxx_timer... These failures appear to correlate with high traffic which, I suggest, points the finger at driver/hardware bugs rather than media selection probs. > I don't think that flushing the queue (as somebody else from this thread > suggested) is a good ideea as you loose a lot of packets (usually 16). A typical tx_timeout routine will at present cause up to 16 packets to be retransmitted (duplicated). Packet loss is not a problem (particularly when compared with loss of an interface). > And > usually the buffers themselves have no relation with the > transmission/reception logic - you might need to restart the transmitter > and/or the DMA engine and maybe write again the head buffer to it - > that's all. > > Now, coming back to the initial message about "correlated" reports of Tx > timeouts using some boards/drivers, I don't think this is true. Most of > the problems are in fact related to media selection (when autonegotiation > fails and the boards cannot transmit properly) and real timeouts (e.g. > caused by collisions) that are board specific. In this case, there is no > general rule that has to be applied - the hardware has to be driven "The > Right Way" (which might not be even documented). Perhaps an appropriate detection algorithm is to do a radical reset & reinitialisation if two tx timeout occur with no intervening tx interrupts. Have you played with the tx timeout interval? Even with 14 packet tx interrupt mitigation, it's very, very hard to force a tx timeout on a 10baseT LAN with the timeout set to just 20 milliseconds. We're using 400! I dunno. I just get the impression from the mailing lists that the network drivers are a weak spot for Linux, and we're letting the whole game down. Help me out here... Let me quote "Brian" from last week: ``The web servers (~10Mb/s) last about 3 days on . With all the Linux zealots talking about taking on Sun and the "enterprise," I'd drop dead laughing if they weren't my servers.'' Ouch. From owner-netdev@oss.sgi.com Wed Jun 14 18:47:21 2000 Received: by oss.sgi.com id ; Wed, 14 Jun 2000 18:47:11 -0700 Received: from emerald.alltel.com ([198.133.100.6]:21261 "EHLO emerald.alltel.com") by oss.sgi.com with ESMTP id ; Wed, 14 Jun 2000 18:46:54 -0700 Received: from sbridge.lit.alltel.com ([10.33.2.48]) by emerald.alltel.com with MailMarshal (3,1,1,0) id ; Wed, 14 Jun 2000 20:46:46 -500 Date: Wed, 14 Jun 2000 20:46 -0500 From: "POSTMASTER" To: "netdev@oss.sgi.com" Subject: Unsent Message Returned to Sender Message-ID: <20000614204645971-31fa7539@alltel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Notice to Sender ================ This message was received by this installation but could not be delivered to its intended cc:Mail recipient(s). Original subject: Re: [eepro100] Re: True on TRANSMIT ERROR TIMEOUT Intended recipient(s) who DID NOT receive this message: donavan.nelson..@alltel.com The following cc:Mail error(s) were recorded: *** Message recipient is unknown *** -------- Original Message Text -------- [ I really wish I hadn't done that reply-to-all on Tuesday. Can we please henceforth keep this on netdev? ] Bogdan Costescu wrote: > > On Tue, 13 Jun 2000, Andrew Morton wrote: > > > And there are many, many times when a wedged driver can be resurrected > > by a down/up or a rmmod/insmod. This means that the driver _could_ have > > automtically recovered in tx_timeout, but it simply did not do so. > > I beg to disagree. The above mentioned operations are doing much more than > handling TX timeouts: register/unregister IRQ, get/release memory, set up > (from scratch) the Tx and Rx buffers, media selection... The tx_timeout > routine should only recover from a Tx timeout! What you propose is > something like calling xxx_open() from tx_timeout... if I understand it > right. Well, I didn't say "let's put in lots of bugs" :) My point is very simple: - Drivers and/or NICs are hanging - The hangs are fixed by down+up or rmmod+insmod Hence, the hangs _could_ be unhung by appropriate action in the tx timeout! This would be a great step forward. A sub-second hiccup and a few dropped packets versus a complete system outage. It's not pretty, it shouldn't be necessary, but geeze, it's better than what we have now. Without access to the ASIC designers and everything else, we may never resolve these problems. And can we be sure that the closed-source vendor-developed drivers don't _currently_ reset the crap out of the NIC when it goes unresponsive? > > Can anyone suggest a reason why we _shouldn't_ simply reset the NIC to > > the utmost possible extent in tx_timeout? Restart media selection, > > reinitialise ring buffers, etc, etc? > > Simply because you need to know the exact state of the card in order to > save the relevant parts of it, reset and then load the card with > the previous values (of course, you don't need to re-load the parts > which created the problem). mm. This is a matter of keeping appropriate state in the dev private space. True, the effects of 'mii-diag -F -A' will be lost, but they'll also be lost across an up+down or a reboot, no? > Think of the 3c59x driver: you might want to > save the Tx/Rx threshold related values, poll interval values and so on. > This is only efficient if you keep most of them to the default (as > power-on) values. > For media selection, xxx_timer should take care of this, there should be > no need for tx_timeout to handle media changes. When the transmission is > stopped because of a media change, the xxx_timer routine should take care > of the media state, then tx_timeout routine should reset the transmitter > and everything should be working again. Also adding media related logic to > tx_timeout raises the problem of protecting the access to the media > related registers WRT xxx_timer... These failures appear to correlate with high traffic which, I suggest, points the finger at driver/hardware bugs rather than media selection probs. > I don't think that flushing the queue (as somebody else from this thread > suggested) is a good ideea as you loose a lot of packets (usually 16). A typical tx_timeout routine will at present cause up to 16 packets to be retransmitted (duplicated). Packet loss is not a problem (particularly when compared with loss of an interface). > And > usually the buffers themselves have no relation with the > transmission/reception logic - you might need to restart the transmitter > and/or the DMA engine and maybe write again the head buffer to it - > that's all. > > Now, coming back to the initial message about "correlated" reports of Tx > timeouts using some boards/drivers, I don't think this is true. Most of > the problems are in fact related to media selection (when autonegotiation > fails and the boards cannot transmit properly) and real timeouts (e.g. > caused by collisions) that are board specific. In this case, there is no > general rule that has to be applied - the hardware has to be driven "The > Right Way" (which might not be even documented). Perhaps an appropriate detection algorithm is to do a radical reset & reinitialisation if two tx timeout occur with no intervening tx interrupts. Have you played with the tx timeout interval? Even with 14 packet tx interrupt mitigation, it's very, very hard to force a tx timeout on a 10baseT LAN with the timeout set to just 20 milliseconds. We're using 400! I dunno. I just get the impression from the mailing lists that the network drivers are a weak spot for Linux, and we're letting the whole game down. Help me out here... Let me quote "Brian" from last week: ``The web servers (~10Mb/s) last about 3 days on . With all the Linux zealots talking about taking on Sun and the "enterprise," I'd drop dead laughing if they weren't my servers.'' Ouch. _______________________________________________ eepro100 mailing list eepro100@scyld.com http://www.scyld.com/mailman/listinfo/eepro100 From owner-netdev@oss.sgi.com Wed Jun 14 20:43:22 2000 Received: by oss.sgi.com id ; Wed, 14 Jun 2000 20:43:12 -0700 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:18437 "EHLO grok.yi.org") by oss.sgi.com with ESMTP id ; Wed, 14 Jun 2000 20:42:55 -0700 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.yi.org (8.9.3/8.9.3) with ESMTP id VAA07605 for ; Wed, 14 Jun 2000 21:20:19 -0700 Message-ID: <39485983.5144B6FF@candelatech.com> Date: Wed, 14 Jun 2000 21:20:19 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.14-5.0 i586) X-Accept-Language: en MIME-Version: 1.0 To: netdev Subject: Re: [eepro100] Re: True on TRANSMIT ERROR TIMEOUT (Tulip too) References: <39461BB3.299DD807@uow.edu.au> <39483426.B07E4BA@uow.edu.au> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Andrew Morton wrote: > Well, I didn't say "let's put in lots of bugs" :) > > My point is very simple: > > - Drivers and/or NICs are hanging > - The hangs are fixed by down+up or rmmod+insmod > > Hence, the hangs _could_ be unhung by appropriate action > in the tx timeout! > > This would be a great step forward. A sub-second hiccup and > a few dropped packets versus a complete system outage. > > It's not pretty, it shouldn't be necessary, but geeze, it's > better than what we have now. Without access to the ASIC > designers and everything else, we may never resolve these > problems. And can we be sure that the closed-source > vendor-developed drivers don't _currently_ reset the > crap out of the NIC when it goes unresponsive? I'll be happy to test anything you can think of to fix my problem with ZNYX four-ports / Tulip hangs. Right now I'm thinking about basically tailing the /var/messages and using ifconfig when I see the errors...it should be obvious that there are much better ways to do this! I believe it's Tulip driver related too, because I can also hang another tulip based card I have (the chip is different, if tulip compatible)... > These failures appear to correlate with high traffic which, > I suggest, points the finger at driver/hardware bugs rather > than media selection probs. I think that's true..I'm running the cards back to back across a cross-over cable. I can't imagine why they would be having probe problems only after I crank up the volume of traffic. > > > I don't think that flushing the queue (as somebody else from this thread > > suggested) is a good ideea as you loose a lot of packets (usually 16). I can easily lose millions with the current state of affairs...I'd love to only lose 16, especially if they are recored (tx_fifo or something). > Have you played with the tx timeout interval? Even with > 14 packet tx interrupt mitigation, it's very, very hard > to force a tx timeout on a 10baseT LAN with the timeout > set to just 20 milliseconds. We're using 400! I always thought they were rock solid, untill I actually started beating on them unmercifully! The press has generally been kind to Linux as a server machine...it'd suck to blow that now! Ben -- Ben Greear (greearb@candelatech.com) http://www.candelatech.com Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Wed Jun 14 21:24:33 2000 Received: by oss.sgi.com id ; Wed, 14 Jun 2000 21:24:22 -0700 Received: from saw.sw.com.sg ([203.120.9.98]:9856 "HELO saw.sw.com.sg") by oss.sgi.com with SMTP id ; Wed, 14 Jun 2000 21:24:06 -0700 Received: (qmail 3880 invoked by uid 577); 15 Jun 2000 04:24:01 -0000 Message-ID: <20000615122401.B3638@saw.sw.com.sg> Date: Thu, 15 Jun 2000 12:24:01 +0800 From: Andrey Savochkin To: Andrew Morton Cc: vortex , eepro100 , tulip , netdev , Bogdan Costescu Subject: Re: True on TRANSMIT ERROR TIMEOUT References: <39461BB3.299DD807@uow.edu.au> <39483426.B07E4BA@uow.edu.au> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.93.2i In-Reply-To: <39483426.B07E4BA@uow.edu.au>; from "Andrew Morton" on Thu, Jun 15, 2000 at 01:40:54AM Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello, On Thu, Jun 15, 2000 at 01:40:54AM +0000, Andrew Morton wrote: > Well, I didn't say "let's put in lots of bugs" :) > > My point is very simple: > > - Drivers and/or NICs are hanging > - The hangs are fixed by down+up or rmmod+insmod > > Hence, the hangs _could_ be unhung by appropriate action > in the tx timeout! > > This would be a great step forward. A sub-second hiccup and > a few dropped packets versus a complete system outage. The main problem is not the action in timeout routine. The problem is that these routines should be extensively debugged by the authors/maintainers of all drivers. TX timeout routine catches cases that shouldn't happen in real life. It's a redundant code, and it's pity that it's called so often. It depends on the hardware what actions should be taken in these "impossible" cases. Speaking about eepro100, I initially thought that the restart of the transmitter unit is meaningful and sufficient. When I started to debug the code, artificially trying to cause TX timeouts, I found that it's not true. The hardware works in a way that receiver problems leads to TX unit stall after a short time. I personally consider it as a hardware bug, but I should cope with it. Currently, TX timeout routine does full reset, just like for interface down and up. If the current timeout handler fails in its mission, I'm fixing it. I just need user's patience and help. I have only one head and two hands to clatter on keyboard, and I can't fix in a flash. Andrew, you should consider what is appropriate for your driver, basing on user's reports. Other drivers are only hints on what may be done. Best regards Andrey From owner-netdev@oss.sgi.com Wed Jun 14 22:02:12 2000 Received: by oss.sgi.com id ; Wed, 14 Jun 2000 22:02:03 -0700 Received: from lrcsun15.epfl.ch ([128.178.156.77]:9205 "EHLO lrcsun15.epfl.ch") by oss.sgi.com with ESMTP id ; Wed, 14 Jun 2000 22:01:36 -0700 Received: (from almesber@localhost) by lrcsun15.epfl.ch (8.8.X/EPFL-8.1a) id HAA23372; Thu, 15 Jun 2000 07:01:25 +0200 (MET DST) From: Werner Almesberger Message-Id: <200006150501.HAA23372@lrcsun15.epfl.ch> Subject: Re: netfilter NAT vs. pump To: andrewm@uow.edu.au (Andrew Morton) Date: Thu, 15 Jun 2000 07:01:25 +0200 (MET DST) Cc: netdev@oss.sgi.com, h.verhagen@chello.nl In-Reply-To: <394818FD.70258D1B@uow.edu.au> from "Andrew Morton" at Jun 14, 2000 11:45:01 PM X-Mailer: ELM [version 2.5 PL2] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Andrew Morton wrote: > I forwarded your patch along and it fixed the problem. He's using > pump-0.7.9-1. Good ! :-) Actually, the further discussion in netdev convinced me that pump is broken by design and probably not worth fixing, given the availability of equivalent correct implementations. My /sbin/pump is now: ---------------------------------- cut here ----------------------------------- #!/bin/sh opt=-R do=exec itf= host= remove=false while [ ! -z "$*" ]; do case "$1" in -i) shift itf=$1;; -h) shift host="-h $1";; -r) remove=true;; *) echo "Unrecognized option $1" 1>&2 exit 1;; esac shift done if $remove; then # $do /sbin/dhcpcd $opt -k $itf killall -TERM dhcpcd else $do /sbin/dhcpcd $opt $host $itf fi ---------------------------------- cut here ----------------------------------- Differences to the "real" pump: - doesn't know about config files - only supports the options used by the RH 6.2 ifup/ifdown scripts - change the opt= line if you want it to update /etc/resolv.conf - change the remove action (if $remove; then ...) if you want it to release the IP address on interface shutdown (I want to keep mine) Other people have suggested to use dhclient, which seems to do the right thing too. - Werner -- _________________________________________________________________________ / Werner Almesberger, ICA, EPFL, CH werner.almesberger@ica.epfl.ch / /_IN_N_032__Tel_+41_21_693_6621__Fax_+41_21_693_6610_____________________/ From owner-netdev@oss.sgi.com Thu Jun 15 00:05:03 2000 Received: by oss.sgi.com id ; Thu, 15 Jun 2000 00:04:43 -0700 Received: from ertpg14e1.nortelnetworks.com ([47.234.0.35]:11728 "EHLO ertpg14e1.nortelnetworks.com") by oss.sgi.com with ESMTP id ; Thu, 15 Jun 2000 00:04:23 -0700 Received: from zsngd101.asiapac.nortel.com (actually znsgd101) by ertpg14e1.nortelnetworks.com; Thu, 15 Jun 2000 03:02:12 -0400 Received: from zctwb003.asiapac.nortel.com ([47.152.32.111]) by zsngd101.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id MS0LA93L; Thu, 15 Jun 2000 15:02:08 +0800 Received: from pwold011.asiapac.nortel.com ([47.181.193.45]) by zctwb003.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id MNGMW801; Thu, 15 Jun 2000 17:02:07 +1000 Received: from uow.edu.au (IDENT:akpm@localhost [127.0.0.1]) by pwold011.asiapac.nortel.com (8.9.3/8.9.3) with ESMTP id RAA20973; Thu, 15 Jun 2000 17:01:48 +1000 Message-ID: <39487F5B.D26F4AFF@uow.edu.au> Date: Thu, 15 Jun 2000 07:01:47 +0000 X-Sybari-Space: 00000000 00000000 00000000 From: Andrew Morton X-Mailer: Mozilla 4.61 [en] (X11; I; Linux 2.4.0-test1-ac10 i686) X-Accept-Language: en MIME-Version: 1.0 To: Andrey Savochkin CC: netdev Subject: Re: True on TRANSMIT ERROR TIMEOUT References: <39461BB3.299DD807@uow.edu.au> <39483426.B07E4BA@uow.edu.au>, <39483426.B07E4BA@uow.edu.au>; from "Andrew Morton" on Thu, Jun 15, 2000 at 01:40:54AM <20000615122401.B3638@saw.sw.com.sg> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Andrey Savochkin wrote: > > TX timeout routine catches cases that shouldn't happen in real life. That's what I said :) And some time ago Donald has pointed at a history of people putting in bandaids and claiming to have "fixed the problem" when they really have not. We would be very bad to do that. But there is real value in just keeping the damn things working. Perhaps 'stable' kernels should have the bandaids, and development kernels shouldn't. From owner-netdev@oss.sgi.com Thu Jun 15 07:58:29 2000 Received: by oss.sgi.com id ; Thu, 15 Jun 2000 07:58:19 -0700 Received: from gb.bnet.pl ([212.160.188.33]:4591 "HELO nic.nigdzie") by oss.sgi.com with SMTP id ; Thu, 15 Jun 2000 05:52:07 -0700 Received: (qmail 21984 invoked by uid 500); 15 Jun 2000 12:51:13 -0000 Date: Thu, 15 Jun 2000 14:51:13 +0200 From: Jacek Konieczny To: netdev@oss.sgi.com Subject: IPv6 address autoconfiguration fix Message-ID: <20000615145113.A21969@nic.nigdzie> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.1.11i Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing This patch fixes problem with disapearing all-nodes (ff02::1) multicast address after setting interface down and up. When all-nodes address is not present IPv6 address autoconfiguration doesn't work. Greets, Jacek Konieczny PS. I have already sent it to Alexey Kuznetsov, but a friend of mine asked me to send it here too. It is possible the fix will be included in 2.2.17 kernel? diff -durN linux-2.2.16.orig/net/ipv6/addrconf.c linux/net/ipv6/addrconf.c --- linux-2.2.16.orig/net/ipv6/addrconf.c Tue Jan 4 19:12:27 2000 +++ linux/net/ipv6/addrconf.c Fri Jun 9 20:37:48 2000 @@ -255,8 +255,6 @@ idev = ipv6_add_dev(dev); if (idev == NULL) return NULL; - if (dev->flags&IFF_UP) - ipv6_mc_up(idev); } return idev; } @@ -1045,6 +1043,8 @@ return; } + ipv6_mc_up(idev); + addrconf_lock(); ifp = ipv6_add_addr(idev, &addr, IFA_HOST); @@ -1084,6 +1084,8 @@ if (idev == NULL) return; + ipv6_mc_up(idev); + #ifdef CONFIG_IPV6_EUI64 memset(&addr, 0, sizeof(struct in6_addr)); @@ -1120,6 +1122,8 @@ printk(KERN_DEBUG "init sit: add_dev failed\n"); return; } + + ipv6_mc_up(idev); sit_add_v4_addrs(idev); From owner-netdev@oss.sgi.com Thu Jun 15 07:58:29 2000 Received: by oss.sgi.com id ; Thu, 15 Jun 2000 07:58:20 -0700 Received: from panic.ohr.gatech.edu ([130.207.47.194]:41486 "EHLO havoc.gtf.org") by oss.sgi.com with ESMTP id ; Thu, 15 Jun 2000 06:12:12 -0700 Received: from mandrakesoft.com (adsl-77-228-135.atl.bellsouth.net [216.77.228.135]) by havoc.gtf.org (8.9.3/8.9.3) with ESMTP id JAA01247 for ; Thu, 15 Jun 2000 09:11:30 -0400 Message-ID: <3948D602.CA56CF00@mandrakesoft.com> Date: Thu, 15 Jun 2000 09:11:30 -0400 From: Jeff Garzik Organization: MandrakeSoft X-Mailer: Mozilla 4.73 [en] (X11; I; Linux 2.4.0-test1-ac18 i686) X-Accept-Language: en MIME-Version: 1.0 To: netdev@oss.sgi.com Subject: Tulip and IP aliasing References: <200006151255.QAA29302@ms2.inr.ac.ru> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing I got a report saying that Tulip was slow when IP aliasing for a large number of hosts was in effect. This report was from a guy running a Web server which binds to a ton of IP aliases in order to do its virtual hosts. Talking with jamal on IRC he mentioned that IP aliases matching occurs through a linear search. Can someone point me to this code? I grepped for CONFIG_IP_ALIAS in 2.4.x but that seemed only to encompass very small bits of code, and not the actual IP matching code I am searching for. Regards, Jeff -- Jeff Garzik | Building 1024 | Free beer tomorrow. MandrakeSoft, Inc. | From owner-netdev@oss.sgi.com Thu Jun 15 07:58:29 2000 Received: by oss.sgi.com id ; Thu, 15 Jun 2000 07:58:20 -0700 Received: from smtp.arrakis.es ([212.59.199.83]:22156 "EHLO ssmtp01.melange.isp") by oss.sgi.com with ESMTP id ; Thu, 15 Jun 2000 06:04:58 -0700 Received: from ssmtp01.melange.isp ([127.0.0.1]) by ssmtp01.melange.isp (Netscape Messaging Server 4.15) with SMTP id FW75IT04.U1U for ; Thu, 15 Jun 2000 15:01:41 +0200 Received: from oss.sgi.com ([216.32.174.190]) by ssmtp02.melange.isp (Netscape Messaging Server 4.15) with ESMTP id FW3KWY05.O14 for ; Tue, 13 Jun 2000 16:43:46 +0200 Received: by oss.sgi.com id ; Tue, 13 Jun 2000 06:46:14 -0700 Received: from lrcsun15.epfl.ch ([128.178.156.77]:44269 "EHLO lrcsun15.epfl.ch") by oss.sgi.com with ESMTP id ; Tue, 13 Jun 2000 06:45:55 -0700 Received: (from almesber@localhost) by lrcsun15.epfl.ch (8.8.X/EPFL-8.1a) id QAA18667; Tue, 13 Jun 2000 16:45:52 +0200 (MET DST) From: Werner Almesberger Message-Id: <200006131445.QAA18667@lrcsun15.epfl.ch> Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? To: greearb@candelatech.com (Ben Greear) Date: Tue, 13 Jun 2000 16:45:52 +0200 (MET DST) Cc: netdev@oss.sgi.com In-Reply-To: <39464F01.2206643E@candelatech.com> from "Ben Greear" at Jun 13, 2000 08:10:57 AM X-Mailer: ELM [version 2.5 PL2] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Ben Greear wrote: > Yep, there is a comment in the code, probably from 1.x talking about how it > should be cleaned up in the 'next' release...seems that comment is still > valid :) Oh, a _lot_ got cleaned up in 2.3 ... ;-) > It would be a perfect place for OO and inheritance, The "all in one" type of object you're describing would be rather ugly, IMHO. Better to remove those elements that don't have a 1:1 relationship and put them elsewhere. E.g. neighbours are already elsewhere. Local addresses (L3 and maybe also L2) should probably also go elsewhere. Then qdiscs (think multilink PPP, ATM, or FR SVCs), etc. Probably the best approach to clean up the design would be to split the net_device structure into as many substructures as possible, create new name spaces, where applicable, and then to combine those things that always or usually occur together. - Werner -- _________________________________________________________________________ / Werner Almesberger, ICA, EPFL, CH werner.almesberger@ica.epfl.ch / /_IN_N_032__Tel_+41_21_693_6621__Fax_+41_21_693_6610_____________________/ From owner-netdev@oss.sgi.com Thu Jun 15 07:58:29 2000 Received: by oss.sgi.com id ; Thu, 15 Jun 2000 07:58:19 -0700 Received: from mail.cyberus.ca ([209.195.95.1]:19957 "EHLO cyberus.ca") by oss.sgi.com with ESMTP id ; Thu, 15 Jun 2000 04:24:47 -0700 Received: from shell.cyberus.ca (shell [209.195.95.7]) by cyberus.ca (8.9.3/8.9.3/Cyberus Online Inc.) with ESMTP id HAA00551; Thu, 15 Jun 2000 07:24:16 -0400 (EDT) Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.1b+Sun/8.9.3) with ESMTP id HAA10050; Thu, 15 Jun 2000 07:24:15 -0400 (EDT) Date: Thu, 15 Jun 2000 07:24:15 -0400 (EDT) From: jamal To: Werner Almesberger cc: Ben Greear , netdev@oss.sgi.com Subject: Re: 802.1q Was (Re: Plans for 2.5 / 2.6 ??? In-Reply-To: <200006131445.QAA18667@lrcsun15.epfl.ch> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Tue, 13 Jun 2000, Werner Almesberger wrote: > Ben Greear wrote: > > Yep, there is a comment in the code, probably from 1.x talking about how it > > should be cleaned up in the 'next' release...seems that comment is still > > valid :) > > Oh, a _lot_ got cleaned up in 2.3 ... ;-) > > > It would be a perfect place for OO and inheritance, > > The "all in one" type of object you're describing would be rather ugly, > IMHO. Better to remove those elements that don't have a 1:1 relationship > and put them elsewhere. E.g. neighbours are already elsewhere. Local > addresses (L3 and maybe also L2) should probably also go elsewhere. > Then qdiscs (think multilink PPP, ATM, or FR SVCs), etc. > > Probably the best approach to clean up the design would be to split the > net_device structure into as many substructures as possible, create new > name spaces, where applicable, and then to combine those things that > always or usually occur together. So you provide ability to do name spaces for the substructures/subsystems as well? This way maybe you can have generic and extensible user space tool(s) modelled after the device and its 'subsystems' The neighbor, for example, is a device 'packet mungler' -- or that is what i claim; it maintains its own table(s), search algorithms as well as methods. So the VLAN or MPLS etc 'subsystem' could be along the same line. The should be a generic way for the subsystems to inter-act and communicate eg a vlan updating the ARP table or looking up a route. cheers, jamal From owner-netdev@oss.sgi.com Thu Jun 15 07:58:39 2000 Received: by oss.sgi.com id ; Thu, 15 Jun 2000 07:58:20 -0700 Received: from saw.sw.com.sg ([203.120.9.98]:640 "HELO saw.sw.com.sg") by oss.sgi.com with SMTP id ; Thu, 15 Jun 2000 03:18:07 -0700 Received: (qmail 397 invoked by uid 577); 15 Jun 2000 10:17:30 -0000 Message-ID: <20000615181730.B350@saw.sw.com.sg> Date: Thu, 15 Jun 2000 18:17:30 +0800 From: Andrey Savochkin To: Andrew Morton Cc: netdev Subject: Re: True on TRANSMIT ERROR TIMEOUT References: <39461BB3.299DD807@uow.edu.au> <39483426.B07E4BA@uow.edu.au>, <39483426.B07E4BA@uow.edu.au>; <20000615122401.B3638@saw.sw.com.sg> <39487F5B.D26F4AFF@uow.edu.au> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.93.2i In-Reply-To: <39487F5B.D26F4AFF@uow.edu.au>; from "Andrew Morton" on Thu, Jun 15, 2000 at 07:01:47AM Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Thu, Jun 15, 2000 at 07:01:47AM +0000, Andrew Morton wrote: > And some time ago Donald has pointed at a history of people putting in > bandaids and claiming to have "fixed the problem" when they really have > not. Do not take it too seriously... :-) > We would be very bad to do that. > > But there is real value in just keeping the damn things working. > > Perhaps 'stable' kernels should have the bandaids, and development > kernels shouldn't. If timeout routine is written, it should 1) work 2) be included in the driver independently of the kernel status. Developers need user's reports, but it doesn't mean that the driver shouldn't do its best to work even in case of problems. You'll get reports anyway. I just mentioned the "bandaid" status of timeout routine to explain that this topic is rather controversial. Best regards Andrey From owner-netdev@oss.sgi.com Thu Jun 15 14:32:13 2000 Received: by oss.sgi.com id ; Thu, 15 Jun 2000 14:32:03 -0700 Received: from laurin.munich.netsurf.de ([194.64.166.1]:45970 "EHLO laurin.munich.netsurf.de") by oss.sgi.com with ESMTP id ; Thu, 15 Jun 2000 14:31:34 -0700 Received: from fred.muc.de (none@ns1067.munich.netsurf.de [195.180.235.67]) by laurin.munich.netsurf.de (8.9.3/8.9.3) with ESMTP id XAA12341; Thu, 15 Jun 2000 23:31:30 +0200 (MET DST) Received: from andi by fred.muc.de with local (Exim 2.05 #1) id 132hFJ-0000g4-00; Thu, 15 Jun 2000 23:32:17 +0200 Date: Thu, 15 Jun 2000 23:32:17 +0200 From: Andi Kleen To: Jeff Garzik Cc: netdev@oss.sgi.com Subject: Re: Tulip and IP aliasing Message-ID: <20000615233217.A2576@fred.muc.de> References: <200006151255.QAA29302@ms2.inr.ac.ru> <3948D602.CA56CF00@mandrakesoft.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.95.4us In-Reply-To: <3948D602.CA56CF00@mandrakesoft.com>; from Jeff Garzik on Thu, Jun 15, 2000 at 05:00:17PM +0200 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Thu, Jun 15, 2000 at 05:00:17PM +0200, Jeff Garzik wrote: > I got a report saying that Tulip was slow when IP aliasing for a large > number of hosts was in effect. This report was from a guy running a Web > server which binds to a ton of IP aliases in order to do its virtual > hosts. > > Talking with jamal on IRC he mentioned that IP aliases matching occurs > through a linear search. > > Can someone point me to this code? I grepped for CONFIG_IP_ALIAS in > 2.4.x but that seemed only to encompass very small bits of code, and not > the actual IP matching code I am searching for. Linear search only occurs in obscure cases. For all normal paths local IPs are looked up via special local routes in the routing cache and the FIB hash. About the only thing that could do something like a linear alias search during normal operation (not configuration) is inet_select_addr(), but it should not be called frequently. -Andi From owner-netdev@oss.sgi.com Thu Jun 15 15:20:14 2000 Received: by oss.sgi.com id ; Thu, 15 Jun 2000 15:19:54 -0700 Received: from nob.rap.ucar.edu ([128.117.193.5]:35599 "EHLO nob.rap.ucar.edu") by oss.sgi.com with ESMTP id ; Thu, 15 Jun 2000 15:19:39 -0700 Received: (from tres@localhost) by nob.rap.ucar.edu (8.9.3/8.9.3/Debian 8.9.3-21) id QAA31159 for netdev@oss.sgi.com; Thu, 15 Jun 2000 16:19:33 -0600 Date: Thu, 15 Jun 2000 16:19:33 -0600 From: Tres Hofmeister To: netdev Subject: Re: [vortex, eepro100] True on TRANSMIT ERROR TIMEOUT Message-ID: <20000615161933.I30376@nob.rap.ucar.edu> References: <39461BB3.299DD807@uow.edu.au> <39483426.B07E4BA@uow.edu.au> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii User-Agent: Mutt/1.0.1i In-Reply-To: <39483426.B07E4BA@uow.edu.au>; from andrewm@uow.edu.au on Thu, Jun 15, 2000 at 01:40:54AM +0000 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On 2000.06.15, Andrew Morton wrote: : : Let me quote "Brian" from last week: : : ``The web servers (~10Mb/s) last about 3 days on . With all : the Linux zealots talking about taking on Sun and the "enterprise," : I'd drop dead laughing if they weren't my servers.'' : : Ouch. Indeed. I'm in the same boat as "Brian." I've tried everything I can think of to keep our web server up. In the end, it's looking like the answer is using watchdog(8) to monitor the Ethernet interface, with a repair script which reinitializes it when it fails. I've currently got a NetGear Tulip card in the machine, but I've had problems with the vortex and eepro100 drivers as well. I *never* thought I'd be fighting off our NT pushers because our Linux web server seemed unreliable... :( Oh, well. [I'm on the vortex list, not the netdev list, by the way...] -- Tres Hofmeister http://www.rap.ucar.edu/~tres/ Research Applications Program, National Center for Atmospheric Research From owner-netdev@oss.sgi.com Thu Jun 15 15:41:05 2000 Received: by oss.sgi.com id ; Thu, 15 Jun 2000 15:40:45 -0700 Received: from panic.ohr.gatech.edu ([130.207.47.194]:54280 "EHLO havoc.gtf.org") by oss.sgi.com with ESMTP id ; Thu, 15 Jun 2000 15:40:34 -0700 Received: from mandrakesoft.com (adsl-77-228-135.atl.bellsouth.net [216.77.228.135]) by havoc.gtf.org (8.9.3/8.9.3) with ESMTP id SAA09737; Thu, 15 Jun 2000 18:40:22 -0400 Message-ID: <39495B56.B1103A5C@mandrakesoft.com> Date: Thu, 15 Jun 2000 18:40:22 -0400 From: Jeff Garzik Organization: MandrakeSoft X-Mailer: Mozilla 4.73 [en] (X11; I; Linux 2.4.0-test1-ac18 i686) X-Accept-Language: en MIME-Version: 1.0 To: Tres Hofmeister CC: netdev Subject: Re: [vortex, eepro100] True on TRANSMIT ERROR TIMEOUT References: <39461BB3.299DD807@uow.edu.au> <39483426.B07E4BA@uow.edu.au> <20000615161933.I30376@nob.rap.ucar.edu> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Tres Hofmeister wrote: > Indeed. I'm in the same boat as "Brian." I've tried everything > I can think of to keep our web server up. In the end, it's looking like > the answer is using watchdog(8) to monitor the Ethernet interface, with > a repair script which reinitializes it when it fails. I've currently > got a NetGear Tulip card in the machine, but I've had problems with > the vortex and eepro100 drivers as well. Problems with the Tulip driver nonwithstanding, the NetGear cards are awful :/ They use an ancient chip design that should have never seen the light of day in the first place. You're far better off finding a better Tulip clone or a using a card w/ a real Tulip chip on it. Jeff -- Jeff Garzik | Building 1024 | Free beer tomorrow. MandrakeSoft, Inc. | From owner-netdev@oss.sgi.com Thu Jun 15 16:23:45 2000 Received: by oss.sgi.com id ; Thu, 15 Jun 2000 16:23:35 -0700 Received: from gw.simegen.com ([203.2.135.4]:36107 "EHLO keon.simegen.com") by oss.sgi.com with ESMTP id ; Thu, 15 Jun 2000 16:23:17 -0700 Received: from anaconda.simegen.com [203.28.9.32] (mail) by keon.simegen.com with esmtp (Exim 3.12 #1 (Debian)) id 132iyP-0005ce-00; Fri, 16 Jun 2000 09:22:58 +1000 Received: from localhost ([127.0.0.1] helo=zeor.simegen.com) by anaconda.simegen.com with esmtp (Exim 3.12 #1 (Debian)) id 132iyN-0001L5-00; Fri, 16 Jun 2000 09:22:55 +1000 Message-ID: <3949654F.958B1718@zeor.simegen.com> Date: Fri, 16 Jun 2000 09:22:55 +1000 From: dancer@zeor.simegen.com X-Mailer: Mozilla 4.72 [en] (X11; I; Linux 2.2.14 i586) X-Accept-Language: en MIME-Version: 1.0 To: Tres Hofmeister CC: netdev Subject: Re: [vortex, eepro100] True on TRANSMIT ERROR TIMEOUT References: <39461BB3.299DD807@uow.edu.au> <39483426.B07E4BA@uow.edu.au> <20000615161933.I30376@nob.rap.ucar.edu> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Tres Hofmeister wrote: > On 2000.06.15, Andrew Morton wrote: > : > : Let me quote "Brian" from last week: > : > : ``The web servers (~10Mb/s) last about 3 days on . With all > : the Linux zealots talking about taking on Sun and the "enterprise," > : I'd drop dead laughing if they weren't my servers.'' > : > : Ouch. > > Indeed. I'm in the same boat as "Brian." I've tried everything > I can think of to keep our web server up. In the end, it's looking like > the answer is using watchdog(8) to monitor the Ethernet interface, with > a repair script which reinitializes it when it fails. I've currently > got a NetGear Tulip card in the machine, but I've had problems with > the vortex and eepro100 drivers as well. > > I *never* thought I'd be fighting off our NT pushers because > our Linux web server seemed unreliable... :( Oh, well. Here, also, though we're running a complete range of services for a million or so people. Does that count as 'enterprise'?. We got over all-too-frequent kernel panics by replacing our vortex cards with eepro100's. It's hard to justify another hardware switch to our partners, who have been pushing Sun or NT for the last several years. Of course, our third choice for cards would have been tulips. :/ D From owner-netdev@oss.sgi.com Thu Jun 15 19:56:26 2000 Received: by oss.sgi.com id ; Thu, 15 Jun 2000 19:56:16 -0700 Received: from saw.sw.com.sg ([203.120.9.98]:7040 "HELO saw.sw.com.sg") by oss.sgi.com with SMTP id ; Thu, 15 Jun 2000 19:56:05 -0700 Received: (qmail 3284 invoked by uid 577); 16 Jun 2000 02:55:56 -0000 Message-ID: <20000616105556.A3241@saw.sw.com.sg> Date: Fri, 16 Jun 2000 10:55:56 +0800 From: Andrey Savochkin To: dancer@zeor.simegen.com, Tres Hofmeister Cc: netdev Subject: Re: [vortex, eepro100] True on TRANSMIT ERROR TIMEOUT References: <39461BB3.299DD807@uow.edu.au> <39483426.B07E4BA@uow.edu.au> <20000615161933.I30376@nob.rap.ucar.edu> <3949654F.958B1718@zeor.simegen.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.93.2i In-Reply-To: <3949654F.958B1718@zeor.simegen.com>; from "dancer@zeor.simegen.com" on Fri, Jun 16, 2000 at 09:22:55AM Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello, On Fri, Jun 16, 2000 at 09:22:55AM +1000, dancer@zeor.simegen.com wrote: > Tres Hofmeister wrote: [snip] > > I *never* thought I'd be fighting off our NT pushers because > > our Linux web server seemed unreliable... :( Oh, well. > > Here, also, though we're running a complete range of services for a million > or so people. Does that count as 'enterprise'?. We got over > all-too-frequent kernel panics by replacing our vortex cards with > eepro100's. It's hard to justify another hardware switch to our partners, > who have been pushing Sun or NT for the last several years. Of course, our > third choice for cards would have been tulips. :/ You definitely will have problems defending the choice of Linux instead of Sun and NT. That's because you do not take actions. Have you ever considered to report your problems, help developers to debug your case? Regards Andrey V. Savochkin From owner-netdev@oss.sgi.com Fri Jun 16 04:42:39 2000 Received: by oss.sgi.com id ; Fri, 16 Jun 2000 04:42:30 -0700 Received: from tcm.hut.fi ([130.233.44.1]:55823 "EHLO tcm-gw.tcm.hut.fi") by oss.sgi.com with ESMTP id ; Fri, 16 Jun 2000 04:42:16 -0700 Received: (from smap@localhost) by tcm-gw.tcm.hut.fi (8.8.7/8.8.7) id OAA12494 for ; Fri, 16 Jun 2000 14:42:14 +0300 X-Authentication-Warning: tcm-gw.tcm.hut.fi: smap set sender to using -f Received: from caffeine.tcm.hut.fi(130.233.45.27) by tcm-gw.tcm.hut.fi via smap (V2.0) id xma012492; Fri, 16 Jun 00 14:41:57 +0300 Received: from morphine.tcm.hut.fi (morphine.tcm.hut.fi [130.233.45.7]) by caffeine.tcm.hut.fi (8.9.2/8.9.2) with ESMTP id OAA03778 for ; Fri, 16 Jun 2000 14:41:58 +0300 (EET DST) Received: (from ajtuomin@localhost) by morphine.tcm.hut.fi (8.9.2/8.7.1) id OAA00043 for netdev@oss.sgi.com; Fri, 16 Jun 2000 14:41:53 +0300 (EET DST) From: Antti Tuominen Message-Id: <200006161141.OAA00043@morphine.tcm.hut.fi> Subject: linux/in6.h and net/ndisc.h To: netdev@oss.sgi.com Date: Fri, 16 Jun 2000 14:41:53 +0300 (EET DST) X-Mailer: ELM [version 2.4ME+ PL49 (25)] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello! I'm a member of the MIPL Mobile IPv6 development team. As part of our implementation we modified linux/in6.h to support MIPv6 destination options as described in "Mobility Support in IPv6" draft. Also we are going to add two new ndisc options to net/ndisc.h (also defined in MIPv6). Could these changes be integrated to the main kernel? What is the preferred naming for these? == diff -urN v2.4.0-test1/include/linux/in6.h linux/include/linux/in6.h --- v2.4.0-test1/include/linux/in6.h Thu May 25 05:52:42 2000 +++ linux/include/linux/in6.h Thu Jun 8 16:30:59 2000 @@ -142,6 +142,14 @@ #define IPV6_TLV_JUMBO 194 /* + * Mobile IPv6 TLV options. + */ +#define MIPV6_TLV_BINDACK 7 +#define MIPV6_TLV_BINDRQ 8 +#define MIPV6_TLV_BINDUPDATE 198 +#define MIPV6_TLV_HOMEADDR 201 + +/* * IPV6 socket options */ diff -urN v2.4.0-test1/include/net/ndisc.h linux/include/net/ndisc.h --- v2.4.0-test1/include/net/ndisc.h Sun May 28 16:07:33 2000 +++ linux/include/net/ndisc.h Tue Jun 13 16:57:18 2000 @@ -21,6 +21,10 @@ #define ND_OPT_REDIRECT_HDR 4 #define ND_OPT_MTU 5 +/* Mobile IPv6 specific ndisc options */ +#define ND_OPT_RTR_ADV_INTERVAL 7 +#define ND_OPT_HOME_AGENT_INFO 8 + #define MAX_RTR_SOLICITATION_DELAY HZ #define ND_REACHABLE_TIME (30*HZ) == Also I was wandering why is IPV6_TLV_ROUTERALERT defined as 20 contradicting the RFC 2711 where it's defined as 5. Is there some logic behind this which just eludes me or is it just plain wrong? In in6.h: 141:#define IPV6_TLV_ROUTERALERT 20 In RFC 2711 IPv6 Router Alert Option, 2.1 Syntax: The first three bits of the first byte are zero and the value 5 in the remaining five bits is the Hop-by-Hop Option Type number. And one other thing, are the fixes submitted by Sami Kivisaari going to be in the kernel anytime soon? Antti -- Antti J. Tuominen, JMT 3A 131, 02150 Espoo, Finland. Research assistant, TSE Institute at Helsinki University of Technology work: ajtuomin@tcm.hut.fi; home: tuominen@iki.fi From owner-netdev@oss.sgi.com Fri Jun 16 09:12:11 2000 Received: by oss.sgi.com id ; Fri, 16 Jun 2000 09:12:00 -0700 Received: from minus.inr.ac.ru ([193.233.7.97]:16391 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Fri, 16 Jun 2000 09:11:42 -0700 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id UAA11842; Fri, 16 Jun 2000 20:11:29 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200006161611.UAA11842@ms2.inr.ac.ru> Subject: Re: Tulip and IP aliasing To: ak@muc.DE (Andi Kleen) Date: Fri, 16 Jun 2000 20:11:29 +0400 (MSK DST) Cc: netdev@oss.sgi.com In-Reply-To: <20000615233217.A2576@fred.muc.de> from "Andi Kleen" at Jun 16, 0 02:13:08 am X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Length: 181 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello! > Linear search only occurs in obscure cases. Even when inet_addr_select() is called (rare case), linear search with aliases reduces to check of the first element. Alexey From owner-netdev@oss.sgi.com Fri Jun 16 23:22:03 2000 Received: by oss.sgi.com id ; Fri, 16 Jun 2000 23:21:54 -0700 Received: from dialup162.canberra.net.au ([203.33.188.34]:58119 "HELO halfway") by oss.sgi.com with SMTP id ; Fri, 16 Jun 2000 23:21:36 -0700 Received: from linuxcare.com.au (localhost [127.0.0.1]) by halfway (Postfix) with ESMTP id 4527B8192; Fri, 16 Jun 2000 18:00:14 +0930 (CST) From: Rusty Russell To: Werner Almesberger , netdev@oss.sgi.com Subject: Re: netfilter NAT vs. pump In-reply-to: Your message of "Tue, 13 Jun 2000 22:58:55 +0200." <200006132058.WAA27846@lrcsun15.epfl.ch> Date: Fri, 16 Jun 2000 18:30:14 +1000 Message-Id: <20000616083014.4527B8192@halfway> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing In message <200006132058.WAA27846@lrcsun15.epfl.ch> you write: > - make it happen less often by setting NFC_ALTERED only when > something has changed (probably a good idea in any case) Yep. And NAT only makes sense for `simple' stuff anyway; NATting local src=0.0.0.0 packets is just plain wierd. This patch is trivial and clean. Can you test it with unpatched pump? Also sets NFC_ALTERED in compat layer (which doesn't hook in at LOCAL_OUT anwyay, so is just me being a pedant). Cheers, Rusty. --- working-2.4.0-test1/net/ipv4/netfilter/ip_nat_standalone.c.~1~ Tue Jun 6 00:11:01 2000 +++ working-2.4.0-test1/net/ipv4/netfilter/ip_nat_standalone.c Fri Jun 16 18:12:21 2000 @@ -60,8 +60,7 @@ IP_NF_ASSERT(!((*pskb)->nh.iph->frag_off & __constant_htons(IP_MF|IP_OFFSET))); - /* FIXME: One day, fill in properly. --RR */ - (*pskb)->nfcache |= NFC_UNKNOWN | NFC_ALTERED; + (*pskb)->nfcache |= NFC_UNKNOWN; /* If we had a hardware checksum before, it's now invalid */ if ((*pskb)->pkt_type != PACKET_LOOPBACK) --- working-2.4.0-test1/net/ipv4/netfilter/ip_nat_core.c.~1~ Tue Jun 6 00:11:01 2000 +++ working-2.4.0-test1/net/ipv4/netfilter/ip_nat_core.c Fri Jun 16 18:16:44 2000 @@ -663,8 +663,10 @@ static void manip_pkt(u_int16_t proto, struct iphdr *iph, size_t len, const struct ip_conntrack_manip *manip, - enum ip_nat_manip_type maniptype) + enum ip_nat_manip_type maniptype, + __u32 *nfcache) { + *nfcache |= NFC_ALTERED; find_nat_proto(proto)->manip_pkt(iph, len, manip, maniptype); if (maniptype == IP_NAT_MANIP_SRC) { @@ -718,7 +720,8 @@ (*pskb)->nh.iph, (*pskb)->len, &info->manips[i].manip, - info->manips[i].maniptype); + info->manips[i].maniptype, + &(*pskb)->nfcache); } } helper = info->helper; @@ -782,7 +785,8 @@ manip_pkt(inner->protocol, inner, skb->len - ((void *)inner - (void *)iph), &info->manips[i].manip, - !info->manips[i].maniptype); + !info->manips[i].maniptype, + &skb->nfcache); /* Outer packet needs to have IP header NATed like it's a reply. */ } else if (info->manips[i].direction == dir @@ -795,7 +799,8 @@ IP_PARTS(info->manips[i].manip.ip)); manip_pkt(0, iph, skb->len, &info->manips[i].manip, - info->manips[i].maniptype); + info->manips[i].maniptype, + &skb->nfcache); } } READ_UNLOCK(&ip_nat_lock); --- working-2.4.0-test1/net/ipv4/netfilter/ip_fw_compat.c.~1~ Fri May 12 13:22:38 2000 +++ working-2.4.0-test1/net/ipv4/netfilter/ip_fw_compat.c Fri Jun 16 18:25:21 2000 @@ -83,7 +83,8 @@ int ret = FW_BLOCK; u_int16_t redirpt; - (*pskb)->nfcache |= NFC_UNKNOWN; + /* Assume worse case: any hook could change packet */ + (*pskb)->nfcache |= NFC_UNKNOWN | NFC_ALTERED; (*pskb)->ip_summed = CHECKSUM_NONE; switch (hooknum) { -- Hacking time. From owner-netdev@oss.sgi.com Sun Jun 18 06:33:35 2000 Received: by oss.sgi.com id ; Sun, 18 Jun 2000 06:33:05 -0700 Received: from [203.126.247.144] ([203.126.247.144]:38290 "EHLO zsngs001") by oss.sgi.com with ESMTP id ; Sun, 18 Jun 2000 06:32:33 -0700 Received: from zsngd101.asiapac.nortel.com (actually znsgd101) by zsngs001; Sun, 18 Jun 2000 21:31:11 +0800 Received: from zctwb003.asiapac.nortel.com ([47.152.32.111]) by zsngd101.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NBKHA0X5; Sun, 18 Jun 2000 21:31:14 +0800 Received: from pwold011.asiapac.nortel.com ([47.181.193.45]) by zctwb003.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NCLF7857; Sun, 18 Jun 2000 23:31:15 +1000 Received: from uow.edu.au (IDENT:akpm@[47.181.194.78]) by pwold011.asiapac.nortel.com (8.9.3/8.9.3) with ESMTP id XAA28289; Sun, 18 Jun 2000 23:31:07 +1000 Message-ID: <394CD00E.A02FEBBE@uow.edu.au> Date: Sun, 18 Jun 2000 23:35:10 +1000 X-Sybari-Space: 00000000 00000000 00000000 From: Andrew Morton X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.2.14-15mdk i586) X-Accept-Language: en MIME-Version: 1.0 To: "David S. Miller" CC: "netdev@oss.sgi.com" Subject: [patch] TCP throughput after 2.2.17-pre1 Content-Type: multipart/mixed; boundary="------------7DF873E452058BAB6315BDA2" X-Orig: Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing This is a multi-part message in MIME format. --------------7DF873E452058BAB6315BDA2 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit With 2.2.17-pre1 on a 400MHz uniprocessor it is possible to saturate a 100baseT with netperf (just a TCP transmitter) with 40% CPU left over. With -pre3 and -pre4 it maxes out at 38 Mbits/sec. A five times reduction. The problem is caused by this patch: Index: tcp.c =================================================================== RCS file: /opt/cvs/lk2.2/net/ipv4/tcp.c,v retrieving revision 1.3 retrieving revision 1.3.2.1 diff -u -r1.3 -r1.3.2.1 --- tcp.c 2000/06/11 13:26:18 1.3 +++ tcp.c 2000/06/17 04:19:01 1.3.2.1 @@ -698,13 +698,16 @@ /* * Wait for more memory for a socket + * + * If we got here an allocation has failed on us. We cannot + * spin here or we may block the very code freeing memory + * for us. */ static void wait_for_tcp_memory(struct sock * sk) { release_sock(sk); if (!tcp_memory_free(sk)) { struct wait_queue wait = { current, NULL }; - sk->socket->flags &= ~SO_NOSPACE; add_wait_queue(sk->sleep, &wait); for (;;) { @@ -721,6 +724,12 @@ } current->state = TASK_RUNNING; remove_wait_queue(sk->sleep, &wait); + } + else + { + /* Yield time to the memory freeing paths */ + current->state = TASK_INTERRUPTIBLE; + schedule_timeout(1); } lock_sock(sk); } This code is assuming that the network layer has run out of memory. Not true. Instead it goes to sleep if the socket can still take more data! The comment about "Yield time to the memory freeing paths" is bogus. I've looked through Alan's changelogs and I follow all the mailing lists, but I cannot tell where this one came from. A fix against 2.2.17-pre4 (with some optimisations and rationalisations) is attached. --------------7DF873E452058BAB6315BDA2 Content-Type: text/plain; charset=us-ascii; name="tcp.patch" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="tcp.patch" --- linux-2.2.17pre4/include/net/sock.h Tue Aug 10 05:05:13 1999 +++ linux-akpm/include/net/sock.h Sun Jun 18 23:16:09 2000 @@ -922,6 +922,16 @@ return sock_wspace(sk) >= MIN_WRITE_SPACE; } +/* + * Return true if the socket's snd buffer is not fully utilised + */ + +extern __inline__ int tcp_memory_free(struct sock *sk) +{ + return atomic_read(&sk->wmem_alloc) < sk->sndbuf; +} + + /* * Declarations from timer.c */ --- linux-2.2.17pre4/net/ipv4/tcp.c Sun Jun 18 21:04:07 2000 +++ linux-akpm/net/ipv4/tcp.c Sun Jun 18 23:29:21 2000 @@ -201,7 +201,8 @@ * tcp_do_sendmsg to avoid burstiness. * Eric Schenk : Fix fast close down bug with * shutdown() followed by close(). - * Andi Kleen : Make poll agree with SIGIO + * Andi Kleen : Make poll agree with SIGIO + * Andrew Morton : Repair handling of 'wmem_alloc exceeded' state * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License @@ -691,23 +692,17 @@ return 0; } -static inline int tcp_memory_free(struct sock *sk) -{ - return atomic_read(&sk->wmem_alloc) < sk->sndbuf; -} - /* * Wait for more memory for a socket * - * If we got here an allocation has failed on us. We cannot - * spin here or we may block the very code freeing memory - * for us. + * If we got here the socket has used all its available send buffer space. + * We deschedule until more data can be sent. */ static void wait_for_tcp_memory(struct sock * sk) { - release_sock(sk); if (!tcp_memory_free(sk)) { struct wait_queue wait = { current, NULL }; + release_sock(sk); sk->socket->flags &= ~SO_NOSPACE; add_wait_queue(sk->sleep, &wait); for (;;) { @@ -724,14 +719,8 @@ } current->state = TASK_RUNNING; remove_wait_queue(sk->sleep, &wait); + lock_sock(sk); } - else - { - /* Yield time to the memory freeing paths */ - current->state = TASK_INTERRUPTIBLE; - schedule_timeout(1); - } - lock_sock(sk); } /* --- linux-2.2.17pre4/net/core/sock.c Tue May 11 02:55:25 1999 +++ linux-akpm/net/core/sock.c Sun Jun 18 23:24:18 2000 @@ -558,7 +558,7 @@ */ struct sk_buff *sock_wmalloc(struct sock *sk, unsigned long size, int force, int priority) { - if (force || atomic_read(&sk->wmem_alloc) < sk->sndbuf) { + if (force || tcp_memory_free(sk)) { struct sk_buff * skb = alloc_skb(size, priority); if (skb) { atomic_add(skb->truesize, &sk->wmem_alloc); @@ -575,7 +575,7 @@ */ struct sk_buff *sock_rmalloc(struct sock *sk, unsigned long size, int force, int priority) { - if (force || atomic_read(&sk->rmem_alloc) < sk->rcvbuf) { + if (force || tcp_memory_free(sk)) { struct sk_buff *skb = alloc_skb(size, priority); if (skb) { atomic_add(skb->truesize, &sk->rmem_alloc); @@ -658,7 +658,7 @@ if (signal_pending(current)) break; current->state = TASK_INTERRUPTIBLE; - if (atomic_read(&sk->wmem_alloc) < sk->sndbuf) + if (tcp_memory_free(sk)) break; if (sk->shutdown & SEND_SHUTDOWN) break; --------------7DF873E452058BAB6315BDA2-- From owner-netdev@oss.sgi.com Sun Jun 18 07:40:45 2000 Received: by oss.sgi.com id ; Sun, 18 Jun 2000 07:40:26 -0700 Received: from pizda.ninka.net ([216.101.162.242]:56978 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Sun, 18 Jun 2000 07:39:50 -0700 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id HAA04840; Sun, 18 Jun 2000 07:26:13 -0700 Date: Sun, 18 Jun 2000 07:26:13 -0700 Message-Id: <200006181426.HAA04840@pizda.ninka.net> X-Authentication-Warning: pizda.ninka.net: davem set sender to davem@redhat.com using -f From: "David S. Miller" To: andrewm@uow.edu.au CC: netdev@oss.sgi.com, alan@redhat.com, kuznet@ms2.inr.ac.ru In-reply-to: <394CD00E.A02FEBBE@uow.edu.au> (message from Andrew Morton on Sun, 18 Jun 2000 23:35:10 +1000) Subject: Re: [patch] TCP throughput after 2.2.17-pre1 References: <394CD00E.A02FEBBE@uow.edu.au> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Date: Sun, 18 Jun 2000 23:35:10 +1000 From: Andrew Morton The problem is caused by this patch: The change aparently came from Alexey to cure some deadlock issues. I just noticed the change today, and I also am skeptical about it's correctness. It I understand correctly, it's the simple issue about the lack of distinction between a socket buffer allocation failing because of going over the socket limits or a real kmalloc failure of some sort. Someone needs to sort this out correctly and I'd like to ask Alan to back this out until a better fix is found. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Sun Jun 18 07:48:15 2000 Received: by oss.sgi.com id ; Sun, 18 Jun 2000 07:48:05 -0700 Received: from devserv.devel.redhat.com ([207.175.42.156]:20228 "EHLO devserv.devel.redhat.com") by oss.sgi.com with ESMTP id ; Sun, 18 Jun 2000 07:47:51 -0700 Received: (from alan@localhost) by devserv.devel.redhat.com (8.9.3/8.9.3) id KAA07398; Sun, 18 Jun 2000 10:47:01 -0400 From: Alan Cox Message-Id: <200006181447.KAA07398@devserv.devel.redhat.com> Subject: Re: [patch] TCP throughput after 2.2.17-pre1 To: davem@redhat.com (David S. Miller) Date: Sun, 18 Jun 2000 10:47:01 -0400 (EDT) Cc: andrewm@uow.edu.au, netdev@oss.sgi.com, alan@redhat.com, kuznet@ms2.inr.ac.ru In-Reply-To: <200006181426.HAA04840@pizda.ninka.net> from "David S. Miller" at Jun 18, 2000 07:26:13 AM X-Mailer: ELM [version 2.5 PL2] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing > The change aparently came from Alexey to cure some > deadlock issues. Its from me. Its critical to fix the problem > Someone needs to sort this out correctly and I'd like to ask > Alan to back this out until a better fix is found. Its fixed a huge range of hangs especially on SMP boxes. Its staying until the proper fix is done. That probably needs a sock_wmalloc_err() and also the new code path waking on the socket kfreeing a buffer. Right at the moment its a huge win having it in 2.2.17pre because I can actually look at the few remaining 'it hung' reports and work on those. 2.2.17 we have to fix this properly - neither the hack fix nor ignoring it are options. Alan From owner-netdev@oss.sgi.com Sun Jun 18 09:11:26 2000 Received: by oss.sgi.com id ; Sun, 18 Jun 2000 09:11:06 -0700 Received: from minus.inr.ac.ru ([193.233.7.97]:27407 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Sun, 18 Jun 2000 09:10:44 -0700 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id UAA32560; Sun, 18 Jun 2000 20:09:53 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200006181609.UAA32560@ms2.inr.ac.ru> Subject: Re: [patch] TCP throughput after 2.2.17-pre1 To: davem@redhat.com (David S. Miller) Date: Sun, 18 Jun 2000 20:09:52 +0400 (MSK DST) Cc: andrewm@uow.edu.au, netdev@oss.sgi.com, alan@redhat.com In-Reply-To: <200006181426.HAA04840@pizda.ninka.net> from "David S. Miller" at Jun 18, 0 07:26:13 am X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Length: 197 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello! > I just noticed the change today, and I also am skeptical > about it's correctness. The fix, which is in 2.3 now, should be OK. Seems, it is easy to copy it to 2.2. I can do it. Alexey From owner-netdev@oss.sgi.com Sun Jun 18 09:17:35 2000 Received: by oss.sgi.com id ; Sun, 18 Jun 2000 09:17:15 -0700 Received: from laurin.munich.netsurf.de ([194.64.166.1]:29649 "EHLO laurin.munich.netsurf.de") by oss.sgi.com with ESMTP id ; Sun, 18 Jun 2000 09:16:57 -0700 Received: from fred.muc.de (none@ns1219.munich.netsurf.de [195.180.235.219]) by laurin.munich.netsurf.de (8.9.3/8.9.3) with ESMTP id SAA08421; Sun, 18 Jun 2000 18:16:16 +0200 (MET DST) Received: from andi by fred.muc.de with local (Exim 2.05 #1) id 133hkT-0001nf-00; Sun, 18 Jun 2000 18:16:37 +0200 Date: Sun, 18 Jun 2000 18:16:37 +0200 From: Andi Kleen To: Andrew Morton Cc: "David S. Miller" , "netdev@oss.sgi.com" , alan@lxorguk.ukuu.org.uk Subject: Re: [patch] TCP throughput after 2.2.17-pre1 Message-ID: <20000618181637.A6916@fred.muc.de> References: <394CD00E.A02FEBBE@uow.edu.au> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.95.4us In-Reply-To: <394CD00E.A02FEBBE@uow.edu.au>; from Andrew Morton on Sun, Jun 18, 2000 at 03:41:47PM +0200 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Sun, Jun 18, 2000 at 03:41:47PM +0200, Andrew Morton wrote: > With 2.2.17-pre1 on a 400MHz uniprocessor it is possible to saturate a > 100baseT with netperf (just a TCP transmitter) with 40% CPU left over. > > With -pre3 and -pre4 it maxes out at 38 Mbits/sec. A five times > reduction. Your patch is unfortunately not the right solution. It goes back to the old state and doesn't fix the original bug (tcp spinning in case of real oom). Also you cannot just use tcp_* in generic code and the release_sock move is useless because sk->sndbuf management occurs outside the socket lock anyways. Here is a better patch. It tries to properly distingush the two socket oom cases (out of socket buffer and out of system memory) and sleeps in the later case to give the system some time to recover. It was integrated with the normal sleep loop, because a sndbuf wakeup is a strong cue that some memory was freed again. I also added a net_statistics to make the problem more transparent. Based on discussions with Alan. Patch is relative to plain 2.2.16. -Andi --- include/net/sock.h.sockalloc Fri Jun 16 13:33:37 2000 +++ include/net/sock.h Sun Jun 18 17:17:46 2000 @@ -717,6 +717,10 @@ extern struct sk_buff *sock_wmalloc(struct sock *sk, unsigned long size, int force, int priority); + +extern struct sk_buff *sock_wmalloc_err(struct sock *sk, + unsigned long size, int force, + int priority, int *err); extern struct sk_buff *sock_rmalloc(struct sock *sk, unsigned long size, int force, int priority); --- include/net/snmp.h.sockalloc Fri Jun 16 13:33:37 2000 +++ include/net/snmp.h Sun Jun 18 17:50:29 2000 @@ -178,6 +178,7 @@ unsigned long OfoPruned; unsigned long OutOfWindowIcmps; unsigned long LockDroppedIcmps; + unsigned long SockMallocOOM; }; #endif --- net/core/sock.c.sockalloc Fri Jun 16 13:33:44 2000 +++ net/core/sock.c Sun Jun 18 18:05:01 2000 @@ -566,6 +565,31 @@ skb->sk = sk; return skb; } + net_statistics.SockMallocOOM++; + } + return NULL; +} + +/* + * Allocate memory from the sockets send buffer, telling caller about real OOM. + * err is only set for oom, not for socket buffer overflow. + */ +struct sk_buff *sock_wmalloc_err(struct sock *sk, unsigned long size, int force, int priority, int *err) +{ + *err = 0; + /* Note: overcommitment possible */ + if (force || atomic_read(&sk->wmem_alloc) < sk->sndbuf) { + struct sk_buff * skb; + *err = -ENOMEM; + skb = alloc_skb(size, priority); + if (skb) { + *err = 0; + atomic_add(skb->truesize, &sk->wmem_alloc); + skb->destructor = sock_wfree; + skb->sk = sk; + return skb; + } + net_statistics.SockMallocOOM++; } return NULL; } @@ -583,6 +607,7 @@ skb->sk = sk; return skb; } + net_statistics.SockMallocOOM++; } return NULL; } @@ -602,6 +627,7 @@ if (mem) return mem; atomic_sub(size, &sk->omem_alloc); + net_statistics.SockMallocOOM++; } return NULL; } --- net/ipv4/tcp.c.sockalloc Fri Jun 16 13:33:45 2000 +++ net/ipv4/tcp.c Sun Jun 18 18:02:38 2000 @@ -697,9 +697,11 @@ } /* - * Wait for more memory for a socket + * Wait for more memory for a socket. + * Special case is err == -ENOMEM, in this case just sleep a bit waiting + * for the system to free up some memory. */ -static void wait_for_tcp_memory(struct sock * sk) +static void wait_for_tcp_memory(struct sock * sk, int err) { release_sock(sk); if (!tcp_memory_free(sk)) { @@ -711,13 +713,18 @@ if (signal_pending(current)) break; current->state = TASK_INTERRUPTIBLE; - if (tcp_memory_free(sk)) + if (tcp_memory_free(sk) && !err) break; if (sk->shutdown & SEND_SHUTDOWN) break; if (sk->err) break; - schedule(); + if (!err) + schedule(); + else { + schedule_timeout(1); + break; + } } current->state = TASK_RUNNING; remove_wait_queue(sk->sleep, &wait); @@ -915,7 +922,7 @@ tmp += copy; queue_it = 0; } - skb = sock_wmalloc(sk, tmp, 0, GFP_KERNEL); + skb = sock_wmalloc_err(sk, tmp, 0, GFP_KERNEL, &err); /* If we didn't get any memory, we need to sleep. */ if (skb == NULL) { @@ -928,8 +935,10 @@ err = -ERESTARTSYS; goto do_interrupted; } - tcp_push_pending_frames(sk, tp); - wait_for_tcp_memory(sk); + /* In OOM that would fail anyways so do not bother. */ + if (!err) + tcp_push_pending_frames(sk, tp); + wait_for_tcp_memory(sk, err); /* If SACK's were formed or PMTU events happened, * we must find out about it. --- net/ipv4/proc.c.sockalloc Fri Jun 16 13:33:45 2000 +++ net/ipv4/proc.c Sun Jun 18 17:50:17 2000 @@ -359,8 +359,8 @@ len = sprintf(buffer, "TcpExt: SyncookiesSent SyncookiesRecv SyncookiesFailed" " EmbryonicRsts PruneCalled RcvPruned OfoPruned" - " OutOfWindowIcmps LockDroppedIcmps\n" - "TcpExt: %lu %lu %lu %lu %lu %lu %lu %lu %lu\n", + " OutOfWindowIcmps LockDroppedIcmps SockMallocOOM\n" + "TcpExt: %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu\n", net_statistics.SyncookiesSent, net_statistics.SyncookiesRecv, net_statistics.SyncookiesFailed, @@ -369,7 +369,8 @@ net_statistics.RcvPruned, net_statistics.OfoPruned, net_statistics.OutOfWindowIcmps, - net_statistics.LockDroppedIcmps); + net_statistics.LockDroppedIcmps, + net_statistics.SockMallocOOM); if (offset >= len) { From owner-netdev@oss.sgi.com Sun Jun 18 19:43:59 2000 Received: by oss.sgi.com id ; Sun, 18 Jun 2000 19:43:49 -0700 Received: from m201-4-p47.warwick.net ([208.242.201.202]:2820 "EHLO circuit.moureaux.com") by oss.sgi.com with ESMTP id ; Sun, 18 Jun 2000 19:43:36 -0700 Received: from circuit.moureaux.com (IDENT:statux@circuit.moureaux.com [192.168.0.1]) by circuit.moureaux.com (8.9.3/8.9.3) with SMTP id WAA00949; Sun, 18 Jun 2000 22:43:53 -0400 Date: Sun, 18 Jun 2000 22:43:53 -0400 (EDT) From: Statux X-Sender: statux@circuit.moureaux.com To: Andrew Morton cc: "David S. Miller" , "netdev@oss.sgi.com" Subject: Re: [patch] TCP throughput after 2.2.17-pre1 In-Reply-To: <394CD00E.A02FEBBE@uow.edu.au> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing > + current->state = TASK_INTERRUPTIBLE; isn't it.. interruptable (or is my english weakening)? -Statux From owner-netdev@oss.sgi.com Mon Jun 19 01:33:21 2000 Received: by oss.sgi.com id ; Mon, 19 Jun 2000 01:33:11 -0700 Received: from lrcsun15.epfl.ch ([128.178.156.77]:23972 "EHLO lrcsun15.epfl.ch") by oss.sgi.com with ESMTP id ; Mon, 19 Jun 2000 01:32:50 -0700 Received: (from almesber@localhost) by lrcsun15.epfl.ch (8.8.X/EPFL-8.1a) id KAA12922; Mon, 19 Jun 2000 10:32:11 +0200 (MET DST) From: Werner Almesberger Message-Id: <200006190832.KAA12922@lrcsun15.epfl.ch> Subject: Re: netfilter NAT vs. pump To: rusty@linuxcare.com.au (Rusty Russell) Date: Mon, 19 Jun 2000 10:32:10 +0200 (MET DST) Cc: netdev@oss.sgi.com In-Reply-To: <20000616083014.4527B8192@halfway> from "Rusty Russell" at Jun 16, 2000 06:30:14 PM X-Mailer: ELM [version 2.5 PL2] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Rusty Russell wrote: > This patch is trivial and clean. Can you test it with unpatched pump? Seems to work in my setup. Thanks, - Werner -- _________________________________________________________________________ / Werner Almesberger, ICA, EPFL, CH werner.almesberger@ica.epfl.ch / /_IN_N_032__Tel_+41_21_693_6621__Fax_+41_21_693_6610_____________________/ From owner-netdev@oss.sgi.com Mon Jun 19 03:55:51 2000 Received: by oss.sgi.com id ; Mon, 19 Jun 2000 03:55:31 -0700 Received: from smtprch2.nortelnetworks.com ([192.135.215.15]:28384 "EHLO smtprch2.nortel.com") by oss.sgi.com with ESMTP id ; Mon, 19 Jun 2000 03:55:11 -0700 Received: from zrchb213.us.nortel.com (actually zrchb213) by smtprch2.nortel.com; Mon, 19 Jun 2000 05:51:30 -0500 Received: from zctwb003.asiapac.nortel.com ([47.152.32.111]) by zrchb213.us.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id MPD1Y7R6; Mon, 19 Jun 2000 05:54:35 -0500 Received: from pwold011.asiapac.nortel.com ([47.181.193.45]) by zctwb003.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NCLF8A9B; Mon, 19 Jun 2000 20:54:34 +1000 Received: from uow.edu.au (IDENT:akpm@[47.181.194.90]) by pwold011.asiapac.nortel.com (8.9.3/8.9.3) with ESMTP id UAA05423; Mon, 19 Jun 2000 20:54:05 +1000 Message-ID: <394DFCD5.41FCD467@uow.edu.au> Date: Mon, 19 Jun 2000 20:58:29 +1000 X-Sybari-Space: 00000000 00000000 00000000 From: Andrew Morton X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.2.14-15mdk i586) X-Accept-Language: en MIME-Version: 1.0 To: Alan Cox CC: "David S. Miller" , netdev@oss.sgi.com, kuznet@ms2.inr.ac.ru, Andi Kleen Subject: Re: [patch] TCP throughput after 2.2.17-pre1 References: <200006181426.HAA04840@pizda.ninka.net> from "David S. Miller" at Jun 18, 2000 07:26:13 AM <200006181447.KAA07398@devserv.devel.redhat.com> Content-Type: multipart/mixed; boundary="------------9CBB049786F372AA80D19CAF" X-Orig: Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing This is a multi-part message in MIME format. --------------9CBB049786F372AA80D19CAF Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Andi's patch works for me. I've attached the 2.2.17-pre4 version here. Alan Cox wrote: > > and also > the new code path waking on the socket kfreeing a buffer. He didn't appear to do that bit. It just polls. The sleep_on() would be good to have; wait_for_tcp_memory() will be called quite often for the non-oom case. --------------9CBB049786F372AA80D19CAF Content-Type: text/plain; charset=us-ascii; name="tcp.patch" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="tcp.patch" --- linux-2.2.17pre4/include/net/sock.h Tue Aug 10 05:05:13 1999 +++ linux-akpm/include/net/sock.h Mon Jun 19 19:03:07 2000 @@ -717,6 +717,10 @@ extern struct sk_buff *sock_wmalloc(struct sock *sk, unsigned long size, int force, int priority); + +extern struct sk_buff *sock_wmalloc_err(struct sock *sk, + unsigned long size, int force, + int priority, int *err); extern struct sk_buff *sock_rmalloc(struct sock *sk, unsigned long size, int force, int priority); --- linux-2.2.17pre4/include/net/snmp.h Mon Oct 5 03:19:39 1998 +++ linux-akpm/include/net/snmp.h Mon Jun 19 19:03:07 2000 @@ -178,6 +178,7 @@ unsigned long OfoPruned; unsigned long OutOfWindowIcmps; unsigned long LockDroppedIcmps; + unsigned long SockMallocOOM; }; #endif --- linux-2.2.17pre4/net/core/sock.c Tue May 11 02:55:25 1999 +++ linux-akpm/net/core/sock.c Mon Jun 19 19:03:07 2000 @@ -566,6 +566,31 @@ skb->sk = sk; return skb; } + net_statistics.SockMallocOOM++; + } + return NULL; +} + +/* + * Allocate memory from the sockets send buffer, telling caller about real OOM. + * err is only set for oom, not for socket buffer overflow. + */ +struct sk_buff *sock_wmalloc_err(struct sock *sk, unsigned long size, int force, int priority, int *err) +{ + *err = 0; + /* Note: overcommitment possible */ + if (force || atomic_read(&sk->wmem_alloc) < sk->sndbuf) { + struct sk_buff * skb; + *err = -ENOMEM; + skb = alloc_skb(size, priority); + if (skb) { + *err = 0; + atomic_add(skb->truesize, &sk->wmem_alloc); + skb->destructor = sock_wfree; + skb->sk = sk; + return skb; + } + net_statistics.SockMallocOOM++; } return NULL; } @@ -583,6 +608,7 @@ skb->sk = sk; return skb; } + net_statistics.SockMallocOOM++; } return NULL; } @@ -602,6 +628,7 @@ if (mem) return mem; atomic_sub(size, &sk->omem_alloc); + net_statistics.SockMallocOOM++; } return NULL; } --- linux-2.2.17pre4/net/ipv4/tcp.c Sun Jun 18 21:04:07 2000 +++ linux-akpm/net/ipv4/tcp.c Mon Jun 19 20:40:33 2000 @@ -697,40 +697,38 @@ } /* - * Wait for more memory for a socket - * - * If we got here an allocation has failed on us. We cannot - * spin here or we may block the very code freeing memory - * for us. + * Wait for more memory for a socket. + * Special case is err == -ENOMEM, in this case just sleep a bit waiting + * for the system to free up some memory. */ -static void wait_for_tcp_memory(struct sock * sk) +static void wait_for_tcp_memory(struct sock * sk, int err) { release_sock(sk); if (!tcp_memory_free(sk)) { struct wait_queue wait = { current, NULL }; + sk->socket->flags &= ~SO_NOSPACE; add_wait_queue(sk->sleep, &wait); for (;;) { if (signal_pending(current)) break; current->state = TASK_INTERRUPTIBLE; - if (tcp_memory_free(sk)) + if (tcp_memory_free(sk) && !err) break; if (sk->shutdown & SEND_SHUTDOWN) break; if (sk->err) break; - schedule(); + if (!err) + schedule(); + else { + schedule_timeout(1); + break; + } } current->state = TASK_RUNNING; remove_wait_queue(sk->sleep, &wait); } - else - { - /* Yield time to the memory freeing paths */ - current->state = TASK_INTERRUPTIBLE; - schedule_timeout(1); - } lock_sock(sk); } @@ -924,7 +922,7 @@ tmp += copy; queue_it = 0; } - skb = sock_wmalloc(sk, tmp, 0, GFP_KERNEL); + skb = sock_wmalloc_err(sk, tmp, 0, GFP_KERNEL, &err); /* If we didn't get any memory, we need to sleep. */ if (skb == NULL) { @@ -937,8 +935,10 @@ err = -ERESTARTSYS; goto do_interrupted; } - tcp_push_pending_frames(sk, tp); - wait_for_tcp_memory(sk); + /* In OOM that would fail anyways so do not bother. */ + if (!err) + tcp_push_pending_frames(sk, tp); + wait_for_tcp_memory(sk, err); /* If SACK's were formed or PMTU events happened, * we must find out about it. --- linux-2.2.17pre4/net/ipv4/proc.c Fri Jun 16 23:48:00 2000 +++ linux-akpm/net/ipv4/proc.c Mon Jun 19 19:03:07 2000 @@ -359,8 +359,8 @@ len = sprintf(buffer, "TcpExt: SyncookiesSent SyncookiesRecv SyncookiesFailed" " EmbryonicRsts PruneCalled RcvPruned OfoPruned" - " OutOfWindowIcmps LockDroppedIcmps\n" - "TcpExt: %lu %lu %lu %lu %lu %lu %lu %lu %lu\n", + " OutOfWindowIcmps LockDroppedIcmps SockMallocOOM\n" + "TcpExt: %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu\n", net_statistics.SyncookiesSent, net_statistics.SyncookiesRecv, net_statistics.SyncookiesFailed, @@ -369,7 +369,8 @@ net_statistics.RcvPruned, net_statistics.OfoPruned, net_statistics.OutOfWindowIcmps, - net_statistics.LockDroppedIcmps); + net_statistics.LockDroppedIcmps, + net_statistics.SockMallocOOM); if (offset >= len) { --------------9CBB049786F372AA80D19CAF-- From owner-netdev@oss.sgi.com Mon Jun 19 04:34:00 2000 Received: by oss.sgi.com id ; Mon, 19 Jun 2000 04:33:51 -0700 Received: from laurin.munich.netsurf.de ([194.64.166.1]:12236 "EHLO laurin.munich.netsurf.de") by oss.sgi.com with ESMTP id ; Mon, 19 Jun 2000 04:33:40 -0700 Received: from fred.muc.de (none@ns1108.munich.netsurf.de [195.180.235.108]) by laurin.munich.netsurf.de (8.9.3/8.9.3) with ESMTP id NAA07735; Mon, 19 Jun 2000 13:33:26 +0200 (MET DST) Received: from andi by fred.muc.de with local (Exim 2.05 #1) id 133znx-0000IH-00; Mon, 19 Jun 2000 13:33:25 +0200 Date: Mon, 19 Jun 2000 13:33:25 +0200 From: Andi Kleen To: Andrew Morton Cc: Alan Cox , "David S. Miller" , netdev@oss.sgi.com, kuznet@ms2.inr.ac.ru, Andi Kleen Subject: Re: [patch] TCP throughput after 2.2.17-pre1 Message-ID: <20000619133325.A1126@fred.muc.de> References: <394DFCD5.41FCD467@uow.edu.au> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.95.4us In-Reply-To: <394DFCD5.41FCD467@uow.edu.au>; from Andrew Morton on Mon, Jun 19, 2000 at 12:54:55PM +0200 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Mon, Jun 19, 2000 at 12:54:55PM +0200, Andrew Morton wrote: > Andi's patch works for me. I've attached the 2.2.17-pre4 version here. > > Alan Cox wrote: > > > > and also > > the new code path waking on the socket kfreeing a buffer. > > He didn't appear to do that bit. It just polls. It does actually. The schedule_timeout(1) will be woken earlier with the kfree_skb wakeup. -Andi From owner-netdev@oss.sgi.com Mon Jun 19 06:53:13 2000 Received: by oss.sgi.com id ; Mon, 19 Jun 2000 06:53:03 -0700 Received: from smtprch1.nortelnetworks.com ([192.135.215.14]:15246 "EHLO smtprch1.nortel.com") by oss.sgi.com with ESMTP id ; Mon, 19 Jun 2000 06:52:46 -0700 Received: from zrchb213.us.nortel.com (actually zrchb213) by smtprch1.nortel.com; Mon, 19 Jun 2000 08:49:14 -0500 Received: from zctwb003.asiapac.nortel.com ([47.152.32.111]) by zrchb213.us.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id MPD1ZBLK; Mon, 19 Jun 2000 08:50:10 -0500 Received: from pwold011.asiapac.nortel.com ([47.181.193.45]) by zctwb003.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NCLF8BBP; Mon, 19 Jun 2000 23:50:12 +1000 Received: from uow.edu.au (IDENT:akpm@[47.181.194.111]) by pwold011.asiapac.nortel.com (8.9.3/8.9.3) with ESMTP id XAA06429 for ; Mon, 19 Jun 2000 23:50:06 +1000 Message-ID: <394E2616.C25F8376@uow.edu.au> Date: Mon, 19 Jun 2000 23:54:30 +1000 X-Sybari-Space: 00000000 00000000 00000000 From: Andrew Morton X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.2.14-15mdk i586) X-Accept-Language: en MIME-Version: 1.0 To: "netdev@oss.sgi.com" Subject: modular net drivers Content-Type: multipart/mixed; boundary="------------41373E28E27635F9DC73A463" Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing This is a multi-part message in MIME format. --------------41373E28E27635F9DC73A463 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit As you may have noticed, Al Viro is running around the kernel getting rid of MOD_INC_USE_COUNT and MOD_DEC_USE_COUNT. His long-term plan is to remove MOD_INC_USE_COUNT and MOD_DEC_USE_COUNT completely. I'm looking into the changes required for the net drivers. I have attached here a patch which implements this. It works with 3c59x.c, tested as a module and statically linked. Also compiles and works with CONFIG_MODULES turned off. The implications of this are: - All 2.4-only netdrivers can have all their MOD_DEC and MOD_INC calls removed. All that twisty logic to keep track of the counts can be tossed. A single SET_NETDEVICE_OWNER(dev); will be needed when the netdevice fields are being filled in. - Drivers which support 2.2 will need to retain their MOD_INC/MOD_DEC macros and they will need to be given a SET_NETDEVICE_OWNER(). They will need to do this: #if LINUX_VERSION_CODE < KERNEL_VERSION(2,4,0) #define SET_NETDEVICE_OWNER(d) #else #define MOD_INC_USE_COUNT #define MOD_DEC_USE_COUNT #endif A couple of sticky drivers which Al has identified are com20020_cs.c and hysdn_net.c. These both call dev->stop() in bizarre ways and should be repaired anyway - they're either trying to bypass notification and dev_clear_fastroute() or they're trying hard to leak module refcounts and crash the machine. They can continue to use the legacy MOD_DEC and MOD_INC calls, but only until these are tossed out altogether. Al suggested that the SET_NETDEVICE_OWNER(dev) functionality could be embedded within ether_setup() but I think it's better open-coded in this manner - a few drivers (eg, shaper.c) don't use ether_setup. If/when this patch goes in it will NOT require that all netdevice drivers be edited at the same time. We can migrate them gradually. Comments? Gnashing of teeth? --------------41373E28E27635F9DC73A463 Content-Type: text/plain; charset=us-ascii; name="netdevice-modules.patch" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="netdevice-modules.patch" --- linux-2.4.0-test1-ac21/include/linux/netdevice.h Mon Jun 19 16:47:53 2000 +++ linux-akpm/include/linux/netdevice.h Mon Jun 19 22:10:09 2000 @@ -136,6 +136,11 @@ struct neigh_parms; struct sk_buff; +/* Centralised module refcounting for netdevices */ +struct module; +#define SET_NETDEVICE_OWNER(dev) \ + do { dev->owner = THIS_MODULE; } while (0) + struct netif_rx_stats { unsigned total; @@ -372,6 +377,9 @@ unsigned char *haddr); int (*neigh_setup)(struct net_device *dev, struct neigh_parms *); int (*accept_fastpath)(struct net_device *, struct dst_entry*); + + /* open/release and usage marking */ + struct module *owner; /* bridge stuff */ struct net_bridge_port *br_port; --- linux-2.4.0-test1-ac21/net/core/dev.c Mon Jun 19 16:47:54 2000 +++ linux-akpm/net/core/dev.c Mon Jun 19 21:28:14 2000 @@ -89,6 +89,7 @@ #include #include #include +#include #if defined(CONFIG_NET_RADIO) || defined(CONFIG_NET_PCMCIA_RADIO) #include /* Note : will define WIRELESS_EXT */ #endif /* CONFIG_NET_RADIO || CONFIG_NET_PCMCIA_RADIO */ @@ -664,8 +665,13 @@ * Call device private open method */ - if (dev->open) + if (dev->open) { + if (dev->owner) + __MOD_INC_USE_COUNT(dev->owner); ret = dev->open(dev); + if (ret != 0 && dev->owner) + __MOD_DEC_USE_COUNT(dev->owner); + } /* * If it went open OK then: @@ -780,6 +786,13 @@ * Tell people we are down */ notifier_call_chain(&netdev_chain, NETDEV_DOWN, dev); + + /* + * Drop the module refcount + */ + if (dev->owner) { + __MOD_DEC_USE_COUNT(dev->owner); + } return(0); } --------------41373E28E27635F9DC73A463-- From owner-netdev@oss.sgi.com Mon Jun 19 08:21:13 2000 Received: by oss.sgi.com id ; Mon, 19 Jun 2000 08:21:03 -0700 Received: from ppp0.ocs.com.au ([203.34.97.3]:15632 "HELO mail.ocs.com.au") by oss.sgi.com with SMTP id ; Mon, 19 Jun 2000 08:20:49 -0700 Received: (qmail 31641 invoked by uid 502); 19 Jun 2000 15:20:44 -0000 Received: (qmail 31628 invoked from network); 19 Jun 2000 15:20:42 -0000 Received: from ocs3.ocs-net (192.168.255.3) by mail.ocs.com.au with SMTP; 19 Jun 2000 15:20:42 -0000 X-Mailer: exmh version 2.1.1 10/15/1999 From: Keith Owens To: Andrew Morton cc: "netdev@oss.sgi.com" Subject: Re: modular net drivers In-reply-to: Your message of "Mon, 19 Jun 2000 23:54:30 +1000." <394E2616.C25F8376@uow.edu.au> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Tue, 20 Jun 2000 01:20:41 +1000 Message-ID: <4516.961428041@ocs3.ocs-net> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Mon, 19 Jun 2000 23:54:30 +1000, Andrew Morton wrote: >--- linux-2.4.0-test1-ac21/net/core/dev.c Mon Jun 19 16:47:54 2000 >+++ linux-akpm/net/core/dev.c Mon Jun 19 21:28:14 2000 >@@ -664,8 +665,13 @@ > * Call device private open method > */ > >- if (dev->open) >+ if (dev->open) { >+ if (dev->owner) >+ __MOD_INC_USE_COUNT(dev->owner); > ret = dev->open(dev); >+ if (ret != 0 && dev->owner) >+ __MOD_DEC_USE_COUNT(dev->owner); >+ } Racy. The module referred to by dev->owner might be in the middle of being unloaded. You need try_inc_mod_count() with a check on its result instead of __MOD_INC_USE_COUNT. From owner-netdev@oss.sgi.com Mon Jun 19 08:21:33 2000 Received: by oss.sgi.com id ; Mon, 19 Jun 2000 08:21:14 -0700 Received: from zikova.cvut.cz ([147.32.235.100]:5383 "EHLO zikova.cvut.cz") by oss.sgi.com with ESMTP id ; Mon, 19 Jun 2000 08:20:50 -0700 Received: from vcnet.vc.cvut.cz (vcnet.vc.cvut.cz [147.32.240.61]) by zikova.cvut.cz (8.9.0.Beta5/8.9.0.Beta5) with ESMTP id RAA16970; Mon, 19 Jun 2000 17:20:15 +0200 Received: from VCNET/SpoolDir by vcnet.vc.cvut.cz (Mercury 1.21); 19 Jun 100 17:20:15 MET-1MEST Received: from SpoolDir by VCNET (Mercury 1.30); 19 Jun 100 17:20:09 MET-1MEST From: "Petr Vandrovec" Organization: CC CTU Prague To: andrewm@uow.edu.au Date: Mon, 19 Jun 2000 17:20:04 MET-1 MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII Content-transfer-encoding: 7BIT Subject: Re: modular net drivers CC: netdev@oss.sgi.com X-mailer: Pegasus Mail v3.40 Message-ID: <600FE22630F@vcnet.vc.cvut.cz> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On 19 Jun 00 at 23:54, Andrew Morton wrote: > @@ -664,8 +665,13 @@ > * Call device private open method > */ > > - if (dev->open) > + if (dev->open) { > + if (dev->owner) > + __MOD_INC_USE_COUNT(dev->owner); > ret = dev->open(dev); > + if (ret != 0 && dev->owner) > + __MOD_DEC_USE_COUNT(dev->owner); > + } You should change it to if (dev->owner) __MOD_INC_USE_COUNT(dev->owner); if (dev->open) ret = dev->open(dev); if (ret != 0 && dev->owner) __MOD_DEC_USE_COUNT(dev->owner); ... as 'ret' is preinitialized to 0, so NULL ->open is allowed - your code will decrement usage count below zero in release code for devices with NULL open... (there are no such devices just now as currently you must do at least MOD_INC_USE_COUNT in open, but in future... Or change it to if (!dev->open) BUG(); ...) Best regards, Petr Vandrovec vandrove@vc.cvut.cz P.S.: It would be really nice if each 2.4.0-acXX could differ in LINUX_VERSION_CODE... It is really hard to maintain module (vmmon/vmnet) for 22 different kernels which all presents as 2.4.0 :-( And we have even more of 2.3.99 flawors... From owner-netdev@oss.sgi.com Mon Jun 19 09:03:54 2000 Received: by oss.sgi.com id ; Mon, 19 Jun 2000 09:03:34 -0700 Received: from smtprch1.nortelnetworks.com ([192.135.215.14]:63441 "EHLO smtprch1.nortel.com") by oss.sgi.com with ESMTP id ; Mon, 19 Jun 2000 09:03:11 -0700 Received: from zrchb213.us.nortel.com (actually zrchb213) by smtprch1.nortel.com; Mon, 19 Jun 2000 11:01:42 -0500 Received: from zctwb003.asiapac.nortel.com ([47.152.32.111]) by zrchb213.us.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id MPD1ZH4K; Mon, 19 Jun 2000 11:02:37 -0500 Received: from pwold011.asiapac.nortel.com ([47.181.193.45]) by zctwb003.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NCLF8BDJ; Tue, 20 Jun 2000 02:02:39 +1000 Received: from uow.edu.au (IDENT:akpm@[47.181.194.111]) by pwold011.asiapac.nortel.com (8.9.3/8.9.3) with ESMTP id CAA10603; Tue, 20 Jun 2000 02:02:22 +1000 Message-ID: <394E4517.9028C53B@uow.edu.au> Date: Tue, 20 Jun 2000 02:06:47 +1000 X-Sybari-Space: 00000000 00000000 00000000 From: Andrew Morton X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.2.14-15mdk i586) X-Accept-Language: en MIME-Version: 1.0 To: Keith Owens CC: "netdev@oss.sgi.com" Subject: Re: modular net drivers References: Your message of "Mon, 19 Jun 2000 23:54:30 +1000." <394E2616.C25F8376@uow.edu.au> <4516.961428041@ocs3.ocs-net> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Keith Owens wrote: > > Racy. The module referred to by dev->owner might be in the middle of > being unloaded. You need try_inc_mod_count() with a check on its > result instead of __MOD_INC_USE_COUNT. eww. That would mean the existing code is racy. I assumed it took care of this elsewhere. Maybe not. try_inc_mod_count looks good. Petr Vandrovec wrote: > > > You should change it to > if (dev->owner) > __MOD_INC_USE_COUNT(dev->owner); > if (dev->open) > ret = dev->open(dev); > if (ret != 0 && dev->owner) > __MOD_DEC_USE_COUNT(dev->owner); > ... From owner-netdev@oss.sgi.com Mon Jun 19 17:18:37 2000 Received: by oss.sgi.com id ; Mon, 19 Jun 2000 17:18:17 -0700 Received: from host.dsl.speakeasy.net ([216.254.93.178]:55546 "EHLO vaio.greennet") by oss.sgi.com with ESMTP id ; Mon, 19 Jun 2000 17:17:54 -0700 Received: from localhost (becker@localhost) by vaio.greennet (8.9.3/8.8.7) with ESMTP id UAA26990; Mon, 19 Jun 2000 20:19:07 -0400 Date: Mon, 19 Jun 2000 20:19:06 -0400 (EDT) From: Donald Becker X-Sender: becker@vaio.greennet To: Andrew Morton cc: "netdev@oss.sgi.com" Subject: Re: modular net drivers In-Reply-To: <394E2616.C25F8376@uow.edu.au> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Mon, 19 Jun 2000, Andrew Morton wrote: > As you may have noticed, Al Viro is running around the kernel getting > rid of MOD_INC_USE_COUNT and MOD_DEC_USE_COUNT. His long-term plan > is to remove MOD_INC_USE_COUNT and MOD_DEC_USE_COUNT completely. > I'm looking into the changes required for the net drivers. Hmmm, this is curious. Should not the "feature freeze" phase come well after the "interface freeze"? This is pretty obviously an interface change. > - All 2.4-only netdrivers can have all their MOD_DEC and MOD_INC calls > removed. All that twisty logic to keep track of the counts can be > tossed. A single > > SET_NETDEVICE_OWNER(dev); "Twisty logic"? Most of the drivers have straight-forward use counts. How is this new method any simpler? If anything, it seems to be more complex without any obvious benefit. Donald Becker becker@scyld.com Scyld Computing Corporation http://www.scyld.com 410 Severn Ave. Suite 210 Annapolis MD 21403 From owner-netdev@oss.sgi.com Mon Jun 19 17:33:07 2000 Received: by oss.sgi.com id ; Mon, 19 Jun 2000 17:32:58 -0700 Received: from deliverator.sgi.com ([204.94.214.10]:57617 "EHLO deliverator.sgi.com") by oss.sgi.com with ESMTP id ; Mon, 19 Jun 2000 17:32:37 -0700 Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by deliverator.sgi.com (980309.SGI.8.8.8-aspam-6.2/980310.SGI-aspam) via SMTP id RAA07743 for ; Mon, 19 Jun 2000 17:27:39 -0700 (PDT) mail_from (kaos@kao2.melbourne.sgi.com) Received: from kao2.melbourne.sgi.com (kao2.melbourne.sgi.com [134.14.55.180]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id KAA13452; Tue, 20 Jun 2000 10:29:59 +1000 X-Mailer: exmh version 2.1.1 10/15/1999 From: Keith Owens To: Donald Becker cc: Andrew Morton , "netdev@oss.sgi.com" Subject: Re: modular net drivers In-reply-to: Your message of "Mon, 19 Jun 2000 20:19:06 -0400." Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Tue, 20 Jun 2000 10:29:59 +1000 Message-ID: <3292.961460999@kao2.melbourne.sgi.com> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Mon, 19 Jun 2000 20:19:06 -0400 (EDT), Donald Becker wrote: >On Mon, 19 Jun 2000, Andrew Morton wrote: > >> As you may have noticed, Al Viro is running around the kernel getting >> rid of MOD_INC_USE_COUNT and MOD_DEC_USE_COUNT. His long-term plan >> is to remove MOD_INC_USE_COUNT and MOD_DEC_USE_COUNT completely. >> I'm looking into the changes required for the net drivers. > >Hmmm, this is curious. >Should not the "feature freeze" phase come well after the "interface freeze"? >This is pretty obviously an interface change. It is also an important bug fix. The module code has suffered from unload races ever since the kernel locking became fine grained, users can crash the kernel. >> - All 2.4-only netdrivers can have all their MOD_DEC and MOD_INC calls >> removed. All that twisty logic to keep track of the counts can be >> tossed. A single >> >> SET_NETDEVICE_OWNER(dev); > >"Twisty logic"? Most of the drivers have straight-forward use counts. >How is this new method any simpler? If anything, it seems to be more >complex without any obvious benefit. There are inherent unload races when code that lives inside a module tries to adjust the use count on that module. To the extent that the code pages can be deleted underneath the code that is executing! Module use counts need to be set before entering the module, not after the module code has started executing. From owner-netdev@oss.sgi.com Mon Jun 19 17:59:37 2000 Received: by oss.sgi.com id ; Mon, 19 Jun 2000 17:59:17 -0700 Received: from smtprch2.nortelnetworks.com ([192.135.215.15]:59108 "EHLO smtprch2.nortel.com") by oss.sgi.com with ESMTP id ; Mon, 19 Jun 2000 17:58:50 -0700 Received: from zrchb213.us.nortel.com (actually zrchb213) by smtprch2.nortel.com; Mon, 19 Jun 2000 19:55:25 -0500 Received: from zctwb003.asiapac.nortel.com ([47.152.32.111]) by zrchb213.us.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NH9NYPPX; Mon, 19 Jun 2000 19:58:29 -0500 Received: from pwold011.asiapac.nortel.com ([47.181.193.45]) by zctwb003.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NCLF8BRB; Tue, 20 Jun 2000 10:58:31 +1000 Received: from uow.edu.au (IDENT:akpm@localhost [127.0.0.1]) by pwold011.asiapac.nortel.com (8.9.3/8.9.3) with ESMTP id KAA14759; Tue, 20 Jun 2000 10:58:17 +1000 Message-ID: <394EC1A8.46941D36@uow.edu.au> Date: Tue, 20 Jun 2000 00:58:16 +0000 X-Sybari-Space: 00000000 00000000 00000000 From: Andrew Morton X-Mailer: Mozilla 4.61 [en] (X11; I; Linux 2.4.0-test1-ac10 i686) X-Accept-Language: en MIME-Version: 1.0 To: Donald Becker CC: "netdev@oss.sgi.com" Subject: Re: modular net drivers References: <394E2616.C25F8376@uow.edu.au> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Orig: Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Donald Becker wrote: > Hello, Donald. > On Mon, 19 Jun 2000, Andrew Morton wrote: > > > As you may have noticed, Al Viro is running around the kernel getting > > rid of MOD_INC_USE_COUNT and MOD_DEC_USE_COUNT. His long-term plan > > is to remove MOD_INC_USE_COUNT and MOD_DEC_USE_COUNT completely. > > I'm looking into the changes required for the net drivers. > > Hmmm, this is curious. > Should not the "feature freeze" phase come well after the "interface freeze"? > This is pretty obviously an interface change. Yes, I find it amusing that when something stable is imminent, people start madly running around cleaning up stuff that should have been done months/years ago. Oh well. Better than not doing it. Linus' attitude is that interface changes should go into 2.4 even if they're large/risky, because he seeks to minimise the differences between 2.4 and 2.5, so problem fixes can be fed between the two. This is unconventional and useful. > > - All 2.4-only netdrivers can have all their MOD_DEC and MOD_INC calls > > removed. All that twisty logic to keep track of the counts can be > > tossed. A single > > > > SET_NETDEVICE_OWNER(dev); > > "Twisty logic"? Most of the drivers have straight-forward use counts. They used to, but it was risky. The problem with a simple MOD_INC_USE_COUNT at the end of open() is that the preceding kmalloc()'s can sleep while the module refcount is zero, exposing an opportunity for the module to be unloaded while running. So the "fix" was to move the MOD_INC_USE_COUNT to the top of open() and to do complementary MOD_DEC calls wherever the open() method takes an error path. It's pretty stinky and still doesn't stop the races. > How is this new method any simpler? If anything, it seems to be more > complex without any obvious benefit. Well, even if it wasn't for the race-avoidance issue, it's always nice to suck out duplicated code and put it in one place. From owner-netdev@oss.sgi.com Tue Jun 20 03:42:05 2000 Received: by oss.sgi.com id ; Tue, 20 Jun 2000 03:41:55 -0700 Received: from laurin.munich.netsurf.de ([194.64.166.1]:31371 "EHLO laurin.munich.netsurf.de") by oss.sgi.com with ESMTP id ; Tue, 20 Jun 2000 03:41:32 -0700 Received: from fred.muc.de (none@ns1165.munich.netsurf.de [195.180.235.165]) by laurin.munich.netsurf.de (8.9.3/8.9.3) with ESMTP id MAA05862; Tue, 20 Jun 2000 12:41:03 +0200 (MET DST) Received: from andi by fred.muc.de with local (Exim 2.05 #1) id 134LTW-0000Rq-00; Tue, 20 Jun 2000 12:41:46 +0200 Date: Tue, 20 Jun 2000 12:41:46 +0200 From: Andi Kleen To: Keith Owens Cc: Donald Becker , Andrew Morton , "netdev@oss.sgi.com" Subject: Re: modular net drivers Message-ID: <20000620124146.A1375@fred.muc.de> References: <3292.961460999@kao2.melbourne.sgi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.95.4us In-Reply-To: <3292.961460999@kao2.melbourne.sgi.com>; from Keith Owens on Tue, Jun 20, 2000 at 02:33:47AM +0200 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Tue, Jun 20, 2000 at 02:33:47AM +0200, Keith Owens wrote: > >> - All 2.4-only netdrivers can have all their MOD_DEC and MOD_INC calls > >> removed. All that twisty logic to keep track of the counts can be > >> tossed. A single > >> > >> SET_NETDEVICE_OWNER(dev); > > > >"Twisty logic"? Most of the drivers have straight-forward use counts. > >How is this new method any simpler? If anything, it seems to be more > >complex without any obvious benefit. > > There are inherent unload races when code that lives inside a module > tries to adjust the use count on that module. To the extent that the > code pages can be deleted underneath the code that is executing! > Module use counts need to be set before entering the module, not after > the module code has started executing. At least for open/close() that is not true -- rtnl_lock() protects against that. For that there are the same rules as in 2.2 (INC before first sleep, DEC after last sleep) -Andi -- This is like TV. I don't like TV. From owner-netdev@oss.sgi.com Tue Jun 20 03:45:54 2000 Received: by oss.sgi.com id ; Tue, 20 Jun 2000 03:45:45 -0700 Received: from laurin.munich.netsurf.de ([194.64.166.1]:18064 "EHLO laurin.munich.netsurf.de") by oss.sgi.com with ESMTP id ; Tue, 20 Jun 2000 03:45:31 -0700 Received: from fred.muc.de (none@ns1165.munich.netsurf.de [195.180.235.165]) by laurin.munich.netsurf.de (8.9.3/8.9.3) with ESMTP id MAA06371; Tue, 20 Jun 2000 12:45:15 +0200 (MET DST) Received: from andi by fred.muc.de with local (Exim 2.05 #1) id 134LXp-0000Tg-00; Tue, 20 Jun 2000 12:46:13 +0200 Date: Tue, 20 Jun 2000 12:46:13 +0200 From: Andi Kleen To: Andi Kleen Cc: Keith Owens , Donald Becker , Andrew Morton , "netdev@oss.sgi.com" Subject: Re: modular net drivers, take 2 Message-ID: <20000620124613.A1833@fred.muc.de> References: <3292.961460999@kao2.melbourne.sgi.com> <20000620124146.A1375@fred.muc.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.95.4us In-Reply-To: <20000620124146.A1375@fred.muc.de>; from Andi Kleen on Tue, Jun 20, 2000 at 12:41:46PM +0200 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Tue, Jun 20, 2000 at 12:41:46PM +0200, Andi Kleen wrote: > On Tue, Jun 20, 2000 at 02:33:47AM +0200, Keith Owens wrote: > > >> - All 2.4-only netdrivers can have all their MOD_DEC and MOD_INC calls > > >> removed. All that twisty logic to keep track of the counts can be > > >> tossed. A single > > >> > > >> SET_NETDEVICE_OWNER(dev); > > > > > >"Twisty logic"? Most of the drivers have straight-forward use counts. > > >How is this new method any simpler? If anything, it seems to be more > > >complex without any obvious benefit. > > > > There are inherent unload races when code that lives inside a module > > tries to adjust the use count on that module. To the extent that the > > code pages can be deleted underneath the code that is executing! > > Module use counts need to be set before entering the module, not after > > the module code has started executing. > > At least for open/close() that is not true -- rtnl_lock() protects against > that. For that there are the same rules as in 2.2 (INC before first sleep, > DEC after last sleep) Ok, there is still a small race with the actual module unload. I think the cleanest solution is to let open/close run in the big kernel lock. They are not performance critical anyways. Comments ? Looks much prefereable than changing all the network drivers at this stage at least. -Andi --- linux/net/core/dev.c-devlock Tue Jun 20 00:58:33 2000 +++ linux/net/core/dev.c Tue Jun 20 12:38:25 2000 @@ -89,6 +89,7 @@ #include #include #include +#include #if defined(CONFIG_NET_RADIO) || defined(CONFIG_NET_PCMCIA_RADIO) #include /* Note : will define WIRELESS_EXT */ #endif /* CONFIG_NET_RADIO || CONFIG_NET_PCMCIA_RADIO */ @@ -2091,10 +2092,12 @@ case SIOCSIFNAME: if (!capable(CAP_NET_ADMIN)) return -EPERM; + lock_kernel(); dev_load(ifr.ifr_name); rtnl_lock(); ret = dev_ifsioc(&ifr, cmd); rtnl_unlock(); + unlock_kernel(); return ret; case SIOCGIFMEM: From owner-netdev@oss.sgi.com Tue Jun 20 05:03:45 2000 Received: by oss.sgi.com id ; Tue, 20 Jun 2000 05:03:35 -0700 Received: from [203.126.247.144] ([203.126.247.144]:8429 "EHLO zsngs001") by oss.sgi.com with ESMTP id ; Tue, 20 Jun 2000 05:03:06 -0700 Received: from zsngd101.asiapac.nortel.com (actually znsgd101) by zsngs001; Tue, 20 Jun 2000 20:02:42 +0800 Received: from zctwb003.asiapac.nortel.com ([47.152.32.111]) by zsngd101.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NBKHBTN6; Tue, 20 Jun 2000 20:02:45 +0800 Received: from pwold011.asiapac.nortel.com ([47.181.193.45]) by zctwb003.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NCLF8CPQ; Tue, 20 Jun 2000 22:02:48 +1000 Received: from uow.edu.au (IDENT:akpm@[47.181.194.130]) by pwold011.asiapac.nortel.com (8.9.3/8.9.3) with ESMTP id WAA18430; Tue, 20 Jun 2000 22:02:29 +1000 Message-ID: <394F5E62.DE02F835@uow.edu.au> Date: Tue, 20 Jun 2000 22:06:58 +1000 X-Sybari-Space: 00000000 00000000 00000000 From: Andrew Morton X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.2.14-15mdk i586) X-Accept-Language: en MIME-Version: 1.0 To: Andi Kleen CC: Keith Owens , Donald Becker , "netdev@oss.sgi.com" Subject: Re: modular net drivers, take 2 References: <3292.961460999@kao2.melbourne.sgi.com> <20000620124146.A1375@fred.muc.de>, <20000620124146.A1375@fred.muc.de>; from Andi Kleen on Tue, Jun 20, 2000 at 12:41:46PM +0200 <20000620124613.A1833@fred.muc.de> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Orig: Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Andi Kleen wrote: > > > Ok, there is still a small race with the actual module unload. I think > the cleanest solution is to let open/close run in the big kernel lock. > They are not performance critical anyways. > > Comments ? - devinet_ioctl() calls dev_change_flags() direct, thus neatly bypassing your lock_kernel() :( - sys_ioctl() and sys_delete_module() both already claim the big lock, so where's the race anyway? I feel I'm missing something.. From owner-netdev@oss.sgi.com Tue Jun 20 09:01:36 2000 Received: by oss.sgi.com id ; Tue, 20 Jun 2000 09:01:26 -0700 Received: from laurin.munich.netsurf.de ([194.64.166.1]:60822 "EHLO laurin.munich.netsurf.de") by oss.sgi.com with ESMTP id ; Tue, 20 Jun 2000 09:01:05 -0700 Received: from fred.muc.de (none@ns1165.munich.netsurf.de [195.180.235.165]) by laurin.munich.netsurf.de (8.9.3/8.9.3) with ESMTP id SAA21246; Tue, 20 Jun 2000 18:00:23 +0200 (MET DST) Received: from andi by fred.muc.de with local (Exim 2.05 #1) id 134Q77-000143-00; Tue, 20 Jun 2000 17:38:57 +0200 Date: Tue, 20 Jun 2000 17:38:57 +0200 From: Andi Kleen To: Andrew Morton Cc: Andi Kleen , Keith Owens , Donald Becker , "netdev@oss.sgi.com" Subject: Re: modular net drivers, take 2 Message-ID: <20000620173857.A4089@fred.muc.de> References: <394F5E62.DE02F835@uow.edu.au> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.95.4us In-Reply-To: <394F5E62.DE02F835@uow.edu.au>; from Andrew Morton on Tue, Jun 20, 2000 at 02:04:35PM +0200 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Tue, Jun 20, 2000 at 02:04:35PM +0200, Andrew Morton wrote: > > - sys_ioctl() and sys_delete_module() both already claim > the big lock, so where's the race anyway? I feel I'm missing > something.. Ugh, I missed that. Ok, with that there is no race. Even better :-) I guess there are some time critical ioctls that should be run outside kernel lock though. It is far too late to audit them all though. -Andi -- This is like TV. I don't like TV. From owner-netdev@oss.sgi.com Tue Jun 20 09:48:57 2000 Received: by oss.sgi.com id ; Tue, 20 Jun 2000 09:48:47 -0700 Received: from dialup185.canberra.net.au ([203.33.188.57]:1284 "HELO halfway") by oss.sgi.com with SMTP id ; Tue, 20 Jun 2000 09:48:43 -0700 Received: from linuxcare.com.au (localhost [127.0.0.1]) by halfway (Postfix) with ESMTP id D030D817F; Tue, 20 Jun 2000 17:01:56 +1000 (EST) From: Rusty Russell To: Keith Owens Cc: "netdev@oss.sgi.com" Subject: Re: modular net drivers In-reply-to: Your message of "Tue, 20 Jun 2000 10:29:59 +1000." <3292.961460999@kao2.melbourne.sgi.com> Date: Tue, 20 Jun 2000 17:01:56 +1000 Message-Id: <20000620070156.D030D817F@halfway> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing In message <3292.961460999@kao2.melbourne.sgi.com> you write: > It is also an important bug fix. The module code has suffered from > unload races ever since the kernel locking became fine grained, users > can crash the kernel. Races which can be largely solved at the moment by having the module page removal code sync all bh's and softirqs after calling cleanup(). Hell, we could even poll all CPUs and check they're not executing in the about-to-be-freed pages. Speed is completely unimportant here. Let's be clear: embedding a struct module *owner in every registerable structure is the path to bloated insanity. Other avenues should be explored. ``A 2.4 Kernel In Our Time!'' Rusty. -- Hacking time. From owner-netdev@oss.sgi.com Tue Jun 20 12:52:19 2000 Received: by oss.sgi.com id ; Tue, 20 Jun 2000 12:52:09 -0700 Received: from ha2.rdc1.tx.home.com ([24.4.0.67]:59883 "EHLO lh2.rdc1.tx.home.com") by oss.sgi.com with ESMTP id ; Tue, 20 Jun 2000 12:51:50 -0700 Received: from mypad.com ([24.1.13.170]) by lh2.rdc1.tx.home.com (InterMail vM.4.01.02.00 201-229-116) with ESMTP id <20000620195150.TUHM11517.lh2.rdc1.tx.home.com@mypad.com> for ; Tue, 20 Jun 2000 12:51:50 -0700 Message-ID: <394FCC27.9785D5FA@mypad.com> Date: Tue, 20 Jun 2000 13:55:19 -0600 From: gg&ht forever Reply-To: lighthouse@mypad.com X-Mailer: Mozilla 4.7 [en] (Win98; I) X-Accept-Language: en MIME-Version: 1.0 To: net dev Subject: routing table/cache question Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello, This is a Linux 2.2 question, although I suspect that it applies to 2.4 as well. Please feel free to direct me to RTFM or another list. I've notice in 2.2.x that ICMP redirects don't get put in the routing table but do get put into the routing cache, that is the one that you can look at with route -Cn. There's no flag for redirects there but the host entry appears. Is this understanding correct? Also can someone give me some insights into the use of the ibl flags there? I can see that b means broadcast (also apparent from the rt_cache.h file) and l means local (as in local destination?), and i seems only to apply to the lo interface. Is that correct? Is there a write-up on this somewhere? What really bothers me about the route cache are that there seem to be some odd entries. For example, suppose my host is 172.16.1.1 and I ping some external Internet host that I'll just call w.x.y.z (so nobody can say I'm picking on them). Also suppose that I have my default router on 172.16.1.1 set up so that when I ping w.x.y.z I get a redirect. OK. Ethereal reports one redirect and then all subsequent communications go to the correct router. Also, route -Cn reports a host entry for the redirect to w.x.y.z. But, route -Cn also reports (I left out ref & use): Source Destination Gateway Flags Device ------ ----------- ------- ----- ------ w.x.y.z 172.16.1.1 172.16.1.1 l lo where the flag entry is a lower case L (hard to differentiate from a one [1]). What is this entry for??? Thanks for your time! Regards, Scott From owner-netdev@oss.sgi.com Tue Jun 20 14:50:40 2000 Received: by oss.sgi.com id ; Tue, 20 Jun 2000 14:50:19 -0700 Received: from ppp0.ocs.com.au ([203.34.97.3]:34576 "HELO mail.ocs.com.au") by oss.sgi.com with SMTP id ; Tue, 20 Jun 2000 14:49:50 -0700 Received: (qmail 15774 invoked by uid 502); 20 Jun 2000 21:49:45 -0000 Received: (qmail 15750 invoked from network); 20 Jun 2000 21:49:40 -0000 Received: from ocs3.ocs-net (192.168.255.3) by mail.ocs.com.au with SMTP; 20 Jun 2000 21:49:40 -0000 X-Mailer: exmh version 2.1.1 10/15/1999 From: Keith Owens To: Rusty Russell cc: "netdev@oss.sgi.com" , Donald Becker , Andrew Morton Subject: Re: modular net drivers In-reply-to: Your message of "Tue, 20 Jun 2000 17:01:56 +1000." <20000620070156.D030D817F@halfway> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Wed, 21 Jun 2000 07:49:40 +1000 Message-ID: <4450.961537780@ocs3.ocs-net> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Tue, 20 Jun 2000 17:01:56 +1000, Rusty Russell wrote: >Keith Owens wrote: >> It is also an important bug fix. The module code has suffered from >> unload races ever since the kernel locking became fine grained, users >> can crash the kernel. > >Races which can be largely solved at the moment by having the module >page removal code sync all bh's and softirqs after calling cleanup(). >Hell, we could even poll all CPUs and check they're not executing in >the about-to-be-freed pages. Speed is completely unimportant here. This race is not obvious but IMHO it exists. The original theory was Kernel load and unload code runs under the big kernel lock. open() and similar code runs under the big kernel lock. If the code does MOD_INC_USE_COUNT before sleeping then we are safe. But consider this race, even on UP. Module has been used, nothing is currently using it, use_count == 0. rmmod runs, either manual or autoclean. The module is marked as being deleted. module_cleanup() is entered, does I/O, sleeps, loses big kernel lock. open() is entered, calls the module code, does MOD_INC_USE_COUNT. open() code in module sleeps, loosing big kernel lock. module_cleanup() resumes, gets big kernel lock, unloads the module. open() code in module resumes - the code pages have gone. Checking in module_cleanup() to see if the use count has changed is not a solution. module_cleanup() may already have destroyed structures that the open() code expects to use, either immediately or later. Polling bh and softirq is not the answer either. While the code sleeps it is in schedule() - the only thing that says a process is in a module is some return address 4+ layers up the stack! Linus suggested only allowing module unload when all processors where idle but that has the same problem - schedule is idle. AFAICT the only safe mechanism is one that checks the module state *before* entering the module. Once you enter the module and sleep all bets are off. And that means exporting the module information to the open() layer, which is what Al Viro has been doing. From owner-netdev@oss.sgi.com Tue Jun 20 14:53:19 2000 Received: by oss.sgi.com id ; Tue, 20 Jun 2000 14:53:10 -0700 Received: from ppp0.ocs.com.au ([203.34.97.3]:36368 "HELO mail.ocs.com.au") by oss.sgi.com with SMTP id ; Tue, 20 Jun 2000 14:52:54 -0700 Received: (qmail 15828 invoked by uid 502); 20 Jun 2000 21:52:51 -0000 Received: (qmail 15821 invoked from network); 20 Jun 2000 21:52:49 -0000 Received: from ocs3.ocs-net (192.168.255.3) by mail.ocs.com.au with SMTP; 20 Jun 2000 21:52:49 -0000 X-Mailer: exmh version 2.1.1 10/15/1999 From: Keith Owens To: "netdev@oss.sgi.com" Subject: Re: modular net drivers, take 2 In-reply-to: Your message of "Tue, 20 Jun 2000 17:38:57 +0200." <20000620173857.A4089@fred.muc.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Wed, 21 Jun 2000 07:52:48 +1000 Message-ID: <4490.961537968@ocs3.ocs-net> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Tue, 20 Jun 2000 17:38:57 +0200, Andi Kleen wrote: >On Tue, Jun 20, 2000 at 02:04:35PM +0200, Andrew Morton wrote: >> - sys_ioctl() and sys_delete_module() both already claim >> the big lock, so where's the race anyway? I feel I'm missing >> something.. >I guess there are some time critical ioctls that should be run outside >kernel lock though. It is far too late to audit them all though. ioctls are not a problem, as long as they use a file descriptor, i.e. no global ioctls. Getting a file descriptor requires open() or its equivalent which set the module use_count. The race is in open, I don't know of any races after use_count is set and open() has complete and left the module. From owner-netdev@oss.sgi.com Tue Jun 20 15:59:00 2000 Received: by oss.sgi.com id ; Tue, 20 Jun 2000 15:58:40 -0700 Received: from panic.ohr.gatech.edu ([130.207.47.194]:12304 "EHLO havoc.gtf.org") by oss.sgi.com with ESMTP id ; Tue, 20 Jun 2000 15:58:11 -0700 Received: from mandrakesoft.com (adsl-77-228-135.atl.bellsouth.net [216.77.228.135]) by havoc.gtf.org (8.9.3/8.9.3) with ESMTP id SAA13384; Tue, 20 Jun 2000 18:57:59 -0400 Message-ID: <394FF6F8.21D572F9@mandrakesoft.com> Date: Tue, 20 Jun 2000 18:58:00 -0400 From: Jeff Garzik Organization: MandrakeSoft X-Mailer: Mozilla 4.73 [en] (X11; I; Linux 2.2.17pre4 i686) X-Accept-Language: en MIME-Version: 1.0 To: netdev@oss.sgi.com CC: "David S. Miller" Subject: no-net build broken... Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing building without CONFIG_NET fails to link... net/network.o: In function `qdisc_restart': net/network.o(.text+0x45b3): undefined reference to `netdev_nit' net/network.o(.text+0x45bd): undefined reference to `dev_queue_xmit_nit' net/network.o(.text+0x461d): undefined reference to `softnet_data' net/network.o(.text+0x4626): undefined reference to `softnet_data' net/network.o: In function `dev_watchdog': net/network.o(.text+0x46d2): undefined reference to `netdev_finish_unregister' net/network.o: In function `noop_requeue': net/network.o(.text+0x47ce): undefined reference to `net_ratelimit' make: *** [vmlinux] Error 1 -- Jeff Garzik | Building 1024 | Free beer tomorrow. MandrakeSoft, Inc. | From owner-netdev@oss.sgi.com Tue Jun 20 17:43:23 2000 Received: by oss.sgi.com id ; Tue, 20 Jun 2000 17:43:13 -0700 Received: from smtprch1.nortelnetworks.com ([192.135.215.14]:10627 "EHLO smtprch1.nortel.com") by oss.sgi.com with ESMTP id ; Tue, 20 Jun 2000 17:43:02 -0700 Received: from zrchb213.us.nortel.com (actually zrchb213) by smtprch1.nortel.com; Tue, 20 Jun 2000 19:41:42 -0500 Received: from zctwb003.asiapac.nortel.com ([47.152.32.111]) by zrchb213.us.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NKANVL97; Tue, 20 Jun 2000 19:42:46 -0500 Received: from pwold011.asiapac.nortel.com ([47.181.193.45]) by zctwb003.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NCLF8DC5; Wed, 21 Jun 2000 10:42:47 +1000 Received: from uow.edu.au (IDENT:akpm@localhost [127.0.0.1]) by pwold011.asiapac.nortel.com (8.9.3/8.9.3) with ESMTP id KAA26949; Wed, 21 Jun 2000 10:42:41 +1000 Message-ID: <39500F81.46F5A826@uow.edu.au> Date: Wed, 21 Jun 2000 00:42:41 +0000 X-Sybari-Space: 00000000 00000000 00000000 From: Andrew Morton X-Mailer: Mozilla 4.61 [en] (X11; I; Linux 2.4.0-test1-ac10 i686) X-Accept-Language: en MIME-Version: 1.0 To: Andi Kleen CC: "netdev@oss.sgi.com" Subject: Re: modular net drivers, take 2 References: <394F5E62.DE02F835@uow.edu.au>, <394F5E62.DE02F835@uow.edu.au>; from Andrew Morton on Tue, Jun 20, 2000 at 02:04:35PM +0200 <20000620173857.A4089@fred.muc.de> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Andi Kleen wrote: > > On Tue, Jun 20, 2000 at 02:04:35PM +0200, Andrew Morton wrote: > > > > - sys_ioctl() and sys_delete_module() both already claim > > the big lock, so where's the race anyway? I feel I'm missing > > something.. > > Ugh, I missed that. Ok, with that there is no race. Even better :-) I think I lied. Look at this: int sock_ioctl(...) { struct socket *sock; int err; unlock_kernel(); sock = socki_lookup(inode); err = sock->ops->ioctl(sock, cmd, arg); lock_kernel(); return err; } From owner-netdev@oss.sgi.com Tue Jun 20 18:20:23 2000 Received: by oss.sgi.com id ; Tue, 20 Jun 2000 18:20:13 -0700 Received: from [203.126.247.144] ([203.126.247.144]:916 "EHLO zsngs001") by oss.sgi.com with ESMTP id ; Tue, 20 Jun 2000 18:19:49 -0700 Received: from zsngd101.asiapac.nortel.com (actually znsgd101) by zsngs001; Wed, 21 Jun 2000 09:18:44 +0800 Received: from zctwb003.asiapac.nortel.com ([47.152.32.111]) by zsngd101.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NBKHBV75; Wed, 21 Jun 2000 09:18:48 +0800 Received: from pwold011.asiapac.nortel.com ([47.181.193.45]) by zctwb003.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NCLF8DF1; Wed, 21 Jun 2000 11:18:51 +1000 Received: from uow.edu.au (IDENT:akpm@localhost [127.0.0.1]) by pwold011.asiapac.nortel.com (8.9.3/8.9.3) with ESMTP id LAA27176; Wed, 21 Jun 2000 11:18:43 +1000 Message-ID: <395017F3.516165AD@uow.edu.au> Date: Wed, 21 Jun 2000 01:18:43 +0000 X-Sybari-Space: 00000000 00000000 00000000 From: Andrew Morton X-Mailer: Mozilla 4.61 [en] (X11; I; Linux 2.4.0-test1-ac10 i686) X-Accept-Language: en MIME-Version: 1.0 To: Keith Owens CC: "netdev@oss.sgi.com" Subject: Re: modular net drivers, take 2 References: Your message of "Tue, 20 Jun 2000 17:38:57 +0200." <20000620173857.A4089@fred.muc.de> <4490.961537968@ocs3.ocs-net> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Orig: Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Keith Owens wrote: > > On Tue, 20 Jun 2000 17:38:57 +0200, > Andi Kleen wrote: > >On Tue, Jun 20, 2000 at 02:04:35PM +0200, Andrew Morton wrote: > >> - sys_ioctl() and sys_delete_module() both already claim > >> the big lock, so where's the race anyway? I feel I'm missing > >> something.. > >I guess there are some time critical ioctls that should be run outside > >kernel lock though. It is far too late to audit them all though. > > ioctls are not a problem, as long as they use a file descriptor, i.e. > no global ioctls. Getting a file descriptor requires open() or its > equivalent which set the module use_count. The race is in open, I > don't know of any races after use_count is set and open() has complete > and left the module. I don't think you're right here, Keith. ioctls on the netdevice don't require a descriptor which is associated with dev->open(). For example, Donald's mii-diag application and ifconfig both call device-specific functions without ever having called dev->open(): modprobe driver mii-diag -v eth0 ifconfig -a In this example, both dev->ioctl() and dev->get_stats() are called while the module refcount is zero. So they're as risky as open(); these code paths need to be audited for races wrt kmalloc->schedule() opportunities. How about a totally different approach: In the module_exit() we locate all netdevices associated with this module and overwrite all their function pointers with the addresses of non-modular stub functions which return ENODEV. Then we don't have to worry about device methods being called after unload. The PCI code does it for us; not sure about non-PCI device management though. in xxx_probe1(): dev->owner = THIS_MODULE; /* Sorry, Rusty */ in xx_remove1(): zap_netdevice(dev); zap_netdevice(dev) { dev->open = err_open; dev->start_xmit = err_start_xmit; etc | From owner-netdev@oss.sgi.com Tue Jun 20 18:50:03 2000 Received: by oss.sgi.com id ; Tue, 20 Jun 2000 18:49:53 -0700 Received: from pneumatic-tube.sgi.com ([204.94.214.22]:10817 "EHLO pneumatic-tube.sgi.com") by oss.sgi.com with ESMTP id ; Tue, 20 Jun 2000 18:49:37 -0700 Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by pneumatic-tube.sgi.com (980327.SGI.8.8.8-aspam/980310.SGI-aspam) via SMTP id SAA04067 for ; Tue, 20 Jun 2000 18:54:45 -0700 (PDT) mail_from (kaos@kao2.melbourne.sgi.com) Received: from kao2.melbourne.sgi.com (kao2.melbourne.sgi.com [134.14.55.180]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id LAA01954; Wed, 21 Jun 2000 11:48:15 +1000 X-Mailer: exmh version 2.1.1 10/15/1999 From: Keith Owens To: Andrew Morton cc: "netdev@oss.sgi.com" Subject: Re: modular net drivers, take 2 In-reply-to: Your message of "Wed, 21 Jun 2000 01:18:43 GMT." <395017F3.516165AD@uow.edu.au> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Wed, 21 Jun 2000 11:48:15 +1000 Message-ID: <6062.961552095@kao2.melbourne.sgi.com> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Wed, 21 Jun 2000 01:18:43 +0000, Andrew Morton wrote: >Keith Owens wrote: >> ioctls are not a problem, as long as they use a file descriptor, i.e. >> no global ioctls. Getting a file descriptor requires open() or its >> equivalent which set the module use_count. The race is in open, I >> don't know of any races after use_count is set and open() has complete >> and left the module. > >I don't think you're right here, Keith. Note I said "ioctls are not a problem, as long as they use a file descriptor". Getting the file descriptor sets the module reference count. But anything that accesses module code without a reference count is a bomb waiting to explode. >ioctls on the netdevice don't require a descriptor which is associated >with dev->open(). For example, Donald's mii-diag application and >ifconfig both call device-specific functions without ever having called >dev->open(): > >modprobe driver >mii-diag -v eth0 >ifconfig -a > >In this example, both dev->ioctl() and dev->get_stats() are called while >the module refcount is zero. So they're as risky as open(); these code >paths need to be audited for races wrt kmalloc->schedule() >opportunities. Absolutely. No module reference count == no safety. mii-diag gets a file descriptor but it is for a general socket and is not tied to the interface module in any way. >How about a totally different approach: > >In the module_exit() we locate all netdevices associated with this >module and overwrite all their function pointers with the addresses of >non-modular stub functions which return ENODEV. Then we don't have to >worry about device methods being called after unload. Change every module that registers anything to make sure that they replace the register data with stubs on exit? And make sure that all of them do so before they sleep anywhere in module cleanup? It would work but is it the best solution? The existing method of avoiding module races is beginning to look like a dead dog. Look at the constraints we have to run under :- * All code that can ever call any module functions must either have a reference count on that module or must run under the same lock as the module unload (big kernel lock). * Every module must be checked to see that it never sleeps before doing MOD_INC_USE_COUNT. * Every module must be checked to see that it never sleeps after doing MOD_DEC_USE_COUNT. * Every module that registers anything must change the registered functions in module cleanup and must do so before sleeping (new). That is an awful lot of opportunities to make mistakes. And forcing lots of code to run under the big kernel lock does not scale well. From owner-netdev@oss.sgi.com Tue Jun 20 19:39:12 2000 Received: by oss.sgi.com id ; Tue, 20 Jun 2000 19:39:03 -0700 Received: from [203.126.247.144] ([203.126.247.144]:11680 "EHLO zsngs001") by oss.sgi.com with ESMTP id ; Tue, 20 Jun 2000 19:38:42 -0700 Received: from zsngd101.asiapac.nortel.com (actually znsgd101) by zsngs001; Wed, 21 Jun 2000 10:37:58 +0800 Received: from zctwb003.asiapac.nortel.com ([47.152.32.111]) by zsngd101.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NBKHBWW7; Wed, 21 Jun 2000 10:38:01 +0800 Received: from pwold011.asiapac.nortel.com ([47.181.193.45]) by zctwb003.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NCLF8DJD; Wed, 21 Jun 2000 12:38:05 +1000 Received: from uow.edu.au (IDENT:akpm@localhost [127.0.0.1]) by pwold011.asiapac.nortel.com (8.9.3/8.9.3) with ESMTP id MAA27669; Wed, 21 Jun 2000 12:37:59 +1000 Message-ID: <39502A86.6C8C6FC2@uow.edu.au> Date: Wed, 21 Jun 2000 02:37:58 +0000 X-Sybari-Space: 00000000 00000000 00000000 From: Andrew Morton X-Mailer: Mozilla 4.61 [en] (X11; I; Linux 2.4.0-test1-ac10 i686) X-Accept-Language: en MIME-Version: 1.0 To: Keith Owens CC: "netdev@oss.sgi.com" Subject: Re: modular net drivers, take 2 References: Your message of "Wed, 21 Jun 2000 01:18:43 GMT." <395017F3.516165AD@uow.edu.au> <6062.961552095@kao2.melbourne.sgi.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Orig: Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Keith Owens wrote: > > Change every module that registers anything to make sure that they > replace the register data with stubs on exit? And make sure that all > of them do so before they sleep anywhere in module cleanup? It would > work but is it the best solution? Not when you put it that way. > The existing method of avoiding module races is beginning to look like > a dead dog. Look at the constraints we have to run under :- > > * All code that can ever call any module functions must either have a > reference count on that module or must run under the same lock as the > module unload (big kernel lock). Yes. > * Every module must be checked to see that it never sleeps before doing > MOD_INC_USE_COUNT. > * Every module must be checked to see that it never sleeps after doing > MOD_DEC_USE_COUNT. These two can be avoided by hoisting the inc/dec up into the netdevice layer. But we need to wrap dev->get_stats(), dev->ioctl(), etc with inc/dec as well.. > * Every module that registers anything must change the registered > functions in module cleanup and must do so before sleeping (new). > > That is an awful lot of opportunities to make mistakes. And forcing > lots of code to run under the big kernel lock does not scale well. I am now remembering Alexey's disparaging comments about "self-modifying code". It sucks and I'm still seeking a single, centralised fix. How about plan J (warning: inelegance approaching): Module unload is a very rare occurence, so let's penalise that and that alone. We grab the ENTIRE machine within sys_delete_module. Like, grab the big kernel lock, then wait until ALL other CPUs are spinning on the kernel lock, and then allow sys_delete_module to proceed. From owner-netdev@oss.sgi.com Tue Jun 20 20:11:43 2000 Received: by oss.sgi.com id ; Tue, 20 Jun 2000 20:11:23 -0700 Received: from host.dsl.speakeasy.net ([216.254.93.178]:19703 "EHLO vaio.greennet") by oss.sgi.com with ESMTP id ; Tue, 20 Jun 2000 20:11:17 -0700 Received: from localhost (becker@localhost) by vaio.greennet (8.9.3/8.8.7) with ESMTP id XAA00738; Tue, 20 Jun 2000 23:12:36 -0400 Date: Tue, 20 Jun 2000 23:12:36 -0400 (EDT) From: Donald Becker X-Sender: becker@vaio.greennet To: Andrew Morton cc: "netdev@oss.sgi.com" Subject: Re: modular net drivers, take 2 In-Reply-To: <395017F3.516165AD@uow.edu.au> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Wed, 21 Jun 2000, Andrew Morton wrote: > In this example, both dev->ioctl() and dev->get_stats() are called while > the module refcount is zero. So they're as risky as open(); these code > paths need to be audited for races wrt kmalloc->schedule() > opportunities. Hmmm, now this is an important point. I just "know" that ioctl() and get_stats() should be "simple" functions that shouldn't do call any kernel function, except for perhaps printk(). This is not documented in the skeleton driver, or elsewhere. I'm guessing that the documentation should advise locking the module if any operation is done in get_stats() or private_ioctl() that might result in a reschedule. Donald Becker becker@scyld.com Scyld Computing Corporation http://www.scyld.com 410 Severn Ave. Suite 210 Annapolis MD 21403 From owner-netdev@oss.sgi.com Tue Jun 20 20:47:02 2000 Received: by oss.sgi.com id ; Tue, 20 Jun 2000 20:46:53 -0700 Received: from deliverator.sgi.com ([204.94.214.10]:9052 "EHLO deliverator.sgi.com") by oss.sgi.com with ESMTP id ; Tue, 20 Jun 2000 20:46:36 -0700 Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by deliverator.sgi.com (980309.SGI.8.8.8-aspam-6.2/980310.SGI-aspam) via SMTP id UAA28653 for ; Tue, 20 Jun 2000 20:41:37 -0700 (PDT) mail_from (kaos@kao2.melbourne.sgi.com) Received: from kao2.melbourne.sgi.com (kao2.melbourne.sgi.com [134.14.55.180]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id NAA02579; Wed, 21 Jun 2000 13:44:00 +1000 X-Mailer: exmh version 2.1.1 10/15/1999 From: Keith Owens To: Andrew Morton cc: "netdev@oss.sgi.com" Subject: Re: modular net drivers, take 2 In-reply-to: Your message of "Wed, 21 Jun 2000 02:37:58 GMT." <39502A86.6C8C6FC2@uow.edu.au> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Wed, 21 Jun 2000 13:44:00 +1000 Message-ID: <12333.961559040@kao2.melbourne.sgi.com> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Wed, 21 Jun 2000 02:37:58 +0000, Andrew Morton wrote: >Module unload is a very rare occurence, so let's penalise that and that >alone. > >We grab the ENTIRE machine within sys_delete_module. > >Like, grab the big kernel lock, then wait until ALL other CPUs are >spinning on the kernel lock, and then allow sys_delete_module to >proceed. CPU 0 CPU 1 rmmod use_count == 0, OK to remove Grab big kernel lock. Wait for other cpus to spin. Enter open(). spin. sys_delete_module runs. module_exit() sleeps, drops lock. open() continues, enters module. MOD_INC_USE_COUNT sleep in module open code, drop lock. module_exit() wakes and continues. Code is removed. module open code wakes and continues. Oops. Anything sleeping loses the lock. Any sleep in module open code primes the race, if the module_exit code also sleeps the race is triggered. From owner-netdev@oss.sgi.com Tue Jun 20 20:48:12 2000 Received: by oss.sgi.com id ; Tue, 20 Jun 2000 20:48:02 -0700 Received: from smtprch2.nortelnetworks.com ([192.135.215.15]:8601 "EHLO smtprch2.nortel.com") by oss.sgi.com with ESMTP id ; Tue, 20 Jun 2000 20:47:41 -0700 Received: from zrchb213.us.nortel.com (actually zrchb213) by smtprch2.nortel.com; Tue, 20 Jun 2000 22:44:22 -0500 Received: from zctwb003.asiapac.nortel.com ([47.152.32.111]) by zrchb213.us.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NKANVNJ3; Tue, 20 Jun 2000 22:47:27 -0500 Received: from pwold011.asiapac.nortel.com ([47.181.193.45]) by zctwb003.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NCLF8DMY; Wed, 21 Jun 2000 13:47:27 +1000 Received: from uow.edu.au (IDENT:akpm@localhost [127.0.0.1]) by pwold011.asiapac.nortel.com (8.9.3/8.9.3) with ESMTP id NAA28063; Wed, 21 Jun 2000 13:47:29 +1000 Message-ID: <39503AD1.99A54D57@uow.edu.au> Date: Wed, 21 Jun 2000 03:47:29 +0000 X-Sybari-Space: 00000000 00000000 00000000 From: Andrew Morton X-Mailer: Mozilla 4.61 [en] (X11; I; Linux 2.4.0-test1-ac10 i686) X-Accept-Language: en MIME-Version: 1.0 To: Donald Becker CC: "netdev@oss.sgi.com" Subject: Re: modular net drivers, take 2 References: <395017F3.516165AD@uow.edu.au> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Orig: Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Donald Becker wrote: > > On Wed, 21 Jun 2000, Andrew Morton wrote: > > > In this example, both dev->ioctl() and dev->get_stats() are called while > > the module refcount is zero. So they're as risky as open(); these code > > paths need to be audited for races wrt kmalloc->schedule() > > opportunities. > > Hmmm, now this is an important point. > > I just "know" that ioctl() and get_stats() should be "simple" functions that > shouldn't do call any kernel function, except for perhaps printk(). This is > not documented in the skeleton driver, or elsewhere. > > I'm guessing that the documentation should advise locking the module if any > operation is done in get_stats() or private_ioctl() that might result in a > reschedule. Yes, these functions tend to be atomic in-and-out. But there are opportunities for module unload prior to them even being called. The call path appears to be: sys_ioctl() lock_kernel() -> fd.file_operations.ioctl() (sock_ioctl()) unlock_kernel() -> sock.proto_ops.ioctl() (inet_ioctl() for af_inet) -> dev_ioctl() dev_load() -> dev_ifsioc() __dev_get_by_name() (Fails if unloaded) netif_device_present() -> dev->ioctl() So there are some small SMP-only opportunities for the module to be unloaded prior to netdevice.do_ioctl() even being called. This is getting sticky, isn't it? [ Yipes. 3c527.c does a sleep_on() when the module refcount is zero ] From owner-netdev@oss.sgi.com Tue Jun 20 20:55:32 2000 Received: by oss.sgi.com id ; Tue, 20 Jun 2000 20:55:13 -0700 Received: from deliverator.sgi.com ([204.94.214.10]:4191 "EHLO deliverator.sgi.com") by oss.sgi.com with ESMTP id ; Tue, 20 Jun 2000 20:55:09 -0700 Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by deliverator.sgi.com (980309.SGI.8.8.8-aspam-6.2/980310.SGI-aspam) via SMTP id UAA29319 for ; Tue, 20 Jun 2000 20:50:11 -0700 (PDT) mail_from (kaos@kao2.melbourne.sgi.com) Received: from kao2.melbourne.sgi.com (kao2.melbourne.sgi.com [134.14.55.180]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id NAA02652; Wed, 21 Jun 2000 13:53:49 +1000 X-Mailer: exmh version 2.1.1 10/15/1999 From: Keith Owens To: Donald Becker cc: "netdev@oss.sgi.com" Subject: Re: modular net drivers, take 2 In-reply-to: Your message of "Tue, 20 Jun 2000 23:12:36 -0400." Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Wed, 21 Jun 2000 13:53:49 +1000 Message-ID: <12400.961559629@kao2.melbourne.sgi.com> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Tue, 20 Jun 2000 23:12:36 -0400 (EDT), Donald Becker wrote: >On Wed, 21 Jun 2000, Andrew Morton wrote: > >> In this example, both dev->ioctl() and dev->get_stats() are called while >> the module refcount is zero. So they're as risky as open(); these code >> paths need to be audited for races wrt kmalloc->schedule() >> opportunities. > >I just "know" that ioctl() and get_stats() should be "simple" functions that >shouldn't do call any kernel function, except for perhaps printk(). This is >not documented in the skeleton driver, or elsewhere. > >I'm guessing that the documentation should advise locking the module if any >operation is done in get_stats() or private_ioctl() that might result in a >reschedule. Worse than that. If the module use count is zero then it can be removed at any time from another cpu. The only things preventing this are :- * Big kernel lock for open() and similar functions. * MOD_INC_USE_COUNT to bump the reference count. If the ioctl() uses a file descriptor that was obtained by first calling the module's open() routine then the use count was bumped so the ioctl() is safe (once the open race is fixed). But SIOCDEVPRIVATE ioctl() calls use file descriptors that do not come from the module and therefore the reference count "lock" is bypassed. From owner-netdev@oss.sgi.com Tue Jun 20 21:25:52 2000 Received: by oss.sgi.com id ; Tue, 20 Jun 2000 21:25:43 -0700 Received: from smtprch1.nortelnetworks.com ([192.135.215.14]:7590 "EHLO smtprch1.nortel.com") by oss.sgi.com with ESMTP id ; Tue, 20 Jun 2000 21:25:14 -0700 Received: from zrchb213.us.nortel.com (actually zrchb213) by smtprch1.nortel.com; Tue, 20 Jun 2000 23:23:49 -0500 Received: from zctwb003.asiapac.nortel.com ([47.152.32.111]) by zrchb213.us.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NKANVN35; Tue, 20 Jun 2000 23:24:05 -0500 Received: from pwold011.asiapac.nortel.com ([47.181.193.45]) by zctwb003.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NCLF8D35; Wed, 21 Jun 2000 14:24:06 +1000 Received: from uow.edu.au (IDENT:akpm@localhost [127.0.0.1]) by pwold011.asiapac.nortel.com (8.9.3/8.9.3) with ESMTP id OAA28251; Wed, 21 Jun 2000 14:24:01 +1000 Message-ID: <39504361.81F03943@uow.edu.au> Date: Wed, 21 Jun 2000 04:24:01 +0000 X-Sybari-Space: 00000000 00000000 00000000 From: Andrew Morton X-Mailer: Mozilla 4.61 [en] (X11; I; Linux 2.4.0-test1-ac10 i686) X-Accept-Language: en MIME-Version: 1.0 To: Keith Owens CC: "netdev@oss.sgi.com" Subject: Re: modular net drivers, take 2 References: Your message of "Wed, 21 Jun 2000 02:37:58 GMT." <39502A86.6C8C6FC2@uow.edu.au> <12333.961559040@kao2.melbourne.sgi.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Keith Owens wrote: > > Anything sleeping loses the lock. Any sleep in module open code primes > the race, if the module_exit code also sleeps the race is triggered. You're a hard man, Mr Owens. So sys_delete_module() isn't allowed to sleep. It's hard to make this rule future-safe. Do you think that the concept of grabbing the entire machine during module unload is an acceptable one? I think it is, because the act of actually unloading kernel text is so unique and traumatic. If so then it shouldn't be too hard to find a way. You have shown why we can't use the big lock, but we could create a new one for this purpose. The challenge is to find a place to put it. A code path which is regularly traversed in toplevel context and has an upper bound on the revisit period. Such as schedule() (we'd get shot..). sys_delete_module() { ... spin_lock(&module_deletion_lock); blocked_cpus = 1 << smp_processor_id(); while (blocked_cpus != ((1 << smp_num_cpus) - 1)) ; { I think the only code whcih needs to go in here is the call to vfree(module). } spin_unlock(&module_deletion_lock); ... } schedule() { ... if (spin_is_locked(&module_deletion_lock)) wait_while_unloading() } wait_while_unloading() { set_bit(&blocked_cpus, smp_processor_id()); while (spin_is_locked(&module_deletion_lock)) ; } From owner-netdev@oss.sgi.com Tue Jun 20 21:46:52 2000 Received: by oss.sgi.com id ; Tue, 20 Jun 2000 21:46:43 -0700 Received: from smtprch2.nortelnetworks.com ([192.135.215.15]:52126 "EHLO smtprch2.nortel.com") by oss.sgi.com with ESMTP id ; Tue, 20 Jun 2000 21:46:23 -0700 Received: from zrchb213.us.nortel.com (actually zrchb213) by smtprch2.nortel.com; Tue, 20 Jun 2000 23:42:12 -0500 Received: from zctwb003.asiapac.nortel.com ([47.152.32.111]) by zrchb213.us.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NKANVNSK; Tue, 20 Jun 2000 23:45:18 -0500 Received: from pwold011.asiapac.nortel.com ([47.181.193.45]) by zctwb003.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NCLF8DP6; Wed, 21 Jun 2000 14:45:14 +1000 Received: from uow.edu.au (IDENT:akpm@localhost [127.0.0.1]) by pwold011.asiapac.nortel.com (8.9.3/8.9.3) with ESMTP id OAA28376 for ; Wed, 21 Jun 2000 14:45:19 +1000 Message-ID: <3950485F.AD6F58C6@uow.edu.au> Date: Wed, 21 Jun 2000 04:45:19 +0000 From: Andrew Morton X-Mailer: Mozilla 4.61 [en] (X11; I; Linux 2.4.0-test1-ac10 i686) X-Accept-Language: en MIME-Version: 1.0 To: "netdev@oss.sgi.com" Subject: Re: modular net drivers, take 2 References: Your message of "Wed, 21 Jun 2000 02:37:58 GMT." <39502A86.6C8C6FC2@uow.edu.au> <12333.961559040@kao2.melbourne.sgi.com> <39504361.81F03943@uow.edu.au> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Orig: Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Plan M: sys_delete_module() doesn't do the vfree(). It schedules it for 5 seconds in the future. Or provides a mechanism for userland to do this. From owner-netdev@oss.sgi.com Tue Jun 20 22:15:23 2000 Received: by oss.sgi.com id ; Tue, 20 Jun 2000 22:15:13 -0700 Received: from deliverator.sgi.com ([204.94.214.10]:39802 "EHLO deliverator.sgi.com") by oss.sgi.com with ESMTP id ; Tue, 20 Jun 2000 22:14:53 -0700 Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by deliverator.sgi.com (980309.SGI.8.8.8-aspam-6.2/980310.SGI-aspam) via SMTP id WAA05667 for ; Tue, 20 Jun 2000 22:09:55 -0700 (PDT) mail_from (kaos@kao2.melbourne.sgi.com) Received: from kao2.melbourne.sgi.com (kao2.melbourne.sgi.com [134.14.55.180]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id PAA03066; Wed, 21 Jun 2000 15:12:15 +1000 X-Mailer: exmh version 2.1.1 10/15/1999 From: Keith Owens To: Andrew Morton cc: "netdev@oss.sgi.com" Subject: Re: modular net drivers, take 2 In-reply-to: Your message of "Wed, 21 Jun 2000 04:24:01 GMT." <39504361.81F03943@uow.edu.au> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Wed, 21 Jun 2000 15:12:15 +1000 Message-ID: <13370.961564335@kao2.melbourne.sgi.com> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Wed, 21 Jun 2000 04:24:01 +0000, Andrew Morton wrote: >Keith Owens wrote: >> Anything sleeping loses the lock. Any sleep in module open code primes >> the race, if the module_exit code also sleeps the race is triggered. > >You're a hard man, Mr Owens. I try ... >So sys_delete_module() isn't allowed to sleep. It's hard to make this >rule future-safe. Impossible because sys_delete_module() calls module_exit() which is allowed to do anything. >Do you think that the concept of grabbing the entire machine during >module unload is an acceptable one? I think it is, because the act of >actually unloading kernel text is so unique and traumatic. It is the best solution, if it can be done. But I have not found any method of doing this. >sys_delete_module() >{ > ... > spin_lock(&module_deletion_lock); > blocked_cpus = 1 << smp_processor_id(); > while (blocked_cpus != ((1 << smp_num_cpus) - 1)) > ; > { > I think the only code whcih needs to go in > here is the call to vfree(module). sys_delete_module() -> free_module() -> mod->cleanup() -> module_exit() which is entered with module_deletion_lock held. You just constrained all module cleanup code to never sleep - no chance. For sys_delete_module() to "grab" the entire machine it has to exclude all processors from entering the module being unloaded (not too difficult), to verify that no processor is currently executing the code pages (a bit harder) and that no suspended process or timer queue will ever pop its stack and return into those code pages (the really hard bit). From owner-netdev@oss.sgi.com Tue Jun 20 22:18:02 2000 Received: by oss.sgi.com id ; Tue, 20 Jun 2000 22:17:53 -0700 Received: from deliverator.sgi.com ([204.94.214.10]:63867 "EHLO deliverator.sgi.com") by oss.sgi.com with ESMTP id ; Tue, 20 Jun 2000 22:17:44 -0700 Received: from larry.melbourne.sgi.com (larry.melbourne.sgi.com [134.14.52.130]) by deliverator.sgi.com (980309.SGI.8.8.8-aspam-6.2/980310.SGI-aspam) via SMTP id WAA05979 for ; Tue, 20 Jun 2000 22:12:46 -0700 (PDT) mail_from (kaos@kao2.melbourne.sgi.com) Received: from kao2.melbourne.sgi.com (kao2.melbourne.sgi.com [134.14.55.180]) by larry.melbourne.sgi.com (950413.SGI.8.6.12/950213.SGI.AUTOCF) via ESMTP id PAA03088; Wed, 21 Jun 2000 15:16:24 +1000 X-Mailer: exmh version 2.1.1 10/15/1999 From: Keith Owens To: Andrew Morton cc: "netdev@oss.sgi.com" Subject: Re: modular net drivers, take 2 In-reply-to: Your message of "Wed, 21 Jun 2000 04:45:19 GMT." <3950485F.AD6F58C6@uow.edu.au> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Wed, 21 Jun 2000 15:16:24 +1000 Message-ID: <13465.961564584@kao2.melbourne.sgi.com> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Wed, 21 Jun 2000 04:45:19 +0000, Andrew Morton wrote: >Plan M: > >sys_delete_module() doesn't do the vfree(). It schedules it for 5 >seconds in the future. Or provides a mechanism for userland to do this. Does not fix the problem where module open() code is running at the same time as module_exit() code which tears down the kernel structures that open needs. Freeing the code pages while they are in use is only one of the possible failure modes for this race. From owner-netdev@oss.sgi.com Tue Jun 20 22:43:23 2000 Received: by oss.sgi.com id ; Tue, 20 Jun 2000 22:43:13 -0700 Received: from [203.126.247.144] ([203.126.247.144]:22713 "EHLO zsngs001") by oss.sgi.com with ESMTP id ; Tue, 20 Jun 2000 22:42:57 -0700 Received: from zsngd101.asiapac.nortel.com (actually znsgd101) by zsngs001; Wed, 21 Jun 2000 13:41:59 +0800 Received: from zctwb003.asiapac.nortel.com ([47.152.32.111]) by zsngd101.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NBKHBY2V; Wed, 21 Jun 2000 13:42:03 +0800 Received: from pwold011.asiapac.nortel.com ([47.181.193.45]) by zctwb003.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NCLF8DSZ; Wed, 21 Jun 2000 15:42:06 +1000 Received: from uow.edu.au (IDENT:akpm@localhost [127.0.0.1]) by pwold011.asiapac.nortel.com (8.9.3/8.9.3) with ESMTP id PAA28679; Wed, 21 Jun 2000 15:41:58 +1000 Message-ID: <395055A6.25FC7058@uow.edu.au> Date: Wed, 21 Jun 2000 05:41:58 +0000 X-Sybari-Space: 00000000 00000000 00000000 From: Andrew Morton X-Mailer: Mozilla 4.61 [en] (X11; I; Linux 2.4.0-test1-ac10 i686) X-Accept-Language: en MIME-Version: 1.0 To: Keith Owens CC: "netdev@oss.sgi.com" Subject: Re: modular net drivers, take 2 References: Your message of "Wed, 21 Jun 2000 04:24:01 GMT." <39504361.81F03943@uow.edu.au> <13370.961564335@kao2.melbourne.sgi.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Orig: Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Keith Owens wrote: > > ... > >sys_delete_module() > >{ > > ... > > spin_lock(&module_deletion_lock); > > blocked_cpus = 1 << smp_processor_id(); > > while (blocked_cpus != ((1 << smp_num_cpus) - 1)) > > ; > > { > > I think the only code whcih needs to go in > > here is the call to vfree(module). > > sys_delete_module() -> free_module() -> mod->cleanup() -> module_exit() > which is entered with module_deletion_lock held. You just constrained > all module cleanup code to never sleep - no chance. I was proposing that the global CPU-grab only need surround the vfree(). It's more of a way of getting all CPUs into a known state than a lock. So sys_delete_module would be more like: sys_delete_module() { free_module_no_vfree(); spin_lock(&module_deletion_lock); /* Something like this... */ for_each_task(tsk) { if (tsk->state == TASK_RUNNING) tsk->need_resched = 1; } blocked_cpus = 1 << smp_processor_id(); while (blocked_cpus != ((1 << smp_num_cpus) - 1)) ; vfree(module); spin_unlock(&module_deletion_lock); } > For sys_delete_module() to "grab" the entire machine it has to exclude > all processors from entering the module being unloaded (not too > difficult), OK, they're all known to be spinning on module_deletion_lock. > to verify that no processor is currently executing the code > pages (a bit harder) OK, they're all running in wait_while_unloading() > and that no suspended process or timer queue will > ever pop its stack and return into those code pages (the really hard > bit). I think any code which got into a situation like this when the module refcount is zero would be rather broken. From owner-netdev@oss.sgi.com Wed Jun 21 03:37:54 2000 Received: by oss.sgi.com id ; Wed, 21 Jun 2000 03:37:34 -0700 Received: from laurin.munich.netsurf.de ([194.64.166.1]:53736 "EHLO laurin.munich.netsurf.de") by oss.sgi.com with ESMTP id ; Wed, 21 Jun 2000 03:37:13 -0700 Received: from fred.muc.de (none@ns1231.munich.netsurf.de [195.180.235.231]) by laurin.munich.netsurf.de (8.9.3/8.9.3) with ESMTP id MAA04957; Wed, 21 Jun 2000 12:36:49 +0200 (MET DST) Received: from andi by fred.muc.de with local (Exim 2.05 #1) id 134hdL-0000Fl-00; Wed, 21 Jun 2000 12:21:23 +0200 Date: Wed, 21 Jun 2000 12:21:23 +0200 From: Andi Kleen To: Andrew Morton Cc: Andi Kleen , "netdev@oss.sgi.com" Subject: Re: modular net drivers, take 2 Message-ID: <20000621122123.A971@fred.muc.de> References: <39500F81.46F5A826@uow.edu.au> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.95.4us In-Reply-To: <39500F81.46F5A826@uow.edu.au>; from Andrew Morton on Wed, Jun 21, 2000 at 02:44:03AM +0200 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Wed, Jun 21, 2000 at 02:44:03AM +0200, Andrew Morton wrote: > Andi Kleen wrote: > > > > On Tue, Jun 20, 2000 at 02:04:35PM +0200, Andrew Morton wrote: > > > > > > - sys_ioctl() and sys_delete_module() both already claim > > > the big lock, so where's the race anyway? I feel I'm missing > > > something.. > > > > Ugh, I missed that. Ok, with that there is no race. Even better :-) > > I think I lied. Look at this: Ok. Just let's hope that no non-ioctl function grabs first a different lock (like the rtnl_lock) and then does a lock_kernel(). That would give a nice deadlock race. -Andi -- This is like TV. I don't like TV. From owner-netdev@oss.sgi.com Wed Jun 21 04:02:55 2000 Received: by oss.sgi.com id ; Wed, 21 Jun 2000 04:02:46 -0700 Received: from [203.126.247.144] ([203.126.247.144]:51422 "EHLO zsngs001") by oss.sgi.com with ESMTP id ; Wed, 21 Jun 2000 04:02:28 -0700 Received: from zsngd101.asiapac.nortel.com (actually znsgd101) by zsngs001; Wed, 21 Jun 2000 19:02:02 +0800 Received: from zctwb003.asiapac.nortel.com ([47.152.32.111]) by zsngd101.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NBKHB5LZ; Wed, 21 Jun 2000 19:02:05 +0800 Received: from pwold011.asiapac.nortel.com ([47.181.193.45]) by zctwb003.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NCLF8D8W; Wed, 21 Jun 2000 21:02:09 +1000 Received: from uow.edu.au (IDENT:akpm@[47.181.194.109]) by pwold011.asiapac.nortel.com (8.9.3/8.9.3) with ESMTP id VAA30260; Wed, 21 Jun 2000 21:02:11 +1000 Message-ID: <3950A1C1.72CDA79B@uow.edu.au> Date: Wed, 21 Jun 2000 21:06:41 +1000 X-Sybari-Space: 00000000 00000000 00000000 From: Andrew Morton X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.2.14-15mdk i586) X-Accept-Language: en MIME-Version: 1.0 To: Keith Owens CC: "netdev@oss.sgi.com" Subject: Re: modular net drivers, take 2 References: Your message of "Wed, 21 Jun 2000 04:45:19 GMT." <3950485F.AD6F58C6@uow.edu.au> <13465.961564584@kao2.melbourne.sgi.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Orig: Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Keith Owens wrote: > > On Wed, 21 Jun 2000 04:45:19 +0000, > Andrew Morton wrote: > >Plan M: > > > >sys_delete_module() doesn't do the vfree(). It schedules it for 5 > >seconds in the future. Or provides a mechanism for userland to do this. > > Does not fix the problem where module open() code is running at the > same time as module_exit() code which tears down the kernel structures > that open needs. Freeing the code pages while they are in use is only > one of the possible failure modes for this race. Aaaah. I can see it all clearly now. The fundamental misdesign here is that register_netdevice() and unregister_netdev() are called from within the module constructor and destructor functions. You see, right now a module "owns" a number of netdevices. This is wrong. It should be that a module "owns" a driver (pci_driver) and that driver "owns" a number of netdevices: module_init() registers the availability of the driver. refcount remains zero driver *open_module("eepro100") Locates the driver. Increments module refcount. No netdevices open yet. Shold not be called from within module_init! That's the current problem. device *open_driver(driver *) Creates an interface instance. Increments an 'opencount' in the driver. Critically: register_netdevice is called olny now. close_device(device *) Unregister_netdev() close_driver(driver *) Fails if any devices are open. Decrements module refcount to zero. Now, no drivers are registered, so nobody can access "eth0", or device->open or anything else. module_exit() Unregisters the driver availability. Safe. We just need to prevent module_exit() from racing against open_driver(). Just a spinlock. Too late for all of this. From owner-netdev@oss.sgi.com Wed Jun 21 09:55:26 2000 Received: by oss.sgi.com id ; Wed, 21 Jun 2000 09:55:06 -0700 Received: from dialup341.canberra.net.au ([203.33.188.213]:62982 "HELO halfway") by oss.sgi.com with SMTP id ; Wed, 21 Jun 2000 09:54:56 -0700 Received: from linuxcare.com.au (localhost [127.0.0.1]) by halfway (Postfix) with ESMTP id 83F988193 for ; Wed, 21 Jun 2000 17:59:59 +1000 (EST) From: Rusty Russell To: netdev@oss.sgi.com Subject: RFC: Reporting dropped packets Date: Wed, 21 Jun 2000 17:59:59 +1000 Message-Id: <20000621075959.83F988193@halfway> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing There are a number of places in the network code where we drop packets which people might be interested in knowing about (eg. my nat code, CONFIG_IP_ROUTE_VERBOSE). Be very nice if these were more flexible, and centralized. Options are: 1) Leave it alone. 2) Add nf_dropping(int pf, unsigned int hook, const struct sk_buff *skb, const struct net_device *indev, const struct net_device *outdev, const char *reason); nf_register_drop() and nf_unregister_drop() for recipients of dropped packets. 3) Add a bogus NF_IP_DROPPING hook IPv4 netfilter; make the skb->nfmark field hold an enum indicating why the packet was dropped. The third is most trivial, and is what I'm leaning towards at this stage. Anyone feel strongly that the current stuff is nicer? Will prepare patch if noone objects... Rusty. -- Hacking time. From owner-netdev@oss.sgi.com Wed Jun 21 09:55:26 2000 Received: by oss.sgi.com id ; Wed, 21 Jun 2000 09:55:16 -0700 Received: from dialup341.canberra.net.au ([203.33.188.213]:62726 "HELO halfway") by oss.sgi.com with SMTP id ; Wed, 21 Jun 2000 09:54:57 -0700 Received: from linuxcare.com.au (localhost [127.0.0.1]) by halfway (Postfix) with ESMTP id 4B30A8154; Wed, 21 Jun 2000 16:56:44 +1000 (EST) From: Rusty Russell To: Keith Owens Cc: "netdev@oss.sgi.com" Subject: Re: modular net drivers In-reply-to: Your message of "Wed, 21 Jun 2000 07:49:40 +1000." <4450.961537780@ocs3.ocs-net> Date: Wed, 21 Jun 2000 16:56:44 +1000 Message-Id: <20000621065644.4B30A8154@halfway> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing In message <4450.961537780@ocs3.ocs-net> you write: > On Tue, 20 Jun 2000 17:01:56 +1000, > Rusty Russell wrote: > >Races which can be largely solved at the moment by having the module > >page removal code sync all bh's and softirqs after calling cleanup(). > >Hell, we could even poll all CPUs and check they're not executing in > >the about-to-be-freed pages. Speed is completely unimportant here. > > This race is not obvious but IMHO it exists. The original theory was [ Entry into other points of module during module_cleanup() example snipped ] Well, that race is obvious, IMHO. module_cleanup should unregister everything first, before doing other cleaning up (which might sleep). I consider this fundamental, anyway; there's no other sane mechanism (unless you really do want to turn us into a microkernel). > Checking in module_cleanup() to see if the use count has changed is not > a solution. module_cleanup() may already have destroyed structures > that the open() code expects to use, either immediately or later. Of course... but two-stage cleanup (as suggested by Alexey long ago) will solve this, too. If you have non-trivial requirements, you need to do some form of reference counting anyway; cleanup() then just assures that the reference count will never *rise*. When it hits zero, you call `cleanup_finish(THIS_MODULE);'. This would certainly solve *my* problems. I can provide code if this is still not clear how this keeps the penalty for being a module in the module, and does not pollute the rest of the kernel. Please consider, Rusty. -- Hacking time. From owner-netdev@oss.sgi.com Wed Jun 21 20:33:30 2000 Received: by oss.sgi.com id ; Wed, 21 Jun 2000 20:33:21 -0700 Received: from smtprch1.nortelnetworks.com ([192.135.215.14]:28118 "EHLO smtprch1.nortel.com") by oss.sgi.com with ESMTP id ; Wed, 21 Jun 2000 20:32:59 -0700 Received: from zrchb213.us.nortel.com (actually zrchb213) by smtprch1.nortel.com; Wed, 21 Jun 2000 22:31:48 -0500 Received: from zctwb003.asiapac.nortel.com ([47.152.32.111]) by zrchb213.us.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NKANW22W; Wed, 21 Jun 2000 22:32:55 -0500 Received: from pwold011.asiapac.nortel.com ([47.181.193.45]) by zctwb003.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NCLF816D; Thu, 22 Jun 2000 13:32:51 +1000 Received: from uow.edu.au (IDENT:akpm@localhost [127.0.0.1]) by pwold011.asiapac.nortel.com (8.9.3/8.9.3) with ESMTP id NAA07017; Thu, 22 Jun 2000 13:31:39 +1000 Message-ID: <3951889B.B4BFEA36@uow.edu.au> Date: Thu, 22 Jun 2000 03:31:39 +0000 X-Sybari-Space: 00000000 00000000 00000000 From: Andrew Morton X-Mailer: Mozilla 4.61 [en] (X11; I; Linux 2.4.0-test1-ac10 i686) X-Accept-Language: en MIME-Version: 1.0 To: Rusty Russell CC: Keith Owens , "netdev@oss.sgi.com" Subject: Re: modular net drivers References: Your message of "Wed, 21 Jun 2000 07:49:40 +1000." <4450.961537780@ocs3.ocs-net> <20000621065644.4B30A8154@halfway> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Rusty Russell wrote: > > module_cleanup should unregister everything first, before doing other > cleaning up (which might sleep). Yup. module_cleanup() calls unregister_netdev(). It would be better to do the unregister_netdev(), then to wait for everyone to stop using the device (but how?) and to then reap the module. > ... > > I can provide code if this is still not clear how this keeps the > penalty for being a module in the module, and does not pollute the > rest of the kernel. Please. pseudo-code would suffice for me. Guys, I don't think we're going to solve this one any time soon. Unless Rusty has a trick up the sleeve I'll put together the patch which hoists the INC/DEC up into dev.c. At least that fixes some open/close races. Plus (here he goes again) timer deletion races are much more important... From owner-netdev@oss.sgi.com Wed Jun 21 20:41:40 2000 Received: by oss.sgi.com id ; Wed, 21 Jun 2000 20:41:30 -0700 Received: from smtprch1.nortelnetworks.com ([192.135.215.14]:45015 "EHLO smtprch1.nortel.com") by oss.sgi.com with ESMTP id ; Wed, 21 Jun 2000 20:41:24 -0700 Received: from zrchb213.us.nortel.com (actually zrchb213) by smtprch1.nortel.com; Wed, 21 Jun 2000 22:36:34 -0500 Received: from zctwb003.asiapac.nortel.com ([47.152.32.111]) by zrchb213.us.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NKANW2JP; Wed, 21 Jun 2000 22:37:42 -0500 Received: from pwold011.asiapac.nortel.com ([47.181.193.45]) by zctwb003.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NCLF816R; Thu, 22 Jun 2000 13:37:40 +1000 Received: from uow.edu.au (IDENT:akpm@localhost [127.0.0.1]) by pwold011.asiapac.nortel.com (8.9.3/8.9.3) with ESMTP id NAA07059; Thu, 22 Jun 2000 13:37:48 +1000 Message-ID: <39518A0C.59CFBFDC@uow.edu.au> Date: Thu, 22 Jun 2000 03:37:48 +0000 X-Sybari-Space: 00000000 00000000 00000000 From: Andrew Morton X-Mailer: Mozilla 4.61 [en] (X11; I; Linux 2.4.0-test1-ac10 i686) X-Accept-Language: en MIME-Version: 1.0 To: Rusty Russell CC: netdev@oss.sgi.com Subject: Re: RFC: Reporting dropped packets References: <20000621075959.83F988193@halfway> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Rusty Russell wrote: > > There are a number of places in the network code where we drop packets > which people might be interested in knowing about (eg. my nat code, > CONFIG_IP_ROUTE_VERBOSE). > > Be very nice if these were more flexible, and centralized. Does this affect drivers? Incoming packes dropped due to OOM, outgoing due to error recovery. From owner-netdev@oss.sgi.com Wed Jun 21 23:43:11 2000 Received: by oss.sgi.com id ; Wed, 21 Jun 2000 23:43:01 -0700 Received: from dialup340.canberra.net.au ([203.33.188.212]:56585 "HELO halfway") by oss.sgi.com with SMTP id ; Wed, 21 Jun 2000 23:42:45 -0700 Received: from linuxcare.com.au (localhost [127.0.0.1]) by halfway (Postfix) with ESMTP id 60DF68154 for ; Thu, 22 Jun 2000 16:42:39 +1000 (EST) From: Rusty Russell To: netdev@oss.sgi.com Subject: Re: RFC: Reporting dropped packets In-reply-to: Your message of "Thu, 22 Jun 2000 03:37:48 GMT." <39518A0C.59CFBFDC@uow.edu.au> Date: Thu, 22 Jun 2000 16:42:39 +1000 Message-Id: <20000622064239.60DF68154@halfway> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing In message <39518A0C.59CFBFDC@uow.edu.au> Andrew Morton writes: > Rusty Russell wrote: > > > > There are a number of places in the network code where we drop packets > > which people might be interested in knowing about (eg. my nat code, > > CONFIG_IP_ROUTE_VERBOSE). > > Does this affect drivers? Incoming packes dropped due to OOM, > outgoing due to error recovery. No, only systematic (`interesting') *IP* packet drops, not random ones. Trying to cover all packet drops was my original mistake, and is fairly useless (turns out that `drop' is a vague definition: did we `drop' a repeated TCP segment?). I'm suggesting a trivial change here, not some giant packet beancounter patch... Rusty. -- Hacking time. From owner-netdev@oss.sgi.com Thu Jun 22 05:45:45 2000 Received: by oss.sgi.com id ; Thu, 22 Jun 2000 05:45:34 -0700 Received: from panic.ohr.gatech.edu ([130.207.47.194]:55302 "EHLO havoc.gtf.org") by oss.sgi.com with ESMTP id ; Thu, 22 Jun 2000 05:45:16 -0700 Received: from mandrakesoft.com (adsl-77-228-135.atl.bellsouth.net [216.77.228.135]) by havoc.gtf.org (8.9.3/8.9.3) with ESMTP id IAA06797; Thu, 22 Jun 2000 08:44:28 -0400 Message-ID: <39520A2D.498A3E39@mandrakesoft.com> Date: Thu, 22 Jun 2000 08:44:29 -0400 From: Jeff Garzik Organization: MandrakeSoft X-Mailer: Mozilla 4.73 [en] (X11; I; Linux 2.2.17pre4 i686) X-Accept-Language: en MIME-Version: 1.0 To: Andrew Morton CC: Rusty Russell , Keith Owens , "netdev@oss.sgi.com" Subject: Re: modular net drivers References: Your message of "Wed, 21 Jun 2000 07:49:40 +1000." <4450.961537780@ocs3.ocs-net> <20000621065644.4B30A8154@halfway> <3951889B.B4BFEA36@uow.edu.au> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Andrew Morton wrote: > > Rusty Russell wrote: > > > > module_cleanup should unregister everything first, before doing other > > cleaning up (which might sleep). > > Yup. module_cleanup() calls unregister_netdev(). > > It would be better to do the unregister_netdev(), then to wait for everyone to stop using the device (but how?) and to then reap the module. Two things to remember: one, module_cleanup is called only when module use count (and hence user count) drops to zero, and two unregister_netdev closes the net device, so any users that slipped in and opened the net device during module_cleanup during (is that even possible?) are booted when the net device is closed. This also takes care of any wacky hardware cleanup that needs to be done in dev->stop(), so module_cleanup can simply concern itself with unregistered and freeing stuff and then exiting. Jeff -- Jeff Garzik | Building 1024 | Free beer tomorrow. MandrakeSoft, Inc. | From owner-netdev@oss.sgi.com Thu Jun 22 06:04:34 2000 Received: by oss.sgi.com id ; Thu, 22 Jun 2000 06:04:25 -0700 Received: from ppp0.ocs.com.au ([203.34.97.3]:4359 "HELO mail.ocs.com.au") by oss.sgi.com with SMTP id ; Thu, 22 Jun 2000 06:04:06 -0700 Received: (qmail 1832 invoked by uid 502); 22 Jun 2000 13:04:00 -0000 Received: (qmail 1814 invoked from network); 22 Jun 2000 13:03:58 -0000 Received: from ocs3.ocs-net (192.168.255.3) by mail.ocs.com.au with SMTP; 22 Jun 2000 13:03:58 -0000 X-Mailer: exmh version 2.1.1 10/15/1999 From: Keith Owens To: Jeff Garzik cc: Rusty Russell , "netdev@oss.sgi.com" Subject: Re: modular net drivers In-reply-to: Your message of "Thu, 22 Jun 2000 08:44:29 -0400." <39520A2D.498A3E39@mandrakesoft.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Thu, 22 Jun 2000 23:03:57 +1000 Message-ID: <2460.961679037@ocs3.ocs-net> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Thu, 22 Jun 2000 08:44:29 -0400, Jeff Garzik wrote: >Two things to remember: one, module_cleanup is called only when module >use count (and hence user count) drops to zero, and two >unregister_netdev closes the net device, so any users that slipped in >and opened the net device during module_cleanup during (is that even >possible?) are booted when the net device is closed. This also takes >care of any wacky hardware cleanup that needs to be done in dev->stop(), >so module_cleanup can simply concern itself with unregistered and >freeing stuff and then exiting. Not that simple. Module cleanup is entered with the big kernel lock which is supposed to prevent open() or its equivalents from entering the module until cleanup has completed. But module cleanup can sleep, losing the big kernel lock. If cleanup sleeps before unregister then open() can enter the module during cleanup, and the module structures can be in any old state. Other calls to module functions do not run under the big kernel lock. Instead they are meant to call open() first which bumps the reference count. It is then safe to enter module code without a lock, assuming the open race in the previous paragraph has been fixed. Alas we are finding code that calls module functions and has no reference count on the module and does not run under the big kernel lock. In other words, it is a wide open race condition. You can even have this :- CPU 0 CPU 1 lock_kernel(); No lock or reference, incorrect code module_exit(); but it exists right now. if (fops->ioctl) unregister(); fops->ioctl(); <=== now zero! The existing locking mechanism is just too fragile, it relies on every module writer and every piece of kernel code that calls a module being coded exactly right. Plus it all relies on the big kernel lock - can you say non-scalable? Yes, the existing mechanism can be made to work, but only at huge expense. Every module has to be checked to see if it conforms to the lock design. Every kernel call via a function pointer has to be checked to see if it conforms to the same design. This has to be done forever, for all new code and methods. And the whole design will not scale. What Al Viro is doing is a lot cleaner in the long run. And it will scale a lot better because it uses per fops spin locks. From owner-netdev@oss.sgi.com Thu Jun 22 06:59:14 2000 Received: by oss.sgi.com id ; Thu, 22 Jun 2000 06:59:05 -0700 Received: from smtprch2.nortelnetworks.com ([192.135.215.15]:20625 "EHLO smtprch2.nortel.com") by oss.sgi.com with ESMTP id ; Thu, 22 Jun 2000 06:58:46 -0700 Received: from zrchb213.us.nortel.com (actually zrchb213) by smtprch2.nortel.com; Thu, 22 Jun 2000 08:53:44 -0500 Received: from zctwb003.asiapac.nortel.com ([47.152.32.111]) by zrchb213.us.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NKANW3RM; Thu, 22 Jun 2000 08:56:51 -0500 Received: from pwold011.asiapac.nortel.com ([47.181.193.45]) by zctwb003.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NCLF8FRJ; Thu, 22 Jun 2000 23:56:49 +1000 Received: from uow.edu.au (IDENT:akpm@[47.181.194.147]) by pwold011.asiapac.nortel.com (8.9.3/8.9.3) with ESMTP id XAA10039 for ; Thu, 22 Jun 2000 23:56:46 +1000 Message-ID: <39521C2C.B11140A2@uow.edu.au> Date: Fri, 23 Jun 2000 00:01:16 +1000 X-Sybari-Space: 00000000 00000000 00000000 From: Andrew Morton X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.2.14-15mdk i586) X-Accept-Language: en MIME-Version: 1.0 To: "netdev@oss.sgi.com" Subject: Re: modular net drivers References: Your message of "Thu, 22 Jun 2000 08:44:29 -0400." <39520A2D.498A3E39@mandrakesoft.com> <2460.961679037@ocs3.ocs-net> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Orig: Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Keith Owens wrote: > > Other calls to module functions do not run under the big kernel lock. > Instead they are meant to call open() first which bumps the reference > count. The problem with the net drivers is that they are "opened" within the module constructor, "closed" with the module destructor and their "open" and "close" methods don't (can't) manipulate the refcounts. You see, netdevice.open() is a misnomer. It simply switches an existing, already-open netdevice from 'down' to 'up'. It should be called 'start' (cf netdevice.stop). It's register_netdevice() which registers a device with the kernel and makes all its entry points available for use. register_netdevice is called from within module_init(). Seen in this light, incrementing the refcount within netdevice.open() is kinda arbitrary. All the other netdevice methods should also increment the refcount (yes, racy). Sigh. rm /sbin/rmmod. There. Fixed. From owner-netdev@oss.sgi.com Thu Jun 22 09:48:37 2000 Received: by oss.sgi.com id ; Thu, 22 Jun 2000 09:48:27 -0700 Received: from gst.gst.com ([208.219.159.150]:54539 "EHLO gst.gst.com") by oss.sgi.com with ESMTP id ; Thu, 22 Jun 2000 09:48:11 -0700 Received: from x0.org ([208.219.159.214]) by gst.gst.com (8.8.8/8.8.8) with ESMTP id MAA14808 for ; Thu, 22 Jun 2000 12:59:11 -0400 (EDT) Message-ID: <3952434A.94F4F9A5@x0.org> Date: Thu, 22 Jun 2000 12:48:10 -0400 From: Andy X-Mailer: Mozilla 4.73 [en] (X11; I; Linux 2.2.16 i586) X-Accept-Language: en MIME-Version: 1.0 To: netdev@oss.sgi.com Subject: TCP, UDP,... Kernel modules Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello! For the project I am working, it would be cool if TCP would be a kernel module (separate from IP). Is that possible? What other mess would this drag with? Did anybody work in this direction yet? Andy From owner-netdev@oss.sgi.com Thu Jun 22 10:49:37 2000 Received: by oss.sgi.com id ; Thu, 22 Jun 2000 10:49:28 -0700 Received: from dialup283.canberra.net.au ([203.33.188.155]:8202 "HELO halfway") by oss.sgi.com with SMTP id ; Thu, 22 Jun 2000 10:49:08 -0700 Received: from linuxcare.com.au (localhost [127.0.0.1]) by halfway (Postfix) with ESMTP id 304CE8154; Fri, 23 Jun 2000 03:48:58 +1000 (EST) From: Rusty Russell To: Andrew Morton Cc: Keith Owens , "netdev@oss.sgi.com" Subject: Re: modular net drivers In-reply-to: Your message of "Thu, 22 Jun 2000 03:31:39 GMT." <3951889B.B4BFEA36@uow.edu.au> Date: Fri, 23 Jun 2000 03:48:58 +1000 Message-Id: <20000622174858.304CE8154@halfway> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing In message <3951889B.B4BFEA36@uow.edu.au> you write: > > I can provide code if this is still not clear how this keeps the > > penalty for being a module in the module, and does not pollute the > > rest of the kernel. > > Please. > > pseudo-code would suffice for me. OK. Here is how it would work: 1) struct module gets an `int cleanup_cpu'. 2) Modules supply a `deactivate()' method, which guarantees that (after synchronization) module count will NEVER increase. 3) Kernel thread does actual page removal (and could do execution of `cleanup()': whatever's easier). 4) #define __MOD_INC_USE_COUNT(mod) atomic_inc(&mod->uc.usecount) #define __MOD_DEC_USE_COUNT(mod) \ do { if (atomic_dec_and_test(&mod->uc.usecount)) \ mod->cleanup_cpu = smp_processor_id(); \ } while(0) 4) sys_init_module() initializes mod->uc.usecount to 1, and mod->cleanup_cpu to -1. 5) sys_delete_module() (probably check if mod->uc.usecount == 1): A) Calls mod->deactivate(), B) Syncs kernel (grab kernel lock, sync timers and other bh()s, maybe grab entire machine if required). C) __MOD_DEC_USE_COUNT(mod) 6) Cleanup thread looks for modules with mod->cleanup_cpu set to current->processor, when it does, it can call mod->cleanup() then vfree() the memory. The only limitation with this scheme is that if we lose the deactivate race, the module could be `unloading' for an indefinite period. Enhancements: We could split the module's init() function into init() and `activate()', where init() doesn't do anything which allows external code access to the module (ie. the modcount will not increase). Then we can simply call `mod->activate()' and return -EBUSY if we lose the race. It's probably not worth the hassle. Make __MOD_INC_USE_COUNT assert that mod->cleanup_cpu == -1 (you shouldn't be increasing the module count if you're deactivated). This method relies on the cleanup thread bouncing around the CPUs so it eventually runs on the CPU the module was removed on. Something more deterministic would please the pedants. FAQ 1) Won't this be complicated to implement Not for my code, and I imagine not for lots of other code. 2) Why cleanup in a thread We may be in the module when we do the final MOD_DEC_USE_COUNT, so waiting until another thread is running on that CPU guarantees that the function has exited. 3) Won't that make kernel coding complex Not as much as adding struct module * to every registerable object. Noone coding the kernel today should have problems understanding the issues here. Clear now? Rusty. -- Hacking time. From owner-netdev@oss.sgi.com Thu Jun 22 18:05:04 2000 Received: by oss.sgi.com id ; Thu, 22 Jun 2000 18:04:54 -0700 Received: from web901.mail.yahoo.com ([128.11.23.76]:6663 "HELO web901.mail.yahoo.com") by oss.sgi.com with SMTP id ; Thu, 22 Jun 2000 18:04:36 -0700 Received: (qmail 14367 invoked by uid 60001); 23 Jun 2000 01:04:37 -0000 Message-ID: <20000623010437.14366.qmail@web901.mail.yahoo.com> Received: from [210.77.224.121] by web901.mail.yahoo.com; Thu, 22 Jun 2000 18:04:37 PDT Date: Thu, 22 Jun 2000 18:04:37 -0700 (PDT) From: =?gb2312?q?Hu=20Gang?= Subject: no subject To: netdev@oss.sgi.com MIME-Version: 1.0 Content-Type: text/plain; charset=gb2312 Content-Transfer-Encoding: 8bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing no subject ===== hi,This is heavenbird e-mail. __________________________________________________ Do You Yahoo!? Get Yahoo! Mail - Free email you can access from anywhere! http://mail.yahoo.com/ From owner-netdev@oss.sgi.com Fri Jun 23 00:06:54 2000 Received: by oss.sgi.com id ; Fri, 23 Jun 2000 00:06:35 -0700 Received: from p-biset.issy.cnet.fr ([139.100.0.33]:9991 "HELO p-biset.issy.cnet.fr") by oss.sgi.com with SMTP id ; Fri, 23 Jun 2000 00:06:14 -0700 Received: from l-mhs1.lannion.cnet.fr ([161.104.1.59]) by p-biset.issy.cnet.fr with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2448.0) id N3KG93AV; Fri, 23 Jun 2000 09:06:12 +0200 Received: from rd.francetelecom.fr (lsun215.lannion.cnet.fr [161.104.4.52]) by l-mhs1.lannion.cnet.fr with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2448.0) id NPDNFB9X; Fri, 23 Jun 2000 09:05:15 +0200 Message-ID: <39530C14.BF2E6555@rd.francetelecom.fr> Date: Fri, 23 Jun 2000 09:04:52 +0200 From: Alexis JANIAK Organization: FTR&D X-Mailer: Mozilla 4.72 [en] (X11; I; SunOS 5.6 sun4m) X-Accept-Language: fr, en MIME-Version: 1.0 To: netdev@oss.sgi.com Subject: documentation on netlink sockets Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hi, I'm a french student in computer science in training periode (this is my last year of studies), and i'm working on mpls over atm under GNU/Linux. And i need to read the routing database of the kernel. I see in zebra's code that they use netlink socket to do this. But i couldn't find some doc about the use of netlink (which seem to be specific to Linux) although i looked in the Linux Documentation Project, in the "unix network programming" and in the "TCP/IP illustrated Vol 2", but they didn't talk about netlink socket. I try to read the sources, but without some doc, it's very hard for me understand how it's working. I wrote to alan cox who gave me your e-mail. Could help me ? Do you know some books, web sites, news group or anything else that might help me ? That would be a great help for me if you could give me some indications. Thanks Alexis JANIAK 3rd year student in computer science at ENSSAT (Ecole Nationale Superieure de Science Appliquee et de Technologie) Lannion - France From owner-netdev@oss.sgi.com Fri Jun 23 01:01:04 2000 Received: by oss.sgi.com id ; Fri, 23 Jun 2000 01:00:55 -0700 Received: from c18758247.telekabel.chello.nl ([212.187.58.247]:46596 "HELO zuma.kawa.dhs.org") by oss.sgi.com with SMTP id ; Fri, 23 Jun 2000 01:00:40 -0700 Received: by zuma.kawa.dhs.org (Postfix, from userid 1000) id 604551A7D4; Fri, 23 Jun 2000 10:00:26 +0200 (CEST) Date: Fri, 23 Jun 2000 10:00:24 +0200 From: marcel@nl.linux.org (Marcel Harkema) To: Alexis JANIAK Cc: netdev@oss.sgi.com Subject: Re: documentation on netlink sockets Message-ID: <20000623100022.A7741@zuma> References: <39530C14.BF2E6555@rd.francetelecom.fr> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii User-Agent: Mutt/0.95.5i qahwah In-Reply-To: <39530C14.BF2E6555@rd.francetelecom.fr>; from Alexis JANIAK on Fri, Jun 23, 2000 at 09:04:52AM +0200 X-GnuPG-Fingerprint: 7070 A8EC 4AD9 E62D 802A D44F 865A 558B 68DA 0F3E X-GnuPG-Key-ID: 1024D/68DA0F3E Organization: Euro Linux, http://www.linux.eu.org/ Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Fri, Jun 23, 2000 at 09:04 +0200, Alexis JANIAK wrote: > code that they use netlink socket to do this. But i couldn't find some doc > about the use of netlink (which seem to be specific to Linux) although i Try the netlink sockets howto available from http://qos.ittc.ukans.edu/. Cheers, Marcel From owner-netdev@oss.sgi.com Fri Jun 23 09:48:39 2000 Received: by oss.sgi.com id ; Fri, 23 Jun 2000 09:48:29 -0700 Received: from dialup326.canberra.net.au ([203.33.188.198]:1035 "HELO halfway") by oss.sgi.com with SMTP id ; Fri, 23 Jun 2000 09:48:12 -0700 Received: from linuxcare.com.au (localhost [127.0.0.1]) by halfway (Postfix) with ESMTP id AA5BB8154; Sat, 24 Jun 2000 02:48:05 +1000 (EST) From: Rusty Russell To: Keith Owens , Philipp Rumpf , Andrew Morton , Alan Cox , "netdev@oss.sgi.com" Subject: Re: modular net drivers Date: Sat, 24 Jun 2000 02:48:05 +1000 Message-Id: <20000623164805.AA5BB8154@halfway> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing In message <20000622174858.304CE8154@halfway> I wrote: > OK. Here is how it would work: Alternate solution to avoid module problems: Phil Rumpf and I came up with basically identical answers. It assumes that MOD_INC_USE_COUNT is always called in user context, and involves no changes to module code. 1) static volatile int freeze[NR_CPUS]; 2) sys_delete_module fires off a RT kernel thread for every other CPU: ... /* Fire off #CPUs - 1 threads */ retry: for (i = 0; i < smp_num_cpus(); i++) if (!freeze[i] && i != smp_processor_id()) goto retry; if (atomic_read(&mod->uc.use) == 0) mod->cleanup(); else err = -EBUSY; memset(freeze, 0, sizeof(freeze)); 3) Threads do: freeze[smp_processor_id()] = 1; while (freeze[smp_processor_id()]); exit That effectively freezes the other cpus, and cleanup() can sleep et al. No more races. Problems? Rusty. -- Hacking time. ------- End of Forwarded Message From owner-netdev@oss.sgi.com Fri Jun 23 11:49:41 2000 Received: by oss.sgi.com id ; Fri, 23 Jun 2000 11:49:31 -0700 Received: from smtprch2.nortelnetworks.com ([192.135.215.15]:31963 "EHLO smtprch2.nortel.com") by oss.sgi.com with ESMTP id ; Fri, 23 Jun 2000 11:49:16 -0700 Received: from zrchb213.us.nortel.com (actually zrchb213) by smtprch2.nortel.com; Fri, 23 Jun 2000 07:47:47 -0500 Received: from zctwb003.asiapac.nortel.com ([47.152.32.111]) by zrchb213.us.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NKANXKFD; Fri, 23 Jun 2000 07:50:54 -0500 Received: from pwold011.asiapac.nortel.com ([47.181.193.45]) by zctwb003.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NCLF8NGM; Fri, 23 Jun 2000 22:50:53 +1000 Received: from uow.edu.au (IDENT:akpm@[47.181.194.167]) by pwold011.asiapac.nortel.com (8.9.3/8.9.3) with ESMTP id WAA21154 for ; Fri, 23 Jun 2000 22:48:41 +1000 Message-ID: <39535DA8.A5D014D4@uow.edu.au> Date: Fri, 23 Jun 2000 22:52:56 +1000 X-Sybari-Space: 00000000 00000000 00000000 From: Andrew Morton X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.2.14-15mdk i586) X-Accept-Language: en MIME-Version: 1.0 To: "netdev@oss.sgi.com" Subject: Driver bug reporting Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Orig: Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Andrey and I have been bemoaning the quality of the netdriver bug reports, and the number of emails which go back and forth before an adequate amount of info is gathered. So we're putting together a 'reporting net driver problems' doc. The idea is that each driver will say "See Documentation/networking/reporting-driver-problems.txt" in its signon message. This is what we have so far. Go for it. Network driver problem HOWTO ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If you have a problem which you believe may be attributed to a network driver, please report it to driver's maintainer. The development of high-quality driver software is impossible without user feedback. In many cases the authors are not even able to test all code paths on their own hardware, and they certainly can't test their driver against all third party switches and hubs. So, your report is really WELCOME! To be useful, the report should contain: 1) Kernel version. 2) An exact copy of the banner message which the driver generates on initialization. All information in the banner is necessary for developers, otherwise it wouldn't be printed. 3) Description of your problem. Explain what has happened, what didn't work as you expected, how and with which tools you checked it. Did a previous version of the kernel or driver work correctly? Describe the circumstances under which the problem occurs. Check your logs and include in the report any driver messages. An example of a good description of a problem is "All network communication seems to stop when I set full-duplex mode on the switch. A ping from the computer to any other address shows 100% packet loss and produces ``Destination Host Unreachable'' messages. There are no kernel logs for the time of the experiment." An example of a bad description is "Help me! It hangs!". The email address of the driver maintainer is usually present in comments inside the driver source and in the MAINTAINERS file in the top level kernel source directory. If you are prepared to investigate the problem further you may also refer to the next section of this document for more hints on how to do this and how to collect more information about your hardware/system configuration to describe better your problem. Reporting and diagnosing problems ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Maintainers find that accurate and complete problem reports are invaluable in resolving driver problems. We are frequently not able to reproduce problems and must rely on your patience and efforts to get to the bottom of the problem. If you believe you have a driver problem here are some of the steps you should take: - Is it really a driver problem? Eliminate some variables: try different cards, different computers, different cables, different ports on the switch/hub, different versions of the kernel or of the driver, etc. - OK, it's a driver problem. You need to generate a report. Typically this is an email to the maintainer and/or linux-net@vger.rutgers.edu. The maintainer's email address will be in the driver source or in the MAINTAINERS file. - The contents of your report will vary a lot depending upon the problem. If it's a kernel crash then you should refer to the REPORTING-BUGS file. But for most problems it is useful to provide the following: o Kernel version, driver version o A copy of the banner message which the driver generates when it is initialised. For example: eth0: 3Com PCI 3c905C Tornado at 0xa400, 00:50:da:6a:88:f0, IRQ 19 8K byte-wide RAM 5:3 Rx:Tx split, autoselect/Autonegotiate interface. MII transceiver found at address 24, status 782d. Enabling bus-master transmits and whole-frame receives. o If it is a PCI device, the relevant output from 'lspci -vx', eg: 00:09.0 Ethernet controller: 3Com Corporation 3c905C-TX [Fast Etherlink] (rev 74) Subsystem: 3Com Corporation: Unknown device 9200 Flags: bus master, medium devsel, latency 32, IRQ 19 I/O ports at a400 [size=128] Memory at db000000 (32-bit, non-prefetchable) [size=128] Expansion ROM at [disabled] [size=128K] Capabilities: [dc] Power Management version 2 00: b7 10 00 92 07 00 10 02 74 00 00 02 08 20 00 00 10: 01 a4 00 00 00 00 00 db 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 b7 10 00 10 30: 00 00 00 00 dc 00 00 00 00 00 00 00 05 01 0a 0a o A description of the environment: 10baseT? 100baseT? full/half duplex? switched or hubbed? o Any additional module parameters which you may be providing to the driver. o Any kernel logs which are produced. The more the merrier. If this is a large file and you are sending your report to a mailing list, mention that you have the logfile, but don't send it. If you're reporting direct to the maintainer then just send it. To ensure that all kernel logs are available, add the following line to /etc/syslog.conf: kern.* /var/log/messages Then restart syslogd with: /etc/rc.d/init.d/syslog restart (The above may vary, depending upon which Linux distribution you use). o If your problem is reproducible then that's great. Try the following: 1) Increase the debug level. Usually this is done via: a) modprobe driver.o debug=7 b) In /etc/conf.modules (or modules.conf): options driver_name debug=7 2) Recreate the problem with the higher debug level, send all logs to the maintainer. 3) Download you card's diagnostic tool from Donald Backer's website http://www.scyld.com/diag. Download mii-diag.c as well. Build these. a) Run 'vortex-diag -aaee' and 'mii-diag -v' when the card is working correctly. Save the output. b) Run the above commands when the card is malfunctioning. Send both sets of output. Finally, please be patient and be prepared to do some work. You may end up working on this problem for a week or more as the maintainer asks more questions, asks for more tests, asks for patches to be applied, etc. At the end of it all, the problem may even remain unresolved. From owner-netdev@oss.sgi.com Fri Jun 23 15:29:21 2000 Received: by oss.sgi.com id ; Fri, 23 Jun 2000 15:29:11 -0700 Received: from mail.inconnect.com ([209.140.64.7]:28610 "HELO mail.inconnect.com") by oss.sgi.com with SMTP id ; Fri, 23 Jun 2000 15:28:55 -0700 Received: (qmail 10170 invoked from network); 23 Jun 2000 22:28:55 -0000 Received: from ultra1.inconnect.com (209.140.64.2) by mail with SMTP; 23 Jun 2000 22:28:55 -0000 Date: Fri, 23 Jun 2000 16:28:55 -0600 (MDT) From: Keyshaun X-Sender: kruger@ultra1.inconnect.com To: Andy cc: netdev@oss.sgi.com Subject: Re: TCP, UDP,... Kernel modules In-Reply-To: <3952434A.94F4F9A5@x0.org> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing How would it be helpful to make TCP a seperate module? I always understood TCP as an integral part of functioning IP support. It seems that it would be easier to just make the IP modular seeing as it is rather small. My IPv6 module is 133k and I don't see how IPv4 could be much if at all bigger (basic support). Shaun thekernel@subdimension.com On Thu, 22 Jun 2000, Andy wrote: > Hello! > > For the project I am working, it would be cool if TCP would be a kernel > module (separate from IP). > Is that possible? > What other mess would this drag with? > Did anybody work in this direction yet? > > Andy > From owner-netdev@oss.sgi.com Fri Jun 23 20:10:03 2000 Received: by oss.sgi.com id ; Fri, 23 Jun 2000 20:09:44 -0700 Received: from smtprch2.nortelnetworks.com ([192.135.215.15]:54994 "EHLO smtprch2.nortel.com") by oss.sgi.com with ESMTP id ; Fri, 23 Jun 2000 20:09:20 -0700 Received: from zrchb213.us.nortel.com (actually zrchb213) by smtprch2.nortel.com; Fri, 23 Jun 2000 22:02:58 -0500 Received: from zctwb003.asiapac.nortel.com ([47.152.32.111]) by zrchb213.us.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NQCKK3T5; Fri, 23 Jun 2000 22:06:04 -0500 Received: from pwold011.asiapac.nortel.com ([47.181.193.45]) by zctwb003.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NCLF8NNL; Sat, 24 Jun 2000 13:06:04 +1000 Received: from uow.edu.au (IDENT:akpm@[47.181.194.183]) by pwold011.asiapac.nortel.com (8.9.3/8.9.3) with ESMTP id NAA28772; Sat, 24 Jun 2000 13:03:57 +1000 Message-ID: <3954262D.60BDEF41@uow.edu.au> Date: Sat, 24 Jun 2000 13:08:29 +1000 X-Sybari-Space: 00000000 00000000 00000000 From: Andrew Morton X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.2.14-15mdk i586) X-Accept-Language: en MIME-Version: 1.0 To: Rusty Russell CC: Keith Owens , Philipp Rumpf , Alan Cox , "netdev@oss.sgi.com" Subject: Re: modular net drivers References: <20000623164805.AA5BB8154@halfway> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Orig: Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Rusty Russell wrote: > > In message <20000622174858.304CE8154@halfway> I wrote: > > OK. Here is how it would work: > > Alternate solution to avoid module problems: Phil Rumpf and I came up > with basically identical answers. It assumes that MOD_INC_USE_COUNT > is always called in user context, and involves no changes to module > code. > > 1) static volatile int freeze[NR_CPUS]; Yup. I think this can be generalised and pushed out to userland more. A new system call: int sys_stop_cpu(int yep) sys_stop_cpu(1) Causes the current CPU to enter a busy loop, with local interrupts disabled. The return value is the number of CPUs which are _not_ captured by sys_stop_cpu. If the current CPU is the last CPU, sys_stop_cpu() will return 1 immediately. sys_stop_cpu(0) will unfreeze _all_ CPUs (I think this is a little racy...) So the idea is that a privileged app can loop doing clone()/sys_stop_cpu(1) until all CPUs have stopped. Then the privileged app can unload modules (or do anything else which requires total serialisation). The weakness in this (and in your proposal, Rusty) is the case where some module code does a schedule() when the module reference count is zero. I'm not aware of any which can do this, but all it would take is a kmalloc() within a netdriver's set_multicast_list/do_ioctl/get_stats/etc method. Two more things: You had: if (atomic_read(&mod->uc.use) == 0) mod->cleanup(); This could be changed to if (atomic_read(&mod->uc.use) == 0) { atomic_inc(&mod->uc.use); mod->cleanup(); atomic_dec(&mod->uc.use); } to avoid bizarre reentrancy happenings if the module destructor somehow calls schedule(). Finally, the net drivers seem to be the biggest problem at this time, and I think all their methods are called via ioctl on a socket. For 2.4 why the hell don't we just take the unlock_kernel() out of sock_ioctl()? From owner-netdev@oss.sgi.com Sat Jun 24 04:29:57 2000 Received: by oss.sgi.com id ; Sat, 24 Jun 2000 04:29:37 -0700 Received: from laurin.munich.netsurf.de ([194.64.166.1]:6553 "EHLO laurin.munich.netsurf.de") by oss.sgi.com with ESMTP id ; Sat, 24 Jun 2000 04:29:21 -0700 Received: from fred.muc.de (none@ns1246.munich.netsurf.de [195.180.235.246]) by laurin.munich.netsurf.de (8.9.3/8.9.3) with ESMTP id NAA07942; Sat, 24 Jun 2000 13:29:16 +0200 (MET DST) Received: from andi by fred.muc.de with local (Exim 2.05 #1) id 135nxv-0000DW-00; Sat, 24 Jun 2000 13:19:11 +0200 Date: Sat, 24 Jun 2000 13:19:11 +0200 From: Andi Kleen To: Alexis JANIAK Cc: netdev@oss.sgi.com Subject: Re: documentation on netlink sockets Message-ID: <20000624131911.A832@fred.muc.de> References: <39530C14.BF2E6555@rd.francetelecom.fr> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.95.4us In-Reply-To: <39530C14.BF2E6555@rd.francetelecom.fr>; from Alexis JANIAK on Fri, Jun 23, 2000 at 09:07:48AM +0200 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Fri, Jun 23, 2000 at 09:07:48AM +0200, Alexis JANIAK wrote: > Hi, > > I'm a french student in computer science in training periode (this is my > last year of studies), and i'm working on mpls over atm under GNU/Linux. > And i need to read the routing database of the kernel. I see in zebra's > code that they use netlink socket to do this. But i couldn't find some > doc about the use of netlink (which seem to be specific to Linux) > although i > looked in the Linux Documentation Project, in the "unix network > programming" and in the "TCP/IP illustrated Vol 2", but they didn't talk > about > netlink socket. I try to read the sources, but without some doc, it's > very hard for me > understand how it's working. I wrote to alan cox who gave me your > e-mail. Could You forgot the obvious man netlink man rtnetlink -A. From owner-netdev@oss.sgi.com Sat Jun 24 08:02:37 2000 Received: by oss.sgi.com id ; Sat, 24 Jun 2000 08:02:18 -0700 Received: from c997662-c.frmt1.sfba.home.com ([24.1.70.208]:16136 "EHLO fruits.uzix.org") by oss.sgi.com with ESMTP id ; Sat, 24 Jun 2000 08:01:59 -0700 Received: from prumpf by fruits.uzix.org with local (Exim 3.12 #1) id 135rQh-0006X4-00; Sat, 24 Jun 2000 08:01:07 -0700 Date: Sat, 24 Jun 2000 08:01:06 -0700 From: Philipp Rumpf To: Andrew Morton Cc: Rusty Russell , Keith Owens , Alan Cox , "netdev@oss.sgi.com" Subject: Re: modular net drivers Message-ID: <20000624080106.A25102@fruits.uzix.org> References: <20000623164805.AA5BB8154@halfway> <3954262D.60BDEF41@uow.edu.au> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0pre3us In-Reply-To: <3954262D.60BDEF41@uow.edu.au> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Sat, Jun 24, 2000 at 01:08:29PM +1000, Andrew Morton wrote: > Rusty Russell wrote: > > > > In message <20000622174858.304CE8154@halfway> I wrote: > > > OK. Here is how it would work: > > > > Alternate solution to avoid module problems: Phil Rumpf and I came up > > with basically identical answers. It assumes that MOD_INC_USE_COUNT > > is always called in user context, and involves no changes to module > > code. > > > > 1) static volatile int freeze[NR_CPUS]; > > Yup. > > I think this can be generalised and pushed out to userland more. A new > system call: > > int sys_stop_cpu(int yep) > > sys_stop_cpu(1) Causes the current CPU to enter a busy loop, with local > interrupts disabled. The return value is the number of CPUs which are Do we need to disable local interrupts ? interrupt handlers are safe because disable_irq is synchronous, not sure about softirqs. > _not_ captured by sys_stop_cpu. If the current CPU is the last CPU, > sys_stop_cpu() will return 1 immediately. > > > sys_stop_cpu(0) will unfreeze _all_ CPUs (I think this is a little > racy...) > > > > So the idea is that a privileged app can loop doing > clone()/sys_stop_cpu(1) until all CPUs have stopped. Then the > privileged app can unload modules (or do anything else which requires > total serialisation). What's the point of exporting int sys_stop_cpu(int yep) ? I can't really see it being useful for anything but module load/unload, and I'm sure in 2.5 we will see people needing other APIs - disable_cpu(int cpu)-like. > The weakness in this (and in your proposal, Rusty) is the case where > some module code does a schedule() when the module reference count is > zero. I'm not aware of any which can do this, but all it would take is You're still not allowed to be stupid, no. What makes this slightly less problematic is schedule() with a zero reference count is a bug, even on UP. Note that schedule() in the cleanup_module function isn't a bug, as long as you can guarantee the module count won't be incremented again. > Two more things: > > You had: > > if (atomic_read(&mod->uc.use) == 0) > mod->cleanup(); > > This could be changed to > > if (atomic_read(&mod->uc.use) == 0) { > atomic_inc(&mod->uc.use); > mod->cleanup(); > atomic_dec(&mod->uc.use); > } > > to avoid bizarre reentrancy happenings if the module destructor somehow > calls schedule(). I think we should mark a module that's being unloaded anyway. incrementing the use count would be nice because it would mea schedule() with a zero use count is always a bug. Philipp Rumpf From owner-netdev@oss.sgi.com Sat Jun 24 08:30:08 2000 Received: by oss.sgi.com id ; Sat, 24 Jun 2000 08:29:58 -0700 Received: from [203.126.247.144] ([203.126.247.144]:42629 "EHLO zsngs001") by oss.sgi.com with ESMTP id ; Sat, 24 Jun 2000 08:29:32 -0700 Received: from zsngd101.asiapac.nortel.com (actually znsgd101) by zsngs001; Sat, 24 Jun 2000 23:28:32 +0800 Received: from zctwb003.asiapac.nortel.com ([47.152.32.111]) by zsngd101.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NQ12PTT9; Sat, 24 Jun 2000 23:28:35 +0800 Received: from pwold011.asiapac.nortel.com ([47.181.193.45]) by zctwb003.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NCLF8NQ6; Sun, 25 Jun 2000 01:28:37 +1000 Received: from uow.edu.au (IDENT:akpm@[47.181.194.190]) by pwold011.asiapac.nortel.com (8.9.3/8.9.3) with ESMTP id BAA31811; Sun, 25 Jun 2000 01:26:12 +1000 Message-ID: <3954D42A.938A724B@uow.edu.au> Date: Sun, 25 Jun 2000 01:30:50 +1000 X-Sybari-Space: 00000000 00000000 00000000 From: Andrew Morton X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.2.14-15mdk i586) X-Accept-Language: en MIME-Version: 1.0 To: Philipp Rumpf CC: Rusty Russell , Keith Owens , Alan Cox , "netdev@oss.sgi.com" Subject: Re: modular net drivers References: <20000623164805.AA5BB8154@halfway> <3954262D.60BDEF41@uow.edu.au>, <3954262D.60BDEF41@uow.edu.au> <20000624080106.A25102@fruits.uzix.org> Content-Type: multipart/mixed; boundary="------------CBB127BDC2B6A2C0991803E8" X-Orig: Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing This is a multi-part message in MIME format. --------------CBB127BDC2B6A2C0991803E8 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit OK. I've flung together a prototype. It doesn't immediately crash.... --------------CBB127BDC2B6A2C0991803E8 Content-Type: text/plain; charset=us-ascii; name="freeze.patch" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="freeze.patch" --- linux-official/kernel/module.c Fri Jun 9 23:00:28 2000 +++ linux-akpm/kernel/module.c Sun Jun 25 01:15:57 2000 @@ -365,17 +365,14 @@ return res; } -asmlinkage long -sys_delete_module(const char *name_user) +static long +do_sys_delete_module(const char *name_user) { struct module *mod, *next; char *name; long error; int something_changed; - if (!capable(CAP_SYS_MODULE)) - return -EPERM; - lock_kernel(); if (name_user) { if ((error = get_mod_name(name_user, &name)) < 0) @@ -442,6 +439,75 @@ unlock_kernel(); return error; } + +#ifdef CONFIG_SMP +static volatile int ice_block; +static spinlock_t freeze_lock = SPIN_LOCK_UNLOCKED; + +static int +antarctica(void *dummy) +{ + printk("start antarctica on %d\n", smp_processor_id()); + while (ice_block) + ; + printk("stop antarctica on %d\n", smp_processor_id()); + return 0; +} + +static int +freeze_other_cpus(void) +{ + int cpu, retval; + + if (!spin_trylock(&freeze_lock)) + return -EAGAIN; + + printk("start freeze_other_cpus()\n"); + ice_block = 1; + for (cpu = 0; cpu < smp_num_cpus - 1; cpu++) { + retval = kernel_thread(antarctica, (void *)0, 0); + if (retval < 0) + goto out_melt; + } + printk("continue freeze_other_cpus()\n"); + return 0; +out_melt: + ice_block = 0; + spin_unlock(&freeze_lock); + return retval; +} + +static void +melt_other_cpus(void) +{ + printk("melt_other_cpus() starts\n"); + ice_block = 0; + spin_unlock(&freeze_lock); + printk("melt_other_cpus() stops\n"); +} + +#else /* CONFIG_SMP */ + +#define freeze_other_cpus() 0 +#define melt_other_cpus() do { } while (0) +#endif + +asmlinkage long +sys_delete_module(const char *name_user) +{ + long ret; + + if (!capable(CAP_SYS_MODULE)) + return -EPERM; + + if ((ret = freeze_other_cpus())) + return ret; + + ret = do_sys_delete_module(name_user); + melt_other_cpus(); + return ret; +} + /* Query various bits about modules. */ --------------CBB127BDC2B6A2C0991803E8-- From owner-netdev@oss.sgi.com Sat Jun 24 08:38:28 2000 Received: by oss.sgi.com id ; Sat, 24 Jun 2000 08:38:08 -0700 Received: from puffin.external.hp.com ([192.25.206.4]:25610 "EHLO puffin.external.hp.com") by oss.sgi.com with ESMTP id ; Sat, 24 Jun 2000 08:37:51 -0700 Received: (from prumpf@localhost) by puffin.external.hp.com (8.9.3/8.9.3) id JAA31745; Sat, 24 Jun 2000 09:35:48 -0600 Message-ID: <20000624093548.A31621@puffin.external.hp.com> Date: Sat, 24 Jun 2000 09:35:48 -0600 From: Philipp Rumpf To: Andrew Morton , Philipp Rumpf Cc: Rusty Russell , Keith Owens , Alan Cox , "netdev@oss.sgi.com" Subject: Re: modular net drivers References: <20000623164805.AA5BB8154@halfway> <3954262D.60BDEF41@uow.edu.au>, <3954262D.60BDEF41@uow.edu.au> <20000624080106.A25102@fruits.uzix.org> <3954D42A.938A724B@uow.edu.au> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.93.2 In-Reply-To: <3954D42A.938A724B@uow.edu.au>; from Andrew Morton on Sun, Jun 25, 2000 at 01:30:50AM +1000 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Sun, Jun 25, 2000 at 01:30:50AM +1000, Andrew Morton wrote: > +static volatile int ice_block; > +static spinlock_t freeze_lock = SPIN_LOCK_UNLOCKED; > + > +static int > +antarctica(void *dummy) > +{ > + printk("start antarctica on %d\n", smp_processor_id()); > + while (ice_block) > + ; > + printk("stop antarctica on %d\n", smp_processor_id()); > + return 0; > +} > + > +static int > +freeze_other_cpus(void) > +{ > + int cpu, retval; > + > + if (!spin_trylock(&freeze_lock)) > + return -EAGAIN; > + > + printk("start freeze_other_cpus()\n"); > + ice_block = 1; > + for (cpu = 0; cpu < smp_num_cpus - 1; cpu++) { > + retval = kernel_thread(antarctica, (void *)0, 0); > + if (retval < 0) > + goto out_melt; > + } > + printk("continue freeze_other_cpus()\n"); > + return 0; > +out_melt: > + ice_block = 0; > + spin_unlock(&freeze_lock); > + return retval; > +} You don't guarantee all the other kernel_threads have started executing. Basically you need either an atomic_inc or a per-CPU flag somewhere. Philipp Rumpf From owner-netdev@oss.sgi.com Sat Jun 24 08:49:28 2000 Received: by oss.sgi.com id ; Sat, 24 Jun 2000 08:49:18 -0700 Received: from vindaloo.ras.ucalgary.ca ([136.159.55.21]:46262 "EHLO vindaloo.ras.ucalgary.ca") by oss.sgi.com with ESMTP id ; Sat, 24 Jun 2000 08:48:57 -0700 Received: (from rgooch@localhost) by vindaloo.ras.ucalgary.ca (8.10.0/8.10.0) id e5OFmiC09138; Sat, 24 Jun 2000 09:48:44 -0600 Date: Sat, 24 Jun 2000 09:48:44 -0600 Message-Id: <200006241548.e5OFmiC09138@vindaloo.ras.ucalgary.ca> From: Richard Gooch To: Philipp Rumpf Cc: Andrew Morton , Philipp Rumpf , Rusty Russell , Keith Owens , Alan Cox , "netdev@oss.sgi.com" Subject: Re: modular net drivers In-Reply-To: <20000624093548.A31621@puffin.external.hp.com> References: <20000623164805.AA5BB8154@halfway> <3954262D.60BDEF41@uow.edu.au> <20000624080106.A25102@fruits.uzix.org> <3954D42A.938A724B@uow.edu.au> <20000624093548.A31621@puffin.external.hp.com> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Philipp Rumpf writes: > On Sun, Jun 25, 2000 at 01:30:50AM +1000, Andrew Morton wrote: > > +static volatile int ice_block; > > +static spinlock_t freeze_lock = SPIN_LOCK_UNLOCKED; > > + > > +static int > > +antarctica(void *dummy) > > +{ > > + printk("start antarctica on %d\n", smp_processor_id()); > > + while (ice_block) > > + ; > > + printk("stop antarctica on %d\n", smp_processor_id()); > > + return 0; > > +} > > + > > +static int > > +freeze_other_cpus(void) > > +{ > > + int cpu, retval; > > + > > + if (!spin_trylock(&freeze_lock)) > > + return -EAGAIN; > > + > > + printk("start freeze_other_cpus()\n"); > > + ice_block = 1; > > + for (cpu = 0; cpu < smp_num_cpus - 1; cpu++) { > > + retval = kernel_thread(antarctica, (void *)0, 0); > > + if (retval < 0) > > + goto out_melt; > > + } > > + printk("continue freeze_other_cpus()\n"); > > + return 0; > > +out_melt: > > + ice_block = 0; > > + spin_unlock(&freeze_lock); > > + return retval; > > +} > > You don't guarantee all the other kernel_threads have started executing. > > Basically you need either an atomic_inc or a per-CPU flag somewhere. Or why not use smp_call_function()? Just move it from the i386 tree to the generic tree. Regards, Richard.... Permanent: rgooch@atnf.csiro.au Current: rgooch@ras.ucalgary.ca From owner-netdev@oss.sgi.com Sat Jun 24 08:55:28 2000 Received: by oss.sgi.com id ; Sat, 24 Jun 2000 08:55:09 -0700 Received: from c997662-c.frmt1.sfba.home.com ([24.1.70.208]:31496 "EHLO fruits.uzix.org") by oss.sgi.com with ESMTP id ; Sat, 24 Jun 2000 08:54:57 -0700 Received: from prumpf by fruits.uzix.org with local (Exim 3.12 #1) id 135sG7-0006Y2-00; Sat, 24 Jun 2000 08:54:15 -0700 Date: Sat, 24 Jun 2000 08:54:14 -0700 From: Philipp Rumpf To: Richard Gooch Cc: Andrew Morton , Rusty Russell , Keith Owens , Alan Cox , "netdev@oss.sgi.com" Subject: Re: modular net drivers Message-ID: <20000624085414.C25102@fruits.uzix.org> References: <20000623164805.AA5BB8154@halfway> <3954262D.60BDEF41@uow.edu.au> <20000624080106.A25102@fruits.uzix.org> <3954D42A.938A724B@uow.edu.au> <20000624093548.A31621@puffin.external.hp.com> <200006241548.e5OFmiC09138@vindaloo.ras.ucalgary.ca> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0pre3us In-Reply-To: <200006241548.e5OFmiC09138@vindaloo.ras.ucalgary.ca> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing > Or why not use smp_call_function()? because the IPI smp_call_function uses might hit between the MOD_DEC_USE_COUNT that sets the module use count to zero and the return from module code. That would mean the interrupted CPU would return to nonexistent code afterwards and in all likelyhood cause an Oops. > Just move it from the i386 tree to the generic tree. It's been generic for some time. Philipp Rumpf From owner-netdev@oss.sgi.com Sat Jun 24 09:03:29 2000 Received: by oss.sgi.com id ; Sat, 24 Jun 2000 09:03:19 -0700 Received: from vindaloo.ras.ucalgary.ca ([136.159.55.21]:50870 "EHLO vindaloo.ras.ucalgary.ca") by oss.sgi.com with ESMTP id ; Sat, 24 Jun 2000 09:03:11 -0700 Received: (from rgooch@localhost) by vindaloo.ras.ucalgary.ca (8.10.0/8.10.0) id e5OG38E09451; Sat, 24 Jun 2000 10:03:08 -0600 Date: Sat, 24 Jun 2000 10:03:08 -0600 Message-Id: <200006241603.e5OG38E09451@vindaloo.ras.ucalgary.ca> From: Richard Gooch To: Philipp Rumpf Cc: Andrew Morton , Rusty Russell , Keith Owens , Alan Cox , "netdev@oss.sgi.com" Subject: Re: modular net drivers In-Reply-To: <20000624085414.C25102@fruits.uzix.org> References: <20000623164805.AA5BB8154@halfway> <3954262D.60BDEF41@uow.edu.au> <20000624080106.A25102@fruits.uzix.org> <3954D42A.938A724B@uow.edu.au> <20000624093548.A31621@puffin.external.hp.com> <200006241548.e5OFmiC09138@vindaloo.ras.ucalgary.ca> <20000624085414.C25102@fruits.uzix.org> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Philipp Rumpf writes: > > Or why not use smp_call_function()? > > because the IPI smp_call_function uses might hit between the > MOD_DEC_USE_COUNT that sets the module use count to zero and the > return from module code. That would mean the interrupted CPU would > return to nonexistent code afterwards and in all likelyhood cause an > Oops. But you could use it to launch the kernel threads, right? > > Just move it from the i386 tree to the generic tree. > > It's been generic for some time. I only see it in the i386 and alpha arch trees. It may not be i386 specific anymore, but it isn't generic :-) Regards, Richard.... Permanent: rgooch@atnf.csiro.au Current: rgooch@ras.ucalgary.ca From owner-netdev@oss.sgi.com Sat Jun 24 09:12:29 2000 Received: by oss.sgi.com id ; Sat, 24 Jun 2000 09:12:19 -0700 Received: from c997662-c.frmt1.sfba.home.com ([24.1.70.208]:37384 "EHLO fruits.uzix.org") by oss.sgi.com with ESMTP id ; Sat, 24 Jun 2000 09:11:57 -0700 Received: from prumpf by fruits.uzix.org with local (Exim 3.12 #1) id 135sW9-0006YJ-00; Sat, 24 Jun 2000 09:10:49 -0700 Date: Sat, 24 Jun 2000 09:10:48 -0700 From: Philipp Rumpf To: Richard Gooch Cc: Andrew Morton , Rusty Russell , Keith Owens , Alan Cox , "netdev@oss.sgi.com" Subject: Re: modular net drivers Message-ID: <20000624091048.D25102@fruits.uzix.org> References: <20000623164805.AA5BB8154@halfway> <3954262D.60BDEF41@uow.edu.au> <20000624080106.A25102@fruits.uzix.org> <3954D42A.938A724B@uow.edu.au> <20000624093548.A31621@puffin.external.hp.com> <200006241548.e5OFmiC09138@vindaloo.ras.ucalgary.ca> <20000624085414.C25102@fruits.uzix.org> <200006241603.e5OG38E09451@vindaloo.ras.ucalgary.ca> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0pre3us In-Reply-To: <200006241603.e5OG38E09451@vindaloo.ras.ucalgary.ca> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Sat, Jun 24, 2000 at 10:03:08AM -0600, Richard Gooch wrote: > But you could use it to launch the kernel threads, right? I don't see how. > > > > Just move it from the i386 tree to the generic tree. > > > > It's been generic for some time. > > I only see it in the i386 and alpha arch trees. It may not be i386 > specific anymore, but it isn't generic :-) ia64 and mips64 seem to have it as well. That leaves sparc which I guess just hasn't been merged. Philipp Rumpf From owner-netdev@oss.sgi.com Sat Jun 24 09:12:48 2000 Received: by oss.sgi.com id ; Sat, 24 Jun 2000 09:12:29 -0700 Received: from vindaloo.ras.ucalgary.ca ([136.159.55.21]:53430 "EHLO vindaloo.ras.ucalgary.ca") by oss.sgi.com with ESMTP id ; Sat, 24 Jun 2000 09:12:16 -0700 Received: (from rgooch@localhost) by vindaloo.ras.ucalgary.ca (8.10.0/8.10.0) id e5OGBh809634; Sat, 24 Jun 2000 10:11:43 -0600 Date: Sat, 24 Jun 2000 10:11:43 -0600 Message-Id: <200006241611.e5OGBh809634@vindaloo.ras.ucalgary.ca> From: Richard Gooch To: Philipp Rumpf Cc: Andrew Morton , Rusty Russell , Keith Owens , Alan Cox , "netdev@oss.sgi.com" Subject: Re: modular net drivers In-Reply-To: <20000624091048.D25102@fruits.uzix.org> References: <20000623164805.AA5BB8154@halfway> <3954262D.60BDEF41@uow.edu.au> <20000624080106.A25102@fruits.uzix.org> <3954D42A.938A724B@uow.edu.au> <20000624093548.A31621@puffin.external.hp.com> <200006241548.e5OFmiC09138@vindaloo.ras.ucalgary.ca> <20000624085414.C25102@fruits.uzix.org> <200006241603.e5OG38E09451@vindaloo.ras.ucalgary.ca> <20000624091048.D25102@fruits.uzix.org> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Philipp Rumpf writes: > On Sat, Jun 24, 2000 at 10:03:08AM -0600, Richard Gooch wrote: > > But you could use it to launch the kernel threads, right? > > I don't see how. > > > > > > > Just move it from the i386 tree to the generic tree. > > > > > > It's been generic for some time. > > > > I only see it in the i386 and alpha arch trees. It may not be i386 > > specific anymore, but it isn't generic :-) > > ia64 and mips64 seem to have it as well. That leaves sparc which I > guess just hasn't been merged. Where is it for ia64 and mips64? I didn't see it in Linus' tree (hence it doesn't exist). Regards, Richard.... Permanent: rgooch@atnf.csiro.au Current: rgooch@ras.ucalgary.ca From owner-netdev@oss.sgi.com Sat Jun 24 19:33:03 2000 Received: by oss.sgi.com id ; Sat, 24 Jun 2000 19:32:43 -0700 Received: from [203.126.247.144] ([203.126.247.144]:47505 "EHLO zsngs001") by oss.sgi.com with ESMTP id ; Sat, 24 Jun 2000 19:32:28 -0700 Received: from zsngd101.asiapac.nortel.com (actually znsgd101) by zsngs001; Sun, 25 Jun 2000 10:26:10 +0800 Received: from zctwb003.asiapac.nortel.com ([47.152.32.111]) by zsngd101.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NQ12PT9L; Sun, 25 Jun 2000 10:26:14 +0800 Received: from pwold011.asiapac.nortel.com ([47.181.193.45]) by zctwb003.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NCLF8NSD; Sun, 25 Jun 2000 12:26:16 +1000 Received: from uow.edu.au (IDENT:akpm@[47.181.194.200]) by pwold011.asiapac.nortel.com (8.9.3/8.9.3) with ESMTP id MAA10553; Sun, 25 Jun 2000 12:24:25 +1000 Message-ID: <39556E6E.DE0DAA90@uow.edu.au> Date: Sun, 25 Jun 2000 12:29:02 +1000 X-Sybari-Space: 00000000 00000000 00000000 From: Andrew Morton X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.2.14-15mdk i586) X-Accept-Language: en MIME-Version: 1.0 To: Richard Gooch CC: Philipp Rumpf , Philipp Rumpf , Rusty Russell , Keith Owens , Alan Cox , "netdev@oss.sgi.com" Subject: Re: modular net drivers References: <20000624093548.A31621@puffin.external.hp.com>, <20000623164805.AA5BB8154@halfway> <3954262D.60BDEF41@uow.edu.au> <20000624080106.A25102@fruits.uzix.org> <3954D42A.938A724B@uow.edu.au> <20000624093548.A31621@puffin.external.hp.com> <200006241548.e5OFmiC09138@vindaloo.ras.ucalgary.ca> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Orig: Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Richard Gooch wrote: > > Or why not use smp_call_function()? Just move it from the i386 tree to > the generic tree. The comment above smp_call_function() says " The function to run. This must be fast and non-blocking." I think the cross-CPU function is basically an ISR; starting a kernel thread from there would be interesting. My patch was quite bogus, of course :( We can have as many threads as we like spinning and the scheduler will cheerfully timeslice between them. Right now, I'm not sure how to implement this puppy. I don't see how to ensure that the sub-thread is running on another CPU. I know that, in practice, the subthread _does_ run on another CPU because I relied on that in my del_timer_sync test suite. But this depends upon scheduler vagaries which may not remain true. Perhaps we just create (smp_num_cpus * 10) threads and any time we see a non-captured CPU we do a local_irq_disable() and spin. How delightful. Thinking caps, gentlemen. From owner-netdev@oss.sgi.com Sat Jun 24 19:46:53 2000 Received: by oss.sgi.com id ; Sat, 24 Jun 2000 19:46:33 -0700 Received: from c997662-c.frmt1.sfba.home.com ([24.1.70.208]:6667 "EHLO fruits.uzix.org") by oss.sgi.com with ESMTP id ; Sat, 24 Jun 2000 19:46:11 -0700 Received: from prumpf by fruits.uzix.org with local (Exim 3.12 #1) id 1362Pz-0006eV-00; Sat, 24 Jun 2000 19:45:07 -0700 Date: Sat, 24 Jun 2000 19:45:06 -0700 From: Philipp Rumpf To: Andrew Morton Cc: Richard Gooch , Rusty Russell , Keith Owens , Alan Cox , "netdev@oss.sgi.com" Subject: Re: modular net drivers Message-ID: <20000624194506.B25473@fruits.uzix.org> References: <20000624093548.A31621@puffin.external.hp.com>, <20000623164805.AA5BB8154@halfway> <3954262D.60BDEF41@uow.edu.au> <20000624080106.A25102@fruits.uzix.org> <3954D42A.938A724B@uow.edu.au> <20000624093548.A31621@puffin.external.hp.com> <200006241548.e5OFmiC09138@vindaloo.ras.ucalgary.ca> <39556E6E.DE0DAA90@uow.edu.au> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0pre3us In-Reply-To: <39556E6E.DE0DAA90@uow.edu.au> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Sun, Jun 25, 2000 at 12:29:02PM +1000, Andrew Morton wrote: > My patch was quite bogus, of course :( We can have as many threads as > we like spinning and the scheduler will cheerfully timeslice between > them. Oh ? I wasn't aware kernel_threads could be rescheduled, unlike normal kernel code, and in fact I still don't see how they would. From owner-netdev@oss.sgi.com Sat Jun 24 19:48:33 2000 Received: by oss.sgi.com id ; Sat, 24 Jun 2000 19:48:23 -0700 Received: from vindaloo.ras.ucalgary.ca ([136.159.55.21]:50615 "EHLO vindaloo.ras.ucalgary.ca") by oss.sgi.com with ESMTP id ; Sat, 24 Jun 2000 19:48:08 -0700 Received: (from rgooch@localhost) by vindaloo.ras.ucalgary.ca (8.10.0/8.10.0) id e5P2lX217879; Sat, 24 Jun 2000 20:47:33 -0600 Date: Sat, 24 Jun 2000 20:47:33 -0600 Message-Id: <200006250247.e5P2lX217879@vindaloo.ras.ucalgary.ca> From: Richard Gooch To: Philipp Rumpf Cc: Andrew Morton , Rusty Russell , Keith Owens , Alan Cox , "netdev@oss.sgi.com" Subject: Re: modular net drivers In-Reply-To: <20000624194506.B25473@fruits.uzix.org> References: <20000624093548.A31621@puffin.external.hp.com> <20000623164805.AA5BB8154@halfway> <3954262D.60BDEF41@uow.edu.au> <20000624080106.A25102@fruits.uzix.org> <3954D42A.938A724B@uow.edu.au> <200006241548.e5OFmiC09138@vindaloo.ras.ucalgary.ca> <39556E6E.DE0DAA90@uow.edu.au> <20000624194506.B25473@fruits.uzix.org> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Philipp Rumpf writes: > On Sun, Jun 25, 2000 at 12:29:02PM +1000, Andrew Morton wrote: > > My patch was quite bogus, of course :( We can have as many threads as > > we like spinning and the scheduler will cheerfully timeslice between > > them. > > Oh ? I wasn't aware kernel_threads could be rescheduled, unlike > normal kernel code, and in fact I still don't see how they would. They can't. Once you're running in the kernel, you can't be pre-empted by another thread, unless you sleep or otherwise call schedule(). You don't even need to have RT priority to pin a cpu like this. Regards, Richard.... Permanent: rgooch@atnf.csiro.au Current: rgooch@ras.ucalgary.ca From owner-netdev@oss.sgi.com Sat Jun 24 19:56:23 2000 Received: by oss.sgi.com id ; Sat, 24 Jun 2000 19:56:13 -0700 Received: from c997662-c.frmt1.sfba.home.com ([24.1.70.208]:10763 "EHLO fruits.uzix.org") by oss.sgi.com with ESMTP id ; Sat, 24 Jun 2000 19:55:44 -0700 Received: from prumpf by fruits.uzix.org with local (Exim 3.12 #1) id 1362ZE-0006eh-00; Sat, 24 Jun 2000 19:54:40 -0700 Date: Sat, 24 Jun 2000 19:54:39 -0700 From: Philipp Rumpf To: Richard Gooch Cc: Andrew Morton , Rusty Russell , Keith Owens , Alan Cox , "netdev@oss.sgi.com" Subject: Re: modular net drivers Message-ID: <20000624195439.C25473@fruits.uzix.org> References: <20000624093548.A31621@puffin.external.hp.com> <20000623164805.AA5BB8154@halfway> <3954262D.60BDEF41@uow.edu.au> <20000624080106.A25102@fruits.uzix.org> <3954D42A.938A724B@uow.edu.au> <200006241548.e5OFmiC09138@vindaloo.ras.ucalgary.ca> <39556E6E.DE0DAA90@uow.edu.au> <20000624194506.B25473@fruits.uzix.org> <200006250247.e5P2lX217879@vindaloo.ras.ucalgary.ca> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0pre3us In-Reply-To: <200006250247.e5P2lX217879@vindaloo.ras.ucalgary.ca> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Sat, Jun 24, 2000 at 08:47:33PM -0600, Richard Gooch wrote: > Philipp Rumpf writes: > > Oh ? I wasn't aware kernel_threads could be rescheduled, unlike > > normal kernel code, and in fact I still don't see how they would. > > They can't. Once you're running in the kernel, you can't be pre-empted > by another thread, unless you sleep or otherwise call schedule(). Nod, that's what it looked like to me. Andrew, take your thinking cap off again. From owner-netdev@oss.sgi.com Sun Jun 25 04:19:17 2000 Received: by oss.sgi.com id ; Sun, 25 Jun 2000 04:18:58 -0700 Received: from [203.126.247.144] ([203.126.247.144]:48287 "EHLO zsngs001") by oss.sgi.com with ESMTP id ; Sun, 25 Jun 2000 04:18:41 -0700 Received: from zsngd101.asiapac.nortel.com (actually znsgd101) by zsngs001; Sun, 25 Jun 2000 19:17:39 +0800 Received: from zctwb003.asiapac.nortel.com ([47.152.32.111]) by zsngd101.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NQ12P4LX; Sun, 25 Jun 2000 19:17:43 +0800 Received: from pwold011.asiapac.nortel.com ([47.181.193.45]) by zctwb003.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NCLF8N4R; Sun, 25 Jun 2000 21:17:45 +1000 Received: from uow.edu.au (IDENT:akpm@[47.181.194.209]) by pwold011.asiapac.nortel.com (8.9.3/8.9.3) with ESMTP id VAA12816; Sun, 25 Jun 2000 21:16:13 +1000 Message-ID: <3955EB11.519DF76B@uow.edu.au> Date: Sun, 25 Jun 2000 21:20:49 +1000 X-Sybari-Space: 00000000 00000000 00000000 From: Andrew Morton X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.2.14-15mdk i586) X-Accept-Language: en MIME-Version: 1.0 To: Alan Cox CC: "netdev@oss.sgi.com" , Andreas Tobler Subject: [patch] 3c59x.c for 2.2.17 Content-Type: multipart/mixed; boundary="------------A028613A1132A2E773E9E93A" X-Orig: Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing This is a multi-part message in MIME format. --------------A028613A1132A2E773E9E93A Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Hi, Alan. - Andrey's tester reported that the driver was still oopsing every three days on vortex-specific code for a 3c905. So I've split the ISR into vortex_interrupt() and boomerang_interrupt() as per the 2.3/2.4 driver. - The driver would crash the machine if all dev_alloc_skb()'s failed during open(). So I simply fail the open() if we can't preallocate all the skb's. - The problem reported by Mark Hemment where the rx path would die if 32 successive dev_alloc_skb()'s failed has been semi-kludgily fixed by detecting this situation in the tx interrupt and deliberately calling the rx ISR. This means that in cruel OOM situations we're relying on Tx interrupts to initiate polling for available memory. This can take some time if TCP has backed off a long way, but it recovers eventually. - Put an explicit "are we interrupting" test at the start is the ISRs to improve efficiency during PCI interrupt sharing. Also avoids testing bits which we have no business testing when no interrupt is pending. Patch against 2.2.17-pre5 attached. It hasn't been tested on a 3c590 (vortex series) since I put the OOM stuff in. I didn't actually address possible OOM problems with the vortex series, so the vortex code paths should not be affected. I'd appreciate it if you could test it on your 590. (Andreas: you too, please). --------------A028613A1132A2E773E9E93A Content-Type: text/plain; charset=us-ascii; name="3c59x.patch" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="3c59x.patch" --- linux-2.2.17pre5/drivers/net/3c59x.c Fri Jun 23 01:15:43 2000 +++ linux-akpm/drivers/net/3c59x.c Sun Jun 25 02:06:07 2000 @@ -42,15 +42,20 @@ - In vortex_error, do_tx_reset and vortex_tx_timeout(Vortex): clear tbusy and force a BH rerun to better recover from errors. - 12Jun00 <2.2.16> andrewm + 24Jun00 <2.2.16> andrewm - Better handling of shared interrupts - Reset the transmitter in vortex_error() on both maxcollisions and Tx reclaim error + - Split the ISR into vortex_interrupt and boomerang_interrupt. This is + to fix once-and-for-all the dubious testing of vortex status bits on + boomerang/hurricane/cyclone/tornado NICs. + - Fixed crash under OOM during vortex_open() (Mark Hemment) + - Fix Rx cessation problem during OOM (help from Mark Hemment) - See http://www.uow.edu.au/~andrewm/linux/#3c59x-2.2 for more details. */ static char *version = -"3c59x.c:v0.99H 12Jun00 Donald Becker and others http://www.scyld.com/network/vortex.html\n"; +"3c59x.c:v0.99H 24Jun00 Donald Becker and others http://www.scyld.com/network/vortex.html\n"; /* "Knobs" that adjust features and parameters. */ /* Set the copy breakpoint for the copy-only-tiny-frames scheme. @@ -563,6 +568,7 @@ static int vortex_rx(struct device *dev); static int boomerang_rx(struct device *dev); static void vortex_interrupt(int irq, void *dev_id, struct pt_regs *regs); +static void boomerang_interrupt(int irq, void *dev_id, struct pt_regs *regs); static int vortex_close(struct device *dev); static void update_stats(long ioaddr, struct device *dev); static struct net_device_stats *vortex_get_stats(struct device *dev); @@ -570,6 +576,8 @@ static int vortex_ioctl(struct device *dev, struct ifreq *rq, int cmd); +/* #define dev_alloc_skb dev_alloc_skb_debug */ + /* This driver uses 'options' to pass the media type, full-duplex flag, etc. */ /* Option count limit only -- unlimited interfaces are supported. */ #define MAX_UNITS 8 @@ -1033,12 +1041,13 @@ request_region(ioaddr, pci_tbl[chip_idx].io_size, dev->name); /* The 3c59x-specific entries in the device structure. */ - dev->open = &vortex_open; - dev->hard_start_xmit = &vortex_start_xmit; - dev->stop = &vortex_close; - dev->get_stats = &vortex_get_stats; - dev->do_ioctl = &vortex_ioctl; - dev->set_multicast_list = &set_rx_mode; + dev->open = vortex_open; + dev->hard_start_xmit = vp->full_bus_master_tx ? + boomerang_start_xmit : vortex_start_xmit; + dev->stop = vortex_close; + dev->get_stats = vortex_get_stats; + dev->do_ioctl = vortex_ioctl; + dev->set_multicast_list = set_rx_mode; return dev; } @@ -1050,7 +1059,9 @@ long ioaddr = dev->base_addr; struct vortex_private *vp = (struct vortex_private *)dev->priv; unsigned int config; - int i; + int i, retval; + + MOD_INC_USE_COUNT; /* Before initializing select the active media port. */ EL3WINDOW(3); @@ -1072,12 +1083,6 @@ } else dev->if_port = vp->default_media; - init_timer(&vp->timer); - vp->timer.expires = RUN_AT(media_tbl[dev->if_port].wait); - vp->timer.data = (unsigned long)dev; - vp->timer.function = &vortex_timer; /* timer handler */ - add_timer(&vp->timer); - if (vortex_debug > 1) printk(KERN_DEBUG "%s: Initial media type %s.\n", dev->name, media_tbl[dev->if_port].name); @@ -1127,8 +1132,11 @@ outw(SetStatusEnb | 0x00, ioaddr + EL3_CMD); /* Use the now-standard shared IRQ implementation. */ - if (request_irq(dev->irq, &vortex_interrupt, SA_SHIRQ, dev->name, dev)) { - return -EAGAIN; + if ((retval = request_irq(dev->irq, vp->full_bus_master_tx ? + &boomerang_interrupt : &vortex_interrupt, + SA_SHIRQ, dev->name, dev))) { + printk(KERN_ERR "%s: Cannot allocate IRQ #%d\n", dev->name, dev->irq); + goto out; } if (vortex_debug > 1) { @@ -1193,12 +1201,20 @@ vp->rx_ring[i].addr = virt_to_bus(skb->data); #endif } + if (i != RX_RING_SIZE) { + int j; + for (j = 0; j < RX_RING_SIZE; j++) { + if (vp->rx_skbuff[j]) + DEV_FREE_SKB(vp->rx_skbuff[j]); + } + retval = -ENOMEM; + goto out_free_irq; + } /* Wrap the ring. */ vp->rx_ring[i-1].next = cpu_to_le32(virt_to_bus(&vp->rx_ring[0])); outl(virt_to_bus(&vp->rx_ring[0]), ioaddr + UpListPtr); } if (vp->full_bus_master_tx) { /* Boomerang bus master Tx. */ - dev->hard_start_xmit = &boomerang_start_xmit; vp->cur_tx = vp->dirty_tx = 0; outb(PKT_BUF_SZ>>8, ioaddr + TxFreeThreshold); /* Room for a packet. */ /* Clear the Tx ring. */ @@ -1232,9 +1248,18 @@ if (vp->cb_fn_base) /* The PCMCIA people are idiots. */ writel(0x8000, vp->cb_fn_base + 4); - MOD_INC_USE_COUNT; + init_timer(&vp->timer); + vp->timer.expires = RUN_AT(media_tbl[dev->if_port].wait); + vp->timer.data = (unsigned long)dev; + vp->timer.function = &vortex_timer; /* timer handler */ + add_timer(&vp->timer); return 0; + +out_free_irq: + free_irq(dev->irq, dev); +out: + return retval; } static void vortex_timer(unsigned long data) @@ -1372,7 +1397,10 @@ unsigned long flags; __save_flags(flags); __cli(); - vortex_interrupt(dev->irq, dev, 0); + if (vp->full_bus_master_tx) + boomerang_interrupt(dev->irq, dev, 0); + else + vortex_interrupt(dev->irq, dev, 0); __restore_flags(flags); } } @@ -1606,10 +1634,10 @@ int i; if (vortex_debug > 3) - printk(KERN_DEBUG "%s: Trying to send a packet, Tx index %d.\n", + printk(KERN_DEBUG "%s: Trying to send a boomerang packet, Tx index %d.\n", dev->name, vp->cur_tx); if (vp->tx_full) { - if (vortex_debug >0) + if (vortex_debug > 0) printk(KERN_WARNING "%s: Tx Ring full, refusing to send buffer.\n", dev->name); return 1; @@ -1623,7 +1651,7 @@ spin_lock_irqsave(&vp->lock, flags); outw(DownStall, ioaddr + EL3_CMD); /* Wait for the stall to complete. */ - for (i = 4000; i >= 0 ; i--) + for (i = 4000; i >= 0; i--) if ( (inw(ioaddr + EL3_STATUS) & CmdInProgress) == 0) break; prev_entry->next = cpu_to_le32(virt_to_bus(&vp->tx_ring[entry])); @@ -1649,6 +1677,12 @@ /* The interrupt handler does all of the Rx thread work and cleans up after the Tx thread. */ + +/* + * This is the ISR for the vortex series chips. + * full_bus_master_tx == 0 && full_bus_master_rx == 0 + */ + static void vortex_interrupt(int irq, void *dev_id, struct pt_regs *regs) { struct device *dev = dev_id; @@ -1660,8 +1694,11 @@ ioaddr = dev->base_addr; spin_lock(&vp->lock); status = inw(ioaddr + EL3_STATUS); - if ((status & IntLatch) == 0) + if ((status & IntLatch) == 0) { + if (vortex_debug > 5) + printk(KERN_DEBUG "%s: no vortex interrupt pending\n", dev->name); goto no_int; /* Happens during shared interrupts */ + } if (status & IntReq) { status |= vp->deferred; @@ -1669,19 +1706,16 @@ } if (vortex_debug > 4) - printk(KERN_DEBUG "%s: interrupt, status %4.4x, latency %d ticks.\n", + printk(KERN_DEBUG "%s: vortex_interrupt, status %4.4x, latency %d ticks.\n", dev->name, status, inb(ioaddr + Timer)); do { if (vortex_debug > 5) printk(KERN_DEBUG "%s: In interrupt loop, status %4.4x.\n", dev->name, status); + if (status & RxComplete) vortex_rx(dev); - if (status & UpComplete) { - outw(AckIntr | UpComplete, ioaddr + EL3_CMD); - boomerang_rx(dev); - } if (status & TxAvailable) { if (vortex_debug > 5) @@ -1692,6 +1726,93 @@ mark_bh(NET_BH); } + if (status & DMADone) { + if (inw(ioaddr + Wn7_MasterStatus) & 0x1000) { + outw(0x1000, ioaddr + Wn7_MasterStatus); /* Ack the event. */ + DEV_FREE_SKB(vp->tx_skb); /* Release the transfered buffer */ + if (inw(ioaddr + TxFree) > 1536) { + clear_bit(0, (void*)&dev->tbusy); + mark_bh(NET_BH); + } else /* Interrupt when FIFO has room for max-sized packet. */ + outw(SetTxThreshold + (1536>>2), ioaddr + EL3_CMD); + } + } + + /* Check for all uncommon interrupts at once. */ + if (status & (HostError | RxEarly | StatsFull | TxComplete | IntReq)) { + if (status == 0xffff) + break; + vortex_error(dev, status); + } + + if (--work_done < 0) { + printk(KERN_WARNING "%s: Too much work in interrupt, status " + "%4.4x.\n", dev->name, status); + /* Disable all pending interrupts. */ + do { + vp->deferred |= status; + outw(SetStatusEnb | (~vp->deferred & vp->status_enable), + ioaddr + EL3_CMD); + outw(AckIntr | (vp->deferred & 0x7ff), ioaddr + EL3_CMD); + } while ((status = inw(ioaddr + EL3_CMD)) & IntLatch); + /* The timer will reenable interrupts. */ + mod_timer(&vp->timer, RUN_AT(1)); + break; + } + + /* Acknowledge the IRQ. */ + outw(AckIntr | IntReq | IntLatch, ioaddr + EL3_CMD); + } while ((status = inw(ioaddr + EL3_STATUS)) & (IntLatch | RxComplete)); + + if (vortex_debug > 4) + printk(KERN_DEBUG "%s: exiting interrupt, status %4.4x.\n", + dev->name, status); + +no_int: + spin_unlock(&vp->lock); +} + +/* + * This is the ISR for the boomerang/cyclone/hurricane/tornado series chips. + * full_bus_master_tx == 1 && full_bus_master_rx == 1 + */ + +static void boomerang_interrupt(int irq, void *dev_id, struct pt_regs *regs) +{ + struct device *dev = dev_id; + struct vortex_private *vp = (struct vortex_private *)dev->priv; + long ioaddr; + int status; + int work_done = max_interrupt_work; + + ioaddr = dev->base_addr; + spin_lock(&vp->lock); + status = inw(ioaddr + EL3_STATUS); + if ((status & IntLatch) == 0) { + if (vortex_debug > 5) + printk(KERN_DEBUG "%s: no boomerang interrupt pending\n", dev->name); + goto no_int; /* Happens during shared interrupts */ + } + + if (status & IntReq) { + status |= vp->deferred; + vp->deferred = 0; + } + + if (vortex_debug > 4) + printk(KERN_DEBUG "%s: interrupt, status %04x, latency %d, cur_rx %d, dirty_rx %d\n", + dev->name, status, inb(ioaddr + Timer), vp->cur_rx, vp->dirty_rx); + + do { + if (vortex_debug > 5) + printk(KERN_DEBUG "%s: In interrupt loop, status %4.4x.\n", + dev->name, status); + + if (status & UpComplete) { + outw(AckIntr | UpComplete, ioaddr + EL3_CMD); + boomerang_rx(dev); + } + if (status & DownComplete) { unsigned int dirty_tx = vp->dirty_tx; @@ -1710,22 +1831,12 @@ vp->dirty_tx = dirty_tx; outw(AckIntr | DownComplete, ioaddr + EL3_CMD); if (vp->tx_full && (vp->cur_tx - dirty_tx <= TX_RING_SIZE - 1)) { - vp->tx_full= 0; + vp->tx_full = 0; clear_bit(0, (void*)&dev->tbusy); mark_bh(NET_BH); } } - if (status & DMADone) { - if (inw(ioaddr + Wn7_MasterStatus) & 0x1000) { - outw(0x1000, ioaddr + Wn7_MasterStatus); /* Ack the event. */ - DEV_FREE_SKB(vp->tx_skb); /* Release the transfered buffer */ - if (inw(ioaddr + TxFree) > 1536) { - clear_bit(0, (void*)&dev->tbusy); - mark_bh(NET_BH); - } else /* Interrupt when FIFO has room for max-sized packet. */ - outw(SetTxThreshold + (1536>>2), ioaddr + EL3_CMD); - } - } + /* Check for all uncommon interrupts at once. */ if (status & (HostError | RxEarly | StatsFull | TxComplete | IntReq)) { if (status == 0xffff) @@ -1734,25 +1845,18 @@ } if (--work_done < 0) { - if ((status & (0x7fe - (UpComplete | DownComplete))) == 0) { - /* Just ack these and return. */ - outw(AckIntr | UpComplete | DownComplete, ioaddr + EL3_CMD); - } else { - printk(KERN_WARNING "%s: Too much work in interrupt, status " - "%4.4x.\n", dev->name, status); - /* Disable all pending interrupts. */ - do { - vp->deferred |= status; - outw(SetStatusEnb | (~vp->deferred & vp->status_enable), - ioaddr + EL3_CMD); - outw(AckIntr | (vp->deferred & 0x7ff), ioaddr + EL3_CMD); - } while ((status = inw(ioaddr + EL3_CMD)) & IntLatch); - /* The timer will reenable interrupts. */ - del_timer(&vp->timer); - vp->timer.expires = RUN_AT(1); - add_timer(&vp->timer); - break; - } + printk(KERN_WARNING "%s: Too much work in interrupt, status " + "%4.4x.\n", dev->name, status); + /* Disable all pending interrupts. */ + do { + vp->deferred |= status; + outw(SetStatusEnb | (~vp->deferred & vp->status_enable), + ioaddr + EL3_CMD); + outw(AckIntr | (vp->deferred & 0x7ff), ioaddr + EL3_CMD); + } while ((status = inw(ioaddr + EL3_CMD)) & IntLatch); + /* The timer will reenable interrupts. */ + mod_timer(&vp->timer, RUN_AT(1)); + break; } /* Acknowledge the IRQ. */ @@ -1760,7 +1864,15 @@ if (vp->cb_fn_base) /* The PCMCIA people are idiots. */ writel(0x8000, vp->cb_fn_base + 4); - } while ((status = inw(ioaddr + EL3_STATUS)) & (IntLatch | RxComplete)); + } while ((status = inw(ioaddr + EL3_STATUS)) & IntLatch); + + /* + * If we have totally run out to rx skb's due to persistent OOM, + * we can use the Tx interrupt to retry the allocation. Dirty + * but expedient + */ + if ((vp->cur_rx - vp->dirty_rx) == RX_RING_SIZE) + boomerang_rx(dev); if (vortex_debug > 4) printk(KERN_DEBUG "%s: exiting interrupt, status %4.4x.\n", --------------A028613A1132A2E773E9E93A-- From owner-netdev@oss.sgi.com Sun Jun 25 05:14:29 2000 Received: by oss.sgi.com id ; Sun, 25 Jun 2000 05:14:20 -0700 Received: from lightning.swansea.uk.linux.org ([194.168.151.1]:31572 "EHLO the-village.bc.nu") by oss.sgi.com with ESMTP id ; Sun, 25 Jun 2000 05:13:54 -0700 Received: from alan by the-village.bc.nu with local (Exim 2.12 #1) id 136BEN-0002B9-00; Sun, 25 Jun 2000 13:09:43 +0100 Subject: Re: [patch] 3c59x.c for 2.2.17 To: andrewm@uow.edu.au (Andrew Morton) Date: Sun, 25 Jun 2000 13:09:41 +0100 (BST) Cc: alan@lxorguk.ukuu.org.uk (Alan Cox), netdev@oss.sgi.com (netdev@oss.sgi.com), toa@pop.agri.ch (Andreas Tobler) In-Reply-To: <3955EB11.519DF76B@uow.edu.au> from "Andrew Morton" at Jun 25, 2000 09:20:49 PM X-Mailer: ELM [version 2.5 PL1] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-Id: From: Alan Cox Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing > - The driver would crash the machine if all dev_alloc_skb()'s failed > during open(). So I simply fail the open() if we can't preallocate all > the skb's. Ok > - The problem reported by Mark Hemment where the rx path would die if 32 > successive dev_alloc_skb()'s failed has been semi-kludgily fixed by > detecting this situation in the tx interrupt and deliberately calling > the rx ISR. This means that in cruel OOM situations we're relying on Tx > interrupts to initiate polling for available memory. This can take some > time if TCP has backed off a long way, but it recovers eventually. This is a mess in a lot of Don Becker skeleton base drivers. Its much cleaner if the last ring buffer slot is about to be used and the new alloc fails to simply recycle the buffer and throw the received packet away. Several drivers simply require the ring remains full and they are a lot cleaner for it. Will apply From owner-netdev@oss.sgi.com Sun Jun 25 05:40:39 2000 Received: by oss.sgi.com id ; Sun, 25 Jun 2000 05:40:19 -0700 Received: from [203.126.247.144] ([203.126.247.144]:56481 "EHLO zsngs001") by oss.sgi.com with ESMTP id ; Sun, 25 Jun 2000 05:39:51 -0700 Received: from zsngd101.asiapac.nortel.com (actually znsgd101) by zsngs001; Sun, 25 Jun 2000 20:38:54 +0800 Received: from zctwb003.asiapac.nortel.com ([47.152.32.111]) by zsngd101.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NQ12P4N9; Sun, 25 Jun 2000 20:38:58 +0800 Received: from pwold011.asiapac.nortel.com ([47.181.193.45]) by zctwb003.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NCLF8NV2; Sun, 25 Jun 2000 22:39:01 +1000 Received: from uow.edu.au (IDENT:akpm@[47.181.194.209]) by pwold011.asiapac.nortel.com (8.9.3/8.9.3) with ESMTP id WAA13278; Sun, 25 Jun 2000 22:37:40 +1000 Message-ID: <3955FE28.F9F16C8F@uow.edu.au> Date: Sun, 25 Jun 2000 22:42:16 +1000 X-Sybari-Space: 00000000 00000000 00000000 From: Andrew Morton X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.2.14-15mdk i586) X-Accept-Language: en MIME-Version: 1.0 To: Alan Cox CC: "netdev@oss.sgi.com" , Andreas Tobler Subject: Re: [patch] 3c59x.c for 2.2.17 References: <3955EB11.519DF76B@uow.edu.au> from "Andrew Morton" at Jun 25, 2000 09:20:49 PM Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Orig: Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Alan Cox wrote: > > Its much cleaner > if the last ring buffer slot is about to be used and the new alloc fails to > simply recycle the buffer and throw the received packet away. Agreed. Mark Hemment's fix basically did this. He reserved an 'emergency' skbuff and never gave it away. I was not attracted to it because it handles a _very_ rare case by adding complexity to a core code path upon which I wish to severaly beat later this year. > Several drivers simply require the ring remains full and they are a lot cleaner > for it. I don't know about the other drivers, but I think the way Donald's 3c59x handles the replenishment of the skbuffs is quite delightful. The fact that I could just call the rx ISR at a random point in time and have everything sort itself out is testament to this. From owner-netdev@oss.sgi.com Sun Jun 25 07:04:08 2000 Received: by oss.sgi.com id ; Sun, 25 Jun 2000 07:04:00 -0700 Received: from hornisse.agrinet.ch ([212.28.128.30]:53009 "EHLO pop.agri.ch") by oss.sgi.com with ESMTP id ; Sun, 25 Jun 2000 07:03:22 -0700 Received: from pop.agri.ch (212.28.159.74) by pop.agri.ch (5.0.047) id 395204360000F064; Sun, 25 Jun 2000 16:01:15 +0200 Message-ID: <395612D3.44590147@pop.agri.ch> Date: Sun, 25 Jun 2000 16:10:30 +0200 From: Andreas Tobler Reply-To: toa@pop.agri.ch Organization: zero X-Mailer: Mozilla 4.61 (Macintosh; I; PPC) X-Accept-Language: en MIME-Version: 1.0 To: Andrew Morton CC: Alan Cox , "netdev@oss.sgi.com" Subject: Re: [patch] 3c59x.c for 2.2.17 References: <3955EB11.519DF76B@uow.edu.au> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Andrew Morton wrote: > Patch against 2.2.17-pre5 attached. It hasn't been tested on a 3c590 > (vortex series) since I put the OOM stuff in. I didn't actually address > possible OOM problems with the vortex series, so the vortex code paths > should not be affected. I'd appreciate it if you could test it on your > 590. (Andreas: you too, please). Applied against pre6, on PowerMac 7200, ok. Stress testing follows. Andreas From owner-netdev@oss.sgi.com Sun Jun 25 07:09:28 2000 Received: by oss.sgi.com id ; Sun, 25 Jun 2000 07:09:08 -0700 Received: from iwr1.iwr.uni-heidelberg.de ([129.206.104.40]:3234 "EHLO iwr1.iwr.uni-heidelberg.de") by oss.sgi.com with ESMTP id ; Sun, 25 Jun 2000 07:08:50 -0700 Received: from kenzo.iwr.uni-heidelberg.de (IDENT:bogdan@kenzo.iwr.uni-heidelberg.de [129.206.120.29]) by iwr1.iwr.uni-heidelberg.de (8.9.3/8.9.3) with ESMTP id QAA24444; Sun, 25 Jun 2000 16:08:19 +0200 (MET DST) Received: from localhost (bogdan@localhost) by kenzo.iwr.uni-heidelberg.de (8.8.7/8.8.7) with ESMTP id QAA25617; Sun, 25 Jun 2000 16:08:18 +0200 Date: Sun, 25 Jun 2000 16:08:18 +0200 (CEST) From: Bogdan Costescu To: Alan Cox cc: Andrew Morton , "netdev@oss.sgi.com" , Andreas Tobler Subject: Re: [patch] 3c59x.c for 2.2.17 In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Sun, 25 Jun 2000, Alan Cox wrote: > This is a mess in a lot of Don Becker skeleton base drivers. Its much cleaner > if the last ring buffer slot is about to be used and the new alloc fails to > simply recycle the buffer and throw the received packet away. > > Several drivers simply require the ring remains full and they are a lot cleaner > for it. If I understand it right, you want that the driver never passes one of its preallocated Rx buffers to the upper levels, so that the Rx ring is always of size RX_RING_SIZE. On receiving, it tries to allocate a new buffer; if it's OK it copies the the data to it and sends it up, if it fails it just ignores that so that the received data (already in the Rx ring buffer) is discarded. Don Becker's skeleton has a rx_copybreak scheme which allows this situation to happen now with a small modification. If you set rx_copybreak > 1512, it will always allocate a new buffer and memcpy the data to it; however, in the failing case, it goes through the other branch of 'if', as the 2 conditions (pkt_len < rx_copybreak and dev_alloc_skb =! NULL) are AND-ed - a split in two 'if' solves this problem. The reason for this AND is that if the new alloc fails, the Rx ring buffer (with data) is sent up and the re-filling of the Rx ring will (hopefully) get a new buffer. So, as I see it, your solution denies the delivery of received data in the few ring buffers in favor of keeping the already allocated memory. In both cases, in an OOM situation, the flow of data to upper levels will stop; so what is the advantage? Sincerely, Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: Bogdan.Costescu@IWR.Uni-Heidelberg.De From owner-netdev@oss.sgi.com Sun Jun 25 08:29:49 2000 Received: by oss.sgi.com id ; Sun, 25 Jun 2000 08:29:30 -0700 Received: from lightning.swansea.uk.linux.org ([194.168.151.1]:16987 "EHLO the-village.bc.nu") by oss.sgi.com with ESMTP id ; Sun, 25 Jun 2000 08:29:08 -0700 Received: from alan by the-village.bc.nu with local (Exim 2.12 #1) id 136EHC-0002Xn-00; Sun, 25 Jun 2000 16:24:50 +0100 Subject: Re: [patch] 3c59x.c for 2.2.17 To: Bogdan.Costescu@IWR.Uni-Heidelberg.De (Bogdan Costescu) Date: Sun, 25 Jun 2000 16:24:48 +0100 (BST) Cc: alan@lxorguk.ukuu.org.uk (Alan Cox), andrewm@uow.edu.au (Andrew Morton), netdev@oss.sgi.com (netdev@oss.sgi.com), toa@pop.agri.ch (Andreas Tobler) In-Reply-To: from "Bogdan Costescu" at Jun 25, 2000 04:08:18 PM X-Mailer: ELM [version 2.5 PL1] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-Id: From: Alan Cox Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing > If I understand it right, you want that the driver never passes one of its > preallocated Rx buffers to the upper levels, so that the Rx ring is always No. I want to do if(; Sun, 25 Jun 2000 09:44:48 -0700 Received: from adsl-151-196-242-4.bellatlantic.net ([151.196.242.4]:35054 "EHLO vaio.greennet") by oss.sgi.com with ESMTP id ; Sun, 25 Jun 2000 09:44:19 -0700 Received: from localhost (becker@localhost) by vaio.greennet (8.9.3/8.8.7) with ESMTP id MAA18809; Sun, 25 Jun 2000 12:45:50 -0400 Date: Sun, 25 Jun 2000 12:45:50 -0400 (EDT) From: Donald Becker X-Sender: becker@vaio.greennet To: Alan Cox cc: Andrew Morton , "netdev@oss.sgi.com" Subject: Re: [patch] 3c59x.c for 2.2.17 In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Sun, 25 Jun 2000, Alan Cox wrote: > > - The driver would crash the machine if all dev_alloc_skb()'s failed > > during open(). So I simply fail the open() if we can't preallocate all > > the skb's. > > Ok The previous code would allow operation even if not all buffers could be allocated. This is a good thing. I suppose the correct operation is to refuse to start if no buffers at all can be allocated. > > - The problem reported by Mark Hemment where the rx path would die if 32 > > successive dev_alloc_skb()'s failed has been semi-kludgily fixed by > > detecting this situation in the tx interrupt and deliberately calling > > the rx ISR. This means that in cruel OOM situations we're relying on Tx > > interrupts to initiate polling for available memory. This can take some > > time if TCP has backed off a long way, but it recovers eventually. > > This is a mess in a lot of Don Becker skeleton base drivers. Its much cleaner > if the last ring buffer slot is about to be used and the new alloc fails to > simply recycle the buffer and throw the received packet away. > > Several drivers simply require the ring remains full and they are a lot cleaner > for it. All drivers used to be this way. I had to change the structure on many drivers when there were several kernel versions where the kernel would fail skbuff allocation even when there was plenty of free memory. IIRC, there were other kernel versions where using swap space didn't occur quickly enough, and there would be temporary shortages of memory that would last for several tens of milliseconds. Donald Becker becker@scyld.com Scyld Computing Corporation http://www.scyld.com 410 Severn Ave. Suite 210 Annapolis MD 21403 From owner-netdev@oss.sgi.com Sun Jun 25 09:49:58 2000 Received: by oss.sgi.com id ; Sun, 25 Jun 2000 09:49:48 -0700 Received: from iwr1.iwr.uni-heidelberg.de ([129.206.104.40]:31651 "EHLO iwr1.iwr.uni-heidelberg.de") by oss.sgi.com with ESMTP id ; Sun, 25 Jun 2000 09:49:23 -0700 Received: from kenzo.iwr.uni-heidelberg.de (IDENT:bogdan@kenzo.iwr.uni-heidelberg.de [129.206.120.29]) by iwr1.iwr.uni-heidelberg.de (8.9.3/8.9.3) with ESMTP id SAA25119; Sun, 25 Jun 2000 18:48:52 +0200 (MET DST) Received: from localhost (bogdan@localhost) by kenzo.iwr.uni-heidelberg.de (8.8.7/8.8.7) with ESMTP id SAA25940; Sun, 25 Jun 2000 18:48:51 +0200 Date: Sun, 25 Jun 2000 18:48:51 +0200 (CEST) From: Bogdan Costescu To: Alan Cox cc: Andrew Morton , "netdev@oss.sgi.com" , Andreas Tobler Subject: Re: [patch] 3c59x.c for 2.2.17 In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Sun, 25 Jun 2000, Alan Cox wrote: > No. I want to do With all due respect, I don't understand! > if( { > skb=alloc_skb(RING_SIZE); What has RING_SIZE to do with the allocation of a buffer? Should it be "PKT_BUF_SZ" instead (="size of each temporary Rx buffer") ? > if(skb==NULL) > recycle original What does "recycle" mean: is it sent up or kept? In case it's sent up, how do we get another buffer? In case it's kept, what do we send? > } > else > { > alloc skb > if(skb!=NULL) > copy > } Now the "else" part is just the opposite of Don's copybreak scheme. Why do you want to copy when pkt_size > rx_copybreak? Isn't memcpy too expensive in this case? Sincerely, Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: Bogdan.Costescu@IWR.Uni-Heidelberg.De From owner-netdev@oss.sgi.com Sun Jun 25 10:34:58 2000 Received: by oss.sgi.com id ; Sun, 25 Jun 2000 10:34:39 -0700 Received: from u-202.karlsruhe.ipdial.viaginterkom.de ([62.180.10.202]:12292 "EHLO u-202.karlsruhe.ipdial.viaginterkom.de") by oss.sgi.com with ESMTP id ; Sun, 25 Jun 2000 10:34:05 -0700 Received: (ralf@lappi) by lappi.waldorf-gmbh.de id ; Sun, 25 Jun 2000 19:33:14 +0200 Date: Sun, 25 Jun 2000 19:33:14 +0200 From: Ralf Baechle To: Richard Gooch Cc: Philipp Rumpf , Andrew Morton , Rusty Russell , Keith Owens , Alan Cox , "netdev@oss.sgi.com" Subject: Re: modular net drivers Message-ID: <20000625193313.C1572@bacchus.dhis.org> References: <20000623164805.AA5BB8154@halfway> <3954262D.60BDEF41@uow.edu.au> <20000624080106.A25102@fruits.uzix.org> <3954D42A.938A724B@uow.edu.au> <20000624093548.A31621@puffin.external.hp.com> <200006241548.e5OFmiC09138@vindaloo.ras.ucalgary.ca> <20000624085414.C25102@fruits.uzix.org> <200006241603.e5OG38E09451@vindaloo.ras.ucalgary.ca> <20000624091048.D25102@fruits.uzix.org> <200006241611.e5OGBh809634@vindaloo.ras.ucalgary.ca> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0.1i In-Reply-To: <200006241611.e5OGBh809634@vindaloo.ras.ucalgary.ca>; from rgooch@ras.ucalgary.ca on Sat, Jun 24, 2000 at 10:11:43AM -0600 X-Accept-Language: de,en,fr Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Sat, Jun 24, 2000 at 10:11:43AM -0600, Richard Gooch wrote: > > ia64 and mips64 seem to have it as well. That leaves sparc which I > > guess just hasn't been merged. > > Where is it for ia64 and mips64? I didn't see it in Linus' tree (hence > it doesn't exist). If it's not in Linus tree then in any case it's in Linus mailfolder as part of my latest batch of patches sent to him. Ralf From owner-netdev@oss.sgi.com Sun Jun 25 16:40:51 2000 Received: by oss.sgi.com id ; Sun, 25 Jun 2000 16:40:41 -0700 Received: from [203.126.247.144] ([203.126.247.144]:56501 "EHLO zsngs001") by oss.sgi.com with ESMTP id ; Sun, 25 Jun 2000 16:40:06 -0700 Received: from zsngd101.asiapac.nortel.com (actually znsgd101) by zsngs001; Mon, 26 Jun 2000 07:38:45 +0800 Received: from zctwb003.asiapac.nortel.com ([47.152.32.111]) by zsngd101.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NQ12PV25; Mon, 26 Jun 2000 07:38:49 +0800 Received: from pwold011.asiapac.nortel.com ([47.181.193.45]) by zctwb003.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id NCLF8N7C; Mon, 26 Jun 2000 09:38:52 +1000 Received: from uow.edu.au (IDENT:akpm@localhost [127.0.0.1]) by pwold011.asiapac.nortel.com (8.9.3/8.9.3) with ESMTP id JAA19767; Mon, 26 Jun 2000 09:37:24 +1000 Message-ID: <395697B3.5C83F98A@uow.edu.au> Date: Sun, 25 Jun 2000 23:37:23 +0000 X-Sybari-Space: 00000000 00000000 00000000 From: Andrew Morton X-Mailer: Mozilla 4.61 [en] (X11; I; Linux 2.4.0-test1-ac10 i686) X-Accept-Language: en MIME-Version: 1.0 To: Donald Becker CC: Alan Cox , "netdev@oss.sgi.com" Subject: Re: [patch] 3c59x.c for 2.2.17 References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Orig: Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 879 Lines: 26 Donald Becker wrote: > > On Sun, 25 Jun 2000, Alan Cox wrote: > > > > - The driver would crash the machine if all dev_alloc_skb()'s failed > > > during open(). So I simply fail the open() if we can't preallocate all > > > the skb's. > > > > Ok > > The previous code would allow operation even if not all buffers could be > allocated. This is a good thing. I agree. I think. This is fairly easy to do: partially fill the rx ring and leave the other slots with a NULL skb, and set cur_rx and dirty_rx appropriately. Then hope that boomerang_rx will eventually fill the ring up. However, the risk of losing some of the first 10-20 packets after an ifup is unattractive - ARP, DHCP, NIS, etc. > I suppose the correct operation is to refuse to start if no buffers at all > can be allocated. Yes. The open() code _was_ referencing the -1'th ring entry in this situation.. From owner-netdev@oss.sgi.com Sun Jun 25 22:00:32 2000 Received: by oss.sgi.com id ; Sun, 25 Jun 2000 22:00:12 -0700 Received: from saw.sw.com.sg ([203.120.9.98]:37505 "HELO saw.sw.com.sg") by oss.sgi.com with SMTP id ; Sun, 25 Jun 2000 21:59:45 -0700 Received: (qmail 9594 invoked by uid 577); 26 Jun 2000 04:59:09 -0000 Message-ID: <20000626125909.A9548@saw.sw.com.sg> Date: Mon, 26 Jun 2000 12:59:09 +0800 From: Andrey Savochkin To: Alan Cox , Andrew Morton Cc: "netdev@oss.sgi.com" , Andreas Tobler Subject: Re: 3c59x.c for 2.2.17 References: <3955EB11.519DF76B@uow.edu.au> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.93.2i In-Reply-To: ; from "Alan Cox" on Sun, Jun 25, 2000 at 01:09:41PM Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 330 Lines: 10 Hello, On Sun, Jun 25, 2000 at 01:09:41PM +0100, Alan Cox wrote: > This is a mess in a lot of Don Becker skeleton base drivers. Its much cleaner > if the last ring buffer slot is about to be used and the new alloc fails to > simply recycle the buffer and throw the received packet away. I do it this way in eepro100. Andrey From owner-netdev@oss.sgi.com Mon Jun 26 01:53:44 2000 Received: by oss.sgi.com id ; Mon, 26 Jun 2000 01:53:25 -0700 Received: from mail.tvfiles.com ([194.185.239.100]:60696 "EHLO mail.tvfiles.com") by oss.sgi.com with ESMTP id ; Mon, 26 Jun 2000 01:52:55 -0700 Received: from tvfiles.com (gilberto.dev.tvfiles.com [192.168.1.10]) by mail.tvfiles.com (8.8.7/8.8.7) with ESMTP id LAA06683 for ; Mon, 26 Jun 2000 11:50:53 +0200 Message-ID: <39571B3B.317426A1@tvfiles.com> Date: Mon, 26 Jun 2000 10:58:36 +0200 From: Antonio Di Noto X-Mailer: Mozilla 4.72 [en] (WinNT; I) X-Accept-Language: en MIME-Version: 1.0 To: netdev@oss.sgi.com Subject: Multicast changes on 2.2.16? Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 577 Lines: 18 Dear sirs, I'm working on some multicast applications for an university exam. There seems to be a change in the way the linux kernel deals with multicast in multihomed host. All the software I tested (both mine and third parties') does not work if the host is multihomed. The packets seems to go nowhere and are invisible to the command "tcpdump ip multicast" too. This does not happen with kernel 2.2.5, for instance. My question is: is there a change in the kernel multicast code after version 2.2.5? If so, how can I get more information? Thanks in advance, Antonio From owner-netdev@oss.sgi.com Mon Jun 26 06:07:26 2000 Received: by oss.sgi.com id ; Mon, 26 Jun 2000 06:07:17 -0700 Received: from gst.gst.com ([208.219.159.150]:7689 "EHLO gst.gst.com") by oss.sgi.com with ESMTP id ; Mon, 26 Jun 2000 06:06:52 -0700 Received: from x0.org ([208.219.159.214]) by gst.gst.com (8.8.8/8.8.8) with ESMTP id JAA10666 for ; Mon, 26 Jun 2000 09:17:23 -0400 (EDT) Message-ID: <3957554C.D279FB5D@x0.org> Date: Mon, 26 Jun 2000 09:06:20 -0400 From: Andy X-Mailer: Mozilla 4.73 [en] (X11; I; Linux 2.2.16 i586) X-Accept-Language: en MIME-Version: 1.0 To: "netdev@oss.sgi.com" Subject: [Fwd: Re: TCP, UDP,... Kernel modules] Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 812 Lines: 20 > How would it be helpful to make TCP a seperate module? I always > understood TCP as an integral part of functioning IP support. It seems > that it would be easier to just make the IP modular seeing as it is rather > small. My IPv6 module is 133k and I don't see how IPv4 could be much if > at all bigger (basic support). > > For the project I am working, it would be cool if TCP would be a kernel > > module (separate from IP). > > Is that possible? > > What other mess would this drag with? > > Did anybody work in this direction yet? Well, what if you do not need TCP/IP in boot time? Or maybe you need IP, but you do not need TCP. Think about the project of putting linux kernel into a bios. You want to eliminate as much as possible. I guess that's why they made IDE as a kernel module too. Andy From owner-netdev@oss.sgi.com Mon Jun 26 06:41:46 2000 Received: by oss.sgi.com id ; Mon, 26 Jun 2000 06:41:36 -0700 Received: from infoteka.nsk.ru ([212.20.32.40]:48652 "HELO infoteka.nsk.ru") by oss.sgi.com with SMTP id ; Mon, 26 Jun 2000 06:40:58 -0700 Received: (qmail 14882 invoked by uid 7770); 26 Jun 2000 13:40:11 -0000 Received: from ppp86.infoteka.nsk.ru (HELO dyp) (dyp@212.20.33.86) by infoteka.nsk.ru with SMTP; 26 Jun 2000 13:40:11 -0000 From: Denis Perchine To: David S. Miller , Andi Kleen , Alexey Kuznetsov , Networking Team Subject: Fwd: Problem with recv syscall on socket when other side closed connection Date: Mon, 26 Jun 2000 20:40:25 +0700 X-Mailer: KMail [version 1.0.29.1] Content-Type: text/plain MIME-Version: 1.0 Message-Id: <0006262043000Q.00485@dyp> Content-Transfer-Encoding: 8bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1318 Lines: 44 Hello all, I did not get any answer on this in linux-kernel. And I send it to maintainers now. The problem occurs in PostgreSQL 7.0.2. There's no such problem in FreeBSD & IRIX. They both gives you 0 after all received data is read. Any comments? ---------- Forwarded Message ---------- Subject: Problem with recv syscall on socket when other side closed connection Date: Sat, 24 Jun 2000 05:58:39 +0700 From: Denis Perchine Hello all, There's quite strange behavior of the linux kernel when other side closed connection and we try to read from socket. Firstly I get -1 and EPIPE in errno. Hmmm... I could not find anywhere in standards or manpages that recv can return EPIPE. OK... The if I try to continue read I will get the rest of the data which arrived between last read and connection close... Very strange logic... Any comments on this. -- Sincerely Yours, Denis Perchine ---------------------------------- E-Mail: dyp@perchine.com HomePage: http://www.perchine.com/dyp/ FidoNet: 2:5000/120.5 ---------------------------------- ------------------------------------------------------- -- Sincerely Yours, Denis Perchine ---------------------------------- E-Mail: dyp@perchine.com HomePage: http://www.perchine.com/dyp/ FidoNet: 2:5000/120.5 ---------------------------------- From owner-netdev@oss.sgi.com Mon Jun 26 08:16:38 2000 Received: by oss.sgi.com id ; Mon, 26 Jun 2000 08:16:27 -0700 Received: from minus.inr.ac.ru ([193.233.7.97]:41735 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Mon, 26 Jun 2000 08:16:04 -0700 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id TAA22206; Mon, 26 Jun 2000 19:15:08 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200006261515.TAA22206@ms2.inr.ac.ru> Subject: Re: Fwd: Problem with recv syscall on socket when other side closed connection To: dyp@perchine.com (Denis Perchine) Date: Mon, 26 Jun 2000 19:15:08 +0400 (MSK DST) Cc: davem@redhat.com, ak@muc.de, netdev@oss.sgi.com In-Reply-To: <0006262043000Q.00485@dyp> from "Denis Perchine" at Jun 26, 0 08:40:25 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 687 Lines: 19 Hello! > There's quite strange behavior of the linux kernel when other side closed connection > and we try to read from socket. > Firstly I get -1 and EPIPE in errno. Hmmm... I could not find anywhere in standards or manpages > that recv can return EPIPE. OK... Apparently, your last write() failed and you get asynchronous error notification. What does freebsd return? ECONNRESET? Solaris returns EPIPE as well. And EPIPE really looks as sematically correct code. read() is OK, previous write() failed. > The if I try to continue read I will get the rest of the data which arrived between last read and > connection close... Of course. Do you propose to lose this data? 8) Alexey From owner-netdev@oss.sgi.com Mon Jun 26 15:23:20 2000 Received: by oss.sgi.com id ; Mon, 26 Jun 2000 15:23:01 -0700 Received: from pix142166199162.nbtel.net ([142.166.199.162]:62206 "EHLO pzero.sandelman.ottawa.on.ca") by oss.sgi.com with ESMTP id ; Mon, 26 Jun 2000 15:22:36 -0700 Received: from morden.sandelman.ottawa.on.ca (localhost [127.0.0.1]) by pzero.sandelman.ottawa.on.ca (8.8.8/8.8.8) with ESMTP id MAA03950 for ; Mon, 26 Jun 2000 12:14:43 -0600 (MDT) Message-Id: <200006261814.MAA03950@pzero.sandelman.ottawa.on.ca> To: "netdev@oss.sgi.com" Subject: Re: modular net drivers In-reply-to: Your message of "Wed, 21 Jun 2000 07:49:40 +1000." <4450.961537780@ocs3.ocs-net> Mime-Version: 1.0 (generated by tm-edit 7.108) Content-Type: text/plain; charset=US-ASCII Date: Mon, 26 Jun 2000 12:14:03 -0600 From: Michael Richardson Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 2452 Lines: 50 >>>>> "Keith" == Keith Owens writes: Keith> On Tue, 20 Jun 2000 17:01:56 +1000, Keith> Rusty Russell wrote: >> Keith Owens wrote: >>> It is also an important bug fix. The module code has suffered from >>> unload races ever since the kernel locking became fine grained, users >>> can crash the kernel. >> >> Races which can be largely solved at the moment by having the module >> page removal code sync all bh's and softirqs after calling cleanup(). >> Hell, we could even poll all CPUs and check they're not executing in >> the about-to-be-freed pages. Speed is completely unimportant here. Keith> This race is not obvious but IMHO it exists. The original theory was Keith> Kernel load and unload code runs under the big kernel lock. Keith> open() and similar code runs under the big kernel lock. Keith> If the code does MOD_INC_USE_COUNT before sleeping then we are safe. Keith> But consider this race, even on UP. Keith> Module has been used, nothing is currently using it, use_count == 0. Keith> rmmod runs, either manual or autoclean. Keith> The module is marked as being deleted. Keith> module_cleanup() is entered, does I/O, sleeps, loses big kernel The module_cleanup() is broken in that case. It should get all resources (i.e. locks) that it needs before doing *anything* and should release all resources as soon as it fails to get any others. To do anything else is to cause a possible deadlock. This is textbook multiprocessing. Keith> AFAICT the only safe mechanism is one that checks the module state Keith> *before* entering the module. Once you enter the module and sleep all Keith> bets are off. And that means exporting the module information to the Keith> open() layer, which is what Al Viro has been doing. Is the "module is marked as being deleted" the info that is passed to open()? "Module is deleted" is an atomic operation. It either occurs because use_count==0, or it fails, and all further calls to the module don't find it. ] Out and about in Ottawa. hmmm... beer. | firewalls [ ] Michael Richardson, Sandelman Software Works, Ottawa, ON |net architect[ ] mcr@sandelman.ottawa.on.ca http://www.sandelman.ottawa.on.ca/ |device driver[ ] panic("Just another NetBSD/notebook using, kernel hacking, security guy"); [ From owner-netdev@oss.sgi.com Mon Jun 26 18:28:11 2000 Received: by oss.sgi.com id ; Mon, 26 Jun 2000 18:28:01 -0700 Received: from infoteka.nsk.ru ([212.20.32.40]:47370 "HELO infoteka.nsk.ru") by oss.sgi.com with SMTP id ; Mon, 26 Jun 2000 18:27:38 -0700 Received: (qmail 28928 invoked by uid 7770); 27 Jun 2000 01:27:05 -0000 Received: from ppp86.infoteka.nsk.ru (HELO dyp) (dyp@212.20.33.86) by infoteka.nsk.ru with SMTP; 27 Jun 2000 01:27:05 -0000 From: Denis Perchine To: kuznet@ms2.inr.ac.ru, dyp@perchine.com (Denis Perchine) Subject: Re: Fwd: Problem with recv syscall on socket when other side closed connection Date: Tue, 27 Jun 2000 08:19:15 +0700 X-Mailer: KMail [version 1.0.29.1] Content-Type: text/plain Cc: davem@redhat.com, ak@muc.de, netdev@oss.sgi.com References: <200006261515.TAA22206@ms2.inr.ac.ru> In-Reply-To: <200006261515.TAA22206@ms2.inr.ac.ru> MIME-Version: 1.0 Message-Id: <0006270823180R.00485@dyp> Content-Transfer-Encoding: 8bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1300 Lines: 34 > > There's quite strange behavior of the linux kernel when other side closed connection > > and we try to read from socket. > > Firstly I get -1 and EPIPE in errno. Hmmm... I could not find anywhere in standards or manpages > > that recv can return EPIPE. OK... > > Apparently, your last write() failed and you get asynchronous error > notification. What does freebsd return? ECONNRESET? Sorry... But seems that you did not understand the problem. I talk about recv... Not write... write SHOULD give EPIPE on connection reset... But not recv/read. > Solaris returns EPIPE as well. And EPIPE really looks as sematically > correct code. read() is OK, previous write() failed. > > > The if I try to continue read I will get the rest of the data which arrived between last read and > > connection close... > > Of course. Do you propose to lose this data? 8) Usual way of handling connection reset when you do only read is to give all data available and then return 0, indicating EOF. Or some OSes (HPUX if I'm not mistaken) gives you all data available and then ECONNRESET. But not other way around... -- Sincerely Yours, Denis Perchine ---------------------------------- E-Mail: dyp@perchine.com HomePage: http://www.perchine.com/dyp/ FidoNet: 2:5000/120.5 ---------------------------------- From owner-netdev@oss.sgi.com Mon Jun 26 18:41:01 2000 Received: by oss.sgi.com id ; Mon, 26 Jun 2000 18:40:51 -0700 Received: from pizda.ninka.net ([216.101.162.242]:31108 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Mon, 26 Jun 2000 18:40:31 -0700 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id SAA03319; Mon, 26 Jun 2000 18:31:08 -0700 Date: Mon, 26 Jun 2000 18:31:08 -0700 Message-Id: <200006270131.SAA03319@pizda.ninka.net> X-Authentication-Warning: pizda.ninka.net: davem set sender to davem@redhat.com using -f From: "David S. Miller" To: dyp@perchine.com CC: kuznet@ms2.inr.ac.ru, dyp@perchine.com, ak@muc.de, netdev@oss.sgi.com In-reply-to: <0006270823180R.00485@dyp> (message from Denis Perchine on Tue, 27 Jun 2000 08:19:15 +0700) Subject: Re: Fwd: Problem with recv syscall on socket when other side closed connection References: <200006261515.TAA22206@ms2.inr.ac.ru> <0006270823180R.00485@dyp> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 906 Lines: 21 From: Denis Perchine Date: Tue, 27 Jun 2000 08:19:15 +0700 Usual way of handling connection reset when you do only read is to give all data available and then return 0, indicating EOF. Or some OSes (HPUX if I'm not mistaken) gives you all data available and then ECONNRESET. But not other way around... Connection reset effectively means that TCP's reliable data transport has been completely compromised, and therefore any attempt to move data across that socket should immediately indicate an error. What I am also trying to say is that when one says "all the data available", the rest of the connection has in effect dropped all meaning to that phrase, you no longer know what "all the data" is and between two different reset occurances "all the data" will be different so why would you want to read any of it at all? Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Mon Jun 26 18:49:51 2000 Received: by oss.sgi.com id ; Mon, 26 Jun 2000 18:49:41 -0700 Received: from infoteka.nsk.ru ([212.20.32.40]:63242 "HELO infoteka.nsk.ru") by oss.sgi.com with SMTP id ; Mon, 26 Jun 2000 18:49:31 -0700 Received: (qmail 29306 invoked by uid 7770); 27 Jun 2000 01:49:27 -0000 Received: from ppp86.infoteka.nsk.ru (HELO dyp) (dyp@212.20.33.86) by infoteka.nsk.ru with SMTP; 27 Jun 2000 01:49:27 -0000 From: Denis Perchine To: "David S. Miller" Subject: Re: Fwd: Problem with recv syscall on socket when other side closed connection Date: Tue, 27 Jun 2000 08:45:46 +0700 X-Mailer: KMail [version 1.0.29.1] Content-Type: text/plain Cc: kuznet@ms2.inr.ac.ru, ak@muc.de, netdev@oss.sgi.com References: <200006261515.TAA22206@ms2.inr.ac.ru> <0006270823180R.00485@dyp> <200006270131.SAA03319@pizda.ninka.net> In-Reply-To: <200006270131.SAA03319@pizda.ninka.net> MIME-Version: 1.0 Message-Id: <0006270852390S.00485@dyp> Content-Transfer-Encoding: 8bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Content-Length: 1792 Lines: 38 > Usual way of handling connection reset when you do only read is to > give all data available and then return 0, indicating EOF. Or some > OSes (HPUX if I'm not mistaken) gives you all data available and > then ECONNRESET. But not other way around... > > Connection reset effectively means that TCP's reliable data transport > has been completely compromised, and therefore any attempt to move > data across that socket should immediately indicate an error. > > What I am also trying to say is that when one says "all the data > available", the rest of the connection has in effect dropped all > meaning to that phrase, you no longer know what "all the data" is > and between two different reset occurances "all the data" will be > different so why would you want to read any of it at all? I need to read all data BEFORE reset occurs. The problem is that I get EPIPE before I get the data recieved before the reset. Just live example. I have a connection to PostgreSQL database. And for some reason that connection was reset by database server. This can happend due to enumerous number of errors. Before close the connection server wrote the reason to it. The problem is that in the case of Linux I will not get this data unless I will provide some special handling for this (just ignore SIGPIPE from recv). This is not the case on other OSes. I also can not understand whether such behavoir compatible with POSIX specs or not? As far as I can remember recv can not return EPIPE in any case. At least there's no any mention about this in man pages, info, and other docs I have. -- Sincerely Yours, Denis Perchine ---------------------------------- E-Mail: dyp@perchine.com HomePage: http://www.perchine.com/dyp/ FidoNet: 2:5000/120.5 ---------------------------------- From owner-netdev@oss.sgi.com Mon Jun 26 21:28:12 2000 Received: by oss.sgi.com id ; Mon, 26 Jun 2000 21:27:52 -0700 Received: from pizda.ninka.net ([216.101.162.242]:32900 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Mon, 26 Jun 2000 21:27:47 -0700 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id VAA03406; Mon, 26 Jun 2000 21:18:14 -0700 Date: Mon, 26 Jun 2000 21:18:14 -0700 Message-Id: <200006270418.VAA03406@pizda.ninka.net> X-Authentication-Warning: pizda.ninka.net: davem set sender to davem@redhat.com using -f From: "David S. Miller" To: dyp@perchine.com CC: kuznet@ms2.inr.ac.ru, ak@muc.de, netdev@oss.sgi.com In-reply-to: <0006270852390S.00485@dyp> (message from Denis Perchine on Tue, 27 Jun 2000 08:45:46 +0700) Subject: Re: Fwd: Problem with recv syscall on socket when other side closed connection References: <200006261515.TAA22206@ms2.inr.ac.ru> <0006270823180R.00485@dyp> <200006270131.SAA03319@pizda.ninka.net> <0006270852390S.00485@dyp> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing From: Denis Perchine Date: Tue, 27 Jun 2000 08:45:46 +0700 I need to read all data BEFORE reset occurs. The problem is that I get EPIPE before I get the data recieved before the reset. And I am trying to tell you that you are not guarenteed to get that data, ever. Here is one example, if the network between the sql server and your client reorders the packets such that the reset arrives before the "reason" data, TCP will not even take the data packets and will drop them on the floor. The sql server should gracefully do a normal close of the socket if it wishes the client to receive the data with any amount of certainty. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Mon Jun 26 21:30:52 2000 Received: by oss.sgi.com id ; Mon, 26 Jun 2000 21:30:42 -0700 Received: from infoteka.nsk.ru ([212.20.32.40]:28687 "HELO infoteka.nsk.ru") by oss.sgi.com with SMTP id ; Mon, 26 Jun 2000 21:30:35 -0700 Received: (qmail 4284 invoked by uid 7770); 27 Jun 2000 04:30:10 -0000 Received: from ppp86.infoteka.nsk.ru (HELO dyp) (dyp@212.20.33.86) by infoteka.nsk.ru with SMTP; 27 Jun 2000 04:30:10 -0000 From: Denis Perchine To: "David S. Miller" Subject: Re: Fwd: Problem with recv syscall on socket when other side closed connection Date: Tue, 27 Jun 2000 11:32:11 +0700 X-Mailer: KMail [version 1.0.29.1] Content-Type: text/plain Cc: kuznet@ms2.inr.ac.ru, ak@muc.de, netdev@oss.sgi.com References: <200006261515.TAA22206@ms2.inr.ac.ru> <0006270852390S.00485@dyp> <200006270418.VAA03406@pizda.ninka.net> In-Reply-To: <200006270418.VAA03406@pizda.ninka.net> MIME-Version: 1.0 Message-Id: <00062711333501.00507@dyp> Content-Transfer-Encoding: 8bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing > I need to read all data BEFORE reset occurs. The problem is that I > get EPIPE before I get the data recieved before the reset. > > And I am trying to tell you that you are not guarenteed to > get that data, ever. > > Here is one example, if the network between the sql server and > your client reorders the packets such that the reset arrives before > the "reason" data, TCP will not even take the data packets and > will drop them on the floor. > > The sql server should gracefully do a normal close of the socket > if it wishes the client to receive the data with any amount of > certainty. Ough... Sorry... This was a misunderstanding. It does. And I get EPIPE.. For sure it does. Connection is closed using close syscall. -- Sincerely Yours, Denis Perchine ---------------------------------- E-Mail: dyp@perchine.com HomePage: http://www.perchine.com/dyp/ FidoNet: 2:5000/120.5 ---------------------------------- From owner-netdev@oss.sgi.com Tue Jun 27 05:23:07 2000 Received: by oss.sgi.com id ; Tue, 27 Jun 2000 05:22:47 -0700 Received: from minus.inr.ac.ru ([193.233.7.97]:32017 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Tue, 27 Jun 2000 05:22:18 -0700 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id QAA02901; Tue, 27 Jun 2000 16:21:55 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200006271221.QAA02901@ms2.inr.ac.ru> Subject: Re: Fwd: Problem with recv syscall on socket when other side closed connection To: dyp@perchine.com (Denis Perchine) Date: Tue, 27 Jun 2000 16:21:55 +0400 (MSK DST) Cc: dyp@perchine.com, davem@redhat.com, ak@muc.de, netdev@oss.sgi.com In-Reply-To: <0006270823180R.00485@dyp> from "Denis Perchine" at Jun 27, 0 08:19:15 am X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Length: 1513 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello! > Sorry... But seems that you did not understand the problem. > I talk about recv... Not write... write SHOULD give EPIPE on connection reset... > But not recv/read. I did understand. This error was for write(), but it became known _after_ you exited write(). So that it is delivered to read(). It is usual problem of all full-duplex pipes. We could translate this EPIPE to ECONNRESET, when it is delivered to read(), but it does not change its sense. Solaris does not translate. > Usual way of handling connection reset when you do only read is to give > all data available and then return 0, indicating EOF. Sorry? Think a bit. You wrote to dead socket, right? It is the hardest error. If the transport were local, you would get SIGPIPE and died painful death. If an OS ignores such events, it is simply impossible to use, you will get silently truncated data all the time. > Or some OSes (HPUX if I'm not mistaken) gives you all data available and then > ECONNRESET. But not other way around... This approach has its merits, and it is acceptable in principle. But Linux approach is evidently better, because errors are expedited. Each protocol, where out of band events are inlined to data is inclined to deadlocks. In Linux scheme you know forward that stream is aborted. Depending on protocol you may choose to abort protocol or to continue to operate, parsing already received messages. > But not other way around... You have just seen a new way around. The correct one. 8) Alexey From owner-netdev@oss.sgi.com Tue Jun 27 07:58:58 2000 Received: by oss.sgi.com id ; Tue, 27 Jun 2000 07:58:47 -0700 Received: from gst.gst.com ([208.219.159.150]:56082 "EHLO gst.gst.com") by oss.sgi.com with ESMTP id ; Tue, 27 Jun 2000 07:48:55 -0700 Received: from x0.org ([208.219.159.214]) by gst.gst.com (8.8.8/8.8.8) with ESMTP id KAA29280 for ; Tue, 27 Jun 2000 10:59:57 -0400 (EDT) Message-ID: <3958BED5.43FE6C27@x0.org> Date: Tue, 27 Jun 2000 10:48:53 -0400 From: Andy X-Mailer: Mozilla 4.73 [en] (X11; I; Linux 2.2.16 i586) X-Accept-Language: en MIME-Version: 1.0 To: "netdev@oss.sgi.com" Subject: Network protocols as modules Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello! After long brainstorm meating, we compiled a list of some good reasons for network protocols to be implemented as kernel modules: Cleaner code Better maintainability Better extensibility Modularity Smaller initial kernel Higher efficiency Here are some scenarios: PDA, linux powered watch, satelite,... does not need TCP/IP running all the time. It could load module when needed. This would save working space and power consumption by adding minimal latency to the initial network request. Linux powered diskless workstation has kernel in some small memmory, for example bios and it runs bootp/tftp. They both run without TCP, why would TCP eat the memmory. The kernel can get TCP module from the network. Problems of putting TCP on some other layer or sticking another layer between TCP and IP. If TCP would be independent from IP, then this would be trivial. IPV6 support is complicated because TCP has its tenticles inside IP implementation. Please comment. Andy From owner-netdev@oss.sgi.com Tue Jun 27 09:03:49 2000 Received: by oss.sgi.com id ; Tue, 27 Jun 2000 08:49:38 -0700 Received: from mail.inconnect.com ([209.140.64.7]:15320 "HELO mail.inconnect.com") by oss.sgi.com with SMTP id ; Tue, 27 Jun 2000 08:49:15 -0700 Received: (qmail 25324 invoked from network); 27 Jun 2000 15:49:19 -0000 Received: from ultra1.inconnect.com (209.140.64.2) by mail with SMTP; 27 Jun 2000 15:49:19 -0000 Date: Tue, 27 Jun 2000 09:49:19 -0600 (MDT) From: Keyshaun X-Sender: kruger@ultra1.inconnect.com To: Andy cc: "netdev@oss.sgi.com" Subject: Re: Network protocols as modules In-Reply-To: <3958BED5.43FE6C27@x0.org> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing First I am rather curious as to why it would be so important to cut such a small part of the kernel and make it a module. I may just be a bit slow not knowing how much TCP and UDP take up. The things I make modules are things like IPv6, and network drivers that I may need on the next upgrade, and possibly my PPP if I really wanted to. > Cleaner code > Better maintainability > Better extensibility > Modularity > Smaller initial kernel > Higher efficiency How much smaller will the kernel be? > > Here are some scenarios: > > PDA, linux powered watch, satelite,... does not need TCP/IP running all > the time. It could load module when needed. This would save working > space and power consumption by adding minimal latency to the initial > network request. I can understand working space, but in the case of small devices wouldn't it be possible to get the text segment mapped directly to the memory that the kernel is burned into( this is 2MB on some PDAs) and thus make it a moot point to make anything a module? At that time the module would just be taking up user memory being stored and more in execution. > Linux powered diskless workstation has kernel in some small memmory, for > example bios and it runs bootp/tftp. They both run without TCP, why > would TCP eat the memmory. The kernel can get TCP module from the > network. I see a similar situation with the diskless workstation. Some network cards already have the ability to get an OS and boot it from a network source. If the OS is coming from over the network why not bring a kernel with all the basic support you need (full netdrivers no modules). > Problems of putting TCP on some other layer or sticking another layer > between TCP and > IP. If TCP would be independent from IP, then this would be trivial. According to the OSI model they are already in seperate layers. TCP is just the feature of IP that allows streaming. I'm sure other protocolls have their own methods. > IPV6 support is complicated because TCP has its tenticles inside IP > implementation. I would rather run full IPv6 anyway and let the 138k module be in the kernel and take up space and cut other things. The only trouble is that this isn't an option until IPv6 support is complete. Until then, If I am designing something I will put memory into it so that it will not need to be modular. Shaun Kruger From owner-netdev@oss.sgi.com Tue Jun 27 09:20:19 2000 Received: by oss.sgi.com id ; Tue, 27 Jun 2000 09:20:09 -0700 Received: from infoteka.nsk.ru ([212.20.32.40]:36869 "HELO infoteka.nsk.ru") by oss.sgi.com with SMTP id ; Tue, 27 Jun 2000 09:10:04 -0700 Received: (qmail 28121 invoked by uid 7770); 27 Jun 2000 16:10:10 -0000 Received: from ppp86.infoteka.nsk.ru (HELO dyp) (dyp@212.20.33.86) by infoteka.nsk.ru with SMTP; 27 Jun 2000 16:10:10 -0000 From: Denis Perchine To: kuznet@ms2.inr.ac.ru Subject: Re: Fwd: Problem with recv syscall on socket when other side closed connection Date: Tue, 27 Jun 2000 23:07:47 +0700 X-Mailer: KMail [version 1.0.29.1] Content-Type: text/plain Cc: davem@redhat.com, ak@muc.de, netdev@oss.sgi.com References: <200006271221.QAA02901@ms2.inr.ac.ru> In-Reply-To: <200006271221.QAA02901@ms2.inr.ac.ru> MIME-Version: 1.0 Message-Id: <00062723125606.00490@dyp> Content-Transfer-Encoding: 8bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing > > Sorry... But seems that you did not understand the problem. > > I talk about recv... Not write... write SHOULD give EPIPE on connection reset... > > But not recv/read. > > I did understand. This error was for write(), but it became known > _after_ you exited write(). So that it is delivered to read(). > It is usual problem of all full-duplex pipes. > > We could translate this EPIPE to ECONNRESET, when it is delivered > to read(), but it does not change its sense. No.... Don't do it... At least I can write workaround with comment: /* Just to make this buggy Linux happy :-(((( */ > Solaris does not translate. > > > > Usual way of handling connection reset when you do only read is to give > > all data available and then return 0, indicating EOF. > > Sorry? Think a bit. > > You wrote to dead socket, right? It is the hardest error. > If the transport were local, you would get SIGPIPE and died painful death. > If an OS ignores such events, it is simply impossible to use, > you will get silently truncated data all the time. > > > Or some OSes (HPUX if I'm not mistaken) gives you all data available and then > > ECONNRESET. But not other way around... > > This approach has its merits, and it is acceptable in principle. > > But Linux approach is evidently better, because errors are expedited. > Each protocol, where out of band events are inlined to data > is inclined to deadlocks. > > In Linux scheme you know forward that stream is aborted. > Depending on protocol you may choose to abort protocol > or to continue to operate, parsing already received messages. > > > But not other way around... > > You have just seen a new way around. The correct one. 8) Now I start to understand why BSD people hate Linux... Due to such statements. Maybe even this is the correct one, but it breaks programs that works. And it is different from other OSes which makes the task of writing portable programs very hard. Only because Linux kernel people think they are gods. -- Sincerely Yours, Denis Perchine ---------------------------------- E-Mail: dyp@perchine.com HomePage: http://www.perchine.com/dyp/ FidoNet: 2:5000/120.5 ---------------------------------- From owner-netdev@oss.sgi.com Tue Jun 27 09:30:50 2000 Received: by oss.sgi.com id ; Tue, 27 Jun 2000 09:30:40 -0700 Received: from minus.inr.ac.ru ([193.233.7.97]:263 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Tue, 27 Jun 2000 09:30:31 -0700 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id UAA13199; Tue, 27 Jun 2000 20:30:04 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200006271630.UAA13199@ms2.inr.ac.ru> Subject: Re: Fwd: Problem with recv syscall on socket when other side closed connection To: dyp@perchine.com (Denis Perchine) Date: Tue, 27 Jun 2000 20:30:03 +0400 (MSK DST) Cc: davem@redhat.com, ak@muc.de, netdev@oss.sgi.com In-Reply-To: <00062723125606.00490@dyp> from "Denis Perchine" at Jun 27, 0 11:07:47 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Length: 484 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello! > No.... Don't do it... At least I can write workaround with comment: > /* > Just to make this buggy Linux happy :-(((( > */ OK. Let me to reproduce. > > You wrote to dead socket, right? It is the hardest error. > Now I start to understand why BSD people hate Linux... Guy, before all your program is buggy. You could tell "thank you" to people who have lost time explaining you this. And you could thank also OS, which allowed to expose this _fatal_ bug. Alexey From owner-netdev@oss.sgi.com Tue Jun 27 23:44:48 2000 Received: by oss.sgi.com id ; Tue, 27 Jun 2000 23:44:37 -0700 Received: from mail.informatik.uni-ulm.de ([134.60.68.63]:11310 "EHLO mail.informatik.uni-ulm.de") by oss.sgi.com with ESMTP id ; Tue, 27 Jun 2000 23:44:18 -0700 Received: from [134.60.8.166] (helo=ferret.extern.uni-ulm.de ident=user92474) by mail.informatik.uni-ulm.de with esmtp (Exim 3.00 #1) id 137BaK-0004h2-00; Wed, 28 Jun 2000 08:44:32 +0200 Received: from blackbird.extern.uni-ulm.de ([172.16.1.10] ident=root) by ferret.extern.uni-ulm.de with esmtp (Exim 3.02 #2) id 137BJ7-00003Y-00; Wed, 28 Jun 2000 08:26:45 +0200 Received: (from stefan@localhost) by blackbird.extern.uni-ulm.de (8.9.3/8.9.3/SuSE Linux 8.9.3-0.1) id IAA00969; Wed, 28 Jun 2000 08:22:07 +0200 Date: Wed, 28 Jun 2000 08:22:07 +0200 From: Stefan Schlott To: Andy Cc: netdev@oss.sgi.com Subject: Re: Network protocols as modules Message-ID: <20000628082207.A587@blackbird.extern.uni-ulm.de> References: <3958BED5.43FE6C27@x0.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2i In-Reply-To: <3958BED5.43FE6C27@x0.org>; from andy@x0.org on Tue, Jun 27, 2000 at 10:48:53AM -0400 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello Andy, I think separating the basic protocols (icmp/igmp, tcp, udp) from the ip module is not a good idea. > Cleaner code > Better maintainability True. That would certainly be an advantage. > Better extensibility Even with tcp, ... included in the ip module, you can write new network protocols as separate modules. > Modularity > Smaller initial kernel Well... who would want ip without tcp or udp? :-) icmp will be mandatory anyway, because otherwise you would break the ip functionality (think of "no route to host", "packet too big", ... error messages) > Higher efficiency I dare to doubt that. > Problems of putting TCP on some other layer or sticking another layer > between TCP and > IP. If TCP would be independent from IP, then this would be trivial. > IPV6 support is complicated because TCP has its tenticles inside IP > implementation. And imho there are some perfectly sound reasons to do so. Sure, the code is (to put it mildly) a bit cryptic, but kernel code is not an "object d'art". The only thing that counts is performance. I am absolutely sure the kernel programmers could have done better than using ugly "goto" statements, inline assembler code, ... The only reason they did so is to improve performance. If you want a clean code, you will have to look at some "academic" OS (minix might be an example, though I have never seen its network code). If you want to write a "real world OS", you'll soon start counting the number of times a network packet is copied in memory; and that's what the current networking code tries to minimize. Please don't get me wrong - there are certainly situations where I wished the code would be a bit more modular, too (you should have heard me cursing when I saw how fragmentation is done in the ipv6 module ;-). I am no kernel hacker, but I think I figured out why the code is like it currently is. Stefan. -- *--- please cut here... -------------------------------------- thanks! ---* |-> E-Mail: stefan.schlott@student.uni-ulm.de DH-PGP-Key: 0x2F36F4FE <-| | Mary had a crypto key, she kept it in escrow, | | and everything that Mary said, the Feds were sure to know. | | -- Sam Simpson, July 9, 1998 | *-------------------------------------------------------------------------* From owner-netdev@oss.sgi.com Wed Jun 28 05:28:52 2000 Received: by oss.sgi.com id ; Wed, 28 Jun 2000 05:28:12 -0700 Received: from gst.gst.com ([208.219.159.150]:17161 "EHLO gst.gst.com") by oss.sgi.com with ESMTP id ; Wed, 28 Jun 2000 05:27:46 -0700 Received: from x0.org ([208.219.159.214]) by gst.gst.com (8.8.8/8.8.8) with ESMTP id IAA12892 for ; Wed, 28 Jun 2000 08:38:59 -0400 (EDT) Message-ID: <3959EF4A.5E857970@x0.org> Date: Wed, 28 Jun 2000 08:27:54 -0400 From: Andy X-Mailer: Mozilla 4.73 [en] (X11; I; Linux 2.2.16 i586) X-Accept-Language: en MIME-Version: 1.0 To: "netdev@oss.sgi.com" Subject: Re: Network protocols as modules References: <3958BED5.43FE6C27@x0.org> <20000628082207.A587@blackbird.extern.uni-ulm.de> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing > I think separating the basic protocols (icmp/igmp, tcp, udp) from the > ip module is not a good idea. Why not? You can still use the same memmory area. I am not saying completely separate, just let us make TCP as a kernel module or maybe TCP and UDP. ICMP/IGMP would have to stay as a part of IP. > Even with tcp, ... included in the ip module, you can write new network > protocols as separate modules. Great. What if you want to replace existing protocol? Think about adding some security layer between IP and TCP or to replace TCP by something else. Who would need such a thing? Well, entertain me. > Well... who would want ip without tcp or udp? :-) > icmp will be mandatory anyway, because otherwise you would break the > ip functionality (think of "no route to host", "packet too big", ... > error messages) Bootp does not need TCP. > Please don't get me wrong - there are certainly situations where I wished > the code would be a bit more modular, too (you should have heard me cursing > when I saw how fragmentation is done in the ipv6 module ;-). I am no kernel > hacker, but I think I figured out why the code is like it currently is. All I am saying is let us make a clear distinction of what is IP (and make kernel module out of it), what is TCP (maybe make separate module) and the rest. Andy