From owner-netdev@oss.sgi.com Sun Dec 2 19:00:18 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fB330IN14535 for netdev-outgoing; Sun, 2 Dec 2001 19:00:18 -0800 Received: from outpost.powerdns.com (postfix@outpost.ds9a.nl [213.244.168.210]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fB3306o14530 for ; Sun, 2 Dec 2001 19:00:06 -0800 Received: by outpost.powerdns.com (Postfix, from userid 1000) id AAFCCC60F5; Mon, 3 Dec 2001 03:00:02 +0100 (CET) Date: Mon, 3 Dec 2001 03:00:02 +0100 From: bert hubert To: lartc@mailman.ds9a.nl, linux-kernel@vger.kernel.org, kuznet@ms2.inr.ac.ru, hadi@cyberus.ca, netdev@oss.sgi.com Subject: CBQ and all other qdiscs now REALLY completely documented (almost!) Message-ID: <20011203030002.A20601@outpost.ds9a.nl> Mail-Followup-To: bert hubert , lartc@mailman.ds9a.nl, linux-kernel@vger.kernel.org, kuznet@ms2.inr.ac.ru, hadi@cyberus.ca, netdev@oss.sgi.com References: <20011201013341.A23830@outpost.ds9a.nl> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <20011201013341.A23830@outpost.ds9a.nl>; from ahu@ds9a.nl on Sat, Dec 01, 2001 at 01:33:41AM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 3744 Lines: 105 On Sat, Dec 01, 2001 at 01:33:41AM +0100, bert hubert wrote: > One thing - does *anybody* understand how hash tables work in tc filter, and > what they do? Furthermore, I could use some help with the tc filter police > things. Thanks to Andreas Steinmetz and David Sauer, tc hash tables are now documented as well, thanks! See: http://ds9a.nl/2.4Routing/HOWTO//cvs/2.4routing/output/2.4routing-12.html And then 'Hashing filters for very fast massive filtering'. I also finished documenting all parameters for TBF, CBQ, SFQ, PRIO, bfifo, pfifo and pfifo_fast. All queues in the Linux kernel are now described in the Linux Advanced Routing & Shaping HOWTO, which can be found on http://ds9a.nl/2.4Routing I want to send this off to the LDP and Freshmeat somewhere next week, I *would really* like people who are knowledgeable about this subject (this means you, ANK & Jamal 8) ) to read through this. This HOWTO is rapidly becoming the perceived authoritative source for traffic control in linux (google on 'Linux Routing' finds it), it might as well be right! So if you have any time at all, check the parts you know about. I expect mistakes. The parts of the table of contents that document stuff in the kernel not documented elsewhere: 9. Queueing Disciplines for Bandwidth Management 9.1 Queues and Queueing Disciplines explained 9.2 Simple, classless Queueing Disciplines 9.2.1 pfifo_fast 9.2.1.1 Parameters & usage 9.2.2 Token Bucket Filter 9.2.2.1 Parameters & usage 9.2.2.2 Sample configuration 9.2.3 Stochastic Fairness Queueing 9.2.3.1 Parameters & usage 9.2.3.2 Sample configuration 9.3 Advice for when to use which queue 9.4 Classful Queueing Disciplines 9.4.1 Flow within classful qdiscs & classes 9.4.2 The qdisc family: roots, handles, siblings and parents 9.4.2.1 How filters are used to classify traffic 9.4.2.2 How packets are dequeued to the hardware 9.4.3 The PRIO qdisc 9.4.3.1 PRIO parameters & usage 9.4.3.2 Sample configuration 9.4.4 The famous CBQ qdisc 9.4.4.1 CBQ shaping in detail 9.4.4.2 CBQ classful behaviour 9.4.4.3 CBQ parameters that determine link sharing & borrowing 9.4.4.4 Sample configuration 9.4.4.5 Other CBQ parameters: split & defmap 9.4.5 Hierarchical Token Bucket 9.4.5.1 Sample configuration 9.5 Classifying packets with filters 9.5.1 Some simple filtering examples 9.5.2 All the filtering commands you will normally need (...) 12. Advanced filters for (re-)classifying packets 12.1 The "u32" classifier 12.1.1 U32 selector 12.1.2 General selectors 12.1.3 Specific selectors 12.2 The "route" classifier 12.3 Policing filters 12.4 Hashing filters for very fast massive filtering (...) 14. Advanced & less common queueing disciplines 14.1 bfifo/pfifo 14.1.1 Parameters & usage 14.2 Clark-Shenker-Zhang algorithm (CSZ) 14.3 DSMARK 14.3.1 Introduction 14.3.2 What is Dsmark related to? 14.3.3 Differentiated Services guidelines 14.3.4 Working with Dsmark 14.3.5 How SCH_DSMARK works. 14.3.6 TC_INDEX Filter 14.4 Ingress policer qdisc 14.5 Random Early Drop (RED) 14.6 VC/ATM emulation 14.7 Weighted Round Robin (WRR) The only thing left to document are Policing filters. Regards, bert hubert -- http://www.PowerDNS.com Versatile DNS Software & Services Trilab The Technology People Netherlabs BV / Rent-a-Nerd.nl - Nerd Available - 'SYN! .. SYN|ACK! .. ACK!' - the mating call of the internet From owner-netdev@oss.sgi.com Sun Dec 2 19:13:52 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fB33DqO14885 for netdev-outgoing; Sun, 2 Dec 2001 19:13:52 -0800 Received: from zeus.anet-chi.com (root@zeus.anet-chi.com [207.7.4.6]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fB33Dno14882 for ; Sun, 2 Dec 2001 19:13:50 -0800 Received: from ipv16 (as1-98.chi.il.dial.anet.com [198.92.156.98]) by zeus.anet-chi.com (8.9.3/spamfix) with SMTP id UAA10871; Sun, 2 Dec 2001 20:13:01 -0600 (CST) Message-ID: <007701c17ba1$f177dcc0$a300a8c0@ipv16> From: "Jim Fleming" To: "bert hubert" , , , , , References: <20011201013341.A23830@outpost.ds9a.nl> <20011203030002.A20601@outpost.ds9a.nl> Subject: Re: [LARTC] CBQ and all other qdiscs now REALLY completely documented (almost!) Date: Sun, 2 Dec 2001 20:26:07 -0600 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 5.50.4133.2400 X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4133.2400 Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 241 Lines: 14 ----- Original Message ----- From: "bert hubert" > > The only thing left to document are Policing filters. > This may help... http://www.dot-biz.com/IPv4/Tutorial/ Jim Fleming http://www.IPv8.info IPv16....One Better !! From owner-netdev@oss.sgi.com Mon Dec 3 11:05:15 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fB3J5Fv08654 for netdev-outgoing; Mon, 3 Dec 2001 11:05:15 -0800 Received: from smtpmail2.iol.cz (smtp.iol.cz [194.228.2.44]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fB3J5Bo08648 for ; Mon, 3 Dec 2001 11:05:12 -0800 Received: from devix ([194.228.141.237]) by smtpmail2.iol.cz (InterMail vK.4.03.05.00 201-232-132 license d644f6ed01e70e5935170669e145ddd5) with ESMTP id <20011203092849.JQBG20904.smtpmail2@devix> for ; Mon, 3 Dec 2001 10:28:49 +0100 Received: from devik (helo=localhost) by devix with local-esmtp (Exim 3.16 #8) id 16ApOK-0000EX-00 for netdev@oss.sgi.com; Mon, 03 Dec 2001 10:28:00 +0100 Date: Mon, 3 Dec 2001 10:28:00 +0100 (CET) From: devik X-X-Sender: To: Subject: HTB qdisc improvements question Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 837 Lines: 29 Hello, I've written HTB queuing discipline for Linux. I'd like to ask net core developers for advise what should I fix/improve/change in order to be able to merge it someday. Patches and detailed manual are at http://luxik.cdi.cz/~devik/qos/htb/. Qdisc is in use since August and users found two bugs till now (both fixed). It uses existing tc architecture without exception and implements AFAIK all tc methods needed for qdisc and class operations. Implementation uses new algorithm (not yet published) but it conforms to Floyd's theory about link sharing, namely "formal sharing" approach. I don't expect future changes to the feature set only bugfixes and performance optimizations. So that I'd again like to ask net gurus here to review the code and tell me what could I do better. Thanks in advance, Martin Devera aka devik From owner-netdev@oss.sgi.com Mon Dec 3 22:31:38 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fB46Vcm13814 for netdev-outgoing; Mon, 3 Dec 2001 22:31:38 -0800 Received: from noxmail.sandelman.ottawa.on.ca (cyphermail.sandelman.ottawa.on.ca [192.139.46.78]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fB46VUo13800 for ; Mon, 3 Dec 2001 22:31:30 -0800 Received: from marajade.sandelman.ottawa.on.ca ([2002:c08b:2e21:2:204:76ff:fe2d:8c]) by noxmail.sandelman.ottawa.on.ca (8.11.6/8.11.6) with ESMTP id fB45V1K09325 (using TLSv1/SSLv3 with cipher EDH-RSA-DES-CBC3-SHA (168 bits) verified OK); Tue, 4 Dec 2001 00:31:04 -0500 (EST) Received: from marajade.sandelman.ottawa.on.ca (localhost [[UNIX: localhost]]) by marajade.sandelman.ottawa.on.ca (8.11.6/8.11.0) with ESMTP id fB45Q4D12164; Tue, 4 Dec 2001 00:26:08 -0500 (EST) Message-Id: <200112040526.fB45Q4D12164@marajade.sandelman.ottawa.on.ca> To: netdev@oss.sgi.com, design@lists.freeswan.org Subject: Re: what pointers does pskb_may_pull() nuke? In-reply-to: Your message of "Wed, 28 Nov 2001 00:32:45 GMT." Mime-Version: 1.0 (generated by tm-edit 7.108) Content-Type: text/plain; charset=US-ASCII Date: Tue, 04 Dec 2001 00:26:03 -0500 From: Michael Richardson Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 1969 Lines: 61 -----BEGIN PGP SIGNED MESSAGE----- >>>>> "Julian" == Julian Anastasov writes: Julian> after 2.4.4 >> Is IPH_is_SKB_PULLED set by some other part of the system? Julian> I see it in your own code, not in kernel: lib/freeswan.h (1.91) Oops, my blush. I only grep'ed the code in net/ipsec :-) One thing is that we had mixed reports about the need to reassemble fragments. We realized that this is because netfilter will reassemble fragments for us. (My User-Mode-Linux tests had netfilter off) I thought that 2.2 had a specific option for turning this on. Maybe it was 2.0! Anyway, it is not clear to us what netfilter option is the minimum required to cause fragments to be reassembled prior to transport layer. Is just having netfilter enabled enough to do that? I'm asking so that we can properly diagnose the situation. We will be adding: #ifdef IP_FRAGMENT_ASSEMBLE /* In Linux 2.4.4, we may have to reassemble fragments. They are not assembled automatically to save TCP from having to copy twice. */ if (skb_is_nonlinear(skb)) { if (skb_linearize(skb, GFP_ATOMIC) != 0) { goto rcvleave; } } ipp = (struct iphdr *)skb->nh.iph; iphlen = ipp->ihl << 2; #endif we will do this after we have checked for COW on the SKB. ] ON HUMILITY: to err is human. To moo, bovine. | firewalls [ ] Michael Richardson, Sandelman Software Works, Ottawa, ON |net architect[ ] mcr@sandelman.ottawa.on.ca http://www.sandelman.ottawa.on.ca/ |device driver[ ] panic("Just another NetBSD/notebook using, kernel hacking, security guy"); [ -----BEGIN PGP SIGNATURE----- Version: 2.6.3ia Charset: latin1 Comment: Finger me for keys iQCVAwUBPAxeaYqHRg3pndX9AQGNOQQAw9bGwP0KVdN8y3bhJkJsU1xcAFXoyjc2 DqYtZDtY4CMP3R5URlH4hG2wWFB2Z+IIV/mdyVkmmLVzP+e8GBw/ZnVa9T9pFmHa vstQtnRVOdNTpgOswgBZfzI3l4VMScA0WAVYREZEsxU7NZVhLzbIUZH5O28EEzpC HfrUgs2ihhc= =BNOW -----END PGP SIGNATURE----- From owner-netdev@oss.sgi.com Tue Dec 4 03:18:16 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fB4BIGE22461 for netdev-outgoing; Tue, 4 Dec 2001 03:18:16 -0800 Received: from l.himel.bg (IDENT:root@unamed.infotel.bg [212.39.68.18] (may be forged)) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fB4BI5o22455 for ; Tue, 4 Dec 2001 03:18:06 -0800 Received: from linux.himel.bg (IDENT:ja@linux.himel.bg [127.0.0.1]) by l.himel.bg (8.9.3/8.9.3) with ESMTP id MAA06284; Tue, 4 Dec 2001 12:18:24 +0200 Date: Tue, 4 Dec 2001 12:18:24 +0200 (EET) From: Julian Anastasov X-X-Sender: To: Michael Richardson cc: , Subject: Re: what pointers does pskb_may_pull() nuke? In-Reply-To: <200112040526.fB45Q4D12164@marajade.sandelman.ottawa.on.ca> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 2625 Lines: 74 Hello, On Tue, 4 Dec 2001, Michael Richardson wrote: > One thing is that we had mixed reports about the need to reassemble > fragments. We realized that this is because netfilter will reassemble > fragments for us. > (My User-Mode-Linux tests had netfilter off) Yes, this is true. But this reassembling will disappear soon and will be moved down to each its user. I.e. you have to do it yourself. Using the 2 lines is simpler but slower if you receive only fragments. You can use these 2 lines of code before you design how to read the data from each skb in frag_list (may be checksum + read?!?). > I thought that 2.2 had a specific option for turning this on. Maybe it > was 2.0! Yes, in 2.2 ip_defrag performed the reassembling. Not in 2.4. More specifically, the linearization of the data. > Anyway, it is not clear to us what netfilter option is the minimum required > to cause fragments to be reassembled prior to transport layer. Is just having > netfilter enabled enough to do that? I'm asking so that we can properly > diagnose the situation. It is not the Netfilter who calls ip_defrag, it was called from the local delivery code (of course, the conntracking can do it at prerouting). With the current kernels you are ok, as long as, the patch from Rusty is not applied: http://marc.theaimsgroup.com/?l=linux-netdev&m=100432944806885&w=2 see the change in net/core/netfilter.c. There are 5 lines that will disappear soon. So, be warned. This is the reason we to talk for using these 2 lines. TCP is just ready for these changes (no surprises here). > We will be adding: > > #ifdef IP_FRAGMENT_ASSEMBLE > /* In Linux 2.4.4, we may have to reassemble fragments. They are > not assembled automatically to save TCP from having to copy > twice. They are reassembled in sense of proper ordering but the data is spread on many skbs in ->frag_list. I.e. ip_defrag avoids data copy. If you need to checksum the data then you have to additionally linearize it. > */ > if (skb_is_nonlinear(skb)) { > if (skb_linearize(skb, GFP_ATOMIC) != 0) { > goto rcvleave; > } > } > ipp = (struct iphdr *)skb->nh.iph; > iphlen = ipp->ihl << 2; > #endif > > we will do this after we have checked for COW on the SKB. Instead of cow may be you have to look for another function (but it will be 2.4 specific). skb_cow copies data, although valid for 2.2, you must be sure that you need to copy the skb data (if you need write access) in 2.4. The other variant is to use function that expands the header if only that is the goal. Regards -- Julian Anastasov From owner-netdev@oss.sgi.com Tue Dec 4 13:20:03 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fB4LK3N22740 for netdev-outgoing; Tue, 4 Dec 2001 13:20:03 -0800 Received: from ki.yok.utu.fi (ki.yok.utu.fi [130.232.129.100]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fB4LJvo22735 for ; Tue, 4 Dec 2001 13:19:57 -0800 Received: by ki.yok.utu.fi (Postfix, from userid 1000) id 170986B665; Tue, 4 Dec 2001 22:19:48 +0200 (EET) From: Tommi Virtanen To: netdev@oss.sgi.com Subject: [PATCH] 2.4.16 won't compile without TCP/IP Date: 04 Dec 2001 22:19:48 +0200 Message-ID: <87itbmd9gb.fsf@ki.yok.utu.fi> User-Agent: Gnus/5.0808 (Gnus v5.8.8) XEmacs/21.4 (Artificial Intelligence) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 1422 Lines: 52 Disabling CONFIG_INET makes 2.4.16 fail at compile due to include/net/tcp_ecn.h using macros INET_ECN_xmit and INET_ECN_xmit. They try to access the af_inet member of a union #define INET_ECN_xmit(sk) do { (sk)->protinfo.af_inet.tos |= 2; } while (0) #define INET_ECN_dontxmit(sk) do { (sk)->protinfo.af_inet.tos &= ~3; } while (0) but without CONFIG_INET, the union has no af_inet field. Here's a patch that disables the whole header if CONFIG_INET is not set. You may prefer some other approach; please just fix the symptoms :) --- linux-2.4.16.orig/include/net/tcp_ecn.h Sat Nov 3 03:43:26 2001 +++ linux-2.4.16/include/net/tcp_ecn.h Tue Dec 4 21:54:34 2001 @@ -1,6 +1,8 @@ #ifndef _NET_TCP_ECN_H_ #define _NET_TCP_ECN_H_ 1 +#ifdef CONFIG_INET + #include #define TCP_HP_BITS (~(TCP_RESERVED_BITS|TCP_FLAG_PSH)) @@ -128,5 +130,7 @@ if (sysctl_tcp_ecn && th->ece && th->cwr) req->ecn_ok = 1; } + +#endif #endif And just to be sure, here's a minimal config that triggers the failure. Everything else is unset. CONFIG_X86=y CONFIG_ISA=y CONFIG_UID16=y CONFIG_M386=y CONFIG_X86_L1_CACHE_SHIFT=4 CONFIG_RWSEM_GENERIC_SPINLOCK=y CONFIG_NOHIGHMEM=y CONFIG_NET=y CONFIG_BINFMT_ELF=y CONFIG_MSDOS_PARTITION=y -- tv@{{hq.yok.utu,havoc,gaeshido}.fi,{debian,wanderer}.org,stonesoft.com} double a,b=4,c;main(){for(;++a<2e6;c-=(b=-b)/a++);printf("%f\n",c);} From owner-netdev@oss.sgi.com Thu Dec 6 09:30:42 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fB6HUg021538 for netdev-outgoing; Thu, 6 Dec 2001 09:30:42 -0800 Received: from outpost.powerdns.com (postfix@outpost.ds9a.nl [213.244.168.210]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fB6HUco21535 for ; Thu, 6 Dec 2001 09:30:39 -0800 Received: by outpost.powerdns.com (Postfix, from userid 1000) id 64DD1C6560; Thu, 6 Dec 2001 17:30:35 +0100 (CET) Date: Thu, 6 Dec 2001 17:30:35 +0100 From: bert hubert To: netdev@oss.sgi.com Cc: hadi@cyberus.ca Subject: GRED documentation Message-ID: <20011206173035.A7634@outpost.ds9a.nl> Mail-Followup-To: bert hubert , netdev@oss.sgi.com, hadi@cyberus.ca Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 737 Lines: 22 People, I have now documented pretty much every qdisc under the sun, but I could do with some help on GRED. If anybody has a description on how it works, please let me know. In the meantime, I would again ask specifically Alexey and Jamal to spend some time on http://ds9a.nl/lartc - I tried to document your wonderful kernel work but I may have been wrong. Interest in queueing discplines is really taking off, I desperately want my documentation to be *right*. Regards, bert -- http://www.PowerDNS.com Versatile DNS Software & Services Trilab The Technology People Netherlabs BV / Rent-a-Nerd.nl - Nerd Available - 'SYN! .. SYN|ACK! .. ACK!' - the mating call of the internet From owner-netdev@oss.sgi.com Thu Dec 6 13:43:45 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fB6LhjG31825 for netdev-outgoing; Thu, 6 Dec 2001 13:43:45 -0800 Received: from fencepost.gnu.org (we-refuse-to-spy-on-our-users@fencepost.gnu.org [199.232.76.164]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fB6Lhbo31822 for ; Thu, 6 Dec 2001 13:43:38 -0800 Received: from buytenh by fencepost.gnu.org with local (Exim 3.22 #1 (Debian)) id 16C5Mk-0001X3-00; Thu, 06 Dec 2001 15:43:34 -0500 Date: Thu, 6 Dec 2001 15:43:34 -0500 From: Lennert Buytenhek To: netfilter-devel@lists.gnumonks.org, netdev@oss.sgi.com Cc: bridge@math.leidenuniv.nl Subject: [RFC] bridge-netfilter patch 0.0.4pre1 available Message-ID: <20011206154334.B3632@gnu.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 3098 Lines: 65 (please CC on replies, I'm not on netfilter-devel or netdev) Hi, Version 0.0.4pre1 of the bridge-netfilter patch is available. There's a bunch of loose ends to clean up, but in its currently form it's mostly done. It consists of an extra file in net/bridge/ plus a number of miscellaneous modifications to the rest of the tree, some of which controversial. I will discuss the 'intrusive' modifications briefly. 1. Add a threshhold hook macro, NF_HOOK_THRESH, which only calls hooks that have elem->priority >= specified_threshold. nf_hook_slow is extended with an argument for passing this threshold. This is necessary for stealing packets from a hook chain and reinjecting them later on, which the bridge-netfilter stuff needs. At LK2001 there was a suggestion to use QUEUE for this, but I can't see how this can be used cleanly, as only one queue handler per protocol family can be registered, and this would conflict with the existing use of QUEUE. 2. Add members ->physindev and ->physoutdev to struct sk_buff. This is necessary for 'interface transparency'; the ability to filter on enslaved devices in iptables rules transparently. For example, if eth0 is enslaved to br0, and a packet comes in from eth0, destined for the local machine, iptables -A INPUT -i eth0 -j DROP would drop the packet if you have interface transparency. It's easy to see that in this case, you need to keep at least one extra variable with the sk_buff to make the mentioned rule work. In the case of a locally originated packet, you also need at least one extra member. In the case of an IP-forwarded packet with both source and destination interfaces being bridge interfaces (sounds somewhat artificial, but there actually are such setups), you need two. 3. Copy hardware header upon refragmentation. Without bridge-nf, in the case of a locally originated or forwarded packet, refragmentation is done in IP/POST_ROUTING (ip_refrag), while neighbour resolution is done after IP/POST_ROUTING. bridge-nf needs to move the refragmentation to BR/POST_ROUTING, which is after neighbour resolution, so we effectively switch the order of the two. Refragmentation doesn't copy hardware headers, so this causes us to send out packets with corrupted hardware headers. (1) is slightly messy because it adds yet another argument to a function that already takes 6 arguments, but otherwise it's not really a big deal. (2) touches on something that has been sparking holy wars for ages. If you want interface transparency, you need these members. No further comment. (3) is ugly, with the alternatives being even more ugly. The bridge-nf patch has lived outside the tree for a while, and some educated guessing, based on the prevailing attitude I have experienced at LK2001, says that (2) (and perhaps (3) as well) will prevent it from being accepted into the tree at all. It's likely to end up in netfilter patch-o-matic. Currently it is available at: http://bridge.sourceforge.net/devel/bridge-nf/ cheers, Lennert From owner-netdev@oss.sgi.com Thu Dec 6 14:21:38 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fB6MLcM00360 for netdev-outgoing; Thu, 6 Dec 2001 14:21:38 -0800 Received: from grok.yi.org (IDENT:tgUqUkUT+kq1Aq6he5ysSCNNpjnjk2az@cx97923-a.phnx3.az.home.com [24.1.197.194]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fB6MLWo00355 for ; Thu, 6 Dec 2001 14:21:32 -0800 Received: from candelatech.com (IDENT:DEvWW3bFk/znbSx8z87d1m391QzC73Wb@localhost.localdomain [127.0.0.1]) by grok.yi.org (8.11.6/8.11.2) with ESMTP id fB6LLP105047; Thu, 6 Dec 2001 14:21:25 -0700 Message-ID: <3C0FE155.9010303@candelatech.com> Date: Thu, 06 Dec 2001 14:21:25 -0700 From: Ben Greear Organization: Candela Technologies User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.4) Gecko/20011019 Netscape6/6.2 X-Accept-Language: en-us MIME-Version: 1.0 To: Lennert Buytenhek CC: netfilter-devel@lists.gnumonks.org, netdev@oss.sgi.com, bridge@math.leidenuniv.nl Subject: Re: [RFC] bridge-netfilter patch 0.0.4pre1 available References: <20011206154334.B3632@gnu.org> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 2241 Lines: 52 Lennert Buytenhek wrote: > (please CC on replies, I'm not on netfilter-devel or netdev) > > > Hi, > > Version 0.0.4pre1 of the bridge-netfilter patch is available. There's a > bunch of loose ends to clean up, but in its currently form it's mostly done. > It consists of an extra file in net/bridge/ plus a number of miscellaneous > modifications to the rest of the tree, some of which controversial. I will > discuss the 'intrusive' modifications briefly. > > 1. Add a threshhold hook macro, NF_HOOK_THRESH, which only calls hooks > that have elem->priority >= specified_threshold. nf_hook_slow is extended > with an argument for passing this threshold. This is necessary for > stealing packets from a hook chain and reinjecting them later on, which > the bridge-netfilter stuff needs. At LK2001 there was a suggestion to > use QUEUE for this, but I can't see how this can be used cleanly, as only > one queue handler per protocol family can be registered, and this would > conflict with the existing use of QUEUE. > > 2. Add members ->physindev and ->physoutdev to struct sk_buff. This is > necessary for 'interface transparency'; the ability to filter on enslaved > devices in iptables rules transparently. For example, if eth0 is enslaved > to br0, and a packet comes in from eth0, destined for the local machine, > > iptables -A INPUT -i eth0 -j DROP > > would drop the packet if you have interface transparency. It's easy to > see that in this case, you need to keep at least one extra variable with > the sk_buff to make the mentioned rule work. In the case of a locally > originated packet, you also need at least one extra member. In the case > of an IP-forwarded packet with both source and destination interfaces > being bridge interfaces (sounds somewhat artificial, but there actually > are such setups), you need two. Does this scheme still work if you go: eth0 -> vlan5 -> br0 (Does vlan5 or eth0 count as the physindev?) Ben -- Ben Greear President of Candela Technologies Inc http://www.candelatech.com ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Thu Dec 6 14:52:20 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fB6MqKe01424 for netdev-outgoing; Thu, 6 Dec 2001 14:52:20 -0800 Received: from fencepost.gnu.org (we-refuse-to-spy-on-our-users@fencepost.gnu.org [199.232.76.164]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fB6MqGo01419 for ; Thu, 6 Dec 2001 14:52:16 -0800 Received: from buytenh by fencepost.gnu.org with local (Exim 3.22 #1 (Debian)) id 16C6R8-00057F-00; Thu, 06 Dec 2001 16:52:10 -0500 Date: Thu, 6 Dec 2001 16:52:10 -0500 From: Lennert Buytenhek To: Ben Greear Cc: netfilter-devel@lists.samba.org, netdev@oss.sgi.com, bridge@math.leidenuniv.nl Subject: Re: [Bridge] Re: [RFC] bridge-netfilter patch 0.0.4pre1 available Message-ID: <20011206165209.A18646@gnu.org> References: <20011206154334.B3632@gnu.org> <3C0FE155.9010303@candelatech.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <3C0FE155.9010303@candelatech.com>; from greearb@candelatech.com on Thu, Dec 06, 2001 at 02:21:25PM -0700 Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 1465 Lines: 31 On Thu, Dec 06, 2001 at 02:21:25PM -0700, Ben Greear wrote: > > 2. Add members ->physindev and ->physoutdev to struct sk_buff. This is > > necessary for 'interface transparency'; the ability to filter on enslaved > > devices in iptables rules transparently. For example, if eth0 is enslaved > > to br0, and a packet comes in from eth0, destined for the local machine, > > > > iptables -A INPUT -i eth0 -j DROP > > > > would drop the packet if you have interface transparency. It's easy to > > see that in this case, you need to keep at least one extra variable with > > the sk_buff to make the mentioned rule work. In the case of a locally > > originated packet, you also need at least one extra member. In the case > > of an IP-forwarded packet with both source and destination interfaces > > being bridge interfaces (sounds somewhat artificial, but there actually > > are such setups), you need two. > > Does this scheme still work if you go: eth0 -> vlan5 -> br0 > (Does vlan5 or eth0 count as the physindev?) I'm not familiar with how your vlan stuff works.. is 'vlan5' a kind of bridge device in itself? Or is it just tagged VLAN 5 over eth0? Currently, the bridge-nf patch uses as physindev skb->dev from-when-the- packet-was-passed-to-the-bridge-code in net_rx_action. So, it all depends on which device you enslaved to br0. In the above scenario, it would look like 'vlan5' is the one. cheers, Lennert From owner-netdev@oss.sgi.com Thu Dec 6 15:08:25 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fB6N8P701880 for netdev-outgoing; Thu, 6 Dec 2001 15:08:25 -0800 Received: from grok.yi.org (IDENT:+HS3xo7AiOtWgj1QU8r1m7XR9fC01WvP@cx97923-a.phnx3.az.home.com [24.1.197.194]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fB6N8Jo01876 for ; Thu, 6 Dec 2001 15:08:19 -0800 Received: from candelatech.com (IDENT:/qx5zBvIl4LIAn0aIU7h+2sYmfxdVULM@localhost.localdomain [127.0.0.1]) by grok.yi.org (8.11.6/8.11.2) with ESMTP id fB6M7C105285; Thu, 6 Dec 2001 15:07:42 -0700 Message-ID: <3C0FEC0F.8090008@candelatech.com> Date: Thu, 06 Dec 2001 15:07:11 -0700 From: Ben Greear Organization: Candela Technologies User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.4) Gecko/20011019 Netscape6/6.2 X-Accept-Language: en-us MIME-Version: 1.0 To: Lennert Buytenhek CC: netfilter-devel@lists.samba.org, netdev@oss.sgi.com, bridge@math.leidenuniv.nl Subject: Re: [Bridge] Re: [RFC] bridge-netfilter patch 0.0.4pre1 available References: <20011206154334.B3632@gnu.org> <3C0FE155.9010303@candelatech.com> <20011206165209.A18646@gnu.org> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 2143 Lines: 63 Lennert Buytenhek wrote: > On Thu, Dec 06, 2001 at 02:21:25PM -0700, Ben Greear wrote: > > >>>2. Add members ->physindev and ->physoutdev to struct sk_buff. This is >>> necessary for 'interface transparency'; the ability to filter on enslaved >>> devices in iptables rules transparently. For example, if eth0 is enslaved >>> to br0, and a packet comes in from eth0, destined for the local machine, >>> >>> iptables -A INPUT -i eth0 -j DROP >>> >>> would drop the packet if you have interface transparency. It's easy to >>> see that in this case, you need to keep at least one extra variable with >>> the sk_buff to make the mentioned rule work. In the case of a locally >>> originated packet, you also need at least one extra member. In the case >>> of an IP-forwarded packet with both source and destination interfaces >>> being bridge interfaces (sounds somewhat artificial, but there actually >>> are such setups), you need two. >>> >>Does this scheme still work if you go: eth0 -> vlan5 -> br0 >>(Does vlan5 or eth0 count as the physindev?) >> > > I'm not familiar with how your vlan stuff works.. is 'vlan5' a kind of > bridge device in itself? Or is it just tagged VLAN 5 over eth0? It definately isn't a bridge...much more like the latter. For that reason, it's good that your bridging stuff works :) > > Currently, the bridge-nf patch uses as physindev skb->dev from-when-the- > packet-was-passed-to-the-bridge-code in net_rx_action. So, it all depends on > which device you enslaved to br0. In the above scenario, it would look > like 'vlan5' is the one. Good. I think that is probably the right way for things to work because you can not have more than one ethernet device feed a particular vlan device...so for firewalling reasons there should be little reason to distinguish the physical (eth0) device while using VLANs... Ben > > > cheers, > Lennert > > -- Ben Greear President of Candela Technologies Inc http://www.candelatech.com ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Fri Dec 7 15:35:39 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fB7NZdU27494 for netdev-outgoing; Fri, 7 Dec 2001 15:35:39 -0800 Received: from noxmail.sandelman.ottawa.on.ca (cyphermail.sandelman.ottawa.on.ca [192.139.46.78]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fB7NZMo27478 for ; Fri, 7 Dec 2001 15:35:23 -0800 Received: from marajade.sandelman.ottawa.on.ca ([204.120.52.27]) by noxmail.sandelman.ottawa.on.ca (8.11.6/8.11.6) with ESMTP id fB7IT0a17472 (using TLSv1/SSLv3 with cipher EDH-RSA-DES-CBC3-SHA (168 bits) verified OK); Fri, 7 Dec 2001 13:41:17 -0500 (EST) Received: from marajade.sandelman.ottawa.on.ca (localhost [[UNIX: localhost]]) by marajade.sandelman.ottawa.on.ca (8.11.6/8.11.0) with ESMTP id fB5L9Ol00806; Wed, 5 Dec 2001 16:09:27 -0500 (EST) Message-Id: <200112052109.fB5L9Ol00806@marajade.sandelman.ottawa.on.ca> To: Julian Anastasov cc: netdev@oss.sgi.com, design@lists.freeswan.org Subject: Re: what pointers does pskb_may_pull() nuke? In-reply-to: Your message of "Tue, 04 Dec 2001 12:18:24 +0200." Mime-Version: 1.0 (generated by tm-edit 7.108) Content-Type: text/plain; charset=US-ASCII Date: Wed, 05 Dec 2001 16:09:24 -0500 From: Michael Richardson Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 5758 Lines: 144 -----BEGIN PGP SIGNED MESSAGE----- >>>>> "Julian" == Julian Anastasov writes: Julian> Hello, Thank you kindly for your continuing help. Julian> On Tue, 4 Dec 2001, Michael Richardson wrote: >> One thing is that we had mixed reports about the need to reassemble >> fragments. We realized that this is because netfilter will reassemble >> fragments for us. >> (My User-Mode-Linux tests had netfilter off) Julian> Yes, this is true. But this reassembling will disappear soon "soon" means 2.5/2.6, I hope. Not 2.4 series. Julian> and will be moved down to each its user. I.e. you have to do it Julian> yourself. Using the 2 lines is simpler but slower if you receive only Julian> fragments. You can use these 2 lines of code before you design how Julian> to read the data from each skb in frag_list (may be checksum + read?!?). When various systems screw up PMTU, (outside of our control, we have re-enabled generation of ICMP messages after turning it off a couple of releases ago) we get lots of 1500 byte packets that we need to prepend 48 bytes to (IP + ESP) header. So, we wind up fragmenting the outgoing ESP. (We do not copy DF to the outer header at this time, since we can not cope with the unauthenticated ICMPs that would result). So, we get ESP fragments to reassemble. Once nice thing about seeing the fragments is that we might decide to implement a proposal for doing authenticated PMTU for tunnels by observing the size of the largest fragment. >> Anyway, it is not clear to us what netfilter option is the minimum required >> to cause fragments to be reassembled prior to transport layer. Is just having >> netfilter enabled enough to do that? I'm asking so that we can properly >> diagnose the situation. Julian> It is not the Netfilter who calls ip_defrag, it was called Julian> from the local delivery code (of course, the conntracking can do it Julian> at prerouting). With the current kernels you are ok, as long as, the I think you are missing what I am saying. Yes, it was called in the local delivery code, but is no longer. Our experience is that some aspect of compiling netfilter into a kernel (not necessarily with conntrack on) causes ip_defrag() to get called before local delivery. Julian> will disappear soon. So, be warned. This is the reason we to talk Julian> for using these 2 lines. TCP is just ready for these changes (no Julian> surprises here). It would have been better if the transport protocol registration code had taken a flag (that TCP would set) that said "I can handle fragments". What has happened is that you have changed the interface between network and transport code unilaterally causing a subtle incompatibility. This causes us to add code to compensate. This is BAD. I would be happy to write a patch that implemented the above. >> We will be adding: >> >> #ifdef IP_FRAGMENT_ASSEMBLE >> /* In Linux 2.4.4, we may have to reassemble fragments. They are >> not assembled automatically to save TCP from having to copy >> twice. Julian> They are reassembled in sense of proper ordering but the Julian> data is spread on many skbs in ->frag_list. I.e. ip_defrag avoids Julian> data copy. If you need to checksum the data then you have to Julian> additionally linearize it. We need to decrypt it. We need it linear. Yes, defragment is the wrong term if you like. Linearize is better. >> we will do this after we have checked for COW on the SKB. Julian> Instead of cow may be you have to look for another Julian> function (but it will be 2.4 specific). skb_cow copies data, Julian> although valid for 2.2, you must be sure that you need to copy Julian> the skb data (if you need write access) in 2.4. The other variant Julian> is to use function that expands the header if only that is the goal. We need to write (we decrypt in place). We do: #ifdef NET_21 /* if skb was cloned (most likely due to a packet sniffer such as tcpdump being momentarily attached to the interface), make a copy of our own to modify */ if(skb_cloned(skb)) { /* include any mac header while copying.. */ if(skb_headroom(skb) < hard_header_len) { printk(KERN_WARNING "klips_error:ipsec_rcv: " "tried to skb_push hhlen=%d, %d available. This should never happen, please report.\n", hard_header_len, skb_headroom(skb)); goto rcvleave; } skb_push(skb, hard_header_len); if #ifdef SKB_COW_NEW (skb_cow(skb, skb_headroom(skb)) != 0) #else /* SKB_COW_NEW */ ((skb = skb_cow(skb, skb_headroom(skb))) == NULL) #endif /* SKB_COW_NEW */ { goto rcvleave; } if(skb->len < hard_header_len) { printk(KERN_WARNING "klips_error:ipsec_rcv: " "tried to skb_pull hhlen=%d, %d available. This should never happen, please report.\n", hard_header_len, skb->len); goto rcvleave; } skb_pull(skb, hard_header_len); } #endif /* NET_21 */ ] ON HUMILITY: to err is human. To moo, bovine. | firewalls [ ] Michael Richardson, Sandelman Software Works, Ottawa, ON |net architect[ ] mcr@sandelman.ottawa.on.ca http://www.sandelman.ottawa.on.ca/ |device driver[ ] panic("Just another NetBSD/notebook using, kernel hacking, security guy"); [ -----BEGIN PGP SIGNATURE----- Version: 2.6.3ia Charset: latin1 Comment: Finger me for keys iQCVAwUBPA6NAYqHRg3pndX9AQEMkAP8D5HTKx83nXq6MEJQILpevuEo0El7neEZ hipYoDS1KYz75lhiooBTzcVjAv/bNUTmC8BNiBUBCAyWrAGObPXL/eICsgltAPTT DjWXanPldzxKUEiXS+syXEPz619svfB4XRPWtVafUPgEiXPdlfqdeZ91C0GeLqA4 mMvbbPVa3LY= =j8YZ -----END PGP SIGNATURE----- From owner-netdev@oss.sgi.com Fri Dec 7 16:19:02 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fB80J2M28767 for netdev-outgoing; Fri, 7 Dec 2001 16:19:02 -0800 Received: from u.domain.uli (ja.mac.ssi.bg [212.95.166.194]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fB80Iqo28764 for ; Fri, 7 Dec 2001 16:18:52 -0800 Received: from localhost (IDENT:ja@localhost [127.0.0.1]) by u.domain.uli (8.11.0/8.11.0) with ESMTP id fB81Jrw01173; Sat, 8 Dec 2001 01:19:53 GMT Date: Sat, 8 Dec 2001 01:19:53 +0000 (GMT) From: Julian Anastasov X-X-Sender: To: Michael Richardson cc: , Subject: Re: what pointers does pskb_may_pull() nuke? In-Reply-To: <200112052109.fB5L9Ol00806@marajade.sandelman.ottawa.on.ca> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 3554 Lines: 91 Hello, On Wed, 5 Dec 2001, Michael Richardson wrote: > Julian> Yes, this is true. But this reassembling will disappear soon > > "soon" means 2.5/2.6, I hope. Not 2.4 series. I don't know, there are other people that take the decision. It seems Netfilter is the last thing that stops this change. > When various systems screw up PMTU, (outside of our control, we have > re-enabled generation of ICMP messages after turning it off a couple of > releases ago) we get lots of 1500 byte packets that we need to prepend 48 > bytes to (IP + ESP) header. So, we wind up fragmenting the outgoing ESP. > > (We do not copy DF to the outer header at this time, since we can not cope > with the unauthenticated ICMPs that would result). > > So, we get ESP fragments to reassemble. > > Once nice thing about seeing the fragments is that we might decide to > implement a proposal for doing authenticated PMTU for tunnels by observing > the size of the largest fragment. Yes, it is interesting but ... not in my area > I think you are missing what I am saying. > Yes, it was called in the local delivery code, but is no longer. > Our experience is that some aspect of compiling netfilter into a kernel > (not necessarily with conntrack on) causes ip_defrag() to get called before > local delivery. I don't see where is the problem. There is always someone that calls ip_defrag before the protocols, the last one is ip_input.c:ip_local_deliver() if it is still not called. So, remains the issue with the linearization and the checksums. > Julian> will disappear soon. So, be warned. This is the reason we to talk > Julian> for using these 2 lines. TCP is just ready for these changes (no > Julian> surprises here). > > It would have been better if the transport protocol registration code had > taken a flag (that TCP would set) that said "I can handle fragments". Hm, there is such flag for IP but for the IP protocols I don't see such flag, can't comment here. It seems all protocols in the kernel already know what to do. > What has happened is that you have changed the interface between network > and transport code unilaterally causing a subtle incompatibility. This causes > us to add code to compensate. This is BAD. Not me but I see that some changes are for good, it is difficult to follow them in some places but you can always return to the old handling with some calls. My question to all: when the checksums will be loosed? The same function in netfilter.c still touches the checksum fields. So, make sure you deal with checksums too (probably after the linearization). > I would be happy to write a patch that implemented the above. you can clear this issue with the maintainers ... > We need to decrypt it. We need it linear. > Yes, defragment is the wrong term if you like. Linearize is better. > > >> we will do this after we have checked for COW on the SKB. > > Julian> Instead of cow may be you have to look for another > Julian> function (but it will be 2.4 specific). skb_cow copies data, > Julian> although valid for 2.2, you must be sure that you need to copy > Julian> the skb data (if you need write access) in 2.4. The other variant > Julian> is to use function that expands the header if only that is the goal. > > We need to write (we decrypt in place). OK, you need write access to the first fragment. It seems the code is same in 2.2 and 2.4 for this but I'm not sure. > We do: > > #ifdef NET_21 ... Regards -- Julian Anastasov From owner-netdev@oss.sgi.com Fri Dec 7 18:06:54 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fB826s630545 for netdev-outgoing; Fri, 7 Dec 2001 18:06:54 -0800 Received: from outpost.powerdns.com (postfix@outpost.ds9a.nl [213.244.168.210]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fB826oo30542 for ; Fri, 7 Dec 2001 18:06:50 -0800 Received: by outpost.powerdns.com (Postfix, from userid 1000) id 5D287C6966; Sat, 8 Dec 2001 02:06:47 +0100 (CET) Date: Sat, 8 Dec 2001 02:06:47 +0100 From: bert hubert To: netdev@oss.sgi.com Cc: kuznet@ms2.inr.ac.ru Subject: First shot at a tc/qdisc manpage Message-ID: <20011208020647.A26124@outpost.ds9a.nl> Mail-Followup-To: bert hubert , netdev@oss.sgi.com, kuznet@ms2.inr.ac.ru Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 618 Lines: 20 Please critique, not sure if everything is correct. Note that many statements do not yet apply to HTB, but Devik is working on that. Once done, this is donated to Alexey for inclusion. http://ds9a.nl/lartc/tc.8 http://ds9a.nl/lartc/tc.txt (looks crappy in webbrowser) http://ds9a.nl/lartc/tc.pdf The referenced tc-*.8 manpages do not exist yet. Regards, bert -- http://www.PowerDNS.com Versatile DNS Software & Services Trilab The Technology People Netherlabs BV / Rent-a-Nerd.nl - Nerd Available - 'SYN! .. SYN|ACK! .. ACK!' - the mating call of the internet From owner-netdev@oss.sgi.com Sat Dec 8 12:24:36 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fB8KOas17770 for netdev-outgoing; Sat, 8 Dec 2001 12:24:36 -0800 Received: from shell.cyberus.ca (shell.cyberus.ca [216.191.240.114]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fB8KOKo17762 for ; Sat, 8 Dec 2001 12:24:20 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id OAA05032; Sat, 8 Dec 2001 14:20:20 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Sat, 8 Dec 2001 14:20:20 -0500 (EST) From: jamal To: bert hubert cc: , , , Subject: Re: CBQ and all other qdiscs now REALLY completely documented (almost!) In-Reply-To: <20011203030002.A20601@outpost.ds9a.nl> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 6463 Lines: 167 On Mon, 3 Dec 2001, bert hubert wrote: > On Sat, Dec 01, 2001 at 01:33:41AM +0100, bert hubert wrote: > > > One thing - does *anybody* understand how hash tables work in tc filter, and > > what they do? Furthermore, I could use some help with the tc filter police > > things. > > Thanks to Andreas Steinmetz and David Sauer, tc hash tables are now > documented as well, thanks! > > See: > > http://ds9a.nl/2.4Routing/HOWTO//cvs/2.4routing/output/2.4routing-12.html > > And then 'Hashing filters for very fast massive filtering'. > > I also finished documenting all parameters for TBF, CBQ, SFQ, PRIO, > bfifo, pfifo and pfifo_fast. All queues in the Linux kernel are now > described in the Linux Advanced Routing & Shaping HOWTO, which can be found on > > http://ds9a.nl/2.4Routing > > I want to send this off to the LDP and Freshmeat somewhere next week, I > *would really* like people who are knowledgeable about this subject (this > means you, ANK & Jamal 8) ) to read through this. > > This HOWTO is rapidly becoming the perceived authoritative source for > traffic control in linux (google on 'Linux Routing' finds it), it might as > well be right! So if you have any time at all, check the parts you know > about. I expect mistakes. > > The parts of the table of contents that document stuff in the kernel not > documented elsewhere: "not documented elsewhere" comes out rude. Werner and I (and even Alexey when he was in the mood -- and i have seen some good documentation by other people as well) have spent numerous hours documenting, presenting and answering questions on mailing lists at times Sample docs that i was personally involved in: ftp://icaftp.epfl.ch/pub/linux/diffserv/misc/dsid-01.txt.gz You need to introduce the big picture to the user. and what is wrong with the definitions used in http://www.davin.ottawa.on.ca/ols/img10.htm that forced you to introduce your own? Actually, the big picture is: http://www.davin.ottawa.on.ca/ols/img9.htm Also http://www.linuxjournal.com/article.php?sid=3369 (was written in 98 but got published in 99) Now despite all the bitching above, i think your efforts are noble. [My complaints about your style is you often are trying to present facts by using opinions. For example despite a lot of effort in the past to explain ingress qdisc to you in the past and, pointing you to very good documentation from CISCO you still ended using your opinions on what you thought it should be;-> My scanning of the document shows opinions still posing as miscontrued facts. It is improving compared to what i saw last when we discussed ingress. Let me clarify one thing in this email; i'll read what you have later. Lets start by your description of TC_PRIO and TOS mappings etc: Your descriptions of these values is insufficient. Consider this a tutorial and reword it as you wish but please avoid opinions. Ok here's clarification, this applies to both prio, default fifo 3 band queueing and CBQ defaultmap classification; applies to both packets being forwarded as well as locally generated: First Step: =========== Define TOS: This is a 4 bit value used as defined in RFC 1349. 0 1 2 3 4 5 6 7 +-----+-----+-----+-----+-----+-----+-----+-----+ | | | | | PRECEDENCE | TOS | MBZ | | | | | +-----+-----+-----+-----+-----+-----+-----+-----+ Then define the values possible as: 1000 -- minimize delay 0100 -- maximize throughput 0010 -- maximize reliability 0001 -- minimize monetary cost 0000 -- normal service Look at RFC 1349 for typical values used by different applications Then of course note that RFC 1349 is obsoleted by RFC 2474 (yes, you can weep); Having said all that: Linux remaps packets incoming with different values to some internal value; the colum "mapped to" shows the internal mapping 8value(hex) TOS(dec) mapped to(dec) ---------------------------------- 0x0 0 0 1 7 2 0 3 0 4 2 5 2 6 2 7 2 0x10 8 6 9 6 10 6 11 6 12 2 13 2 14 2 15 2 Fill in the "8value(hex)" column gaps using the bitmap from RFC1349 for the 8 bits; These are the values ou would see with tcpdump -vvv I filled the two easiest ones i could compute in my head. Second step: Take the default priority map: 1, 2, 2, 2, 1, 2, 0, 0 , 1, 1, 1, 1, 1, 1, 1, 1 This applies for both default prio and the 3-band FIFO queue. Note the queue map fitted on the last column 8 but value TOS mapped to queue map --------------------------------------------- 0x0 0 0 1 1 7 2 2 0 2 3 0 2 4 2 1 5 2 2 6 2 0 7 2 0 0x10 8 6 1 9 6 1 10 6 1 11 6 1 12 2 1 13 2 1 14 2 1 15 2 1 Queue 0 gets processed first then queue 1 then queue 2. In the strict priority processing such as in prio or default 3 band sched, queue 0 is processed until no more packets are left, then queue1 etc. This could result in starvation. You could avoid starvation by inserting a TBF in a prio; limit the size of the fifo in a class or use CBQ configured as WRR. I hope the above explains why you have to recreate the priomap everytime you change the number of bands. You used the word "probably" which is wrong. The proper word is "MUST". What i think would be useful for you to do is describe some of the vlaues used by some applications (RFC 1349 cut-n-paste job would help). cheers, jamal From owner-netdev@oss.sgi.com Sat Dec 8 12:55:53 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fB8KtrI18254 for netdev-outgoing; Sat, 8 Dec 2001 12:55:53 -0800 Received: from outpost.powerdns.com (postfix@outpost.ds9a.nl [213.244.168.210]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fB8Ktho18250 for ; Sat, 8 Dec 2001 12:55:44 -0800 Received: by outpost.powerdns.com (Postfix, from userid 1000) id 706F8C6A37; Sat, 8 Dec 2001 20:55:41 +0100 (CET) Date: Sat, 8 Dec 2001 20:55:41 +0100 From: bert hubert To: jamal Cc: lartc@mailman.ds9a.nl, linux-kernel@vger.kernel.org, kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com Subject: further CBQ/tc documentation ds9a.nl/lartc/manpages Message-ID: <20011208205541.A15565@outpost.ds9a.nl> Mail-Followup-To: bert hubert , jamal , lartc@mailman.ds9a.nl, linux-kernel@vger.kernel.org, kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com References: <20011203030002.A20601@outpost.ds9a.nl> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: ; from hadi@cyberus.ca on Sat, Dec 08, 2001 at 02:20:20PM -0500 Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 3440 Lines: 88 On Sat, Dec 08, 2001 at 02:20:20PM -0500, jamal wrote: > > The parts of the table of contents that document stuff in the kernel not > > documented elsewhere: > > "not documented elsewhere" comes out rude. Werner and I (and even > Alexey when he was in the mood -- and i have seen some good documentation > by other people as well) have spent numerous hours documenting, presenting > and answering questions on mailing lists at times True. I should have worded that better but I lost sight of politeness due to my great enthusiasm at finally understanding everything. Some parts required literally *hours* of digging through sources and disembodied slides - presentations lose something without a speaker. > Sample docs that i was personally involved in: > ftp://icaftp.epfl.ch/pub/linux/diffserv/misc/dsid-01.txt.gz These days I understand this document, but I didn't used to. That might be because I'm thick, though. > You need to introduce the big picture to the user. > and what is wrong with the definitions used in > http://www.davin.ottawa.on.ca/ols/img10.htm that forced you to introduce > your own? I've since moved to this terminology. Please also see the manpages I'm writing at http://ds9a.nl/lartc/manpages > Actually, the big picture is: > http://www.davin.ottawa.on.ca/ols/img9.htm > Also > http://www.linuxjournal.com/article.php?sid=3369 > (was written in 98 but got published in 99) Google is surely to be praised - I had found all these links already. But to summarize: stuff is out there. > [My complaints about your style is you often are trying to present facts > by using opinions. For example despite a lot of effort in the past to > explain ingress qdisc to you in the past and, pointing you to very good > documentation from CISCO you still ended using your opinions on what you > thought it should be;-> I really didn't understand how everything worked back then, sadly. I do now, hopefully. > My scanning of the document shows opinions still posing as miscontrued > facts. It is improving compared to what i saw last when we discussed ingress. > Let me clarify one thing in this email; i'll read what you have later. Some stuff remains from that time, am working on removing it. My current efforts is writing the manpages and getting them 100% right and devoid of opinion. Once they are finished & reviewed, I'm 'backporting' the insight to the HOWTO, which will then lose a lot of content and instead refer to the manpages. > Lets start by your description of TC_PRIO and TOS mappings etc: > Your descriptions of these values is insufficient. Consider this a > tutorial and reword it as you wish but please avoid opinions. Will do, it makes sense now. > Look at RFC 1349 for typical values used by different applications > Then of course note that RFC 1349 is obsoleted by RFC 2474 (yes, you can > weep); That confused me greatly, yes. > What i think would be useful for you to do is describe some of the vlaues > used by some applications (RFC 1349 cut-n-paste job would help). Thanks. I'm working on making the HOWTO more factual and the manpages 100% factual. I'm always happy with critiques. Regards, bert -- http://www.PowerDNS.com Versatile DNS Software & Services Trilab The Technology People Netherlabs BV / Rent-a-Nerd.nl - Nerd Available - 'SYN! .. SYN|ACK! .. ACK!' - the mating call of the internet From owner-netdev@oss.sgi.com Sat Dec 8 13:46:54 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fB8LksX19262 for netdev-outgoing; Sat, 8 Dec 2001 13:46:54 -0800 Received: from shell.cyberus.ca (shell.cyberus.ca [216.191.240.114]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fB8Lkmo19259 for ; Sat, 8 Dec 2001 13:46:48 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id PAA05134; Sat, 8 Dec 2001 15:43:05 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Sat, 8 Dec 2001 15:43:05 -0500 (EST) From: jamal To: bert hubert cc: , , , Subject: Re: further CBQ/tc documentation ds9a.nl/lartc/manpages In-Reply-To: <20011208205541.A15565@outpost.ds9a.nl> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 1719 Lines: 46 For starters, i think you need a defintions sections. Look at: http://www.ietf.org/internet-drafts/draft-ietf-diffserv-model-06.txt (eg what is a shaper etc and how trhings are placed together). At least that will ensure that you dont sday things like "Prio cant shape". It is a good model but may be insufficient given Linux TCs capabilities. Email me when unsure. Some other things: - In your comment "Do not confuse this classless simple qdisc with the classful PRIO one!". This is misleading: the default 3 band FIFO queue is conceptually the same as the default prio qdisc (the priomaps are identical). 3 bands; same prioritization schemes. - You really need to fix ingress section: it works for both forwarding and packets coming in to local sockets. More importantly, It takes advantages of _all_ filter schemes available for TC as well as the policing functionality (which sadly seemed to have been replicated by someone in netfilter, wrongly if i may add ;->). - You keep saying "reodering" -- dont know what that means. Reordering is generally considered a Bad Thing(tm). - your description of the "peakrate" (same in TBF as well as policing) Well captured. It took ages to get this into peoples heads. This also applies to CBQ. - your description of "MTU" Not very good description: This is just what it literally says; maximum transmit unit; A packet larger than this will be dropped. Default is 2K. For ethernet, MTUs of 1500 bytes, this is fine; however, you should put a cautionary statement here in regards to people having MTUs smaller than 2K (example the lo device); they might find that all their packets greater than 2K being dropped. More later if dont get distracted. cheers, jamal From owner-netdev@oss.sgi.com Sat Dec 8 15:33:29 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fB8NXTd20827 for netdev-outgoing; Sat, 8 Dec 2001 15:33:29 -0800 Received: from smtp-2.hut.fi (smtp-2.hut.fi [130.233.228.92]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fB8NXLo20824 for ; Sat, 8 Dec 2001 15:33:22 -0800 Received: from vesper.tky.hut.fi (IDENT:qmailr@vesper.tky.hut.fi [130.233.19.17]) by smtp-2.hut.fi (8.9.3/8.9.3) with SMTP id AAA00229 for ; Sun, 9 Dec 2001 00:33:18 +0200 (EET) Received: (qmail 5341 invoked by uid 500); 8 Dec 2001 22:33:05 -0000 Date: Sun, 9 Dec 2001 00:33:05 +0200 From: Antti J Tuominen To: netdev@oss.sgi.com Subject: Update to include/net/ndisc.h Message-ID: <20011209003305.A5284@vesper.tky.hut.fi> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="k+w/mQv8wyuph6w0" X-Mailer: Mutt 0.95.6us Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 1269 Lines: 48 --k+w/mQv8wyuph6w0 Content-Type: text/plain; charset=us-ascii Hello! The include/net/ndisc.h seems to be lagging behind from the IANA assigned numbers. Could the attached patch be applied. Regards, Antti -- Antti J. Tuominen, JMT 3A 133, FIN-02150 Espoo, Finland. Research assistant, TSE Institute at Helsinki University of Technology work: ajtuomin@tml.hut.fi; home: tuominen@iki.fi --k+w/mQv8wyuph6w0 Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename=ndisc-patch --- linux-2.4.16/include/net/ndisc.h Thu Nov 22 21:47:42 2001 +++ linux/include/net/ndisc.h Sun Dec 9 00:06:14 2001 @@ -11,6 +11,9 @@ #define NDISC_NEIGHBOUR_ADVERTISEMENT 136 #define NDISC_REDIRECT 137 +#define NDISC_IND_SOLICITATION 140 /* RFC3122 */ +#define NDISC_IND_ADVERTISEMENT 141 /* RFC3122 */ + /* * ndisc options */ @@ -20,6 +23,11 @@ #define ND_OPT_PREFIX_INFO 3 #define ND_OPT_REDIRECT_HDR 4 #define ND_OPT_MTU 5 +#define ND_OPT_NBMA_SHORTCUT_LIMIT 6 +#define ND_OPT_RTR_ADV_INTERVAL 7 /* Mobile IPv6 */ +#define ND_OPT_HOME_AGENT_INFO 8 /* Mobile IPv6 */ +#define ND_OPT_SOURCE_ADDR_LIST 9 /* RFC3122 */ +#define ND_OPT_TARGET_ADDR_LIST 10 /* RFC3122 */ #define MAX_RTR_SOLICITATION_DELAY HZ --k+w/mQv8wyuph6w0-- From owner-netdev@oss.sgi.com Sat Dec 8 15:54:44 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fB8Nsis21224 for netdev-outgoing; Sat, 8 Dec 2001 15:54:44 -0800 Received: from smtp-2.hut.fi (smtp-2.hut.fi [130.233.228.92]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fB8Nsco21219 for ; Sat, 8 Dec 2001 15:54:38 -0800 Received: from vesper.tky.hut.fi (IDENT:qmailr@vesper.tky.hut.fi [130.233.19.17]) by smtp-2.hut.fi (8.9.3/8.9.3) with SMTP id AAA00463 for ; Sun, 9 Dec 2001 00:54:36 +0200 (EET) Received: (qmail 5412 invoked by uid 500); 8 Dec 2001 22:54:23 -0000 Date: Sun, 9 Dec 2001 00:54:23 +0200 From: Antti J Tuominen To: netdev@oss.sgi.com Subject: Re: Update to include/net/ndisc.h Message-ID: <20011209005423.A5385@vesper.tky.hut.fi> References: <20011209003305.A5284@vesper.tky.hut.fi> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary=d6Gm4EdcadzBjdND X-Mailer: Mutt 0.95.6us In-Reply-To: <20011209003305.A5284@vesper.tky.hut.fi>; from Antti J Tuominen on Sun, Dec 09, 2001 at 12:33:05AM +0200 Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 1396 Lines: 48 --d6Gm4EdcadzBjdND Content-Type: text/plain; charset=us-ascii On Sun, Dec 09, 2001 at 12:33:05AM +0200, Antti J Tuominen wrote: > assigned numbers. Could the attached patch be applied. Not that one... sorry. Made a mistake, here's the correct patch. Inverse Neighbor Discovery Messages are, of course, 141 and 142. Silly me. Antti -- Antti J. Tuominen, JMT 3A 133, FIN-02150 Espoo, Finland. Research assistant, TSE Institute at Helsinki University of Technology work: ajtuomin@tml.hut.fi; home: tuominen@iki.fi --d6Gm4EdcadzBjdND Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename=ndisc-patch --- linux-2.4.16/include/net/ndisc.h Thu Nov 22 21:47:42 2001 +++ linux/include/net/ndisc.h Sun Dec 9 00:06:14 2001 @@ -11,6 +11,9 @@ #define NDISC_NEIGHBOUR_ADVERTISEMENT 136 #define NDISC_REDIRECT 137 +#define NDISC_IND_SOLICITATION 141 /* RFC3122 */ +#define NDISC_IND_ADVERTISEMENT 142 /* RFC3122 */ + /* * ndisc options */ @@ -20,6 +23,11 @@ #define ND_OPT_PREFIX_INFO 3 #define ND_OPT_REDIRECT_HDR 4 #define ND_OPT_MTU 5 +#define ND_OPT_NBMA_SHORTCUT_LIMIT 6 +#define ND_OPT_RTR_ADV_INTERVAL 7 /* Mobile IPv6 */ +#define ND_OPT_HOME_AGENT_INFO 8 /* Mobile IPv6 */ +#define ND_OPT_SOURCE_ADDR_LIST 9 /* RFC3122 */ +#define ND_OPT_TARGET_ADDR_LIST 10 /* RFC3122 */ #define MAX_RTR_SOLICITATION_DELAY HZ --d6Gm4EdcadzBjdND-- From owner-netdev@oss.sgi.com Sat Dec 8 16:24:00 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fB90O0v21983 for netdev-outgoing; Sat, 8 Dec 2001 16:24:00 -0800 Received: from outpost.powerdns.com (postfix@outpost.ds9a.nl [213.244.168.210]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fB90Nko21978 for ; Sat, 8 Dec 2001 16:23:46 -0800 Received: by outpost.powerdns.com (Postfix, from userid 1000) id 395F9C6104; Sun, 9 Dec 2001 00:23:44 +0100 (CET) Date: Sun, 9 Dec 2001 00:23:44 +0100 From: bert hubert To: jamal Cc: lartc@mailman.ds9a.nl, linux-kernel@vger.kernel.org, kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com Subject: Re: CBQ and all other qdiscs now REALLY completely documented (almost!) Message-ID: <20011209002344.C20125@outpost.ds9a.nl> Mail-Followup-To: bert hubert , jamal , lartc@mailman.ds9a.nl, linux-kernel@vger.kernel.org, kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com References: <20011203030002.A20601@outpost.ds9a.nl> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: ; from hadi@cyberus.ca on Sat, Dec 08, 2001 at 02:20:20PM -0500 Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 5050 Lines: 136 On Sat, Dec 08, 2001 at 02:20:20PM -0500, jamal wrote: > Linux remaps packets incoming with different values to some internal > value; the colum "mapped to" shows the internal mapping > > 8value(hex) TOS(dec) mapped to(dec) > ---------------------------------- > 0x0 0 0 > 1 7 > 2 0 > 3 0 > 4 2 > 5 2 > 6 2 > 7 2 > 0x10 8 6 > 9 6 > 10 6 > 11 6 > 12 2 > 13 2 > 14 2 > 15 2 I find this tos2prio table in the kernel (2.5.x), which is somewhat different than your table: 0 TC_PRIO_BESTEFFORT, 0 1 TC_PRIO_(FILLER), 1 2 TC_PRIO_BESTEFFORT, 0 3 TC_PRIO_(BESTEFFORT), 0 4 TC_PRIO_BULK, 2 5 TC_PRIO_(BULK), 2 6 TC_PRIO_BULK, 2 7 TC_PRIO_(BULK), 2 8 TC_PRIO_INTERACTIVE, 6 9 TC_PRIO_(INTERACTIVE), 6 10 TC_PRIO_INTERACTIVE, 6 11 TC_PRIO_(INTERACTIVE), 6 12 TC_PRIO_INTERACTIVE_BULK, 4 13 TC_PRIO_(INTERACTIVE_BULK), 4 14 TC_PRIO_INTERACTIVE_BULK, 4 15 TC_PRIO_(INTERACTIVE_BULK) 4 > Fill in the "8value(hex)" column gaps using the bitmap from RFC1349 for > the 8 bits; These are the values ou would see with tcpdump -vvv > I filled the two easiest ones i could compute in my head. > > Second step: > > Take the default priority map: > 1, 2, 2, 2, 1, 2, 0, 0 , 1, 1, 1, 1, 1, 1, 1, 1 > This applies for both default prio and the 3-band FIFO queue. > Note the queue map fitted on the last column > > 8 but value TOS mapped to queue map > --------------------------------------------- > 0x0 0 0 1 > 1 7 2 > 2 0 2 > 3 0 2 > 4 2 1 > 5 2 2 > 6 2 0 > 7 2 0 > 0x10 8 6 1 > 9 6 1 > 10 6 1 > 11 6 1 > 12 2 1 > 13 2 1 > 14 2 1 > 15 2 1 I've changed this table to: TOS Bits Means Linux Priority Band ------------------------------------------------------------ 0x0 0 Normal Service 0 Best Effort 1 0x2 1 Minimize Monetary Cost 1 Filler 2 0x4 2 Maximize Reliability 0 Best Effort 1 0x6 3 mmc+mr 0 Best Effort 1 0x8 4 Maximize Throughput 2 Bulk 2 0xa 5 mmc+mt 2 Bulk 2 0xc 6 mr+mt 2 Bulk 2 0xe 7 mmc+mr+mt 2 Bulk 2 0x10 8 Minimize Delay 6 Interactive 0 0x12 9 mmc+md 6 Interactive 0 0x14 10 mr+md 6 Interactive 0 0x16 11 mmc+mr+md 6 Interactive 0 0x18 12 mt+md 4 Int. Bulk 1 0x1a 13 mmc+mt+md 4 Int. Bulk 1 0x1c 14 mr+mt+md 4 Int. Bulk 1 0x1e 15 mmc+mr+mt+md 4 Int. Bulk 1 http://ds9a.nl/lartc/HOWTO/cvs/2.4routing/output/2.4routing-9.html#ss9.2 Your table appears to imply that a Maximum Reliability, Mininum Delay packet, TOS bits=9, gets mapped to band 1, not 0, which would not make sense. Laying it out like this, which does appear how it works, does mean that you can specify priorities in the priomap which do not correspond to possible TOS values. Is it possible at all to set skb->priority from userspace without going through the tos2prio mapping? CBQ can use the skb->priority to classify: /* * Step 1. If skb->priority points to one of our classes, use it. */ if (TC_H_MAJ(prio^sch->handle) == 0 && (cl = cbq_class_lookup(q, prio)) != NULL) return cl; But to do this, you would need to be able to set skb->priority to a 32bit number: include/linux/pkt_sched.h:#define TC_H_MAJ_MASK (0xFFFF0000U) include/linux/pkt_sched.h:#define TC_H_MAJ(h) ((h)&TC_H_MAJ_MASK) I can't find where you would do this, any clues? Thanks again for taking the time to help me. Regards, bert -- http://www.PowerDNS.com Versatile DNS Software & Services Trilab The Technology People Netherlabs BV / Rent-a-Nerd.nl - Nerd Available - 'SYN! .. SYN|ACK! .. ACK!' - the mating call of the internet From owner-netdev@oss.sgi.com Sat Dec 8 18:18:04 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fB92I4l24883 for netdev-outgoing; Sat, 8 Dec 2001 18:18:04 -0800 Received: from shell.cyberus.ca (shell.cyberus.ca [216.191.240.114]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fB92Hno24880 for ; Sat, 8 Dec 2001 18:17:49 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id UAA05370; Sat, 8 Dec 2001 20:14:11 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Sat, 8 Dec 2001 20:14:10 -0500 (EST) From: jamal To: bert hubert cc: , , , Subject: Re: CBQ and all other qdiscs now REALLY completely documented (almost!) In-Reply-To: <20011209002344.C20125@outpost.ds9a.nl> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 5517 Lines: 151 On Sun, 9 Dec 2001, bert hubert wrote: > On Sat, Dec 08, 2001 at 02:20:20PM -0500, jamal wrote: > > > Linux remaps packets incoming with different values to some internal > > value; the colum "mapped to" shows the internal mapping > > > > 8value(hex) TOS(dec) mapped to(dec) > > ---------------------------------- > > 0x0 0 0 > > 1 7 > > 2 0 > > 3 0 > > 4 2 > > 5 2 > > 6 2 > > 7 2 > > 0x10 8 6 > > 9 6 > > 10 6 > > 11 6 > > 12 2 > > 13 2 > > 14 2 > > 15 2 > > I find this tos2prio table in the kernel (2.5.x), which is somewhat > different than your table: > > 0 TC_PRIO_BESTEFFORT, 0 > 1 TC_PRIO_(FILLER), 1 > 2 TC_PRIO_BESTEFFORT, 0 > 3 TC_PRIO_(BESTEFFORT), 0 > 4 TC_PRIO_BULK, 2 > 5 TC_PRIO_(BULK), 2 > 6 TC_PRIO_BULK, 2 > 7 TC_PRIO_(BULK), 2 > 8 TC_PRIO_INTERACTIVE, 6 > 9 TC_PRIO_(INTERACTIVE), 6 > 10 TC_PRIO_INTERACTIVE, 6 > 11 TC_PRIO_(INTERACTIVE), 6 > 12 TC_PRIO_INTERACTIVE_BULK, 4 > 13 TC_PRIO_(INTERACTIVE_BULK), 4 > 14 TC_PRIO_INTERACTIVE_BULK, 4 > 15 TC_PRIO_(INTERACTIVE_BULK) 4 > > > > Fill in the "8value(hex)" column gaps using the bitmap from RFC1349 for > > the 8 bits; These are the values ou would see with tcpdump -vvv > > I filled the two easiest ones i could compute in my head. > > > > Second step: > > > > Take the default priority map: > > 1, 2, 2, 2, 1, 2, 0, 0 , 1, 1, 1, 1, 1, 1, 1, 1 > > This applies for both default prio and the 3-band FIFO queue. > > Note the queue map fitted on the last column > > > > 8 but value TOS mapped to queue map > > --------------------------------------------- > > 0x0 0 0 1 > > 1 7 2 > > 2 0 2 > > 3 0 2 > > 4 2 1 > > 5 2 2 > > 6 2 0 > > 7 2 0 > > 0x10 8 6 1 > > 9 6 1 > > 10 6 1 > > 11 6 1 > > 12 2 1 > > 13 2 1 > > 14 2 1 > > 15 2 1 > > I've changed this table to: > TOS Bits Means Linux Priority Band > ------------------------------------------------------------ > 0x0 0 Normal Service 0 Best Effort 1 > 0x2 1 Minimize Monetary Cost 1 Filler 2 > 0x4 2 Maximize Reliability 0 Best Effort 1 > 0x6 3 mmc+mr 0 Best Effort 1 > 0x8 4 Maximize Throughput 2 Bulk 2 > 0xa 5 mmc+mt 2 Bulk 2 > 0xc 6 mr+mt 2 Bulk 2 > 0xe 7 mmc+mr+mt 2 Bulk 2 > 0x10 8 Minimize Delay 6 Interactive 0 > 0x12 9 mmc+md 6 Interactive 0 > 0x14 10 mr+md 6 Interactive 0 > 0x16 11 mmc+mr+md 6 Interactive 0 > 0x18 12 mt+md 4 Int. Bulk 1 > 0x1a 13 mmc+mt+md 4 Int. Bulk 1 > 0x1c 14 mr+mt+md 4 Int. Bulk 1 > 0x1e 15 mmc+mr+mt+md 4 Int. Bulk 1 > Yes, sorry the last 4 are int_bulk (value 4) and not just bulk (2). good eye. You are still abusing the word TOS. Thats only 4 bits not 8; Use the terminology from RFC1349 at least. > http://ds9a.nl/lartc/HOWTO/cvs/2.4routing/output/2.4routing-9.html#ss9.2 > > Your table appears to imply that a Maximum Reliability, Mininum Delay > packet, TOS bits=9, gets mapped to band 1, not 0, which would not make > sense. > This is the priomap: 1, 2, 2, 2, 1, 2, 0, 0 , 1, 1, 1, 1, 1, 1, 1, 1 It says 1 is the right value > Laying it out like this, which does appear how it works, does mean that you > can specify priorities in the priomap which do not correspond to possible > TOS values. > You cant remap the 3 band scheduler trivially, but you should be able to replace it with a default prio qdisc get exactly the same behavior and use whatever map you want (eg your 0 to 1 substitution for TOS 1001) > Is it possible at all to set skb->priority from userspace without going > through the tos2prio mapping? > SO_PRIORITY socket option is doable; you have to be root. > CBQ can use the skb->priority to classify: so do prio and pfifo_fast (as i am sure you are aware) > /* > * Step 1. If skb->priority points to one of our classes, use it. > */ > if (TC_H_MAJ(prio^sch->handle) == 0 && > (cl = cbq_class_lookup(q, prio)) != NULL) > return cl; > > But to do this, you would need to be able to set skb->priority to a 32bit > number: > Cant think of a straight way to do this .... Alexey would know, cheers, jamal From owner-netdev@oss.sgi.com Sat Dec 8 18:30:46 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fB92Uk225091 for netdev-outgoing; Sat, 8 Dec 2001 18:30:46 -0800 Received: from outpost.powerdns.com (postfix@outpost.ds9a.nl [213.244.168.210]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fB92Ubo25088 for ; Sat, 8 Dec 2001 18:30:37 -0800 Received: by outpost.powerdns.com (Postfix, from userid 1000) id 5FEFFC67D0; Sun, 9 Dec 2001 02:30:29 +0100 (CET) Date: Sun, 9 Dec 2001 02:30:29 +0100 From: bert hubert To: lartc@mailman.ds9a.nl, linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: CBQ and all other qdiscs now REALLY completely documented (almost!) Message-ID: <20011209023029.A25580@outpost.ds9a.nl> Mail-Followup-To: bert hubert , lartc@mailman.ds9a.nl, linux-kernel@vger.kernel.org, netdev@oss.sgi.com References: <20011209002344.C20125@outpost.ds9a.nl> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: ; from hadi@cyberus.ca on Sat, Dec 08, 2001 at 08:14:10PM -0500 Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 2584 Lines: 86 On Sat, Dec 08, 2001 at 08:14:10PM -0500, jamal wrote: > Yes, sorry the last 4 are int_bulk (value 4) and not just bulk (2). good > eye. You are still abusing the word TOS. Thats only 4 bits not 8; > Use the terminology from RFC1349 at least. Will do. > > http://ds9a.nl/lartc/HOWTO/cvs/2.4routing/output/2.4routing-9.html#ss9.2 > > > > Your table appears to imply that a Maximum Reliability, Mininum Delay > > packet, TOS bits=9, gets mapped to band 1, not 0, which would not make > > sense. > > This is the priomap: 1, 2, 2, 2, 1, 2, 0, 0 , 1, 1, 1, 1, 1, 1, 1, 1 > It says 1 is the right value AFAICT, the priomap maps skb->priority to band. So the translation is as follows: Type of Service octet, which is fed to: skb->priority = rt_tos2priority(iph->tos); To extract the four TOS bits, and to translate to prio: static inline char rt_tos2priority(u8 tos) { return ip_tos2prio[IPTOS_TOS(tos)>>1]; } ---- __u8 ip_tos2prio[16] = { TC_PRIO_BESTEFFORT, ECN_OR_COST(FILLER), TC_PRIO_BESTEFFORT, ECN_OR_COST(BESTEFFORT), TC_PRIO_BULK, ECN_OR_COST(BULK), TC_PRIO_BULK, ECN_OR_COST(BULK), TC_PRIO_INTERACTIVE, ECN_OR_COST(INTERACTIVE), TC_PRIO_INTERACTIVE, ECN_OR_COST(INTERACTIVE), TC_PRIO_INTERACTIVE_BULK, ECN_OR_COST(INTERACTIVE_BULK), TC_PRIO_INTERACTIVE_BULK, ECN_OR_COST(INTERACTIVE_BULK) }; --- #define TC_PRIO_BESTEFFORT 0 #define TC_PRIO_FILLER 1 #define TC_PRIO_BULK 2 #define TC_PRIO_INTERACTIVE_BULK 4 #define TC_PRIO_INTERACTIVE 6 #define TC_PRIO_CONTROL 7 #define TC_PRIO_MAX 15 net/sched/sched_generic.c: static const u8 prio2band[TC_PRIO_MAX+1] = { 1, 2, 2, 2, 1, 2, 0, 0 , 1, 1, 1, 1, 1, 1, 1, 1 }; list = ((struct sk_buff_head*)qdisc->data) + prio2band[skb->priority&TC_PRIO_MAX]; > > CBQ can use the skb->priority to classify: > > so do prio and pfifo_fast (as i am sure you are aware) Of course, but only CBQ (& HTB, by the way) can extract a classid directly from it, without a priomap. Devik is planning to learn HTB to extract a classid directly from the fwmark, to skip a layer of indirection. Regards, bert hubert -- http://www.PowerDNS.com Versatile DNS Software & Services Trilab The Technology People Netherlabs BV / Rent-a-Nerd.nl - Nerd Available - 'SYN! .. SYN|ACK! .. ACK!' - the mating call of the internet From owner-netdev@oss.sgi.com Sat Dec 8 19:14:19 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fB93EJw25949 for netdev-outgoing; Sat, 8 Dec 2001 19:14:19 -0800 Received: from shell.cyberus.ca (shell.cyberus.ca [216.191.240.114]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fB93EGo25946 for ; Sat, 8 Dec 2001 19:14:16 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id VAA05488; Sat, 8 Dec 2001 21:10:41 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Sat, 8 Dec 2001 21:10:41 -0500 (EST) From: jamal To: bert hubert cc: , , Subject: Re: CBQ and all other qdiscs now REALLY completely documented (almost!) In-Reply-To: <20011209023029.A25580@outpost.ds9a.nl> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 614 Lines: 26 On Sun, 9 Dec 2001, bert hubert wrote: > On Sat, Dec 08, 2001 at 08:14:10PM -0500, jamal wrote: > > AFAICT, the priomap maps skb->priority to band. So the translation is as > follows: > yes ;-> > > > > so do prio and pfifo_fast (as i am sure you are aware) > > Of course, but only CBQ (& HTB, by the way) can extract a classid directly > from it, without a priomap. Devik is planning to learn HTB to extract a > classid directly from the fwmark, to skip a layer of indirection. > I am not sure if this is such a nice hack. Whats wrong with with using the fwmark classifier to select classes? cheers, jamal From owner-netdev@oss.sgi.com Sun Dec 9 11:15:16 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fB9JFG410271 for netdev-outgoing; Sun, 9 Dec 2001 11:15:16 -0800 Received: from ms2.inr.ac.ru (minus.inr.ac.ru [193.233.7.97]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fB9JFEo10267 for ; Sun, 9 Dec 2001 11:15:14 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id VAA00499; Sun, 9 Dec 2001 21:14:46 +0300 From: kuznet@ms2.inr.ac.ru Message-Id: <200112091814.VAA00499@ms2.inr.ac.ru> Subject: Re: CBQ and all other qdiscs now REALLY completely documented To: hadi@cyberus.ca (jamal) Date: Sun, 9 Dec 2001 21:14:46 +0300 (MSK) Cc: ahu@ds9a.nl, lartc@mailman.ds9a.nl, linux-kernel@vger.kernel.org, netdev@oss.sgi.com In-Reply-To: from "jamal" at Dec 8, 1 08:14:10 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 218 Lines: 11 Hello! > > But to do this, you would need to be able to set skb->priority to a 32bit > > number: > > > > Cant think of a straight way to do this .... Alexey would know, SO_PRIORITY. Or I did not follow you? Alexey From owner-netdev@oss.sgi.com Sun Dec 9 11:18:10 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fB9JIAA10393 for netdev-outgoing; Sun, 9 Dec 2001 11:18:10 -0800 Received: from outpost.powerdns.com (postfix@outpost.ds9a.nl [213.244.168.210]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fB9JI7o10390 for ; Sun, 9 Dec 2001 11:18:07 -0800 Received: by outpost.powerdns.com (Postfix, from userid 1000) id 6D884C60F0; Sun, 9 Dec 2001 19:18:05 +0100 (CET) Date: Sun, 9 Dec 2001 19:18:05 +0100 From: bert hubert To: kuznet@ms2.inr.ac.ru Cc: jamal , lartc@mailman.ds9a.nl, linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: CBQ and all other qdiscs now REALLY completely documented Message-ID: <20011209191805.A18271@outpost.ds9a.nl> Mail-Followup-To: bert hubert , kuznet@ms2.inr.ac.ru, jamal , lartc@mailman.ds9a.nl, linux-kernel@vger.kernel.org, netdev@oss.sgi.com References: <200112091814.VAA00499@ms2.inr.ac.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <200112091814.VAA00499@ms2.inr.ac.ru>; from kuznet@ms2.inr.ac.ru on Sun, Dec 09, 2001 at 09:14:46PM +0300 Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 618 Lines: 19 On Sun, Dec 09, 2001 at 09:14:46PM +0300, kuznet@ms2.inr.ac.ru wrote: > > > But to do this, you would need to be able to set skb->priority to a 32bit > > > number: > > Cant think of a straight way to do this .... Alexey would know, > > SO_PRIORITY. Or I did not follow you? Ah yes, thanks, that sets sk->priority which later sets skb->priority. Regards, bert -- http://www.PowerDNS.com Versatile DNS Software & Services Trilab The Technology People Netherlabs BV / Rent-a-Nerd.nl - Nerd Available - 'SYN! .. SYN|ACK! .. ACK!' - the mating call of the internet From owner-netdev@oss.sgi.com Sun Dec 9 14:48:51 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fB9MmpT27260 for netdev-outgoing; Sun, 9 Dec 2001 14:48:51 -0800 Received: from shell.cyberus.ca (shell.cyberus.ca [216.191.240.114]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fB9Mmmo27257 for ; Sun, 9 Dec 2001 14:48:48 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id QAA06446; Sun, 9 Dec 2001 16:45:01 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Sun, 9 Dec 2001 16:45:01 -0500 (EST) From: jamal To: cc: , , , Subject: Re: CBQ and all other qdiscs now REALLY completely documented In-Reply-To: <200112091814.VAA00499@ms2.inr.ac.ru> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 395 Lines: 21 On Sun, 9 Dec 2001 kuznet@ms2.inr.ac.ru wrote: > Hello! > > > > But to do this, you would need to be able to set skb->priority to a 32bit > > > number: > > > > > > > Cant think of a straight way to do this .... Alexey would know, > > SO_PRIORITY. Or I did not follow you? > So priority limits the size of skb->priority to be from 0..6; this wont work with that check in cbq. cheers, jamal From owner-netdev@oss.sgi.com Sun Dec 9 14:54:00 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fB9Ms0e27412 for netdev-outgoing; Sun, 9 Dec 2001 14:54:00 -0800 Received: from outpost.powerdns.com (postfix@outpost.ds9a.nl [213.244.168.210]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fB9Mrvo27409 for ; Sun, 9 Dec 2001 14:53:57 -0800 Received: by outpost.powerdns.com (Postfix, from userid 1000) id 48B66C6C2F; Sun, 9 Dec 2001 22:53:50 +0100 (CET) Date: Sun, 9 Dec 2001 22:53:50 +0100 From: bert hubert To: jamal Cc: kuznet@ms2.inr.ac.ru, linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: CBQ and all other qdiscs now REALLY completely documented Message-ID: <20011209225350.A22512@outpost.ds9a.nl> Mail-Followup-To: bert hubert , jamal , kuznet@ms2.inr.ac.ru, linux-kernel@vger.kernel.org, netdev@oss.sgi.com References: <200112091814.VAA00499@ms2.inr.ac.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: ; from hadi@cyberus.ca on Sun, Dec 09, 2001 at 04:45:01PM -0500 Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 575 Lines: 19 On Sun, Dec 09, 2001 at 04:45:01PM -0500, jamal wrote: > > > Cant think of a straight way to do this .... Alexey would know, > > > > SO_PRIORITY. Or I did not follow you? > > So priority limits the size of skb->priority to be from 0..6; this wont > work with that check in cbq. No, only IP_TOS does so. Regards, bert -- http://www.PowerDNS.com Versatile DNS Software & Services Trilab The Technology People Netherlabs BV / Rent-a-Nerd.nl - Nerd Available - 'SYN! .. SYN|ACK! .. ACK!' - the mating call of the internet From owner-netdev@oss.sgi.com Sun Dec 9 15:10:44 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fB9NAic27837 for netdev-outgoing; Sun, 9 Dec 2001 15:10:44 -0800 Received: from shell.cyberus.ca (shell.cyberus.ca [216.191.240.114]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fB9NAfo27834 for ; Sun, 9 Dec 2001 15:10:42 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id RAA06474; Sun, 9 Dec 2001 17:07:03 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Sun, 9 Dec 2001 17:07:03 -0500 (EST) From: jamal To: bert hubert cc: , , Subject: Re: CBQ and all other qdiscs now REALLY completely documented In-Reply-To: <20011209225350.A22512@outpost.ds9a.nl> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 470 Lines: 21 On Sun, 9 Dec 2001, bert hubert wrote: > On Sun, Dec 09, 2001 at 04:45:01PM -0500, jamal wrote: > > > > Cant think of a straight way to do this .... Alexey would know, > > > > > > SO_PRIORITY. Or I did not follow you? > > > > So priority limits the size of skb->priority to be from 0..6; this wont > > work with that check in cbq. > > No, only IP_TOS does so. > probaly ip precedence. Have you tried this or you are following what the man pages say? cheers, jamal From owner-netdev@oss.sgi.com Sun Dec 9 15:13:45 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fB9NDj727952 for netdev-outgoing; Sun, 9 Dec 2001 15:13:45 -0800 Received: from outpost.powerdns.com (postfix@outpost.ds9a.nl [213.244.168.210]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fB9NDgo27949 for ; Sun, 9 Dec 2001 15:13:42 -0800 Received: by outpost.powerdns.com (Postfix, from userid 1000) id 3D097C6477; Sun, 9 Dec 2001 23:13:40 +0100 (CET) Date: Sun, 9 Dec 2001 23:13:40 +0100 From: bert hubert To: jamal Cc: kuznet@ms2.inr.ac.ru, linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: CBQ and all other qdiscs now REALLY completely documented Message-ID: <20011209231340.A23420@outpost.ds9a.nl> Mail-Followup-To: bert hubert , jamal , kuznet@ms2.inr.ac.ru, linux-kernel@vger.kernel.org, netdev@oss.sgi.com References: <20011209225350.A22512@outpost.ds9a.nl> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: ; from hadi@cyberus.ca on Sun, Dec 09, 2001 at 05:07:03PM -0500 Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 671 Lines: 21 On Sun, Dec 09, 2001 at 05:07:03PM -0500, jamal wrote: > > > So priority limits the size of skb->priority to be from 0..6; this wont > > > work with that check in cbq. > > > > No, only IP_TOS does so. > > probaly ip precedence. Have you tried this or you are following what the > man pages say? I have been living in the source for quite a while now - see ip_setsockopt() in net/ipv4/ip_sockglue.c. Regards, bert -- http://www.PowerDNS.com Versatile DNS Software & Services Trilab The Technology People Netherlabs BV / Rent-a-Nerd.nl - Nerd Available - 'SYN! .. SYN|ACK! .. ACK!' - the mating call of the internet From owner-netdev@oss.sgi.com Sun Dec 9 17:42:19 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBA1gJu30887 for netdev-outgoing; Sun, 9 Dec 2001 17:42:19 -0800 Received: from outpost.powerdns.com (postfix@outpost.ds9a.nl [213.244.168.210]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBA1fXo30874 for ; Sun, 9 Dec 2001 17:41:33 -0800 Received: by outpost.powerdns.com (Postfix, from userid 1000) id DAC9CC6C86; Mon, 10 Dec 2001 01:41:30 +0100 (CET) Date: Mon, 10 Dec 2001 01:41:30 +0100 From: bert hubert To: kuznet@ms2.inr.ac.ru Cc: jamal , lartc@mailman.ds9a.nl, linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: CBQ MANPAGE: I hear the theme of '2001, A Space Odyssey' Message-ID: <20011210014130.A27193@outpost.ds9a.nl> Mail-Followup-To: bert hubert , kuznet@ms2.inr.ac.ru, jamal , lartc@mailman.ds9a.nl, linux-kernel@vger.kernel.org, netdev@oss.sgi.com References: <200112091814.VAA00499@ms2.inr.ac.ru> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="yrj/dFKFPuw6o+aM" Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <200112091814.VAA00499@ms2.inr.ac.ru>; from kuznet@ms2.inr.ac.ru on Sun, Dec 09, 2001 at 09:14:46PM +0300 Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 15604 Lines: 473 --yrj/dFKFPuw6o+aM Content-Type: text/plain; charset=us-ascii Content-Disposition: inline ... to the sound of 'Also sprach Zarathustra': After weeks of social deprivation and much digging through heaps of code, I bring you tc-cbq.8 The CBQ manpage. Nearly 2500 words, 8 printed pages, of nearly unintelligible gobledygook, explaining mostly how CBQ works. It is part of the Linux Advanced Routing & Traffic Control documentation project which contains a HOWTO, a mailinglist, an IRC channel and now manpages: http://ds9a.nl/lartc I want to thank Jamal for stubbornly straightening me out when I use messy language and explaining how things work. The errors are mine though. I *implore* ANK and others to read through this. I'm about exhausted and running out of time (need to get on with work), and have a hard time figuring out the exact details of the CBQ link sharing algorithm. I need help, so to speak. The manpage indicates where. Thanks for your attention. Please find tc-cbq.8 attached. Regards, bert hubert -- http://www.PowerDNS.com Versatile DNS Software & Services Trilab The Technology People Netherlabs BV / Rent-a-Nerd.nl - Nerd Available - 'SYN! .. SYN|ACK! .. ACK!' - the mating call of the internet --yrj/dFKFPuw6o+aM Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="tc-cbq.8" .TH CBQ 8 "8 December 2001" "iproute2" "Linux" .SH NAME CBQ \- Class Based Queueing .SH SYNOPSIS .B tc qdisc ... dev dev .B ( parent classid .B | root) [ handle major: .B ] cbq avpkt bytes .B bandwidth rate .B [ cell bytes .B ] [ ewma log .B ] [ mpu bytes .B ] .B tc class ... dev dev .B parent major:[minor] .B [ classid major:minor .B ] cbq allot bytes .B [ bandwidth rate .B ] [ rate rate .B ] prio priority .B [ weight weight .B ] [ minburst packets .B ] [ maxburst packets .B ] [ ewma log .B ] [ cell bytes .B ] avpkt bytes .B [ mpu bytes .B ] [ bounded isolated ] [ split handle .B & defmap defmap .B ] [ estimator interval timeconstant .B ] .SH DESCRIPTION Class Based Queueing is a classful qdisc that implements a rich linksharing hierarchy of classes. It contains shaping elements as well as prioritizing capabilities. Shaping is performed using link idle time calculations based on the timing of dequeue events and underlying link bandwidth. .SH SHAPING ALGORITHM Shaping is done using link idle time calculations, and actions taken if these calculations deviate from set limits. When shape a 10mbit/s connection to 1mbit/s, the link will be idle 90% of the time. If it isn't, it needs to be throttled so that it IS idle 90% of the time. >From the kernel's perspecive, this is hard to measure, so CBQ instead derives the idle time from the number of microseconds that elapse between requests from the hardware layer for more data. Combined with the knowledge of packet sizes, this is used to approximate how full or empty the link is. This is rather circumspect and doesn't always arrive at proper results. For example, what is the actual link speed of an interface that is not really able to transmit the full 100mbit/s of data, perhaps because of a badly implemented driver? A PCMCIA network card will also never achieve 100mbit/s because of the way the bus is designed - again, how do we calculate the idle time? The physical link bandwidth may be ill defined in case of not-quite-real network devices like PPP over Ethernet or PPTP over TCP/IP. The effective bandwidth in that case is probably determined by the efficiency of pipes to userspace - which not defined. During operations, the effective idletime is measured using an exponential weighted moving average (EWMA), which considers recent packets to be exponentially more important than past ones. The unix loadaverage is calculated in the same way. The calculated idle time is substracted from the EWMA measured one, the resulting number is called 'avgidle'. A perfectly loaded link has an avgidle of zero: packets arrive exactly at the calculated interval. An overloaded link has a negative avgidle and if it gets too negative, CBQ throttles and is then 'overlimit'. Conversely, an idle link might amass a huge avgidle, which would then allow infinite bandwidths after a few hours of silence. To prevent this, avgidle is capped at .B maxidle. If overlimit, in theory, the CBQ could throttle itself for exactly the amount of time that was calculated to pass between packets, and then pass one packet, and throttle again. Due to timer resolution constraints, this may not be feasible, see the .B minburst parameter below. .SH CLASSIFICATION Within the one CBQ instance many classes may exist. Each of these classes contains another qdisc, by default .BR tc-pfifo (8). When enqueueing a packet, CBQ starts at the root and uses various methods to determine which class should receive the data. If a verdict is reached, this process is repeated for the recipient class which might have further means of classifying traffic to its children, if any. CBQ has the following methods available to classify a packet to any child classes. .TP (i) .B skb->priority class encoding. Can be set from userspace by an application with the .B IP_PRIO setsockopt. The .B skb->priority class encoding only applies if the skb->priority holds a major:minor handle of an existing class within this qdisc. .TP (ii) tc filters attached to the class. .TP (iii) The defmap of a class, as set with the .B split & defmap parameters. The defmap may contain instructions for each possible Linux packet priority. .P Each class also has a .B level. Leaf nodes, attached to the bottom of the class hierarchy, have a level of 0. .SH CLASSIFICATION ALGORITHM Classification is a loop, which terminates when a leaf class is found. At any point the loop may jump to the fallback algorithm. The loop consists of the following steps: .TP (i) If the packet is generated locally and has a valid classid encoded within its .B skb->priority, choose it and terminate. .TP (ii) Consult the tc filters, if any, attached to this child. If these return a class which is not a leaf class, restart loop from the class returned. If it is a leaf, choose it and terminate. .TP (iii) If the tc filters did not return a class, but did return a classid, try to find a class with that id within this qdisc. Check if the found class is of a lower .B level than the current class. If so, and the returned class is not a leaf node, restart the loop at the found class. If it is a leaf node, terminate. If we found an upward reference to a higher level, enter the fallback algorithm. .TP (iv) If the tc filters did not return a class, nor a valid reference to one, consider the minor number of the reference to be the priority. Retrieve a class from the defmap of this class for the priority. If this did not contain a class, consult the defmap of this class for the .B BEST_EFFORT class. If this is an upward reference, or no .B BEST_EFFORT class was defined, enter the fallback algorithm. If a valid class was found, and it is not a leaf node, restart the loop at this class. If it is a leaf, choose it and terminate. If neither the priority distilled from the classid, nor the .B BEST_EFFORT priority yielded a class, enter the fallback algorithm. .P The fallback algorithm resides outside of the loop and is as follows. .TP (i) Consult the defmap of the class at which the jump to fallback occured. If the defmap contains a class for the .B priority of the class (which is related to the TOS field), choose this class and terminate. .TP (ii) Consult the map for a class for the .B BEST_EFFORT priority. If found, choose it, and terminate. .TP (iii) Choose the class at which breakout to the fallback algorithm occured. Terminate. .P The packet is enqueued to the class which was chosen when either algorithm terminated. It is therefore possible for a packet to be enqueued *not* at a leaf node, but in the middle of the hierarchy. .SH LINK SHARING ALGORITHM When dequeueing for sending to the network device, CBQ decides which of its classes will be allowed to send. It does so with a Weighted Round Robin process in which each class with packets gets a chance to send in turn. The WRR process starts by asking the highest priority classes (lowest numerically - highest semantically) for packets, and will continue to do so until they have no more data to offer, in which case the process repeats for lower priorities. .B CERTAINTY ENDS HERE, ANK PLEASE HELP Each class is not allowed to send at length though - they can only dequeue a configurable amount of data during each round. If a class is about to go overlimit, and it is not .B bounded it will try to borrow avgidle from siblings that are not .B isolated. This process is repeated from the bottom upwards. If a class is unable to borrow enough avgidle to send a packet, it is throttled and not asked for a packet for enough time for the avgidle to increase above zero. .B I REALLY NEED HELP FIGURING THIS OUT. REST OF DOCUMENT IS PRETTY CERTAIN .B AGAIN. .SH QDISC The root qdisc of a CBQ class tree has the following parameters: .TP parent major:minor | root This mandatory parameter determines the place of the CBQ instance, either at the .B root of an interface or within an existing class. .TP handle major: Like all other qdiscs, the CBQ can be assigned a handle. Should consist only of a major number, followed by a colon. Optional. .TP avpkt bytes For calculations, the average packet size must be known. It is silently capped at a minimum of 2/3 of the interface MTU. Mandatory. .TP bandwidth rate To determine the idle time, CBQ must know the bandwidth of your underlying physical interface, or parent qdisc. This is a vital parameter, more about it later. Mandatory. .TP cell The cell size determines he granularity of packet transmission time calculations. Has a sensible default. .TP mpu A zero sized packet may still take time to transmit. This value is the lower cap for packet transmission time calculations - packets smaller than this value are still deemed to have this size. Defaults to zero. .TP ewma log When CBQ needs to measure the average idle time, it does so using an Exponentially Weighted Moving Average which smoothes out measurements into a moving average. The EWMA LOG determines how much smoothing occurs. Defaults to 5. Lower values imply greater sensitivity. Must be between 0 and 31. .P A CBQ qdisc does not shape out of its own accord. It only needs to know certain parameters about the underlying link. Actual shaping is done in classes. .SH CLASSES Classes have a host of parameters to configure their operation. .TP parent major:minor Place of this class within the hierarchy. If attached directly to a qdisc and not to another class, minor can be omitted. Mandatory. .TP classid major:minor Like qdiscs, classes can be named. The major number must be equal to the major number of the qdisc to which it belongs. Optional, but needed if this class is going to have children. .TP weight weight When dequeueing to the interface, classes are tried for traffic in a round-robin fashion. Classes with a higher configured qdisc will generally have more traffic to offer during each round, so it makes sense to allow it to dequeue more traffic. All weights under a class are normalized, so only the ratios matter. Defaults to the configured rate, unless the priority of this class is maximal, in which case it is set to 1. .TP allot bytes Allot specifies how many bytes a qdisc can dequeue during each round of the process. This parameter is weighted using the renormalized class weight described above. .TP priority priority In the round-robin process, classes with the lowest priority field are tried for packets first. Mandatory. .TP rate rate Maximum rate this class and all its children combined can send at. Mandatory. .TP bandwidth rate This is different from the bandwidth specified when creating a CBQ disc. Only used to determine maxidle and offtime, which are only calculated when specifying maxburst or minburst. Mandatory if specifying maxburst or minburst. .TP maxburst This number of packets is used to calculate maxidle so that when avgidle is at maxidle, this number of average packets can be burst before avgidle drops to 0. Set it higher to be more tolerant of bursts. You can't set maxidle directly, only via this parameter. .TP minburst As mentioned before, CBQ needs to throttle in case of overlimit. The ideal solution is to do so for exactly the calculated idle time, and pass 1 packet. However, Unix kernels generally have a hard time scheduling events shorter than 10ms, so it is better to throttle for a longer period, and then pass minburst packets in one go, and then sleep minburst times longer. The time to wait is called the offtime. Higher values of minburst lead to more accurate shaping in the long term, but to bigger bursts at millisecond timescales. .TP minidle If avgidle is below 0, we are overlimits and need to wait until avgidle will be big enough to send one packet. To prevent a sudden burst from shutting down the link for a prolonged period of time, avgidle is reset to minidle if it gets too low. Minidle is specified in negative microseconds, so 10 means that avgidle is capped at -10us. .TP bounded Signifies that this class will not borrow bandwidth from its siblings. .TP isolated Means that this class will not borrow bandwidth to its siblings .TP split major:minor & defmap bitmap[/bitmap] If consulting filters attached to a class did not give a verdict, CBQ can also classify based on the packet's priority. There are 16 priorities available, numbered from 0 to 15. The defmap specifies which priorities this class wants to receive, specified as a bitmap. The Least Significant Bit corresponds to priority zero. The .B split parameter tells CBQ at which class the decision must be made, which should be a (grand)parent of the class you are adding. As an example, 'tc class add ... classid 10:1 cbq .. split 10:0 defmap c0' configures class 10:0 to send packets with priorities 6 and 7 to 10:1. The complimentary configuration would then be: 'tc class add ... classid 10:2 cbq ... split 10:0 defmap 3f' Which would send all packets 0, 1, 2, 3, 4 and 5 to 10:1. .TP estimator interval timeconstant CBQ can measure how much bandwidth each class is using, which tc filters can use to classify packets with. In order to determine the bandwidth it uses a very simple estimator that measures once every .B interval microseconds how much traffic has passed. This again is a EWMA, for which the time constant can be specified, also in microseconds. The .B time constant corresponds to the sluggishness of the measurement or, conversely, to the sensitivity of the average to short bursts. Higher values mean less sensitivity. .SH SOURCES .TP o Sally Floyd and Van Jacobson, "Link-sharing and Resource Management Models for Packet Networks", IEEE/ACM Transactions on Networking, Vol.3, No.4, 1995 .TP o Sally Floyd, "Notes on CBQ and Guaranted Service", 1995 .TP o Sally Floyd, "Notes on Class-Based Queueing: Setting Parameters", 1996 .TP o Sally Floyd and Michael Speer, "Experimental Results for Class-Based Queueing", 1998, not published. .SH SEE ALSO .BR tc (8) .SH AUTHOR Alexey N. Kuznetsov, . This manpage maintained by bert hubert --yrj/dFKFPuw6o+aM-- From owner-netdev@oss.sgi.com Sun Dec 9 18:02:22 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBA22M031237 for netdev-outgoing; Sun, 9 Dec 2001 18:02:22 -0800 Received: from shell.cyberus.ca (shell.cyberus.ca [216.191.240.114]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBA22Jo31234 for ; Sun, 9 Dec 2001 18:02:19 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id TAA07016; Sun, 9 Dec 2001 19:58:40 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Sun, 9 Dec 2001 19:58:40 -0500 (EST) From: jamal To: bert hubert cc: , , Subject: Re: CBQ and all other qdiscs now REALLY completely documented In-Reply-To: <20011209231340.A23420@outpost.ds9a.nl> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 621 Lines: 24 On Sun, 9 Dec 2001, bert hubert wrote: > On Sun, Dec 09, 2001 at 05:07:03PM -0500, jamal wrote: > > > > So priority limits the size of skb->priority to be from 0..6; this wont > > > > work with that check in cbq. > > > > > > No, only IP_TOS does so. > > > > probaly ip precedence. Have you tried this or you are following what the > > man pages say? > > I have been living in the source for quite a while now - see ip_setsockopt() > in net/ipv4/ip_sockglue.c. > Thats the wrong place to look. Look instead at: net/core/sock.c I got it; non root is limited to 0..6; root can set the full 32 bit range. cheers, jamal From owner-netdev@oss.sgi.com Sun Dec 9 18:08:25 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBA28P131387 for netdev-outgoing; Sun, 9 Dec 2001 18:08:25 -0800 Received: from shell.cyberus.ca (shell.cyberus.ca [216.191.240.114]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBA28Ko31384 for ; Sun, 9 Dec 2001 18:08:20 -0800 Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.3/666/Cyberus Online Inc.) with ESMTP id UAA07024; Sun, 9 Dec 2001 20:04:42 -0500 (EST) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Sun, 9 Dec 2001 20:04:42 -0500 (EST) From: jamal To: bert hubert cc: , , , Subject: Re: CBQ MANPAGE: I hear the theme of '2001, A Space Odyssey' In-Reply-To: <20011210014130.A27193@outpost.ds9a.nl> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 1719 Lines: 51 Sorry didnt read it; did the 30 sec scan .. If this is meant to be for users, why are you talking about skb->priority? Isnt it sufficient to just call it prioirity? Also, if you think that Alexeys imp. is based on Floyd only, you are highly mistaken; Going back to high latency response mode ... cheers, jamal On Mon, 10 Dec 2001, bert hubert wrote: > ... to the sound of 'Also sprach Zarathustra': > > After weeks of social deprivation and much digging through heaps of code, I > bring you > > tc-cbq.8 > > The CBQ manpage. Nearly 2500 words, 8 printed pages, of nearly > unintelligible gobledygook, explaining mostly how CBQ works. > > It is part of the Linux Advanced Routing & Traffic Control documentation > project which contains a HOWTO, a mailinglist, an IRC channel and now > manpages: > > http://ds9a.nl/lartc > > I want to thank Jamal for stubbornly straightening me out when I use messy > language and explaining how things work. The errors are mine though. > > I *implore* ANK and others to read through this. I'm about exhausted and > running out of time (need to get on with work), and have a hard time > figuring out the exact details of the CBQ link sharing algorithm. I need > help, so to speak. The manpage indicates where. > > Thanks for your attention. Please find tc-cbq.8 attached. > > Regards, > > bert hubert > > > -- > http://www.PowerDNS.com Versatile DNS Software & Services > Trilab The Technology People > Netherlabs BV / Rent-a-Nerd.nl - Nerd Available - > 'SYN! .. SYN|ACK! .. ACK!' - the mating call of the internet > From owner-netdev@oss.sgi.com Sun Dec 9 18:12:07 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBA2C7f31510 for netdev-outgoing; Sun, 9 Dec 2001 18:12:07 -0800 Received: from outpost.powerdns.com (postfix@outpost.ds9a.nl [213.244.168.210]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBA2C4o31505 for ; Sun, 9 Dec 2001 18:12:04 -0800 Received: by outpost.powerdns.com (Postfix, from userid 1000) id 01B99C6C7C; Mon, 10 Dec 2001 02:12:00 +0100 (CET) Date: Mon, 10 Dec 2001 02:12:00 +0100 From: bert hubert To: jamal Cc: kuznet@ms2.inr.ac.ru, lartc@mailman.ds9a.nl, linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: CBQ MANPAGE: I hear the theme of '2001, A Space Odyssey' Message-ID: <20011210021200.A27995@outpost.ds9a.nl> Mail-Followup-To: bert hubert , jamal , kuznet@ms2.inr.ac.ru, lartc@mailman.ds9a.nl, linux-kernel@vger.kernel.org, netdev@oss.sgi.com References: <20011210014130.A27193@outpost.ds9a.nl> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: ; from hadi@cyberus.ca on Sun, Dec 09, 2001 at 08:04:42PM -0500 Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 968 Lines: 29 On Sun, Dec 09, 2001 at 08:04:42PM -0500, jamal wrote: > Sorry didnt read it; did the 30 sec scan .. > If this is meant to be for users, why are you talking about skb->priority? > Isnt it sufficient to just call it prioirity? It's not done yet and may need some readability tuning. Note however that skb->priority is a bit overloaded. It can contain a priority, but also a 32bit encoded classid. These are different things, so they deserve different mention. > Also, if you think that Alexeys imp. is based on Floyd only, you are > highly mistaken; I just copied the attribution from the kernel, am glad to rectify things. > Going back to high latency response mode ... Thanks for reviewing. Regards, bert -- http://www.PowerDNS.com Versatile DNS Software & Services Trilab The Technology People Netherlabs BV / Rent-a-Nerd.nl - Nerd Available - 'SYN! .. SYN|ACK! .. ACK!' - the mating call of the internet From owner-netdev@oss.sgi.com Mon Dec 10 10:04:42 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBAI4gL10533 for netdev-outgoing; Mon, 10 Dec 2001 10:04:42 -0800 Received: from ms2.inr.ac.ru (minus.inr.ac.ru [193.233.7.97]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBAI4do10530 for ; Mon, 10 Dec 2001 10:04:40 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id UAA16642; Mon, 10 Dec 2001 20:04:16 +0300 From: kuznet@ms2.inr.ac.ru Message-Id: <200112101704.UAA16642@ms2.inr.ac.ru> Subject: Re: CBQ and all other qdiscs now REALLY completely documented To: hadi@cyberus.ca (jamal) Date: Mon, 10 Dec 2001 20:04:16 +0300 (MSK) Cc: ahu@ds9a.nl, lartc@mailman.ds9a.nl, linux-kernel@vger.kernel.org, netdev@oss.sgi.com In-Reply-To: from "jamal" at Dec 9, 1 04:45:01 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 316 Lines: 10 Hello! > So priority limits the size of skb->priority to be from 0..6; this wont > work with that check in cbq. No, it does not. Values different of "low prio" defaults (0..6) are not allowed to user without privileges by evident reasons. User with correspoding capability may direct traffic to any class. Alexey From owner-netdev@oss.sgi.com Mon Dec 10 11:18:29 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBAJITX12347 for netdev-outgoing; Mon, 10 Dec 2001 11:18:29 -0800 Received: from noxmail.sandelman.ottawa.on.ca (nox.sandelman.ottawa.on.ca [192.139.46.6]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBAJINo12341 for ; Mon, 10 Dec 2001 11:18:23 -0800 Received: from marajade.sandelman.ottawa.on.ca (227-198-131-12.bellhead.com [12.131.198.227]) by noxmail.sandelman.ottawa.on.ca (8.11.6/8.11.6) with ESMTP id fBAII6501580 (using TLSv1/SSLv3 with cipher EDH-RSA-DES-CBC3-SHA (168 bits) verified OK); Mon, 10 Dec 2001 13:18:09 -0500 (EST) Received: from marajade.sandelman.ottawa.on.ca (localhost [[UNIX: localhost]]) by marajade.sandelman.ottawa.on.ca (8.11.6/8.11.0) with ESMTP id fBAI0eC01237; Mon, 10 Dec 2001 11:00:44 -0700 (MST) Message-Id: <200112101800.fBAI0eC01237@marajade.sandelman.ottawa.on.ca> To: netdev@oss.sgi.com, design@lists.freeswan.org Subject: Re: [Design] Re: what pointers does pskb_may_pull() nuke? In-reply-to: Your message of "Sat, 08 Dec 2001 01:19:53 GMT." Mime-Version: 1.0 (generated by tm-edit 7.108) Content-Type: text/plain; charset=US-ASCII Date: Mon, 10 Dec 2001 11:00:40 -0700 From: Michael Richardson Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 1578 Lines: 36 -----BEGIN PGP SIGNED MESSAGE----- >>>>> "Julian" == Julian Anastasov writes: >> I think you are missing what I am saying. >> Yes, it was called in the local delivery code, but is no longer. >> Our experience is that some aspect of compiling netfilter into a kernel >> (not necessarily with conntrack on) causes ip_defrag() to get called before >> local delivery. Julian> I don't see where is the problem. There is always someone that Julian> calls ip_defrag before the protocols, the last one is Julian> ip_input.c:ip_local_deliver() if it is still not called. So, remains Julian> the issue with the linearization and the checksums. It used to be that someone always did linearization of skbuff. This is the problem - this changed unilaterally. We would, of course, be happy to integrate our code into the kernel, once the kernel source are moved to a free country. ] ON HUMILITY: to err is human. To moo, bovine. | firewalls [ ] Michael Richardson, Sandelman Software Works, Ottawa, ON |net architect[ ] mcr@sandelman.ottawa.on.ca http://www.sandelman.ottawa.on.ca/ |device driver[ ] panic("Just another NetBSD/notebook using, kernel hacking, security guy"); [ -----BEGIN PGP SIGNATURE----- Version: 2.6.3ia Charset: latin1 Comment: Finger me for keys iQCVAwUBPBT4RoqHRg3pndX9AQF5kgP7Bq8rIu0lp1l5zm63HYspGCuzizBy+Dof pUyRTsLvBoYqFpbxo5FNntFv+Ku5SsH/5kurDcWqtMZzxp+phIgupbJuP7CGUiQc FjlzWzCK+jItkeswNMwBtWX2EWRZVNqolVMmHoNdNEL/soL56FehXTshj7NvBulX np9vzYMmOFE= =OfoA -----END PGP SIGNATURE----- From owner-netdev@oss.sgi.com Mon Dec 10 11:39:15 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBAJdFR13007 for netdev-outgoing; Mon, 10 Dec 2001 11:39:15 -0800 Received: from enterprise.atl.lmco.com (mail.atl.external.lmco.com [192.35.37.50]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBAJdEo13004 for ; Mon, 10 Dec 2001 11:39:14 -0800 Received: from misty.atl.lmco.com (misty [166.17.242.243]) by enterprise.atl.lmco.com (Postfix) with ESMTP id 71DA3C1CED for ; Mon, 10 Dec 2001 13:38:41 -0500 (EST) Received: (from cwinters@localhost) by misty.atl.lmco.com (8.11.2/8.11.2) id fBAIce701550 for netdev@oss.sgi.com; Mon, 10 Dec 2001 13:38:40 -0500 Date: Mon, 10 Dec 2001 13:38:40 -0500 From: Chuck Winters To: netdev@oss.sgi.com Subject: kfree_skb Message-ID: <20011210133840.A1534@atl.lmco.com> Mail-Followup-To: netdev@oss.sgi.com Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 69 Lines: 3 Can bad things happen in 2.2 if you kfree_skb more than once? Chuck From owner-netdev@oss.sgi.com Mon Dec 10 12:07:04 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBAK74P14021 for netdev-outgoing; Mon, 10 Dec 2001 12:07:04 -0800 Received: from grok.yi.org (IDENT:FDAwVZiJbFN7+0WRQ6ktCXeNX2c0E66V@cx97923-a.phnx3.az.home.com [24.1.197.194]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBAK71o14016 for ; Mon, 10 Dec 2001 12:07:01 -0800 Received: from candelatech.com (IDENT:ptJH96mfxg4VL19Pei47F3agrBdMYmeR@localhost.localdomain [127.0.0.1]) by grok.yi.org (8.11.6/8.11.2) with ESMTP id fBAJ6L226748; Mon, 10 Dec 2001 12:06:31 -0700 Message-ID: <3C1507AC.70505@candelatech.com> Date: Mon, 10 Dec 2001 12:06:20 -0700 From: Ben Greear Organization: Candela Technologies User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.4) Gecko/20011019 Netscape6/6.2 X-Accept-Language: en-us MIME-Version: 1.0 To: linux-net , "netdev@oss.sgi.com" , Andi Kleen Subject: How to set the free-buffer-space for which select set a socket writable. Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 710 Lines: 21 I'm looking for a way to make select wait for a certain amount of free-space in the write queue for UDP (and hopefully, TCP) socket. For instance, I want to send a 32k UDP packet, and I don't want select to tell me the socket is writable when there is only 4k of space available (because my resulting sendto of the 32k pkt will fail in this case). Is there a way to do this today? If not, any interest in accepting a patch that would give this functionality if I write one up? Thanks, Ben -- Ben Greear President of Candela Technologies Inc http://www.candelatech.com ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear From owner-netdev@oss.sgi.com Mon Dec 10 12:42:49 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBAKgnV15082 for netdev-outgoing; Mon, 10 Dec 2001 12:42:49 -0800 Received: from protactinium.btinternet.com (protactinium.btinternet.com [194.73.73.176]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBAKgho15078 for ; Mon, 10 Dec 2001 12:42:43 -0800 Received: from host213-122-123-186.btinternet.com ([213.122.123.186] helo=stev.org) by protactinium.btinternet.com with esmtp (Exim 3.22 #8) id 16DWJe-00012k-00; Mon, 10 Dec 2001 19:42:19 +0000 Received: from god (god.stev.org [192.168.1.8]) by stev.org (8.11.6/8.11.6) with SMTP id fBAJdp122770; Mon, 10 Dec 2001 19:39:51 GMT Message-ID: <009b01c181b2$a1e696d0$0801a8c0@Stev.org> Reply-To: "James Stevenson" From: "James Stevenson" To: "Ben Greear" , "linux-net" , , "Andi Kleen" References: <3C1507AC.70505@candelatech.com> Subject: Re: How to set the free-buffer-space for which select set a socket writable. Date: Mon, 10 Dec 2001 19:41:15 -0000 X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 5.50.4522.1200 x-mimeole: Produced By Microsoft MimeOLE V5.50.4522.1200 Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 1440 Lines: 46 Hi you can set the socket buffer size and also you can find out how much data is in the send / recv Q for the socket via an ioctl (cannot remember which 1) but i have seen / used it before James ----- Original Message ----- From: "Ben Greear" To: "linux-net" ; ; "Andi Kleen" Sent: Monday, December 10, 2001 7:06 PM Subject: How to set the free-buffer-space for which select set a socket writable. > I'm looking for a way to make select wait for a certain amount of > free-space in the write queue for UDP (and hopefully, TCP) socket. > > For instance, I want to send a 32k UDP packet, and I don't want > select to tell me the socket is writable when there is only > 4k of space available (because my resulting sendto of the 32k > pkt will fail in this case). > > Is there a way to do this today? If not, any interest in > accepting a patch that would give this functionality if > I write one up? > > Thanks, > Ben > > -- > Ben Greear > President of Candela Technologies Inc http://www.candelatech.com > ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear > > > - > To unsubscribe from this list: send the line "unsubscribe linux-net" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html From owner-netdev@oss.sgi.com Mon Dec 10 16:01:28 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBB01S622004 for netdev-outgoing; Mon, 10 Dec 2001 16:01:28 -0800 Received: from alph (ALyon-202-1-2-226.abo.wanadoo.fr [217.128.85.226]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBB01Ao21927 for ; Mon, 10 Dec 2001 16:01:10 -0800 Received: from alph (alph [127.0.0.1]) by alph (Postfix) with ESMTP id A6911166D4 for ; Tue, 11 Dec 2001 00:01:02 +0100 (CET) Subject: PACKET_MR_PROMISC doesn't set IFF_PROMISC From: Yoann Vandoorselaere To: netdev@oss.sgi.com Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="=-fIEduNu4oY/MVXmfsh6K" X-Mailer: Evolution/1.0 (Preview Release) Date: 11 Dec 2001 00:01:02 +0100 Message-Id: <1008025262.4584.14.camel@alph> Mime-Version: 1.0 Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 2191 Lines: 71 --=-fIEduNu4oY/MVXmfsh6K Content-Type: multipart/mixed; boundary="=-1V6uzKNHgPdD6bpj7b+y" --=-1V6uzKNHgPdD6bpj7b+y Content-Type: text/plain Content-Transfer-Encoding: quoted-printable Hi, I've read a little about the issue at : http://groups.google.com/groups?hl=3Den&threadm=3Dlinux.kernel.Pine.LNX.4.3= 1.0101240002380.29105-100000%40netcore.fi&rnum=3D4&prev=3D/groups%3Fq%3Dgfl= ags%2Blinux%2Bnet%26hl%3Den Apparently, some people think that it is an application problem, and that the application should be fixed. However, having two way of putting the interface in promiscuous mode (and one which is not reported) look like a security bug to me. IDS host based sensor might be monitoring the machine in order to alert if the machine goes into promiscuous mode. This mean that anyone might volontarily use PACKET_MR_PROMISC in order to bypass the sensor...=20 The attached patch should fix the problem, but I don't believe it's the right way to fix it... Maybe the use of dev->gflags should be corrected ? or am I missing something ? Ps: please CC me as I'm not subscribed to the list --=20 Yoann Vandoorselaere http://www.prelude-ids.org --=-1V6uzKNHgPdD6bpj7b+y Content-Disposition: attachment; filename=promisc-set.patch Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable --- net/core/dev.c.orig Thu Dec 6 12:53:21 2001 +++ net/core/dev.c Thu Dec 6 12:54:22 2001 @@ -2082,7 +2082,7 @@ static int dev_ifsioc(struct ifreq *ifr, switch(cmd)=20 { case SIOCGIFFLAGS: /* Get interface flags */ - ifr->ifr_flags =3D (dev->flags&~(IFF_PROMISC|IFF_ALLMULTI|IFF_RUNNING)) + ifr->ifr_flags =3D (dev->flags&~(IFF_ALLMULTI|IFF_RUNNING)) |(dev->gflags&(IFF_PROMISC|IFF_ALLMULTI)); if (netif_running(dev) && netif_carrier_ok(dev)) ifr->ifr_flags |=3D IFF_RUNNING; --=-1V6uzKNHgPdD6bpj7b+y-- --=-fIEduNu4oY/MVXmfsh6K Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.6 (GNU/Linux) Comment: For info see http://www.gnupg.org iD8DBQA8FT6u4tfUv0C+vv8RAtYTAKCVbTnQm55N14/LOERn8IfNk15iCQCgwpmI 3Q+luxqGy1R0xGmYLJxzZHA= =uKaI -----END PGP SIGNATURE----- --=-fIEduNu4oY/MVXmfsh6K-- From owner-netdev@oss.sgi.com Mon Dec 10 16:09:02 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBB092f22293 for netdev-outgoing; Mon, 10 Dec 2001 16:09:02 -0800 Received: from netbank.com.br (IDENT:postfix@garrincha.netbank.com.br [200.203.199.88]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBB08xo22290 for ; Mon, 10 Dec 2001 16:09:00 -0800 Received: from brinquedo.distro.conectiva (2-070.ctame701-2.telepar.net.br [200.181.170.70]) by netbank.com.br (Postfix) with ESMTP id 5327646814; Mon, 10 Dec 2001 21:07:10 -0200 (BRDT) Received: by brinquedo.distro.conectiva (Postfix, from userid 501) id 05D8BC465; Mon, 10 Dec 2001 21:08:42 -0200 (BRST) Date: Mon, 10 Dec 2001 21:08:42 -0200 From: Arnaldo Carvalho de Melo To: Chuck Winters Cc: netdev@oss.sgi.com Subject: Re: kfree_skb Message-ID: <20011210210842.A896@conectiva.com.br> References: <20011210133840.A1534@atl.lmco.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20011210133840.A1534@atl.lmco.com> User-Agent: Mutt/1.3.23i X-Url: http://advogato.org/person/acme Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 342 Lines: 8 Em Mon, Dec 10, 2001 at 01:38:40PM -0500, Chuck Winters escreveu: > Can bad things happen in 2.2 if you kfree_skb more than once? not if it matches the number of skb_get on it (or other operations that bump the skb refcount, dunno if it exists in 2.2, I'm looking at 2.4, that bumps the refcount at least in skb_recv_datagram) 8) - Arnaldo From owner-netdev@oss.sgi.com Mon Dec 10 18:08:32 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBB28Wf27572 for netdev-outgoing; Mon, 10 Dec 2001 18:08:32 -0800 Received: from netbank.com.br (IDENT:postfix@garrincha.netbank.com.br [200.203.199.88]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBB28No27569 for ; Mon, 10 Dec 2001 18:08:24 -0800 Received: from brinquedo.distro.conectiva (2-070.ctame701-2.telepar.net.br [200.181.170.70]) by netbank.com.br (Postfix) with ESMTP id 82C0346828; Mon, 10 Dec 2001 23:06:37 -0200 (BRDT) Received: by brinquedo.distro.conectiva (Postfix, from userid 501) id 0AD7AC465; Mon, 10 Dec 2001 23:08:10 -0200 (BRST) Date: Mon, 10 Dec 2001 23:08:10 -0200 From: Arnaldo Carvalho de Melo To: "David S.Miller" , SteveW@ACM.org, jschlst@samba.org, ncorbic@sangoma.com, eis@baty.hanse.de, dag@brattli.net, Linus Torvalds , Marcelo Tosatti Cc: netdev@oss.sgi.com Subject: [RFC] cleaning up struct sock Message-ID: <20011210230810.C896@conectiva.com.br> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.3.23i X-Url: http://advogato.org/person/acme Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 2332 Lines: 64 Hi, This patch cleans up include/net/sock.h, starting to move protocol specific stuff out of this header, with it struct sock protinfo member turns into this: union { void *destruct_hook; struct unix_opt af_unix; #if defined(CONFIG_INET) || defined (CONFIG_INET_MODULE) struct inet_opt af_inet; #endif } protinfo; Where it previously had one entry per protocol, surrounded by #ifdef. The approach I used was to overload destruct_hook so that it serves the same purpose of struct inode u.generic_ip field. Please note that most of the protocols in protinfo (without this patch) already uses pointers to private structs allocated at proto_create time and destroyed after the sk->refcnt hits zero, with kfree or with a special function registered in sk->destruct, which most of the protocols touched by this patch were already using. The original idea, discussed privately with some of the net stack maintainers was to have a big fat union as struct inode now has, but after reading the discussion about struct inode where it was said that the filesystems would migrate to use u.generic_ip I changed my mind and decided to hack this first proposal to see what people think. I've been using this approach with the LLC and NetBEUI stacks that I'm working on, no problems, as expected 8) If this is accepted I'll do the same for sk->tp_pinfo and remove the SPX bits from sock.h, making sk->destruct look at sk->tp_pinfo.generic and freeing it if non null, just like it works now for sk->protinfo.destruct_hook. IMHO, if this is accepted we should make sk->protinfo and sk->tp_pinfo be just void * pointers 8) The patch is against 2.4.16 but I'll of course submit it for 2.5 if the patch looks ok for the net maintainers. Please take a look and lemme know what you think. Ah, while working on this I noticed that netrom oopses on rmmod, not sure if this is a bug introduced by this patch, will certainly check. David, if you don't like it, I'll happily switch to the big fat union idea, but I think that this is more clean and will avoid us having to patch sock.h every time a new net stack is added to the kernel. Patch available at: dir: http://www.kernel.org/pub/linux/kernel/people/acme/v2.4/2.4.16/ file: sock_cleanup.patch.bz2 Best Regards, - Arnaldo From owner-netdev@oss.sgi.com Mon Dec 10 18:55:46 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBB2tkx29408 for netdev-outgoing; Mon, 10 Dec 2001 18:55:46 -0800 Received: from motgate2.mot.com (motgate2.mot.com [136.182.1.10]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBB2tdo29387 for ; Mon, 10 Dec 2001 18:55:39 -0800 Received: [from pobox.mot.com (pobox.mot.com [129.188.137.100]) by motgate2.mot.com (motgate2 2.1) with ESMTP id SAA11182 for ; Mon, 10 Dec 2001 18:55:37 -0700 (MST)] Received: [from il02exb01.comm.mot.com (il02exb01.comm.mot.com [145.1.204.17]) by pobox.mot.com (MOT-pobox 2.0) with ESMTP id SAA05911 for ; Mon, 10 Dec 2001 18:55:36 -0700 (MST)] Received: by il02exb01.comm.mot.com with Internet Mail Service (5.5.2654.52) id ; Mon, 10 Dec 2001 19:55:36 -0600 Message-ID: From: Patel Hemang-QA4383 To: "'netdev@oss.sgi.com'" , "'linux-ipv6@inner.net'" Cc: Patel Hemang-QA4383 Subject: IPv6 over slip Date: Mon, 10 Dec 2001 19:55:35 -0600 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2654.52) Content-Type: text/plain; charset="iso-8859-1" Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 1493 Lines: 21 All, We are trying to run the MIPL and IPv6 stack with the SLIP interface(sl0) in LINUX. The router is sending out Router Advertisements, but the IPv6 stack does not seem to pick up the router advertisements. When we do a tcpdump -i sl0 ip6 -R, it prints out the hex dump of the router advertisement, and that is in line with the standard. But somehow the router advertisement is not picked up by the IPv6 stack and thus the MIPL code, also Ethereal shows an "unknown ip packet" type. What changes do you have to make to the IPv6 and the MIPL code to get it to listen to the router advertisements and communicate over the sl0 interface? I am able to get the sl0 interface to work, and I am able to ping(IPv4) from it. But I don't see any BU or RS being sent out through the sl0 interface, although the kernel log indicates that those messages are being sent. Also what is the main ".c" file, where in I can put the debug statements to see that the IPv6 stack is receiving the router advertisements. We need to know where the packets get dropped between the slip interface and the IPv6 stack and why? Thanks, Hemang Patel 847-435-9872 ------------------------------------------------------------------------ Hemang Patel Adv. Tech. & Strategy-CGISS, Motorola. Ph: (847) 435-9872 Pg: 877-734-6069 Fax: (847) 576-9018 Email: hemang.patel@motorola.com Never say die ------------------------------------------------------------------------ From owner-netdev@oss.sgi.com Mon Dec 10 22:54:39 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBB6sdX05724 for netdev-outgoing; Mon, 10 Dec 2001 22:54:39 -0800 Received: from smtp011.mail.yahoo.com (smtp011.mail.yahoo.com [216.136.173.31]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBB6sbo05721 for ; Mon, 10 Dec 2001 22:54:37 -0800 Received: from ptil-10-145-ban.primus-india.net (HELO iwave014) (203.196.145.10) by smtp.mail.vip.sc5.yahoo.com with SMTP; 11 Dec 2001 05:54:34 -0000 Message-ID: <003801c18208$8f5d2500$8302a8c0@iwave014> From: "Abdul Khaliq" To: Subject: Date: Tue, 11 Dec 2001 11:26:12 +0530 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 5.50.4133.2400 X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4133.2400 Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 308 Lines: 12 Hi I am an Enginnering student at Banaglore University Bangalore, i am a beginner, where can i find implementation documentation on Tcp/Ipv6, Ipv6 Routing. Thanks & Regards _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com From owner-netdev@oss.sgi.com Tue Dec 11 00:18:59 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBB8IxY07109 for netdev-outgoing; Tue, 11 Dec 2001 00:18:59 -0800 Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBB8Iso07106 for ; Tue, 11 Dec 2001 00:18:54 -0800 Received: from localhost (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with ESMTP id XAA22217; Mon, 10 Dec 2001 23:18:28 -0800 Date: Mon, 10 Dec 2001 23:18:26 -0800 (PST) Message-Id: <20011210.231826.55509210.davem@redhat.com> To: acme@conectiva.com.br Cc: SteveW@ACM.org, jschlst@samba.org, ncorbic@sangoma.com, eis@baty.hanse.de, dag@brattli.net, torvalds@transmeta.com, marcelo@conectiva.com.br, netdev@oss.sgi.com Subject: Re: [RFC] cleaning up struct sock From: "David S. Miller" In-Reply-To: <20011210230810.C896@conectiva.com.br> References: <20011210230810.C896@conectiva.com.br> X-Mailer: Mew version 2.1 on Emacs 21.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 1645 Lines: 40 From: Arnaldo Carvalho de Melo Date: Mon, 10 Dec 2001 23:08:10 -0200 David, if you don't like it, I'll happily switch to the big fat union idea, but I think that this is more clean and will avoid us having to patch sock.h every time a new net stack is added to the kernel. I'm a little concerned about having to allocate two objects instead of one. These things aren't like inodes. Inodes are cached and lookup read-multiple objects, whereas sockets are throw-away and recycled objects. Inode allocation performance therefore isn't that critical, but socket allocation performance is. Then we go back to the old problem of protocols that can be used to embed IP and thus having to keep track of parallel state for multiple protocols. I think your changes did not compromise that for what we currently support though. I still need to think about this some more.... You know, actually, the protocols are the ones which call sk_alloc(). So we could just extend sk_alloc() to take a kmem_cache_t argument. TCP could thus make a kmem_cache_t which is sizeof(struct sock) + sizeof(struct tcp_opt) and then set the TP_INFO pointer to "(sk + 1)". Oh yes, another overhead is all the extra dereferencing. To fight that we could make a macro that knows the above layout: #define TCP_PINFO(SK) ((struct tcp_opt *)((SK) + 1)) So I guess we could do things your way without any of the potential performance problems. It is going to be a while before I can apply something like this, I would like to help Jens+Linus get the new block stuff in shape first. This would obviously be a 2.5.x change too. From owner-netdev@oss.sgi.com Tue Dec 11 01:12:26 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBB9CQw08841 for netdev-outgoing; Tue, 11 Dec 2001 01:12:26 -0800 Received: from titan.bieringer.de (mail.bieringer.de [195.226.187.51]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBB9CNo08838 for ; Tue, 11 Dec 2001 01:12:23 -0800 Received: (qmail 24442 invoked from network); 11 Dec 2001 08:12:15 -0000 Received: from pd9e4ede3.dip.t-dialin.net (HELO worker.muc.bieringer.de) (217.228.237.227) by mail.bieringer.de with SMTP; 11 Dec 2001 08:12:15 -0000 Date: Tue, 11 Dec 2001 09:14:33 +0100 From: Peter Bieringer To: Patel Hemang-QA4383 , "'netdev@oss.sgi.com'" , "'linux-ipv6@inner.net'" Subject: Re: IPv6 over slip Message-ID: <14110000.1008058473@localhost> In-Reply-To: References: X-Mailer: Mulberry/2.1.1 (Linux/x86) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Content-Disposition: inline Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 651 Lines: 20 --On Monday, December 10, 2001 07:55:35 PM -0600 Patel Hemang-QA4383 wrote: > We are trying to run the MIPL and IPv6 stack with the SLIP > interface(sl0) in LINUX. Are you sure that SLIP is able to transport IPv6 packets over the link? Is there any "layer2 transport protocol field" in SLIP in use? Perhaps not, and then SLIP has the same "never be IPv6 capable" feature like encapsulation "rawip" on ISDN links. At "rawip", no "layer 2 transport protocol field" is used, and both endpoints assume that all transported data is IPv4. If this is the case in SLIP, too, we should rename it to SLIPv4 ;-) Peter From owner-netdev@oss.sgi.com Tue Dec 11 04:16:13 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBBCGDe16979 for netdev-outgoing; Tue, 11 Dec 2001 04:16:13 -0800 Received: from gw.chygwyn.com (IDENT:root@gw.chygwyn.com [62.172.158.50]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBBCG5o16975 for ; Tue, 11 Dec 2001 04:16:05 -0800 Received: (from steve@localhost) by gw.chygwyn.com (8.9.3/8.9.3) id LAA29992; Tue, 11 Dec 2001 11:14:39 GMT From: Steven Whitehouse Message-Id: <200112111114.LAA29992@gw.chygwyn.com> Subject: Re: [RFC] cleaning up struct sock To: davem@redhat.com (David S. Miller) Date: Tue, 11 Dec 2001 11:14:39 +0000 (GMT) Cc: acme@conectiva.com.br, jschlst@samba.org, ncorbic@sangoma.com, eis@baty.hanse.de, dag@brattli.net, torvalds@transmeta.com, marcelo@conectiva.com.br, netdev@oss.sgi.com In-Reply-To: <20011210.231826.55509210.davem@redhat.com> from "David S. Miller" at Dec 10, 2001 11:18:26 PM Organization: ChyGywn Limited X-RegisteredOffice: 7, New Yatt Road, Witney, Oxfordshire. OX28 1NU England X-RegisteredNumber: 03887683 Reply-To: Steve Whitehouse X-Mailer: ELM [version 2.5 PL1] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 2709 Lines: 68 Hi, > > These things aren't like inodes. Inodes are cached and lookup > read-multiple objects, whereas sockets are throw-away and recycled > objects. Inode allocation performance therefore isn't that critical, > but socket allocation performance is. > We do have to allocate an inode as well though (at least in the normal case) and with the new (if I've understood the conclusions of the discussions on inode allocation) scheme this would give us a single allocation which would be sizeof(struct inode) + sizeof(struct socket) in a slab cache private to sockfs. [snip] > > You know, actually, the protocols are the ones which call sk_alloc(). > So we could just extend sk_alloc() to take a kmem_cache_t argument. > TCP could thus make a kmem_cache_t which is sizeof(struct sock) > + sizeof(struct tcp_opt) and then set the TP_INFO pointer to "(sk + > 1)". > > Oh yes, another overhead is all the extra dereferencing. To fight > that we could make a macro that knows the above layout: > > #define TCP_PINFO(SK) ((struct tcp_opt *)((SK) + 1)) > > So I guess we could do things your way without any of the potential > performance problems. > This sounds like a good plan, but let me just throw in some wild ideas based on my earlier comments about about the struct inode... I wonder if we could allocate a combined object along the lines of the following: a) struct inode b) struct socket c) struct sock (minus per protocol areas) d) struct my_protocol (whatever the socket actually is) e) anything else required all in one go. It seems the reasons for not doing that are: 1. Parts (a) and (b) are likely to have a different lifetime to (c) to (e) but with the broken-out inode union, I wonder if thats actually likely to give us larger overhead or not ? 2. Need a way to prevent inodes disappearing when the last close from user space occurs (while socket shutdown occurs). Perhaps we can simply increment the reference count in the release routine ? I need to look into that to be sure. 3. Need to have some way of setting different sizes of structure for different protocols, so this looks like either having sockfs broken into one fs per protocol or adding some method of choosing different allocators. I'm not too sure that either of those solutions is "good". I'm not sure that any of these is a big hurdle to overcome, but I have no real feel for whether that would give us an advantage in both memory usage efficiency and speed without actually doing it and comparing results. It does seem that if we are going down this road, then we should consider taking it to its ultimate end of one allocation per socket creation though, Steve. From owner-netdev@oss.sgi.com Tue Dec 11 04:48:28 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBBCmSJ17839 for netdev-outgoing; Tue, 11 Dec 2001 04:48:28 -0800 Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBBCmOo17836 for ; Tue, 11 Dec 2001 04:48:24 -0800 Received: from localhost (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with ESMTP id DAA01807; Tue, 11 Dec 2001 03:47:56 -0800 Date: Tue, 11 Dec 2001 03:47:55 -0800 (PST) Message-Id: <20011211.034755.74751257.davem@redhat.com> To: Steve@ChyGwyn.com, steve@gw.chygwyn.com Cc: acme@conectiva.com.br, jschlst@samba.org, ncorbic@sangoma.com, eis@baty.hanse.de, dag@brattli.net, torvalds@transmeta.com, marcelo@conectiva.com.br, netdev@oss.sgi.com Subject: Re: [RFC] cleaning up struct sock From: "David S. Miller" In-Reply-To: <200112111114.LAA29992@gw.chygwyn.com> References: <20011210.231826.55509210.davem@redhat.com> <200112111114.LAA29992@gw.chygwyn.com> X-Mailer: Mew version 2.1 on Emacs 21.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 1244 Lines: 33 From: Steven Whitehouse Date: Tue, 11 Dec 2001 11:14:39 +0000 (GMT) > These things aren't like inodes. Inodes are cached and lookup > read-multiple objects, whereas sockets are throw-away and recycled > objects. Inode allocation performance therefore isn't that critical, > but socket allocation performance is. > We do have to allocate an inode as well though (at least in the normal case) and with the new (if I've understood the conclusions of the discussions on inode allocation) scheme this would give us a single allocation which would be sizeof(struct inode) + sizeof(struct socket) in a slab cache private to sockfs. Indeed, you are right. But I don't know about the inode+socket idea. I wonder if we could allocate a combined object along the lines of the following: a) struct inode b) struct socket c) struct sock (minus per protocol areas) d) struct my_protocol (whatever the socket actually is) e) anything else required all in one go. The reference counting would be a pain in the neck, I think. If we can somehow "genericize" the "inode + extra stuff" into some abstraction, sure maybe. I don't know, it could work. From owner-netdev@oss.sgi.com Tue Dec 11 04:52:32 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBBCqWv17999 for netdev-outgoing; Tue, 11 Dec 2001 04:52:32 -0800 Received: from netbank.com.br (IDENT:postfix@garrincha.netbank.com.br [200.203.199.88]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBBCqNo17996 for ; Tue, 11 Dec 2001 04:52:23 -0800 Received: from brinquedo.distro.conectiva (2-070.ctame701-2.telepar.net.br [200.181.170.70]) by netbank.com.br (Postfix) with ESMTP id 4735146818; Tue, 11 Dec 2001 09:50:30 -0200 (BRDT) Received: by brinquedo.distro.conectiva (Postfix, from userid 501) id 95CF1C465; Tue, 11 Dec 2001 09:52:19 -0200 (BRST) Date: Tue, 11 Dec 2001 09:52:19 -0200 From: Arnaldo Carvalho de Melo To: "David S. Miller" Cc: SteveW@ACM.org, jschlst@samba.org, ncorbic@sangoma.com, eis@baty.hanse.de, dag@brattli.net, torvalds@transmeta.com, marcelo@conectiva.com.br, netdev@oss.sgi.com Subject: Re: [RFC] cleaning up struct sock Message-ID: <20011211095219.B1630@conectiva.com.br> References: <20011210230810.C896@conectiva.com.br> <20011210.231826.55509210.davem@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20011210.231826.55509210.davem@redhat.com> User-Agent: Mutt/1.3.23i X-Url: http://advogato.org/person/acme Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 3113 Lines: 79 Em Mon, Dec 10, 2001 at 11:18:26PM -0800, David S. Miller escreveu: > From: Arnaldo Carvalho de Melo > Date: Mon, 10 Dec 2001 23:08:10 -0200 > > David, if you don't like it, I'll happily switch to the big fat > union idea, but I think that this is more clean and will avoid us > having to patch sock.h every time a new net stack is added to the > kernel. > > I'm a little concerned about having to allocate two objects instead of > one. If that is the case then we could keep some of the performance critical protocols in the union and leave the other ones, that already were allocating two objects, using the sk->protinfo.generic (aka destruct_hook), but there are other possibilities, like you mention. Ok, the goal of not having anything specific to a protocol in sock.h would not be achieved, but things more cleaner than today. That was one of the reasons for me not to have left af_inet with the #ifdefs in protinfo in this patch (i.e. for performance critical, most of the time enabled anyway protocols, leave it as is in protinfo). But lets see first if we find a clean way to have sock.h clean of protocol specific stuff without harming performance. > These things aren't like inodes. Inodes are cached and lookup > read-multiple objects, whereas sockets are throw-away and recycled > objects. Inode allocation performance therefore isn't that critical, > but socket allocation performance is. > > Then we go back to the old problem of protocols that can be used to > embed IP and thus having to keep track of parallel state for multiple > protocols. I think your changes did not compromise that for what > we currently support though. > > I still need to think about this some more.... > > You know, actually, the protocols are the ones which call sk_alloc(). > So we could just extend sk_alloc() to take a kmem_cache_t argument. > TCP could thus make a kmem_cache_t which is sizeof(struct sock) > + sizeof(struct tcp_opt) and then set the TP_INFO pointer to "(sk + > 1)". yes, that is a nice idea, and if we have it like this: struct sock { . . . void tp_pinfo[0]; } #define TCP_PINFO(SK) ((struct tcp_opt *)(&(SK)->tp_pinfo) > Oh yes, another overhead is all the extra dereferencing. To fight > that we could make a macro that knows the above layout: > > #define TCP_PINFO(SK) ((struct tcp_opt *)((SK) + 1)) > > So I guess we could do things your way without any of the potential > performance problems. yup, the patch I presented was just something quick so that more people could talk think about it and to see if I should spend more time on it. > It is going to be a while before I can apply something like this, I > would like to help Jens+Linus get the new block stuff in shape first. > This would obviously be a 2.5.x change too. yes, of course, but I wanted to tell people that this was something I wanted to have done as soon as possible, so that net stack maintainers don't get too much patch clashes. Well I'll think more about it, about what Steven told and try to come out with something more polished. - Arnaldo From owner-netdev@oss.sgi.com Tue Dec 11 04:55:47 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBBCtla18110 for netdev-outgoing; Tue, 11 Dec 2001 04:55:47 -0800 Received: from netbank.com.br (IDENT:postfix@garrincha.netbank.com.br [200.203.199.88]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBBCtho18107 for ; Tue, 11 Dec 2001 04:55:43 -0800 Received: from brinquedo.distro.conectiva (2-070.ctame701-2.telepar.net.br [200.181.170.70]) by netbank.com.br (Postfix) with ESMTP id B9F7A46809; Tue, 11 Dec 2001 09:53:51 -0200 (BRDT) Received: by brinquedo.distro.conectiva (Postfix, from userid 501) id D8D39C465; Tue, 11 Dec 2001 09:55:43 -0200 (BRST) Date: Tue, 11 Dec 2001 09:55:43 -0200 From: Arnaldo Carvalho de Melo To: "David S. Miller" Cc: SteveW@ACM.org, jschlst@samba.org, ncorbic@sangoma.com, eis@baty.hanse.de, dag@brattli.net, torvalds@transmeta.com, marcelo@conectiva.com.br, netdev@oss.sgi.com Subject: Re: [RFC] cleaning up struct sock Message-ID: <20011211095543.C1630@conectiva.com.br> References: <20011210230810.C896@conectiva.com.br> <20011210.231826.55509210.davem@redhat.com> <20011211095219.B1630@conectiva.com.br> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20011211095219.B1630@conectiva.com.br> User-Agent: Mutt/1.3.23i X-Url: http://advogato.org/person/acme Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 915 Lines: 21 Em Tue, Dec 11, 2001 at 09:52:19AM -0200, Arnaldo Carvalho de Melo escreveu: > If that is the case then we could keep some of the performance critical > protocols in the union and leave the other ones, that already were > allocating two objects, using the sk->protinfo.generic (aka destruct_hook), > but there are other possibilities, like you mention. > > Ok, the goal of not having anything specific to a protocol in sock.h would > not be achieved, but things more cleaner than today. ^ would > That was one of the reasons for me not to have left af_inet with the above I mean: (without the "not") "That was one of the reasons for me to have left af_inet with the" > #ifdefs in protinfo in this patch (i.e. for performance critical, most of > the time enabled anyway protocols, leave it as is in protinfo). sorry... From owner-netdev@oss.sgi.com Tue Dec 11 08:00:22 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBBG0M425377 for netdev-outgoing; Tue, 11 Dec 2001 08:00:22 -0800 Received: from megisto-sql1.megisto.com ([63.113.114.132]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBBG0Ko25360 for ; Tue, 11 Dec 2001 08:00:20 -0800 Received: by megisto-sql1.megisto.com with Internet Mail Service (5.5.2650.21) id ; Tue, 11 Dec 2001 09:55:58 -0500 Message-ID: From: Mohammad Akram To: "'netdev@oss.sgi.com'" Subject: One IP address configured on two interfaces Date: Tue, 11 Dec 2001 09:55:56 -0500 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2650.21) Content-Type: text/plain; charset="iso-8859-1" Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 302 Lines: 14 Hi All, I have a scenario where I need to configure One IP Address on a Physical interface and a loopback interface(while both the i/fs' are up). Is this possible to do on a Linux box or what is required to support such a functionality. I will appreciate your response. Thanks in Advance, Akram. From owner-netdev@oss.sgi.com Tue Dec 11 08:43:27 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBBGhRk29571 for netdev-outgoing; Tue, 11 Dec 2001 08:43:27 -0800 Received: from enterprise.atl.lmco.com (mail.atl.external.lmco.com [192.35.37.50]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBBGhPo29555 for ; Tue, 11 Dec 2001 08:43:25 -0800 Received: from misty.atl.lmco.com (misty [166.17.242.243]) by enterprise.atl.lmco.com (Postfix) with ESMTP id CCAEAC1CF4 for ; Tue, 11 Dec 2001 10:43:17 -0500 (EST) Received: (from cwinters@localhost) by misty.atl.lmco.com (8.11.2/8.11.2) id fBBFhGo08220 for netdev@oss.sgi.com; Tue, 11 Dec 2001 10:43:16 -0500 Date: Tue, 11 Dec 2001 10:43:16 -0500 From: Chuck Winters To: netdev@oss.sgi.com Subject: Re: kfree_skb Message-ID: <20011211104316.A8195@atl.lmco.com> Mail-Followup-To: netdev@oss.sgi.com References: <20011210133840.A1534@atl.lmco.com> <20011210210842.A896@conectiva.com.br> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <20011210210842.A896@conectiva.com.br>; from acme@conectiva.com.br on Mon, Dec 10, 2001 at 09:08:42PM -0200 Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 440 Lines: 11 On Mon, Dec 10, 2001 at 09:08:42PM -0200, Arnaldo Carvalho de Melo wrote: > Em Mon, Dec 10, 2001 at 01:38:40PM -0500, Chuck Winters escreveu: > > Can bad things happen in 2.2 if you kfree_skb more than once? > > not if it matches the number of skb_get on it (or other operations that > bump the skb refcount, dunno if it exists in 2.2, I'm looking at 2.4, that > bumps the refcount at least in skb_recv_datagram) 8) > > - Arnaldo Thanks From owner-netdev@oss.sgi.com Tue Dec 11 12:20:28 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBBKKS517885 for netdev-outgoing; Tue, 11 Dec 2001 12:20:28 -0800 Received: from ms2.inr.ac.ru (minus.inr.ac.ru [193.233.7.97]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBBKKIo17882 for ; Tue, 11 Dec 2001 12:20:20 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id WAA03827; Tue, 11 Dec 2001 22:19:51 +0300 From: kuznet@ms2.inr.ac.ru Message-Id: <200112111919.WAA03827@ms2.inr.ac.ru> Subject: Re: [RFC] cleaning up struct sock To: davem@redhat.COM (David S. Miller) Date: Tue, 11 Dec 2001 22:19:51 +0300 (MSK) Cc: netdev@oss.sgi.com, acme@conectiva.COM.BR, steve@gw.CHygwyn.COM In-Reply-To: <20011211.034755.74751257.davem@redhat.com> from "David S. Miller" at Dec 11, 1 05:49:15 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 1020 Lines: 29 Hello! > a) struct inode > b) struct socket (A bit out of topic) The last is just absent now, by the way. The information inlined into private part of inode is wrong anyway and causes lots of troubles, starting from problems with async signaling and finishing with identd returning wrong answers about socket owners. If the plan of VFS is getting rid of private part, we just should kill struct socket. If no such plans exist, we could return to the question of ability to get/put inodes on softirqs and move some junk from sock to inode. > c) struct sock (minus per protocol areas) > d) struct my_protocol (whatever the socket actually is) ... which is exactly current situation. Listen, the situation is very clear really. Reality is TCP/IP. The only thing which could justify getting rid of padding each socket to tcp size would be a protocol, which has status different of marginal ans requires massive allocations like tcp. I do not see any signs of this on horizont. Do you see? Alexey From owner-netdev@oss.sgi.com Tue Dec 11 12:53:39 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBBKrdN18580 for netdev-outgoing; Tue, 11 Dec 2001 12:53:39 -0800 Received: from u.domain.uli (ja.mac.ssi.bg [212.95.166.194]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBBKrYo18577 for ; Tue, 11 Dec 2001 12:53:34 -0800 Received: from localhost (IDENT:ja@localhost [127.0.0.1]) by u.domain.uli (8.11.0/8.11.0) with ESMTP id fBBLsnK00981; Tue, 11 Dec 2001 21:54:49 GMT Date: Tue, 11 Dec 2001 21:54:49 +0000 (GMT) From: Julian Anastasov X-X-Sender: To: Mohammad Akram cc: "'netdev@oss.sgi.com'" Subject: Re: One IP address configured on two interfaces In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 545 Lines: 31 Hello, On Tue, 11 Dec 2001, Mohammad Akram wrote: > Hi All, > > I have a scenario where I need to configure One IP Address on a Physical > interface > and a loopback interface(while both the i/fs' are up). Such requirement is suspicious > Is this possible to do on a Linux box or what is required to support such a > functionality. It should work: ip addr add 192.168.0.1/24 brd + dev eth0 ip addr add 192.168.0.1/32 dev lo > I will appreciate your response. > > Thanks in Advance, > Akram. Regards -- Julian Anastasov From owner-netdev@oss.sgi.com Tue Dec 11 13:52:17 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBBLqHw21084 for netdev-outgoing; Tue, 11 Dec 2001 13:52:17 -0800 Received: from localhost.localdomain (adsl-64-109-170-29.dsl.chcgil.ameritech.net [64.109.170.29]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBBLppo21071 for ; Tue, 11 Dec 2001 13:51:51 -0800 Received: (from rochberg@localhost) by localhost.localdomain (8.11.6/8.11.6) id fBBKpXG01812; Tue, 11 Dec 2001 15:51:33 -0500 Date: Tue, 11 Dec 2001 15:51:33 -0500 Message-Id: <200112112051.fBBKpXG01812@localhost.localdomain> To: netdev@oss.sgi.com From: rochberg+l@61Cnetworks.com Subject: [Patch] fwmark on locally-originated packets Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 8827 Lines: 263 This patch lets you set the fwmark for locally-originated packets on a per-socket basis. This means that with a little application tweaking (add an ioctl call) you can control packet routing on a per-socket basis. Select QoS on each connection, load-balance by hand, slice, dice! I've written a patch to do this. It does: 1. Add ioctls to set the fwmark for a socket 2. Make sure that the fwmark is passed to the routing functions 2a. Add new route function ip_route_output_sk which fetches necessary data out of sk (currently sk->bound_dev_if and sk->fwmark) and stuffs it into a route key 2b. Convert relevant calls to ip_route_output to use ip_route_output_sk 2c. Convert ip_route_connect to use ip_route_output_sk 3. Change ip_queue_xmit to copy skb->nfmark into sk->nfmark on outgoing packets 4. (unrelated bonus patch) Initialize key correctly in fib_frontend; the old "key.foo = bar...." lines didn't initialize fwmark. The new initializer zeros all unused fields Question 1: Do I want to hook in at ip_queue_xmit, or is there a better place? Question 2: Do I want to send this to some other mailing list (like linux-net@vger)? -david patches follow diff -X ~/dontdiff -Naur linux-2.4.16/include/linux/sockios.h linux-2.4.16-fwmark/include/linux/sockios.h --- linux-2.4.16/include/linux/sockios.h Tue Dec 11 14:47:10 2001 +++ linux-2.4.16-fwmark/include/linux/sockios.h Sat Dec 8 14:48:17 2001 @@ -105,6 +105,13 @@ #define SIOCGIFVLAN 0x8982 /* 802.1Q VLAN support */ #define SIOCSIFVLAN 0x8983 /* Set 802.1Q VLAN options */ +/* Set netfilter fwmark on packets for this connection */ +#define SIOCSFWMARK 0x8984 /* Set netfilter fwmark on packets from this cxn */ +#define SIOCGFWMARK 0x8985 + + + + /* bonding calls */ #define SIOCBONDENSLAVE 0x8990 /* enslave a device to the bond */ diff -X ~/dontdiff -Naur linux-2.4.16/include/net/route.h linux-2.4.16-fwmark/include/net/route.h --- linux-2.4.16/include/net/route.h Tue Dec 11 14:47:27 2001 +++ linux-2.4.16-fwmark/include/net/route.h Tue Dec 11 12:17:18 2001 @@ -27,6 +27,7 @@ #include #include #include +#include #include #include #include @@ -140,6 +141,17 @@ return ip_route_output_key(rp, &key); } +static inline int ip_route_output_sk(struct rtable **rp, + u32 daddr, u32 saddr, u32 tos, const struct sock *sk) +{ + struct rt_key key = { dst:daddr, src:saddr, oif:sk->bound_dev_if, tos:tos, +#if defined(CONFIG_NETFILTER) || defined(CONFIG_NETFILTER_MODULE) + fwmark:sk->nfmark, +#endif + }; + return ip_route_output_key(rp, &key); +} + static inline void ip_rt_put(struct rtable * rt) { @@ -156,17 +168,17 @@ return ip_tos2prio[IPTOS_TOS(tos)>>1]; } -static inline int ip_route_connect(struct rtable **rp, u32 dst, u32 src, u32 tos, int oif) +static inline int ip_route_connect(struct rtable **rp, u32 dst, u32 src, u32 tos, const struct sock *sk) { int err; - err = ip_route_output(rp, dst, src, tos, oif); + err = ip_route_output_sk(rp, dst, src, tos, sk); if (err || (dst && src)) return err; dst = (*rp)->rt_dst; src = (*rp)->rt_src; ip_rt_put(*rp); *rp = NULL; - return ip_route_output(rp, dst, src, tos, oif); + return ip_route_output_sk(rp, dst, src, tos, sk); } extern void rt_bind_peer(struct rtable *rt, int create); diff -X ~/dontdiff -Naur linux-2.4.16/include/net/sock.h linux-2.4.16-fwmark/include/net/sock.h --- linux-2.4.16/include/net/sock.h Tue Dec 11 14:47:28 2001 +++ linux-2.4.16-fwmark/include/net/sock.h Tue Dec 11 12:17:05 2001 @@ -602,6 +602,10 @@ long rcvtimeo; long sndtimeo; +#if defined(CONFIG_NETFILTER) || defined(CONFIG_NETFILTER_MODULE) + int nfmark; /* Set nfmark on outgoing packets if non-zero */ +#endif + #ifdef CONFIG_FILTER /* Socket Filtering Instructions */ struct sk_filter *filter; diff -X ~/dontdiff -Naur linux-2.4.16/net/ipv4/af_inet.c linux-2.4.16-fwmark/net/ipv4/af_inet.c --- linux-2.4.16/net/ipv4/af_inet.c Tue Dec 11 14:48:09 2001 +++ linux-2.4.16-fwmark/net/ipv4/af_inet.c Mon Dec 10 16:50:07 2001 @@ -931,6 +931,23 @@ #endif return -ENOPKG; +#if defined(CONFIG_NETFILTER) || defined(CONFIG_NETFILTER_MODULE) + case SIOCSFWMARK: + err = get_user(sk->nfmark,(int *) arg); + if (err) { + return err; + } + sk_dst_reset(sk); + break; + case SIOCGFWMARK: + err = put_user(sk->nfmark,(int *) arg); + if (err) { + return err; + } + break; +#endif + + default: if ((cmd >= SIOCDEVPRIVATE) && (cmd <= (SIOCDEVPRIVATE + 15))) diff -X ~/dontdiff -Naur linux-2.4.16/net/ipv4/fib_frontend.c linux-2.4.16-fwmark/net/ipv4/fib_frontend.c --- linux-2.4.16/net/ipv4/fib_frontend.c Tue Dec 11 14:48:10 2001 +++ linux-2.4.16-fwmark/net/ipv4/fib_frontend.c Sat Dec 8 14:47:25 2001 @@ -207,17 +207,10 @@ struct net_device *dev, u32 *spec_dst, u32 *itag) { struct in_device *in_dev; - struct rt_key key; + struct rt_key key = { dst:src, src:dst, tos:tos, oif:0,iif:oif,scope:RT_SCOPE_UNIVERSE}; struct fib_result res; int no_addr, rpf; int ret; - - key.dst = src; - key.src = dst; - key.tos = tos; - key.oif = 0; - key.iif = oif; - key.scope = RT_SCOPE_UNIVERSE; no_addr = rpf = 0; read_lock(&inetdev_lock); diff -X ~/dontdiff -Naur linux-2.4.16/net/ipv4/ip_output.c linux-2.4.16-fwmark/net/ipv4/ip_output.c --- linux-2.4.16/net/ipv4/ip_output.c Tue Dec 11 14:48:14 2001 +++ linux-2.4.16-fwmark/net/ipv4/ip_output.c Sat Dec 8 14:47:25 2001 @@ -345,6 +345,12 @@ struct rtable *rt; struct iphdr *iph; +#if defined(CONFIG_NETFILTER) || defined(CONFIG_NETFILTER_MODULE) + if (sk->nfmark) { + skb->nfmark=sk->nfmark; + } +#endif + /* Skip all of this if the packet is already routed, * f.e. by something like SCTP. */ @@ -366,9 +372,9 @@ * keep trying until route appears or the connection times itself * out. */ - if (ip_route_output(&rt, daddr, sk->saddr, + if (ip_route_output_sk(&rt, daddr, sk->saddr, RT_CONN_FLAGS(sk), - sk->bound_dev_if)) + sk)) goto no_route; __sk_dst_set(sk, &rt->u.dst); sk->route_caps = rt->u.dst.dev->features; @@ -964,6 +970,7 @@ daddr = replyopts.opt.faddr; } + /* XXX should this use sk->oif ? */ if (ip_route_output(&rt, daddr, rt->rt_spec_dst, RT_TOS(skb->nh.iph->tos), 0)) return; diff -X ~/dontdiff -Naur linux-2.4.16/net/ipv4/tcp_ipv4.c linux-2.4.16-fwmark/net/ipv4/tcp_ipv4.c --- linux-2.4.16/net/ipv4/tcp_ipv4.c Tue Dec 11 14:48:27 2001 +++ linux-2.4.16-fwmark/net/ipv4/tcp_ipv4.c Sat Dec 8 14:47:25 2001 @@ -667,7 +667,7 @@ } tmp = ip_route_connect(&rt, nexthop, sk->saddr, - RT_CONN_FLAGS(sk), sk->bound_dev_if); + RT_CONN_FLAGS(sk), sk); if (tmp < 0) return tmp; @@ -1150,11 +1150,11 @@ struct ip_options *opt; opt = req->af.v4_req.opt; - if(ip_route_output(&rt, ((opt && opt->srr) ? + if(ip_route_output_sk(&rt, ((opt && opt->srr) ? opt->faddr : req->af.v4_req.rmt_addr), req->af.v4_req.loc_addr, - RT_CONN_FLAGS(sk), sk->bound_dev_if)) { + RT_CONN_FLAGS(sk), sk)) { IP_INC_STATS_BH(IpOutNoRoutes); return NULL; } @@ -1733,7 +1733,7 @@ /* Query new route. */ err = ip_route_connect(&rt, daddr, 0, RT_TOS(sk->protinfo.af_inet.tos)|sk->localroute, - sk->bound_dev_if); + sk); if (err) return err; @@ -1781,8 +1781,8 @@ if(sk->protinfo.af_inet.opt && sk->protinfo.af_inet.opt->srr) daddr = sk->protinfo.af_inet.opt->faddr; - err = ip_route_output(&rt, daddr, sk->saddr, - RT_CONN_FLAGS(sk), sk->bound_dev_if); + err = ip_route_output_sk(&rt, daddr, sk->saddr, + RT_CONN_FLAGS(sk), sk); if (!err) { __sk_dst_set(sk, &rt->u.dst); sk->route_caps = rt->u.dst.dev->features; diff -X ~/dontdiff -Naur linux-2.4.16/net/ipv4/udp.c linux-2.4.16-fwmark/net/ipv4/udp.c --- linux-2.4.16/net/ipv4/udp.c Tue Dec 11 14:48:29 2001 +++ linux-2.4.16-fwmark/net/ipv4/udp.c Sat Dec 8 14:47:25 2001 @@ -724,7 +724,7 @@ sk_dst_reset(sk); err = ip_route_connect(&rt, usin->sin_addr.s_addr, sk->saddr, - RT_CONN_FLAGS(sk), sk->bound_dev_if); + RT_CONN_FLAGS(sk), sk); if (err) return err; if ((rt->rt_flags&RTCF_BROADCAST) && !sk->broadcast) { From owner-netdev@oss.sgi.com Tue Dec 11 14:04:30 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBBM4Uo21708 for netdev-outgoing; Tue, 11 Dec 2001 14:04:30 -0800 Received: from docomolabs-usa.com (fridge.docomo-usa.com [216.98.102.228]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBBM4Ro21705 for ; Tue, 11 Dec 2001 14:04:27 -0800 Received: from VAIOHE (dhcp5.docomo-usa.com [172.21.96.5]) by docomolabs-usa.com (8.11.3/8.11.3) with ESMTP id fBBL4KJ03642 for ; Tue, 11 Dec 2001 13:04:20 -0800 (PST) Reply-To: From: "Xiaoning He" To: Subject: IPv6 Question on Red Hat 7.1 Date: Tue, 11 Dec 2001 13:03:22 -0800 Organization: NTT-Docomo USA Labs Message-ID: <000301c18287$44d2dc30$056015ac@VAIOHE> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Priority: 1 (Highest) X-MSMail-Priority: High X-Mailer: Microsoft Outlook, Build 10.0.2627 Importance: High X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2600.0000 Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 804 Lines: 22 Dear All Sorry to bother you with some stupid, however urgent, questions. I was asked to modify the ND Router Solicitation and ND Router Advertisement message based on the IPv6 implantation on Red Hat 7.1 . I found the radvd which sends out the ND_RA messages. I also found that in ndisc.c, the ND_RS will be generated and send out. However, after I rewrite the codes, with radvdump, I can see the change I made to the ND_RA. What I have done is simple, whenever the radvd-deamon receives a ND_RS, it will log to a file. However, it seems the ND_RS messages are never sent out. My project requires me to send out the ND_RS whenever I want. So, my question is how the user is able to let the system send out a ND_RS whenever he wants? I am new to Linux, sorry for my ignorant. Thank you Richard From owner-netdev@oss.sgi.com Tue Dec 11 14:40:15 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBBMeFL22768 for netdev-outgoing; Tue, 11 Dec 2001 14:40:15 -0800 Received: from yue.hongo.wide.ad.jp (root@yue.hongo.wide.ad.jp [203.178.139.94]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBBMeCo22765 for ; Tue, 11 Dec 2001 14:40:12 -0800 Received: from localhost (yoshfuji@localhost [127.0.0.1]) by yue.hongo.wide.ad.jp (8.9.3+3.2W/8.9.3/Debian 8.9.3-21) with ESMTP id GAA00868; Wed, 12 Dec 2001 06:40:47 +0900 To: xiaoning@docomolabs-usa.com Cc: netdev@oss.sgi.com Subject: Re: IPv6 Question on Red Hat 7.1 In-Reply-To: <000301c18287$44d2dc30$056015ac@VAIOHE> References: <000301c18287$44d2dc30$056015ac@VAIOHE> X-Mailer: Mew version 1.94.2 on Emacs 20.7 / Mule 4.1 (AOI) X-URL: http://www.yoshifuji.org/%7Ehideaki/ X-Fingerprint: 90 22 65 EB 1E CF 3A D1 0B DF 80 D8 48 07 F8 94 E0 62 0E EA X-PGP-Key-URL: http://www.yoshifuji.org/%7Ehideaki/hideaki@yoshifuji.org.asc Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-Id: <20011212064047A.yoshfuji@wide.ad.jp> Date: Wed, 12 Dec 2001 06:40:47 +0900 From: YOSHIFUJI Hideaki / =?iso-2022-jp?B?GyRCNUhGIzFRTEAbKEI=?= X-Dispatcher: imput version 991025(IM133) Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 665 Lines: 13 In article <000301c18287$44d2dc30$056015ac@VAIOHE> (at Tue, 11 Dec 2001 13:03:22 -0800), "Xiaoning He" says: > What I have done is simple, whenever the radvd-deamon receives a ND_RS, > it will log to a file. However, it seems the ND_RS messages are never > sent out. My project requires me to send out the ND_RS whenever I want. > So, my question is how the user is able to let the system send out a > ND_RS whenever he wants? I am new to Linux, sorry for my ignorant. We cannot do this at this moment with linux kernel. If you really want to send RS message by hand in urgent, you can generate ones using raw socket. --yoshfuji From owner-netdev@oss.sgi.com Tue Dec 11 15:09:09 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBBN99H23795 for netdev-outgoing; Tue, 11 Dec 2001 15:09:09 -0800 Received: from docomolabs-usa.com (fridge.docomo-usa.com [216.98.102.228]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBBN92o23792 for ; Tue, 11 Dec 2001 15:09:02 -0800 Received: from VAIOHE (dhcp5.docomo-usa.com [172.21.96.5]) by docomolabs-usa.com (8.11.3/8.11.3) with ESMTP id fBBM8uJ06556; Tue, 11 Dec 2001 14:08:56 -0800 (PST) Reply-To: From: "Xiaoning He" To: =?windows-1255?B?J1lPU0hJRlVKSSBIaWRlYWtpIC8gi2eToYlwlr4n?= Cc: Subject: RE: IPv6 Question on Red Hat 7.1 Date: Tue, 11 Dec 2001 14:07:58 -0800 Organization: NTT-Docomo USA Labs Message-ID: <000401c18290$4ac2dab0$056015ac@VAIOHE> MIME-Version: 1.0 Content-Type: text/plain; charset="windows-1255" X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook, Build 10.0.2627 Importance: Normal In-Reply-To: <20011212064047A.yoshfuji@wide.ad.jp> X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2600.0000 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id fBBN93o23793 Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 1279 Lines: 39 Thank you for the information. Then, would you please let me know where I can download the IPv6 implementation for Red Hat 7.1? Now, it comes with the system install, is there any standalone package containing all the source code for IPv6 available for downloading? Thank you. Xiaoning He > -----Original Message----- > From: Hideaki YOSHIFUJI [mailto:yoshfuji@cerberus.hongo.wide.ad.jp] On > Behalf Of YOSHIFUJI Hideaki / ‹g“¡‰p–¾ > Sent: Tuesday, December 11, 2001 1:41 PM > To: xiaoning@docomolabs-usa.com > Cc: netdev@oss.sgi.com > Subject: Re: IPv6 Question on Red Hat 7.1 > > In article <000301c18287$44d2dc30$056015ac@VAIOHE> (at Tue, 11 Dec 2001 > 13:03:22 -0800), "Xiaoning He" says: > > > What I have done is simple, whenever the radvd-deamon receives a ND_RS, > > it will log to a file. However, it seems the ND_RS messages are never > > sent out. My project requires me to send out the ND_RS whenever I want. > > So, my question is how the user is able to let the system send out a > > ND_RS whenever he wants? I am new to Linux, sorry for my ignorant. > > We cannot do this at this moment with linux kernel. > If you really want to send RS message by hand in urgent, > you can generate ones using raw socket. > > --yoshfuji From owner-netdev@oss.sgi.com Tue Dec 11 15:57:16 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBBNvGr26040 for netdev-outgoing; Tue, 11 Dec 2001 15:57:16 -0800 Received: from yue.hongo.wide.ad.jp (root@yue.hongo.wide.ad.jp [203.178.139.94]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBBNvDo26037 for ; Tue, 11 Dec 2001 15:57:14 -0800 Received: from localhost (yoshfuji@localhost [127.0.0.1]) by yue.hongo.wide.ad.jp (8.9.3+3.2W/8.9.3/Debian 8.9.3-21) with ESMTP id HAA01064; Wed, 12 Dec 2001 07:57:49 +0900 To: xiaoning@docomolabs-usa.com Cc: netdev@oss.sgi.com Subject: RE: IPv6 Question on Red Hat 7.1 In-Reply-To: <000401c18290$4ac2dab0$056015ac@VAIOHE> References: <20011212064047A.yoshfuji@wide.ad.jp> <000401c18290$4ac2dab0$056015ac@VAIOHE> X-Mailer: Mew version 1.94.2 on Emacs 20.7 / Mule 4.1 (AOI) X-URL: http://www.yoshifuji.org/%7Ehideaki/ X-Fingerprint: 90 22 65 EB 1E CF 3A D1 0B DF 80 D8 48 07 F8 94 E0 62 0E EA X-PGP-Key-URL: http://www.yoshifuji.org/%7Ehideaki/hideaki@yoshifuji.org.asc Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-Id: <20011212075749V.yoshfuji@wide.ad.jp> Date: Wed, 12 Dec 2001 07:57:49 +0900 From: YOSHIFUJI Hideaki / =?iso-2022-jp?B?GyRCNUhGIzFRTEAbKEI=?= X-Dispatcher: imput version 991025(IM133) Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 607 Lines: 14 In article <000401c18290$4ac2dab0$056015ac@VAIOHE> (at Tue, 11 Dec 2001 14:07:58 -0800), "Xiaoning He" says: > Then, would you please let me know where I can download the IPv6 > implementation for Red Hat 7.1? Now, it comes with the system install, : > > We cannot do this at this moment with linux kernel. > > If you really want to send RS message by hand in urgent, > > you can generate ones using raw socket. I meant linux kernel does not provide such a hook to send RS message by user's request. BTW you will be interested in . :-) --yoshfuji From owner-netdev@oss.sgi.com Wed Dec 12 16:22:21 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBD0MLa09686 for netdev-outgoing; Wed, 12 Dec 2001 16:22:21 -0800 Received: from lacrosse.corp.redhat.com (host154.207-175-42.redhat.com [207.175.42.154]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBD0LWo09679 for ; Wed, 12 Dec 2001 16:21:33 -0800 Received: from toomuch.toronto.redhat.com (toomuch.toronto.redhat.com [172.16.14.22]) by lacrosse.corp.redhat.com (8.11.6/8.9.3) with ESMTP id fBCNL8h17371; Wed, 12 Dec 2001 18:21:08 -0500 Received: (from bcrl@localhost) by toomuch.toronto.redhat.com (8.11.6/8.11.2) id fBCNL7n07552; Wed, 12 Dec 2001 18:21:07 -0500 Date: Wed, 12 Dec 2001 18:21:07 -0500 From: Benjamin LaHaise To: davem@redhat.com, torvalds@transmeta.com Cc: linux-kernel@vger.kernel.org, netdev@oss.sgi.com, jgarzik@mandrakesoft.com, jes@trained-monkey.org, ionut@cs.columbia.edu, akpm@zip.com.au Subject: [PATCH] v2.5.1-pre10-02_kvec_net.diff Message-ID: <20011212182107.B28056@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 17855 Lines: 499 Hey folks, This patch follows on the 01_kvec patch to convert the skbuff fragment structure over to a kveclet. I got carried away tidying up the network drivers a bit in doing this, plus fixed a bug in ns83820.c. To make backwards compatibility for drivers a bit easier, I introduced #defines in skbuff.h for skb_frag_{offset,length,page} which should be added to 2.4 too. Dave, is this good for you? Comments? Hopefully all the driver authors I've hit are included on the cc, my apologies if not. This still works for me on the ns83820 and 3c59x drivers. Cheers, -ben -- Fish. diff -urN 01_kvec-v2.5.1-pre10/drivers/net/3c59x.c 02_kvec_net-v2.5.1-pre10/drivers/net/3c59x.c --- 01_kvec-v2.5.1-pre10/drivers/net/3c59x.c Fri Nov 9 16:41:42 2001 +++ 02_kvec_net-v2.5.1-pre10/drivers/net/3c59x.c Wed Dec 12 17:46:15 2001 @@ -1976,7 +1976,10 @@ /* Calculate the next Tx descriptor entry. */ int entry = vp->cur_tx % TX_RING_SIZE; struct boom_tx_desc *prev_entry = &vp->tx_ring[(vp->cur_tx-1) % TX_RING_SIZE]; + struct boom_tx_desc *desc = &vp->tx_ring[entry]; unsigned long flags; + int len, i; + void *buf; if (vortex_debug > 6) { printk(KERN_DEBUG "boomerang_start_xmit()\n"); @@ -1995,42 +1998,41 @@ vp->tx_skbuff[entry] = skb; - vp->tx_ring[entry].next = 0; + desc->next = 0; #if DO_ZEROCOPY if (skb->ip_summed != CHECKSUM_HW) - vp->tx_ring[entry].status = cpu_to_le32(skb->len | TxIntrUploaded); + desc->status = cpu_to_le32(skb->len | TxIntrUploaded); else - vp->tx_ring[entry].status = cpu_to_le32(skb->len | TxIntrUploaded | AddTCPChksum); + desc->status = cpu_to_le32(skb->len | TxIntrUploaded | AddTCPChksum); - if (!skb_shinfo(skb)->nr_frags) { - vp->tx_ring[entry].frag[0].addr = cpu_to_le32(pci_map_single(vp->pdev, skb->data, - skb->len, PCI_DMA_TODEVICE)); - vp->tx_ring[entry].frag[0].length = cpu_to_le32(skb->len | LAST_FRAG); - } else { - int i; - - vp->tx_ring[entry].frag[0].addr = cpu_to_le32(pci_map_single(vp->pdev, skb->data, - skb->len-skb->data_len, PCI_DMA_TODEVICE)); - vp->tx_ring[entry].frag[0].length = cpu_to_le32(skb->len-skb->data_len); - - for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { - skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; - - vp->tx_ring[entry].frag[i+1].addr = - cpu_to_le32(pci_map_single(vp->pdev, - (void*)page_address(frag->page) + frag->page_offset, - frag->size, PCI_DMA_TODEVICE)); - - if (i == skb_shinfo(skb)->nr_frags-1) - vp->tx_ring[entry].frag[i+1].length = cpu_to_le32(frag->size|LAST_FRAG); - else - vp->tx_ring[entry].frag[i+1].length = cpu_to_le32(frag->size); - } + buf = skb->data; + len = skb->len; + if (skb_shinfo(skb)->nr_frags) + len -= skb->data_len; + + for (i=0; ; i++) { + skb_frag_t *frag; + u32 last = 0; + u32 addr = pci_map_single(vp->pdev, buf, len, PCI_DMA_TODEVICE); + + /* No more fragments? */ + if (i == skb_shinfo(skb)->nr_frags) + last = cpu_to_le32(LAST_FRAG); + + desc->frag[i].addr = cpu_to_le32(addr); + desc->frag[i].length = cpu_to_le32(len) | last; + + if (last) + break; + + frag = &skb_shinfo(skb)->frags[i]; + buf = page_address(skb_frag_page(frag)) + skb_frag_offset(frag); + len = skb_frag_length(frag); } #else - vp->tx_ring[entry].addr = cpu_to_le32(pci_map_single(vp->pdev, skb->data, skb->len, PCI_DMA_TODEVICE)); - vp->tx_ring[entry].length = cpu_to_le32(skb->len | LAST_FRAG); - vp->tx_ring[entry].status = cpu_to_le32(skb->len | TxIntrUploaded); + desc->addr = cpu_to_le32(pci_map_single(vp->pdev, skb->data, skb->len, PCI_DMA_TODEVICE)); + desc->length = cpu_to_le32(skb->len | LAST_FRAG); + desc->status = cpu_to_le32(skb->len | TxIntrUploaded); #endif spin_lock_irqsave(&vp->lock, flags); diff -urN 01_kvec-v2.5.1-pre10/drivers/net/8139cp.c 02_kvec_net-v2.5.1-pre10/drivers/net/8139cp.c --- 01_kvec-v2.5.1-pre10/drivers/net/8139cp.c Mon Nov 19 18:19:42 2001 +++ 02_kvec_net-v2.5.1-pre10/drivers/net/8139cp.c Wed Dec 12 17:58:27 2001 @@ -663,10 +663,10 @@ u32 len, mapping; u32 ctrl; - len = this_frag->size; + len = skb_frag_length(this_frag); mapping = pci_map_single(cp->pdev, - ((void *) page_address(this_frag->page) + - this_frag->page_offset), + (page_address(skb_frag_page(this_frag)) + + skb_frag_offset(this_frag)), len, PCI_DMA_TODEVICE); eor = (entry == (CP_TX_RING_SIZE - 1)) ? RingEnd : 0; #ifdef CP_TX_CHECKSUM diff -urN 01_kvec-v2.5.1-pre10/drivers/net/acenic.c 02_kvec_net-v2.5.1-pre10/drivers/net/acenic.c --- 01_kvec-v2.5.1-pre10/drivers/net/acenic.c Mon Nov 19 18:19:42 2001 +++ 02_kvec_net-v2.5.1-pre10/drivers/net/acenic.c Wed Dec 12 17:50:55 2001 @@ -2635,15 +2635,15 @@ skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; struct tx_ring_info *info; - len += frag->size; + len += skb_frag_length(frag); info = ap->skb->tx_skbuff + idx; desc = ap->tx_ring + idx; - mapping = pci_map_page(ap->pdev, frag->page, - frag->page_offset, frag->size, + mapping = pci_map_page(ap->pdev, skb_frag_page(frag), + skb_frag_offset(frag), skb_frag_length(frag), PCI_DMA_TODEVICE); - flagsize = (frag->size << 16); + flagsize = (skb_frag_length(frag) << 16); if (skb->ip_summed == CHECKSUM_HW) flagsize |= BD_FLG_TCP_UDP_SUM; idx = (idx + 1) % TX_RING_ENTRIES; @@ -2662,7 +2662,7 @@ info->skb = NULL; } info->mapping = mapping; - info->maplen = frag->size; + info->maplen = skb_frag_length(frag); ace_load_tx_bd(desc, mapping, flagsize); } } diff -urN 01_kvec-v2.5.1-pre10/drivers/net/ns83820.c 02_kvec_net-v2.5.1-pre10/drivers/net/ns83820.c --- 01_kvec-v2.5.1-pre10/drivers/net/ns83820.c Fri Nov 9 16:45:35 2001 +++ 02_kvec_net-v2.5.1-pre10/drivers/net/ns83820.c Wed Dec 12 17:28:29 2001 @@ -957,11 +957,13 @@ if (!nr_frags) break; - buf = pci_map_single_high(dev->pci_dev, frag->page, 0, - frag->size, PCI_DMA_TODEVICE); + buf = pci_map_single_high(dev->pci_dev, skb_frag_page(frag), + skb_frag_offset(frag), + skb_frag_length(frag), + PCI_DMA_TODEVICE); dprintk("frag: buf=%08Lx page=%08lx\n", (long long)buf, (long)(frag->page - mem_map)); - len = frag->size; + len = skb_frag_length(frag); frag++; nr_frags--; } diff -urN 01_kvec-v2.5.1-pre10/drivers/net/starfire.c 02_kvec_net-v2.5.1-pre10/drivers/net/starfire.c --- 01_kvec-v2.5.1-pre10/drivers/net/starfire.c Sun Sep 30 15:26:07 2001 +++ 02_kvec_net-v2.5.1-pre10/drivers/net/starfire.c Wed Dec 12 17:37:29 2001 @@ -1134,11 +1134,13 @@ if (skb_first_frag_len(skb) == 1) has_bad_length = 1; else { - for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) - if (skb_shinfo(skb)->frags[i].size == 1) { + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { + skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + if (skb_frag_length(frag)) == 1) { has_bad_length = 1; break; } + } } if (has_bad_length) @@ -1188,13 +1190,17 @@ #ifdef ZEROCOPY for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { skb_frag_t *this_frag = &skb_shinfo(skb)->frags[i]; + void *addr = page_address(skb_frag_page(this_frag)) + + skb_frag_offset(this_frag); + int len = skb_frag_length(this_frag); /* we already have the proper value in entry */ np->tx_info[entry].frag_mapping[i] = - pci_map_single(np->pci_dev, page_address(this_frag->page) + this_frag->page_offset, this_frag->size, PCI_DMA_TODEVICE); + pci_map_single(np->pci_dev, addr, len, + PCI_DMA_TODEVICE); np->tx_ring[entry].frag[i].addr = cpu_to_le32(np->tx_info[entry].frag_mapping[i]); - np->tx_ring[entry].frag[i].len = cpu_to_le32(this_frag->size); + np->tx_ring[entry].frag[i].len = cpu_to_le32(len); if (debug > 5) { printk(KERN_DEBUG "%s: Tx #%d frag %d len %4.4x.\n", dev->name, np->cur_tx, i, diff -urN 01_kvec-v2.5.1-pre10/drivers/net/sungem.c 02_kvec_net-v2.5.1-pre10/drivers/net/sungem.c --- 01_kvec-v2.5.1-pre10/drivers/net/sungem.c Sun Oct 21 13:36:54 2001 +++ 02_kvec_net-v2.5.1-pre10/drivers/net/sungem.c Wed Dec 12 17:38:21 2001 @@ -714,10 +714,10 @@ dma_addr_t mapping; u64 this_ctrl; - len = this_frag->size; + len = skb_frag_length(this_frag); mapping = pci_map_page(gp->pdev, - this_frag->page, - this_frag->page_offset, + skb_frag_page(this_frag), + skb_frag_offset(this_frag), len, PCI_DMA_TODEVICE); this_ctrl = ctrl; if (frag == skb_shinfo(skb)->nr_frags - 1) diff -urN 01_kvec-v2.5.1-pre10/drivers/net/sunhme.c 02_kvec_net-v2.5.1-pre10/drivers/net/sunhme.c --- 01_kvec-v2.5.1-pre10/drivers/net/sunhme.c Fri Oct 12 18:35:53 2001 +++ 02_kvec_net-v2.5.1-pre10/drivers/net/sunhme.c Wed Dec 12 17:59:51 2001 @@ -2336,12 +2336,11 @@ for (frag = 0; frag < skb_shinfo(skb)->nr_frags; frag++) { skb_frag_t *this_frag = &skb_shinfo(skb)->frags[frag]; u32 len, mapping, this_txflags; + void *addr = page_address(skb_frag_page(this_frag)) + + skb_frag_offset(this_frag); - len = this_frag->size; - mapping = hme_dma_map(hp, - ((void *) page_address(this_frag->page) + - this_frag->page_offset), - len, DMA_TODEVICE); + len = skb_frag_length(this_frag); + mapping = hme_dma_map(hp, addr, len, DMA_TODEVICE); this_txflags = tx_flags; if (frag == skb_shinfo(skb)->nr_frags - 1) this_txflags |= TXFLAG_EOP; Binary files 01_kvec-v2.5.1-pre10/include/linux/.skbuff.h.swp and 02_kvec_net-v2.5.1-pre10/include/linux/.skbuff.h.swp differ diff -urN 01_kvec-v2.5.1-pre10/include/linux/skbuff.h 02_kvec_net-v2.5.1-pre10/include/linux/skbuff.h --- 01_kvec-v2.5.1-pre10/include/linux/skbuff.h Wed Dec 12 13:39:29 2001 +++ 02_kvec_net-v2.5.1-pre10/include/linux/skbuff.h Wed Dec 12 17:42:24 2001 @@ -107,14 +107,11 @@ #define MAX_SKB_FRAGS 6 -typedef struct skb_frag_struct skb_frag_t; +typedef struct kveclet skb_frag_t; -struct skb_frag_struct -{ - struct page *page; - __u16 page_offset; - __u16 size; -}; +#define skb_frag_page(f) ((f)->page) +#define skb_frag_offset(f) ((f)->offset) +#define skb_frag_length(f) ((f)->length) /* This data is invariant across clones and lives at * the end of the header data, ie. at skb->end. diff -urN 01_kvec-v2.5.1-pre10/net/core/datagram.c 02_kvec_net-v2.5.1-pre10/net/core/datagram.c --- 01_kvec-v2.5.1-pre10/net/core/datagram.c Thu Apr 12 15:11:39 2001 +++ 02_kvec_net-v2.5.1-pre10/net/core/datagram.c Wed Dec 12 17:52:04 2001 @@ -227,7 +227,7 @@ BUG_TRAP(start <= offset+len); - end = start + skb_shinfo(skb)->frags[i].size; + end = start + skb_shinfo(skb)->frags[i].length; if ((copy = end-offset) > 0) { int err; u8 *vaddr; @@ -237,7 +237,7 @@ if (copy > len) copy = len; vaddr = kmap(page); - err = memcpy_toiovec(to, vaddr + frag->page_offset + + err = memcpy_toiovec(to, vaddr + frag->offset + offset-start, copy); kunmap(page); if (err) @@ -303,7 +303,7 @@ BUG_TRAP(start <= offset+len); - end = start + skb_shinfo(skb)->frags[i].size; + end = start + skb_shinfo(skb)->frags[i].length; if ((copy = end-offset) > 0) { unsigned int csum2; int err = 0; @@ -314,8 +314,9 @@ if (copy > len) copy = len; vaddr = kmap(page); - csum2 = csum_and_copy_to_user(vaddr + frag->page_offset + - offset-start, to, copy, 0, &err); + csum2 = csum_and_copy_to_user(vaddr + frag->offset + + offset-start, to, copy, + 0, &err); kunmap(page); if (err) goto fault; diff -urN 01_kvec-v2.5.1-pre10/net/core/skbuff.c 02_kvec_net-v2.5.1-pre10/net/core/skbuff.c --- 01_kvec-v2.5.1-pre10/net/core/skbuff.c Tue Aug 7 11:30:50 2001 +++ 02_kvec_net-v2.5.1-pre10/net/core/skbuff.c Wed Dec 12 16:33:26 2001 @@ -744,7 +744,7 @@ int i; for (i=0; ifrags[i].size; + int end = offset + skb_shinfo(skb)->frags[i].length; if (end > len) { if (skb_cloned(skb)) { if (!realloc) @@ -756,7 +756,7 @@ put_page(skb_shinfo(skb)->frags[i].page); skb_shinfo(skb)->nr_frags--; } else { - skb_shinfo(skb)->frags[i].size = len-offset; + skb_shinfo(skb)->frags[i].length = len - offset; } } offset = end; @@ -833,9 +833,9 @@ /* Estimate size of pulled pages. */ eat = delta; for (i=0; inr_frags; i++) { - if (skb_shinfo(skb)->frags[i].size >= eat) + if (skb_shinfo(skb)->frags[i].length >= eat) goto pull_pages; - eat -= skb_shinfo(skb)->frags[i].size; + eat -= skb_shinfo(skb)->frags[i].length; } /* If we need update frag list, we are in troubles. @@ -900,14 +900,14 @@ eat = delta; k = 0; for (i=0; inr_frags; i++) { - if (skb_shinfo(skb)->frags[i].size <= eat) { + if (skb_shinfo(skb)->frags[i].length <= eat) { put_page(skb_shinfo(skb)->frags[i].page); - eat -= skb_shinfo(skb)->frags[i].size; + eat -= skb_shinfo(skb)->frags[i].length; } else { skb_shinfo(skb)->frags[k] = skb_shinfo(skb)->frags[i]; if (eat) { - skb_shinfo(skb)->frags[k].page_offset += eat; - skb_shinfo(skb)->frags[k].size -= eat; + skb_shinfo(skb)->frags[k].offset += eat; + skb_shinfo(skb)->frags[k].length -= eat; eat = 0; } k++; @@ -947,7 +947,7 @@ BUG_TRAP(start <= offset+len); - end = start + skb_shinfo(skb)->frags[i].size; + end = start + skb_shinfo(skb)->frags[i].length; if ((copy = end-offset) > 0) { u8 *vaddr; @@ -955,8 +955,8 @@ copy = len; vaddr = kmap_skb_frag(&skb_shinfo(skb)->frags[i]); - memcpy(to, vaddr+skb_shinfo(skb)->frags[i].page_offset+ - offset-start, copy); + memcpy(to, vaddr + skb_shinfo(skb)->frags[i].offset + + offset - start, copy); kunmap_skb_frag(vaddr); if ((len -= copy) == 0) @@ -1020,7 +1020,7 @@ BUG_TRAP(start <= offset+len); - end = start + skb_shinfo(skb)->frags[i].size; + end = start + skb_shinfo(skb)->frags[i].length; if ((copy = end-offset) > 0) { unsigned int csum2; u8 *vaddr; @@ -1029,8 +1029,8 @@ if (copy > len) copy = len; vaddr = kmap_skb_frag(frag); - csum2 = csum_partial(vaddr + frag->page_offset + - offset-start, copy, 0); + csum2 = csum_partial(vaddr + frag->offset + + offset - start, copy, 0); kunmap_skb_frag(vaddr); csum = csum_block_add(csum, csum2, pos); if (!(len -= copy)) @@ -1096,7 +1096,7 @@ BUG_TRAP(start <= offset+len); - end = start + skb_shinfo(skb)->frags[i].size; + end = start + skb_shinfo(skb)->frags[i].length; if ((copy = end-offset) > 0) { unsigned int csum2; u8 *vaddr; @@ -1105,7 +1105,7 @@ if (copy > len) copy = len; vaddr = kmap_skb_frag(frag); - csum2 = csum_partial_copy_nocheck(vaddr + frag->page_offset + + csum2 = csum_partial_copy_nocheck(vaddr + frag->offset + offset-start, to, copy, 0); kunmap_skb_frag(vaddr); csum = csum_block_add(csum, csum2, pos); diff -urN 01_kvec-v2.5.1-pre10/net/ipv4/ip_fragment.c 02_kvec_net-v2.5.1-pre10/net/ipv4/ip_fragment.c --- 01_kvec-v2.5.1-pre10/net/ipv4/ip_fragment.c Fri Sep 7 14:01:21 2001 +++ 02_kvec_net-v2.5.1-pre10/net/ipv4/ip_fragment.c Wed Dec 12 17:53:06 2001 @@ -542,7 +542,7 @@ skb_shinfo(clone)->frag_list = skb_shinfo(head)->frag_list; skb_shinfo(head)->frag_list = NULL; for (i=0; inr_frags; i++) - plen += skb_shinfo(head)->frags[i].size; + plen += skb_shinfo(head)->frags[i].length; clone->len = clone->data_len = head->data_len - plen; head->data_len -= clone->len; head->len -= clone->len; diff -urN 01_kvec-v2.5.1-pre10/net/ipv4/tcp.c 02_kvec_net-v2.5.1-pre10/net/ipv4/tcp.c --- 01_kvec-v2.5.1-pre10/net/ipv4/tcp.c Tue Oct 30 18:08:12 2001 +++ 02_kvec_net-v2.5.1-pre10/net/ipv4/tcp.c Wed Dec 12 16:35:31 2001 @@ -752,7 +752,7 @@ if (i) { skb_frag_t *frag = &skb_shinfo(skb)->frags[i-1]; return page == frag->page && - off == frag->page_offset+frag->size; + off == frag->offset + frag->length; } return 0; } @@ -762,8 +762,8 @@ { skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; frag->page = page; - frag->page_offset = off; - frag->size = size; + frag->offset = off; + frag->length = size; skb_shinfo(skb)->nr_frags = i+1; } @@ -872,7 +872,7 @@ i = skb_shinfo(skb)->nr_frags; if (can_coalesce(skb, i, page, offset)) { - skb_shinfo(skb)->frags[i-1].size += copy; + skb_shinfo(skb)->frags[i-1].length += copy; } else if (i < MAX_SKB_FRAGS) { get_page(page); fill_page_desc(skb, i, page, offset, copy); @@ -1135,7 +1135,7 @@ /* Update the skb. */ if (merge) { - skb_shinfo(skb)->frags[i-1].size += copy; + skb_shinfo(skb)->frags[i-1].length += copy; } else { fill_page_desc(skb, i, page, off, copy); if (TCP_PAGE(sk)) { diff -urN 01_kvec-v2.5.1-pre10/net/ipv4/tcp_output.c 02_kvec_net-v2.5.1-pre10/net/ipv4/tcp_output.c --- 01_kvec-v2.5.1-pre10/net/ipv4/tcp_output.c Mon Nov 5 12:46:12 2001 +++ 02_kvec_net-v2.5.1-pre10/net/ipv4/tcp_output.c Wed Dec 12 17:54:59 2001 @@ -382,7 +382,7 @@ skb->data_len = len - pos; for (i=0; ifrags[i].size; + int size = skb_shinfo(skb)->frags[i].length; if (pos + size > len) { skb_shinfo(skb1)->frags[k] = skb_shinfo(skb)->frags[i]; @@ -396,9 +396,9 @@ * 2. Split is accurately. We make this. */ get_page(skb_shinfo(skb)->frags[i].page); - skb_shinfo(skb1)->frags[0].page_offset += (len-pos); - skb_shinfo(skb1)->frags[0].size -= (len-pos); - skb_shinfo(skb)->frags[i].size = len-pos; + skb_shinfo(skb1)->frags[0].offset += (len-pos); + skb_shinfo(skb1)->frags[0].length -= (len-pos); + skb_shinfo(skb)->frags[i].length = len-pos; skb_shinfo(skb)->nr_frags++; } k++; From owner-netdev@oss.sgi.com Fri Dec 14 16:03:33 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBF03XU12633 for netdev-outgoing; Fri, 14 Dec 2001 16:03:33 -0800 Received: from u.domain.uli (ja.mac.ssi.bg [212.95.166.194]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBF003o12496 for ; Fri, 14 Dec 2001 16:03:20 -0800 Received: from localhost (IDENT:ja@localhost [127.0.0.1]) by u.domain.uli (8.11.0/8.11.0) with ESMTP id fBF10ft12165; Sat, 15 Dec 2001 01:01:21 GMT Date: Sat, 15 Dec 2001 01:00:41 +0000 (GMT) From: Julian Anastasov X-X-Sender: To: Alexey Kuznetsov cc: , "David S. Miller" Subject: fib/netdev cleanup Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 636 Lines: 24 Hello, I see some problems with the netdev notifications for the FIB: 1. fib_inetaddr_event deletes all routes when the last address is deleted which is wrong, may be because there is the 2nd problem: 2. FIB never detects NETDEV_UNREGISTER because ip_ptr is already NULL 3. We better to clear nh->nh_dev on NETDEV_UNREGISTER, fatal for multipath routes in 2.4 (problem known from some time ago) I uploaded all these changes, both for 2.2 and 2.4 here: http://www.linuxvirtualserver.org/~julian/fibnetdev-2.4.16-1.diff http://www.linuxvirtualserver.org/~julian/fibnetdev-2.2.20-1.diff Regards -- Julian Anastasov From owner-netdev@oss.sgi.com Sat Dec 15 09:00:10 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBFH0AD21316 for netdev-outgoing; Sat, 15 Dec 2001 09:00:10 -0800 Received: from outpost.powerdns.com (postfix@outpost.ds9a.nl [213.244.168.210]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBFH00o21290 for ; Sat, 15 Dec 2001 09:00:01 -0800 Received: by outpost.powerdns.com (Postfix, from userid 1000) id BB0A4C60EA; Sat, 15 Dec 2001 16:59:56 +0100 (CET) Date: Sat, 15 Dec 2001 16:59:56 +0100 From: bert hubert To: netdev@oss.sgi.com Subject: [PATCH] make netfilter able to change/see skb->priority Message-ID: <20011215165956.A31862@outpost.ds9a.nl> Mail-Followup-To: bert hubert , netdev@oss.sgi.com Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 2126 Lines: 64 Rusty & others: Right now, netfilter can't see or touch the skb->priority of packets generated locally because it is only set in ip_queue_xmit2, after netfilter has been consulted. This patch moves the skb->priority=sk->priority line to just before calling netfilter. I think this patch is philosophically right because it allows netfilter to override userspace instructions, which is what we do for lots of other targets too. We feel that it is ok to drop or mangle locally generated packets in netfilter. I think we should do the same for skb->priority. The real reason I need this is because I've patched iptables to be skb->priority aware, allowing coolness like this: # iptables -t mangle -A OUTPUT --dport 22 -j PRIO --classid 1:1 Which allows you to classify packets *directly* from iptables to QoS classes. Right now, this has to be done like this: # iptables -t mangle -A OUTPUT --dport 22 -j MARK --set-mark 6 # tc filter add dev eth1 protocol ip parent 1:0 prio 1 handle 6 fw classid 1:1 Which requires not only more typing, but is also slower. Or, if using only tc filter, like this: # F="tc filter add dev eth0 parent 1:0 protocol ip prio 5" # $F handle 1: u32 divisor 1 # $F u32 ht 1: match tcp dst 22 0xFFFF match ip protocol 6 0xFF match ip firstfrag flowid 1:1 # $F u32 ht 800:: match u8 0 0 offset at 0 mask 0x0f00 shift 6 link 1: The patch: --- net/ipv4/ip_output.c.orig Sat Dec 15 16:45:47 2001 +++ net/ipv4/ip_output.c Sat Dec 15 16:06:28 2001 @@ -315,7 +315,6 @@ /* Add an IP checksum. */ ip_send_check(iph); - skb->priority = sk->priority; return skb->dst->output(skb); fragment: @@ -395,7 +394,7 @@ iph->ihl += opt->optlen >> 2; ip_options_build(skb, opt, sk->daddr, rt, 0); } - + skb->priority = skb->sk->priority; return NF_HOOK(PF_INET, NF_IP_LOCAL_OUT, skb, NULL, rt->u.dst.dev, ip_queue_xmit2); -- http://www.PowerDNS.com Versatile DNS Software & Services Trilab The Technology People Netherlabs BV / Rent-a-Nerd.nl - Nerd Available - 'SYN! .. SYN|ACK! .. ACK!' - the mating call of the internet From owner-netdev@oss.sgi.com Sat Dec 15 18:32:39 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBG2Wd531131 for netdev-outgoing; Sat, 15 Dec 2001 18:32:39 -0800 Received: from sgi.com (sgi.SGI.COM [192.48.153.1]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBG2WYo31126 for ; Sat, 15 Dec 2001 18:32:34 -0800 Received: from wagner.rustcorp.com.au (CPE-203-51-26-56.nsw.bigpond.net.au [203.51.26.56]) by sgi.com (980327.SGI.8.8.8-aspam/980304.SGI-aspam: SGI does not authorize the use of its proprietary systems or networks for unsolicited or bulk email from the Internet.) via ESMTP id RAA02738 for ; Sat, 15 Dec 2001 17:32:31 -0800 (PST) mail_from (rusty@rustcorp.com.au) Received: from wagner.rustcorp.com.au ([127.0.0.1] helo=rustcorp.com.au) by wagner.rustcorp.com.au with esmtp (Exim 3.32 #1 (Debian)) id 16FOkX-0001pY-00; Sun, 16 Dec 2001 11:01:49 +1100 From: Rusty Russell To: Hein Roehrig Cc: Neale Banks , netdev@oss.sgi.com Subject: Re: network interface names ethX and renaming interfaces In-reply-to: Your message of "Sat, 15 Dec 2001 18:10:05 BST." Date: Sun, 16 Dec 2001 11:01:28 +1100 Message-Id: Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 769 Lines: 18 In message you write: > in Linux 2.2.20 I have a problem renaming the network interface dummy0 > to eth0 and then starting a regular ethernet driver --- I would like > it to come up as eth1 but it comes up as eth0, messing up the dummy0 > interface. What an interesting corner case. I assume that there's a good (speed?) reason why init_etherdev doesn't use dev_alloc_name() or equiv. Of course, I never thought anyone would rename interfaces to the names of other interfaces, so I never honestly considered it. It was more for large ISPs to name interfaces after their clients (I had a pppd hack that did this), for filtering and easy identification purposes. Rusty. -- Anyone who quotes me in their sig is an idiot. -- Rusty Russell. From owner-netdev@oss.sgi.com Mon Dec 17 22:36:25 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBI6aP219172 for netdev-outgoing; Mon, 17 Dec 2001 22:36:25 -0800 Received: from netbank.com.br (IDENT:postfix@garrincha.netbank.com.br [200.203.199.88]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBI6a7o19165 for ; Mon, 17 Dec 2001 22:36:08 -0800 Received: from brinquedo.distro.conectiva (1-218.ctame701-2.telepar.net.br [200.181.138.218]) by netbank.com.br (Postfix) with ESMTP id 3D24646823; Tue, 18 Dec 2001 03:35:51 -0200 (BRDT) Received: by brinquedo.distro.conectiva (Postfix, from userid 501) id BA1CDC44B; Tue, 18 Dec 2001 03:35:52 -0200 (BRST) Date: Tue, 18 Dec 2001 03:35:52 -0200 From: Arnaldo Carvalho de Melo To: "David S. Miller" Cc: SteveW@ACM.org, jschlst@samba.org, ncorbic@sangoma.com, eis@baty.hanse.de, dag@brattli.net, torvalds@transmeta.com, marcelo@conectiva.com.br, netdev@oss.sgi.com, Linux Kernel Mailing List Subject: [PATCH][RFC 2] cleaning up struct sock Message-ID: <20011218033552.B910@conectiva.com.br> Mail-Followup-To: Arnaldo Carvalho de Melo , "David S. Miller" , SteveW@ACM.org, jschlst@samba.org, ncorbic@sangoma.com, eis@baty.hanse.de, dag@brattli.net, torvalds@transmeta.com, marcelo@conectiva.com.br, netdev@oss.sgi.com, Linux Kernel Mailing List References: <20011210230810.C896@conectiva.com.br> <20011210.231826.55509210.davem@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20011210.231826.55509210.davem@redhat.com> User-Agent: Mutt/1.3.23i X-Url: http://advogato.org/person/acme Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 5102 Lines: 132 Em Mon, Dec 10, 2001 at 11:18:26PM -0800, David S. Miller escreveu: > From: Arnaldo Carvalho de Melo > Date: Mon, 10 Dec 2001 23:08:10 -0200 > > David, if you don't like it, I'll happily switch to the big fat > union idea, but I think that this is more clean and will avoid us > having to patch sock.h every time a new net stack is added to the > kernel. > > I'm a little concerned about having to allocate two objects instead of > one. ok, this new patch doesn't allocates two objects, just one like it is today, more below > These things aren't like inodes. Inodes are cached and lookup > read-multiple objects, whereas sockets are throw-away and recycled > objects. Inode allocation performance therefore isn't that critical, > but socket allocation performance is. > > Then we go back to the old problem of protocols that can be used to > embed IP and thus having to keep track of parallel state for multiple > protocols. I think your changes did not compromise that for what > we currently support though. > > I still need to think about this some more.... > > You know, actually, the protocols are the ones which call sk_alloc(). > So we could just extend sk_alloc() to take a kmem_cache_t argument. Well, in this patch I added two new fields to net_proto_family, sk_cachep and sk_size, so that the current protocols will pass this as zero without modification and the net_families[proto]->sk_cachep will get the current sk_cachep slabcache, but the ones that actually initializes the sk_cachep and sk_size members of net_proto_family will use its private slab cache that has objsize == sizeof(struct sock) + sizeof(proto_opt), with this I removed the sk->tp_pinfo altogether and use the macro TCP_PINFO like you suggested, also net_pinfo is gone and sk->protinfo is just a void pointer now. > TCP could thus make a kmem_cache_t which is sizeof(struct sock) > + sizeof(struct tcp_opt) and then set the TP_INFO pointer to "(sk + > 1)". > > Oh yes, another overhead is all the extra dereferencing. To fight > that we could make a macro that knows the above layout: > > #define TCP_PINFO(SK) ((struct tcp_opt *)((SK) + 1)) yes > So I guess we could do things your way without any of the potential > performance problems. > > It is going to be a while before I can apply something like this, I > would like to help Jens+Linus get the new block stuff in shape first. > This would obviously be a 2.5.x change too. Ok, this is just for comments, I'll do the modifications that people agree on here. The changes were rather minimal, i.e., tcp already was using this style: int tcp_foo_function(struct sock* sk) { struct tcp_opt *tp = &sk->tp_pinfo.af_tcp; and in the patch it just changes it to: struct tcp_opt *tp = TCP_PINFO(sk); in most places. I could make patches to make the IP_SK case be of similar style of the current tcp_opt usage (i.e., like the code excerpt above). This message was sent using the patch 8) All the protocols, khttpd, netfilter, etc, are already converted to this new style and the only thing that still has to be done is to remove things like daddr, saddr, rcv_saddr, dport, sport and other ipv4 specific members of struct sock. Ah, another thing is to try and make rtnetlink use sock_register prior to using sk_alloc, so that I can remove the check for in sk_alloc and net_proto_sk_size for net_families[family] being NULL. Please let me know if this is something acceptable for 2.5. The patch is still for 2.4.16, but it should apply cleanly to 2.5.1, I think. Of course I'll make sure it works with 2.5.1 if it is considered OK. I'll stop working on it for now till further comments are made by the net stack maintainers and concentrate on a new task: to do the same thing for struct inode, where, it seems, we won't even need the per fs slabcaches, just using a private void pointer 8) Here is an example of how the slabcaches are: [acme@rama2 acme]$ grep sock /proc/slabinfo unix_sock 5 10 396 1 1 1 : 17 735 2 1 0 inet_sock 17 20 800 4 4 1 : 25 207 8 4 0 sock 0 0 332 0 0 1 : 0 0 0 0 0 [acme@rama2 acme]$ And without the patch, using 2.4.16-pre1 [acme@brinquedo linux]$ grep sock /proc/slabinfo sock 76 84 1056 11 12 2 : 87 4730 19 7 0 With this we get memory savings and performance gains in addition to the much needed (IMHO) cleanup of include/net/sock.h 8) On big busy servers these savings can reach over one megabyte of kernel memory, please correct me if I'm wrong :) IPv6 sockets use about 980 bytes, so for a kernel with IPv6 compiled even as a module one can get savings even for the ipv4 sockets case. Patch available at: http://www.kernel.org/pub/linux/kernel/people/acme/v2.4/2.4.16/ sock.cleanup-5.patch.bz2 A not so complete changelog is at: http://www.kernel.org/pub/linux/kernel/people/acme/v2.4/2.4.16/ patch-2.4.16.log it can be of help in understanding this patch. Waiting for comments and testers results, TIA, - Arnaldo From owner-netdev@oss.sgi.com Mon Dec 17 23:53:13 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBI7rDG22732 for netdev-outgoing; Mon, 17 Dec 2001 23:53:13 -0800 Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBI7rAo22729 for ; Mon, 17 Dec 2001 23:53:10 -0800 Received: from localhost (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with ESMTP id WAA03002; Mon, 17 Dec 2001 22:51:35 -0800 Date: Mon, 17 Dec 2001 22:51:34 -0800 (PST) Message-Id: <20011217.225134.91313099.davem@redhat.com> To: acme@conectiva.com.br Cc: SteveW@ACM.org, jschlst@samba.org, ncorbic@sangoma.com, eis@baty.hanse.de, dag@brattli.net, torvalds@transmeta.com, marcelo@conectiva.com.br, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [PATCH][RFC 2] cleaning up struct sock From: "David S. Miller" In-Reply-To: <20011218033552.B910@conectiva.com.br> References: <20011210230810.C896@conectiva.com.br> <20011210.231826.55509210.davem@redhat.com> <20011218033552.B910@conectiva.com.br> X-Mailer: Mew version 2.1 on Emacs 21.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 689 Lines: 21 From: Arnaldo Carvalho de Melo Date: Tue, 18 Dec 2001 03:35:52 -0200 the only thing that still has to be done is to remove things like daddr, saddr, rcv_saddr, dport, sport and other ipv4 specific members of struct sock Actually, I'd like to keep the first couple cache lines of struct sock the way it is :-( For hash lookups the identity + the hash next pointer fit perfectly in one cache line on nearly all platforms. Which brings me to... Please let me know if this is something acceptable for 2.5. What kind of before/after effect do you see in lat_tcp/lat_connect (from lmbench) runs? Franks a lot, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Tue Dec 18 05:01:12 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBID1CA00954 for netdev-outgoing; Tue, 18 Dec 2001 05:01:12 -0800 Received: from netbank.com.br (IDENT:postfix@garrincha.netbank.com.br [200.203.199.88]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBID17o00942 for ; Tue, 18 Dec 2001 05:01:07 -0800 Received: from brinquedo.distro.conectiva (1-218.ctame701-2.telepar.net.br [200.181.138.218]) by netbank.com.br (Postfix) with ESMTP id A286A46811; Tue, 18 Dec 2001 10:00:49 -0200 (BRDT) Received: by brinquedo.distro.conectiva (Postfix, from userid 501) id 65860C44B; Tue, 18 Dec 2001 10:01:04 -0200 (BRST) Date: Tue, 18 Dec 2001 10:01:04 -0200 From: Arnaldo Carvalho de Melo To: "David S. Miller" Cc: SteveW@ACM.org, jschlst@samba.org, ncorbic@sangoma.com, eis@baty.hanse.de, dag@brattli.net, torvalds@transmeta.com, marcelo@conectiva.com.br, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [PATCH][RFC 2] cleaning up struct sock Message-ID: <20011218100104.A2000@conectiva.com.br> Mail-Followup-To: Arnaldo Carvalho de Melo , "David S. Miller" , SteveW@ACM.org, jschlst@samba.org, ncorbic@sangoma.com, eis@baty.hanse.de, dag@brattli.net, torvalds@transmeta.com, marcelo@conectiva.com.br, netdev@oss.sgi.com, linux-kernel@vger.kernel.org References: <20011210230810.C896@conectiva.com.br> <20011210.231826.55509210.davem@redhat.com> <20011218033552.B910@conectiva.com.br> <20011217.225134.91313099.davem@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20011217.225134.91313099.davem@redhat.com> User-Agent: Mutt/1.3.23i X-Url: http://advogato.org/person/acme Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 1086 Lines: 29 Em Mon, Dec 17, 2001 at 10:51:34PM -0800, David S. Miller escreveu: > From: Arnaldo Carvalho de Melo > Date: Tue, 18 Dec 2001 03:35:52 -0200 > > the only thing that still has to be done is to remove things > like daddr, saddr, rcv_saddr, dport, sport and other ipv4 specific members > of struct sock > > Actually, I'd like to keep the first couple cache lines of struct > sock the way it is :-( For hash lookups the identity + the hash next > pointer fit perfectly in one cache line on nearly all platforms. fair > Which brings me to... > > Please let me know if this is something acceptable for 2.5. > > What kind of before/after effect do you see in lat_tcp/lat_connect > (from lmbench) runs? Will see today, I concentrated on the cleanup part trying not to harm performance by following the suggestions for the first patch (i.e., just one allocation, etc). I'll test it later today, at the lab, UP and SMP (4 and 8 way) and submit the results here. Apart from possible performance problems, does the patch looks OK? - Arnaldo From owner-netdev@oss.sgi.com Tue Dec 18 11:57:52 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBIJvqK04026 for netdev-outgoing; Tue, 18 Dec 2001 11:57:52 -0800 Received: from ms2.inr.ac.ru (minus.inr.ac.ru [193.233.7.97]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBIJvlo04021 for ; Tue, 18 Dec 2001 11:57:47 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id VAA10648; Tue, 18 Dec 2001 21:57:19 +0300 From: kuznet@ms2.inr.ac.ru Message-Id: <200112181857.VAA10648@ms2.inr.ac.ru> Subject: Re: fib/netdev cleanup To: ja@ssi.bg (Julian Anastasov) Date: Tue, 18 Dec 2001 21:57:19 +0300 (MSK) Cc: netdev@oss.sgi.com, davem@redhat.com In-Reply-To: from "Julian Anastasov" at Dec 15, 1 01:00:41 am X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 587 Lines: 23 Hello! > 1. fib_inetaddr_event deletes all routes when the last address is > deleted which is wrong, When the last address is deleted, IP is disabled on the device, and it is pretty strange to say about some routes after this they have no object to refer to. > 2. FIB never detects NETDEV_UNREGISTER because ip_ptr is already NULL See above. > 3. We better to clear nh->nh_dev on NETDEV_UNREGISTER, fatal for > multipath routes in 2.4 (problem known from some time ago) Oh, damn, I forgot about this our finding. Are you sure that the last chunk is enough to fix this? Alexey From owner-netdev@oss.sgi.com Tue Dec 18 12:36:13 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBIKaDj05254 for netdev-outgoing; Tue, 18 Dec 2001 12:36:13 -0800 Received: from l.himel.bg (IDENT:root@unamed.infotel.bg [212.39.68.18] (may be forged)) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBIKa6o05251 for ; Tue, 18 Dec 2001 12:36:06 -0800 Received: from linux.himel.bg (IDENT:ja@linux.himel.bg [127.0.0.1]) by l.himel.bg (8.9.3/8.9.3) with ESMTP id VAA21684; Tue, 18 Dec 2001 21:39:57 +0200 Date: Tue, 18 Dec 2001 21:39:57 +0200 (EET) From: Julian Anastasov X-X-Sender: To: cc: , Subject: Re: fib/netdev cleanup In-Reply-To: <200112181857.VAA10648@ms2.inr.ac.ru> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 1792 Lines: 54 Hello, On Tue, 18 Dec 2001 kuznet@ms2.inr.ac.ru wrote: > Hello! > > > 1. fib_inetaddr_event deletes all routes when the last address is > > deleted which is wrong, > > When the last address is deleted, IP is disabled on the device, > and it is pretty strange to say about some routes after this they > have no object to refer to. I can have routes through this device with preferred source IP from another device. They are removed from this forced deletion. Now I can't add route through one device without using addresses attached to it. Such example is host route through different device. It is enough to use fib_del_ifaddr for the last address (any address that is deleted), why to disable IP (and remove the other routes) if we assume that the address is not used anymore on any device as prefsrc? 2.2 leaks routes, the link routes remain after the last address is deleted, in 2.4 they are deleted with fib_del_ifaddr before the check for ifa_list==NULL but why this fib_disable_ip(...dev, 1) ... > > 2. FIB never detects NETDEV_UNREGISTER because ip_ptr is already NULL > > See above. unreachable code ... > > 3. We better to clear nh->nh_dev on NETDEV_UNREGISTER, fatal for > > multipath routes in 2.4 (problem known from some time ago) > > Oh, damn, I forgot about this our finding. > > Are you sure that the last chunk is enough to fix this? Yes, nh_dev will be valid until dev->ifindex is valid. The tests weeks ago show that it is enough. force=1 is the moment when the device disappears (unregistered) but not if the above problems remain. If not, I really don't know what will be the fix for nh_dev, it should be separate function called only for NETDEV_UNREGISTER which is not reachable for FIB ... deadlock :) > Alexey Regards -- Julian Anastasov From owner-netdev@oss.sgi.com Tue Dec 18 13:02:01 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBIL21a06649 for netdev-outgoing; Tue, 18 Dec 2001 13:02:01 -0800 Received: from ms2.inr.ac.ru (minus.inr.ac.ru [193.233.7.97]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBIL1vo06624 for ; Tue, 18 Dec 2001 13:01:57 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id XAA11093; Tue, 18 Dec 2001 23:01:39 +0300 From: kuznet@ms2.inr.ac.ru Message-Id: <200112182001.XAA11093@ms2.inr.ac.ru> Subject: Re: fib/netdev cleanup To: ja@ssi.bg (Julian Anastasov) Date: Tue, 18 Dec 2001 23:01:39 +0300 (MSK) Cc: netdev@oss.sgi.com, davem@redhat.com In-Reply-To: from "Julian Anastasov" at Dec 18, 1 09:39:57 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 831 Lines: 26 Hello! > I can have routes through this device with preferred source IP > from another device. You cannot do this on device where IP is not enabled. The thing which you has proposed is just impossibility to disable once enabled IP. > Yes, nh_dev will be valid until dev->ifindex is valid. > The tests weeks ago show that it is enough. force=1 is the moment > when the device disappears (unregistered) but not if the above > problems remain. I recalled the discussion from mailboxes... Yes, the point is valid. Well, probably, fib_sync_down() should be called directly on unreg event. It is not a good idea to call fib_disable_ip twice, which would happen if you do check for nil dev->ip_ptr... Ugly, anyway. Probably, for 2.4 it is enough safe just to leave stray ifindex. It is bug, but not a hard bug at least. Alexey From owner-netdev@oss.sgi.com Tue Dec 18 13:12:20 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBILCKu15379 for netdev-outgoing; Tue, 18 Dec 2001 13:12:20 -0800 Received: from ms2.inr.ac.ru (minus.inr.ac.ru [193.233.7.97]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBILCGo15354 for ; Tue, 18 Dec 2001 13:12:16 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id XAA11184; Tue, 18 Dec 2001 23:11:56 +0300 From: kuznet@ms2.inr.ac.ru Message-Id: <200112182011.XAA11184@ms2.inr.ac.ru> Subject: Re: network interface names ethX and renaming interfaces To: rusty@rustcorp.COM.AU (Rusty Russell) Date: Tue, 18 Dec 2001 23:11:56 +0300 (MSK) Cc: netdev@oss.sgi.com In-Reply-To: from "Rusty Russell" at Dec 16, 1 04:45:00 am X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 277 Lines: 10 Hello! > What an interesting corner case. I assume that there's a good > (speed?) reason why init_etherdev doesn't use dev_alloc_name() or > equiv. It was simply heavily broken. Well, if all was right in x.y, there were no reasons to work n kernel x.y+1. Right? :-) Alexey From owner-netdev@oss.sgi.com Tue Dec 18 13:29:03 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBILT3e16051 for netdev-outgoing; Tue, 18 Dec 2001 13:29:03 -0800 Received: from l.himel.bg (IDENT:root@unamed.infotel.bg [212.39.68.18] (may be forged)) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBILSso16041 for ; Tue, 18 Dec 2001 13:28:55 -0800 Received: from linux.himel.bg (IDENT:ja@linux.himel.bg [127.0.0.1]) by l.himel.bg (8.9.3/8.9.3) with ESMTP id WAA23026; Tue, 18 Dec 2001 22:33:13 +0200 Date: Tue, 18 Dec 2001 22:33:13 +0200 (EET) From: Julian Anastasov X-X-Sender: To: cc: , Subject: Re: fib/netdev cleanup In-Reply-To: <200112182001.XAA11093@ms2.inr.ac.ru> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 2355 Lines: 69 Hello, On Tue, 18 Dec 2001 kuznet@ms2.inr.ac.ru wrote: > Hello! > > > I can have routes through this device with preferred source IP > > from another device. > > You cannot do this on device where IP is not enabled. > > The thing which you has proposed is just impossibility to disable once > enabled IP. Is it valid to: ifconfig eth0 192.168.0.1 up ifconfig eth1 0.0.0.0 up <--- enable IP echo 1 > /proc/sys/net/ipv4/conf/all/rp_filter echo 1 > /proc/sys/net/ipv4/conf/eth0/rp_filter echo 1 > /proc/sys/net/ipv4/conf/eth1/rp_filter echo 1 > /proc/sys/net/ipv4/conf/eth0/proxy_arp echo 1 > /proc/sys/net/ipv4/conf/eth1/proxy_arp ip route add 192.168.0.2 dev eth1 src 192.168.0.1 <--- ip addr add 10.0.0.1/24 brd + dev eth1 ip addr del 10.0.0.1/24 brd + dev eth1 ops, the last command deleted the route to 192.168.0.2 in 2.2 it leaks link route to 10.0.0.0/24 > > Yes, nh_dev will be valid until dev->ifindex is valid. > > The tests weeks ago show that it is enough. force=1 is the moment > > when the device disappears (unregistered) but not if the above > > problems remain. > > I recalled the discussion from mailboxes... Yes, the point is valid. > > Well, probably, fib_sync_down() should be called directly on unreg event. > It is not a good idea to call fib_disable_ip twice, which would happen > if you do check for nil dev->ip_ptr... Ugly, anyway. The question is whether we should disable IP when the last address is deleted. Under "disable IP" you mean to delete all routes through this device but on this device remain only the routes with prefsrc from another device. I don't see a good reason to do it, may be I'm missing something? Note that this does not contradict with the no_addr check in fib_validate_source. May be some routes without prefsrc will remain until the device is marked down or unregistered. But these routes don't have prefsrc. They are added from user and can remain even after the last address is deleted? And there are other routes with prefsrc from another device, like the above example. In any case the deleted address has nothing to do with these routes. > Probably, for 2.4 it is enough safe just to leave stray ifindex. > It is bug, but not a hard bug at least. Yes, 2.4 cries too much with multipath routes on unregistering one of the paths. > Alexey Regards -- Julian Anastasov From owner-netdev@oss.sgi.com Tue Dec 18 13:37:03 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBILb3A16558 for netdev-outgoing; Tue, 18 Dec 2001 13:37:03 -0800 Received: from ms2.inr.ac.ru (minus.inr.ac.ru [193.233.7.97]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBILaxo16555 for ; Tue, 18 Dec 2001 13:36:59 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id XAA11343; Tue, 18 Dec 2001 23:36:42 +0300 From: kuznet@ms2.inr.ac.ru Message-Id: <200112182036.XAA11343@ms2.inr.ac.ru> Subject: Re: fib/netdev cleanup To: ja@ssi.bg (Julian Anastasov) Date: Tue, 18 Dec 2001 23:36:42 +0300 (MSK) Cc: netdev@oss.sgi.com, davem@redhat.com In-Reply-To: from "Julian Anastasov" at Dec 18, 1 10:33:13 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 595 Lines: 29 Hello! > ifconfig eth1 0.0.0.0 up <--- enable IP It is the only official way to create addressless IP interfaces. > ops, the last command deleted the route to 192.168.0.2 And this is the only official way to destroy an IP interface. > in 2.2 it leaks link route to 10.0.0.0/24 Well, it is bug. Old kenele must have some bugs eventually. :-) > The question is whether we should disable IP when the > last address is deleted. MUST. > Under "disable IP" you mean Under disable I means disable. No IP on this interface. No input, no output, no references, full shutdown. Alexey From owner-netdev@oss.sgi.com Tue Dec 18 13:52:41 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBILqf917480 for netdev-outgoing; Tue, 18 Dec 2001 13:52:41 -0800 Received: from perninha.conectiva.com.br (perninha.conectiva.com.br [200.250.58.156]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBILqVo17468 for ; Tue, 18 Dec 2001 13:52:32 -0800 Received: from brinquedo.distro.conectiva (dhcp047.distro.conectiva [10.0.20.47]) by perninha.conectiva.com.br (Postfix) with ESMTP id 0CDD538C15; Tue, 18 Dec 2001 17:52:11 -0300 (EST) Received: by brinquedo.distro.conectiva (Postfix, from userid 501) id DE587C44B; Tue, 18 Dec 2001 18:52:00 -0200 (BRST) Date: Tue, 18 Dec 2001 18:52:00 -0200 From: Arnaldo Carvalho de Melo To: "David S. Miller" Cc: SteveW@ACM.org, jschlst@samba.org, ncorbic@sangoma.com, eis@baty.hanse.de, dag@brattli.net, torvalds@transmeta.com, marcelo@conectiva.com.br, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [PATCH][RFC 2] cleaning up struct sock Message-ID: <20011218185200.A1211@conectiva.com.br> Mail-Followup-To: Arnaldo Carvalho de Melo , "David S. Miller" , SteveW@ACM.org, jschlst@samba.org, ncorbic@sangoma.com, eis@baty.hanse.de, dag@brattli.net, torvalds@transmeta.com, marcelo@conectiva.com.br, netdev@oss.sgi.com, linux-kernel@vger.kernel.org References: <20011210230810.C896@conectiva.com.br> <20011210.231826.55509210.davem@redhat.com> <20011218033552.B910@conectiva.com.br> <20011217.225134.91313099.davem@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20011217.225134.91313099.davem@redhat.com> User-Agent: Mutt/1.3.23i X-Url: http://advogato.org/person/acme Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 1508 Lines: 39 Em Mon, Dec 17, 2001 at 10:51:34PM -0800, David S. Miller escreveu: > Which brings me to... > > Please let me know if this is something acceptable for 2.5. > > What kind of before/after effect do you see in lat_tcp/lat_connect > (from lmbench) runs? Improvements on the lat_connect case? :) 2.4.16 TCP latency using 127.0.0.1: 119.3369 microseconds TCP latency using 127.0.0.1: 118.9847 microseconds TCP latency using 127.0.0.1: 118.5139 microseconds TCP latency using 127.0.0.1: 119.1301 microseconds TCP latency using 127.0.0.1: 118.6322 microseconds TCP/IP connection cost to 127.0.0.1: 429.6667 microseconds TCP/IP connection cost to 127.0.0.1: 430.7692 microseconds TCP/IP connection cost to 127.0.0.1: 431.4615 microseconds TCP/IP connection cost to 127.0.0.1: 430.3846 microseconds TCP/IP connection cost to 127.0.0.1: 435.4615 microseconds 2.4.16-acme5 TCP latency using 127.0.0.1: 119.2639 microseconds TCP latency using 127.0.0.1: 118.6068 microseconds TCP latency using 127.0.0.1: 119.0443 microseconds TCP latency using 127.0.0.1: 119.5683 microseconds TCP latency using 127.0.0.1: 118.9556 microseconds TCP/IP connection cost to 127.0.0.1: 408.3571 microseconds TCP/IP connection cost to 127.0.0.1: 409.6429 microseconds TCP/IP connection cost to 127.0.0.1: 410.6429 microseconds TCP/IP connection cost to 127.0.0.1: 409.2143 microseconds TCP/IP connection cost to 127.0.0.1: 414.8333 microseconds More results are coming, this time for local connections on a 8-way box - Arnaldo From owner-netdev@oss.sgi.com Tue Dec 18 13:58:27 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBILwRl17722 for netdev-outgoing; Tue, 18 Dec 2001 13:58:27 -0800 Received: from hera.cwi.nl (hera.cwi.nl [192.16.191.8]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBILwNo17715 for ; Tue, 18 Dec 2001 13:58:23 -0800 Received: from qaip3 (bovenra.ins.cwi.nl [192.16.196.174]) by hera.cwi.nl with ESMTP id VAA06329 for ; Tue, 18 Dec 2001 21:57:44 +0100 (MET) Received: from qaip3 ([127.0.0.1] helo=qaip3.ins.cwi.nl) by qaip3 with esmtp (Exim 3.33 #1 (Debian)) id 16GRHO-0000Es-00; Tue, 18 Dec 2001 21:56:02 +0100 X-Mailer: exmh version 2.3.1 01/18/2001 (debian 2.3.1-1) with nmh-1.0.4+dev To: Rusty Russell cc: Neale Banks , netdev@oss.sgi.com From: Hein Roehrig Subject: Re: network interface names ethX and renaming interfaces In-Reply-To: Message from Rusty Russell of "Sun, 16 Dec 2001 11:01:28 +1100." References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Tue, 18 Dec 2001 21:56:02 +0100 Message-Id: Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 974 Lines: 27 So is anybody already working on a fix? If not, I would try to produce one over Xmas; your input & advice would be welcome. -Hein PS: what is netdev@oss.sgi.com? > In message you write: > > in Linux 2.2.20 I have a problem renaming the network interface dummy0 > > to eth0 and then starting a regular ethernet driver --- I would like > > it to come up as eth1 but it comes up as eth0, messing up the dummy0 > > interface. > > What an interesting corner case. I assume that there's a good > (speed?) reason why init_etherdev doesn't use dev_alloc_name() or > equiv. > > Of course, I never thought anyone would rename interfaces to the names > of other interfaces, so I never honestly considered it. It was more > for large ISPs to name interfaces after their clients (I had a pppd > hack that did this), for filtering and easy identification purposes. > > Rusty. > -- > Anyone who quotes me in their sig is an idiot. -- Rusty Russell. From owner-netdev@oss.sgi.com Tue Dec 18 14:06:00 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBIM60118381 for netdev-outgoing; Tue, 18 Dec 2001 14:06:00 -0800 Received: from l.himel.bg (IDENT:root@unamed.infotel.bg [212.39.68.18] (may be forged)) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBIM5to18378 for ; Tue, 18 Dec 2001 14:05:55 -0800 Received: from linux.himel.bg (IDENT:ja@linux.himel.bg [127.0.0.1]) by l.himel.bg (8.9.3/8.9.3) with ESMTP id XAA23687; Tue, 18 Dec 2001 23:10:46 +0200 Date: Tue, 18 Dec 2001 23:10:46 +0200 (EET) From: Julian Anastasov X-X-Sender: To: cc: , Subject: Re: fib/netdev cleanup In-Reply-To: <200112182036.XAA11343@ms2.inr.ac.ru> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 821 Lines: 35 Hello, On Tue, 18 Dec 2001 kuznet@ms2.inr.ac.ru wrote: > > ops, the last command deleted the route to 192.168.0.2 > > And this is the only official way to destroy an IP interface. This is very unofficial way to stop IP on interface :) May be official for non-Linux :) > > The question is whether we should disable IP when the > > last address is deleted. > > MUST. > > > > Under "disable IP" you mean > > Under disable I means disable. No IP on this interface. No input, no output, > no references, full shutdown. Hm, it does not sound good to me. May be because I can't imagine where it is used. IMO, the user should take care what he adds and then to delete it. Why to disable valid usage of routes for independent subnets. I'm not happy but anyway :) > Alexey Regards -- Julian Anastasov From owner-netdev@oss.sgi.com Tue Dec 18 14:09:01 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBIM91j18554 for netdev-outgoing; Tue, 18 Dec 2001 14:09:01 -0800 Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBIM8wo18530 for ; Tue, 18 Dec 2001 14:08:58 -0800 Received: from localhost (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with ESMTP id NAA08696; Tue, 18 Dec 2001 13:08:10 -0800 Date: Tue, 18 Dec 2001 13:08:09 -0800 (PST) Message-Id: <20011218.130809.22018359.davem@redhat.com> To: acme@conectiva.com.br Cc: SteveW@ACM.org, jschlst@samba.org, ncorbic@sangoma.com, eis@baty.hanse.de, dag@brattli.net, torvalds@transmeta.com, marcelo@conectiva.com.br, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [PATCH][RFC 2] cleaning up struct sock From: "David S. Miller" In-Reply-To: <20011218185200.A1211@conectiva.com.br> References: <20011218033552.B910@conectiva.com.br> <20011217.225134.91313099.davem@redhat.com> <20011218185200.A1211@conectiva.com.br> X-Mailer: Mew version 2.1 on Emacs 21.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 667 Lines: 18 From: Arnaldo Carvalho de Melo Date: Tue, 18 Dec 2001 18:52:00 -0200 Em Mon, Dec 17, 2001 at 10:51:34PM -0800, David S. Miller escreveu: > Which brings me to... > > Please let me know if this is something acceptable for 2.5. > > What kind of before/after effect do you see in lat_tcp/lat_connect > (from lmbench) runs? Improvements on the lat_connect case? :) Great. I have no fundamental problems with your changes. Now, when/if we move the hash-chain/identity members into the IPv4 struct (there are some issues with this wrt. ipv6 btw) I will be interested in seeing the same tests done :-) From owner-netdev@oss.sgi.com Tue Dec 18 16:33:04 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBJ0X4723860 for netdev-outgoing; Tue, 18 Dec 2001 16:33:04 -0800 Received: from marina.lowendale.com.au (neale@gw.lowendale.com.au [203.26.242.120]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBJ0Wro23854 for ; Tue, 18 Dec 2001 16:32:53 -0800 Received: from localhost (neale@localhost) by marina.lowendale.com.au (8.9.3/8.9.3/Debian/GNU) with ESMTP id KAA03251; Wed, 19 Dec 2001 10:49:21 +1100 Date: Wed, 19 Dec 2001 10:49:16 +1100 (EST) From: Neale Banks To: Hein Roehrig cc: Rusty Russell , netdev@oss.sgi.com Subject: Re: network interface names ethX and renaming interfaces In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 2729 Lines: 76 On Tue, 18 Dec 2001, Hein Roehrig wrote: > So is anybody already working on a fix? If not, I would try to > produce one over Xmas; your input & advice would be welcome. Not exactly "working on a fix", but I did take a closer look since my earlier email. I don't claim to be a seasoned hacker (i.e. probably missed a few things) - this is what I did pick up: (1) lack of bounds checking on the index of statically-dimensioned arrays like *ethdev_index[MAX_ETH_CARDS] in net_init.c (2) init_etherdev() can return a NULL pointer. I can (so far) find a few ethernet drivers which specifically contemplate this possibility (ne2k-pci.c 3c527.c 3c59x.c 8139too.c acenic.c arlan.c). Conversely, at least these appear not to contemplate the return of NULL: 3c507.c 3c509.c 3c515.c ac3200.c (others don't appear use the return value). Note that this was only a very quick grep of some of the drivers. (3) (OK, this one's a nitpick) assignment of a value to a variable in it's declaration (or is that definition?) then immediately assigning it another value in a "for" loop (yeah, compiler's optimisation should pick this up - but IMVHO that misses the point - it's very unpretty). Appended patch starts on (1) and cleans up one instance of (3) (as well as being incomplete, I have a nagging suspicion of an off-by-one error). Point (2) initially looked too scary for me but on looking closer at the drivers which appear to do the Right Thing maybe it's not so bad (but perhaps the ettiquette of getting such an aggregate patch accepted should scare me ;-). BTW, I suspect a different slant on your case would be to bring up say eth0, rename that to eth1 and try to bring up another ethernet (exact reaction depends on point (2) above, but could be significantly different to what you have experienced from renaming dummy to eth). > PS: what is netdev@oss.sgi.com? I don't know (but I suspect a polite wave to Keith may be in order). Regards, Neale. --- linux-2.2.20-orig/drivers/net/net_init.c Sun Nov 11 19:14:23 2001 +++ linux-2.2.20-ntb/drivers/net/net_init.c Wed Dec 19 09:54:23 2001 @@ -104,6 +104,10 @@ goto found; } } + if (i>=MAX_ETH_CARDS) { + printk("init_etherdev: FATAL - too many eth devs.\n"); + return NULL; + } alloc_size &= ~3; /* Round to dword boundary. */ @@ -127,6 +131,7 @@ ethdev_index[i] = dev; break; } + /* FIXME??? Need to check (i>=MAX_ETH_CARDS)??? */ } ether_setup(dev); /* Hmmm, should this be called here? */ @@ -468,8 +473,7 @@ static int etherdev_get_index(struct device *dev) { - int i=MAX_ETH_CARDS; - + int i; for (i = 0; i < MAX_ETH_CARDS; ++i) { if (ethdev_index[i] == NULL) { sprintf(dev->name, "eth%d", i); From owner-netdev@oss.sgi.com Tue Dec 18 17:05:57 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBJ15vL24777 for netdev-outgoing; Tue, 18 Dec 2001 17:05:57 -0800 Received: from www.dangermen.com (mawi057-249.dsl.tds.net [208.171.57.249]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBJ15ko24774 for ; Tue, 18 Dec 2001 17:05:47 -0800 Received: (from jleu@localhost) by www.dangermen.com (8.9.3/8.9.3) id SAA31482; Tue, 18 Dec 2001 18:05:25 -0600 X-Authentication-Warning: www.dangermen.com: jleu set sender to jleu@mindspring.com using -f Date: Tue, 18 Dec 2001 18:05:25 -0600 From: "James R. Leu" To: Julian Anastasov Cc: kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com, davem@redhat.com Subject: Re: fib/netdev cleanup Message-ID: <20011218180525.D26734@dangermen.com> Reply-To: jleu@mindspring.com References: <200112182001.XAA11093@ms2.inr.ac.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2i In-Reply-To: ; from ja@ssi.bg on Tue, Dec 18, 2001 at 10:33:13PM +0200 Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 3037 Lines: 95 Comments below ... On Tue, Dec 18, 2001 at 10:33:13PM +0200, Julian Anastasov wrote: > > Hello, > > On Tue, 18 Dec 2001 kuznet@ms2.inr.ac.ru wrote: > > > Hello! > > > > > I can have routes through this device with preferred source IP > > > from another device. > > > > You cannot do this on device where IP is not enabled. > > > > The thing which you has proposed is just impossibility to disable once > > enabled IP. > > Is it valid to: > > ifconfig eth0 192.168.0.1 up > ifconfig eth1 0.0.0.0 up <--- enable IP > echo 1 > /proc/sys/net/ipv4/conf/all/rp_filter > echo 1 > /proc/sys/net/ipv4/conf/eth0/rp_filter > echo 1 > /proc/sys/net/ipv4/conf/eth1/rp_filter > echo 1 > /proc/sys/net/ipv4/conf/eth0/proxy_arp > echo 1 > /proc/sys/net/ipv4/conf/eth1/proxy_arp > ip route add 192.168.0.2 dev eth1 src 192.168.0.1 <--- > > ip addr add 10.0.0.1/24 brd + dev eth1 > ip addr del 10.0.0.1/24 brd + dev eth1 > ops, the last command deleted the route to 192.168.0.2 I agree with Julian. I think there needs to be a way for an IP interface to remember that it has an "addressless" address attached to it. To me if I do: ifconfig eth1 0.0.0.0 up ip addr add 10.0.0.1/24 brd + dev eth1 Then I should have to do: ip addr del 10.0.0.1/24 brd + dev eth1 ip addr del 0.0.0.0 dev eth1 (or some equivalent) Before IP is completly removed from the interface. I'll go back to lurking now. Jim > in 2.2 it leaks link route to 10.0.0.0/24 > > > > Yes, nh_dev will be valid until dev->ifindex is valid. > > > The tests weeks ago show that it is enough. force=1 is the moment > > > when the device disappears (unregistered) but not if the above > > > problems remain. > > > > I recalled the discussion from mailboxes... Yes, the point is valid. > > > > Well, probably, fib_sync_down() should be called directly on unreg event. > > It is not a good idea to call fib_disable_ip twice, which would happen > > if you do check for nil dev->ip_ptr... Ugly, anyway. > > The question is whether we should disable IP when the > last address is deleted. Under "disable IP" you mean to delete > all routes through this device but on this device remain only > the routes with prefsrc from another device. I don't see a good > reason to do it, may be I'm missing something? Note that this > does not contradict with the no_addr check in fib_validate_source. > May be some routes without prefsrc will remain until the device > is marked down or unregistered. But these routes don't have > prefsrc. They are added from user and can remain even after the > last address is deleted? And there are other routes with prefsrc > from another device, like the above example. In any case the > deleted address has nothing to do with these routes. > > > Probably, for 2.4 it is enough safe just to leave stray ifindex. > > It is bug, but not a hard bug at least. > > Yes, 2.4 cries too much with multipath routes on > unregistering one of the paths. > > > Alexey > > Regards > > -- > Julian Anastasov -- James R. Leu jleu@mindspring.com From owner-netdev@oss.sgi.com Tue Dec 18 18:22:48 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBJ2Mmo26371 for netdev-outgoing; Tue, 18 Dec 2001 18:22:48 -0800 Received: from netbank.com.br (IDENT:postfix@garrincha.netbank.com.br [200.203.199.88]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBJ2Mho26368 for ; Tue, 18 Dec 2001 18:22:44 -0800 Received: from brinquedo.distro.conectiva (1-218.ctame701-2.telepar.net.br [200.181.138.218]) by netbank.com.br (Postfix) with ESMTP id 9CCC546821; Tue, 18 Dec 2001 23:22:15 -0200 (BRDT) Received: by brinquedo.distro.conectiva (Postfix, from userid 501) id 540FFC44B; Tue, 18 Dec 2001 23:22:23 -0200 (BRST) Date: Tue, 18 Dec 2001 23:22:22 -0200 From: Arnaldo Carvalho de Melo To: "David S. Miller" Cc: SteveW@ACM.org, jschlst@samba.org, ncorbic@sangoma.com, eis@baty.hanse.de, dag@brattli.net, torvalds@transmeta.com, marcelo@conectiva.com.br, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [PATCH][RFC 2] cleaning up struct sock Message-ID: <20011218232222.A1963@conectiva.com.br> Mail-Followup-To: Arnaldo Carvalho de Melo , "David S. Miller" , SteveW@ACM.org, jschlst@samba.org, ncorbic@sangoma.com, eis@baty.hanse.de, dag@brattli.net, torvalds@transmeta.com, marcelo@conectiva.com.br, netdev@oss.sgi.com, linux-kernel@vger.kernel.org References: <20011218033552.B910@conectiva.com.br> <20011217.225134.91313099.davem@redhat.com> <20011218185200.A1211@conectiva.com.br> <20011218.130809.22018359.davem@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20011218.130809.22018359.davem@redhat.com> User-Agent: Mutt/1.3.23i X-Url: http://advogato.org/person/acme Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 1219 Lines: 27 Em Tue, Dec 18, 2001 at 01:08:09PM -0800, David S. Miller escreveu: > From: Arnaldo Carvalho de Melo > > What kind of before/after effect do you see in lat_tcp/lat_connect > > (from lmbench) runs? > > Improvements on the lat_connect case? :) > > Great. I have no fundamental problems with your changes. Ok, I assume that these changes will get integrated in 2.5 then, does anybody disagree? Holler now or forever hold your peace (tm) Linus. 8) > Now, when/if we move the hash-chain/identity members into > the IPv4 struct (there are some issues with this wrt. ipv6 btw) I will > be interested in seeing the same tests done :-) I tried it but had problems with incoming connections not being stablished, and you said that you preferred that it stayed there, on the first cacheline, so I preferred to submit the patch without that and after this gets integrated and more widely tested in 2.5 I'll revisit this issue and try again, trying to look into these issues with IPv6 you've mentioned, BTW, could you elaborate on that? :) Now to the next tasks: dedicate some more time to finish the LLC stack for 2.4 and move on to cleanup struct inode (fs.h) for 2.5. - Arnaldo From owner-netdev@oss.sgi.com Tue Dec 18 19:34:07 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBJ3Y7j28934 for netdev-outgoing; Tue, 18 Dec 2001 19:34:07 -0800 Received: from www.linux.org.uk (IDENT:exim@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBJ3Xqo28927 for ; Tue, 18 Dec 2001 19:33:52 -0800 Received: from adsl-77-230-81.asm.bellsouth.net ([216.77.230.81] helo=mandrakesoft.com) by www.linux.org.uk with esmtp (Exim 3.33 #5) id 16GWYH-0003kK-00; Wed, 19 Dec 2001 02:33:49 +0000 Message-ID: <3C1FFC8B.6A861BAF@mandrakesoft.com> Date: Tue, 18 Dec 2001 21:33:47 -0500 From: Jeff Garzik Organization: MandrakeSoft X-Mailer: Mozilla 4.79 [en] (X11; U; Linux 2.4.17-pre8 i686) X-Accept-Language: en MIME-Version: 1.0 To: Linux-Kernel list , netdev@oss.sgi.com Subject: Patchkit 2.4.17-rc2-jg1 Content-Type: multipart/mixed; boundary="------------C2EDFC9F1BE16EA69ABD6078" Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 3575 Lines: 146 This is a multi-part message in MIME format. --------------C2EDFC9F1BE16EA69ABD6078 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit hehe, seems like everyone has their own patchkit these days :) I probably won't be announcing each release, just the ones of interest or that need wider testing. All of these patches are likely bound for Marcelo, after testing. Most of these patches relate to net drivers. Available on your favorite kernel.org mirror soon, ftp://ftp.??.kernel.org/pub/linux/kernel/people/jgarzik/patchkits/2.4/2.4.17-rc2-jg1.tar.gz Patches and comments welcome. Attached is the Contents file for the jg1 tarball. More changes are coming soon (hi Jes!). Most if not all these patches are available from the 'linux_2_4' module of gkernel cvs. http://sf.net/projects/gkernel/ -- Jeff Garzik | Only so many songs can be sung Building 1024 | with two lips, two lungs, and one tongue. MandrakeSoft | - nomeansno --------------C2EDFC9F1BE16EA69ABD6078 Content-Type: text/plain; charset=us-ascii; name="contents.txt" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="contents.txt" Version: 2.4.17-rc2-jg1 Released: Dec 18, 2001 19_netif_msg.patch Add netif_msg_hw and netif_msg_wol to netif_msg API. Required by 20_natsemi. Reminder to self: since this is not in the mainline as I previously thought, netif_msg_hw should not be added since it's too general. 20_8139too_rx_reset.patch Experimental RX reset which should be much improved over the existing code, but needs wider testing. 20_alpha_jensen_noop.patch From lkml, author ??? , fix jensen build. 20_clgenfb.patch From ??? , fix a clgen chip 20_dl2k.patch dl2k bug fixes from vendor 20_ipv6_fix.patch Fix procfs function definition 20_sis900.patch sis900 bug fixes from vendor 40_8139cp_no_copybreak.patch Remove copybreak code, 8139C+ has no alignment restrictions (improves performance) 50_mii_interface.patch Add ethtool MII helper routines 60_8139cp_misc.patch Un-inline rx error accounting function (Jes's suggestion) Only allow ioctl while interface is open. 60_8139too_misc.patch Only allow ioctl while interface is open. 60_ethtool_basic.patch Implement basic ethtool support for 3c501, 3c503, 3c505, 3c507, 3c509, 3c515, 3c523, 3c527. Implement additional ethtool media support for starfire. 60_ethtool_pcmcia.patch Implement ethtool support for several pcmcia drivers, including media support where possible and easy. 60_natsemi.patch Implement more ethtool support for natsemi. (Tim Hockin) 72_8139cp_mii.patch Implement MII interface support for 8139cp, and related ethtool ioctls. 73_8139too_mii.patch Implement MII interface support for 8139too, and related ethtool ioctls. 75_ethtool_mii.patch Implement MII interface support and related ioctls for epic100, fealnx, via-rhine and winbond. 80_de2104x.patch Add new de2104x driver for 2104x tulip chips. Version 0.5.3. 80_jiffies.patch From lkml, author ???, jiffies cleanup for eexpress, ne2k-pci, and slip. Needs review before going to Marcelo and Linus. 90_mii_constants.patch Use constants from mii.h rather than magic numbers. 90_suser.patch Janitor cleanup, author Morten Helgesen: s/susers/capable/ for fealnx. 99_8139cp_changelog.patch 99_8139too_changelog.patch 99_credits.patch 99_maintainers.patch Various text updates 99_cs89x0_printk.patch Clean up / fix printk output (Andrew Morton) --------------C2EDFC9F1BE16EA69ABD6078-- From owner-netdev@oss.sgi.com Wed Dec 19 04:18:19 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBJCIJO08995 for netdev-outgoing; Wed, 19 Dec 2001 04:18:19 -0800 Received: from l.himel.bg (IDENT:root@unamed.infotel.bg [212.39.68.18] (may be forged)) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBJCI9o08992 for ; Wed, 19 Dec 2001 04:18:11 -0800 Received: from linux.himel.bg (IDENT:ja@linux.himel.bg [127.0.0.1]) by l.himel.bg (8.9.3/8.9.3) with ESMTP id NAA03154; Wed, 19 Dec 2001 13:22:08 +0200 Date: Wed, 19 Dec 2001 13:22:08 +0200 (EET) From: Julian Anastasov X-X-Sender: To: "James R. Leu" cc: , , Subject: Re: fib/netdev cleanup In-Reply-To: <20011218180525.D26734@dangermen.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 2044 Lines: 60 Hello, On Tue, 18 Dec 2001, James R. Leu wrote: > I think there needs to be a way for an IP interface to remember > that it has an "addressless" address attached to it. To me if I do: route without prefsrc from this device ... > ifconfig eth1 0.0.0.0 up > ip addr add 10.0.0.1/24 brd + dev eth1 > > Then I should have to do: > > ip addr del 10.0.0.1/24 brd + dev eth1 > ip addr del 0.0.0.0 dev eth1 Hm, good idea. I forgot about inet_del_ifa and its flag destroy. I see that inet_insert_ifa has check for 0.0.0.0. Can we add similar check in the deletion path, i.e. only del 0.0.0.0 to call inetdev_destroy. All other inet_del_ifa should not call inetdev_destroy. The del 0.0.0.0 event should be propagated somehow. ifconfig dev 0.0.0.0 will not work (only enable) but ip addr del 0.0.0.0 works in this scheme. It will delete all present addresses and routes. We will have symmetry when enabling and disabling IP for interface. What I don't know is who really relies on the fact that the last address stops IP. In fact, the no_addr check in fib_validate_source is enough to stop any incoming IP traffic from device without addresses and routes. So, may be we have to check the sending path, whether we break some expectations. Another thing: may be we can teach fib_sync_down to release ifa_local 0.0.0.0, i.e. all routes without prefsrc when the last address is deleted, instead of disabling IP but it is not very good. If the above way to stop IP for interface works then fib_inetaddr_event simply will call the evil 'fib_disable_ip(ifa->ifa_dev->dev, 1);' only when someone deletes 0.0.0.0 (we will have to check ifa->ifa_dev->dev->ip_ptr to know when IP is deleted or to do it for ifa_local 0.0.0.0). And of course, on NETDEV_UNREGISTER. So, may be we can find a way to officially enable and disable IP? If such idea sounds good I can go further and to try it. > (or some equivalent) > > Before IP is completly removed from the interface. > > I'll go back to lurking now. > > Jim Regards -- Julian Anastasov From owner-netdev@oss.sgi.com Wed Dec 19 20:26:39 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBK4QdU17404 for netdev-outgoing; Wed, 19 Dec 2001 20:26:39 -0800 Received: from netbank.com.br (garrincha.netbank.com.br [200.203.199.88]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBK4QVX17400 for ; Wed, 19 Dec 2001 20:26:32 -0800 Received: from brinquedo.distro.conectiva (1-218.ctame701-2.telepar.net.br [200.181.138.218]) by netbank.com.br (Postfix) with ESMTP id 8B3114683B; Thu, 20 Dec 2001 01:25:12 -0200 (BRDT) Received: by brinquedo.distro.conectiva (Postfix, from userid 501) id 719A3C44B; Thu, 20 Dec 2001 01:23:39 -0200 (BRST) Date: Thu, 20 Dec 2001 01:23:39 -0200 From: Arnaldo Carvalho de Melo To: "David S. Miller" Cc: SteveW@ACM.org, jschlst@samba.org, ncorbic@sangoma.com, eis@baty.hanse.de, dag@brattli.net, torvalds@transmeta.com, marcelo@conectiva.com.br, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: [PATCH][RFC 3] cleaning up struct sock Message-ID: <20011220012339.A919@conectiva.com.br> Mail-Followup-To: Arnaldo Carvalho de Melo , "David S. Miller" , SteveW@ACM.org, jschlst@samba.org, ncorbic@sangoma.com, eis@baty.hanse.de, dag@brattli.net, torvalds@transmeta.com, marcelo@conectiva.com.br, netdev@oss.sgi.com, linux-kernel@vger.kernel.org References: <20011218033552.B910@conectiva.com.br> <20011217.225134.91313099.davem@redhat.com> <20011218185200.A1211@conectiva.com.br> <20011218.130809.22018359.davem@redhat.com> <20011218232222.A1963@conectiva.com.br> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20011218232222.A1963@conectiva.com.br> User-Agent: Mutt/1.3.23i X-Url: http://advogato.org/person/acme Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 950 Lines: 29 Ok, patch for 2.5.1, without the bogus cvs $id strings hunks, being used in this machine now. Available at: http://www.kernel.org/pub/linux/kernel/people/acme/v2.5/2.5.1/ sock.cleanup-2.5.1.patch.bz2 Ah, the lat_unix_connect results on a pentium 300 mmx notebook: 2.5.1 + this patch UNIX connection cost : 96.1749 microseconds UNIX connection cost : 96.3361 microseconds UNIX connection cost : 97.2310 microseconds UNIX connection cost : 101.9180 microseconds UNIX connection cost : 97.2461 microseconds 2.4.16 pristine UNIX connection cost : 112.7034 microseconds UNIX connection cost : 114.5494 microseconds UNIX connection cost : 114.0923 microseconds UNIX connection cost : 111.0959 microseconds UNIX connection cost : 120.8419 microseconds And about 100 KB of kernel memory saved for AF_UNIX sockets on a basic KDE session (i.e., the AF_UNIX struct sock now is about 400 bytes when it is about 1200 bytes on a pristine kernel). - Arnaldo From owner-netdev@oss.sgi.com Thu Dec 20 01:22:34 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBK9MYk23782 for netdev-outgoing; Thu, 20 Dec 2001 01:22:34 -0800 Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBK9MVX23777 for ; Thu, 20 Dec 2001 01:22:31 -0800 Received: from localhost (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with ESMTP id AAA05868; Thu, 20 Dec 2001 00:21:27 -0800 Date: Thu, 20 Dec 2001 00:21:26 -0800 (PST) Message-Id: <20011220.002126.119272610.davem@redhat.com> To: acme@conectiva.com.br Cc: SteveW@ACM.org, jschlst@samba.org, ncorbic@sangoma.com, eis@baty.hanse.de, dag@brattli.net, torvalds@transmeta.com, marcelo@conectiva.com.br, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [PATCH][RFC 3] cleaning up struct sock From: "David S. Miller" In-Reply-To: <20011220012339.A919@conectiva.com.br> References: <20011218.130809.22018359.davem@redhat.com> <20011218232222.A1963@conectiva.com.br> <20011220012339.A919@conectiva.com.br> X-Mailer: Mew version 2.1 on Emacs 21.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 647 Lines: 23 From: Arnaldo Carvalho de Melo Date: Thu, 20 Dec 2001 01:23:39 -0200 Available at: http://www.kernel.org/pub/linux/kernel/people/acme/v2.5/2.5.1/ sock.cleanup-2.5.1.patch.bz2 Looking pretty good. I have one improvement. I'd rather you pass the "kmem_cache_t" directly into sk_alloc, use NULL for "I don't have any extra private area". And then, for the IP case lay it out like this: struct sock struct ip_opt struct {tcp,raw4,...}_opt And use different kmem_cache_t's for each protocol instead of the same one for tcp, raw4, etc. RAW/UDP sockets waste a lot of space with your current layout. From owner-netdev@oss.sgi.com Thu Dec 20 02:36:20 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBKAaK525706 for netdev-outgoing; Thu, 20 Dec 2001 02:36:20 -0800 Received: from coruscant.gnumonks.org (mail@coruscant.franken.de [193.174.159.226]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBKAaFX25703 for ; Thu, 20 Dec 2001 02:36:16 -0800 Received: from uucp by coruscant.gnumonks.org with local-bsmtp (Exim 3.33 #1) id 16GzcZ-0007do-00 for netdev@oss.sgi.com; Thu, 20 Dec 2001 10:36:11 +0100 Received: from laforge by sunbeam.gnumonks.org with local (Exim 3.33 #1) id 16GzZ1-0004Kz-00; Thu, 20 Dec 2001 10:32:31 +0100 Date: Thu, 20 Dec 2001 10:32:30 +0100 From: Harald Welte To: bert hubert Cc: netdev@oss.sgi.com, Netfilter Development Mailinglist Subject: Re: [PATCH] make netfilter able to change/see skb->priority Message-ID: <20011220103230.U11363@sunbeam.de.gnumonks.org> Mail-Followup-To: Harald Welte , bert hubert , netdev@oss.sgi.com, Netfilter Development Mailinglist References: <20011215165956.A31862@outpost.ds9a.nl> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.3.17i In-Reply-To: <20011215165956.A31862@outpost.ds9a.nl>; from ahu@ds9a.nl on Sat, Dec 15, 2001 at 04:59:56PM +0100 X-Operating-System: Linux sunbeam.de.gnumonks.org 2.4.14 X-Date: Today is Sweetmorn, the 59th day of The Aftermath in the YOLD 3167 Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 1226 Lines: 25 On Sat, Dec 15, 2001 at 04:59:56PM +0100, bert hubert wrote: > Rusty & others: > > Right now, netfilter can't see or touch the skb->priority of packets > generated locally because it is only set in ip_queue_xmit2, after netfilter > has been consulted. This patch moves the skb->priority=sk->priority line to > just before calling netfilter. > > I think this patch is philosophically right because it allows netfilter to > override userspace instructions, which is what we do for lots of other > targets too. We feel that it is ok to drop or mangle locally generated > packets in netfilter. I think we should do the same for skb->priority. I don't see any bad implications of your patch. What is the position of our core networking people (Dave, Andi, Alexey) to this proposal? I'd like to see this minimal change because it would extend the features of iptables - without hurting anybody else. -- Live long and prosper - Harald Welte / laforge@gnumonks.org http://www.gnumonks.org/ ============================================================================ GCS/E/IT d- s-: a-- C+++ UL++++$ P+++ L++++$ E--- W- N++ o? K- w--- O- M- V-- PS+ PE-- Y+ PGP++ t++ 5-- !X !R tv-- b+++ DI? !D G+ e* h+ r% y+(*) From owner-netdev@oss.sgi.com Thu Dec 20 03:27:36 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBKBRa927459 for netdev-outgoing; Thu, 20 Dec 2001 03:27:36 -0800 Received: from outpost.powerdns.com (postfix@outpost.ds9a.nl [213.244.168.210]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBKBRWX27456 for ; Thu, 20 Dec 2001 03:27:32 -0800 Received: by outpost.powerdns.com (Postfix, from userid 1000) id 1FE8AC60FE; Thu, 20 Dec 2001 11:27:28 +0100 (CET) Date: Thu, 20 Dec 2001 11:27:28 +0100 From: bert hubert To: kuznet@ms2.inr.ac.ru Cc: netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: [BUG/WANT TO FIX] Equal Cost Multipath Broken in 2.4.x Message-ID: <20011220112728.A11112@outpost.ds9a.nl> Mail-Followup-To: bert hubert , kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 1206 Lines: 37 # ip ro add dev eth0 default nexthop via 10.0.0.1 dev eth0 nexthop via 10.0.0.202 dev eth0 # ip ro ls 10.0.0.0/8 dev eth0 proto kernel scope link src 10.0.0.11 default nexthop via 10.0.0.1 dev eth0 weight 1 dead nexthop via 10.0.0.202 dev eth0 weight 1 10.0.0.1 however is far from dead, if we add yet another nexthop: # ip ro add dev eth0 default nexthop via 10.10.10.10 dev eth0 nexthop via 10.0.0.1 dev eth0 nexthop via 10.0.0.202 dev eth0 # ip ro ls 10.0.0.0/8 dev eth0 proto kernel scope link src 10.0.0.11 default nexthop via 10.10.10.10 dev eth0 weight 1 dead nexthop via 10.0.0.1 dev eth0 weight 1 nexthop via 10.0.0.202 dev eth0 weight 1 This first nexthop is *always* declared dead. Linux 2.4.x, iproute 20010824. If anybody can point me in the direction of this problem, it must be known as it has been there for a *long* time, it would be appreciated. I'll try to fix it. Thanks! Regards, bert -- http://www.PowerDNS.com Versatile DNS Software & Services http://www.tk the dot in .tk Netherlabs BV / Rent-a-Nerd.nl - Nerd Available - Linux Advanced Routing & Traffic Control: http://ds9a.nl/lartc From owner-netdev@oss.sgi.com Thu Dec 20 04:17:34 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBKCHYa28446 for netdev-outgoing; Thu, 20 Dec 2001 04:17:34 -0800 Received: from l.himel.bg (IDENT:root@unamed.infotel.bg [212.39.68.18] (may be forged)) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBKCHIX28443 for ; Thu, 20 Dec 2001 04:17:19 -0800 Received: from linux.himel.bg (IDENT:ja@linux.himel.bg [127.0.0.1]) by l.himel.bg (8.9.3/8.9.3) with ESMTP id NAA05607; Thu, 20 Dec 2001 13:21:30 +0200 Date: Thu, 20 Dec 2001 13:21:30 +0200 (EET) From: Julian Anastasov X-X-Sender: To: bert hubert cc: , , Subject: Re: [BUG/WANT TO FIX] Equal Cost Multipath Broken in 2.4.x In-Reply-To: <20011220112728.A11112@outpost.ds9a.nl> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 1616 Lines: 59 Hello, On Thu, 20 Dec 2001, bert hubert wrote: > # ip ro add dev eth0 default nexthop via 10.0.0.1 dev eth0 nexthop via > 10.0.0.202 dev eth0 > # ip ro ls > 10.0.0.0/8 dev eth0 proto kernel scope link src 10.0.0.11 > default > nexthop via 10.0.0.1 dev eth0 weight 1 dead > nexthop via 10.0.0.202 dev eth0 weight 1 > > 10.0.0.1 however is far from dead, if we add yet another nexthop: > > # ip ro add dev eth0 default nexthop via 10.10.10.10 dev eth0 nexthop via > 10.0.0.1 dev eth0 nexthop via 10.0.0.202 dev eth0 > > # ip ro ls > 10.0.0.0/8 dev eth0 proto kernel scope link src 10.0.0.11 > default > nexthop via 10.10.10.10 dev eth0 weight 1 dead > nexthop via 10.0.0.1 dev eth0 weight 1 > nexthop via 10.0.0.202 dev eth0 weight 1 > > This first nexthop is *always* declared dead. Linux 2.4.x, iproute 20010824. > > If anybody can point me in the direction of this problem, it must be known > as it has been there for a *long* time, it would be appreciated. I'll try to Yes, I remember people to report for this problem for long time but I was not able to reproduce it. May be it could be fixed with the following change (only compiled): --- iproute2/ip/iproute.c.orig Mon Aug 6 03:31:52 2001 +++ iproute2/ip/iproute.c Thu Dec 20 13:14:06 2001 @@ -620,6 +620,8 @@ } rtnh->rtnh_len = sizeof(*rtnh); rtnh->rtnh_ifindex = 0; + rtnh->rtnh_flags = 0; + rtnh->rtnh_hops = 0; rta->rta_len += rtnh->rtnh_len; parse_one_nh(rta, rtnh, &argc, &argv); rtnh = RTNH_NEXT(rtnh); > fix it. > > Thanks! > > Regards, > > bert Regards -- Julian Anastasov From owner-netdev@oss.sgi.com Thu Dec 20 04:29:35 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBKCTZw28662 for netdev-outgoing; Thu, 20 Dec 2001 04:29:35 -0800 Received: from outpost.powerdns.com (postfix@outpost.ds9a.nl [213.244.168.210]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBKCTVX28659 for ; Thu, 20 Dec 2001 04:29:31 -0800 Received: by outpost.powerdns.com (Postfix, from userid 1000) id A0CFBC60E9; Thu, 20 Dec 2001 12:29:27 +0100 (CET) Date: Thu, 20 Dec 2001 12:29:27 +0100 From: bert hubert To: Julian Anastasov Cc: kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [BUG/WANT TO FIX] Equal Cost Multipath Broken in 2.4.x Message-ID: <20011220122927.A12949@outpost.ds9a.nl> Mail-Followup-To: bert hubert , Julian Anastasov , kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com, linux-kernel@vger.kernel.org References: <20011220112728.A11112@outpost.ds9a.nl> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: ; from ja@ssi.bg on Thu, Dec 20, 2001 at 01:21:30PM +0200 Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 1009 Lines: 26 On Thu, Dec 20, 2001 at 01:21:30PM +0200, Julian Anastasov wrote: > > If anybody can point me in the direction of this problem, it must be known > > as it has been there for a *long* time, it would be appreciated. I'll try to > > Yes, I remember people to report for this problem for long > time but I was not able to reproduce it. May be it could be fixed > with the following change (only compiled): Your patch does not appear to relate to iproute-20010824. I think I've found the problem, however. I think there has been an API change between 2.2 and 2.4. 'ip' compiled under 2.2 will not properly configure ECMP on 2.4! If I recompile tc under 2.4, the problem disappears. Maybe ip should warn against this? Should be easy to do. Regards, bert -- http://www.PowerDNS.com Versatile DNS Software & Services http://www.tk the dot in .tk Netherlabs BV / Rent-a-Nerd.nl - Nerd Available - Linux Advanced Routing & Traffic Control: http://ds9a.nl/lartc From owner-netdev@oss.sgi.com Thu Dec 20 05:09:12 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBKD9Cw03395 for netdev-outgoing; Thu, 20 Dec 2001 05:09:12 -0800 Received: from l.himel.bg (IDENT:root@unamed.infotel.bg [212.39.68.18] (may be forged)) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBKD96X03389 for ; Thu, 20 Dec 2001 05:09:06 -0800 Received: from linux.himel.bg (IDENT:ja@linux.himel.bg [127.0.0.1]) by l.himel.bg (8.9.3/8.9.3) with ESMTP id OAA06818; Thu, 20 Dec 2001 14:13:16 +0200 Date: Thu, 20 Dec 2001 14:13:16 +0200 (EET) From: Julian Anastasov X-X-Sender: To: bert hubert cc: , , Subject: Re: [BUG/WANT TO FIX] Equal Cost Multipath Broken in 2.4.x In-Reply-To: <20011220122927.A12949@outpost.ds9a.nl> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 922 Lines: 36 Hello, On Thu, 20 Dec 2001, bert hubert wrote: > Your patch does not appear to relate to iproute-20010824. I think I've found Hm, it is against iproute2-2.4.7-now-ss010824.tar.gz. Is iproute-20010824 (what is that?) somehow different? > the problem, however. I think there has been an API change between 2.2 and > 2.4. 'ip' compiled under 2.2 will not properly configure ECMP on 2.4! May be the effect is different with different compiler ... and uninitialized stack data. See the entry in RELNOTES: [010803] * If "dev" is not specified in multipath route, ifindex remained uninitialized. Grr. Thanks to Kunihiro Ishiguro . It seems the same bug exists for rtnh_flags and rtnh_hops, at the same place. > If I recompile tc under 2.4, the problem disappears. This is new. IIRC, the other users don't have such success :) > Regards, > > bert Regards -- Julian Anastasov From owner-netdev@oss.sgi.com Thu Dec 20 06:04:28 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBKE4So04795 for netdev-outgoing; Thu, 20 Dec 2001 06:04:28 -0800 Received: from perninha.conectiva.com.br (perninha.conectiva.com.br [200.250.58.156]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBKE4LX04792 for ; Thu, 20 Dec 2001 06:04:22 -0800 Received: from brinquedo.distro.conectiva (dhcp047.distro.conectiva [10.0.20.47]) by perninha.conectiva.com.br (Postfix) with ESMTP id D8A1038C8E; Thu, 20 Dec 2001 10:04:11 -0300 (EST) Received: by brinquedo.distro.conectiva (Postfix, from userid 501) id 53BFBC44B; Thu, 20 Dec 2001 10:38:00 -0200 (BRST) Date: Thu, 20 Dec 2001 10:37:59 -0200 From: Arnaldo Carvalho de Melo To: "David S. Miller" Cc: SteveW@ACM.org, jschlst@samba.org, ncorbic@sangoma.com, eis@baty.hanse.de, dag@brattli.net, torvalds@transmeta.com, marcelo@conectiva.com.br, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [PATCH][RFC 3] cleaning up struct sock Message-ID: <20011220103759.A1208@conectiva.com.br> Mail-Followup-To: Arnaldo Carvalho de Melo , "David S. Miller" , SteveW@ACM.org, jschlst@samba.org, ncorbic@sangoma.com, eis@baty.hanse.de, dag@brattli.net, torvalds@transmeta.com, marcelo@conectiva.com.br, netdev@oss.sgi.com, linux-kernel@vger.kernel.org References: <20011218.130809.22018359.davem@redhat.com> <20011218232222.A1963@conectiva.com.br> <20011220012339.A919@conectiva.com.br> <20011220.002126.119272610.davem@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20011220.002126.119272610.davem@redhat.com> User-Agent: Mutt/1.3.23i X-Url: http://advogato.org/person/acme Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 1105 Lines: 35 Em Thu, Dec 20, 2001 at 12:21:26AM -0800, David S. Miller escreveu: > From: Arnaldo Carvalho de Melo > Date: Thu, 20 Dec 2001 01:23:39 -0200 > > Available at: > > http://www.kernel.org/pub/linux/kernel/people/acme/v2.5/2.5.1/ > sock.cleanup-2.5.1.patch.bz2 > > Looking pretty good. I have one improvement. > > I'd rather you pass the "kmem_cache_t" directly into sk_alloc, use > NULL for "I don't have any extra private area". humm I did that with sock_register to avoid changing all the sk_alloc users, but in the end all protocols were changed so... ok, I'll do that, at least it'll simplify the "rtnetlink socket allocated early in the boot process before sock_register(rtnetlink) was called". > And then, for the IP case lay it out like this: > > struct sock > struct ip_opt > struct {tcp,raw4,...}_opt > > And use different kmem_cache_t's for each protocol instead of > the same one for tcp, raw4, etc. > > RAW/UDP sockets waste a lot of space with your current layout. *grin* Ok, ok, lets save more bytes 8) I'll look into this. - Arnaldo From owner-netdev@oss.sgi.com Thu Dec 20 06:53:40 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBKErej06972 for netdev-outgoing; Thu, 20 Dec 2001 06:53:40 -0800 Received: from outpost.powerdns.com (postfix@outpost.ds9a.nl [213.244.168.210]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBKErZX06969 for ; Thu, 20 Dec 2001 06:53:35 -0800 Received: by outpost.powerdns.com (Postfix, from userid 1000) id 77002C60E9; Thu, 20 Dec 2001 14:53:32 +0100 (CET) Date: Thu, 20 Dec 2001 14:53:32 +0100 From: bert hubert To: Julian Anastasov Cc: kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: [BUG/FIXED !] Equal Cost Multipath Broken in 2.4.x Message-ID: <20011220145332.A15230@outpost.ds9a.nl> Mail-Followup-To: bert hubert , Julian Anastasov , kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com, linux-kernel@vger.kernel.org References: <20011220122927.A12949@outpost.ds9a.nl> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: ; from ja@ssi.bg on Thu, Dec 20, 2001 at 02:13:16PM +0200 Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 1695 Lines: 45 On Thu, Dec 20, 2001 at 02:13:16PM +0200, Julian Anastasov wrote: > > > Your patch does not appear to relate to iproute-20010824. I think I've found > > Hm, it is against iproute2-2.4.7-now-ss010824.tar.gz. Is > iproute-20010824 (what is that?) somehow different? 20010824 is the version that comes with Debian unstable which, aside from some additional manpages, is 100% identical to the regular one. However, I must have not been awake this morning. It does apply now, AND fixes the problem. Thanks! > > the problem, however. I think there has been an API change between 2.2 and > > 2.4. 'ip' compiled under 2.2 will not properly configure ECMP on 2.4! > > May be the effect is different with different compiler ... > and uninitialized stack data. See the entry in RELNOTES: > > [010803] > * If "dev" is not specified in multipath route, ifindex remained > uninitialized. Grr. Thanks to Kunihiro Ishiguro . I do specify dev on the commandline, however, you are right in that is the compiler that fixes the behaviour. Apparently, gcc-3.0 is lucky in this respect. > > If I recompile tc under 2.4, the problem disappears. > > This is new. IIRC, the other users don't have such success :) I happened to be compiling with gcc-3.0 at the time, while debian compile their packages with gcc-2.95. I'll mention this patch on the LARTC mailinglist too. Will you push this patch towards Alexey? Regards, bert -- http://www.PowerDNS.com Versatile DNS Software & Services http://www.tk the dot in .tk Netherlabs BV / Rent-a-Nerd.nl - Nerd Available - Linux Advanced Routing & Traffic Control: http://ds9a.nl/lartc From owner-netdev@oss.sgi.com Thu Dec 20 07:06:20 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBKF6Ka07350 for netdev-outgoing; Thu, 20 Dec 2001 07:06:20 -0800 Received: from l.himel.bg (IDENT:root@unamed.infotel.bg [212.39.68.18] (may be forged)) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBKF6FX07347 for ; Thu, 20 Dec 2001 07:06:15 -0800 Received: from linux.himel.bg (IDENT:ja@linux.himel.bg [127.0.0.1]) by l.himel.bg (8.9.3/8.9.3) with ESMTP id QAA09729; Thu, 20 Dec 2001 16:10:11 +0200 Date: Thu, 20 Dec 2001 16:10:11 +0200 (EET) From: Julian Anastasov X-X-Sender: To: bert hubert cc: , , Subject: Re: [BUG/FIXED !] Equal Cost Multipath Broken in 2.4.x In-Reply-To: <20011220145332.A15230@outpost.ds9a.nl> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 1075 Lines: 41 Hello, On Thu, 20 Dec 2001, bert hubert wrote: > must have not been awake this morning. It does apply now, AND fixes the > problem. Thanks! Very good > > [010803] > > * If "dev" is not specified in multipath route, ifindex remained > > uninitialized. Grr. Thanks to Kunihiro Ishiguro . > > I do specify dev on the commandline, however, you are right in that is the > compiler that fixes the behaviour. Apparently, gcc-3.0 is lucky in this > respect. Yes, it seems dev is an old problem. IMO the right fix is to memset with 0 the struct before usage. This will avoid any further errors. > I happened to be compiling with gcc-3.0 at the time, while debian compile > their packages with gcc-2.95. I'll mention this patch on the LARTC > mailinglist too. Yes, it must depend somehow on the compiler, may be from previous calls of other functions, no time to investigate. > Will you push this patch towards Alexey? If he is not reading the whole thread, please, do it directly. > Regards, > > bert Regards -- Julian Anastasov From owner-netdev@oss.sgi.com Fri Dec 21 06:54:52 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBLEsqb08271 for netdev-outgoing; Fri, 21 Dec 2001 06:54:52 -0800 Received: from netbank.com.br (IDENT:postfix@garrincha.netbank.com.br [200.203.199.88]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBLEsjX08266 for ; Fri, 21 Dec 2001 06:54:46 -0800 Received: from brinquedo.distro.conectiva (1-218.ctame701-2.telepar.net.br [200.181.138.218]) by netbank.com.br (Postfix) with ESMTP id CCB7D4680D; Fri, 21 Dec 2001 11:53:42 -0200 (BRDT) Received: by brinquedo.distro.conectiva (Postfix, from userid 501) id 76AAFC44B; Fri, 21 Dec 2001 11:54:38 -0200 (BRST) Date: Fri, 21 Dec 2001 11:54:38 -0200 From: Arnaldo Carvalho de Melo To: "David S. Miller" Cc: SteveW@ACM.org, jschlst@samba.org, ncorbic@sangoma.com, eis@baty.hanse.de, dag@brattli.net, torvalds@transmeta.com, marcelo@conectiva.com.br, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [PATCH][RFC 3] cleaning up struct sock Message-ID: <20011221115438.A5990@conectiva.com.br> Mail-Followup-To: Arnaldo Carvalho de Melo , "David S. Miller" , SteveW@ACM.org, jschlst@samba.org, ncorbic@sangoma.com, eis@baty.hanse.de, dag@brattli.net, torvalds@transmeta.com, marcelo@conectiva.com.br, netdev@oss.sgi.com, linux-kernel@vger.kernel.org References: <20011218.130809.22018359.davem@redhat.com> <20011218232222.A1963@conectiva.com.br> <20011220012339.A919@conectiva.com.br> <20011220.002126.119272610.davem@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20011220.002126.119272610.davem@redhat.com> User-Agent: Mutt/1.3.23i X-Url: http://advogato.org/person/acme Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 1763 Lines: 47 Em Thu, Dec 20, 2001 at 12:21:26AM -0800, David S. Miller escreveu: > From: Arnaldo Carvalho de Melo > Date: Thu, 20 Dec 2001 01:23:39 -0200 > > Available at: > > http://www.kernel.org/pub/linux/kernel/people/acme/v2.5/2.5.1/ > sock.cleanup-2.5.1.patch.bz2 > > Looking pretty good. I have one improvement. > > I'd rather you pass the "kmem_cache_t" directly into sk_alloc, use > NULL for "I don't have any extra private area". > > And then, for the IP case lay it out like this: > > struct sock > struct ip_opt > struct {tcp,raw4,...}_opt > > And use different kmem_cache_t's for each protocol instead of > the same one for tcp, raw4, etc. > > RAW/UDP sockets waste a lot of space with your current layout. Indeed it wastes, but the current setup in the stock kernel wastes even more ;) Well, did what you suggested, adding a slab parameter to sk_alloc and I also overloaded zero_it but its current behaviour is maintained, i.e., 0 == don't zeroes the newly allocated sock, 1 == zeroes it, the overloading is: 1 == sizeof(struct sock), > 1 == objsize of the per protocol slabcache. For now I did only to UDPv4 sockets, will do the others this afternoon, this is the result so far: [rama2 kernel-acme]$ grep sock /proc/slabinfo unix_sock 7 20 400 1 2 1 : 17 572 2 0 0 udp_sock 6 10 372 1 1 1 : 7 31 1 0 0 tcp_sock 13 15 800 3 3 1 : 13 46 3 0 0 sock 0 0 336 0 0 1 : 0 0 0 0 0 Now UDP sockets use only 372 bytes while in the stock kernel it uses 1280 bytes when all the protocols are selected (as modules or statically linked, but more than 1 KB when just TCP/IP v4 is selected). More to come. Patch will be available later today. - Arnaldo From owner-netdev@oss.sgi.com Fri Dec 21 20:28:58 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBM4Sw430879 for netdev-outgoing; Fri, 21 Dec 2001 20:28:58 -0800 Received: from netbank.com.br (IDENT:postfix@garrincha.netbank.com.br [200.203.199.88]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBM4SfX30876 for ; Fri, 21 Dec 2001 20:28:42 -0800 Received: from brinquedo.distro.conectiva (2-111.ctame701-2.telepar.net.br [200.181.170.111]) by netbank.com.br (Postfix) with ESMTP id 0D9524683E; Sat, 22 Dec 2001 01:27:30 -0200 (BRDT) Received: by brinquedo.distro.conectiva (Postfix, from userid 501) id 79A1AC44B; Sat, 22 Dec 2001 01:28:24 -0200 (BRST) Date: Sat, 22 Dec 2001 01:28:24 -0200 From: Arnaldo Carvalho de Melo To: "David S. Miller" , SteveW@ACM.org, jschlst@samba.org, ncorbic@sangoma.com, eis@baty.hanse.de, dag@brattli.net, torvalds@transmeta.com, marcelo@conectiva.com.br, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: [PATCH][RFC 4] cleaning up struct sock Message-ID: <20011222012824.A8996@conectiva.com.br> Mail-Followup-To: Arnaldo Carvalho de Melo , "David S. Miller" , SteveW@ACM.org, jschlst@samba.org, ncorbic@sangoma.com, eis@baty.hanse.de, dag@brattli.net, torvalds@transmeta.com, marcelo@conectiva.com.br, netdev@oss.sgi.com, linux-kernel@vger.kernel.org References: <20011218.130809.22018359.davem@redhat.com> <20011218232222.A1963@conectiva.com.br> <20011220012339.A919@conectiva.com.br> <20011220.002126.119272610.davem@redhat.com> <20011221115438.A5990@conectiva.com.br> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20011221115438.A5990@conectiva.com.br> User-Agent: Mutt/1.3.23i X-Url: http://advogato.org/person/acme Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 1127 Lines: 34 Em Fri, Dec 21, 2001 at 11:54:38AM -0200, Arnaldo Carvalho de Melo escreveu: > Em Thu, Dec 20, 2001 at 12:21:26AM -0800, David S. Miller escreveu: > > I'd rather you pass the "kmem_cache_t" directly into sk_alloc, use > > NULL for "I don't have any extra private area". > > > > And then, for the IP case lay it out like this: > > > > struct sock > > struct ip_opt > > struct {tcp,raw4,...}_opt > > > > And use different kmem_cache_t's for each protocol instead of > > the same one for tcp, raw4, etc. > > > > RAW/UDP sockets waste a lot of space with your current layout. Done, patch available at: http://www.kernel.org/pub/linux/kernel/people/acme/v2.5/2.5.2-pre1/ sock.cleanup-2.5.2-pre1.bz2 Current state of /proc/slabinfo: [acme@rama2 acme]$ grep sock /proc/slabinfo unix_sock 7 20 400 1 2 1 : 17 572 2 0 0 raw4_sock 0 10 376 0 1 1 : 1 3 1 0 0 udp_sock 6 10 372 1 1 1 : 7 31 1 0 0 tcp_sock 13 15 800 3 3 1 : 14 47 3 0 0 sock 0 0 336 0 0 1 : 0 0 0 0 0 TODO: do the same for IPv6, that now has only one slabcache for all its protocols. - Arnaldo From owner-netdev@oss.sgi.com Sat Dec 22 19:49:25 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBN3nPo14286 for netdev-outgoing; Sat, 22 Dec 2001 19:49:25 -0800 Received: from nghost.vosn.net ([157.238.46.43]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBN3nHX14283 for ; Sat, 22 Dec 2001 19:49:18 -0800 Received: from ppp-206-171-33-2.rdcy01.pacbell.net ([206.171.33.2] ident=reverend) by nghost.vosn.net with esmtp (Exim 3.33 #1) id 16Hyha-0000y4-00 for netdev@oss.sgi.com; Sat, 22 Dec 2001 21:49:27 -0500 Date: Sat, 22 Dec 2001 18:49:19 -0800 (PST) From: Reverend X-Sender: reverend@splinter.church.net To: netdev@oss.sgi.com Subject: [BUG] Compile errors with networking disabled (fwd) Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - nghost.vosn.net X-AntiAbuse: Original Domain - oss.sgi.com X-AntiAbuse: Originator/Caller UID/GID - [0 0] / [0 0] X-AntiAbuse: Sender Address Domain - unrooted.net Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 3068 Lines: 72 I mailed this over to Alan, and he directed me to this address for net/core maint issues. BTW, the README in the net directory still lists him as the maintainer. I haven't seen this issue mentioned anywhere else, which I find a bit odd and makes me suspect myself rather than an error in the kernel. But I've tried the compile with a few different toolchains and consistently hit the same issue. Rev ---------- Forwarded message ---------- Date: Sat, 22 Dec 2001 13:28:08 -0800 (PST) From: Reverend To: alan@lxorguk.ukuu.org.uk Subject: [BUG] Compile errors with networking disabled Hey Alan, You're listed as the maintainer for net/core in the 2.5 series still. If that isn't really the case please let me know - I'll try to figure out who is. I get a compile error if I turn off networking completely in the 2.5 series (I tried a few of the prepatch versions, and attached a patch against 2.5.1-pre11). I doubt that this patch is the correct way to address the issue, but it's been working for me. I get the following error when compiling: In file included from /usr/src/linux/include/net/tcp.h:1036, from sock.c:122: /usr/src/linux/include/net/tcp_ecn.h: In function `TCP_ECN_send': /usr/src/linux/include/net/tcp_ecn.h:54: union has no member named `af_inet' /usr/src/linux/include/net/tcp_ecn.h:61: union has no member named `af_inet' In file included from /usr/src/linux/include/net/tcp.h:1036, from skbuff.c:58: /usr/src/linux/include/net/tcp_ecn.h: In function `TCP_ECN_send': /usr/src/linux/include/net/tcp_ecn.h:54: union has no member named `af_inet' /usr/src/linux/include/net/tcp_ecn.h:61: union has no member named `af_inet' make[3]: *** [sock.o] Error 1 The errors are actually on INET_ECN_xmit and INET_ECN_dontxmit, which are macro functions in include/net/inet_ecn.h and reference the af_inet field of protinfo. af_inet isn't included in protinfo unless networking is configured on however. I just nulled the functions, but the correct answer might be to cut out the references to the macros all together unless networking is on. I leave it to hands more capable than my own to determine the correct sollution. Rev diff -urN linux/include/net/inet_ecn.h linux-edited/include/net/inet_ecn.h --- linux/include/net/inet_ecn.h Tue Oct 30 15:08:12 2001 +++ linux-edited/include/net/inet_ecn.h Sat Dec 22 13:17:31 2001 @@ -24,8 +24,17 @@ return outer; } +#if defined(CONFIG_INET) || defined (CONFIG_INET_MODULE) + #define INET_ECN_xmit(sk) do { (sk)->protinfo.af_inet.tos |= 2; } while (0) #define INET_ECN_dontxmit(sk) do { (sk)->protinfo.af_inet.tos &= ~3; } while (0) + +#else + +#define INET_ECN_xmit(sk) do { ; } while (0) +#define INET_ECN_dontxmit(sk) do { ; } while (0) + +#endif /* defined(CONFIG_INET) || defined (CONFIG_INET_MODULE) */ #define IP6_ECN_flow_init(label) do { \ (label) &= ~htonl(3<<20); \ From owner-netdev@oss.sgi.com Sat Dec 22 20:21:39 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBN4Ld314604 for netdev-outgoing; Sat, 22 Dec 2001 20:21:39 -0800 Received: from mail.ocs.com.au (mail.ocs.com.au [203.34.97.2]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBN4LZX14601 for ; Sat, 22 Dec 2001 20:21:36 -0800 Received: (qmail 9106 invoked from network); 23 Dec 2001 03:21:30 -0000 Received: from ocs3.intra.ocs.com.au (192.168.255.3) by mail.ocs.com.au with SMTP; 23 Dec 2001 03:21:30 -0000 Received: by ocs3.intra.ocs.com.au (Postfix, from userid 16331) id 61040300090; Sun, 23 Dec 2001 14:21:25 +1100 (EST) Received: from ocs3.intra.ocs.com.au (localhost [127.0.0.1]) by ocs3.intra.ocs.com.au (Postfix) with ESMTP id 0DDD097; Sun, 23 Dec 2001 14:21:24 +1100 (EST) X-Mailer: exmh version 2.2 06/23/2000 with nmh-1.0.4 From: Keith Owens To: Reverend Cc: netdev@oss.sgi.com Subject: Re: [BUG] Compile errors with networking disabled (fwd) In-reply-to: Your message of "Sat, 22 Dec 2001 18:49:19 -0800." Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Sun, 23 Dec 2001 14:21:19 +1100 Message-ID: <18977.1009077679@ocs3.intra.ocs.com.au> Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 514 Lines: 13 On Sat, 22 Dec 2001 18:49:19 -0800 (PST), Reverend wrote: >In file included from /usr/src/linux/include/net/tcp.h:1036, > from sock.c:122: >/usr/src/linux/include/net/tcp_ecn.h: In function `TCP_ECN_send': >/usr/src/linux/include/net/tcp_ecn.h:54: union has no member named `af_inet' >/usr/src/linux/include/net/tcp_ecn.h:61: union has no member named `af_inet' Fixed by DaveM in 2.4.17. Browse the 2.4.17 patch and look for -#include -#include From owner-netdev@oss.sgi.com Sun Dec 23 12:06:47 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBNK6lE27820 for netdev-outgoing; Sun, 23 Dec 2001 12:06:47 -0800 Received: from ms2.inr.ac.ru (minus.inr.ac.ru [193.233.7.97]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBNK6hX27813 for ; Sun, 23 Dec 2001 12:06:43 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id WAA18811; Sun, 23 Dec 2001 22:06:16 +0300 From: kuznet@ms2.inr.ac.ru Message-Id: <200112231906.WAA18811@ms2.inr.ac.ru> Subject: Re: [PATCH] make netfilter able to change/see skb->priority To: ahu@ds9a.NL (bert hubert) Date: Sun, 23 Dec 2001 22:06:16 +0300 (MSK) Cc: netdev@oss.sgi.com, laforge@gnumonks.ORG In-Reply-To: <20011215165956.A31862@outpost.ds9a.nl> from "bert hubert" at Dec 15, 1 07:15:01 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 943 Lines: 24 Hello! > I think this patch is philosophically right because it allows netfilter to > override userspace instructions, What's about NF_IP_POST_ROUTING? Well, _phylosophically_ it is wrong yet. skb->priority is exactly a hint from user or another available source and overriding this is simply loss of information. Real policy is made in queueing level: it decides to use this hint or to ignore it. But in practice this can be convenient despite of phylosophical flaws. :-) Anyway, I never understood well the principle of placement of netfilter hooks. F.e. NF_IP_FORWARD is apparently too late: all the modifications are made, redirects are sent, DF packets are dropped... As result this hook cannot be used for anything but silent dropping packets. The same happens with all the intermediate hooks, in fact only NF_IP_POST_ROUTING and NF_IP_PRE_ROUTING may be used for something smart unambiguosuly. Very strange, to be honest. Alexey From owner-netdev@oss.sgi.com Sun Dec 23 22:10:36 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBO6Aa117134 for netdev-outgoing; Sun, 23 Dec 2001 22:10:36 -0800 Received: from localhost.localdomain ([203.190.138.70]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBO6ALX17130 for ; Sun, 23 Dec 2001 22:10:26 -0800 Received: from vishalmalhan ([192.9.203.171]) by localhost.localdomain (8.11.0/8.11.0) with SMTP id fBO5AwK19623 for ; Mon, 24 Dec 2001 10:40:59 +0530 Message-ID: <02d701c18c38$e351f160$abcb09c0@npi.stpn.soft.net> From: "Vishal Malhan" To: References: <200112231906.WAA18811@ms2.inr.ac.ru> Subject: SIOCGIFADDR and IPv6 Date: Mon, 24 Dec 2001 10:37:28 +0530 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 5.00.2314.1300 X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2314.1300 Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 200 Lines: 8 The ioctl SIOCGIFADDR works fine for v4 but gives EINVAL for v6. Are the IOCTL's not v6 enabled ? If so then how to do the same for v6 ? I am using Redhat Linux 7.1 kernel 2.4.1 Rgds, Vishal Malhan From owner-netdev@oss.sgi.com Sun Dec 23 22:30:26 2001 Received: (from majordomo@localhost) by oss.sgi.com (8.11.2/8.11.3) id fBO6UQd17325 for netdev-outgoing; Sun, 23 Dec 2001 22:30:26 -0800 Received: from yue.hongo.wide.ad.jp (root@yue.hongo.wide.ad.jp [203.178.139.94]) by oss.sgi.com (8.11.2/8.11.3) with SMTP id fBO6UNX17322 for ; Sun, 23 Dec 2001 22:30:23 -0800 Received: from localhost (yoshfuji@localhost [127.0.0.1]) by yue.hongo.wide.ad.jp (8.9.3+3.2W/8.9.3/Debian 8.9.3-21) with ESMTP id OAA06413; Mon, 24 Dec 2001 14:30:18 +0900 To: vmalhan@npi.stpn.soft.net Cc: netdev@oss.sgi.com, usagi-users@linux-ipv6.org Subject: Re: SIOCGIFADDR and IPv6 In-Reply-To: <02d701c18c38$e351f160$abcb09c0@npi.stpn.soft.net> References: <200112231906.WAA18811@ms2.inr.ac.ru> <02d701c18c38$e351f160$abcb09c0@npi.stpn.soft.net> X-Mailer: Mew version 1.94.2 on Emacs 20.7 / Mule 4.1 (AOI) X-URL: http://www.yoshifuji.org/%7Ehideaki/ X-Fingerprint: 90 22 65 EB 1E CF 3A D1 0B DF 80 D8 48 07 F8 94 E0 62 0E EA X-PGP-Key-URL: http://www.yoshifuji.org/%7Ehideaki/hideaki@yoshifuji.org.asc Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-Id: <20011224143018T.yoshfuji@linux-ipv6.org> Date: Mon, 24 Dec 2001 14:30:18 +0900 From: YOSHIFUJI Hideaki / =?iso-2022-jp?B?GyRCNUhGIzFRTEAbKEI=?= X-Dispatcher: imput version 991025(IM133) Sender: owner-netdev@oss.sgi.com Precedence: bulk Content-Length: 642 Lines: 14 In article <02d701c18c38$e351f160$abcb09c0@npi.stpn.soft.net> (at Mon, 24 Dec 2001 10:37:28 +0530), "Vishal Malhan" says: > The ioctl SIOCGIFADDR works fine for v4 but gives EINVAL for v6. > Are the IOCTL's not v6 enabled ? > If so then how to do the same for v6 ? We do not support SIOCGIFADDR for IPv6 because in6_addr is too big.:-p Use rtnetlink(7) to get ipv6 addresses. You probably want to use getifaddrs(3) in USAGI package to get addresses (H/W addresses, ipv4 addresses and ipv6 addresses) on interfaces. The interface was originally from BSDI and also available on KAME(BSD)-based OS's. --yoshfuji