From cfriesen@nortelnetworks.com Sat Mar 1 22:03:44 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 01 Mar 2003 22:03:47 -0800 (PST) Received: from zcars0m9.nortelnetworks.com (zcars0m9.nortelnetworks.com [47.129.242.157]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id h2263heA017610 for ; Sat, 1 Mar 2003 22:03:44 -0800 Received: from zcard309.ca.nortel.com (zcard309.ca.nortel.com [47.129.242.69]) by zcars0m9.nortelnetworks.com (Switch-2.2.0/Switch-2.2.0) with ESMTP id h22634q11232; Sun, 2 Mar 2003 01:03:04 -0500 (EST) Received: from zcard0k6.ca.nortel.com ([47.129.242.158]) by zcard309.ca.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id GDF4RWX4; Sun, 2 Mar 2003 01:03:05 -0500 Received: from pcard0ks.ca.nortel.com ([47.129.117.131]) by zcard0k6.ca.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id FSL7YWNZ; Sun, 2 Mar 2003 01:03:05 -0500 Received: from nortelnetworks.com (localhost.localdomain [127.0.0.1]) by pcard0ks.ca.nortel.com (Postfix) with ESMTP id 2E9602E12F; Sun, 2 Mar 2003 01:03:04 -0500 (EST) Message-ID: <3E619E97.8010508@nortelnetworks.com> Date: Sun, 02 Mar 2003 01:03:03 -0500 X-Sybari-Space: 00000000 00000000 00000000 From: Chris Friesen User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.8) Gecko/20020204 X-Accept-Language: en-us MIME-Version: 1.0 To: jamal Cc: linux-kernel@vger.kernel.org, netdev@oss.sgi.com, linux-net@vger.kernel.org Subject: Re: anyone ever done multicast AF_UNIX sockets? References: <3E5E7081.6020704@nortelnetworks.com> <20030228083009.Y53276@shell.cyberus.ca> <3E5F748E.2080605@nortelnetworks.com> <20030228212309.C57212@shell.cyberus.ca> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 1828 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: cfriesen@nortelnetworks.com Precedence: bulk X-list: netdev Content-Length: 3163 Lines: 72 jamal wrote: > Did you also measure throughput? No. lmbench doesn't appear to test UDP socket local throughput. > You are overlooking the flexibility that already exists in IP based > transports as an advantage; the fact that you can make them > distributed instead of localized with a simple addressing change > is a very powerful abstraction. True. On the other hand, the same could be said about unicast IP sockets vs unix sockets. Unix sockets exist for a reason, and I'm simply proposing to extend them. >>From >>userspace, multicast unix would be *simple* to use, as in totally >>transparent. > You could implement the abstraction in user space as a library today by > having some server that muxes to several registered clients. This is what we have now, though with a suboptimal solution (we inherited it from another group). The disadvantage with this is that it adds a send/schedule/receive iteration. If you have a small number of listeners this can have a large effect percentage-wise on your messaging cost. The kernel approach also cuts the number of syscalls required by a factor of two compared to the server-based approach. > So whats the addressing scheme for multicast unix? Would it be a > reserved path? Actually I was thinking it could be arbitrary, with a flag in the unix part of struct sock saying that it was actually a multicast address. The api would be something like the IP multicast one, where you get and bind a normal socket and then use setsockopt to attach yourself to one or more of multicast addresses. A given address could be multicast or not, but they would reside in the same namespace and would collide as currently happens. The only way to create a multicast address would be the setsockopt call--if the address doesn't already exist a socket would be created by the kernel and bound to the desired address. To see if its feasable I've actually coded up a proof-of-concept that seems to do fairly well. I tested it with a process sending an 8-byte packet containing a timestamp to three listeners, who checked the time on receipt and printed out the difference. For comparison I have two different userspace implementations, one with a server process (very simple for test purposes) and the other using an mmap'd file to store which process is listening to what messages. The timings (in usec) for the delays to each of the listeners were as follows on my duron 750: userspace server: 104 133 153 userspace no server: 72 111 138 kernelspace: 60 91 113 As you can see, the kernelspace code is the fastest and since its in the kernel it can be written to avoid being scheduled out while holding locks which is hard to avoid with the no-server userspace option. If this sounds at all interesting I would be glad to post a patch so you could shoot holes in it, otherwise I'll continue working on it privately. Chris -- Chris Friesen | MailStop: 043/33/F10 Nortel Networks | work: (613) 765-0557 3500 Carling Avenue | fax: (613) 765-2986 Nepean, ON K2H 8E9 Canada | email: cfriesen@nortelnetworks.com From hadi@cyberus.ca Sun Mar 2 06:12:17 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 02 Mar 2003 06:12:29 -0800 (PST) Received: from mx03.cyberus.ca (mx03.cyberus.ca [216.191.240.24]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id h22ECAeA028572 for ; Sun, 2 Mar 2003 06:12:16 -0800 Received: from shell.cyberus.ca ([216.191.240.114]) by mx03.cyberus.ca with esmtp (Exim 4.10) id 18pUCD-0000nI-00; Sun, 02 Mar 2003 09:12:05 -0500 Received: from shell.cyberus.ca (localhost.cyberus.ca [127.0.0.1]) by shell.cyberus.ca (8.12.6/8.12.6) with ESMTP id h22EBeYO063470; Sun, 2 Mar 2003 09:11:40 -0500 (EST) (envelope-from hadi@cyberus.ca) Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.12.6/8.12.6/Submit) with ESMTP id h22EBdxn063467; Sun, 2 Mar 2003 09:11:40 -0500 (EST) (envelope-from hadi@cyberus.ca) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Sun, 2 Mar 2003 09:11:39 -0500 (EST) From: jamal To: Chris Friesen cc: linux-kernel@vger.kernel.org, "" , "" Subject: Re: anyone ever done multicast AF_UNIX sockets? In-Reply-To: <3E619E97.8010508@nortelnetworks.com> Message-ID: <20030302081916.S61365@shell.cyberus.ca> References: <3E5E7081.6020704@nortelnetworks.com> <20030228083009.Y53276@shell.cyberus.ca> <3E5F748E.2080605@nortelnetworks.com> <20030228212309.C57212@shell.cyberus.ca> <3E619E97.8010508@nortelnetworks.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 1829 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev Content-Length: 3825 Lines: 94 On Sun, 2 Mar 2003, Chris Friesen wrote: > jamal wrote: > > Did you also measure throughput? > > No. lmbench doesn't appear to test UDP socket local throughput. I think you need to collect all data if you are trying to show improvements. > > > You are overlooking the flexibility that already exists in IP based > > transports as an advantage; the fact that you can make them > > distributed instead of localized with a simple addressing change > > is a very powerful abstraction. > > True. On the other hand, the same could be said about unicast IP > sockets vs unix sockets. Unix sockets exist for a reason, and I'm > simply proposing to extend them. > You are treading into areas where unix sockets make less sense compared to sockets. Good design rules (should actually read "lazy design rules") ometimes you gotta move to a round peg instead of trying to make the square one round. > > You could implement the abstraction in user space as a library today by > > having some server that muxes to several registered clients. > > This is what we have now, though with a suboptimal solution (we > inherited it from another group). The disadvantage with this is that it > adds a send/schedule/receive iteration. If you have a small number of > listeners this can have a large effect percentage-wise on your messaging > cost. The kernel approach also cuts the number of syscalls required by > a factor of two compared to the server-based approach. > Ok, so its only a problem when you have a few listeners i.e user space scheme scales just fine as you keep adding listeners. In your tests what was the break-even point? > > So whats the addressing scheme for multicast unix? Would it be a > > reserved path? > > Actually I was thinking it could be arbitrary, with a flag in the unix > part of struct sock saying that it was actually a multicast address. > The api would be something like the IP multicast one, where you get and > bind a normal socket and then use setsockopt to attach yourself to one > or more of multicast addresses. A given address could be multicast or > not, but they would reside in the same namespace and would collide as > currently happens. The only way to create a multicast address would be > the setsockopt call--if the address doesn't already exist a socket would > be created by the kernel and bound to the desired address. > Addressing has to be backwared compatible i.e not affecting any other program. > To see if its feasable I've actually coded up a proof-of-concept that > seems to do fairly well. I tested it with a process sending an 8-byte > packet containing a timestamp to three listeners, who checked the time > on receipt and printed out the difference. > > For comparison I have two different userspace implementations, one with > a server process (very simple for test purposes) and the other using an > mmap'd file to store which process is listening to what messages. > > The timings (in usec) for the delays to each of the listeners were as > follows on my duron 750: > > userspace server: 104 133 153 > userspace no server: 72 111 138 > kernelspace: 60 91 113 > > As you can see, the kernelspace code is the fastest and since its in the > kernel it can be written to avoid being scheduled out while holding > locks which is hard to avoid with the no-server userspace option. > Actually, the difference between user space server and kernel doesnt appear that big. What you need to do is collect more data. repeat with incrementing number of listeners. > If this sounds at all interesting I would be glad to post a patch so you > could shoot holes in it, otherwise I'll continue working on it privately. > no rush, lets see your test data first and then you gotta do a better sales job on the cost/benefit/flexibilty ratios. cheers, jamal From zjp@iscas.ac.cn Sun Mar 2 23:45:03 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 02 Mar 2003 23:45:08 -0800 (PST) Received: from mail.iscas.ac.cn ([159.226.5.56]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id h237j2eA007487 for ; Sun, 2 Mar 2003 23:45:03 -0800 Received: (qmail 24812 invoked by uid 104); 3 Mar 2003 07:44:34 -0000 Received: from zjp@iscas.ac.cn by mail.iscas.ac.cn by uid 0 with qmail-scanner-1.14 (hbedv: 6.15.0.1. hbedv: operating system: Linux (glibc). hbedv: product version: 2.0.4. hbedv: engine version: 6.15.0.1. hbedv: packlib version: 2.0.0.8 (supports 19 formats). hbedv: vdf version: 6.15.0.7 (66928 recognized forms). hbedv: . hbedv: product: AntiVir Workstation. hbedv: key file: hbedv.key. hbedv: registered user: irene, 123. hbedv: serial number: 1001020203. hbedv: key expires: 31 May 2003. hbedv: run mode: PRIVATE. hbedv: . hbedv: product: AntiVir MailGate. hbedv: key file: hbedv.key. hbedv: registered user: irene, 123. hbedv: serial number: 1001020203. hbedv: key expires: 31 May 2003. hbedv: run mode: PRIVATE. hbedv: . hbedv: product: AntiVir (command line scanner). hbedv: key file: hbedv.key. hbedv: registered user: irene, 123. hbedv: serial number: 1001020203. hbedv: key expires: 31 May 2003. hbedv: run mode: PRIVATE. Clear:. Processed in 0.260838 secs); 03 Mar 2003 07:44:34 -0000 Received: from unknown (HELO zhengjp) (zjp@159.226.5.59) by mail.iscas.ac.cn with SMTP; 3 Mar 2003 07:44:33 -0000 Message-ID: <003901c2e159$5d932ae0$6c05a8c0@zhengjp> From: "Zheng Jianping" To: Subject: How to set IPv6 router alert option Date: Mon, 3 Mar 2003 15:49:05 +0800 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2800.1106 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1106 X-archive-position: 1830 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: zjp@iscas.ac.cn Precedence: bulk X-list: netdev Content-Length: 469 Lines: 19 Hi, I want to send a packet(MLD query message) with IPv6 router alert option by socket. After creating a ICMPv6 socket, how to send a MLD query message packet by the created socket. Thanks, Zheng Jianping ---------------------------------------------------------------------------- --------------------------- Multimedia Communication & Network Engneering Research Center Institue of Software, Chiese Academy of Sciences Email: zjp@iscas.ac.cn Tel: 6255,5523 From davem@redhat.com Mon Mar 3 01:02:49 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 03 Mar 2003 01:03:02 -0800 (PST) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id h2392meA008716 for ; Mon, 3 Mar 2003 01:02:48 -0800 Received: from localhost (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with ESMTP id AAA00650; Mon, 3 Mar 2003 00:44:58 -0800 Date: Mon, 03 Mar 2003 00:44:57 -0800 (PST) Message-Id: <20030303.004457.24252283.davem@redhat.com> To: bwa@us.ibm.com Cc: lksctp-developers@lists.sourceforge.net, linux-net@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [PATCH] subset of RFC2553 From: "David S. Miller" In-Reply-To: <1046109300.3503.12.camel@w-bwa1.beaverton.ibm.com> References: <1045847170.3104.7.camel@w-bwa1.beaverton.ibm.com> <20030221.232639.129509431.davem@redhat.com> <1046109300.3503.12.camel@w-bwa1.beaverton.ibm.com> X-FalunGong: Information control. X-Mailer: Mew version 2.1 on Emacs 21.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 1831 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev Content-Length: 425 Lines: 14 From: Bruce Allan Date: 24 Feb 2003 09:54:57 -0800 On Fri, 2003-02-21 at 23:26, David S. Miller wrote: > > Bruce, while applying this I noticed that in6addr_{any,loopback} > are not exported by modules. > > Please send me a small patch to add the exports if this will be > needed by SCTP and friends. Doh! Sorry, here (see below) it is against 2.5.59. Applied, thanks. From davem@redhat.com Mon Mar 3 01:12:17 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 03 Mar 2003 01:12:23 -0800 (PST) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id h239CGeA009155 for ; Mon, 3 Mar 2003 01:12:17 -0800 Received: from localhost (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with ESMTP id AAA00686; Mon, 3 Mar 2003 00:54:29 -0800 Date: Mon, 03 Mar 2003 00:54:28 -0800 (PST) Message-Id: <20030303.005428.96142819.davem@redhat.com> To: yoshfuji@linux-ipv6.org Cc: linux-kernel@vger.kernel.org, netdev@oss.sgi.com, kuznet@ms2.inr.ac.ru, pekkas@netcore.fi, usagi@linux-ipv6.org Subject: Re: [PATCH] IPv6: Privacy Extensions for Stateless Address Autoconfiguration in IPv6 From: "David S. Miller" In-Reply-To: <20030226.004155.71903869.yoshfuji@linux-ipv6.org> References: <20021101.174832.44646503.yoshfuji@linux-ipv6.org> <20030223.223114.65976206.davem@redhat.com> <20030226.004155.71903869.yoshfuji@linux-ipv6.org> X-FalunGong: Information control. X-Mailer: Mew version 2.1 on Emacs 21.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=iso-2022-jp Content-Transfer-Encoding: 7bit X-archive-position: 1832 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev Content-Length: 267 Lines: 7 From: YOSHIFUJI Hideaki / $B5HF#1QL@(B Date: Wed, 26 Feb 2003 00:41:55 +0900 (JST) Well, I've found a bug that a temporary addresses were not re-generated properly. Here's the patch for linux-2.5.63. Fix applied, thanks. From davem@redhat.com Mon Mar 3 01:46:39 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 03 Mar 2003 01:46:50 -0800 (PST) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id h239kceA024540 for ; Mon, 3 Mar 2003 01:46:39 -0800 Received: from localhost (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with ESMTP id BAA00846; Mon, 3 Mar 2003 01:28:26 -0800 Date: Mon, 03 Mar 2003 01:28:25 -0800 (PST) Message-Id: <20030303.012825.81834528.davem@redhat.com> To: latten@austin.ibm.com Cc: kuznet@ms2.inr.ac.ru, linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: PATCH: IPSec not using padding when Null Encryption From: "David S. Miller" In-Reply-To: <200302272129.h1RLTJW28434@faith.austin.ibm.com> References: <200302272129.h1RLTJW28434@faith.austin.ibm.com> X-FalunGong: Information control. X-Mailer: Mew version 2.1 on Emacs 21.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 1833 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev Content-Length: 397 Lines: 10 From: latten@austin.ibm.com Date: Thu, 27 Feb 2003 15:29:19 -0600 Ok, anyway, this fix just pretty much makes sure that when Null Encryption or any algorithm with a blocksize less than 4 is used, that the ciphertext, any padding, and next-header and pad-length fields terminate on a 4-byte boundary. I have tested it. Please let me know if all is well. Applied, thanks. From davem@redhat.com Mon Mar 3 01:47:53 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 03 Mar 2003 01:47:58 -0800 (PST) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id h239lqeA028214 for ; Mon, 3 Mar 2003 01:47:53 -0800 Received: from localhost (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with ESMTP id BAA00857; Mon, 3 Mar 2003 01:30:13 -0800 Date: Mon, 03 Mar 2003 01:30:13 -0800 (PST) Message-Id: <20030303.013013.93812658.davem@redhat.com> To: yoshfuji@linux-ipv6.org Cc: linux-kernel@vger.kernel.org, netdev@oss.sgi.com, kuznet@ms2.inr.ac.ru, usagi@linux-ipv6.org Subject: Re: [PATCH] Use C99 initializers in net/ipv6 From: "David S. Miller" In-Reply-To: <20030228.065944.08980219.yoshfuji@linux-ipv6.org> References: <20030228.065944.08980219.yoshfuji@linux-ipv6.org> X-FalunGong: Information control. X-Mailer: Mew version 2.1 on Emacs 21.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=iso-2022-jp Content-Transfer-Encoding: 7bit X-archive-position: 1834 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev Content-Length: 316 Lines: 8 From: YOSHIFUJI Hideaki / $B5HF#1QL@(B Date: Fri, 28 Feb 2003 06:59:44 +0900 (JST) This convers net/ipv6/{addrconf,route,sit}.c files to use C99 initializers. We don't touch net/ipv6/exthdrs.c for now because it will conflicts with our patch for IPsec. Applied, thanks. From davem@redhat.com Mon Mar 3 01:52:41 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 03 Mar 2003 01:52:46 -0800 (PST) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id h239qfeA001941 for ; Mon, 3 Mar 2003 01:52:41 -0800 Received: from localhost (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with ESMTP id BAA00880; Mon, 3 Mar 2003 01:34:52 -0800 Date: Mon, 03 Mar 2003 01:34:51 -0800 (PST) Message-Id: <20030303.013451.20307573.davem@redhat.com> To: jmorris@intercode.com.au Cc: toml@us.ibm.com, netdev@oss.sgi.com, kuznet@ms2.inr.ac.ru Subject: Re: IPSec: setkey -DP freezes machine From: "David S. Miller" In-Reply-To: References: X-FalunGong: Information control. X-Mailer: Mew version 2.1 on Emacs 21.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 1835 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev Content-Length: 427 Lines: 11 From: James Morris Date: Sat, 1 Mar 2003 03:01:04 +1100 (EST) Alternatively, a family parameter could be added to the compile_policy() operation, but this duplicates data already present in our native xfrm_userpolicy_info format. I like this solution, it seems the cleanest. Could someone implement this fix and send me the patch? I'm very backlogged for the next day or so... From jmorris@intercode.com.au Mon Mar 3 04:14:14 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 03 Mar 2003 04:14:30 -0800 (PST) Received: from blackbird.intercode.com.au (blackbird.intercode.com.au [203.32.101.10]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id h23CEBeA007035 for ; Mon, 3 Mar 2003 04:14:13 -0800 Received: from localhost (jmorris@localhost) by blackbird.intercode.com.au (8.9.3/8.9.3) with ESMTP id XAA10331; Mon, 3 Mar 2003 23:13:55 +1100 Date: Mon, 3 Mar 2003 23:13:55 +1100 (EST) From: James Morris To: "David S. Miller" cc: toml@us.ibm.com, , Subject: [PATCH] Re: IPSec: setkey -DP freezes machine In-Reply-To: <20030303.013451.20307573.davem@redhat.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 1836 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jmorris@intercode.com.au Precedence: bulk X-list: netdev Content-Length: 3622 Lines: 106 On Mon, 3 Mar 2003, David S. Miller wrote: > Alternatively, a family parameter could be added to the compile_policy() > operation, but this duplicates data already present in our native > xfrm_userpolicy_info format. > > I like this solution, it seems the cleanest. > Ok, here's a patch which does this. I've also added check to verify_newpolicy_info() so that we don't run into the same problem for policies provided via the netlink interface. Tom, would you let me know if this works for you, as my racoon isn't working yet. - James -- James Morris diff -urN -X dontdiff linux-2.5.63.orig/include/net/xfrm.h linux-2.5.63.w1/include/net/xfrm.h --- linux-2.5.63.orig/include/net/xfrm.h Fri Feb 21 00:44:01 2003 +++ linux-2.5.63.w1/include/net/xfrm.h Mon Mar 3 22:19:40 2003 @@ -223,7 +223,7 @@ char *id; int (*notify)(struct xfrm_state *x, int event); int (*acquire)(struct xfrm_state *x, struct xfrm_tmpl *, struct xfrm_policy *xp, int dir); - struct xfrm_policy *(*compile_policy)(int opt, u8 *data, int len, int *dir); + struct xfrm_policy *(*compile_policy)(u16 family, int opt, u8 *data, int len, int *dir); }; extern int xfrm_register_km(struct xfrm_mgr *km); diff -urN -X dontdiff linux-2.5.63.orig/net/ipv4/xfrm_state.c linux-2.5.63.w1/net/ipv4/xfrm_state.c --- linux-2.5.63.orig/net/ipv4/xfrm_state.c Fri Feb 21 00:44:01 2003 +++ linux-2.5.63.w1/net/ipv4/xfrm_state.c Mon Mar 3 22:23:53 2003 @@ -680,7 +680,7 @@ err = -EINVAL; read_lock(&xfrm_km_lock); list_for_each_entry(km, &xfrm_km_list, list) { - pol = km->compile_policy(optname, data, optlen, &err); + pol = km->compile_policy(sk->family, optname, data, optlen, &err); if (err >= 0) break; } diff -urN -X dontdiff linux-2.5.63.orig/net/ipv4/xfrm_user.c linux-2.5.63.w1/net/ipv4/xfrm_user.c --- linux-2.5.63.orig/net/ipv4/xfrm_user.c Tue Feb 25 15:03:26 2003 +++ linux-2.5.63.w1/net/ipv4/xfrm_user.c Mon Mar 3 22:56:34 2003 @@ -538,6 +538,21 @@ return -EINVAL; }; + switch (p->family) { + case AF_INET: + break; + + case AF_INET6: +#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) + break; +#else + return -EAFNOSUPPORT; +#endif + + default: + return -EINVAL; + }; + return verify_policy_dir(p->dir); } @@ -1057,7 +1072,8 @@ /* User gives us xfrm_user_policy_info followed by an array of 0 * or more templates. */ -struct xfrm_policy *xfrm_compile_policy(int opt, u8 *data, int len, int *dir) +struct xfrm_policy *xfrm_compile_policy(u16 family, int opt, + u8 *data, int len, int *dir) { struct xfrm_userpolicy_info *p = (struct xfrm_userpolicy_info *)data; struct xfrm_user_tmpl *ut = (struct xfrm_user_tmpl *) (p + 1); diff -urN -X dontdiff linux-2.5.63.orig/net/key/af_key.c linux-2.5.63.w1/net/key/af_key.c --- linux-2.5.63.orig/net/key/af_key.c Tue Feb 25 15:03:26 2003 +++ linux-2.5.63.w1/net/key/af_key.c Mon Mar 3 22:30:56 2003 @@ -2420,7 +2420,8 @@ return pfkey_broadcast(skb, GFP_ATOMIC, BROADCAST_REGISTERED, NULL); } -static struct xfrm_policy *pfkey_compile_policy(int opt, u8 *data, int len, int *dir) +static struct xfrm_policy *pfkey_compile_policy(u16 family, int opt, + u8 *data, int len, int *dir) { struct xfrm_policy *xp; struct sadb_x_policy *pol = (struct sadb_x_policy*)data; @@ -2451,6 +2452,7 @@ xp->lft.hard_byte_limit = XFRM_INF; xp->lft.soft_packet_limit = XFRM_INF; xp->lft.hard_packet_limit = XFRM_INF; + xp->family = family; xp->xfrm_nr = 0; if (pol->sadb_x_policy_type == IPSEC_POLICY_IPSEC && From davem@redhat.com Mon Mar 3 04:37:35 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 03 Mar 2003 04:37:38 -0800 (PST) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id h23CbYeA007481 for ; Mon, 3 Mar 2003 04:37:35 -0800 Received: from localhost (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with ESMTP id EAA01567; Mon, 3 Mar 2003 04:19:50 -0800 Date: Mon, 03 Mar 2003 04:19:50 -0800 (PST) Message-Id: <20030303.041950.14411363.davem@redhat.com> To: jmorris@intercode.com.au Cc: toml@us.ibm.com, netdev@oss.sgi.com, kuznet@ms2.inr.ac.ru Subject: Re: [PATCH] Re: IPSec: setkey -DP freezes machine From: "David S. Miller" In-Reply-To: References: <20030303.013451.20307573.davem@redhat.com> X-FalunGong: Information control. X-Mailer: Mew version 2.1 on Emacs 21.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 1837 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev Content-Length: 530 Lines: 16 From: James Morris Date: Mon, 3 Mar 2003 23:13:55 +1100 (EST) On Mon, 3 Mar 2003, David S. Miller wrote: > Alternatively, a family parameter could be added to the compile_policy() > operation, but this duplicates data already present in our native > xfrm_userpolicy_info format. > > I like this solution, it seems the cleanest. Ok, here's a patch which does this. Looks good, I'll apply this. If more problems are found, we can patch on top of this. From terje.eggestad@scali.com Mon Mar 3 04:51:28 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 03 Mar 2003 04:51:31 -0800 (PST) Received: from elin.scali.no (elin.scali.no [62.70.89.10]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id h23CpQeA007923 for ; Mon, 3 Mar 2003 04:51:28 -0800 Received: from pc-16.office.scali.no (pc-16.office.scali.no [172.16.0.116]) by elin.scali.no (8.12.5/8.12.5) with ESMTP id h23CpH4l024526; Mon, 3 Mar 2003 13:51:17 +0100 Subject: Re: anyone ever done multicast AF_UNIX sockets? From: Terje Eggestad To: Chris Friesen Cc: linux-kernel , netdev@oss.sgi.com, linux-net@vger.kernel.org In-Reply-To: <3E5E7081.6020704@nortelnetworks.com> References: <3E5E7081.6020704@nortelnetworks.com> Content-Type: text/plain Organization: Scali AS Message-Id: <1046695876.7731.78.camel@pc-16.office.scali.no> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.1 Date: 03 Mar 2003 13:51:17 +0100 Content-Transfer-Encoding: 7bit X-Virus-Scanned: by amavisd-milter (http://amavis.org/) X-archive-position: 1838 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: terje.eggestad@scali.com Precedence: bulk X-list: netdev Content-Length: 3616 Lines: 87 On a single box you would use a shared memory segment to do this. It has the following advantages: - no syscalls at all - whenever the recipients need to use the info, they access the shm directly (you may need to use a semaphore to enforce consistency, or if you're really pressed on time, spin lock a shm location) There is no need for the recipients to copy the info to private data structs. - there is no need for the recipients to waste cycles on processing an update - you KNOW that all the recipients has "updated" at the same time. That aside, you idea of being notified when the listener (peer) is not there is pretty hopless when it comes to multicasts. Why does it help you to know that there are no recipients contra the wrong number recipients ???? OR asked differently, if you don't have a notion of who the recipients are/should be, why would you care if there are none?????? There are practically no real applications for this feature. If you really want to get to know that a recipient disappeared, use a stream socket to each recipients, and to keep the # of syscalls down, get the aio patch, and do the send to all with a single lio_listio() call. Also: Keep in mind that either you do multicast, or explisit send to all, the data you're sending are copied from you buffer to the dest sockets recv buffers anyway. If you're sending 1k you need somewhere between 250 to 1000 cycles to do the copy, depending on alignment. I've measured the syscall overhead for a write(len=0) to be about 800 cycles on a P3 or athlon, and about 2000 on P4. If you really have enough possible recipients, you should use a shm segment instead. If you have only a few (~10) the overhead is worst case 20000 cycles, or on a 2G P4, 10 microsecs to do a syscall for each. Who cares... TJ On Thu, 2003-02-27 at 21:09, Chris Friesen wrote: > It is fairly common to want to distribute information between a single > sender and multiple receivers on a single box. > > Multicast IP sockets are one possibility, but then you have additional > overhead in the IP stack. > > Unix sockets are more efficient and give notification if the listener is > not present, but the problem then becomes that you must do one syscall > for each listener. > > So, here's my main point--has anyone ever considered the concept of > multicast AF_UNIX sockets? > > The main features would be: > --ability to associate/disassociate a socket with a multicast address > --ability to associate/disassociate with all multicast addresses > (possibly through some kind of raw socket thing, or maybe a simple > wildcard multicast address) > --on process death all sockets owned by that process are disassociated > from any multicast addresses that they were associated with > --on sending a packet to a multicast address and there are no sockets > associated with it, return -1 with errno=ECONNREFUSED > > The association/disassociation could be done using the setsockopt() > calls the same as with udp sockets, everything else would be the same > from a userspace perspective. > > Any thoughts? How hard would this be to put in? > > Chris -- _________________________________________________________________________ Terje Eggestad mailto:terje.eggestad@scali.no Scali Scalable Linux Systems http://www.scali.com Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE) P.O.Box 150, Oppsal +47 975 31 574 (MOBILE) N-0619 Oslo fax: +47 22 62 89 51 NORWAY _________________________________________________________________________ From davem@redhat.com Mon Mar 3 04:53:49 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 03 Mar 2003 04:53:51 -0800 (PST) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id h23CrmeA008291 for ; Mon, 3 Mar 2003 04:53:49 -0800 Received: from localhost (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with ESMTP id EAA01621; Mon, 3 Mar 2003 04:36:00 -0800 Date: Mon, 03 Mar 2003 04:35:59 -0800 (PST) Message-Id: <20030303.043559.19477354.davem@redhat.com> To: terje.eggestad@scali.com Cc: cfriesen@nortelnetworks.com, linux-kernel@vger.kernel.org, netdev@oss.sgi.com, linux-net@vger.kernel.org Subject: Re: anyone ever done multicast AF_UNIX sockets? From: "David S. Miller" In-Reply-To: <1046695876.7731.78.camel@pc-16.office.scali.no> References: <3E5E7081.6020704@nortelnetworks.com> <1046695876.7731.78.camel@pc-16.office.scali.no> X-FalunGong: Information control. X-Mailer: Mew version 2.1 on Emacs 21.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 1839 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev Content-Length: 212 Lines: 6 From: Terje Eggestad Date: 03 Mar 2003 13:51:17 +0100 On a single box you would use a shared memory segment to do this. Thank you for applying real brains to this problem :) From toml@us.ibm.com Mon Mar 3 07:39:03 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 03 Mar 2003 07:39:07 -0800 (PST) Received: from e1.ny.us.ibm.com (e1.ny.us.ibm.com [32.97.182.101]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id h23Fd2eA010972 for ; Mon, 3 Mar 2003 07:39:03 -0800 Received: from northrelay02.pok.ibm.com (northrelay02.pok.ibm.com [9.56.224.150]) by e1.ny.us.ibm.com (8.12.7/8.12.2) with ESMTP id h23Fbeab087742; Mon, 3 Mar 2003 10:37:40 -0500 Received: from d01ml072.pok.ibm.com (d01ml072.pok.ibm.com [9.117.250.211]) by northrelay02.pok.ibm.com (8.12.3/NCO/VER6.5) with ESMTP id h23FbbjS019300; Mon, 3 Mar 2003 10:37:38 -0500 Subject: Re: [PATCH] Re: IPSec: setkey -DP freezes machine To: James Morris Cc: "David S. Miller" , kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com X-Mailer: Lotus Notes Release 5.0.11 July 24, 2002 Message-ID: From: "Tom Lendacky" Date: Mon, 3 Mar 2003 09:37:37 -0600 X-MIMETrack: Serialize by Router on D01ML072/01/M/IBM(Release 5.0.11 +SPRs MIAS5EXFG4, MIAS5AUFPV and DHAG4Y6R7W, MATTEST |November 8th, 2002) at 03/03/2003 10:37:39 AM MIME-Version: 1.0 Content-type: text/plain; charset=us-ascii X-archive-position: 1840 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: toml@us.ibm.com Precedence: bulk X-list: netdev Content-Length: 384 Lines: 17 > Ok, here's a patch which does this. > > I've also added check to verify_newpolicy_info() so that we don't run into > the same problem for policies provided via the netlink interface. > > Tom, would you let me know if this works for you, as my racoon isn't > working yet. The patch works for me, setkey -DP no longer freezes the machine and the proper output is displayed. Tom From davem@redhat.com Mon Mar 3 07:42:07 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 03 Mar 2003 07:42:10 -0800 (PST) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id h23Fg6eA011343 for ; Mon, 3 Mar 2003 07:42:07 -0800 Received: from localhost (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with ESMTP id HAA02317; Mon, 3 Mar 2003 07:23:46 -0800 Date: Mon, 03 Mar 2003 07:23:45 -0800 (PST) Message-Id: <20030303.072345.99136696.davem@redhat.com> To: toml@us.ibm.com Cc: jmorris@intercode.com.au, kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com Subject: Re: [PATCH] Re: IPSec: setkey -DP freezes machine From: "David S. Miller" In-Reply-To: References: X-FalunGong: Information control. X-Mailer: Mew version 2.1 on Emacs 21.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 1841 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev Content-Length: 310 Lines: 10 From: "Tom Lendacky" Date: Mon, 3 Mar 2003 09:37:37 -0600 > Tom, would you let me know if this works for you, as my racoon isn't > working yet. The patch works for me, setkey -DP no longer freezes the machine and the proper output is displayed. Thank you for testing. From cfriesen@nortelnetworks.com Mon Mar 3 09:09:48 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 03 Mar 2003 09:09:57 -0800 (PST) Received: from zcars04e.nortelnetworks.com (zcars04e.nortelnetworks.com [47.129.242.56]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id h23H9keA012527 for ; Mon, 3 Mar 2003 09:09:48 -0800 Received: from zcard307.ca.nortel.com (zcard307.ca.nortel.com [47.129.242.67]) by zcars04e.nortelnetworks.com (Switch-2.2.0/Switch-2.2.0) with ESMTP id h23H9cR08332; Mon, 3 Mar 2003 12:09:39 -0500 (EST) Received: from zcard0k6.ca.nortel.com ([47.129.242.158]) by zcard307.ca.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id GDFA8BZR; Mon, 3 Mar 2003 12:09:39 -0500 Received: from pcard0ks.ca.nortel.com ([47.129.117.131]) by zcard0k6.ca.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id FSL7YYNG; Mon, 3 Mar 2003 12:09:38 -0500 Received: from nortelnetworks.com (localhost.localdomain [127.0.0.1]) by pcard0ks.ca.nortel.com (Postfix) with ESMTP id 1DDAA2E12F; Mon, 3 Mar 2003 12:09:38 -0500 (EST) Message-ID: <3E638C51.2000904@nortelnetworks.com> Date: Mon, 03 Mar 2003 12:09:37 -0500 X-Sybari-Space: 00000000 00000000 00000000 From: Chris Friesen User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.8) Gecko/20020204 X-Accept-Language: en-us MIME-Version: 1.0 To: Terje Eggestad Cc: linux-kernel , netdev@oss.sgi.com, linux-net@vger.kernel.org, davem@redhat.com Subject: Re: anyone ever done multicast AF_UNIX sockets? References: <3E5E7081.6020704@nortelnetworks.com> <1046695876.7731.78.camel@pc-16.office.scali.no> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 1842 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: cfriesen@nortelnetworks.com Precedence: bulk X-list: netdev Content-Length: 2790 Lines: 58 Terje Eggestad wrote: > On a single box you would use a shared memory segment to do this. It has > the following advantages: > - no syscalls at all Unless you poll for messages on the receiving side, how do you trigger the receiver to look for a message? Shared memory doesn't have file descriptors. > - whenever the recipients need to use the info, they access the shm > directly (you may need to use a semaphore to enforce consistency, or if > you're really pressed on time, spin lock a shm location) There is no > need for the recipients to copy the info to private data structs. How do they know the information has changed? Suppose one process detects that the ethernet link has dropped. How does it alert other processes which need to do something? > Why does it help you to know that there are no recipients contra the > wrong number recipients ???? OR asked differently, if you don't have a > notion of who the recipients are/should be, why would you care if there > are none?????? > There are practically no real applications for this feature. It's true that if I have a nonzero number of listeners it doesn't tell me anything since I don't know if the right one is included. However, if I send a message and there were *no* listeners but I know that there should be at least one, then I can log the anomaly, raise an alarm, or take whatever action is appropriate. > Also: Keep in mind that either you do multicast, or explisit send to > all, the data you're sending are copied from you buffer to the dest > sockets recv buffers anyway. If you're sending 1k you need somewhere > between 250 to 1000 cycles to do the copy, depending on alignment. I've > measured the syscall overhead for a write(len=0) to be about 800 cycles > on a P3 or athlon, and about 2000 on P4. If you really have enough > possible recipients, you should use a shm segment instead. If you have > only a few (~10) the overhead is worst case 20000 cycles, or on a 2G P4, > 10 microsecs to do a syscall for each. Who cares... Granted, shared memory (or sysV message queues) are the fastest way to transfer data between processes. However, you still have to implement some way to alert the receiver that there is a message waiting for it. For large packet sizes it may be sufficient to send a small unix socket message to alert it that there is a message waiting, but for small messages the cost of the copying is small compared to the cost of the context switch, and the unix multicast cuts the number of context switches in half. Chris -- Chris Friesen | MailStop: 043/33/F10 Nortel Networks | work: (613) 765-0557 3500 Carling Avenue | fax: (613) 765-2986 Nepean, ON K2H 8E9 Canada | email: cfriesen@nortelnetworks.com From davem@redhat.com Mon Mar 3 09:13:03 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 03 Mar 2003 09:13:08 -0800 (PST) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id h23HD3eA012917 for ; Mon, 3 Mar 2003 09:13:03 -0800 Received: from localhost (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with ESMTP id IAA02602; Mon, 3 Mar 2003 08:55:04 -0800 Date: Mon, 03 Mar 2003 08:55:04 -0800 (PST) Message-Id: <20030303.085504.105424448.davem@redhat.com> To: cfriesen@nortelnetworks.com Cc: terje.eggestad@scali.com, linux-kernel@vger.kernel.org, netdev@oss.sgi.com, linux-net@vger.kernel.org Subject: Re: anyone ever done multicast AF_UNIX sockets? From: "David S. Miller" In-Reply-To: <3E638C51.2000904@nortelnetworks.com> References: <3E5E7081.6020704@nortelnetworks.com> <1046695876.7731.78.camel@pc-16.office.scali.no> <3E638C51.2000904@nortelnetworks.com> X-FalunGong: Information control. X-Mailer: Mew version 2.1 on Emacs 21.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 1843 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev Content-Length: 254 Lines: 8 From: Chris Friesen Date: Mon, 03 Mar 2003 12:09:37 -0500 Unless you poll for messages on the receiving side, how do you trigger the receiver to look for a message? Send signals. Use a FUTEX, be creative... From cfriesen@nortelnetworks.com Mon Mar 3 10:03:26 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 03 Mar 2003 10:03:32 -0800 (PST) Received: from zcars04e.nortelnetworks.com (zcars04e.nortelnetworks.com [47.129.242.56]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id h23I3PeA015592 for ; Mon, 3 Mar 2003 10:03:25 -0800 Received: from zcard307.ca.nortel.com (zcard307.ca.nortel.com [47.129.242.67]) by zcars04e.nortelnetworks.com (Switch-2.2.0/Switch-2.2.0) with ESMTP id h23I2jR09110; Mon, 3 Mar 2003 13:02:45 -0500 (EST) Received: from zcard0k6.ca.nortel.com ([47.129.242.158]) by zcard307.ca.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id GDFA8HJT; Mon, 3 Mar 2003 13:02:45 -0500 Received: from pcard0ks.ca.nortel.com ([47.129.117.131]) by zcard0k6.ca.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id FSL7YYTZ; Mon, 3 Mar 2003 13:02:45 -0500 Received: from nortelnetworks.com (localhost.localdomain [127.0.0.1]) by pcard0ks.ca.nortel.com (Postfix) with ESMTP id 978492E12F; Mon, 3 Mar 2003 13:02:44 -0500 (EST) Message-ID: <3E6398C4.2020605@nortelnetworks.com> Date: Mon, 03 Mar 2003 13:02:44 -0500 X-Sybari-Space: 00000000 00000000 00000000 From: Chris Friesen User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.8) Gecko/20020204 X-Accept-Language: en-us MIME-Version: 1.0 To: jamal Cc: linux-kernel@vger.kernel.org, netdev@oss.sgi.com, linux-net@vger.kernel.org Subject: Re: anyone ever done multicast AF_UNIX sockets? References: <3E5E7081.6020704@nortelnetworks.com> <20030228083009.Y53276@shell.cyberus.ca> <3E5F748E.2080605@nortelnetworks.com> <20030228212309.C57212@shell.cyberus.ca> <3E619E97.8010508@nortelnetworks.com> <20030302081916.S61365@shell.cyberus.ca> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 1844 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: cfriesen@nortelnetworks.com Precedence: bulk X-list: netdev Content-Length: 4146 Lines: 103 jamal wrote: > On Sun, 2 Mar 2003, Chris Friesen wrote >>jamal wrote >>>Did you also measure throughput >>No. lmbench doesn't appear to test UDP socket local throughput > I think you need to collect all data if you are trying to show > improvements. I'll look at how they were measuring unix socket throughput and try implementing something similar for UDP. It's not clear to me how to really measure throughput in a multicast environment though since it depends very much on your application messaging patterns. > Ok, so its only a problem when you have a few listeners i.e user space > scheme scales just fine as you keep adding listeners. > In your tests what was the break-even point? See below for more detailed test results. > Addressing has to be backwared compatible i.e not affecting any other > program. Of course. The way I've designed it is that you get and bind() a socket as normal, and then use setsockopt() to register interest in a multicast address (same as IP multicast). If the address already exists but is not a multicast address, then you get an error. If a socket tries to bind() or connect() to an existing multicast address, you get an error. The different types of addresses exist in the same address space, but the only way to register interest in multicast addresses is through setsockopt(). >>The timings (in usec) for the delays to each of the listeners were as >>follows on my duron 750: >> >>userspace server: 104 133 153 >>userspace no server: 72 111 138 >>kernelspace: 60 91 113 > Actually, the difference between user space server and kernel doesnt > appear that big. What you need to do is collect more data. > repeat with incrementing number of listeners. What would you consider a "big" difference? Here the userspace server is 35% slower than the kernelspace version. You wanted more data, so here's results comparing the no-server userspace method vs the kernel method. The server-based one would be slightly more expensive than the no-server version. The results below are the smallest and largest latencies (in usecs) for the message to reach the listeners in userspace. I've used three different sizes, the two extremes and a roughly average sized message in my particular domain. 44bytes # listeners userspace kernelspace 10 73,335 103,252 20 72,610 106,429 50 74,1482 205,1301 100 76,3000 362,3425 200 737,9917 236bytes # listeners userspace kernelspace 10 70,346 81,265 20 74,639 122,468 50 75,1557 230,1421 100 80,3107 408,3743 40036-byte message # listeners userspace kernelspace 10 302,4181 322,1692 20 303,7491 347,3450 50 306,10451 483,8394 100 309,23107 697,17061 200 313,45528 997,39810 As one would expect, the initial latencies are somewhat higher for the kernel space solution since all the skb header duplication is done before anyone is woken up. One thing that I did not expect was the increased max latency in the kernel space soltion when the number of listeners grew large. On reflection, however, I suspect that this is due to scheduler load since all of the listening processes have become runnable while in the userspace version they become runnable one at a time. It would be interesting to run this on 2.5 with the O(1) scheduler and see if it makes a difference. With larger message sizes, the cost of the additional copies in the userspace solution start to outweigh the overhead of the additional runnable processes and the kernel space solution stays faster in all runs tested. Chris -- Chris Friesen | MailStop: 043/33/F10 Nortel Networks | work: (613) 765-0557 3500 Carling Avenue | fax: (613) 765-2986 Nepean, ON K2H 8E9 Canada | email: cfriesen@nortelnetworks.com From cfriesen@nortelnetworks.com Mon Mar 3 10:07:57 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 03 Mar 2003 10:08:02 -0800 (PST) Received: from zcars04e.nortelnetworks.com (zcars04e.nortelnetworks.com [47.129.242.56]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id h23I7ueA016947 for ; Mon, 3 Mar 2003 10:07:56 -0800 Received: from zcard307.ca.nortel.com (zcard307.ca.nortel.com [47.129.242.67]) by zcars04e.nortelnetworks.com (Switch-2.2.0/Switch-2.2.0) with ESMTP id h23I7mR09130; Mon, 3 Mar 2003 13:07:48 -0500 (EST) Received: from zcard0k6.ca.nortel.com ([47.129.242.158]) by zcard307.ca.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id GDFA8H5L; Mon, 3 Mar 2003 13:07:47 -0500 Received: from pcard0ks.ca.nortel.com ([47.129.117.131]) by zcard0k6.ca.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id FSL7YY44; Mon, 3 Mar 2003 13:07:48 -0500 Received: from nortelnetworks.com (localhost.localdomain [127.0.0.1]) by pcard0ks.ca.nortel.com (Postfix) with ESMTP id 74D0A2E12F; Mon, 3 Mar 2003 13:07:45 -0500 (EST) Message-ID: <3E6399F1.10303@nortelnetworks.com> Date: Mon, 03 Mar 2003 13:07:45 -0500 X-Sybari-Space: 00000000 00000000 00000000 From: Chris Friesen User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.8) Gecko/20020204 X-Accept-Language: en-us MIME-Version: 1.0 To: "David S. Miller" Cc: terje.eggestad@scali.com, linux-kernel@vger.kernel.org, netdev@oss.sgi.com, linux-net@vger.kernel.org Subject: Re: anyone ever done multicast AF_UNIX sockets? References: <3E5E7081.6020704@nortelnetworks.com> <1046695876.7731.78.camel@pc-16.office.scali.no> <3E638C51.2000904@nortelnetworks.com> <20030303.085504.105424448.davem@redhat.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 1845 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: cfriesen@nortelnetworks.com Precedence: bulk X-list: netdev Content-Length: 793 Lines: 26 David S. Miller wrote: > From: Chris Friesen > Date: Mon, 03 Mar 2003 12:09:37 -0500 > > Unless you poll for messages on the receiving side, how do you trigger > the receiver to look for a message? > > Send signals. Use a FUTEX, be creative... Suppose I have a process that waits on UDP packets, the unified local IPC that we're discussing, other unix sockets, and stdin. It's awfully nice if the local IPC can be handled using the same select/poll mechanism as all the other messaging. Chris -- Chris Friesen | MailStop: 043/33/F10 Nortel Networks | work: (613) 765-0557 3500 Carling Avenue | fax: (613) 765-2986 Nepean, ON K2H 8E9 Canada | email: cfriesen@nortelnetworks.com From davem@redhat.com Mon Mar 3 10:14:33 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 03 Mar 2003 10:14:38 -0800 (PST) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id h23IEWeA017421 for ; Mon, 3 Mar 2003 10:14:33 -0800 Received: from localhost (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with ESMTP id JAA03149; Mon, 3 Mar 2003 09:56:41 -0800 Date: Mon, 03 Mar 2003 09:56:41 -0800 (PST) Message-Id: <20030303.095641.87696857.davem@redhat.com> To: cfriesen@nortelnetworks.com Cc: terje.eggestad@scali.com, linux-kernel@vger.kernel.org, netdev@oss.sgi.com, linux-net@vger.kernel.org Subject: Re: anyone ever done multicast AF_UNIX sockets? From: "David S. Miller" In-Reply-To: <3E6399F1.10303@nortelnetworks.com> References: <3E638C51.2000904@nortelnetworks.com> <20030303.085504.105424448.davem@redhat.com> <3E6399F1.10303@nortelnetworks.com> X-FalunGong: Information control. X-Mailer: Mew version 2.1 on Emacs 21.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 1846 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev Content-Length: 860 Lines: 18 From: Chris Friesen Date: Mon, 03 Mar 2003 13:07:45 -0500 Suppose I have a process that waits on UDP packets, the unified local IPC that we're discussing, other unix sockets, and stdin. It's awfully nice if the local IPC can be handled using the same select/poll mechanism as all the other messaging. So use UDP, you still haven't backed up your performance claims. Experiment, set the SO_NO_CHECK socket option to "1" and see if that makes a difference performance wise for local clients. But if performance is "so important", then you shouldn't really be shying away from the shared memory suggestion and nothing is going to top that (it eliminates all the copies, using flat out AF_UNIX over UDP only truly eliminates some header processing, nothing more, the copies are still there with AF_UNIX). From ak@suse.de Mon Mar 3 10:18:40 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 03 Mar 2003 10:18:46 -0800 (PST) Received: from Cantor.suse.de (ns.suse.de [213.95.15.193]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id h23IIdeA017787 for ; Mon, 3 Mar 2003 10:18:40 -0800 Received: from Hermes.suse.de (Hermes.suse.de [213.95.15.136]) by Cantor.suse.de (Postfix) with ESMTP id CA51114EAB; Mon, 3 Mar 2003 19:18:07 +0100 (MET) To: Chris Friesen Cc: linux-kernel@vger.kernel.org, netdev@oss.sgi.com, linux-net@vger.kernel.org, hadi@cyberus.ca Subject: Re: anyone ever done multicast AF_UNIX sockets? References: <3E5E7081.6020704@nortelnetworks.com.suse.lists.linux.kernel> <20030228083009.Y53276@shell.cyberus.ca.suse.lists.linux.kernel> <3E5F748E.2080605@nortelnetworks.com.suse.lists.linux.kernel> <20030228212309.C57212@shell.cyberus.ca.suse.lists.linux.kernel> <3E619E97.8010508@nortelnetworks.com.suse.lists.linux.kernel> <20030302081916.S61365@shell.cyberus.ca.suse.lists.linux.kernel> <3E6398C4.2020605@nortelnetworks.com.suse.lists.linux.kernel> From: Andi Kleen Date: 03 Mar 2003 19:18:07 +0100 In-Reply-To: Chris Friesen's message of "3 Mar 2003 19:07:27 +0100" Message-ID: X-Mailer: Gnus v5.7/Emacs 20.7 X-archive-position: 1847 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ak@suse.de Precedence: bulk X-list: netdev Content-Length: 633 Lines: 17 Chris Friesen writes: > I'll look at how they were measuring unix socket throughput and try > implementing something similar for UDP. It's not clear to me how to > really measure throughput in a multicast environment though since it > depends very much on your application messaging patterns. Unix sockets are often slower than TCP over loopback because they use much smaller socket sizes by default. This causes much more context switches. Just run a vmstat 1 in parallel and watch the context switch rates. You can fix it by increasing the send and receive buffers of the unix socket. -Andi From cfriesen@nortelnetworks.com Mon Mar 3 11:11:17 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 03 Mar 2003 11:11:24 -0800 (PST) Received: from zcars04e.nortelnetworks.com (zcars04e.nortelnetworks.com [47.129.242.56]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id h23JBGeA018646 for ; Mon, 3 Mar 2003 11:11:17 -0800 Received: from zcard307.ca.nortel.com (zcard307.ca.nortel.com [47.129.242.67]) by zcars04e.nortelnetworks.com (Switch-2.2.0/Switch-2.2.0) with ESMTP id h23JB8R10116; Mon, 3 Mar 2003 14:11:08 -0500 (EST) Received: from zcard0k6.ca.nortel.com ([47.129.242.158]) by zcard307.ca.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id GDFA8P5D; Mon, 3 Mar 2003 14:11:08 -0500 Received: from pcard0ks.ca.nortel.com ([47.129.117.131]) by zcard0k6.ca.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id FSL7YY8H; Mon, 3 Mar 2003 14:11:08 -0500 Received: from nortelnetworks.com (localhost.localdomain [127.0.0.1]) by pcard0ks.ca.nortel.com (Postfix) with ESMTP id DA6CE2E12F; Mon, 3 Mar 2003 14:11:07 -0500 (EST) Message-ID: <3E63A8CB.2090307@nortelnetworks.com> Date: Mon, 03 Mar 2003 14:11:07 -0500 X-Sybari-Space: 00000000 00000000 00000000 From: Chris Friesen User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.8) Gecko/20020204 X-Accept-Language: en-us MIME-Version: 1.0 To: "David S. Miller" Cc: terje.eggestad@scali.com, linux-kernel@vger.kernel.org, netdev@oss.sgi.com, linux-net@vger.kernel.org Subject: Re: anyone ever done multicast AF_UNIX sockets? References: <3E638C51.2000904@nortelnetworks.com> <20030303.085504.105424448.davem@redhat.com> <3E6399F1.10303@nortelnetworks.com> <20030303.095641.87696857.davem@redhat.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 1848 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: cfriesen@nortelnetworks.com Precedence: bulk X-list: netdev Content-Length: 1914 Lines: 47 David S. Miller wrote: > From: Chris Friesen > Date: Mon, 03 Mar 2003 13:07:45 -0500 > > Suppose I have a process that waits on UDP packets, the unified local > IPC that we're discussing, other unix sockets, and stdin. It's awfully > nice if the local IPC can be handled using the same select/poll > mechanism as all the other messaging. > > So use UDP, you still haven't backed up your performance > claims. Experiment, set the SO_NO_CHECK socket option to > "1" and see if that makes a difference performance wise > for local clients. I did provide numbers for UDP latency, which is more critical for my own application since most messages fit within a single packet. I haven't done UDP bandwidth testing--I need to check how lmbench did it for the unix socket and do the same for UDP. Local TCP was far slower than unix sockets though. > But if performance is "so important", then you shouldn't really be > shying away from the shared memory suggestion and nothing is going to > top that (it eliminates all the copies, using flat out AF_UNIX over > UDP only truly eliminates some header processing, nothing more, the > copies are still there with AF_UNIX). Yes, I realize that the receiver still has to do a copy. With large messages this could be an issue. With small messages, I had assumed that the cost of a recv() wouldn't be that much worse than the cost of the sender doing a kill() to alert the receiver that a message is waiting. Maybe I was wrong. It might be interesting to try a combination of sysV msg queue and signals to see how it stacks up. Project for tonight. Chris -- Chris Friesen | MailStop: 043/33/F10 Nortel Networks | work: (613) 765-0557 3500 Carling Avenue | fax: (613) 765-2986 Nepean, ON K2H 8E9 Canada | email: cfriesen@nortelnetworks.com From davem@redhat.com Mon Mar 3 11:14:38 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 03 Mar 2003 11:14:42 -0800 (PST) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id h23JEceA019020 for ; Mon, 3 Mar 2003 11:14:38 -0800 Received: from localhost (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with ESMTP id KAA03434; Mon, 3 Mar 2003 10:56:46 -0800 Date: Mon, 03 Mar 2003 10:56:46 -0800 (PST) Message-Id: <20030303.105646.02089773.davem@redhat.com> To: cfriesen@nortelnetworks.com Cc: terje.eggestad@scali.com, linux-kernel@vger.kernel.org, netdev@oss.sgi.com, linux-net@vger.kernel.org Subject: Re: anyone ever done multicast AF_UNIX sockets? From: "David S. Miller" In-Reply-To: <3E63A8CB.2090307@nortelnetworks.com> References: <3E6399F1.10303@nortelnetworks.com> <20030303.095641.87696857.davem@redhat.com> <3E63A8CB.2090307@nortelnetworks.com> X-FalunGong: Information control. X-Mailer: Mew version 2.1 on Emacs 21.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 1849 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev Content-Length: 486 Lines: 12 From: Chris Friesen Date: Mon, 03 Mar 2003 14:11:07 -0500 I haven't done UDP bandwidth testing--I need to check how lmbench did it for the unix socket and do the same for UDP. Local TCP was far slower than unix sockets though. That result is system specific and depends upon how the data and datastructures hit the cpu cachelines in the kernel. TCP bandwidth is slightly faster than AF_UNIX bandwidth on my sparc64 boxes for example. From terje.eggestad@scali.com Mon Mar 3 11:35:48 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 03 Mar 2003 11:35:55 -0800 (PST) Received: from localhost.localdomain (2etnv5.cm.chello.no [80.111.51.24]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id h23JZkeA020080 for ; Mon, 3 Mar 2003 11:35:47 -0800 Received: from localhost (localhost [127.0.0.1]) by localhost.localdomain (8.12.5/8.12.5) with ESMTP id h23JdK2c028159; Mon, 3 Mar 2003 20:39:20 +0100 Subject: Re: anyone ever done multicast AF_UNIX sockets? From: Terje Eggestad To: Chris Friesen Cc: linux-kernel , netdev@oss.sgi.com, linux-net@vger.kernel.org, davem@redhat.com In-Reply-To: <3E638C51.2000904@nortelnetworks.com> References: <3E5E7081.6020704@nortelnetworks.com> <1046695876.7731.78.camel@pc-16.office.scali.no> <3E638C51.2000904@nortelnetworks.com> Content-Type: text/plain Content-Transfer-Encoding: 7bit X-Mailer: Ximian Evolution 1.0.8 (1.0.8-10) Date: 03 Mar 2003 20:39:19 +0100 Message-Id: <1046720360.28127.209.camel@eggis1> Mime-Version: 1.0 X-archive-position: 1850 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: terje.eggestad@scali.com Precedence: bulk X-list: netdev Content-Length: 4779 Lines: 99 On Mon, 2003-03-03 at 18:09, Chris Friesen wrote: Terje Eggestad wrote: > On a single box you would use a shared memory segment to do this. It has > the following advantages: > - no syscalls at all Unless you poll for messages on the receiving side, how do you trigger the receiver to look for a message? Shared memory doesn't have file descriptors. OK, you want multicast to send the *same* info to all peers. The only of two sane reason to do that is to update the peers with some info they need to do real work. So when there is reel work to be done, the info is available in the shm. The other reason is to tell the others to die. Then you a) have a socket/pipe connected that you get a end of file event on, or, you have a timeout on the select() (in any real life app you should anyway) so that when select/poll return -1 with errno=EINTR, you check some flags in shm. If you *had* multicast, you don't know *when* a peer proccessed it. What if the peer is suspended ??? you don't get an error on the send, and you apparently never get an answer, then what? The peer may also gone haywire on a while(1); I have an OSS project project (http://midway.sourceforge.net/) where I have a gateway daemon that poll on a large set of sockets (TCP/IP clients) and passes the request to IPC servers, and back. The way I'm doing that is to have two threads, on on blocking wait on the select/poll, the other on msgrcv. Works quite well. > - whenever the recipients need to use the info, they access the shm > directly (you may need to use a semaphore to enforce consistency, or if > you're really pressed on time, spin lock a shm location) There is no > need for the recipients to copy the info to private data structs. How do they know the information has changed? Suppose one process detects that the ethernet link has dropped. How does it alert other processes which need to do something? Again, if you want someone to do something, they must ack the request before you can safely assume that they are going to do something. > Why does it help you to know that there are no recipients contra the > wrong number recipients ???? OR asked differently, if you don't have a > notion of who the recipients are/should be, why would you care if there > are none?????? > There are practically no real applications for this feature. It's true that if I have a nonzero number of listeners it doesn't tell me anything since I don't know if the right one is included. However, if I send a message and there were *no* listeners but I know that there should be at least one, then I can log the anomaly, raise an alarm, or take whatever action is appropriate. > Also: Keep in mind that either you do multicast, or explisit send to > all, the data you're sending are copied from you buffer to the dest > sockets recv buffers anyway. If you're sending 1k you need somewhere > between 250 to 1000 cycles to do the copy, depending on alignment. I've > measured the syscall overhead for a write(len=0) to be about 800 cycles > on a P3 or athlon, and about 2000 on P4. If you really have enough > possible recipients, you should use a shm segment instead. If you have > only a few (~10) the overhead is worst case 20000 cycles, or on a 2G P4, > 10 microsecs to do a syscall for each. Who cares... Granted, shared memory (or sysV message queues) are the fastest way to transfer data between processes. However, you still have to implement some way to alert the receiver that there is a message waiting for it. For large packet sizes it may be sufficient to send a small unix socket message to alert it that there is a message waiting, but for small messages the cost of the copying is small compared to the cost of the context switch, and the unix multicast cuts the number of context switches in half. Chris -- Chris Friesen | MailStop: 043/33/F10 Nortel Networks | work: (613) 765-0557 3500 Carling Avenue | fax: (613) 765-2986 Nepean, ON K2H 8E9 Canada | email: cfriesen@nortelnetworks.com -- _________________________________________________________________________ Terje Eggestad mailto:terje.eggestad@scali.no Scali Scalable Linux Systems http://www.scali.com Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE) P.O.Box 150, Oppsal +47 975 31 574 (MOBILE) N-0619 Oslo fax: +47 22 62 89 51 NORWAY _________________________________________________________________________ From terje.eggestad@scali.com Mon Mar 3 11:38:38 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 03 Mar 2003 11:38:45 -0800 (PST) Received: from localhost.localdomain (2etnv5.cm.chello.no [80.111.51.24]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id h23JcaeA020456 for ; Mon, 3 Mar 2003 11:38:37 -0800 Received: from localhost (localhost [127.0.0.1]) by localhost.localdomain (8.12.5/8.12.5) with ESMTP id h23JgC2c028173; Mon, 3 Mar 2003 20:42:12 +0100 Subject: Re: anyone ever done multicast AF_UNIX sockets? From: Terje Eggestad To: "David S. Miller" Cc: cfriesen@nortelnetworks.com, linux-kernel@vger.kernel.org, netdev@oss.sgi.com, linux-net@vger.kernel.org In-Reply-To: <20030303.105646.02089773.davem@redhat.com> References: <3E6399F1.10303@nortelnetworks.com> <20030303.095641.87696857.davem@redhat.com> <3E63A8CB.2090307@nortelnetworks.com> <20030303.105646.02089773.davem@redhat.com> Content-Type: text/plain Content-Transfer-Encoding: 7bit X-Mailer: Ximian Evolution 1.0.8 (1.0.8-10) Date: 03 Mar 2003 20:42:12 +0100 Message-Id: <1046720532.28127.213.camel@eggis1> Mime-Version: 1.0 X-archive-position: 1851 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: terje.eggestad@scali.com Precedence: bulk X-list: netdev Content-Length: 1295 Lines: 32 On Mon, 2003-03-03 at 19:56, David S. Miller wrote: From: Chris Friesen Date: Mon, 03 Mar 2003 14:11:07 -0500 I haven't done UDP bandwidth testing--I need to check how lmbench did it for the unix socket and do the same for UDP. Local TCP was far slower than unix sockets though. That result is system specific and depends upon how the data and datastructures hit the cpu cachelines in the kernel. TCP bandwidth is slightly faster than AF_UNIX bandwidth on my sparc64 boxes for example. I've seen that their are the same on linux.I tried to to do AF_UNIX instead of AF_INET internally to boost perf, but to no avail. Makes you suspect that the loopback device actually create an AF_UNIX connection under the hood ;-) -- _________________________________________________________________________ Terje Eggestad mailto:terje.eggestad@scali.no Scali Scalable Linux Systems http://www.scali.com Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE) P.O.Box 150, Oppsal +47 975 31 574 (MOBILE) N-0619 Oslo fax: +47 22 62 89 51 NORWAY _________________________________________________________________________ From cfriesen@nortelnetworks.com Mon Mar 3 17:31:24 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 03 Mar 2003 17:31:30 -0800 (PST) Received: from zcars04f.nortelnetworks.com (zcars04f.nortelnetworks.com [47.129.242.57]) by oss.sgi.com (8.11.6/8.11.6) with SMTP id h241VNp28668 for ; Mon, 3 Mar 2003 17:31:23 -0800 Received: from zcard309.ca.nortel.com (zcard309.ca.nortel.com [47.129.242.69]) by zcars04f.nortelnetworks.com (Switch-2.2.0/Switch-2.2.0) with ESMTP id h23LWvU00760; Mon, 3 Mar 2003 16:32:57 -0500 (EST) Received: from zcard0k6.ca.nortel.com ([47.129.242.158]) by zcard309.ca.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id GDF4SKVC; Mon, 3 Mar 2003 16:32:57 -0500 Received: from pcard0ks.ca.nortel.com ([47.129.117.131]) by zcard0k6.ca.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id FSL7YZ3Z; Mon, 3 Mar 2003 16:32:57 -0500 Received: from nortelnetworks.com (localhost.localdomain [127.0.0.1]) by pcard0ks.ca.nortel.com (Postfix) with ESMTP id 73CD52E12F; Mon, 3 Mar 2003 16:32:56 -0500 (EST) Message-ID: <3E63CA08.4040209@nortelnetworks.com> Date: Mon, 03 Mar 2003 16:32:56 -0500 X-Sybari-Space: 00000000 00000000 00000000 From: Chris Friesen User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.8) Gecko/20020204 X-Accept-Language: en-us MIME-Version: 1.0 To: Terje Eggestad Cc: "David S. Miller" , linux-kernel@vger.kernel.org, netdev@oss.sgi.com, linux-net@vger.kernel.org Subject: Re: anyone ever done multicast AF_UNIX sockets? References: <3E6399F1.10303@nortelnetworks.com> <20030303.095641.87696857.davem@redhat.com> <3E63A8CB.2090307@nortelnetworks.com> <20030303.105646.02089773.davem@redhat.com> <1046720532.28127.213.camel@eggis1> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 1852 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: cfriesen@nortelnetworks.com Precedence: bulk X-list: netdev Content-Length: 1774 Lines: 43 Terje Eggestad wrote: > On Mon, 2003-03-03 at 19:56, David S. Miller wrote: > TCP bandwidth is slightly faster than AF_UNIX bandwidth on my > sparc64 boxes for example. > > I've seen that their are the same on linux.I tried to to do AF_UNIX > instead of AF_INET internally to boost perf, but to no avail. Makes you > suspect that the loopback device actually create an AF_UNIX connection > under the hood ;-) On my P4 1.8GHz, AF_INET vs AF_UNIX looks like this: *Local* Communication latencies in microseconds - smaller is better ------------------------------------------------------------- Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP ctxsw UNIX UDP TCP conn --------- ------- ----- ----- ---- ----- ----- ----- ----- ---- pcard0ks. 2.4.18- 1.740 10.4 15.9 20.1 33.1 23.5 44.3 72.7 pcard0ks. 2.4.18- 1.560 10.6 16.0 23.4 38.1 36.1 44.6 77.4 *Local* Communication bandwidths in MB/s - bigger is better ----------------------------------------------------------- Host OS Pipe AF TCP File Mmap Bcopy Bcopy Mem Mem UNIX reread reread (libc) (hand) read write --------- ------- ---- ---- ---- ------ ------ ------ ------ ---- ----- pcard0ks. 2.4.18- 650. 677. 151. 721.9 958.0 290.8 288.8 955. 418.4 pcard0ks. 2.4.18- 379. 701. 163. 714.8 949.5 289.5 288.5 956. 420.5 On this machine at least, UDP latency is 25% worse than AF_UNIX, and TCP bandwidth is about 22% that of AF_UNIX. Chris -- Chris Friesen | MailStop: 043/33/F10 Nortel Networks | work: (613) 765-0557 3500 Carling Avenue | fax: (613) 765-2986 Nepean, ON K2H 8E9 Canada | email: cfriesen@nortelnetworks.com From cfriesen@nortelnetworks.com Mon Mar 3 17:35:07 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 03 Mar 2003 17:35:11 -0800 (PST) Received: from zcars04f.nortelnetworks.com (zcars04f.nortelnetworks.com [47.129.242.57]) by oss.sgi.com (8.11.6/8.11.6) with SMTP id h241Z7j29263 for ; Mon, 3 Mar 2003 17:35:07 -0800 Received: from zcard309.ca.nortel.com (zcard309.ca.nortel.com [47.129.242.69]) by zcars04f.nortelnetworks.com (Switch-2.2.0/Switch-2.2.0) with ESMTP id h23MTEU04066; Mon, 3 Mar 2003 17:29:15 -0500 (EST) Received: from zcard0k6.ca.nortel.com ([47.129.242.158]) by zcard309.ca.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id GDF4SL41; Mon, 3 Mar 2003 17:29:15 -0500 Received: from pcard0ks.ca.nortel.com ([47.129.117.131]) by zcard0k6.ca.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id FSL7YZVN; Mon, 3 Mar 2003 17:29:15 -0500 Received: from nortelnetworks.com (localhost.localdomain [127.0.0.1]) by pcard0ks.ca.nortel.com (Postfix) with ESMTP id 3CEB72E12F; Mon, 3 Mar 2003 17:29:14 -0500 (EST) Message-ID: <3E63D73A.2000402@nortelnetworks.com> Date: Mon, 03 Mar 2003 17:29:14 -0500 X-Sybari-Space: 00000000 00000000 00000000 From: Chris Friesen User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.8) Gecko/20020204 X-Accept-Language: en-us MIME-Version: 1.0 To: Terje Eggestad Cc: linux-kernel , netdev@oss.sgi.com, linux-net@vger.kernel.org, davem@redhat.com Subject: Re: anyone ever done multicast AF_UNIX sockets? References: <3E5E7081.6020704@nortelnetworks.com> <1046695876.7731.78.camel@pc-16.office.scali.no> <3E638C51.2000904@nortelnetworks.com> <1046720360.28127.209.camel@eggis1> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 1853 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: cfriesen@nortelnetworks.com Precedence: bulk X-list: netdev Content-Length: 2414 Lines: 54 Terje Eggestad wrote: > On Mon, 2003-03-03 at 18:09, Chris Friesen wrote: > Terje Eggestad wrote: > > On a single box you would use a shared memory segment to do this. It has > > the following advantages: > > - no syscalls at all > > Unless you poll for messages on the receiving side, how do you trigger > the receiver to look for a message? Shared memory doesn't have file > descriptors. > > OK, you want multicast to send the *same* info to all peers. The only of > two sane reason to do that is to update the peers with some info they > need to do real work. So when there is reel work to be done, the info is > available in the shm. Okay, but how do they know there is work to be done? They're waiting in select() monitoring sockets, fds, being hit with signals, etc. How do you tell them to check their messages? You have to hit them over the head with a signal or something and tell them to check the shared memory messages. > If you *had* multicast, you don't know *when* a peer proccessed it. > What if the peer is suspended ??? you don't get an error on the send, > and you apparently never get an answer, then what? The peer may also > gone haywire on a while(1); Exactly. So if the message got delivered you have no way of knowing for sure that it was processed and you have application-level timers and stuff. But if the message wasn't delivered to anyone and you know it should have been, then you don't have to wait for the timer to expire to know that they didn't get it. > How do they know the information has changed? Suppose one process > detects that the ethernet link has dropped. How does it alert other > processes which need to do something? > > Again, if you want someone to do something, they must ack the request > before you can safely assume that they are going to do something. Certainly. My point was that if you're trying to handle all events in a single thread, you need some way to tell the message recipient that it needs to check the shared memory buffer. Otherwise you need multiple threads like you mentioned in your project description. Chris -- Chris Friesen | MailStop: 043/33/F10 Nortel Networks | work: (613) 765-0557 3500 Carling Avenue | fax: (613) 765-2986 Nepean, ON K2H 8E9 Canada | email: cfriesen@nortelnetworks.com From terje.eggestad@scali.com Mon Mar 3 18:14:23 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 03 Mar 2003 18:14:27 -0800 (PST) Received: from localhost.localdomain (2etnv5.cm.chello.no [80.111.51.24]) by oss.sgi.com (8.11.6/8.11.6) with SMTP id h242ELf31746 for ; Mon, 3 Mar 2003 18:14:21 -0800 Received: from localhost (localhost [127.0.0.1]) by localhost.localdomain (8.12.5/8.12.5) with ESMTP id h23NTO2c028472; Tue, 4 Mar 2003 00:29:24 +0100 Subject: Re: anyone ever done multicast AF_UNIX sockets? From: Terje Eggestad To: Chris Friesen Cc: linux-kernel , netdev@oss.sgi.com, linux-net@vger.kernel.org, davem@redhat.com In-Reply-To: <3E63D73A.2000402@nortelnetworks.com> References: <3E5E7081.6020704@nortelnetworks.com> <1046695876.7731.78.camel@pc-16.office .scali.no> <3E638C51.2000904@nortelnetworks.com> <1046720360.28127.209.camel@eggis1> <3E63D73A.2000402@nortelnetworks.com> Content-Type: text/plain Content-Transfer-Encoding: 7bit X-Mailer: Ximian Evolution 1.0.8 (1.0.8-10) Date: 04 Mar 2003 00:29:24 +0100 Message-Id: <1046734165.27924.263.camel@eggis1> Mime-Version: 1.0 X-archive-position: 1854 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: terje.eggestad@scali.com Precedence: bulk X-list: netdev Content-Length: 2751 Lines: 82 My point is that you can't send a request real work with either shm nor multicast. You don't know who or howmany recipients there are. You just use it to update someone that do real work. Then they tend not to need it until they get a request for real work, then alost always on a tcp connection or as audp (unicast) message. How do you design a protocol that uses multicast to send a request to do work? All uses I can think of right now of multicast/broadcast is: * Discovery, like in NIS. * Announcements like in OSPF. * update like in NTP broadcast DHCP is actually a nice example of very very bad things that happen if you loose control of how many servers that are running. On Mon, 2003-03-03 at 23:29, Chris Friesen wrote: Terje Eggestad wrote: > On Mon, 2003-03-03 at 18:09, Chris Friesen wrote: > If you *had* multicast, you don't know *when* a peer proccessed it. > What if the peer is suspended ??? you don't get an error on the send, > and you apparently never get an answer, then what? The peer may also > gone haywire on a while(1); Exactly. So if the message got delivered you have no way of knowing for sure that it was processed and you have application-level timers and stuff. But if the message wasn't delivered to anyone and you know it should have been, then you don't have to wait for the timer to expire to know that they didn't get it. Nice to know, but it help you, how? If there is a subscriber out there that is hung? You need that timer *anyway*. Why the special case? All I see you're trying to do is something like this (just the nonblocking version): do_unix_mcast(message) { alarm(timeout); rc = write(fd_unixmultocast, message, mlen); if (rc == -1 && errno == nosubscribers) goto they_are_all_dead; rc = select( fd_unixmultocast ++); if (rc == -1 && errno = EINTR) goto they_are_all_dead; alarm(0); process_reply(); return; they_all_dead: handle_all_dead_peers(); return; }; Chris -- Chris Friesen | MailStop: 043/33/F10 Nortel Networks | work: (613) 765-0557 3500 Carling Avenue | fax: (613) 765-2986 Nepean, ON K2H 8E9 Canada | email: cfriesen@nortelnetworks.com -- _________________________________________________________________________ Terje Eggestad mailto:terje.eggestad@scali.no Scali Scalable Linux Systems http://www.scali.com Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE) P.O.Box 150, Oppsal +47 975 31 574 (MOBILE) N-0619 Oslo fax: +47 22 62 89 51 NORWAY _________________________________________________________________________ From terje.eggestad@scali.com Mon Mar 3 18:14:24 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 03 Mar 2003 18:14:27 -0800 (PST) Received: from localhost.localdomain (2etnv5.cm.chello.no [80.111.51.24]) by oss.sgi.com (8.11.6/8.11.6) with SMTP id h242ENf31754 for ; Mon, 3 Mar 2003 18:14:23 -0800 Received: from localhost (localhost [127.0.0.1]) by localhost.localdomain (8.12.5/8.12.5) with ESMTP id h23Nc22c028479; Tue, 4 Mar 2003 00:38:03 +0100 Subject: Re: anyone ever done multicast AF_UNIX sockets? From: Terje Eggestad To: Chris Friesen Cc: "David S. Miller" , linux-kernel@vger.kernel.org, netdev@oss.sgi.com, linux-net@vger.kernel.org In-Reply-To: <3E63CA08.4040209@nortelnetworks.com> References: <3E6399F1.10303@nortelnetworks.com> <20030303.095641.87696857.davem@redhat.c om> <3E63A8CB.2090307@nortelnetworks.com> <20030303.105646.02089773.davem@redhat.com> <1046720532.28127.213.camel@eggis1> <3E63CA08.4040209@nortelnetworks.com> Content-Type: text/plain Content-Transfer-Encoding: 7bit X-Mailer: Ximian Evolution 1.0.8 (1.0.8-10) Date: 04 Mar 2003 00:38:02 +0100 Message-Id: <1046734683.28127.275.camel@eggis1> Mime-Version: 1.0 X-archive-position: 1855 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: terje.eggestad@scali.com Precedence: bulk X-list: netdev Content-Length: 2975 Lines: 71 The latency I belive, a 25% increase don't matter all that much. ( routinely send meesages sub micro second. that tcp BW is ridiculus low, make sure that you run with with good sized socket buffers, and that tcp windowing is enabled. But then again, if you want to send much data fast between processes, a stream socket is a pretty bad idea anyway. A) shm b) mmap a file, write into it, and send the filenake to the other side, then mmap it there. Don't underestemate the BW of a fedex'ed tape. TJ On Mon, 2003-03-03 at 22:32, Chris Friesen wrote: Terje Eggestad wrote: > On Mon, 2003-03-03 at 19:56, David S. Miller wrote: > TCP bandwidth is slightly faster than AF_UNIX bandwidth on my > sparc64 boxes for example. > > I've seen that their are the same on linux.I tried to to do AF_UNIX > instead of AF_INET internally to boost perf, but to no avail. Makes you > suspect that the loopback device actually create an AF_UNIX connection > under the hood ;-) On my P4 1.8GHz, AF_INET vs AF_UNIX looks like this: *Local* Communication latencies in microseconds - smaller is better ------------------------------------------------------------- Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP ctxsw UNIX UDP TCP conn --------- ------- ----- ----- ---- ----- ----- ----- ----- ---- pcard0ks. 2.4.18- 1.740 10.4 15.9 20.1 33.1 23.5 44.3 72.7 pcard0ks. 2.4.18- 1.560 10.6 16.0 23.4 38.1 36.1 44.6 77.4 *Local* Communication bandwidths in MB/s - bigger is better ----------------------------------------------------------- Host OS Pipe AF TCP File Mmap Bcopy Bcopy Mem Mem UNIX reread reread (libc) (hand) read write --------- ------- ---- ---- ---- ------ ------ ------ ------ ---- ----- pcard0ks. 2.4.18- 650. 677. 151. 721.9 958.0 290.8 288.8 955. 418.4 pcard0ks. 2.4.18- 379. 701. 163. 714.8 949.5 289.5 288.5 956. 420.5 On this machine at least, UDP latency is 25% worse than AF_UNIX, and TCP bandwidth is about 22% that of AF_UNIX. Chris -- Chris Friesen | MailStop: 043/33/F10 Nortel Networks | work: (613) 765-0557 3500 Carling Avenue | fax: (613) 765-2986 Nepean, ON K2H 8E9 Canada | email: cfriesen@nortelnetworks.com -- _________________________________________________________________________ Terje Eggestad mailto:terje.eggestad@scali.no Scali Scalable Linux Systems http://www.scali.com Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE) P.O.Box 150, Oppsal +47 975 31 574 (MOBILE) N-0619 Oslo fax: +47 22 62 89 51 NORWAY _________________________________________________________________________ From hadi@cyberus.ca Mon Mar 3 18:38:44 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 03 Mar 2003 18:38:48 -0800 (PST) Received: from mx02.cyberus.ca (mx02.cyberus.ca [216.191.240.26]) by oss.sgi.com (8.11.6/8.11.6) with SMTP id h242cif03294 for ; Mon, 3 Mar 2003 18:38:44 -0800 Received: from shell.cyberus.ca ([216.191.240.114]) by mx02.cyberus.ca with esmtp (Exim 4.10) id 18q2KJ-000JPB-00; Mon, 03 Mar 2003 21:38:43 -0500 Received: from shell.cyberus.ca (localhost.cyberus.ca [127.0.0.1]) by shell.cyberus.ca (8.12.6/8.12.6) with ESMTP id h242cIqu068027; Mon, 3 Mar 2003 21:38:18 -0500 (EST) (envelope-from hadi@cyberus.ca) Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.12.6/8.12.6/Submit) with ESMTP id h242cHmE068024; Mon, 3 Mar 2003 21:38:17 -0500 (EST) (envelope-from hadi@cyberus.ca) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Mon, 3 Mar 2003 21:38:17 -0500 (EST) From: jamal To: Terje Eggestad cc: Chris Friesen , linux-kernel , "" , "" , "" Subject: Re: anyone ever done multicast AF_UNIX sockets? In-Reply-To: <1046734165.27924.263.camel@eggis1> Message-ID: <20030303212628.M67734@shell.cyberus.ca> References: <3E5E7081.6020704@nortelnetworks.com> <1046695876.7731.78.camel@pc-16.office .scali.no> <3E638C51.2000904@nortelnetworks.com> <1046720360.28127.209.camel@eggis1> <3E63D73A.2000402@nortelnetworks.com> <1046734165.27924.263.camel@eggis1> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 1856 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev Content-Length: 635 Lines: 27 Hi Terje, On Mon, 4 Mar 2003, Terje Eggestad wrote: > How do you design a protocol that uses multicast to send a request to do > work? > > All uses I can think of right now of multicast/broadcast is: > * Discovery, like in NIS. > * Announcements like in OSPF. > * update like in NTP broadcast > I know we are digressing away from main discussion ... The concept of reliable multicast is known to be useful. Look at(for some sample apps): http://www.ietf.org/html.charters/rmt-charter.html But we are talking about a distributed system in that context. Agreed, reliability and multicast do not always make sense. cheers, jamal From hshmulik@intel.com Tue Mar 4 09:11:47 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 04 Mar 2003 09:11:50 -0800 (PST) Received: from caduceus.fm.intel.com (fmr02.intel.com [192.55.52.25]) by oss.sgi.com (8.11.6/8.11.6) with SMTP id h24HBkf16696 for ; Tue, 4 Mar 2003 09:11:46 -0800 Received: from petasus.fm.intel.com (petasus.fm.intel.com [10.1.192.37]) by caduceus.fm.intel.com (8.11.6/8.11.6/d: outer.mc,v 1.51 2002/09/23 20:43:23 dmccart Exp $) with ESMTP id h24H5Qe12437 for ; Tue, 4 Mar 2003 17:05:26 GMT Received: from fmsmsxvs042.fm.intel.com (fmsmsxvs042.fm.intel.com [132.233.42.128]) by petasus.fm.intel.com (8.11.6/8.11.6/d: inner.mc,v 1.28 2003/01/13 19:44:39 dmccart Exp $) with SMTP id h24H6GT18665 for ; Tue, 4 Mar 2003 17:06:16 GMT Received: from jrslxjul4.npdj.intel.com ([10.12.254.188]) by fmsmsxvs042.fm.intel.com (NAVGW 2.5.2.11) with SMTP id M2003030409142528026 ; Tue, 04 Mar 2003 09:14:26 -0800 Date: Tue, 4 Mar 2003 19:11:42 +0200 (IST) From: Shmulik Hen X-X-Sender: hshmulik@jrslxjul4.npdj.intel.com To: bonding-devel@lists.sourceforge.net, , cc: jgarzik@pobox.com Subject: [PATCH][bonding] division by zero bug Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 1857 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hshmulik@intel.com Precedence: bulk X-list: netdev Content-Length: 1165 Lines: 34 The following patch fixes a division by zero bug in the bonding module that happens when transmitting through a bond with no slaves, in the XOR bonding mode. The patch is against bonding-2.4.20-20030207 in sorceforge (http://sourceforge.net/projects/bonding/). diff -urN linux-2.4.20-20030207/drivers/net/bonding.c linux-2.4.20-devel/drivers/net/bonding.c --- linux-2.4.20-20030207/drivers/net/bonding.c 2003-03-02 14:01:46.000000000 +0200 +++ linux-2.4.20-devel/drivers/net/bonding.c 2003-03-02 14:35:04.000000000 +0200 @@ -2597,6 +2597,13 @@ return 0; } + if (bond->slave_cnt == 0) { + /* no slaves in the bond, frame not sent */ + dev_kfree_skb(skb); + read_unlock_irqrestore(&bond->lock, flags); + return 0; + } + slave_no = (data->h_dest[5]^slave->dev->dev_addr[5]) % bond->slave_cnt; while ( (slave_no > 0) && (slave != (slave_t *)bond) ) { -- | Shmulik Hen | | Israel Design Center (Jerusalem) | | LAN Access Division | | Intel Communications Group, Intel corp. | | | | Anti-Spam: shmulik dot hen at intel dot com | From ahu@outpost.ds9a.nl Wed Mar 5 03:28:59 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 05 Mar 2003 03:29:01 -0800 (PST) Received: from outpost.ds9a.nl (outpost.ds9a.nl [213.244.168.210]) by oss.sgi.com (8.11.6/8.11.6) with SMTP id h25BSvf12503 for ; Wed, 5 Mar 2003 03:28:58 -0800 Received: by outpost.ds9a.nl (Postfix, from userid 1000) id CEBE74501; Wed, 5 Mar 2003 12:28:52 +0100 (CET) Date: Wed, 5 Mar 2003 12:28:52 +0100 From: bert hubert To: Andreas Jellinghaus Cc: mit_warlord@users.sourceforge.net, linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: ipsec-tools 0.1 + kernel 2.5.64 Message-ID: <20030305112852.GA22351@outpost.ds9a.nl> Mail-Followup-To: bert hubert , Andreas Jellinghaus , mit_warlord@users.sourceforge.net, linux-kernel@vger.kernel.org, netdev@oss.sgi.com References: <1046863752.441.7.camel@simulacron> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1046863752.441.7.camel@simulacron> User-Agent: Mutt/1.3.28i X-archive-position: 1858 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ahu@ds9a.nl Precedence: bulk X-list: netdev Content-Length: 728 Lines: 22 On Wed, Mar 05, 2003 at 12:29:12PM +0100, Andreas Jellinghaus wrote: > Hi, > > both manual keying and automatic keying with racoon (pre-shared secret) > are working fine. No need to patch or modify anything. > I tried only ipv4. By the way, regarding ipsec-tools 0.1, are you sure you want to fork the projects involved? By the way, you did not mention it here but ipsec-tools is available on http://sourceforge.net/projects/ipsec-tools , I also link them from http://lartc.org/howto/lartc.ipsec.html Regards, bert -- http://www.PowerDNS.com Open source, database driven DNS Software http://lartc.org Linux Advanced Routing & Traffic Control HOWTO http://netherlabs.nl Consulting From linux-netdev@gmane.org Wed Mar 5 04:31:28 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 05 Mar 2003 04:31:32 -0800 (PST) Received: from main.gmane.org (main.gmane.org [80.91.224.249]) by oss.sgi.com (8.11.6/8.11.6) with SMTP id h25CVQf14743 for ; Wed, 5 Mar 2003 04:31:27 -0800 Received: from root by main.gmane.org with local (Exim 3.35 #1 (Debian)) id 18qY2T-0005EJ-00 for ; Wed, 05 Mar 2003 13:30:25 +0100 X-Injected-Via-Gmane: http://gmane.org/ To: netdev@oss.sgi.com Received: from news by main.gmane.org with local (Exim 3.35 #1 (Debian)) id 18qRvo-0001Rd-00 for ; Wed, 05 Mar 2003 06:59:08 +0100 From: "jpaul" Subject: SuSE 8.1 Wireless Network Date: Wed, 05 Mar 2003 07:03:35 +0100 Message-ID: Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-Complaints-To: usenet@main.gmane.org User-Agent: Pan/0.13.3 (That cat's something I can't explain) X-archive-position: 1859 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jpaulb@web.de Precedence: bulk X-list: netdev Content-Length: 236 Lines: 9 I would like to put together a wireless network. I was thinking about the linksys WEFW11S DSL router and the WUSD11 USD wireless adapter for a laptop Has anyone any experence with these?? Easy of setup if they work at all etc. Paul From eric@lammerts.org Wed Mar 5 06:11:27 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 05 Mar 2003 06:11:29 -0800 (PST) Received: from ezri.xs4all.nl (ezri.xs4all.nl [194.109.253.9]) by oss.sgi.com (8.11.6/8.11.6) with SMTP id h25EBPf16010 for ; Wed, 5 Mar 2003 06:11:26 -0800 Received: (qmail 16995 invoked by uid 502); 5 Mar 2003 14:11:23 -0000 Date: Wed, 5 Mar 2003 15:11:23 +0100 From: Eric Lammerts To: linux-net@vger.kernel.org, netdev@oss.sgi.com Cc: alan@lxorguk.ukuu.org.uk Subject: [PATCH] wrong ENETDOWN in af_packet? Message-ID: <20030305141123.GA16699@ally.lammerts.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4i X-archive-position: 1860 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: eric@lammerts.org Precedence: bulk X-list: netdev Content-Length: 2599 Lines: 120 Hi, I have a program that goes like this (source code at end of mail): open PF_PACKET socket look up index of interface x bind to that interface bring interface x down (~IFF_UP) bring interface x up (IFF_UP|IFF_RUNNING) for(;;) { recvfrom() } Problem: the first recvfrom() always results in ENETDOWN. The reason is that (in af_packet.c) packet_notifier(NETDEV_DOWN) sets sk->err to ENETDOWN, but packet_notifier(NETDEV_UP) doesn't clear it. Is this behaviour deliberate? If not, I suggest the following patch: diff -u -r1.1.1.1 af_packet.c --- linux-2.4.19/net/packet/af_packet.c 10 Jan 2003 16:20:09 -0000 1.1.1.1 +++ linux-2.4.19/net/packet/af_packet.c 5 Mar 2003 11:04:33 -0000 @@ -1407,6 +1407,7 @@ dev_add_pack(&po->prot_hook); sock_hold(sk); po->running = 1; + sk->err = 0; } spin_unlock(&po->bind_lock); #ifdef CONFIG_PACKET_MULTICAST Currently I work around the problem by doing a getsockopt(x, SOL_SOCKET, SO_ERROR,...) to clear the error variable. Eric #include #include #include #include #include #include #include #include #include #include void modify_iface_flags(int sock, char *device_name, short set, short reset) { struct ifreq ifr; strncpy(ifr.ifr_name, device_name, IFNAMSIZ); if(ioctl(sock, SIOCGIFFLAGS, &ifr) < 0) { perror("SIOCGIFFLAGS"); exit(1); } ifr.ifr_flags |= set; ifr.ifr_flags &= ~reset; strncpy(ifr.ifr_name, device_name, IFNAMSIZ); if(ioctl(sock, SIOCSIFFLAGS, &ifr) < 0) { perror("SIOCSIFFLAGS"); exit(1); } } void bind_to_iface(int sock, char *ifacename) { struct ifreq ifr; struct sockaddr_ll sa; strncpy(ifr.ifr_name, ifacename, IFNAMSIZ); if(ioctl(sock, SIOCGIFINDEX, &ifr) < 0) { perror("ioctl SIOCGIFINDEX"); exit(1); } sa.sll_family = AF_PACKET; sa.sll_ifindex = ifr.ifr_ifindex; if(bind(sock, (struct sockaddr *)&sa, sizeof(sa)) < 0) { perror("bind"); exit(1); } } int main() { int fd, sz; char iface[] = "eth0"; unsigned char data[1518]; struct sockaddr_ll sa; socklen_t salen; fd = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL)); if(fd < 0) { perror("socket"); exit(1); } bind_to_iface(fd, iface); // bring it down modify_iface_flags(fd, iface, 0, IFF_UP); // bring it up modify_iface_flags(fd, iface, IFF_UP | IFF_RUNNING, 0); //receive packet salen = sizeof(sa); sz = recvfrom(fd, data, sizeof(data), 0, (struct sockaddr *)&sa, &salen); if(sz == -1) { perror("recvfrom"); exit(1); } return 0; } From agx@linux.it Wed Mar 5 06:24:38 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 05 Mar 2003 06:24:41 -0800 (PST) Received: from ax-agx.axnet.it (dns.axnet.it [217.59.82.2]) by oss.sgi.com (8.11.6/8.11.6) with SMTP id h25EObf16832 for ; Wed, 5 Mar 2003 06:24:38 -0800 Received: by ax-agx.axnet.it (Postfix, from userid 1000) id DF4354234D; Wed, 5 Mar 2003 15:27:47 +0100 (CET) Date: Wed, 5 Mar 2003 15:27:47 +0100 From: Antonio Gallo To: mitch@sfgoth.com Cc: netdev@oss.sgi.com Subject: [Bug] PPPoATM or ATM module problem with ADSL PCI Cards Message-ID: <20030305142747.GA17315@ax-agx.axnet.it> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline User-Agent: Mutt/1.4i X-Disclaimer: Please visit http://www.badpenguin.org/ X-Operating-System: Bad Penguin GNU/Linux 0.99.7 X-archive-position: 1861 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: agx@linux.it Precedence: bulk X-list: netdev Content-Length: 2907 Lines: 79 I'm not sure if is a real bug or something wrong but after 3 weeks of tests i'm thinking is something inside the kernel (pppoatm.o) or the ATM layer in general, and so some friend at @linux.it suggest me to contact directly the mantainers instead of writing to the kernel ML. This is the situation: LAN <=== ethernet ===> Zyxel Router <=== PPPoA ===> Provider This works! I remove the phone cable from the router and put it into our Linux Box Linux <=== PPPoA ===> Provider I've used 2 different kind of cards: 1. BeWan PCI ADSL ST (ATM driver + pppd with the atm plugin) 2. Pulsar ADSL (provide /dev/ttyG0 so i just use normal ppp ) Boths card can detect the line (Link Up), i can also see the "link" LED to become "on". Where i run the PPP to connect to the provider i can see the "Tx" LED working but the "Rx" LED not. Confirmation of this is done through "ifconfig" or "cat /proc/net/atm/UNICORN:0" that showme a positive number for Tx and Error number for Rx. I contacted the provider of the line and it told me the right parameter of the line, that i was already know. Provider: Elitel (via Telecom Italia) - Italy Line: DTM Protocol VCMUX, RFC2364, PPPoA VPI.VCI: 8.35 Bandwidth: 128 up / 640 down User access: username+password (PPP CHAP/PAP) Ip address: Dynamically assigned I stayed with them 1 hour on the phone. Infact they was able to see that i am transmitting "atm cells" but that cells contains invalid data and so they dropped them. This is the reason why i never see Rx packet? So whereis the problem? Kernel or ppp pluging sending wrong information? If the problem was the card it will be strange to have the same problem on the same card (different drivers, different architectures, different chipset etc.). So is my machine really sending wrong atm cells? Debugging of the "Bewan" driver showed this: Mar 5 11:06:53 ax-dummy kernel: unicorn_atmdrv.c : unicorn_atm_open: Mar 5 11:06:53 ax-dummy kernel: unicorn_atm: ESI=00:9f:c8:f1:f7:58 Mar 5 11:06:53 ax-dummy kernel: unicorn_atm: upstream_rate=639 Kbits/s,downstream_rate=6143 Kbits/s Mar 5 11:06:53 ax-dummy kernel: unicorn_atmdrv.c : get_link_rate: link_rate=1507 cells/sec Mar 5 11:06:53 ax-dummy kernel: unicorn_atmdrv.c : aal5_decode: skb to short,skb->len=48,pdu_length=27264 Mar 5 11:06:53 ax-dummy kernel: unicorn_atmdrv.c : rcv_poll: wrong VPI.VCI 15.16 Mar 5 11:06:53 ax-dummy kernel: unicorn_atmdrv.c : rcv_poll: wrong VPI.VCI 15.16 Mar 5 11:06:53 ax-dummy kernel: unicorn_atmdrv.c : aal5_decode: skb to short,skb->len=48,pdu_length=27264 i am also waiting an answer from the support of both ADSL cards. I hope you can give some indication about why i'm sending wrong cells and how to check which/where is the real problem. Ops, i forget to mention that i tested with both 2.4.20 and 2.4.18 kernels. Thank you in advance, Antonio Gallo www.badpenguin.org p.s. i'm really lost :-( From kazunori@miyazawa.org Wed Mar 5 06:30:15 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 05 Mar 2003 06:30:23 -0800 (PST) Received: from miyazawa.org (usen-43x235x12x234.ap-USEN.usen.ad.jp [43.235.12.234]) by oss.sgi.com (8.11.6/8.11.6) with SMTP id h25EUEf17528 for ; Wed, 5 Mar 2003 06:30:14 -0800 Received: from monza.miyazawa.org ([2001:200:0:ff18:220:e0ff:fe8a:e797]) (AUTH: LOGIN kazunori, ) by miyazawa.org with esmtp; Wed, 05 Mar 2003 23:12:12 +0900 Date: Wed, 5 Mar 2003 23:30:25 +0900 From: Kazunori Miyazawa To: davem@redhat.com, kuznet@ms2.inr.ac.ru Cc: linux-kernel@vger.kernel.org, netdev@oss.sgi.com, usagi-core@linux-ipv6.org Subject: [PATH] IPv6 IPsec support Message-Id: <20030305233025.784feb00.kazunori@miyazawa.org> X-Mailer: Sylpheed version 0.8.10 (GTK+ 1.2.10; i386-debian-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 1862 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: kazunori@miyazawa.org Precedence: bulk X-list: netdev Content-Length: 120844 Lines: 4451 Hello, I submit the patch to let the kernel support ipv6 ipsec again. It is able to comple ipv6 as module. This patch incldes a couple of clean-up and changes of function name. Sorry, this patch is for linux-2.5.63. Best Regards, --Kazunori Miyazawa (Yokogawa Electric Corporation) Patch-Name: IPsec Patch-Id: IPSEC_2_5_63_ALL-20030304 Patch-Author: Kazunori Miyazawa Credit: Kazunori Miyazawa , Mitsuru Kanda , YOSHIFUJI Hideaki , Kunihiro Ishiguro This patch make the kernel process IPv6 packet with IPsec. - We've introduced a function pointer (xfrm_dst_lookup) for looking up routing table of each address family to comple ipv6 as module. - We moved some common functions among protocols such as skb_icv_walk() in net/ipv4/{ah.c,esp.c} to net/ipv4/xfrm_algo.c. This is for compling ah / esp and ah6 / esp6 as modules. - We renamed some IPv4 specific xfrm_XXX() functions to xfrm4_XXX(). diff -ruN -x CVS linux-2.5.63/include/linux/ipv6.h linux25/include/linux/ipv6.h --- linux-2.5.63/include/linux/ipv6.h 2003-02-25 04:05:38.000000000 +0900 +++ linux25/include/linux/ipv6.h 2003-03-05 11:30:34.000000000 +0900 @@ -74,6 +74,21 @@ #define rt0_type rt_hdr.type; }; +struct ipv6_auth_hdr { + __u8 nexthdr; + __u8 hdrlen; /* This one is measured in 32 bit units! */ + __u16 reserved; + __u32 spi; + __u32 seq_no; /* Sequence number */ + __u8 auth_data[4]; /* Length variable but >=4. Mind the 64 bit alignment! */ +}; + +struct ipv6_esp_hdr { + __u32 spi; + __u32 seq_no; /* Sequence number */ + __u8 enc_data[8]; /* Length variable but >=8. Mind the 64 bit alignment! */ +}; + /* * IPv6 fixed header * diff -ruN -x CVS linux-2.5.63/include/net/dst.h linux25/include/net/dst.h --- linux-2.5.63/include/net/dst.h 2003-02-25 04:05:44.000000000 +0900 +++ linux25/include/net/dst.h 2003-03-05 17:49:50.000000000 +0900 @@ -247,7 +247,10 @@ struct flowi; extern int xfrm_lookup(struct dst_entry **dst_p, struct flowi *fl, struct sock *sk, int flags); +extern int xfrm6_lookup(struct dst_entry **dst_p, struct flowi *fl, + struct sock *sk, int flags); extern void xfrm_init(void); +extern void xfrm6_init(void); #endif diff -ruN -x CVS linux-2.5.63/include/net/ip6_route.h linux25/include/net/ip6_route.h --- linux-2.5.63/include/net/ip6_route.h 2003-02-25 04:05:12.000000000 +0900 +++ linux25/include/net/ip6_route.h 2003-03-04 20:38:14.000000000 +0900 @@ -38,6 +38,7 @@ extern int ipv6_route_ioctl(unsigned int cmd, void *arg); extern int ip6_route_add(struct in6_rtmsg *rtmsg); +extern int ip6_route_del(struct in6_rtmsg *rtmsg); extern int ip6_del_rt(struct rt6_info *); extern int ip6_rt_addr_add(struct in6_addr *addr, @@ -57,6 +58,8 @@ struct in6_addr *saddr, int oif, int flags); +extern struct rt6_info *ndisc_get_dummy_rt(void); + /* * support functions for ND * diff -ruN -x CVS linux-2.5.63/include/net/xfrm.h linux25/include/net/xfrm.h --- linux-2.5.63/include/net/xfrm.h 2003-02-25 04:05:41.000000000 +0900 +++ linux25/include/net/xfrm.h 2003-03-05 17:49:51.000000000 +0900 @@ -12,6 +12,7 @@ #include #include +#include #define XFRM_ALIGN8(len) (((len) + 7) & ~7) @@ -282,6 +283,7 @@ struct xfrm_dst *next; struct dst_entry dst; struct rtable rt; + struct rt6_info rt6; } u; }; @@ -308,26 +310,42 @@ if (sp && atomic_dec_and_test(&sp->refcnt)) __secpath_destroy(sp); } - -extern int __xfrm_policy_check(struct sock *, int dir, struct sk_buff *skb); +extern int __xfrm_policy_check(struct sock *, int dir, struct sk_buff *skb, unsigned short family); static inline int xfrm_policy_check(struct sock *sk, int dir, struct sk_buff *skb) { if (sk && sk->policy[XFRM_POLICY_IN]) - return __xfrm_policy_check(sk, dir, skb); + return __xfrm_policy_check(sk, dir, skb, AF_INET); return !xfrm_policy_list[dir] || (skb->dst->flags & DST_NOPOLICY) || - __xfrm_policy_check(sk, dir, skb); + __xfrm_policy_check(sk, dir, skb, AF_INET); } -extern int __xfrm_route_forward(struct sk_buff *skb); +static inline int xfrm6_policy_check(struct sock *sk, int dir, struct sk_buff *skb) +{ + if (sk && sk->policy[XFRM_POLICY_IN]) + return __xfrm_policy_check(sk, dir, skb, AF_INET6); + + return !xfrm_policy_list[dir] || + (skb->dst->flags & DST_NOPOLICY) || + __xfrm_policy_check(sk, dir, skb, AF_INET6); +} + +extern int __xfrm_route_forward(struct sk_buff *skb, unsigned short family); static inline int xfrm_route_forward(struct sk_buff *skb) { return !xfrm_policy_list[XFRM_POLICY_OUT] || (skb->dst->flags & DST_NOXFRM) || - __xfrm_route_forward(skb); + __xfrm_route_forward(skb, AF_INET); +} + +static inline int xfrm6_route_forward(struct sk_buff *skb) +{ + return !xfrm_policy_list[XFRM_POLICY_OUT] || + (skb->dst->flags & DST_NOXFRM) || + __xfrm_route_forward(skb, AF_INET6); } extern int __xfrm_sk_clone_policy(struct sock *sk); @@ -380,12 +398,16 @@ extern void xfrm_input_init(void); extern int xfrm_state_walk(u8 proto, int (*func)(struct xfrm_state *, int, void*), void *); extern struct xfrm_state *xfrm_state_alloc(void); -extern struct xfrm_state *xfrm_state_find(u32 daddr, u32 saddr, struct flowi *fl, struct xfrm_tmpl *tmpl, - struct xfrm_policy *pol, int *err); +extern struct xfrm_state *xfrm4_state_find(u32 daddr, u32 saddr, struct flowi *fl, struct xfrm_tmpl *tmpl, + struct xfrm_policy *pol, int *err); +extern struct xfrm_state *xfrm6_state_find(struct in6_addr *daddr, struct in6_addr *saddr, + struct flowi *fl, struct xfrm_tmpl *tmpl, + struct xfrm_policy *pol, int *err); extern int xfrm_state_check_expire(struct xfrm_state *x); extern void xfrm_state_insert(struct xfrm_state *x); extern int xfrm_state_check_space(struct xfrm_state *x, struct sk_buff *skb); -extern struct xfrm_state *xfrm_state_lookup(u32 daddr, u32 spi, u8 proto); +extern struct xfrm_state *xfrm4_state_lookup(u32 daddr, u32 spi, u8 proto); +extern struct xfrm_state *xfrm6_state_lookup(struct in6_addr *daddr, u32 spi, u8 proto); extern struct xfrm_state *xfrm_find_acq_byseq(u32 seq); extern void xfrm_state_delete(struct xfrm_state *x); extern void xfrm_state_flush(u8 proto); @@ -393,17 +415,21 @@ extern void xfrm_replay_advance(struct xfrm_state *x, u32 seq); extern int xfrm_check_selectors(struct xfrm_state **x, int n, struct flowi *fl); extern int xfrm4_rcv(struct sk_buff *skb); +extern int xfrm6_rcv(struct sk_buff *skb); +extern int xfrm6_clear_mutable_options(struct sk_buff *skb, u16 *nh_offset, int dir); extern int xfrm_user_policy(struct sock *sk, int optname, u8 *optval, int optlen); struct xfrm_policy *xfrm_policy_alloc(int gfp); extern int xfrm_policy_walk(int (*func)(struct xfrm_policy *, int, int, void*), void *); -struct xfrm_policy *xfrm_policy_lookup(int dir, struct flowi *fl); +struct xfrm_policy *xfrm_policy_lookup(int dir, struct flowi *fl, unsigned short family); int xfrm_policy_insert(int dir, struct xfrm_policy *policy, int excl); struct xfrm_policy *xfrm_policy_delete(int dir, struct xfrm_selector *sel); struct xfrm_policy *xfrm_policy_byid(int dir, u32 id, int delete); void xfrm_policy_flush(void); void xfrm_alloc_spi(struct xfrm_state *x, u32 minspi, u32 maxspi); struct xfrm_state * xfrm_find_acq(u8 mode, u16 reqid, u8 proto, u32 daddr, u32 saddr, int create); +struct xfrm_state * xfrm6_find_acq(u8 mode, u16 reqid, u8 proto, struct in6_addr *daddr, + struct in6_addr *saddr, int create); extern void xfrm_policy_flush(void); extern void xfrm_policy_kill(struct xfrm_policy *); extern int xfrm_sk_policy_insert(struct sock *sk, int dir, struct xfrm_policy *pol); @@ -425,23 +451,129 @@ extern struct xfrm_algo_desc *xfrm_aalg_get_byname(char *name); extern struct xfrm_algo_desc *xfrm_ealg_get_byname(char *name); +static __inline__ int addr_match(void *token1, void *token2, int prefixlen) +{ + __u32 *a1 = token1; + __u32 *a2 = token2; + int pdw; + int pbi; + + pdw = prefixlen >> 5; /* num of whole __u32 in prefix */ + pbi = prefixlen & 0x1f; /* num of bits in incomplete u32 in prefix */ + + if (pdw) + if (memcmp(a1, a2, pdw << 2)) + return 0; + + if (pbi) { + __u32 mask; + + mask = htonl((0xffffffff) << (32 - pbi)); + + if ((a1[pdw] ^ a2[pdw]) & mask) + return 0; + } + + return 1; +} + static inline int xfrm6_selector_match(struct xfrm_selector *sel, struct flowi *fl) { - return !memcmp(fl->fl6_dst, sel->daddr.a6, sizeof(struct in6_addr)) && - !((fl->uli_u.ports.dport^sel->dport)&sel->dport_mask) && - !((fl->uli_u.ports.sport^sel->sport)&sel->sport_mask) && - (fl->proto == sel->proto || !sel->proto) && - (fl->oif == sel->ifindex || !sel->ifindex) && - !memcmp(fl->fl6_src, sel->saddr.a6, sizeof(struct in6_addr)); + return addr_match(fl->fl6_dst, &sel->daddr, sel->prefixlen_d) && + addr_match(fl->fl6_src, &sel->saddr, sel->prefixlen_s) && + !((fl->uli_u.ports.dport^sel->dport)&sel->dport_mask) && + !((fl->uli_u.ports.sport^sel->sport)&sel->sport_mask) && + (fl->proto == sel->proto || !sel->proto) && + (fl->oif == sel->ifindex || !sel->ifindex); } extern int xfrm6_register_type(struct xfrm_type *type); extern int xfrm6_unregister_type(struct xfrm_type *type); extern struct xfrm_type *xfrm6_get_type(u8 proto); -extern struct xfrm_state *xfrm6_state_lookup(struct in6_addr *daddr, u32 spi, u8 proto); -struct xfrm_state * xfrm6_find_acq(u8 mode, u16 reqid, u8 proto, struct in6_addr *daddr, struct in6_addr *saddr, int create); -void xfrm6_alloc_spi(struct xfrm_state *x, u32 minspi, u32 maxspi); +struct ah_data +{ + u8 *key; + int key_len; + u8 *work_icv; + int icv_full_len; + int icv_trunc_len; + + void (*icv)(struct ah_data*, + struct sk_buff *skb, u8 *icv); + + struct crypto_tfm *tfm; +}; + +struct esp_data +{ + /* Confidentiality */ + struct { + u8 *key; /* Key */ + int key_len; /* Key length */ + u8 *ivec; /* ivec buffer */ + /* ivlen is offset from enc_data, where encrypted data start. + * It is logically different of crypto_tfm_alg_ivsize(tfm). + * We assume that it is either zero (no ivec), or + * >= crypto_tfm_alg_ivsize(tfm). */ + int ivlen; + int padlen; /* 0..255 */ + struct crypto_tfm *tfm; /* crypto handle */ + } conf; + + /* Integrity. It is active when icv_full_len != 0 */ + struct { + u8 *key; /* Key */ + int key_len; /* Length of the key */ + u8 *work_icv; + int icv_full_len; + int icv_trunc_len; + void (*icv)(struct esp_data*, + struct sk_buff *skb, + int offset, int len, u8 *icv); + struct crypto_tfm *tfm; + } auth; +}; + +typedef void (icv_update_fn_t)(struct crypto_tfm *, struct scatterlist *, unsigned int); +extern void skb_ah_walk(const struct sk_buff *skb, + struct crypto_tfm *tfm, icv_update_fn_t icv_update); +extern void skb_icv_walk(const struct sk_buff *skb, struct crypto_tfm *tfm, + int offset, int len, icv_update_fn_t icv_update); +extern int skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset, int len); +extern int skb_cow_data(struct sk_buff *skb, int tailbits, struct sk_buff **trailer); +extern void *pskb_put(struct sk_buff *skb, struct sk_buff *tail, int len); + +static inline void +ah_hmac_digest(struct ah_data *ahp, struct sk_buff *skb, u8 *auth_data) +{ + struct crypto_tfm *tfm = ahp->tfm; + + memset(auth_data, 0, ahp->icv_trunc_len); + crypto_hmac_init(tfm, ahp->key, &ahp->key_len); + skb_ah_walk(skb, tfm, crypto_hmac_update); + crypto_hmac_final(tfm, ahp->key, &ahp->key_len, ahp->work_icv); + memcpy(auth_data, ahp->work_icv, ahp->icv_trunc_len); +} + +static inline void +esp_hmac_digest(struct esp_data *esp, struct sk_buff *skb, int offset, + int len, u8 *auth_data) +{ + struct crypto_tfm *tfm = esp->auth.tfm; + char *icv = esp->auth.work_icv; + + memset(auth_data, 0, esp->auth.icv_trunc_len); + crypto_hmac_init(tfm, esp->auth.key, &esp->auth.key_len); + skb_icv_walk(skb, tfm, offset, len, crypto_hmac_update); + crypto_hmac_final(tfm, esp->auth.key, &esp->auth.key_len, icv); + memcpy(auth_data, icv, esp->auth.icv_trunc_len); +} + + +typedef int (xfrm_dst_lookup_t)(struct xfrm_dst **dst, struct flowi *fl); +int xfrm_dst_lookup_register(xfrm_dst_lookup_t *dst_lookup, unsigned short family); +void xfrm_dst_lookup_unregister(unsigned short family); #endif /* _NET_XFRM_H */ diff -ruN -x CVS linux-2.5.63/net/ipv4/ah.c linux25/net/ipv4/ah.c --- linux-2.5.63/net/ipv4/ah.c 2003-02-25 04:05:42.000000000 +0900 +++ linux25/net/ipv4/ah.c 2003-03-05 17:49:52.000000000 +0900 @@ -7,25 +7,8 @@ #include #include -#define AH_HLEN_NOICV 12 - -typedef void (icv_update_fn_t)(struct crypto_tfm *, - struct scatterlist *, unsigned int); - -struct ah_data -{ - u8 *key; - int key_len; - u8 *work_icv; - int icv_full_len; - int icv_trunc_len; - - void (*icv)(struct ah_data*, - struct sk_buff *skb, u8 *icv); - - struct crypto_tfm *tfm; -}; +#define AH_HLEN_NOICV 12 /* Clear mutable options and find final destination to substitute * into IP header for icv calculation. Options are already checked @@ -71,92 +54,6 @@ return 0; } -static void skb_ah_walk(const struct sk_buff *skb, - struct crypto_tfm *tfm, icv_update_fn_t icv_update) -{ - int offset = 0; - int len = skb->len; - int start = skb->len - skb->data_len; - int i, copy = start - offset; - struct scatterlist sg; - - /* Checksum header. */ - if (copy > 0) { - if (copy > len) - copy = len; - - sg.page = virt_to_page(skb->data + offset); - sg.offset = (unsigned long)(skb->data + offset) % PAGE_SIZE; - sg.length = copy; - - icv_update(tfm, &sg, 1); - - if ((len -= copy) == 0) - return; - offset += copy; - } - - for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { - int end; - - BUG_TRAP(start <= offset + len); - - end = start + skb_shinfo(skb)->frags[i].size; - if ((copy = end - offset) > 0) { - skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; - - if (copy > len) - copy = len; - - sg.page = frag->page; - sg.offset = frag->page_offset + offset-start; - sg.length = copy; - - icv_update(tfm, &sg, 1); - - if (!(len -= copy)) - return; - offset += copy; - } - start = end; - } - - if (skb_shinfo(skb)->frag_list) { - struct sk_buff *list = skb_shinfo(skb)->frag_list; - - for (; list; list = list->next) { - int end; - - BUG_TRAP(start <= offset + len); - - end = start + list->len; - if ((copy = end - offset) > 0) { - if (copy > len) - copy = len; - skb_ah_walk(list, tfm, icv_update); - if ((len -= copy) == 0) - return; - offset += copy; - } - start = end; - } - } - if (len) - BUG(); -} - -static void -ah_hmac_digest(struct ah_data *ahp, struct sk_buff *skb, u8 *auth_data) -{ - struct crypto_tfm *tfm = ahp->tfm; - - memset(auth_data, 0, ahp->icv_trunc_len); - crypto_hmac_init(tfm, ahp->key, &ahp->key_len); - skb_ah_walk(skb, tfm, crypto_hmac_update); - crypto_hmac_final(tfm, ahp->key, &ahp->key_len, ahp->work_icv); - memcpy(auth_data, ahp->work_icv, ahp->icv_trunc_len); -} - static int ah_output(struct sk_buff *skb) { int err; @@ -330,7 +227,7 @@ skb->h.icmph->code != ICMP_FRAG_NEEDED) return; - x = xfrm_state_lookup(iph->daddr, ah->spi, IPPROTO_AH); + x = xfrm4_state_lookup(iph->daddr, ah->spi, IPPROTO_AH); if (!x) return; printk(KERN_DEBUG "pmtu discvovery on SA AH/%08x/%08x\n", diff -ruN -x CVS linux-2.5.63/net/ipv4/esp.c linux25/net/ipv4/esp.c --- linux-2.5.63/net/ipv4/esp.c 2003-02-25 04:05:34.000000000 +0900 +++ linux25/net/ipv4/esp.c 2003-03-05 17:49:52.000000000 +0900 @@ -8,312 +8,8 @@ #include #include -#define MAX_SG_ONSTACK 4 - -typedef void (icv_update_fn_t)(struct crypto_tfm *, - struct scatterlist *, unsigned int); - -/* BUGS: - * - we assume replay seqno is always present. - */ - -struct esp_data -{ - /* Confidentiality */ - struct { - u8 *key; /* Key */ - int key_len; /* Key length */ - u8 *ivec; /* ivec buffer */ - /* ivlen is offset from enc_data, where encrypted data start. - * It is logically different of crypto_tfm_alg_ivsize(tfm). - * We assume that it is either zero (no ivec), or - * >= crypto_tfm_alg_ivsize(tfm). */ - int ivlen; - int padlen; /* 0..255 */ - struct crypto_tfm *tfm; /* crypto handle */ - } conf; - - /* Integrity. It is active when icv_full_len != 0 */ - struct { - u8 *key; /* Key */ - int key_len; /* Length of the key */ - u8 *work_icv; - int icv_full_len; - int icv_trunc_len; - void (*icv)(struct esp_data*, - struct sk_buff *skb, - int offset, int len, u8 *icv); - struct crypto_tfm *tfm; - } auth; -}; - -/* Move to common area: it is shared with AH. */ - -void skb_icv_walk(const struct sk_buff *skb, struct crypto_tfm *tfm, - int offset, int len, icv_update_fn_t icv_update) -{ - int start = skb->len - skb->data_len; - int i, copy = start - offset; - struct scatterlist sg; - - /* Checksum header. */ - if (copy > 0) { - if (copy > len) - copy = len; - - sg.page = virt_to_page(skb->data + offset); - sg.offset = (unsigned long)(skb->data + offset) % PAGE_SIZE; - sg.length = copy; - - icv_update(tfm, &sg, 1); - - if ((len -= copy) == 0) - return; - offset += copy; - } - - for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { - int end; - - BUG_TRAP(start <= offset + len); - - end = start + skb_shinfo(skb)->frags[i].size; - if ((copy = end - offset) > 0) { - skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; - - if (copy > len) - copy = len; - - sg.page = frag->page; - sg.offset = frag->page_offset + offset-start; - sg.length = copy; - - icv_update(tfm, &sg, 1); - - if (!(len -= copy)) - return; - offset += copy; - } - start = end; - } - - if (skb_shinfo(skb)->frag_list) { - struct sk_buff *list = skb_shinfo(skb)->frag_list; - - for (; list; list = list->next) { - int end; - - BUG_TRAP(start <= offset + len); - - end = start + list->len; - if ((copy = end - offset) > 0) { - if (copy > len) - copy = len; - skb_icv_walk(list, tfm, offset-start, copy, icv_update); - if ((len -= copy) == 0) - return; - offset += copy; - } - start = end; - } - } - if (len) - BUG(); -} - - -/* Looking generic it is not used in another places. */ - -int -skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset, int len) -{ - int start = skb->len - skb->data_len; - int i, copy = start - offset; - int elt = 0; - - if (copy > 0) { - if (copy > len) - copy = len; - sg[elt].page = virt_to_page(skb->data + offset); - sg[elt].offset = (unsigned long)(skb->data + offset) % PAGE_SIZE; - sg[elt].length = copy; - elt++; - if ((len -= copy) == 0) - return elt; - offset += copy; - } - - for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { - int end; - - BUG_TRAP(start <= offset + len); - - end = start + skb_shinfo(skb)->frags[i].size; - if ((copy = end - offset) > 0) { - skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; - - if (copy > len) - copy = len; - sg[elt].page = frag->page; - sg[elt].offset = frag->page_offset+offset-start; - sg[elt].length = copy; - elt++; - if (!(len -= copy)) - return elt; - offset += copy; - } - start = end; - } - - if (skb_shinfo(skb)->frag_list) { - struct sk_buff *list = skb_shinfo(skb)->frag_list; - - for (; list; list = list->next) { - int end; - - BUG_TRAP(start <= offset + len); - - end = start + list->len; - if ((copy = end - offset) > 0) { - if (copy > len) - copy = len; - elt += skb_to_sgvec(list, sg+elt, offset - start, copy); - if ((len -= copy) == 0) - return elt; - offset += copy; - } - start = end; - } - } - if (len) - BUG(); - return elt; -} - -/* Common with AH after some work on arguments. */ - -static void -esp_hmac_digest(struct esp_data *esp, struct sk_buff *skb, int offset, - int len, u8 *auth_data) -{ - struct crypto_tfm *tfm = esp->auth.tfm; - char *icv = esp->auth.work_icv; - - memset(auth_data, 0, esp->auth.icv_trunc_len); - crypto_hmac_init(tfm, esp->auth.key, &esp->auth.key_len); - skb_icv_walk(skb, tfm, offset, len, crypto_hmac_update); - crypto_hmac_final(tfm, esp->auth.key, &esp->auth.key_len, icv); - memcpy(auth_data, icv, esp->auth.icv_trunc_len); -} - -/* Check that skb data bits are writable. If they are not, copy data - * to newly created private area. If "tailbits" is given, make sure that - * tailbits bytes beyond current end of skb are writable. - * - * Returns amount of elements of scatterlist to load for subsequent - * transformations and pointer to writable trailer skb. - */ - -int skb_cow_data(struct sk_buff *skb, int tailbits, struct sk_buff **trailer) -{ - int copyflag; - int elt; - struct sk_buff *skb1, **skb_p; - - /* If skb is cloned or its head is paged, reallocate - * head pulling out all the pages (pages are considered not writable - * at the moment even if they are anonymous). - */ - if ((skb_cloned(skb) || skb_shinfo(skb)->nr_frags) && - __pskb_pull_tail(skb, skb_pagelen(skb)-skb_headlen(skb)) == NULL) - return -ENOMEM; - - /* Easy case. Most of packets will go this way. */ - if (!skb_shinfo(skb)->frag_list) { - /* A little of trouble, not enough of space for trailer. - * This should not happen, when stack is tuned to generate - * good frames. OK, on miss we reallocate and reserve even more - * space, 128 bytes is fair. */ - - if (skb_tailroom(skb) < tailbits && - pskb_expand_head(skb, 0, tailbits-skb_tailroom(skb)+128, GFP_ATOMIC)) - return -ENOMEM; - - /* Voila! */ - *trailer = skb; - return 1; - } - - /* Misery. We are in troubles, going to mincer fragments... */ - elt = 1; - skb_p = &skb_shinfo(skb)->frag_list; - copyflag = 0; - - while ((skb1 = *skb_p) != NULL) { - int ntail = 0; - - /* The fragment is partially pulled by someone, - * this can happen on input. Copy it and everything - * after it. */ - - if (skb_shared(skb1)) - copyflag = 1; - - /* If the skb is the last, worry about trailer. */ - - if (skb1->next == NULL && tailbits) { - if (skb_shinfo(skb1)->nr_frags || - skb_shinfo(skb1)->frag_list || - skb_tailroom(skb1) < tailbits) - ntail = tailbits + 128; - } - - if (copyflag || - skb_cloned(skb1) || - ntail || - skb_shinfo(skb1)->nr_frags || - skb_shinfo(skb1)->frag_list) { - struct sk_buff *skb2; - - /* Fuck, we are miserable poor guys... */ - if (ntail == 0) - skb2 = skb_copy(skb1, GFP_ATOMIC); - else - skb2 = skb_copy_expand(skb1, - skb_headroom(skb1), - ntail, - GFP_ATOMIC); - if (unlikely(skb2 == NULL)) - return -ENOMEM; - - if (skb1->sk) - skb_set_owner_w(skb, skb1->sk); - - /* Looking around. Are we still alive? - * OK, link new skb, drop old one */ - - skb2->next = skb1->next; - *skb_p = skb2; - kfree_skb(skb1); - skb1 = skb2; - } - elt++; - *trailer = skb1; - skb_p = &skb1->next; - } - - return elt; -} - -void *pskb_put(struct sk_buff *skb, struct sk_buff *tail, int len) -{ - if (tail != skb) { - skb->data_len += len; - skb->len += len; - } - return skb_put(tail, len); -} +#define MAX_SG_ONSTACK 4 int esp_output(struct sk_buff *skb) { @@ -575,7 +271,7 @@ skb->h.icmph->code != ICMP_FRAG_NEEDED) return; - x = xfrm_state_lookup(iph->daddr, esph->spi, IPPROTO_ESP); + x = xfrm4_state_lookup(iph->daddr, esph->spi, IPPROTO_ESP); if (!x) return; printk(KERN_DEBUG "pmtu discvovery on SA ESP/%08x/%08x\n", diff -ruN -x CVS linux-2.5.63/net/ipv4/route.c linux25/net/ipv4/route.c --- linux-2.5.63/net/ipv4/route.c 2003-02-25 04:06:01.000000000 +0900 +++ linux25/net/ipv4/route.c 2003-03-04 20:38:15.000000000 +0900 @@ -96,6 +96,7 @@ #include #include #include +#include #ifdef CONFIG_SYSCTL #include #endif @@ -2599,6 +2600,13 @@ #endif /* CONFIG_PROC_FS */ #endif /* CONFIG_NET_CLS_ROUTE */ +int xfrm_dst_lookup(struct xfrm_dst **dst, struct flowi *fl) +{ + int err = 0; + err = __ip_route_output_key((struct rtable**)dst, fl); + return err; +} + int __init ip_rt_init(void) { int i, order, goal, rc = 0; @@ -2680,6 +2688,7 @@ ip_rt_gc_interval; add_timer(&rt_periodic_timer); + xfrm_dst_lookup_register(xfrm_dst_lookup, AF_INET); #ifdef CONFIG_PROC_FS if (rt_cache_proc_init()) goto out_enomem; diff -ruN -x CVS linux-2.5.63/net/ipv4/xfrm_algo.c linux25/net/ipv4/xfrm_algo.c --- linux-2.5.63/net/ipv4/xfrm_algo.c 2003-02-25 04:05:16.000000000 +0900 +++ linux25/net/ipv4/xfrm_algo.c 2003-03-04 20:38:16.000000000 +0900 @@ -8,9 +8,11 @@ * Software Foundation; either version 2 of the License, or (at your option) * any later version. */ +#include #include #include #include +#include /* * Algorithms supported by IPsec. These entries contain properties which @@ -348,3 +350,333 @@ n++; return n; } + +#if defined(CONFIG_INET_AH) || defined(CONFIG_INET_AH_MODULE) || defined(CONFIG_INET6_AH) || defined(CONFIG_INET6_AH_MODULE) +void skb_ah_walk(const struct sk_buff *skb, + struct crypto_tfm *tfm, icv_update_fn_t icv_update) +{ + int offset = 0; + int len = skb->len; + int start = skb->len - skb->data_len; + int i, copy = start - offset; + struct scatterlist sg; + + /* Checksum header. */ + if (copy > 0) { + if (copy > len) + copy = len; + + sg.page = virt_to_page(skb->data + offset); + sg.offset = (unsigned long)(skb->data + offset) % PAGE_SIZE; + sg.length = copy; + + icv_update(tfm, &sg, 1); + + if ((len -= copy) == 0) + return; + offset += copy; + } + + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { + int end; + + BUG_TRAP(start <= offset + len); + + end = start + skb_shinfo(skb)->frags[i].size; + if ((copy = end - offset) > 0) { + skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + + if (copy > len) + copy = len; + + sg.page = frag->page; + sg.offset = frag->page_offset + offset-start; + sg.length = copy; + + icv_update(tfm, &sg, 1); + + if (!(len -= copy)) + return; + offset += copy; + } + start = end; + } + + if (skb_shinfo(skb)->frag_list) { + struct sk_buff *list = skb_shinfo(skb)->frag_list; + + for (; list; list = list->next) { + int end; + + BUG_TRAP(start <= offset + len); + + end = start + list->len; + if ((copy = end - offset) > 0) { + if (copy > len) + copy = len; + skb_ah_walk(list, tfm, icv_update); + if ((len -= copy) == 0) + return; + offset += copy; + } + start = end; + } + } + if (len) + BUG(); +} +#endif + +#if defined(CONFIG_INET_ESP) || defined(CONFIG_INET_ESP_MODULE) || defined(CONFIG_INET6_ESP) || defined(CONFIG_INET6_ESP_MODULE) +/* Move to common area: it is shared with AH. */ + +void skb_icv_walk(const struct sk_buff *skb, struct crypto_tfm *tfm, + int offset, int len, icv_update_fn_t icv_update) +{ + int start = skb->len - skb->data_len; + int i, copy = start - offset; + struct scatterlist sg; + + /* Checksum header. */ + if (copy > 0) { + if (copy > len) + copy = len; + + sg.page = virt_to_page(skb->data + offset); + sg.offset = (unsigned long)(skb->data + offset) % PAGE_SIZE; + sg.length = copy; + + icv_update(tfm, &sg, 1); + + if ((len -= copy) == 0) + return; + offset += copy; + } + + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { + int end; + + BUG_TRAP(start <= offset + len); + + end = start + skb_shinfo(skb)->frags[i].size; + if ((copy = end - offset) > 0) { + skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + + if (copy > len) + copy = len; + + sg.page = frag->page; + sg.offset = frag->page_offset + offset-start; + sg.length = copy; + + icv_update(tfm, &sg, 1); + + if (!(len -= copy)) + return; + offset += copy; + } + start = end; + } + + if (skb_shinfo(skb)->frag_list) { + struct sk_buff *list = skb_shinfo(skb)->frag_list; + + for (; list; list = list->next) { + int end; + + BUG_TRAP(start <= offset + len); + + end = start + list->len; + if ((copy = end - offset) > 0) { + if (copy > len) + copy = len; + skb_icv_walk(list, tfm, offset-start, copy, icv_update); + if ((len -= copy) == 0) + return; + offset += copy; + } + start = end; + } + } + if (len) + BUG(); +} + + +/* Looking generic it is not used in another places. */ + +int +skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset, int len) +{ + int start = skb->len - skb->data_len; + int i, copy = start - offset; + int elt = 0; + + if (copy > 0) { + if (copy > len) + copy = len; + sg[elt].page = virt_to_page(skb->data + offset); + sg[elt].offset = (unsigned long)(skb->data + offset) % PAGE_SIZE; + sg[elt].length = copy; + elt++; + if ((len -= copy) == 0) + return elt; + offset += copy; + } + + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { + int end; + + BUG_TRAP(start <= offset + len); + + end = start + skb_shinfo(skb)->frags[i].size; + if ((copy = end - offset) > 0) { + skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + + if (copy > len) + copy = len; + sg[elt].page = frag->page; + sg[elt].offset = frag->page_offset+offset-start; + sg[elt].length = copy; + elt++; + if (!(len -= copy)) + return elt; + offset += copy; + } + start = end; + } + + if (skb_shinfo(skb)->frag_list) { + struct sk_buff *list = skb_shinfo(skb)->frag_list; + + for (; list; list = list->next) { + int end; + + BUG_TRAP(start <= offset + len); + + end = start + list->len; + if ((copy = end - offset) > 0) { + if (copy > len) + copy = len; + elt += skb_to_sgvec(list, sg+elt, offset - start, copy); + if ((len -= copy) == 0) + return elt; + offset += copy; + } + start = end; + } + } + if (len) + BUG(); + return elt; +} + +/* Check that skb data bits are writable. If they are not, copy data + * to newly created private area. If "tailbits" is given, make sure that + * tailbits bytes beyond current end of skb are writable. + * + * Returns amount of elements of scatterlist to load for subsequent + * transformations and pointer to writable trailer skb. + */ + +int skb_cow_data(struct sk_buff *skb, int tailbits, struct sk_buff **trailer) +{ + int copyflag; + int elt; + struct sk_buff *skb1, **skb_p; + + /* If skb is cloned or its head is paged, reallocate + * head pulling out all the pages (pages are considered not writable + * at the moment even if they are anonymous). + */ + if ((skb_cloned(skb) || skb_shinfo(skb)->nr_frags) && + __pskb_pull_tail(skb, skb_pagelen(skb)-skb_headlen(skb)) == NULL) + return -ENOMEM; + + /* Easy case. Most of packets will go this way. */ + if (!skb_shinfo(skb)->frag_list) { + /* A little of trouble, not enough of space for trailer. + * This should not happen, when stack is tuned to generate + * good frames. OK, on miss we reallocate and reserve even more + * space, 128 bytes is fair. */ + + if (skb_tailroom(skb) < tailbits && + pskb_expand_head(skb, 0, tailbits-skb_tailroom(skb)+128, GFP_ATOMIC)) + return -ENOMEM; + + /* Voila! */ + *trailer = skb; + return 1; + } + + /* Misery. We are in troubles, going to mincer fragments... */ + + elt = 1; + skb_p = &skb_shinfo(skb)->frag_list; + copyflag = 0; + + while ((skb1 = *skb_p) != NULL) { + int ntail = 0; + + /* The fragment is partially pulled by someone, + * this can happen on input. Copy it and everything + * after it. */ + + if (skb_shared(skb1)) + copyflag = 1; + + /* If the skb is the last, worry about trailer. */ + + if (skb1->next == NULL && tailbits) { + if (skb_shinfo(skb1)->nr_frags || + skb_shinfo(skb1)->frag_list || + skb_tailroom(skb1) < tailbits) + ntail = tailbits + 128; + } + + if (copyflag || + skb_cloned(skb1) || + ntail || + skb_shinfo(skb1)->nr_frags || + skb_shinfo(skb1)->frag_list) { + struct sk_buff *skb2; + + /* Fuck, we are miserable poor guys... */ + if (ntail == 0) + skb2 = skb_copy(skb1, GFP_ATOMIC); + else + skb2 = skb_copy_expand(skb1, + skb_headroom(skb1), + ntail, + GFP_ATOMIC); + if (unlikely(skb2 == NULL)) + return -ENOMEM; + + if (skb1->sk) + skb_set_owner_w(skb, skb1->sk); + + /* Looking around. Are we still alive? + * OK, link new skb, drop old one */ + + skb2->next = skb1->next; + *skb_p = skb2; + kfree_skb(skb1); + skb1 = skb2; + } + elt++; + *trailer = skb1; + skb_p = &skb1->next; + } + + return elt; +} + +void *pskb_put(struct sk_buff *skb, struct sk_buff *tail, int len) +{ + if (tail != skb) { + skb->data_len += len; + skb->len += len; + } + return skb_put(tail, len); +} +#endif diff -ruN -x CVS linux-2.5.63/net/ipv4/xfrm_input.c linux25/net/ipv4/xfrm_input.c --- linux-2.5.63/net/ipv4/xfrm_input.c 2003-02-25 04:05:05.000000000 +0900 +++ linux25/net/ipv4/xfrm_input.c 2003-03-05 17:49:52.000000000 +0900 @@ -1,4 +1,14 @@ +/* Changes + * + * Mitsuru KANDA @USAGI : IPv6 Support + * Kazunori MIYAZAWA @USAGI : + * YOSHIFUJI Hideaki @USAGI : + * Kunihiro Ishiguro : + * + */ + #include +#include #include static kmem_cache_t *secpath_cachep; @@ -64,7 +74,7 @@ if (xfrm_nr == XFRM_MAX_DEPTH) goto drop; - x = xfrm_state_lookup(iph->daddr, spi, iph->protocol); + x = xfrm4_state_lookup(iph->daddr, spi, iph->protocol); if (x == NULL) goto drop; @@ -157,3 +167,288 @@ if (!secpath_cachep) panic("IP: failed to allocate secpath_cache\n"); } + +#if defined (CONFIG_IPV6) || defined (CONFIG_IPV6_MODULE) + +/* Fetch spi and seq frpm ipsec header */ + +static int xfrm6_parse_spi(struct sk_buff *skb, u8 nexthdr, u32 *spi, u32 *seq) +{ + int offset, offset_seq; + + switch (nexthdr) { + case IPPROTO_AH: + offset = offsetof(struct ip_auth_hdr, spi); + offset_seq = offsetof(struct ip_auth_hdr, seq_no); + break; + case IPPROTO_ESP: + offset = offsetof(struct ip_esp_hdr, spi); + offset_seq = offsetof(struct ip_esp_hdr, seq_no); + break; + case IPPROTO_COMP: + if (!pskb_may_pull(skb, 4)) + return -EINVAL; + *spi = *(u16*)(skb->h.raw + 2); + *seq = 0; + return 0; + default: + return 1; + } + + if (!pskb_may_pull(skb, 16)) + return -EINVAL; + + *spi = *(u32*)(skb->h.raw + offset); + *seq = *(u32*)(skb->h.raw + offset_seq); + return 0; +} + +static int zero_out_mutable_opts(struct ipv6_opt_hdr *opthdr) +{ + u8 *opt = (u8 *)opthdr; + int len = ipv6_optlen(opthdr); + int off = 0; + int optlen = 0; + + off += 2; + len -= 2; + + while (len > 0) { + + switch (opt[off]) { + + case IPV6_TLV_PAD0: + optlen = 1; + break; + default: + if (len < 2) + goto bad; + optlen = opt[off+1]+2; + if (len < optlen) + goto bad; + if (opt[off] & 0x20) + memset(&opt[off+2], 0, opt[off+1]); + break; + } + + off += optlen; + len -= optlen; + } + if (len == 0) + return 1; + +bad: + return 0; +} + +int xfrm6_clear_mutable_options(struct sk_buff *skb, u16 *nh_offset, int dir) +{ + u16 offset = sizeof(struct ipv6hdr); + struct ipv6_opt_hdr *exthdr = (struct ipv6_opt_hdr*)(skb->nh.raw + offset); + unsigned int packet_len = skb->tail - skb->nh.raw; + u8 nexthdr = skb->nh.ipv6h->nexthdr; + u8 nextnexthdr = 0; + + *nh_offset = ((unsigned char *)&skb->nh.ipv6h->nexthdr) - skb->nh.raw; + + while (offset + 1 <= packet_len) { + + switch (nexthdr) { + + case NEXTHDR_HOP: + *nh_offset = offset; + offset += ipv6_optlen(exthdr); + if (!zero_out_mutable_opts(exthdr)) { + if (net_ratelimit()) + printk(KERN_WARNING "overrun hopopts\n"); + return 0; + } + nexthdr = exthdr->nexthdr; + exthdr = (struct ipv6_opt_hdr*)(skb->nh.raw + offset); + break; + + case NEXTHDR_ROUTING: + *nh_offset = offset; + offset += ipv6_optlen(exthdr); + ((struct ipv6_rt_hdr*)exthdr)->segments_left = 0; + nexthdr = exthdr->nexthdr; + exthdr = (struct ipv6_opt_hdr*)(skb->nh.raw + offset); + break; + + case NEXTHDR_DEST: + *nh_offset = offset; + offset += ipv6_optlen(exthdr); + if (!zero_out_mutable_opts(exthdr)) { + if (net_ratelimit()) + printk(KERN_WARNING "overrun destopt\n"); + return 0; + } + nexthdr = exthdr->nexthdr; + exthdr = (struct ipv6_opt_hdr*)(skb->nh.raw + offset); + break; + + case NEXTHDR_AUTH: + if (dir == XFRM_POLICY_OUT) { + memset(((struct ipv6_auth_hdr*)exthdr)->auth_data, 0, + (((struct ipv6_auth_hdr*)exthdr)->hdrlen - 1) << 2); + } + if (exthdr->nexthdr == NEXTHDR_DEST) { + offset += (((struct ipv6_auth_hdr*)exthdr)->hdrlen + 2) << 2; + exthdr = (struct ipv6_opt_hdr*)(skb->nh.raw + offset); + nextnexthdr = exthdr->nexthdr; + if (!zero_out_mutable_opts(exthdr)) { + if (net_ratelimit()) + printk(KERN_WARNING "overrun destopt\n"); + return 0; + } + } + return nexthdr; + default : + return nexthdr; + } + } + + return nexthdr; +} + +int xfrm6_rcv(struct sk_buff *skb) +{ + int err; + u32 spi, seq; + struct xfrm_state *xfrm_vec[XFRM_MAX_DEPTH]; + struct xfrm_state *x; + int xfrm_nr = 0; + int decaps = 0; + struct ipv6hdr *hdr = skb->nh.ipv6h; + unsigned char *tmp_hdr = NULL; + int hdr_len = 0; + u16 nh_offset = 0; + u8 nexthdr = 0; + + if (hdr->nexthdr == IPPROTO_AH || hdr->nexthdr == IPPROTO_ESP) { + nh_offset = ((unsigned char*)&skb->nh.ipv6h->nexthdr) - skb->nh.raw; + hdr_len = sizeof(struct ipv6hdr); + } else { + hdr_len = skb->h.raw - skb->nh.raw; + } + + tmp_hdr = kmalloc(hdr_len, GFP_ATOMIC); + if (!tmp_hdr) + goto drop; + memcpy(tmp_hdr, skb->nh.raw, hdr_len); + + nexthdr = xfrm6_clear_mutable_options(skb, &nh_offset, XFRM_POLICY_IN); + hdr->priority = 0; + hdr->flow_lbl[0] = 0; + hdr->flow_lbl[1] = 0; + hdr->flow_lbl[2] = 0; + hdr->hop_limit = 0; + + if ((err = xfrm6_parse_spi(skb, nexthdr, &spi, &seq)) != 0) + goto drop; + + do { + struct ipv6hdr *iph = skb->nh.ipv6h; + + if (xfrm_nr == XFRM_MAX_DEPTH) + goto drop; + + x = xfrm6_state_lookup(&iph->daddr, spi, nexthdr); + if (x == NULL) + goto drop; + spin_lock(&x->lock); + if (unlikely(x->km.state != XFRM_STATE_VALID)) + goto drop_unlock; + + if (x->props.replay_window && xfrm_replay_check(x, seq)) + goto drop_unlock; + + nexthdr = x->type->input(x, skb); + if (nexthdr <= 0) + goto drop_unlock; + + if (x->props.replay_window) + xfrm_replay_advance(x, seq); + + x->curlft.bytes += skb->len; + x->curlft.packets++; + + spin_unlock(&x->lock); + + xfrm_vec[xfrm_nr++] = x; + + iph = skb->nh.ipv6h; /* ??? */ + + if (nexthdr == NEXTHDR_DEST) { + if (!pskb_may_pull(skb, (skb->h.raw-skb->data)+8) || + !pskb_may_pull(skb, (skb->h.raw-skb->data)+((skb->h.raw[1]+1)<<3))) { + err = -EINVAL; + goto drop; + } + nexthdr = skb->h.raw[0]; + nh_offset = skb->h.raw - skb->nh.raw; + skb_pull(skb, (skb->h.raw[1]+1)<<3); + skb->h.raw = skb->data; + } + + if (x->props.mode) { /* XXX */ + if (iph->nexthdr != IPPROTO_IPV6) + goto drop; + skb->nh.raw = skb->data; + iph = skb->nh.ipv6h; + decaps = 1; + break; + } + + if ((err = xfrm6_parse_spi(skb, nexthdr, &spi, &seq)) < 0) + goto drop; + } while (!err); + + memcpy(skb->nh.raw, tmp_hdr, hdr_len); + skb->nh.raw[nh_offset] = nexthdr; + skb->nh.ipv6h->payload_len = htons(hdr_len + skb->len - sizeof(struct ipv6hdr)); + + /* Allocate new secpath or COW existing one. */ + if (!skb->sp || atomic_read(&skb->sp->refcnt) != 1) { + struct sec_path *sp; + sp = kmem_cache_alloc(secpath_cachep, SLAB_ATOMIC); + if (!sp) + goto drop; + if (skb->sp) { + memcpy(sp, skb->sp, sizeof(struct sec_path)); + secpath_put(skb->sp); + } else + sp->len = 0; + atomic_set(&sp->refcnt, 1); + skb->sp = sp; + } + + if (xfrm_nr + skb->sp->len > XFRM_MAX_DEPTH) + goto drop; + + memcpy(skb->sp->xvec+skb->sp->len, xfrm_vec, xfrm_nr*sizeof(void*)); + skb->sp->len += xfrm_nr; + + if (decaps) { + if (!(skb->dev->flags&IFF_LOOPBACK)) { + dst_release(skb->dst); + skb->dst = NULL; + } + netif_rx(skb); + return 0; + } else { + return -nexthdr; + } + +drop_unlock: + spin_unlock(&x->lock); + xfrm_state_put(x); +drop: + if (tmp_hdr) kfree(tmp_hdr); + while (--xfrm_nr >= 0) + xfrm_state_put(xfrm_vec[xfrm_nr]); + kfree_skb(skb); + return 0; +} + +#endif /* CONFIG_IPV6 || CONFIG_IPV6_MODULE */ diff -ruN -x CVS linux-2.5.63/net/ipv4/xfrm_policy.c linux25/net/ipv4/xfrm_policy.c --- linux-2.5.63/net/ipv4/xfrm_policy.c 2003-02-25 04:05:32.000000000 +0900 +++ linux25/net/ipv4/xfrm_policy.c 2003-03-05 17:49:52.000000000 +0900 @@ -1,6 +1,16 @@ +/* Changes + * + * Mitsuru KANDA @USAGI : IPv6 Support + * Kazunori MIYAZAWA @USAGI : + * Kunihiro Ishiguro : + * + */ + #include #include #include +#include +#include DECLARE_MUTEX(xfrm_cfg_sem); @@ -10,6 +20,11 @@ struct xfrm_policy *xfrm_policy_list[XFRM_POLICY_MAX*2]; extern struct dst_ops xfrm4_dst_ops; +#if defined (CONFIG_IPV6) || defined (CONFIG_IPV6_MODULE) +extern struct dst_ops xfrm6_dst_ops; +#endif + +static inline int xfrm_dst_lookup(struct xfrm_dst **dst, struct flowi *fl, unsigned short family); /* Limited flow cache. Its function now is to accelerate search for * policy rules. @@ -48,6 +63,24 @@ return hash & (FLOWCACHE_HASH_SIZE-1); } +#if defined (CONFIG_IPV6) || defined (CONFIG_IPV6_MODULE) +static inline u32 flow_hash6(struct flowi *fl) +{ + u32 hash = fl->fl6_src->s6_addr32[2] ^ + fl->fl6_src->s6_addr32[3] ^ + fl->uli_u.ports.sport; + + hash = ((hash & 0xF0F0F0F0) >> 4) | ((hash & 0x0F0F0F0F) << 4); + + hash ^= fl->fl6_dst->s6_addr32[2] ^ + fl->fl6_dst->s6_addr32[3] ^ + fl->uli_u.ports.dport; + hash ^= (hash >> 10); + hash ^= (hash >> 20); + return hash & (FLOWCACHE_HASH_SIZE-1); +} +#endif + static int flow_lwm = 2*FLOWCACHE_HASH_SIZE; static int flow_hwm = 4*FLOWCACHE_HASH_SIZE; @@ -77,13 +110,27 @@ } } -struct xfrm_policy *flow_lookup(int dir, struct flowi *fl) +struct xfrm_policy *flow_lookup(int dir, struct flowi *fl, + unsigned short family) { - struct xfrm_policy *pol; + struct xfrm_policy *pol = NULL; struct flow_entry *fle; - u32 hash = flow_hash(fl); + u32 hash; int cpu; + switch (family) { + case AF_INET: + hash = flow_hash(fl); + break; +#if defined (CONFIG_IPV6) || defined (CONFIG_IPV6_MODULE) + case AF_INET6: + hash = flow_hash6(fl); + break; +#endif + default: + return NULL; + } + local_bh_disable(); cpu = smp_processor_id(); @@ -101,7 +148,7 @@ } } - pol = xfrm_policy_lookup(dir, fl); + pol = xfrm_policy_lookup(dir, fl, family); if (fle) { /* Stale flow entry found. Update it. */ @@ -199,6 +246,46 @@ return type; } +static xfrm_dst_lookup_t *__xfrm_dst_lookup[AF_MAX]; +rwlock_t xdl_lock = RW_LOCK_UNLOCKED; + +int xfrm_dst_lookup_register(xfrm_dst_lookup_t *dst_lookup, + unsigned short family) +{ + int err = 0; + + write_lock(&xdl_lock); + if (__xfrm_dst_lookup[family]) + err = -ENOBUFS; + else { + __xfrm_dst_lookup[family] = dst_lookup; + } + write_unlock(&xdl_lock); + + return err; +} + +void xfrm_dst_lookup_unregister(unsigned short family) +{ + write_lock(&xdl_lock); + if (__xfrm_dst_lookup[family]) + __xfrm_dst_lookup[family] = 0; + write_unlock(&xdl_lock); +} + +static inline int xfrm_dst_lookup(struct xfrm_dst **dst, struct flowi *fl, + unsigned short family) +{ + int err = 0; + read_lock(&xdl_lock); + if (__xfrm_dst_lookup[family]) + err = __xfrm_dst_lookup[family](dst, fl); + else + err = -EINVAL; + read_unlock(&xdl_lock); + return err; +} + #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) static struct xfrm_type *xfrm6_type_map[256]; static rwlock_t xfrm6_type_lock = RW_LOCK_UNLOCKED; @@ -506,15 +593,32 @@ /* Find policy to apply to this flow. */ -struct xfrm_policy *xfrm_policy_lookup(int dir, struct flowi *fl) +struct xfrm_policy *xfrm_policy_lookup(int dir, struct flowi *fl, + unsigned short family) { struct xfrm_policy *pol; read_lock_bh(&xfrm_policy_lock); for (pol = xfrm_policy_list[dir]; pol; pol = pol->next) { struct xfrm_selector *sel = &pol->selector; + int match; + + if (pol->family != family) + continue; - if (xfrm4_selector_match(sel, fl)) { + switch (family) { + case AF_INET: + match = xfrm4_selector_match(sel, fl); + break; +#if defined (CONFIG_IPV6) || defined (CONFIG_IPV6_MODULE) + case AF_INET6: + match = xfrm6_selector_match(sel, fl); + break; +#endif + default: + match = 0; + } + if (match) { atomic_inc(&pol->refcnt); break; } @@ -529,7 +633,21 @@ read_lock_bh(&xfrm_policy_lock); if ((pol = sk->policy[dir]) != NULL) { - if (xfrm4_selector_match(&pol->selector, fl)) + int match; + + switch (sk->family) { + case AF_INET: + match = xfrm4_selector_match(&pol->selector, fl); + break; +#if defined (CONFIG_IPV6) || defined (CONFIG_IPV6_MODULE) + case AF_INET6: + match = xfrm6_selector_match(&pol->selector, fl); + break; +#endif + default: + match = 0; + } + if (match) atomic_inc(&pol->refcnt); else pol = NULL; @@ -630,8 +748,8 @@ /* Resolve list of templates for the flow, given policy. */ static int -xfrm_tmpl_resolve(struct xfrm_policy *policy, struct flowi *fl, - struct xfrm_state **xfrm) +xfrm4_tmpl_resolve(struct xfrm_policy *policy, struct flowi *fl, + struct xfrm_state **xfrm) { int nx; int i, error; @@ -649,7 +767,53 @@ local = tmpl->saddr.xfrm4_addr; } - x = xfrm_state_find(remote, local, fl, tmpl, policy, &error); + x = xfrm4_state_find(remote, local, fl, tmpl, policy, &error); + + if (x && x->km.state == XFRM_STATE_VALID) { + xfrm[nx++] = x; + daddr = remote; + saddr = local; + continue; + } + if (x) { + error = (x->km.state == XFRM_STATE_ERROR ? + -EINVAL : -EAGAIN); + xfrm_state_put(x); + } + + if (!tmpl->optional) + goto fail; + } + return nx; + +fail: + for (nx--; nx>=0; nx--) + xfrm_state_put(xfrm[nx]); + return error; +} + +#if defined (CONFIG_IPV6) || defined (CONFIG_IPV6_MODULE) +static int +xfrm6_tmpl_resolve(struct xfrm_policy *policy, struct flowi *fl, + struct xfrm_state **xfrm) +{ + int nx; + int i, error; + struct in6_addr *daddr = fl->fl6_dst; + struct in6_addr *saddr = fl->fl6_src; + + for (nx=0, i = 0; i < policy->xfrm_nr; i++) { + struct xfrm_state *x=NULL; + struct in6_addr *remote = daddr; + struct in6_addr *local = saddr; + struct xfrm_tmpl *tmpl = &policy->xfrm_vec[i]; + + if (tmpl->mode) { + remote = (struct in6_addr*)&tmpl->id.daddr; + local = (struct in6_addr*)&tmpl->saddr; + } + + x = xfrm6_state_find(remote, local, fl, tmpl, policy, &error); if (x && x->km.state == XFRM_STATE_VALID) { xfrm[nx++] = x; @@ -673,6 +837,7 @@ xfrm_state_put(xfrm[nx]); return error; } +#endif /* Check that the bundle accepts the flow and its components are * still valid. @@ -694,6 +859,24 @@ return 0; } +#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) +static int xfrm6_bundle_ok(struct xfrm_dst *xdst, struct flowi *fl) +{ + do { + if (xdst->u.dst.ops != &xfrm6_dst_ops) + return 1; + + if (!xfrm6_selector_match(&xdst->u.dst.xfrm->sel, fl)) + return 0; + if (xdst->u.dst.xfrm->km.state != XFRM_STATE_VALID || + xdst->u.dst.path->obsolete > 0) + return 0; + xdst = (struct xfrm_dst*)xdst->u.dst.child; + } while (xdst); + return 0; +} +#endif + /* Allocate chain of dst_entry's, attach known xfrm's, calculate * all the metrics... Shortly, bundle a bundle. @@ -744,7 +927,7 @@ .saddr = local } } }; - err = __ip_route_output_key(&rt, &fl_tunnel); + err = xfrm_dst_lookup((struct xfrm_dst**)&rt, &fl_tunnel, AF_INET); if (err) goto error; } else { @@ -791,6 +974,97 @@ return err; } +#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) +static int +xfrm6_bundle_create(struct xfrm_policy *policy, struct xfrm_state **xfrm, int nx, + struct flowi *fl, struct dst_entry **dst_p) +{ + struct dst_entry *dst, *dst_prev; + struct rt6_info *rt0 = (struct rt6_info*)(*dst_p); + struct rt6_info *rt = rt0; + struct in6_addr *remote = fl->fl6_dst; + struct in6_addr *local = fl->fl6_src; + int i; + int err = 0; + int header_len = 0; + + dst = dst_prev = NULL; + + for (i = 0; i < nx; i++) { + struct dst_entry *dst1 = dst_alloc(&xfrm6_dst_ops); + + if (unlikely(dst1 == NULL)) { + err = -ENOBUFS; + goto error; + } + + dst1->xfrm = xfrm[i]; + if (!dst) + dst = dst1; + else { + dst_prev->child = dst1; + dst1->flags |= DST_NOHASH; + dst_clone(dst1); + } + dst_prev = dst1; + if (xfrm[i]->props.mode) { + remote = (struct in6_addr*)&xfrm[i]->id.daddr; + local = (struct in6_addr*)&xfrm[i]->props.saddr; + } + header_len += xfrm[i]->props.header_len; + } + + if (ipv6_addr_cmp(remote, fl->fl6_dst)) { + struct flowi fl_tunnel = { .nl_u = { .ip6_u = + { .daddr = remote, + .saddr = local } + } + }; + err = xfrm_dst_lookup((struct xfrm_dst**)&dst, &fl_tunnel, AF_INET6); + if (err) + goto error; + } else { + dst_clone(&rt->u.dst); + } + dst_prev->child = &rt->u.dst; + for (dst_prev = dst; dst_prev != &rt->u.dst; dst_prev = dst_prev->child) { + struct xfrm_dst *x = (struct xfrm_dst*)dst_prev; + x->u.rt.fl = *fl; + + dst_prev->dev = rt->u.dst.dev; + if (rt->u.dst.dev) + dev_hold(rt->u.dst.dev); + dst_prev->obsolete = -1; + dst_prev->flags |= DST_HOST; + dst_prev->lastuse = jiffies; + dst_prev->header_len = header_len; + memcpy(&dst_prev->metrics, &rt->u.dst.metrics, sizeof(dst_prev->metrics)); + dst_prev->path = &rt->u.dst; + + /* Copy neighbout for reachability confirmation */ + dst_prev->neighbour = neigh_clone(rt->u.dst.neighbour); + dst_prev->input = rt->u.dst.input; + dst_prev->output = dst_prev->xfrm->type->output; + /* Sheit... I remember I did this right. Apparently, + * it was magically lost, so this code needs audit */ + x->u.rt6.rt6i_flags = rt0->rt6i_flags&(RTCF_BROADCAST|RTCF_MULTICAST|RTCF_LOCAL); + x->u.rt6.rt6i_metric = rt0->rt6i_metric; + x->u.rt6.rt6i_node = rt0->rt6i_node; + x->u.rt6.rt6i_hoplimit = rt0->rt6i_hoplimit; + x->u.rt6.rt6i_gateway = rt0->rt6i_gateway; + memcpy(&x->u.rt6.rt6i_gateway, &rt0->rt6i_gateway, sizeof(x->u.rt6.rt6i_gateway)); + header_len -= x->u.dst.xfrm->props.header_len; + } + *dst_p = dst; + return 0; + +error: + if (dst) + dst_free(dst); + return err; +} +#endif + /* Main function: finds/creates a bundle for given flow. * * At the moment we eat a raw IP route. Mostly to speed up lookups @@ -806,9 +1080,7 @@ int nx = 0; int err; u32 genid; - - fl->oif = rt->u.dst.dev->ifindex; - fl->fl4_src = rt->rt_src; + u16 family = (*dst_p)->ops->family; restart: genid = xfrm_policy_genid; @@ -821,11 +1093,12 @@ if ((rt->u.dst.flags & DST_NOXFRM) || !xfrm_policy_list[XFRM_POLICY_OUT]) return 0; - policy = flow_lookup(XFRM_POLICY_OUT, fl); - if (!policy) - return 0; + policy = flow_lookup(XFRM_POLICY_OUT, fl, family); } + if (!policy) + return 0; + policy->curlft.use_time = (unsigned long)xtime.tv_sec; switch (policy->action) { @@ -846,23 +1119,48 @@ * LATER: help from flow cache. It is optional, this * is required only for output policy. */ - read_lock_bh(&policy->lock); - for (dst = policy->bundles; dst; dst = dst->next) { - struct xfrm_dst *xdst = (struct xfrm_dst*)dst; - if (xdst->u.rt.fl.fl4_dst == fl->fl4_dst && - xdst->u.rt.fl.fl4_src == fl->fl4_src && - xdst->u.rt.fl.oif == fl->oif && - xfrm_bundle_ok(xdst, fl)) { - dst_clone(dst); + if (family == AF_INET) { + fl->oif = rt->u.dst.dev->ifindex; + fl->fl4_src = rt->rt_src; + read_lock_bh(&policy->lock); + for (dst = policy->bundles; dst; dst = dst->next) { + struct xfrm_dst *xdst = (struct xfrm_dst*)dst; + if (xdst->u.rt.fl.fl4_dst == fl->fl4_dst && + xdst->u.rt.fl.fl4_src == fl->fl4_src && + xdst->u.rt.fl.oif == fl->oif && + xfrm_bundle_ok(xdst, fl)) { + dst_clone(dst); + break; + } + } + read_unlock_bh(&policy->lock); + if (dst) break; + nx = xfrm4_tmpl_resolve(policy, fl, xfrm); +#if defined (CONFIG_IPV6) || defined (CONFIG_IPV6_MODULE) + } else if (family == AF_INET6) { + read_lock_bh(&policy->lock); + for (dst = policy->bundles; dst; dst = dst->next) { + struct xfrm_dst *xdst = (struct xfrm_dst*)dst; + if (!ipv6_addr_cmp(&xdst->u.rt6.rt6i_dst.addr, fl->fl6_dst) && + !ipv6_addr_cmp(&xdst->u.rt6.rt6i_src.addr, fl->fl6_src) && + xfrm6_bundle_ok(xdst, fl)) { + dst_clone(dst); + break; + } } + read_unlock_bh(&policy->lock); + if (dst) + break; + nx = xfrm6_tmpl_resolve(policy, fl, xfrm); +#endif + } else { + return -EINVAL; } - read_unlock_bh(&policy->lock); if (dst) break; - nx = xfrm_tmpl_resolve(policy, fl, xfrm); if (unlikely(nx<0)) { err = nx; if (err == -EAGAIN) { @@ -873,7 +1171,18 @@ __set_task_state(tsk, TASK_INTERRUPTIBLE); add_wait_queue(&km_waitq, &wait); - err = xfrm_tmpl_resolve(policy, fl, xfrm); + switch (family) { + case AF_INET: + err = xfrm4_tmpl_resolve(policy, fl, xfrm); + break; +#if defined (CONFIG_IPV6) || defined (CONFIG_IPV6_MODULE) + case AF_INET6: + err = xfrm6_tmpl_resolve(policy, fl, xfrm); + break; +#endif + default: + err = -EINVAL; + } if (err == -EAGAIN) schedule(); __set_task_state(tsk, TASK_RUNNING); @@ -896,7 +1205,19 @@ } dst = &rt->u.dst; - err = xfrm_bundle_create(policy, xfrm, nx, fl, &dst); + switch (family) { + case AF_INET: + err = xfrm_bundle_create(policy, xfrm, nx, fl, &dst); + break; +#if defined (CONFIG_IPV6) || defined (CONFIG_IPV6_MODULE) + case AF_INET6: + err = xfrm6_bundle_create(policy, xfrm, nx, fl, &dst); + break; +#endif + default: + err = -EINVAL; + } + if (unlikely(err)) { int i; for (i=0; inh.iph; u8 *xprth = skb->nh.raw + iph->ihl*4; @@ -1008,18 +1329,109 @@ fl->fl4_src = iph->saddr; } -int __xfrm_policy_check(struct sock *sk, int dir, struct sk_buff *skb) +#if defined (CONFIG_IPV6) || defined (CONFIG_IPV6_MODULE) +static inline int +xfrm6_state_ok(struct xfrm_tmpl *tmpl, struct xfrm_state *x) +{ + return x->id.proto == tmpl->id.proto && + (x->id.spi == tmpl->id.spi || !tmpl->id.spi) && + x->props.mode == tmpl->mode && + (tmpl->aalgos & (1<props.aalgo)) && + (!x->props.mode || !ipv6_addr_any((struct in6_addr*)&x->props.saddr) || + !ipv6_addr_cmp((struct in6_addr *)&tmpl->saddr, (struct in6_addr*)&x->props.saddr)); +} + +static inline int +xfrm6_policy_ok(struct xfrm_tmpl *tmpl, struct sec_path *sp, int idx) +{ + for (; idx < sp->len; idx++) { + if (xfrm6_state_ok(tmpl, sp->xvec[idx])) + return ++idx; + } + return -1; +} + +static inline void +_decode_session6(struct sk_buff *skb, struct flowi *fl) +{ + u16 offset = sizeof(struct ipv6hdr); + struct ipv6hdr *hdr = skb->nh.ipv6h; + struct ipv6_opt_hdr *exthdr = (struct ipv6_opt_hdr*)(skb->nh.raw + offset); + u8 nexthdr = skb->nh.ipv6h->nexthdr; + + fl->fl6_dst = &hdr->daddr; + fl->fl6_src = &hdr->saddr; + + while (pskb_may_pull(skb, skb->nh.raw + offset + 1 - skb->data)) { + switch (nexthdr) { + case NEXTHDR_ROUTING: + case NEXTHDR_HOP: + case NEXTHDR_DEST: + offset += ipv6_optlen(exthdr); + nexthdr = exthdr->nexthdr; + exthdr = (struct ipv6_opt_hdr*)(skb->nh.raw + offset); + break; + + case IPPROTO_UDP: + case IPPROTO_TCP: + case IPPROTO_SCTP: + if (pskb_may_pull(skb, skb->nh.raw + offset + 4 - skb->data)) { + u16 *ports = (u16 *)exthdr; + + fl->uli_u.ports.sport = ports[0]; + fl->uli_u.ports.dport = ports[1]; + } + return; + + /* XXX Why are there these headers? */ + case IPPROTO_AH: + case IPPROTO_ESP: + default: + fl->uli_u.spi = 0; + return; + }; + } +} +#endif + +int __xfrm_policy_check(struct sock *sk, int dir, struct sk_buff *skb, + unsigned short family) { struct xfrm_policy *pol; struct flowi fl; - _decode_session(skb, &fl); + switch (family) { + case AF_INET: + _decode_session4(skb, &fl); + break; +#if defined (CONFIG_IPV6) || defined (CONFIG_IPV6_MODULE) + case AF_INET6: + _decode_session6(skb, &fl); + break; +#endif + default : + return 0; + } /* First, check used SA against their selectors. */ if (skb->sp) { int i; + for (i=skb->sp->len-1; i>=0; i--) { - if (!xfrm4_selector_match(&skb->sp->xvec[i]->sel, &fl)) + int match; + switch (family) { + case AF_INET: + match = xfrm4_selector_match(&skb->sp->xvec[i]->sel, &fl); + break; +#if defined (CONFIG_IPV6) || defined (CONFIG_IPV6_MODULE) + case AF_INET6: + match = xfrm6_selector_match(&skb->sp->xvec[i]->sel, &fl); + break; +#endif + default: + match = 0; + } + if (!match) return 0; } } @@ -1029,7 +1441,7 @@ pol = xfrm_sk_policy_lookup(sk, dir, &fl); if (!pol) - pol = flow_lookup(dir, &fl); + pol = flow_lookup(dir, &fl, family); if (!pol) return 1; @@ -1050,7 +1462,18 @@ * are implied between each two transformations. */ for (i = pol->xfrm_nr-1, k = 0; i >= 0; i--) { - k = xfrm_policy_ok(pol->xfrm_vec+i, sp, k); + switch (family) { + case AF_INET: + k = xfrm_policy_ok(pol->xfrm_vec+i, sp, k); + break; +#if defined (CONFIG_IPV6) || defined (CONFIG_IPV6_MODULE) + case AF_INET6: + k = xfrm6_policy_ok(pol->xfrm_vec+i, sp, k); + break; +#endif + default: + k = -1; + } if (k < 0) goto reject; } @@ -1064,18 +1487,29 @@ return 0; } -int __xfrm_route_forward(struct sk_buff *skb) +int __xfrm_route_forward(struct sk_buff *skb, unsigned short family) { struct flowi fl; - _decode_session(skb, &fl); + switch (family) { + case AF_INET: + _decode_session4(skb, &fl); + break; +#if defined (CONFIG_IPV6) || defined (CONFIG_IPV6_MODULE) + case AF_INET6: + _decode_session6(skb, &fl); + break; +#endif + default: + return 0; + } return xfrm_lookup(&skb->dst, &fl, NULL, 0) == 0; } /* Optimize later using cookies and generation ids. */ -static struct dst_entry *xfrm4_dst_check(struct dst_entry *dst, u32 cookie) +static struct dst_entry *xfrm_dst_check(struct dst_entry *dst, u32 cookie) { struct dst_entry *child = dst; @@ -1091,19 +1525,19 @@ return dst; } -static void xfrm4_dst_destroy(struct dst_entry *dst) +static void xfrm_dst_destroy(struct dst_entry *dst) { xfrm_state_put(dst->xfrm); dst->xfrm = NULL; } -static void xfrm4_link_failure(struct sk_buff *skb) +static void xfrm_link_failure(struct sk_buff *skb) { /* Impossible. Such dst must be popped before reaches point of failure. */ return; } -static struct dst_entry *xfrm4_negative_advice(struct dst_entry *dst) +static struct dst_entry *xfrm_negative_advice(struct dst_entry *dst) { if (dst) { if (dst->obsolete) { @@ -1114,8 +1548,7 @@ return dst; } - -static int xfrm4_garbage_collect(void) +static void __xfrm_garbage_collect(void) { int i; struct xfrm_policy *pol; @@ -1145,10 +1578,22 @@ gc_list = dst->next; dst_free(dst); } +} +static inline int xfrm4_garbage_collect(void) +{ + __xfrm_garbage_collect(); return (atomic_read(&xfrm4_dst_ops.entries) > xfrm4_dst_ops.gc_thresh*2); } +#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) +static inline int xfrm6_garbage_collect(void) +{ + __xfrm_garbage_collect(); + return (atomic_read(&xfrm6_dst_ops.entries) > xfrm6_dst_ops.gc_thresh*2); +} +#endif + static int bundle_depends_on(struct dst_entry *dst, struct xfrm_state *x) { do { @@ -1192,7 +1637,7 @@ return 0; } - + static void xfrm4_update_pmtu(struct dst_entry *dst, u32 mtu) { struct dst_entry *path = dst->path; @@ -1203,6 +1648,18 @@ path->ops->update_pmtu(path, mtu); } +#if defined (CONFIG_IPV6) || defined (CONFIG_IPV6_MODULE) +static void xfrm6_update_pmtu(struct dst_entry *dst, u32 mtu) +{ + struct dst_entry *path = dst->path; + + if (mtu >= 1280 && mtu < dst_pmtu(dst)) + return; + + path->ops->update_pmtu(path, mtu); +} +#endif + /* Well... that's _TASK_. We need to scan through transformation * list and figure out what mss tcp should generate in order to * final datagram fit to mtu. Mama mia... :-) @@ -1212,7 +1669,7 @@ * * Consider this function as something like dark humour. :-) */ -static int xfrm4_get_mss(struct dst_entry *dst, u32 mtu) +static int xfrm_get_mss(struct dst_entry *dst, u32 mtu) { int res = mtu - dst->header_len; @@ -1247,16 +1704,32 @@ .family = AF_INET, .protocol = __constant_htons(ETH_P_IP), .gc = xfrm4_garbage_collect, - .check = xfrm4_dst_check, - .destroy = xfrm4_dst_destroy, - .negative_advice = xfrm4_negative_advice, - .link_failure = xfrm4_link_failure, + .check = xfrm_dst_check, + .destroy = xfrm_dst_destroy, + .negative_advice = xfrm_negative_advice, + .link_failure = xfrm_link_failure, .update_pmtu = xfrm4_update_pmtu, - .get_mss = xfrm4_get_mss, + .get_mss = xfrm_get_mss, .gc_thresh = 1024, .entry_size = sizeof(struct xfrm_dst), }; +#if defined (CONFIG_IPV6) || defined (CONFIG_IPV6_MODULE) +struct dst_ops xfrm6_dst_ops = { + .family = AF_INET6, + .protocol = __constant_htons(ETH_P_IPV6), + .gc = xfrm6_garbage_collect, + .check = xfrm_dst_check, + .destroy = xfrm_dst_destroy, + .negative_advice = xfrm_negative_advice, + .link_failure = xfrm_link_failure, + .update_pmtu = xfrm6_update_pmtu, + .get_mss = xfrm_get_mss, + .gc_thresh = 1024, + .entry_size = sizeof(struct xfrm_dst), +}; +#endif /* CONFIG_IPV6 || CONFIG_IPV6_MODULE */ + void __init xfrm_init(void) { xfrm4_dst_ops.kmem_cachep = kmem_cache_create("xfrm4_dst_cache", @@ -1267,8 +1740,12 @@ if (!xfrm4_dst_ops.kmem_cachep) panic("IP: failed to allocate xfrm4_dst_cache\n"); - flow_cache_init(); +#if defined (CONFIG_IPV6) || defined (CONFIG_IPV6_MODULE) + xfrm6_dst_ops.kmem_cachep = xfrm4_dst_ops.kmem_cachep; +#endif + flow_cache_init(); xfrm_state_init(); xfrm_input_init(); } + diff -ruN -x CVS linux-2.5.63/net/ipv4/xfrm_state.c linux25/net/ipv4/xfrm_state.c --- linux-2.5.63/net/ipv4/xfrm_state.c 2003-02-25 04:05:38.000000000 +0900 +++ linux25/net/ipv4/xfrm_state.c 2003-03-05 20:17:33.000000000 +0900 @@ -1,3 +1,11 @@ +/* Changes + * + * Mitsuru KANDA @USAGI : IPv6 Support + * Kazunori MIYAZAWA @USAGI : + * Kunihiro Ishiguro : + * + */ + #include #include #include @@ -207,8 +215,8 @@ } struct xfrm_state * -xfrm_state_find(u32 daddr, u32 saddr, struct flowi *fl, struct xfrm_tmpl *tmpl, - struct xfrm_policy *pol, int *err) +xfrm4_state_find(u32 daddr, u32 saddr, struct flowi *fl, struct xfrm_tmpl *tmpl, + struct xfrm_policy *pol, int *err) { unsigned h = ntohl(daddr); struct xfrm_state *x; @@ -290,6 +298,7 @@ x->props.saddr.xfrm4_addr = saddr; x->props.mode = tmpl->mode; x->props.reqid = tmpl->reqid; + x->props.family = AF_INET; if (km_query(x, tmpl, pol) == 0) { x->km.state = XFRM_STATE_ACQ; @@ -318,14 +327,133 @@ return x; } +#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) +struct xfrm_state * +xfrm6_state_find(struct in6_addr *daddr, struct in6_addr *saddr, struct flowi *fl, struct xfrm_tmpl *tmpl, + struct xfrm_policy *pol, int *err) +{ + unsigned h = ntohl(daddr->s6_addr32[2]^daddr->s6_addr32[3]); + struct xfrm_state *x; + int acquire_in_progress = 0; + int error = 0; + struct xfrm_state *best = NULL; + + h = (h ^ (h>>16)) % XFRM_DST_HSIZE; + + spin_lock_bh(&xfrm_state_lock); + list_for_each_entry(x, xfrm_state_bydst+h, bydst) { + if (x->props.family == AF_INET6&& + !ipv6_addr_cmp(daddr, (struct in6_addr *)&x->id.daddr) && + x->props.reqid == tmpl->reqid && + (!ipv6_addr_cmp(saddr, (struct in6_addr *)&x->props.saddr)|| ipv6_addr_any(saddr)) && + tmpl->mode == x->props.mode && + tmpl->id.proto == x->id.proto) { + /* Resolution logic: + 1. There is a valid state with matching selector. + Done. + 2. Valid state with inappropriate selector. Skip. + + Entering area of "sysdeps". + + 3. If state is not valid, selector is temporary, + it selects only session which triggered + previous resolution. Key manager will do + something to install a state with proper + selector. + */ + if (x->km.state == XFRM_STATE_VALID) { + if (!xfrm6_selector_match(&x->sel, fl)) + continue; + if (!best || + best->km.dying > x->km.dying || + (best->km.dying == x->km.dying && + best->curlft.add_time < x->curlft.add_time)) + best = x; + } else if (x->km.state == XFRM_STATE_ACQ) { + acquire_in_progress = 1; + } else if (x->km.state == XFRM_STATE_ERROR || + x->km.state == XFRM_STATE_EXPIRED) { + if (xfrm6_selector_match(&x->sel, fl)) + error = 1; + } + } + } + + if (best) { + atomic_inc(&best->refcnt); + spin_unlock_bh(&xfrm_state_lock); + return best; + } + x = NULL; + if (!error && !acquire_in_progress && + ((x = xfrm_state_alloc()) != NULL)) { + /* Initialize temporary selector matching only + * to current session. */ + memcpy(&x->sel.daddr, fl->fl6_dst, sizeof(struct in6_addr)); + memcpy(&x->sel.saddr, fl->fl6_src, sizeof(struct in6_addr)); + x->sel.dport = fl->uli_u.ports.dport; + x->sel.dport_mask = ~0; + x->sel.sport = fl->uli_u.ports.sport; + x->sel.sport_mask = ~0; + x->sel.prefixlen_d = 128; + x->sel.prefixlen_s = 128; + x->sel.proto = fl->proto; + x->sel.ifindex = fl->oif; + x->id = tmpl->id; + if (ipv6_addr_any((struct in6_addr*)&x->id.daddr)) + memcpy(&x->id.daddr, daddr, sizeof(x->sel.daddr)); + memcpy(&x->props.saddr, &tmpl->saddr, sizeof(x->props.saddr)); + if (ipv6_addr_any((struct in6_addr*)&x->props.saddr)) + memcpy(&x->props.saddr, &saddr, sizeof(x->sel.saddr)); + x->props.mode = tmpl->mode; + x->props.reqid = tmpl->reqid; + x->props.family = AF_INET6; + + if (km_query(x, tmpl, pol) == 0) { + x->km.state = XFRM_STATE_ACQ; + list_add_tail(&x->bydst, xfrm_state_bydst+h); + atomic_inc(&x->refcnt); + if (x->id.spi) { + struct in6_addr *addr = (struct in6_addr*)&x->id.daddr; + h = ntohl((addr->s6_addr32[2]^addr->s6_addr32[3])^x->id.spi^x->id.proto); + h = (h ^ (h>>10) ^ (h>>20)) % XFRM_DST_HSIZE; + list_add(&x->byspi, xfrm_state_byspi+h); + atomic_inc(&x->refcnt); + } + x->lft.hard_add_expires_seconds = ACQ_EXPIRES; + atomic_inc(&x->refcnt); + mod_timer(&x->timer, ACQ_EXPIRES*HZ); + } else { + x->km.state = XFRM_STATE_DEAD; + xfrm_state_put(x); + x = NULL; + error = 1; + } + } + spin_unlock_bh(&xfrm_state_lock); + if (!x) + *err = acquire_in_progress ? -EAGAIN : + (error ? -ESRCH : -ENOMEM); + return x; +} +#endif /* CONFIG_IPV6 || CONFIG_IPV6_MODULE */ + void xfrm_state_insert(struct xfrm_state *x) { unsigned h = 0; - if (x->props.family == AF_INET) + switch (x->props.family) { + case AF_INET: h = ntohl(x->id.daddr.xfrm4_addr); - else if (x->props.family == AF_INET6) + break; +#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) + case AF_INET6: h = ntohl(x->id.daddr.a6[2]^x->id.daddr.a6[3]); + break; +#endif + default: + return; + } h = (h ^ (h>>16)) % XFRM_DST_HSIZE; @@ -384,7 +512,7 @@ } struct xfrm_state * -xfrm_state_lookup(u32 daddr, u32 spi, u8 proto) +xfrm4_state_lookup(u32 daddr, u32 spi, u8 proto) { unsigned h = ntohl(daddr^spi^proto); struct xfrm_state *x; @@ -406,6 +534,31 @@ return NULL; } +#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) +struct xfrm_state * +xfrm6_state_lookup(struct in6_addr *daddr, u32 spi, u8 proto) +{ + unsigned h = ntohl(daddr->s6_addr32[2]^daddr->s6_addr32[3]^spi^proto); + struct xfrm_state *x; + + h = (h ^ (h>>10) ^ (h>>20)) % XFRM_DST_HSIZE; + + spin_lock_bh(&xfrm_state_lock); + list_for_each_entry(x, xfrm_state_byspi+h, byspi) { + if (x->props.family == AF_INET6 && + spi == x->id.spi && + !ipv6_addr_cmp(daddr, (struct in6_addr *)x->id.daddr.a6) && + proto == x->id.proto) { + atomic_inc(&x->refcnt); + spin_unlock_bh(&xfrm_state_lock); + return x; + } + } + spin_unlock_bh(&xfrm_state_lock); + return NULL; +} +#endif + struct xfrm_state * xfrm_find_acq(u8 mode, u16 reqid, u8 proto, u32 daddr, u32 saddr, int create) { @@ -445,7 +598,59 @@ x0->km.state = XFRM_STATE_ACQ; x0->id.daddr.xfrm4_addr = daddr; x0->id.proto = proto; + x0->props.mode = mode; + x0->props.reqid = reqid; x0->props.family = AF_INET; + x0->lft.hard_add_expires_seconds = ACQ_EXPIRES; + atomic_inc(&x0->refcnt); + mod_timer(&x0->timer, jiffies + ACQ_EXPIRES*HZ); + atomic_inc(&x0->refcnt); + list_add_tail(&x0->bydst, xfrm_state_bydst+h); + wake_up(&km_waitq); + } + spin_unlock_bh(&xfrm_state_lock); + return x0; +} + +#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) +struct xfrm_state * +xfrm6_find_acq(u8 mode, u16 reqid, u8 proto, struct in6_addr *daddr, struct in6_addr *saddr, int create) +{ + struct xfrm_state *x, *x0; + unsigned h = ntohl(daddr->s6_addr32[2]^daddr->s6_addr32[3]); + + h = (h ^ (h>>16)) % XFRM_DST_HSIZE; + x0 = NULL; + + spin_lock_bh(&xfrm_state_lock); + list_for_each_entry(x, xfrm_state_bydst+h, bydst) { + if (x->props.family == AF_INET6 && + !ipv6_addr_cmp(daddr, (struct in6_addr *)x->id.daddr.a6) && + mode == x->props.mode && + proto == x->id.proto && + !ipv6_addr_cmp(saddr, (struct in6_addr *)x->props.saddr.a6) && + reqid == x->props.reqid && + x->km.state == XFRM_STATE_ACQ) { + if (!x0) + x0 = x; + if (x->id.spi) + continue; + x0 = x; + break; + } + } + if (x0) { + atomic_inc(&x0->refcnt); + } else if (create && (x0 = xfrm_state_alloc()) != NULL) { + memcpy(x0->sel.daddr.a6, daddr, sizeof(struct in6_addr)); + memcpy(x0->sel.saddr.a6, saddr, sizeof(struct in6_addr)); + x0->sel.prefixlen_d = 128; + x0->sel.prefixlen_s = 128; + memcpy(x0->props.saddr.a6, saddr, sizeof(struct in6_addr)); + x0->km.state = XFRM_STATE_ACQ; + memcpy(x0->id.daddr.a6, daddr, sizeof(struct in6_addr)); + x0->id.proto = proto; + x0->props.family = AF_INET6; x0->props.mode = mode; x0->props.reqid = reqid; x0->lft.hard_add_expires_seconds = ACQ_EXPIRES; @@ -458,6 +663,7 @@ spin_unlock_bh(&xfrm_state_lock); return x0; } +#endif /* Silly enough, but I'm lazy to build resolution list */ @@ -491,7 +697,18 @@ return; if (minspi == maxspi) { - x0 = xfrm_state_lookup(x->id.daddr.xfrm4_addr, minspi, x->id.proto); + switch(x->props.family) { + case AF_INET: + x0 = xfrm4_state_lookup(x->id.daddr.xfrm4_addr, minspi, x->id.proto); + break; +#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) + case AF_INET6: + x0 = xfrm6_state_lookup((struct in6_addr*)x->id.daddr.a6, minspi, x->id.proto); + break; +#endif + default: + x0 = NULL; + } if (x0) { xfrm_state_put(x0); return; @@ -503,7 +720,18 @@ maxspi = ntohl(maxspi); for (h=0; hid.daddr.xfrm4_addr, htonl(spi), x->id.proto); + switch(x->props.family) { + case AF_INET: + x0 = xfrm4_state_lookup(x->id.daddr.xfrm4_addr, minspi, x->id.proto); + break; +#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) + case AF_INET6: + x0 = xfrm6_state_lookup((struct in6_addr*)x->id.daddr.a6, minspi, x->id.proto); + break; +#endif + default: + x0 = NULL; + } if (x0 == NULL) break; xfrm_state_put(x0); @@ -512,7 +740,18 @@ } if (x->id.spi) { spin_lock_bh(&xfrm_state_lock); - h = ntohl(x->id.daddr.xfrm4_addr^x->id.spi^x->id.proto); + switch(x->props.family) { + case AF_INET: + h = ntohl(x->id.daddr.xfrm4_addr^x->id.spi^x->id.proto); + break; +#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) + case AF_INET6: + h = ntohl(x->id.daddr.a6[2]^x->id.daddr.a6[3]^x->id.spi^x->id.proto); + break; +#endif + default: + h = 0; /* XXX */ + } h = (h ^ (h>>10) ^ (h>>20)) % XFRM_DST_HSIZE; list_add(&x->byspi, xfrm_state_byspi+h); atomic_inc(&x->refcnt); @@ -605,14 +844,21 @@ int i; for (i=0; iprops.family == AF_INET && - !xfrm4_selector_match(&x[i]->sel, fl)) - return -EINVAL; + int match; + switch(x[i]->props.family) { + case AF_INET: + match = xfrm4_selector_match(&x[i]->sel, fl); + break; #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) - if (x[i]->props.family == AF_INET6 && - !xfrm6_selector_match(&x[i]->sel, fl)) - return -EINVAL; + case AF_INET6: + match = xfrm6_selector_match(&x[i]->sel, fl); + break; #endif + default: + match = 0; + } + if (!match) + return -EINVAL; } return 0; } @@ -722,118 +968,3 @@ } } -#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) -struct xfrm_state * -xfrm6_state_lookup(struct in6_addr *daddr, u32 spi, u8 proto) -{ - unsigned h = ntohl(daddr->s6_addr32[2]^daddr->s6_addr32[3]^spi^proto); - struct xfrm_state *x; - - h = (h ^ (h>>10) ^ (h>>20)) % XFRM_DST_HSIZE; - - spin_lock_bh(&xfrm_state_lock); - list_for_each_entry(x, xfrm_state_byspi+h, byspi) { - if (x->props.family == AF_INET6 && - spi == x->id.spi && - !ipv6_addr_cmp(daddr, (struct in6_addr *)x->id.daddr.a6) && - proto == x->id.proto) { - atomic_inc(&x->refcnt); - spin_unlock_bh(&xfrm_state_lock); - return x; - } - } - spin_unlock_bh(&xfrm_state_lock); - return NULL; -} - -struct xfrm_state * -xfrm6_find_acq(u8 mode, u16 reqid, u8 proto, struct in6_addr *daddr, struct in6_addr *saddr, int create) -{ - struct xfrm_state *x, *x0; - unsigned h = ntohl(daddr->s6_addr32[2]^daddr->s6_addr32[3]); - - h = (h ^ (h>>16)) % XFRM_DST_HSIZE; - x0 = NULL; - - spin_lock_bh(&xfrm_state_lock); - list_for_each_entry(x, xfrm_state_bydst+h, bydst) { - if (x->props.family == AF_INET6 && - !memcmp(daddr, x->id.daddr.a6, sizeof(struct in6_addr)) && - mode == x->props.mode && - proto == x->id.proto && - !memcmp(saddr, x->props.saddr.a6, sizeof(struct in6_addr)) && - reqid == x->props.reqid && - x->km.state == XFRM_STATE_ACQ) { - if (!x0) - x0 = x; - if (x->id.spi) - continue; - x0 = x; - break; - } - } - if (x0) { - atomic_inc(&x0->refcnt); - } else if (create && (x0 = xfrm_state_alloc()) != NULL) { - memcpy(x0->sel.daddr.a6, daddr, sizeof(struct in6_addr)); - memcpy(x0->sel.saddr.a6, saddr, sizeof(struct in6_addr)); - x0->sel.prefixlen_d = 128; - x0->sel.prefixlen_s = 128; - memcpy(x0->props.saddr.a6, saddr, sizeof(struct in6_addr)); - x0->km.state = XFRM_STATE_ACQ; - memcpy(x0->id.daddr.a6, daddr, sizeof(struct in6_addr)); - x0->id.proto = proto; - x0->props.family = AF_INET6; - x0->props.mode = mode; - x0->props.reqid = reqid; - x0->lft.hard_add_expires_seconds = ACQ_EXPIRES; - atomic_inc(&x0->refcnt); - mod_timer(&x0->timer, jiffies + ACQ_EXPIRES*HZ); - atomic_inc(&x0->refcnt); - list_add_tail(&x0->bydst, xfrm_state_bydst+h); - wake_up(&km_waitq); - } - spin_unlock_bh(&xfrm_state_lock); - return x0; -} - -void -xfrm6_alloc_spi(struct xfrm_state *x, u32 minspi, u32 maxspi) -{ - u32 h; - struct xfrm_state *x0; - - if (x->id.spi) - return; - - if (minspi == maxspi) { - x0 = xfrm6_state_lookup((struct in6_addr*)x->id.daddr.a6, minspi, x->id.proto); - if (x0) { - xfrm_state_put(x0); - return; - } - x->id.spi = minspi; - } else { - u32 spi = 0; - minspi = ntohl(minspi); - maxspi = ntohl(maxspi); - for (h=0; hid.daddr.a6, htonl(spi), x->id.proto); - if (x0 == NULL) - break; - xfrm_state_put(x0); - } - x->id.spi = htonl(spi); - } - if (x->id.spi) { - spin_lock_bh(&xfrm_state_lock); - h = ntohl(x->id.daddr.a6[2]^x->id.daddr.a6[3]^x->id.spi^x->id.proto); - h = (h ^ (h>>10) ^ (h>>20)) % XFRM_DST_HSIZE; - list_add(&x->byspi, xfrm_state_byspi+h); - atomic_inc(&x->refcnt); - spin_unlock_bh(&xfrm_state_lock); - wake_up(&km_waitq); - } -} -#endif /* CONFIG_IPV6 || CONFIG_IPV6_MODULE */ diff -ruN -x CVS linux-2.5.63/net/ipv4/xfrm_user.c linux25/net/ipv4/xfrm_user.c --- linux-2.5.63/net/ipv4/xfrm_user.c 2003-02-25 04:05:34.000000000 +0900 +++ linux25/net/ipv4/xfrm_user.c 2003-03-04 20:38:16.000000000 +0900 @@ -234,8 +234,8 @@ switch (x->props.family) { case AF_INET: - x1 = xfrm_state_lookup(x->props.saddr.xfrm4_addr, - x->id.spi, x->id.proto); + x1 = xfrm4_state_lookup(x->props.saddr.xfrm4_addr, + x->id.spi, x->id.proto); break; #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) case AF_INET6: @@ -265,7 +265,7 @@ switch (p->family) { case AF_INET: - x = xfrm_state_lookup(p->saddr.xfrm4_addr, p->spi, p->proto); + x = xfrm4_state_lookup(p->saddr.xfrm4_addr, p->spi, p->proto); break; #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) case AF_INET6: @@ -395,7 +395,7 @@ switch (p->family) { case AF_INET: - x = xfrm_state_lookup(p->saddr.xfrm4_addr, p->spi, p->proto); + x = xfrm4_state_lookup(p->saddr.xfrm4_addr, p->spi, p->proto); break; #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) case AF_INET6: diff -ruN -x CVS linux-2.5.63/net/ipv6/Kconfig linux25/net/ipv6/Kconfig --- linux-2.5.63/net/ipv6/Kconfig 2003-02-25 04:05:32.000000000 +0900 +++ linux25/net/ipv6/Kconfig 2003-03-04 20:38:16.000000000 +0900 @@ -17,5 +17,18 @@ See for details. -source "net/ipv6/netfilter/Kconfig" +config INET6_AH + tristate "IPv6: AH transformation" + ---help--- + Support for IPsec AH. + + If unsure, say Y. + +config INET6_ESP + tristate "IPv6: ESP transformation" + ---help--- + Support for IPsec ESP. + If unsure, say Y. + +source "net/ipv6/netfilter/Kconfig" diff -ruN -x CVS linux-2.5.63/net/ipv6/Makefile linux25/net/ipv6/Makefile --- linux-2.5.63/net/ipv6/Makefile 2003-02-25 04:05:39.000000000 +0900 +++ linux25/net/ipv6/Makefile 2003-03-05 00:28:59.000000000 +0900 @@ -10,4 +10,6 @@ exthdrs.o sysctl_net_ipv6.o datagram.o proc.o \ ip6_flowlabel.o ipv6_syms.o +obj-$(CONFIG_INET6_AH) += ah6.o +obj-$(CONFIG_INET6_ESP) += esp6.o obj-$(CONFIG_NETFILTER) += netfilter/ diff -ruN -x CVS linux-2.5.63/net/ipv6/ah6.c linux25/net/ipv6/ah6.c --- linux-2.5.63/net/ipv6/ah6.c 1970-01-01 09:00:00.000000000 +0900 +++ linux25/net/ipv6/ah6.c 2003-03-05 11:32:51.000000000 +0900 @@ -0,0 +1,361 @@ +/* + * Copyright (C)2002 USAGI/WIDE Project + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * Authors + * + * Mitsuru KANDA @USAGI : IPv6 Support + * Kazunori MIYAZAWA @USAGI : + * Kunihiro Ishiguro : + * + * This file is derived from net/ipv4/ah.c. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define AH_HLEN_NOICV 12 + +/* XXX no ipv6 ah specific */ +#define NIP6(addr) \ + ntohs((addr).s6_addr16[0]),\ + ntohs((addr).s6_addr16[1]),\ + ntohs((addr).s6_addr16[2]),\ + ntohs((addr).s6_addr16[3]),\ + ntohs((addr).s6_addr16[4]),\ + ntohs((addr).s6_addr16[5]),\ + ntohs((addr).s6_addr16[6]),\ + ntohs((addr).s6_addr16[7]) + +int ah6_output(struct sk_buff *skb) +{ + int err; + int hdr_len = sizeof(struct ipv6hdr); + struct dst_entry *dst = skb->dst; + struct xfrm_state *x = dst->xfrm; + struct ipv6hdr *iph = NULL; + struct ip_auth_hdr *ah; + struct ah_data *ahp; + u16 nh_offset = 0; + u8 nexthdr; +printk(KERN_DEBUG "%s\n", __FUNCTION__); + if (skb->ip_summed == CHECKSUM_HW && skb_checksum_help(skb) == NULL) + return -EINVAL; + + spin_lock_bh(&x->lock); + if ((err = xfrm_state_check_expire(x)) != 0) + goto error; + if ((err = xfrm_state_check_space(x, skb)) != 0) + goto error; + + if (x->props.mode) { + iph = skb->nh.ipv6h; + skb->nh.ipv6h = (struct ipv6hdr*)skb_push(skb, x->props.header_len); + skb->nh.ipv6h->version = 6; + skb->nh.ipv6h->payload_len = htons(skb->len - sizeof(struct ipv6hdr)); + skb->nh.ipv6h->nexthdr = IPPROTO_AH; + memcpy(&skb->nh.ipv6h->saddr, &x->props.saddr, sizeof(struct in6_addr)); + memcpy(&skb->nh.ipv6h->daddr, &x->id.daddr, sizeof(struct in6_addr)); + ah = (struct ip_auth_hdr*)(skb->nh.ipv6h+1); + ah->nexthdr = IPPROTO_IPV6; + } else { + hdr_len = skb->h.raw - skb->nh.raw; + iph = kmalloc(hdr_len, GFP_ATOMIC); + if (!iph) { + err = -ENOMEM; + goto error; + } + memcpy(iph, skb->data, hdr_len); + skb->nh.ipv6h = (struct ipv6hdr*)skb_push(skb, x->props.header_len); + memcpy(skb->nh.ipv6h, iph, hdr_len); + nexthdr = xfrm6_clear_mutable_options(skb, &nh_offset, XFRM_POLICY_OUT); + if (nexthdr == 0) + goto error; + + skb->nh.raw[nh_offset] = IPPROTO_AH; + skb->nh.ipv6h->payload_len = htons(skb->len - sizeof(struct ipv6hdr)); + ah = (struct ip_auth_hdr*)(skb->nh.raw+hdr_len); + skb->h.raw = (unsigned char*) ah; + ah->nexthdr = nexthdr; + } + + skb->nh.ipv6h->priority = 0; + skb->nh.ipv6h->flow_lbl[0] = 0; + skb->nh.ipv6h->flow_lbl[1] = 0; + skb->nh.ipv6h->flow_lbl[2] = 0; + skb->nh.ipv6h->hop_limit = 0; + + ahp = x->data; + ah->hdrlen = (XFRM_ALIGN8(ahp->icv_trunc_len + + AH_HLEN_NOICV) >> 2) - 2; + + ah->reserved = 0; + ah->spi = x->id.spi; + ah->seq_no = htonl(++x->replay.oseq); + ahp->icv(ahp, skb, ah->auth_data); + + if (x->props.mode) { + skb->nh.ipv6h->hop_limit = iph->hop_limit; + skb->nh.ipv6h->priority = iph->priority; + skb->nh.ipv6h->flow_lbl[0] = iph->flow_lbl[0]; + skb->nh.ipv6h->flow_lbl[1] = iph->flow_lbl[1]; + skb->nh.ipv6h->flow_lbl[2] = iph->flow_lbl[2]; + } else { + memcpy(skb->nh.ipv6h, iph, hdr_len); + skb->nh.raw[nh_offset] = IPPROTO_AH; + skb->nh.ipv6h->payload_len = htons(skb->len - sizeof(struct ipv6hdr)); + kfree (iph); + } + + skb->nh.raw = skb->data; + + x->curlft.bytes += skb->len; + x->curlft.packets++; + spin_unlock_bh(&x->lock); + if ((skb->dst = dst_pop(dst)) == NULL) + goto error_nolock; + return NET_XMIT_BYPASS; +error: + spin_unlock_bh(&x->lock); +error_nolock: + kfree_skb(skb); + return err; +} + +int ah6_input(struct xfrm_state *x, struct sk_buff *skb) +{ + int ah_hlen; + struct ipv6hdr *iph; + struct ipv6_auth_hdr *ah; + struct ah_data *ahp; + unsigned char *tmp_hdr = NULL; + int hdr_len = skb->h.raw - skb->nh.raw; + u8 nexthdr = 0; + + if (!pskb_may_pull(skb, sizeof(struct ip_auth_hdr))) + goto out; + + ah = (struct ipv6_auth_hdr*)skb->data; + ahp = x->data; + ah_hlen = (ah->hdrlen + 2) << 2; + + if (ah_hlen != XFRM_ALIGN8(ahp->icv_full_len + AH_HLEN_NOICV) && + ah_hlen != XFRM_ALIGN8(ahp->icv_trunc_len + AH_HLEN_NOICV)) + goto out; + + if (!pskb_may_pull(skb, ah_hlen)) + goto out; + + /* We are going to _remove_ AH header to keep sockets happy, + * so... Later this can change. */ + if (skb_cloned(skb) && + pskb_expand_head(skb, 0, 0, GFP_ATOMIC)) + goto out; + + tmp_hdr = kmalloc(hdr_len, GFP_ATOMIC); + if (!tmp_hdr) + goto out; + memcpy(tmp_hdr, skb->nh.raw, hdr_len); + ah = (struct ipv6_auth_hdr*)skb->data; + iph = skb->nh.ipv6h; + + { + u8 auth_data[ahp->icv_trunc_len]; + + memcpy(auth_data, ah->auth_data, ahp->icv_trunc_len); + skb_push(skb, skb->data - skb->nh.raw); + ahp->icv(ahp, skb, ah->auth_data); + if (memcmp(ah->auth_data, auth_data, ahp->icv_trunc_len)) { + if (net_ratelimit()) + printk(KERN_WARNING "ipsec ah authentication error\n"); + x->stats.integrity_failed++; + goto free_out; + } + } + + nexthdr = ah->nexthdr; + skb->nh.raw = skb_pull(skb, (ah->hdrlen+2)<<2); + memcpy(skb->nh.raw, tmp_hdr, hdr_len); + skb->nh.ipv6h->payload_len = htons(skb->len - sizeof(struct ipv6hdr)); + skb_pull(skb, hdr_len); + skb->h.raw = skb->data; + + + kfree(tmp_hdr); + + return nexthdr; + +free_out: + kfree(tmp_hdr); +out: + return -EINVAL; +} + +void ah6_err(struct sk_buff *skb, struct inet6_skb_parm *opt, + int type, int code, int offset, __u32 info) +{ + struct ipv6hdr *iph = (struct ipv6hdr*)skb->data; + struct ip_auth_hdr *ah = (struct ip_auth_hdr*)(skb->data+offset); + struct xfrm_state *x; + + if (type != ICMPV6_DEST_UNREACH || + type != ICMPV6_PKT_TOOBIG) + return; + + x = xfrm6_state_lookup(&iph->daddr, ah->spi, IPPROTO_AH); + if (!x) + return; + + printk(KERN_DEBUG "pmtu discvovery on SA AH/%08x/" + "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n", + ntohl(ah->spi), NIP6(iph->daddr)); + + xfrm_state_put(x); +} + +static int ah6_init_state(struct xfrm_state *x, void *args) +{ + struct ah_data *ahp = NULL; + struct xfrm_algo_desc *aalg_desc; + + /* null auth can use a zero length key */ + if (x->aalg->alg_key_len > 512) + goto error; + + ahp = kmalloc(sizeof(*ahp), GFP_KERNEL); + if (ahp == NULL) + return -ENOMEM; + + memset(ahp, 0, sizeof(*ahp)); + + ahp->key = x->aalg->alg_key; + ahp->key_len = (x->aalg->alg_key_len+7)/8; + ahp->tfm = crypto_alloc_tfm(x->aalg->alg_name, 0); + if (!ahp->tfm) + goto error; + ahp->icv = ah_hmac_digest; + + /* + * Lookup the algorithm description maintained by xfrm_algo, + * verify crypto transform properties, and store information + * we need for AH processing. This lookup cannot fail here + * after a successful crypto_alloc_tfm(). + */ + aalg_desc = xfrm_aalg_get_byname(x->aalg->alg_name); + BUG_ON(!aalg_desc); + + if (aalg_desc->uinfo.auth.icv_fullbits/8 != + crypto_tfm_alg_digestsize(ahp->tfm)) { + printk(KERN_INFO "AH: %s digestsize %u != %hu\n", + x->aalg->alg_name, crypto_tfm_alg_digestsize(ahp->tfm), + aalg_desc->uinfo.auth.icv_fullbits/8); + goto error; + } + + ahp->icv_full_len = aalg_desc->uinfo.auth.icv_fullbits/8; + ahp->icv_trunc_len = aalg_desc->uinfo.auth.icv_truncbits/8; + + ahp->work_icv = kmalloc(ahp->icv_full_len, GFP_KERNEL); + if (!ahp->work_icv) + goto error; + + x->props.header_len = XFRM_ALIGN8(ahp->icv_trunc_len + AH_HLEN_NOICV); + if (x->props.mode) + x->props.header_len += 20; + x->data = ahp; + + return 0; + +error: + if (ahp) { + if (ahp->work_icv) + kfree(ahp->work_icv); + if (ahp->tfm) + crypto_free_tfm(ahp->tfm); + kfree(ahp); + } + return -EINVAL; +} + +static void ah6_destroy(struct xfrm_state *x) +{ + struct ah_data *ahp = x->data; + + if (ahp->work_icv) { + kfree(ahp->work_icv); + ahp->work_icv = NULL; + } + if (ahp->tfm) { + crypto_free_tfm(ahp->tfm); + ahp->tfm = NULL; + } +} + +static struct xfrm_type ah6_type = +{ + .description = "AH6", + .proto = IPPROTO_AH, + .init_state = ah6_init_state, + .destructor = ah6_destroy, + .input = ah6_input, + .output = ah6_output +}; + +static struct inet6_protocol ah6_protocol = { + .handler = xfrm6_rcv, + .err_handler = ah6_err, +}; + +int __init ah6_init(void) +{ + SET_MODULE_OWNER(&ah6_type); + + if (xfrm6_register_type(&ah6_type) < 0) { + printk(KERN_INFO "ipv6 ah init: can't add xfrm type\n"); + return -EAGAIN; + } + + if (inet6_add_protocol(&ah6_protocol, IPPROTO_AH) < 0) { + printk(KERN_INFO "ipv6 ah init: can't add protocol\n"); + xfrm6_unregister_type(&ah6_type); + return -EAGAIN; + } + + return 0; +} + +static void __exit ah6_fini(void) +{ + if (inet6_del_protocol(&ah6_protocol, IPPROTO_AH) < 0) + printk(KERN_INFO "ipv6 ah close: can't remove protocol\n"); + + if (xfrm6_unregister_type(&ah6_type) < 0) + printk(KERN_INFO "ipv6 ah close: can't remove xfrm type\n"); + +} + +module_init(ah6_init); +module_exit(ah6_fini); + +MODULE_LICENSE("GPL"); diff -ruN -x CVS linux-2.5.63/net/ipv6/esp6.c linux25/net/ipv6/esp6.c --- linux-2.5.63/net/ipv6/esp6.c 1970-01-01 09:00:00.000000000 +0900 +++ linux25/net/ipv6/esp6.c 2003-03-05 11:33:20.000000000 +0900 @@ -0,0 +1,526 @@ +/* + * Copyright (C)2002 USAGI/WIDE Project + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * Authors + * + * Mitsuru KANDA @USAGI : IPv6 Support + * Kazunori MIYAZAWA @USAGI : + * Kunihiro Ishiguro : + * + * This file is derived from net/ipv4/esp.c + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define MAX_SG_ONSTACK 4 + +/* BUGS: + * - we assume replay seqno is always present. + */ + +/* Move to common area: it is shared with AH. */ +/* Common with AH after some work on arguments. */ + +/* XXX no ipv6 esp specific */ +#define NIP6(addr) \ + ntohs((addr).s6_addr16[0]),\ + ntohs((addr).s6_addr16[1]),\ + ntohs((addr).s6_addr16[2]),\ + ntohs((addr).s6_addr16[3]),\ + ntohs((addr).s6_addr16[4]),\ + ntohs((addr).s6_addr16[5]),\ + ntohs((addr).s6_addr16[6]),\ + ntohs((addr).s6_addr16[7]) + +static int get_offset(u8 *packet, u32 packet_len, u8 *nexthdr, struct ipv6_opt_hdr **prevhdr) +{ + u16 offset = sizeof(struct ipv6hdr); + struct ipv6_opt_hdr *exthdr = (struct ipv6_opt_hdr*)(packet + offset); + u8 nextnexthdr; + + *nexthdr = ((struct ipv6hdr*)packet)->nexthdr; + + while (offset + 1 < packet_len) { + + switch (*nexthdr) { + + case NEXTHDR_HOP: + case NEXTHDR_ROUTING: + offset += ipv6_optlen(exthdr); + *nexthdr = exthdr->nexthdr; + *prevhdr = exthdr; + exthdr = (struct ipv6_opt_hdr*)(packet + offset); + break; + + case NEXTHDR_DEST: + nextnexthdr = + ((struct ipv6_opt_hdr*)(packet + offset + ipv6_optlen(exthdr)))->nexthdr; + /* XXX We know the option is inner dest opt + with next next header check. */ + if (nextnexthdr != NEXTHDR_HOP && + nextnexthdr != NEXTHDR_ROUTING && + nextnexthdr != NEXTHDR_DEST) { + return offset; + } + offset += ipv6_optlen(exthdr); + *nexthdr = exthdr->nexthdr; + *prevhdr = exthdr; + exthdr = (struct ipv6_opt_hdr*)(packet + offset); + break; + + default : + return offset; + } + } + + return offset; +} + +int esp6_output(struct sk_buff *skb) +{ + int err; + int hdr_len = 0; + struct dst_entry *dst = skb->dst; + struct xfrm_state *x = dst->xfrm; + struct ipv6hdr *iph = NULL, *top_iph; + struct ip_esp_hdr *esph; + struct crypto_tfm *tfm; + struct esp_data *esp; + struct sk_buff *trailer; + struct ipv6_opt_hdr *prevhdr = NULL; + int blksize; + int clen; + int alen; + int nfrags; + u8 nexthdr; +printk(KERN_DEBUG "%s\n", __FUNCTION__); + /* First, if the skb is not checksummed, complete checksum. */ + if (skb->ip_summed == CHECKSUM_HW && skb_checksum_help(skb) == NULL) + return -EINVAL; + + spin_lock_bh(&x->lock); + if ((err = xfrm_state_check_expire(x)) != 0) + goto error; + if ((err = xfrm_state_check_space(x, skb)) != 0) + goto error; + + err = -ENOMEM; + + /* Strip IP header in transport mode. Save it. */ + + if (!x->props.mode) { + hdr_len = get_offset(skb->nh.raw, skb->len, &nexthdr, &prevhdr); + iph = kmalloc(hdr_len, GFP_ATOMIC); + if (!iph) { + err = -ENOMEM; + goto error; + } + memcpy(iph, skb->nh.raw, hdr_len); + __skb_pull(skb, hdr_len); + } + + /* Now skb is pure payload to encrypt */ + + /* Round to block size */ + clen = skb->len; + + esp = x->data; + alen = esp->auth.icv_trunc_len; + tfm = esp->conf.tfm; + blksize = crypto_tfm_alg_blocksize(tfm); + clen = (clen + 2 + blksize-1)&~(blksize-1); + if (esp->conf.padlen) + clen = (clen + esp->conf.padlen-1)&~(esp->conf.padlen-1); + + if ((nfrags = skb_cow_data(skb, clen-skb->len+alen, &trailer)) < 0) { + if (!x->props.mode && iph) kfree(iph); + goto error; + } + + /* Fill padding... */ + do { + int i; + for (i=0; ilen - 2; i++) + *(u8*)(trailer->tail + i) = i+1; + } while (0); + *(u8*)(trailer->tail + clen-skb->len - 2) = (clen - skb->len)-2; + pskb_put(skb, trailer, clen - skb->len); + + if (x->props.mode) { + iph = skb->nh.ipv6h; + top_iph = (struct ipv6hdr*)skb_push(skb, x->props.header_len); + esph = (struct ip_esp_hdr*)(top_iph+1); + *(u8*)(trailer->tail - 1) = IPPROTO_IPV6; + top_iph->version = 6; + top_iph->priority = iph->priority; + top_iph->flow_lbl[0] = iph->flow_lbl[0]; + top_iph->flow_lbl[1] = iph->flow_lbl[1]; + top_iph->flow_lbl[2] = iph->flow_lbl[2]; + top_iph->nexthdr = IPPROTO_ESP; + top_iph->payload_len = htons(skb->len + alen); + top_iph->hop_limit = iph->hop_limit; + memcpy(&top_iph->saddr, (struct in6_addr *)&x->props.saddr, sizeof(struct ipv6hdr)); + memcpy(&top_iph->daddr, (struct in6_addr *)&x->id.daddr, sizeof(struct ipv6hdr)); + } else { + /* XXX exthdr */ + esph = (struct ip_esp_hdr*)skb_push(skb, x->props.header_len); + skb->h.raw = (unsigned char*)esph; + top_iph = (struct ipv6hdr*)skb_push(skb, hdr_len); + memcpy(top_iph, iph, hdr_len); + kfree(iph); + top_iph->payload_len = htons(skb->len + alen - sizeof(struct ipv6hdr)); + if (prevhdr) { + prevhdr->nexthdr = IPPROTO_ESP; + } else { + top_iph->nexthdr = IPPROTO_ESP; + } + *(u8*)(trailer->tail - 1) = nexthdr; + } + + esph->spi = x->id.spi; + esph->seq_no = htonl(++x->replay.oseq); + + if (esp->conf.ivlen) + crypto_cipher_set_iv(tfm, esp->conf.ivec, crypto_tfm_alg_ivsize(tfm)); + + do { + struct scatterlist sgbuf[nfrags>MAX_SG_ONSTACK ? 0 : nfrags]; + struct scatterlist *sg = sgbuf; + + if (unlikely(nfrags > MAX_SG_ONSTACK)) { + sg = kmalloc(sizeof(struct scatterlist)*nfrags, GFP_ATOMIC); + if (!sg) + goto error; + } + skb_to_sgvec(skb, sg, esph->enc_data+esp->conf.ivlen-skb->data, clen); + crypto_cipher_encrypt(tfm, sg, sg, clen); + if (unlikely(sg != sgbuf)) + kfree(sg); + } while (0); + + if (esp->conf.ivlen) { + memcpy(esph->enc_data, esp->conf.ivec, crypto_tfm_alg_ivsize(tfm)); + crypto_cipher_get_iv(tfm, esp->conf.ivec, crypto_tfm_alg_ivsize(tfm)); + } + + if (esp->auth.icv_full_len) { + esp->auth.icv(esp, skb, (u8*)esph-skb->data, + 8+esp->conf.ivlen+clen, trailer->tail); + pskb_put(skb, trailer, alen); + } + + skb->nh.raw = skb->data; + + x->curlft.bytes += skb->len; + x->curlft.packets++; + spin_unlock_bh(&x->lock); + if ((skb->dst = dst_pop(dst)) == NULL) + goto error_nolock; + return NET_XMIT_BYPASS; + +error: + spin_unlock_bh(&x->lock); +error_nolock: + kfree_skb(skb); + return err; +} + +int esp6_input(struct xfrm_state *x, struct sk_buff *skb) +{ + struct ipv6hdr *iph; + struct ip_esp_hdr *esph; + struct esp_data *esp = x->data; + struct sk_buff *trailer; + int blksize = crypto_tfm_alg_blocksize(esp->conf.tfm); + int alen = esp->auth.icv_trunc_len; + int elen = skb->len - 8 - esp->conf.ivlen - alen; + + int hdr_len = skb->h.raw - skb->nh.raw; + int nfrags; + u8 ret_nexthdr = 0; + unsigned char *tmp_hdr = NULL; + + if (!pskb_may_pull(skb, sizeof(struct ip_esp_hdr))) + goto out; + + if (elen <= 0 || (elen & (blksize-1))) + goto out; + + tmp_hdr = kmalloc(hdr_len, GFP_ATOMIC); + if (!tmp_hdr) + goto out; + memcpy(tmp_hdr, skb->nh.raw, hdr_len); + + /* If integrity check is required, do this. */ + if (esp->auth.icv_full_len) { + u8 sum[esp->auth.icv_full_len]; + u8 sum1[alen]; + + esp->auth.icv(esp, skb, 0, skb->len-alen, sum); + + if (skb_copy_bits(skb, skb->len-alen, sum1, alen)) + BUG(); + + if (unlikely(memcmp(sum, sum1, alen))) { + x->stats.integrity_failed++; + goto out; + } + } + + if ((nfrags = skb_cow_data(skb, 0, &trailer)) < 0) + goto out; + + skb->ip_summed = CHECKSUM_NONE; + + esph = (struct ip_esp_hdr*)skb->data; + iph = skb->nh.ipv6h; + + /* Get ivec. This can be wrong, check against another impls. */ + if (esp->conf.ivlen) + crypto_cipher_set_iv(esp->conf.tfm, esph->enc_data, crypto_tfm_alg_ivsize(esp->conf.tfm)); + + { + u8 nexthdr[2]; + struct scatterlist sgbuf[nfrags>MAX_SG_ONSTACK ? 0 : nfrags]; + struct scatterlist *sg = sgbuf; + u8 padlen; + + if (unlikely(nfrags > MAX_SG_ONSTACK)) { + sg = kmalloc(sizeof(struct scatterlist)*nfrags, GFP_ATOMIC); + if (!sg) + goto out; + } + skb_to_sgvec(skb, sg, 8+esp->conf.ivlen, elen); + crypto_cipher_decrypt(esp->conf.tfm, sg, sg, elen); + if (unlikely(sg != sgbuf)) + kfree(sg); + + if (skb_copy_bits(skb, skb->len-alen-2, nexthdr, 2)) + BUG(); + + padlen = nexthdr[0]; + if (padlen+2 >= elen) { + if (net_ratelimit()) { + printk(KERN_WARNING "ipsec esp packet is garbage padlen=%d, elen=%d\n", padlen+2, elen); + } + goto out; + } + /* ... check padding bits here. Silly. :-) */ + + ret_nexthdr = nexthdr[1]; + pskb_trim(skb, skb->len - alen - padlen - 2); + skb->h.raw = skb_pull(skb, 8 + esp->conf.ivlen); + skb->nh.raw += 8 + esp->conf.ivlen; + memcpy(skb->nh.raw, tmp_hdr, hdr_len); + } + kfree(tmp_hdr); + return ret_nexthdr; + +out: + return -EINVAL; +} + +static u32 esp6_get_max_size(struct xfrm_state *x, int mtu) +{ + struct esp_data *esp = x->data; + u32 blksize = crypto_tfm_alg_blocksize(esp->conf.tfm); + + if (x->props.mode) { + mtu = (mtu + 2 + blksize-1)&~(blksize-1); + } else { + /* The worst case. */ + mtu += 2 + blksize; + } + if (esp->conf.padlen) + mtu = (mtu + esp->conf.padlen-1)&~(esp->conf.padlen-1); + + return mtu + x->props.header_len + esp->auth.icv_full_len; +} + +void esp6_err(struct sk_buff *skb, struct inet6_skb_parm *opt, + int type, int code, int offset, __u32 info) +{ + struct ipv6hdr *iph = (struct ipv6hdr*)skb->data; + struct ip_esp_hdr *esph = (struct ip_esp_hdr*)(skb->data+offset); + struct xfrm_state *x; + + if (type != ICMPV6_DEST_UNREACH || + type != ICMPV6_PKT_TOOBIG) + return; + + x = xfrm6_state_lookup(&iph->daddr, esph->spi, IPPROTO_ESP); + if (!x) + return; + printk(KERN_DEBUG "pmtu discvovery on SA ESP/%08x/" + "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n", + ntohl(esph->spi), NIP6(iph->daddr)); + xfrm_state_put(x); +} + +void esp6_destroy(struct xfrm_state *x) +{ + struct esp_data *esp = x->data; + + if (esp->conf.tfm) { + crypto_free_tfm(esp->conf.tfm); + esp->conf.tfm = NULL; + } + if (esp->conf.ivec) { + kfree(esp->conf.ivec); + esp->conf.ivec = NULL; + } + if (esp->auth.tfm) { + crypto_free_tfm(esp->auth.tfm); + esp->auth.tfm = NULL; + } + if (esp->auth.work_icv) { + kfree(esp->auth.work_icv); + esp->auth.work_icv = NULL; + } +} + +int esp6_init_state(struct xfrm_state *x, void *args) +{ + struct esp_data *esp = NULL; + + if (x->aalg) { + if (x->aalg->alg_key_len == 0 || x->aalg->alg_key_len > 512) + goto error; + } + if (x->ealg == NULL || x->ealg->alg_key_len == 0) + goto error; + + esp = kmalloc(sizeof(*esp), GFP_KERNEL); + if (esp == NULL) + return -ENOMEM; + + memset(esp, 0, sizeof(*esp)); + + if (x->aalg) { + struct xfrm_algo_desc *aalg_desc; + + esp->auth.key = x->aalg->alg_key; + esp->auth.key_len = (x->aalg->alg_key_len+7)/8; + esp->auth.tfm = crypto_alloc_tfm(x->aalg->alg_name, 0); + if (esp->auth.tfm == NULL) + goto error; + esp->auth.icv = esp_hmac_digest; + + aalg_desc = xfrm_aalg_get_byname(x->aalg->alg_name); + BUG_ON(!aalg_desc); + + if (aalg_desc->uinfo.auth.icv_fullbits/8 != + crypto_tfm_alg_digestsize(esp->auth.tfm)) { + printk(KERN_INFO "ESP: %s digestsize %u != %hu\n", + x->aalg->alg_name, + crypto_tfm_alg_digestsize(esp->auth.tfm), + aalg_desc->uinfo.auth.icv_fullbits/8); + goto error; + } + + esp->auth.icv_full_len = aalg_desc->uinfo.auth.icv_fullbits/8; + esp->auth.icv_trunc_len = aalg_desc->uinfo.auth.icv_truncbits/8; + + esp->auth.work_icv = kmalloc(esp->auth.icv_full_len, GFP_KERNEL); + if (!esp->auth.work_icv) + goto error; + } + esp->conf.key = x->ealg->alg_key; + esp->conf.key_len = (x->ealg->alg_key_len+7)/8; + esp->conf.tfm = crypto_alloc_tfm(x->ealg->alg_name, CRYPTO_TFM_MODE_CBC); + if (esp->conf.tfm == NULL) + goto error; + esp->conf.ivlen = crypto_tfm_alg_ivsize(esp->conf.tfm); + esp->conf.padlen = 0; + if (esp->conf.ivlen) { + esp->conf.ivec = kmalloc(esp->conf.ivlen, GFP_KERNEL); + get_random_bytes(esp->conf.ivec, esp->conf.ivlen); + } + crypto_cipher_setkey(esp->conf.tfm, esp->conf.key, esp->conf.key_len); + x->props.header_len = 8 + esp->conf.ivlen; + if (x->props.mode) + x->props.header_len += 40; /* XXX ext hdr */ + x->data = esp; + return 0; + +error: + if (esp) { + if (esp->auth.tfm) + crypto_free_tfm(esp->auth.tfm); + if (esp->auth.work_icv) + kfree(esp->auth.work_icv); + if (esp->conf.tfm) + crypto_free_tfm(esp->conf.tfm); + kfree(esp); + } + return -EINVAL; +} + +static struct xfrm_type esp6_type = +{ + .description = "ESP6", + .proto = IPPROTO_ESP, + .init_state = esp6_init_state, + .destructor = esp6_destroy, + .get_max_size = esp6_get_max_size, + .input = esp6_input, + .output = esp6_output +}; + +static struct inet6_protocol esp6_protocol = { + .handler = xfrm6_rcv, + .err_handler = esp6_err, +}; + +int __init esp6_init(void) +{ + SET_MODULE_OWNER(&esp6_type); + if (xfrm6_register_type(&esp6_type) < 0) { + printk(KERN_INFO "ipv6 esp init: can't add xfrm type\n"); + return -EAGAIN; + } + if (inet6_add_protocol(&esp6_protocol, IPPROTO_ESP) < 0) { + printk(KERN_INFO "ipv6 esp init: can't add protocol\n"); + xfrm6_unregister_type(&esp6_type); + return -EAGAIN; + } + + return 0; +} + +static void __exit esp6_fini(void) +{ + if (inet6_del_protocol(&esp6_protocol, IPPROTO_ESP) < 0) + printk(KERN_INFO "ipv6 esp close: can't remove protocol\n"); + if (xfrm6_unregister_type(&esp6_type) < 0) + printk(KERN_INFO "ipv6 esp close: can't remove xfrm type\n"); +} + +module_init(esp6_init); +module_exit(esp6_fini); + +MODULE_LICENSE("GPL"); diff -ruN -x CVS linux-2.5.63/net/ipv6/ip6_input.c linux25/net/ipv6/ip6_input.c --- linux-2.5.63/net/ipv6/ip6_input.c 2003-02-25 04:05:39.000000000 +0900 +++ linux25/net/ipv6/ip6_input.c 2003-03-04 20:38:16.000000000 +0900 @@ -150,7 +150,8 @@ It would be stupid to detect for optional headers, which are missing with probability of 200% */ - if (nexthdr != IPPROTO_TCP && nexthdr != IPPROTO_UDP) { + if (nexthdr != IPPROTO_TCP && nexthdr != IPPROTO_UDP && + nexthdr != NEXTHDR_AUTH && nexthdr != NEXTHDR_ESP) { nhoff = ipv6_parse_exthdrs(&skb, nhoff); if (nhoff < 0) return 0; diff -ruN -x CVS linux-2.5.63/net/ipv6/ip6_output.c linux25/net/ipv6/ip6_output.c --- linux-2.5.63/net/ipv6/ip6_output.c 2003-02-25 04:05:06.000000000 +0900 +++ linux25/net/ipv6/ip6_output.c 2003-03-04 20:38:16.000000000 +0900 @@ -192,6 +192,11 @@ int seg_len = skb->len; int hlimit; u32 mtu; + int err = 0; + + if ((err = xfrm_lookup(&skb->dst, fl, sk, 0)) < 0) { + return err; + } if (opt) { int head_room; @@ -576,6 +581,13 @@ } pktlength = length; + if (dst) { + if ((err = xfrm_lookup(&dst, fl, sk, 0)) < 0) { + dst_release(dst); + return -ENETUNREACH; + } + } + if (hlimit < 0) { if (ipv6_addr_is_multicast(fl->fl6_dst)) hlimit = np->mcast_hops; @@ -630,10 +642,8 @@ err = 0; if (flags&MSG_PROBE) goto out; - - skb = sock_alloc_send_skb(sk, pktlength + 15 + - dev->hard_header_len, - flags & MSG_DONTWAIT, &err); + /* alloc skb with mtu as we do in the IPv4 stack for IPsec */ + skb = sock_alloc_send_skb(sk, mtu, flags & MSG_DONTWAIT, &err); if (skb == NULL) { IP6_INC_STATS(Ip6OutDiscards); @@ -663,6 +673,8 @@ err = getfrag(data, &hdr->saddr, ((char *) hdr) + (pktlength - length), 0, length); + if (!opt || !opt->dst1opt) + skb->h.raw = ((char *) hdr) + (pktlength - length); if (!err) { IP6_INC_STATS(Ip6OutRequests); diff -ruN -x CVS linux-2.5.63/net/ipv6/ndisc.c linux25/net/ipv6/ndisc.c --- linux-2.5.63/net/ipv6/ndisc.c 2003-02-25 04:05:34.000000000 +0900 +++ linux25/net/ipv6/ndisc.c 2003-03-05 11:30:41.000000000 +0900 @@ -71,6 +71,7 @@ #include #include +#include #include #include @@ -335,8 +336,6 @@ unsigned char ha[MAX_ADDR_LEN]; unsigned char *h_dest = NULL; - skb_reserve(skb, (dev->hard_header_len + 15) & ~15); - if (dev->hard_header) { if (ipv6_addr_type(daddr) & IPV6_ADDR_MULTICAST) { ndisc_mc_map(daddr, ha, dev, 1); @@ -373,10 +372,50 @@ * Send a Neighbour Advertisement */ +int ndisc_output(struct sk_buff *skb) +{ + if (skb) { + struct neighbour *neigh = (skb->dst ? skb->dst->neighbour : NULL); + if (ndisc_build_ll_hdr(skb, skb->dev, &skb->nh.ipv6h->daddr, neigh, skb->len) == 0) { + kfree_skb(skb); + return -EINVAL; + } + dev_queue_xmit(skb); + return 0; + } + return -EINVAL; +} + +static inline void ndisc_rt_init(struct rt6_info *rt, struct net_device *dev, + struct neighbour *neigh) +{ + rt->rt6i_dev = dev; + rt->rt6i_nexthop = neigh; + rt->rt6i_expires = 0; + rt->rt6i_flags = RTF_LOCAL; + rt->rt6i_metric = 0; + rt->rt6i_hoplimit = 255; + rt->u.dst.output = ndisc_output; +} + +static inline void ndisc_flow_init(struct flowi *fl, u8 type, + struct in6_addr *saddr, struct in6_addr *daddr) +{ + memset(fl, 0, sizeof(*fl)); + fl->fl6_src = saddr; + fl->fl6_dst = daddr; + fl->proto = IPPROTO_ICMPV6; + fl->uli_u.icmpt.type = type; + fl->uli_u.icmpt.code = 0; +} + static void ndisc_send_na(struct net_device *dev, struct neighbour *neigh, struct in6_addr *daddr, struct in6_addr *solicited_addr, - int router, int solicited, int override, int inc_opt) + int router, int solicited, int override, int inc_opt) { + struct flowi fl; + struct rt6_info *rt = NULL; + struct dst_entry* dst; struct sock *sk = ndisc_socket->sk; struct nd_msg *msg; int len; @@ -385,6 +424,22 @@ len = sizeof(struct icmp6hdr) + sizeof(struct in6_addr); + rt = ndisc_get_dummy_rt(); + if (!rt) + return; + + ndisc_flow_init(&fl, NDISC_NEIGHBOUR_ADVERTISEMENT, solicited_addr, daddr); + ndisc_rt_init(rt, dev, neigh); + + dst = (struct dst_entry*)rt; + dst_clone(dst); + + err = xfrm_lookup(&dst, &fl, NULL, 0); + if (err < 0) { + dst_release(dst); + return; + } + if (inc_opt) { if (dev->addr_len) len += NDISC_OPT_SPACE(dev->addr_len); @@ -400,14 +455,10 @@ return; } - if (ndisc_build_ll_hdr(skb, dev, daddr, neigh, len) == 0) { - kfree_skb(skb); - return; - } - + skb_reserve(skb, (dev->hard_header_len + 15) & ~15); ip6_nd_hdr(sk, skb, dev, solicited_addr, daddr, IPPROTO_ICMPV6, len); - msg = (struct nd_msg *) skb_put(skb, len); + skb->h.raw = (unsigned char*) msg = (struct nd_msg *) skb_put(skb, len); msg->icmph.icmp6_type = NDISC_NEIGHBOUR_ADVERTISEMENT; msg->icmph.icmp6_code = 0; @@ -430,7 +481,9 @@ csum_partial((__u8 *) msg, len, 0)); - dev_queue_xmit(skb); + dst_clone(dst); + skb->dst = dst; + dst_output(skb); ICMP6_INC_STATS(Icmp6OutNeighborAdvertisements); ICMP6_INC_STATS(Icmp6OutMsgs); @@ -440,6 +493,9 @@ struct in6_addr *solicit, struct in6_addr *daddr, struct in6_addr *saddr) { + struct flowi fl; + struct rt6_info *rt = NULL; + struct dst_entry* dst; struct sock *sk = ndisc_socket->sk; struct sk_buff *skb; struct nd_msg *msg; @@ -454,6 +510,22 @@ saddr = &addr_buf; } + rt = ndisc_get_dummy_rt(); + if (!rt) + return; + + ndisc_flow_init(&fl, NDISC_NEIGHBOUR_SOLICITATION, saddr, daddr); + ndisc_rt_init(rt, dev, neigh); + + dst = (struct dst_entry*)rt; + dst_clone(dst); + + err = xfrm_lookup(&dst, &fl, NULL, 0); + if (err < 0) { + dst_release(dst); + return; + } + len = sizeof(struct icmp6hdr) + sizeof(struct in6_addr); send_llinfo = dev->addr_len && ipv6_addr_type(saddr) != IPV6_ADDR_ANY; if (send_llinfo) @@ -466,14 +538,10 @@ return; } - if (ndisc_build_ll_hdr(skb, dev, daddr, neigh, len) == 0) { - kfree_skb(skb); - return; - } - + skb_reserve(skb, (dev->hard_header_len + 15) & ~15); ip6_nd_hdr(sk, skb, dev, saddr, daddr, IPPROTO_ICMPV6, len); - msg = (struct nd_msg *)skb_put(skb, len); + skb->h.raw = (unsigned char*) msg = (struct nd_msg *)skb_put(skb, len); msg->icmph.icmp6_type = NDISC_NEIGHBOUR_SOLICITATION; msg->icmph.icmp6_code = 0; msg->icmph.icmp6_cksum = 0; @@ -492,7 +560,9 @@ csum_partial((__u8 *) msg, len, 0)); /* send it! */ - dev_queue_xmit(skb); + dst_clone(dst); + skb->dst = dst; + dst_output(skb); ICMP6_INC_STATS(Icmp6OutNeighborSolicits); ICMP6_INC_STATS(Icmp6OutMsgs); @@ -501,6 +571,9 @@ void ndisc_send_rs(struct net_device *dev, struct in6_addr *saddr, struct in6_addr *daddr) { + struct flowi fl; + struct rt6_info *rt = NULL; + struct dst_entry* dst; struct sock *sk = ndisc_socket->sk; struct sk_buff *skb; struct icmp6hdr *hdr; @@ -508,6 +581,22 @@ int len; int err; + rt = ndisc_get_dummy_rt(); + if (!rt) + return; + + ndisc_flow_init(&fl, NDISC_ROUTER_SOLICITATION, saddr, daddr); + ndisc_rt_init(rt, dev, NULL); + + dst = (struct dst_entry*)rt; + dst_clone(dst); + + err = xfrm_lookup(&dst, &fl, NULL, 0); + if (err < 0) { + dst_release(dst); + return; + } + len = sizeof(struct icmp6hdr); if (dev->addr_len) len += NDISC_OPT_SPACE(dev->addr_len); @@ -519,14 +608,10 @@ return; } - if (ndisc_build_ll_hdr(skb, dev, daddr, NULL, len) == 0) { - kfree_skb(skb); - return; - } - + skb_reserve(skb, (dev->hard_header_len + 15) & ~15); ip6_nd_hdr(sk, skb, dev, saddr, daddr, IPPROTO_ICMPV6, len); - hdr = (struct icmp6hdr *) skb_put(skb, len); + skb->h.raw = (unsigned char*) hdr = (struct icmp6hdr *) skb_put(skb, len); hdr->icmp6_type = NDISC_ROUTER_SOLICITATION; hdr->icmp6_code = 0; hdr->icmp6_cksum = 0; @@ -543,7 +628,9 @@ csum_partial((__u8 *) hdr, len, 0)); /* send it! */ - dev_queue_xmit(skb); + dst_clone(dst); + skb->dst = dst; + dst_output(skb); ICMP6_INC_STATS(Icmp6OutRouterSolicits); ICMP6_INC_STATS(Icmp6OutMsgs); @@ -1125,6 +1212,8 @@ struct in6_addr *addrp; struct net_device *dev; struct rt6_info *rt; + struct dst_entry *dst; + struct flowi fl; u8 *opt; int rd_len; int err; @@ -1136,6 +1225,22 @@ if (rt == NULL) return; + dst = (struct dst_entry*)rt; + + if (ipv6_get_lladdr(dev, &saddr_buf)) { + ND_PRINTK1("redirect: no link_local addr for dev\n"); + return; + } + + ndisc_flow_init(&fl, NDISC_REDIRECT, &saddr_buf, &skb->nh.ipv6h->saddr); + + dst_clone(dst); + err = xfrm_lookup(&dst, &fl, NULL, 0); + if (err) { + dst_release(dst); + return; + } + if (rt->rt6i_flags & RTF_GATEWAY) { ND_PRINTK1("ndisc_send_redirect: not a neighbour\n"); dst_release(&rt->u.dst); @@ -1164,11 +1269,6 @@ rd_len &= ~0x7; len += rd_len; - if (ipv6_get_lladdr(dev, &saddr_buf)) { - ND_PRINTK1("redirect: no link_local addr for dev\n"); - return; - } - buff = sock_alloc_send_skb(sk, MAX_HEADER + len + dev->hard_header_len + 15, 0, &err); if (buff == NULL) { @@ -1178,15 +1278,11 @@ hlen = 0; - if (ndisc_build_ll_hdr(buff, dev, &skb->nh.ipv6h->saddr, NULL, len) == 0) { - kfree_skb(buff); - return; - } - + skb_reserve(skb, (dev->hard_header_len + 15) & ~15); ip6_nd_hdr(sk, buff, dev, &saddr_buf, &skb->nh.ipv6h->saddr, IPPROTO_ICMPV6, len); - icmph = (struct icmp6hdr *) skb_put(buff, len); + skb->h.raw = (unsigned char*) icmph = (struct icmp6hdr *) skb_put(buff, len); memset(icmph, 0, sizeof(struct icmp6hdr)); icmph->icmp6_type = NDISC_REDIRECT; @@ -1224,7 +1320,8 @@ len, IPPROTO_ICMPV6, csum_partial((u8 *) icmph, len, 0)); - dev_queue_xmit(buff); + skb->dst = dst; + dst_output(skb); ICMP6_INC_STATS(Icmp6OutRedirects); ICMP6_INC_STATS(Icmp6OutMsgs); diff -ruN -x CVS linux-2.5.63/net/ipv6/raw.c linux25/net/ipv6/raw.c --- linux-2.5.63/net/ipv6/raw.c 2003-02-25 04:05:16.000000000 +0900 +++ linux25/net/ipv6/raw.c 2003-03-04 20:38:16.000000000 +0900 @@ -45,6 +45,7 @@ #include #include +#include struct sock *raw_v6_htable[RAWV6_HTABLE_SIZE]; rwlock_t raw_v6_lock = RW_LOCK_UNLOCKED; @@ -304,6 +305,11 @@ struct inet_opt *inet = inet_sk(sk); struct raw6_opt *raw_opt = raw6_sk(sk); + if (!xfrm6_policy_check(sk, XFRM_POLICY_IN, skb)) { + kfree_skb(skb); + return NET_RX_DROP; + } + if (!raw_opt->checksum) skb->ip_summed = CHECKSUM_UNNECESSARY; diff -ruN -x CVS linux-2.5.63/net/ipv6/route.c linux25/net/ipv6/route.c --- linux-2.5.63/net/ipv6/route.c 2003-02-25 04:05:39.000000000 +0900 +++ linux25/net/ipv6/route.c 2003-03-05 11:30:41.000000000 +0900 @@ -49,6 +49,8 @@ #include #include #include +#include +#include #include @@ -128,6 +130,12 @@ rwlock_t rt6_lock = RW_LOCK_UNLOCKED; +/* Dummy rt for ndisc */ +struct rt6_info *ndisc_get_dummy_rt() +{ + return dst_alloc(&ip6_dst_ops); +} + /* * Route lookup. Any rt6_lock is implied. */ @@ -1809,6 +1817,14 @@ #endif +int xfrm6_dst_lookup(struct xfrm_dst **dst, struct flowi *fl) +{ + int err = 0; + *dst = (struct xfrm_dst*)ip6_route_output(NULL, fl); + if (!*dst) + err = -ENETUNREACH; + return err; +} void __init ip6_route_init(void) { @@ -1817,6 +1833,7 @@ 0, SLAB_HWCACHE_ALIGN, NULL, NULL); fib6_init(); + xfrm_dst_lookup_register(xfrm6_dst_lookup, AF_INET6); #ifdef CONFIG_PROC_FS proc_net_create("ipv6_route", 0, rt6_proc_info); proc_net_create("rt6_stats", 0, rt6_proc_stats); @@ -1830,7 +1847,7 @@ proc_net_remove("ipv6_route"); proc_net_remove("rt6_stats"); #endif - + xfrm_dst_lookup_unregister(AF_INET6); rt6_ifdown(NULL); fib6_gc_cleanup(); } diff -ruN -x CVS linux-2.5.63/net/ipv6/tcp_ipv6.c linux25/net/ipv6/tcp_ipv6.c --- linux-2.5.63/net/ipv6/tcp_ipv6.c 2003-02-25 04:05:33.000000000 +0900 +++ linux25/net/ipv6/tcp_ipv6.c 2003-03-04 20:38:16.000000000 +0900 @@ -50,6 +50,7 @@ #include #include #include +#include #include @@ -677,6 +678,9 @@ fl.nl_u.ip6_u.daddr = rt0->addr; } + if (!fl.fl6_src) + fl.fl6_src = &np->saddr; + dst = ip6_route_output(sk, &fl); if ((err = dst->error) != 0) { @@ -1637,6 +1641,9 @@ if (sk_filter(sk, skb, 0)) goto discard_and_relse; + if (!xfrm6_policy_check(sk, XFRM_POLICY_IN, skb)) + goto discard_it; + skb->dev = NULL; bh_lock_sock(sk); @@ -1652,6 +1659,9 @@ return ret; no_tcp_socket: + if (!xfrm6_policy_check(NULL, XFRM_POLICY_IN, skb)) + goto discard_and_relse; + if (skb->len < (th->doff<<2) || tcp_checksum_complete(skb)) { bad_packet: TCP_INC_STATS_BH(TcpInErrs); @@ -1671,8 +1681,11 @@ discard_and_relse: sock_put(sk); goto discard_it; - + do_time_wait: + if (!xfrm6_policy_check(NULL, XFRM_POLICY_IN, skb)) + goto discard_and_relse; + if (skb->len < (th->doff<<2) || tcp_checksum_complete(skb)) { TCP_INC_STATS_BH(TcpInErrs); sock_put(sk); diff -ruN -x CVS linux-2.5.63/net/ipv6/udp.c linux25/net/ipv6/udp.c --- linux-2.5.63/net/ipv6/udp.c 2003-02-25 04:05:40.000000000 +0900 +++ linux25/net/ipv6/udp.c 2003-03-04 20:38:16.000000000 +0900 @@ -50,6 +50,7 @@ #include #include +#include DEFINE_SNMP_STAT(struct udp_mib, udp_stats_in6); @@ -541,6 +542,11 @@ static inline int udpv6_queue_rcv_skb(struct sock * sk, struct sk_buff *skb) { + if (!xfrm6_policy_check(sk, XFRM_POLICY_IN, skb)) { + kfree_skb(skb); + return -1; + } + #if defined(CONFIG_FILTER) if (sk->filter && skb->ip_summed != CHECKSUM_UNNECESSARY) { if ((unsigned short)csum_fold(skb_checksum(skb, 0, skb->len, skb->csum))) { @@ -646,6 +652,9 @@ if (!pskb_may_pull(skb, sizeof(struct udphdr))) goto short_packet; + if (!xfrm6_policy_check(NULL, XFRM_POLICY_IN, skb)) + goto discard; + saddr = &skb->nh.ipv6h->saddr; daddr = &skb->nh.ipv6h->daddr; uh = skb->h.uh; diff -ruN -x CVS linux-2.5.63/net/key/af_key.c linux25/net/key/af_key.c --- linux-2.5.63/net/key/af_key.c 2003-02-25 04:05:13.000000000 +0900 +++ linux25/net/key/af_key.c 2003-03-04 20:38:16.000000000 +0900 @@ -550,8 +550,8 @@ switch (((struct sockaddr *)(addr + 1))->sa_family) { case AF_INET: - x = xfrm_state_lookup(((struct sockaddr_in *)(addr + 1))->sin_addr.s_addr, - sa->sadb_sa_spi, proto); + x = xfrm4_state_lookup(((struct sockaddr_in *)(addr + 1))->sin_addr.s_addr, + sa->sadb_sa_spi, proto); break; #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) case AF_INET6: @@ -1097,18 +1097,7 @@ min_spi = htonl(0x100); max_spi = htonl(0x0fffffff); } - switch (x->props.family) { - case AF_INET: - xfrm_alloc_spi(x, min_spi, max_spi); - break; -#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) - case AF_INET6: - xfrm6_alloc_spi(x, min_spi, max_spi); - break; -#endif - default: - break; - } + xfrm_alloc_spi(x, min_spi, max_spi); if (x->id.spi) resp_skb = pfkey_xfrm_state2msg(x, 0, 3); } diff -ruN -x CVS linux-2.5.63/net/netsyms.c linux25/net/netsyms.c --- linux-2.5.63/net/netsyms.c 2003-02-25 04:05:16.000000000 +0900 +++ linux25/net/netsyms.c 2003-03-04 20:38:15.000000000 +0900 @@ -296,11 +296,11 @@ EXPORT_SYMBOL(__xfrm_route_forward); EXPORT_SYMBOL(xfrm_state_alloc); EXPORT_SYMBOL(__xfrm_state_destroy); -EXPORT_SYMBOL(xfrm_state_find); +EXPORT_SYMBOL(xfrm4_state_find); EXPORT_SYMBOL(xfrm_state_insert); EXPORT_SYMBOL(xfrm_state_check_expire); EXPORT_SYMBOL(xfrm_state_check_space); -EXPORT_SYMBOL(xfrm_state_lookup); +EXPORT_SYMBOL(xfrm4_state_lookup); EXPORT_SYMBOL(xfrm_replay_check); EXPORT_SYMBOL(xfrm_replay_advance); EXPORT_SYMBOL(xfrm_check_selectors); @@ -324,13 +324,17 @@ EXPORT_SYMBOL(xfrm_policy_flush); EXPORT_SYMBOL(xfrm_policy_byid); EXPORT_SYMBOL(xfrm_policy_list); +EXPORT_SYMBOL(xfrm_dst_lookup_register); +EXPORT_SYMBOL(xfrm_dst_lookup_unregister); #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) +EXPORT_SYMBOL(xfrm6_state_find); +EXPORT_SYMBOL(xfrm6_rcv); EXPORT_SYMBOL(xfrm6_state_lookup); EXPORT_SYMBOL(xfrm6_find_acq); -EXPORT_SYMBOL(xfrm6_alloc_spi); EXPORT_SYMBOL(xfrm6_register_type); EXPORT_SYMBOL(xfrm6_unregister_type); EXPORT_SYMBOL(xfrm6_get_type); +EXPORT_SYMBOL(xfrm6_clear_mutable_options); #endif EXPORT_SYMBOL_GPL(xfrm_probe_algs); @@ -342,6 +346,15 @@ EXPORT_SYMBOL_GPL(xfrm_ealg_get_byid); EXPORT_SYMBOL_GPL(xfrm_aalg_get_byname); EXPORT_SYMBOL_GPL(xfrm_ealg_get_byname); +#if defined(CONFIG_INET_AH) || defined(CONFIG_INET_AH_MODULE) || defined(CONFIG_INET6_AH) || defined(CONFIG_INET6_AH_MODULE) +EXPORT_SYMBOL_GPL(skb_ah_walk); +#endif +#if defined(CONFIG_INET_ESP) || defined(CONFIG_INET_ESP_MODULE) || defined(CONFIG_INET6_ESP) || defined(CONFIG_INET6_ESP_MODULE) +EXPORT_SYMBOL_GPL(skb_cow_data); +EXPORT_SYMBOL_GPL(pskb_put); +EXPORT_SYMBOL_GPL(skb_icv_walk); +EXPORT_SYMBOL_GPL(skb_to_sgvec); +#endif #if defined (CONFIG_IPV6_MODULE) || defined (CONFIG_IP_SCTP_MODULE) /* inet functions common to v4 and v6 */ From warlord@MIT.EDU Wed Mar 5 06:52:49 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 05 Mar 2003 06:52:51 -0800 (PST) Received: from fort-point-station.mit.edu (FORT-POINT-STATION.MIT.EDU [18.7.7.76]) by oss.sgi.com (8.11.6/8.11.6) with SMTP id h25Eqmf18604 for ; Wed, 5 Mar 2003 06:52:48 -0800 Received: from grand-central-station.mit.edu (GRAND-CENTRAL-STATION.MIT.EDU [18.7.21.82]) by fort-point-station.mit.edu (8.9.2/8.9.2) with ESMTP id JAA01806; Wed, 5 Mar 2003 09:52:46 -0500 (EST) Received: from manawatu-mail-centre.mit.edu (MANAWATU-MAIL-CENTRE.MIT.EDU [18.7.7.71]) by grand-central-station.mit.edu (8.9.2/8.9.2) with ESMTP id JAA08319; Wed, 5 Mar 2003 09:52:45 -0500 (EST) Received: from kikki.mit.edu (KIKKI.MIT.EDU [18.18.1.142]) ) by manawatu-mail-centre.mit.edu (8.12.4/8.12.4) with ESMTP id h25Eqi6g020217; Wed, 5 Mar 2003 09:52:44 -0500 (EST) Received: (from warlord@localhost) by kikki.mit.edu (8.9.3) id JAA26295; Wed, 5 Mar 2003 09:52:44 -0500 (EST) To: bert hubert Cc: Andreas Jellinghaus , linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: ipsec-tools 0.1 + kernel 2.5.64 References: <1046863752.441.7.camel@simulacron> <20030305112852.GA22351@outpost.ds9a.nl> From: Derek Atkins Date: 05 Mar 2003 09:52:44 -0500 In-Reply-To: <20030305112852.GA22351@outpost.ds9a.nl> Message-ID: User-Agent: Gnus/5.0808 (Gnus v5.8.8) Emacs/20.7 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-archive-position: 1863 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: warlord@MIT.EDU Precedence: bulk X-list: netdev Content-Length: 1291 Lines: 37 bert hubert writes: > On Wed, Mar 05, 2003 at 12:29:12PM +0100, Andreas Jellinghaus wrote: > > Hi, > > > > both manual keying and automatic keying with racoon (pre-shared secret) > > are working fine. No need to patch or modify anything. > > I tried only ipv4. > > By the way, regarding ipsec-tools 0.1, are you sure you want to fork the > projects involved? I spoke to the KAME people and unfortunately, at least for now, there is no other choice but to fork. Perhaps down the road we can merge, but as of last week they don't want to host a linux package. They are willing to take some of our patches, but that doesn't help with a build system. > By the way, you did not mention it here but ipsec-tools is available on > http://sourceforge.net/projects/ipsec-tools , I also link them from > http://lartc.org/howto/lartc.ipsec.html I didn't? Perhaps I said ipsec-tool.sourceforge.net which has a link to sourceforge.net/projects/ipsec-tools and is much shorter to type. ;) > Regards, > > bert -derek -- Derek Atkins, SB '93 MIT EE, SM '95 MIT Media Laboratory Member, MIT Student Information Processing Board (SIPB) URL: http://web.mit.edu/warlord/ PP-ASEL-IA N1NWH warlord@MIT.EDU PGP key available From davem@redhat.com Wed Mar 5 07:40:30 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 05 Mar 2003 07:40:34 -0800 (PST) Received: from pizda.ninka.net (pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.11.6/8.11.6) with SMTP id h25FeTf19745 for ; Wed, 5 Mar 2003 07:40:30 -0800 Received: from localhost (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with ESMTP id HAA16030; Wed, 5 Mar 2003 07:21:50 -0800 Date: Wed, 05 Mar 2003 07:21:49 -0800 (PST) Message-Id: <20030305.072149.121185037.davem@redhat.com> To: kazunori@miyazawa.org Cc: kuznet@ms2.inr.ac.ru, linux-kernel@vger.kernel.org, netdev@oss.sgi.com, usagi-core@linux-ipv6.org Subject: Re: [PATH] IPv6 IPsec support From: "David S. Miller" In-Reply-To: <20030305233025.784feb00.kazunori@miyazawa.org> References: <20030305233025.784feb00.kazunori@miyazawa.org> X-FalunGong: Information control. X-Mailer: Mew version 2.1 on Emacs 21.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 1864 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev Content-Length: 588 Lines: 21 From: Kazunori Miyazawa Date: Wed, 5 Mar 2003 23:30:25 +0900 Hello Miyazawa-san, I submit the patch to let the kernel support ipv6 ipsec again. It is able to comple ipv6 as module. This patch incldes a couple of clean-up and changes of function name. Excellent work. I have comments, but they are very minor and can wait. I will apply your patch after basic build testing. The next large task will be to abstract out more common pieces of code. There is still quite a bit of code duplication between v4 and v6 xfrm methods, Thank you! From yoshfuji@linux-ipv6.org Wed Mar 5 07:48:23 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 05 Mar 2003 07:48:26 -0800 (PST) Received: from yue.hongo.wide.ad.jp (yue.hongo.wide.ad.jp [203.178.139.94]) by oss.sgi.com (8.11.6/8.11.6) with SMTP id h25FmNf20276 for ; Wed, 5 Mar 2003 07:48:23 -0800 Received: from localhost (localhost [127.0.0.1]) by yue.hongo.wide.ad.jp (8.12.3+3.5Wbeta/8.12.3/Debian-5) with ESMTP id h25FmMUl009623; Thu, 6 Mar 2003 00:48:22 +0900 Date: Thu, 06 Mar 2003 00:48:20 +0900 (JST) Message-Id: <20030306.004820.41101302.yoshfuji@linux-ipv6.org> To: davem@redhat.com Cc: kazunori@miyazawa.org, kuznet@ms2.inr.ac.ru, linux-kernel@vger.kernel.org, netdev@oss.sgi.com, usagi@linux-ipv6.org Subject: Re: (usagi-core 12294) Re: [PATCH] IPv6 IPsec support From: YOSHIFUJI Hideaki / =?iso-2022-jp?B?GyRCNUhGIzFRTEAbKEI=?= In-Reply-To: <20030305.072149.121185037.davem@redhat.com> References: <20030305233025.784feb00.kazunori@miyazawa.org> <20030305.072149.121185037.davem@redhat.com> Organization: USAGI Project X-URL: http://www.yoshifuji.org/%7Ehideaki/ X-Fingerprint: 90 22 65 EB 1E CF 3A D1 0B DF 80 D8 48 07 F8 94 E0 62 0E EA X-PGP-Key-URL: http://www.yoshifuji.org/%7Ehideaki/hideaki@yoshifuji.org.asc X-Face: "5$Al-.M>NJ%a'@hhZdQm:."qn~PA^gq4o*>iCFToq*bAi#4FRtx}enhuQKz7fNqQz\BYU] $~O_5m-9'}MIs`XGwIEscw;e5b>n"B_?j/AkL~i/MEaZBLP X-Mailer: Mew version 2.2 on Emacs 20.7 / Mule 4.1 (AOI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 1865 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: yoshfuji@linux-ipv6.org Precedence: bulk X-list: netdev Content-Length: 482 Lines: 14 In article <20030305.072149.121185037.davem@redhat.com> (at Wed, 05 Mar 2003 07:21:49 -0800 (PST)), "David S. Miller" says: > I will apply your patch after basic build testing. Thank you. > The next large task will be to abstract out more common > pieces of code. There is still quite a bit of code duplication > between v4 and v6 xfrm methods, Yes, we will do that. That patch is first step for reducing duplicate codes between IPv4 and IPv6. --yoshfuji From mcmanus@datapower.ducksong.com Wed Mar 5 13:00:54 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 05 Mar 2003 13:00:57 -0800 (PST) Received: from datapower.ducksong.com (ip67-93-141-186.z141-93-67.customer.algx.net [67.93.141.186]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h25L0n40027680 for ; Wed, 5 Mar 2003 13:00:53 -0800 Received: (from mcmanus@localhost) by datapower.ducksong.com (8.11.6/8.11.6) id h25L0ms10895 for netdev@oss.sgi.com; Wed, 5 Mar 2003 16:00:48 -0500 Date: Wed, 5 Mar 2003 16:00:48 -0500 From: "Patrick R. McManus" To: netdev@oss.sgi.com Subject: SIOCETHTOOL ioctl() and a corrupted cmd argument Message-ID: <20030305210047.GA10824@ducksong.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4i X-archive-position: 1866 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: mcmanus@ducksong.com Precedence: bulk X-list: netdev Content-Length: 1303 Lines: 45 Hello, this is odd. My problem is with the cmd argument to a driver's ioctl() handler getting modified when the caller is non root. I have a 2.4.19era kernel and am running the e1000 driver, as a module, from the 2.4.20 kernel. (drivers previous to 4.4.12 tended to keep resetting themselves on me.) my userspace code make a call that looks like this struct ethtool_cmd ec; int fd; int rv = -1; memset (&ifr,0,sizeof(ifr)); strncpy (ifr.ifr_name, getName(),IFNAMSIZ); fd = socket (PF_INET,SOCK_DGRAM,0); ifr.ifr_data = (char *) &ec; ec.cmd = ETHTOOL_GSET; fprintf (stderr,"SIOCETHTOOL is %X\n",SIOCETHTOOL); if (ioctl(fd, SIOCETHTOOL, &ifr) >=0) stderr always prints: SIOCETHTOOL is 8946 when I run the userspace code as root the ioctl succeeds, when I run it as an unpriv'd user it fails. So I annotated the driver by adding to e1000_ioctl: printk(KERN_INFO "general ioctl cmd %X, magic %X\n",cmd,SIOCETHTOOL); as root I get the expected Mar 5 15:53:33 mcmanus kernel: general ioctl cmd 8946, magic 8946 as a regular user I get Mar 5 15:46:57 mcmanus kernel: general ioctl cmd 89F0, magic 8946 can someone help me with the chain to look at for why the cmd value might be getting modified? -Patrick From aj@dungeon.inka.de Wed Mar 5 13:24:38 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 05 Mar 2003 13:24:41 -0800 (PST) Received: from mail.inka.de (mail@quechua.inka.de [193.197.184.2]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h25LN540031655 for ; Wed, 5 Mar 2003 13:24:38 -0800 Received: from dungeon.inka.de (uucp@[127.0.0.1]) by mail.inka.de with uucp (rmailwrap 0.5) id 18qgLv-0007Je-00; Wed, 05 Mar 2003 22:23:03 +0100 Received: from [192.168.0.10] (unknown [192.168.0.10]) by dungeon.inka.de (Postfix) with ESMTP id 154B620E4F; Wed, 5 Mar 2003 22:22:59 +0100 (CET) Subject: Re: ipsec-tools 0.1 + kernel 2.5.64 From: Andreas Jellinghaus Cc: linux-kernel@vger.kernel.org, netdev@oss.sgi.com In-Reply-To: <20030305112852.GA22351@outpost.ds9a.nl> References: <1046863752.441.7.camel@simulacron> <20030305112852.GA22351@outpost.ds9a.nl> Content-Type: text/plain Organization: Message-Id: <1046899726.440.0.camel@simulacron> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.1 Date: 05 Mar 2003 22:28:46 +0100 Content-Transfer-Encoding: 7bit X-archive-position: 1867 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: aj@dungeon.inka.de Precedence: bulk X-list: netdev Content-Length: 134 Lines: 7 it's working fine with win 2k pro (ipsec, 3des, sha, pre shared key). I will try to write something useful for the howto. Andreas From mcmanus@datapower.ducksong.com Wed Mar 5 13:32:07 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 05 Mar 2003 13:32:09 -0800 (PST) Received: from datapower.ducksong.com (ip67-93-141-189.z141-93-67.customer.algx.net [67.93.141.189]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h25LW540032151 for ; Wed, 5 Mar 2003 13:32:06 -0800 Received: (from mcmanus@localhost) by datapower.ducksong.com (8.11.6/8.11.6) id h25LW5g02187 for netdev@oss.sgi.com; Wed, 5 Mar 2003 16:32:05 -0500 Date: Wed, 5 Mar 2003 16:32:05 -0500 From: "Patrick R. McManus" To: netdev@oss.sgi.com Subject: Re: SIOCETHTOOL ioctl() and a corrupted cmd argument Message-ID: <20030305213205.GA1227@ducksong.com> References: <20030305210047.GA10824@ducksong.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030305210047.GA10824@ducksong.com> User-Agent: Mutt/1.4i X-archive-position: 1868 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: mcmanus@ducksong.com Precedence: bulk X-list: netdev Content-Length: 869 Lines: 24 [Patrick R. McManus: Mar 05 16:00] > as a regular user I get > Mar 5 15:46:57 mcmanus kernel: general ioctl cmd 89F0, magic 8946 > turns out, as I had expected, my report is bogus.. this ioctl is a fallback after the siocethtool fails. the driver do_ioctl() never gets invoked at all when the ioctl() is invoked without being root. this would be because in net/core/dev.c dev_ioctl() they are filtered out: case SIOCETHTOOL: case SIOCGMIIPHY: case SIOCGMIIREG: if (!capable(CAP_NET_ADMIN)) return -EPERM; but SIOCETHTOOL shouldn't need perms, right? it has some functionality that needs it and some that doesn't, and the driver sorts it out.. there isn't a GIOCETHTOOL at all.. #define ETHTOOL_GSET 0x00000001 /* Get settings. */ #define ETHTOOL_SSET 0x00000002 /* Set settings, privileged. */ From garzik@gtf.org Wed Mar 5 13:42:03 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 05 Mar 2003 13:42:05 -0800 (PST) Received: from havoc.gtf.org (havoc.daloft.com [64.213.145.173]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h25Lg1Ru032677 for ; Wed, 5 Mar 2003 13:42:03 -0800 Received: by havoc.gtf.org (Postfix, from userid 500) id 02824663B; Wed, 5 Mar 2003 16:41:55 -0500 (EST) Date: Wed, 5 Mar 2003 16:41:55 -0500 From: Jeff Garzik To: "Patrick R. McManus" Cc: netdev@oss.sgi.com Subject: Re: SIOCETHTOOL ioctl() and a corrupted cmd argument Message-ID: <20030305214155.GM13420@gtf.org> References: <20030305210047.GA10824@ducksong.com> <20030305213205.GA1227@ducksong.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030305213205.GA1227@ducksong.com> User-Agent: Mutt/1.3.28i X-archive-position: 1869 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev Content-Length: 750 Lines: 21 On Wed, Mar 05, 2003 at 04:32:05PM -0500, Patrick R. McManus wrote: > but SIOCETHTOOL shouldn't need perms, right? it has some functionality > that needs it and some that doesn't, and the driver sorts it > out.. there isn't a GIOCETHTOOL at all.. > > #define ETHTOOL_GSET 0x00000001 /* Get settings. */ > #define ETHTOOL_SSET 0x00000002 /* Set settings, privileged. */ You are correct that comment is misleading... all ethtool does current requiring CAP_NET_ADMIN. This is one of the costs of lumping things under one ioctl, rather than constantly using new ioctls. It is certainly possible (and reasonable) that a future kernel peeks at the ioctl and then conditionally checks privs, but this is not currently the case. Jeff From bunk@fs.tum.de Wed Mar 5 14:55:32 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 05 Mar 2003 14:55:38 -0800 (PST) Received: from hermes.fachschaften.tu-muenchen.de (hermes.fachschaften.tu-muenchen.de [129.187.202.12]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h25Msoq9006687 for ; Wed, 5 Mar 2003 14:55:31 -0800 Received: (qmail 1254 invoked from network); 5 Mar 2003 22:54:44 -0000 Received: from mimas.fachschaften.tu-muenchen.de (129.187.202.58) by hermes.fachschaften.tu-muenchen.de with QMQP; 5 Mar 2003 22:54:44 -0000 Date: Wed, 5 Mar 2003 23:54:41 +0100 From: Adrian Bunk To: davem@redhat.com, netdev@oss.sgi.com Cc: linux-kernel@vger.kernel.org Subject: Chaotic structure of the net headers? Message-ID: <20030305225441.GO20423@fs.tum.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4i X-archive-position: 1870 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: bunk@fs.tum.de Precedence: bulk X-list: netdev Content-Length: 2072 Lines: 60 Hi, if all I'm describing is completely logical and I'm only too dumb to see the logic please forgive me. ;-) In 2.5.64 there are networking headers both under include/linux/ and include/net/. I don't understand whether there's a deeper logic why e.g. the netfilter headers are under include/linux/. There's some duplication, e.g. include/linux/in6.h contains <-- snip --> /* * IPV6 extension headers */ #define IPPROTO_HOPOPTS 0 /* IPv6 hop-by-hop options */ #define IPPROTO_ROUTING 43 /* IPv6 routing header */ #define IPPROTO_FRAGMENT 44 /* IPv6 fragmentation header */ #define IPPROTO_ICMPV6 58 /* ICMPv6 */ #define IPPROTO_NONE 59 /* IPv6 no next header */ #define IPPROTO_DSTOPTS 60 /* IPv6 destination options */ <-- snip --> and include/net/ipv6.h contains: <-- snip --> /* * NextHeader field of IPv6 header */ #define NEXTHDR_HOP 0 /* Hop-by-hop option header. */ #define NEXTHDR_TCP 6 /* TCP segment. */ #define NEXTHDR_UDP 17 /* UDP message. */ #define NEXTHDR_IPV6 41 /* IPv6 in IPv6 */ #define NEXTHDR_ROUTING 43 /* Routing header. */ #define NEXTHDR_FRAGMENT 44 /* Fragmentation/reassembly header. */ #define NEXTHDR_ESP 50 /* Encapsulating security payload. */ #define NEXTHDR_AUTH 51 /* Authentication header. */ #define NEXTHDR_ICMP 58 /* ICMP for IPv6. */ #define NEXTHDR_NONE 59 /* No next header */ #define NEXTHDR_DEST 60 /* Destination options header. */ <-- snip --> Two different #define's for the same thing doesn't sound like a good idea? cu Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed From davem@redhat.com Wed Mar 5 14:58:56 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 05 Mar 2003 14:58:58 -0800 (PST) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h25Mwtq9007077 for ; Wed, 5 Mar 2003 14:58:56 -0800 Received: from localhost (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with ESMTP id OAA17413; Wed, 5 Mar 2003 14:39:51 -0800 Date: Wed, 05 Mar 2003 14:39:51 -0800 (PST) Message-Id: <20030305.143951.118510613.davem@redhat.com> To: bunk@fs.tum.de Cc: netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: Chaotic structure of the net headers? From: "David S. Miller" In-Reply-To: <20030305225441.GO20423@fs.tum.de> References: <20030305225441.GO20423@fs.tum.de> X-FalunGong: Information control. X-Mailer: Mew version 2.1 on Emacs 21.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 1871 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev Content-Length: 222 Lines: 8 From: Adrian Bunk Date: Wed, 5 Mar 2003 23:54:41 +0100 Two different #define's for the same thing doesn't sound like a good idea? Required by the ipv6 advanced sockets API I do believe. From Rod.VanMeter@nokia.com Wed Mar 5 15:18:56 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 05 Mar 2003 15:18:58 -0800 (PST) Received: from mailhost.iprg.nokia.com (mailhost.iprg.nokia.com [205.226.5.12]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h25NIsq9008581 for ; Wed, 5 Mar 2003 15:18:55 -0800 Received: from darkstar.iprg.nokia.com (darkstar.iprg.nokia.com [205.226.5.69]) by mailhost.iprg.nokia.com (8.9.3/8.9.3-GLGS) with ESMTP id PAA15361; Wed, 5 Mar 2003 15:18:42 -0800 (PST) Received: (from root@localhost) by darkstar.iprg.nokia.com (8.11.0/8.11.0-DARKSTAR) id h25NIf430270; Wed, 5 Mar 2003 15:18:41 -0800 X-mProtect: <200303052318> Nokia Silicon Valley Messaging Protection Received: from UNKNOWN (172.19.68.126, claiming to be "dadhcp-172019068126.americas.nokia.com") by darkstar.iprg.nokia.com smtpd7QgrwF; Wed, 05 Mar 2003 15:18:39 PST Subject: Re: Chaotic structure of the net headers? From: Rod Van Meter Reply-To: Rod.VanMeter@nokia.com To: ext Adrian Bunk Cc: davem@redhat.com, netdev@oss.sgi.com, linux-kernel@vger.kernel.org In-Reply-To: <20030305225441.GO20423@fs.tum.de> References: <20030305225441.GO20423@fs.tum.de> Content-Type: text/plain Organization: Nokia Networks Message-Id: <1046905834.17778.400.camel@localhost.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 Date: 05 Mar 2003 15:10:35 -0800 Content-Transfer-Encoding: 7bit X-archive-position: 1873 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: Rod.VanMeter@nokia.com Precedence: bulk X-list: netdev On Wed, 2003-03-05 at 14:54, ext Adrian Bunk wrote: > > There's some duplication, e.g. include/linux/in6.h contains > > /* > * IPV6 extension headers > */ > #define IPPROTO_HOPOPTS 0 /* IPv6 hop-by-hop options */ > #define IPPROTO_ROUTING 43 /* IPv6 routing header */ > #define IPPROTO_FRAGMENT 44 /* IPv6 fragmentation header */ > #define IPPROTO_ICMPV6 58 /* ICMPv6 */ > #define IPPROTO_NONE 59 /* IPv6 no next header */ > #define IPPROTO_DSTOPTS 60 /* IPv6 destination options */ According to RFC2292 (Advanced Sockets): 2.1.1. IPv6 Next Header Values IPv6 defines many new values for the Next Header field. The following constants are defined as a result of including . #define IPPROTO_HOPOPTS 0 /* IPv6 Hop-by-Hop options */ #define IPPROTO_IPV6 41 /* IPv6 header */ #define IPPROTO_ROUTING 43 /* IPv6 Routing header */ #define IPPROTO_FRAGMENT 44 /* IPv6 fragmentation header */ #define IPPROTO_ESP 50 /* encapsulating security payload */ #define IPPROTO_AH 51 /* authentication header */ #define IPPROTO_ICMPV6 58 /* ICMPv6 */ #define IPPROTO_NONE 59 /* IPv6 no next header */ #define IPPROTO_DSTOPTS 60 /* IPv6 Destination options */ Berkeley-derived IPv4 implementations also define IPPROTO_IP to be 0. This should not be a problem since IPPROTO_IP is used only with IPv4 sockets and IPPROTO_HOPOPTS only with IPv6 sockets. > > and include/net/ipv6.h contains: > > <-- snip --> > > /* > * NextHeader field of IPv6 header > */ > > #define NEXTHDR_HOP 0 /* Hop-by-hop option header. */ > #define NEXTHDR_TCP 6 /* TCP segment. */ > #define NEXTHDR_UDP 17 /* UDP message. */ > #define NEXTHDR_IPV6 41 /* IPv6 in IPv6 */ > #define NEXTHDR_ROUTING 43 /* Routing header. */ > #define NEXTHDR_FRAGMENT 44 /* Fragmentation/reassembly header. */ This form doesn't appear in RFC2292, nor in 2133 (Basic Socket...) My interpretation is that this latter form is defined for kernel use, while the former is for user-level manipulation of raw packet fields (the primary purpose of 2292). Does it make sense to have two forms, one kernel, one user? I haven't e.g. followed the desired include chain. If we wanted to merge the uses, the former form and include location would probably have to be used. I've been looking into this. There are a *few* things missing from the 2292 support. AFAICT, it's just a handful of functions/macros for manipulating option headers that need to be added. Does anybody actually USE this stuff (the advanced sockets API, I mean, not IPv6)? I'm planning to add those missing bits, just for kicks, but haven't done it yet. --Rod From davem@redhat.com Wed Mar 5 15:23:31 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 05 Mar 2003 15:23:33 -0800 (PST) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h25NMpq9008942 for ; Wed, 5 Mar 2003 15:23:31 -0800 Received: from localhost (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with ESMTP id PAA17467; Wed, 5 Mar 2003 15:03:44 -0800 Date: Wed, 05 Mar 2003 15:03:44 -0800 (PST) Message-Id: <20030305.150344.50145701.davem@redhat.com> To: Rod.VanMeter@nokia.com Cc: bunk@fs.tum.de, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: Chaotic structure of the net headers? From: "David S. Miller" In-Reply-To: <1046905834.17778.400.camel@localhost.localdomain> References: <20030305225441.GO20423@fs.tum.de> <1046905834.17778.400.camel@localhost.localdomain> X-FalunGong: Information control. X-Mailer: Mew version 2.1 on Emacs 21.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 1874 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev From: Rod Van Meter Date: 05 Mar 2003 15:10:35 -0800 Does it make sense to have two forms, one kernel, one user? I haven't e.g. followed the desired include chain. If we wanted to merge the uses, the former form and include location would probably have to be used. I've been looking into this. There are a *few* things missing from the 2292 support. AFAICT, it's just a handful of functions/macros for manipulating option headers that need to be added. Actually forget all my comments, GLIBC headers are where the advanced socket API requirements for headers should be applied. And since this is only used in the kernel, there is no need for the NEXTHDR_* if it trully just duplicates the IPPROTO_* defines. I'm willing to accept a cleanup patch of this nature, sure. From davem@redhat.com Wed Mar 5 15:44:12 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 05 Mar 2003 15:44:15 -0800 (PST) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h25NiBq9009621 for ; Wed, 5 Mar 2003 15:44:12 -0800 Received: from localhost (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with ESMTP id PAA17543; Wed, 5 Mar 2003 15:25:31 -0800 Date: Wed, 05 Mar 2003 15:25:30 -0800 (PST) Message-Id: <20030305.152530.70806720.davem@redhat.com> To: kazunori@miyazawa.org Cc: kuznet@ms2.inr.ac.ru, linux-kernel@vger.kernel.org, netdev@oss.sgi.com, usagi-core@linux-ipv6.org Subject: Re: [PATH] IPv6 IPsec support From: "David S. Miller" In-Reply-To: <20030305233025.784feb00.kazunori@miyazawa.org> References: <20030305233025.784feb00.kazunori@miyazawa.org> X-FalunGong: Information control. X-Mailer: Mew version 2.1 on Emacs 21.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 1875 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev From: Kazunori Miyazawa Date: Wed, 5 Mar 2003 23:30:25 +0900 Hello Miyazawa-san, I submit the patch to let the kernel support ipv6 ipsec again. It is able to comple ipv6 as module. As promised I applied the patch. I will push it to Linus later this evening, or tomorrow. In this initial checkin I made only 2 minor fixes, they are attached below: --- ./include/net/ip6_route.h.~1~ Wed Mar 5 15:32:41 2003 +++ ./include/net/ip6_route.h Wed Mar 5 15:40:42 2003 @@ -38,7 +38,6 @@ extern int ipv6_route_ioctl(unsigned int cmd, void *arg); extern int ip6_route_add(struct in6_rtmsg *rtmsg); -extern int ip6_route_del(struct in6_rtmsg *rtmsg); extern int ip6_del_rt(struct rt6_info *); extern int ip6_rt_addr_add(struct in6_addr *addr, --- ./net/ipv6/Kconfig.~1~ Wed Mar 5 15:32:41 2003 +++ ./net/ipv6/Kconfig Wed Mar 5 15:35:27 2003 @@ -19,6 +19,7 @@ config INET6_AH tristate "IPv6: AH transformation" + depends on IPV6 ---help--- Support for IPsec AH. @@ -26,6 +27,7 @@ config INET6_ESP tristate "IPv6: ESP transformation" + depends on IPV6 ---help--- Support for IPsec ESP. From davem@redhat.com Wed Mar 5 15:59:48 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 05 Mar 2003 15:59:52 -0800 (PST) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h25Nxlq9010071 for ; Wed, 5 Mar 2003 15:59:48 -0800 Received: from localhost (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with ESMTP id PAA17576; Wed, 5 Mar 2003 15:41:01 -0800 Date: Wed, 05 Mar 2003 15:41:00 -0800 (PST) Message-Id: <20030305.154100.28816301.davem@redhat.com> To: yoshfuji@linux-ipv6.org Cc: kazunori@miyazawa.org, kuznet@ms2.inr.ac.ru, linux-kernel@vger.kernel.org, netdev@oss.sgi.com, usagi@linux-ipv6.org Subject: Re: (usagi-core 12294) Re: [PATCH] IPv6 IPsec support From: "David S. Miller" In-Reply-To: <20030306.004820.41101302.yoshfuji@linux-ipv6.org> References: <20030305233025.784feb00.kazunori@miyazawa.org> <20030305.072149.121185037.davem@redhat.com> <20030306.004820.41101302.yoshfuji@linux-ipv6.org> X-FalunGong: Information control. X-Mailer: Mew version 2.1 on Emacs 21.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=iso-2022-jp Content-Transfer-Encoding: 7bit X-archive-position: 1876 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev From: YOSHIFUJI Hideaki / $B5HF#1QL@(B Date: Thu, 06 Mar 2003 00:48:20 +0900 (JST) > The next large task will be to abstract out more common > pieces of code. There is still quite a bit of code duplication > between v4 and v6 xfrm methods, Yes, we will do that. That patch is first step for reducing duplicate codes between IPv4 and IPv6. Great. I believe it should be possible, in the end, to make the XFRM engine %100 address-family (v4, v6 etc.) and protocol (ah, esp) independant. If that goal is achieved, we may move generic parts from net/ipv4/xfrm_*.c to net/xfrm_*.c Note that this coincides with the idea to eventually have an address-family independant flow cache. Most of the address-family specific areas are: 1) DST lookup (xfrm_dst_lookup_t) 2) selector key comparisons and state lookup (xfrm$(AF)_selector_match, xfrm$(AF)_state_find) 3) receive processing (xfrm${AF}_rcv) #1 is made for ipv6 by Miyazawa-san's patch. This could logically be extended to handle issues #2 and #3 above. All protocol specific (ESP, AH) and address-family specific references should go away from places like include/net/xfrm.h I think you understand all of this, and therefore I cannot wait for the next ipsec cleanup patch from you :) Finally, note that eventually we will need some reference counting scheme for to allow xfrm address-family modules to be unloaded safely. Currently, ipv4 cannot be a module and ipv6 as a module is not able to unload :-) So the module unload problem does not exist right at this moment. So ignore this issue for now. From rgb@conscoop.ottawa.on.ca Wed Mar 5 16:06:34 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 05 Mar 2003 16:06:37 -0800 (PST) Received: from conscoop.ottawa.on.ca (cpu2747.adsl.bellglobal.com [207.236.55.216]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2605qq9010510 for ; Wed, 5 Mar 2003 16:06:34 -0800 Received: (from rgb@localhost) by conscoop.ottawa.on.ca (8.12.0.Beta5/8.11.6) id h25NrKZG008496; Wed, 5 Mar 2003 18:53:20 -0500 Date: Wed, 5 Mar 2003 18:53:20 -0500 From: Richard Guy Briggs To: Rod Van Meter Cc: ext Adrian Bunk , davem@redhat.com, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: Chaotic structure of the net headers? Message-ID: <20030305185320.I4305@grendel.conscoop.ottawa.on.ca> References: <20030305225441.GO20423@fs.tum.de> <1046905834.17778.400.camel@localhost.localdomain> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5.1i In-Reply-To: <1046905834.17778.400.camel@localhost.localdomain>; from Rod.VanMeter@nokia.com on Wed, Mar 05, 2003 at 03:10:35PM -0800 X-archive-position: 1877 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: rgb@conscoop.ottawa.on.ca Precedence: bulk X-list: netdev On Wed, Mar 05, 2003 at 03:10:35PM -0800, Rod Van Meter wrote: > On Wed, 2003-03-05 at 14:54, ext Adrian Bunk wrote: > > There's some duplication, e.g. include/linux/in6.h contains > > > > /* > > * IPV6 extension headers > > */ > > #define IPPROTO_HOPOPTS 0 /* IPv6 hop-by-hop options */ > > #define IPPROTO_ROUTING 43 /* IPv6 routing header */ > > #define IPPROTO_FRAGMENT 44 /* IPv6 fragmentation header */ > > #define IPPROTO_ICMPV6 58 /* ICMPv6 */ > > #define IPPROTO_NONE 59 /* IPv6 no next header */ > > #define IPPROTO_DSTOPTS 60 /* IPv6 destination options */ > > > According to RFC2292 (Advanced Sockets): > > 2.1.1. IPv6 Next Header Values > > IPv6 defines many new values for the Next Header field. The > following constants are defined as a result of including > . > > #define IPPROTO_HOPOPTS 0 /* IPv6 Hop-by-Hop options */ > #define IPPROTO_IPV6 41 /* IPv6 header */ > #define IPPROTO_ROUTING 43 /* IPv6 Routing header */ > #define IPPROTO_FRAGMENT 44 /* IPv6 fragmentation header */ > #define IPPROTO_ESP 50 /* encapsulating security payload */ > #define IPPROTO_AH 51 /* authentication header */ > #define IPPROTO_ICMPV6 58 /* ICMPv6 */ > #define IPPROTO_NONE 59 /* IPv6 no next header */ > #define IPPROTO_DSTOPTS 60 /* IPv6 Destination options */ > > Berkeley-derived IPv4 implementations also define IPPROTO_IP to be 0. > This should not be a problem since IPPROTO_IP is used only with IPv4 > sockets and IPPROTO_HOPOPTS only with IPv6 sockets. The Linux FreeS/WAN IPsec implementation has been using IPPROTO_ESP, IPPROTO_AH, IPPROTO_INT (61, put aside by IANA for internal use), IPPROTO_COMP (108), IPPROTO_IPIP (4) for the last 5 years, based on common usage and examples such as IPPROTO_UDP, IPPROTO_TCP, IPPROTO_ICMP. > > and include/net/ipv6.h contains: > > > > <-- snip --> > > > > /* > > * NextHeader field of IPv6 header > > */ > > > > #define NEXTHDR_HOP 0 /* Hop-by-hop option header. */ > > #define NEXTHDR_TCP 6 /* TCP segment. */ > > #define NEXTHDR_UDP 17 /* UDP message. */ > > #define NEXTHDR_IPV6 41 /* IPv6 in IPv6 */ > > #define NEXTHDR_ROUTING 43 /* Routing header. */ > > #define NEXTHDR_FRAGMENT 44 /* Fragmentation/reassembly header. */ > > This form doesn't appear in RFC2292, nor in 2133 (Basic Socket...) > > My interpretation is that this latter form is defined for kernel use, > while the former is for user-level manipulation of raw packet fields > (the primary purpose of 2292). We use these in the kernel, but not in userspace. We define SA_ESP, etc... > Does it make sense to have two forms, one kernel, one user? I haven't > e.g. followed the desired include chain. If we wanted to merge the > uses, the former form and include location would probably have to be > used. We use the two forms since shared user/kernel headers are a nuisance... > I've been looking into this. There are a *few* things missing from the > 2292 support. AFAICT, it's just a handful of functions/macros for > manipulating option headers that need to be added. > > Does anybody actually USE this stuff (the advanced sockets API, I mean, > not IPv6)? I'm planning to add those missing bits, just for kicks, but > haven't done it yet. > > --Rod slainte mhath, RGB -- Richard Guy Briggs -- ~\ Auto-Free Ottawa! Canada -- \@ @ No Internet Wiretapping! -- _\\/\%___\\/\% Vote! -- _______GTVS6#790__(*)_______(*)(*)_______ From kazunori@miyazawa.org Wed Mar 5 16:32:00 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 05 Mar 2003 16:32:03 -0800 (PST) Received: from miyazawa.org (usen-43x235x12x234.ap-USEN.usen.ad.jp [43.235.12.234]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h260Vwq9011219 for ; Wed, 5 Mar 2003 16:31:59 -0800 Received: from monza.miyazawa.org ([2001:200:0:ff18:220:e0ff:fe8a:e797]) (AUTH: LOGIN kazunori, ) by miyazawa.org with esmtp; Thu, 06 Mar 2003 09:14:01 +0900 Date: Thu, 6 Mar 2003 09:32:19 +0900 From: Kazunori Miyazawa To: "David S. Miller" Cc: kuznet@ms2.inr.ac.ru, linux-kernel@vger.kernel.org, netdev@oss.sgi.com, usagi-core@linux-ipv6.org Subject: Re: [PATH] IPv6 IPsec support Message-Id: <20030306093219.1a702868.kazunori@miyazawa.org> In-Reply-To: <20030305.152530.70806720.davem@redhat.com> References: <20030305233025.784feb00.kazunori@miyazawa.org> <20030305.152530.70806720.davem@redhat.com> X-Mailer: Sylpheed version 0.8.10 (GTK+ 1.2.10; i386-debian-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 1878 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: kazunori@miyazawa.org Precedence: bulk X-list: netdev Hello David, On Wed, 05 Mar 2003 15:25:30 -0800 (PST) "David S. Miller" wrote: > From: Kazunori Miyazawa > Date: Wed, 5 Mar 2003 23:30:25 +0900 > > Hello Miyazawa-san, > > I submit the patch to let the kernel support ipv6 ipsec again. > It is able to comple ipv6 as module. > > As promised I applied the patch. I will push it to Linus later > this evening, or tomorrow. > > In this initial checkin I made only 2 minor fixes, they > are attached below: > Thank you very much. My patch is the first step. I think there are these TODOs around IPv6 IPsec as far as I remember. - Extension Header Processing on inbound: As a result of IPv6 IPsec support, Extension Header processing is devided into ipv6_parse_exthdrs and ipproto->handler. I think it is better to merge other Extension Header handling into ipproto->handler. - Fragmentation support on outbound: We should change ipv6_build_xmit like ip_append_data style to support fragmentation with IPsec. - Removing duplicate codes, clean up and improveing performance. - Considering relation of IPv6 IPsec and Mobile IPv6. This is future stuff. Best regards, --Kazunori Miyazawa (Yokogawa Electric Corporation) From davem@redhat.com Wed Mar 5 21:02:48 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 05 Mar 2003 21:02:52 -0800 (PST) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2652lq9015424 for ; Wed, 5 Mar 2003 21:02:48 -0800 Received: from localhost (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with ESMTP id UAA18044; Wed, 5 Mar 2003 20:43:49 -0800 Date: Wed, 05 Mar 2003 20:43:48 -0800 (PST) Message-Id: <20030305.204348.130225511.davem@redhat.com> To: kazunori@miyazawa.org Cc: kuznet@ms2.inr.ac.ru, linux-kernel@vger.kernel.org, netdev@oss.sgi.com, usagi-core@linux-ipv6.org Subject: Re: [PATH] IPv6 IPsec support From: "David S. Miller" In-Reply-To: <20030306093219.1a702868.kazunori@miyazawa.org> References: <20030305233025.784feb00.kazunori@miyazawa.org> <20030305.152530.70806720.davem@redhat.com> <20030306093219.1a702868.kazunori@miyazawa.org> X-FalunGong: Information control. X-Mailer: Mew version 2.1 on Emacs 21.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 1879 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev From: Kazunori Miyazawa Date: Thu, 6 Mar 2003 09:32:19 +0900 - Extension Header Processing on inbound: As a result of IPv6 IPsec support, Extension Header processing is devided into ipv6_parse_exthdrs and ipproto->handler. I think it is better to merge other Extension Header handling into ipproto->handler. Ok. - Fragmentation support on outbound: We should change ipv6_build_xmit like ip_append_data style to support fragmentation with IPsec. Please work together with Alexey on this. There are known major problems on ipv4 side, and it must be resolved before ipv6 side may be done. For example, right now a non-TCP packet can do the following. If it is just slightly smaller than MTU, and when encapsulated in ESP/AH it becomes larger than MTU, we will not fragment it and too-large frame will be sent to device. In my last round of talks with Alexey I believe we were very close to a possible solution to this problem. The idea was to have a "local dont-fragment" flag, and at the very last stage of IP output we check this and either 1) clear DF and fragment or 2) drop packet and send ICMP message back. Alexey, what is the current state? - Removing duplicate codes, clean up and improveing performance. - Considering relation of IPv6 IPsec and Mobile IPv6. This is future stuff. Ok. From rreddy@c.psc.edu Thu Mar 6 08:01:46 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 06 Mar 2003 08:01:49 -0800 (PST) Received: from c.psc.edu (c.psc.edu [128.182.73.106]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h26G1jq9018848 for ; Thu, 6 Mar 2003 08:01:46 -0800 Received: by c.psc.edu for NETDEV@OSS.SGI.COM; Thu, 6 Mar 2003 11:01:45 -0500 Date: Thu, 6 Mar 2003 11:01:45 -0500 From: "Raghurama 'REDDY'" Reply-To: rreddy@psc.edu To: NETDEV@OSS.SGI.COM CC: RREDDY@vms.psc.edu Message-Id: <03030611014533.2221238e.7643064@psc.edu> Subject: Output on raw sockets ignores IP_DF when packet is bigger than pmtu X-archive-position: 1881 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: rreddy@psc.edu Precedence: bulk X-list: netdev Content-Length: 1506 Lines: 46 Hello! We are looking at the "traceroute" code, in particular the behavior of the "-M" option (path MTU discovery) on a 2.4.20 kernel. In attempting to get a "Framentation required message" from an intermediate router, the code does the following: - Open a raw socket - Send a packet that is smaller than interface MTU but bigger than the MTU of an intermediate router with IP_DF set (the NIC has an MTU of 4400, and pmtu is 1500). What is observed is that when route cache is flushed, it works as expected. We get a "Framentation Required" message from the intermediate router. But when the cache is *not* flushed, the packets are fragmented based on pmtu before sending the packets out on the net, inspite of the fact that IP_DF is set (based on the tcpdump observations). Is this the right behavior? But looking at 2.4.20 kernel code in "include/net/ip.h": Not sure if "ip_send" is the in the call tree or not; I am *not* intimately familiar with the code ... :-( ------------------ static inline int ip_send(struct sk_buff *skb) { if (skb->len > skb->dst->pmtu) return ip_fragment(skb, ip_finish_output); else return ip_finish_output(skb); } ------------------ This seems to indicate that it would fragment the packet if the packet is bigger than path MTU irrespetive of the IP_DF flag. Is there a way to get the host to not fragment when IP_DF is set, iresspective what the pmtu is? Thanks! --rr From sri@us.ibm.com Thu Mar 6 10:47:07 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 06 Mar 2003 10:47:13 -0800 (PST) Received: from e6.ny.us.ibm.com (e6.ny.us.ibm.com [32.97.182.106]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h26Il0q9003014 for ; Thu, 6 Mar 2003 10:47:07 -0800 Received: from northrelay01.pok.ibm.com (northrelay01.pok.ibm.com [9.56.224.149]) by e6.ny.us.ibm.com (8.12.8/8.12.2) with ESMTP id h26IkrZu049886; Thu, 6 Mar 2003 13:46:53 -0500 Received: from dyn9-47-18-140.beaverton.ibm.com (dyn9-47-18-140.beaverton.ibm.com [9.47.18.140]) by northrelay01.pok.ibm.com (8.12.8/NCO/VER6.5) with ESMTP id h26IknPO062020; Thu, 6 Mar 2003 13:46:50 -0500 Date: Thu, 6 Mar 2003 10:30:33 -0800 (PST) From: Sridhar Samudrala X-X-Sender: sridhar@dyn9-47-18-140.beaverton.ibm.com To: "Raghurama 'REDDY'" cc: NETDEV@oss.sgi.com, Subject: Re: Output on raw sockets ignores IP_DF when packet is bigger than pmtu In-Reply-To: <03030611014533.2221238e.7643064@psc.edu> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 1882 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: sri@us.ibm.com Precedence: bulk X-list: netdev Content-Length: 345 Lines: 13 On Thu, 6 Mar 2003, Raghurama 'REDDY' wrote: > Is there a way to get the host to not fragment when IP_DF is set, > iresspective what the pmtu is? You can disable pmtu discovery on a socket using the IP level socket option IP_MTU_DISCOVER. System wide pmtu discovery can be disabled by setting /proc/sys/net/ipv4/ip_no_pmtu_disc -Sridhar From davem@redhat.com Thu Mar 6 10:50:39 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 06 Mar 2003 10:50:43 -0800 (PST) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h26Inxq9003683 for ; Thu, 6 Mar 2003 10:50:39 -0800 Received: from localhost (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with ESMTP id KAA19604; Thu, 6 Mar 2003 10:31:43 -0800 Date: Thu, 06 Mar 2003 10:31:42 -0800 (PST) Message-Id: <20030306.103142.58817243.davem@redhat.com> To: eric@lammerts.org Cc: linux-net@vger.kernel.org, netdev@oss.sgi.com, alan@lxorguk.ukuu.org.uk Subject: Re: [PATCH] wrong ENETDOWN in af_packet? From: "David S. Miller" In-Reply-To: <20030305141123.GA16699@ally.lammerts.org> References: <20030305141123.GA16699@ally.lammerts.org> X-FalunGong: Information control. X-Mailer: Mew version 2.1 on Emacs 21.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 1883 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev Content-Length: 439 Lines: 10 From: Eric Lammerts Date: Wed, 5 Mar 2003 15:11:23 +0100 The reason is that (in af_packet.c) packet_notifier(NETDEV_DOWN) sets sk->err to ENETDOWN, but packet_notifier(NETDEV_UP) doesn't clear it. Is this behaviour deliberate? Yes the behavior is deliberate. You want to be aware of the event. Just because the opposite event has occurred afterwards doesn't mean the first event didn't happen :-) From holt@sgi.com Thu Mar 6 13:11:01 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 06 Mar 2003 13:11:05 -0800 (PST) Received: from tolkor.sgi.com ([198.149.18.6]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h26LALq9010348 for ; Thu, 6 Mar 2003 13:11:01 -0800 Received: from ledzep.americas.sgi.com (ledzep.americas.sgi.com [192.48.203.134]) by tolkor.sgi.com (8.12.2/8.12.2/linux-outbound_gateway-1.2) with ESMTP id h26LKokq007999 for ; Thu, 6 Mar 2003 15:20:50 -0600 Received: from thistle-e236.americas.sgi.com (thistle-e236.americas.sgi.com [128.162.236.204]) by ledzep.americas.sgi.com (SGI-8.9.3/americas-smart-nospam1.1) with ESMTP id PAA55938; Thu, 6 Mar 2003 15:10:14 -0600 (CST) Received: from mandrake.americas.sgi.com (mandrake.americas.sgi.com [128.162.232.96]) by thistle-e236.americas.sgi.com (8.12.8/SGI-server-1.8) with ESMTP id h26LAEvw6978040; Thu, 6 Mar 2003 15:10:15 -0600 (CST) Received: from mandrake.americas.sgi.com (localhost.localdomain [127.0.0.1]) by mandrake.americas.sgi.com (8.12.5/8.11.6/erikj-RedHat-7.2-Eagan) with ESMTP id h26LAEst031494; Thu, 6 Mar 2003 15:10:14 -0600 Received: from localhost (holt@localhost) by mandrake.americas.sgi.com (8.12.5/8.12.5/Submit) with ESMTP id h26LAEct031490; Thu, 6 Mar 2003 15:10:14 -0600 X-Authentication-Warning: mandrake.americas.sgi.com: holt owned process doing -bs Date: Thu, 6 Mar 2003 15:10:13 -0600 (CST) From: Robin Holt X-X-Sender: holt@mandrake.americas.sgi.com To: Linux Kernel Mailing List , Subject: Make ipconfig.c work as a loadable module. Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 1884 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: holt@sgi.com Precedence: bulk X-list: netdev Content-Length: 6072 Lines: 208 The patch at the end of this email makes ipconfig.c work as a loadable module under the 2.5. The diff was taken against the bitkeeper tree changeset 1.1075. Currently ipconfig.o must get statically linked into the kernel. I have a proprietary driver which the supplier will not provide a GPL version or info. In order to mount root over NFS, I need to get the vendors driver loaded via a ramdisk. A couple more items get moved from ipconfig.h to nfs_fs.h. Thanks, Robin Holt ------------------------- Patch --------------------------------------- ===== fs/Kconfig 1.18 vs edited ===== --- 1.18/fs/Kconfig Sun Feb 9 19:29:49 2003 +++ edited/fs/Kconfig Wed Mar 5 11:07:56 2003 @@ -1270,7 +1270,7 @@ config ROOT_NFS bool "Root file system on NFS" - depends on NFS_FS=y && IP_PNP + depends on NFS_FS=y && IP_PNP!=n help If you want your Linux box to mount its whole root file system (the one containing the directory /) from some other computer over the ===== fs/nfs/nfsroot.c 1.11 vs edited ===== --- 1.11/fs/nfs/nfsroot.c Thu Nov 7 11:29:59 2002 +++ edited/fs/nfs/nfsroot.c Wed Mar 5 11:07:56 2003 @@ -69,6 +69,7 @@ */ #include +#include #include #include #include @@ -106,6 +107,15 @@ static struct nfs_mount_data nfs_data __initdata = { 0, };/* NFS mount info */ static int nfs_port __initdata = 0; /* Port to connect to for NFS */ static int mount_port __initdata = 0; /* Mount daemon port number */ + + +u32 root_server_addr __initdata = INADDR_NONE; /* Address of NFS server */ +u8 root_server_path[NFS_ROOT_PATH_LEN] __initdata = { 0, }; /* Path to mount as root */ + +#ifdef CONFIG_IP_PNP_MODULE +EXPORT_SYMBOL(root_server_addr); +EXPORT_SYMBOL(root_server_path); +#endif /*************************************************************************** ===== include/linux/nfs_fs.h 1.43 vs edited ===== --- 1.43/include/linux/nfs_fs.h Sat Dec 21 00:29:02 2002 +++ edited/include/linux/nfs_fs.h Wed Mar 5 11:07:56 2003 @@ -417,7 +417,12 @@ /* NFS root */ +#ifdef CONFIG_ROOT_NFS +#define NFS_ROOT_PATH_LEN 256 +extern u8 root_server_path[NFS_ROOT_PATH_LEN]; /* Path to mount as root */ + extern void * nfs_root_data(void); +#endif #define nfs_wait_event(clnt, wq, condition) \ ({ \ ===== include/net/ipconfig.h 1.2 vs edited ===== --- 1.2/include/net/ipconfig.h Tue Feb 5 01:40:15 2002 +++ edited/include/net/ipconfig.h Wed Mar 5 11:07:56 2003 @@ -21,7 +21,6 @@ extern u32 ic_servaddr; /* Boot server IP address */ extern u32 root_server_addr; /* Address of NFS server */ -extern u8 root_server_path[]; /* Path to mount as root */ ===== net/ipv4/Kconfig 1.4 vs edited ===== --- 1.4/net/ipv4/Kconfig Wed Nov 13 06:52:02 2002 +++ edited/net/ipv4/Kconfig Wed Mar 5 11:07:56 2003 @@ -133,8 +133,8 @@ you may want to say Y here to speed up the routing process. config IP_PNP - bool "IP: kernel level autoconfiguration" - depends on INET + tristate "IP: kernel level autoconfiguration" + depends on INET!=n help This enables automatic configuration of IP addresses of devices and of the routing table during kernel boot, based on either information @@ -146,7 +146,7 @@ config IP_PNP_DHCP bool "IP: DHCP support" - depends on IP_PNP + depends on IP_PNP!=n ---help--- If you want your Linux box to mount its whole root file system (the one containing the directory /) from some other computer over the @@ -163,7 +163,7 @@ config IP_PNP_BOOTP bool "IP: BOOTP support" - depends on IP_PNP + depends on IP_PNP!=n ---help--- If you want your Linux box to mount its whole root file system (the one containing the directory /) from some other computer over the @@ -178,7 +178,7 @@ config IP_PNP_RARP bool "IP: RARP support" - depends on IP_PNP + depends on IP_PNP!=n help If you want your Linux box to mount its whole root file system (the one containing the directory /) from some other computer over the ===== net/ipv4/ipconfig.c 1.22 vs edited ===== --- 1.22/net/ipv4/ipconfig.c Tue Feb 18 12:38:27 2003 +++ edited/net/ipv4/ipconfig.c Wed Mar 5 11:07:56 2003 @@ -32,6 +32,7 @@ */ #include +#include #include #include #include @@ -52,6 +53,7 @@ #include #include #include +#include #include #include #include @@ -131,9 +133,6 @@ u32 ic_servaddr __initdata = INADDR_NONE; /* Boot server IP address */ -u32 root_server_addr __initdata = INADDR_NONE; /* Address of NFS server */ -u8 root_server_path[256] __initdata = { 0, }; /* Path to mount as root */ - /* Persistent data: */ int ic_proto_used; /* Protocol used, if any */ @@ -1136,6 +1135,7 @@ unsigned long jiff; #ifdef CONFIG_PROC_FS + /* >>> Need to remove this on unload!!! */ proc_net_create("pnp", 0, pnp_get_info); #endif /* CONFIG_PROC_FS */ @@ -1263,8 +1263,6 @@ return 0; } -module_init(ip_auto_config); - /* * Decode any IP configuration options in the "ip=" or "nfsaddrs=" kernel @@ -1386,6 +1384,29 @@ return 1; } + +#ifdef CONFIG_IP_PNP_MODULE +char *ip = NULL; +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Martin Mares"); +MODULE_DESCRIPTION("IP Autoconfig module: \n" \ + "Uses BOOTP/DHCP/RARP to determine IP configuration before the root\n" + " filesystem is mounted. See nfsroot.txt in the kernel source."); +MODULE_PARM(ip, "s"); +MODULE_PARM_DESC(ip, "[[]:[]:[]:[]:[]:[]:]"); + + +int __init init_module(void) +{ + if (ip != NULL) { + ip_auto_config_setup(ip); + } + + return ip_auto_config(); +} +#else +module_init(ip_auto_config); + static int __init nfsaddrs_config_setup(char *addrs) { return ip_auto_config_setup(addrs); @@ -1393,3 +1414,4 @@ __setup("ip=", ip_auto_config_setup); __setup("nfsaddrs=", nfsaddrs_config_setup); +#endif From alan@lxorguk.ukuu.org.uk Thu Mar 6 13:28:56 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 06 Mar 2003 13:28:59 -0800 (PST) Received: from irongate.swansea.linux.org.uk (pc2-cwma1-4-cust86.swan.cable.ntl.com [213.105.254.86]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h26LSsq9012485 for ; Thu, 6 Mar 2003 13:28:56 -0800 Received: from irongate.swansea.linux.org.uk (localhost [127.0.0.1]) by irongate.swansea.linux.org.uk (8.12.7/8.12.7) with ESMTP id h26MYHYf018976; Thu, 6 Mar 2003 22:34:17 GMT Received: (from alan@localhost) by irongate.swansea.linux.org.uk (8.12.7/8.12.7/Submit) id h26MYG6T018974; Thu, 6 Mar 2003 22:34:16 GMT X-Authentication-Warning: irongate.swansea.linux.org.uk: alan set sender to alan@lxorguk.ukuu.org.uk using -f Subject: Re: Make ipconfig.c work as a loadable module. From: Alan Cox To: Robin Holt Cc: Linux Kernel Mailing List , netdev@oss.sgi.com In-Reply-To: References: Content-Type: text/plain Content-Transfer-Encoding: 7bit Organization: Message-Id: <1046990052.18158.121.camel@irongate.swansea.linux.org.uk> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.1 (1.2.1-4) Date: 06 Mar 2003 22:34:16 +0000 X-archive-position: 1885 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: alan@lxorguk.ukuu.org.uk Precedence: bulk X-list: netdev Content-Length: 619 Lines: 14 On Thu, 2003-03-06 at 21:10, Robin Holt wrote: > The patch at the end of this email makes ipconfig.c work as a loadable > module under the 2.5. The diff was taken against the bitkeeper tree > changeset 1.1075. The right fix is to delete ipconfig.c, it has been the right fix for a long long time. There are initrd based bootp/dhcp setups that can also then mount a root NFS partition and they do *not* need any kernel helper. Indeed probably the biggest distro using nfs root (LTSP) doesn't use ipconfig even on 2.4. DaveM can you just remove the thing. See http://www.ltsp.org for initrds that don't need it in From cw@f00f.org Thu Mar 6 13:32:18 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 06 Mar 2003 13:32:20 -0800 (PST) Received: from tapu.f00f.org (tapu.f00f.org [202.49.232.129]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h26LWHq9013327 for ; Thu, 6 Mar 2003 13:32:18 -0800 Received: by tapu.f00f.org (Postfix, from userid 10000) id C73351830E0C; Thu, 6 Mar 2003 13:32:17 -0800 (PST) Date: Thu, 6 Mar 2003 13:32:17 -0800 From: Chris Wedgwood To: "David S. Miller" Cc: yoshfuji@linux-ipv6.org, kazunori@miyazawa.org, kuznet@ms2.inr.ac.ru, linux-kernel@vger.kernel.org, netdev@oss.sgi.com, usagi@linux-ipv6.org Subject: Re: (usagi-core 12294) Re: [PATCH] IPv6 IPsec support Message-ID: <20030306213217.GA6358@f00f.org> References: <20030305233025.784feb00.kazunori@miyazawa.org> <20030305.072149.121185037.davem@redhat.com> <20030306.004820.41101302.yoshfuji@linux-ipv6.org> <20030305.154100.28816301.davem@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030305.154100.28816301.davem@redhat.com> User-Agent: Mutt/1.3.28i X-No-Archive: Yes X-archive-position: 1886 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: cw@f00f.org Precedence: bulk X-list: netdev Content-Length: 406 Lines: 12 On Wed, Mar 05, 2003 at 03:41:00PM -0800, David S. Miller wrote: > Note that this coincides with the idea to eventually have an > address-family independant flow cache. Actually... at that point being able to monitor updates to the flow-cache would be useful for various statistical purposes and applications, especially if the flow cache was able to periodically export utilization counters... --cw From garzik@gtf.org Thu Mar 6 14:11:43 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 06 Mar 2003 14:11:46 -0800 (PST) Received: from havoc.gtf.org (havoc.daloft.com [64.213.145.173]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h26MBgq9016737 for ; Thu, 6 Mar 2003 14:11:43 -0800 Received: by havoc.gtf.org (Postfix, from userid 500) id E3A7D6659; Thu, 6 Mar 2003 17:11:36 -0500 (EST) Date: Thu, 6 Mar 2003 17:11:36 -0500 From: Jeff Garzik To: Alan Cox Cc: Robin Holt , Linux Kernel Mailing List , netdev@oss.sgi.com Subject: Re: Make ipconfig.c work as a loadable module. Message-ID: <20030306221136.GB26732@gtf.org> References: <1046990052.18158.121.camel@irongate.swansea.linux.org.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1046990052.18158.121.camel@irongate.swansea.linux.org.uk> User-Agent: Mutt/1.3.28i X-archive-position: 1887 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev Content-Length: 675 Lines: 19 On Thu, Mar 06, 2003 at 10:34:16PM +0000, Alan Cox wrote: > On Thu, 2003-03-06 at 21:10, Robin Holt wrote: > > The patch at the end of this email makes ipconfig.c work as a loadable > > module under the 2.5. The diff was taken against the bitkeeper tree > > changeset 1.1075. > > The right fix is to delete ipconfig.c, it has been the right fix for a long > long time. There are initrd based bootp/dhcp setups that can also then mount > a root NFS partition and they do *not* need any kernel helper. The klibc tarball on kernel.org also has ipconfig-type code, waiting for initramfs early userspace :) Many have wanted to delete ipconfig.c for a while now... Jeff From rmk@arm.linux.org.uk Thu Mar 6 14:25:58 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 06 Mar 2003 14:26:02 -0800 (PST) Received: from caramon.arm.linux.org.uk (caramon.arm.linux.org.uk [212.18.232.186]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h26MPuq9018868 for ; Thu, 6 Mar 2003 14:25:58 -0800 Received: from flint.arm.linux.org.uk ([3ffe:8260:2002:1:201:2ff:fe14:8fad]) by caramon.arm.linux.org.uk with asmtp (TLSv1:DES-CBC3-SHA:168) (Exim 4.12) id 18r3oB-0000Oy-00; Thu, 06 Mar 2003 22:25:47 +0000 Received: from rmk by flint.arm.linux.org.uk with local (Exim 4.12) id 18r3oA-0005rD-00; Thu, 06 Mar 2003 22:25:46 +0000 Date: Thu, 6 Mar 2003 22:25:46 +0000 From: Russell King To: Jeff Garzik Cc: Alan Cox , Robin Holt , Linux Kernel Mailing List , netdev@oss.sgi.com Subject: Re: Make ipconfig.c work as a loadable module. Message-ID: <20030306222546.K838@flint.arm.linux.org.uk> Mail-Followup-To: Jeff Garzik , Alan Cox , Robin Holt , Linux Kernel Mailing List , netdev@oss.sgi.com References: <1046990052.18158.121.camel@irongate.swansea.linux.org.uk> <20030306221136.GB26732@gtf.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5.1i In-Reply-To: <20030306221136.GB26732@gtf.org>; from jgarzik@pobox.com on Thu, Mar 06, 2003 at 05:11:36PM -0500 X-archive-position: 1888 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: rmk@arm.linux.org.uk Precedence: bulk X-list: netdev Content-Length: 1036 Lines: 23 On Thu, Mar 06, 2003 at 05:11:36PM -0500, Jeff Garzik wrote: > On Thu, Mar 06, 2003 at 10:34:16PM +0000, Alan Cox wrote: > > On Thu, 2003-03-06 at 21:10, Robin Holt wrote: > > > The patch at the end of this email makes ipconfig.c work as a loadable > > > module under the 2.5. The diff was taken against the bitkeeper tree > > > changeset 1.1075. > > > > The right fix is to delete ipconfig.c, it has been the right fix for a long > > long time. There are initrd based bootp/dhcp setups that can also then mount > > a root NFS partition and they do *not* need any kernel helper. > > The klibc tarball on kernel.org also has ipconfig-type code, waiting for > initramfs early userspace :) > > Many have wanted to delete ipconfig.c for a while now... Yep, can't the deletion wait a couple more weeks or so until klibc gets merged? It's not like ipconfig.c is broken currently, is it? -- Russell King (rmk@arm.linux.org.uk) The developer of ARM Linux http://www.arm.linux.org.uk/personal/aboutme.html From garzik@gtf.org Thu Mar 6 14:32:22 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 06 Mar 2003 14:32:26 -0800 (PST) Received: from havoc.gtf.org (havoc.daloft.com [64.213.145.173]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h26MWMq9019833 for ; Thu, 6 Mar 2003 14:32:22 -0800 Received: by havoc.gtf.org (Postfix, from userid 500) id D3D836659; Thu, 6 Mar 2003 17:32:16 -0500 (EST) Date: Thu, 6 Mar 2003 17:32:16 -0500 From: Jeff Garzik To: Alan Cox , Robin Holt , Linux Kernel Mailing List , netdev@oss.sgi.com Subject: Re: Make ipconfig.c work as a loadable module. Message-ID: <20030306223216.GB28643@gtf.org> References: <1046990052.18158.121.camel@irongate.swansea.linux.org.uk> <20030306221136.GB26732@gtf.org> <20030306222546.K838@flint.arm.linux.org.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030306222546.K838@flint.arm.linux.org.uk> User-Agent: Mutt/1.3.28i X-archive-position: 1889 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev Content-Length: 296 Lines: 12 On Thu, Mar 06, 2003 at 10:25:46PM +0000, Russell King wrote: > Yep, can't the deletion wait a couple more weeks or so until klibc gets > merged? It's not like ipconfig.c is broken currently, is it? The klibc merge date appears to be infinity at this point. Probably my fault, too. Jeff From alan@lxorguk.ukuu.org.uk Thu Mar 6 15:09:42 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 06 Mar 2003 15:09:45 -0800 (PST) Received: from irongate.swansea.linux.org.uk (pc2-cwma1-4-cust86.swan.cable.ntl.com [213.105.254.86]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h26N8sq9021445 for ; Thu, 6 Mar 2003 15:09:41 -0800 Received: from irongate.swansea.linux.org.uk (localhost [127.0.0.1]) by irongate.swansea.linux.org.uk (8.12.7/8.12.7) with ESMTP id h270E1Yf019240; Fri, 7 Mar 2003 00:14:02 GMT Received: (from alan@localhost) by irongate.swansea.linux.org.uk (8.12.7/8.12.7/Submit) id h270Dw3q019238; Fri, 7 Mar 2003 00:13:58 GMT X-Authentication-Warning: irongate.swansea.linux.org.uk: alan set sender to alan@lxorguk.ukuu.org.uk using -f Subject: Re: Make ipconfig.c work as a loadable module. From: Alan Cox To: Russell King Cc: Jeff Garzik , Robin Holt , Linux Kernel Mailing List , netdev@oss.sgi.com In-Reply-To: <20030306222546.K838@flint.arm.linux.org.uk> References: <1046990052.18158.121.camel@irongate.swansea.linux.org.uk> <20030306221136.GB26732@gtf.org> <20030306222546.K838@flint.arm.linux.org.uk> Content-Type: text/plain Content-Transfer-Encoding: 7bit Organization: Message-Id: <1046996037.18158.142.camel@irongate.swansea.linux.org.uk> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.1 (1.2.1-4) Date: 07 Mar 2003 00:13:57 +0000 X-archive-position: 1890 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: alan@lxorguk.ukuu.org.uk Precedence: bulk X-list: netdev Content-Length: 751 Lines: 18 On Thu, 2003-03-06 at 22:25, Russell King wrote: > > > The right fix is to delete ipconfig.c, it has been the right fix for a long > > > long time. There are initrd based bootp/dhcp setups that can also then mount > > > a root NFS partition and they do *not* need any kernel helper. > > > > The klibc tarball on kernel.org also has ipconfig-type code, waiting for > > initramfs early userspace :) > > > > Many have wanted to delete ipconfig.c for a while now... > > Yep, can't the deletion wait a couple more weeks or so until klibc gets > merged? It's not like ipconfig.c is broken currently, is it? Thats how it ended up in 2.4. Klibc doesnt really matter, the apps exist linked with dietlibc and stuff even without klibc. Time for it to die From rmk@arm.linux.org.uk Thu Mar 6 15:19:16 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 06 Mar 2003 15:19:18 -0800 (PST) Received: from caramon.arm.linux.org.uk (caramon.arm.linux.org.uk [212.18.232.186]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h26NJEq9022298 for ; Thu, 6 Mar 2003 15:19:15 -0800 Received: from flint.arm.linux.org.uk ([3ffe:8260:2002:1:201:2ff:fe14:8fad]) by caramon.arm.linux.org.uk with asmtp (TLSv1:DES-CBC3-SHA:168) (Exim 4.12) id 18r4dn-0000bx-00; Thu, 06 Mar 2003 23:19:07 +0000 Received: from rmk by flint.arm.linux.org.uk with local (Exim 4.12) id 18r4dl-0006LP-00; Thu, 06 Mar 2003 23:19:05 +0000 Date: Thu, 6 Mar 2003 23:19:05 +0000 From: Russell King To: Alan Cox Cc: Jeff Garzik , Robin Holt , Linux Kernel Mailing List , netdev@oss.sgi.com Subject: Re: Make ipconfig.c work as a loadable module. Message-ID: <20030306231905.M838@flint.arm.linux.org.uk> Mail-Followup-To: Alan Cox , Jeff Garzik , Robin Holt , Linux Kernel Mailing List , netdev@oss.sgi.com References: <1046990052.18158.121.camel@irongate.swansea.linux.org.uk> <20030306221136.GB26732@gtf.org> <20030306222546.K838@flint.arm.linux.org.uk> <1046996037.18158.142.camel@irongate.swansea.linux.org.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5.1i In-Reply-To: <1046996037.18158.142.camel@irongate.swansea.linux.org.uk>; from alan@lxorguk.ukuu.org.uk on Fri, Mar 07, 2003 at 12:13:57AM +0000 X-archive-position: 1891 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: rmk@arm.linux.org.uk Precedence: bulk X-list: netdev Content-Length: 1295 Lines: 32 On Fri, Mar 07, 2003 at 12:13:57AM +0000, Alan Cox wrote: > On Thu, 2003-03-06 at 22:25, Russell King wrote: > > > > The right fix is to delete ipconfig.c, it has been the right fix for a long > > > > long time. There are initrd based bootp/dhcp setups that can also then mount > > > > a root NFS partition and they do *not* need any kernel helper. > > > > > > The klibc tarball on kernel.org also has ipconfig-type code, waiting for > > > initramfs early userspace :) > > > > > > Many have wanted to delete ipconfig.c for a while now... > > > > Yep, can't the deletion wait a couple more weeks or so until klibc gets > > merged? It's not like ipconfig.c is broken currently, is it? > > Thats how it ended up in 2.4. Klibc doesnt really matter, the apps exist > linked with dietlibc and stuff even without klibc. > > Time for it to die "klibc doesnt really matter" I'd prefer not to have to have thousands of special programs around just to be able to boot my machines, especially when it was all in- kernel up until this point. klibc yes, dietlibc with random other garbage in some random filesystem which'd need maintaining - no thanks. -- Russell King (rmk@arm.linux.org.uk) The developer of ARM Linux http://www.arm.linux.org.uk/personal/aboutme.html From alan@lxorguk.ukuu.org.uk Thu Mar 6 15:24:30 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 06 Mar 2003 15:24:32 -0800 (PST) Received: from irongate.swansea.linux.org.uk (pc2-cwma1-4-cust86.swan.cable.ntl.com [213.105.254.86]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h26NOSq9022738 for ; Thu, 6 Mar 2003 15:24:29 -0800 Received: from irongate.swansea.linux.org.uk (localhost [127.0.0.1]) by irongate.swansea.linux.org.uk (8.12.7/8.12.7) with ESMTP id h270ToYf019275; Fri, 7 Mar 2003 00:29:51 GMT Received: (from alan@localhost) by irongate.swansea.linux.org.uk (8.12.7/8.12.7/Submit) id h270TmNx019273; Fri, 7 Mar 2003 00:29:48 GMT X-Authentication-Warning: irongate.swansea.linux.org.uk: alan set sender to alan@lxorguk.ukuu.org.uk using -f Subject: Re: Make ipconfig.c work as a loadable module. From: Alan Cox To: Russell King Cc: Jeff Garzik , Robin Holt , Linux Kernel Mailing List , netdev@oss.sgi.com In-Reply-To: <20030306231905.M838@flint.arm.linux.org.uk> References: <1046990052.18158.121.camel@irongate.swansea.linux.org.uk> <20030306221136.GB26732@gtf.org> <20030306222546.K838@flint.arm.linux.org.uk> <1046996037.18158.142.camel@irongate.swansea.linux.org.uk> <20030306231905.M838@flint.arm.linux.org.uk> Content-Type: text/plain Content-Transfer-Encoding: 7bit Organization: Message-Id: <1046996987.17718.144.camel@irongate.swansea.linux.org.uk> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.1 (1.2.1-4) Date: 07 Mar 2003 00:29:47 +0000 X-archive-position: 1892 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: alan@lxorguk.ukuu.org.uk Precedence: bulk X-list: netdev Content-Length: 599 Lines: 14 On Thu, 2003-03-06 at 23:19, Russell King wrote: > "klibc doesnt really matter" > > I'd prefer not to have to have thousands of special programs around > just to be able to boot my machines, especially when it was all in- > kernel up until this point. > > klibc yes, dietlibc with random other garbage in some random filesystem > which'd need maintaining - no thanks. You can build the dhcp client with glibc static into your initrd. Its hardly magic or special programs or random garbage, and last time I counted it came to one program. Dunno what the other 999 utilities your dhcp needs are ? From davem@redhat.com Thu Mar 6 15:45:56 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 06 Mar 2003 15:46:00 -0800 (PST) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h26Njtq9023628 for ; Thu, 6 Mar 2003 15:45:56 -0800 Received: from localhost (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with ESMTP id PAA20292; Thu, 6 Mar 2003 15:27:04 -0800 Date: Thu, 06 Mar 2003 15:27:03 -0800 (PST) Message-Id: <20030306.152703.21845381.davem@redhat.com> To: cw@f00f.org Cc: yoshfuji@linux-ipv6.org, kazunori@miyazawa.org, kuznet@ms2.inr.ac.ru, linux-kernel@vger.kernel.org, netdev@oss.sgi.com, usagi@linux-ipv6.org Subject: Re: (usagi-core 12294) Re: [PATCH] IPv6 IPsec support From: "David S. Miller" In-Reply-To: <20030306213217.GA6358@f00f.org> References: <20030306.004820.41101302.yoshfuji@linux-ipv6.org> <20030305.154100.28816301.davem@redhat.com> <20030306213217.GA6358@f00f.org> X-FalunGong: Information control. X-Mailer: Mew version 2.1 on Emacs 21.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 1893 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev Content-Length: 395 Lines: 11 From: Chris Wedgwood Date: Thu, 6 Mar 2003 13:32:17 -0800 Actually... at that point being able to monitor updates to the flow-cache would be useful for various statistical purposes and applications, especially if the flow cache was able to periodically export utilization counters... It will keep statistics, just like the route cache keeps them now. From rmk@arm.linux.org.uk Thu Mar 6 16:09:06 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 06 Mar 2003 16:09:10 -0800 (PST) Received: from caramon.arm.linux.org.uk (caramon.arm.linux.org.uk [212.18.232.186]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2708Pq9024526 for ; Thu, 6 Mar 2003 16:09:06 -0800 Received: from flint.arm.linux.org.uk ([3ffe:8260:2002:1:201:2ff:fe14:8fad]) by caramon.arm.linux.org.uk with asmtp (TLSv1:DES-CBC3-SHA:168) (Exim 4.12) id 18r5PN-0000oC-00; Fri, 07 Mar 2003 00:08:17 +0000 Received: from rmk by flint.arm.linux.org.uk with local (Exim 4.12) id 18r5PM-0006qL-00; Fri, 07 Mar 2003 00:08:16 +0000 Date: Fri, 7 Mar 2003 00:08:16 +0000 From: Russell King To: Alan Cox Cc: Jeff Garzik , Robin Holt , Linux Kernel Mailing List , netdev@oss.sgi.com Subject: Re: Make ipconfig.c work as a loadable module. Message-ID: <20030307000816.P838@flint.arm.linux.org.uk> Mail-Followup-To: Alan Cox , Jeff Garzik , Robin Holt , Linux Kernel Mailing List , netdev@oss.sgi.com References: <1046990052.18158.121.camel@irongate.swansea.linux.org.uk> <20030306221136.GB26732@gtf.org> <20030306222546.K838@flint.arm.linux.org.uk> <1046996037.18158.142.camel@irongate.swansea.linux.org.uk> <20030306231905.M838@flint.arm.linux.org.uk> <1046996987.17718.144.camel@irongate.swansea.linux.org.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5.1i In-Reply-To: <1046996987.17718.144.camel@irongate.swansea.linux.org.uk>; from alan@lxorguk.ukuu.org.uk on Fri, Mar 07, 2003 at 12:29:47AM +0000 X-archive-position: 1894 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: rmk@arm.linux.org.uk Precedence: bulk X-list: netdev Content-Length: 1284 Lines: 29 On Fri, Mar 07, 2003 at 12:29:47AM +0000, Alan Cox wrote: > On Thu, 2003-03-06 at 23:19, Russell King wrote: > > "klibc doesnt really matter" > > > > I'd prefer not to have to have thousands of special programs around > > just to be able to boot my machines, especially when it was all in- > > kernel up until this point. > > > > klibc yes, dietlibc with random other garbage in some random filesystem > > which'd need maintaining - no thanks. > > You can build the dhcp client with glibc static into your initrd. Its hardly > magic or special programs or random garbage, and last time I counted it came > to one program. Dunno what the other 999 utilities your dhcp needs are ? How about mount for nfs-root, a shell and a shell script to supply the correct parameters to mount so it doesn't go and try to mount the nfs-root with locking enabled - oh, and a few programs like sed and so forth to pull the mount parameters out of the dhcp client output, if there is such an output. ipconfig.c does more than just configure networking. It's a far smaller solution to NFS-root than any userspace implementation could ever hope to be. -- Russell King (rmk@arm.linux.org.uk) The developer of ARM Linux http://www.arm.linux.org.uk/personal/aboutme.html From pakrat@www.linux.org.uk Thu Mar 6 17:29:09 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 06 Mar 2003 17:29:14 -0800 (PST) Received: from www.linux.org.uk (IDENT:exim@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h271T7q9002297 for ; Thu, 6 Mar 2003 17:29:08 -0800 Received: from pakrat by www.linux.org.uk with local (Exim 3.33 #5) id 18r6fZ-0001IZ-00; Fri, 07 Mar 2003 01:29:05 +0000 Date: Fri, 7 Mar 2003 01:29:05 +0000 From: Chris Dukes To: Alan Cox , Jeff Garzik , Robin Holt , Linux Kernel Mailing List , netdev@oss.sgi.com Subject: Re: Make ipconfig.c work as a loadable module. Message-ID: <20030307012905.G20725@parcelfarce.linux.theplanet.co.uk> References: <1046990052.18158.121.camel@irongate.swansea.linux.org.uk> <20030306221136.GB26732@gtf.org> <20030306222546.K838@flint.arm.linux.org.uk> <1046996037.18158.142.camel@irongate.swansea.linux.org.uk> <20030306231905.M838@flint.arm.linux.org.uk> <1046996987.17718.144.camel@irongate.swansea.linux.org.uk> <20030307000816.P838@flint.arm.linux.org.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5.1i In-Reply-To: <20030307000816.P838@flint.arm.linux.org.uk>; from rmk@arm.linux.org.uk on Fri, Mar 07, 2003 at 12:08:16AM +0000 X-archive-position: 1895 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: pakrat@www.uk.linux.org Precedence: bulk X-list: netdev Content-Length: 1248 Lines: 28 On Fri, Mar 07, 2003 at 12:08:16AM +0000, Russell King wrote: > > > > You can build the dhcp client with glibc static into your initrd. Its hardly > > magic or special programs or random garbage, and last time I counted it came > > to one program. Dunno what the other 999 utilities your dhcp needs are ? > > How about mount for nfs-root, a shell and a shell script to supply the > correct parameters to mount so it doesn't go and try to mount the > nfs-root with locking enabled - oh, and a few programs like sed and > so forth to pull the mount parameters out of the dhcp client output, > if there is such an output. If IBM can fit a kernel and a ramdisk containing all the utilities you describe and more in smaller than 5M of file for tftp, one would think that it could be done on Linux. > > ipconfig.c does more than just configure networking. It's a far smaller > solution to NFS-root than any userspace implementation could ever hope > to be. That's nice. Would you mind explaining to us where that would be a benefit? Aside from dead header space in elf executables, I'm at a loss as to how a usermode implementation must be significantly larger than kernel code. -- Chris Dukes I tried being reasonable once--I didn't like it. From cfriesen@nortelnetworks.com Thu Mar 6 21:48:31 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 06 Mar 2003 21:48:37 -0800 (PST) Received: from zcars04f.nortelnetworks.com (zcars04f.nortelnetworks.com [47.129.242.57]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h275mUq9027567 for ; Thu, 6 Mar 2003 21:48:31 -0800 Received: from zcard309.ca.nortel.com (zcard309.ca.nortel.com [47.129.242.69]) by zcars04f.nortelnetworks.com (Switch-2.2.5/Switch-2.2.0) with ESMTP id h275mMj19044; Fri, 7 Mar 2003 00:48:22 -0500 (EST) Received: from zcard0k6.ca.nortel.com ([47.129.242.158]) by zcard309.ca.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id GDF44DJG; Fri, 7 Mar 2003 00:48:23 -0500 Received: from pcard0ks.ca.nortel.com ([47.129.117.131]) by zcard0k6.ca.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id FSL7ZCDR; Fri, 7 Mar 2003 00:48:23 -0500 Received: from nortelnetworks.com (localhost.localdomain [127.0.0.1]) by pcard0ks.ca.nortel.com (Postfix) with ESMTP id 27AFA2D957; Fri, 7 Mar 2003 00:48:22 -0500 (EST) Message-ID: <3E6832A5.2020502@nortelnetworks.com> Date: Fri, 07 Mar 2003 00:48:21 -0500 X-Sybari-Space: 00000000 00000000 00000000 From: Chris Friesen User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.8) Gecko/20020204 X-Accept-Language: en-us MIME-Version: 1.0 To: linux-kernel , linux-net@vger.kernel.org, netdev@oss.sgi.com Subject: unix socket latency regression from 2.4 to 2.5 (and multicast AF_UNIX benchmarks) Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 1896 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: cfriesen@nortelnetworks.com Precedence: bulk X-list: netdev Content-Length: 3220 Lines: 64 I've done another series of benchmarks with regards to multicast AF_UNIX on the 2.5.63 kernel. One of the biggest surprises to me was the performance regression in the normal userspace case and in the kernel case with samll numbers of listeners. The tests basically work by having a single sender send messages of three different sizes to varying numbers of listeners, which take a timestamp and then nanosleep() for a second to allow the other listeners to wake up as fast as possible. When the listeners wake up they figure out the latency and send it on to a third utility which dumps out the latencies. In the case of the userspace test, the sender sends to each listener in turn. In the case of the multicast test, the kernel handles the cloning of the packet to distribute it to the listeners. Here are the new results combined with the old for comparison. The machine is a Duron 750, with K7 optimizations in the kernel. Both kernels were compiled with gcc 3.2. 44bytes 2.4.20 2.5.63 2.4.20 2.5.63 # listeners userspace userspace kernelspace kernelspace 10 73,335 96,493 103,252 100,286 20 72,610 99,885 106,429 134,517 50 74,1482 95,2075 205,1301 230,1273 100 76,3000 97,4173 362,3425 431,2654 200 107,8719 737,9917 831,5412 236bytes 2.4.20 2.5.63 2.4.20 2.5.63 # listeners userspace userspace kernelspace kernelspace 10 70,346 98,510 81,265 100,290 20 74,639 100,918 122,468 137,533 50 75,1557 103,2225 230,1421 238,1329 100 80,3107 105,4415 408,3743 461,2794 200 131,9117 889,5720 40036-bytes 2.4.20 2.5.63 2.4.20 2.5.63 # listeners userspace userspace kernelspace kernelspace 10 302,4181 841,6218 322,1692 702,2231 20 303,7491 873,12606 347,3450 722,3829 50 306,10451 868,38031 483,8394 884,8583 100 309,23107 881,69403 697,17061 1137,16729 200 313,45528 898,132887 997,39810 1586,32722 It appears that sending/receiving is significantly more expensive in 2.5 than it was in 2.4, with the difference going up as the size of the message goes up. Is 2.5 using different copying code or something? Anyone have any ideas as to what is going on here? Also, even with the increased copying costs, the O(1) scheduler in 2.5 means that the kernelspace multicast solution is faster than the userspace solution in either kernel in all cases, even when waking up 200 listeners simultaneously. Any comments on the multicast concept that haven't been discussed already? Chris -- Chris Friesen | MailStop: 043/33/F10 Nortel Networks | work: (613) 765-0557 3500 Carling Avenue | fax: (613) 765-2986 Nepean, ON K2H 8E9 Canada | email: cfriesen@nortelnetworks.com From malware@t-online.de Thu Mar 6 23:15:55 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 06 Mar 2003 23:16:03 -0800 (PST) Received: from mailout01.sul.t-online.com (mailout01.sul.t-online.com [194.25.134.80]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h277Frq9032746 for ; Thu, 6 Mar 2003 23:15:55 -0800 Received: from fwd01.sul.t-online.de by mailout01.sul.t-online.com with smtp id 18rC59-0004vI-0H; Fri, 07 Mar 2003 08:15:51 +0100 Received: from fire.malware.de (320008702754-0001@[217.0.134.77]) by fwd01.sul.t-online.com with esmtp id 18rC51-1y3EsiC; Fri, 7 Mar 2003 08:15:43 +0100 Message-Id: <200303070715.IAA27138@fire.malware.de> Date: Fri, 07 Mar 2003 08:15:20 +0100 From: malware@t-online.de (Michael Mueller) X-Mailer: Mozilla 4.78 [en] (X11; U; Linux 2.4.20-pre8 i686) X-Accept-Language: en, de MIME-Version: 1.0 To: Alan Cox CC: Russell King , Jeff Garzik , Robin Holt , Linux Kernel Mailing List , netdev@oss.sgi.com Subject: Re: Make ipconfig.c work as a loadable module. References: <1046990052.18158.121.camel@irongate.swansea.linux.org.uk> <20030306221136.GB26732@gtf.org> <20030306222546.K838@flint.arm.linux.org.uk> <1046996037.18158.142.camel@irongate.swansea.linux.org.uk> <20030306231905.M838@flint.arm.linux.org.uk> <1046996987.17718.144.camel@irongate.swansea.linux.org.uk> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Sender: 320008702754-0001@t-dialin.net X-archive-position: 1897 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: malware@t-online.de Precedence: bulk X-list: netdev Content-Length: 925 Lines: 23 Hi Alan, you wrote: > > I'd prefer not to have to have thousands of special programs around > > just to be able to boot my machines, especially when it was all in- > > kernel up until this point. > > > > klibc yes, dietlibc with random other garbage in some random filesystem > > which'd need maintaining - no thanks. > > You can build the dhcp client with glibc static into your initrd. Its hardly > magic or special programs or random garbage, and last time I counted it came > to one program. Dunno what the other 999 utilities your dhcp needs are ? Sorry, but I must join Russel here. I have atleast one machine which has a bootloader able to load exactly one file only. There is currently no way to load an initrd. It would need to implement the whole (BOOTP+)TFTP stuff again, just to get the initrd. So I was quite happy linux 2.4 still knows about mounting a NFS root filesystem without user-space help. Michael From vda@port.imtp.ilyichevsk.odessa.ua Fri Mar 7 01:24:59 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 07 Mar 2003 01:25:03 -0800 (PST) Received: from Port.imtp.ilyichevsk.odessa.ua (169.imtp.Ilyichevsk.Odessa.UA [195.66.192.169] (may be forged)) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h279O5q9010501 for ; Fri, 7 Mar 2003 01:24:56 -0800 Received: from there ([172.16.42.177]) by Port.imtp.ilyichevsk.odessa.ua (8.10.2/8.10.2) with SMTP id h279Cpu07949; Fri, 7 Mar 2003 11:13:04 +0200 Message-Id: <200303070913.h279Cpu07949@Port.imtp.ilyichevsk.odessa.ua> Content-Type: text/plain; charset="koi8-r" From: Denis Vlasenko Reply-To: vda@port.imtp.ilyichevsk.odessa.ua To: Alan Cox , Russell King Subject: Re: Make ipconfig.c work as a loadable module. Date: Fri, 7 Mar 2003 11:10:15 +0200 X-Mailer: KMail [version 1.3.2] Cc: Jeff Garzik , Robin Holt , Linux Kernel Mailing List , netdev@oss.sgi.com References: <20030306231905.M838@flint.arm.linux.org.uk> <1046996987.17718.144.camel@irongate.swansea.linux.org.uk> In-Reply-To: <1046996987.17718.144.camel@irongate.swansea.linux.org.uk> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-archive-position: 1898 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: vda@port.imtp.ilyichevsk.odessa.ua Precedence: bulk X-list: netdev Content-Length: 558 Lines: 16 On 7 March 2003 02:29, Alan Cox wrote: > On Thu, 2003-03-06 at 23:19, Russell King wrote: > > "klibc doesnt really matter" > > > > I'd prefer not to have to have thousands of special programs around > > just to be able to boot my machines, especially when it was all in- > > kernel up until this point. > > > > klibc yes, dietlibc with random other garbage in some random > > filesystem which'd need maintaining - no thanks. > > You can build the dhcp client with glibc static into your initrd. Anything built static against glibs tends to be 400K+. -- vda From rmk@arm.linux.org.uk Fri Mar 7 01:43:25 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 07 Mar 2003 01:43:29 -0800 (PST) Received: from caramon.arm.linux.org.uk (caramon.arm.linux.org.uk [212.18.232.186]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h279giq9027483 for ; Fri, 7 Mar 2003 01:43:25 -0800 Received: from flint.arm.linux.org.uk ([3ffe:8260:2002:1:201:2ff:fe14:8fad]) by caramon.arm.linux.org.uk with asmtp (TLSv1:DES-CBC3-SHA:168) (Exim 4.12) id 18rENB-0002o8-00; Fri, 07 Mar 2003 09:42:37 +0000 Received: from rmk by flint.arm.linux.org.uk with local (Exim 4.12) id 18rEN9-0003NV-00; Fri, 07 Mar 2003 09:42:35 +0000 Date: Fri, 7 Mar 2003 09:42:35 +0000 From: Russell King To: Chris Dukes Cc: Alan Cox , Jeff Garzik , Robin Holt , Linux Kernel Mailing List , netdev@oss.sgi.com Subject: Re: Make ipconfig.c work as a loadable module. Message-ID: <20030307094235.A11807@flint.arm.linux.org.uk> Mail-Followup-To: Chris Dukes , Alan Cox , Jeff Garzik , Robin Holt , Linux Kernel Mailing List , netdev@oss.sgi.com References: <1046990052.18158.121.camel@irongate.swansea.linux.org.uk> <20030306221136.GB26732@gtf.org> <20030306222546.K838@flint.arm.linux.org.uk> <1046996037.18158.142.camel@irongate.swansea.linux.org.uk> <20030306231905.M838@flint.arm.linux.org.uk> <1046996987.17718.144.camel@irongate.swansea.linux.org.uk> <20030307000816.P838@flint.arm.linux.org.uk> <20030307012905.G20725@parcelfarce.linux.theplanet.co.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5.1i In-Reply-To: <20030307012905.G20725@parcelfarce.linux.theplanet.co.uk>; from pakrat@www.uk.linux.org on Fri, Mar 07, 2003 at 01:29:05AM +0000 X-archive-position: 1899 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: rmk@arm.linux.org.uk Precedence: bulk X-list: netdev Content-Length: 2213 Lines: 58 On Fri, Mar 07, 2003 at 01:29:05AM +0000, Chris Dukes wrote: > If IBM can fit a kernel and a ramdisk containing all the utilities you > describe and more in smaller than 5M of file for tftp, one would think > that it could be done on Linux. Wow. 5MB eh? We currently do NFS-root in 690K. > > ipconfig.c does more than just configure networking. It's a far smaller > > solution to NFS-root than any userspace implementation could ever hope > > to be. > > That's nice. Would you mind explaining to us where that would be a > benefit? Aside from dead header space in elf executables, I'm at > a loss as to how a usermode implementation must be significantly > larger than kernel code. If you're suggesting above that "5MB isn't significantly larger than the size Linux can do this" then I think I've just proven you wrong. Lets see - building an ramdisk to mount a root filesystem out of existing binaries would require from my exisitng systems probably something like: text data bss dec hex filename 1093047 21224 15560 1129831 113d67 /lib/libc.so.6 515890 22320 16640 554850 87762 /bin/sh 58540 2436 9776 70752 11460 /lib/libresolv.so.2 53685 1476 5488 60649 ece9 /bin/mount 45511 672 432 46615 b617 /bin/sed 42830 624 40 43494 a9e6 /sbin/pump 10783 500 104 11387 2c7b /lib/libtermcap.so.2 8765 444 28 9237 2415 /lib/libdl.so.2 pump isn't really suitable for the task, but I don't have dhcpcd around. dhcpcd is even larger than pump however. That's getting on for 2MB vs: 2620 2012 0 4632 1218 fs/nfs/nfsroot.o 8016 380 80 8476 211c net/ipv4/ipconfig.o about 13K. Which version is overly bloated? Which version is huge? Which version is compact? Even the klibc ipconfig version is significantly larger than the in-kernel version - and klibc and its binaries are written to be small. Note: I *do* agree that ipconfig.c needs to die before 2.6 but I do not agree that today is the right day. -- Russell King (rmk@arm.linux.org.uk) The developer of ARM Linux http://www.arm.linux.org.uk/personal/aboutme.html From seong@etri.re.kr Fri Mar 7 01:44:42 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 07 Mar 2003 01:44:50 -0800 (PST) Received: from cms1.etri.re.kr (cms1.etri.re.kr [129.254.16.11]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h279i0q9027673 for ; Fri, 7 Mar 2003 01:44:42 -0800 Received: from seong (129.254.172.40 [129.254.172.40]) by cms1.etri.re.kr with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id GCGCDZR9; Fri, 7 Mar 2003 18:43:44 +0900 Message-ID: <005101c2e48e$4016a4f0$28acfe81@seong> From: "Seong Moon" To: Subject: rtnetlink and multicast routing cache ? Date: Fri, 7 Mar 2003 18:45:13 +0900 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 5.50.4920.2300 X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4920.2300 X-archive-position: 1900 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: seong@etri.re.kr Precedence: bulk X-list: netdev Content-Length: 211 Lines: 12 Hi there! I want to get and monitor multicast routing cache information from kernel through rtnetlink. Is it possible ? I'm using linux2.4.18. If it is possible, What can I do for this ? thanks in advance. From bogdan.costescu@iwr.uni-heidelberg.de Fri Mar 7 03:46:15 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 07 Mar 2003 03:46:23 -0800 (PST) Received: from mail.iwr.uni-heidelberg.de (mail.iwr.uni-heidelberg.de [129.206.104.30]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h27BkDq9017803 for ; Fri, 7 Mar 2003 03:46:14 -0800 Received: from kenzo.iwr.uni-heidelberg.de (kenzo.iwr.uni-heidelberg.de [129.206.120.29]) by mail.iwr.uni-heidelberg.de (8.11.2/8.11.1) with ESMTP id h27BkBP23646; Fri, 7 Mar 2003 12:46:11 +0100 (MET) Received: from localhost (bogdan@localhost) by kenzo.iwr.uni-heidelberg.de (8.11.6/8.11.6) with ESMTP id h27BkBq31730; Fri, 7 Mar 2003 12:46:11 +0100 Date: Fri, 7 Mar 2003 12:46:11 +0100 (CET) From: Bogdan Costescu To: Russell King cc: Chris Dukes , Alan Cox , Jeff Garzik , Robin Holt , Linux Kernel Mailing List , Subject: Re: Make ipconfig.c work as a loadable module. In-Reply-To: <20030307094235.A11807@flint.arm.linux.org.uk> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 1901 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: bogdan.costescu@iwr.uni-heidelberg.de Precedence: bulk X-list: netdev Content-Length: 1706 Lines: 36 On Fri, 7 Mar 2003, Russell King wrote: > Which version is overly bloated? > Which version is huge? > Which version is compact? ... and the size is not important only because we want to make everything smaller, but because of how it's commonly used (at least in the clustering world from which I come): the mainboard BIOS or NIC PROC contains PXE/DHCP client; data is transferred through UDP, with very poor (if any) congestion control. Congestion control means here both extreme situations: if packets don't arrive to the client, it might not ask again, ask only a limited number of times or give up after some timeout; if the server has some faster NIC to be able to handle more such requests, it might also send too fast for a single client which might drop packets. In some cases, if such situation occurs, the client just blocks there printing an error message on the console, without trying to restart the whole process and the only way to make it do something is to press the Reset button or plug in a keyboard... When you have tens or hundreds of such nodes, it's not a pleasure ! Booting a bunch of such nodes would become problematic if they need to transfer more data (=initrd) to start the kernel and so network booting would become less reliable. Please note that I'm not saying "ipconfig has to stay" - just that any solution should not dramatically increase the size of data transferred before the jump to kernel code. -- Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: Bogdan.Costescu@IWR.Uni-Heidelberg.De From alan@lxorguk.ukuu.org.uk Fri Mar 7 03:49:09 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 07 Mar 2003 03:49:12 -0800 (PST) Received: from irongate.swansea.linux.org.uk (pc2-cwma1-4-cust86.swan.cable.ntl.com [213.105.254.86]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h27Bn7q9018184 for ; Fri, 7 Mar 2003 03:49:08 -0800 Received: from irongate.swansea.linux.org.uk (localhost [127.0.0.1]) by irongate.swansea.linux.org.uk (8.12.7/8.12.7) with ESMTP id h27CseYf020905; Fri, 7 Mar 2003 12:54:41 GMT Received: (from alan@localhost) by irongate.swansea.linux.org.uk (8.12.7/8.12.7/Submit) id h27CsbnY020903; Fri, 7 Mar 2003 12:54:37 GMT X-Authentication-Warning: irongate.swansea.linux.org.uk: alan set sender to alan@lxorguk.ukuu.org.uk using -f Subject: Re: Make ipconfig.c work as a loadable module. From: Alan Cox To: Michael Mueller Cc: Russell King , Jeff Garzik , Robin Holt , Linux Kernel Mailing List , netdev@oss.sgi.com In-Reply-To: <200303070715.IAA27138@fire.malware.de> References: <1046990052.18158.121.camel@irongate.swansea.linux.org.uk> <20030306221136.GB26732@gtf.org> <20030306222546.K838@flint.arm.linux.org.uk> <1046996037.18158.142.camel@irongate.swansea.linux.org.uk> <20030306231905.M838@flint.arm.linux.org.uk> <1046996987.17718.144.camel@irongate.swansea.linux.org.uk> <200303070715.IAA27138@fire.malware.de> Content-Type: text/plain Content-Transfer-Encoding: 7bit Organization: Message-Id: <1047041676.20793.12.camel@irongate.swansea.linux.org.uk> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.1 (1.2.1-4) Date: 07 Mar 2003 12:54:36 +0000 X-archive-position: 1902 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: alan@lxorguk.ukuu.org.uk Precedence: bulk X-list: netdev Content-Length: 499 Lines: 11 On Fri, 2003-03-07 at 07:15, Michael Mueller wrote: > Hi Alan, > Sorry, but I must join Russel here. I have atleast one machine which has > a bootloader able to load exactly one file only. There is currently no > way to load an initrd. It would need to implement the whole (BOOTP+)TFTP > stuff again, just to get the initrd. So I was quite happy linux 2.4 > still knows about mounting a NFS root filesystem without user-space > help. Just glue the initrd to the kernel. This is not rocket science From pakrat@www.linux.org.uk Fri Mar 7 05:38:58 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 07 Mar 2003 05:39:04 -0800 (PST) Received: from www.linux.org.uk (IDENT:exim@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h27DcEq9023589 for ; Fri, 7 Mar 2003 05:38:58 -0800 Received: from pakrat by www.linux.org.uk with local (Exim 3.33 #5) id 18rI3B-00027C-00; Fri, 07 Mar 2003 13:38:13 +0000 Date: Fri, 7 Mar 2003 13:38:13 +0000 From: Chris Dukes To: Alan Cox , Jeff Garzik , Robin Holt , Linux Kernel Mailing List , netdev@oss.sgi.com Subject: Re: Make ipconfig.c work as a loadable module. Message-ID: <20030307133812.A6676@parcelfarce.linux.theplanet.co.uk> References: <1046990052.18158.121.camel@irongate.swansea.linux.org.uk> <20030306221136.GB26732@gtf.org> <20030306222546.K838@flint.arm.linux.org.uk> <1046996037.18158.142.camel@irongate.swansea.linux.org.uk> <20030306231905.M838@flint.arm.linux.org.uk> <1046996987.17718.144.camel@irongate.swansea.linux.org.uk> <20030307000816.P838@flint.arm.linux.org.uk> <20030307012905.G20725@parcelfarce.linux.theplanet.co.uk> <20030307094235.A11807@flint.arm.linux.org.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5.1i In-Reply-To: <20030307094235.A11807@flint.arm.linux.org.uk>; from rmk@arm.linux.org.uk on Fri, Mar 07, 2003 at 09:42:35AM +0000 X-archive-position: 1903 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: pakrat@www.uk.linux.org Precedence: bulk X-list: netdev Content-Length: 2009 Lines: 49 On Fri, Mar 07, 2003 at 09:42:35AM +0000, Russell King wrote: > On Fri, Mar 07, 2003 at 01:29:05AM +0000, Chris Dukes wrote: > > That's nice. Would you mind explaining to us where that would be a > > benefit? Aside from dead header space in elf executables, I'm at > > a loss as to how a usermode implementation must be significantly > > larger than kernel code. > > If you're suggesting above that "5MB isn't significantly larger than > the size Linux can do this" then I think I've just proven you wrong. The 5Mb example is AIX. > > Lets see - building an ramdisk to mount a root filesystem out of existing > binaries would require from my exisitng systems probably something like: > I said userspace. I did not say existing binaries. [Size comparison of the kitchen sink vs kernel code deleted because it's comparing apples and oranges]. > > Which version is overly bloated? > Which version is huge? > Which version is compact? You are asserting aesthetics instead of benefits. I asked about benefits. Specifically, what is the benefit of compact? I'm sure you have a very good technical or business benefit to compact, but those of us in the world of workstations and servers have zero clue what it may be. Another individual has already indicated a very valid technical merit to having it all in one file. I have the same problem myself. AIX and *BSD have a working approach to that problem. > > Even the klibc ipconfig version is significantly larger than the in-kernel > version - and klibc and its binaries are written to be small. User space solution is not the same as a solution implemented with multiple user space apps. > > Note: I *do* agree that ipconfig.c needs to die before 2.6 but I do not > agree that today is the right day. Perhaps you could explain why today is not the day. (ie, soon to be shipping product that requires it. desire to see a viable userspace solution working before it is removed). -- Chris Dukes I tried being reasonable once--I didn't like it. From rmk@arm.linux.org.uk Fri Mar 7 06:30:10 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 07 Mar 2003 06:30:35 -0800 (PST) Received: from caramon.arm.linux.org.uk (caramon.arm.linux.org.uk [212.18.232.186]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h27ETTq9003855 for ; Fri, 7 Mar 2003 06:30:10 -0800 Received: from flint.arm.linux.org.uk ([3ffe:8260:2002:1:201:2ff:fe14:8fad]) by caramon.arm.linux.org.uk with asmtp (TLSv1:DES-CBC3-SHA:168) (Exim 4.12) id 18rIqf-0003qp-00; Fri, 07 Mar 2003 14:29:21 +0000 Received: from rmk by flint.arm.linux.org.uk with local (Exim 4.12) id 18rIqe-00061Q-00; Fri, 07 Mar 2003 14:29:20 +0000 Date: Fri, 7 Mar 2003 14:29:20 +0000 From: Russell King To: Chris Dukes Cc: Alan Cox , Jeff Garzik , Robin Holt , Linux Kernel Mailing List , netdev@oss.sgi.com Subject: Re: Make ipconfig.c work as a loadable module. Message-ID: <20030307142920.F17492@flint.arm.linux.org.uk> Mail-Followup-To: Chris Dukes , Alan Cox , Jeff Garzik , Robin Holt , Linux Kernel Mailing List , netdev@oss.sgi.com References: <1046990052.18158.121.camel@irongate.swansea.linux.org.uk> <20030306221136.GB26732@gtf.org> <20030306222546.K838@flint.arm.linux.org.uk> <1046996037.18158.142.camel@irongate.swansea.linux.org.uk> <20030306231905.M838@flint.arm.linux.org.uk> <1046996987.17718.144.camel@irongate.swansea.linux.org.uk> <20030307000816.P838@flint.arm.linux.org.uk> <20030307012905.G20725@parcelfarce.linux.theplanet.co.uk> <20030307094235.A11807@flint.arm.linux.org.uk> <20030307133812.A6676@parcelfarce.linux.theplanet.co.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5.1i In-Reply-To: <20030307133812.A6676@parcelfarce.linux.theplanet.co.uk>; from pakrat@www.uk.linux.org on Fri, Mar 07, 2003 at 01:38:13PM +0000 X-archive-position: 1904 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: rmk@arm.linux.org.uk Precedence: bulk X-list: netdev Content-Length: 2102 Lines: 50 On Fri, Mar 07, 2003 at 01:38:13PM +0000, Chris Dukes wrote: > You are asserting aesthetics instead of benefits. I asked about benefits. > Specifically, what is the benefit of compact? Think _embedded_. Think "cost of flash chips". Think "not everything has a floppy disk". > I'm sure you have a very good technical or business benefit to compact, I'm sorry, believe it or not, but I'm not swayed by "business benefits" here. Although I have my own business in the UK, we as a business are currently involved in hardware design which has nothing to do with the points I'm raising here. > but those of us in the world of workstations and servers have zero clue > what it may be. Indeed and understandable. > User space solution is not the same as a solution implemented with > multiple user space apps. I've been working on klibc to work towards providing such a solution. I know what it involves, and I know that this solution isn't there yet. Also, the fundamentals of klibc have not been accepted by Linus, so we don't even know if this is going to be a solution yet. > > Note: I *do* agree that ipconfig.c needs to die before 2.6 but I do not > > agree that today is the right day. > > Perhaps you could explain why today is not the day. > (ie, soon to be shipping product that requires it. desire to see a viable > userspace solution working before it is removed). Just about every ARM kernel development downloads kernels via XMODEM and the ability to bring networking up and mount a NFS-root filesystem is by fair the easiest way to develop on *any* embedded device with Ethernet. I suppose you could say I have a _community_ interest here - an interest in ensuring that the ARM community has the resources to be able to continue using Linux. So, while the big server people run around removing functionality they don't need, they make other parts of the community suffer. Is that really what Open Source is about? Suffering? 8) -- Russell King (rmk@arm.linux.org.uk) The developer of ARM Linux http://www.arm.linux.org.uk/personal/aboutme.html From root@chaos.analogic.com Fri Mar 7 08:23:17 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 07 Mar 2003 08:23:21 -0800 (PST) Received: from chaos.analogic.com (chaos.analogic.com [204.178.40.224]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h27GMaq9019749 for ; Fri, 7 Mar 2003 08:23:17 -0800 Received: (from root@localhost) by chaos.analogic.com (8.11.0.Beta3(chaos.analogic.com)/8.12.0.A) id h27GNur15326; Fri, 7 Mar 2003 11:23:56 -0500 Date: Fri, 7 Mar 2003 11:23:56 -0500 (EST) From: "Richard B. Johnson" X-Sender: root@chaos Reply-To: root@chaos.analogic.com To: Russell King cc: Chris Dukes , Alan Cox , Jeff Garzik , Robin Holt , Linux Kernel Mailing List , netdev@oss.sgi.com Subject: Re: Make ipconfig.c work as a loadable module. In-Reply-To: <20030307142920.F17492@flint.arm.linux.org.uk> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 1905 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: root@chaos.analogic.com Precedence: bulk X-list: netdev Content-Length: 3701 Lines: 85 On Fri, 7 Mar 2003, Russell King wrote: > On Fri, Mar 07, 2003 at 01:38:13PM +0000, Chris Dukes wrote: > > You are asserting aesthetics instead of benefits. I asked about benefits. > > Specifically, what is the benefit of compact? > > Think _embedded_. Think "cost of flash chips". Think "not everything > has a floppy disk". > > > I'm sure you have a very good technical or business benefit to compact, > > I'm sorry, believe it or not, but I'm not swayed by "business benefits" > here. Although I have my own business in the UK, we as a business are > currently involved in hardware design which has nothing to do with the > points I'm raising here. > > > but those of us in the world of workstations and servers have zero clue > > what it may be. > > Indeed and understandable. > > > User space solution is not the same as a solution implemented with > > multiple user space apps. > > I've been working on klibc to work towards providing such a solution. > I know what it involves, and I know that this solution isn't there yet. > Also, the fundamentals of klibc have not been accepted by Linus, so we > don't even know if this is going to be a solution yet. > > > > Note: I *do* agree that ipconfig.c needs to die before 2.6 but I do not > > > agree that today is the right day. > > > > Perhaps you could explain why today is not the day. > > (ie, soon to be shipping product that requires it. desire to see a viable > > userspace solution working before it is removed). > > Just about every ARM kernel development downloads kernels via XMODEM > and the ability to bring networking up and mount a NFS-root filesystem > is by fair the easiest way to develop on *any* embedded device with > Ethernet. > > I suppose you could say I have a _community_ interest here - an interest > in ensuring that the ARM community has the resources to be able to continue > using Linux. > > So, while the big server people run around removing functionality they > don't need, they make other parts of the community suffer. Is that > really what Open Source is about? Suffering? 8) > > -- > Russell King (rmk@arm.linux.org.uk) The developer of ARM Linux > http://www.arm.linux.org.uk/personal/aboutme.html > - As the kernel changes there are some things that really need to remain. You need to be able to boot from a "floppy" disk. Yes, now-days it's probably not a real floppy, but a BIOS module that emulates a floppy. A lot of people don't realilize that this is how a CD/ROM is booted! The BIOS configures it to "look" like a floppy for the purpose of booting. A "bootable" CD/ROM has as its first partition, the image of a floppy disk. Also, many embeded systems boot from a "RAM" disk that emulates a floppy disk for the purpose of booting. In fact, there is a good argument to make virtually all embeded systems that use the same CPU as the development environment, boot this way. You can design, code, and test the whole damn thing while the hardware engineers are still laying out components. One such RAM disk on our equipment, pages in "sectors" through a tiny (0x1000) window which disappears after booting, therefore no address-space is given up to some NVRAM. Linux is unmodified, thinking it was booted from a 1.44 MB floppy. If the kernel grows to where this can't be done anymore, then embeded systems will not use modern kernels. It's that simple. So, increased functionality really needs to be put into modules so that the basic kernel doesn't continue to increase in size. Cheers, Dick Johnson Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips). Why is the government concerned about the lunatic fringe? Think about it. From malware@t-online.de Fri Mar 7 13:34:46 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 07 Mar 2003 13:34:50 -0800 (PST) Received: from mailout02.sul.t-online.com (mailout02.sul.t-online.com [194.25.134.17]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h27LYiq9014265 for ; Fri, 7 Mar 2003 13:34:45 -0800 Received: from fwd09.sul.t-online.de by mailout02.sul.t-online.com with smtp id 18rPUJ-0006oF-02; Fri, 07 Mar 2003 22:34:43 +0100 Received: from fire.malware.de (320008702754-0001@[193.158.189.2]) by fwd09.sul.t-online.com with esmtp id 18rPU8-11QWQKC; Fri, 7 Mar 2003 22:34:32 +0100 Message-Id: <200303072132.WAA02244@fire.malware.de> Date: Fri, 07 Mar 2003 22:33:07 +0100 From: malware@t-online.de (Michael Mueller) X-Mailer: Mozilla 4.78 [en] (X11; U; Linux 2.4.20-pre8 i686) X-Accept-Language: en, de MIME-Version: 1.0 To: Alan Cox CC: Russell King , Jeff Garzik , Robin Holt , Linux Kernel Mailing List , netdev@oss.sgi.com Subject: Re: Make ipconfig.c work as a loadable module. References: <1046990052.18158.121.camel@irongate.swansea.linux.org.uk> <20030306221136.GB26732@gtf.org> <20030306222546.K838@flint.arm.linux.org.uk> <1046996037.18158.142.camel@irongate.swansea.linux.org.uk> <20030306231905.M838@flint.arm.linux.org.uk> <1046996987.17718.144.camel@irongate.swansea.linux.org.uk> <200303070715.IAA27138@fire.malware.de> <1047041676.20793.12.camel@irongate.swansea.linux.org.uk> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Sender: 320008702754-0001@t-dialin.net X-archive-position: 1906 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: malware@t-online.de Precedence: bulk X-list: netdev Content-Length: 901 Lines: 26 Hi Alan, you wrote: > > Sorry, but I must join Russel here. I have atleast one machine which has > > a bootloader able to load exactly one file only. There is currently no > > way to load an initrd. It would need to implement the whole (BOOTP+)TFTP > > stuff again, just to get the initrd. So I was quite happy linux 2.4 > > still knows about mounting a NFS root filesystem without user-space > > help. > > Just glue the initrd to the kernel. This is not rocket science Do you have a sort of glue fixing the ramdisk support on m68k to support physically non-continous memory too? Otherwhise I have only 1 MiB for the whole initrd. So hopefully the removal of ipconfig.c, if decided for, does not propagate back into the 2.4 series. It would add a heap of useless work to do, just to get it up again. Michael -- Linux@TekXpress http://www-users.rwth-aachen.de/Michael.Mueller4/tekxp/tekxp.html From wli@holomorphy.com Fri Mar 7 13:48:20 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 07 Mar 2003 13:48:22 -0800 (PST) Received: from holomorphy (mail@holomorphy.com [66.224.33.161]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h27LmHq9015257 for ; Fri, 7 Mar 2003 13:48:20 -0800 Received: from wli by holomorphy with local (Exim 3.35 #1 (Debian)) id 18rPgz-0006BU-00; Fri, 07 Mar 2003 13:47:49 -0800 Date: Fri, 7 Mar 2003 13:47:49 -0800 From: William Lee Irwin III To: Chris Dukes , Alan Cox , Jeff Garzik , Robin Holt , Linux Kernel Mailing List , netdev@oss.sgi.com Subject: Re: Make ipconfig.c work as a loadable module. Message-ID: <20030307214749.GA20188@holomorphy.com> Mail-Followup-To: William Lee Irwin III , Chris Dukes , Alan Cox , Jeff Garzik , Robin Holt , Linux Kernel Mailing List , netdev@oss.sgi.com References: <1046990052.18158.121.camel@irongate.swansea.linux.org.uk> <20030306221136.GB26732@gtf.org> <20030306222546.K838@flint.arm.linux.org.uk> <1046996037.18158.142.camel@irongate.swansea.linux.org.uk> <20030306231905.M838@flint.arm.linux.org.uk> <1046996987.17718.144.camel@irongate.swansea.linux.org.uk> <20030307000816.P838@flint.arm.linux.org.uk> <20030307012905.G20725@parcelfarce.linux.theplanet.co.uk> <20030307094235.A11807@flint.arm.linux.org.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030307094235.A11807@flint.arm.linux.org.uk> User-Agent: Mutt/1.3.28i Organization: The Domain of Holomorphy X-archive-position: 1907 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: wli@holomorphy.com Precedence: bulk X-list: netdev Content-Length: 418 Lines: 12 On Fri, Mar 07, 2003 at 09:42:35AM +0000, Russell King wrote: > That's getting on for 2MB vs: > 2620 2012 0 4632 1218 fs/nfs/nfsroot.o > 8016 380 80 8476 211c net/ipv4/ipconfig.o > about 13K. There's a cap on the maximum size of things various bootloaders can load via tftp; 2MB is relatively certain to blow it. ISTR the limit being something near 1MB for 2 of my boxen. -- wli From cfriesen@nortelnetworks.com Fri Mar 7 14:01:48 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 07 Mar 2003 14:01:52 -0800 (PST) Received: from zcars04e.nortelnetworks.com (zcars04e.nortelnetworks.com [47.129.242.56]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h27M1kq9018166 for ; Fri, 7 Mar 2003 14:01:47 -0800 Received: from zcard307.ca.nortel.com (zcard307.ca.nortel.com [47.129.242.67]) by zcars04e.nortelnetworks.com (Switch-2.2.5/Switch-2.2.0) with ESMTP id h27M0G420428; Fri, 7 Mar 2003 17:00:16 -0500 (EST) Received: from zcard0k6.ca.nortel.com ([47.129.242.158]) by zcard307.ca.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id GDFBJSWB; Fri, 7 Mar 2003 17:00:16 -0500 Received: from pcard0ks.ca.nortel.com ([47.129.117.131]) by zcard0k6.ca.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id FSL7Z1ND; Fri, 7 Mar 2003 17:00:17 -0500 Received: from nortelnetworks.com (localhost.localdomain [127.0.0.1]) by pcard0ks.ca.nortel.com (Postfix) with ESMTP id 881ED2D957; Fri, 7 Mar 2003 17:00:15 -0500 (EST) Message-ID: <3E69166F.9080604@nortelnetworks.com> Date: Fri, 07 Mar 2003 17:00:15 -0500 X-Sybari-Space: 00000000 00000000 00000000 From: Chris Friesen User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.8) Gecko/20020204 X-Accept-Language: en-us MIME-Version: 1.0 To: William Lee Irwin III Cc: Chris Dukes , Alan Cox , Jeff Garzik , Robin Holt , Linux Kernel Mailing List , netdev@oss.sgi.com Subject: Re: Make ipconfig.c work as a loadable module. References: <1046990052.18158.121.camel@irongate.swansea.linux.org.uk> <20030306221136.GB26732@gtf.org> <20030306222546.K838@flint.arm.linux.org.uk> <1046996037.18158.142.camel@irongate.swansea.linux.org.uk> <20030306231905.M838@flint.arm.linux.org.uk> <1046996987.17718.144.camel@irongate.swansea.linux.org.uk> <20030307000816.P838@flint.arm.linux.org.uk> <20030307012905.G20725@parcelfarce.linux.theplanet.co.uk> <20030307094235.A11807@flint.arm.linux.org.uk> <20030307214749.GA20188@holomorphy.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 1908 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: cfriesen@nortelnetworks.com Precedence: bulk X-list: netdev Content-Length: 981 Lines: 28 William Lee Irwin III wrote: > On Fri, Mar 07, 2003 at 09:42:35AM +0000, Russell King wrote: > >>That's getting on for 2MB vs: >> 2620 2012 0 4632 1218 fs/nfs/nfsroot.o >> 8016 380 80 8476 211c net/ipv4/ipconfig.o >>about 13K. >> > > There's a cap on the maximum size of things various bootloaders can > load via tftp; 2MB is relatively certain to blow it. ISTR the limit > being something near 1MB for 2 of my boxen. Since this is totally machine/architecture specific (we're tftp'ing 10MB kernel/ramdisk images to embedded PPC machines here) it might be a good idea to ask around and find what the most restrictive requirements are. Is 1MB the worst-case or does it get even tighter? Chris -- Chris Friesen | MailStop: 043/33/F10 Nortel Networks | work: (613) 765-0557 3500 Carling Avenue | fax: (613) 765-2986 Nepean, ON K2H 8E9 Canada | email: cfriesen@nortelnetworks.com From ebiederm@xmission.com Fri Mar 7 18:04:01 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 07 Mar 2003 18:04:06 -0800 (PST) Received: from frodo.biederman.org (ebiederm.dsl.xmission.com [166.70.28.69]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2823xq9029628 for ; Fri, 7 Mar 2003 18:04:01 -0800 Received: (from eric@localhost) by frodo.biederman.org (8.9.3/8.9.3) id TAA15566; Fri, 7 Mar 2003 19:03:24 -0700 To: Bogdan Costescu Cc: Russell King , Chris Dukes , Alan Cox , Jeff Garzik , Robin Holt , Linux Kernel Mailing List , Subject: Re: Make ipconfig.c work as a loadable module. References: From: ebiederm@xmission.com (Eric W. Biederman) Date: 07 Mar 2003 19:03:24 -0700 In-Reply-To: Message-ID: User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.1 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-archive-position: 1909 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ebiederm@xmission.com Precedence: bulk X-list: netdev Content-Length: 2070 Lines: 43 Bogdan Costescu writes: > On Fri, 7 Mar 2003, Russell King wrote: > > > Which version is overly bloated? > > Which version is huge? > > Which version is compact? > > ... and the size is not important only because we want to make everything > smaller, but because of how it's commonly used (at least in the clustering > world from which I come): > > the mainboard BIOS or NIC PROC contains PXE/DHCP client; data is > transferred through UDP, with very poor (if any) congestion control. Only because the implementations suck. See etherboot. > Congestion control means here both extreme situations: if packets don't > arrive to the client, it might not ask again, ask only a limited number of > times or give up after some timeout; if the server has some faster NIC to > be able to handle more such requests, it might also send too fast for a > single client which might drop packets. In some cases, if such situation > occurs, the client just blocks there printing an error message on the > console, without trying to restart the whole process and the only way to > make it do something is to press the Reset button or plug in a keyboard... > When you have tens or hundreds of such nodes, it's not a pleasure ! But this is all before the kernel is loaded. Having booted a 1000 node cluster with TFTP and DHCP. From a single host with even being in the same town, I think I have some room to talk. > Booting a bunch of such nodes would become problematic if they need > to transfer more data (=initrd) to start the kernel and so network booting > would become less reliable. Please note that I'm not saying "ipconfig has > to stay" - just that any solution should not dramatically increase the > size of data transferred before the jump to kernel code. Right. But I would suggest fixing your NBP (what PXE load) which must be < 64K anyway if you have noticeable reliability problems. Not that I even suggest using PXE for production use anyway. But sometimes you are stuck with what you can do. Eric From bogdan.costescu@iwr.uni-heidelberg.de Sat Mar 8 02:45:46 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 08 Mar 2003 02:45:48 -0800 (PST) Received: from mail.iwr.uni-heidelberg.de (mail.iwr.uni-heidelberg.de [129.206.104.30]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h28Ajiq9028107 for ; Sat, 8 Mar 2003 02:45:45 -0800 Received: from kenzo.iwr.uni-heidelberg.de (kenzo.iwr.uni-heidelberg.de [129.206.120.29]) by mail.iwr.uni-heidelberg.de (8.11.2/8.11.1) with ESMTP id h28AjgX21889; Sat, 8 Mar 2003 11:45:42 +0100 (MET) Received: from localhost (bogdan@localhost) by kenzo.iwr.uni-heidelberg.de (8.11.6/8.11.6) with ESMTP id h28Ajfd12460; Sat, 8 Mar 2003 11:45:41 +0100 Date: Sat, 8 Mar 2003 11:45:40 +0100 (CET) From: Bogdan Costescu To: "Eric W. Biederman" cc: Russell King , Chris Dukes , Alan Cox , Jeff Garzik , Robin Holt , Linux Kernel Mailing List , Subject: Re: Make ipconfig.c work as a loadable module. In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 1910 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: bogdan.costescu@iwr.uni-heidelberg.de Precedence: bulk X-list: netdev Content-Length: 1192 Lines: 33 On 7 Mar 2003, Eric W. Biederman wrote: > Only because the implementations suck. See etherboot. Agreed, but as you rightly say at the end of your message... > But sometimes you are stuck with what you can do. ... and you can't go use etherboot or whatever, you have to deal with it. You can deal with it today because ipconfig is small, you might not be able to deal with it tomorrow if you'll have to transfer twice as much because of a big initrd. > But this is all before the kernel is loaded. But that's exactly my point. The ipconfig functionality is needed and what I ask for is that whatever means (if any) are chosen to replace it, they should keep the low size. > Having booted a 1000 node cluster with TFTP and DHCP. I do not doubt this, but I'm afraid that you (or we) might not be able to do it again tomorrow. And probably this is an ideal case where you have used the better solution as client (etherboot)... -- Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: Bogdan.Costescu@IWR.Uni-Heidelberg.De From dirch221@yahoo.co.in Sat Mar 8 02:50:40 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 08 Mar 2003 02:50:42 -0800 (PST) Received: from web8205.mail.in.yahoo.com (web8205.mail.in.yahoo.com [203.199.70.126]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h28Anwq9028527 for ; Sat, 8 Mar 2003 02:50:39 -0800 Message-ID: <20030308104950.66215.qmail@web8205.mail.in.yahoo.com> Received: from [202.54.65.65] by web8205.mail.in.yahoo.com via HTTP; Sat, 08 Mar 2003 10:49:50 GMT Date: Sat, 8 Mar 2003 10:49:50 +0000 (GMT) From: =?iso-8859-1?q?barkkarn=20aravinda?= Subject: protocol development To: netdev@oss.sgi.com MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="0-1071572380-1047120590=:63048" Content-Transfer-Encoding: 8bit X-archive-position: 1911 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: dirch221@yahoo.co.in Precedence: bulk X-list: netdev Content-Length: 1086 Lines: 22 --0-1071572380-1047120590=:63048 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit hi iam arvind from india.i want to make some changes to tcp protocol and implement in my computer.can i use libpcap to write my own protocol.give me some suggestion about how to work on this project.what are the books that can help me in this project. bye Catch all the cricket action. Download Yahoo! Score tracker --0-1071572380-1047120590=:63048 Content-Type: text/html; charset=iso-8859-1 Content-Transfer-Encoding: 8bit

hi

   iam arvind from india.i want to make some changes to tcp protocol and implement in my computer.can i use libpcap to write my own protocol.give me some suggestion about how to work on this project.what are the books that can help me in this project.

bye

Catch all the cricket action. Download Yahoo! Score tracker --0-1071572380-1047120590=:63048-- From ebiederm@xmission.com Sat Mar 8 08:08:22 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 08 Mar 2003 08:08:24 -0800 (PST) Received: from frodo.biederman.org (ebiederm.dsl.xmission.com [166.70.28.69]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h28G7eq9008014 for ; Sat, 8 Mar 2003 08:08:21 -0800 Received: (from eric@localhost) by frodo.biederman.org (8.9.3/8.9.3) id JAA17848; Sat, 8 Mar 2003 09:07:11 -0700 To: Bogdan Costescu Cc: Russell King , Chris Dukes , Alan Cox , Jeff Garzik , Robin Holt , Linux Kernel Mailing List , Subject: Re: Make ipconfig.c work as a loadable module. References: From: ebiederm@xmission.com (Eric W. Biederman) Date: 08 Mar 2003 09:07:11 -0700 In-Reply-To: Message-ID: User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.1 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-archive-position: 1912 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ebiederm@xmission.com Precedence: bulk X-list: netdev Content-Length: 2280 Lines: 62 Bogdan Costescu writes: > On 7 Mar 2003, Eric W. Biederman wrote: > > > Only because the implementations suck. See etherboot. > > Agreed, but as you rightly say at the end of your message... > > > But sometimes you are stuck with what you can do. > > ... and you can't go use etherboot or whatever, you have to deal with it. At the very least I can use etherboot as a NBP in PXE terms. So I have a reasonable client after the first tftp transaction. > You can deal with it today because ipconfig is small, you might not be > able to deal with it tomorrow if you'll have to transfer twice as much > because of a big initrd. I routinely support an initrd with: glibc. /bin/bash dhclient mke2fs mkreiserfs parted sfdisk mount pivot_root etc. (All binaries were striped though). And I usually have to pass an ramdisk_size=XXX option to the kernel or my decompressed initial ramdisk is to large. I use it for setting up a local filesystem on a cluster node. And I was able to setup an entire cluster 1000 node cluster in about 15-20 minutes. (Multicast cuts down on the bandwidth requirements which is very nice). With a good bootloader it does not much how big your initrd is. I totally agree that small is good and important. At the same time ipconfig.c is wrong. It is great during development and on systems with a single NIC. But the hard coded policies can be bad for production systems. Not that hard coded policies are bad in general just the kernel is the wrong place to put them. > > But this is all before the kernel is loaded. > > But that's exactly my point. The ipconfig functionality is needed and what > I ask for is that whatever means (if any) are chosen to replace it, they > should keep the low size. Similar functionality is definitely needed. > > > Having booted a 1000 node cluster with TFTP and DHCP. > > I do not doubt this, but I'm afraid that you (or we) might not be able to > do it again tomorrow. And probably this is an ideal case where you have > used the better solution as client (etherboot)... True. But when things are important and the there is GPL'd firmware available that actually works properly. It is worth putting it on the requirements list of things to do. Eric From rmk@arm.linux.org.uk Sat Mar 8 08:19:49 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 08 Mar 2003 08:19:51 -0800 (PST) Received: from caramon.arm.linux.org.uk (caramon.arm.linux.org.uk [212.18.232.186]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h28GJkq9008418 for ; Sat, 8 Mar 2003 08:19:48 -0800 Received: from flint.arm.linux.org.uk ([3ffe:8260:2002:1:201:2ff:fe14:8fad]) by caramon.arm.linux.org.uk with asmtp (TLSv1:DES-CBC3-SHA:168) (Exim 4.12) id 18rh2w-0000hK-00; Sat, 08 Mar 2003 16:19:38 +0000 Received: from rmk by flint.arm.linux.org.uk with local (Exim 4.12) id 18rh2v-0002Ad-00; Sat, 08 Mar 2003 16:19:37 +0000 Date: Sat, 8 Mar 2003 16:19:36 +0000 From: Russell King To: "Eric W. Biederman" Cc: Bogdan Costescu , Chris Dukes , Alan Cox , Jeff Garzik , Robin Holt , Linux Kernel Mailing List , netdev@oss.sgi.com Subject: Re: Make ipconfig.c work as a loadable module. Message-ID: <20030308161936.C1896@flint.arm.linux.org.uk> Mail-Followup-To: "Eric W. Biederman" , Bogdan Costescu , Chris Dukes , Alan Cox , Jeff Garzik , Robin Holt , Linux Kernel Mailing List , netdev@oss.sgi.com References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5.1i In-Reply-To: ; from ebiederm@xmission.com on Sat, Mar 08, 2003 at 09:07:11AM -0700 X-archive-position: 1913 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: rmk@arm.linux.org.uk Precedence: bulk X-list: netdev Content-Length: 1979 Lines: 45 On Sat, Mar 08, 2003 at 09:07:11AM -0700, Eric W. Biederman wrote: > With a good bootloader it does not much how big your initrd is. I > totally agree that small is good and important. At the same time > ipconfig.c is wrong. It is great during development and on systems > with a single NIC. But the hard coded policies can be bad for > production systems. Not that hard coded policies are bad in general > just the kernel is the wrong place to put them. With multi-NIC systems, it is perfectly possible to use ipconfig.c with one specific interface. /* * Decode any IP configuration options in the "ip=" or "nfsaddrs=" kernel * command line parameter. It consists of option fields separated by colons in * the following order: * * :::::: * * Any of the fields can be empty which means to use a default value: * - address given by BOOTP or RARP * - address of host returning BOOTP or RARP packet * - none, or the address returned by BOOTP * - automatically determined from , or the * one returned by BOOTP * - in ASCII notation, or the name returned * by BOOTP * - use all available devices * : * off|none - don't do autoconfig at all (DEFAULT) * on|any - use any configured protocol * dhcp|bootp|rarp - use only the specified protocol * both - use both BOOTP and RARP (not DHCP) */ ip=:::::eth0:dhcp (I haven't actually tried this though.) However, how do you configure your ramdisk via the boot loader to use a specific NIC / mount a specific filesystem, etc? -- Russell King (rmk@arm.linux.org.uk) The developer of ARM Linux http://www.arm.linux.org.uk/personal/aboutme.html From ebiederm@xmission.com Sat Mar 8 08:48:51 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 08 Mar 2003 08:48:54 -0800 (PST) Received: from frodo.biederman.org (ebiederm.dsl.xmission.com [166.70.28.69]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h28Gmoq9008957 for ; Sat, 8 Mar 2003 08:48:51 -0800 Received: (from eric@localhost) by frodo.biederman.org (8.9.3/8.9.3) id JAA17979; Sat, 8 Mar 2003 09:48:16 -0700 To: Russell King Cc: Bogdan Costescu , Chris Dukes , Alan Cox , Jeff Garzik , Robin Holt , Linux Kernel Mailing List , netdev@oss.sgi.com Subject: Re: Make ipconfig.c work as a loadable module. References: <20030308161936.C1896@flint.arm.linux.org.uk> From: ebiederm@xmission.com (Eric W. Biederman) Date: 08 Mar 2003 09:48:16 -0700 In-Reply-To: <20030308161936.C1896@flint.arm.linux.org.uk> Message-ID: User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.1 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-archive-position: 1914 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ebiederm@xmission.com Precedence: bulk X-list: netdev Content-Length: 1449 Lines: 36 Russell King writes: > On Sat, Mar 08, 2003 at 09:07:11AM -0700, Eric W. Biederman wrote: > > With a good bootloader it does not much how big your initrd is. I > > totally agree that small is good and important. At the same time > > ipconfig.c is wrong. It is great during development and on systems > > with a single NIC. But the hard coded policies can be bad for > > production systems. Not that hard coded policies are bad in general > > just the kernel is the wrong place to put them. > > With multi-NIC systems, it is perfectly possible to use ipconfig.c with > one specific interface. Sorry. I expressed that wrong. It is not multi-NIC that ipconfig.c gets wrong. It is multiple DHCP servers. You just get multiple dhcp servers when you have multiple NICs. The policies in ipconfig.c are quite good, they just are not universally applicable. But as ipconfig.c is in the kernel it tends to get used where it is inappropriate. > ip=:::::eth0:dhcp > > (I haven't actually tried this though.) I had forgotten about that one, and I believe it helps in some cases. > However, how do you configure your ramdisk via the boot loader to use > a specific NIC / mount a specific filesystem, etc? I can change the contents of my ramdisk as easily as I can change the kernel command line. For the complex setups just placing a configuration file in the ramdisk is what seems to work the best in practice. Eric From rmk@arm.linux.org.uk Sat Mar 8 09:05:45 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 08 Mar 2003 09:05:48 -0800 (PST) Received: from caramon.arm.linux.org.uk (caramon.arm.linux.org.uk [212.18.232.186]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h28H5gq9009426 for ; Sat, 8 Mar 2003 09:05:44 -0800 Received: from flint.arm.linux.org.uk ([3ffe:8260:2002:1:201:2ff:fe14:8fad]) by caramon.arm.linux.org.uk with asmtp (TLSv1:DES-CBC3-SHA:168) (Exim 4.12) id 18rhlO-0000rO-00; Sat, 08 Mar 2003 17:05:34 +0000 Received: from rmk by flint.arm.linux.org.uk with local (Exim 4.12) id 18rhlN-0002bB-00; Sat, 08 Mar 2003 17:05:33 +0000 Date: Sat, 8 Mar 2003 17:05:32 +0000 From: Russell King To: "Eric W. Biederman" Cc: Bogdan Costescu , Chris Dukes , Alan Cox , Jeff Garzik , Robin Holt , Linux Kernel Mailing List , netdev@oss.sgi.com Subject: Re: Make ipconfig.c work as a loadable module. Message-ID: <20030308170532.D1896@flint.arm.linux.org.uk> Mail-Followup-To: "Eric W. Biederman" , Bogdan Costescu , Chris Dukes , Alan Cox , Jeff Garzik , Robin Holt , Linux Kernel Mailing List , netdev@oss.sgi.com References: <20030308161936.C1896@flint.arm.linux.org.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5.1i In-Reply-To: ; from ebiederm@xmission.com on Sat, Mar 08, 2003 at 09:48:16AM -0700 X-archive-position: 1915 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: rmk@arm.linux.org.uk Precedence: bulk X-list: netdev Content-Length: 1018 Lines: 21 On Sat, Mar 08, 2003 at 09:48:16AM -0700, Eric W. Biederman wrote: > I can change the contents of my ramdisk as easily as I can change > the kernel command line. For the complex setups just placing > a configuration file in the ramdisk is what seems to work the best > in practice. You'll forgive me if I don't think that "change the contents of ramdisk" is as easy as changing the kernel command line. Last time I checked, to change the contents of a ramdisk image, you needed to ungzip it, mount it, make some changes, unmount it, re-gzip it, and re-install the thing. Or, in the case of initramfs, you need to rebuild the kernel image. Compare this to changing the kernel command line from "root=/dev/hda1" to "root=/dev/nfs ip=dhcp" in the boot loader by hitting a few keys on the keyboard before the kernel loads, and I think you'll start to get my point here. -- Russell King (rmk@arm.linux.org.uk) The developer of ARM Linux http://www.arm.linux.org.uk/personal/aboutme.html From ebiederm@xmission.com Sat Mar 8 12:51:40 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 08 Mar 2003 12:51:43 -0800 (PST) Received: from frodo.biederman.org (ebiederm.dsl.xmission.com [166.70.28.69]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h28Koxq9011235 for ; Sat, 8 Mar 2003 12:51:40 -0800 Received: (from eric@localhost) by frodo.biederman.org (8.9.3/8.9.3) id LAA18149; Sat, 8 Mar 2003 11:01:06 -0700 To: Russell King Cc: Bogdan Costescu , Chris Dukes , Alan Cox , Jeff Garzik , Robin Holt , Linux Kernel Mailing List , netdev@oss.sgi.com Subject: Re: Make ipconfig.c work as a loadable module. References: <20030308161936.C1896@flint.arm.linux.org.uk> <20030308170532.D1896@flint.arm.linux.org.uk> From: ebiederm@xmission.com (Eric W. Biederman) Date: 08 Mar 2003 11:01:05 -0700 In-Reply-To: <20030308170532.D1896@flint.arm.linux.org.uk> Message-ID: User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.1 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-archive-position: 1916 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ebiederm@xmission.com Precedence: bulk X-list: netdev Content-Length: 3108 Lines: 73 Russell King writes: > On Sat, Mar 08, 2003 at 09:48:16AM -0700, Eric W. Biederman wrote: > > I can change the contents of my ramdisk as easily as I can change > > the kernel command line. For the complex setups just placing > > a configuration file in the ramdisk is what seems to work the best > > in practice. > > You'll forgive me if I don't think that "change the contents of ramdisk" > is as easy as changing the kernel command line. > > Last time I checked, to change the contents of a ramdisk image, you needed > to ungzip it, mount it, make some changes, unmount it, re-gzip it, and > re-install the thing. Or, in the case of initramfs, you need to rebuild > the kernel image. Compare this to changing the kernel command line from > "root=/dev/hda1" to "root=/dev/nfs ip=dhcp" in the boot loader by hitting > a few keys on the keyboard before the kernel loads, and I think you'll > start to get my point here. Currently on systems I am talking I have a directory structured like: dir/config dir/bzImage dir/ramdisk dir/ramdisk/sbin/init dir/ramdisk/etc/ ..... So I edit dir/ramdisk/etc/somefile.conf and run a script that rebuilds everything. Or I edit dir/config which has my command line in it and run the script again. Getting to this point took a bit of effort but that is where I am at now. With initramfs it becomes as designed it becomes easier because it easier to build a cpio archive. But mkcramfs has similar properties for building filesystems. The whole building the initramfs thing into the kernel is something that probably needs to be worked so the initramfs can be attached to the kernel separately. When the bootable kernel image is ELF that is easy. With something like bzImage on x86 it can be a pain, as there isn't any room to extend the things. And all I asserted is that for ``me'' it is equally simple to change the ramdisk contents as to changes those of a file. For something like /bin/kinit that contains the default kernel polices on how to mount root it should certainly be command line driven. For complicated setups where I am partitioning the hard drives, making filesystems, and installing over the network. A configuration file has proven to be easier, and that is what I do. The fundamental issue is that after a certain point the command line just does not have room for all of the parameters needed. Possibly I answered the wrong question? As for hitting a few keys on the keyboard in the bootloader before the kernel loads well.... That is good on one machine, it gets to be a pain on 4. And at a 1000 I have much better things to do with my time. Which just shows my bias from working on with clusters. On a cluster the only time you want to treat a machine as an individual is when you are replacing bad hardware. I have played with parsing command line options. And messing with /proc/cmdline or being /sbin/init and just getting those options from the kernel is not difficult. For prototyping it may be a good idea to read /proc/cmdline so the kernel can eat the options before kinit does. Eric From acme@conectiva.com.br Sat Mar 8 20:45:41 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 08 Mar 2003 20:45:44 -0800 (PST) Received: from orion.netbank.com.br (orion.netbank.com.br [200.203.199.90]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h294jbq9015616 for ; Sat, 8 Mar 2003 20:45:39 -0800 Received: from [200.181.170.60] (helo=brinquendo.conectiva.com.br) by orion.netbank.com.br with asmtp (Exim 3.33 #1) id 18rsiB-0007pl-00; Sun, 09 Mar 2003 01:46:59 -0300 Received: by brinquendo.conectiva.com.br (Postfix, from userid 500) id 4F9B01966C; Sun, 9 Mar 2003 04:46:34 +0000 (UTC) Date: Sun, 9 Mar 2003 01:46:33 -0300 From: Arnaldo Carvalho de Melo To: Alan Cox Cc: Michael Mueller , Russell King , Jeff Garzik , Robin Holt , Linux Kernel Mailing List , netdev@oss.sgi.com Subject: Re: Make ipconfig.c work as a loadable module. Message-ID: <20030309044633.GC9359@conectiva.com.br> Mail-Followup-To: Arnaldo Carvalho de Melo , Alan Cox , Michael Mueller , Russell King , Jeff Garzik , Robin Holt , Linux Kernel Mailing List , netdev@oss.sgi.com References: <1046990052.18158.121.camel@irongate.swansea.linux.org.uk> <20030306221136.GB26732@gtf.org> <20030306222546.K838@flint.arm.linux.org.uk> <1046996037.18158.142.camel@irongate.swansea.linux.org.uk> <20030306231905.M838@flint.arm.linux.org.uk> <1046996987.17718.144.camel@irongate.swansea.linux.org.uk> <200303070715.IAA27138@fire.malware.de> <1047041676.20793.12.camel@irongate.swansea.linux.org.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1047041676.20793.12.camel@irongate.swansea.linux.org.uk> User-Agent: Mutt/1.4i X-Url: http://advogato.org/person/acme X-archive-position: 1917 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: acme@conectiva.com.br Precedence: bulk X-list: netdev Content-Length: 887 Lines: 21 Em Fri, Mar 07, 2003 at 12:54:36PM +0000, Alan Cox escreveu: > On Fri, 2003-03-07 at 07:15, Michael Mueller wrote: > > Hi Alan, > > Sorry, but I must join Russel here. I have atleast one machine which has > > a bootloader able to load exactly one file only. There is currently no > > way to load an initrd. It would need to implement the whole (BOOTP+)TFTP > > stuff again, just to get the initrd. So I was quite happy linux 2.4 > > still knows about mounting a NFS root filesystem without user-space > > help. > > Just glue the initrd to the kernel. This is not rocket science arch/sparc/boot/piggyback.c Simple utility to make a single-image install kernel with initial ramdisk for Sparc tftpbooting without need to set up nfs. Copyright (C) 1996 Jakub Jelinek (jj@sunsite.mff.cuni.cz) Pete Zaitcev endian fixes for cross-compiles, 2000. - Arnaldo From seong@etri.re.kr Sun Mar 9 15:54:35 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 09 Mar 2003 15:54:39 -0800 (PST) Received: from cms1.etri.re.kr (cms1.etri.re.kr [129.254.16.11]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h29NsXq9005371 for ; Sun, 9 Mar 2003 15:54:34 -0800 Received: from SEONG ([129.254.172.40]) by cms1.etri.re.kr with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id GRNP2T9Q; Mon, 10 Mar 2003 08:54:13 +0900 Message-ID: <001201c2e697$6ace7280$28acfe81@seong> From: "Seong Moon" To: Subject: multicast routing cache monitoring? Date: Mon, 10 Mar 2003 08:55:49 +0900 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 5.50.4920.2300 X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4920.2300 X-archive-position: 1918 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: seong@etri.re.kr Precedence: bulk X-list: netdev Content-Length: 210 Lines: 11 Hi there! I want to get and monitor multicast routing cache information from kernel through rtnetlink. Is it possible ? I'm using linux2.4.18. If it is possible, What can I do for this ? thanks in advance. From ulrik.debie@newtec.be Mon Mar 10 07:16:52 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 10 Mar 2003 07:16:59 -0800 (PST) Received: from mailhost.newtec.be ([62.58.98.250]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2AFGnq9026141 for ; Mon, 10 Mar 2003 07:16:51 -0800 Received: from Newtec_gw-MTA by mailhost.newtec.be with Novell_GroupWise; Mon, 10 Mar 2003 16:16:46 +0100 Message-Id: X-Mailer: Novell GroupWise Internet Agent 6.0.2 Date: Mon, 10 Mar 2003 16:16:13 +0100 From: "Ulrik De Bie" To: , , Subject: Fwd: tcp seq nr wrapping bug + patch Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=_D887874E.62036C27" X-archive-position: 1919 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ulrik.debie@newtec.be Precedence: bulk X-list: netdev Content-Length: 1554 Lines: 63 --=_D887874E.62036C27 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Hello, I resend this patch which fixes a stupid mistake in the tcp sequence = number in the 2.2 kernel. Kind regards, Ulrik De Bie --=_D887874E.62036C27 Content-Type: message/rfc822 Date: Wed, 11 Sep 2002 17:36:27 +0200 From: "Ulrik De Bie" To: , Subject: tcp seq nr wrapping bug + patch Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable Content-Disposition: inline When the sequence number in a tcp session is about to wrap for packets leaving the system, a problem arises: When the system call writev is called, with a count of 5 for instance, and = the second iov entry makes the sequence number wrap, then the other 3 will be sent in separate packets, because the comparison will be wrong. before() fixes this problem. Sorry that I'm sending from a windows machine at the moment, I don't have a linux mail machine available at the very moment. Kind regards, Ulrik De Bie udb@newtec.be --- linux-2.2.21/net/ipv4/tcp.c Wed Sep 11 11:03:10 2002 +++ linux/net/ipv4/tcp.c Wed Sep 11 17:27:53 2002 @@ -823,7 +823,7 @@ */ if (skb_tailroom(skb) > 0 && (mss_now - copy) > 0 && - tp->snd_nxt < TCP_SKB_CB(skb)->end_seq)= { + before(tp->snd_nxt , TCP_SKB_CB(skb)->e= nd_seq)) { int last_byte_was_odd =3D (copy % = 4); =20 /*=20 --=_D887874E.62036C27-- From johnpol@2ka.mipt.ru Mon Mar 10 11:23:27 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 10 Mar 2003 11:23:30 -0800 (PST) Received: from ffke-campus-gw.mipt.ru (ffke-campus-gw.mipt.ru [194.85.82.65]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2AJMgq9007672 for ; Mon, 10 Mar 2003 11:23:25 -0800 Received: from zanzibar.2ka.mipt.ru (zanzibar.2ka.mipt.ru [194.85.82.77]) by ffke-campus-gw.mipt.ru (8.12.8/8.12.8) with SMTP id h2AJMYBC030715 for ; Mon, 10 Mar 2003 22:22:34 +0300 Date: Mon, 10 Mar 2003 22:22:05 +0300 From: Evgeniy Polyakov To: netdev@oss.sgi.com Subject: netconsole for kernel 2.5.64 Message-Id: <20030310222205.0664b476.johnpol@2ka.mipt.ru> Reply-To: johnpol@2ka.mipt.ru Organization: MIPT X-Mailer: Sylpheed version 0.8.9 (GTK+ 1.2.10; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="Multipart_Mon__10_Mar_2003_22:22:05_+0300_082e02e8" X-archive-position: 1920 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: johnpol@2ka.mipt.ru Precedence: bulk X-list: netdev Content-Length: 19468 Lines: 273 This is a multi-part message in MIME format. --Multipart_Mon__10_Mar_2003_22:22:05_+0300_082e02e8 Content-Type: text/plain; charset=KOI8-R Content-Transfer-Encoding: quoted-printable X-MIME-Autoconverted: from 8bit to quoted-printable by ffke-campus-gw.mipt.ru id h2AJMYBC030715 Hello, developers. If someone still interesting in Ingo Molnar's netconsole patch against latest 2.5 tree, here it is. Now it is statically linked into the kernel to obtain even part of dmesg. It sends UDP datagrams only to broadcast address(255.255.255.255/ff:ff:ff:ff:ff:ff) from only 10.0.0.2 address. All ports are also statically assigned to 6666. It uses only the first net_device, wich is not LOOPBACK or dummy. And wich can change it's flags. But in theory anyone still can do insmod with params. In practi=D3e module part differs from the original almost only in replacement of__cli() and others by cli()... netconsole_client is the latest from http://redhat.com/~mingo/netconsole-patches/=20 Evgeniy Polyakov ( s0mbre ) --Multipart_Mon__10_Mar_2003_22:22:05_+0300_082e02e8 Content-Type: application/octet-stream; name="netconsole-2.5.64.diff" Content-Disposition: attachment; filename="netconsole-2.5.64.diff" Content-Transfer-Encoding: base64 ZGlmZiAtTnJ1IC4uLzEvbGludXgtMi41LjY0L2RyaXZlcnMvbmV0LzNjNTl4LmMgLi9kcml2ZXJz L25ldC8zYzU5eC5jCi0tLSAuLi8xL2xpbnV4LTIuNS42NC9kcml2ZXJzL25ldC8zYzU5eC5jCVdl ZCBNYXIgIDUgMDY6Mjg6NTMgMjAwMworKysgLi9kcml2ZXJzL25ldC8zYzU5eC5jCU1vbiBNYXIg MTAgMjE6NTk6MzMgMjAwMwpAQCAtODg1LDYgKzg4NSw5IEBACiBzdGF0aWMgaW50IHZvcnRleF9p b2N0bChzdHJ1Y3QgbmV0X2RldmljZSAqZGV2LCBzdHJ1Y3QgaWZyZXEgKnJxLCBpbnQgY21kKTsK IHN0YXRpYyB2b2lkIHZvcnRleF90eF90aW1lb3V0KHN0cnVjdCBuZXRfZGV2aWNlICpkZXYpOwog c3RhdGljIHZvaWQgYWNwaV9zZXRfV09MKHN0cnVjdCBuZXRfZGV2aWNlICpkZXYpOworI2lmZGVm IEhBVkVfUE9MTF9DT05UUk9MTEVSCitzdGF0aWMgdm9pZCBfX3BvbGxfY29udHJvbGxlcihzdHJ1 Y3QgbmV0X2RldmljZSAqZGV2KTsgCisjZW5kaWYKIAwKIC8qIFRoaXMgZHJpdmVyIHVzZXMgJ29w dGlvbnMnIHRvIHBhc3MgdGhlIG1lZGlhIHR5cGUsIGZ1bGwtZHVwbGV4IGZsYWcsIGV0Yy4gKi8K IC8qIE9wdGlvbiBjb3VudCBsaW1pdCBvbmx5IC0tIHVubGltaXRlZCBpbnRlcmZhY2VzIGFyZSBz dXBwb3J0ZWQuICovCkBAIC05MzYsNiArOTM5LDIwIEBACiAKICNlbmRpZiAvKiBDT05GSUdfUE0g Ki8KIAorI2lmZGVmIEhBVkVfUE9MTF9DT05UUk9MTEVSCitzdGF0aWMgdm9pZCBfX3BvbGxfY29u dHJvbGxlcihzdHJ1Y3QgbmV0X2RldmljZSAqZGV2KQoreworCXN0cnVjdCB2b3J0ZXhfcHJpdmF0 ZSAqdnAgPSBkZXYtPnByaXY7CisJCisJZGlzYWJsZV9pcnEoZGV2LT5pcnEpOworCWlmICh2cC0+ ZnVsbF9idXNfbWFzdGVyX3J4KQorCQlib29tZXJhbmdfaW50ZXJydXB0KGRldi0+aXJxLCBkZXYs IE5VTEwpOworCWVsc2UKKwkJdm9ydGV4X2ludGVycnVwdChkZXYtPmlycSwgZGV2LCBOVUxMKTsK KwllbmFibGVfaXJxKGRldi0+aXJxKTsKK30KKyNlbmRpZgorCiAjaWZkZWYgQ09ORklHX0VJU0EK IHN0YXRpYyBzdHJ1Y3QgZWlzYV9kZXZpY2VfaWQgdm9ydGV4X2Vpc2FfaWRzW10gPSB7CiAJeyAi VENNNTkyMCIgfSwKQEAgLTE0MzgsNiArMTQ1NSw5IEBACiAJCQkJKGRldi0+ZmVhdHVyZXMgJiBO RVRJRl9GX0lQX0NTVU0pID8gImVuIjoiZGlzIik7CiAJfQogCisjaWZkZWYgSEFWRV9QT0xMX0NP TlRST0xMRVIKKwlkZXYtPnBvbGxfY29udHJvbGxlciA9IF9fcG9sbF9jb250cm9sbGVyOworI2Vu ZGlmCiAJZGV2LT5zdG9wID0gdm9ydGV4X2Nsb3NlOwogCWRldi0+Z2V0X3N0YXRzID0gdm9ydGV4 X2dldF9zdGF0czsKIAlkZXYtPmRvX2lvY3RsID0gdm9ydGV4X2lvY3RsOwpCaW5hcnkgZmlsZXMg Li4vMS9saW51eC0yLjUuNjQvZHJpdmVycy9uZXQvM2M1OXgubyBhbmQgLi9kcml2ZXJzL25ldC8z YzU5eC5vIGRpZmZlcgpkaWZmIC1OcnUgLi4vMS9saW51eC0yLjUuNjQvZHJpdmVycy9uZXQvS2Nv bmZpZyAuL2RyaXZlcnMvbmV0L0tjb25maWcKLS0tIC4uLzEvbGludXgtMi41LjY0L2RyaXZlcnMv bmV0L0tjb25maWcJV2VkIE1hciAgNSAwNjoyOTozNCAyMDAzCisrKyAuL2RyaXZlcnMvbmV0L0tj b25maWcJTW9uIE1hciAxMCAyMTo1NzozMCAyMDAzCkBAIC0zOSw2ICszOSwyOCBAQAogCXNvdXJj ZSAiZHJpdmVycy9uZXQvYXJjbmV0L0tjb25maWciCiBlbmRpZgogCitjb25maWcgTkVUQ09OU09M RQorCXRyaXN0YXRlICJOZXR3b3JrIGNvbnNvbGUgc3VwcG9ydCIKKwlkZXBlbmRzIG9uIE5FVERF VklDRVMKKwktLS1oZWxwLS0tCisJTmV0d29yayBjb25zb2xlIGlzIGEgZGVidWdnaW5nIHRvb2wg dGhhdCBpbXBsZW1lbnRzIAorCWtlcm5lbC1sZXZlbCBuZXR3b3JrIGxvZ2dpbmcgdmlhIFVEUCBw YWNrZXRzLgorCisJdGhlIHNwZWNpYWwgdGhpbmcgYWJvdXQgdGhpcyBhcHByb2FjaCBpcyB0aGUg YWJpbGl0eSB0byBzZW5kICdlbWVyZ2VuY3knCisJbmV0d29yayBwYWNrZXRzIGV2ZW4gZnJvbSBJ UlEgaGFuZGxlcnMuIFRoaXMgZW5hYmxlcyB0aGUgbmV0Y29uc29sZSB0bworCXNlbmQgZW5vdWdo IGluZm8gZXZlbiBpZiB3ZSBjcmFzaCBpbiBpbml0IG9yIGluIGFuIGludGVycnVwdCBoYW5kbGVy LgorCisJYW5vdGhlciBwcm9wZXJ0eSBvZiBuZXRjb25zb2xlIGlzIHRoYXQgaXQncyBhYmxlIHRv IHNoYXJlIHRoZSBuZXR3b3JraW5nCisJZGV2aWNlIHdpdGggb3RoZXIga2VybmVsIHN1YnN5c3Rl bXMsIGxpa2UgdGhlIFRDUC9JUCBzdGFjay4gU28gdGhlCisJbmV0d29ya2luZyBkZXZpY2UgaXMg bm90IGRlZGljYXRlZCBmb3IgbmV0Y29uc29sZSB1c2UsIGl0J3MgdHJhbnNwYXJlbnRseQorCXNo YXJlZC4KKworCW5ldGNvbnNvbGUgaXMgYWxzbyBkZXNpZ25lZCB0byBiZSByb2J1c3QsIGl0IGdv ZXMgc3RyYWlnaHQgdG8gdGhlIG5ldHdvcmsKKwlkcml2ZXIsIHNvIGl0IGRvZXMgbm90IGRlcGVu ZCBvbiB0aGUgbmV0d29ya2luZyBzdGFjayB0byBsb2cgbWVzc2FnZXMuIAorCQorCWh0dHA6Ly9t YXJjLnRoZWFpbXNncm91cC5jb20vP2w9bGludXgta2VybmVsJm09MTAwMTUzNTE1MTI2OTEwJnc9 MgorCWh0dHA6Ly9yZWRoYXQuY29tL35taW5nby9uZXRjb25zb2xlLXBhdGNoZXMvCisKIGNvbmZp ZyBEVU1NWQogCXRyaXN0YXRlICJEdW1teSBuZXQgZHJpdmVyIHN1cHBvcnQiCiAJZGVwZW5kcyBv biBORVRERVZJQ0VTCmRpZmYgLU5ydSAuLi8xL2xpbnV4LTIuNS42NC9kcml2ZXJzL25ldC9NYWtl ZmlsZSAuL2RyaXZlcnMvbmV0L01ha2VmaWxlCi0tLSAuLi8xL2xpbnV4LTIuNS42NC9kcml2ZXJz L25ldC9NYWtlZmlsZQlXZWQgTWFyICA1IDA2OjI5OjA0IDIwMDMKKysrIC4vZHJpdmVycy9uZXQv TWFrZWZpbGUJTW9uIE1hciAxMCAyMTo1MDoxMyAyMDAzCkBAIC0xODksNSArMTg5LDcgQEAKIG9i ai0kKENPTkZJR19IQU1SQURJTykgKz0gaGFtcmFkaW8vCiBvYmotJChDT05GSUdfSVJEQSkgKz0g aXJkYS8KIAorb2JqLSQoQ09ORklHX05FVENPTlNPTEUpICs9IG5ldGNvbnNvbGUubworCiAKIGlu Y2x1ZGUgJChUT1BESVIpL2RyaXZlcnMvdXNiL25ldC9NYWtlZmlsZS5taWkKZGlmZiAtTnJ1IC4u LzEvbGludXgtMi41LjY0L2RyaXZlcnMvbmV0L25ldGNvbnNvbGUuYyAuL2RyaXZlcnMvbmV0L25l dGNvbnNvbGUuYwotLS0gLi4vMS9saW51eC0yLjUuNjQvZHJpdmVycy9uZXQvbmV0Y29uc29sZS5j CVRodSBKYW4gIDEgMDM6MDA6MDAgMTk3MAorKysgLi9kcml2ZXJzL25ldC9uZXRjb25zb2xlLmMJ TW9uIE1hciAxMCAyMjowMDo1MSAyMDAzCkBAIC0wLDAgKzEsMzcwIEBACisvKgorICogIGxpbnV4 L2RyaXZlcnMvbmV0L25ldGNvbnNvbGUuYworICoKKyAqICBDb3B5cmlnaHQgKEMpIDIwMDEgIElu Z28gTW9sbmFyIDxtaW5nb0ByZWRoYXQuY29tPgorICoKKyAqICBUaGlzIGZpbGUgY29udGFpbnMg dGhlIGltcGxlbWVudGF0aW9uIG9mIGFuIElSUS1zYWZlLCBjcmFzaC1zYWZlCisgKiAga2VybmVs IGNvbnNvbGUgaW1wbGVtZW50YXRpb24gdGhhdCBvdXRwdXRzIGtlcm5lbCBtZXNzYWdlcyB0byB0 aGUKKyAqICBuZXR3b3JrLgorICoKKyAqIE1vZGlmaWNhdGlvbiBoaXN0b3J5OgorICoKKyAqIDIw MDEtMDktMTcgICAgc3RhcnRlZCBieSBJbmdvIE1vbG5hci4KKyAqIDIwMDMtMDMtMTAJICBwb3J0 ZWQgdG8gMi41IGFuZCBsaW5rZWQgaW50byB0aGUga2VybmVsCisgKgkJCWJ5IEV2Z2VuaXkgUG9s eWFrb3YgPGpvaG5wb2xAMmthLm1pcHQucnU+CisgKi8KKworLyoqKioqKioqKioqKioqKioqKioq KioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioKKyAqICAgICAgVGhp cyBwcm9ncmFtIGlzIGZyZWUgc29mdHdhcmU7IHlvdSBjYW4gcmVkaXN0cmlidXRlIGl0IGFuZC9v ciBtb2RpZnkKKyAqICAgICAgaXQgdW5kZXIgdGhlIHRlcm1zIG9mIHRoZSBHTlUgR2VuZXJhbCBQ dWJsaWMgTGljZW5zZSBhcyBwdWJsaXNoZWQgYnkKKyAqICAgICAgdGhlIEZyZWUgU29mdHdhcmUg Rm91bmRhdGlvbjsgZWl0aGVyIHZlcnNpb24gMiwgb3IgKGF0IHlvdXIgb3B0aW9uKQorICogICAg ICBhbnkgbGF0ZXIgdmVyc2lvbi4KKyAqCisgKiAgICAgIFRoaXMgcHJvZ3JhbSBpcyBkaXN0cmli dXRlZCBpbiB0aGUgaG9wZSB0aGF0IGl0IHdpbGwgYmUgdXNlZnVsLAorICogICAgICBidXQgV0lU SE9VVCBBTlkgV0FSUkFOVFk7IHdpdGhvdXQgZXZlbiB0aGUgaW1wbGllZCB3YXJyYW50eSBvZgor ICogICAgICBNRVJDSEFOVEFCSUxJVFkgb3IgRklUTkVTUyBGT1IgQSBQQVJUSUNVTEFSIFBVUlBP U0UuICBTZWUgdGhlCisgKiAgICAgIEdOVSBHZW5lcmFsIFB1YmxpYyBMaWNlbnNlIGZvciBtb3Jl IGRldGFpbHMuCisgKgorICogICAgICBZb3Ugc2hvdWxkIGhhdmUgcmVjZWl2ZWQgYSBjb3B5IG9m IHRoZSBHTlUgR2VuZXJhbCBQdWJsaWMgTGljZW5zZQorICogICAgICBhbG9uZyB3aXRoIHRoaXMg cHJvZ3JhbTsgaWYgbm90LCB3cml0ZSB0byB0aGUgRnJlZSBTb2Z0d2FyZQorICogICAgICBGb3Vu ZGF0aW9uLCBJbmMuLCA2NzUgTWFzcyBBdmUsIENhbWJyaWRnZSwgTUEgMDIxMzksIFVTQS4KKyAq CisgKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioqKioq KioqKioqKioqKi8KKworI2luY2x1ZGUgPG5ldC90Y3AuaD4KKyNpbmNsdWRlIDxuZXQvdWRwLmg+ CisjaW5jbHVkZSA8bGludXgvbW0uaD4KKyNpbmNsdWRlIDxsaW51eC90dHkuaD4KKyNpbmNsdWRl IDxsaW51eC9pbml0Lmg+CisjaW5jbHVkZSA8bGludXgvbW9kdWxlLmg+CisjaW5jbHVkZSA8YXNt L3VuYWxpZ25lZC5oPgorI2luY2x1ZGUgPGxpbnV4L2NvbnNvbGUuaD4KKyNpbmNsdWRlIDxsaW51 eC9pbmV0Lmg+CisjaW5jbHVkZSA8bGludXgvc21wX2xvY2suaD4KKyNpbmNsdWRlIDxsaW51eC9u ZXRkZXZpY2UuaD4KKyNpbmNsdWRlIDxsaW51eC90dHlfZHJpdmVyLmg+CisjaW5jbHVkZSA8bGlu dXgvZXRoZXJkZXZpY2UuaD4KKworI2RlZmluZSBERUZQT1JUIDY2NjYKKyNkZWZpbmUgREVGQURE UiBJTkFERFJfQlJPQURDQVNUCisKK3N0YXRpYyBzdHJ1Y3QgbmV0X2RldmljZSAqbmV0Y29uc29s ZV9kZXY7CitzdGF0aWMgdTE2IHNvdXJjZV9wb3J0ID0gREVGUE9SVCwgdGFyZ2V0X3BvcnQgPSBE RUZQT1JUOworc3RhdGljIHUzMiBzb3VyY2VfaXAsIHRhcmdldF9pcCA9IERFRkFERFI7CitzdGF0 aWMgdW5zaWduZWQgY2hhciBkYWRkcls2XSA9IHsweGZmLCAweGZmLCAweGZmLCAweGZmLCAweGZm LCAweGZmfSA7CisKKyNkZWZpbmUgTkVUQ09OU09MRV9WRVJTSU9OIDB4MDEKKyNkZWZpbmUgSEVB REVSX0xFTiA1CisKKyNkZWZpbmUgTUFYX1VEUF9DSFVOSyAxNDYwCisjZGVmaW5lIE1BWF9QUklO VF9DSFVOSyAoTUFYX1VEUF9DSFVOSy1IRUFERVJfTEVOKQorCisvKgorICogV2UgbWFpbnRhaW4g YSBzbWFsbCBwb29sIG9mIGZ1bGx5LXNpemVkIHNrYnMsCisgKiB0byBtYWtlIHN1cmUgdGhlIG1l c3NhZ2UgZ2V0cyBvdXQgZXZlbiBpbgorICogZXh0cmVtZSBPT00gc2l0dWF0aW9ucy4KKyAqLwor I2RlZmluZSBNQVhfTkVUQ09OU09MRV9TS0JTIDMyCisKK3N0YXRpYyBzcGlubG9ja190IG5ldGNv bnNvbGVfbG9jayA9IFNQSU5fTE9DS19VTkxPQ0tFRDsKK3N0YXRpYyBpbnQgbnJfbmV0Y29uc29s ZV9za2JzOworc3RhdGljIHN0cnVjdCBza19idWZmICpuZXRjb25zb2xlX3NrYnM7CisKKyNkZWZp bmUgTUFYX1NLQl9TSVpFIFwKKwkJKE1BWF9VRFBfQ0hVTksgKyBzaXplb2Yoc3RydWN0IHVkcGhk cikgKyBcCisJCQkJc2l6ZW9mKHN0cnVjdCBpcGhkcikgKyBzaXplb2Yoc3RydWN0IGV0aGhkcikp CisKK3N0YXRpYyB2b2lkIF9fcmVmaWxsX25ldGNvbnNvbGVfc2ticyh2b2lkKQoreworCXN0cnVj dCBza19idWZmICpza2I7CisJdW5zaWduZWQgbG9uZyBmbGFnczsKKworCXNwaW5fbG9ja19pcnFz YXZlKCZuZXRjb25zb2xlX2xvY2ssIGZsYWdzKTsKKwl3aGlsZSAobnJfbmV0Y29uc29sZV9za2Jz IDwgTUFYX05FVENPTlNPTEVfU0tCUykgeworCQlza2IgPSBhbGxvY19za2IoTUFYX1NLQl9TSVpF LCBHRlBfQVRPTUlDKTsKKwkJaWYgKCFza2IpCisJCQlicmVhazsKKwkJaWYgKG5ldGNvbnNvbGVf c2ticykKKwkJCXNrYi0+bmV4dCA9IG5ldGNvbnNvbGVfc2ticzsKKwkJZWxzZQorCQkJc2tiLT5u ZXh0ID0gTlVMTDsKKwkJbmV0Y29uc29sZV9za2JzID0gc2tiOworCQlucl9uZXRjb25zb2xlX3Nr YnMrKzsKKwl9CisJc3Bpbl91bmxvY2tfaXJxcmVzdG9yZSgmbmV0Y29uc29sZV9sb2NrLCBmbGFn cyk7Cit9CisKK3N0YXRpYyBzdHJ1Y3Qgc2tfYnVmZiAqIGdldF9uZXRjb25zb2xlX3NrYih2b2lk KQoreworCXN0cnVjdCBza19idWZmICpza2I7CisKKwl1bnNpZ25lZCBsb25nIGZsYWdzOworCisJ c3Bpbl9sb2NrX2lycXNhdmUoJm5ldGNvbnNvbGVfbG9jaywgZmxhZ3MpOworCXNrYiA9IG5ldGNv bnNvbGVfc2ticzsKKwlpZiAoc2tiKQorCQluZXRjb25zb2xlX3NrYnMgPSBza2ItPm5leHQ7CisJ c2tiLT5uZXh0ID0gTlVMTDsKKwlucl9uZXRjb25zb2xlX3NrYnMtLTsKKwlzcGluX3VubG9ja19p cnFyZXN0b3JlKCZuZXRjb25zb2xlX2xvY2ssIGZsYWdzKTsKKworCXJldHVybiBza2I7Cit9CisK K3N0YXRpYyBzcGlubG9ja190IHNlcXVlbmNlX2xvY2sgPSBTUElOX0xPQ0tfVU5MT0NLRUQ7Citz dGF0aWMgdW5zaWduZWQgaW50IG9mZnNldDsKKworc3RhdGljIHZvaWQgc2VuZF9uZXRjb25zb2xl X3NrYihzdHJ1Y3QgbmV0X2RldmljZSAqZGV2LCBjb25zdCBjaGFyICptc2csIHVuc2lnbmVkIGlu dCBtc2dfbGVuKQoreworCWludCB0b3RhbF9sZW4sIGV0aF9sZW4sIGlwX2xlbiwgdWRwX2xlbjsK Kwl1bnNpZ25lZCBsb25nIGZsYWdzOworCXN0cnVjdCBza19idWZmICpza2I7CisJc3RydWN0IHVk cGhkciAqdWRwaDsKKwlzdHJ1Y3QgaXBoZHIgKmlwaDsKKwlzdHJ1Y3QgZXRoaGRyICpldGg7CisK Kwl1ZHBfbGVuID0gbXNnX2xlbiArIEhFQURFUl9MRU4gKyBzaXplb2YoKnVkcGgpOworCWlwX2xl biA9IGV0aF9sZW4gPSB1ZHBfbGVuICsgc2l6ZW9mKCppcGgpOworCXRvdGFsX2xlbiA9IGV0aF9s ZW4gKyBFVEhfSExFTjsKKworCWlmIChucl9uZXRjb25zb2xlX3NrYnMgPCBNQVhfTkVUQ09OU09M RV9TS0JTKQorCQlfX3JlZmlsbF9uZXRjb25zb2xlX3NrYnMoKTsKKworCXNrYiA9IGFsbG9jX3Nr Yih0b3RhbF9sZW4sIEdGUF9BVE9NSUMpOworCWlmICghc2tiKSB7CisJCXNrYiA9IGdldF9uZXRj b25zb2xlX3NrYigpOworCQlpZiAoIXNrYikKKwkJCS8qIHRvdWdoISAqLworCQkJcmV0dXJuOwor CX0KKworCWF0b21pY19zZXQoJnNrYi0+dXNlcnMsIDEpOworCXNrYl9yZXNlcnZlKHNrYiwgdG90 YWxfbGVuIC0gbXNnX2xlbiAtIEhFQURFUl9MRU4pOworCXNrYi0+ZGF0YVswXSA9IE5FVENPTlNP TEVfVkVSU0lPTjsKKworCXNwaW5fbG9ja19pcnFzYXZlKCZzZXF1ZW5jZV9sb2NrLCBmbGFncyk7 CisJcHV0X3VuYWxpZ25lZChodG9ubChvZmZzZXQpLCAodTMyICopIChza2ItPmRhdGEgKyAxKSk7 CisJb2Zmc2V0ICs9IG1zZ19sZW47CisJc3Bpbl91bmxvY2tfaXJxcmVzdG9yZSgmc2VxdWVuY2Vf bG9jaywgZmxhZ3MpOworCisJbWVtY3B5KHNrYi0+ZGF0YSArIEhFQURFUl9MRU4sIG1zZywgbXNn X2xlbik7CisJc2tiLT5sZW4gKz0gbXNnX2xlbiArIEhFQURFUl9MRU47CisKKwl1ZHBoID0gKHN0 cnVjdCB1ZHBoZHIgKikgc2tiX3B1c2goc2tiLCBzaXplb2YoKnVkcGgpKTsKKwl1ZHBoLT5zb3Vy Y2UgPSBzb3VyY2VfcG9ydDsKKwl1ZHBoLT5kZXN0ID0gdGFyZ2V0X3BvcnQ7CisJdWRwaC0+bGVu ID0gaHRvbnModWRwX2xlbik7CisJdWRwaC0+Y2hlY2sgPSAwOworCisJaXBoID0gKHN0cnVjdCBp cGhkciAqKXNrYl9wdXNoKHNrYiwgc2l6ZW9mKCppcGgpKTsKKworCWlwaC0+dmVyc2lvbiAgPSA0 OworCWlwaC0+aWhsICAgICAgPSA1OworCWlwaC0+dG9zICAgICAgPSAwOworICAgICAgICBpcGgt PnRvdF9sZW4gID0gaHRvbnMoaXBfbGVuKTsKKwlpcGgtPmlkICAgICAgID0gMDsKKwlpcGgtPmZy YWdfb2ZmID0gMDsKKwlpcGgtPnR0bCAgICAgID0gNjQ7CisgICAgICAgIGlwaC0+cHJvdG9jb2wg PSBJUFBST1RPX1VEUDsKKwlpcGgtPmNoZWNrICAgID0gMDsKKyAgICAgICAgaXBoLT5zYWRkciAg ICA9IHNvdXJjZV9pcDsKKyAgICAgICAgaXBoLT5kYWRkciAgICA9IHRhcmdldF9pcDsKKwlpcGgt PmNoZWNrICAgID0gaXBfZmFzdF9jc3VtKCh1bnNpZ25lZCBjaGFyICopaXBoLCBpcGgtPmlobCk7 CisKKwlldGggPSAoc3RydWN0IGV0aGhkciAqKSBza2JfcHVzaChza2IsIEVUSF9ITEVOKTsKKwor CWV0aC0+aF9wcm90byA9IGh0b25zKEVUSF9QX0lQKTsKKwltZW1jcHkoZXRoLT5oX3NvdXJjZSwg ZGV2LT5kZXZfYWRkciwgZGV2LT5hZGRyX2xlbik7CisJbWVtY3B5KGV0aC0+aF9kZXN0LCBkYWRk ciwgZGV2LT5hZGRyX2xlbik7CisKK3JlcGVhdDoKKwlzcGluX2xvY2soJmRldi0+eG1pdF9sb2Nr KTsKKwlkZXYtPnhtaXRfbG9ja19vd25lciA9IHNtcF9wcm9jZXNzb3JfaWQoKTsKKworCWlmIChu ZXRpZl9xdWV1ZV9zdG9wcGVkKGRldikpIHsKKwkJZGV2LT54bWl0X2xvY2tfb3duZXIgPSAtMTsK KwkJc3Bpbl91bmxvY2soJmRldi0+eG1pdF9sb2NrKTsKKworCQlkZXYtPnBvbGxfY29udHJvbGxl cihkZXYpOworCQlnb3RvIHJlcGVhdDsKKwl9CisKKwlkZXYtPmhhcmRfc3RhcnRfeG1pdChza2Is IGRldik7CisKKwlkZXYtPnhtaXRfbG9ja19vd25lciA9IC0xOworCXNwaW5fdW5sb2NrKCZkZXYt PnhtaXRfbG9jayk7Cit9CisKK3N0YXRpYyB2b2lkIHdyaXRlX25ldGNvbnNvbGVfbXNnKHN0cnVj dCBjb25zb2xlICpjb24sIGNvbnN0IGNoYXIgKm1zZywgdW5zaWduZWQgaW50IG1zZ19sZW4pCit7 CisJaW50IGxlbiwgbGVmdDsKKwlzdHJ1Y3QgbmV0X2RldmljZSAqZGV2OworCisJZGV2ID0gbmV0 Y29uc29sZV9kZXY7CisJaWYgKCFkZXYpCisJCXJldHVybjsKKworCWlmIChkZXYtPnBvbGxfY29u dHJvbGxlciAmJiBuZXRpZl9ydW5uaW5nKGRldikpIHsKKwkJdW5zaWduZWQgbG9uZyBmbGFnczsK KworCQlzYXZlX2ZsYWdzKGZsYWdzKTsKKwkJY2xpKCk7CisJCWxlZnQgPSBtc2dfbGVuOworcmVw ZWF0OgorCQlpZiAobGVmdCA+IE1BWF9QUklOVF9DSFVOSykKKwkJCWxlbiA9IE1BWF9QUklOVF9D SFVOSzsKKwkJZWxzZQorCQkJbGVuID0gbGVmdDsKKwkJc2VuZF9uZXRjb25zb2xlX3NrYihkZXYs IG1zZywgbGVuKTsKKwkJbXNnICs9IGxlbjsKKwkJbGVmdCAtPSBsZW47CisJCWlmIChsZWZ0KQor CQkJZ290byByZXBlYXQ7CisJCXJlc3RvcmVfZmxhZ3MoZmxhZ3MpOworCX0KK30KKworc3RhdGlj IGNoYXIgKmRldjsKK3N0YXRpYyBpbnQgdGFyZ2V0X2V0aF9ieXRlMCA9IDI1NTsKK3N0YXRpYyBp bnQgdGFyZ2V0X2V0aF9ieXRlMSA9IDI1NTsKK3N0YXRpYyBpbnQgdGFyZ2V0X2V0aF9ieXRlMiA9 IDI1NTsKK3N0YXRpYyBpbnQgdGFyZ2V0X2V0aF9ieXRlMyA9IDI1NTsKK3N0YXRpYyBpbnQgdGFy Z2V0X2V0aF9ieXRlNCA9IDI1NTsKK3N0YXRpYyBpbnQgdGFyZ2V0X2V0aF9ieXRlNSA9IDI1NTsK KworI2lmZGVmIE1PRFVMRQorTU9EVUxFX1BBUk0odGFyZ2V0X2lwLCAiaSIpOworTU9EVUxFX1BB Uk0odGFyZ2V0X2V0aF9ieXRlMCwgImkiKTsKK01PRFVMRV9QQVJNKHRhcmdldF9ldGhfYnl0ZTEs ICJpIik7CitNT0RVTEVfUEFSTSh0YXJnZXRfZXRoX2J5dGUyLCAiaSIpOworTU9EVUxFX1BBUk0o dGFyZ2V0X2V0aF9ieXRlMywgImkiKTsKK01PRFVMRV9QQVJNKHRhcmdldF9ldGhfYnl0ZTQsICJp Iik7CitNT0RVTEVfUEFSTSh0YXJnZXRfZXRoX2J5dGU1LCAiaSIpOworTU9EVUxFX1BBUk0oc291 cmNlX3BvcnQsICJoIik7CitNT0RVTEVfUEFSTSh0YXJnZXRfcG9ydCwgImgiKTsKK01PRFVMRV9Q QVJNKGRldiwgInMiKTsKKyNlbmRpZgorCitzdGF0aWMgc3RydWN0IGNvbnNvbGUgbmV0Y29uc29s ZSA9CisJIHsgZmxhZ3M6IENPTl9FTkFCTEVELCB3cml0ZTogd3JpdGVfbmV0Y29uc29sZV9tc2cg fTsKKworI2lmIDAKK3N0YXRpYyBpbnQgbmNfcmVjdl9wYWNrZXQoc3RydWN0IHNrX2J1ZmYgKnNr Yiwgc3RydWN0IG5ldF9kZXZpY2UgKmRldiwgc3RydWN0IHBhY2tldF90eXBlICpwdCk7CitzdGF0 aWMgc3RydWN0IHBhY2tldF90eXBlIG5jX3BhY2tldF90eXBlIF9faW5pdGRhdGEgPSB7CisJLnR5 cGUgPSBfX2NvbnN0YW50X2h0b25zKEVUSF9QX0lQKSwKKwkuZnVuYyA9IG5jX3JlY3ZfcGFja2V0 LAorfTsKKworc3RhdGljIHZvaWQgbmNfc2xlZXAoaW50IHNlYykKK3sKKwl1bnNpZ25lZCBsb25n IGppZmY7CisJCisJamlmZiA9IGppZmZpZXMgKyA1KkhaOworCXdoaWxlICh0aW1lX2JlZm9yZShq aWZmaWVzLCBqaWZmKSkKKwkJOworCit9CitzdGF0aWMgaW50IG5jX3JlY3ZfcGFja2V0KHN0cnVj dCBza19idWZmICpza2IsIHN0cnVjdCBuZXRfZGV2aWNlICpkZXYsIHN0cnVjdCBwYWNrZXRfdHlw ZSAqcHQpCit7CisJcmV0dXJuIDA7Cit9CisjZW5kaWYKK3N0YXRpYyBzdHJ1Y3QgbmV0X2Rldmlj ZSAqbmNfaW5pdF9uZXRjb25zb2xlKCkKK3sKKwlzdHJ1Y3QgbmV0X2RldmljZSAqX2RldjsKKwlp bnQgY291bnQgPSAwOworCisJcnRubF9zaGxvY2soKTsKKwlmb3IgKF9kZXY9ZGV2X2Jhc2U7IF9k ZXY7IF9kZXYgPSBfZGV2LT5uZXh0KQorCXsKKwkJcHJpbnRrKEtFUk5fSU5GTyAiJXM6IHByb2Jp bmcgZGV2aWNlIDwlcz5cbiIsIF9fZnVuY19fLCBfZGV2LT5uYW1lKTsKKwkJCisJCWlmICghc3Ry bmNtcChfZGV2LT5uYW1lLCAiZHVtbXkiLCA1KSB8fCAoX2Rldi0+ZmxhZ3MmSUZGX0xPT1BCQUNL KSkKKwkJCWNvbnRpbnVlOworCQlpZiAoZGV2X2NoYW5nZV9mbGFncyhfZGV2LCBfZGV2LT5mbGFn cyB8IElGRl9VUCkgPCAwKSB7CisJCQlwcmludGsoS0VSTl9FUlIgIiVzOiBmYWlsZWQgdG8gb3Bl biAlc1xuIiwgX19mdW5jX18sIF9kZXYtPm5hbWUpOworCQkJY29udGludWU7CisJCX0KKwkJY291 bnQrKzsKKwkJYnJlYWs7CisJfQorCXJ0bmxfc2h1bmxvY2soKTsKKworCWlmIChjb3VudCA9PSAw KQorCQlyZXR1cm4gTlVMTDsKKworCWRldiA9IF9kZXYtPm5hbWU7CisJCisJLy9kZXZfYWRkX3Bh Y2soJm5jX3BhY2tldF90eXBlKTsKKwkKKwlyZXR1cm4gX2RldjsKK30KKworc3RhdGljIGludCBf X2luaXQgaW5pdF9uZXRjb25zb2xlKHZvaWQpCit7CisJc3RydWN0IG5ldF9kZXZpY2UgKm5kZXYg PSBOVUxMOworCisJbmRldiA9IG5jX2luaXRfbmV0Y29uc29sZSgpOworCWlmICghbmRldikgewor CQlwcmludGsoS0VSTl9FUlIgIm5ldGNvbnNvbGU6IG5ldHdvcmsgZGV2aWNlICVzIGRvZXMgbm90 IGV4aXN0LCBhYm9ydGluZy5cbiIsIGRldik7CisJCXJldHVybiAtMTsKKwl9CisKKwlwcmludGso S0VSTl9JTkZPICIlczogVXNpbmcgZGV2aWNlIDwlcz5cbiIsIF9fZnVuY19fLCBuZGV2LT5uYW1l KTsKKwkKKwlpZiAoIW5kZXYtPnBvbGxfY29udHJvbGxlcikgeworCQlwcmludGsoS0VSTl9FUlIg Im5ldGNvbnNvbGU6ICVzJ3MgbmV0d29yayBkcml2ZXIgZG9lcyBub3QgaW1wbGVtZW50IG5ldGxv Z2dpbmcgeWV0LCBhYm9ydGluZy5cbiIsIGRldik7CisJCXJldHVybiAtMTsKKwl9CisKKwlyZWdp c3Rlcl9jb25zb2xlKCZuZXRjb25zb2xlKTsKKworCXNvdXJjZV9pcCA9IG50b2hsKGluX2F0b24o IjEwLjAuMC4yIikpOworI2RlZmluZSBJUCh4KSAoKGNoYXIgKikmc291cmNlX2lwKVt4XQorCXBy aW50ayhLRVJOX0lORk8gIm5ldGNvbnNvbGU6IHVzaW5nIHNvdXJjZSBJUCAlaS4laS4laS4laVxu IiwKKwkJSVAoMyksIElQKDIpLCBJUCgxKSwgSVAoMCkpOworI3VuZGVmIElQCisJc291cmNlX2lw ID0gaHRvbmwoc291cmNlX2lwKTsKKyNkZWZpbmUgSVAoeCkgKChjaGFyICopJnRhcmdldF9pcClb eF0KKwlwcmludGsoS0VSTl9JTkZPICJuZXRjb25zb2xlOiB1c2luZyB0YXJnZXQgSVAgJWkuJWku JWkuJWlcbiIsCisJCUlQKDMpLCBJUCgyKSwgSVAoMSksIElQKDApKTsKKyN1bmRlZiBJUAorCXRh cmdldF9pcCA9IGh0b25sKHRhcmdldF9pcCk7CisJcHJpbnRrKEtFUk5fSU5GTyAibmV0Y29uc29s ZTogdXNpbmcgc291cmNlIFVEUCBwb3J0OiAlaVxuIiwgc291cmNlX3BvcnQpOworCXNvdXJjZV9w b3J0ID0gaHRvbnMoc291cmNlX3BvcnQpOworCXByaW50ayhLRVJOX0lORk8gIm5ldGNvbnNvbGU6 IHVzaW5nIHRhcmdldCBVRFAgcG9ydDogJWlcbiIsIHRhcmdldF9wb3J0KTsKKwl0YXJnZXRfcG9y dCA9IGh0b25zKHRhcmdldF9wb3J0KTsKKworCWRhZGRyWzBdID0gdGFyZ2V0X2V0aF9ieXRlMDsK KwlkYWRkclsxXSA9IHRhcmdldF9ldGhfYnl0ZTE7CisJZGFkZHJbMl0gPSB0YXJnZXRfZXRoX2J5 dGUyOworCWRhZGRyWzNdID0gdGFyZ2V0X2V0aF9ieXRlMzsKKwlkYWRkcls0XSA9IHRhcmdldF9l dGhfYnl0ZTQ7CisJZGFkZHJbNV0gPSB0YXJnZXRfZXRoX2J5dGU1OworCisJaWYgKChkYWRkclsw XSAmIGRhZGRyWzFdICYgZGFkZHJbMl0gJiBkYWRkclszXSAmIGRhZGRyWzRdICYgZGFkZHJbNV0p ID09IDI1NSkKKwkJcHJpbnRrKEtFUk5fSU5GTyAibmV0Y29uc29sZTogdXNpbmcgYnJvYWRjYXN0 IGV0aGVybmV0IGZyYW1lcyB0byBzZW5kIHBhY2tldHMuXG4iKTsKKwllbHNlCisJCXByaW50ayhL RVJOX0lORk8gIm5ldGNvbnNvbGU6IHVzaW5nIHRhcmdldCBldGhlcm5ldCBhZGRyZXNzICUwMng6 JTAyeDolMDJ4OiUwMng6JTAyeDolMDJ4LlxuIiwgZGFkZHJbMF0sIGRhZGRyWzFdLCBkYWRkclsy XSwgZGFkZHJbM10sIGRhZGRyWzRdLCBkYWRkcls1XSk7CisJCQorCW5ldGNvbnNvbGVfZGV2ID0g bmRldjsKKyNkZWZpbmUgU1RBUlRVUF9NU0cgIlsuLi5uZXR3b3JrIGNvbnNvbGUgc3RhcnR1cC4u Ll1cbiIKKwl3cml0ZV9uZXRjb25zb2xlX21zZyhOVUxMLCBTVEFSVFVQX01TRywgc3RybGVuKFNU QVJUVVBfTVNHKSk7CisKKwlwcmludGsoS0VSTl9JTkZPICJuZXRjb25zb2xlOiBuZXR3b3JrIGxv Z2dpbmcgc3RhcnRlZCB1cCBzdWNjZXNzZnVsbHkhXG4iKTsKKwlyZXR1cm4gMDsKK30KKworc3Rh dGljIHZvaWQgX19leGl0IGNsZWFudXBfbmV0Y29uc29sZSh2b2lkKQoreworCXByaW50ayhLRVJO X0lORk8gIm5ldGNvbnNvbGU6IG5ldHdvcmsgbG9nZ2luZyBzaHV0IGRvd24uXG4iKTsKKwl1bnJl Z2lzdGVyX2NvbnNvbGUoJm5ldGNvbnNvbGUpOworCisjZGVmaW5lIFNIVVRET1dOX01TRyAiWy4u Lm5ldHdvcmsgY29uc29sZSBzaHV0ZG93bi4uLl1cbiIKKwl3cml0ZV9uZXRjb25zb2xlX21zZyhO VUxMLCBTSFVURE9XTl9NU0csIHN0cmxlbihTSFVURE9XTl9NU0cpKTsKKwluZXRjb25zb2xlX2Rl diA9IE5VTEw7Cit9CisKK21vZHVsZV9pbml0KGluaXRfbmV0Y29uc29sZSk7Cittb2R1bGVfZXhp dChjbGVhbnVwX25ldGNvbnNvbGUpOworCitpbnQgZHVtbXkgPSBNQVhfU0tCX1NJWkU7Cg== --Multipart_Mon__10_Mar_2003_22:22:05_+0300_082e02e8-- From mochel@osdl.org Mon Mar 10 14:51:31 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 10 Mar 2003 14:51:34 -0800 (PST) Received: from mail.osdl.org (air-2.osdl.org [65.172.181.6]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2AMpVq9031498 for ; Mon, 10 Mar 2003 14:51:31 -0800 Received: from localhost (build.pdx.osdl.net [172.20.1.2]) by mail.osdl.org (8.11.6/8.11.6) with ESMTP id h2AMpRG15885; Mon, 10 Mar 2003 14:51:27 -0800 Date: Mon, 10 Mar 2003 16:27:01 -0600 (CST) From: Patrick Mochel X-X-Sender: To: Andreas Jellinghaus cc: , , Subject: Re: 2.5.64 oops in ppp / pppo2 / kobject In-Reply-To: <1047336461.10548.3.camel@simulacron> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 1921 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: mochel@osdl.org Precedence: bulk X-list: netdev Content-Length: 1990 Lines: 79 On 10 Mar 2003, Andreas Jellinghaus wrote: > pppoe link failed, then ppp oopsed. > > Also, shutting down the system ends in a deadlock > (or so? nothing is happening, lots of processes in __down.) > > plain 2.5.64 plus ipv6/setkey patch (unreleated i think). Please try the latest BK snapshot from http://kernel.org/pub/linux/kernel/v2.5/snapshots/ Plus this patch on top it. This problem has been reported before, and this patch should fix it.. Thanks, -pat ===== fs/sysfs/dir.c 1.4 vs edited ===== --- 1.4/fs/sysfs/dir.c Sat Mar 8 23:42:32 2003 +++ edited/fs/sysfs/dir.c Sun Mar 9 16:01:45 2003 @@ -98,7 +98,6 @@ * Unlink and unhash. */ spin_unlock(&dcache_lock); - d_delete(d); simple_unlink(dentry->d_inode,d); dput(d); spin_lock(&dcache_lock); @@ -108,16 +107,11 @@ } spin_unlock(&dcache_lock); up(&dentry->d_inode->i_sem); - d_invalidate(dentry); - simple_rmdir(parent->d_inode,dentry); d_delete(dentry); + simple_rmdir(parent->d_inode,dentry); pr_debug(" o %s removing done (%d)\n",dentry->d_name.name, atomic_read(&dentry->d_count)); - /** - * Drop reference from initial sysfs_get_dentry(). - */ - dput(dentry); /** * Drop reference from dget() on entrance. ===== fs/sysfs/inode.c 1.83 vs edited ===== --- 1.83/fs/sysfs/inode.c Mon Mar 3 17:11:29 2003 +++ edited/fs/sysfs/inode.c Sun Mar 9 14:25:45 2003 @@ -93,19 +93,14 @@ /* make sure dentry is really there */ if (victim->d_inode && (victim->d_parent->d_inode == dir->d_inode)) { - simple_unlink(dir->d_inode,victim); - d_delete(victim); - pr_debug("sysfs: Removing %s (%d)\n", victim->d_name.name, atomic_read(&victim->d_count)); - /* - * Drop reference from initial sysfs_get_dentry(). - */ - dput(victim); + + simple_unlink(dir->d_inode,victim); + } - - /** - * Drop the reference acquired from sysfs_get_dentry() above. + /* + * Drop reference from sysfs_get_dentry() above. */ dput(victim); } From anton@samba.org Mon Mar 10 19:59:31 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 10 Mar 2003 19:59:38 -0800 (PST) Received: from lists.samba.org (dp.samba.org [66.70.73.150]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2B3xKq9006081 for ; Mon, 10 Mar 2003 19:59:21 -0800 Received: by lists.samba.org (Postfix, from userid 504) id D98CB2C053; Tue, 11 Mar 2003 03:30:00 +0000 (GMT) Date: Tue, 11 Mar 2003 14:29:50 +1100 From: Anton Blanchard To: netdev@oss.sgi.com Subject: alignment of SKBs Message-ID: <20030311032950.GB1132@krispykreme> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.3i X-archive-position: 1922 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: anton@samba.org Precedence: bulk X-list: netdev Content-Length: 1280 Lines: 29 Hi, Linux likes to align TCP/IP headers for the benefit of the CPU. No problems there. However, especially with gigabit, it is important to try and minimise the number of PCI transactions. As an example, the e1000 driver currently starts most transmit packets 14 bytes from the start of a cacheline and all receive packets 18 bytes from the start of a cacheline. On ppc64 this is going to be expensive on the PCI bus, especially for DMA writes as we work our way up to cacheline aligment. Unaligned loads and stores on the headers should be reasonably quick on recent ppc64 machines, so the tradeoff is definitely towards optimising for the PCI bus. Unfortunately we cant do anything about it because skb_reserve() is used everywhere. Perhaps if we had another macro (skb_align?) we could override it on a per arch basis. While the receive side is easy to fix (modify the skb_reserve in the e1000 and dev_skb_alloc routines), the transmit side is more difficult. Luckily DMA reads tend to be less of an issue. From my reading of the code, on transmits we copy the data in before we put the TCP header together. I guess we could arrange things so that the common case would fall on a cacheline boundary and the uncommon case would overflow into the cacheline before. Anton From pekkas@netcore.fi Mon Mar 10 22:25:08 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 10 Mar 2003 22:25:12 -0800 (PST) Received: from netcore.fi (netcore.fi [193.94.160.1]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2B6OQq9008644 for ; Mon, 10 Mar 2003 22:25:08 -0800 Received: from localhost (pekkas@localhost) by netcore.fi (8.11.6/8.11.6) with ESMTP id h2B6OKY00515 for ; Tue, 11 Mar 2003 08:24:20 +0200 Date: Tue, 11 Mar 2003 08:24:19 +0200 (EET) From: Pekka Savola To: netdev@oss.sgi.com Subject: Is RFC1822 -type License on IPR good enough? Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 1923 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: pekkas@netcore.fi Precedence: bulk X-list: netdev Content-Length: 988 Lines: 25 Hi, An issue has just come up when IETF is standardizing Secure Neighbor Discovery for IPv6. Some organizations, including Microsoft and Ericsson, have IPR claims on a mechanism which would be very useful to the mechanism. Folks are working with the organizations, hoping to get a licensing agreement like RFC1822. I'd like to solicit opinions whether this is considered "good enough" for possible implementation in Linux kernel, or whether it would lock us out. (Also remember the tradeoff: if this technique is unacceptable, there are no easy alternatives to solving the problem, only very difficult ones). I'm assuming the license like that would be appropriate as it has been used in other free systems in protocols like IKE. If you think there is a problem, please send a note ASAP. -- Pekka Savola "You each name yourselves king, yet the Netcore Oy kingdom bleeds." Systems. Networks. Security. -- George R.R. Martin: A Clash of Kings From ahu@outpost.ds9a.nl Tue Mar 11 01:45:04 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 11 Mar 2003 01:45:14 -0800 (PST) Received: from outpost.ds9a.nl (postfix@outpost.ds9a.nl [213.244.168.210]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2B9iMq9016508 for ; Tue, 11 Mar 2003 01:45:04 -0800 Received: by outpost.ds9a.nl (Postfix, from userid 1000) id 615B64573; Tue, 11 Mar 2003 10:44:20 +0100 (CET) Date: Tue, 11 Mar 2003 10:44:20 +0100 From: bert hubert To: Alexey Kuznetsov , Martin Devera , Linux Kernel Mailinlist , David Jarvis , netdev@oss.sgi.com Subject: Re: kernel panic: bug in sch_sfq.c Message-ID: <20030311094420.GB19658@outpost.ds9a.nl> Mail-Followup-To: bert hubert , Alexey Kuznetsov , Martin Devera , Linux Kernel Mailinlist , David Jarvis , netdev@oss.sgi.com References: <20030311091409.GA4491@oasis.frogfoot.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030311091409.GA4491@oasis.frogfoot.net> User-Agent: Mutt/1.3.28i X-archive-position: 1924 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ahu@ds9a.nl Precedence: bulk X-list: netdev Content-Length: 6312 Lines: 138 On Tue, Mar 11, 2003 at 11:14:09AM +0200, Abraham van der Merwe wrote: > Hi! > > I have a box that crashed today. Below is the decoded kernel panic. If you > track down the bug PLEASE send me a patch. Weird, Alexeys code is normally very very solid. Perhaps HTB is also involved. Devik? > > ------------< snip <------< snip <------< snip <------------ > ksymoops 2.4.8 on i686 2.4.20-rc1. Options used > -v vmlinux-2.4.21-pre5 (specified) > -K (specified) > -L (specified) > -O (specified) > -m System.map-2.4.21-pre5 (specified) > > Unable to handle kernel NULL pointer dereference at virtual address 00000004 > *pde = 00000000 > Oops: 0002 > CPU: 0 > EIP: 0010:[] Not tainted > Using defaults from ksymoops -t elf32-i386 -a i386 > EFLAGS: 00010202 > eax: 00000000 ebx: c7b9a9e8 ecx: 0000007f edx: c7a8eef8 > esi: c7b9ab08 edi: 000007f0 ebp: c7a8e060 esp: c021deb8 > ds: 0018 es: 0018 ss: 0018 > Process swapper (pid: 0, stackpage=c021d000) > Stack: c7b9a9e8 c7b9ab08 c7f7ee00 c7b9a860 c7b893c0 c7f7ee00 c7b9a860 00000000 > c01a3507 c7b5c680 7fb9a9f0 c01a339e c7a8e000 ffffffff 00000018 00000006 > c7b9a800 00000018 00000006 c7b9a800 c7b9a9e8 c7b9ab08 c7f7ee00 c01a371c > Call Trace: [] [] [] [] [>c019949d>] > [] [] [] [] [] [] > [] [] [] [] [] > Code: 89 50 04 89 02 8b 5c 24 24 c7 03 00 00 00 00 c7 43 04 00 00 > > > >>EIP; c01a5399 <===== > > >>esp; c021deb8 > > Trace; c01a3507 > Trace; c01a339e > Trace; c01a371c > Trace; c019f7a3 > Trace; c0115a6a > Trace; c01082bd > Trace; c0105240 > Trace; c0105240 > Trace; c010a528 > Trace; c0105240 > Trace; c0105240 > Trace; c0105263 > Trace; c01052d2 > Trace; c0105000 <_stext+0/0> > Trace; c0105027 > > Code; c01a5399 > 00000000 <_EIP>: > Code; c01a5399 <===== > 0: 89 50 04 mov %edx,0x4(%eax) <===== > Code; c01a539c > 3: 89 02 mov %eax,(%edx) > Code; c01a539e > 5: 8b 5c 24 24 mov 0x24(%esp,1),%ebx > Code; c01a53a2 > 9: c7 03 00 00 00 00 movl $0x0,(%ebx) > Code; c01a53a8 > f: c7 43 04 00 00 00 00 movl $0x0,0x4(%ebx) > > <0>Kernel panic: Aiee, killing interrupt handler! > ------------< snip <------< snip <------< snip <------------ > > Below are the rules that were installed on the system: > > ------------< snip <------< snip <------< snip <------------ > /sbin/tc qdisc del dev eth0 root > /sbin/tc qdisc del dev eth1 root > /sbin/iptables -t mangle -F qos > /sbin/iptables -t mangle -Z qos > /sbin/tc qdisc add dev eth0 root handle 1: htb default 5 r2q 1 > /sbin/tc class add dev eth0 parent 1: classid 1:1 htb rate 96kbit > /sbin/tc class add dev eth0 parent 1:1 classid 1:2 htb rate 96kbit ceil 96kbit > /sbin/tc class add dev eth0 parent 1:2 classid 1:3 htb rate 48kbit ceil 96kbit prio 1 > /sbin/tc qdisc add dev eth0 handle 3: parent 1:3 sfq perturb 10 limit 31 > /sbin/tc class add dev eth0 parent 1:2 classid 1:4 htb rate 24kbit ceil 96kbit prio 1 > /sbin/tc qdisc add dev eth0 handle 4: parent 1:4 sfq perturb 10 limit 31 > /sbin/tc class add dev eth0 parent 1:2 classid 1:5 htb rate 16kbit ceil 96kbit prio 2 > /sbin/tc qdisc add dev eth0 handle 5: parent 1:5 sfq perturb 10 limit 31 > /sbin/iptables -t mangle -A qos -o eth0 -s 66.8.85.0/28 -j CLASSIFY --set-class 1:3 > /sbin/iptables -t mangle -A qos -o eth0 -s 66.8.85.80/28 -j CLASSIFY --set-class 1:4 > /sbin/iptables -t mangle -A qos -o eth0 -s 192.116.106.192/29 -j CLASSIFY --set-class 1:0 > /sbin/iptables -t mangle -A qos -o eth0 -s 66.8.28.48/29 -j CLASSIFY --set-class 1:0 > /sbin/tc qdisc add dev eth1 root handle 1: htb default 5 r2q 2 > /sbin/tc class add dev eth1 parent 1: classid 1:1 htb rate 512kbit > /sbin/tc class add dev eth1 parent 1:1 classid 1:2 htb rate 256kbit ceil 512kbit > /sbin/tc class add dev eth1 parent 1:2 classid 1:3 htb rate 128kbit ceil 512kbit prio 1 > /sbin/tc qdisc add dev eth1 handle 3: parent 1:3 sfq perturb 10 limit 169 > /sbin/tc class add dev eth1 parent 1:2 classid 1:4 htb rate 64kbit ceil 512kbit prio 1 > /sbin/tc qdisc add dev eth1 handle 4: parent 1:4 sfq perturb 10 limit 169 > /sbin/tc class add dev eth1 parent 1:2 classid 1:5 htb rate 32kbit ceil 512kbit prio 2 > /sbin/tc qdisc add dev eth1 handle 5: parent 1:5 sfq perturb 10 limit 169 > /sbin/iptables -t mangle -A qos -o eth1 -d 66.8.85.0/28 -j CLASSIFY --set-class 1:3 > /sbin/iptables -t mangle -A qos -o eth1 -d 66.8.85.80/28 -j CLASSIFY --set-class 1:4 > /sbin/iptables -t mangle -A qos -o eth1 -d 192.116.106.192/29 -j CLASSIFY --set-class 1:0 > /sbin/iptables -t mangle -A qos -o eth1 -d 66.8.28.48/29 -j CLASSIFY --set-class 1:0 > ------------< snip <------< snip <------< snip <------------ > > I've made tons of info available on my home page for you to look at (proc > files, vmlinux, System.map, original panic message, etc. > > http://oasis.frogfoot.net/sfq/ > > -- > > Regards > Abraham > > I saw what you did and I know who you are. > > ___________________________________________________ > Abraham vd Merwe [ZR1BBQ] - Frogfoot Networks > P.O. Box 3472, Matieland, Stellenbosch, 7602 > Cell: +27 82 565 4451 Http: http://www.frogfoot.net/ > Email: abz@frogfoot.net > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- http://www.PowerDNS.com Open source, database driven DNS Software http://lartc.org Linux Advanced Routing & Traffic Control HOWTO http://netherlabs.nl Consulting From erik@hensema.net Tue Mar 11 03:18:14 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 11 Mar 2003 03:18:16 -0800 (PST) Received: from dexter.hensema.net (cc78409-a.hnglo1.ov.home.nl [212.120.97.185]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2BBIBq9021010 for ; Tue, 11 Mar 2003 03:18:13 -0800 Received: from bender.home.hensema.net (bender.home.hensema.net [192.168.1.252]) by dexter.hensema.net (8.12.3/8.12.3) with ESMTP id h2BBI2p4017374; Tue, 11 Mar 2003 12:18:02 +0100 Received: from bender.home.hensema.net (localhost [127.0.0.1]) by bender.home.hensema.net (8.12.3/8.12.3) with ESMTP id h2BBI2T3001911; Tue, 11 Mar 2003 12:18:02 +0100 Received: (from erik@localhost) by bender.home.hensema.net (8.12.3/8.12.3/Submit) id h2BBI14s001910; Tue, 11 Mar 2003 12:18:01 +0100 Date: Tue, 11 Mar 2003 12:18:01 +0100 From: Erik Hensema To: netdev@oss.sgi.com Cc: LARTC , Netfilter Development Mailinglist Subject: [PATCH 2.4.21-pre4] Propagate netfilter MARK value when tunneling Message-ID: <20030311111801.GA1853@hensema.net> Reply-To: erik@hensema.net Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="k1lZvvs/B4yU6o8G" Content-Disposition: inline User-Agent: Mutt/1.3.27i X-archive-position: 1925 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: erik@hensema.net Precedence: bulk X-list: netdev Content-Length: 3484 Lines: 99 --k1lZvvs/B4yU6o8G Content-Type: text/plain; charset=us-ascii Content-Disposition: inline This patch enables the user to propagate netfilter MARK values from tunneled packets to the tunnel packets. The primary use for this is QoS: it enables you to MARK a packet before it enters a tunnel and then later pick up the packet when it's about to leave the physical interface. jamal suggested to also propagate other skb specifics like the tcindex and priority. I haven't included these in the current patch for the very simple reason that I don't understand what they mean ;-) The patch is currently limited to GRE, IPIP and SIT. Patch is attached to this mail, but also can be downloaded from http://dexter.hensema.net/~erik/patches/netfilter-propagate-mark-2.4.21-pre4.diff -- Erik Hensema (erik@hensema.net) --k1lZvvs/B4yU6o8G Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="netfilter-propagate-mark-2.4.21-pre4.diff" --- ../linux-2.4.21-pre4/net/Config.in Sat Aug 3 02:39:46 2002 +++ net/Config.in Tue Mar 11 12:08:29 2003 @@ -13,6 +13,7 @@ bool 'Network packet filtering (replaces ipchains)' CONFIG_NETFILTER if [ "$CONFIG_NETFILTER" = "y" ]; then bool ' Network packet filtering debugging' CONFIG_NETFILTER_DEBUG + bool ' Propagate netfilter MARK value when tunneling' CONFIG_NETFILTER_PROPAGATE_MARK fi bool 'Socket Filtering' CONFIG_FILTER tristate 'Unix domain sockets' CONFIG_UNIX --- ../linux-2.4.21-pre4/net/ipv4/ipip.c Fri Nov 29 00:53:15 2002 +++ net/ipv4/ipip.c Tue Mar 11 11:58:50 2003 @@ -619,6 +619,9 @@ } if (skb->sk) skb_set_owner_w(new_skb, skb->sk); +#ifdef CONFIG_NETFILTER_PROPAGATE_MARK + new_skb->nfmark = skb->nfmark; +#endif dev_kfree_skb(skb); skb = new_skb; } --- ../linux-2.4.21-pre4/net/ipv4/ip_gre.c Fri Nov 29 00:53:15 2002 +++ net/ipv4/ip_gre.c Tue Mar 11 11:59:07 2003 @@ -822,6 +822,9 @@ } if (skb->sk) skb_set_owner_w(new_skb, skb->sk); +#ifdef CONFIG_NETFILTER_PROPAGATE_MARK + new_skb->nfmark = skb->nfmark; +#endif dev_kfree_skb(skb); skb = new_skb; } --- ../linux-2.4.21-pre4/net/ipv6/sit.c Fri Nov 29 00:53:15 2002 +++ net/ipv6/sit.c Tue Mar 11 11:59:20 2003 @@ -571,6 +571,9 @@ } if (skb->sk) skb_set_owner_w(new_skb, skb->sk); +#ifdef CONFIG_NETFILTER_PROPAGATE_MARK + new_skb->nfmark = skb->nfmark; +#endif dev_kfree_skb(skb); skb = new_skb; } --- ../linux-2.4.21-pre4/Documentation/Configure.help Wed Feb 26 10:51:16 2003 +++ Documentation/Configure.help Tue Mar 11 12:05:37 2003 @@ -2507,6 +2507,22 @@ You can say Y here if you want to get additional messages useful in debugging the netfilter code. +Propagate netfilter MARK value when tunneling +CONFIG_NETFILTER_PROPAGATE_MARK + With this option enabled, netfilter MARK values are propagated from + tunneled packets to the tunnel packets. It enables you to trace + packets from before they enter the tunnel to the point where they + leave the physical interface. + + One of the possible uses is marking packets for QoS before they + enter a tunnel. These mark values can then be picked up by filters + defined by the "tc" utility when they're about the leave the + physical interface. + + This option currently works for GRE, IPIP and SIT tunnels. + + If unsure, say N. + Connection tracking (required for masq/NAT) CONFIG_IP_NF_CONNTRACK Connection tracking keeps a record of what packets have passed --k1lZvvs/B4yU6o8G-- From devik@cdi.cz Tue Mar 11 04:06:08 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 11 Mar 2003 04:06:12 -0800 (PST) Received: from luxik.cdi.cz (inway106.cdi.cz [213.151.81.106]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2BC55q9021871 for ; Tue, 11 Mar 2003 04:05:58 -0800 Received: from a76-137.dialup.iol.cz ([194.228.137.76] helo=devix) by luxik.cdi.cz with asmtp (Exim 3.34 #3) id 18siSX-0001Az-00; Tue, 11 Mar 2003 13:02:22 +0100 Received: from devik (helo=localhost) by devix with local-esmtp (Exim 3.16 #8) id 18siME-0000J1-00; Tue, 11 Mar 2003 12:55:46 +0100 Date: Tue, 11 Mar 2003 12:55:46 +0100 (CET) From: devik X-X-Sender: To: bert hubert cc: Alexey Kuznetsov , Linux Kernel Mailinlist , David Jarvis , Subject: Re: kernel panic: bug in sch_sfq.c In-Reply-To: <20030311094420.GB19658@outpost.ds9a.nl> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 1926 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: devik@cdi.cz Precedence: bulk X-list: netdev Content-Length: 7581 Lines: 167 Hmm, I looked at it. It seems that skb linked list was corrupted (containing NULL pointer). It could be because of two problems, either someone (maybe htb too)'ve overwritten memory or HTB called dequeue with wrong argument. Latter is unlikely because I call q->dequeue and sfq's dequeue was really called. Thus pointer is ok. Let's examine how could HTB mungle with qdisc internals. If htb would think that leaf is inner node, inner.feed[0] is pointer equal to leaf.q. I examined code but there is no way to make this mistake. Last 3 days I got 3 bugreports. Each crashes in different place and all seem unrelated. Each is NULL pointer dereference though. I think there is some place in some code which writes in bad random places in memory :-\ To ask all people whose have seen such Oops, have you used dynamic tc classes changes ? Like creating/deleting/changing/viewving classes offten at runtime ? (I'm trying to find common trigger). thanks, devik HTB maintainer On Tue, 11 Mar 2003, bert hubert wrote: > On Tue, Mar 11, 2003 at 11:14:09AM +0200, Abraham van der Merwe wrote: > > Hi! > > > > I have a box that crashed today. Below is the decoded kernel panic. If you > > track down the bug PLEASE send me a patch. > > Weird, Alexeys code is normally very very solid. Perhaps HTB is also > involved. Devik? > > > > > ------------< snip <------< snip <------< snip <------------ > > ksymoops 2.4.8 on i686 2.4.20-rc1. Options used > > -v vmlinux-2.4.21-pre5 (specified) > > -K (specified) > > -L (specified) > > -O (specified) > > -m System.map-2.4.21-pre5 (specified) > > > > Unable to handle kernel NULL pointer dereference at virtual address 00000004 > > *pde = 00000000 > > Oops: 0002 > > CPU: 0 > > EIP: 0010:[] Not tainted > > Using defaults from ksymoops -t elf32-i386 -a i386 > > EFLAGS: 00010202 > > eax: 00000000 ebx: c7b9a9e8 ecx: 0000007f edx: c7a8eef8 > > esi: c7b9ab08 edi: 000007f0 ebp: c7a8e060 esp: c021deb8 > > ds: 0018 es: 0018 ss: 0018 > > Process swapper (pid: 0, stackpage=c021d000) > > Stack: c7b9a9e8 c7b9ab08 c7f7ee00 c7b9a860 c7b893c0 c7f7ee00 c7b9a860 00000000 > > c01a3507 c7b5c680 7fb9a9f0 c01a339e c7a8e000 ffffffff 00000018 00000006 > > c7b9a800 00000018 00000006 c7b9a800 c7b9a9e8 c7b9ab08 c7f7ee00 c01a371c > > Call Trace: [] [] [] [] [>c019949d>] > > [] [] [] [] [] [] > > [] [] [] [] [] > > Code: 89 50 04 89 02 8b 5c 24 24 c7 03 00 00 00 00 c7 43 04 00 00 > > > > > > >>EIP; c01a5399 <===== > > > > >>esp; c021deb8 > > > > Trace; c01a3507 > > Trace; c01a339e > > Trace; c01a371c > > Trace; c019f7a3 > > Trace; c0115a6a > > Trace; c01082bd > > Trace; c0105240 > > Trace; c0105240 > > Trace; c010a528 > > Trace; c0105240 > > Trace; c0105240 > > Trace; c0105263 > > Trace; c01052d2 > > Trace; c0105000 <_stext+0/0> > > Trace; c0105027 > > > > Code; c01a5399 > > 00000000 <_EIP>: > > Code; c01a5399 <===== > > 0: 89 50 04 mov %edx,0x4(%eax) <===== > > Code; c01a539c > > 3: 89 02 mov %eax,(%edx) > > Code; c01a539e > > 5: 8b 5c 24 24 mov 0x24(%esp,1),%ebx > > Code; c01a53a2 > > 9: c7 03 00 00 00 00 movl $0x0,(%ebx) > > Code; c01a53a8 > > f: c7 43 04 00 00 00 00 movl $0x0,0x4(%ebx) > > > > <0>Kernel panic: Aiee, killing interrupt handler! > > ------------< snip <------< snip <------< snip <------------ > > > > Below are the rules that were installed on the system: > > > > ------------< snip <------< snip <------< snip <------------ > > /sbin/tc qdisc del dev eth0 root > > /sbin/tc qdisc del dev eth1 root > > /sbin/iptables -t mangle -F qos > > /sbin/iptables -t mangle -Z qos > > /sbin/tc qdisc add dev eth0 root handle 1: htb default 5 r2q 1 > > /sbin/tc class add dev eth0 parent 1: classid 1:1 htb rate 96kbit > > /sbin/tc class add dev eth0 parent 1:1 classid 1:2 htb rate 96kbit ceil 96kbit > > /sbin/tc class add dev eth0 parent 1:2 classid 1:3 htb rate 48kbit ceil 96kbit prio 1 > > /sbin/tc qdisc add dev eth0 handle 3: parent 1:3 sfq perturb 10 limit 31 > > /sbin/tc class add dev eth0 parent 1:2 classid 1:4 htb rate 24kbit ceil 96kbit prio 1 > > /sbin/tc qdisc add dev eth0 handle 4: parent 1:4 sfq perturb 10 limit 31 > > /sbin/tc class add dev eth0 parent 1:2 classid 1:5 htb rate 16kbit ceil 96kbit prio 2 > > /sbin/tc qdisc add dev eth0 handle 5: parent 1:5 sfq perturb 10 limit 31 > > /sbin/iptables -t mangle -A qos -o eth0 -s 66.8.85.0/28 -j CLASSIFY --set-class 1:3 > > /sbin/iptables -t mangle -A qos -o eth0 -s 66.8.85.80/28 -j CLASSIFY --set-class 1:4 > > /sbin/iptables -t mangle -A qos -o eth0 -s 192.116.106.192/29 -j CLASSIFY --set-class 1:0 > > /sbin/iptables -t mangle -A qos -o eth0 -s 66.8.28.48/29 -j CLASSIFY --set-class 1:0 > > /sbin/tc qdisc add dev eth1 root handle 1: htb default 5 r2q 2 > > /sbin/tc class add dev eth1 parent 1: classid 1:1 htb rate 512kbit > > /sbin/tc class add dev eth1 parent 1:1 classid 1:2 htb rate 256kbit ceil 512kbit > > /sbin/tc class add dev eth1 parent 1:2 classid 1:3 htb rate 128kbit ceil 512kbit prio 1 > > /sbin/tc qdisc add dev eth1 handle 3: parent 1:3 sfq perturb 10 limit 169 > > /sbin/tc class add dev eth1 parent 1:2 classid 1:4 htb rate 64kbit ceil 512kbit prio 1 > > /sbin/tc qdisc add dev eth1 handle 4: parent 1:4 sfq perturb 10 limit 169 > > /sbin/tc class add dev eth1 parent 1:2 classid 1:5 htb rate 32kbit ceil 512kbit prio 2 > > /sbin/tc qdisc add dev eth1 handle 5: parent 1:5 sfq perturb 10 limit 169 > > /sbin/iptables -t mangle -A qos -o eth1 -d 66.8.85.0/28 -j CLASSIFY --set-class 1:3 > > /sbin/iptables -t mangle -A qos -o eth1 -d 66.8.85.80/28 -j CLASSIFY --set-class 1:4 > > /sbin/iptables -t mangle -A qos -o eth1 -d 192.116.106.192/29 -j CLASSIFY --set-class 1:0 > > /sbin/iptables -t mangle -A qos -o eth1 -d 66.8.28.48/29 -j CLASSIFY --set-class 1:0 > > ------------< snip <------< snip <------< snip <------------ > > > > I've made tons of info available on my home page for you to look at (proc > > files, vmlinux, System.map, original panic message, etc. > > > > http://oasis.frogfoot.net/sfq/ > > > > -- > > > > Regards > > Abraham > > > > I saw what you did and I know who you are. > > > > ___________________________________________________ > > Abraham vd Merwe [ZR1BBQ] - Frogfoot Networks > > P.O. Box 3472, Matieland, Stellenbosch, 7602 > > Cell: +27 82 565 4451 Http: http://www.frogfoot.net/ > > Email: abz@frogfoot.net > > > > - > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ > > > > -- > http://www.PowerDNS.com Open source, database driven DNS Software > http://lartc.org Linux Advanced Routing & Traffic Control HOWTO > http://netherlabs.nl Consulting > From jmorris@intercode.com.au Tue Mar 11 04:40:39 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 11 Mar 2003 04:40:42 -0800 (PST) Received: from blackbird.intercode.com.au (IDENT:root@blackbird.intercode.com.au [203.32.101.10]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2BCeaq9023182 for ; Tue, 11 Mar 2003 04:40:38 -0800 Received: from localhost (jmorris@localhost) by blackbird.intercode.com.au (8.11.6/8.9.3) with ESMTP id h2BCe5c31557; Tue, 11 Mar 2003 23:40:05 +1100 Date: Tue, 11 Mar 2003 23:40:05 +1100 (EST) From: James Morris To: Alan Cox cc: Ulrik De Bie , , Subject: Re: Fwd: tcp seq nr wrapping bug + patch In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 1927 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jmorris@intercode.com.au Precedence: bulk X-list: netdev Content-Length: 712 Lines: 29 On Mon, 10 Mar 2003, Ulrik De Bie wrote: > I resend this patch which fixes a stupid mistake in the tcp sequence > number in the 2.2 kernel. This looks good, thanks. Alan, please apply. - James -- James Morris diff -urN -X dontdiff linux-2.2.24.orig/net/ipv4/tcp.c linux-2.2.24.w1/net/ipv4/tcp.c --- linux-2.2.24.orig/net/ipv4/tcp.c Wed Sep 25 00:06:26 2002 +++ linux-2.2.24.w1/net/ipv4/tcp.c Tue Mar 11 23:26:00 2003 @@ -823,7 +823,7 @@ */ if (skb_tailroom(skb) > 0 && (mss_now - copy) > 0 && - tp->snd_nxt < TCP_SKB_CB(skb)->end_seq) { + before(tp->snd_nxt, TCP_SKB_CB(skb)->end_seq)) { int last_byte_was_odd = (copy % 4); /* From kuznet@ms2.inr.ac.ru Tue Mar 11 08:09:50 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 11 Mar 2003 08:09:59 -0800 (PST) Received: from sex.inr.ac.ru (sex.inr.ac.ru [193.233.7.165]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2BG95q9004694 for ; Tue, 11 Mar 2003 08:09:49 -0800 Received: (from kuznet@localhost) by sex.inr.ac.ru (8.6.13/ANK) id TAA13074; Tue, 11 Mar 2003 19:08:18 +0300 From: kuznet@ms2.inr.ac.ru Message-Id: <200303111608.TAA13074@sex.inr.ac.ru> Subject: Re: kernel panic: bug in sch_sfq.c To: abz@frogfoot.net (Abraham van der Merwe) Date: Tue, 11 Mar 2003 19:08:18 +0300 (MSK) Cc: devik@cdi.cz, ahu@ds9a.nl, linux-kernel@vger.kernel.org, david@uninetwork.co.za, netdev@oss.sgi.com In-Reply-To: <20030311155409.GB7641@oasis.frogfoot.net> from "Abraham van der Merwe" at Mar 11, 3 05:54:09 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 X-archive-position: 1928 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: kuznet@ms2.inr.ac.ru Precedence: bulk X-list: netdev Content-Length: 192 Lines: 8 Hello! > Also, if I compile the kernel with all debugging enabled (CONFIG_DEBUG_SLAB, > etc) I can reliably trigger the BUG() on line 1263 in mm/slab.c How does backtrace oops look? Alexey From abz@oasis.frogfoot.net Tue Mar 11 08:46:06 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 11 Mar 2003 08:46:13 -0800 (PST) Received: from oasis.frogfoot.net (oasis.frogfoot.net [66.8.28.51]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2BGjtq9005232 for ; Tue, 11 Mar 2003 08:45:59 -0800 Received: (qmail 7880 invoked by uid 1001); 11 Mar 2003 15:54:09 -0000 Date: Tue, 11 Mar 2003 17:54:09 +0200 From: Abraham van der Merwe To: devik Cc: bert hubert , Alexey Kuznetsov , Linux Kernel Mailinlist , David Jarvis , netdev@oss.sgi.com Subject: Re: kernel panic: bug in sch_sfq.c Message-ID: <20030311155409.GB7641@oasis.frogfoot.net> Mail-Followup-To: devik , bert hubert , Alexey Kuznetsov , Linux Kernel Mailinlist , David Jarvis , netdev@oss.sgi.com References: <20030311094420.GB19658@outpost.ds9a.nl> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.3.28i Organization: Frogfoot Networks X-Operating-System: Debian GNU/Linux oasis 2.4.20-rc1 i686 X-GPG-Public-Key: http://oasis.frogfoot.net/pgpkeys/keys/frogfoot.gpg X-Uptime: 17:40:32 up 69 days, 5:07, 6 users, load average: 0.00, 0.03, 0.00 X-Edited-With-Muttmode: muttmail.sl - 2001-09-27 X-archive-position: 1929 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: abz@frogfoot.net Precedence: bulk X-list: netdev Content-Length: 8729 Lines: 185 Hi devik! In this case, I added the rules the morning, but there was no traffic flowing through the machine. That evening we redirected traffic through the box (so HTB obviously kicked in). About 10 minutes later it crashed. I think the only common thing I've seen so far is that it crashes when htb is actually doing a lot of shaping/prioritizing - that doesn't help much. Also, if I compile the kernel with all debugging enabled (CONFIG_DEBUG_SLAB, etc) I can reliably trigger the BUG() on line 1263 in mm/slab.c - don't know if this is related to the HTB problem - I've tested this on many different machines with many different kernels. If I disable CONFIG_DEBUG_SLAB, I don't get that (obviously). The first crash since I've started disabling CONFIG_DEBUG_SLAB on our Linux QoS boxes have been this one. > Hmm, > I looked at it. It seems that skb linked list was corrupted > (containing NULL pointer). It could be because of two problems, > either someone (maybe htb too)'ve overwritten memory or HTB > called dequeue with wrong argument. > Latter is unlikely because I call q->dequeue and sfq's > dequeue was really called. Thus pointer is ok. > Let's examine how could HTB mungle with qdisc internals. If > htb would think that leaf is inner node, inner.feed[0] is > pointer equal to leaf.q. I examined code but there is no > way to make this mistake. > > Last 3 days I got 3 bugreports. Each crashes in different > place and all seem unrelated. Each is NULL pointer dereference > though. > I think there is some place in some code which writes in > bad random places in memory :-\ > > To ask all people whose have seen such Oops, have you used > dynamic tc classes changes ? Like creating/deleting/changing/viewving > classes offten at runtime ? (I'm trying to find common trigger). > > thanks, devik > HTB maintainer > > On Tue, 11 Mar 2003, bert hubert wrote: > > > On Tue, Mar 11, 2003 at 11:14:09AM +0200, Abraham van der Merwe wrote: > > > Hi! > > > > > > I have a box that crashed today. Below is the decoded kernel panic. If you > > > track down the bug PLEASE send me a patch. > > > > Weird, Alexeys code is normally very very solid. Perhaps HTB is also > > involved. Devik? > > > > > > > > ------------< snip <------< snip <------< snip <------------ > > > ksymoops 2.4.8 on i686 2.4.20-rc1. Options used > > > -v vmlinux-2.4.21-pre5 (specified) > > > -K (specified) > > > -L (specified) > > > -O (specified) > > > -m System.map-2.4.21-pre5 (specified) > > > > > > Unable to handle kernel NULL pointer dereference at virtual address 00000004 > > > *pde = 00000000 > > > Oops: 0002 > > > CPU: 0 > > > EIP: 0010:[] Not tainted > > > Using defaults from ksymoops -t elf32-i386 -a i386 > > > EFLAGS: 00010202 > > > eax: 00000000 ebx: c7b9a9e8 ecx: 0000007f edx: c7a8eef8 > > > esi: c7b9ab08 edi: 000007f0 ebp: c7a8e060 esp: c021deb8 > > > ds: 0018 es: 0018 ss: 0018 > > > Process swapper (pid: 0, stackpage=c021d000) > > > Stack: c7b9a9e8 c7b9ab08 c7f7ee00 c7b9a860 c7b893c0 c7f7ee00 c7b9a860 00000000 > > > c01a3507 c7b5c680 7fb9a9f0 c01a339e c7a8e000 ffffffff 00000018 00000006 > > > c7b9a800 00000018 00000006 c7b9a800 c7b9a9e8 c7b9ab08 c7f7ee00 c01a371c > > > Call Trace: [] [] [] [] [>c019949d>] > > > [] [] [] [] [] [] > > > [] [] [] [] [] > > > Code: 89 50 04 89 02 8b 5c 24 24 c7 03 00 00 00 00 c7 43 04 00 00 > > > > > > > > > >>EIP; c01a5399 <===== > > > > > > >>esp; c021deb8 > > > > > > Trace; c01a3507 > > > Trace; c01a339e > > > Trace; c01a371c > > > Trace; c019f7a3 > > > Trace; c0115a6a > > > Trace; c01082bd > > > Trace; c0105240 > > > Trace; c0105240 > > > Trace; c010a528 > > > Trace; c0105240 > > > Trace; c0105240 > > > Trace; c0105263 > > > Trace; c01052d2 > > > Trace; c0105000 <_stext+0/0> > > > Trace; c0105027 > > > > > > Code; c01a5399 > > > 00000000 <_EIP>: > > > Code; c01a5399 <===== > > > 0: 89 50 04 mov %edx,0x4(%eax) <===== > > > Code; c01a539c > > > 3: 89 02 mov %eax,(%edx) > > > Code; c01a539e > > > 5: 8b 5c 24 24 mov 0x24(%esp,1),%ebx > > > Code; c01a53a2 > > > 9: c7 03 00 00 00 00 movl $0x0,(%ebx) > > > Code; c01a53a8 > > > f: c7 43 04 00 00 00 00 movl $0x0,0x4(%ebx) > > > > > > <0>Kernel panic: Aiee, killing interrupt handler! > > > ------------< snip <------< snip <------< snip <------------ > > > > > > Below are the rules that were installed on the system: > > > > > > ------------< snip <------< snip <------< snip <------------ > > > /sbin/tc qdisc del dev eth0 root > > > /sbin/tc qdisc del dev eth1 root > > > /sbin/iptables -t mangle -F qos > > > /sbin/iptables -t mangle -Z qos > > > /sbin/tc qdisc add dev eth0 root handle 1: htb default 5 r2q 1 > > > /sbin/tc class add dev eth0 parent 1: classid 1:1 htb rate 96kbit > > > /sbin/tc class add dev eth0 parent 1:1 classid 1:2 htb rate 96kbit ceil 96kbit > > > /sbin/tc class add dev eth0 parent 1:2 classid 1:3 htb rate 48kbit ceil 96kbit prio 1 > > > /sbin/tc qdisc add dev eth0 handle 3: parent 1:3 sfq perturb 10 limit 31 > > > /sbin/tc class add dev eth0 parent 1:2 classid 1:4 htb rate 24kbit ceil 96kbit prio 1 > > > /sbin/tc qdisc add dev eth0 handle 4: parent 1:4 sfq perturb 10 limit 31 > > > /sbin/tc class add dev eth0 parent 1:2 classid 1:5 htb rate 16kbit ceil 96kbit prio 2 > > > /sbin/tc qdisc add dev eth0 handle 5: parent 1:5 sfq perturb 10 limit 31 > > > /sbin/iptables -t mangle -A qos -o eth0 -s 66.8.85.0/28 -j CLASSIFY --set-class 1:3 > > > /sbin/iptables -t mangle -A qos -o eth0 -s 66.8.85.80/28 -j CLASSIFY --set-class 1:4 > > > /sbin/iptables -t mangle -A qos -o eth0 -s 192.116.106.192/29 -j CLASSIFY --set-class 1:0 > > > /sbin/iptables -t mangle -A qos -o eth0 -s 66.8.28.48/29 -j CLASSIFY --set-class 1:0 > > > /sbin/tc qdisc add dev eth1 root handle 1: htb default 5 r2q 2 > > > /sbin/tc class add dev eth1 parent 1: classid 1:1 htb rate 512kbit > > > /sbin/tc class add dev eth1 parent 1:1 classid 1:2 htb rate 256kbit ceil 512kbit > > > /sbin/tc class add dev eth1 parent 1:2 classid 1:3 htb rate 128kbit ceil 512kbit prio 1 > > > /sbin/tc qdisc add dev eth1 handle 3: parent 1:3 sfq perturb 10 limit 169 > > > /sbin/tc class add dev eth1 parent 1:2 classid 1:4 htb rate 64kbit ceil 512kbit prio 1 > > > /sbin/tc qdisc add dev eth1 handle 4: parent 1:4 sfq perturb 10 limit 169 > > > /sbin/tc class add dev eth1 parent 1:2 classid 1:5 htb rate 32kbit ceil 512kbit prio 2 > > > /sbin/tc qdisc add dev eth1 handle 5: parent 1:5 sfq perturb 10 limit 169 > > > /sbin/iptables -t mangle -A qos -o eth1 -d 66.8.85.0/28 -j CLASSIFY --set-class 1:3 > > > /sbin/iptables -t mangle -A qos -o eth1 -d 66.8.85.80/28 -j CLASSIFY --set-class 1:4 > > > /sbin/iptables -t mangle -A qos -o eth1 -d 192.116.106.192/29 -j CLASSIFY --set-class 1:0 > > > /sbin/iptables -t mangle -A qos -o eth1 -d 66.8.28.48/29 -j CLASSIFY --set-class 1:0 > > > ------------< snip <------< snip <------< snip <------------ > > > > > > I've made tons of info available on my home page for you to look at (proc > > > files, vmlinux, System.map, original panic message, etc. > > > > > > http://oasis.frogfoot.net/sfq/ > > > > -- > > http://www.PowerDNS.com Open source, database driven DNS Software > > http://lartc.org Linux Advanced Routing & Traffic Control HOWTO > > http://netherlabs.nl Consulting > > > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- Regards Abraham Reporter (to Mahatma Gandhi): Mr Gandhi, what do you think of Western Civilization? Gandhi: I think it would be a good idea. ___________________________________________________ Abraham vd Merwe - Frogfoot Networks CC 9 Kinnaird Court, 33 Main Street, Newlands, 7700 Phone: +27 21 686 1674 Cell: +27 82 565 4451 Http: http://www.frogfoot.net/ Email: abz@frogfoot.net From rddunlap@osdl.org Tue Mar 11 11:58:29 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 11 Mar 2003 11:58:33 -0800 (PST) Received: from mail.osdl.org (air-2.osdl.org [65.172.181.6]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2BJwSq9007519 for ; Tue, 11 Mar 2003 11:58:29 -0800 Received: from dragon.pdx.osdl.net (dragon.pdx.osdl.net [172.20.1.27]) by mail.osdl.org (8.11.6/8.11.6) with SMTP id h2BJwQG28469; Tue, 11 Mar 2003 11:58:26 -0800 Date: Tue, 11 Mar 2003 11:56:12 -0800 From: "Randy.Dunlap" To: linux-net@vger.kernel.org Cc: netdev@oss.sgi.com Subject: updating MIBs/statistics Message-Id: <20030311115612.73821921.rddunlap@osdl.org> Organization: OSDL X-Mailer: Sylpheed version 0.8.6 (GTK+ 1.2.10; i586-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 1930 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: rddunlap@osdl.org Precedence: bulk X-list: netdev Content-Length: 412 Lines: 16 Hi, I'm looking into doing some MIB updates, such as making the IPv6 stats table be on a per interface basis (per RFC 2465) and adding a UDP listener table and TCP connection table as well as some other IPv6 MIB requirements. Has anyone else tackled this? Assuming that the patches for this are in Linux style, are there any issues with doing updates like these? Like non-technical issues? Thanks, -- ~Randy From abz@oasis.frogfoot.net Tue Mar 11 13:20:29 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 11 Mar 2003 13:20:39 -0800 (PST) Received: from oasis.frogfoot.net (oasis.frogfoot.net [66.8.28.51]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2BLKNq9011281 for ; Tue, 11 Mar 2003 13:20:27 -0800 Received: (qmail 8925 invoked by uid 1001); 11 Mar 2003 21:19:02 -0000 Date: Tue, 11 Mar 2003 23:19:02 +0200 From: Abraham van der Merwe To: kuznet@ms2.inr.ac.ru Cc: devik@cdi.cz, ahu@ds9a.nl, linux-kernel@vger.kernel.org, david@uninetwork.co.za, netdev@oss.sgi.com Subject: Re: kernel panic: bug in sch_sfq.c Message-ID: <20030311211902.GA8699@oasis.frogfoot.net> Mail-Followup-To: kuznet@ms2.inr.ac.ru, devik@cdi.cz, ahu@ds9a.nl, linux-kernel@vger.kernel.org, david@uninetwork.co.za, netdev@oss.sgi.com References: <20030311155409.GB7641@oasis.frogfoot.net> <200303111608.TAA13074@sex.inr.ac.ru> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="BXVAT5kNtrzKuDFl" Content-Disposition: inline In-Reply-To: <200303111608.TAA13074@sex.inr.ac.ru> User-Agent: Mutt/1.3.28i Organization: Frogfoot Networks X-Operating-System: Debian GNU/Linux oasis 2.4.20-rc1 i686 X-GPG-Public-Key: http://oasis.frogfoot.net/pgpkeys/keys/frogfoot.gpg X-Uptime: 22:58:38 up 69 days, 10:25, 8 users, load average: 0.02, 0.01, 0.00 X-Edited-With-Muttmode: muttmail.sl - 2001-09-27 X-archive-position: 1931 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: abz@frogfoot.net Precedence: bulk X-list: netdev Content-Length: 2439 Lines: 101 --BXVAT5kNtrzKuDFl Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Hi kuznet! > > Also, if I compile the kernel with all debugging enabled (CONFIG_DEBUG_= SLAB, > > etc) I can reliably trigger the BUG() on line 1263 in mm/slab.c >=20 > How does backtrace oops look? I didn't write down most of the BUG() panics, but here is one (unfortunately it doesn't have any QoS code in the stack trace): ------------< snip <------< snip <------< snip <------------ root@trillian:~/uni-qos# cat panic.txt c0192eb1 c0176c8c c017718c c0176afa c010810f c01082b3 c0105240 c0105240 c0105240 c0105240 c0105263 c01052d2 c0105000 c0105027 0f 0b ef 04 e0 87 1e c0 f7 c5 00 04 00 00 74 36 b8 a5 c2 0f EIP: 0010:c012642e ESP: c0221eb4 KERNEL BUG slab.c:1263 root@trillian:~/uni-qos# ------------< snip <------< snip <------< snip <------------ A quick objdump through the kernel's vmlinux image reveals, that the stack trace above looks as follows: ------------< snip <------< snip <------< snip <------------ c0192eb1 alloc_skb c0176c8c speedo_refill_rx_buf c017718c speedo_rx c0176afa speedo_interrupt c010810f handle_IRQ_event c01082b3 do_IRQ c0105240 default_idle c0105240 default_idle c0105240 default_idle c0105240 c0105263 default_idle c01052d2 cpu_idle c0105000 rest_init c0105027 rest_init ------------< snip <------< snip <------< snip <------------ It crashes when it hits BUG(); in slab.c: ------------< snip <------< snip <------< snip <------------ #if DEBUG if (cachep->flags & SLAB_POISON) if (kmem_check_poison_obj(cachep, objp)) BUG(); ------------< snip <------< snip <------< snip <------------ --=20 Regards Abraham Nothing is so often irretrievably missed as a daily opportunity. -- Ebner-Eschenbach ___________________________________________________ Abraham vd Merwe - Frogfoot Networks CC 9 Kinnaird Court, 33 Main Street, Newlands, 7700 Phone: +27 21 686 1674 Cell: +27 82 565 4451 Http: http://www.frogfoot.net/ Email: abz@frogfoot.net --BXVAT5kNtrzKuDFl Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.5 (GNU/Linux) Comment: For info see http://www.gnupg.org iD8DBQE+blLG0jJV70h31dERAiQ7AJ0b2vqj6ouXT7nlHnSGB2Y8JfLqawCfeKW7 8CmtViJv+1OGwWaCRYR/M+k= =l040 -----END PGP SIGNATURE----- --BXVAT5kNtrzKuDFl-- From rreddy@c.psc.edu Tue Mar 11 15:11:16 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 11 Mar 2003 15:11:20 -0800 (PST) Received: from c.psc.edu (c.psc.edu [128.182.73.106]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2BNAZq9015153 for ; Tue, 11 Mar 2003 15:11:16 -0800 Received: by c.psc.edu for NETDEV@OSS.SGI.COM; Tue, 11 Mar 2003 18:10:35 -0500 Date: Tue, 11 Mar 2003 18:10:35 -0500 From: "Raghurama 'REDDY'" Reply-To: rreddy@psc.edu To: NETDEV@OSS.SGI.COM CC: RREDDY@vms.psc.edu Message-Id: <03031118103534.2221577a.8921380@psc.edu> Subject: Bug or feature: raw sockets ignores IP_DF when packet is bigger than pmtu X-archive-position: 1932 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: rreddy@psc.edu Precedence: bulk X-list: netdev Content-Length: 1611 Lines: 52 Please let me know if this is the expected behavior: I am running a linux 2.4.20 system. The socket type is SOCK_RAW and the protocol is IPPROTO_RAW I am filling IP header myself. IP_DF is set in the header. I have experimented with and without IP_HDRINCL, and it did not make difference. It appears that on Linux, if the protocol and the socket are RAW, it does assume the header is included. For the test: The interface MTU is 4470 Next hop is router with an MTU of 1500 Packet size being sent out is about 2000 bytes What is observed on running a program like traceroute with "-M" option is: The first time I run it, I do get "Fragmentation required" message as expected. We can also observe in the tcpdump output that DF is set. If I run the test again immediately, we can see in the tcpdump output on the outging interface that IP fragments the message to 1500 bytes and sends them out with out setting the DF bit. This is because the route cache has path MTU stored as 1500. If I wait for some (until cache expires), or explicitly flush the cache with: echo 1 > /proc/sys/net/ipv4/route/flush and rerun the test, it works as expected and returns "Fragmentation required" packets. So the conjecture is that IP on the "host" fragments the packets if it knows the path MTU is not large enough to send the packet with out fragmentation (even when DF bit is set) Apparently this is consistent with the IPv6 spec which says that the routers can not fragment packets, and that hosts may. Thanks! --rr From seong@etri.re.kr Tue Mar 11 17:01:05 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 11 Mar 2003 17:01:11 -0800 (PST) Received: from cms1.etri.re.kr (cms1.etri.re.kr [129.254.16.11]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2C113q9020564 for ; Tue, 11 Mar 2003 17:01:05 -0800 Received: from SEONG ([129.254.172.40]) by cms1.etri.re.kr with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id GRNPMBYW; Wed, 12 Mar 2003 10:00:43 +0900 Message-ID: <001801c2e833$0d44e430$28acfe81@seong> From: "Seong Moon" To: Subject: arp cache deletion and netlink ? Date: Wed, 12 Mar 2003 10:02:28 +0900 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 5.50.4920.2300 X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4920.2300 X-archive-position: 1933 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: seong@etri.re.kr Precedence: bulk X-list: netdev Content-Length: 171 Lines: 8 Hi there. Can I monitor the deletion of an arp cache entry through netlink interface ? I looked into kernel source, then I found there is no implementation about that. From davem@redhat.com Tue Mar 11 23:40:12 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 11 Mar 2003 23:40:15 -0800 (PST) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2C7dVq9012071 for ; Tue, 11 Mar 2003 23:40:12 -0800 Received: from localhost (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with ESMTP id XAA06207; Tue, 11 Mar 2003 23:38:43 -0800 Date: Tue, 11 Mar 2003 23:38:43 -0800 (PST) Message-Id: <20030311.233843.97559124.davem@redhat.com> To: yoshfuji@linux-ipv6.org Cc: kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com, usagi@linux-ipv6.org Subject: Re: [PATCH] IPSEC: typo in xfrm_sk_clone_policy() From: "David S. Miller" In-Reply-To: <20030312.161749.123173528.yoshfuji@linux-ipv6.org> References: <20030312.161749.123173528.yoshfuji@linux-ipv6.org> X-FalunGong: Information control. X-Mailer: Mew version 2.1 on Emacs 21.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=iso-2022-jp Content-Transfer-Encoding: 7bit X-archive-position: 1934 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev Content-Length: 316 Lines: 9 From: YOSHIFUJI Hideaki / $B5HF#1QL@(B Date: Wed, 12 Mar 2003 16:17:49 +0900 (JST) I think following patch fixes a typo in xfrm_sk_clone_policy() which results in infinite loop if sk->policy[0] or sk->policy[1] is true. Patch is for 2.5.64. Patch applied, thank you. From yoshfuji@linux-ipv6.org Wed Mar 12 00:11:08 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 12 Mar 2003 00:11:13 -0800 (PST) Received: from yue.hongo.wide.ad.jp (yue.hongo.wide.ad.jp [203.178.139.94]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2C8B6q9012767 for ; Wed, 12 Mar 2003 00:11:08 -0800 Received: from localhost (localhost [127.0.0.1]) by yue.hongo.wide.ad.jp (8.12.3+3.5Wbeta/8.12.3/Debian-5) with ESMTP id h2C7HnUl019287; Wed, 12 Mar 2003 16:17:50 +0900 Date: Wed, 12 Mar 2003 16:17:49 +0900 (JST) Message-Id: <20030312.161749.123173528.yoshfuji@linux-ipv6.org> To: davem@redhat.com, kuznet@ms2.inr.ac.ru CC: netdev@oss.sgi.com, usagi@linux-ipv6.org Subject: [PATCH] IPSEC: typo in xfrm_sk_clone_policy() From: YOSHIFUJI Hideaki / =?iso-2022-jp?B?GyRCNUhGIzFRTEAbKEI=?= Organization: USAGI Project X-URL: http://www.yoshifuji.org/%7Ehideaki/ X-Fingerprint: 90 22 65 EB 1E CF 3A D1 0B DF 80 D8 48 07 F8 94 E0 62 0E EA X-PGP-Key-URL: http://www.yoshifuji.org/%7Ehideaki/hideaki@yoshifuji.org.asc X-Face: "5$Al-.M>NJ%a'@hhZdQm:."qn~PA^gq4o*>iCFToq*bAi#4FRtx}enhuQKz7fNqQz\BYU] $~O_5m-9'}MIs`XGwIEscw;e5b>n"B_?j/AkL~i/MEaZBLP X-Mailer: Mew version 2.2 on Emacs 20.7 / Mule 4.1 (AOI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 1935 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: yoshfuji@linux-ipv6.org Precedence: bulk X-list: netdev Content-Length: 844 Lines: 29 Hello. I think following patch fixes a typo in xfrm_sk_clone_policy() which results in infinite loop if sk->policy[0] or sk->policy[1] is true. Patch is for 2.5.64. Thanks. Index: include/net/xfrm.h =================================================================== RCS file: /cvsroot/usagi/usagi-backport/linux25/include/net/xfrm.h,v retrieving revision 1.1.1.7 diff -u -r1.1.1.7 xfrm.h --- include/net/xfrm.h 16 Feb 2003 04:09:06 -0000 1.1.1.7 +++ include/net/xfrm.h 12 Mar 2003 07:06:20 -0000 @@ -335,7 +335,7 @@ static inline int xfrm_sk_clone_policy(struct sock *sk) { if (unlikely(sk->policy[0] || sk->policy[1])) - return xfrm_sk_clone_policy(sk); + return __xfrm_sk_clone_policy(sk); return 0; } -- Hideaki YOSHIFUJI @ USAGI Project GPG FP: 9022 65EB 1ECF 3AD1 0BDF 80D8 4807 F894 E062 0EEA From maxiu@man.poznan.pl Wed Mar 12 04:28:09 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 12 Mar 2003 04:28:16 -0800 (PST) Received: from rose.man.poznan.pl (rose.man.poznan.pl [150.254.173.3]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2CCS6q9030640 for ; Wed, 12 Mar 2003 04:28:08 -0800 Received: from rose.man.poznan.pl (localhost [127.0.0.1]) by rose.man.poznan.pl (8.12.5/8.12.5/auth/ldap/milter/tls) with ESMTP id h2CCHlUM018534 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO) for ; Wed, 12 Mar 2003 13:17:48 +0100 (CET) Received: from localhost (maxiu@localhost) by rose.man.poznan.pl (8.12.5/8.12.5/Submit) with ESMTP id h2CCHl5O018530 for ; Wed, 12 Mar 2003 13:17:47 +0100 (CET) X-Authentication-Warning: rose.man.poznan.pl: maxiu owned process doing -bs Date: Wed, 12 Mar 2003 13:17:47 +0100 (CET) From: Marcin Kaminski To: netdev@oss.sgi.com Subject: socket(PF_INET6, SOCK_RAW, IPPROTO_ICMPV6) Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=iso-8859-2 X-RAVMilter-Version: 8.4.1(snapshot 20020919) (rose) Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from QUOTED-PRINTABLE to 8bit by oss.sgi.com id h2CCS6q9030640 X-archive-position: 1936 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: maxiu@man.poznan.pl Precedence: bulk X-list: netdev Content-Length: 777 Lines: 24 Hi I've found, that on my system (2.2.22) RAW sockets for IPv6 works different than for IPv4. When I create socket like: interfaceSocket = socket(PF_INET, SOCK_RAW, IPPROTO_ICMP); I receive ICMPv4 packets with IP header, but when I use interfaceSocket = socket(PF_INET6, SOCK_RAW, IPPROTO_ICMPV6); I receive ICMPv6 packets WITHOUT IPv6 header. What should I do in order to get full packet? Man pages of raw(7) tell: For receiving the IP header is always included in the packet. But it is not true for IPv6 :( With regards - Marcin Kaminski --------------------------------- maxiu - --- software developer ------------------- 6net project --- ----- network administrator -------- Best Group admin ----- ------- Poznañ Supercomputing and Networking Center ------- From yoshfuji@linux-ipv6.org Wed Mar 12 04:39:43 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 12 Mar 2003 04:39:46 -0800 (PST) Received: from yue.hongo.wide.ad.jp (yue.hongo.wide.ad.jp [203.178.139.94]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2CCd2q9031045 for ; Wed, 12 Mar 2003 04:39:42 -0800 Received: from localhost (localhost [127.0.0.1]) by yue.hongo.wide.ad.jp (8.12.3+3.5Wbeta/8.12.3/Debian-5) with ESMTP id h2CCd8Ul021264; Wed, 12 Mar 2003 21:39:08 +0900 Date: Wed, 12 Mar 2003 21:39:03 +0900 (JST) Message-Id: <20030312.213903.04850857.yoshfuji@linux-ipv6.org> To: maxiu@man.poznan.pl Cc: netdev@oss.sgi.com Subject: Re: socket(PF_INET6, SOCK_RAW, IPPROTO_ICMPV6) From: YOSHIFUJI Hideaki / =?iso-2022-jp?B?GyRCNUhGIzFRTEAbKEI=?= In-Reply-To: References: Organization: USAGI Project X-URL: http://www.yoshifuji.org/%7Ehideaki/ X-Fingerprint: 90 22 65 EB 1E CF 3A D1 0B DF 80 D8 48 07 F8 94 E0 62 0E EA X-PGP-Key-URL: http://www.yoshifuji.org/%7Ehideaki/hideaki@yoshifuji.org.asc X-Face: "5$Al-.M>NJ%a'@hhZdQm:."qn~PA^gq4o*>iCFToq*bAi#4FRtx}enhuQKz7fNqQz\BYU] $~O_5m-9'}MIs`XGwIEscw;e5b>n"B_?j/AkL~i/MEaZBLP X-Mailer: Mew version 2.2 on Emacs 20.7 / Mule 4.1 (AOI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 1937 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: yoshfuji@linux-ipv6.org Precedence: bulk X-list: netdev Content-Length: 1006 Lines: 29 In article (at Wed, 12 Mar 2003 13:17:47 +0100 (CET)), Marcin Kaminski says: > When I create socket like: > interfaceSocket = socket(PF_INET, SOCK_RAW, IPPROTO_ICMP); > I receive ICMPv4 packets with IP header, but when I use > interfaceSocket = socket(PF_INET6, SOCK_RAW, IPPROTO_ICMPV6); > I receive ICMPv6 packets WITHOUT IPv6 header. It is because of the specification (RFC2292). > What should I do in order to get full packet? > Man pages of raw(7) tell: There're no portable way to send/receive whole packet including IPv6 header (and possible extension header(s)). > What should I do in order to get full packet? > Man pages of raw(7) tell: > > For receiving the IP header is always included in the packet. > > But it is not true for IPv6 :( It is an error of that manpage. -- Hideaki YOSHIFUJI @ USAGI Project GPG FP: 9022 65EB 1ECF 3AD1 0BDF 80D8 4807 F894 E062 0EEA From maxiu@man.poznan.pl Wed Mar 12 04:52:44 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 12 Mar 2003 04:52:46 -0800 (PST) Received: from rose.man.poznan.pl (rose.man.poznan.pl [150.254.173.3]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2CCq3q9032104 for ; Wed, 12 Mar 2003 04:52:44 -0800 Received: from rose.man.poznan.pl (localhost [127.0.0.1]) by rose.man.poznan.pl (8.12.5/8.12.5/auth/ldap/milter/tls) with ESMTP id h2CCprUM023964 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO); Wed, 12 Mar 2003 13:51:57 +0100 (CET) Received: from localhost (maxiu@localhost) by rose.man.poznan.pl (8.12.5/8.12.5/Submit) with ESMTP id h2CCprbK023960; Wed, 12 Mar 2003 13:51:53 +0100 (CET) X-Authentication-Warning: rose.man.poznan.pl: maxiu owned process doing -bs Date: Wed, 12 Mar 2003 13:51:53 +0100 (CET) From: Marcin Kaminski To: YOSHIFUJI Hideaki / =?iso-2022-jp?B?GyRCNUhGIzFRTEAbKEI=?= cc: netdev@oss.sgi.com Subject: Re: socket(PF_INET6, SOCK_RAW, IPPROTO_ICMPV6) In-Reply-To: <20030312.213903.04850857.yoshfuji@linux-ipv6.org> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=iso-8859-2 X-RAVMilter-Version: 8.4.1(snapshot 20020919) (rose) X-archive-position: 1938 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: maxiu@man.poznan.pl Precedence: bulk X-list: netdev Content-Length: 444 Lines: 13 On Wed, 12 Mar 2003, YOSHIFUJI Hideaki / [iso-2022-jp] $B5HF#1QL@(B wrote: > > What should I do in order to get full packet? > > There're no portable way to send/receive whole packet including > IPv6 header (and possible extension header(s)). OK. Is there simplier method of obtaining source and destaination address of icmp packet (the only informations I need from ipv6 header) than setting IPV6_PKTINFO and receiving them with recvmsg? From yoshfuji@linux-ipv6.org Wed Mar 12 07:33:33 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 12 Mar 2003 07:33:37 -0800 (PST) Received: from yue.hongo.wide.ad.jp (yue.hongo.wide.ad.jp [203.178.139.94]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2CFXWq9029353 for ; Wed, 12 Mar 2003 07:33:33 -0800 Received: from localhost (localhost [127.0.0.1]) by yue.hongo.wide.ad.jp (8.12.3+3.5Wbeta/8.12.3/Debian-5) with ESMTP id h2CFXfUl022095; Thu, 13 Mar 2003 00:33:42 +0900 Date: Thu, 13 Mar 2003 00:33:41 +0900 (JST) Message-Id: <20030313.003341.49620358.yoshfuji@linux-ipv6.org> To: maxiu@man.poznan.pl Cc: netdev@oss.sgi.com Subject: Re: socket(PF_INET6, SOCK_RAW, IPPROTO_ICMPV6) From: YOSHIFUJI Hideaki / =?iso-2022-jp?B?GyRCNUhGIzFRTEAbKEI=?= In-Reply-To: References: <20030312.213903.04850857.yoshfuji@linux-ipv6.org> Organization: USAGI Project X-URL: http://www.yoshifuji.org/%7Ehideaki/ X-Fingerprint: 90 22 65 EB 1E CF 3A D1 0B DF 80 D8 48 07 F8 94 E0 62 0E EA X-PGP-Key-URL: http://www.yoshifuji.org/%7Ehideaki/hideaki@yoshifuji.org.asc X-Face: "5$Al-.M>NJ%a'@hhZdQm:."qn~PA^gq4o*>iCFToq*bAi#4FRtx}enhuQKz7fNqQz\BYU] $~O_5m-9'}MIs`XGwIEscw;e5b>n"B_?j/AkL~i/MEaZBLP X-Mailer: Mew version 2.2 on Emacs 20.7 / Mule 4.1 (AOI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 1939 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: yoshfuji@linux-ipv6.org Precedence: bulk X-list: netdev Content-Length: 484 Lines: 10 In article (at Wed, 12 Mar 2003 13:51:53 +0100 (CET)), Marcin Kaminski says: > Is there simplier method of obtaining source and destaination address of > icmp packet (the only informations I need from ipv6 header) than setting > IPV6_PKTINFO and receiving them with recvmsg? No; recvmsg() and IPV6_PKTINFO socket options is the SIMPLE way for obtaining source and destination address. --yoshfuji From ps41@hotmail.com Wed Mar 12 15:04:17 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 12 Mar 2003 15:04:20 -0800 (PST) Received: from hotmail.com (bay1-f208.bay1.hotmail.com [65.54.245.208]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2CN4Gq9004839 for ; Wed, 12 Mar 2003 15:04:16 -0800 Received: from mail pickup service by hotmail.com with Microsoft SMTPSVC; Wed, 12 Mar 2003 15:04:11 -0800 Received: from 136.162.88.194 by by1fd.bay1.hotmail.msn.com with HTTP; Wed, 12 Mar 2003 23:04:10 GMT X-Originating-IP: [136.162.88.194] From: "Parag Sharma" To: netdev@oss.sgi.com Subject: faulting in user pages for zero copy transmit Date: Wed, 12 Mar 2003 14:04:10 -0900 Mime-Version: 1.0 Content-Type: text/plain; format=flowed Message-ID: X-OriginalArrivalTime: 12 Mar 2003 23:04:11.0102 (UTC) FILETIME=[B16637E0:01C2E8EB] X-archive-position: 1940 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ps41@hotmail.com Precedence: bulk X-list: netdev Content-Length: 473 Lines: 22 Hi, I am looking at 2.4.19-ac4 source code and trying to understand the zero copy transmit. I have been trying to figure out how/where are the user pages, that might have been swapped out, brought into memory prior to DMA? I would appreciate any help in figuring this one out. thanks Parag _________________________________________________________________ Help STOP SPAM with the new MSN 8 and get 2 months FREE* http://join.msn.com/?page=features/junkmail From maxiu@man.poznan.pl Thu Mar 13 02:41:10 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 13 Mar 2003 02:41:21 -0800 (PST) Received: from rose.man.poznan.pl (rose.man.poznan.pl [150.254.173.3]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2DAf7q9023691 for ; Thu, 13 Mar 2003 02:41:09 -0800 Received: from rose.man.poznan.pl (localhost [127.0.0.1]) by rose.man.poznan.pl (8.12.5/8.12.5/auth/ldap/milter/tls) with ESMTP id h2DAeuUM021469 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO); Thu, 13 Mar 2003 11:40:57 +0100 (CET) Received: from localhost (maxiu@localhost) by rose.man.poznan.pl (8.12.5/8.12.5/Submit) with ESMTP id h2DAetTY021466; Thu, 13 Mar 2003 11:40:56 +0100 (CET) X-Authentication-Warning: rose.man.poznan.pl: maxiu owned process doing -bs Date: Thu, 13 Mar 2003 11:40:55 +0100 (CET) From: Marcin Kaminski To: YOSHIFUJI Hideaki / =?iso-2022-jp?B?GyRCNUhGIzFRTEAbKEI=?= cc: netdev@oss.sgi.com Subject: Re: socket(PF_INET6, SOCK_RAW, IPPROTO_ICMPV6) In-Reply-To: <20030313.003341.49620358.yoshfuji@linux-ipv6.org> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=iso-8859-2 X-RAVMilter-Version: 8.4.1(snapshot 20020919) (rose) X-archive-position: 1943 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: maxiu@man.poznan.pl Precedence: bulk X-list: netdev Content-Length: 728 Lines: 17 On Thu, 13 Mar 2003, YOSHIFUJI Hideaki / [iso-2022-jp] $B5HF#1QL@(B wrote: > No; recvmsg() and IPV6_PKTINFO socket options is the SIMPLE way for > obtaining source and destination address. OK, it works very well, but is there a way which is common to IPv4 and IPv6 to get ICMP packets? Obtaining addresses is common to both protocols, but now I get IP header + ICMP header from IPv4 sockets, and ICMP header from IPv6 sockets, so I must process them differently (basic ICMP packets have the same structure, only different values so routines for ICMP could be universal). You wrote that there is no portable way to obtain IPv6 + ICMPv6, so is there a way to portable obtain only ICMPv4 (without IPv4 header)? With regards From anton@samba.org Thu Mar 13 11:25:50 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 13 Mar 2003 11:26:00 -0800 (PST) Received: from lists.samba.org (dp.samba.org [66.70.73.150]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2DJP9q9003141 for ; Thu, 13 Mar 2003 11:25:50 -0800 Received: by lists.samba.org (Postfix, from userid 504) id 26CED2C053; Thu, 13 Mar 2003 19:25:09 +0000 (GMT) Date: Fri, 14 Mar 2003 06:24:57 +1100 From: Anton Blanchard To: netdev@oss.sgi.com Cc: davem@redhat.com, akpm@digeo.com, bcrl@redhat.com Subject: recvmsg compat code Message-ID: <20030313192457.GA3279@krispykreme> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.3i X-archive-position: 1945 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: anton@samba.org Precedence: bulk X-list: netdev Content-Length: 1522 Lines: 55 Hi, The recent clean up of the duplicated recvmsg code (which I was happy to see go in) broke my sshd. It turns out compat handling of fd passing is broken. We were looking for the MSG_CMSG_COMPAT flag in msg->msg_flags. I was just about to pass down flags into the two problem functions, however put_cmsg is called from a bunch of places. Any thoughts? Anton ===== net/core/scm.c 1.6 vs edited ===== --- 1.6/net/core/scm.c Fri Mar 7 06:06:44 2003 +++ edited/net/core/scm.c Fri Mar 14 06:18:03 2003 @@ -165,14 +165,15 @@ return err; } -int put_cmsg(struct msghdr * msg, int level, int type, int len, void *data) +int put_cmsg(struct msghdr * msg, int level, int type, int len, void *data, + unsigned int flags) { struct cmsghdr *cm = (struct cmsghdr*)msg->msg_control; struct cmsghdr cmhdr; int cmlen = CMSG_LEN(len); int err; - if (MSG_CMSG_COMPAT & msg->msg_flags) + if (MSG_CMSG_COMPAT & flags) return put_cmsg_compat(msg, level, type, len, data); if (cm==NULL || msg->msg_controllen < sizeof(*cm)) { @@ -200,7 +201,8 @@ return err; } -void scm_detach_fds(struct msghdr *msg, struct scm_cookie *scm) +void scm_detach_fds(struct msghdr *msg, struct scm_cookie *scm, + unsigned long flags) { struct cmsghdr *cm = (struct cmsghdr*)msg->msg_control; @@ -210,7 +212,7 @@ int *cmfptr; int err = 0, i; - if (MSG_CMSG_COMPAT & msg->msg_flags) + if (MSG_CMSG_COMPAT & flags) return scm_detach_fds_compat(msg, scm); if (msg->msg_controllen > sizeof(struct cmsghdr)) From mcr@sandelman.ottawa.on.ca Sun Mar 16 20:25:58 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 16 Mar 2003 20:26:13 -0800 (PST) Received: from noxmail.sandelman.ottawa.on.ca ([192.139.46.78]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2H4Ppq9001931 for ; Sun, 16 Mar 2003 20:25:58 -0800 Received: from sandelman.ottawa.on.ca (wl-129-246.wireless.ietf56.ietf.org [130.129.129.246]) by noxmail.sandelman.ottawa.on.ca (8.11.6/8.11.6) with ESMTP id h2H4PRD04384 (using TLSv1/SSLv3 with cipher EDH-RSA-DES-CBC3-SHA (168 bits) verified OK) for ; Sun, 16 Mar 2003 23:25:29 -0500 (EST) Received: from marajade.sandelman.ottawa.on.ca (marajade [127.0.0.1] (may be forged)) by sandelman.ottawa.on.ca (8.12.3/8.12.3/Debian -4) with ESMTP id h2H4PPWm006365 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO) for ; Sun, 16 Mar 2003 20:25:26 -0800 Received: from marajade.sandelman.ottawa.on.ca (mcr@localhost) by marajade.sandelman.ottawa.on.ca (8.12.3/8.12.3/Debian-5) with ESMTP id h2H4PO8h006362 for ; Sun, 16 Mar 2003 20:25:25 -0800 Message-Id: <200303170425.h2H4PO8h006362@marajade.sandelman.ottawa.on.ca> To: netdev@oss.sgi.com Subject: BUG: 2.4.21-pre5 changes network scan order Mime-Version: 1.0 (generated by tm-edit 1.8) Content-Type: text/plain; charset=US-ASCII Date: Sun, 16 Mar 2003 20:25:24 -0800 From: Michael Richardson X-archive-position: 1953 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: mcr@sandelman.ottawa.on.ca Precedence: bulk X-list: netdev Content-Length: 2928 Lines: 67 -----BEGIN PGP SIGNED MESSAGE----- We installed 2.4.21-pre5 on a system that had 2.4.20 on it before. A) The network scan order changes. It used to be: eth0 = Intel (on motherboard) eth1 = DP8381x eth2 = DS21140 which WAS the PCI BIOS order. 2.4.21-pre5 does NOT get it in the PCI bios order. B) Since we all agree that it is unacceptable to make a change like this in the production stream, I expect that this is a bug. dhs-[~] root 31 #lspci 00:00.0 Host bridge: Intel Corp. 440BX/ZX/DX - 82443BX/ZX/DX Host bridge (rev 0) 00:01.0 PCI bridge: Intel Corp. 440BX/ZX/DX - 82443BX/ZX/DX AGP bridge (rev 03) 00:07.0 ISA bridge: Intel Corp. 82371AB/EB/MB PIIX4 ISA (rev 02) 00:07.1 IDE interface: Intel Corp. 82371AB/EB/MB PIIX4 IDE (rev 01) 00:07.2 USB Controller: Intel Corp. 82371AB/EB/MB PIIX4 USB (rev 01) 00:07.3 Bridge: Intel Corp. 82371AB/EB/MB PIIX4 ACPI (rev 02) 00:0d.0 Ethernet controller: Intel Corp. 82559ER (rev 09) 00:11.0 Ethernet controller: National Semiconductor Corporation DP83815 (MacPhyr 00:13.0 Ethernet controller: Digital Equipment Corporation DECchip 21140 [Faste) 01:00.0 VGA compatible controller: S3 Inc. 86c368 [Trio 3D/2X] (rev 02) tulip0: EEPROM default media type Autosense. tulip0: Index #0 - Media MII (#11) described by a 21140 MII PHY (1) block. tulip0: MII transceiver #0 config 1000 status 7809 advertising 01e1. divert: allocating divert_blk for eth0 eth0: Digital DS21140 Tulip rev 34 at 0xec00, 00:40:05:A3:52:E6, IRQ 10. eepro100.c:v1.09j-t 9/29/99 Donald Becker http://www.scyld.com/network/eepro100.html eepro100.c: $Revision: 1.36 $ 2000/11/17 Modified by Andrey V. Savochkin and others PCI: Found IRQ 9 for device 00:0d.0 divert: allocating divert_blk for eth1 eth1: Intel Corp. 82559ER, 00:60:EF:11:4E:5A, IRQ 9. Receiver lock-up bug exists -- enabling work-around. Board assembly 645520-034, Physical connectors present: RJ45 Primary interface chip DP83840 PHY #1. DP83840 specific setup, setting register 23 to 0422. General self-test: passed. Serial sub-system self-test: passed. Internal registers self-test: passed. ROM checksum self-test: passed (0xdbd8681d). Receiver lock-up workaround activated. natsemi dp8381x driver, version 1.07+LK1.0.17, Sep 27, 2002 originally by Donald Becker http://www.scyld.com/network/natsemi.html 2.4.x kernel port by Jeff Garzik, Tjeerd Mulder PCI: Found IRQ 9 for device 00:11.0 divert: allocating divert_blk for eth2 eth2: NatSemi DP8381[56] at 0xe080f000, 00:a0:cc:a1:fd:84, IRQ 9. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.7 (GNU/Linux) Comment: Finger me for keys iQCVAwUBPnVOI4qHRg3pndX9AQEY5QP/Uu0jvlmLSdH94lyn3yln0Psszc3HjTw/ oc3CiuuIFTlZ7brJ/IuBuGjCrgS31+oaOUMq+xgqQDlXL+Pj9Xw3Q8llphjyWTKz M15s2If2fTAVTYDuKa9WMGBT8HfmSDrE+V/2UulsG1PivUyztID8VVmqeDECctL0 nivCqDooFV4= =mUBg -----END PGP SIGNATURE----- From hadi@cyberus.ca Sun Mar 16 20:43:46 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 16 Mar 2003 20:43:52 -0800 (PST) Received: from mx03.cyberus.ca (mx03.cyberus.ca [216.191.240.24]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2H4hiq9003158 for ; Sun, 16 Mar 2003 20:43:45 -0800 Received: from shell.cyberus.ca ([216.191.240.114]) by mx03.cyberus.ca with esmtp (Exim 4.10) id 18umTP-0004hU-00; Sun, 16 Mar 2003 23:43:43 -0500 Received: from shell.cyberus.ca (localhost.cyberus.ca [127.0.0.1]) by shell.cyberus.ca (8.12.6/8.12.6) with ESMTP id h2H4hCqu009296; Sun, 16 Mar 2003 23:43:12 -0500 (EST) (envelope-from hadi@cyberus.ca) Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.12.6/8.12.6/Submit) with ESMTP id h2H4hBTQ009293; Sun, 16 Mar 2003 23:43:11 -0500 (EST) (envelope-from hadi@cyberus.ca) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Sun, 16 Mar 2003 23:43:10 -0500 (EST) From: jamal To: Michael Richardson cc: netdev@oss.sgi.com, Marcelo Tosatti Subject: Re: BUG: 2.4.21-pre5 changes network scan order In-Reply-To: <200303170425.h2H4PO8j006362@marajade.sandelman.ottawa.on.ca> Message-ID: <20030316233446.T9241@shell.cyberus.ca> References: <200303170425.h2H4PO8j006362@marajade.sandelman.ottawa.on.ca> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 1954 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev Content-Length: 3533 Lines: 88 This is definetly unacceptable behavior. Did driverfs sneak into 2.4 ? cheers, jamal On Sun, 16 Mar 2003, Michael Richardson wrote: > ------- Blind-Carbon-Copy > > To: netdev@oss.sgi.com > Subject: BUG: 2.4.21-pre5 changes network scan order > Mime-Version: 1.0 (generated by tm-edit 1.8) > Content-Type: text/plain; charset=US-ASCII > Date: Sun, 16 Mar 2003 20:25:24 -0800 > From: Michael Richardson > > - -----BEGIN PGP SIGNED MESSAGE----- > > > We installed 2.4.21-pre5 on a system that had 2.4.20 on it before. > A) The network scan order changes. It used to be: > eth0 = Intel (on motherboard) > eth1 = DP8381x > eth2 = DS21140 > > which WAS the PCI BIOS order. > > 2.4.21-pre5 does NOT get it in the PCI bios order. > > B) Since we all agree that it is unacceptable to make a change like this in > the production stream, I expect that this is a bug. > > > dhs-[~] root 31 #lspci > 00:00.0 Host bridge: Intel Corp. 440BX/ZX/DX - 82443BX/ZX/DX Host bridge (rev 0) > 00:01.0 PCI bridge: Intel Corp. 440BX/ZX/DX - 82443BX/ZX/DX AGP bridge (rev 03) > 00:07.0 ISA bridge: Intel Corp. 82371AB/EB/MB PIIX4 ISA (rev 02) > 00:07.1 IDE interface: Intel Corp. 82371AB/EB/MB PIIX4 IDE (rev 01) > 00:07.2 USB Controller: Intel Corp. 82371AB/EB/MB PIIX4 USB (rev 01) > 00:07.3 Bridge: Intel Corp. 82371AB/EB/MB PIIX4 ACPI (rev 02) > 00:0d.0 Ethernet controller: Intel Corp. 82559ER (rev 09) > 00:11.0 Ethernet controller: National Semiconductor Corporation DP83815 (MacPhyr > 00:13.0 Ethernet controller: Digital Equipment Corporation DECchip 21140 [Faste) > 01:00.0 VGA compatible controller: S3 Inc. 86c368 [Trio 3D/2X] (rev 02) > > > > tulip0: EEPROM default media type Autosense. > tulip0: Index #0 - Media MII (#11) described by a 21140 MII PHY (1) block. > tulip0: MII transceiver #0 config 1000 status 7809 advertising 01e1. > divert: allocating divert_blk for eth0 > eth0: Digital DS21140 Tulip rev 34 at 0xec00, 00:40:05:A3:52:E6, IRQ 10. > eepro100.c:v1.09j-t 9/29/99 Donald Becker http://www.scyld.com/network/eepro100.html > eepro100.c: $Revision: 1.36 $ 2000/11/17 Modified by Andrey V. Savochkin and others > PCI: Found IRQ 9 for device 00:0d.0 > divert: allocating divert_blk for eth1 > eth1: Intel Corp. 82559ER, 00:60:EF:11:4E:5A, IRQ 9. > Receiver lock-up bug exists -- enabling work-around. > Board assembly 645520-034, Physical connectors present: RJ45 > Primary interface chip DP83840 PHY #1. > DP83840 specific setup, setting register 23 to 0422. > General self-test: passed. > Serial sub-system self-test: passed. > Internal registers self-test: passed. > ROM checksum self-test: passed (0xdbd8681d). > Receiver lock-up workaround activated. > natsemi dp8381x driver, version 1.07+LK1.0.17, Sep 27, 2002 > originally by Donald Becker > http://www.scyld.com/network/natsemi.html > 2.4.x kernel port by Jeff Garzik, Tjeerd Mulder > PCI: Found IRQ 9 for device 00:11.0 > divert: allocating divert_blk for eth2 > eth2: NatSemi DP8381[56] at 0xe080f000, 00:a0:cc:a1:fd:84, IRQ 9. > - -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.0.7 (GNU/Linux) > Comment: Finger me for keys > > iQCVAwUBPnVOI4qHRg3pndX9AQEY5QP/Uu0jvlmLSdH94lyn3yln0Psszc3HjTw/ > oc3CiuuIFTlZ7brJ/IuBuGjCrgS31+oaOUMq+xgqQDlXL+Pj9Xw3Q8llphjyWTKz > M15s2If2fTAVTYDuKa9WMGBT8HfmSDrE+V/2UulsG1PivUyztID8VVmqeDECctL0 > nivCqDooFV4= > =mUBg > - -----END PGP SIGNATURE----- > > ------- End of Blind-Carbon-Copy > From andwes-8@student.luth.se Mon Mar 17 06:06:56 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 17 Mar 2003 06:07:03 -0800 (PST) Received: from gepetto.dc.luth.se (gepetto.dc.luth.se [130.240.42.40]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2HE6Dq9025447 for ; Mon, 17 Mar 2003 06:06:56 -0800 Received: from legion (tomten.campus.luth.se [130.240.221.171]) by gepetto.dc.luth.se (8.12.5/8.12.5) with SMTP id h2HE6CEZ002858 for ; Mon, 17 Mar 2003 15:06:12 +0100 (MET) Message-ID: <001101c2ec8e$5da12bf0$abddf082@legion> From: "Andreas Westin" To: Subject: bug ? Date: Mon, 17 Mar 2003 15:06:06 +0100 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2800.1106 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1106 X-archive-position: 1955 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: andwes-8@student.luth.se Precedence: bulk X-list: netdev Content-Length: 2767 Lines: 82 Hello, I've gotten this "crash" now several times with >=2.4.20. It usually happens within 1-2 days after bootup while downloading a larger amount of data at around 500k/s. Using gentoo linux on a dual 400mhz (celeron) abit bp6, i know this could be due to hardware problems but i thought that I might aswell report it if its a real bug. I hope that the info below is all that you need. /Andreas Network driver: 8139too Fast Ethernet driver 0.9.26 eth0: RealTek RTL8139 Fast Ethernet at 0xd8800000, 00:d0:70:01:0f:2f, IRQ 18 eth0: Identified 8139 chip type 'RTL-8139C' Unable to handle kernel paging request at virtual address e9272d39 c026c49f *pde = 00000000 Oops: 0000 CPU: 0 EIP: 0010:[] Not tainted Using defaults from ksymoops -t elf32-i386 -a i386 EFLAGS: 00210246 eax: 00008000 ebx: e9272d0d ecx: d561e3a0 edx: 00000000 esi: cad75f6c edi: cc47bec0 ebp: cad75f6c esp: cad75ee4 ds: 0018 es: 0018 ss: 0018 Process ncftp (pid: 31976, stackpage=cad75000) Stack: d561e3a0 cad75f6c 00008000 00000000 00000000 cad75efc 00000000 00008000 cad75f20 c022c008 cc47bec0 cad75f6c 00008000 00000000 cad75f20 00000000 00000000 00000000 00000000 00000000 00000049 cad74000 00000064 d65bb520 Call Trace: [] [] [] [] Code: ff 53 2c 85 c0 89 c2 78 07 8b 44 24 18 89 46 04 8b 5c 24 1c >>EIP; c026c49f <===== Trace; c022c008 Trace; c022c147 Trace; c013ce97 Trace; c010773f Code; c026c49f 00000000 <_EIP>: Code; c026c49f <===== 0: ff 53 2c call *0x2c(%ebx) <===== Code; c026c4a2 3: 85 c0 test %eax,%eax Code; c026c4a4 5: 89 c2 mov %eax,%edx Code; c026c4a6 7: 78 07 js 10 <_EIP+0x10> Code; c026c4a8 9: 8b 44 24 18 mov 0x18(%esp,1),%eax Code; c026c4ac d: 89 46 04 mov %eax,0x4(%esi) Code; c026c4af 10: 8b 5c 24 1c mov 0x1c(%esp,1),%ebx Linux hostname 2.4.21-pre5 #2 SMP Fri Feb 28 00:03:43 CET 2003 i686 Celeron (Mendocino) GenuineIntel GNU/Linux Gnu C 3.2.2 Gnu make 3.80 util-linux 2.11z mount 2.11z modutils 2.4.23 e2fsprogs 1.32 reiserfsprogs 3.6.3 Linux C Library 2.3.2 Dynamic linker (ldd) 2.3.2 Procps 3.1.6 Net-tools 1.60 Kbd 1.06 Sh-utils 2.0.15 From ralph@istop.com Mon Mar 17 19:43:26 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 17 Mar 2003 19:43:32 -0800 (PST) Received: from smtp.istop.com (dci.doncaster.on.ca [66.11.168.194]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2I3hOq9018274 for ; Mon, 17 Mar 2003 19:43:25 -0800 Received: from ns.istop.com (ns.istop.com [66.11.168.199]) by smtp.istop.com (Postfix) with ESMTP id 187DD36A1E for ; Mon, 17 Mar 2003 22:43:24 -0500 (EST) Date: Mon, 17 Mar 2003 22:43:38 -0500 (EST) From: Ralph Doncaster Reply-To: ralph+d@istop.com To: netdev@oss.sgi.com Subject: Re: Linux router performance (3c59x) (fwd) Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 1957 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ralph@istop.com Precedence: bulk X-list: netdev Content-Length: 6555 Lines: 167 I haven't heard from Jamal or Dave, so perhaps someone from this list has some wisdom to impart. Currently the box in question is running a 67% system load with ~40kpps. Here's the switch port stats that the 2 3c905cx cards are plugged into: 5 minute input rate 36143000 bits/sec, 8914 packets/sec 5 minute output rate 54338000 bits/sec, 10722 packets/sec - 5 minute input rate 50585000 bits/sec, 12445 packets/sec 5 minute output rate 34326000 bits/sec, 9596 packets/sec Ralph Doncaster principal, IStop.com ---------- Forwarded message ---------- Date: Mon, 17 Mar 2003 11:18:25 -0500 (EST) From: Ralph Doncaster To: jamal Cc: "mcr@sandelman.ottawa.on.ca" , "vortex@scyld.com" , "davem@redhat.com" Subject: Re: Linux router performance (3c59x) Hi Jamal, I found a 3c59x NAPI patch (from orr.falooley.org/pub/linux/net/, which seems to be down right now), and applied that against the stock 2.4.20 kernel. Unfortunately I don't see a noticable improvement from 2.4.19 without NAPI. When I send a 10kpps flood of 64-byte frames through the router, the CPU flatlines (duron 750). The number of interrupts/sec doesn't go down and the context switching is reduced so NAPI is having some affect, but not the intended reduction in CPU load (10kpps flood was done during the middle of this vmstat log, when you see idle go to 0): root@tor-router /usr/src# vmstat 2 procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 0 0 1 932 623600 15636 27916 0 0 2 51 2762 1085 13 41 47 0 0 1 932 623600 15636 27916 0 0 0 0 18603 1660 0 43 57 0 0 1 932 623600 15636 27916 0 0 0 0 18495 1593 0 50 50 0 0 1 932 623584 15652 27916 0 0 0 20 18949 1671 0 49 51 0 0 1 932 623584 15652 27916 0 0 0 0 18768 1192 0 63 37 0 0 1 932 623584 15652 27916 0 0 0 0 16084 62 0 100 0 0 0 1 932 623584 15652 27916 0 0 0 0 16059 132 0 100 0 0 0 1 932 623576 15660 27916 0 0 0 8 18043 24 0 100 0 0 0 1 932 623576 15660 27916 0 0 0 0 17795 71 0 100 0 0 0 1 932 623576 15660 27916 0 0 0 0 14181 70 0 100 0 0 0 1 932 623576 15660 27916 0 0 0 0 16764 122 0 100 0 0 0 1 932 623576 15660 27916 0 0 0 0 16802 63 0 100 0 0 0 1 932 623568 15668 27916 0 0 0 8 17044 23 0 100 0 0 0 1 932 623568 15668 27916 0 0 0 0 19198 1520 0 48 52 0 0 1 932 623568 15668 27916 0 0 0 0 18684 1611 0 39 61 0 0 1 932 623568 15668 27916 0 0 0 0 18256 1518 0 44 56 This is a box doing straight routing (no firewalling), with a full bgp4 routing table (>100k routes). Kernel advanced router config option as well as fastroute was chosen. Is the 3c59x NAPI patch just no good, or is there something else I should be doing to get decent linux routing throughput with it? Ralph Doncaster principal, IStop.com On Sat, 14 Dec 2002, jamal wrote: > > > On Fri, 13 Dec 2002, Ralph Doncaster wrote: > > > Hi Jamal, > > > > I'm running 2.4.19 on a linux router in Toronto. It's got 2 3c905CX > > cards, and I've disabled rx_copybreak in the driver. FASTROUTE is not > > tuned on. CPU is a Duron-750. At around 40kpps, the box hits 100% CPU. > > > > And it should probably die if you start hitting around 60kpps i.e no > packets make it. > > > Based on your numbers for 2.2.14, it would seem FASTROUTE would make a big > > difference. > > http://robur.slu.se/Linux/net-development/jamal/FF-html/img7.htm > > > > It has its disadvantages: > It chews a lot of CPU and theres a lot of things you must bypass > by virtue of DMA-DMA connectivity. > > > Comparing the Usenix paper results for 2.4 seem to show that FASTROUTE > > doesn't make as much difference. Since your numbers show almost 100kpps > > I think if you only have a couple of interfaces on a P2 you should pretty > much be able to do about 100Kpps on each. > > > for regular 2.4 I'm guessing that means the irqmitigation of the stock > > 3c59x.c sucks, even though it looks like it will process multiple packets > > per interrupt under load (max_interrupt_work). > > irq mitigation is only done by a few NICs. NAPI does a better mitigation > in s/ware without requiring h/ware support. The mitigation is based on > feedback from the system; so if the system is slow (pentium vs P3) you > process less and NEVER die. I believe theres 3c59x.c NAPI driver. > > > DaveM was rather terse > > when I communicated with him recently, but what he clearly said was the > > e1000 is the best performer under linux due to software IRQ mitigation > > features in the driver (not the hardware RxIntDelay feature). > > > > He was more than likely refering to NAPI. e1000 is definetly the best; but > i dont own any; Robert Olson owns a few and he swears by them. I can email > him for details if you are interested. > > > Now that 2.4.20 includes the e1000 driver, it would seem the easiest way > > to get high-performance routing under Linux would be for me to upgrade > > from 2.4.19 to 2.4.20 with the FASTROUTE enabled, and swap my 3C905CX > > cards for a couple of e1000's. > > > > No. Forget FASTROUTE. I dont think anyone is looking at it at all or it > is ever being updated; we killed it with NAPI perfomance wise, no > difference and featurewise NAPI is superior. > Although recently i have been thinking of experimenting withe CISCO like > adjancecies/CEF (but that is a totaly different thing). > > > Looking at the README > > ftp://robur.slu.se/pub/Linux/net-development/NAPI/README > > It seems to indicate the 2.4.20 e1000 driver is NAPIfied, so I shouldn't > > need any NAPI patches. > > > > 2.4.20 already has NAPI built in. When you compile the kernel, you have > it. > > > Lastly, your comments to MCR about NAPI being better than FASTROUTE seem > > to imply that I don't need FASTROUTE. However I would expect FASTROUTE to > > provide additional performance when used with NAPI (since it avoids the > > codepath for firewalling & NAT). > > > > If you dont have any firewalling policies on theres no difference. > NAT is a different beast - that thing puts Linux to shame. > so, no you dont need FASTROUTE. > > cheers, > jamal > > From greearb@candelatech.com Mon Mar 17 20:48:28 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 17 Mar 2003 20:48:31 -0800 (PST) Received: from grok.yi.org (IDENT:i310CLwQNXg+lmUPA94EVJ/GaxeEpdtH@dhcp93-dsl-usw3.w-link.net [206.129.84.93]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2I4mRq9019113 for ; Mon, 17 Mar 2003 20:48:28 -0800 Received: from candelatech.com (IDENT:FBAqSCWmwi3t9499U+VZcluml7HMdYK9@localhost.localdomain [127.0.0.1]) by grok.yi.org (8.11.6/8.11.6) with ESMTP id h2I4m8a26649; Mon, 17 Mar 2003 20:48:09 -0800 Message-ID: <3E76A508.30007@candelatech.com> Date: Mon, 17 Mar 2003 20:48:08 -0800 From: Ben Greear Organization: Candela Technologies User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3b) Gecko/20030210 X-Accept-Language: en-us, en MIME-Version: 1.0 To: ralph+d@istop.com CC: netdev@oss.sgi.com Subject: Re: Linux router performance (3c59x) (fwd) References: In-Reply-To: Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 1958 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: greearb@candelatech.com Precedence: bulk X-list: netdev Content-Length: 1074 Lines: 32 Ralph Doncaster wrote: > I haven't heard from Jamal or Dave, so perhaps someone from this list has > some wisdom to impart. > Currently the box in question is running a 67% system load with ~40kpps. > Here's the switch port stats that the 2 3c905cx cards are plugged into: > > 5 minute input rate 36143000 bits/sec, 8914 packets/sec > 5 minute output rate 54338000 bits/sec, 10722 packets/sec > - > 5 minute input rate 50585000 bits/sec, 12445 packets/sec > 5 minute output rate 34326000 bits/sec, 9596 packets/sec When using larger packets, NAPI doesn't have much effect. Have you tried routing with simple routing tables to see if that speeds anything up? Could also try an e100 or Tulip NIC. Those usually work pretty good... Or, could use an e1000 GigE NIC... It's also possible that you are just reaching the limit of your system. Ben -- Ben Greear President of Candela Technologies Inc http://www.candelatech.com ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear From ralph@istop.com Mon Mar 17 21:10:10 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 17 Mar 2003 21:10:47 -0800 (PST) Received: from smtp.istop.com (dci.doncaster.on.ca [66.11.168.194]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2I59Tq9030978 for ; Mon, 17 Mar 2003 21:10:10 -0800 Received: from ns.istop.com (ns.istop.com [66.11.168.199]) by smtp.istop.com (Postfix) with ESMTP id 81DF9369CD; Tue, 18 Mar 2003 00:09:28 -0500 (EST) Date: Tue, 18 Mar 2003 00:09:42 -0500 (EST) From: Ralph Doncaster Reply-To: ralph+d@istop.com To: Ben Greear Cc: "netdev@oss.sgi.com" Subject: Re: Linux router performance (3c59x) (fwd) In-Reply-To: <3E76A508.30007@candelatech.com> Message-ID: References: <3E76A508.30007@candelatech.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 1959 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ralph@istop.com Precedence: bulk X-list: netdev Content-Length: 1604 Lines: 42 On Mon, 17 Mar 2003, Ben Greear wrote: > Ralph Doncaster wrote: [...] > > Currently the box in question is running a 67% system load with ~40kpps. > > Here's the switch port stats that the 2 3c905cx cards are plugged into: > > > > 5 minute input rate 36143000 bits/sec, 8914 packets/sec > > 5 minute output rate 54338000 bits/sec, 10722 packets/sec > > - > > 5 minute input rate 50585000 bits/sec, 12445 packets/sec > > 5 minute output rate 34326000 bits/sec, 9596 packets/sec > > When using larger packets, NAPI doesn't have much effect. So I should just give up on Linux and go with FreeBSD? http://info.iet.unipi.it/~luigi/polling/ > Have you tried routing with simple routing tables to see if that > speeds anything up? No, but I did read through a bunch of the route-cache code and even with the dynamic hashtable size introduced in recent 2.4 revs, it looks very ineficient for core routing. I'd expect a speedup with a small routing table, but then it would be useless as a core router in my network. > Could also try an e100 or Tulip NIC. Those usually work pretty > good... Or, could use an e1000 GigE NIC... If I can get confirmation that under similar conditions the e1000 performs significantly better, then I'll go that route. > It's also possible that you are just reaching the limit of your > system. The NAPI docs imply 144kpps is easily attainable on lesser hardware than mine. Also I can't see bandwidth being the issue as I'm moving <25Mbytes/sec over the PCI bus. I should be able to do more than double that before I have to worry about PCI saturation. -Ralph From greearb@candelatech.com Mon Mar 17 22:30:51 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 17 Mar 2003 22:30:55 -0800 (PST) Received: from grok.yi.org (IDENT:k1HCooalvCfNsQtKVDEeZFYhe2Gu5Aay@dhcp93-dsl-usw3.w-link.net [206.129.84.93]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2I6Unq9004679 for ; Mon, 17 Mar 2003 22:30:50 -0800 Received: from candelatech.com (IDENT:8NC5ZwCIsZC9Zhav0mw0rrPZ1eJIXTC/@localhost.localdomain [127.0.0.1]) by grok.yi.org (8.11.6/8.11.6) with ESMTP id h2I6Ula07511; Mon, 17 Mar 2003 22:30:47 -0800 Message-ID: <3E76BD17.7060208@candelatech.com> Date: Mon, 17 Mar 2003 22:30:47 -0800 From: Ben Greear Organization: Candela Technologies User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3b) Gecko/20030210 X-Accept-Language: en-us, en MIME-Version: 1.0 To: ralph+d@istop.com CC: "netdev@oss.sgi.com" Subject: Re: Linux router performance (3c59x) (fwd) References: <3E76A508.30007@candelatech.com> In-Reply-To: Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 1960 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: greearb@candelatech.com Precedence: bulk X-list: netdev Content-Length: 2899 Lines: 85 Ralph Doncaster wrote: > On Mon, 17 Mar 2003, Ben Greear wrote: > > >>Ralph Doncaster wrote: > > [...] > >>>Currently the box in question is running a 67% system load with ~40kpps. >>>Here's the switch port stats that the 2 3c905cx cards are plugged into: >>> >>> 5 minute input rate 36143000 bits/sec, 8914 packets/sec >>> 5 minute output rate 54338000 bits/sec, 10722 packets/sec >>>- >>> 5 minute input rate 50585000 bits/sec, 12445 packets/sec >>> 5 minute output rate 34326000 bits/sec, 9596 packets/sec >> >>When using larger packets, NAPI doesn't have much effect. > > > So I should just give up on Linux and go with FreeBSD? > http://info.iet.unipi.it/~luigi/polling/ It would be interesting to see a performance comparison. >>Have you tried routing with simple routing tables to see if that >>speeds anything up? > > No, but I did read through a bunch of the route-cache code and even with > the dynamic hashtable size introduced in recent 2.4 revs, it looks very > ineficient for core routing. I'd expect a speedup with a small routing > table, but then it would be useless as a core router in my network. So, if making the routing table smaller 'fixes' things, then NAPI and your NIC is not the problem. >>Could also try an e100 or Tulip NIC. Those usually work pretty >>good... Or, could use an e1000 GigE NIC... > > > If I can get confirmation that under similar conditions the e1000 performs > significantly better, then I'll go that route. In my testing, I could get about 140kpps (64-byte packets) tx or rx on a single port. Bi-directional I got about 90kpps. This was a 1.8Ghz AMD processor with a tulip driver. When using MTU sized packets, could fill 4 ports with tx+rx traffic at 90+Mbps. With e1000 on a 64/66 PCI bus, I could transmit around 860Mbps with 1500 byte packets (tx + rx on the same machine, but different ports of a dual-port NIC), and could generate maybe 400kpps with small packets (I don't remember the exact number here...) This was using a slightly modified (and slower) pktgen module, which is standard in the latest kernels. So, sending/receiving packets at extreme rates is possible. Routing with 100k entries may not work nearly so well. >>It's also possible that you are just reaching the limit of your >>system. > > > The NAPI docs imply 144kpps is easily attainable on lesser hardware than > mine. Also I can't see bandwidth being the issue as I'm moving > <25Mbytes/sec over the PCI bus. I should be able to do more than double > that before I have to worry about PCI saturation. So, test w/smaller routing tables so you can see if it's routing or the NIC that is slowing you down. > > -Ralph > -- Ben Greear President of Candela Technologies Inc http://www.candelatech.com ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear From Robert.Olsson@data.slu.se Tue Mar 18 01:55:41 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 18 Mar 2003 01:55:47 -0800 (PST) Received: from robur.slu.se (robur.slu.se [130.238.98.12]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2I9sxq9008613 for ; Tue, 18 Mar 2003 01:55:40 -0800 Received: (from robert@localhost) by robur.slu.se (8.9.3/8.9.3) id KAA12892; Tue, 18 Mar 2003 10:54:48 +0100 From: Robert Olsson MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <15990.60648.230534.852040@robur.slu.se> Date: Tue, 18 Mar 2003 10:54:48 +0100 To: ralph+d@istop.com Cc: netdev@oss.sgi.com, Robert.Olsson@data.slu.se Subject: Re: Linux router performance (3c59x) (fwd) In-Reply-To: References: X-Mailer: VM 6.92 under Emacs 19.34.1 X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 1961 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: Robert.Olsson@data.slu.se Precedence: bulk X-list: netdev Content-Length: 2113 Lines: 52 Ralph Doncaster writes: > I haven't heard from Jamal or Dave, so perhaps someone from this list has > some wisdom to impart. > Currently the box in question is running a 67% system load with ~40kpps. > Here's the switch port stats that the 2 3c905cx cards are plugged into: Hello! First we do a lot of testing with routing path but have no experience with the hardware you have 3c59x or duron. In general it seems hard to extrapolate performance X1 % CPU at X2 pps. You don't see CPU used in IRQ context and not in some of softIRQ's. I think a better way for this tests is to input "overload" so your system gets saturated. You get the DoS test for free... After getting the throughput you have figure out what's your bottleneck CPU, PCI etc. > This is a box doing straight routing (no firewalling), with a full bgp4 > routing table (>100k routes). Kernel advanced router config option as > well as fastroute was chosen. The size of routing table itself has no effect... The challenge comes when there are a high number of new "flows" per second so garbage collection gets active. This can be seen with a program rtstat in the iproute2 package. Currently there is no driver with FASTROUTE support in the kernel so this will not do you any good now. But Linux routing (and packet overload) performance is still very good. You can see performance numbers as well as profiles for different setups http://robur.slu.se/Linux/net-development/experiments/router-profile.html As seen packet memory allocation is one of the CPU consumers. And also we see that slab is not not fully per CPU so we are spinning in case of SMP. And as seen UP gives about 345 kpps. With skb recycling bump this up to 507 kpps. The challenge for now is to get aggregated performance with SMP. Also remember that network and routing in particular is very much data transport which is DMA transfers from and to memory and these has to interact with CPU/driver arbitrating for the bus to manage this DMA's. Latencies and serializations are not obvious at this level. Cheers. --ro From erik@hensema.net Tue Mar 18 05:46:14 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 18 Mar 2003 05:46:18 -0800 (PST) Received: from dexter.hensema.net (cc78409-a.hnglo1.ov.home.nl [212.120.97.185]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2IDkAq9023728 for ; Tue, 18 Mar 2003 05:46:12 -0800 Received: from bender.home.hensema.net (bender.ipv6.hensema.net [IPv6:2001:888:10a1:0:202:44ff:fe69:60f5]) by dexter.hensema.net (8.12.3/8.12.3) with ESMTP id h2IDk7PA023144 for ; Tue, 18 Mar 2003 14:46:07 +0100 Received: from bender.home.hensema.net ([127.0.0.1]) by bender.home.hensema.net (8.12.3/8.12.3) with ESMTP id h2IDB5nB016389 for ; Tue, 18 Mar 2003 14:11:05 +0100 Received: (from erik@localhost) by bender.home.hensema.net (8.12.3/8.12.3/Submit) id h2IDB4iP016388 for netdev@oss.sgi.com; Tue, 18 Mar 2003 14:11:04 +0100 Date: Tue, 18 Mar 2003 14:11:04 +0100 From: Erik Hensema To: netdev@oss.sgi.com Subject: TCP/IPv6 broken in Linux 2.5.64? Message-ID: <20030318131104.GA16367@hensema.net> Reply-To: erik@hensema.net Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.3.27i X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 1962 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: erik@hensema.net Precedence: bulk X-list: netdev Content-Length: 2916 Lines: 35 Hi, I'm trying to upgrade to Linux 2.5.x from 2.4.x. It seems to be working fine, except for IPv6: a TCP session can be established, but I can't send data. I can receive it though. This is a tcpdump session of me telnetting to the smtp port of my server, with is on local ethernet; native ipv6: 14:01:24.169321 bender.ipv6.hensema.net.32926 > dexter.ipv6.hensema.net.smtp: S 3441486378:3441486378(0) win 5760 14:01:24.169532 dexter.ipv6.hensema.net.smtp > bender.ipv6.hensema.net.32926: S 2141989668:2141989668(0) ack 3441486379 win 5712 14:01:24.170104 bender.ipv6.hensema.net.32926 > dexter.ipv6.hensema.net.smtp: . ack 1 win 5760 14:01:24.187583 dexter.ipv6.hensema.net.47911 > bender.ipv6.hensema.net.ident: S 2130642408:2130642408(0) win 5760 14:01:24.187612 bender.ipv6.hensema.net.ident > dexter.ipv6.hensema.net.47911: R 0:0(0) ack 2130642409 win 0 14:01:24.198246 dexter.ipv6.hensema.net.smtp > bender.ipv6.hensema.net.32926: P 1:87(86) ack 1 win 5712 14:01:24.198285 bender.ipv6.hensema.net.32926 > dexter.ipv6.hensema.net.smtp: . ack 87 win 5760 14:01:27.397607 bender.ipv6.hensema.net.32926 > dexter.ipv6.hensema.net.smtp: P 1:7(6) ack 87 win 5760 14:01:27.598549 bender.ipv6.hensema.net.32926 > dexter.ipv6.hensema.net.smtp: P 1:7(6) ack 87 win 5760 14:01:28.199460 bender.ipv6.hensema.net.32926 > dexter.ipv6.hensema.net.smtp: P 1:7(6) ack 87 win 5760 14:01:29.223402 bender.ipv6.hensema.net.32926 > dexter.ipv6.hensema.net.smtp: P 1:7(6) ack 87 win 5760 14:01:31.015042 bender.ipv6.hensema.net.32926 > dexter.ipv6.hensema.net.smtp: P 1:7(6) ack 87 win 5760 14:01:34.342488 bender.ipv6.hensema.net.32926 > dexter.ipv6.hensema.net.smtp: P 1:7(6) ack 87 win 5760 14:01:40.997436 bender.ipv6.hensema.net.32926 > dexter.ipv6.hensema.net.smtp: P 1:7(6) ack 87 win 5760 14:01:53.795498 bender.ipv6.hensema.net.32925 > dexter.ipv6.hensema.net.smtp: P 0:6(6) ack 1 win 5760 14:01:54.051386 bender.ipv6.hensema.net.32926 > dexter.ipv6.hensema.net.smtp: P 1:7(6) ack 87 win 5760 I do see the SMTP greeting. However, when I send a RSET, there's no response from the server. IPv4 is working fine. icmpv6/ipv6 and udp/ipv6 too. bender is running linux 2.5.64. dexter is running linux 2.4.18, mostly the SuSE 8.0 version (that is: quite heavily patched). -- Erik Hensema (erik@hensema.net) From solt@dns.toxicfilms.tv Tue Mar 18 05:54:59 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 18 Mar 2003 05:55:01 -0800 (PST) Received: from dns.toxicfilms.tv (dns.toxicfilms.tv [150.254.37.24]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2IDsHq9024095 for ; Tue, 18 Mar 2003 05:54:58 -0800 Received: by dns.toxicfilms.tv (Postfix, from userid 1000) id 8EE2E1CD30; Tue, 18 Mar 2003 14:54:15 +0100 (CET) Received: from localhost (localhost [127.0.0.1]) by dns.toxicfilms.tv (Postfix) with ESMTP id 600AE80016; Tue, 18 Mar 2003 14:54:15 +0100 (CET) Date: Tue, 18 Mar 2003 14:54:15 +0100 (CET) From: Maciej Soltysiak To: Erik Hensema Cc: netdev@oss.sgi.com Subject: Re: TCP/IPv6 broken in Linux 2.5.64? In-Reply-To: <20030318131104.GA16367@hensema.net> Message-ID: References: <20030318131104.GA16367@hensema.net> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 1963 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: solt@dns.toxicfilms.tv Precedence: bulk X-list: netdev Content-Length: 195 Lines: 8 Hi, > bender is running linux 2.5.64. dexter is running linux 2.4.18, Are you a cartoons fan ? Dexter's laboratory, Futurama ? That's a cool way to name hosts. I am inspired :) Regards, Maciej From ahu@outpost.ds9a.nl Tue Mar 18 08:10:14 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 18 Mar 2003 08:10:24 -0800 (PST) Received: from outpost.ds9a.nl (postfix@outpost.ds9a.nl [213.244.168.210]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2IG9Xq9027286 for ; Tue, 18 Mar 2003 08:10:14 -0800 Received: by outpost.ds9a.nl (Postfix, from userid 1000) id D7E894020; Tue, 18 Mar 2003 17:09:31 +0100 (CET) Date: Tue, 18 Mar 2003 17:09:31 +0100 From: bert hubert To: netdev@oss.sgi.com Cc: linux-kernel@vger.kernel.org Subject: interop success, ip6sec Linux 2.5.65 vs FreeBSD 4.7-STABLE Message-ID: <20030318160931.GA9529@outpost.ds9a.nl> Mail-Followup-To: bert hubert , netdev@oss.sgi.com, linux-kernel@vger.kernel.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.3.28i X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 1964 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ahu@ds9a.nl Precedence: bulk X-list: netdev Content-Length: 698 Lines: 20 Thanks to Niels Bakker I can report that stock Linux 2.5.65 can talk IP6SEC with FreeBSD 4.7-STABLE (KAME 20010528/FreeBSD). It worked on the first try. This using ipsec-tools-0.2.2 and manual keying. Racoon is reported not to listen on IPv6 sockets yet, but we didn't try this. The configuration used is that described in http://lartc.org/howto/lartc.ipsec.html 'Intro with Manual Keying', with IPv4 addresses replaced by IPv6 addresses. Thanks for making this possible! Regards, Bert -- http://www.PowerDNS.com Open source, database driven DNS Software http://lartc.org Linux Advanced Routing & Traffic Control HOWTO http://netherlabs.nl Consulting From ahu@outpost.ds9a.nl Tue Mar 18 08:26:15 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 18 Mar 2003 08:26:18 -0800 (PST) Received: from outpost.ds9a.nl (postfix@outpost.ds9a.nl [213.244.168.210]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2IGPXq9027668 for ; Tue, 18 Mar 2003 08:26:14 -0800 Received: by outpost.ds9a.nl (Postfix, from userid 1000) id C8AE740F7; Tue, 18 Mar 2003 17:25:32 +0100 (CET) Date: Tue, 18 Mar 2003 17:25:32 +0100 From: bert hubert To: Erik Hensema Cc: netdev@oss.sgi.com Subject: Re: TCP/IPv6 broken in Linux 2.5.64? Message-ID: <20030318162532.GA9705@outpost.ds9a.nl> Mail-Followup-To: bert hubert , Erik Hensema , netdev@oss.sgi.com References: <20030318131104.GA16367@hensema.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030318131104.GA16367@hensema.net> User-Agent: Mutt/1.3.28i X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 1965 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ahu@ds9a.nl Precedence: bulk X-list: netdev Content-Length: 1579 Lines: 51 On Tue, Mar 18, 2003 at 02:11:04PM +0100, Erik Hensema wrote: > Hi, > > I'm trying to upgrade to Linux 2.5.x from 2.4.x. It seems to be working > fine, except for IPv6: a TCP session can be established, but I can't send > data. I can receive it though. I can confirm this. 2.5.65 can connect to other hosts out there, like ipv6 irc servers, or an IPv6 zonetransfer. However, when I try to ssh from 2.5.65 to another 2.5.65, nothing happens. 2.4.18 can also ssh to 2.5.65. So we have: to: 2.4.18 2.5.65 from: 2.4.18 OK OK 2.5.65 OK ERROR tcpdump from 2.5.65 to 2.5.65: 25.987220 hostA.33180 > hostB.22: S 2721590261:2721590261(0) win 5760 25.987490 hostB.22 > hostA.33180: S 2377513523:2377513523(0) ack 2721590262 win 5712 25.987622 hostA.33180 > hostB.22: . ack 1 win 5760 25.993273 hostB.22 > hostA.33180: P 1:41(40) ack 1 win 5712 26.193443 hostB.22 > hostA.33180: P 1:41(40) ack 1 win 5712 26.757236 hostB.22 > hostA.33180: P 1:41(40) ack 1 win 5712 The originating host does not ACK the received data, it appears. No IPSEC is involved with this. Regards, bert -- http://www.PowerDNS.com Open source, database driven DNS Software http://lartc.org Linux Advanced Routing & Traffic Control HOWTO http://netherlabs.nl Consulting From ahu@outpost.ds9a.nl Tue Mar 18 08:51:27 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 18 Mar 2003 08:51:32 -0800 (PST) Received: from outpost.ds9a.nl (postfix@outpost.ds9a.nl [213.244.168.210]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2IGpPq9028216 for ; Tue, 18 Mar 2003 08:51:26 -0800 Received: by outpost.ds9a.nl (Postfix, from userid 1000) id 80DA044D6; Tue, 18 Mar 2003 17:51:24 +0100 (CET) Date: Tue, 18 Mar 2003 17:51:24 +0100 From: bert hubert To: Erik Hensema , netdev@oss.sgi.com Subject: Re: TCP/IPv6 broken in Linux 2.5.64? Message-ID: <20030318165124.GA10127@outpost.ds9a.nl> Mail-Followup-To: bert hubert , Erik Hensema , netdev@oss.sgi.com References: <20030318131104.GA16367@hensema.net> <20030318162532.GA9705@outpost.ds9a.nl> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030318162532.GA9705@outpost.ds9a.nl> User-Agent: Mutt/1.3.28i X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 1966 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ahu@ds9a.nl Precedence: bulk X-list: netdev Content-Length: 2564 Lines: 65 On Tue, Mar 18, 2003 at 05:25:32PM +0100, bert hubert wrote: > So we have: > > to: 2.4.18 2.5.65 > from: > 2.4.18 OK OK > 2.5.65 OK ERROR Ok, here is a counter example of 2.5.65 happily talking to 2.5.65, so it sometimes does work: 33.933107 hostC.33255 > hostB.22: SWE 3663286543:3663286543(0) win 5680 33.933203 hostB.22 > hostC.33255: S 3554639364:3554639364(0) ack 3663286544 win 5712 33.999407 hostC.33255 > hostB.22: . ack 1 win 5680 34.007239 hostB.22 > hostC.33255: P 1:41(40) ack 1 win 5712 34.072108 hostC.33255 > hostB.22: . ack 41 win 5680 34.091633 hostC.33255 > hostB.22: P 1:40(39) ack 41 win 5680 34.097058 hostB.22 > hostC.33255: . ack 40 win 5712 Here is a macos X laptop, hostD trying and failing talk to hostB, which runs 2.5.65: 16.829237 hostD.56023 > hostB.22: S 3755233012:3755233012(0) win 32768 16.829482 hostB.22 > hostD.56023: S 3776737964:3776737964(0) ack 3755233013 win 5712 17.296953 hostD.56023 > hostB.22: . ack 1 win 32844 17.301934 hostB.22 > hostD.56023: P 1:41(40) ack 1 win 5712 18.953105 hostB.22 > hostD.56023: P 1:41(40) ack 1 win 5712 21.768126 hostB.22 > hostD.56023: P 1:41(40) ack 1 win 5712 27.398163 hostB.22 > hostD.56023: P 1:41(40) ack 1 win 5712 Closer inspection shows: 17:47:39.206188 HostB.22 > HostD.56030: P [bad tcp cksum 407f!] 1:41(40) ack 1 win 5712 (len 72, hlim 64) Note the bad checksum! It appears hostB is the culprit here, the one constant factor in the entire story. HostB is a Pentium PRO, the other machines aren't. Perhaps this might be it? This dump was run on hostB, so no chance of bad media there. Let me know if I can do more research - icmp6 works just fine. Regards, bert -- http://www.PowerDNS.com Open source, database driven DNS Software http://lartc.org Linux Advanced Routing & Traffic Control HOWTO http://netherlabs.nl Consulting From larslan@merete.zapto.org Tue Mar 18 10:29:19 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 18 Mar 2003 10:29:23 -0800 (PST) Received: from merete.balder.no (197.80-202-160.nextgentel.com [80.202.160.197]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2IITGq9002055 for ; Tue, 18 Mar 2003 10:29:18 -0800 Received: from localhost (larslan@localhost) by merete.balder.no (8.11.6/8.11.6) with ESMTP id h2IIM8w14281 for ; Tue, 18 Mar 2003 19:22:09 +0100 Date: Tue, 18 Mar 2003 19:22:08 +0100 (CET) From: Lars Landmark X-X-Sender: larslan@merete.balder.no To: netdev@oss.sgi.com Subject: class/qdisc question Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 1967 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: larslan@merete.zapto.org Precedence: bulk X-list: netdev Content-Length: 1664 Lines: 56 HI; I am trying to write my own class based queue. But as usual some problems seems not to be resolved. I have achieved to send package through my queue. This can be done if I not attach class or filters. If I do try to attach class or filter, my computer stops. I can not read any message, nor do anything. My only choice is to push power button in order to reboot. My "queue" is compiled as module and if I do "insmod", it is loaded in to kernel. This operation do not report any error. [root@lars larslan]# /sbin/insmod sch_kll Using /lib/modules/2.4.20/kernel/net/sched/sch_kll.o /sbin/lsmod report Module Size Used by Not tainted sch_htb 21088 1 sch_kll 9608 0 (autoclean) (unused) 3c59x 28520 2 When I now configure this module width my patched tc file ************ root@lars iproute2.lars]# ./tc/tc qdisc add dev eth0 root handle 1: kll default 5 ************ output from /sbin/lsmod do not change. It still says used by 0 and (unused). I have written som output in every procedyre, and dmesg report that this procedures are called ************* [root@lars iproute2.lars]# dmesg KLL: inne i classify? KLL: inne i dequeue? KLL: inne i dequeue? .... ************* So my question is, how can this happen? I thought that at once I configure module, modules usage count would be incremented??? Is there any possibility, that when I attach a filter my computer crash because kernel do not know kll-module is in use??? I would be very happy if some could tell me what I have been missed... Any suggestion is appreciated, Thanks in advance Lars Student From mk@karaba.org Tue Mar 18 10:32:30 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 18 Mar 2003 10:32:34 -0800 (PST) Received: from zanzibar.karaba.org (karaba.org [218.219.152.88] (may be forged)) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2IIWSq9002432 for ; Tue, 18 Mar 2003 10:32:29 -0800 Received: from [3ffe:501:1057:710::53] (helo=hyakusiki.karaba.org) by zanzibar.karaba.org with esmtp (Exim 3.35 #1 (Debian)) id 18vLsb-0005QQ-00; Wed, 19 Mar 2003 03:32:05 +0900 Date: Tue, 18 Mar 2003 10:32:27 -0800 Message-ID: <87of48h6f8.wl@karaba.org> From: Mitsuru KANDA / =?ISO-2022-JP?B?GyRCP0BFRBsoQiAbJEI9PBsoQg==?= To: davem@redhat.com, kuznet@ms2.inr.ac.ru Cc: linux-kernel@vger.kernel.org, netdev@oss.sgi.com, usagi-core@linux-ipv6.org Subject: [PATCH] IPv6 Extension headers (Re: [PATCH] IPv6 IPsec support) In-Reply-To: <20030305.204348.130225511.davem@redhat.com> References: <20030305233025.784feb00.kazunori@miyazawa.org> <20030305.152530.70806720.davem@redhat.com> <20030306093219.1a702868.kazunori@miyazawa.org> <20030305.204348.130225511.davem@redhat.com> MIME-Version: 1.0 (generated by SEMI 1.14.4 - "Hosorogi") Content-Type: text/plain; charset=US-ASCII X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 1968 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: mk@karaba.org Precedence: bulk X-list: netdev Content-Length: 21766 Lines: 789 Hello, At Wed, 05 Mar 2003 20:43:48 -0800 (PST), "David S. Miller" wrote: > > From: Kazunori Miyazawa > Date: Thu, 6 Mar 2003 09:32:19 +0900 > > - Extension Header Processing on inbound: > As a result of IPv6 IPsec support, Extension Header processing is devided > into ipv6_parse_exthdrs and ipproto->handler. I think it is better to merge > other Extension Header handling into ipproto->handler. > > Ok. This patch merges inbound IPv6 extension header processing parts into inet6_protocols{} like a IPv6 AH/ESP headers. As a result of this patch, I removed destopt parsing part in xfrm6_rcv() and removed ipv6_parse_exthdrs(). Could you check this patch? (This patch is against 2.5.65.) Best Regards, -mk Index: include/net/ipv6.h =================================================================== RCS file: /cvsroot/usagi/usagi-backport/linux25/include/net/ipv6.h,v retrieving revision 1.1.1.4 diff -u -r1.1.1.4 ipv6.h --- include/net/ipv6.h 9 Jan 2003 11:14:19 -0000 1.1.1.4 +++ include/net/ipv6.h 18 Mar 2003 05:11:39 -0000 @@ -203,11 +203,7 @@ extern int ip6_call_ra_chain(struct sk_buff *skb, int sel); -extern int ipv6_reassembly(struct sk_buff **skb, int); - extern int ipv6_parse_hopopts(struct sk_buff *skb, int); - -extern int ipv6_parse_exthdrs(struct sk_buff **skb, int); extern struct ipv6_txoptions * ipv6_dup_options(struct sock *sk, struct ipv6_txoptions *opt); Index: include/net/protocol.h =================================================================== RCS file: /cvsroot/usagi/usagi-backport/linux25/include/net/protocol.h,v retrieving revision 1.1.1.3 diff -u -r1.1.1.3 protocol.h --- include/net/protocol.h 11 Nov 2002 04:08:20 -0000 1.1.1.3 +++ include/net/protocol.h 18 Mar 2003 05:11:39 -0000 @@ -44,7 +44,7 @@ #if defined(CONFIG_IPV6) || defined (CONFIG_IPV6_MODULE) struct inet6_protocol { - int (*handler)(struct sk_buff *skb); + int (*handler)(struct sk_buff **skbp); void (*err_handler)(struct sk_buff *skb, struct inet6_skb_parm *opt, Index: include/net/transp_v6.h =================================================================== RCS file: /cvsroot/usagi/usagi-backport/linux25/include/net/transp_v6.h,v retrieving revision 1.1.1.1 diff -u -r1.1.1.1 transp_v6.h --- include/net/transp_v6.h 7 Oct 2002 10:22:46 -0000 1.1.1.1 +++ include/net/transp_v6.h 18 Mar 2003 05:11:39 -0000 @@ -15,6 +15,14 @@ struct flowi; +/* extention headers */ +extern void ipv6_hopopts_init(void); +extern void ipv6_rthdr_init(void); +extern void ipv6_frag_init(void); +extern void ipv6_nodata_init(void); +extern void ipv6_destopt_init(void); + +/* transport protocols */ extern void rawv6_init(void); extern void udpv6_init(void); extern void tcpv6_init(void); Index: include/net/xfrm.h =================================================================== RCS file: /cvsroot/usagi/usagi-backport/linux25/include/net/xfrm.h,v retrieving revision 1.1.1.8 diff -u -r1.1.1.8 xfrm.h --- include/net/xfrm.h 13 Mar 2003 17:29:53 -0000 1.1.1.8 +++ include/net/xfrm.h 18 Mar 2003 05:11:39 -0000 @@ -415,7 +415,7 @@ extern void xfrm_replay_advance(struct xfrm_state *x, u32 seq); extern int xfrm_check_selectors(struct xfrm_state **x, int n, struct flowi *fl); extern int xfrm4_rcv(struct sk_buff *skb); -extern int xfrm6_rcv(struct sk_buff *skb); +extern int xfrm6_rcv(struct sk_buff **pskb); extern int xfrm6_clear_mutable_options(struct sk_buff *skb, u16 *nh_offset, int dir); extern int xfrm_user_policy(struct sock *sk, int optname, u8 *optval, int optlen); Index: net/ipv4/xfrm_input.c =================================================================== RCS file: /cvsroot/usagi/usagi-backport/linux25/net/ipv4/xfrm_input.c,v retrieving revision 1.1.1.4 diff -u -r1.1.1.4 xfrm_input.c --- net/ipv4/xfrm_input.c 13 Mar 2003 17:29:03 -0000 1.1.1.4 +++ net/ipv4/xfrm_input.c 18 Mar 2003 05:11:39 -0000 @@ -311,8 +311,9 @@ return nexthdr; } -int xfrm6_rcv(struct sk_buff *skb) +int xfrm6_rcv(struct sk_buff **pskb) { + struct sk_buff *skb = *pskb; int err; u32 spi, seq; struct xfrm_state *xfrm_vec[XFRM_MAX_DEPTH]; @@ -325,12 +326,8 @@ u16 nh_offset = 0; u8 nexthdr = 0; - if (hdr->nexthdr == IPPROTO_AH || hdr->nexthdr == IPPROTO_ESP) { - nh_offset = ((unsigned char*)&skb->nh.ipv6h->nexthdr) - skb->nh.raw; - hdr_len = sizeof(struct ipv6hdr); - } else { - hdr_len = skb->h.raw - skb->nh.raw; - } + nh_offset = ((unsigned char*)&skb->nh.ipv6h->nexthdr) - skb->nh.raw; + hdr_len = sizeof(struct ipv6hdr); tmp_hdr = kmalloc(hdr_len, GFP_ATOMIC); if (!tmp_hdr) @@ -378,18 +375,6 @@ xfrm_vec[xfrm_nr++] = x; iph = skb->nh.ipv6h; /* ??? */ - - if (nexthdr == NEXTHDR_DEST) { - if (!pskb_may_pull(skb, (skb->h.raw-skb->data)+8) || - !pskb_may_pull(skb, (skb->h.raw-skb->data)+((skb->h.raw[1]+1)<<3))) { - err = -EINVAL; - goto drop; - } - nexthdr = skb->h.raw[0]; - nh_offset = skb->h.raw - skb->nh.raw; - skb_pull(skb, (skb->h.raw[1]+1)<<3); - skb->h.raw = skb->data; - } if (x->props.mode) { /* XXX */ if (iph->nexthdr != IPPROTO_IPV6) Index: net/ipv6/af_inet6.c =================================================================== RCS file: /cvsroot/usagi/usagi-backport/linux25/net/ipv6/af_inet6.c,v retrieving revision 1.1.1.7 diff -u -r1.1.1.7 af_inet6.c --- net/ipv6/af_inet6.c 25 Feb 2003 05:33:26 -0000 1.1.1.7 +++ net/ipv6/af_inet6.c 18 Mar 2003 05:11:40 -0000 @@ -793,6 +793,13 @@ addrconf_init(); sit_init(); + /* Init v6 extention headers. */ + ipv6_hopopts_init(); + ipv6_rthdr_init(); + ipv6_frag_init(); + ipv6_nodata_init(); + ipv6_destopt_init(); + /* Init v6 transport protocols. */ udpv6_init(); tcpv6_init(); Index: net/ipv6/exthdrs.c =================================================================== RCS file: /cvsroot/usagi/usagi-backport/linux25/net/ipv6/exthdrs.c,v retrieving revision 1.1.1.3 diff -u -r1.1.1.3 exthdrs.c --- net/ipv6/exthdrs.c 20 Feb 2003 08:34:32 -0000 1.1.1.3 +++ net/ipv6/exthdrs.c 18 Mar 2003 05:11:40 -0000 @@ -18,6 +18,9 @@ /* Changes: * yoshfuji : ensure not to overrun while parsing * tlv options. + * Mitsuru KANDA @USAGI : Remove ipv6_parse_exthdrs(). + * : Register inbound extention header + * : handlers as inet6_protocol{}. */ #include @@ -44,20 +47,6 @@ #include /* - * Parsing inbound headers. - * - * Parsing function "func" returns offset wrt skb->nh of the place, - * where next nexthdr value is stored or NULL, if parsing - * failed. It should also update skb->h tp point at the next header. - */ - -struct hdrtype_proc -{ - int type; - int (*func) (struct sk_buff **, int offset); -}; - -/* * Parsing tlv encoded headers. * * Parsing function "func" returns 1, if parsing succeed @@ -164,49 +153,77 @@ {-1, NULL} }; -static int ipv6_dest_opt(struct sk_buff **skb_ptr, int nhoff) +int ipv6_destopt_rcv(struct sk_buff **skbp) { - struct sk_buff *skb=*skb_ptr; + struct sk_buff *skb = *skbp; struct inet6_skb_parm *opt = (struct inet6_skb_parm *)skb->cb; + u8 nexthdr = 0; if (!pskb_may_pull(skb, (skb->h.raw-skb->data)+8) || !pskb_may_pull(skb, (skb->h.raw-skb->data)+((skb->h.raw[1]+1)<<3))) { kfree_skb(skb); - return -1; + return 0; } + nexthdr = ((struct ipv6_destopt_hdr *)skb->h.raw)->nexthdr; + opt->dst1 = skb->h.raw - skb->nh.raw; if (ip6_parse_tlv(tlvprocdestopt_lst, skb)) { skb->h.raw += ((skb->h.raw[1]+1)<<3); - return opt->dst1; + return -nexthdr; } + + return 0; +} - return -1; +static struct inet6_protocol destopt_protocol = +{ + .handler = ipv6_destopt_rcv, +}; + +void __init ipv6_destopt_init(void) +{ + if (inet6_add_protocol(&destopt_protocol, IPPROTO_DSTOPTS) < 0) + printk(KERN_ERR "ipv6_destopt_init: Could not register protocol\n"); } /******************************** NONE header. No data in packet. ********************************/ -static int ipv6_nodata(struct sk_buff **skb_ptr, int nhoff) +int ipv6_nodata_rcv(struct sk_buff **skbp) { - kfree_skb(*skb_ptr); - return -1; + struct sk_buff *skb = *skbp; + + kfree_skb(skb); + return 0; +} + +static struct inet6_protocol nodata_protocol = +{ + .handler = ipv6_nodata_rcv, +}; + +void __init ipv6_nodata_init(void) +{ + if (inet6_add_protocol(&nodata_protocol, IPPROTO_NONE) < 0) + printk(KERN_ERR "ipv6_nodata_init: Could not register protocol\n"); } /******************************** Routing header. ********************************/ -static int ipv6_routing_header(struct sk_buff **skb_ptr, int nhoff) +int ipv6_rthdr_rcv(struct sk_buff **skbp) { - struct sk_buff *skb = *skb_ptr; + struct sk_buff *skb = *skbp; struct inet6_skb_parm *opt = (struct inet6_skb_parm *)skb->cb; struct in6_addr *addr; struct in6_addr daddr; int addr_type; int n, i; + u8 nexthdr = 0; struct ipv6_rt_hdr *hdr; struct rt0_hdr *rthdr; @@ -215,15 +232,16 @@ !pskb_may_pull(skb, (skb->h.raw-skb->data)+((skb->h.raw[1]+1)<<3))) { IP6_INC_STATS_BH(Ip6InHdrErrors); kfree_skb(skb); - return -1; + return 0; } hdr = (struct ipv6_rt_hdr *) skb->h.raw; + nexthdr = hdr->nexthdr; if ((ipv6_addr_type(&skb->nh.ipv6h->daddr)&IPV6_ADDR_MULTICAST) || skb->pkt_type != PACKET_HOST) { kfree_skb(skb); - return -1; + return 0; } looped_back: @@ -232,24 +250,24 @@ skb->h.raw += (hdr->hdrlen + 1) << 3; opt->dst0 = opt->dst1; opt->dst1 = 0; - return (&hdr->nexthdr) - skb->nh.raw; + return -nexthdr; } if (hdr->type != IPV6_SRCRT_TYPE_0 || (hdr->hdrlen & 0x01)) { icmpv6_param_prob(skb, ICMPV6_HDR_FIELD, hdr->type != IPV6_SRCRT_TYPE_0 ? 2 : 1); - return -1; + return 0; } /* * This is the routing header forwarding algorithm from - * RFC 1883, page 17. + * RFC 2460, page 16. */ n = hdr->hdrlen >> 1; if (hdr->segments_left > n) { icmpv6_param_prob(skb, ICMPV6_HDR_FIELD, (&hdr->segments_left) - skb->nh.raw); - return -1; + return 0; } /* We are about to mangle packet header. Be careful! @@ -259,8 +277,8 @@ struct sk_buff *skb2 = skb_copy(skb, GFP_ATOMIC); kfree_skb(skb); if (skb2 == NULL) - return -1; - *skb_ptr = skb = skb2; + return 0; + *skbp = skb = skb2; opt = (struct inet6_skb_parm *)skb2->cb; hdr = (struct ipv6_rt_hdr *) skb2->h.raw; } @@ -278,7 +296,7 @@ if (addr_type&IPV6_ADDR_MULTICAST) { kfree_skb(skb); - return -1; + return 0; } ipv6_addr_copy(&daddr, addr); @@ -289,23 +307,34 @@ ip6_route_input(skb); if (skb->dst->error) { dst_input(skb); - return -1; + return 0; } if (skb->dst->dev->flags&IFF_LOOPBACK) { if (skb->nh.ipv6h->hop_limit <= 1) { icmpv6_send(skb, ICMPV6_TIME_EXCEED, ICMPV6_EXC_HOPLIMIT, 0, skb->dev); kfree_skb(skb); - return -1; + return 0; } skb->nh.ipv6h->hop_limit--; goto looped_back; } dst_input(skb); - return -1; + return 0; } +static struct inet6_protocol rthdr_protocol = +{ + .handler = ipv6_rthdr_rcv, +}; + +void __init ipv6_rthdr_init(void) +{ + if (inet6_add_protocol(&rthdr_protocol, IPPROTO_ROUTING) < 0) + printk(KERN_ERR "ipv6_rthdr_init: Could not register protocol\n"); +}; + /* This function inverts received rthdr. NOTE: specs allow to make it automatically only if @@ -371,97 +400,6 @@ return opt; } -/******************************** - AUTH header. - ********************************/ - -/* - rfc1826 said, that if a host does not implement AUTH header - it MAY ignore it. We use this hole 8) - - Actually, now we can implement OSPFv6 without kernel IPsec. - Authentication for poors may be done in user space with the same success. - - Yes, it means, that we allow application to send/receive - raw authentication header. Apparently, we suppose, that it knows - what it does and calculates authentication data correctly. - Certainly, it is possible only for udp and raw sockets, but not for tcp. - - AUTH header has 4byte granular length, which kills all the idea - behind AUTOMATIC 64bit alignment of IPv6. Now we will lose - cpu ticks, checking that sender did not something stupid - and opt->hdrlen is even. Shit! --ANK (980730) - */ - -static int ipv6_auth_hdr(struct sk_buff **skb_ptr, int nhoff) -{ - struct sk_buff *skb=*skb_ptr; - struct inet6_skb_parm *opt = (struct inet6_skb_parm *)skb->cb; - int len; - - if (!pskb_may_pull(skb, (skb->h.raw-skb->data)+8)) - goto fail; - - /* - * RFC2402 2.2 Payload Length - * The 8-bit field specifies the length of AH in 32-bit words - * (4-byte units), minus "2". - * -- Noriaki Takamiya @USAGI Project - */ - len = (skb->h.raw[1]+2)<<2; - - if (len&7) - goto fail; - - if (!pskb_may_pull(skb, (skb->h.raw-skb->data)+len)) - goto fail; - - opt->auth = skb->h.raw - skb->nh.raw; - skb->h.raw += len; - return opt->auth; - -fail: - kfree_skb(skb); - return -1; -} - -/* This list MUST NOT contain entry for NEXTHDR_HOP. - It is parsed immediately after packet received - and if it occurs somewhere in another place we must - generate error. - */ - -static struct hdrtype_proc hdrproc_lst[] = { - {NEXTHDR_FRAGMENT, ipv6_reassembly}, - {NEXTHDR_ROUTING, ipv6_routing_header}, - {NEXTHDR_DEST, ipv6_dest_opt}, - {NEXTHDR_NONE, ipv6_nodata}, - {NEXTHDR_AUTH, ipv6_auth_hdr}, - /* - {NEXTHDR_ESP, ipv6_esp_hdr}, - */ - {-1, NULL} -}; - -int ipv6_parse_exthdrs(struct sk_buff **skb_in, int nhoff) -{ - struct hdrtype_proc *hdrt; - u8 nexthdr = (*skb_in)->nh.raw[nhoff]; - -restart: - for (hdrt=hdrproc_lst; hdrt->type >= 0; hdrt++) { - if (hdrt->type == nexthdr) { - if ((nhoff = hdrt->func(skb_in, nhoff)) >= 0) { - nexthdr = (*skb_in)->nh.raw[nhoff]; - goto restart; - } - return -1; - } - } - return nhoff; -} - - /********************************** Hop-by-hop options. **********************************/ @@ -530,6 +468,34 @@ if (ip6_parse_tlv(tlvprochopopt_lst, skb)) return sizeof(struct ipv6hdr); return -1; +} + +/* This is fake. We have already parsed hopopts in ipv6_rcv(). -mk */ +int ipv6_hopopts_rcv(struct sk_buff **skbp) +{ + struct sk_buff *skb = *skbp; + u8 nexthdr = 0; + + if (!pskb_may_pull(skb, (skb->h.raw-skb->data)+8) || + !pskb_may_pull(skb, (skb->h.raw-skb->data)+((skb->h.raw[1]+1)<<3))) { + kfree_skb(skb); + return 0; + } + nexthdr = ((struct ipv6_hopopt_hdr *)skb->h.raw)->nexthdr; + skb->h.raw += (skb->h.raw[1]+1)<<3; + + return -nexthdr; +} + +static struct inet6_protocol hopopts_protocol = +{ + .handler = ipv6_hopopts_rcv, +}; + +void __init ipv6_hopopts_init(void) +{ + if (inet6_add_protocol(&hopopts_protocol, IPPROTO_HOPOPTS) < 0) + printk(KERN_ERR "ipv6_hopopts_init: Could not register protocol\n"); } /* Index: net/ipv6/icmp.c =================================================================== RCS file: /cvsroot/usagi/usagi-backport/linux25/net/ipv6/icmp.c,v retrieving revision 1.1.1.7 diff -u -r1.1.1.7 icmp.c --- net/ipv6/icmp.c 13 Mar 2003 17:29:06 -0000 1.1.1.7 +++ net/ipv6/icmp.c 18 Mar 2003 05:11:40 -0000 @@ -74,7 +74,7 @@ static struct socket *__icmpv6_socket[NR_CPUS]; #define icmpv6_socket __icmpv6_socket[smp_processor_id()] -static int icmpv6_rcv(struct sk_buff *skb); +static int icmpv6_rcv(struct sk_buff **pskb); static struct inet6_protocol icmpv6_protocol = { .handler = icmpv6_rcv, @@ -458,8 +458,9 @@ * Handle icmp messages */ -static int icmpv6_rcv(struct sk_buff *skb) +static int icmpv6_rcv(struct sk_buff **pskb) { + struct sk_buff *skb = *pskb; struct net_device *dev = skb->dev; struct in6_addr *saddr, *daddr; struct ipv6hdr *orig_hdr; Index: net/ipv6/ip6_input.c =================================================================== RCS file: /cvsroot/usagi/usagi-backport/linux25/net/ipv6/ip6_input.c,v retrieving revision 1.1.1.6 diff -u -r1.1.1.6 ip6_input.c --- net/ipv6/ip6_input.c 13 Mar 2003 17:29:06 -0000 1.1.1.6 +++ net/ipv6/ip6_input.c 18 Mar 2003 05:11:40 -0000 @@ -15,6 +15,10 @@ * as published by the Free Software Foundation; either version * 2 of the License, or (at your option) any later version. */ +/* Changes + * + * Mitsuru KANDA @USAGI : Remove ipv6_parse_exthdrs(). + */ #include #include @@ -127,38 +131,11 @@ struct inet6_protocol *ipprot; struct sock *raw_sk; int nhoff; - int nexthdr; + int nexthdr = hdr->nexthdr; u8 hash; skb->h.raw = skb->nh.raw + sizeof(struct ipv6hdr); - /* - * Parse extension headers - */ - - nexthdr = hdr->nexthdr; - nhoff = offsetof(struct ipv6hdr, nexthdr); - - /* Skip hop-by-hop options, they are already parsed. */ - if (nexthdr == NEXTHDR_HOP) { - nhoff = sizeof(struct ipv6hdr); - nexthdr = skb->h.raw[0]; - skb->h.raw += (skb->h.raw[1]+1)<<3; - } - - /* This check is sort of optimization. - It would be stupid to detect for optional headers, - which are missing with probability of 200% - */ - if (nexthdr != IPPROTO_TCP && nexthdr != IPPROTO_UDP && - nexthdr != NEXTHDR_AUTH && nexthdr != NEXTHDR_ESP) { - nhoff = ipv6_parse_exthdrs(&skb, nhoff); - if (nhoff < 0) - return 0; - nexthdr = skb->nh.raw[nhoff]; - hdr = skb->nh.ipv6h; - } - if (!pskb_pull(skb, skb->h.raw - skb->data)) goto discard; @@ -173,7 +150,7 @@ hash = nexthdr & (MAX_INET_PROTOS - 1); if ((ipprot = inet6_protos[hash]) != NULL) { - int ret = ipprot->handler(skb); + int ret = ipprot->handler(&skb); if (ret < 0) { nexthdr = -ret; goto resubmit; @@ -182,6 +159,7 @@ } else { if (!raw_sk) { IP6_INC_STATS_BH(Ip6InUnknownProtos); + nhoff = offsetof(struct ipv6hdr, nexthdr); icmpv6_param_prob(skb, ICMPV6_UNK_NEXTHDR, nhoff); } else { IP6_INC_STATS_BH(Ip6InDelivers); Index: net/ipv6/reassembly.c =================================================================== RCS file: /cvsroot/usagi/usagi-backport/linux25/net/ipv6/reassembly.c,v retrieving revision 1.1.1.4 diff -u -r1.1.1.4 reassembly.c --- net/ipv6/reassembly.c 20 Feb 2003 08:34:32 -0000 1.1.1.4 +++ net/ipv6/reassembly.c 18 Mar 2003 05:11:40 -0000 @@ -23,6 +23,7 @@ * Horst von Brand Add missing #include * Alexey Kuznetsov SMP races, threading, cleanup. * Patrick McHardy LRU queue of frag heads for evictor. + * Mitsuru KANDA @USAGI Register inet6_protocol{}. */ #include #include @@ -525,6 +526,7 @@ int remove_fraghdr = 0; int payload_len; int nhoff; + u8 nexthdr = 0; fq_kill(fq); @@ -535,6 +537,8 @@ payload_len = (head->data - head->nh.raw) - sizeof(struct ipv6hdr) + fq->len; nhoff = head->h.raw - head->nh.raw; + nexthdr = ((struct frag_hdr*)head->h.raw)->nexthdr; + if (payload_len > 65535) { payload_len -= 8; if (payload_len > 65535) @@ -609,9 +613,13 @@ if (head->ip_summed == CHECKSUM_HW) head->csum = csum_partial(head->nh.raw, head->h.raw-head->nh.raw, head->csum); + if (!pskb_pull(head, head->h.raw - head->data)) { + goto out_fail; + } + IP6_INC_STATS_BH(Ip6ReasmOKs); fq->fragments = NULL; - return nhoff; + return nexthdr; out_oversize: if (net_ratelimit()) @@ -622,16 +630,18 @@ printk(KERN_DEBUG "ip6_frag_reasm: no memory for reassembly\n"); out_fail: IP6_INC_STATS_BH(Ip6ReasmFails); - return -1; + return 0; } -int ipv6_reassembly(struct sk_buff **skbp, int nhoff) +int ipv6_frag_rcv(struct sk_buff **skbp) { struct sk_buff *skb = *skbp; struct net_device *dev = skb->dev; struct frag_hdr *fhdr; struct frag_queue *fq; struct ipv6hdr *hdr; + int nhoff = skb->h.raw - skb->nh.raw; + u8 nexthdr = 0; hdr = skb->nh.ipv6h; @@ -640,15 +650,16 @@ /* Jumbo payload inhibits frag. header */ if (hdr->payload_len==0) { icmpv6_param_prob(skb, ICMPV6_HDR_FIELD, skb->h.raw-skb->nh.raw); - return -1; + goto discard; } if (!pskb_may_pull(skb, (skb->h.raw-skb->data)+sizeof(struct frag_hdr))) { icmpv6_param_prob(skb, ICMPV6_HDR_FIELD, skb->h.raw-skb->nh.raw); - return -1; + goto discard; } hdr = skb->nh.ipv6h; fhdr = (struct frag_hdr *)skb->h.raw; + nexthdr = fhdr->nexthdr; if (!(fhdr->frag_off & htons(0xFFF9))) { /* It is not a fragmented frame */ @@ -674,10 +685,22 @@ spin_unlock(&fq->lock); fq_put(fq); - return ret; + return -ret; } +discard: IP6_INC_STATS_BH(Ip6ReasmFails); kfree_skb(skb); - return -1; + return 0; +} + +static struct inet6_protocol frag_protocol = +{ + .handler = ipv6_frag_rcv, +}; + +void __init ipv6_frag_init(void) +{ + if (inet6_add_protocol(&frag_protocol, IPPROTO_FRAGMENT) < 0) + printk(KERN_ERR "ipv6_frag_init: Could not register protocol\n"); } Index: net/ipv6/tcp_ipv6.c =================================================================== RCS file: /cvsroot/usagi/usagi-backport/linux25/net/ipv6/tcp_ipv6.c,v retrieving revision 1.1.1.8 diff -u -r1.1.1.8 tcp_ipv6.c --- net/ipv6/tcp_ipv6.c 13 Mar 2003 17:29:06 -0000 1.1.1.8 +++ net/ipv6/tcp_ipv6.c 18 Mar 2003 05:11:40 -0000 @@ -1591,8 +1591,9 @@ return 0; } -static int tcp_v6_rcv(struct sk_buff *skb) +static int tcp_v6_rcv(struct sk_buff **pskb) { + struct sk_buff *skb = *pskb; struct tcphdr *th; struct sock *sk; int ret; Index: net/ipv6/udp.c =================================================================== RCS file: /cvsroot/usagi/usagi-backport/linux25/net/ipv6/udp.c,v retrieving revision 1.1.1.7 diff -u -r1.1.1.7 udp.c --- net/ipv6/udp.c 13 Mar 2003 17:29:06 -0000 1.1.1.7 +++ net/ipv6/udp.c 18 Mar 2003 05:11:40 -0000 @@ -641,8 +641,9 @@ read_unlock(&udp_hash_lock); } -static int udpv6_rcv(struct sk_buff *skb) +static int udpv6_rcv(struct sk_buff **pskb) { + struct sk_buff *skb = *pskb; struct sock *sk; struct udphdr *uh; struct net_device *dev = skb->dev; From jleu@nero.doit.wisc.edu Tue Mar 18 12:52:43 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 18 Mar 2003 12:52:51 -0800 (PST) Received: from nero.doit.wisc.edu (nero.doit.wisc.edu [128.104.17.130]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2IKq2q9012457 for ; Tue, 18 Mar 2003 12:52:43 -0800 Received: (from jleu@localhost) by nero.doit.wisc.edu (8.11.6/8.11.6) id h2IMk0b26157; Tue, 18 Mar 2003 16:46:00 -0600 Date: Tue, 18 Mar 2003 16:45:59 -0600 From: "James R. Leu" To: Lars Landmark Cc: netdev@oss.sgi.com Subject: Re: class/qdisc question Message-ID: <20030318164559.A26154@mindspring.com> Reply-To: jleu@mindspring.com References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0.1i In-Reply-To: ; from larslan@merete.zapto.org on Tue, Mar 18, 2003 at 07:22:08PM +0100 Organization: none X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 1969 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jleu@mindspring.com Precedence: bulk X-list: netdev Content-Length: 2027 Lines: 65 I suggest moving your development to user-mode-linux, then you can attach a debugger and track down the location of the lockup. http://sf.net/projects/user-mode-linux/ On Tue, Mar 18, 2003 at 07:22:08PM +0100, Lars Landmark wrote: > HI; > > I am trying to write my own class based queue. But as usual some problems > seems not to be resolved. > > I have achieved to send package through my queue. This can be done if I > not attach class or filters. If I do try to attach class or filter, my > computer stops. I can not read any message, nor do anything. My only > choice is to push power button in order to reboot. > > My "queue" is compiled as module and if I do "insmod", > it is loaded in to kernel. This operation do not report any error. > > [root@lars larslan]# /sbin/insmod sch_kll > Using /lib/modules/2.4.20/kernel/net/sched/sch_kll.o > > > /sbin/lsmod report > Module Size Used by Not tainted > sch_htb 21088 1 > sch_kll 9608 0 (autoclean) (unused) > 3c59x 28520 2 > > > When I now configure this module width my patched tc file > ************ > root@lars iproute2.lars]# ./tc/tc qdisc add dev eth0 root handle 1: kll > default 5 > ************ > output from /sbin/lsmod do not change. It still says used by 0 and > (unused). > > I have written som output in every procedyre, and dmesg report that this > procedures are called > ************* > [root@lars iproute2.lars]# dmesg > KLL: inne i classify? > KLL: inne i dequeue? > KLL: inne i dequeue? > .... > ************* > So my question is, how can this happen? > I thought that at once I configure module, > modules usage count would be incremented??? > Is there any possibility, that when I attach a filter my computer crash > because kernel do not know kll-module is in use??? > > I would be very happy if some could tell me what I have been missed... > > Any suggestion is appreciated, > Thanks in advance > > Lars > Student > > -- James R. Leu From MAILER-DAEMON@oss.sgi.com Wed Mar 19 04:48:13 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 19 Mar 2003 04:48:18 -0800 (PST) Received: from kastor.ds.pg.gda.pl (postfix@kastor.ds.pg.gda.pl [213.192.72.3]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2JClUq9032595 for ; Wed, 19 Mar 2003 04:48:12 -0800 Received: by kastor.ds.pg.gda.pl (Postfix, from userid 8) id D4A6B2EEB1; Wed, 19 Mar 2003 13:47:26 +0100 (CET) X-Scanned-By: Bylem tu. Amavis :) Received: from vger.kernel.org (vger.kernel.org [209.116.70.75]) by kastor.ds.pg.gda.pl (Postfix) with ESMTP id 469BE2EEAC for ; Wed, 19 Mar 2003 13:47:24 +0100 (CET) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id ; Wed, 19 Mar 2003 07:34:42 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id ; Wed, 19 Mar 2003 07:34:42 -0500 Received: from outpost.ds9a.nl ([213.244.168.210]:9357 "EHLO outpost.ds9a.nl") by vger.kernel.org with ESMTP id ; Wed, 19 Mar 2003 07:34:36 -0500 Received: by outpost.ds9a.nl (Postfix, from userid 1000) id B36C14508; Wed, 19 Mar 2003 13:45:33 +0100 (CET) Date: Wed, 19 Mar 2003 13:45:33 +0100 From: bert hubert To: netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: [BUG] 2.5.65 ipv6 TCP checksum errors (capture attached) Message-ID: <20030319124533.GA14363@outpost.ds9a.nl> Mail-Followup-To: bert hubert , netdev@oss.sgi.com, linux-kernel@vger.kernel.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="ew6BAiZeqk4r7MaW" Content-Disposition: inline User-Agent: Mutt/1.3.28i Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 1970 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ahu@ds9a.nl Precedence: bulk X-list: netdev Content-Length: 3966 Lines: 71 --ew6BAiZeqk4r7MaW Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Interestingly, the initial ssh connection worked, the second one failed. Subsequent attempts fail too. This all over ipv6: hubert# tcpdump -r file -v -v 29.09 snapcount.33408 > hubert.ssh: S [tcp sum ok] 2737328594:2737328594(0) win 5760 (len 40, hlim 64) 29.09 hubert.ssh > snapcount.33408: S [tcp sum ok] 2399386333:2399386333(0) ack 2737328595 win 5712 (len 40, hlim 64) 29.09 snapcount.33408 > hubert.ssh: . [tcp sum ok] 1:1(0) ack 1 win 5760 (len 32, hlim 64) 29.10 hubert.ssh > snapcount.33408: P [bad tcp cksum 4f2!] 1:41(40) ack 1 win 5712 (len 72, hlim 64) 29.30 hubert.ssh > snapcount.33408: P [bad tcp cksum 3bf1!] 1:41(40) ack 1 win 5712 (len 72, hlim 64) 29.83 hubert.ssh > snapcount.33408: P [bad tcp cksum 23ef!] 1:41(40) ack 1 win 5712 (len 72, hlim 64) 30.86 hubert.ssh > snapcount.33408: P [bad tcp cksum 23eb!] 1:41(40) ack 1 win 5712 (len 72, hlim 64) Both hosts run 2.5.65. hubert.ipv6.ds9a.nl (publically routable, so you can try to ssh to me as long as I'm not asleep, the machine is next to my bed) is a pentium pro 200. Kernel was make mrpropered before compiling, virgin kernel. Capture attached. Regards, bert -- http://www.PowerDNS.com Open source, database driven DNS Software http://lartc.org Linux Advanced Routing & Traffic Control HOWTO http://netherlabs.nl Consulting --ew6BAiZeqk4r7MaW Content-Type: application/octet-stream Content-Disposition: attachment; filename=bad-csum Content-Transfer-Encoding: base64 1MOyoQIABAAAAAAAAAAAANwFAAABAAAAiWR4Pht3AQBeAAAAXgAAAAAIoRnw8ACgzMjyXIbd YAAAAAAoBkAgAQiIEDYAAAIIof/+GfDxIAEIiBA2AAACCKH//hnw8IKAABajKFHSAAAAAKAC FoACJAAAAgQFoAQCCAoAox+fAAAAAAEDAwCJZHg+wHcBAF4AAABeAAAAAKDMyPJcAAihGfDw ht1gAAAAACgGQCABCIgQNgAAAgih//4Z8PAgAQiIEDYAAAIIof/+GfDxABaCgI8Dut2jKFHT oBIWUOuhAAACBAWgBAIICgClzBoAox+fAQMDAIlkeD5KeAEAVgAAAFYAAAAACKEZ8PAAoMzI 8lyG3WAAAAAAIAZAIAEIiBA2AAACCKH//hnw8SABCIgQNgAAAgih//4Z8PCCgAAWoyhR048D ut6AEBaAGiIAAAEBCAoAox+gAKXMGolkeD5XigEAfgAAAH4AAAAAoMzI8lwACKEZ8PCG3WAA AAAASAZAIAEIiBA2AAACCKH//hnw8CABCIgQNgAAAgih//4Z8PEAFoKAjwO63qMoUdOAGBZQ aA0AAAEBCAoApcwfAKMfoFNTSC0xLjk5LU9wZW5TU0hfMy41cDEgRGViaWFuIDE6My41cDEt NQqJZHg+7pkEAH4AAAB+AAAAAKDMyPJcAAihGfDwht1gAAAAAEgGQCABCIgQNgAAAgih//4Z 8PAgAQiIEDYAAAIIof/+GfDxABaCgI8Dut6jKFHTgBgWUGgNAAABAQgKAKXM6ACjH6BTU0gt MS45OS1PcGVuU1NIXzMuNXAxIERlYmlhbiAxOjMuNXAxLTUKiWR4Pv/GDAB+AAAAfgAAAACg zMjyXAAIoRnw8IbdYAAAAABIBkAgAQiIEDYAAAIIof/+GfDwIAEIiBA2AAACCKH//hnw8QAW goCPA7reoyhR04AYFlBoDQAAAQEICgClzwAAox+gU1NILTEuOTktT3BlblNTSF8zLjVwMSBE ZWJpYW4gMTozLjVwMS01CopkeD4pJA0AfgAAAH4AAAAAoMzI8lwACKEZ8PCG3WAAAAAASAZA IAEIiBA2AAACCKH//hnw8CABCIgQNgAAAgih//4Z8PEAFoKAjwO63qMoUdOAGBZQaA0AAAEB CAoApdMAAKMfoFNTSC0xLjk5LU9wZW5TU0hfMy41cDEgRGViaWFuIDE6My41cDEtNQqMZHg+ 0/QJAH4AAAB+AAAAAKDMyPJcAAihGfDwht1gAAAAAEgGQCABCIgQNgAAAgih//4Z8PAgAQiI EDYAAAIIof/+GfDxABaCgI8Dut6jKFHTgBgWUGgNAAABAQgKAKXaAACjH6BTU0gtMS45OS1P cGVuU1NIXzMuNXAxIERlYmlhbiAxOjMuNXAxLTUKj2R4PkXxDgB+AAAAfgAAAACgzMjyXAAI oRnw8IbdYAAAAABIBkAgAQiIEDYAAAIIof/+GfDwIAEIiBA2AAACCKH//hnw8QAWgoCPA7re oyhR04AYFlBoDQAAAQEICgCl5wAAox+gU1NILTEuOTktT3BlblNTSF8zLjVwMSBEZWJpYW4g MTozLjVwMS01CpZkeD6mqAkAfgAAAH4AAAAAoMzI8lwACKEZ8PCG3WAAAAAASAZAIAEIiBA2 AAACCKH//hnw8CABCIgQNgAAAgih//4Z8PEAFoKAjwO63qMoUdOAGBZQaA0AAAEBCAoApgEA AKMfoFNTSC0xLjk5LU9wZW5TU0hfMy41cDEgRGViaWFuIDE6My41cDEtNQo= --ew6BAiZeqk4r7MaW-- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ From archie@precisionio.com Wed Mar 19 14:09:22 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 19 Mar 2003 14:09:31 -0800 (PST) Received: from mailman.precisionio.com (www.precisionio.com [65.192.41.225]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2JM8eq9022769 for ; Wed, 19 Mar 2003 14:09:22 -0800 Received: from bubba.precisionio.com (bubba.precisionio.com [172.16.0.223]) by mailman.precisionio.com (8.12.6/8.12.6) with ESMTP id h2JM8YWV043716; Wed, 19 Mar 2003 14:08:34 -0800 (PST) (envelope-from archie@precisionio.com) Received: from bubba.precisionio.com (localhost [127.0.0.1]) by bubba.precisionio.com (8.12.7/8.12.7) with ESMTP id h2JM8YoX037111; Wed, 19 Mar 2003 14:08:34 -0800 (PST) (envelope-from archie@bubba.precisionio.com) Received: (from archie@localhost) by bubba.precisionio.com (8.12.7/8.12.7/Submit) id h2JM8Yx1037110; Wed, 19 Mar 2003 14:08:34 -0800 (PST) From: Archie Cobbs Message-Id: <200303192208.h2JM8Yx1037110@bubba.precisionio.com> Subject: [PATCH] sk_buff's allocated from private pools To: netdev@oss.sgi.com Date: Wed, 19 Mar 2003 14:08:34 -0800 (PST) CC: Archie Cobbs X-Mailer: ELM [version 2.4ME+ PL82 (25)] MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=US-ASCII X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 1971 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: archie@precisionio.com Precedence: bulk X-list: netdev Content-Length: 5910 Lines: 202 Hello, I'm submitting this patch for inclusion in the Linux kernel if deemed generally useful. The purpose of this patch is to add a new function called alloc_skb_custom() (or whatever) that allows the data portion of an sk_buff to reside in any memory region, not just a region returned by kmalloc(). For example, if a networking device has a restriction on where receive buffers may reside, then the device driver can avoid copying every incoming packet if it is able to create an sk_buff that points to the receive buffer memory. Basically this amounts to adding a 'free_data' function pointer to the sk_buff structure. By default this points to kfree() but in general could point to anywhere. FYI FreeBSD has had equivalent functionality in its 'struct mbuf' for many years (I'm also a FreeBSD developer). Thanks for your review. Cheers, -Archie __________________________________________________________________________ Archie Cobbs * Precision I/O * http://www.precisionio.com Index: include/linux/skbuff.h =================================================================== RCS file: /home/cvs/linux-2.4.20/include/linux/skbuff.h,v retrieving revision 1.1.1.1 retrieving revision 1.2 diff -u -r1.1.1.1 -r1.2 --- include/linux/skbuff.h 3 Jan 2003 22:31:40 -0000 1.1.1.1 +++ include/linux/skbuff.h 15 Mar 2003 01:13:35 -0000 1.2 @@ -193,6 +193,7 @@ unsigned char *tail; /* Tail pointer */ unsigned char *end; /* End pointer */ + void (*free_data)(const void *); /* Free data buffer function */ void (*destructor)(struct sk_buff *); /* Destruct function */ #ifdef CONFIG_NETFILTER /* Can be used for communication between hooks. */ @@ -230,6 +231,8 @@ extern void __kfree_skb(struct sk_buff *skb); extern struct sk_buff * alloc_skb(unsigned int size, int priority); +extern struct sk_buff * alloc_skb_custom(unsigned int size, int priority, + void (*free_data)(const void *), u8 *data); extern void kfree_skbmem(struct sk_buff *skb); extern struct sk_buff * skb_clone(struct sk_buff *skb, int priority); extern struct sk_buff * skb_copy(const struct sk_buff *skb, int priority); Index: net/netsyms.c =================================================================== RCS file: /home/cvs/linux-2.4.20/net/netsyms.c,v retrieving revision 1.1.1.1 retrieving revision 1.2 diff -u -r1.1.1.1 -r1.2 --- net/netsyms.c 3 Jan 2003 22:31:42 -0000 1.1.1.1 +++ net/netsyms.c 15 Mar 2003 01:13:35 -0000 1.2 @@ -489,6 +489,7 @@ EXPORT_SYMBOL(eth_copy_and_sum); #endif EXPORT_SYMBOL(alloc_skb); +EXPORT_SYMBOL(alloc_skb_custom); EXPORT_SYMBOL(__kfree_skb); EXPORT_SYMBOL(skb_clone); EXPORT_SYMBOL(skb_copy); Index: net/core/skbuff.c =================================================================== RCS file: /home/cvs/linux-2.4.20/net/core/skbuff.c,v retrieving revision 1.1 retrieving revision 1.3 diff -u -r1.1 -r1.3 --- net/core/skbuff.c 7 Jan 2003 00:35:25 -0000 1.1 +++ net/core/skbuff.c 18 Mar 2003 23:14:54 -0000 1.3 @@ -149,7 +149,7 @@ */ /** - * alloc_skb - allocate a network buffer + * alloc_skb - allocate a network buffer using kmalloc * @size: size to allocate * @gfp_mask: allocation mask * @@ -169,8 +169,46 @@ if (in_interrupt() && (gfp_mask & __GFP_WAIT)) { static int count = 0; if (++count < 5) { - printk(KERN_ERR "alloc_skb called nonatomically " - "from interrupt %p\n", NET_CALLER(size)); + printk(KERN_ERR "%s called nonatomically from " + "interrupt %p\n", "alloc_skb", NET_CALLER(size)); + BUG(); + } + gfp_mask &= ~__GFP_WAIT; + } + + /* Get the DATA. Size must match skb_add_mtu(). */ + size = SKB_DATA_ALIGN(size); + data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask); + if (data == NULL) + return NULL; + + /* Allocate the rest of the skb */ + if ((skb = alloc_skb_custom(size, gfp_mask, kfree, data)) == NULL) + kfree(data); + + /* Done */ + return skb; +} + +/** + * alloc_skb_custom - allocate a network buffer + * using the supplied data area + * + * This assumes that size is aligned via SKB_DATA_ALIGN(), and + * that 'data' points to size + sizeof(struct skb_shared_info) + * bytes. + */ + +struct sk_buff *alloc_skb_custom(unsigned int size, int gfp_mask, + void (*free_data)(const void *), u8 *data) +{ + struct sk_buff *skb; + + if (in_interrupt() && (gfp_mask & __GFP_WAIT)) { + static int count = 0; + if (++count < 5) { + printk(KERN_ERR "%s called nonatomically from " + "interrupt %p\n", "alloc_skb_custom", NET_CALLER(size)); BUG(); } gfp_mask &= ~__GFP_WAIT; @@ -184,11 +222,9 @@ goto nohead; } - /* Get the DATA. Size must match skb_add_mtu(). */ - size = SKB_DATA_ALIGN(size); - data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask); - if (data == NULL) - goto nodata; + /* Size must match skb_add_mtu(). */ + if (size != SKB_DATA_ALIGN(size)) + BUG(); /* XXX: does not include slab overhead */ skb->truesize = size + sizeof(struct sk_buff); @@ -198,6 +234,7 @@ skb->data = data; skb->tail = data; skb->end = data + size; + skb->free_data = free_data; /* Set up other state */ skb->len = 0; @@ -210,8 +247,6 @@ skb_shinfo(skb)->frag_list = NULL; return skb; -nodata: - skb_head_to_pool(skb); nohead: return NULL; } @@ -285,7 +320,10 @@ if (skb_shinfo(skb)->frag_list) skb_drop_fraglist(skb); - kfree(skb->head); + if (skb->free_data == NULL) + BUG(); + (*skb->free_data)(skb->head); + skb->free_data = NULL; } } @@ -384,6 +422,7 @@ C(tail); C(end); n->destructor = NULL; + C(free_data); #ifdef CONFIG_NETFILTER C(nfmark); C(nfcache); @@ -520,6 +559,7 @@ skb->head = data; skb->end = data + size; + skb->free_data = kfree; /* Set up new pointers */ skb->h.raw += offset; @@ -647,6 +687,7 @@ skb->head = data; skb->end = data+size; + skb->free_data = kfree; skb->data += off; skb->tail += off; From davem@redhat.com Wed Mar 19 16:33:23 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 19 Mar 2003 16:33:31 -0800 (PST) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2K0XLq9025811 for ; Wed, 19 Mar 2003 16:33:22 -0800 Received: from localhost (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with ESMTP id QAA12026; Wed, 19 Mar 2003 16:31:06 -0800 Date: Wed, 19 Mar 2003 16:31:05 -0800 (PST) Message-Id: <20030319.163105.44963500.davem@redhat.com> To: dlstevens@us.ibm.com Cc: kuznet@ms2.inr.ac.ru, linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [PATCH] anycast support for IPv6, updated to 2.5.44 From: "David S. Miller" In-Reply-To: References: X-FalunGong: Information control. X-Mailer: Mew version 2.1 on Emacs 21.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 1972 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev Content-Length: 444 Lines: 12 From: "David Stevens" Date: Mon, 28 Oct 2002 14:06:00 -0700 Below is a patch to add anycast support for IPv6. It's the same patch as I've posted previously, but updated with comments from Chris Hellwig and for kernel version 2.5.44. I'm going to apply this, with the small change that dev_getany() is renamed to dev_get_by_flags() which more accurately describes what the routine does. Thanks David. From yoshfuji@wide.ad.jp Wed Mar 19 19:01:21 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 19 Mar 2003 19:01:32 -0800 (PST) Received: from yue.hongo.wide.ad.jp (yue.hongo.wide.ad.jp [203.178.139.94]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2K31Jq9028566 for ; Wed, 19 Mar 2003 19:01:21 -0800 Received: from localhost (localhost [127.0.0.1]) by yue.hongo.wide.ad.jp (8.12.3+3.5Wbeta/8.12.3/Debian-5) with ESMTP id h2K31aUl005092; Thu, 20 Mar 2003 12:01:37 +0900 Date: Thu, 20 Mar 2003 12:01:36 +0900 (JST) Message-Id: <20030320.120136.108400165.yoshfuji@wide.ad.jp> To: davem@redhat.com Cc: dlstevens@us.ibm.com, kuznet@ms2.inr.ac.ru, linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [PATCH] anycast support for IPv6, updated to 2.5.44 From: YOSHIFUJI Hideaki / =?iso-2022-jp?B?GyRCNUhGIzFRTEAbKEI=?= In-Reply-To: <20030319.163105.44963500.davem@redhat.com> References: <20030319.163105.44963500.davem@redhat.com> X-URL: http://www.yoshifuji.org/%7Ehideaki/ X-Fingerprint: 90 22 65 EB 1E CF 3A D1 0B DF 80 D8 48 07 F8 94 E0 62 0E EA X-PGP-Key-URL: http://www.yoshifuji.org/%7Ehideaki/hideaki@yoshifuji.org.asc X-Mailer: Mew version 2.2 on Emacs 20.7 / Mule 4.1 (AOI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 1973 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: yoshfuji@wide.ad.jp Precedence: bulk X-list: netdev Content-Length: 1059 Lines: 25 In article <20030319.163105.44963500.davem@redhat.com> (at Wed, 19 Mar 2003 16:31:05 -0800 (PST)), "David S. Miller" says: > From: "David Stevens" > Date: Mon, 28 Oct 2002 14:06:00 -0700 > > Below is a patch to add anycast support for IPv6. It's the same patch as > I've posted previously, but updated with comments from Chris Hellwig and > for kernel version 2.5.44. > > I'm going to apply this, with the small change that dev_getany() is > renamed to dev_get_by_flags() which more accurately describes > what the routine does. Again: I don't like API at all. Anycast address management itself in that patch would be ok. However, JOIN/LEAVE is NOT useful and userland application will be incompatible with other implementation. (sigh...) I think linux likes unicast model (assign address like unicast address), too. And, we see __constant_{hton,ntoh}{l,h}() again... -- Hideaki YOSHIFUJI @ USAGI Project GPG FP: 9022 65EB 1ECF 3AD1 0BDF 80D8 4807 F894 E062 0EEA From davem@redhat.com Wed Mar 19 19:25:42 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 19 Mar 2003 19:25:45 -0800 (PST) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2K3Pfq9029060 for ; Wed, 19 Mar 2003 19:25:42 -0800 Received: from localhost (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with ESMTP id TAA12333; Wed, 19 Mar 2003 19:23:32 -0800 Date: Wed, 19 Mar 2003 19:23:31 -0800 (PST) Message-Id: <20030319.192331.95884882.davem@redhat.com> To: yoshfuji@wide.ad.jp Cc: dlstevens@us.ibm.com, kuznet@ms2.inr.ac.ru, linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [PATCH] anycast support for IPv6, updated to 2.5.44 From: "David S. Miller" In-Reply-To: <20030320.120136.108400165.yoshfuji@wide.ad.jp> References: <20030319.163105.44963500.davem@redhat.com> <20030320.120136.108400165.yoshfuji@wide.ad.jp> X-FalunGong: Information control. X-Mailer: Mew version 2.1 on Emacs 21.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=iso-2022-jp Content-Transfer-Encoding: 7bit X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 1974 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev Content-Length: 948 Lines: 22 From: YOSHIFUJI Hideaki / $B5HF#1QL@(B Date: Thu, 20 Mar 2003 12:01:36 +0900 (JST) In article <20030319.163105.44963500.davem@redhat.com> (at Wed, 19 Mar 2003 16:31:05 -0800 (PST)), "David S. Miller" says: > I'm going to apply this, with the small change that dev_getany() is > renamed to dev_get_by_flags() which more accurately describes > what the routine does. Again: I don't like API at all. Anycast address management itself in that patch would be ok. However, JOIN/LEAVE is NOT useful and userland application will be incompatible with other implementation. (sigh...) I think linux likes unicast model (assign address like unicast address), too. Please propose alternative API, or do you suggest not to export this facility to user at all? And, we see __constant_{hton,ntoh}{l,h}() again... I will fix this, thank you for mentioning this. From yoshfuji@wide.ad.jp Wed Mar 19 19:44:11 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 19 Mar 2003 19:44:14 -0800 (PST) Received: from yue.hongo.wide.ad.jp (yue.hongo.wide.ad.jp [203.178.139.94]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2K3i9q9029478 for ; Wed, 19 Mar 2003 19:44:11 -0800 Received: from localhost (localhost [127.0.0.1]) by yue.hongo.wide.ad.jp (8.12.3+3.5Wbeta/8.12.3/Debian-5) with ESMTP id h2K3iTUl005429; Thu, 20 Mar 2003 12:44:29 +0900 Date: Thu, 20 Mar 2003 12:44:28 +0900 (JST) Message-Id: <20030320.124428.95965257.yoshfuji@wide.ad.jp> To: davem@redhat.com Cc: dlstevens@us.ibm.com, kuznet@ms2.inr.ac.ru, linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [PATCH] anycast support for IPv6, updated to 2.5.44 From: YOSHIFUJI Hideaki / =?iso-2022-jp?B?GyRCNUhGIzFRTEAbKEI=?= In-Reply-To: <20030319.192331.95884882.davem@redhat.com> References: <20030319.163105.44963500.davem@redhat.com> <20030320.120136.108400165.yoshfuji@wide.ad.jp> <20030319.192331.95884882.davem@redhat.com> X-URL: http://www.yoshifuji.org/%7Ehideaki/ X-Fingerprint: 90 22 65 EB 1E CF 3A D1 0B DF 80 D8 48 07 F8 94 E0 62 0E EA X-PGP-Key-URL: http://www.yoshifuji.org/%7Ehideaki/hideaki@yoshifuji.org.asc X-Mailer: Mew version 2.2 on Emacs 20.7 / Mule 4.1 (AOI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 1975 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: yoshfuji@wide.ad.jp Precedence: bulk X-list: netdev Content-Length: 1131 Lines: 25 In article <20030319.192331.95884882.davem@redhat.com> (at Wed, 19 Mar 2003 19:23:31 -0800 (PST)), "David S. Miller" says: > > I'm going to apply this, with the small change that dev_getany() is > > renamed to dev_get_by_flags() which more accurately describes > > what the routine does. > > Again: I don't like API at all. > > Anycast address management itself in that patch would be ok. > However, JOIN/LEAVE is NOT useful and userland application will be > incompatible with other implementation. (sigh...) > I think linux likes unicast model (assign address like unicast address), too. > > Please propose alternative API, or do you suggest not > to export this facility to user at all? I like to assign address like unicast (using ioctl and rtnetlink (RTN_ANYCAST)). We suggest you not exporting this facilicy until finishing new API (And, another API would be standardized; This is another reason why I am against exporting that API for now.) -- Hideaki YOSHIFUJI @ USAGI Project GPG FP: 9022 65EB 1ECF 3AD1 0BDF 80D8 4807 F894 E062 0EEA From davem@redhat.com Wed Mar 19 19:49:49 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 19 Mar 2003 19:49:53 -0800 (PST) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2K3nlq9029816 for ; Wed, 19 Mar 2003 19:49:49 -0800 Received: from localhost (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with ESMTP id TAA12413; Wed, 19 Mar 2003 19:47:35 -0800 Date: Wed, 19 Mar 2003 19:47:35 -0800 (PST) Message-Id: <20030319.194735.31799019.davem@redhat.com> To: yoshfuji@wide.ad.jp Cc: dlstevens@us.ibm.com, kuznet@ms2.inr.ac.ru, linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [PATCH] anycast support for IPv6, updated to 2.5.44 From: "David S. Miller" In-Reply-To: <20030320.124428.95965257.yoshfuji@wide.ad.jp> References: <20030320.120136.108400165.yoshfuji@wide.ad.jp> <20030319.192331.95884882.davem@redhat.com> <20030320.124428.95965257.yoshfuji@wide.ad.jp> X-FalunGong: Information control. X-Mailer: Mew version 2.1 on Emacs 21.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=iso-2022-jp Content-Transfer-Encoding: 7bit X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 1976 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev Content-Length: 847 Lines: 19 From: YOSHIFUJI Hideaki / $B5HF#1QL@(B Date: Thu, 20 Mar 2003 12:44:28 +0900 (JST) In article <20030319.192331.95884882.davem@redhat.com> (at Wed, 19 Mar 2003 19:23:31 -0800 (PST)), "David S. Miller" says: > Please propose alternative API, or do you suggest not > to export this facility to user at all? I like to assign address like unicast (using ioctl and rtnetlink (RTN_ANYCAST)). We suggest you not exporting this facilicy until finishing new API (And, another API would be standardized; This is another reason why I am against exporting that API for now.) I think anycast addresses are more like multicast than unicast. Do you agree about this? But here is what really matters, does the advanced IPV6 socket API say anything about a user API for anycast? From nalkunda@cse.msu.edu Wed Mar 19 20:57:25 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 19 Mar 2003 20:57:29 -0800 (PST) Received: from sargasso.cse.msu.edu (sargasso.cse.msu.edu [35.9.20.10]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2K4vNq9030458 for ; Wed, 19 Mar 2003 20:57:24 -0800 Received: from elans-pc.cse.msu.edu (elans.cse.msu.edu [35.9.43.164]) by sargasso.cse.msu.edu (8.12.8/8.12.8) with ESMTP id h2K4vHhA003182; Wed, 19 Mar 2003 23:57:17 -0500 (EST) Content-Type: text/plain; charset="us-ascii" From: N N Ashok To: netdev@oss.sgi.com, linux-net@vger.kernel.org Subject: Casting (struct rtable*) to (struct dst_entry*) Date: Wed, 19 Mar 2003 23:55:02 -0500 User-Agent: KMail/1.4.3 MIME-Version: 1.0 Message-Id: <200303192355.02509.nalkunda@cse.msu.edu> Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id h2K4vNq9030458 X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 1977 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: nalkunda@cse.msu.edu Precedence: bulk X-list: netdev Content-Length: 881 Lines: 25 Hi, I have been looking at the networking code of Linux for my Masters thesis. I observed the following: In ip_route_input(), if a route is found in the cache, the skb->dst is setup with the route found by casting the rtable entry to dst_entry: skb->dst = (struct dst_entry*)rth; Later in ip_route_input(), skb->dst->input() is called: return skb->dst->input(skb); In ip_forward(), skb->dst is again casted to rtable: rt = (struct rtable*)skb->dst; I am unable to understand how a rtable structure casted to dst_entry will give a correct pointer to the input() function. I looked at the fields in rtable and dst_entry, the fields in the structures are cannot be lined up (the fourth field in rtable is not the same type as the fourth field in dst_entry). Can anybody help me understand this casting of rtable to dst_entry and then back to rtable? Thanks, Ashok From davem@redhat.com Wed Mar 19 21:00:47 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 19 Mar 2003 21:00:54 -0800 (PST) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2K50kq9030993 for ; Wed, 19 Mar 2003 21:00:46 -0800 Received: from localhost (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with ESMTP id UAA12505; Wed, 19 Mar 2003 20:59:13 -0800 Date: Wed, 19 Mar 2003 20:59:12 -0800 (PST) Message-Id: <20030319.205912.131928327.davem@redhat.com> To: nalkunda@cse.msu.edu Cc: netdev@oss.sgi.com, linux-net@vger.kernel.org Subject: Re: Casting (struct rtable*) to (struct dst_entry*) From: "David S. Miller" In-Reply-To: <200303192355.02509.nalkunda@cse.msu.edu> References: <200303192355.02509.nalkunda@cse.msu.edu> X-FalunGong: Information control. X-Mailer: Mew version 2.1 on Emacs 21.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 1978 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev Content-Length: 463 Lines: 10 From: N N Ashok Date: Wed, 19 Mar 2003 23:55:02 -0500 I am unable to understand how a rtable structure casted to dst_entry will give a correct pointer to the input() function. I looked at the fields in rtable and dst_entry, the fields in the structures are cannot be lined up (the fourth field in rtable is not the same type as the fourth field in dst_entry). "struct rtable" starts with a "struct dst_entry" From nalkunda@cse.msu.edu Wed Mar 19 21:30:04 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 19 Mar 2003 21:30:16 -0800 (PST) Received: from sargasso.cse.msu.edu (sargasso.cse.msu.edu [35.9.20.10]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2K5U3q9031959 for ; Wed, 19 Mar 2003 21:30:04 -0800 Received: from elans-pc.cse.msu.edu (elans.cse.msu.edu [35.9.43.164]) by sargasso.cse.msu.edu (8.12.8/8.12.8) with ESMTP id h2K5TvhA006821; Thu, 20 Mar 2003 00:29:57 -0500 (EST) Content-Type: text/plain; charset="iso-8859-1" From: N N Ashok To: "David S. Miller" Subject: Re: Casting (struct rtable*) to (struct dst_entry*) Date: Thu, 20 Mar 2003 00:27:41 -0500 User-Agent: KMail/1.4.3 Cc: netdev@oss.sgi.com, linux-net@vger.kernel.org References: <200303192355.02509.nalkunda@cse.msu.edu> <20030319.205912.131928327.davem@redhat.com> In-Reply-To: <20030319.205912.131928327.davem@redhat.com> MIME-Version: 1.0 Message-Id: <200303200027.41923.nalkunda@cse.msu.edu> Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id h2K5U3q9031959 X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 1979 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: nalkunda@cse.msu.edu Precedence: bulk X-list: netdev Content-Length: 1417 Lines: 32 On Wednesday 19 March 2003 23:59, David S. Miller wrote: > From: N N Ashok > Date: Wed, 19 Mar 2003 23:55:02 -0500 > > I am unable to understand how a rtable structure casted to dst_entry > will give a correct pointer to the input() function. I looked at the fields > in rtable and dst_entry, the fields in the structures are cannot be lined > up (the fourth field in rtable is not the same type as the fourth field in > dst_entry). > > "struct rtable" starts with a "struct dst_entry" Thanks David. I did see that. But however, I could not understand how "struct rtable" can be casted to "struct dst_entry" and then back again, all the while accessing fields of both structures. When the (struct rtable *)rth is filled in ip_route_input(), the variables accessed are those of rtable. Then rth is cast to (struct dst_entry *) and assigned to skb->dst (which is of type struct dst_entry *). After this, in ip_rcv_finish(), the field of dst_entry is accessed as in: skb->dst->input(). I am unable to understand how, data filled in as rtable fields will be valid when accessed as dst_entry fields. Later in ip_forward() (for a packet to be forwarded), the skb->dst is cast to (struct rtable *) and its fields accessed. A correction to the previous post: the skb->dst->input() is invoked in ip_rcv_finish() and not in ip_route_input() as mentioned in the post. Thanks, Ashok From davem@redhat.com Wed Mar 19 21:36:06 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 19 Mar 2003 21:36:11 -0800 (PST) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2K5ZPq9032346 for ; Wed, 19 Mar 2003 21:36:06 -0800 Received: from localhost (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with ESMTP id VAA12634; Wed, 19 Mar 2003 21:33:52 -0800 Date: Wed, 19 Mar 2003 21:33:52 -0800 (PST) Message-Id: <20030319.213352.50358237.davem@redhat.com> To: nalkunda@cse.msu.edu Cc: netdev@oss.sgi.com, linux-net@vger.kernel.org Subject: Re: Casting (struct rtable*) to (struct dst_entry*) From: "David S. Miller" In-Reply-To: <200303200027.41923.nalkunda@cse.msu.edu> References: <200303192355.02509.nalkunda@cse.msu.edu> <20030319.205912.131928327.davem@redhat.com> <200303200027.41923.nalkunda@cse.msu.edu> X-FalunGong: Information control. X-Mailer: Mew version 2.1 on Emacs 21.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 1980 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev Content-Length: 637 Lines: 24 From: N N Ashok Date: Thu, 20 Mar 2003 00:27:41 -0500 I did see that. But however, I could not understand how "struct rtable" can be casted to "struct dst_entry" and then back again, all the while accessing fields of both structures. You miss the point that they are the same structure. It is allocated the size of "struct rtable" but it may be casted back and forth between rtable and dst_entry as desired. void foo(void) { struct rtable rt; struct dst_entry *dst; rt->u.dst.bar = 1; dst = (struct dst_entry *) &rt; ASSERT(dst->bar == 1); dst = &rt->u.dst; ASSERT(dst->bar == 1); } From nalkunda@cse.msu.edu Wed Mar 19 22:05:40 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 19 Mar 2003 22:05:43 -0800 (PST) Received: from sargasso.cse.msu.edu (sargasso.cse.msu.edu [35.9.20.10]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2K650q9000310 for ; Wed, 19 Mar 2003 22:05:40 -0800 Received: from elans-pc.cse.msu.edu (elans.cse.msu.edu [35.9.43.164]) by sargasso.cse.msu.edu (8.12.8/8.12.8) with ESMTP id h2K64shA010265; Thu, 20 Mar 2003 01:04:54 -0500 (EST) Content-Type: text/plain; charset="iso-8859-1" From: N N Ashok To: "David S. Miller" Subject: Re: Casting (struct rtable*) to (struct dst_entry*) Date: Thu, 20 Mar 2003 01:02:38 -0500 User-Agent: KMail/1.4.3 Cc: netdev@oss.sgi.com, linux-net@vger.kernel.org References: <200303192355.02509.nalkunda@cse.msu.edu> <200303200027.41923.nalkunda@cse.msu.edu> <20030319.213352.50358237.davem@redhat.com> In-Reply-To: <20030319.213352.50358237.davem@redhat.com> MIME-Version: 1.0 Message-Id: <200303200102.38813.nalkunda@cse.msu.edu> Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id h2K650q9000310 X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 1981 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: nalkunda@cse.msu.edu Precedence: bulk X-list: netdev Content-Length: 1221 Lines: 38 On Thursday 20 March 2003 00:33, David S. Miller wrote: > From: N N Ashok > Date: Thu, 20 Mar 2003 00:27:41 -0500 > > I did see that. But however, I could not understand how "struct rtable" > can be casted to "struct dst_entry" and then back again, all the while > accessing fields of both structures. > > You miss the point that they are the same structure. It is allocated > the size of "struct rtable" but it may be casted back and forth > between rtable and dst_entry as desired. > > void foo(void) > { > struct rtable rt; > struct dst_entry *dst; > > rt->u.dst.bar = 1; > > dst = (struct dst_entry *) &rt; > ASSERT(dst->bar == 1); > > dst = &rt->u.dst; > ASSERT(dst->bar == 1); > } I think I finally understand the whole setup. Please correct if I'm wrong. "struct rtable" has its first field "u" which is a union of "dst_entry" and "struct rtable *". Thus when we cast rtable to dst_entry, we are accessing the rtable.u.dst_entry itself and not any other part of rtable. Since originally the data was allocated the size of "rtable", when we cast "dst_entry" to "struct rtable" we can access all the fields of "struct table". Thanks a lot for the clarification. Ashok From dlstevens@us.ibm.com Wed Mar 19 23:34:53 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 19 Mar 2003 23:35:00 -0800 (PST) Received: from e32.co.us.ibm.com (e32.co.us.ibm.com [32.97.110.130]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2K7Ypq9001312 for ; Wed, 19 Mar 2003 23:34:53 -0800 Received: from westrelay04.boulder.ibm.com (westrelay04.boulder.ibm.com [9.17.193.32]) by e32.co.us.ibm.com (8.12.8/8.12.2) with ESMTP id h2K7XqTr046592; Thu, 20 Mar 2003 02:33:52 -0500 Received: from d03nm121.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.193.82]) by westrelay04.boulder.ibm.com (8.12.8/NCO/VER6.5) with ESMTP id h2K7YiZt223914; Thu, 20 Mar 2003 00:34:44 -0700 Importance: Normal Sensitivity: Subject: Re: [PATCH] anycast support for IPv6, updated to 2.5.44 To: "David S. Miller" Cc: yoshfuji@wide.ad.jp, kuznet@ms2.inr.ac.ru, linux-kernel@vger.kernel.org, netdev@oss.sgi.com X-Mailer: Lotus Notes Release 5.0.4a July 24, 2000 Message-ID: From: David Stevens Date: Thu, 20 Mar 2003 00:34:41 -0700 X-MIMETrack: Serialize by Router on D03NM121/03/M/IBM(Release 6.0 [IBM]|December 16, 2002) at 03/20/2003 00:34:43 MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 1982 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: dlstevens@us.ibm.com Precedence: bulk X-list: netdev Content-Length: 1009 Lines: 23 Yoshifuji, I created the multicast-like API because, aside from the in-kernel use, there was no way to use anycasting otherwise, and I believe for at least the high-availability case, it doesn't make any sense to treat it like a unicast address. An exited DNS server program with the server machine still up will in fact deny service to clients that might otherwise find a working server if the "permanent" address model were not there. With a multicast-like interface at least available, programs have the choice of tying the anycast address to whether or not the service that needs it is running. That said, there's no reason why you can't have both, and that's straightforward with the code (but not implemented). I think it's too early to be concerned with compatibility since there is no alternative non-permanent anycast address API. If Linux has an API to do something that can't be done at all on other systems, there clearly isn't a portability issue. +-DLS From nalkunda@cse.msu.edu Thu Mar 20 00:34:45 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 20 Mar 2003 00:34:51 -0800 (PST) Received: from sargasso.cse.msu.edu (sargasso.cse.msu.edu [35.9.20.10]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2K8Yiq9006643 for ; Thu, 20 Mar 2003 00:34:45 -0800 Received: from elans-pc.cse.msu.edu (elans.cse.msu.edu [35.9.43.164]) by sargasso.cse.msu.edu (8.12.8/8.12.8) with ESMTP id h2K8YchA024595; Thu, 20 Mar 2003 03:34:38 -0500 (EST) Content-Type: text/plain; charset="us-ascii" From: N N Ashok To: netdev@oss.sgi.com, linux-net@vger.kernel.org Subject: Keeping track of an interface Date: Thu, 20 Mar 2003 03:32:22 -0500 User-Agent: KMail/1.4.3 MIME-Version: 1.0 Message-Id: <200303200332.22747.nalkunda@cse.msu.edu> Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id h2K8Yiq9006643 X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 1983 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: nalkunda@cse.msu.edu Precedence: bulk X-list: netdev Content-Length: 1129 Lines: 23 Hi, I have a situation where I want to record which interface (say eth2) was used for a packet (say the current packet). Later I want to be able to send other packets over the interface which I recorded if it is up and able to send packets. What would be the best way to keep track of an interface for the above scenario? Could we use the "struct net_device" pointer to the interface? If we kept track of "struct net_device" pointer, and later that interface flapped, the pointer would no longer be valid (I assume so). Then how should do we detect that? One way I thought was that we keep track of the "oif" (in "struct fib_nh") for the particular device. Is it guaranteed that the same device will always have the same "oif" even if it flapped? If so, it would be a simple matter to record the "oif" number of the device and later send the packets on that interface after checking that the interface is up and able to send packets. The broader aim is to route packets of a connection over the same interface, like routing all packets of the same TCP session over the same interface. Thanks, Ashok From Robert.Olsson@data.slu.se Thu Mar 20 01:33:31 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 20 Mar 2003 01:33:36 -0800 (PST) Received: from robur.slu.se (robur.slu.se [130.238.98.12]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2K9XSq9008073 for ; Thu, 20 Mar 2003 01:33:30 -0800 Received: (from robert@localhost) by robur.slu.se (8.9.3/8.9.3) id KAA25659; Thu, 20 Mar 2003 10:33:21 +0100 From: Robert Olsson MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <15993.35552.587283.291895@robur.slu.se> Date: Thu, 20 Mar 2003 10:33:20 +0100 To: Archie Cobbs Cc: netdev@oss.sgi.com Subject: [PATCH] sk_buff's allocated from private pools In-Reply-To: <200303192208.h2JM8Yx1037110@bubba.precisionio.com> References: <200303192208.h2JM8Yx1037110@bubba.precisionio.com> X-Mailer: VM 6.92 under Emacs 19.34.1 X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 1985 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: Robert.Olsson@data.slu.se Precedence: bulk X-list: netdev Content-Length: 1496 Lines: 38 Archie Cobbs writes: > Hello, > > I'm submitting this patch for inclusion in the Linux kernel if deemed > generally useful. > > The purpose of this patch is to add a new function called alloc_skb_custom() > (or whatever) that allows the data portion of an sk_buff to reside in any > memory region, not just a region returned by kmalloc(). For example, if a > networking device has a restriction on where receive buffers may reside, > then the device driver can avoid copying every incoming packet if it is > able to create an sk_buff that points to the receive buffer memory. > > Basically this amounts to adding a 'free_data' function pointer to the > sk_buff structure. By default this points to kfree() but in general could > point to anywhere. FYI. The skb recycling patches I play with uses the same callback and has an implementation for private buffers and sync's with outstanding callback marked skb's etc. ftp://robur.slu.se/pub/Linux/net-development/skb_recycling/recycle19.pat ftp://robur.slu.se/pub/Linux/net-development/skb_recycling/e1000-RC-030217.pat Also for SMP it marks in skb header in which cpu skb_headerinit was done so callback has a chance to re-route skb to the origin CPU to minimize cache bouncing in case of recycling. Also skb_headerinit is moved to be the first operation the a skb life of skb not last. Current implementation uses only kmalloc for data part so your alloc_skb_custom add some new value. Cheers. --ro From hshmulik@intel.com Thu Mar 20 07:15:45 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 20 Mar 2003 07:15:48 -0800 (PST) Received: from caduceus.fm.intel.com (fmr02.intel.com [192.55.52.25]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2KFFgq9024841 for ; Thu, 20 Mar 2003 07:15:43 -0800 Received: from talaria.fm.intel.com (talaria.fm.intel.com [10.1.192.39]) by caduceus.fm.intel.com (8.11.6/8.11.6/d: outer.mc,v 1.51 2002/09/23 20:43:23 dmccart Exp $) with ESMTP id h2KF96605132 for ; Thu, 20 Mar 2003 15:09:06 GMT Received: from fmsmsxv040-1.fm.intel.com (fmsmsxvs040.fm.intel.com [132.233.42.124]) by talaria.fm.intel.com (8.11.6/8.11.6/d: inner.mc,v 1.28 2003/01/13 19:44:39 dmccart Exp $) with SMTP id h2KFH5c07934 for ; Thu, 20 Mar 2003 15:17:05 GMT Received: from jrslxjul4.npdj.intel.com ([10.12.254.188]) by fmsmsxv040-1.fm.intel.com (NAVGW 2.5.2.11) with SMTP id M2003032007155221563 ; Thu, 20 Mar 2003 07:15:53 -0800 Date: Thu, 20 Mar 2003 17:15:36 +0200 (IST) From: Shmulik Hen X-X-Sender: hshmulik@jrslxjul4.npdj.intel.com To: Bonding Developement list , Bonding Announce list , Linux Net Mailing list , Linux Kernel Mailing list , Oss SGI Netdev list , Jeff Garzik Subject: [patch] (2/8) Add 802.3ad support to bonding Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 1990 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hshmulik@intel.com Precedence: bulk X-list: netdev Content-Length: 5930 Lines: 192 This patch complements the latest drop of bonding from source-forge (2.4.20-20030317) by incorporating the changes to bond_release_all() too. It also fixes a hang when releasing a slave while outgoing traffic is running, that looks like a deadlock between the BR_NETPROTO_LOCK, dev->xmit_lock and the bond lock (happens on quad processor machines, but KDB back trace wasn't clear enough). This patch is against bonding 2.4.20-20030317. diff -Nuarp linux-2.4.20-bonding-20030317/drivers/net/bonding.c linux-2.4.20-bonding-20030317-devel/drivers/net/bonding.c --- linux-2.4.20-bonding-20030317/drivers/net/bonding.c 2003-03-18 17:03:24.000000000 +0200 +++ linux-2.4.20-bonding-20030317-devel/drivers/net/bonding.c 2003-03-18 17:03:24.000000000 +0200 @@ -286,6 +286,15 @@ * checking slave and slave->dev (which only worked by accident). * - Misc code cleanup: get arp_send() prototype from header file, * add max_bonds to bonding.txt. + * + * 2003/03/18 - Tsippy Mendelson and + * Shmulik Hen + * - Make sure only bond_attach_slave() and bond_detach_slave() can + * manipulate the slave list, including slave_cnt, even when in + * bond_release_all(). + * - Fixed hang in bond_release() while traffic is running. + * netdev_set_master() must not be called from within the bond lock. + * */ #include @@ -326,8 +335,8 @@ #include #include -#define DRV_VERSION "2.4.20-20030207" -#define DRV_RELDATE "February 7, 2003" +#define DRV_VERSION "2.4.20-20030317" +#define DRV_RELDATE "March 17, 2003" #define DRV_NAME "bonding" #define DRV_DESCRIPTION "Ethernet Channel Bonding Driver" @@ -1469,16 +1478,14 @@ static int bond_release(struct net_devic bond = (struct bonding *) master->priv; - write_lock_irqsave(&bond->lock, flags); - /* master already enslaved, or slave not enslaved, or no slave for this master */ if ((master->flags & IFF_SLAVE) || !(slave->flags & IFF_SLAVE)) { printk (KERN_DEBUG "%s: cannot release %s.\n", master->name, slave->name); - write_unlock_irqrestore(&bond->lock, flags); return -EINVAL; } + write_lock_irqsave(&bond->lock, flags); bond->current_arp_slave = NULL; our_slave = (slave_t *)bond; old_current = bond->current_slave; @@ -1497,38 +1504,7 @@ static int bond_release(struct net_devic } else { printk(".\n"); } - - /* release the slave from its bond */ - - if (multicast_mode == BOND_MULTICAST_ALL) { - /* flush master's mc_list from slave */ - bond_mc_list_flush (slave, master); - - /* unset promiscuity level from slave */ - if (master->flags & IFF_PROMISC) - dev_set_promiscuity(slave, -1); - - /* unset allmulti level from slave */ - if (master->flags & IFF_ALLMULTI) - dev_set_allmulti(slave, -1); - } - - netdev_set_master(slave, NULL); - - /* only restore its RUNNING flag if monitoring set it down */ - if (slave->flags & IFF_UP) { - slave->flags |= IFF_RUNNING; - } - - if (slave->flags & IFF_NOARP || - bond->current_slave != NULL) { - dev_close(slave); - our_slave->original_flags &= ~IFF_UP; - } - - bond_restore_slave_flags(our_slave); - kfree(our_slave); - + if (bond->current_slave == NULL) { printk(KERN_INFO "%s: now running without any active interface !\n", @@ -1539,16 +1515,51 @@ static int bond_release(struct net_devic bond->primary_slave = NULL; } - write_unlock_irqrestore(&bond->lock, flags); - return 0; /* deletion OK */ + break; } - } - /* if we get here, it's because the device was not found */ + } write_unlock_irqrestore(&bond->lock, flags); + + if (our_slave == (slave_t *)bond) { + /* if we get here, it's because the device was not found */ + printk (KERN_INFO "%s: %s not enslaved\n", master->name, slave->name); + return -EINVAL; + } + + /* undo settings and restore original values */ + + if (multicast_mode == BOND_MULTICAST_ALL) { + /* flush master's mc_list from slave */ + bond_mc_list_flush (slave, master); - printk (KERN_INFO "%s: %s not enslaved\n", master->name, slave->name); - return -EINVAL; + /* unset promiscuity level from slave */ + if (master->flags & IFF_PROMISC) + dev_set_promiscuity(slave, -1); + + /* unset allmulti level from slave */ + if (master->flags & IFF_ALLMULTI) + dev_set_allmulti(slave, -1); + } + + netdev_set_master(slave, NULL); + + /* only restore its RUNNING flag if monitoring set it down */ + if (slave->flags & IFF_UP) { + slave->flags |= IFF_RUNNING; + } + + if (slave->flags & IFF_NOARP || + bond->current_slave != NULL) { + dev_close(slave); + our_slave->original_flags &= ~IFF_UP; + } + + bond_restore_slave_flags(our_slave); + + kfree(our_slave); + + return 0; /* deletion OK */ } /* @@ -1571,10 +1582,12 @@ static int bond_release_all(struct net_d bond = (struct bonding *) master->priv; bond->current_arp_slave = NULL; + bond->current_slave = NULL; + bond->primary_slave = NULL; while ((our_slave = bond->prev) != (slave_t *)bond) { slave_dev = our_slave->dev; - bond->prev = our_slave->prev; + bond_detach_slave(bond, our_slave); if (multicast_mode == BOND_MULTICAST_ALL || (multicast_mode == BOND_MULTICAST_ACTIVE @@ -1604,10 +1617,6 @@ static int bond_release_all(struct net_d dev_close(slave_dev); } - bond->current_slave = NULL; - bond->next = (slave_t *)bond; - bond->slave_cnt = 0; - bond->primary_slave = NULL; printk (KERN_INFO "%s: released all slaves\n", master->name); return 0; -- | Shmulik Hen | | Israel Design Center (Jerusalem) | | LAN Access Division | | Intel Communications Group, Intel corp. | | | | Anti-Spam: shmulik dot hen at intel dot com | From hshmulik@intel.com Thu Mar 20 07:15:14 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 20 Mar 2003 07:15:17 -0800 (PST) Received: from hermes.fm.intel.com (fmr01.intel.com [192.55.52.18]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2KFFEq9024645 for ; Thu, 20 Mar 2003 07:15:14 -0800 Received: from talaria.fm.intel.com (talaria.fm.intel.com [10.1.192.39]) by hermes.fm.intel.com (8.11.6/8.11.6/d: outer.mc,v 1.51 2002/09/23 20:43:23 dmccart Exp $) with ESMTP id h2KFBif07915 for ; Thu, 20 Mar 2003 15:11:45 GMT Received: from fmsmsxv040-1.fm.intel.com (fmsmsxvs040.fm.intel.com [132.233.42.124]) by talaria.fm.intel.com (8.11.6/8.11.6/d: inner.mc,v 1.28 2003/01/13 19:44:39 dmccart Exp $) with SMTP id h2KFGac07600 for ; Thu, 20 Mar 2003 15:16:36 GMT Received: from jrslxjul4.npdj.intel.com ([10.12.254.188]) by fmsmsxv040-1.fm.intel.com (NAVGW 2.5.2.11) with SMTP id M2003032007152306522 ; Thu, 20 Mar 2003 07:15:25 -0800 Date: Thu, 20 Mar 2003 17:15:07 +0200 (IST) From: Shmulik Hen X-X-Sender: hshmulik@jrslxjul4.npdj.intel.com To: Bonding Developement list , Bonding Announce list , Linux Net Mailing list , Linux Kernel Mailing list , Oss SGI Netdev list , Jeff Garzik Subject: [patch] (1/8) Adding 802.3ad support to bonding Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 1989 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hshmulik@intel.com Precedence: bulk X-list: netdev Content-Length: 3792 Lines: 100 This patch adds support for point to point protocols (e.g. 802.3ad) over bonding that need to know the physical device the skb came on. It saves the real device in a new field in skbuff before overwriting it with the virtual interface device in skb_bond() and __vlan_hwaccel_rx(). This patch is against 2.4.20 kernel and gives compatibility for anyone who wants to test the latest release of bonding from source-forge (2.4.20-20030317). diff -Nuarp linux-2.4.20-orig/include/linux/if_vlan.h linux-2.4.20-devel/include/linux/if_vlan.h --- linux-2.4.20-orig/include/linux/if_vlan.h 2002-11-29 01:53:15.000000000 +0200 +++ linux-2.4.20-devel/include/linux/if_vlan.h 2003-03-04 12:13:23.000000000 +0200 @@ -148,6 +148,9 @@ static inline int __vlan_hwaccel_rx(stru { struct net_device_stats *stats; +#ifdef BOND_POINT_TO_POINT_PROT + skb->real_dev = skb->dev; +#endif //BOND_POINT_TO_POINT_PROT skb->dev = grp->vlan_devices[vlan_tag & VLAN_VID_MASK]; if (skb->dev == NULL) { kfree_skb(skb); diff -Nuarp linux-2.4.20-orig/include/linux/skbuff.h linux-2.4.20-devel/include/linux/skbuff.h --- linux-2.4.20-orig/include/linux/skbuff.h 2002-08-03 03:39:46.000000000 +0300 +++ linux-2.4.20-devel/include/linux/skbuff.h 2003-03-04 11:59:29.000000000 +0200 @@ -135,6 +135,11 @@ struct sk_buff { struct sock *sk; /* Socket we are owned by */ struct timeval stamp; /* Time we arrived */ struct net_device *dev; /* Device we arrived on/are leaving by */ +#define BOND_POINT_TO_POINT_PROT + struct net_device *real_dev; /* For support of point to point protocols + (e.g. 802.3ad) over bonding, we must save the + physical device that got the packet before + replacing skb->dev with the virtual device. */ /* Transport layer header */ union diff -Nuarp linux-2.4.20-orig/net/core/dev.c linux-2.4.20-devel/net/core/dev.c --- linux-2.4.20-orig/net/core/dev.c 2002-11-29 01:53:15.000000000 +0200 +++ linux-2.4.20-devel/net/core/dev.c 2003-03-03 19:48:15.000000000 +0200 @@ -1328,8 +1328,12 @@ static __inline__ void skb_bond(struct s { struct net_device *dev = skb->dev; - if (dev->master) - skb->dev = dev->master; + if (dev->master) { +#ifdef BOND_POINT_TO_POINT_PROT + skb->real_dev = skb->dev; +#endif //BOND_POINT_TO_POINT_PROT + skb->dev = dev->master; + } } static void net_tx_action(struct softirq_action *h) diff -Nuarp linux-2.4.20-orig/net/core/skbuff.c linux-2.4.20-devel/net/core/skbuff.c --- linux-2.4.20-orig/net/core/skbuff.c 2002-08-03 03:39:46.000000000 +0300 +++ linux-2.4.20-devel/net/core/skbuff.c 2003-03-03 19:51:39.000000000 +0200 @@ -231,6 +231,9 @@ static inline void skb_headerinit(void * skb->sk = NULL; skb->stamp.tv_sec=0; /* No idea about time */ skb->dev = NULL; +#ifdef BOND_POINT_TO_POINT_PROT + skb->real_dev = NULL; +#endif //BOND_POINT_TO_POINT_PROT skb->dst = NULL; memset(skb->cb, 0, sizeof(skb->cb)); skb->pkt_type = PACKET_HOST; /* Default type */ @@ -362,6 +365,9 @@ struct sk_buff *skb_clone(struct sk_buff n->sk = NULL; C(stamp); C(dev); +#ifdef BOND_POINT_TO_POINT_PROT + C(real_dev); +#endif //BOND_POINT_TO_POINT_PROT C(h); C(nh); C(mac); @@ -417,6 +423,9 @@ static void copy_skb_header(struct sk_bu new->list=NULL; new->sk=NULL; new->dev=old->dev; +#ifdef BOND_POINT_TO_POINT_PROT + new->real_dev=old->real_dev; +#endif //BOND_POINT_TO_POINT_PROT new->priority=old->priority; new->protocol=old->protocol; new->dst=dst_clone(old->dst); -- | Shmulik Hen | | Israel Design Center (Jerusalem) | | LAN Access Division | | Intel Communications Group, Intel corp. | | | | Anti-Spam: shmulik dot hen at intel dot com | From hshmulik@intel.com Thu Mar 20 07:14:34 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 20 Mar 2003 07:14:38 -0800 (PST) Received: from hermes.fm.intel.com (fmr01.intel.com [192.55.52.18]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2KFEWq9024559 for ; Thu, 20 Mar 2003 07:14:33 -0800 Received: from talaria.fm.intel.com (talaria.fm.intel.com [10.1.192.39]) by hermes.fm.intel.com (8.11.6/8.11.6/d: outer.mc,v 1.51 2002/09/23 20:43:23 dmccart Exp $) with ESMTP id h2KFAkf07558 for ; Thu, 20 Mar 2003 15:10:55 GMT Received: from fmsmsxv040-1.fm.intel.com (fmsmsxvs040.fm.intel.com [132.233.42.124]) by talaria.fm.intel.com (8.11.6/8.11.6/d: inner.mc,v 1.28 2003/01/13 19:44:39 dmccart Exp $) with SMTP id h2KFFcc06713 for ; Thu, 20 Mar 2003 15:15:38 GMT Received: from jrslxjul4.npdj.intel.com ([10.12.254.188]) by fmsmsxv040-1.fm.intel.com (NAVGW 2.5.2.11) with SMTP id M2003032007142513089 ; Thu, 20 Mar 2003 07:14:27 -0800 Date: Thu, 20 Mar 2003 17:14:09 +0200 (IST) From: Shmulik Hen X-X-Sender: hshmulik@jrslxjul4.npdj.intel.com To: Bonding Developement list , Bonding Announce list , Linux Net Mailing list , Linux Kernel Mailing list , Oss SGI Netdev list , Jeff Garzik Subject: [patch] (0/8) Adding 802.3ad support to bonding Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 1988 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hshmulik@intel.com Precedence: bulk X-list: netdev Content-Length: 3746 Lines: 102 This patch adds support for point to point protocols (e.g. 802.3ad) over bonding that need to know the physical device the skb came on. It saves the real device in a new field in skbuff before overwriting it with the virtual interface device in skb_bond() and __vlan_hwaccel_rx(). This patch is against 2.4.21-pre5 kernel. diff -Nuarp linux-2.4.21-pre5-orig/include/linux/if_vlan.h linux-2.4.21-pre5-new/include/linux/if_vlan.h --- linux-2.4.21-pre5-orig/include/linux/if_vlan.h 2002-11-29 01:53:15.000000000 +0200 +++ linux-2.4.21-pre5-new/include/linux/if_vlan.h 2003-03-04 14:01:56.000000000 +0200 @@ -148,6 +148,9 @@ static inline int __vlan_hwaccel_rx(stru { struct net_device_stats *stats; +#ifdef BOND_POINT_TO_POINT_PROT + skb->real_dev = skb->dev; +#endif //BOND_POINT_TO_POINT_PROT skb->dev = grp->vlan_devices[vlan_tag & VLAN_VID_MASK]; if (skb->dev == NULL) { kfree_skb(skb); diff -Nuarp linux-2.4.21-pre5-orig/include/linux/skbuff.h linux-2.4.21-pre5-new/include/linux/skbuff.h --- linux-2.4.21-pre5-orig/include/linux/skbuff.h 2003-03-04 13:43:27.000000000 +0200 +++ linux-2.4.21-pre5-new/include/linux/skbuff.h 2003-03-04 14:13:25.000000000 +0200 @@ -135,6 +135,11 @@ struct sk_buff { struct sock *sk; /* Socket we are owned by */ struct timeval stamp; /* Time we arrived */ struct net_device *dev; /* Device we arrived on/are leaving by */ +#define BOND_POINT_TO_POINT_PROT + struct net_device *real_dev; /* For support of point to point protocols + (e.g. 802.3ad) over bonding, we must save the + physical device that got the packet before + replacing skb->dev with the virtual device. */ /* Transport layer header */ union diff -Nuarp linux-2.4.21-pre5-orig/net/core/dev.c linux-2.4.21-pre5-new/net/core/dev.c --- linux-2.4.21-pre5-orig/net/core/dev.c 2003-03-04 13:43:28.000000000 +0200 +++ linux-2.4.21-pre5-new/net/core/dev.c 2003-03-04 14:14:56.000000000 +0200 @@ -1328,8 +1328,12 @@ static __inline__ void skb_bond(struct s { struct net_device *dev = skb->dev; - if (dev->master) - skb->dev = dev->master; + if (dev->master) { +#ifdef BOND_POINT_TO_POINT_PROT + skb->real_dev = skb->dev; +#endif //BOND_POINT_TO_POINT_PROT + skb->dev = dev->master; + } } static void net_tx_action(struct softirq_action *h) diff -Nuarp linux-2.4.21-pre5-orig/net/core/skbuff.c linux-2.4.21-pre5-new/net/core/skbuff.c --- linux-2.4.21-pre5-orig/net/core/skbuff.c 2003-03-04 13:43:28.000000000 +0200 +++ linux-2.4.21-pre5-new/net/core/skbuff.c 2003-03-04 14:17:44.000000000 +0200 @@ -231,6 +231,9 @@ static inline void skb_headerinit(void * skb->sk = NULL; skb->stamp.tv_sec=0; /* No idea about time */ skb->dev = NULL; +#ifdef BOND_POINT_TO_POINT_PROT + skb->real_dev = NULL; +#endif //BOND_POINT_TO_POINT_PROT skb->dst = NULL; memset(skb->cb, 0, sizeof(skb->cb)); skb->pkt_type = PACKET_HOST; /* Default type */ @@ -362,6 +365,9 @@ struct sk_buff *skb_clone(struct sk_buff n->sk = NULL; C(stamp); C(dev); +#ifdef BOND_POINT_TO_POINT_PROT + C(real_dev); +#endif //BOND_POINT_TO_POINT_PROT C(h); C(nh); C(mac); @@ -417,6 +423,9 @@ static void copy_skb_header(struct sk_bu new->list=NULL; new->sk=NULL; new->dev=old->dev; +#ifdef BOND_POINT_TO_POINT_PROT + new->real_dev=old->real_dev; +#endif //BOND_POINT_TO_POINT_PROT new->priority=old->priority; new->protocol=old->protocol; new->dst=dst_clone(old->dst); -- | Shmulik Hen | | Israel Design Center (Jerusalem) | | LAN Access Division | | Intel Communications Group, Intel corp. | | | | Anti-Spam: shmulik dot hen at intel dot com | From hshmulik@intel.com Thu Mar 20 07:16:14 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 20 Mar 2003 07:16:21 -0800 (PST) Received: from caduceus.fm.intel.com (fmr02.intel.com [192.55.52.25]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2KFGEq9025187 for ; Thu, 20 Mar 2003 07:16:14 -0800 Received: from talaria.fm.intel.com (talaria.fm.intel.com [10.1.192.39]) by caduceus.fm.intel.com (8.11.6/8.11.6/d: outer.mc,v 1.51 2002/09/23 20:43:23 dmccart Exp $) with ESMTP id h2KF9b605562 for ; Thu, 20 Mar 2003 15:09:37 GMT Received: from fmsmsxv040-1.fm.intel.com (fmsmsxvs040.fm.intel.com [132.233.42.124]) by talaria.fm.intel.com (8.11.6/8.11.6/d: inner.mc,v 1.28 2003/01/13 19:44:39 dmccart Exp $) with SMTP id h2KFHUc08232 for ; Thu, 20 Mar 2003 15:17:30 GMT Received: from jrslxjul4.npdj.intel.com ([10.12.254.188]) by fmsmsxv040-1.fm.intel.com (NAVGW 2.5.2.11) with SMTP id M2003032007161716904 ; Thu, 20 Mar 2003 07:16:19 -0800 Date: Thu, 20 Mar 2003 17:16:01 +0200 (IST) From: Shmulik Hen X-X-Sender: hshmulik@jrslxjul4.npdj.intel.com To: Bonding Developement list , Bonding Announce list , Linux Net Mailing list , Linux Kernel Mailing list , Oss SGI Netdev list , Jeff Garzik Subject: [patch] (3/8) Add 802.3ad support to bonding Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 1991 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hshmulik@intel.com Precedence: bulk X-list: netdev Content-Length: 3378 Lines: 103 This patch fixes a hang when enslaving a new slave while incoming traffic is running, that looks like a deadlock between the BR_NETPROTO_LOCK, dev->xmit_lock and the bond lock (happens on quad processor machines, but KDB back trace wasn't clear enough). This patch is against bonding 2.4.20-20030317. diff -Nuarp linux-2.4.20-bonding-20030317/drivers/net/bonding.c linux-2.4.20-bonding-20030317-devel/drivers/net/bonding.c --- linux-2.4.20-bonding-20030317/drivers/net/bonding.c 2003-03-18 17:03:25.000000000 +0200 +++ linux-2.4.20-bonding-20030317-devel/drivers/net/bonding.c 2003-03-18 17:03:26.000000000 +0200 @@ -295,6 +295,10 @@ * - Fixed hang in bond_release() while traffic is running. * netdev_set_master() must not be called from within the bond lock. * + * 2003/03/18 - Tsippy Mendelson and + * Shmulik Hen + * - Fixed hang in bond_enslave(): netdev_set_master() must not be + * called from within the bond lock while traffic is running. */ #include @@ -1066,14 +1070,12 @@ static int bond_enslave(struct net_devic "Warning : no link monitoring support for %s\n", slave_dev->name); } - write_lock_irqsave(&bond->lock, flags); /* not running. */ if ((slave_dev->flags & IFF_UP) != IFF_UP) { #ifdef BONDING_DEBUG printk(KERN_CRIT "Error, slave_dev is not running\n"); #endif - write_unlock_irqrestore(&bond->lock, flags); return -EINVAL; } @@ -1082,12 +1084,10 @@ static int bond_enslave(struct net_devic #ifdef BONDING_DEBUG printk(KERN_CRIT "Error, Device was already enslaved\n"); #endif - write_unlock_irqrestore(&bond->lock, flags); return -EBUSY; } if ((new_slave = kmalloc(sizeof(slave_t), GFP_ATOMIC)) == NULL) { - write_unlock_irqrestore(&bond->lock, flags); return -ENOMEM; } memset(new_slave, 0, sizeof(slave_t)); @@ -1100,9 +1100,7 @@ static int bond_enslave(struct net_devic #ifdef BONDING_DEBUG printk(KERN_CRIT "Error %d calling netdev_set_master\n", err); #endif - kfree(new_slave); - write_unlock_irqrestore(&bond->lock, flags); - return err; + goto err_free; } new_slave->dev = slave_dev; @@ -1121,6 +1119,8 @@ static int bond_enslave(struct net_devic dev_mc_add (slave_dev, dmi->dmi_addr, dmi->dmi_addrlen, 0); } + write_lock_irqsave(&bond->lock, flags); + bond_attach_slave(bond, new_slave); new_slave->delay = 0; new_slave->link_failure_count = 0; @@ -1259,7 +1259,11 @@ static int bond_enslave(struct net_devic new_slave->state == BOND_STATE_ACTIVE ? "n active" : " backup", new_slave->link == BOND_LINK_UP ? "n up" : " down"); + //enslave is successfull return 0; +err_free: + kfree(new_slave); + return err; } /* @@ -1607,6 +1611,9 @@ static int bond_release_all(struct net_d kfree(our_slave); + /* Can be safely called from inside the bond lock + since traffic and timers have already stopped + */ netdev_set_master(slave_dev, NULL); /* only restore its RUNNING flag if monitoring set it down */ -- | Shmulik Hen | | Israel Design Center (Jerusalem) | | LAN Access Division | | Intel Communications Group, Intel corp. | | | | Anti-Spam: shmulik dot hen at intel dot com | From hshmulik@intel.com Thu Mar 20 07:16:34 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 20 Mar 2003 07:16:37 -0800 (PST) Received: from hermes.fm.intel.com (fmr01.intel.com [192.55.52.18]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2KFGXq9025338 for ; Thu, 20 Mar 2003 07:16:33 -0800 Received: from talaria.fm.intel.com (talaria.fm.intel.com [10.1.192.39]) by hermes.fm.intel.com (8.11.6/8.11.6/d: outer.mc,v 1.51 2002/09/23 20:43:23 dmccart Exp $) with ESMTP id h2KFD2f09202 for ; Thu, 20 Mar 2003 15:13:03 GMT Received: from fmsmsxv040-1.fm.intel.com (fmsmsxvs040.fm.intel.com [132.233.42.124]) by talaria.fm.intel.com (8.11.6/8.11.6/d: inner.mc,v 1.28 2003/01/13 19:44:39 dmccart Exp $) with SMTP id h2KFHsc08500 for ; Thu, 20 Mar 2003 15:17:54 GMT Received: from jrslxjul4.npdj.intel.com ([10.12.254.188]) by fmsmsxv040-1.fm.intel.com (NAVGW 2.5.2.11) with SMTP id M2003032007164119211 ; Thu, 20 Mar 2003 07:16:43 -0800 Date: Thu, 20 Mar 2003 17:16:25 +0200 (IST) From: Shmulik Hen X-X-Sender: hshmulik@jrslxjul4.npdj.intel.com To: Bonding Developement list , Bonding Announce list , Linux Net Mailing list , Linux Kernel Mailing list , Oss SGI Netdev list , Jeff Garzik Subject: [patch] (4/8) Add 802.3ad support to bonding Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 1992 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hshmulik@intel.com Precedence: bulk X-list: netdev Content-Length: 4132 Lines: 136 This patch adds support for getting slave's speed and duplex via ethtool (Needed for 802.3ad and other future modes). This patch is against bonding 2.4.20-20030317. diff -Nuarp linux-2.4.20-bonding-20030317/drivers/net/bonding.c linux-2.4.20-bonding-20030317-devel/drivers/net/bonding.c --- linux-2.4.20-bonding-20030317/drivers/net/bonding.c 2003-03-18 17:03:26.000000000 +0200 +++ linux-2.4.20-bonding-20030317-devel/drivers/net/bonding.c 2003-03-18 17:03:27.000000000 +0200 @@ -299,6 +299,10 @@ * Shmulik Hen * - Fixed hang in bond_enslave(): netdev_set_master() must not be * called from within the bond lock while traffic is running. + * + * 2003/03/18 - Amir Noam + * - Added support for getting slave's speed and duplex via ethtool. + * Needed for 802.3ad and other future modes. */ #include @@ -649,6 +653,59 @@ bond_attach_slave(struct bonding *bond, set_fs(fs); \ ret; }) +/* + * Get link speed and duplex from the slave's base driver + * using ethtool. If for some reason the call fails or the + * values are invalid, fake speed and duplex to 100/Full + * and return error. + */ +static int bond_update_speed_duplex(struct slave *slave) +{ + struct net_device *dev = slave->dev; + static int (* ioctl)(struct net_device *, struct ifreq *, int); + struct ifreq ifr; + struct ethtool_cmd etool; + + ioctl = dev->do_ioctl; + if (ioctl) { + etool.cmd = ETHTOOL_GSET; + ifr.ifr_data = (char*)&etool; + if (IOCTL(dev, &ifr, SIOCETHTOOL) == 0) { + slave->speed = etool.speed; + slave->duplex = etool.duplex; + } else { + goto err_out; + } + } else { + goto err_out; + } + + switch (slave->speed) { + case SPEED_10: + case SPEED_100: + case SPEED_1000: + break; + default: + goto err_out; + } + + switch (slave->duplex) { + case DUPLEX_FULL: + case DUPLEX_HALF: + break; + default: + goto err_out; + } + + return 0; + +err_out: + //Fake speed and duplex + slave->speed = SPEED_100; + slave->duplex = DUPLEX_FULL; + return -1; +} + /* * if supports MII link status reporting, check its link status. * @@ -1173,6 +1230,13 @@ static int bond_enslave(struct net_devic new_slave->link = BOND_LINK_DOWN; } + if (bond_update_speed_duplex(new_slave) && (new_slave->link == BOND_LINK_UP) ) { + printk(KERN_WARNING + "bond_enslave(): failed to get speed/duplex from %s, " + "speed forced to 100Mbps, duplex forced to Full.\n", + new_slave->dev->name); + } + /* if we're in active-backup mode, we need one and only one active * interface. The backup interfaces will have their NOARP flag set * because we need them to be completely deaf and not to respond to @@ -1821,6 +1885,9 @@ static void bond_mii_monitor(struct net_ } break; } /* end of switch */ + + bond_update_speed_duplex(slave); + } /* end of while */ /* diff -Nuarp linux-2.4.20-bonding-20030317/include/linux/if_bonding.h linux-2.4.20-bonding-20030317-devel/include/linux/if_bonding.h --- linux-2.4.20-bonding-20030317/include/linux/if_bonding.h 2003-03-18 17:03:26.000000000 +0200 +++ linux-2.4.20-bonding-20030317-devel/include/linux/if_bonding.h 2003-03-18 17:03:27.000000000 +0200 @@ -11,6 +11,9 @@ * This software may be used and distributed according to the terms * of the GNU Public License, incorporated herein by reference. * + * 2003/03/18 - Amir Noam + * - Added support for getting slave's speed and duplex via ethtool. + * Needed for 802.3ad and other future modes. */ #ifndef _LINUX_IF_BONDING_H @@ -89,6 +92,8 @@ typedef struct slave { char state; /* one of BOND_STATE_XXXX */ unsigned short original_flags; u32 link_failure_count; + u16 speed; + u8 duplex; } slave_t; /* -- | Shmulik Hen | | Israel Design Center (Jerusalem) | | LAN Access Division | | Intel Communications Group, Intel corp. | | | | Anti-Spam: shmulik dot hen at intel dot com | From hshmulik@intel.com Thu Mar 20 07:17:22 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 20 Mar 2003 07:17:26 -0800 (PST) Received: from caduceus.fm.intel.com (fmr02.intel.com [192.55.52.25]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2KFHIq9026084 for ; Thu, 20 Mar 2003 07:17:19 -0800 Received: from talaria.fm.intel.com (talaria.fm.intel.com [10.1.192.39]) by caduceus.fm.intel.com (8.11.6/8.11.6/d: outer.mc,v 1.51 2002/09/23 20:43:23 dmccart Exp $) with ESMTP id h2KFAc606657 for ; Thu, 20 Mar 2003 15:10:39 GMT Received: from fmsmsxv040-1.fm.intel.com (fmsmsxvs040.fm.intel.com [132.233.42.124]) by talaria.fm.intel.com (8.11.6/8.11.6/d: inner.mc,v 1.28 2003/01/13 19:44:39 dmccart Exp $) with SMTP id h2KFIac08996 for ; Thu, 20 Mar 2003 15:18:36 GMT Received: from jrslxjul4.npdj.intel.com ([10.12.254.188]) by fmsmsxv040-1.fm.intel.com (NAVGW 2.5.2.11) with SMTP id M2003032007172214967 ; Thu, 20 Mar 2003 07:17:24 -0800 Date: Thu, 20 Mar 2003 17:17:07 +0200 (IST) From: Shmulik Hen X-X-Sender: hshmulik@jrslxjul4.npdj.intel.com To: Bonding Developement list , Bonding Announce list , Linux Net Mailing list , Linux Kernel Mailing list , Oss SGI Netdev list , Jeff Garzik Subject: [patch] (5/8) Add 802.3ad support to bonding Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 1993 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hshmulik@intel.com Precedence: bulk X-list: netdev Content-Length: 17875 Lines: 511 This patch enables support of modes that need to use the unique mac address of each slave. It moves setting the slave's mac address and opening it from the application to the driver. This breaks backward compatibility between the new driver and older applications ! It also blocks possibility of enslaving before the master is up (to prevent putting the system in an unstable state), and removes the code that unconditionally restores all base driver's flags (flags are automatically restored once all undo stages are done in proper order). This patch is against bonding 2.4.20-20030317. diff -Nuarp linux-2.4.20-bonding-20030317/Documentation/networking/ifenslave.c linux-2.4.20-bonding-20030317-devel/Documentation/networking/ifenslave.c --- linux-2.4.20-bonding-20030317/Documentation/networking/ifenslave.c 2003-03-18 17:03:28.000000000 +0200 +++ linux-2.4.20-bonding-20030317-devel/Documentation/networking/ifenslave.c 2003-03-18 17:03:28.000000000 +0200 @@ -51,6 +51,15 @@ * multiple interfaces are specified on a single ifenslave command * (ifenslave bond0 eth0 eth1). * + * - 2003/03/18 - Tsippy Mendelson and + * Shmulik Hen + * - Moved setting the slave's mac address and openning it, from + * the application to the driver. This enables support of modes + * that need to use the unique mac address of each slave. + * The driver also takes care of closing the slave and restoring its + * original mac address upon release. + * In addition, block possibility of enslaving before the master is up. + * This prevents putting the system in an undefined state. */ static char *version = @@ -278,30 +287,11 @@ main(int argc, char **argv) fprintf(stderr, "SIOCBONDRELEASE: cannot detach %s from %s. errno=%s.\n", slave_ifname, master_ifname, strerror(errno)); } - else { /* we'll set the interface down to avoid any conflicts due to - same IP/MAC */ - strncpy(ifr2.ifr_name, slave_ifname, IFNAMSIZ); - if (ioctl(skfd, SIOCGIFFLAGS, &ifr2) < 0) { - int saved_errno = errno; - fprintf(stderr, "SIOCGIFFLAGS on %s failed: %s\n", slave_ifname, - strerror(saved_errno)); - } - else { - ifr2.ifr_flags &= ~(IFF_UP | IFF_RUNNING); - if (ioctl(skfd, SIOCSIFFLAGS, &ifr2) < 0) { - int saved_errno = errno; - fprintf(stderr, "Shutting down interface %s failed: %s\n", - slave_ifname, strerror(saved_errno)); - } - } - } + /* the bonding module takes care of restoring the slaves original + * mac address and closing its net device + */ } else { /* attach a slave interface to the master */ - /* two possibilities : - - if hwaddr_notset, do nothing. The bond will assign the - hwaddr from it's first slave. - - if !hwaddr_notset, assign the master's hwaddr to each slave - */ strncpy(ifr2.ifr_name, slave_ifname, IFNAMSIZ); if (ioctl(skfd, SIOCGIFFLAGS, &ifr2) < 0) { @@ -311,6 +301,7 @@ main(int argc, char **argv) return 1; } + /* if hwaddr_notset, assign the slave hw address to the master */ if (hwaddr_notset) { /* assign the slave hw address to the * master since it currently does not @@ -341,6 +332,10 @@ main(int argc, char **argv) */ master_up = 1; } + } else { + fprintf(stderr, "Cannot enslave; the specified master interface '%s' is not up.\n", master_ifname); + + exit (1); } if (!goterr) { @@ -389,41 +384,10 @@ main(int argc, char **argv) } } - } else { - /* we'll assign master's hwaddr to this slave */ - if (ifr2.ifr_flags & IFF_UP) { - ifr2.ifr_flags &= ~IFF_UP; - if (ioctl(skfd, SIOCSIFFLAGS, &ifr2) < 0) { - int saved_errno = errno; - fprintf(stderr, "Shutting down interface %s failed: %s\n", - slave_ifname, strerror(saved_errno)); - } - } - - strncpy(if_hwaddr.ifr_name, slave_ifname, IFNAMSIZ); - if (ioctl(skfd, SIOCSIFHWADDR, &if_hwaddr) < 0) { - int saved_errno = errno; - fprintf(stderr, "SIOCSIFHWADDR on %s failed: %s\n", if_hwaddr.ifr_name, - strerror(saved_errno)); - if (saved_errno == EBUSY) - fprintf(stderr, " The slave device %s is busy: it must be" - " idle before running this command.\n", slave_ifname); - else if (saved_errno == EOPNOTSUPP) - fprintf(stderr, " The slave device you specified does not support" - " setting the MAC address.\n Your kernel likely does not" - " support slave devices.\n"); - else if (saved_errno == EINVAL) - fprintf(stderr, " The slave device's address type does not match" - " the master's address type.\n"); - } else { - if (verbose) { - unsigned char *hwaddr = if_hwaddr.ifr_hwaddr.sa_data; - printf("Slave's (%s) hardware address set to " - "%2.2x:%2.2x:%2.2x:%2.2x:%2.2x:%2.2x.\n", slave_ifname, - hwaddr[0], hwaddr[1], hwaddr[2], hwaddr[3], hwaddr[4], hwaddr[5]); - } - } } + /* the bonding module takes care of setting the slave's mac address + * according to the mode requirements. + */ if (*spp && !strcmp(*spp, "metric")) { if (*++spp == NULL) { @@ -500,18 +464,18 @@ main(int argc, char **argv) } } - ifr2.ifr_flags |= IFF_UP; /* the interface will need to be up to be bonded */ - if ((ifr2.ifr_flags &= ~(IFF_SLAVE | IFF_MASTER)) == 0 - || strncpy(ifr2.ifr_name, slave_ifname, IFNAMSIZ) <= 0 - || ioctl(skfd, SIOCSIFFLAGS, &ifr2) < 0) { - fprintf(stderr, - "Something broke setting the slave (%s) flags: %s.\n", - slave_ifname, strerror(errno)); - } else { - if (verbose) - printf("Set the slave's (%s) flags %4.4x.\n", slave_ifname, if_flags.ifr_flags); + /* the bonding module takes care of openning the interface + * after setting its mac address + */ + if (ifr2.ifr_flags & IFF_UP) { // the interface will need to be down + ifr2.ifr_flags &= ~IFF_UP; + if (ioctl(skfd, SIOCSIFFLAGS, &ifr2) < 0) { + int saved_errno = errno; + fprintf(stderr, "Shutting down interface %s failed: %s\n", + slave_ifname, strerror(saved_errno)); + } } - + /* Do the real thing */ if ( ! opt_r) { strncpy(if_flags.ifr_name, master_ifname, IFNAMSIZ); diff -Nuarp linux-2.4.20-bonding-20030317/drivers/net/bonding.c linux-2.4.20-bonding-20030317-devel/drivers/net/bonding.c --- linux-2.4.20-bonding-20030317/drivers/net/bonding.c 2003-03-18 17:03:28.000000000 +0200 +++ linux-2.4.20-bonding-20030317-devel/drivers/net/bonding.c 2003-03-18 17:03:28.000000000 +0200 @@ -303,6 +303,22 @@ * 2003/03/18 - Amir Noam * - Added support for getting slave's speed and duplex via ethtool. * Needed for 802.3ad and other future modes. + * + * 2003/03/18 - Tsippy Mendelson and + * Shmulik Hen + * - Enable support of modes that need to use the unique mac address of + * each slave. + * * bond_enslave(): Moved setting the slave's mac address, and + * openning it, from the application to the driver. This breaks + * backward comaptibility with old versions of ifenslave that open + * the slave before enalsving it !!!. + * * bond_release(): The driver also takes care of closing the slave + * and restoring its original mac address. + * - Removed the code that restores all base driver's flags. + * Flags are automatically restored once all undo stages are done + * properly. + * - Block possibility of enslaving before the master is up. This + * prevents putting the system in an unstable state. */ #include @@ -433,7 +449,6 @@ static void bond_mii_monitor(struct net_ static void loadbalance_arp_monitor(struct net_device *dev); static void activebackup_arp_monitor(struct net_device *dev); static int bond_event(struct notifier_block *this, unsigned long event, void *ptr); -static void bond_restore_slave_flags(slave_t *slave); static void bond_mc_list_destroy(struct bonding *bond); static void bond_mc_add(bonding_t *bond, void *addr, int alen); static void bond_mc_delete(bonding_t *bond, void *addr, int alen); @@ -509,11 +524,6 @@ multicast_mode_name(void) } } -static void bond_restore_slave_flags(slave_t *slave) -{ - slave->dev->flags = slave->original_flags; -} - static void bond_set_slave_inactive_flags(slave_t *slave) { slave->state = BOND_STATE_BACKUP; @@ -1110,12 +1120,12 @@ static int bond_enslave(struct net_devic slave_t *new_slave = NULL; unsigned long flags = 0; unsigned long rflags = 0; - int ndx = 0; int err = 0; struct dev_mc_list *dmi; struct in_ifaddr **ifap; struct in_ifaddr *ifa; int link_reporting; + struct sockaddr addr; if (master_dev == NULL || slave_dev == NULL) { return -ENODEV; @@ -1128,12 +1138,14 @@ static int bond_enslave(struct net_devic slave_dev->name); } - /* not running. */ - if ((slave_dev->flags & IFF_UP) != IFF_UP) { + /* This breaks backward comaptibility with old versions + of ifenslave which open the slave before enalsving */ + /* already up. */ + if ((slave_dev->flags & IFF_UP) == IFF_UP) { #ifdef BONDING_DEBUG - printk(KERN_CRIT "Error, slave_dev is not running\n"); + printk(KERN_CRIT "Error, slave_dev is up\n"); #endif - return -EINVAL; + return -EBUSY; } /* already enslaved */ @@ -1144,20 +1156,66 @@ static int bond_enslave(struct net_devic return -EBUSY; } + /* bond must be initialize by bond_open() before enslaving */ + if ((master_dev->flags & IFF_UP) != IFF_UP) { +#ifdef BONDING_DEBUG + printk(KERN_CRIT "Error, master_dev is not up\n"); +#endif + return -EPERM; + } + + if (slave_dev->set_mac_address == NULL) { + printk(KERN_CRIT " The slave device you specified does not support" + " setting the MAC address.\n Your kernel likely does not" + " support slave devices.\n"); + return -EOPNOTSUPP; + } + if ((new_slave = kmalloc(sizeof(slave_t), GFP_ATOMIC)) == NULL) { return -ENOMEM; } memset(new_slave, 0, sizeof(slave_t)); - /* save flags before call to netdev_set_master */ + /* save slave's original flags before calling */ + /* netdev_set_master and dev_open */ new_slave->original_flags = slave_dev->flags; + + /* save slave's original ("permanent") mac address for + modes that needs it, and for restoring it upon release, + and then set it to the master's address */ + memcpy(new_slave->perm_hwaddr, slave_dev->dev_addr, ETH_ALEN); + + if (bond->next != (slave_t*)bond) { + /* set slave to master's mac address + The application already set the master's + mac address to that of the first slave */ + memcpy(addr.sa_data, master_dev->dev_addr, ETH_ALEN); + addr.sa_family = slave_dev->type; + err = slave_dev->set_mac_address(slave_dev, &addr); + if (err) { +#ifdef BONDING_DEBUG + printk(KERN_CRIT "Error %d calling set_mac_address\n", err); +#endif + goto err_free; + } + } + + /* open the slave since the application closed it */ + err = dev_open(slave_dev); + if (err) { +#ifdef BONDING_DEBUG + printk(KERN_CRIT "Openning slave %s failed\n", slave_dev->name); +#endif + goto err_restore_mac; + } + err = netdev_set_master(slave_dev, master_dev); if (err) { #ifdef BONDING_DEBUG printk(KERN_CRIT "Error %d calling netdev_set_master\n", err); #endif - goto err_free; + goto err_close; } new_slave->dev = slave_dev; @@ -1285,39 +1343,6 @@ static int bond_enslave(struct net_devic write_unlock_irqrestore(&bond->lock, flags); - /* - * !!! This is to support old versions of ifenslave. We can remove - * this in 2.5 because our ifenslave takes care of this for us. - * We check to see if the master has a mac address yet. If not, - * we'll give it the mac address of our slave device. - */ - for (ndx = 0; ndx < slave_dev->addr_len; ndx++) { -#ifdef BONDING_DEBUG - printk(KERN_CRIT "Checking ndx=%d of master_dev->dev_addr\n", - ndx); -#endif - if (master_dev->dev_addr[ndx] != 0) { -#ifdef BONDING_DEBUG - printk(KERN_CRIT "Found non-zero byte at ndx=%d\n", - ndx); -#endif - break; - } - } - if (ndx == slave_dev->addr_len) { - /* - * We got all the way through the address and it was - * all 0's. - */ -#ifdef BONDING_DEBUG - printk(KERN_CRIT "%s doesn't have a MAC address yet. ", - master_dev->name); - printk(KERN_CRIT "Going to give assign it from %s.\n", - slave_dev->name); -#endif - bond_sethwaddr(master_dev, slave_dev); - } - printk (KERN_INFO "%s: enslaving %s as a%s interface with a%s link.\n", master_dev->name, slave_dev->name, new_slave->state == BOND_STATE_ACTIVE ? "n active" : " backup", @@ -1325,6 +1350,16 @@ static int bond_enslave(struct net_devic //enslave is successfull return 0; + +// Undo stages on error +err_close: + dev_close(slave_dev); + +err_restore_mac: + memcpy(addr.sa_data, new_slave->perm_hwaddr, ETH_ALEN); + addr.sa_family = slave_dev->type; + slave_dev->set_mac_address(slave_dev, &addr); + err_free: kfree(new_slave); return err; @@ -1539,6 +1574,7 @@ static int bond_release(struct net_devic bonding_t *bond; slave_t *our_slave, *old_current; unsigned long flags; + struct sockaddr addr; if (master == NULL || slave == NULL) { return -ENODEV; @@ -1612,21 +1648,29 @@ static int bond_release(struct net_devic netdev_set_master(slave, NULL); - /* only restore its RUNNING flag if monitoring set it down */ - if (slave->flags & IFF_UP) { - slave->flags |= IFF_RUNNING; - } + /* close slave before restoring its mac address */ + dev_close(slave); - if (slave->flags & IFF_NOARP || - bond->current_slave != NULL) { - dev_close(slave); - our_slave->original_flags &= ~IFF_UP; + /* restore original ("permanent") mac address*/ + memcpy(addr.sa_data, our_slave->perm_hwaddr, ETH_ALEN); + addr.sa_family = slave->type; + slave->set_mac_address(slave, &addr); + + /* restore the original state of the IFF_NOARP flag that might have */ + /* been set by bond_set_slave_inactive_flags() */ + if ((our_slave->original_flags & IFF_NOARP) == 0) { + slave->flags &= ~IFF_NOARP; } - bond_restore_slave_flags(our_slave); - kfree(our_slave); + /* if the last slave was removed, zero the mac address + of the master so it will be set by the application + to the mac address of the first slave */ + if (bond->next == (slave_t*)bond) { + memset(master->dev_addr, 0, master->addr_len); + } + return 0; /* deletion OK */ } @@ -1639,6 +1683,7 @@ static int bond_release_all(struct net_d bonding_t *bond; slave_t *our_slave; struct net_device *slave_dev; + struct sockaddr addr; if (master == NULL) { return -ENODEV; @@ -1673,21 +1718,33 @@ static int bond_release_all(struct net_d dev_set_allmulti(slave_dev, -1); } - kfree(our_slave); - /* Can be safely called from inside the bond lock since traffic and timers have already stopped */ netdev_set_master(slave_dev, NULL); - /* only restore its RUNNING flag if monitoring set it down */ - if (slave_dev->flags & IFF_UP) - slave_dev->flags |= IFF_RUNNING; + /* close slave before restoring its mac address */ + dev_close(slave_dev); + + /* restore original ("permanent") mac address*/ + memcpy(addr.sa_data, our_slave->perm_hwaddr, ETH_ALEN); + addr.sa_family = slave_dev->type; + slave_dev->set_mac_address(slave_dev, &addr); + + /* restore the original state of the IFF_NOARP flag that might have */ + /* been set by bond_set_slave_inactive_flags() */ + if ((our_slave->original_flags & IFF_NOARP) == 0) { + slave_dev->flags &= ~IFF_NOARP; + } - if (slave_dev->flags & IFF_NOARP) - dev_close(slave_dev); + kfree(our_slave); } + /* zero the mac address of the master so it will be + set by the application to the mac address of the + first slave */ + memset(master->dev_addr, 0, master->addr_len); + printk (KERN_INFO "%s: released all slaves\n", master->name); return 0; @@ -2904,6 +2961,15 @@ static int bond_get_info(char *buf, char "up\n" : "down\n"); len += sprintf(buf + len, "Link Failure Count: %d\n", slave->link_failure_count); + + len += sprintf(buf + len, + "Permanent HW addr: %02x:%02x:%02x:%02x:%02x:%02x\n", + slave->perm_hwaddr[0], + slave->perm_hwaddr[1], + slave->perm_hwaddr[2], + slave->perm_hwaddr[3], + slave->perm_hwaddr[4], + slave->perm_hwaddr[5]); } read_unlock_irqrestore(&bond->lock, flags); diff -Nuarp linux-2.4.20-bonding-20030317/include/linux/if_bonding.h linux-2.4.20-bonding-20030317-devel/include/linux/if_bonding.h --- linux-2.4.20-bonding-20030317/include/linux/if_bonding.h 2003-03-18 17:03:28.000000000 +0200 +++ linux-2.4.20-bonding-20030317-devel/include/linux/if_bonding.h 2003-03-18 17:03:28.000000000 +0200 @@ -14,6 +14,11 @@ * 2003/03/18 - Amir Noam * - Added support for getting slave's speed and duplex via ethtool. * Needed for 802.3ad and other future modes. + * + * 2003/03/18 - Tsippy Mendelson and + * Shmulik Hen + * - Enable support of modes that need to use the unique mac address of + * each slave. */ #ifndef _LINUX_IF_BONDING_H @@ -94,6 +99,7 @@ typedef struct slave { u32 link_failure_count; u16 speed; u8 duplex; + u8 perm_hwaddr[ETH_ALEN]; } slave_t; /* -- | Shmulik Hen | | Israel Design Center (Jerusalem) | | LAN Access Division | | Intel Communications Group, Intel corp. | | | | Anti-Spam: shmulik dot hen at intel dot com | From hshmulik@intel.com Thu Mar 20 07:18:00 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 20 Mar 2003 07:18:07 -0800 (PST) Received: from hermes.fm.intel.com (fmr01.intel.com [192.55.52.18]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2KFI0q9026416 for ; Thu, 20 Mar 2003 07:18:00 -0800 Received: from talaria.fm.intel.com (talaria.fm.intel.com [10.1.192.39]) by hermes.fm.intel.com (8.11.6/8.11.6/d: outer.mc,v 1.51 2002/09/23 20:43:23 dmccart Exp $) with ESMTP id h2KFETf10650 for ; Thu, 20 Mar 2003 15:14:30 GMT Received: from fmsmsxv040-1.fm.intel.com (fmsmsxvs040.fm.intel.com [132.233.42.124]) by talaria.fm.intel.com (8.11.6/8.11.6/d: inner.mc,v 1.28 2003/01/13 19:44:39 dmccart Exp $) with SMTP id h2KFJFc09412 for ; Thu, 20 Mar 2003 15:19:15 GMT Received: from jrslxjul4.npdj.intel.com ([10.12.254.188]) by fmsmsxv040-1.fm.intel.com (NAVGW 2.5.2.11) with SMTP id M2003032007180220753 ; Thu, 20 Mar 2003 07:18:04 -0800 Date: Thu, 20 Mar 2003 17:17:46 +0200 (IST) From: Shmulik Hen X-X-Sender: hshmulik@jrslxjul4.npdj.intel.com To: Bonding Developement list , Bonding Announce list , Linux Net Mailing list , Linux Kernel Mailing list , Oss SGI Netdev list , Jeff Garzik Subject: [patch] (7/8) Add 802.3ad support to bonding Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 1994 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hshmulik@intel.com Precedence: bulk X-list: netdev Content-Length: 6313 Lines: 206 This patch moves the driver's private data types from include/linux/if_bonding.h to the local bonding.h. This patch is against bonding 2.4.20-20030317. diff -Nuarp linux-2.4.20-bonding-20030317/drivers/net/bonding/bonding.h linux-2.4.20-bonding-20030317-devel/drivers/net/bonding/bonding.h --- linux-2.4.20-bonding-20030317/drivers/net/bonding/bonding.h 1970-01-01 02:00:00.000000000 +0200 +++ linux-2.4.20-bonding-20030317-devel/drivers/net/bonding/bonding.h 2003-03-18 17:03:31.000000000 +0200 @@ -0,0 +1,68 @@ +/* + * Bond several ethernet interfaces into a Cisco, running 'Etherchannel'. + * + * Portions are (c) Copyright 1995 Simon "Guru Aleph-Null" Janes + * NCM: Network and Communications Management, Inc. + * + * BUT, I'm the one who modified it for ethernet, so: + * (c) Copyright 1999, Thomas Davis, tadavis@lbl.gov + * + * This software may be used and distributed according to the terms + * of the GNU Public License, incorporated herein by reference. + * + */ + +#ifndef _LINUX_BONDING_H +#define _LINUX_BONDING_H + +#include +#include + +typedef struct slave { + struct slave *next; + struct slave *prev; + struct net_device *dev; + short delay; + unsigned long jiffies; + char link; /* one of BOND_LINK_XXXX */ + char state; /* one of BOND_STATE_XXXX */ + unsigned short original_flags; + u32 link_failure_count; + u16 speed; + u8 duplex; + u8 perm_hwaddr[ETH_ALEN]; +} slave_t; + +/* + * Here are the locking policies for the two bonding locks: + * + * 1) Get bond->lock when reading/writing slave list. + * 2) Get bond->ptrlock when reading/writing bond->current_slave. + * (It is unnecessary when the write-lock is put with bond->lock.) + * 3) When we lock with bond->ptrlock, we must lock with bond->lock + * beforehand. + */ +typedef struct bonding { + slave_t *next; + slave_t *prev; + slave_t *current_slave; + slave_t *primary_slave; + slave_t *current_arp_slave; + __s32 slave_cnt; + rwlock_t lock; + rwlock_t ptrlock; + struct timer_list mii_timer; + struct timer_list arp_timer; + struct net_device_stats *stats; +#ifdef CONFIG_PROC_FS + struct proc_dir_entry *bond_proc_dir; + struct proc_dir_entry *bond_proc_info_file; +#endif /* CONFIG_PROC_FS */ + struct bonding *next_bond; + struct net_device *device; + struct dev_mc_list *mc_list; + unsigned short flags; +} bonding_t; + +#endif /* _LINUX_BONDING_H */ + diff -Nuarp linux-2.4.20-bonding-20030317/drivers/net/bonding/bond_main.c linux-2.4.20-bonding-20030317-devel/drivers/net/bonding/bond_main.c --- linux-2.4.20-bonding-20030317/drivers/net/bonding/bond_main.c 2003-03-18 17:03:30.000000000 +0200 +++ linux-2.4.20-bonding-20030317-devel/drivers/net/bonding/bond_main.c 2003-03-18 17:03:31.000000000 +0200 @@ -358,6 +358,7 @@ #include #include #include +#include "bonding.h" #define DRV_VERSION "2.4.20-20030317" #define DRV_RELDATE "March 17, 2003" @@ -380,6 +381,11 @@ DRV_NAME ".c:v" DRV_VERSION " (" DRV_REL #define MAX_ARP_IP_TARGETS 16 #endif +struct bond_parm_tbl { + char *modename; + int mode; +}; + static int arp_interval = BOND_LINK_ARP_INTERV; static char *arp_ip_target[MAX_ARP_IP_TARGETS] = { NULL, }; static unsigned long arp_target[MAX_ARP_IP_TARGETS] = { 0, } ; diff -Nuarp linux-2.4.20-bonding-20030317/include/linux/if_bonding.h linux-2.4.20-bonding-20030317-devel/include/linux/if_bonding.h --- linux-2.4.20-bonding-20030317/include/linux/if_bonding.h 2003-03-18 17:03:30.000000000 +0200 +++ linux-2.4.20-bonding-20030317-devel/include/linux/if_bonding.h 2003-03-18 17:03:31.000000000 +0200 @@ -19,18 +19,18 @@ * Shmulik Hen * - Enable support of modes that need to use the unique mac address of * each slave. + * + * 2003/03/18 - Tsippy Mendelson and + * Amir Noam + * - Moved driver's private data types to bonding.h */ #ifndef _LINUX_IF_BONDING_H #define _LINUX_IF_BONDING_H -#ifdef __KERNEL__ -#include #include -#include -#endif /* __KERNEL__ */ - #include +#include /* * We can remove these ioctl definitions in 2.5. People should use the @@ -66,11 +66,6 @@ #define BOND_MULTICAST_ACTIVE 1 #define BOND_MULTICAST_ALL 2 -struct bond_parm_tbl { - char *modename; - int mode; -}; - typedef struct ifbond { __s32 bond_mode; __s32 num_slaves; @@ -86,55 +81,7 @@ typedef struct ifslave __u32 link_failure_count; } ifslave; -#ifdef __KERNEL__ -typedef struct slave { - struct slave *next; - struct slave *prev; - struct net_device *dev; - short delay; - unsigned long jiffies; - char link; /* one of BOND_LINK_XXXX */ - char state; /* one of BOND_STATE_XXXX */ - unsigned short original_flags; - u32 link_failure_count; - u16 speed; - u8 duplex; - u8 perm_hwaddr[ETH_ALEN]; -} slave_t; - -/* - * Here are the locking policies for the two bonding locks: - * - * 1) Get bond->lock when reading/writing slave list. - * 2) Get bond->ptrlock when reading/writing bond->current_slave. - * (It is unnecessary when the write-lock is put with bond->lock.) - * 3) When we lock with bond->ptrlock, we must lock with bond->lock - * beforehand. - */ -typedef struct bonding { - slave_t *next; - slave_t *prev; - slave_t *current_slave; - slave_t *primary_slave; - slave_t *current_arp_slave; - __s32 slave_cnt; - rwlock_t lock; - rwlock_t ptrlock; - struct timer_list mii_timer; - struct timer_list arp_timer; - struct net_device_stats *stats; -#ifdef CONFIG_PROC_FS - struct proc_dir_entry *bond_proc_dir; - struct proc_dir_entry *bond_proc_info_file; -#endif /* CONFIG_PROC_FS */ - struct bonding *next_bond; - struct net_device *device; - struct dev_mc_list *mc_list; - unsigned short flags; -} bonding_t; -#endif /* __KERNEL__ */ - -#endif /* _LINUX_BOND_H */ +#endif /* _LINUX_IF_BONDING_H */ /* * Local variables: -- | Shmulik Hen | | Israel Design Center (Jerusalem) | | LAN Access Division | | Intel Communications Group, Intel corp. | | | | Anti-Spam: shmulik dot hen at intel dot com | From hshmulik@intel.com Thu Mar 20 07:18:21 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 20 Mar 2003 07:18:42 -0800 (PST) Received: from caduceus.fm.intel.com (fmr02.intel.com [192.55.52.25]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2KFHqq9026366 for ; Thu, 20 Mar 2003 07:17:53 -0800 Received: from talaria.fm.intel.com (talaria.fm.intel.com [10.1.192.39]) by caduceus.fm.intel.com (8.11.6/8.11.6/d: outer.mc,v 1.51 2002/09/23 20:43:23 dmccart Exp $) with ESMTP id h2KFB4607130 for ; Thu, 20 Mar 2003 15:11:08 GMT Received: from fmsmsxv040-1.fm.intel.com (fmsmsxvs040.fm.intel.com [132.233.42.124]) by talaria.fm.intel.com (8.11.6/8.11.6/d: inner.mc,v 1.28 2003/01/13 19:44:39 dmccart Exp $) with SMTP id h2KFIvc09227 for ; Thu, 20 Mar 2003 15:18:57 GMT Received: from jrslxjul4.npdj.intel.com ([10.12.254.188]) by fmsmsxv040-1.fm.intel.com (NAVGW 2.5.2.11) with SMTP id M2003032007174120925 ; Thu, 20 Mar 2003 07:17:43 -0800 Date: Thu, 20 Mar 2003 17:17:25 +0200 (IST) From: Shmulik Hen X-X-Sender: hshmulik@jrslxjul4.npdj.intel.com To: Bonding Developement list , Bonding Announce list , Linux Net Mailing list , Linux Kernel Mailing list , Oss SGI Netdev list , Jeff Garzik Subject: [patch] (6/8) Add 802.3ad support to bonding Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 1995 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hshmulik@intel.com Precedence: bulk X-list: netdev Content-Length: 215563 Lines: 6940 This patch enables support for multiple files in the project. It moves bonding.c to a sub directory of it's own (drivers/net/bonding/) and renames it to bond_main.c. This patch is against bonding 2.4.20-20030317. diff -Nuarp linux-2.4.20-bonding-20030317/drivers/net/bonding/bond_main.c linux-2.4.20-bonding-20030317-devel/drivers/net/bonding/bond_main.c --- linux-2.4.20-bonding-20030317/drivers/net/bonding/bond_main.c 1970-01-01 02:00:00.000000000 +0200 +++ linux-2.4.20-bonding-20030317-devel/drivers/net/bonding/bond_main.c 2003-03-18 17:03:29.000000000 +0200 @@ -0,0 +1,3434 @@ +/* + * originally based on the dummy device. + * + * Copyright 1999, Thomas Davis, tadavis@lbl.gov. + * Licensed under the GPL. Based on dummy.c, and eql.c devices. + * + * bonding.c: an Ethernet Bonding driver + * + * This is useful to talk to a Cisco EtherChannel compatible equipment: + * Cisco 5500 + * Sun Trunking (Solaris) + * Alteon AceDirector Trunks + * Linux Bonding + * and probably many L2 switches ... + * + * How it works: + * ifconfig bond0 ipaddress netmask up + * will setup a network device, with an ip address. No mac address + * will be assigned at this time. The hw mac address will come from + * the first slave bonded to the channel. All slaves will then use + * this hw mac address. + * + * ifconfig bond0 down + * will release all slaves, marking them as down. + * + * ifenslave bond0 eth0 + * will attach eth0 to bond0 as a slave. eth0 hw mac address will either + * a: be used as initial mac address + * b: if a hw mac address already is there, eth0's hw mac address + * will then be set from bond0. + * + * v0.1 - first working version. + * v0.2 - changed stats to be calculated by summing slaves stats. + * + * Changes: + * Arnaldo Carvalho de Melo + * - fix leaks on failure at bond_init + * + * 2000/09/30 - Willy Tarreau + * - added trivial code to release a slave device. + * - fixed security bug (CAP_NET_ADMIN not checked) + * - implemented MII link monitoring to disable dead links : + * All MII capable slaves are checked every milliseconds + * (100 ms seems good). This value can be changed by passing it to + * insmod. A value of zero disables the monitoring (default). + * - fixed an infinite loop in bond_xmit_roundrobin() when there's no + * good slave. + * - made the code hopefully SMP safe + * + * 2000/10/03 - Willy Tarreau + * - optimized slave lists based on relevant suggestions from Thomas Davis + * - implemented active-backup method to obtain HA with two switches: + * stay as long as possible on the same active interface, while we + * also monitor the backup one (MII link status) because we want to know + * if we are able to switch at any time. ( pass "mode=1" to insmod ) + * - lots of stress testings because we need it to be more robust than the + * wires ! :-> + * + * 2000/10/09 - Willy Tarreau + * - added up and down delays after link state change. + * - optimized the slaves chaining so that when we run forward, we never + * repass through the bond itself, but we can find it by searching + * backwards. Renders the deletion more difficult, but accelerates the + * scan. + * - smarter enslaving and releasing. + * - finer and more robust SMP locking + * + * 2000/10/17 - Willy Tarreau + * - fixed two potential SMP race conditions + * + * 2000/10/18 - Willy Tarreau + * - small fixes to the monitoring FSM in case of zero delays + * 2000/11/01 - Willy Tarreau + * - fixed first slave not automatically used in trunk mode. + * 2000/11/10 : spelling of "EtherChannel" corrected. + * 2000/11/13 : fixed a race condition in case of concurrent accesses to ioctl(). + * 2000/12/16 : fixed improper usage of rtnl_exlock_nowait(). + * + * 2001/1/3 - Chad N. Tindel + * - The bonding driver now simulates MII status monitoring, just like + * a normal network device. It will show that the link is down iff + * every slave in the bond shows that their links are down. If at least + * one slave is up, the bond's MII status will appear as up. + * + * 2001/2/7 - Chad N. Tindel + * - Applications can now query the bond from user space to get + * information which may be useful. They do this by calling + * the BOND_INFO_QUERY ioctl. Once the app knows how many slaves + * are in the bond, it can call the BOND_SLAVE_INFO_QUERY ioctl to + * get slave specific information (# link failures, etc). See + * for more details. The structs of interest + * are ifbond and ifslave. + * + * 2001/4/5 - Chad N. Tindel + * - Ported to 2.4 Kernel + * + * 2001/5/2 - Jeffrey E. Mast + * - When a device is detached from a bond, the slave device is no longer + * left thinking that is has a master. + * + * 2001/5/16 - Jeffrey E. Mast + * - memset did not appropriately initialized the bond rw_locks. Used + * rwlock_init to initialize to unlocked state to prevent deadlock when + * first attempting a lock + * - Called SET_MODULE_OWNER for bond device + * + * 2001/5/17 - Tim Anderson + * - 2 paths for releasing for slave release; 1 through ioctl + * and 2) through close. Both paths need to release the same way. + * - the free slave in bond release is changing slave status before + * the free. The netdev_set_master() is intended to change slave state + * so it should not be done as part of the release process. + * - Simple rule for slave state at release: only the active in A/B and + * only one in the trunked case. + * + * 2001/6/01 - Tim Anderson + * - Now call dev_close when releasing a slave so it doesn't screw up + * out routing table. + * + * 2001/6/01 - Chad N. Tindel + * - Added /proc support for getting bond and slave information. + * Information is in /proc/net//info. + * - Changed the locking when calling bond_close to prevent deadlock. + * + * 2001/8/05 - Janice Girouard + * - correct problem where refcnt of slave is not incremented in bond_ioctl + * so the system hangs when halting. + * - correct locking problem when unable to malloc in bond_enslave. + * - adding bond_xmit_xor logic. + * - adding multiple bond device support. + * + * 2001/8/13 - Erik Habbinga + * - correct locking problem with rtnl_exlock_nowait + * + * 2001/8/23 - Janice Girouard + * - bzero initial dev_bonds, to correct oops + * - convert SIOCDEVPRIVATE to new MII ioctl calls + * + * 2001/9/13 - Takao Indoh + * - Add the BOND_CHANGE_ACTIVE ioctl implementation + * + * 2001/9/14 - Mark Huth + * - Change MII_LINK_READY to not check for end of auto-negotiation, + * but only for an up link. + * + * 2001/9/20 - Chad N. Tindel + * - Add the device field to bonding_t. Previously the net_device + * corresponding to a bond wasn't available from the bonding_t + * structure. + * + * 2001/9/25 - Janice Girouard + * - add arp_monitor for active backup mode + * + * 2001/10/23 - Takao Indoh + * - Various memory leak fixes + * + * 2001/11/5 - Mark Huth + * - Don't take rtnl lock in bond_mii_monitor as it deadlocks under + * certain hotswap conditions. + * Note: this same change may be required in bond_arp_monitor ??? + * - Remove possibility of calling bond_sethwaddr with NULL slave_dev ptr + * - Handle hot swap ethernet interface deregistration events to remove + * kernel oops following hot swap of enslaved interface + * + * 2002/1/2 - Chad N. Tindel + * - Restore original slave flags at release time. + * + * 2002/02/18 - Erik Habbinga + * - bond_release(): calling kfree on our_slave after call to + * bond_restore_slave_flags, not before + * - bond_enslave(): saving slave flags into original_flags before + * call to netdev_set_master, so the IFF_SLAVE flag doesn't end + * up in original_flags + * + * 2002/04/05 - Mark Smith and + * Steve Mead + * - Port Gleb Natapov's multicast support patchs from 2.4.12 + * to 2.4.18 adding support for multicast. + * + * 2002/06/10 - Tony Cureington + * - corrected uninitialized pointer (ifr.ifr_data) in bond_check_dev_link; + * actually changed function to use MIIPHY, then MIIREG, and finally + * ETHTOOL to determine the link status + * - fixed bad ifr_data pointer assignments in bond_ioctl + * - corrected mode 1 being reported as active-backup in bond_get_info; + * also added text to distinguish type of load balancing (rr or xor) + * - change arp_ip_target module param from "1-12s" (array of 12 ptrs) + * to "s" (a single ptr) + * + * 2002/08/30 - Jay Vosburgh + * - Removed acquisition of xmit_lock in set_multicast_list; caused + * deadlock on SMP (lock is held by caller). + * - Revamped SIOCGMIIPHY, SIOCGMIIREG portion of bond_check_dev_link(). + * + * 2002/09/18 - Jay Vosburgh + * - Fixed up bond_check_dev_link() (and callers): removed some magic + * numbers, banished local MII_ defines, wrapped ioctl calls to + * prevent EFAULT errors + * + * 2002/9/30 - Jay Vosburgh + * - make sure the ip target matches the arp_target before saving the + * hw address. + * + * 2002/9/30 - Dan Eisner + * - make sure my_ip is set before taking down the link, since + * not all switches respond if the source ip is not set. + * + * 2002/10/8 - Janice Girouard + * - read in the local ip address when enslaving a device + * - add primary support + * - make sure 2*arp_interval has passed when a new device + * is brought on-line before taking it down. + * + * 2002/09/11 - Philippe De Muyter + * - Added bond_xmit_broadcast logic. + * - Added bond_mode() support function. + * + * 2002/10/26 - Laurent Deniel + * - allow to register multicast addresses only on active slave + * (useful in active-backup mode) + * - add multicast module parameter + * - fix deletion of multicast groups after unloading module + * + * 2002/11/06 - Kameshwara Rayaprolu + * - Changes to prevent panic from closing the device twice; if we close + * the device in bond_release, we must set the original_flags to down + * so it won't be closed again by the network layer. + * + * 2002/11/07 - Tony Cureington + * - Fix arp_target_hw_addr memory leak + * - Created activebackup_arp_monitor function to handle arp monitoring + * in active backup mode - the bond_arp_monitor had several problems... + * such as allowing slaves to tx arps sequentially without any delay + * for a response + * - Renamed bond_arp_monitor to loadbalance_arp_monitor and re-wrote + * this function to just handle arp monitoring in load-balancing mode; + * it is a lot more compact now + * - Changes to ensure one and only one slave transmits in active-backup + * mode + * - Robustesize parameters; warn users about bad combinations of + * parameters; also if miimon is specified and a network driver does + * not support MII or ETHTOOL, inform the user of this + * - Changes to support link_failure_count when in arp monitoring mode + * - Fix up/down delay reported in /proc + * - Added version; log version; make version available from "modinfo -d" + * - Fixed problem in bond_check_dev_link - if the first IOCTL (SIOCGMIIPH) + * failed, the ETHTOOL ioctl never got a chance + * + * 2002/11/16 - Laurent Deniel + * - fix multicast handling in activebackup_arp_monitor + * - remove one unnecessary and confusing current_slave == slave test + * in activebackup_arp_monitor + * + * 2002/11/17 - Laurent Deniel + * - fix bond_slave_info_query when slave_id = num_slaves + * + * 2002/11/19 - Janice Girouard + * - correct ifr_data reference. Update ifr_data reference + * to mii_ioctl_data struct values to avoid confusion. + * + * 2002/11/22 - Bert Barbe + * - Add support for multiple arp_ip_target + * + * 2002/12/13 - Jay Vosburgh + * - Changed to allow text strings for mode and multicast, e.g., + * insmod bonding mode=active-backup. The numbers still work. + * One change: an invalid choice will cause module load failure, + * rather than the previous behavior of just picking one. + * - Minor cleanups; got rid of dup ctype stuff, atoi function + * + * 2003/02/07 - Jay Vosburgh + * - Added use_carrier module parameter that causes miimon to + * use netif_carrier_ok() test instead of MII/ETHTOOL ioctls. + * - Minor cleanups; consolidated ioctl calls to one function. + * + * 2003/02/07 - Tony Cureington + * - Fix bond_mii_monitor() logic error that could result in + * bonding round-robin mode ignoring links after failover/recovery + * + * 2003/03/17 - Jay Vosburgh + * - kmalloc fix (GFP_KERNEL to GFP_ATOMIC) reported by + * Shmulik dot Hen at intel.com. + * - Based on discussion on mailing list, changed use of + * update_slave_cnt(), created wrapper functions for adding/removing + * slaves, changed bond_xmit_xor() to check slave_cnt instead of + * checking slave and slave->dev (which only worked by accident). + * - Misc code cleanup: get arp_send() prototype from header file, + * add max_bonds to bonding.txt. + * + * 2003/03/18 - Tsippy Mendelson and + * Shmulik Hen + * - Make sure only bond_attach_slave() and bond_detach_slave() can + * manipulate the slave list, including slave_cnt, even when in + * bond_release_all(). + * - Fixed hang in bond_release() while traffic is running. + * netdev_set_master() must not be called from within the bond lock. + * + * 2003/03/18 - Tsippy Mendelson and + * Shmulik Hen + * - Fixed hang in bond_enslave(): netdev_set_master() must not be + * called from within the bond lock while traffic is running. + * + * 2003/03/18 - Amir Noam + * - Added support for getting slave's speed and duplex via ethtool. + * Needed for 802.3ad and other future modes. + * + * 2003/03/18 - Tsippy Mendelson and + * Shmulik Hen + * - Enable support of modes that need to use the unique mac address of + * each slave. + * * bond_enslave(): Moved setting the slave's mac address, and + * openning it, from the application to the driver. This breaks + * backward comaptibility with old versions of ifenslave that open + * the slave before enalsving it !!!. + * * bond_release(): The driver also takes care of closing the slave + * and restoring its original mac address. + * - Removed the code that restores all base driver's flags. + * Flags are automatically restored once all undo stages are done + * properly. + * - Block possibility of enslaving before the master is up. This + * prevents putting the system in an unstable state. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include + +#define DRV_VERSION "2.4.20-20030317" +#define DRV_RELDATE "March 17, 2003" +#define DRV_NAME "bonding" +#define DRV_DESCRIPTION "Ethernet Channel Bonding Driver" + +static const char *version = +DRV_NAME ".c:v" DRV_VERSION " (" DRV_RELDATE ")\n"; + +/* monitor all links that often (in milliseconds). <=0 disables monitoring */ +#ifndef BOND_LINK_MON_INTERV +#define BOND_LINK_MON_INTERV 0 +#endif + +#ifndef BOND_LINK_ARP_INTERV +#define BOND_LINK_ARP_INTERV 0 +#endif + +#ifndef MAX_ARP_IP_TARGETS +#define MAX_ARP_IP_TARGETS 16 +#endif + +static int arp_interval = BOND_LINK_ARP_INTERV; +static char *arp_ip_target[MAX_ARP_IP_TARGETS] = { NULL, }; +static unsigned long arp_target[MAX_ARP_IP_TARGETS] = { 0, } ; +static int arp_ip_count = 0; +static u32 my_ip = 0; +char *arp_target_hw_addr = NULL; + +static char *primary= NULL; + +static int max_bonds = BOND_DEFAULT_MAX_BONDS; +static int miimon = BOND_LINK_MON_INTERV; +static int use_carrier = 1; +static int bond_mode = BOND_MODE_ROUNDROBIN; +static int updelay = 0; +static int downdelay = 0; + +static char *mode = NULL; + +static struct bond_parm_tbl bond_mode_tbl[] = { +{ "balance-rr", BOND_MODE_ROUNDROBIN}, +{ "active-backup", BOND_MODE_ACTIVEBACKUP}, +{ "balance-xor", BOND_MODE_XOR}, +{ "broadcast", BOND_MODE_BROADCAST}, +{ NULL, -1}, +}; + +static int multicast_mode = BOND_MULTICAST_ALL; +static char *multicast = NULL; + +static struct bond_parm_tbl bond_mc_tbl[] = { +{ "disabled", BOND_MULTICAST_DISABLED}, +{ "active", BOND_MULTICAST_ACTIVE}, +{ "all", BOND_MULTICAST_ALL}, +{ NULL, -1}, +}; + +static int first_pass = 1; +static struct bonding *these_bonds = NULL; +static struct net_device *dev_bonds = NULL; + +MODULE_PARM(max_bonds, "i"); +MODULE_PARM_DESC(max_bonds, "Max number of bonded devices"); +MODULE_PARM(miimon, "i"); +MODULE_PARM_DESC(miimon, "Link check interval in milliseconds"); +MODULE_PARM(use_carrier, "i"); +MODULE_PARM_DESC(use_carrier, "Use netif_carrier_ok (vs MII ioctls) in miimon; 09 for off, 1 for on (default)"); +MODULE_PARM(mode, "s"); +MODULE_PARM_DESC(mode, "Mode of operation : 0 for round robin, 1 for active-backup, 2 for xor"); +MODULE_PARM(arp_interval, "i"); +MODULE_PARM_DESC(arp_interval, "arp interval in milliseconds"); +MODULE_PARM(arp_ip_target, "1-" __MODULE_STRING(MAX_ARP_IP_TARGETS) "s"); +MODULE_PARM_DESC(arp_ip_target, "arp targets in n.n.n.n form"); +MODULE_PARM(updelay, "i"); +MODULE_PARM_DESC(updelay, "Delay before considering link up, in milliseconds"); +MODULE_PARM(downdelay, "i"); +MODULE_PARM_DESC(downdelay, "Delay before considering link down, in milliseconds"); +MODULE_PARM(primary, "s"); +MODULE_PARM_DESC(primary, "Primary network device to use"); +MODULE_PARM(multicast, "s"); +MODULE_PARM_DESC(multicast, "Mode for multicast support : 0 for none, 1 for active slave, 2 for all slaves (default)"); + +static int bond_xmit_roundrobin(struct sk_buff *skb, struct net_device *dev); +static int bond_xmit_xor(struct sk_buff *skb, struct net_device *dev); +static int bond_xmit_activebackup(struct sk_buff *skb, struct net_device *dev); +static struct net_device_stats *bond_get_stats(struct net_device *dev); +static void bond_mii_monitor(struct net_device *dev); +static void loadbalance_arp_monitor(struct net_device *dev); +static void activebackup_arp_monitor(struct net_device *dev); +static int bond_event(struct notifier_block *this, unsigned long event, void *ptr); +static void bond_mc_list_destroy(struct bonding *bond); +static void bond_mc_add(bonding_t *bond, void *addr, int alen); +static void bond_mc_delete(bonding_t *bond, void *addr, int alen); +static int bond_mc_list_copy (struct dev_mc_list *src, struct bonding *dst, int gpf_flag); +static inline int dmi_same(struct dev_mc_list *dmi1, struct dev_mc_list *dmi2); +static void bond_set_promiscuity(bonding_t *bond, int inc); +static void bond_set_allmulti(bonding_t *bond, int inc); +static struct dev_mc_list* bond_mc_list_find_dmi(struct dev_mc_list *dmi, struct dev_mc_list *mc_list); +static void bond_mc_update(bonding_t *bond, slave_t *new, slave_t *old); +static void bond_set_slave_inactive_flags(slave_t *slave); +static void bond_set_slave_active_flags(slave_t *slave); +static int bond_enslave(struct net_device *master, struct net_device *slave); +static int bond_release(struct net_device *master, struct net_device *slave); +static int bond_release_all(struct net_device *master); +static int bond_sethwaddr(struct net_device *master, struct net_device *slave); + +/* + * bond_get_info is the interface into the /proc filesystem. This is + * a different interface than the BOND_INFO_QUERY ioctl. That is done + * through the generic networking ioctl interface, and bond_info_query + * is the internal function which provides that information. + */ +static int bond_get_info(char *buf, char **start, off_t offset, int length); + +/* #define BONDING_DEBUG 1 */ + +/* several macros */ + +#define IS_UP(dev) ((((dev)->flags & (IFF_UP)) == (IFF_UP)) && \ + (netif_running(dev) && netif_carrier_ok(dev))) + +static void arp_send_all(slave_t *slave) +{ + int i; + + for (i = 0; (idev, + my_ip, arp_target_hw_addr, slave->dev->dev_addr, + arp_target_hw_addr); + } +} + + +static const char * +bond_mode_name(void) +{ + switch (bond_mode) { + case BOND_MODE_ROUNDROBIN : + return "load balancing (round-robin)"; + case BOND_MODE_ACTIVEBACKUP : + return "fault-tolerance (active-backup)"; + case BOND_MODE_XOR : + return "load balancing (xor)"; + case BOND_MODE_BROADCAST : + return "fault-tolerance (broadcast)"; + default : + return "unknown"; + } +} + +static const char * +multicast_mode_name(void) +{ + switch(multicast_mode) { + case BOND_MULTICAST_DISABLED : + return "disabled"; + case BOND_MULTICAST_ACTIVE : + return "active slave only"; + case BOND_MULTICAST_ALL : + return "all slaves"; + default : + return "unknown"; + } +} + +static void bond_set_slave_inactive_flags(slave_t *slave) +{ + slave->state = BOND_STATE_BACKUP; + slave->dev->flags |= IFF_NOARP; +} + +static void bond_set_slave_active_flags(slave_t *slave) +{ + slave->state = BOND_STATE_ACTIVE; + slave->dev->flags &= ~IFF_NOARP; +} + +/* + * This function counts and verifies the the number of attached + * slaves, checking the count against the expected value (given that incr + * is either 1 or -1, for add or removal of a slave). Only + * bond_xmit_xor() uses the slave_cnt value, but this is still a good + * consistency check. + */ +static inline void +update_slave_cnt(bonding_t *bond, int incr) +{ + slave_t *slave = NULL; + int expect = bond->slave_cnt + incr; + + bond->slave_cnt = 0; + for (slave = bond->prev; slave != (slave_t*)bond; + slave = slave->prev) { + bond->slave_cnt++; + } + + if (expect != bond->slave_cnt) + BUG(); +} + +/* + * This function detaches the slave from the list . + * WARNING: no check is made to verify if the slave effectively + * belongs to . It returns in case it's needed. + * Nothing is freed on return, structures are just unchained. + * If the bond->current_slave pointer was pointing to , + * it's replaced with slave->next, or if not applicable. + * + * bond->lock held by caller. + */ +static slave_t * +bond_detach_slave(bonding_t *bond, slave_t *slave) +{ + if ((bond == NULL) || (slave == NULL) || + ((void *)bond == (void *)slave)) { + printk(KERN_ERR + "bond_detach_slave(): trying to detach " + "slave %p from bond %p\n", bond, slave); + return slave; + } + + if (bond->next == slave) { /* is the slave at the head ? */ + if (bond->prev == slave) { /* is the slave alone ? */ + write_lock(&bond->ptrlock); + bond->current_slave = NULL; /* no slave anymore */ + write_unlock(&bond->ptrlock); + bond->prev = bond->next = (slave_t *)bond; + } else { /* not alone */ + bond->next = slave->next; + slave->next->prev = (slave_t *)bond; + bond->prev->next = slave->next; + + write_lock(&bond->ptrlock); + if (bond->current_slave == slave) { + bond->current_slave = slave->next; + } + write_unlock(&bond->ptrlock); + } + } else { + slave->prev->next = slave->next; + if (bond->prev == slave) { /* is this slave the last one ? */ + bond->prev = slave->prev; + } else { + slave->next->prev = slave->prev; + } + + write_lock(&bond->ptrlock); + if (bond->current_slave == slave) { + bond->current_slave = slave->next; + } + write_unlock(&bond->ptrlock); + } + + update_slave_cnt(bond, -1); + + return slave; +} + +static void +bond_attach_slave(struct bonding *bond, struct slave *new_slave) +{ + /* + * queue to the end of the slaves list, make the first element its + * successor, the last one its predecessor, and make it the bond's + * predecessor. + * + * Just to clarify, so future bonding driver hackers don't go through + * the same confusion stage I did trying to figure this out, the + * slaves are stored in a double linked circular list, sortof. + * In the ->next direction, the last slave points to the first slave, + * bypassing bond; only the slaves are in the ->next direction. + * In the ->prev direction, however, the first slave points to bond + * and bond points to the last slave. + * + * It looks like a circle with a little bubble hanging off one side + * in the ->prev direction only. + * + * When going through the list once, its best to start at bond->prev + * and go in the ->prev direction, testing for bond. Doing this + * in the ->next direction doesn't work. Trust me, I know this now. + * :) -mts 2002.03.14 + */ + new_slave->prev = bond->prev; + new_slave->prev->next = new_slave; + bond->prev = new_slave; + new_slave->next = bond->next; + + update_slave_cnt(bond, 1); +} + + +/* + * Less bad way to call ioctl from within the kernel; this needs to be + * done some other way to get the call out of interrupt context. + * Needs "ioctl" variable to be supplied by calling context. + */ +#define IOCTL(dev, arg, cmd) ({ \ + int ret; \ + mm_segment_t fs = get_fs(); \ + set_fs(get_ds()); \ + ret = ioctl(dev, arg, cmd); \ + set_fs(fs); \ + ret; }) + +/* + * Get link speed and duplex from the slave's base driver + * using ethtool. If for some reason the call fails or the + * values are invalid, fake speed and duplex to 100/Full + * and return error. + */ +static int bond_update_speed_duplex(struct slave *slave) +{ + struct net_device *dev = slave->dev; + static int (* ioctl)(struct net_device *, struct ifreq *, int); + struct ifreq ifr; + struct ethtool_cmd etool; + + ioctl = dev->do_ioctl; + if (ioctl) { + etool.cmd = ETHTOOL_GSET; + ifr.ifr_data = (char*)&etool; + if (IOCTL(dev, &ifr, SIOCETHTOOL) == 0) { + slave->speed = etool.speed; + slave->duplex = etool.duplex; + } else { + goto err_out; + } + } else { + goto err_out; + } + + switch (slave->speed) { + case SPEED_10: + case SPEED_100: + case SPEED_1000: + break; + default: + goto err_out; + } + + switch (slave->duplex) { + case DUPLEX_FULL: + case DUPLEX_HALF: + break; + default: + goto err_out; + } + + return 0; + +err_out: + //Fake speed and duplex + slave->speed = SPEED_100; + slave->duplex = DUPLEX_FULL; + return -1; +} + +/* + * if supports MII link status reporting, check its link status. + * + * We either do MII/ETHTOOL ioctls, or check netif_carrier_ok(), + * depening upon the setting of the use_carrier parameter. + * + * Return either BMSR_LSTATUS, meaning that the link is up (or we + * can't tell and just pretend it is), or 0, meaning that the link is + * down. + * + * If reporting is non-zero, instead of faking link up, return -1 if + * both ETHTOOL and MII ioctls fail (meaning the device does not + * support them). If use_carrier is set, return whatever it says. + * It'd be nice if there was a good way to tell if a driver supports + * netif_carrier, but there really isn't. + */ +static int +bond_check_dev_link(struct net_device *dev, int reporting) +{ + static int (* ioctl)(struct net_device *, struct ifreq *, int); + struct ifreq ifr; + struct mii_ioctl_data *mii; + struct ethtool_value etool; + + if (use_carrier) { + return netif_carrier_ok(dev) ? BMSR_LSTATUS : 0; + } + + ioctl = dev->do_ioctl; + if (ioctl) { + /* TODO: set pointer to correct ioctl on a per team member */ + /* bases to make this more efficient. that is, once */ + /* we determine the correct ioctl, we will always */ + /* call it and not the others for that team */ + /* member. */ + + /* + * We cannot assume that SIOCGMIIPHY will also read a + * register; not all network drivers (e.g., e100) + * support that. + */ + + /* Yes, the mii is overlaid on the ifreq.ifr_ifru */ + mii = (struct mii_ioctl_data *)&ifr.ifr_data; + if (IOCTL(dev, &ifr, SIOCGMIIPHY) == 0) { + mii->reg_num = MII_BMSR; + if (IOCTL(dev, &ifr, SIOCGMIIREG) == 0) { + return mii->val_out & BMSR_LSTATUS; + } + } + + /* try SIOCETHTOOL ioctl, some drivers cache ETHTOOL_GLINK */ + /* for a period of time so we attempt to get link status */ + /* from it last if the above MII ioctls fail... */ + etool.cmd = ETHTOOL_GLINK; + ifr.ifr_data = (char*)&etool; + if (IOCTL(dev, &ifr, SIOCETHTOOL) == 0) { + if (etool.data == 1) { + return BMSR_LSTATUS; + } else { +#ifdef BONDING_DEBUG + printk(KERN_INFO + ":: SIOCETHTOOL shows link down \n"); +#endif + return 0; + } + } + + } + + /* + * If reporting, report that either there's no dev->do_ioctl, + * or both SIOCGMIIREG and SIOCETHTOOL failed (meaning that we + * cannot report link status). If not reporting, pretend + * we're ok. + */ + return reporting ? -1 : BMSR_LSTATUS; +} + +static u16 bond_check_mii_link(bonding_t *bond) +{ + int has_active_interface = 0; + unsigned long flags; + + read_lock_irqsave(&bond->lock, flags); + read_lock(&bond->ptrlock); + has_active_interface = (bond->current_slave != NULL); + read_unlock(&bond->ptrlock); + read_unlock_irqrestore(&bond->lock, flags); + + return (has_active_interface ? BMSR_LSTATUS : 0); +} + +static int bond_open(struct net_device *dev) +{ + struct timer_list *timer = &((struct bonding *)(dev->priv))->mii_timer; + struct timer_list *arp_timer = &((struct bonding *)(dev->priv))->arp_timer; + MOD_INC_USE_COUNT; + + if (miimon > 0) { /* link check interval, in milliseconds. */ + init_timer(timer); + timer->expires = jiffies + (miimon * HZ / 1000); + timer->data = (unsigned long)dev; + timer->function = (void *)&bond_mii_monitor; + add_timer(timer); + } + + if (arp_interval> 0) { /* arp interval, in milliseconds. */ + init_timer(arp_timer); + arp_timer->expires = jiffies + (arp_interval * HZ / 1000); + arp_timer->data = (unsigned long)dev; + if (bond_mode == BOND_MODE_ACTIVEBACKUP) { + arp_timer->function = (void *)&activebackup_arp_monitor; + } else { + arp_timer->function = (void *)&loadbalance_arp_monitor; + } + add_timer(arp_timer); + } + return 0; +} + +static int bond_close(struct net_device *master) +{ + bonding_t *bond = (struct bonding *) master->priv; + unsigned long flags; + + write_lock_irqsave(&bond->lock, flags); + + if (miimon > 0) { /* link check interval, in milliseconds. */ + del_timer(&bond->mii_timer); + } + if (arp_interval> 0) { /* arp interval, in milliseconds. */ + del_timer(&bond->arp_timer); + if (arp_target_hw_addr != NULL) { + kfree(arp_target_hw_addr); + arp_target_hw_addr = NULL; + } + } + + /* Release the bonded slaves */ + bond_release_all(master); + bond_mc_list_destroy (bond); + + write_unlock_irqrestore(&bond->lock, flags); + + MOD_DEC_USE_COUNT; + return 0; +} + +/* + * flush all members of flush->mc_list from device dev->mc_list + */ +static void bond_mc_list_flush(struct net_device *dev, struct net_device *flush) +{ + struct dev_mc_list *dmi; + + for (dmi = flush->mc_list; dmi != NULL; dmi = dmi->next) + dev_mc_delete(dev, dmi->dmi_addr, dmi->dmi_addrlen, 0); +} + +/* + * Totally destroys the mc_list in bond + */ +static void bond_mc_list_destroy(struct bonding *bond) +{ + struct dev_mc_list *dmi; + + dmi = bond->mc_list; + while (dmi) { + bond->mc_list = dmi->next; + kfree(dmi); + dmi = bond->mc_list; + } +} + +/* + * Add a Multicast address to every slave in the bonding group + */ +static void bond_mc_add(bonding_t *bond, void *addr, int alen) +{ + slave_t *slave; + switch (multicast_mode) { + case BOND_MULTICAST_ACTIVE : + /* write lock already acquired */ + if (bond->current_slave != NULL) + dev_mc_add(bond->current_slave->dev, addr, alen, 0); + break; + case BOND_MULTICAST_ALL : + for (slave = bond->prev; slave != (slave_t*)bond; slave = slave->prev) + dev_mc_add(slave->dev, addr, alen, 0); + break; + case BOND_MULTICAST_DISABLED : + break; + } +} + +/* + * Remove a multicast address from every slave in the bonding group + */ +static void bond_mc_delete(bonding_t *bond, void *addr, int alen) +{ + slave_t *slave; + switch (multicast_mode) { + case BOND_MULTICAST_ACTIVE : + /* write lock already acquired */ + if (bond->current_slave != NULL) + dev_mc_delete(bond->current_slave->dev, addr, alen, 0); + break; + case BOND_MULTICAST_ALL : + for (slave = bond->prev; slave != (slave_t*)bond; slave = slave->prev) + dev_mc_delete(slave->dev, addr, alen, 0); + break; + case BOND_MULTICAST_DISABLED : + break; + } +} + +/* + * Copy all the Multicast addresses from src to the bonding device dst + */ +static int bond_mc_list_copy (struct dev_mc_list *src, struct bonding *dst, + int gpf_flag) +{ + struct dev_mc_list *dmi, *new_dmi; + + for (dmi = src; dmi != NULL; dmi = dmi->next) { + new_dmi = kmalloc(sizeof(struct dev_mc_list), gpf_flag); + + if (new_dmi == NULL) { + return -ENOMEM; + } + + new_dmi->next = dst->mc_list; + dst->mc_list = new_dmi; + + new_dmi->dmi_addrlen = dmi->dmi_addrlen; + memcpy(new_dmi->dmi_addr, dmi->dmi_addr, dmi->dmi_addrlen); + new_dmi->dmi_users = dmi->dmi_users; + new_dmi->dmi_gusers = dmi->dmi_gusers; + } + return 0; +} + +/* + * Returns 0 if dmi1 and dmi2 are the same, non-0 otherwise + */ +static inline int dmi_same(struct dev_mc_list *dmi1, struct dev_mc_list *dmi2) +{ + return memcmp(dmi1->dmi_addr, dmi2->dmi_addr, dmi1->dmi_addrlen) == 0 && + dmi1->dmi_addrlen == dmi2->dmi_addrlen; +} + +/* + * Push the promiscuity flag down to all slaves + */ +static void bond_set_promiscuity(bonding_t *bond, int inc) +{ + slave_t *slave; + switch (multicast_mode) { + case BOND_MULTICAST_ACTIVE : + /* write lock already acquired */ + if (bond->current_slave != NULL) + dev_set_promiscuity(bond->current_slave->dev, inc); + break; + case BOND_MULTICAST_ALL : + for (slave = bond->prev; slave != (slave_t*)bond; slave = slave->prev) + dev_set_promiscuity(slave->dev, inc); + break; + case BOND_MULTICAST_DISABLED : + break; + } +} + +/* + * Push the allmulti flag down to all slaves + */ +static void bond_set_allmulti(bonding_t *bond, int inc) +{ + slave_t *slave; + switch (multicast_mode) { + case BOND_MULTICAST_ACTIVE : + /* write lock already acquired */ + if (bond->current_slave != NULL) + dev_set_allmulti(bond->current_slave->dev, inc); + break; + case BOND_MULTICAST_ALL : + for (slave = bond->prev; slave != (slave_t*)bond; slave = slave->prev) + dev_set_allmulti(slave->dev, inc); + break; + case BOND_MULTICAST_DISABLED : + break; + } +} + +/* + * returns dmi entry if found, NULL otherwise + */ +static struct dev_mc_list* bond_mc_list_find_dmi(struct dev_mc_list *dmi, + struct dev_mc_list *mc_list) +{ + struct dev_mc_list *idmi; + + for (idmi = mc_list; idmi != NULL; idmi = idmi->next) { + if (dmi_same(dmi, idmi)) { + return idmi; + } + } + return NULL; +} + +static void set_multicast_list(struct net_device *master) +{ + bonding_t *bond = master->priv; + struct dev_mc_list *dmi; + unsigned long flags = 0; + + if (multicast_mode == BOND_MULTICAST_DISABLED) + return; + /* + * Lock the private data for the master + */ + write_lock_irqsave(&bond->lock, flags); + + /* set promiscuity flag to slaves */ + if ( (master->flags & IFF_PROMISC) && !(bond->flags & IFF_PROMISC) ) + bond_set_promiscuity(bond, 1); + + if ( !(master->flags & IFF_PROMISC) && (bond->flags & IFF_PROMISC) ) + bond_set_promiscuity(bond, -1); + + /* set allmulti flag to slaves */ + if ( (master->flags & IFF_ALLMULTI) && !(bond->flags & IFF_ALLMULTI) ) + bond_set_allmulti(bond, 1); + + if ( !(master->flags & IFF_ALLMULTI) && (bond->flags & IFF_ALLMULTI) ) + bond_set_allmulti(bond, -1); + + bond->flags = master->flags; + + /* looking for addresses to add to slaves' mc list */ + for (dmi = master->mc_list; dmi != NULL; dmi = dmi->next) { + if (bond_mc_list_find_dmi(dmi, bond->mc_list) == NULL) + bond_mc_add(bond, dmi->dmi_addr, dmi->dmi_addrlen); + } + + /* looking for addresses to delete from slaves' list */ + for (dmi = bond->mc_list; dmi != NULL; dmi = dmi->next) { + if (bond_mc_list_find_dmi(dmi, master->mc_list) == NULL) + bond_mc_delete(bond, dmi->dmi_addr, dmi->dmi_addrlen); + } + + + /* save master's multicast list */ + bond_mc_list_destroy (bond); + bond_mc_list_copy (master->mc_list, bond, GFP_ATOMIC); + + write_unlock_irqrestore(&bond->lock, flags); +} + +/* + * Update the mc list and multicast-related flags for the new and + * old active slaves (if any) according to the multicast mode + */ +static void bond_mc_update(bonding_t *bond, slave_t *new, slave_t *old) +{ + struct dev_mc_list *dmi; + + switch(multicast_mode) { + case BOND_MULTICAST_ACTIVE : + if (bond->device->flags & IFF_PROMISC) { + if (old != NULL && new != old) + dev_set_promiscuity(old->dev, -1); + dev_set_promiscuity(new->dev, 1); + } + if (bond->device->flags & IFF_ALLMULTI) { + if (old != NULL && new != old) + dev_set_allmulti(old->dev, -1); + dev_set_allmulti(new->dev, 1); + } + /* first remove all mc addresses from old slave if any, + and _then_ add them to new active slave */ + if (old != NULL && new != old) { + for (dmi = bond->device->mc_list; dmi != NULL; dmi = dmi->next) + dev_mc_delete(old->dev, dmi->dmi_addr, dmi->dmi_addrlen, 0); + } + for (dmi = bond->device->mc_list; dmi != NULL; dmi = dmi->next) + dev_mc_add(new->dev, dmi->dmi_addr, dmi->dmi_addrlen, 0); + break; + case BOND_MULTICAST_ALL : + /* nothing to do: mc list is already up-to-date on all slaves */ + break; + case BOND_MULTICAST_DISABLED : + break; + } +} + +/* enslave device to bond device */ +static int bond_enslave(struct net_device *master_dev, + struct net_device *slave_dev) +{ + bonding_t *bond = NULL; + slave_t *new_slave = NULL; + unsigned long flags = 0; + unsigned long rflags = 0; + int err = 0; + struct dev_mc_list *dmi; + struct in_ifaddr **ifap; + struct in_ifaddr *ifa; + int link_reporting; + struct sockaddr addr; + + if (master_dev == NULL || slave_dev == NULL) { + return -ENODEV; + } + bond = (struct bonding *) master_dev->priv; + + if (slave_dev->do_ioctl == NULL) { + printk(KERN_DEBUG + "Warning : no link monitoring support for %s\n", + slave_dev->name); + } + + /* This breaks backward comaptibility with old versions + of ifenslave which open the slave before enalsving */ + /* already up. */ + if ((slave_dev->flags & IFF_UP) == IFF_UP) { +#ifdef BONDING_DEBUG + printk(KERN_CRIT "Error, slave_dev is up\n"); +#endif + return -EBUSY; + } + + /* already enslaved */ + if (master_dev->flags & IFF_SLAVE || slave_dev->flags & IFF_SLAVE) { +#ifdef BONDING_DEBUG + printk(KERN_CRIT "Error, Device was already enslaved\n"); +#endif + return -EBUSY; + } + + /* bond must be initialize by bond_open() before enslaving */ + if ((master_dev->flags & IFF_UP) != IFF_UP) { +#ifdef BONDING_DEBUG + printk(KERN_CRIT "Error, master_dev is not up\n"); +#endif + return -EPERM; + } + + if (slave_dev->set_mac_address == NULL) { + printk(KERN_CRIT " The slave device you specified does not support" + " setting the MAC address.\n Your kernel likely does not" + " support slave devices.\n"); + return -EOPNOTSUPP; + } + + if ((new_slave = kmalloc(sizeof(slave_t), GFP_ATOMIC)) == NULL) { + return -ENOMEM; + } + memset(new_slave, 0, sizeof(slave_t)); + + /* save slave's original flags before calling */ + /* netdev_set_master and dev_open */ + new_slave->original_flags = slave_dev->flags; + + /* save slave's original ("permanent") mac address for + modes that needs it, and for restoring it upon release, + and then set it to the master's address */ + memcpy(new_slave->perm_hwaddr, slave_dev->dev_addr, ETH_ALEN); + + if (bond->next != (slave_t*)bond) { + /* set slave to master's mac address + The application already set the master's + mac address to that of the first slave */ + memcpy(addr.sa_data, master_dev->dev_addr, ETH_ALEN); + addr.sa_family = slave_dev->type; + err = slave_dev->set_mac_address(slave_dev, &addr); + if (err) { +#ifdef BONDING_DEBUG + printk(KERN_CRIT "Error %d calling set_mac_address\n", err); +#endif + goto err_free; + } + } + + /* open the slave since the application closed it */ + err = dev_open(slave_dev); + if (err) { +#ifdef BONDING_DEBUG + printk(KERN_CRIT "Openning slave %s failed\n", slave_dev->name); +#endif + goto err_restore_mac; + } + + err = netdev_set_master(slave_dev, master_dev); + + if (err) { +#ifdef BONDING_DEBUG + printk(KERN_CRIT "Error %d calling netdev_set_master\n", err); +#endif + goto err_close; + } + + new_slave->dev = slave_dev; + + if (multicast_mode == BOND_MULTICAST_ALL) { + /* set promiscuity level to new slave */ + if (master_dev->flags & IFF_PROMISC) + dev_set_promiscuity(slave_dev, 1); + + /* set allmulti level to new slave */ + if (master_dev->flags & IFF_ALLMULTI) + dev_set_allmulti(slave_dev, 1); + + /* upload master's mc_list to new slave */ + for (dmi = master_dev->mc_list; dmi != NULL; dmi = dmi->next) + dev_mc_add (slave_dev, dmi->dmi_addr, dmi->dmi_addrlen, 0); + } + + write_lock_irqsave(&bond->lock, flags); + + bond_attach_slave(bond, new_slave); + new_slave->delay = 0; + new_slave->link_failure_count = 0; + + if (miimon > 0 && !use_carrier) { + link_reporting = bond_check_dev_link(slave_dev, 1); + + if ((link_reporting == -1) && (arp_interval == 0)) { + /* + * miimon is set but a bonded network driver + * does not support ETHTOOL/MII and + * arp_interval is not set. Note: if + * use_carrier is enabled, we will never go + * here (because netif_carrier is always + * supported); thus, we don't need to change + * the messages for netif_carrier. + */ + printk(KERN_ERR + "bond_enslave(): MII and ETHTOOL support not " + "available for interface %s, and " + "arp_interval/arp_ip_target module parameters " + "not specified, thus bonding will not detect " + "link failures! see bonding.txt for details.\n", + slave_dev->name); + } else if (link_reporting == -1) { + /* unable get link status using mii/ethtool */ + printk(KERN_WARNING + "bond_enslave: can't get link status from " + "interface %s; the network driver associated " + "with this interface does not support " + "MII or ETHTOOL link status reporting, thus " + "miimon has no effect on this interface.\n", + slave_dev->name); + } + } + + /* check for initial state */ + if ((miimon <= 0) || + (bond_check_dev_link(slave_dev, 0) == BMSR_LSTATUS)) { +#ifdef BONDING_DEBUG + printk(KERN_CRIT "Initial state of slave_dev is BOND_LINK_UP\n"); +#endif + new_slave->link = BOND_LINK_UP; + new_slave->jiffies = jiffies; + } + else { +#ifdef BONDING_DEBUG + printk(KERN_CRIT "Initial state of slave_dev is BOND_LINK_DOWN\n"); +#endif + new_slave->link = BOND_LINK_DOWN; + } + + if (bond_update_speed_duplex(new_slave) && (new_slave->link == BOND_LINK_UP) ) { + printk(KERN_WARNING + "bond_enslave(): failed to get speed/duplex from %s, " + "speed forced to 100Mbps, duplex forced to Full.\n", + new_slave->dev->name); + } + + /* if we're in active-backup mode, we need one and only one active + * interface. The backup interfaces will have their NOARP flag set + * because we need them to be completely deaf and not to respond to + * any ARP request on the network to avoid fooling a switch. Thus, + * since we guarantee that current_slave always point to the last + * usable interface, we just have to verify this interface's flag. + */ + if (bond_mode == BOND_MODE_ACTIVEBACKUP) { + if (((bond->current_slave == NULL) + || (bond->current_slave->dev->flags & IFF_NOARP)) + && (new_slave->link == BOND_LINK_UP)) { +#ifdef BONDING_DEBUG + printk(KERN_CRIT "This is the first active slave\n"); +#endif + /* first slave or no active slave yet, and this link + is OK, so make this interface the active one */ + bond->current_slave = new_slave; + bond_set_slave_active_flags(new_slave); + bond_mc_update(bond, new_slave, NULL); + } + else { +#ifdef BONDING_DEBUG + printk(KERN_CRIT "This is just a backup slave\n"); +#endif + bond_set_slave_inactive_flags(new_slave); + } + read_lock_irqsave(&(((struct in_device *)slave_dev->ip_ptr)->lock), rflags); + ifap= &(((struct in_device *)slave_dev->ip_ptr)->ifa_list); + ifa = *ifap; + my_ip = ifa->ifa_address; + read_unlock_irqrestore(&(((struct in_device *)slave_dev->ip_ptr)->lock), rflags); + + /* if there is a primary slave, remember it */ + if (primary != NULL) + if( strcmp(primary, new_slave->dev->name) == 0) + bond->primary_slave = new_slave; + } else { +#ifdef BONDING_DEBUG + printk(KERN_CRIT "This slave is always active in trunk mode\n"); +#endif + /* always active in trunk mode */ + new_slave->state = BOND_STATE_ACTIVE; + if (bond->current_slave == NULL) + bond->current_slave = new_slave; + } + + write_unlock_irqrestore(&bond->lock, flags); + + printk (KERN_INFO "%s: enslaving %s as a%s interface with a%s link.\n", + master_dev->name, slave_dev->name, + new_slave->state == BOND_STATE_ACTIVE ? "n active" : " backup", + new_slave->link == BOND_LINK_UP ? "n up" : " down"); + + //enslave is successfull + return 0; + +// Undo stages on error +err_close: + dev_close(slave_dev); + +err_restore_mac: + memcpy(addr.sa_data, new_slave->perm_hwaddr, ETH_ALEN); + addr.sa_family = slave_dev->type; + slave_dev->set_mac_address(slave_dev, &addr); + +err_free: + kfree(new_slave); + return err; +} + +/* + * This function changes the active slave to slave . + * It returns -EINVAL in the following cases. + * - is not found in the list. + * - There is not active slave now. + * - is already active. + * - The link state of is not BOND_LINK_UP. + * - is not running. + * In these cases, this fuction does nothing. + * In the other cases, currnt_slave pointer is changed and 0 is returned. + */ +static int bond_change_active(struct net_device *master_dev, struct net_device *slave_dev) +{ + bonding_t *bond; + slave_t *slave; + slave_t *oldactive = NULL; + slave_t *newactive = NULL; + unsigned long flags; + int ret = 0; + + if (master_dev == NULL || slave_dev == NULL) { + return -ENODEV; + } + + bond = (struct bonding *) master_dev->priv; + write_lock_irqsave(&bond->lock, flags); + slave = (slave_t *)bond; + oldactive = bond->current_slave; + + while ((slave = slave->prev) != (slave_t *)bond) { + if(slave_dev == slave->dev) { + newactive = slave; + break; + } + } + + if ((newactive != NULL)&& + (oldactive != NULL)&& + (newactive != oldactive)&& + (newactive->link == BOND_LINK_UP)&& + IS_UP(newactive->dev)) { + bond_set_slave_inactive_flags(oldactive); + bond_set_slave_active_flags(newactive); + bond_mc_update(bond, newactive, oldactive); + bond->current_slave = newactive; + printk("%s : activate %s(old : %s)\n", + master_dev->name, newactive->dev->name, + oldactive->dev->name); + } + else { + ret = -EINVAL; + } + write_unlock_irqrestore(&bond->lock, flags); + return ret; +} + +/* Choose a new valid interface from the pool, set it active + * and make it the current slave. If no valid interface is + * found, the oldest slave in BACK state is choosen and + * activated. If none is found, it's considered as no + * interfaces left so the current slave is set to NULL. + * The result is a pointer to the current slave. + * + * Since this function sends messages tails through printk, the caller + * must have started something like `printk(KERN_INFO "xxxx ");'. + * + * Warning: must put locks around the call to this function if needed. + */ +slave_t *change_active_interface(bonding_t *bond) +{ + slave_t *newslave, *oldslave; + slave_t *bestslave = NULL; + int mintime; + + read_lock(&bond->ptrlock); + newslave = oldslave = bond->current_slave; + read_unlock(&bond->ptrlock); + + if (newslave == NULL) { /* there were no active slaves left */ + if (bond->next != (slave_t *)bond) { /* found one slave */ + write_lock(&bond->ptrlock); + newslave = bond->current_slave = bond->next; + write_unlock(&bond->ptrlock); + } else { + + printk (" but could not find any %s interface.\n", + (bond_mode == BOND_MODE_ACTIVEBACKUP) ? "backup":"other"); + write_lock(&bond->ptrlock); + bond->current_slave = (slave_t *)NULL; + write_unlock(&bond->ptrlock); + return NULL; /* still no slave, return NULL */ + } + } else if (bond_mode == BOND_MODE_ACTIVEBACKUP) { + /* make sure oldslave doesn't send arps - this could + * cause a ping-pong effect between interfaces since they + * would be able to tx arps - in active backup only one + * slave should be able to tx arps, and that should be + * the current_slave; the only exception is when all + * slaves have gone down, then only one non-current slave can + * send arps at a time; clearing oldslaves' mc list is handled + * later in this function. + */ + bond_set_slave_inactive_flags(oldslave); + } + + mintime = updelay; + + /* first try the primary link; if arping, a link must tx/rx traffic + * before it can be considered the current_slave - also, we would skip + * slaves between the current_slave and primary_slave that may be up + * and able to arp + */ + if ((bond->primary_slave != NULL) && (arp_interval == 0)) { + if (IS_UP(bond->primary_slave->dev)) + newslave = bond->primary_slave; + } + + do { + if (IS_UP(newslave->dev)) { + if (newslave->link == BOND_LINK_UP) { + /* this one is immediately usable */ + if (bond_mode == BOND_MODE_ACTIVEBACKUP) { + bond_set_slave_active_flags(newslave); + bond_mc_update(bond, newslave, oldslave); + printk (" and making interface %s the active one.\n", + newslave->dev->name); + } + else { + printk (" and setting pointer to interface %s.\n", + newslave->dev->name); + } + + write_lock(&bond->ptrlock); + bond->current_slave = newslave; + write_unlock(&bond->ptrlock); + return newslave; + } + else if (newslave->link == BOND_LINK_BACK) { + /* link up, but waiting for stabilization */ + if (newslave->delay < mintime) { + mintime = newslave->delay; + bestslave = newslave; + } + } + } + } while ((newslave = newslave->next) != oldslave); + + /* no usable backup found, we'll see if we at least got a link that was + coming back for a long time, and could possibly already be usable. + */ + + if (bestslave != NULL) { + /* early take-over. */ + printk (" and making interface %s the active one %d ms earlier.\n", + bestslave->dev->name, + (updelay - bestslave->delay)*miimon); + + bestslave->delay = 0; + bestslave->link = BOND_LINK_UP; + bestslave->jiffies = jiffies; + bond_set_slave_active_flags(bestslave); + bond_mc_update(bond, bestslave, oldslave); + write_lock(&bond->ptrlock); + bond->current_slave = bestslave; + write_unlock(&bond->ptrlock); + return bestslave; + } + + if ((bond_mode == BOND_MODE_ACTIVEBACKUP) && + (multicast_mode == BOND_MULTICAST_ACTIVE) && + (oldslave != NULL)) { + /* flush bonds (master's) mc_list from oldslave since it wasn't + * updated (and deleted) above + */ + bond_mc_list_flush(oldslave->dev, bond->device); + if (bond->device->flags & IFF_PROMISC) { + dev_set_promiscuity(oldslave->dev, -1); + } + if (bond->device->flags & IFF_ALLMULTI) { + dev_set_allmulti(oldslave->dev, -1); + } + } + + printk (" but could not find any %s interface.\n", + (bond_mode == BOND_MODE_ACTIVEBACKUP) ? "backup":"other"); + + /* absolutely nothing found. let's return NULL */ + write_lock(&bond->ptrlock); + bond->current_slave = (slave_t *)NULL; + write_unlock(&bond->ptrlock); + return NULL; +} + +/* + * Try to release the slave device from the bond device + * It is legal to access current_slave without a lock because all the function + * is write-locked. + * + * The rules for slave state should be: + * for Active/Backup: + * Active stays on all backups go down + * for Bonded connections: + * The first up interface should be left on and all others downed. + */ +static int bond_release(struct net_device *master, struct net_device *slave) +{ + bonding_t *bond; + slave_t *our_slave, *old_current; + unsigned long flags; + struct sockaddr addr; + + if (master == NULL || slave == NULL) { + return -ENODEV; + } + + bond = (struct bonding *) master->priv; + + /* master already enslaved, or slave not enslaved, + or no slave for this master */ + if ((master->flags & IFF_SLAVE) || !(slave->flags & IFF_SLAVE)) { + printk (KERN_DEBUG "%s: cannot release %s.\n", master->name, slave->name); + return -EINVAL; + } + + write_lock_irqsave(&bond->lock, flags); + bond->current_arp_slave = NULL; + our_slave = (slave_t *)bond; + old_current = bond->current_slave; + while ((our_slave = our_slave->prev) != (slave_t *)bond) { + if (our_slave->dev == slave) { + bond_detach_slave(bond, our_slave); + + printk (KERN_INFO "%s: releasing %s interface %s", + master->name, + (our_slave->state == BOND_STATE_ACTIVE) ? "active" : "backup", + slave->name); + + if (our_slave == old_current) { + /* find a new interface and be verbose */ + change_active_interface(bond); + } else { + printk(".\n"); + } + + if (bond->current_slave == NULL) { + printk(KERN_INFO + "%s: now running without any active interface !\n", + master->name); + } + + if (bond->primary_slave == our_slave) { + bond->primary_slave = NULL; + } + + break; + } + + } + write_unlock_irqrestore(&bond->lock, flags); + + if (our_slave == (slave_t *)bond) { + /* if we get here, it's because the device was not found */ + printk (KERN_INFO "%s: %s not enslaved\n", master->name, slave->name); + return -EINVAL; + } + + /* undo settings and restore original values */ + + if (multicast_mode == BOND_MULTICAST_ALL) { + /* flush master's mc_list from slave */ + bond_mc_list_flush (slave, master); + + /* unset promiscuity level from slave */ + if (master->flags & IFF_PROMISC) + dev_set_promiscuity(slave, -1); + + /* unset allmulti level from slave */ + if (master->flags & IFF_ALLMULTI) + dev_set_allmulti(slave, -1); + } + + netdev_set_master(slave, NULL); + + /* close slave before restoring its mac address */ + dev_close(slave); + + /* restore original ("permanent") mac address*/ + memcpy(addr.sa_data, our_slave->perm_hwaddr, ETH_ALEN); + addr.sa_family = slave->type; + slave->set_mac_address(slave, &addr); + + /* restore the original state of the IFF_NOARP flag that might have */ + /* been set by bond_set_slave_inactive_flags() */ + if ((our_slave->original_flags & IFF_NOARP) == 0) { + slave->flags &= ~IFF_NOARP; + } + + kfree(our_slave); + + /* if the last slave was removed, zero the mac address + of the master so it will be set by the application + to the mac address of the first slave */ + if (bond->next == (slave_t*)bond) { + memset(master->dev_addr, 0, master->addr_len); + } + + return 0; /* deletion OK */ +} + +/* + * This function releases all slaves. + * Warning: must put write-locks around the call to this function. + */ +static int bond_release_all(struct net_device *master) +{ + bonding_t *bond; + slave_t *our_slave; + struct net_device *slave_dev; + struct sockaddr addr; + + if (master == NULL) { + return -ENODEV; + } + + if (master->flags & IFF_SLAVE) { + return -EINVAL; + } + + bond = (struct bonding *) master->priv; + bond->current_arp_slave = NULL; + bond->current_slave = NULL; + bond->primary_slave = NULL; + + while ((our_slave = bond->prev) != (slave_t *)bond) { + slave_dev = our_slave->dev; + bond_detach_slave(bond, our_slave); + + if (multicast_mode == BOND_MULTICAST_ALL + || (multicast_mode == BOND_MULTICAST_ACTIVE + && bond->current_slave == our_slave)) { + + /* flush master's mc_list from slave */ + bond_mc_list_flush (slave_dev, master); + + /* unset promiscuity level from slave */ + if (master->flags & IFF_PROMISC) + dev_set_promiscuity(slave_dev, -1); + + /* unset allmulti level from slave */ + if (master->flags & IFF_ALLMULTI) + dev_set_allmulti(slave_dev, -1); + } + + /* Can be safely called from inside the bond lock + since traffic and timers have already stopped + */ + netdev_set_master(slave_dev, NULL); + + /* close slave before restoring its mac address */ + dev_close(slave_dev); + + /* restore original ("permanent") mac address*/ + memcpy(addr.sa_data, our_slave->perm_hwaddr, ETH_ALEN); + addr.sa_family = slave_dev->type; + slave_dev->set_mac_address(slave_dev, &addr); + + /* restore the original state of the IFF_NOARP flag that might have */ + /* been set by bond_set_slave_inactive_flags() */ + if ((our_slave->original_flags & IFF_NOARP) == 0) { + slave_dev->flags &= ~IFF_NOARP; + } + + kfree(our_slave); + } + + /* zero the mac address of the master so it will be + set by the application to the mac address of the + first slave */ + memset(master->dev_addr, 0, master->addr_len); + + printk (KERN_INFO "%s: released all slaves\n", master->name); + + return 0; +} + +/* this function is called regularly to monitor each slave's link. */ +static void bond_mii_monitor(struct net_device *master) +{ + bonding_t *bond = (struct bonding *) master->priv; + slave_t *slave, *bestslave, *oldcurrent; + unsigned long flags; + int slave_died = 0; + + read_lock_irqsave(&bond->lock, flags); + + /* we will try to read the link status of each of our slaves, and + * set their IFF_RUNNING flag appropriately. For each slave not + * supporting MII status, we won't do anything so that a user-space + * program could monitor the link itself if needed. + */ + + bestslave = NULL; + slave = (slave_t *)bond; + + read_lock(&bond->ptrlock); + oldcurrent = bond->current_slave; + read_unlock(&bond->ptrlock); + + while ((slave = slave->prev) != (slave_t *)bond) { + /* use updelay+1 to match an UP slave even when updelay is 0 */ + int mindelay = updelay + 1; + struct net_device *dev = slave->dev; + int link_state; + + link_state = bond_check_dev_link(dev, 0); + + switch (slave->link) { + case BOND_LINK_UP: /* the link was up */ + if (link_state == BMSR_LSTATUS) { + /* link stays up, tell that this one + is immediately available */ + if (IS_UP(dev) && (mindelay > -2)) { + /* -2 is the best case : + this slave was already up */ + mindelay = -2; + bestslave = slave; + } + break; + } + else { /* link going down */ + slave->link = BOND_LINK_FAIL; + slave->delay = downdelay; + if (slave->link_failure_count < UINT_MAX) { + slave->link_failure_count++; + } + if (downdelay > 0) { + printk (KERN_INFO + "%s: link status down for %sinterface " + "%s, disabling it in %d ms.\n", + master->name, + IS_UP(dev) + ? ((bond_mode == BOND_MODE_ACTIVEBACKUP) + ? ((slave == oldcurrent) + ? "active " : "backup ") + : "") + : "idle ", + dev->name, + downdelay * miimon); + } + } + /* no break ! fall through the BOND_LINK_FAIL test to + ensure proper action to be taken + */ + case BOND_LINK_FAIL: /* the link has just gone down */ + if (link_state != BMSR_LSTATUS) { + /* link stays down */ + if (slave->delay <= 0) { + /* link down for too long time */ + slave->link = BOND_LINK_DOWN; + /* in active/backup mode, we must + completely disable this interface */ + if (bond_mode == BOND_MODE_ACTIVEBACKUP) { + bond_set_slave_inactive_flags(slave); + } + printk(KERN_INFO + "%s: link status definitely down " + "for interface %s, disabling it", + master->name, + dev->name); + + read_lock(&bond->ptrlock); + if (slave == bond->current_slave) { + read_unlock(&bond->ptrlock); + /* find a new interface and be verbose */ + change_active_interface(bond); + } else { + read_unlock(&bond->ptrlock); + printk(".\n"); + } + slave_died = 1; + } else { + slave->delay--; + } + } else { + /* link up again */ + slave->link = BOND_LINK_UP; + slave->jiffies = jiffies; + printk(KERN_INFO + "%s: link status up again after %d ms " + "for interface %s.\n", + master->name, + (downdelay - slave->delay) * miimon, + dev->name); + + if (IS_UP(dev) && (mindelay > -1)) { + /* -1 is a good case : this slave went + down only for a short time */ + mindelay = -1; + bestslave = slave; + } + } + break; + case BOND_LINK_DOWN: /* the link was down */ + if (link_state != BMSR_LSTATUS) { + /* the link stays down, nothing more to do */ + break; + } else { /* link going up */ + slave->link = BOND_LINK_BACK; + slave->delay = updelay; + + if (updelay > 0) { + /* if updelay == 0, no need to + advertise about a 0 ms delay */ + printk (KERN_INFO + "%s: link status up for interface" + " %s, enabling it in %d ms.\n", + master->name, + dev->name, + updelay * miimon); + } + } + /* no break ! fall through the BOND_LINK_BACK state in + case there's something to do. + */ + case BOND_LINK_BACK: /* the link has just come back */ + if (link_state != BMSR_LSTATUS) { + /* link down again */ + slave->link = BOND_LINK_DOWN; + printk(KERN_INFO + "%s: link status down again after %d ms " + "for interface %s.\n", + master->name, + (updelay - slave->delay) * miimon, + dev->name); + } else { + /* link stays up */ + if (slave->delay == 0) { + /* now the link has been up for long time enough */ + slave->link = BOND_LINK_UP; + slave->jiffies = jiffies; + + if (bond_mode != BOND_MODE_ACTIVEBACKUP) { + /* make it immediately active */ + slave->state = BOND_STATE_ACTIVE; + } else if (slave != bond->primary_slave) { + /* prevent it from being the active one */ + slave->state = BOND_STATE_BACKUP; + } + + printk(KERN_INFO + "%s: link status definitely up " + "for interface %s.\n", + master->name, + dev->name); + + if ( (bond->primary_slave != NULL) + && (slave == bond->primary_slave) ) + change_active_interface(bond); + } + else + slave->delay--; + + /* we'll also look for the mostly eligible slave */ + if (bond->primary_slave == NULL) { + if (IS_UP(dev) && (slave->delay < mindelay)) { + mindelay = slave->delay; + bestslave = slave; + } + } else if ( (IS_UP(bond->primary_slave->dev)) || + ( (!IS_UP(bond->primary_slave->dev)) && + (IS_UP(dev) && (slave->delay < mindelay)) ) ) { + mindelay = slave->delay; + bestslave = slave; + } + } + break; + } /* end of switch */ + + bond_update_speed_duplex(slave); + + } /* end of while */ + + /* + * if there's no active interface and we discovered that one + * of the slaves could be activated earlier, so we do it. + */ + read_lock(&bond->ptrlock); + oldcurrent = bond->current_slave; + read_unlock(&bond->ptrlock); + + /* no active interface at the moment or need to bring up the primary */ + if (oldcurrent == NULL) { /* no active interface at the moment */ + if (bestslave != NULL) { /* last chance to find one ? */ + if (bestslave->link == BOND_LINK_UP) { + printk (KERN_INFO + "%s: making interface %s the new active one.\n", + master->name, bestslave->dev->name); + } else { + printk (KERN_INFO + "%s: making interface %s the new " + "active one %d ms earlier.\n", + master->name, bestslave->dev->name, + (updelay - bestslave->delay) * miimon); + + bestslave->delay = 0; + bestslave->link = BOND_LINK_UP; + bestslave->jiffies = jiffies; + } + + if (bond_mode == BOND_MODE_ACTIVEBACKUP) { + bond_set_slave_active_flags(bestslave); + bond_mc_update(bond, bestslave, NULL); + } else { + bestslave->state = BOND_STATE_ACTIVE; + } + write_lock(&bond->ptrlock); + bond->current_slave = bestslave; + write_unlock(&bond->ptrlock); + } else if (slave_died) { + /* print this message only once a slave has just died */ + printk(KERN_INFO + "%s: now running without any active interface !\n", + master->name); + } + } + + read_unlock_irqrestore(&bond->lock, flags); + /* re-arm the timer */ + mod_timer(&bond->mii_timer, jiffies + (miimon * HZ / 1000)); +} + +/* + * this function is called regularly to monitor each slave's link + * ensuring that traffic is being sent and received when arp monitoring + * is used in load-balancing mode. if the adapter has been dormant, then an + * arp is transmitted to generate traffic. see activebackup_arp_monitor for + * arp monitoring in active backup mode. + */ +static void loadbalance_arp_monitor(struct net_device *master) +{ + bonding_t *bond; + unsigned long flags; + slave_t *slave; + int the_delta_in_ticks = arp_interval * HZ / 1000; + int next_timer = jiffies + (arp_interval * HZ / 1000); + + bond = (struct bonding *) master->priv; + if (master->priv == NULL) { + mod_timer(&bond->arp_timer, next_timer); + return; + } + + read_lock_irqsave(&bond->lock, flags); + + /* TODO: investigate why rtnl_shlock_nowait and rtnl_exlock_nowait + * are called below and add comment why they are required... + */ + if ((!IS_UP(master)) || rtnl_shlock_nowait()) { + mod_timer(&bond->arp_timer, next_timer); + read_unlock_irqrestore(&bond->lock, flags); + return; + } + + if (rtnl_exlock_nowait()) { + rtnl_shunlock(); + mod_timer(&bond->arp_timer, next_timer); + read_unlock_irqrestore(&bond->lock, flags); + return; + } + + /* see if any of the previous devices are up now (i.e. they have + * xmt and rcv traffic). the current_slave does not come into + * the picture unless it is null. also, slave->jiffies is not needed + * here because we send an arp on each slave and give a slave as + * long as it needs to get the tx/rx within the delta. + * TODO: what about up/down delay in arp mode? it wasn't here before + * so it can wait + */ + slave = (slave_t *)bond; + while ((slave = slave->prev) != (slave_t *)bond) { + + if (slave->link != BOND_LINK_UP) { + + if (((jiffies - slave->dev->trans_start) <= + the_delta_in_ticks) && + ((jiffies - slave->dev->last_rx) <= + the_delta_in_ticks)) { + + slave->link = BOND_LINK_UP; + slave->state = BOND_STATE_ACTIVE; + + /* primary_slave has no meaning in round-robin + * mode. the window of a slave being up and + * current_slave being null after enslaving + * is closed. + */ + read_lock(&bond->ptrlock); + if (bond->current_slave == NULL) { + read_unlock(&bond->ptrlock); + printk(KERN_INFO + "%s: link status definitely up " + "for interface %s, ", + master->name, + slave->dev->name); + change_active_interface(bond); + } else { + read_unlock(&bond->ptrlock); + printk(KERN_INFO + "%s: interface %s is now up\n", + master->name, + slave->dev->name); + } + } + } else { + /* slave->link == BOND_LINK_UP */ + + /* not all switches will respond to an arp request + * when the source ip is 0, so don't take the link down + * if we don't know our ip yet + */ + if (((jiffies - slave->dev->trans_start) >= + (2*the_delta_in_ticks)) || + (((jiffies - slave->dev->last_rx) >= + (2*the_delta_in_ticks)) && my_ip !=0)) { + slave->link = BOND_LINK_DOWN; + slave->state = BOND_STATE_BACKUP; + if (slave->link_failure_count < UINT_MAX) { + slave->link_failure_count++; + } + printk(KERN_INFO + "%s: interface %s is now down.\n", + master->name, + slave->dev->name); + + read_lock(&bond->ptrlock); + if (slave == bond->current_slave) { + read_unlock(&bond->ptrlock); + change_active_interface(bond); + } else { + read_unlock(&bond->ptrlock); + } + } + } + + /* note: if switch is in round-robin mode, all links + * must tx arp to ensure all links rx an arp - otherwise + * links may oscillate or not come up at all; if switch is + * in something like xor mode, there is nothing we can + * do - all replies will be rx'ed on same link causing slaves + * to be unstable during low/no traffic periods + */ + if (IS_UP(slave->dev)) { + arp_send_all(slave); + } + } + + rtnl_exunlock(); + rtnl_shunlock(); + read_unlock_irqrestore(&bond->lock, flags); + + /* re-arm the timer */ + mod_timer(&bond->arp_timer, next_timer); +} + +/* + * When using arp monitoring in active-backup mode, this function is + * called to determine if any backup slaves have went down or a new + * current slave needs to be found. + * The backup slaves never generate traffic, they are considered up by merely + * receiving traffic. If the current slave goes down, each backup slave will + * be given the opportunity to tx/rx an arp before being taken down - this + * prevents all slaves from being taken down due to the current slave not + * sending any traffic for the backups to receive. The arps are not necessarily + * necessary, any tx and rx traffic will keep the current slave up. While any + * rx traffic will keep the backup slaves up, the current slave is responsible + * for generating traffic to keep them up regardless of any other traffic they + * may have received. + * see loadbalance_arp_monitor for arp monitoring in load balancing mode + */ +static void activebackup_arp_monitor(struct net_device *master) +{ + bonding_t *bond; + unsigned long flags; + slave_t *slave; + int the_delta_in_ticks = arp_interval * HZ / 1000; + int next_timer = jiffies + (arp_interval * HZ / 1000); + + bond = (struct bonding *) master->priv; + if (master->priv == NULL) { + mod_timer(&bond->arp_timer, next_timer); + return; + } + + read_lock_irqsave(&bond->lock, flags); + + if (!IS_UP(master)) { + mod_timer(&bond->arp_timer, next_timer); + read_unlock_irqrestore(&bond->lock, flags); + return; + } + + /* determine if any slave has come up or any backup slave has + * gone down + * TODO: what about up/down delay in arp mode? it wasn't here before + * so it can wait + */ + slave = (slave_t *)bond; + while ((slave = slave->prev) != (slave_t *)bond) { + + if (slave->link != BOND_LINK_UP) { + if ((jiffies - slave->dev->last_rx) <= + the_delta_in_ticks) { + + slave->link = BOND_LINK_UP; + write_lock(&bond->ptrlock); + if ((bond->current_slave == NULL) && + ((jiffies - slave->dev->trans_start) <= + the_delta_in_ticks)) { + bond->current_slave = slave; + bond_set_slave_active_flags(slave); + bond_mc_update(bond, slave, NULL); + bond->current_arp_slave = NULL; + } else if (bond->current_slave != slave) { + /* this slave has just come up but we + * already have a current slave; this + * can also happen if bond_enslave adds + * a new slave that is up while we are + * searching for a new slave + */ + bond_set_slave_inactive_flags(slave); + bond->current_arp_slave = NULL; + } + + if (slave == bond->current_slave) { + printk(KERN_INFO + "%s: %s is up and now the " + "active interface\n", + master->name, + slave->dev->name); + } else { + printk(KERN_INFO + "%s: backup interface %s is " + "now up\n", + master->name, + slave->dev->name); + } + + write_unlock(&bond->ptrlock); + } + } else { + read_lock(&bond->ptrlock); + if ((slave != bond->current_slave) && + (bond->current_arp_slave == NULL) && + (((jiffies - slave->dev->last_rx) >= + 3*the_delta_in_ticks) && (my_ip != 0))) { + /* a backup slave has gone down; three times + * the delta allows the current slave to be + * taken out before the backup slave. + * note: a non-null current_arp_slave indicates + * the current_slave went down and we are + * searching for a new one; under this + * condition we only take the current_slave + * down - this gives each slave a chance to + * tx/rx traffic before being taken out + */ + read_unlock(&bond->ptrlock); + slave->link = BOND_LINK_DOWN; + if (slave->link_failure_count < UINT_MAX) { + slave->link_failure_count++; + } + bond_set_slave_inactive_flags(slave); + printk(KERN_INFO + "%s: backup interface %s is now down\n", + master->name, + slave->dev->name); + } else { + read_unlock(&bond->ptrlock); + } + } + } + + read_lock(&bond->ptrlock); + slave = bond->current_slave; + read_unlock(&bond->ptrlock); + + if (slave != NULL) { + + /* if we have sent traffic in the past 2*arp_intervals but + * haven't xmit and rx traffic in that time interval, select + * a different slave. slave->jiffies is only updated when + * a slave first becomes the current_slave - not necessarily + * after every arp; this ensures the slave has a full 2*delta + * before being taken out. if a primary is being used, check + * if it is up and needs to take over as the current_slave + */ + if ((((jiffies - slave->dev->trans_start) >= + (2*the_delta_in_ticks)) || + (((jiffies - slave->dev->last_rx) >= + (2*the_delta_in_ticks)) && (my_ip != 0))) && + ((jiffies - slave->jiffies) >= 2*the_delta_in_ticks)) { + + slave->link = BOND_LINK_DOWN; + if (slave->link_failure_count < UINT_MAX) { + slave->link_failure_count++; + } + printk(KERN_INFO "%s: link status down for " + "active interface %s, disabling it", + master->name, + slave->dev->name); + slave = change_active_interface(bond); + bond->current_arp_slave = slave; + if (slave != NULL) { + slave->jiffies = jiffies; + } + + } else if ((bond->primary_slave != NULL) && + (bond->primary_slave != slave) && + (bond->primary_slave->link == BOND_LINK_UP)) { + /* at this point, slave is the current_slave */ + printk(KERN_INFO + "%s: changing from interface %s to primary " + "interface %s\n", + master->name, + slave->dev->name, + bond->primary_slave->dev->name); + + /* primary is up so switch to it */ + bond_set_slave_inactive_flags(slave); + bond_mc_update(bond, bond->primary_slave, slave); + write_lock(&bond->ptrlock); + bond->current_slave = bond->primary_slave; + write_unlock(&bond->ptrlock); + slave = bond->primary_slave; + bond_set_slave_active_flags(slave); + slave->jiffies = jiffies; + } else { + bond->current_arp_slave = NULL; + } + + /* the current slave must tx an arp to ensure backup slaves + * rx traffic + */ + if ((slave != NULL) && + (((jiffies - slave->dev->last_rx) >= the_delta_in_ticks) && + (my_ip != 0))) { + arp_send_all(slave); + } + } + + /* if we don't have a current_slave, search for the next available + * backup slave from the current_arp_slave and make it the candidate + * for becoming the current_slave + */ + if (slave == NULL) { + + if ((bond->current_arp_slave == NULL) || + (bond->current_arp_slave == (slave_t *)bond)) { + bond->current_arp_slave = bond->prev; + } + + if (bond->current_arp_slave != (slave_t *)bond) { + bond_set_slave_inactive_flags(bond->current_arp_slave); + slave = bond->current_arp_slave->next; + + /* search for next candidate */ + do { + if (IS_UP(slave->dev)) { + slave->link = BOND_LINK_BACK; + bond_set_slave_active_flags(slave); + arp_send_all(slave); + slave->jiffies = jiffies; + bond->current_arp_slave = slave; + break; + } + + /* if the link state is up at this point, we + * mark it down - this can happen if we have + * simultaneous link failures and + * change_active_interface doesn't make this + * one the current slave so it is still marked + * up when it is actually down + */ + if (slave->link == BOND_LINK_UP) { + slave->link = BOND_LINK_DOWN; + if (slave->link_failure_count < + UINT_MAX) { + slave->link_failure_count++; + } + + bond_set_slave_inactive_flags(slave); + printk(KERN_INFO + "%s: backup interface " + "%s is now down.\n", + master->name, + slave->dev->name); + } + } while ((slave = slave->next) != + bond->current_arp_slave->next); + } + } + + mod_timer(&bond->arp_timer, next_timer); + read_unlock_irqrestore(&bond->lock, flags); +} + +typedef uint32_t in_addr_t; + +int +my_inet_aton(char *cp, unsigned long *the_addr) { + static const in_addr_t max[4] = { 0xffffffff, 0xffffff, 0xffff, 0xff }; + in_addr_t val; + char c; + union iaddr { + uint8_t bytes[4]; + uint32_t word; + } res; + uint8_t *pp = res.bytes; + int digit,base; + + res.word = 0; + + c = *cp; + for (;;) { + /* + * Collect number up to ``.''. + * Values are specified as for C: + * 0x=hex, 0=octal, isdigit=decimal. + */ + if (!isdigit(c)) goto ret_0; + val = 0; base = 10; digit = 0; + for (;;) { + if (isdigit(c)) { + val = (val * base) + (c - '0'); + c = *++cp; + digit = 1; + } else { + break; + } + } + if (c == '.') { + /* + * Internet format: + * a.b.c.d + * a.b.c (with c treated as 16 bits) + * a.b (with b treated as 24 bits) + */ + if (pp > res.bytes + 2 || val > 0xff) { + goto ret_0; + } + *pp++ = val; + c = *++cp; + } else + break; + } + /* + * Check for trailing characters. + */ + if (c != '\0' && (!isascii(c) || !isspace(c))) { + goto ret_0; + } + /* + * Did we get a valid digit? + */ + if (!digit) { + goto ret_0; + } + + /* Check whether the last part is in its limits depending on + the number of parts in total. */ + if (val > max[pp - res.bytes]) { + goto ret_0; + } + + if (the_addr != NULL) { + *the_addr = res.word | htonl (val); + } + + return (1); + +ret_0: + return (0); +} + +static int bond_sethwaddr(struct net_device *master, struct net_device *slave) +{ +#ifdef BONDING_DEBUG + printk(KERN_CRIT "bond_sethwaddr: master=%x\n", (unsigned int)master); + printk(KERN_CRIT "bond_sethwaddr: slave=%x\n", (unsigned int)slave); + printk(KERN_CRIT "bond_sethwaddr: slave->addr_len=%d\n", slave->addr_len); +#endif + memcpy(master->dev_addr, slave->dev_addr, slave->addr_len); + return 0; +} + +static int bond_info_query(struct net_device *master, struct ifbond *info) +{ + bonding_t *bond = (struct bonding *) master->priv; + slave_t *slave; + unsigned long flags; + + info->bond_mode = bond_mode; + info->num_slaves = 0; + info->miimon = miimon; + + read_lock_irqsave(&bond->lock, flags); + for (slave = bond->prev; slave != (slave_t *)bond; slave = slave->prev) { + info->num_slaves++; + } + read_unlock_irqrestore(&bond->lock, flags); + + return 0; +} + +static int bond_slave_info_query(struct net_device *master, + struct ifslave *info) +{ + bonding_t *bond = (struct bonding *) master->priv; + slave_t *slave; + int cur_ndx = 0; + unsigned long flags; + + if (info->slave_id < 0) { + return -ENODEV; + } + + read_lock_irqsave(&bond->lock, flags); + for (slave = bond->prev; + slave != (slave_t *)bond && cur_ndx < info->slave_id; + slave = slave->prev) { + cur_ndx++; + } + read_unlock_irqrestore(&bond->lock, flags); + + if (slave != (slave_t *)bond) { + strcpy(info->slave_name, slave->dev->name); + info->link = slave->link; + info->state = slave->state; + info->link_failure_count = slave->link_failure_count; + } else { + return -ENODEV; + } + + return 0; +} + +static int bond_ioctl(struct net_device *master_dev, struct ifreq *ifr, int cmd) +{ + struct net_device *slave_dev = NULL; + struct ifbond *u_binfo = NULL, k_binfo; + struct ifslave *u_sinfo = NULL, k_sinfo; + struct mii_ioctl_data *mii = NULL; + int ret = 0; + +#ifdef BONDING_DEBUG + printk(KERN_INFO "bond_ioctl: master=%s, cmd=%d\n", + master_dev->name, cmd); +#endif + + switch (cmd) { + case SIOCGMIIPHY: + mii = (struct mii_ioctl_data *)&ifr->ifr_data; + if (mii == NULL) { + return -EINVAL; + } + mii->phy_id = 0; + /* Fall Through */ + case SIOCGMIIREG: + /* + * We do this again just in case we were called by SIOCGMIIREG + * instead of SIOCGMIIPHY. + */ + mii = (struct mii_ioctl_data *)&ifr->ifr_data; + if (mii == NULL) { + return -EINVAL; + } + if (mii->reg_num == 1) { + mii->val_out = bond_check_mii_link( + (struct bonding *)master_dev->priv); + } + return 0; + case BOND_INFO_QUERY_OLD: + case SIOCBONDINFOQUERY: + u_binfo = (struct ifbond *)ifr->ifr_data; + if (copy_from_user(&k_binfo, u_binfo, sizeof(ifbond))) { + return -EFAULT; + } + ret = bond_info_query(master_dev, &k_binfo); + if (ret == 0) { + if (copy_to_user(u_binfo, &k_binfo, sizeof(ifbond))) { + return -EFAULT; + } + } + return ret; + case BOND_SLAVE_INFO_QUERY_OLD: + case SIOCBONDSLAVEINFOQUERY: + u_sinfo = (struct ifslave *)ifr->ifr_data; + if (copy_from_user(&k_sinfo, u_sinfo, sizeof(ifslave))) { + return -EFAULT; + } + ret = bond_slave_info_query(master_dev, &k_sinfo); + if (ret == 0) { + if (copy_to_user(u_sinfo, &k_sinfo, sizeof(ifslave))) { + return -EFAULT; + } + } + return ret; + } + + if (!capable(CAP_NET_ADMIN)) { + return -EPERM; + } + + slave_dev = dev_get_by_name(ifr->ifr_slave); + +#ifdef BONDING_DEBUG + printk(KERN_INFO "slave_dev=%x: \n", (unsigned int)slave_dev); + printk(KERN_INFO "slave_dev->name=%s: \n", slave_dev->name); +#endif + + if (slave_dev == NULL) { + ret = -ENODEV; + } else { + switch (cmd) { + case BOND_ENSLAVE_OLD: + case SIOCBONDENSLAVE: + ret = bond_enslave(master_dev, slave_dev); + break; + case BOND_RELEASE_OLD: + case SIOCBONDRELEASE: + ret = bond_release(master_dev, slave_dev); + break; + case BOND_SETHWADDR_OLD: + case SIOCBONDSETHWADDR: + ret = bond_sethwaddr(master_dev, slave_dev); + break; + case BOND_CHANGE_ACTIVE_OLD: + case SIOCBONDCHANGEACTIVE: + if (bond_mode == BOND_MODE_ACTIVEBACKUP) { + ret = bond_change_active(master_dev, slave_dev); + } + else { + ret = -EINVAL; + } + break; + default: + ret = -EOPNOTSUPP; + } + dev_put(slave_dev); + } + return ret; +} + +#ifdef CONFIG_NET_FASTROUTE +static int bond_accept_fastpath(struct net_device *dev, struct dst_entry *dst) +{ + return -1; +} +#endif + +/* + * in broadcast mode, we send everything to all usable interfaces. + */ +static int bond_xmit_broadcast(struct sk_buff *skb, struct net_device *dev) +{ + slave_t *slave, *start_at; + struct bonding *bond = (struct bonding *) dev->priv; + unsigned long flags; + struct net_device *device_we_should_send_to = 0; + + if (!IS_UP(dev)) { /* bond down */ + dev_kfree_skb(skb); + return 0; + } + + read_lock_irqsave(&bond->lock, flags); + + read_lock(&bond->ptrlock); + slave = start_at = bond->current_slave; + read_unlock(&bond->ptrlock); + + if (slave == NULL) { /* we're at the root, get the first slave */ + /* no suitable interface, frame not sent */ + read_unlock_irqrestore(&bond->lock, flags); + dev_kfree_skb(skb); + return 0; + } + + do { + if (IS_UP(slave->dev) + && (slave->link == BOND_LINK_UP) + && (slave->state == BOND_STATE_ACTIVE)) { + if (device_we_should_send_to) { + struct sk_buff *skb2; + if ((skb2 = skb_clone(skb, GFP_ATOMIC)) == NULL) { + printk(KERN_ERR "bond_xmit_broadcast: skb_clone() failed\n"); + continue; + } + + skb2->dev = device_we_should_send_to; + skb2->priority = 1; + dev_queue_xmit(skb2); + } + device_we_should_send_to = slave->dev; + } + } while ((slave = slave->next) != start_at); + + if (device_we_should_send_to) { + skb->dev = device_we_should_send_to; + skb->priority = 1; + dev_queue_xmit(skb); + } else + dev_kfree_skb(skb); + + /* frame sent to all suitable interfaces */ + read_unlock_irqrestore(&bond->lock, flags); + return 0; +} + +static int bond_xmit_roundrobin(struct sk_buff *skb, struct net_device *dev) +{ + slave_t *slave, *start_at; + struct bonding *bond = (struct bonding *) dev->priv; + unsigned long flags; + + if (!IS_UP(dev)) { /* bond down */ + dev_kfree_skb(skb); + return 0; + } + + read_lock_irqsave(&bond->lock, flags); + + read_lock(&bond->ptrlock); + slave = start_at = bond->current_slave; + read_unlock(&bond->ptrlock); + + if (slave == NULL) { /* we're at the root, get the first slave */ + /* no suitable interface, frame not sent */ + dev_kfree_skb(skb); + read_unlock_irqrestore(&bond->lock, flags); + return 0; + } + + do { + if (IS_UP(slave->dev) + && (slave->link == BOND_LINK_UP) + && (slave->state == BOND_STATE_ACTIVE)) { + + skb->dev = slave->dev; + skb->priority = 1; + dev_queue_xmit(skb); + + write_lock(&bond->ptrlock); + bond->current_slave = slave->next; + write_unlock(&bond->ptrlock); + + read_unlock_irqrestore(&bond->lock, flags); + return 0; + } + } while ((slave = slave->next) != start_at); + + /* no suitable interface, frame not sent */ + dev_kfree_skb(skb); + read_unlock_irqrestore(&bond->lock, flags); + return 0; +} + +/* + * in XOR mode, we determine the output device by performing xor on + * the source and destination hw adresses. If this device is not + * enabled, find the next slave following this xor slave. + */ +static int bond_xmit_xor(struct sk_buff *skb, struct net_device *dev) +{ + slave_t *slave, *start_at; + struct bonding *bond = (struct bonding *) dev->priv; + unsigned long flags; + struct ethhdr *data = (struct ethhdr *)skb->data; + int slave_no; + + if (!IS_UP(dev)) { /* bond down */ + dev_kfree_skb(skb); + return 0; + } + + read_lock_irqsave(&bond->lock, flags); + slave = bond->prev; + + /* we're at the root, get the first slave */ + if (bond->slave_cnt == 0) { + /* no suitable interface, frame not sent */ + dev_kfree_skb(skb); + read_unlock_irqrestore(&bond->lock, flags); + return 0; + } + + slave_no = (data->h_dest[5]^slave->dev->dev_addr[5]) % bond->slave_cnt; + + while ( (slave_no > 0) && (slave != (slave_t *)bond) ) { + slave = slave->prev; + slave_no--; + } + start_at = slave; + + do { + if (IS_UP(slave->dev) + && (slave->link == BOND_LINK_UP) + && (slave->state == BOND_STATE_ACTIVE)) { + + skb->dev = slave->dev; + skb->priority = 1; + dev_queue_xmit(skb); + + read_unlock_irqrestore(&bond->lock, flags); + return 0; + } + } while ((slave = slave->next) != start_at); + + /* no suitable interface, frame not sent */ + dev_kfree_skb(skb); + read_unlock_irqrestore(&bond->lock, flags); + return 0; +} + +/* + * in active-backup mode, we know that bond->current_slave is always valid if + * the bond has a usable interface. + */ +static int bond_xmit_activebackup(struct sk_buff *skb, struct net_device *dev) +{ + struct bonding *bond = (struct bonding *) dev->priv; + unsigned long flags; + int ret; + + if (!IS_UP(dev)) { /* bond down */ + dev_kfree_skb(skb); + return 0; + } + + /* if we are sending arp packets, try to at least + identify our own ip address */ + if ( (arp_interval > 0) && (my_ip == 0) && + (skb->protocol == __constant_htons(ETH_P_ARP) ) ) { + char *the_ip = (((char *)skb->data)) + + sizeof(struct ethhdr) + + sizeof(struct arphdr) + + ETH_ALEN; + memcpy(&my_ip, the_ip, 4); + } + + /* if we are sending arp packets and don't know + * the target hw address, save it so we don't need + * to use a broadcast address. + * don't do this if in active backup mode because the slaves must + * receive packets to stay up, and the only ones they receive are + * broadcasts. + */ + if ( (bond_mode != BOND_MODE_ACTIVEBACKUP) && + (arp_ip_count == 1) && + (arp_interval > 0) && (arp_target_hw_addr == NULL) && + (skb->protocol == __constant_htons(ETH_P_IP) ) ) { + struct ethhdr *eth_hdr = + (struct ethhdr *) (((char *)skb->data)); + struct iphdr *ip_hdr = (struct iphdr *)(eth_hdr + 1); + + if (arp_target[0] == ip_hdr->daddr) { + arp_target_hw_addr = kmalloc(ETH_ALEN, GFP_KERNEL); + if (arp_target_hw_addr != NULL) + memcpy(arp_target_hw_addr, eth_hdr->h_dest, ETH_ALEN); + } + } + + read_lock_irqsave(&bond->lock, flags); + + read_lock(&bond->ptrlock); + if (bond->current_slave != NULL) { /* one usable interface */ + skb->dev = bond->current_slave->dev; + read_unlock(&bond->ptrlock); + skb->priority = 1; + ret = dev_queue_xmit(skb); + read_unlock_irqrestore(&bond->lock, flags); + return 0; + } + else { + read_unlock(&bond->ptrlock); + } + + /* no suitable interface, frame not sent */ +#ifdef BONDING_DEBUG + printk(KERN_INFO "There was no suitable interface, so we don't transmit\n"); +#endif + dev_kfree_skb(skb); + read_unlock_irqrestore(&bond->lock, flags); + return 0; +} + +static struct net_device_stats *bond_get_stats(struct net_device *dev) +{ + bonding_t *bond = dev->priv; + struct net_device_stats *stats = bond->stats, *sstats; + slave_t *slave; + unsigned long flags; + + memset(bond->stats, 0, sizeof(struct net_device_stats)); + + read_lock_irqsave(&bond->lock, flags); + + for (slave = bond->prev; slave != (slave_t *)bond; slave = slave->prev) { + sstats = slave->dev->get_stats(slave->dev); + + stats->rx_packets += sstats->rx_packets; + stats->rx_bytes += sstats->rx_bytes; + stats->rx_errors += sstats->rx_errors; + stats->rx_dropped += sstats->rx_dropped; + + stats->tx_packets += sstats->tx_packets; + stats->tx_bytes += sstats->tx_bytes; + stats->tx_errors += sstats->tx_errors; + stats->tx_dropped += sstats->tx_dropped; + + stats->multicast += sstats->multicast; + stats->collisions += sstats->collisions; + + stats->rx_length_errors += sstats->rx_length_errors; + stats->rx_over_errors += sstats->rx_over_errors; + stats->rx_crc_errors += sstats->rx_crc_errors; + stats->rx_frame_errors += sstats->rx_frame_errors; + stats->rx_fifo_errors += sstats->rx_fifo_errors; + stats->rx_missed_errors += sstats->rx_missed_errors; + + stats->tx_aborted_errors += sstats->tx_aborted_errors; + stats->tx_carrier_errors += sstats->tx_carrier_errors; + stats->tx_fifo_errors += sstats->tx_fifo_errors; + stats->tx_heartbeat_errors += sstats->tx_heartbeat_errors; + stats->tx_window_errors += sstats->tx_window_errors; + + } + + read_unlock_irqrestore(&bond->lock, flags); + return stats; +} + +static int bond_get_info(char *buf, char **start, off_t offset, int length) +{ + bonding_t *bond = these_bonds; + int len = 0; + off_t begin = 0; + u16 link; + slave_t *slave = NULL; + unsigned long flags; + + while (bond != NULL) { + /* + * This function locks the mutex, so we can't lock it until + * afterwards + */ + link = bond_check_mii_link(bond); + + len += sprintf(buf + len, "Bonding Mode: %s\n", + bond_mode_name()); + + if (bond_mode == BOND_MODE_ACTIVEBACKUP) { + read_lock_irqsave(&bond->lock, flags); + read_lock(&bond->ptrlock); + if (bond->current_slave != NULL) { + len += sprintf(buf + len, + "Currently Active Slave: %s\n", + bond->current_slave->dev->name); + } + read_unlock(&bond->ptrlock); + read_unlock_irqrestore(&bond->lock, flags); + } + + len += sprintf(buf + len, "MII Status: "); + len += sprintf(buf + len, + link == BMSR_LSTATUS ? "up\n" : "down\n"); + len += sprintf(buf + len, "MII Polling Interval (ms): %d\n", + miimon); + len += sprintf(buf + len, "Up Delay (ms): %d\n", + updelay * miimon); + len += sprintf(buf + len, "Down Delay (ms): %d\n", + downdelay * miimon); + len += sprintf(buf + len, "Multicast Mode: %s\n", + multicast_mode_name()); + + read_lock_irqsave(&bond->lock, flags); + for (slave = bond->prev; slave != (slave_t *)bond; + slave = slave->prev) { + len += sprintf(buf + len, "\nSlave Interface: %s\n", slave->dev->name); + + len += sprintf(buf + len, "MII Status: "); + + len += sprintf(buf + len, + slave->link == BOND_LINK_UP ? + "up\n" : "down\n"); + len += sprintf(buf + len, "Link Failure Count: %d\n", + slave->link_failure_count); + + len += sprintf(buf + len, + "Permanent HW addr: %02x:%02x:%02x:%02x:%02x:%02x\n", + slave->perm_hwaddr[0], + slave->perm_hwaddr[1], + slave->perm_hwaddr[2], + slave->perm_hwaddr[3], + slave->perm_hwaddr[4], + slave->perm_hwaddr[5]); + } + read_unlock_irqrestore(&bond->lock, flags); + + /* + * Figure out the calcs for the /proc/net interface + */ + *start = buf + (offset - begin); + len -= (offset - begin); + if (len > length) { + len = length; + } + if (len < 0) { + len = 0; + } + + + bond = bond->next_bond; + } + return len; +} + +static int bond_event(struct notifier_block *this, unsigned long event, + void *ptr) +{ + struct bonding *this_bond = (struct bonding *)these_bonds; + struct bonding *last_bond; + struct net_device *event_dev = (struct net_device *)ptr; + + /* while there are bonds configured */ + while (this_bond != NULL) { + if (this_bond == event_dev->priv ) { + switch (event) { + case NETDEV_UNREGISTER: + /* + * remove this bond from a linked list of + * bonds + */ + if (this_bond == these_bonds) { + these_bonds = this_bond->next_bond; + } else { + for (last_bond = these_bonds; + last_bond != NULL; + last_bond = last_bond->next_bond) { + if (last_bond->next_bond == + this_bond) { + last_bond->next_bond = + this_bond->next_bond; + } + } + } + return NOTIFY_DONE; + + default: + return NOTIFY_DONE; + } + } else if (this_bond->device == event_dev->master) { + switch (event) { + case NETDEV_UNREGISTER: + bond_release(this_bond->device, event_dev); + break; + } + return NOTIFY_DONE; + } + this_bond = this_bond->next_bond; + } + return NOTIFY_DONE; +} + +static struct notifier_block bond_netdev_notifier = { + notifier_call: bond_event, +}; + +static int __init bond_init(struct net_device *dev) +{ + bonding_t *bond, *this_bond, *last_bond; + int count; + +#ifdef BONDING_DEBUG + printk (KERN_INFO "Begin bond_init for %s\n", dev->name); +#endif + bond = kmalloc(sizeof(struct bonding), GFP_KERNEL); + if (bond == NULL) { + return -ENOMEM; + } + memset(bond, 0, sizeof(struct bonding)); + + /* initialize rwlocks */ + rwlock_init(&bond->lock); + rwlock_init(&bond->ptrlock); + + bond->stats = kmalloc(sizeof(struct net_device_stats), GFP_KERNEL); + if (bond->stats == NULL) { + kfree(bond); + return -ENOMEM; + } + memset(bond->stats, 0, sizeof(struct net_device_stats)); + + bond->next = bond->prev = (slave_t *)bond; + bond->current_slave = NULL; + bond->current_arp_slave = NULL; + bond->device = dev; + dev->priv = bond; + + /* Initialize the device structure. */ + switch (bond_mode) { + case BOND_MODE_ACTIVEBACKUP: + dev->hard_start_xmit = bond_xmit_activebackup; + break; + case BOND_MODE_ROUNDROBIN: + dev->hard_start_xmit = bond_xmit_roundrobin; + break; + case BOND_MODE_XOR: + dev->hard_start_xmit = bond_xmit_xor; + break; + case BOND_MODE_BROADCAST: + dev->hard_start_xmit = bond_xmit_broadcast; + break; + default: + printk(KERN_ERR "Unknown bonding mode %d\n", bond_mode); + kfree(bond->stats); + kfree(bond); + return -EINVAL; + } + + dev->get_stats = bond_get_stats; + dev->open = bond_open; + dev->stop = bond_close; + dev->set_multicast_list = set_multicast_list; + dev->do_ioctl = bond_ioctl; + + /* + * Fill in the fields of the device structure with ethernet-generic + * values. + */ + + ether_setup(dev); + + dev->tx_queue_len = 0; + dev->flags |= IFF_MASTER|IFF_MULTICAST; +#ifdef CONFIG_NET_FASTROUTE + dev->accept_fastpath = bond_accept_fastpath; +#endif + + printk(KERN_INFO "%s registered with", dev->name); + if (miimon > 0) { + printk(" MII link monitoring set to %d ms", miimon); + updelay /= miimon; + downdelay /= miimon; + } else { + printk("out MII link monitoring"); + } + printk(", in %s mode.\n", bond_mode_name()); + + printk(KERN_INFO "%s registered with", dev->name); + if (arp_interval > 0) { + printk(" ARP monitoring set to %d ms with %d target(s):", + arp_interval, arp_ip_count); + for (count=0 ; countbond_proc_dir = proc_mkdir(dev->name, proc_net); + if (bond->bond_proc_dir == NULL) { + printk(KERN_ERR "%s: Cannot init /proc/net/%s/\n", + dev->name, dev->name); + kfree(bond->stats); + kfree(bond); + return -ENOMEM; + } + bond->bond_proc_info_file = + create_proc_info_entry("info", 0, bond->bond_proc_dir, + bond_get_info); + if (bond->bond_proc_info_file == NULL) { + printk(KERN_ERR "%s: Cannot init /proc/net/%s/info\n", + dev->name, dev->name); + remove_proc_entry(dev->name, proc_net); + kfree(bond->stats); + kfree(bond); + return -ENOMEM; + } +#endif /* CONFIG_PROC_FS */ + + if (first_pass == 1) { + these_bonds = bond; + register_netdevice_notifier(&bond_netdev_notifier); + first_pass = 0; + } else { + last_bond = these_bonds; + this_bond = these_bonds->next_bond; + while (this_bond != NULL) { + last_bond = this_bond; + this_bond = this_bond->next_bond; + } + last_bond->next_bond = bond; + } + + return 0; +} + +/* +static int __init bond_probe(struct net_device *dev) +{ + bond_init(dev); + return 0; +} + */ + +/* + * Convert string input module parms. Accept either the + * number of the mode or its string name. + */ +static inline int +bond_parse_parm(char *mode_arg, struct bond_parm_tbl *tbl) +{ + int i; + + for (i = 0; tbl[i].modename != NULL; i++) { + if ((isdigit(*mode_arg) && + tbl[i].mode == simple_strtol(mode_arg, NULL, 0)) || + (0 == strncmp(mode_arg, tbl[i].modename, + strlen(tbl[i].modename)))) { + return tbl[i].mode; + } + } + + return -1; +} + + +static int __init bonding_init(void) +{ + int no; + int err; + + /* Find a name for this unit */ + static struct net_device *dev_bond = NULL; + + printk(KERN_INFO "%s", version); + + /* + * Convert string parameters. + */ + if (mode) { + bond_mode = bond_parse_parm(mode, bond_mode_tbl); + if (bond_mode == -1) { + printk(KERN_WARNING + "bonding_init(): Invalid bonding mode \"%s\"\n", + mode == NULL ? "NULL" : mode); + return -EINVAL; + } + } + + if (multicast) { + multicast_mode = bond_parse_parm(multicast, bond_mc_tbl); + if (multicast_mode == -1) { + printk(KERN_WARNING + "bonding_init(): Invalid multicast mode \"%s\"\n", + multicast == NULL ? "NULL" : multicast); + return -EINVAL; + } + } + + if (max_bonds < 1 || max_bonds > INT_MAX) { + printk(KERN_WARNING + "bonding_init(): max_bonds (%d) not in range %d-%d, " + "so it was reset to BOND_DEFAULT_MAX_BONDS (%d)", + max_bonds, 1, INT_MAX, BOND_DEFAULT_MAX_BONDS); + max_bonds = BOND_DEFAULT_MAX_BONDS; + } + dev_bond = dev_bonds = kmalloc(max_bonds*sizeof(struct net_device), + GFP_KERNEL); + if (dev_bond == NULL) { + return -ENOMEM; + } + memset(dev_bonds, 0, max_bonds*sizeof(struct net_device)); + + if (miimon < 0) { + printk(KERN_WARNING + "bonding_init(): miimon module parameter (%d), " + "not in range 0-%d, so it was reset to %d\n", + miimon, INT_MAX, BOND_LINK_MON_INTERV); + miimon = BOND_LINK_MON_INTERV; + } + + if (updelay < 0) { + printk(KERN_WARNING + "bonding_init(): updelay module parameter (%d), " + "not in range 0-%d, so it was reset to 0\n", + updelay, INT_MAX); + updelay = 0; + } + + if (downdelay < 0) { + printk(KERN_WARNING + "bonding_init(): downdelay module parameter (%d), " + "not in range 0-%d, so it was reset to 0\n", + downdelay, INT_MAX); + downdelay = 0; + } + + if (miimon == 0) { + if ((updelay != 0) || (downdelay != 0)) { + /* just warn the user the up/down delay will have + * no effect since miimon is zero... + */ + printk(KERN_WARNING + "bonding_init(): miimon module parameter not " + "set and updelay (%d) or downdelay (%d) module " + "parameter is set; updelay and downdelay have " + "no effect unless miimon is set\n", + updelay, downdelay); + } + } else { + /* don't allow arp monitoring */ + if (arp_interval != 0) { + printk(KERN_WARNING + "bonding_init(): miimon (%d) and arp_interval " + "(%d) can't be used simultaneously, " + "disabling ARP monitoring\n", + miimon, arp_interval); + arp_interval = 0; + } + + if ((updelay % miimon) != 0) { + /* updelay will be rounded in bond_init() when it + * is divided by miimon, we just inform user here + */ + printk(KERN_WARNING + "bonding_init(): updelay (%d) is not a multiple " + "of miimon (%d), updelay rounded to %d ms\n", + updelay, miimon, (updelay / miimon) * miimon); + } + + if ((downdelay % miimon) != 0) { + /* downdelay will be rounded in bond_init() when it + * is divided by miimon, we just inform user here + */ + printk(KERN_WARNING + "bonding_init(): downdelay (%d) is not a " + "multiple of miimon (%d), downdelay rounded " + "to %d ms\n", + downdelay, miimon, + (downdelay / miimon) * miimon); + } + } + + if (arp_interval < 0) { + printk(KERN_WARNING + "bonding_init(): arp_interval module parameter (%d), " + "not in range 0-%d, so it was reset to %d\n", + arp_interval, INT_MAX, BOND_LINK_ARP_INTERV); + arp_interval = BOND_LINK_ARP_INTERV; + } + + for (arp_ip_count=0 ; + (arp_ip_count < MAX_ARP_IP_TARGETS) && arp_ip_target[arp_ip_count]; + arp_ip_count++ ) { + /* TODO: check and log bad ip address */ + if (my_inet_aton(arp_ip_target[arp_ip_count], + &arp_target[arp_ip_count]) == 0) { + printk(KERN_WARNING + "bonding_init(): bad arp_ip_target module " + "parameter (%s), ARP monitoring will not be " + "performed\n", + arp_ip_target[arp_ip_count]); + arp_interval = 0; + } + } + + + if ( (arp_interval > 0) && (arp_ip_count==0)) { + /* don't allow arping if no arp_ip_target given... */ + printk(KERN_WARNING + "bonding_init(): arp_interval module parameter " + "(%d) specified without providing an arp_ip_target " + "parameter, arp_interval was reset to 0\n", + arp_interval); + arp_interval = 0; + } + + if ((miimon == 0) && (arp_interval == 0)) { + /* miimon and arp_interval not set, we need one so things + * work as expected, see bonding.txt for details + */ + printk(KERN_ERR + "bonding_init(): either miimon or " + "arp_interval and arp_ip_target module parameters " + "must be specified, otherwise bonding will not detect " + "link failures! see bonding.txt for details.\n"); + } + + if ((primary != NULL) && (bond_mode != BOND_MODE_ACTIVEBACKUP)){ + /* currently, using a primary only makes sence + * in active backup mode + */ + printk(KERN_WARNING + "bonding_init(): %s primary device specified but has " + " no effect in %s mode\n", + primary, bond_mode_name()); + primary = NULL; + } + + + for (no = 0; no < max_bonds; no++) { + dev_bond->init = bond_init; + + err = dev_alloc_name(dev_bond,"bond%d"); + if (err < 0) { + kfree(dev_bonds); + return err; + } + SET_MODULE_OWNER(dev_bond); + if (register_netdev(dev_bond) != 0) { + kfree(dev_bonds); + return -EIO; + } + dev_bond++; + } + return 0; +} + +static void __exit bonding_exit(void) +{ + struct net_device *dev_bond = dev_bonds; + struct bonding *bond; + int no; + + unregister_netdevice_notifier(&bond_netdev_notifier); + + for (no = 0; no < max_bonds; no++) { + +#ifdef CONFIG_PROC_FS + bond = (struct bonding *) dev_bond->priv; + remove_proc_entry("info", bond->bond_proc_dir); + remove_proc_entry(dev_bond->name, proc_net); +#endif + unregister_netdev(dev_bond); + kfree(bond->stats); + kfree(dev_bond->priv); + + dev_bond->priv = NULL; + dev_bond++; + } + kfree(dev_bonds); +} + +module_init(bonding_init); +module_exit(bonding_exit); +MODULE_LICENSE("GPL"); +MODULE_DESCRIPTION(DRV_DESCRIPTION ", v" DRV_VERSION); + +/* + * Local variables: + * c-indent-level: 8 + * c-basic-offset: 8 + * tab-width: 8 + * End: + */ diff -Nuarp linux-2.4.20-bonding-20030317/drivers/net/bonding/Makefile linux-2.4.20-bonding-20030317-devel/drivers/net/bonding/Makefile --- linux-2.4.20-bonding-20030317/drivers/net/bonding/Makefile 1970-01-01 02:00:00.000000000 +0200 +++ linux-2.4.20-bonding-20030317-devel/drivers/net/bonding/Makefile 2003-03-18 17:03:29.000000000 +0200 @@ -0,0 +1,12 @@ +# +# Makefile for the Ethernet Bonding driver +# + +O_TARGET := bonding.o + +obj-y := bond_main.o + +obj-m := $(O_TARGET) + +include $(TOPDIR)/Rules.make + diff -Nuarp linux-2.4.20-bonding-20030317/drivers/net/bonding.c linux-2.4.20-bonding-20030317-devel/drivers/net/bonding.c --- linux-2.4.20-bonding-20030317/drivers/net/bonding.c 2003-03-18 17:03:29.000000000 +0200 +++ linux-2.4.20-bonding-20030317-devel/drivers/net/bonding.c 1970-01-01 02:00:00.000000000 +0200 @@ -1,3434 +0,0 @@ -/* - * originally based on the dummy device. - * - * Copyright 1999, Thomas Davis, tadavis@lbl.gov. - * Licensed under the GPL. Based on dummy.c, and eql.c devices. - * - * bonding.c: an Ethernet Bonding driver - * - * This is useful to talk to a Cisco EtherChannel compatible equipment: - * Cisco 5500 - * Sun Trunking (Solaris) - * Alteon AceDirector Trunks - * Linux Bonding - * and probably many L2 switches ... - * - * How it works: - * ifconfig bond0 ipaddress netmask up - * will setup a network device, with an ip address. No mac address - * will be assigned at this time. The hw mac address will come from - * the first slave bonded to the channel. All slaves will then use - * this hw mac address. - * - * ifconfig bond0 down - * will release all slaves, marking them as down. - * - * ifenslave bond0 eth0 - * will attach eth0 to bond0 as a slave. eth0 hw mac address will either - * a: be used as initial mac address - * b: if a hw mac address already is there, eth0's hw mac address - * will then be set from bond0. - * - * v0.1 - first working version. - * v0.2 - changed stats to be calculated by summing slaves stats. - * - * Changes: - * Arnaldo Carvalho de Melo - * - fix leaks on failure at bond_init - * - * 2000/09/30 - Willy Tarreau - * - added trivial code to release a slave device. - * - fixed security bug (CAP_NET_ADMIN not checked) - * - implemented MII link monitoring to disable dead links : - * All MII capable slaves are checked every milliseconds - * (100 ms seems good). This value can be changed by passing it to - * insmod. A value of zero disables the monitoring (default). - * - fixed an infinite loop in bond_xmit_roundrobin() when there's no - * good slave. - * - made the code hopefully SMP safe - * - * 2000/10/03 - Willy Tarreau - * - optimized slave lists based on relevant suggestions from Thomas Davis - * - implemented active-backup method to obtain HA with two switches: - * stay as long as possible on the same active interface, while we - * also monitor the backup one (MII link status) because we want to know - * if we are able to switch at any time. ( pass "mode=1" to insmod ) - * - lots of stress testings because we need it to be more robust than the - * wires ! :-> - * - * 2000/10/09 - Willy Tarreau - * - added up and down delays after link state change. - * - optimized the slaves chaining so that when we run forward, we never - * repass through the bond itself, but we can find it by searching - * backwards. Renders the deletion more difficult, but accelerates the - * scan. - * - smarter enslaving and releasing. - * - finer and more robust SMP locking - * - * 2000/10/17 - Willy Tarreau - * - fixed two potential SMP race conditions - * - * 2000/10/18 - Willy Tarreau - * - small fixes to the monitoring FSM in case of zero delays - * 2000/11/01 - Willy Tarreau - * - fixed first slave not automatically used in trunk mode. - * 2000/11/10 : spelling of "EtherChannel" corrected. - * 2000/11/13 : fixed a race condition in case of concurrent accesses to ioctl(). - * 2000/12/16 : fixed improper usage of rtnl_exlock_nowait(). - * - * 2001/1/3 - Chad N. Tindel - * - The bonding driver now simulates MII status monitoring, just like - * a normal network device. It will show that the link is down iff - * every slave in the bond shows that their links are down. If at least - * one slave is up, the bond's MII status will appear as up. - * - * 2001/2/7 - Chad N. Tindel - * - Applications can now query the bond from user space to get - * information which may be useful. They do this by calling - * the BOND_INFO_QUERY ioctl. Once the app knows how many slaves - * are in the bond, it can call the BOND_SLAVE_INFO_QUERY ioctl to - * get slave specific information (# link failures, etc). See - * for more details. The structs of interest - * are ifbond and ifslave. - * - * 2001/4/5 - Chad N. Tindel - * - Ported to 2.4 Kernel - * - * 2001/5/2 - Jeffrey E. Mast - * - When a device is detached from a bond, the slave device is no longer - * left thinking that is has a master. - * - * 2001/5/16 - Jeffrey E. Mast - * - memset did not appropriately initialized the bond rw_locks. Used - * rwlock_init to initialize to unlocked state to prevent deadlock when - * first attempting a lock - * - Called SET_MODULE_OWNER for bond device - * - * 2001/5/17 - Tim Anderson - * - 2 paths for releasing for slave release; 1 through ioctl - * and 2) through close. Both paths need to release the same way. - * - the free slave in bond release is changing slave status before - * the free. The netdev_set_master() is intended to change slave state - * so it should not be done as part of the release process. - * - Simple rule for slave state at release: only the active in A/B and - * only one in the trunked case. - * - * 2001/6/01 - Tim Anderson - * - Now call dev_close when releasing a slave so it doesn't screw up - * out routing table. - * - * 2001/6/01 - Chad N. Tindel - * - Added /proc support for getting bond and slave information. - * Information is in /proc/net//info. - * - Changed the locking when calling bond_close to prevent deadlock. - * - * 2001/8/05 - Janice Girouard - * - correct problem where refcnt of slave is not incremented in bond_ioctl - * so the system hangs when halting. - * - correct locking problem when unable to malloc in bond_enslave. - * - adding bond_xmit_xor logic. - * - adding multiple bond device support. - * - * 2001/8/13 - Erik Habbinga - * - correct locking problem with rtnl_exlock_nowait - * - * 2001/8/23 - Janice Girouard - * - bzero initial dev_bonds, to correct oops - * - convert SIOCDEVPRIVATE to new MII ioctl calls - * - * 2001/9/13 - Takao Indoh - * - Add the BOND_CHANGE_ACTIVE ioctl implementation - * - * 2001/9/14 - Mark Huth - * - Change MII_LINK_READY to not check for end of auto-negotiation, - * but only for an up link. - * - * 2001/9/20 - Chad N. Tindel - * - Add the device field to bonding_t. Previously the net_device - * corresponding to a bond wasn't available from the bonding_t - * structure. - * - * 2001/9/25 - Janice Girouard - * - add arp_monitor for active backup mode - * - * 2001/10/23 - Takao Indoh - * - Various memory leak fixes - * - * 2001/11/5 - Mark Huth - * - Don't take rtnl lock in bond_mii_monitor as it deadlocks under - * certain hotswap conditions. - * Note: this same change may be required in bond_arp_monitor ??? - * - Remove possibility of calling bond_sethwaddr with NULL slave_dev ptr - * - Handle hot swap ethernet interface deregistration events to remove - * kernel oops following hot swap of enslaved interface - * - * 2002/1/2 - Chad N. Tindel - * - Restore original slave flags at release time. - * - * 2002/02/18 - Erik Habbinga - * - bond_release(): calling kfree on our_slave after call to - * bond_restore_slave_flags, not before - * - bond_enslave(): saving slave flags into original_flags before - * call to netdev_set_master, so the IFF_SLAVE flag doesn't end - * up in original_flags - * - * 2002/04/05 - Mark Smith and - * Steve Mead - * - Port Gleb Natapov's multicast support patchs from 2.4.12 - * to 2.4.18 adding support for multicast. - * - * 2002/06/10 - Tony Cureington - * - corrected uninitialized pointer (ifr.ifr_data) in bond_check_dev_link; - * actually changed function to use MIIPHY, then MIIREG, and finally - * ETHTOOL to determine the link status - * - fixed bad ifr_data pointer assignments in bond_ioctl - * - corrected mode 1 being reported as active-backup in bond_get_info; - * also added text to distinguish type of load balancing (rr or xor) - * - change arp_ip_target module param from "1-12s" (array of 12 ptrs) - * to "s" (a single ptr) - * - * 2002/08/30 - Jay Vosburgh - * - Removed acquisition of xmit_lock in set_multicast_list; caused - * deadlock on SMP (lock is held by caller). - * - Revamped SIOCGMIIPHY, SIOCGMIIREG portion of bond_check_dev_link(). - * - * 2002/09/18 - Jay Vosburgh - * - Fixed up bond_check_dev_link() (and callers): removed some magic - * numbers, banished local MII_ defines, wrapped ioctl calls to - * prevent EFAULT errors - * - * 2002/9/30 - Jay Vosburgh - * - make sure the ip target matches the arp_target before saving the - * hw address. - * - * 2002/9/30 - Dan Eisner - * - make sure my_ip is set before taking down the link, since - * not all switches respond if the source ip is not set. - * - * 2002/10/8 - Janice Girouard - * - read in the local ip address when enslaving a device - * - add primary support - * - make sure 2*arp_interval has passed when a new device - * is brought on-line before taking it down. - * - * 2002/09/11 - Philippe De Muyter - * - Added bond_xmit_broadcast logic. - * - Added bond_mode() support function. - * - * 2002/10/26 - Laurent Deniel - * - allow to register multicast addresses only on active slave - * (useful in active-backup mode) - * - add multicast module parameter - * - fix deletion of multicast groups after unloading module - * - * 2002/11/06 - Kameshwara Rayaprolu - * - Changes to prevent panic from closing the device twice; if we close - * the device in bond_release, we must set the original_flags to down - * so it won't be closed again by the network layer. - * - * 2002/11/07 - Tony Cureington - * - Fix arp_target_hw_addr memory leak - * - Created activebackup_arp_monitor function to handle arp monitoring - * in active backup mode - the bond_arp_monitor had several problems... - * such as allowing slaves to tx arps sequentially without any delay - * for a response - * - Renamed bond_arp_monitor to loadbalance_arp_monitor and re-wrote - * this function to just handle arp monitoring in load-balancing mode; - * it is a lot more compact now - * - Changes to ensure one and only one slave transmits in active-backup - * mode - * - Robustesize parameters; warn users about bad combinations of - * parameters; also if miimon is specified and a network driver does - * not support MII or ETHTOOL, inform the user of this - * - Changes to support link_failure_count when in arp monitoring mode - * - Fix up/down delay reported in /proc - * - Added version; log version; make version available from "modinfo -d" - * - Fixed problem in bond_check_dev_link - if the first IOCTL (SIOCGMIIPH) - * failed, the ETHTOOL ioctl never got a chance - * - * 2002/11/16 - Laurent Deniel - * - fix multicast handling in activebackup_arp_monitor - * - remove one unnecessary and confusing current_slave == slave test - * in activebackup_arp_monitor - * - * 2002/11/17 - Laurent Deniel - * - fix bond_slave_info_query when slave_id = num_slaves - * - * 2002/11/19 - Janice Girouard - * - correct ifr_data reference. Update ifr_data reference - * to mii_ioctl_data struct values to avoid confusion. - * - * 2002/11/22 - Bert Barbe - * - Add support for multiple arp_ip_target - * - * 2002/12/13 - Jay Vosburgh - * - Changed to allow text strings for mode and multicast, e.g., - * insmod bonding mode=active-backup. The numbers still work. - * One change: an invalid choice will cause module load failure, - * rather than the previous behavior of just picking one. - * - Minor cleanups; got rid of dup ctype stuff, atoi function - * - * 2003/02/07 - Jay Vosburgh - * - Added use_carrier module parameter that causes miimon to - * use netif_carrier_ok() test instead of MII/ETHTOOL ioctls. - * - Minor cleanups; consolidated ioctl calls to one function. - * - * 2003/02/07 - Tony Cureington - * - Fix bond_mii_monitor() logic error that could result in - * bonding round-robin mode ignoring links after failover/recovery - * - * 2003/03/17 - Jay Vosburgh - * - kmalloc fix (GFP_KERNEL to GFP_ATOMIC) reported by - * Shmulik dot Hen at intel.com. - * - Based on discussion on mailing list, changed use of - * update_slave_cnt(), created wrapper functions for adding/removing - * slaves, changed bond_xmit_xor() to check slave_cnt instead of - * checking slave and slave->dev (which only worked by accident). - * - Misc code cleanup: get arp_send() prototype from header file, - * add max_bonds to bonding.txt. - * - * 2003/03/18 - Tsippy Mendelson and - * Shmulik Hen - * - Make sure only bond_attach_slave() and bond_detach_slave() can - * manipulate the slave list, including slave_cnt, even when in - * bond_release_all(). - * - Fixed hang in bond_release() while traffic is running. - * netdev_set_master() must not be called from within the bond lock. - * - * 2003/03/18 - Tsippy Mendelson and - * Shmulik Hen - * - Fixed hang in bond_enslave(): netdev_set_master() must not be - * called from within the bond lock while traffic is running. - * - * 2003/03/18 - Amir Noam - * - Added support for getting slave's speed and duplex via ethtool. - * Needed for 802.3ad and other future modes. - * - * 2003/03/18 - Tsippy Mendelson and - * Shmulik Hen - * - Enable support of modes that need to use the unique mac address of - * each slave. - * * bond_enslave(): Moved setting the slave's mac address, and - * openning it, from the application to the driver. This breaks - * backward comaptibility with old versions of ifenslave that open - * the slave before enalsving it !!!. - * * bond_release(): The driver also takes care of closing the slave - * and restoring its original mac address. - * - Removed the code that restores all base driver's flags. - * Flags are automatically restored once all undo stages are done - * properly. - * - Block possibility of enslaving before the master is up. This - * prevents putting the system in an unstable state. - */ - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#include -#include -#include -#include -#include -#include - -#include -#include -#include -#include -#include -#include - -#define DRV_VERSION "2.4.20-20030317" -#define DRV_RELDATE "March 17, 2003" -#define DRV_NAME "bonding" -#define DRV_DESCRIPTION "Ethernet Channel Bonding Driver" - -static const char *version = -DRV_NAME ".c:v" DRV_VERSION " (" DRV_RELDATE ")\n"; - -/* monitor all links that often (in milliseconds). <=0 disables monitoring */ -#ifndef BOND_LINK_MON_INTERV -#define BOND_LINK_MON_INTERV 0 -#endif - -#ifndef BOND_LINK_ARP_INTERV -#define BOND_LINK_ARP_INTERV 0 -#endif - -#ifndef MAX_ARP_IP_TARGETS -#define MAX_ARP_IP_TARGETS 16 -#endif - -static int arp_interval = BOND_LINK_ARP_INTERV; -static char *arp_ip_target[MAX_ARP_IP_TARGETS] = { NULL, }; -static unsigned long arp_target[MAX_ARP_IP_TARGETS] = { 0, } ; -static int arp_ip_count = 0; -static u32 my_ip = 0; -char *arp_target_hw_addr = NULL; - -static char *primary= NULL; - -static int max_bonds = BOND_DEFAULT_MAX_BONDS; -static int miimon = BOND_LINK_MON_INTERV; -static int use_carrier = 1; -static int bond_mode = BOND_MODE_ROUNDROBIN; -static int updelay = 0; -static int downdelay = 0; - -static char *mode = NULL; - -static struct bond_parm_tbl bond_mode_tbl[] = { -{ "balance-rr", BOND_MODE_ROUNDROBIN}, -{ "active-backup", BOND_MODE_ACTIVEBACKUP}, -{ "balance-xor", BOND_MODE_XOR}, -{ "broadcast", BOND_MODE_BROADCAST}, -{ NULL, -1}, -}; - -static int multicast_mode = BOND_MULTICAST_ALL; -static char *multicast = NULL; - -static struct bond_parm_tbl bond_mc_tbl[] = { -{ "disabled", BOND_MULTICAST_DISABLED}, -{ "active", BOND_MULTICAST_ACTIVE}, -{ "all", BOND_MULTICAST_ALL}, -{ NULL, -1}, -}; - -static int first_pass = 1; -static struct bonding *these_bonds = NULL; -static struct net_device *dev_bonds = NULL; - -MODULE_PARM(max_bonds, "i"); -MODULE_PARM_DESC(max_bonds, "Max number of bonded devices"); -MODULE_PARM(miimon, "i"); -MODULE_PARM_DESC(miimon, "Link check interval in milliseconds"); -MODULE_PARM(use_carrier, "i"); -MODULE_PARM_DESC(use_carrier, "Use netif_carrier_ok (vs MII ioctls) in miimon; 09 for off, 1 for on (default)"); -MODULE_PARM(mode, "s"); -MODULE_PARM_DESC(mode, "Mode of operation : 0 for round robin, 1 for active-backup, 2 for xor"); -MODULE_PARM(arp_interval, "i"); -MODULE_PARM_DESC(arp_interval, "arp interval in milliseconds"); -MODULE_PARM(arp_ip_target, "1-" __MODULE_STRING(MAX_ARP_IP_TARGETS) "s"); -MODULE_PARM_DESC(arp_ip_target, "arp targets in n.n.n.n form"); -MODULE_PARM(updelay, "i"); -MODULE_PARM_DESC(updelay, "Delay before considering link up, in milliseconds"); -MODULE_PARM(downdelay, "i"); -MODULE_PARM_DESC(downdelay, "Delay before considering link down, in milliseconds"); -MODULE_PARM(primary, "s"); -MODULE_PARM_DESC(primary, "Primary network device to use"); -MODULE_PARM(multicast, "s"); -MODULE_PARM_DESC(multicast, "Mode for multicast support : 0 for none, 1 for active slave, 2 for all slaves (default)"); - -static int bond_xmit_roundrobin(struct sk_buff *skb, struct net_device *dev); -static int bond_xmit_xor(struct sk_buff *skb, struct net_device *dev); -static int bond_xmit_activebackup(struct sk_buff *skb, struct net_device *dev); -static struct net_device_stats *bond_get_stats(struct net_device *dev); -static void bond_mii_monitor(struct net_device *dev); -static void loadbalance_arp_monitor(struct net_device *dev); -static void activebackup_arp_monitor(struct net_device *dev); -static int bond_event(struct notifier_block *this, unsigned long event, void *ptr); -static void bond_mc_list_destroy(struct bonding *bond); -static void bond_mc_add(bonding_t *bond, void *addr, int alen); -static void bond_mc_delete(bonding_t *bond, void *addr, int alen); -static int bond_mc_list_copy (struct dev_mc_list *src, struct bonding *dst, int gpf_flag); -static inline int dmi_same(struct dev_mc_list *dmi1, struct dev_mc_list *dmi2); -static void bond_set_promiscuity(bonding_t *bond, int inc); -static void bond_set_allmulti(bonding_t *bond, int inc); -static struct dev_mc_list* bond_mc_list_find_dmi(struct dev_mc_list *dmi, struct dev_mc_list *mc_list); -static void bond_mc_update(bonding_t *bond, slave_t *new, slave_t *old); -static void bond_set_slave_inactive_flags(slave_t *slave); -static void bond_set_slave_active_flags(slave_t *slave); -static int bond_enslave(struct net_device *master, struct net_device *slave); -static int bond_release(struct net_device *master, struct net_device *slave); -static int bond_release_all(struct net_device *master); -static int bond_sethwaddr(struct net_device *master, struct net_device *slave); - -/* - * bond_get_info is the interface into the /proc filesystem. This is - * a different interface than the BOND_INFO_QUERY ioctl. That is done - * through the generic networking ioctl interface, and bond_info_query - * is the internal function which provides that information. - */ -static int bond_get_info(char *buf, char **start, off_t offset, int length); - -/* #define BONDING_DEBUG 1 */ - -/* several macros */ - -#define IS_UP(dev) ((((dev)->flags & (IFF_UP)) == (IFF_UP)) && \ - (netif_running(dev) && netif_carrier_ok(dev))) - -static void arp_send_all(slave_t *slave) -{ - int i; - - for (i = 0; (idev, - my_ip, arp_target_hw_addr, slave->dev->dev_addr, - arp_target_hw_addr); - } -} - - -static const char * -bond_mode_name(void) -{ - switch (bond_mode) { - case BOND_MODE_ROUNDROBIN : - return "load balancing (round-robin)"; - case BOND_MODE_ACTIVEBACKUP : - return "fault-tolerance (active-backup)"; - case BOND_MODE_XOR : - return "load balancing (xor)"; - case BOND_MODE_BROADCAST : - return "fault-tolerance (broadcast)"; - default : - return "unknown"; - } -} - -static const char * -multicast_mode_name(void) -{ - switch(multicast_mode) { - case BOND_MULTICAST_DISABLED : - return "disabled"; - case BOND_MULTICAST_ACTIVE : - return "active slave only"; - case BOND_MULTICAST_ALL : - return "all slaves"; - default : - return "unknown"; - } -} - -static void bond_set_slave_inactive_flags(slave_t *slave) -{ - slave->state = BOND_STATE_BACKUP; - slave->dev->flags |= IFF_NOARP; -} - -static void bond_set_slave_active_flags(slave_t *slave) -{ - slave->state = BOND_STATE_ACTIVE; - slave->dev->flags &= ~IFF_NOARP; -} - -/* - * This function counts and verifies the the number of attached - * slaves, checking the count against the expected value (given that incr - * is either 1 or -1, for add or removal of a slave). Only - * bond_xmit_xor() uses the slave_cnt value, but this is still a good - * consistency check. - */ -static inline void -update_slave_cnt(bonding_t *bond, int incr) -{ - slave_t *slave = NULL; - int expect = bond->slave_cnt + incr; - - bond->slave_cnt = 0; - for (slave = bond->prev; slave != (slave_t*)bond; - slave = slave->prev) { - bond->slave_cnt++; - } - - if (expect != bond->slave_cnt) - BUG(); -} - -/* - * This function detaches the slave from the list . - * WARNING: no check is made to verify if the slave effectively - * belongs to . It returns in case it's needed. - * Nothing is freed on return, structures are just unchained. - * If the bond->current_slave pointer was pointing to , - * it's replaced with slave->next, or if not applicable. - * - * bond->lock held by caller. - */ -static slave_t * -bond_detach_slave(bonding_t *bond, slave_t *slave) -{ - if ((bond == NULL) || (slave == NULL) || - ((void *)bond == (void *)slave)) { - printk(KERN_ERR - "bond_detach_slave(): trying to detach " - "slave %p from bond %p\n", bond, slave); - return slave; - } - - if (bond->next == slave) { /* is the slave at the head ? */ - if (bond->prev == slave) { /* is the slave alone ? */ - write_lock(&bond->ptrlock); - bond->current_slave = NULL; /* no slave anymore */ - write_unlock(&bond->ptrlock); - bond->prev = bond->next = (slave_t *)bond; - } else { /* not alone */ - bond->next = slave->next; - slave->next->prev = (slave_t *)bond; - bond->prev->next = slave->next; - - write_lock(&bond->ptrlock); - if (bond->current_slave == slave) { - bond->current_slave = slave->next; - } - write_unlock(&bond->ptrlock); - } - } else { - slave->prev->next = slave->next; - if (bond->prev == slave) { /* is this slave the last one ? */ - bond->prev = slave->prev; - } else { - slave->next->prev = slave->prev; - } - - write_lock(&bond->ptrlock); - if (bond->current_slave == slave) { - bond->current_slave = slave->next; - } - write_unlock(&bond->ptrlock); - } - - update_slave_cnt(bond, -1); - - return slave; -} - -static void -bond_attach_slave(struct bonding *bond, struct slave *new_slave) -{ - /* - * queue to the end of the slaves list, make the first element its - * successor, the last one its predecessor, and make it the bond's - * predecessor. - * - * Just to clarify, so future bonding driver hackers don't go through - * the same confusion stage I did trying to figure this out, the - * slaves are stored in a double linked circular list, sortof. - * In the ->next direction, the last slave points to the first slave, - * bypassing bond; only the slaves are in the ->next direction. - * In the ->prev direction, however, the first slave points to bond - * and bond points to the last slave. - * - * It looks like a circle with a little bubble hanging off one side - * in the ->prev direction only. - * - * When going through the list once, its best to start at bond->prev - * and go in the ->prev direction, testing for bond. Doing this - * in the ->next direction doesn't work. Trust me, I know this now. - * :) -mts 2002.03.14 - */ - new_slave->prev = bond->prev; - new_slave->prev->next = new_slave; - bond->prev = new_slave; - new_slave->next = bond->next; - - update_slave_cnt(bond, 1); -} - - -/* - * Less bad way to call ioctl from within the kernel; this needs to be - * done some other way to get the call out of interrupt context. - * Needs "ioctl" variable to be supplied by calling context. - */ -#define IOCTL(dev, arg, cmd) ({ \ - int ret; \ - mm_segment_t fs = get_fs(); \ - set_fs(get_ds()); \ - ret = ioctl(dev, arg, cmd); \ - set_fs(fs); \ - ret; }) - -/* - * Get link speed and duplex from the slave's base driver - * using ethtool. If for some reason the call fails or the - * values are invalid, fake speed and duplex to 100/Full - * and return error. - */ -static int bond_update_speed_duplex(struct slave *slave) -{ - struct net_device *dev = slave->dev; - static int (* ioctl)(struct net_device *, struct ifreq *, int); - struct ifreq ifr; - struct ethtool_cmd etool; - - ioctl = dev->do_ioctl; - if (ioctl) { - etool.cmd = ETHTOOL_GSET; - ifr.ifr_data = (char*)&etool; - if (IOCTL(dev, &ifr, SIOCETHTOOL) == 0) { - slave->speed = etool.speed; - slave->duplex = etool.duplex; - } else { - goto err_out; - } - } else { - goto err_out; - } - - switch (slave->speed) { - case SPEED_10: - case SPEED_100: - case SPEED_1000: - break; - default: - goto err_out; - } - - switch (slave->duplex) { - case DUPLEX_FULL: - case DUPLEX_HALF: - break; - default: - goto err_out; - } - - return 0; - -err_out: - //Fake speed and duplex - slave->speed = SPEED_100; - slave->duplex = DUPLEX_FULL; - return -1; -} - -/* - * if supports MII link status reporting, check its link status. - * - * We either do MII/ETHTOOL ioctls, or check netif_carrier_ok(), - * depening upon the setting of the use_carrier parameter. - * - * Return either BMSR_LSTATUS, meaning that the link is up (or we - * can't tell and just pretend it is), or 0, meaning that the link is - * down. - * - * If reporting is non-zero, instead of faking link up, return -1 if - * both ETHTOOL and MII ioctls fail (meaning the device does not - * support them). If use_carrier is set, return whatever it says. - * It'd be nice if there was a good way to tell if a driver supports - * netif_carrier, but there really isn't. - */ -static int -bond_check_dev_link(struct net_device *dev, int reporting) -{ - static int (* ioctl)(struct net_device *, struct ifreq *, int); - struct ifreq ifr; - struct mii_ioctl_data *mii; - struct ethtool_value etool; - - if (use_carrier) { - return netif_carrier_ok(dev) ? BMSR_LSTATUS : 0; - } - - ioctl = dev->do_ioctl; - if (ioctl) { - /* TODO: set pointer to correct ioctl on a per team member */ - /* bases to make this more efficient. that is, once */ - /* we determine the correct ioctl, we will always */ - /* call it and not the others for that team */ - /* member. */ - - /* - * We cannot assume that SIOCGMIIPHY will also read a - * register; not all network drivers (e.g., e100) - * support that. - */ - - /* Yes, the mii is overlaid on the ifreq.ifr_ifru */ - mii = (struct mii_ioctl_data *)&ifr.ifr_data; - if (IOCTL(dev, &ifr, SIOCGMIIPHY) == 0) { - mii->reg_num = MII_BMSR; - if (IOCTL(dev, &ifr, SIOCGMIIREG) == 0) { - return mii->val_out & BMSR_LSTATUS; - } - } - - /* try SIOCETHTOOL ioctl, some drivers cache ETHTOOL_GLINK */ - /* for a period of time so we attempt to get link status */ - /* from it last if the above MII ioctls fail... */ - etool.cmd = ETHTOOL_GLINK; - ifr.ifr_data = (char*)&etool; - if (IOCTL(dev, &ifr, SIOCETHTOOL) == 0) { - if (etool.data == 1) { - return BMSR_LSTATUS; - } else { -#ifdef BONDING_DEBUG - printk(KERN_INFO - ":: SIOCETHTOOL shows link down \n"); -#endif - return 0; - } - } - - } - - /* - * If reporting, report that either there's no dev->do_ioctl, - * or both SIOCGMIIREG and SIOCETHTOOL failed (meaning that we - * cannot report link status). If not reporting, pretend - * we're ok. - */ - return reporting ? -1 : BMSR_LSTATUS; -} - -static u16 bond_check_mii_link(bonding_t *bond) -{ - int has_active_interface = 0; - unsigned long flags; - - read_lock_irqsave(&bond->lock, flags); - read_lock(&bond->ptrlock); - has_active_interface = (bond->current_slave != NULL); - read_unlock(&bond->ptrlock); - read_unlock_irqrestore(&bond->lock, flags); - - return (has_active_interface ? BMSR_LSTATUS : 0); -} - -static int bond_open(struct net_device *dev) -{ - struct timer_list *timer = &((struct bonding *)(dev->priv))->mii_timer; - struct timer_list *arp_timer = &((struct bonding *)(dev->priv))->arp_timer; - MOD_INC_USE_COUNT; - - if (miimon > 0) { /* link check interval, in milliseconds. */ - init_timer(timer); - timer->expires = jiffies + (miimon * HZ / 1000); - timer->data = (unsigned long)dev; - timer->function = (void *)&bond_mii_monitor; - add_timer(timer); - } - - if (arp_interval> 0) { /* arp interval, in milliseconds. */ - init_timer(arp_timer); - arp_timer->expires = jiffies + (arp_interval * HZ / 1000); - arp_timer->data = (unsigned long)dev; - if (bond_mode == BOND_MODE_ACTIVEBACKUP) { - arp_timer->function = (void *)&activebackup_arp_monitor; - } else { - arp_timer->function = (void *)&loadbalance_arp_monitor; - } - add_timer(arp_timer); - } - return 0; -} - -static int bond_close(struct net_device *master) -{ - bonding_t *bond = (struct bonding *) master->priv; - unsigned long flags; - - write_lock_irqsave(&bond->lock, flags); - - if (miimon > 0) { /* link check interval, in milliseconds. */ - del_timer(&bond->mii_timer); - } - if (arp_interval> 0) { /* arp interval, in milliseconds. */ - del_timer(&bond->arp_timer); - if (arp_target_hw_addr != NULL) { - kfree(arp_target_hw_addr); - arp_target_hw_addr = NULL; - } - } - - /* Release the bonded slaves */ - bond_release_all(master); - bond_mc_list_destroy (bond); - - write_unlock_irqrestore(&bond->lock, flags); - - MOD_DEC_USE_COUNT; - return 0; -} - -/* - * flush all members of flush->mc_list from device dev->mc_list - */ -static void bond_mc_list_flush(struct net_device *dev, struct net_device *flush) -{ - struct dev_mc_list *dmi; - - for (dmi = flush->mc_list; dmi != NULL; dmi = dmi->next) - dev_mc_delete(dev, dmi->dmi_addr, dmi->dmi_addrlen, 0); -} - -/* - * Totally destroys the mc_list in bond - */ -static void bond_mc_list_destroy(struct bonding *bond) -{ - struct dev_mc_list *dmi; - - dmi = bond->mc_list; - while (dmi) { - bond->mc_list = dmi->next; - kfree(dmi); - dmi = bond->mc_list; - } -} - -/* - * Add a Multicast address to every slave in the bonding group - */ -static void bond_mc_add(bonding_t *bond, void *addr, int alen) -{ - slave_t *slave; - switch (multicast_mode) { - case BOND_MULTICAST_ACTIVE : - /* write lock already acquired */ - if (bond->current_slave != NULL) - dev_mc_add(bond->current_slave->dev, addr, alen, 0); - break; - case BOND_MULTICAST_ALL : - for (slave = bond->prev; slave != (slave_t*)bond; slave = slave->prev) - dev_mc_add(slave->dev, addr, alen, 0); - break; - case BOND_MULTICAST_DISABLED : - break; - } -} - -/* - * Remove a multicast address from every slave in the bonding group - */ -static void bond_mc_delete(bonding_t *bond, void *addr, int alen) -{ - slave_t *slave; - switch (multicast_mode) { - case BOND_MULTICAST_ACTIVE : - /* write lock already acquired */ - if (bond->current_slave != NULL) - dev_mc_delete(bond->current_slave->dev, addr, alen, 0); - break; - case BOND_MULTICAST_ALL : - for (slave = bond->prev; slave != (slave_t*)bond; slave = slave->prev) - dev_mc_delete(slave->dev, addr, alen, 0); - break; - case BOND_MULTICAST_DISABLED : - break; - } -} - -/* - * Copy all the Multicast addresses from src to the bonding device dst - */ -static int bond_mc_list_copy (struct dev_mc_list *src, struct bonding *dst, - int gpf_flag) -{ - struct dev_mc_list *dmi, *new_dmi; - - for (dmi = src; dmi != NULL; dmi = dmi->next) { - new_dmi = kmalloc(sizeof(struct dev_mc_list), gpf_flag); - - if (new_dmi == NULL) { - return -ENOMEM; - } - - new_dmi->next = dst->mc_list; - dst->mc_list = new_dmi; - - new_dmi->dmi_addrlen = dmi->dmi_addrlen; - memcpy(new_dmi->dmi_addr, dmi->dmi_addr, dmi->dmi_addrlen); - new_dmi->dmi_users = dmi->dmi_users; - new_dmi->dmi_gusers = dmi->dmi_gusers; - } - return 0; -} - -/* - * Returns 0 if dmi1 and dmi2 are the same, non-0 otherwise - */ -static inline int dmi_same(struct dev_mc_list *dmi1, struct dev_mc_list *dmi2) -{ - return memcmp(dmi1->dmi_addr, dmi2->dmi_addr, dmi1->dmi_addrlen) == 0 && - dmi1->dmi_addrlen == dmi2->dmi_addrlen; -} - -/* - * Push the promiscuity flag down to all slaves - */ -static void bond_set_promiscuity(bonding_t *bond, int inc) -{ - slave_t *slave; - switch (multicast_mode) { - case BOND_MULTICAST_ACTIVE : - /* write lock already acquired */ - if (bond->current_slave != NULL) - dev_set_promiscuity(bond->current_slave->dev, inc); - break; - case BOND_MULTICAST_ALL : - for (slave = bond->prev; slave != (slave_t*)bond; slave = slave->prev) - dev_set_promiscuity(slave->dev, inc); - break; - case BOND_MULTICAST_DISABLED : - break; - } -} - -/* - * Push the allmulti flag down to all slaves - */ -static void bond_set_allmulti(bonding_t *bond, int inc) -{ - slave_t *slave; - switch (multicast_mode) { - case BOND_MULTICAST_ACTIVE : - /* write lock already acquired */ - if (bond->current_slave != NULL) - dev_set_allmulti(bond->current_slave->dev, inc); - break; - case BOND_MULTICAST_ALL : - for (slave = bond->prev; slave != (slave_t*)bond; slave = slave->prev) - dev_set_allmulti(slave->dev, inc); - break; - case BOND_MULTICAST_DISABLED : - break; - } -} - -/* - * returns dmi entry if found, NULL otherwise - */ -static struct dev_mc_list* bond_mc_list_find_dmi(struct dev_mc_list *dmi, - struct dev_mc_list *mc_list) -{ - struct dev_mc_list *idmi; - - for (idmi = mc_list; idmi != NULL; idmi = idmi->next) { - if (dmi_same(dmi, idmi)) { - return idmi; - } - } - return NULL; -} - -static void set_multicast_list(struct net_device *master) -{ - bonding_t *bond = master->priv; - struct dev_mc_list *dmi; - unsigned long flags = 0; - - if (multicast_mode == BOND_MULTICAST_DISABLED) - return; - /* - * Lock the private data for the master - */ - write_lock_irqsave(&bond->lock, flags); - - /* set promiscuity flag to slaves */ - if ( (master->flags & IFF_PROMISC) && !(bond->flags & IFF_PROMISC) ) - bond_set_promiscuity(bond, 1); - - if ( !(master->flags & IFF_PROMISC) && (bond->flags & IFF_PROMISC) ) - bond_set_promiscuity(bond, -1); - - /* set allmulti flag to slaves */ - if ( (master->flags & IFF_ALLMULTI) && !(bond->flags & IFF_ALLMULTI) ) - bond_set_allmulti(bond, 1); - - if ( !(master->flags & IFF_ALLMULTI) && (bond->flags & IFF_ALLMULTI) ) - bond_set_allmulti(bond, -1); - - bond->flags = master->flags; - - /* looking for addresses to add to slaves' mc list */ - for (dmi = master->mc_list; dmi != NULL; dmi = dmi->next) { - if (bond_mc_list_find_dmi(dmi, bond->mc_list) == NULL) - bond_mc_add(bond, dmi->dmi_addr, dmi->dmi_addrlen); - } - - /* looking for addresses to delete from slaves' list */ - for (dmi = bond->mc_list; dmi != NULL; dmi = dmi->next) { - if (bond_mc_list_find_dmi(dmi, master->mc_list) == NULL) - bond_mc_delete(bond, dmi->dmi_addr, dmi->dmi_addrlen); - } - - - /* save master's multicast list */ - bond_mc_list_destroy (bond); - bond_mc_list_copy (master->mc_list, bond, GFP_ATOMIC); - - write_unlock_irqrestore(&bond->lock, flags); -} - -/* - * Update the mc list and multicast-related flags for the new and - * old active slaves (if any) according to the multicast mode - */ -static void bond_mc_update(bonding_t *bond, slave_t *new, slave_t *old) -{ - struct dev_mc_list *dmi; - - switch(multicast_mode) { - case BOND_MULTICAST_ACTIVE : - if (bond->device->flags & IFF_PROMISC) { - if (old != NULL && new != old) - dev_set_promiscuity(old->dev, -1); - dev_set_promiscuity(new->dev, 1); - } - if (bond->device->flags & IFF_ALLMULTI) { - if (old != NULL && new != old) - dev_set_allmulti(old->dev, -1); - dev_set_allmulti(new->dev, 1); - } - /* first remove all mc addresses from old slave if any, - and _then_ add them to new active slave */ - if (old != NULL && new != old) { - for (dmi = bond->device->mc_list; dmi != NULL; dmi = dmi->next) - dev_mc_delete(old->dev, dmi->dmi_addr, dmi->dmi_addrlen, 0); - } - for (dmi = bond->device->mc_list; dmi != NULL; dmi = dmi->next) - dev_mc_add(new->dev, dmi->dmi_addr, dmi->dmi_addrlen, 0); - break; - case BOND_MULTICAST_ALL : - /* nothing to do: mc list is already up-to-date on all slaves */ - break; - case BOND_MULTICAST_DISABLED : - break; - } -} - -/* enslave device to bond device */ -static int bond_enslave(struct net_device *master_dev, - struct net_device *slave_dev) -{ - bonding_t *bond = NULL; - slave_t *new_slave = NULL; - unsigned long flags = 0; - unsigned long rflags = 0; - int err = 0; - struct dev_mc_list *dmi; - struct in_ifaddr **ifap; - struct in_ifaddr *ifa; - int link_reporting; - struct sockaddr addr; - - if (master_dev == NULL || slave_dev == NULL) { - return -ENODEV; - } - bond = (struct bonding *) master_dev->priv; - - if (slave_dev->do_ioctl == NULL) { - printk(KERN_DEBUG - "Warning : no link monitoring support for %s\n", - slave_dev->name); - } - - /* This breaks backward comaptibility with old versions - of ifenslave which open the slave before enalsving */ - /* already up. */ - if ((slave_dev->flags & IFF_UP) == IFF_UP) { -#ifdef BONDING_DEBUG - printk(KERN_CRIT "Error, slave_dev is up\n"); -#endif - return -EBUSY; - } - - /* already enslaved */ - if (master_dev->flags & IFF_SLAVE || slave_dev->flags & IFF_SLAVE) { -#ifdef BONDING_DEBUG - printk(KERN_CRIT "Error, Device was already enslaved\n"); -#endif - return -EBUSY; - } - - /* bond must be initialize by bond_open() before enslaving */ - if ((master_dev->flags & IFF_UP) != IFF_UP) { -#ifdef BONDING_DEBUG - printk(KERN_CRIT "Error, master_dev is not up\n"); -#endif - return -EPERM; - } - - if (slave_dev->set_mac_address == NULL) { - printk(KERN_CRIT " The slave device you specified does not support" - " setting the MAC address.\n Your kernel likely does not" - " support slave devices.\n"); - return -EOPNOTSUPP; - } - - if ((new_slave = kmalloc(sizeof(slave_t), GFP_ATOMIC)) == NULL) { - return -ENOMEM; - } - memset(new_slave, 0, sizeof(slave_t)); - - /* save slave's original flags before calling */ - /* netdev_set_master and dev_open */ - new_slave->original_flags = slave_dev->flags; - - /* save slave's original ("permanent") mac address for - modes that needs it, and for restoring it upon release, - and then set it to the master's address */ - memcpy(new_slave->perm_hwaddr, slave_dev->dev_addr, ETH_ALEN); - - if (bond->next != (slave_t*)bond) { - /* set slave to master's mac address - The application already set the master's - mac address to that of the first slave */ - memcpy(addr.sa_data, master_dev->dev_addr, ETH_ALEN); - addr.sa_family = slave_dev->type; - err = slave_dev->set_mac_address(slave_dev, &addr); - if (err) { -#ifdef BONDING_DEBUG - printk(KERN_CRIT "Error %d calling set_mac_address\n", err); -#endif - goto err_free; - } - } - - /* open the slave since the application closed it */ - err = dev_open(slave_dev); - if (err) { -#ifdef BONDING_DEBUG - printk(KERN_CRIT "Openning slave %s failed\n", slave_dev->name); -#endif - goto err_restore_mac; - } - - err = netdev_set_master(slave_dev, master_dev); - - if (err) { -#ifdef BONDING_DEBUG - printk(KERN_CRIT "Error %d calling netdev_set_master\n", err); -#endif - goto err_close; - } - - new_slave->dev = slave_dev; - - if (multicast_mode == BOND_MULTICAST_ALL) { - /* set promiscuity level to new slave */ - if (master_dev->flags & IFF_PROMISC) - dev_set_promiscuity(slave_dev, 1); - - /* set allmulti level to new slave */ - if (master_dev->flags & IFF_ALLMULTI) - dev_set_allmulti(slave_dev, 1); - - /* upload master's mc_list to new slave */ - for (dmi = master_dev->mc_list; dmi != NULL; dmi = dmi->next) - dev_mc_add (slave_dev, dmi->dmi_addr, dmi->dmi_addrlen, 0); - } - - write_lock_irqsave(&bond->lock, flags); - - bond_attach_slave(bond, new_slave); - new_slave->delay = 0; - new_slave->link_failure_count = 0; - - if (miimon > 0 && !use_carrier) { - link_reporting = bond_check_dev_link(slave_dev, 1); - - if ((link_reporting == -1) && (arp_interval == 0)) { - /* - * miimon is set but a bonded network driver - * does not support ETHTOOL/MII and - * arp_interval is not set. Note: if - * use_carrier is enabled, we will never go - * here (because netif_carrier is always - * supported); thus, we don't need to change - * the messages for netif_carrier. - */ - printk(KERN_ERR - "bond_enslave(): MII and ETHTOOL support not " - "available for interface %s, and " - "arp_interval/arp_ip_target module parameters " - "not specified, thus bonding will not detect " - "link failures! see bonding.txt for details.\n", - slave_dev->name); - } else if (link_reporting == -1) { - /* unable get link status using mii/ethtool */ - printk(KERN_WARNING - "bond_enslave: can't get link status from " - "interface %s; the network driver associated " - "with this interface does not support " - "MII or ETHTOOL link status reporting, thus " - "miimon has no effect on this interface.\n", - slave_dev->name); - } - } - - /* check for initial state */ - if ((miimon <= 0) || - (bond_check_dev_link(slave_dev, 0) == BMSR_LSTATUS)) { -#ifdef BONDING_DEBUG - printk(KERN_CRIT "Initial state of slave_dev is BOND_LINK_UP\n"); -#endif - new_slave->link = BOND_LINK_UP; - new_slave->jiffies = jiffies; - } - else { -#ifdef BONDING_DEBUG - printk(KERN_CRIT "Initial state of slave_dev is BOND_LINK_DOWN\n"); -#endif - new_slave->link = BOND_LINK_DOWN; - } - - if (bond_update_speed_duplex(new_slave) && (new_slave->link == BOND_LINK_UP) ) { - printk(KERN_WARNING - "bond_enslave(): failed to get speed/duplex from %s, " - "speed forced to 100Mbps, duplex forced to Full.\n", - new_slave->dev->name); - } - - /* if we're in active-backup mode, we need one and only one active - * interface. The backup interfaces will have their NOARP flag set - * because we need them to be completely deaf and not to respond to - * any ARP request on the network to avoid fooling a switch. Thus, - * since we guarantee that current_slave always point to the last - * usable interface, we just have to verify this interface's flag. - */ - if (bond_mode == BOND_MODE_ACTIVEBACKUP) { - if (((bond->current_slave == NULL) - || (bond->current_slave->dev->flags & IFF_NOARP)) - && (new_slave->link == BOND_LINK_UP)) { -#ifdef BONDING_DEBUG - printk(KERN_CRIT "This is the first active slave\n"); -#endif - /* first slave or no active slave yet, and this link - is OK, so make this interface the active one */ - bond->current_slave = new_slave; - bond_set_slave_active_flags(new_slave); - bond_mc_update(bond, new_slave, NULL); - } - else { -#ifdef BONDING_DEBUG - printk(KERN_CRIT "This is just a backup slave\n"); -#endif - bond_set_slave_inactive_flags(new_slave); - } - read_lock_irqsave(&(((struct in_device *)slave_dev->ip_ptr)->lock), rflags); - ifap= &(((struct in_device *)slave_dev->ip_ptr)->ifa_list); - ifa = *ifap; - my_ip = ifa->ifa_address; - read_unlock_irqrestore(&(((struct in_device *)slave_dev->ip_ptr)->lock), rflags); - - /* if there is a primary slave, remember it */ - if (primary != NULL) - if( strcmp(primary, new_slave->dev->name) == 0) - bond->primary_slave = new_slave; - } else { -#ifdef BONDING_DEBUG - printk(KERN_CRIT "This slave is always active in trunk mode\n"); -#endif - /* always active in trunk mode */ - new_slave->state = BOND_STATE_ACTIVE; - if (bond->current_slave == NULL) - bond->current_slave = new_slave; - } - - write_unlock_irqrestore(&bond->lock, flags); - - printk (KERN_INFO "%s: enslaving %s as a%s interface with a%s link.\n", - master_dev->name, slave_dev->name, - new_slave->state == BOND_STATE_ACTIVE ? "n active" : " backup", - new_slave->link == BOND_LINK_UP ? "n up" : " down"); - - //enslave is successfull - return 0; - -// Undo stages on error -err_close: - dev_close(slave_dev); - -err_restore_mac: - memcpy(addr.sa_data, new_slave->perm_hwaddr, ETH_ALEN); - addr.sa_family = slave_dev->type; - slave_dev->set_mac_address(slave_dev, &addr); - -err_free: - kfree(new_slave); - return err; -} - -/* - * This function changes the active slave to slave . - * It returns -EINVAL in the following cases. - * - is not found in the list. - * - There is not active slave now. - * - is already active. - * - The link state of is not BOND_LINK_UP. - * - is not running. - * In these cases, this fuction does nothing. - * In the other cases, currnt_slave pointer is changed and 0 is returned. - */ -static int bond_change_active(struct net_device *master_dev, struct net_device *slave_dev) -{ - bonding_t *bond; - slave_t *slave; - slave_t *oldactive = NULL; - slave_t *newactive = NULL; - unsigned long flags; - int ret = 0; - - if (master_dev == NULL || slave_dev == NULL) { - return -ENODEV; - } - - bond = (struct bonding *) master_dev->priv; - write_lock_irqsave(&bond->lock, flags); - slave = (slave_t *)bond; - oldactive = bond->current_slave; - - while ((slave = slave->prev) != (slave_t *)bond) { - if(slave_dev == slave->dev) { - newactive = slave; - break; - } - } - - if ((newactive != NULL)&& - (oldactive != NULL)&& - (newactive != oldactive)&& - (newactive->link == BOND_LINK_UP)&& - IS_UP(newactive->dev)) { - bond_set_slave_inactive_flags(oldactive); - bond_set_slave_active_flags(newactive); - bond_mc_update(bond, newactive, oldactive); - bond->current_slave = newactive; - printk("%s : activate %s(old : %s)\n", - master_dev->name, newactive->dev->name, - oldactive->dev->name); - } - else { - ret = -EINVAL; - } - write_unlock_irqrestore(&bond->lock, flags); - return ret; -} - -/* Choose a new valid interface from the pool, set it active - * and make it the current slave. If no valid interface is - * found, the oldest slave in BACK state is choosen and - * activated. If none is found, it's considered as no - * interfaces left so the current slave is set to NULL. - * The result is a pointer to the current slave. - * - * Since this function sends messages tails through printk, the caller - * must have started something like `printk(KERN_INFO "xxxx ");'. - * - * Warning: must put locks around the call to this function if needed. - */ -slave_t *change_active_interface(bonding_t *bond) -{ - slave_t *newslave, *oldslave; - slave_t *bestslave = NULL; - int mintime; - - read_lock(&bond->ptrlock); - newslave = oldslave = bond->current_slave; - read_unlock(&bond->ptrlock); - - if (newslave == NULL) { /* there were no active slaves left */ - if (bond->next != (slave_t *)bond) { /* found one slave */ - write_lock(&bond->ptrlock); - newslave = bond->current_slave = bond->next; - write_unlock(&bond->ptrlock); - } else { - - printk (" but could not find any %s interface.\n", - (bond_mode == BOND_MODE_ACTIVEBACKUP) ? "backup":"other"); - write_lock(&bond->ptrlock); - bond->current_slave = (slave_t *)NULL; - write_unlock(&bond->ptrlock); - return NULL; /* still no slave, return NULL */ - } - } else if (bond_mode == BOND_MODE_ACTIVEBACKUP) { - /* make sure oldslave doesn't send arps - this could - * cause a ping-pong effect between interfaces since they - * would be able to tx arps - in active backup only one - * slave should be able to tx arps, and that should be - * the current_slave; the only exception is when all - * slaves have gone down, then only one non-current slave can - * send arps at a time; clearing oldslaves' mc list is handled - * later in this function. - */ - bond_set_slave_inactive_flags(oldslave); - } - - mintime = updelay; - - /* first try the primary link; if arping, a link must tx/rx traffic - * before it can be considered the current_slave - also, we would skip - * slaves between the current_slave and primary_slave that may be up - * and able to arp - */ - if ((bond->primary_slave != NULL) && (arp_interval == 0)) { - if (IS_UP(bond->primary_slave->dev)) - newslave = bond->primary_slave; - } - - do { - if (IS_UP(newslave->dev)) { - if (newslave->link == BOND_LINK_UP) { - /* this one is immediately usable */ - if (bond_mode == BOND_MODE_ACTIVEBACKUP) { - bond_set_slave_active_flags(newslave); - bond_mc_update(bond, newslave, oldslave); - printk (" and making interface %s the active one.\n", - newslave->dev->name); - } - else { - printk (" and setting pointer to interface %s.\n", - newslave->dev->name); - } - - write_lock(&bond->ptrlock); - bond->current_slave = newslave; - write_unlock(&bond->ptrlock); - return newslave; - } - else if (newslave->link == BOND_LINK_BACK) { - /* link up, but waiting for stabilization */ - if (newslave->delay < mintime) { - mintime = newslave->delay; - bestslave = newslave; - } - } - } - } while ((newslave = newslave->next) != oldslave); - - /* no usable backup found, we'll see if we at least got a link that was - coming back for a long time, and could possibly already be usable. - */ - - if (bestslave != NULL) { - /* early take-over. */ - printk (" and making interface %s the active one %d ms earlier.\n", - bestslave->dev->name, - (updelay - bestslave->delay)*miimon); - - bestslave->delay = 0; - bestslave->link = BOND_LINK_UP; - bestslave->jiffies = jiffies; - bond_set_slave_active_flags(bestslave); - bond_mc_update(bond, bestslave, oldslave); - write_lock(&bond->ptrlock); - bond->current_slave = bestslave; - write_unlock(&bond->ptrlock); - return bestslave; - } - - if ((bond_mode == BOND_MODE_ACTIVEBACKUP) && - (multicast_mode == BOND_MULTICAST_ACTIVE) && - (oldslave != NULL)) { - /* flush bonds (master's) mc_list from oldslave since it wasn't - * updated (and deleted) above - */ - bond_mc_list_flush(oldslave->dev, bond->device); - if (bond->device->flags & IFF_PROMISC) { - dev_set_promiscuity(oldslave->dev, -1); - } - if (bond->device->flags & IFF_ALLMULTI) { - dev_set_allmulti(oldslave->dev, -1); - } - } - - printk (" but could not find any %s interface.\n", - (bond_mode == BOND_MODE_ACTIVEBACKUP) ? "backup":"other"); - - /* absolutely nothing found. let's return NULL */ - write_lock(&bond->ptrlock); - bond->current_slave = (slave_t *)NULL; - write_unlock(&bond->ptrlock); - return NULL; -} - -/* - * Try to release the slave device from the bond device - * It is legal to access current_slave without a lock because all the function - * is write-locked. - * - * The rules for slave state should be: - * for Active/Backup: - * Active stays on all backups go down - * for Bonded connections: - * The first up interface should be left on and all others downed. - */ -static int bond_release(struct net_device *master, struct net_device *slave) -{ - bonding_t *bond; - slave_t *our_slave, *old_current; - unsigned long flags; - struct sockaddr addr; - - if (master == NULL || slave == NULL) { - return -ENODEV; - } - - bond = (struct bonding *) master->priv; - - /* master already enslaved, or slave not enslaved, - or no slave for this master */ - if ((master->flags & IFF_SLAVE) || !(slave->flags & IFF_SLAVE)) { - printk (KERN_DEBUG "%s: cannot release %s.\n", master->name, slave->name); - return -EINVAL; - } - - write_lock_irqsave(&bond->lock, flags); - bond->current_arp_slave = NULL; - our_slave = (slave_t *)bond; - old_current = bond->current_slave; - while ((our_slave = our_slave->prev) != (slave_t *)bond) { - if (our_slave->dev == slave) { - bond_detach_slave(bond, our_slave); - - printk (KERN_INFO "%s: releasing %s interface %s", - master->name, - (our_slave->state == BOND_STATE_ACTIVE) ? "active" : "backup", - slave->name); - - if (our_slave == old_current) { - /* find a new interface and be verbose */ - change_active_interface(bond); - } else { - printk(".\n"); - } - - if (bond->current_slave == NULL) { - printk(KERN_INFO - "%s: now running without any active interface !\n", - master->name); - } - - if (bond->primary_slave == our_slave) { - bond->primary_slave = NULL; - } - - break; - } - - } - write_unlock_irqrestore(&bond->lock, flags); - - if (our_slave == (slave_t *)bond) { - /* if we get here, it's because the device was not found */ - printk (KERN_INFO "%s: %s not enslaved\n", master->name, slave->name); - return -EINVAL; - } - - /* undo settings and restore original values */ - - if (multicast_mode == BOND_MULTICAST_ALL) { - /* flush master's mc_list from slave */ - bond_mc_list_flush (slave, master); - - /* unset promiscuity level from slave */ - if (master->flags & IFF_PROMISC) - dev_set_promiscuity(slave, -1); - - /* unset allmulti level from slave */ - if (master->flags & IFF_ALLMULTI) - dev_set_allmulti(slave, -1); - } - - netdev_set_master(slave, NULL); - - /* close slave before restoring its mac address */ - dev_close(slave); - - /* restore original ("permanent") mac address*/ - memcpy(addr.sa_data, our_slave->perm_hwaddr, ETH_ALEN); - addr.sa_family = slave->type; - slave->set_mac_address(slave, &addr); - - /* restore the original state of the IFF_NOARP flag that might have */ - /* been set by bond_set_slave_inactive_flags() */ - if ((our_slave->original_flags & IFF_NOARP) == 0) { - slave->flags &= ~IFF_NOARP; - } - - kfree(our_slave); - - /* if the last slave was removed, zero the mac address - of the master so it will be set by the application - to the mac address of the first slave */ - if (bond->next == (slave_t*)bond) { - memset(master->dev_addr, 0, master->addr_len); - } - - return 0; /* deletion OK */ -} - -/* - * This function releases all slaves. - * Warning: must put write-locks around the call to this function. - */ -static int bond_release_all(struct net_device *master) -{ - bonding_t *bond; - slave_t *our_slave; - struct net_device *slave_dev; - struct sockaddr addr; - - if (master == NULL) { - return -ENODEV; - } - - if (master->flags & IFF_SLAVE) { - return -EINVAL; - } - - bond = (struct bonding *) master->priv; - bond->current_arp_slave = NULL; - bond->current_slave = NULL; - bond->primary_slave = NULL; - - while ((our_slave = bond->prev) != (slave_t *)bond) { - slave_dev = our_slave->dev; - bond_detach_slave(bond, our_slave); - - if (multicast_mode == BOND_MULTICAST_ALL - || (multicast_mode == BOND_MULTICAST_ACTIVE - && bond->current_slave == our_slave)) { - - /* flush master's mc_list from slave */ - bond_mc_list_flush (slave_dev, master); - - /* unset promiscuity level from slave */ - if (master->flags & IFF_PROMISC) - dev_set_promiscuity(slave_dev, -1); - - /* unset allmulti level from slave */ - if (master->flags & IFF_ALLMULTI) - dev_set_allmulti(slave_dev, -1); - } - - /* Can be safely called from inside the bond lock - since traffic and timers have already stopped - */ - netdev_set_master(slave_dev, NULL); - - /* close slave before restoring its mac address */ - dev_close(slave_dev); - - /* restore original ("permanent") mac address*/ - memcpy(addr.sa_data, our_slave->perm_hwaddr, ETH_ALEN); - addr.sa_family = slave_dev->type; - slave_dev->set_mac_address(slave_dev, &addr); - - /* restore the original state of the IFF_NOARP flag that might have */ - /* been set by bond_set_slave_inactive_flags() */ - if ((our_slave->original_flags & IFF_NOARP) == 0) { - slave_dev->flags &= ~IFF_NOARP; - } - - kfree(our_slave); - } - - /* zero the mac address of the master so it will be - set by the application to the mac address of the - first slave */ - memset(master->dev_addr, 0, master->addr_len); - - printk (KERN_INFO "%s: released all slaves\n", master->name); - - return 0; -} - -/* this function is called regularly to monitor each slave's link. */ -static void bond_mii_monitor(struct net_device *master) -{ - bonding_t *bond = (struct bonding *) master->priv; - slave_t *slave, *bestslave, *oldcurrent; - unsigned long flags; - int slave_died = 0; - - read_lock_irqsave(&bond->lock, flags); - - /* we will try to read the link status of each of our slaves, and - * set their IFF_RUNNING flag appropriately. For each slave not - * supporting MII status, we won't do anything so that a user-space - * program could monitor the link itself if needed. - */ - - bestslave = NULL; - slave = (slave_t *)bond; - - read_lock(&bond->ptrlock); - oldcurrent = bond->current_slave; - read_unlock(&bond->ptrlock); - - while ((slave = slave->prev) != (slave_t *)bond) { - /* use updelay+1 to match an UP slave even when updelay is 0 */ - int mindelay = updelay + 1; - struct net_device *dev = slave->dev; - int link_state; - - link_state = bond_check_dev_link(dev, 0); - - switch (slave->link) { - case BOND_LINK_UP: /* the link was up */ - if (link_state == BMSR_LSTATUS) { - /* link stays up, tell that this one - is immediately available */ - if (IS_UP(dev) && (mindelay > -2)) { - /* -2 is the best case : - this slave was already up */ - mindelay = -2; - bestslave = slave; - } - break; - } - else { /* link going down */ - slave->link = BOND_LINK_FAIL; - slave->delay = downdelay; - if (slave->link_failure_count < UINT_MAX) { - slave->link_failure_count++; - } - if (downdelay > 0) { - printk (KERN_INFO - "%s: link status down for %sinterface " - "%s, disabling it in %d ms.\n", - master->name, - IS_UP(dev) - ? ((bond_mode == BOND_MODE_ACTIVEBACKUP) - ? ((slave == oldcurrent) - ? "active " : "backup ") - : "") - : "idle ", - dev->name, - downdelay * miimon); - } - } - /* no break ! fall through the BOND_LINK_FAIL test to - ensure proper action to be taken - */ - case BOND_LINK_FAIL: /* the link has just gone down */ - if (link_state != BMSR_LSTATUS) { - /* link stays down */ - if (slave->delay <= 0) { - /* link down for too long time */ - slave->link = BOND_LINK_DOWN; - /* in active/backup mode, we must - completely disable this interface */ - if (bond_mode == BOND_MODE_ACTIVEBACKUP) { - bond_set_slave_inactive_flags(slave); - } - printk(KERN_INFO - "%s: link status definitely down " - "for interface %s, disabling it", - master->name, - dev->name); - - read_lock(&bond->ptrlock); - if (slave == bond->current_slave) { - read_unlock(&bond->ptrlock); - /* find a new interface and be verbose */ - change_active_interface(bond); - } else { - read_unlock(&bond->ptrlock); - printk(".\n"); - } - slave_died = 1; - } else { - slave->delay--; - } - } else { - /* link up again */ - slave->link = BOND_LINK_UP; - slave->jiffies = jiffies; - printk(KERN_INFO - "%s: link status up again after %d ms " - "for interface %s.\n", - master->name, - (downdelay - slave->delay) * miimon, - dev->name); - - if (IS_UP(dev) && (mindelay > -1)) { - /* -1 is a good case : this slave went - down only for a short time */ - mindelay = -1; - bestslave = slave; - } - } - break; - case BOND_LINK_DOWN: /* the link was down */ - if (link_state != BMSR_LSTATUS) { - /* the link stays down, nothing more to do */ - break; - } else { /* link going up */ - slave->link = BOND_LINK_BACK; - slave->delay = updelay; - - if (updelay > 0) { - /* if updelay == 0, no need to - advertise about a 0 ms delay */ - printk (KERN_INFO - "%s: link status up for interface" - " %s, enabling it in %d ms.\n", - master->name, - dev->name, - updelay * miimon); - } - } - /* no break ! fall through the BOND_LINK_BACK state in - case there's something to do. - */ - case BOND_LINK_BACK: /* the link has just come back */ - if (link_state != BMSR_LSTATUS) { - /* link down again */ - slave->link = BOND_LINK_DOWN; - printk(KERN_INFO - "%s: link status down again after %d ms " - "for interface %s.\n", - master->name, - (updelay - slave->delay) * miimon, - dev->name); - } else { - /* link stays up */ - if (slave->delay == 0) { - /* now the link has been up for long time enough */ - slave->link = BOND_LINK_UP; - slave->jiffies = jiffies; - - if (bond_mode != BOND_MODE_ACTIVEBACKUP) { - /* make it immediately active */ - slave->state = BOND_STATE_ACTIVE; - } else if (slave != bond->primary_slave) { - /* prevent it from being the active one */ - slave->state = BOND_STATE_BACKUP; - } - - printk(KERN_INFO - "%s: link status definitely up " - "for interface %s.\n", - master->name, - dev->name); - - if ( (bond->primary_slave != NULL) - && (slave == bond->primary_slave) ) - change_active_interface(bond); - } - else - slave->delay--; - - /* we'll also look for the mostly eligible slave */ - if (bond->primary_slave == NULL) { - if (IS_UP(dev) && (slave->delay < mindelay)) { - mindelay = slave->delay; - bestslave = slave; - } - } else if ( (IS_UP(bond->primary_slave->dev)) || - ( (!IS_UP(bond->primary_slave->dev)) && - (IS_UP(dev) && (slave->delay < mindelay)) ) ) { - mindelay = slave->delay; - bestslave = slave; - } - } - break; - } /* end of switch */ - - bond_update_speed_duplex(slave); - - } /* end of while */ - - /* - * if there's no active interface and we discovered that one - * of the slaves could be activated earlier, so we do it. - */ - read_lock(&bond->ptrlock); - oldcurrent = bond->current_slave; - read_unlock(&bond->ptrlock); - - /* no active interface at the moment or need to bring up the primary */ - if (oldcurrent == NULL) { /* no active interface at the moment */ - if (bestslave != NULL) { /* last chance to find one ? */ - if (bestslave->link == BOND_LINK_UP) { - printk (KERN_INFO - "%s: making interface %s the new active one.\n", - master->name, bestslave->dev->name); - } else { - printk (KERN_INFO - "%s: making interface %s the new " - "active one %d ms earlier.\n", - master->name, bestslave->dev->name, - (updelay - bestslave->delay) * miimon); - - bestslave->delay = 0; - bestslave->link = BOND_LINK_UP; - bestslave->jiffies = jiffies; - } - - if (bond_mode == BOND_MODE_ACTIVEBACKUP) { - bond_set_slave_active_flags(bestslave); - bond_mc_update(bond, bestslave, NULL); - } else { - bestslave->state = BOND_STATE_ACTIVE; - } - write_lock(&bond->ptrlock); - bond->current_slave = bestslave; - write_unlock(&bond->ptrlock); - } else if (slave_died) { - /* print this message only once a slave has just died */ - printk(KERN_INFO - "%s: now running without any active interface !\n", - master->name); - } - } - - read_unlock_irqrestore(&bond->lock, flags); - /* re-arm the timer */ - mod_timer(&bond->mii_timer, jiffies + (miimon * HZ / 1000)); -} - -/* - * this function is called regularly to monitor each slave's link - * ensuring that traffic is being sent and received when arp monitoring - * is used in load-balancing mode. if the adapter has been dormant, then an - * arp is transmitted to generate traffic. see activebackup_arp_monitor for - * arp monitoring in active backup mode. - */ -static void loadbalance_arp_monitor(struct net_device *master) -{ - bonding_t *bond; - unsigned long flags; - slave_t *slave; - int the_delta_in_ticks = arp_interval * HZ / 1000; - int next_timer = jiffies + (arp_interval * HZ / 1000); - - bond = (struct bonding *) master->priv; - if (master->priv == NULL) { - mod_timer(&bond->arp_timer, next_timer); - return; - } - - read_lock_irqsave(&bond->lock, flags); - - /* TODO: investigate why rtnl_shlock_nowait and rtnl_exlock_nowait - * are called below and add comment why they are required... - */ - if ((!IS_UP(master)) || rtnl_shlock_nowait()) { - mod_timer(&bond->arp_timer, next_timer); - read_unlock_irqrestore(&bond->lock, flags); - return; - } - - if (rtnl_exlock_nowait()) { - rtnl_shunlock(); - mod_timer(&bond->arp_timer, next_timer); - read_unlock_irqrestore(&bond->lock, flags); - return; - } - - /* see if any of the previous devices are up now (i.e. they have - * xmt and rcv traffic). the current_slave does not come into - * the picture unless it is null. also, slave->jiffies is not needed - * here because we send an arp on each slave and give a slave as - * long as it needs to get the tx/rx within the delta. - * TODO: what about up/down delay in arp mode? it wasn't here before - * so it can wait - */ - slave = (slave_t *)bond; - while ((slave = slave->prev) != (slave_t *)bond) { - - if (slave->link != BOND_LINK_UP) { - - if (((jiffies - slave->dev->trans_start) <= - the_delta_in_ticks) && - ((jiffies - slave->dev->last_rx) <= - the_delta_in_ticks)) { - - slave->link = BOND_LINK_UP; - slave->state = BOND_STATE_ACTIVE; - - /* primary_slave has no meaning in round-robin - * mode. the window of a slave being up and - * current_slave being null after enslaving - * is closed. - */ - read_lock(&bond->ptrlock); - if (bond->current_slave == NULL) { - read_unlock(&bond->ptrlock); - printk(KERN_INFO - "%s: link status definitely up " - "for interface %s, ", - master->name, - slave->dev->name); - change_active_interface(bond); - } else { - read_unlock(&bond->ptrlock); - printk(KERN_INFO - "%s: interface %s is now up\n", - master->name, - slave->dev->name); - } - } - } else { - /* slave->link == BOND_LINK_UP */ - - /* not all switches will respond to an arp request - * when the source ip is 0, so don't take the link down - * if we don't know our ip yet - */ - if (((jiffies - slave->dev->trans_start) >= - (2*the_delta_in_ticks)) || - (((jiffies - slave->dev->last_rx) >= - (2*the_delta_in_ticks)) && my_ip !=0)) { - slave->link = BOND_LINK_DOWN; - slave->state = BOND_STATE_BACKUP; - if (slave->link_failure_count < UINT_MAX) { - slave->link_failure_count++; - } - printk(KERN_INFO - "%s: interface %s is now down.\n", - master->name, - slave->dev->name); - - read_lock(&bond->ptrlock); - if (slave == bond->current_slave) { - read_unlock(&bond->ptrlock); - change_active_interface(bond); - } else { - read_unlock(&bond->ptrlock); - } - } - } - - /* note: if switch is in round-robin mode, all links - * must tx arp to ensure all links rx an arp - otherwise - * links may oscillate or not come up at all; if switch is - * in something like xor mode, there is nothing we can - * do - all replies will be rx'ed on same link causing slaves - * to be unstable during low/no traffic periods - */ - if (IS_UP(slave->dev)) { - arp_send_all(slave); - } - } - - rtnl_exunlock(); - rtnl_shunlock(); - read_unlock_irqrestore(&bond->lock, flags); - - /* re-arm the timer */ - mod_timer(&bond->arp_timer, next_timer); -} - -/* - * When using arp monitoring in active-backup mode, this function is - * called to determine if any backup slaves have went down or a new - * current slave needs to be found. - * The backup slaves never generate traffic, they are considered up by merely - * receiving traffic. If the current slave goes down, each backup slave will - * be given the opportunity to tx/rx an arp before being taken down - this - * prevents all slaves from being taken down due to the current slave not - * sending any traffic for the backups to receive. The arps are not necessarily - * necessary, any tx and rx traffic will keep the current slave up. While any - * rx traffic will keep the backup slaves up, the current slave is responsible - * for generating traffic to keep them up regardless of any other traffic they - * may have received. - * see loadbalance_arp_monitor for arp monitoring in load balancing mode - */ -static void activebackup_arp_monitor(struct net_device *master) -{ - bonding_t *bond; - unsigned long flags; - slave_t *slave; - int the_delta_in_ticks = arp_interval * HZ / 1000; - int next_timer = jiffies + (arp_interval * HZ / 1000); - - bond = (struct bonding *) master->priv; - if (master->priv == NULL) { - mod_timer(&bond->arp_timer, next_timer); - return; - } - - read_lock_irqsave(&bond->lock, flags); - - if (!IS_UP(master)) { - mod_timer(&bond->arp_timer, next_timer); - read_unlock_irqrestore(&bond->lock, flags); - return; - } - - /* determine if any slave has come up or any backup slave has - * gone down - * TODO: what about up/down delay in arp mode? it wasn't here before - * so it can wait - */ - slave = (slave_t *)bond; - while ((slave = slave->prev) != (slave_t *)bond) { - - if (slave->link != BOND_LINK_UP) { - if ((jiffies - slave->dev->last_rx) <= - the_delta_in_ticks) { - - slave->link = BOND_LINK_UP; - write_lock(&bond->ptrlock); - if ((bond->current_slave == NULL) && - ((jiffies - slave->dev->trans_start) <= - the_delta_in_ticks)) { - bond->current_slave = slave; - bond_set_slave_active_flags(slave); - bond_mc_update(bond, slave, NULL); - bond->current_arp_slave = NULL; - } else if (bond->current_slave != slave) { - /* this slave has just come up but we - * already have a current slave; this - * can also happen if bond_enslave adds - * a new slave that is up while we are - * searching for a new slave - */ - bond_set_slave_inactive_flags(slave); - bond->current_arp_slave = NULL; - } - - if (slave == bond->current_slave) { - printk(KERN_INFO - "%s: %s is up and now the " - "active interface\n", - master->name, - slave->dev->name); - } else { - printk(KERN_INFO - "%s: backup interface %s is " - "now up\n", - master->name, - slave->dev->name); - } - - write_unlock(&bond->ptrlock); - } - } else { - read_lock(&bond->ptrlock); - if ((slave != bond->current_slave) && - (bond->current_arp_slave == NULL) && - (((jiffies - slave->dev->last_rx) >= - 3*the_delta_in_ticks) && (my_ip != 0))) { - /* a backup slave has gone down; three times - * the delta allows the current slave to be - * taken out before the backup slave. - * note: a non-null current_arp_slave indicates - * the current_slave went down and we are - * searching for a new one; under this - * condition we only take the current_slave - * down - this gives each slave a chance to - * tx/rx traffic before being taken out - */ - read_unlock(&bond->ptrlock); - slave->link = BOND_LINK_DOWN; - if (slave->link_failure_count < UINT_MAX) { - slave->link_failure_count++; - } - bond_set_slave_inactive_flags(slave); - printk(KERN_INFO - "%s: backup interface %s is now down\n", - master->name, - slave->dev->name); - } else { - read_unlock(&bond->ptrlock); - } - } - } - - read_lock(&bond->ptrlock); - slave = bond->current_slave; - read_unlock(&bond->ptrlock); - - if (slave != NULL) { - - /* if we have sent traffic in the past 2*arp_intervals but - * haven't xmit and rx traffic in that time interval, select - * a different slave. slave->jiffies is only updated when - * a slave first becomes the current_slave - not necessarily - * after every arp; this ensures the slave has a full 2*delta - * before being taken out. if a primary is being used, check - * if it is up and needs to take over as the current_slave - */ - if ((((jiffies - slave->dev->trans_start) >= - (2*the_delta_in_ticks)) || - (((jiffies - slave->dev->last_rx) >= - (2*the_delta_in_ticks)) && (my_ip != 0))) && - ((jiffies - slave->jiffies) >= 2*the_delta_in_ticks)) { - - slave->link = BOND_LINK_DOWN; - if (slave->link_failure_count < UINT_MAX) { - slave->link_failure_count++; - } - printk(KERN_INFO "%s: link status down for " - "active interface %s, disabling it", - master->name, - slave->dev->name); - slave = change_active_interface(bond); - bond->current_arp_slave = slave; - if (slave != NULL) { - slave->jiffies = jiffies; - } - - } else if ((bond->primary_slave != NULL) && - (bond->primary_slave != slave) && - (bond->primary_slave->link == BOND_LINK_UP)) { - /* at this point, slave is the current_slave */ - printk(KERN_INFO - "%s: changing from interface %s to primary " - "interface %s\n", - master->name, - slave->dev->name, - bond->primary_slave->dev->name); - - /* primary is up so switch to it */ - bond_set_slave_inactive_flags(slave); - bond_mc_update(bond, bond->primary_slave, slave); - write_lock(&bond->ptrlock); - bond->current_slave = bond->primary_slave; - write_unlock(&bond->ptrlock); - slave = bond->primary_slave; - bond_set_slave_active_flags(slave); - slave->jiffies = jiffies; - } else { - bond->current_arp_slave = NULL; - } - - /* the current slave must tx an arp to ensure backup slaves - * rx traffic - */ - if ((slave != NULL) && - (((jiffies - slave->dev->last_rx) >= the_delta_in_ticks) && - (my_ip != 0))) { - arp_send_all(slave); - } - } - - /* if we don't have a current_slave, search for the next available - * backup slave from the current_arp_slave and make it the candidate - * for becoming the current_slave - */ - if (slave == NULL) { - - if ((bond->current_arp_slave == NULL) || - (bond->current_arp_slave == (slave_t *)bond)) { - bond->current_arp_slave = bond->prev; - } - - if (bond->current_arp_slave != (slave_t *)bond) { - bond_set_slave_inactive_flags(bond->current_arp_slave); - slave = bond->current_arp_slave->next; - - /* search for next candidate */ - do { - if (IS_UP(slave->dev)) { - slave->link = BOND_LINK_BACK; - bond_set_slave_active_flags(slave); - arp_send_all(slave); - slave->jiffies = jiffies; - bond->current_arp_slave = slave; - break; - } - - /* if the link state is up at this point, we - * mark it down - this can happen if we have - * simultaneous link failures and - * change_active_interface doesn't make this - * one the current slave so it is still marked - * up when it is actually down - */ - if (slave->link == BOND_LINK_UP) { - slave->link = BOND_LINK_DOWN; - if (slave->link_failure_count < - UINT_MAX) { - slave->link_failure_count++; - } - - bond_set_slave_inactive_flags(slave); - printk(KERN_INFO - "%s: backup interface " - "%s is now down.\n", - master->name, - slave->dev->name); - } - } while ((slave = slave->next) != - bond->current_arp_slave->next); - } - } - - mod_timer(&bond->arp_timer, next_timer); - read_unlock_irqrestore(&bond->lock, flags); -} - -typedef uint32_t in_addr_t; - -int -my_inet_aton(char *cp, unsigned long *the_addr) { - static const in_addr_t max[4] = { 0xffffffff, 0xffffff, 0xffff, 0xff }; - in_addr_t val; - char c; - union iaddr { - uint8_t bytes[4]; - uint32_t word; - } res; - uint8_t *pp = res.bytes; - int digit,base; - - res.word = 0; - - c = *cp; - for (;;) { - /* - * Collect number up to ``.''. - * Values are specified as for C: - * 0x=hex, 0=octal, isdigit=decimal. - */ - if (!isdigit(c)) goto ret_0; - val = 0; base = 10; digit = 0; - for (;;) { - if (isdigit(c)) { - val = (val * base) + (c - '0'); - c = *++cp; - digit = 1; - } else { - break; - } - } - if (c == '.') { - /* - * Internet format: - * a.b.c.d - * a.b.c (with c treated as 16 bits) - * a.b (with b treated as 24 bits) - */ - if (pp > res.bytes + 2 || val > 0xff) { - goto ret_0; - } - *pp++ = val; - c = *++cp; - } else - break; - } - /* - * Check for trailing characters. - */ - if (c != '\0' && (!isascii(c) || !isspace(c))) { - goto ret_0; - } - /* - * Did we get a valid digit? - */ - if (!digit) { - goto ret_0; - } - - /* Check whether the last part is in its limits depending on - the number of parts in total. */ - if (val > max[pp - res.bytes]) { - goto ret_0; - } - - if (the_addr != NULL) { - *the_addr = res.word | htonl (val); - } - - return (1); - -ret_0: - return (0); -} - -static int bond_sethwaddr(struct net_device *master, struct net_device *slave) -{ -#ifdef BONDING_DEBUG - printk(KERN_CRIT "bond_sethwaddr: master=%x\n", (unsigned int)master); - printk(KERN_CRIT "bond_sethwaddr: slave=%x\n", (unsigned int)slave); - printk(KERN_CRIT "bond_sethwaddr: slave->addr_len=%d\n", slave->addr_len); -#endif - memcpy(master->dev_addr, slave->dev_addr, slave->addr_len); - return 0; -} - -static int bond_info_query(struct net_device *master, struct ifbond *info) -{ - bonding_t *bond = (struct bonding *) master->priv; - slave_t *slave; - unsigned long flags; - - info->bond_mode = bond_mode; - info->num_slaves = 0; - info->miimon = miimon; - - read_lock_irqsave(&bond->lock, flags); - for (slave = bond->prev; slave != (slave_t *)bond; slave = slave->prev) { - info->num_slaves++; - } - read_unlock_irqrestore(&bond->lock, flags); - - return 0; -} - -static int bond_slave_info_query(struct net_device *master, - struct ifslave *info) -{ - bonding_t *bond = (struct bonding *) master->priv; - slave_t *slave; - int cur_ndx = 0; - unsigned long flags; - - if (info->slave_id < 0) { - return -ENODEV; - } - - read_lock_irqsave(&bond->lock, flags); - for (slave = bond->prev; - slave != (slave_t *)bond && cur_ndx < info->slave_id; - slave = slave->prev) { - cur_ndx++; - } - read_unlock_irqrestore(&bond->lock, flags); - - if (slave != (slave_t *)bond) { - strcpy(info->slave_name, slave->dev->name); - info->link = slave->link; - info->state = slave->state; - info->link_failure_count = slave->link_failure_count; - } else { - return -ENODEV; - } - - return 0; -} - -static int bond_ioctl(struct net_device *master_dev, struct ifreq *ifr, int cmd) -{ - struct net_device *slave_dev = NULL; - struct ifbond *u_binfo = NULL, k_binfo; - struct ifslave *u_sinfo = NULL, k_sinfo; - struct mii_ioctl_data *mii = NULL; - int ret = 0; - -#ifdef BONDING_DEBUG - printk(KERN_INFO "bond_ioctl: master=%s, cmd=%d\n", - master_dev->name, cmd); -#endif - - switch (cmd) { - case SIOCGMIIPHY: - mii = (struct mii_ioctl_data *)&ifr->ifr_data; - if (mii == NULL) { - return -EINVAL; - } - mii->phy_id = 0; - /* Fall Through */ - case SIOCGMIIREG: - /* - * We do this again just in case we were called by SIOCGMIIREG - * instead of SIOCGMIIPHY. - */ - mii = (struct mii_ioctl_data *)&ifr->ifr_data; - if (mii == NULL) { - return -EINVAL; - } - if (mii->reg_num == 1) { - mii->val_out = bond_check_mii_link( - (struct bonding *)master_dev->priv); - } - return 0; - case BOND_INFO_QUERY_OLD: - case SIOCBONDINFOQUERY: - u_binfo = (struct ifbond *)ifr->ifr_data; - if (copy_from_user(&k_binfo, u_binfo, sizeof(ifbond))) { - return -EFAULT; - } - ret = bond_info_query(master_dev, &k_binfo); - if (ret == 0) { - if (copy_to_user(u_binfo, &k_binfo, sizeof(ifbond))) { - return -EFAULT; - } - } - return ret; - case BOND_SLAVE_INFO_QUERY_OLD: - case SIOCBONDSLAVEINFOQUERY: - u_sinfo = (struct ifslave *)ifr->ifr_data; - if (copy_from_user(&k_sinfo, u_sinfo, sizeof(ifslave))) { - return -EFAULT; - } - ret = bond_slave_info_query(master_dev, &k_sinfo); - if (ret == 0) { - if (copy_to_user(u_sinfo, &k_sinfo, sizeof(ifslave))) { - return -EFAULT; - } - } - return ret; - } - - if (!capable(CAP_NET_ADMIN)) { - return -EPERM; - } - - slave_dev = dev_get_by_name(ifr->ifr_slave); - -#ifdef BONDING_DEBUG - printk(KERN_INFO "slave_dev=%x: \n", (unsigned int)slave_dev); - printk(KERN_INFO "slave_dev->name=%s: \n", slave_dev->name); -#endif - - if (slave_dev == NULL) { - ret = -ENODEV; - } else { - switch (cmd) { - case BOND_ENSLAVE_OLD: - case SIOCBONDENSLAVE: - ret = bond_enslave(master_dev, slave_dev); - break; - case BOND_RELEASE_OLD: - case SIOCBONDRELEASE: - ret = bond_release(master_dev, slave_dev); - break; - case BOND_SETHWADDR_OLD: - case SIOCBONDSETHWADDR: - ret = bond_sethwaddr(master_dev, slave_dev); - break; - case BOND_CHANGE_ACTIVE_OLD: - case SIOCBONDCHANGEACTIVE: - if (bond_mode == BOND_MODE_ACTIVEBACKUP) { - ret = bond_change_active(master_dev, slave_dev); - } - else { - ret = -EINVAL; - } - break; - default: - ret = -EOPNOTSUPP; - } - dev_put(slave_dev); - } - return ret; -} - -#ifdef CONFIG_NET_FASTROUTE -static int bond_accept_fastpath(struct net_device *dev, struct dst_entry *dst) -{ - return -1; -} -#endif - -/* - * in broadcast mode, we send everything to all usable interfaces. - */ -static int bond_xmit_broadcast(struct sk_buff *skb, struct net_device *dev) -{ - slave_t *slave, *start_at; - struct bonding *bond = (struct bonding *) dev->priv; - unsigned long flags; - struct net_device *device_we_should_send_to = 0; - - if (!IS_UP(dev)) { /* bond down */ - dev_kfree_skb(skb); - return 0; - } - - read_lock_irqsave(&bond->lock, flags); - - read_lock(&bond->ptrlock); - slave = start_at = bond->current_slave; - read_unlock(&bond->ptrlock); - - if (slave == NULL) { /* we're at the root, get the first slave */ - /* no suitable interface, frame not sent */ - read_unlock_irqrestore(&bond->lock, flags); - dev_kfree_skb(skb); - return 0; - } - - do { - if (IS_UP(slave->dev) - && (slave->link == BOND_LINK_UP) - && (slave->state == BOND_STATE_ACTIVE)) { - if (device_we_should_send_to) { - struct sk_buff *skb2; - if ((skb2 = skb_clone(skb, GFP_ATOMIC)) == NULL) { - printk(KERN_ERR "bond_xmit_broadcast: skb_clone() failed\n"); - continue; - } - - skb2->dev = device_we_should_send_to; - skb2->priority = 1; - dev_queue_xmit(skb2); - } - device_we_should_send_to = slave->dev; - } - } while ((slave = slave->next) != start_at); - - if (device_we_should_send_to) { - skb->dev = device_we_should_send_to; - skb->priority = 1; - dev_queue_xmit(skb); - } else - dev_kfree_skb(skb); - - /* frame sent to all suitable interfaces */ - read_unlock_irqrestore(&bond->lock, flags); - return 0; -} - -static int bond_xmit_roundrobin(struct sk_buff *skb, struct net_device *dev) -{ - slave_t *slave, *start_at; - struct bonding *bond = (struct bonding *) dev->priv; - unsigned long flags; - - if (!IS_UP(dev)) { /* bond down */ - dev_kfree_skb(skb); - return 0; - } - - read_lock_irqsave(&bond->lock, flags); - - read_lock(&bond->ptrlock); - slave = start_at = bond->current_slave; - read_unlock(&bond->ptrlock); - - if (slave == NULL) { /* we're at the root, get the first slave */ - /* no suitable interface, frame not sent */ - dev_kfree_skb(skb); - read_unlock_irqrestore(&bond->lock, flags); - return 0; - } - - do { - if (IS_UP(slave->dev) - && (slave->link == BOND_LINK_UP) - && (slave->state == BOND_STATE_ACTIVE)) { - - skb->dev = slave->dev; - skb->priority = 1; - dev_queue_xmit(skb); - - write_lock(&bond->ptrlock); - bond->current_slave = slave->next; - write_unlock(&bond->ptrlock); - - read_unlock_irqrestore(&bond->lock, flags); - return 0; - } - } while ((slave = slave->next) != start_at); - - /* no suitable interface, frame not sent */ - dev_kfree_skb(skb); - read_unlock_irqrestore(&bond->lock, flags); - return 0; -} - -/* - * in XOR mode, we determine the output device by performing xor on - * the source and destination hw adresses. If this device is not - * enabled, find the next slave following this xor slave. - */ -static int bond_xmit_xor(struct sk_buff *skb, struct net_device *dev) -{ - slave_t *slave, *start_at; - struct bonding *bond = (struct bonding *) dev->priv; - unsigned long flags; - struct ethhdr *data = (struct ethhdr *)skb->data; - int slave_no; - - if (!IS_UP(dev)) { /* bond down */ - dev_kfree_skb(skb); - return 0; - } - - read_lock_irqsave(&bond->lock, flags); - slave = bond->prev; - - /* we're at the root, get the first slave */ - if (bond->slave_cnt == 0) { - /* no suitable interface, frame not sent */ - dev_kfree_skb(skb); - read_unlock_irqrestore(&bond->lock, flags); - return 0; - } - - slave_no = (data->h_dest[5]^slave->dev->dev_addr[5]) % bond->slave_cnt; - - while ( (slave_no > 0) && (slave != (slave_t *)bond) ) { - slave = slave->prev; - slave_no--; - } - start_at = slave; - - do { - if (IS_UP(slave->dev) - && (slave->link == BOND_LINK_UP) - && (slave->state == BOND_STATE_ACTIVE)) { - - skb->dev = slave->dev; - skb->priority = 1; - dev_queue_xmit(skb); - - read_unlock_irqrestore(&bond->lock, flags); - return 0; - } - } while ((slave = slave->next) != start_at); - - /* no suitable interface, frame not sent */ - dev_kfree_skb(skb); - read_unlock_irqrestore(&bond->lock, flags); - return 0; -} - -/* - * in active-backup mode, we know that bond->current_slave is always valid if - * the bond has a usable interface. - */ -static int bond_xmit_activebackup(struct sk_buff *skb, struct net_device *dev) -{ - struct bonding *bond = (struct bonding *) dev->priv; - unsigned long flags; - int ret; - - if (!IS_UP(dev)) { /* bond down */ - dev_kfree_skb(skb); - return 0; - } - - /* if we are sending arp packets, try to at least - identify our own ip address */ - if ( (arp_interval > 0) && (my_ip == 0) && - (skb->protocol == __constant_htons(ETH_P_ARP) ) ) { - char *the_ip = (((char *)skb->data)) - + sizeof(struct ethhdr) - + sizeof(struct arphdr) + - ETH_ALEN; - memcpy(&my_ip, the_ip, 4); - } - - /* if we are sending arp packets and don't know - * the target hw address, save it so we don't need - * to use a broadcast address. - * don't do this if in active backup mode because the slaves must - * receive packets to stay up, and the only ones they receive are - * broadcasts. - */ - if ( (bond_mode != BOND_MODE_ACTIVEBACKUP) && - (arp_ip_count == 1) && - (arp_interval > 0) && (arp_target_hw_addr == NULL) && - (skb->protocol == __constant_htons(ETH_P_IP) ) ) { - struct ethhdr *eth_hdr = - (struct ethhdr *) (((char *)skb->data)); - struct iphdr *ip_hdr = (struct iphdr *)(eth_hdr + 1); - - if (arp_target[0] == ip_hdr->daddr) { - arp_target_hw_addr = kmalloc(ETH_ALEN, GFP_KERNEL); - if (arp_target_hw_addr != NULL) - memcpy(arp_target_hw_addr, eth_hdr->h_dest, ETH_ALEN); - } - } - - read_lock_irqsave(&bond->lock, flags); - - read_lock(&bond->ptrlock); - if (bond->current_slave != NULL) { /* one usable interface */ - skb->dev = bond->current_slave->dev; - read_unlock(&bond->ptrlock); - skb->priority = 1; - ret = dev_queue_xmit(skb); - read_unlock_irqrestore(&bond->lock, flags); - return 0; - } - else { - read_unlock(&bond->ptrlock); - } - - /* no suitable interface, frame not sent */ -#ifdef BONDING_DEBUG - printk(KERN_INFO "There was no suitable interface, so we don't transmit\n"); -#endif - dev_kfree_skb(skb); - read_unlock_irqrestore(&bond->lock, flags); - return 0; -} - -static struct net_device_stats *bond_get_stats(struct net_device *dev) -{ - bonding_t *bond = dev->priv; - struct net_device_stats *stats = bond->stats, *sstats; - slave_t *slave; - unsigned long flags; - - memset(bond->stats, 0, sizeof(struct net_device_stats)); - - read_lock_irqsave(&bond->lock, flags); - - for (slave = bond->prev; slave != (slave_t *)bond; slave = slave->prev) { - sstats = slave->dev->get_stats(slave->dev); - - stats->rx_packets += sstats->rx_packets; - stats->rx_bytes += sstats->rx_bytes; - stats->rx_errors += sstats->rx_errors; - stats->rx_dropped += sstats->rx_dropped; - - stats->tx_packets += sstats->tx_packets; - stats->tx_bytes += sstats->tx_bytes; - stats->tx_errors += sstats->tx_errors; - stats->tx_dropped += sstats->tx_dropped; - - stats->multicast += sstats->multicast; - stats->collisions += sstats->collisions; - - stats->rx_length_errors += sstats->rx_length_errors; - stats->rx_over_errors += sstats->rx_over_errors; - stats->rx_crc_errors += sstats->rx_crc_errors; - stats->rx_frame_errors += sstats->rx_frame_errors; - stats->rx_fifo_errors += sstats->rx_fifo_errors; - stats->rx_missed_errors += sstats->rx_missed_errors; - - stats->tx_aborted_errors += sstats->tx_aborted_errors; - stats->tx_carrier_errors += sstats->tx_carrier_errors; - stats->tx_fifo_errors += sstats->tx_fifo_errors; - stats->tx_heartbeat_errors += sstats->tx_heartbeat_errors; - stats->tx_window_errors += sstats->tx_window_errors; - - } - - read_unlock_irqrestore(&bond->lock, flags); - return stats; -} - -static int bond_get_info(char *buf, char **start, off_t offset, int length) -{ - bonding_t *bond = these_bonds; - int len = 0; - off_t begin = 0; - u16 link; - slave_t *slave = NULL; - unsigned long flags; - - while (bond != NULL) { - /* - * This function locks the mutex, so we can't lock it until - * afterwards - */ - link = bond_check_mii_link(bond); - - len += sprintf(buf + len, "Bonding Mode: %s\n", - bond_mode_name()); - - if (bond_mode == BOND_MODE_ACTIVEBACKUP) { - read_lock_irqsave(&bond->lock, flags); - read_lock(&bond->ptrlock); - if (bond->current_slave != NULL) { - len += sprintf(buf + len, - "Currently Active Slave: %s\n", - bond->current_slave->dev->name); - } - read_unlock(&bond->ptrlock); - read_unlock_irqrestore(&bond->lock, flags); - } - - len += sprintf(buf + len, "MII Status: "); - len += sprintf(buf + len, - link == BMSR_LSTATUS ? "up\n" : "down\n"); - len += sprintf(buf + len, "MII Polling Interval (ms): %d\n", - miimon); - len += sprintf(buf + len, "Up Delay (ms): %d\n", - updelay * miimon); - len += sprintf(buf + len, "Down Delay (ms): %d\n", - downdelay * miimon); - len += sprintf(buf + len, "Multicast Mode: %s\n", - multicast_mode_name()); - - read_lock_irqsave(&bond->lock, flags); - for (slave = bond->prev; slave != (slave_t *)bond; - slave = slave->prev) { - len += sprintf(buf + len, "\nSlave Interface: %s\n", slave->dev->name); - - len += sprintf(buf + len, "MII Status: "); - - len += sprintf(buf + len, - slave->link == BOND_LINK_UP ? - "up\n" : "down\n"); - len += sprintf(buf + len, "Link Failure Count: %d\n", - slave->link_failure_count); - - len += sprintf(buf + len, - "Permanent HW addr: %02x:%02x:%02x:%02x:%02x:%02x\n", - slave->perm_hwaddr[0], - slave->perm_hwaddr[1], - slave->perm_hwaddr[2], - slave->perm_hwaddr[3], - slave->perm_hwaddr[4], - slave->perm_hwaddr[5]); - } - read_unlock_irqrestore(&bond->lock, flags); - - /* - * Figure out the calcs for the /proc/net interface - */ - *start = buf + (offset - begin); - len -= (offset - begin); - if (len > length) { - len = length; - } - if (len < 0) { - len = 0; - } - - - bond = bond->next_bond; - } - return len; -} - -static int bond_event(struct notifier_block *this, unsigned long event, - void *ptr) -{ - struct bonding *this_bond = (struct bonding *)these_bonds; - struct bonding *last_bond; - struct net_device *event_dev = (struct net_device *)ptr; - - /* while there are bonds configured */ - while (this_bond != NULL) { - if (this_bond == event_dev->priv ) { - switch (event) { - case NETDEV_UNREGISTER: - /* - * remove this bond from a linked list of - * bonds - */ - if (this_bond == these_bonds) { - these_bonds = this_bond->next_bond; - } else { - for (last_bond = these_bonds; - last_bond != NULL; - last_bond = last_bond->next_bond) { - if (last_bond->next_bond == - this_bond) { - last_bond->next_bond = - this_bond->next_bond; - } - } - } - return NOTIFY_DONE; - - default: - return NOTIFY_DONE; - } - } else if (this_bond->device == event_dev->master) { - switch (event) { - case NETDEV_UNREGISTER: - bond_release(this_bond->device, event_dev); - break; - } - return NOTIFY_DONE; - } - this_bond = this_bond->next_bond; - } - return NOTIFY_DONE; -} - -static struct notifier_block bond_netdev_notifier = { - notifier_call: bond_event, -}; - -static int __init bond_init(struct net_device *dev) -{ - bonding_t *bond, *this_bond, *last_bond; - int count; - -#ifdef BONDING_DEBUG - printk (KERN_INFO "Begin bond_init for %s\n", dev->name); -#endif - bond = kmalloc(sizeof(struct bonding), GFP_KERNEL); - if (bond == NULL) { - return -ENOMEM; - } - memset(bond, 0, sizeof(struct bonding)); - - /* initialize rwlocks */ - rwlock_init(&bond->lock); - rwlock_init(&bond->ptrlock); - - bond->stats = kmalloc(sizeof(struct net_device_stats), GFP_KERNEL); - if (bond->stats == NULL) { - kfree(bond); - return -ENOMEM; - } - memset(bond->stats, 0, sizeof(struct net_device_stats)); - - bond->next = bond->prev = (slave_t *)bond; - bond->current_slave = NULL; - bond->current_arp_slave = NULL; - bond->device = dev; - dev->priv = bond; - - /* Initialize the device structure. */ - switch (bond_mode) { - case BOND_MODE_ACTIVEBACKUP: - dev->hard_start_xmit = bond_xmit_activebackup; - break; - case BOND_MODE_ROUNDROBIN: - dev->hard_start_xmit = bond_xmit_roundrobin; - break; - case BOND_MODE_XOR: - dev->hard_start_xmit = bond_xmit_xor; - break; - case BOND_MODE_BROADCAST: - dev->hard_start_xmit = bond_xmit_broadcast; - break; - default: - printk(KERN_ERR "Unknown bonding mode %d\n", bond_mode); - kfree(bond->stats); - kfree(bond); - return -EINVAL; - } - - dev->get_stats = bond_get_stats; - dev->open = bond_open; - dev->stop = bond_close; - dev->set_multicast_list = set_multicast_list; - dev->do_ioctl = bond_ioctl; - - /* - * Fill in the fields of the device structure with ethernet-generic - * values. - */ - - ether_setup(dev); - - dev->tx_queue_len = 0; - dev->flags |= IFF_MASTER|IFF_MULTICAST; -#ifdef CONFIG_NET_FASTROUTE - dev->accept_fastpath = bond_accept_fastpath; -#endif - - printk(KERN_INFO "%s registered with", dev->name); - if (miimon > 0) { - printk(" MII link monitoring set to %d ms", miimon); - updelay /= miimon; - downdelay /= miimon; - } else { - printk("out MII link monitoring"); - } - printk(", in %s mode.\n", bond_mode_name()); - - printk(KERN_INFO "%s registered with", dev->name); - if (arp_interval > 0) { - printk(" ARP monitoring set to %d ms with %d target(s):", - arp_interval, arp_ip_count); - for (count=0 ; countbond_proc_dir = proc_mkdir(dev->name, proc_net); - if (bond->bond_proc_dir == NULL) { - printk(KERN_ERR "%s: Cannot init /proc/net/%s/\n", - dev->name, dev->name); - kfree(bond->stats); - kfree(bond); - return -ENOMEM; - } - bond->bond_proc_info_file = - create_proc_info_entry("info", 0, bond->bond_proc_dir, - bond_get_info); - if (bond->bond_proc_info_file == NULL) { - printk(KERN_ERR "%s: Cannot init /proc/net/%s/info\n", - dev->name, dev->name); - remove_proc_entry(dev->name, proc_net); - kfree(bond->stats); - kfree(bond); - return -ENOMEM; - } -#endif /* CONFIG_PROC_FS */ - - if (first_pass == 1) { - these_bonds = bond; - register_netdevice_notifier(&bond_netdev_notifier); - first_pass = 0; - } else { - last_bond = these_bonds; - this_bond = these_bonds->next_bond; - while (this_bond != NULL) { - last_bond = this_bond; - this_bond = this_bond->next_bond; - } - last_bond->next_bond = bond; - } - - return 0; -} - -/* -static int __init bond_probe(struct net_device *dev) -{ - bond_init(dev); - return 0; -} - */ - -/* - * Convert string input module parms. Accept either the - * number of the mode or its string name. - */ -static inline int -bond_parse_parm(char *mode_arg, struct bond_parm_tbl *tbl) -{ - int i; - - for (i = 0; tbl[i].modename != NULL; i++) { - if ((isdigit(*mode_arg) && - tbl[i].mode == simple_strtol(mode_arg, NULL, 0)) || - (0 == strncmp(mode_arg, tbl[i].modename, - strlen(tbl[i].modename)))) { - return tbl[i].mode; - } - } - - return -1; -} - - -static int __init bonding_init(void) -{ - int no; - int err; - - /* Find a name for this unit */ - static struct net_device *dev_bond = NULL; - - printk(KERN_INFO "%s", version); - - /* - * Convert string parameters. - */ - if (mode) { - bond_mode = bond_parse_parm(mode, bond_mode_tbl); - if (bond_mode == -1) { - printk(KERN_WARNING - "bonding_init(): Invalid bonding mode \"%s\"\n", - mode == NULL ? "NULL" : mode); - return -EINVAL; - } - } - - if (multicast) { - multicast_mode = bond_parse_parm(multicast, bond_mc_tbl); - if (multicast_mode == -1) { - printk(KERN_WARNING - "bonding_init(): Invalid multicast mode \"%s\"\n", - multicast == NULL ? "NULL" : multicast); - return -EINVAL; - } - } - - if (max_bonds < 1 || max_bonds > INT_MAX) { - printk(KERN_WARNING - "bonding_init(): max_bonds (%d) not in range %d-%d, " - "so it was reset to BOND_DEFAULT_MAX_BONDS (%d)", - max_bonds, 1, INT_MAX, BOND_DEFAULT_MAX_BONDS); - max_bonds = BOND_DEFAULT_MAX_BONDS; - } - dev_bond = dev_bonds = kmalloc(max_bonds*sizeof(struct net_device), - GFP_KERNEL); - if (dev_bond == NULL) { - return -ENOMEM; - } - memset(dev_bonds, 0, max_bonds*sizeof(struct net_device)); - - if (miimon < 0) { - printk(KERN_WARNING - "bonding_init(): miimon module parameter (%d), " - "not in range 0-%d, so it was reset to %d\n", - miimon, INT_MAX, BOND_LINK_MON_INTERV); - miimon = BOND_LINK_MON_INTERV; - } - - if (updelay < 0) { - printk(KERN_WARNING - "bonding_init(): updelay module parameter (%d), " - "not in range 0-%d, so it was reset to 0\n", - updelay, INT_MAX); - updelay = 0; - } - - if (downdelay < 0) { - printk(KERN_WARNING - "bonding_init(): downdelay module parameter (%d), " - "not in range 0-%d, so it was reset to 0\n", - downdelay, INT_MAX); - downdelay = 0; - } - - if (miimon == 0) { - if ((updelay != 0) || (downdelay != 0)) { - /* just warn the user the up/down delay will have - * no effect since miimon is zero... - */ - printk(KERN_WARNING - "bonding_init(): miimon module parameter not " - "set and updelay (%d) or downdelay (%d) module " - "parameter is set; updelay and downdelay have " - "no effect unless miimon is set\n", - updelay, downdelay); - } - } else { - /* don't allow arp monitoring */ - if (arp_interval != 0) { - printk(KERN_WARNING - "bonding_init(): miimon (%d) and arp_interval " - "(%d) can't be used simultaneously, " - "disabling ARP monitoring\n", - miimon, arp_interval); - arp_interval = 0; - } - - if ((updelay % miimon) != 0) { - /* updelay will be rounded in bond_init() when it - * is divided by miimon, we just inform user here - */ - printk(KERN_WARNING - "bonding_init(): updelay (%d) is not a multiple " - "of miimon (%d), updelay rounded to %d ms\n", - updelay, miimon, (updelay / miimon) * miimon); - } - - if ((downdelay % miimon) != 0) { - /* downdelay will be rounded in bond_init() when it - * is divided by miimon, we just inform user here - */ - printk(KERN_WARNING - "bonding_init(): downdelay (%d) is not a " - "multiple of miimon (%d), downdelay rounded " - "to %d ms\n", - downdelay, miimon, - (downdelay / miimon) * miimon); - } - } - - if (arp_interval < 0) { - printk(KERN_WARNING - "bonding_init(): arp_interval module parameter (%d), " - "not in range 0-%d, so it was reset to %d\n", - arp_interval, INT_MAX, BOND_LINK_ARP_INTERV); - arp_interval = BOND_LINK_ARP_INTERV; - } - - for (arp_ip_count=0 ; - (arp_ip_count < MAX_ARP_IP_TARGETS) && arp_ip_target[arp_ip_count]; - arp_ip_count++ ) { - /* TODO: check and log bad ip address */ - if (my_inet_aton(arp_ip_target[arp_ip_count], - &arp_target[arp_ip_count]) == 0) { - printk(KERN_WARNING - "bonding_init(): bad arp_ip_target module " - "parameter (%s), ARP monitoring will not be " - "performed\n", - arp_ip_target[arp_ip_count]); - arp_interval = 0; - } - } - - - if ( (arp_interval > 0) && (arp_ip_count==0)) { - /* don't allow arping if no arp_ip_target given... */ - printk(KERN_WARNING - "bonding_init(): arp_interval module parameter " - "(%d) specified without providing an arp_ip_target " - "parameter, arp_interval was reset to 0\n", - arp_interval); - arp_interval = 0; - } - - if ((miimon == 0) && (arp_interval == 0)) { - /* miimon and arp_interval not set, we need one so things - * work as expected, see bonding.txt for details - */ - printk(KERN_ERR - "bonding_init(): either miimon or " - "arp_interval and arp_ip_target module parameters " - "must be specified, otherwise bonding will not detect " - "link failures! see bonding.txt for details.\n"); - } - - if ((primary != NULL) && (bond_mode != BOND_MODE_ACTIVEBACKUP)){ - /* currently, using a primary only makes sence - * in active backup mode - */ - printk(KERN_WARNING - "bonding_init(): %s primary device specified but has " - " no effect in %s mode\n", - primary, bond_mode_name()); - primary = NULL; - } - - - for (no = 0; no < max_bonds; no++) { - dev_bond->init = bond_init; - - err = dev_alloc_name(dev_bond,"bond%d"); - if (err < 0) { - kfree(dev_bonds); - return err; - } - SET_MODULE_OWNER(dev_bond); - if (register_netdev(dev_bond) != 0) { - kfree(dev_bonds); - return -EIO; - } - dev_bond++; - } - return 0; -} - -static void __exit bonding_exit(void) -{ - struct net_device *dev_bond = dev_bonds; - struct bonding *bond; - int no; - - unregister_netdevice_notifier(&bond_netdev_notifier); - - for (no = 0; no < max_bonds; no++) { - -#ifdef CONFIG_PROC_FS - bond = (struct bonding *) dev_bond->priv; - remove_proc_entry("info", bond->bond_proc_dir); - remove_proc_entry(dev_bond->name, proc_net); -#endif - unregister_netdev(dev_bond); - kfree(bond->stats); - kfree(dev_bond->priv); - - dev_bond->priv = NULL; - dev_bond++; - } - kfree(dev_bonds); -} - -module_init(bonding_init); -module_exit(bonding_exit); -MODULE_LICENSE("GPL"); -MODULE_DESCRIPTION(DRV_DESCRIPTION ", v" DRV_VERSION); - -/* - * Local variables: - * c-indent-level: 8 - * c-basic-offset: 8 - * tab-width: 8 - * End: - */ diff -Nuarp linux-2.4.20-bonding-20030317/drivers/net/Makefile linux-2.4.20-bonding-20030317-devel/drivers/net/Makefile --- linux-2.4.20-bonding-20030317/drivers/net/Makefile 2003-03-18 17:03:29.000000000 +0200 +++ linux-2.4.20-bonding-20030317-devel/drivers/net/Makefile 2003-03-18 17:03:29.000000000 +0200 @@ -29,6 +29,10 @@ ifeq ($(CONFIG_E1000),y) obj-y += e1000/e1000.o endif +ifeq ($(CONFIG_BONDING),y) + obj-y += bonding/bonding.o +endif + ifeq ($(CONFIG_ISDN_PPP),y) obj-$(CONFIG_ISDN) += slhc.o endif @@ -46,6 +50,7 @@ subdir-$(CONFIG_SK98LIN) += sk98lin subdir-$(CONFIG_SKFP) += skfp subdir-$(CONFIG_E100) += e100 subdir-$(CONFIG_E1000) += e1000 +subdir-$(CONFIG_BONDING) += bonding # # link order important here @@ -157,7 +162,6 @@ endif obj-$(CONFIG_STRIP) += strip.o obj-$(CONFIG_DUMMY) += dummy.o -obj-$(CONFIG_BONDING) += bonding.o obj-$(CONFIG_DE600) += de600.o obj-$(CONFIG_DE620) += de620.o obj-$(CONFIG_AT1500) += lance.o -- | Shmulik Hen | | Israel Design Center (Jerusalem) | | LAN Access Division | | Intel Communications Group, Intel corp. | | | | Anti-Spam: shmulik dot hen at intel dot com | From hshmulik@intel.com Thu Mar 20 07:18:57 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 20 Mar 2003 07:19:06 -0800 (PST) Received: from caduceus.fm.intel.com (fmr02.intel.com [192.55.52.25]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2KFIeq9026739 for ; Thu, 20 Mar 2003 07:18:40 -0800 Received: from talaria.fm.intel.com (talaria.fm.intel.com [10.1.192.39]) by caduceus.fm.intel.com (8.11.6/8.11.6/d: outer.mc,v 1.51 2002/09/23 20:43:23 dmccart Exp $) with ESMTP id h2KFBt608010 for ; Thu, 20 Mar 2003 15:11:59 GMT Received: from fmsmsxv040-1.fm.intel.com (fmsmsxvs040.fm.intel.com [132.233.42.124]) by talaria.fm.intel.com (8.11.6/8.11.6/d: inner.mc,v 1.28 2003/01/13 19:44:39 dmccart Exp $) with SMTP id h2KFJsc09831 for ; Thu, 20 Mar 2003 15:19:54 GMT Received: from jrslxjul4.npdj.intel.com ([10.12.254.188]) by fmsmsxv040-1.fm.intel.com (NAVGW 2.5.2.11) with SMTP id M2003032007183911683 ; Thu, 20 Mar 2003 07:18:41 -0800 Date: Thu, 20 Mar 2003 17:18:23 +0200 (IST) From: Shmulik Hen X-X-Sender: hshmulik@jrslxjul4.npdj.intel.com To: Bonding Developement list , Bonding Announce list , Linux Net Mailing list , Linux Kernel Mailing list , Oss SGI Netdev list , Jeff Garzik Subject: [patch] (8/8) Add 802.3ad support to bonding Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from QUOTED-PRINTABLE to 8bit by oss.sgi.com id h2KFIeq9026739 X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 1996 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hshmulik@intel.com Precedence: bulk X-list: netdev Content-Length: 122965 Lines: 3413 This patch adds the actual code that does the IEEE 802.3ad dynamic link aggregation stuff. This mode offers the following advantages: automatic configuration, rapid configuration and reconfiguration, and deterministic behavior. It forms aggregation groups that include only members with full duplex and the same speed, and all adapters in the active aggregator simultaneously receive and transmit data. This patch is against bonding 2.4.20-20030317. diff -Nuarp linux-2.4.20-bonding-20030317/Documentation/networking/bonding.txt linux-2.4.20-bonding-20030317-devel/Documentation/networking/bonding.txt --- linux-2.4.20-bonding-20030317/Documentation/networking/bonding.txt 2003-03-18 17:24:24.000000000 +0200 +++ linux-2.4.20-bonding-20030317-devel/Documentation/networking/bonding.txt 2003-03-18 17:24:25.000000000 +0200 @@ -237,6 +237,11 @@ text or numeric option): Broadcast policy: transmits everything on all slave interfaces. This mode provides fault tolerance. + 802.3ad or 4 + IEEE 802.3ad Dynamic link aggregation. Creates aggregation + groups that share the same speed and duplex settings. + Transmits and receives on all slaves in the active aggregator. + miimon Specifies the frequency in milli-seconds that MII link monitoring will @@ -412,7 +417,7 @@ Switch Configuration While the switch does not need to be configured when the active-backup policy is used (mode=1), it does need to be configured for the round-robin, -XOR, and broadcast policies (mode=0, mode=2, and mode=3). +XOR, broadcast, and 802.3ad policies (mode=0, mode=2, mode=3, and mode=4). Verifying Bond Configuration @@ -445,7 +450,7 @@ parameters of mode=0 and miimon=1000 is The network configuration can be verified using the ifconfig command. In the example below, the bond0 interface is the master (MASTER) while eth0 and eth1 are slaves (SLAVE). Notice all slaves of bond0 have the same MAC address -(HWaddr) as bond0. +(HWaddr) as bond0 (except for 802.3ad mode). [root]# /sbin/ifconfig bond0 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4 @@ -538,6 +543,13 @@ Frequently Asked Questions units. * Linux bonding, of course ! + In 802.3ad mode, it works with with systems that support IEEE 802.3ad + Dynamic Link Aggregation: + + * Extreme networks Summit 7i (look for link-aggregation). + * Cisco 6500 series (look for lacp). + * Foundry Big Iron 4000 + In active-backup mode, it should work with any Layer-II switche. @@ -591,6 +603,9 @@ Frequently Asked Questions Broadcast policy transmits everything on all slave interfaces. + 802.3ad, based on XOR but distributes traffic among all interfaces + in the active aggregator. + High Availability ================= diff -Nuarp linux-2.4.20-bonding-20030317/drivers/net/bonding/bond_3ad.c linux-2.4.20-bonding-20030317-devel/drivers/net/bonding/bond_3ad.c --- linux-2.4.20-bonding-20030317/drivers/net/bonding/bond_3ad.c 1970-01-01 02:00:00.000000000 +0200 +++ linux-2.4.20-bonding-20030317-devel/drivers/net/bonding/bond_3ad.c 2003-03-18 17:24:25.000000000 +0200 @@ -0,0 +1,2454 @@ +/**************************************************************************** + Copyright(c) 1999 - 2003 Intel Corporation. All rights reserved. + + This program is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by the Free + Software Foundation; either version 2 of the License, or (at your option) + any later version. + + This program is distributed in the hope that it will be useful, but WITHOUT + ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + You should have received a copy of the GNU General Public License along with + this program; if not, write to the Free Software Foundation, Inc., 59 + Temple Place - Suite 330, Boston, MA 02111-1307, USA. + + The full GNU General Public License is included in this distribution in the + file called LICENSE. +*****************************************************************************/ + +#include +#include +#include +#include +#include +#include +#include "bonding.h" +#include "bond_3ad.h" + +// General definitions +#define AD_SHORT_TIMEOUT 1 +#define AD_LONG_TIMEOUT 0 +#define AD_STANDBY 0x2 +#define AD_MAX_TX_IN_SECOND 3 +#define AD_COLLECTOR_MAX_DELAY 0 + +// Timer definitions(43.4.4 in the 802.3ad standard) +#define AD_FAST_PERIODIC_TIME 1 +#define AD_SLOW_PERIODIC_TIME 30 +#define AD_SHORT_TIMEOUT_TIME (3*AD_FAST_PERIODIC_TIME) +#define AD_LONG_TIMEOUT_TIME (3*AD_SLOW_PERIODIC_TIME) +#define AD_CHURN_DETECTION_TIME 60 +#define AD_AGGREGATE_WAIT_TIME 2 + +// Port state definitions(43.4.2.2 in the 802.3ad standard) +#define AD_STATE_LACP_ACTIVITY 0x1 +#define AD_STATE_LACP_TIMEOUT 0x2 +#define AD_STATE_AGGREGATION 0x4 +#define AD_STATE_SYNCHRONIZATION 0x8 +#define AD_STATE_COLLECTING 0x10 +#define AD_STATE_DISTRIBUTING 0x20 +#define AD_STATE_DEFAULTED 0x40 +#define AD_STATE_EXPIRED 0x80 + +// Port Variables definitions used by the State Machines(43.4.7 in the 802.3ad standard) +#define AD_PORT_BEGIN 0x1 +#define AD_PORT_LACP_ENABLED 0x2 +#define AD_PORT_ACTOR_CHURN 0x4 +#define AD_PORT_PARTNER_CHURN 0x8 +#define AD_PORT_READY 0x10 +#define AD_PORT_READY_N 0x20 +#define AD_PORT_MATCHED 0x40 +#define AD_PORT_STANDBY 0x80 +#define AD_PORT_SELECTED 0x100 +#define AD_PORT_MOVED 0x200 + +// Port Key definitions +// key is determined according to the link speed, duplex and +// user key(which is yet not supported) +// ------------------------------------------------------------ +// Port key : | User key | Speed |Duplex| +// ------------------------------------------------------------ +// 16 6 1 0 +#define AD_DUPLEX_KEY_BITS 0x1 +#define AD_SPEED_KEY_BITS 0x3E +#define AD_USER_KEY_BITS 0xFFC0 + +//dalloun +#define AD_LINK_SPEED_BITMASK_1MBPS 0x1 +#define AD_LINK_SPEED_BITMASK_10MBPS 0x2 +#define AD_LINK_SPEED_BITMASK_100MBPS 0x4 +#define AD_LINK_SPEED_BITMASK_1000MBPS 0x8 +//endalloun + +// compare MAC addresses +#define MAC_ADDRESS_COMPARE(A, B) memcmp(A, B, ETH_ALEN) + +static struct mac_addr null_mac_addr = {{0, 0, 0, 0, 0, 0}}; +static u16 ad_ticks_per_sec; + +// ================= 3AD api to bonding and kernel code ================== +static u16 __get_link_speed(struct port *port); +static u8 __get_duplex(struct port *port); +static inline void __initialize_port_locks(struct port *port); +static inline void __deinitialize_port_locks(struct port *port); +//conversions +static void __ntohs_lacpdu(struct lacpdu *lacpdu); +static u16 __ad_timer_to_ticks(u16 timer_type, u16 Par); + + +// ================= ad code helper functions ================== +//needed by ad_rx_machine(...) +static void __record_pdu(struct lacpdu *lacpdu, struct port *port); +static void __record_default(struct port *port); +static void __update_selected(struct lacpdu *lacpdu, struct port *port); +static void __update_default_selected(struct port *port); +static void __choose_matched(struct lacpdu *lacpdu, struct port *port); +static void __update_ntt(struct lacpdu *lacpdu, struct port *port); + +//needed for ad_mux_machine(..) +static void __attach_bond_to_agg(struct port *port); +static void __detach_bond_from_agg(struct port *port); +static int __agg_ports_are_ready(struct aggregator *aggregator); +static void __set_agg_ports_ready(struct aggregator *aggregator, int val); + +//needed for ad_agg_selection_logic(...) +static u32 __get_agg_bandwidth(struct aggregator *aggregator); +static struct aggregator *__get_active_agg(struct aggregator *aggregator); + + +// ================= main 802.3ad protocol functions ================== +static int ad_lacpdu_send(struct port *port); +static int ad_marker_send(struct port *port, struct marker *marker); +static void ad_mux_machine(struct port *port); +static void ad_rx_machine(struct lacpdu *lacpdu, struct port *port); +static void ad_tx_machine(struct port *port); +static void ad_periodic_machine(struct port *port); +static void ad_port_selection_logic(struct port *port); +static void ad_agg_selection_logic(struct aggregator *aggregator); +static void ad_clear_agg(struct aggregator *aggregator); +static void ad_initialize_agg(struct aggregator *aggregator); +static void ad_initialize_port(struct port *port); +static void ad_initialize_lacpdu(struct lacpdu *Lacpdu); +static void ad_enable_collecting_distributing(struct port *port); +static void ad_disable_collecting_distributing(struct port *port); +static void ad_marker_info_received(struct marker *marker_info, struct port *port); +static void ad_marker_response_received(struct marker *marker, struct port *port); + + +///////////////////////////////////////////////////////////////////////////////// +// ================= api to bonding and kernel code ================== +///////////////////////////////////////////////////////////////////////////////// + +/** + * __get_bond_by_port - get the port's bonding struct + * @port: the port we're looking at + * + * Return @port's bonding struct, or %NULL if it can't be found. + */ +static inline struct bonding *__get_bond_by_port(struct port *port) +{ + if (port->slave == NULL) { + return NULL; + } + + return bond_get_bond_by_slave(port->slave); +} + +/** + * __get_first_port - get the first port in the bond + * @bond: the bond we're looking at + * + * Return the port of the first slave in @bond, or %NULL if it can't be found. + */ +static inline struct port *__get_first_port(struct bonding *bond) +{ + struct slave *slave = bond->next; + + if (slave == (struct slave *)bond) { + return NULL; + } + + return &(SLAVE_AD_INFO(slave).port); +} + +/** + * __get_next_port - get the next port in the bond + * @port: the port we're looking at + * + * Return the port of the slave that is next in line of @port's slave in the + * bond, or %NULL if it can't be found. + */ +static inline struct port *__get_next_port(struct port *port) +{ + struct bonding *bond = __get_bond_by_port(port); + struct slave *slave = port->slave; + + // If there's no bond for this port, or this is the last slave + if ((bond == NULL) || (slave->next == bond->next)) { + return NULL; + } + + return &(SLAVE_AD_INFO(slave->next).port); +} + +/** + * __get_first_agg - get the first aggregator in the bond + * @bond: the bond we're looking at + * + * Return the aggregator of the first slave in @bond, or %NULL if it can't be + * found. + */ +static inline struct aggregator *__get_first_agg(struct port *port) +{ + struct bonding *bond = __get_bond_by_port(port); + + // If there's no bond for this port, or this is the last slave + if ((bond == NULL) || (bond->next == (struct slave *)bond)) { + return NULL; + } + + return &(SLAVE_AD_INFO(bond->next).aggregator); +} + +/** + * __get_next_agg - get the next aggregator in the bond + * @aggregator: the aggregator we're looking at + * + * Return the aggregator of the slave that is next in line of @aggregator's + * slave in the bond, or %NULL if it can't be found. + */ +static inline struct aggregator *__get_next_agg(struct aggregator *aggregator) +{ + struct slave *slave = aggregator->slave; + struct bonding *bond = bond_get_bond_by_slave(slave); + + // If there's no bond for this aggregator, or this is the last slave + if ((bond == NULL) || (slave->next == bond->next)) { + return NULL; + } + + return &(SLAVE_AD_INFO(slave->next).aggregator); +} + +/** + * __disable_port - disable the port's slave + * @port: the port we're looking at + * + */ +static inline void __disable_port(struct port *port) +{ + bond_set_slave_inactive_flags(port->slave); +} + +/** + * __enable_port - enable the port's slave, if it's up + * @port: the port we're looking at + * + */ +static inline void __enable_port(struct port *port) +{ + struct slave *slave = port->slave; + + if ((slave->link == BOND_LINK_UP) && IS_UP(slave->dev)) { + bond_set_slave_active_flags(slave); + } +} + +/** + * __port_is_enabled - check if the port's slave is in active state + * @port: the port we're looking at + * + */ +static inline int __port_is_enabled(struct port *port) +{ + return(port->slave->state == BOND_STATE_ACTIVE); +} + +/** + * __get_agg_selection_mode - get the aggregator selection mode + * @port: the port we're looking at + * + * Get the aggregator selection mode. Can be %BANDWIDTH or %COUNT. + */ +static inline u32 __get_agg_selection_mode(struct port *port) +{ + struct bonding *bond = __get_bond_by_port(port); + + if (bond == NULL) { + return AD_BANDWIDTH; + } + + return BOND_AD_INFO(bond).agg_select_mode; +} + +/** + * __check_agg_selection_timer - check if the selection timer has expired + * @port: the port we're looking at + * + */ +static inline int __check_agg_selection_timer(struct port *port) +{ + struct bonding *bond = __get_bond_by_port(port); + + if (bond == NULL) { + return 0; + } + + return BOND_AD_INFO(bond).agg_select_timer ? 1 : 0; +} + +/** + * __get_rx_machine_lock - lock the port's RX machine + * @port: the port we're looking at + * + */ +static inline void __get_rx_machine_lock(struct port *port) +{ + spin_lock(&(SLAVE_AD_INFO(port->slave).rx_machine_lock)); +} + +/** + * __release_rx_machine_lock - unlock the port's RX machine + * @port: the port we're looking at + * + */ +static inline void __release_rx_machine_lock(struct port *port) +{ + spin_unlock(&(SLAVE_AD_INFO(port->slave).rx_machine_lock)); +} + +/** + * __get_link_speed - get a port's speed + * @port: the port we're looking at + * + * Return @port's speed in 802.3ad bitmask format. i.e. one of: + * 0, + * %AD_LINK_SPEED_BITMASK_10MBPS, + * %AD_LINK_SPEED_BITMASK_100MBPS, + * %AD_LINK_SPEED_BITMASK_1000MBPS + */ +static u16 __get_link_speed(struct port *port) +{ + struct slave *slave = port->slave; + u16 speed; + + /* this if covers only a special case: when the configuration starts with + * link down, it sets the speed to 0. + * This is done in spite of the fact that the e100 driver reports 0 to be + * compatible with MVT in the future.*/ + if (slave->link != BOND_LINK_UP) { + speed=0; + } else { + switch (slave->speed) { + case SPEED_10: + speed = AD_LINK_SPEED_BITMASK_10MBPS; + break; + + case SPEED_100: + speed = AD_LINK_SPEED_BITMASK_100MBPS; + break; + + case SPEED_1000: + speed = AD_LINK_SPEED_BITMASK_1000MBPS; + break; + + default: + speed = 0; // unknown speed value from ethtool. shouldn't happen + break; + } + } + + BOND_PRINT_DBG(("Port %d Received link speed %d update from adapter", port->actor_port_number, speed)); + return speed; +} + +/** + * __get_duplex - get a port's duplex + * @port: the port we're looking at + * + * Return @port's duplex in 802.3ad bitmask format. i.e.: + * 0x01 if in full duplex + * 0x00 otherwise + */ +static u8 __get_duplex(struct port *port) +{ + struct slave *slave = port->slave; + + u8 retval; + + // handling a special case: when the configuration starts with + // link down, it sets the duplex to 0. + if (slave->link != BOND_LINK_UP) { + retval=0x0; + } else { + switch (slave->duplex) { + case DUPLEX_FULL: + retval=0x1; + BOND_PRINT_DBG(("Port %d Received status full duplex update from adapter", port->actor_port_number)); + break; + case DUPLEX_HALF: + default: + retval=0x0; + BOND_PRINT_DBG(("Port %d Received status NOT full duplex update from adapter", port->actor_port_number)); + break; + } + } + return retval; +} + +/** + * __initialize_port_locks - initialize a port's RX machine spinlock + * @port: the port we're looking at + * + */ +static inline void __initialize_port_locks(struct port *port) +{ + // make sure it isn't called twice + spin_lock_init(&(SLAVE_AD_INFO(port->slave).rx_machine_lock)); +} + +/** + * __deinitialize_port_locks - deinitialize a port's RX machine spinlock + * @port: the port we're looking at + * + */ +static inline void __deinitialize_port_locks(struct port *port) +{ +} + +//conversions +/** + * __ntohs_lacpdu - convert the contents of a LACPDU to host byte order + * @lacpdu: the speicifed lacpdu + * + * For each multi-byte field in the lacpdu, convert its content + */ +static void __ntohs_lacpdu(struct lacpdu *lacpdu) +{ + if (lacpdu) { + lacpdu->actor_system_priority = ntohs(lacpdu->actor_system_priority); + lacpdu->actor_key = ntohs(lacpdu->actor_key); + lacpdu->actor_port_priority = ntohs(lacpdu->actor_port_priority); + lacpdu->actor_port = ntohs(lacpdu->actor_port); + lacpdu->partner_system_priority = ntohs(lacpdu->partner_system_priority); + lacpdu->partner_key = ntohs(lacpdu->partner_key); + lacpdu->partner_port_priority = ntohs(lacpdu->partner_port_priority); + lacpdu->partner_port = ntohs(lacpdu->partner_port); + lacpdu->collector_max_delay = ntohs(lacpdu->collector_max_delay); + } +} + +/** + * __ad_timer_to_ticks - convert a given timer type to AD module ticks + * @timer_type: which timer to operate + * @par: timer parameter. see below + * + * If @timer_type is %current_while_timer, @par indicates long/short timer. + * If @timer_type is %periodic_timer, @par is one of %FAST_PERIODIC_TIME, + * %SLOW_PERIODIC_TIME. + */ +static u16 __ad_timer_to_ticks(u16 timer_type, u16 par) +{ + u16 retval=0; //to silence the compiler + + switch (timer_type) { + case AD_CURRENT_WHILE_TIMER: // for rx machine usage + if (par) { // for short or long timeout + retval = (AD_SHORT_TIMEOUT_TIME*ad_ticks_per_sec); // short timeout + } else { + retval = (AD_LONG_TIMEOUT_TIME*ad_ticks_per_sec); // long timeout + } + break; + case AD_ACTOR_CHURN_TIMER: // for local churn machine + retval = (AD_CHURN_DETECTION_TIME*ad_ticks_per_sec); + break; + case AD_PERIODIC_TIMER: // for periodic machine + retval = (par*ad_ticks_per_sec); // long timeout + break; + case AD_PARTNER_CHURN_TIMER: // for remote churn machine + retval = (AD_CHURN_DETECTION_TIME*ad_ticks_per_sec); + break; + case AD_WAIT_WHILE_TIMER: // for selection machine + retval = (AD_AGGREGATE_WAIT_TIME*ad_ticks_per_sec); + break; + } + return retval; +} + + +///////////////////////////////////////////////////////////////////////////////// +// ================= ad_rx_machine helper functions ================== +///////////////////////////////////////////////////////////////////////////////// + +/** + * __record_pdu - record parameters from a received lacpdu + * @lacpdu: the lacpdu we've received + * @port: the port we're looking at + * + * Record the parameter values for the Actor carried in a received lacpdu as + * the current partner operational parameter values and sets + * actor_oper_port_state.defaulted to FALSE. + */ +static void __record_pdu(struct lacpdu *lacpdu, struct port *port) +{ + // validate lacpdu and port + if (lacpdu && port) { + // record the new parameter values for the partner operational + port->partner_oper_port_number = lacpdu->actor_port; + port->partner_oper_port_priority = lacpdu->actor_port_priority; + port->partner_oper_system = lacpdu->actor_system; + port->partner_oper_system_priority = lacpdu->actor_system_priority; + port->partner_oper_key = lacpdu->actor_key; + // zero partener's lase states + port->partner_oper_port_state = 0; + port->partner_oper_port_state |= (lacpdu->actor_state & AD_STATE_LACP_ACTIVITY); + port->partner_oper_port_state |= (lacpdu->actor_state & AD_STATE_LACP_TIMEOUT); + port->partner_oper_port_state |= (lacpdu->actor_state & AD_STATE_AGGREGATION); + port->partner_oper_port_state |= (lacpdu->actor_state & AD_STATE_SYNCHRONIZATION); + port->partner_oper_port_state |= (lacpdu->actor_state & AD_STATE_COLLECTING); + port->partner_oper_port_state |= (lacpdu->actor_state & AD_STATE_DISTRIBUTING); + port->partner_oper_port_state |= (lacpdu->actor_state & AD_STATE_DEFAULTED); + port->partner_oper_port_state |= (lacpdu->actor_state & AD_STATE_EXPIRED); + + // set actor_oper_port_state.defaulted to FALSE + port->actor_oper_port_state &= ~AD_STATE_DEFAULTED; + + // set the partner sync. to on if the partner is sync. and the port is matched + if ((port->sm_vars & AD_PORT_MATCHED) && (lacpdu->actor_state & AD_STATE_SYNCHRONIZATION)) { + port->partner_oper_port_state |= AD_STATE_SYNCHRONIZATION; + } else { + port->partner_oper_port_state &= ~AD_STATE_SYNCHRONIZATION; + } + } +} + +/** + * __record_default - record default parameters + * @port: the port we're looking at + * + * This function records the default parameter values for the partner carried + * in the Partner Admin parameters as the current partner operational parameter + * values and sets actor_oper_port_state.defaulted to TRUE. + */ +static void __record_default(struct port *port) +{ + // validate the port + if (port) { + // record the partner admin parameters + port->partner_oper_port_number = port->partner_admin_port_number; + port->partner_oper_port_priority = port->partner_admin_port_priority; + port->partner_oper_system = port->partner_admin_system; + port->partner_oper_system_priority = port->partner_admin_system_priority; + port->partner_oper_key = port->partner_admin_key; + port->partner_oper_port_state = port->partner_admin_port_state; + + // set actor_oper_port_state.defaulted to true + port->actor_oper_port_state |= AD_STATE_DEFAULTED; + } +} + +/** + * __update_selected - update a port's Selected variable from a received lacpdu + * @lacpdu: the lacpdu we've received + * @port: the port we're looking at + * + * Update the value of the selected variable, using parameter values from a + * newly received lacpdu. The parameter values for the Actor carried in the + * received PDU are compared with the corresponding operational parameter + * values for the ports partner. If one or more of the comparisons shows that + * the value(s) received in the PDU differ from the current operational values, + * then selected is set to FALSE and actor_oper_port_state.synchronization is + * set to out_of_sync. Otherwise, selected remains unchanged. + */ +static void __update_selected(struct lacpdu *lacpdu, struct port *port) +{ + // validate lacpdu and port + if (lacpdu && port) { + // check if any parameter is different + if ((lacpdu->actor_port != port->partner_oper_port_number) || + (lacpdu->actor_port_priority != port->partner_oper_port_priority) || + MAC_ADDRESS_COMPARE(&(lacpdu->actor_system), &(port->partner_oper_system)) || + (lacpdu->actor_system_priority != port->partner_oper_system_priority) || + (lacpdu->actor_key != port->partner_oper_key) || + ((lacpdu->actor_state & AD_STATE_AGGREGATION) != (port->partner_oper_port_state & AD_STATE_AGGREGATION)) + ) { + // update the state machine Selected variable + port->sm_vars &= ~AD_PORT_SELECTED; + } + } +} + +/** + * __update_default_selected - update a port's Selected variable from Partner + * @port: the port we're looking at + * + * This function updates the value of the selected variable, using the partner + * administrative parameter values. The administrative values are compared with + * the corresponding operational parameter values for the partner. If one or + * more of the comparisons shows that the administrative value(s) differ from + * the current operational values, then Selected is set to FALSE and + * actor_oper_port_state.synchronization is set to OUT_OF_SYNC. Otherwise, + * Selected remains unchanged. + */ +static void __update_default_selected(struct port *port) +{ + // validate the port + if (port) { + // check if any parameter is different + if ((port->partner_admin_port_number != port->partner_oper_port_number) || + (port->partner_admin_port_priority != port->partner_oper_port_priority) || + MAC_ADDRESS_COMPARE(&(port->partner_admin_system), &(port->partner_oper_system)) || + (port->partner_admin_system_priority != port->partner_oper_system_priority) || + (port->partner_admin_key != port->partner_oper_key) || + ((port->partner_admin_port_state & AD_STATE_AGGREGATION) != (port->partner_oper_port_state & AD_STATE_AGGREGATION)) + ) { + // update the state machine Selected variable + port->sm_vars &= ~AD_PORT_SELECTED; + } + } +} + +/** + * __choose_matched - update a port's matched variable from a received lacpdu + * @lacpdu: the lacpdu we've received + * @port: the port we're looking at + * + * Update the value of the matched variable, using parameter values from a + * newly received lacpdu. Parameter values for the partner carried in the + * received PDU are compared with the corresponding operational parameter + * values for the actor. Matched is set to TRUE if all of these parameters + * match and the PDU parameter partner_state.aggregation has the same value as + * actor_oper_port_state.aggregation and lacp will actively maintain the link + * in the aggregation. Matched is also set to TRUE if the value of + * actor_state.aggregation in the received PDU is set to FALSE, i.e., indicates + * an individual link and lacp will actively maintain the link. Otherwise, + * matched is set to FALSE. LACP is considered to be actively maintaining the + * link if either the PDU's actor_state.lacp_activity variable is TRUE or both + * the actor's actor_oper_port_state.lacp_activity and the PDU's + * partner_state.lacp_activity variables are TRUE. + */ +static void __choose_matched(struct lacpdu *lacpdu, struct port *port) +{ + // validate lacpdu and port + if (lacpdu && port) { + // check if all parameters are alike + if (((lacpdu->partner_port == port->actor_port_number) && + (lacpdu->partner_port_priority == port->actor_port_priority) && + !MAC_ADDRESS_COMPARE(&(lacpdu->partner_system), &(port->actor_system)) && + (lacpdu->partner_system_priority == port->actor_system_priority) && + (lacpdu->partner_key == port->actor_oper_port_key) && + ((lacpdu->partner_state & AD_STATE_AGGREGATION) == (port->actor_oper_port_state & AD_STATE_AGGREGATION))) || + // or this is individual link(aggregation == FALSE) + ((lacpdu->actor_state & AD_STATE_AGGREGATION) == 0) + ) { + // update the state machine Matched variable + port->sm_vars |= AD_PORT_MATCHED; + } else { + port->sm_vars &= ~AD_PORT_MATCHED; + } + } +} + +/** + * __update_ntt - update a port's ntt variable from a received lacpdu + * @lacpdu: the lacpdu we've received + * @port: the port we're looking at + * + * Updates the value of the ntt variable, using parameter values from a newly + * received lacpdu. The parameter values for the partner carried in the + * received PDU are compared with the corresponding operational parameter + * values for the Actor. If one or more of the comparisons shows that the + * value(s) received in the PDU differ from the current operational values, + * then ntt is set to TRUE. Otherwise, ntt remains unchanged. + */ +static void __update_ntt(struct lacpdu *lacpdu, struct port *port) +{ + // validate lacpdu and port + if (lacpdu && port) { + // check if any parameter is different + if ((lacpdu->partner_port != port->actor_port_number) || + (lacpdu->partner_port_priority != port->actor_port_priority) || + MAC_ADDRESS_COMPARE(&(lacpdu->partner_system), &(port->actor_system)) || + (lacpdu->partner_system_priority != port->actor_system_priority) || + (lacpdu->partner_key != port->actor_oper_port_key) || + ((lacpdu->partner_state & AD_STATE_LACP_ACTIVITY) != (port->actor_oper_port_state & AD_STATE_LACP_ACTIVITY)) || + ((lacpdu->partner_state & AD_STATE_LACP_TIMEOUT) != (port->actor_oper_port_state & AD_STATE_LACP_TIMEOUT)) || + ((lacpdu->partner_state & AD_STATE_SYNCHRONIZATION) != (port->actor_oper_port_state & AD_STATE_SYNCHRONIZATION)) || + ((lacpdu->partner_state & AD_STATE_AGGREGATION) != (port->actor_oper_port_state & AD_STATE_AGGREGATION)) + ) { + // set ntt to be TRUE + port->ntt = 1; + } + } +} + +/** + * __attach_bond_to_agg + * @port: the port we're looking at + * + * Handle the attaching of the port's control parser/multiplexer and the + * aggregator. This function does nothing since the parser/multiplexer of the + * receive and the parser/multiplexer of the aggregator are already combined. + */ +static void __attach_bond_to_agg(struct port *port) +{ + port=NULL; // just to satisfy the compiler + // This function does nothing since the parser/multiplexer of the receive + // and the parser/multiplexer of the aggregator are already combined +} + +/** + * __detach_bond_to_agg + * @port: the port we're looking at + * + * Handle the detaching of the port's control parser/multiplexer from the + * aggregator. This function does nothing since the parser/multiplexer of the + * receive and the parser/multiplexer of the aggregator are already combined. + */ +static void __detach_bond_from_agg(struct port *port) +{ + port=NULL; // just to satisfy the compiler + // This function does nothing sience the parser/multiplexer of the receive + // and the parser/multiplexer of the aggregator are already combined +} + +/** + * __agg_ports_are_ready - check if all ports in an aggregator are ready + * @aggregator: the aggregator we're looking at + * + */ +static int __agg_ports_are_ready(struct aggregator *aggregator) +{ + struct port *port; + int retval = 1; + + if (aggregator) { + // scan all ports in this aggregator to verfy if they are all ready + for (port=aggregator->lag_ports; port; port=port->next_port_in_aggregator) { + if (!(port->sm_vars & AD_PORT_READY_N)) { + retval = 0; + break; + } + } + } + + return retval; +} + +/** + * __set_agg_ports_ready - set value of Ready bit in all ports of an aggregator + * @aggregator: the aggregator we're looking at + * @val: Should the ports' ready bit be set on or off + * + */ +static void __set_agg_ports_ready(struct aggregator *aggregator, int val) +{ + struct port *port; + + for (port=aggregator->lag_ports; port; port=port->next_port_in_aggregator) { + if (val) { + port->sm_vars |= AD_PORT_READY; + } else { + port->sm_vars &= ~AD_PORT_READY; + } + } +} + +/** + * __get_agg_bandwidth - get the total bandwidth of an aggregator + * @aggregator: the aggregator we're looking at + * + */ +static u32 __get_agg_bandwidth(struct aggregator *aggregator) +{ + u32 bandwidth=0; + u32 basic_speed; + + if (aggregator->num_of_ports) { + basic_speed = __get_link_speed(aggregator->lag_ports); + switch (basic_speed) { + case AD_LINK_SPEED_BITMASK_1MBPS: + bandwidth = aggregator->num_of_ports; + break; + case AD_LINK_SPEED_BITMASK_10MBPS: + bandwidth = aggregator->num_of_ports * 10; + break; + case AD_LINK_SPEED_BITMASK_100MBPS: + bandwidth = aggregator->num_of_ports * 100; + break; + case AD_LINK_SPEED_BITMASK_1000MBPS: + bandwidth = aggregator->num_of_ports * 1000; + break; + default: + bandwidth=0; // to silent the compilor .... + } + } + return bandwidth; +} + +/** + * __get_active_agg - get the current active aggregator + * @aggregator: the aggregator we're looking at + * + */ +static struct aggregator *__get_active_agg(struct aggregator *aggregator) +{ + struct aggregator *retval = NULL; + + for (; aggregator; aggregator = __get_next_agg(aggregator)) { + if (aggregator->is_active) { + retval = aggregator; + break; + } + } + + return retval; +} + +////////////////////////////////////////////////////////////////////////////////////// +// ================= main 802.3ad protocol code ====================================== +////////////////////////////////////////////////////////////////////////////////////// + +/** + * ad_lacpdu_send - send out a lacpdu packet on a given port + * @port: the port we're looking at + * + * Returns: 0 on success + * < 0 on error + */ +static int ad_lacpdu_send(struct port *port) +{ + struct slave *slave = port->slave; + struct sk_buff *skb; + struct lacpdu_header *lacpdu_header; + int length = sizeof(struct lacpdu_header); + struct mac_addr lacpdu_multicast_address = AD_MULTICAST_LACPDU_ADDR; + + skb = dev_alloc_skb(length); + if (!skb) { + return -ENOMEM; + } + + skb->dev = slave->dev; + skb->mac.raw = skb->data; + skb->nh.raw = skb->data + ETH_HLEN; + skb->protocol = PKT_TYPE_LACPDU; + + lacpdu_header = (struct lacpdu_header *)skb_put(skb, length); + + lacpdu_header->ad_header.destination_address = lacpdu_multicast_address; + /* Note: source addres is set to be the member's PERMANENT address, because we use it + to identify loopback lacpdus in receive. */ + lacpdu_header->ad_header.source_address = *((struct mac_addr *)(slave->perm_hwaddr)); + lacpdu_header->ad_header.length_type = PKT_TYPE_LACPDU; + + lacpdu_header->lacpdu = port->lacpdu; // struct copy + + dev_queue_xmit(skb); + + return 0; +} + +/** + * ad_marker_send - send marker information/response on a given port + * @port: the port we're looking at + * @marker: marker data to send + * + * Returns: 0 on success + * < 0 on error + */ +static int ad_marker_send(struct port *port, struct marker *marker) +{ + struct slave *slave = port->slave; + struct sk_buff *skb; + struct marker_header *marker_header; + int length = sizeof(struct marker_header); + struct mac_addr lacpdu_multicast_address = AD_MULTICAST_LACPDU_ADDR; + + skb = dev_alloc_skb(length + 16); + if (!skb) { + return -ENOMEM; + } + + skb_reserve(skb, 16); + + skb->dev = slave->dev; + skb->mac.raw = skb->data; + skb->nh.raw = skb->data + ETH_HLEN; + skb->protocol = PKT_TYPE_LACPDU; + + marker_header = (struct marker_header *)skb_put(skb, length); + + marker_header->ad_header.destination_address = lacpdu_multicast_address; + /* Note: source addres is set to be the member's PERMANENT address, because we use it + to identify loopback MARKERs in receive. */ + marker_header->ad_header.source_address = *((struct mac_addr *)(slave->perm_hwaddr)); + marker_header->ad_header.length_type = PKT_TYPE_LACPDU; + + marker_header->marker = *marker; // struct copy + + dev_queue_xmit(skb); + + return 0; +} + +/** + * ad_mux_machine - handle a port's mux state machine + * @port: the port we're looking at + * + */ +static void ad_mux_machine(struct port *port) +{ + mux_states_t last_state; + + // keep current State Machine state to compare later if it was changed + last_state = port->sm_mux_state; + + if (port->sm_vars & AD_PORT_BEGIN) { + port->sm_mux_state = AD_MUX_DETACHED; // next state + } else { + switch (port->sm_mux_state) { + case AD_MUX_DETACHED: + if ((port->sm_vars & AD_PORT_SELECTED) || (port->sm_vars & AD_PORT_STANDBY)) { // if SELECTED or STANDBY + port->sm_mux_state = AD_MUX_WAITING; // next state + } + break; + case AD_MUX_WAITING: + // if SELECTED == FALSE return to DETACH state + if (!(port->sm_vars & AD_PORT_SELECTED)) { // if UNSELECTED + port->sm_vars &= ~AD_PORT_READY_N; + // in order to withhold the Selection Logic to check all ports READY_N value + // every callback cycle to update ready variable, we check READY_N and update READY here + __set_agg_ports_ready(port->aggregator, __agg_ports_are_ready(port->aggregator)); + port->sm_mux_state = AD_MUX_DETACHED; // next state + break; + } + + // check if the wait_while_timer expired + if (port->sm_mux_timer_counter && !(--port->sm_mux_timer_counter)) { + port->sm_vars |= AD_PORT_READY_N; + } + + // in order to withhold the selection logic to check all ports READY_N value + // every callback cycle to update ready variable, we check READY_N and update READY here + __set_agg_ports_ready(port->aggregator, __agg_ports_are_ready(port->aggregator)); + + // if the wait_while_timer expired, and the port is in READY state, move to ATTACHED state + if ((port->sm_vars & AD_PORT_READY) && !port->sm_mux_timer_counter) { + port->sm_mux_state = AD_MUX_ATTACHED; // next state + } + break; + case AD_MUX_ATTACHED: + // check also if agg_select_timer expired(so the edable port will take place only after this timer) + if ((port->sm_vars & AD_PORT_SELECTED) && (port->partner_oper_port_state & AD_STATE_SYNCHRONIZATION) && !__check_agg_selection_timer(port)) { + port->sm_mux_state = AD_MUX_COLLECTING_DISTRIBUTING;// next state + } else if (!(port->sm_vars & AD_PORT_SELECTED) || (port->sm_vars & AD_PORT_STANDBY)) { // if UNSELECTED or STANDBY + port->sm_vars &= ~AD_PORT_READY_N; + // in order to withhold the selection logic to check all ports READY_N value + // every callback cycle to update ready variable, we check READY_N and update READY here + __set_agg_ports_ready(port->aggregator, __agg_ports_are_ready(port->aggregator)); + port->sm_mux_state = AD_MUX_DETACHED;// next state + } + break; + case AD_MUX_COLLECTING_DISTRIBUTING: + if (!(port->sm_vars & AD_PORT_SELECTED) || (port->sm_vars & AD_PORT_STANDBY) || + !(port->partner_oper_port_state & AD_STATE_SYNCHRONIZATION) + ) { + port->sm_mux_state = AD_MUX_ATTACHED;// next state + + } else { + // if port state hasn't changed make + // sure that a collecting distributing + // port in an active aggregator is enabled + if (port->aggregator && + port->aggregator->is_active && + !__port_is_enabled(port)) { + + __enable_port(port); + } + } + break; + default: //to silence the compiler + break; + } + } + + // check if the state machine was changed + if (port->sm_mux_state != last_state) { + BOND_PRINT_DBG(("Mux Machine: Port=%d, Last State=%d, Curr State=%d", port->actor_port_number, last_state, port->sm_mux_state)); + switch (port->sm_mux_state) { + case AD_MUX_DETACHED: + __detach_bond_from_agg(port); + port->actor_oper_port_state &= ~AD_STATE_SYNCHRONIZATION; + ad_disable_collecting_distributing(port); + port->actor_oper_port_state &= ~AD_STATE_COLLECTING; + port->actor_oper_port_state &= ~AD_STATE_DISTRIBUTING; + port->ntt = 1; + break; + case AD_MUX_WAITING: + port->sm_mux_timer_counter = __ad_timer_to_ticks(AD_WAIT_WHILE_TIMER, 0); + break; + case AD_MUX_ATTACHED: + __attach_bond_to_agg(port); + port->actor_oper_port_state |= AD_STATE_SYNCHRONIZATION; + port->actor_oper_port_state &= ~AD_STATE_COLLECTING; + port->actor_oper_port_state &= ~AD_STATE_DISTRIBUTING; + ad_disable_collecting_distributing(port); + port->ntt = 1; + break; + case AD_MUX_COLLECTING_DISTRIBUTING: + port->actor_oper_port_state |= AD_STATE_COLLECTING; + port->actor_oper_port_state |= AD_STATE_DISTRIBUTING; + ad_enable_collecting_distributing(port); + port->ntt = 1; + break; + default: //to silence the compiler + break; + } + } +} + +/** + * ad_rx_machine - handle a port's rx State Machine + * @lacpdu: the lacpdu we've received + * @port: the port we're looking at + * + * If lacpdu arrived, stop previous timer (if exists) and set the next state as + * CURRENT. If timer expired set the state machine in the proper state. + * In other cases, this function checks if we need to switch to other state. + */ +static void ad_rx_machine(struct lacpdu *lacpdu, struct port *port) +{ + rx_states_t last_state; + + // Lock to prevent 2 instances of this function to run simultaneously(rx interrupt and periodic machine callback) + __get_rx_machine_lock(port); + + // keep current State Machine state to compare later if it was changed + last_state = port->sm_rx_state; + + // check if state machine should change state + // first, check if port was reinitialized + if (port->sm_vars & AD_PORT_BEGIN) { + port->sm_rx_state = AD_RX_INITIALIZE; // next state + } + // check if port is not enabled + else if (!(port->sm_vars & AD_PORT_BEGIN) && !port->is_enabled && !(port->sm_vars & AD_PORT_MOVED)) { + port->sm_rx_state = AD_RX_PORT_DISABLED; // next state + } + // check if new lacpdu arrived + else if (lacpdu && ((port->sm_rx_state == AD_RX_EXPIRED) || (port->sm_rx_state == AD_RX_DEFAULTED) || (port->sm_rx_state == AD_RX_CURRENT))) { + port->sm_rx_timer_counter = 0; // zero timer + port->sm_rx_state = AD_RX_CURRENT; + } else { + // if timer is on, and if it is expired + if (port->sm_rx_timer_counter && !(--port->sm_rx_timer_counter)) { + switch (port->sm_rx_state) { + case AD_RX_EXPIRED: + port->sm_rx_state = AD_RX_DEFAULTED; // next state + break; + case AD_RX_CURRENT: + port->sm_rx_state = AD_RX_EXPIRED; // next state + break; + default: //to silence the compiler + break; + } + } else { + // if no lacpdu arrived and no timer is on + switch (port->sm_rx_state) { + case AD_RX_PORT_DISABLED: + if (port->sm_vars & AD_PORT_MOVED) { + port->sm_rx_state = AD_RX_INITIALIZE; // next state + } else if (port->is_enabled && (port->sm_vars & AD_PORT_LACP_ENABLED)) { + port->sm_rx_state = AD_RX_EXPIRED; // next state + } else if (port->is_enabled && ((port->sm_vars & AD_PORT_LACP_ENABLED) == 0)) { + port->sm_rx_state = AD_RX_LACP_DISABLED; // next state + } + break; + default: //to silence the compiler + break; + + } + } + } + + // check if the State machine was changed or new lacpdu arrived + if ((port->sm_rx_state != last_state) || (lacpdu)) { + BOND_PRINT_DBG(("Rx Machine: Port=%d, Last State=%d, Curr State=%d", port->actor_port_number, last_state, port->sm_rx_state)); + switch (port->sm_rx_state) { + case AD_RX_INITIALIZE: + if (!(port->actor_oper_port_key & AD_DUPLEX_KEY_BITS)) { + port->sm_vars &= ~AD_PORT_LACP_ENABLED; + } else { + port->sm_vars |= AD_PORT_LACP_ENABLED; + } + port->sm_vars &= ~AD_PORT_SELECTED; + __record_default(port); + port->actor_oper_port_state &= ~AD_STATE_EXPIRED; + port->sm_vars &= ~AD_PORT_MOVED; + port->sm_rx_state = AD_RX_PORT_DISABLED; // next state + + /*- Fall Through -*/ + + case AD_RX_PORT_DISABLED: + port->sm_vars &= ~AD_PORT_MATCHED; + break; + case AD_RX_LACP_DISABLED: + port->sm_vars &= ~AD_PORT_SELECTED; + __record_default(port); + port->partner_oper_port_state &= ~AD_STATE_AGGREGATION; + port->sm_vars |= AD_PORT_MATCHED; + port->actor_oper_port_state &= ~AD_STATE_EXPIRED; + break; + case AD_RX_EXPIRED: + //Reset of the Synchronization flag. (Standard 43.4.12) + //This reset cause to disable this port in the COLLECTING_DISTRIBUTING state of the + //mux machine in case of EXPIRED even if LINK_DOWN didn't arrive for the port. + port->partner_oper_port_state &= ~AD_STATE_SYNCHRONIZATION; + port->sm_vars &= ~AD_PORT_MATCHED; + port->partner_oper_port_state |= AD_SHORT_TIMEOUT; + port->sm_rx_timer_counter = __ad_timer_to_ticks(AD_CURRENT_WHILE_TIMER, (u16)(AD_SHORT_TIMEOUT)); + port->actor_oper_port_state |= AD_STATE_EXPIRED; + break; + case AD_RX_DEFAULTED: + __update_default_selected(port); + __record_default(port); + port->sm_vars |= AD_PORT_MATCHED; + port->actor_oper_port_state &= ~AD_STATE_EXPIRED; + break; + case AD_RX_CURRENT: + // detect loopback situation + if (!MAC_ADDRESS_COMPARE(&(lacpdu->actor_system), &(port->actor_system))) { + // INFO_RECEIVED_LOOPBACK_FRAMES + printk(KERN_ERR "bonding: An illegal loopback occurred on adapter (%s)\n", + port->slave->dev->name); + printk(KERN_ERR "Check the configuration to verify that all Adapters " + "are connected to 802.3ad compliant switch ports\n"); + __release_rx_machine_lock(port); + return; + } + __update_selected(lacpdu, port); + __update_ntt(lacpdu, port); + __record_pdu(lacpdu, port); + __choose_matched(lacpdu, port); + port->sm_rx_timer_counter = __ad_timer_to_ticks(AD_CURRENT_WHILE_TIMER, (u16)(port->actor_oper_port_state & AD_STATE_LACP_TIMEOUT)); + port->actor_oper_port_state &= ~AD_STATE_EXPIRED; + // verify that if the aggregator is enabled, the port is enabled too. + //(because if the link goes down for a short time, the 802.3ad will not + // catch it, and the port will continue to be disabled) + if (port->aggregator && port->aggregator->is_active && !__port_is_enabled(port)) { + __enable_port(port); + } + break; + default: //to silence the compiler + break; + } + } + __release_rx_machine_lock(port); +} + +/** + * ad_tx_machine - handle a port's tx state machine + * @port: the port we're looking at + * + */ +static void ad_tx_machine(struct port *port) +{ + struct lacpdu *lacpdu = &port->lacpdu; + + // check if tx timer expired, to verify that we do not send more than 3 packets per second + if (port->sm_tx_timer_counter && !(--port->sm_tx_timer_counter)) { + // check if there is something to send + if (port->ntt && (port->sm_vars & AD_PORT_LACP_ENABLED)) { + //update current actual Actor parameters + //lacpdu->subtype initialized + //lacpdu->version_number initialized + //lacpdu->tlv_type_actor_info initialized + //lacpdu->actor_information_length initialized + lacpdu->actor_system_priority = port->actor_system_priority; + lacpdu->actor_system = port->actor_system; + lacpdu->actor_key = port->actor_oper_port_key; + lacpdu->actor_port_priority = port->actor_port_priority; + lacpdu->actor_port = port->actor_port_number; + lacpdu->actor_state = port->actor_oper_port_state; + //lacpdu->reserved_3_1 initialized + //lacpdu->tlv_type_partner_info initialized + //lacpdu->partner_information_length initialized + lacpdu->partner_system_priority = port->partner_oper_system_priority; + lacpdu->partner_system = port->partner_oper_system; + lacpdu->partner_key = port->partner_oper_key; + lacpdu->partner_port_priority = port->partner_oper_port_priority; + lacpdu->partner_port = port->partner_oper_port_number; + lacpdu->partner_state = port->partner_oper_port_state; + //lacpdu->reserved_3_2 initialized + //lacpdu->tlv_type_collector_info initialized + //lacpdu->collector_information_length initialized + //collector_max_delay initialized + //reserved_12[12] initialized + //tlv_type_terminator initialized + //terminator_length initialized + //reserved_50[50] initialized + + // We need to convert all non u8 parameters to Big Endian for transmit + __ntohs_lacpdu(lacpdu); + // send the lacpdu + if (ad_lacpdu_send(port) >= 0) { + BOND_PRINT_DBG(("Sent LACPDU on port %d", port->actor_port_number)); + // mark ntt as false, so it will not be sent again until demanded + port->ntt = 0; + } + } + // restart tx timer(to verify that we will not exceed AD_MAX_TX_IN_SECOND + port->sm_tx_timer_counter=ad_ticks_per_sec/AD_MAX_TX_IN_SECOND; + } +} + +/** + * ad_periodic_machine - handle a port's periodic state machine + * @port: the port we're looking at + * + * Turn ntt flag on priodically to perform periodic transmission of lacpdu's. + */ +static void ad_periodic_machine(struct port *port) +{ + periodic_states_t last_state; + + // keep current state machine state to compare later if it was changed + last_state = port->sm_periodic_state; + + // check if port was reinitialized + if (((port->sm_vars & AD_PORT_BEGIN) || !(port->sm_vars & AD_PORT_LACP_ENABLED) || !port->is_enabled) || + (!(port->actor_oper_port_state & AD_STATE_LACP_ACTIVITY) && !(port->partner_oper_port_state & AD_STATE_LACP_ACTIVITY)) + ) { + port->sm_periodic_state = AD_NO_PERIODIC; // next state + } + // check if state machine should change state + else if (port->sm_periodic_timer_counter) { + // check if periodic state machine expired + if (!(--port->sm_periodic_timer_counter)) { + // if expired then do tx + port->sm_periodic_state = AD_PERIODIC_TX; // next state + } else { + // If not expired, check if there is some new timeout parameter from the partner state + switch (port->sm_periodic_state) { + case AD_FAST_PERIODIC: + if (!(port->partner_oper_port_state & AD_STATE_LACP_TIMEOUT)) { + port->sm_periodic_state = AD_SLOW_PERIODIC; // next state + } + break; + case AD_SLOW_PERIODIC: + if ((port->partner_oper_port_state & AD_STATE_LACP_TIMEOUT)) { + // stop current timer + port->sm_periodic_timer_counter = 0; + port->sm_periodic_state = AD_PERIODIC_TX; // next state + } + break; + default: //to silence the compiler + break; + } + } + } else { + switch (port->sm_periodic_state) { + case AD_NO_PERIODIC: + port->sm_periodic_state = AD_FAST_PERIODIC; // next state + break; + case AD_PERIODIC_TX: + if (!(port->partner_oper_port_state & AD_STATE_LACP_TIMEOUT)) { + port->sm_periodic_state = AD_SLOW_PERIODIC; // next state + } else { + port->sm_periodic_state = AD_FAST_PERIODIC; // next state + } + break; + default: //to silence the compiler + break; + } + } + + // check if the state machine was changed + if (port->sm_periodic_state != last_state) { + BOND_PRINT_DBG(("Periodic Machine: Port=%d, Last State=%d, Curr State=%d", port->actor_port_number, last_state, port->sm_periodic_state)); + switch (port->sm_periodic_state) { + case AD_NO_PERIODIC: + port->sm_periodic_timer_counter = 0; // zero timer + break; + case AD_FAST_PERIODIC: + port->sm_periodic_timer_counter = __ad_timer_to_ticks(AD_PERIODIC_TIMER, (u16)(AD_FAST_PERIODIC_TIME))-1; // decrement 1 tick we lost in the PERIODIC_TX cycle + break; + case AD_SLOW_PERIODIC: + port->sm_periodic_timer_counter = __ad_timer_to_ticks(AD_PERIODIC_TIMER, (u16)(AD_SLOW_PERIODIC_TIME))-1; // decrement 1 tick we lost in the PERIODIC_TX cycle + break; + case AD_PERIODIC_TX: + port->ntt = 1; + break; + default: //to silence the compiler + break; + } + } +} + +/** + * ad_port_selection_logic - select aggregation groups + * @port: the port we're looking at + * + * Select aggregation groups, and assign each port for it's aggregetor. The + * selection logic is called in the inititalization (after all the handshkes), + * and after every lacpdu receive (if selected is off). + */ +static void ad_port_selection_logic(struct port *port) +{ + struct aggregator *aggregator, *free_aggregator = NULL, *temp_aggregator; + struct port *last_port = NULL, *curr_port; + int found = 0; + + // if the port is already Selected, do nothing + if (port->sm_vars & AD_PORT_SELECTED) { + return; + } + + // if the port is connected to other aggregator, detach it + if (port->aggregator) { + // detach the port from its former aggregator + temp_aggregator=port->aggregator; + for (curr_port=temp_aggregator->lag_ports; curr_port; last_port=curr_port, curr_port=curr_port->next_port_in_aggregator) { + if (curr_port == port) { + temp_aggregator->num_of_ports--; + if (!last_port) {// if it is the first port attached to the aggregator + temp_aggregator->lag_ports=port->next_port_in_aggregator; + } else {// not the first port attached to the aggregator + last_port->next_port_in_aggregator=port->next_port_in_aggregator; + } + + // clear the port's relations to this aggregator + port->aggregator = NULL; + port->next_port_in_aggregator=NULL; + port->actor_port_aggregator_identifier=0; + + BOND_PRINT_DBG(("Port %d left LAG %d", port->actor_port_number, temp_aggregator->aggregator_identifier)); + // if the aggregator is empty, clear its parameters, and set it ready to be attached + if (!temp_aggregator->lag_ports) { + ad_clear_agg(temp_aggregator); + } + break; + } + } + if (!curr_port) { // meaning: the port was related to an aggregator but was not on the aggregator port list + printk(KERN_WARNING "bonding: Warning: Port %d (on %s) was " + "related to aggregator %d but was not on its port list\n", + port->actor_port_number, port->slave->dev->name, + port->aggregator->aggregator_identifier); + } + } + // search on all aggregators for a suitable aggregator for this port + for (aggregator = __get_first_agg(port); aggregator; + aggregator = __get_next_agg(aggregator)) { + + // keep a free aggregator for later use(if needed) + if (!aggregator->lag_ports) { + if (!free_aggregator) { + free_aggregator=aggregator; + } + continue; + } + // check if current aggregator suits us + if (((aggregator->actor_oper_aggregator_key == port->actor_oper_port_key) && // if all parameters match AND + !MAC_ADDRESS_COMPARE(&(aggregator->partner_system), &(port->partner_oper_system)) && + (aggregator->partner_system_priority == port->partner_oper_system_priority) && + (aggregator->partner_oper_aggregator_key == port->partner_oper_key) + ) && + ((MAC_ADDRESS_COMPARE(&(port->partner_oper_system), &(null_mac_addr)) && // partner answers + !aggregator->is_individual) // but is not individual OR + ) + ) { + // attach to the founded aggregator + port->aggregator = aggregator; + port->actor_port_aggregator_identifier=port->aggregator->aggregator_identifier; + port->next_port_in_aggregator=aggregator->lag_ports; + port->aggregator->num_of_ports++; + aggregator->lag_ports=port; + BOND_PRINT_DBG(("Port %d joined LAG %d(existing LAG)", port->actor_port_number, port->aggregator->aggregator_identifier)); + + // mark this port as selected + port->sm_vars |= AD_PORT_SELECTED; + found = 1; + break; + } + } + + // the port couldn't find an aggregator - attach it to a new aggregator + if (!found) { + if (free_aggregator) { + // assign port a new aggregator + port->aggregator = free_aggregator; + port->actor_port_aggregator_identifier=port->aggregator->aggregator_identifier; + + // update the new aggregator's parameters + // if port was responsed from the end-user + if (port->actor_oper_port_key & AD_DUPLEX_KEY_BITS) {// if port is full duplex + port->aggregator->is_individual = 0; + } else { + port->aggregator->is_individual = 1; + } + + port->aggregator->actor_admin_aggregator_key = port->actor_admin_port_key; + port->aggregator->actor_oper_aggregator_key = port->actor_oper_port_key; + port->aggregator->partner_system=port->partner_oper_system; + port->aggregator->partner_system_priority = port->partner_oper_system_priority; + port->aggregator->partner_oper_aggregator_key = port->partner_oper_key; + port->aggregator->receive_state = 1; + port->aggregator->transmit_state = 1; + port->aggregator->lag_ports = port; + port->aggregator->num_of_ports++; + + // mark this port as selected + port->sm_vars |= AD_PORT_SELECTED; + + BOND_PRINT_DBG(("Port %d joined LAG %d(new LAG)", port->actor_port_number, port->aggregator->aggregator_identifier)); + } else { + printk(KERN_ERR "bonding: Port %d (on %s) did not find a suitable aggregator\n", + port->actor_port_number, port->slave->dev->name); + } + } + // if all aggregator's ports are READY_N == TRUE, set ready=TRUE in all aggregator's ports + // else set ready=FALSE in all aggregator's ports + __set_agg_ports_ready(port->aggregator, __agg_ports_are_ready(port->aggregator)); + + if (!__check_agg_selection_timer(port) && (aggregator = __get_first_agg(port))) { + ad_agg_selection_logic(aggregator); + } +} + +/** + * ad_agg_selection_logic - select an aggregation group for a team + * @aggregator: the aggregator we're looking at + * + * It is assumed that only one aggregator may be selected for a team. + * The logic of this function is to select (at first time) the aggregator with + * the most ports attached to it, and to reselect the active aggregator only if + * the previous aggregator has no more ports related to it. + * + * FIXME: this function MUST be called with the first agg in the bond, or + * __get_active_agg() won't work correctly. This function should be better + * called with the bond itself, and retrieve the first agg from it. + */ +static void ad_agg_selection_logic(struct aggregator *aggregator) +{ + struct aggregator *best_aggregator = NULL, *active_aggregator = NULL; + struct aggregator *last_active_aggregator = NULL, *origin_aggregator; + struct port *port; + u16 num_of_aggs=0; + + origin_aggregator = aggregator; + + //get current active aggregator + last_active_aggregator = __get_active_agg(aggregator); + + // search for the aggregator with the most ports attached to it. + do { + // count how many candidate lag's we have + if (aggregator->lag_ports) { + num_of_aggs++; + } + if (aggregator->is_active && !aggregator->is_individual && // if current aggregator is the active aggregator + MAC_ADDRESS_COMPARE(&(aggregator->partner_system), &(null_mac_addr))) { // and partner answers to 802.3ad PDUs + if (aggregator->num_of_ports) { // if any ports attached to the current aggregator + best_aggregator=NULL; // disregard the best aggregator that was chosen by now + break; // stop the selection of other aggregator if there are any ports attached to this active aggregator + } else { // no ports attached to this active aggregator + aggregator->is_active = 0; // mark this aggregator as not active anymore + } + } + if (aggregator->num_of_ports) { // if any ports attached + if (best_aggregator) { // if there is a candidte aggregator + //The reasons for choosing new best aggregator: + // 1. if current agg is NOT individual and the best agg chosen so far is individual OR + // current and best aggs are both individual or both not individual, AND + // 2a. current agg partner reply but best agg partner do not reply OR + // 2b. current agg partner reply OR current agg partner do not reply AND best agg partner also do not reply AND + // current has more ports/bandwidth, or same amount of ports but current has faster ports, THEN + // current agg become best agg so far + + //if current agg is NOT individual and the best agg chosen so far is individual change best_aggregator + if (!aggregator->is_individual && best_aggregator->is_individual) { + best_aggregator=aggregator; + } + // current and best aggs are both individual or both not individual + else if ((aggregator->is_individual && best_aggregator->is_individual) || + (!aggregator->is_individual && !best_aggregator->is_individual)) { + // current and best aggs are both individual or both not individual AND + // current agg partner reply but best agg partner do not reply + if ((MAC_ADDRESS_COMPARE(&(aggregator->partner_system), &(null_mac_addr)) && + !MAC_ADDRESS_COMPARE(&(best_aggregator->partner_system), &(null_mac_addr)))) { + best_aggregator=aggregator; + } + // current agg partner reply OR current agg partner do not reply AND best agg partner also do not reply + else if (! (!MAC_ADDRESS_COMPARE(&(aggregator->partner_system), &(null_mac_addr)) && + MAC_ADDRESS_COMPARE(&(best_aggregator->partner_system), &(null_mac_addr)))) { + if ((__get_agg_selection_mode(aggregator->lag_ports) == AD_BANDWIDTH)&& + (__get_agg_bandwidth(aggregator) > __get_agg_bandwidth(best_aggregator))) { + best_aggregator=aggregator; + } else if (__get_agg_selection_mode(aggregator->lag_ports) == AD_COUNT) { + if (((aggregator->num_of_ports > best_aggregator->num_of_ports) && + (aggregator->actor_oper_aggregator_key & AD_SPEED_KEY_BITS))|| + ((aggregator->num_of_ports == best_aggregator->num_of_ports) && + ((u16)(aggregator->actor_oper_aggregator_key & AD_SPEED_KEY_BITS) > + (u16)(best_aggregator->actor_oper_aggregator_key & AD_SPEED_KEY_BITS)))) { + best_aggregator=aggregator; + } + } + } + } + } else { + best_aggregator=aggregator; + } + } + aggregator->is_active = 0; // mark all aggregators as not active anymore + } while ((aggregator = __get_next_agg(aggregator))); + + // if we have new aggregator selected, don't replace the old aggregator if it has an answering partner, + // or if both old aggregator and new aggregator don't have answering partner + if (best_aggregator) { + if (last_active_aggregator && last_active_aggregator->lag_ports && last_active_aggregator->lag_ports->is_enabled && + (MAC_ADDRESS_COMPARE(&(last_active_aggregator->partner_system), &(null_mac_addr)) || // partner answers OR + (!MAC_ADDRESS_COMPARE(&(last_active_aggregator->partner_system), &(null_mac_addr)) && // both old and new + !MAC_ADDRESS_COMPARE(&(best_aggregator->partner_system), &(null_mac_addr)))) // partner do not answer + ) { + // if new aggregator has link, and old aggregator does not, replace old aggregator.(do nothing) + // -> don't replace otherwise. + if (!(!last_active_aggregator->actor_oper_aggregator_key && best_aggregator->actor_oper_aggregator_key)) { + best_aggregator=NULL; + last_active_aggregator->is_active = 1; // don't replace good old aggregator + + } + } + } + + // if there is new best aggregator, activate it + if (best_aggregator) { + for (aggregator = __get_first_agg(best_aggregator->lag_ports); + aggregator; + aggregator = __get_next_agg(aggregator)) { + + BOND_PRINT_DBG(("Agg=%d; Ports=%d; a key=%d; p key=%d; Indiv=%d; Active=%d", + aggregator->aggregator_identifier, aggregator->num_of_ports, + aggregator->actor_oper_aggregator_key, aggregator->partner_oper_aggregator_key, + aggregator->is_individual, aggregator->is_active)); + } + + // check if any partner replys + if (best_aggregator->is_individual) { + printk(KERN_WARNING "bonding: Warning: No 802.3ad response from the link partner " + "for any adapters in the bond\n"); + } + + // check if there are more than one aggregator + if (num_of_aggs > 1) { + BOND_PRINT_DBG(("Warning: More than one Link Aggregation Group was " + "found in the bond. Only one group will function in the bond")); + } + + best_aggregator->is_active = 1; + BOND_PRINT_DBG(("LAG %d choosed as the active LAG", best_aggregator->aggregator_identifier)); + BOND_PRINT_DBG(("Agg=%d; Ports=%d; a key=%d; p key=%d; Indiv=%d; Active=%d", + best_aggregator->aggregator_identifier, best_aggregator->num_of_ports, + best_aggregator->actor_oper_aggregator_key, best_aggregator->partner_oper_aggregator_key, + best_aggregator->is_individual, best_aggregator->is_active)); + + // disable the ports that were related to the former active_aggregator + if (last_active_aggregator) { + for (port=last_active_aggregator->lag_ports; port; port=port->next_port_in_aggregator) { + __disable_port(port); + } + } + } + + // if the selected aggregator is of join individuals(partner_system is NULL), enable their ports + active_aggregator = __get_active_agg(origin_aggregator); + + if (active_aggregator) { + if (!MAC_ADDRESS_COMPARE(&(active_aggregator->partner_system), &(null_mac_addr))) { + for (port=active_aggregator->lag_ports; port; port=port->next_port_in_aggregator) { + __enable_port(port); + } + } + } +} + +/** + * ad_clear_agg - clear a given aggregator's parameters + * @aggregator: the aggregator we're looking at + * + */ +static void ad_clear_agg(struct aggregator *aggregator) +{ + if (aggregator) { + aggregator->is_individual = 0; + aggregator->actor_admin_aggregator_key = 0; + aggregator->actor_oper_aggregator_key = 0; + aggregator->partner_system = null_mac_addr; + aggregator->partner_system_priority = 0; + aggregator->partner_oper_aggregator_key = 0; + aggregator->receive_state = 0; + aggregator->transmit_state = 0; + aggregator->lag_ports = NULL; + aggregator->is_active = 0; + aggregator->num_of_ports = 0; + BOND_PRINT_DBG(("LAG %d was cleared", aggregator->aggregator_identifier)); + } +} + +/** + * ad_initialize_agg - initialize a given aggregator's parameters + * @aggregator: the aggregator we're looking at + * + */ +static void ad_initialize_agg(struct aggregator *aggregator) +{ + if (aggregator) { + ad_clear_agg(aggregator); + + aggregator->aggregator_mac_address = null_mac_addr; + aggregator->aggregator_identifier = 0; + aggregator->slave = NULL; + } +} + +/** + * ad_initialize_port - initialize a given port's parameters + * @aggregator: the aggregator we're looking at + * + */ +static void ad_initialize_port(struct port *port) +{ + if (port) { + port->actor_port_number = 1; + port->actor_port_priority = 0xff; + port->actor_system = null_mac_addr; + port->actor_system_priority = 0xffff; + port->actor_port_aggregator_identifier = 0; + port->ntt = 0; + port->actor_admin_port_key = 1; + port->actor_oper_port_key = 1; + port->actor_admin_port_state = AD_STATE_AGGREGATION | AD_STATE_LACP_ACTIVITY; + port->actor_oper_port_state = AD_STATE_AGGREGATION | AD_STATE_LACP_ACTIVITY; + port->partner_admin_system = null_mac_addr; + port->partner_oper_system = null_mac_addr; + port->partner_admin_system_priority = 0xffff; + port->partner_oper_system_priority = 0xffff; + port->partner_admin_key = 1; + port->partner_oper_key = 1; + port->partner_admin_port_number = 1; + port->partner_oper_port_number = 1; + port->partner_admin_port_priority = 0xff; + port->partner_oper_port_priority = 0xff; + port->partner_admin_port_state = 1; + port->partner_oper_port_state = 1; + port->is_enabled = 1; + // ****** private parameters ****** + port->sm_vars = 0x3; + port->sm_rx_state = 0; + port->sm_rx_timer_counter = 0; + port->sm_periodic_state = 0; + port->sm_periodic_timer_counter = 0; + port->sm_mux_state = 0; + port->sm_mux_timer_counter = 0; + port->sm_tx_state = 0; + port->sm_tx_timer_counter = 0; + port->slave = NULL; + port->aggregator = NULL; + port->next_port_in_aggregator = NULL; + port->transaction_id = 0; + + ad_initialize_lacpdu(&(port->lacpdu)); + } +} + +/** + * ad_enable_collecting_distributing - enable a port's transmit/receive + * @port: the port we're looking at + * + * Enable @port if it's in an active aggregator + */ +static void ad_enable_collecting_distributing(struct port *port) +{ + if (port->aggregator->is_active) { + BOND_PRINT_DBG(("Enabling port %d(LAG %d)", port->actor_port_number, port->aggregator->aggregator_identifier)); + __enable_port(port); + } +} + +/** + * ad_disable_collecting_distributing - disable a port's transmit/receive + * @port: the port we're looking at + * + */ +static void ad_disable_collecting_distributing(struct port *port) +{ + if (port->aggregator && MAC_ADDRESS_COMPARE(&(port->aggregator->partner_system), &(null_mac_addr))) { + BOND_PRINT_DBG(("Disabling port %d(LAG %d)", port->actor_port_number, port->aggregator->aggregator_identifier)); + __disable_port(port); + } +} + +#if 0 +/** + * ad_marker_info_send - send a marker information frame + * @port: the port we're looking at + * + * This function does nothing since we decided not to implement send and handle + * response for marker PDU's, in this stage, but only to respond to marker + * information. + */ +static void ad_marker_info_send(struct port *port) +{ + struct marker marker; + u16 index; + + // fill the marker PDU with the appropriate values + marker.subtype = 0x02; + marker.version_number = 0x01; + marker.tlv_type = AD_MARKER_INFORMATION_SUBTYPE; + marker.marker_length = 0x16; + // convert requester_port to Big Endian + marker.requester_port = (((port->actor_port_number & 0xFF) << 8) |((u16)(port->actor_port_number & 0xFF00) >> 8)); + marker.requester_system = port->actor_system; + // convert requester_port(u32) to Big Endian + marker.requester_transaction_id = (((++port->transaction_id & 0xFF) << 24) |((port->transaction_id & 0xFF00) << 8) |((port->transaction_id & 0xFF0000) >> 8) |((port->transaction_id & 0xFF000000) >> 24)); + marker.pad = 0; + marker.tlv_type_terminator = 0x00; + marker.terminator_length = 0x00; + for (index=0; index<90; index++) { + marker.reserved_90[index]=0; + } + + // send the marker information + if (ad_marker_send(port, &marker) >= 0) { + BOND_PRINT_DBG(("Sent Marker Information on port %d", port->actor_port_number)); + } +} +#endif + +/** + * ad_marker_info_received - handle receive of a Marker information frame + * @marker_info: Marker info received + * @port: the port we're looking at + * + */ +static void ad_marker_info_received(struct marker *marker_info,struct port *port) +{ + struct marker marker; + + // copy the received marker data to the response marker + //marker = *marker_info; + memcpy(&marker, marker_info, sizeof(struct marker)); + // change the marker subtype to marker response + marker.tlv_type=AD_MARKER_RESPONSE_SUBTYPE; + // send the marker response + + if (ad_marker_send(port, &marker) >= 0) { + BOND_PRINT_DBG(("Sent Marker Response on port %d", port->actor_port_number)); + } +} + +/** + * ad_marker_response_received - handle receive of a marker response frame + * @marker: marker PDU received + * @port: the port we're looking at + * + * This function does nothing since we decided not to implement send and handle + * response for marker PDU's, in this stage, but only to respond to marker + * information. + */ +static void ad_marker_response_received(struct marker *marker, struct port *port) +{ + marker=NULL; // just to satisfy the compiler + port=NULL; // just to satisfy the compiler + // DO NOTHING, SINCE WE DECIDED NOT TO IMPLEMENT THIS FEATURE FOR NOW +} + +/** + * ad_initialize_lacpdu - initialize a given lacpdu structure + * @lacpdu: lacpdu structure to initialize + * + */ +static void ad_initialize_lacpdu(struct lacpdu *lacpdu) +{ + u16 index; + + // initialize lacpdu data + lacpdu->subtype = 0x01; + lacpdu->version_number = 0x01; + lacpdu->tlv_type_actor_info = 0x01; + lacpdu->actor_information_length = 0x14; + // lacpdu->actor_system_priority updated on send + // lacpdu->actor_system updated on send + // lacpdu->actor_key updated on send + // lacpdu->actor_port_priority updated on send + // lacpdu->actor_port updated on send + // lacpdu->actor_state updated on send + lacpdu->tlv_type_partner_info = 0x02; + lacpdu->partner_information_length = 0x14; + for (index=0; index<=2; index++) { + lacpdu->reserved_3_1[index]=0; + } + // lacpdu->partner_system_priority updated on send + // lacpdu->partner_system updated on send + // lacpdu->partner_key updated on send + // lacpdu->partner_port_priority updated on send + // lacpdu->partner_port updated on send + // lacpdu->partner_state updated on send + for (index=0; index<=2; index++) { + lacpdu->reserved_3_2[index]=0; + } + lacpdu->tlv_type_collector_info = 0x03; + lacpdu->collector_information_length= 0x10; + lacpdu->collector_max_delay = AD_COLLECTOR_MAX_DELAY; + for (index=0; index<=11; index++) { + lacpdu->reserved_12[index]=0; + } + lacpdu->tlv_type_terminator = 0x00; + lacpdu->terminator_length = 0; + for (index=0; index<=49; index++) { + lacpdu->reserved_50[index]=0; + } +} + +////////////////////////////////////////////////////////////////////////////////////// +// ================= AD exported functions to the main bonding code ================== +////////////////////////////////////////////////////////////////////////////////////// + +// Check aggregators status in team every T seconds +#define AD_AGGREGATOR_SELECTION_TIMER 8 + +static u16 aggregator_identifier; + +/** + * bond_3ad_initialize - initialize a bond's 802.3ad parameters and structures + * @bond: bonding struct to work on + * @tick_resolution: tick duration (millisecond resolution) + * + * Can be called only after the mac address of the bond is set. + */ +void bond_3ad_initialize(struct bonding *bond, u16 tick_resolution) +{ + // check that the bond is not initialized yet + if (MAC_ADDRESS_COMPARE(&(BOND_AD_INFO(bond).system.sys_mac_addr), &(bond->device->dev_addr))) { + + aggregator_identifier = 0; + + BOND_AD_INFO(bond).system.sys_priority = 0xFFFF; + BOND_AD_INFO(bond).system.sys_mac_addr = *((struct mac_addr *)bond->device->dev_addr); + + // initialize how many times this module is called in one second(should be about every 100ms) + ad_ticks_per_sec = tick_resolution; + + // initialize the aggregator selection timer(to activate an aggregation selection after initialize) + BOND_AD_INFO(bond).agg_select_timer = (AD_AGGREGATOR_SELECTION_TIMER * ad_ticks_per_sec); + BOND_AD_INFO(bond).agg_select_mode = AD_BANDWIDTH; + } +} + +/** + * bond_3ad_bind_slave - initialize a slave's port + * @slave: slave struct to work on + * + * Returns: 0 on success + * < 0 on error + */ +int bond_3ad_bind_slave(struct slave *slave) +{ + struct bonding *bond = bond_get_bond_by_slave(slave); + struct port *port; + struct aggregator *aggregator; + + if (bond == NULL) { + printk(KERN_CRIT "The slave %s is not attached to its bond\n", slave->dev->name); + return -1; + } + + //check that the slave has not been intialized yet. + if (SLAVE_AD_INFO(slave).port.slave != slave) { + + // port initialization + port = &(SLAVE_AD_INFO(slave).port); + + ad_initialize_port(port); + + port->slave = slave; + port->actor_port_number = SLAVE_AD_INFO(slave).id; + // key is determined according to the link speed, duplex and user key(which is yet not supported) + // ------------------------------------------------------------ + // Port key : | User key | Speed |Duplex| + // ------------------------------------------------------------ + // 16 6 1 0 + port->actor_admin_port_key = 0; // initialize this parameter + port->actor_admin_port_key |= __get_duplex(port); + port->actor_admin_port_key |= (__get_link_speed(port) << 1); + port->actor_oper_port_key = port->actor_admin_port_key; + // if the port is not full duplex, then the port should be not lacp Enabled + if (!(port->actor_oper_port_key & AD_DUPLEX_KEY_BITS)) { + port->sm_vars &= ~AD_PORT_LACP_ENABLED; + } + // actor system is the bond's system + port->actor_system = BOND_AD_INFO(bond).system.sys_mac_addr; + // tx timer(to verify that no more than MAX_TX_IN_SECOND lacpdu's are sent in one second) + port->sm_tx_timer_counter = ad_ticks_per_sec/AD_MAX_TX_IN_SECOND; + port->aggregator = NULL; + port->next_port_in_aggregator = NULL; + + __disable_port(port); + __initialize_port_locks(port); + + + // aggregator initialization + aggregator = &(SLAVE_AD_INFO(slave).aggregator); + + ad_initialize_agg(aggregator); + + aggregator->aggregator_mac_address = *((struct mac_addr *)bond->device->dev_addr); + aggregator->aggregator_identifier = (++aggregator_identifier); + aggregator->slave = slave; + aggregator->is_active = 0; + aggregator->num_of_ports = 0; + } + + return 0; +} + +/** + * bond_3ad_unbind_slave - deinitialize a slave's port + * @slave: slave struct to work on + * + * Search for the aggregator that is related to this port, remove the + * aggregator and assign another aggregator for other port related to it + * (if any), and remove the port. + */ +void bond_3ad_unbind_slave(struct slave *slave) +{ + struct port *port, *prev_port, *temp_port; + struct aggregator *aggregator, *new_aggregator, *temp_aggregator; + int select_new_active_agg = 0; + + // find the aggregator related to this slave + aggregator = &(SLAVE_AD_INFO(slave).aggregator); + + // find the port related to this slave + port = &(SLAVE_AD_INFO(slave).port); + + // if slave is null, the whole port is not initialized + if (!port->slave) { + printk(KERN_WARNING "bonding: Trying to unbind an uninitialized port on %s\n", slave->dev->name); + return; + } + + bond_3ad_link_status_changed(slave, 0); + + // disable the port + ad_disable_collecting_distributing(port); + + // deinitialize port's locks if necessary(os-specific) + __deinitialize_port_locks(port); + + BOND_PRINT_DBG(("Unbinding Link Aggregation Group %d", aggregator->aggregator_identifier)); + // check if this aggregator is occupied + if (aggregator->lag_ports) { + // check if there are other ports related to this aggregator except + // the port related to this slave(thats ensure us that there is a + // reason to search for new aggregator, and that we will find one + if ((aggregator->lag_ports != port) || (aggregator->lag_ports->next_port_in_aggregator)) { + // find new aggregator for the related port(s) + new_aggregator = __get_first_agg(port); + for (; new_aggregator; new_aggregator = __get_next_agg(new_aggregator)) { + // if the new aggregator is empty, or it connected to to our port only + if (!new_aggregator->lag_ports || ((new_aggregator->lag_ports == port) && !new_aggregator->lag_ports->next_port_in_aggregator)) { + break; + } + } + // if new aggregator found, copy the aggregator's parameters + // and connect the related lag_ports to the new aggregator + if ((new_aggregator) && ((!new_aggregator->lag_ports) || ((new_aggregator->lag_ports == port) && !new_aggregator->lag_ports->next_port_in_aggregator))) { + BOND_PRINT_DBG(("Some port(s) related to LAG %d - replaceing with LAG %d", aggregator->aggregator_identifier, new_aggregator->aggregator_identifier)); + + if ((new_aggregator->lag_ports == port) && new_aggregator->is_active) { + printk(KERN_INFO "bonding: Removing an active aggregator\n"); + // select new active aggregator + select_new_active_agg = 1; + } + + new_aggregator->is_individual = aggregator->is_individual; + new_aggregator->actor_admin_aggregator_key = aggregator->actor_admin_aggregator_key; + new_aggregator->actor_oper_aggregator_key = aggregator->actor_oper_aggregator_key; + new_aggregator->partner_system = aggregator->partner_system; + new_aggregator->partner_system_priority = aggregator->partner_system_priority; + new_aggregator->partner_oper_aggregator_key = aggregator->partner_oper_aggregator_key; + new_aggregator->receive_state = aggregator->receive_state; + new_aggregator->transmit_state = aggregator->transmit_state; + new_aggregator->lag_ports = aggregator->lag_ports; + new_aggregator->is_active = aggregator->is_active; + new_aggregator->num_of_ports = aggregator->num_of_ports; + + // update the information that is written on the ports about the aggregator + for (temp_port=aggregator->lag_ports; temp_port; temp_port=temp_port->next_port_in_aggregator) { + temp_port->aggregator=new_aggregator; + temp_port->actor_port_aggregator_identifier = new_aggregator->aggregator_identifier; + } + + // clear the aggregator + ad_clear_agg(aggregator); + + if (select_new_active_agg) { + ad_agg_selection_logic(__get_first_agg(port)); + } + } else { + printk(KERN_WARNING "bonding: Warning: unbinding aggregator, " + "and could not find a new aggregator for its ports\n"); + } + } else { // in case that the only port related to this aggregator is the one we want to remove + select_new_active_agg = aggregator->is_active; + // clear the aggregator + ad_clear_agg(aggregator); + if (select_new_active_agg) { + printk(KERN_INFO "Removing an active aggregator\n"); + // select new active aggregator + ad_agg_selection_logic(__get_first_agg(port)); + } + } + } + + BOND_PRINT_DBG(("Unbinding port %d", port->actor_port_number)); + // find the aggregator that this port is connected to + temp_aggregator = __get_first_agg(port); + for (; temp_aggregator; temp_aggregator = __get_next_agg(temp_aggregator)) { + prev_port = NULL; + // search the port in the aggregator's related ports + for (temp_port=temp_aggregator->lag_ports; temp_port; prev_port=temp_port, temp_port=temp_port->next_port_in_aggregator) { + if (temp_port == port) { // the aggregator found - detach the port from this aggregator + if (prev_port) { + prev_port->next_port_in_aggregator = temp_port->next_port_in_aggregator; + } else { + temp_aggregator->lag_ports = temp_port->next_port_in_aggregator; + } + temp_aggregator->num_of_ports--; + if (temp_aggregator->num_of_ports==0) { + select_new_active_agg = temp_aggregator->is_active; + // clear the aggregator + ad_clear_agg(temp_aggregator); + if (select_new_active_agg) { + printk(KERN_INFO "Removing an active aggregator\n"); + // select new active aggregator + ad_agg_selection_logic(__get_first_agg(port)); + } + } + break; + } + } + } + port->slave=NULL; +} + +/** + * bond_3ad_state_machine_handler - handle state machines timeout + * @bond: bonding struct to work on + * + * The state machine handling concept in this module is to check every tick + * which state machine should operate any function. The execution order is + * round robin, so when we have an interaction between state machines, the + * reply of one to each other might be delayed until next tick. + * + * This function also complete the initialization when the agg_select_timer + * times out, and it selects an aggregator for the ports that are yet not + * related to any aggregator, and selects the active aggregator for a bond. + */ +void bond_3ad_state_machine_handler(struct bonding *bond) +{ + struct port *port; + struct aggregator *aggregator; + unsigned long flags; + + read_lock_irqsave(&bond->lock, flags); + + //check if there are any slaves + if (bond->next == (struct slave *)bond) { + goto end; + } + + if ((bond->device->flags & IFF_UP) != IFF_UP) { + goto end; + } + + // check if agg_select_timer timer after initialize is timed out + if (BOND_AD_INFO(bond).agg_select_timer && !(--BOND_AD_INFO(bond).agg_select_timer)) { + // select the active aggregator for the bond + if ((port = __get_first_port(bond))) { + if (!port->slave) { + printk(KERN_WARNING "bonding: Warning: bond's first port is uninitialized\n"); + goto end; + } + + aggregator = __get_first_agg(port); + ad_agg_selection_logic(aggregator); + } + } + + // for each port run the state machines + for (port = __get_first_port(bond); port; port = __get_next_port(port)) { + if (!port->slave) { + printk(KERN_WARNING "bonding: Warning: Found an uninitialized port\n"); + goto end; + } + + ad_rx_machine(NULL, port); + ad_periodic_machine(port); + ad_port_selection_logic(port); + ad_mux_machine(port); + ad_tx_machine(port); + + // turn off the BEGIN bit, since we already handled it + if (port->sm_vars & AD_PORT_BEGIN) { + port->sm_vars &= ~AD_PORT_BEGIN; + } + } + +end: + read_unlock_irqrestore(&bond->lock, flags); + + + if ((bond->device->flags & IFF_UP) == IFF_UP) { + /* re-arm the timer */ + mod_timer(&(BOND_AD_INFO(bond).ad_timer), jiffies + (AD_TIMER_INTERVAL * HZ / 1000)); + } +} + +/** + * bond_3ad_rx_indication - handle a received frame + * @lacpdu: received lacpdu + * @slave: slave struct to work on + * @length: length of the data received + * + * It is assumed that frames that were sent on this NIC don't returned as new + * received frames (loopback). Since only the payload is given to this + * function, it check for loopback. + */ +void bond_3ad_rx_indication(struct lacpdu *lacpdu, struct slave *slave, u16 length) +{ + struct port *port; + + if (length >= sizeof(struct lacpdu)) { + + port = &(SLAVE_AD_INFO(slave).port); + + if (!port->slave) { + printk(KERN_WARNING "bonding: Warning: port of slave %s is uninitialized\n", slave->dev->name); + return; + } + + switch (lacpdu->subtype) { + case AD_TYPE_LACPDU: + __ntohs_lacpdu(lacpdu); + BOND_PRINT_DBG(("Received LACPDU on port %d", port->actor_port_number)); + ad_rx_machine(lacpdu, port); + break; + + case AD_TYPE_MARKER: + // No need to convert fields to Little Endian since we don't use the marker's fields. + + switch (((struct marker *)lacpdu)->tlv_type) { + case AD_MARKER_INFORMATION_SUBTYPE: + BOND_PRINT_DBG(("Received Marker Information on port %d", port->actor_port_number)); + ad_marker_info_received((struct marker *)lacpdu, port); + break; + + case AD_MARKER_RESPONSE_SUBTYPE: + BOND_PRINT_DBG(("Received Marker Response on port %d", port->actor_port_number)); + ad_marker_response_received((struct marker *)lacpdu, port); + break; + + default: + BOND_PRINT_DBG(("Received an unknown Marker subtype on slot %d", port->actor_port_number)); + } + } + } +} + +/** + * bond_3ad_adapter_speed_changed - handle a slave's speed change indication + * @slave: slave struct to work on + * + * Handle reselection of aggregator (if needed) for this port. + */ +void bond_3ad_adapter_speed_changed(struct slave *slave) +{ + struct port *port; + + port = &(SLAVE_AD_INFO(slave).port); + + // if slave is null, the whole port is not initialized + if (!port->slave) { + printk(KERN_WARNING "bonding: Warning: speed changed for uninitialized port on %s\n", + slave->dev->name); + return; + } + + port->actor_admin_port_key &= ~AD_SPEED_KEY_BITS; + port->actor_oper_port_key=port->actor_admin_port_key |= (__get_link_speed(port) << 1); + BOND_PRINT_DBG(("Port %d changed speed", port->actor_port_number)); + // there is no need to reselect a new aggregator, just signal the + // state machines to reinitialize + port->sm_vars |= AD_PORT_BEGIN; +} + +/** + * bond_3ad_adapter_duplex_changed - handle a slave's duplex change indication + * @slave: slave struct to work on + * + * Handle reselection of aggregator (if needed) for this port. + */ +void bond_3ad_adapter_duplex_changed(struct slave *slave) +{ + struct port *port; + + port=&(SLAVE_AD_INFO(slave).port); + + // if slave is null, the whole port is not initialized + if (!port->slave) { + printk(KERN_WARNING "bonding: Warning: duplex changed for uninitialized port on %s\n", + slave->dev->name); + return; + } + + port->actor_admin_port_key &= ~AD_DUPLEX_KEY_BITS; + port->actor_oper_port_key=port->actor_admin_port_key |= __get_duplex(port); + BOND_PRINT_DBG(("Port %d changed duplex", port->actor_port_number)); + // there is no need to reselect a new aggregator, just signal the + // state machines to reinitialize + port->sm_vars |= AD_PORT_BEGIN; +} + +/** + * bond_3ad_link_status_changed - handle a slave's link status change indication + * @slave: slave struct to work on + * @status: whether the link is now up or down + * + * Handle reselection of aggregator (if needed) for this port. + */ +void bond_3ad_link_status_changed(struct slave *slave, int status) +{ + struct port *port; + + port = &(SLAVE_AD_INFO(slave).port); + + // if slave is null, the whole port is not initialized + if (!port->slave) { + printk(KERN_WARNING "bonding: Warning: link status changed for uninitialized port on %s\n", + slave->dev->name); + return; + } + + // on link down we are zeroing duplex and speed since some of the adaptors(ce1000.lan) report full duplex/speed instead of N/A(duplex) / 0(speed) + // on link up we are forcing recheck on the duplex and speed since some of he adaptors(ce1000.lan) report + if (status) { // is up + port->is_enabled = 1; + port->actor_admin_port_key &= ~AD_DUPLEX_KEY_BITS; + port->actor_oper_port_key=port->actor_admin_port_key |= __get_duplex(port); + port->actor_admin_port_key &= ~AD_SPEED_KEY_BITS; + port->actor_oper_port_key=port->actor_admin_port_key |= (__get_link_speed(port) << 1); + } else { + port->is_enabled = 0; + port->actor_admin_port_key &= ~AD_DUPLEX_KEY_BITS; + port->actor_oper_port_key= (port->actor_admin_port_key &= ~AD_SPEED_KEY_BITS); + } + BOND_PRINT_DBG(("Port %d changed link status to %s", port->actor_port_number, (status?"UP":"DOWN"))); + // there is no need to reselect a new aggregator, just signal the + // state machines to reinitialize + port->sm_vars |= AD_PORT_BEGIN; +} + +/** + * bond_3ad_get_active_agg_info - get information of the active aggregator + * @bond: bonding struct to work on + * @ad_info: ad_info struct to fill with the bond's info + * + * Returns: 0 on success + * < 0 on error + */ +int bond_3ad_get_active_agg_info(struct bonding *bond, struct ad_info *ad_info) +{ + struct aggregator *aggregator = NULL; + struct port *port; + + for (port = __get_first_port(bond); port; port = __get_next_port(port)) { + if (port->aggregator && port->aggregator->is_active) { + aggregator = port->aggregator; + break; + } + } + + if (aggregator) { + ad_info->aggregator_id = aggregator->aggregator_identifier; + ad_info->ports = aggregator->num_of_ports; + ad_info->actor_key = aggregator->actor_oper_aggregator_key; + ad_info->partner_key = aggregator->partner_oper_aggregator_key; + memcpy(ad_info->partner_system, aggregator->partner_system.mac_addr_value, ETH_ALEN); + return 0; + } + + return -1; +} + +int bond_3ad_xmit_xor(struct sk_buff *skb, struct net_device *dev) +{ + slave_t *slave, *start_at; + struct bonding *bond = (struct bonding *) dev->priv; + unsigned long flags; + struct ethhdr *data = (struct ethhdr *)skb->data; + int slave_agg_no; + int slaves_in_agg; + int agg_id; + struct ad_info ad_info; + + if (!IS_UP(dev)) { /* bond down */ + dev_kfree_skb(skb); + return 0; + } + + if (bond == NULL) { + printk(KERN_CRIT "bonding: Error: bond is NULL on device %s\n", dev->name); + dev_kfree_skb(skb); + return 0; + } + + read_lock_irqsave(&bond->lock, flags); + slave = bond->prev; + + /* check if bond is empty */ + if ((slave == (struct slave *) bond) || (bond->slave_cnt == 0)) { + printk(KERN_DEBUG "ERROR: bond is empty\n"); + dev_kfree_skb(skb); + read_unlock_irqrestore(&bond->lock, flags); + return 0; + } + + if (bond_3ad_get_active_agg_info(bond, &ad_info)) { + printk(KERN_DEBUG "ERROR: bond_3ad_get_active_agg_info failed\n"); + dev_kfree_skb(skb); + read_unlock_irqrestore(&bond->lock, flags); + return 0; + } + + slaves_in_agg = ad_info.ports; + agg_id = ad_info.aggregator_id; + + if (slaves_in_agg == 0) { + /*the aggregator is empty*/ + printk(KERN_DEBUG "ERROR: active aggregator is empty\n"); + dev_kfree_skb(skb); + read_unlock_irqrestore(&bond->lock, flags); + return 0; + } + + /* we're at the root, get the first slave */ + if ((slave == NULL) || (slave->dev == NULL)) { + /* no suitable interface, frame not sent */ + dev_kfree_skb(skb); + read_unlock_irqrestore(&bond->lock, flags); + return 0; + } + + slave_agg_no = (data->h_dest[5]^slave->dev->dev_addr[5]) % slaves_in_agg; + while (slave != (slave_t *)bond) { + struct aggregator *agg = SLAVE_AD_INFO(slave).port.aggregator; + + if (agg && (agg->aggregator_identifier == agg_id)) { + slave_agg_no--; + if (slave_agg_no < 0) { + break; + } + } + + slave = slave->prev; + if (slave == NULL) { + printk(KERN_ERR "bonding: Error: slave is NULL\n"); + dev_kfree_skb(skb); + read_unlock_irqrestore(&bond->lock, flags); + return 0; + } + } + + if (slave == (slave_t *)bond) { + printk(KERN_ERR "bonding: Error: Couldn't find a slave to tx on for aggregator ID %d\n", agg_id); + dev_kfree_skb(skb); + read_unlock_irqrestore(&bond->lock, flags); + return 0; + } + + start_at = slave; + + do { + int slave_agg_id = 0; + struct aggregator *agg; + + if (slave == NULL) { + printk(KERN_ERR "bonding: Error: slave is NULL\n"); + dev_kfree_skb(skb); + read_unlock_irqrestore(&bond->lock, flags); + return 0; + } + + agg = SLAVE_AD_INFO(slave).port.aggregator; + + if (agg) { + slave_agg_id = agg->aggregator_identifier; + } + + if (SLAVE_IS_OK(slave) && + agg && (slave_agg_id == agg_id)) { + skb->dev = slave->dev; + skb->priority = 1; + dev_queue_xmit(skb); + read_unlock_irqrestore(&bond->lock, flags); + return 0; + } + } while ((slave = slave->next) != start_at); + + /* no suitable interface, frame not sent */ + dev_kfree_skb(skb); + read_unlock_irqrestore(&bond->lock, flags); + return 0; +} + +int bond_3ad_lacpdu_recv(struct sk_buff *skb, struct net_device *dev, struct packet_type* ptype) +{ + struct bonding *bond = (struct bonding *)dev->priv; + struct slave *slave = NULL; + unsigned long flags; + int ret = NET_RX_DROP; + + if (!(dev->flags & IFF_MASTER)) { + goto out; + } + + read_lock_irqsave(&bond->lock, flags); +#ifdef BOND_POINT_TO_POINT_PROT + slave = bond_get_slave_by_dev((struct bonding *) dev->priv, skb->real_dev); +#else +#warning "skb->real_dev not defined. apply bond-p2p patch for the module to work !!!" +#endif //BOND_POINT_TO_POINT_PROT + + if (slave == NULL) { + goto out_unlock; + } + + bond_3ad_rx_indication((struct lacpdu *) skb->data, slave, skb->len); + + ret = NET_RX_SUCCESS; + +out_unlock: + read_unlock_irqrestore(&bond->lock, flags); +out: + dev_kfree_skb(skb); + + return ret; +} + diff -Nuarp linux-2.4.20-bonding-20030317/drivers/net/bonding/bond_3ad.h linux-2.4.20-bonding-20030317-devel/drivers/net/bonding/bond_3ad.h --- linux-2.4.20-bonding-20030317/drivers/net/bonding/bond_3ad.h 1970-01-01 02:00:00.000000000 +0200 +++ linux-2.4.20-bonding-20030317-devel/drivers/net/bonding/bond_3ad.h 2003-03-18 17:24:25.000000000 +0200 @@ -0,0 +1,281 @@ +/**************************************************************************** + Copyright(c) 1999 - 2003 Intel Corporation. All rights reserved. + + This program is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by the Free + Software Foundation; either version 2 of the License, or (at your option) + any later version. + + This program is distributed in the hope that it will be useful, but WITHOUT + ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + You should have received a copy of the GNU General Public License along with + this program; if not, write to the Free Software Foundation, Inc., 59 + Temple Place - Suite 330, Boston, MA 02111-1307, USA. + + The full GNU General Public License is included in this distribution in the + file called LICENSE. +*****************************************************************************/ + +#ifndef __BOND_3AD_H__ +#define __BOND_3AD_H__ + +#include +#include +#include + +// General definitions +#define BOND_ETH_P_LACPDU 0x8809 +#define PKT_TYPE_LACPDU __constant_htons(BOND_ETH_P_LACPDU) +#define AD_TIMER_INTERVAL 100 /*msec*/ + +#define MULTICAST_LACPDU_ADDR {0x01, 0x80, 0xC2, 0x00, 0x00, 0x02} +#define AD_MULTICAST_LACPDU_ADDR {MULTICAST_LACPDU_ADDR} + +typedef struct mac_addr { + u8 mac_addr_value[ETH_ALEN]; +} mac_addr_t; + +typedef enum { + AD_BANDWIDTH = 0, + AD_COUNT +} agg_selection_t; + +// rx machine states(43.4.11 in the 802.3ad standard) +typedef enum { + AD_RX_DUMMY, + AD_RX_INITIALIZE, // rx Machine + AD_RX_PORT_DISABLED, // rx Machine + AD_RX_LACP_DISABLED, // rx Machine + AD_RX_EXPIRED, // rx Machine + AD_RX_DEFAULTED, // rx Machine + AD_RX_CURRENT // rx Machine +} rx_states_t; + +// periodic machine states(43.4.12 in the 802.3ad standard) +typedef enum { + AD_PERIODIC_DUMMY, + AD_NO_PERIODIC, // periodic machine + AD_FAST_PERIODIC, // periodic machine + AD_SLOW_PERIODIC, // periodic machine + AD_PERIODIC_TX // periodic machine +} periodic_states_t; + +// mux machine states(43.4.13 in the 802.3ad standard) +typedef enum { + AD_MUX_DUMMY, + AD_MUX_DETACHED, // mux machine + AD_MUX_WAITING, // mux machine + AD_MUX_ATTACHED, // mux machine + AD_MUX_COLLECTING_DISTRIBUTING // mux machine +} mux_states_t; + +// tx machine states(43.4.15 in the 802.3ad standard) +typedef enum { + AD_TX_DUMMY, + AD_TRANSMIT // tx Machine +} tx_states_t; + +// rx indication types +typedef enum { + AD_TYPE_LACPDU = 1, // type lacpdu + AD_TYPE_MARKER // type marker +} pdu_type_t; + +// rx marker indication types +typedef enum { + AD_MARKER_INFORMATION_SUBTYPE = 1, // marker imformation subtype + AD_MARKER_RESPONSE_SUBTYPE // marker response subtype +} marker_subtype_t; + +// timers types(43.4.9 in the 802.3ad standard) +typedef enum { + AD_CURRENT_WHILE_TIMER, + AD_ACTOR_CHURN_TIMER, + AD_PERIODIC_TIMER, + AD_PARTNER_CHURN_TIMER, + AD_WAIT_WHILE_TIMER +} ad_timers_t; + +#pragma pack(1) + +typedef struct ad_header { + struct mac_addr destination_address; + struct mac_addr source_address; + u16 length_type; +} ad_header_t; + +// Link Aggregation Control Protocol(LACP) data unit structure(43.4.2.2 in the 802.3ad standard) +typedef struct lacpdu { + u8 subtype; // = LACP(= 0x01) + u8 version_number; + u8 tlv_type_actor_info; // = actor information(type/length/value) + u8 actor_information_length; // = 20 + u16 actor_system_priority; + struct mac_addr actor_system; + u16 actor_key; + u16 actor_port_priority; + u16 actor_port; + u8 actor_state; + u8 reserved_3_1[3]; // = 0 + u8 tlv_type_partner_info; // = partner information + u8 partner_information_length; // = 20 + u16 partner_system_priority; + struct mac_addr partner_system; + u16 partner_key; + u16 partner_port_priority; + u16 partner_port; + u8 partner_state; + u8 reserved_3_2[3]; // = 0 + u8 tlv_type_collector_info; // = collector information + u8 collector_information_length; // = 16 + u16 collector_max_delay; + u8 reserved_12[12]; + u8 tlv_type_terminator; // = terminator + u8 terminator_length; // = 0 + u8 reserved_50[50]; // = 0 +} lacpdu_t; + +typedef struct lacpdu_header { + struct ad_header ad_header; + struct lacpdu lacpdu; +} lacpdu_header_t; + +// Marker Protocol Data Unit(PDU) structure(43.5.3.2 in the 802.3ad standard) +typedef struct marker { + u8 subtype; // = 0x02 (marker PDU) + u8 version_number; // = 0x01 + u8 tlv_type; // = 0x01 (marker information) + // = 0x02 (marker response information) + u8 marker_length; // = 0x16 + u16 requester_port; // The number assigned to the port by the requester + struct mac_addr requester_system; // The requester’s system id + u32 requester_transaction_id; // The transaction id allocated by the requester, + u16 pad; // = 0 + u8 tlv_type_terminator; // = 0x00 + u8 terminator_length; // = 0x00 + u8 reserved_90[90]; // = 0 +} marker_t; + +typedef struct marker_header { + struct ad_header ad_header; + struct marker marker; +} marker_header_t; + +#pragma pack() + +struct slave; +struct bonding; +struct ad_info; +struct port; + +#ifdef __ia64__ +#pragma pack(8) +#endif + +// aggregator structure(43.4.5 in the 802.3ad standard) +typedef struct aggregator { + struct mac_addr aggregator_mac_address; + u16 aggregator_identifier; + u16 is_individual; // BOOLEAN + u16 actor_admin_aggregator_key; + u16 actor_oper_aggregator_key; + struct mac_addr partner_system; + u16 partner_system_priority; + u16 partner_oper_aggregator_key; + u16 receive_state; // BOOLEAN + u16 transmit_state; // BOOLEAN + struct port *lag_ports; + // ****** PRIVATE PARAMETERS ****** + struct slave *slave; // pointer to the bond slave that this aggregator belongs to + u16 is_active; // BOOLEAN. Indicates if this aggregator is active + u16 num_of_ports; +} aggregator_t; + +// port structure(43.4.6 in the 802.3ad standard) +typedef struct port { + u16 actor_port_number; + u16 actor_port_priority; + struct mac_addr actor_system; // This parameter is added here although it is not specified in the standard, just for simplification + u16 actor_system_priority; // This parameter is added here although it is not specified in the standard, just for simplification + u16 actor_port_aggregator_identifier; + u16 ntt; // BOOLEAN + u16 actor_admin_port_key; + u16 actor_oper_port_key; + u8 actor_admin_port_state; + u8 actor_oper_port_state; + struct mac_addr partner_admin_system; + struct mac_addr partner_oper_system; + u16 partner_admin_system_priority; + u16 partner_oper_system_priority; + u16 partner_admin_key; + u16 partner_oper_key; + u16 partner_admin_port_number; + u16 partner_oper_port_number; + u16 partner_admin_port_priority; + u16 partner_oper_port_priority; + u8 partner_admin_port_state; + u8 partner_oper_port_state; + u16 is_enabled; // BOOLEAN + // ****** PRIVATE PARAMETERS ****** + u16 sm_vars; // all state machines variables for this port + rx_states_t sm_rx_state; // state machine rx state + u16 sm_rx_timer_counter; // state machine rx timer counter + periodic_states_t sm_periodic_state;// state machine periodic state + u16 sm_periodic_timer_counter; // state machine periodic timer counter + mux_states_t sm_mux_state; // state machine mux state + u16 sm_mux_timer_counter; // state machine mux timer counter + tx_states_t sm_tx_state; // state machine tx state + u16 sm_tx_timer_counter; // state machine tx timer counter(allways on - enter to transmit state 3 time per second) + struct slave *slave; // pointer to the bond slave that this port belongs to + struct aggregator *aggregator; // pointer to an aggregator that this port related to + struct port *next_port_in_aggregator; // Next port on the linked list of the parent aggregator + u32 transaction_id; // continuous number for identification of Marker PDU's; + struct lacpdu lacpdu; // the lacpdu that will be sent for this port +} port_t; + +// system structure +typedef struct ad_system { + u16 sys_priority; + struct mac_addr sys_mac_addr; +} ad_system_t; + +#ifdef __ia64__ +#pragma pack() +#endif + +// ================= AD Exported structures to the main bonding code ================== +#define BOND_AD_INFO(bond) ((bond)->ad_info) +#define SLAVE_AD_INFO(slave) ((slave)->ad_info) + +struct ad_bond_info { + ad_system_t system; // 802.3ad system structure + u32 agg_select_timer; // Timer to select aggregator after all adapter's hand shakes + u32 agg_select_mode; // Mode of selection of active aggregator(bandwidth/count) + struct timer_list ad_timer; + struct packet_type ad_pkt_type; +}; + +struct ad_slave_info { + struct aggregator aggregator; // 802.3ad aggregator structure + struct port port; // 802.3ad port structure + spinlock_t rx_machine_lock; // To avoid race condition between callback and receive interrupt + u16 id; +}; + +// ================= AD Exported functions to the main bonding code ================== +void bond_3ad_initialize(struct bonding *bond, u16 tick_resolution); +int bond_3ad_bind_slave(struct slave *slave); +void bond_3ad_unbind_slave(struct slave *slave); +void bond_3ad_state_machine_handler(struct bonding *bond); +void bond_3ad_rx_indication(struct lacpdu *lacpdu, struct slave *slave, u16 length); +void bond_3ad_adapter_speed_changed(struct slave *slave); +void bond_3ad_adapter_duplex_changed(struct slave *slave); +void bond_3ad_link_status_changed(struct slave *slave, int status); +int bond_3ad_get_active_agg_info(struct bonding *bond, struct ad_info *ad_info); +int bond_3ad_xmit_xor(struct sk_buff *skb, struct net_device *dev); +int bond_3ad_lacpdu_recv(struct sk_buff *skb, struct net_device *dev, struct packet_type* ptype); +#endif //__BOND_3AD_H__ + diff -Nuarp linux-2.4.20-bonding-20030317/drivers/net/bonding/bonding.h linux-2.4.20-bonding-20030317-devel/drivers/net/bonding/bonding.h --- linux-2.4.20-bonding-20030317/drivers/net/bonding/bonding.h 2003-03-18 17:24:24.000000000 +0200 +++ linux-2.4.20-bonding-20030317-devel/drivers/net/bonding/bonding.h 2003-03-18 17:24:25.000000000 +0200 @@ -10,6 +10,11 @@ * This software may be used and distributed according to the terms * of the GNU Public License, incorporated herein by reference. * + * + * 2003/03/18 - Amir Noam , + * Tsippy Mendelson and + * Shmulik Hen + * - Added support for IEEE 802.3ad Dynamic link aggregatoin mode. */ #ifndef _LINUX_BONDING_H @@ -17,6 +22,36 @@ #include #include +#include "bond_3ad.h" + +#ifdef BONDING_DEBUG + +// use this like so: BOND_PRINT_DBG(("foo = %d, bar = %d", foo, bar)); +#define BOND_PRINT_DBG(X) \ +do { \ + printk(KERN_DEBUG "%s (%d)", __FUNCTION__, __LINE__); \ + printk X; \ + printk("\n"); \ +} while(0) + +#else +#define BOND_PRINT_DBG(X) +#endif /* BONDING_DEBUG */ + +#define IS_UP(dev) ((((dev)->flags & (IFF_UP)) == (IFF_UP)) && \ + (netif_running(dev) && netif_carrier_ok(dev))) + +/* Checks whether the dev is ready for transmit. We do not check netif_running */ +/* since a device can be stopped by the driver for short periods of time for */ +/* maintainance. dev_queue_xmit() handles this by queing the packet until the */ +/* the dev is running again. Keeping packets ordering requires sticking the the*/ +/* same dev as much as possible */ +#define SLAVE_IS_OK(slave) \ + ((((slave)->dev->flags & (IFF_UP)) == (IFF_UP)) && \ + netif_carrier_ok((slave)->dev) && \ + ((slave)->link == BOND_LINK_UP) && \ + ((slave)->state == BOND_STATE_ACTIVE)) + typedef struct slave { struct slave *next; @@ -31,6 +66,7 @@ typedef struct slave { u16 speed; u8 duplex; u8 perm_hwaddr[ETH_ALEN]; + struct ad_slave_info ad_info; // HUGE struct. maybe alloc dynamically } slave_t; /* @@ -62,7 +98,53 @@ typedef struct bonding { struct net_device *device; struct dev_mc_list *mc_list; unsigned short flags; + struct ad_bond_info ad_info; } bonding_t; +void bond_set_slave_active_flags(slave_t *slave); +void bond_set_slave_inactive_flags(slave_t *slave); + +//this function can be used for iterating the slave list (which is circular) +//must be locked with bond RW lock +extern inline struct slave* +bond_get_next_slave(struct bonding *bond, struct slave *slave) +{ + //If we have reached the last slave - return NULL + if (slave->next == bond->next) { + return NULL; + } + return slave->next; +} + +//must be locked with bond RW lock +//returns NULL if the net_device does not belong to any of the bond's slaves +extern inline struct slave* +bond_get_slave_by_dev(struct bonding *bond, struct net_device *slave_dev) +{ + struct slave *our_slave = bond->next; + + //check if the list of slaves is empty + if (our_slave == (slave_t *)bond) { + return NULL; + } + + for (; our_slave; our_slave = bond_get_next_slave(bond, our_slave)) { + if (our_slave->dev == slave_dev) { + break; + } + } + return our_slave; +} + +extern inline struct bonding* +bond_get_bond_by_slave(struct slave *slave) +{ + if (!slave || !slave->dev->master) { + return NULL; + } + + return (struct bonding *)(slave->dev->master->priv); +} + #endif /* _LINUX_BONDING_H */ diff -Nuarp linux-2.4.20-bonding-20030317/drivers/net/bonding/bond_main.c linux-2.4.20-bonding-20030317-devel/drivers/net/bonding/bond_main.c --- linux-2.4.20-bonding-20030317/drivers/net/bonding/bond_main.c 2003-03-18 17:24:24.000000000 +0200 +++ linux-2.4.20-bonding-20030317-devel/drivers/net/bonding/bond_main.c 2003-03-18 17:24:25.000000000 +0200 @@ -319,6 +319,11 @@ * properly. * - Block possibility of enslaving before the master is up. This * prevents putting the system in an unstable state. + * + * 2003/03/18 - Amir Noam , + * Tsippy Mendelson and + * Shmulik Hen + * - Added support for IEEE 802.3ad Dynamic link aggregatoin mode. */ #include @@ -359,6 +364,7 @@ #include #include #include "bonding.h" +#include "bond_3ad.h" #define DRV_VERSION "2.4.20-20030317" #define DRV_RELDATE "March 17, 2003" @@ -409,6 +415,7 @@ static struct bond_parm_tbl bond_mode_tb { "active-backup", BOND_MODE_ACTIVEBACKUP}, { "balance-xor", BOND_MODE_XOR}, { "broadcast", BOND_MODE_BROADCAST}, +{ "802.3ad", BOND_MODE_8023AD}, { NULL, -1}, }; @@ -464,8 +471,6 @@ static void bond_set_promiscuity(bonding static void bond_set_allmulti(bonding_t *bond, int inc); static struct dev_mc_list* bond_mc_list_find_dmi(struct dev_mc_list *dmi, struct dev_mc_list *mc_list); static void bond_mc_update(bonding_t *bond, slave_t *new, slave_t *old); -static void bond_set_slave_inactive_flags(slave_t *slave); -static void bond_set_slave_active_flags(slave_t *slave); static int bond_enslave(struct net_device *master, struct net_device *slave); static int bond_release(struct net_device *master, struct net_device *slave); static int bond_release_all(struct net_device *master); @@ -483,9 +488,6 @@ static int bond_get_info(char *buf, char /* several macros */ -#define IS_UP(dev) ((((dev)->flags & (IFF_UP)) == (IFF_UP)) && \ - (netif_running(dev) && netif_carrier_ok(dev))) - static void arp_send_all(slave_t *slave) { int i; @@ -510,6 +512,8 @@ bond_mode_name(void) return "load balancing (xor)"; case BOND_MODE_BROADCAST : return "fault-tolerance (broadcast)"; + case BOND_MODE_8023AD: + return "IEEE 802.3ad Dynamic link aggregation"; default : return "unknown"; } @@ -530,13 +534,13 @@ multicast_mode_name(void) } } -static void bond_set_slave_inactive_flags(slave_t *slave) +void bond_set_slave_inactive_flags(slave_t *slave) { slave->state = BOND_STATE_BACKUP; slave->dev->flags |= IFF_NOARP; } -static void bond_set_slave_active_flags(slave_t *slave) +void bond_set_slave_active_flags(slave_t *slave) { slave->state = BOND_STATE_ACTIVE; slave->dev->flags &= ~IFF_NOARP; @@ -815,8 +819,29 @@ static u16 bond_check_mii_link(bonding_t return (has_active_interface ? BMSR_LSTATUS : 0); } +//register to receive lacpdus on a bond +static void bond_register_lacpdu(struct bonding *bond) +{ + struct packet_type* pk_type = &(BOND_AD_INFO(bond).ad_pkt_type); + + //initialize packet type + pk_type->type = PKT_TYPE_LACPDU; + pk_type->dev = bond->device; + pk_type->func = bond_3ad_lacpdu_recv; + pk_type->data = (void*)1; // understand shared skbs + + dev_add_pack(pk_type); +} + +//register to receive lacpdus on a bond +static void bond_unregister_lacpdu(struct bonding *bond) +{ + dev_remove_pack(&(BOND_AD_INFO(bond).ad_pkt_type)); +} + static int bond_open(struct net_device *dev) { + struct bonding *bond = (struct bonding *)(dev->priv); struct timer_list *timer = &((struct bonding *)(dev->priv))->mii_timer; struct timer_list *arp_timer = &((struct bonding *)(dev->priv))->arp_timer; MOD_INC_USE_COUNT; @@ -840,6 +865,19 @@ static int bond_open(struct net_device * } add_timer(arp_timer); } + + if (bond_mode == BOND_MODE_8023AD) { + struct timer_list *ad_timer = &(BOND_AD_INFO(bond).ad_timer); + init_timer(ad_timer); + ad_timer->expires = jiffies + (AD_TIMER_INTERVAL * HZ / 1000); + ad_timer->data = (unsigned long)bond; + ad_timer->function = (void *)&bond_3ad_state_machine_handler; + add_timer(ad_timer); + + //register to receive LACPDUs + bond_register_lacpdu(bond); + } + return 0; } @@ -861,8 +899,18 @@ static int bond_close(struct net_device } } - /* Release the bonded slaves */ - bond_release_all(master); + if (bond_mode == BOND_MODE_8023AD) { + del_timer_sync(&(BOND_AD_INFO(bond).ad_timer)); + + //Unregister the receive of LACPDUs + bond_unregister_lacpdu(bond); + } + + if (bond->next != (struct slave *) bond) { + /* Release the bonded slaves */ + bond_release_all(master); + } + bond_mc_list_destroy (bond); write_unlock_irqrestore(&bond->lock, flags); @@ -880,6 +928,13 @@ static void bond_mc_list_flush(struct ne for (dmi = flush->mc_list; dmi != NULL; dmi = dmi->next) dev_mc_delete(dev, dmi->dmi_addr, dmi->dmi_addrlen, 0); + + if (bond_mode == BOND_MODE_8023AD) { + /*del lacpdu mc addr to mc list*/ + u8 lacpdu_multicast[ETH_ALEN] = MULTICAST_LACPDU_ADDR; + + dev_mc_delete(dev, lacpdu_multicast, ETH_ALEN, 0); + } } /* @@ -1240,6 +1295,13 @@ static int bond_enslave(struct net_devic dev_mc_add (slave_dev, dmi->dmi_addr, dmi->dmi_addrlen, 0); } + if (bond_mode == BOND_MODE_8023AD) { + /*add lacpdu mc addr to mc list*/ + u8 lacpdu_multicast[ETH_ALEN] = MULTICAST_LACPDU_ADDR; + + dev_mc_add(slave_dev, lacpdu_multicast, ETH_ALEN, 0); + } + write_lock_irqsave(&bond->lock, flags); bond_attach_slave(bond, new_slave); @@ -1299,6 +1361,11 @@ static int bond_enslave(struct net_devic "bond_enslave(): failed to get speed/duplex from %s, " "speed forced to 100Mbps, duplex forced to Full.\n", new_slave->dev->name); + if (bond_mode == BOND_MODE_8023AD) { + printk(KERN_WARNING + "Operation of 802.3ad mode requires ETHTOOL support " + "in base driver for proper aggregator selection.\n"); + } } /* if we're in active-backup mode, we need one and only one active @@ -1337,6 +1404,23 @@ static int bond_enslave(struct net_devic if (primary != NULL) if( strcmp(primary, new_slave->dev->name) == 0) bond->primary_slave = new_slave; + } else if (bond_mode == BOND_MODE_8023AD) { + /* in 802.3ad mode, the internal mechanism + will activate the slaves in the selected + aggregator */ + bond_set_slave_inactive_flags(new_slave); + //if this is the first slave + if (new_slave == bond->next) { + SLAVE_AD_INFO(new_slave).id = 1; + /*Initialize AD with the number of times that the AD timer is called in 1 second*/ + /*can be called only after the mac address of the bond is set*/ + bond_3ad_initialize(bond, 1000/AD_TIMER_INTERVAL); + } else { + SLAVE_AD_INFO(new_slave).id = + SLAVE_AD_INFO(new_slave->prev).id + 1; + } + + bond_3ad_bind_slave(new_slave); } else { #ifdef BONDING_DEBUG printk(KERN_CRIT "This slave is always active in trunk mode\n"); @@ -1601,6 +1685,12 @@ static int bond_release(struct net_devic old_current = bond->current_slave; while ((our_slave = our_slave->prev) != (slave_t *)bond) { if (our_slave->dev == slave) { + /* Inform AD package of unbinding of slave. */ + if (bond_mode == BOND_MODE_8023AD) { + bond_3ad_unbind_slave(our_slave); + } + + /* release the slave from its bond */ bond_detach_slave(bond, our_slave); printk (KERN_INFO "%s: releasing %s interface %s", @@ -1705,6 +1795,12 @@ static int bond_release_all(struct net_d bond->primary_slave = NULL; while ((our_slave = bond->prev) != (slave_t *)bond) { + /* Inform AD package of unbinding of slave + before slave is detached from the list. */ + if (bond_mode == BOND_MODE_8023AD) { + bond_3ad_unbind_slave(our_slave); + } + slave_dev = our_slave->dev; bond_detach_slave(bond, our_slave); @@ -1784,6 +1880,8 @@ static void bond_mii_monitor(struct net_ int mindelay = updelay + 1; struct net_device *dev = slave->dev; int link_state; + u16 old_speed = slave->speed; + u8 old_duplex = slave->duplex; link_state = bond_check_dev_link(dev, 0); @@ -1832,7 +1930,7 @@ static void bond_mii_monitor(struct net_ slave->link = BOND_LINK_DOWN; /* in active/backup mode, we must completely disable this interface */ - if (bond_mode == BOND_MODE_ACTIVEBACKUP) { + if ((bond_mode == BOND_MODE_ACTIVEBACKUP) || (bond_mode == BOND_MODE_8023AD)) { bond_set_slave_inactive_flags(slave); } printk(KERN_INFO @@ -1841,6 +1939,11 @@ static void bond_mii_monitor(struct net_ master->name, dev->name); + //notify ad that the link status has changed + if (bond_mode == BOND_MODE_8023AD) { + bond_3ad_link_status_changed(slave, 0); + } + read_lock(&bond->ptrlock); if (slave == bond->current_slave) { read_unlock(&bond->ptrlock); @@ -1911,8 +2014,12 @@ static void bond_mii_monitor(struct net_ /* now the link has been up for long time enough */ slave->link = BOND_LINK_UP; slave->jiffies = jiffies; - - if (bond_mode != BOND_MODE_ACTIVEBACKUP) { + + if (bond_mode == BOND_MODE_8023AD) { + /* prevent it from being the active one */ + slave->state = BOND_STATE_BACKUP; + } + else if (bond_mode != BOND_MODE_ACTIVEBACKUP) { /* make it immediately active */ slave->state = BOND_STATE_ACTIVE; } else if (slave != bond->primary_slave) { @@ -1926,7 +2033,12 @@ static void bond_mii_monitor(struct net_ master->name, dev->name); - if ( (bond->primary_slave != NULL) + //notify ad that the link status has changed + if (bond_mode == BOND_MODE_8023AD) { + bond_3ad_link_status_changed(slave, 1); + } + + if ( (bond->primary_slave != NULL) && (slave == bond->primary_slave) ) change_active_interface(bond); } @@ -1950,7 +2062,16 @@ static void bond_mii_monitor(struct net_ } /* end of switch */ bond_update_speed_duplex(slave); - + + if (bond_mode == BOND_MODE_8023AD) { + if (old_speed != slave->speed) { + bond_3ad_adapter_speed_changed(slave); + } + if (old_duplex != slave->duplex) { + bond_3ad_adapter_duplex_changed(slave); + } + } + } /* end of while */ /* @@ -1978,12 +2099,17 @@ static void bond_mii_monitor(struct net_ bestslave->delay = 0; bestslave->link = BOND_LINK_UP; bestslave->jiffies = jiffies; + + //notify ad that the link status has changed + if (bond_mode == BOND_MODE_8023AD) { + bond_3ad_link_status_changed(bestslave, 1); + } } if (bond_mode == BOND_MODE_ACTIVEBACKUP) { bond_set_slave_active_flags(bestslave); bond_mc_update(bond, bestslave, NULL); - } else { + } else if (bond_mode != BOND_MODE_8023AD) { bestslave->state = BOND_STATE_ACTIVE; } write_lock(&bond->ptrlock); @@ -2956,6 +3082,31 @@ static int bond_get_info(char *buf, char multicast_mode_name()); read_lock_irqsave(&bond->lock, flags); + + if (bond_mode == BOND_MODE_8023AD) { + struct ad_info ad_info; + + len += sprintf(buf + len, "\n802.3ad info\n"); + + if (bond_3ad_get_active_agg_info(bond, &ad_info)) { + len += sprintf(buf + len, "bond %s has no active aggregator\n", bond->device->name); + } else { + len += sprintf(buf + len, "Active Aggregator Info:\n"); + + len += sprintf(buf + len, "\tAggregator ID: %d\n", ad_info.aggregator_id); + len += sprintf(buf + len, "\tNumber of ports: %d\n", ad_info.ports); + len += sprintf(buf + len, "\tActor Key: %d\n", ad_info.actor_key); + len += sprintf(buf + len, "\tPartner Key: %d\n", ad_info.partner_key); + len += sprintf(buf + len, "\tPartner Mac Address: %02x:%02x:%02x:%02x:%02x:%02x\n", + ad_info.partner_system[0], + ad_info.partner_system[1], + ad_info.partner_system[2], + ad_info.partner_system[3], + ad_info.partner_system[4], + ad_info.partner_system[5]); + } + } + for (slave = bond->prev; slave != (slave_t *)bond; slave = slave->prev) { len += sprintf(buf + len, "\nSlave Interface: %s\n", slave->dev->name); @@ -2976,6 +3127,17 @@ static int bond_get_info(char *buf, char slave->perm_hwaddr[3], slave->perm_hwaddr[4], slave->perm_hwaddr[5]); + + if (bond_mode == BOND_MODE_8023AD) { + struct aggregator *agg = SLAVE_AD_INFO(slave).port.aggregator; + + if (agg) { + len += sprintf(buf + len, "Aggregator ID: %d\n", + agg->aggregator_identifier); + } else { + len += sprintf(buf + len, "Aggregator ID: N/A\n"); + } + } } read_unlock_irqrestore(&bond->lock, flags); @@ -3093,6 +3255,9 @@ static int __init bond_init(struct net_d case BOND_MODE_BROADCAST: dev->hard_start_xmit = bond_xmit_broadcast; break; + case BOND_MODE_8023AD: + dev->hard_start_xmit = bond_3ad_xmit_xor; + break; default: printk(KERN_ERR "Unknown bonding mode %d\n", bond_mode); kfree(bond->stats); @@ -3280,6 +3445,35 @@ static int __init bonding_init(void) downdelay = 0; } + /* reset values for 802.3ad */ + if (bond_mode == BOND_MODE_8023AD) { + if (arp_interval != 0) { + printk(KERN_WARNING "bonding_init(): ARP monitoring" + "can't be used simultaneously with 802.3ad, " + "disabling ARP monitoring\n" + ); + arp_interval = 0; + } + + if (miimon == 0) { + printk(KERN_ERR + "bonding_init(): miimon must be specified, " + "otherwise bonding will not detect link failure, " + "speed and duplex which are essential " + "for 802.3ad operation" + "Forcing miimon to 100msec\n"); + miimon = 100; + } + + if (multicast_mode != BOND_MULTICAST_ALL) { + printk(KERN_ERR + "bonding_init(): Multicast mode must " + "be set to ALL for 802.3ad, " + "Forcing Multicast mode to ALL\n"); + multicast_mode = BOND_MULTICAST_ALL; + } + } + if (miimon == 0) { if ((updelay != 0) || (downdelay != 0)) { /* just warn the user the up/down delay will have diff -Nuarp linux-2.4.20-bonding-20030317/drivers/net/bonding/Makefile linux-2.4.20-bonding-20030317-devel/drivers/net/bonding/Makefile --- linux-2.4.20-bonding-20030317/drivers/net/bonding/Makefile 2003-03-18 17:24:24.000000000 +0200 +++ linux-2.4.20-bonding-20030317-devel/drivers/net/bonding/Makefile 2003-03-18 17:24:25.000000000 +0200 @@ -4,7 +4,8 @@ O_TARGET := bonding.o -obj-y := bond_main.o +obj-y := bond_main.o \ + bond_3ad.o obj-m := $(O_TARGET) diff -Nuarp linux-2.4.20-bonding-20030317/include/linux/if_bonding.h linux-2.4.20-bonding-20030317-devel/include/linux/if_bonding.h --- linux-2.4.20-bonding-20030317/include/linux/if_bonding.h 2003-03-18 17:24:24.000000000 +0200 +++ linux-2.4.20-bonding-20030317-devel/include/linux/if_bonding.h 2003-03-18 17:24:25.000000000 +0200 @@ -23,6 +23,11 @@ * 2003/03/18 - Tsippy Mendelson and * Amir Noam * - Moved driver's private data types to bonding.h + * + * 2003/03/18 - Amir Noam , + * Tsippy Mendelson and + * Shmulik Hen + * - Added support for IEEE 802.3ad Dynamic link aggregatoin mode. */ #ifndef _LINUX_IF_BONDING_H @@ -49,6 +54,7 @@ #define BOND_MODE_ACTIVEBACKUP 1 #define BOND_MODE_XOR 2 #define BOND_MODE_BROADCAST 3 +#define BOND_MODE_8023AD 4 /* each slave's link has 4 states */ #define BOND_LINK_UP 0 /* link is up and running */ @@ -81,6 +87,14 @@ typedef struct ifslave __u32 link_failure_count; } ifslave; +struct ad_info { + __u16 aggregator_id; + __u16 ports; + __u16 actor_key; + __u16 partner_key; + __u8 partner_system[ETH_ALEN]; +}; + #endif /* _LINUX_IF_BONDING_H */ /* -- | Shmulik Hen | | Israel Design Center (Jerusalem) | | LAN Access Division | | Intel Communications Group, Intel corp. | | | | Anti-Spam: shmulik dot hen at intel dot com | From kuznet@ms2.inr.ac.ru Thu Mar 20 07:50:12 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 20 Mar 2003 07:50:16 -0800 (PST) Received: from sex.inr.ac.ru (sex.inr.ac.ru [193.233.7.165]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2KFo8q9028733 for ; Thu, 20 Mar 2003 07:50:10 -0800 Received: (from kuznet@localhost) by sex.inr.ac.ru (8.6.13/ANK) id SAA10972; Thu, 20 Mar 2003 18:49:49 +0300 From: kuznet@ms2.inr.ac.ru Message-Id: <200303201549.SAA10972@sex.inr.ac.ru> Subject: Re: TCP/IPv6 broken in Linux 2.5.64? To: ahu@ds9a.NL (bert hubert) Date: Thu, 20 Mar 2003 18:49:49 +0300 (MSK) Cc: netdev@oss.sgi.com In-Reply-To: <20030318162532.GA9705@outpost.ds9a.nl> from "bert hubert" at Mar 18, 3 07:45:02 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 1997 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: kuznet@ms2.inr.ac.ru Precedence: bulk X-list: netdev Content-Length: 756 Lines: 22 Hello! > irc servers, or an IPv6 zonetransfer. However, when I try to ssh from 2.5.65 Try this. I have just found this lost patch, it is from 2.4 tree, but it should fit to 2.5 as well. Alexey ===== net/ipv6/tcp_ipv6.c 1.19 vs edited ===== --- 1.19/net/ipv6/tcp_ipv6.c Thu Jan 23 21:14:18 2003 +++ edited/net/ipv6/tcp_ipv6.c Thu Mar 20 18:44:17 2003 @@ -983,7 +983,7 @@ struct ipv6_pinfo *np = &sk->net_pinfo.af_inet6; if (skb->ip_summed == CHECKSUM_HW) { - th->check = csum_ipv6_magic(&np->saddr, &np->daddr, len, IPPROTO_TCP, 0); + th->check = ~csum_ipv6_magic(&np->saddr, &np->daddr, len, IPPROTO_TCP, 0); skb->csum = offsetof(struct tcphdr, check); } else { th->check = csum_ipv6_magic(&np->saddr, &np->daddr, len, IPPROTO_TCP, From ahu@outpost.ds9a.nl Thu Mar 20 07:55:32 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 20 Mar 2003 07:55:43 -0800 (PST) Received: from outpost.ds9a.nl (postfix@outpost.ds9a.nl [213.244.168.210]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2KFsfq9029102 for ; Thu, 20 Mar 2003 07:55:22 -0800 Received: by outpost.ds9a.nl (Postfix, from userid 1000) id 21F8A45A0; Thu, 20 Mar 2003 16:19:12 +0100 (CET) Date: Thu, 20 Mar 2003 16:19:12 +0100 From: bert hubert To: netdev@oss.sgi.com, linux-kernel@vger.kernel.org Cc: davem@redhat.com Subject: [some more data] Re: [BUG] 2.5.65 ipv6 TCP checksum errors (capture attached) Message-ID: <20030320151912.GA25487@outpost.ds9a.nl> Mail-Followup-To: bert hubert , netdev@oss.sgi.com, linux-kernel@vger.kernel.org, davem@redhat.com References: <20030319124533.GA14363@outpost.ds9a.nl> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="NzB8fVQJ5HfG6fxh" Content-Disposition: inline In-Reply-To: <20030319124533.GA14363@outpost.ds9a.nl> User-Agent: Mutt/1.3.28i X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 1998 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ahu@ds9a.nl Precedence: bulk X-list: netdev Content-Length: 3833 Lines: 69 --NzB8fVQJ5HfG6fxh Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Wed, Mar 19, 2003 at 01:45:33PM +0100, bert hubert wrote: > Interestingly, the initial ssh connection worked, the second one failed. > Subsequent attempts fail too. I've since let loose the excellent ethereal on this and found out: > hubert# tcpdump -r file -v -v > 29.09 snapcount.33408 > hubert.ssh: S [tcp sum ok] 2737328594:2737328594(0) win 5760 (len 40, hlim 64) > 29.09 hubert.ssh > snapcount.33408: S [tcp sum ok] 2399386333:2399386333(0) ack 2737328595 win 5712 (len 40, hlim 64) > 29.09 snapcount.33408 > hubert.ssh: . [tcp sum ok] 1:1(0) ack 1 win 5760 (len 32, hlim 64) So far so good. > 29.10 hubert.ssh > snapcount.33408: P [bad tcp cksum 4f2!] 1:41(40) ack 1 win 5712 (len 72, hlim 64) > 29.30 hubert.ssh > snapcount.33408: P [bad tcp cksum 3bf1!] 1:41(40) ack 1 win 5712 (len 72, hlim 64) > 29.83 hubert.ssh > snapcount.33408: P [bad tcp cksum 23ef!] 1:41(40) ack 1 win 5712 (len 72, hlim 64) > 30.86 hubert.ssh > snapcount.33408: P [bad tcp cksum 23eb!] 1:41(40) ack 1 win 5712 (len 72, hlim 64) These packets all have an identical csum of 0x680d, it is not being updated. So, the SYN/SYNACK/ACK stuff went fine, the initial data however has a wrong checksum. For completeness, I've attached the capture again. Regards, bert -- http://www.PowerDNS.com Open source, database driven DNS Software http://lartc.org Linux Advanced Routing & Traffic Control HOWTO http://netherlabs.nl Consulting --NzB8fVQJ5HfG6fxh Content-Type: application/octet-stream Content-Disposition: attachment; filename=bad-csum Content-Transfer-Encoding: base64 1MOyoQIABAAAAAAAAAAAANwFAAABAAAAiWR4Pht3AQBeAAAAXgAAAAAIoRnw8ACgzMjyXIbd YAAAAAAoBkAgAQiIEDYAAAIIof/+GfDxIAEIiBA2AAACCKH//hnw8IKAABajKFHSAAAAAKAC FoACJAAAAgQFoAQCCAoAox+fAAAAAAEDAwCJZHg+wHcBAF4AAABeAAAAAKDMyPJcAAihGfDw ht1gAAAAACgGQCABCIgQNgAAAgih//4Z8PAgAQiIEDYAAAIIof/+GfDxABaCgI8Dut2jKFHT oBIWUOuhAAACBAWgBAIICgClzBoAox+fAQMDAIlkeD5KeAEAVgAAAFYAAAAACKEZ8PAAoMzI 8lyG3WAAAAAAIAZAIAEIiBA2AAACCKH//hnw8SABCIgQNgAAAgih//4Z8PCCgAAWoyhR048D ut6AEBaAGiIAAAEBCAoAox+gAKXMGolkeD5XigEAfgAAAH4AAAAAoMzI8lwACKEZ8PCG3WAA AAAASAZAIAEIiBA2AAACCKH//hnw8CABCIgQNgAAAgih//4Z8PEAFoKAjwO63qMoUdOAGBZQ aA0AAAEBCAoApcwfAKMfoFNTSC0xLjk5LU9wZW5TU0hfMy41cDEgRGViaWFuIDE6My41cDEt NQqJZHg+7pkEAH4AAAB+AAAAAKDMyPJcAAihGfDwht1gAAAAAEgGQCABCIgQNgAAAgih//4Z 8PAgAQiIEDYAAAIIof/+GfDxABaCgI8Dut6jKFHTgBgWUGgNAAABAQgKAKXM6ACjH6BTU0gt MS45OS1PcGVuU1NIXzMuNXAxIERlYmlhbiAxOjMuNXAxLTUKiWR4Pv/GDAB+AAAAfgAAAACg zMjyXAAIoRnw8IbdYAAAAABIBkAgAQiIEDYAAAIIof/+GfDwIAEIiBA2AAACCKH//hnw8QAW goCPA7reoyhR04AYFlBoDQAAAQEICgClzwAAox+gU1NILTEuOTktT3BlblNTSF8zLjVwMSBE ZWJpYW4gMTozLjVwMS01CopkeD4pJA0AfgAAAH4AAAAAoMzI8lwACKEZ8PCG3WAAAAAASAZA IAEIiBA2AAACCKH//hnw8CABCIgQNgAAAgih//4Z8PEAFoKAjwO63qMoUdOAGBZQaA0AAAEB CAoApdMAAKMfoFNTSC0xLjk5LU9wZW5TU0hfMy41cDEgRGViaWFuIDE6My41cDEtNQqMZHg+ 0/QJAH4AAAB+AAAAAKDMyPJcAAihGfDwht1gAAAAAEgGQCABCIgQNgAAAgih//4Z8PAgAQiI EDYAAAIIof/+GfDxABaCgI8Dut6jKFHTgBgWUGgNAAABAQgKAKXaAACjH6BTU0gtMS45OS1P cGVuU1NIXzMuNXAxIERlYmlhbiAxOjMuNXAxLTUKj2R4PkXxDgB+AAAAfgAAAACgzMjyXAAI oRnw8IbdYAAAAABIBkAgAQiIEDYAAAIIof/+GfDwIAEIiBA2AAACCKH//hnw8QAWgoCPA7re oyhR04AYFlBoDQAAAQEICgCl5wAAox+gU1NILTEuOTktT3BlblNTSF8zLjVwMSBEZWJpYW4g MTozLjVwMS01CpZkeD6mqAkAfgAAAH4AAAAAoMzI8lwACKEZ8PCG3WAAAAAASAZAIAEIiBA2 AAACCKH//hnw8CABCIgQNgAAAgih//4Z8PEAFoKAjwO63qMoUdOAGBZQaA0AAAEBCAoApgEA AKMfoFNTSC0xLjk5LU9wZW5TU0hfMy41cDEgRGViaWFuIDE6My41cDEtNQo= --NzB8fVQJ5HfG6fxh-- From garzik@gtf.org Thu Mar 20 08:57:06 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 20 Mar 2003 08:57:12 -0800 (PST) Received: from havoc.gtf.org (havoc.daloft.com [64.213.145.173]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2KGuPq9002397 for ; Thu, 20 Mar 2003 08:57:05 -0800 Received: by havoc.gtf.org (Postfix, from userid 500) id A6DD96654; Thu, 20 Mar 2003 16:56:18 +0000 (US/Central) Date: Thu, 20 Mar 2003 11:56:18 -0500 From: Jeff Garzik To: Shmulik Hen Cc: Bonding Developement list , Bonding Announce list , Linux Net Mailing list , Linux Kernel Mailing list , Oss SGI Netdev list Subject: Re: [patch] (0/8) Adding 802.3ad support to bonding Message-ID: <20030320165618.GB8256@gtf.org> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.3.28i X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 1999 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev Content-Length: 300 Lines: 15 I (and many others) will be going over these patches. I also see that somebody (davem?) applied your divide-by-zero patch to the mainline kernel. My initial comment is that we will want to work to eliminate these ifdefs. Other comments will follow. Thanks to Intel for these efforts! Jeff From ahu@outpost.ds9a.nl Thu Mar 20 09:01:32 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 20 Mar 2003 09:01:35 -0800 (PST) Received: from outpost.ds9a.nl (postfix@outpost.ds9a.nl [213.244.168.210]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2KH1Lq9003191 for ; Thu, 20 Mar 2003 09:01:22 -0800 Received: by outpost.ds9a.nl (Postfix, from userid 1000) id DEC313FDD; Thu, 20 Mar 2003 17:31:07 +0100 (CET) Date: Thu, 20 Mar 2003 17:31:07 +0100 From: bert hubert To: kuznet@ms2.inr.ac.ru Cc: netdev@oss.sgi.com Subject: Re: TCP/IPv6 broken in Linux 2.5.64? Message-ID: <20030320163107.GA28229@outpost.ds9a.nl> Mail-Followup-To: bert hubert , kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com References: <20030318162532.GA9705@outpost.ds9a.nl> <200303201549.SAA10972@sex.inr.ac.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200303201549.SAA10972@sex.inr.ac.ru> User-Agent: Mutt/1.3.28i X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 2000 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ahu@ds9a.nl Precedence: bulk X-list: netdev Content-Length: 687 Lines: 21 On Thu, Mar 20, 2003 at 06:49:49PM +0300, kuznet@ms2.inr.ac.ru wrote: > Hello! > > > irc servers, or an IPv6 zonetransfer. However, when I try to ssh from 2.5.65 > > Try this. I have just found this lost patch, it is from 2.4 tree, but > it should fit to 2.5 as well. This has solved my problem 100%. The earlier mentioned hostA and hostB can now cheerfully connect to eachother. Everything else I try works too. So I suggest this be sent linus-wards. Thanks! Regards, bert -- http://www.PowerDNS.com Open source, database driven DNS Software http://lartc.org Linux Advanced Routing & Traffic Control HOWTO http://netherlabs.nl Consulting From hshmulik@intel.com Thu Mar 20 09:57:59 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 20 Mar 2003 09:58:06 -0800 (PST) Received: from caduceus.fm.intel.com (fmr02.intel.com [192.55.52.25]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2KHvwq9003999 for ; Thu, 20 Mar 2003 09:57:59 -0800 Received: from talaria.fm.intel.com (talaria.fm.intel.com [10.1.192.39]) by caduceus.fm.intel.com (8.11.6/8.11.6/d: outer.mc,v 1.51 2002/09/23 20:43:23 dmccart Exp $) with ESMTP id h2KHpH621838 for ; Thu, 20 Mar 2003 17:51:21 GMT Received: from fmsmsxv040-1.fm.intel.com (fmsmsxvs040.fm.intel.com [132.233.42.124]) by talaria.fm.intel.com (8.11.6/8.11.6/d: inner.mc,v 1.28 2003/01/13 19:44:39 dmccart Exp $) with SMTP id h2KFEpc05915 for ; Thu, 20 Mar 2003 15:14:51 GMT Received: from jrslxjul4.npdj.intel.com ([10.12.254.188]) by fmsmsxv040-1.fm.intel.com (NAVGW 2.5.2.11) with SMTP id M2003032007133815638 ; Thu, 20 Mar 2003 07:13:40 -0800 Date: Thu, 20 Mar 2003 17:13:23 +0200 (IST) From: Shmulik Hen X-X-Sender: hshmulik@jrslxjul4.npdj.intel.com To: Bonding Developement list , Bonding Announce list , Linux Net Mailing list , Linux Kernel Mailing list , Oss SGI Netdev list , Jeff Garzik Subject: [Bonding][patch set] - Adding IEEE 802.3ad Dynamic link aggregation support Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 2001 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hshmulik@intel.com Precedence: bulk X-list: netdev Content-Length: 1228 Lines: 31 Hello, The following set of 7(+2) patches add support for 802.3ad link aggregation mode on top of the latest release of bonding from source-forge (2.4.20-20030317). They also handle a set of bug fixes that were discovered during the past several weeks of an extensive testing effort done by our QA group. This comes as one of several enhancements Intel has decided to contribute to the open source community. This code is ported from our iANS product which has been around for some time. We are in the process of porting our advanced networking features from iANS to the bonding driver. In future releases we plan to add more features, improvements and adapting the code for 2.5.x kernels. The first 2 patches add support for point-to-point protocols to the 2.4.20/2.4.21-pre5 kernels in the net subtree, and are a pre-requisite for the 802.3ad feature. The following patches only modify the bonding files. -- | Shmulik Hen | | Israel Design Center (Jerusalem) | | LAN Access Division | | Intel Communications Group, Intel corp. | | | | Anti-Spam: shmulik dot hen at intel dot com | From erik@hensema.net Thu Mar 20 10:10:51 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 20 Mar 2003 10:10:54 -0800 (PST) Received: from dexter.hensema.net (cc78409-a.hnglo1.ov.home.nl [212.120.97.185]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2KIA7q9004459 for ; Thu, 20 Mar 2003 10:10:50 -0800 Received: from bender.home.hensema.net (bender.ipv6.hensema.net [IPv6:2001:888:10a1:0:202:44ff:fe69:60f5]) by dexter.hensema.net (8.12.3/8.12.3) with ESMTP id h2KIA5PA009275; Thu, 20 Mar 2003 19:10:05 +0100 Received: from bender.home.hensema.net (localhost [127.0.0.1]) by bender.home.hensema.net (8.12.3/8.12.3) with ESMTP id h2KIA5dN019976; Thu, 20 Mar 2003 19:10:05 +0100 Received: (from erik@localhost) by bender.home.hensema.net (8.12.3/8.12.3/Submit) id h2KIA50J019975; Thu, 20 Mar 2003 19:10:05 +0100 Date: Thu, 20 Mar 2003 19:10:05 +0100 From: Erik Hensema To: bert hubert Cc: netdev@oss.sgi.com Subject: Re: TCP/IPv6 broken in Linux 2.5.64? Message-ID: <20030320181004.GA19970@hensema.net> Reply-To: erik@hensema.net References: <20030318162532.GA9705@outpost.ds9a.nl> <200303201549.SAA10972@sex.inr.ac.ru> <20030320163107.GA28229@outpost.ds9a.nl> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030320163107.GA28229@outpost.ds9a.nl> User-Agent: Mutt/1.3.27i X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 2002 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: erik@hensema.net Precedence: bulk X-list: netdev Content-Length: 644 Lines: 18 On Thu, Mar 20, 2003 at 05:31:07PM +0100, bert hubert wrote: > On Thu, Mar 20, 2003 at 06:49:49PM +0300, kuznet@ms2.inr.ac.ru wrote: > > Hello! > > > > > irc servers, or an IPv6 zonetransfer. However, when I try to ssh from 2.5.65 > > > > Try this. I have just found this lost patch, it is from 2.4 tree, but > > it should fit to 2.5 as well. > > This has solved my problem 100%. The earlier mentioned hostA and hostB can > now cheerfully connect to eachother. Everything else I try works too. > > So I suggest this be sent linus-wards. Thanks! I can confirm that this fixed my IPv6 problems on 2.5.x. -- Erik Hensema (erik@hensema.net) From joern@wohnheim.fh-wedel.de Thu Mar 20 14:33:40 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 20 Mar 2003 14:33:45 -0800 (PST) Received: from wohnheim.fh-wedel.de (wohnheim.fh-wedel.de [195.37.86.122]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2KMXdq9023743 for ; Thu, 20 Mar 2003 14:33:40 -0800 Received: from joern by wohnheim.fh-wedel.de with local (Exim 3.35 #1 (Debian)) id 18w8bJ-0005iA-00; Thu, 20 Mar 2003 23:33:29 +0100 Date: Thu, 20 Mar 2003 23:33:29 +0100 From: =?iso-8859-1?Q?J=F6rn?= Engel To: linux-kernel@vger.kernel.org Cc: netdev@oss.sgi.com, acme@conectiva.com.br Subject: [PATCH] clean up net/802/Makefile (small version) Message-ID: <20030320223329.GB13641@wohnheim.fh-wedel.de> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit User-Agent: Mutt/1.3.28i X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 2003 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: joern@wohnheim.fh-wedel.de Precedence: bulk X-list: netdev Content-Length: 653 Lines: 27 This patch simply removes a couple of lines with duplicated functionality. Patch is against 2.4.20. Arnaldo, are you the correct maintainer for this? Jörn -- Victory in war is not repetitious. -- Sun Tzu --- linux-2.4.20/net/802/Makefile Sat Aug 3 02:39:46 2002 +++ linux-2.4.20/net/802/Makefile.1 Thu Mar 20 23:20:05 2003 @@ -15,13 +15,9 @@ obj-$(CONFIG_SYSCTL) += sysctl_net_802.o obj-$(CONFIG_LLC) += llc_sendpdu.o llc_utility.o cl2llc.o llc_macinit.o -ifeq ($(CONFIG_SYSCTL),y) -obj-y += sysctl_net_802.o -endif ifeq ($(CONFIG_LLC),y) subdir-y += transit -obj-y += llc_sendpdu.o llc_utility.o cl2llc.o llc_macinit.o SNAP = y endif From joern@wohnheim.fh-wedel.de Thu Mar 20 14:35:48 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 20 Mar 2003 14:35:51 -0800 (PST) Received: from wohnheim.fh-wedel.de (wohnheim.fh-wedel.de [195.37.86.122]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2KMZlq9023908 for ; Thu, 20 Mar 2003 14:35:48 -0800 Received: from joern by wohnheim.fh-wedel.de with local (Exim 3.35 #1 (Debian)) id 18w8dS-0007EF-00; Thu, 20 Mar 2003 23:35:42 +0100 Date: Thu, 20 Mar 2003 23:35:42 +0100 From: =?iso-8859-1?Q?J=F6rn?= Engel To: linux-kernel@vger.kernel.org Cc: netdev@oss.sgi.com, acme@conectiva.com.br Subject: Re: [PATCH] clean up net/802/Makefile (large version) Message-ID: <20030320223542.GC13641@wohnheim.fh-wedel.de> References: <20030320223329.GB13641@wohnheim.fh-wedel.de> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20030320223329.GB13641@wohnheim.fh-wedel.de> User-Agent: Mutt/1.3.28i X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 2004 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: joern@wohnheim.fh-wedel.de Precedence: bulk X-list: netdev Content-Length: 1420 Lines: 77 This one tries to clean up the other code as well. Jörn -- The cheapest, fastest and most reliable components of a computer system are those that aren't there. -- Gordon Bell, DEC labratories --- linux-2.4.20/net/802/Makefile Sat Aug 3 02:39:46 2002 +++ linux-2.4.20/net/802/Makefile.2 Thu Mar 20 23:26:22 2003 @@ -11,48 +11,26 @@ export-objs = llc_macinit.o p8022.o psnap.o +snap-objs = p8022.o psnap.o + obj-y = p8023.o obj-$(CONFIG_SYSCTL) += sysctl_net_802.o -obj-$(CONFIG_LLC) += llc_sendpdu.o llc_utility.o cl2llc.o llc_macinit.o -ifeq ($(CONFIG_SYSCTL),y) -obj-y += sysctl_net_802.o -endif - -ifeq ($(CONFIG_LLC),y) -subdir-y += transit -obj-y += llc_sendpdu.o llc_utility.o cl2llc.o llc_macinit.o -SNAP = y -endif - -ifdef CONFIG_TR -obj-y += tr.o - SNAP=y -endif - -ifdef CONFIG_NET_FC -obj-y += fc.o -endif - -ifdef CONFIG_FDDI -obj-y += fddi.o -endif - -ifdef CONFIG_HIPPI -obj-y += hippi.o -endif - -ifdef CONFIG_IPX - SNAP=y -endif - -ifdef CONFIG_ATALK - SNAP=y -endif - -ifeq ($(SNAP),y) -obj-y += p8022.o psnap.o -endif +obj-$(CONFIG_LLC) += llc_sendpdu.o llc_utility.o cl2llc.o llc_macinit.o $(snap-objs) + +subdir-$(CONFIG_LLC) += transit + +obj-$(CONFIG_TR) += tr.o $(snap-objs) + +obj-$(CONFIG_NET_FC) += fc.o + +obj-$(CONFIG_FDDI) += fddi.o + +obj-$(CONFIG_HIPPI) += hippi.o + +obj-$(CONFIG_IPX) += $(snap-objs) + +obj-$(CONFIG_ATALK) += $(snap-objs) include $(TOPDIR)/Rules.make From fubar@us.ibm.com Thu Mar 20 14:54:41 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 20 Mar 2003 14:54:46 -0800 (PST) Received: from e34.co.us.ibm.com (e34.co.us.ibm.com [32.97.110.132]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2KMrsq9024540 for ; Thu, 20 Mar 2003 14:54:41 -0800 Received: from westrelay02.boulder.ibm.com (westrelay02.boulder.ibm.com [9.17.195.11]) by e34.co.us.ibm.com (8.12.8/8.12.2) with ESMTP id h2KMrY64075902; Thu, 20 Mar 2003 17:53:34 -0500 Received: from d03nm121.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.193.82]) by westrelay02.boulder.ibm.com (8.12.8/NCO/VER6.5) with ESMTP id h2KMrYEc084888; Thu, 20 Mar 2003 15:53:34 -0700 Importance: Normal Sensitivity: Subject: Re: [Bonding-devel] [patch] (2/8) Add 802.3ad support to bonding (released to bonding on sourceforge) To: Shmulik Hen Cc: Bonding Developement list , Bonding Announce list , Linux Net Mailing list , Linux Kernel Mailing list , Oss SGI Netdev list , Jeff Garzik X-Mailer: Lotus Notes Release 5.0.7 March 21, 2001 Message-ID: From: Jay Vosburgh Date: Thu, 20 Mar 2003 14:53:14 -0800 X-MIMETrack: Serialize by Router on D03NM121/03/M/IBM(Release 6.0 [IBM]|December 16, 2002) at 03/20/2003 15:53:34 MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 2005 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: fubar@us.ibm.com Precedence: bulk X-list: netdev Content-Length: 315 Lines: 14 I have incorporated Shmulik Hen's bug fix patches to bonding (patch numbers 2 and 3) into the current code and released the new patch to sourceforge.net/projects/bonding. The current bonding update is bonding-2.4.20-20030320. The only changes I made were minor spelling / formatting fixes. -J From davem@redhat.com Thu Mar 20 16:11:23 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 20 Mar 2003 16:11:31 -0800 (PST) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2L0Agq9001959 for ; Thu, 20 Mar 2003 16:11:23 -0800 Received: from localhost (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with ESMTP id QAA14618; Thu, 20 Mar 2003 16:08:46 -0800 Date: Thu, 20 Mar 2003 16:08:45 -0800 (PST) Message-Id: <20030320.160845.121240938.davem@redhat.com> To: fubar@us.ibm.com Cc: hshmulik@intel.com, bonding-devel@lists.sourceforge.net, bonding-announce@lists.sourceforge.net, linux-net@vger.kernel.org, linux-kernel@vger.kernel.org, netdev@oss.sgi.com, jgarzik@pobox.com Subject: Re: [Bonding-devel] [patch] (2/8) Add 802.3ad support to bonding (released to bonding on sourceforge) From: "David S. Miller" In-Reply-To: References: X-FalunGong: Information control. X-Mailer: Mew version 2.1 on Emacs 21.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 2006 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev Content-Length: 830 Lines: 19 From: Jay Vosburgh Date: Thu, 20 Mar 2003 14:53:14 -0800 I have incorporated Shmulik Hen's bug fix patches to bonding (patch numbers 2 and 3) into the current code and released the new patch to sourceforge.net/projects/bonding. The current bonding update is bonding-2.4.20-20030320. The only changes I made were minor spelling / formatting fixes. So when do these changes end up being sent to myself or Jeff for mainline inclusion? I have no objection to the sourceforge project for bonding, but I do object to there being such latency between what the sourceforge tree has (especially bug fixes) and what gets submitted into the mainline. Personally, I'd prefer that all development occur in the mainline tree. That gives you testing coverage that is impossible otherwise. From toml@us.ibm.com Thu Mar 20 16:40:22 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 20 Mar 2003 16:40:56 -0800 (PST) Received: from e3.ny.us.ibm.com (e3.ny.us.ibm.com [32.97.182.103]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2L0dWq9003027 for ; Thu, 20 Mar 2003 16:40:22 -0800 Received: from northrelay01.pok.ibm.com (northrelay01.pok.ibm.com [9.56.224.149]) by e3.ny.us.ibm.com (8.12.8/8.12.2) with ESMTP id h2L0cskD137186; Thu, 20 Mar 2003 19:38:54 -0500 Received: from tomlt2.austin.ibm.com (tomlt2.austin.ibm.com [9.41.94.20]) by northrelay01.pok.ibm.com (8.12.8/NCO/VER6.5) with ESMTP id h2L0cprD116838; Thu, 20 Mar 2003 19:38:51 -0500 Subject: [PATCH] IPSec: IPV6_IPSEC_POLICY / IPV6_XFRM_POLICY socket options From: Tom Lendacky To: netdev@oss.sgi.com Cc: davem@redhat.com, kuznet@ms2.inr.ac.ru, toml@us.ibm.com Content-Type: text/plain Content-Transfer-Encoding: 7bit X-Mailer: Ximian Evolution 1.0.8 (1.0.8-10) Date: 20 Mar 2003 18:40:16 -0600 Message-Id: <1048207217.1212.3.camel@tomlt2.tomloffice.austin.ibm.com> Mime-Version: 1.0 X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 2007 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: toml@us.ibm.com Precedence: bulk X-list: netdev Content-Length: 3165 Lines: 114 I've created a patch to fix the problem of racoon not being able to listen on IPv6 addresses. The problem occurs from not having support for the IP(V6)_IPSEC_POLICY and IP(V6)_XFRM_POLICY socket options in IPv6. Please review the patch below and let me know if my fix is ok. Additionally, for those wanting to run racoon you will have to update the sockmisc.c file. You will need to change the #define of IPV6_IPSEC_POLICY to use the value 34 and not 16 (which is the IP_IPSEC_POLICY value). This will allow racoon to listen on an IPv6 address, but I'm still not having luck getting racoon working over IPv6. Thanks, Tom diff -ur linux-2.5.65-orig/include/linux/in6.h linux-2.5.65/include/linux/in6.h --- linux-2.5.65-orig/include/linux/in6.h 2003-03-17 15:44:11.000000000 -0600 +++ linux-2.5.65/include/linux/in6.h 2003-03-20 10:51:33.000000000 -0600 @@ -176,5 +176,8 @@ #define IPV6_FLOWLABEL_MGR 32 #define IPV6_FLOWINFO_SEND 33 +#define IPV6_IPSEC_POLICY 34 +#define IPV6_XFRM_POLICY 35 + #endif diff -ur linux-2.5.65-orig/net/ipv4/xfrm_user.c linux-2.5.65/net/ipv4/xfrm_user.c --- linux-2.5.65-orig/net/ipv4/xfrm_user.c 2003-03-17 15:44:08.000000000 -0600 +++ linux-2.5.65/net/ipv4/xfrm_user.c 2003-03-20 09:24:53.000000000 -0600 @@ -1080,10 +1080,26 @@ struct xfrm_policy *xp; int nr; - if (opt != IP_XFRM_POLICY) { - *dir = -EOPNOTSUPP; + switch (family) { + case AF_INET: + if (opt != IP_XFRM_POLICY) { + *dir = -EOPNOTSUPP; + return NULL; + } + break; +#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) + case AF_INET6: + if (opt != IPV6_XFRM_POLICY) { + *dir = -EOPNOTSUPP; + return NULL; + } + break; +#endif + default: + *dir = -EINVAL; return NULL; } + *dir = -EINVAL; if (len < sizeof(*p) || diff -ur linux-2.5.65-orig/net/ipv6/ipv6_sockglue.c linux-2.5.65/net/ipv6/ipv6_sockglue.c --- linux-2.5.65-orig/net/ipv6/ipv6_sockglue.c 2003-03-17 15:43:39.000000000 -0600 +++ linux-2.5.65/net/ipv6/ipv6_sockglue.c 2003-03-20 10:07:46.000000000 -0600 @@ -47,6 +47,7 @@ #include #include #include +#include #include @@ -386,6 +387,10 @@ case IPV6_FLOWLABEL_MGR: retv = ipv6_flowlabel_opt(sk, optval, optlen); break; + case IPV6_IPSEC_POLICY: + case IPV6_XFRM_POLICY: + retv = xfrm_user_policy(sk, optname, optval, optlen); + break; #ifdef CONFIG_NETFILTER default: diff -ur linux-2.5.65-orig/net/key/af_key.c linux-2.5.65/net/key/af_key.c --- linux-2.5.65-orig/net/key/af_key.c 2003-03-17 15:43:49.000000000 -0600 +++ linux-2.5.65/net/key/af_key.c 2003-03-20 16:25:10.000000000 -0600 @@ -2415,8 +2415,23 @@ struct xfrm_policy *xp; struct sadb_x_policy *pol = (struct sadb_x_policy*)data; - if (opt != IP_IPSEC_POLICY) { - *dir = -EOPNOTSUPP; + switch (family) { + case AF_INET: + if (opt != IP_IPSEC_POLICY) { + *dir = -EOPNOTSUPP; + return NULL; + } + break; +#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) + case AF_INET6: + if (opt != IPV6_IPSEC_POLICY) { + *dir = -EOPNOTSUPP; + return NULL; + } + break; +#endif + default: + *dir = -EINVAL; return NULL; } From fubar@us.ibm.com Thu Mar 20 16:44:19 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 20 Mar 2003 16:44:22 -0800 (PST) Received: from e35.co.us.ibm.com (e35.co.us.ibm.com [32.97.110.133]) by oss.sgi.com (8.12.8/8.12.5) with SMTP id h2L0iHq9003374 for ; Thu, 20 Mar 2003 16:44:18 -0800 Received: from westrelay04.boulder.ibm.com (westrelay04.boulder.ibm.com [9.17.193.32]) by e35.co.us.ibm.com (8.12.8/8.12.2) with ESMTP id h2L0i5gJ021158; Thu, 20 Mar 2003 19:44:05 -0500 Received: from d03nm121.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.193.82]) by westrelay04.boulder.ibm.com (8.12.8/NCO/VER6.5) with ESMTP id h2L0i3sf090166; Thu, 20 Mar 2003 17:44:04 -0700 Importance: Normal Sensitivity: Subject: Re: [Bonding-devel] [patch] (2/8) Add 802.3ad support to bonding (released to bonding on sourceforge) To: "David S. Miller" Cc: hshmulik@intel.com, bonding-devel@lists.sourceforge.net, bonding-announce@lists.sourceforge.net, linux-net@vger.kernel.org, linux-kernel@vger.kernel.org, netdev@oss.sgi.com, jgarzik@pobox.com X-Mailer: Lotus Notes Release 5.0.7 March 21, 2001 Message-ID: From: Jay Vosburgh Date: Thu, 20 Mar 2003 16:43:52 -0800 X-MIMETrack: Serialize by Router on D03NM121/03/M/IBM(Release 6.0 [IBM]|December 16, 2002) at 03/20/2003 17:44:04 MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) X-archive-position: 2008 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: fubar@us.ibm.com Precedence: bulk X-list: netdev Content-Length: 969 Lines: 31 >So when do these changes end up being sent to myself or >Jeff for mainline inclusion? > >I have no objection to the sourceforge project for bonding, but >I do object to there being such latency between what the sourceforge >tree has (especially bug fixes) and what gets submitted into the >mainline. > >Personally, I'd prefer that all development occur in the mainline >tree. That gives you testing coverage that is impossible otherwise. Fair enough; the delay has gotten excessive of late. Would it be satisfactory going forward for the sourceforge site to contain patches to "standard" releases (e.g., 2.4.20), and do updates to the current development kernel and the sourceforge site simultaneously? In other words, sourceforge has a patch containing all bonding updates since 2.4.20 (or whichever version) was released, and each time that patch is updated, the incremental update goes out for inclusion in the development kernel. -J From jgarzik@pobox.com Thu Mar 20 16:56:45 2003 Received: with ECARTIS (v1.0.0; l