From owner-netdev@oss.sgi.com Thu Aug 1 06:34:06 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g71DY6Rw032765 for ; Thu, 1 Aug 2002 06:34:06 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g71DY5cv032764 for netdev-outgoing; Thu, 1 Aug 2002 06:34:05 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from serwus.bnet.pl (serwus.bnet.pl [217.97.249.1]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g71DXcRw032754 for ; Thu, 1 Aug 2002 06:33:40 -0700 Received: by serwus.bnet.pl (Postfix, from userid 507) id 21CFCF04F; Thu, 1 Aug 2002 15:35:07 +0200 (CEST) Date: Thu, 1 Aug 2002 15:35:06 +0200 From: Jacek Konieczny To: maxk@qualcomm.com, vtun@office.satix.net Cc: netdev@oss.sgi.com, underley@underley.eu.org, linux-kernel@vger.kernel.org Subject: "new style" netdevice allocation patch for TUN driver (2.4.18 kernel) Message-ID: <20020801133506.GA22073@serwus.bnet.pl> Reply-To: jajcus@bnet.pl Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.3.25i X-Spam-Status: No, hits=-5.0 required=5.0 tests=UNIFIED_PATCH version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk I had a lot of problem with tun devices created with both openvpn and vtund. When I wanted to shut down my system when the devices were in use (eg. TCP connection established on tun0 interface), even if the tunneling daemon was killed, it stopped while trying to deconfigure network. And "unregister_netdevice: waiting for tun0 to become free" message was displayed again and again. I tried to resolve this problem using Google, but I have only found out, that this is behaviour of 2.4 kernels, and that it is proper. After further investigation, in kernel sources, I found out, that there are "old style" and "new style" network devices, and that only the "old style" devices have this problem. I had similar problem with VLAN devices some time ago, so I checked VLAN driver sources too. As I suspected, it was "new style" device now. The patch below is my try to make tun device "new style" too. It seems to work for me, but I am not sure if it is 100% proper. This is patch against 2.4.18 sources. Sorry, for spamming all those addresses, but I am not sure which one is correct. Driver on URL given in MAINTAINERS file seems to be a bit outdated. Greets, Jacek --- linux/drivers/net/tun.c.orig Sun Sep 30 21:26:07 2001 +++ linux/drivers/net/tun.c Thu Aug 1 14:41:12 2002 @@ -20,6 +20,14 @@ * Modifications for 2.3.99-pre5 kernel. */ +/* + * 01.08.2002 + * Jacek Konieczny + * Modifications for "new style" device allocation + * (fixes "wating for tunX to become free" problem) + */ + + #define TUN_VER "1.4" #include @@ -159,6 +167,17 @@ return 0; } +void tun_net_destruct(struct net_device *dev) +{ + if (dev) { + if (dev->priv) { + kfree(dev->priv); + dev->priv=NULL; + MOD_DEC_USE_COUNT; + } + } +} + /* Character device part */ /* Poll */ @@ -204,14 +223,14 @@ skb_reserve(skb, 2); copy_from_user(skb_put(skb, len), ptr, len); - skb->dev = &tun->dev; + skb->dev = tun->dev; switch (tun->flags & TUN_TYPE_MASK) { case TUN_TUN_DEV: skb->mac.raw = skb->data; skb->protocol = pi.proto; break; case TUN_TAP_DEV: - skb->protocol = eth_type_trans(skb, &tun->dev); + skb->protocol = eth_type_trans(skb, tun->dev); break; }; @@ -310,7 +329,7 @@ schedule(); continue; } - netif_start_queue(&tun->dev); + netif_start_queue(tun->dev); if (!verify_area(VERIFY_WRITE, buf, count)) ret = tun_put_user(tun, skb, buf, count); @@ -357,8 +376,6 @@ init_waitqueue_head(&tun->read_wait); tun->owner = -1; - tun->dev.init = tun_net_init; - tun->dev.priv = tun; err = -EINVAL; @@ -377,17 +394,21 @@ if (*ifr->ifr_name) name = ifr->ifr_name; - if ((err = dev_alloc_name(&tun->dev, name)) < 0) - goto failed; - if ((err = register_netdevice(&tun->dev))) - goto failed; - - MOD_INC_USE_COUNT; + dev = dev_alloc(name, &err); + if (!dev) goto failed; + + tun->dev=dev; + dev->init = tun_net_init; + dev->priv = tun; + dev->destructor = tun_net_destruct; + dev->features |= NETIF_F_DYNALLOC; + tun->name = dev->name; - tun->name = tun->dev.name; - } + err=register_netdevice(dev); + if (err<0) goto failed; - DBG(KERN_INFO "%s: tun_set_iff\n", tun->name); + MOD_INC_USE_COUNT; + } if (ifr->ifr_flags & IFF_NO_PI) tun->flags |= TUN_NO_PI; @@ -402,6 +423,7 @@ return 0; failed: + kfree(dev); kfree(tun); return err; } @@ -532,10 +554,8 @@ skb_queue_purge(&tun->readq); if (!(tun->flags & TUN_PERSIST)) { - dev_close(&tun->dev); - unregister_netdevice(&tun->dev); - kfree(tun); - MOD_DEC_USE_COUNT; + dev_close(tun->dev); + unregister_netdevice(tun->dev); } rtnl_unlock(); --- linux/include/linux/if_tun.h.orig Tue Jun 12 04:15:27 2001 +++ linux/include/linux/if_tun.h Thu Aug 1 14:33:40 2002 @@ -40,7 +40,7 @@ wait_queue_head_t read_wait; struct sk_buff_head readq; - struct net_device dev; + struct net_device *dev; struct net_device_stats stats; struct fasync_struct *fasync; From owner-netdev@oss.sgi.com Thu Aug 1 07:03:21 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g71E3KRw002823 for ; Thu, 1 Aug 2002 07:03:20 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g71E3KN0002822 for netdev-outgoing; Thu, 1 Aug 2002 07:03:20 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from servidor.linux-ha.org (cpe-24-221-212-80.co.sprintbbd.net [24.221.212.80]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g71E3BRw002812 for ; Thu, 1 Aug 2002 07:03:12 -0700 Received: from unix.sh (localhost [127.0.0.1]) by servidor.linux-ha.org (Postfix on SuSE Linux 8.0 (i386)) with ESMTP id C54E5307F3; Thu, 1 Aug 2002 08:04:21 -0600 (MDT) Message-ID: <3D493FE5.4090102@unix.sh> Date: Thu, 01 Aug 2002 08:04:21 -0600 From: Alan Robertson Organization: IBM Linux Technology Center User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.8) Gecko/20020204 X-Accept-Language: en-us MIME-Version: 1.0 To: netdev@oss.sgi.com Cc: Kevin Dwyer Subject: SIOCGIFBRDADDR? Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, hits=1.8 required=5.0 tests=SUBJ_ENDS_IN_Q_MARK,SUBJ_ALL_CAPS version=2.20 X-Spam-Level: * Sender: owner-netdev@oss.sgi.com Precedence: bulk Hi, Some folks using the linux-ha (heartbeat) software have been running into some network-related problems, and Alan Cox suggested that you folks could help us out. One of the ways you can configure linux-ha protocols is to send broadcast packets. In this case, our software needs to get the broadcast address from the OS, so we can proceed. We currently use the SIOCGIFBRDADDR ioctl to do this for us. This has worked pretty well for us for quite some time, but recently some users have run into some problems. It appears that if one configures an interface using ifconfig, then the SIOCGIFBRDADDR ioctl works quite nicely to determine the broadcast address of an interface. However, if one uses the iproute2 tools to configure the interface, then it appears that the SIOCGIFBRDADDR ioctl returns bogus results on that interface. Is this supposed to be the case? Is there a different method we're supposed to use instead of SIOCGIFBRDADDR? Thanks! -- Alan Robertson alanr@unix.sh http://linux-ha.org/ From owner-netdev@oss.sgi.com Thu Aug 1 10:43:03 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g71Hh2Rw008377 for ; Thu, 1 Aug 2002 10:43:02 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g71Hh2II008376 for netdev-outgoing; Thu, 1 Aug 2002 10:43:02 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from sex.inr.ac.ru (sex.inr.ac.ru [193.233.7.165]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g71HgrRw008366 for ; Thu, 1 Aug 2002 10:42:54 -0700 Received: (from kuznet@localhost) by sex.inr.ac.ru (8.6.13/ANK) id VAA07150; Thu, 1 Aug 2002 21:43:23 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200208011743.VAA07150@sex.inr.ac.ru> Subject: Re: "new style" netdevice allocation patch for TUN driver (2.4.18 kernel) To: jajcus@bnet.pl Date: Thu, 1 Aug 2002 21:43:23 +0400 (MSD) Cc: netdev@oss.sgi.com In-Reply-To: <20020801133506.GA22073@serwus.bnet.pl> from "Jacek Konieczny" at Aug 1, 2 05:45:01 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 X-Spam-Status: No, hits=-3.8 required=5.0 tests=IN_REP_TO,NO_REAL_NAME version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hello! > network. And "unregister_netdevice: waiting for tun0 to become free" This smells like a leakage. I think this should be investigated. > devices, and that only the "old style" devices have this problem. > I had similar problem with VLAN devices some time ago, so I checked VLAN > driver sources too. Well, switching to "new style" does not solve the problem. If we have a leakage, it will continue silently which is even worse. Probably, you should to define NET_REFCNT_DEBUG in net/core/dev.c to track what happens with the device. Alexey From owner-netdev@oss.sgi.com Thu Aug 1 10:51:02 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g71Hp2Rw008538 for ; Thu, 1 Aug 2002 10:51:02 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g71Hp2kR008537 for netdev-outgoing; Thu, 1 Aug 2002 10:51:02 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from nic.bnet.pl (komp42@aleja.bmj.net.pl [195.82.161.242]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g71HotRw008528 for ; Thu, 1 Aug 2002 10:50:56 -0700 Received: by nic.bnet.pl (Postfix, from userid 500) id DB0A6D004; Thu, 1 Aug 2002 19:52:06 +0200 (CEST) Date: Thu, 1 Aug 2002 19:52:06 +0200 From: Jacek Konieczny To: kuznet@ms2.inr.ac.ru Cc: netdev@oss.sgi.com Subject: Re: "new style" netdevice allocation patch for TUN driver (2.4.18 kernel) Message-ID: <20020801175204.GC17308@nic.nigdzie> Mail-Followup-To: kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com References: <20020801133506.GA22073@serwus.bnet.pl> <200208011743.VAA07150@sex.inr.ac.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200208011743.VAA07150@sex.inr.ac.ru> User-Agent: Mutt/1.3.99i X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk On Thu, Aug 01, 2002 at 09:43:23PM +0400, kuznet@ms2.inr.ac.ru wrote: > Hello! > > > network. And "unregister_netdevice: waiting for tun0 to become free" > > This smells like a leakage. I think this should be investigated. I am not sure. Sometimes it is not easy to free such device. > > devices, and that only the "old style" devices have this problem. > > I had similar problem with VLAN devices some time ago, so I checked VLAN > > driver sources too. > > Well, switching to "new style" does not solve the problem. If we have > a leakage, it will continue silently which is even worse. > Probably, you should to define NET_REFCNT_DEBUG in net/core/dev.c > to track what happens with the device. After applying my patch, the device unregistration is usually defered, but finally it is finished. On system shutdown, when all routes are removed, all interfaces downed and all processes killed the device will not be used any more and the device is freed. Without the patch my system doesn't even come to this point, as it stops on the first process which tries to access any network device (eg. any iproute2 command). Usually it stops on postfix shutdown. Probably because I use the tun device for tunneling SMTP connections to my home machine. Probably shutting down postfix before openvpn would help, but failing to do this should not hang the system. And even if there is a bug in tun module it should never prevent system from clean shutdown. Even if the device will be never freed and module unloaded. So I thing this patch (or similar) should be applied. Greets, Jacek From owner-netdev@oss.sgi.com Thu Aug 1 11:00:54 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g71I0sRw008852 for ; Thu, 1 Aug 2002 11:00:54 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g71I0rc6008851 for netdev-outgoing; Thu, 1 Aug 2002 11:00:53 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from sex.inr.ac.ru (sex.inr.ac.ru [193.233.7.165]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g71I0nRw008842 for ; Thu, 1 Aug 2002 11:00:50 -0700 Received: (from kuznet@localhost) by sex.inr.ac.ru (8.6.13/ANK) id WAA07242; Thu, 1 Aug 2002 22:01:37 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200208011801.WAA07242@sex.inr.ac.ru> Subject: Re: "new style" netdevice allocation patch for TUN driver (2.4.18 kernel) To: jajcus@bnet.pl (Jacek Konieczny) Date: Thu, 1 Aug 2002 22:01:37 +0400 (MSD) Cc: netdev@oss.sgi.com In-Reply-To: <20020801175204.GC17308@nic.nigdzie> from "Jacek Konieczny" at Aug 1, 2 07:52:06 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 X-Spam-Status: No, hits=-3.8 required=5.0 tests=IN_REP_TO,NO_REAL_NAME version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hello! > but finally it is finished. On system shutdown, when all routes are > removed, all interfaces downed and all processes killed the device will > not be used any more and the device is freed. I do not understand this. It is leak and we have to find who continues to use the device. > And even if there is a bug Nope. The bug is the first, the cleanup is the second. Alexey From owner-netdev@oss.sgi.com Thu Aug 1 17:26:01 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g720Q1Rw019701 for ; Thu, 1 Aug 2002 17:26:01 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g720Q1Xt019700 for netdev-outgoing; Thu, 1 Aug 2002 17:26:01 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from cms1.etri.re.kr (cms1.etri.re.kr [129.254.16.11]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g720PuRw019691 for ; Thu, 1 Aug 2002 17:25:57 -0700 Received: from seong (seong1.etri.re.kr [129.254.171.33]) by cms1.etri.re.kr with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id PXQFN0G5; Fri, 2 Aug 2002 09:22:48 +0900 Message-ID: <004501c239bb$6ca70d30$21abfe81@seong> From: "??" To: References: <20020801133506.GA22073@serwus.bnet.pl> <200208011743.VAA07150@sex.inr.ac.ru> <20020801175204.GC17308@nic.nigdzie> Subject: two net_device ? Date: Fri, 2 Aug 2002 09:27:46 +0900 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 5.50.4807.1700 X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4807.1700 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by oss.sgi.com id g720PvRw019692 X-Spam-Status: No, hits=-0.1 required=5.0 tests=SUBJ_ENDS_IN_Q_MARK version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hi, there. I implemented 32 pseudo interface(eth1 ~ eth32) using a module. When I up the interfaces using "ifconfig eth1 xx.xx.xx.xx" command, there are two "eth1" interface. Before I up the interfaces, there is a "eth1" interface. I can't find what thing goes wrong. For test, I made just 2 pseudo interface using same mechanizm. then above problem doesn't occur. any suggestion, thanks.. I use 2.4.17 kernel. From owner-netdev@oss.sgi.com Sat Aug 3 12:58:23 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g73JwNRw030432 for ; Sat, 3 Aug 2002 12:58:23 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g73JwM5j030431 for netdev-outgoing; Sat, 3 Aug 2002 12:58:22 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from oknodo.bof.de (bof.de [195.4.223.10]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g73JtQRw030395 for ; Sat, 3 Aug 2002 12:55:27 -0700 Received: (qmail 348 invoked by uid 500); 3 Aug 2002 19:55:46 -0000 Date: Sat, 3 Aug 2002 21:55:46 +0200 From: Patrick Schaaf To: netfilter-devel@lists.netfilter.org Cc: netdev@oss.sgi.com Subject: [PATCH] my recent meddling with ip_conntrack Message-ID: <20020803215546.A284@oknodo.bof.de> References: <20020802102451.A685@oknodo.bof.de> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="OXfL5xGRrasGEqWY" Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <20020802102451.A685@oknodo.bof.de>; from bof@bof.de on Fri, Aug 02, 2002 at 10:24:51AM +0200 X-Spam-Status: No, hits=-9.4 required=5.0 tests=IN_REP_TO,UNIFIED_PATCH version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk --OXfL5xGRrasGEqWY Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Hi netfilter-devel & netdev, I have pulled my recent ip_conntrack patches up to 2.4.19, and have that merge running now on my shiny new dual P-MMX 200. No surprises. It's already up 40 minutes with hundreds of connections tracked! Patch appended for curious people and would-be testers. All comments welcome. This is not meant for inclusion anywhere, right now, just looking for some eyeballs. have a nice weekend Patrick Short Changelog, in order of probable importance: - netfilter hook statistics, /proc/net/nf_hook_stat*, as a compile option found under "Networking Options". Per-hook-function rdtscll() based timing and occurrence counting. See netfilter in action for yourself! - remove unneccessary add_timer() calls from per-packet processing. Introduces new ip_conntrack->timeout_target, 4 byte in size. The running timer is never disturbed when increasing monotonically. That covers the normal ESTABLISHED case. When the timer runs out, it possibly restarts itself to the then-current timeout_target. - prefer to allocate the ip_conntrack hash using get_free_pages() - use a single linked list to hash them. BTW, with bucket count autoselection, this change doubles the number of available buckets. Saves four byte per ip_conntrack_hash_tuple, 8 byte per ip_conntrack. - in include/linux/skbuff.h, introduce nf_skb_forget(), and use that to cleanup several of places in ipv4/ core stack code. - make init_conntrack() a bit more sane, removes unneccessary hash computations. --OXfL5xGRrasGEqWY Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="bof-ct-merged-20020803.Changelog" ---------------------- Would send the following csets --------------------- ChangeSet@1.597, 2002-08-03 18:20:26+02:00, bof@cdr.(none) Merge bkbits-linux-2.4 after 2.4.19 ChangeSet@1.582.9.7, 2002-08-02 09:02:19+02:00, bof@cdr.(none) ip_conntrack_standalone.c: cleanup /proc/net/ip_conntrack output - same as always, now. ChangeSet@1.582.9.6, 2002-08-02 09:00:13+02:00, bof@cdr.(none) ip_conntrack.h: prototype for new ip_ct_sudden_death() ip_conntrack_proto_icmp.c, ip_conntrack_proto_tcp.c: use ip_ct_sudden_death(), instead of fiddling with ct->timeout directly. ip_conntrack_core.c: introduce ct->timeout_target, makes add_timer() a rare event. ip_conntrack_standalone.c: show ct->timeout_target in /proc/net/ip_conntrack. ChangeSet@1.582.9.5, 2002-08-01 22:52:01+02:00, bof@cdr.(none) ip_conntrack_core.c: more add_timer avoidance ChangeSet@1.582.9.4, 2002-08-01 22:27:28+02:00, bof@cdr.(none) ip_conntrack.h, ip_conntrack_core.c: begin add_timer avoidance ChangeSet@1.582.9.3, 2002-08-01 21:21:14+02:00, bof@cdr.(none) ip_conntrack_core.c: get_free_pages allocation for ip_conntrack_hash ChangeSet@1.582.7.5, 2002-08-01 19:04:41+02:00, bof@cdr.(none) netfilter.c: remove KERN_NOTICE output ChangeSet@1.582.9.2, 2002-08-01 09:45:24+02:00, bof@cdr.(none) srlist.h: fix single ring list code ip_conntrack_core.c: type agnostic ip_conntrack_hash allocation ChangeSet@1.582.9.1, 2002-07-31 20:57:39+02:00, bof@cdr.(none) include/linux/netfilter_ipv4/srlist.h: introduced. a single ring list implementation with almost the same interface as the netfilter_ipv4/listhelp.h include/linux/netfilter_ipv4/ip_conntrack*.h: use srlist_head in place of list_head for conntrack tuple hashing. net/ipv4/netfilter/ip_conntrack_{core,standalone}.c: use srlist_head instead of list_head for conntrack tuple hashing. ChangeSet@1.582.7.4, 2002-07-29 09:25:21+02:00, bof@cdr.(none) netfilter.c: some more comments, minimal cleanup, KERN_NOTICE upon (un)registration. Configure.help: friendly help and advise regarding CONFIG_NETFILTER_HOOK_STAT ChangeSet@1.582.7.3, 2002-07-28 11:58:38+02:00, bof@cdr.(none) net/core/netfilter.c: remove debug printks related to slabifying hook statistic counters. ChangeSet@1.582.7.2, 2002-07-28 11:54:23+02:00, bof@cdr.(none) net/core/netfilter.c: slabify, make per-cpu counters. ChangeSet@1.582.7.1, 2002-07-27 19:16:23+02:00, bof@cdr.(none) Config.in, netfilter.h, netfilter.c: netfilter hook statistics ChangeSet@1.582.4.2, 2002-07-22 21:07:02+02:00, bof@cdr.(none) overall: compiles now, skb_nf_forget() introduction probably OK. skbuff.h: sk_buff speling fix ChangeSet@1.582.4.1, 2002-07-22 20:45:22+02:00, bof@cdr.(none) skbuff.h: define skb_nf_forget() skbuff.c, ip_conntrack_core.c, ipt_REJECT.c, ipip.c, ip_gre.c, sit.c: use skb_nf_forget() ip_input.c, ipmr.c: use skb_nf_forget() NOTE: original code did not clear nf_debug. Now it will. ChangeSet@1.582.2.70, 2002-07-22 09:32:20+02:00, bof@cdr.(none) ip_conntrack_core.c: in init_conntrack(), rename drop_next to drop_rotor: document recent change. ChangeSet@1.582.2.69, 2002-07-22 09:31:29+02:00, bof@cdr.(none) ip_conntrack_core.c: in init_conntrack(), narrow scope of static drop_next. ChangeSet@1.582.2.68, 2002-07-22 09:30:29+02:00, bof@cdr.(none) ip_conntrack_core.c: sanitize typing for hash_conntrack() return value: always use u_int32_t. ChangeSet@1.582.2.67, 2002-07-22 09:24:39+02:00, bof@cdr.(none) ip_conntrack_core.c: remove hash calculation from unconditional part of init_conntrack(), to the rare place where it is needed. ChangeSet@1.582.2.66, 2002-07-22 09:23:20+02:00, bof@cdr.(none) ip_conntrack_core.c: remove repl_hash calculation in init_conntrack(): it was not used. --------------------------------------------------------------------------- --OXfL5xGRrasGEqWY Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="bof-ct-merged-20020803.patch" diff -urN ex1/Documentation/Configure.help ex2/Documentation/Configure.help --- ex1/Documentation/Configure.help Sat Aug 3 21:21:17 2002 +++ ex2/Documentation/Configure.help Sat Aug 3 21:24:54 2002 @@ -2429,6 +2429,23 @@ You can say Y here if you want to get additional messages useful in debugging the netfilter code. +Netfilter hook statistics +CONFIG_NETFILTER_HOOK_STAT + If you say Y here, the time spent in the various netfilter hook + functions is measured, using the TSC of your processor. Your + kernel won't boot when you don't have a working TSC. + Say N when you don't have a modern Intel/AMD processor. + + When enabled, look at /proc/net/nf_stat_hook_* for the actual + measurement results, presented in a format easy to guess by + any well-calibrated crystal ball. + + The timing imposes a processing overhead that may be relevant + on machines with high packet rates. The overhead is estimated + at about 5% of the time used by the hook functions, themselves. + + The safe thing is to say N. + Connection tracking (required for masq/NAT) CONFIG_IP_NF_CONNTRACK Connection tracking keeps a record of what packets have passed diff -urN ex1/include/linux/netfilter.h ex2/include/linux/netfilter.h --- ex1/include/linux/netfilter.h Sat Aug 3 21:21:14 2002 +++ ex2/include/linux/netfilter.h Sat Aug 3 21:24:50 2002 @@ -51,6 +51,9 @@ int hooknum; /* Hooks are ordered in ascending priority. */ int priority; +#ifdef CONFIG_NETFILTER_HOOK_STAT + void *hook_stat; +#endif }; struct nf_sockopt_ops diff -urN ex1/include/linux/netfilter_ipv4/ip_conntrack.h ex2/include/linux/netfilter_ipv4/ip_conntrack.h --- ex1/include/linux/netfilter_ipv4/ip_conntrack.h Sat Aug 3 21:21:09 2002 +++ ex2/include/linux/netfilter_ipv4/ip_conntrack.h Sat Aug 3 21:24:47 2002 @@ -97,6 +97,7 @@ volatile unsigned long status; /* Timer function; drops refcnt when it goes off. */ + unsigned long timeout_target; struct timer_list timeout; /* If we're expecting another related connection, this will be @@ -160,6 +161,9 @@ extern int invert_tuplepr(struct ip_conntrack_tuple *inverse, const struct ip_conntrack_tuple *orig); + +/* Kill this conntrack immediately, without regard to timeouts. */ +extern int ip_ct_sudden_death(struct ip_conntrack *ct); /* Refresh conntrack for this many jiffies */ extern void ip_ct_refresh(struct ip_conntrack *ct, diff -urN ex1/include/linux/netfilter_ipv4/ip_conntrack_core.h ex2/include/linux/netfilter_ipv4/ip_conntrack_core.h --- ex1/include/linux/netfilter_ipv4/ip_conntrack_core.h Sat Aug 3 21:21:21 2002 +++ ex2/include/linux/netfilter_ipv4/ip_conntrack_core.h Sat Aug 3 21:24:55 2002 @@ -1,6 +1,7 @@ #ifndef _IP_CONNTRACK_CORE_H #define _IP_CONNTRACK_CORE_H #include +#include /* This header is used to share core functionality between the standalone connection tracking module, and the compatibility layer's use @@ -44,7 +45,7 @@ return NF_ACCEPT; } -extern struct list_head *ip_conntrack_hash; +extern struct srlist_head *ip_conntrack_hash; extern struct list_head expect_list; DECLARE_RWLOCK_EXTERN(ip_conntrack_lock); #endif /* _IP_CONNTRACK_CORE_H */ diff -urN ex1/include/linux/netfilter_ipv4/ip_conntrack_tuple.h ex2/include/linux/netfilter_ipv4/ip_conntrack_tuple.h --- ex1/include/linux/netfilter_ipv4/ip_conntrack_tuple.h Sat Aug 3 21:21:16 2002 +++ ex2/include/linux/netfilter_ipv4/ip_conntrack_tuple.h Sat Aug 3 21:24:54 2002 @@ -1,6 +1,8 @@ #ifndef _IP_CONNTRACK_TUPLE_H #define _IP_CONNTRACK_TUPLE_H +#include + /* A `tuple' is a structure containing the information to uniquely identify a connection. ie. if two packets have the same tuple, they are in the same connection; if not, they are not. @@ -85,7 +87,7 @@ /* Connections have two entries in the hash table: one for each way */ struct ip_conntrack_tuple_hash { - struct list_head list; + struct srlist_head list; struct ip_conntrack_tuple tuple; diff -urN ex1/include/linux/netfilter_ipv4/srlist.h ex2/include/linux/netfilter_ipv4/srlist.h --- ex1/include/linux/netfilter_ipv4/srlist.h Thu Jan 1 01:00:00 1970 +++ ex2/include/linux/netfilter_ipv4/srlist.h Sat Aug 3 21:24:35 2002 @@ -0,0 +1,78 @@ +#ifndef __NETFILTER_IPV4_SRLIST_H +#define __NETFILTER_IPV4_SRLIST_H + +struct srlist_head { + struct srlist_head *next; +}; + +#define INIT_SRLIST_HEAD(ptr) do { (ptr)->next = (ptr); } while (0) + +#define SRLIST_FIND(srl, cmpfn, type, args...) \ +({ \ + struct srlist_head *__head = (struct srlist_head *) (srl); \ + struct srlist_head *__i; \ + \ + ASSERT_READ_LOCK(__head); \ + __i = __head; \ + do { \ + if (__i->next == __head) { __i = 0; break; } \ + __i = __i->next; \ + } while (!cmpfn((const type)__i , ## args)); \ + (type)__i; \ +}) + +#define SRLIST_FIND_W(srl, cmpfn, type, args...) \ +({ \ + struct srlist_head *__head = (struct srlist_head *) (srl); \ + struct srlist_head *__i; \ + \ + ASSERT_WRITE_LOCK(__head); \ + __i = __head; \ + do { \ + if (__i->next == __head) { __i = 0; break; } \ + __i = __i->next; \ + } while (!cmpfn((const type)__i , ## args)); \ + (type)__i; \ +}) + +#ifndef CONFIG_NETFILTER_DEBUG +#define SRLIST_DELETE_WARN(estr, e, hstr) do{}while (0) +#else +#define SRLIST_DELETE_WARN(estr, e, hstr) \ + printk("TUPLE_DELETE: %s:%u `%s'(%p) not in %s.\n", \ + __FILE__, __LINE__, estr, e, hstr) +#endif + +#define SRLIST_DELETE(srl, elem) \ +do { \ + struct srlist_head *__head = (struct srlist_head *) (srl); \ + struct srlist_head *__elem = (struct srlist_head *) (elem); \ + struct srlist_head *__i; \ + \ + ASSERT_WRITE_LOCK(__head); \ + __i = __head; \ + while (1) { \ + struct srlist_head *__next = __i->next; \ + \ + if (__next == __head) { \ + SRLIST_DELETE_WARN(#elem, __elem, #srl); \ + break; \ + } \ + if (__next == __elem) { \ + __i->next = __elem->next; \ + break; \ + } \ + __i = __next; \ + } \ +} while (0) + +#define SRLIST_PREPEND(srl, elem) \ +do { \ + struct srlist_head *__head = (struct srlist_head *) (srl); \ + struct srlist_head *__elem = (struct srlist_head *) (elem); \ + \ + __elem->next = __head->next; \ + __head->next = __elem; \ +} while (0) + +#endif diff -urN ex1/include/linux/skbuff.h ex2/include/linux/skbuff.h --- ex1/include/linux/skbuff.h Sat Aug 3 21:20:56 2002 +++ ex2/include/linux/skbuff.h Sat Aug 3 21:24:37 2002 @@ -1144,6 +1144,17 @@ if (nfct) atomic_inc(&nfct->master->use); } +static inline void +skb_nf_forget(struct sk_buff *skb) +{ + nf_conntrack_put(skb->nfct); + skb->nfct = NULL; +#ifdef CONFIG_NETFILTER_DEBUG + skb->nf_debug = 0; +#endif +} +#else +static inline void skb_nf_forget(struct sk_buff *skb) {} #endif #endif /* __KERNEL__ */ diff -urN ex1/net/Config.in ex2/net/Config.in --- ex1/net/Config.in Sat Aug 3 21:21:01 2002 +++ ex2/net/Config.in Sat Aug 3 21:24:41 2002 @@ -13,6 +13,7 @@ bool 'Network packet filtering (replaces ipchains)' CONFIG_NETFILTER if [ "$CONFIG_NETFILTER" = "y" ]; then bool ' Network packet filtering debugging' CONFIG_NETFILTER_DEBUG + bool ' Netfilter hook statistics' CONFIG_NETFILTER_HOOK_STAT fi bool 'Socket Filtering' CONFIG_FILTER tristate 'Unix domain sockets' CONFIG_UNIX diff -urN ex1/net/core/netfilter.c ex2/net/core/netfilter.c --- ex1/net/core/netfilter.c Sat Aug 3 21:21:12 2002 +++ ex2/net/core/netfilter.c Sat Aug 3 21:24:49 2002 @@ -47,6 +47,293 @@ struct list_head nf_hooks[NPROTO][NF_MAX_HOOKS]; static LIST_HEAD(nf_sockopts); +#ifdef CONFIG_NETFILTER_HOOK_STAT + +/* + * menuconfig this under "Network options" >> "Netfilter hook statistics" + * + * The following code, up to the next #endif, implements per hook + * statistics counting. If enabled, look at /proc/net/nf_stat_hook* + * for the results. + * + */ + +#include +#include +#include + +/* + * nf_stat_hook_proc[pf][hooknum] is a flag per protocol/hook, telling + * whether we have already created the /proc/net/nf_stat_hook_X.Y file. + * The array is only consulted during module registration. This code + * never removes the proc files; when all hook functions unregister, + * an empty file remains. + * + * Not used under normal per-packet processing. + */ +static unsigned char nf_stat_hook_proc[NPROTO][NF_MAX_HOOKS]; + +/* + * struct nf_stat_hook_sample is used in nf_inject(), to record the + * beginning of the operation. After calling the hook function, + * it is reused to compute the duration of the hook function call, + * which is then recorded in nf_hook_ops->stat[percpu]. + * + * CPU-local data on the stack, unshared. + */ +struct nf_stat_hook_sample { + unsigned long long stamp; +}; + +/* + * struct nf_stat_hook is our main statistics state structure. + * It is kept cache-aligned and per-cpu, summing the per-cpu + * values only when read through the /proc interface. + * + * CPU-local data, read across all CPUs only on user request. + * Updated locally on each CPU, one update per packet and hook function. + */ +struct nf_stat_hook { + unsigned long long count; + unsigned long long sum; +} __attribute__ ((__aligned__(SMP_CACHE_BYTES))); + +/* + * The nf_stat_hook structures come from our private slab cache. + */ +static kmem_cache_t *nf_stat_hook_slab; + +/* + * nf_stat_hook_zero() is the slab ctor/dtor + */ +static void nf_stat_hook_zero(void *data, kmem_cache_t *slab, unsigned long x) +{ + struct nf_stat_hook *stat = data; + int i; + + for (i=0; icount = stat->sum = 0; +} + +/* + * nf_stat_hook_setup() is the one-time initialization routine. + * It allocates the slab cache for our statistics counters, + * and initializes the "proc registration" flag array. + */ +static void __init nf_stat_hook_setup(void) +{ + /* early rdtsc to catch booboo at boot time */ + { struct nf_stat_hook_sample sample; rdtscll(sample.stamp); } + + nf_stat_hook_slab = kmem_cache_create("nf_stat_hook", + NR_CPUS * sizeof(struct nf_stat_hook), + 0, SLAB_HWCACHE_ALIGN, + nf_stat_hook_zero, nf_stat_hook_zero); + if (!nf_stat_hook_slab) + printk(KERN_ERR "nf_stat_hook will NOT WORK - no slab.\n"); + + memset(nf_stat_hook_proc, 0, sizeof(nf_stat_hook_proc)); +} + +/* + * nf_stat_hook_read_proc() is a proc_fs read_proc() callback. + * Called per protocol/hook, the statistics of all netfilter + * hook elements sitting on that hook, are shown, in priority + * order. On SMP, the per-cpu counters are summed here. + * For accuracy, maybe we need to take some write lock. Later. + * + * Readings might look strange, until such locking is done. + * If you need to compensate, read several times, and throw + * out the strange results. Look for silly non-monotony. + * + * Output fields are seperated by a single blank, and represent: + * [0] address of 'struct nf_hook_ops'. (pointer, in unadorned 8-byte hex) + * [1] address of nf_hook_ops->hook() function pointer. When the + * hook module is built into the kernel, you can find this + * in System.map. (pointer, in unadorned 8-byte hex) + * [2] hook priority. (signed integer, in ascii) + * [3] number of times hook was called. (unsigned 64 bit integer, in ascii) + * [4] total number of cycles spent in the hook function, measured by + * summing the rdtscll() differences across the calls. (unsigned + * 64 bit integer, in ascii) + * + * Additional fields may be added in the future; if any field is eventually + * retired, it will be set to neutral values: '00000000' for the pointer + * fields, and '0' for the integer fields. That's theory, not guarantee. :) + */ +static int nf_stat_hook_read_proc( + char *page, + char **start, + off_t off, + int count, + int *eof, + void *data +) { + struct list_head *l; + int res; + + for ( res = 0, l = ((struct list_head *)data)->next; + l != data; + l = l->next + ) { + int i; + struct nf_hook_ops *elem = (struct nf_hook_ops *) l; + struct nf_stat_hook *stat = elem->hook_stat; + + if (stat) { + unsigned long long count; + unsigned long long sum; + /* maybe write_lock something here */ + for (i=0, count=0, sum=0; icount; + sum += stat->sum; + } + /* and then write_unlock it here */ + i = sprintf(page+res, "%p %p %d %Lu %Lu\n", + elem, elem->hook, elem->priority, + count, sum); + } else { + i = sprintf(page+res, "%p %p %d 0 0\n", + elem, elem->hook, elem->priority); + } + if (i <= 0) + break; + res += i; + } + return res; +} + +/* + * nf_stat_hook_register() is called whenever a hook element registers. + * When neccessary, we create a /proc/net/nf_stat_hook* file here, + * and we always allocate one struct nf_stat_hook. + */ +static void nf_stat_hook_register(struct nf_hook_ops *elem) +{ + elem->hook_stat = (NULL == nf_stat_hook_slab) + ? 0 : kmem_cache_alloc(nf_stat_hook_slab, SLAB_ATOMIC); + if (!elem->hook_stat) return; + if (!nf_stat_hook_proc[elem->pf][elem->hooknum]) { + char buf[64]; + char hookname_buf[16]; + char pfname_buf[16]; + char *hookname; + char *pfname; + struct proc_dir_entry *proc; + + switch(elem->pf) { + case 2: + pfname = "ipv4"; + switch(elem->hooknum) { + case 0: + hookname = "PRE-ROUTING"; + break; + case 1: + hookname = "LOCAL-IN"; + break; + case 2: + hookname = "FORWARD"; + break; + case 3: + hookname = "LOCAL-OUT"; + break; + case 4: + hookname = "POST-ROUTING"; + break; + default: + sprintf(hookname_buf, "hook%d", + elem->hooknum); + hookname = hookname_buf; + break; + } + break; + default: + sprintf(hookname_buf, "hook%d", + elem->hooknum); + hookname = hookname_buf; + sprintf(pfname_buf, "pf%d", + elem->pf); + pfname = pfname_buf; + break; + } + sprintf(buf, "net/nf_stat_hook_%s.%s", pfname, hookname); + proc = create_proc_read_entry(buf, 0644, NULL, + nf_stat_hook_read_proc, + &nf_hooks[elem->pf][elem->hooknum] + ); + if (!proc) { + printk(KERN_ERR "cannot create %s\n", buf); + kmem_cache_free(nf_stat_hook_slab, elem->hook_stat); + elem->hook_stat = 0; + return; + } + proc->owner = THIS_MODULE; + } + nf_stat_hook_proc[elem->pf][elem->hooknum]++; +} + +/* + * nf_stat_hook_unregister() is called when a hook element unregisters. + * The statistics structure is freed, but we NEVER remove the /proc/net + * file entry. Maybe we should. nf_stat_hook_proc[][] contains the correct + * counter, I think (modulo races). + */ +static void nf_stat_hook_unregister(struct nf_hook_ops *elem) +{ + if (elem->hook_stat) + kmem_cache_free(nf_stat_hook_slab, elem->hook_stat); + nf_stat_hook_proc[elem->pf][elem->hooknum]--; +} + +/* + * Finally, the next two functions implement the real timekeeping. + * If rdtscll() proves problematic, these have to be changed. + * The _begin() function is called before a specific hook entry + * function gets called - it starts the timer. + * The _end() function is called after the hook entry function, + * and it stops the timer, and remembers the interval in the + * statistics structure (per-cpu). + */ + +static inline void nf_stat_hook_begin(struct nf_stat_hook_sample *sample) +{ + rdtscll(sample->stamp); +} + +static inline void nf_stat_hook_end( + struct nf_stat_hook_sample *sample, + struct nf_hook_ops *elem, + int verdict +) { + struct nf_stat_hook *stat = elem->hook_stat; + struct nf_stat_hook_sample now; + if (!stat) return; + rdtscll(now.stamp); now.stamp -= sample->stamp; + stat += smp_processor_id(); + stat->count++; + stat->sum += now.stamp; +} + +#else + +/* + * Here, a set of empty macros provides for nice ifdef free callers into + * this statistics code. If CONFIG_NETFILTER_HOOK_STAT is NOT defined, + * these should make the compiled code identical to what we had before. + */ +struct nf_stat_hook_sample {}; +#define nf_stat_hook_begin(a) do{}while(0) +#define nf_stat_hook_end(a,b,c) do{}while(0) +#define nf_stat_hook_register(a) do{}while(0) +#define nf_stat_hook_unregister(a) do{}while(0) +#define nf_stat_hook_setup() do{}while(0) + +/* + * End of new statistics stuff. On with the traditional net/core/netfilter.c + * Search below for "nf_stat_hook" to see where we call into the statistics. + */ +#endif + /* * A queue handler may be registered for each protocol. Each is protected by * long term mutex. The handler must provide an an outfn() to accept packets @@ -68,6 +355,7 @@ if (reg->priority < ((struct nf_hook_ops *)i)->priority) break; } + nf_stat_hook_register(reg); list_add(®->list, i->prev); br_write_unlock_bh(BR_NETPROTO_LOCK); return 0; @@ -77,6 +365,7 @@ { br_write_lock_bh(BR_NETPROTO_LOCK); list_del(®->list); + nf_stat_hook_unregister(reg); br_write_unlock_bh(BR_NETPROTO_LOCK); } @@ -346,14 +635,19 @@ { for (*i = (*i)->next; *i != head; *i = (*i)->next) { struct nf_hook_ops *elem = (struct nf_hook_ops *)*i; + struct nf_stat_hook_sample sample; + nf_stat_hook_begin(&sample); switch (elem->hook(hook, skb, indev, outdev, okfn)) { case NF_QUEUE: + nf_stat_hook_end(&sample, elem, NF_QUEUE); return NF_QUEUE; case NF_STOLEN: + nf_stat_hook_end(&sample, elem, NF_STOLEN); return NF_STOLEN; case NF_DROP: + nf_stat_hook_end(&sample, elem, NF_DROP); return NF_DROP; case NF_REPEAT: @@ -369,6 +663,7 @@ elem->hook, hook); #endif } + nf_stat_hook_end(&sample, elem, NF_ACCEPT); } return NF_ACCEPT; } @@ -638,4 +933,5 @@ for (h = 0; h < NF_MAX_HOOKS; h++) INIT_LIST_HEAD(&nf_hooks[i][h]); } + nf_stat_hook_setup(); } diff -urN ex1/net/core/skbuff.c ex2/net/core/skbuff.c --- ex1/net/core/skbuff.c Sat Aug 3 21:21:00 2002 +++ ex2/net/core/skbuff.c Sat Aug 3 21:24:40 2002 @@ -323,9 +323,7 @@ } skb->destructor(skb); } -#ifdef CONFIG_NETFILTER - nf_conntrack_put(skb->nfct); -#endif + skb_nf_forget(skb); skb_headerinit(skb, NULL, 0); /* clean state */ kfree_skbmem(skb); } diff -urN ex1/net/ipv4/ip_gre.c ex2/net/ipv4/ip_gre.c --- ex1/net/ipv4/ip_gre.c Sat Aug 3 21:21:16 2002 +++ ex2/net/ipv4/ip_gre.c Sat Aug 3 21:24:54 2002 @@ -644,13 +644,7 @@ skb->dev = tunnel->dev; dst_release(skb->dst); skb->dst = NULL; -#ifdef CONFIG_NETFILTER - nf_conntrack_put(skb->nfct); - skb->nfct = NULL; -#ifdef CONFIG_NETFILTER_DEBUG - skb->nf_debug = 0; -#endif -#endif + skb_nf_forget(skb); ipgre_ecn_decapsulate(iph, skb); netif_rx(skb); read_unlock(&ipgre_lock); @@ -876,13 +870,7 @@ } } -#ifdef CONFIG_NETFILTER - nf_conntrack_put(skb->nfct); - skb->nfct = NULL; -#ifdef CONFIG_NETFILTER_DEBUG - skb->nf_debug = 0; -#endif -#endif + skb_nf_forget(skb); IPTUNNEL_XMIT(); tunnel->recursion--; diff -urN ex1/net/ipv4/ip_input.c ex2/net/ipv4/ip_input.c --- ex1/net/ipv4/ip_input.c Sat Aug 3 21:20:57 2002 +++ ex2/net/ipv4/ip_input.c Sat Aug 3 21:24:37 2002 @@ -226,12 +226,9 @@ __skb_pull(skb, ihl); -#ifdef CONFIG_NETFILTER /* Free reference early: we don't need it any more, and it may hold ip_conntrack module loaded indefinitely. */ - nf_conntrack_put(skb->nfct); - skb->nfct = NULL; -#endif /*CONFIG_NETFILTER*/ + skb_nf_forget(skb); /* Point into the IP datagram, just past the header. */ skb->h.raw = skb->data; diff -urN ex1/net/ipv4/ipip.c ex2/net/ipv4/ipip.c --- ex1/net/ipv4/ipip.c Sat Aug 3 21:21:14 2002 +++ ex2/net/ipv4/ipip.c Sat Aug 3 21:24:50 2002 @@ -493,13 +493,7 @@ skb->dev = tunnel->dev; dst_release(skb->dst); skb->dst = NULL; -#ifdef CONFIG_NETFILTER - nf_conntrack_put(skb->nfct); - skb->nfct = NULL; -#ifdef CONFIG_NETFILTER_DEBUG - skb->nf_debug = 0; -#endif -#endif + skb_nf_forget(skb); ipip_ecn_decapsulate(iph, skb); netif_rx(skb); read_unlock(&ipip_lock); @@ -644,13 +638,7 @@ if ((iph->ttl = tiph->ttl) == 0) iph->ttl = old_iph->ttl; -#ifdef CONFIG_NETFILTER - nf_conntrack_put(skb->nfct); - skb->nfct = NULL; -#ifdef CONFIG_NETFILTER_DEBUG - skb->nf_debug = 0; -#endif -#endif + skb_nf_forget(skb); IPTUNNEL_XMIT(); tunnel->recursion--; diff -urN ex1/net/ipv4/ipmr.c ex2/net/ipv4/ipmr.c --- ex1/net/ipv4/ipmr.c Sat Aug 3 21:21:13 2002 +++ ex2/net/ipv4/ipmr.c Sat Aug 3 21:24:49 2002 @@ -1096,10 +1096,7 @@ skb->h.ipiph = skb->nh.iph; skb->nh.iph = iph; -#ifdef CONFIG_NETFILTER - nf_conntrack_put(skb->nfct); - skb->nfct = NULL; -#endif + skb_nf_forget(skb); } static inline int ipmr_forward_finish(struct sk_buff *skb) @@ -1441,10 +1438,7 @@ skb->dst = NULL; ((struct net_device_stats*)reg_dev->priv)->rx_bytes += skb->len; ((struct net_device_stats*)reg_dev->priv)->rx_packets++; -#ifdef CONFIG_NETFILTER - nf_conntrack_put(skb->nfct); - skb->nfct = NULL; -#endif + skb_nf_forget(skb); netif_rx(skb); dev_put(reg_dev); return 0; @@ -1508,10 +1502,7 @@ ((struct net_device_stats*)reg_dev->priv)->rx_bytes += skb->len; ((struct net_device_stats*)reg_dev->priv)->rx_packets++; skb->dst = NULL; -#ifdef CONFIG_NETFILTER - nf_conntrack_put(skb->nfct); - skb->nfct = NULL; -#endif + skb_nf_forget(skb); netif_rx(skb); dev_put(reg_dev); return 0; diff -urN ex1/net/ipv4/netfilter/ip_conntrack_core.c ex2/net/ipv4/netfilter/ip_conntrack_core.c --- ex1/net/ipv4/netfilter/ip_conntrack_core.c Sat Aug 3 21:20:55 2002 +++ ex2/net/ipv4/netfilter/ip_conntrack_core.c Sat Aug 3 21:24:34 2002 @@ -52,9 +52,31 @@ unsigned int ip_conntrack_htable_size = 0; static int ip_conntrack_max = 0; static atomic_t ip_conntrack_count = ATOMIC_INIT(0); -struct list_head *ip_conntrack_hash; +struct srlist_head *ip_conntrack_hash; +static int ip_conntrack_hash_vmalloced; static kmem_cache_t *ip_conntrack_cachep; +static __init void alloc_ip_conntrack_hash(void) +{ + const size_t s = sizeof(*ip_conntrack_hash) * ip_conntrack_htable_size; + IP_NF_ASSERT(ip_conntrack_hash == 0); + ip_conntrack_hash = (void *) __get_free_pages(GFP_KERNEL, get_order(s)); + if (!ip_conntrack_hash) { + ip_conntrack_hash = vmalloc(s); + if (!ip_conntrack_hash) BUG(); + ip_conntrack_hash_vmalloced = 1; + } +} + +static void free_ip_conntrack_hash(void) +{ + const size_t s = sizeof(*ip_conntrack_hash) * ip_conntrack_htable_size; + if (ip_conntrack_hash_vmalloced) + vfree(ip_conntrack_hash); + else + free_pages((unsigned long)ip_conntrack_hash, get_order(s)); +} + extern struct ip_conntrack_protocol ip_conntrack_generic_protocol; static inline int proto_cmpfn(const struct ip_conntrack_protocol *curr, @@ -155,12 +177,12 @@ { MUST_BE_WRITE_LOCKED(&ip_conntrack_lock); /* Remove from both hash lists: must not NULL out next ptrs, - otherwise we'll look unconfirmed. Fortunately, LIST_DELETE + otherwise we'll look unconfirmed. Fortunately, SRLIST_DELETE doesn't do this. --RR */ - LIST_DELETE(&ip_conntrack_hash + SRLIST_DELETE(&ip_conntrack_hash [hash_conntrack(&ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple)], &ct->tuplehash[IP_CT_DIR_ORIGINAL]); - LIST_DELETE(&ip_conntrack_hash + SRLIST_DELETE(&ip_conntrack_hash [hash_conntrack(&ct->tuplehash[IP_CT_DIR_REPLY].tuple)], &ct->tuplehash[IP_CT_DIR_REPLY]); /* If our expected is in the list, take it out. */ @@ -196,14 +218,46 @@ atomic_dec(&ip_conntrack_count); } +static inline int later_than(unsigned long this, unsigned long ref) +{ + return this > ref + || (ref > ((unsigned long)-1) - 864000 && this < ref + 864000); +} + +static inline int earlier_than(unsigned long this, unsigned long ref) +{ + return this != ref && !later_than(this, ref); +} + +static inline void activate_timeout_target(struct ip_conntrack *ct) +{ + ct->timeout.expires = ct->timeout_target; + add_timer(&ct->timeout); +} + static void death_by_timeout(unsigned long ul_conntrack) { struct ip_conntrack *ct = (void *)ul_conntrack; WRITE_LOCK(&ip_conntrack_lock); + if (later_than(ct->timeout_target, ct->timeout.expires)) { + activate_timeout_target(ct); + WRITE_UNLOCK(&ip_conntrack_lock); + return; + } + clean_from_lists(ct); + WRITE_UNLOCK(&ip_conntrack_lock); + ip_conntrack_put(ct); +} + +int ip_ct_sudden_death(struct ip_conntrack *ct) +{ + if (!del_timer(&ct->timeout)) return 0; + WRITE_LOCK(&ip_conntrack_lock); clean_from_lists(ct); WRITE_UNLOCK(&ip_conntrack_lock); ip_conntrack_put(ct); + return 1; } static inline int @@ -223,7 +277,7 @@ struct ip_conntrack_tuple_hash *h; MUST_BE_READ_LOCKED(&ip_conntrack_lock); - h = LIST_FIND(&ip_conntrack_hash[hash_conntrack(tuple)], + h = SRLIST_FIND(&ip_conntrack_hash[hash_conntrack(tuple)], conntrack_tuple_cmp, struct ip_conntrack_tuple_hash *, tuple, ignored_conntrack); @@ -271,7 +325,7 @@ int __ip_conntrack_confirm(struct nf_ct_info *nfct) { - unsigned int hash, repl_hash; + u_int32_t hash, repl_hash; struct ip_conntrack *ct; enum ip_conntrack_info ctinfo; @@ -301,23 +355,19 @@ /* See if there's one in the list already, including reverse: NAT could have grabbed it without realizing, since we're not in the hash. If there is, we lost race. */ - if (!LIST_FIND(&ip_conntrack_hash[hash], + if (!SRLIST_FIND(&ip_conntrack_hash[hash], conntrack_tuple_cmp, struct ip_conntrack_tuple_hash *, &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple, NULL) - && !LIST_FIND(&ip_conntrack_hash[repl_hash], + && !SRLIST_FIND(&ip_conntrack_hash[repl_hash], conntrack_tuple_cmp, struct ip_conntrack_tuple_hash *, &ct->tuplehash[IP_CT_DIR_REPLY].tuple, NULL)) { - list_prepend(&ip_conntrack_hash[hash], + SRLIST_PREPEND(&ip_conntrack_hash[hash], &ct->tuplehash[IP_CT_DIR_ORIGINAL]); - list_prepend(&ip_conntrack_hash[repl_hash], + SRLIST_PREPEND(&ip_conntrack_hash[repl_hash], &ct->tuplehash[IP_CT_DIR_REPLY]); - /* Timer relative to confirmation time, not original - setting time, otherwise we'd get timer wrap in - wierd delay cases. */ - ct->timeout.expires += jiffies; - add_timer(&ct->timeout); + activate_timeout_target(ct); atomic_inc(&ct->ct_general.use); WRITE_UNLOCK(&ip_conntrack_lock); return NF_ACCEPT; @@ -435,19 +485,22 @@ /* There's a small race here where we may free a just-assured connection. Too bad: we're in trouble anyway. */ -static inline int unreplied(const struct ip_conntrack_tuple_hash *i) +static inline int unreplied(const struct ip_conntrack_tuple_hash *i, + struct ip_conntrack_tuple_hash **lru) { - return !(i->ctrack->status & IPS_ASSURED); + if (!(i->ctrack->status & IPS_ASSURED)) + *lru = (struct ip_conntrack_tuple_hash *) i; + return 0; } -static int early_drop(struct list_head *chain) +static int early_drop(struct srlist_head *chain) { /* Traverse backwards: gives us oldest, which is roughly LRU */ - struct ip_conntrack_tuple_hash *h; + struct ip_conntrack_tuple_hash *h = 0; int dropped = 0; READ_LOCK(&ip_conntrack_lock); - h = LIST_FIND(chain, unreplied, struct ip_conntrack_tuple_hash *); + SRLIST_FIND(chain, unreplied, struct ip_conntrack_tuple_hash *, &h); if (h) atomic_inc(&h->ctrack->ct_general.use); READ_UNLOCK(&ip_conntrack_lock); @@ -455,10 +508,7 @@ if (!h) return dropped; - if (del_timer(&h->ctrack->timeout)) { - death_by_timeout((unsigned long)h->ctrack); - dropped = 1; - } + dropped = ip_ct_sudden_death(h->ctrack); ip_conntrack_put(h->ctrack); return dropped; } @@ -485,22 +535,19 @@ { struct ip_conntrack *conntrack; struct ip_conntrack_tuple repl_tuple; - size_t hash, repl_hash; struct ip_conntrack_expect *expected; int i; - static unsigned int drop_next = 0; - - hash = hash_conntrack(tuple); if (ip_conntrack_max && atomic_read(&ip_conntrack_count) >= ip_conntrack_max) { /* Try dropping from random chain, or else from the chain about to put into (in case they're trying to bomb one hash chain). */ - unsigned int next = (drop_next++)%ip_conntrack_htable_size; + static u_int32_t drop_rotor = 0; + u_int32_t next = (drop_rotor++)%ip_conntrack_htable_size; if (!early_drop(&ip_conntrack_hash[next]) - && !early_drop(&ip_conntrack_hash[hash])) { + && !early_drop(&ip_conntrack_hash[hash_conntrack(tuple)])) { if (net_ratelimit()) printk(KERN_WARNING "ip_conntrack: table full, dropping" @@ -513,7 +560,6 @@ DEBUGP("Can't invert tuple.\n"); return NULL; } - repl_hash = hash_conntrack(&repl_tuple); conntrack = kmem_cache_alloc(ip_conntrack_cachep, GFP_ATOMIC); if (!conntrack) { @@ -689,8 +735,7 @@ ret = proto->packet(ct, (*pskb)->nh.iph, (*pskb)->len, ctinfo); if (ret == -1) { /* Invalid */ - nf_conntrack_put((*pskb)->nfct); - (*pskb)->nfct = NULL; + skb_nf_forget(*pskb); return NF_ACCEPT; } @@ -699,8 +744,7 @@ ct, ctinfo); if (ret == -1) { /* Invalid */ - nf_conntrack_put((*pskb)->nfct); - (*pskb)->nfct = NULL; + skb_nf_forget(*pskb); return NF_ACCEPT; } } @@ -808,7 +852,7 @@ return 0; } -static inline int unhelp(struct ip_conntrack_tuple_hash *i, +static inline int unhelp(const struct ip_conntrack_tuple_hash *i, const struct ip_conntrack_helper *me) { if (i->ctrack->helper == me) { @@ -834,7 +878,7 @@ /* Get rid of expecteds, set helpers to NULL. */ for (i = 0; i < ip_conntrack_htable_size; i++) - LIST_FIND_W(&ip_conntrack_hash[i], unhelp, + SRLIST_FIND_W(&ip_conntrack_hash[i], unhelp, struct ip_conntrack_tuple_hash *, me); WRITE_UNLOCK(&ip_conntrack_lock); @@ -851,15 +895,11 @@ IP_NF_ASSERT(ct->timeout.data == (unsigned long)ct); WRITE_LOCK(&ip_conntrack_lock); - /* If not in hash table, timer will not be active yet */ - if (!is_confirmed(ct)) - ct->timeout.expires = extra_jiffies; - else { - /* Need del_timer for race avoidance (may already be dying). */ - if (del_timer(&ct->timeout)) { - ct->timeout.expires = jiffies + extra_jiffies; - add_timer(&ct->timeout); - } + ct->timeout_target = jiffies + extra_jiffies; + if ( is_confirmed(ct) + && earlier_than(ct->timeout_target, ct->timeout.expires) + && del_timer(&ct->timeout)) { + activate_timeout_target(ct); } WRITE_UNLOCK(&ip_conntrack_lock); } @@ -942,7 +982,7 @@ READ_LOCK(&ip_conntrack_lock); for (i = 0; !h && i < ip_conntrack_htable_size; i++) { - h = LIST_FIND(&ip_conntrack_hash[i], do_kill, + h = SRLIST_FIND(&ip_conntrack_hash[i], do_kill, struct ip_conntrack_tuple_hash *, kill, data); } if (h) @@ -961,10 +1001,7 @@ /* This is order n^2, by the way. */ while ((h = get_next_corpse(kill, data)) != NULL) { /* Time to push up daises... */ - if (del_timer(&h->ctrack->timeout)) - death_by_timeout((unsigned long)h->ctrack); - /* ... else the timer will get him soon. */ - + ip_ct_sudden_death(h->ctrack); ip_conntrack_put(h->ctrack); } } @@ -1073,7 +1110,7 @@ } kmem_cache_destroy(ip_conntrack_cachep); - vfree(ip_conntrack_hash); + free_ip_conntrack_hash(); nf_unregister_sockopt(&so_getorigdst); } @@ -1092,7 +1129,7 @@ } else { ip_conntrack_htable_size = (((num_physpages << PAGE_SHIFT) / 16384) - / sizeof(struct list_head)); + / sizeof(*ip_conntrack_hash)); if (num_physpages > (1024 * 1024 * 1024 / PAGE_SIZE)) ip_conntrack_htable_size = 8192; if (ip_conntrack_htable_size < 16) @@ -1107,8 +1144,7 @@ if (ret != 0) return ret; - ip_conntrack_hash = vmalloc(sizeof(struct list_head) - * ip_conntrack_htable_size); + alloc_ip_conntrack_hash(); if (!ip_conntrack_hash) { nf_unregister_sockopt(&so_getorigdst); return -ENOMEM; @@ -1119,7 +1155,7 @@ SLAB_HWCACHE_ALIGN, NULL, NULL); if (!ip_conntrack_cachep) { printk(KERN_ERR "Unable to create ip_conntrack slab cache\n"); - vfree(ip_conntrack_hash); + free_ip_conntrack_hash(); nf_unregister_sockopt(&so_getorigdst); return -ENOMEM; } @@ -1133,7 +1169,7 @@ WRITE_UNLOCK(&ip_conntrack_lock); for (i = 0; i < ip_conntrack_htable_size; i++) - INIT_LIST_HEAD(&ip_conntrack_hash[i]); + INIT_SRLIST_HEAD(&ip_conntrack_hash[i]); /* This is fucking braindead. There is NO WAY of doing this without the CONFIG_SYSCTL unless you don't want to detect errors. @@ -1143,7 +1179,7 @@ = register_sysctl_table(ip_conntrack_root_table, 0); if (ip_conntrack_sysctl_header == NULL) { kmem_cache_destroy(ip_conntrack_cachep); - vfree(ip_conntrack_hash); + free_ip_conntrack_hash(); nf_unregister_sockopt(&so_getorigdst); return -ENOMEM; } diff -urN ex1/net/ipv4/netfilter/ip_conntrack_proto_icmp.c ex2/net/ipv4/netfilter/ip_conntrack_proto_icmp.c --- ex1/net/ipv4/netfilter/ip_conntrack_proto_icmp.c Sat Aug 3 21:20:56 2002 +++ ex2/net/ipv4/netfilter/ip_conntrack_proto_icmp.c Sat Aug 3 21:24:36 2002 @@ -77,9 +77,8 @@ means this will only run once even if count hits zero twice (theoretically possible with SMP) */ if (CTINFO2DIR(ctinfo) == IP_CT_DIR_REPLY) { - if (atomic_dec_and_test(&ct->proto.icmp.count) - && del_timer(&ct->timeout)) - ct->timeout.function((unsigned long)ct); + if (atomic_dec_and_test(&ct->proto.icmp.count)) + ip_ct_sudden_death(ct); } else { atomic_inc(&ct->proto.icmp.count); ip_ct_refresh(ct, ICMP_TIMEOUT); diff -urN ex1/net/ipv4/netfilter/ip_conntrack_proto_tcp.c ex2/net/ipv4/netfilter/ip_conntrack_proto_tcp.c --- ex1/net/ipv4/netfilter/ip_conntrack_proto_tcp.c Sat Aug 3 21:21:12 2002 +++ ex2/net/ipv4/netfilter/ip_conntrack_proto_tcp.c Sat Aug 3 21:24:49 2002 @@ -189,8 +189,7 @@ problem case, so we can delete the conntrack immediately. --RR */ if (!(conntrack->status & IPS_SEEN_REPLY) && tcph->rst) { - if (del_timer(&conntrack->timeout)) - conntrack->timeout.function((unsigned long)conntrack); + ip_ct_sudden_death(conntrack); } else { /* Set ASSURED if we see see valid ack in ESTABLISHED after SYN_RECV */ if (oldtcpstate == TCP_CONNTRACK_SYN_RECV diff -urN ex1/net/ipv4/netfilter/ip_conntrack_standalone.c ex2/net/ipv4/netfilter/ip_conntrack_standalone.c --- ex1/net/ipv4/netfilter/ip_conntrack_standalone.c Sat Aug 3 21:21:12 2002 +++ ex2/net/ipv4/netfilter/ip_conntrack_standalone.c Sat Aug 3 21:24:48 2002 @@ -83,7 +83,7 @@ conntrack->tuplehash[IP_CT_DIR_ORIGINAL] .tuple.dst.protonum, timer_pending(&conntrack->timeout) - ? (conntrack->timeout.expires - jiffies)/HZ : 0); + ? (conntrack->timeout_target - jiffies)/HZ : 0); len += proto->print_conntrack(buffer + len, conntrack); len += print_tuple(buffer + len, @@ -140,7 +140,7 @@ READ_LOCK(&ip_conntrack_lock); /* Traverse hash; print originals then reply. */ for (i = 0; i < ip_conntrack_htable_size; i++) { - if (LIST_FIND(&ip_conntrack_hash[i], conntrack_iterate, + if (SRLIST_FIND(&ip_conntrack_hash[i], conntrack_iterate, struct ip_conntrack_tuple_hash *, buffer, offset, &upto, &len, length)) goto finished; diff -urN ex1/net/ipv4/netfilter/ipt_REJECT.c ex2/net/ipv4/netfilter/ipt_REJECT.c --- ex1/net/ipv4/netfilter/ipt_REJECT.c Sat Aug 3 21:21:04 2002 +++ ex2/net/ipv4/netfilter/ipt_REJECT.c Sat Aug 3 21:24:43 2002 @@ -69,12 +69,8 @@ return; /* This packet will not be the same as the other: clear nf fields */ - nf_conntrack_put(nskb->nfct); - nskb->nfct = NULL; nskb->nfcache = 0; -#ifdef CONFIG_NETFILTER_DEBUG - nskb->nf_debug = 0; -#endif + skb_nf_forget(nskb); tcph = (struct tcphdr *)((u_int32_t*)nskb->nh.iph + nskb->nh.iph->ihl); diff -urN ex1/net/ipv6/sit.c ex2/net/ipv6/sit.c --- ex1/net/ipv6/sit.c Sat Aug 3 21:21:17 2002 +++ ex2/net/ipv6/sit.c Sat Aug 3 21:24:54 2002 @@ -403,13 +403,7 @@ skb->dev = tunnel->dev; dst_release(skb->dst); skb->dst = NULL; -#ifdef CONFIG_NETFILTER - nf_conntrack_put(skb->nfct); - skb->nfct = NULL; -#ifdef CONFIG_NETFILTER_DEBUG - skb->nf_debug = 0; -#endif -#endif + skb_nf_forget(skb); ipip6_ecn_decapsulate(iph, skb); netif_rx(skb); read_unlock(&ipip6_lock); @@ -600,13 +594,7 @@ if ((iph->ttl = tiph->ttl) == 0) iph->ttl = iph6->hop_limit; -#ifdef CONFIG_NETFILTER - nf_conntrack_put(skb->nfct); - skb->nfct = NULL; -#ifdef CONFIG_NETFILTER_DEBUG - skb->nf_debug = 0; -#endif -#endif + skb_nf_forget(skb); IPTUNNEL_XMIT(); tunnel->recursion--; --OXfL5xGRrasGEqWY-- From owner-netdev@oss.sgi.com Mon Aug 5 23:46:20 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g766kKRw001989 for ; Mon, 5 Aug 2002 23:46:20 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g766kKx3001988 for netdev-outgoing; Mon, 5 Aug 2002 23:46:20 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from coruscant.gnumonks.org (mail@coruscant.franken.de [193.174.159.226]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g766k9Rw001979 for ; Mon, 5 Aug 2002 23:46:10 -0700 Received: from [192.168.200.2] (helo=sunbeam.gnumonks.org) by coruscant.gnumonks.org with esmtp (TLSv1:DES-CBC3-SHA:168) (Exim 3.34 #1) id 17by8R-0002rH-00; Tue, 06 Aug 2002 08:48:03 +0200 Received: from laforge by sunbeam.gnumonks.org with local (Exim 3.34 #1) id 17bxwT-0003Hx-00; Tue, 06 Aug 2002 08:35:41 +0200 Date: Tue, 6 Aug 2002 08:35:41 +0200 From: Harald Welte To: netfilter-devel@lists.netfilter.org Cc: netdev@oss.sgi.com Subject: [RFC] Options for ECN target Message-ID: <20020806083541.L11828@sunbeam.de.gnumonks.org> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="Bzq2cJcN05fcPrs+" Content-Disposition: inline User-Agent: Mutt/1.3.17i X-Operating-System: Linux sunbeam.de.gnumonks.org 2.4.19-pre10-newnat-pptp X-Date: Today is Boomtime, the 71st day of Confusion in the YOLD 3168 X-Spam-Status: No, hits=0.0 required=5.0 tests= version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk --Bzq2cJcN05fcPrs+ Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Hi! Some time ago I've written an ECN target for the iptables mangle table.=20 It has now undergone some final bugfixes and I intend to submit it to the kernel. The main goal of this target is to be able to selectively work around=20 known ECN blackholes rather than disabling ECN for the whole host using "echo 0 > /proc/sys/net/ipv4/tcp_ecn". There is one question left: How much flexibility do we want to give the use= r? The ECN target currently allows: --ecn-tcp-remove Remove CWR+ECE bits from TCP header. Should be used on TCP syn packets to prevent ECN negotiation --ecn-ip-ect [0..3] Allows arbitrary setting of the ECT codepoint --ecn-tcp-cwr [0|1] Allows setting or clearing the TCP CWR bit --ecn-tcp-ece [0|1] Allows setting or clearing the TCP ECE bit The first option is necessarry and is the primary use of the target. The last three options are more experimental and would allow somebody to play with 'simulated congestion' by setting the ECT in IP, etc. However, this is potentially very dangerous and I'm not sure if it was a good idea to give this power directly to the user. =20 Do you suggest removing the last three options and just keep the=20 --ecn-tcp-remove ? Thanks for your assistance, --=20 Live long and prosper - Harald Welte / laforge@gnumonks.org http://www.gnumonks.org/ =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D GCS/E/IT d- s-: a-- C+++ UL++++$ P+++ L++++$ E--- W- N++ o? K- w--- O- M+= =20 V-- PS++ PE-- Y++ PGP++ t+ 5-- !X !R tv-- b+++ !DI !D G+ e* h--- r++ y+(*) --Bzq2cJcN05fcPrs+ Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.6 (GNU/Linux) Comment: For info see http://www.gnupg.org iD8DBQE9T246XaXGVTD0i/8RAhsAAJ42gv0vrnBq9eu9jb64krktAQ9GugCff58F 8M2WB1EKkDz337K3iAlFFQ8= =z7Ku -----END PGP SIGNATURE----- --Bzq2cJcN05fcPrs+-- From owner-netdev@oss.sgi.com Tue Aug 6 06:46:15 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g76DkFRw025731 for ; Tue, 6 Aug 2002 06:46:15 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g76DkFYn025730 for netdev-outgoing; Tue, 6 Aug 2002 06:46:15 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from sex.inr.ac.ru (sex.inr.ac.ru [193.233.7.165]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g76Dk9Rw025717 for ; Tue, 6 Aug 2002 06:46:10 -0700 Received: (from kuznet@localhost) by sex.inr.ac.ru (8.6.13/ANK) id RAA24323; Tue, 6 Aug 2002 17:47:56 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200208061347.RAA24323@sex.inr.ac.ru> Subject: Re: [RFC] Options for ECN target To: laforge@gnumonks.ORG (Harald Welte) Date: Tue, 6 Aug 2002 17:47:56 +0400 (MSD) Cc: netdev@oss.sgi.com In-Reply-To: <20020806083541.L11828@sunbeam.de.gnumonks.org> from "Harald Welte" at Aug 6, 2 12:15:09 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 X-Spam-Status: No, hits=-3.8 required=5.0 tests=IN_REP_TO,NO_REAL_NAME version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hello! > The first option is necessarry and is the primary use of the target. > The last three options are more experimental and would allow somebody > to play with 'simulated congestion' by setting the ECT in IP, etc. ... > However, this is potentially very dangerous Yes, they are illegal. I would not say "very dangerous". Surely, much less dangerous than cp /dev/zero /dev/mem. :-) So, I think it can be added for debugging purposes, provided hidden in a section devoted to debugging in manual and not shown in command line helper. Alexey From owner-netdev@oss.sgi.com Wed Aug 7 17:23:03 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g780N3Rw021947 for ; Wed, 7 Aug 2002 17:23:03 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g780N3Rk021946 for netdev-outgoing; Wed, 7 Aug 2002 17:23:03 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from blackbird.intercode.com.au (blackbird.intercode.com.au [203.32.101.10]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g780MWRw021937 for ; Wed, 7 Aug 2002 17:22:34 -0700 Received: from localhost (jmorris@localhost) by blackbird.intercode.com.au (8.9.3/8.9.3) with ESMTP id KAA16866; Thu, 8 Aug 2002 10:24:19 +1000 Date: Thu, 8 Aug 2002 10:24:19 +1000 (EST) From: James Morris To: "David S. Miller" , cc: netdev@oss.sgi.com, Matthew Wilcox Subject: [PATCH] minor socket ioctl cleanup for 2.5.30 Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Status: No, hits=-5.0 required=5.0 tests=UNIFIED_PATCH version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Suggested by Matthew Wilcox, the patch below consolidates FIOSETOWN etc. ioctl handling into the socket layer, making it common for all sockets. econet/af_econet.c | 12 ------------ ipv4/af_inet.c | 16 ---------------- ipv6/af_inet6.c | 14 -------------- packet/af_packet.c | 14 -------------- socket.c | 30 ++++++++++++++++++++++++++++-- wanrouter/af_wanpipe.c | 14 -------------- 6 files changed, 28 insertions, 72 deletions Btw, is af_wanpipe.c likely to stay in the tree? It doesn't seem to be used anymore. - James -- James Morris diff -urN -X dontdiff linux-2.5.30.orig/net/econet/af_econet.c linux-2.5.30.w1/net/econet/af_econet.c --- linux-2.5.30.orig/net/econet/af_econet.c Sat Aug 3 23:40:30 2002 +++ linux-2.5.30.w1/net/econet/af_econet.c Wed Aug 7 23:33:57 2002 @@ -643,21 +643,9 @@ static int econet_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg) { struct sock *sk = sock->sk; - int pid; switch(cmd) { - case FIOSETOWN: - case SIOCSPGRP: - if (get_user(pid, (int *) arg)) - return -EFAULT; - if (current->pid != pid && current->pgrp != -pid && !capable(CAP_NET_ADMIN)) - return -EPERM; - sk->proc = pid; - return(0); - case FIOGETOWN: - case SIOCGPGRP: - return put_user(sk->proc, (int *)arg); case SIOCGSTAMP: if(sk->stamp.tv_sec==0) return -ENOENT; diff -urN -X dontdiff linux-2.5.30.orig/net/ipv4/af_inet.c linux-2.5.30.w1/net/ipv4/af_inet.c --- linux-2.5.30.orig/net/ipv4/af_inet.c Sat Aug 3 23:40:09 2002 +++ linux-2.5.30.w1/net/ipv4/af_inet.c Wed Aug 7 23:25:23 2002 @@ -850,24 +850,8 @@ { struct sock *sk = sock->sk; int err = 0; - int pid; switch (cmd) { - case FIOSETOWN: - case SIOCSPGRP: - if (get_user(pid, (int *)arg)) - err = -EFAULT; - else if (current->pid != pid && - current->pgrp != -pid && - !capable(CAP_NET_ADMIN)) - err = -EPERM; - else - sk->proc = pid; - break; - case FIOGETOWN: - case SIOCGPGRP: - err = put_user(sk->proc, (int *)arg); - break; case SIOCGSTAMP: if (!sk->stamp.tv_sec) err = -ENOENT; diff -urN -X dontdiff linux-2.5.30.orig/net/ipv6/af_inet6.c linux-2.5.30.w1/net/ipv6/af_inet6.c --- linux-2.5.30.orig/net/ipv6/af_inet6.c Sat Aug 3 23:40:22 2002 +++ linux-2.5.30.w1/net/ipv6/af_inet6.c Wed Aug 7 23:30:36 2002 @@ -455,23 +455,9 @@ { struct sock *sk = sock->sk; int err = -EINVAL; - int pid; switch(cmd) { - case FIOSETOWN: - case SIOCSPGRP: - if (get_user(pid, (int *) arg)) - return -EFAULT; - /* see sock_no_fcntl */ - if (current->pid != pid && current->pgrp != -pid && - !capable(CAP_NET_ADMIN)) - return -EPERM; - sk->proc = pid; - return(0); - case FIOGETOWN: - case SIOCGPGRP: - return put_user(sk->proc,(int *)arg); case SIOCGSTAMP: if(sk->stamp.tv_sec==0) return -ENOENT; diff -urN -X dontdiff linux-2.5.30.orig/net/packet/af_packet.c linux-2.5.30.w1/net/packet/af_packet.c --- linux-2.5.30.orig/net/packet/af_packet.c Sat Aug 3 23:40:30 2002 +++ linux-2.5.30.w1/net/packet/af_packet.c Thu Aug 8 01:34:44 2002 @@ -1458,20 +1458,6 @@ spin_unlock_bh(&sk->receive_queue.lock); return put_user(amount, (int *)arg); } - case FIOSETOWN: - case SIOCSPGRP: { - int pid; - if (get_user(pid, (int *) arg)) - return -EFAULT; - if (current->pid != pid && current->pgrp != -pid && - !capable(CAP_NET_ADMIN)) - return -EPERM; - sk->proc = pid; - break; - } - case FIOGETOWN: - case SIOCGPGRP: - return put_user(sk->proc, (int *)arg); case SIOCGSTAMP: if(sk->stamp.tv_sec==0) return -ENOENT; diff -urN -X dontdiff linux-2.5.30.orig/net/socket.c linux-2.5.30.w1/net/socket.c --- linux-2.5.30.orig/net/socket.c Sun Aug 4 23:59:29 2002 +++ linux-2.5.30.w1/net/socket.c Wed Aug 7 23:39:00 2002 @@ -683,11 +683,37 @@ unsigned long arg) { struct socket *sock; - int err; + struct sock *sk; + int err = 0; unlock_kernel(); sock = SOCKET_I(inode); - err = sock->ops->ioctl(sock, cmd, arg); + sk = sock->sk; + + switch(cmd) { + case FIOSETOWN: + case SIOCSPGRP: { + int pid; + + if (get_user(pid, (int *)arg)) + err = -EFAULT; + else if (current->pid != pid && current->pgrp != -pid && + !capable(CAP_NET_ADMIN)) + err = -EPERM; + else + sk->proc = pid; + break; + } + + case FIOGETOWN: + case SIOCGPGRP: + err = put_user(sk->proc, (int *)arg); + break; + + default: + err = sock->ops->ioctl(sock, cmd, arg); + } + lock_kernel(); return err; diff -urN -X dontdiff linux-2.5.30.orig/net/wanrouter/af_wanpipe.c linux-2.5.30.w1/net/wanrouter/af_wanpipe.c --- linux-2.5.30.orig/net/wanrouter/af_wanpipe.c Sat Aug 3 23:39:41 2002 +++ linux-2.5.30.w1/net/wanrouter/af_wanpipe.c Wed Aug 7 23:33:26 2002 @@ -1867,23 +1867,9 @@ { struct sock *sk = sock->sk; int err; - int pid; switch(cmd) { - case FIOSETOWN: - case SIOCSPGRP: - err = get_user(pid, (int *) arg); - if (err) - return err; - if (current->pid != pid && current->pgrp != -pid && - !capable(CAP_NET_ADMIN)) - return -EPERM; - sk->proc = pid; - return(0); - case FIOGETOWN: - case SIOCGPGRP: - return put_user(sk->proc, (int *)arg); case SIOCGSTAMP: if(sk->stamp.tv_sec==0) return -ENOENT; From owner-netdev@oss.sgi.com Thu Aug 8 02:29:42 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g789TgRw002481 for ; Thu, 8 Aug 2002 02:29:42 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g789TgLa002480 for netdev-outgoing; Thu, 8 Aug 2002 02:29:42 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from color.sics.se (color.sics.se [193.10.66.199]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g789TaRw002471 for ; Thu, 8 Aug 2002 02:29:37 -0700 Received: from r2d2.sics.se (r2d2.sics.se [193.10.66.198]) by color.sics.se (8.9.3/8.9.3) with ESMTP id LAA15287 for ; Thu, 8 Aug 2002 11:31:40 +0200 (MET DST) env-to () env-from (gabriel@sics.se) Received: from localhost (gabriel@localhost) by r2d2.sics.se (8.11.6/8.9.3) with ESMTP id g789VeU27595 for ; Thu, 8 Aug 2002 11:31:40 +0200 env-from (gabriel@sics.se) X-Authentication-Warning: r2d2.sics.se: gabriel owned process doing -bs Date: Thu, 8 Aug 2002 11:31:40 +0200 (CEST) From: Gabriel Paues To: netdev@oss.sgi.com Subject: TBF timing issues In-Reply-To: <3D4768DC.7070907@cdac.ernet.in> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hello! I use the TBF filter with good results on my computer. When I set a rate with the filter the traffic get shaped accordingly. I have set up a couple of UML-instances (user-mode-linux) as a network to test different QoS strategies. My problem is that the TBF calculates the wrong rates. An UML-instance may get interrupted just like any other program. Does the timing code in TBF presume that the kernel wont be interrupted, and therefore generates the wrong rates? Ragards, Gabriel Paues From owner-netdev@oss.sgi.com Thu Aug 8 04:29:31 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g78BTVRw004470 for ; Thu, 8 Aug 2002 04:29:31 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g78BTVwu004468 for netdev-outgoing; Thu, 8 Aug 2002 04:29:31 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from coruscant.gnumonks.org (mail@coruscant.franken.de [193.174.159.226]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g78BTJRw004458 for ; Thu, 8 Aug 2002 04:29:20 -0700 Received: from [192.168.200.2] (helo=sunbeam.gnumonks.org) by coruscant.gnumonks.org with esmtp (TLSv1:DES-CBC3-SHA:168) (Exim 3.34 #1) id 17clVi-0000xK-00; Thu, 08 Aug 2002 13:31:23 +0200 Received: from laforge by sunbeam.gnumonks.org with local (Exim 3.34 #1) id 17clVY-0004ro-00; Thu, 08 Aug 2002 13:31:12 +0200 Date: Thu, 8 Aug 2002 13:31:12 +0200 From: Harald Welte To: David Miller Cc: netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: [PATCH] fix HIPQUAD macro in kernel.h Message-ID: <20020808133112.E11828@sunbeam.de.gnumonks.org> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="HQbpMUFNRY4iYVZ3" Content-Disposition: inline User-Agent: Mutt/1.3.17i X-Operating-System: Linux sunbeam.de.gnumonks.org 2.4.19-pre10-newnat-pptp X-Date: Today is Boomtime, the 71st day of Confusion in the YOLD 3168 X-Spam-Status: No, hits=-5.0 required=5.0 tests=UNIFIED_PATCH version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk --HQbpMUFNRY4iYVZ3 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Hi Dave! Below is a fix for the HIPQUAD macro in kernel.h. The macro is currently not endian-aware - it just assumes running on a little-endian machine. If you don't like the #ifdefs in kernel.h, the macros could be moved into= =20 include/linux/byteorder/. Please apply, thanks --- linux-2.4.19-rc5-plain/include/linux/kernel.h Wed Aug 7 22:55:03 2002 +++ linux-2.4.19-rc5-endian/include/linux/kernel.h Thu Aug 8 11:34:13 2002 @@ -12,6 +12,7 @@ #include #include #include +#include =20 /* Optimization barrier */ /* The "volatile" is due to gcc bugs */ @@ -128,11 +129,17 @@ ((unsigned char *)&addr)[2], \ ((unsigned char *)&addr)[3] =20 +#if defined(__LITTLE_ENDIAN) #define HIPQUAD(addr) \ ((unsigned char *)&addr)[3], \ ((unsigned char *)&addr)[2], \ ((unsigned char *)&addr)[1], \ ((unsigned char *)&addr)[0] +#elif defined(__BIG_ENDIAN) +#define HIPQUAD NIPQUAD +#else +#error "Please fix asm/byteorder.h" +#endif /* __LITTLE_ENDIAN */ =20 /* * min()/max() macros that also do --=20 Live long and prosper - Harald Welte / laforge@gnumonks.org http://www.gnumonks.org/ =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D GCS/E/IT d- s-: a-- C+++ UL++++$ P+++ L++++$ E--- W- N++ o? K- w--- O- M+= =20 V-- PS++ PE-- Y++ PGP++ t+ 5-- !X !R tv-- b+++ !DI !D G+ e* h--- r++ y+(*) --HQbpMUFNRY4iYVZ3 Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.6 (GNU/Linux) Comment: For info see http://www.gnupg.org iD8DBQE9UlZ/XaXGVTD0i/8RArgWAJ9v+OI9pUwvTFy5Cojwkr1Ks3+qsQCffIuQ 6Xs89gLqtxxuRoMB4rJeSF0= =PdjG -----END PGP SIGNATURE----- --HQbpMUFNRY4iYVZ3-- From owner-netdev@oss.sgi.com Thu Aug 8 04:39:20 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g78BdJRw004622 for ; Thu, 8 Aug 2002 04:39:19 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g78BdJ4G004621 for netdev-outgoing; Thu, 8 Aug 2002 04:39:19 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from Cantor.suse.de (ns.suse.de [213.95.15.193]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g78BdFRw004612 for ; Thu, 8 Aug 2002 04:39:16 -0700 Received: from Hermes.suse.de (Charybdis.suse.de [213.95.15.201]) by Cantor.suse.de (Postfix) with ESMTP id ABC3F14573; Thu, 8 Aug 2002 13:41:15 +0200 (MEST) Date: Thu, 8 Aug 2002 13:41:13 +0200 From: Andi Kleen To: Harald Welte Cc: David Miller , netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [PATCH] fix HIPQUAD macro in kernel.h Message-ID: <20020808134113.A2552@wotan.suse.de> References: <20020808133112.E11828@sunbeam.de.gnumonks.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20020808133112.E11828@sunbeam.de.gnumonks.org> User-Agent: Mutt/1.3.22.1i X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk On Thu, Aug 08, 2002 at 01:31:12PM +0200, Harald Welte wrote: > Hi Dave! > > Below is a fix for the HIPQUAD macro in kernel.h. The macro is currently > not endian-aware - it just assumes running on a little-endian machine. > > If you don't like the #ifdefs in kernel.h, the macros could be moved into > include/linux/byteorder/. > > Please apply, thanks That change is wrong. IP address should be always in network order (=BE) while in kernel. -Andi From owner-netdev@oss.sgi.com Thu Aug 8 05:04:53 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g78C4rRw018463 for ; Thu, 8 Aug 2002 05:04:53 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g78C4rb8018426 for netdev-outgoing; Thu, 8 Aug 2002 05:04:53 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from coruscant.gnumonks.org (mail@coruscant.franken.de [193.174.159.226]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g78C4fRw017428 for ; Thu, 8 Aug 2002 05:04:42 -0700 Received: from [192.168.200.2] (helo=sunbeam.gnumonks.org) by coruscant.gnumonks.org with esmtp (TLSv1:DES-CBC3-SHA:168) (Exim 3.34 #1) id 17cm3x-0001Ms-00; Thu, 08 Aug 2002 14:06:46 +0200 Received: from laforge by sunbeam.gnumonks.org with local (Exim 3.34 #1) id 17cm3k-0004tR-00; Thu, 08 Aug 2002 14:06:32 +0200 Date: Thu, 8 Aug 2002 14:06:32 +0200 From: Harald Welte To: Andi Kleen Cc: David Miller , netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [PATCH] fix HIPQUAD macro in kernel.h Message-ID: <20020808140632.I11828@sunbeam.de.gnumonks.org> References: <20020808133112.E11828@sunbeam.de.gnumonks.org> <20020808134113.A2552@wotan.suse.de> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="tB3UQx9o7itSJcWB" Content-Disposition: inline User-Agent: Mutt/1.3.17i In-Reply-To: <20020808134113.A2552@wotan.suse.de>; from ak@suse.de on Thu, Aug 08, 2002 at 01:41:13PM +0200 X-Operating-System: Linux sunbeam.de.gnumonks.org 2.4.19-pre10-newnat-pptp X-Date: Today is Boomtime, the 71st day of Confusion in the YOLD 3168 X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk --tB3UQx9o7itSJcWB Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Aug 08, 2002 at 01:41:13PM +0200, Andi Kleen wrote: > On Thu, Aug 08, 2002 at 01:31:12PM +0200, Harald Welte wrote: > > Hi Dave! > >=20 > > Below is a fix for the HIPQUAD macro in kernel.h. The macro is current= ly > > not endian-aware - it just assumes running on a little-endian machine. > >=20 > > If you don't like the #ifdefs in kernel.h, the macros could be moved in= to=20 > > include/linux/byteorder/. > >=20 > > Please apply, thanks >=20 > That change is wrong. IP address should be always in network order (=3DBE= )=20 > while in kernel. well, there is for example a short codepath in ip_conntrack_irc.c, where the ip address is parsed from the packet payload (after which it is present in host byte order). Before it is written to the apropriate data structure= s, we convert it to network byte order. And just before this happens, there are debug printk's which use HIPQUAD. What is the point of providing two macros (HIPQUAD and NIPQUAD), if one of them does only work on little-endian. I would understand your point if th HIPQUAD macro wasn't present at all (and only NIPQUAD existed). I assumed that NIPQUAD does parse an ip address in network byte order, and HIPQUAD in host byte order. If they are really meant for little or big endian, they should be renamed to BEIPQUAD and LEIPQUAD. > -Andi --=20 Live long and prosper - Harald Welte / laforge@gnumonks.org http://www.gnumonks.org/ =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D GCS/E/IT d- s-: a-- C+++ UL++++$ P+++ L++++$ E--- W- N++ o? K- w--- O- M+= =20 V-- PS++ PE-- Y++ PGP++ t+ 5-- !X !R tv-- b+++ !DI !D G+ e* h--- r++ y+(*) --tB3UQx9o7itSJcWB Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.6 (GNU/Linux) Comment: For info see http://www.gnupg.org iD8DBQE9Ul7IXaXGVTD0i/8RAl0ZAJ9s9rJCx2NuiYpt/xG+mYgZIgHAFACfZoLi kyE0bvhAyI0FaZO8ENYQYwk= =HTLa -----END PGP SIGNATURE----- --tB3UQx9o7itSJcWB-- From owner-netdev@oss.sgi.com Thu Aug 8 07:59:48 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g78ExmRw017457 for ; Thu, 8 Aug 2002 07:59:48 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g78Exm66017456 for netdev-outgoing; Thu, 8 Aug 2002 07:59:48 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g78ExjRw017447 for ; Thu, 8 Aug 2002 07:59:45 -0700 Received: from localhost (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with ESMTP id HAA07337; Thu, 8 Aug 2002 07:48:53 -0700 Date: Thu, 08 Aug 2002 07:48:53 -0700 (PDT) Message-Id: <20020808.074853.114346036.davem@redhat.com> To: ak@suse.de Cc: laforge@gnumonks.org, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [PATCH] fix HIPQUAD macro in kernel.h From: "David S. Miller" In-Reply-To: <20020808134113.A2552@wotan.suse.de> References: <20020808133112.E11828@sunbeam.de.gnumonks.org> <20020808134113.A2552@wotan.suse.de> X-Mailer: Mew version 2.1 on Emacs 21.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk From: Andi Kleen Date: Thu, 8 Aug 2002 13:41:13 +0200 That change is wrong. IP address should be always in network order (=BE) while in kernel. He's fixing the HIPQUAD ('H' as in 'host') not NIPQUAD ('N' as in 'network') macro. If you disagree with people using HIPQUAD at all, recommend that it be deleted. Until then, it ought to be fixed :-) From owner-netdev@oss.sgi.com Thu Aug 8 08:09:09 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g78F99Rw017964 for ; Thu, 8 Aug 2002 08:09:09 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g78F99aU017963 for netdev-outgoing; Thu, 8 Aug 2002 08:09:09 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g78F96Rw017954 for ; Thu, 8 Aug 2002 08:09:06 -0700 Received: from localhost (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with ESMTP id HAA07511; Thu, 8 Aug 2002 07:58:16 -0700 Date: Thu, 08 Aug 2002 07:58:16 -0700 (PDT) Message-Id: <20020808.075816.56749431.davem@redhat.com> To: laforge@gnumonks.org Cc: netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [PATCH] fix HIPQUAD macro in kernel.h From: "David S. Miller" In-Reply-To: <20020808133112.E11828@sunbeam.de.gnumonks.org> References: <20020808133112.E11828@sunbeam.de.gnumonks.org> X-Mailer: Mew version 2.1 on Emacs 21.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk From: Harald Welte Date: Thu, 8 Aug 2002 13:31:12 +0200 Below is a fix for the HIPQUAD macro in kernel.h. The macro is currently not endian-aware - it just assumes running on a little-endian machine. This looks fine, I've added it to both my 2.4.x and 2.5.x networking trees. Thanks. From owner-netdev@oss.sgi.com Thu Aug 8 08:44:51 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g78FipRw018785 for ; Thu, 8 Aug 2002 08:44:51 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g78Fipp1018784 for netdev-outgoing; Thu, 8 Aug 2002 08:44:51 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g78FilRw018774 for ; Thu, 8 Aug 2002 08:44:47 -0700 Received: from localhost (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with ESMTP id IAA07720; Thu, 8 Aug 2002 08:33:21 -0700 Date: Thu, 08 Aug 2002 08:33:20 -0700 (PDT) Message-Id: <20020808.083320.100990288.davem@redhat.com> To: jmorris@intercode.com.au Cc: kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com, willy@debian.org Subject: Re: [PATCH] minor socket ioctl cleanup for 2.5.30 From: "David S. Miller" In-Reply-To: References: X-Mailer: Mew version 2.1 on Emacs 21.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk From: James Morris Date: Thu, 8 Aug 2002 10:24:19 +1000 (EST) Suggested by Matthew Wilcox, the patch below consolidates FIOSETOWN etc. ioctl handling into the socket layer, making it common for all sockets. Do we really want to do this? What if some socket family either doesn't want to support it or wants to handle it differently? Btw, is af_wanpipe.c likely to stay in the tree? It doesn't seem to be used anymore. I have no idea. Ask the WAN people :-) From owner-netdev@oss.sgi.com Thu Aug 8 09:05:22 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g78G5MRw019595 for ; Thu, 8 Aug 2002 09:05:22 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g78G5MLN019594 for netdev-outgoing; Thu, 8 Aug 2002 09:05:22 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from www.linux.org.uk (parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g78G5FRw019585 for ; Thu, 8 Aug 2002 09:05:16 -0700 Received: from willy by www.linux.org.uk with local (Exim 3.33 #5) id 17cpom-000666-00; Thu, 08 Aug 2002 17:07:20 +0100 Date: Thu, 8 Aug 2002 17:07:20 +0100 From: Matthew Wilcox To: "David S. Miller" Cc: jmorris@intercode.com.au, kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com, willy@debian.org Subject: Re: [PATCH] minor socket ioctl cleanup for 2.5.30 Message-ID: <20020808170720.N24631@parcelfarce.linux.theplanet.co.uk> References: <20020808.083320.100990288.davem@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5.1i In-Reply-To: <20020808.083320.100990288.davem@redhat.com>; from davem@redhat.com on Thu, Aug 08, 2002 at 08:33:20AM -0700 X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk On Thu, Aug 08, 2002 at 08:33:20AM -0700, David S. Miller wrote: > From: James Morris > Date: Thu, 8 Aug 2002 10:24:19 +1000 (EST) > > Suggested by Matthew Wilcox, the patch below consolidates FIOSETOWN etc. > ioctl handling into the socket layer, making it common for all sockets. > > Do we really want to do this? What if some socket family either > doesn't want to support it or wants to handle it differently? I rather think we do. It's analagous to saying "What if some filesystem either doesn't want to support it or wants to handle it differently?" -- tough! This is unix and filesystems (socket families) support this. have you read forsyth's paper "Sending UNIX to the Fat Farm"? http://www.caldo.demon.co.uk/doc/taste.pdf Section 3.3 is relevant here ... though I think you'll find great amusement in his other criticisms of solaris. -- Revolutions do not require corporate support. From owner-netdev@oss.sgi.com Thu Aug 8 10:11:38 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g78HBcRw020450 for ; Thu, 8 Aug 2002 10:11:38 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g78HBcqr020449 for netdev-outgoing; Thu, 8 Aug 2002 10:11:38 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from sex.inr.ac.ru (sex.inr.ac.ru [193.233.7.165]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g78HBXRw020440 for ; Thu, 8 Aug 2002 10:11:33 -0700 Received: (from kuznet@localhost) by sex.inr.ac.ru (8.6.13/ANK) id VAA02896; Thu, 8 Aug 2002 21:13:15 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200208081713.VAA02896@sex.inr.ac.ru> Subject: Re: [PATCH] minor socket ioctl cleanup for 2.5.30 To: willy@debian.org (Matthew Wilcox) Date: Thu, 8 Aug 2002 21:13:15 +0400 (MSD) Cc: davem@redhat.com, jmorris@intercode.com.au, netdev@oss.sgi.com, willy@debian.org In-Reply-To: <20020808170720.N24631@parcelfarce.linux.theplanet.co.uk> from "Matthew Wilcox" at Aug 8, 2 05:07:20 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 X-Spam-Status: No, hits=-3.8 required=5.0 tests=IN_REP_TO,NO_REAL_NAME version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hello! > > Do we really want to do this? What if some socket family either > > doesn't want to support it or wants to handle it differently? > > I rather think we do. It's analagous to saying "What if some filesystem > either doesn't want to support it or wants to handle it differently?" > -- tough! This is unix and filesystems (socket families) support this. No, this is not true. This creepy ioctl is specific to TCP (well, x.25 also uses SIGURG), which use kill*(sk->proc, SIGURG) directly. Probably, it is better to move sk->proc to TCP private data, this ioctl to tcp_ioctl(). Or... find a way to get rid of this completely, not breaking compatibility with a few BSDish applications. Alexey From owner-netdev@oss.sgi.com Thu Aug 8 10:15:21 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g78HFLRw020573 for ; Thu, 8 Aug 2002 10:15:21 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g78HFLk8020572 for netdev-outgoing; Thu, 8 Aug 2002 10:15:21 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from www.linux.org.uk (parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g78HFGRw020562 for ; Thu, 8 Aug 2002 10:15:17 -0700 Received: from willy by www.linux.org.uk with local (Exim 3.33 #5) id 17cquT-0007zC-00; Thu, 08 Aug 2002 18:17:17 +0100 Date: Thu, 8 Aug 2002 18:17:17 +0100 From: Matthew Wilcox To: kuznet@ms2.inr.ac.ru Cc: Matthew Wilcox , davem@redhat.com, jmorris@intercode.com.au, netdev@oss.sgi.com Subject: Re: [PATCH] minor socket ioctl cleanup for 2.5.30 Message-ID: <20020808181717.P24631@parcelfarce.linux.theplanet.co.uk> References: <20020808170720.N24631@parcelfarce.linux.theplanet.co.uk> <200208081713.VAA02896@sex.inr.ac.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5.1i In-Reply-To: <200208081713.VAA02896@sex.inr.ac.ru>; from kuznet@ms2.inr.ac.ru on Thu, Aug 08, 2002 at 09:13:15PM +0400 X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk On Thu, Aug 08, 2002 at 09:13:15PM +0400, kuznet@ms2.inr.ac.ru wrote: > No, this is not true. This creepy ioctl is specific to TCP > (well, x.25 also uses SIGURG), which use kill*(sk->proc, SIGURG) directly. > > Probably, it is better to move sk->proc to TCP private data, > this ioctl to tcp_ioctl(). Or... find a way to get rid of this completely, > not breaking compatibility with a few BSDish applications. jamesm also has patches which remove sk->proc altogether and make TCP use the normal fasync methods. this ioctl then does an f_setown and most of the creepiness is gone. consider this patch a stepping-stone. -- Revolutions do not require corporate support. From owner-netdev@oss.sgi.com Thu Aug 8 10:25:51 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g78HPpRw020730 for ; Thu, 8 Aug 2002 10:25:51 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g78HPpAF020729 for netdev-outgoing; Thu, 8 Aug 2002 10:25:51 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from blackbird.intercode.com.au (blackbird.intercode.com.au [203.32.101.10]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g78HOTRw020714 for ; Thu, 8 Aug 2002 10:24:30 -0700 Received: from localhost (jmorris@localhost) by blackbird.intercode.com.au (8.9.3/8.9.3) with ESMTP id DAA20038; Fri, 9 Aug 2002 03:26:11 +1000 Date: Fri, 9 Aug 2002 03:26:10 +1000 (EST) From: James Morris To: Matthew Wilcox cc: kuznet@ms2.inr.ac.ru, , Subject: Re: [PATCH] minor socket ioctl cleanup for 2.5.30 In-Reply-To: <20020808181717.P24631@parcelfarce.linux.theplanet.co.uk> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Status: No, hits=-9.4 required=5.0 tests=IN_REP_TO,UNIFIED_PATCH version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk On Thu, 8 Aug 2002, Matthew Wilcox wrote: > On Thu, Aug 08, 2002 at 09:13:15PM +0400, kuznet@ms2.inr.ac.ru wrote: > > No, this is not true. This creepy ioctl is specific to TCP > > (well, x.25 also uses SIGURG), which use kill*(sk->proc, SIGURG) directly. > > > > Probably, it is better to move sk->proc to TCP private data, > > this ioctl to tcp_ioctl(). Or... find a way to get rid of this completely, > > not breaking compatibility with a few BSDish applications. > > jamesm also has patches which remove sk->proc altogether and make TCP > use the normal fasync methods. this ioctl then does an f_setown and > most of the creepiness is gone. consider this patch a stepping-stone. > Yes, this patch is complete (see below), but is waiting on another fix to go in (which I sent you recently). The unification of the ioctl code also now means SIOCSPGRP works exactly the same as F_SETOWN (i.e. it also works now for sigio). - James -- James Morris diff -urN -X dontdiff linux-2.5.27.orig/drivers/char/tty_io.c linux-2.5.27.w1/drivers/char/tty_io.c --- linux-2.5.27.orig/drivers/char/tty_io.c Wed Jul 17 19:10:44 2002 +++ linux-2.5.27.w1/drivers/char/tty_io.c Sun Jul 28 03:12:50 2002 @@ -1457,11 +1457,8 @@ if (on) { if (!waitqueue_active(&tty->read_wait)) tty->minimum_to_wake = 1; - if (filp->f_owner.pid == 0) { - filp->f_owner.pid = (-tty->pgrp) ? : current->pid; - filp->f_owner.uid = current->uid; - filp->f_owner.euid = current->euid; - } + if (filp->f_owner.pid == 0) + f_setown(filp, (-tty->pgrp) ? : current->pid); } else { if (!tty->fasync && !waitqueue_active(&tty->read_wait)) tty->minimum_to_wake = N_TTY_BUF_SIZE; diff -urN -X dontdiff linux-2.5.27.orig/drivers/net/tun.c linux-2.5.27.w1/drivers/net/tun.c --- linux-2.5.27.orig/drivers/net/tun.c Fri Jun 21 13:12:48 2002 +++ linux-2.5.27.w1/drivers/net/tun.c Sun Jul 28 03:30:05 2002 @@ -514,11 +514,8 @@ if (on) { tun->flags |= TUN_FASYNC; - if (!file->f_owner.pid) { - file->f_owner.pid = current->pid; - file->f_owner.uid = current->uid; - file->f_owner.euid = current->euid; - } + if (!file->f_owner.pid) + f_setown(file, current->pid); } else tun->flags &= ~TUN_FASYNC; diff -urN -X dontdiff linux-2.5.27.orig/fs/Makefile linux-2.5.27.w1/fs/Makefile --- linux-2.5.27.orig/fs/Makefile Wed Jul 17 19:10:45 2002 +++ linux-2.5.27.w1/fs/Makefile Sun Jul 28 03:24:30 2002 @@ -8,7 +8,7 @@ O_TARGET := fs.o export-objs := filesystems.o open.o dcache.o buffer.o bio.o inode.o dquot.o \ - mpage.o + mpage.o fcntl.o obj-y := open.o read_write.o devices.o file_table.o buffer.o \ bio.o super.o block_dev.o char_dev.o stat.o exec.o pipe.o \ diff -urN -X dontdiff linux-2.5.27.orig/fs/dnotify.c linux-2.5.27.w1/fs/dnotify.c --- linux-2.5.27.orig/fs/dnotify.c Tue Jun 18 01:20:45 2002 +++ linux-2.5.27.w1/fs/dnotify.c Sun Jul 28 03:12:11 2002 @@ -93,9 +93,7 @@ } prev = &odn->dn_next; } - filp->f_owner.pid = current->pid; - filp->f_owner.uid = current->uid; - filp->f_owner.euid = current->euid; + f_setown(filp, current->pid); dn->dn_mask = arg; dn->dn_fd = fd; dn->dn_filp = filp; diff -urN -X dontdiff linux-2.5.27.orig/fs/fcntl.c linux-2.5.27.w1/fs/fcntl.c --- linux-2.5.27.orig/fs/fcntl.c Sun Jul 21 10:57:51 2002 +++ linux-2.5.27.w1/fs/fcntl.c Sun Jul 28 18:23:03 2002 @@ -11,12 +11,12 @@ #include #include #include +#include #include #include #include -extern int sock_fcntl (struct file *, unsigned int cmd, unsigned long arg); extern int fcntl_setlease(unsigned int fd, struct file *filp, long arg); extern int fcntl_getlease(struct file *filp); @@ -259,6 +259,18 @@ return 0; } +long f_getown(struct file *filp) +{ + return filp->f_owner.pid; +} + +void f_setown(struct file *filp, unsigned long arg) +{ + filp->f_owner.pid = arg; + filp->f_owner.uid = current->uid; + filp->f_owner.euid = current->euid; +} + static long do_fcntl(unsigned int fd, unsigned int cmd, unsigned long arg, struct file * filp) { @@ -301,16 +313,12 @@ * current syscall conventions, the only way * to fix this will be in libc. */ - err = filp->f_owner.pid; + err = f_getown(filp); break; case F_SETOWN: - lock_kernel(); - filp->f_owner.pid = arg; - filp->f_owner.uid = current->uid; - filp->f_owner.euid = current->euid; err = 0; - if (S_ISSOCK (filp->f_dentry->d_inode->i_mode)) - err = sock_fcntl (filp, F_SETOWN, arg); + lock_kernel(); + f_setown(filp, arg); unlock_kernel(); break; case F_GETSIG: @@ -334,10 +342,6 @@ err = fcntl_dirnotify(fd, filp, arg); break; default: - /* sockets need a few special fcntls. */ - err = -EINVAL; - if (S_ISSOCK (filp->f_dentry->d_inode->i_mode)) - err = sock_fcntl (filp, cmd, arg); break; } @@ -400,15 +404,22 @@ POLLHUP | POLLERR /* POLL_HUP */ }; +static inline int sigio_perm(struct task_struct *p, + struct fown_struct *fown) +{ + return ((fown->euid == 0) || + (fown->euid == p->suid) || (fown->euid == p->uid) || + (fown->uid == p->suid) || (fown->uid == p->uid)); +} + static void send_sigio_to_task(struct task_struct *p, struct fown_struct *fown, int fd, int reason) { - if ((fown->euid != 0) && - (fown->euid ^ p->suid) && (fown->euid ^ p->uid) && - (fown->uid ^ p->suid) && (fown->uid ^ p->uid)) + if (!sigio_perm(p, fown)) return; + switch (fown->signum) { siginfo_t si; default: @@ -461,6 +472,35 @@ read_unlock(&tasklist_lock); } +static void send_sigurg_to_task(struct task_struct *p, + struct fown_struct *fown) +{ + if (sigio_perm(p, fown)) + send_sig(SIGURG, p, 1); +} + +void send_sigurg(struct fown_struct *fown) +{ + struct task_struct *p; + int pid = fown->pid; + + read_lock(&tasklist_lock); + if ((pid > 0) && (p = find_task_by_pid(pid))) { + send_sigurg_to_task(p, fown); + goto out; + } + for_each_task(p) { + int match = p->pid; + if (pid < 0) + match = -p->pgrp; + if (pid != match) + continue; + send_sigurg_to_task(p, fown); + } +out: + read_unlock(&tasklist_lock); +} + static rwlock_t fasync_lock = RW_LOCK_UNLOCKED; static kmem_cache_t *fasync_cache; @@ -544,3 +584,5 @@ } module_init(fasync_init) + +EXPORT_SYMBOL(f_setown); diff -urN -X dontdiff linux-2.5.27.orig/fs/locks.c linux-2.5.27.w1/fs/locks.c --- linux-2.5.27.orig/fs/locks.c Sat Jul 6 11:01:01 2002 +++ linux-2.5.27.w1/fs/locks.c Sun Jul 28 18:23:18 2002 @@ -1309,9 +1309,7 @@ fl->fl_next = *before; *before = fl; list_add(&fl->fl_link, &file_lock_list); - filp->f_owner.pid = current->pid; - filp->f_owner.uid = current->uid; - filp->f_owner.euid = current->euid; + f_setown(filp, current->pid); out_unlock: unlock_kernel(); return error; diff -urN -X dontdiff linux-2.5.27.orig/include/linux/fs.h linux-2.5.27.w1/include/linux/fs.h --- linux-2.5.27.orig/include/linux/fs.h Sun Jul 21 10:57:51 2002 +++ linux-2.5.27.w1/include/linux/fs.h Sun Jul 28 18:22:10 2002 @@ -604,6 +604,10 @@ /* only for net: no internal synchronization */ extern void __kill_fasync(struct fasync_struct *, int, int); +extern void send_sigurg(struct fown_struct *fown); +extern long f_getown(struct file *filp); +extern void f_setown(struct file *filp, unsigned long arg); + /* * Umount options */ diff -urN -X dontdiff linux-2.5.27.orig/include/net/inet_common.h linux-2.5.27.w1/include/net/inet_common.h --- linux-2.5.27.orig/include/net/inet_common.h Tue Aug 24 03:01:02 1999 +++ linux-2.5.27.w1/include/net/inet_common.h Sat Jul 27 01:40:08 2002 @@ -34,9 +34,6 @@ extern int inet_getsockopt(struct socket *sock, int level, int optname, char *optval, int *optlen); -extern int inet_fcntl(struct socket *sock, - unsigned int cmd, - unsigned long arg); extern int inet_listen(struct socket *sock, int backlog); extern void inet_sock_release(struct sock *sk); diff -urN -X dontdiff linux-2.5.27.orig/include/net/sock.h linux-2.5.27.w1/include/net/sock.h --- linux-2.5.27.orig/include/net/sock.h Tue Jun 18 01:19:54 2002 +++ linux-2.5.27.w1/include/net/sock.h Sat Jul 27 01:40:08 2002 @@ -132,7 +132,6 @@ unsigned char rcvtstamp; /* Hole of 1 byte. Try to pack. */ int route_caps; - int proc; unsigned long lingertime; int hashent; @@ -362,6 +361,7 @@ int *errcode); extern void *sock_kmalloc(struct sock *sk, int size, int priority); extern void sock_kfree_s(struct sock *sk, void *mem, int size); +extern int sk_send_sigurg(struct sock *sk); /* * Functions to fill in entries in struct proto_ops when a protocol diff -urN -X dontdiff linux-2.5.27.orig/kernel/futex.c linux-2.5.27.w1/kernel/futex.c --- linux-2.5.27.orig/kernel/futex.c Wed Jul 17 19:10:45 2002 +++ linux-2.5.27.w1/kernel/futex.c Sun Jul 28 15:02:28 2002 @@ -276,9 +276,7 @@ filp->f_dentry = dget(futex_mnt->mnt_root); if (signal) { - filp->f_owner.pid = current->tgid; - filp->f_owner.uid = current->uid; - filp->f_owner.euid = current->euid; + f_setown(filp, current->tgid); filp->f_owner.signum = signal; } diff -urN -X dontdiff linux-2.5.27.orig/net/core/sock.c linux-2.5.27.w1/net/core/sock.c --- linux-2.5.27.orig/net/core/sock.c Tue Jun 18 01:20:53 2002 +++ linux-2.5.27.w1/net/core/sock.c Sun Jul 28 15:02:53 2002 @@ -1048,34 +1048,6 @@ return -EOPNOTSUPP; } -/* - * Note: if you add something that sleeps here then change sock_fcntl() - * to do proper fd locking. - */ -int sock_no_fcntl(struct socket *sock, unsigned int cmd, unsigned long arg) -{ - struct sock *sk = sock->sk; - - switch(cmd) - { - case F_SETOWN: - /* - * This is a little restrictive, but it's the only - * way to make sure that you can't send a sigurg to - * another process. - */ - if (current->pgrp != -arg && - current->pid != arg && - !capable(CAP_KILL)) return(-EPERM); - sk->proc = arg; - return(0); - case F_GETOWN: - return(sk->proc); - default: - return(-EINVAL); - } -} - int sock_no_sendmsg(struct socket *sock, struct msghdr *m, int flags, struct scm_cookie *scm) { @@ -1177,6 +1149,15 @@ { if (sk->protinfo) kfree(sk->protinfo); +} + +int sk_send_sigurg(struct sock *sk) +{ + if (sk->socket && sk->socket->file && sk->socket->file->f_owner.pid) { + send_sigurg(&sk->socket->file->f_owner); + return 1; + } + return 0; } void sock_init_data(struct socket *sock, struct sock *sk) diff -urN -X dontdiff linux-2.5.27.orig/net/econet/af_econet.c linux-2.5.27.w1/net/econet/af_econet.c --- linux-2.5.27.orig/net/econet/af_econet.c Tue Jun 18 01:19:54 2002 +++ linux-2.5.27.w1/net/econet/af_econet.c Sun Jul 28 21:54:52 2002 @@ -643,21 +643,9 @@ static int econet_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg) { struct sock *sk = sock->sk; - int pid; switch(cmd) { - case FIOSETOWN: - case SIOCSPGRP: - if (get_user(pid, (int *) arg)) - return -EFAULT; - if (current->pid != pid && current->pgrp != -pid && !capable(CAP_NET_ADMIN)) - return -EPERM; - sk->proc = pid; - return(0); - case FIOGETOWN: - case SIOCGPGRP: - return put_user(sk->proc, (int *)arg); case SIOCGSTAMP: if(sk->stamp.tv_sec==0) return -ENOENT; diff -urN -X dontdiff linux-2.5.27.orig/net/ipv4/af_inet.c linux-2.5.27.w1/net/ipv4/af_inet.c --- linux-2.5.27.orig/net/ipv4/af_inet.c Tue Jun 18 01:21:13 2002 +++ linux-2.5.27.w1/net/ipv4/af_inet.c Sun Jul 28 22:01:28 2002 @@ -850,24 +850,8 @@ { struct sock *sk = sock->sk; int err = 0; - int pid; switch (cmd) { - case FIOSETOWN: - case SIOCSPGRP: - if (get_user(pid, (int *)arg)) - err = -EFAULT; - else if (current->pid != pid && - current->pgrp != -pid && - !capable(CAP_NET_ADMIN)) - err = -EPERM; - else - sk->proc = pid; - break; - case FIOGETOWN: - case SIOCGPGRP: - err = put_user(sk->proc, (int *)arg); - break; case SIOCGSTAMP: if (!sk->stamp.tv_sec) err = -ENOENT; diff -urN -X dontdiff linux-2.5.27.orig/net/ipv4/tcp_input.c linux-2.5.27.w1/net/ipv4/tcp_input.c --- linux-2.5.27.orig/net/ipv4/tcp_input.c Tue Jun 18 01:20:53 2002 +++ linux-2.5.27.w1/net/ipv4/tcp_input.c Sat Jul 27 01:40:08 2002 @@ -3093,13 +3093,8 @@ return; /* Tell the world about our new urgent pointer. */ - if (sk->proc != 0) { - if (sk->proc > 0) - kill_proc(sk->proc, SIGURG, 1); - else - kill_pg(-sk->proc, SIGURG, 1); + if (sk_send_sigurg(sk)) sk_wake_async(sk, 3, POLL_PRI); - } /* We may be adding urgent data when the last byte read was * urgent. To do this requires some care. We cannot just ignore diff -urN -X dontdiff linux-2.5.27.orig/net/ipv4/tcp_minisocks.c linux-2.5.27.w1/net/ipv4/tcp_minisocks.c --- linux-2.5.27.orig/net/ipv4/tcp_minisocks.c Tue Jun 18 01:19:54 2002 +++ linux-2.5.27.w1/net/ipv4/tcp_minisocks.c Sat Jul 27 01:40:08 2002 @@ -676,7 +676,6 @@ newsk->done = 0; newsk->userlocks = sk->userlocks & ~SOCK_BINDPORT_LOCK; - newsk->proc = 0; newsk->backlog.head = newsk->backlog.tail = NULL; newsk->callback_lock = RW_LOCK_UNLOCKED; skb_queue_head_init(&newsk->error_queue); diff -urN -X dontdiff linux-2.5.27.orig/net/ipv6/af_inet6.c linux-2.5.27.w1/net/ipv6/af_inet6.c --- linux-2.5.27.orig/net/ipv6/af_inet6.c Fri Jun 21 13:12:48 2002 +++ linux-2.5.27.w1/net/ipv6/af_inet6.c Sun Jul 28 22:06:54 2002 @@ -455,23 +455,9 @@ { struct sock *sk = sock->sk; int err = -EINVAL; - int pid; switch(cmd) { - case FIOSETOWN: - case SIOCSPGRP: - if (get_user(pid, (int *) arg)) - return -EFAULT; - /* see sock_no_fcntl */ - if (current->pid != pid && current->pgrp != -pid && - !capable(CAP_NET_ADMIN)) - return -EPERM; - sk->proc = pid; - return(0); - case FIOGETOWN: - case SIOCGPGRP: - return put_user(sk->proc,(int *)arg); case SIOCGSTAMP: if(sk->stamp.tv_sec==0) return -ENOENT; diff -urN -X dontdiff linux-2.5.27.orig/net/packet/af_packet.c linux-2.5.27.w1/net/packet/af_packet.c --- linux-2.5.27.orig/net/packet/af_packet.c Tue Jun 18 01:19:54 2002 +++ linux-2.5.27.w1/net/packet/af_packet.c Sun Jul 28 22:03:07 2002 @@ -1458,20 +1458,6 @@ spin_unlock_bh(&sk->receive_queue.lock); return put_user(amount, (int *)arg); } - case FIOSETOWN: - case SIOCSPGRP: { - int pid; - if (get_user(pid, (int *) arg)) - return -EFAULT; - if (current->pid != pid && current->pgrp != -pid && - !capable(CAP_NET_ADMIN)) - return -EPERM; - sk->proc = pid; - break; - } - case FIOGETOWN: - case SIOCGPGRP: - return put_user(sk->proc, (int *)arg); case SIOCGSTAMP: if(sk->stamp.tv_sec==0) return -ENOENT; diff -urN -X dontdiff linux-2.5.27.orig/net/socket.c linux-2.5.27.w1/net/socket.c --- linux-2.5.27.orig/net/socket.c Sat Jul 6 11:01:02 2002 +++ linux-2.5.27.w1/net/socket.c Sun Jul 28 22:08:37 2002 @@ -674,20 +674,36 @@ file, vector, count, tot_len); } -/* - * With an ioctl arg may well be a user mode pointer, but we don't know what to do - * with it - that's up to the protocol still. - */ - int sock_ioctl(struct inode *inode, struct file *file, unsigned int cmd, unsigned long arg) { struct socket *sock; - int err; + int err = 0; + int pid; unlock_kernel(); sock = SOCKET_I(inode); - err = sock->ops->ioctl(sock, cmd, arg); + + switch(cmd) { + case FIOSETOWN: + case SIOCSPGRP: + if (get_user(pid, (int *)arg)) + err = -EFAULT; + else + f_setown(sock->file, pid); + break; + + case FIOGETOWN: + case SIOCGPGRP: + pid = f_getown(sock->file); + err = put_user(pid, (int *)arg); + break; + + default: + err = sock->ops->ioctl(sock, cmd, arg); + + } + lock_kernel(); return err; @@ -1514,24 +1530,6 @@ sockfd_put(sock); out: return err; -} - - -/* - * Perform a file control on a socket file descriptor. - * - * Doesn't acquire a fd lock, because no network fcntl - * function sleeps currently. - */ - -int sock_fcntl(struct file *filp, unsigned int cmd, unsigned long arg) -{ - struct socket *sock; - - sock = SOCKET_I (filp->f_dentry->d_inode); - if (sock && sock->ops) - return sock_no_fcntl(sock, cmd, arg); - return(-EINVAL); } /* Argument list sizes for sys_socketcall */ diff -urN -X dontdiff linux-2.5.27.orig/net/wanrouter/af_wanpipe.c linux-2.5.27.w1/net/wanrouter/af_wanpipe.c --- linux-2.5.27.orig/net/wanrouter/af_wanpipe.c Tue Jun 18 01:19:39 2002 +++ linux-2.5.27.w1/net/wanrouter/af_wanpipe.c Sun Jul 28 21:56:01 2002 @@ -1867,23 +1867,9 @@ { struct sock *sk = sock->sk; int err; - int pid; switch(cmd) { - case FIOSETOWN: - case SIOCSPGRP: - err = get_user(pid, (int *) arg); - if (err) - return err; - if (current->pid != pid && current->pgrp != -pid && - !capable(CAP_NET_ADMIN)) - return -EPERM; - sk->proc = pid; - return(0); - case FIOGETOWN: - case SIOCGPGRP: - return put_user(sk->proc, (int *)arg); case SIOCGSTAMP: if(sk->stamp.tv_sec==0) return -ENOENT; diff -urN -X dontdiff linux-2.5.27.orig/net/x25/x25_in.c linux-2.5.27.w1/net/x25/x25_in.c --- linux-2.5.27.orig/net/x25/x25_in.c Tue Jun 18 01:19:10 2002 +++ linux-2.5.27.w1/net/x25/x25_in.c Sat Jul 27 01:40:08 2002 @@ -283,13 +283,8 @@ skb_queue_tail(&x25->interrupt_in_queue, skb); queued = 1; } - if (sk->proc != 0) { - if (sk->proc > 0) - kill_proc(sk->proc, SIGURG, 1); - else - kill_pg(-sk->proc, SIGURG, 1); + if (sk_send_sigurg(sk)) sock_wake_async(sk->socket, 3, POLL_PRI); - } x25_write_internal(sk, X25_INTERRUPT_CONFIRMATION); break; From owner-netdev@oss.sgi.com Thu Aug 8 10:40:12 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g78HeCRw020986 for ; Thu, 8 Aug 2002 10:40:12 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g78HeCAl020985 for netdev-outgoing; Thu, 8 Aug 2002 10:40:12 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from www.linux.org.uk (parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g78He0Rw020971 for ; Thu, 8 Aug 2002 10:40:01 -0700 Received: from willy by www.linux.org.uk with local (Exim 3.33 #5) id 17crIU-0000Oj-00; Thu, 08 Aug 2002 18:42:06 +0100 Date: Thu, 8 Aug 2002 18:42:06 +0100 From: Matthew Wilcox To: James Morris Cc: Matthew Wilcox , kuznet@ms2.inr.ac.ru, davem@redhat.com, netdev@oss.sgi.com Subject: Re: [PATCH] minor socket ioctl cleanup for 2.5.30 Message-ID: <20020808184206.Q24631@parcelfarce.linux.theplanet.co.uk> References: <20020808181717.P24631@parcelfarce.linux.theplanet.co.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5.1i In-Reply-To: ; from jmorris@intercode.com.au on Fri, Aug 09, 2002 at 03:26:10AM +1000 X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk On Fri, Aug 09, 2002 at 03:26:10AM +1000, James Morris wrote: > Yes, this patch is complete (see below), but is waiting on another fix to > go in (which I sent you recently). The unification of the ioctl code also > now means SIOCSPGRP works exactly the same as F_SETOWN (i.e. it also works > now for sigio). this patch is for information purposes only, right? ;-) > diff -urN -X dontdiff linux-2.5.27.orig/fs/fcntl.c linux-2.5.27.w1/fs/fcntl.c > --- linux-2.5.27.orig/fs/fcntl.c Sun Jul 21 10:57:51 2002 > +++ linux-2.5.27.w1/fs/fcntl.c Sun Jul 28 18:23:03 2002 > @@ -259,6 +259,18 @@ > return 0; > } > > +long f_getown(struct file *filp) > +{ > + return filp->f_owner.pid; > +} i still think this function shouldn't exist -- a direct reference is no problem. > case F_SETOWN: > - lock_kernel(); > - filp->f_owner.pid = arg; > - filp->f_owner.uid = current->uid; > - filp->f_owner.euid = current->euid; > err = 0; > - if (S_ISSOCK (filp->f_dentry->d_inode->i_mode)) > - err = sock_fcntl (filp, F_SETOWN, arg); > + lock_kernel(); > + f_setown(filp, arg); > unlock_kernel(); this is now hooked by the LSM folks so patch will need to be updated for 2.5.30. > diff -urN -X dontdiff linux-2.5.27.orig/net/socket.c linux-2.5.27.w1/net/socket.c > --- linux-2.5.27.orig/net/socket.c Sat Jul 6 11:01:02 2002 > +++ linux-2.5.27.w1/net/socket.c Sun Jul 28 22:08:37 2002 > + switch(cmd) { > + case FIOSETOWN: > + case SIOCSPGRP: > + if (get_user(pid, (int *)arg)) > + err = -EFAULT; > + else > + f_setown(sock->file, pid); > + break; i'm pretty sure you need a lock_kernel + unlock_kernel around the else. if two people are doing a F_SETOWN / SIOCSPGRP at the same time, you could have a race. -- Revolutions do not require corporate support. From owner-netdev@oss.sgi.com Thu Aug 8 10:44:25 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g78HiPRw021112 for ; Thu, 8 Aug 2002 10:44:25 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g78HiOM8021111 for netdev-outgoing; Thu, 8 Aug 2002 10:44:24 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from blackbird.intercode.com.au (blackbird.intercode.com.au [203.32.101.10]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g78HiGRw021101 for ; Thu, 8 Aug 2002 10:44:17 -0700 Received: from localhost (jmorris@localhost) by blackbird.intercode.com.au (8.9.3/8.9.3) with ESMTP id DAA20157; Fri, 9 Aug 2002 03:46:05 +1000 Date: Fri, 9 Aug 2002 03:46:05 +1000 (EST) From: James Morris To: Matthew Wilcox cc: kuznet@ms2.inr.ac.ru, , Subject: Re: [PATCH] minor socket ioctl cleanup for 2.5.30 In-Reply-To: <20020808184206.Q24631@parcelfarce.linux.theplanet.co.uk> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk On Thu, 8 Aug 2002, Matthew Wilcox wrote: > this patch is for information purposes only, right? ;-) Yes. > > +long f_getown(struct file *filp) > > +{ > > + return filp->f_owner.pid; > > +} > > i still think this function shouldn't exist -- a direct reference is > no problem. Agreed. > > > > + f_setown(filp, arg); > this is now hooked by the LSM folks so patch will need to be updated for > 2.5.30. Yep, a more recent version of this patch consolidates the LSM file_set_fowner hook into f_setown(). - James -- James Morris From owner-netdev@oss.sgi.com Fri Aug 9 13:12:35 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g79KCZRw019954 for ; Fri, 9 Aug 2002 13:12:35 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g79KCZ06019953 for netdev-outgoing; Fri, 9 Aug 2002 13:12:35 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from touchme.toronto.redhat.com (to-velocet.redhat.com [216.138.202.10]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g79KCSRw019943 for ; Fri, 9 Aug 2002 13:12:29 -0700 Received: from toomuch.toronto.redhat.com (toomuch.toronto.redhat.com [172.16.14.22]) by touchme.toronto.redhat.com (Postfix) with ESMTP id 69EBEB883E; Fri, 9 Aug 2002 16:14:39 -0400 (EDT) Received: (from bcrl@localhost) by toomuch.toronto.redhat.com (8.11.6/8.11.6) id g79KEdP12533; Fri, 9 Aug 2002 16:14:39 -0400 Date: Fri, 9 Aug 2002 16:14:39 -0400 From: Benjamin LaHaise To: davem@redhat.com, netdev@oss.sgi.com Subject: [patch] bug prematurely setting nr_frags Message-ID: <20020809161439.E10640@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5.1i X-Spam-Status: No, hits=-5.0 required=5.0 tests=UNIFIED_PATCH version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hello Dave et al, The patch below fixes a case where nr_frags will be incorrectly set when an allocation fails in sock_alloc_send_pskb. This bug was found while trying to track down a problem that shows up as an oops attempting to free a page that comes from an uninitialized fragment entry in an skb, and this problem looks like a possible causes. Thanks goes to Stephen Tweedie for digging through the crash dump to find several key bits of data. -ben -- "You will be reincarnated as a toad; and you will be much happier." :r ~/patches/v2.4/v2.4.20-pre1-nr_frags.diff diff -urN v2.4.20-pre1/net/core/sock.c net-2.4.20-pre1/net/core/sock.c --- v2.4.20-pre1/net/core/sock.c Fri Aug 9 13:50:46 2002 +++ net-2.4.20-pre1/net/core/sock.c Fri Aug 9 15:46:46 2002 @@ -785,7 +785,6 @@ npages = (data_len + (PAGE_SIZE - 1)) >> PAGE_SHIFT; skb->truesize += data_len; - skb_shinfo(skb)->nr_frags = npages; for (i = 0; i < npages; i++) { struct page *page; skb_frag_t *frag; @@ -804,6 +803,9 @@ PAGE_SIZE : data_len); data_len -= PAGE_SIZE; + + /* frag[i] is now initialized */ + skb_shinfo(skb)->nr_frags = i + 1; } /* Full success... */ From owner-netdev@oss.sgi.com Sat Aug 10 00:37:56 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7A7buRw032664 for ; Sat, 10 Aug 2002 00:37:56 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7A7bu8f032663 for netdev-outgoing; Sat, 10 Aug 2002 00:37:56 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from foonix.foonet.net (IDENT:root@foonix.foonet.net [216.207.29.74]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7A7bmRw032654 for ; Sat, 10 Aug 2002 00:37:48 -0700 Received: from foonix.foonet.net (IDENT:xerox@foonix.foonet.net [216.207.29.74]) by foonix.foonet.net (8.11.6/8.11.6) with ESMTP id g7A7e0710638 for ; Sat, 10 Aug 2002 03:40:01 -0400 Date: Sat, 10 Aug 2002 03:40:00 -0400 (EDT) From: CIT/FOONET Admin To: netdev@oss.sgi.com Subject: Serious weirdness w/ cbq and everything else Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Status: No, hits=0.0 required=5.0 tests= version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk I just can't get this to shape right.. I'm using 2.4.19-rc1 but I tried other versions.. what happens is it does not shape to the bandwidth I set it at.. sometimes no where even near it.. I just don't get it.. i even tried changing my HZ to 1000 and trying low latency patches, etc.. What's the deal? My config is just something like this: DEV=eth0 tc qdisc del dev $DEV root 2> /dev/null > /dev/null tc qdisc add dev $DEV root handle 10:0 cbq bandwidth 100mbit avpkt 200 mpu 64 tc class add dev $DEV parent 10:0 classid 10:1 est 1sec 8sec cbq bandwidth 250Kbit rate 250Kbit \ weight 25Kbit allot 1500 prio 3 minburst 64 maxburst 256 avpkt 128 cell 8 bounded tc qdisc add dev $DEV parent 10:1 sfq quantum 1514b perturb 5 tc class add dev $DEV parent 10:0 classid 10:51 est 1sec 8sec cbq bandwidth 100mbit rate \ 100Mbit allot 1514 prio 3 maxburst 10 cell 8 avpkt 500 bounded tc qdisc add dev $DEV parent 10:51 tbf rate 1mbit buffer 2kb latency 100ms mtu 1514 then filters........ pointing to that.. i even added the rate estimator just recently and look what it shows: class cbq 10:1 parent 10: leaf 800d: rate 250000bps cell 8b (bounded) prio 3/3 weight 250bps allot 1514b level 0 ewma 5 avpkt 200b Sent 6294040846 bytes 20138257 pkts (dropped 6644, overlimits 21092797) rate 241446bps 555pps borrowed 0 overactions 762540 avgidle 0 undertime -3 that class is actually doing about 2Mbps not .241Mbps like the estimator says.. i tried even setting the bandwidth to 2500000 bps as you see and it still lets it go over 5Mbps for i have no idea why except that the estimator shows 400000bps when it's doing 5Mbps for some odd reason.. that doesn't add up at all!! Any ideas? Paul From owner-netdev@oss.sgi.com Sat Aug 10 03:44:47 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7AAilRw001907 for ; Sat, 10 Aug 2002 03:44:47 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7AAik6g001906 for netdev-outgoing; Sat, 10 Aug 2002 03:44:46 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from anakin.wychk.org (CPE-144-132-195-245.nsw.bigpond.net.au [144.132.195.245]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7AAiPRw001897 for ; Sat, 10 Aug 2002 03:44:29 -0700 Received: from anakin.wychk.org (localhost [127.0.0.1]) by anakin.wychk.org (8.12.5/8.12.5/Debian-1) with ESMTP id g7AAOfES001216 for ; Sat, 10 Aug 2002 20:24:41 +1000 Received: (from glee@localhost) by anakin.wychk.org (8.12.5/8.12.5/Debian-1) id g7AAOfVZ001215 for netdev@oss.sgi.com; Sat, 10 Aug 2002 20:24:41 +1000 Date: Sat, 10 Aug 2002 20:24:41 +1000 From: Geoffrey Lee To: netdev@oss.sgi.com Subject: [PATCH] connect() return value. Message-ID: <20020810102441.GA1126@anakin.wychk.org> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="gKMricLos+KVdGMg" Content-Disposition: inline User-Agent: Mutt/1.4i X-Spam-Status: No, hits=0.0 required=5.0 tests= version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk --gKMricLos+KVdGMg Content-Type: text/plain; charset=big5 Content-Disposition: inline Hi, I posted this on linux-kernel but didn't get a reply. Anyway here's a patch for it. Problem: (1) While you try to connect() to port 0 on ipv4 / ipv6, it happily does it. This doesn't seem to be correct, as I remember port 0 is reserved. For reference FreeBSD / Solaris / OSF return -EADDRNOTAVAIL for this. This is what my patch fixes, for ipv4 and ipv6 for both tcp and udp, I don't know what should be done for the other protocols that is supported. (2) connect() doesn't return -EINTR on signal, rather it exits the signal handler after processing it and returns an error if an error occurred for the connection or does the connect to completion. The Linux man page implies that we don't ever return an -EINTR on Linux, but I don't know what is the correct behavior. For reference, FreeBSD / Solaris / OSF return -EINTR for an interrupted connect(). The patch also addresses this. However, it is noted on OSF that even though connect() returns with -EINTR if the connection succeeds you can still send and receive through that socket. (Please CC:, not on list. Thanks). -- G. --gKMricLos+KVdGMg Content-Type: text/plain; charset=big5 Content-Disposition: attachment; filename=network-patch diff -ruNp linux-2.4.19/net/ipv4/tcp_ipv4.c linux-2.4.19-glee/net/ipv4/tcp_ipv4.c --- linux-2.4.19/net/ipv4/tcp_ipv4.c 2002-08-03 19:26:04.000000000 +1000 +++ linux-2.4.19-glee/net/ipv4/tcp_ipv4.c 2002-08-10 19:34:24.000000000 +1000 @@ -763,6 +763,9 @@ int tcp_v4_connect(struct sock *sk, stru if (usin->sin_family != AF_INET) return(-EAFNOSUPPORT); + if (usin->sin_port == 0) + return(-EADDRNOTAVAIL); + nexthop = daddr = usin->sin_addr.s_addr; if (sk->protinfo.af_inet.opt && sk->protinfo.af_inet.opt->srr) { if (daddr == 0) diff -ruNp linux-2.4.19/net/ipv4/udp.c linux-2.4.19-glee/net/ipv4/udp.c --- linux-2.4.19/net/ipv4/udp.c 2002-08-03 19:26:04.000000000 +1000 +++ linux-2.4.19-glee/net/ipv4/udp.c 2002-08-10 19:34:52.000000000 +1000 @@ -723,6 +723,9 @@ int udp_connect(struct sock *sk, struct if (usin->sin_family != AF_INET) return -EAFNOSUPPORT; + if (usin->sin_port == 0) + return -EADDRNOTAVAIL; + sk_dst_reset(sk); oif = sk->bound_dev_if; diff -ruNp linux-2.4.19/net/ipv6/tcp_ipv6.c linux-2.4.19-glee/net/ipv6/tcp_ipv6.c --- linux-2.4.19/net/ipv6/tcp_ipv6.c 2002-08-03 19:26:04.000000000 +1000 +++ linux-2.4.19-glee/net/ipv6/tcp_ipv6.c 2002-08-10 19:35:57.000000000 +1000 @@ -539,6 +539,9 @@ static int tcp_v6_connect(struct sock *s if (usin->sin6_family != AF_INET6) return(-EAFNOSUPPORT); + if (usin->sin6_port == 0) + return -EADDRNOTAVAIL; + fl.fl6_flowlabel = 0; if (np->sndflow) { fl.fl6_flowlabel = usin->sin6_flowinfo&IPV6_FLOWINFO_MASK; diff -ruNp linux-2.4.19/net/ipv6/udp.c linux-2.4.19-glee/net/ipv6/udp.c --- linux-2.4.19/net/ipv6/udp.c 2002-08-03 19:26:04.000000000 +1000 +++ linux-2.4.19-glee/net/ipv6/udp.c 2002-08-10 19:36:22.000000000 +1000 @@ -231,6 +231,9 @@ int udpv6_connect(struct sock *sk, struc if (usin->sin6_family != AF_INET6) return -EAFNOSUPPORT; + if (usin->sin6_port == 0) + return -EADDRNOTAVAIL; + fl.fl6_flowlabel = 0; if (np->sndflow) { fl.fl6_flowlabel = usin->sin6_flowinfo&IPV6_FLOWINFO_MASK; diff -ruNp linux-2.4.19/net/socket.c linux-2.4.19-glee/net/socket.c --- linux-2.4.19/net/socket.c 2002-08-03 19:26:04.000000000 +1000 +++ linux-2.4.19-glee/net/socket.c 2002-08-10 19:37:23.000000000 +1000 @@ -1118,6 +1118,9 @@ asmlinkage long sys_connect(int fd, stru out_put: sockfd_put(sock); out: + if (signal_pending(current)) + return -EINTR; + return err; } --gKMricLos+KVdGMg-- From owner-netdev@oss.sgi.com Sat Aug 10 07:26:29 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7AEQTRw004594 for ; Sat, 10 Aug 2002 07:26:29 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7AEQTIF004593 for netdev-outgoing; Sat, 10 Aug 2002 07:26:29 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from cyberus.ca (mail.cyberus.ca [216.191.240.111]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7AEQJRw004584 for ; Sat, 10 Aug 2002 07:26:20 -0700 Received: from shell.cyberus.ca (shell [216.191.240.114]) by cyberus.ca (8.9.3/8.9.3/Cyberus Online Inc.) with ESMTP id KAA21107; Sat, 10 Aug 2002 10:28:28 -0400 (EDT) Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.11.6+Sun/8.11.6) with ESMTP id g7AELuX03725; Sat, 10 Aug 2002 10:21:57 -0400 (EDT) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Sat, 10 Aug 2002 10:21:56 -0400 (EDT) From: jamal To: Harald Welte cc: , Subject: Re: [RFC] Options for ECN target In-Reply-To: <20020806083541.L11828@sunbeam.de.gnumonks.org> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk The last 3 options are dangerous; although i am pretty sure it is too late to complain about it since you have released the code at least once. When ECN nonces comes into effect, it may become a non-issue (but would still make interesting effect). suggestion: get rid of them. cheers, jamal On Tue, 6 Aug 2002, Harald Welte wrote: > > There is one question left: How much flexibility do we want to give the user? > > The ECN target currently allows: > > --ecn-tcp-remove Remove CWR+ECE bits from TCP header. Should be used > on TCP syn packets to prevent ECN negotiation > > --ecn-ip-ect [0..3] Allows arbitrary setting of the ECT codepoint > --ecn-tcp-cwr [0|1] Allows setting or clearing the TCP CWR bit > --ecn-tcp-ece [0|1] Allows setting or clearing the TCP ECE bit > > > The first option is necessarry and is the primary use of the target. > The last three options are more experimental and would allow somebody > to play with 'simulated congestion' by setting the ECT in IP, etc. > > However, this is potentially very dangerous and I'm not sure if it was > a good idea to give this power directly to the user. > > Do you suggest removing the last three options and just keep the > --ecn-tcp-remove ? > > Thanks for your assistance, > > -- > Live long and prosper > - Harald Welte / laforge@gnumonks.org http://www.gnumonks.org/ > ============================================================================ > GCS/E/IT d- s-: a-- C+++ UL++++$ P+++ L++++$ E--- W- N++ o? K- w--- O- M+ > V-- PS++ PE-- Y++ PGP++ t+ 5-- !X !R tv-- b+++ !DI !D G+ e* h--- r++ y+(*) > From owner-netdev@oss.sgi.com Sat Aug 10 07:38:41 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7AEcfRw004795 for ; Sat, 10 Aug 2002 07:38:41 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7AEcf6t004794 for netdev-outgoing; Sat, 10 Aug 2002 07:38:41 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from cyberus.ca (mail.cyberus.ca [216.191.240.111]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7AEcaRw004782 for ; Sat, 10 Aug 2002 07:38:36 -0700 Received: from shell.cyberus.ca (shell [216.191.240.114]) by cyberus.ca (8.9.3/8.9.3/Cyberus Online Inc.) with ESMTP id KAA24251; Sat, 10 Aug 2002 10:40:50 -0400 (EDT) Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.11.6+Sun/8.11.6) with ESMTP id g7AEYJ403763; Sat, 10 Aug 2002 10:34:19 -0400 (EDT) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Sat, 10 Aug 2002 10:34:19 -0400 (EDT) From: jamal To: Gabriel Paues cc: Subject: Re: TBF timing issues In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk UML will never work well. The issue is related to the timer effects in UML. I believe its more of a clock skew and innacuracy than TBF making assumptions. cheers, jamal On Thu, 8 Aug 2002, Gabriel Paues wrote: > Hello! > > I use the TBF filter with good results on my computer. When I set a rate > with the filter the traffic get shaped accordingly. > > I have set up a couple of UML-instances (user-mode-linux) as a network > to test different QoS strategies. My problem is that the TBF calculates > the wrong rates. An UML-instance may get interrupted just like any other > program. Does the timing code in TBF presume that the kernel wont be > interrupted, and therefore generates the wrong rates? > > Ragards, > > Gabriel Paues > From owner-netdev@oss.sgi.com Sun Aug 11 13:45:48 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7BKjmRw022603 for ; Sun, 11 Aug 2002 13:45:48 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7BKjmup022602 for netdev-outgoing; Sun, 11 Aug 2002 13:45:48 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from sex.inr.ac.ru (sex.inr.ac.ru [193.233.7.165]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7BKjhRw022591 for ; Sun, 11 Aug 2002 13:45:44 -0700 Received: (from kuznet@localhost) by sex.inr.ac.ru (8.6.13/ANK) id AAA16082; Mon, 12 Aug 2002 00:46:51 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200208112046.AAA16082@sex.inr.ac.ru> Subject: Re: [PATCH] connect() return value. To: glee@gnupilgrims.ORG (Geoffrey Lee) Date: Mon, 12 Aug 2002 00:46:51 +0400 (MSD) Cc: netdev@oss.sgi.com In-Reply-To: <20020810102441.GA1126@anakin.wychk.org> from "Geoffrey Lee" at Aug 10, 2 03:15:01 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 X-Spam-Status: No, hits=-3.8 required=5.0 tests=IN_REP_TO,NO_REAL_NAME version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hello! > connect() doesn't return -EINTR on signal, It does, when you ask not to restart after signal. > The patch also addresses this. It would be better if it did not. :-) Alexey From owner-netdev@oss.sgi.com Sun Aug 11 14:06:08 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7BL68Rw022880 for ; Sun, 11 Aug 2002 14:06:08 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7BL68Ys022879 for netdev-outgoing; Sun, 11 Aug 2002 14:06:08 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from sex.inr.ac.ru (sex.inr.ac.ru [193.233.7.165]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7BL63Rw022870 for ; Sun, 11 Aug 2002 14:06:04 -0700 Received: (from kuznet@localhost) by sex.inr.ac.ru (8.6.13/ANK) id BAA16136; Mon, 12 Aug 2002 01:07:33 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200208112107.BAA16136@sex.inr.ac.ru> Subject: Re: [patch] bug prematurely setting nr_frags To: bcrl@redhat.COM (Benjamin LaHaise) Date: Mon, 12 Aug 2002 01:07:33 +0400 (MSD) Cc: netdev@oss.sgi.com In-Reply-To: <20020809161439.E10640@redhat.com> from "Benjamin LaHaise" at Aug 11, 2 02:45:07 am X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 X-Spam-Status: No, hits=-3.8 required=5.0 tests=IN_REP_TO,NO_REAL_NAME version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hello! > that comes from an uninitialized fragment entry in an skb, and this problem > looks like a possible causes. This function is not used, by the way. Alexey From owner-netdev@oss.sgi.com Sun Aug 11 16:45:59 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7BNjxRw026102 for ; Sun, 11 Aug 2002 16:45:59 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7BNjx2U026101 for netdev-outgoing; Sun, 11 Aug 2002 16:45:59 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from anakin.wychk.org (CPE-144-132-195-245.nsw.bigpond.net.au [144.132.195.245]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7BNjkRw026092 for ; Sun, 11 Aug 2002 16:45:50 -0700 Received: from anakin.wychk.org (localhost [127.0.0.1]) by anakin.wychk.org (8.12.5/8.12.5/Debian-1) with ESMTP id g7BNPkN5027238; Mon, 12 Aug 2002 09:25:46 +1000 Received: (from glee@localhost) by anakin.wychk.org (8.12.5/8.12.5/Debian-1) id g7BNPkx1027237; Mon, 12 Aug 2002 09:25:46 +1000 Date: Mon, 12 Aug 2002 09:25:46 +1000 From: Geoffrey Lee To: kuznet@ms2.inr.ac.ru Cc: netdev@oss.sgi.com Subject: Re: [PATCH] connect() return value. Message-ID: <20020811232546.GA27168@anakin.wychk.org> References: <20020810102441.GA1126@anakin.wychk.org> <200208112046.AAA16082@sex.inr.ac.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=big5 Content-Disposition: inline In-Reply-To: <200208112046.AAA16082@sex.inr.ac.ru> User-Agent: Mutt/1.4i X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Yo, On Mon, Aug 12, 2002 at 12:46:51AM +0400, kuznet@ms2.inr.ac.ru wrote: > Hello! > > > connect() doesn't return -EINTR on signal, > > It does, when you ask not to restart after signal. > True. What I was getting at is the default behavior of an OS with interruptibility for connect(), i.e. using signal() instead of specifying / unspecifying the SA_RESTART flag with sigaction(), though I will acknowledge that if you're using signal() you may be in for some unpleasant surprises with signal interruptibility anyway. I've done a bit more testing. It seems that OSF1 (Digital UNIX) will not return -EINTR by default with signal(), contrary to what I have previously stated. I must have been on drugs. However, Solaris will do this (SunOS 5.6) by default with signal(), and I wanted to know what is the "correct" behavior for connect() by default with signal(). So, current standings are with connect() interruptibility by default (i.e. with signal()) are: FreeBSD 4.6-STABLE: -EINTR Digital UNIX (OSF1 V4.0): To completion SunOS 5.6: -EINTR Linux (2.4): To completion > > > The patch also addresses this. > > It would be better if it did not. :-) > I only patched it from what I observed with other Unices. :-) I patched it largely due to observing what FreeBSD does with their sockets connect() call. Because my patched addresses 2 issues, if you are going to apply the patch for the -EADDRNOTAVAIL issue, and you want a re-worked patch without the connect() -EINTR issue, please let me know. But it should be trivial to hand edit my prevoius patch posted. -- G. From owner-netdev@oss.sgi.com Sun Aug 11 16:57:21 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7BNvLRw026302 for ; Sun, 11 Aug 2002 16:57:21 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7BNvLSD026301 for netdev-outgoing; Sun, 11 Aug 2002 16:57:21 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from oemcomputer.maine.rr.com (ptd-24-198-51-104.maine.rr.com [24.198.51.104]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7BNvGRw026292 for ; Sun, 11 Aug 2002 16:57:17 -0700 Message-Id: <200208112357.g7BNvGRw026292@oss.sgi.com> From: bilgi@kanser-tedavis.com.sgi.com To: netdev@oss.sgi.com Subject: kanser tedavisi Date: Mon, 12 Aug 2002 02:58:16 +0300 MIME-Version: 1.0 (produced by Synapse) x-mailer: Synapse - Delphi & Kylix TCP/IP library by Lukas Gebauer Content-type: text/plain; charset=ISO-8859-1 Content-Disposition: inline Content-Description: Message text Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from Quoted-printable to 8bit by oss.sgi.com id g7BNvHRw026293 X-Spam-Status: No, hits=0.6 required=5.0 tests=NO_REAL_NAME version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Kemoterapinin yan etkileri nedeniyle yorgun ve bitkin düsmüs bir yakininiz mi var? Yanitiniz evetse vakit kaybetmeyin ve CARCTOL'u denemesini saglayin. Genelde ancak klasik tedavinin ise yaramadiginin anlasilmasi, insanlari baska seçenekler aramaya itmektedir, ancak o durumda maalesef yapacak fazla birsey kalmamaktadir. Ayni hatanin yapilmasina izin vermeyin ve hiçbir yan etkisi olmayan, Himalaya Dagi bitkilerinden üretilen Carctol'u denemesini saglayin. CARCTOL, birkaç bin yillik, alternatif Hint Tibbinin (Ayuverda'nin) insanliga sundugu bir umut isigidir. Önceleri vücudu takviye edici tonik olarak kullanilan bu ürün simdi yeni formülüyle kansere karsi etkili bir ilaçtir Lütfen asagidaki siteye ugrayin ve Carctol'u yakindan taniyin. http://www.kanser-tedavisi.com esenlikler dileriz.. Should you wish to learn about the ways to aid in healing Cancer along with conventional methods and about Cartol, you may view our site in its English format at http://www.kanser-tedavisi.com/English_homepage.html From owner-netdev@oss.sgi.com Sun Aug 11 18:27:20 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7C1RKRw027918 for ; Sun, 11 Aug 2002 18:27:20 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7C1RKNf027917 for netdev-outgoing; Sun, 11 Aug 2002 18:27:20 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from sex.inr.ac.ru (sex.inr.ac.ru [193.233.7.165]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7C1RFRw027908 for ; Sun, 11 Aug 2002 18:27:15 -0700 Received: (from kuznet@localhost) by sex.inr.ac.ru (8.6.13/ANK) id FAA16791; Mon, 12 Aug 2002 05:28:38 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200208120128.FAA16791@sex.inr.ac.ru> Subject: Re: [PATCH] connect() return value. To: glee@gnupilgrims.org (Geoffrey Lee) Date: Mon, 12 Aug 2002 05:28:38 +0400 (MSD) Cc: netdev@oss.sgi.com In-Reply-To: <20020811232546.GA27168@anakin.wychk.org> from "Geoffrey Lee" at Aug 12, 2 09:25:46 am X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 X-Spam-Status: No, hits=-3.8 required=5.0 tests=IN_REP_TO,NO_REAL_NAME version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hello! > I wanted to know what > is the "correct" behavior for connect() by default with signal(). This simply does not matter, that's answer. Actually, it is not so easy to implement restartable connect(). I think this is the only reason why it not restartable in some OSes. BTW could you make the following experiments on the same OSes: connect() on nonblocking socket, then repeat the connect() until it returns EISCONN. I guess the behaviour also will be different. > it should be trivial to hand edit my prevoius patch posted. Yes, of course. Though the first part of it also from the class "does not matter", it is worth to do this just for sanity. Alexey From owner-netdev@oss.sgi.com Mon Aug 12 05:19:24 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7CCJORw003320 for ; Mon, 12 Aug 2002 05:19:24 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7CCJNMj003319 for netdev-outgoing; Mon, 12 Aug 2002 05:19:23 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from mailFA11.rediffmail.com ([203.199.83.246]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7CCJERw003309 for ; Mon, 12 Aug 2002 05:19:17 -0700 Received: (qmail 15833 invoked by uid 510); 12 Aug 2002 12:21:01 -0000 Date: 12 Aug 2002 12:21:01 -0000 Message-ID: <20020812122101.15832.qmail@mailFA11.rediffmail.com> Received: from unknown (210.214.216.219) by rediffmail.com via HTTP; 12 aug 2002 12:21:01 -0000 MIME-Version: 1.0 From: "v.chitra" Reply-To: "v.chitra" To: davem@redhat.com Cc: ak@muc.de, kuznet@msz.inr.ac.ru.sgi.com, netdev@oss.sgi.com Subject: ipv6 problem Content-type: text/plain; format=flowed Content-Disposition: inline X-Spam-Status: No, hits=0.0 required=5.0 tests= version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk hye i am working in drdl. i am creating socket library for IPv6 in Linux OS. i did a sample program giving AF_INET6 to socket function. but when i compiled the program, it is saying that Family not supported. for compilation i gave the following command: g++ -o server server.cpp i want to know how to compile programs for IPv6. what are the linker options we have to give. since i month i am facing this problem. reply as soon as possible. thank you bye From owner-netdev@oss.sgi.com Mon Aug 12 12:19:45 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7CJJjRw023257 for ; Mon, 12 Aug 2002 12:19:45 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7CJJiTe023256 for netdev-outgoing; Mon, 12 Aug 2002 12:19:44 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from sex.inr.ac.ru (sex.inr.ac.ru [193.233.7.165]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7CJJcRw023247 for ; Mon, 12 Aug 2002 12:19:39 -0700 Received: (from kuznet@localhost) by sex.inr.ac.ru (8.6.13/ANK) id XAA18870; Mon, 12 Aug 2002 23:20:56 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200208121920.XAA18870@sex.inr.ac.ru> Subject: Re: raw ipv6 broken in 2.4.19 To: gandalf@wlug.westbo.se (Martin Josefsson) Date: Mon, 12 Aug 2002 23:20:56 +0400 (MSD) Cc: netdev@oss.sgi.com In-Reply-To: <1029179259.791.31.camel@tux> from "Martin Josefsson" at Aug 12, 2 09:07:38 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Type: text/plain; charset=KOI8-R Content-Transfer-Encoding: 8bit X-Spam-Status: No, hits=-2.6 required=5.0 tests=IN_REP_TO,NO_REAL_NAME,PORN_12,PORN_3 version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hello! > raw ipv6 doesn't work in 2.4.19, Seems, more information is required because of: root@mops:~ # cat /proc/version Linux version 2.4.20-pre1-mops (root@mops) (gcc version 3.0.2 20010905 (ASPLinux 7.2 3.0.1-3)) #3370 ÷ÔÒ á×Ç 6 18:18:59 MSD 2002 root@mops:~ # traceroute6 sex traceroute to sex.inr.ac.ru (3ffe:2400:0:5:202:b3ff:fe90:d922) from 3ffe:2400:0:4:240:d0ff:fe16:a83d, 30 hops max, 16 byte packets 1 dust.inr.ac.ru (3ffe:2400:0:4:2c0:95ff:fee7:f562) 0.412 ms 0.195 ms 0.143 ms 2 sex.inr.ac.ru (3ffe:2400:0:5:202:b3ff:fe90:d922) 0.216 ms 0.199 ms 0.207 ms root@mops:~ # ping6 sex PING sex(sex.inr.ac.ru) 56 data bytes 64 bytes from sex.inr.ac.ru: icmp_seq=1 ttl=63 time=0.218 ms 64 bytes from sex.inr.ac.ru: icmp_seq=2 ttl=63 time=0.229 ms 64 bytes from sex.inr.ac.ru: icmp_seq=3 ttl=63 time=0.206 ms 64 bytes from sex.inr.ac.ru: icmp_seq=4 ttl=63 time=0.223 ms --- sex ping statistics --- 4 packets transmitted, 4 received, 0% loss, time 3030ms rtt min/avg/max/mdev = 0.206/0.219/0.229/0.008 ms root@mops:~ # So, please, make binary tcpdumps and tell me version of iputils, which you use and make straces of ping. Alexey From owner-netdev@oss.sgi.com Mon Aug 12 12:28:48 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7CJSmRw023401 for ; Mon, 12 Aug 2002 12:28:48 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7CJSmGo023400 for netdev-outgoing; Mon, 12 Aug 2002 12:28:48 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from netcore.fi (netcore.fi [193.94.160.1]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7CJSfRw023389 for ; Mon, 12 Aug 2002 12:28:42 -0700 Received: from localhost (pekkas@localhost) by netcore.fi (8.11.6/8.11.6) with ESMTP id g7CJUtI21922; Mon, 12 Aug 2002 22:30:56 +0300 Date: Mon, 12 Aug 2002 22:30:55 +0300 (EEST) From: Pekka Savola To: "v.chitra" cc: netdev@oss.sgi.com Subject: Re: ipv6 problem In-Reply-To: <20020812122101.15832.qmail@mailFA11.rediffmail.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk '/sbin/modprobe ipv6'. On 12 Aug 2002, v.chitra wrote: > hye > i am working in drdl. i am creating socket library for IPv6 in > Linux OS. > i did a sample program giving AF_INET6 to socket function. > but when i compiled the program, it is saying that Family not > supported. > for compilation i gave the following command: > > g++ -o server server.cpp > > i want to know how to compile programs for IPv6. what are the > linker options we have to give. since i month i am facing this > problem. > > reply as soon as possible. > thank you > bye > > -- Pekka Savola "Tell me of difficulties surmounted, Netcore Oy not those you stumble over and fall" Systems. Networks. Security. -- Robert Jordan: A Crown of Swords From owner-netdev@oss.sgi.com Mon Aug 12 12:43:00 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7CJh0Rw023640 for ; Mon, 12 Aug 2002 12:43:00 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7CJh0H1023639 for netdev-outgoing; Mon, 12 Aug 2002 12:43:00 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from tux.rsn.bth.se (postfix@tux.rsn.bth.se [194.47.143.135]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7CJgURw023628 for ; Mon, 12 Aug 2002 12:42:30 -0700 Received: by tux.rsn.bth.se (Postfix, from userid 501) id 2D4EB3701B; Mon, 12 Aug 2002 21:07:39 +0200 (CEST) Subject: raw ipv6 broken in 2.4.19 From: Martin Josefsson To: kuznet@ms2.inr.ac.ru Cc: netdev@oss.sgi.com Content-Type: text/plain Content-Transfer-Encoding: 7bit X-Mailer: Ximian Evolution 1.0.7 Date: 12 Aug 2002 21:07:38 +0200 Message-Id: <1029179259.791.31.camel@tux> Mime-Version: 1.0 X-Spam-Status: No, hits=-5.0 required=5.0 tests=UNIFIED_PATCH version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hi Alexey, raw ipv6 doesn't work in 2.4.19, I've traced it to part of a patch that went in between -pre7 and -pre8 back in April 23. Reverting the patch below makes it work again in two machines. (part of changeset 1.383.17.3 in marcelos BK tree) The description of that part in the changeset is: "IPv6 raw had missing sk->filter handling and rawv6_rcv missing some checksum processing." The symptoms were that ping6 didn't work, it complained about: ping: recvmsg: No route to host but icmp echo-requests were sent out and icmp echo-replies were recieved. Ip6InDiscards increased for each icmp echo-reply recieved, but Ip6InDelivers also increased for each packet recieved. And traceroute6 returned bogus addresses most of the time (it was either the correct address or a bogus one but always the same bogus address independent of which ip the response came from) Tested against Linux and OpenBSD with the same results. --- 1.8/net/ipv6/raw.c Thu Mar 14 00:46:57 2002 +++ 1.9/net/ipv6/raw.c Tue Apr 23 04:13:30 2002 @@ -278,6 +278,16 @@ static inline int rawv6_rcv_skb(struct sock * sk, struct sk_buff * skb) { +#if defined(CONFIG_FILTER) + if (sk->filter && skb->ip_summed != CHECKSUM_UNNECESSARY) { + if ((unsigned short)csum_fold(skb_checksum(skb, 0, skb->len, skb->csum))) { + IP6_INC_STATS_BH(Ip6InDiscards); + kfree_skb(skb); + return 0; + } + skb->ip_summed = CHECKSUM_UNNECESSARY; + } +#endif /* Charge it to the socket. */ if (sock_queue_rcv_skb(sk,skb)<0) { IP6_INC_STATS_BH(Ip6InDiscards); @@ -298,9 +308,33 @@ */ int rawv6_rcv(struct sock *sk, struct sk_buff *skb) { + if (!sk->tp_pinfo.tp_raw.checksum) + skb->ip_summed = CHECKSUM_UNNECESSARY; + + if (skb->ip_summed != CHECKSUM_UNNECESSARY) { + if (skb->ip_summed == CHECKSUM_HW) { + skb->ip_summed = CHECKSUM_UNNECESSARY; + if (csum_ipv6_magic(&skb->nh.ipv6h->saddr, + &skb->nh.ipv6h->daddr, + skb->len, sk->num, skb->csum)) { + NETDEBUG(if (net_ratelimit()) printk(KERN_DEBUG "raw v6 hw csum failure.\n")); + skb->ip_summed = CHECKSUM_NONE; + } + } + if (skb->ip_summed == CHECKSUM_NONE) + skb->csum = ~csum_ipv6_magic(&skb->nh.ipv6h->saddr, + &skb->nh.ipv6h->daddr, + skb->len, sk->num, 0); + } + if (sk->protinfo.af_inet.hdrincl) { - __skb_push(skb, skb->nh.raw - skb->data); - skb->h.raw = skb->nh.raw; + if (skb->ip_summed != CHECKSUM_UNNECESSARY && + (unsigned short)csum_fold(skb_checksum(skb, 0, skb->len, skb->csum))) { + IP6_INC_STATS_BH(Ip6InDiscards); + kfree_skb(skb); + return 0; + } + skb->ip_summed = CHECKSUM_UNNECESSARY; } rawv6_rcv_skb(sk, skb); @@ -339,7 +373,17 @@ msg->msg_flags |= MSG_TRUNC; } - err = skb_copy_datagram_iovec(skb, 0, msg->msg_iov, copied); + if (skb->ip_summed==CHECKSUM_UNNECESSARY) { + err = skb_copy_datagram_iovec(skb, 0, msg->msg_iov, copied); + } else if (msg->msg_flags&MSG_TRUNC) { + if ((unsigned short)csum_fold(skb_checksum(skb, 0, skb->len, skb->csum))) + goto csum_copy_err; + err = skb_copy_datagram_iovec(skb, 0, msg->msg_iov, copied); + } else { + err = skb_copy_and_csum_datagram_iovec(skb, 0, msg->msg_iov); + if (err == -EINVAL) + goto csum_copy_err; + } if (err) goto out_free; @@ -366,6 +410,27 @@ skb_free_datagram(sk, skb); out: return err; + +csum_copy_err: + /* Clear queue. */ + if (flags&MSG_PEEK) { + int clear = 0; + spin_lock_irq(&sk->receive_queue.lock); + if (skb == skb_peek(&sk->receive_queue)) { + __skb_unlink(skb, &sk->receive_queue); + clear = 1; + } + spin_unlock_irq(&sk->receive_queue.lock); + if (clear) + kfree_skb(skb); + } + + /* Error for blocking case is chosen to masquerade + as some normal condition. + */ + err = (flags&MSG_DONTWAIT) ? -EAGAIN : -EHOSTUNREACH; + IP6_INC_STATS_USER(Ip6InDiscards); + goto out_free; } /* -- /Martin Never argue with an idiot. They drag you down to their level, then beat you with experience. From owner-netdev@oss.sgi.com Mon Aug 12 14:00:15 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7CL0FRw024395 for ; Mon, 12 Aug 2002 14:00:15 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7CL0FsX024394 for netdev-outgoing; Mon, 12 Aug 2002 14:00:15 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from tux.rsn.bth.se (postfix@tux.rsn.bth.se [194.47.143.135]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7CL07Rw024385 for ; Mon, 12 Aug 2002 14:00:08 -0700 Received: by tux.rsn.bth.se (Postfix, from userid 501) id 252513700E; Mon, 12 Aug 2002 23:02:30 +0200 (CEST) Subject: Re: raw ipv6 broken in 2.4.19 From: Martin Josefsson To: kuznet@ms2.inr.ac.ru Cc: netdev@oss.sgi.com In-Reply-To: <200208121920.XAA18870@sex.inr.ac.ru> References: <200208121920.XAA18870@sex.inr.ac.ru> Content-Type: text/plain Content-Transfer-Encoding: 7bit X-Mailer: Ximian Evolution 1.0.7 Date: 12 Aug 2002 23:02:29 +0200 Message-Id: <1029186150.700.39.camel@tux> Mime-Version: 1.0 X-Spam-Status: No, hits=-4.8 required=5.0 tests=IN_REP_TO,SUPERLONG_LINE version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk On Mon, 2002-08-12 at 21:20, kuznet@ms2.inr.ac.ru wrote: > Hello! > > > raw ipv6 doesn't work in 2.4.19, > > Seems, more information is required because of: > So, please, make binary tcpdumps and tell me version of iputils, > which you use and make straces of ping. I think I've narrowed it down to what's causing the problem now. I didn't think far enough before. It seems to be the tulip driver, if I use it things stop working but if I use an old ISA ne2k clone it works fine. The NIC is a D-Link DFE-570TX (quad tulip with "Digital DS21143 Tulip rev 65" for those who don't know) Both the tulip-driver in vanilla 2.4.19 (without NAPI patch) and the tulip-NAPI-011103 driver causes this. Sorry for pointing fingers at the ipv6 code, it was the most logical solution for my brain (and the fact that I found something to back out that made it work again :) I think the strace and tcpdumps are uninteressing now so I'm not attaching them, the tcpdump looked perfectly normal to me. But here a small piece of the strace: sendto(3, "\200\0\0\0\304\2\3\0Z\16X=\331%\17\0\10\t\n\v\f\r\16\17"..., 64, 0, {sin_family=AF_INET6, sin6_port=htons(58), inet_pton(AF_INET6, "3ffe:200:3d:1:2d0:b7ff:fe3f:b7b", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 64 recvmsg(3, 0xbfffea80, 0) = -1 EHOSTUNREACH (No route to host) recvmsg(3, 0xbfffe800, MSG_ERRQUEUE|MSG_DONTWAIT) = -1 EAGAIN (Resource temporarily unavailable) write(2, "ping: recvmsg: No route to host\n", 32ping: recvmsg: No route to host ) = 32 I'm going to take a look at the driver now but I don't really know what to look for. -- /Martin Never argue with an idiot. They drag you down to their level, then beat you with experience. From owner-netdev@oss.sgi.com Mon Aug 12 16:05:25 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7CN5PRw026724 for ; Mon, 12 Aug 2002 16:05:25 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7CN5PKa026723 for netdev-outgoing; Mon, 12 Aug 2002 16:05:25 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from touchme.toronto.redhat.com (to-velocet.redhat.com [216.138.202.10]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7CN5LRw026714 for ; Mon, 12 Aug 2002 16:05:21 -0700 Received: from toomuch.toronto.redhat.com (toomuch.toronto.redhat.com [172.16.14.22]) by touchme.toronto.redhat.com (Postfix) with ESMTP id 0D636B8049; Mon, 12 Aug 2002 19:07:45 -0400 (EDT) Received: (from bcrl@localhost) by toomuch.toronto.redhat.com (8.11.6/8.11.6) id g7CN7j908815; Mon, 12 Aug 2002 19:07:45 -0400 Date: Mon, 12 Aug 2002 19:07:44 -0400 From: Benjamin LaHaise To: kuznet@ms2.inr.ac.ru Cc: netdev@oss.sgi.com Subject: Re: [patch] bug prematurely setting nr_frags Message-ID: <20020812190744.R1781@redhat.com> References: <20020809161439.E10640@redhat.com> <200208112107.BAA16136@sex.inr.ac.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5.1i In-Reply-To: <200208112107.BAA16136@sex.inr.ac.ru>; from kuznet@ms2.inr.ac.ru on Mon, Aug 12, 2002 at 01:07:33AM +0400 X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk On Mon, Aug 12, 2002 at 01:07:33AM +0400, kuznet@ms2.inr.ac.ru wrote: > Hello! > > > that comes from an uninitialized fragment entry in an skb, and this problem > > looks like a possible causes. > > This function is not used, by the way. Huh? It's called from sock_alloc_send_skb, which is called from all over the stack. -ben -- "You will be reincarnated as a toad; and you will be much happier." From owner-netdev@oss.sgi.com Mon Aug 12 19:41:51 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7D2fpRw012175 for ; Mon, 12 Aug 2002 19:41:51 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7D2fphW012174 for netdev-outgoing; Mon, 12 Aug 2002 19:41:51 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from anakin.wychk.org (CPE-144-132-195-245.nsw.bigpond.net.au [144.132.195.245]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7D2fVRw012154 for ; Mon, 12 Aug 2002 19:41:31 -0700 Received: from anakin.wychk.org (localhost [127.0.0.1]) by anakin.wychk.org (8.12.5/8.12.5/Debian-1) with ESMTP id g7D2LFN5012669; Tue, 13 Aug 2002 12:21:15 +1000 Received: (from glee@localhost) by anakin.wychk.org (8.12.5/8.12.5/Debian-1) id g7D2LFDf012668; Tue, 13 Aug 2002 12:21:15 +1000 Date: Tue, 13 Aug 2002 12:21:15 +1000 From: Geoffrey Lee To: kuznet@ms2.inr.ac.ru Cc: netdev@oss.sgi.com Subject: Re: [PATCH] connect() return value. Message-ID: <20020813022115.GA11627@anakin.wychk.org> References: <20020811232546.GA27168@anakin.wychk.org> <200208120128.FAA16791@sex.inr.ac.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=big5 Content-Disposition: inline In-Reply-To: <200208120128.FAA16791@sex.inr.ac.ru> User-Agent: Mutt/1.4i X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk On Mon, Aug 12, 2002 at 05:28:38AM +0400, kuznet@ms2.inr.ac.ru wrote: > Hello! > > > I wanted to know what > > is the "correct" behavior for connect() by default with signal(). > > This simply does not matter, that's answer. > > Actually, it is not so easy to implement restartable connect(). > I think this is the only reason why it not restartable in some OSes. > True, probably. Solaris will not restart a connect() even with the SA_RESTART set in sa_flags. > BTW could you make the following experiments on the same OSes: > connect() on nonblocking socket, then repeat the connect() > until it returns EISCONN. I guess the behaviour also will be different. > > In brief, it is what we expect. After we do a connect() on a non-blocking socket, it returns -EINPROGRESS. On subsequent attempts it returns -EALREADY, then after it is connected it returns -EISCONN. This behavior is consistent in all operating systems which I have tested, which include the following: Digital UNIX (OSF1 4.0) Solaris (SunOS 5.6) Linux (2.4.18) On a LAN, when we run the program on first try, it is usually possible to see -EINPROGRESS, -EALREADY, then -EISCONN. On subsequent runs it normally returns -EINPROGRESS, then -EISCONN. This behavior can be explained by arp interaction and is consistent with what we expect. On a WAN link, we see it return -EINPROGRESS, -EALREADY, then -EISCONN all of the time. Not tested on BSD yet, but I expect the results to be consistent with my current findings. I can post the results later if you want them. This is all fine, but what is interesting is what follows aftewards. It was noted in Richard Steven's UNIX Network Programming that there are several ways which you could find out to see if the connection has succeeded in the background. One way mentioned is to call a read() with a length of 0. With this method, it implies that read() must return an error for an in-progress socket, otherwise there would be no way to distinguish between a in-progress socket or a completed socket. To test this, we write a program to connect to a open TCP port on a remote machine on the WAN (to maximize the delay). We set the socket to non-blocking then we issue a TCP connect(). connect() will return immediately while the connection is attempted in the background asynchroniously. After the return from connect() we immediately issue a read() with a length of 0, and we note the return value from read, the errno (if applicable). Finally, we issue a second connect() and we note the return value, and errno (if applicable). This is what happens on Digital UNIX (OSF1 4.0): read() returns 0 with an errno of 36. 36 on Digital UNIX corresponds to -EINPROGRESS. The errno isn't from the read but from our first connect(). On second connect(), it returns -1, with an errno of -EALREADY. On Solaris (SunOS 5.6): read() returns -1 with an errno of ENOTCONN. On second connect() it returns -1 with an errno of EALREADY. On Linux (v 2.4.18): read() returns 0 with an errno of 115. On Linux 115 corresponds to -EINPROGRESS. So as with Digital UNIX, the errno isn't from read but from our first conect(). On second connect(), it returns -1 with an errno of -EALREADY. Is using write() instead of read() a better way? To do this, we change our previous program, instead of reading 0 bytes this time we issue a write() of 0 bytes. On Solaris (SunOS 5.6): write() returns -1 with an errno of -ENOTCONN. On Digital UNIX (OSF1 4.0): write() returns -1 with an errno of -ENOTCONN. We note that previously a read() returned 0. This behavior is inconsistent. There is a bug in Digital UNIX. On Linux (v 2.4.18): write() returns 0. We note that Solaris and Linux are consistent with their read() and write() for a in-progress socket. I guess the moral story is don't try to use read() or write() to test if a socket is connected or not. :-) It looks to me that issuing a second connect() and see if it returns -EISCONN is probably a more portable way. Of course, following up on the previous discussion, one must handle a -EALREADY from connect() as well. > > it should be trivial to hand edit my prevoius patch posted. > > Yes, of course. Though the first part of it also from the class > "does not matter", it is worth to do this just for sanity. > Yep. Though I would really like it if it conforms to what the rest of the Unices does with that behavior. So, we can agree that the first part should be applied on the next networking merge? :-) -- G. From owner-netdev@oss.sgi.com Mon Aug 12 21:02:32 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7D42WRw014036 for ; Mon, 12 Aug 2002 21:02:32 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7D42WLn014035 for netdev-outgoing; Mon, 12 Aug 2002 21:02:32 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from sex.inr.ac.ru (sex.inr.ac.ru [193.233.7.165]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7D42QRw014025 for ; Mon, 12 Aug 2002 21:02:27 -0700 Received: (from kuznet@localhost) by sex.inr.ac.ru (8.6.13/ANK) id IAA20030; Tue, 13 Aug 2002 08:04:40 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200208130404.IAA20030@sex.inr.ac.ru> Subject: Re: raw ipv6 broken in 2.4.19 To: gandalf@wlug.westbo.se (Martin Josefsson) Date: Tue, 13 Aug 2002 08:04:40 +0400 (MSD) Cc: netdev@oss.sgi.com In-Reply-To: <1029186150.700.39.camel@tux> from "Martin Josefsson" at Aug 12, 2 11:02:29 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 X-Spam-Status: No, hits=-3.8 required=5.0 tests=IN_REP_TO,NO_REAL_NAME version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hello! > It seems to be the tulip driver, Apparently, it corrupts IPv6 packets. I have seen this bug, it was cured by tuning pci settings in bios setup to be more conservative. The most intersting thing was that only ICMP IPv6 were corrupted. :-) > attaching them, the tcpdump looked perfectly normal to me. I do not think so, packets should have wrong checksum. Alexey From owner-netdev@oss.sgi.com Mon Aug 12 21:04:00 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7D43xRw014146 for ; Mon, 12 Aug 2002 21:04:00 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7D43xKa014144 for netdev-outgoing; Mon, 12 Aug 2002 21:03:59 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from sex.inr.ac.ru (sex.inr.ac.ru [193.233.7.165]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7D43tRw014132 for ; Mon, 12 Aug 2002 21:03:56 -0700 Received: (from kuznet@localhost) by sex.inr.ac.ru (8.6.13/ANK) id IAA20038; Tue, 13 Aug 2002 08:06:13 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200208130406.IAA20038@sex.inr.ac.ru> Subject: Re: [patch] bug prematurely setting nr_frags To: bcrl@redhat.com (Benjamin LaHaise) Date: Tue, 13 Aug 2002 08:06:13 +0400 (MSD) Cc: netdev@oss.sgi.com In-Reply-To: <20020812190744.R1781@redhat.com> from "Benjamin LaHaise" at Aug 12, 2 07:07:44 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 X-Spam-Status: No, hits=-3.8 required=5.0 tests=IN_REP_TO,NO_REAL_NAME version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hello! > Huh? It's called from sock_alloc_send_skb, which is called from all > over the stack. It is used with data_len==0. The branch generating fragments is dead. Alexey From owner-netdev@oss.sgi.com Mon Aug 12 21:42:27 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7D4gRRw014782 for ; Mon, 12 Aug 2002 21:42:27 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7D4gRYb014781 for netdev-outgoing; Mon, 12 Aug 2002 21:42:27 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from sex.inr.ac.ru (sex.inr.ac.ru [193.233.7.165]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7D4gJRw014768 for ; Mon, 12 Aug 2002 21:42:20 -0700 Received: (from kuznet@localhost) by sex.inr.ac.ru (8.6.13/ANK) id IAA20082; Tue, 13 Aug 2002 08:44:21 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200208130444.IAA20082@sex.inr.ac.ru> Subject: Re: [PATCH] connect() return value. To: glee@gnupilgrims.org (Geoffrey Lee) Date: Tue, 13 Aug 2002 08:44:21 +0400 (MSD) Cc: netdev@oss.sgi.com In-Reply-To: <20020813022115.GA11627@anakin.wychk.org> from "Geoffrey Lee" at Aug 13, 2 12:21:15 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 X-Spam-Status: No, hits=-3.8 required=5.0 tests=IN_REP_TO,NO_REAL_NAME version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hello! > it returns -EISCONN. This behavior is consistent in all operating > systems which I have tested, which include the following: This is very strange. Linux should return success once. Probably, I should explain why I ask this. This is a necessary element of implementation of restartable connect(). If connect() is restarted after signal, it should return success on connected socket sometimes rather than -EISCONN. > On Linux (v 2.4.18): > > read() returns 0 with an errno of 115. On Linux 115 corresponds to > -EINPROGRESS. I am sorry, this machine is not running Linux. Zero length read() always returns zero, not depending of socket state. > Yep. Though I would really like it if it conforms to what the rest of > the Unices does with that behavior. Non-sense. We are not going to emulate pathological cases. > So, we can agree Did we argue about this? :-) Alexey From owner-netdev@oss.sgi.com Mon Aug 12 22:57:32 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7D5vWRw015415 for ; Mon, 12 Aug 2002 22:57:32 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7D5vWiQ015414 for netdev-outgoing; Mon, 12 Aug 2002 22:57:32 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from anakin.wychk.org (CPE-144-132-195-245.nsw.bigpond.net.au [144.132.195.245]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7D5vHRw015405 for ; Mon, 12 Aug 2002 22:57:20 -0700 Received: from anakin.wychk.org (localhost [127.0.0.1]) by anakin.wychk.org (8.12.5/8.12.5/Debian-1) with ESMTP id g7D5b4N5014890; Tue, 13 Aug 2002 15:37:04 +1000 Received: (from glee@localhost) by anakin.wychk.org (8.12.5/8.12.5/Debian-1) id g7D5b4eJ014889; Tue, 13 Aug 2002 15:37:04 +1000 Date: Tue, 13 Aug 2002 15:37:04 +1000 From: Geoffrey Lee To: kuznet@ms2.inr.ac.ru Cc: netdev@oss.sgi.com Subject: Re: [PATCH] connect() return value. Message-ID: <20020813053704.GA14788@anakin.wychk.org> References: <20020813022115.GA11627@anakin.wychk.org> <200208130444.IAA20082@sex.inr.ac.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=big5 Content-Disposition: inline In-Reply-To: <200208130444.IAA20082@sex.inr.ac.ru> User-Agent: Mutt/1.4i X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk On Tue, Aug 13, 2002 at 08:44:21AM +0400, kuznet@ms2.inr.ac.ru wrote: > Hello! > > > it returns -EISCONN. This behavior is consistent in all operating > > systems which I have tested, which include the following: > > This is very strange. Linux should return success once. > > Probably, I should explain why I ask this. This is a necessary > element of implementation of restartable connect(). If connect() > is restarted after signal, it should return success on connected > socket sometimes rather than -EISCONN. > I see. Possibly, I should be a bit clearer, to make sure that I am not not confusing with what are you are asking for. The summaries so far, for Linux: (1) no signals (so it runs the sighandler but doesn't return from the sys call until it completes), non-blocking socket: -EINPROGRESS, -EALREADY, -EISCONN, in that order. Never returns 0. (2) SIGALRM, set for 1 second, (and reschedules another for 1 second), non-blocking -EINPROGRESS, -EALREADY, -EISCONN, in that order. We note that in -EALREADY a signal is processed. That is the SIGALRM handler. (3) SIGALRM, blocking: will run sighandler but connect() returns 0 once, running connect() again gives EISCONN. Is that what you expected? > > > On Linux (v 2.4.18): > > > > read() returns 0 with an errno of 115. On Linux 115 corresponds to > > -EINPROGRESS. > > I am sorry, this machine is not running Linux. > > Zero length read() always returns zero, not depending of socket state. > > It doesn't seem to be standardized what one should return. So read() / write() is not a portable way to test if a socket has been connected with non-blocking connect. -- G. From owner-netdev@oss.sgi.com Tue Aug 13 03:15:31 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7DAFVRw029287 for ; Tue, 13 Aug 2002 03:15:31 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7DAFVeb029286 for netdev-outgoing; Tue, 13 Aug 2002 03:15:31 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from sex.inr.ac.ru (sex.inr.ac.ru [193.233.7.165]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7DAFORw029277 for ; Tue, 13 Aug 2002 03:15:25 -0700 Received: (from kuznet@localhost) by sex.inr.ac.ru (8.6.13/ANK) id OAA20516; Tue, 13 Aug 2002 14:17:41 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200208131017.OAA20516@sex.inr.ac.ru> Subject: Re: [PATCH] connect() return value. To: glee@gnupilgrims.org (Geoffrey Lee) Date: Tue, 13 Aug 2002 14:17:41 +0400 (MSD) Cc: netdev@oss.sgi.com In-Reply-To: <20020813053704.GA14788@anakin.wychk.org> from "Geoffrey Lee" at Aug 13, 2 03:37:04 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 X-Spam-Status: No, hits=-3.8 required=5.0 tests=IN_REP_TO,NO_REAL_NAME version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hello! > (1) no signals (so it runs the sighandler but doesn't return from the > sys call until it completes), non-blocking socket: > > > -EINPROGRESS, -EALREADY, -EISCONN, in that order. Never returns 0. socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3 fcntl64(0x3, 0x4, 0x800, 0x400151cc) = 0 connect(3, {sin_family=AF_INET, sin_port=htons(9), sin_addr=inet_addr("193.233.7.75")}}, 16) = -1 EINPROGRESS (Operation now in progress) connect(3, {sin_family=AF_INET, sin_port=htons(9), sin_addr=inet_addr("193.233.7.75")}}, 16) = -1 EALREADY (Operation already in progress) connect(3, {sin_family=AF_INET, sin_port=htons(9), sin_addr=inet_addr("193.233.7.75")}}, 16) = -1 EALREADY (Operation already in progress) connect(3, {sin_family=AF_INET, sin_port=htons(9), sin_addr=inet_addr("193.233.7.75")}}, 16) = -1 EALREADY (Operation already in progress) connect(3, {sin_family=AF_INET, sin_port=htons(9), sin_addr=inet_addr("193.233.7.75")}}, 16) = -1 EALREADY (Operation already in progress) connect(3, {sin_family=AF_INET, sin_port=htons(9), sin_addr=inet_addr("193.233.7.75")}}, 16) = 0 connect(3, {sin_family=AF_INET, sin_port=htons(9), sin_addr=inet_addr("193.233.7.75")}}, 16) = -1 EISCONN (Transport endpoint is already connected) > Is that what you expected? Well, it proves that one of us makes tests not on linux. Alexey From owner-netdev@oss.sgi.com Tue Aug 13 04:55:14 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7DBtDRw031549 for ; Tue, 13 Aug 2002 04:55:13 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7DBtDsP031548 for netdev-outgoing; Tue, 13 Aug 2002 04:55:13 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from anakin.wychk.org (CPE-144-132-195-245.nsw.bigpond.net.au [144.132.195.245]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7DBt6Rw031539 for ; Tue, 13 Aug 2002 04:55:09 -0700 Received: from anakin.wychk.org (localhost [127.0.0.1]) by anakin.wychk.org (8.12.5/8.12.5/Debian-1) with ESMTP id g7DBYoN5019112; Tue, 13 Aug 2002 21:34:50 +1000 Received: (from glee@localhost) by anakin.wychk.org (8.12.5/8.12.5/Debian-1) id g7DBYo3o019111; Tue, 13 Aug 2002 21:34:50 +1000 Date: Tue, 13 Aug 2002 21:34:50 +1000 From: Geoffrey Lee To: kuznet@ms2.inr.ac.ru Cc: netdev@oss.sgi.com Subject: Re: [PATCH] connect() return value. Message-ID: <20020813113450.GA19058@anakin.wychk.org> References: <20020813053704.GA14788@anakin.wychk.org> <200208131017.OAA20516@sex.inr.ac.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=big5 Content-Disposition: inline In-Reply-To: <200208131017.OAA20516@sex.inr.ac.ru> User-Agent: Mutt/1.4i X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk > > Is that what you expected? > > Well, it proves that one of us makes tests not on linux. > Ah, yes you're right. I was feeling stupid and I forgot to clear the errno because re-trying the connect(). It looks my brain has gone mouldy. Sorry. :-) -- G. From owner-netdev@oss.sgi.com Tue Aug 13 05:02:59 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7DC2rRw031757 for ; Tue, 13 Aug 2002 05:02:59 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7DC2r7R031756 for netdev-outgoing; Tue, 13 Aug 2002 05:02:53 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from tux.rsn.bth.se (postfix@tux.rsn.bth.se [194.47.143.135]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7DC2TRw031746 for ; Tue, 13 Aug 2002 05:02:30 -0700 Received: by tux.rsn.bth.se (Postfix, from userid 501) id 19EB836FC9; Tue, 13 Aug 2002 13:32:36 +0200 (CEST) Subject: Re: raw ipv6 broken in 2.4.19 From: Martin Josefsson To: kuznet@ms2.inr.ac.ru Cc: netdev@oss.sgi.com In-Reply-To: <200208130404.IAA20030@sex.inr.ac.ru> References: <200208130404.IAA20030@sex.inr.ac.ru> Content-Type: multipart/mixed; boundary="=-ayFoqu3ajXYTjlIN3X1s" X-Mailer: Ximian Evolution 1.0.7 Date: 13 Aug 2002 13:32:36 +0200 Message-Id: <1029238356.785.7.camel@tux> Mime-Version: 1.0 X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk --=-ayFoqu3ajXYTjlIN3X1s Content-Type: text/plain Content-Transfer-Encoding: 7bit On Tue, 2002-08-13 at 06:04, kuznet@ms2.inr.ac.ru wrote: > Hello! > > > It seems to be the tulip driver, > > Apparently, it corrupts IPv6 packets. I have seen this bug, > it was cured by tuning pci settings in bios setup to be more conservative. > The most intersting thing was that only ICMP IPv6 were corrupted. :-) Eek! I have this problem in two machines, one SMP and one UP, both with i440BX chipset. I've tried setting the BIOS settings as conservative as possible in one machine but it didn't help. Will try the other one a little later. I've heard something about a PCI posting bug in the tulip hardware? could this have anything to do with that? > > attaching them, the tcpdump looked perfectly normal to me. > > I do not think so, packets should have wrong checksum. I've attached a small dump. captured while trying to ping an OpenBSD machine. I also found that ping6 -s 37 and lower works fine but -s 38 and higher doesn't. -- /Martin Never argue with an idiot. They drag you down to their level, then beat you with experience. --=-ayFoqu3ajXYTjlIN3X1s Content-Disposition: attachment; filename=dumpfile Content-Type: application/octet-stream; name=dumpfile Content-Transfer-Encoding: base64 1MOyoQIABAAAAAAAAAAAAP//AAABAAAATexYPYK+BQB2AAAAdgAAAADQtz8LewCAyMpNmYbdYAAA AABAOkA//gIAAD0AAQKAyP/+yk2ZP/4CAAA9AAEC0Lf//j8Le4AATq7mBAAATexYPVO+BQAICQoL DA0ODxAREhMUFRYXGBkaGxwdHh8gISIjJCUmJygpKissLS4vMDEyMzQ1NjdN7Fg9x8IFAHYAAAB2 AAAAAIDIyk2ZANC3Pwt7ht1gAAAAAEA6QD/+AgAAPQABAtC3//4/C3s//gIAAD0AAQKAyP/+yk2Z gQBNruYEAABN7Fg9U74FAAgJCgsMDQ4PEBESExQVFhcYGRobHB0eHyAhIiMkJSYnKCkqKywtLi8w MTIzNDU2N07sWD2cuAUAdgAAAHYAAAAA0Lc/C3sAgMjKTZmG3WAAAAAAQDpAP/4CAAA9AAECgMj/ /spNmT/+AgAAPQABAtC3//4/C3uAAA605gQBAE7sWD2RuAUACAkKCwwNDg8QERITFBUWFxgZGhsc HR4fICEiIyQlJicoKSorLC0uLzAxMjM0NTY3TuxYPZjABQB2AAAAdgAAAACAyMpNmQDQtz8Le4bd YAAAAABAOkA//gIAAD0AAQLQt//+Pwt7P/4CAAA9AAECgMj//spNmYEADbTmBAEATuxYPZG4BQAI CQoLDA0ODxAREhMUFRYXGBkaGxwdHh8gISIjJCUmJygpKissLS4vMDEyMzQ1NjdP7Fg91bcFAHYA AAB2AAAAANC3Pwt7AIDIyk2Zht1gAAAAAEA6QD/+AgAAPQABAoDI//7KTZk//gIAAD0AAQLQt//+ Pwt7gADStOYEAgBP7Fg9y7cFAAgJCgsMDQ4PEBESExQVFhcYGRobHB0eHyAhIiMkJSYnKCkqKywt Li8wMTIzNDU2N0/sWD1YvwUAdgAAAHYAAAAAgMjKTZkA0Lc/C3uG3WAAAAAAQDpAP/4CAAA9AAEC 0Lf//j8Lez/+AgAAPQABAoDI//7KTZmBANG05gQCAE/sWD3LtwUACAkKCwwNDg8QERITFBUWFxgZ GhscHR4fICEiIyQlJicoKSorLC0uLzAxMjM0NTY3UOxYPRi3BQB2AAAAdgAAAADQtz8LewCAyMpN mYbdYAAAAABAOkA//gIAAD0AAQKAyP/+yk2ZP/4CAAA9AAEC0Lf//j8Le4AAjbXmBAMAUOxYPQ63 BQAICQoLDA0ODxAREhMUFRYXGBkaGxwdHh8gISIjJCUmJygpKissLS4vMDEyMzQ1NjdQ7Fg9P74F AHYAAAB2AAAAAIDIyk2ZANC3Pwt7ht1gAAAAAEA6QD/+AgAAPQABAtC3//4/C3s//gIAAD0AAQKA yP/+yk2ZgQCMteYEAwBQ7Fg9DrcFAAgJCgsMDQ4PEBESExQVFhcYGRobHB0eHyAhIiMkJSYnKCkq KywtLi8wMTIzNDU2N1HsWD1gtgUAdgAAAHYAAAAA0Lc/C3sAgMjKTZmG3WAAAAAAQDpAP/4CAAA9 AAECgMj//spNmT/+AgAAPQABAtC3//4/C3uAAEO25gQEAFHsWD1WtgUACAkKCwwNDg8QERITFBUW FxgZGhscHR4fICEiIyQlJicoKSorLC0uLzAxMjM0NTY3UexYPRS5BQB2AAAAdgAAAACAyMpNmQDQ tz8Le4bdYAAAAABAOkA//gIAAD0AAQLQt//+Pwt7P/4CAAA9AAECgMj//spNmYEAQrbmBAQAUexY PVa2BQAICQoLDA0ODxAREhMUFRYXGBkaGxwdHh8gISIjJCUmJygpKissLS4vMDEyMzQ1NjdS7Fg9 kLUFAFYAAABWAAAAANC3Pwt7AIDIyk2Zht1gAAAAACA6//6AAAAAAAAAAoDI//7KTZk//gIAAD0A AQLQt//+Pwt7hwA8zQAAAAA//gIAAD0AAQLQt//+Pwt7AQEAgMjKTZlS7Fg9BrYFAHYAAAB2AAAA ANC3Pwt7AIDIyk2Zht1gAAAAAEA6QD/+AgAAPQABAoDI//7KTZk//gIAAD0AAQLQt//+Pwt7gAC4 tuYEBQBS7Fg937UFAAgJCgsMDQ4PEBESExQVFhcYGRobHB0eHyAhIiMkJSYnKCkqKywtLi8wMTIz NDU2N1LsWD2OvQUATgAAAE4AAAAAgMjKTZkA0Lc/C3uG3WAAAAAAGDr//oAAAAAAAAAC0Lf//j8L e/6AAAAAAAAAAoDI//7KTZmIAFd1QAAAAD/+AgAAPQABAtC3//4/C3tS7Fg9lb0FAHYAAAB2AAAA AIDIyk2ZANC3Pwt7ht1gAAAAAEA6QD/+AgAAPQABAtC3//4/C3s//gIAAD0AAQKAyP/+yk2ZgQC3 tuYEBQBS7Fg937UFAAgJCgsMDQ4PEBESExQVFhcYGRobHB0eHyAhIiMkJSYnKCkqKywtLi8wMTIz NDU2N1LsWD3ZqQoAVgAAAFYAAAAAgMjKTZkA0Lc/C3uG3WAAAAAAIDr/P/4CAAA9AAEC0Lf//j8L ez/+AgAAPQABAoDI//7KTZmHAPkRAAAAAD/+AgAAPQABAoDI//7KTZkBAQDQtz8Le1LsWD32qQoA VgAAAFYAAAAA0Lc/C3sAgMjKTZmG3WAAAAAAIDr/P/4CAAA9AAECgMj//spNmT/+AgAAPQABAtC3 //4/C3uIAEO4YAAAAD/+AgAAPQABAoDI//7KTZkCAQCAyMpNmVPsWD3ctAUAdgAAAHYAAAAA0Lc/ C3sAgMjKTZmG3WAAAAAAQDpAP/4CAAA9AAECgMj//spNmT/+AgAAPQABAtC3//4/C3uAAMK35gQG AFPsWD3TtAUACAkKCwwNDg8QERITFBUWFxgZGhscHR4fICEiIyQlJicoKSorLC0uLzAxMjM0NTY3 U+xYPba9BQB2AAAAdgAAAACAyMpNmQDQtz8Le4bdYAAAAABAOkA//gIAAD0AAQLQt//+Pwt7P/4C AAA9AAECgMj//spNmYEAwbfmBAYAU+xYPdO0BQAICQoLDA0ODxAREhMUFRYXGBkaGxwdHh8gISIj JCUmJygpKissLS4vMDEyMzQ1NjdU7Fg9JLQFAHYAAAB2AAAAANC3Pwt7AIDIyk2Zht1gAAAAAEA6 QD/+AgAAPQABAoDI//7KTZk//gIAAD0AAQLQt//+Pwt7gAB3uOYEBwBU7Fg9HLQFAAgJCgsMDQ4P EBESExQVFhcYGRobHB0eHyAhIiMkJSYnKCkqKywtLi8wMTIzNDU2N1TsWD01uAUAdgAAAHYAAAAA gMjKTZkA0Lc/C3uG3WAAAAAAQDpAP/4CAAA9AAEC0Lf//j8Lez/+AgAAPQABAoDI//7KTZmBAHa4 5gQHAFTsWD0ctAUACAkKCwwNDg8QERITFBUWFxgZGhscHR4fICEiIyQlJicoKSorLC0uLzAxMjM0 NTY3VexYPWizBQB2AAAAdgAAAADQtz8LewCAyMpNmYbdYAAAAABAOkA//gIAAD0AAQKAyP/+yk2Z P/4CAAA9AAEC0Lf//j8Le4AAMrnmBAgAVexYPV+zBQAICQoLDA0ODxAREhMUFRYXGBkaGxwdHh8g ISIjJCUmJygpKissLS4vMDEyMzQ1NjdV7Fg9RLsFAHYAAAB2AAAAAIDIyk2ZANC3Pwt7ht1gAAAA AEA6QD/+AgAAPQABAtC3//4/C3s//gIAAD0AAQKAyP/+yk2ZgQAxueYECABV7Fg9X7MFAAgJCgsM DQ4PEBESExQVFhcYGRobHB0eHyAhIiMkJSYnKCkqKywtLi8wMTIzNDU2Nw== --=-ayFoqu3ajXYTjlIN3X1s-- From owner-netdev@oss.sgi.com Tue Aug 13 06:28:51 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7DDSpRw003444 for ; Tue, 13 Aug 2002 06:28:51 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7DDSpRp003443 for netdev-outgoing; Tue, 13 Aug 2002 06:28:51 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from sex.inr.ac.ru (sex.inr.ac.ru [193.233.7.165]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7DDSkRw003432 for ; Tue, 13 Aug 2002 06:28:46 -0700 Received: (from kuznet@localhost) by sex.inr.ac.ru (8.6.13/ANK) id RAA20947; Tue, 13 Aug 2002 17:30:55 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200208131330.RAA20947@sex.inr.ac.ru> Subject: Re: raw ipv6 broken in 2.4.19 To: gandalf@wlug.westbo.se (Martin Josefsson) Date: Tue, 13 Aug 2002 17:30:55 +0400 (MSD) Cc: netdev@oss.sgi.com In-Reply-To: <1029238356.785.7.camel@tux> from "Martin Josefsson" at Aug 13, 2 01:32:36 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 X-Spam-Status: No, hits=-3.8 required=5.0 tests=IN_REP_TO,NO_REAL_NAME version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hello! > I've heard something about a PCI posting bug in the tulip hardware? > could this have anything to do with that? No ideas. Eventually I moved that card to an ancient Neptunand it works flawlessly there. > I've attached a small dump. captured while trying to ping an OpenBSD It looks boringly correct, indeed. Was it really taken at faulting machine? Alexey From owner-netdev@oss.sgi.com Tue Aug 13 06:45:03 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7DDj3Rw005146 for ; Tue, 13 Aug 2002 06:45:03 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7DDj3cZ005145 for netdev-outgoing; Tue, 13 Aug 2002 06:45:03 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from sex.inr.ac.ru (sex.inr.ac.ru [193.233.7.165]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7DDixRw005133 for ; Tue, 13 Aug 2002 06:45:00 -0700 Received: (from kuznet@localhost) by sex.inr.ac.ru (8.6.13/ANK) id RAA21080; Tue, 13 Aug 2002 17:47:14 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200208131347.RAA21080@sex.inr.ac.ru> Subject: Re: [PATCH] connect() return value. To: glee@gnupilgrims.org (Geoffrey Lee) Date: Tue, 13 Aug 2002 17:47:14 +0400 (MSD) Cc: netdev@oss.sgi.com In-Reply-To: <20020813113450.GA19058@anakin.wychk.org> from "Geoffrey Lee" at Aug 13, 2 09:34:50 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 X-Spam-Status: No, hits=-3.8 required=5.0 tests=IN_REP_TO,NO_REAL_NAME version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hello! > Ah, yes you're right. I was feeling stupid and I forgot to clear the > errno because re-trying the connect(). Good. :-) Could you repeat corrected test with repreated nonblocking connect() on those OSes? Alexey From owner-netdev@oss.sgi.com Tue Aug 13 06:56:51 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7DDupRw005334 for ; Tue, 13 Aug 2002 06:56:51 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7DDuouH005333 for netdev-outgoing; Tue, 13 Aug 2002 06:56:50 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from sex.inr.ac.ru (sex.inr.ac.ru [193.233.7.165]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7DDujRw005324 for ; Tue, 13 Aug 2002 06:56:46 -0700 Received: (from kuznet@localhost) by sex.inr.ac.ru (8.6.13/ANK) id RAA21125; Tue, 13 Aug 2002 17:58:58 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200208131358.RAA21125@sex.inr.ac.ru> Subject: Re: raw ipv6 broken in 2.4.19 To: gandalf@wlug.westbo.se (Martin Josefsson) Date: Tue, 13 Aug 2002 17:58:58 +0400 (MSD) Cc: netdev@oss.sgi.com In-Reply-To: <1029246435.1135.22.camel@tux> from "Martin Josefsson" at Aug 13, 2 03:47:15 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 X-Spam-Status: No, hits=-3.8 required=5.0 tests=IN_REP_TO,NO_REAL_NAME version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hello! > If the packets recieved is really corrupted wouldn't something have > complained before that patch was merged in 2.4.19-pre8 ? No. We simply forgot to verify checksum on raw packets. :-) > if the ipv6 header was corrupt it would have been dropped? No, ipv6 has no checksum on header. Corruptions there will remain unnoticed. > if the icmp packet was corrupt, wouldn't ping6 have complained? Provided wrong bits are in data area, it would dump wrong bits comparing them to original. But if it is in timestamp or in header, it also would remain unnoticed before 2.4.19. Very strange. We have similar phenomenon reported with TCP, by the way. So, I have to assume that checksumming routine is wrong and does some shit sort of relying on an uninitialized data. Alexey From owner-netdev@oss.sgi.com Tue Aug 13 07:16:06 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7DEG6Rw005624 for ; Tue, 13 Aug 2002 07:16:06 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7DEG66t005623 for netdev-outgoing; Tue, 13 Aug 2002 07:16:06 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from tux.rsn.bth.se (postfix@tux.rsn.bth.se [194.47.143.135]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7DEFnRw005611 for ; Tue, 13 Aug 2002 07:15:49 -0700 Received: by tux.rsn.bth.se (Postfix, from userid 501) id C0E9D36FC9; Tue, 13 Aug 2002 15:47:15 +0200 (CEST) Subject: Re: raw ipv6 broken in 2.4.19 From: Martin Josefsson To: kuznet@ms2.inr.ac.ru Cc: netdev@oss.sgi.com In-Reply-To: <200208131330.RAA20947@sex.inr.ac.ru> References: <200208131330.RAA20947@sex.inr.ac.ru> Content-Type: text/plain Content-Transfer-Encoding: 7bit X-Mailer: Ximian Evolution 1.0.7 Date: 13 Aug 2002 15:47:15 +0200 Message-Id: <1029246435.1135.22.camel@tux> Mime-Version: 1.0 X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk On Tue, 2002-08-13 at 15:30, kuznet@ms2.inr.ac.ru wrote: > Hello! > > > I've heard something about a PCI posting bug in the tulip hardware? > > could this have anything to do with that? > > No ideas. Eventually I moved that card to an ancient Neptunand it works > flawlessly there. I'm going to test it with a shitty i820 chipset later. Asus P3C-D and the D-Link DFE570-TX is a really bad combination (thats why I don't use that board anymore :), having ~90% cpu idle while maxing at ~75Mbit/s routed is not fun. One NIC active at once is ok, but two or more just destroys performance. But I've never had that problem with 440BX chipsets. > > I've attached a small dump. captured while trying to ping an OpenBSD > > It looks boringly correct, indeed. Was it really taken at faulting machine? Yes it was. If the packets recieved is really corrupted wouldn't something have complained before that patch was merged in 2.4.19-pre8 ? if the ipv6 header was corrupt it would have been dropped? if the icmp packet was corrupt, wouldn't ping6 have complained? But then on the other side, it works fine with other NIC's... And the machine answers icmp echo-requests without any problems with the tulip. Nothing's beeing discarded then. I think I'll just go with reverting that patch for now. -- /Martin Never argue with an idiot. They drag you down to their level, then beat you with experience. From owner-netdev@oss.sgi.com Tue Aug 13 07:53:49 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7DErmRw006009 for ; Tue, 13 Aug 2002 07:53:48 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7DErm4A006008 for netdev-outgoing; Tue, 13 Aug 2002 07:53:48 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from anakin.wychk.org (CPE-144-132-195-245.nsw.bigpond.net.au [144.132.195.245]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7DEr6Rw005994 for ; Tue, 13 Aug 2002 07:53:07 -0700 Received: from anakin.wychk.org (localhost [127.0.0.1]) by anakin.wychk.org (8.12.5/8.12.5/Debian-1) with ESMTP id g7DEWlN5021518; Wed, 14 Aug 2002 00:32:47 +1000 Received: (from glee@localhost) by anakin.wychk.org (8.12.5/8.12.5/Debian-1) id g7DEWkDl021517; Wed, 14 Aug 2002 00:32:46 +1000 Date: Wed, 14 Aug 2002 00:32:46 +1000 From: Geoffrey Lee To: kuznet@ms2.inr.ac.ru Cc: netdev@oss.sgi.com Subject: Re: [PATCH] connect() return value. Message-ID: <20020813143246.GA20943@anakin.wychk.org> References: <20020813113450.GA19058@anakin.wychk.org> <200208131347.RAA21080@sex.inr.ac.ru> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="x+6KMIRAuhnl3hBn" Content-Disposition: inline In-Reply-To: <200208131347.RAA21080@sex.inr.ac.ru> User-Agent: Mutt/1.4i X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk --x+6KMIRAuhnl3hBn Content-Type: text/plain; charset=big5 Content-Disposition: inline On Tue, Aug 13, 2002 at 05:47:14PM +0400, kuznet@ms2.inr.ac.ru wrote: > Hello! > > > Ah, yes you're right. I was feeling stupid and I forgot to clear the > > errno because re-trying the connect(). > > Good. :-) > > Could you repeat corrected test with repreated nonblocking connect() > on those OSes? > Alright, it looks like I've been smoking too much pot. I'm going to post the source code [ugly] here as well so what you see is what you get (no more oops-I-forgot-to-clear-errno oddities again :-) Let's redo the experiment from the start: This is the following OS I will test against: SunOS 5.6 OSF1 4.0 Linux v2.4.18 Source files: connect.c connect3.c connect.c tests for interruptibility with a particular OS. connect3.c tests for what connect returns if the socket is set into non-blocking mode. First off, we note the behavior of the with regard to interruptibility of connect. I use SIGALRM, sigaction with SA_RESTART, and connect to a valid IP where the host is down / no host is present. Default blocking behavior. SunOS 5.6: sighandler called, connect fails with -EINTR OSF1 4.0 sighandler called, connect fails with -EINTR (I have no idea why I "corrected" myself before that it's restartable, possibly some cruft I left behind with a read() following connect. So a return of -EINTR is my Final Answer. Sorry for that confusion.) Linux 2.4.18: sighandler is called, but we see that the program does not exit, this seems to imply that the connect is restarted. Now, we try with no signals, connect to a valid IP where host is up with a valid port listening. We use fcntl() to set the socket to non-blocking mode. SunOS 5.6 -EINPROGRESS, -EALREADY, -EISCONN, in that order. OSF1 4.0 -EINPROGRESS, -EALREADY, -EISCONN in that order. Linux 2.4.18 -EINPROGRESS, -EALREADY, 0 (Success), -EISCONN, in that order. So Linux does indeed return 0, and this is the magic value that we are looking for. People writing portably will be required to handle 0 in Linux, as well as possibly an error of -EISCONN (this error is ok and seems to be expected in OSF1 4.0 and SunOS 5.6). -- G. --x+6KMIRAuhnl3hBn Content-Type: text/x-csrc; charset=big5 Content-Disposition: attachment; filename="connect.c" #include #include #include #include #include #include #include #define PORT 22 #define HOST "129.94.44.44" void handler(int); int main(int argc, char **argv) { struct sockaddr_in saddr; int ret; int c; int fd; char buf[BUFSIZ]; int val; size_t len; struct sigaction act, oact; char *host = HOST; unsigned short port = PORT; val = 0; len = sizeof(val); while ((c = getopt(argc, argv, "h:p:")) != -1) { switch (c) { case 'h': host = strdup(optarg); break; case 'p': port = atoi(optarg); break; default: break; } } bzero(&act, sizeof(struct sigaction)); bzero(&oact, sizeof(struct sigaction)); act.sa_flags |= SA_RESTART; act.sa_handler = handler; ret = sigaction(SIGALRM, &act, &oact); if (ret < 0) { printf("sigaction\n"); exit(1); } memset(&saddr, 0, sizeof(struct sockaddr_in)); saddr.sin_family = AF_INET; saddr.sin_port = htons(port); saddr.sin_addr.s_addr = inet_addr(host); fd = socket(AF_INET, SOCK_STREAM, 0); if (fd < 0) { printf("socket\n"); exit(1); } alarm(1); errno = 0; ret = connect(fd, (struct sockaddr *)&saddr, sizeof(struct sockaddr)); printf("returned from connect\n"); if (ret < 0) { printf("connect\n"); printf("%s\n", strerror(errno)); } close(fd); return(0); } void handler(int signo) { printf("in signal hander\n"); return ; } --x+6KMIRAuhnl3hBn Content-Type: text/x-csrc; charset=big5 Content-Disposition: attachment; filename="connect3.c" #include #include #include #include #include #include #include #include #define HOST "192.168.0.1" #define PORT 22 void handler(int); int main(int argc, char **argv) { struct sockaddr_in saddr; int ret; int fd; int i; char buf[BUFSIZ]; int val; int c; unsigned short port = PORT; size_t len; char *host = HOST; val = 0; len = sizeof(val); while ((c = getopt(argc, argv, "h:p:")) != -1) { switch (c) { case 'h': host = strdup(optarg); break; case 'p': port = atoi(optarg); break; default: break; } } memset(&saddr, 0, sizeof(struct sockaddr_in)); saddr.sin_family = AF_INET; saddr.sin_port = htons(port); saddr.sin_addr.s_addr = inet_addr(host); fd = socket(AF_INET, SOCK_STREAM, 0); if (fd < 0) { printf("socket\n"); exit(1); } val = fcntl(fd, F_GETFL); if (val < 0) { printf("fcntl get\n"); exit(1); } ret = fcntl(fd, F_SETFL, val | O_NONBLOCK); if (ret < 0) { printf("fcntl set\n"); exit(1); } for (i = 1 ;; i++) { errno = 0; printf("entering %d connect\n", i); ret = connect(fd, (struct sockaddr *)&saddr, sizeof(struct sockaddr_in)); printf("leaving %d connect, error %d msg = %s\n", i, errno, strerror(errno)); if (errno == EISCONN) break; } close(fd); exit(0); } void handler(int signo) { alarm(1); printf("in signal hander\n"); return ; } --x+6KMIRAuhnl3hBn-- From owner-netdev@oss.sgi.com Tue Aug 13 08:26:29 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7DFQTRw006694 for ; Tue, 13 Aug 2002 08:26:29 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7DFQT8Q006693 for netdev-outgoing; Tue, 13 Aug 2002 08:26:29 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from sex.inr.ac.ru (sex.inr.ac.ru [193.233.7.165]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7DFQPRw006684 for ; Tue, 13 Aug 2002 08:26:26 -0700 Received: (from kuznet@localhost) by sex.inr.ac.ru (8.6.13/ANK) id TAA21381; Tue, 13 Aug 2002 19:28:41 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200208131528.TAA21381@sex.inr.ac.ru> Subject: Re: [PATCH] connect() return value. To: glee@gnupilgrims.org (Geoffrey Lee) Date: Tue, 13 Aug 2002 19:28:41 +0400 (MSD) Cc: netdev@oss.sgi.com In-Reply-To: <20020813143246.GA20943@anakin.wychk.org> from "Geoffrey Lee" at Aug 14, 2 00:32:46 am X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 X-Spam-Status: No, hits=-3.8 required=5.0 tests=IN_REP_TO,NO_REAL_NAME version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hello! > First off, et al. Great, I think we will change this too. Could you repeat that test with read() after incomplete connect? That your test was apparently invalid because of the same errno mistake, so results may be different. Only, please, use read() with non-zero length. Alexey From owner-netdev@oss.sgi.com Tue Aug 13 10:52:46 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7DHqkRw010887 for ; Tue, 13 Aug 2002 10:52:46 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7DHqkPG010886 for netdev-outgoing; Tue, 13 Aug 2002 10:52:46 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from tux.rsn.bth.se (postfix@tux.rsn.bth.se [194.47.143.135]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7DHqSRw010875 for ; Tue, 13 Aug 2002 10:52:29 -0700 Received: by tux.rsn.bth.se (Postfix, from userid 501) id B130936FC9; Tue, 13 Aug 2002 19:14:34 +0200 (CEST) Subject: Re: raw ipv6 broken in 2.4.19 From: Martin Josefsson To: kuznet@ms2.inr.ac.ru Cc: netdev@oss.sgi.com In-Reply-To: <200208131358.RAA21125@sex.inr.ac.ru> References: <200208131358.RAA21125@sex.inr.ac.ru> Content-Type: text/plain Content-Transfer-Encoding: 7bit X-Mailer: Ximian Evolution 1.0.7 Date: 13 Aug 2002 19:14:34 +0200 Message-Id: <1029258874.772.97.camel@tux> Mime-Version: 1.0 X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk On Tue, 2002-08-13 at 15:58, kuznet@ms2.inr.ac.ru wrote: > Very strange. We have similar phenomenon reported with TCP, by the way. > So, I have to assume that checksumming routine is wrong and does some shit > sort of relying on an uninitialized data. I've added some debug printk's and found out that it's the call to csum_fold that fails in skb_copy_and_csum_datagram_iovec. skb_copy_and_csum_datagram_iovec is called from: net/ipv6/raw.c:rawv6_recvmsg() if (skb->ip_summed==CHECKSUM_UNNECESSARY) { err = skb_copy_datagram_iovec(skb, 0, msg->msg_iov, copied); } else if (msg->msg_flags&MSG_TRUNC) { if ((unsigned short)csum_fold(skb_checksum(skb, 0, skb->len, skb->csum))) goto csum_copy_err; err = skb_copy_datagram_iovec(skb, 0, msg->msg_iov, copied); } else { err = skb_copy_and_csum_datagram_iovec(skb, 0, msg->msg_iov); ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ if (err == -EINVAL) goto csum_copy_err; } This obviosly works with my old ISA ne2k clone but not with the tulip. And the tcpdump looked ok Does anyone else have an idea or suggestion I can try? Alexey, when you had checksum problems, did you see invalid checksums in tcpdump? -- /Martin Never argue with an idiot. They drag you down to their level, then beat you with experience. From owner-netdev@oss.sgi.com Tue Aug 13 11:28:58 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7DISvRw011394 for ; Tue, 13 Aug 2002 11:28:58 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7DISv1n011393 for netdev-outgoing; Tue, 13 Aug 2002 11:28:57 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from touchme.toronto.redhat.com (to-velocet.redhat.com [216.138.202.10]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7DISrRw011384 for ; Tue, 13 Aug 2002 11:28:53 -0700 Received: from toomuch.toronto.redhat.com (toomuch.toronto.redhat.com [172.16.14.22]) by touchme.toronto.redhat.com (Postfix) with ESMTP id 1027FB804C; Tue, 13 Aug 2002 14:31:21 -0400 (EDT) Received: (from bcrl@localhost) by toomuch.toronto.redhat.com (8.11.6/8.11.6) id g7DIVKI12861; Tue, 13 Aug 2002 14:31:20 -0400 Date: Tue, 13 Aug 2002 14:31:20 -0400 From: Benjamin LaHaise To: kuznet@ms2.inr.ac.ru Cc: netdev@oss.sgi.com Subject: Re: [patch] bug prematurely setting nr_frags Message-ID: <20020813143120.C12730@redhat.com> References: <20020812190744.R1781@redhat.com> <200208130406.IAA20038@sex.inr.ac.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5.1i In-Reply-To: <200208130406.IAA20038@sex.inr.ac.ru>; from kuznet@ms2.inr.ac.ru on Tue, Aug 13, 2002 at 08:06:13AM +0400 X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk On Tue, Aug 13, 2002 at 08:06:13AM +0400, kuznet@ms2.inr.ac.ru wrote: > Hello! > > > Huh? It's called from sock_alloc_send_skb, which is called from all > > over the stack. > > It is used with data_len==0. The branch generating fragments is dead. Ah, I see. Care for a patch to remove it altogether then? As for the bug, does anyone else see a place where nr_frags can be set without initializing a page pointer? Otherwise it looks like a random memory corruption... what fun. -ben -- "You will be reincarnated as a toad; and you will be much happier." From owner-netdev@oss.sgi.com Tue Aug 13 13:04:43 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7DK4hRw027142 for ; Tue, 13 Aug 2002 13:04:43 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7DK4hNV027141 for netdev-outgoing; Tue, 13 Aug 2002 13:04:43 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from sex.inr.ac.ru (sex.inr.ac.ru [193.233.7.165]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7DK4bRw027129 for ; Tue, 13 Aug 2002 13:04:38 -0700 Received: (from kuznet@localhost) by sex.inr.ac.ru (8.6.13/ANK) id AAA21878; Wed, 14 Aug 2002 00:06:48 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200208132006.AAA21878@sex.inr.ac.ru> Subject: Re: raw ipv6 broken in 2.4.19 To: gandalf@wlug.westbo.se (Martin Josefsson) Date: Wed, 14 Aug 2002 00:06:48 +0400 (MSD) Cc: netdev@oss.sgi.com In-Reply-To: <1029258874.772.97.camel@tux> from "Martin Josefsson" at Aug 13, 2 07:14:34 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 X-Spam-Status: No, hits=-3.8 required=5.0 tests=IN_REP_TO,NO_REAL_NAME version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hello! > Alexey, when you had checksum problems, did you see invalid checksums in > tcpdump? No. It is dubiously similar to your case. It is the sibj "Linux TCP problem while talking to hostme.bkbits.net" in linux-kernel. It is mistique not less than your case. :-) > Does anyone else have an idea or suggestion I can try? Well, let's look what exactly checksum routines calculate on the steps. Maybe, this will give a clue. Actually, tulip is really different of another cards, this creature generates badly aligned packets. F.e. it is possible a checksumming routine folds some bytes beyond end of frame. :-) I have lots of verious tulips here, but have never seen such a shit. :-) Alexey From owner-netdev@oss.sgi.com Tue Aug 13 14:12:05 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7DLC5Rw011777 for ; Tue, 13 Aug 2002 14:12:05 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7DLC54V011776 for netdev-outgoing; Tue, 13 Aug 2002 14:12:05 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from u.domain.uli (ja.mac.ssi.bg [212.95.166.194]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7DLBvRw011766 for ; Tue, 13 Aug 2002 14:11:58 -0700 Received: from localhost (IDENT:ja@localhost [127.0.0.1]) by u.domain.uli (8.11.6/8.11.6) with ESMTP id g7E0F9b01428; Wed, 14 Aug 2002 00:15:09 GMT Date: Wed, 14 Aug 2002 00:15:09 +0000 (GMT) From: Julian Anastasov X-X-Sender: ja@u.domain.uli To: Martin Josefsson cc: kuznet@ms2.inr.ac.ru, Subject: Re: raw ipv6 broken in 2.4.19 In-Reply-To: <1029258874.772.97.camel@tux> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hello, On 13 Aug 2002, Martin Josefsson wrote: > On Tue, 2002-08-13 at 15:58, kuznet@ms2.inr.ac.ru wrote: > > > Very strange. We have similar phenomenon reported with TCP, by the way. > > So, I have to assume that checksumming routine is wrong and does some shit > > sort of relying on an uninitialized data. > > I've added some debug printk's and found out that it's the call to > csum_fold that fails in skb_copy_and_csum_datagram_iovec. What about the known problem with csum_partial called with zero length. IIRC, on CONFIG_X86_USE_PPRO_CHECKSUM compilation this function depends on the data alignment (addr&2!=0 => bug) - calling csum_partial with zero length in 2.4+ is a bug, checks should be added in the caller. Can this be problem with skb_copy_and_csum_datagram_iovec? I see that net/ipv6/raw.c provides 0 as hlen. Regards -- Julian Anastasov From owner-netdev@oss.sgi.com Tue Aug 13 14:27:22 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7DLRMRw011983 for ; Tue, 13 Aug 2002 14:27:22 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7DLRMWR011982 for netdev-outgoing; Tue, 13 Aug 2002 14:27:22 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from tux.rsn.bth.se (postfix@tux.rsn.bth.se [194.47.143.135]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7DLRFRw011973 for ; Tue, 13 Aug 2002 14:27:15 -0700 Received: by tux.rsn.bth.se (Postfix, from userid 501) id AA4D736FC9; Tue, 13 Aug 2002 23:29:35 +0200 (CEST) Subject: Re: raw ipv6 broken in 2.4.19 From: Martin Josefsson To: kuznet@ms2.inr.ac.ru Cc: netdev@oss.sgi.com In-Reply-To: <200208132006.AAA21878@sex.inr.ac.ru> References: <200208132006.AAA21878@sex.inr.ac.ru> Content-Type: text/plain Content-Transfer-Encoding: 7bit X-Mailer: Ximian Evolution 1.0.7 Date: 13 Aug 2002 23:29:35 +0200 Message-Id: <1029274175.1135.129.camel@tux> Mime-Version: 1.0 X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk On Tue, 2002-08-13 at 22:06, kuznet@ms2.inr.ac.ru wrote: > Hello! > > > Alexey, when you had checksum problems, did you see invalid checksums in > > tcpdump? > > No. It is dubiously similar to your case. It is the sibj > "Linux TCP problem while talking to hostme.bkbits.net" in linux-kernel. > It is mistique not less than your case. :-) > > > Does anyone else have an idea or suggestion I can try? > > Well, let's look what exactly checksum routines calculate on the steps. > Maybe, this will give a clue. Just tell me what you want me to dig out from the checksumming stuff. > Actually, tulip is really different of another cards, this creature > generates badly aligned packets. F.e. it is possible a checksumming routine > folds some bytes beyond end of frame. :-) I have lots of verious tulips here, > but have never seen such a shit. :-) hehe, this isn't a clone, it's a quad with "Digital DS21143 Tulip rev 65" chips (from tulip driver) behind a "PCI bridge: Digital Equipment Corporation DECchip 21152" (from lspci) -- /Martin Never argue with an idiot. They drag you down to their level, then beat you with experience. From owner-netdev@oss.sgi.com Tue Aug 13 14:58:09 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7DLw9Rw012249 for ; Tue, 13 Aug 2002 14:58:09 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7DLw86H012248 for netdev-outgoing; Tue, 13 Aug 2002 14:58:08 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from sex.inr.ac.ru (sex.inr.ac.ru [193.233.7.165]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7DLw3Rw012239 for ; Tue, 13 Aug 2002 14:58:04 -0700 Received: (from kuznet@localhost) by sex.inr.ac.ru (8.6.13/ANK) id CAA22153; Wed, 14 Aug 2002 02:00:06 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200208132200.CAA22153@sex.inr.ac.ru> Subject: Re: raw ipv6 broken in 2.4.19 To: ja@ssi.bg (Julian Anastasov) Date: Wed, 14 Aug 2002 02:00:06 +0400 (MSD) Cc: gandalf@wlug.westbo.se, netdev@oss.sgi.com In-Reply-To: from "Julian Anastasov" at Aug 14, 2 00:15:09 am X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 X-Spam-Status: No, hits=-3.8 required=5.0 tests=IN_REP_TO,NO_REAL_NAME version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hello! > What about the known problem with csum_partial called > with zero length. Maybe it is known to you, but not to me. > (addr&2!=0 => bug) Sorry? What do you mean? Checksumming routines must not have any alignment constraints, they used with arbitrary combinations of alignments. Alexey From owner-netdev@oss.sgi.com Tue Aug 13 15:09:35 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7DM9ZRw012478 for ; Tue, 13 Aug 2002 15:09:35 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7DM9ZC1012477 for netdev-outgoing; Tue, 13 Aug 2002 15:09:35 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from u.domain.uli (ja.mac.ssi.bg [212.95.166.194]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7DM9PRw012468 for ; Tue, 13 Aug 2002 15:09:29 -0700 Received: from localhost (IDENT:ja@localhost [127.0.0.1]) by u.domain.uli (8.11.6/8.11.6) with ESMTP id g7E1Cfb01643; Wed, 14 Aug 2002 01:12:41 GMT Date: Wed, 14 Aug 2002 01:12:41 +0000 (GMT) From: Julian Anastasov X-X-Sender: ja@u.domain.uli To: kuznet@ms2.inr.ac.ru cc: gandalf@wlug.westbo.se, Subject: Re: raw ipv6 broken in 2.4.19 In-Reply-To: <200208132200.CAA22153@sex.inr.ac.ru> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hello, On Wed, 14 Aug 2002 kuznet@ms2.inr.ac.ru wrote: > > What about the known problem with csum_partial called > > with zero length. > > Maybe it is known to you, but not to me. Hm, some links: http://marc.theaimsgroup.com/?l=linux-kernel&m=99489362903633&w=2 http://marc.theaimsgroup.com/?l=linux-virtual-server&m=100332800916641&w=2 > > (addr&2!=0 => bug) > > Sorry? What do you mean? Checksumming routines must not have any alignment > constraints, they used with arbitrary combinations of alignments. Do you believe, they have :) See the 2nd URL, there is a fix tested from Wensong (old LVS problem). > Alexey Regards -- Julian Anastasov From owner-netdev@oss.sgi.com Tue Aug 13 15:11:46 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7DMBfRw012605 for ; Tue, 13 Aug 2002 15:11:41 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7DMBf1Q012604 for netdev-outgoing; Tue, 13 Aug 2002 15:11:41 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from sex.inr.ac.ru (sex.inr.ac.ru [193.233.7.165]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7DMBbRw012595 for ; Tue, 13 Aug 2002 15:11:38 -0700 Received: (from kuznet@localhost) by sex.inr.ac.ru (8.6.13/ANK) id CAA22204; Wed, 14 Aug 2002 02:13:55 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200208132213.CAA22204@sex.inr.ac.ru> Subject: Re: raw ipv6 broken in 2.4.19 To: gandalf@wlug.westbo.se (Martin Josefsson) Date: Wed, 14 Aug 2002 02:13:55 +0400 (MSD) Cc: netdev@oss.sgi.com In-Reply-To: <1029274175.1135.129.camel@tux> from "Martin Josefsson" at Aug 13, 2 11:29:35 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 X-Spam-Status: No, hits=-3.8 required=5.0 tests=IN_REP_TO,NO_REAL_NAME version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hello! > Just tell me what you want me to dig out from the checksumming stuff. First of all, value returned by routine. Apparently, it is wrong. Then arguments (pointers, lengths). Alexey From owner-netdev@oss.sgi.com Tue Aug 13 15:16:49 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7DMGnRw012747 for ; Tue, 13 Aug 2002 15:16:49 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7DMGn5V012746 for netdev-outgoing; Tue, 13 Aug 2002 15:16:49 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from sex.inr.ac.ru (sex.inr.ac.ru [193.233.7.165]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7DMGiRw012737 for ; Tue, 13 Aug 2002 15:16:45 -0700 Received: (from kuznet@localhost) by sex.inr.ac.ru (8.6.13/ANK) id CAA22224; Wed, 14 Aug 2002 02:18:50 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200208132218.CAA22224@sex.inr.ac.ru> Subject: Re: raw ipv6 broken in 2.4.19 To: ja@ssi.bg (Julian Anastasov) Date: Wed, 14 Aug 2002 02:18:50 +0400 (MSD) Cc: gandalf@wlug.westbo.se, netdev@oss.sgi.com In-Reply-To: from "Julian Anastasov" at Aug 14, 2 01:12:41 am X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 X-Spam-Status: No, hits=-3.8 required=5.0 tests=IN_REP_TO,NO_REAL_NAME version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hello! > > Sorry? What do you mean? Checksumming routines must not have any alignment > > constraints, they used with arbitrary combinations of alignments. > > Do you believe, they have :) See the 2nd URL, Well, if this is only in combination with zero length, it is not a disaster. Alexey From owner-netdev@oss.sgi.com Tue Aug 13 15:25:46 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7DMPkRw012923 for ; Tue, 13 Aug 2002 15:25:46 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7DMPke7012922 for netdev-outgoing; Tue, 13 Aug 2002 15:25:46 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from sex.inr.ac.ru (sex.inr.ac.ru [193.233.7.165]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7DMPeRw012912 for ; Tue, 13 Aug 2002 15:25:41 -0700 Received: (from kuznet@localhost) by sex.inr.ac.ru (8.6.13/ANK) id CAA22255; Wed, 14 Aug 2002 02:27:56 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200208132227.CAA22255@sex.inr.ac.ru> Subject: Re: [patch] bug prematurely setting nr_frags To: bcrl@redhat.com (Benjamin LaHaise) Date: Wed, 14 Aug 2002 02:27:56 +0400 (MSD) Cc: netdev@oss.sgi.com In-Reply-To: <20020813143120.C12730@redhat.com> from "Benjamin LaHaise" at Aug 13, 2 02:31:20 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 X-Spam-Status: No, hits=-3.8 required=5.0 tests=IN_REP_TO,NO_REAL_NAME version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hello! > Ah, I see. Care for a patch to remove it altogether then? Well, someone wrote it for some purpose. :-) > As for the > bug, does anyone else see a place where nr_frags can be set without > initializing a page pointer? No. When does this happen? > Otherwise it looks like a random memory corruption... what fun. Well, even if this is corruption, it is unlikely to be random. This maybe write beyond end of an skb, corrupting skb_shared_info. At least, we had such place in netfilter fixed some time ago. Alexey From owner-netdev@oss.sgi.com Tue Aug 13 15:35:37 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7DMZbRw013066 for ; Tue, 13 Aug 2002 15:35:37 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7DMZbER013065 for netdev-outgoing; Tue, 13 Aug 2002 15:35:37 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from u.domain.uli (ja.mac.ssi.bg [212.95.166.194]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7DMZQRw013056 for ; Tue, 13 Aug 2002 15:35:29 -0700 Received: from localhost (IDENT:ja@localhost [127.0.0.1]) by u.domain.uli (8.11.6/8.11.6) with ESMTP id g7E1ceb01749; Wed, 14 Aug 2002 01:38:40 GMT Date: Wed, 14 Aug 2002 01:38:40 +0000 (GMT) From: Julian Anastasov X-X-Sender: ja@u.domain.uli To: kuznet@ms2.inr.ac.ru cc: gandalf@wlug.westbo.se, Subject: Re: raw ipv6 broken in 2.4.19 In-Reply-To: <200208132218.CAA22224@sex.inr.ac.ru> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hello, On Wed, 14 Aug 2002 kuznet@ms2.inr.ac.ru wrote: > Well, if this is only in combination with zero length, it is not a disaster. May be. It seems there were no such callers but they appeared. I'm not sure a comment can avoid further problems. As for the fix, we don't see often lengths of 0 and 1 for unaligned addrs, it is in the slow path in csum_partial. IMHO, it will not hurt if we apply the fix instead of auditing everything. > Alexey Regards -- Julian Anastasov From owner-netdev@oss.sgi.com Tue Aug 13 16:42:47 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7DNglRw013522 for ; Tue, 13 Aug 2002 16:42:47 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7DNgleu013521 for netdev-outgoing; Tue, 13 Aug 2002 16:42:47 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from tux.rsn.bth.se (postfix@tux.rsn.bth.se [194.47.143.135]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7DNgVRw013511 for ; Tue, 13 Aug 2002 16:42:32 -0700 Received: by tux.rsn.bth.se (Postfix, from userid 501) id 368D336FC9; Wed, 14 Aug 2002 01:02:22 +0200 (CEST) Subject: Re: raw ipv6 broken in 2.4.19 From: Martin Josefsson To: Julian Anastasov Cc: kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com In-Reply-To: References: Content-Type: text/plain Content-Transfer-Encoding: 7bit X-Mailer: Ximian Evolution 1.0.7 Date: 14 Aug 2002 01:02:21 +0200 Message-Id: <1029279741.772.134.camel@tux> Mime-Version: 1.0 X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk On Wed, 2002-08-14 at 03:38, Julian Anastasov wrote: > > Hello, > > On Wed, 14 Aug 2002 kuznet@ms2.inr.ac.ru wrote: > > > Well, if this is only in combination with zero length, it is not a disaster. > > May be. It seems there were no such callers but they > appeared. I'm not sure a comment can avoid further problems. > As for the fix, we don't see often lengths of 0 and 1 for > unaligned addrs, it is in the slow path in csum_partial. IMHO, > it will not hurt if we apply the fix instead of auditing everything. I applied the patch and now it's working fine, both ping6 and traceroute6 seems to be working fine. I'll leave it to you guys to debate this :) -- /Martin Never argue with an idiot. They drag you down to their level, then beat you with experience. From owner-netdev@oss.sgi.com Tue Aug 13 17:12:19 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7E0CJRw013977 for ; Tue, 13 Aug 2002 17:12:19 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7E0CJvo013976 for netdev-outgoing; Tue, 13 Aug 2002 17:12:19 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from anakin.wychk.org (CPE-144-132-195-245.nsw.bigpond.net.au [144.132.195.245]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7E0BWRw013960 for ; Tue, 13 Aug 2002 17:11:33 -0700 Received: from anakin.wychk.org (localhost [127.0.0.1]) by anakin.wychk.org (8.12.5/8.12.5/Debian-1) with ESMTP id g7DNp4N5028899; Wed, 14 Aug 2002 09:51:04 +1000 Received: (from glee@localhost) by anakin.wychk.org (8.12.5/8.12.5/Debian-1) id g7DNp3XS028898; Wed, 14 Aug 2002 09:51:03 +1000 Date: Wed, 14 Aug 2002 09:51:03 +1000 From: Geoffrey Lee To: kuznet@ms2.inr.ac.ru Cc: netdev@oss.sgi.com Subject: Re: [PATCH] connect() return value. Message-ID: <20020813235103.GA28432@anakin.wychk.org> References: <20020813143246.GA20943@anakin.wychk.org> <200208131528.TAA21381@sex.inr.ac.ru> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="rwEMma7ioTxnRzrJ" Content-Disposition: inline In-Reply-To: <200208131528.TAA21381@sex.inr.ac.ru> User-Agent: Mutt/1.4i X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk --rwEMma7ioTxnRzrJ Content-Type: text/plain; charset=big5 Content-Disposition: inline On Tue, Aug 13, 2002 at 07:28:41PM +0400, kuznet@ms2.inr.ac.ru wrote: > Hello! > > > First off, > et al. > > Great, I think we will change this too. > What are you intending to change? :-) connect behaivor or read behavior? > Could you repeat that test with read() after incomplete connect? > That your test was apparently invalid because of the same errno mistake, > so results may be different. Only, please, use read() with non-zero length. > Sure. (though I don't think it should matter since with connect() the test was bogus because I forgot to reset errno everytime I went through the loop). I will test for write as well, because read and write should be consistent with each other. Again, I will be attaching the programs that I used for this test. In this mail: connect-read.c (test for 0 length read) connect-write.c (test for 0 length write) Both are modified versions of the original connect.c. Tested operating systems: SunOS 5.6 OSF1 4.0 Linux 2.4.18 In this test we will be using the programs to connect to a open TCP port on a host. We connect to a host on the WAN in an attempt to maximize the connect process. First we create a socket and we set it to non-blocking. We connect once, and test for an errno of EINPROGRESS (expected). We set the errno value to 0, to guarantee that should read / write return an error it is from the read / write call. After that, we then we issue a 0 length read / write, and test for an error. Finally, we issue a second connect, to make sure that we get an EALREADY for errno, to make sure that the 3 way handshake is not complete and we indeed tried to write to / read from a in-progress socket. SunOS 5.6 We find the errno after the first connect is indeed EINPROGRESS. A read with 0 length returns -1, with an errno of 134. That corresponds to ENOTCONN on Solaris. The second connect returns EALREADY, and we are sure that the 3 way handshake is not complete. Similarly for write, when we issue a zero-length write, we find write returns -1 with an errno of 134 (ENOTCONN). OSF1 4.0 We find that we get a expected EINPROGRESS after the first connect. For read, it returns 0. On the second connect, it returns EALREADY. It is interesting to note that on OSF1 4.0, write returns an errno of 57 (ENOTCONN). Behavior is not consistent with their socket read and write. We have found a bug in Digital UNIX. Linux 2.4.18 We find that we indeed get a EINPROGRESS after the first connected. After a read, we find that read returns 0. On second connect, it returns an error with errno set to EALREADY. Similarly for write, when we issue a zero-length write, we find write returns 0. So to summarize: To write portably to test for a connected socket, you can: * use connect, but be prepared to handle an errno of 0 from Linux; * handle the differences between read / write. In particular Digital UNIX has the extremely unsocial behaivor of returning something totally different with read and write. * Use getpeername. We expect ENOTCONN if it is not completed, or some error occurred. Another implementation detail: OSF1 4.0 and Solaris require getpeername's second argument to be valid, or -EBADF will be returned. Linux does not require this and returns -ENOTCONN even if the second argument is NULL. SunOS 5.6 and OSF1 4.0 both return -ENOTCONN if you run getpeername with a non-NULL second argument. i.e. getpeername(fd, NULL, NULL) --> ok on linux, not ok on OSF1/SunOS getpeername(fd, &peeraddr, NULL) --> ok on all Non-blocking sockets is as non-portable as it can be. :-) -- G. -- G. --rwEMma7ioTxnRzrJ Content-Type: text/x-csrc; charset=big5 Content-Disposition: attachment; filename="connect-read.c" #include #include #include #include #include #include #include #include #include extern char *optarg; #define HOST "192.168.0.1" #define PORT 0 void handler(int); int main(int argc, char **argv) { struct sockaddr_in saddr; int ret; int fd; char buf[BUFSIZ]; int val; int c; unsigned short port = PORT; size_t len; char *host = HOST; val = 0; len = sizeof(val); while ((c = getopt(argc, argv, "h:p:")) != -1) { switch (c) { case 'h': host = (char *)strdup(optarg); break; case 'p': port = atoi(optarg); break; default: break; } } memset(&saddr, 0, sizeof(struct sockaddr_in)); saddr.sin_family = AF_INET; saddr.sin_port = htons(port); saddr.sin_addr.s_addr = inet_addr(host); fd = socket(AF_INET, SOCK_STREAM, 0); if (fd < 0) { printf("socket\n"); exit(1); } val = fcntl(fd, F_GETFL); if (val < 0) { printf("fcntl get\n"); exit(1); } ret = fcntl(fd, F_SETFL, val | O_NONBLOCK); if (ret < 0) { printf("fcntl set\n"); exit(1); } ret = connect(fd, (struct sockaddr *)&saddr, sizeof(struct sockaddr_in)); if (ret < 0) { if (errno == EINPROGRESS) { printf("connect: EINPROGRESS (expected)\n"); } else { printf("unexpected errno from connect, errno = %d" " msg = %s\n", errno, strerror(errno)); } } /* make sure errno is 0 */ errno = 0; val = read(fd, buf, 0); printf("read() returns %d, errno = %d, msg = %s\n", val, errno, (val < 0) ? strerror(errno) : ""); errno = 0; ret = connect(fd, (struct sockaddr *)&saddr, sizeof(struct sockaddr_in)); printf("connect returns %d, err = %s\n", ret, strerror(errno)); close(fd); exit(0); } --rwEMma7ioTxnRzrJ Content-Type: text/x-csrc; charset=big5 Content-Disposition: attachment; filename="connect-write.c" #include #include #include #include #include #include #include #include #include extern char *optarg; #define HOST "192.168.0.1" #define PORT 0 void handler(int); int main(int argc, char **argv) { struct sockaddr_in saddr; int ret; int fd; char buf[BUFSIZ]; int val; int c; unsigned short port = PORT; size_t len; char *host = HOST; val = 0; len = sizeof(val); while ((c = getopt(argc, argv, "h:p:")) != -1) { switch (c) { case 'h': host = (char *)strdup(optarg); break; case 'p': port = atoi(optarg); break; default: break; } } memset(&saddr, 0, sizeof(struct sockaddr_in)); saddr.sin_family = AF_INET; saddr.sin_port = htons(port); saddr.sin_addr.s_addr = inet_addr(host); fd = socket(AF_INET, SOCK_STREAM, 0); if (fd < 0) { printf("socket\n"); exit(1); } val = fcntl(fd, F_GETFL); if (val < 0) { printf("fcntl get\n"); exit(1); } ret = fcntl(fd, F_SETFL, val | O_NONBLOCK); if (ret < 0) { printf("fcntl set\n"); exit(1); } ret = connect(fd, (struct sockaddr *)&saddr, sizeof(struct sockaddr_in)); if (ret < 0) { if (errno == EINPROGRESS) { printf("connect: EINPROGRESS (expected)\n"); } else { printf("unexpected errno from connect, errno = %d" " msg = %s\n", errno, strerror(errno)); } } /* make sure errno is 0 */ errno = 0; val = write(fd, "1", 0); printf("write() returns %d, errno = %d, msg = %s\n", val, errno, (val < 0) ? strerror(errno) : ""); errno = 0; ret = connect(fd, (struct sockaddr *)&saddr, sizeof(struct sockaddr_in)); printf("connect returns %d, err = %s\n", ret, strerror(errno)); close(fd); exit(0); } --rwEMma7ioTxnRzrJ-- From owner-netdev@oss.sgi.com Tue Aug 13 17:34:15 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7E0YFRw014316 for ; Tue, 13 Aug 2002 17:34:15 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7E0YFEV014315 for netdev-outgoing; Tue, 13 Aug 2002 17:34:15 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from sex.inr.ac.ru (sex.inr.ac.ru [193.233.7.165]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7E0Y9Rw014306 for ; Tue, 13 Aug 2002 17:34:09 -0700 Received: (from kuznet@localhost) by sex.inr.ac.ru (8.6.13/ANK) id EAA22833; Wed, 14 Aug 2002 04:36:24 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200208140036.EAA22833@sex.inr.ac.ru> Subject: Re: [PATCH] connect() return value. To: glee@gnupilgrims.org (Geoffrey Lee) Date: Wed, 14 Aug 2002 04:36:24 +0400 (MSD) Cc: netdev@oss.sgi.com In-Reply-To: <20020813235103.GA28432@anakin.wychk.org> from "Geoffrey Lee" at Aug 14, 2 09:51:03 am X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 X-Spam-Status: No, hits=-3.8 required=5.0 tests=IN_REP_TO,NO_REAL_NAME version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hello! > What are you intending to change? :-) connect. > > Could you repeat that test with read() after incomplete connect? > > That your test was apparently invalid because of the same errno mistake, > > so results may be different. Only, please, use read() with non-zero length. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > the read / write call. After that, we then we issue a 0 length > read / write, I asked you not to do this. This is pathological case, and return value from this presents only academic interest. > We find the errno after the first connect is indeed EINPROGRESS. A > read with 0 length returns -1, with an errno of 134. That corresponds > to ENOTCONN on Solaris. F.e. if Solaris behaves in this way on normal not-nil read/write, it is fatally buggy. I do not understand why it works after this, just by plain luck. read/write on not-yet-connected socket can only block (when blocking) or return EAGAIN otherwise. Well, or complete sucessfully. Alexey From owner-netdev@oss.sgi.com Tue Aug 13 20:23:28 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7E3NSRw015906 for ; Tue, 13 Aug 2002 20:23:28 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7E3NRjk015905 for netdev-outgoing; Tue, 13 Aug 2002 20:23:27 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from anakin.wychk.org (CPE-144-132-195-245.nsw.bigpond.net.au [144.132.195.245]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7E3N9Rw015894 for ; Tue, 13 Aug 2002 20:23:12 -0700 Received: from anakin.wychk.org (localhost [127.0.0.1]) by anakin.wychk.org (8.12.5/8.12.5/Debian-1) with ESMTP id g7E32hN5031320; Wed, 14 Aug 2002 13:02:43 +1000 Received: (from glee@localhost) by anakin.wychk.org (8.12.5/8.12.5/Debian-1) id g7E32gP4031319; Wed, 14 Aug 2002 13:02:42 +1000 Date: Wed, 14 Aug 2002 13:02:42 +1000 From: Geoffrey Lee To: kuznet@ms2.inr.ac.ru Cc: netdev@oss.sgi.com Subject: Re: [PATCH] connect() return value. Message-ID: <20020814030242.GA30872@anakin.wychk.org> References: <20020813235103.GA28432@anakin.wychk.org> <200208140036.EAA22833@sex.inr.ac.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=big5 Content-Disposition: inline In-Reply-To: <200208140036.EAA22833@sex.inr.ac.ru> User-Agent: Mutt/1.4i X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk > > I asked you not to do this. This is pathological case, and return > value from this presents only academic interest. > > Burk. Sorry for that error. > > We find the errno after the first connect is indeed EINPROGRESS. A > > read with 0 length returns -1, with an errno of 134. That corresponds > > to ENOTCONN on Solaris. > > F.e. if Solaris behaves in this way on normal not-nil read/write, > it is fatally buggy. I do not understand why it works after this, > just by plain luck. > > read/write on not-yet-connected socket can only block (when blocking) > or return EAGAIN otherwise. Well, or complete sucessfully. > Ok, let's try again. Same OS, same conditions same program used, but this time we read and write with a count of 1. OSF1 4.0: It seems even with a non-nil write it is very anti-social and returns different error codes. For read it returns -EWOULDBLOCK, while for write it returns -ENOTCONN. On OSF1, the errno EWOULDBLOCK is the same as EAGAIN. It looks like for socket reads and writes, OSF1 does the sanity checks for read and write in a different order. This is a bug. SunOS 5.6: -ENOTCONN for both a read and write of 1 byte count. Linux 2.4.18: -EAGAIN in both read and write of one byte. As you said that the fact that Solaris works at all with those semantics is pure luck, so we try to understand it a bit better. We modify the program further, and after the connection succeeds, we read some data and we print it out. For the write case, we use a simple `echo' server and write some data to the server and read back from it, and see that it is ok. We sleep for a amount of time between the second connect call and the read / write call to give it ample time for the 3 way handshake to complete. As a control the same tests is done across all 3 operating systems. We first start the server and telnet to it and make sure that it works alright. We verify that this is true. We also start the server on the same computer to rule out endianness problems. SunOS 5.6: read case is ok. write case is ok. OSF1 4.0: read case is ok. write case is ok. Linux 2.4.18: read case is ok. write case is ok. So SunOS with -ENOTCONN semantics work. So does OSF1, but we note write returns -ENOTCONN. So if you say that it is indeed a bug, then both SunOS and OSF1 are lucky. But now that we have determined that is is not portable to use a read or a write to test if a non-blocking socket is connected or not, I'd like to hear your reasons why a read or a write returning -ENOTCONN is a buggy behavior. -- G. From owner-netdev@oss.sgi.com Tue Aug 13 21:14:39 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7E4EdRw016257 for ; Tue, 13 Aug 2002 21:14:39 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7E4Edfg016256 for netdev-outgoing; Tue, 13 Aug 2002 21:14:39 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from sex.inr.ac.ru (sex.inr.ac.ru [193.233.7.165]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7E4EYRw016247 for ; Tue, 13 Aug 2002 21:14:35 -0700 Received: (from kuznet@localhost) by sex.inr.ac.ru (8.6.13/ANK) id IAA23134; Wed, 14 Aug 2002 08:16:44 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200208140416.IAA23134@sex.inr.ac.ru> Subject: Re: [PATCH] connect() return value. To: glee@gnupilgrims.org (Geoffrey Lee) Date: Wed, 14 Aug 2002 08:16:44 +0400 (MSD) Cc: netdev@oss.sgi.com In-Reply-To: <20020814030242.GA30872@anakin.wychk.org> from "Geoffrey Lee" at Aug 14, 2 01:02:42 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 X-Spam-Status: No, hits=-3.8 required=5.0 tests=IN_REP_TO,NO_REAL_NAME version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hello! > to hear your reasons why a read or a write returning -ENOTCONN is a Because any error but EAGAIN/EINTR is failure. As you convinced yourself there is no good way to detect connection completion, so in fact writing correct program is next to impossible. Actually, never in my life I have seen a pattern of program with waits for EISCONN after EINTR on connect() (but lots of them fail when not seeing 0), so I have no idea why they work under these OSes. Apparently, they fail randomly. Alexey From owner-netdev@oss.sgi.com Tue Aug 13 22:18:31 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7E5IVRw016801 for ; Tue, 13 Aug 2002 22:18:31 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7E5IVFp016800 for netdev-outgoing; Tue, 13 Aug 2002 22:18:31 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from anakin.wychk.org (CPE-144-132-195-245.nsw.bigpond.net.au [144.132.195.245]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7E5IMRw016790 for ; Tue, 13 Aug 2002 22:18:24 -0700 Received: from anakin.wychk.org (localhost [127.0.0.1]) by anakin.wychk.org (8.12.5/8.12.5/Debian-1) with ESMTP id g7E4vuN5032410; Wed, 14 Aug 2002 14:57:56 +1000 Received: (from glee@localhost) by anakin.wychk.org (8.12.5/8.12.5/Debian-1) id g7E4vttb032409; Wed, 14 Aug 2002 14:57:55 +1000 Date: Wed, 14 Aug 2002 14:57:55 +1000 From: Geoffrey Lee To: kuznet@ms2.inr.ac.ru Cc: netdev@oss.sgi.com Subject: Re: [PATCH] connect() return value. Message-ID: <20020814045755.GB32315@anakin.wychk.org> References: <20020814030242.GA30872@anakin.wychk.org> <200208140416.IAA23134@sex.inr.ac.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=big5 Content-Disposition: inline In-Reply-To: <200208140416.IAA23134@sex.inr.ac.ru> User-Agent: Mutt/1.4i X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk > > Because any error but EAGAIN/EINTR is failure. As you convinced Yep. I thought about this as well. If you used read with a byte count of 0 / 1, and it returned -1 to the program with errno set to ENOTCONN, you cannot be confident whether it is trying to complete a 3 way handshake or the connection failed. > yourself there is no good way to detect connection completion, > so in fact writing correct program is next to impossible. > I am a bit jaded about non-blocking connects on Unix, because is even more non-portable than I ever imagined. > Actually, never in my life I have seen a pattern of program > with waits for EISCONN after EINTR on connect() (but lots of them > fail when not seeing 0), so I have no idea why they work under these OSes. > Apparently, they fail randomly. > Yep. On Solaris / Digital UNIX the 3 way handshake completes asynchronously on EINTR. Checking for a EISCONN is one non-perfect solution to check for a successful connection establishment. So, in the near future, we can expect Linux to not return 0 for non-blocking connects (i.e. the change you mentioned)? -- G. From owner-netdev@oss.sgi.com Wed Aug 14 01:39:42 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7E8dfRw019358 for ; Wed, 14 Aug 2002 01:39:41 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7E8dfAq019357 for netdev-outgoing; Wed, 14 Aug 2002 01:39:41 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from d12lmsgate-2.de.ibm.com (d12lmsgate-2.de.ibm.com [195.212.91.200]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7E8dGRw019345 for ; Wed, 14 Aug 2002 01:39:17 -0700 Received: from d12relay01.de.ibm.com (d12relay01.de.ibm.com [9.165.215.22]) by d12lmsgate-2.de.ibm.com (8.12.3/8.12.3) with ESMTP id g7E8ffq6058652 for ; Wed, 14 Aug 2002 10:41:41 +0200 Received: from d12ml033.de.ibm.com (d12ml033_cs0 [9.165.223.11]) by d12relay01.de.ibm.com (8.12.3/NCO/VER6.3) with ESMTP id g7E8feH0037960 for ; Wed, 14 Aug 2002 10:41:41 +0200 Subject: new net device feature: shared-ipv6-cards To: netdev@oss.sgi.com Cc: utz.bacher@de.ibm.com X-Mailer: Lotus Notes Release 5.0.4a July 24, 2000 Message-ID: From: "Andreas Herrmann" Date: Wed, 14 Aug 2002 10:41:40 +0200 X-MIMETrack: Serialize by Router on D12ML033/12/M/IBM(Release 5.0.9a |January 7, 2002) at 14/08/2002 10:41:40 MIME-Version: 1.0 Content-type: text/plain; charset=us-ascii X-Spam-Status: No, hits=-5.0 required=5.0 tests=UNIFIED_PATCH version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hi, I ask for integration of the attached patch into the stock kernel. The patch is against kernel version 2.4.19. This patch allows to replace the part 0xFFFE of an EUI-64 based interface identifier by another 16 bit value. The new net device feature is needed to avoid duplicate address conflicts on Linux for z/Series when shared OSA cards are used with IPv6. The following changes are performed: o Documentation/Configure.help net/Config.in Configure option and help text for the new feature. o arch/s390/defconfig arch/s390x/defconfig Set option on by default for Linux for z/Series. o include/linux/netdevice.h Changed struct net_device to store an id for the card user instance. Introduced macro for the new net device feature. o net/ipv6/addrconf.c Replace 0xFFFE by a card user instance id in function ipv6_generate_eui64(). o net/8021q/vlan.c Take new feature into account for VLAN devices in function register_802_1Q_vlan_device(). On Linux for zSeries OSA network cards can be shared among various Linuxes. The OSA card has only one MAC address. This leads to duplicate address conflicts in conjunction with IPv6 and a vanilla kernel if more than one Linux use the same card. But the device driver for the card can deliver a unique 16-bit identifier for each Linux sharing the same card. This identifier is placed instead of the part 0xFFFE in the interface identifier. The "u" bit of the interface identifier is not inverted when the new feature is used. Hence the resulting interface identifier has local scope according to RFC2373. Consequently this change of the autoconfiguration does not violate any RFCs. Regards, Andreas -- Linux for eServer Development Tel : +49-7031-16-4640 Notes mail : Andreas Herrmann/GERMANY/IBM@IBMDE email : aherrman@de.ibm.com diff -Naur linux-2.4.19.old/Documentation/Configure.help linux-2.4.19/Documentation/Configure.help --- linux-2.4.19.old/Documentation/Configure.help Mon Aug 5 10:54:29 2002 +++ linux-2.4.19/Documentation/Configure.help Mon Aug 5 10:56:50 2002 @@ -5404,6 +5404,16 @@ It is safe to say N here for now. +Prepare net_device struct for shared IPv6 cards +CONFIG_SHARED_IPV6_CARDS + This prepares the net_device structure to contain a card user instance + id. On some systems, e.g. IBM zSeries, networking cards can be shared. + In order to make IPv6 autoconfiguration useful, each user of the + networking card will get a different id which is used for unique + address generation (the id is used in the EUI-64 generation). + + Only say yes on IBM zSeries or S/390 systems. + # 2.5 tree only IPv6: routing messages via old netlink CONFIG_IPV6_NETLINK diff -Naur linux-2.4.19.old/arch/s390/defconfig linux-2.4.19/arch/s390/defconfig --- linux-2.4.19.old/arch/s390/defconfig Mon Aug 5 10:54:28 2002 +++ linux-2.4.19/arch/s390/defconfig Mon Aug 5 12:22:33 2002 @@ -146,6 +146,7 @@ # CONFIG_INET_ECN is not set # CONFIG_SYN_COOKIES is not set CONFIG_IPV6=m +CONFIG_SHARED_IPV6_CARDS=y # CONFIG_KHTTPD is not set # CONFIG_ATM is not set # CONFIG_VLAN_8021Q is not set diff -Naur linux-2.4.19.old/arch/s390x/defconfig linux-2.4.19/arch/s390x/defconfig --- linux-2.4.19.old/arch/s390x/defconfig Mon Aug 5 10:54:29 2002 +++ linux-2.4.19/arch/s390x/defconfig Mon Aug 5 12:23:29 2002 @@ -146,6 +146,7 @@ # CONFIG_INET_ECN is not set # CONFIG_SYN_COOKIES is not set CONFIG_IPV6=m +CONFIG_SHARED_IPV6_CARDS=y # CONFIG_KHTTPD is not set # CONFIG_ATM is not set # CONFIG_VLAN_8021Q is not set diff -Naur linux-2.4.19.old/include/linux/netdevice.h linux-2.4.19/include/linux/netdevice.h --- linux-2.4.19.old/include/linux/netdevice.h Mon Aug 5 10:54:19 2002 +++ linux-2.4.19/include/linux/netdevice.h Mon Aug 5 12:18:41 2002 @@ -362,6 +362,9 @@ #define NETIF_F_HW_VLAN_RX 256 /* Receive VLAN hw acceleration */ #define NETIF_F_HW_VLAN_FILTER 512 /* Receive filtering on VLAN */ #define NETIF_F_VLAN_CHALLENGED 1024 /* Device cannot handle VLAN packets */ +#define NETIF_F_SHARED_IPV6 2048 /* make IPv6 address autogeneration + * network card instance aware + */ /* Called after device is detached from network. */ void (*uninit)(struct net_device *dev); @@ -431,6 +434,9 @@ /* this will get initialized at each interface type init routine */ struct divert_blk *divert; #endif /* CONFIG_NET_DIVERT */ +#ifdef CONFIG_SHARED_IPV6_CARDS + unsigned short dev_id; +#endif /* CONFIG_SHARED_IPV6_CARDS */ }; diff -Naur linux-2.4.19.old/net/8021q/vlan.c linux-2.4.19/net/8021q/vlan.c --- linux-2.4.19.old/net/8021q/vlan.c Mon Aug 5 10:54:19 2002 +++ linux-2.4.19/net/8021q/vlan.c Mon Aug 5 10:56:50 2002 @@ -437,6 +437,10 @@ /* IFF_BROADCAST|IFF_MULTICAST; ??? */ new_dev->flags = real_dev->flags; new_dev->flags &= ~IFF_UP; +#ifdef CONFIG_SHARED_IPV6_CARDS + new_dev->features |= (real_dev->features & NETIF_F_SHARED_IPV6); + new_dev->dev_id = real_dev->dev_id; +#endif /* Make this thing known as a VLAN device */ new_dev->priv_flags |= IFF_802_1Q_VLAN; diff -Naur linux-2.4.19.old/net/Config.in linux-2.4.19/net/Config.in --- linux-2.4.19.old/net/Config.in Mon Aug 5 10:54:19 2002 +++ linux-2.4.19/net/Config.in Mon Aug 5 10:56:50 2002 @@ -25,6 +25,7 @@ if [ "$CONFIG_IPV6" != "n" ]; then source net/ipv6/Config.in fi + bool ' Prepare net_device struct for shared IPv6 cards' CONFIG_SHARED_IPV6_CARDS fi if [ "$CONFIG_EXPERIMENTAL" = "y" ]; then source net/khttpd/Config.in diff -Naur linux-2.4.19.old/net/ipv6/addrconf.c linux-2.4.19/net/ipv6/addrconf.c --- linux-2.4.19.old/net/ipv6/addrconf.c Mon Aug 5 10:54:19 2002 +++ linux-2.4.19/net/ipv6/addrconf.c Mon Aug 5 10:56:50 2002 @@ -690,9 +690,20 @@ return -1; memcpy(eui, dev->dev_addr, 3); memcpy(eui + 5, dev->dev_addr+3, 3); +#ifdef CONFIG_SHARED_IPV6_CARDS + if (dev->features&NETIF_F_SHARED_IPV6) { + eui[3] = (dev->dev_id>>8)&0xff; + eui[4] = dev->dev_id&0xff; + } else { + eui[3] = 0xFF; + eui[4] = 0xFE; + eui[0] ^= 2; + } +#else /* CONFIG_SHARED_IPV6_CARDS */ eui[3] = 0xFF; eui[4] = 0xFE; eui[0] ^= 2; +#endif /* CONFIG_SHARED_IPV6_CARDS */ return 0; } return -1; From owner-netdev@oss.sgi.com Wed Aug 14 05:20:59 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7ECKxRw023411 for ; Wed, 14 Aug 2002 05:20:59 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7ECKxcK023410 for netdev-outgoing; Wed, 14 Aug 2002 05:20:59 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from beohost.scyld.com (vzw-166-141-36-124.airbridge.net [166.141.36.124] (may be forged)) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7ECKnRw023394 for ; Wed, 14 Aug 2002 05:20:52 -0700 Received: from localhost (becker@localhost) by beohost.scyld.com (8.11.6/8.11.6) with ESMTP id g7ECFlk13784; Wed, 14 Aug 2002 08:15:47 -0400 Date: Wed, 14 Aug 2002 08:15:47 -0400 (EDT) From: Donald Becker To: kuznet@ms2.inr.ac.ru cc: Martin Josefsson , Subject: Re: raw ipv6 broken in 2.4.19 In-Reply-To: <200208132006.AAA21878@sex.inr.ac.ru> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Status: No, hits=-3.1 required=5.0 tests=IN_REP_TO,MAY_BE_FORGED version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk On Wed, 14 Aug 2002 kuznet@ms2.inr.ac.ru wrote: > Actually, tulip is really different of another cards, this creature > generates badly aligned packets. F.e. it is possible a checksumming routine > folds some bytes beyond end of frame. :-) I have lots of verious tulips here, > but have never seen such a shit. :-) I'm not certain what you mean by this: there are many NIC chips that have alignment requirements similar to the Tulip design. Note that the Tulip is no more likely than other cards to trigger motherboard PCI implementation bugs. It's just that there are far more NIC chip types supported by the Tulip driver than other drivers. -- Donald Becker becker@scyld.com Scyld Computing Corporation http://www.scyld.com 410 Severn Ave. Suite 210 Second Generation Beowulf Clusters Annapolis MD 21403 410-990-9993 From owner-netdev@oss.sgi.com Wed Aug 14 06:11:23 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7EDBNRw024161 for ; Wed, 14 Aug 2002 06:11:23 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7EDBNCS024160 for netdev-outgoing; Wed, 14 Aug 2002 06:11:23 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from sex.inr.ac.ru (sex.inr.ac.ru [193.233.7.165]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7EDBJRw024151 for ; Wed, 14 Aug 2002 06:11:19 -0700 Received: (from kuznet@localhost) by sex.inr.ac.ru (8.6.13/ANK) id RAA23897; Wed, 14 Aug 2002 17:13:21 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200208141313.RAA23897@sex.inr.ac.ru> Subject: Re: raw ipv6 broken in 2.4.19 To: becker@scyld.com (Donald Becker) Date: Wed, 14 Aug 2002 17:13:21 +0400 (MSD) Cc: gandalf@wlug.westbo.se, netdev@oss.sgi.com In-Reply-To: from "Donald Becker" at Aug 14, 2 08:15:47 am X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 X-Spam-Status: No, hits=-3.8 required=5.0 tests=IN_REP_TO,NO_REAL_NAME version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hello! > I'm not certain what you mean by this: there are many NIC chips that have > alignment requirements similar to the Tulip design. Comparing to _his_ ones. > Note that the Tulip is no more likely than other cards to trigger > motherboard PCI implementation bugs. Donald, it was brain attack. When speaking about alignment we have already switched to investigation of our checksumming routines. Alexey From owner-netdev@oss.sgi.com Wed Aug 14 10:23:54 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7EHNsRw007157 for ; Wed, 14 Aug 2002 10:23:54 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7EHNsa3007156 for netdev-outgoing; Wed, 14 Aug 2002 10:23:54 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from sex.inr.ac.ru (sex.inr.ac.ru [193.233.7.165]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7EHNnRw007144 for ; Wed, 14 Aug 2002 10:23:50 -0700 Received: (from kuznet@localhost) by sex.inr.ac.ru (8.6.13/ANK) id VAA24463; Wed, 14 Aug 2002 21:25:39 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200208141725.VAA24463@sex.inr.ac.ru> Subject: Re: [PATCH] connect() return value. To: glee@gnupilgrims.org (Geoffrey Lee) Date: Wed, 14 Aug 2002 21:25:39 +0400 (MSD) Cc: davem@redhat.com (Dave Miller), netdev@oss.sgi.com In-Reply-To: <20020814045755.GB32315@anakin.wychk.org> from "Geoffrey Lee" at Aug 14, 2 02:57:55 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 X-Spam-Status: No, hits=-3.8 required=5.0 tests=IN_REP_TO,NO_REAL_NAME version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hello! > So, in the near future, we can expect Linux to not return 0 for > non-blocking connects (i.e. the change you mentioned)? I wanted, but I changed my opinion about this. It is pretty strange to mimic behavior of solaris/tru64 and Co, which is really stinking. So, despite of our current behavior is not perfect and even not quite self-consistent, it is the best one. Well, and it does not contradict to sus/posix. So, connect() remains restartable, and it will return 0 on success instead of EISCONN crap. Alexey From owner-netdev@oss.sgi.com Wed Aug 14 20:24:16 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7F3OGRw023128 for ; Wed, 14 Aug 2002 20:24:16 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7F3OGAS023127 for netdev-outgoing; Wed, 14 Aug 2002 20:24:16 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from gandalf.gnupilgrims.org ([202.181.197.10]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7F3O9Rw023118 for ; Wed, 14 Aug 2002 20:24:09 -0700 Received: from glee by gandalf.gnupilgrims.org with local (Exim 3.35 #1 (Debian)) id 17fBGZ-0000s5-00; Thu, 15 Aug 2002 11:25:43 +0800 Date: Thu, 15 Aug 2002 11:25:43 +0800 From: glee@gnupilgrims.org To: kuznet@ms2.inr.ac.ru Cc: Dave Miller , netdev@oss.sgi.com Subject: Re: [PATCH] connect() return value. Message-ID: <20020815032543.GA3083@gandalf.chinesecodefoo.org> References: <20020814045755.GB32315@anakin.wychk.org> <200208141725.VAA24463@sex.inr.ac.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200208141725.VAA24463@sex.inr.ac.ru> User-Agent: Mutt/1.3.28i X-Spam-Status: No, hits=-3.8 required=5.0 tests=IN_REP_TO,NO_REAL_NAME version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk On Wed, Aug 14, 2002 at 09:25:39PM +0400, kuznet@ms2.inr.ac.ru wrote: > Hello! > > > So, in the near future, we can expect Linux to not return 0 for > > non-blocking connects (i.e. the change you mentioned)? > > I wanted, but I changed my opinion about this. It is pretty strange to mimic > behavior of solaris/tru64 and Co, which is really stinking. > It is true it is bad to go for a technically inferior solution that EINTR must be handled by the userspace code. > So, despite of our current behavior is not perfect and even not > quite self-consistent, it is the best one. Well, and it does not > contradict to sus/posix. > Agreed for both points, as far as I recall it is not standardized, not even what "may" or "should" be done. > So, connect() remains restartable, and it will return 0 on success > instead of EISCONN crap. Yep. I research how sockets are done on various Unix as a hobby. So, one of the motives of my original letter was that I wanted to know whether there is any special reasons (well now I know after this discussion) why Linux's connect is restartable. I will make a special note of that in my notes. Thanks for this fruitful discussion. :-) -- G. From owner-netdev@oss.sgi.com Wed Aug 14 20:27:12 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7F3RBRw023237 for ; Wed, 14 Aug 2002 20:27:11 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7F3RBmB023236 for netdev-outgoing; Wed, 14 Aug 2002 20:27:11 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7F3R9Rw023224 for ; Wed, 14 Aug 2002 20:27:09 -0700 Received: from localhost (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with ESMTP id UAA09602; Wed, 14 Aug 2002 20:15:14 -0700 Date: Wed, 14 Aug 2002 20:15:14 -0700 (PDT) Message-Id: <20020814.201514.30457132.davem@redhat.com> To: glee@gnupilgrims.org Cc: kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com Subject: Re: [PATCH] connect() return value. From: "David S. Miller" In-Reply-To: <20020815032543.GA3083@gandalf.chinesecodefoo.org> References: <20020814045755.GB32315@anakin.wychk.org> <200208141725.VAA24463@sex.inr.ac.ru> <20020815032543.GA3083@gandalf.chinesecodefoo.org> X-Mailer: Mew version 2.1 on Emacs 21.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk From: glee@gnupilgrims.org Date: Thu, 15 Aug 2002 11:25:43 +0800 I research how sockets are done on various Unix as a hobby. How extensive are your notes on poll() behavior on TCP sockets? :-) From owner-netdev@oss.sgi.com Thu Aug 15 02:44:22 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7F9iMRw028123 for ; Thu, 15 Aug 2002 02:44:22 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7F9iMea028122 for netdev-outgoing; Thu, 15 Aug 2002 02:44:22 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from smtpprx1.nationalcity.com (smtpprx1.nationalcity.com [161.150.10.202]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7F9iARw028097 for ; Thu, 15 Aug 2002 02:44:10 -0700 Received: from ohclesql1016.corp.ntl-city.net (ohclesql1016.corp.ntl-city.net [10.48.17.141]) by smtpprx1.nationalcity.com (mailer daemon) with ESMTP id AA3C96E66 for ; Wed, 14 Aug 2002 15:27:27 -0400 (EDT) Received: from 10.48.17.140 by ohclesql1016.corp.ntl-city.net with ESMTP (Tumbleweed MMS SMTP Relay (MMS v5.0)); Wed, 14 Aug 2002 11:46:44 -0400 X-Server-Uuid: F5AB8017-BF7C-496C-B7DC-BE18D80FC80A Received: by nt-cleopsapp71.ntl-city.com with Internet Mail Service ( 5.5.2653.19) id ; Wed, 14 Aug 2002 11:46:43 -0400 Message-ID: From: "Olson, John C" To: "'netdev@oss.sgi.com'" Subject: system hang under remove security Date: Wed, 14 Aug 2002 11:46:41 -0400 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2653.19) X-WSS-ID: 1144A4E96235652-01-01 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Spam-Status: No, hits=0.0 required=5.0 tests= version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hello, Have been working with Andrey Savochkin on a problem that I have been experiencing with the 2.4.16 kernel and he suggested that I contact you to see if you had experience with it. The key points are: * my system hung (didn't respond to keyboard, remote and so on) when I ran remote network scanner (nessus - specifically udp scans for mstream and trinoo although it fails for more than that) * I had all my services disabled (see netstat output below) * I've tried it with eepro100 and 3com card, with the same result * my kernel is 2.4.16 (SuSE provided for 7.3 professional) * Running on a Compaq ML370 with a 4200 controller Here is a listing of my netstat -a: Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State raw 0 0 *:raw *:* 7 Active UNIX domain sockets (servers and established) Proto RefCnt Flags Type State I-Node Path unix 6 [ ] DGRAM 620 /dev/log unix 3 [ ] STREAM CONNECTED 3352 unix 3 [ ] STREAM CONNECTED 3351 unix 2 [ ] DGRAM 3309 unix 2 [ ] DGRAM 1207 unix 2 [ ] DGRAM 1058 unix 2 [ ] DGRAM 826 Here is a listing of lsmod: Module Size Used by af_packet 12976 1 (autoclean) 3c59x 22240 1 (autoclean) pci-scan 3440 1 (autoclean) [3c59x] lvm-mod 45440 13 (autoclean) reiserfs 153520 8 ncr53c8xx 51856 0 (unused) cpqarray 16208 4 The question: are there known vulnerabilities of this kind? It seems like I have (in my mind) narrowed this problem down to either a kernel or IP stack problem. Any help would be very much appreciated. Thanks, John -----Original Message----- From: Andrey Savochkin [mailto:saw@saw.sw.com.sg] Sent: Wednesday, August 14, 2002 10:43 AM To: Olson, John C Subject: Re: 2.4.16 freezed up with eepro100 module On Wed, Aug 14, 2002 at 10:24:39AM -0400, Olson, John C wrote: > BTW - just tried the same thing with sshd turned off as well (i.e. only > thing listening was raw) and it still crashed. Doesn't that mean that the > only things left to check are the ip stack, kernel and driver? Since I've > gone through multiple drivers and cards, shouldn't that take out the driver > leaving the stack and kernel? You've done it already, by trying 2 different drivers: eepro100 and 3com, right? So, you've eliminated the driver. I think, it's the time to ask other kernel people, Alan Cox or the mailing list netdev@oss.sgi.com. The key points are: - your system hung (didn't respond to keyboard and so on) when you ran remote network scanner, doing nessus or whatever attacks - you had all your services disabled (provide netstat output) - you've tried it with eepro100 and 3com card, with the same result - your kernel is 2.4.16 (add whether it's a mainstream or redhat kernel) The question: are there known vulnerabilities of this kind? And pick up a reasonable subject, like "system hang under remove security scan" :-) Andrey From owner-netdev@oss.sgi.com Thu Aug 15 04:22:08 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7FBM8Rw032035 for ; Thu, 15 Aug 2002 04:22:08 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7FBM8ob032034 for netdev-outgoing; Thu, 15 Aug 2002 04:22:08 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from anakin.wychk.org (CPE-144-132-195-245.nsw.bigpond.net.au [144.132.195.245]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7FBM0Rw032025 for ; Thu, 15 Aug 2002 04:22:04 -0700 Received: from anakin.wychk.org (localhost [127.0.0.1]) by anakin.wychk.org (8.12.5/8.12.5/Debian-1) with ESMTP id g7FB2B2M015542; Thu, 15 Aug 2002 21:02:11 +1000 Received: (from glee@localhost) by anakin.wychk.org (8.12.5/8.12.5/Debian-1) id g7FB2A47015541; Thu, 15 Aug 2002 21:02:10 +1000 Date: Thu, 15 Aug 2002 21:02:10 +1000 From: Geoffrey Lee To: "David S. Miller" Cc: kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com Subject: Re: [PATCH] connect() return value. Message-ID: <20020815110210.GA15218@anakin.wychk.org> References: <20020814045755.GB32315@anakin.wychk.org> <200208141725.VAA24463@sex.inr.ac.ru> <20020815032543.GA3083@gandalf.chinesecodefoo.org> <20020814.201514.30457132.davem@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=big5 Content-Disposition: inline In-Reply-To: <20020814.201514.30457132.davem@redhat.com> User-Agent: Mutt/1.4i X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk On Wed, Aug 14, 2002 at 08:15:14PM -0700, David S. Miller wrote: > From: glee@gnupilgrims.org > Date: Thu, 15 Aug 2002 11:25:43 +0800 > > I research how sockets are done on various Unix as a hobby. > > How extensive are your notes on poll() behavior on TCP > sockets? :-) Hmm, in which regard? Is there something specific that you are searching for? :-) -- G. From owner-netdev@oss.sgi.com Thu Aug 15 08:01:36 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7FF1aRw005242 for ; Thu, 15 Aug 2002 08:01:36 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7FF1auV005241 for netdev-outgoing; Thu, 15 Aug 2002 08:01:36 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from servidor.linux-ha.org (cpe-24-221-212-80.co.sprintbbd.net [24.221.212.80]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7FF1TRw005232 for ; Thu, 15 Aug 2002 08:01:30 -0700 Received: from unix.sh (localhost [127.0.0.1]) by servidor.linux-ha.org (Postfix on SuSE Linux 8.0 (i386)) with ESMTP id D085E308AA; Thu, 15 Aug 2002 09:03:09 -0600 (MDT) Message-ID: <3D5BC2AD.3070905@unix.sh> Date: Thu, 15 Aug 2002 09:03:09 -0600 From: Alan Robertson Organization: IBM Linux Technology Center User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.8) Gecko/20020204 X-Accept-Language: en-us MIME-Version: 1.0 To: netdev@oss.sgi.com Cc: Kevin Dwyer Subject: Resend: SIOCGIFBRDADDR? Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, hits=-0.1 required=5.0 tests=SUBJ_ENDS_IN_Q_MARK version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk I sent this earlier, but I got no reply, and can't seem to find it in the archives. If you reply to this, could you please copy both Kevin Dwyer and myself on the reply? Many thanks! Hi, Some folks using the linux-ha (heartbeat) software have been running into some network-related problems, and Alan Cox suggested that you folks could help us out. One of the ways you can configure linux-ha protocols is to send broadcast packets. In this case, our software needs to get the broadcast address from the OS, so we can proceed. We currently use the SIOCGIFBRDADDR ioctl to do this for us. This has worked pretty well for us for quite some time, but recently some users have run into some problems. It appears that if one configures an interface using ifconfig, then the SIOCGIFBRDADDR ioctl works quite nicely to determine the broadcast address of an interface. However, if one uses the iproute2 tools to configure the interface, then it appears that the SIOCGIFBRDADDR ioctl returns bogus results on that interface. Is this supposed to be the case? Is there a different method we're supposed to use instead of SIOCGIFBRDADDR? Thanks! -- Alan Robertson alanr@unix.sh http://linux-ha.org/ From owner-netdev@oss.sgi.com Thu Aug 15 08:31:44 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7FFViRw005613 for ; Thu, 15 Aug 2002 08:31:44 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7FFVij1005612 for netdev-outgoing; Thu, 15 Aug 2002 08:31:44 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from sex.inr.ac.ru (sex.inr.ac.ru [193.233.7.165]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7FFVeRw005602 for ; Thu, 15 Aug 2002 08:31:41 -0700 Received: (from kuznet@localhost) by sex.inr.ac.ru (8.6.13/ANK) id TAA26970; Thu, 15 Aug 2002 19:33:20 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200208151533.TAA26970@sex.inr.ac.ru> Subject: Re: Resend: SIOCGIFBRDADDR? To: alanr@unix.sh (Alan Robertson) Date: Thu, 15 Aug 2002 19:33:20 +0400 (MSD) Cc: netdev@oss.sgi.com In-Reply-To: <3D5BC2AD.3070905@unix.sh> from "Alan Robertson" at Aug 15, 2 07:15:01 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 X-Spam-Status: No, hits=-3.9 required=5.0 tests=IN_REP_TO,SUBJ_ENDS_IN_Q_MARK,NO_REAL_NAME version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hello! > Is this supposed to be the case? You forgot to say what kind of "bogus result" is obtained. Alexey From owner-netdev@oss.sgi.com Thu Aug 15 09:07:53 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7FG7qRw006391 for ; Thu, 15 Aug 2002 09:07:53 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7FG7qQk006390 for netdev-outgoing; Thu, 15 Aug 2002 09:07:52 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from kaneda.pheared.net (root@kaneda.isrd.net [206.205.246.39]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7FG7dRw006380 for ; Thu, 15 Aug 2002 09:07:39 -0700 Received: from kaneda.isrd.lan (kevin@localhost [127.0.0.1]) by kaneda.pheared.net (8.12.3/8.12.3) with ESMTP id g7FGABCK010105; Thu, 15 Aug 2002 12:10:11 -0400 Received: from localhost (kevin@localhost) by kaneda.isrd.lan (8.12.3/8.12.3/Submit) with ESMTP id g7FGABdL010101; Thu, 15 Aug 2002 12:10:11 -0400 X-Authentication-Warning: kaneda.isrd.lan: kevin owned process doing -bs Date: Thu, 15 Aug 2002 12:10:11 -0400 (EDT) From: Kevin Dwyer To: kuznet@ms2.inr.ac.ru cc: Alan Robertson , Subject: Re: Resend: SIOCGIFBRDADDR? In-Reply-To: <200208151533.TAA26970@sex.inr.ac.ru> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Status: No, hits=-4.5 required=5.0 tests=IN_REP_TO,SUBJ_ENDS_IN_Q_MARK version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk On Thu, 15 Aug 2002, kuznet@ms2.inr.ac.ru grunted something like: > You forgot to say what kind of "bogus result" is obtained. Since I'm the one who has primarily been chasing this problem, I'll explain. (cut from my original email...) Specifically, it uses iproute to add aliases to a device, like eth0 in this example: ip -f inet addr add 10.5.5.1 dev eth0 scope link ip -f inet addr add 10.5.5.2 dev eth0 scope link ip -f inet addr add 10.5.5.3 dev eth0 scope link ip -f inet addr add 10.5.5.4 dev eth0 scope link ip -f inet addr add 10.5.5.5 dev eth0 scope link That's all fine, except that it causes really weird stuff to happen to the output of ifconfig. For one, the inet addr: field will change once in a while if I make a modification with ifconfig. That doesn't affect heartbeat, but the Bcast: field is wrongly reported as 0.0.0.0. Now, I know for a fact that I set it correctly when I bring up the interface, prior to using the ip route commands. Once they've been applied, it goes crazy. So my question: Does anyone know if there's an ip route command to make the Bcast field show up as the correct value, or is there a better way for heartbeat to get the broadcast address? (and from a followup email...) # ip addr show eth0 primary scope global 2: eth0: mtu 1500 qdisc pfifo_fast qlen 100 link/ether 00:c0:95:e1:bc:64 brd ff:ff:ff:ff:ff:ff inet 10.5.5.8/24 brd 10.5.5.255 scope global eth0 That command gives me the correct broadcast address in the brd field. Correct meaning, the one we're looking for, and would otherwise have gotten if I hadn't used the before-mentioned iproute commands. Any help is appreciated. Thanks. -[ kevin@pheared.net devel.pheared.net ]- -[ Rather be forgotten, than remembered for giving in. ]- -[ ZZ = g ^ (xb * xa) mod p g = h^{(p-1)/q} mod p ]- From owner-netdev@oss.sgi.com Thu Aug 15 09:34:06 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7FGY6Rw007008 for ; Thu, 15 Aug 2002 09:34:06 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7FGY6Qb007007 for netdev-outgoing; Thu, 15 Aug 2002 09:34:06 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from sex.inr.ac.ru (sex.inr.ac.ru [193.233.7.165]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7FGY1Rw006996 for ; Thu, 15 Aug 2002 09:34:01 -0700 Received: (from kuznet@localhost) by sex.inr.ac.ru (8.6.13/ANK) id UAA27112; Thu, 15 Aug 2002 20:35:59 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200208151635.UAA27112@sex.inr.ac.ru> Subject: Re: Resend: SIOCGIFBRDADDR? To: kevin@pheared.net (Kevin Dwyer) Date: Thu, 15 Aug 2002 20:35:59 +0400 (MSD) Cc: alanr@unix.sh, netdev@oss.sgi.com In-Reply-To: from "Kevin Dwyer" at Aug 15, 2 12:10:11 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 X-Spam-Status: No, hits=-3.9 required=5.0 tests=IN_REP_TO,SUBJ_ENDS_IN_Q_MARK,NO_REAL_NAME version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hello! > crazy. So my question: Does anyone know if there's an ip route command to > make the Bcast field show up as the correct value, Yes, that one which is described in manual (and equivalent to corresponding ifconfig) ifconfig eth0:tra-ta-ta 10.x.y.z etc. > or is there a better > way for heartbeat to get the broadcast address? Well, yes. Initialize ifreq with right address, for instance. Probably, your program has the same problem, which is in net-tools package. Or this: > That command gives me the correct broadcast address in the brd field. Alexey From owner-netdev@oss.sgi.com Thu Aug 15 12:50:48 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7FJomRw010341 for ; Thu, 15 Aug 2002 12:50:48 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7FJomFW010340 for netdev-outgoing; Thu, 15 Aug 2002 12:50:48 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from kaneda.pheared.net (root@kaneda.isrd.net [206.205.246.39]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7FJmvRw010281 for ; Thu, 15 Aug 2002 12:48:57 -0700 Received: from kaneda.isrd.lan (kevin@localhost [127.0.0.1]) by kaneda.pheared.net (8.12.3/8.12.3) with ESMTP id g7FJD8CK010296 for ; Thu, 15 Aug 2002 15:13:08 -0400 Received: from localhost (kevin@localhost) by kaneda.isrd.lan (8.12.3/8.12.3/Submit) with ESMTP id g7FJBmQd010292; Thu, 15 Aug 2002 15:12:28 -0400 X-Authentication-Warning: kaneda.isrd.lan: kevin owned process doing -bs Date: Thu, 15 Aug 2002 15:11:48 -0400 (EDT) From: Kevin Dwyer To: kuznet@ms2.inr.ac.ru cc: alanr@unix.sh, Subject: Re: Resend: SIOCGIFBRDADDR? In-Reply-To: <200208151635.UAA27112@sex.inr.ac.ru> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Status: No, hits=-4.5 required=5.0 tests=IN_REP_TO,SUBJ_ENDS_IN_Q_MARK version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk On Thu, 15 Aug 2002, kuznet@ms2.inr.ac.ru grunted something like: Thanks for the response... > > crazy. So my question: Does anyone know if there's an ip route command to > > make the Bcast field show up as the correct value, > > Yes, that one which is described in manual (and equivalent to corresponding > ifconfig) > > ifconfig eth0:tra-ta-ta 10.x.y.z etc. Perhaps I should clarify a bit further. When we are starting heartbeat, the interface is setup thusly: (I've simplified this to make it easier to read) ifconfig eth0 10.5.5.8 broadcast 10.5.5.255 netmask 255.255.255.0 ip -f inet addr add 10.5.5.1 dev eth0 scope link ip -f inet addr add 10.5.5.2 dev eth0 scope link After executing those commands, we see the following: ifconfig eth0 eth0 Link encap:Ethernet HWaddr 00:C0:95:E1:BC:64 inet addr:10.5.5.1 Bcast:0.0.0.0 Mask:255.255.255.255 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 (Note: the inet addr, Bcast, and Mask fields are all incorrectly reporting ifconfig's initial settings, which is why heartbeat is also confused.) 2: eth0: mtu 1500 qdisc pfifo_fast qlen 100 link/ether 00:c0:95:e1:bc:64 brd ff:ff:ff:ff:ff:ff inet 10.5.5.1/32 scope link eth0 inet 10.5.5.2/32 scope link eth0 inet 10.5.5.8/24 brd 10.5.5.255 scope global eth0 So we have eth0 setup as 10.5.5.8, and two aliases setup with iproute at 10.5.5.1 and 10.5.5.2. They are not however aliases like eth0:1, eth0:2, etc. But, iproute has setup the kernel to reply to ARPs for those IPs. So, heartbeat needs to know which interface to use for sending broadcast packets to another node. I've defined eth0 as said interface, and it uses eth0 as the ifreq for the ioctl call. However, the ioctl call returns 0.0.0.0, when in fact I need 10.5.5.255. > > or is there a better > > way for heartbeat to get the broadcast address? > > Well, yes. Initialize ifreq with right address, for instance. > Probably, your program has the same problem, which is in net-tools package. See above.. the problem is that using eth0 is the right address. > Or this: > > > That command gives me the correct broadcast address in the brd field. The only problem I see with emulating iproute's commands is that you'll need netlink (is this correct?) so it won't work for everyone until everyone starts adding that to their kernel config. (A possibly reasonable request, if this is indeed the best way to attack the problem. This is partially why we're asking for commentary/help.) -[ kevin@pheared.net devel.pheared.net ]- -[ Rather be forgotten, than remembered for giving in. ]- -[ ZZ = g ^ (xb * xa) mod p g = h^{(p-1)/q} mod p ]- From owner-netdev@oss.sgi.com Thu Aug 15 14:35:56 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7FLZuRw011542 for ; Thu, 15 Aug 2002 14:35:56 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7FLZuTT011541 for netdev-outgoing; Thu, 15 Aug 2002 14:35:56 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7FLZpRw011532 for ; Thu, 15 Aug 2002 14:35:52 -0700 Received: from localhost (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with ESMTP id OAA12170; Thu, 15 Aug 2002 14:23:37 -0700 Date: Thu, 15 Aug 2002 14:23:37 -0700 (PDT) Message-Id: <20020815.142337.95077599.davem@redhat.com> To: glee@gnupilgrims.org Cc: kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com Subject: Re: [PATCH] connect() return value. From: "David S. Miller" In-Reply-To: <20020815110210.GA15218@anakin.wychk.org> References: <20020815032543.GA3083@gandalf.chinesecodefoo.org> <20020814.201514.30457132.davem@redhat.com> <20020815110210.GA15218@anakin.wychk.org> X-Mailer: Mew version 2.1 on Emacs 21.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk From: Geoffrey Lee Date: Thu, 15 Aug 2002 21:02:10 +1000 On Wed, Aug 14, 2002 at 08:15:14PM -0700, David S. Miller wrote: > How extensive are your notes on poll() behavior on TCP > sockets? :-) Hmm, in which regard? Is there something specific that you are searching for? :-) POLLHUP in particular. I remember we verified that our behavior matched Solaris but no checks were performed against others. From owner-netdev@oss.sgi.com Thu Aug 15 18:28:52 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7G1SqRw015645 for ; Thu, 15 Aug 2002 18:28:52 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7G1SpFp015644 for netdev-outgoing; Thu, 15 Aug 2002 18:28:51 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from anakin.wychk.org (CPE-144-132-195-245.nsw.bigpond.net.au [144.132.195.245]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7G1S7Rw015628 for ; Thu, 15 Aug 2002 18:28:10 -0700 Received: from anakin.wychk.org (localhost [127.0.0.1]) by anakin.wychk.org (8.12.5/8.12.5/Debian-1) with ESMTP id g7G18B2M028944; Fri, 16 Aug 2002 11:08:11 +1000 Received: (from glee@localhost) by anakin.wychk.org (8.12.5/8.12.5/Debian-1) id g7G189Mb028943; Fri, 16 Aug 2002 11:08:09 +1000 Date: Fri, 16 Aug 2002 11:08:09 +1000 From: Geoffrey Lee To: "David S. Miller" Cc: kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com Subject: Re: [PATCH] connect() return value. Message-ID: <20020816010809.GA28446@anakin.wychk.org> References: <20020815032543.GA3083@gandalf.chinesecodefoo.org> <20020814.201514.30457132.davem@redhat.com> <20020815110210.GA15218@anakin.wychk.org> <20020815.142337.95077599.davem@redhat.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="AqsLC8rIMeq19msA" Content-Disposition: inline In-Reply-To: <20020815.142337.95077599.davem@redhat.com> User-Agent: Mutt/1.4i X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk --AqsLC8rIMeq19msA Content-Type: text/plain; charset=big5 Content-Disposition: inline On Thu, Aug 15, 2002 at 02:23:37PM -0700, David S. Miller wrote: > From: Geoffrey Lee > Date: Thu, 15 Aug 2002 21:02:10 +1000 > > On Wed, Aug 14, 2002 at 08:15:14PM -0700, David S. Miller wrote: > > How extensive are your notes on poll() behavior on TCP > > sockets? :-) > > Hmm, in which regard? > > Is there something specific that you are searching for? :-) > > POLLHUP in particular. I remember we verified that our behavior > matched Solaris but no checks were performed against others. In all honesty, none for that. But it would be a worthwhile experiment I think. Let us see what happens. Today, we are going to see what happens with POLLHUP on different Unix implementations. Source code is provided. poll.c (server that uses poll) linger.c (our client to invoke the POLLHUP condition) Note in poll.c I used size_t as the 3rd argument to accept instead of socklen_t, as only Linux had socklen_t. We also see that we get a compile warning when compiling the client as on SunOS 5.6 the 4th argument to setsockopt is a char *. But we can ignore that warning. The following operating systems will be tested against: OSF1 4.0 (Digital UNIX) SunOS 5.6 (Solaris) Linux 2.4.18 We will start the server on a machine with some specified port. We call listen for the specified socket, and accept. Next, we block in the call to poll with fd set to the socket descriptor returned by accept, and events and revents zeroed. On the call to poll, we specify -1 as a senitel that we want to block forever until something interesting arrives (the error condition). On the client, we connect to the server to the specified port as normal. We specify the SO_LINGER option, with l_onoff set to 1 and linger time set to 0. After we successfully connect to the server, we call the close call immediately, to issue a RST to the server. This is what happens on Linux 2.4.18: $ ./poll -p 8888 accept ok. poll: POLLERR set poll: POLLHUP set sever terminating $ On the client side: $ ./linger -h [ip] -p 8888 connecting .. closing ... closed. $ So, sending a RST will trigger POLLHUP and POLLERR on Linux. On OSF1 4.0, we get this: On the server: $ ./poll -p 8888 accept ok. On the client side: $ ./linger -h [ip] -p 8888 connecting .. closing ... closed. $ So on Digital UNIX, it blocks forever in the call to poll. That's strange. Let's see what happens on SunOS 5.6: ./poll -p 8888 accept ok. On the client side: ./linger -h [ip] -p 8888 connecting .. closing ... closed. $ So on SunOS 5.6, it also blocks forever. Do you know which Solaris was tested the last time it was found that Linux behaves the same way as Solaris? By the way, while we are on the subject of poll, what should one include to get POLLRDNORM and INFTIM? I found on OSF1 4.0 one can have both by including , on SunOS 5.6, one can get POLLRDNORM by including , for INFTIM, one must include as well. On Linux, one will not be able to get them by including both or . will not work either. by including , then one can get POLLRDNORM defined. But I could not find INFTIM anywhere on Linux. -- G. --AqsLC8rIMeq19msA Content-Type: text/x-csrc; charset=big5 Content-Disposition: attachment; filename="linger.c" #include #include #include #include #include #include /* getopt */ #define HOST "192.168.0.1" #define PORT (22) int main(argc, argv) int argc; char *argv[]; { int sockfd; int c; struct linger ling; struct sockaddr_in servaddr; char *host = HOST; unsigned short port = PORT; while ((c = getopt(argc, argv, "h:p:")) != -1) { switch (c) { case 'h': host = (char *)strdup(optarg); break; case 'p': port = atoi(optarg); break; default: break; } } sockfd = socket(AF_INET, SOCK_STREAM, 0); if (sockfd < 0) { printf("socket\n"); exit(1); } bzero(&servaddr, sizeof(struct sockaddr_in)); servaddr.sin_family = AF_INET; servaddr.sin_port = htons(port); servaddr.sin_addr.s_addr = inet_addr(host); ling.l_onoff = 1; ling.l_linger = 0; if (setsockopt(sockfd, SOL_SOCKET, SO_LINGER, &ling, sizeof(ling)) < 0) { printf("setsockopt\n"); exit(1); } printf("connecting ..\n"); if (connect(sockfd, (struct sockaddr *)&servaddr, sizeof(struct sockaddr_in)) < 0) { printf("connect: %s\n", strerror(errno)); exit(1); } printf("closing ...\n"); close(sockfd); /* send RST */ printf("closed.\n"); exit(0); } --AqsLC8rIMeq19msA Content-Type: text/x-csrc; charset=big5 Content-Disposition: attachment; filename="poll.c" #include #include #include #include #include #include #include /* strdup */ #include /* getopt */ #define PORT (22) #define BACKLOG 20 #ifndef INFTIM #define INFTIM (-1) #endif extern char *optarg; int main(argc, argv) int argc; char *argv[]; { struct pollfd nfd; struct sockaddr_in saddr, caddr; int fd; int c; int clientfd; size_t len; unsigned short port = PORT; bzero(&saddr, sizeof(struct sockaddr_in)); bzero(&caddr, sizeof(struct sockaddr_in)); while ((c = getopt(argc, argv, "h:p:")) != -1) { switch (c) { case 'p': port = atoi(optarg); break; default: break; } } saddr.sin_family = AF_INET; saddr.sin_port = htons(port); saddr.sin_addr.s_addr = INADDR_ANY; fd = socket(AF_INET, SOCK_STREAM, 0); if (fd < 0) { printf("socket\n"); exit(1); } if (bind(fd, (struct sockaddr *)&saddr, sizeof(struct sockaddr_in)) < 0) { printf("bind\n"); exit(1); } if (listen(fd, BACKLOG) < 0) { printf("listen\n"); exit(1); } len = sizeof(struct sockaddr_in); if ((clientfd = accept(fd, (struct sockaddr *)&caddr, &len)) < 0) { printf("accept: %s\n", strerror(errno)), exit(1); } printf("accept ok.\n"); /* reset */ nfd.fd = clientfd; nfd.events = 0; nfd.revents = 0; if (poll(&nfd, 1, INFTIM) < 0) printf("poll: %s\n", strerror(errno)), exit(1); if (nfd.revents & POLLERR) printf("poll: POLLERR set\n"); if (nfd.revents & POLLHUP) printf("poll: POLLHUP set\n"); if (nfd.revents & POLLNVAL) printf("poll: POLLNVAL set\n"); /* should not happen */ close(clientfd); printf("sever terminating\n"); exit(0); } --AqsLC8rIMeq19msA-- From owner-netdev@oss.sgi.com Thu Aug 15 22:53:29 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7G5rTRw026819 for ; Thu, 15 Aug 2002 22:53:29 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7G5rT1Z026818 for netdev-outgoing; Thu, 15 Aug 2002 22:53:29 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from smtp-send.myrealbox.com (smtp-send.myrealbox.com [192.108.102.143]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7G5rPRw026809 for ; Thu, 15 Aug 2002 22:53:25 -0700 Received: from bin_ren [192.18.43.11] by myrealbox.com with NetMail ModWeb Module; Thu, 15 Aug 2002 23:56:05 -0600 Subject: the logic behind rt_hash_code() Reply-To: bin_ren@myrealbox.com From: "Bin Ren" To: netdev@oss.sgi.com Date: Thu, 15 Aug 2002 23:56:05 -0600 X-Mailer: NetMail ModWeb Module X-Sender: bin_ren MIME-Version: 1.0 Message-ID: <1029477365.be039780bin_ren@myrealbox.com> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id g7G5rPRw026810 X-Spam-Status: No, hits=0.0 required=5.0 tests= version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hi, here I have a question: When we are trying to find out the IP routing info while sending a packet, that time, first we try to find if there's any match in the routing Hash Bucket by calling ip_route_output ( ) function. If not found, then we call ip_rout_output_slow ( ) function which looks for the FIB table. Now in the first case where we search in the rt_hash_buscket, we first find out a hash Number by calling the function rt_hash_code ( ). I don't understand the logic in finding out the hash number. They are doing so many exoring, shifting operations to find the hash Number. Does anyone know the logic of finding the Hash function? Can U describe me that??? Thanks in advance. Laudney Ren From owner-netdev@oss.sgi.com Fri Aug 16 07:20:19 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7GEKJRw023610 for ; Fri, 16 Aug 2002 07:20:19 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7GEKJsA023609 for netdev-outgoing; Fri, 16 Aug 2002 07:20:19 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from gw-nl3.philips.com (gw-nl3.philips.com [212.153.190.5]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7GEK4Rw023599 for ; Fri, 16 Aug 2002 07:20:04 -0700 Received: from smtpscan-nl2.philips.com (smtpscan-nl2.philips.com [130.139.36.22]) by gw-nl3.philips.com (Postfix) with ESMTP id 454AB3B1A7 for ; Fri, 16 Aug 2002 15:57:26 +0200 (MET DST) Received: from smtprelay-nl1.philips.com (localhost [127.0.0.1]) by smtpscan-nl2.philips.com (8.9.3/8.8.5-1.2.2m-19990317) with ESMTP id PAA09925 for ; Fri, 16 Aug 2002 15:57:25 +0200 (MET DST) Received: from prle1.natlab.research.philips.com (prle1.natlab.research.philips.com [130.145.137.162]) by smtprelay-nl1.philips.com (8.9.3/8.8.5-1.2.2m-19990317) with ESMTP id PAA21266 for ; Fri, 16 Aug 2002 15:57:24 +0200 (MET DST) Received: from PC5175 (PC5175.ddns.htc.nl.philips.com [130.145.174.158]) by prle1.natlab.research.philips.com (8.11.1/8.11.1) with SMTP id g7GDkf400242 for ; Fri, 16 Aug 2002 15:46:41 +0200 (METDST) Message-ID: <003d01c2452b$5a2cc930$9eae9182@ddns.htc.nl.philips.com> From: " F.H.G. Ogg" To: Subject: IPv6 traffic class header field Date: Fri, 16 Aug 2002 15:46:41 +0200 MIME-Version: 1.0 Content-Type: text/plain; charset="Windows-1252" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 5.50.4807.1700 X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4910.0300 X-Spam-Status: No, hits=0.0 required=5.0 tests= version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hi, is there anyone who can tell me how to set the traffic class in an IPv6 UDP socket? (SOCK_DGRAM) I can already set the flow label, but not the Traffic Class. I run Linux, kernel 2.4.18 (and above). hope you can help greetings _____________________________________________ Felix Ogg (TU/e graduate student) Tel: +31 40 27 45605 Department Software Architectures [WDC 2.010] Philips Research/Natuurkundig Laboratorium Prof. Holstlaan 4, 5656 AA EINDHOVEN From owner-netdev@oss.sgi.com Sun Aug 18 07:06:04 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7IE64Rw013743 for ; Sun, 18 Aug 2002 07:06:04 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7IE64AI013742 for netdev-outgoing; Sun, 18 Aug 2002 07:06:04 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from smp.paktronix.com ([65.103.169.59]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7IE5rRw013729 for ; Sun, 18 Aug 2002 07:05:54 -0700 Received: from netmonster.pakint.net (netmonster [192.168.3.13]) by smp.paktronix.com (8.9.3/8.9.3) with ESMTP id JAA10435; Sun, 18 Aug 2002 09:15:04 -0500 Date: Sun, 18 Aug 2002 09:21:12 -0500 (CDT) From: "Matthew G. Marsh" X-X-Sender: mgm@netmonster.pakint.net To: Kevin Dwyer cc: netdev@oss.sgi.com Subject: Re: Resend: SIOCGIFBRDADDR? In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Status: No, hits=-4.5 required=5.0 tests=IN_REP_TO,SUBJ_ENDS_IN_Q_MARK version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk On Thu, 15 Aug 2002, Kevin Dwyer wrote: > On Thu, 15 Aug 2002, kuznet@ms2.inr.ac.ru grunted something like: > > Thanks for the response... > > > > crazy. So my question: Does anyone know if there's an ip route command to > > > make the Bcast field show up as the correct value, > > > > Yes, that one which is described in manual (and equivalent to corresponding > > ifconfig) > > > > ifconfig eth0:tra-ta-ta 10.x.y.z etc. > > Perhaps I should clarify a bit further. > When we are starting heartbeat, the interface is setup thusly: > (I've simplified this to make it easier to read) > > ifconfig eth0 10.5.5.8 broadcast 10.5.5.255 netmask 255.255.255.0 > ip -f inet addr add 10.5.5.1 dev eth0 scope link > ip -f inet addr add 10.5.5.2 dev eth0 scope link Why do you use the 'scope link' statements? And as far as your original question try the following: ip -f inet addr add 10.5.5.8/24 dev eth0 brd + ip -f inet addr add 10.5.5.1 dev eth0 scope link brd 10.5.5.255 ip -f inet addr add 10.5.5.2 dev eth0 scope link brd 10.5.5.255 And then you will have the output of 'ip ad li dev eth0' as: 3: eth0: mtu 1500 qdisc pfifo_fast qlen 100 link/ether 00:c0:9f:0c:5f:db brd ff:ff:ff:ff:ff:ff inet 10.5.5.1/32 brd 10.5.5.255 scope link eth0 inet 10.5.5.2/32 brd 10.5.5.255 scope link eth0 inet 10.5.5.8/24 brd 10.5.5.255 scope global eth0 And also you will then have 'ifconfig eth0': eth0 Link encap:Ethernet HWaddr 00:C0:9F:0C:5F:DB inet addr:10.5.5.2 Bcast:10.5.5.255 Mask:255.255.255.255 UP BROADCAST MULTICAST MTU:1500 Metric:1 Work for you? [snip] > -[ kevin@pheared.net devel.pheared.net ]- > -[ Rather be forgotten, than remembered for giving in. ]- > -[ ZZ = g ^ (xb * xa) mod p g = h^{(p-1)/q} mod p ]- And if I am way off base here due to jumping in just ignore me. -------------------------------------------------- Matthew G. Marsh, President Paktronix Systems LLC 1506 North 59th Street Omaha NE 68104 Phone: (402) 932-7250 x101 Email: mgm@paktronix.com WWW: http://www.paktronix.com -------------------------------------------------- From owner-netdev@oss.sgi.com Sun Aug 18 11:12:40 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7IICeRw016824 for ; Sun, 18 Aug 2002 11:12:40 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7IICeQn016823 for netdev-outgoing; Sun, 18 Aug 2002 11:12:40 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from kaneda.pheared.net (root@kaneda.isrd.net [206.205.246.39]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7IICVRw016810 for ; Sun, 18 Aug 2002 11:12:32 -0700 Received: from kaneda.isrd.lan (kevin@localhost [127.0.0.1]) by kaneda.pheared.net (8.12.3/8.12.3) with ESMTP id g7IIFICK014285; Sun, 18 Aug 2002 14:15:18 -0400 Received: from localhost (kevin@localhost) by kaneda.isrd.lan (8.12.3/8.12.3/Submit) with ESMTP id g7IIFHEQ014281; Sun, 18 Aug 2002 14:15:18 -0400 X-Authentication-Warning: kaneda.isrd.lan: kevin owned process doing -bs Date: Sun, 18 Aug 2002 14:15:17 -0400 (EDT) From: Kevin Dwyer To: "Matthew G. Marsh" cc: netdev@oss.sgi.com Subject: Re: Resend: SIOCGIFBRDADDR? In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Status: No, hits=-4.5 required=5.0 tests=IN_REP_TO,SUBJ_ENDS_IN_Q_MARK version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk On Sun, 18 Aug 2002, Matthew G. Marsh grunted something like: > > ifconfig eth0 10.5.5.8 broadcast 10.5.5.255 netmask 255.255.255.0 > > ip -f inet addr add 10.5.5.1 dev eth0 scope link > > ip -f inet addr add 10.5.5.2 dev eth0 scope link > > Why do you use the 'scope link' statements? Why do I use the 'scope link' argument, or the statement as a whole? Those two statements give the same effect as aliases (more or less) without having to ifconfig eth0:X for each X. My understanding was the 'scope link' part was useful in keeping the effects of the command local to the device. However, I also did not generate these commands on my own: fwbuilder is doing that. We're trying to get a setup that can be easily managed by other people who may not have the same experience with command-line tools and such. > And as far as your original question try the following: > > ip -f inet addr add 10.5.5.8/24 dev eth0 brd + > ip -f inet addr add 10.5.5.1 dev eth0 scope link brd 10.5.5.255 > ip -f inet addr add 10.5.5.2 dev eth0 scope link brd 10.5.5.255 Aha! At the suggestion of the iproute2 documentation, and another person on the linux-ha list, I tried using 'brd +' on the aliased IPs, but apparently we were reading the docs wrong. Setting it explicitly works! > And also you will then have 'ifconfig eth0': > > eth0 Link encap:Ethernet HWaddr 00:C0:9F:0C:5F:DB > inet addr:10.5.5.2 Bcast:10.5.5.255 Mask:255.255.255.255 > UP BROADCAST MULTICAST MTU:1500 Metric:1 > > Work for you? Exactly what we need. Now the ioctl can pull the broadcast address. > And if I am way off base here due to jumping in just ignore me. Not at all, thanks for the suggestion. It seems now that I need to try to encourage the fwbuilder folks to specify the broadcast address on each alias. At the very least, I should be able to whip out a patch to do it, and hopefully they'll accept it. /* kevin@pheared.net http://devel.pheared.net/ */ /* Network Security Engineer http://pheared.net/~kevin */ /* Sabotage will set us free. Throw a rock in the machine. */ From owner-netdev@oss.sgi.com Sun Aug 18 11:21:02 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7IIL2Rw017008 for ; Sun, 18 Aug 2002 11:21:02 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7IIL2cA017007 for netdev-outgoing; Sun, 18 Aug 2002 11:21:02 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from mail.sirinet.net (mail.sirinet.net [198.203.196.92]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7IIKkRw016996 for ; Sun, 18 Aug 2002 11:20:46 -0700 Received: from [10.0.0.198] (ppp22.sirinet.net [207.3.80.22]) by mail.sirinet.net (8.12.4/8.12.2) with ESMTP id g7IINHho023465; Sun, 18 Aug 2002 13:23:28 -0500 Subject: UDP Wide Broadcast patch From: John Tobin To: netdev@oss.sgi.com Cc: ttimo@idsoftware.com Content-Type: multipart/mixed; boundary="=-ZOumsiTdBhg3WQaoNHwE" X-Mailer: Ximian Evolution 1.0.7 Date: 18 Aug 2002 13:22:41 -0500 Message-Id: <1029694978.330.6.camel@ogre> Mime-Version: 1.0 X-Spam-Status: No, hits=-5.0 required=5.0 tests=UNIFIED_PATCH version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk --=-ZOumsiTdBhg3WQaoNHwE Content-Type: text/plain Content-Transfer-Encoding: 7bit Hi, DocWilco and I came up with this patch for the linux kernel at Quakecon last night based off of his BSD patch that works around the programming oversight in Quake 3 that causes packets from the dedicated server to not broadcast when the +set net_ip option is used on a LAN, it isn't an issue with internet servers since the send heartbeats to the id master server. The behavior is off by default and can be turned on using a sysctl. John Tobin --=-ZOumsiTdBhg3WQaoNHwE Content-Disposition: attachment; filename=udp_wide_broadcast.patch Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; name=udp_wide_broadcast.patch; charset=ANSI_X3.4-1968 --- linux/include/linux/sysctl.h.orig 2002-08-17 19:52:27.000000000 -0500 +++ linux/include/linux/sysctl.h 2002-08-17 19:53:00.000000000 -0500 @@ -291,7 +291,8 @@ NET_IPV4_NONLOCAL_BIND=3D88, NET_IPV4_ICMP_RATELIMIT=3D89, NET_IPV4_ICMP_RATEMASK=3D90, - NET_TCP_TW_REUSE=3D91 + NET_TCP_TW_REUSE=3D91, + NET_UDP_WIDE_BROADCAST=3D92 }; =20 enum { --- linux-modified/net/ipv4/sysctl_net_ipv4.c.orig 2002-08-17 19:48:19.0000= 00000 -0500 +++ linux-modified/net/ipv4/sysctl_net_ipv4.c 2002-08-17 19:52:03.000000000= -0500 @@ -45,6 +45,9 @@ extern int inet_peer_gc_mintime; extern int inet_peer_gc_maxtime; =20 +/* From udp.c */ +extern int sysctl_udp_wide_broadcast; + #ifdef CONFIG_SYSCTL static int tcp_retr1_max =3D 255;=20 static int ip_local_port_range_min[] =3D { 1, 1 }; @@ -221,6 +224,8 @@ &sysctl_icmp_ratemask, sizeof(int), 0644, NULL, &proc_dointvec}, {NET_TCP_TW_REUSE, "tcp_tw_reuse", &sysctl_tcp_tw_reuse, sizeof(int), 0644, NULL, &proc_dointvec}, + {NET_UDP_WIDE_BROADCAST, "udp_wide_broadcast", + &sysctl_udp_wide_broadcast, sizeof(int), 0644, NULL, &proc_dointvec}, {0} }; =20 --- linux-modified/net/ipv4/udp.c.orig 2002-08-17 19:40:59.000000000 -0500 +++ linux-modified/net/ipv4/udp.c 2002-08-17 23:37:47.000000000 -0500 @@ -94,6 +94,8 @@ #include #include =20 +int sysctl_udp_wide_broadcast =3D 0; + /* * Snmp MIB for the UDP layer */ @@ -272,9 +274,10 @@ if ((s->num !=3D hnum) || (s->daddr && s->daddr!=3Drmt_addr) || (s->dport !=3D rmt_port && s->dport !=3D 0) || - (s->rcv_saddr && s->rcv_saddr !=3D loc_addr) || - (s->bound_dev_if && s->bound_dev_if !=3D dif)) + !(sysctl_udp_wide_broadcast || !(s->rcv_saddr && s->rcv_saddr !=3D = loc_addr)) || + (s->bound_dev_if && s->bound_dev_if !=3D dif)) { continue; + } break; } return s; --=-ZOumsiTdBhg3WQaoNHwE Content-Disposition: attachment; filename=readme.txt Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; name=readme.txt; charset=ANSI_X3.4-1968 UDP Wide Broadcast Patch for Kernel 2.4.19 Main purpose is to allow Quake 3 servers, and all games powered by the engine, to be visible in the server browser when the servers are being run = on aliased IP addresses and multiple NICs using the +set net_ip option. To apply the patch run "patch -p1 < udp_wide_broadcast.patch" from your source directory and recompile. Add "echo "1" > /proc/sys/net/ipv4/udp_wide_broadcast" to one of your start= up scripts, i.e./etc/rc.d/rc.local, and run thousands of servers from your computer. Patch by: Rogier Mulhujzen John Tobin A patch with the same functionality for FreeBSD is available from http://www.bsdchicks.com/patches --=-ZOumsiTdBhg3WQaoNHwE-- From owner-netdev@oss.sgi.com Sun Aug 18 17:01:14 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7J01ERw022271 for ; Sun, 18 Aug 2002 17:01:14 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7J01Ell022270 for netdev-outgoing; Sun, 18 Aug 2002 17:01:14 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from smp.paktronix.com ([65.103.169.59]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7J011Rw022261 for ; Sun, 18 Aug 2002 17:01:01 -0700 Received: from netmonster.pakint.net (netmonster [192.168.3.13]) by smp.paktronix.com (8.9.3/8.9.3) with ESMTP id TAA11380; Sun, 18 Aug 2002 19:10:19 -0500 Date: Sun, 18 Aug 2002 19:16:28 -0500 (CDT) From: "Matthew G. Marsh" X-X-Sender: mgm@netmonster.pakint.net To: Kevin Dwyer cc: netdev@oss.sgi.com Subject: Re: Resend: SIOCGIFBRDADDR? In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Status: No, hits=-4.5 required=5.0 tests=IN_REP_TO,SUBJ_ENDS_IN_Q_MARK version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk On Sun, 18 Aug 2002, Kevin Dwyer wrote: > On Sun, 18 Aug 2002, Matthew G. Marsh grunted something like: > > Why do you use the 'scope link' statements? > > Why do I use the 'scope link' argument, or the statement as a whole? > Those two statements give the same effect as aliases (more or less) > without having to ifconfig eth0:X for each X. My understanding was the > 'scope link' part was useful in keeping the effects of the command local > to the device. However, I also did not generate these commands on my own: > fwbuilder is doing that. We're trying to get a setup that can be easily > managed by other people who may not have the same experience with > command-line tools and such. Ah OK. I was wondering because if you leave off the 'scope link' then you get inheritance of the broadcast for secondary ip addrs. > > And as far as your original question try the following: > > > > ip -f inet addr add 10.5.5.8/24 dev eth0 brd + > > ip -f inet addr add 10.5.5.1 dev eth0 scope link brd 10.5.5.255 > > ip -f inet addr add 10.5.5.2 dev eth0 scope link brd 10.5.5.255 > > Aha! At the suggestion of the iproute2 documentation, and another person > on the linux-ha list, I tried using 'brd +' on the aliased IPs, but > apparently we were reading the docs wrong. Setting it explicitly works! Yep - that is the 'scope link' override that basically tells the system to ignore any inheritance and treat the address as a special local case. Note that the order in which the addresses are applied is critical. For example if you do: ip addr add 10.5.5.1/24 dev eth0 scope link brd + Then you get the appropriate broadcast address for the /24 netmask. BUT then you cannot add in a global address in the same network as in: ip addr add 10.5.5.8/24 dev eth0 scope global brd 10.5.5.255 RTNETLINK answers: Invalid argument You have to change the netmask at this point (think scope~=mask). Essentially you cannot have a primary and secondary address on a single device exist with different scope. Think of IPv6 scope and it becomes a little clearer where the addressing conflicts. Your best bet when in doubt is to always specify the actual broadcast address you require. The 'brd +' is merly a shortcut to specify the broadcast address associated with the specified CIDR mask. In fact you can have way interesting amounts of fun by specifying alternate broadcasts from CIDR masks as in: ip addr add 10.5.5.1/24 dev eth0 brd 10.5.255.255 Using this on a "Class B" network allow you to "be seen" by other devices but only speak to/from devices within your CIDR scope (hint: look at the output of 'ip ro li tab local') as well as really fsck routers and ARP tables for the network... ;-} [snip] > > And if I am way off base here due to jumping in just ignore me. > > Not at all, thanks for the suggestion. It seems now that I need to try to > encourage the fwbuilder folks to specify the broadcast address on each > alias. At the very least, I should be able to whip out a patch to do > it, and hopefully they'll accept it. > > /* kevin@pheared.net http://devel.pheared.net/ */ > /* Network Security Engineer http://pheared.net/~kevin */ > /* Sabotage will set us free. Throw a rock in the machine. */ -------------------------------------------------- Matthew G. Marsh, President Paktronix Systems LLC 1506 North 59th Street Omaha NE 68104 Phone: (402) 932-7250 x101 Email: mgm@paktronix.com WWW: http://www.paktronix.com -------------------------------------------------- From owner-netdev@oss.sgi.com Mon Aug 19 01:18:42 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7J8IfRw002607 for ; Mon, 19 Aug 2002 01:18:42 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7J8If9I002606 for netdev-outgoing; Mon, 19 Aug 2002 01:18:41 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from d12lmsgate-2.de.ibm.com (d12lmsgate-2.de.ibm.com [195.212.91.200]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7J8IGRw002595 for ; Mon, 19 Aug 2002 01:18:17 -0700 Received: from d12relay01.de.ibm.com (d12relay01.de.ibm.com [9.165.215.22]) by d12lmsgate-2.de.ibm.com (8.12.3/8.12.3) with ESMTP id g7J8KM5x028164; Mon, 19 Aug 2002 10:20:22 +0200 Received: from d12ml033.de.ibm.com (d12ml033_cs0 [9.165.223.11]) by d12relay01.de.ibm.com (8.12.3/NCO/VER6.3) with ESMTP id g7J8KLMD115208; Mon, 19 Aug 2002 10:20:21 +0200 Subject: [PATCH] new net device feature: shared-ipv6-cards To: Pekka Savola , netdev@oss.sgi.com Cc: "David S. Miller" , Alexey Kuznetsov , Andi Kleen X-Mailer: Lotus Notes Release 5.0.4a July 24, 2000 Message-ID: From: "Andreas Herrmann" Date: Mon, 19 Aug 2002 10:20:19 +0200 X-MIMETrack: Serialize by Router on D12ML033/12/M/IBM(Release 5.0.9a |January 7, 2002) at 19/08/2002 10:20:21 MIME-Version: 1.0 Content-type: text/plain; charset=us-ascii X-Spam-Status: No, hits=-5.0 required=5.0 tests=UNIFIED_PATCH version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk (Obviously, I've forgotten to address the networking kernel maintainers in my previous mail. So here we go again ...;-) Hi, I ask for integration of the attached patch into the stock kernel. The patch is against kernel version 2.4.19. This patch allows to replace the part 0xFFFE of an EUI-64 based interface identifier by another 16 bit value. The new net device feature is needed to avoid duplicate address conflicts on Linux for z/Series when shared OSA cards are used with IPv6. The following changes are performed: o Documentation/Configure.help net/Config.in Configure option and help text for the new feature. o arch/s390/defconfig arch/s390x/defconfig Set option on by default for Linux for z/Series. o include/linux/netdevice.h Changed struct net_device to store an id for the card user instance. Introduced macro for the new net device feature. o net/ipv6/addrconf.c Replace 0xFFFE by a card user instance id in function ipv6_generate_eui64(). o net/8021q/vlan.c Take new feature into account for VLAN devices in function register_802_1Q_vlan_device(). On Linux for zSeries OSA network cards can be shared among various Linuxes. The OSA card has only one MAC address. This leads to duplicate address conflicts in conjunction with IPv6 and a vanilla kernel if more than one Linux use the same card. But the device driver for the card can deliver a unique 16-bit identifier for each Linux sharing the same card. This identifier is placed instead of the part 0xFFFE in the interface identifier. The "u" bit of the interface identifier is not inverted when the new feature is used. Hence the resulting interface identifier has local scope according to RFC2373. Consequently this change of the autoconfiguration does not violate any RFCs. Regards, Andreas -- Linux for eServer Development Tel : +49-7031-16-4640 Notes mail : Andreas Herrmann/GERMANY/IBM@IBMDE email : aherrman@de.ibm.com diff -Naur linux-2.4.19.old/Documentation/Configure.help linux-2.4.19/Documentation/Configure.help --- linux-2.4.19.old/Documentation/Configure.help Mon Aug 5 10:54:29 2002 +++ linux-2.4.19/Documentation/Configure.help Mon Aug 5 10:56:50 2002 @@ -5404,6 +5404,16 @@ It is safe to say N here for now. +Prepare net_device struct for shared IPv6 cards +CONFIG_SHARED_IPV6_CARDS + This prepares the net_device structure to contain a card user instance + id. On some systems, e.g. IBM zSeries, networking cards can be shared. + In order to make IPv6 autoconfiguration useful, each user of the + networking card will get a different id which is used for unique + address generation (the id is used in the EUI-64 generation). + + Only say yes on IBM zSeries or S/390 systems. + # 2.5 tree only IPv6: routing messages via old netlink CONFIG_IPV6_NETLINK diff -Naur linux-2.4.19.old/arch/s390/defconfig linux-2.4.19/arch/s390/defconfig --- linux-2.4.19.old/arch/s390/defconfig Mon Aug 5 10:54:28 2002 +++ linux-2.4.19/arch/s390/defconfig Mon Aug 5 12:22:33 2002 @@ -146,6 +146,7 @@ # CONFIG_INET_ECN is not set # CONFIG_SYN_COOKIES is not set CONFIG_IPV6=m +CONFIG_SHARED_IPV6_CARDS=y # CONFIG_KHTTPD is not set # CONFIG_ATM is not set # CONFIG_VLAN_8021Q is not set diff -Naur linux-2.4.19.old/arch/s390x/defconfig linux-2.4.19/arch/s390x/defconfig --- linux-2.4.19.old/arch/s390x/defconfig Mon Aug 5 10:54:29 2002 +++ linux-2.4.19/arch/s390x/defconfig Mon Aug 5 12:23:29 2002 @@ -146,6 +146,7 @@ # CONFIG_INET_ECN is not set # CONFIG_SYN_COOKIES is not set CONFIG_IPV6=m +CONFIG_SHARED_IPV6_CARDS=y # CONFIG_KHTTPD is not set # CONFIG_ATM is not set # CONFIG_VLAN_8021Q is not set diff -Naur linux-2.4.19.old/include/linux/netdevice.h linux-2.4.19/include/linux/netdevice.h --- linux-2.4.19.old/include/linux/netdevice.h Mon Aug 5 10:54:19 2002 +++ linux-2.4.19/include/linux/netdevice.h Mon Aug 5 12:18:41 2002 @@ -362,6 +362,9 @@ #define NETIF_F_HW_VLAN_RX 256 /* Receive VLAN hw acceleration */ #define NETIF_F_HW_VLAN_FILTER 512 /* Receive filtering on VLAN */ #define NETIF_F_VLAN_CHALLENGED 1024 /* Device cannot handle VLAN packets */ +#define NETIF_F_SHARED_IPV6 2048 /* make IPv6 address autogeneration + * network card instance aware + */ /* Called after device is detached from network. */ void (*uninit)(struct net_device *dev); @@ -431,6 +434,9 @@ /* this will get initialized at each interface type init routine */ struct divert_blk *divert; #endif /* CONFIG_NET_DIVERT */ +#ifdef CONFIG_SHARED_IPV6_CARDS + unsigned short dev_id; +#endif /* CONFIG_SHARED_IPV6_CARDS */ }; diff -Naur linux-2.4.19.old/net/8021q/vlan.c linux-2.4.19/net/8021q/vlan.c --- linux-2.4.19.old/net/8021q/vlan.c Mon Aug 5 10:54:19 2002 +++ linux-2.4.19/net/8021q/vlan.c Mon Aug 5 10:56:50 2002 @@ -437,6 +437,10 @@ /* IFF_BROADCAST|IFF_MULTICAST; ??? */ new_dev->flags = real_dev->flags; new_dev->flags &= ~IFF_UP; +#ifdef CONFIG_SHARED_IPV6_CARDS + new_dev->features |= (real_dev->features & NETIF_F_SHARED_IPV6); + new_dev->dev_id = real_dev->dev_id; +#endif /* Make this thing known as a VLAN device */ new_dev->priv_flags |= IFF_802_1Q_VLAN; diff -Naur linux-2.4.19.old/net/Config.in linux-2.4.19/net/Config.in --- linux-2.4.19.old/net/Config.in Mon Aug 5 10:54:19 2002 +++ linux-2.4.19/net/Config.in Mon Aug 5 10:56:50 2002 @@ -25,6 +25,7 @@ if [ "$CONFIG_IPV6" != "n" ]; then source net/ipv6/Config.in fi + bool ' Prepare net_device struct for shared IPv6 cards' CONFIG_SHARED_IPV6_CARDS fi if [ "$CONFIG_EXPERIMENTAL" = "y" ]; then source net/khttpd/Config.in diff -Naur linux-2.4.19.old/net/ipv6/addrconf.c linux-2.4.19/net/ipv6/addrconf.c --- linux-2.4.19.old/net/ipv6/addrconf.c Mon Aug 5 10:54:19 2002 +++ linux-2.4.19/net/ipv6/addrconf.c Mon Aug 5 10:56:50 2002 @@ -690,9 +690,20 @@ return -1; memcpy(eui, dev->dev_addr, 3); memcpy(eui + 5, dev->dev_addr+3, 3); +#ifdef CONFIG_SHARED_IPV6_CARDS + if (dev->features&NETIF_F_SHARED_IPV6) { + eui[3] = (dev->dev_id>>8)&0xff; + eui[4] = dev->dev_id&0xff; + } else { + eui[3] = 0xFF; + eui[4] = 0xFE; + eui[0] ^= 2; + } +#else /* CONFIG_SHARED_IPV6_CARDS */ eui[3] = 0xFF; eui[4] = 0xFE; eui[0] ^= 2; +#endif /* CONFIG_SHARED_IPV6_CARDS */ return 0; } return -1; From owner-netdev@oss.sgi.com Mon Aug 19 01:23:19 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7J8NJRw002731 for ; Mon, 19 Aug 2002 01:23:19 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7J8NIZo002730 for netdev-outgoing; Mon, 19 Aug 2002 01:23:18 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7J8NFRw002721 for ; Mon, 19 Aug 2002 01:23:15 -0700 Received: from localhost (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with ESMTP id BAA09986; Mon, 19 Aug 2002 01:11:32 -0700 Date: Mon, 19 Aug 2002 01:11:31 -0700 (PDT) Message-Id: <20020819.011131.82964805.davem@redhat.com> To: AHERRMAN@de.ibm.com Cc: pekkas@netcore.fi, netdev@oss.sgi.com, kuznet@ms2.inr.ac.ru, ak@suse.de Subject: Re: [PATCH] new net device feature: shared-ipv6-cards From: "David S. Miller" In-Reply-To: References: X-Mailer: Mew version 2.1 on Emacs 21.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk From: "Andreas Herrmann" Date: Mon, 19 Aug 2002 10:20:19 +0200 The following changes are performed: o Documentation/Configure.help net/Config.in Configure option and help text for the new feature. No config option please. If device driver indicates it wants this behavior with a flag bit, no need to duplicate this logic. It is likely that it will not be just z-series machines that will find this useful. From owner-netdev@oss.sgi.com Mon Aug 19 02:55:18 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7J9tIRw003860 for ; Mon, 19 Aug 2002 02:55:18 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7J9tInA003859 for netdev-outgoing; Mon, 19 Aug 2002 02:55:18 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from d12lmsgate.de.ibm.com (d12lmsgate.de.ibm.com [195.212.91.199]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7J9sYS0003841 for ; Mon, 19 Aug 2002 02:55:12 -0700 Received: from d12relay02.de.ibm.com (d12relay02.de.ibm.com [9.165.215.23]) by d12lmsgate.de.ibm.com (8.12.3/8.12.3) with ESMTP id g7J9vEcU031908; Mon, 19 Aug 2002 11:57:14 +0200 Received: from d12ml033.de.ibm.com (d12ml033_cs0 [9.165.223.11]) by d12relay02.de.ibm.com (8.12.3/NCO/VER6.3) with ESMTP id g7J9vDFh115318; Mon, 19 Aug 2002 11:57:14 +0200 Subject: Re: [PATCH] new net device feature: shared-ipv6-cards To: "David S. Miller" Cc: ak@suse.de, kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com, owner-netdev@oss.sgi.com, pekkas@netcore.fi X-Mailer: Lotus Notes Release 5.0.4a July 24, 2000 Message-ID: From: "Andreas Herrmann" Date: Mon, 19 Aug 2002 11:57:11 +0200 X-MIMETrack: Serialize by Router on D12ML033/12/M/IBM(Release 5.0.9a |January 7, 2002) at 19/08/2002 11:57:13 MIME-Version: 1.0 Content-type: text/plain; charset=us-ascii X-Spam-Status: No, hits=0.0 required=5.0 tests= version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk "David S. Miller" 08/19/02 10:11 AM > From: "Andreas Herrmann" > Date: Mon, 19 Aug 2002 10:20:19 +0200 > The following changes are performed: > o Documentation/Configure.help > net/Config.in > Configure option and help text for the new feature. > No config option please. If device driver indicates it wants this > behavior with a flag bit, no need to duplicate this logic. > It is likely that it will not be just z-series machines that > will find this useful. Ok, that's quite sensible. Then I'd like to add some documentation either to Documentation/networking/netdevices.txt or a new file (say Documentation/networking/ipv6-addrconf.txt) to describe the new feature. Although netdevices.txt seems to be rather incomplete, I think the description of this netdevice feature belongs herein. Any other thoughts, comments regarding the patch? Andreas -- Linux for eServer Development Tel : +49-7031-16-4640 Notes mail : Andreas Herrmann/GERMANY/IBM@IBMDE email : aherrman@de.ibm.com From owner-netdev@oss.sgi.com Mon Aug 19 21:40:28 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7K4eSEC009218 for ; Mon, 19 Aug 2002 21:40:28 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7K4eStI009217 for netdev-outgoing; Mon, 19 Aug 2002 21:40:28 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from kegel.com (c-24-126-73-164.we.client2.attbi.com [24.126.73.164]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7K4dTEC009125 for ; Mon, 19 Aug 2002 21:39:29 -0700 Received: (from dank@localhost) by kegel.com (8.11.6/8.11.6) id g7K4kso02198; Mon, 19 Aug 2002 21:46:54 -0700 Date: Mon, 19 Aug 2002 21:46:54 -0700 From: dank@kegel.com Message-Id: <200208200446.g7K4kso02198@kegel.com> To: netdev@oss.sgi.com Subject: [PATCH] khttpd crash fix Cc: dank@alumni.caltech.edu X-Spam-Status: No, hits=-4.4 required=5.0 tests=NO_REAL_NAME,UNIFIED_PATCH version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk I've been using this patch for some time now; it fixes a nasty oops in khttpd on smp, and adds crucial details to the doc. Sure, khttpd is going away in 2.5, but 2.4 is stable, let's fix it there. I sent it multiple times to the author of khttpd and to the khttpd-users mailing list, and even to linux-kernel; I received good comments from khttpd-users but silence elsewhere. In desperation, I finally read the MAINTAINERS file, and learned that there is no official maintainer for khttpd, so here I am mailing it to the 'mail patches to' address for General Networking Stuff. Any chance this could be blessed for transmission to Marcello for 2.4.20? Thanks, Dan Kegel diff -x '*times*' -x '.*' -x '*.o' -Naur linux-2.4.17-orig/net/khttpd/README linux/net/khttpd/README --- linux-2.4.17-orig/net/khttpd/README Thu Nov 16 14:07:53 2000 +++ linux/net/khttpd/README Mon May 27 22:43:43 2002 @@ -14,14 +14,15 @@ other webservers in that it runs from within the Linux-kernel as a module (device-driver). - kHTTPd handles only static (file based) web-pages, and passes all requests - for non-static information to a regular userspace-webserver such as Apache or - Zeus. The userspace-daemon doesn't have to be altered in any way. + kHTTPd handles only static (file based) web-pages, and passes all requests + for non-static information to a regular userspace-webserver such as Apache + or Zeus. The userspace-daemon doesn't have to be altered in any way. Static web-pages are not a very complex thing to serve, but these are very important nevertheless, since virtually all images are static, and a large portion of the html-pages are static also. A "regular" webserver has little - added value for static pages, it is simply a "copy file to network"-operation. + added value for static pages, it is simply a "copy file to network" + operation. This can be done very efficiently from within the Linux-kernel, for example the nfs (network file system) daemon performs a similar task and also runs in the kernel. @@ -44,6 +45,7 @@ echo 1 > /proc/sys/net/khttpd/stop echo 1 > /proc/sys/net/khttpd/unload + sleep 2 rmmod khttpd @@ -71,7 +73,7 @@ Before you can start using kHTTPd, you have to configure it. This is done through the /proc filesystem, and can thus be done from inside - a script. Most parameters can only be set when kHTTPd is not active. + a script. Most parameters can only be set when kHTTPd is stopped. The following things need configuration: @@ -117,26 +119,31 @@ Port 8080 + Starting kHTTPd + =============== + Once you have set up the configuration, start kHTTPD by running + echo 1 > /proc/sys/net/khttpd/start + It may take a jiffie or two to start. - Stopping kHTTPd =============== - In order to change the configuration, you should stop kHTTPd by typing + To stop kHTTPd, do echo 1 > /proc/sys/net/khttpd/stop - on a command-prompt. + It should stop in a jiffy or two. - If you want to unload the module, you should type + Unloading kHTTPd + =============== + To unload the module, do + echo 1 > /proc/sys/net/khttpd/stop echo 1 > /proc/sys/net/khttpd/unload - after stopping kHTTPd first. + #killall -HUP khttpd + sleep 2 + rmmod khttpd - If this doesn't work fast enough for you (the commands above can wait for + If this doesn't work fast enough for you (unloading can wait for a remote connection to close down), you can send the daemons a "HUP" signal after you told them to stop. This will cause the daemon-threads to stop immediately. - - Note that the daemons will restart immediately if they are not told to - stop. - 4. Permissions @@ -212,7 +219,21 @@ maxconnect 1000 Maximum number of concurrent connections -6. More information +6. Known Issues + kHTTPd is *not* currently compatible with tmpfs. Trying to serve + files stored on a tmpfs partition is known to cause kernel oopses + as of 2.4.18. This is due to the same problem that prevents sendfile() + from being usable with tmpfs. A tmpfs patch is floating around that seems + to fix this, but has not been released as of 27 May 2002. + kHTTPD does work fine with ramfs, though. + + There is debate about whether to remove kHTTPd from the main + kernel sources. This will probably happen in the 2.5 kernel series, + after which khttpd will still be available as a patch. + + The kHTTPd source code could use a good spring cleaning. + +7. More information ------------------- More information about the architecture of kHTTPd, the mailinglist and configuration-examples can be found at the kHTTPd homepage: @@ -221,4 +242,6 @@ Bugreports, patches, etc can be send to the mailinglist (khttpd-users@zgp.org) or to khttpd@fenrus.demon.nl + Mailing list archives are at + http://lists.alt.org/mailman/listinfo/khttpd-users diff -x '*times*' -x '.*' -x '*.o' -Naur linux-2.4.17-orig/net/khttpd/main.c linux/net/khttpd/main.c --- linux-2.4.17-orig/net/khttpd/main.c Sun Mar 25 18:14:25 2001 +++ linux/net/khttpd/main.c Mon May 27 23:17:08 2002 @@ -95,11 +95,17 @@ { int CPUNR; sigset_t tmpsig; + int old_stop_count; DECLARE_WAITQUEUE(main_wait,current); MOD_INC_USE_COUNT; + /* Remember value of stop count. If it changes, user must have + * asked us to stop. Sensing this is much less racy than + * directly sensing sysctl_khttpd_stop. - dank + */ + old_stop_count = atomic_read(&khttpd_stopCount); CPUNR=0; if (cpu_pointer!=NULL) @@ -125,11 +131,9 @@ atomic_inc(&DaemonCount); atomic_set(&Running[CPUNR],1); - while (sysctl_khttpd_stop==0) + while (old_stop_count == atomic_read(&khttpd_stopCount)) { int changes = 0; - - changes +=AcceptConnections(CPUNR,MainSocket); if (ConnectionsPending(CPUNR)) @@ -194,11 +198,9 @@ DECLARE_WAIT_QUEUE_HEAD(WQ); - sprintf(current->comm,"khttpd manager"); daemonize(); - /* Block all signals except SIGKILL and SIGSTOP */ spin_lock_irq(¤t->sigmask_lock); tmpsig = current->blocked; @@ -206,42 +208,35 @@ recalc_sigpending(current); spin_unlock_irq(¤t->sigmask_lock); - /* main loop */ while (sysctl_khttpd_unload==0) { int I; - + int old_stop_count; /* First : wait for activation */ - - sysctl_khttpd_start = 0; - while ( (sysctl_khttpd_start==0) && (!signal_pending(current)) && (sysctl_khttpd_unload==0) ) { current->state = TASK_INTERRUPTIBLE; interruptible_sleep_on_timeout(&WQ,HZ); } - if ( (signal_pending(current)) || (sysctl_khttpd_unload!=0) ) break; + sysctl_khttpd_stop = 0; /* Then start listening and spawn the daemons */ - if (StartListening(sysctl_khttpd_serverport)==0) { + sysctl_khttpd_start = 0; continue; } - + ActualThreads = sysctl_khttpd_threads; if (ActualThreads<1) ActualThreads = 1; - if (ActualThreads>CONFIG_KHTTPD_NUMCPU) ActualThreads = CONFIG_KHTTPD_NUMCPU; - /* Write back the actual value */ - sysctl_khttpd_threads = ActualThreads; InitUserspace(ActualThreads); @@ -249,87 +244,63 @@ if (InitDataSending(ActualThreads)!=0) { StopListening(); + sysctl_khttpd_start = 0; continue; } if (InitWaitHeaders(ActualThreads)!=0) { - I=0; - while (I0) - interruptible_sleep_on_timeout(&WQ,HZ); - StopListening(); - } - - - + /* Wait for the daemons to stop, one second per iteration */ + while (atomic_read(&DaemonCount)>0) + interruptible_sleep_on_timeout(&WQ,HZ); + StopListening(); + sysctl_khttpd_start = 0; + /* reap the zombie-daemons */ + do + waitpid_result = waitpid(-1,NULL,__WCLONE|WNOHANG); + while (waitpid_result>0); } - + sysctl_khttpd_start = 0; sysctl_khttpd_stop = 1; + atomic_inc(&khttpd_stopCount); /* Wait for the daemons to stop, one second per iteration */ while (atomic_read(&DaemonCount)>0) interruptible_sleep_on_timeout(&WQ,HZ); - - - waitpid_result = 1; + StopListening(); /* reap the zombie-daemons */ - while (waitpid_result>0) + do waitpid_result = waitpid(-1,NULL,__WCLONE|WNOHANG); - - StopListening(); - + while (waitpid_result>0); (void)printk(KERN_NOTICE "kHTTPd: Management daemon stopped. \n You can unload the module now.\n"); @@ -344,16 +315,13 @@ MOD_INC_USE_COUNT; - I=0; - while (INext; } - LeaveFunction("WaitHeaders"); + LeaveFunction("WaitForHeaders"); return count; } @@ -178,6 +178,12 @@ EnterFunction("DecodeHeader"); + if (Buffer[CPUNR] == NULL) { + /* see comments in main.c regarding buffer managemnet - dank */ + printk(KERN_CRIT "khttpd: lost my buffer"); + BUG(); + } + /* First, read the data */ msg.msg_name = 0; --- linux-2.4.17-orig/Documentation/networking/khttpd.txt Wed Dec 31 16:00:00 1969 +++ linux/Documentation/networking/khttpd.txt Mon May 27 23:19:03 2002 @@ -0,0 +1 @@ +See net/khttpd/README for documentation on khttpd, the kernel http server. From owner-netdev@oss.sgi.com Tue Aug 20 05:51:34 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7KCpYEC026154 for ; Tue, 20 Aug 2002 05:51:34 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7KCpYNs026153 for netdev-outgoing; Tue, 20 Aug 2002 05:51:34 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from hydrogen.rooted.net (hydrogen.chelford.enformatica.com [193.133.49.25]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7KCpSEC026143 for ; Tue, 20 Aug 2002 05:51:29 -0700 Received: from [193.133.49.229] (helo=banana) by hydrogen.rooted.net with esmtp (Exim 3.14 #2) id 17h8XP-0007eq-00; Tue, 20 Aug 2002 13:55:11 +0100 Message-Id: <4.2.0.58.20020820135148.02f70ed8@193.133.49.25> X-Sender: tompop@193.133.49.25 X-Mailer: QUALCOMM Windows Eudora Pro Version 4.2.0.58 X-Priority: 2 (High) Date: Tue, 20 Aug 2002 13:54:39 +0100 To: netdev@oss.sgi.com From: Tom Parker Subject: igmp kernel issues. Cc: tom@rooted.net Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; format=flowed X-Spam-Status: No, hits=0.0 required=5.0 tests= version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Ive been given this address by Alan Cox in regard to some linux kernel networking issues.. Follows is the original question I put to Alan: --- Im currently rewriting large parts of mrouted for a custom application. One of the functions im implementing is a "promiscuous" mode, where by mrouted will service IGMP requests from any subnet, even if there is no explicit route for the subnet in the kernel routing tables. Currently the kernel dosnt seem to pass IGMP reports to user space unless there is a route for the subnet of the host which the report came from. Ive searched through ipmr.c and igmp.c to find where its dropping the IGMP report but havnt found anything obvious.. Do you know where it does this? --- Any additional information on this is greatly appreciated. Thanks in advance Tom Parker From owner-netdev@oss.sgi.com Wed Aug 21 08:26:54 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7LFQsEC000601 for ; Wed, 21 Aug 2002 08:26:54 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7LFQs96000600 for netdev-outgoing; Wed, 21 Aug 2002 08:26:54 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from d12lmsgate-3.de.ibm.com (d12lmsgate-3.de.ibm.com [195.212.91.201]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7LFQ5EE000586 for ; Wed, 21 Aug 2002 08:26:39 -0700 Received: from d12relay02.de.ibm.com (d12relay02.de.ibm.com [9.165.215.23]) by d12lmsgate-3.de.ibm.com (8.12.3/8.12.3) with ESMTP id g7LFSsA9021548; Wed, 21 Aug 2002 17:28:54 +0200 Received: from d12ml033.de.ibm.com (d12ml033_cs0 [9.165.223.11]) by d12relay02.de.ibm.com (8.12.3/NCO/VER6.3) with ESMTP id g7LFSrJB085938; Wed, 21 Aug 2002 17:28:53 +0200 Subject: Re: [PATCH] new net device feature: shared-ipv6-cards To: "David S. Miller" Cc: ak@suse.de, kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com, owner-netdev@oss.sgi.com, pekkas@netcore.fi X-Mailer: Lotus Notes Release 5.0.4a July 24, 2000 Message-ID: From: "Andreas Herrmann" Date: Wed, 21 Aug 2002 17:28:52 +0200 X-MIMETrack: Serialize by Router on D12ML033/12/M/IBM(Release 5.0.9a |January 7, 2002) at 21/08/2002 17:28:53 MIME-Version: 1.0 Content-type: text/plain; charset=us-ascii X-Spam-Status: No, hits=-5.0 required=5.0 tests=UNIFIED_PATCH version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hi, I removed the config option, added documentation to Documentation/networking/netdevices.txt and created a new patch for the new net device feature. Regards, Andreas -- diff -Naur linux-2.4.19.old/Documentation/networking/netdevices.txt linux-2.4.19/Documentation/networking/netdevices.txt --- linux-2.4.19.old/Documentation/networking/netdevices.txt Thu Nov 9 03:09:49 2000 +++ linux-2.4.19/Documentation/networking/netdevices.txt Wed Aug 21 17:11:28 2002 @@ -40,3 +40,18 @@ Sleeping: NO +struct net_device feature NETIF_F_SHARED_IPV6 +============================================= + +On some systems, e.g. IBM zSeries, networking cards can be shared. +In order to make IPv6 autoconfiguration useful, each user of the +networking card will get a different id ("card user instance id") +which is used for unique address generation in ipv6_generate_eui64(). + +Using this feature, the part 0xFFFE of an EUI-64 based interface +identifier is replaced by the id, and the "u" bit is not inverted. +So the "u" bit will indicate local scope. + +The id is stored in: + + dev->dev_id diff -Naur linux-2.4.19.old/include/linux/netdevice.h linux-2.4.19/include/linux/netdevice.h --- linux-2.4.19.old/include/linux/netdevice.h Sat Aug 3 00:39:45 2002 +++ linux-2.4.19/include/linux/netdevice.h Wed Aug 21 16:01:54 2002 @@ -362,6 +362,9 @@ #define NETIF_F_HW_VLAN_RX 256 /* Receive VLAN hw acceleration */ #define NETIF_F_HW_VLAN_FILTER 512 /* Receive filtering on VLAN */ #define NETIF_F_VLAN_CHALLENGED 1024 /* Device cannot handle VLAN packets */ +#define NETIF_F_SHARED_IPV6 2048 /* make IPv6 address autogeneration + * network card instance aware + */ /* Called after device is detached from network. */ void (*uninit)(struct net_device *dev); @@ -431,6 +434,9 @@ /* this will get initialized at each interface type init routine */ struct divert_blk *divert; #endif /* CONFIG_NET_DIVERT */ + + /* for feature NETIF_F_SHARED_IPV6 to store id of shared card */ + unsigned short dev_id; }; diff -Naur linux-2.4.19.old/net/8021q/vlan.c linux-2.4.19/net/8021q/vlan.c --- linux-2.4.19.old/net/8021q/vlan.c Sat Aug 3 00:39:46 2002 +++ linux-2.4.19/net/8021q/vlan.c Wed Aug 21 15:45:31 2002 @@ -437,6 +437,8 @@ /* IFF_BROADCAST|IFF_MULTICAST; ??? */ new_dev->flags = real_dev->flags; new_dev->flags &= ~IFF_UP; + new_dev->features |= (real_dev->features & NETIF_F_SHARED_IPV6); + new_dev->dev_id = real_dev->dev_id; /* Make this thing known as a VLAN device */ new_dev->priv_flags |= IFF_802_1Q_VLAN; diff -Naur linux-2.4.19.old/net/ipv6/addrconf.c linux-2.4.19/net/ipv6/addrconf.c --- linux-2.4.19.old/net/ipv6/addrconf.c Sat Aug 3 00:39:46 2002 +++ linux-2.4.19/net/ipv6/addrconf.c Wed Aug 21 15:37:03 2002 @@ -690,9 +690,14 @@ return -1; memcpy(eui, dev->dev_addr, 3); memcpy(eui + 5, dev->dev_addr+3, 3); - eui[3] = 0xFF; - eui[4] = 0xFE; - eui[0] ^= 2; + if (dev->features&NETIF_F_SHARED_IPV6) { + eui[3] = (dev->dev_id>>8)&0xff; + eui[4] = dev->dev_id&0xff; + } else { + eui[3] = 0xFF; + eui[4] = 0xFE; + eui[0] ^= 2; + } return 0; } return -1; From owner-netdev@oss.sgi.com Wed Aug 21 08:53:37 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7LFrbEC001661 for ; Wed, 21 Aug 2002 08:53:37 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7LFrb7N001660 for netdev-outgoing; Wed, 21 Aug 2002 08:53:37 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from sex.inr.ac.ru (sex.inr.ac.ru [193.233.7.165]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7LFrWEC001651 for ; Wed, 21 Aug 2002 08:53:33 -0700 Received: (from kuznet@localhost) by sex.inr.ac.ru (8.6.13/ANK) id TAA15473; Wed, 21 Aug 2002 19:55:32 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200208211555.TAA15473@sex.inr.ac.ru> Subject: Re: [PATCH] new net device feature: shared-ipv6-cards To: AHERRMAN@de.ibm.com (Andreas Herrmann) Date: Wed, 21 Aug 2002 19:55:32 +0400 (MSD) Cc: davem@redhat.com, netdev@oss.sgi.com, pekkas@netcore.fi In-Reply-To: from "Andreas Herrmann" at Aug 21, 2 05:28:52 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 X-Spam-Status: No, hits=-3.8 required=5.0 tests=IN_REP_TO,NO_REAL_NAME version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hello! > I removed the config option, added documentation to > Documentation/networking/netdevices.txt > and created a new patch for the new net device feature. One thought: if you have to add a new field to struct netdevice, it makes lots of sense to move further and to add special method dev->generate_eui64() instead, which will be initialized to standard eth_generate_eui64() for normal ethernets and to something private for your device. In this way we are guaranteed not to have problems of this kind in future. Alexey From owner-netdev@oss.sgi.com Wed Aug 21 09:58:37 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7LGwbEC007630 for ; Wed, 21 Aug 2002 09:58:37 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7LGwbK4007629 for netdev-outgoing; Wed, 21 Aug 2002 09:58:37 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from d12lmsgate-2.de.ibm.com (d12lmsgate-2.de.ibm.com [195.212.91.200]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7LGwNEE007617 for ; Wed, 21 Aug 2002 09:58:31 -0700 Received: from d12relay02.de.ibm.com (d12relay02.de.ibm.com [9.165.215.23]) by d12lmsgate-2.de.ibm.com (8.12.3/8.12.3) with ESMTP id g7LH0g5x031038; Wed, 21 Aug 2002 19:00:42 +0200 Received: from d12ml033.de.ibm.com (d12ml033_cs0 [9.165.223.11]) by d12relay02.de.ibm.com (8.12.3/NCO/VER6.3) with ESMTP id g7LH0fJB095968; Wed, 21 Aug 2002 19:00:41 +0200 Subject: Re: [PATCH] new net device feature: shared-ipv6-cards To: kuznet@ms2.inr.ac.ru Cc: davem@redhat.com, netdev@oss.sgi.com, owner-netdev@oss.sgi.com, pekkas@netcore.fi X-Mailer: Lotus Notes Release 5.0.4a July 24, 2000 Message-ID: From: "Andreas Herrmann" Date: Wed, 21 Aug 2002 19:00:39 +0200 X-MIMETrack: Serialize by Router on D12ML033/12/M/IBM(Release 5.0.9a |January 7, 2002) at 21/08/2002 19:00:41 MIME-Version: 1.0 Content-type: text/plain; charset=us-ascii X-Spam-Status: No, hits=0.0 required=5.0 tests= version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk kuznet@ms2.inr.ac.ru Sent by: owner-netdev@oss.sgi.com 08/21/02 05:55 PM > One thought: if you have to add a new field to struct netdevice, > it makes lots of sense to move further and to add special method > dev->generate_eui64() instead, which will be initialized > to standard eth_generate_eui64() for normal ethernets > and to something private for your device. I agree with this. Of course this implies changes in the device driver itself. (Currently our device driver for the shared cards relies on the changes described in the patch.) But its the way it should be done. > In this way we are guaranteed not to have problems of this kind > in future. What about implementing your alternative proposal in 2.5 and to use the current approach in a next 2.4.x kernel? Regards, Andreas -- Linux for eServer Development Tel : +49-7031-16-4640 Notes mail : Andreas Herrmann/GERMANY/IBM@IBMDE email : aherrman@de.ibm.com From owner-netdev@oss.sgi.com Wed Aug 21 10:11:50 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7LHBnRA008638 for ; Wed, 21 Aug 2002 10:11:49 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7LHBnnf008637 for netdev-outgoing; Wed, 21 Aug 2002 10:11:49 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from sex.inr.ac.ru (sex.inr.ac.ru [193.233.7.165]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7LHBgRA008395; Wed, 21 Aug 2002 10:11:43 -0700 Received: (from kuznet@localhost) by sex.inr.ac.ru (8.6.13/ANK) id VAA18485; Wed, 21 Aug 2002 21:13:33 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200208211713.VAA18485@sex.inr.ac.ru> Subject: Re: [PATCH] new net device feature: shared-ipv6-cards To: AHERRMAN@de.ibm.com (Andreas Herrmann) Date: Wed, 21 Aug 2002 21:13:33 +0400 (MSD) Cc: davem@redhat.com, netdev@oss.sgi.com, owner-netdev@oss.sgi.com, pekkas@netcore.fi In-Reply-To: from "Andreas Herrmann" at Aug 21, 2 07:00:39 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 X-Spam-Status: No, hits=-3.8 required=5.0 tests=IN_REP_TO,NO_REAL_NAME version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hello! > What about implementing your alternative proposal in 2.5 and to use > the current approach in a next 2.4.x kernel? I do not see any good reasons to do this differently. Alexey From owner-netdev@oss.sgi.com Wed Aug 21 10:22:16 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7LHMFRA009335 for ; Wed, 21 Aug 2002 10:22:15 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7LHMFmJ009334 for netdev-outgoing; Wed, 21 Aug 2002 10:22:15 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from sj-msg-core-1.cisco.com (sj-msg-core-1.cisco.com [171.71.163.11]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7LHM8RA009324 for ; Wed, 21 Aug 2002 10:22:08 -0700 Received: from mira-sjcd-1.cisco.com (IDENT:mirapoint@mira-sjcd-1.cisco.com [171.69.43.44]) by sj-msg-core-1.cisco.com (8.12.2/8.12.2) with ESMTP id g7LHP1AK028695; Wed, 21 Aug 2002 10:25:01 -0700 (PDT) Received: from dhcp-128-107-163-94.cisco.com (dhcp-128-107-163-94.cisco.com [128.107.163.94]) by mira-sjcd-1.cisco.com (Mirapoint) with ESMTP id ADH91719; Wed, 21 Aug 2002 10:25:28 -0700 (PDT) Date: Wed, 21 Aug 2002 10:23:37 -0700 (PDT) From: Vince Laviano To: Tom Parker cc: Subject: Re: igmp kernel issues. In-Reply-To: <4.2.0.58.20020820135148.02f70ed8@193.133.49.25> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hi Tom, One possible culprit is the following code from the beginning of ip_rcv_finish(), called for incoming packets after they go through netfilter, in ip_input.c: /* * Initialise the virtual path cache for the packet. It describes * how the packet travels inside Linux networking. */ if (skb->dst == NULL) { if (ip_route_input(skb, iph->daddr, iph->saddr, iph->tos, dev)) goto drop; } } Vince On Tue, 20 Aug 2002, Tom Parker wrote: > > Ive been given this address by Alan Cox in regard to some linux kernel > networking > issues.. Follows is the original question I put to Alan: > > --- > Im currently rewriting large parts of mrouted for a custom application. > One of the functions im implementing is a "promiscuous" mode, > where by mrouted will service IGMP requests from any subnet, even > if there is no explicit route for the subnet in the kernel routing tables. > > Currently the kernel dosnt seem to pass IGMP reports to user space > unless there is a route for the subnet of the host which the report came > from. > > Ive searched through ipmr.c and igmp.c to find where its dropping the > IGMP report but havnt found anything obvious.. Do you know where > it does this? > --- > > Any additional information on this is greatly appreciated. > > Thanks in advance > Tom Parker > > From owner-netdev@oss.sgi.com Wed Aug 21 15:07:44 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7LM7hBM019247 for ; Wed, 21 Aug 2002 15:07:43 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7LM7hGA019246 for netdev-outgoing; Wed, 21 Aug 2002 15:07:43 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7LM7aBM019236; Wed, 21 Aug 2002 15:07:37 -0700 Received: from localhost (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with ESMTP id OAA20246; Wed, 21 Aug 2002 14:55:10 -0700 Date: Wed, 21 Aug 2002 14:55:09 -0700 (PDT) Message-Id: <20020821.145509.22238146.davem@redhat.com> To: AHERRMAN@de.ibm.com Cc: kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com, owner-netdev@oss.sgi.com, pekkas@netcore.fi Subject: Re: [PATCH] new net device feature: shared-ipv6-cards From: "David S. Miller" In-Reply-To: References: X-Mailer: Mew version 2.1 on Emacs 21.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk From: "Andreas Herrmann" Date: Wed, 21 Aug 2002 19:00:39 +0200 > In this way we are guaranteed not to have problems of this kind > in future. What about implementing your alternative proposal in 2.5 and to use the current approach in a next 2.4.x kernel? Just because you have some code depending upon this doesn't mean that is the way we should implement it in 2.4.x We should do it right from the start. Because doing it properly hurts no other driver except for yours. :-) From owner-netdev@oss.sgi.com Thu Aug 22 06:33:01 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7MDWuBM006914 for ; Thu, 22 Aug 2002 06:32:56 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7MDWuAu006913 for netdev-outgoing; Thu, 22 Aug 2002 06:32:56 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from d12lmsgate-3.de.ibm.com (d12lmsgate-3.de.ibm.com [195.212.91.201]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7MDWcBO006903 for ; Thu, 22 Aug 2002 06:32:47 -0700 Received: from d12relay02.de.ibm.com (d12relay02.de.ibm.com [9.165.215.23]) by d12lmsgate-3.de.ibm.com (8.12.3/8.12.3) with ESMTP id g7MDYwA9040834; Thu, 22 Aug 2002 15:34:58 +0200 Received: from d12ml033.de.ibm.com (d12ml033_cs0 [9.165.223.11]) by d12relay02.de.ibm.com (8.12.3/NCO/VER6.3) with ESMTP id g7MDYvqL104656; Thu, 22 Aug 2002 15:34:57 +0200 Subject: Re: [PATCH] new net device feature: shared-ipv6-cards To: "David S. Miller" , kuznet@ms2.inr.ac.ru Cc: netdev@oss.sgi.com, owner-netdev@oss.sgi.com, pekkas@netcore.fi X-Mailer: Lotus Notes Release 5.0.4a July 24, 2000 Message-ID: From: "Andreas Herrmann" Date: Thu, 22 Aug 2002 15:34:55 +0200 X-MIMETrack: Serialize by Router on D12ML033/12/M/IBM(Release 5.0.9a |January 7, 2002) at 22/08/2002 15:34:58 MIME-Version: 1.0 Content-type: text/plain; charset=us-ascii X-Spam-Status: No, hits=0.0 required=5.0 tests= version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk kuznet@ms2.inr.ac.ru 08/21/02 07:13 PM > Hello! > > What about implementing your alternative proposal in 2.5 and to use > > the current approach in a next 2.4.x kernel? > I do not see any good reasons to do this differently. > Alexey "David S. Miller" 08/21/02 11:55 PM > From: "Andreas Herrmann" > Date: Wed, 21 Aug 2002 19:00:39 +0200 > > In this way we are guaranteed not to have problems of this kind > > in future. > What about implementing your alternative proposal in 2.5 and to use > the current approach in a next 2.4.x kernel? > Just because you have some code depending upon this doesn't > mean that is the way we should implement it in 2.4.x > We should do it right from the start. Because doing it properly > hurts no other driver except for yours. :-) Well, I tried it;-) Sometimes programmers are as lazy as something. But in fact, I'm on your side. So it's time for another patch. However, I cannot foresee when I finish this work. Regards, Andreas -- Linux for eServer Development Tel : +49-7031-16-4640 Notes mail : Andreas Herrmann/GERMANY/IBM@IBMDE email : aherrman@de.ibm.com From owner-netdev@oss.sgi.com Thu Aug 22 23:49:36 2002 Received: from oss.sgi.com (localhost [127.0.0.1]) by oss.sgi.com (8.12.5/8.12.5) with ESMTP id g7N6naBM014214 for ; Thu, 22 Aug 2002 23:49:36 -0700 Received: (from majordomo@localhost) by oss.sgi.com (8.12.5/8.12.3/Submit) id g7N6najT014213 for netdev-outgoing; Thu, 22 Aug 2002 23:49:36 -0700 X-Authentication-Warning: oss.sgi.com: majordomo set sender to owner-netdev@oss.sgi.com using -f Received: from u.domain.uli (ja.mac.ssi.bg [212.95.166.194]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7N6nRBM014198 for ; Thu, 22 Aug 2002 23:49:29 -0700 Received: from localhost (IDENT:ja@localhost [127.0.0.1]) by u.domain.uli (8.11.6/8.11.6) with ESMTP id g7N9rrx01171; Fri, 23 Aug 2002 09:53:55 GMT Date: Fri, 23 Aug 2002 09:53:53 +0000 (GMT) From: Julian Anastasov X-X-Sender: ja@u.domain.uli To: Alexey Kuznetsov cc: netdev@oss.sgi.com, Alan Cox Subject: TCP fixes for 2.2: forget sendmsg OOM, tcp_reset error_report Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Spam-Status: No, hits=-5.0 required=5.0 tests=UNIFIED_PATCH version=2.20 X-Spam-Level: Sender: owner-netdev@oss.sgi.com Precedence: bulk Hello, Below are two fixes useful for 2.2.22. The first fixes SIGIO issue with tcp_reset, stolen from 2.4 - we should send event on error. The second one fixes TCP sendmsg problem on OOM (broken terminals, etc) - we should forget the old error (ENOMEM) and not to fail with EFAULT. --- v2.2.21/linux/net/ipv4/tcp_input.c Sun Nov 4 10:16:16 2001 +++ linux/net/ipv4/tcp_input.c Thu Aug 22 22:14:23 2002 @@ -296,7 +296,7 @@ } sk->shutdown = SHUTDOWN_MASK; if (!sk->dead) - sk->state_change(sk); + sk->error_report(sk); } /* This tags the retransmission queue when SACKs arrive. */ --- v2.2.21/linux/net/ipv4/tcp.c Sun Nov 4 10:16:16 2001 +++ linux/net/ipv4/tcp.c Thu Aug 22 22:16:02 2002 @@ -940,6 +940,7 @@ if (!err) tcp_push_pending_frames(sk, tp); wait_for_tcp_memory(sk, err); + err = 0; /* If SACK's were formed or PMTU events happened, * we must find out about it. Regards -- Julian Anastasov From ralf@oss.sgi.com Fri Aug 23 16:41:22 2002 Received: with ECARTIS (v1.0.0; list netdev); Fri, 23 Aug 2002 16:51:43 -0700 (PDT) Received: from shaft19-f87.dialo.tiscali.de (shaft19-f87.dialo.tiscali.de [62.246.19.87]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7NNf19D002570; Fri, 23 Aug 2002 16:41:02 -0700 Received: (ralf@lappi.linux-mips.net) by ralf.linux-mips.org id ; Sat, 24 Aug 2002 01:44:11 +0200 Date: Sat, 24 Aug 2002 01:44:11 +0200 From: Ralf Baechle To: undisclosed-recipients:; Subject: ADMIN: New mailing list software on oss Message-ID: <20020824014411.A10670@bacchus.dhis.org> Mime-Version: 1.0 Content-type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5.1i X-Accept-Language: de,en,fr Content-Transfer-Encoding: 8bit X-archive-position: 2 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ralf@oss.sgi.com Precedence: bulk X-list: netdev Content-Length: 498 Lines: 10 After a long time we're finally retiring Majordomo as mailing list manager for oss.sgi.com and have replaced it with Ecartis. From a user perspective Ecartis is similar to Majordomo, that is subscribing and unsubscribing do work the same way. In addition we now offer new features such as a digest mode. For more information send an email containing the word "help" to Ecartis@oss.sgi.com. For a while we'll still have it available under the old majordomo@oss.sgi.com address as well. Ralf From hadi@cyberus.ca Sun Aug 25 08:59:37 2002 Received: with ECARTIS (v1.0.0; list netdev); Sun, 25 Aug 2002 08:59:39 -0700 (PDT) Received: from cyberus.ca (mail.cyberus.ca [216.191.240.111]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7PFxb9D003455 for ; Sun, 25 Aug 2002 08:59:37 -0700 Received: from shell.cyberus.ca (shell [216.191.240.114]) by cyberus.ca (8.9.3/8.9.3/Cyberus Online Inc.) with ESMTP id MAA28239; Sun, 25 Aug 2002 12:02:56 -0400 (EDT) Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.11.6+Sun/8.11.6) with ESMTP id g7PFuDb29969; Sun, 25 Aug 2002 11:56:13 -0400 (EDT) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Sun, 25 Aug 2002 11:56:12 -0400 (EDT) From: jamal To: cc: Subject: Re: packet re-ordering on SMP machines. Message-ID: MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit X-archive-position: 4 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev Content-Length: 251 Lines: 12 NAPI fixes packet reordering problems. Could people please post network related questions to netdev please? I think it even says so in the FAQ Richard Gooch maintains. Believe it or not, quiet a few people are not subscribed to lk cheers, jamal From hadi@cyberus.ca Sun Aug 25 09:21:17 2002 Received: with ECARTIS (v1.0.0; list netdev); Sun, 25 Aug 2002 09:21:18 -0700 (PDT) Received: from cyberus.ca (mail.cyberus.ca [216.191.240.111]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7PGLH9D003606 for ; Sun, 25 Aug 2002 09:21:17 -0700 Received: from shell.cyberus.ca (shell [216.191.240.114]) by cyberus.ca (8.9.3/8.9.3/Cyberus Online Inc.) with ESMTP id MAA04044; Sun, 25 Aug 2002 12:24:25 -0400 (EDT) Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.11.6+Sun/8.11.6) with ESMTP id g7PGHgb29999; Sun, 25 Aug 2002 12:17:42 -0400 (EDT) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Sun, 25 Aug 2002 12:17:42 -0400 (EDT) From: jamal To: cc: Mala Anand , , Robert Olsson Subject: Re: [Lse-tech] Re: (RFC): SKB Initialization Message-ID: MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit X-archive-position: 5 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev Content-Length: 1297 Lines: 33 Mala, Could you please at least cc netdev on networking related issues? It says so in the kernel FAQ. I swore back around 95 to join lk only when Linux gets a IDE maintainer who is not insane. Hasnt happened yet. Can you repeat your tests with the hotlist turned off (i.e set to 0)? Also if you would be doing tests on NAPI please either copy us or netdev; it is not nice to read weeks after you post. Also Robert and i did a few tests and we did find skb recycling (based on a patch from Robert a few years back) was infact giving perfomance improvements of upto 15% over regular slab. Did you test with that patch for the e1000 he pointed you at? I repeated the tests (around June/July) with the tulip with input rates of a few 100K packets/sec and noticed a improvement over regular NAPI by about 10%. Theres one bug on the tulip which we are chasing that might be related to tulips alignment requirements; The idea of only freeing on the same CPU a skb allocated is free with the e1000 NAPI driver style but not in the tulip NAPI where a txmit interupt might happen on a different CPU. The skb recycler patch only recylces if allocation and freeing are happening on the same CPU; otherwise we let the slab take the hit. On the tulip this happens about 50% of the time. cheers, jamal From greearb@candelatech.com Sun Aug 25 11:29:02 2002 Received: with ECARTIS (v1.0.0; list netdev); Sun, 25 Aug 2002 11:29:04 -0700 (PDT) Received: from grok.yi.org (IDENT:BqPwX5WRhqGNJGLLuqMZ3EgR8a8LiMnt@dhcp101-dsl-usw4.w-link.net [208.161.125.101]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7PIT19D004308 for ; Sun, 25 Aug 2002 11:29:01 -0700 Received: from candelatech.com (IDENT:UM8o5j7CEVluOX3bmA1WsbteiUxtxZp5@localhost.localdomain [127.0.0.1]) by grok.yi.org (8.11.6/8.11.2) with ESMTP id g7PIWI726378; Sun, 25 Aug 2002 11:32:18 -0700 Message-ID: <3D6922B2.1020400@candelatech.com> Date: Sun, 25 Aug 2002 11:32:18 -0700 From: Ben Greear Organization: Candela Technologies User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1b) Gecko/20020722 X-Accept-Language: en-us, en MIME-Version: 1.0 To: jamal CC: netdev@oss.sgi.com Subject: Re: packet re-ordering on SMP machines. References: Content-type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit X-archive-position: 6 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: greearb@candelatech.com Precedence: bulk X-list: netdev Content-Length: 2313 Lines: 57 jamal wrote: > > > > NAPI fixes packet reordering problems. It does indeed. I just patched the e1000 with the latest NAPI patch I could find (from Aug 15 or so), and the re-ordering problems went away. The amount of packets dropped decreased too, but I still see about 1 out of 1000 packets dropped due to rx-FIFO or rx-dropped. This is when trying to run 60,000 pps of 1514 byte packets from one port to the other on the same dual-port e1000 NIC (copper). It will generate up to about 72,000 pps without dropping too many more... I will do some more tests on two single-port NICs soon to see if that performs better. Also, I see the hard_start_xmit call failing 5876 times out of 2719493 calls (for example). The code that calls the method looks like this: spin_lock_bh(&odev->xmit_lock); if (!netif_queue_stopped(odev)) { if (odev->hard_start_xmit(next->skb, odev)) { if (net_ratelimit()) { printk(KERN_INFO "Hard xmit error\n"); } next->errors++; next->last_ok = 0; } else { next->last_ok = 1; next->sofar++; next->tx_bytes += (next->cur_pkt_size + 4); /* count csum */ } next->next_tx_ns = getRelativeCurNs() + next->ipg; } else { /* Re-try it next time */ next->last_ok = 0; } spin_unlock_bh(&odev->xmit_lock); I have not seen hard_start_xmit fail on other drivers, even when over-driving them well beyond their capabilities. Any ideas what causes the hard_start_xmit errors? Thanks, Ben -- Ben Greear President of Candela Technologies Inc http://www.candelatech.com ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear From manand@us.ibm.com Sun Aug 25 13:09:09 2002 Received: with ECARTIS (v1.0.0; list netdev); Sun, 25 Aug 2002 13:09:13 -0700 (PDT) Received: from e1.ny.us.ibm.com (e1.ny.us.ibm.com [32.97.182.101]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7PK989D005199 for ; Sun, 25 Aug 2002 13:09:09 -0700 Received: from northrelay02.pok.ibm.com (northrelay02.pok.ibm.com [9.56.224.150]) by e1.ny.us.ibm.com (8.12.2/8.12.2) with ESMTP id g7PKC9nM159294; Sun, 25 Aug 2002 16:12:09 -0400 Received: from d03nm123.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.193.82]) by northrelay02.pok.ibm.com (8.12.3/NCO/VER6.4) with ESMTP id g7PKC6Xs079606; Sun, 25 Aug 2002 16:12:06 -0400 Subject: Re: [Lse-tech] Re: (RFC): SKB Initialization To: jamal Cc: linux-kernel@vger.kernel.org, "Mala Anand" , netdev@oss.sgi.com, Robert Olsson X-Mailer: Lotus Notes Release 5.0.3 (Intl) 21 March 2000 Message-ID: From: "Mala Anand" Date: Sun, 25 Aug 2002 15:12:06 -0500 X-MIMETrack: Serialize by Router on D03NM123/03/M/IBM(Release 5.0.10 |March 22, 2002) at 08/25/2002 02:12:08 PM MIME-Version: 1.0 Content-type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit X-archive-position: 8 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: manand@us.ibm.com Precedence: bulk X-list: netdev Content-Length: 2855 Lines: 76 Jamal wrote .. >Could you please at least cc netdev on networking related issues? >It says so in the kernel FAQ. >I swore back around 95 to join lk only when Linux gets a IDE maintainer >who is not insane. Hasnt happened yet. Yes I will, it is a mistake from my part, >Can you repeat your tests with the hotlist turned off (i.e set to 0)? Even if I turned the hot list, the slab allocator has a per cpu array of objects. In this case it keeps by default 60 objects and hot list keeps 126 objects. So there may be a difference. I will try this. This skb init work is the result of my probing in to the slab cache work. Read my posting on slab cache: http://marc.theaimsgroup.com/?l=linux-kernel&m=102773718023056&w=2 This work triggered the skb init patch. To quantify the effect of bouncing the objects between cpus, I choose skb to measure. And it turns out that the limited cpu array is not the culprit in this case, it is how the objects are allocated in one cpu and freed in another cpu is what causing the bouncing of objects between cpus. >Also if you would be doing tests on NAPI please either copy us or netdev; >it is not nice to read weeks after you post. Yes I will. >Also Robert and i did a few tests and we did find skb recycling (based on >a patch from Robert a few years back) was infact giving perfomance >improvements of upto 15% over regular slab. >Did you test with that patch for the e1000 he pointed you at? >I repeated the tests (around June/July) with the tulip with input rates of >a few 100K packets/sec and noticed a improvement over regular NAPI by >about 10%. Theres one bug on the tulip which we are chasing that >might be related to tulips alignment requirements; Yes I got the patch from Robert and I am planning on testing the patch. My understanding is that skbs are recylced in other operating systems as well to improve performance. And it particularly helps in architectures where pci mapping is expensive and when skbs are recycled, remapping is eliminated. I think the skbinit patch and recycling skbs are mutually exclusive. Recycling skbs will reduce the number of times we hit alloc_skb and __kfree_skb. >The idea of only freeing on the same CPU a skb allocated is free with >the e1000 NAPI driver style but not in the tulip NAPI where a txmit >interupt might happen on a different CPU. The skb recycler patch only >recylces if allocation and freeing are happening on the same CPU; >otherwise we let the slab take the hit. On the tulip this happens about >50% of the time. So skbinit patch will help the other case. Regards, Mala Mala Anand IBM Linux Technology Center - Kernel Performance E-mail:manand@us.ibm.com http://www-124.ibm.com/developerworks/opensource/linuxperf http://www-124.ibm.com/developerworks/projects/linuxperf Phone:838-8088; Tie-line:678-8088 From davem@redhat.com Sun Aug 25 15:53:25 2002 Received: with ECARTIS (v1.0.0; list netdev); Sun, 25 Aug 2002 15:53:28 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7PMrO9D006504 for ; Sun, 25 Aug 2002 15:53:24 -0700 Received: from localhost (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with ESMTP id PAA15429; Sun, 25 Aug 2002 15:51:18 -0700 Date: Sun, 25 Aug 2002 15:51:17 -0700 (PDT) Message-Id: <20020825.155117.112262015.davem@redhat.com> To: hadi@cyberus.ca Cc: linux-kernel@vger.kernel.org, manand@us.ibm.com, netdev@oss.sgi.com, Robert.Olsson@data.slu.se Subject: Re: [Lse-tech] Re: (RFC): SKB Initialization From: "David S. Miller" In-Reply-To: References: X-Mailer: Mew version 2.1 on Emacs 21.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit X-archive-position: 9 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev Content-Length: 348 Lines: 13 From: jamal Date: Sun, 25 Aug 2002 12:17:42 -0400 (EDT) I swore back around 95 to join lk only when Linux gets a IDE maintainer who is not insane. Hasnt happened yet. Alan is now one of the 2.5.x IDE maintainers, would you like me to add you to linux-kernel? :-) Franks a lot, David S. Miller davem@redhat.com From hadi@cyberus.ca Sun Aug 25 17:55:48 2002 Received: with ECARTIS (v1.0.0; list netdev); Sun, 25 Aug 2002 17:55:50 -0700 (PDT) Received: from cyberus.ca (mail.cyberus.ca [216.191.240.111]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7Q0tl9D007208 for ; Sun, 25 Aug 2002 17:55:48 -0700 Received: from shell.cyberus.ca (shell [216.191.240.114]) by cyberus.ca (8.9.3/8.9.3/Cyberus Online Inc.) with ESMTP id UAA26745; Sun, 25 Aug 2002 20:59:09 -0400 (EDT) Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.11.6+Sun/8.11.6) with ESMTP id g7Q0qIq00846; Sun, 25 Aug 2002 20:52:26 -0400 (EDT) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Sun, 25 Aug 2002 20:52:18 -0400 (EDT) From: jamal To: Ben Greear cc: Subject: Re: packet re-ordering on SMP machines. In-Reply-To: <3D6922B2.1020400@candelatech.com> Message-ID: MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit X-archive-position: 10 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev Content-Length: 1373 Lines: 46 On Sun, 25 Aug 2002, Ben Greear wrote: > jamal wrote: > > > > > > > > NAPI fixes packet reordering problems. > > It does indeed. I just patched the e1000 with the latest NAPI patch > I could find (from Aug 15 or so), and the re-ordering problems went away. > > The amount of packets dropped decreased too, but I still see about 1 out of > 1000 packets dropped due to rx-FIFO or rx-dropped. This is when trying to run > 60,000 pps of 1514 byte packets from one port to the other on the same dual-port e1000 > NIC (copper). It will generate up to about 72,000 pps without dropping too many > more... > That doesnt sound impressive at all. I know it's about .8 of wire rate but you should be able to exceed that. Robert was generating in the range of 800Kpps with that NIC if i recall corectly > I will do some more tests on two single-port NICs soon to see if that > performs better. You should see better numbers. Also if you have SMP, tie each onto a CPU. Additionaly get the skb recycler patch from Robert, it should improve things even more. > > Also, I see the hard_start_xmit call failing 5876 times out of 2719493 > calls (for example). The code that calls the method looks like this: > I dont have access to that NIC. But a stoopid question: Have you tried increasing the transmit queue via ifconfig? 1000 packets is reasonable for gige. cheers, jamal From hadi@cyberus.ca Sun Aug 25 18:05:54 2002 Received: with ECARTIS (v1.0.0; list netdev); Sun, 25 Aug 2002 18:05:55 -0700 (PDT) Received: from cyberus.ca (mail.cyberus.ca [216.191.240.111]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7Q15r9D007359 for ; Sun, 25 Aug 2002 18:05:54 -0700 Received: from shell.cyberus.ca (shell [216.191.240.114]) by cyberus.ca (8.9.3/8.9.3/Cyberus Online Inc.) with ESMTP id VAA29563; Sun, 25 Aug 2002 21:09:13 -0400 (EDT) Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.11.6+Sun/8.11.6) with ESMTP id g7Q12Vt00909; Sun, 25 Aug 2002 21:02:31 -0400 (EDT) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Sun, 25 Aug 2002 21:02:31 -0400 (EDT) From: jamal To: Mala Anand cc: , , Robert Olsson Subject: Re: [Lse-tech] Re: (RFC): SKB Initialization In-Reply-To: Message-ID: MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit X-archive-position: 11 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev Content-Length: 2952 Lines: 81 On Sun, 25 Aug 2002, Mala Anand wrote: > > Jamal wrote .. > > >Can you repeat your tests with the hotlist turned off (i.e set to 0)? > > Even if I turned the hot list, the slab allocator has a per cpu array of > objects. In this case it keeps by default 60 objects and hot list keeps > 126 objects. So there may be a difference. I will try this. > Well, the hotlist is supposed to be one way to reduce the effect of initialization. > This skb init work is the result of my probing in to the slab cache work. > Read > my posting on slab cache: > http://marc.theaimsgroup.com/?l=linux-kernel&m=102773718023056&w=2 > This work triggered the skb init patch. To quantify the effect of bouncing > the objects between cpus, I choose skb to measure. And it turns out that > the > limited cpu array is not the culprit in this case, it is how the objects > are > allocated in one cpu and freed in another cpu is what causing the bouncing > of objects between cpus. > Same conclusion as us then > >Also Robert and i did a few tests and we did find skb recycling (based on > >a patch from Robert a few years back) was infact giving perfomance > >improvements of upto 15% over regular slab. > >Did you test with that patch for the e1000 he pointed you at? > >I repeated the tests (around June/July) with the tulip with input rates of > >a few 100K packets/sec and noticed a improvement over regular NAPI by > >about 10%. Theres one bug on the tulip which we are chasing that > >might be related to tulips alignment requirements; > > Yes I got the patch from Robert and I am planning on testing the patch. My > understanding is that skbs are recylced in other operating systems as well > to > improve performance. And it particularly helps in architectures where pci > mapping is expensive and when skbs are recycled, remapping is eliminated. standard practise for eons on networking in rtoses at least. Slab on Linux was supposed to get rid of this. Thats why Robert hid his original patch. >I think the skbinit patch and recycling skbs are mutually exclusive. I would say they are more orthogonal than mutually exclusive. Although ou still need to prove that relocating the code actually helps in real life. On paper it looks good. > Recycling > skbs will reduce the number of times we hit alloc_skb and __kfree_skb. > Thats what we see. > >The idea of only freeing on the same CPU a skb allocated is free with > >the e1000 NAPI driver style but not in the tulip NAPI where a txmit > >interupt might happen on a different CPU. The skb recycler patch only > >recylces if allocation and freeing are happening on the same CPU; > >otherwise we let the slab take the hit. On the tulip this happens about > >50% of the time. > So skbinit patch will help the other case. > It will. Note, however that on the e1000 style coding, the hit ration is much higher -- almost 100% in theory at least. Perhaps its time to convert the tulip ... cheers, jamal From flygong@yahoo.com Sun Aug 25 19:12:50 2002 Received: with ECARTIS (v1.0.0; list netdev); Sun, 25 Aug 2002 19:12:52 -0700 (PDT) Received: from web14502.mail.yahoo.com (web14502.mail.yahoo.com [216.136.224.65]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7Q2Co9D008981 for ; Sun, 25 Aug 2002 19:12:50 -0700 Message-ID: <20020826021611.64023.qmail@web14502.mail.yahoo.com> Received: from [211.167.226.83] by web14502.mail.yahoo.com via HTTP; Sun, 25 Aug 2002 19:16:11 PDT Date: Sun, 25 Aug 2002 19:16:11 -0700 (PDT) From: Bergs Subject: serious packet loss in 25Mbps load,frame size 64byte,duration 60 second. tested with smartbits To: netdev@oss.sgi.com MIME-Version: 1.0 Content-type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit X-archive-position: 12 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: flygong@yahoo.com Precedence: bulk X-list: netdev Content-Length: 839 Lines: 33 hi My kernel is 2.4.16. when I use th smartbit test device to test the performance of eepro100 driver 1.23b version. I test it on a intel s220 server with two intel 82550PM chip network interface.The data flow generated by smartbit device come into the NIC 1 and come out from NIC 2. when the data load is 25% of 100Mbps,bi-directional,64byte frame size,and the test duration is 60 second,the test result display that serious packet loss take place. when the duration decrease to 10 second,the other test condition matain unchanged,no packet losss. I wonder whether the memory is out of use. I want to know why this happen? and how to solve it ? Thanks a lot bergs __________________________________________________ Do You Yahoo!? Yahoo! Finance - Get real-time stock quotes http://finance.yahoo.com From flygong@yahoo.com Sun Aug 25 19:13:48 2002 Received: with ECARTIS (v1.0.0; list netdev); Sun, 25 Aug 2002 19:13:49 -0700 (PDT) Received: from web14503.mail.yahoo.com (web14503.mail.yahoo.com [216.136.224.66]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7Q2Dm9D008989 for ; Sun, 25 Aug 2002 19:13:48 -0700 Message-ID: <20020826021709.63092.qmail@web14503.mail.yahoo.com> Received: from [211.167.226.83] by web14503.mail.yahoo.com via HTTP; Sun, 25 Aug 2002 19:17:09 PDT Date: Sun, 25 Aug 2002 19:17:09 -0700 (PDT) From: Bergs Subject: serious packet loss in 25Mbps load,frame size 64byte,duration 60 second. tested with smartbits To: netdev@oss.sgi.com MIME-Version: 1.0 Content-type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit X-archive-position: 13 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: flygong@yahoo.com Precedence: bulk X-list: netdev Content-Length: 839 Lines: 33 hi My kernel is 2.4.16. when I use th smartbit test device to test the performance of eepro100 driver 1.23b version. I test it on a intel s220 server with two intel 82550PM chip network interface.The data flow generated by smartbit device come into the NIC 1 and come out from NIC 2. when the data load is 25% of 100Mbps,bi-directional,64byte frame size,and the test duration is 60 second,the test result display that serious packet loss take place. when the duration decrease to 10 second,the other test condition matain unchanged,no packet losss. I wonder whether the memory is out of use. I want to know why this happen? and how to solve it ? Thanks a lot bergs __________________________________________________ Do You Yahoo!? Yahoo! Finance - Get real-time stock quotes http://finance.yahoo.com From greearb@candelatech.com Sun Aug 25 21:32:14 2002 Received: with ECARTIS (v1.0.0; list netdev); Sun, 25 Aug 2002 21:32:15 -0700 (PDT) Received: from grok.yi.org (IDENT:6OSyl9pZShe2bZAxniA0TZanXwWPvcmg@dhcp101-dsl-usw4.w-link.net [208.161.125.101]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7Q4W89D010524 for ; Sun, 25 Aug 2002 21:32:13 -0700 Received: from candelatech.com (IDENT:11rXpaaoZXTBcqT0sko7CrnRYHZU4n3Q@localhost.localdomain [127.0.0.1]) by grok.yi.org (8.11.6/8.11.2) with ESMTP id g7Q4YlT03608; Sun, 25 Aug 2002 21:34:58 -0700 Message-ID: <3D69AFE7.6020902@candelatech.com> Date: Sun, 25 Aug 2002 21:34:47 -0700 From: Ben Greear Organization: Candela Technologies User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1b) Gecko/20020722 X-Accept-Language: en-us, en MIME-Version: 1.0 To: jamal CC: netdev@oss.sgi.com Subject: Re: packet re-ordering on SMP machines. References: Content-type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit X-archive-position: 15 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: greearb@candelatech.com Precedence: bulk X-list: netdev Content-Length: 1961 Lines: 58 jamal wrote: > That doesnt sound impressive at all. I know it's about .8 of wire rate > but you should be able to exceed that. > Robert was generating in the range of 800Kpps with that NIC if i recall > corectly I had only tested 1514 byte pkts, so I was getting around 880Mbps, which is pretty good as far as I know. I see about 255 kpps when sending 64 byte pkts to myself. Still dropping about 1 in 4000 packets at this speed. I think most of Robert's tests didn't involve actually doing something with the received packet though, and I am inspecting it for latency, sequence number, etc. I'm even doing a __get_timeofday() call to calculate the latency...need to find a faster way to do that... If I only allocate/scan 1 per 100 packets (ie alloc one packet and send it 100 times), then I get a more respectable 365kpps. Robert's patch should definately help! > Also if you have SMP, tie each onto a CPU. That's with the irq_afinity thing in proc, right? > Additionaly get the skb recycler patch from Robert, it should improve > things even more. Do you happen to have a URL for this? Actually, the various network tweaks are relatively hard to find (at least to find the most up-to-date coppies). It would be great if there was a place where they were all concentrated. > > >>Also, I see the hard_start_xmit call failing 5876 times out of 2719493 >>calls (for example). The code that calls the method looks like this: >> > > > I dont have access to that NIC. But a stoopid question: Have you tried > increasing the transmit queue via ifconfig? 1000 packets is reasonable > for gige. I upped it, but it didn't stop the errors. The NIC is still performing, so it may not be a real problem... Thanks for the info, Ben -- Ben Greear President of Candela Technologies Inc http://www.candelatech.com ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear From hadi@cyberus.ca Mon Aug 26 04:24:12 2002 Received: with ECARTIS (v1.0.0; list netdev); Mon, 26 Aug 2002 04:24:15 -0700 (PDT) Received: from cyberus.ca (mail.cyberus.ca [216.191.240.111]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7QBOC9D024905 for ; Mon, 26 Aug 2002 04:24:12 -0700 Received: from shell.cyberus.ca (shell [216.191.240.114]) by cyberus.ca (8.9.3/8.9.3/Cyberus Online Inc.) with ESMTP id HAA02054; Mon, 26 Aug 2002 07:27:34 -0400 (EDT) Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.11.6+Sun/8.11.6) with ESMTP id g7QBKl202087; Mon, 26 Aug 2002 07:20:51 -0400 (EDT) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Mon, 26 Aug 2002 07:20:47 -0400 (EDT) From: jamal To: Ben Greear cc: Subject: Re: packet re-ordering on SMP machines. In-Reply-To: <3D69AFE7.6020902@candelatech.com> Message-ID: MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit X-archive-position: 16 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev Content-Length: 2550 Lines: 82 On Sun, 25 Aug 2002, Ben Greear wrote: > jamal wrote: > > > That doesnt sound impressive at all. I know it's about .8 of wire rate > > but you should be able to exceed that. > > Robert was generating in the range of 800Kpps with that NIC if i recall > > corectly > > I had only tested 1514 byte pkts, so I was getting around 880Mbps, > which is pretty good as far as I know. theres no reason you shouldnt be able to do wire rate. > > I see about 255 kpps when sending 64 byte pkts to myself. Still > dropping about 1 in 4000 packets at this speed. I think most of Robert's > tests didn't involve actually doing something with the received packet > though, and I am inspecting it for latency, sequence number, etc. > > I'm even doing a __get_timeofday() call to calculate the latency...need > to find a faster way to do that... > ouch. for latency or sequencing you dont really need to all packets. Read academic papers on the subject. You probably need about 5% of the total packets. Also you dont have to do the checks at runtime, you can do them once the run is complete (which you should be able to tell since you control both send and receive). > If I only allocate/scan 1 per 100 packets (ie alloc one packet and send it 100 times), > then I get a more respectable 365kpps. Robert's patch should definately help! > Yes, clearly you will benefit. > > Also if you have SMP, tie each onto a CPU. > > That's with the irq_afinity thing in proc, right? yes. > > > Additionaly get the skb recycler patch from Robert, it should improve > > things even more. > > Do you happen to have a URL for this? > > Actually, the various network tweaks are relatively hard to find > (at least to find the most up-to-date coppies). It would be great if > there was a place where they were all concentrated. Roberts site is the main repository; it may have READMEs with URLs pointing to various locations. ftp://130.238.98.12/pub/Linux/net-development/ and look at the recycling and NAPI sub-directories. > > > > > > >>Also, I see the hard_start_xmit call failing 5876 times out of 2719493 > >>calls (for example). The code that calls the method looks like this: > >> > > > > > > I dont have access to that NIC. But a stoopid question: Have you tried > > increasing the transmit queue via ifconfig? 1000 packets is reasonable > > for gige. > > I upped it, but it didn't stop the errors. The NIC is still performing, > so it may not be a real problem... > I dont have this NIC. When Robert shows up he may be able to explain this. cheers, jamal From manand@us.ibm.com Mon Aug 26 06:00:46 2002 Received: with ECARTIS (v1.0.0; list netdev); Mon, 26 Aug 2002 06:00:47 -0700 (PDT) Received: from e31.co.us.ibm.com (e31.co.us.ibm.com [32.97.110.129]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7QD0k9D026629 for ; Mon, 26 Aug 2002 06:00:46 -0700 Received: from westrelay02.boulder.ibm.com (westrelay02.boulder.ibm.com [9.17.194.23]) by e31.co.us.ibm.com (8.12.2/8.12.2) with ESMTP id g7QD45c5028400; Mon, 26 Aug 2002 09:04:05 -0400 Received: from d03nm123.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.193.82]) by westrelay02.boulder.ibm.com (8.12.3/NCO/VER6.4) with ESMTP id g7QD43N6097918; Mon, 26 Aug 2002 07:04:04 -0600 Subject: Re: [Lse-tech] Re: (RFC): SKB Initialization To: jamal , davem@redhat.com Cc: netdev@oss.sgi.com, Robert Olsson , "Bill Hartner" , X-Mailer: Lotus Notes Release 5.0.7 March 21, 2001 Message-ID: From: "Mala Anand" Date: Mon, 26 Aug 2002 08:04:05 -0500 X-MIMETrack: Serialize by Router on D03NM123/03/M/IBM(Release 5.0.10 |March 22, 2002) at 08/26/2002 07:04:04 AM MIME-Version: 1.0 Content-type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit X-archive-position: 17 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: manand@us.ibm.com Precedence: bulk X-list: netdev Content-Length: 821 Lines: 28 >>I think the skbinit patch and recycling skbs are mutually exclusive. >I would say they are more orthogonal than mutually exclusive. >Although ou still need to prove that relocating the code actually helps in >real life. On paper it looks good. Troy Wilson (who works with me) posted SPECweb99 results using my skbinit patch to lkml on Friday: http://www.uwsg.iu.edu/hypermail/linux/kernel/0208.2/1470.html I know you don't subscribe to lkml. Have you seen these results? On Numa machine it showed around 3% improvement using SPECweb99. Regards, Mala Mala Anand IBM Linux Technology Center - Kernel Performance E-mail:manand@us.ibm.com http://www-124.ibm.com/developerworks/opensource/linuxperf http://www-124.ibm.com/developerworks/projects/linuxperf Phone:838-8088; Tie-line:678-8088 From weixl@caltech.edu Mon Aug 26 16:00:47 2002 Received: with ECARTIS (v1.0.0; list netdev); Mon, 26 Aug 2002 16:00:49 -0700 (PDT) Received: from chamber.cco.caltech.edu (chamber.its.caltech.edu [131.215.48.55] (may be forged)) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7QN0ltG010360 for ; Mon, 26 Aug 2002 16:00:47 -0700 Received: from weixl (sonata.caltech.edu [131.215.220.1]) by chamber.cco.caltech.edu (8.12.3/8.12.3) with ESMTP id g7QN40Fn021118; Mon, 26 Aug 2002 16:04:05 -0700 (PDT) Message-ID: <009701c24d54$d27304a0$f1fa010a@weixl> From: "Xiaoliang \(David\) Wei" To: "Ben Greear" , "jamal" , "Cheng Jin" , "Cheng Hu" , "Steven Low" Cc: References: <3D69AFE7.6020902@candelatech.com> Subject: Re: packet re-ordering on SMP machines. Date: Mon, 26 Aug 2002 16:03:36 -0700 MIME-Version: 1.0 Content-type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2600.0000 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2600.0000 X-archive-position: 18 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: weixl@caltech.edu Precedence: bulk X-list: netdev Content-Length: 2979 Lines: 90 Hi Ben and Jamal, Are you guys sure that getdayoftime per packet is a big overhead on Gbps connection? Do you compare the performance with getdayoftime per packet and without? I guess RFC 1323 specifies that each packet should have a timestamp (although not from getdayoftime). Also, what's your testbed's configuration, Ben? (I guess if we can use faster hardware to overcome this effect...) Thank you:) ps: I am working on some high speed TCP experiment and may want to make getdayoftime every packet... -David Xiaoliang (David) Wei Graduate Student in CS@Caltech http://www.cs.caltech.edu/~weixl ==================================================== ----- Original Message ----- From: "Ben Greear" To: "jamal" Cc: Sent: Sunday, August 25, 2002 9:34 PM Subject: Re: packet re-ordering on SMP machines. > > jamal wrote: > > > That doesnt sound impressive at all. I know it's about .8 of wire rate > > but you should be able to exceed that. > > Robert was generating in the range of 800Kpps with that NIC if i recall > > corectly > > I had only tested 1514 byte pkts, so I was getting around 880Mbps, > which is pretty good as far as I know. > > I see about 255 kpps when sending 64 byte pkts to myself. Still > dropping about 1 in 4000 packets at this speed. I think most of Robert's > tests didn't involve actually doing something with the received packet > though, and I am inspecting it for latency, sequence number, etc. > > I'm even doing a __get_timeofday() call to calculate the latency...need > to find a faster way to do that... > > If I only allocate/scan 1 per 100 packets (ie alloc one packet and send it 100 times), > then I get a more respectable 365kpps. Robert's patch should definately help! > > > Also if you have SMP, tie each onto a CPU. > > That's with the irq_afinity thing in proc, right? > > > Additionaly get the skb recycler patch from Robert, it should improve > > things even more. > > Do you happen to have a URL for this? > > Actually, the various network tweaks are relatively hard to find > (at least to find the most up-to-date coppies). It would be great if > there was a place where they were all concentrated. > > > > > > >>Also, I see the hard_start_xmit call failing 5876 times out of 2719493 > >>calls (for example). The code that calls the method looks like this: > >> > > > > > > I dont have access to that NIC. But a stoopid question: Have you tried > > increasing the transmit queue via ifconfig? 1000 packets is reasonable > > for gige. > > I upped it, but it didn't stop the errors. The NIC is still performing, > so it may not be a real problem... > > Thanks for the info, > Ben > > -- > Ben Greear > President of Candela Technologies Inc http://www.candelatech.com > ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear > > > > > From greearb@candelatech.com Mon Aug 26 16:17:39 2002 Received: with ECARTIS (v1.0.0; list netdev); Mon, 26 Aug 2002 16:17:40 -0700 (PDT) Received: from grok.yi.org (IDENT:G4+K4Ar+2jyXhhVg3JHuE2eQlrhtUtl9@dhcp101-dsl-usw4.w-link.net [208.161.125.101]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7QNHctG010538 for ; Mon, 26 Aug 2002 16:17:38 -0700 Received: from candelatech.com (IDENT:MI7NKb48JHy9zwt8AzqWv3zrnEvlCuDm@localhost.localdomain [127.0.0.1]) by grok.yi.org (8.11.6/8.11.2) with ESMTP id g7QNKfT17030; Mon, 26 Aug 2002 16:20:41 -0700 Message-ID: <3D6AB7C9.3020802@candelatech.com> Date: Mon, 26 Aug 2002 16:20:41 -0700 From: Ben Greear Organization: Candela Technologies User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1b) Gecko/20020722 X-Accept-Language: en-us, en MIME-Version: 1.0 To: "Xiaoliang (David) Wei" CC: jamal , Cheng Jin , Cheng Hu , Steven Low , netdev@oss.sgi.com Subject: Re: packet re-ordering on SMP machines. References: <3D69AFE7.6020902@candelatech.com> <009701c24d54$d27304a0$f1fa010a@weixl> Content-type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit X-archive-position: 19 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: greearb@candelatech.com Precedence: bulk X-list: netdev Content-Length: 3790 Lines: 118 Xiaoliang (David) Wei wrote: > Hi Ben and Jamal, > Are you guys sure that getdayoftime per packet is a big overhead on > Gbps connection? > Do you compare the performance with getdayoftime per packet and > without? I guess RFC 1323 specifies that each packet should have a timestamp > (although not from getdayoftime). > Also, what's your testbed's configuration, Ben? (I guess if we can > use faster hardware to overcome this effect...) > Thank you:) > > ps: I am working on some high speed TCP experiment and may want to make > getdayoftime every packet... Actually, now that I think back, I believe the generic ethernet code timestamps each skb when it's received anyway.... So, my hit probably comes mostly from allocating new buffers and potentially the gettimeofday that is done then. I have not benchmarked the kernel gettimeofday call in any sort of isolated case. It does not appear that the CPU is what is limiting my particular test, I think it's either the NIC or the driver, or more likely, the way I'm driving it... Ben > > -David > Xiaoliang (David) Wei Graduate Student in CS@Caltech > http://www.cs.caltech.edu/~weixl > ==================================================== > ----- Original Message ----- > From: "Ben Greear" > To: "jamal" > Cc: > Sent: Sunday, August 25, 2002 9:34 PM > Subject: Re: packet re-ordering on SMP machines. > > > >>jamal wrote: >> >> >>>That doesnt sound impressive at all. I know it's about .8 of wire rate >>>but you should be able to exceed that. >>>Robert was generating in the range of 800Kpps with that NIC if i recall >>>corectly >> >>I had only tested 1514 byte pkts, so I was getting around 880Mbps, >>which is pretty good as far as I know. >> >>I see about 255 kpps when sending 64 byte pkts to myself. Still >>dropping about 1 in 4000 packets at this speed. I think most of Robert's >>tests didn't involve actually doing something with the received packet >>though, and I am inspecting it for latency, sequence number, etc. >> >>I'm even doing a __get_timeofday() call to calculate the latency...need >>to find a faster way to do that... >> >>If I only allocate/scan 1 per 100 packets (ie alloc one packet and send it > > 100 times), > >>then I get a more respectable 365kpps. Robert's patch should definately > > help! > >>>Also if you have SMP, tie each onto a CPU. >> >>That's with the irq_afinity thing in proc, right? >> >> >>>Additionaly get the skb recycler patch from Robert, it should improve >>>things even more. >> >>Do you happen to have a URL for this? >> >>Actually, the various network tweaks are relatively hard to find >>(at least to find the most up-to-date coppies). It would be great if >>there was a place where they were all concentrated. >> >> >>> >>>>Also, I see the hard_start_xmit call failing 5876 times out of 2719493 >>>>calls (for example). The code that calls the method looks like this: >>>> >>> >>> >>>I dont have access to that NIC. But a stoopid question: Have you tried >>>increasing the transmit queue via ifconfig? 1000 packets is reasonable >>>for gige. >> >>I upped it, but it didn't stop the errors. The NIC is still performing, >>so it may not be a real problem... >> >>Thanks for the info, >>Ben >> >>-- >>Ben Greear >>President of Candela Technologies Inc http://www.candelatech.com >>ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear >> >> >> >> >> > > -- Ben Greear President of Candela Technologies Inc http://www.candelatech.com ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear From manand@us.ibm.com Mon Aug 26 19:50:28 2002 Received: with ECARTIS (v1.0.0; list netdev); Mon, 26 Aug 2002 19:50:30 -0700 (PDT) Received: from e35.co.us.ibm.com (e35.co.us.ibm.com [32.97.110.133]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7R2oStG012399 for ; Mon, 26 Aug 2002 19:50:28 -0700 Received: from westrelay04.boulder.ibm.com (westrelay04.boulder.ibm.com [9.17.193.32]) by e35.co.us.ibm.com (8.12.2/8.12.2) with ESMTP id g7R2renO013956; Mon, 26 Aug 2002 22:53:40 -0400 Received: from d03nm123.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.193.82]) by westrelay04.boulder.ibm.com (8.12.3/NCO/VER6.4) with ESMTP id g7R2s4kU102724; Mon, 26 Aug 2002 20:54:05 -0600 Subject: Re: [Lse-tech] Re: (RFC): SKB Initialization To: Robert Olsson Cc: "Bill Hartner" , davem@redhat.com, jamal , linux-kernel@vger.kernel.org, "Mala Anand" , netdev@oss.sgi.com, Robert Olsson X-Mailer: Lotus Notes Release 5.0.3 (Intl) 21 March 2000 Message-ID: From: "Mala Anand" Date: Mon, 26 Aug 2002 21:53:37 -0500 X-MIMETrack: Serialize by Router on D03NM123/03/M/IBM(Release 5.0.10 |March 22, 2002) at 08/26/2002 08:53:39 PM MIME-Version: 1.0 Content-type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit X-archive-position: 20 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: manand@us.ibm.com Precedence: bulk X-list: netdev Content-Length: 1084 Lines: 34 Robert Olsson wrote.. >In slab terms you moved part of the destructor to the constructor >but the main problem is still there. The skb entered the "wrong" CPU >so to be "reused from the slab again" the work has to done regardless >if it's in the constructor or destructor. That is true if it is a uni processor but in smp the initialization, if happened in two different CPUs, affects performance due to cache effects. The problem of object (skb) allocation, usage and deallocation occurring in multiple CPUs need to be addressed separately. This patch is not attempting to address that. >Eventually if we accept some cache misses a skb could possibly be re-routed >to the proper slab/CPU for this we would need some skb coloring. You still can do this. I don't see skbinit patch hindering this. Regards, Mala Mala Anand IBM Linux Technology Center - Kernel Performance E-mail:manand@us.ibm.com http://www-124.ibm.com/developerworks/opensource/linuxperf http://www-124.ibm.com/developerworks/projects/linuxperf Phone:838-8088; Tie-line:678-8088 From naim@Ic4ic.com Mon Aug 26 22:00:19 2002 Received: with ECARTIS (v1.0.0; list netdev); Mon, 26 Aug 2002 22:00:20 -0700 (PDT) Received: from exchange.Ic4ic.com ([194.90.135.194]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7R50HtG013612 for ; Mon, 26 Aug 2002 22:00:18 -0700 Received: through eSafe SMTP Relay 1029955789; Tue Aug 27 08:05:30 2002 content-class: urn:content-classes:message MIME-Version: 1.0 Content-type: text/plain Subject: Limited Transmit X-MimeOLE: Produced By Microsoft Exchange V6.0.4417.0 Date: Tue, 27 Aug 2002 08:03:05 +0200 Message-ID: <88BC9E379956AE4DB689CC5FF6F5A43D2970AB@exchange.Ic4ic.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Limited Transmit Thread-Index: AcJNj2i8zex3SKYrSRmkifs/o5elXg== From: "Naim Far" To: Content-Transfer-Encoding: 8bit X-archive-position: 21 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: naim@Ic4ic.com Precedence: bulk X-list: netdev Content-Length: 156 Lines: 8 How do I add the "Limited Trasnmit" (RFC 3042) feature to the TCP implementation in Linux ? I hope I can get CCed with the answer. THANX in advance... From hadi@cyberus.ca Tue Aug 27 03:21:21 2002 Received: with ECARTIS (v1.0.0; list netdev); Tue, 27 Aug 2002 03:21:22 -0700 (PDT) Received: from cyberus.ca (mail.cyberus.ca [216.191.240.111]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7RALLtG021300 for ; Tue, 27 Aug 2002 03:21:21 -0700 Received: from shell.cyberus.ca (shell [216.191.240.114]) by cyberus.ca (8.9.3/8.9.3/Cyberus Online Inc.) with ESMTP id GAA13228; Tue, 27 Aug 2002 06:24:30 -0400 (EDT) Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.11.6+Sun/8.11.6) with ESMTP id g7RAHgM08708; Tue, 27 Aug 2002 06:17:43 -0400 (EDT) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Tue, 27 Aug 2002 06:17:42 -0400 (EDT) From: jamal To: Mala Anand cc: , , Robert Olsson , Bill Hartner , Subject: Re: [Lse-tech] Re: (RFC): SKB Initialization In-Reply-To: Message-ID: MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit X-archive-position: 22 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev Content-Length: 675 Lines: 22 On Mon, 26 Aug 2002, Mala Anand wrote: > Troy Wilson (who works with me) posted SPECweb99 results using my > skbinit patch to lkml on Friday: > http://www.uwsg.iu.edu/hypermail/linux/kernel/0208.2/1470.html > I know you don't subscribe to lkml. Have you seen these results? > On Numa machine it showed around 3% improvement using SPECweb99. > The posting you pointed to says 1% - not that it matters. It becomes more insignificant when skb recycling comes in play mostly because the alloc and freeing of skbs doesnt really show up as hotlist item within the profile. I am not saying it is totaly useless -- anything that will save a few cycles is good; cheers, jamal From hadi@cyberus.ca Tue Aug 27 04:02:56 2002 Received: with ECARTIS (v1.0.0; list netdev); Tue, 27 Aug 2002 04:02:57 -0700 (PDT) Received: from cyberus.ca (mail.cyberus.ca [216.191.240.111]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7RB2ttG022149 for ; Tue, 27 Aug 2002 04:02:55 -0700 Received: from shell.cyberus.ca (shell [216.191.240.114]) by cyberus.ca (8.9.3/8.9.3/Cyberus Online Inc.) with ESMTP id HAA24063; Tue, 27 Aug 2002 07:06:21 -0400 (EDT) Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.11.6+Sun/8.11.6) with ESMTP id g7RAxXS08772; Tue, 27 Aug 2002 06:59:33 -0400 (EDT) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Tue, 27 Aug 2002 06:59:33 -0400 (EDT) From: jamal To: "Xiaoliang (David) Wei" cc: Ben Greear , Cheng Jin , Cheng Hu , Steven Low , Subject: Re: packet re-ordering on SMP machines. In-Reply-To: <009701c24d54$d27304a0$f1fa010a@weixl> Message-ID: MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit X-archive-position: 23 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev Content-Length: 736 Lines: 28 On Mon, 26 Aug 2002, Xiaoliang (David) Wei wrote: > Hi Ben and Jamal, > Are you guys sure that getdayoftime per packet is a big overhead on > Gbps connection? We may be talking about different things; I am talking about do_gettimeofday -- which is very expensive. Anyone who has time could look at improving that. It is run per incoming packet. > Do you compare the performance with getdayoftime per packet and > without? I think it would be pretty noticeable if you got rid of the per-incoming-packet calls to do_gettimeofday > I guess RFC 1323 specifies that each packet should have a timestamp > (although not from getdayoftime). In Linux, this is cleverly based on the system clock (jiffies). cheers, jamal From ak@suse.de Tue Aug 27 04:09:07 2002 Received: with ECARTIS (v1.0.0; list netdev); Tue, 27 Aug 2002 04:09:10 -0700 (PDT) Received: from Cantor.suse.de (ns.suse.de [213.95.15.193]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7RB96tG022575 for ; Tue, 27 Aug 2002 04:09:06 -0700 Received: from Hermes.suse.de (Charybdis.suse.de [213.95.15.201]) by Cantor.suse.de (Postfix) with ESMTP id 557901455A; Tue, 27 Aug 2002 13:12:28 +0200 (MEST) Date: Tue, 27 Aug 2002 13:12:27 +0200 From: Andi Kleen To: jamal Cc: "Xiaoliang (David) Wei" , Ben Greear , Cheng Jin , Cheng Hu , Steven Low , netdev@oss.sgi.com Subject: Re: packet re-ordering on SMP machines. Message-ID: <20020827131227.A16565@wotan.suse.de> References: <009701c24d54$d27304a0$f1fa010a@weixl> Mime-Version: 1.0 Content-type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.3.22.1i Content-Transfer-Encoding: 8bit X-archive-position: 24 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ak@suse.de Precedence: bulk X-list: netdev Content-Length: 1089 Lines: 34 On Tue, Aug 27, 2002 at 06:59:33AM -0400, jamal wrote: > > > > On Mon, 26 Aug 2002, Xiaoliang (David) Wei wrote: > > > Hi Ben and Jamal, > > Are you guys sure that getdayoftime per packet is a big overhead on > > Gbps connection? > > We may be talking about different things; > I am talking about do_gettimeofday -- which is very expensive. > Anyone who has time could look at improving that. It is run per incoming > packet. That is because of the lock it takes. Locks are always slow. Older kernels used gettimeoffset which ran without lock, but that was changed because in some very obscure cases it could cause non monotonous timestamps when the user turns on timestamp receiving to user space (kernel protocols do not care) Possibilities: - Ignore the problem and switch back to gettimeoffset again - Switch to gettimeoffset but add some correction step for the unlikely case that someone wants the timestamp from user space (would be my prefered solution) - Implement lockless gettimeofday like x86-64 or sparc (good one too, but likely slower than last) -Andi From hadi@cyberus.ca Tue Aug 27 04:21:58 2002 Received: with ECARTIS (v1.0.0; list netdev); Tue, 27 Aug 2002 04:21:59 -0700 (PDT) Received: from cyberus.ca (mail.cyberus.ca [216.191.240.111]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7RBLvtG023038 for ; Tue, 27 Aug 2002 04:21:58 -0700 Received: from shell.cyberus.ca (shell [216.191.240.114]) by cyberus.ca (8.9.3/8.9.3/Cyberus Online Inc.) with ESMTP id HAA29673; Tue, 27 Aug 2002 07:25:25 -0400 (EDT) Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.11.6+Sun/8.11.6) with ESMTP id g7RBIf608801; Tue, 27 Aug 2002 07:18:41 -0400 (EDT) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Tue, 27 Aug 2002 07:18:41 -0400 (EDT) From: jamal To: Naim Far cc: Subject: Re: Limited Transmit In-Reply-To: <88BC9E379956AE4DB689CC5FF6F5A43D2970AB@exchange.Ic4ic.com> Message-ID: MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit X-archive-position: 25 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev Content-Length: 383 Lines: 17 On Tue, 27 Aug 2002, Naim Far wrote: > How do I add the "Limited Trasnmit" (RFC 3042) feature to the TCP > implementation in Linux ? > I am not sure if this RFC will add a lot of value if any to Linux. Linux already knows how to to undo changes that are caused by false alarms like reordering. To implement start with linux/net/ipv4/* and look at the tcp files. cheers, jamal From hadi@cyberus.ca Tue Aug 27 05:08:22 2002 Received: with ECARTIS (v1.0.0; list netdev); Tue, 27 Aug 2002 05:08:24 -0700 (PDT) Received: from cyberus.ca (mail.cyberus.ca [216.191.240.111]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7RC8LtG027739 for ; Tue, 27 Aug 2002 05:08:21 -0700 Received: from shell.cyberus.ca (shell [216.191.240.114]) by cyberus.ca (8.9.3/8.9.3/Cyberus Online Inc.) with ESMTP id IAA17058; Tue, 27 Aug 2002 08:11:48 -0400 (EDT) Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.11.6+Sun/8.11.6) with ESMTP id g7RC54B08997; Tue, 27 Aug 2002 08:05:05 -0400 (EDT) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Tue, 27 Aug 2002 08:05:04 -0400 (EDT) From: jamal To: Andi Kleen cc: "Xiaoliang (David) Wei" , Ben Greear , Cheng Jin , Cheng Hu , Steven Low , Subject: Re: packet re-ordering on SMP machines. In-Reply-To: <20020827131227.A16565@wotan.suse.de> Message-ID: MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit X-archive-position: 26 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev Content-Length: 830 Lines: 35 On Tue, 27 Aug 2002, Andi Kleen wrote: > > That is because of the lock it takes. Locks are always slow. xtime_lock? > > Older kernels used gettimeoffset which ran without lock, but that was > changed because in some very obscure cases it could cause non monotonous > timestamps when the user turns on timestamp receiving to user space > (kernel protocols do not care) > > Possibilities: > > - Ignore the problem and switch back to gettimeoffset again Is it safe to call gettimeoffset without the lock? > - Switch to gettimeoffset but add some correction step for the unlikely > case that someone wants the timestamp from user space > (would be my prefered solution) > - Implement lockless gettimeofday like x86-64 or sparc > (good one too, but likely slower than last) ia64 seems to also have the lock. cheers, jamal From ak@suse.de Tue Aug 27 05:16:43 2002 Received: with ECARTIS (v1.0.0; list netdev); Tue, 27 Aug 2002 05:16:44 -0700 (PDT) Received: from Cantor.suse.de (ns.suse.de [213.95.15.193]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7RCGgtG028177 for ; Tue, 27 Aug 2002 05:16:42 -0700 Received: from Hermes.suse.de (Charybdis.suse.de [213.95.15.201]) by Cantor.suse.de (Postfix) with ESMTP id CA44F14716; Tue, 27 Aug 2002 14:20:04 +0200 (MEST) Date: Tue, 27 Aug 2002 14:20:04 +0200 From: Andi Kleen To: jamal Cc: Andi Kleen , "Xiaoliang (David) Wei" , Ben Greear , Cheng Jin , Cheng Hu , Steven Low , netdev@oss.sgi.com Subject: Re: packet re-ordering on SMP machines. Message-ID: <20020827142004.C4358@wotan.suse.de> References: <20020827131227.A16565@wotan.suse.de> Mime-Version: 1.0 Content-type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.3.22.1i Content-Transfer-Encoding: 8bit X-archive-position: 27 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ak@suse.de Precedence: bulk X-list: netdev Content-Length: 1354 Lines: 49 On Tue, Aug 27, 2002 at 08:05:04AM -0400, jamal wrote: > > > > On Tue, 27 Aug 2002, Andi Kleen wrote: > > > > > That is because of the lock it takes. Locks are always slow. > > xtime_lock? Yes. It also has some other overhead. > > > > > Older kernels used gettimeoffset which ran without lock, but that was > > changed because in some very obscure cases it could cause non monotonous > > timestamps when the user turns on timestamp receiving to user space > > (kernel protocols do not care) > > > > Possibilities: > > > > - Ignore the problem and switch back to gettimeoffset again > > Is it safe to call gettimeoffset without the lock? Of course. The only problem is that the clock can be non mononotonous sometimes and not be in sync with gettimeofday, but at least the kernel users of packet timestamps do not care. The only problem is the socket option, but it is obscure enough that I would not worry too much about it. > > > - Switch to gettimeoffset but add some correction step for the unlikely > > case that someone wants the timestamp from user space > > (would be my prefered solution) > > - Implement lockless gettimeofday like x86-64 or sparc > > (good one too, but likely slower than last) > > > ia64 seems to also have the lock. Quick fix is to just use gettimeoffset in netif_rx again. Should be fine for you. -Andi From kuznet@ms2.inr.ac.ru Tue Aug 27 06:01:14 2002 Received: with ECARTIS (v1.0.0; list netdev); Tue, 27 Aug 2002 06:01:18 -0700 (PDT) Received: from sex.inr.ac.ru (sex.inr.ac.ru [193.233.7.165]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7RD1CtG029370 for ; Tue, 27 Aug 2002 06:01:13 -0700 Received: (from kuznet@localhost) by sex.inr.ac.ru (8.6.13/ANK) id RAA19164; Tue, 27 Aug 2002 17:06:30 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200208271306.RAA19164@sex.inr.ac.ru> Subject: Re: packet re-ordering on SMP machines. To: ak@suse.DE (Andi Kleen) Date: Tue, 27 Aug 2002 17:06:30 +0400 (MSD) Cc: netdev@oss.sgi.com In-Reply-To: <20020827142004.C4358@wotan.suse.de> from "Andi Kleen" at Aug 27, 2 04:45:00 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 X-archive-position: 28 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: kuznet@ms2.inr.ac.ru Precedence: bulk X-list: netdev Content-Length: 897 Lines: 29 Hello! > Of course. The only problem is that the clock can be non mononotonous > sometimes and not be in sync with gettimeofday, but at least the kernel > users of packet timestamps do not care. What kernel users? Where did you find them? :-) > The only problem is the socket option, but it is obscure enough that I > would not worry too much about it. I am very sorry, but passing timestamp to user level is the only purpose of timestamping and it _MUST_ be monotonic and synchronous to time of day, otherwise it is completely useless. Shortly, this timestmap must be synchronous to timeofday. > > > - Implement lockless You talk about this for ages. :-) Actually, the problem is solved very easily. Deprecate SIOCGSTAMP, and either count users of SO_TIMESTAMP and enable timestamping only when it is required, or, alternatively, to move retrirval timestamp to socket level. Alexey From ak@suse.de Tue Aug 27 06:09:56 2002 Received: with ECARTIS (v1.0.0; list netdev); Tue, 27 Aug 2002 06:09:57 -0700 (PDT) Received: from Cantor.suse.de (ns.suse.de [213.95.15.193]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7RD9ttG029833 for ; Tue, 27 Aug 2002 06:09:55 -0700 Received: from Hermes.suse.de (Charybdis.suse.de [213.95.15.201]) by Cantor.suse.de (Postfix) with ESMTP id 13580145ED; Tue, 27 Aug 2002 15:13:18 +0200 (MEST) Date: Tue, 27 Aug 2002 15:13:17 +0200 From: Andi Kleen To: kuznet@ms2.inr.ac.ru Cc: Andi Kleen , netdev@oss.sgi.com Subject: Re: packet re-ordering on SMP machines. Message-ID: <20020827151317.A3389@wotan.suse.de> References: <20020827142004.C4358@wotan.suse.de> <200208271306.RAA19164@sex.inr.ac.ru> Mime-Version: 1.0 Content-type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200208271306.RAA19164@sex.inr.ac.ru> User-Agent: Mutt/1.3.22.1i Content-Transfer-Encoding: 8bit X-archive-position: 29 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ak@suse.de Precedence: bulk X-list: netdev Content-Length: 1678 Lines: 50 On Tue, Aug 27, 2002 at 05:06:30PM +0400, A.N.Kuznetsov wrote: > Hello! > > > Of course. The only problem is that the clock can be non mononotonous > > sometimes and not be in sync with gettimeofday, but at least the kernel > > users of packet timestamps do not care. > > What kernel users? Where did you find them? :-) Hmm, I thought TCP used it, but it seems to use jiffies directly. Ok, no kernel users then. Not sure about sunrpc and out of tree stuff like SCTP. > > The only problem is the socket option, but it is obscure enough that I > > would not worry too much about it. > > I am very sorry, but passing timestamp to user level is the only purpose > of timestamping and it _MUST_ be monotonic and synchronous to time of day, > otherwise it is completely useless. That make monotonous step doesn't need to be in netif_rx. My old proposal was to move it to socket layer. Then it would be only done when needed. Unfortunately it could get somewhat inaccurate when the queueing delay is too long. > > > > - Implement lockless > > You talk about this for ages. :-) It is nearly there for x86-64 ;) (code is in for vsyscalls, just kernel do_gettimeofday doesn't use it yet) > > > Actually, the problem is solved very easily. Deprecate SIOCGSTAMP, > and either count users of SO_TIMESTAMP and enable timestamping only > when it is required, or, alternatively, to move retrirval timestamp to socket > level. Moving it later may make it useless for RTT purposes when the queueing delays are too long. But if no kernel users exist then just making it a global refcnt could work nicely. Then most people would not eat the overhead when count == 0. -Andi From manand@us.ibm.com Tue Aug 27 06:15:29 2002 Received: with ECARTIS (v1.0.0; list netdev); Tue, 27 Aug 2002 06:15:31 -0700 (PDT) Received: from e35.co.us.ibm.com (e35.co.us.ibm.com [32.97.110.133]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7RDFTtG030223 for ; Tue, 27 Aug 2002 06:15:29 -0700 Received: from westrelay02.boulder.ibm.com (westrelay02.boulder.ibm.com [9.17.194.23]) by e35.co.us.ibm.com (8.12.2/8.12.2) with ESMTP id g7RDIZnO076726; Tue, 27 Aug 2002 09:18:36 -0400 Received: from d03nm123.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.193.82]) by westrelay02.boulder.ibm.com (8.12.3/NCO/VER6.4) with ESMTP id g7RDIYE3085602; Tue, 27 Aug 2002 07:18:34 -0600 Subject: Re: [Lse-tech] Re: (RFC): SKB Initialization To: jamal Cc: Bill Hartner , davem@redhat.com, linux-kernel@vger.kernel.org, netdev@oss.sgi.com, Robert Olsson X-Mailer: Lotus Notes Release 5.0.7 March 21, 2001 Message-ID: From: "Mala Anand" Date: Tue, 27 Aug 2002 08:18:36 -0500 X-MIMETrack: Serialize by Router on D03NM123/03/M/IBM(Release 5.0.10 |March 22, 2002) at 08/27/2002 07:18:34 AM MIME-Version: 1.0 Content-type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit X-archive-position: 30 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: manand@us.ibm.com Precedence: bulk X-list: netdev Content-Length: 1843 Lines: 54 Jamal wrote .. >On Mon, 26 Aug 2002, Mala Anand wrote: >> Troy Wilson (who works with me) posted SPECweb99 results using my >> skbinit patch to lkml on Friday: >> http://www.uwsg.iu.edu/hypermail/linux/kernel/0208.2/1470.html >> I know you don't subscribe to lkml. Have you seen these results? >> On Numa machine it showed around 3% improvement using SPECweb99. >The posting you pointed to says 1% - not that it matters. It becomes more >insignificant when skb recycling comes in play mostly because the alloc >and freeing of skbs doesnt really show up as hotlist item within >the profile. >I am not saying it is totaly useless -- anything that will save a few >cycles is good; SPECweb99 profile shows that __kfree_skb is in the top 5 hot routines. We will test the skb recycle patch on SPECweb99 and add skbinit patch to that and see how it helps. What I understand is that the skb recycle patch does not attempt to recycle if the skbs are allocated on CPU and freed on another CPU. Is that right? If so, skbinit patch will help those cases. I think this patch is pretty safe and I anticipate greater gains on NUMA systems. BTW the 3% gain that I reported earlier on NUMA is done at another site of IBM and it turned out to be it is not a NUMA machine. It is also an 8-way SMP machine, however those are non-complaint SPECWeb99 runs so I won't be able to use those results. The alloc and free routines are not hot in netperf3 profiles. However I am seeing some gains there also, not significant. I will post netperf3 results with skbinit patch later. Regards, Mala Mala Anand IBM Linux Technology Center - Kernel Performance E-mail:manand@us.ibm.com http://www-124.ibm.com/developerworks/opensource/linuxperf http://www-124.ibm.com/developerworks/projects/linuxperf Phone:838-8088; Tie-line:678-8088 From kuznet@ms2.inr.ac.ru Tue Aug 27 06:19:01 2002 Received: with ECARTIS (v1.0.0; list netdev); Tue, 27 Aug 2002 06:19:02 -0700 (PDT) Received: from sex.inr.ac.ru (sex.inr.ac.ru [193.233.7.165]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7RDJ0tG030601 for ; Tue, 27 Aug 2002 06:19:00 -0700 Received: (from kuznet@localhost) by sex.inr.ac.ru (8.6.13/ANK) id RAA19244; Tue, 27 Aug 2002 17:24:24 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200208271324.RAA19244@sex.inr.ac.ru> Subject: Re: packet re-ordering on SMP machines. To: ak@suse.de (Andi Kleen) Date: Tue, 27 Aug 2002 17:24:24 +0400 (MSD) Cc: ak@suse.de, netdev@oss.sgi.com In-Reply-To: <20020827151317.A3389@wotan.suse.de> from "Andi Kleen" at Aug 27, 2 03:13:17 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 X-archive-position: 31 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: kuznet@ms2.inr.ac.ru Precedence: bulk X-list: netdev Content-Length: 370 Lines: 12 Hello! > Moving it later may make it useless for RTT purposes when the queueing > delays are too long. Absolutely wrong. RTT is always calculated end-to-end, otherwise it some meaningless quantity, be it sctp, rpc or something. The only place where precesion of timestamp is more or less interesting is tcpdump. But not enough to make it not monotonic. :-) Alexey From mcmohd@rediffmail.com Tue Aug 27 06:56:19 2002 Received: with ECARTIS (v1.0.0; list netdev); Tue, 27 Aug 2002 06:56:20 -0700 (PDT) Received: from webmail15.rediffmail.com (webmail15.rediffmail.com [203.199.83.25] (may be forged)) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7RDuGtG001553 for ; Tue, 27 Aug 2002 06:56:17 -0700 Received: (qmail 17674 invoked by uid 510); 27 Aug 2002 13:57:27 -0000 Date: 27 Aug 2002 13:57:27 -0000 Message-ID: <20020827135727.17673.qmail@webmail15.rediffmail.com> Received: from unknown (12.40.51.195) by rediffmail.com via HTTP; 27 aug 2002 13:57:27 -0000 MIME-Version: 1.0 From: "Mohd. Mohtashim" Reply-To: "Mohd. Mohtashim" To: "Andi Kleen" Cc: "Xiaoliang\(David\)Wei" , "Ben Greear" , "Cheng Jin" , "Cheng Hu" , "Steven Low" , netdev@oss.sgi.com, "jamal" Subject: Re: Re: packet re-ordering on SMP machines. Content-type: text/plain Content-Disposition: inline Content-Transfer-Encoding: 8bit X-archive-position: 32 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: mcmohd@rediffmail.com Precedence: bulk X-list: netdev Content-Length: 1486 Lines: 67 On Tue, 27 Aug 2002 Andi Kleen wrote : > >On Tue, Aug 27, 2002 at 08:05:04AM -0400, jamal wrote: > > > > > > > > On Tue, 27 Aug 2002, Andi Kleen wrote: > > > > > > > > That is because of the lock it takes. Locks are always >slow. > > > > xtime_lock? > >Yes. > >It also has some other overhead. > > > > > > > > > Older kernels used gettimeoffset which ran without lock, but >that was > > > changed because in some very obscure cases it could cause >non monotonous > > > timestamps when the user turns on timestamp receiving to >user space > > > (kernel protocols do not care) > > > > > > Possibilities: > > > > > > - Ignore the problem and switch back to gettimeoffset >again > > > > Is it safe to call gettimeoffset without the lock? > > >Of course. The only problem is that the clock can be non >mononotonous >sometimes and not be in sync with gettimeofday, but at least the >kernel >users of packet timestamps do not care. >The only problem is the socket option, but it is obscure enough >that I >would not worry too much about it. > > > > > - Switch to gettimeoffset but add some correction step for >the unlikely > > > case that someone wants the timestamp from user space > > > (would be my prefered solution) > > > - Implement lockless gettimeofday like x86-64 or sparc > > > (good one too, but likely slower than last) > > > > > > ia64 seems to also have the lock. > >Quick fix is to just use gettimeoffset in netif_rx again. >Should >be fine for you. > >-Andi > From hadi@cyberus.ca Tue Aug 27 08:52:50 2002 Received: with ECARTIS (v1.0.0; list netdev); Tue, 27 Aug 2002 08:52:53 -0700 (PDT) Received: from cyberus.ca (mail.cyberus.ca [216.191.240.111]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7RFqmtG004618 for ; Tue, 27 Aug 2002 08:52:49 -0700 Received: from shell.cyberus.ca (shell [216.191.240.114]) by cyberus.ca (8.9.3/8.9.3/Cyberus Online Inc.) with ESMTP id LAA25966; Tue, 27 Aug 2002 11:56:13 -0400 (EDT) Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.11.6+Sun/8.11.6) with ESMTP id g7RFnTJ10635; Tue, 27 Aug 2002 11:49:29 -0400 (EDT) X-Authentication-Warning: shell.cyberus.ca: hadi owned process doing -bs Date: Tue, 27 Aug 2002 11:49:29 -0400 (EDT) From: jamal To: Mala Anand cc: Bill Hartner , , , , Robert Olsson Subject: Re: [Lse-tech] Re: (RFC): SKB Initialization In-Reply-To: Message-ID: MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit X-archive-position: 33 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev Content-Length: 543 Lines: 18 On Tue, 27 Aug 2002, Mala Anand wrote: > SPECweb99 profile shows that __kfree_skb is in the top 5 hot routines. We > will test the skb recycle patch on SPECweb99 and add skbinit patch > to that and see how it helps. What I understand is that the skb recycle > patch does not attempt to recycle if the skbs are allocated on CPU > and freed on another CPU. Is that right? If so, skbinit patch will help > those cases. yes it will. Not significant is my current thinking. i.e i wouldnt write my mother to tell her about it. cheers, jamal From chengjin@cs.caltech.edu Tue Aug 27 10:18:45 2002 Received: with ECARTIS (v1.0.0; list netdev); Tue, 27 Aug 2002 10:18:47 -0700 (PDT) Received: from swordfish.cs.caltech.edu (swordfish.cs.caltech.edu [131.215.44.124]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7RHIjtG007031 for ; Tue, 27 Aug 2002 10:18:45 -0700 Received: from orchestra.cs.caltech.edu (orchestra.cs.caltech.edu [131.215.44.20]) by swordfish.cs.caltech.edu (Postfix) with ESMTP id 92BCBDF274; Tue, 27 Aug 2002 10:22:13 -0700 (PDT) Received: from localhost (chengjin@localhost) by orchestra.cs.caltech.edu (8.11.6/8.9.3) with ESMTP id g7RHMDF27878; Tue, 27 Aug 2002 10:22:13 -0700 X-Authentication-Warning: orchestra.cs.caltech.edu: chengjin owned process doing -bs Date: Tue, 27 Aug 2002 10:22:13 -0700 (PDT) From: Cheng Jin To: Andi Kleen Cc: jamal , "Xiaoliang (David) Wei" , Ben Greear , Cheng Hu , Steven Low , "netdev@oss.sgi.com" Subject: Re: packet re-ordering on SMP machines. In-Reply-To: <20020827142004.C4358@wotan.suse.de> Message-ID: MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit X-archive-position: 34 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: chengjin@cs.caltech.edu Precedence: bulk X-list: netdev Content-Length: 805 Lines: 22 Hi, Andi, > Quick fix is to just use gettimeoffset in netif_rx again. Should > be fine for you. There doesn't appear to be a function called gettimeoffset in 2.4.18 anymore. The closest I found was do_fast_gettimeoffset in "arch/i386/kernel/time.c" This appears to be the unlocked version that you are referring to, except I can't tell why the higher 32 bits (edx) of the timestamp isn't used. (maybe the asm code takes care of it, but it seems that the result is stored in edx so) What you said about a light-weight gettime function makes sense. For our purpose of timing RTTs, any gettime function with a resolution higher than 1 ms will probably be enough. The time doesn't need to be in exactly in sync with the one obtained from the locking version of the gettime function. Thanks, Cheng From ak@suse.de Tue Aug 27 10:29:45 2002 Received: with ECARTIS (v1.0.0; list netdev); Tue, 27 Aug 2002 10:29:46 -0700 (PDT) Received: from Cantor.suse.de (ns.suse.de [213.95.15.193]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7RHTitG008865 for ; Tue, 27 Aug 2002 10:29:44 -0700 Received: from Hermes.suse.de (Charybdis.suse.de [213.95.15.201]) by Cantor.suse.de (Postfix) with ESMTP id CEBCF14884; Tue, 27 Aug 2002 19:33:07 +0200 (MEST) Date: Tue, 27 Aug 2002 19:33:03 +0200 From: Andi Kleen To: Cheng Jin Cc: Andi Kleen , jamal , "Xiaoliang (David) Wei" , Ben Greear , Cheng Hu , Steven Low , "netdev@oss.sgi.com" Subject: Re: packet re-ordering on SMP machines. Message-ID: <20020827193303.B4971@wotan.suse.de> References: <20020827142004.C4358@wotan.suse.de> Mime-Version: 1.0 Content-type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.3.22.1i Content-Transfer-Encoding: 8bit X-archive-position: 35 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ak@suse.de Precedence: bulk X-list: netdev Content-Length: 996 Lines: 27 On Tue, Aug 27, 2002 at 10:22:13AM -0700, Cheng Jin wrote: > Hi, Andi, > > > Quick fix is to just use gettimeoffset in netif_rx again. Should > > be fine for you. > > There doesn't appear to be a function called gettimeoffset in 2.4.18 > anymore. The closest I found was do_fast_gettimeoffset in > "arch/i386/kernel/time.c" This appears to be the unlocked version that Yes, I mean do_fast_gettimeoffset. > you are referring to, except I can't tell why the higher 32 bits (edx) of > the timestamp isn't used. (maybe the asm code takes care of it, but it seems > that the result is stored in edx so) 32bit precision are probably enough for this. > > What you said about a light-weight gettime function makes sense. For our > purpose of timing RTTs, any gettime function with a resolution higher than > 1 ms will probably be enough. The time doesn't need to be in exactly in sync > with the one obtained from the locking version of the gettime function. TSC should be fine then. -Andi From weixl@caltech.edu Tue Aug 27 12:40:07 2002 Received: with ECARTIS (v1.0.0; list netdev); Tue, 27 Aug 2002 12:40:08 -0700 (PDT) Received: from vestibule.its.caltech.edu (vestibule.its.caltech.edu [131.215.48.17]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7RJe7tG015629 for ; Tue, 27 Aug 2002 12:40:07 -0700 Received: from weixl (charter-DHCP-122.caltech.edu [131.215.186.122]) by vestibule.its.caltech.edu (8.12.3/8.12.3) with ESMTP id g7RJhTdK009442; Tue, 27 Aug 2002 12:43:30 -0700 (PDT) Message-ID: <002701c24e01$f59b6190$8900a8c0@weixl> From: "Xiaoliang \(David\) Wei" To: "jamal" , "Andi Kleen" Cc: "Ben Greear" , "Cheng Jin" , "Cheng Hu" , "Steven Low" , References: Subject: Re: packet re-ordering on SMP machines. Date: Tue, 27 Aug 2002 12:43:02 -0700 MIME-Version: 1.0 Content-type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2600.0000 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2600.0000 X-archive-position: 36 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: weixl@caltech.edu Precedence: bulk X-list: netdev Content-Length: 702 Lines: 33 > > > > That is because of the lock it takes. Locks are always slow. > > xtime_lock? I guess so, after looked at do_gettimeofday > > > Possibilities: > > > > - Ignore the problem and switch back to gettimeoffset again > > Is it safe to call gettimeoffset without the lock? What's the possible danger to ignore the lock? Can I read the xtime directly? > > > - Switch to gettimeoffset but add some correction step for the unlikely > > case that someone wants the timestamp from user space > > (would be my prefered solution) > > - Implement lockless gettimeofday like x86-64 or sparc > > (good one too, but likely slower than last) > > > ia64 seems to also have the lock. > > cheers, > jamal > > > > From greearb@candelatech.com Tue Aug 27 23:22:54 2002 Received: with ECARTIS (v1.0.0; list netdev); Tue, 27 Aug 2002 23:22:56 -0700 (PDT) Received: from grok.yi.org (IDENT:pHRxtV1fB26mQhYhEc1Y7IDb6JqIiTcc@dhcp101-dsl-usw4.w-link.net [208.161.125.101]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7S6MrtG010760 for ; Tue, 27 Aug 2002 23:22:54 -0700 Received: from candelatech.com (IDENT:QvRaSi3CLqlbb5NJpp+hxesaRtNYFpB/@localhost.localdomain [127.0.0.1]) by grok.yi.org (8.11.6/8.11.2) with ESMTP id g7S6PwT15912; Tue, 27 Aug 2002 23:25:58 -0700 Message-ID: <3D6C6CF6.9040002@candelatech.com> Date: Tue, 27 Aug 2002 23:25:58 -0700 From: Ben Greear Organization: Candela Technologies User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1b) Gecko/20020722 X-Accept-Language: en-us, en MIME-Version: 1.0 To: jamal CC: "Bloch, Jack" , "'netdev@oss.sgi.com'" Subject: Re: IP stack question (how to force pkts to not route locally, but go out interfaces regardless of destination) References: Content-type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit X-archive-position: 38 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: greearb@candelatech.com Precedence: bulk X-list: netdev Content-Length: 1593 Lines: 48 jamal wrote: > Ben Wrote: >>I would actually like to be able to force a machine to not do local >>routing as well, and force packets out over an interface even if >>the destination is a local IP, using source-based-routing, >>or something similar. There is no way to do this currently? >> > > > Try that SO_DONTROUTE and see if solves your problem; you probably have to > bind the socket to a specific device as well; > For all that trouble, i would suggest you may just as well write a sock > packet based app. I am back to trying to figure out how to make this work. I'm tried source based routing, and it does not work (it routes internally). I read the help on SO_DONTROUTE, but it seems to make the kernel not able to send to a router. I would like to be able to route, ie port a -> router -> port b I already bind to a particular port and IP, and use policy based routing (source based routing) to make sure the packet is sent out the correct local interface. I just need to find the routing logic that notices the destination IP is local and tell it to quit looking (probably for a particular socket, as I can see how this could break applications who didn't expect it) I dug through the code once before looking for this, and didn't find what I needed. Can anyone suggest the right files and/or methods to look in? Thanks, Ben > > cheers, > jamal > -- Ben Greear President of Candela Technologies Inc http://www.candelatech.com ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear From greearb@candelatech.com Wed Aug 28 00:29:54 2002 Received: with ECARTIS (v1.0.0; list netdev); Wed, 28 Aug 2002 00:29:56 -0700 (PDT) Received: from grok.yi.org (IDENT:EgrYHmByOeKRm+USTUJlyNgpU1IqYPbn@dhcp101-dsl-usw4.w-link.net [208.161.125.101]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7S7TrtG023155 for ; Wed, 28 Aug 2002 00:29:53 -0700 Received: from candelatech.com (IDENT:LnPv/H3XEnytSDaJwXOlXW9MFsiQIdsy@localhost.localdomain [127.0.0.1]) by grok.yi.org (8.11.6/8.11.2) with ESMTP id g7S7XNT24893 for ; Wed, 28 Aug 2002 00:33:24 -0700 Message-ID: <3D6C7CC3.5040105@candelatech.com> Date: Wed, 28 Aug 2002 00:33:23 -0700 From: Ben Greear Organization: Candela Technologies User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1b) Gecko/20020722 X-Accept-Language: en-us, en MIME-Version: 1.0 To: "'netdev@oss.sgi.com'" Subject: [Fwd: dev_get_by_index is not hashed: O(n)] Content-type: text/plain Content-Transfer-Encoding: 8bit X-archive-position: 39 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: greearb@candelatech.com Precedence: bulk X-list: netdev Content-Length: 1360 Lines: 42 -- Ben Greear President of Candela Technologies Inc http://www.candelatech.com ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear -- Attached file included as plaintext by Ecartis -- -- File: dev_get_by_index is not hashed: O(n) Message-ID: <3D6C781B.8010506@candelatech.com> Date: Wed, 28 Aug 2002 00:13:31 -0700 From: Ben Greear Organization: Candela Technologies User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1b) Gecko/20020722 X-Accept-Language: en-us, en MIME-Version: 1.0 To: linux-net Subject: dev_get_by_index is not hashed: O(n) Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit I just noticed that dev_get_by_index is not hashed, it is just a linear search through all interfaces. This call is made in many places in the kernel, and particularly in route.c For machines with large numbers of interfaces (VLANs for instance), this could be a real performance drag. Any reason we don't keep interfaces in a hash-table by index? Ben -- Ben Greear President of Candela Technologies Inc http://www.candelatech.com ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear From jleu@nero.doit.wisc.edu Wed Aug 28 04:44:54 2002 Received: with ECARTIS (v1.0.0; list netdev); Wed, 28 Aug 2002 04:44:59 -0700 (PDT) Received: from nero.doit.wisc.edu (IDENT:root@nero.doit.wisc.edu [128.104.17.130]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7SBirtG002757 for ; Wed, 28 Aug 2002 04:44:54 -0700 Received: (from jleu@localhost) by nero.doit.wisc.edu (8.11.6/8.11.6) id g7SCfD113886; Wed, 28 Aug 2002 07:41:13 -0500 Date: Wed, 28 Aug 2002 07:41:12 -0500 From: "James R. Leu" To: Ben Greear Cc: jamal , "Bloch, Jack" , "'netdev@oss.sgi.com'" Subject: Re: IP stack question (how to force pkts to not route locally, but go out interfaces regardless of destination) Message-ID: <20020828074112.A13868@nero.doit.wisc.edu> Reply-To: jleu@mindspring.com References: <3D6C6CF6.9040002@candelatech.com> Mime-Version: 1.0 Content-type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0.1i In-Reply-To: <3D6C6CF6.9040002@candelatech.com>; from greearb@candelatech.com on Tue, Aug 27, 2002 at 11:25:58PM -0700 Organization: none Content-Transfer-Encoding: 8bit X-archive-position: 40 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jleu@mindspring.com Precedence: bulk X-list: netdev Content-Length: 1985 Lines: 61 Hello, How about changing the preference of the 'local' table (kernel change required) and inserting a new table (via iptables) that had rules like "if src is local, and dest = 192.168.1.1 then send out eth0" Jim On Tue, Aug 27, 2002 at 11:25:58PM -0700, Ben Greear wrote: > > jamal wrote: > > Ben Wrote: > >>I would actually like to be able to force a machine to not do local > >>routing as well, and force packets out over an interface even if > >>the destination is a local IP, using source-based-routing, > >>or something similar. There is no way to do this currently? > >> > > > > > > Try that SO_DONTROUTE and see if solves your problem; you probably have to > > bind the socket to a specific device as well; > > For all that trouble, i would suggest you may just as well write a sock > > packet based app. > > I am back to trying to figure out how to make this work. I'm tried source > based routing, and it does not work (it routes internally). > > I read the help on SO_DONTROUTE, but it seems to make the kernel not able > to send to a router. I would like to be able to route, ie port a -> router -> port b > > I already bind to a particular port and IP, and use policy based routing > (source based routing) to make sure the packet is sent out the correct > local interface. > > I just need to find the routing logic that notices the destination IP is local > and tell it to quit looking (probably for a particular socket, as I can > see how this could break applications who didn't expect it) > > I dug through the code once before looking for this, and didn't find > what I needed. Can anyone suggest the right files and/or methods to > look in? > > Thanks, > Ben > > > > > cheers, > > jamal > > > > > -- > Ben Greear > President of Candela Technologies Inc http://www.candelatech.com > ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear > > -- James R. Leu From lunz@falooley.org Wed Aug 28 11:52:39 2002 Received: with ECARTIS (v1.0.0; list netdev); Wed, 28 Aug 2002 11:52:43 -0700 (PDT) Received: from crown.reflexsecurity.com (dsl-65-188-226-101.telocity.com [65.188.226.101]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7SIqatG025949 for ; Wed, 28 Aug 2002 11:52:37 -0700 Received: from stoli.localnet ([192.168.0.106]) by crown.reflexsecurity.com with smtp (Exim 3.35 #1 (Debian)) id 17k7zh-0004Jw-00; Wed, 28 Aug 2002 14:56:45 -0400 Received: by stoli.localnet (sSMTP sendmail emulation); Wed, 28 Aug 2002 14:56:12 -0400 Date: Wed, 28 Aug 2002 14:56:12 -0400 From: Jason Lunz To: netdev@oss.sgi.com Cc: jgarzik@mandrakesoft.com, becker@scyld.com Subject: [PATCH] 2.4.20-pre sundance.c cleanups Message-ID: <20020828185612.GA14342@reflexsecurity.com> Mime-Version: 1.0 Content-type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.3.28i Content-Transfer-Encoding: 8bit X-archive-position: 41 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: lunz@falooley.org Precedence: bulk X-list: netdev Content-Length: 24420 Lines: 705 I've had longstanding problems using the sundance.c driver on a D-Link 580-TX 4-port ethercard. I tried both Becker's driver and the in-kernel one; both behave similarly, but only Becker's driver detects the card properly. The included patch merges most of the important changes from Becker's current 1.09 driver into the main kernel. I have some questions, though, about the following issues: On Thu, 11 Jul 2002, Jeff Garzik wrote: > Donald Becker wrote: > > The mdelay(300) is completely bogus. While that is a typical period > > for autonegotiation to complete on a short link, the spec says that > > it can take up to 3 seconds, 10X longer, to complete > > autonegotiation. Given that the driver must be able to handle a > > longer autonegotiation period and no link beat, why call mdelay() at > > all? > > Ouch. You are absolutely right, and I take the blame for not > reviewing more closely. That's what I get for trusting vendors too > much ;-) [D-Link has been the one patching sundance and dl2k for a > while now] This mdelay has been removed, but... > I've been meaning to go through several drivers and fix up the stupid > assumptions they make about autonegotiation completion time. There are > a couple other drivers that do somewhat the same thing, though with a > different [if equally silly] implementation. > > And finally, most drivers need to be updated to follow the logic: call > netif_carrier_off(). Wait for autoneg complete and link OK, before > calling netif_carrier_on(). ...nothing like this has been done. I added calls to netif_carrier_on/off in response to LinkChange interrupts, but I don't know if that's even the right thing to do. I'm sure that needs more cleanup. On Thu, 11 Jul 2002, Donald Becker wrote: > The DLink modified driver is still missing a few important changes for > the new chip version in MMIO mode. For instance, ASICCtrl apparently > must now be read and written as a 32 bit word, while the older chip > worked with writew(). This is done. There are many other instances, where one driver does readb() while the other does readw() for example. I have no idea whether that makes any difference. Finally, when testing earlier versions of Becker's driver, it was discovered that IntrRxDone is completely broken on my card. His current driver and this patch therefore ignore it completely for all cards; is that the right thing to do for older cards? Should that only be done on the broken cards, or is it unimportant? also, what about the version numbering? I just incremented the kernel version from 1.01b to 1.01c, but we also have Becker's 1.09 and it's rumored that d-link has 1.02 and 1.03, but i've been unable to find them. Jason --- sundance.c Tue Aug 27 15:30:08 2002 +++ sundance-merge.c Wed Aug 28 13:17:27 2002 @@ -1,6 +1,6 @@ /* sundance.c: A Linux device driver for the Sundance ST201 "Alta". */ /* - Written 1999-2000 by Donald Becker. + Written 1999-2002 by Donald Becker. This software may be used and distributed according to the terms of the GNU General Public License (GPL), incorporated herein by reference. @@ -23,13 +23,21 @@ Version 1.01b (D-Link): - Add new board to PCI ID list - + + Version 1.01c (Jason Lunz): + - merged changes from Donald Becker's sundance.c v1.09: + . use IO ops by default (needed for D-Link 580TX) + . autodetect need for mii_preamble_required + . add per-adapter mtu change support + . update driver status in SIOCSMIIREG ioctl + . ignore IntrRxDone (buggy on some chipsets) + - minor cleanups */ #define DRV_NAME "sundance" -#define DRV_VERSION "1.01b" -#define DRV_RELDATE "17-Jan-2002" +#define DRV_VERSION "1.01c" +#define DRV_RELDATE "27-Aug-2002" /* The user-configurable values. @@ -37,7 +45,6 @@ static int debug = 1; /* 1 normal messages, 0 quiet .. 7 verbose. */ /* Maximum events (Rx packets, etc.) to handle at each interrupt. */ static int max_interrupt_work = 30; -static int mtu; /* Maximum number of multicast addresses to filter (vs. rx-all-multicast). Typical is a 64 element hash table based on the Ethernet CRC. */ static int multicast_filter_limit = 32; @@ -48,6 +55,7 @@ need a copy-align. */ static int rx_copybreak; +#define MAX_UNITS 8 /* More are supported, limit only on options */ /* media[] specifies the media type the NIC operates at. autosense Autosensing active media. 10mbps_hd 10Mbps half duplex. @@ -60,15 +68,31 @@ 3 100Mbps half duplex. 4 100Mbps full duplex. */ -#define MAX_UNITS 8 static char *media[MAX_UNITS]; +/* Used to pass the media type, etc. + Both 'options[]' and 'full_duplex[]' should exist for driver + interoperability. + The media type is usually passed in 'options[]'. + The default is autonegotation for speed and duplex. + This should rarely be overridden. + Use option values 0x10/0x20 for 10Mbps, 0x100,0x200 for 100Mbps. + Use option values 0x10 and 0x100 for forcing half duplex fixed speed. + Use option values 0x20 and 0x200 for forcing full duplex operation. +*/ +static int options[MAX_UNITS] = {-1, -1, -1, -1, -1, -1, -1, -1}; + +/* Set iff a MII transceiver on any interface requires mdio preamble. + This only set with older tranceivers, so the extra + code size of a per-interface flag is not worthwhile. */ +static int mii_preamble_required = 0; + /* Operational parameters that are set at compile time. */ /* Keep the ring sizes a power of two for compile efficiency. The compiler will convert '%'<2^N> into a bit mask. Making the Tx ring too large decreases the effectiveness of channel - bonding and packet priority, and more than 128 requires modifying the - Tx error recovery. + bonding and packet priority, and more than 31 requires modifying the + Tx status handling error recovery. Large receive rings merely waste memory. */ #define TX_RING_SIZE 16 #define TX_QUEUE_LEN 10 /* Limit ring entries actually used. */ @@ -125,23 +149,31 @@ MODULE_LICENSE("GPL"); MODULE_PARM(max_interrupt_work, "i"); -MODULE_PARM(mtu, "i"); MODULE_PARM(debug, "i"); MODULE_PARM(rx_copybreak, "i"); MODULE_PARM(media, "1-" __MODULE_STRING(MAX_UNITS) "s"); -MODULE_PARM_DESC(max_interrupt_work, "Sundance Alta maximum events handled per interrupt"); -MODULE_PARM_DESC(mtu, "Sundance Alta MTU (all boards)"); +MODULE_PARM(options, "1-" __MODULE_STRING(MAX_UNITS) "i"); +MODULE_PARM(multicast_filter_limit, "i"); +MODULE_PARM(mii_preamble_required, "i"); MODULE_PARM_DESC(debug, "Sundance Alta debug level (0-5)"); +MODULE_PARM_DESC(media, "Sundance Alta force fixed speed+duplex"); +MODULE_PARM_DESC(options, "Sundance Alta force transceiver type or fixed speed+duplex"); +MODULE_PARM_DESC(max_interrupt_work, "Sundance Alta maximum events handled per interrupt"); MODULE_PARM_DESC(rx_copybreak, "Sundance Alta copy breakpoint for copy-only-tiny-frames"); +MODULE_PARM_DESC(multicast_filter_limit, "Sundance Alta multicast addresses before switching to Rx-all-multicast"); +MODULE_PARM_DESC(mii_preamble_required, "Sundance Alta force sending a preamble before MII management transactions"); /* Theory of Operation I. Board Compatibility This driver is designed for the Sundance Technologies "Alta" ST201 chip. +The Kendin KS8723 is the same design with an integrated transceiver. II. Board-specific settings +This is an all-in-one chip, so there are no board-specific settings. + III. Driver operation IIIa. Ring buffers @@ -200,8 +232,9 @@ IVb. References The Sundance ST201 datasheet, preliminary version. -http://cesdis.gsfc.nasa.gov/linux/misc/100mbps.html -http://cesdis.gsfc.nasa.gov/linux/misc/NWay.html +The Kendin KS8723 datasheet, preliminary version. +http://www.scyld.com/expert/100mbps.html +http://www.scyld.com/expert/NWay.html IVc. Errata @@ -209,6 +242,11 @@ +/* Work-around for Kendin chip bugs. */ +#ifndef USE_MEM_OPS +#define USE_IO_OPS 1 +#endif + enum pci_id_flags_bits { /* Set PCI command register bits before calling probe1(). */ PCI_USES_IO=1, PCI_USES_MEM=2, PCI_USES_MASTER=4, @@ -311,7 +349,7 @@ MACCtrl0 = 0x50, MACCtrl1 = 0x52, StationAddr = 0x54, - MaxTxSize = 0x5A, + MaxFrameSize = 0x5A, RxMode = 0x5c, MIICtrl = 0x5e, MulticastFilter0 = 0x60, @@ -402,7 +440,6 @@ int chip_id, drv_flags; unsigned int cur_rx, dirty_rx; /* Producer/consumer ring indices */ unsigned int rx_buf_sz; /* Based on MTU+slack. */ - spinlock_t txlock; /* Group with Tx control cache line. */ struct netdev_desc *last_tx; /* Last Tx descriptor used. */ unsigned int cur_tx, dirty_tx; unsigned int tx_full:1; /* The Tx queue is full. */ @@ -410,13 +447,12 @@ unsigned int full_duplex:1; /* Full-duplex operation requested. */ unsigned int medialock:1; /* Do not sense media. */ unsigned int default_port:4; /* Last dev->if_port value. */ - unsigned int an_enable:1; - unsigned int speed; /* Multicast and receive mode. */ spinlock_t mcastlock; /* SMP lock multicast updates. */ u16 mcast_filter[4]; /* MII transceiver section. */ int mii_cnt; /* MII device addresses. */ + int link_status; u16 advertising; /* NWay media advertisement */ unsigned char phys[MII_CNT]; /* MII device addresses, only first one used. */ struct pci_dev *pci_dev; @@ -425,6 +461,7 @@ /* The station address location in the EEPROM. */ #define EEPROM_SA_OFFSET 0x10 +static int change_mtu(struct net_device *dev, int new_mtu); static int eeprom_read(long ioaddr, int location); static int mdio_read(struct net_device *dev, int phy_id, int location); static void mdio_write(struct net_device *dev, int phy_id, int location, int value); @@ -437,11 +474,10 @@ static void intr_handler(int irq, void *dev_instance, struct pt_regs *regs); static void netdev_error(struct net_device *dev, int intr_status); static int netdev_rx(struct net_device *dev); -static void netdev_error(struct net_device *dev, int intr_status); static void set_rx_mode(struct net_device *dev); -static struct net_device_stats *get_stats(struct net_device *dev); -static int netdev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd); +static int netdev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd); static int netdev_close(struct net_device *dev); +static struct net_device_stats *get_stats(struct net_device *dev); @@ -455,9 +491,9 @@ int irq; int i; long ioaddr; - u16 mii_ctl; void *ring_space; dma_addr_t ring_dma; + int option = card_idx < MAX_UNITS ? options[card_idx] : 0; /* when built into the kernel, we only print version if device is found */ @@ -524,11 +560,9 @@ dev->do_ioctl = &netdev_ioctl; dev->tx_timeout = &tx_timeout; dev->watchdog_timeo = TX_TIMEOUT; + dev->change_mtu = &change_mtu; pci_set_drvdata(pdev, dev); - if (mtu) - dev->mtu = mtu; - i = register_netdev(dev); if (i) goto err_out_unmap_rx; @@ -542,77 +576,70 @@ if (1) { int phy, phy_idx = 0; np->phys[0] = 1; /* Default setting */ + mii_preamble_required++; for (phy = 0; phy < 32 && phy_idx < MII_CNT; phy++) { int mii_status = mdio_read(dev, phy, 1); if (mii_status != 0xffff && mii_status != 0x0000) { np->phys[phy_idx++] = phy; np->advertising = mdio_read(dev, phy, 4); + if ((mii_status & 0x0040) == 0) + mii_preamble_required++; printk(KERN_INFO "%s: MII PHY found at address %d, status " "0x%4.4x advertising %4.4x.\n", dev->name, phy, mii_status, np->advertising); } } + mii_preamble_required--; np->mii_cnt = phy_idx; if (phy_idx == 0) printk(KERN_INFO "%s: No MII transceiver found!, ASIC status %x\n", dev->name, readl(ioaddr + ASICCtrl)); } + /* Parse override configuration */ - np->an_enable = 1; if (card_idx < MAX_UNITS) { if (media[card_idx] != NULL) { - np->an_enable = 0; if (strcmp (media[card_idx], "100mbps_fd") == 0 || strcmp (media[card_idx], "4") == 0) { - np->speed = 100; - np->full_duplex = 1; + option |= 0x200; } else if (strcmp (media[card_idx], "100mbps_hd") == 0 || strcmp (media[card_idx], "3") == 0) { - np->speed = 100; - np->full_duplex = 0; + option |= 0x100; } else if (strcmp (media[card_idx], "10mbps_fd") == 0 || strcmp (media[card_idx], "2") == 0) { - np->speed = 10; - np->full_duplex = 1; + option |= 0x20; } else if (strcmp (media[card_idx], "10mbps_hd") == 0 || strcmp (media[card_idx], "1") == 0) { - np->speed = 10; - np->full_duplex = 0; - } else { - np->an_enable = 1; + option |= 0x10; } } } - /* Fibre PHY? */ - if (readl (ioaddr + ASICCtrl) & 0x80) { - /* Default 100Mbps Full */ - if (np->an_enable) { - np->speed = 100; - np->full_duplex = 1; - np->an_enable = 0; - } + /* Fibre PHY? Default 100Mbps Full */ + if((readl(ioaddr + ASICCtrl) & 0x80) && (0 == (option & 0x3ff))) { + option |= 0x200; } - /* Reset PHY */ - mdio_write (dev, np->phys[0], MII_BMCR, BMCR_RESET); - mdelay (300); - mdio_write (dev, np->phys[0], MII_BMCR, BMCR_ANENABLE|BMCR_ANRESTART); - /* Force media type */ - if (!np->an_enable) { - mii_ctl = 0; - mii_ctl |= (np->speed == 100) ? BMCR_SPEED100 : 0; - mii_ctl |= (np->full_duplex) ? BMCR_FULLDPLX : 0; - mdio_write (dev, np->phys[0], MII_BMCR, mii_ctl); - printk (KERN_INFO "Override speed=%d, %s duplex\n", - np->speed, np->full_duplex ? "Full" : "Half"); + /* Allow forcing the media type. */ + if (option > 0) { + if (option & 0x220) + np->full_duplex = 1; + np->default_port = option & 0x3ff; + if (np->default_port & 0x330) { + np->medialock = 1; + printk(KERN_INFO " Forcing %dMbs %s-duplex operation.\n", + (option & 0x300 ? 100 : 10), + (option & 0x220 ? "full" : "half")); + mdio_write(dev, np->phys[0], MII_BMCR, + ((option & 0x300) ? BMCR_SPEED100 : 0) | + ((option & 0x220) ? BMCR_FULLDPLX : 0)); + } } - /* Perhaps move the reset here? */ /* Reset the chip to erase previous misconfiguration. */ if (debug > 1) printk("ASIC Control is %x.\n", readl(ioaddr + ASICCtrl)); - writew(0x007f, ioaddr + ASICCtrl + 2); + writel(0x007f0000 | readl(ioaddr + ASICCtrl), ioaddr + ASICCtrl); if (debug > 1) printk("ASIC Control is now %x.\n", readl(ioaddr + ASICCtrl)); @@ -636,10 +663,21 @@ } + +static int change_mtu(struct net_device *dev, int new_mtu) +{ + if ((new_mtu < 68) || (new_mtu > 8191)) /* Set by RxDMAFrameLen */ + return -EINVAL; + if (netif_running(dev)) + return -EBUSY; + dev->mtu = new_mtu; + return 0; +} + /* Read the EEPROM and MII Management Data I/O (MDIO) interfaces. */ static int __devinit eeprom_read(long ioaddr, int location) { - int boguscnt = 1000; /* Typical 190 ticks. */ + int boguscnt = 2000; /* Typical 190 ticks. */ writew(0x0200 | (location & 0xff), ioaddr + EECtrl); do { if (! (readw(ioaddr + EECtrl) & 0x8000)) { @@ -658,11 +696,6 @@ met by back-to-back 33Mhz PCI cycles. */ #define mdio_delay() readb(mdio_addr) -/* Set iff a MII transceiver on any interface requires mdio preamble. - This only set with older tranceivers, so the extra - code size of a per-interface flag is not worthwhile. */ -static const char mii_preamble_required = 1; - enum mii_reg_bits { MDIO_ShiftClk=0x0001, MDIO_Data=0x0002, MDIO_EnbOutput=0x0004, }; @@ -761,6 +794,11 @@ init_ring(dev); + if (dev->if_port == 0) + dev->if_port = np->default_port; + + np->mcastlock = (spinlock_t) SPIN_LOCK_UNLOCKED; + writel(np->rx_ring_dma, ioaddr + RxListPtr); /* The Tx list pointer is written as packets are queued. */ @@ -768,26 +806,27 @@ writeb(dev->dev_addr[i], ioaddr + StationAddr + i); /* Initialize other registers. */ - /* Configure the PCI bus bursts and FIFO thresholds. */ - - if (dev->if_port == 0) - dev->if_port = np->default_port; - - np->mcastlock = (spinlock_t) SPIN_LOCK_UNLOCKED; + np->link_status = readb(ioaddr + MIICtrl) & 0xE0; + writew((np->full_duplex || (np->link_status & 0x20)) ? 0x120 : 0, + ioaddr + MACCtrl0); + writew(dev->mtu + 14, ioaddr + MaxFrameSize); + if (dev->mtu > 2047) + writel(readl(ioaddr + ASICCtrl) | 0x0C, ioaddr + ASICCtrl); + /* Configure the PCI bus bursts and FIFO thresholds. */ set_rx_mode(dev); writew(0, ioaddr + IntrEnable); writew(0, ioaddr + DownCounter); /* Set the chip to poll every N*320nsec. */ writeb(100, ioaddr + RxDescPoll); writeb(127, ioaddr + TxDescPoll); - netif_start_queue(dev); /* Enable interrupts by setting the interrupt mask. */ - writew(IntrRxDone | IntrRxDMADone | IntrPCIErr | IntrDrvRqst | IntrTxDone + writew(IntrRxDMADone | IntrPCIErr | IntrDrvRqst | IntrTxDone | StatsMax | LinkChange, ioaddr + IntrEnable); writew(StatsEnable | RxEnable | TxEnable, ioaddr + MACCtrl1); + netif_start_queue(dev); if (debug > 2) printk(KERN_DEBUG "%s: Done netdev_open(), status: Rx %x Tx %x " @@ -815,7 +854,7 @@ int duplex; /* Force media */ - if (!np->an_enable || mii_lpa == 0xffff) { + if (np->medialock || mii_lpa == 0xffff) { if (np->full_duplex) writew (readw (ioaddr + MACCtrl0) | EnbFullDuplex, ioaddr + MACCtrl0); @@ -875,7 +914,7 @@ /* Stop and restart the chip's Tx processes . */ /* Trigger an immediate transmit demand. */ - writew(IntrRxDone | IntrRxDMADone | IntrPCIErr | IntrDrvRqst | IntrTxDone + writew(IntrRxDMADone | IntrPCIErr | IntrDrvRqst | IntrTxDone | StatsMax | LinkChange, ioaddr + IntrEnable); dev->trans_start = jiffies; @@ -896,7 +935,7 @@ np->cur_rx = np->cur_tx = 0; np->dirty_rx = np->dirty_tx = 0; - np->rx_buf_sz = (dev->mtu <= 1500 ? PKT_BUF_SZ : dev->mtu + 32); + np->rx_buf_sz = (dev->mtu <= 1500 ? PKT_BUF_SZ : dev->mtu + 36); /* Initialize all Rx descriptors. */ for (i = 0; i < RX_RING_SIZE; i++) { @@ -904,7 +943,7 @@ ((i+1)%RX_RING_SIZE)*sizeof(*np->rx_ring)); np->rx_ring[i].status = 0; np->rx_ring[i].frag[0].length = 0; - np->rx_skbuff[i] = 0; + np->rx_skbuff[i] = NULL; } /* Fill in the Rx buffers. Handle allocation failure gracefully. */ @@ -923,7 +962,7 @@ np->dirty_rx = (unsigned int)(i - RX_RING_SIZE); for (i = 0; i < TX_RING_SIZE; i++) { - np->tx_skbuff[i] = 0; + np->tx_skbuff[i] = NULL; np->tx_ring[i].status = 0; } return; @@ -972,8 +1011,8 @@ dev->trans_start = jiffies; if (debug > 4) { - printk(KERN_DEBUG "%s: Transmit frame #%d queued in slot %d.\n", - dev->name, np->cur_tx, entry); + printk(KERN_DEBUG "%s: Transmit frame #%d len %d queued in slot %d.\n", + dev->name, np->cur_tx, skb->len, entry); } return 0; } @@ -993,18 +1032,18 @@ do { int intr_status = readw(ioaddr + IntrStatus); - writew(intr_status & (IntrRxDone | IntrRxDMADone | IntrPCIErr | - IntrDrvRqst | IntrTxDone | IntrTxDMADone | StatsMax | - LinkChange), ioaddr + IntrStatus); + if ((intr_status & ~IntrRxDone) == 0 || intr_status == 0xffff) + break; + + writew(intr_status & (IntrRxDMADone | IntrPCIErr | + IntrDrvRqst | IntrTxDone | IntrTxDMADone | + StatsMax | LinkChange), ioaddr + IntrStatus); if (debug > 4) printk(KERN_DEBUG "%s: Interrupt, status %4.4x.\n", dev->name, intr_status); - if (intr_status == 0) - break; - - if (intr_status & (IntrRxDone|IntrRxDMADone)) + if (intr_status & IntrRxDMADone) netdev_rx(dev); if (intr_status & IntrTxDone) { @@ -1026,7 +1065,8 @@ if (tx_status & 0x02) np->stats.tx_window_errors++; /* This reset has not been verified!. */ if (tx_status & 0x10) { /* Reset the Tx. */ - writew(0x001c, ioaddr + ASICCtrl + 2); + writel(0x001c0000 | readl(ioaddr + ASICCtrl), + ioaddr + ASICCtrl); #if 0 /* Do we need to reset the Tx pointer here? */ writel(np->tx_ring_dma + np->dirty_tx*sizeof(*np->tx_ring), @@ -1067,6 +1107,7 @@ /* Abnormal error summary/uncommon events handlers. */ if (intr_status & (IntrDrvRqst | IntrPCIErr | LinkChange | StatsMax)) netdev_error(dev, intr_status); + if (--boguscnt < 0) { get_stats(dev); if (debug > 1) @@ -1196,8 +1237,6 @@ { long ioaddr = dev->base_addr; struct netdev_private *np = dev->priv; - u16 mii_ctl, mii_advertise, mii_lpa; - int speed; if (intr_status & IntrDrvRqst) { /* Stop the down counter and turn interrupts back on. */ @@ -1205,37 +1244,32 @@ printk("%s: Turning interrupts back on.\n", dev->name); writew(0, ioaddr + IntrEnable); writew(0, ioaddr + DownCounter); - writew(IntrRxDone | IntrRxDMADone | IntrPCIErr | IntrDrvRqst | + writew(IntrRxDMADone | IntrPCIErr | IntrDrvRqst | IntrTxDone | StatsMax | LinkChange, ioaddr + IntrEnable); /* Ack buggy InRequest */ writew (IntrDrvRqst, ioaddr + IntrStatus); } if (intr_status & LinkChange) { - if (np->an_enable) { - mii_advertise = mdio_read (dev, np->phys[0], MII_ADVERTISE); - mii_lpa= mdio_read (dev, np->phys[0], MII_LPA); - mii_advertise &= mii_lpa; - printk (KERN_INFO "%s: Link changed: ", dev->name); - if (mii_advertise & ADVERTISE_100FULL) - printk ("100Mbps, full duplex\n"); - else if (mii_advertise & ADVERTISE_100HALF) - printk ("100Mbps, half duplex\n"); - else if (mii_advertise & ADVERTISE_10FULL) - printk ("10Mbps, full duplex\n"); - else if (mii_advertise & ADVERTISE_10HALF) - printk ("10Mbps, half duplex\n"); + int new_status = readb(ioaddr + MIICtrl) & 0xE0; + u16 mii_advertise = mdio_read (dev, np->phys[0], MII_ADVERTISE); + u16 mii_lpa = mdio_read (dev, np->phys[0], MII_LPA); + + printk(KERN_ERR "%s: Link changed: Autonegotiation advertising " + "%dMbps %s duplex, partner %dMbps %s duplex.\n", + dev->name, + (mii_advertise & BMCR_SPEED100) ? 100 : 10, + (mii_advertise & BMCR_FULLDPLX) ? "full" : "half", + (mii_lpa & BMCR_SPEED100) ? 100 : 10, + (mii_lpa & BMCR_FULLDPLX) ? "full" : "half"); + if ((np->link_status ^ new_status) & 0x80) { + /* need to check if this is even remotely correct */ + if (new_status & 0x80) + netif_carrier_on(dev); else - printk ("\n"); - - } else { - mii_ctl = mdio_read (dev, np->phys[0], MII_BMCR); - speed = (mii_ctl & BMCR_SPEED100) ? 100 : 10; - printk (KERN_INFO "%s: Link changed: %dMbps ,", - dev->name, speed); - printk ("%s duplex.\n", (mii_ctl & BMCR_FULLDPLX) ? - "full" : "half"); + netif_carrier_off(dev); } - check_duplex (dev); + np->link_status = new_status; + check_duplex(dev); } if (intr_status & StatsMax) { get_stats(dev); @@ -1253,6 +1287,9 @@ struct netdev_private *np = dev->priv; int i; + if (readw(ioaddr + TxOctetsHigh) == 0xffff) + return &np->stats; + /* We should lock this segment of code for SMP eventually, although the vulnerability window is very small and statistics are non-critical. */ @@ -1336,6 +1373,7 @@ static int netdev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd) { + struct netdev_private *np = dev->priv; struct mii_ioctl_data *data = (struct mii_ioctl_data *)&rq->ifr_data; switch(cmd) { @@ -1343,7 +1381,7 @@ return netdev_ethtool_ioctl(dev, (void *) rq->ifr_data); case SIOCGMIIPHY: /* Get address of MII PHY in use. */ case SIOCDEVPRIVATE: /* for binary compat, remove in 2.5 */ - data->phy_id = ((struct netdev_private *)dev->priv)->phys[0] & 0x1f; + data->phy_id = np->phys[0] & 0x1f; /* Fall Through */ case SIOCGMIIREG: /* Read MII PHY register. */ @@ -1355,6 +1393,17 @@ case SIOCDEVPRIVATE+2: /* for binary compat, remove in 2.5 */ if (!capable(CAP_NET_ADMIN)) return -EPERM; + if(data->phy_id == np->phys[0]) { + switch(data->reg_num) { + case 0: + np->medialock = (data->val_in & (BMCR_RESET|BMCR_ANENABLE)) ? 0 : 1; + if(np->medialock) + np->full_duplex = (data->val_in & BMCR_FULLDPLX) ? 1 : 0; + break; + case 4: np->advertising = data->val_in; break; + } + /* Perhaps check_duplex(dev), depending on chip semantics. */ + } mdio_write(dev, data->phy_id & 0x1f, data->reg_num & 0x1f, data->val_in); return 0; default: @@ -1418,7 +1467,7 @@ np->rx_ring[i].frag[0].addr, np->rx_buf_sz, PCI_DMA_FROMDEVICE); dev_kfree_skb(skb); - np->rx_skbuff[i] = 0; + np->rx_skbuff[i] = NULL; } } for (i = 0; i < TX_RING_SIZE; i++) { @@ -1428,7 +1477,7 @@ np->tx_ring[i].frag[0].addr, skb->len, PCI_DMA_TODEVICE); dev_kfree_skb(skb); - np->tx_skbuff[i] = 0; + np->tx_skbuff[i] = NULL; } } From manand@us.ibm.com Wed Aug 28 14:49:30 2002 Received: with ECARTIS (v1.0.0; list netdev); Wed, 28 Aug 2002 14:49:31 -0700 (PDT) Received: from e2.ny.us.ibm.com (e2.ny.us.ibm.com [32.97.182.102]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7SLnTtG001760 for ; Wed, 28 Aug 2002 14:49:30 -0700 Received: from northrelay02.pok.ibm.com (northrelay02.pok.ibm.com [9.56.224.150]) by e2.ny.us.ibm.com (8.12.2/8.12.2) with ESMTP id g7SLqvIV077562; Wed, 28 Aug 2002 17:52:57 -0400 Received: from d03nm123.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.193.82]) by northrelay02.pok.ibm.com (8.12.3/NCO/VER6.4) with ESMTP id g7SLqrv0020860; Wed, 28 Aug 2002 17:52:53 -0400 Subject: IPV4 and IPV6 tcp_stream comparison To: netdev@oss.sgi.com, linux-net@vger.kernel.org, usagi-users@linux-ipv6.org, linux-kernel@vger.kernel.org Cc: "Bill Hartner" , "Venkata Jagana" X-Mailer: Lotus Notes Release 5.0.7 March 21, 2001 Message-ID: From: "Mala Anand" Date: Wed, 28 Aug 2002 16:52:59 -0500 X-MIMETrack: Serialize by Router on D03NM123/03/M/IBM(Release 5.0.10 |March 22, 2002) at 08/28/2002 03:52:55 PM MIME-Version: 1.0 Content-type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit X-archive-position: 42 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: manand@us.ibm.com Precedence: bulk X-list: netdev Content-Length: 671 Lines: 23 I did a comparison test of IPV4 and IPV6 using 2.4.17 kernel for IPV4 and 2.4.17 kernel+USAGI-linux24-s20020415-2.4.17.diff patch running netperf3, tcp_stream 1 adapter, 2 adapters test on UNI, SMP kernels using a 2-way machine. The test setup/results/profile can be found at: http://www-124.ibm.com/developerworks/opensource/linuxperf/netperf/results/may_02/netperf3_ipv6_2.4.17resutls.htm Regards, Mala Mala Anand IBM Linux Technology Center - Kernel Performance E-mail:manand@us.ibm.com http://www-124.ibm.com/developerworks/opensource/linuxperf http://www-124.ibm.com/developerworks/projects/linuxperf Phone:838-8088; Tie-line:678-8088 From lunz@gtf.org Wed Aug 28 16:09:57 2002 Received: with ECARTIS (v1.0.0; list netdev); Wed, 28 Aug 2002 16:10:01 -0700 (PDT) Received: from crown.reflexsecurity.com (dsl-65-188-226-101.telocity.com [65.188.226.101]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7SN9ttG005998 for ; Wed, 28 Aug 2002 16:09:56 -0700 Received: from stoli.localnet ([192.168.0.106]) by crown.reflexsecurity.com with smtp (Exim 3.35 #1 (Debian)) id 17kC0j-0004rN-00; Wed, 28 Aug 2002 19:14:05 -0400 Received: by stoli.localnet (sSMTP sendmail emulation); Wed, 28 Aug 2002 19:13:33 -0400 Date: Wed, 28 Aug 2002 19:13:33 -0400 From: Jason Lunz To: netdev@oss.sgi.com Cc: becker@scyld.com, jgarzik@mandrakesoft.com, "Patrick R. McManus" Subject: Re: [PATCH] 2.4.20-pre sundance.c cleanups Message-ID: <20020828231333.GA15183@reflexsecurity.com> References: <20020828185612.GA14342@reflexsecurity.com> Mime-Version: 1.0 Content-type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20020828185612.GA14342@reflexsecurity.com> User-Agent: Mutt/1.3.28i Content-Transfer-Encoding: 8bit X-archive-position: 43 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: lunz@gtf.org Precedence: bulk X-list: netdev Content-Length: 27549 Lines: 834 I forgot to mention that the last patch, like every other sundance driver I've ever tried, has transmit timeouts under heavy tx load. So I merged in only the tx changes from Edward Peng's sundance.c v1.03 (i ignored the rest of it because it moved rx handling to a tasklet in what looks like an emulation of a NAPI dev->poll function). This is the first sundance driver I've used that handles both heavy RX and TX load (where heavy is ~90kpps) without the card resetting on transmit timeout. It's also the first in-kernel sundance.c that recognizes a D-Link 580-TX 4-port ethercard. It remains to be seen whether this driver still works with older sundance cards; i'd appreciate it if anyone can test that. Jason --- sundance-kernel.c Tue Aug 27 15:30:08 2002 +++ sundance-kernel-cleanup-b.c Wed Aug 28 19:03:39 2002 @@ -1,6 +1,6 @@ /* sundance.c: A Linux device driver for the Sundance ST201 "Alta". */ /* - Written 1999-2000 by Donald Becker. + Written 1999-2002 by Donald Becker. This software may be used and distributed according to the terms of the GNU General Public License (GPL), incorporated herein by reference. @@ -23,13 +23,27 @@ Version 1.01b (D-Link): - Add new board to PCI ID list - + + Version 1.01c (Jason Lunz): + - merged changes from Donald Becker's sundance.c v1.09: + . use IO ops by default (needed for D-Link 580TX) + . autodetect need for mii_preamble_required + . add per-adapter mtu change support + . update driver status in SIOCSMIIREG ioctl + . ignore IntrRxDone (buggy on some chipsets) + - minor cleanups + + Version 1.01d (Jason Lunz): + - merged changes from Edward Peng's (of D-Link) sundance.c v1.03: + . increase tx ring size + . tx interrupt coalescing + . support for flow control */ #define DRV_NAME "sundance" -#define DRV_VERSION "1.01b" -#define DRV_RELDATE "17-Jan-2002" +#define DRV_VERSION "1.01d" +#define DRV_RELDATE "28-Aug-2002" /* The user-configurable values. @@ -37,7 +51,6 @@ static int debug = 1; /* 1 normal messages, 0 quiet .. 7 verbose. */ /* Maximum events (Rx packets, etc.) to handle at each interrupt. */ static int max_interrupt_work = 30; -static int mtu; /* Maximum number of multicast addresses to filter (vs. rx-all-multicast). Typical is a 64 element hash table based on the Ethernet CRC. */ static int multicast_filter_limit = 32; @@ -47,7 +60,10 @@ This chip can receive into offset buffers, so the Alpha does not need a copy-align. */ static int rx_copybreak; +static int tx_coalesce=1; +static int flowctrl=1; +#define MAX_UNITS 8 /* More are supported, limit only on options */ /* media[] specifies the media type the NIC operates at. autosense Autosensing active media. 10mbps_hd 10Mbps half duplex. @@ -60,18 +76,34 @@ 3 100Mbps half duplex. 4 100Mbps full duplex. */ -#define MAX_UNITS 8 static char *media[MAX_UNITS]; +/* Used to pass the media type, etc. + Both 'options[]' and 'full_duplex[]' should exist for driver + interoperability. + The media type is usually passed in 'options[]'. + The default is autonegotation for speed and duplex. + This should rarely be overridden. + Use option values 0x10/0x20 for 10Mbps, 0x100,0x200 for 100Mbps. + Use option values 0x10 and 0x100 for forcing half duplex fixed speed. + Use option values 0x20 and 0x200 for forcing full duplex operation. +*/ +static int options[MAX_UNITS] = {-1, -1, -1, -1, -1, -1, -1, -1}; + +/* Set iff a MII transceiver on any interface requires mdio preamble. + This only set with older tranceivers, so the extra + code size of a per-interface flag is not worthwhile. */ +static int mii_preamble_required = 0; + /* Operational parameters that are set at compile time. */ /* Keep the ring sizes a power of two for compile efficiency. The compiler will convert '%'<2^N> into a bit mask. Making the Tx ring too large decreases the effectiveness of channel - bonding and packet priority, and more than 128 requires modifying the - Tx error recovery. + bonding and packet priority, and more than 31 requires modifying the + Tx status handling error recovery. Large receive rings merely waste memory. */ -#define TX_RING_SIZE 16 -#define TX_QUEUE_LEN 10 /* Limit ring entries actually used. */ +#define TX_RING_SIZE 64 +#define TX_QUEUE_LEN (TX_RING_SIZE - 1) /* Limit ring entries actually used. */ #define RX_RING_SIZE 32 #define TX_TOTAL_SIZE TX_RING_SIZE*sizeof(struct netdev_desc) #define RX_TOTAL_SIZE RX_RING_SIZE*sizeof(struct netdev_desc) @@ -125,23 +157,33 @@ MODULE_LICENSE("GPL"); MODULE_PARM(max_interrupt_work, "i"); -MODULE_PARM(mtu, "i"); MODULE_PARM(debug, "i"); MODULE_PARM(rx_copybreak, "i"); MODULE_PARM(media, "1-" __MODULE_STRING(MAX_UNITS) "s"); -MODULE_PARM_DESC(max_interrupt_work, "Sundance Alta maximum events handled per interrupt"); -MODULE_PARM_DESC(mtu, "Sundance Alta MTU (all boards)"); +MODULE_PARM(options, "1-" __MODULE_STRING(MAX_UNITS) "i"); +MODULE_PARM(multicast_filter_limit, "i"); +MODULE_PARM(mii_preamble_required, "i"); +MODULE_PARM(flowctrl, "i"); MODULE_PARM_DESC(debug, "Sundance Alta debug level (0-5)"); +MODULE_PARM_DESC(media, "Sundance Alta force fixed speed+duplex"); +MODULE_PARM_DESC(options, "Sundance Alta force transceiver type or fixed speed+duplex"); +MODULE_PARM_DESC(max_interrupt_work, "Sundance Alta maximum events handled per interrupt"); MODULE_PARM_DESC(rx_copybreak, "Sundance Alta copy breakpoint for copy-only-tiny-frames"); +MODULE_PARM_DESC(multicast_filter_limit, "Sundance Alta multicast addresses before switching to Rx-all-multicast"); +MODULE_PARM_DESC(mii_preamble_required, "Sundance Alta force sending a preamble before MII management transactions"); +MODULE_PARM_DESC(flowctrl, "Sundance Alta flow control (0|1, default 1)"); /* Theory of Operation I. Board Compatibility This driver is designed for the Sundance Technologies "Alta" ST201 chip. +The Kendin KS8723 is the same design with an integrated transceiver. II. Board-specific settings +This is an all-in-one chip, so there are no board-specific settings. + III. Driver operation IIIa. Ring buffers @@ -200,8 +242,9 @@ IVb. References The Sundance ST201 datasheet, preliminary version. -http://cesdis.gsfc.nasa.gov/linux/misc/100mbps.html -http://cesdis.gsfc.nasa.gov/linux/misc/NWay.html +The Kendin KS8723 datasheet, preliminary version. +http://www.scyld.com/expert/100mbps.html +http://www.scyld.com/expert/NWay.html IVc. Errata @@ -209,6 +252,11 @@ +/* Work-around for Kendin chip bugs. */ +#ifndef USE_MEM_OPS +#define USE_IO_OPS 1 +#endif + enum pci_id_flags_bits { /* Set PCI command register bits before calling probe1(). */ PCI_USES_IO=1, PCI_USES_MEM=2, PCI_USES_MASTER=4, @@ -311,7 +359,7 @@ MACCtrl0 = 0x50, MACCtrl1 = 0x52, StationAddr = 0x54, - MaxTxSize = 0x5A, + MaxFrameSize = 0x5A, RxMode = 0x5c, MIICtrl = 0x5e, MulticastFilter0 = 0x60, @@ -402,21 +450,20 @@ int chip_id, drv_flags; unsigned int cur_rx, dirty_rx; /* Producer/consumer ring indices */ unsigned int rx_buf_sz; /* Based on MTU+slack. */ - spinlock_t txlock; /* Group with Tx control cache line. */ struct netdev_desc *last_tx; /* Last Tx descriptor used. */ unsigned int cur_tx, dirty_tx; unsigned int tx_full:1; /* The Tx queue is full. */ + unsigned int flowctrl:1; /* These values are keep track of the transceiver/media in use. */ unsigned int full_duplex:1; /* Full-duplex operation requested. */ unsigned int medialock:1; /* Do not sense media. */ unsigned int default_port:4; /* Last dev->if_port value. */ - unsigned int an_enable:1; - unsigned int speed; /* Multicast and receive mode. */ spinlock_t mcastlock; /* SMP lock multicast updates. */ u16 mcast_filter[4]; /* MII transceiver section. */ int mii_cnt; /* MII device addresses. */ + int link_status; u16 advertising; /* NWay media advertisement */ unsigned char phys[MII_CNT]; /* MII device addresses, only first one used. */ struct pci_dev *pci_dev; @@ -425,6 +472,7 @@ /* The station address location in the EEPROM. */ #define EEPROM_SA_OFFSET 0x10 +static int change_mtu(struct net_device *dev, int new_mtu); static int eeprom_read(long ioaddr, int location); static int mdio_read(struct net_device *dev, int phy_id, int location); static void mdio_write(struct net_device *dev, int phy_id, int location, int value); @@ -437,11 +485,10 @@ static void intr_handler(int irq, void *dev_instance, struct pt_regs *regs); static void netdev_error(struct net_device *dev, int intr_status); static int netdev_rx(struct net_device *dev); -static void netdev_error(struct net_device *dev, int intr_status); static void set_rx_mode(struct net_device *dev); -static struct net_device_stats *get_stats(struct net_device *dev); -static int netdev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd); +static int netdev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd); static int netdev_close(struct net_device *dev); +static struct net_device_stats *get_stats(struct net_device *dev); @@ -455,9 +502,9 @@ int irq; int i; long ioaddr; - u16 mii_ctl; void *ring_space; dma_addr_t ring_dma; + int option = card_idx < MAX_UNITS ? options[card_idx] : 0; /* when built into the kernel, we only print version if device is found */ @@ -524,11 +571,9 @@ dev->do_ioctl = &netdev_ioctl; dev->tx_timeout = &tx_timeout; dev->watchdog_timeo = TX_TIMEOUT; + dev->change_mtu = &change_mtu; pci_set_drvdata(pdev, dev); - if (mtu) - dev->mtu = mtu; - i = register_netdev(dev); if (i) goto err_out_unmap_rx; @@ -542,77 +587,79 @@ if (1) { int phy, phy_idx = 0; np->phys[0] = 1; /* Default setting */ + mii_preamble_required++; for (phy = 0; phy < 32 && phy_idx < MII_CNT; phy++) { int mii_status = mdio_read(dev, phy, 1); if (mii_status != 0xffff && mii_status != 0x0000) { np->phys[phy_idx++] = phy; np->advertising = mdio_read(dev, phy, 4); + if ((mii_status & 0x0040) == 0) + mii_preamble_required++; printk(KERN_INFO "%s: MII PHY found at address %d, status " "0x%4.4x advertising %4.4x.\n", dev->name, phy, mii_status, np->advertising); } } + mii_preamble_required--; np->mii_cnt = phy_idx; if (phy_idx == 0) printk(KERN_INFO "%s: No MII transceiver found!, ASIC status %x\n", dev->name, readl(ioaddr + ASICCtrl)); } + + if(tx_coalesce < 1) { + tx_coalesce = 1; + } else if(tx_coalesce > TX_QUEUE_LEN - 1) { + tx_coalesce = TX_QUEUE_LEN - 1; + } + if(flowctrl == 0) { + np->flowctrl = 0; + } + /* Parse override configuration */ - np->an_enable = 1; if (card_idx < MAX_UNITS) { if (media[card_idx] != NULL) { - np->an_enable = 0; if (strcmp (media[card_idx], "100mbps_fd") == 0 || strcmp (media[card_idx], "4") == 0) { - np->speed = 100; - np->full_duplex = 1; + option |= 0x200; } else if (strcmp (media[card_idx], "100mbps_hd") == 0 || strcmp (media[card_idx], "3") == 0) { - np->speed = 100; - np->full_duplex = 0; + option |= 0x100; } else if (strcmp (media[card_idx], "10mbps_fd") == 0 || strcmp (media[card_idx], "2") == 0) { - np->speed = 10; - np->full_duplex = 1; + option |= 0x20; } else if (strcmp (media[card_idx], "10mbps_hd") == 0 || strcmp (media[card_idx], "1") == 0) { - np->speed = 10; - np->full_duplex = 0; - } else { - np->an_enable = 1; + option |= 0x10; } } } - /* Fibre PHY? */ - if (readl (ioaddr + ASICCtrl) & 0x80) { - /* Default 100Mbps Full */ - if (np->an_enable) { - np->speed = 100; - np->full_duplex = 1; - np->an_enable = 0; - } + /* Fibre PHY? Default 100Mbps Full */ + if((readl(ioaddr + ASICCtrl) & 0x80) && (0 == (option & 0x3ff))) { + option |= 0x200; } - /* Reset PHY */ - mdio_write (dev, np->phys[0], MII_BMCR, BMCR_RESET); - mdelay (300); - mdio_write (dev, np->phys[0], MII_BMCR, BMCR_ANENABLE|BMCR_ANRESTART); - /* Force media type */ - if (!np->an_enable) { - mii_ctl = 0; - mii_ctl |= (np->speed == 100) ? BMCR_SPEED100 : 0; - mii_ctl |= (np->full_duplex) ? BMCR_FULLDPLX : 0; - mdio_write (dev, np->phys[0], MII_BMCR, mii_ctl); - printk (KERN_INFO "Override speed=%d, %s duplex\n", - np->speed, np->full_duplex ? "Full" : "Half"); + /* Allow forcing the media type. */ + if (option > 0) { + if (option & 0x220) + np->full_duplex = 1; + np->default_port = option & 0x3ff; + if (np->default_port & 0x330) { + np->medialock = 1; + printk(KERN_INFO " Forcing %dMbps %s-duplex operation.\n", + (option & 0x300 ? 100 : 10), + np->full_duplex ? "full" : "half"); + mdio_write(dev, np->phys[0], MII_BMCR, + (option & 0x330) ? BMCR_SPEED100 : 0 | + (np->full_duplex ? BMCR_FULLDPLX : 0)); + } } - /* Perhaps move the reset here? */ /* Reset the chip to erase previous misconfiguration. */ if (debug > 1) printk("ASIC Control is %x.\n", readl(ioaddr + ASICCtrl)); - writew(0x007f, ioaddr + ASICCtrl + 2); + writel(0x007f0000 | readl(ioaddr + ASICCtrl), ioaddr + ASICCtrl); if (debug > 1) printk("ASIC Control is now %x.\n", readl(ioaddr + ASICCtrl)); @@ -636,10 +683,21 @@ } + +static int change_mtu(struct net_device *dev, int new_mtu) +{ + if ((new_mtu < 68) || (new_mtu > 8191)) /* Set by RxDMAFrameLen */ + return -EINVAL; + if (netif_running(dev)) + return -EBUSY; + dev->mtu = new_mtu; + return 0; +} + /* Read the EEPROM and MII Management Data I/O (MDIO) interfaces. */ static int __devinit eeprom_read(long ioaddr, int location) { - int boguscnt = 1000; /* Typical 190 ticks. */ + int boguscnt = 2000; /* Typical 190 ticks. */ writew(0x0200 | (location & 0xff), ioaddr + EECtrl); do { if (! (readw(ioaddr + EECtrl) & 0x8000)) { @@ -658,11 +716,6 @@ met by back-to-back 33Mhz PCI cycles. */ #define mdio_delay() readb(mdio_addr) -/* Set iff a MII transceiver on any interface requires mdio preamble. - This only set with older tranceivers, so the extra - code size of a per-interface flag is not worthwhile. */ -static const char mii_preamble_required = 1; - enum mii_reg_bits { MDIO_ShiftClk=0x0001, MDIO_Data=0x0002, MDIO_EnbOutput=0x0004, }; @@ -761,6 +814,11 @@ init_ring(dev); + if (dev->if_port == 0) + dev->if_port = np->default_port; + + np->mcastlock = (spinlock_t) SPIN_LOCK_UNLOCKED; + writel(np->rx_ring_dma, ioaddr + RxListPtr); /* The Tx list pointer is written as packets are queued. */ @@ -768,26 +826,27 @@ writeb(dev->dev_addr[i], ioaddr + StationAddr + i); /* Initialize other registers. */ - /* Configure the PCI bus bursts and FIFO thresholds. */ - - if (dev->if_port == 0) - dev->if_port = np->default_port; - - np->mcastlock = (spinlock_t) SPIN_LOCK_UNLOCKED; + np->link_status = readb(ioaddr + MIICtrl) & 0xE0; + writew((np->full_duplex || (np->link_status & 0x20)) ? 0x120 : 0, + ioaddr + MACCtrl0); + writew(dev->mtu + 14, ioaddr + MaxFrameSize); + if (dev->mtu > 2047) + writel(readl(ioaddr + ASICCtrl) | 0x0C, ioaddr + ASICCtrl); + /* Configure the PCI bus bursts and FIFO thresholds. */ set_rx_mode(dev); writew(0, ioaddr + IntrEnable); writew(0, ioaddr + DownCounter); /* Set the chip to poll every N*320nsec. */ writeb(100, ioaddr + RxDescPoll); writeb(127, ioaddr + TxDescPoll); - netif_start_queue(dev); /* Enable interrupts by setting the interrupt mask. */ - writew(IntrRxDone | IntrRxDMADone | IntrPCIErr | IntrDrvRqst | IntrTxDone + writew(IntrRxDMADone | IntrPCIErr | IntrDrvRqst | IntrTxDone | StatsMax | LinkChange, ioaddr + IntrEnable); writew(StatsEnable | RxEnable | TxEnable, ioaddr + MACCtrl1); + netif_start_queue(dev); if (debug > 2) printk(KERN_DEBUG "%s: Done netdev_open(), status: Rx %x Tx %x " @@ -815,21 +874,23 @@ int duplex; /* Force media */ - if (!np->an_enable || mii_lpa == 0xffff) { + if (np->medialock || mii_lpa == 0xffff) { if (np->full_duplex) writew (readw (ioaddr + MACCtrl0) | EnbFullDuplex, ioaddr + MACCtrl0); return; } /* Autonegotiation */ - duplex = (negotiated & 0x0100) || (negotiated & 0x01C0) == 0x0040; + duplex = (negotiated & LPA_100FULL) || + (negotiated & (LPA_100FULL | LPA_100HALF | LPA_10FULL)) == LPA_10FULL; if (np->full_duplex != duplex) { np->full_duplex = duplex; if (debug) printk(KERN_INFO "%s: Setting %s-duplex based on MII #%d " "negotiated capability %4.4x.\n", dev->name, duplex ? "full" : "half", np->phys[0], negotiated); - writew(duplex ? 0x20 : 0, ioaddr + MACCtrl0); + writew(duplex ? (readw(ioaddr + MACCtrl0) | EnbFullDuplex) : 0, + ioaddr + MACCtrl0); } } @@ -875,13 +936,13 @@ /* Stop and restart the chip's Tx processes . */ /* Trigger an immediate transmit demand. */ - writew(IntrRxDone | IntrRxDMADone | IntrPCIErr | IntrDrvRqst | IntrTxDone + writew(IntrRxDMADone | IntrPCIErr | IntrDrvRqst | IntrTxDone | StatsMax | LinkChange, ioaddr + IntrEnable); dev->trans_start = jiffies; np->stats.tx_errors++; - if (!np->tx_full) + if(!netif_queue_stopped(dev)) netif_wake_queue(dev); } @@ -892,11 +953,10 @@ struct netdev_private *np = dev->priv; int i; - np->tx_full = 0; np->cur_rx = np->cur_tx = 0; np->dirty_rx = np->dirty_tx = 0; - np->rx_buf_sz = (dev->mtu <= 1500 ? PKT_BUF_SZ : dev->mtu + 32); + np->rx_buf_sz = (dev->mtu <= 1500 ? PKT_BUF_SZ : dev->mtu + 36); /* Initialize all Rx descriptors. */ for (i = 0; i < RX_RING_SIZE; i++) { @@ -904,7 +964,7 @@ ((i+1)%RX_RING_SIZE)*sizeof(*np->rx_ring)); np->rx_ring[i].status = 0; np->rx_ring[i].frag[0].length = 0; - np->rx_skbuff[i] = 0; + np->rx_skbuff[i] = NULL; } /* Fill in the Rx buffers. Handle allocation failure gracefully. */ @@ -923,7 +983,7 @@ np->dirty_rx = (unsigned int)(i - RX_RING_SIZE); for (i = 0; i < TX_RING_SIZE; i++) { - np->tx_skbuff[i] = 0; + np->tx_skbuff[i] = NULL; np->tx_ring[i].status = 0; } return; @@ -945,8 +1005,12 @@ txdesc->next_desc = 0; /* Note: disable the interrupt generation here before releasing. */ - txdesc->status = - cpu_to_le32((entry<<2) | DescIntrOnDMADone | DescIntrOnTx | DisableAlign); + if(entry % tx_coalesce == 0) { + txdesc->status = cpu_to_le32((entry<<2) | DescIntrOnTx | DisableAlign); + + } else { + txdesc->status = cpu_to_le32((entry<<2) | DisableAlign); + } txdesc->frag[0].addr = cpu_to_le32(pci_map_single(np->pci_dev, skb->data, skb->len, PCI_DMA_TODEVICE)); txdesc->frag[0].length = cpu_to_le32(skb->len | LastFrag); @@ -958,10 +1022,10 @@ /* On some architectures: explicitly flush cache lines here. */ - if (np->cur_tx - np->dirty_tx < TX_QUEUE_LEN - 1) { + if((np->cur_tx - np->dirty_tx < TX_QUEUE_LEN - 1) && + !netif_queue_stopped(dev)) { /* do nothing */ } else { - np->tx_full = 1; netif_stop_queue(dev); } /* Side effect: The read wakes the potentially-idle transmit channel. */ @@ -972,9 +1036,13 @@ dev->trans_start = jiffies; if (debug > 4) { - printk(KERN_DEBUG "%s: Transmit frame #%d queued in slot %d.\n", - dev->name, np->cur_tx, entry); + printk(KERN_DEBUG "%s: Transmit frame #%d len %d queued in slot %d.\n", + dev->name, np->cur_tx, skb->len, entry); } + + if(tx_coalesce > 1) + writel(1000, dev->base_addr + DownCounter); + return 0; } @@ -993,21 +1061,21 @@ do { int intr_status = readw(ioaddr + IntrStatus); - writew(intr_status & (IntrRxDone | IntrRxDMADone | IntrPCIErr | - IntrDrvRqst | IntrTxDone | IntrTxDMADone | StatsMax | - LinkChange), ioaddr + IntrStatus); + if ((intr_status & ~IntrRxDone) == 0 || intr_status == 0xffff) + break; + + writew(intr_status & (IntrRxDMADone | IntrPCIErr | + IntrDrvRqst | IntrTxDone | IntrTxDMADone | + StatsMax | LinkChange), ioaddr + IntrStatus); if (debug > 4) printk(KERN_DEBUG "%s: Interrupt, status %4.4x.\n", dev->name, intr_status); - if (intr_status == 0) - break; - - if (intr_status & (IntrRxDone|IntrRxDMADone)) + if (intr_status & IntrRxDMADone) netdev_rx(dev); - if (intr_status & IntrTxDone) { + if (intr_status & (IntrTxDone | IntrDrvRqst)) { int boguscnt = 32; int tx_status = readw(ioaddr + TxStatus); while (tx_status & 0x80) { @@ -1026,7 +1094,8 @@ if (tx_status & 0x02) np->stats.tx_window_errors++; /* This reset has not been verified!. */ if (tx_status & 0x10) { /* Reset the Tx. */ - writew(0x001c, ioaddr + ASICCtrl + 2); + writel(0x001c0000 | readl(ioaddr + ASICCtrl), + ioaddr + ASICCtrl); #if 0 /* Do we need to reset the Tx pointer here? */ writel(np->tx_ring_dma + np->dirty_tx*sizeof(*np->tx_ring), @@ -1038,7 +1107,7 @@ } /* Yup, this is a documentation bug. It cost me *hours*. */ writew(0, ioaddr + TxStatus); - tx_status = readb(ioaddr + TxStatus); + tx_status = readw(ioaddr + TxStatus); if (--boguscnt < 0) break; } @@ -1057,26 +1126,22 @@ dev_kfree_skb_irq(skb); np->tx_skbuff[entry] = 0; } - if (np->tx_full - && np->cur_tx - np->dirty_tx < TX_QUEUE_LEN - 4) { + if (netif_queue_stopped(dev) && + np->cur_tx - np->dirty_tx < TX_QUEUE_LEN - 4) { /* The ring is no longer full, clear tbusy. */ - np->tx_full = 0; netif_wake_queue(dev); } /* Abnormal error summary/uncommon events handlers. */ - if (intr_status & (IntrDrvRqst | IntrPCIErr | LinkChange | StatsMax)) + if (intr_status & (IntrPCIErr | LinkChange | StatsMax)) netdev_error(dev, intr_status); + if (--boguscnt < 0) { get_stats(dev); if (debug > 1) printk(KERN_WARNING "%s: Too much work at interrupt, " "status=0x%4.4x / 0x%4.4x.\n", dev->name, intr_status, readw(ioaddr + IntrClear)); - /* Re-enable us in 3.2msec. */ - writew(0, ioaddr + IntrEnable); - writew(1000, ioaddr + DownCounter); - writew(IntrDrvRqst, ioaddr + IntrEnable); break; } } while (1); @@ -1085,6 +1150,9 @@ printk(KERN_DEBUG "%s: exiting interrupt, status=%#4.4x.\n", dev->name, readw(ioaddr + IntrStatus)); + if(np->cur_tx - np->dirty_tx > 0 && tx_coalesce > 1) + writel(100, ioaddr + DownCounter); + spin_unlock(&np->lock); } @@ -1196,46 +1264,31 @@ { long ioaddr = dev->base_addr; struct netdev_private *np = dev->priv; - u16 mii_ctl, mii_advertise, mii_lpa; - int speed; - if (intr_status & IntrDrvRqst) { - /* Stop the down counter and turn interrupts back on. */ - if (debug > 1) - printk("%s: Turning interrupts back on.\n", dev->name); - writew(0, ioaddr + IntrEnable); - writew(0, ioaddr + DownCounter); - writew(IntrRxDone | IntrRxDMADone | IntrPCIErr | IntrDrvRqst | - IntrTxDone | StatsMax | LinkChange, ioaddr + IntrEnable); - /* Ack buggy InRequest */ - writew (IntrDrvRqst, ioaddr + IntrStatus); - } if (intr_status & LinkChange) { - if (np->an_enable) { - mii_advertise = mdio_read (dev, np->phys[0], MII_ADVERTISE); - mii_lpa= mdio_read (dev, np->phys[0], MII_LPA); - mii_advertise &= mii_lpa; - printk (KERN_INFO "%s: Link changed: ", dev->name); - if (mii_advertise & ADVERTISE_100FULL) - printk ("100Mbps, full duplex\n"); - else if (mii_advertise & ADVERTISE_100HALF) - printk ("100Mbps, half duplex\n"); - else if (mii_advertise & ADVERTISE_10FULL) - printk ("10Mbps, full duplex\n"); - else if (mii_advertise & ADVERTISE_10HALF) - printk ("10Mbps, half duplex\n"); + int new_status = readb(ioaddr + MIICtrl) & 0xE0; + u16 mii_advertise = mdio_read (dev, np->phys[0], MII_ADVERTISE); + u16 mii_lpa = mdio_read (dev, np->phys[0], MII_LPA); + + printk(KERN_ERR "%s: Link changed: Autonegotiation advertising " + "%dMbps %s duplex, partner %dMbps %s duplex.\n", + dev->name, + (mii_advertise & BMCR_SPEED100) ? 100 : 10, + (mii_advertise & BMCR_FULLDPLX) ? "full" : "half", + (mii_lpa & BMCR_SPEED100) ? 100 : 10, + (mii_lpa & BMCR_FULLDPLX) ? "full" : "half"); + if ((np->link_status ^ new_status) & 0x80) { + /* need to check if this is even remotely correct */ + if (new_status & 0x80) + netif_carrier_on(dev); else - printk ("\n"); - - } else { - mii_ctl = mdio_read (dev, np->phys[0], MII_BMCR); - speed = (mii_ctl & BMCR_SPEED100) ? 100 : 10; - printk (KERN_INFO "%s: Link changed: %dMbps ,", - dev->name, speed); - printk ("%s duplex.\n", (mii_ctl & BMCR_FULLDPLX) ? - "full" : "half"); + netif_carrier_off(dev); } - check_duplex (dev); + np->link_status = new_status; + check_duplex(dev); + if(np->flowctrl == 0) + writew(readw(ioaddr + MACCtrl0) & ~EnbFlowCtrl, + ioaddr + MACCtrl0); } if (intr_status & StatsMax) { get_stats(dev); @@ -1253,6 +1306,9 @@ struct netdev_private *np = dev->priv; int i; + if (readw(ioaddr + TxOctetsHigh) == 0xffff) + return &np->stats; + /* We should lock this segment of code for SMP eventually, although the vulnerability window is very small and statistics are non-critical. */ @@ -1329,6 +1385,38 @@ return 0; } + case ETHTOOL_GSET: { + struct ethtool_cmd ecmd = { ETHTOOL_GSET }; + + ecmd.supported = SUPPORTED_Autoneg; + + spin_lock_irq(&np->lock); + if((readl(dev->base_addr + ASICCtrl) & 0x80)) { + ecmd.supported |= SUPPORTED_FIBRE; + } else { + ecmd.supported |= (SUPPORTED_100baseT_Half | + SUPPORTED_100baseT_Full | + SUPPORTED_10baseT_Half | + SUPPORTED_10baseT_Full | + SUPPORTED_MII); + } + + ecmd.advertising = np->advertising; + ecmd.speed = 0; + ecmd.duplex = np->full_duplex; + ecmd.port = 0; + ecmd.phy_address = np->phys[0]; + ecmd.transceiver = 0; + ecmd.autoneg = !np->medialock; + ecmd.maxtxpkt = 0; + ecmd.maxrxpkt = 0; + + spin_unlock_irq(&np->lock); + if(copy_to_user(useraddr, &ecmd, sizeof(ecmd))) + return -EFAULT; + return 0; + } + } return -EOPNOTSUPP; @@ -1336,6 +1424,7 @@ static int netdev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd) { + struct netdev_private *np = dev->priv; struct mii_ioctl_data *data = (struct mii_ioctl_data *)&rq->ifr_data; switch(cmd) { @@ -1343,7 +1432,7 @@ return netdev_ethtool_ioctl(dev, (void *) rq->ifr_data); case SIOCGMIIPHY: /* Get address of MII PHY in use. */ case SIOCDEVPRIVATE: /* for binary compat, remove in 2.5 */ - data->phy_id = ((struct netdev_private *)dev->priv)->phys[0] & 0x1f; + data->phy_id = np->phys[0] & 0x1f; /* Fall Through */ case SIOCGMIIREG: /* Read MII PHY register. */ @@ -1355,6 +1444,17 @@ case SIOCDEVPRIVATE+2: /* for binary compat, remove in 2.5 */ if (!capable(CAP_NET_ADMIN)) return -EPERM; + if(data->phy_id == np->phys[0]) { + switch(data->reg_num) { + case 0: + np->medialock = (data->val_in & (BMCR_RESET|BMCR_ANENABLE)) ? 0 : 1; + if(np->medialock) + np->full_duplex = (data->val_in & BMCR_FULLDPLX) ? 1 : 0; + break; + case 4: np->advertising = data->val_in; break; + } + /* Perhaps check_duplex(dev), depending on chip semantics. */ + } mdio_write(dev, data->phy_id & 0x1f, data->reg_num & 0x1f, data->val_in); return 0; default: @@ -1418,7 +1518,7 @@ np->rx_ring[i].frag[0].addr, np->rx_buf_sz, PCI_DMA_FROMDEVICE); dev_kfree_skb(skb); - np->rx_skbuff[i] = 0; + np->rx_skbuff[i] = NULL; } } for (i = 0; i < TX_RING_SIZE; i++) { @@ -1428,7 +1528,7 @@ np->tx_ring[i].frag[0].addr, skb->len, PCI_DMA_TODEVICE); dev_kfree_skb(skb); - np->tx_skbuff[i] = 0; + np->tx_skbuff[i] = NULL; } } From kumarkr@us.ibm.com Wed Aug 28 16:53:58 2002 Received: with ECARTIS (v1.0.0; list netdev); Wed, 28 Aug 2002 16:54:00 -0700 (PDT) Received: from e31.co.us.ibm.com (e31.co.us.ibm.com [32.97.110.129]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7SNrwtG007121 for ; Wed, 28 Aug 2002 16:53:58 -0700 Received: from westrelay02.boulder.ibm.com (westrelay02.boulder.ibm.com [9.17.194.23]) by e31.co.us.ibm.com (8.12.2/8.12.2) with ESMTP id g7SNvV8v048366; Wed, 28 Aug 2002 19:57:31 -0400 Received: from d03nm801.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.193.82]) by westrelay02.boulder.ibm.com (8.12.3/NCO/VER6.4) with ESMTP id g7SNvUBM080202; Wed, 28 Aug 2002 17:57:30 -0600 Subject: Re: [PATCH] IPv6 Prefix List support for 2.5.31 To: kuznet@ms2.inr.ac.ru Cc: linux-net@vger.kernel.org, linux-kernel@vger.kernel.org, netdev@oss.sgi.com X-Mailer: Lotus Notes Release 5.0.7 March 21, 2001 Message-ID: From: "Krishna Kumar" Date: Wed, 28 Aug 2002 16:55:35 -0700 X-MIMETrack: Serialize by Router on D03NM801/03/M/IBM(Release 5.0.10 |March 22, 2002) at 08/28/2002 05:57:30 PM MIME-Version: 1.0 Content-type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit X-archive-position: 44 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: kumarkr@us.ibm.com Precedence: bulk X-list: netdev Content-Length: 16812 Lines: 496 Hi Alexey, The reasons I didn't put this in the routing table are as follows : - Routing table will have lots of address prefixes that are not available to the local interface addresses. Eg, none of my interfaces might have the address 2002:2.., but a route entry can exist for this. More code needs to be present to parse each entry in this case. - To be a prefix list entry, it should come via an RA. So a manual ifconfig
should not cause a prefix entry to be created. But doing an ifconfig as above would create an entry in the routing table. Then the only solution to figure out that this is not a real Prefix entry would involve more code to determine if this is an RA added routing entry or a manual one. - Also, the search over a longer routing table across all nodes is more time consuming. Adding to per device interface structures makes it easier to maintain since the prefix list is per interface (each interface has 'n' prefixes supported on that updated by the RA's receieved on that interface). thanks, - KK PS: I forgot to earlier include this mail to netdev, hence I am giving the entire patch in this mail again.. kuznet@ms2.inr.ac.ru Sent by: To: Krishna Kumar/Beaverton/IBM@IBMUS linux-net-owner@vger cc: linux-net@vger.kernel.org .kernel.org Subject: Re: [PATCH] IPv6 Prefix List support for 2.5.31 08/28/2002 03:45 PM Hello! > This patch implements Prefix List support in IPv6. The reasons for the > patch are : Listen, let me to ask stupid question: why normal way of storing prefix list in routing table is not enough??? Alexey - To unsubscribe from this list: send the line "unsubscribe linux-net" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ------------------------------------------------------------------------------------------------------------------------------------------------------------------- Hi all, This patch implements Prefix List support in IPv6. The reasons for the patch are : - RFC conformance (RFC 2461 - Neighbor Discovery, Section 5.1 and others). - Prefix List is needed to support Mobile IPv6 when it gets submitted to the kernel list. This code has both been tested within IPv6 and with Mobile IPv6. It has also been integrated into the USAGI kernel. Any comments are welcome. Thanks, - KK --- linux-2.5.31.org/net/ipv6/addrconf.c Sat Aug 10 18:41:55 2002 +++ linux-2.5.31/net/ipv6/addrconf.c Tue Aug 20 14:18:28 2002 @@ -26,6 +26,7 @@ * packets. * yoshfuji@USAGI : Fixed interval between DAD * packets. + * Krishna Kumar@IBM : Added Prefix List Support. */ #include @@ -64,6 +65,10 @@ #define IPV6_MAX_ADDRESSES 16 +#ifdef CONFIG_IPV6_PREFIXLIST_DEBUG +#include +#endif + /* Set to 3 to get tracing... */ #define ACONF_DEBUG 2 @@ -229,6 +234,7 @@ struct net_device *dev = idev->dev; BUG_TRAP(idev->addr_list==NULL); BUG_TRAP(idev->mc_list==NULL); + BUG_TRAP(list_empty(&idev->prefix_list) == 1); #ifdef NET_REFCNT_DEBUG printk(KERN_DEBUG "in6_dev_finish_destroy: %s\n", dev ? dev->name : "NIL"); #endif @@ -257,6 +263,8 @@ ndev->lock = RW_LOCK_UNLOCKED; ndev->dev = dev; + ndev->prefix_lock = SPIN_LOCK_UNLOCKED; + INIT_LIST_HEAD(&ndev->prefix_list); memcpy(&ndev->cnf, &ipv6_devconf_dflt, sizeof(ndev->cnf)); ndev->cnf.mtu6 = dev->mtu; ndev->cnf.sysctl = NULL; @@ -300,6 +308,37 @@ return idev; } +void ipv6_addr_prefix(struct in6_addr *prefix, + struct in6_addr *addr, int prefix_len) +{ + unsigned long mask; + int ncopy, nbits; + + memset(prefix, 0, sizeof(*prefix)); + + if (prefix_len <= 0) + return; + if (prefix_len > 128) + prefix_len = 128; + + ncopy = prefix_len / 32; + switch (ncopy) { + case 4: prefix->s6_addr32[3] = addr->s6_addr32[3]; + case 3: prefix->s6_addr32[2] = addr->s6_addr32[2]; + case 2: prefix->s6_addr32[1] = addr->s6_addr32[1]; + case 1: prefix->s6_addr32[0] = addr->s6_addr32[0]; + case 0: break; + } + nbits = prefix_len % 32; + if (nbits == 0) + return; + + mask = ~((1 << (32 - nbits)) - 1); + mask = htonl(mask); + + prefix->s6_addr32[ncopy] = addr->s6_addr32[ncopy] & mask; +} + static void addrconf_forward_change(struct inet6_dev *idev) { struct net_device *dev; @@ -447,6 +486,56 @@ in6_ifa_put(ifp); } +int ipv6_get_prefix_entries(struct prefix_info **plist, int ifindex, int plen) +{ + int count; + struct net_device *dev; + struct inet6_dev *idev; + struct list_head *head; + struct prefix_element *p; + + if (plist == NULL) { + BUG_TRAP(plist != NULL); + return -EINVAL; + } + if ((dev = dev_get_by_index(ifindex)) == NULL) { + printk(KERN_WARNING "Bad I/F (%d) in ipv6_get_prefix_entries\n", + ifindex); + return -EINVAL; + } + + if ((idev = __in6_dev_get(dev)) == NULL) { + dev_put(dev); + return -EINVAL; + } + + read_lock_bh(&idev->lock); + if (!(count = idev->prefix_count)) { + /* No elements on list */ + goto out; + } + if ((*plist = kmalloc(count * sizeof(struct prefix_info), + GFP_ATOMIC)) == NULL) { + count = -ENOMEM; + goto out; + } + count = 0; + spin_lock_bh(&idev->prefix_lock); + list_for_each(head, &idev->prefix_list) { + p = list_entry(head, struct prefix_element, list); + if (plen == 0 || p->pinfo.prefix_len == plen) { + memcpy(*plist + count, &p->pinfo, + sizeof(struct prefix_info)); + count++; + } + } + spin_unlock_bh(&idev->prefix_lock); +out: + read_unlock_bh(&idev->lock); + dev_put(dev); + return count; +} + /* * Choose an apropriate source address * should do: @@ -803,6 +892,82 @@ return idev; } +static int ipv6_add_prefix(struct inet6_dev *idev, struct prefix_info *pinfo, + __u32 lifetime) +{ + struct in6_addr prefix; + struct list_head *pos; + struct prefix_element *pfx; +#ifdef CONFIG_IPV6_PREFIXLIST_DEBUG + char abuf[64]; +#endif + + ipv6_addr_prefix(&prefix, &pinfo->prefix, (int)pinfo->prefix_len); + +#ifdef CONFIG_IPV6_PREFIXLIST_DEBUG + in6_ntop(&prefix, abuf); +#endif + + /* Check if the prefix already exists in the list */ + read_lock_bh(&idev->lock); + spin_lock_bh(&idev->prefix_lock); + list_for_each(pos, &idev->prefix_list) { + pfx = list_entry(pos, struct prefix_element, list); + if (pfx->pinfo.prefix_len == pinfo->prefix_len + && ipv6_addr_cmp(&pfx->pinfo.prefix, &prefix) == 0) { + /* Found the prefix */ + if (lifetime == 0) { + /* If lifetime = 0, delete the prefix */ +#ifdef CONFIG_IPV6_PREFIXLIST_DEBUG + printk(KERN_INFO "%s: deleting prefix %s/%d\n", + __FUNCTION__, abuf, pfx->pinfo.prefix_len); +#endif + list_del(&pfx->list); + kfree(pfx); + goto out; + } + pfx->pinfo.valid = lifetime; + pfx->timestamp = jiffies; + spin_unlock_bh(&idev->prefix_lock); + read_unlock_bh(&idev->lock); +#ifdef CONFIG_IPV6_PREFIXLIST_DEBUG + printk(KERN_INFO "%s: changing prefix %s/%d, lifetime = %d\n", + __FUNCTION__, abuf, pfx->pinfo.prefix_len, lifetime); +#endif + return 0; + } + } + if (lifetime == 0) { + /* Prefix was not on the list and lifetime = 0, do nothing */ + goto out; + } + + /* New Prefix, allocate one and fill in */ + if ((pfx = kmalloc(sizeof(struct prefix_element), GFP_ATOMIC)) == NULL) { + ADBG(("ipv6_add_prefix: malloc failed\n")); + spin_unlock_bh(&idev->prefix_lock); + read_unlock_bh(&idev->lock); + return -1; + } + INIT_LIST_HEAD(&pfx->list); + memcpy(&pfx->pinfo, pinfo, sizeof(struct prefix_info)); + pfx->pinfo.valid = lifetime; + pfx->timestamp = jiffies; + idev->prefix_count++; + ipv6_addr_copy(&pfx->pinfo.prefix, &prefix); + + list_add(&pfx->list, idev->prefix_list.prev); /* add to end of list */ +#ifdef CONFIG_IPV6_PREFIXLIST_DEBUG + printk(KERN_INFO "%s: adding prefix %s/%d, lifetime = %d\n", + __FUNCTION__, abuf, pfx->pinfo.prefix_len, lifetime); +#endif +out: + spin_unlock_bh(&idev->prefix_lock); + read_unlock_bh(&idev->lock); + + return 0; +} + void addrconf_prefix_rcv(struct net_device *dev, u8 *opt, int len) { struct prefix_info *pinfo; @@ -880,6 +1045,11 @@ if (rt) dst_release(&rt->u.dst); + if (pinfo->onlink) { + /* Add this prefix to the list of prefixes on this interface */ + ipv6_add_prefix(in6_dev, pinfo, valid_lft); + } + /* Try to figure out our local address for this prefix */ if (pinfo->autoconf && in6_dev->cnf.autoconf) { @@ -1325,6 +1495,8 @@ struct inet6_dev *idev; struct inet6_ifaddr *ifa, **bifa; int i; + struct list_head *pos, *n; + struct prefix_element *pfx; ASSERT_RTNL(); @@ -1387,6 +1559,24 @@ else ipv6_mc_down(idev); + /* Step 5: Free up Prefix List */ + spin_lock_bh(&idev->prefix_lock); + list_for_each_safe(pos, n, &idev->prefix_list) { + pfx = list_entry(pos, struct prefix_element, list); +#ifdef CONFIG_IPV6_PREFIXLIST_DEBUG + { + char abuf[64]; + struct prefix_element *pl = (struct prefix_element *)pfx; + in6_ntop(&pl->pinfo.prefix, abuf); + printk(KERN_INFO "%s: deleting prefix %s/%d, lifetime = %d\n", + __FUNCTION__, abuf, pl->pinfo.prefix_len, pl->pinfo.valid); + } +#endif + kfree(pfx); + } + INIT_LIST_HEAD(&idev->prefix_list); + spin_unlock_bh(&idev->prefix_lock); + /* Shot the device (if unregistered) */ if (how == 1) { @@ -1618,6 +1808,10 @@ struct inet6_ifaddr *ifp; unsigned long now = jiffies; int i; + struct list_head *pos, *n; + struct prefix_element *pfx; + struct net_device *dev; + struct inet6_dev *idev; for (i=0; i < IN6_ADDR_HSIZE; i++) { @@ -1659,6 +1853,48 @@ write_unlock(&addrconf_hash_lock); } + /* + * We need to expire prefixes even if no addresses are deleted in the + * loop above, since autoconfiguration may not be set in all router + * advertisements. + */ + read_lock(&dev_base_lock); + for (dev = dev_base; dev; dev = dev->next) { + unsigned long age; + if (!(idev = __in6_dev_get(dev))) { + continue; + } + read_lock_bh(&idev->lock); + spin_lock_bh(&idev->prefix_lock); + if (list_empty(&idev->prefix_list)) { + spin_unlock_bh(&idev->prefix_lock); + read_unlock_bh(&idev->lock); + continue; + } + list_for_each_safe(pos, n, &idev->prefix_list) { + pfx = list_entry(pos, struct prefix_element, list); + if (pfx->pinfo.valid != PINFO_VALID_LIFETIME_INFINITE) { +#ifdef CONFIG_IPV6_PREFIXLIST_DEBUG + char abuf[64]; + + in6_ntop(&pfx->pinfo.prefix, abuf); +#endif + age = (now - pfx->timestamp) / HZ; + if (age > pfx->pinfo.valid) { +#ifdef CONFIG_IPV6_PREFIXLIST_DEBUG + printk(KERN_INFO "%s: deleting prefix %s/%d, lifetime = %d\n", + __FUNCTION__, abuf, pfx->pinfo.prefix_len, pfx->pinfo.valid); +#endif + list_del(&pfx->list); + kfree(pfx); + } + } + } + spin_unlock_bh(&idev->prefix_lock); + read_unlock_bh(&idev->lock); + } + read_unlock(&dev_base_lock); + mod_timer(&addr_chk_timer, jiffies + ADDR_CHECK_FREQUENCY); } --- linux-2.5.31.org/include/net/addrconf.h Mon Aug 19 15:52:53 2002 +++ linux-2.5.31/include/net/addrconf.h Tue Aug 20 14:13:56 2002 @@ -6,6 +6,8 @@ #define MAX_RTR_SOLICITATIONS 3 #define RTR_SOLICITATION_INTERVAL (4*HZ) +#define PINFO_VALID_LIFETIME_INFINITE 0xffffffff /* infinite lifetime */ + #define ADDR_CHECK_FREQUENCY (120*HZ) struct prefix_info { @@ -40,6 +42,12 @@ #define IN6_ADDR_HSIZE 16 +struct prefix_element { + struct list_head list; + struct prefix_info pinfo; + unsigned long timestamp; +}; + extern void addrconf_init(void); extern void addrconf_cleanup(void); @@ -88,6 +96,10 @@ extern void addrconf_prefix_rcv(struct net_device *dev, u8 *opt, int len); +extern int ipv6_get_prefix_entries( + struct prefix_info **plist, + int ifindex, int plen); + /* Device notifier */ extern int register_inet6addr_notifier(struct notifier_block *nb); extern int unregister_inet6addr_notifier(struct notifier_block *nb); --- linux-2.5.31.org/include/net/if_inet6.h Sat Aug 10 18:41:56 2002 +++ linux-2.5.31/include/net/if_inet6.h Tue Aug 20 11:02:01 2002 @@ -96,6 +96,9 @@ struct inet6_ifaddr *addr_list; struct ifmcaddr6 *mc_list; + struct list_head prefix_list; + int prefix_count; + spinlock_t prefix_lock; rwlock_t lock; atomic_t refcnt; __u32 if_flags; --- linux-2.5.31.org/include/linux/inet.h.org Wed Aug 21 15:46:05 2002 +++ linux-2.5.31/include/linux/inet.h Wed Aug 21 15:46:05 2002 @@ -49,5 +49,20 @@ extern void inet_proto_init(struct net_proto *pro); extern __u32 in_aton(const char *str); +#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) +#include +extern __inline__ char *in6_ntop(const struct in6_addr *in6, char *buf){ + if (!buf) + return NULL; + sprintf(buf, + "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x", + ntohs(in6->s6_addr16[0]), ntohs(in6->s6_addr16[1]), + ntohs(in6->s6_addr16[2]), ntohs(in6->s6_addr16[3]), + ntohs(in6->s6_addr16[4]), ntohs(in6->s6_addr16[5]), + ntohs(in6->s6_addr16[6]), ntohs(in6->s6_addr16[7])); + return buf; +} +#endif + #endif #endif /* _LINUX_INET_H */ --- linux-2.5.31.org/net/ipv6/Config.in Sat Aug 10 18:41:29 2002 +++ linux-2.5.31/net/ipv6/Config.in Wed Aug 21 10:25:36 2002 @@ -2,6 +2,14 @@ # IPv6 configuration # +# --- overall --- +bool ' IPv6: Verbose debugging messages' CONFIG_IPV6_DEBUG + +# --- NDP (RFC2461) --- +if [ "$CONFIG_IPV6_DEBUG" = "y" ]; then + bool ' IPv6: Prefix List Debugging' CONFIG_IPV6_PREFIXLIST_DEBUG +fi + #bool ' IPv6: flow policy support' CONFIG_RT6_POLICY #bool ' IPv6: firewall support' CONFIG_IPV6_FIREWALL From kuznet@ms2.inr.ac.ru Wed Aug 28 17:27:07 2002 Received: with ECARTIS (v1.0.0; list netdev); Wed, 28 Aug 2002 17:27:08 -0700 (PDT) Received: from sex.inr.ac.ru (sex.inr.ac.ru [193.233.7.165]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7T0R6tG007909 for ; Wed, 28 Aug 2002 17:27:07 -0700 Received: (from kuznet@localhost) by sex.inr.ac.ru (8.6.13/ANK) id EAA25255; Thu, 29 Aug 2002 04:32:22 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200208290032.EAA25255@sex.inr.ac.ru> Subject: Re: [PATCH] IPv6 Prefix List support for 2.5.31 To: kumarkr@us.ibm.com (Krishna Kumar) Date: Thu, 29 Aug 2002 04:32:22 +0400 (MSD) Cc: linux-net@vger.kernel.org, linux-kernel@vger.kernel.org, netdev@oss.sgi.com In-Reply-To: from "Krishna Kumar" at Aug 28, 2 04:55:35 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 X-archive-position: 45 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: kuznet@ms2.inr.ac.ru Precedence: bulk X-list: netdev Content-Length: 486 Lines: 19 Hello! > - Routing table will have lots of address prefixes that are not available > to the local interface addresses. Well, all the direct routes are prefixes by definition. > - To be a prefix list entry, it should come via an RA. Wrong. But this does not matter, RA routes may be tagged with a flag. > - Also, the search over a longer routing table across all nodes is more > time consuming. Do you jest? You compare linear search with lookup in radix tree. :-) Alexey From weixl@caltech.edu Thu Aug 29 03:13:54 2002 Received: with ECARTIS (v1.0.0; list netdev); Thu, 29 Aug 2002 03:13:56 -0700 (PDT) Received: from chamber.cco.caltech.edu (chamber.its.caltech.edu [131.215.48.55] (may be forged)) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7TADstG028605 for ; Thu, 29 Aug 2002 03:13:54 -0700 Received: from weixl (sonata.caltech.edu [131.215.220.1]) by chamber.cco.caltech.edu (8.12.3/8.12.3) with ESMTP id g7TAHSne006609 for ; Thu, 29 Aug 2002 03:17:28 -0700 (PDT) Message-ID: <037601c24f45$4549fe10$f1fa010a@weixl> From: "Xiaoliang \(David\) Wei" To: References: <3D6C6CF6.9040002@candelatech.com> <20020828074112.A13868@nero.doit.wisc.edu> Subject: A question on RTT estimation of SACKed packet. Date: Thu, 29 Aug 2002 03:17:24 -0700 MIME-Version: 1.0 Content-type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2600.0000 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2600.0000 X-archive-position: 46 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: weixl@caltech.edu Precedence: bulk X-list: netdev Content-Length: 1023 Lines: 24 Hi Everyone, I am studying the Linux 2.4.19 TCP code. I have a question on the RTT estimation. In the function tcp_clean_rtx_queue (tcp_input.c), all the packet before snd_una is checked: 1. When a packet was not SACKed before, we can calculate the seq_rtt from its skb's timestamp (now - scb->when). 2. When a packet was SACKed but not retransmitted, the linux also calculate the seq_rtt from it when there is no unSACKed packet in the queue. I cannot understand the second situation: The packet was SACKed before, that means it arrived the receiver and triggered a SACK sometime before. The interval between when packet is sent and when the SACK is received should be the RTT for experienced this packet. Even now the packet is ACKed, I don't think this ACK is triggered by this packet. Why is it used to calculate the RTT? Thanks. -David Xiaoliang (David) Wei Graduate Student in CS@Caltech http://www.cs.caltech.edu/~weixl ==================================================== From kumarkr@us.ibm.com Thu Aug 29 11:26:04 2002 Received: with ECARTIS (v1.0.0; list netdev); Thu, 29 Aug 2002 11:26:05 -0700 (PDT) Received: from e31.co.us.ibm.com (e31.co.us.ibm.com [32.97.110.129]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7TIQ4tG022523 for ; Thu, 29 Aug 2002 11:26:04 -0700 Received: from westrelay02.boulder.ibm.com (westrelay02.boulder.ibm.com [9.17.194.23]) by e31.co.us.ibm.com (8.12.2/8.12.2) with ESMTP id g7TITe8v076144; Thu, 29 Aug 2002 14:29:40 -0400 Received: from d03nm801.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.193.82]) by westrelay02.boulder.ibm.com (8.12.3/NCO/VER6.4) with ESMTP id g7TITccs122090; Thu, 29 Aug 2002 12:29:39 -0600 Subject: Re: [PATCH] IPv6 Prefix List support for 2.5.31 To: kuznet@ms2.inr.ac.ru Cc: linux-kernel@vger.kernel.org, linux-net@vger.kernel.org, netdev@oss.sgi.com X-Mailer: Lotus Notes Release 5.0.7 March 21, 2001 Message-ID: From: "Krishna Kumar" Date: Thu, 29 Aug 2002 11:28:15 -0700 X-MIMETrack: Serialize by Router on D03NM801/03/M/IBM(Release 5.0.10 |March 22, 2002) at 08/29/2002 12:29:38 PM MIME-Version: 1.0 Content-type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit X-archive-position: 47 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: kumarkr@us.ibm.com Precedence: bulk X-list: netdev Content-Length: 2412 Lines: 72 Hi Alexey, Thanks for your detailed response. I have included my comments to the points you have raised below : > > - Routing table will have lots of address prefixes that are not available > > to the local interface addresses. > > Well, all the direct routes are prefixes by definition. I had written in my previous mail "More code needs to be present to parse each entry in this case.", which is what I think you have also implied. Eg, for each routing entry, you need to make sure it is the same device, that it is not a Link Local destination, and there is no 'next hop' field. > > - To be a prefix list entry, it should come via an RA. > > Wrong. But this does not matter, RA routes may be tagged with a flag. I disagree about the first part :-) A user can add an address with a new prefix, but that should not be present in the list of prefixes supported on that subnet. The prefix list should contain elements that a RA has advertised as part of the prefixes that the router supports. The RFC is specific about how prefix list entries are to be created (eg 6.3.4). Regarding the second point you made about the 'flag', I agree with you, and I did say something to that effect in my previous mail : "Then the only solution to figure out that this is not a real Prefix entry would involve more code to determine if this is an RA added routing entry or a manual one." > > - Also, the search over a longer routing table across all nodes is more > > time consuming. > > Do you jest? You compare linear search with lookup in radix tree. :-) Since there is no key to lookup, you have to always walk the entire tree (the key is an address/prefix, but we don't have it). I think it would be faster in the case where the routing table is very large, and there aren't too many interfaces (haven't defined "too many" here :-). The number of prefixes on an interface is not important for the search property. The difference is that in case of routing table lookup, you go through the entire tree and parse each entry, while in the proposed approach, the linear search is only done to locate the 'dev' and then the work is more straightforward - return all entries after getting the correct 'dev'. I agree that the Prefix List could be implemented on either the idev list OR the routing table, we just need to agree which is better place to put the list. Thanks, - KK From kuznet@ms2.inr.ac.ru Thu Aug 29 11:56:00 2002 Received: with ECARTIS (v1.0.0; list netdev); Thu, 29 Aug 2002 11:56:01 -0700 (PDT) Received: from sex.inr.ac.ru (sex.inr.ac.ru [193.233.7.165]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7TItwtG023310 for ; Thu, 29 Aug 2002 11:55:59 -0700 Received: (from kuznet@localhost) by sex.inr.ac.ru (8.6.13/ANK) id XAA27022; Thu, 29 Aug 2002 23:01:11 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200208291901.XAA27022@sex.inr.ac.ru> Subject: Re: A question on RTT estimation of SACKed packet. To: weixl@caltech.EDU (Xiaoliang \(David\), Wei) Date: Thu, 29 Aug 2002 23:01:11 +0400 (MSD) Cc: netdev@oss.sgi.com In-Reply-To: <037601c24f45$4549fe10$f1fa010a@weixl> from "Xiaoliang \(David\) Wei" at Aug 29, 2 02:45:00 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 X-archive-position: 48 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: kuznet@ms2.inr.ac.ru Precedence: bulk X-list: netdev Content-Length: 853 Lines: 19 Hello! > 2. When a packet was SACKed but not retransmitted, the linux also > calculate the seq_rtt from it when there is no unSACKed packet in the queue. > I cannot understand the second situation: The packet was SACKed before, > that means it arrived the receiver and triggered a SACK sometime before. The > interval between when packet is sent and when the SACK is received should be > the RTT for experienced this packet. Even now the packet is ACKed, I don't > think this ACK is triggered by this packet. Why is it used to calculate the > RTT? It is not used. When a segment fills a hole, tcp uses skb->when of the segment which _filled_ the hole. See? What's about using SACKs to give additional feed to rtt estimator, even when ACK is duplicate, it is intersting idea, I even read about this somewhere. But we do not use this. Alexey From weixl@caltech.edu Thu Aug 29 15:15:42 2002 Received: with ECARTIS (v1.0.0; list netdev); Thu, 29 Aug 2002 15:15:43 -0700 (PDT) Received: from chamber.cco.caltech.edu (chamber.its.caltech.edu [131.215.48.55] (may be forged)) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7TMFgtG026858 for ; Thu, 29 Aug 2002 15:15:42 -0700 Received: from weixl (sonata.caltech.edu [131.215.220.1]) by chamber.cco.caltech.edu (8.12.3/8.12.3) with ESMTP id g7TMJDne001205; Thu, 29 Aug 2002 15:19:13 -0700 (PDT) Message-ID: <028501c24faa$18693d60$f1fa010a@weixl> From: "Xiaoliang \(David\) Wei" To: Cc: References: <200208291901.XAA27022@sex.inr.ac.ru> Subject: Re: A question on RTT estimation of SACKed packet. Date: Thu, 29 Aug 2002 15:19:08 -0700 MIME-Version: 1.0 Content-type: text/plain; charset=gb2312 Content-Transfer-Encoding: 8bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2600.0000 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2600.0000 X-archive-position: 49 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: weixl@caltech.edu Precedence: bulk X-list: netdev Content-Length: 1038 Lines: 27 Thanks, Alexey. > > 2. When a packet was SACKed but not retransmitted, the linux also > > calculate the seq_rtt from it when there is no unSACKed packet in the queue. > > It is not used. When a segment fills a hole, tcp uses skb->when of the segment > which _filled_ the hole. See? Yes, this is exactly my question: When this segment fills a hole, it may be SACKed before. If it's SACKed before, that means its "round trip" had already finished before this ACK came-- Now it's filling the hole just because some strange things (such as ack loss) happened. So, I don't think we should use its skb->when to estimate RTT. But in Linux, such a packet (SACKed) may still provide RTT estimation. > > What's about using SACKs to give additional feed to rtt estimator, > even when ACK is duplicate, it is intersting idea, I even read about > this somewhere. But we do not use this. Actually, I am going to do that: Can I get a safe RTT estimation from SACK, when the SACKed packet (in rtx queue) is never SACKed or Retransmitted? -David From pekkas@netcore.fi Thu Aug 29 23:34:11 2002 Received: with ECARTIS (v1.0.0; list netdev); Thu, 29 Aug 2002 23:34:12 -0700 (PDT) Received: from netcore.fi (netcore.fi [193.94.160.1]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7U6Y9tG004787 for ; Thu, 29 Aug 2002 23:34:10 -0700 Received: from localhost (pekkas@localhost) by netcore.fi (8.11.6/8.11.6) with ESMTP id g7U6bcg12850; Fri, 30 Aug 2002 09:37:39 +0300 Date: Fri, 30 Aug 2002 09:37:38 +0300 (EEST) From: Pekka Savola To: David Stevens cc: linux-kernel@vger.kernel.org, , Subject: Re: [PATCH] anycast support for IPv6, linux-2.5.31 In-Reply-To: Message-ID: MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit X-archive-position: 50 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: pekkas@netcore.fi Precedence: bulk X-list: netdev Content-Length: 2386 Lines: 40 On Wed, 28 Aug 2002, David Stevens wrote: > 1) The API > Although the RFC's liken anycasting to ordinary unicasting, I think > it's more appropriate to tie it closely to particular applications, so I've > chosen an API similar to multicasting. So, rather than having a permanent > anycast address associated with the machine, particular applications > that use anycasting can join or leave "anycast groups," and the machine will > recognize the anycast addresses as its own when one or more applications have > joined the group. > So, for example, someone using anycasting for DNS high availability > can add a join to the anycast group in the server and as long as the DNS server > is running, the machine will answer to that anycast address. But the machine > will not respond to anycasts when the service that's using it isn't available, > so a broken server application that has exited won't deny that service if > there are other working members of the anycast group on other hosts. > I don't know if that's controversial or not-- the RFC's are written > more from the external context, but seem to imply a model along the lines of > using "ifconfig" to add anycast addresses. I think that model doesn't fit the > best uses of anycasting, but I'd like to hear your thoughts on it. > The application interface for joining and leaving anycast groups is 2 > new setsockopt() calls: IPV6_JOIN_ANYCAST and IPV6_LEAVE_ANYCAST. The arguments > are the same as the corresponding multicast operations. The kernel keeps a > reference count of members; when that goes to zero, the anycast address is not > recognized as a local address. While nonzero, the host listens on the solicited > node for that address, sends advertisements in response to solicitations (with > override=0) and delivers packets sent to the anycast address to upper layers. > There's also an in-kernel interface described below, which is used by > IPv6 mobility, for example. Before going too much down this path, I think one should write an Internet Draft about the proposed API (should be quite short & simple) and see what kind of response it has in the relevant working groups. -- Pekka Savola "Tell me of difficulties surmounted, Netcore Oy not those you stumble over and fall" Systems. Networks. Security. -- Robert Jordan: A Crown of Swords From dlstevens@us.ibm.com Fri Aug 30 01:13:14 2002 Received: with ECARTIS (v1.0.0; list netdev); Fri, 30 Aug 2002 01:13:15 -0700 (PDT) Received: from e2.ny.us.ibm.com (e2.ny.us.ibm.com [32.97.182.102]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7U8DDtG007175 for ; Fri, 30 Aug 2002 01:13:13 -0700 Received: from northrelay02.pok.ibm.com (northrelay02.pok.ibm.com [9.56.224.150]) by e2.ny.us.ibm.com (8.12.2/8.12.2) with ESMTP id g7U8GmIV271238; Fri, 30 Aug 2002 04:16:48 -0400 Received: from d03nm035.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.193.82]) by northrelay02.pok.ibm.com (8.12.3/NCO/VER6.4) with ESMTP id g7U8Gj0Z020368; Fri, 30 Aug 2002 04:16:45 -0400 From: "David Stevens" Importance: Normal Sensitivity: Subject: Re: [PATCH] anycast support for IPv6, linux-2.5.31 To: Pekka Savola Cc: linux-kernel@vger.kernel.org, , X-Mailer: Lotus Notes Release 5.0.4a July 24, 2000 Message-ID: Date: Fri, 30 Aug 2002 01:16:39 -0700 X-MIMETrack: Serialize by Router on D03NM035/03/M/IBM(Release 5.0.10 |March 22, 2002) at 08/30/2002 02:16:46 AM MIME-Version: 1.0 Content-type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit X-archive-position: 51 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: dlstevens@us.ibm.com Precedence: bulk X-list: netdev Content-Length: 1399 Lines: 43 Pekka, You wrote: >Before going too much down this path, I think one should write an Internet >Draft about the proposed API (should be quite short & simple) and see what >kind of response it has in the relevant working groups. I don't disagree with that, for informational purposes, but it doesn't conflict with the RFC's, which of course don't cover API's, and don't specify any interface for anycasting. However, my primary goal is to get anycasting support with an in-kernel interface in 2.5 before the freeze. :-) I used the setsockopt() API for testing, and left it in the patch for others to do the same. Though I think it's the right approach, for the reasons I mentioned, I'd rather see that portion pulled from the patch if it's controversial, than have the in-kernel interface and anycasting proper delayed over that. The one use of anycast I'm aware of right now is for IPv6 mobility, which needs the in-kernel interface. The user-level interface is important for future applications, and a reference-counted setsockopt() interface doesn't mean we can't also have an ip/ifconfig interface for permanent anycast addresses, too (the required anycast addresses in this patch are permanent, for example). So I don't see it as committing to one choice, but having in-kernel anycast support (soon) I think is the more important first step. +-DLS From sbansal@aplion.stpn.soft.net Fri Aug 30 04:01:34 2002 Received: with ECARTIS (v1.0.0; list netdev); Fri, 30 Aug 2002 04:01:35 -0700 (PDT) Received: from dharti.aplion.stpn.soft.net ([203.190.133.225]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7UB1QtG009974 for ; Fri, 30 Aug 2002 04:01:31 -0700 Received: from Saurabh (localhost [127.0.0.1]) by dharti.aplion.stpn.soft.net (8.11.2/8.11.2) with SMTP id g7UB6Cv24544; Fri, 30 Aug 2002 16:36:12 +0530 Message-ID: <010901c25016$d59d17a0$2304a8c0@Ionizer> From: "Saurabh Bansal" To: Cc: Subject: TCP stops sending ACKs after 32 mesg... Date: Fri, 30 Aug 2002 16:47:30 +0530 MIME-Version: 1.0 Content-type: text/plain X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2600.0000 X-MIMEOLE: Produced By Microsoft MimeOLE V6.00.2600.0000 Content-Transfer-Encoding: 8bit X-archive-position: 52 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: sbansal@aplion.stpn.soft.net Precedence: bulk X-list: netdev Content-Length: 1674 Lines: 38 Hi, It has been more than a week since I posted this query to the LINUX group and haven't heard anything from the group-side in this context. Is my query under processing/observation OR something else?? Hope to get some positive response from your side this time! Thanks And Regards Saurabh ----- Original Message ----- From: Saurabh Bansal To: netdev@oss.sgi.com Cc: linux-net@vger.kernel.org Sent: Wednesday, August 21, 2002 1:26 PM Hi all, I did this experiment with the a set of two processes behaving as a sender/receiver set on the same Linux machine. What the experiment was that--- "The sender has to send some fixed-size (20 bytes) mesgs to the receiver but the receiver has not to pick-up anyone of them." What has been observed is, TCP buffers the first 32 messages into the receive socket queue of receiver, AND THEN it stops sending ack's for next mesg. As a result, the sender keeps on trying to send the 33rd mesg, but gets no ack's and connection gets down within 15-20 minutes. Another important observations about this experiment are : 1) These two processes were running on the same Linux system with kernel version 2.2.xx. This problem doesn't come with Linux-kernel 2.4.xx. 2) If these processes are run on different Linux m/c's with kernel 2.2.xx, the problem doesn't appear. 3) The socket APIs used for sending/receiving mesg are send()/recv(). 4) The problem has no relevance with the size of mesg. I have seen the same behavior with 1-byte mesg to 1k-byte mesg. Can anyone tell me if there is any bug/limitation with the Linux-kernel 2.2.xx in comparison to kernel 2.4.xx in the above problem-context?? Saurabh Bansal From matti.aarnio@zmailer.org Fri Aug 30 04:21:12 2002 Received: with ECARTIS (v1.0.0; list netdev); Fri, 30 Aug 2002 04:21:14 -0700 (PDT) Received: from mail.zmailer.org (mail.zmailer.org [62.240.94.4]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7UBLAtG011127 for ; Fri, 30 Aug 2002 04:21:11 -0700 Received: (mea@mea-ext) by mail.zmailer.org id ; Fri, 30 Aug 2002 14:24:48 +0300 Date: Fri, 30 Aug 2002 14:24:48 +0300 From: Matti Aarnio To: Saurabh Bansal Cc: netdev@oss.sgi.com, linux-net@vger.kernel.org Subject: Re: TCP stops sending ACKs after 32 mesg... Message-ID: <20020830112448.GS16533@mea-ext.zmailer.org> References: <010901c25016$d59d17a0$2304a8c0@Ionizer> Mime-Version: 1.0 Content-type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <010901c25016$d59d17a0$2304a8c0@Ionizer> Content-Transfer-Encoding: 8bit X-archive-position: 53 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: matti.aarnio@zmailer.org Precedence: bulk X-list: netdev Content-Length: 2368 Lines: 68 On Fri, Aug 30, 2002 at 04:47:30PM +0530, Saurabh Bansal wrote: > Hi, > > It has been more than a week since I posted this query to the LINUX > group and haven't heard anything from the group-side in this context. > Is my query under processing/observation OR something else?? Slipped in between spam-floods, or something... > Hope to get some positive response from your side this time! > Thanks And Regards > Saurabh > > ----- Original Message ----- > From: Saurabh Bansal > To: netdev@oss.sgi.com > Cc: linux-net@vger.kernel.org > Sent: Wednesday, August 21, 2002 1:26 PM > > Hi all, > > I did this experiment with the a set of two processes behaving as > a sender/receiver set on the same Linux machine. What the experiment was > that--- > > "The sender has to send some fixed-size (20 bytes) mesgs to the > receiver but the receiver has not to pick-up anyone of them." > > What has been observed is, TCP buffers the first 32 messages into the > receive socket queue of receiver, AND THEN it stops sending ack's for > next mesg. As a result, the sender keeps on trying to send the 33rd > mesg, but gets no ack's and connection gets down within 15-20 minutes. Are you sure you are emptying the receiver socket when some new data arrives ? > Another important observations about this experiment are : > > 1) These two processes were running on the same Linux system with > kernel version 2.2.xx. This problem doesn't come with Linux-kernel > 2.4.xx. I recall having seen this kind of problems in 2.1/2.2 series, possibly a bug-fix was added in late 2.2 to restore earlier behaviour in 2.0... I had implemented a problem circumvention in one of my software suites by making the socket non-blocking, and adding proper select() calls. > 2) If these processes are run on different Linux m/c's with kernel > 2.2.xx, the problem doesn't appear. Quite so. > 3) The socket APIs used for sending/receiving mesg are send()/recv(). > 4) The problem has no relevance with the size of mesg. I have seen the > same behavior with 1-byte mesg to 1k-byte mesg. > > Can anyone tell me if there is any bug/limitation with the Linux-kernel > 2.2.xx in comparison to kernel 2.4.xx in the above problem-context?? I faintly recall something which might be what you are describing. > Saurabh Bansal /Matti Aarnio From kuznet@ms2.inr.ac.ru Fri Aug 30 06:24:51 2002 Received: with ECARTIS (v1.0.0; list netdev); Fri, 30 Aug 2002 06:24:52 -0700 (PDT) Received: from sex.inr.ac.ru (sex.inr.ac.ru [193.233.7.165]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7UDOotG016886 for ; Fri, 30 Aug 2002 06:24:50 -0700 Received: (from kuznet@localhost) by sex.inr.ac.ru (8.6.13/ANK) id RAA29187; Fri, 30 Aug 2002 17:29:56 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200208301329.RAA29187@sex.inr.ac.ru> Subject: Re: TCP stops sending ACKs after 32 mesg... To: sbansal@aplion.stpn.soft.NET (Saurabh Bansal) Date: Fri, 30 Aug 2002 17:29:56 +0400 (MSD) Cc: netdev@oss.sgi.com In-Reply-To: <010901c25016$d59d17a0$2304a8c0@Ionizer> from "Saurabh Bansal" at Aug 30, 2 05:15:02 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 X-archive-position: 54 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: kuznet@ms2.inr.ac.ru Precedence: bulk X-list: netdev Content-Length: 303 Lines: 10 Hello! > Can anyone tell me if there is any bug/limitation with the Linux-kernel 2.2.xx in comparison to kernel 2.4.xx in the above problem-context?? Yes, it is well known problem of 2.2. You forget to read at receiver side, do not you? Well, the only workaround is not to forget to do this. Alexey From kuznet@ms2.inr.ac.ru Fri Aug 30 06:34:59 2002 Received: with ECARTIS (v1.0.0; list netdev); Fri, 30 Aug 2002 06:35:00 -0700 (PDT) Received: from sex.inr.ac.ru (sex.inr.ac.ru [193.233.7.165]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g7UDYwtG017259 for ; Fri, 30 Aug 2002 06:34:58 -0700 Received: (from kuznet@localhost) by sex.inr.ac.ru (8.6.13/ANK) id RAA29215; Fri, 30 Aug 2002 17:40:08 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200208301340.RAA29215@sex.inr.ac.ru> Subject: Re: TCP stops sending ACKs after 32 mesg... To: matti.aarnio@zmailer.ORG (Matti Aarnio) Date: Fri, 30 Aug 2002 17:40:08 +0400 (MSD) Cc: netdev@oss.sgi.com In-Reply-To: <20020830112448.GS16533@mea-ext.zmailer.org> from "Matti Aarnio" at Aug 30, 2 05:15:02 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 X-archive-position: 55 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: kuznet@ms2.inr.ac.ru Precedence: bulk X-list: netdev Content-Length: 610 Lines: 21 Hello! > I recall having seen this kind of problems in 2.1/2.2 series, > possibly a bug-fix was added in late 2.2 to restore earlier > behaviour in 2.0... No, it was repaired only in 2.4. Actually, I think 2.0 behaved approximately in the same way, the source of the bug is deep in linux approach to memory accounting and 2.0 was not different. > > 2) If these processes are run on different Linux m/c's with kernel > > 2.2.xx, the problem doesn't appear. > > Quite so. No, it appears. Only number "32" is different, on ethernet it changes to ~90 (for size > rx_copy_break) or 1000. Alexey From jmorris@intercode.com.au Sat Aug 31 19:01:49 2002 Received: with ECARTIS (v1.0.0; list netdev); Sat, 31 Aug 2002 19:01:51 -0700 (PDT) Received: from blackbird.intercode.com.au (blackbird.intercode.com.au [203.32.101.10]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g8121ktG015542 for ; Sat, 31 Aug 2002 19:01:47 -0700 Received: from localhost (jmorris@localhost) by blackbird.intercode.com.au (8.9.3/8.9.3) with ESMTP id MAA13962; Sun, 1 Sep 2002 12:05:13 +1000 Date: Sun, 1 Sep 2002 12:05:13 +1000 (EST) From: James Morris To: "David S. Miller" , cc: netdev@oss.sgi.com, Subject: [PATCH] ipv6 compile fix, __FUNCTION__ pasting, 2.5.33 Message-ID: MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit X-archive-position: 57 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jmorris@intercode.com.au Precedence: bulk X-list: netdev Another __FUNCTION__ pasting fix, for 2.5.33. - James -- James Morris diff -urN -X dontdiff linux-2.5.33.orig/net/ipv6/af_inet6.c linux-2.5.33.w1/net/ipv6/af_inet6.c --- linux-2.5.33.orig/net/ipv6/af_inet6.c Sun Sep 1 11:34:46 2002 +++ linux-2.5.33.w1/net/ipv6/af_inet6.c Sun Sep 1 12:00:07 2002 @@ -663,8 +663,8 @@ sizeof(struct raw6_sock), 0, SLAB_HWCACHE_ALIGN, 0, 0); if (!tcp6_sk_cachep || !udp6_sk_cachep || !raw6_sk_cachep) - printk(KERN_CRIT __FUNCTION__ - ": Can't create protocol sock SLAB caches!\n"); + printk(KERN_CRIT "%s: Can't create protocol sock SLAB " + "caches!\n", __FUNCTION__); /* Register the socket-side information for inet6_create. */ for(r = &inetsw6[0]; r < &inetsw6[SOCK_MAX]; ++r) From davem@redhat.com Sat Aug 31 23:06:49 2002 Received: with ECARTIS (v1.0.0; list netdev); Sat, 31 Aug 2002 23:06:53 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.5/8.12.5) with SMTP id g8166ntG017623 for ; Sat, 31 Aug 2002 23:06:49 -0700 Received: from localhost (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with ESMTP id XAA20385; Sat, 31 Aug 2002 23:03:40 -0700 Date: Sat, 31 Aug 2002 23:03:39 -0700 (PDT) Message-Id: <20020831.230339.121297948.davem@redhat.com> To: jmorris@intercode.com.au Cc: kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: [PATCH] ipv6 compile fix, __FUNCTION__ pasting, 2.5.33 From: "David S. Miller" In-Reply-To: References: X-Mailer: Mew version 2.1 on Emacs 21.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit X-archive-position: 58 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev From: James Morris Date: Sun, 1 Sep 2002 12:05:13 +1000 (EST) Another __FUNCTION__ pasting fix, for 2.5.33. Applied to my tree.