From owner-netdev@oss.sgi.com Wed Mar 1 01:27:27 2000 Received: by oss.sgi.com id ; Wed, 1 Mar 2000 01:27:18 -0800 Received: from pizda.ninka.net ([216.101.162.242]:16000 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Wed, 1 Mar 2000 01:27:02 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id BAA01711; Wed, 1 Mar 2000 01:22:47 -0800 Date: Wed, 1 Mar 2000 01:22:47 -0800 Message-Id: <200003010922.BAA01711@pizda.ninka.net> X-Authentication-Warning: pizda.ninka.net: davem set sender to davem@redhat.com using -f From: "David S. Miller" To: rusty@linuxcare.com.au CC: netdev@oss.sgi.com In-reply-to: (message from Rusty Russell on Wed, 01 Mar 2000 14:55:32 +1100) Subject: Re: [PATCH] Baby Bear: Netfilter merge patch I vs. vger References: Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing From: Rusty Russell Date: Wed, 01 Mar 2000 14:55:32 +1100 This is JamesM's netfilter queing registration changes. Applied, sent to Linus, thanks a lot. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Wed Mar 1 10:46:12 2000 Received: by oss.sgi.com id ; Wed, 1 Mar 2000 10:46:02 -0800 Received: from ppp68.arobas.net ([205.205.36.138]:5380 "HELO dialin156.ottawa.globalserve.net") by oss.sgi.com with SMTP id ; Wed, 1 Mar 2000 10:45:41 -0800 Received: (qmail 571 invoked by uid 1000); 1 Mar 2000 18:43:28 -0000 Date: Wed, 1 Mar 2000 13:43:28 -0500 From: jetienne@arobas.net To: netdev@oss.sgi.com Subject: write with mmaped packet ? Message-ID: <20000301134328.A490@long-haul.net> Reply-To: jetienne@arobas.net Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii User-Agent: Mutt/1.0i Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing hi, i know it is possible to receive packets with mmaped io (alexey's tcpdump), but is it possible to send packets with a similar mecanism ? thanks From owner-netdev@oss.sgi.com Wed Mar 1 14:57:01 2000 Received: by oss.sgi.com id ; Wed, 1 Mar 2000 14:56:51 -0800 Received: from gw.chygwyn.com ([62.172.158.50]:58119 "EHLO gw.chygwyn.com") by oss.sgi.com with ESMTP id ; Wed, 1 Mar 2000 14:56:37 -0800 Received: (from steve@localhost) by gw.chygwyn.com (8.9.3/8.9.3) id VAA23256; Wed, 1 Mar 2000 21:41:04 GMT From: Steve Whitehouse Message-Id: <200003012141.VAA23256@gw.chygwyn.com> Subject: New DECnet patch & proposed removal of DECnet raw sockets To: linux-decnet@dreamtime.org, netdev@oss.sgi.com Date: Wed, 1 Mar 2000 21:41:04 +0000 (GMT) Organization: ChyGywn Limited X-RegisteredOffice: 7, New Yatt Road, Witney, Oxfordshire. OX8 6NU England X-RegisteredNumber: 03887683 Reply-To: Steve Whitehouse X-Mailer: ELM [version 2.5 PL1] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hi, My latest DECnet patch is out (against 2.3.48) with the following features: o Bug fix sending disc messages (from Eduardo Serrat ) o DECnet is now fully threaded (Correctly I hope!) o Bug fix to receiving disc messages (complement to Eduardo's patch) o Bug fix to make dnetd work once again o DECnet is going to use netfilter to move routing packets to a userland routing daemon, hence the new hooks and the increment of NF_MAX_HOOKS which is why I've cc'd Rusty. I don't normally send these announcements to netdev, but I'm doing so this time as I've a question to ask which I'd like to read a wider audience than usual... Has anybody used DECnet raw sockets for anything? So far as I'm aware nobody has. I added raw sockets to DECnet only to allow a userland routing daemon (as yet unwritten) to read DECnet routing messages. This can be much more easily acheived using netfilter so I'm proposing the removal of raw sockets (they are probably not safe now with the new SMP DECnet code anyway). If I don't hear any objections, I'll send in a patch to remove DECnet raw sockets from the kernel in a week or so, Steve. P.S. The patch is in the usual place on ftp.sucs.swan.ac.uk and has been submitted to the main kernel tree, so should appear shortly. From owner-netdev@oss.sgi.com Thu Mar 2 01:25:45 2000 Received: by oss.sgi.com id ; Thu, 2 Mar 2000 01:25:34 -0800 Received: from fep4-orange.clear.net.nz ([203.97.32.4]:6298 "EHLO fep4-orange.clear.net.nz") by oss.sgi.com with ESMTP id ; Thu, 2 Mar 2000 01:25:13 -0800 Received: from klaatu.patho.gen.nz (d1-u17.test.clear.net.nz [203.97.54.209]) by fep4-orange.clear.net.nz (1.5/1.6) with ESMTP id WAA02139; Thu, 2 Mar 2000 22:24:18 +1300 (NZDT) Message-ID: <38BE3341.79D4A80B@klaatu.patho.gen.nz> Date: Thu, 02 Mar 2000 22:24:17 +1300 From: Michael Clark X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.2.14 i686) X-Accept-Language: en MIME-Version: 1.0 To: Bill Wendling CC: root , linux-kernel@vger.rutgers.edu, linux-net@vger.rutgers.edu, netdev@oss.sgi.com Subject: Re: bug: 3c509 broken on 2.3.48 References: <20000302102603.A1091@sirius.ftsm.ukm.my> <20000302015357.A10322@ganymede.isdn.uiuc.edu> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Bill Wendling wrote: > > Also sprach root: > } hai, > } > } a bug report: > } > } 3c509 is broken on 2.3.48. trying to manually using ifconfig and route give > } the following error; > } > } ifup: SIOCSIFFLAGS: No such device > } ifup: SIOCADDRT: Network is down > } ifup: SIOCADDRT: Network is unreachable > } ifup: SIOCADDRT: Network is unreachable > } > } 3c509 work on 2.3.45. I haven't try 2.3.[46-47] > } > It doesn't work for them either...And I don't think it'll work for 4[89]. > No patches have been sent for them... [snip] This patch was sent. Also fixed NE2000 clone for me. I'm surprised more net drivers were not broken - 2.3.47 had some big net changes with drivers checking link state differently. Has someone got this patch into 2.3.49? ~mc Tim Waugh wrote: > > On Sun, 27 Feb 2000, Pete Clements wrote: > > > FYI: > > With both 2.3.47 and 48, unable to up the 3c509 driver when compiled > > into the kernel. Works fine when compiled as module. > > Does this patch help? > > Tim. > */ > > --- linux/net/core/dev.c~ Sun Feb 27 17:53:51 2000 > +++ linux/net/core/dev.c Sun Feb 27 17:54:06 2000 > @@ -2128,6 +2128,7 @@ > dev->iflink = dev->ifindex; > if (dev->rebuild_header == NULL) > dev->rebuild_header = default_rebuild_header; > + set_bit(__LINK_STATE_PRESENT, &dev->state); > dev_init_scheduler(dev); > } > } From owner-netdev@oss.sgi.com Thu Mar 2 05:07:26 2000 Received: by oss.sgi.com id ; Thu, 2 Mar 2000 05:07:16 -0800 Received: from c18787251.telekabel.chello.nl ([212.187.87.251]:48117 "HELO blackmail.zwoel.org") by oss.sgi.com with SMTP id ; Thu, 2 Mar 2000 05:06:55 -0800 Received: by blackmail.zwoel.org (Postfix, from userid 801) id 005C412007; Thu, 2 Mar 2000 14:06:40 +0100 (CET) Date: Thu, 2 Mar 2000 14:06:40 +0100 From: Johan van Selst To: netdev@oss.sgi.com Subject: Strange IPv6 fragmentation Message-ID: <20000302140640.A1859@panther.zwoel.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii User-Agent: Mutt/1.1.2i Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Linux replies to large (fragmented) ping6 packets with the fragments sent in an unusual order. This seems to confuse Cisco routers, and with large packets (16k) it actually makes a pinging Cisco crash (ok, that's his fault, but still...) Other systems we have tried (BSD/KAME) don't show this strange behaviour. Is there any reason why Linux sends fragments in this order? tcpdump snippet (turtle is a FreeBSD 3.4+KAME machine; panther is a Linux 2.3.48 system, but 2.2.10 shows the same behaviour): 13:48:13.014602 turtle.ipv6.stack.nl > panther.ipv6.stack.nl: frag (0|1432) icmp6: echo request 13:48:13.015091 turtle.ipv6.stack.nl > panther.ipv6.stack.nl: frag (1432|1432) 13:48:13.016322 turtle.ipv6.stack.nl > panther.ipv6.stack.nl: frag (2864|1432) 13:48:13.017264 turtle.ipv6.stack.nl > panther.ipv6.stack.nl: frag (4296|512) 13:48:13.410755 panther.ipv6.stack.nl > turtle.ipv6.stack.nl: frag (2864|1432) 13:48:13.570641 panther.ipv6.stack.nl > turtle.ipv6.stack.nl: frag (1432|1432) 13:48:13.665278 panther.ipv6.stack.nl > turtle.ipv6.stack.nl: frag (0|1432) icmp6: echo reply 13:48:13.700859 panther.ipv6.stack.nl > turtle.ipv6.stack.nl: frag (4296|512) Greetings, Johan van Selst From owner-netdev@oss.sgi.com Thu Mar 2 05:18:46 2000 Received: by oss.sgi.com id ; Thu, 2 Mar 2000 05:18:26 -0800 Received: from c18787251.telekabel.chello.nl ([212.187.87.251]:49909 "HELO blackmail.zwoel.org") by oss.sgi.com with SMTP id ; Thu, 2 Mar 2000 05:18:20 -0800 Received: by blackmail.zwoel.org (Postfix, from userid 801) id D127F12007; Thu, 2 Mar 2000 14:18:15 +0100 (CET) Date: Thu, 2 Mar 2000 14:18:15 +0100 From: Johan van Selst To: netdev@oss.sgi.com Subject: Re: Strange IPv6 fragmentation Message-ID: <20000302141815.B1859@panther.zwoel.org> References: <20000302140640.A1859@panther.zwoel.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii User-Agent: Mutt/1.1.2i In-Reply-To: <20000302140640.A1859@panther.zwoel.org>; from johans@stack.nl on Thu, Mar 02, 2000 at 02:06:40PM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Johan van Selst wrote: > Linux replies to large (fragmented) ping6 packets with the fragments > sent in an unusual order. I should check things properly before I send out an email, so that I won't have to reply to myself. Usually Linux seemds to send fragments in reverse order. With IPv6 it does the same, however the last fragment now is sent at the end (ie just after the first fragment). This behaviour is not specific for ping replies. Ciao, Johan From owner-netdev@oss.sgi.com Thu Mar 2 11:03:32 2000 Received: by oss.sgi.com id ; Thu, 2 Mar 2000 11:03:21 -0800 Received: from minus.inr.ac.ru ([193.233.7.97]:45582 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Thu, 2 Mar 2000 11:02:59 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id WAA28655; Thu, 2 Mar 2000 22:02:40 +0300 From: kuznet@ms2.inr.ac.ru Message-Id: <200003021902.WAA28655@ms2.inr.ac.ru> Subject: Re: Strange IPv6 fragmentation To: johans@stack.NL (Johan van Selst) Date: Thu, 2 Mar 2000 22:02:40 +0300 (MSK) Cc: netdev@oss.sgi.com In-Reply-To: <20000302140640.A1859@panther.zwoel.org> from "Johan van Selst" at Mar 2, 0 04:13:13 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Length: 395 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello! > behaviour. Is there any reason why Linux sends fragments in this order? It used to be the most efficient one with Linux defragmenter used during 2.1. I remade it about 2.2.0 to make particular order of fragments not so essential, but fragmenter was not updated to look nicer. Well, if this order really confuses some broken software, it is reason to think about changing it. Alexey From owner-netdev@oss.sgi.com Thu Mar 2 11:58:32 2000 Received: by oss.sgi.com id ; Thu, 2 Mar 2000 11:58:23 -0800 Received: from kogge.hanse.de ([192.76.134.17]:46346 "EHLO kogge.Hanse.DE") by oss.sgi.com with ESMTP id ; Thu, 2 Mar 2000 11:58:12 -0800 Received: (from uucp@localhost) by kogge.Hanse.DE (8.9.3/8.9.1) with UUCP id VAA30531; Thu, 2 Mar 2000 21:00:01 +0100 (CET) (envelope-from eis@baty.hanse.de) Received: (from eis@localhost) by baty.hanse.de (8.9.3/8.9.3) id UAA13661; Thu, 2 Mar 2000 20:34:46 +0100 Date: Thu, 2 Mar 2000 20:34:46 +0100 From: Henner Eisen Message-Id: <200003021934.UAA13661@baty.hanse.de> To: Steve@ChyGwyn.com CC: netdev@oss.sgi.com In-reply-to: <200002290930.JAA00926@gw.chygwyn.com> (message from Steve Whitehouse on Tue, 29 Feb 2000 09:30:31 +0000 (GMT)) Subject: Re: MSG_EOR flag References: <200002290930.JAA00926@gw.chygwyn.com> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hi, >>>>> "Steve" == Steve Whitehouse writes: Steve> Hi, Steve> The flag is part of the draft POSIX standard, which I Steve> followed when updating the code. Basically it works like Steve> this... Is this described in the documents available on the net? Last time when I checked (probaly too long ago), socket stuff not important for AF_INET was not really covered by the documents. Steve> o sendmsg() - MSG_EOR indicates that the data sent is the Steve> last in a record. Data in a sendmsg() call must all be part Steve> of one record. If the sendmsg() call returns before all Steve> the data is sent, then the MSG_EOR is ignored so that the Steve> rest of the record may be sent in a later sendmsg() call Steve> when more buffer space is available. o.k. Steve> o recvmsg() - Each call may only return data from a single Steve> record. MSG_EOR is returned when the last bit of data from Steve> a record is returned. Now I'm a little confused. Does POSIX always allow to return partial packets (some fragments only), just the recvmsg() containing the last fragment needs to set MSG_EOR? Or does POSIX even require that each segment is passed seperatly? Steve> There was an obvious problem, in that with write() you Steve> can't specify any flags, so the implicit MSG_EOR was added Steve> so that it could do something vagely sensible without Steve> forcing people to use sendmsg() to terminate each Steve> record. (POSIX doesn't actually say if write() should have Steve> an implicit MSG_EOR or not, it stays slient on the matter) Steve> As to what to do when the sockets don't support sending Steve> fragments, I'd suggest that they ought to print out a Steve> message (suitably rate limited) that the application is an Steve> old one and should be updated if it does not set Steve> MSG_EOR. This seems to be the usual approach that I've seen Steve> elsewhere around the kernel. Its probably also a good idea Steve> to update as many applications as possible before this Steve> message is added though. Should older kernels (2.2.x) also be modified to at least accept MSG_EOR such that the re-comopiled application (which set MSG_EOR properly) will continue to work with 2.2.x? Steve> btw, are you thinking about AX.25 ? So far as I'm aware its Steve> the only other SEQPACKET supporting protocol... I was thinking about X.25, where write() broke from the implicit MSG_EOR. I did fix that soon, and also made MSG_EOR mandatory because the current X.25 does not support sending incomplete packets. From your rationale above (thanks!), this seems to be the correct behavior. Now, I understand, that I also need to always set MSG_EOR in recvmsg (because the current PF_X25 does never return partial packets). The further question is what to do with 2.2.15. As PF_X25 is marked CONFIG_EXPERIMENTAL, people are expected to deal with occasional API changes (I don't think that there are applications using send/secvmsg anyway). As people will continue to use 2.2.x for some time, I think it is appropriate to set MSG_EOR on recvmsg() and accept (but not enforce) MSG_EOR on sendmsg() for 2.2.x, too. Or should I better leave it in 2.2.x as it currently is? Henner From owner-netdev@oss.sgi.com Thu Mar 2 12:23:35 2000 Received: by oss.sgi.com id ; Thu, 2 Mar 2000 12:23:17 -0800 Received: from ganymede.isdn.uiuc.edu ([192.17.19.210]:14601 "EHLO ganymede.isdn.uiuc.edu") by oss.sgi.com with ESMTP id ; Thu, 2 Mar 2000 12:22:58 -0800 Received: (from wendling@localhost) by ganymede.isdn.uiuc.edu (8.9.3/8.9.3) id PAA16835; Thu, 2 Mar 2000 15:21:36 -0500 Date: Thu, 2 Mar 2000 14:21:36 -0600 From: Bill Wendling To: Michael Clark Cc: root , linux-kernel@vger.rutgers.edu, linux-net@vger.rutgers.edu, netdev@oss.sgi.com Subject: Re: bug: 3c509 broken on 2.3.48 Message-ID: <20000302142136.C16280@ganymede.isdn.uiuc.edu> References: <20000302102603.A1091@sirius.ftsm.ukm.my> <20000302015357.A10322@ganymede.isdn.uiuc.edu> <38BE3341.79D4A80B@klaatu.patho.gen.nz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0i In-Reply-To: <38BE3341.79D4A80B@klaatu.patho.gen.nz>; from mclark@klaatu.patho.gen.nz on Thu, Mar 02, 2000 at 10:24:17PM +1300 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Also sprach Michael Clark: } This patch was sent. Also fixed NE2000 clone for me. I'm surprised more } net drivers were not broken - 2.3.47 had some big net changes with } drivers checking link state differently. } } Has someone got this patch into 2.3.49? } } ~mc } } Tim Waugh wrote: } > } > On Sun, 27 Feb 2000, Pete Clements wrote: } > } > > FYI: } > > With both 2.3.47 and 48, unable to up the 3c509 driver when compiled } > > into the kernel. Works fine when compiled as module. } > } > Does this patch help? } > } > Tim. } > */ } > } > --- linux/net/core/dev.c~ Sun Feb 27 17:53:51 2000 } > +++ linux/net/core/dev.c Sun Feb 27 17:54:06 2000 } > @@ -2128,6 +2128,7 @@ } > dev->iflink = dev->ifindex; } > if (dev->rebuild_header == NULL) } > dev->rebuild_header = default_rebuild_header; } > + set_bit(__LINK_STATE_PRESENT, &dev->state); } > dev_init_scheduler(dev); } > } } > } } This is in the 2.3.49pre2 kernel. I didn't see it before. Sorry. However, the sequence of instructions is dev_init_scheduler(dev); set_bit(__LINK_STATE_PRESENT, &dev->state); which I"m hoping is okay... -- || Bill Wendling wendling@ganymede.isdn.uiuc.edu From owner-netdev@oss.sgi.com Thu Mar 2 13:28:56 2000 Received: by oss.sgi.com id ; Thu, 2 Mar 2000 13:28:37 -0800 Received: from gw.chygwyn.com ([62.172.158.50]:61705 "EHLO gw.chygwyn.com") by oss.sgi.com with ESMTP id ; Thu, 2 Mar 2000 13:28:07 -0800 Received: (from steve@localhost) by gw.chygwyn.com (8.9.3/8.9.3) id VAA28963; Thu, 2 Mar 2000 21:21:41 GMT From: Steve Whitehouse Message-Id: <200003022121.VAA28963@gw.chygwyn.com> Subject: Re: MSG_EOR flag To: eis@baty.hanse.de (Henner Eisen) Date: Thu, 2 Mar 2000 21:21:41 +0000 (GMT) Cc: netdev@oss.sgi.com In-Reply-To: <200003021934.UAA13661@baty.hanse.de> from "Henner Eisen" at Mar 02, 2000 08:34:46 PM Organization: ChyGywn Limited X-RegisteredOffice: 7, New Yatt Road, Witney, Oxfordshire. OX8 6NU England X-RegisteredNumber: 03887683 Reply-To: Steve Whitehouse X-Mailer: ELM [version 2.5 PL1] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hi, > > > Hi, > > >>>>> "Steve" == Steve Whitehouse writes: > > Steve> Hi, > > Steve> The flag is part of the draft POSIX standard, which I > Steve> followed when updating the code. Basically it works like > Steve> this... > > Is this described in the documents available on the net? Last time > when I checked (probaly too long ago), socket stuff not important for > AF_INET was not really covered by the documents. > So far as I'm aware there is nothing available on the net, unfortunately. [some deleted text] > > Steve> o recvmsg() - Each call may only return data from a single > Steve> record. MSG_EOR is returned when the last bit of data from > Steve> a record is returned. > > Now I'm a little confused. Does POSIX always allow to return partial > packets (some fragments only), just the recvmsg() containing the last > fragment needs to set MSG_EOR? Or does POSIX even require that each > segment is passed seperatly? > You can return partial packets, but each call to recvmsg() must only return part (or whole) of one record. It must never return parts of more than one record in a single call. MSG_EOR is only set by recvmsg() on the final part of a record. You can just set MSG_EOR (as you suggest below) on each recvmsg() call if each call always results in a whole record being copied to the user. This is very unlikely to result in correct behaviour though... you've no idea (from the kernel side) how big a buffer a user is going to give you to put the data in. Unlike SOCK_DGRAM you must not discard records with don't fit in the buffer, but must keep the part not yet sent to the user so the user can request it later. POSIX says that SOCK_SEQPACKET is basically identical to SOCK_STREAM except for the MSG_EOR makers which just act like tags at certain points in the data stream. Everytime you come across a MSG_EOR marker it marks the end of a recvmsg() or sendmsg() call. If you are looking for the "quick fix" for 2.2.xx though I'd certainly support your suggestion of always returning MSG_EOR from recvmsg() over the current behaviour. So long as the applications always have a large enough buffer size, which from your comments I gather they do, then everything should work fine. [more deleted text] > > Should older kernels (2.2.x) also be modified to at least accept MSG_EOR > such that the re-comopiled application (which set MSG_EOR properly) will > continue to work with 2.2.x? > Hmm. Good question. Its not a bad idea to do that I guess. Probably the person to ask is Alan Cox since he looks after the 2.2.xx kernels now. > Steve> btw, are you thinking about AX.25 ? So far as I'm aware its > Steve> the only other SEQPACKET supporting protocol... > > I was thinking about X.25, where write() broke from the implicit MSG_EOR. > I did fix that soon, and also made MSG_EOR mandatory because the current > X.25 does not support sending incomplete packets. From your rationale above > (thanks!), this seems to be the correct behavior. Now, I understand, that > I also need to always set MSG_EOR in recvmsg (because the current PF_X25 > does never return partial packets). > Yes, but see my earlier comments. > The further question is what to do with 2.2.15. As PF_X25 is marked > CONFIG_EXPERIMENTAL, people are expected to deal with occasional API > changes (I don't think that there are applications using send/secvmsg > anyway). As people will continue to use 2.2.x for some time, I think > it is appropriate to set MSG_EOR on recvmsg() and accept (but not enforce) > MSG_EOR on sendmsg() for 2.2.x, too. > > Or should I better leave it in 2.2.x as it currently is? > > Henner > I think its probably a good idea to do the patch for 2.2.xx. Its fairly easily done and I can't see it having any unwanted side effects, Steve. From owner-netdev@oss.sgi.com Sat Mar 4 08:07:07 2000 Received: by oss.sgi.com id ; Sat, 4 Mar 2000 08:06:47 -0800 Received: from linuxcare.canberra.net.au ([203.29.91.49]:22541 "EHLO front.linuxcare.com.au") by oss.sgi.com with ESMTP id ; Sat, 4 Mar 2000 08:06:20 -0800 Received: from halfway.linuxcare.com.au (penicillin.linuxcare.com.au [10.61.2.27]) by front.linuxcare.com.au (8.9.3/8.9.3/Debian 8.9.3-6) with ESMTP id DAA09297; Sun, 5 Mar 2000 03:06:09 +1100 X-Authentication-Warning: front.linuxcare.com.au: Host penicillin.linuxcare.com.au [10.61.2.27] claimed to be halfway.linuxcare.com.au Received: from linuxcare.com.au (really [127.0.0.1]) by linuxcare.com.au via in.smtpd with esmtp id (Debian Smail3.2.0.102) for ; Sun, 5 Mar 2000 03:06:08 +1100 (EST) Message-Id: From: Rusty Russell To: davem@redhat.com cc: netdev@oss.sgi.com Subject: [PATCH] Mummy Bear: Netfilter merge patch II vs. vger Date: Sun, 05 Mar 2000 03:06:08 +1100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing This requires a bit more code for connection tracking with tunnels to work: `nf_conntrack_put(skb->nfct); skb->nfct=NULL;' when untunnelling packets. If a tunnelling person wants to add that... Index: include/linux/netfilter.h =================================================================== RCS file: /cvs/linux/linux/include/linux/netfilter.h,v retrieving revision 1.7 diff -u -r1.7 netfilter.h --- include/linux/netfilter.h 2000/03/01 20:34:48 1.7 +++ include/linux/netfilter.h 2000/03/03 08:10:16 @@ -179,7 +179,6 @@ NF_REASON_SET_BY_IPCHAINS, NF_REASON_FOR_ROUTING, NF_REASON_FOR_CLS_FW, - NF_REASON_MIN_RESERVED_FOR_CONNTRACK = 1024, }; #endif /*__LINUX_NETFILTER_H*/ Index: include/linux/skbuff.h =================================================================== RCS file: /cvs/linux/linux/include/linux/skbuff.h,v retrieving revision 1.66 diff -u -r1.66 skbuff.h --- include/linux/skbuff.h 2000/02/18 16:47:01 1.66 +++ include/linux/skbuff.h 2000/03/03 08:10:18 @@ -37,6 +37,17 @@ #define NET_CALLER(arg) __builtin_return_address(0) #endif +#ifdef CONFIG_NETFILTER +struct nf_conntrack { + atomic_t use; + void (*destroy)(struct nf_conntrack *); +}; + +struct nf_ct_info { + struct nf_conntrack *master; +}; +#endif + struct sk_buff_head { /* These two members must be first. */ struct sk_buff * next; @@ -115,6 +126,8 @@ __u32 nfreason; /* Cache info */ __u32 nfcache; + /* Associated connection, if any */ + struct nf_ct_info *nfct; #ifdef CONFIG_NETFILTER_DEBUG unsigned int nf_debug; #endif @@ -634,6 +647,21 @@ extern void skb_init(void); extern void skb_add_mtu(int mtu); + +#ifdef CONFIG_NETFILTER +extern __inline__ void +nf_conntrack_put(struct nf_ct_info *nfct) +{ + if (nfct && atomic_dec_and_test(&nfct->master->use)) + nfct->master->destroy(nfct->master); +} +extern __inline__ void +nf_conntrack_get(struct nf_ct_info *nfct) +{ + if (nfct) + atomic_inc(&nfct->master->use); +} +#endif #endif /* __KERNEL__ */ #endif /* _LINUX_SKBUFF_H */ Index: net/core/skbuff.c =================================================================== RCS file: /cvs/linux/linux/net/core/skbuff.c,v retrieving revision 1.68 diff -u -r1.68 skbuff.c --- net/core/skbuff.c 2000/02/18 16:47:18 1.68 +++ net/core/skbuff.c 2000/03/03 08:10:26 @@ -204,6 +204,7 @@ skb->rx_dev = NULL; #ifdef CONFIG_NETFILTER skb->nfmark = skb->nfreason = skb->nfcache = 0; + skb->nfct = NULL; #ifdef CONFIG_NETFILTER_DEBUG skb->nf_debug = 0; #endif @@ -246,6 +247,9 @@ } skb->destructor(skb); } +#ifdef CONFIG_NETFILTER + nf_conntrack_put(skb->nfct); +#endif #ifdef CONFIG_NET if(skb->rx_dev) dev_put(skb->rx_dev); @@ -282,6 +286,9 @@ n->is_clone = 1; atomic_set(&n->users, 1); n->destructor = NULL; +#ifdef CONFIG_NETFILTER + nf_conntrack_get(skb->nfct); +#endif return n; } @@ -314,6 +321,8 @@ new->nfmark=old->nfmark; new->nfreason=old->nfreason; new->nfcache=old->nfcache; + new->nfct=old->nfct; + nf_conntrack_get(new->nfct); #ifdef CONFIG_NETFILTER_DEBUG new->nf_debug=old->nf_debug; #endif Index: net/ipv4/ip_output.c =================================================================== RCS file: /cvs/linux/linux/net/ipv4/ip_output.c,v retrieving revision 1.80 diff -u -r1.80 ip_output.c --- net/ipv4/ip_output.c 2000/02/09 11:16:41 1.80 +++ net/ipv4/ip_output.c 2000/03/03 08:10:30 @@ -890,6 +890,12 @@ ptr += len; offset += len; +#ifdef CONFIG_NETFILTER + /* Connection association is same as pre-frag packet */ + skb2->nfct = skb->nfct; + nf_conntrack_get(skb2->nfct); +#endif + /* * Put this fragment into the sending queue. */ -- Hacking time. From owner-netdev@oss.sgi.com Sun Mar 5 22:49:41 2000 Received: by oss.sgi.com id ; Sun, 5 Mar 2000 22:49:30 -0800 Received: from cerberus.nemoto.ecei.tohoku.ac.jp ([130.34.199.67]:57092 "EHLO cerberus.nemoto.ecei.tohoku.ac.jp") by oss.sgi.com with ESMTP id ; Sun, 5 Mar 2000 22:49:05 -0800 Received: from localhost (yoshfuji@localhost [127.0.0.1]) by cerberus.nemoto.ecei.tohoku.ac.jp (8.9.3+3.2W/8.9.3/Debian 8.9.3-6) with ESMTP id PAA01705; Mon, 6 Mar 2000 15:48:14 +0900 To: misiek@pld.org.pl CC: netdev@oss.sgi.com, core@kame.net Subject: Re: SIOCGIFCONF and IPv6 addresses From: Hideaki YOSHIFUJI (=?iso-2022-jp?B?GyRCNUhGIzFRTEAbKEI=?=) In-Reply-To: <16134.951541573@coconut.itojun.org> References: <20000223181035.A637@admin.misiek.eu.org> <16134.951541573@coconut.itojun.org> X-Mailer: Mew version 1.94 on XEmacs 20.4 (Emerald) X-URL: http://www.ecei.tohoku.ac.jp/%7Eyoshfuji/ X-Fingerprint: F7 31 65 99 5E B2 BB A7 15 15 13 23 18 06 A9 6F 57 00 6B 25 X-Pgp5-Key-Url: http://cerberus.nemoto.ecei.tohoku.ac.jp/%7Eyoshfuji/yoshfuji@ecei.tohoku.ac.jp.asc Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-Id: <20000306154813K.yoshfuji@cerberus.nemoto.ecei.tohoku.ac.jp> Date: Mon, 06 Mar 2000 15:48:13 +0900 X-Dispatcher: imput version 990905(IM130) Lines: 30 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hi, In article <16134.951541573@coconut.itojun.org> (at Sat, 26 Feb 2000 14:06:13 +0900), itojun@iijlab.net says: > And my recommendation to linux camp is: > - include getifaddrs into libc. > - if you have sysctl(RT_NET_IFLIST): > - use sysctl(RT_NET_IFLIST) as backend of getifaddrs. > - may implement SIOCGLIFCONF as extra interface. > - if you do not have sysctl(RT_NET_IFLIST): > - implement SIOCGLIFCONF. use it as backend of getifaddrs. I've implemented *prototype* of getifaddr() for Linux 2.2 (or later) using PF_NETLINK socket (instead of RT_NET_IFLIST): . We should discuss on: 1. Which family should be returned for interface addresses? (AF_LINK on *BSDs) 2. What data should returned through ifa_data: (struct if_data on *BSDs) To do: getifaddrs() for Linux 2.0 (or earlier)... -- Hideaki YOSHIFUJI Web Page: http://www.ecei.tohoku.ac.jp/%7Eyoshfuji/ PGP5i FP: F731 6599 5EB2 BBA7 1515 1323 1806 A96F 5700 6B25 From owner-netdev@oss.sgi.com Sun Mar 5 23:50:00 2000 Received: by oss.sgi.com id ; Sun, 5 Mar 2000 23:49:51 -0800 Received: from mcn.xidian.edu.cn ([202.117.114.10]:65284 "EHLO mcn.xidian.edu.cn") by oss.sgi.com with ESMTP id ; Sun, 5 Mar 2000 23:49:38 -0800 Received: from mcn.xidian.edu.cn (xhmeng [192.168.1.7]) by mcn.xidian.edu.cn (8.8.7/8.8.7) with ESMTP id PAA01081 for ; Mon, 6 Mar 2000 15:31:35 +0800 Message-ID: <38C362B6.5B136D67@mcn.xidian.edu.cn> Date: Mon, 06 Mar 2000 15:48:06 +0800 From: Meng Xiaohu X-Mailer: Mozilla 4.51 [en] (Win95; I) X-Accept-Language: en MIME-Version: 1.0 To: linux Subject: How to configure "tunnel"? Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing I am doing research on Mobile IP. But I don't know well how to configure "tunnel" on Linux. Where can I learn it? Thanks! -- Meng Xiaohu State Key Laboratory on Integrated Service Networks P.O. Box 297, Xidian University Xian, Shaanxi, 710071, P.R.China E-mail:xhmeng@mcn.xidian.edu.cn From owner-netdev@oss.sgi.com Mon Mar 6 00:05:50 2000 Received: by oss.sgi.com id ; Mon, 6 Mar 2000 00:05:40 -0800 Received: from quechua.inka.de ([212.227.14.2]:22348 "EHLO mail.inka.de") by oss.sgi.com with ESMTP id ; Mon, 6 Mar 2000 00:05:24 -0800 Received: from dungeon.inka.de by mail.inka.de with uucp (rmailwrap 0.4) id 12RsVw-0005sb-00; Mon, 6 Mar 2000 09:05:16 +0100 Received: by dungeon.inka.de (Postfix, from userid 1000) id 32F15B78F0; Mon, 6 Mar 2000 09:05:10 +0100 (CET) Date: Mon, 6 Mar 2000 09:05:10 +0100 From: Andreas Jellinghaus To: Meng Xiaohu Cc: linux Subject: Re: How to configure "tunnel"? Message-ID: <20000306090510.B20670@dungeon.inka.de> References: <38C362B6.5B136D67@mcn.xidian.edu.cn> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit User-Agent: Mutt/1.0.1i In-Reply-To: <38C362B6.5B136D67@mcn.xidian.edu.cn>; from xhmeng@mcn.xidian.edu.cn on Mon, Mar 06, 2000 at 03:48:06PM +0800 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Mon, Mar 06, 2000 at 03:48:06PM +0800, Meng Xiaohu wrote: > I am doing research on Mobile IP. > But I don't know well how to configure "tunnel" on Linux. the mailing list is meant for development. user questions maybe better go to linux-net@vger.rutgers.edu mailing list or news groups. look at the iproute command and it´s documentation. tunnels are getting very easy, for example: ip tunnel add tun0 mode gre local my.ip.addr.ess remote your.ip.addr.ess ip route add some.net.wor.k/mask dev tun0 regards, andreas From owner-netdev@oss.sgi.com Mon Mar 6 00:36:11 2000 Received: by oss.sgi.com id ; Mon, 6 Mar 2000 00:36:01 -0800 Received: from pizda.ninka.net ([216.101.162.242]:18436 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Mon, 6 Mar 2000 00:35:47 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id AAA11520; Mon, 6 Mar 2000 00:31:09 -0800 Date: Mon, 6 Mar 2000 00:31:09 -0800 Message-Id: <200003060831.AAA11520@pizda.ninka.net> X-Authentication-Warning: pizda.ninka.net: davem set sender to davem@redhat.com using -f From: "David S. Miller" To: rusty@linuxcare.com.au CC: netdev@oss.sgi.com In-reply-to: (message from Rusty Russell on Sun, 05 Mar 2000 03:06:08 +1100) Subject: Re: [PATCH] Mummy Bear: Netfilter merge patch II vs. vger References: Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing From: Rusty Russell Date: Sun, 05 Mar 2000 03:06:08 +1100 This requires a bit more code for connection tracking with tunnels to work: `nf_conntrack_put(skb->nfct); skb->nfct=NULL;' when untunnelling packets. If a tunnelling person wants to add that... Applied, sent to Linus, thanks. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Mon Mar 6 00:57:01 2000 Received: by oss.sgi.com id ; Mon, 6 Mar 2000 00:56:51 -0800 Received: from smtp01ffm.de.uu.net ([192.76.144.150]:46793 "EHLO smtp01ffm.de.uu.net") by oss.sgi.com with ESMTP id ; Mon, 6 Mar 2000 00:56:34 -0800 Received: from gatekeeper.syskonnect.de (gatekeeper.skd.de [193.27.19.129] (may be forged)) by smtp01ffm.de.uu.net (5.5.5/5.5.5) with ESMTP id JAA12272 for ; Mon, 6 Mar 2000 09:56:28 +0100 (MET) Received: from syskonnect.de (spock [193.27.19.1]) by gatekeeper.syskonnect.de (8.9.3/8.8.8) with ESMTP id JAA28682 for ; Mon, 6 Mar 2000 09:57:32 +0100 (MET) Received: from cgoos-nt (localhost [127.0.0.1]) by syskonnect.de (8.9.3/8.6.12) with SMTP id JAA01115 for ; Mon, 6 Mar 2000 09:55:33 +0100 (MET) Message-Id: <200003060855.JAA01115@syskonnect.de> From: "Christoph Goos" Organization: syskonnect.de To: netdev@oss.sgi.com Date: Mon, 6 Mar 2000 09:55:21 +0100 MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII Content-transfer-encoding: 7BIT Subject: Question about net_init.c X-mailer: Pegasus Mail for Win32 (v3.01d) Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello, I made some changes to the sktr driver in 2.2.14 lately, and came about the MTU size of 2000 that is used by Token Ring on Linux. There is a comment in drivers/net/net_init.c, saying "bug in fragmenter..." at the MTU setting. Is this still true for 2.2.14, or can the correct TR MTU (4491) be used ? I hope you can help me with this, or direct me to someone who can. Best Regards, Christoph Goos ------------------------------------- Christoph Goos (cgoos@syskonnect.de) Software Developer at SysKonnect Siemensstr. 23 D-76275 Ettlingen phone: +49 7243 502 351 ------------------------------------- From owner-netdev@oss.sgi.com Tue Mar 7 13:39:26 2000 Received: by oss.sgi.com id ; Tue, 7 Mar 2000 13:39:07 -0800 Received: from kogge.hanse.de ([192.76.134.17]:3601 "EHLO kogge.Hanse.DE") by oss.sgi.com with ESMTP id ; Tue, 7 Mar 2000 13:38:40 -0800 Received: (from uucp@localhost) by kogge.Hanse.DE (8.9.3/8.9.1) with UUCP id WAA27883; Tue, 7 Mar 2000 22:40:28 +0100 (CET) (envelope-from eis@baty.hanse.de) Received: (from eis@localhost) by baty.hanse.de (8.9.3/8.9.3) id TAA26191; Tue, 7 Mar 2000 19:36:11 +0100 To: Steve Whitehouse Cc: netdev@oss.sgi.com Subject: Re: MSG_EOR flag References: <200003022121.VAA28963@gw.chygwyn.com> From: Henner Eisen Date: 07 Mar 2000 19:36:11 +0100 In-Reply-To: Steve Whitehouse's message of "Thu, 2 Mar 2000 21:21:41 +0000 (GMT)" Message-ID: Lines: 55 X-Mailer: Gnus v5.5/Emacs 20.3 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing >>>>> "Steve" == Steve Whitehouse writes: Steve> You can return partial packets, but each call to recvmsg() Steve> must only return part (or whole) of one record. It must Steve> never return parts of more than one record in a single Steve> call. MSG_EOR is only set by recvmsg() on the final part of Steve> a record. Steve> You can just set MSG_EOR (as you suggest below) on each Steve> recvmsg() call if each call always results in a whole Steve> record being copied to the user. This is very unlikely to Steve> result in correct behaviour though... you've no idea (from Steve> the kernel side) how big a buffer a user is going to give Steve> you to put the data in. Unlike SOCK_DGRAM you must not Steve> discard records with don't fit in the buffer, but must keep Steve> the part not yet sent to the user so the user can request Steve> it later. Yes, I know, but the current method of re-assembling X.25 packets by means of the M-bit can fail anyway (currently, this will will trigger an X.25 reset for the virtual connection). I agree that this is broken, but that's how it is impelented today. It is certainly worthy to fix. But this needs some deeper changes in the X.25 code which cannot be applied as a last minute patch for 2.4. Steve> If you are looking for the "quick fix" for 2.2.xx though Steve> I'd certainly support your suggestion of always returning Steve> MSG_EOR from recvmsg() over the current behaviour. So long o.k. Steve> as the applications always have a large enough buffer size, Steve> which from your comments I gather they do, then everything Steve> should work fine. A related question is how to handle message boundaries in read() and write() constistently. If write() in 2.3.x implicitly sets MSG_EOR, which is interpreted as `each write() should generate a single, complete message in terms of the underlaying protocol' by many protocol families, I think read() from a SEQ_PACKET socket should behave consistently. That means it should only return if the last fragment was received (unless the read buffer space is to small in which case read() should return an error). But as linux maps all read() to recvmsg() internally, the socket layer only sees a recvmsg() call and cannot determine whether it originated from a read(). Thus, it will be necessary to add a flag to recvmsg, which is always set when recvmsg is called on behalf of read(). This flag would request that recvmsg should return only if either the final part of the messages arrived or the receive buffer size is exceeded. Is this what MSG_WAITALL is intended for? Henner From owner-netdev@oss.sgi.com Tue Mar 7 15:03:48 2000 Received: by oss.sgi.com id ; Tue, 7 Mar 2000 15:03:38 -0800 Received: from gw.chygwyn.com ([62.172.158.50]:49170 "EHLO gw.chygwyn.com") by oss.sgi.com with ESMTP id ; Tue, 7 Mar 2000 15:03:13 -0800 Received: (from steve@localhost) by gw.chygwyn.com (8.9.3/8.9.3) id WAA18346; Tue, 7 Mar 2000 22:57:23 GMT From: Steve Whitehouse Message-Id: <200003072257.WAA18346@gw.chygwyn.com> Subject: Re: MSG_EOR flag To: eis@baty.hanse.de (Henner Eisen) Date: Tue, 7 Mar 2000 22:57:23 +0000 (GMT) Cc: netdev@oss.sgi.com In-Reply-To: from "Henner Eisen" at Mar 07, 2000 07:36:11 PM Organization: ChyGywn Limited X-RegisteredOffice: 7, New Yatt Road, Witney, Oxfordshire. OX8 6NU England X-RegisteredNumber: 03887683 Reply-To: Steve Whitehouse X-Mailer: ELM [version 2.5 PL1] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing [deleted text] > > A related question is how to handle message boundaries in read() > and write() constistently. If write() in 2.3.x implicitly sets MSG_EOR, > which is interpreted as `each write() should generate a single, complete > message in terms of the underlaying protocol' by many protocol families, > I think read() from a SEQ_PACKET socket should behave consistently. That > means it should only return if the last fragment was received (unless > the read buffer space is to small in which case read() should return > an error). But as linux maps all read() to recvmsg() internally, the > socket layer only sees a recvmsg() call and cannot determine whether > it originated from a read(). Thus, it will be necessary to add a flag > to recvmsg, which is always set when recvmsg is called on behalf of read(). > This flag would request that recvmsg should return only if either the > final part of the messages arrived or the receive buffer size is exceeded. > > Is this what MSG_WAITALL is intended for? > > Henner > Hi, The question of read() and write() is a difficult one. For write() to work without assistance from other syscalls (i.e. sendmsg(), or perhaps one might invent ioctl(SIOEOR)) it seemed that adding an implicit MSG_EOR was the only sensible option. There is nothing in POSIX to say either way, and I didn't have access to any other OS which implemented SEQPACKET sockets in order to check what they did (although thats probably a good thing to try and find out if possible - I'd be interested to know). I feel that read() is less of a problem. The behaviour of recvmsg() is retained to a certain extent by the fact that recvmsg() is called to "do the work" as you said. This means that any read() call will only contain the whole, or part of a single record, and never more than one record. The philosophy here was simply "if you want to know where the record boundaries are, use a function which returns flags, if you don't care where the boundaries are, or you already know because your protocol determines the record size in advance, use read()". If you are suggesting that (I think you are, but I'm not 100% sure) that the protocol not transfer a single byte of data to userspace in a read() call until the EOR marker has been seen, this has problems. Firstly, upon the "buffer not big enough error" the userspace program has to find out somehow how big the buffer needs to be (probably another ioctl()). Secondly, the kernel side buffers now have to be big enough to store a complete record from the transmitting application. There is nothing to say how large a record maybe - it could be many times larger than the physical memory of the receiving machine. Within a specific protocol there may well be limits, but in some there aren't. DECnet is one of the protocols that are unlimited in this way, which is most of the reason for the current behaviour. Overall, I prefer the option of keeping the behaviour of read() as simple as possible and just using the more comprehensive recvmsg() when more information is required. MSG_WAITALL means don't return until the specified number of bytes have been read. For SEQPACKET, that has to be amended so that early return occurs at message boundaries, otherwise the rule of no more than one record per recvmsg() call could be broken. However I don't think that MSG_WAITALL should be merged into read() for SEQPACKET sockets, simply because it gives no more information to userland than the current scheme, Steve. From owner-netdev@oss.sgi.com Wed Mar 8 14:32:02 2000 Received: by oss.sgi.com id ; Wed, 8 Mar 2000 14:31:52 -0800 Received: from daiquiri.bb-data.de ([193.31.178.129]:2737 "HELO daiquiri.bb-data.de") by oss.sgi.com with SMTP id ; Wed, 8 Mar 2000 14:31:40 -0800 Received: by daiquiri.bb-data.de; id PAA04398; Tue, 7 Mar 2000 15:27:40 +0100 Received: from el-presidente.bb-data.de(193.31.178.42) by daiquiri.bb-data.de via smap (4.0a) id xma004253; Tue, 7 Mar 00 15:27:20 +0100 Received: from el-presidente.bb-data.de (localhost [127.0.0.1]) by el-presidente.bb-data.de (8.9.3/8.9.3) with ESMTP id PAA20007 for ; Tue, 7 Mar 2000 15:22:45 +0100 (MET) Received: from sunrise.bb-data.de (sunrise.bb-data.de [193.31.178.37]) by el-presidente.bb-data.de (8.9.3/8.9.3) with ESMTP id PAA20003 for ; Tue, 7 Mar 2000 15:22:45 +0100 (MET) Received: from bankenservice.de ([10.1.32.9]) by sunrise.bb-data.de (Netscape Messaging Server 3.6) with ESMTP id AAA7C0 for ; Tue, 7 Mar 2000 15:23:36 +0100 Message-ID: <38C511B2.C1BC2445@bankenservice.de> Date: Tue, 07 Mar 2000 15:26:58 +0100 From: "Henrik Wellschmidt" X-Mailer: Mozilla 4.61 [en] (WinNT; I) X-Accept-Language: en MIME-Version: 1.0 To: netdev@oss.sgi.com Subject: remove Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing remove From owner-netdev@oss.sgi.com Thu Mar 9 11:14:20 2000 Received: by oss.sgi.com id ; Thu, 9 Mar 2000 11:13:59 -0800 Received: from kogge.hanse.de ([192.76.134.17]:64261 "EHLO kogge.Hanse.DE") by oss.sgi.com with ESMTP id ; Thu, 9 Mar 2000 11:13:37 -0800 Received: (from uucp@localhost) by kogge.Hanse.DE (8.9.3/8.9.1) with UUCP id UAA05951; Thu, 9 Mar 2000 20:15:39 +0100 (CET) (envelope-from eis@baty.hanse.de) Received: (from eis@localhost) by baty.hanse.de (8.9.3/8.9.3) id UAA04891; Thu, 9 Mar 2000 20:06:42 +0100 Date: Thu, 9 Mar 2000 20:06:42 +0100 From: Henner Eisen Message-Id: <200003091906.UAA04891@baty.hanse.de> To: Steve@ChyGwyn.com CC: netdev@oss.sgi.com In-reply-to: <200003072257.WAA18346@gw.chygwyn.com> (message from Steve Whitehouse on Tue, 7 Mar 2000 22:57:23 +0000 (GMT)) Subject: Re: MSG_EOR flag References: <200003072257.WAA18346@gw.chygwyn.com> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hi, >>>>> "Steve" == Steve Whitehouse writes: Steve> If you are suggesting that (I think you are, but I'm not Steve> 100% sure) that the protocol not transfer a single byte of Yes, I was thinking about something like this ... Steve> data to userspace in a read() call until the EOR marker has Steve> been seen, this has problems. Firstly, upon the "buffer not Steve> big enough error" the userspace program has to find out Steve> somehow how big the buffer needs to be (probably another Steve> ioctl()). Secondly, the kernel side buffers now have to be Steve> big enough to store a complete record from the transmitting Steve> application. There is nothing to say how large a record Steve> maybe - it could be many times larger than the physical Steve> memory of the receiving machine. Within a specific protocol Steve> there may well be limits, but in some there aren't. DECnet Steve> is one of the protocols that are unlimited in this way, Steve> which is most of the reason for the current behaviour. ... and I'm also aware of these problems. Steve> Overall, I prefer the option of keeping the behaviour of Steve> read() as simple as possible and just using the more Steve> comprehensive recvmsg() when more information is required. Steve> MSG_WAITALL means don't return until the specified number Steve> of bytes have been read. For SEQPACKET, that has to be Steve> amended so that early return occurs at message boundaries, Steve> otherwise the rule of no more than one record per recvmsg() Steve> call could be broken. However I don't think that Steve> MSG_WAITALL should be merged into read() for SEQPACKET Steve> sockets, simply because it gives no more information to Steve> userland than the current scheme, I agree. The real source of the (potential) compatibilty problem is that certain protocol families support SOCK_SEQPACKET by the datagram_* methods in their code. They can be easily identified because they put datagram_poll() in their proto_ops and call skb_recv_datagram() from their recvmsg() method. The protocols families affected are ax25, netrom, rose, and x25. The result is that their SOCK_SEQPACKET sockets behave more like SOCK_DGRAM sockets with additional reliabilty (the reliabilty is however affected by the problems mentioned above). The other SOCK_SEQPACKET sockets (and this seems to be in line with posix requirements) behave more like SOCK_STREAM with additional packet boundary marker (MSG_EOR flag). The result is that the sockets using the datagram methods behave like this: - sendmsg() allways behaves as if MSG_EOR was set, although it is currently not set by applications. In in 2.2.x, sendmsg() will even return an error if MSG_EOR is set). - recvmsg() allways behaves as if MSG_WAITALL was set (although neither current applications nor sock_read() do set it). 2.2.x as well as 2.3.x will even return an error if MSG_WAITALL is set. MSG_TRUNC is however set correctly if the receive buffer in recvmsg is to small, the remaining data are discarded. But as. Until this is fixed, I think that the above mentioned protocol families should deal with the MSG_* flags as follows in order to match posix sematics as close as possible: (1) sendmsg() should accept the flag MSG_EOR. (2) recvmsg() should accept the flag MSG_WAITALL. (3) recvmsg() should set MSG_EOR before it returns. Question: shall it also do so if MSG_TRUNC is set? (4) As long as the implementations do not support sending partial messages, sendmsg should return with error when MSG_EOR was not set. It is very unlikely that (1) and (2) will break any existent application. They will allow applications, which have been fixed to set the flags correctly, continue to run with the broken kernel level protocol implementation. Thus, I'd suggest that these changes are done even in 2.2.x. (3) Is a slightly more dangerous compatibilty issue. I'd suggest to apply this to 2.3.x only. (Compatibilty problems of such sort should be expected when upgrading major kernel version) (4) will cause the largest compatibilty problem with current applications. I think it is o.k. to do that in x25 (which is CONFIG_EXPERIMENTAL) for 2.3.x. For the other protocol families, the maintainer would probably decide not to do so. (Maybe, print out a net_rate_limit()'ed kernel debug message). Do you agree? When the protocols are fixed in future, applications using read() should be able to reproduce the current semtantics (read() returning fully re-assembles messages) be setting socket options SO_RCVLOWAT and SO_RCVTIMEO to very large values. If necessary, the default values for these options can be increased, but this can be discussed when the fixing is actuelly done. Henner From owner-netdev@oss.sgi.com Mon Mar 13 19:22:11 2000 Received: by oss.sgi.com id ; Mon, 13 Mar 2000 19:22:00 -0800 Received: from [202.102.223.33] ([202.102.223.33]:9832 "EHLO ns.cstnet-hf.net.cn") by oss.sgi.com with ESMTP id ; Mon, 13 Mar 2000 19:21:42 -0800 Received: from ustc.edu.cn (hpe25.nic.ustc.edu.cn [202.38.64.1]) by ns.cstnet-hf.net.cn (8.8.7/8.8.6) with SMTP id LAA09758; Tue, 14 Mar 2000 11:11:00 -0800 Received: from tarn.isdn.ustc.edu.cn by ustc.edu.cn with ESMTP (8.6.10/16.2) id KAA16804; Tue, 14 Mar 2000 10:51:52 +0800 Date: Tue, 14 Mar 2000 11:03:50 +0800 (CST) From: Zam Reply-To: zam_ustc@263.net To: netdev@oss.sgi.com cc: kuznet@ms2.inr.ac.ru Subject: A small bug in net/sched/cls_u32.c In-Reply-To: <200003072257.WAA18346@gw.chygwyn.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hi, Recently I am studying the code of packet scheduling but it is a bit hard for me to understand the filter expecially u32 classifier. Today I find some strange codes in file cls_u32.c. The code goes as follows(My kernel version is 2.2.13): static u32 gen_new_htid(struct tc_u_common *tp_c) { int i = 0x800; do { if (++tp_c->hgenerator == 0x7FF) tp_c->hgenerator = 1; } while (i>0 && u32_lookup_ht(tp_c, (tp_c->hgenerator|0x800)<<20)); return i > 0 ? (tp_c->hgenerator|0x800)<<20 : 0; } What the use of variable "i" here? And further more,if the handle is used up a dead loop is generated. Maybe the author has forgotten to add "i--". The bug still exists in kernel 2.2.14. Does anybody knows where to get materials explaining the mechanism of u32 classifier and the usage of tc? It is too complicated for me to understand. Thanks a lot. Zam Join us in the Linux world! From owner-netdev@oss.sgi.com Wed Mar 15 00:24:48 2000 Received: by oss.sgi.com id ; Wed, 15 Mar 2000 00:24:38 -0800 Received: from marsonia.tel.fer.hr ([161.53.19.140]:9991 "EHLO marsonia.tel.fer.hr") by oss.sgi.com with ESMTP id ; Wed, 15 Mar 2000 00:24:19 -0800 Received: by marsonia.tel.fer.hr via sendmail from stdin id (Debian Smail3.2.0.102) for netdev@oss.sgi.com; Wed, 15 Mar 2000 09:23:42 +0100 (CET) To: netdev@oss.sgi.com Subject: Neighbour and Destination Cache X-Face: #,z/^>"$sSF,#@D@uI]X=i.9Ln#j- i)z#UJSpbint-hkmX6r2XIGSbwhsG4S>R5]?1URkz4NQlKYY>vc+]' >F@Q!Df9#Qn`>;b8.n&{DV0|UVxZmc From: Josip Gracin Date: 15 Mar 2000 09:23:41 +0100 Message-ID: <874sa84p6a.fsf@marsonia.tel.fer.hr> Lines: 11 User-Agent: Gnus/5.070099 (Pterodactyl Gnus v0.99) XEmacs/21.1 (Bryce Canyon) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello! Could someone please explain to me the concept of The Generic Address Resolution Entity (neighbour.c) and of Protocol Independent Destination Cache (dst.c). How is it used in the Linux networking? I do have an idea of how it works but I would need someone to explicitly define it for me so that I get the whole picture. I'd really appreciate some help. Josip From owner-netdev@oss.sgi.com Wed Mar 15 06:00:00 2000 Received: by oss.sgi.com id ; Wed, 15 Mar 2000 05:59:42 -0800 Received: from ikar.t17.ds.pwr.wroc.pl ([156.17.215.227]:4624 "HELO ikar.t17.ds.pwr.wroc.pl") by oss.sgi.com with SMTP id ; Wed, 15 Mar 2000 05:59:29 -0800 Received: by ikar.t17.ds.pwr.wroc.pl (Postfix, from userid 1002) id BA782C81E5; Wed, 15 Mar 2000 14:56:14 +0100 (CET) Date: Wed, 15 Mar 2000 14:56:01 +0100 From: Arkadiusz Miskiewicz To: netdev@oss.sgi.com Subject: ip_dev_loopback_xmit: bad owned skb = c1e292c0: etc... Message-ID: <20000315145601.A552@admin.misiek.eu.org> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-2 Content-Transfer-Encoding: 8bit X-Mailer: Mutt 1.0i X-URL: http://www.misiek.eu.org X-Operating-System: Linux sunsite 4.0.20 #119 Tue Jan 16 12:21:53 MET 2001 i986 pld Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hi all. I *probably* found a bug in network part of kernel: root@dark pocket# ./bootpc --dev lo ip_dev_loopback_xmit: bad owned skb = c1e292c0: skb: pf=2 (unowned) dev=dummy0 len=328 PROTO=17 192.168.0.1:68 255.255.255.255:67 L=328 S=0x00 I=0 F=0x4000 T=64 ip_finish_output: bad owned skb = c1e293a0: skb: pf=2 (owned) dev=dummy0 len=328 PROTO=17 192.168.0.1:68 255.255.255.255:67 L=328 S=0x00 I=0 F=0x4000 T=64 ip_local_deliver: bad non-lo skb: skb: pf=2 (unowned) dev=dummy0 len=328 PROTO=17 192.168.0.1:68 255.255.255.255:67 L=328 S=0x00 I=0 F=0x4000 T=64 ip_dev_loopback_xmit: bad owned skb = c1e29100: skb: pf=2 (unowned) dev=dummy0 len=328 PROTO=17 192.168.0.1:68 255.255.255.255:67 L=328 S=0x00 I=0 F=0x4000 T=64 ip_finish_output: bad owned skb = c1e29aa0: skb: pf=2 (owned) dev=dummy0 len=328 PROTO=17 192.168.0.1:68 255.255.255.255:67 L=328 S=0x00 I=0 F=0x4000 T=64 ip_local_deliver: bad non-lo skb: skb: pf=2 (unowned) dev=dummy0 len=328 PROTO=17 192.168.0.1:68 255.255.255.255:67 L=328 S=0x00 I=0 F=0x4000 T=64 root@dark pocket# ip addr show 1: lo: mtu 3924 qdisc noqueue link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope global lo inet6 ::1/128 scope host 2: sit0@NONE: mtu 1480 qdisc noop link/sit 0.0.0.0 brd 0.0.0.0 3: dummy0: mtu 1500 qdisc noqueue link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff inet 192.168.0.1/32 brd 192.168.0.1 scope global dummy0 inet6 fe80::200:ff:fe00:0/10 scope link inet6 3ffe:902:12::10/128 scope global root@dark pocket# ip link show 1: lo: mtu 3924 qdisc noqueue link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: sit0@NONE: mtu 1480 qdisc noop link/sit 0.0.0.0 brd 0.0.0.0 3: dummy0: mtu 1500 qdisc noqueue link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff root@dark pocket# uname -a Linux dark 2.3.51 #8 Mon Mar 13 19:12:19 CET 2000 i586 pld root@dark pocket# I just tried one thing with old (0.64) bootpc client. -- Arkadiusz Mi¶kiewicz http://www.misiek.eu.org/ PLD GNU/Linux [IPv6 enabled] http://www.pld.org.pl/ From owner-netdev@oss.sgi.com Wed Mar 15 06:19:00 2000 Received: by oss.sgi.com id ; Wed, 15 Mar 2000 06:18:41 -0800 Received: from colin.muc.de ([193.149.48.1]:59397 "HELO colin.muc.de") by oss.sgi.com with SMTP id ; Wed, 15 Mar 2000 06:18:15 -0800 Received: by colin.muc.de id <140600-3>; Wed, 15 Mar 2000 15:18:07 +0100 Message-ID: <20000315151806.09819@colin.muc.de> From: Andi Kleen To: Arkadiusz Miskiewicz Cc: netdev@oss.sgi.com Subject: Re: ip_dev_loopback_xmit: bad owned skb = c1e292c0: etc... References: <20000315145601.A552@admin.misiek.eu.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.88e In-Reply-To: <20000315145601.A552@admin.misiek.eu.org>; from Arkadiusz Miskiewicz on Wed, Mar 15, 2000 at 03:02:15PM +0100 Date: Wed, 15 Mar 2000 15:18:06 +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Wed, Mar 15, 2000 at 03:02:15PM +0100, Arkadiusz Miskiewicz wrote: > Hi all. > > I *probably* found a bug in network part of kernel: [...] Turn off netfilter debugging, it is broken currently in 2.3. -Andi From owner-netdev@oss.sgi.com Wed Mar 15 06:22:51 2000 Received: by oss.sgi.com id ; Wed, 15 Mar 2000 06:22:31 -0800 Received: from pizda.ninka.net ([216.101.162.242]:14208 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Wed, 15 Mar 2000 06:22:23 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id GAA01737; Wed, 15 Mar 2000 06:17:24 -0800 Date: Wed, 15 Mar 2000 06:17:24 -0800 Message-Id: <200003151417.GAA01737@pizda.ninka.net> X-Authentication-Warning: pizda.ninka.net: davem set sender to davem@redhat.com using -f From: "David S. Miller" To: misiek@pld.org.pl CC: netdev@oss.sgi.com In-reply-to: <20000315145601.A552@admin.misiek.eu.org> (message from Arkadiusz Miskiewicz on Wed, 15 Mar 2000 14:56:01 +0100) Subject: Re: ip_dev_loopback_xmit: bad owned skb = c1e292c0: etc... References: <20000315145601.A552@admin.misiek.eu.org> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Turn off NETFILTER debugging, it's overly verbose. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Thu Mar 16 01:12:36 2000 Received: by oss.sgi.com id ; Thu, 16 Mar 2000 01:12:26 -0800 Received: from chaos.thphy.uni-duesseldorf.de ([134.99.64.99]:28422 "EHLO chaos.thphy.uni-duesseldorf.de") by oss.sgi.com with ESMTP id ; Thu, 16 Mar 2000 01:12:09 -0800 Received: from localhost (kai@localhost) by chaos.thphy.uni-duesseldorf.de (8.9.3/8.8.7) with ESMTP id KAA17327 for ; Thu, 16 Mar 2000 10:13:43 +0100 X-Authentication-Warning: chaos.thphy.uni-duesseldorf.de: kai owned process doing -bs Date: Thu, 16 Mar 2000 10:13:42 +0100 (CET) From: Kai Germaschewski To: netdev@oss.sgi.com Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hey there! Trying to clean up the current ISDN net interface code for 2.4, I'm facing the following problem. Basically, the structure is the following: There's the so called ISDN link layer (LL), which interfaces to the network layer as a struct net_device and also interface to HL (hardware layer) - drivers that provide actual channels to transfer data on. The problem is flow control and locking the xmit path. Basically, it should easily work the following way: isdn_net_hard_start_xmit() { frame data accordingly (e.g. syncppp, ciscohdlc or whatever) send data to HL channel. } In this setup, the network layer guarantees that hard_start_xmit() is called single threaded, and therefore the HL driver send routine as well. Now, we also have to send supervisory frames sometimes, e.g. negotiation frames which are send from userspace (ipppd), or timer (keep-alive packets), compression reset frames,... Basically I can see two ways to ensure that send_data_to_HL is called single threaded: 1) Take the supervisory frames and stuff them into the network queue, using TC_PRIO_CONTROL and dev_queue_xmit(). (That's what syncppp.c does) 2) Lock the xmit path explicitely. Additional constraints: multi link has to be considered as well, e.g. bundling two physical B-channels into one net interface. This particularly means that the supervisory frames are sometimes bound to a specific channel, we would loose that information using dev_queue_xmit(). Worse, even if we encode the channel into the skb somehow, flow control is a problem. Flow control works the obvious way, i.e. netif_stop_queue() if all channels are busy and netif_wake_queue() if at least one channel becomes non-busy. So now hard_start_xmit() might give us the control frame for a specific channel, which could still be busy though, because we just know that (any) one channel is non-busy. Locking the xmit path ourselves seems like somewhat duplicate effort, but I can't think of any better way. BTW: which spinlock kind should be used for code which can be called from user process, hard_start_xmit() and a task_queue? spinlock_bh() is not documented in Documentations/spinlocks.txt, but I guess that's the right one? Thanks, Kai From owner-netdev@oss.sgi.com Thu Mar 16 04:09:46 2000 Received: by oss.sgi.com id ; Thu, 16 Mar 2000 04:09:36 -0800 Received: from laurin.munich.netsurf.de ([194.64.166.1]:53969 "EHLO laurin.munich.netsurf.de") by oss.sgi.com with ESMTP id ; Thu, 16 Mar 2000 04:09:13 -0800 Received: from fred.muc.de (none@ns1144.munich.netsurf.de [195.180.235.144]) by laurin.munich.netsurf.de (8.9.3/8.9.3) with ESMTP id NAA13783; Thu, 16 Mar 2000 13:08:58 +0100 (MET) Received: from andi by fred.muc.de with local (Exim 2.05 #1) id 12VYW1-0001Kg-00; Thu, 16 Mar 2000 12:32:33 +0100 Date: Thu, 16 Mar 2000 12:32:33 +0100 From: Andi Kleen To: Kai Germaschewski Cc: netdev@oss.sgi.com Subject: Re: your mail Message-ID: <20000316123233.A1849@fred.muc.de> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.95.4us In-Reply-To: ; from Kai Germaschewski on Thu, Mar 16, 2000 at 10:14:31AM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Thu, Mar 16, 2000 at 10:14:31AM +0100, Kai Germaschewski wrote: > 1) Take the supervisory frames and stuff them into the network queue, > using TC_PRIO_CONTROL and dev_queue_xmit(). (That's what syncppp.c does) > 2) Lock the xmit path explicitely. I think (1) is much cleaner > > Additional constraints: multi link has to be considered as well, e.g. > bundling two physical B-channels into one net interface. > > This particularly means that the supervisory frames are sometimes bound to > a specific channel, we would loose that information using > dev_queue_xmit(). You could put that information into the skb's cb structure. TCP uses it for similar purposes. It is currently 48 bytes. It is free to use for you. > > Worse, even if we encode the channel into the skb somehow, flow control is > a problem. Flow control works the obvious way, i.e. netif_stop_queue() if > all channels are busy and netif_wake_queue() if at least one channel > becomes non-busy. So now hard_start_xmit() might give us the control frame > for a specific channel, which could still be busy though, because we just > know that (any) one channel is non-busy. Just return 1 then and sch_generic will requeue the packet. > > Locking the xmit path ourselves seems like somewhat duplicate effort, but > I can't think of any better way. BTW: which spinlock kind should be used > for code which can be called from user process, hard_start_xmit() and a > task_queue? spinlock_bh() is not documented in > Documentations/spinlocks.txt, but I guess that's the right one? Yes. -Andi -- This is like TV. I don't like TV. From owner-netdev@oss.sgi.com Sat Mar 18 02:46:52 2000 Received: by oss.sgi.com id ; Sat, 18 Mar 2000 02:46:41 -0800 Received: from kogge.hanse.de ([192.76.134.17]:31762 "EHLO kogge.Hanse.DE") by oss.sgi.com with ESMTP id ; Sat, 18 Mar 2000 02:46:30 -0800 Received: (from uucp@localhost) by kogge.Hanse.DE (8.9.3/8.9.1) with UUCP id LAA47321; Sat, 18 Mar 2000 11:48:34 +0100 (CET) (envelope-from eis@baty.hanse.de) Received: (from eis@localhost) by baty.hanse.de (8.9.3/8.9.3) id JAA20232; Sat, 18 Mar 2000 09:48:58 +0100 To: Andi Kleen Cc: Kai Germaschewski , netdev@oss.sgi.com, i4ldeveloper@listserv.isdn4linux.de Subject: ppp control frame passing (was: (none) / Re: your mail) References: <20000316123233.A1849@fred.muc.de> From: Henner Eisen Date: 18 Mar 2000 09:48:57 +0100 In-Reply-To: Andi Kleen's message of "Thu, 16 Mar 2000 12:32:33 +0100" Message-ID: Lines: 59 X-Mailer: Gnus v5.5/Emacs 20.3 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing >>>>> "Andi" == Andi Kleen writes: >> Worse, even if we encode the channel into the skb somehow, >> flow control is a problem. Flow control works the obvious way, >> i.e. netif_stop_queue() if all channels are busy and >> netif_wake_queue() if at least one channel becomes non-busy. So >> now hard_start_xmit() might give us the control frame for a >> specific channel, which could still be busy though, because we >> just know that (any) one channel is non-busy. Andi> Just return 1 then and sch_generic will requeue the packet. Does this "return 1" depend on a special network scheduler beeing attached to the netif? Although I also think that passing (and queuing) isdn_ppp control frames via the the standard netdevice interface (there should be no internal queues in ppp, only in the ppp netdevice's network scheduler as in the hardware) would be the cleanest solution, I see yet another little problem: If we need to pass a control frame while the netif is busy, we must rely on the network scheduler to queue the ppp control at the head of all output frames. Otherwise, (e.g. if another frame with the same priority is already queued), doing netif_wake_queue() will pass us the wrong frame first. I see several posibilities: When netif is in the 'await xmit ppp control frame' state an a non-ppp control frame is passed (1) send the non-control frame first (which will delay the ppp control frame) or (2) discard the non ppp control frame (which will decrease performance of the application having set the high priority intentionally). (3) requeue the non ppp control frame using dev_queue_xmit() and return 0; (4) pppd queued the ppp control frame not with dev_queue_xmit() but with a special dev_queue_xmit_first() which would ensure that the frame is send before all other currently queued frames. The network scheduler would provide a special q->enqueue_first() method to support this such that dev_queue_xmit_first() would look like this: if (q->enqueue_first){ q->enqueue_first(skb); } else { q->enqueue(skb); } Standard dev_queue_xmit() would remain unchanged such that fast path is not affected. BTW: It seems that even more simplifications in all ppp drivers could result from using the standard netdevice interface also for ppp control frames. If received control frames were also passed upstream via netif_rx(), we could get rid of a special /dev/[i]ppp* devices and all the related code in the ppp drivers. pppd could just send and receive the ppp control frames via a PF_PACKET socket. (Well, we still need the /dev/*ppp* files for supporting the ppp ioctl()s, but all the code dealing with reading, writing and queueing control frames could be removed.) Well, but this seems to be 2.5.x issue... Henner From owner-netdev@oss.sgi.com Sat Mar 18 04:14:43 2000 Received: by oss.sgi.com id ; Sat, 18 Mar 2000 04:14:32 -0800 Received: from laurin.munich.netsurf.de ([194.64.166.1]:13740 "EHLO laurin.munich.netsurf.de") by oss.sgi.com with ESMTP id ; Sat, 18 Mar 2000 04:14:18 -0800 Received: from fred.muc.de (none@ns1032.munich.netsurf.de [195.180.235.32]) by laurin.munich.netsurf.de (8.9.3/8.9.3) with ESMTP id NAA02578; Sat, 18 Mar 2000 13:14:13 +0100 (MET) Received: from andi by fred.muc.de with local (Exim 2.05 #1) id 12WI9l-0001ST-00; Sat, 18 Mar 2000 13:16:37 +0100 Date: Sat, 18 Mar 2000 13:16:37 +0100 From: Andi Kleen To: Henner Eisen Cc: Andi Kleen , Kai Germaschewski , netdev@oss.sgi.com, i4ldeveloper@listserv.isdn4linux.de Subject: Re: ppp control frame passing (was: (none) / Re: your mail) Message-ID: <20000318131637.A5599@fred.muc.de> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.95.4us In-Reply-To: ; from Henner Eisen on Sat, Mar 18, 2000 at 11:46:23AM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Sat, Mar 18, 2000 at 11:46:23AM +0100, Henner Eisen wrote: > >>>>> "Andi" == Andi Kleen writes: > > >> Worse, even if we encode the channel into the skb somehow, > >> flow control is a problem. Flow control works the obvious way, > >> i.e. netif_stop_queue() if all channels are busy and > >> netif_wake_queue() if at least one channel becomes non-busy. So > >> now hard_start_xmit() might give us the control frame for a > >> specific channel, which could still be busy though, because we > >> just know that (any) one channel is non-busy. > > Andi> Just return 1 then and sch_generic will requeue the packet. > > Does this "return 1" depend on a special network scheduler beeing > attached to the netif? No. All network schedulers should implement requeue, and the check is done by the generic queueing code. > (2) discard the non ppp control frame (which will decrease performance > of the application having set the high priority intentionally). .. and there is no guarantee that you will get the control frame even then. > (3) requeue the non ppp control frame using dev_queue_xmit() and return 0; > (4) pppd queued the ppp control frame not with dev_queue_xmit() but with > a special dev_queue_xmit_first() which would ensure that the frame > is send before all other currently queued frames. The network scheduler > would provide a special q->enqueue_first() method to support this such > that dev_queue_xmit_first() would look like this: > if (q->enqueue_first){ > q->enqueue_first(skb); > } else { > q->enqueue(skb); > } It is probably easier to use a special scheduler for that that calls the normal scheduler as a child (and make sure PPP devices always have that special scheduler pushed). That scheduler would make sure that control packets come always first. > > Standard dev_queue_xmit() would remain unchanged such that fast > path is not affected. > > BTW: It seems that even more simplifications in all ppp drivers could > result from using the standard netdevice interface also for ppp > control frames. If received control frames were also passed upstream > via netif_rx(), we could get rid of a special /dev/[i]ppp* devices and > all the related code in the ppp drivers. pppd could just send and > receive the ppp control frames via a PF_PACKET socket. (Well, we > still need the /dev/*ppp* files for supporting the ppp ioctl()s, but > all the code dealing with reading, writing and queueing control > frames could be removed.) Well, but this seems to be 2.5.x issue... The PPP ioctls could be replaced by netlink messages to get rid of the devices (just who wants to do that work?). I'm not sure if it is worth it. -Andi -- This is like TV. I don't like TV. From owner-netdev@oss.sgi.com Sat Mar 18 07:24:36 2000 Received: by oss.sgi.com id ; Sat, 18 Mar 2000 07:24:25 -0800 Received: from [195.223.246.10] ([195.223.246.10]:39185 "HELO spock.linux.it") by oss.sgi.com with SMTP id ; Sat, 18 Mar 2000 07:24:11 -0800 Received: by spock.linux.it (Postfix, from userid 55) id 59EF41C788; Sat, 18 Mar 2000 16:35:23 +0100 (CET) Received: by spock.linux.it (Postfix, from userid 55) id C92541C7CF; Sat, 18 Mar 2000 16:13:37 +0100 (CET) Received: from oss.sgi.com (oss.sgi.com [216.32.174.118]) by spock.linux.it (Postfix) with ESMTP id 7763E1C781 for ; Sat, 18 Mar 2000 13:18:18 +0100 (CET) Received: by oss.sgi.com id ; Sat, 18 Mar 2000 04:14:32 -0800 Received: from laurin.munich.netsurf.de ([194.64.166.1]:13740 "EHLO laurin.munich.netsurf.de") by oss.sgi.com with ESMTP id ; Sat, 18 Mar 2000 04:14:18 -0800 Received: from fred.muc.de (none@ns1032.munich.netsurf.de [195.180.235.32]) by laurin.munich.netsurf.de (8.9.3/8.9.3) with ESMTP id NAA02578; Sat, 18 Mar 2000 13:14:13 +0100 (MET) Received: from andi by fred.muc.de with local (Exim 2.05 #1) id 12WI9l-0001ST-00; Sat, 18 Mar 2000 13:16:37 +0100 Date: Sat, 18 Mar 2000 13:16:37 +0100 From: Andi Kleen To: Henner Eisen Cc: Andi Kleen , Kai Germaschewski , netdev@oss.sgi.com, i4ldeveloper@listserv.isdn4linux.de Subject: Re: ppp control frame passing (was: (none) / Re: your mail) Message-ID: <20000318131637.A5599@fred.muc.de> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.95.4us In-Reply-To: ; from Henner Eisen on Sat, Mar 18, 2000 at 11:46:23AM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Sat, Mar 18, 2000 at 11:46:23AM +0100, Henner Eisen wrote: > >>>>> "Andi" == Andi Kleen writes: > > >> Worse, even if we encode the channel into the skb somehow, > >> flow control is a problem. Flow control works the obvious way, > >> i.e. netif_stop_queue() if all channels are busy and > >> netif_wake_queue() if at least one channel becomes non-busy. So > >> now hard_start_xmit() might give us the control frame for a > >> specific channel, which could still be busy though, because we > >> just know that (any) one channel is non-busy. > > Andi> Just return 1 then and sch_generic will requeue the packet. > > Does this "return 1" depend on a special network scheduler beeing > attached to the netif? No. All network schedulers should implement requeue, and the check is done by the generic queueing code. > (2) discard the non ppp control frame (which will decrease performance > of the application having set the high priority intentionally). .. and there is no guarantee that you will get the control frame even then. > (3) requeue the non ppp control frame using dev_queue_xmit() and return 0; > (4) pppd queued the ppp control frame not with dev_queue_xmit() but with > a special dev_queue_xmit_first() which would ensure that the frame > is send before all other currently queued frames. The network scheduler > would provide a special q->enqueue_first() method to support this such > that dev_queue_xmit_first() would look like this: > if (q->enqueue_first){ > q->enqueue_first(skb); > } else { > q->enqueue(skb); > } It is probably easier to use a special scheduler for that that calls the normal scheduler as a child (and make sure PPP devices always have that special scheduler pushed). That scheduler would make sure that control packets come always first. > > Standard dev_queue_xmit() would remain unchanged such that fast > path is not affected. > > BTW: It seems that even more simplifications in all ppp drivers could > result from using the standard netdevice interface also for ppp > control frames. If received control frames were also passed upstream > via netif_rx(), we could get rid of a special /dev/[i]ppp* devices and > all the related code in the ppp drivers. pppd could just send and > receive the ppp control frames via a PF_PACKET socket. (Well, we > still need the /dev/*ppp* files for supporting the ppp ioctl()s, but > all the code dealing with reading, writing and queueing control > frames could be removed.) Well, but this seems to be 2.5.x issue... The PPP ioctls could be replaced by netlink messages to get rid of the devices (just who wants to do that work?). I'm not sure if it is worth it. -Andi -- This is like TV. I don't like TV. From owner-netdev@oss.sgi.com Sat Mar 18 08:08:54 2000 Received: by oss.sgi.com id ; Sat, 18 Mar 2000 08:08:45 -0800 Received: from minus.inr.ac.ru ([193.233.7.97]:44811 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Sat, 18 Mar 2000 08:08:18 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id TAA02148; Sat, 18 Mar 2000 19:08:05 +0300 From: kuznet@ms2.inr.ac.ru Message-Id: <200003181608.TAA02148@ms2.inr.ac.ru> Subject: Re: ppp control frame passing (was: (none) / Re: your mail) To: eis@baty.hanse.DE (Henner Eisen) Date: Sat, 18 Mar 2000 19:08:05 +0300 (MSK) Cc: netdev@oss.sgi.com, ak@muc.de (Andi Kleen) In-Reply-To: from "Henner Eisen" at Mar 18, 0 02:13:29 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Length: 1614 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello! > >> Worse, even if we encode the channel into the skb somehow, > >> flow control is a problem. Flow control works the obvious way, > >> i.e. netif_stop_queue() if all channels are busy and > >> netif_wake_queue() if at least one channel becomes non-busy. So > >> now hard_start_xmit() might give us the control frame for a > >> specific channel, which could still be busy though, because we > >> just know that (any) one channel is non-busy. Alert!!! Try to avoid this. This system has _no_ support for multiple hardware queues. Essentially, you have two choices: to send control frames internally without dev_queue_xmit() (it is easy), or to move channel demultiplexing to special qdisc (f.e. as sch_atm does). It is still not clear how support for multiple hardware queues can be organized. I am inclined to believe that it is simply not well-defined task. Probably, it is better to try to move channel multiplexing out of hardware level. > Andi> Just return 1 then and sch_generic will requeue the packet. And also it will send it immediately again, if tbusy was not set, or block whole device, if it is set. No, it will not help with multiple links. > BTW: It seems that even more simplifications in all ppp drivers could > result from using the standard netdevice interface also for ppp > control frames. If received control frames were also passed upstream > via netif_rx(), It is not related problem. Just add support for this to device and you will be able to receive/send such frames via packet socket. And it does not solve your problem at all. Alexey From owner-netdev@oss.sgi.com Mon Mar 20 06:31:58 2000 Received: by oss.sgi.com id ; Mon, 20 Mar 2000 06:31:49 -0800 Received: from corderoatado.arnet.com.ar ([200.45.0.3]:59409 "EHLO mx2.arnet.com.ar") by oss.sgi.com with ESMTP id ; Mon, 20 Mar 2000 06:31:27 -0800 Received: from mail pickup service by mx2.arnet.com.ar with Microsoft SMTPSVC; Mon, 20 Mar 2000 11:28:38 -0300 Received: from oss.sgi.com ([216.32.174.118]) by mx1.arnet.com.ar with Microsoft SMTPSVC(5.5.1877.357.35); Thu, 16 Mar 2000 06:12:52 -0300 Received: by oss.sgi.com id ; Thu, 16 Mar 2000 01:12:26 -0800 Received: from chaos.thphy.uni-duesseldorf.de ([134.99.64.99]:28422 "EHLO chaos.thphy.uni-duesseldorf.de") by oss.sgi.com with ESMTP id ; Thu, 16 Mar 2000 01:12:09 -0800 Received: from localhost (kai@localhost) by chaos.thphy.uni-duesseldorf.de (8.9.3/8.8.7) with ESMTP id KAA17327 for ; Thu, 16 Mar 2000 10:13:43 +0100 X-Authentication-Warning: chaos.thphy.uni-duesseldorf.de: kai owned process doing -bs Date: Thu, 16 Mar 2000 10:13:42 +0100 (CET) From: Kai Germaschewski To: netdev@oss.sgi.com Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hey there! Trying to clean up the current ISDN net interface code for 2.4, I'm facing the following problem. Basically, the structure is the following: There's the so called ISDN link layer (LL), which interfaces to the network layer as a struct net_device and also interface to HL (hardware layer) - drivers that provide actual channels to transfer data on. The problem is flow control and locking the xmit path. Basically, it should easily work the following way: isdn_net_hard_start_xmit() { frame data accordingly (e.g. syncppp, ciscohdlc or whatever) send data to HL channel. } In this setup, the network layer guarantees that hard_start_xmit() is called single threaded, and therefore the HL driver send routine as well. Now, we also have to send supervisory frames sometimes, e.g. negotiation frames which are send from userspace (ipppd), or timer (keep-alive packets), compression reset frames,... Basically I can see two ways to ensure that send_data_to_HL is called single threaded: 1) Take the supervisory frames and stuff them into the network queue, using TC_PRIO_CONTROL and dev_queue_xmit(). (That's what syncppp.c does) 2) Lock the xmit path explicitely. Additional constraints: multi link has to be considered as well, e.g. bundling two physical B-channels into one net interface. This particularly means that the supervisory frames are sometimes bound to a specific channel, we would loose that information using dev_queue_xmit(). Worse, even if we encode the channel into the skb somehow, flow control is a problem. Flow control works the obvious way, i.e. netif_stop_queue() if all channels are busy and netif_wake_queue() if at least one channel becomes non-busy. So now hard_start_xmit() might give us the control frame for a specific channel, which could still be busy though, because we just know that (any) one channel is non-busy. Locking the xmit path ourselves seems like somewhat duplicate effort, but I can't think of any better way. BTW: which spinlock kind should be used for code which can be called from user process, hard_start_xmit() and a task_queue? spinlock_bh() is not documented in Documentations/spinlocks.txt, but I guess that's the right one? Thanks, Kai From owner-netdev@oss.sgi.com Mon Mar 20 06:54:30 2000 Received: by oss.sgi.com id ; Mon, 20 Mar 2000 06:54:10 -0800 Received: from linuxcare.canberra.net.au ([203.29.91.49]:45326 "EHLO front.linuxcare.com.au") by oss.sgi.com with ESMTP id ; Mon, 20 Mar 2000 06:53:44 -0800 Received: from halfway.linuxcare.com.au (halfway.linuxcare.com.au [10.61.2.46]) by front.linuxcare.com.au (8.9.3/8.9.3/Debian 8.9.3-6) with ESMTP id BAA22453 for ; Tue, 21 Mar 2000 01:53:39 +1100 Received: from linuxcare.com.au (really [127.0.0.1]) by halfway.linuxcare.com.au via in.smtpd with esmtp id (Debian Smail3.2.0.102) for ; Tue, 21 Mar 2000 01:53:39 +1100 (EST) Message-Id: From: Rusty Russell To: netdev@oss.sgi.com Subject: (FORWARD) James Morris: [PATCH] ip_queue fucked-up oops fix Date: Tue, 21 Mar 2000 01:53:39 +1100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hi Alexey, James looks after the (EXPERIMENTAL) ip_queue netfilter code. Please apply. From: James Morris To: Rusty Russell Subject: [PATCH] ip_queue fucked-up oops fix Hi Rusty, The patch below against 2.3.99-pre2-4 fixes a problem in ip_queue which can lead to a kernel crash on SMP machines. diff -ur --exclude=*.[oa] --exclude=.* linux-2.3.99-pre2-4/net/ipv4/netfilter/ip_queue.c linux/net/ipv4/netfilter/ip_queue.c --- linux-2.3.99-pre2-4/net/ipv4/netfilter/ip_queue.c Sat Mar 18 23:22:33 2000 +++ linux/net/ipv4/netfilter/ip_queue.c Sun Mar 19 13:16:43 2000 @@ -491,7 +491,7 @@ skb = netlink_build_message(e, &status); if (skb == NULL) return status; - return netlink_unicast(nfnl, skb, nlq->peer.pid, 0); + return netlink_unicast(nfnl, skb, nlq->peer.pid, MSG_DONTWAIT); } static struct sk_buff * From owner-netdev@oss.sgi.com Mon Mar 20 07:59:29 2000 Received: by oss.sgi.com id ; Mon, 20 Mar 2000 07:59:20 -0800 Received: from lobosuelto.arnet.com.ar ([200.45.0.2]:18698 "EHLO mx1.arnet.com.ar") by oss.sgi.com with ESMTP id ; Mon, 20 Mar 2000 07:58:58 -0800 Received: from mail1.arnet.com.ar ([200.45.0.4]) by mx1.arnet.com.ar with Microsoft SMTPSVC(5.5.1877.357.35); Mon, 20 Mar 2000 12:58:43 -0300 Received: from mail pickup service by mail1.arnet.com.ar with Microsoft SMTPSVC; Mon, 20 Mar 2000 12:58:43 -0300 Received: from mx1.arnet.com.ar ([200.45.0.2]) by mail2.arnet.com.ar with Microsoft SMTPSVC(5.5.1877.357.35); Mon, 20 Mar 2000 11:36:42 -0300 Received: from oss.sgi.com ([216.32.174.118]) by mx1.arnet.com.ar with Microsoft SMTPSVC(5.5.1877.357.35); Mon, 20 Mar 2000 11:36:30 -0300 Received: by oss.sgi.com id ; Mon, 20 Mar 2000 06:31:49 -0800 Received: from corderoatado.arnet.com.ar ([200.45.0.3]:59409 "EHLO mx2.arnet.com.ar") by oss.sgi.com with ESMTP id ; Mon, 20 Mar 2000 06:31:27 -0800 Received: from mail pickup service by mx2.arnet.com.ar with Microsoft SMTPSVC; Mon, 20 Mar 2000 11:28:38 -0300 Received: from oss.sgi.com ([216.32.174.118]) by mx1.arnet.com.ar with Microsoft SMTPSVC(5.5.1877.357.35); Thu, 16 Mar 2000 06:12:52 -0300 Received: by oss.sgi.com id ; Thu, 16 Mar 2000 01:12:26 -0800 Received: from chaos.thphy.uni-duesseldorf.de ([134.99.64.99]:28422 "EHLO chaos.thphy.uni-duesseldorf.de") by oss.sgi.com with ESMTP id ; Thu, 16 Mar 2000 01:12:09 -0800 Received: from localhost (kai@localhost) by chaos.thphy.uni-duesseldorf.de (8.9.3/8.8.7) with ESMTP id KAA17327 for ; Thu, 16 Mar 2000 10:13:43 +0100 X-Authentication-Warning: chaos.thphy.uni-duesseldorf.de: kai owned process doing -bs Date: Thu, 16 Mar 2000 10:13:42 +0100 (CET) From: Kai Germaschewski To: netdev@oss.sgi.com Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hey there! Trying to clean up the current ISDN net interface code for 2.4, I'm facing the following problem. Basically, the structure is the following: There's the so called ISDN link layer (LL), which interfaces to the network layer as a struct net_device and also interface to HL (hardware layer) - drivers that provide actual channels to transfer data on. The problem is flow control and locking the xmit path. Basically, it should easily work the following way: isdn_net_hard_start_xmit() { frame data accordingly (e.g. syncppp, ciscohdlc or whatever) send data to HL channel. } In this setup, the network layer guarantees that hard_start_xmit() is called single threaded, and therefore the HL driver send routine as well. Now, we also have to send supervisory frames sometimes, e.g. negotiation frames which are send from userspace (ipppd), or timer (keep-alive packets), compression reset frames,... Basically I can see two ways to ensure that send_data_to_HL is called single threaded: 1) Take the supervisory frames and stuff them into the network queue, using TC_PRIO_CONTROL and dev_queue_xmit(). (That's what syncppp.c does) 2) Lock the xmit path explicitely. Additional constraints: multi link has to be considered as well, e.g. bundling two physical B-channels into one net interface. This particularly means that the supervisory frames are sometimes bound to a specific channel, we would loose that information using dev_queue_xmit(). Worse, even if we encode the channel into the skb somehow, flow control is a problem. Flow control works the obvious way, i.e. netif_stop_queue() if all channels are busy and netif_wake_queue() if at least one channel becomes non-busy. So now hard_start_xmit() might give us the control frame for a specific channel, which could still be busy though, because we just know that (any) one channel is non-busy. Locking the xmit path ourselves seems like somewhat duplicate effort, but I can't think of any better way. BTW: which spinlock kind should be used for code which can be called from user process, hard_start_xmit() and a task_queue? spinlock_bh() is not documented in Documentations/spinlocks.txt, but I guess that's the right one? Thanks, Kai From owner-netdev@oss.sgi.com Mon Mar 20 08:18:20 2000 Received: by oss.sgi.com id ; Mon, 20 Mar 2000 08:18:00 -0800 Received: from lobosuelto.arnet.com.ar ([200.45.0.2]:16903 "EHLO mx1.arnet.com.ar") by oss.sgi.com with ESMTP id ; Mon, 20 Mar 2000 08:17:44 -0800 Received: from mail1.arnet.com.ar ([200.45.0.4]) by mx1.arnet.com.ar with Microsoft SMTPSVC(5.5.1877.357.35); Mon, 20 Mar 2000 13:17:23 -0300 Received: from mail pickup service by mail1.arnet.com.ar with Microsoft SMTPSVC; Mon, 20 Mar 2000 13:17:23 -0300 Received: from mx1.arnet.com.ar ([200.45.0.2]) by mail2.arnet.com.ar with Microsoft SMTPSVC(5.5.1877.357.35); Mon, 20 Mar 2000 11:54:52 -0300 Received: from oss.sgi.com ([216.32.174.118]) by mx1.arnet.com.ar with Microsoft SMTPSVC(5.5.1877.357.35); Mon, 20 Mar 2000 11:54:40 -0300 Received: by oss.sgi.com id ; Mon, 20 Mar 2000 06:54:10 -0800 Received: from linuxcare.canberra.net.au ([203.29.91.49]:45326 "EHLO front.linuxcare.com.au") by oss.sgi.com with ESMTP id ; Mon, 20 Mar 2000 06:53:44 -0800 Received: from halfway.linuxcare.com.au (halfway.linuxcare.com.au [10.61.2.46]) by front.linuxcare.com.au (8.9.3/8.9.3/Debian 8.9.3-6) with ESMTP id BAA22453 for ; Tue, 21 Mar 2000 01:53:39 +1100 Received: from linuxcare.com.au (really [127.0.0.1]) by halfway.linuxcare.com.au via in.smtpd with esmtp id (Debian Smail3.2.0.102) for ; Tue, 21 Mar 2000 01:53:39 +1100 (EST) Message-Id: From: Rusty Russell To: netdev@oss.sgi.com Subject: (FORWARD) James Morris: [PATCH] ip_queue fucked-up oops fix Date: Tue, 21 Mar 2000 01:53:39 +1100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hi Alexey, James looks after the (EXPERIMENTAL) ip_queue netfilter code. Please apply. From: James Morris To: Rusty Russell Subject: [PATCH] ip_queue fucked-up oops fix Hi Rusty, The patch below against 2.3.99-pre2-4 fixes a problem in ip_queue which can lead to a kernel crash on SMP machines. diff -ur --exclude=*.[oa] --exclude=.* linux-2.3.99-pre2-4/net/ipv4/netfilter/ip_queue.c linux/net/ipv4/netfilter/ip_queue.c --- linux-2.3.99-pre2-4/net/ipv4/netfilter/ip_queue.c Sat Mar 18 23:22:33 2000 +++ linux/net/ipv4/netfilter/ip_queue.c Sun Mar 19 13:16:43 2000 @@ -491,7 +491,7 @@ skb = netlink_build_message(e, &status); if (skb == NULL) return status; - return netlink_unicast(nfnl, skb, nlq->peer.pid, 0); + return netlink_unicast(nfnl, skb, nlq->peer.pid, MSG_DONTWAIT); } static struct sk_buff * From owner-netdev@oss.sgi.com Mon Mar 20 12:49:35 2000 Received: by oss.sgi.com id ; Mon, 20 Mar 2000 12:49:15 -0800 Received: from kogge.hanse.de ([192.76.134.17]:5638 "EHLO kogge.Hanse.DE") by oss.sgi.com with ESMTP id ; Mon, 20 Mar 2000 12:48:46 -0800 Received: (from uucp@localhost) by kogge.Hanse.DE (8.9.3/8.9.1) with UUCP id VAA39786; Mon, 20 Mar 2000 21:50:56 +0100 (CET) (envelope-from eis@baty.hanse.de) Received: (from eis@localhost) by baty.hanse.de (8.9.3/8.9.3) id QAA21169; Sat, 18 Mar 2000 16:56:09 +0100 Date: Sat, 18 Mar 2000 16:56:09 +0100 From: Henner Eisen Message-Id: <200003181556.QAA21169@baty.hanse.de> To: ak@muc.de CC: ak@muc.de, kai@thphy.uni-duesseldorf.de, netdev@oss.sgi.com, i4ldeveloper@listserv.isdn4linux.de In-reply-to: <20000318131637.A5599@fred.muc.de> (message from Andi Kleen on Sat, 18 Mar 2000 13:16:37 +0100) Subject: Re: ppp control frame passing (was: (none) / Re: your mail) References: <20000318131637.A5599@fred.muc.de> Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing >>>>> "Andi" == Andi Kleen writes: Andi> It is probably easier to use a special scheduler for that Andi> that calls the normal scheduler as a child (and make sure Andi> PPP devices always have that special scheduler pushed). That Andi> scheduler would make sure that control packets come always Andi> first. Sounds good! I think we should take that direction. This would also be totally independent of the underlaying ppp implentation. That means, if it works, it can be re-used for the other ppp implementations as well (syncppp, generic). Then, pppd support can migrate to using packet sockets instead of /dev/*ppp* and all ppp stacks can migrate smoothly. On the long run, I think we should support the generic ppp by isdn, but with the current call / demand dialing this won't be easy befor a major rewrite of the isdn link level has been done. (This rewrite would of corse use the generic ppp). Henner From owner-netdev@oss.sgi.com Mon Mar 20 19:21:09 2000 Received: by oss.sgi.com id ; Mon, 20 Mar 2000 19:21:00 -0800 Received: from [202.102.223.33] ([202.102.223.33]:35191 "EHLO ns.cstnet-hf.net.cn") by oss.sgi.com with ESMTP id ; Mon, 20 Mar 2000 19:20:38 -0800 Received: from ustc.edu.cn (hpe25.nic.ustc.edu.cn [202.38.64.1]) by ns.cstnet-hf.net.cn (8.8.7/8.8.6) with SMTP id KAA27181 for ; Tue, 21 Mar 2000 10:43:00 -0800 Received: from mail.ustc.edu.cn by ustc.edu.cn with SMTP (8.6.10/16.2) id KAA19826; Tue, 21 Mar 2000 10:25:37 +0800 Received: (qmail 7183 invoked by uid 2746); 21 Mar 2000 02:19:11 -0000 Date: Tue, 21 Mar 2000 10:19:11 +0800 (CST) From: YaNan Guo To: netdev@oss.sgi.com Subject: ipv6 help in freeswan-1.3!!!!! Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello every one of u: I install freeswan1.3 in two machine(both installed mandrake7.0), now i want to test ipsec used in ipv6 environment. But i search all documents attached the freeswan, i found no test example about ipsec for ipv6. Here is my ipv6 environment: one has ipv6 address 3ffe:3216:2101:1000:101::1/112(eth2 as subnet), 3ffe:3216:2101:ffff::1/64(eth0 as gateway), 202.38.64.185/24(eth1 for ipv4); the other is a machine in subnet ,it has only one nic, 3ffe:3216:2101:1000:101::2/112(eth0), these two machine can ping each other, how can i test ipsec in ipv6. who can tell me how to do? thanks a lot !!!!!!! Best wishes!!! From owner-netdev@oss.sgi.com Mon Mar 20 21:08:49 2000 Received: by oss.sgi.com id ; Mon, 20 Mar 2000 21:08:40 -0800 Received: from cpu2747.adsl.bellglobal.com ([207.236.55.216]:44527 "EHLO grendel.conscoop.ottawa.on.ca") by oss.sgi.com with ESMTP id ; Mon, 20 Mar 2000 21:08:30 -0800 Received: (from rgb@localhost) by grendel.conscoop.ottawa.on.ca (8.9.0/8.9.0) id AAA05887; Tue, 21 Mar 2000 00:04:43 -0500 Date: Tue, 21 Mar 2000 00:04:43 -0500 From: Richard Guy Briggs To: YaNan Guo Cc: netdev@oss.sgi.com Subject: Re: ipv6 help in freeswan-1.3!!!!! Message-ID: <20000321000443.H3638@grendel.conscoop.ottawa.on.ca> References: Mime-Version: 1.0 Content-Type: multipart/signed; boundary=nqkreNcslJAfgyzk; micalg=pgp-md5; protocol="application/pgp-signature" X-Mailer: Mutt 0.95.7i In-Reply-To: Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing --nqkreNcslJAfgyzk Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable On Tue, Mar 21, 2000 at 10:19:11AM +0800, YaNan Guo wrote: > Hello every one of u: > I install freeswan1.3 in two machine(both installed mandrake7.0), now > i want to test ipsec used in ipv6 environment. But i search all documents > attached the freeswan, i found no test example about ipsec for ipv6. Here > is my ipv6 environment: > one has ipv6 address 3ffe:3216:2101:1000:101::1/112(eth2 as subnet),= =20 > 3ffe:3216:2101:ffff::1/64(eth0 as gateway), > 202.38.64.185/24(eth1 for ipv4); > the other is a machine in subnet ,it has only one nic,=20 > 3ffe:3216:2101:1000:101::2/112(eth0), > these two machine can ping each other, how can i test ipsec in ipv6. You can't with a stock 1.3. There is work being done by Gerhard Goessler to start porting the userspace tools and he may go on to do some kernel stuff if I don't get there first. =2E..in Oz until May, presently visiting LinuxCare OzLabs in Canberra, slainte mhath, RGB --=20 Richard Guy Briggs -- PGP key available Auto-Free Ottawa! Canada Prevent Internet Wiretapping! -- FreeS/WAN: Thanks for voting Green! -- Marillion: --nqkreNcslJAfgyzk Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: 2.6.3i iQCVAwUBONcC6N+sBuIhFagtAQFrPgP+JAvyp1c5FByJJk91+eJwsPLK9M3HR2Jy miHApmLKw/nQ4MaA8sV6JbPSPXSdXqsXBbX29ysYxFMUedbD29LiDJ1ex910B1Qv 5Mux4ufEhB8WCImawpQGVx3DjG3RQL3BGSlw57Vr5eMuGyCvQMLJgPSjGHIa70fH 5tbItDsZEzA= =+HTd -----END PGP SIGNATURE----- --nqkreNcslJAfgyzk-- From owner-netdev@oss.sgi.com Mon Mar 20 23:56:21 2000 Received: by oss.sgi.com id ; Mon, 20 Mar 2000 23:56:00 -0800 Received: from iabgfw.iabg.de ([194.139.245.2]:36512 "EHLO iabgfw.iabg.de") by oss.sgi.com with ESMTP id ; Mon, 20 Mar 2000 23:55:53 -0800 Received: by iabgfw.iabg.de; id IAA19509; Tue, 21 Mar 2000 08:55:48 +0100 (MET) Received: from iabgmh.iabg.de(10.255.255.2) by iabgfw.iabg.de via smap (V4.2) id xma019043; Tue, 21 Mar 00 08:54:56 +0100 Received: from iabgvw.iabg.de ([10.255.255.8]) by iabgmh.iabg.de (Post.Office MTA v3.5.1 release 219 ID# 127-59214U1600L300S0V35) with ESMTP id de; Tue, 21 Mar 2000 08:54:54 +0100 Received: from iabgdns.iabg.de (localhost [127.0.0.1]) by iabgvw.iabg.de (8.8.8+Sun/8.8.8) with ESMTP id IAA17750; Tue, 21 Mar 2000 08:54:53 +0100 (MET) Received: from iabg.de (cc31pc12.iabg.de [10.3.0.20]) by iabgdns.iabg.de (8.8.8+Sun/8.8.8) with ESMTP id IAA04321; Tue, 21 Mar 2000 08:54:53 +0100 (MET) Message-ID: <38D72AB3.E51E1CB5@iabg.de> Date: Tue, 21 Mar 2000 08:54:27 +0100 From: Gerhard Gessler X-Mailer: Mozilla 4.5 [en] (WinNT; I) X-Accept-Language: en MIME-Version: 1.0 To: YaNan Guo CC: netdev@oss.sgi.com Subject: Re: ipv6 help in freeswan-1.3!!!!! References: Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing YaNan Guo wrote: > > Hello every one of u: > I install freeswan1.3 in two machine(both installed mandrake7.0), now > i want to test ipsec used in ipv6 environment. But i search all documents > attached the freeswan, i found no test example about ipsec for ipv6. Here > is my ipv6 environment: > one has ipv6 address 3ffe:3216:2101:1000:101::1/112(eth2 as subnet), > 3ffe:3216:2101:ffff::1/64(eth0 as gateway), > 202.38.64.185/24(eth1 for ipv4); > the other is a machine in subnet ,it has only one nic, > 3ffe:3216:2101:1000:101::2/112(eth0), > these two machine can ping each other, how can i test ipsec in ipv6. > who can tell me how to do? thanks a lot !!!!!!! > Best wishes!!! Hello YaNan, FreeS/WAN does currently not support IPv6. I am working on this and I hope to release the first patches for FreeS/WAN in the next few days. I think there is a good chance that the changes and enhancements I have made are going into the original distribution. So, please be patient. By the way, I don´t think this is the right mailing list for IPSec and/or FreeS/WAN. You might be wanting to take a look at "www.freeswan.org". This is the official webside of the FreeS/WAN group and there you can find the address of the mailinglist. Regards, Gerhard --------------------------------------------------- Gerhard Geßler IABG mbH, Abteilung IK42 Einsteinstr. 20 85521 Ottobrunn Tel. (089) 6088 2021 Fax: (089) 6088 2845 E-Mail: gessler@iabg.de From owner-netdev@oss.sgi.com Tue Mar 21 04:35:02 2000 Received: by oss.sgi.com id ; Tue, 21 Mar 2000 04:34:43 -0800 Received: from smtprch1.nortelnetworks.com ([192.135.215.14]:10181 "EHLO smtprch1.nortel.com") by oss.sgi.com with ESMTP id ; Tue, 21 Mar 2000 04:34:17 -0800 Received: from zsngd101.asiapac.nortel.com (actually znsgd101) by smtprch1.nortel.com; Tue, 21 Mar 2000 06:33:46 -0600 Received: from zctwb003.asiapac.nortel.com ([47.152.32.111]) by zsngd101.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id HA5D99DN; Tue, 21 Mar 2000 20:33:23 +0800 Received: from pwold011.asiapac.nortel.com ([47.181.193.45]) by zctwb003.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id HK0MJSG2; Tue, 21 Mar 2000 23:33:25 +1100 Received: from uow.edu.au (IDENT:andrewm@[47.181.207.103]) by pwold011.asiapac.nortel.com (8.9.3/8.9.3) with ESMTP id XAA17990 for ; Tue, 21 Mar 2000 23:33:24 +1100 Message-ID: <38D76D15.95E8645B@uow.edu.au> Date: Tue, 21 Mar 2000 12:37:41 +0000 From: Andrew Morton X-Mailer: Mozilla 4.61 [en] (X11; I; Linux 2.2.13-7mdk i586) X-Accept-Language: en MIME-Version: 1.0 To: netdev Subject: [patch] skbuff cleanup Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing All the ethernet drivers do this in their Rx ISR:: skb = dev_alloc_skb(length); ... skb_reserve(skb,2); /* Force 16 byte alignment */ This patch collapses these into a single call (dev_hdr_alloc_skb) which has the same number of instructions as dev_alloc_skb. So instead of the above we can simply use: skb = dev_hdr_alloc_skb(length, ETH_HLEN); This function can be used for other L2's, however note that it will align the L3 data on a 32 bit boundary only. This just happens to be a 16 byte boundary for Ethernet. --- linux-2.3.99-pre1/include/linux/skbuff.h Tue Mar 14 21:20:48 2000 +++ linux/include/linux/skbuff.h Tue Mar 21 23:24:36 2000 @@ -626,6 +626,25 @@ return skb; } +/* + * Like dev_alloc_skb, except we are passed the length of the L2 layer's + * header (eg the Ethernet MAC header). We allocate and then reserve + * 0 to 3 bytes of additional room in the skb to ensure that the Layer 3 + * data will start on a longword boundary + */ + +extern __inline__ struct sk_buff * +dev_hdr_alloc_skb(unsigned int length, const int proto_hdr_length) +{ + struct sk_buff *skb; + const int xlength = ((4 - (proto_hdr_length & 3)) & 3 ) + 16; + + skb = alloc_skb(length+xlength, GFP_ATOMIC); + if (skb) + skb_reserve(skb,xlength); + return skb; +} + extern __inline__ struct sk_buff * skb_cow(struct sk_buff *skb, unsigned int headroom) { From owner-netdev@oss.sgi.com Tue Mar 21 14:24:14 2000 Received: by oss.sgi.com id ; Tue, 21 Mar 2000 14:24:04 -0800 Received: from pizda.ninka.net ([216.101.162.242]:3968 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Tue, 21 Mar 2000 14:23:40 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id OAA00807; Tue, 21 Mar 2000 14:18:36 -0800 Date: Tue, 21 Mar 2000 14:18:36 -0800 Message-Id: <200003212218.OAA00807@pizda.ninka.net> X-Authentication-Warning: pizda.ninka.net: davem set sender to davem@redhat.com using -f From: "David S. Miller" To: rusty@linuxcare.com.au CC: netdev@oss.sgi.com In-reply-to: (message from Rusty Russell on Tue, 21 Mar 2000 01:53:39 +1100) Subject: Re: (FORWARD) James Morris: [PATCH] ip_queue fucked-up oops fix References: Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing From: Rusty Russell Date: Tue, 21 Mar 2000 01:53:39 +1100 James looks after the (EXPERIMENTAL) ip_queue netfilter code. Please apply. Patch applied, thanks. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Tue Mar 21 16:38:25 2000 Received: by oss.sgi.com id ; Tue, 21 Mar 2000 16:38:05 -0800 Received: from w078.z209220232.was-dc.dsl.cnc.net ([209.220.232.78]:2549 "EHLO vaio.greennet") by oss.sgi.com with ESMTP id ; Tue, 21 Mar 2000 16:37:46 -0800 Received: from localhost (becker@localhost) by vaio.greennet (8.9.3/8.8.7) with ESMTP id JAA02250; Tue, 21 Mar 2000 09:23:09 -0500 Date: Tue, 21 Mar 2000 09:23:08 -0500 (EST) From: Donald Becker X-Sender: becker@vaio.greennet To: Andrew Morton cc: netdev Subject: Re: [patch] skbuff cleanup In-Reply-To: <38D76D15.95E8645B@uow.edu.au> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Tue, 21 Mar 2000, Andrew Morton wrote: > All the ethernet drivers do this in their Rx ISR:: > > skb = dev_alloc_skb(length); > ... > skb_reserve(skb,2); /* Force 16 byte alignment */ Some bus master drivers require that the Rx data buffers begin on a long word (or rarely, worse) boundary. The drivers that have this behavior do document it, but not at every alloc_skb() instance, so don't break the current semantics of dev_alloc_skb() when adding the new function. Chips that have this requirement also require that the Rx data buffer end on a longword boundary, so you can't work around the alignment by having a 14 byte buffer in the beginning of the descriptor chain :-<. Most drivers give you the option to copy-align with the 'rx_copybreak' parameter. Donald Becker Scyld Computing Corporation, becker@scyld.com From owner-netdev@oss.sgi.com Wed Mar 22 02:54:23 2000 Received: by oss.sgi.com id ; Wed, 22 Mar 2000 02:54:13 -0800 Received: from chaos.thphy.uni-duesseldorf.de ([134.99.64.99]:19975 "EHLO chaos.thphy.uni-duesseldorf.de") by oss.sgi.com with ESMTP id ; Wed, 22 Mar 2000 02:54:13 -0800 Received: from localhost (kai@localhost) by chaos.thphy.uni-duesseldorf.de (8.9.3/8.8.7) with ESMTP id LAA32089; Wed, 22 Mar 2000 11:55:28 +0100 X-Authentication-Warning: chaos.thphy.uni-duesseldorf.de: kai owned process doing -bs Date: Wed, 22 Mar 2000 11:55:27 +0100 (CET) From: Kai Germaschewski To: kuznet@ms2.inr.ac.ru cc: netdev@oss.sgi.com, ak@muc.de, eis@baty.hanse.DE Subject: Re: ppp control frame passing (was: (none) / Re: your mail) Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing > >> Worse, even if we encode the channel into the skb somehow, > >> flow control is a problem. Flow control works the obvious way, > >> i.e. netif_stop_queue() if all channels are busy and > >> netif_wake_queue() if at least one channel becomes non-busy. So > >> now hard_start_xmit() might give us the control frame for a > >> specific channel, which could still be busy though, because we > >> just know that (any) one channel is non-busy. > Alert!!! Try to avoid this. This system has _no_ support for multiple > hardware queues. Essentially, you have two choices: to send control > frames > internally without dev_queue_xmit() (it is easy), or to move channel > demultiplexing to special qdisc (f.e. as sch_atm does). It is still not > clear how support for multiple hardware queues can be organized. I am > inclined to believe that it is simply not well-defined task. Probably, > it > is better to try to move channel multiplexing out of hardware level. Okay, for the time being I went for the former solution, i.e. I do send control frames internally. BTW: Sorry for the late reply, this list's lag is incredible. Just one more question, about backporting: spinlock_bh doesn't exist in 2.2 AFAICS, I guess I need to use spinlock_irqsave? I'd be very happy if someone could comment on the locking/procedure I used - thanks! isdn_net_writebuf_skb(channel, skb) { /* we are guaranteed that this channel is not (yet) busy */ /* channel is always protected by channel->xmit_lock spinlock */ write skb to channel; atomic_inc (&channel->frame_cnt); if (all channels in this bundle busy) { /* a channel is busy if channel->frame_cnt >= 2) */ netif_stop_queue(); } } isdn_net_stat_callback(channel) /* this is called from hardware channel when frame has actually been sent down the line */ /* this may be called in hard-irq context, so we use the task queue for sending (don't want to call isdn_net_writebuf_skb() in irq context) */ { atomic_dec(&channel->frame_cnt); if (!(channel_busy(channel))) { if (!skb_queue_empty(&channel->super_tx_queue)) { /* if there is supervisory data waiting, send it first */ queue_task(&lchannel->tqueue, &tq_immediate); } else { netif_wake_queue(); } } } where the task on tq_immediate would do the following: { spin_lock_bh(&channel->xmit_lock); while (!isdn_net_lp_busy(channel)) { skb = skb_dequeue(&channel->super_tx_queue); if (!skb) break; isdn_net_writebuf_skb(channel, skb); } spin_unlock_bh(&channel->xmit_lock); } isdn_net_hard_start_xmit(net_device, skb) { for all channels in bundle(net_device) { spin_lock_bh(&channel->xmit_lock) if (!(channel_busy(channel)) break; spin_unlock_bh(&channel->xmit_lock) } if (!channel) /* no non-busy channel found */ return 1; isdn_net_writebuf_skb(channel, skb) spin_unlock_bh(&channel->xmit_lock); return 0; } isdn_net_write_super(channel, skb) /* this can be called from task queue / process / timer */ /* we send the skb directly if possible, if not we queue it on channel->super_tx_queue */ { spin_lock_bh(&channel->xmit_lock); } if (!channel_busy(channel)) { isdn_net_writebuf_skb(channel, skb); } else { skb_queue_tail(&channel->super_tx_queue, skb); } spin_unlock_bh(&channel->xmit_lock); } From owner-netdev@oss.sgi.com Wed Mar 22 06:19:34 2000 Received: by oss.sgi.com id ; Wed, 22 Mar 2000 06:19:14 -0800 Received: from minus.inr.ac.ru ([193.233.7.97]:7945 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Wed, 22 Mar 2000 06:19:07 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id RAA03876; Wed, 22 Mar 2000 17:18:55 +0300 From: kuznet@ms2.inr.ac.ru Message-Id: <200003221418.RAA03876@ms2.inr.ac.ru> Subject: Re: ppp control frame passing (was: (none) / Re: your mail) To: kai@thphy.uni-duesseldorf.de (Kai Germaschewski) Date: Wed, 22 Mar 2000 17:18:55 +0300 (MSK) Cc: netdev@oss.sgi.com, ak@muc.de, eis@baty.hanse.DE In-Reply-To: from "Kai Germaschewski" at Mar 22, 0 11:55:27 am X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Length: 893 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello! > Just one more question, about backporting: spinlock_bh doesn't exist in > 2.2 AFAICS, I guess I need to use spinlock_irqsave? You need not any spinlock in 2.2. In 2.2 all xmit path executes only on BH, it is single thread and no more locks are required. Yes, it is true provided it is never used from hard irq! If it is, then you have to use irq protection in 2.3 as well. Or better to queue a BH task. > netif_stop_queue(); It is better to avoid to use netif_stop_queue(dev) outside of spin_lock_bh(&dev->xmit_lock). Most of devices sets it only in hard_start_xmit(), where this lock is grabbed by caller. If you submit frame internally, it is not bad idea to acquire this lock. It is not necessary, but you will lose the property that hard_start_xmit is not entered when device is throttled. If it is not a problem for your device, then this is not required. Alexey From owner-netdev@oss.sgi.com Wed Mar 22 11:30:45 2000 Received: by oss.sgi.com id ; Wed, 22 Mar 2000 11:30:26 -0800 Received: from minus.inr.ac.ru ([193.233.7.97]:16908 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Wed, 22 Mar 2000 11:29:56 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id WAA09872; Wed, 22 Mar 2000 22:29:39 +0300 From: kuznet@ms2.inr.ac.ru Message-Id: <200003221929.WAA09872@ms2.inr.ac.ru> Subject: Re: ppp control frame passing (was: (none) / Re: your mail) To: kai@thphy.uni-duesseldorf.de (Kai Germaschewski) Date: Wed, 22 Mar 2000 22:29:39 +0300 (MSK) Cc: netdev@oss.sgi.com, ak@muc.de, eis@baty.hanse.DE In-Reply-To: from "Kai Germaschewski" at Mar 22, 0 07:33:25 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Length: 812 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello! > Okay, yeah, now I remember... However, it's not true as long as the lock > might be grabbed from (pppd) process context, right. > have two options: Use spinlock_irqsave, which is always safe, or schedule > a task from process context, so the lock will always be grabbed from BH. I > think I prefer the latter. Third option in 2.2 is plain start_bh_atomic(). > However, it's okay to grab dev->xmit_lock? I mean, is it kind of an > exported interface? Mmm... it is good and difficult question. 8)8) Well, it exists, hence, you may use it. 8) Some tasks require to serialize driver code wrt hard_start_xmit() (mainly, dev->ioctl()). Actually, if it will want to disappear or to change its sense one day, the fact that you use it and the way which you use it can be even useful information. 8) Alexey From owner-netdev@oss.sgi.com Wed Mar 22 12:26:36 2000 Received: by oss.sgi.com id ; Wed, 22 Mar 2000 12:26:16 -0800 Received: from dialin120.pg4.hamburg.nikoma.de ([213.54.3.120]:3844 "EHLO felix.home.kai") by oss.sgi.com with ESMTP id ; Wed, 22 Mar 2000 12:26:09 -0800 Received: from localhost (kai@localhost) by felix.home.kai (8.8.7/8.8.7) with ESMTP id TAA01758; Wed, 22 Mar 2000 19:33:25 +0100 X-Authentication-Warning: felix.home.kai: kai owned process doing -bs Date: Wed, 22 Mar 2000 19:33:25 +0100 (CET) From: Kai Germaschewski X-Sender: kai@felix.home.kai To: kuznet@ms2.inr.ac.ru cc: netdev@oss.sgi.com, ak@muc.de, eis@baty.hanse.DE Subject: Re: ppp control frame passing (was: (none) / Re: your mail) In-Reply-To: <200003221418.RAA03876@ms2.inr.ac.ru> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hey! On Wed, 22 Mar 2000 kuznet@ms2.inr.ac.ru wrote: > > Just one more question, about backporting: spinlock_bh doesn't exist in > > 2.2 AFAICS, I guess I need to use spinlock_irqsave? > > You need not any spinlock in 2.2. > > In 2.2 all xmit path executes only on BH, it is single thread > and no more locks are required. Okay, yeah, now I remember... However, it's not true as long as the lock might be grabbed from (pppd) process context, right. So, basically I do have two options: Use spinlock_irqsave, which is always safe, or schedule a task from process context, so the lock will always be grabbed from BH. I think I prefer the latter. > Yes, it is true provided it is never used from hard irq! > If it is, then you have to use irq protection in 2.3 as well. > Or better to queue a BH task. If it was called from interrupt, that would be a bug that needs to be fixed (I put some detection code in, so I'll find out. Doesn't seem to happen) > > > netif_stop_queue(); > > It is better to avoid to use netif_stop_queue(dev) outside > of spin_lock_bh(&dev->xmit_lock). Most of devices sets it only > in hard_start_xmit(), where this lock is grabbed by caller. > > If you submit frame internally, it is not bad idea to acquire > this lock. It is not necessary, but you will lose the property > that hard_start_xmit is not entered when device is throttled. > If it is not a problem for your device, then this is not required. Well, if hard_start_xmit() is called although we're busy (due to the apparent race when not holding the lock, I guess), I'll notice and return 1. So no problem here. However, it's okay to grab dev->xmit_lock? I mean, is it kind of an exported interface? If I can use it, I can get rid of my channel->xmit_lock entirely. That'd make the code easier, so it's a good thing. The channel lock has finer granularity, but the contention equals zero in real life any way, so who cares? cu, Kai From owner-netdev@oss.sgi.com Thu Mar 23 06:44:10 2000 Received: by oss.sgi.com id ; Thu, 23 Mar 2000 06:44:00 -0800 Received: from smtprich.nortel.com ([192.135.215.8]:2043 "EHLO smtprich.nortel.com") by oss.sgi.com with ESMTP id ; Thu, 23 Mar 2000 06:43:43 -0800 Received: from zrchb213.us.nortel.com (actually zrchb213) by smtprich.nortel.com; Thu, 23 Mar 2000 08:43:04 -0600 Received: from zctwb003.asiapac.nortel.com ([47.152.32.111]) by zrchb213.us.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id H30RM3AN; Thu, 23 Mar 2000 08:42:07 -0600 Received: from pwold011.asiapac.nortel.com ([47.181.193.45]) by zctwb003.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id HNN1JDCF; Fri, 24 Mar 2000 01:42:07 +1100 Received: from uow.edu.au (IDENT:andrewm@[47.181.207.102]) by pwold011.asiapac.nortel.com (8.9.3/8.9.3) with ESMTP id BAA13233; Fri, 24 Mar 2000 01:42:00 +1100 Message-ID: <38DA2E44.9841923D@uow.edu.au> Date: Thu, 23 Mar 2000 14:46:28 +0000 X-Sybari-Space: 00000000 00000000 00000000 From: Andrew Morton X-Mailer: Mozilla 4.61 [en] (X11; I; Linux 2.2.13-7mdk i586) X-Accept-Language: en MIME-Version: 1.0 To: "David S. Miller" CC: netdev , "Lee, Kuan-Meng" Subject: [patch] 2.3.99 cs89x0.c Content-Type: multipart/mixed; boundary="------------73AFAD9EC54554733D210295" X-Orig: Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing This is a multi-part message in MIME format. --------------73AFAD9EC54554733D210295 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit A few things: - skb_reserve two bytes to longword-align the L3 part of incoming packets. Thanks to Philip Blundell for pointing this out. - Removed a delay loop in dma_rx() which snuck through in my previous patch... - Replace an occurence of 100 with HZ - Remove some outmoded fiddling with skbuff internals. - Fix initialisation of dev->priv's spinlock in non-module mode. - Added 'cs89x0_dma=N' __setup() option so DMA Rx mode can be used when the driver is linked in. (Yes, the __setup functions does return 1!). - Updated documentation for the above. Patch against 2.3.99-pre1 is attached. -- -akpm- --------------73AFAD9EC54554733D210295 Content-Type: text/plain; charset=us-ascii; name="cs89x0.patch" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="cs89x0.patch" --- linux-2.3.99-pre1/Documentation/networking/cs89x0.txt Tue Mar 14 21:20:31 2000 +++ linux.akpm/Documentation/networking/cs89x0.txt Fri Mar 24 01:13:21 2000 @@ -308,6 +308,30 @@ l) If during DMA operation you find erratic behavior or network data corruption you should use your PC's BIOS to slow the EISA bus clock. +m) If the cs89x0 driver is compiled directly into the kernel + (non-modular) then its I/O address is automatically determined by + ISA bus probing. The IRQ number, media options, etc are determined + from the card's EEPROM. + +n) If the cs89x0 driver is compiled directly into the kernel, DMA + mode may be selected by providing the kernel with a boot option + 'cs89x0_dma=N' where 'N' is the desired DMA channel number (5, 6 or + 7). + + Kernel boot options may be provided on the LILO command line: + + LILO boot: linux cs89x0_dma=5 + + or they may be placed in /etc/lilo.conf: + + image=/boot/bzImage-2.3.48 + append="cs89x0_dma=5" + label=linux + root=/dev/hda5 + read-only + + The DMA Rx buffer size is hardwared to 16 kbytes in this mode. + (64k mode is not available). 4.0 COMPILING THE DRIVER =============================================================================== --- linux-2.3.99-pre1/drivers/net/cs89x0.c Tue Mar 14 21:20:37 2000 +++ linux.akpm/drivers/net/cs89x0.c Fri Mar 24 01:15:28 2000 @@ -48,10 +48,18 @@ : Don't call netif_wake_queue() in net_send_packet() : Fixed an out-of-mem bug in dma_rx() : Updated Documentation/cs89x0.txt + + Andrew Morton : andrewm@uow.edu.au / Kernel 2.3.99-pre1 + : Use skb_reserve to longword align IP header (two places) + : Remove a delay loop from dma_rx() + : Replace '100' with HZ + : Clean up a couple of skb API abuses + : Added 'cs89x0_dma=N' kernel boot option + : Correctly initialise lp->lock in non-module compile */ static char *version = -"cs89x0.c: (kernel 2.3.48) Russell Nelson , Andrew Morton \n"; +"cs89x0.c: v2.3.99-pre1-2 Russell Nelson , Andrew Morton \n"; /* ======================= end of configuration ======================= */ @@ -121,7 +129,7 @@ { 0x300, 0x320, 0x340, 0x360, 0x200, 0x220, 0x240, 0x260, 0x280, 0x2a0, 0x2c0, 0x2e0, 0}; #if DEBUGGING -static unsigned int net_debug = 5; +static unsigned int net_debug = DEBUGGING; #else #define net_debug 0 /* gcc will remove all the debug code for us */ #endif @@ -190,6 +198,21 @@ /* Example routines you must write ;->. */ #define tx_done(dev) 1 +/* + * Permit 'cs89x0_dma=N' in the kernel boot environment + */ +#if !defined(MODULE) && (ALLOW_DMA != 0) +static int g_cs89x0_dma; + +static int __init dma_fn(char *str) +{ + g_cs89x0_dma = simple_strtol(str,NULL,0); + return 1; +} + +__setup("cs89x0_dma=", dma_fn); +#endif /* !defined(MODULE) && (ALLOW_DMA != 0) */ + /* Check for a network adaptor of this type, and return '0' iff one exists. If dev->base_addr == 0, probe all likely locations. @@ -318,7 +341,17 @@ retval = ENOMEM; goto out; } - memset(dev->priv, 0, sizeof(struct net_local)); + lp = (struct net_local *)dev->priv; + memset(lp, 0, sizeof(*lp)); + spin_lock_init(&lp->lock); +#if !defined(MODULE) && (ALLOW_DMA != 0) + if (g_cs89x0_dma) + { + lp->use_dma = 1; + lp->dma = g_cs89x0_dma; + lp->dmasize = 16; /* Could make this an option... */ + } +#endif } lp = (struct net_local *)dev->priv; @@ -612,12 +645,6 @@ int status, length; unsigned char *bp = lp->rx_dma_ptr; - { - int i; - for (i = 0; i < 1000; i++) - ; - } - status = bp[0] + (bp[1]<<8); length = bp[2] + (bp[3]<<8); bp += 4; @@ -632,7 +659,7 @@ } /* Malloc up new buffer. */ - skb = alloc_skb(length, GFP_ATOMIC); + skb = dev_alloc_skb(length + 2); if (skb == NULL) { if (net_debug) /* I don't think we want to do this to a stressed system */ printk("%s: Memory squeeze, dropping packet.\n", dev->name); @@ -645,8 +672,7 @@ lp->rx_dma_ptr = bp; return; } - - skb->len = length; + skb_reserve(skb, 2); /* longword align L3 header */ skb->dev = dev; if (bp + length > lp->end_dma_buff) { @@ -720,7 +746,7 @@ writereg(dev, PP_SelfCTL, selfcontrol); /* Wait for the DC/DC converter to power up - 500ms */ - while (jiffies - timenow < 100) + while (jiffies - timenow < HZ) ; } @@ -1317,7 +1343,7 @@ } /* Malloc up new buffer. */ - skb = alloc_skb(length, GFP_ATOMIC); + skb = dev_alloc_skb(length + 2); if (skb == NULL) { #if 0 /* Again, this seems a cruel thing to do */ printk(KERN_WARNING "%s: Memory squeeze, dropping packet.\n", dev->name); @@ -1325,10 +1351,10 @@ lp->stats.rx_dropped++; return; } - skb->len = length; + skb_reserve(skb, 2); /* longword align L3 header */ skb->dev = dev; - insw(ioaddr + RX_FRAME_PORT, skb->data, length >> 1); + insw(ioaddr + RX_FRAME_PORT, skb_put(skb, length), length >> 1); if (length & 1) skb->data[length-1] = inw(ioaddr + RX_FRAME_PORT); --------------73AFAD9EC54554733D210295-- From owner-netdev@oss.sgi.com Thu Mar 23 23:10:44 2000 Received: by oss.sgi.com id ; Thu, 23 Mar 2000 23:10:34 -0800 Received: from linuxcare.canberra.net.au ([203.29.91.49]:20228 "EHLO front.linuxcare.com.au") by oss.sgi.com with ESMTP id ; Thu, 23 Mar 2000 23:10:20 -0800 Received: from elm.linuxcare.com.au (elm.linuxcare.com.au [10.61.2.17]) by front.linuxcare.com.au (8.9.3/8.9.3/Debian 8.9.3-21) with ESMTP id SAA09342; Fri, 24 Mar 2000 18:10:13 +1100 Received: from linuxcare.com (localhost [127.0.0.1]) by elm.linuxcare.com.au (8.9.3/8.9.3/Debian 8.9.3-21) with ESMTP id SAA02284; Fri, 24 Mar 2000 18:10:12 +1100 Message-Id: <200003240710.SAA02284@elm.linuxcare.com.au> X-Authentication-Warning: elm.linuxcare.com.au: Host localhost [127.0.0.1] claimed to be linuxcare.com To: netdev@oss.sgi.com cc: rusty@linuxcare.com, tridge@linuxcare.com Subject: TCP hang with 2.2.14 <-> 2.2.15pre5 MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="----- =_aaaaaaaaaa0" Content-ID: <2074.953881057.0@linuxcare.com> Date: Fri, 24 Mar 2000 18:10:12 +1100 From: Stephen Rothwell Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing ------- =_aaaaaaaaaa0 Content-Type: text/plain; charset="us-ascii" Content-ID: <2074.953881057.1@linuxcare.com> Hi all, In diagnosing a report of a bug in rsync, we have discovered what looks like a bug in the 2.2 TCP stack. After transfering some amount of data (usually 30MB or more) the connection hangs. Below I have included the ends of tcpdumps from both ends. The two machines are called elm (running 2.2.15pre5 and doing most of the transmitting) and owl (running 2.2.14). Also elm gets a reasonable number (~10%) transmit errors. We tried turning off SACK on the 2.2.14 machine. This just seemed to make it fail quicker (though this may not be so). The first dump was collected on elm, the second on owl. I have both complete dumps (from tcpdump -w) if you need them. Cheers, Stephen -- Stephen Rothwell, Open Source Project Engineer, Linuxcare, Inc. +61-2-62628990 tel, +61-2-62628991 fax sfr@linuxcare.com, http://www.linuxcare.com/ Linuxcare. Support for the revolution. ------- =_aaaaaaaaaa0 Content-Type: text/plain; name="elm.tcpdump"; charset="us-ascii" Content-ID: <2074.953881057.2@linuxcare.com> 16:21:43.390093 elm.1216 > owl.rsync: P 31342678:31344126(1448) ack 76839 win 1448 (DF) 16:21:43.390104 elm.1216 > owl.rsync: P 31344126:31345574(1448) ack 76839 win 1448 (DF) 16:21:43.390114 elm.1216 > owl.rsync: P 31345574:31347022(1448) ack 76839 win 1448 (DF) 16:21:43.390124 elm.1216 > owl.rsync: P 31347022:31348470(1448) ack 76839 win 1448 (DF) 16:21:43.390132 elm.1216 > owl.rsync: P 31348470:31349918(1448) ack 76839 win 1448 (DF) 16:21:43.390843 owl.rsync > elm.1216: . ack 31348470 win 30408 (DF) 16:21:43.390944 elm.1216 > owl.rsync: P 31349918:31351366(1448) ack 76839 win 1448 (DF) 16:21:43.390954 elm.1216 > owl.rsync: P 31351366:31352814(1448) ack 76839 win 1448 (DF) 16:21:43.390964 elm.1216 > owl.rsync: P 31352814:31354262(1448) ack 76839 win 1448 (DF) 16:21:43.390972 elm.1216 > owl.rsync: P 31354262:31355710(1448) ack 76839 win 1448 (DF) 16:21:43.390980 elm.1216 > owl.rsync: P 31355710:31357158(1448) ack 76839 win 1448 (DF) 16:21:43.391759 owl.rsync > elm.1216: . ack 31348470 win 30408 (DF) 16:21:43.391763 owl.rsync > elm.1216: . ack 31348470 win 30408 (DF) 16:21:43.391766 owl.rsync > elm.1216: . ack 31348470 win 30408 (DF) 16:21:43.391837 elm.1216 > owl.rsync: P 31348470:31349918(1448) ack 76839 win 1448 (DF) 16:21:43.391780 owl.rsync > elm.1216: . ack 31348470 win 30408 (DF) 16:21:43.391815 owl.rsync > elm.1216: . ack 31348470 win 30408 (DF) 16:21:43.391864 elm.1216 > owl.rsync: P 31357158:31358606(1448) ack 76839 win 1448 (DF) 16:21:43.392252 owl.rsync > elm.1216: . ack 31357158 win 24616 (DF) 16:21:43.392286 elm.1216 > owl.rsync: P 31358606:31360054(1448) ack 76839 win 1448 (DF) 16:21:43.392296 elm.1216 > owl.rsync: P 31360054:31361502(1448) ack 76839 win 1448 (DF) 16:21:43.392702 owl.rsync > elm.1216: . ack 31360054 win 23168 (DF) 16:21:43.392730 elm.1216 > owl.rsync: P 31361502:31362950(1448) ack 76839 win 1448 (DF) 16:21:43.392739 elm.1216 > owl.rsync: P 31362950:31364398(1448) ack 76839 win 1448 (DF) 16:21:43.392750 elm.1216 > owl.rsync: P 31364398:31365846(1448) ack 76839 win 1448 (DF) 16:21:43.393367 owl.rsync > elm.1216: . ack 31362950 win 31856 (DF) 16:21:43.393407 elm.1216 > owl.rsync: P 31365846:31367294(1448) ack 76839 win 1448 (DF) 16:21:43.393417 elm.1216 > owl.rsync: P 31367294:31368742(1448) ack 76839 win 1448 (DF) 16:21:43.393379 owl.rsync > elm.1216: . ack 31365846 win 30408 (DF) 16:21:43.393441 elm.1216 > owl.rsync: P 31368742:31370190(1448) ack 76839 win 1448 (DF) 16:21:43.393451 elm.1216 > owl.rsync: P 31370190:31371638(1448) ack 76839 win 1448 (DF) 16:21:43.394073 owl.rsync > elm.1216: . ack 31368742 win 31856 (DF) 16:21:43.394108 elm.1216 > owl.rsync: P 31371638:31373086(1448) ack 76839 win 1448 (DF) 16:21:43.394119 elm.1216 > owl.rsync: P 31373086:31374534(1448) ack 76839 win 1448 (DF) 16:21:43.394148 owl.rsync > elm.1216: . ack 31371638 win 31856 (DF) 16:21:43.394176 elm.1216 > owl.rsync: P 31374534:31375982(1448) ack 76839 win 1448 (DF) 16:21:43.394186 elm.1216 > owl.rsync: P 31375982:31377430(1448) ack 76839 win 1448 (DF) 16:21:43.394195 elm.1216 > owl.rsync: P 31377430:31378878(1448) ack 76839 win 1448 (DF) 16:21:43.394853 owl.rsync > elm.1216: . ack 31377430 win 31856 (DF) 16:21:43.394884 elm.1216 > owl.rsync: P 31378878:31380326(1448) ack 76839 win 1448 (DF) 16:21:43.394894 elm.1216 > owl.rsync: P 31380326:31381774(1448) ack 76839 win 1448 (DF) 16:21:43.394903 elm.1216 > owl.rsync: P 31381774:31383222(1448) ack 76839 win 1448 (DF) 16:21:43.394913 elm.1216 > owl.rsync: P 31383222:31384670(1448) ack 76839 win 1448 (DF) 16:21:43.395436 owl.rsync > elm.1216: . ack 31377430 win 31856 (DF) 16:21:43.395491 owl.rsync > elm.1216: . ack 31377430 win 31856 (DF) 16:21:43.566543 owl.rsync > elm.1216: P 76839:78287(1448) ack 31377430 win 31856 (DF) 16:21:43.566608 elm.1216 > owl.rsync: . ack 78287 win 1448 (DF) 16:21:43.566949 owl.rsync > elm.1216: . 78287:79735(1448) ack 31377430 win 31856 (DF) 16:21:43.587451 elm.1216 > owl.rsync: P 31377430:31378878(1448) ack 79735 win 0 (DF) 16:21:43.588103 owl.rsync > elm.1216: . 79735:81183(1448) ack 31383222 win 28960 (DF) 16:21:43.588136 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:21:43.786560 owl.rsync > elm.1216: . 79735:81183(1448) ack 31383222 win 31856 (DF) 16:21:43.786598 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:21:43.987447 elm.1216 > owl.rsync: P 31377430:31378878(1448) ack 79735 win 0 (DF) 16:21:43.987798 owl.rsync > elm.1216: . ack 31383222 win 31856 (DF) 16:21:43.987818 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:21:44.186492 owl.rsync > elm.1216: . 79735:81183(1448) ack 31383222 win 31856 (DF) 16:21:44.186529 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:21:44.787469 elm.1216 > owl.rsync: P 31377430:31378878(1448) ack 79735 win 0 (DF) 16:21:44.787825 owl.rsync > elm.1216: . ack 31383222 win 31856 (DF) 16:21:44.787860 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:21:44.986440 owl.rsync > elm.1216: . 79735:81183(1448) ack 31383222 win 31856 (DF) 16:21:44.986473 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:21:46.387470 elm.1216 > owl.rsync: P 31377430:31378878(1448) ack 79735 win 0 (DF) 16:21:46.387893 owl.rsync > elm.1216: . ack 31383222 win 31856 (DF) 16:21:46.387929 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:21:46.586341 owl.rsync > elm.1216: . 79735:81183(1448) ack 31383222 win 31856 (DF) 16:21:46.586375 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:21:49.587480 elm.1216 > owl.rsync: P 31377430:31378878(1448) ack 79735 win 0 (DF) 16:21:49.587803 owl.rsync > elm.1216: . ack 31383222 win 31856 (DF) 16:21:49.587839 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:21:49.786129 owl.rsync > elm.1216: . 79735:81183(1448) ack 31383222 win 31856 (DF) 16:21:49.786165 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:21:55.987474 elm.1216 > owl.rsync: P 31377430:31378878(1448) ack 79735 win 0 (DF) 16:21:55.987864 owl.rsync > elm.1216: . ack 31383222 win 31856 (DF) 16:21:55.987902 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:21:56.185739 owl.rsync > elm.1216: . 79735:81183(1448) ack 31383222 win 31856 (DF) 16:21:56.185810 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:22:08.787452 elm.1216 > owl.rsync: P 31377430:31378878(1448) ack 79735 win 0 (DF) 16:22:08.787805 owl.rsync > elm.1216: . ack 31383222 win 31856 (DF) 16:22:08.787833 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:22:08.984888 owl.rsync > elm.1216: . 79735:81183(1448) ack 31383222 win 31856 (DF) 16:22:08.984922 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:22:34.387464 elm.1216 > owl.rsync: P 31377430:31378878(1448) ack 79735 win 0 (DF) 16:22:34.387756 owl.rsync > elm.1216: . ack 31383222 win 31856 (DF) 16:22:34.387783 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:22:34.583229 owl.rsync > elm.1216: . 79735:81183(1448) ack 31383222 win 31856 (DF) 16:22:34.583265 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:23:25.587464 elm.1216 > owl.rsync: P 31377430:31378878(1448) ack 79735 win 0 (DF) 16:23:25.587819 owl.rsync > elm.1216: . ack 31383222 win 31856 (DF) 16:23:25.587843 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:23:25.779959 owl.rsync > elm.1216: . 79735:81183(1448) ack 31383222 win 31856 (DF) 16:23:25.779993 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:25:07.987473 elm.1216 > owl.rsync: P 31377430:31378878(1448) ack 79735 win 0 (DF) 16:25:07.987870 owl.rsync > elm.1216: . ack 31383222 win 31856 (DF) 16:25:07.987906 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:25:08.173267 owl.rsync > elm.1216: . 79735:81183(1448) ack 31383222 win 31856 (DF) 16:25:08.173305 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:27:07.987482 elm.1216 > owl.rsync: P 31377430:31378878(1448) ack 79735 win 0 (DF) 16:27:07.987829 owl.rsync > elm.1216: . ack 31383222 win 31856 (DF) 16:27:07.987865 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:27:08.165481 owl.rsync > elm.1216: . 79735:81183(1448) ack 31383222 win 31856 (DF) 16:27:08.165518 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:29:07.987484 elm.1216 > owl.rsync: P 31377430:31378878(1448) ack 79735 win 0 (DF) 16:29:07.987827 owl.rsync > elm.1216: . ack 31383222 win 31856 (DF) 16:29:07.987863 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:29:08.157687 owl.rsync > elm.1216: . 79735:81183(1448) ack 31383222 win 31856 (DF) 16:29:08.157723 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:31:07.987495 elm.1216 > owl.rsync: P 31377430:31378878(1448) ack 79735 win 0 (DF) 16:31:07.987909 owl.rsync > elm.1216: . ack 31383222 win 31856 (DF) 16:31:07.987952 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:31:08.149903 owl.rsync > elm.1216: . 79735:81183(1448) ack 31383222 win 31856 (DF) 16:31:08.149958 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:33:07.987479 elm.1216 > owl.rsync: P 31377430:31378878(1448) ack 79735 win 0 (DF) 16:33:07.987844 owl.rsync > elm.1216: . ack 31383222 win 31856 (DF) 16:33:07.987880 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:33:08.142121 owl.rsync > elm.1216: . 79735:81183(1448) ack 31383222 win 31856 (DF) 16:33:08.142157 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:35:07.987478 elm.1216 > owl.rsync: P 31377430:31378878(1448) ack 79735 win 0 (DF) 16:35:07.987844 owl.rsync > elm.1216: . ack 31383222 win 31856 (DF) 16:35:07.987879 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:35:08.134331 owl.rsync > elm.1216: . 79735:81183(1448) ack 31383222 win 31856 (DF) 16:35:08.134366 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:37:07.987475 elm.1216 > owl.rsync: P 31377430:31378878(1448) ack 79735 win 0 (DF) 16:37:07.987900 owl.rsync > elm.1216: . ack 31383222 win 31856 (DF) 16:37:07.987946 elm.1216 > owl.rsync: R 2680405324:2680405324(0) win 0 16:37:07.989904 elm.1216 > owl.rsync: R 31436798:31436798(0) ack 79735 win 31856 (DF) ------- =_aaaaaaaaaa0 Content-Type: text/plain; name="owl.tcpdump"; charset="us-ascii" Content-ID: <2074.953881057.3@linuxcare.com> 16:24:52.341380 elm.1216 > owl.rsync: P 31342678:31344126(1448) ack 76839 win 1448 (DF) 16:24:52.341504 elm.1216 > owl.rsync: P 31344126:31345574(1448) ack 76839 win 1448 (DF) 16:24:52.341536 owl.rsync > elm.1216: . ack 31345574 win 31856 (DF) 16:24:52.341626 elm.1216 > owl.rsync: P 31345574:31347022(1448) ack 76839 win 1448 (DF) 16:24:52.341777 elm.1216 > owl.rsync: P 31347022:31348470(1448) ack 76839 win 1448 (DF) 16:24:52.341810 owl.rsync > elm.1216: . ack 31348470 win 30408 (DF) 16:24:52.342233 elm.1216 > owl.rsync: P 31349918:31351366(1448) ack 76839 win 1448 (DF) 16:24:52.342279 owl.rsync > elm.1216: . ack 31348470 win 30408 (DF) 16:24:52.342355 elm.1216 > owl.rsync: P 31351366:31352814(1448) ack 76839 win 1448 (DF) 16:24:52.342390 owl.rsync > elm.1216: . ack 31348470 win 30408 (DF) 16:24:52.342489 elm.1216 > owl.rsync: P 31352814:31354262(1448) ack 76839 win 1448 (DF) 16:24:52.342523 owl.rsync > elm.1216: . ack 31348470 win 30408 (DF) 16:24:52.342621 elm.1216 > owl.rsync: P 31354262:31355710(1448) ack 76839 win 1448 (DF) 16:24:52.342654 owl.rsync > elm.1216: . ack 31348470 win 30408 (DF) 16:24:52.342752 elm.1216 > owl.rsync: P 31355710:31357158(1448) ack 76839 win 1448 (DF) 16:24:52.342784 owl.rsync > elm.1216: . ack 31348470 win 30408 (DF) 16:24:52.343124 elm.1216 > owl.rsync: P 31348470:31349918(1448) ack 76839 win 1448 (DF) 16:24:52.343164 owl.rsync > elm.1216: . ack 31357158 win 24616 (DF) 16:24:52.343247 elm.1216 > owl.rsync: P 31357158:31358606(1448) ack 76839 win 1448 (DF) 16:24:52.343579 elm.1216 > owl.rsync: P 31358606:31360054(1448) ack 76839 win 1448 (DF) 16:24:52.343638 owl.rsync > elm.1216: . ack 31360054 win 23168 (DF) 16:24:52.343697 elm.1216 > owl.rsync: P 31360054:31361502(1448) ack 76839 win 1448 (DF) 16:24:52.344028 elm.1216 > owl.rsync: P 31361502:31362950(1448) ack 76839 win 1448 (DF) 16:24:52.344142 elm.1216 > owl.rsync: P 31362950:31364398(1448) ack 76839 win 1448 (DF) 16:24:52.344264 elm.1216 > owl.rsync: P 31364398:31365846(1448) ack 76839 win 1448 (DF) 16:24:52.344334 owl.rsync > elm.1216: . ack 31362950 win 31856 (DF) 16:24:52.344348 owl.rsync > elm.1216: . ack 31365846 win 30408 (DF) 16:24:52.344703 elm.1216 > owl.rsync: P 31365846:31367294(1448) ack 76839 win 1448 (DF) 16:24:52.344819 elm.1216 > owl.rsync: P 31367294:31368742(1448) ack 76839 win 1448 (DF) 16:24:52.344856 owl.rsync > elm.1216: . ack 31368742 win 31856 (DF) 16:24:52.344941 elm.1216 > owl.rsync: P 31368742:31370190(1448) ack 76839 win 1448 (DF) 16:24:52.345068 elm.1216 > owl.rsync: P 31370190:31371638(1448) ack 76839 win 1448 (DF) 16:24:52.345100 owl.rsync > elm.1216: . ack 31371638 win 31856 (DF) 16:24:52.345397 elm.1216 > owl.rsync: P 31371638:31373086(1448) ack 76839 win 1448 (DF) 16:24:52.345519 elm.1216 > owl.rsync: P 31373086:31374534(1448) ack 76839 win 1448 (DF) 16:24:52.345553 owl.rsync > elm.1216: . ack 31374534 win 31856 (DF) 16:24:52.345642 elm.1216 > owl.rsync: P 31374534:31375982(1448) ack 76839 win 1448 (DF) 16:24:52.345792 elm.1216 > owl.rsync: P 31375982:31377430(1448) ack 76839 win 1448 (DF) 16:24:52.345822 owl.rsync > elm.1216: . ack 31377430 win 31856 (DF) 16:24:52.346174 elm.1216 > owl.rsync: P 31378878:31380326(1448) ack 76839 win 1448 (DF) 16:24:52.346218 owl.rsync > elm.1216: . ack 31377430 win 31856 (DF) 16:24:52.346295 elm.1216 > owl.rsync: P 31380326:31381774(1448) ack 76839 win 1448 (DF) 16:24:52.346328 owl.rsync > elm.1216: . ack 31377430 win 31856 (DF) 16:24:52.346430 elm.1216 > owl.rsync: P 31381774:31383222(1448) ack 76839 win 1448 (DF) 16:24:52.346461 owl.rsync > elm.1216: . ack 31377430 win 31856 (DF) 16:24:52.517284 owl.rsync > elm.1216: P 76839:78287(1448) ack 31377430 win 31856 (DF) 16:24:52.517673 elm.1216 > owl.rsync: . ack 78287 win 1448 (DF) 16:24:52.517697 owl.rsync > elm.1216: . 78287:79735(1448) ack 31377430 win 31856 (DF) 16:24:52.538760 elm.1216 > owl.rsync: P 31377430:31378878(1448) ack 79735 win 0 (DF) 16:24:52.538849 owl.rsync > elm.1216: . 79735:81183(1448) ack 31383222 win 28960 (DF) 16:24:52.539204 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:24:52.737310 owl.rsync > elm.1216: . 79735:81183(1448) ack 31383222 win 31856 (DF) 16:24:52.737680 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:24:52.938772 elm.1216 > owl.rsync: P 31377430:31378878(1448) ack 79735 win 0 (DF) 16:24:52.938803 owl.rsync > elm.1216: . ack 31383222 win 31856 (DF) 16:24:52.938904 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:24:53.137279 owl.rsync > elm.1216: . 79735:81183(1448) ack 31383222 win 31856 (DF) 16:24:53.137629 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:24:53.738854 elm.1216 > owl.rsync: P 31377430:31378878(1448) ack 79735 win 0 (DF) 16:24:53.738880 owl.rsync > elm.1216: . ack 31383222 win 31856 (DF) 16:24:53.739000 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:24:53.937279 owl.rsync > elm.1216: . 79735:81183(1448) ack 31383222 win 31856 (DF) 16:24:53.937625 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:24:55.338968 elm.1216 > owl.rsync: P 31377430:31378878(1448) ack 79735 win 0 (DF) 16:24:55.339047 owl.rsync > elm.1216: . ack 31383222 win 31856 (DF) 16:24:55.339177 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:24:55.537284 owl.rsync > elm.1216: . 79735:81183(1448) ack 31383222 win 31856 (DF) 16:24:55.537632 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:24:58.539180 elm.1216 > owl.rsync: P 31377430:31378878(1448) ack 79735 win 0 (DF) 16:24:58.539212 owl.rsync > elm.1216: . ack 31383222 win 31856 (DF) 16:24:58.539333 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:24:58.737280 owl.rsync > elm.1216: . 79735:81183(1448) ack 31383222 win 31856 (DF) 16:24:58.737628 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:25:04.939591 elm.1216 > owl.rsync: P 31377430:31378878(1448) ack 79735 win 0 (DF) 16:25:04.939642 owl.rsync > elm.1216: . ack 31383222 win 31856 (DF) 16:25:04.939772 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:25:05.137294 owl.rsync > elm.1216: . 79735:81183(1448) ack 31383222 win 31856 (DF) 16:25:05.137695 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:25:17.740386 elm.1216 > owl.rsync: P 31377430:31378878(1448) ack 79735 win 0 (DF) 16:25:17.740417 owl.rsync > elm.1216: . ack 31383222 win 31856 (DF) 16:25:17.740529 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:25:17.937282 owl.rsync > elm.1216: . 79735:81183(1448) ack 31383222 win 31856 (DF) 16:25:17.937630 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:25:43.342057 elm.1216 > owl.rsync: P 31377430:31378878(1448) ack 79735 win 0 (DF) 16:25:43.342092 owl.rsync > elm.1216: . ack 31383222 win 31856 (DF) 16:25:43.342203 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:25:43.537284 owl.rsync > elm.1216: . 79735:81183(1448) ack 31383222 win 31856 (DF) 16:25:43.537632 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:26:34.545390 elm.1216 > owl.rsync: P 31377430:31378878(1448) ack 79735 win 0 (DF) 16:26:34.545479 owl.rsync > elm.1216: . ack 31383222 win 31856 (DF) 16:26:34.545590 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:26:34.737322 owl.rsync > elm.1216: . 79735:81183(1448) ack 31383222 win 31856 (DF) 16:26:34.737690 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:28:16.952043 elm.1216 > owl.rsync: P 31377430:31378878(1448) ack 79735 win 0 (DF) 16:28:16.952100 owl.rsync > elm.1216: . ack 31383222 win 31856 (DF) 16:28:16.952227 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:28:17.137283 owl.rsync > elm.1216: . 79735:81183(1448) ack 31383222 win 31856 (DF) 16:28:17.137635 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:30:16.959839 elm.1216 > owl.rsync: P 31377430:31378878(1448) ack 79735 win 0 (DF) 16:30:16.959900 owl.rsync > elm.1216: . ack 31383222 win 31856 (DF) 16:30:16.960021 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:30:17.137286 owl.rsync > elm.1216: . 79735:81183(1448) ack 31383222 win 31856 (DF) 16:30:17.137636 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:32:16.967633 elm.1216 > owl.rsync: P 31377430:31378878(1448) ack 79735 win 0 (DF) 16:32:16.967689 owl.rsync > elm.1216: . ack 31383222 win 31856 (DF) 16:32:16.967812 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:32:17.137285 owl.rsync > elm.1216: . 79735:81183(1448) ack 31383222 win 31856 (DF) 16:32:17.137634 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:34:16.975434 elm.1216 > owl.rsync: P 31377430:31378878(1448) ack 79735 win 0 (DF) 16:34:16.975505 owl.rsync > elm.1216: . ack 31383222 win 31856 (DF) 16:34:16.975640 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:34:17.137284 owl.rsync > elm.1216: . 79735:81183(1448) ack 31383222 win 31856 (DF) 16:34:17.137656 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:36:16.983199 elm.1216 > owl.rsync: P 31377430:31378878(1448) ack 79735 win 0 (DF) 16:36:16.983264 owl.rsync > elm.1216: . ack 31383222 win 31856 (DF) 16:36:16.983391 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:36:17.137283 owl.rsync > elm.1216: . 79735:81183(1448) ack 31383222 win 31856 (DF) 16:36:17.137634 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:38:16.990990 elm.1216 > owl.rsync: P 31377430:31378878(1448) ack 79735 win 0 (DF) 16:38:16.991058 owl.rsync > elm.1216: . ack 31383222 win 31856 (DF) 16:38:16.991184 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:38:17.137283 owl.rsync > elm.1216: . 79735:81183(1448) ack 31383222 win 31856 (DF) 16:38:17.137632 elm.1216 > owl.rsync: . ack 79735 win 0 (DF) 16:40:16.998785 elm.1216 > owl.rsync: P 31377430:31378878(1448) ack 79735 win 0 (DF) 16:40:16.998866 owl.rsync > elm.1216: . ack 31383222 win 31856 (DF) 16:40:16.999003 elm.1216 > owl.rsync: R 2680405324:2680405324(0) win 0 16:40:17.000988 elm.1216 > owl.rsync: R 31436798:31436798(0) ack 79735 win 31856 (DF) ------- =_aaaaaaaaaa0-- From owner-netdev@oss.sgi.com Fri Mar 24 01:33:14 2000 Received: by oss.sgi.com id ; Fri, 24 Mar 2000 01:33:05 -0800 Received: from laurin.munich.netsurf.de ([194.64.166.1]:21957 "EHLO laurin.munich.netsurf.de") by oss.sgi.com with ESMTP id ; Fri, 24 Mar 2000 01:32:49 -0800 Received: from fred.muc.de (ns1170.munich.netsurf.de [195.180.235.170]) by laurin.munich.netsurf.de (8.9.3/8.9.3) with ESMTP id KAA16174; Fri, 24 Mar 2000 10:32:40 +0100 (MET) Received: from andi by fred.muc.de with local (Exim 2.05 #1) id 12YQV1-0001fS-00; Fri, 24 Mar 2000 10:35:23 +0100 Date: Fri, 24 Mar 2000 10:35:23 +0100 From: Andi Kleen To: Stephen Rothwell Cc: netdev@oss.sgi.com, rusty@linuxcare.com, tridge@linuxcare.com Subject: Re: TCP hang with 2.2.14 <-> 2.2.15pre5 Message-ID: <20000324103523.A6373@fred.muc.de> References: <200003240710.SAA02284@elm.linuxcare.com.au> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.95.4us In-Reply-To: <200003240710.SAA02284@elm.linuxcare.com.au>; from Stephen Rothwell on Fri, Mar 24, 2000 at 08:12:08AM +0100 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Fri, Mar 24, 2000 at 08:12:08AM +0100, Stephen Rothwell wrote: > 16:37:07.987900 owl.rsync > elm.1216: . ack 31383222 win 31856 (DF) > 16:37:07.987946 elm.1216 > owl.rsync: R 2680405324:2680405324(0) win 0 > 16:37:07.989904 elm.1216 > owl.rsync: R 31436798:31436798(0) ack 79735 win 31856 (DF) The second reset is not generated by the Linux TCP/IP stack (we never send RSTs with options and windows). It looks like a normal ACK with a few bits flipped (?). The first one is bogus too because it has wrong sequence numbers, but at least the rest of the header looks normal. I would check the ethernet driver/card. -Andi From owner-netdev@oss.sgi.com Sat Mar 25 06:18:28 2000 Received: by oss.sgi.com id ; Sat, 25 Mar 2000 06:18:17 -0800 Received: from smtprtp1.ntcom.nortel.net ([137.118.22.14]:4525 "EHLO smtprtp1.ntcom.nortel.net") by oss.sgi.com with ESMTP id ; Sat, 25 Mar 2000 06:18:01 -0800 Received: from zsngd101.asiapac.nortel.com (actually znsgd101) by smtprtp1.ntcom.nortel.net; Sat, 25 Mar 2000 09:08:04 -0500 Received: from zctwb003.asiapac.nortel.com ([47.152.32.111]) by zsngd101.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id H3S3WRPX; Sat, 25 Mar 2000 22:08:01 +0800 Received: from pwold011.asiapac.nortel.com ([47.181.193.45]) by zctwb003.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id HNN1JF1F; Sun, 26 Mar 2000 01:08:04 +1100 Received: from uow.edu.au (IDENT:andrewm@[47.181.207.103]) by pwold011.asiapac.nortel.com (8.9.3/8.9.3) with ESMTP id BAA02221 for ; Sun, 26 Mar 2000 01:07:55 +1100 Message-ID: <38DCC94C.9B82856B@uow.edu.au> Date: Sat, 25 Mar 2000 14:12:28 +0000 From: Andrew Morton X-Mailer: Mozilla 4.61 [en] (X11; I; Linux 2.2.13-7mdk i586) X-Accept-Language: en MIME-Version: 1.0 To: netdev Subject: 3c59x.c Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Would anyone object if I did some work on this driver? (2.3.99) Let me assume the answer is "no" and plunge on... I have identified a stack of things which _could_ be done. I've broken them down into optional, desirable and essential. I would propose that the optional stuff gets noted in a comment and the desirable and essential things get implemented. Would those in the know please review my notes below, correct any misconceptions, add things I've missed? There are a few questions in here as well. They are marked with "???". Thanks. [ A lot of the "vortex_foo" comments also apply to "boomerang_foo" ] Optional ======= - Debug code is _always_ compiled in, and debug level defaults to 1. This is OK, it would be better to make it easy for 'debug' to be literal "0". This reduces the driver size by 1.5k. 'debug' should be made always a literal constant in a non-modular compile because it can't be altered at runtime or boot. - struct pci_id_info has a 'probe1' method. It is never used. Remove this. ??? - why do we independently kmalloc dev->priv? Why not let init_etherdev() do it? ??? - Where does 'option' come from in vortex_probe1()? Don't understand this. - some printk's which don't use KERN_XXXX - make ram_split[] static (shorter code). - vortex_open(): this is dodgy: for (i = 2000; i >= 0 ; i--) if ( ! (inw(ioaddr + EL3_STATUS) & CmdInProgress)) break; Is 2000 'huge' enough to still work on a 3GHz CPU? Would prefer: for (i = 10; i >= 0 ; i--) { if ( ! (inw(ioaddr + EL3_STATUS) & CmdInProgress)) break; udelay(10); } - priv.in_interrupt is not used. - vortex_tx_timeout(): for (j = 200; j >= 0 ; j--) if ( ! (inw(ioaddr + EL3_STATUS) & CmdInProgress)) break; CPU-speed dependent. - vortex_error(): for (i = 2000; i >= 0 ; i--) if ( ! (inw(ioaddr + EL3_STATUS) & CmdInProgress)) break; (three places) - vortex_start_xmit(): if (inw(ioaddr + TxFree) > 1536) { netif_wake_queue(dev); But there is this: static const int mtu = 1500; use this 'knob' rather than "1536". - boomerang_start_xmit(): if (1) { huh? - Inconsistent use of 'vortex_debug' and 'debug' - vortex_interrupt(): What does the stuff under if (status & DownComplete) { do? Looks like transmitter shutdown handling? - vortex_rx(): while (inw(ioaddr + Wn7_MasterStatus) & 0x8000) ; while we're waiting for DMA to complete. Is this efficient from a bus utilisation POV? - [OT] Gack. Can eth_type_trans() be sped up? - vortex_rw(): for (i = 200; i >= 0; i--) if ( ! (inw(ioaddr + EL3_STATUS) & CmdInProgress)) break; two places. - mdio_read(): for (i = 14; i >= 0; i--) { int dataval = (read_cmd&(1<>= 1) { int dataval = (read_cmd&i) ? MDIO_DATA_WRITE1 : MDIO_DATA_WRITE0; Big deal. These aren't in a fast path. - mdio_write(): Ditto . - #if ! defined(final_version). This seems to be cruft - it's always compiled in (there is no final version!) Desirable ========= - return value from init_etherdev() is not checked. - return value from kmalloc() is not checked (two places) - return from pci_alloc_consistent() not checked. - return from request_region is unchecked. - vortex_open(): /* Use the now-standard shared IRQ implementation. */ if (request_irq(dev->irq, &vortex_interrupt, SA_SHIRQ, dev->name, dev)) { return -EAGAIN; } This happens _after_ we've called init_timer()/add_timer, so vortex_timer() will end up getting called on a driver for which the open failed. Use del_timer() here. ??? - vortex_open() skb = dev_alloc_skb(PKT_BUF_SZ); ... skb_reserve(skb, 2); /* Align IP on 16 byte boundaries */ Shouldn't that be PKT_BUF_SZ + 2? Essential ========= - Are there MOD_INC/DEC_USE_COUNT races? What are the rules here? - vortex_start_xmit() is racy. It is not protected from h/w interrupts. It needs a spin_lock_irqsave(priv->lock) and vortex_interrupt() needs a spin_lock(priv->lock); In vortex_start_xmit(), the spin_lock_irq() should happen immediately before the netif_stop_queue(). * vortex_start_xmit() calls netif_stop_queue() and then under some circumstances (non-DMA o/p and there is room in the Tx buffer) it calls netif_wake_queue(). Seems OK, as long as it's done under the spin_lock_irqsave(). - Ditto boomerang_start_xmit() - vortex_tx_timeout() doesn't always call netif_wake_queue. Is this OK? - vortex_tx_timeout() fiddles with h/w and is not protected from vortex_interrupt() which also fiddles with h/w registers. Use priv->lock for this case as well. - boomerang_start_xmit() calls cli(). Remove this and wrap the entire function in spin_lock_irqsave(priv->lock) - vortex_get_stats() needs a spinlock in update_stats(). (lock in dev.priv, call spinlock_init()). Remove cli(). - set_rx_mode() probably needs a spinlock. Safest to add it. - vortex_interrupt() uses dev_kfree_skb_irq(), but vortex_interrupt() is called from elsewhere in non-IRQ context! Use dev_kfree_skb_any(). ??? - vortex_ioctl(): switch(cmd) { case SIOCDEVPRIVATE: /* Get the address of the PHY in use. */ data[0] = phy; case SIOCDEVPRIVATE+1: /* Read the specified MII register. */ EL3WINDOW(4); data[3] = mdio_read(ioaddr, data[0] & 0x1f, data[1] & 0x1f); Do we mean to fall through the first case? This is either very devious or plain wrong. - [OT] pci.c is racy. - [OT] waah! skeleton.c uses global cli()/sti()! - [OT] I had a peek at eepro100.c. The use of wait_for_cmd_done() is racy. -- -akpm- From owner-netdev@oss.sgi.com Sat Mar 25 09:13:39 2000 Received: by oss.sgi.com id ; Sat, 25 Mar 2000 09:13:29 -0800 Received: from adsl-151-196-249-3.bellatlantic.net ([151.196.249.3]:48881 "EHLO vaio.greennet") by oss.sgi.com with ESMTP id ; Sat, 25 Mar 2000 09:13:08 -0800 Received: from localhost (becker@localhost) by vaio.greennet (8.9.3/8.8.7) with ESMTP id MAA20427; Sat, 25 Mar 2000 12:15:11 -0500 Date: Sat, 25 Mar 2000 12:15:11 -0500 (EST) From: Donald Becker X-Sender: becker@vaio.greennet To: Andrew Morton cc: netdev Subject: Re: 3c59x.c In-Reply-To: <38DCC94C.9B82856B@uow.edu.au> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Sat, 25 Mar 2000, Andrew Morton wrote: > To: netdev > Subject: 3c59x.c > > Would anyone object if I did some work on this driver? (2.3.99) You should bring this up on the linux-vortex mailing list. If you do not read the linux-vortex mailing list, you should not even think about modifying the driver until you understand more about it. > - Debug code is _always_ compiled in, and debug level defaults to 1. > This is OK, it would be better to make it easy for 'debug' to be > literal "0". This reduces the driver size by 1.5k. This is highly useful information when tracking down problems. Most problems are not in the driver structure, but problems with the system (IRQs, PCI bus, and the like) or the media selection. > - struct pci_id_info has a 'probe1' method. It is never used. Remove this. This is because it was incompletely converted for 2.3.99. Rule: It's not reasonable to halfway convert something and walk away. > - why do we independently kmalloc dev->priv? Why not let > init_etherdev() do it? Because of alignment requirements. Some descriptors must be longword aligned, and they perform much better when cache line aligned. You must understand caching systems before touching this. It's OK to take slight performance hits on rarely used systems like the Sparc, but you should throughly understand what the PCI bus implementation on the adapter is doing. > ??? > - Where does 'option' come from in vortex_probe1()? Don't understand > this. LILO parameters. > - some printk's which don't use KERN_XXXX You only put that at the beginning of the emitted line! I see only one place where this is incorrect, on line 848, and that should only show up for certain CardBus adapters. (Hmmm, that line came from a *later* driver version than 99H!) > - vortex_open(): this is dodgy: No, it's not. > for (i = 2000; i >= 0 ; i--) > if ( ! (inw(ioaddr + EL3_STATUS) & CmdInProgress)) > break; > > Is 2000 'huge' enough to still work on a 3GHz CPU? Yes. These are based on PCI bus transactions. This is not a CPU timing loop. > Would prefer: > > for (i = 10; i >= 0 ; i--) > { > if ( ! (inw(ioaddr + EL3_STATUS) & CmdInProgress)) > break; > udelay(10); > } No. This is wrong, and potentially very, very bad. Do a profile on how many times this loop executes. It is typically zero or one times. Regarding the similar reset code, I explicitly changed 3Com drivers back after Alan changed them to use udelay(). The udelay() code was broken at one point -- it didn't take into account running faster because of cache alignment. Alan blamed me for using the wrong timing info, and left a note about my "bug" in the driver. Neither I nor 3Com could reproduce the problem. Later the udelay() fix was very quietly slipped into place. > - priv.in_interrupt is not used. The code for this lock was eliminated in recent 2.3.* changes. But whoever did that change did only a minimal, hackish job of updating the driver. > - vortex_tx_timeout(): > > for (j = 200; j >= 0 ; j--) > if ( ! (inw(ioaddr + EL3_STATUS) & CmdInProgress)) > break; > > CPU-speed dependent. No, it's not. See above. You *must* understand what is going on before you touch the code. > - vortex_start_xmit(): > > if (inw(ioaddr + TxFree) > 1536) { > netif_wake_queue(dev); > > But there is this: > > static const int mtu = 1500; > > use this 'knob' rather than "1536". Noooooo. No. > - boomerang_start_xmit(): > > if (1) { > > huh? Hackish driver update to 2.3.*. Not my fault. Look at the original 99H source, or better, the 99M source code to understand why the original code made sense. > - vortex_interrupt(): > What does the stuff under > if (status & DownComplete) { > do? Looks like transmitter shutdown handling? No. Think "DownloadComplete", except I used the names defined in the 3Com manual when they didn't conflict with some other convention. > - vortex_rx(): > while (inw(ioaddr + Wn7_MasterStatus) & 0x8000) > ; > > while we're waiting for DMA to complete. Is this efficient from a > bus utilisation POV? Usually no. This is a bad loop, and violates my usual coding standards. But in this special case, for the 3c590 series only, we are waiting on a single PCI bus burst transaction that we just triggered. The typical loop iteration count is zero, but has no easily calculated upper bound. > - [OT] Gack. Can eth_type_trans() be sped up? Gackkkk! It *is* horrible, and a cache pig to boot. Not my fault, and it certainly should be part of netif_rx() not a separate call in the driver. It was put in by the people that do PPP and didn't care about performance. > - mdio_read(): .. > - mdio_write(): [[ Pointless code change. Don't touch touch that code. It take me a long time to get the serial EEPROM code and the MDIO code just right. Once I did, and verified the cases on an o-scope, I copied that code verbatim across the various drivers. If you think it's easy, write a duplicate function from scratch, just reading the datasheet. ]] > ??? > - vortex_open() > > skb = dev_alloc_skb(PKT_BUF_SZ); > ... > skb_reserve(skb, 2); /* Align IP on 16 byte boundaries */ > > Shouldn't that be PKT_BUF_SZ + 2? No. PKT_BUF_SZ is 1536, not 1514/1518. It ends on a cache line boundary. Drivers that handle "porky packets" (and the recent 3Com cards have the hardware support to do this) use vp->rx_buf_sz instead of a constant. See the Hamachi driver for an example. > Essential > ========= > > - Are there MOD_INC/DEC_USE_COUNT races? What are the rules here? The locking for this is handled outside the drivers, as it should be. > - vortex_start_xmit() is racy. It is not protected from h/w > interrupts. It needs a spin_lock_irqsave(priv->lock) and > vortex_interrupt() needs a spin_lock(priv->lock); The vortex_start_xmit() is for the 3c590 series only, not the later hardware. The hardware only needs to be protected against simultaneous transmit attempts, not a receive that happens in the middle of transferring a packet to the card's (large!) FIFO. Putting a spin lock there will waste a *lot* of spin time, since the old 3c590 isn't especially fast, especially if we are using PIO mode. > * vortex_start_xmit() calls netif_stop_queue() and then under some > circumstances (non-DMA o/p and there is room in the Tx buffer) it > calls netif_wake_queue(). Seems OK, as long as it's done under the > spin_lock_irqsave(). These macros are only OK if they are trivial locking functions. When they are more complex functions every card without a large Tx queue will suffer badly. > - vortex_tx_timeout() doesn't always call netif_wake_queue. Is this > OK? It depends on the Tx scheme in use. > - vortex_ioctl(): > > switch(cmd) { > case SIOCDEVPRIVATE: /* Get the address of the PHY in use. */ > data[0] = phy; > case SIOCDEVPRIVATE+1: /* Read the specified MII register. */ > EL3WINDOW(4); > data[3] = mdio_read(ioaddr, data[0] & 0x1f, data[1] & 0x1f); > > Do we mean to fall through the first case? This is either very > devious or plain wrong. Yes. All MII drivers use this code (and some without MII even fake the registers). You should understand how this driver relates to other drivers, the diagnostics code, and the programs that read MII registers. > - [OT] waah! skeleton.c uses global cli()/sti()! I've had pci-skeleton.c for a long time, but it apparently wasn't worthy. It is now incorrect for 2.3.*. At least parts of the ancient skeleton.c have been updated with the new interface. > - [OT] I had a peek at eepro100.c. The use of wait_for_cmd_done() is > racy. It's not quite as it appears. Most run-time commands are instantly accepted or slow the next PCI response. The only commands that take visible time are the initialization. Remember that locks are not free. They cost a lot of time, and measuring that time usually changes the timing to understate the cost. Donald Becker Scyld Computing Corporation, becker@scyld.com From owner-netdev@oss.sgi.com Sat Mar 25 12:29:50 2000 Received: by oss.sgi.com id ; Sat, 25 Mar 2000 12:29:41 -0800 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:33801 "EHLO grok.myip.org") by oss.sgi.com with ESMTP id ; Sat, 25 Mar 2000 12:29:23 -0800 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.myip.org (8.9.3/8.9.3) with ESMTP id LAA21074; Sat, 25 Mar 2000 11:58:10 -0700 Message-ID: <38DD0C42.5C58A3EA@candelatech.com> Date: Sat, 25 Mar 2000 11:58:10 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.12-20 i586) X-Accept-Language: en MIME-Version: 1.0 To: Andrew Morton CC: netdev Subject: Re: 3c59x.c References: <38DCC94C.9B82856B@uow.edu.au> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Andrew Morton wrote: > > Would anyone object if I did some work on this driver? (2.3.99) > > Let me assume the answer is "no" and plunge on... Sounds good to me. I'd like to see support for the 4 extra bytes that 802.1Q vlan needs, as well as the ability to change the MAC on the card, if possible. (If these are already available, then please ignore me..I don't actually have one of these cards.) Thanks, Ben -- Ben Greear (greearb@candelatech.com) http://scry.wanfear.com/~greear Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com From owner-netdev@oss.sgi.com Sat Mar 25 16:15:01 2000 Received: by oss.sgi.com id ; Sat, 25 Mar 2000 16:14:52 -0800 Received: from linuxcare.canberra.net.au ([203.29.91.49]:7436 "EHLO front.linuxcare.com.au") by oss.sgi.com with ESMTP id ; Sat, 25 Mar 2000 16:14:32 -0800 Received: (from sfr@localhost) by front.linuxcare.com.au (8.9.3/8.9.3/Debian 8.9.3-21) id KAA12821; Sun, 26 Mar 2000 10:14:15 +1000 Date: Sun, 26 Mar 2000 10:14:15 +1000 From: Stephen Rothwell Message-Id: <200003260014.KAA12821@front.linuxcare.com.au> To: ak@muc.de, sfr@linuxcare.com Subject: Re: TCP hang with 2.2.14 <-> 2.2.15pre5 Cc: netdev@oss.sgi.com, rusty@linuxcare.com, tridge@linuxcare.com Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hi Andi, From: Andi Kleen > > On Fri, Mar 24, 2000 at 08:12:08AM +0100, Stephen Rothwell wrote: > > 16:37:07.987900 owl.rsync > elm.1216: . ack 31383222 win 31856 (DF) > > 16:37:07.987946 elm.1216 > owl.rsync: R 2680405324:2680405324(0) win 0 > > 16:37:07.989904 elm.1216 > owl.rsync: R 31436798:31436798(0) ack 79735 win 31856 (DF) > > The second reset is not generated by the Linux TCP/IP stack (we never send > RSTs with options and windows). It looks like a normal ACK with a few bits > flipped (?). > > The first one is bogus too because it has wrong sequence numbers, but at least > the rest of the header looks normal. > > I would check the ethernet driver/card. I was not concerned about the resets. I was concerned about the series of retransmits near the start of the dumps. Basically a packet from elm gets dropped and the correct ack comes back (for the packet just before the dropped one) and then elm retransmits the dropped packet. Owl then acks the last but one packet that elm thinks it has sent (presumably another one has been dropped). Owl send some data to elm until the window fills and then no more progress is made from either end. elm is sending acks with window 0 but owl persists in trying to send another segment. My understanding of TCP is not wonderful, but this doesn't seem correct. Just to reitterate, owl is running 2.2.14 (with SACK disabled) and elm is running 2.2.15pre5. I would be happy to be told that this is a bug in 2.2.14 that is already fixed. Cheers, Stephen Rothwell sfr@linuxcare.com From owner-netdev@oss.sgi.com Sun Mar 26 01:18:12 2000 Received: by oss.sgi.com id ; Sun, 26 Mar 2000 01:17:53 -0800 Received: from pizda.ninka.net ([216.101.162.242]:51328 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Sun, 26 Mar 2000 01:17:34 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id BAA02517; Sun, 26 Mar 2000 01:12:04 -0800 Date: Sun, 26 Mar 2000 01:12:04 -0800 Message-Id: <200003260912.BAA02517@pizda.ninka.net> X-Authentication-Warning: pizda.ninka.net: davem set sender to davem@redhat.com using -f From: "David S. Miller" To: rusty@linuxcare.com.au CC: netdev@oss.sgi.com In-reply-to: (message from Rusty Russell on Wed, 05 Apr 2000 20:14:09 +1000) Subject: Re: [PATCH] What-The-Fuck-Is-That-Monstrosity Bear: Netfilter merge patch III References: Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing From: Rusty Russell Date: Wed, 05 Apr 2000 20:14:09 +1000 Merge teething problems, mainly. This patch does: ... Patch applied, thanks. I'll send this off to Linus during my next merge. Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Sun Mar 26 03:27:33 2000 Received: by oss.sgi.com id ; Sun, 26 Mar 2000 03:27:14 -0800 Received: from dow.sw.com.sg ([203.120.9.222]:14856 "EHLO dow.sw.com.sg") by oss.sgi.com with ESMTP id ; Sun, 26 Mar 2000 03:26:53 -0800 Received: from kuznet by dow.sw.com.sg with local (Exim 3.13 #7) id 12ZBGU-0000ST-00; Sun, 26 Mar 2000 19:31:30 +0800 Subject: Re: TCP hang with 2.2.14 <-> 2.2.15pre5 To: sfr@linuxcare.COM.AU (Stephen Rothwell) Date: Sun, 26 Mar 2000 19:31:30 +0800 (SGT) Cc: netdev@oss.sgi.com In-Reply-To: <200003260014.KAA12821@front.linuxcare.com.au> from "Stephen Rothwell" at Mar 26, 2000 05:13:05 AM From: kuznet@ms2.inr.ac.ru (Alexey Kuznetosv) X-Mailer: ELM [version 2.5 PL1] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-Id: Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello! > elm is sending acks with window 0 but owl persists in trying to send > another segment. > > My understanding of TCP is not wonderful, but this doesn't seem correct. Yes, it really tries to send out-of-window segment. It is not good and it is not clear why it occured, dumps are too short. Could you prepare dumps with sequence numbers covering stalled seqno - 64K? > I would be happy to be told that this is a bug in 2.2.14 that is already > fixed. I do not remember. Let's search. But! Despite of all this, rsync behaves strangely. Window became zero not because of bug in tcp, but because of bug in rsync. If it forgets to read data, it is deemed to dead-lock soon or later. Alexey From owner-netdev@oss.sgi.com Sun Mar 26 03:33:54 2000 Received: by oss.sgi.com id ; Sun, 26 Mar 2000 03:33:35 -0800 Received: from pizda.ninka.net ([216.101.162.242]:4736 "EHLO pizda.ninka.net") by oss.sgi.com with ESMTP id ; Sun, 26 Mar 2000 03:33:22 -0800 Received: (from davem@localhost) by pizda.ninka.net (8.9.3/8.9.3) id DAA01525; Sun, 26 Mar 2000 03:27:41 -0800 Date: Sun, 26 Mar 2000 03:27:41 -0800 Message-Id: <200003261127.DAA01525@pizda.ninka.net> X-Authentication-Warning: pizda.ninka.net: davem set sender to davem@redhat.com using -f From: "David S. Miller" To: kuznet@ms2.inr.ac.ru CC: sfr@linuxcare.COM.AU, netdev@oss.sgi.com In-reply-to: (kuznet@ms2.inr.ac.ru) Subject: Re: TCP hang with 2.2.14 <-> 2.2.15pre5 References: Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Date: Sun, 26 Mar 2000 19:31:30 +0800 (SGT) From: kuznet@ms2.inr.ac.ru (Alexey Kuznetosv) > I would be happy to be told that this is a bug in 2.2.14 that is > already fixed. I do not remember. Let's search. But! Despite of all this, rsync behaves strangely. Window became zero not because of bug in tcp, but because of bug in rsync. If it forgets to read data, it is deemed to dead-lock soon or later. It is bug both in rsync and kernel :-) Kernel fix is in 2.2.15-pre* for some time now, look at this change to tcp_timer.c:tcp_write_err() @@ -131,6 +131,7 @@ } else { /* Clean up time. */ tcp_set_state(sk, TCP_CLOSE); + sk->shutdown |= SHUTDOWN_MASK; return 0; } return 1; Later, David S. Miller davem@redhat.com From owner-netdev@oss.sgi.com Sun Mar 26 06:01:54 2000 Received: by oss.sgi.com id ; Sun, 26 Mar 2000 06:01:45 -0800 Received: from dow.sw.com.sg ([203.120.9.222]:22024 "EHLO dow.sw.com.sg") by oss.sgi.com with ESMTP id ; Sun, 26 Mar 2000 06:01:26 -0800 Received: from kuznet by dow.sw.com.sg with local (Exim 3.13 #7) id 12ZDft-0000p1-00; Sun, 26 Mar 2000 22:05:53 +0800 Subject: Re: TCP hang with 2.2.14 <-> 2.2.15pre5 To: davem@redhat.com (David S. Miller) Date: Sun, 26 Mar 2000 22:05:53 +0800 (SGT) Cc: kuznet@ms2.inr.ac.ru, sfr@linuxcare.COM.AU, netdev@oss.sgi.com In-Reply-To: <200003261127.DAA01525@pizda.ninka.net> from "David S. Miller" at Mar 26, 2000 03:27:41 AM From: kuznet@ms2.inr.ac.ru (Alexey Kuznetosv) X-Mailer: ELM [version 2.5 PL1] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-Id: Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello! > Kernel fix is in 2.2.15-pre* for some time now, look at this > change to tcp_timer.c:tcp_write_err() But I do not see how it could explain sending out-of-window segment. Look, instead of zero window probe it retransmits next segment. It is interesting that it is even correct behaviour, the result is the same as with sending zero window probe 8)8) But it was not expected yet! Actualy, it looks like receiver shrunk window (there was not enough information in tcpdump to check this), or like sender miscalculated it. Well, both sides are linuxes, so that all the bugs are ours. Alexey From owner-netdev@oss.sgi.com Sun Mar 26 06:56:45 2000 Received: by oss.sgi.com id ; Sun, 26 Mar 2000 06:56:26 -0800 Received: from smtprtp1.ntcom.nortel.net ([137.118.22.14]:62659 "EHLO smtprtp1.ntcom.nortel.net") by oss.sgi.com with ESMTP id ; Sun, 26 Mar 2000 06:55:57 -0800 Received: from zsngd101.asiapac.nortel.com (actually znsgd101) by smtprtp1.ntcom.nortel.net; Sun, 26 Mar 2000 09:55:26 -0500 Received: from zctwb003.asiapac.nortel.com ([47.152.32.111]) by zsngd101.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id H3S3WWBF; Sun, 26 Mar 2000 22:55:18 +0800 Received: from pwold011.asiapac.nortel.com ([47.181.193.45]) by zctwb003.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id HNN1JFJD; Mon, 27 Mar 2000 00:55:21 +1000 Received: from uow.edu.au (IDENT:andrewm@[47.181.207.103]) by pwold011.asiapac.nortel.com (8.9.3/8.9.3) with ESMTP id AAA15253; Mon, 27 Mar 2000 00:55:11 +1000 Message-ID: <38DE25E0.EAF22B3@uow.edu.au> Date: Sun, 26 Mar 2000 14:59:44 +0000 X-Sybari-Space: 00000000 00000000 00000000 From: Andrew Morton X-Mailer: Mozilla 4.61 [en] (X11; I; Linux 2.2.13-7mdk i586) X-Accept-Language: en MIME-Version: 1.0 To: Donald Becker CC: netdev Subject: Re: 3c59x.c References: <38DCC94C.9B82856B@uow.edu.au> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Donald, thanks for the reply - I hoped you would be listening. I'd like to work this through a little more then summarise the outcome as a plan of what (if anything) needs to be done. Once the intended changes are well reviewed someone (doesn't really matter who) can cut a patch. Sound OK? Donald Becker wrote: > > On Sat, 25 Mar 2000, Andrew Morton wrote: > > > To: netdev > > Subject: 3c59x.c > > > > Would anyone object if I did some work on this driver? (2.3.99) > > You should bring this up on the linux-vortex mailing list. > If you do not read the linux-vortex mailing list, you should not even think > about modifying the driver until you understand more about it. Thanks - I'll check it out. > > - Debug code is _always_ compiled in, and debug level defaults to 1. > > This is OK, it would be better to make it easy for 'debug' to be > > literal "0". This reduces the driver size by 1.5k. > > This is highly useful information when tracking down problems. > Most problems are not in the driver structure, but problems with the system > (IRQs, PCI bus, and the like) or the media selection. I wasn't proposing that any diagnostic tools be removed. I was noting that under some circumstances, dead code elimination could be used to remove code which is never executed. In particular, when the driver is linked into the kernel there appears to be no way to alter the value of 'debug'. May as well make it a literal constant and let the compiler remove the dead code completely. Minor issue. > ... > > > - why do we independently kmalloc dev->priv? Why not let > > init_etherdev() do it? > > Because of alignment requirements. Some descriptors must be longword > aligned, and they perform much better when cache line aligned. You must > understand caching systems before touching this. It's OK to take slight > performance hits on rarely used systems like the Sparc, but you should > throughly understand what the PCI bus implementation on the adapter is doing. > init_etherdev() aligns dev->priv on a 32 byte boundary (this was recently fixed up). Is this not suitable? > ... > > - Are there MOD_INC/DEC_USE_COUNT races? What are the rules here? > > The locking for this is handled outside the drivers, as it should be. Sorry, this was a comment to myself which snuck through. I am told that there are insmod/rmmod races in various drivers but I've never seen a description of the sorts of things to look out for. > > - vortex_start_xmit() is racy. It is not protected from h/w > > interrupts. It needs a spin_lock_irqsave(priv->lock) and > > vortex_interrupt() needs a spin_lock(priv->lock); > > The vortex_start_xmit() is for the 3c590 series only, not the later hardware. > The hardware only needs to be protected against simultaneous transmit > attempts, not a receive that happens in the middle of transferring a packet > to the card's (large!) FIFO. Are you sure about this? That a Tx complete interrupt (or a vortex_interrupt() call from vortex_tx_timeout()) cannot occur when vortex_start_xmit() is executing? > Putting a spin lock there will waste a *lot* of spin time, since the old > 3c590 isn't especially fast, especially if we are using PIO mode. I assume the wastage you are talking about it stalling the Rx interrupt while the PIO Tx is in progress? > > * vortex_start_xmit() calls netif_stop_queue() and then under some > > circumstances (non-DMA o/p and there is room in the Tx buffer) it > > calls netif_wake_queue(). Seems OK, as long as it's done under the > > spin_lock_irqsave(). > > These macros are only OK if they are trivial locking functions. When they > are more complex functions every card without a large Tx queue will suffer > badly. They're quite lightweight. Probably < 100 insns for the worst case. > > - [OT] waah! skeleton.c uses global cli()/sti()! > > I've had pci-skeleton.c for a long time, but it apparently wasn't worthy. > It is now incorrect for 2.3.*. That went over my head... I find skeleton.c very useful. I'd be interested in pci-skeleton.c. Is it available somewhere? > At least parts of the ancient skeleton.c have been updated with the new > interface. Parts, yes. It's missing the align-IP-header-on-longword fiddle as well as using the global cli(). There could be other things. I'll have a closer look. You didn't address some of my observations - I think they were pretty straightforward things like checking returns from resource allocation attempts. I'll included them when I revisit this later in the week. -- -akpm- From owner-netdev@oss.sgi.com Sun Mar 26 11:21:57 2000 Received: by oss.sgi.com id ; Sun, 26 Mar 2000 11:21:47 -0800 Received: from adsl-151-196-249-3.bellatlantic.net ([151.196.249.3]:58353 "EHLO vaio.greennet") by oss.sgi.com with ESMTP id ; Sun, 26 Mar 2000 11:21:28 -0800 Received: from localhost (becker@localhost) by vaio.greennet (8.9.3/8.8.7) with ESMTP id OAA25888; Sun, 26 Mar 2000 14:20:02 -0500 Date: Sun, 26 Mar 2000 14:20:02 -0500 (EST) From: Donald Becker X-Sender: becker@vaio.greennet To: Andrew Morton cc: netdev Subject: Re: 3c59x.c In-Reply-To: <38DE25E0.EAF22B3@uow.edu.au> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Sun, 26 Mar 2000, Andrew Morton wrote: > I'd like to work this through a little more then summarise the outcome > as a plan of what (if anything) needs to be done. Once the intended > changes are well reviewed someone (doesn't really matter who) can cut a > patch. Sound OK? You should understand my stance on this: I wrote and maintained many drivers for seven years. In the past seven months significant, poorly considered changes were made to the driver interface. For most of those changes I wasn't consulted, or even given warning about them. The only notice was mail to the linux-kernel (not even to linux-net) list that the interface had already changed. Some of the interface changes were obviously first passes, and had never really been tested. They were quickly fixed, but it was clear that they were not thought through before being implemented. Interface changes are major undertakings, and they should have been treated as such. Instead the attitude was "Here are changes. Fix your drivers to work with them." > > You should bring this up on the linux-vortex mailing list. > > If you do not read the linux-vortex mailing list, you should not even think > > about modifying the driver until you understand more about it. > > Thanks - I'll check it out. It should have been the first place to check. > > > - why do we independently kmalloc dev->priv? Why not let > > > init_etherdev() do it? > > > > Because of alignment requirements. Some descriptors must be longword .. > init_etherdev() aligns dev->priv on a 32 byte boundary (this was > recently fixed up). Is this not suitable? The proper statement was "this was recently fixed up, after yet another person broke it". Counting on init_etherdev() to provide a correctly aligned, properly allocated structure will continue to be a bad assumption. > > > - Are there MOD_INC/DEC_USE_COUNT races? What are the rules here? > > > > The locking for this is handled outside the drivers, as it should be. > > Sorry, this was a comment to myself which snuck through. I am told that > there are insmod/rmmod races in various drivers but I've never seen a > description of the sorts of things to look out for. Yes, there are races in some drivers. But I don't believe there are any here. The driver interface should be designed so that locking is unnecessary in most common cases. Driver writer can always introduce their own races, of course. > > > - vortex_start_xmit() is racy. It is not protected from h/w > > > interrupts. It needs a spin_lock_irqsave(priv->lock) and > > > vortex_interrupt() needs a spin_lock(priv->lock); > > > > The vortex_start_xmit() is for the 3c590 series only, not the later hardware. > > The hardware only needs to be protected against simultaneous transmit > > attempts, not a receive that happens in the middle of transferring a packet > > to the card's (large!) FIFO. > > Are you sure about this? That a Tx complete interrupt (or a > vortex_interrupt() call from vortex_tx_timeout()) cannot occur when > vortex_start_xmit() is executing? The Tx-packet-load register activity is independent of the interrupt handling. Although that should be verified with the current SMP setup. You should be able to ignore the timeout. The semantics of watchdog timeouts are that they are never called during a transmit attempt. Hmmm, actually, that *should* be reviewed. I can't find anyplace that the new semantics in pre-2.4 have been defined, so there may well be a new race condition here. If there isn't one currently, it could easily grow one. It would bad to add the overhead of a spin lock during normal operation just to handle this rare error case. > > Putting a spin lock there will waste a *lot* of spin time, since the old > > 3c590 isn't especially fast, especially if we are using PIO mode. > > I assume the wastage you are talking about it stalling the Rx interrupt > while the PIO Tx is in progress? .. > > These macros are only OK if they are trivial locking functions. When they > > are more complex functions every card without a large Tx queue will suffer > > badly. > > They're quite lightweight. Probably < 100 insns for the worst case. Acckkk!! 100 instructions is *not* light weight! This is the normal run-time situation, not an error recovery path. > > I've had pci-skeleton.c for a long time, but it apparently wasn't worthy. > > It is now incorrect for 2.3.*. > > That went over my head... > > I find skeleton.c very useful. I'd be interested in pci-skeleton.c. Is > it available somewhere? The version in http://cesdis.gsfc.nasa.gov/linux/drivers/kern-2.3/index.html ftp://cesdis.gsfc.nasa.gov/pub/linux/drivers/kern-2.3/pci-skeleton.c is the latest of mine, but keep in mind that it uses my PCI scan structure and backwards compatibility code that will not be accepted into the kernel. > > At least parts of the ancient skeleton.c have been updated with the new > > interface. > > Parts, yes. It's missing the align-IP-header-on-longword fiddle as well > as using the global cli(). There could be other things. I'll have a > closer look. It was designed as an ISA skeleton. Some patches did not update it when people made interface changes, so it doesn't represent an accurate example. The pci-skeleton.c code does compile, and my drivers attempt to closely match the its structure and semantics. Well, moderated by the desire to minimize the code changes from previous versions. > You didn't address some of my observations - I think they were pretty > straightforward things like checking returns from resource allocation > attempts. I'll included them when I revisit this later in the week. Previously GFP_KERNEL meant "wait until you have the memory", so you didn't have to check all allocations. I believe that has changed in pre-2.4 (and other GPLed OSes that use the drivers, such L4 and OStk, don't make that assertion) so checking the allocations is now needed. Recovering from a failed allocation *will* add noticable code to attach() and open(). The descriptor-based drivers should all already handle failed skbuff allocations, although they have never been tested when you cannot allocate any skbuffs at all during initialization. Donald Becker Scyld Computing Corporation, becker@scyld.com From owner-netdev@oss.sgi.com Mon Mar 27 03:21:11 2000 Received: by oss.sgi.com id ; Mon, 27 Mar 2000 03:20:52 -0800 Received: from dow.sw.com.sg ([203.120.9.222]:26897 "EHLO dow.sw.com.sg") by oss.sgi.com with ESMTP id ; Mon, 27 Mar 2000 03:20:32 -0800 Received: from kuznet by dow.sw.com.sg with local (Exim 3.13 #7) id 12ZXe0-0004ZI-00; Mon, 27 Mar 2000 19:25:16 +0800 Subject: Re: 3c59x.c To: becker@scyld.COM (Donald Becker) Date: Mon, 27 Mar 2000 19:25:16 +0800 (SGT) Cc: netdev@oss.sgi.com In-Reply-To: from "Donald Becker" at Mar 27, 2000 12:13:19 AM From: kuznet@ms2.inr.ac.ru (Alexey Kuznetosv) X-Mailer: ELM [version 2.5 PL1] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-Id: Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello! > for seven years. In the past seven months significant, poorly considered > changes were made to the driver interface. For most of those changes I > wasn't consulted, or even given warning about them. Donald, I apologize, but I daresay you seems to have short memory. The changes in interface driver <-> linux/net were proposed by you on September 20 1998. If you forgot this I can bounce your draft back to you. It is worth to note that massive changes in the names of routines (sort of hard_start_xmit() -> queue_tx()) proposed by you are not made. tbusy is renamed not to tx_full, but encapsulated to macros instead. But if I rememeber correctly, even name netif_wake_queue() was invented by you. > Some of the interface changes were obviously first passes, and had never > really been tested. They were quickly fixed, ? They are finished on September 1999 and no bits were changed since that time. > Yes, there are races in some drivers. But I don't believe there are any > here. > > The driver interface should be designed so that locking is unnecessary in > most common cases. Top level provides proper serialization to allow unload/load modules any moment. Actually, MOD_* reference count is not a true reference count (it is broken in this sense, by the way), but rather plain hint for automatic module unloader. > You should be able to ignore the timeout. The semantics of watchdog > timeouts are that they are never called during a transmit attempt. Right. > Hmmm, actually, that *should* be reviewed. I can't find anyplace that the > new semantics in pre-2.4 have been defined, so there may well be a new race > condition here. I said you several times that no semantic changes to the interface driver<->core may be made without your advice. It is your right not to believe to this warranty, certainly. tx_timeout routine was invented by you and does the thing, which you expected, if I understood your proposal correctly. Actually, it would be not so bad, if _you_ prepared this patch, then you would not have such terrible suspections, I think. I am sorry, I honestly waited for it for year and we had no more time to wait. Alexey From owner-netdev@oss.sgi.com Mon Mar 27 06:46:22 2000 Received: by oss.sgi.com id ; Mon, 27 Mar 2000 06:46:12 -0800 Received: from smtprtp1.ntcom.nortel.net ([137.118.22.14]:64209 "EHLO smtprtp1.ntcom.nortel.net") by oss.sgi.com with ESMTP id ; Mon, 27 Mar 2000 06:45:54 -0800 Received: from zsngd101.asiapac.nortel.com (actually znsgd101) by smtprtp1.ntcom.nortel.net; Mon, 27 Mar 2000 09:45:04 -0500 Received: from zctwb003.asiapac.nortel.com ([47.152.32.111]) by zsngd101.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id H3S3XFTY; Mon, 27 Mar 2000 22:44:56 +0800 Received: from pwold011.asiapac.nortel.com ([47.181.193.45]) by zctwb003.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id HNN1JHC4; Tue, 28 Mar 2000 00:44:59 +1000 Received: from uow.edu.au (IDENT:akpm@[47.181.207.103]) by pwold011.asiapac.nortel.com (8.9.3/8.9.3) with ESMTP id AAA25178; Tue, 28 Mar 2000 00:44:46 +1000 Message-ID: <38DF7503.D498C734@uow.edu.au> Date: Mon, 27 Mar 2000 14:49:39 +0000 X-Sybari-Space: 00000000 00000000 00000000 From: Andrew Morton X-Mailer: Mozilla 4.61 [en] (X11; I; Linux 2.2.13-7mdk i586) X-Accept-Language: en MIME-Version: 1.0 To: Alexey Kuznetosv CC: Donald Becker , netdev@oss.sgi.com Subject: Re: 3c59x.c References: from "Donald Becker" at Mar 27, 2000 12:13:19 AM Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Alexey Kuznetosv wrote: > > Donald, I apologize, but I daresay you seems to have short memory. uh-oh. Guys, we all know there's a lot of history. Could we please just concentrate on the future? There are drivers to be got going, no? > ... > Actually, it would be not so bad, if _you_ prepared this patch, If Don has the bandwidth for this it would be very good. One of the great plusses of his influence is that all the drivers have basically the same structure. If he could use his experience and knowledge to prepare a reference driver for the 2.4 framework, others could follow that. This is something I have been struggling with recently - many of the drivers are subtly different wrt their interworking with the higher layer. Of course, copying another driver is one way of getting the job done. The other is to actually understand what is going on. AFAIK there is no description of the softnet<->driver interface which allows driver writers to gain this understanding. A simple functional API description doesn't cut it - we need to know what the dynamic relationships are, what serialisation guarantees the higher layer makes, etc. Simply looking at the code isn't a good solution here because: 1: All driver maintainers need to do it and they may make mistakes and 2: Maintainers don't know what behaviour is a permanent part of the interface and what behaviour is incidental and may be taken away. Yes, I know of davem's email and Jamal's doc. They're not enough. The lack of this architectural description will adversely affect Linux's overall quality. Is doing so, in fact. WTFM :-) -- -akpm- From owner-netdev@oss.sgi.com Mon Mar 27 07:59:52 2000 Received: by oss.sgi.com id ; Mon, 27 Mar 2000 07:59:43 -0800 Received: from dow.sw.com.sg ([203.120.9.222]:59666 "EHLO dow.sw.com.sg") by oss.sgi.com with ESMTP id ; Mon, 27 Mar 2000 07:59:29 -0800 Received: from kuznet by dow.sw.com.sg with local (Exim 3.13 #7) id 12Zbxw-0005bX-00; Tue, 28 Mar 2000 00:02:08 +0800 Subject: Re: 3c59x.c To: andrewm@uow.edu.au (Andrew Morton) Date: Tue, 28 Mar 2000 00:02:08 +0800 (SGT) Cc: kuznet@ms2.inr.ac.ru (Alexey Kuznetosv), becker@scyld.COM (Donald Becker), netdev@oss.sgi.com In-Reply-To: <38DF7503.D498C734@uow.edu.au> from "Andrew Morton" at Mar 27, 2000 02:49:39 PM From: kuznet@ms2.inr.ac.ru (Alexey Kuznetosv) X-Mailer: ELM [version 2.5 PL1] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-Id: Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello! > Guys, we all know there's a lot of history. Could we please just > concentrate on the future? There are drivers to be got going, no? Right. Alas, 2.4 is rather present than future tense, so that let's concentrate on the present yet. 8) > basically the same structure. If he could use his experience and > knowledge to prepare a reference driver for the 2.4 framework, others > could follow that. 8) If you know the history, you know that it is exactly which I prayed to make. I am sorry but this proposal was refused by all the sides. it is question of the past though. 8) > The other is to actually understand what is going on. AFAIK there is no > description of the softnet<->driver interface which allows driver > writers to gain this understanding. A simple functional API description > doesn't cut it - we need to know what the dynamic relationships are, > what serialisation guarantees the higher layer makes, etc. Jamal's document covered all the _necessary_ topics with pretty deep explanations. If you have something to add to the list of "necessary" topics, please, add. > Yes, I know of davem's email and Jamal's doc. They're not enough. The > lack of this architectural description will adversely affect Linux's > overall quality. Is doing so, in fact. Ask some concrete questions better. All such documents are result of dialogue, rather than broadcast from a godlike being. No questions --- no answers. BTW why did you have no questions before softnet? 8) The situation was much worse that time and its explanations really required volumes of controversial texts. 8) Alexey From owner-netdev@oss.sgi.com Mon Mar 27 09:34:43 2000 Received: by oss.sgi.com id ; Mon, 27 Mar 2000 09:34:24 -0800 Received: from dow.sw.com.sg ([203.120.9.222]:30739 "EHLO dow.sw.com.sg") by oss.sgi.com with ESMTP id ; Mon, 27 Mar 2000 09:34:02 -0800 Received: from kuznet by dow.sw.com.sg with local (Exim 3.13 #7) id 12ZdSO-0005xT-00; Tue, 28 Mar 2000 01:37:40 +0800 Subject: Re: TCP hang with 2.2.14 <-> 2.2.15pre5 To: davem@redhat.com (David S. Miller) Date: Tue, 28 Mar 2000 01:37:40 +0800 (SGT) Cc: kuznet@ms2.inr.ac.ru, sfr@linuxcare.COM.AU, netdev@oss.sgi.com In-Reply-To: <200003261127.DAA01525@pizda.ninka.net> from "David S. Miller" at Mar 26, 2000 03:27:41 AM From: kuznet@ms2.inr.ac.ru (Alexey Kuznetsov) X-Mailer: ELM [version 2.5 PL1] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-Id: Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello! Parsing dumps shows that 2.2 moves snd_nxt beyond right edge of window... :-( The result is that acks from it are never accepted by another side and it is the finish, certainly. I still do not know why it occurs. It smells like a race condition. It is natural to suspect that immediate reason is full-duplex io in rsync, 2.2 is pretty dirty with mutual locking of multiple processes accessing the socket... Stephen, please, try 2.3 so aggressively as you are able. If it will also lockup, it is not a race and it is easier to investigate there. Alexey From owner-netdev@oss.sgi.com Mon Mar 27 17:24:52 2000 Received: by oss.sgi.com id ; Mon, 27 Mar 2000 17:24:32 -0800 Received: from cx97923-a.phnx3.az.home.com ([24.9.112.194]:60688 "EHLO grok.myip.org") by oss.sgi.com with ESMTP id ; Mon, 27 Mar 2000 17:24:02 -0800 Received: from candelatech.com (IDENT:greear@localhost [127.0.0.1]) by grok.myip.org (8.9.3/8.9.3) with ESMTP id SAA31119; Mon, 27 Mar 2000 18:50:55 -0700 Message-ID: <38E00FFF.FB8C1480@candelatech.com> Date: Mon, 27 Mar 2000 18:50:55 -0700 From: Ben Greear Organization: Candela Technologies X-Mailer: Mozilla 4.72 [en] (X11; U; Linux 2.2.12-20 i586) X-Accept-Language: en MIME-Version: 1.0 To: "netdev@oss.sgi.com" CC: Wallace Davis Subject: Hairy routing question. Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing As usual, I seem to be trying to do something that is inherently not how things work. But, if you have any ideas, I'd love to hear them :) Are there any linux-related mailing lists where this is a more appropriate question? Basically, I want a single PC to look like a bunch of PCs. So, I might have: Ether -------- S --------- Client -eth0 172.20.20.3 W | ServerPC PC -eth1 172.20.20.4 I eth0-| 172.20.20.1 -eth2 172.20.20.5 T |_________ -------- C H Now, I would like to be able to have eth0 have one IP address (no virtual interfaces, at least in one configuration), and be able to route packets over a specific eth interface on the Client PC. Assume that a plain old ethernet switch sits between them. So, can this be done with something like source-routing? The ServerPC can just send out it's pkts on eth0, so it's pretty simple, but what about the Client PC? Can I somehow tell the kernel that if the packet is from a certain IP, then it is to send it out a certain ethernet port? If that's possible, can I make sure that the ARP fromm ServerPC is answered correctly so that the pkt comes to the right ClientPC ethernet device (and right port on the switch)? Thanks, Ben -- Ben Greear (greearb@candelatech.com) http://scry.wanfear.com/~greear Author of ScryMUD: scry.wanfear.com 4444 (Released under GPL) http://scry.wanfear.com From owner-netdev@oss.sgi.com Mon Mar 27 20:01:04 2000 Received: by oss.sgi.com id ; Mon, 27 Mar 2000 20:00:54 -0800 Received: from wirespeed.solidum.com ([216.13.130.242]:30929 "EHLO solidum.com") by oss.sgi.com with ESMTP id ; Mon, 27 Mar 2000 20:00:31 -0800 Received: from phobos.solidum.com (mcr@phobos.solidum.com [192.168.1.13]) by solidum.com (8.8.7/8.8.7) with ESMTP id XAA08972 for ; Mon, 27 Mar 2000 23:00:29 -0500 Message-Id: <200003280400.XAA08972@solidum.com> To: "netdev@oss.sgi.com" Subject: Re: Hairy routing question. In-Reply-To: Your message of "Mon, 27 Mar 2000 18:50:55 MST." <38E00FFF.FB8C1480@candelatech.com> Mime-Version: 1.0 (generated by tm-edit 7.108) Content-Type: text/plain; charset=US-ASCII Date: Mon, 27 Mar 2000 23:00:28 -0500 From: Michael Richardson Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing >>>>> "Ben" == Ben Greear writes: Ben> As usual, I seem to be trying to do something that is inherently not Ben> how things work. But, if you have any ideas, I'd love to hear them Ben> :) Ben> Are there any linux-related mailing lists where this is a more Ben> appropriate question? Ben> Basically, I want a single PC to look like a bunch of PCs. Ben> So, I might have: Ether -------- S --------- Client -eth0 172.20.20.3 W | ServerPC PC -eth1 172.20.20.4 I eth0-| 172.20.20.1 -eth2 172.20.20.5 T |_________ -------- C H Ben> Now, I would like to be able to have eth0 have one IP address (no Ben> virtual interfaces, at least in one configuration), and be able to Ben> route packets over a specific eth interface on the Client PC. Ben> Assume that a plain old ethernet switch sits between them. You are sort of doing trunking. Ben> So, can this be done with something like source-routing? The You don't need to do this, just bind the local port to the address on which you wish to send. But that may not work. Ben> If that's possible, can I make sure that the ARP fromm ServerPC is Ben> answered correctly so that the pkt comes to the right ClientPC Ben> ethernet device (and right port on the switch)? You can't due to the fact that Alexei doesn't believe that people put multiple network interfaces on the same physical wire. :!mcr!: | Solidum Systems Corporation, http://www.solidum.com Michael Richardson |For a better connected world,where data flows faster Personal: http://www.sandelman.ottawa.on.ca/People/Michael_Richardson/Bio.html mailto:mcr@sandelman.ottawa.on.ca mailto:mcr@solidum.com From owner-netdev@oss.sgi.com Tue Mar 28 06:57:57 2000 Received: by oss.sgi.com id ; Tue, 28 Mar 2000 06:57:38 -0800 Received: from smtprch1.nortelnetworks.com ([192.135.215.14]:4302 "EHLO smtprch1.nortel.com") by oss.sgi.com with ESMTP id ; Tue, 28 Mar 2000 06:57:19 -0800 Received: from zrchb213.us.nortel.com (actually zrchb213) by smtprch1.nortel.com; Tue, 28 Mar 2000 08:57:19 -0600 Received: from zctwb003.asiapac.nortel.com ([47.152.32.111]) by zrchb213.us.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id HXWMN1ZG; Tue, 28 Mar 2000 08:56:53 -0600 Received: from pwold011.asiapac.nortel.com ([47.181.193.45]) by zctwb003.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id HNN1J2PX; Wed, 29 Mar 2000 00:56:58 +1000 Received: from uow.edu.au (IDENT:akpm@[47.181.207.103]) by pwold011.asiapac.nortel.com (8.9.3/8.9.3) with ESMTP id AAA02569; Wed, 29 Mar 2000 00:56:47 +1000 Message-ID: <38E0C957.8BEE9D05@uow.edu.au> Date: Tue, 28 Mar 2000 15:01:43 +0000 X-Sybari-Space: 00000000 00000000 00000000 From: Andrew Morton X-Mailer: Mozilla 4.61 [en] (X11; I; Linux 2.2.13-7mdk i586) X-Accept-Language: en MIME-Version: 1.0 To: Alexey Kuznetosv CC: netdev@oss.sgi.com Subject: Re: 3c59x.c References: <38DF7503.D498C734@uow.edu.au> from "Andrew Morton" at Mar 27, 2000 02:49:39 PM Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Alexey Kuznetosv wrote: > > Hello! Hi, Alexey. > ... > > The other is to actually understand what is going on. AFAIK there is no > > description of the softnet<->driver interface which allows driver > > writers to gain this understanding. A simple functional API description > > doesn't cut it - we need to know what the dynamic relationships are, > > what serialisation guarantees the higher layer makes, etc. > > Jamal's document covered all the _necessary_ topics with > pretty deep explanations. If you have something to add to the list > of "necessary" topics, please, add. Sorry, I guess I was venting a little frustration. As an experienced system programmer (20 years. gad.) I _expect_ to be able to pick up the architectural aspects quickly but after quite a few evenings work I keep on finding surprises in this stuff. Getting old, I guess. > > > Yes, I know of davem's email and Jamal's doc. They're not enough. The > > lack of this architectural description will adversely affect Linux's > > overall quality. Is doing so, in fact. > > Ask some concrete questions better. All such documents are result > of dialogue, rather than broadcast from a godlike being. > No questions --- no answers. mm.. The problem with this approach is that it doesn't scale. _I_ learn the answer, and then the next person comes along... Here are a few: - Why do some drivers statically allocate their struct net_device (eepro, cs89x0) whereas others call init_etherdev()? What's the right thing to do? - Why do some go (in probe1): if (dev == NULL) init_etherdev() whereas others do not test for null? Is the test necessary? - The manipulation of the pci_root_buses and pci_devices lists in pci.c has no SMP or IRQ protection. The net drivers call into pci.c to add/remove things from these lists but provide no race avoidance. Is there something higher up which guarantees that these list operations are serialised wrt some random soundcard driver or is this a bug? - As far as I can tell, many Ethernet chips separate the Tx and Rx functions sufficiently well for the driver author to be able to let the Tx and Rx threads operate independently. And the orginal architecture allowed this to occur. But the recommended practice of locking the whole driver within hard_start_xmit() will penalise the Rx threads. If this is correct, should we be using separate rx and tx device-private locks? - Should we be hardwiring ISAPNP databases into the drivers? Shouldn't these be on disk? (this question is rhetorical - I don't think anyone's interested in ISA any more). A lot of these questions would be perfectly answered if someone such as yourself were to give the skeleton driver an overhaul. Then if any drivers deviate from its recipe we need to understand why. > BTW why did you have no questions before softnet? 8) I've only been looking at this stuff for a few weeks. When Alan marked the cs89x0 driver as 'OBSOLETE' I had to get involved. That thing cost me nine bucks! BTW: a while back I noticed that the driver's probe1() is being called a huge number of times at bootup from net/dev.c. Is this known about? [ Shit. I haven't done any work on this today. I learnt quite a bit about IDE tho :-) ] -- -akpm- From owner-netdev@oss.sgi.com Tue Mar 28 09:35:48 2000 Received: by oss.sgi.com id ; Tue, 28 Mar 2000 09:35:38 -0800 Received: from adsl-151-196-242-9.bellatlantic.net ([151.196.242.9]:17652 "EHLO vaio.greennet") by oss.sgi.com with ESMTP id ; Tue, 28 Mar 2000 09:35:21 -0800 Received: from localhost (becker@localhost) by vaio.greennet (8.9.3/8.8.7) with ESMTP id MAA11569; Tue, 28 Mar 2000 12:32:01 -0500 Date: Tue, 28 Mar 2000 12:32:01 -0500 (EST) From: Donald Becker X-Sender: becker@vaio.greennet To: Andrew Morton cc: Alexey Kuznetosv , netdev@oss.sgi.com Subject: Re: 3c59x.c In-Reply-To: <38E0C957.8BEE9D05@uow.edu.au> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Tue, 28 Mar 2000, Andrew Morton wrote: I can answer the two questions that are historical. > Here are a few: > > - Why do some drivers statically allocate their struct net_device > (eepro, cs89x0) whereas others call init_etherdev()? What's the right > thing to do? Because they are semi-broken. Someone, I don't recall the details, converted drivers to modules. But they had the perspective that only one card of each type might exist. One of the few correct drivers in this respect is the znet.c driver, where only one device can exist. (It uses *two* DMA channels and only had a on-motherboard implementation.) Calling init_etherdev() to get a dynamically allocated device is the correct approach. Ideally the driver interface would have been defined to always use this mechanism, but that wouldn't work with compiled-in drivers that took LILO parameters. By the time LILO parameters became more flexible, the limited "drivers/net/Space.c" configuration was too well established in the documentation. > - Why do some go (in probe1): > > if (dev == NULL) > init_etherdev() > > whereas others do not test for null? Is the test necessary? No. The init_etherdev() call should work for both NULL and partially pre-allocated parameters. If NULL, it should return the "next" device (a struct net_device). It's not assured, but the intent is that it will return the first unused device from {"eth0" "eth1"...}. If preallocated, the structure is initialized for a generic Ethernet-like interface. Devices that approximate Ethernet (e.g. wireless cards, PLIP, USB links) should use this interface and tweak the parameters rather than writing their own init_*() routine. The second parameter is the size for allocating a dev->priv region. In general using this feature is depricated. One reason is the alignment and memory region attributes of the dev->priv region is historically unpredictable. You should pass '0' and allocate dev->priv yourself. > BTW: a while back I noticed that the driver's probe1() is being called a > huge number of times at bootup from net/dev.c. Is this known about? I think that this new behavior is bogus. Multiple calls should only occur for ISA drivers that have previously reported that they found a card. The PCI et al. driver scan routines should only be called once. Donald Becker Scyld Computing Corporation, becker@scyld.com From owner-netdev@oss.sgi.com Wed Mar 29 17:55:53 2000 Received: by oss.sgi.com id ; Wed, 29 Mar 2000 17:55:43 -0800 Received: from dhcp-192-52.ietf.connect.com.au ([169.208.192.52]:3844 "EHLO halfway.linuxcare.com.au") by oss.sgi.com with ESMTP id ; Wed, 29 Mar 2000 17:55:22 -0800 Received: from linuxcare.com.au (really [127.0.0.1]) by linuxcare.com.au via in.smtpd with esmtp id (Debian Smail3.2.0.102) for ; Thu, 30 Mar 2000 11:24:11 +0930 (CST) Message-Id: From: Rusty Russell To: Ganesh Sittampalam Cc: torvalds@transmeta.com, netdev@oss.sgi.com Subject: Re: 2.3.99-pre3 netfilter oops/crash In-reply-to: Your message of "Sun, 26 Mar 2000 22:24:38 +0100." Date: Thu, 30 Mar 2000 11:24:11 +0930 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing In message you write: > Basically, if you repeatedly insert and remove the ip_conntrack module > under high networking load, it provokes a kernel oops which sometimes > escalates to an aiee killing the interrupt handler (and thus a complete > system crash). Linus, please apply. Hi Ganesh, Thanks! This is a problem with conntracks lying around after the module had been removed. This fixes it (we don't care about module removal performance). --- linux-2.3.99-pre3/net/ipv4/netfilter/ip_conntrack_core.c.~1~ Sat Apr 8 17:59:21 2000 +++ linux-2.3.99-pre3/net/ipv4/netfilter/ip_conntrack_core.c Thu Mar 30 10:42:41 2000 @@ -831,6 +831,15 @@ unregister_sysctl_table(ip_conntrack_sysctl_header); #endif ip_ct_selective_cleanup(kill_all, NULL); + + /* Now, no more packets coming in, but some connections may be + still alive due to skbs on other CPUs, or queued (ip_queue, + device queues, etc). */ + while (atomic_read(&ip_conntrack_count) != 0) { + if (current->need_resched) + schedule(); + } + kmem_cache_destroy(ip_conntrack_cachep); vfree(ip_conntrack_hash); nf_unregister_sockopt(&so_getorigdst); -- Hacking time. From owner-netdev@oss.sgi.com Wed Mar 29 17:56:03 2000 Received: by oss.sgi.com id ; Wed, 29 Mar 2000 17:55:53 -0800 Received: from dhcp-192-52.ietf.connect.com.au ([169.208.192.52]:4612 "EHLO halfway.linuxcare.com.au") by oss.sgi.com with ESMTP id ; Wed, 29 Mar 2000 17:55:45 -0800 Received: from linuxcare.com.au (really [127.0.0.1]) by linuxcare.com.au via in.smtpd with esmtp id (Debian Smail3.2.0.102) for ; Thu, 30 Mar 2000 11:24:06 +0930 (CST) Message-Id: From: Rusty Russell To: Arjan van de Ven Cc: torvalds@transmeta.com, netdev@oss.sgi.com Subject: Re: Netfilter compilation patch In-reply-to: Your message of "Mon, 27 Mar 2000 19:02:06 +0200." Date: Thu, 30 Mar 2000 11:24:05 +0930 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing In message you write > The ipfwadm_core.c file didn't compile if CONFIG_PROC_FS wasn't selected. Linus, please apply. Thanks Arjan. This is my preferred patch (rather than having those functions sitting around even though they are never used). --- net/ipv4/netfilter/ipfwadm_core.c.~1~ Sat Mar 18 05:26:20 2000 +++ net/ipv4/netfilter/ipfwadm_core.c Thu Mar 30 10:49:35 2000 @@ -1370,12 +1370,14 @@ if (ret < 0) goto cleanup_nothing; +#ifdef CONFIG_PROC_FS #ifdef CONFIG_IP_ACCT proc_net_create("ip_acct", S_IFREG | S_IRUGO | S_IWUSR, ip_acct_procinfo); #endif proc_net_create("ip_input", S_IFREG | S_IRUGO | S_IWUSR, ip_fw_in_procinfo); proc_net_create("ip_output", S_IFREG | S_IRUGO | S_IWUSR, ip_fw_out_procinfo); proc_net_create("ip_forward", S_IFREG | S_IRUGO | S_IWUSR, ip_fw_fwd_procinfo); +#endif /*CONFIG_PROC_FS*/ /* Register for device up/down reports */ register_netdevice_notifier(&ipfw_dev_notifier); @@ -1391,12 +1393,14 @@ #endif unregister_netdevice_notifier(&ipfw_dev_notifier); +#ifdef CONFIG_PROC_FS #ifdef CONFIG_IP_ACCT proc_net_remove("ip_acct"); #endif proc_net_remove("ip_input"); proc_net_remove("ip_output"); proc_net_remove("ip_forward"); +#endif /*CONFIG_PROC_FS*/ free_fw_chain(chains[IP_FW_FWD]); free_fw_chain(chains[IP_FW_IN]); -- Hacking time. From owner-netdev@oss.sgi.com Wed Mar 29 21:14:15 2000 Received: by oss.sgi.com id ; Wed, 29 Mar 2000 21:14:05 -0800 Received: from dhcp-192-52.ietf.connect.com.au ([169.208.192.52]:45828 "EHLO halfway.linuxcare.com.au") by oss.sgi.com with ESMTP id ; Wed, 29 Mar 2000 21:13:48 -0800 Received: from linuxcare.com.au (really [127.0.0.1]) by linuxcare.com.au via in.smtpd with esmtp id (Debian Smail3.2.0.102) for ; Thu, 30 Mar 2000 14:42:47 +0930 (CST) Message-Id: From: Rusty Russell To: torvalds@transmeta.com cc: netdev@oss.sgi.com Subject: [PATCH] Compile fixes for pre4-1 Date: Thu, 30 Mar 2000 14:42:41 +0930 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Linus, please apply. Need IPPROTO_ macros from in.h; something changed recently in a header, so now #include it directly. Thanks, Rusty. --- linux-2.3.99-pre3/net/ipv4/netfilter/ip_conntrack_proto_icmp.c.~1~ Sat Mar 18 05:26:20 2000 +++ linux-2.3.99-pre3/net/ipv4/netfilter/ip_conntrack_proto_icmp.c Thu Mar 30 14:06:40 2000 @@ -2,6 +2,7 @@ #include #include #include +#include #include #include --- linux-2.3.99-pre3/net/ipv4/netfilter/ip_conntrack_proto_tcp.c.~1~ Thu Mar 30 12:10:52 2000 +++ linux-2.3.99-pre3/net/ipv4/netfilter/ip_conntrack_proto_tcp.c Thu Mar 30 14:06:18 2000 @@ -4,6 +4,7 @@ #include #include #include +#include #include #include #include --- linux-2.3.99-pre3/net/ipv4/netfilter/ip_conntrack_proto_udp.c.~1~ Sat Mar 18 05:26:20 2000 +++ linux-2.3.99-pre3/net/ipv4/netfilter/ip_conntrack_proto_udp.c Thu Mar 30 14:06:35 2000 @@ -2,6 +2,7 @@ #include #include #include +#include #include #include --- linux-2.3.99-pre3/net/ipv4/netfilter/ip_fw_compat_masq.c.~1~ Sat Mar 18 05:26:20 2000 +++ linux-2.3.99-pre3/net/ipv4/netfilter/ip_fw_compat_masq.c Thu Mar 30 14:07:34 2000 @@ -5,6 +5,7 @@ DO IT. */ #include +#include #include #include #include -- Hacking time. From owner-netdev@oss.sgi.com Thu Mar 30 07:30:32 2000 Received: by oss.sgi.com id ; Thu, 30 Mar 2000 07:30:12 -0800 Received: from minus.inr.ac.ru ([193.233.7.97]:13574 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Thu, 30 Mar 2000 07:29:43 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id TAA11239; Thu, 30 Mar 2000 19:29:29 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200003301529.TAA11239@ms2.inr.ac.ru> Subject: Re: 3c59x.c To: andrewm@uow.edu.au (Andrew Morton) Date: Thu, 30 Mar 2000 19:29:29 +0400 (MSK DST) Cc: netdev@oss.sgi.com In-Reply-To: <38E0C957.8BEE9D05@uow.edu.au> from "Andrew Morton" at Mar 28, 0 03:01:43 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Length: 4073 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello! > mm.. The problem with this approach is that it doesn't scale. _I_ > learn the answer, and then the next person comes along... I do not know better one, I am sorry. > - Why do some drivers statically allocate their struct net_device > (eepro, cs89x0) whereas others call init_etherdev()? What's the right > thing to do? By historical reasons. This mess was supposed to be rectified some time ago in 2.3 (when file drivers/net/setup.c was created), but the cleanup was not completed, unfortunately. I think right way is not to use static net_device structures at all. But it can (and do) result in problems with passing kernel boot options to driver, which are unsolved until now. > - Why do some go (in probe1): > > if (dev == NULL) > init_etherdev() > > whereas others do not test for null? Is the test necessary? It is consequence of the first. There were three major kinds of device bootstrap: 1. Devices which were in initial "boot" chain were allocated statically (or declared statically in module), and probe1() got a struct and allocated private data itself. (it was the case when check dev == NULL was made before calling init_etherdev()). 2. Device is not in initial "boot" chain, but its "eth*" label is found in initial chain, so that only private data are allocated. 3. Device is not in boot chain and allocated as whole with private data piggybacked to main struct. To be honest, I do not know what way is correct. This three-fold way was just shit, new one is apparently cleaner, but it is incomplete, at least passing options via kernel command line does not work now, which is bad. I am not an expert here, alas. > - The manipulation of the pci_root_buses and pci_devices lists in pci.c > has no SMP or IRQ protection. The net drivers call into pci.c to > add/remove things from these lists but provide no race avoidance. Is > there something higher up which guarantees that these list operations > are serialised wrt some random soundcard driver or is this a bug? I do not know, I am not an expert here too. Seems, all this code is supposed to be executed only under kernel lock and never executed from an interrupt context. Then it should be valid. > - As far as I can tell, many Ethernet chips separate the Tx and Rx > functions sufficiently well for the driver author to be able to let the > Tx and Rx threads operate independently. And the orginal architecture > allowed this to occur. It did not change. > But the recommended practice of locking the whole driver within > hard_start_xmit() will penalise the Rx threads. This practice was not recommended, certainly! If you mean "controller" lock, it is TX lock as rule. RX locking is made automatically by IRQ serialization logic, so that the question occurs only when driver needs to make something serialized with RX interrupts itself. No help from top level is possible here, it is an internal driver problem and driver have to create its private locks, if it is necessary, or, alternatively, to disable device irqs, while doing dangerous job. > If this is correct, should we be using separate rx and tx device-private > locks? If driver needs them. > - Should we be hardwiring ISAPNP databases into the drivers? Shouldn't > these be on disk? (this question is rhetorical - I don't think anyone's > interested in ISA any more). I have no idea. I know about this even less than about PCI interface, i.e. less then nothing. > A lot of these questions would be perfectly answered if someone such as > yourself were to give the skeleton driver an overhaul. Then if any > drivers deviate from its recipe we need to understand why. I can provide only that part of the skeleton, which interfaces to linux/net/. PCI (or ISA PNP) interface is terra incognita for me. Such skeleton requires collective work. > BTW: a while back I noticed that the driver's probe1() is being called a > huge number of times at bootup from net/dev.c. Is this known about? No... Could you give more information? For what device does this occur? Alexey From owner-netdev@oss.sgi.com Thu Mar 30 09:51:14 2000 Received: by oss.sgi.com id ; Thu, 30 Mar 2000 09:50:55 -0800 Received: from adsl-151-196-244-176.bellatlantic.net ([151.196.244.176]:1784 "EHLO vaio.greennet") by oss.sgi.com with ESMTP id ; Thu, 30 Mar 2000 09:50:36 -0800 Received: from localhost (becker@localhost) by vaio.greennet (8.9.3/8.8.7) with ESMTP id MAA03259; Thu, 30 Mar 2000 12:53:54 -0500 Date: Thu, 30 Mar 2000 12:53:54 -0500 (EST) From: Donald Becker X-Sender: becker@vaio.greennet To: netdev@oss.sgi.com, kuznet@ms2.inr.ac.ru cc: Andrew Morton Subject: Re: 3c59x.c In-Reply-To: <200003301529.TAA11239@ms2.inr.ac.ru> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Thu, 30 Mar 2000 kuznet@ms2.inr.ac.ru wrote: > > - Why do some drivers statically allocate their struct net_device > > (eepro, cs89x0) whereas others call init_etherdev()? What's the right > > thing to do? > > By historical reasons. This mess was supposed to be rectified > some time ago in 2.3 (when file drivers/net/setup.c was created), > but the cleanup was not completed, unfortunately. No, this is unrelated. The specific problem was a crude coversion of drivers to use modules, but the conversion only supported a single card via a single static structure. I didn't use that module conversion, and instead added module support independently. Thus drivers maintained "in-kernel" kept the cruft, while my versions never had it. > I think right way is not to use static net_device structures at all. > But it can (and do) result in problems with passing kernel boot options > to driver, which are unsolved until now. ... > To be honest, I do not know what way is correct. This three-fold way > was just shit, new one is apparently cleaner, but it is incomplete, > at least passing options via kernel command line does not work now, > which is bad. I am not an expert here, alas. The reason Space.c still exists was the big problem when I tried to remove it. See net_init.c, which was supposed to take over the functionality of Space.c. But using it exclusively broke the documentation, which advised people to directly edit Space.c (LILO parameters and modules were not fully supported back then). So Space.c had to remain. I assumed that Space.c would be around only temporarily, but people then people continue to add ever more elaborate initialization cruft to Space.c. [[ It's like a shed wiht a flawed foundation that is still being used. The plan is for it to be torn down later when everything is moved to a new building, but you come back later to find a twenty story apartment building with everyone blaming you for the foundation. ]] We can now get rid of Space.c, since no current documentation advises people to modify it. But it can and should be done in a way that does not break LILO-passed parameters, which are still required for some configurations. > > - The manipulation of the pci_root_buses and pci_devices lists in pci.c > > has no SMP or IRQ protection. The net drivers call into pci.c to > > add/remove things from these lists but provide no race avoidance. Is > > there something higher up which guarantees that these list operations > > are serialised wrt some random soundcard driver or is this a bug? > > I do not know, I am not an expert here too. Seems, all this code is supposed > to be executed only under kernel lock and never executed from an > interrupt context. Then it should be valid. Do you mean under a global whole-kernel lock? > > - As far as I can tell, many Ethernet chips separate the Tx and Rx > > functions sufficiently well for the driver author to be able to let the > > Tx and Rx threads operate independently. And the orginal architecture > > allowed this to occur. .. > > But the recommended practice of locking the whole driver within > > hard_start_xmit() will penalise the Rx threads. > > This practice was not recommended, certainly! Not recommended? But look at what the only guide, the driver conversions, have as a de facto recommendation -- a horrible spinning waste of time, the cost of which is hidden because it's difficult to measure. > If you mean "controller" lock, it is TX lock as rule. The queue logic should default to seeing that driver Tx queue routine is serialized. This is a cleaner interface, results in less wasted time, and should almost never be a SMP performance hit. There are few (no?) drivers that do substantial-but-parallelizable work when queuing a packet for transmission. The existing practice makes it appear that the driver must always protect itself against simultaneous transmission attempts from multiple processors. > RX locking is made automatically by IRQ serialization logic, > so that the question occurs only when driver needs to make something > serialized with RX interrupts itself. No help from top level > is possible here, it is an internal driver problem and driver > have to create its private locks, if it is necessary, > or, alternatively, to disable device irqs, while doing dangerous job. > > > If this is correct, should we be using separate rx and tx device-private > > locks? > > If driver needs them. Arrgggg. It's hard to tell what is an SMP bug and what is expected behavior that the driver must protect itself against. For instance, there was a bug in (IIRC) early 2.2 kernels where an interrupt handler would be re-entered on SMP systems. It took a long time to track this down. The reaction I got (which was later corrected) was that it was perfectly reasonable for two processors to be running the interrupt handler simultaneously, and that driver should add locks to make certain that this worked. Yes, that could be made to work, but it would add horrible locking overhead for the very rare case where more than one processor could do work. > > BTW: a while back I noticed that the driver's probe1() is being called a > > huge number of times at bootup from net/dev.c. Is this known about? > > No... Could you give more information? For what device does this occur? This is a buglet where the driver probes are called a zillion times, every probe called for each entry in Space.c, rather than calling the PCI (and other non-ISA) probes just once in an initial phase. This behavior is always wrong, even for the ISA drivers which should only be called multiple time if a card is found. I took special note of this because the bug was blamed on my drivers, as part of the general PCI scan flame, rather than the flawed change to Space.c which enabled the eth1..ethN entries. This Donald Becker Scyld Computing Corporation, becker@scyld.com From owner-netdev@oss.sgi.com Thu Mar 30 10:46:16 2000 Received: by oss.sgi.com id ; Thu, 30 Mar 2000 10:46:05 -0800 Received: from minus.inr.ac.ru ([193.233.7.97]:36618 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Thu, 30 Mar 2000 10:45:37 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id WAA19064; Thu, 30 Mar 2000 22:45:15 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200003301845.WAA19064@ms2.inr.ac.ru> Subject: Re: 3c59x.c To: becker@scyld.com (Donald Becker) Date: Thu, 30 Mar 2000 22:45:15 +0400 (MSK DST) Cc: netdev@oss.sgi.com, andrewm@uow.edu.au In-Reply-To: from "Donald Becker" at Mar 30, 0 12:53:54 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Length: 5822 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello! > No, this is unrelated. The specific problem was a crude coversion of > drivers to use modules, but the conversion only supported a single card via > a single static structure. > > I didn't use that module conversion, and instead added module support > independently. Thus drivers maintained "in-kernel" kept the cruft, while my > versions never had it. You are right. > The reason Space.c still exists was the big problem when I tried to remove > it. See net_init.c, which was supposed to take over the functionality of > Space.c. But using it exclusively broke the documentation, which advised > people to directly edit Space.c (LILO parameters and modules were not fully > supported back then). So Space.c had to remain. > > I assumed that Space.c would be around only temporarily, but people then > people continue to add ever more elaborate initialization cruft to Space.c. > > We can now get rid of Space.c, since no current documentation advises people > to modify it. But it can and should be done in a way that does not break > LILO-passed parameters, which are still required for some configurations. I tried to kill it while 2.1 and failed miserably by the same reason. 8) Too much of hacks... It was difficult to clean this without harm. The attempt made recently (seems, by Alan) is great step, but it is incomplete again. Yes, Space.c must die. > > I do not know, I am not an expert here too. Seems, all this code is supposed > > to be executed only under kernel lock and never executed from an > > interrupt context. Then it should be valid. > > Do you mean under a global whole-kernel lock? Yes. It was true in 2.2 and it should be still true in 2.3. In current 2.3 only calls made from networking layer are made not under big kernel lock. > Not recommended? But look at what the only guide, the driver conversions, > have as a de facto recommendation -- a horrible spinning waste of time, the > cost of which is hidden because it's difficult to measure. No, Donald. It does not... Well, it was not supposed to tell this at least. Probably, it is badly written. > > If you mean "controller" lock, it is TX lock as rule. > > The queue logic should default to seeing that driver Tx queue routine is > serialized. This is a cleaner interface, results in less wasted time, and > should almost never be a SMP performance hit. There are few (no?) drivers > that do substantial-but-parallelizable work when queuing a packet for > transmission. > > The existing practice makes it appear that the driver must always protect > itself against simultaneous transmission attempts from multiple processors. It is _really_ protected by top level (dev->xmit_lock). The problem is that top level cannot provide this lock for TX completion IRQs and, even more, driver cannot grab this top level lock too, because this lock cannot be get from IRQ context. So, driver is deemed either to serialize itself via tbusy (as tulip did), or to use TX lock (as eepro100 did). Both of the ways are described in Jamal's doc. The second way is much simpler logically. The first is cheaper from performance viewpoint, but it is really complicated. It requires understanding CPU ordering, memory barriers etc. We cannot force people to use this, actually the only known example of flawlessly working driver using this approach was your tulip. But I am afraid even it will not work on smp alpha or ultra, due to absence of cpu ordering there. > > RX locking is made automatically by IRQ serialization logic, > > so that the question occurs only when driver needs to make something > > serialized with RX interrupts itself. No help from top level > > is possible here, it is an internal driver problem and driver > > have to create its private locks, if it is necessary, > > or, alternatively, to disable device irqs, while doing dangerous job. > > > > > If this is correct, should we be using separate rx and tx device-private > > > locks? > > > > If driver needs them. > > Arrgggg. It's hard to tell what is an SMP bug and what is expected behavior > that the driver must protect itself against. Let me to reproduce sentence several lines ago: > > RX locking is made automatically by IRQ serialization logic, See? But driver really can require separate RX path serialization wrt control functions or wrt TX path, f.e. ne2000 needs this. In these cases an additional technique is still required. Such technique cannot supplied by top level by clear reasons, if not to consdier spinlocks and disable_irq_nosync() as such technique, certainly. > For instance, there was a bug in (IIRC) early 2.2 kernels where an interrupt > handler would be re-entered on SMP systems. It took a long time to track > this down. The reaction I got (which was later corrected) was that it was > perfectly reasonable for two processors to be running the interrupt handler > simultaneously, and that driver should add locks to make certain that this > worked. Yes, that could be made to work, but it would add horrible locking > overhead for the very rare case where more than one processor could do work. It was _my_ statement. I was beated and recognized the defeat. 8) [ Though to be honest, I still suspect sometimes that it was right way. Look at current IRQ engine, it is _very_ complicated due to this service. It was not easy to recognize that such overhead at the deepest level is good. But all this is history in any case. ] > This is a buglet where the driver probes are called a zillion times, every > probe called for each entry in Space.c, rather than calling the PCI (and > other non-ISA) probes just once in an initial phase. This behavior is > always wrong, even for the ISA drivers which should only be called multiple > time if a card is found. Indeed... I see, I see. Alexey From owner-netdev@oss.sgi.com Thu Mar 30 11:54:45 2000 Received: by oss.sgi.com id ; Thu, 30 Mar 2000 11:54:35 -0800 Received: from adsl-151-196-244-176.bellatlantic.net ([151.196.244.176]:7416 "EHLO vaio.greennet") by oss.sgi.com with ESMTP id ; Thu, 30 Mar 2000 11:54:28 -0800 Received: from localhost (becker@localhost) by vaio.greennet (8.9.3/8.8.7) with ESMTP id OAA03612; Thu, 30 Mar 2000 14:55:26 -0500 Date: Thu, 30 Mar 2000 14:55:26 -0500 (EST) From: Donald Becker X-Sender: becker@vaio.greennet To: netdev@oss.sgi.com, kuznet@ms2.inr.ac.ru cc: andrewm@uow.edu.au Subject: Queue and SMP locking discussion (was Re: 3c59x.c) In-Reply-To: <200003301845.WAA19064@ms2.inr.ac.ru> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Thu, 30 Mar 2000 kuznet@ms2.inr.ac.ru wrote: > > We can now get rid of Space.c, since no current documentation advises people .. > I tried to kill it while 2.1 and failed miserably by the same reason. 8) > Too much of hacks... It was difficult to clean this without harm. Look at the note in net_init.c, which I created to replace Space.c. For those that don't get the reference "pl14" refers to kernel 0.99pl14, i.e. 0.0.99-pre14 ! Once an bad interface is put into place, it's very difficult to change. For instance, it's far easier to update a bug-free library function than to fix buggy version because someone might be depending on the bugs. Further, the cruftier the interface the more people rely on out-of-interface hooks. > Yes, Space.c must die. We should lead a chant.. "Space.c must die. Space.c must die.." Yet people want to change non-broken things first. [[ PCI access functions. ]] > Yes. It was true in 2.2 and it should be still true in 2.3. > In current 2.3 only calls made from networking layer are made > not under big kernel lock. This is an important detail: no PCI config modifications in a running driver, only during initialization and after shutdown. > > Not recommended? But look at what the only guide, the driver conversions, > > have as a de facto recommendation -- a horrible spinning waste of time, the > > cost of which is hidden because it's difficult to measure. > > No, Donald. It does not... Well, it was not supposed to tell this at least. > Probably, it is badly written. Yes. The de facto guide, the existing SMP changes, are inefficient. (And "it's not my fault". I know I'm sounding like a broken record.) > > The existing practice makes it appear that the driver must always protect > > itself against simultaneous transmission attempts from multiple processors. ... > So, driver is deemed either to serialize itself via tbusy (as tulip did), > or to use TX lock (as eepro100 did). Historical note: The dev->tbusy flag turned into a lock because of a bug. Timer-based retransmission were not added to the Tx queue, but instead went through the normal queueing path during the timer IRQ. So if there was no Tx packet in the queue, it was sent to the driver immediately, even if the timer IRQ was interrupting the queue output routine! The cost of this extra lock in each driver was mitigated (but not overcome) by permitting less expensive, non-strict queue locking and using the Tx attempts that slipped through as a watchdog timer. > The first is cheaper from performance viewpoint, but it is really > complicated. It requires understanding CPU ordering, memory barriers etc. > We cannot force people to use this, actually the only known example > of flawlessly working driver using this approach was your tulip. > But I am afraid even it will not work on smp alpha or ultra, > due to absence of cpu ordering there. The tricky part for avoiding further locking is making certain that descriptor index entries are integers that may be atomically incremented WRT to the Tx queuing. In the Tx packet routine entry = tp->cur_tx % TX_RING_SIZE; ... queue a packet to the 'entry' slot tp->cur_tx++; In the interrupt (Tx clean-up) routine for (dirty_tx = tp->dirty_tx; tp->cur_tx - dirty_tx > 0; dirty_tx++) { This should *not* require locking on any architecture. What may require locking is when the Tx queue is shared by the Rx filter setup logic, as on the Tulip. The set_rx_mode() code must either be able to spin on the Tx queue lock or otherwise have a way to be Tx-serialized. Setting the Rx filter mode is very rare compared to sending packets, so adding a new Tx lock for this is a bad design. > > > RX locking is made automatically by IRQ serialization logic, > > > so that the question occurs only when driver needs to make something > > > serialized with RX interrupts itself. No help from top level > > > is possible here, it is an internal driver problem and driver > > > have to create its private locks, if it is necessary, > > > or, alternatively, to disable device irqs, while doing dangerous job. Disabling the device's IRQ generation is not valid in a shared IRQ environment. Doing this used to be a common technique in other OSes, and even some types of Linux device drivers, but I always avoided it. The only reasonable approach is disabling all IRQs or selectively disabling one IRQ chain. The cli() approach used to be very cheap, and is now expensive only on SMPs. Disabling an IRQ chain used to be very expensive, and is only slightly less expensive now. > > For instance, there was a bug in (IIRC) early 2.2 kernels where an interrupt > > handler would be re-entered on SMP systems. It took a long time to track > > this down. The reaction I got (which was later corrected) was that it was > > perfectly reasonable for two processors to be running the interrupt handler > > simultaneously, and that driver should add locks to make certain that this > > worked. Yes, that could be made to work, but it would add horrible locking > > overhead for the very rare case where more than one processor could do work. > > It was _my_ statement. I was beated and recognized the defeat. 8) I wasn't naming names.. (And I don't think that you were responsible for the original bug, just for not seeing that the semantics that it created were unreasonable.) > [ Though to be honest, I still suspect sometimes that it was right way. > Look at current IRQ engine, it is _very_ complicated due to this > service. It was not easy to recognize that such overhead > at the deepest level is good. But all this is history in any case. > ] That way creates continuously costly overhead for the rare case that two processors would try to handle the same interrupt. And in that rare case you would end up with two processors vying for a single unit of work, or both handling work while pounding on each others cache lines. It sounds so elegant, having multiple processors leaping up to handle an interrupt like men leaping to open a door for a girl, but you get the same low efficiency. Donald Becker Scyld Computing Corporation, becker@scyld.com From owner-netdev@oss.sgi.com Thu Mar 30 13:00:17 2000 Received: by oss.sgi.com id ; Thu, 30 Mar 2000 13:00:08 -0800 Received: from mail.cyberus.ca ([209.195.95.1]:17351 "EHLO cyberus.ca") by oss.sgi.com with ESMTP id ; Thu, 30 Mar 2000 12:59:53 -0800 Received: from shell.cyberus.ca (shell [209.195.95.7]) by cyberus.ca (8.9.3/8.9.3/Cyberus Online Inc.) with ESMTP id PAA16812; Thu, 30 Mar 2000 15:48:11 -0500 (EST) Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.1b+Sun/8.9.3) with ESMTP id PAA15263; Thu, 30 Mar 2000 15:48:10 -0500 (EST) Date: Thu, 30 Mar 2000 15:48:10 -0500 (EST) From: jamal To: Donald Becker cc: netdev@oss.sgi.com, kuznet@ms2.inr.ac.ru, andrewm@uow.edu.au Subject: Re: Queue and SMP locking discussion (was Re: 3c59x.c) In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Thu, 30 Mar 2000, Donald Becker wrote: > On Thu, 30 Mar 2000 kuznet@ms2.inr.ac.ru wrote: > > > We can now get rid of Space.c, since no current documentation advises people > .. > > > > Not recommended? But look at what the only guide, the driver conversions, > > > have as a de facto recommendation -- a horrible spinning waste of time, the > > > cost of which is hidden because it's difficult to measure. > > > > No, Donald. It does not... Well, it was not supposed to tell this at least. > > Probably, it is badly written. > > Yes. The de facto guide, the existing SMP changes, are inefficient. > (And "it's not my fault". I know I'm sounding like a broken record.) > Donald, I am to blame for that -- as is stated in the document disclaimer. Here's the reasoning: In my analysis i noted that the "tx timeout" problems under moderate network loads was _mostly_ because the tx thread was being starved. (i was blasting a lot of 64 byte packets at the tulip and eepro and trying to see where they start dying). One of the main reasons was that the rx thread interupt was consistently pre-empting the tx. Is the total paralelization you are proposing providing any protection against such issues? The suggested locks do fix this problem. cheers, jamal From owner-netdev@oss.sgi.com Thu Mar 30 14:17:29 2000 Received: by oss.sgi.com id ; Thu, 30 Mar 2000 14:17:12 -0800 Received: from mainframe.dgrc.crc.ca ([142.92.38.206]:63441 "EHLO mainframe.dgrc.crc.ca") by oss.sgi.com with ESMTP id ; Thu, 30 Mar 2000 14:16:52 -0800 Received: from crc.ca (curly [142.92.38.251]) by mainframe.dgrc.crc.ca (8.9.3/8.9.3) with ESMTP id RAA14010 for ; Thu, 30 Mar 2000 17:16:43 -0500 (EST) Message-ID: <38E3D24B.642448EB@crc.ca> Date: Thu, 30 Mar 2000 17:16:43 -0500 From: Guilhem Tardy Organization: CRC X-Mailer: Mozilla 4.7 [en] (X11; I; SunOS 5.7 sun4u) X-Accept-Language: en MIME-Version: 1.0 To: netdev@oss.sgi.com Subject: IPv6 send router advertisement References: <200003301529.TAA11239@ms2.inr.ac.ru> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hi! Could anyone tell me where (supposedly in net/ipv6/ndisc.c) the router advertisements are created and sent? Thanks, Guilhem. From owner-netdev@oss.sgi.com Thu Mar 30 15:10:51 2000 Received: by oss.sgi.com id ; Thu, 30 Mar 2000 15:10:35 -0800 Received: from quechua.inka.de ([212.227.14.2]:11126 "EHLO mail.inka.de") by oss.sgi.com with ESMTP id ; Thu, 30 Mar 2000 15:10:16 -0800 Received: from dungeon.inka.de by mail.inka.de with uucp (rmailwrap 0.4) id 12ao4r-0004ba-00; Fri, 31 Mar 2000 01:10:13 +0200 Received: by dungeon.inka.de (Postfix, from userid 1000) id 835C2B7854; Fri, 31 Mar 2000 00:33:29 +0200 (CEST) Date: Fri, 31 Mar 2000 00:33:29 +0200 From: Andreas Jellinghaus To: Guilhem Tardy Cc: netdev@oss.sgi.com Subject: Re: IPv6 send router advertisement Message-ID: <20000331003329.A1940@dungeon.inka.de> References: <200003301529.TAA11239@ms2.inr.ac.ru> <38E3D24B.642448EB@crc.ca> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii User-Agent: Mutt/1.0.1i In-Reply-To: <38E3D24B.642448EB@crc.ca>; from Guilhem.Tardy@crc.ca on Thu, Mar 30, 2000 at 05:16:43PM -0500 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing router advertisements are created by an application such as the router advertisement daemon (radvd) or the zebra router daemon (www.zebra.org). the kernel listens to router advertisements, asks for them, but does not create them. regards, andreas p.s. this question is about use, not development, so some newsgroup such as comp.os.linux.networking or a mailing list such as linux-net@vger.rutgers.edu might be a better place to ask next time. keep the traffik low, so the hackers can hack and improve. advanced users will answer user questions. From owner-netdev@oss.sgi.com Thu Mar 30 18:29:14 2000 Received: by oss.sgi.com id ; Thu, 30 Mar 2000 18:29:06 -0800 Received: from smtprich.nortel.com ([192.135.215.8]:38360 "EHLO smtprich.nortel.com") by oss.sgi.com with ESMTP id ; Thu, 30 Mar 2000 18:28:45 -0800 Received: from zrchb213.us.nortel.com (actually zrchb213) by smtprich.nortel.com; Thu, 30 Mar 2000 20:28:58 -0600 Received: from zctwb003.asiapac.nortel.com ([47.152.32.111]) by zrchb213.us.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id H90BKHVR; Thu, 30 Mar 2000 20:27:59 -0600 Received: from pwold011.asiapac.nortel.com ([47.181.193.45]) by zctwb003.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id HNN1JMRD; Fri, 31 Mar 2000 12:27:58 +1000 Received: from uow.edu.au (IDENT:akpm@localhost [127.0.0.1]) by pwold011.asiapac.nortel.com (8.9.3/8.9.3) with ESMTP id MAA01784; Fri, 31 Mar 2000 12:27:50 +1000 Message-ID: <38E40D26.D34D72CB@uow.edu.au> Date: Fri, 31 Mar 2000 02:27:50 +0000 X-Sybari-Space: 00000000 00000000 00000000 From: Andrew Morton X-Mailer: Mozilla 4.61 [en] (X11; I; Linux 2.3.99-pre2 i686) X-Accept-Language: en MIME-Version: 1.0 To: jamal CC: Donald Becker , netdev@oss.sgi.com, kuznet@ms2.inr.ac.ru Subject: Re: Queue and SMP locking discussion (was Re: 3c59x.c) References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Orig: Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing jamal wrote: > > ... > In my analysis i noted that the "tx timeout" problems under moderate > network loads was _mostly_ because the tx thread was being starved. > (i was blasting a lot of 64 byte packets at the tulip and eepro and > trying to see where they start dying). > One of the main reasons was that the rx thread interupt was consistently > pre-empting the tx. > Is the total paralelization you are proposing providing any protection > against such issues? The suggested locks do fix this problem. I believe that this tx starvation is due to the decision to schedule the tx in the device ISR, for BH handling, rather than to actually dequeue and send packets within the Tx ISR. I can see why the bh scheduling is simpler... I like the loop-until-max_interrupt_work-exceeded architecture. It's _very_ efficient compared with interrupt per packet, and it kicks in when the system is under stress. But it's not being leveraged for transmits. BTW:::: The 3c59x driver's ISR does this: while (stuff_to_do && (count++ < max_interrupt_work)) { if (the device has room for a tx packet) netif_wake_queue() } It appears to me that netif_wake_queue can be called multiple times within this loop, at considerable expense, when the system is under Rx stress. Wouldn't it be better to have a local flag in the ISR which prevents this? bool done_wake = false; while (count++ < max_interrupt_work) { if (!done_wake && the device has room for a tx packet) { done_wake = true; netif_wake_queue(); } } -- -akpm- From owner-netdev@oss.sgi.com Thu Mar 30 20:42:56 2000 Received: by oss.sgi.com id ; Thu, 30 Mar 2000 20:42:47 -0800 Received: from smtprch1.nortelnetworks.com ([192.135.215.14]:17407 "EHLO smtprch1.nortel.com") by oss.sgi.com with ESMTP id ; Thu, 30 Mar 2000 20:42:39 -0800 Received: from zsngd101.asiapac.nortel.com (actually znsgd101) by smtprch1.nortel.com; Thu, 30 Mar 2000 20:47:37 -0600 Received: from zctwb003.asiapac.nortel.com ([47.152.32.111]) by zsngd101.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id H3S3X8HZ; Fri, 31 Mar 2000 10:43:44 +0800 Received: from pwold011.asiapac.nortel.com ([47.181.193.45]) by zctwb003.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id HNN1JMR7; Fri, 31 Mar 2000 12:43:43 +1000 Received: from uow.edu.au (IDENT:akpm@localhost [127.0.0.1]) by pwold011.asiapac.nortel.com (8.9.3/8.9.3) with ESMTP id MAA01919; Fri, 31 Mar 2000 12:43:36 +1000 Message-ID: <38E410D8.1517FE1B@uow.edu.au> Date: Fri, 31 Mar 2000 02:43:36 +0000 X-Sybari-Space: 00000000 00000000 00000000 From: Andrew Morton X-Mailer: Mozilla 4.61 [en] (X11; I; Linux 2.3.99-pre2 i686) X-Accept-Language: en MIME-Version: 1.0 To: jamal , Donald Becker , netdev@oss.sgi.com, kuznet@ms2.inr.ac.ru Subject: Re: Queue and SMP locking discussion (was Re: 3c59x.c) References: <38E40D26.D34D72CB@uow.edu.au> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Andrew Morton wrote: > > It appears to me that netif_wake_queue can be called multiple times > within this loop, Spit. I missed the test_and_set_bit(__LINK_STATE_XOFF... I still question it though: status = inw(ioaddr + EL3_STATUS); do { ... if (status & TxAvailable) { outw(AckIntr | TxAvailable, ioaddr + EL3_CMD); netif_wake_queue(dev); } .... } while ((status = inw(ioaddr + EL3_STATUS)) & (IntLatch | RxComplete)); Will 'inw(ioaddr + EL3_STATUS)' continue to have the TxAvailable bit set even after it has been acked? -- -akpm- From owner-netdev@oss.sgi.com Fri Mar 31 02:52:07 2000 Received: by oss.sgi.com id ; Fri, 31 Mar 2000 02:51:58 -0800 Received: from lrcsun15.epfl.ch ([128.178.156.77]:18570 "EHLO lrcsun15.epfl.ch") by oss.sgi.com with ESMTP id ; Fri, 31 Mar 2000 02:51:38 -0800 Received: (from almesber@localhost) by lrcsun15.epfl.ch (8.8.X/EPFL-8.1a) id MAA22283; Fri, 31 Mar 2000 12:51:30 +0200 (MET DST) From: Werner Almesberger Message-Id: <200003311051.MAA22283@lrcsun15.epfl.ch> Subject: Re: Queue and SMP locking discussion (was Re: 3c59x.c) To: becker@scyld.com (Donald Becker) Date: Fri, 31 Mar 2000 12:51:30 +0200 (MET DST) Cc: netdev@oss.sgi.com, kuznet@ms2.inr.ac.ru, andrewm@uow.edu.au In-Reply-To: from "Donald Becker" at Mar 30, 2000 02:55:26 PM X-Mailer: ELM [version 2.5 PL2] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Donald Becker wrote: > We should lead a chant.. "Space.c must die. Space.c must die.." Or maybe we could adapt "War" by Bruce Springsteen: "Space.c ! UH ! What is it good for ? Absolutely nothin' !" > Historical note: The dev->tbusy flag turned into a lock because of a bug. Considering all the hairy locking issues and that there seems to be a general trend of moving things into tasklets, I wonder if we wouldn't be better off with a tasklet-based structure. Basically, my assumptions are that (1) most work should actually be done in a tasklet, and (2) give up and retry later (i.e. when the lock is available) is a better strategy than spin locking non-trivial portions of code. It should be straightforward to implement an efficient non-blocking "run >= once" (*) mutex with tasklets, so all that's needed on the infrastructure side should be that (1) the important services (e.g. timers) either come as "well-known" tasklets, or at least don't incur too much overhead for starting a tasklet to do the work, and (2) any entries to a driver can be retried by starting some tasklet. (*) "run >= once", because it seems that in most cases, we have some while (there_is_work) do_work(); construct anyway, so the occasional extra invocation doesn't hurt that much. An efficient "run once" implementation would be better, of course. This doesn't help the existing drivers, but may make life easier for future drivers (and in cases where the locking just gets dreadful enough that a conversion is the lesser evil ;-) Example: static struct tasklet_table tbl; ... foo_init(...) { tbl.nr = 3; tbl.task[0] = foo_work_task; tbl.task[1] = TIMER_TASKLET(timer); tbl.task[2] = well_known_tx_tasklet; } ... foo_timer(...) /* run by TIMER_TASKLET(timer) */ { ... if (tasklet_begin_mutex(&tbl,1)) return; ... tasklet_end_mutex(&tbl); ... } ... foo_int(...) { /* enqueue work */ tasklet_schedule(&dev->work_task); ... } etc. The explicit initialization could of course be some set of functions that add all the tasklets related to a given service, e.g. if tx may be invoked by multiple tasklets of the "core", all of them would have to be added. (Then we probably should have a calling_tasklet argument, too.) Example code for the mutex (untested and probably buggy quick brain dump) below. Adding the usual performance boosters, a la tasklet_begin_mutex_spin, __tasklet_end_mutex (no re-scheduling), etc., would be trivial. Opinions ? - Werner ---------------------------------- cut here ----------------------------------- struct tasklet_table { int busy; unsigned long need_to_run; int nr; struct tasklet_struct task[sizeof(unsigned long)*8]; /* or use dynamic allocation */ }; int tasklet_begin_mutex(struct tasklet_table *tbl,int me) { set_bit(me,&tbl->need_to_run); mb(); if (test_and_set_bit(0,&tbl->busy)) return -EBUSY; clear_bit(me,&tbl->need_to_run); return 0; } void tasklet_end_mutex(struct tasklet_table *tbl) { int i; clear_bit(0,&tbl->busy); for (i = 0; i < tbl->nr; i++) if (test_bit(i,&tbl->need_to_run)) tasklet_schedule(tbl->task[i]); } -- _________________________________________________________________________ / Werner Almesberger, ICA, EPFL, CH werner.almesberger@ica.epfl.ch / /_IN_N_032__Tel_+41_21_693_6621__Fax_+41_21_693_6610_____________________/ From owner-netdev@oss.sgi.com Fri Mar 31 06:02:12 2000 Received: by oss.sgi.com id ; Fri, 31 Mar 2000 06:01:57 -0800 Received: from mail.cyberus.ca ([209.195.95.1]:20109 "EHLO cyberus.ca") by oss.sgi.com with ESMTP id ; Fri, 31 Mar 2000 06:01:37 -0800 Received: from shell.cyberus.ca (shell [209.195.95.7]) by cyberus.ca (8.9.3/8.9.3/Cyberus Online Inc.) with ESMTP id IAA05271; Fri, 31 Mar 2000 08:55:53 -0500 (EST) Received: from localhost (hadi@localhost) by shell.cyberus.ca (8.9.1b+Sun/8.9.3) with ESMTP id IAA16635; Fri, 31 Mar 2000 08:55:53 -0500 (EST) Date: Fri, 31 Mar 2000 08:55:53 -0500 (EST) From: jamal To: Andrew Morton cc: Donald Becker , netdev@oss.sgi.com, kuznet@ms2.inr.ac.ru Subject: Re: Queue and SMP locking discussion (was Re: 3c59x.c) In-Reply-To: <38E40D26.D34D72CB@uow.edu.au> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Fri, 31 Mar 2000, Andrew Morton wrote: > I believe that this tx starvation is due to the decision to schedule the > tx in the device ISR, for BH handling, rather than to actually dequeue > and send packets within the Tx ISR. I can see why the bh scheduling is > simpler... > > I like the loop-until-max_interrupt_work-exceeded architecture. It's > _very_ efficient compared with interrupt per packet, and it kicks in > when the system is under stress. But it's not being leveraged for > transmits. > loop-until-max_interrupt_work-exceeded will *not* help you in this. Packet arrivals still mean interupts. Mitigation (which seems to be added to some of Donalds drivers by Jeff Garzik and Andrey Savochkin) will to a certain extent. cheers, jamal From owner-netdev@oss.sgi.com Fri Mar 31 07:06:26 2000 Received: by oss.sgi.com id ; Fri, 31 Mar 2000 07:06:07 -0800 Received: from adsl-151-196-244-176.bellatlantic.net ([151.196.244.176]:7165 "EHLO vaio.greennet") by oss.sgi.com with ESMTP id ; Fri, 31 Mar 2000 07:05:53 -0800 Received: from localhost (becker@localhost) by vaio.greennet (8.9.3/8.8.7) with ESMTP id KAA06715; Fri, 31 Mar 2000 10:08:00 -0500 Date: Fri, 31 Mar 2000 10:08:00 -0500 (EST) From: Donald Becker X-Sender: becker@vaio.greennet To: Andrew Morton cc: jamal , netdev@oss.sgi.com, kuznet@ms2.inr.ac.ru Subject: Re: Queue and SMP locking discussion (was Re: 3c59x.c) In-Reply-To: <38E40D26.D34D72CB@uow.edu.au> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Fri, 31 Mar 2000, Andrew Morton wrote: > I believe that this tx starvation is due to the decision to schedule the > tx in the device ISR, for BH handling, rather than to actually dequeue > and send packets within the Tx ISR. I can see why the bh scheduling is > simpler... Not only is the Linux scheme simpler, it's much better. The BSD stack uses the scheme of dequeuing packets in the ISR. This was a good design in the VAX days, and with primative hardware that handled only single packets. But it has horrible cache behavior, needs an extra lock, and can result the interrupt service routine running a very long time, blocking interrupts. I think the problem is that our recieve routines, eth_type_trans() and netif_rx(), have grown to be too complex, and they have similar bad cache behavior to Tx queuing in the ISR. At least Tx starvation is better than Rx deafness. > I like the loop-until-max_interrupt_work-exceeded architecture. It's > _very_ efficient compared with interrupt per packet, and it kicks in > when the system is under stress. But it's not being leveraged for > transmits. Thanks. (I obviously think it's a good design, or I wouldn't have done it.) Many people just want to turn it off, but the network usually isn't the only part of the system that needs to get work done. > The 3c59x driver's ISR does this: > while (stuff_to_do && (count++ < max_interrupt_work)) > { > if (the device has room for a tx packet) > netif_wake_queue() .. > It appears to me that netif_wake_queue can be called multiple times > within this loop, at considerable expense, when the system is under Rx > stress. Not quite: it only does clear tbusy == netif_wake_queue() once. The TxAvailable indication is only used on the 3c590 series Vortex, and it only triggers once. The 3c900 series has a modern descriptor list and uses the other block of code, where netif_wake_queue() is called only when the 16/32 element Tx queue was full. Most drivers, including pci-skeleton.c, have some hysteresis so that we don't pound against the full limit. We should have two or four Tx slot free before we transition to non-full. Acckk!!! I just saw that someone put netif_wake_queue() in the normal path of the 3c59x.c Tx routine! This is BAD. That is putting an expensive call in the critical path, and it's not even the right semantics. The original intent of netif_wake_queue() was that it would be very lightweight. It should only clear the flag and set the BH bit. But it seems to have grown in complexity... Donald Becker Scyld Computing Corporation, becker@scyld.com From owner-netdev@oss.sgi.com Fri Mar 31 08:15:25 2000 Received: by oss.sgi.com id ; Fri, 31 Mar 2000 08:14:58 -0800 Received: from adsl-151-196-244-176.bellatlantic.net ([151.196.244.176]:9725 "EHLO vaio.greennet") by oss.sgi.com with ESMTP id ; Fri, 31 Mar 2000 08:14:35 -0800 Received: from localhost (becker@localhost) by vaio.greennet (8.9.3/8.8.7) with ESMTP id LAA07012; Fri, 31 Mar 2000 11:16:56 -0500 Date: Fri, 31 Mar 2000 11:16:56 -0500 (EST) From: Donald Becker X-Sender: becker@vaio.greennet To: jamal cc: Andrew Morton , netdev@oss.sgi.com, kuznet@ms2.inr.ac.ru Subject: Re: Queue and SMP locking discussion (was Re: 3c59x.c) In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing On Fri, 31 Mar 2000, jamal wrote: > On Fri, 31 Mar 2000, Andrew Morton wrote: > > > I believe that this tx starvation is due to the decision to schedule the > > tx in the device ISR, for BH handling, rather than to actually dequeue > > and send packets within the Tx ISR. I can see why the bh scheduling is > > simpler... > > > > I like the loop-until-max_interrupt_work-exceeded architecture. It's > > _very_ efficient compared with interrupt per packet, and it kicks in > > when the system is under stress. But it's not being leveraged for > > transmits. > > loop-until-max_interrupt_work-exceeded will *not* help you in this. It does limit the work done by a specific subsystem so that other devices on the same IRQ chain can do their work. At the end of the handler scan the interrupt dispatch system a chance to run other IRQ chains. > Packet arrivals still mean interupts. Most machines could never see a regime where they are overwhelmed by just accepting incoming packets. In the situation where it occurs, usually only gigabit cards or multiple 100baseTx connections, there must be discard/ignore policy. You must drop packets on the floor, and it's arguably best to not be transmitting while the discard is occurring. Given that this is a box-stopping load, I don't see transmit starvation as even an issue, let alone a problem. Transmit starvation is arguably the best behavior. If you must deploy a machine in these conditions, use newer hardware that implements hardware-triggered flow control. Even the $12 RTL8139B cards implement it. > Mitigation (which seems to be added to some of Donalds drivers by Jeff > Garzik and Andrey Savochkin) will to a certain extent. Huh? I've used software interrupt mitigation all along, and typically used the hardware interrupt mitigation where it existed. I don't see that any new driver structure has been added by Garzik or Savochkin. Adding non-hardware interrupt mitigation to the receive routine is a bad idea. You could easily end up with a single packet sitting in Rx queue for ages. Donald Becker Scyld Computing Corporation, becker@scyld.com From owner-netdev@oss.sgi.com Fri Mar 31 09:58:39 2000 Received: by oss.sgi.com id ; Fri, 31 Mar 2000 09:58:25 -0800 Received: from minus.inr.ac.ru ([193.233.7.97]:53513 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Fri, 31 Mar 2000 09:58:10 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id VAA12052; Fri, 31 Mar 2000 21:57:56 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200003311757.VAA12052@ms2.inr.ac.ru> Subject: Re: Queue and SMP locking discussion (was Re: 3c59x.c) To: becker@scyld.com (Donald Becker) Date: Fri, 31 Mar 2000 21:57:56 +0400 (MSK DST) Cc: netdev@oss.sgi.com, andrewm@uow.edu.au In-Reply-To: from "Donald Becker" at Mar 30, 0 02:55:26 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Length: 2296 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello! > Yes. The de facto guide, the existing SMP changes, are inefficient. Donald, I do not understand you, honestly. Design better scheme. Just do it. If you have no time to do it, you will have to work in environment, designed by stupid guys sort of me, who always have lots of time, unfortunately. It is evident, is not it? > The tricky part for avoiding further locking is making certain that > descriptor index entries are integers that may be atomically incremented WRT > to the Tx queuing. > In the Tx packet routine > entry = tp->cur_tx % TX_RING_SIZE; > ... queue a packet to the 'entry' slot > tp->cur_tx++; > > In the interrupt (Tx clean-up) routine > for (dirty_tx = tp->dirty_tx; tp->cur_tx - dirty_tx > 0; dirty_tx++) { > > This should *not* require locking on any architecture. If to forget that queue has another end (when it becomes full). If to remember about this, the sheme becomes more complicated. All the problems are really there, at the another end. > What may require locking is when the Tx queue is shared by the Rx filter > setup logic, as on the Tulip. The set_rx_mode() code must either be able to > spin on the Tx queue lock or otherwise have a way to be Tx-serialized. Top level guarantees that it does not overlap to hard_start_xmit(). > Setting the Rx filter mode is very rare compared to sending packets, so > adding a new Tx lock for this is a bad design. Agreed. BTW do you remember the story with eepro100? Why did it fail doing set_rx_mode() before Torvalds added TX lock there? > The only reasonable approach is disabling all IRQs or selectively disabling > one IRQ chain. The cli() approach used to be very cheap, and is now > expensive only on SMPs. Disabling an IRQ chain used to be very expensive, > and is only slightly less expensive now. Not all the devices are DMAing yet. And without DMA cli() is reliable death of serial interfaces. > I wasn't naming names.. (And I don't think that you were responsible for > the original bug, just for not seeing that the semantics that it created > were unreasonable.) I simply knew about this bug (experienced on my own skin) long before this question was raised, managed to ensure myself that it is not a bug, but feature and tried to propagate this "knowledge". 8)8) Alexey From owner-netdev@oss.sgi.com Fri Mar 31 10:47:33 2000 Received: by oss.sgi.com id ; Fri, 31 Mar 2000 10:47:14 -0800 Received: from minus.inr.ac.ru ([193.233.7.97]:33796 "HELO ms2.inr.ac.ru") by oss.sgi.com with SMTP id ; Fri, 31 Mar 2000 10:47:03 -0800 Received: (from kuznet@localhost) by ms2.inr.ac.ru (8.6.13/ANK) id WAA00485; Fri, 31 Mar 2000 22:46:15 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200003311846.WAA00485@ms2.inr.ac.ru> Subject: Re: Queue and SMP locking discussion (was Re: 3c59x.c) To: becker@scyld.com (Donald Becker) Date: Fri, 31 Mar 2000 22:46:15 +0400 (MSK DST) Cc: andrewm@uow.edu.au, hadi@cyberus.ca, netdev@oss.sgi.com In-Reply-To: from "Donald Becker" at Mar 31, 0 10:08:00 am X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Length: 925 Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello! > Acckk!!! I just saw that someone put netif_wake_queue() in the normal path > of the 3c59x.c Tx routine! This is BAD. That is putting an expensive call > in the critical path, and it's not even the right semantics. Well, do not stop queue, then there will be no reasons to wake it. If we stop queue, we have to wake it. > Most machines could never see a regime where they are overwhelmed by just > accepting incoming packets. In the situation where it occurs, usually only > gigabit cards or multiple 100baseTx connections, there must be > discard/ignore policy. Khm... Please, get some simple benchmark applet sort of netperf and enjoy with this impossible phenomenon on single 100Mbit interface. Despite of all the "max job on interrupt" linux-2.2 never leaves irq handler and does no job in result. BSD (and NT, by a strange reason) with their silly approach _work_ at any load level, by the way. Alexey From owner-netdev@oss.sgi.com Fri Mar 31 11:06:04 2000 Received: by oss.sgi.com id ; Fri, 31 Mar 2000 11:05:44 -0800 Received: from mainframe.dgrc.crc.ca ([142.92.38.206]:48526 "EHLO mainframe.dgrc.crc.ca") by oss.sgi.com with ESMTP id ; Fri, 31 Mar 2000 11:05:36 -0800 Received: from crc.ca (curly [142.92.38.251]) by mainframe.dgrc.crc.ca (8.9.3/8.9.3) with ESMTP id OAA00651; Fri, 31 Mar 2000 14:05:26 -0500 (EST) Message-ID: <38E4F6F7.53FD5F12@crc.ca> Date: Fri, 31 Mar 2000 14:05:27 -0500 From: Guilhem Tardy Organization: CRC X-Mailer: Mozilla 4.7 [en] (X11; I; SunOS 5.7 sun4u) X-Accept-Language: en MIME-Version: 1.0 To: Andreas Jellinghaus CC: netdev@oss.sgi.com Subject: Re: IPv6 send router advertisement References: <200003301529.TAA11239@ms2.inr.ac.ru> <38E3D24B.642448EB@crc.ca> <20000331003329.A1940@dungeon.inka.de> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Andreas Jellinghaus wrote: > > router advertisements are created by an application such as > the router advertisement daemon (radvd) or the zebra router daemon > (www.zebra.org). > > the kernel listens to router advertisements, asks for them, but does > not create them. OK, very instructive. I couldn't find any such piece of code in ndisc.c, now I wonder why this decision to create the router advertisements in a separate process. Anyway, I wrote some hack for the reception of the router advertisement on the 2.3.x kernel to support the new bit fields and options defined by Mobile-IPv6... and I guess now I shall hack one of those daemons you mentioned (if it isn't done already) for them to actually create such RAs... > regards, andreas > p.s. this question is about use, not development, so some newsgroup such as > comp.os.linux.networking or a mailing list such as linux-net@vger.rutgers.edu > might be a better place to ask next time. keep the traffik low, so the hackers > can hack and improve. advanced users will answer user questions. Oh, sorry Mr. Hacker, I would prefer to not have to ask those questions if I could find everything I want in the IPv6 stack of Linux (and related daemons, as you noted)... until then, I gonna have to do some development myself. ;) It is fair to say I am not a guru with Linux (it has been less than a year), but is it relevant here? If I can help improve the conformance of Linux for IEEE standards, whether in the kernel or user space, I think it is all what matters, and your help is very much appreciated - not your disdain. Guilhem. From owner-netdev@oss.sgi.com Fri Mar 31 11:08:54 2000 Received: by oss.sgi.com id ; Fri, 31 Mar 2000 11:08:34 -0800 Received: from wirespeed.solidum.com ([216.13.130.242]:32988 "EHLO solidum.com") by oss.sgi.com with ESMTP id ; Fri, 31 Mar 2000 11:08:21 -0800 Received: from phobos.solidum.com (mcr@phobos.solidum.com [192.168.1.13]) by solidum.com (8.8.7/8.8.7) with ESMTP id OAA00894 for ; Fri, 31 Mar 2000 14:08:19 -0500 Message-Id: <200003311908.OAA00894@solidum.com> To: netdev@oss.sgi.com Subject: Re: Queue and SMP locking discussion (was Re: 3c59x.c) In-Reply-To: Your message of "Fri, 31 Mar 2000 22:46:15 +0400." <200003311846.WAA00485@ms2.inr.ac.ru> Mime-Version: 1.0 (generated by tm-edit 7.108) Content-Type: text/plain; charset=US-ASCII Date: Fri, 31 Mar 2000 14:08:19 -0500 From: Michael Richardson Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing >>>>> "kuznet" == kuznet writes: kuznet> Khm... Please, get some simple benchmark applet sort of netperf kuznet> and enjoy with this impossible phenomenon on single 100Mbit kuznet> interface. Despite of all the "max job on interrupt" linux-2.2 Alexey is totally right. Even with maximum of work on interrupt, a Linux box facing full speed fast ethernet simply stops. kuznet> never leaves irq handler and does no job in result. BSD (and NT, kuznet> by a strange reason) with their silly approach _work_ at any load kuznet> level, by the way. That's is essentially because BSD go and process the packet, discarding thousands at the NIC layer (due to it being out of receive buffers), and therefore get a little bit of work done. Linux tries (smartly) to empty the entire queue from the network card before processing any packets, and since it just can't keep up, period, it never finishes, regardless of max-interrupt-work --- there is immediately another interrupt. See http://www.research.solidum.com/papers/ols1999/top.html NT "solves" the problem because most of its drivers are actually threads, so they must submit to scheduling as well. (that's also what makes them so unresponsive) To solve the the problem, you have to do QoS on CPU scheduling, or offload the bulk of the work to smarter hardware... :!mcr!: | Solidum Systems Corporation, http://www.solidum.com Michael Richardson |For a better connected world,where data flows faster Personal: http://www.sandelman.ottawa.on.ca/People/Michael_Richardson/Bio.html mailto:mcr@sandelman.ottawa.on.ca mailto:mcr@solidum.com From owner-netdev@oss.sgi.com Fri Mar 31 19:30:51 2000 Received: by oss.sgi.com id ; Fri, 31 Mar 2000 19:30:31 -0800 Received: from saw.sw.com.sg ([203.120.9.98]:5252 "HELO saw.sw.com.sg") by oss.sgi.com with SMTP id ; Fri, 31 Mar 2000 19:30:22 -0800 Received: (qmail 20808 invoked by uid 577); 1 Apr 2000 03:30:10 -0000 Message-ID: <20000401113010.A20780@saw.sw.com.sg> Date: Sat, 1 Apr 2000 11:30:10 +0800 From: Andrey Savochkin To: Donald Becker , jamal Cc: Andrew Morton , netdev@oss.sgi.com, kuznet@ms2.inr.ac.ru Subject: Re: Queue and SMP locking discussion (was Re: 3c59x.c) References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.93.2i In-Reply-To: ; from "Donald Becker" on Fri, Mar 31, 2000 at 11:16:56AM Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Hello, On Fri, Mar 31, 2000 at 11:16:56AM -0500, Donald Becker wrote: > On Fri, 31 Mar 2000, jamal wrote: > > Mitigation (which seems to be added to some of Donalds drivers by Jeff > > Garzik and Andrey Savochkin) will to a certain extent. At least for eepro100, the receive interrupt mitigation doesn't exist in the driver. There were some changes about TX completion interrupts, but I consider them as rather irrelevant. The other question is that it's possible to turn on hardware interrupt mitigation on Intel's chips by uploading a microcode. Intel's driver claims to do it. Best regards Andrey V. Savochkin From owner-netdev@oss.sgi.com Fri Mar 31 21:47:51 2000 Received: by oss.sgi.com id ; Fri, 31 Mar 2000 21:47:42 -0800 Received: from [202.102.223.33] ([202.102.223.33]:31353 "EHLO ns.cstnet-hf.net.cn") by oss.sgi.com with ESMTP id ; Fri, 31 Mar 2000 21:47:30 -0800 Received: from ustc.edu.cn (hpe25.nic.ustc.edu.cn [202.38.64.1]) by ns.cstnet-hf.net.cn (8.8.7/8.8.6) with SMTP id NAA31418; Sat, 1 Apr 2000 13:34:38 -0800 Received: from tarn.isdn.ustc.edu.cn by ustc.edu.cn with ESMTP (8.6.10/16.2) id NAA28911; Sat, 1 Apr 2000 13:20:02 +0800 Date: Sat, 1 Apr 2000 13:32:11 +0800 (CST) From: Wang Hui X-Sender: hwang@tarn.isdn.ustc.edu.cn To: Guilhem Tardy cc: Andreas Jellinghaus , netdev@oss.sgi.com Subject: Re: IPv6 send router advertisement In-Reply-To: <38E4F6F7.53FD5F12@crc.ca> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing Is there anyone develope codes according to RFC2765 and RFC2766? I just read them through and want to coding them. But I dont know if there is someone else doing it.