From vnuorval@tcs.hut.fi Fri Aug 1 04:15:27 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 04:15:42 -0700 (PDT) Received: from mail.tcs.hut.fi (mail.tcs.hut.fi [130.233.215.20]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71BFPFl004394 for ; Fri, 1 Aug 2003 04:15:26 -0700 Received: from rhea.tcs.hut.fi (rhea.tcs.hut.fi [130.233.215.147]) by mail.tcs.hut.fi (Postfix) with ESMTP id 61FB38001D1; Fri, 1 Aug 2003 14:15:23 +0300 (EEST) Received: from rhea.tcs.hut.fi (localhost [127.0.0.1]) by rhea.tcs.hut.fi (8.12.3/8.12.3/Debian-5) with ESMTP id h71BFN5L016559; Fri, 1 Aug 2003 14:15:23 +0300 Received: from localhost (vnuorval@localhost) by rhea.tcs.hut.fi (8.12.3/8.12.3/Debian-5) with ESMTP id h71BFLaU016555; Fri, 1 Aug 2003 14:15:22 +0300 Date: Fri, 1 Aug 2003 14:15:21 +0300 (EEST) From: Ville Nuorvala To: yoshfuji@linux-ipv6.org, Cc: netdev@oss.sgi.com Subject: [PATCH] IPV6: Incorrect hoplimit in ip6_push_pending_frames() In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 4419 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: vnuorval@tcs.hut.fi Precedence: bulk X-list: netdev Hi, I noticed the hop limit passed to ip6_append_data() isn't used by ip6_push_pending_frames(), which might lead to unexpected behavior with multicast and (ipv6-in-ipv6) tunneled packets. This patch (against Linux 2.6.0-test2 and cset 1.1595) fixes the problem. Thanks, Ville diff -Nur linux-2.5.OLD/include/linux/ipv6.h linux-2.5/include/linux/ipv6.h --- linux-2.5.OLD/include/linux/ipv6.h Thu Jul 31 18:07:13 2003 +++ linux-2.5/include/linux/ipv6.h Wed Jul 30 15:53:12 2003 @@ -189,6 +189,7 @@ struct ipv6_txoptions *opt; struct rt6_info *rt; struct flowi *fl; + int hop_limit; } cork; }; diff -Nur linux-2.5.OLD/net/ipv6/ip6_output.c linux-2.5/net/ipv6/ip6_output.c --- linux-2.5.OLD/net/ipv6/ip6_output.c Thu Jul 31 18:07:30 2003 +++ linux-2.5/net/ipv6/ip6_output.c Wed Jul 30 22:11:51 2003 @@ -1243,6 +1243,7 @@ dst_hold(&rt->u.dst); np->cork.rt = rt; np->cork.fl = fl; + np->cork.hop_limit = hlimit; inet->cork.fragsize = mtu = dst_pmtu(&rt->u.dst); inet->cork.length = 0; inet->sndmsg_page = NULL; @@ -1465,7 +1466,7 @@ hdr->payload_len = htons(skb->len - sizeof(struct ipv6hdr)); else hdr->payload_len = 0; - hdr->hop_limit = np->hop_limit; + hdr->hop_limit = np->cork.hop_limit; hdr->nexthdr = proto; ipv6_addr_copy(&hdr->saddr, &fl->fl6_src); ipv6_addr_copy(&hdr->daddr, final_dst); -- Ville Nuorvala Research Assistant, Institute of Digital Communications, Helsinki University of Technology email: vnuorval@tcs.hut.fi, phone: +358 (0)9 451 5257 From chas@locutus.cmf.nrl.navy.mil Fri Aug 1 07:02:14 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 07:02:26 -0700 (PDT) Received: from ginger.cmf.nrl.navy.mil (ginger.cmf.nrl.navy.mil [134.207.10.161]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71E2CFn021118 for ; Fri, 1 Aug 2003 07:02:13 -0700 Received: from locutus.cmf.nrl.navy.mil (locutus.cmf.nrl.navy.mil [134.207.10.66]) by ginger.cmf.nrl.navy.mil (8.12.7/8.12.7) with ESMTP id h6THHosG027846; Tue, 29 Jul 2003 13:17:51 -0400 (EDT) Message-Id: <200307291717.h6THHosG027846@ginger.cmf.nrl.navy.mil> To: Mitchell Blank Jr cc: davem@redhat.com, netdev@oss.sgi.com Reply-To: chas3@users.sourceforge.net Subject: Re: [atmdrvr zatm] Remove obsolete EXACT_TS support In-reply-to: Your message of "Mon, 28 Jul 2003 00:13:23 PDT." <20030728071323.GT32831@gaz.sfgoth.com> Date: Tue, 29 Jul 2003 13:15:09 -0400 From: chas williams X-Spam-Score: () hits=-2.9 X-Virus-Scanned: NAI Completed X-Scanned-By: MIMEDefang 2.30 (www . roaringpenguin . com / mimedefang) X-archive-position: 4420 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: chas@cmf.nrl.navy.mil Precedence: bulk X-list: netdev dave, please apply the following patch (hopefully one will arrive shortly that removes cli() et al from zatm as well): In message <20030728071323.GT32831@gaz.sfgoth.com>,Mitchell Blank Jr writes: >Chas - here's another 2.6 atm driver patch. Please push it upstream. > >This removes the obsolete "exact timestamp" support from the zatm driver. >My understanding is that it was part of a research thing Werner did 8 or >so years ago. It has no purpose for any production use. I think 2.6 is >its time to die. > >Besides, these days we use do_gettimeofday() instead of xtime so we should >have a reasonably accurate timestamp anyways. > >The only program that uses the ZATM_GETTHIST ioctl is the src/debug/znth.c >from the userland distribution. This isn't even compiled as part of the >make process so I don't feel any guilt about breaking it. It should >probably also just go away. > >I don't have the hardware (and really doubt anyone else does either, but >that's another matter entirely) but it still compiles and insmod's. > >Patch is versus 2.6.0-test2. # This is a BitKeeper generated patch for the following project: # Project Name: Linux kernel tree # This patch format is intended for GNU patch command version 2.5 or higher. # This patch includes the following deltas: # ChangeSet 1.1596 -> 1.1597 # drivers/atm/zatm.c 1.12 -> 1.13 # drivers/atm/Kconfig 1.5 -> 1.6 # drivers/atm/zatm.h 1.1 -> 1.2 # include/linux/atm_zatm.h 1.1 -> 1.2 # # The following is the BitKeeper ChangeSet Log # -------------------------------------------- # 03/07/28 chas@relax.cmf.nrl.navy.mil 1.1597 # remove EXACT_TS remove from zatm (untested) # -------------------------------------------- # diff -Nru a/drivers/atm/Kconfig b/drivers/atm/Kconfig --- a/drivers/atm/Kconfig Tue Jul 29 13:15:41 2003 +++ b/drivers/atm/Kconfig Tue Jul 29 13:15:41 2003 @@ -164,18 +164,6 @@ Note that extended debugging may create certain race conditions itself. Enable this ONLY if you suspect problems with the driver. -config ATM_ZATM_EXACT_TS - bool "Enable usec resolution timestamps" - depends on ATM_ZATM && X86 - help - The uPD98401 SAR chip supports a high-resolution timer (approx. 30 - MHz) that is used for very accurate reception timestamps. Because - that timer overflows after 140 seconds, and also to avoid timer - drift, time measurements need to be periodically synchronized with - the normal system time. Enabling this feature will add some general - overhead for timer synchronization and also per-packet overhead for - time conversion. - # bool 'Rolfs TI TNETA1570' CONFIG_ATM_TNETA1570 y # if [ "$CONFIG_ATM_TNETA1570" = "y" ]; then # bool ' Enable extended debugging' CONFIG_ATM_TNETA1570_DEBUG n diff -Nru a/drivers/atm/zatm.c b/drivers/atm/zatm.c --- a/drivers/atm/zatm.c Tue Jul 29 13:15:41 2003 +++ b/drivers/atm/zatm.c Tue Jul 29 13:15:41 2003 @@ -52,13 +52,6 @@ #define DPRINTK(format,args...) #endif -#ifndef __i386__ -#ifdef CONFIG_ATM_ZATM_EXACT_TS -#warning Precise timestamping only available on i386 platform -#undef CONFIG_ATM_ZATM_EXACT_TS -#endif -#endif - #ifndef CONFIG_ATM_ZATM_DEBUG @@ -347,150 +340,6 @@ restore_flags(flags); } - -/*----------------------- high-precision timestamps -------------------------*/ - - -#ifdef CONFIG_ATM_ZATM_EXACT_TS - -static struct timer_list sync_timer; - - -/* - * Note: the exact time is not normalized, i.e. tv_usec can be > 1000000. - * This must be handled by higher layers. - */ - -static inline struct timeval exact_time(struct zatm_dev *zatm_dev,u32 ticks) -{ - struct timeval tmp; - - tmp = zatm_dev->last_time; - tmp.tv_usec += ((s64) (ticks-zatm_dev->last_clk)* - (s64) zatm_dev->factor) >> TIMER_SHIFT; - return tmp; -} - - -static void zatm_clock_sync(unsigned long dummy) -{ - struct atm_dev *atm_dev; - struct zatm_dev *zatm_dev; - - for (atm_dev = zatm_boards; atm_dev; atm_dev = zatm_dev->more) { - unsigned long flags,interval; - int diff; - struct timeval now,expected; - u32 ticks; - - zatm_dev = ZATM_DEV(atm_dev); - save_flags(flags); - cli(); - ticks = zpeekl(zatm_dev,uPD98401_TSR); - do_gettimeofday(&now); - restore_flags(flags); - expected = exact_time(zatm_dev,ticks); - diff = 1000000*(expected.tv_sec-now.tv_sec)+ - (expected.tv_usec-now.tv_usec); - zatm_dev->timer_history[zatm_dev->th_curr].real = now; - zatm_dev->timer_history[zatm_dev->th_curr].expected = expected; - zatm_dev->th_curr = (zatm_dev->th_curr+1) & - (ZATM_TIMER_HISTORY_SIZE-1); - interval = 1000000*(now.tv_sec-zatm_dev->last_real_time.tv_sec) - +(now.tv_usec-zatm_dev->last_real_time.tv_usec); - if (diff >= -ADJ_REP_THRES && diff <= ADJ_REP_THRES) - zatm_dev->timer_diffs = 0; - else -#ifndef AGGRESSIVE_DEBUGGING - if (++zatm_dev->timer_diffs >= ADJ_MSG_THRES) -#endif - { - zatm_dev->timer_diffs = 0; - printk(KERN_INFO DEV_LABEL ": TSR update after %ld us:" - " calculation differed by %d us\n",interval,diff); -#ifdef AGGRESSIVE_DEBUGGING - printk(KERN_DEBUG " %d.%08d -> %d.%08d (%lu)\n", - zatm_dev->last_real_time.tv_sec, - zatm_dev->last_real_time.tv_usec, - now.tv_sec,now.tv_usec,interval); - printk(KERN_DEBUG " %u -> %u (%d)\n", - zatm_dev->last_clk,ticks,ticks-zatm_dev->last_clk); - printk(KERN_DEBUG " factor %u\n",zatm_dev->factor); -#endif - } - if (diff < -ADJ_IGN_THRES || diff > ADJ_IGN_THRES) { - /* filter out any major changes (e.g. time zone setup and - such) */ - zatm_dev->last_time = now; - zatm_dev->factor = - (1000 << TIMER_SHIFT)/(zatm_dev->khz+1); - } - else { - zatm_dev->last_time = expected; - /* - * Is the accuracy of udelay really only about 1:300 on - * a 90 MHz Pentium ? Well, the following line avoids - * the problem, but ... - * - * What it does is simply: - * - * zatm_dev->factor = (interval << TIMER_SHIFT)/ - * (ticks-zatm_dev->last_clk); - */ -#define S(x) #x /* "stringification" ... */ -#define SX(x) S(x) - asm("movl %2,%%ebx\n\t" - "subl %3,%%ebx\n\t" - "xorl %%edx,%%edx\n\t" - "shldl $" SX(TIMER_SHIFT) ",%1,%%edx\n\t" - "shl $" SX(TIMER_SHIFT) ",%1\n\t" - "divl %%ebx\n\t" - : "=a" (zatm_dev->factor) - : "0" (interval-diff),"g" (ticks), - "g" (zatm_dev->last_clk) - : "ebx","edx","cc"); -#undef S -#undef SX -#ifdef AGGRESSIVE_DEBUGGING - printk(KERN_DEBUG " (%ld << %d)/(%u-%u) = %u\n", - interval,TIMER_SHIFT,ticks,zatm_dev->last_clk, - zatm_dev->factor); -#endif - } - zatm_dev->last_real_time = now; - zatm_dev->last_clk = ticks; - } - mod_timer(&sync_timer,sync_timer.expires+POLL_INTERVAL*HZ); -} - - -static void __init zatm_clock_init(struct zatm_dev *zatm_dev) -{ - static int start_timer = 1; - unsigned long flags; - - zatm_dev->factor = (1000 << TIMER_SHIFT)/(zatm_dev->khz+1); - zatm_dev->timer_diffs = 0; - memset(zatm_dev->timer_history,0,sizeof(zatm_dev->timer_history)); - zatm_dev->th_curr = 0; - save_flags(flags); - cli(); - do_gettimeofday(&zatm_dev->last_time); - zatm_dev->last_clk = zpeekl(zatm_dev,uPD98401_TSR); - if (start_timer) { - start_timer = 0; - init_timer(&sync_timer); - sync_timer.expires = jiffies+POLL_INTERVAL*HZ; - sync_timer.function = zatm_clock_sync; - add_timer(&sync_timer); - } - restore_flags(flags); -} - - -#endif - - /*----------------------------------- RX ------------------------------------*/ @@ -581,11 +430,7 @@ EVENT("error code 0x%x/0x%x\n",(here[3] & uPD98401_AAL5_ES) >> uPD98401_AAL5_ES_SHIFT,error); skb = ((struct rx_buffer_head *) bus_to_virt(here[2]))->skb; -#ifdef CONFIG_ATM_ZATM_EXACT_TS - skb->stamp = exact_time(zatm_dev,here[1]); -#else do_gettimeofday(&skb->stamp); -#endif #if 0 printk("[-3..0] 0x%08lx 0x%08lx 0x%08lx 0x%08lx\n",((unsigned *) skb->data)[-3], ((unsigned *) skb->data)[-2],((unsigned *) skb->data)[-1], @@ -1455,9 +1300,6 @@ "MHz\n",dev->number, (zin(VER) & uPD98401_MAJOR) >> uPD98401_MAJOR_SHIFT, zin(VER) & uPD98401_MINOR,zatm_dev->khz/1000,zatm_dev->khz % 1000); -#ifdef CONFIG_ATM_ZATM_EXACT_TS - zatm_clock_init(zatm_dev); -#endif return uPD98402_init(dev); } @@ -1699,22 +1541,6 @@ restore_flags(flags); return 0; } -#ifdef CONFIG_ATM_ZATM_EXACT_TS - case ZATM_GETTHIST: - { - int i; - struct zatm_t_hist hs[ZATM_TIMER_HISTORY_SIZE]; - save_flags(flags); - cli(); - for (i = 0; i < ZATM_TIMER_HISTORY_SIZE; i++) - hs[i] = zatm_dev->timer_history[ - (zatm_dev->th_curr+i) & - (ZATM_TIMER_HISTORY_SIZE-1)]; - restore_flags(flags); - return copy_to_user((struct zatm_t_hist *) arg, - hs, sizeof(hs)) ? -EFAULT : 0; - } -#endif default: if (!dev->phy->ioctl) return -ENOIOCTLCMD; return dev->phy->ioctl(dev,cmd,arg); diff -Nru a/drivers/atm/zatm.h b/drivers/atm/zatm.h --- a/drivers/atm/zatm.h Tue Jul 29 13:15:41 2003 +++ b/drivers/atm/zatm.h Tue Jul 29 13:15:41 2003 @@ -40,31 +40,6 @@ #define MBX_TX_0 2 #define MBX_TX_1 3 - -/* - * mkdep doesn't spot this dependency, but that's okay, because zatm.c uses - * CONFIG_ATM_ZATM_EXACT_TS too. - */ - -#ifdef CONFIG_ATM_ZATM_EXACT_TS -#define POLL_INTERVAL 60 /* TSR poll interval in seconds; must be <= - (2^31-1)/clock */ -#define TIMER_SHIFT 20 /* scale factor for fixed-point arithmetic; - 1 << TIMER_SHIFT must be - (1) <= (2^64-1)/(POLL_INTERVAL*clock), - (2) >> clock/10^6, and - (3) <= (2^32-1)/1000 */ -#define ADJ_IGN_THRES 1000000 /* don't adjust if we're off by more than that - many usecs - this filters clock corrections, - time zone changes, etc. */ -#define ADJ_REP_THRES 20000 /* report only differences of more than that - many usecs (don't mention single lost timer - ticks; 10 msec is only 0.03% anyway) */ -#define ADJ_MSG_THRES 5 /* issue complaints only if getting that many - significant timer differences in a row */ -#endif - - struct zatm_vcc { /*-------------------------------- RX part */ int rx_chan; /* RX channel, 0 if none */ @@ -103,17 +78,6 @@ u32 pool_base; /* Free buffer pool dsc (word addr) */ /*-------------------------------- ZATM links */ struct atm_dev *more; /* other ZATM devices */ -#ifdef CONFIG_ATM_ZATM_EXACT_TS - /*-------------------------------- timestamp calculation */ - u32 last_clk; /* results of last poll: clock, */ - struct timeval last_time; /* virtual time and */ - struct timeval last_real_time; /* real time */ - u32 factor; /* multiplication factor */ - int timer_diffs; /* number of significant deviations */ - struct zatm_t_hist timer_history[ZATM_TIMER_HISTORY_SIZE]; - /* record of timer synchronizations */ - int th_curr; /* current position */ -#endif /*-------------------------------- general information */ int mem; /* RAM on board (in bytes) */ int khz; /* timer clock */ diff -Nru a/include/linux/atm_zatm.h b/include/linux/atm_zatm.h --- a/include/linux/atm_zatm.h Tue Jul 29 13:15:41 2003 +++ b/include/linux/atm_zatm.h Tue Jul 29 13:15:41 2003 @@ -21,9 +21,6 @@ /* get statistics and zero */ #define ZATM_SETPOOL _IOW('a',ATMIOC_SARPRV+3,struct atmif_sioc) /* set pool parameters */ -#define ZATM_GETTHIST _IOW('a',ATMIOC_SARPRV+4,struct atmif_sioc) - /* get a history of timer - differences */ struct zatm_pool_info { int ref_count; /* free buffer pool usage counters */ From chas@locutus.cmf.nrl.navy.mil Fri Aug 1 07:02:13 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 07:02:26 -0700 (PDT) Received: from ginger.cmf.nrl.navy.mil (ginger.cmf.nrl.navy.mil [134.207.10.161]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71E2CFl021118 for ; Fri, 1 Aug 2003 07:02:12 -0700 Received: from locutus.cmf.nrl.navy.mil (locutus.cmf.nrl.navy.mil [134.207.10.66]) by ginger.cmf.nrl.navy.mil (8.12.7/8.12.7) with ESMTP id h6VEQgsG023826; Thu, 31 Jul 2003 10:26:42 -0400 (EDT) Message-Id: <200307311426.h6VEQgsG023826@ginger.cmf.nrl.navy.mil> To: Mitchell Blank Jr cc: davem@redhat.com, netdev@oss.sgi.com Reply-To: chas3@users.sourceforge.net Subject: Re: [Linux-ATM-General] Re: [atmdrvr zatm] Remove obsolete EXACT_TS support In-reply-to: Your message of "Wed, 30 Jul 2003 15:57:42 PDT." <20030730225741.GA57991@gaz.sfgoth.com> Date: Thu, 31 Jul 2003 10:23:58 -0400 From: chas williams X-Spam-Score: () hits=-0.3 X-Virus-Scanned: NAI Completed X-Scanned-By: MIMEDefang 2.30 (www . roaringpenguin . com / mimedefang) X-archive-position: 4420 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: chas@cmf.nrl.navy.mil Precedence: bulk X-list: netdev please apply to 2.6. zatm will now compile on smp. it might actually work if someone had some hardware to test it. [atm]: [zatm] convert cli() to spinlock # This is a BitKeeper generated patch for the following project: # Project Name: Linux kernel tree # This patch format is intended for GNU patch command version 2.5 or higher. # This patch includes the following deltas: # ChangeSet 1.1597 -> 1.1598 # drivers/atm/zatm.c 1.13 -> 1.14 # drivers/atm/uPD98402.c 1.4 -> 1.5 # drivers/atm/zatm.h 1.2 -> 1.3 # # The following is the BitKeeper ChangeSet Log # -------------------------------------------- # 03/07/31 chas@relax.cmf.nrl.navy.mil 1.1598 # [zatm] convert cli() to spinlock # -------------------------------------------- # diff -Nru a/drivers/atm/uPD98402.c b/drivers/atm/uPD98402.c --- a/drivers/atm/uPD98402.c Thu Jul 31 10:25:25 2003 +++ b/drivers/atm/uPD98402.c Thu Jul 31 10:25:25 2003 @@ -27,6 +27,7 @@ struct k_sonet_stats sonet_stats;/* link diagnostics */ unsigned char framing; /* SONET/SDH framing */ int loop_mode; /* loopback mode */ + spinlock_t lock; }; @@ -71,14 +72,13 @@ default: return -EINVAL; } - save_flags(flags); - cli(); + spin_lock_irqsave(&PRIV(dev)->lock, flags); PUT(set[0],C11T); PUT(set[1],C12T); PUT(set[2],C13T); PUT((GET(MDR) & ~uPD98402_MDR_SS_MASK) | (set[3] << uPD98402_MDR_SS_SHIFT),MDR); - restore_flags(flags); + spin_unlock_irqrestore(&PRIV(dev)->lock, flags); return 0; } @@ -88,12 +88,11 @@ unsigned long flags; unsigned char s[3]; - save_flags(flags); - cli(); + spin_lock_irqsave(&PRIV(dev)->lock, flags); s[0] = GET(C11R); s[1] = GET(C12R); s[2] = GET(C13R); - restore_flags(flags); + spin_unlock_irqrestore(&PRIV(dev)->lock, flags); return (put_user(s[0], arg) || put_user(s[1], arg+1) || put_user(s[2], arg+2) || put_user(0xff, arg+3) || put_user(0xff, arg+4) || put_user(0xff, arg+5)) ? -EFAULT : 0; @@ -214,6 +213,7 @@ DPRINTK("phy_start\n"); if (!(PRIV(dev) = kmalloc(sizeof(struct uPD98402_priv),GFP_KERNEL))) return -ENOMEM; + spin_lock_init(&PRIV(dev)->lock); memset(&PRIV(dev)->sonet_stats,0,sizeof(struct k_sonet_stats)); (void) GET(PCR); /* clear performance events */ PUT(uPD98402_PFM_FJ,PCMR); /* ignore frequency adj */ diff -Nru a/drivers/atm/zatm.c b/drivers/atm/zatm.c --- a/drivers/atm/zatm.c Thu Jul 31 10:25:25 2003 +++ b/drivers/atm/zatm.c Thu Jul 31 10:25:25 2003 @@ -195,11 +195,10 @@ sizeof(struct rx_buffer_head); } size += align; - save_flags(flags); - cli(); + spin_lock_irqsave(&zatm_dev->lock, flags); free = zpeekl(zatm_dev,zatm_dev->pool_base+2*pool) & uPD98401_RXFP_REMAIN; - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); if (free >= zatm_dev->pool_info[pool].low_water) return; EVENT("starting ... POOL: 0x%x, 0x%x\n", zpeekl(zatm_dev,zatm_dev->pool_base+2*pool), @@ -228,22 +227,22 @@ head->skb = skb; EVENT("enq skb 0x%08lx/0x%08lx\n",(unsigned long) skb, (unsigned long) head); - cli(); + spin_lock_irqsave(&zatm_dev->lock, flags); if (zatm_dev->last_free[pool]) ((struct rx_buffer_head *) (zatm_dev->last_free[pool]-> data))[-1].link = virt_to_bus(head); zatm_dev->last_free[pool] = skb; skb_queue_tail(&zatm_dev->pool[pool],skb); - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); free++; } if (first) { - cli(); + spin_lock_irqsave(&zatm_dev->lock, flags); zwait; zout(virt_to_bus(first),CER); zout(uPD98401_ADD_BAT | (pool << uPD98401_POOL_SHIFT) | count, CMR); - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); EVENT ("POOL: 0x%x, 0x%x\n", zpeekl(zatm_dev,zatm_dev->pool_base+2*pool), zpeekl(zatm_dev,zatm_dev->pool_base+2*pool+1)); @@ -286,8 +285,7 @@ size = pool-ZATM_AAL5_POOL_BASE; if (size < 0) size = 0; /* 64B... */ else if (size > 10) size = 10; /* ... 64kB */ - save_flags(flags); - cli(); + spin_lock_irqsave(&zatm_dev->lock, flags); zpokel(zatm_dev,((zatm_dev->pool_info[pool].low_water/4) << uPD98401_RXFP_ALERT_SHIFT) | (1 << uPD98401_RXFP_BTSZ_SHIFT) | @@ -295,7 +293,7 @@ zatm_dev->pool_base+pool*2); zpokel(zatm_dev,(unsigned long) dummy,zatm_dev->pool_base+ pool*2+1); - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); zatm_dev->last_free[pool] = NULL; refill_pool(dev,pool); } @@ -315,29 +313,29 @@ { struct zatm_pool_info *pool; unsigned long offset,flags; + struct zatm_dev *zatm_dev = ZATM_DEV(vcc->dev); DPRINTK("start 0x%08lx dest 0x%08lx len %d\n",start,dest,len); if (len < PAGE_SIZE) return; - pool = &ZATM_DEV(vcc->dev)->pool_info[ZATM_VCC(vcc)->pool]; + pool = &zatm_dev->pool_info[ZATM_VCC(vcc)->pool]; offset = (dest-start) & (PAGE_SIZE-1); - save_flags(flags); - cli(); + spin_lock_irqsave(&zatm_dev->lock, flags); if (!offset || pool->offset == offset) { pool->next_cnt = 0; - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); return; } if (offset != pool->next_off) { pool->next_off = offset; pool->next_cnt = 0; - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); return; } if (++pool->next_cnt >= pool->next_thres) { pool->offset = pool->next_off; pool->next_cnt = 0; } - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); } /*----------------------------------- RX ------------------------------------*/ @@ -535,20 +533,19 @@ zatm_vcc->pool = ZATM_AAL0_POOL; } if (zatm_vcc->pool < 0) return -EMSGSIZE; - save_flags(flags); - cli(); + spin_lock_irqsave(&zatm_dev->lock, flags); zwait; zout(uPD98401_OPEN_CHAN,CMR); zwait; DPRINTK("0x%x 0x%x\n",zin(CMR),zin(CER)); chan = (zin(CMR) & uPD98401_CHAN_ADDR) >> uPD98401_CHAN_ADDR_SHIFT; - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); DPRINTK("chan is %d\n",chan); if (!chan) return -EAGAIN; use_pool(vcc->dev,zatm_vcc->pool); DPRINTK("pool %d\n",zatm_vcc->pool); /* set up VC descriptor */ - cli(); + spin_lock_irqsave(&zatm_dev->lock, flags); zpokel(zatm_dev,zatm_vcc->pool << uPD98401_RXVC_POOL_SHIFT, chan*VC_SIZE/4); zpokel(zatm_dev,uPD98401_RXVC_OD | (vcc->qos.aal == ATM_AAL5 ? @@ -556,7 +553,7 @@ zpokel(zatm_dev,0,chan*VC_SIZE/4+2); zatm_vcc->rx_chan = chan; zatm_dev->rx_map[chan] = vcc; - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); return 0; } @@ -572,14 +569,13 @@ zatm_dev = ZATM_DEV(vcc->dev); zatm_vcc = ZATM_VCC(vcc); if (!zatm_vcc->rx_chan) return 0; - save_flags(flags); - cli(); + spin_lock_irqsave(&zatm_dev->lock, flags); /* should also handle VPI @@@ */ pos = vcc->vci >> 1; shift = (1-(vcc->vci & 1)) << 4; zpokel(zatm_dev,(zpeekl(zatm_dev,pos) & ~(0xffff << shift)) | ((zatm_vcc->rx_chan | uPD98401_RXLT_ENBL) << shift),pos); - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); return 0; } @@ -596,9 +592,8 @@ if (!zatm_vcc->rx_chan) return; DPRINTK("close_rx\n"); /* disable receiver */ - save_flags(flags); if (vcc->vpi != ATM_VPI_UNSPEC && vcc->vci != ATM_VCI_UNSPEC) { - cli(); + spin_lock_irqsave(&zatm_dev->lock, flags); pos = vcc->vci >> 1; shift = (1-(vcc->vci & 1)) << 4; zpokel(zatm_dev,zpeekl(zatm_dev,pos) & ~(0xffff << shift),pos); @@ -606,9 +601,9 @@ zout(uPD98401_NOP,CMR); zwait; zout(uPD98401_NOP,CMR); - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); } - cli(); + spin_lock_irqsave(&zatm_dev->lock, flags); zwait; zout(uPD98401_DEACT_CHAN | uPD98401_CHAN_RT | (zatm_vcc->rx_chan << uPD98401_CHAN_ADDR_SHIFT),CMR); @@ -620,7 +615,7 @@ if (!(zin(CMR) & uPD98401_CHAN_ADDR)) printk(KERN_CRIT DEV_LABEL "(itf %d): can't close RX channel " "%d\n",vcc->dev->number,zatm_vcc->rx_chan); - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); zatm_dev->rx_map[zatm_vcc->rx_chan] = NULL; zatm_vcc->rx_chan = 0; unuse_pool(vcc->dev,zatm_vcc->pool); @@ -673,11 +668,10 @@ zatm_dev = ZATM_DEV(vcc->dev); zatm_vcc = ZATM_VCC(vcc); EVENT("iovcnt=%d\n",skb_shinfo(skb)->nr_frags,0); - save_flags(flags); - cli(); + spin_lock_irqsave(&zatm_dev->lock, flags); if (!skb_shinfo(skb)->nr_frags) { if (zatm_vcc->txing == RING_ENTRIES-1) { - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); return RING_BUSY; } zatm_vcc->txing++; @@ -732,7 +726,7 @@ zwait; zout(uPD98401_TX_READY | (zatm_vcc->tx_chan << uPD98401_CHAN_ADDR_SHIFT),CMR); - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); EVENT("done\n",0,0); return 0; } @@ -866,15 +860,14 @@ if (zatm_dev->tx_bw < *pcr) return -EAGAIN; zatm_dev->tx_bw -= *pcr; } - save_flags(flags); - cli(); + spin_lock_irqsave(&zatm_dev->lock, flags); DPRINTK("i = %d, m = %d, PCR = %d\n",i,m,*pcr); zpokel(zatm_dev,(i << uPD98401_IM_I_SHIFT) | m,uPD98401_IM(shaper)); zpokel(zatm_dev,c << uPD98401_PC_C_SHIFT,uPD98401_PC(shaper)); zpokel(zatm_dev,0,uPD98401_X(shaper)); zpokel(zatm_dev,0,uPD98401_Y(shaper)); zpokel(zatm_dev,uPD98401_PS_E,uPD98401_PS(shaper)); - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); return shaper; } @@ -889,11 +882,10 @@ if (--zatm_dev->ubr_ref_cnt) return; zatm_dev->ubr = -1; } - save_flags(flags); - cli(); + spin_lock_irqsave(&zatm_dev->lock, flags); zpokel(zatm_dev,zpeekl(zatm_dev,uPD98401_PS(shaper)) & ~uPD98401_PS_E, uPD98401_PS(shaper)); - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); zatm_dev->free_shapers |= 1 << shaper; } @@ -912,8 +904,6 @@ chan = zatm_vcc->tx_chan; if (!chan) return; DPRINTK("close_tx\n"); - save_flags(flags); - cli(); while (skb_peek(&zatm_vcc->backlog)) { if (once) { printk("waiting for backlog to drain ...\n"); @@ -932,6 +922,7 @@ DPRINTK("waiting for TX queue to drain ... %p\n",skb); sleep_on(&zatm_vcc->tx_wait); } + spin_lock_irqsave(&zatm_dev->lock, flags); #if 0 zwait; zout(uPD98401_DEACT_CHAN | (chan << uPD98401_CHAN_ADDR_SHIFT),CMR); @@ -942,7 +933,7 @@ if (!(zin(CMR) & uPD98401_CHAN_ADDR)) printk(KERN_CRIT DEV_LABEL "(itf %d): can't close TX channel " "%d\n",vcc->dev->number,chan); - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); zatm_vcc->tx_chan = 0; zatm_dev->tx_map[chan] = NULL; if (zatm_vcc->shaper != zatm_dev->ubr) { @@ -967,14 +958,13 @@ zatm_vcc = ZATM_VCC(vcc); zatm_vcc->tx_chan = 0; if (vcc->qos.txtp.traffic_class == ATM_NONE) return 0; - save_flags(flags); - cli(); + spin_lock_irqsave(&zatm_dev->lock, flags); zwait; zout(uPD98401_OPEN_CHAN,CMR); zwait; DPRINTK("0x%x 0x%x\n",zin(CMR),zin(CER)); chan = (zin(CMR) & uPD98401_CHAN_ADDR) >> uPD98401_CHAN_ADDR_SHIFT; - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); DPRINTK("chan is %d\n",chan); if (!chan) return -EAGAIN; unlimited = vcc->qos.txtp.traffic_class == ATM_UBR && @@ -1022,15 +1012,14 @@ zatm_dev = ZATM_DEV(vcc->dev); zatm_vcc = ZATM_VCC(vcc); if (!zatm_vcc->tx_chan) return 0; - save_flags(flags); /* set up VC descriptor */ - cli(); + spin_lock_irqsave(&zatm_dev->lock, flags); zpokel(zatm_dev,0,zatm_vcc->tx_chan*VC_SIZE/4); zpokel(zatm_dev,uPD98401_TXVC_L | (zatm_vcc->shaper << uPD98401_TXVC_SHP_SHIFT) | (vcc->vpi << uPD98401_TXVC_VPI_SHIFT) | vcc->vci,zatm_vcc->tx_chan*VC_SIZE/4+1); zpokel(zatm_dev,0,zatm_vcc->tx_chan*VC_SIZE/4+2); - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); zatm_dev->tx_map[zatm_vcc->tx_chan] = vcc; return 0; } @@ -1236,6 +1225,7 @@ DPRINTK(">zatm_init\n"); zatm_dev = ZATM_DEV(dev); + spin_lock_init(&zatm_dev->lock); pci_dev = zatm_dev->pci_dev; zatm_dev->base = pci_resource_start(pci_dev, 0); zatm_dev->irq = pci_dev->irq; @@ -1285,14 +1275,13 @@ do { unsigned long flags; - save_flags(flags); - cli(); + spin_lock_irqsave(&zatm_dev->lock, flags); t0 = zpeekl(zatm_dev,uPD98401_TSR); udelay(10); t1 = zpeekl(zatm_dev,uPD98401_TSR); udelay(1010); t2 = zpeekl(zatm_dev,uPD98401_TSR); - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); } while (t0 > t1 || t1 > t2); /* loop if wrapping ... */ zatm_dev->khz = t2-2*t1+t0; @@ -1492,14 +1481,13 @@ return -EFAULT; if (pool < 0 || pool > ZATM_LAST_POOL) return -EINVAL; - save_flags(flags); - cli(); + spin_lock_irqsave(&zatm_dev->lock, flags); info = zatm_dev->pool_info[pool]; if (cmd == ZATM_GETPOOLZ) { zatm_dev->pool_info[pool].rqa_count = 0; zatm_dev->pool_info[pool].rqu_count = 0; } - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); return copy_to_user( &((struct zatm_pool_req *) arg)->info, &info,sizeof(info)) ? -EFAULT : 0; @@ -1530,15 +1518,14 @@ if (info.low_water >= info.high_water || info.low_water < 0) return -EINVAL; - save_flags(flags); - cli(); + spin_lock_irqsave(&zatm_dev->lock, flags); zatm_dev->pool_info[pool].low_water = info.low_water; zatm_dev->pool_info[pool].high_water = info.high_water; zatm_dev->pool_info[pool].next_thres = info.next_thres; - restore_flags(flags); + spin_unlock_irqrestore(&zatm_dev->lock, flags); return 0; } default: diff -Nru a/drivers/atm/zatm.h b/drivers/atm/zatm.h --- a/drivers/atm/zatm.h Thu Jul 31 10:25:25 2003 +++ b/drivers/atm/zatm.h Thu Jul 31 10:25:25 2003 @@ -85,6 +85,7 @@ unsigned char irq; /* IRQ */ unsigned int base; /* IO base address */ struct pci_dev *pci_dev; /* PCI stuff */ + spinlock_t lock; }; From willy@www.linux.org.uk Fri Aug 1 08:02:35 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 08:02:45 -0700 (PDT) Received: from www.linux.org.uk (IDENT:zD29xXS/6K4bSxwPbxUDl3tjJbe2uGQJ@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71F2XFl030278 for ; Fri, 1 Aug 2003 08:02:35 -0700 Received: from willy by www.linux.org.uk with local (Exim 4.14) id 19ibQO-0003NL-OB for netdev@oss.sgi.com; Fri, 01 Aug 2003 16:02:32 +0100 Date: Fri, 1 Aug 2003 16:02:32 +0100 From: Matthew Wilcox To: netdev@oss.sgi.com Subject: [PATCH] ethtool_ops rev 4 Message-ID: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.1i X-archive-position: 4421 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: willy@debian.org Precedence: bulk X-list: netdev At 55k, I doubt you want to see it posted to the list; patch is available from http://ftp.linux.org.uk/pub/linux/willy/patches/ethtool4.diff and here's the diffstat drivers/net/8139too.c | 330 ++++++++-------------- drivers/net/tg3.c | 584 ++++++++++++++++------------------------ include/linux/ethtool.h | 100 ++++++ include/linux/netdevice.h | 5 net/core/Makefile | 4 net/core/dev.c | 16 - net/core/ethtool.c | 671 ++++++++++++++++++++++++++++++++++++++++++++++ 7 files changed, 1154 insertions(+), 556 deletions(-) Patch has received light testing on an rtl8139c card: Settings for eth0: Supported ports: [ TP MII ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full Advertised auto-negotiation: Yes Speed: 100Mb/s Duplex: Full Port: MII PHYAD: 32 Transceiver: internal Auto-negotiation: on Supports Wake-on: pumbg Wake-on: d Current message level: 0xffffffff (-1) Link detected: yes but obviously it doesn't support all the ethtool options that some cards do. -- "It's not Hollywood. War is real, war is primarily not about defeat or victory, it is about death. I've seen thousands and thousands of dead bodies. Do you think I want to have an academic debate on this subject?" -- Robert Fisk From garzik@gtf.org Fri Aug 1 08:40:29 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 08:40:34 -0700 (PDT) Received: from havoc.gtf.org (host-64-213-145-173.atlantasolutions.com [64.213.145.173] (may be forged)) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71FeSFl027853 for ; Fri, 1 Aug 2003 08:40:29 -0700 Received: by havoc.gtf.org (Postfix, from userid 500) id D82626698; Fri, 1 Aug 2003 11:40:21 -0400 (EDT) Date: Fri, 1 Aug 2003 11:40:21 -0400 From: Jeff Garzik To: Matthew Wilcox Cc: netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 Message-ID: <20030801154021.GA7696@gtf.org> References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> User-Agent: Mutt/1.3.28i X-archive-position: 4422 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev On Fri, Aug 01, 2003 at 04:02:32PM +0100, Matthew Wilcox wrote: > and here's the diffstat > > drivers/net/8139too.c | 330 ++++++++-------------- > drivers/net/tg3.c | 584 ++++++++++++++++------------------------ > include/linux/ethtool.h | 100 ++++++ > include/linux/netdevice.h | 5 > net/core/Makefile | 4 > net/core/dev.c | 16 - > net/core/ethtool.c | 671 ++++++++++++++++++++++++++++++++++++++++++++++ > 7 files changed, 1154 insertions(+), 556 deletions(-) Comments: * need SET_ETHTOOL_OPS macro or HAVE_ETHTOOL_OPS test macro or similar * I still do not see the need to change a simple storage of a constant (into ethtool_gdrvinfo) into _four_ separate function call hooks (reg dump len, eeprom dump len, nic-specific stats len, self-test len). Internal kernel code that needs this information is always a slow path anyway, so just call the ->get_drvinfo hook internally. * I prefer not to add '#include ' to ethtool.h Other than those, looks real good. Jeff From jmorris@intercode.com.au Fri Aug 1 08:51:17 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 08:51:20 -0700 (PDT) Received: from blackbird.intercode.com.au (IDENT:TamECck9nHItRCfvPti5PCOPYy+eE6V7@blackbird.intercode.com.au [203.32.101.10]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71FpEFl029615 for ; Fri, 1 Aug 2003 08:51:16 -0700 Received: from excalibur.intercode.com.au (excalibur.intercode.com.au [203.32.101.12]) by blackbird.intercode.com.au (8.11.6p2/8.9.3) with ESMTP id h71Fowr27206; Sat, 2 Aug 2003 01:50:58 +1000 Date: Sat, 2 Aug 2003 01:50:57 +1000 (EST) From: James Morris To: Zwane Mwaikambo cc: netdev@oss.sgi.com Subject: Re: oops in raw_rcv_skb In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 4423 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jmorris@intercode.com.au Precedence: bulk X-list: netdev On Fri, 1 Aug 2003, Zwane Mwaikambo wrote: > You can reproduce this one easily by doing 5-6 ping -f of a system on the > network (not loopback), originally picked up at http://bugme.osdl.org/show_bug.cgi?id=937 Any chance of getting a gdb traceback on this one? :-) - James -- James Morris From garzik@gtf.org Fri Aug 1 09:25:42 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 09:25:48 -0700 (PDT) Received: from havoc.gtf.org (host-64-213-145-173.atlantasolutions.com [64.213.145.173] (may be forged)) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71GPfFl001543 for ; Fri, 1 Aug 2003 09:25:42 -0700 Received: by havoc.gtf.org (Postfix, from userid 500) id 6EBE76696; Fri, 1 Aug 2003 12:25:36 -0400 (EDT) Date: Fri, 1 Aug 2003 12:25:36 -0400 From: Jeff Garzik To: Matthew Wilcox Cc: netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 Message-ID: <20030801162536.GA18574@gtf.org> References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> User-Agent: Mutt/1.3.28i X-archive-position: 4424 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev On Fri, Aug 01, 2003 at 04:46:56PM +0100, Matthew Wilcox wrote: > On Fri, Aug 01, 2003 at 11:40:21AM -0400, Jeff Garzik wrote: > > Comments: > > > > * need SET_ETHTOOL_OPS macro or HAVE_ETHTOOL_OPS test macro or similar > > DaveM disagreed with that... It's standard netdevice.h practice, and, he didn't disagree w/ my rebuttal. It is needed. > > * I still do not see the need to change a simple storage of a constant > > (into ethtool_gdrvinfo) into _four_ separate function call hooks (reg > > dump len, eeprom dump len, nic-specific stats len, self-test len). > > Internal kernel code that needs this information is always a slow path > > anyway, so just call the ->get_drvinfo hook internally. > > slow path, sure, but increased stack usage. it's a tradeoff, and this way > feels more clean to me. Additing a function hook each time you want to retrieve a new integer value? That's feels overly excessive to me. > > * I prefer not to add '#include ' to ethtool.h > > That means that any code which includes ethtool.h has to include types.h > first (either implicitly or explicitly). The rule so far has been that > header files should call out their dependencies explictly with an include > of the appropriate file. So why *don't* you want it? Because I copy it to userspace :) Jeff From zwane@arm.linux.org.uk Fri Aug 1 09:26:24 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 09:26:30 -0700 (PDT) Received: from hemi.commfireservices.com ([66.212.224.118]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71GQMFl002069 for ; Fri, 1 Aug 2003 09:26:23 -0700 Received: from montezuma.mastecende.com (cuda.commfireservices.com [24.202.53.9]) by hemi.commfireservices.com (Postfix) with ESMTP id 0AB23BC54; Fri, 1 Aug 2003 12:15:16 -0400 (EDT) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by montezuma.mastecende.com (8.12.8/8.12.8) with ESMTP id h71GEftE031939; Fri, 1 Aug 2003 12:14:42 -0400 Date: Fri, 1 Aug 2003 12:14:41 -0400 (EDT) From: Zwane Mwaikambo X-X-Sender: zwane@montezuma.mastecende.com To: James Morris Cc: netdev@oss.sgi.com Subject: Re: oops in raw_rcv_skb In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 4425 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: zwane@arm.linux.org.uk Precedence: bulk X-list: netdev On Sat, 2 Aug 2003, James Morris wrote: > On Fri, 1 Aug 2003, Zwane Mwaikambo wrote: > > > You can reproduce this one easily by doing 5-6 ping -f of a system on the > > network (not loopback), originally picked up at http://bugme.osdl.org/show_bug.cgi?id=937 > > Any chance of getting a gdb traceback on this one? :-) Here is a new oops with the corresponding code. 2.6.0-test2-mm2 (gdb) list *raw_rcv_skb+0x1b5 0xc04e2235 is in raw_rcv_skb (sock.h:942). 937 938 skb->dev = NULL; 939 skb_set_owner_r(skb, sk); 940 skb_queue_tail(&sk->sk_receive_queue, skb); 941 if (!sock_flag(sk, SOCK_DEAD)) 942 sk->sk_data_ready(sk, skb->len); 943 out: 944 return err; 945 } Unable to handle kernel paging request at virtual address c3148068 printing eip: c04e2235 *pde = 0000d067 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC CPU: 0 EIP: 0060:[] Not tainted VLI EFLAGS: 00010246 EIP is at raw_rcv_skb+0x1b5/0x270 eax: 00000000 ebx: 00000104 ecx: 00000104 edx: 00000001 esi: c7fae004 edi: 00000000 ebp: c3148004 esp: cbf4fecc ds: 007b es: 007b ss: 0068 Process ksoftirqd/0 (pid: 3, threadinfo=cbf4e000 task=cbf81000) Stack: c7fae06c cbf4e000 00000206 00000000 c3148000 0000005a c3148004 c96e7024 c7fae004 cab51004 c04e237c c7fae004 c3148004 00000020 c7fae004 c96e7024 c04e1ead c7fae004 c3148004 00000001 ca214004 cab51004 0a00a8c0 c04bd389 Call Trace: [] raw_rcv+0x8c/0xe0 [] raw_v4_input+0xbd/0x150 [] ip_local_deliver+0xc9/0x270 [] ip_rcv+0x37c/0x4e0 [] netif_receive_skb+0x153/0x1d0 [] process_backlog+0x87/0x160 [] net_rx_action+0x84/0x160 [] do_softirq+0xd3/0xe0 [] ksoftirqd+0xbc/0x100 [] ksoftirqd+0x0/0x100 [] kernel_thread_helper+0x5/0x10 Code: 43 86 56 68 ff 74 24 08 9d 8b 54 24 04 8b 5a 14 4b 89 5a 14 8b 42 08 83 e0 08 <0>Kernel panic: Fatal exception in interrupt In interrupt handler - not syncing -- function.linuxpower.ca From zwane@arm.linux.org.uk Fri Aug 1 09:29:54 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 09:29:57 -0700 (PDT) Received: from hemi.commfireservices.com ([66.212.224.118]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71GTrFl002789 for ; Fri, 1 Aug 2003 09:29:53 -0700 Received: from montezuma.mastecende.com (cuda.commfireservices.com [24.202.53.9]) by hemi.commfireservices.com (Postfix) with ESMTP id 38570BC56; Fri, 1 Aug 2003 12:18:47 -0400 (EDT) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by montezuma.mastecende.com (8.12.8/8.12.8) with ESMTP id h71GIDtE031957; Fri, 1 Aug 2003 12:18:13 -0400 Date: Fri, 1 Aug 2003 12:18:13 -0400 (EDT) From: Zwane Mwaikambo X-X-Sender: zwane@montezuma.mastecende.com To: James Morris Cc: netdev@oss.sgi.com Subject: Re: oops in raw_rcv_skb In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 4426 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: zwane@arm.linux.org.uk Precedence: bulk X-list: netdev On Fri, 1 Aug 2003, Zwane Mwaikambo wrote: > Here is a new oops with the corresponding code. 2.6.0-test2-mm2 > > (gdb) list *raw_rcv_skb+0x1b5 > 0xc04e2235 is in raw_rcv_skb (sock.h:942). > 937 > 938 skb->dev = NULL; > 939 skb_set_owner_r(skb, sk); > 940 skb_queue_tail(&sk->sk_receive_queue, skb); > 941 if (!sock_flag(sk, SOCK_DEAD)) > 942 sk->sk_data_ready(sk, skb->len); > 943 out: > 944 return err; > 945 } seems to be the same bug as the previous one i posted. From nebuchadnezzar@nerim.net Fri Aug 1 10:53:35 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 10:53:43 -0700 (PDT) Received: from cerbere (nebuchadnezzar.net1.nerim.net [213.41.153.130]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71HrVFl009240 for ; Fri, 1 Aug 2003 10:53:35 -0700 Received: from [2001:7a8:5982:1:209:5bff:fe1c:f0b8] (helo=zion.matrix) by cerbere with esmtp (Exim 4.20) id 19ie5k-0000Lz-JF for netdev@oss.sgi.com; Fri, 01 Aug 2003 19:53:24 +0200 Received: from localhost ([::1] helo=zion.nerim.net) by zion.matrix with esmtp (Exim 4.20) id 19ie5k-0007tc-0j for netdev@oss.sgi.com; Fri, 01 Aug 2003 19:53:24 +0200 To: netdev@oss.sgi.com Subject: [PATCH] 2.4.x USAGI mipv6_ha_ipsec From: "Daniel 'NebuchadnezzaR' Dehennin" Organisation: CaLviX Date: Fri, 01 Aug 2003 19:53:23 +0200 Message-ID: <87n0etgt7w.fsf@zion.matrix> User-Agent: Gnus/5.1002 (Gnus v5.10.2) Emacs/21.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-archive-position: 4427 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: nebuchadnezzar@nerim.net Precedence: bulk X-list: netdev Hello, My 2.4.21 with USAGI 20030801 don't build /net/ipv6/mobile_ip6/mipv6_ha_ipsec.c : mipv6_ha_ipsec.c: In function `mipv6_change_sa_index': mipv6_ha_ipsec.c:118: warning: implicit declaration of function `in6_ntop' mipv6_ha_ipsec.c:118: warning: format argument is not a pointer (arg 4) mipv6_ha_ipsec.c:119: warning: format argument is not a pointer (arg 4) mipv6_ha_ipsec.c:126: warning: format argument is not a pointer (arg 4) mipv6_ha_ipsec.c:127: warning: format argument is not a pointer (arg 4) [...] I search for the definition of in6_ntop, it in include/linux/inet.h so I make that patch. Thanks. --- linux-2.4.21/net/ipv6/mobile_ip6/mipv6_ha_ipsec.c.orig 2003-08-01 19:37:22.000000000 +0200 +++ linux-2.4.21/net/ipv6/mobile_ip6/mipv6_ha_ipsec.c 2003-08-01 19:03:42.000000000 +0200 @@ -62,6 +62,7 @@ #include #include #include +#include #include #include #include -- Daniel 'NebuchadnezzaR' Dehennin From nebuchadnezzar@nerim.net Fri Aug 1 11:09:33 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 11:09:36 -0700 (PDT) Received: from cerbere (nebuchadnezzar.net1.nerim.net [213.41.153.130]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71I9VFl010664 for ; Fri, 1 Aug 2003 11:09:32 -0700 Received: from zion.matrix ([2001:7a8:5982:1:209:5bff:fe1c:f0b8]) by cerbere with esmtp (Exim 4.20) id 19ieLF-0000M2-NW for netdev@oss.sgi.com; Fri, 01 Aug 2003 20:09:25 +0200 Received: from localhost ([::1] helo=zion.nerim.net) by zion.matrix with esmtp (Exim 4.20) id 19ieLF-0007xu-Db for netdev@oss.sgi.com; Fri, 01 Aug 2003 20:09:25 +0200 To: Linux Networking List Subject: [PATCH 2] 2.4.x USAGI unused variables in mipv6_ha_ipsec.c From: "Daniel 'NebuchadnezzaR' Dehennin" Organisation: CaLviX Date: Fri, 01 Aug 2003 20:09:25 +0200 Message-ID: <87fzklgsh6.fsf@zion.matrix> User-Agent: Gnus/5.1002 (Gnus v5.10.2) Emacs/21.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-archive-position: 4428 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: nebuchadnezzar@nerim.net Precedence: bulk X-list: netdev Hello again ;-), A patch to remove unused variables : mipv6_ha_ipsec.c: In function `__mipv6_mn_change_tunnel_ipsec_by_proto': mipv6_ha_ipsec.c:216: warning: unused variable `ret' mipv6_ha_ipsec.c: In function `__mipv6_ha_change_tunnel_ipsec_by_proto': mipv6_ha_ipsec.c:338: warning: unused variable `ret' See you. --- linux-2.4.21/net/ipv6/mobile_ip6/mipv6_ha_ipsec.c.orig 2003-08-01 20:06:15.000000000 +0200 +++ linux-2.4.21/net/ipv6/mobile_ip6/mipv6_ha_ipsec.c 2003-08-01 20:06:42.000000000 +0200 @@ -213,7 +213,6 @@ int __mipv6_mn_change_tunnel_ipsec_by_pr struct in6_addr dst; struct in6_addr src; struct in6_addr *coa = &entry->coa; - int ret = 0; /* * Phase 1: Change the following SA/SPD @@ -335,7 +334,6 @@ int __mipv6_ha_change_tunnel_ipsec_by_pr struct in6_addr dst; struct in6_addr src; struct in6_addr *coa = &entry->coa; - int ret = 0; /* * Phase 1: Change the following SA/SPD -- Daniel 'NebuchadnezzaR' Dehennin From willy@www.linux.org.uk Fri Aug 1 12:17:11 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 12:17:20 -0700 (PDT) Received: from www.linux.org.uk (IDENT:dWRqvOFfILtpyOjLUmme1m8+W8uXBtM7@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71JH9Fl005257 for ; Fri, 1 Aug 2003 12:17:10 -0700 Received: from willy by www.linux.org.uk with local (Exim 4.14) id 19ic7M-0004BK-Kb; Fri, 01 Aug 2003 16:46:56 +0100 Date: Fri, 1 Aug 2003 16:46:56 +0100 From: Matthew Wilcox To: Jeff Garzik Cc: Matthew Wilcox , netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 Message-ID: <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030801154021.GA7696@gtf.org> User-Agent: Mutt/1.4.1i X-archive-position: 4429 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: willy@debian.org Precedence: bulk X-list: netdev On Fri, Aug 01, 2003 at 11:40:21AM -0400, Jeff Garzik wrote: > Comments: > > * need SET_ETHTOOL_OPS macro or HAVE_ETHTOOL_OPS test macro or similar DaveM disagreed with that... > * I still do not see the need to change a simple storage of a constant > (into ethtool_gdrvinfo) into _four_ separate function call hooks (reg > dump len, eeprom dump len, nic-specific stats len, self-test len). > Internal kernel code that needs this information is always a slow path > anyway, so just call the ->get_drvinfo hook internally. slow path, sure, but increased stack usage. it's a tradeoff, and this way feels more clean to me. > * I prefer not to add '#include ' to ethtool.h That means that any code which includes ethtool.h has to include types.h first (either implicitly or explicitly). The rule so far has been that header files should call out their dependencies explictly with an include of the appropriate file. So why *don't* you want it? -- "It's not Hollywood. War is real, war is primarily not about defeat or victory, it is about death. I've seen thousands and thousands of dead bodies. Do you think I want to have an academic debate on this subject?" -- Robert Fisk From garzik@gtf.org Fri Aug 1 12:44:29 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 12:44:38 -0700 (PDT) Received: from havoc.gtf.org (host-64-213-145-173.atlantasolutions.com [64.213.145.173] (may be forged)) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71JiRFl006741 for ; Fri, 1 Aug 2003 12:44:28 -0700 Received: by havoc.gtf.org (Postfix, from userid 500) id 458486698; Fri, 1 Aug 2003 15:44:22 -0400 (EDT) Date: Fri, 1 Aug 2003 15:44:20 -0400 From: Jeff Garzik To: torvalds@osdl.org Cc: linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: [BK PATCHES] 2.6.x net driver merges Message-ID: <20030801194420.GD3571@gtf.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.3.28i X-archive-position: 4430 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev Linus, please do a bk pull bk://gkernel.bkbits.net/net-drivers-2.5 Others may download the patch from ftp://ftp.??.kernel.org/pub/linux/kernel/people/jgarzik/patchkits/2.6/2.6.0-test2-netdrvr1.patch.bz2 This will update the following files: Documentation/networking/bonding.txt | 343 ++++++++++++++++++++++++----------- Documentation/networking/ifenslave.c | 3 drivers/net/arcnet/com20020-isa.c | 2 drivers/net/tokenring/ibmtr.c | 3 drivers/net/wireless/airo.c | 104 +++++++--- 5 files changed, 309 insertions(+), 146 deletions(-) through these ChangeSets: (03/08/01 1.1547.8.10) Cset exclude: jgarzik@redhat.com|ChangeSet|20030731201437|53548 My fix was wrong, and, mainline now has a better fix. (03/07/31 1.1547.8.9) [tokenring ibmtr_cs] fix build, due to missing ibmtr.c build Note: Better fix is needed. Contributed by Mike Phillips. (03/07/31 1.1547.8.8) [arcnet com20020-isa] fix build broken by lack of ->owner (03/07/31 1.1547.8.7) [netdrvr bonding] fix ifenslave build on ia64 Forward port from 2.4. (03/07/31 1.1547.8.6) [netdrvr bonding] update docs (03/07/29 1.1547.8.5) [wireless airo] adds support for noise level reporting (if available) (03/07/29 1.1547.8.4) [wireless airo] makes the card passive when entering monitor mode (03/07/29 1.1547.8.3) [wireless airo] eliminate infinite loop makes sure a possible (never happened, but just in case) infinite loop in the transmission code terminates. (03/07/29 1.1547.8.2) [wireless airo] safer shutdown sequence changes the card shutdown sequence to a safer one (03/07/29 1.1547.8.1) [wireless airo] fix Tx race From davem@redhat.com Fri Aug 1 13:24:31 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 13:24:39 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71KOUFl008291 for ; Fri, 1 Aug 2003 13:24:31 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id NAA07798; Fri, 1 Aug 2003 13:20:37 -0700 Date: Fri, 1 Aug 2003 13:20:37 -0700 From: "David S. Miller" To: Jeff Garzik Cc: willy@debian.org, netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 Message-Id: <20030801132037.3f3542ae.davem@redhat.com> In-Reply-To: <20030801162536.GA18574@gtf.org> References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4431 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Fri, 1 Aug 2003 12:25:36 -0400 Jeff Garzik wrote: > On Fri, Aug 01, 2003 at 04:46:56PM +0100, Matthew Wilcox wrote: > > On Fri, Aug 01, 2003 at 11:40:21AM -0400, Jeff Garzik wrote: > > > Comments: > > > > > > * need SET_ETHTOOL_OPS macro or HAVE_ETHTOOL_OPS test macro or similar > > > > DaveM disagreed with that... > > It's standard netdevice.h practice, and, he didn't disagree w/ my > rebuttal. > > It is needed. Absolutely not, it makes no sense whatsoever to have this. Jeff, stop and think. The whole _POINT_ of these ops are to avoid duplicated code. If someone is absolutely adament about supporting kernels without ops support they should not support it at all. The point is to avoid code duplication, but what you suggest can only be used to keep the duplicated code around "just in case". This makes exactly no sense at all, it severs only to defeat the whole purpose of the change in the first place. I totally am against making an ifdef test available for this, it can only result in illogical things being done by driver maintainers. From jgarzik@pobox.com Fri Aug 1 15:35:45 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 15:35:49 -0700 (PDT) Received: from www.linux.org.uk (IDENT:rg5SenHWUdrUH12rD8TwSji2gubj2u02@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71MZiFl015583 for ; Fri, 1 Aug 2003 15:35:44 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19iiUw-0000tp-Gp; Fri, 01 Aug 2003 23:35:42 +0100 Message-ID: <3F2AEB33.9050506@pobox.com> Date: Fri, 01 Aug 2003 18:35:31 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: "David S. Miller" CC: willy@debian.org, netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> <20030801132037.3f3542ae.davem@redhat.com> <3F2AE91D.5090705@pobox.com> In-Reply-To: <3F2AE91D.5090705@pobox.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4432 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev Jeff Garzik wrote: > It's an explicit goal to avoid changing the driver API in such a way > that there is a remotely sane path to supporting older kernels. I, of course, meant the exact opposite here :) We want to provide a sane, ifdef-free path to kcompat, where feasible. From davem@redhat.com Fri Aug 1 15:36:55 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 15:37:00 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71MatFl015832 for ; Fri, 1 Aug 2003 15:36:55 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id PAA08138; Fri, 1 Aug 2003 15:32:55 -0700 Date: Fri, 1 Aug 2003 15:32:55 -0700 From: "David S. Miller" To: Jeff Garzik Cc: willy@debian.org, netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 Message-Id: <20030801153255.204baf66.davem@redhat.com> In-Reply-To: <3F2AE91D.5090705@pobox.com> References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> <20030801132037.3f3542ae.davem@redhat.com> <3F2AE91D.5090705@pobox.com> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4433 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Fri, 01 Aug 2003 18:26:37 -0400 Jeff Garzik wrote: > Strangely enough, creating a SET_ETHTOOL_OPS() macro (or > netif_ethtool_ops or pick your name) reduces ifdefs. And then we'll have all of these functions present in the driver, unused, and we'll get tons of warning from gcc. The duplication of code is still there, and this is the main point. > I feel that I've helped shepherd the net driver and PCI APIs to maintain > something fairly interesting: It's not interesting in this case. > It's an explicit goal to avoid changing the driver API in such a way > that there is a remotely sane path to supporting older kernels. This enhancement we're talking about basically has no value unless you accept an appearance of breakage in this particular area. You can't get rid of the duplicated code without accepting that you will have seperate 2.6.x and 2.4.x strains of your driver. If you aren't willing to accept seperate strains of your driver, you simply don't use netdev_ops. It is the end of the conversation. > the few things that is not easily work-around-able is new additions to > existing structures (which wouldn't exist in older kernels). That's > what SET_ETHTOOL_OPS would wrap, while also providing a trigger for > generic compat glue. What gets rid of the static functions that do the work when SET_ETHTOOL_OPS() is a nop? I do not accept a scheme where the functions stay there in the driver anyways. All you seem to be talking about is a compat library which provides netdev_ops in library form or something silly like that. > This (IMO) feature continually saves me real time I don't argue that, just don't use netdev_ops in drivers you wish to keep doing this with :-) Look at drivers/net/acenic.c, that's similar to what your drivers will begin to look like if you don't start accepting a disconnect in certain areas. From davem@redhat.com Fri Aug 1 15:38:33 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 15:38:36 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71McWFl016352 for ; Fri, 1 Aug 2003 15:38:32 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id PAA08157; Fri, 1 Aug 2003 15:34:39 -0700 Date: Fri, 1 Aug 2003 15:34:39 -0700 From: "David S. Miller" To: Jeff Garzik Cc: willy@debian.org, netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 Message-Id: <20030801153439.4a324c36.davem@redhat.com> In-Reply-To: <3F2AEB33.9050506@pobox.com> References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> <20030801132037.3f3542ae.davem@redhat.com> <3F2AE91D.5090705@pobox.com> <3F2AEB33.9050506@pobox.com> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4434 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Fri, 01 Aug 2003 18:35:31 -0400 Jeff Garzik wrote: > We want to provide a sane, ifdef-free path to kcompat, where feasible. I don't believe it's possible with netdev_ops, without undoing the entire purpose of what netdev_ops is trying to accomplish (elimination of code duplication). Show me, in code not words, how you are able to accomplish this with SET_NETDEV_OPS() or whatever. I will not read english text describing the scheme, I will read only code :) From greearb@candelatech.com Fri Aug 1 15:55:02 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 15:55:07 -0700 (PDT) Received: from grok.yi.org (evrtwa1-ar2-4-33-045-074.evrtwa1.dsl-verizon.net [4.33.45.74]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71Mt1Fl017421 for ; Fri, 1 Aug 2003 15:55:02 -0700 Received: from candelatech.com (localhost.localdomain [127.0.0.1]) by grok.yi.org (8.12.8/8.12.8) with ESMTP id h71Msttf013879 for ; Fri, 1 Aug 2003 15:54:56 -0700 Message-ID: <3F2AEFBF.3040604@candelatech.com> Date: Fri, 01 Aug 2003 15:54:55 -0700 From: Ben Greear Organization: Candela Technologies User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5a) Gecko/20030718 X-Accept-Language: en-us, en MIME-Version: 1.0 To: "'netdev@oss.sgi.com'" Subject: 2.4.21: bug report for tg3: tx lockup when changing MTU Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4435 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: greearb@candelatech.com Precedence: bulk X-list: netdev I just noticed that if you change the MTU of a tg3 NIC, it will not work untill you ifdown/ifup it. This problem is repeatable on tg3, and does not happen with the e1000 driver/cards. I am setting the MTU via an ioctl call, not via ifconfig or something like that. When the tg3 is locked up, I see this on the console: Aug 1 15:05:44 demo2 kernel: NETDEV WATCHDOG: eth5: transmit timed out Aug 1 15:05:44 demo2 kernel: tg3: eth5: transmit timed out, resetting Aug 1 15:05:44 demo2 kernel: tg3: tg3_stop_block timed out, ofs=2400 enable_bit=2 Aug 1 15:05:44 demo2 kernel: tg3: tg3_stop_block timed out, ofs=1400 enable_bit=2 Aug 1 15:05:44 demo2 kernel: tg3: tg3_stop_block timed out, ofs=c00 enable_bit=2 Aug 1 15:05:54 demo2 kernel: NETDEV WATCHDOG: eth5: transmit timed out Aug 1 15:05:54 demo2 kernel: tg3: eth5: transmit timed out, resetting Aug 1 15:05:54 demo2 kernel: tg3: tg3_stop_block timed out, ofs=2400 enable_bit=2 Aug 1 15:05:54 demo2 kernel: tg3: tg3_stop_block timed out, ofs=1400 enable_bit=2 ... Kernel is 2.4.21 + custom patches (which should not affect tg3). lspci says the NIC is: Altima AC9100 (rev 15) I will be happy to provide more information as needed. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From jgarzik@pobox.com Fri Aug 1 16:01:35 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 16:01:40 -0700 (PDT) Received: from www.linux.org.uk (IDENT:Zn4088h0c9junvtyMz48dB3BmWWWI3H8@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71N1YFl018056 for ; Fri, 1 Aug 2003 16:01:35 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19iitw-000162-PQ; Sat, 02 Aug 2003 00:01:32 +0100 Message-ID: <3F2AF141.2010308@pobox.com> Date: Fri, 01 Aug 2003 19:01:21 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: "David S. Miller" CC: willy@debian.org, netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> <20030801132037.3f3542ae.davem@redhat.com> <3F2AE91D.5090705@pobox.com> <20030801153255.204baf66.davem@redhat.com> In-Reply-To: <20030801153255.204baf66.davem@redhat.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4436 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev David S. Miller wrote: > On Fri, 01 Aug 2003 18:26:37 -0400 > Jeff Garzik wrote: > > >>Strangely enough, creating a SET_ETHTOOL_OPS() macro (or >>netif_ethtool_ops or pick your name) reduces ifdefs. > > > And then we'll have all of these functions present in > the driver, unused, and we'll get tons of warning from > gcc. > > The duplication of code is still there, and this is the > main point. Not correct: there is nothing unused, there are no warnings, in either the in-kernel case or the older-kernel case. Look at kcompat. That is code that is working, and producing the 2.4/2.6-ready vendor drivers I spoke of. I'm apparently not communicating the design that exists in kcompat, if you think this. The design is: code for 2.6, and it magically works in 2.4 It's a back-compat system that is so good you don't even know it's there. It's completely invisible to the mainline kernel -- as it should be -- presuming that one pays attention to subtle API change effects. Do you see yet how there is no code duplication, no ifdefs, no warnings about unused functions? That is the key point of the whole design, and key to the thread of discussion here. > You can't get rid of the duplicated code without accepting that you > will have seperate 2.6.x and 2.4.x strains of your driver. > > If you aren't willing to accept seperate strains of your driver, you > simply don't use netdev_ops. Look at kcompat. That is real, working code that demonstrates the approach. >>the few things that is not easily work-around-able is new additions to >>existing structures (which wouldn't exist in older kernels). That's >>what SET_ETHTOOL_OPS would wrap, while also providing a trigger for >>generic compat glue. > > > What gets rid of the static functions that do the work when > SET_ETHTOOL_OPS() is a nop? SET_ETHTOOL_OPS is never a no-op. The back-compat form of SET_ETHTOOL_OPS registers the ethtool_ops pointer in storage for later use. A DO_ETHTOOL_OPS macro in the driver's ->do_ioctl -- intentionally not included in the kernel -- does the rest, calling kcompat's backported net/core/ethtool.c, which in turn calls the ethtool_ops hooks in the driver. Making the kcompat'd net driver ready for 2.6 would then involve simply deleting one line. That's why there is no code duplication or unused driver code. Jeff From davem@redhat.com Fri Aug 1 16:05:34 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 16:05:37 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71N5XFl018589 for ; Fri, 1 Aug 2003 16:05:33 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id QAA08249; Fri, 1 Aug 2003 16:01:36 -0700 Date: Fri, 1 Aug 2003 16:01:36 -0700 From: "David S. Miller" To: Jeff Garzik Cc: willy@debian.org, netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 Message-Id: <20030801160136.3342c5cc.davem@redhat.com> In-Reply-To: <3F2AF141.2010308@pobox.com> References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> <20030801132037.3f3542ae.davem@redhat.com> <3F2AE91D.5090705@pobox.com> <20030801153255.204baf66.davem@redhat.com> <3F2AF141.2010308@pobox.com> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4437 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Fri, 01 Aug 2003 19:01:21 -0400 Jeff Garzik wrote: > A DO_ETHTOOL_OPS macro in the driver's ->do_ioctl -- intentionally not > included in the kernel -- does the rest, I don't understand. Where does this DO_ETHTOOL_OPS macro come from? Is it defined by kcompat? If so, how will drivers in vanilla 2.4.x trees end up with the DO_ETHTOOL_OPS define? From davem@redhat.com Fri Aug 1 16:12:50 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 16:12:52 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71NCnFl019319 for ; Fri, 1 Aug 2003 16:12:49 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id QAA08283; Fri, 1 Aug 2003 16:08:57 -0700 Date: Fri, 1 Aug 2003 16:08:57 -0700 From: "David S. Miller" To: Jeff Garzik Cc: willy@debian.org, netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 Message-Id: <20030801160857.32ebbf22.davem@redhat.com> In-Reply-To: <3F2AF32F.7090201@pobox.com> References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> <20030801132037.3f3542ae.davem@redhat.com> <3F2AE91D.5090705@pobox.com> <3F2AEB33.9050506@pobox.com> <20030801153439.4a324c36.davem@redhat.com> <3F2AF32F.7090201@pobox.com> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4438 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Fri, 01 Aug 2003 19:09:35 -0400 Jeff Garzik wrote: > #define SET_ETHTOOL_OPS kcompat_set_ethtool_ops > > #define DO_ETHTOOL_OPS /* duplicate net/core/ethtool.c, basically */ Where does kcompat_set_ethtool_ops store the pointer if it does not exist in struct netdevice? From jgarzik@pobox.com Fri Aug 1 16:18:11 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 16:18:15 -0700 (PDT) Received: from www.linux.org.uk (IDENT:iDc7ycOqp9NNPy2+dMfDWg8UaR/Y+gOS@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71NIAFl019905 for ; Fri, 1 Aug 2003 16:18:10 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19ijA0-0001GK-DD; Sat, 02 Aug 2003 00:18:08 +0100 Message-ID: <3F2AF525.3000605@pobox.com> Date: Fri, 01 Aug 2003 19:17:57 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: "David S. Miller" CC: willy@debian.org, netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> <20030801132037.3f3542ae.davem@redhat.com> <3F2AE91D.5090705@pobox.com> <20030801153255.204baf66.davem@redhat.com> <3F2AF141.2010308@pobox.com> <20030801160136.3342c5cc.davem@redhat.com> In-Reply-To: <20030801160136.3342c5cc.davem@redhat.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4439 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev David S. Miller wrote: > On Fri, 01 Aug 2003 19:01:21 -0400 > Jeff Garzik wrote: > > >>A DO_ETHTOOL_OPS macro in the driver's ->do_ioctl -- intentionally not >>included in the kernel -- does the rest, > > > I don't understand. > > Where does this DO_ETHTOOL_OPS macro come from? Is it defined > by kcompat? If so, how will drivers in vanilla 2.4.x trees end > up with the DO_ETHTOOL_OPS define? If one wishes to implement kcompat design ("it looks like a 2.6 driver"), then you have two needs over and above Matthew's current ethtool_ops patch: (1) naked struct deref of netdev->ethtool_ops will break immediately on older kernels, and (2) to avoid code duplication, you need to insert a call to kcompat's do_ethtool_handling_the_old_way... i.e. basically what net/core/ethtool.c does now. Problem #1 is solved with a wrapper macro that disguises the naked struct deref to ->ethtool_ops. Problem #2 is solved by adding a call to DO_ETHTOOL_OPS macro in a driver's ->do_ioctl handler. So, with those two minor changes, a 2.6 driver will work on an older kernel. To answer your question above, DO_ETHTOOL_OPS can occur one of two ways: (1) my preferred approach, define a no-op DO_ETHTOOL_OPS macro in-kernel -- but I did not think this would get accepted, so I chose (2) DO_ETHTOOL_OPS exists entirely in kcompat, and people submitting kcompat users to mainline would simply delete the one line calling DO_ETHTOOL_OPS. Solution #2 chooses to create a tiny bit more merge-to-mainline pain, but also keeps the mainline kernel drivers more clean. Jeff From davem@redhat.com Fri Aug 1 16:23:31 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 16:23:34 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71NNVFl020521 for ; Fri, 1 Aug 2003 16:23:31 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id QAA08330; Fri, 1 Aug 2003 16:19:38 -0700 Date: Fri, 1 Aug 2003 16:19:37 -0700 From: "David S. Miller" To: Jeff Garzik Cc: willy@debian.org, netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 Message-Id: <20030801161937.1d9a7126.davem@redhat.com> In-Reply-To: <3F2AF525.3000605@pobox.com> References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> <20030801132037.3f3542ae.davem@redhat.com> <3F2AE91D.5090705@pobox.com> <20030801153255.204baf66.davem@redhat.com> <3F2AF141.2010308@pobox.com> <20030801160136.3342c5cc.davem@redhat.com> <3F2AF525.3000605@pobox.com> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4440 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Fri, 01 Aug 2003 19:17:57 -0400 Jeff Garzik wrote: > Solution #2 chooses to create a tiny bit more > merge-to-mainline pain, but also keeps the mainline kernel drivers more > clean. You don't need DO_ETHTOOL_OPS and thus the merge-to-mainline pain at all if you do something like: 1) SET_ETHDEV_OPS() also overrides the ->do_ioctl() setting to a kcompat_netdev_ioctl() one, but remembers the original pointer somewhere. 2) kcompat_netdev_ioctl() does the things DO_ETHTOOL_OPS would have done, failing that it calls the saved ->do_ioctl() pointer. From jgarzik@pobox.com Fri Aug 1 16:35:34 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 16:35:39 -0700 (PDT) Received: from www.linux.org.uk (IDENT:J3yO/Z/hGYogdXpUd6WP37Z2oMWhXKE3@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71NZWFl021452 for ; Fri, 1 Aug 2003 16:35:33 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19ijQp-0001NP-4x; Sat, 02 Aug 2003 00:35:31 +0100 Message-ID: <3F2AF938.7050608@pobox.com> Date: Fri, 01 Aug 2003 19:35:20 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: "David S. Miller" CC: willy@debian.org, netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> <20030801132037.3f3542ae.davem@redhat.com> <3F2AE91D.5090705@pobox.com> <3F2AEB33.9050506@pobox.com> <20030801153439.4a324c36.davem@redhat.com> <3F2AF32F.7090201@pobox.com> <20030801160857.32ebbf22.davem@redhat.com> In-Reply-To: <20030801160857.32ebbf22.davem@redhat.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4441 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev David S. Miller wrote: > On Fri, 01 Aug 2003 19:09:35 -0400 > Jeff Garzik wrote: > > >>#define SET_ETHTOOL_OPS kcompat_set_ethtool_ops >> >>#define DO_ETHTOOL_OPS /* duplicate net/core/ethtool.c, basically */ > > > Where does kcompat_set_ethtool_ops store the pointer if > it does not exist in struct netdevice? Inside an area allocated by the kcompat lib. SET_ETHTOOL_OPS takes 'struct net_device *' and 'struct ethtool_ops *' arguments, so it simply needs to create a lookup list/table somewhere. You keep asking for code, read kcompat :) kcompat_set_ethtool_ops has exactly the same task as the 2.2.x-era backcompat implementation of pci_{get,set}_drvdata. The perfect back-porting/back-compat system would magically make all Linus-tree drivers work without any change on older kernels. I really think the kcompat design is as close as you can come to that. Here is a linux-kernel-friendly version of the kcompat design: "naked struct derefs hurt. otherwise, happy hacking!" And further, experience shows that the number of naked struct derefs that matter is fairly small. (Another less-common area that hurts besides naked-struct-deref is function return type, which is why Linus created irqreturn_t) Jeff From davem@redhat.com Fri Aug 1 16:38:11 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 16:38:19 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71NcAFl021877 for ; Fri, 1 Aug 2003 16:38:10 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id QAA08362; Fri, 1 Aug 2003 16:34:15 -0700 Date: Fri, 1 Aug 2003 16:34:15 -0700 From: "David S. Miller" To: Jeff Garzik Cc: willy@debian.org, netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 Message-Id: <20030801163415.1c3fd6fb.davem@redhat.com> In-Reply-To: <3F2AF938.7050608@pobox.com> References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> <20030801132037.3f3542ae.davem@redhat.com> <3F2AE91D.5090705@pobox.com> <3F2AEB33.9050506@pobox.com> <20030801153439.4a324c36.davem@redhat.com> <3F2AF32F.7090201@pobox.com> <20030801160857.32ebbf22.davem@redhat.com> <3F2AF938.7050608@pobox.com> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4442 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Fri, 01 Aug 2003 19:35:20 -0400 Jeff Garzik wrote: > Inside an area allocated by the kcompat lib. SET_ETHTOOL_OPS takes > 'struct net_device *' and 'struct ethtool_ops *' arguments, so it simply > needs to create a lookup list/table somewhere. Ok ok ok, we're converging :-) Please just comment on my other email suggesting a way to do away with DO_ETHTOOL_OPS. I'm OK with a SET_ETHTOOL_OPS() macro. From davem@redhat.com Fri Aug 1 16:47:22 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 16:47:29 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71NlMFl022685 for ; Fri, 1 Aug 2003 16:47:22 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id QAA08392; Fri, 1 Aug 2003 16:43:29 -0700 Date: Fri, 1 Aug 2003 16:43:28 -0700 From: "David S. Miller" To: Jeff Garzik Cc: willy@debian.org, netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 Message-Id: <20030801164328.5b5bc145.davem@redhat.com> In-Reply-To: <3F2AFAF4.3040604@pobox.com> References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> <20030801132037.3f3542ae.davem@redhat.com> <3F2AE91D.5090705@pobox.com> <20030801153255.204baf66.davem@redhat.com> <3F2AF141.2010308@pobox.com> <20030801160136.3342c5cc.davem@redhat.com> <3F2AF525.3000605@pobox.com> <20030801161937.1d9a7126.davem@redhat.com> <3F2AFAF4.3040604@pobox.com> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4443 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Fri, 01 Aug 2003 19:42:44 -0400 Jeff Garzik wrote: > Still need the boring and obvious definition of SET_ETHTOOL_OPS in > mainline, though. Like I said, I've got no problem with that part. From jgarzik@pobox.com Fri Aug 1 16:58:00 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 16:58:06 -0700 (PDT) Received: from www.linux.org.uk (IDENT:SvMafmN//vZSJhMjuzBvSDjdG3jdClZ3@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h71NvwFl023495 for ; Fri, 1 Aug 2003 16:57:59 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19ijmW-0001WQ-Qd; Sat, 02 Aug 2003 00:57:56 +0100 Message-ID: <3F2AFE7A.10203@pobox.com> Date: Fri, 01 Aug 2003 19:57:46 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: Ben Greear CC: "'netdev@oss.sgi.com'" Subject: Re: 2.4.21: bug report for tg3: tx lockup when changing MTU References: <3F2AEFBF.3040604@candelatech.com> In-Reply-To: <3F2AEFBF.3040604@candelatech.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4444 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev Ben Greear wrote: > I just noticed that if you change the MTU of a tg3 NIC, it will not work > untill you ifdown/ifup it. This problem is repeatable on tg3, and does not > happen with the e1000 driver/cards. > > I am setting the MTU via an ioctl call, not via ifconfig or something like > that. Can you provide the ioctl call info, so I can reproduce? And, are you changing MTU when the interface is up or down? From jgarzik@pobox.com Fri Aug 1 17:00:26 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 17:00:32 -0700 (PDT) Received: from www.linux.org.uk (IDENT:mbLF7vnQT6LUjTsH4zlXT2XhjWraHjZQ@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7200PFl023951 for ; Fri, 1 Aug 2003 17:00:26 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19ij1u-0001Be-Az; Sat, 02 Aug 2003 00:09:46 +0100 Message-ID: <3F2AF32F.7090201@pobox.com> Date: Fri, 01 Aug 2003 19:09:35 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: "David S. Miller" CC: willy@debian.org, netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> <20030801132037.3f3542ae.davem@redhat.com> <3F2AE91D.5090705@pobox.com> <3F2AEB33.9050506@pobox.com> <20030801153439.4a324c36.davem@redhat.com> In-Reply-To: <20030801153439.4a324c36.davem@redhat.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4446 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev David S. Miller wrote: > On Fri, 01 Aug 2003 18:35:31 -0400 > Jeff Garzik wrote: > > >>We want to provide a sane, ifdef-free path to kcompat, where feasible. > > > I don't believe it's possible with netdev_ops, without > undoing the entire purpose of what netdev_ops is trying > to accomplish (elimination of code duplication). > > Show me, in code not words, how you are able to accomplish > this with SET_NETDEV_OPS() or whatever. I will not read > english text describing the scheme, I will read only code :) Read kcompat. Then: #define SET_ETHTOOL_OPS kcompat_set_ethtool_ops #define DO_ETHTOOL_OPS /* duplicate net/core/ethtool.c, basically */ I would define both of these in Matthew's patch, but one only _needs_ to define SET_ETHTOOL_OPS, so I pushed for the latter course. So why is SET_ETHTOOL_OPS needed? It covered up the one place It intentionally follows the same design as SET_MODULE_OWNER, and for the same purpose: hiding what would otherwise be a naked struct deref to a struct member that does not exist on an older kernel. Hiding naked struct derefs is also the reason I created pci_{get,drv}_drvdata, pci_resource_*, etc. Back compat is really a big syntactic sugar game, and naked struct derefs are really the only big thorn in the side. Everything else can be beaten down with syntactic sugar behind the scenes, that never ever gets merged into the upstream kernel. Jeff From jgarzik@pobox.com Fri Aug 1 17:00:24 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 17:00:31 -0700 (PDT) Received: from www.linux.org.uk (IDENT:allKiZwinkLTubXzBheWp6K1ooOLy/4T@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7200NFl023944 for ; Fri, 1 Aug 2003 17:00:23 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19iiMK-0000pf-0I; Fri, 01 Aug 2003 23:26:48 +0100 Message-ID: <3F2AE91D.5090705@pobox.com> Date: Fri, 01 Aug 2003 18:26:37 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: "David S. Miller" CC: willy@debian.org, netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> <20030801132037.3f3542ae.davem@redhat.com> In-Reply-To: <20030801132037.3f3542ae.davem@redhat.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4445 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev David S. Miller wrote: > The whole _POINT_ of these ops are to avoid duplicated code. > If someone is absolutely adament about supporting kernels > without ops support they should not support it at all. > > The point is to avoid code duplication, but what you suggest can only > be used to keep the duplicated code around "just in case". This makes > exactly no sense at all, it severs only to defeat the whole purpose > of the change in the first place. > > I totally am against making an ifdef test available for this, it can > only result in illogical things being done by driver maintainers. Strangely enough, creating a SET_ETHTOOL_OPS() macro (or netif_ethtool_ops or pick your name) reduces ifdefs. I feel that I've helped shepherd the net driver and PCI APIs to maintain something fairly interesting: a driver API that [for the most part...] allows one to write a driver completely without compatibility ifdefs, and ancient-kernel junk. When married with a compat glue lib outside the tree, the same ifdef-free driver works on older kernels. It's an explicit goal to avoid changing the driver API in such a way that there is a remotely sane path to supporting older kernels. One of the few things that is not easily work-around-able is new additions to existing structures (which wouldn't exist in older kernels). That's what SET_ETHTOOL_OPS would wrap, while also providing a trigger for generic compat glue. This trigger is what _reduces_ code duplication. Given such a trigger, a generic library can implement compat code on older kernels. The drivers remain ifdef-free and compat-junk-free. This is method used by the kcompat toolkit (http://sf.net/projects/gkernel/). This (IMO) feature continually saves me real time, again and again, when merging a new net driver into the kernel. It saves me time debugging a driver in both 2.4 and 2.6. The time savings is in the minimization (is that a word?) of changes across kernel versions, and this particular ethtool_ops change will be a thorn in particular. This ethtool_ops change _is_ trivially made backward-compatible, with a simple macro. Look at the future, where vendors are submitting 2.6-ready net drivers, because we made it easier for them to support their existing platform. Over and above the time savings, vendors _will_ start submitting drivers that actually look like Linux drivers. This has already started happening :) Just today I received a Via-rhine gbit driver (GPL'd) at Red Hat, which I am preparing to merge into the kernel. After removing the awful Hungarian notation and silly procfs apis, the driver's actually pretty close to a mergeable driver. It uses the kcompat stuff, and as such isn't full of ifdefs and typical vendor cpp maze. So, for the benefits of saving me real wall-clock hours, and pushing the vendors to create ready-for-the-kernel drivers more often, the cost is a simple one-line wrapper macro that in-kernel drivers would rarely use. In the long run, I'm trying to use and abuse Intel as an example for other vendors to follow (using netdev@, splitting up patches, etc.), and push the driver maintenance load onto the vendors (where they're willing, etc., like Intel). If vendors are willing to respond to feedback and follow standard linux-kernel email development, I'm more than happy for them to become a learned funnel of patches to netdev for review :) This kcompat strategy -- back-compat without ifdefs -- goes a long way towards that, and SET_ETHTOOL_OPS is a big piece of that puzzle right now. Jeff From greearb@candelatech.com Fri Aug 1 17:24:16 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 17:24:54 -0700 (PDT) Received: from grok.yi.org (evrtwa1-ar2-4-33-045-074.evrtwa1.dsl-verizon.net [4.33.45.74]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h720OFFl025687 for ; Fri, 1 Aug 2003 17:24:16 -0700 Received: from candelatech.com (localhost.localdomain [127.0.0.1]) by grok.yi.org (8.12.8/8.12.8) with ESMTP id h720O0tf025224; Fri, 1 Aug 2003 17:24:10 -0700 Message-ID: <3F2B04A0.9030101@candelatech.com> Date: Fri, 01 Aug 2003 17:24:00 -0700 From: Ben Greear Organization: Candela Technologies User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5a) Gecko/20030718 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Jeff Garzik CC: "'netdev@oss.sgi.com'" Subject: Re: 2.4.21: bug report for tg3: tx lockup when changing MTU References: <3F2AEFBF.3040604@candelatech.com> <3F2AFE7A.10203@pobox.com> In-Reply-To: <3F2AFE7A.10203@pobox.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4447 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: greearb@candelatech.com Precedence: bulk X-list: netdev Jeff Garzik wrote: > Ben Greear wrote: > >> I just noticed that if you change the MTU of a tg3 NIC, it will not work >> untill you ifdown/ifup it. This problem is repeatable on tg3, and >> does not >> happen with the e1000 driver/cards. >> >> I am setting the MTU via an ioctl call, not via ifconfig or something >> like >> that. > > > > Can you provide the ioctl call info, so I can reproduce? > > And, are you changing MTU when the interface is up or down? Interface is up and transmitting/receiving pkts at the time. I just reproduced it with commands below. It is probably a race, so not sure that either of these will always fail. Running about 10kpps rx+tx. Was sending pktgen (UDP) traffic of fixed length, so the actual transmitted packet sizes remains the same in this case. # MTU is at 1500 ifconfig eth5 mtu 4096 #worked ifconfig eth5 mtu 4000 # failed. -- Ben Greear Candela Technologies Inc http://www.candelatech.com From jgarzik@pobox.com Fri Aug 1 18:07:00 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 18:07:05 -0700 (PDT) Received: from www.linux.org.uk (IDENT:w1RYSqFLki8rczBJvztL9jiQ2bgDrUfw@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7216xFl028005 for ; Fri, 1 Aug 2003 18:07:00 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19ijXz-0001Q9-5p; Sat, 02 Aug 2003 00:42:55 +0100 Message-ID: <3F2AFAF4.3040604@pobox.com> Date: Fri, 01 Aug 2003 19:42:44 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: "David S. Miller" CC: willy@debian.org, netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> <20030801132037.3f3542ae.davem@redhat.com> <3F2AE91D.5090705@pobox.com> <20030801153255.204baf66.davem@redhat.com> <3F2AF141.2010308@pobox.com> <20030801160136.3342c5cc.davem@redhat.com> <3F2AF525.3000605@pobox.com> <20030801161937.1d9a7126.davem@redhat.com> In-Reply-To: <20030801161937.1d9a7126.davem@redhat.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4448 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev David S. Miller wrote: > On Fri, 01 Aug 2003 19:17:57 -0400 > Jeff Garzik wrote: > > >>Solution #2 chooses to create a tiny bit more >>merge-to-mainline pain, but also keeps the mainline kernel drivers more >>clean. > > > You don't need DO_ETHTOOL_OPS and thus the merge-to-mainline pain > at all if you do something like: > > 1) SET_ETHDEV_OPS() also overrides the ->do_ioctl() setting to > a kcompat_netdev_ioctl() one, but remembers the original pointer > somewhere. > > 2) kcompat_netdev_ioctl() does the things DO_ETHTOOL_OPS would > have done, failing that it calls the saved ->do_ioctl() pointer. Certainly. That's a bit nicer than the back-compat gunk I was plotting, even. Still need the boring and obvious definition of SET_ETHTOOL_OPS in mainline, though. Jeff From takamiya@po.ntts.co.jp Fri Aug 1 19:59:17 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 01 Aug 2003 19:59:20 -0700 (PDT) Received: from mail1.ics.ntts.co.jp (mail1.ics.ntts.co.jp [202.32.24.45]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h722xDFl001702 for ; Fri, 1 Aug 2003 19:59:16 -0700 Received: from mail26.silk.ntts.co.jp by mail1.ics.ntts.co.jp (8.9.3p2/3.7W-NTTSOFT-SUR2.0) id LAA11990 for ; Sat, 2 Aug 2003 11:59:12 +0900 (JST) (envelope-from takamiya@po.ntts.co.jp) Received: from daemon.inl.ntts.co.jp by mail26.silk.ntts.co.jp (8.11.7/3.7W-silk-4.6) id h722xB316268 for ; Sat, 2 Aug 2003 11:59:11 +0900 (JST) (envelope-from takamiya@po.ntts.co.jp) Received: (qmail 54448 invoked by alias); 2 Aug 2003 11:59:10 +0900 Received: (qmail 54428 invoked from network); 2 Aug 2003 11:59:10 +0900 Received: from localhost by localhost with SMTP; 2 Aug 2003 11:59:10 +0900 Date: Sat, 02 Aug 2003 11:59:09 +0900 (JST) Message-Id: <20030802.115909.576029077.takamiya@po.ntts.co.jp> To: nebuchadnezzar@nerim.net Cc: netdev@oss.sgi.com, takamiya@po.ntts.co.jp Subject: Re: [PATCH] 2.4.x USAGI mipv6_ha_ipsec From: Noriaki Takamiya In-Reply-To: <87n0etgt7w.fsf@zion.matrix> <87fzklgsh6.fsf@zion.matrix> References: <87n0etgt7w.fsf@zion.matrix> X-Face: +<)&j!Ce24nM@a.\f6TA,]^9Q76[_QN_[QR-(bT&>b40Oo[:`R(>b7!b-|q5k&.8CO[_Oh_ !9Nk0rikK70~?|08EFH|:]iF6pwPlnfEn-wo-voY:rP?%7p%cxjnbf'hglO'se&QwZN7/RVX!U7*P% cTV('HfHp+?g1+hx7\+J.W]G zYWv%LsDc X-Mailer: Mew version 3.2rc1 on XEmacs 21.4.8 (Honest Recruiter) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 4449 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: takamiya@po.ntts.co.jp Precedence: bulk X-list: netdev Hi, >> Fri, 01 Aug 2003 19:53:23 +0200 >> [Subject: [PATCH] 2.4.x USAGI mipv6_ha_ipsec] >> "Daniel 'NebuchadnezzaR' Dehennin" wrote... nebuchadnezzar> I search for the definition of in6_ntop, it in include/linux/inet.h nebuchadnezzar> so I make that patch. >> Fri, 01 Aug 2003 20:09:25 +0200 >> [Subject: [PATCH 2] 2.4.x USAGI unused variables in mipv6_ha_ipsec.c] >> "Daniel 'NebuchadnezzaR' Dehennin" wrote... nebuchadnezzar> Hello again ;-), nebuchadnezzar> nebuchadnezzar> A patch to remove unused variables : Applied both fixes. Thakns. -- Noriaki Takamiya From akpm@osdl.org Sat Aug 2 01:12:08 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 01:12:16 -0700 (PDT) Received: from mail.osdl.org (fw.osdl.org [65.172.181.6]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h728BxFl020572 for ; Sat, 2 Aug 2003 01:12:07 -0700 Received: from mnm (build.pdx.osdl.net [172.20.1.2]) by mail.osdl.org (8.11.6/8.11.6) with ESMTP id h728BnI26175 for ; Sat, 2 Aug 2003 01:11:51 -0700 Date: Sat, 2 Aug 2003 01:12:48 -0700 From: Andrew Morton To: netdev@oss.sgi.com Subject: Fw: [Bugme-new] [Bug 1030] New: racoon causes oops when implementing IPSec key Message-Id: <20030802011248.6772c9cd.akpm@osdl.org> X-Mailer: Sylpheed version 0.9.4 (GTK+ 1.2.10; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4450 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: akpm@osdl.org Precedence: bulk X-list: netdev Begin forwarded message: Date: Sat, 2 Aug 2003 01:01:24 -0700 From: bugme-daemon@osdl.org To: bugme-new@lists.osdl.org Subject: [Bugme-new] [Bug 1030] New: racoon causes oops when implementing IPSec key http://bugme.osdl.org/show_bug.cgi?id=1030 Summary: racoon causes oops when implementing IPSec key Kernel Version: 2.6.0-test1 Status: NEW Severity: normal Owner: acme@conectiva.com.br Submitter: jsanchez@cs.ucf.edu Distribution: SuSE and LFS Hardware Environment: e100 cards Software Environment: ipsec-tools 0.2.2 Problem Description: I setkey with a policy to use esp and ah on each box. I start racoon on each box. I punch up a web page on one from the other. Insta-oops x 2. Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip: c02bbd06 *pde = 00000000 Oops: 0000 [#1] CPU: 0 EIP: 0060:[] Not tainted EFLAGS: 00010206 EIP is at memcpy+0x1e/0x39 eax: 00000018 ebx: f6fe8a00 ecx: 00000006 edx: 00000000 esi: 00000000 edi: 00000000 ebp: c0562520 esp: f6fb5ccc ds: 007b es: 007b ss:0068 Process racoon (pid: 418, threadinfo=f6fb4000 task=f6fbb300) Stack: Call Trace: xfrm_state_update pfkey_add parse_exthdrs pfkey_process pfkey_sendmsg sock_sendmsg verify_iovec sys_sendmsg sockfd_lookup sys_sendto sys_getsockname __pollwait update_process sys_send sys_socketcall syscall_call Code: f3 a5 a8 02 74 02 66 a5 a8 01 74 01 a4 89 d0 8b 74 24 02 8b <0>Kernel panic: Fatal exception in interrupt In interrupt handler = not syncing For some of the other numbers that didn't get copied, check 67.9.9.32/oops.jpg. Email me if its dead, which it will be after 20 august. Steps to reproduce: >From each box: #!setkey -f flush; spdflush; spdadd $this_box $other_box any -P out ipsec esp/transport//use ah/transport//use; spdadd $other_box $this_box any -P in ipsec esp/transport//use ah/transport//use; Set up racoon (the default config would probably work, here is the gist of mine) remote anonymous { exchange_mode main; my_identifier address; peers_identifier address; lifetime time 1 min; # sec,min,hour proposal { encryption_algorithm 3des; hash_algorithm sha1; authentication_method pre_shared_key ; dh_group 2; } } sainfo anonymous { lifetime time 20 min; encryption_algorithm 3des ; authentication_algorithm hmac_sha1; compression_algorithm deflate ; } Start racoon on each box. Open a new connection to cause a key exchange. Hit the reset button on each box. ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From davem@redhat.com Sat Aug 2 01:17:58 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 01:18:02 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h728HvFl020948 for ; Sat, 2 Aug 2003 01:17:58 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id BAA09115; Sat, 2 Aug 2003 01:13:58 -0700 Date: Sat, 2 Aug 2003 01:13:58 -0700 From: "David S. Miller" To: Andrew Morton Cc: netdev@oss.sgi.com Subject: Re: Fw: [Bugme-new] [Bug 1030] New: racoon causes oops when implementing IPSec key Message-Id: <20030802011358.0524c88c.davem@redhat.com> In-Reply-To: <20030802011248.6772c9cd.akpm@osdl.org> References: <20030802011248.6772c9cd.akpm@osdl.org> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4451 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev Old bug, fixed in current sources. From sascha@schumann.cx Sat Aug 2 02:44:47 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 02:44:55 -0700 (PDT) Received: from milton.schell.de (kdserv.de [217.160.72.35]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h729ijFl027701 for ; Sat, 2 Aug 2003 02:44:46 -0700 Received: (qmail 29266 invoked by uid 501); 2 Aug 2003 09:44:44 -0000 Received: from unknown (HELO eco.foo) (80.143.24.176) by kdserv.de with SMTP; 2 Aug 2003 09:44:44 -0000 Received: from localhost (localhost [127.0.0.1]) by eco.foo (Postfix) with ESMTP id 554E437045; Sat, 2 Aug 2003 11:44:43 +0200 (CEST) Date: Sat, 2 Aug 2003 11:44:43 +0200 (CEST) From: Sascha Schumann X-X-Sender: sas@eco.foo To: Ben Greear Cc: "'netdev@oss.sgi.com'" Subject: Re: 2.4.21: bug report for tg3: tx lockup when changing MTU In-Reply-To: <3F2AEFBF.3040604@candelatech.com> Message-ID: References: <3F2AEFBF.3040604@candelatech.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 4452 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: sascha@schumann.cx Precedence: bulk X-list: netdev > Kernel is 2.4.21 + custom patches (which should not affect tg3). > > lspci says the NIC is: Altima AC9100 (rev 15) [1] says that the AC9100 based Netgear GA302T cards don't support jumbo frames. I'm seeing regular lockups once packets larger than 1500bytes flow through the NIC. It would be cool though if this turned out to be a driver limitation and not a (crippled) chipset issue. [1] http://www.google.de/search?q=cache:y_kVF_dR3TkJ:www.lanshop.co.uk/html/ga302tq.htm+netgear+ga302t+jumbo+frames&hl=de&ie=UTF-8 - Sascha From daniel.ritz@gmx.ch Sat Aug 2 04:53:56 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 04:54:03 -0700 (PDT) Received: from ritz.dnsalias.org (dclient217-162-108-200.hispeed.ch [217.162.108.200]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h72BrsFl002495 for ; Sat, 2 Aug 2003 04:53:56 -0700 Received: from toshba.local (toshba.local [192.168.100.12]) by ritz.dnsalias.org (Postfix) with ESMTP id C83ED4FD7F; Sat, 2 Aug 2003 13:55:45 +0200 (CEST) From: Daniel Ritz To: "David S. Miller" Subject: [PATCH 2.6] Fix IPv6 esp mem leak in esp6_input Date: Sat, 2 Aug 2003 13:50:23 +0200 User-Agent: KMail/1.5.2 Cc: linux-net , "linux-netdev" MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200308021350.23342.daniel.ritz@gmx.ch> X-archive-position: 4453 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: daniel.ritz@gmx.ch Precedence: bulk X-list: netdev fixes a mem leak in esp6_input() in the error paths. and return -ENOMEM, not -EINVAL when out of memory. against 2.6.0-test2-bk --- 1.19/net/ipv6/esp6.c Mon Jul 21 02:46:12 2003 +++ edited/net/ipv6/esp6.c Sat Aug 2 13:02:11 2003 @@ -200,18 +200,24 @@ int hdr_len = skb->h.raw - skb->nh.raw; int nfrags; - u8 ret_nexthdr = 0; unsigned char *tmp_hdr = NULL; + int ret = 0; - if (!pskb_may_pull(skb, sizeof(struct ipv6_esp_hdr))) - goto out; + if (!pskb_may_pull(skb, sizeof(struct ipv6_esp_hdr))) { + ret = -EINVAL; + goto out_nofree; + } - if (elen <= 0 || (elen & (blksize-1))) - goto out; + if (elen <= 0 || (elen & (blksize-1))) { + ret = -EINVAL; + goto out_nofree; + } tmp_hdr = kmalloc(hdr_len, GFP_ATOMIC); - if (!tmp_hdr) - goto out; + if (!tmp_hdr) { + ret = -ENOMEM; + goto out_nofree; + } memcpy(tmp_hdr, skb->nh.raw, hdr_len); /* If integrity check is required, do this. */ @@ -226,12 +232,15 @@ if (unlikely(memcmp(sum, sum1, alen))) { x->stats.integrity_failed++; + ret = -EINVAL; goto out; } } - if ((nfrags = skb_cow_data(skb, 0, &trailer)) < 0) + if ((nfrags = skb_cow_data(skb, 0, &trailer)) < 0) { + ret = -EINVAL; goto out; + } skb->ip_summed = CHECKSUM_NONE; @@ -251,8 +260,10 @@ if (unlikely(nfrags > MAX_SG_ONSTACK)) { sg = kmalloc(sizeof(struct scatterlist)*nfrags, GFP_ATOMIC); - if (!sg) + if (!sg) { + ret = -ENOMEM; goto out; + } } skb_to_sgvec(skb, sg, sizeof(struct ipv6_esp_hdr) + esp->conf.ivlen, elen); crypto_cipher_decrypt(esp->conf.tfm, sg, sg, elen); @@ -267,6 +278,7 @@ if (net_ratelimit()) { printk(KERN_WARNING "ipsec esp packet is garbage padlen=%d, elen=%d\n", padlen+2, elen); } + ret = -EINVAL; goto out; } /* ... check padding bits here. Silly. :-) */ @@ -277,13 +289,13 @@ memcpy(skb->nh.raw, tmp_hdr, hdr_len); skb->nh.ipv6h->payload_len = htons(skb->len - sizeof(struct ipv6hdr)); ip6_find_1stfragopt(skb, &prevhdr); - ret_nexthdr = *prevhdr = nexthdr[1]; + ret = *prevhdr = nexthdr[1]; } - kfree(tmp_hdr); - return ret_nexthdr; out: - return -EINVAL; + kfree(tmp_hdr); +out_nofree: + return ret; } static u32 esp6_get_max_size(struct xfrm_state *x, int mtu) From daniel.ritz@gmx.ch Sat Aug 2 08:46:45 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 08:46:49 -0700 (PDT) Received: from ritz.dnsalias.org (dclient217-162-108-200.hispeed.ch [217.162.108.200]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h72FkhFl016663 for ; Sat, 2 Aug 2003 08:46:44 -0700 Received: from toshba.local (toshba.local [192.168.100.12]) by ritz.dnsalias.org (Postfix) with ESMTP id C3B4C4FD7F; Sat, 2 Aug 2003 17:48:35 +0200 (CEST) From: Daniel Ritz To: Jeff Garzik Subject: [PATCH] fix airo memory leak Date: Sat, 2 Aug 2003 17:43:12 +0200 User-Agent: KMail/1.5.2 Cc: linux-net , "linux-netdev" MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200308021743.12635.daniel.ritz@gmx.ch> X-archive-position: 4454 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: daniel.ritz@gmx.ch Precedence: bulk X-list: netdev fixes a memory leak: memory for the airo_devices list is allocated but never freed. against 2.6.0-test2-bk, but should apply to 2.4 as well... --- 1.54/drivers/net/wireless/airo.c Sun Jul 20 05:17:02 2003 +++ edited/drivers/net/wireless/airo.c Sat Aug 2 17:33:24 2003 @@ -4038,12 +4038,23 @@ return 0; } -static void del_airo_dev( struct net_device *dev ) { - struct net_device_list **p = &airo_devices; - while( *p && ( (*p)->dev != dev ) ) - p = &(*p)->next; - if ( *p && (*p)->dev == dev ) - *p = (*p)->next; +static void del_airo_dev(struct net_device *dev) +{ + struct net_device_list *this = airo_devices, *prev = NULL; + + while (this) { + if (this->dev == dev) { + if (prev) + prev->next = this->next; + else + airo_devices = this->next; + kfree(this); + break; + } + + prev = this; + this = this->next; + } } #ifdef CONFIG_PCI From werner@almesberger.net Sat Aug 2 10:04:56 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 10:05:08 -0700 (PDT) Received: from host.almesberger.net (almesberger.net [63.105.73.239] (may be forged)) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h72H4uFl022817 for ; Sat, 2 Aug 2003 10:04:56 -0700 Received: from almesberger.net (vpnwa-home [10.200.0.2]) by host.almesberger.net (8.11.6/8.9.3) with ESMTP id h72H4oG24090; Sat, 2 Aug 2003 10:04:50 -0700 Received: (from werner@localhost) by almesberger.net (8.11.6/8.11.6) id h72H4i330124; Sat, 2 Aug 2003 14:04:44 -0300 Date: Sat, 2 Aug 2003 14:04:44 -0300 From: Werner Almesberger To: netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: TOE brain dump Message-ID: <20030802140444.E5798@almesberger.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-archive-position: 4455 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: werner@almesberger.net Precedence: bulk X-list: netdev At OLS, there was a bit of discussion on (true and false *) TOEs (TCP Offload Engines). In the course of this discussion, I've suggested what might be a novel approach, so in case this is a good idea, I'd like to dump my thoughts on it, before someone tries to patent my ideas. (Most likely, some of this has already been done or tried elsewhere, but it can't hurt to try to err on the safe side.) (*) The InfiniBand people unfortunately call also their TCP/IP bypass "TOE" (for which they promptly get shouted down, every time they use that word). This is misleading, because there is no TCP that's getting offloaded, but TCP is simply never done. I would consider it to be more accurate to view this as a separate networking technology, with semantics different from TCP/IP, similar to ATM and AAL5. While I'm not entirely convinced about the usefulness of TOE in all the cases it's been suggested for, I can see value in certain areas, e.g. when TCP per-packet overhead becomes an issue. However, I consider the approach of putting a new or heavily modified stack, which duplicates a considerable amount of the functionality in the main kernel, on a separate piece of hardware questionable at best. Some of the issues: - if this stack is closed source or generally hard to modify, security fixes will be slowed down - if this stack is closed source or generally hard to modify, TOE will not be available to projects modifying the stack, e.g. any of the research projects trying to make TCP work at gigabit speeds - this stack either needs to implement all administrative interfaces of the regular kernel, or such a system would have non-uniform configuration/monitoring across interfaces - in some cases, administrative interfaces will require a NIC/TOE-specific switch in the kernel (netlink helps here) - route changes on multi-homed hosts (or any similar kind of failover) are difficult if the state of TCP connections is tied to specific NICs (I've discussed some issues when "migrating" TCP connections in the documentation of tcpcp, http://www.almesberger.net/tcpcp/) - new kernel features will always lag behind on this kind of TOE, and different kernels will require different "firmware" - last but not least, keeping TOE firmware up to date with the TCP/IP stack in the mainstream kernel will require - for each such TOE device - a significant and continuous effort over a long period of time In short, I think such a solution is either a pain to use, or unmaintainable, or - most likely - both. So, how to do better ? Easy: use the Source, Luke. Here's my idea: - instead of putting a different stack on the TOE, a general-purpose processor (probably with some enhancements, and certainly with optimized data paths) is added to the NIC - that processor runs the same Linux kernel image as the host, acting like a NUMA system - a selectable part of TCP/IP is handled on the NIC, and the rest of the system runs on the host processor - instrumentation is added to the mainstream kernel to ensure that as little data as possible is shared between the main CPU and such peripheral CPUs. Note that such instrumentation would be generic, outlining possible boundaries, and not tied to a specific TOE design. - depending on hardware details (cache coherence, etc.), the instrumentation mentioned above may even be necessary for correctness. This would have the unfortunate effect of making the design very fragile with respect to changes in the mainstream kernel. (Performance loss in the case of imperfect instrumentation would be preferable.) - further instrumentation may be needed to let the kernel switch CPUs (i.e. host to NIC, and vice versa) at the right time - since the NIC would probably use a CPU design different from the host CPU, we'd need "fat" kernel binaries: - data structures are the same, i.e. word sizes, byte order, bit numbering, etc. are compatible, and alignments are chosen such that all CPUs involved are reasonably happy - kernels live in the same address space - function pointers become arrays, with one pointer per architecture. When comparing pointers, the first element is used. - if one should choose to also run parts of user space on the NIC, fat binaries would also be needed for this (along with other complications) Benefits: - putting the CPU next to the NIC keeps data paths short, and allows for all kinds of optimizations (e.g. a pipelined memory architecture) - the design is fairly generic, and would equally apply to other areas of the kernel than TCP/IP - using the same kernel image eliminates most maintenance problems, and encourages experimenting with the stack - using the same kernel image (and compatible data structures) guarantees that administrative interfaces are uniform in the entire system - such a design is likely to be able to allow TCP state to be moved to a different NIC, if necessary Possible problems, that may kill this idea: - it may be too hard to achieve correctness - it may be too hard to switch CPUs properly - it may not be possible to express copy operations efficiently in such a context - there may be no way to avoid sharing of hardware-specific data structures, such as page tables, or to emulate their use - people may consider the instrumentation required for this, although fairly generic, too intrusive - all this instrumentation may eat too much performance - nobody may be interested in building hardware for this - nobody may be patient enough to pursue such long-termish development, with uncertain outcome - something I haven't thought of I lack the resources (hardware, financial, and otherwise) to actually do something with these ideas, so please feel free to put them to some use. - Werner -- _________________________________________________________________________ / Werner Almesberger, Buenos Aires, Argentina werner@almesberger.net / /_http://www.almesberger.net/____________________________________________/ From niv@us.ibm.com Sat Aug 2 10:32:53 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 10:33:04 -0700 (PDT) Received: from e35.co.us.ibm.com (e35.co.us.ibm.com [32.97.110.133]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h72HWkFl024975 for ; Sat, 2 Aug 2003 10:32:53 -0700 Received: from westrelay03.boulder.ibm.com (westrelay03.boulder.ibm.com [9.17.195.12]) by e35.co.us.ibm.com (8.12.9/8.12.2) with ESMTP id h72HWVc8270888; Sat, 2 Aug 2003 13:32:31 -0400 Received: from us.ibm.com (d03av03.boulder.ibm.com [9.17.193.83]) by westrelay03.boulder.ibm.com (8.12.9/NCO/VER6.5) with ESMTP id h72HWUYc053666; Sat, 2 Aug 2003 11:32:31 -0600 Message-ID: <3F2BF5C7.90400@us.ibm.com> Date: Sat, 02 Aug 2003 10:32:55 -0700 From: Nivedita Singhvi User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.2.1) Gecko/20021130 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Werner Almesberger CC: netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: TOE brain dump References: <20030802140444.E5798@almesberger.net> In-Reply-To: <20030802140444.E5798@almesberger.net> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4456 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: niv@us.ibm.com Precedence: bulk X-list: netdev Werner Almesberger wrote: > (*) The InfiniBand people unfortunately call also their TCP/IP > bypass "TOE" (for which they promptly get shouted down, > every time they use that word). This is misleading, because Thank you! Yes! All in favor say Aye..AYE!!! Motion passes, the infiniband people don't get to call it TOE anymore.. > While I'm not entirely convinced about the usefulness of TOE in > all the cases it's been suggested for, I can see value in certain > areas, e.g. when TCP per-packet overhead becomes an issue. Ditto, but I see it being used to rollout the idea and process, rather than anything of value now, and the lessons are being learned for the future, when we reach 20Gb, 40Gb, even faster networks of tommorow. The processors might keep up, but nothing else will, for sure. > However, I consider the approach of putting a new or heavily > modified stack, which duplicates a considerable amount of the > functionality in the main kernel, on a separate piece of hardware > questionable at best. Some of the issues: > > - if this stack is closed source or generally hard to modify, > security fixes will be slowed down as will bug fixes, and debugging becomes a right royal pain. Also, most profiles of networking applications show the largest blip is essentially the user<->kernel transfer, and that would still remain the unaddressed bottleneck. > So, how to do better ? Easy: use the Source, Luke. Here's my > idea: > > - instead of putting a different stack on the TOE, a > general-purpose processor (probably with some enhancements, > and certainly with optimized data paths) is added to the NIC The thing is, all the TOE efforts are propietary ones, to my limited knowledge. Thus all the design is occurring in confidential, vendor internal forums. How will they/we come up with really the needed, _common_ design approach? Or is this not so needed? thanks, Nivedita From werner@almesberger.net Sat Aug 2 11:06:12 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 11:06:21 -0700 (PDT) Received: from host.almesberger.net (almesberger.net [63.105.73.239] (may be forged)) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h72I6BFl027777 for ; Sat, 2 Aug 2003 11:06:11 -0700 Received: from almesberger.net (vpnwa-home [10.200.0.2]) by host.almesberger.net (8.11.6/8.9.3) with ESMTP id h72I65G24280; Sat, 2 Aug 2003 11:06:06 -0700 Received: (from werner@localhost) by almesberger.net (8.11.6/8.11.6) id h72I60t30481; Sat, 2 Aug 2003 15:06:00 -0300 Date: Sat, 2 Aug 2003 15:06:00 -0300 From: Werner Almesberger To: Nivedita Singhvi Cc: netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: TOE brain dump Message-ID: <20030802150600.F5798@almesberger.net> References: <20030802140444.E5798@almesberger.net> <3F2BF5C7.90400@us.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3F2BF5C7.90400@us.ibm.com>; from niv@us.ibm.com on Sat, Aug 02, 2003 at 10:32:55AM -0700 X-archive-position: 4457 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: werner@almesberger.net Precedence: bulk X-list: netdev Nivedita Singhvi wrote: > Also, most profiles of networking applications show the > largest blip is essentially the user<->kernel transfer, and > that would still remain the unaddressed bottleneck. I have some hope that sendfile plus a NUMA-like approach will be sufficient for keeping transfers away from buses and memory they don't need to touch. > The thing is, all the TOE efforts are propietary ones, to > my limited knowledge. Many companies default to "closed" designs if they're not given a convincing reason for going "open". The approach I've described may provide that reason. There are also historicial reasons, e.g. if you want to interface with the stack of Windows, or any proprietary Unix, you probably need to obtain some of their source under NDA, and use some of that information in your own drivers or firmware. Of course, none of this is an issue here. Since we're talking about 1-2 years of development time anyway, legacy hardware (i.e. hardware choices influenced by information obtained under an NDA) will be quite obsolete by then and doesn't matter. > Or is this not so needed? Exactly. The "NUMA" approach would avoid the "common TOE design" problem. All you need is a reasonably well documented "general-purpose" CPU (that doesn't mean it has to be an off-the-shelf design, but most likely, the core would be an off-the-shelf one), plus some NIC hardware. Now, if that NIC in turn has some hidden secrets, this isn't an issue as long as one can still write a GPLed driver for it. Of course, there would be elements in such a system that vendors would like to keep secret. But then, there always are, and so far, we've found reasonable compromises most of the time, so I don't see why this couldn't happen here, too. Also, if "classical TOE" patches keep getting rejected, but an open and maintainable approach makes it into the mainstream kernel, also the business aspects should become fairly clear. - Werner -- _________________________________________________________________________ / Werner Almesberger, Buenos Aires, Argentina werner@almesberger.net / /_http://www.almesberger.net/____________________________________________/ From jgarzik@pobox.com Sat Aug 2 12:09:07 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 12:09:18 -0700 (PDT) Received: from www.linux.org.uk (IDENT:yTEIorJOZS7ZcYhSCsBkoXNszRdAMY2B@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h72J96Fl032752 for ; Sat, 2 Aug 2003 12:09:07 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19j1kV-0003a8-LU; Sat, 02 Aug 2003 20:09:03 +0100 Message-ID: <3F2C0C44.6020002@pobox.com> Date: Sat, 02 Aug 2003 15:08:52 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: Nivedita Singhvi CC: Werner Almesberger , netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: TOE brain dump References: <20030802140444.E5798@almesberger.net> <3F2BF5C7.90400@us.ibm.com> In-Reply-To: <3F2BF5C7.90400@us.ibm.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4458 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev My own brain dump: If one wants to go straight from disk to network, why is anyone bothering to involve the host CPU and host memory bus at all? Memory bandwidth and PCI bus bandwidth are still bottlenecks, no much how much of the net stack you offload. Regardless of how fast your network zooms packets, you've gotta keep that pipeline full to make use of it. And you've gotta do something intelligent with it, which in TCP's case involves the host CPU quite a bit. TCP is sufficiently complex, for a reason. It has to handle all manner of disturbingly slow and disturbing fast net connections, all jabbering at the same time. TCP is a "one size fits all" solution, but it doesn't work well for everyone. The "TCP Offload Everything" people really need to look at what data your users want to push, at such high speeds. It's obviously not over a WAN... so steer users away from TCP, to an IP protocol that is tuned for your LAN needs, and more friendly to some sort of h/w offloading solution. A "foo over ipv6" protocol that was designed for h/w offloading from the start, would be a far better idea than full TCP offload will ever be. In any case, when you approach these high speeds, you really must take a good look at the other end of the pipeline: what are you serving at 10Gb/s, 20Gb/s, 40Gb/s? For some time, I think the answer will be "highly specialized stuff" At some point, Intel networking gear will be able to transfer more bits per second than there exist atoms on planet Earth :) Garbage in, garbage out. So, fix the other end of the pipeline too, otherwise this fast network stuff is flashly but pointless. If you want to serve up data from disk, then start creating PCI cards that have both Serial ATA and ethernet connectors on them :) Cut out the middleman of the host CPU and host memory bus instead of offloading portions of TCP that do not need to be offloaded. Jeff From alan@lxorguk.ukuu.org.uk Sat Aug 2 14:01:45 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 14:01:52 -0700 (PDT) Received: from lxorguk.ukuu.org.uk (pc1-cwma1-5-cust4.swan.cable.ntl.com [80.5.120.4]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h72L1hFl024055 for ; Sat, 2 Aug 2003 14:01:44 -0700 Received: from dhcp22.swansea.linux.org.uk (dhcp22.swansea.linux.org.uk [127.0.0.1]) by lxorguk.ukuu.org.uk (8.12.8/8.12.5) with ESMTP id h72KvkC3020394; Sat, 2 Aug 2003 21:57:47 +0100 Received: (from alan@localhost) by dhcp22.swansea.linux.org.uk (8.12.8/8.12.8/Submit) id h72KvjLd020392; Sat, 2 Aug 2003 21:57:45 +0100 X-Authentication-Warning: dhcp22.swansea.linux.org.uk: alan set sender to alan@lxorguk.ukuu.org.uk using -f Subject: Re: TOE brain dump From: Alan Cox To: Werner Almesberger Cc: netdev@oss.sgi.com, Linux Kernel Mailing List In-Reply-To: <20030802140444.E5798@almesberger.net> References: <20030802140444.E5798@almesberger.net> Content-Type: text/plain Content-Transfer-Encoding: 7bit Organization: Message-Id: <1059857864.20305.14.camel@dhcp22.swansea.linux.org.uk> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 (1.2.2-5) Date: 02 Aug 2003 21:57:44 +0100 X-archive-position: 4459 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: alan@lxorguk.ukuu.org.uk Precedence: bulk X-list: netdev On Sad, 2003-08-02 at 18:04, Werner Almesberger wrote: > - last but not least, keeping TOE firmware up to date with the > TCP/IP stack in the mainstream kernel will require - for each > such TOE device - a significant and continuous effort over a > long period of time or even the protocol and protocol refinements.. > - instead of putting a different stack on the TOE, a > general-purpose processor (probably with some enhancements, > and certainly with optimized data paths) is added to the NIC Like say an opteron in the 2nd socket on the motherboard > Benefits: > > - putting the CPU next to the NIC keeps data paths short, and > allows for all kinds of optimizations (e.g. a pipelined > memory architecture) It moves the cost it doesnt make it vanish If I read you right you are arguing for a second processor running Linux.with its own independant memory bus. AMD make those already its called AMD64. I don't know anyone thinking at that level about partitioning one as an I/O processor. From werner@almesberger.net Sat Aug 2 14:49:18 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 14:49:28 -0700 (PDT) Received: from host.almesberger.net (almesberger.net [63.105.73.239] (may be forged)) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h72LnGFl028183 for ; Sat, 2 Aug 2003 14:49:17 -0700 Received: from almesberger.net (vpnwa-home [10.200.0.2]) by host.almesberger.net (8.11.6/8.9.3) with ESMTP id h72Ln8G25059; Sat, 2 Aug 2003 14:49:08 -0700 Received: (from werner@localhost) by almesberger.net (8.11.6/8.11.6) id h72Ln1e31495; Sat, 2 Aug 2003 18:49:01 -0300 Date: Sat, 2 Aug 2003 18:49:01 -0300 From: Werner Almesberger To: Jeff Garzik Cc: Nivedita Singhvi , netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: TOE brain dump Message-ID: <20030802184901.G5798@almesberger.net> References: <20030802140444.E5798@almesberger.net> <3F2BF5C7.90400@us.ibm.com> <3F2C0C44.6020002@pobox.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3F2C0C44.6020002@pobox.com>; from jgarzik@pobox.com on Sat, Aug 02, 2003 at 03:08:52PM -0400 X-archive-position: 4460 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: werner@almesberger.net Precedence: bulk X-list: netdev Jeff Garzik wrote: > jabbering at the same time. TCP is a "one size fits all" solution, but > it doesn't work well for everyone. But then, ten "optimized xxPs" that work well in two different scenarios each, but not so good in the 98 others, wouldn't be much fun either. It's been tried a number of times. Usually, real life sneaks in at one point or another, leaving behind a complex mess. When they've sorted out these problems, regular TCP has caught up with the great optimized transport protocols. At that point, they return to their niche, sometimes tail between legs and muttering curses, sometimes shaking their fist and boldly proclaiming how badly they'll rub TCP in the dirt in the next round. Maybe they shed off some of the complexity, and trade it for even more aggressive optimization, which puts them into their niche even more firmly. Eventually, they fade away. There are cases where TCP doesn't work well, like a path of badly mismatched link layers, but such paths don't treat any protocol following the end-to-end principle kindly. Another problem of TCP is that it has grown a bit too many knobs you need to turn before it works over your really fast really long pipe. (In one of the OLS after dinner speeches, this was quite appropriately called the "wizard gap".) > It's obviously not over a WAN... That's why NFS turned off UDP checksums ;-) As soon as you put it on IP, it will crawl to distances you didn't imagine in your wildest dreams. It always does. > So, fix the other end of the pipeline too, otherwise this fast network > stuff is flashly but pointless. If you want to serve up data from disk, > then start creating PCI cards that have both Serial ATA and ethernet > connectors on them :) Cut out the middleman of the host CPU and host > memory bus instead of offloading portions of TCP that do not need to be > offloaded. That's a good point. A hierarchical memory structure can help here. Moving one end closer to the hardware, and letting it know (e.g. through sendfile) that also the other end is close (or can be reached more directly that through some hopelessly crowded main bus) may help too. - Werner -- _________________________________________________________________________ / Werner Almesberger, Buenos Aires, Argentina werner@almesberger.net / /_http://www.almesberger.net/____________________________________________/ From werner@almesberger.net Sat Aug 2 15:14:19 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 15:14:22 -0700 (PDT) Received: from host.almesberger.net (almesberger.net [63.105.73.239] (may be forged)) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h72MEJFl030541 for ; Sat, 2 Aug 2003 15:14:19 -0700 Received: from almesberger.net (vpnwa-home [10.200.0.2]) by host.almesberger.net (8.11.6/8.9.3) with ESMTP id h72MEGG25185; Sat, 2 Aug 2003 15:14:17 -0700 Received: (from werner@localhost) by almesberger.net (8.11.6/8.11.6) id h72MEBR31594; Sat, 2 Aug 2003 19:14:11 -0300 Date: Sat, 2 Aug 2003 19:14:11 -0300 From: Werner Almesberger To: Alan Cox Cc: netdev@oss.sgi.com, Linux Kernel Mailing List Subject: Re: TOE brain dump Message-ID: <20030802191411.H5798@almesberger.net> References: <20030802140444.E5798@almesberger.net> <1059857864.20305.14.camel@dhcp22.swansea.linux.org.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1059857864.20305.14.camel@dhcp22.swansea.linux.org.uk>; from alan@lxorguk.ukuu.org.uk on Sat, Aug 02, 2003 at 09:57:44PM +0100 X-archive-position: 4461 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: werner@almesberger.net Precedence: bulk X-list: netdev Alan Cox wrote: > It moves the cost it doesnt make it vanish I don't think it really can. What it can do is reduce the overhead (which usually translates to latency and burstiness) and the sharing. > If I read you right you are arguing for a second processor running > Linux.with its own independant memory bus. AMD make those already its > called AMD64. I don't know anyone thinking at that level about > partitioning one as an I/O processor. That's taking this idea to an extreme, yes. I'd think of using something as big as an amd64 for this as "too expensive", but perhaps it's cheap enough in the long run, compared to some "optimized" design. It would certainly have the advantage of already solving various consistency and compatibility issues. (That is, if your host CPUs is/are also amd64.) - Werner -- _________________________________________________________________________ / Werner Almesberger, Buenos Aires, Argentina werner@almesberger.net / /_http://www.almesberger.net/____________________________________________/ From willy@www.linux.org.uk Sat Aug 2 15:21:49 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 15:21:54 -0700 (PDT) Received: from www.linux.org.uk (IDENT:UcG6xcZ7ts6X+JIzSklM4po5qU39agf+@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h72MLlFl031389 for ; Sat, 2 Aug 2003 15:21:48 -0700 Received: from willy by www.linux.org.uk with local (Exim 4.14) id 19j4kz-0004pd-Mh; Sat, 02 Aug 2003 23:21:45 +0100 Date: Sat, 2 Aug 2003 23:21:45 +0100 From: Matthew Wilcox To: Jeff Garzik Cc: Matthew Wilcox , netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 Message-ID: <20030802222145.GE22222@parcelfarce.linux.theplanet.co.uk> References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030801162536.GA18574@gtf.org> User-Agent: Mutt/1.4.1i X-archive-position: 4462 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: willy@debian.org Precedence: bulk X-list: netdev On Fri, Aug 01, 2003 at 12:25:36PM -0400, Jeff Garzik wrote: > On Fri, Aug 01, 2003 at 04:46:56PM +0100, Matthew Wilcox wrote: > > On Fri, Aug 01, 2003 at 11:40:21AM -0400, Jeff Garzik wrote: > > > * need SET_ETHTOOL_OPS macro or HAVE_ETHTOOL_OPS test macro or similar > > It's standard netdevice.h practice, and, he didn't disagree w/ my > rebuttal. OK, now that the two of you thrashed out a design, here's my implementation: diff -u drivers/net/8139too.c drivers/net/8139too.c --- drivers/net/8139too.c 31 Jul 2003 17:09:52 -0000 +++ drivers/net/8139too.c 2 Aug 2003 18:38:25 -0000 @@ -973,7 +973,7 @@ dev->do_ioctl = netdev_ioctl; dev->tx_timeout = rtl8139_tx_timeout; dev->watchdog_timeo = TX_TIMEOUT; - dev->ethtool_ops = &rtl8139_ethtool_ops; + set_ethtool_ops(dev, &rtl8139_ethtool_ops); /* note: the hardware is not capable of sg/csum/highdma, however * through the use of skb_copy_and_csum_dev we enable these diff -u drivers/net/tg3.c drivers/net/tg3.c --- drivers/net/tg3.c 31 Jul 2003 11:12:10 -0000 +++ drivers/net/tg3.c 2 Aug 2003 18:37:54 -0000 @@ -6724,11 +6724,11 @@ dev->do_ioctl = tg3_ioctl; dev->tx_timeout = tg3_tx_timeout; dev->poll = tg3_poll; - dev->ethtool_ops = &tg3_ethtool_ops; dev->weight = 64; dev->watchdog_timeo = TG3_TX_TIMEOUT; dev->change_mtu = tg3_change_mtu; dev->irq = pdev->irq; + set_ethtool_ops(dev, &tg3_ethtool_ops); err = tg3_get_invariants(tp); if (err) { diff -u include/linux/netdevice.h include/linux/netdevice.h --- include/linux/netdevice.h 31 Jul 2003 13:06:23 -0000 +++ include/linux/netdevice.h 2 Aug 2003 18:37:16 -0000 @@ -477,6 +477,10 @@ */ #define SET_NETDEV_DEV(net, pdev) ((net)->class_dev.dev = (pdev)) +static inline void set_ethtool_ops(struct net_device *dev, struct ethtool_ops * ops) +{ + dev->ethtool_ops = ops; +} struct packet_type { Happy with that? > > > * I still do not see the need to change a simple storage of a constant > > > (into ethtool_gdrvinfo) into _four_ separate function call hooks (reg > > > dump len, eeprom dump len, nic-specific stats len, self-test len). > > > Internal kernel code that needs this information is always a slow path > > > anyway, so just call the ->get_drvinfo hook internally. > > > > slow path, sure, but increased stack usage. it's a tradeoff, and this way > > feels more clean to me. > > Additing a function hook each time you want to retrieve a new integer > value? That's feels overly excessive to me. Actually, it's a useful thing to do because it specifies what kind of answer we want. For example, up here, you called them all foo_len. That's not true. Some of them are a byte-count (== len), but some of them are a count of N-byte quantities. That's an unfortunate bit of design, but at least we can make it obvious to the driver-writer what we're expecting of them. > > > * I prefer not to add '#include ' to ethtool.h > > > > That means that any code which includes ethtool.h has to include types.h > > first (either implicitly or explicitly). The rule so far has been that > > header files should call out their dependencies explictly with an include > > of the appropriate file. So why *don't* you want it? > > Because I copy it to userspace :) linux/types.h exists in userspace ;-) You even _expect_ userspce to have already included it -- or where else are the `u32' quantities defined? -- "It's not Hollywood. War is real, war is primarily not about defeat or victory, it is about death. I've seen thousands and thousands of dead bodies. Do you think I want to have an academic debate on this subject?" -- Robert Fisk From jgarzik@pobox.com Sat Aug 2 15:35:00 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 15:35:04 -0700 (PDT) Received: from www.linux.org.uk (IDENT:VMLs27LF89sjE5OoM6LvqsoQ8uGmYllr@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h72MYwFl032658 for ; Sat, 2 Aug 2003 15:34:59 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19j4xl-0004tx-Rf; Sat, 02 Aug 2003 23:34:57 +0100 Message-ID: <3F2C3C86.6000202@pobox.com> Date: Sat, 02 Aug 2003 18:34:46 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: Matthew Wilcox CC: netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> <20030802222145.GE22222@parcelfarce.linux.theplanet.co.uk> In-Reply-To: <20030802222145.GE22222@parcelfarce.linux.theplanet.co.uk> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4463 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev Matthew Wilcox wrote: > On Fri, Aug 01, 2003 at 12:25:36PM -0400, Jeff Garzik wrote: > >>On Fri, Aug 01, 2003 at 04:46:56PM +0100, Matthew Wilcox wrote: >> >>>On Fri, Aug 01, 2003 at 11:40:21AM -0400, Jeff Garzik wrote: >>> >>>>* need SET_ETHTOOL_OPS macro or HAVE_ETHTOOL_OPS test macro or similar >> >>It's standard netdevice.h practice, and, he didn't disagree w/ my >>rebuttal. > > > OK, now that the two of you thrashed out a design, here's my implementation: > > diff -u drivers/net/8139too.c drivers/net/8139too.c > --- drivers/net/8139too.c 31 Jul 2003 17:09:52 -0000 > +++ drivers/net/8139too.c 2 Aug 2003 18:38:25 -0000 > @@ -973,7 +973,7 @@ > dev->do_ioctl = netdev_ioctl; > dev->tx_timeout = rtl8139_tx_timeout; > dev->watchdog_timeo = TX_TIMEOUT; > - dev->ethtool_ops = &rtl8139_ethtool_ops; > + set_ethtool_ops(dev, &rtl8139_ethtool_ops); > > /* note: the hardware is not capable of sg/csum/highdma, however > * through the use of skb_copy_and_csum_dev we enable these > diff -u drivers/net/tg3.c drivers/net/tg3.c > --- drivers/net/tg3.c 31 Jul 2003 11:12:10 -0000 > +++ drivers/net/tg3.c 2 Aug 2003 18:37:54 -0000 > @@ -6724,11 +6724,11 @@ > dev->do_ioctl = tg3_ioctl; > dev->tx_timeout = tg3_tx_timeout; > dev->poll = tg3_poll; > - dev->ethtool_ops = &tg3_ethtool_ops; > dev->weight = 64; > dev->watchdog_timeo = TG3_TX_TIMEOUT; > dev->change_mtu = tg3_change_mtu; > dev->irq = pdev->irq; > + set_ethtool_ops(dev, &tg3_ethtool_ops); > > err = tg3_get_invariants(tp); > if (err) { > diff -u include/linux/netdevice.h include/linux/netdevice.h > --- include/linux/netdevice.h 31 Jul 2003 13:06:23 -0000 > +++ include/linux/netdevice.h 2 Aug 2003 18:37:16 -0000 > @@ -477,6 +477,10 @@ > */ > #define SET_NETDEV_DEV(net, pdev) ((net)->class_dev.dev = (pdev)) > > +static inline void set_ethtool_ops(struct net_device *dev, struct ethtool_ops * > ops) > +{ > + dev->ethtool_ops = ops; > +} It needs to be a macro for maximum flexibility. Also, no need to convert in-kernel drivers over to using it... Let driver authors use it or not as they choose. Jeff From davem@redhat.com Sat Aug 2 17:32:14 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 17:32:22 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h730WBFl009320 for ; Sat, 2 Aug 2003 17:32:14 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id RAA10269; Sat, 2 Aug 2003 17:28:07 -0700 Date: Sat, 2 Aug 2003 17:28:07 -0700 From: "David S. Miller" To: Jeff Garzik Cc: willy@debian.org, netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 Message-Id: <20030802172807.3d56b4ea.davem@redhat.com> In-Reply-To: <3F2C3C86.6000202@pobox.com> References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> <20030802222145.GE22222@parcelfarce.linux.theplanet.co.uk> <3F2C3C86.6000202@pobox.com> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4464 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Sat, 02 Aug 2003 18:34:46 -0400 Jeff Garzik wrote: > Matthew Wilcox wrote: > > +static inline void set_ethtool_ops(struct net_device *dev, struct ethtool_ops * > > ops) > > +{ > > + dev->ethtool_ops = ops; > > +} > > > It needs to be a macro for maximum flexibility. Yes, and please name it with capitol letters, ie. SET_ETHTOOL_OPS(), I have no idea why you used lower-case letters when Jeff and I referred to it consistently with caps. :-) From davem@redhat.com Sat Aug 2 18:37:31 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 18:37:35 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h731bUFl014661 for ; Sat, 2 Aug 2003 18:37:31 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id SAA10442; Sat, 2 Aug 2003 18:33:10 -0700 Date: Sat, 2 Aug 2003 18:33:10 -0700 From: "David S. Miller" To: chas3@users.sourceforge.net Cc: chas@cmf.nrl.navy.mil, mitch@sfgoth.com, netdev@oss.sgi.com Subject: Re: [Linux-ATM-General] Re: [atmdrvr zatm] Remove obsolete EXACT_TS support Message-Id: <20030802183310.05e2cbbc.davem@redhat.com> In-Reply-To: <200307311426.h6VEQgsG023826@ginger.cmf.nrl.navy.mil> References: <20030730225741.GA57991@gaz.sfgoth.com> <200307311426.h6VEQgsG023826@ginger.cmf.nrl.navy.mil> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4465 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Thu, 31 Jul 2003 10:23:58 -0400 chas williams wrote: > please apply to 2.6. zatm will now compile on smp. it might > actually work if someone had some hardware to test it. Applied. From jgarzik@pobox.com Sat Aug 2 20:14:40 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 20:14:49 -0700 (PDT) Received: from www.linux.org.uk (IDENT:KG4N4yt9e4roVKuUrU9+s/Z/FMeFYkHk@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h733EdFl022330 for ; Sat, 2 Aug 2003 20:14:40 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19j9KQ-0006uX-5l; Sun, 03 Aug 2003 04:14:38 +0100 Message-ID: <3F2C7E12.8070904@pobox.com> Date: Sat, 02 Aug 2003 23:14:26 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: Matthew Wilcox CC: netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> <20030802222145.GE22222@parcelfarce.linux.theplanet.co.uk> <3F2C3C86.6000202@pobox.com> <20030803002744.GF22222@parcelfarce.linux.theplanet.co.uk> In-Reply-To: <20030803002744.GF22222@parcelfarce.linux.theplanet.co.uk> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4466 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev Matthew Wilcox wrote: > On Sat, Aug 02, 2003 at 06:34:46PM -0400, Jeff Garzik wrote: > >>>diff -u include/linux/netdevice.h include/linux/netdevice.h >>>--- include/linux/netdevice.h 31 Jul 2003 13:06:23 -0000 >>>+++ include/linux/netdevice.h 2 Aug 2003 18:37:16 -0000 >>>@@ -477,6 +477,10 @@ >>> */ >>>#define SET_NETDEV_DEV(net, pdev) ((net)->class_dev.dev = (pdev)) >>> >>>+static inline void set_ethtool_ops(struct net_device *dev, struct >>>ethtool_ops * >>>ops) >>>+{ >>>+ dev->ethtool_ops = ops; >>>+} >> >> >>It needs to be a macro for maximum flexibility. > > > Nothing stops it being implemented as a macro in kcompat. Having it as > an inline function gives it argument typechecking which always gives me > the warm fuzzies. No, it _needs_ to be a macro for maximum flexibility. Most importantly, kcompat code may use '#ifndef SET_ETHTOOL_OPS' as a trigger, to signal that compat code is needed. No need for drivers to create tons of kernel-version-code ifdefs, just to test for when ethtool_ops appeared in 2.6, for when it starts appearing in 2.4 vendor backports, and (possibly) 2.4 itself. Also, doing it at the cpp level allows compat code to #undef it, if it _really_ knows what its doing, and the situation calls for it. >>Also, no need to convert in-kernel drivers over to using it... Let >>driver authors use it or not as they choose. > > > I took "Like pci_set_drvdata" as the most important part of your > argument... having everyone use it is no bad thing. Certainly. I have no real preferences either way, just noting that in-kernel drivers don't _need_ to use this macro. Jeff From greearb@candelatech.com Sat Aug 2 20:48:59 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 20:49:08 -0700 (PDT) Received: from grok.yi.org (evrtwa1-ar2-4-33-045-074.evrtwa1.dsl-verizon.net [4.33.45.74]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h733mwFl025313 for ; Sat, 2 Aug 2003 20:48:58 -0700 Received: from candelatech.com (localhost.localdomain [127.0.0.1]) by grok.yi.org (8.12.8/8.12.8) with ESMTP id h733mptf003818; Sat, 2 Aug 2003 20:48:52 -0700 Message-ID: <3F2C8623.2080106@candelatech.com> Date: Sat, 02 Aug 2003 20:48:51 -0700 From: Ben Greear Organization: Candela Technologies User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5a) Gecko/20030718 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Sascha Schumann CC: "'netdev@oss.sgi.com'" Subject: Re: 2.4.21: bug report for tg3: tx lockup when changing MTU References: <3F2AEFBF.3040604@candelatech.com> In-Reply-To: Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4467 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: greearb@candelatech.com Precedence: bulk X-list: netdev Sascha Schumann wrote: >>Kernel is 2.4.21 + custom patches (which should not affect tg3). >> >>lspci says the NIC is: Altima AC9100 (rev 15) > > > [1] says that the AC9100 based Netgear GA302T cards don't > support jumbo frames. I'm seeing regular lockups once > packets larger than 1500bytes flow through the NIC. > > It would be cool though if this turned out to be a driver > limitation and not a (crippled) chipset issue. It definately handles 4000 byte frames just fine, you just need to ifdown and ifup it after changing the MTU much of the time...or maybe only when running it under heavy load when you make the MTU change... Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From greearb@candelatech.com Sat Aug 2 21:01:44 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 21:01:54 -0700 (PDT) Received: from grok.yi.org (evrtwa1-ar2-4-33-045-074.evrtwa1.dsl-verizon.net [4.33.45.74]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7341iFl026650 for ; Sat, 2 Aug 2003 21:01:44 -0700 Received: from candelatech.com (localhost.localdomain [127.0.0.1]) by grok.yi.org (8.12.8/8.12.8) with ESMTP id h7341Vtf005412; Sat, 2 Aug 2003 21:01:31 -0700 Message-ID: <3F2C891B.7080004@candelatech.com> Date: Sat, 02 Aug 2003 21:01:31 -0700 From: Ben Greear Organization: Candela Technologies User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5a) Gecko/20030718 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Jeff Garzik CC: Nivedita Singhvi , Werner Almesberger , netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: TOE brain dump References: <20030802140444.E5798@almesberger.net> <3F2BF5C7.90400@us.ibm.com> <3F2C0C44.6020002@pobox.com> In-Reply-To: <3F2C0C44.6020002@pobox.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4468 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: greearb@candelatech.com Precedence: bulk X-list: netdev Jeff Garzik wrote: > So, fix the other end of the pipeline too, otherwise this fast network > stuff is flashly but pointless. If you want to serve up data from disk, > then start creating PCI cards that have both Serial ATA and ethernet > connectors on them :) Cut out the middleman of the host CPU and host I for one would love to see something like this, and not just Serial ATA.. but maybe 8x Serial ATA and RAID :) Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From scott.feldman@intel.com Sat Aug 2 21:34:50 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 21:34:58 -0700 (PDT) Received: from caduceus.jf.intel.com (fmr06.intel.com [134.134.136.7]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h734YmFl028503 for ; Sat, 2 Aug 2003 21:34:49 -0700 Received: from talaria.jf.intel.com (talaria.jf.intel.com [10.7.209.7]) by caduceus.jf.intel.com (8.11.6p2/8.11.6/d: outer.mc,v 1.66 2003/05/22 21:17:36 rfjohns1 Exp $) with ESMTP id h734Sid08681 for ; Sun, 3 Aug 2003 04:28:44 GMT Received: from orsmsxvs041.jf.intel.com (orsmsxvs041.jf.intel.com [192.168.65.54]) by talaria.jf.intel.com (8.11.6p2/8.11.6/d: inner.mc,v 1.35 2003/05/22 21:18:01 rfjohns1 Exp $) with SMTP id h733wGY02567 for ; Sun, 3 Aug 2003 03:58:17 GMT Received: from orsmsx332.amr.corp.intel.com ([192.168.65.60]) by orsmsxvs041.jf.intel.com (NAVGW 2.5.2.11) with SMTP id M2003080221344226801 for ; Sat, 02 Aug 2003 21:34:42 -0700 Received: from orsmsx402.amr.corp.intel.com ([192.168.65.208]) by orsmsx332.amr.corp.intel.com with Microsoft SMTPSVC(5.0.2195.5329); Sat, 2 Aug 2003 21:34:42 -0700 content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-MimeOLE: Produced By Microsoft Exchange V6.0.6375.0 Subject: e100 "Ferguson" release Date: Sat, 2 Aug 2003 21:34:42 -0700 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: e100 "Ferguson" release Thread-Index: AcNZeI636C/uaYjsSwqQ/jrIhuMDyw== From: "Feldman, Scott" To: X-OriginalArrivalTime: 03 Aug 2003 04:34:42.0802 (UTC) FILETIME=[8F15BD20:01C35978] Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id h734YmFl028503 X-archive-position: 4469 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: scott.feldman@intel.com Precedence: bulk X-list: netdev New development version: http://sf.net/projects/e1000, e100-3.0.0_dev11.tar.gz Many thanks to JC [jchapman@katalix.com] for exploring the small packet performance w/ and w/o NAPI. This version includes one of his optimization; others may follow, but I wanted to get this goodness out now. * added opportunistic fast loop (no udelays) in e100_exec_cmd to wait for previous cmd to be accepted before queuing next cmd. Boost small packet performance. [jchapman@katalix.com]. * Use correct versions of dev_kfree_skb for depending on possible contexts. [jchapman@katalix.com]. * Added SET_NETDEV_DEV(). Looking for more testing on non-IA archs, power management, cardbus nics, and WoL. -scott From willy@www.linux.org.uk Sat Aug 2 22:01:08 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 22:01:14 -0700 (PDT) Received: from www.linux.org.uk (IDENT:xUszKN2jPgzXzfcq0ScA0iCxK31fnXwn@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h73516Fl030805 for ; Sat, 2 Aug 2003 22:01:08 -0700 Received: from willy by www.linux.org.uk with local (Exim 4.14) id 19j6iu-0005dM-4h; Sun, 03 Aug 2003 01:27:44 +0100 Date: Sun, 3 Aug 2003 01:27:44 +0100 From: Matthew Wilcox To: Jeff Garzik Cc: Matthew Wilcox , netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 Message-ID: <20030803002744.GF22222@parcelfarce.linux.theplanet.co.uk> References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> <20030802222145.GE22222@parcelfarce.linux.theplanet.co.uk> <3F2C3C86.6000202@pobox.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3F2C3C86.6000202@pobox.com> User-Agent: Mutt/1.4.1i X-archive-position: 4470 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: willy@debian.org Precedence: bulk X-list: netdev On Sat, Aug 02, 2003 at 06:34:46PM -0400, Jeff Garzik wrote: > >diff -u include/linux/netdevice.h include/linux/netdevice.h > >--- include/linux/netdevice.h 31 Jul 2003 13:06:23 -0000 > >+++ include/linux/netdevice.h 2 Aug 2003 18:37:16 -0000 > >@@ -477,6 +477,10 @@ > > */ > > #define SET_NETDEV_DEV(net, pdev) ((net)->class_dev.dev = (pdev)) > > > >+static inline void set_ethtool_ops(struct net_device *dev, struct > >ethtool_ops * > >ops) > >+{ > >+ dev->ethtool_ops = ops; > >+} > > > It needs to be a macro for maximum flexibility. Nothing stops it being implemented as a macro in kcompat. Having it as an inline function gives it argument typechecking which always gives me the warm fuzzies. > Also, no need to convert in-kernel drivers over to using it... Let > driver authors use it or not as they choose. I took "Like pci_set_drvdata" as the most important part of your argument... having everyone use it is no bad thing. -- "It's not Hollywood. War is real, war is primarily not about defeat or victory, it is about death. I've seen thousands and thousands of dead bodies. Do you think I want to have an academic debate on this subject?" -- Robert Fisk From jsanchez@cs.ucf.edu Sat Aug 2 22:50:58 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 22:51:04 -0700 (PDT) Received: from longwood.cs.ucf.edu (longwood.cs.ucf.edu [132.170.108.1]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h735ouFl002605 for ; Sat, 2 Aug 2003 22:50:57 -0700 Received: from mobile (eola [132.170.108.2]) by longwood.cs.ucf.edu (8.12.2/8.12.2) with ESMTP id h735oqB4001424 for ; Sun, 3 Aug 2003 01:50:52 -0400 (EDT) Subject: Re: [Bug 1030] New: racoon causes oops when implementing IPSec key From: Justin Sanchez To: netdev@oss.sgi.com In-Reply-To: <20030802212018.B14141@electric-eye.fr.zoreil.com> References: <89550000.1059833972@[10.10.2.4]> <20030802163333.A12217@electric-eye.fr.zoreil.com> <1059850039.1187.2.camel@mobile> <20030802212018.B14141@electric-eye.fr.zoreil.com> Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="=-2Y8LE63gCCNUG9OvqJTX" X-Mailer: Ximian Evolution 1.0.8 Date: 03 Aug 2003 01:51:22 -0400 Message-Id: <1059889883.1187.15.camel@mobile> Mime-Version: 1.0 X-archive-position: 4471 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jsanchez@cs.ucf.edu Precedence: bulk X-list: netdev --=-2Y8LE63gCCNUG9OvqJTX Content-Type: text/plain Content-Transfer-Encoding: quoted-printable Hi. I had this problem on 2.6.0-test1 and -test2 and -test2-bk2, so I'll try to report it. I'm new to the scene, so I apologize in advance for this post. Background. 2 machines. e100 cards on each, if it matters. ipsec-tools 0.2.2. I give each of them directives to use esp and ah in transport mode. I turn on racoon on each box. I ping. Both panic. Following is about the message:. Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip: c02bbd06 *pde =3D 00000000 Oops: 0000 [#1] CPU: 0 EIP: 0060:[] Not tainted EFLAGS: 00010206 EIP is at memcpy+0x1e/0x39 eax: 00000018 ebx: f6fe8a00 ecx: 00000006 edx: 00000000 esi: 00000000 edi: 00000000 ebp: c0562520 esp: f6fb5ccc ds: 007b es: 007b ss:0068 Process racoon (pid: 418, threadinfo=3Df6fb4000 task=3Df6fbb300) Stack: Call Trace: xfrm_state_update pfkey_add parse_exthdrs pfkey_process pfkey_sendmsg sock_sendmsg verify_iovec sys_sendmsg sockfd_lookup sys_sendto sys_getsockname __pollwait update_process sys_send sys_socketcall syscall_call Code: f3 a5 a8 02 74 02 66 a5 a8 01 74 01 a4 89 d0 8b 74 24 02 8b <0>Kernel panic: Fatal exception in interrupt In interrupt handler =3D not syncing If you want the full text of it, its at 67.9.9.32/oops.jpg. I'm probably just doing something stupid... On Sat, 2003-08-02 at 15:20, Francois Romieu wrote: > Justin Sanchez : > [...] > > How current? I've just seen it in -test2-bk2. >=20 > Forwarded to davem@redhat.com. >=20 > You may consider posting the data of the bug-report updated to -test2-bk2 > on netdev@oss.sgi.com. >=20 > -- > Ueimor >=20 --=-2Y8LE63gCCNUG9OvqJTX Content-Type: application/pgp-signature; name=signature.asc Content-Description: This is a digitally signed message part -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.2 (GNU/Linux) iD8DBQA/LKLaiLmb/rWLQdQRAmWuAJ4g7wXLF1O+gFi+jrLeThezwWAsywCgkiao YUjA6YtWFR9yOVO/5JnRKZc= =6zEM -----END PGP SIGNATURE----- --=-2Y8LE63gCCNUG9OvqJTX-- From davem@redhat.com Sat Aug 2 23:00:29 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 23:00:32 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7360SFl003649 for ; Sat, 2 Aug 2003 23:00:29 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id WAA10944; Sat, 2 Aug 2003 22:56:20 -0700 Date: Sat, 2 Aug 2003 22:56:19 -0700 From: "David S. Miller" To: Daniel Ritz Cc: linux-net@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [PATCH 2.6] Fix IPv6 esp mem leak in esp6_input Message-Id: <20030802225619.17d477e3.davem@redhat.com> In-Reply-To: <200308021350.23342.daniel.ritz@gmx.ch> References: <200308021350.23342.daniel.ritz@gmx.ch> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4472 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Sat, 2 Aug 2003 13:50:23 +0200 Daniel Ritz wrote: > fixes a mem leak in esp6_input() in the error paths. and return -ENOMEM, > not -EINVAL when out of memory. against 2.6.0-test2-bk Patch applied, thanks Daniel. From davem@redhat.com Sat Aug 2 23:05:03 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 23:05:07 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h73652Fl004305 for ; Sat, 2 Aug 2003 23:05:03 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id WAA10978; Sat, 2 Aug 2003 22:59:48 -0700 Date: Sat, 2 Aug 2003 22:59:48 -0700 From: "David S. Miller" To: Ville Nuorvala Cc: yoshfuji@linux-ipv6.org, netdev@oss.sgi.com Subject: Re: [PATCH] IPV6: Incorrect hoplimit in ip6_push_pending_frames() Message-Id: <20030802225948.01c96fb7.davem@redhat.com> In-Reply-To: References: X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4473 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Fri, 1 Aug 2003 14:15:21 +0300 (EEST) Ville Nuorvala wrote: > I noticed the hop limit passed to ip6_append_data() isn't used by > ip6_push_pending_frames(), which might lead to unexpected behavior with > multicast and (ipv6-in-ipv6) tunneled packets. This patch (against Linux > 2.6.0-test2 and cset 1.1595) fixes the problem. Applied, thank you. From jgarzik@pobox.com Sat Aug 2 23:13:02 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 23:13:06 -0700 (PDT) Received: from www.linux.org.uk (IDENT:brcXRjmAJ7+L5OReqrzNa8QAiyIV4/9i@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h736D1Fl005185 for ; Sat, 2 Aug 2003 23:13:02 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19jC72-0002bL-FW; Sun, 03 Aug 2003 07:13:00 +0100 Message-ID: <3F2CA7E1.6060800@pobox.com> Date: Sun, 03 Aug 2003 02:12:49 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: "Feldman, Scott" CC: netdev@oss.sgi.com Subject: Re: e100 "Ferguson" release References: <3F2CA65F.8060105@pobox.com> In-Reply-To: <3F2CA65F.8060105@pobox.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4474 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev Jeff Garzik wrote: > * (extremely minor) some people (like me :)) consider dead reads like > the readb() call in e100_write_flush er, that was a bit incomplete. completing: ... needing to be marked explicitly with a "(void) " prefix, indicating it is intentionally a dead read. Maintainer's call, ultimately, though... From alan@storlinksemi.com Sat Aug 2 23:23:08 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 23:23:12 -0700 (PDT) Received: from smtp013.mail.yahoo.com (smtp013.mail.yahoo.com [216.136.173.57]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h736N7Fl006172 for ; Sat, 2 Aug 2003 23:23:08 -0700 Received: from cpe-66-1-155-95.ca.sprintbbd.net (HELO AlanLap) (alansuntzishih@66.1.155.95 with login) by smtp.mail.vip.sc5.yahoo.com with SMTP; 3 Aug 2003 06:23:06 -0000 From: "Alan Shih" To: "Ben Greear" , "Jeff Garzik" Cc: "Nivedita Singhvi" , "Werner Almesberger" , , Subject: RE: TOE brain dump Date: Sat, 2 Aug 2003 23:22:52 -0700 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2911.0) Importance: Normal X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2727.1300 In-Reply-To: <3F2C891B.7080004@candelatech.com> X-archive-position: 4475 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: alan@storlinksemi.com Precedence: bulk X-list: netdev A DMA xfer that fills the NIC pipe with IDE source. That's not very hard... need a lot of bufferring/FIFO though. May require large modification to the file serving applications? Alan -----Original Message----- From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-owner@vger.kernel.org]On Behalf Of Ben Greear Sent: Saturday, August 02, 2003 9:02 PM To: Jeff Garzik Cc: Nivedita Singhvi; Werner Almesberger; netdev@oss.sgi.com; linux-kernel@vger.kernel.org Subject: Re: TOE brain dump Jeff Garzik wrote: > So, fix the other end of the pipeline too, otherwise this fast network > stuff is flashly but pointless. If you want to serve up data from disk, > then start creating PCI cards that have both Serial ATA and ethernet > connectors on them :) Cut out the middleman of the host CPU and host I for one would love to see something like this, and not just Serial ATA.. but maybe 8x Serial ATA and RAID :) Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ From jgarzik@pobox.com Sat Aug 2 23:40:47 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 23:41:00 -0700 (PDT) Received: from www.linux.org.uk (IDENT:NELjEc2FssOMkU3BYyvJ/bjDS2/IyxUi@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h736ejFl007715 for ; Sat, 2 Aug 2003 23:40:46 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19jCXs-0002jd-97; Sun, 03 Aug 2003 07:40:44 +0100 Message-ID: <3F2CAE61.7070401@pobox.com> Date: Sun, 03 Aug 2003 02:40:33 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: netdev@oss.sgi.com, linux-kernel@vger.kernel.org CC: Werner Almesberger , Nivedita Singhvi Subject: Re: TOE brain dump References: <20030802140444.E5798@almesberger.net> <3F2BF5C7.90400@us.ibm.com> <3F2C0C44.6020002@pobox.com> <20030802184901.G5798@almesberger.net> In-Reply-To: <20030802184901.G5798@almesberger.net> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4476 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev Werner Almesberger wrote: > Jeff Garzik wrote: > >>jabbering at the same time. TCP is a "one size fits all" solution, but >>it doesn't work well for everyone. > > > But then, ten "optimized xxPs" that work well in two different > scenarios each, but not so good in the 98 others, wouldn't be > much fun either. > > It's been tried a number of times. Usually, real life sneaks > in at one point or another, leaving behind a complex mess. > When they've sorted out these problems, regular TCP has caught > up with the great optimized transport protocols. At that point, > they return to their niche, sometimes tail between legs and > muttering curses, sometimes shaking their fist and boldly > proclaiming how badly they'll rub TCP in the dirt in the next > round. Maybe they shed off some of the complexity, and trade it > for even more aggressive optimization, which puts them into > their niche even more firmly. Eventually, they fade away. > > There are cases where TCP doesn't work well, like a path of > badly mismatched link layers, but such paths don't treat any > protocol following the end-to-end principle kindly. > > Another problem of TCP is that it has grown a bit too many > knobs you need to turn before it works over your really fast > really long pipe. (In one of the OLS after dinner speeches, > this was quite appropriately called the "wizard gap".) > > >>It's obviously not over a WAN... > > > That's why NFS turned off UDP checksums ;-) As soon as you put > it on IP, it will crawl to distances you didn't imagine in your > wildest dreams. It always does. Really fast, really long pipes in practice don't exist for 99.9% of all Internet users. When you approach traffic levels that push you want to offload most of the TCP net stack, then TCP isn't the right solution for you anymore, all things considered. The Linux net stack just isn't built to be offloaded. TOE engines will either need to (1) fall back to Linux software for all-but-the-common case (otherwise netfilter, etc. break), or, (2) will need to be hideously complex beasts themselves. And I can't see ASIC and firmware designers being excited about implementing netfilter on a PCI card :) Unfortunately some vendors seem to choosing TOE option #3: TCP offload which introduces many limitations (connection limits, netfilter not supported, etc.) which Linux never had before. Vendors don't seem to realize TOE has real potential to damage the "good network neighbor" image the net stack has. The Linux net stack's behavior is known, documented, predictable. TOE changes all that. There is one interesting TOE solution, that I have yet to see created: run Linux on an embedded processor, on the NIC. This stripped-down Linux kernel would perform all the header parsing, checksumming, etc. into the NIC's local RAM. The Linux OS driver interface becomes a virtual interface with a large MTU, that communicates from host CPU to NIC across the PCI bus using jumbo-ethernet-like data frames. Management frames would control the ethernet interface on the other side of the PCI bus "tunnel". >>So, fix the other end of the pipeline too, otherwise this fast network >>stuff is flashly but pointless. If you want to serve up data from disk, >>then start creating PCI cards that have both Serial ATA and ethernet >>connectors on them :) Cut out the middleman of the host CPU and host >>memory bus instead of offloading portions of TCP that do not need to be >>offloaded. > > > That's a good point. A hierarchical memory structure can help > here. Moving one end closer to the hardware, and letting it > know (e.g. through sendfile) that also the other end is close > (or can be reached more directly that through some hopelessly > crowded main bus) may help too. Definitely. Jeff From jgarzik@pobox.com Sat Aug 2 23:41:50 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 02 Aug 2003 23:41:53 -0700 (PDT) Received: from www.linux.org.uk (IDENT:r2QFDDBea5MOAjl2dLgz0iK1HyGBBbZ4@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h736fmFl008008 for ; Sat, 2 Aug 2003 23:41:49 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19jCYr-0002jy-5D; Sun, 03 Aug 2003 07:41:45 +0100 Message-ID: <3F2CAE9D.5090401@pobox.com> Date: Sun, 03 Aug 2003 02:41:33 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: Alan Shih CC: Ben Greear , Nivedita Singhvi , Werner Almesberger , netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: TOE brain dump References: In-Reply-To: Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4477 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev Alan Shih wrote: > A DMA xfer that fills the NIC pipe with IDE source. That's not very hard... > need a lot of bufferring/FIFO though. May require large modification to the > file serving applications? Nope, that's using the existing sendfile(2) facility. Jeff From jgarzik@pobox.com Sun Aug 3 00:00:30 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 00:00:36 -0700 (PDT) Received: from www.linux.org.uk (IDENT:f0VnXDXUgPOR/pJVSNM/K9f0DO93mSKY@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7370SFl009721 for ; Sun, 3 Aug 2003 00:00:29 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19jC0o-0002Yz-G8; Sun, 03 Aug 2003 07:06:34 +0100 Message-ID: <3F2CA65F.8060105@pobox.com> Date: Sun, 03 Aug 2003 02:06:23 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: "Feldman, Scott" CC: netdev@oss.sgi.com Subject: Re: e100 "Ferguson" release References: In-Reply-To: Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4478 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev Comments: * Given that e100 is only 10/100 hardware, I like the decision to not support rx/tx checksumming and zero-copy. Particularly with some e100's, this eliminates various worries related to chip errata. And as with any "do it in software" solution, you guarantee that the chip never screws up and "acks" a checksum incorrectly, thus passing corrupted data up into the net stack. * (API) Does the out-of-tx-resources condition in e100_xmit_frame ever really happen? I am under the impression that returning non-zero in ->hard_start_xmit results in the packet sometimes being requeued and sometimes dropped. I prefer to guarantee a more-steady state, by simply dropping the packet unconditionally, when this uncommon condition occurs. So, I would a) mark the failure condition with unlikely(), and b) if the condition occurs, simply drop the packet (tx_dropped++, kfree skb), and return zero. Though, ultimately, I wish the net stack would support some way to _guarantee_ that the skb is requeued for transmit. Some packet schedulers in the kernel will drop the skb even if the ->hard_start_xmit return code indicates "requeue". This makes sense from the rule of "skbs are lossy, and can be dropped"... but it really sucks on hardware where unexpected -- but temporary -- loss of TX resources occurs. One can prevent 20-50% (or more) packet loss on certain classes of connections, simply by being able to tell the net stack "hey, if I could go back in time and issue a netif_stop_queue, before you called ->hard_start_xmit, I would" :) * (minor) for completeness, you should limit the PCI class in the pci_device_id table to PCI_CLASS_NETWORK_ETHERNET. There are one-in-a-million cases where this matters, but it's usually a BIOS bug. Still, it's there in pci_device_id table, and it's an easy change, so might as well use it. This is a good janitor task for other PCI net drivers, too. * (long term) I really like Ben H.'s work in drivers/net/sungem_phy.[ch] -- and similar benh code in ibm_emac -- and want to make his code generic for most MII phys. Just something to read and keep in mind. * (style) your struct config definition is terribly clever. perhaps too clever, making it unreadable? Not a specific complaint, mind you, just something that caught my eye. * (minor) in tg3, my own benchmarks and experiments showed it helped to explictly use ____cacheline_aligned markers when defining certain sections of members in struct tg3 (or struct nic, in e100's case). You already clearly pay attention to member layout WRT cache effects, but if you have a clear dividing line, or lines, in struct nic you can use _____cacheline_aligned for even greater benefit. At a minimum test it with a cpu-usage-measuring benchmark like ttcp, though, of course :) IIRC I divided tg3's struct into rx, tx, and "other" sections. * (extremely minor) some people (like me :)) consider dead reads like the readb() call in e100_write_flush * (major?) Aren't there some clunky e100 adapters that don't do MMIO? Do we care? * I would love to see feedback from people testing this driver on ppc64 and sparc64, particularly. * (style, minor) My eyes would prefer functions like e100_hw_reset to have a few more blank lines in them, spreading code+comment blocks out a bit. * (extremely minor) one wonders if you really need the write flush in mdio_ctrl. If the flush is removed, the same net effect appears to occur. * (boring but needed) convert all the magic numbers in e100_configure into constants, or at least add comments describing the magic numbers. If the value is just one bit, you might simply append "/* true */", for example. The general idea is to make the "member name = value" list a little bit more readable to somebody who doesn't know the hardware, and struct config, intimately. * IIRC Donald's MII phy scanning code scans MII phy ids like this: 1..31,0. Or maybe 1..31, and then 0 iff no MII phys were found. In general I would prefer to follow his eepro100.c probe order. Some phys need this because they will report on both phy id #0 (which is magical) and phy id #(non-zero). Donald would know more than me, here. * I like the e100_exec_cb stuff, with the callbacks. * Is it easy to support MII phy interrupts? It would be nice to get a callback that was handled immediately, on phys that do support such interrupts. * do we care about spinlocks around the update_stats and get_stats code? * (bugs) in e100_up, you should undo mod_timer [major] and netif_start_queue [minor], if request_irq fails. And maybe stop the receiver, too? * for all constants 0xffffffff (and others as well if you so choose), prefer the C99 suffix to a cast. This is particularly relevant in pci_set_dma_mask calls, where one should be using 0xffffffffULL, but applies to other constants as well. * (potential races) in e100_probe, you want to call register_netdev as basically the last operation that can fail, if possible. Particularly, you need to move the PCI API operations above register_netdev. Remember, register_netdev winds up calling /sbin/hotplug, which in turn calls programs that will want to start using the interface. So you need to have everything set up by that point, really. * in e100_probe, "if(nic->csr == 0UL) {" should really just test for NULL, because ioremap is defined to return a pointer... * (minor) use a netif_msg_xxx wrapper/constant in e100_init_module test? From greearb@candelatech.com Sun Aug 3 00:32:09 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 00:32:19 -0700 (PDT) Received: from grok.yi.org (evrtwa1-ar2-4-33-045-074.evrtwa1.dsl-verizon.net [4.33.45.74]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h737W8Fl012468 for ; Sun, 3 Aug 2003 00:32:09 -0700 Received: from candelatech.com (localhost.localdomain [127.0.0.1]) by grok.yi.org (8.12.8/8.12.8) with ESMTP id h737W1tf031554; Sun, 3 Aug 2003 00:32:01 -0700 Message-ID: <3F2CBA71.2070503@candelatech.com> Date: Sun, 03 Aug 2003 00:32:01 -0700 From: Ben Greear Organization: Candela Technologies User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5a) Gecko/20030718 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Jeff Garzik CC: "Feldman, Scott" , netdev@oss.sgi.com Subject: Re: e100 "Ferguson" release References: <3F2CA65F.8060105@pobox.com> In-Reply-To: <3F2CA65F.8060105@pobox.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4479 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: greearb@candelatech.com Precedence: bulk X-list: netdev Jeff Garzik wrote: > Comments: > * (API) Does the out-of-tx-resources condition in e100_xmit_frame ever > really happen? I am under the impression that returning non-zero in > ->hard_start_xmit results in the packet sometimes being requeued and > sometimes dropped. I prefer to guarantee a more-steady state, by simply > dropping the packet unconditionally, when this uncommon condition > occurs. So, I would > a) mark the failure condition with unlikely(), and > b) if the condition occurs, simply drop the packet (tx_dropped++, kfree > skb), and return zero. > > Though, ultimately, I wish the net stack would support some way to > _guarantee_ that the skb is requeued for transmit. Some packet > schedulers in the kernel will drop the skb even if the ->hard_start_xmit > return code indicates "requeue". This makes sense from the rule of > "skbs are lossy, and can be dropped"... but it really sucks on hardware > where unexpected -- but temporary -- loss of TX resources occurs. One > can prevent 20-50% (or more) packet loss on certain classes of > connections, simply by being able to tell the net stack "hey, if I could > go back in time and issue a netif_stop_queue, before you called > ->hard_start_xmit, I would" :) Although I have not tried this latest patch, the existing e100 and e1000 in 2.4.21 seldom seem to return true to this method: netif_queue_stopped(odev), even when the next hard_start_xmit() call fails. For instance, this is the code I use in pktgen.c: if (!netif_queue_stopped(odev)) { if (odev->hard_start_xmit(next->skb, odev)) { if (net_ratelimit()) { printk(KERN_INFO "Hard xmit error\n"); } next->errors++; next->last_ok = 0; queue_stopped++; } else { queue_stopped = 0; next->last_ok = 1; next->sofar++; next->tx_bytes += (next->cur_pkt_size + 4); /* count csum */ } With e100 and e1000, I see the very large numbers of the hard_start_xmit failure when running very high packets-per-second rates (small packets). I see virtually no failures with tulip. pktgen knows how to re-queue, but it's curious it has to so often. For code that does not requeue, this could be even more of a bummer. To point b), I think if the driver accepts the packet in hard_start_xmit, it should be able to send the packet out, otherwise return the 'requeue' value and let the calling code know. It is very important to me, at least, to know if a packet has really been sent or not. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From davem@redhat.com Sun Aug 3 00:36:54 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 00:37:01 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h737aqFl013180 for ; Sun, 3 Aug 2003 00:36:53 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id AAA11387; Sun, 3 Aug 2003 00:32:39 -0700 Date: Sun, 3 Aug 2003 00:32:39 -0700 From: "David S. Miller" To: Ben Greear Cc: jgarzik@pobox.com, scott.feldman@intel.com, netdev@oss.sgi.com Subject: Re: e100 "Ferguson" release Message-Id: <20030803003239.4257ef24.davem@redhat.com> In-Reply-To: <3F2CBA71.2070503@candelatech.com> References: <3F2CA65F.8060105@pobox.com> <3F2CBA71.2070503@candelatech.com> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4480 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev > Although I have not tried this latest patch, the existing e100 and e1000 in > 2.4.21 seldom seem to return true to this method: netif_queue_stopped(odev), > even when the next hard_start_xmit() call fails. Returning an error from hard_start_xmit() from normal ethernet drivers is, frankly, a driver bug and should never happen. I don't know if there is somehow something special about the e100, but even if there is hard_start_xmit() failures can be avoided by properly doing netif_queue_{stop,wakeup}() in the right places. From david.lang@digitalinsight.com Sun Aug 3 01:27:39 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 01:27:50 -0700 (PDT) Received: from warden.diginsite.com (warden-p.diginsite.com [208.29.163.248]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h738RcFl017767 for ; Sun, 3 Aug 2003 01:27:39 -0700 Received: from wlvims01.diginsite.com by warden.diginsite.com via smtpd (for oss.SGI.COM [192.48.159.27]) with SMTP; Sun, 3 Aug 2003 01:27:38 -0700 Received: from calexc01.digitalinsight.com ([10.200.0.20]) by wlvims01.digitalinsight.com (Post.Office MTA v3.5.3 release 223 ID# 0-0U10L2S100V35) with ESMTP id com; Sun, 3 Aug 2003 01:26:48 -0700 Received: by calexc01.diginsite.com with Internet Mail Service (5.5.2653.19) id ; Sun, 3 Aug 2003 01:27:31 -0700 Received: from dlang.diginsite.com ([10.201.10.67]) by wlvexc00.digitalinsight.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2656.59) id P0FXW757; Sun, 3 Aug 2003 01:27:21 -0700 From: David Lang To: Alan Shih Cc: Ben Greear , Jeff Garzik , Nivedita Singhvi , Werner Almesberger , netdev@oss.sgi.com, linux-kernel@vger.kernel.org Date: Sun, 3 Aug 2003 01:25:48 -0700 (PDT) Subject: RE: TOE brain dump In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 4481 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: david.lang@digitalinsight.com Precedence: bulk X-list: netdev do you really want the processor on the card to be tunning apache/NFS/Samba/etc ? putting enough linux on the card to act as a router (which would include the netfilter stuff) is one thing. putting the userspace code that interfaces with the outside world for file transfers is something else. if you really want the disk connected to your network card you are just talking a low-end linux box. forget all this stuff about it being on a card and just use a full box (economys of scale will make this cheaper) making a firewall that's a core system with a dozen slave systems attached to it (the network cards) sounds like the type of clustering that Linux has been used for for compute nodes. complicated to setup, but extremely powerful and scalable once configured. if you want more then a router on the card then Alan Cox is right, just add another processor to the system, it's easier and cheaper. David Lang On Sat, 2 Aug 2003, Alan Shih wrote: > Date: Sat, 2 Aug 2003 23:22:52 -0700 > From: Alan Shih > To: Ben Greear , Jeff Garzik > Cc: Nivedita Singhvi , > Werner Almesberger , netdev@oss.sgi.com, > linux-kernel@vger.kernel.org > Subject: RE: TOE brain dump > > A DMA xfer that fills the NIC pipe with IDE source. That's not very hard... > need a lot of bufferring/FIFO though. May require large modification to the > file serving applications? > > Alan > > -----Original Message----- > From: linux-kernel-owner@vger.kernel.org > [mailto:linux-kernel-owner@vger.kernel.org]On Behalf Of Ben Greear > Sent: Saturday, August 02, 2003 9:02 PM > To: Jeff Garzik > Cc: Nivedita Singhvi; Werner Almesberger; netdev@oss.sgi.com; > linux-kernel@vger.kernel.org > Subject: Re: TOE brain dump > > > Jeff Garzik wrote: > > > So, fix the other end of the pipeline too, otherwise this fast network > > stuff is flashly but pointless. If you want to serve up data from disk, > > then start creating PCI cards that have both Serial ATA and ethernet > > connectors on them :) Cut out the middleman of the host CPU and host > > I for one would love to see something like this, and not just Serial ATA.. > but maybe 8x Serial ATA and RAID :) > > Ben > > > -- > Ben Greear > Candela Technologies Inc http://www.candelatech.com > > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > From willy@www.linux.org.uk Sun Aug 3 07:57:00 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 07:57:05 -0700 (PDT) Received: from www.linux.org.uk (IDENT:dxP8jV70pveQkX3/KA355Yf9YCtbJJSj@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h73EuwFl025115 for ; Sun, 3 Aug 2003 07:57:00 -0700 Received: from willy by www.linux.org.uk with local (Exim 4.14) id 19jKI4-0006o5-8W; Sun, 03 Aug 2003 15:56:56 +0100 Date: Sun, 3 Aug 2003 15:56:56 +0100 From: Matthew Wilcox To: Jeff Garzik Cc: Matthew Wilcox , netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 Message-ID: <20030803145656.GI22222@parcelfarce.linux.theplanet.co.uk> References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> <20030802222145.GE22222@parcelfarce.linux.theplanet.co.uk> <3F2C3C86.6000202@pobox.com> <20030803002744.GF22222@parcelfarce.linux.theplanet.co.uk> <3F2C7E12.8070904@pobox.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3F2C7E12.8070904@pobox.com> User-Agent: Mutt/1.4.1i X-archive-position: 4482 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: willy@debian.org Precedence: bulk X-list: netdev On Sat, Aug 02, 2003 at 11:14:26PM -0400, Jeff Garzik wrote: > Matthew Wilcox wrote: > >Nothing stops it being implemented as a macro in kcompat. Having it as > >an inline function gives it argument typechecking which always gives me > >the warm fuzzies. > > No, it _needs_ to be a macro for maximum flexibility. > > Most importantly, kcompat code may use '#ifndef SET_ETHTOOL_OPS' as a > trigger, to signal that compat code is needed. No need for drivers to > create tons of kernel-version-code ifdefs, just to test for when > ethtool_ops appeared in 2.6, for when it starts appearing in 2.4 vendor > backports, and (possibly) 2.4 itself. Also, doing it at the cpp level > allows compat code to #undef it, if it _really_ knows what its doing, > and the situation calls for it. OK. At this point, I really feel like I'm getting in the way and hindering more than I'm helping. Can I pass the torch to you and let you finish the job? -- "It's not Hollywood. War is real, war is primarily not about defeat or victory, it is about death. I've seen thousands and thousands of dead bodies. Do you think I want to have an academic debate on this subject?" -- Robert Fisk From wsx@6com.sk Sun Aug 3 08:44:34 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 08:44:40 -0700 (PDT) Received: from mail.6com.sk (cement.ksp.edi.fmph.uniba.sk [158.195.16.151]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h73FiXFl028554 for ; Sun, 3 Aug 2003 08:44:33 -0700 Received: by mail.6com.sk (Postfix, from userid 501) id 84173630E; Sun, 3 Aug 2003 17:44:27 +0200 (CEST) Date: Sun, 3 Aug 2003 17:44:27 +0200 From: Jan Oravec To: netdev@oss.sgi.com Subject: problem setting net.ipvX.conf.all.forwarding via sysctl() system call Message-ID: <20030803154427.GA12926@wsx.ksp.sk> Reply-To: Jan Oravec Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.1i X-Operating-System: UNIX X-archive-position: 4483 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jan.oravec@6com.sk Precedence: bulk X-list: netdev Hello, When net.ipvX.conf.all.forwarding is enabled via sysctl() system call, forwarding is not enabled on all interfaces as it is when it is changed using /proc filesystem. For IPv6, it is obviously because sysctl 'strategy' handler is not defined. For IPv4, it is because ipv4_sysctl_forward_strategy only copy new value to check whether it has changed and does not update ipv4_devconf.forwarding before calling inet_forward_change(). (it is copied internally by sysctl after ipv4_sysctl_forward_strategy because we return positive number) I am not good in kernel parallel computing strategy, whether it requires some locking or it is safe to do: --- sysctl_net_ipv4.c.old 2003-08-03 17:37:44.000000000 +0200 +++ sysctl_net_ipv4.c 2003-08-03 17:38:18.000000000 +0200 @@ -109,8 +109,9 @@ static int ipv4_sysctl_forward_strategy( } } + ipv4_devconf.forwarding=new; inet_forward_change(); - return 1; + return 0; } ctl_table ipv4_table[] = { Best Regards, -- Jan Oravec XS26 coordinator 6COM s.r.o. 'Access to IPv6' http://www.6com.sk http://www.xs26.net From jgarzik@pobox.com Sun Aug 3 11:00:28 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 11:00:36 -0700 (PDT) Received: from www.linux.org.uk (IDENT:sjW7I344dNw6uUHLQE/tIbYI710RnggI@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h73I0RFl007871 for ; Sun, 3 Aug 2003 11:00:28 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19jMMR-0007pt-LH; Sun, 03 Aug 2003 18:09:35 +0100 Message-ID: <3F2D41B7.7040205@pobox.com> Date: Sun, 03 Aug 2003 13:09:11 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: Matthew Wilcox CC: netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> <20030802222145.GE22222@parcelfarce.linux.theplanet.co.uk> <3F2C3C86.6000202@pobox.com> <20030803002744.GF22222@parcelfarce.linux.theplanet.co.uk> <3F2C7E12.8070904@pobox.com> <20030803145656.GI22222@parcelfarce.linux.theplanet.co.uk> In-Reply-To: <20030803145656.GI22222@parcelfarce.linux.theplanet.co.uk> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4484 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev Matthew Wilcox wrote: > On Sat, Aug 02, 2003 at 11:14:26PM -0400, Jeff Garzik wrote: > >>Matthew Wilcox wrote: >> >>>Nothing stops it being implemented as a macro in kcompat. Having it as >>>an inline function gives it argument typechecking which always gives me >>>the warm fuzzies. >> >>No, it _needs_ to be a macro for maximum flexibility. >> >>Most importantly, kcompat code may use '#ifndef SET_ETHTOOL_OPS' as a >>trigger, to signal that compat code is needed. No need for drivers to >>create tons of kernel-version-code ifdefs, just to test for when >>ethtool_ops appeared in 2.6, for when it starts appearing in 2.4 vendor >>backports, and (possibly) 2.4 itself. Also, doing it at the cpp level >>allows compat code to #undef it, if it _really_ knows what its doing, >>and the situation calls for it. > > > OK. At this point, I really feel like I'm getting in the way and > hindering more than I'm helping. Can I pass the torch to you and let > you finish the job? Sorry to give that impression :( I think we're pretty much "there". But if you wanna hand it off to me for the last little bits, and merging, that's fine too. I'll leave it up to you. Jeff From werner@almesberger.net Sun Aug 3 11:05:57 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 11:06:00 -0700 (PDT) Received: from host.almesberger.net (almesberger.net [63.105.73.239] (may be forged)) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h73I5uFl008556 for ; Sun, 3 Aug 2003 11:05:57 -0700 Received: from almesberger.net (vpnwa-home [10.200.0.2]) by host.almesberger.net (8.11.6/8.9.3) with ESMTP id h73I5oG04155; Sun, 3 Aug 2003 11:05:50 -0700 Received: (from werner@localhost) by almesberger.net (8.11.6/8.11.6) id h73I5c110505; Sun, 3 Aug 2003 15:05:38 -0300 Date: Sun, 3 Aug 2003 15:05:37 -0300 From: Werner Almesberger To: David Lang Cc: Alan Shih , Ben Greear , Jeff Garzik , Nivedita Singhvi , netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: TOE brain dump Message-ID: <20030803150537.C10280@almesberger.net> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: ; from david.lang@digitalinsight.com on Sun, Aug 03, 2003 at 01:25:48AM -0700 X-archive-position: 4485 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: werner@almesberger.net Precedence: bulk X-list: netdev David Lang wrote: > do you really want the processor on the card to be tunning > apache/NFS/Samba/etc ? If it runs a Linux kernel, that's not a problem. Whether you actually want to do this or not, becomes an entirely separate issue. - Werner -- _________________________________________________________________________ / Werner Almesberger, Buenos Aires, Argentina werner@almesberger.net / /_http://www.almesberger.net/____________________________________________/ From hadi@cyberus.ca Sun Aug 3 11:15:10 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 11:15:19 -0700 (PDT) Received: from mail.cyberus.ca (mail.cyberus.ca [209.195.118.111]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h73IF9Fl009421 for ; Sun, 3 Aug 2003 11:15:10 -0700 Received: from cpe0030ab124d2f-cm014500000962.cpe.net.cable.rogers.com ([24.103.99.32] helo=[10.0.0.9]) by mail.cyberus.ca with esmtp (Exim 4.12) id 19jNNq-000OgI-00; Sun, 03 Aug 2003 14:15:06 -0400 Subject: Re: [RFC] High Performance Packet Classifiction for tc framework From: jamal Reply-To: hadi@cyberus.ca To: Michael Bellion and Thomas Heinz Cc: linux-net@vger.kernel.org, netdev@oss.sgi.com In-Reply-To: <3F16A0E5.1080007@hipac.org> References: <200307141045.40999.nf@hipac.org> <1058328537.1797.24.camel@jzny.localdomain> <3F16A0E5.1080007@hipac.org> Content-Type: text/plain Organization: jamalopolis Message-Id: <1059934468.1103.41.camel@jzny.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 Date: 03 Aug 2003 14:14:28 -0400 Content-Transfer-Encoding: 7bit X-archive-position: 4486 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev Hi, Apologies for late response. Its funny how i thought i was going to have more time in the last 2 weeks but due to bad scheduling that wasnt the case. On Thu, 2003-07-17 at 09:13, Michael Bellion and Thomas Heinz wrote: > Hi Jamal > > You wrote: > > This is good.I may have emailed you about this topic before? > > Yes, but at that time we had not any concrete plans to > integrate hipac into tc. We focussed on making nf-hipac as > expressive as iptables first. > Good goal. > > It's a classifier therefore it makes sense ;-> > > :-) > > > nice. What would be interesting is to see your rule update rates vs > > iptables (i expect iptables to suck) - but how do you compare aginst any > > of the tc classifiers for example? > > Regarding the rule update rates we have not done any measurements > yet but nf-hipac should be faster than iptables (even more when > we have implemented the selective cloning stuff). On the other > hand we are probably slower than tc because in addition to the > insert operation into an internal chain there is the actual hipac > insert operation. The insertion in the internal chain is quicker > than the tc insert operation because we use doubly linked lists. > I think i will have to look at your code to make comments. > Regarding the matching performance one has to consider a few things. > The currently existing tc classifiers are an abstraction for rules > (iptables "slang") whilst hipac is an abstraction for a set of rules > (including the chain semantics known from iptables), i.e. a table in > the iptables world. Not entirely accurate. Depends which tc classifier. u32 hash tables are infact like iptables chains. Note, the concept of priorities which is used for conflict resolution as well as further separating sets of rules doesnt exist in iptables. > Of course it is possible to have some sort > of extended classifying in tc too, True, i overlooked this. > i.e. you can add several fw or u32 > filters with the same prio which allows the filters to be hashed. You can also have them use different priorities and with the continue operator first clasify based on packet data then on metadata or on another packet header filter. > One disadvantage of this concept is that the hashed filters > must be compact, i.e. there cannot be other classifiers in between. I didnt understand this. Are you talking about conflict resolving of overlapping filters? > Another major disadvantage is caused by the hashing scheme. > If you use the hash for 1 dimension you have to make sure that > either all filters in a certain bucket are disjoint or you must have > an implicit ordering of the rules (according to the insertion order > or something). This scheme is not extendable to 2 or more dimensions, > i.e. 1 hash for src ip, #(src ip buckets) many dst ip hashes and so > on, because you simply cannot express arbitrary rulesets. If i understood you - you are refering to a way to reduce the number of lookups by having disjoint hashes. I suppose for something as simple as a five tuple lookup, this is almost solvable by hardcoding the different fields into multiway hashes. Its when you try to generalize that it becomes an issue. > Another general problem is of course that the user has to manually > setup the hash which is rather inconvenient. > Yes. Take a look at Werners tcng - he has a clever way to hide things from the user. I did experimentation on u32 with a kernel thread which rearranged things when they seemed out of balance but i havent experimented with a lot of rules. > Now, what are the implications on the matching performance: > tc vs. nf-hipac? As long as the extended hashing stuff is not used > nf-hipac is clearly superior to tc. You are refering to u32. You mean as long as u32 stored things in a single linked list, you win - correct? > When hashing is used it _really_ > depends. If there is only one classifier (with hashing) per interface > and the number of rules per bucket is very small the performance should > be comparable. As soon as you add other classifiers nf-hipac will > outperform tc again. > If we take a simple user interface abstraction like tcng which hides the evil of u32 and then take simple 5 tuple rules - i doubt you will see any difference. For more generic setup, the kernel thread i refer to would work - but may slow insertion. > >>The tc framework is very flexible with respect to where filters can be > >>attached. Unfortunately this cannot be mapped into one HIPAC data > >>structure. Our current design allows to attach filters anywhere but > >>only the filters attached to the top level qdisc would benefit from the > >>HIPAC algorithm. Would this be a noticeable restriction? > > > > I dont think so, but can ytou describe this restriction? > > Well, we thought a little more about the design and came to the > conclusion that it is not necessary to have a HIPAC qdisc at root > but it suffices to ensure that the HIPAC classifier occurs only > once per interface. As you can guess from the last sentence we > dropped the HIPAC qdisc design and changed to the following scheme: > > - there no special HIPAC qdisc at all :-) > - the HIPAC classifier is no longer a simple rule but represents > the whole table > - the HIPAC classifier can occur in any qdisc but at most once > per interface > > So, basically HIPAC is just a normal classifier like any other > with two exceptions: > a) it can occur only once per interface > b) the rules within the classifier can contain other classifiers, > e.g. u32, fw, tc_index, as matches > But why restriction a)? Also why should we need hipac to hold other filters when the infrastructure itself can hold the extended filters just fine? I think you may actually be trying to say why somewhere in the email, but it must not be making a significant impression on my brain. > There is just one problem with the current tc framework. Once > a new filter is inserted into the chain it is not removed even > if the change function of the classifier returns < 0 > (2.6.0-test1: net/sched/cls_api.c: line 280f). > This should be changed anyway, shouldn't it? > Are you refering to this piece of code?: ---- err = tp->ops->change(tp, cl, t->tcm_handle, tca, &fh); if (err == 0) tfilter_notify(skb, n, tp, fh, RTM_NEWTFILTER); errout: if (cl) cops->put(q, cl); return err; --- change() should not return <0 if it has installed the filter i think. Should the top level code be responsible for removing filters? > >>- new HIPAC classifier which supports all native nf-hipac matches > >> (src/dst ip, proto, src/dst port, ttl, state, in_iface, icmp type, > >> tcpflags, fragments) and additionally fwmark > > > > I would think for cleanliness fwmark or any metadata related > > classification would be separate from one that is based on packet bits. > > Since our classifier represents a table of rules and the rules are > based on different matches, like src/dst ip and also fwmark (native) > or u32 (subclassifier as match), this is definitely a clean design. > I think we need to have the infrastructure in the main tc code. Its already there - may not be very clean right now. > >>- the HIPAC classifier can only be attached to the HIPAC qdisc and vice > >> versa the HIPAC qdisc only accepts HIPAC classifiers > > > > > > We do have an issue with being able to do extended classification > > but building a qdisc for it is a no no. Building a qdisc that will force > > other classifier to structure themselves after it is even a bigger sin. > > Look at the action code i have (i can send you an updated patch); a > > better idea is to make extended classifiers an action based on another > > filter match. At least this is what i have been toying with and i dont > > think it is clean enough. what we need is to extend the filtering > > framework itself to have extended classifiers. > > The new design should be much cleaner. Originally we also thought about > making HIPAC a classifier only but we expected some problems related > to this approach. Finally we discovered that this is not the case :) > Consider what i said above. I'll try n cobble together some examples to demonstrate (although it seems you already know this). To allow for anyone to install classifiers-du-jour without being dependet on hipac would be very useful. So ideas that you have for enabling this cleanly should be moved to cls_api. cheers, jamal From andersen@codepoet.org Sun Aug 3 11:28:03 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 11:28:14 -0700 (PDT) Received: from winder.codepoet.org (postfix@codepoet.org [166.70.99.138]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h73IS2Fl010527 for ; Sun, 3 Aug 2003 11:28:03 -0700 Received: by winder.codepoet.org (Codepoet.org Mail Daemon, from userid 1000) id 2E763157577; Sun, 3 Aug 2003 12:27:56 -0600 (MDT) Date: Sun, 3 Aug 2003 12:27:55 -0600 From: Erik Andersen To: Werner Almesberger Cc: Jeff Garzik , netdev@oss.sgi.com, linux-kernel@vger.kernel.org, Nivedita Singhvi Subject: Re: TOE brain dump Message-ID: <20030803182755.GA16770@codepoet.org> Reply-To: andersen@codepoet.org Mail-Followup-To: Erik Andersen , Werner Almesberger , Jeff Garzik , netdev@oss.sgi.com, linux-kernel@vger.kernel.org, Nivedita Singhvi References: <20030802140444.E5798@almesberger.net> <3F2BF5C7.90400@us.ibm.com> <3F2C0C44.6020002@pobox.com> <20030802184901.G5798@almesberger.net> <3F2CAE61.7070401@pobox.com> <20030803145737.B10280@almesberger.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030803145737.B10280@almesberger.net> User-Agent: Mutt/1.3.28i X-Operating-System: Linux 2.4.19-rmk7, Rebel-NetWinder(Intel StrongARM 110 rev 3), 185.95 BogoMips X-No-Junk-Mail: I do not want to get *any* junk mail. X-archive-position: 4487 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: andersen@codepoet.org Precedence: bulk X-list: netdev On Sun Aug 03, 2003 at 02:57:37PM -0300, Werner Almesberger wrote: > > There is one interesting TOE solution, that I have yet to see created: > > run Linux on an embedded processor, on the NIC. > > That's basically what I've been talking about all the > while :-) http://www.snapgear.com/pci630.html -Erik -- Erik B. Andersen http://codepoet-consulting.com/ --This message was written using 73% post-consumer electrons-- From akpm@osdl.org Sun Aug 3 12:01:38 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 12:01:43 -0700 (PDT) Received: from mail.osdl.org (fw.osdl.org [65.172.181.6]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h73J1bFl013068 for ; Sun, 3 Aug 2003 12:01:37 -0700 Received: from mnm (build.pdx.osdl.net [172.20.1.2]) by mail.osdl.org (8.11.6/8.11.6) with ESMTP id h73J17I30784; Sun, 3 Aug 2003 12:01:07 -0700 Date: Sun, 3 Aug 2003 12:02:23 -0700 From: Andrew Morton To: Stephen Rothwell Cc: netdev@oss.sgi.com, janfrode@parallab.no Subject: Fw: [Bugme-new] [Bug 1036] New: Badness in local_bh_enable at kernel/softirq.c:113 Message-Id: <20030803120223.738a7453.akpm@osdl.org> X-Mailer: Sylpheed version 0.9.4 (GTK+ 1.2.10; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4488 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: akpm@osdl.org Precedence: bulk X-list: netdev (The "badness" warning is a tty locking problem. It does not explain the pptp client failures) Begin forwarded message: Date: Sun, 3 Aug 2003 04:53:31 -0700 From: bugme-daemon@osdl.org To: bugme-new@lists.osdl.org Subject: [Bugme-new] [Bug 1036] New: Badness in local_bh_enable at kernel/softirq.c:113 http://bugme.osdl.org/show_bug.cgi?id=1036 Summary: Badness in local_bh_enable at kernel/softirq.c:113 Kernel Version: 2.6.0-test2 Status: NEW Severity: high Owner: bugme-janitors@lists.osdl.org Submitter: janfrode@parallab.no Distribution: gentoo Hardware Environment: AMD AthlonXP Software Environment: ppp-2.4.1-r14 pptpclient-1.2.0 Problem Description: My pptp client connections keeps dying, syslogging: Aug 3 13:35:36 [pppd] Using interface ppp0 Aug 3 13:35:36 [pppd] Connect: ppp0 <--> /dev/pts/4 Aug 3 13:35:36 [/etc/hotplug/net.agent] NET add event not supported Aug 3 13:35:38 [pptp] anon log[decaps_hdlc:pptp_gre.c:198]: PPP mode seems to be Asynchronous._ Aug 3 13:35:39 [pppd] Remote message: Welcome^M^J Aug 3 13:35:41 [pppd] local IP address 129.177.43.23 Aug 3 13:35:41 [pppd] remote IP address 129.177.43.1 Aug 3 13:36:07 [pppd] Unsupported protocol 0xd44a received Aug 3 13:36:57 [pppd] Unsupported protocol 0xcc4a received aug 3 13:38:20 [su(pam_unix)] session opened for user root by (uid=1001) Aug 3 13:39:21 [anacron] Job `cron.daily' started Aug 3 13:39:29 [crontab] (root) LIST (root)_ Aug 3 13:39:37 [pptp] anon warn[decaps_gre:pptp_gre.c:300]: short read (-1): Message too long Aug 3 13:39:37 [pptp] anon log[callmgr_main:pptp_callmgr.c:234]: Closing connection Aug 3 13:39:37 [pptp] anon log[pptp_conn_close:pptp_ctrl.c:308]: Closing PPTP connection Aug 3 13:39:39 [pptp] anon log[call_callback:pptp_callmgr.c:74]: Closing connection Aug 3 13:39:39 [pppd] Hangup (SIGHUP) Aug 3 13:39:39 [kernel] Badness in local_bh_enable at kernel/softirq.c:113 Aug 3 13:39:39 [pppd] Modem hangup Aug 3 13:39:39 [pppd] Connection terminated. Aug 3 13:39:39 [pppd] Connect time 4.1 minutes. Aug 3 13:39:39 [pppd] Sent 310556 bytes, received 1615363 bytes. Aug 3 13:39:39 [/etc/hotplug/net.agent] NET remove event not supported Aug 3 13:39:39 [pppd] Failed to open /dev/pts/4: No such file or directory - Last output repeated 9 times - Aug 3 13:39:39 [pppd] Exit. And giving this call trace in the kernel log: Badness in local_bh_enable at kernel/softirq.c:113 Call Trace: [] local_bh_enable+0x88/0x90 [] ppp_async_push+0xa4/0x1b0 [] __lookup_hash+0x64/0xd0 [] ppp_asynctty_wakeup+0x31/0x60 [] pty_unthrottle+0x56/0x60 [] check_unthrottle+0x3a/0x40 [] n_tty_flush_buffer+0x14/0x50 [] pty_flush_buffer+0x5e/0x60 [] do_tty_hangup+0x3ac/0x420 [] release_dev+0x5b3/0x600 [] snd_pcm_oss_init_substream+0x50/0x90 [] zap_pmd_range+0x4e/0x70 [] unmap_page_range+0x4e/0x90 [] tty_release+0x2b/0x60 [] __fput+0xce/0xe0 [] filp_close+0x4b/0x80 [] put_files_struct+0x6c/0xe0 [] do_exit+0x165/0x340 [] sys_exit+0x15/0x20 [] syscall_call+0x7/0xb Steps to reproduce: Don't know how to trigger it, but it happens all the time. ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From hadi@cyberus.ca Sun Aug 3 12:06:46 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 12:06:51 -0700 (PDT) Received: from mail.cyberus.ca (mail.cyberus.ca [209.195.118.111]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h73J6jFl013745 for ; Sun, 3 Aug 2003 12:06:46 -0700 Received: from cpe0030ab124d2f-cm014500000962.cpe.net.cable.rogers.com ([24.103.99.32] helo=[10.0.0.9]) by mail.cyberus.ca with esmtp (Exim 4.12) id 19jOBo-0002VY-00; Sun, 03 Aug 2003 15:06:44 -0400 Subject: Re: multiple unicast mac address (was Re: netdev_ops retraction) From: jamal Reply-To: hadi@cyberus.ca To: Rick Payne Cc: Jeff Garzik , netdev@oss.sgi.com In-Reply-To: <2147483647.1059667766@fozzy.rossfell.co.uk> References: <20030730184416.GI22222@parcelfarce.linux.theplanet.co.uk> <2147483647.1059659359@fozzy.rossfell.co.uk> <3F292B38.4070508@pobox.com> <2147483647.1059667766@fozzy.rossfell.co.uk> Content-Type: text/plain Organization: jamalopolis Message-Id: <1059937567.1102.77.camel@jzny.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 Date: 03 Aug 2003 15:06:07 -0400 Content-Transfer-Encoding: 7bit X-archive-position: 4489 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev Last discussion that happened: http://marc.theaimsgroup.com/?t=104163060100001&r=1&w=2 cheers, jamal On Thu, 2003-07-31 at 11:09, Rick Payne wrote: > --On Thursday, July 31, 2003 10:44 am -0400 Jeff Garzik > wrote: > > > Hardware that filters N MAC addresses (unicast filtering) doesn't have a > > terribly standard interface, and the unicast filter must be adjusted at > > Indeed but where its possible to support it, it can be - and those cards > will be specified by those who need it (for HA, VRRP etc). > > > different times on different hardware. Also, chip bugs lead one to think > > unicast filtering will work where it doesn't. Also, chip limits for some > > of the more popular chips are unknown. Also, the need for this feature > > is very uncommon, and can be achieved in other ways. > > As I said - promiscuous mode and filtering on the receive side - which is > what you have to resort to anyway for those cards that don't support it. > > Alternatively, its just another patch people need to add to make things do > what they want - just like the ARP patch. > > Rick > > From ebiederm@xmission.com Sun Aug 3 12:24:39 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 12:24:51 -0700 (PDT) Received: from frodo.biederman.org (ebiederm.dsl.xmission.com [166.70.28.69]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h73JOcFl015227 for ; Sun, 3 Aug 2003 12:24:39 -0700 Received: (from eric@localhost) by frodo.biederman.org (8.9.3/8.9.3) id NAA26235; Sun, 3 Aug 2003 13:21:09 -0600 To: Werner Almesberger Cc: Jeff Garzik , Nivedita Singhvi , netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: TOE brain dump References: <20030802140444.E5798@almesberger.net> <3F2BF5C7.90400@us.ibm.com> <3F2C0C44.6020002@pobox.com> <20030802184901.G5798@almesberger.net> From: ebiederm@xmission.com (Eric W. Biederman) Date: 03 Aug 2003 13:21:09 -0600 In-Reply-To: <20030802184901.G5798@almesberger.net> Message-ID: Lines: 59 User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.1 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-archive-position: 4490 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ebiederm@xmission.com Precedence: bulk X-list: netdev Werner Almesberger writes: > Jeff Garzik wrote: > > jabbering at the same time. TCP is a "one size fits all" solution, but > > it doesn't work well for everyone. > > But then, ten "optimized xxPs" that work well in two different > scenarios each, but not so good in the 98 others, wouldn't be > much fun either. The optimized for low latency cases seem to have a strong market in clusters. And they are currently keeping alive quite a few technologies. Myrinet, Infiniband, Quadric's Elan, etc. Having low latency and switch technologies that scale is quite rare currently. > Another problem of TCP is that it has grown a bit too many > knobs you need to turn before it works over your really fast > really long pipe. (In one of the OLS after dinner speeches, > this was quite appropriately called the "wizard gap".) Does anyone know which knobs to turn to make TCP go fast over Infiniband. (A low latency high bandwidth network?) I get to deal with them on a regular basis... There is one place in low latency communications that I can think of where TCP/IP is not the proper solution. For low latency communication the checksum is at the wrong end of the packet. IB gets this one correct and places the checksum at the tail end of the packet. This allows the packet to start transmitting before the checksum is computed, possibly even having the receive start at the other end before the tail of the packet is transmitted. Would it make any sense to do a low latency variation on TCP that fixes that problem? For the IP header we are fine as the data precedes the checksum. But the problem appears to affect all of the upper level protocols that ride on IP, UDP, TCP, SCTP... > > So, fix the other end of the pipeline too, otherwise this fast network > > stuff is flashly but pointless. If you want to serve up data from disk, > > then start creating PCI cards that have both Serial ATA and ethernet > > connectors on them :) Cut out the middleman of the host CPU and host > > memory bus instead of offloading portions of TCP that do not need to be > > offloaded. > > That's a good point. A hierarchical memory structure can help > here. Moving one end closer to the hardware, and letting it > know (e.g. through sendfile) that also the other end is close > (or can be reached more directly that through some hopelessly > crowded main bus) may help too. On that score it is worth noting that the next generation of peripheral busses (Hypertransport, PCI Express, etc) are all switched. Which means that device to device communication may be more reasonable. Going from a bussed interconnect to a switched interconnect is certainly a dramatic change in infrastructure. How that will affect the tradeoffs I don't know. Eric From lm@bitmover.com Sun Aug 3 12:40:33 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 12:40:44 -0700 (PDT) Received: from smtp.bitmover.com (smtp.bitmover.com [192.132.92.12]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h73JeWFl018109 for ; Sun, 3 Aug 2003 12:40:33 -0700 Received: from work.bitmover.com (ipcop.bitmover.com [192.132.92.15]) by smtp.bitmover.com (8.12.9/8.12.9) with ESMTP id h743iem7002500; Sun, 3 Aug 2003 20:44:40 -0700 Received: (from lm@localhost) by work.bitmover.com (8.11.6/8.11.6) id h73JeB108493; Sun, 3 Aug 2003 12:40:11 -0700 Date: Sun, 3 Aug 2003 12:40:11 -0700 From: Larry McVoy To: Erik Andersen , Werner Almesberger , Jeff Garzik , netdev@oss.sgi.com, linux-kernel@vger.kernel.org, Nivedita Singhvi Subject: Re: TOE brain dump Message-ID: <20030803194011.GA8324@work.bitmover.com> Mail-Followup-To: Larry McVoy , Erik Andersen , Werner Almesberger , Jeff Garzik , netdev@oss.sgi.com, linux-kernel@vger.kernel.org, Nivedita Singhvi References: <20030802140444.E5798@almesberger.net> <3F2BF5C7.90400@us.ibm.com> <3F2C0C44.6020002@pobox.com> <20030802184901.G5798@almesberger.net> <3F2CAE61.7070401@pobox.com> <20030803145737.B10280@almesberger.net> <20030803182755.GA16770@codepoet.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030803182755.GA16770@codepoet.org> User-Agent: Mutt/1.4i X-MailScanner-Information: Please contact the ISP for more information X-MailScanner: Found to be clean X-MailScanner-SpamCheck: not spam (whitelisted), SpamAssassin (score=0.5, required 7, AWL, DATE_IN_PAST_06_12) X-archive-position: 4491 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: lm@bitmover.com Precedence: bulk X-list: netdev On Sun, Aug 03, 2003 at 12:27:55PM -0600, Erik Andersen wrote: > On Sun Aug 03, 2003 at 02:57:37PM -0300, Werner Almesberger wrote: > > > There is one interesting TOE solution, that I have yet to see created: > > > run Linux on an embedded processor, on the NIC. > > > > That's basically what I've been talking about all the > > while :-) > > http://www.snapgear.com/pci630.html ipcop plus a new PC for $200 is way higher performance and does more. -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm From david.lang@digitalinsight.com Sun Aug 3 13:15:08 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 13:15:18 -0700 (PDT) Received: from warden.diginsite.com (warden-p.diginsite.com [208.29.163.248]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h73KF7Fl020689 for ; Sun, 3 Aug 2003 13:15:07 -0700 Received: from wlvims01.diginsite.com by warden.diginsite.com via smtpd (for oss.SGI.COM [192.48.159.27]) with SMTP; Sun, 3 Aug 2003 13:15:07 -0700 Received: from calexc01.digitalinsight.com ([10.200.0.20]) by wlvims01.digitalinsight.com (Post.Office MTA v3.5.3 release 223 ID# 0-0U10L2S100V35) with ESMTP id com; Sun, 3 Aug 2003 13:14:17 -0700 Received: by calexc01.diginsite.com with Internet Mail Service (5.5.2653.19) id ; Sun, 3 Aug 2003 13:15:01 -0700 Received: from dlang.diginsite.com ([10.201.10.67]) by wlvexc00.digitalinsight.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2656.59) id QF5KMNL2; Sun, 3 Aug 2003 13:14:59 -0700 From: David Lang To: Larry McVoy Cc: Erik Andersen , Werner Almesberger , Jeff Garzik , netdev@oss.sgi.com, linux-kernel@vger.kernel.org, Nivedita Singhvi Date: Sun, 3 Aug 2003 13:13:24 -0700 (PDT) Subject: Re: TOE brain dump In-Reply-To: <20030803194011.GA8324@work.bitmover.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 4492 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: david.lang@digitalinsight.com Precedence: bulk X-list: netdev On Sun, 3 Aug 2003, Larry McVoy wrote: > On Sun, Aug 03, 2003 at 12:27:55PM -0600, Erik Andersen wrote: > > On Sun Aug 03, 2003 at 02:57:37PM -0300, Werner Almesberger wrote: > > > > There is one interesting TOE solution, that I have yet to see created: > > > > run Linux on an embedded processor, on the NIC. > > > > > > That's basically what I've been talking about all the > > > while :-) > > > > http://www.snapgear.com/pci630.html > > ipcop plus a new PC for $200 is way higher performance and does more. however I can see situations where you would put multiple cards in one box and there could be an advantage to useing PCI (or PCI-X) for you communications between the different 'nodes' of you routing cluster instead of gig ethernet. if this is the approach that the networking guys really want to encourage how about defining an API that you would be willing to support and you can even implement it and then any card that is produced would be supported from day 1. this interface would not have to cover the configuration of the card (that can be done with userspace tools that talk to the card over the 'network', it just needs to cover the ability to do what is effectivly IP over PCI. Linus has commented that in mahy ways Linux is not designed for any existing CPU, it's designed for a virtual CPU that implements all the features we want and those features that aren't implemented in the chips get emulated as needed (obviously what is actually implemented and the speed of emulation are serious considerations for performance) why doesn't the network team define what they thing the ideal NIC interface would be. I can see three catagories of 'ideal' cards 1. cards that are directly driven by the kernel IP stack (similar to what we support now, but an ideal version) 2. router nodes that have access to main memory (PCI card running linux acting as a router/firewall/VPN to offload the main CPU's) 3. router nodes that don't have access to main memory (things like USB/fibrechannel/infiniband/etc versions of #2, the node can run linux and deal with the outside world, only sending the data that is needed to/from the host) even if nobody makes hardware that supports all the desired features directly having a 'this is the dieal driver' reference should impruve furture drivers by letting them use this as the core and implementing code to simulate the features not in hardware. they claim they need this sort of performance, you say 'not that way do it sanely' why not give them a sane way to do it? David Lang From lm@bitmover.com Sun Aug 3 13:31:02 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 13:31:08 -0700 (PDT) Received: from smtp.bitmover.com (smtp.bitmover.com [192.132.92.12]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h73KV1Fl021999 for ; Sun, 3 Aug 2003 13:31:02 -0700 Received: from work.bitmover.com (ipcop.bitmover.com [192.132.92.15]) by smtp.bitmover.com (8.12.9/8.12.9) with ESMTP id h744ZKm7003004; Sun, 3 Aug 2003 21:35:20 -0700 Received: (from lm@localhost) by work.bitmover.com (8.11.6/8.11.6) id h73KUp509118; Sun, 3 Aug 2003 13:30:51 -0700 Date: Sun, 3 Aug 2003 13:30:51 -0700 From: Larry McVoy To: David Lang Cc: Larry McVoy , Erik Andersen , Werner Almesberger , Jeff Garzik , netdev@oss.sgi.com, linux-kernel@vger.kernel.org, Nivedita Singhvi Subject: Re: TOE brain dump Message-ID: <20030803203051.GA9057@work.bitmover.com> Mail-Followup-To: Larry McVoy , David Lang , Larry McVoy , Erik Andersen , Werner Almesberger , Jeff Garzik , netdev@oss.sgi.com, linux-kernel@vger.kernel.org, Nivedita Singhvi References: <20030803194011.GA8324@work.bitmover.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4i X-MailScanner-Information: Please contact the ISP for more information X-MailScanner: Found to be clean X-MailScanner-SpamCheck: not spam (whitelisted), SpamAssassin (score=0.5, required 7, AWL, DATE_IN_PAST_06_12) X-archive-position: 4493 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: lm@bitmover.com Precedence: bulk X-list: netdev On Sun, Aug 03, 2003 at 01:13:24PM -0700, David Lang wrote: > 2. router nodes that have access to main memory (PCI card running linux > acting as a router/firewall/VPN to offload the main CPU's) I can get an entire machine, memory, disk, > Ghz CPU, case, power supply, cdrom, floppy, onboard enet extra net card for routing, for $250 or less, quantity 1, shipped to my door. Why would I want to spend money on some silly offload card when I can get the whole PC for less than the card? -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm From alan@lxorguk.ukuu.org.uk Sun Aug 3 13:55:26 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 13:55:31 -0700 (PDT) Received: from lxorguk.ukuu.org.uk (pc1-cwma1-5-cust4.swan.cable.ntl.com [80.5.120.4]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h73KtOFl023931 for ; Sun, 3 Aug 2003 13:55:25 -0700 Received: from dhcp22.swansea.linux.org.uk (dhcp22.swansea.linux.org.uk [127.0.0.1]) by lxorguk.ukuu.org.uk (8.12.8/8.12.5) with ESMTP id h73KpOC3031925; Sun, 3 Aug 2003 21:51:25 +0100 Received: (from alan@localhost) by dhcp22.swansea.linux.org.uk (8.12.8/8.12.8/Submit) id h73KpMDK031923; Sun, 3 Aug 2003 21:51:22 +0100 X-Authentication-Warning: dhcp22.swansea.linux.org.uk: alan set sender to alan@lxorguk.ukuu.org.uk using -f Subject: Re: TOE brain dump From: Alan Cox To: Werner Almesberger Cc: netdev@oss.sgi.com, Linux Kernel Mailing List In-Reply-To: <20030802191411.H5798@almesberger.net> References: <20030802140444.E5798@almesberger.net> <1059857864.20305.14.camel@dhcp22.swansea.linux.org.uk> <20030802191411.H5798@almesberger.net> Content-Type: text/plain Content-Transfer-Encoding: 7bit Organization: Message-Id: <1059943881.31900.1.camel@dhcp22.swansea.linux.org.uk> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 (1.2.2-5) Date: 03 Aug 2003 21:51:21 +0100 X-archive-position: 4494 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: alan@lxorguk.ukuu.org.uk Precedence: bulk X-list: netdev On Sad, 2003-08-02 at 23:14, Werner Almesberger wrote: > That's taking this idea to an extreme, yes. I'd think of > using something as big as an amd64 for this as "too > expensive", but perhaps it's cheap enough in the long run, > compared to some "optimized" design. Volume makes cheap. If you look at software v hardware raid controllers the hardware people are permanently being killed by the low volume of cards. From alan@lxorguk.ukuu.org.uk Sun Aug 3 13:56:17 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 13:56:21 -0700 (PDT) Received: from lxorguk.ukuu.org.uk (pc1-cwma1-5-cust4.swan.cable.ntl.com [80.5.120.4]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h73KuGFl024198 for ; Sun, 3 Aug 2003 13:56:16 -0700 Received: from dhcp22.swansea.linux.org.uk (dhcp22.swansea.linux.org.uk [127.0.0.1]) by lxorguk.ukuu.org.uk (8.12.8/8.12.5) with ESMTP id h73KqIC3031940; Sun, 3 Aug 2003 21:52:19 +0100 Received: (from alan@localhost) by dhcp22.swansea.linux.org.uk (8.12.8/8.12.8/Submit) id h73KqEOh031938; Sun, 3 Aug 2003 21:52:14 +0100 X-Authentication-Warning: dhcp22.swansea.linux.org.uk: alan set sender to alan@lxorguk.ukuu.org.uk using -f Subject: Re: TOE brain dump From: Alan Cox To: Ben Greear Cc: Jeff Garzik , Nivedita Singhvi , Werner Almesberger , netdev@oss.sgi.com, Linux Kernel Mailing List In-Reply-To: <3F2C891B.7080004@candelatech.com> References: <20030802140444.E5798@almesberger.net> <3F2BF5C7.90400@us.ibm.com> <3F2C0C44.6020002@pobox.com> <3F2C891B.7080004@candelatech.com> Content-Type: text/plain Content-Transfer-Encoding: 7bit Organization: Message-Id: <1059943933.31901.3.camel@dhcp22.swansea.linux.org.uk> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 (1.2.2-5) Date: 03 Aug 2003 21:52:13 +0100 X-archive-position: 4495 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: alan@lxorguk.ukuu.org.uk Precedence: bulk X-list: netdev On Sul, 2003-08-03 at 05:01, Ben Greear wrote: > Jeff Garzik wrote: > > > So, fix the other end of the pipeline too, otherwise this fast network > > stuff is flashly but pointless. If you want to serve up data from disk, > > then start creating PCI cards that have both Serial ATA and ethernet > > connectors on them :) Cut out the middleman of the host CPU and host > > I for one would love to see something like this, and not just Serial ATA.. > but maybe 8x Serial ATA and RAID :) There is a protocol floating around for ATA over ethernet, no TCP layer or nasty latency eating complexities in the middle From hadi@cyberus.ca Sun Aug 3 14:17:06 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 14:17:18 -0700 (PDT) Received: from mail.cyberus.ca (mail.cyberus.ca [209.195.118.111]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h73LH5Fl025960 for ; Sun, 3 Aug 2003 14:17:06 -0700 Received: from cpe0030ab124d2f-cm014500000962.cpe.net.cable.rogers.com ([24.103.99.32] helo=[10.0.0.9]) by mail.cyberus.ca with esmtp (Exim 4.12) id 19jPZk-0008ov-00; Sun, 03 Aug 2003 16:35:33 -0400 Subject: Re: TOE brain dump From: jamal Reply-To: hadi@cyberus.ca To: Larry McVoy Cc: Erik Andersen , Werner Almesberger , Jeff Garzik , netdev@oss.sgi.com, linux-kernel@vger.kernel.org, Nivedita Singhvi In-Reply-To: <20030803194011.GA8324@work.bitmover.com> References: <20030802140444.E5798@almesberger.net> <3F2BF5C7.90400@us.ibm.com> <3F2C0C44.6020002@pobox.com> <20030802184901.G5798@almesberger.net> <3F2CAE61.7070401@pobox.com> <20030803145737.B10280@almesberger.net> <20030803182755.GA16770@codepoet.org> <20030803194011.GA8324@work.bitmover.com> Content-Type: text/plain Organization: jamalopolis Message-Id: <1059942894.1103.96.camel@jzny.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 Date: 03 Aug 2003 16:34:54 -0400 Content-Transfer-Encoding: 7bit X-archive-position: 4496 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev On Sun, 2003-08-03 at 15:40, Larry McVoy wrote: > On Sun, Aug 03, 2003 at 12:27:55PM -0600, Erik Andersen wrote: > > On Sun Aug 03, 2003 at 02:57:37PM -0300, Werner Almesberger wrote: > > > > There is one interesting TOE solution, that I have yet to see created: > > > > run Linux on an embedded processor, on the NIC. > > > > > > That's basically what I've been talking about all the > > > while :-) > > > > http://www.snapgear.com/pci630.html > > ipcop plus a new PC for $200 is way higher performance and does more. ;-> Actually this proves that putting the whole stack on the NIC is the wrong way to go ;-> That poor piece of NIC was obsoleted before it was born on pricing alone and not just compute power it was supposed to liberate us from. I think the idea of hierachical memories and computation is certainly interesting. Put a CPU and memory on the NIC but not to do the work that Linux already does. Instead think of the NIC and its memeory + CPU as a L1 data and code cache for TCP processing. The idea posed from Davem is intriguing: The only thing the NIC should do is TCP fast path processing based on cached control data generated from the main CPU stack. Any time the fast path gets violated, the cache gets invalidate and it becomes an exception handling to be handled by the main CPU stack. IMO, the only time this will make sense is when the setup cost (downloading the cache or cookies as Dave calls them) is amortized by the data that follows. For example, may not make sense to worry about a HTTP1.0 flow which has 3-4 packets after the SYNack.Bulk transfers make sense (storage, file serving). I dont remember the Mogul paper details but i think this is what he was implying. cheers, jamal From david.lang@digitalinsight.com Sun Aug 3 14:23:00 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 14:23:03 -0700 (PDT) Received: from warden3.diginsite.com (warden3-p.diginsite.com [208.147.64.186]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h73LMxFl026639 for ; Sun, 3 Aug 2003 14:23:00 -0700 Received: from no.name.available by warden3.diginsite.com via smtpd (for oss.SGI.COM [192.48.159.27]) with SMTP; Sun, 3 Aug 2003 14:16:04 -0700 Received: from ata-navgw-how1.anytimeaccess.com ([10.210.80.95]) by ata-mail.anytimeaccess.com (Post.Office MTA v3.5.3 release 223 ID# 0-0U10L2S100V35) with SMTP id com for ; Sun, 3 Aug 2003 14:19:17 -0700 Received: from sacexc01.digitalinsight.com ([10.210.80.155]) by ata-navgw-how1.anytimeaccess.com (NAVIEG 2.1 bld 63) with SMTP id M2003080314134107659 ; Sun, 03 Aug 2003 14:13:41 -0700 Received: by sacexc01.anytimeaccess.com with Internet Mail Service (5.5.2656.59) id ; Sun, 3 Aug 2003 14:22:50 -0700 Received: from dlang.diginsite.com ([10.201.10.67]) by wlvexc00.digitalinsight.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2656.59) id QF5KMN9H; Sun, 3 Aug 2003 14:22:47 -0700 From: David Lang To: Larry McVoy Cc: Erik Andersen , Werner Almesberger, Jeff Garzik , netdev@oss.sgi.com, linux-kernel@vger.kernel.org, Nivedita Singhvi Date: Sun, 3 Aug 2003 14:21:12 -0700 (PDT) Subject: Re: TOE brain dump In-Reply-To: <20030803203051.GA9057@work.bitmover.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 4497 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: david.lang@digitalinsight.com Precedence: bulk X-list: netdev On Sun, 3 Aug 2003, Larry McVoy wrote: > On Sun, Aug 03, 2003 at 01:13:24PM -0700, David Lang wrote: > > 2. router nodes that have access to main memory (PCI card running linux > > acting as a router/firewall/VPN to offload the main CPU's) > > I can get an entire machine, memory, disk, > Ghz CPU, case, power supply, > cdrom, floppy, onboard enet extra net card for routing, for $250 or less, > quantity 1, shipped to my door. > > Why would I want to spend money on some silly offload card when I can get > the whole PC for less than the card? you may want to do this for a database box where you want to dedicate your main processing power to the database task, if you use a seperate box you still have to talk to that box over a network, if you have it as a card you can talk to the card much more efficantly then you can talk to the seperate machine. if your 8-way opteron database box is already the bottleneck for your system you will have to spend a LOT of money to get anything that gives you more available processing power, getting a card to offload any processing from the main CPU's can be a win. yes this is somewhat of a niche market, but as you point out adding more and more processors in a SMP model is not the ideal way to go, either from performance or from the cost point of view. on the webserver front there are a lot of companies making a living by selling cards and boxes to offload processing from the main CPU's of the webservers (cards to do gzip compression are a relativly new addition, but cards to do SSL handshakes have been around for a while) used properly these can be a very worthwhile invenstment for high-volume webserver companies. also the cost of an extra box can be considerably higer then just the cost of the hardware. I know of one situation where between Linux OS license fees (redhat advanced server) and security software (intrusion detection, auditing, privilage management, etc) a company is looking at ~$4000 in licensing fees for every box they put in their datacenter (and this is for boxes just running apache, add something like an oracle or J2EE appserver software and the cost goes up even more). at this point the fact that the box only cost $200 doesn't really matter, spending an extra $500 each on 4 boxes to eliminate the need for a 5th is easily worth it. (and this company is re-examining hardwaare raid controllers after having run software raid for years becouse they are realizing that this is requiring them to run more servers due to the load on the CPU's) at the low end you are right, just add another box or add another CPU to an existing box, but there are conditions that make adding specialized cards to offload specific functionality a win (for that matter, even at the low end people routinly offload graphics processing to specialized cards, simply to make their games run faster) David Lang From alan@storlinksemi.com Sun Aug 3 15:02:24 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 15:02:33 -0700 (PDT) Received: from smtp011.mail.yahoo.com (smtp011.mail.yahoo.com [216.136.173.31]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h73M2OFl029358 for ; Sun, 3 Aug 2003 15:02:24 -0700 Received: from cpe-66-1-155-95.ca.sprintbbd.net (HELO AlanLap) (alansuntzishih@66.1.155.95 with login) by smtp.mail.vip.sc5.yahoo.com with SMTP; 3 Aug 2003 22:02:23 -0000 From: "Alan Shih" To: "David Lang" Cc: "Ben Greear" , "Jeff Garzik" , "Nivedita Singhvi" , "Werner Almesberger" , , Subject: RE: TOE brain dump Date: Sun, 3 Aug 2003 15:02:09 -0700 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2911.0) In-Reply-To: Importance: Normal X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2727.1300 X-archive-position: 4498 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: alan@storlinksemi.com Precedence: bulk X-list: netdev On an embedded system, no processor will be fast enough to compete with direct DMA xfer. So just provide sendfile hooks that allow the kernel to initiate data filling from source to dest then allow TSO to take place. Kernel still needs to take care of the TCP stack. I don't see this as building extensive customization though. Alan -----Original Message----- From: David Lang [mailto:david.lang@digitalinsight.com] Sent: Sunday, August 03, 2003 1:26 AM To: Alan Shih Cc: Ben Greear; Jeff Garzik; Nivedita Singhvi; Werner Almesberger; netdev@oss.sgi.com; linux-kernel@vger.kernel.org Subject: RE: TOE brain dump do you really want the processor on the card to be tunning apache/NFS/Samba/etc ? putting enough linux on the card to act as a router (which would include the netfilter stuff) is one thing. putting the userspace code that interfaces with the outside world for file transfers is something else. if you really want the disk connected to your network card you are just talking a low-end linux box. forget all this stuff about it being on a card and just use a full box (economys of scale will make this cheaper) making a firewall that's a core system with a dozen slave systems attached to it (the network cards) sounds like the type of clustering that Linux has been used for for compute nodes. complicated to setup, but extremely powerful and scalable once configured. if you want more then a router on the card then Alan Cox is right, just add another processor to the system, it's easier and cheaper. David Lang On Sat, 2 Aug 2003, Alan Shih wrote: > Date: Sat, 2 Aug 2003 23:22:52 -0700 > From: Alan Shih > To: Ben Greear , Jeff Garzik > Cc: Nivedita Singhvi , > Werner Almesberger , netdev@oss.sgi.com, > linux-kernel@vger.kernel.org > Subject: RE: TOE brain dump > > A DMA xfer that fills the NIC pipe with IDE source. That's not very hard... > need a lot of bufferring/FIFO though. May require large modification to the > file serving applications? > > Alan > > -----Original Message----- > From: linux-kernel-owner@vger.kernel.org > [mailto:linux-kernel-owner@vger.kernel.org]On Behalf Of Ben Greear > Sent: Saturday, August 02, 2003 9:02 PM > To: Jeff Garzik > Cc: Nivedita Singhvi; Werner Almesberger; netdev@oss.sgi.com; > linux-kernel@vger.kernel.org > Subject: Re: TOE brain dump > > > Jeff Garzik wrote: > > > So, fix the other end of the pipeline too, otherwise this fast network > > stuff is flashly but pointless. If you want to serve up data from disk, > > then start creating PCI cards that have both Serial ATA and ethernet > > connectors on them :) Cut out the middleman of the host CPU and host > > I for one would love to see something like this, and not just Serial ATA.. > but maybe 8x Serial ATA and RAID :) > > Ben > > > -- > Ben Greear > Candela Technologies Inc http://www.candelatech.com > > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > From lm@bitmover.com Sun Aug 3 16:44:35 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 16:44:46 -0700 (PDT) Received: from smtp.bitmover.com (smtp.bitmover.com [192.132.92.12]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h73NiYFl004045 for ; Sun, 3 Aug 2003 16:44:35 -0700 Received: from work.bitmover.com (ipcop.bitmover.com [192.132.92.15]) by smtp.bitmover.com (8.12.9/8.12.9) with ESMTP id h747mnm7005317; Mon, 4 Aug 2003 00:48:49 -0700 Received: (from lm@localhost) by work.bitmover.com (8.11.6/8.11.6) id h73NiJM13637; Sun, 3 Aug 2003 16:44:19 -0700 Date: Sun, 3 Aug 2003 16:44:19 -0700 From: Larry McVoy To: David Lang Cc: Larry McVoy , Erik Andersen , Werner Almesberger , Jeff Garzik , netdev@oss.sgi.com, linux-kernel@vger.kernel.org, Nivedita Singhvi Subject: Re: TOE brain dump Message-ID: <20030803234419.GA13604@work.bitmover.com> Mail-Followup-To: Larry McVoy , David Lang , Larry McVoy , Erik Andersen , Werner Almesberger , Jeff Garzik , netdev@oss.sgi.com, linux-kernel@vger.kernel.org, Nivedita Singhvi References: <20030803203051.GA9057@work.bitmover.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4i X-MailScanner-Information: Please contact the ISP for more information X-MailScanner: Found to be clean X-MailScanner-SpamCheck: not spam (whitelisted), SpamAssassin (score=0.5, required 7, AWL, DATE_IN_PAST_06_12) X-archive-position: 4499 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: lm@bitmover.com Precedence: bulk X-list: netdev On Sun, Aug 03, 2003 at 02:21:12PM -0700, David Lang wrote: > if your 8-way opteron database box is already the bottleneck for your > system you will have to spend a LOT of money to get anything that gives > you more available processing power, getting a card to offload any > processing from the main CPU's can be a win. I'd like to see data which supports this. CPUs have gotten so fast and disk I/O still sucks. All the systems I've seen are CPU rich and I/O starved. The smartest thing you could do would be to get a cheap box with a GB of ram as a disk cache and make it be a SAN device. Make N of those and you have tons of disk space and tons of cache and your 8 way opteron database box is going to work just fine. -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm From david-b@pacbell.net Sun Aug 3 20:06:00 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 20:06:37 -0700 (PDT) Received: from mta4.rcsntx.swbell.net (mta4.rcsntx.swbell.net [151.164.30.28]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7435xFl018707 for ; Sun, 3 Aug 2003 20:06:00 -0700 Received: from pacbell.net (ppp-67-118-247-123.dialup.pltn13.pacbell.net [67.118.247.123]) by mta4.rcsntx.swbell.net (8.12.9/8.12.3) with ESMTP id h7435gjA011136; Sun, 3 Aug 2003 22:05:43 -0500 (CDT) Message-ID: <3F2DCE56.6030601@pacbell.net> Date: Sun, 03 Aug 2003 20:09:10 -0700 From: David Brownell User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225 X-Accept-Language: en-us, en, fr MIME-Version: 1.0 To: "David S. Miller" CC: Ben Greear , jgarzik@pobox.com, scott.feldman@intel.com, netdev@oss.sgi.com Subject: Re: e100 "Ferguson" release References: <3F2CA65F.8060105@pobox.com> <3F2CBA71.2070503@candelatech.com> <20030803003239.4257ef24.davem@redhat.com> In-Reply-To: <20030803003239.4257ef24.davem@redhat.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4500 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: david-b@pacbell.net Precedence: bulk X-list: netdev David S. Miller wrote: >>Although I have not tried this latest patch, the existing e100 and e1000 in >>2.4.21 seldom seem to return true to this method: netif_queue_stopped(odev), >>even when the next hard_start_xmit() call fails. > > > Returning an error from hard_start_xmit() from normal ethernet > drivers is, frankly, a driver bug and should never happen. What's "normal" mean? With the current USB stack, network adapters tend to need memory allocations there. Those can fail, though it seems that's not very common. Doesn't seem like a bug, for all that I'd rather see the those paths be zero-alloc in 2.7. - Dave From davem@redhat.com Sun Aug 3 20:13:14 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 20:13:19 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h743DDFl019496 for ; Sun, 3 Aug 2003 20:13:14 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id UAA18891; Sun, 3 Aug 2003 20:08:51 -0700 Date: Sun, 3 Aug 2003 20:08:51 -0700 From: "David S. Miller" To: David Brownell Cc: greearb@candelatech.com, jgarzik@pobox.com, scott.feldman@intel.com, netdev@oss.sgi.com Subject: Re: e100 "Ferguson" release Message-Id: <20030803200851.7d46a605.davem@redhat.com> In-Reply-To: <3F2DCE56.6030601@pacbell.net> References: <3F2CA65F.8060105@pobox.com> <3F2CBA71.2070503@candelatech.com> <20030803003239.4257ef24.davem@redhat.com> <3F2DCE56.6030601@pacbell.net> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4501 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Sun, 03 Aug 2003 20:09:10 -0700 David Brownell wrote: > David S. Miller wrote: > >>Although I have not tried this latest patch, the existing e100 and e1000 in > >>2.4.21 seldom seem to return true to this method: netif_queue_stopped(odev), > >>even when the next hard_start_xmit() call fails. > > > > > > Returning an error from hard_start_xmit() from normal ethernet > > drivers is, frankly, a driver bug and should never happen. > > What's "normal" mean? One that can manage it's own TX resources. > With the current USB stack, network adapters tend to need > memory allocations there. Those can fail, though it seems > that's not very common. Doesn't seem like a bug, for all > that I'd rather see the those paths be zero-alloc in 2.7. Any particular reason why the SKB data itself can't be mapped directly? We created all of these DMA mapping abstractions remember? :-) Another option is to pre-allocate, such that while the TX queue is awake we know we have enough resources to send any given packet. Then in ->hard_start_xmit() after using a buffer we allocate a replacement buffer, if this fails in such a way that a subsequent ->hard_start_xmit() could possibly fail, we do netif_stop_queue(). Look to tg3 to see what I'm talking about in general. netif_stop_queue() is done at the moment at which it may be possible that we cannot accept the queueing of a TX packet. This means that when TX entries available <= MAX_SKB_FRAGS + 1, we stop the queue. This guarentees that we will always be able to handle any packet given to us via ->hard_start_xmit(). From david-b@pacbell.net Sun Aug 3 20:41:38 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 20:41:43 -0700 (PDT) Received: from mta4.rcsntx.swbell.net (mta4.rcsntx.swbell.net [151.164.30.28]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h743fcFl021918 for ; Sun, 3 Aug 2003 20:41:38 -0700 Received: from pacbell.net (ppp-67-118-247-123.dialup.pltn13.pacbell.net [67.118.247.123]) by mta4.rcsntx.swbell.net (8.12.9/8.12.3) with ESMTP id h743fXjA026551; Sun, 3 Aug 2003 22:41:33 -0500 (CDT) Message-ID: <3F2DD6BD.7070504@pacbell.net> Date: Sun, 03 Aug 2003 20:45:01 -0700 From: David Brownell User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225 X-Accept-Language: en-us, en, fr MIME-Version: 1.0 To: "David S. Miller" CC: greearb@candelatech.com, jgarzik@pobox.com, scott.feldman@intel.com, netdev@oss.sgi.com Subject: Re: e100 "Ferguson" release References: <3F2CA65F.8060105@pobox.com> <3F2CBA71.2070503@candelatech.com> <20030803003239.4257ef24.davem@redhat.com> <3F2DCE56.6030601@pacbell.net> <20030803200851.7d46a605.davem@redhat.com> In-Reply-To: <20030803200851.7d46a605.davem@redhat.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4502 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: david-b@pacbell.net Precedence: bulk X-list: netdev >>>>Although I have not tried this latest patch, the existing e100 and e1000 in >>>>2.4.21 seldom seem to return true to this method: netif_queue_stopped(odev), >>>>even when the next hard_start_xmit() call fails. >>> >>> >>>Returning an error from hard_start_xmit() from normal ethernet >>>drivers is, frankly, a driver bug and should never happen. >> >>What's "normal" mean? > > > One that can manage it's own TX resources. Which for the moment, would seem to exclude USB. >>With the current USB stack, network adapters tend to need >>memory allocations there. Those can fail, though it seems >>that's not very common. Doesn't seem like a bug, for all >>that I'd rather see the those paths be zero-alloc in 2.7. > > > Any particular reason why the SKB data itself can't be > mapped directly? We created all of these DMA mapping > abstractions remember? :-) Yes, but the network drivers weren't the ones that needed them! Some older drivers do copy the buffer out of (or for rx, into) the skb, but newer ones just pass the skb data, avoiding a copy. In either case, the buffer was always DMA mapped. Nowadays, some drivers will even set NETIF_F_HIGHDMA if they're going out through a host controller that allows it! (Intel boxes only, AFAIK.) > Another option is to pre-allocate, such that while the TX > queue is awake we know we have enough resources to send any > given packet. Then in ->hard_start_xmit() after using a buffer > we allocate a replacement buffer, if this fails in such a way > that a subsequent ->hard_start_xmit() could possibly fail, we > do netif_stop_queue(). Pre-allocation can be done for the URBs that wrap the data buffers, yes. Not often done today; but it could be. What can't be pre-allocated in a reliable way is the resources used by the host controller drivers, specifically the transfer descriptors. EHCI and OHCI usually need one per URB, unless MTU is over 4 KB. UHCI normally needs quite a few. The API that works inside USB "gadgets' does allow pre-allocation at all those levels, mostly because it's factored to make the submission and completion paths fast. So that "stop if can't pre-allocate" scheme would work, given an "ether.c" patch! :) - Dave From davem@redhat.com Sun Aug 3 20:51:30 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 20:51:34 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h743pUFl022879 for ; Sun, 3 Aug 2003 20:51:30 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id UAA19000; Sun, 3 Aug 2003 20:46:42 -0700 Date: Sun, 3 Aug 2003 20:46:42 -0700 From: "David S. Miller" To: David Brownell Cc: greearb@candelatech.com, jgarzik@pobox.com, scott.feldman@intel.com, netdev@oss.sgi.com Subject: Re: e100 "Ferguson" release Message-Id: <20030803204642.684c6075.davem@redhat.com> In-Reply-To: <3F2DD6BD.7070504@pacbell.net> References: <3F2CA65F.8060105@pobox.com> <3F2CBA71.2070503@candelatech.com> <20030803003239.4257ef24.davem@redhat.com> <3F2DCE56.6030601@pacbell.net> <20030803200851.7d46a605.davem@redhat.com> <3F2DD6BD.7070504@pacbell.net> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4503 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Sun, 03 Aug 2003 20:45:01 -0700 David Brownell wrote: > What can't be pre-allocated in a reliable way is the resources > used by the host controller drivers, specifically the transfer > descriptors. EHCI and OHCI usually need one per URB, unless > MTU is over 4 KB. UHCI normally needs quite a few. Ok, that's interesting. Is there a callback that tells the USB driver that some host controller "resources" have become available? I mean, these host controllers either have to queue requests when out of resources or provide a callback so that the drivers can resubmit. Right? From david-b@pacbell.net Sun Aug 3 21:05:09 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 21:05:15 -0700 (PDT) Received: from mta4.rcsntx.swbell.net (mta4.rcsntx.swbell.net [151.164.30.28]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74458Fl024201 for ; Sun, 3 Aug 2003 21:05:09 -0700 Received: from pacbell.net (ppp-67-118-247-123.dialup.pltn13.pacbell.net [67.118.247.123]) by mta4.rcsntx.swbell.net (8.12.9/8.12.3) with ESMTP id h7444wjA026527; Sun, 3 Aug 2003 23:05:04 -0500 (CDT) Message-ID: <3F2DDC3A.2040707@pacbell.net> Date: Sun, 03 Aug 2003 21:08:26 -0700 From: David Brownell User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225 X-Accept-Language: en-us, en, fr MIME-Version: 1.0 To: "David S. Miller" CC: greearb@candelatech.com, jgarzik@pobox.com, scott.feldman@intel.com, netdev@oss.sgi.com Subject: Re: e100 "Ferguson" release References: <3F2CA65F.8060105@pobox.com> <3F2CBA71.2070503@candelatech.com> <20030803003239.4257ef24.davem@redhat.com> <3F2DCE56.6030601@pacbell.net> <20030803200851.7d46a605.davem@redhat.com> <3F2DD6BD.7070504@pacbell.net> <20030803204642.684c6075.davem@redhat.com> In-Reply-To: <20030803204642.684c6075.davem@redhat.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4504 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: david-b@pacbell.net Precedence: bulk X-list: netdev David S. Miller wrote: > On Sun, 03 Aug 2003 20:45:01 -0700 > David Brownell wrote: > > >>What can't be pre-allocated in a reliable way is the resources >>used by the host controller drivers, specifically the transfer >>descriptors. EHCI and OHCI usually need one per URB, unless >>MTU is over 4 KB. UHCI normally needs quite a few. > > > Ok, that's interesting. All TDs get allocated in usb_submit_urb(), which is the first time the "real" core of USB connects an urb with an I/O queue. That's host-side, not device-side. > Is there a callback that tells the USB driver that some host > controller "resources" have become available? I mean, these host > controllers either have to queue requests when out of resources or > provide a callback so that the drivers can resubmit. No such callback. If no resources, they fail -ENOMEM and the caller must recover. Which is why hard_start_xmit() needs to do something. - Dave From davem@redhat.com Sun Aug 3 21:17:52 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 21:18:01 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h744HqFl025423 for ; Sun, 3 Aug 2003 21:17:52 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id VAA19041; Sun, 3 Aug 2003 21:13:34 -0700 Date: Sun, 3 Aug 2003 21:13:33 -0700 From: "David S. Miller" To: David Brownell Cc: greearb@candelatech.com, jgarzik@pobox.com, scott.feldman@intel.com, netdev@oss.sgi.com Subject: Re: e100 "Ferguson" release Message-Id: <20030803211333.12839f66.davem@redhat.com> In-Reply-To: <3F2DDC3A.2040707@pacbell.net> References: <3F2CA65F.8060105@pobox.com> <3F2CBA71.2070503@candelatech.com> <20030803003239.4257ef24.davem@redhat.com> <3F2DCE56.6030601@pacbell.net> <20030803200851.7d46a605.davem@redhat.com> <3F2DD6BD.7070504@pacbell.net> <20030803204642.684c6075.davem@redhat.com> <3F2DDC3A.2040707@pacbell.net> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4505 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Sun, 03 Aug 2003 21:08:26 -0700 David Brownell wrote: > No such callback. If no resources, they fail -ENOMEM and the > caller must recover. Which is why hard_start_xmit() needs to > do something. I would suggest something different :-) For example, what do USB block device drivers do when -ENOMEM comes back? Do they just drop the request on the floor? No, rather they resubmit the request later without the scsi/block layer knowing anything about what happened, right? How do the USB block device drivers know when "later" is? This is why I can't believe there is not some kind of "some USB resources have been freed" event of some sort which USB drivers can use to deal with this. :-) From davem@redhat.com Sun Aug 3 22:26:02 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 22:26:09 -0700 (PDT) Received: from rth.ninka.net (rth.ninka.net [216.101.162.244]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h745Q1Fl030678 for ; Sun, 3 Aug 2003 22:26:02 -0700 Received: from rth.ninka.net (localhost.localdomain [127.0.0.1]) by rth.ninka.net (8.12.8/8.12.8) with SMTP id h745PsSG027235; Sun, 3 Aug 2003 22:25:55 -0700 Date: Sun, 3 Aug 2003 22:25:54 -0700 From: "David S. Miller" To: Glen Turner Cc: jgarzik@pobox.com, netdev@oss.sgi.com Subject: Re: TOE brain dump Message-Id: <20030803222554.7027a160.davem@redhat.com> In-Reply-To: <3F2DBB2B.9050803@aarnet.edu.au> References: <20030802140444.E5798@almesberger.net> <3F2BF5C7.90400@us.ibm.com> <3F2C0C44.6020002@pobox.com> <20030802184901.G5798@almesberger.net> <3F2CAE61.7070401@pobox.com> <3F2DBB2B.9050803@aarnet.edu.au> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.10; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4506 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev [ This discussion belongs on netdev, not linux-kernel. ] On Mon, 04 Aug 2003 11:17:23 +0930 Glen Turner wrote: > That's Matt Mathis's phrase. The Web100 project > has a set of patches to the kernel > which go a long way to reducing the wizard gap. It would be > nice to see those patches eventually appear in the Linux > mainstream. The web100 patches aren't in the kernel because 1) they've never even been submitted and 2) they need a large cleanup. I sort of get the impression that the web100 folks actually like that their changes are not in the main sources, it keeps their work "special". From werner@almesberger.net Sun Aug 3 22:51:33 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 22:51:44 -0700 (PDT) Received: from host.almesberger.net (almesberger.net [63.105.73.239] (may be forged)) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h745pWFl000336 for ; Sun, 3 Aug 2003 22:51:32 -0700 Received: from almesberger.net (vpnwa-home [10.200.0.2]) by host.almesberger.net (8.11.6/8.9.3) with ESMTP id h73HvkG04076; Sun, 3 Aug 2003 10:57:50 -0700 Received: (from werner@localhost) by almesberger.net (8.11.6/8.11.6) id h73Hvc310464; Sun, 3 Aug 2003 14:57:38 -0300 Date: Sun, 3 Aug 2003 14:57:37 -0300 From: Werner Almesberger To: Jeff Garzik Cc: netdev@oss.sgi.com, linux-kernel@vger.kernel.org, Nivedita Singhvi Subject: Re: TOE brain dump Message-ID: <20030803145737.B10280@almesberger.net> References: <20030802140444.E5798@almesberger.net> <3F2BF5C7.90400@us.ibm.com> <3F2C0C44.6020002@pobox.com> <20030802184901.G5798@almesberger.net> <3F2CAE61.7070401@pobox.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3F2CAE61.7070401@pobox.com>; from jgarzik@pobox.com on Sun, Aug 03, 2003 at 02:40:33AM -0400 X-archive-position: 4507 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: werner@almesberger.net Precedence: bulk X-list: netdev Jeff Garzik wrote: > Really fast, really long pipes in practice don't exist for 99.9% of all > Internet users. It matters to some right now, i.e. the ones who are interested in TOE in the first place. (And there also those who try to tweak TCP to actually work over such links. Right now, its congestion control doesn't scale that well.) Also, IT has been good at making all that elitarian high-performance gear available to the common people rather quickly, and I don't see that changing. The Crisis just alters the pace a little. > When you approach traffic levels that push you want to offload most of > the TCP net stack, then TCP isn't the right solution for you anymore, > all things considered. No. Ironically, TCP is almost always the right solution. Sometimes people try to use something else. Eventually, their protocol wants to go over WANs or something that looks suspiciously like a WAN (MAN or such). At that point, they usually realize that TCP provides exactly the functionality they need. In theory, one could implement the same functionality in other protocols. There was even talk at IETF to support a generic congestion control manager for this purpose. That was many years ago, and I haven't seen anything come out of this. So it seems that, by the time your protocol grows up to want to play in the real world, it wants to be so much like TCP that you're better off using TCP. The amusing bit here is to watch all the "competitors" pop up, grow, fail, and eventually die. > The Linux net stack just isn't built to be offloaded. Yes ! And that's not a flaw of the stack, but it's simply a fact of life. I think that no "real life" stack can be offloaded (in the traditional sense). > And I can't see ASIC and firmware > designers being excited about implementing netfilter on a PCI card :) And when they're done with netfilter, you can throw IPsec, IPv6, or traffic control at them. Eventually, you'll wear them down ;-) > Unfortunately some vendors seem to choosing TOE option #3: TCP offload > which introduces many limitations (connection limits, netfilter not > supported, etc.) which Linux never had before. That's when that little word "no" comes into play, i.e. when their modifications to the stack show up on netdev or linux-kernel. Dave Miller seems to be pretty good at saying "no". I hope he keeps on being good at this ;-) > There is one interesting TOE solution, that I have yet to see created: > run Linux on an embedded processor, on the NIC. That's basically what I've been talking about all the while :-) > The Linux OS driver interface becomes a virtual interface > with a large MTU, Probably not. I think you also want to push some knowledge of where the data ultimately goes to the NIC. This could be something like sendfile, something new, or just a few bytes of user space code. - Werner -- _________________________________________________________________________ / Werner Almesberger, Buenos Aires, Argentina werner@almesberger.net / /_http://www.almesberger.net/____________________________________________/ From jgarzik@pobox.com Sun Aug 3 23:00:29 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 23:00:34 -0700 (PDT) Received: from www.linux.org.uk (IDENT:LyFAj4hc+YOjEDA9RBFVQujz5PCEzoP5@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7460RFl001286 for ; Sun, 3 Aug 2003 23:00:28 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19jQrk-00026Y-PG; Sun, 03 Aug 2003 22:58:12 +0100 Message-ID: <3F2D8569.1010109@pobox.com> Date: Sun, 03 Aug 2003 17:58:01 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: Larry McVoy CC: David Lang , Erik Andersen , Werner Almesberger , netdev@oss.sgi.com, linux-kernel@vger.kernel.org, Nivedita Singhvi Subject: Re: TOE brain dump References: <20030803194011.GA8324@work.bitmover.com> <20030803203051.GA9057@work.bitmover.com> In-Reply-To: <20030803203051.GA9057@work.bitmover.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4508 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev Larry McVoy wrote: > I can get an entire machine, memory, disk, > Ghz CPU, case, power supply, > cdrom, floppy, onboard enet extra net card for routing, for $250 or less, > quantity 1, shipped to my door. > > Why would I want to spend money on some silly offload card when I can get > the whole PC for less than the card? Yep. I think we are entering the era of what I call RAIC (pronounced "rake") -- redundant array of inexpensive computers. For organizations that can handle the space/power/temperature load, a powerful cluster of supercheap PCs, the "Wal-Mart Supercomputer", can be built for a rock-bottom price. From pekkas@netcore.fi Sun Aug 3 23:06:13 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 23:06:16 -0700 (PDT) Received: from netcore.fi (netcore.fi [193.94.160.1]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7466BFl002029 for ; Sun, 3 Aug 2003 23:06:12 -0700 Received: from localhost (pekkas@localhost) by netcore.fi (8.11.6/8.11.6) with ESMTP id h74664b12177 for ; Mon, 4 Aug 2003 09:06:05 +0300 Date: Mon, 4 Aug 2003 09:06:04 +0300 (EEST) From: Pekka Savola To: netdev@oss.sgi.com Subject: multicast IP datagram forwarding bug and fix (fwd) Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 4509 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: pekkas@netcore.fi Precedence: bulk X-list: netdev I didn't see followups to this, so I'm re-sending to the list just in case it got dropped in the cracks.. -- Pekka Savola "You each name yourselves king, yet the Netcore Oy kingdom bleeds." Systems. Networks. Security. -- George R.R. Martin: A Clash of Kings ---------- Forwarded message ---------- Date: Mon, 28 Jul 2003 13:20:31 -0400 From: "Weng, Wending" To: netdev@oss.sgi.com Subject: multicast IP datagram forwarding bug and fix > Hi, > > LINUX doesn't forward multicast IP datagram if it has option(s), there is is a bug in the module ipmr.c, function > ipmr_forward_finish, below is the current version of this function: > > static inline int ipmr_forward_finish(struct sk_buff *skb) > { > struct dst_entry *dst = skb->dst; > > if (skb->len <= dst->pmtu) > return dst->output(skb); > else > return ip_fragment(skb, dst->output); > } > > it forgets to recalculate the checksum in case the option is modified. > > The following code works properly: > > static inline int ipmr_forward_finish(struct sk_buff *skb) > { > struct dst_entry *dst = skb->dst; > > ip_forward_options (skb); /* this line recalculates checksum if needed. */ > > if (skb->len <= dst->pmtu) > return dst->output(skb); > else > return ip_fragment(skb, dst->output); > } > > Wending Weng From davem@redhat.com Sun Aug 3 23:10:14 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 23:10:20 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h746AAFl002620 for ; Sun, 3 Aug 2003 23:10:13 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id XAA19251; Sun, 3 Aug 2003 23:05:52 -0700 Date: Sun, 3 Aug 2003 23:05:52 -0700 From: "David S. Miller" To: Pekka Savola Cc: netdev@oss.sgi.com Subject: Re: multicast IP datagram forwarding bug and fix (fwd) Message-Id: <20030803230552.1aab9411.davem@redhat.com> In-Reply-To: References: X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4510 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Mon, 4 Aug 2003 09:06:04 +0300 (EEST) Pekka Savola wrote: > I didn't see followups to this, so I'm re-sending to the list just in case > it got dropped in the cracks.. I've already checked in a correct fix for this problem from Alexey: # This is a BitKeeper generated patch for the following project: # Project Name: Linux kernel tree # This patch format is intended for GNU patch command version 2.5 or higher. # This patch includes the following deltas: # ChangeSet 1.1584.2.13 -> 1.1584.2.14 # net/ipv4/ipmr.c 1.27 -> 1.28 # # The following is the BitKeeper ChangeSet Log # -------------------------------------------- # 03/08/02 kuznet@ms2.inr.ac.ru 1.1584.2.14 # [IPV4]: IP options were not updated while forwarding multicasts. # -------------------------------------------- # diff -Nru a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c --- a/net/ipv4/ipmr.c Sun Aug 3 23:07:44 2003 +++ b/net/ipv4/ipmr.c Sun Aug 3 23:07:44 2003 @@ -1100,6 +1100,7 @@ skb->h.ipiph = skb->nh.iph; skb->nh.iph = iph; + memset(&(IPCB(skb)->opt), 0, sizeof(IPCB(skb)->opt)); #ifdef CONFIG_NETFILTER nf_conntrack_put(skb->nfct); skb->nfct = NULL; @@ -1108,12 +1109,14 @@ static inline int ipmr_forward_finish(struct sk_buff *skb) { - struct dst_entry *dst = skb->dst; + struct ip_options * opt = &(IPCB(skb)->opt); - if (skb->len <= dst_pmtu(dst)) - return dst_output(skb); - else - return ip_fragment(skb, dst_output); + IP_INC_STATS_BH(IpForwDatagrams); + + if (unlikely(opt->optlen)) + ip_forward_options(skb); + + return dst_output(skb); } /* From pekkas@netcore.fi Sun Aug 3 23:11:39 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 03 Aug 2003 23:11:44 -0700 (PDT) Received: from netcore.fi (netcore.fi [193.94.160.1]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h746BbFl003012 for ; Sun, 3 Aug 2003 23:11:38 -0700 Received: from localhost (pekkas@localhost) by netcore.fi (8.11.6/8.11.6) with ESMTP id h746AsA12243; Mon, 4 Aug 2003 09:11:00 +0300 Date: Mon, 4 Aug 2003 09:10:53 +0300 (EEST) From: Pekka Savola To: Lamont Granquist cc: Bill Davidsen , "David S. Miller" , Carlos Velasco , , , , , , , Subject: Re: [2.4 PATCH] bugfix: ARP respond on all devices In-Reply-To: <20030728213933.F81299@coredump.scriptkiddie.org> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 4511 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: pekkas@netcore.fi Precedence: bulk X-list: netdev Hi, Just a thought.. How about consider this change for 2.6 kernel series at this point, and don't backport it 2.4 at least first and/or make the behaviour configurable? Upgrading from 2.4 to 2.6 should be a step big enough that folks should revisit their more advanced configurations, causing smaller surprises. Changing the behaviour inside 2.4.x series might not be reasonable. On Mon, 28 Jul 2003, Lamont Granquist wrote: > On Mon, 28 Jul 2003, Bill Davidsen wrote: > > On Sun, 27 Jul 2003, David S. Miller wrote: > > > This particular case has been discussed to death in the past > > > and I really recommend people read up there before dragging this > > > out further. > > > > It will keep coming back because it's a real problem. I do agree that the > > hidden patch is not the desired way to solve the problem, but until there > > is a reasonable (not requiring a guru or large manual effort) solution > > people will keep bringing it up. > > And it severely violates the principle of least surprise. Its unfortunate > that this principle isn't more widely discussed and considered on lkml. > -- Pekka Savola "You each name yourselves king, yet the Netcore Oy kingdom bleeds." Systems. Networks. Security. -- George R.R. Martin: A Clash of Kings From andi@averellmail.firstfloor.org Mon Aug 4 05:50:32 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 05:50:45 -0700 (PDT) Received: from zero.aec.at (Bishop.Potter@zero.aec.at [193.170.194.10]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74CoSFl017874 for ; Mon, 4 Aug 2003 05:50:30 -0700 Received: from fred.muc.de (Jared.Oopf@localhost.localdomain [127.0.0.1]) by zero.aec.at (8.11.6/8.11.2) with ESMTP id h74CoLm04438 for ; Mon, 4 Aug 2003 14:50:21 +0200 Received: by fred.muc.de (Postfix on SuSE Linux 7.3 (i386), from userid 500) id C18D35BB86; Mon, 4 Aug 2003 14:50:22 +0200 (CEST) Date: Mon, 4 Aug 2003 14:50:22 +0200 From: Andi Kleen To: netdev@oss.sgi.com Subject: [PATCH] Make XFRM optional Message-ID: <20030804125022.GA8167@averell> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4i X-archive-position: 4512 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ak@muc.de Precedence: bulk X-list: netdev Only compile in the xfrm subsystem when it's needed by any config options. This avoids some code/data structure bloat in case you don't use IP tunneling or IPsec. Also adds a net_ratelimit() to a unprotected printk. For 2.6.0test2 -Andi diff -u linux-work/include/net/dst.h-XFRM linux-work/include/net/dst.h --- linux-work/include/net/dst.h-XFRM 2003-07-18 02:40:02.000000000 +0200 +++ linux-work/include/net/dst.h 2003-08-03 23:12:24.000000000 +0200 @@ -247,8 +247,16 @@ extern void dst_init(void); struct flowi; +#ifndef CONFIG_XFRM +static inline int xfrm_lookup(struct dst_entry **dst_p, struct flowi *fl, + struct sock *sk, int flags) +{ + return 0; +} +#else extern int xfrm_lookup(struct dst_entry **dst_p, struct flowi *fl, struct sock *sk, int flags); #endif +#endif #endif /* _NET_DST_H */ diff -u linux-work/include/net/xfrm.h-XFRM linux-work/include/net/xfrm.h --- linux-work/include/net/xfrm.h-XFRM 2003-07-28 23:12:30.000000000 +0200 +++ linux-work/include/net/xfrm.h 2003-08-03 23:14:04.000000000 +0200 @@ -587,6 +587,8 @@ return !0; } +#ifdef CONFIG_XFRM + extern int __xfrm_policy_check(struct sock *, int dir, struct sk_buff *skb, unsigned short family); static inline int xfrm_policy_check(struct sock *sk, int dir, struct sk_buff *skb, unsigned short family) @@ -652,6 +654,26 @@ } } +#else + +static inline void xfrm_sk_free_policy(struct sock *sk) {} +static inline int xfrm_sk_clone_policy(struct sock *sk) { return 0; } +static inline int xfrm6_route_forward(struct sk_buff *skb) { return 1; } +static inline int xfrm4_route_forward(struct sk_buff *skb) { return 1; } +static inline int xfrm6_policy_check(struct sock *sk, int dir, struct sk_buff *skb) +{ + return 1; +} +static inline int xfrm4_policy_check(struct sock *sk, int dir, struct sk_buff *skb) +{ + return 1; +} +static inline int xfrm_policy_check(struct sock *sk, int dir, struct sk_buff *skb, unsigned short family) +{ + return 1; +} +#endif + static __inline__ xfrm_address_t *xfrm_flowi_daddr(struct flowi *fl, unsigned short family) { @@ -782,12 +804,32 @@ extern int xfrm_check_selectors(struct xfrm_state **x, int n, struct flowi *fl); extern int xfrm_check_output(struct xfrm_state *x, struct sk_buff *skb, unsigned short family); extern int xfrm4_rcv(struct sk_buff *skb); -extern int xfrm4_rcv_encap(struct sk_buff *skb, __u16 encap_type); extern int xfrm4_tunnel_register(struct xfrm_tunnel *handler); extern int xfrm4_tunnel_deregister(struct xfrm_tunnel *handler); extern int xfrm4_tunnel_check_size(struct sk_buff *skb); extern int xfrm6_rcv(struct sk_buff **pskb, unsigned int *nhoffp); + +#ifdef CONFIG_XFRM +extern int xfrm4_rcv_encap(struct sk_buff *skb, __u16 encap_type); extern int xfrm_user_policy(struct sock *sk, int optname, u8 *optval, int optlen); +extern int xfrm_dst_lookup(struct xfrm_dst **dst, struct flowi *fl, unsigned short family); +#else +static inline int xfrm_user_policy(struct sock *sk, int optname, u8 *optval, int optlen) +{ + return -ENOPROTOOPT; +} + +static inline int xfrm4_rcv_encap(struct sk_buff *skb, __u16 encap_type) +{ + /* should not happen */ + kfree_skb(skb); + return 0; +} +static inline int xfrm_dst_lookup(struct xfrm_dst **dst, struct flowi *fl, unsigned short family) +{ + return -EINVAL; +} +#endif void xfrm_policy_init(void); void xfrm4_policy_init(void); @@ -809,7 +851,6 @@ extern int xfrm_sk_policy_insert(struct sock *sk, int dir, struct xfrm_policy *pol); extern struct xfrm_policy *xfrm_sk_policy_lookup(struct sock *sk, int dir, struct flowi *fl); extern int xfrm_flush_bundles(struct xfrm_state *x); -extern int xfrm_dst_lookup(struct xfrm_dst **dst, struct flowi *fl, unsigned short family); extern wait_queue_head_t km_waitq; extern void km_state_expired(struct xfrm_state *x, int hard); diff -u linux-work/net/core/skbuff.c-XFRM linux-work/net/core/skbuff.c --- linux-work/net/core/skbuff.c-XFRM 2003-07-18 02:39:47.000000000 +0200 +++ linux-work/net/core/skbuff.c 2003-08-03 23:12:25.000000000 +0200 @@ -225,7 +225,7 @@ } dst_release(skb->dst); -#ifdef CONFIG_INET +#ifdef CONFIG_XFRM secpath_put(skb->sp); #endif if(skb->destructor) { diff -u linux-work/net/ipv4/Kconfig-XFRM linux-work/net/ipv4/Kconfig --- linux-work/net/ipv4/Kconfig-XFRM 2003-07-18 02:42:42.000000000 +0200 +++ linux-work/net/ipv4/Kconfig 2003-08-03 23:12:25.000000000 +0200 @@ -187,6 +187,7 @@ config NET_IPIP tristate "IP: tunneling" depends on INET + select XFRM ---help--- Tunneling means encapsulating data of one protocol type within another protocol and sending it over a channel that understands the @@ -205,6 +206,7 @@ config NET_IPGRE tristate "IP: GRE tunnels over IP" depends on INET + select XFRM help Tunneling means encapsulating data of one protocol type within another protocol and sending it over a channel that understands the @@ -343,6 +345,7 @@ config INET_AH tristate "IP: AH transformation" + select XFRM select CRYPTO select CRYPTO_HMAC select CRYPTO_MD5 @@ -354,6 +357,7 @@ config INET_ESP tristate "IP: ESP transformation" + select XFRM select CRYPTO select CRYPTO_HMAC select CRYPTO_MD5 @@ -366,6 +370,7 @@ config INET_IPCOMP tristate "IP: IPComp transformation" + select XFRM select CRYPTO select CRYPTO_DEFLATE ---help--- diff -u linux-work/net/ipv4/Makefile-XFRM linux-work/net/ipv4/Makefile --- linux-work/net/ipv4/Makefile-XFRM 2003-07-18 02:42:42.000000000 +0200 +++ linux-work/net/ipv4/Makefile 2003-08-03 23:12:25.000000000 +0200 @@ -23,4 +23,4 @@ obj-$(CONFIG_NETFILTER) += netfilter/ obj-$(CONFIG_IP_VS) += ipvs/ -obj-y += xfrm4_policy.o xfrm4_state.o xfrm4_input.o xfrm4_tunnel.o +obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o xfrm4_tunnel.o diff -u linux-work/net/ipv4/route.c-XFRM linux-work/net/ipv4/route.c --- linux-work/net/ipv4/route.c-XFRM 2003-07-18 02:39:31.000000000 +0200 +++ linux-work/net/ipv4/route.c 2003-08-03 23:12:25.000000000 +0200 @@ -2785,8 +2785,10 @@ create_proc_read_entry("net/rt_acct", 0, 0, ip_rt_acct_read, NULL); #endif #endif +#ifdef CONFIG_XFRM xfrm_init(); xfrm4_init(); +#endif out: return rc; out_enomem: diff -u linux-work/net/ipv4/udp.c-XFRM linux-work/net/ipv4/udp.c --- linux-work/net/ipv4/udp.c-XFRM 2003-07-18 02:42:43.000000000 +0200 +++ linux-work/net/ipv4/udp.c 2003-08-03 23:31:05.000000000 +0200 @@ -938,6 +938,9 @@ */ static int udp_encap_rcv(struct sock * sk, struct sk_buff *skb) { +#ifndef CONFIG_XFRM + return 1; +#else struct udp_opt *up = udp_sk(sk); struct udphdr *uh = skb->h.uh; struct iphdr *iph; @@ -997,10 +1000,12 @@ return -1; default: - printk(KERN_INFO "udp_encap_rcv(): Unhandled UDP encap type: %u\n", - encap_type); + if (net_ratelimit()) + printk(KERN_INFO "udp_encap_rcv(): Unhandled UDP encap type: %u\n", + encap_type); return 1; } +#endif } /* returns: diff -u linux-work/net/ipv6/Kconfig-XFRM linux-work/net/ipv6/Kconfig --- linux-work/net/ipv6/Kconfig-XFRM 2003-07-18 02:39:29.000000000 +0200 +++ linux-work/net/ipv6/Kconfig 2003-08-03 23:12:25.000000000 +0200 @@ -4,6 +4,7 @@ config IPV6_PRIVACY bool "IPv6: Privacy Extensions (RFC 3041) support" depends on IPV6 + select XFRM select CRYPTO select CRYPTO_MD5 ---help--- @@ -22,6 +23,7 @@ config INET6_AH tristate "IPv6: AH transformation" depends on IPV6 + select XFRM select CRYPTO select CRYPTO_HMAC select CRYPTO_MD5 @@ -34,6 +36,7 @@ config INET6_ESP tristate "IPv6: ESP transformation" depends on IPV6 + select XFRM select CRYPTO select CRYPTO_HMAC select CRYPTO_MD5 @@ -47,6 +50,7 @@ config INET6_IPCOMP tristate "IPv6: IPComp transformation" depends on IPV6 + select XFRM select CRYPTO select CRYPTO_DEFLATE ---help--- @@ -57,6 +61,7 @@ config IPV6_TUNNEL tristate "IPv6: IPv6-in-IPv6 tunnel" + select XFRM depends on IPV6 ---help--- Support for IPv6-in-IPv6 tunnels described in RFC 2473. diff -u linux-work/net/ipv6/Makefile-XFRM linux-work/net/ipv6/Makefile --- linux-work/net/ipv6/Makefile-XFRM 2003-07-18 02:39:29.000000000 +0200 +++ linux-work/net/ipv6/Makefile 2003-08-03 23:12:25.000000000 +0200 @@ -8,8 +8,9 @@ route.o ip6_fib.o ipv6_sockglue.o ndisc.o udp.o raw.o \ protocol.o icmp.o mcast.o reassembly.o tcp_ipv6.o \ exthdrs.o sysctl_net_ipv6.o datagram.o proc.o \ - ip6_flowlabel.o ipv6_syms.o \ - xfrm6_policy.o xfrm6_state.o xfrm6_input.o + ip6_flowlabel.o ipv6_syms.o + +obj-$(CONFIG_XFRM) += xfrm6_policy.o xfrm6_state.o xfrm6_input.o obj-$(CONFIG_INET6_AH) += ah6.o obj-$(CONFIG_INET6_ESP) += esp6.o diff -u linux-work/net/ipv6/ipv6_syms.c-XFRM linux-work/net/ipv6/ipv6_syms.c --- linux-work/net/ipv6/ipv6_syms.c-XFRM 2003-07-18 02:39:31.000000000 +0200 +++ linux-work/net/ipv6/ipv6_syms.c 2003-08-03 23:14:41.000000000 +0200 @@ -36,7 +36,9 @@ EXPORT_SYMBOL(in6addr_loopback); EXPORT_SYMBOL(in6_dev_finish_destroy); EXPORT_SYMBOL(ip6_find_1stfragopt); +#ifdef CONFIG_XFRM EXPORT_SYMBOL(xfrm6_rcv); +#endif EXPORT_SYMBOL(rt6_lookup); EXPORT_SYMBOL(fl6_sock_lookup); EXPORT_SYMBOL(ipv6_ext_hdr); diff -u linux-work/net/ipv6/route.c-XFRM linux-work/net/ipv6/route.c --- linux-work/net/ipv6/route.c-XFRM 2003-07-28 23:12:32.000000000 +0200 +++ linux-work/net/ipv6/route.c 2003-08-03 23:12:25.000000000 +0200 @@ -1988,7 +1988,9 @@ if (p) p->proc_fops = &rt6_stats_seq_fops; #endif +#ifdef CONFIG_XFRM xfrm6_init(); +#endif } #ifdef MODULE diff -u linux-work/net/xfrm/Kconfig-XFRM linux-work/net/xfrm/Kconfig --- linux-work/net/xfrm/Kconfig-XFRM 2003-05-27 03:00:40.000000000 +0200 +++ linux-work/net/xfrm/Kconfig 2003-08-03 23:12:25.000000000 +0200 @@ -1,9 +1,13 @@ # # XFRM configuration # +config XFRM + bool + depends on NET + config XFRM_USER tristate "IPsec user configuration interface" - depends on INET + depends on INET && XFRM ---help--- Support for IPsec user configuration interface used by native Linux tools. diff -u linux-work/net/xfrm/Makefile-XFRM linux-work/net/xfrm/Makefile --- linux-work/net/xfrm/Makefile-XFRM 2003-05-27 03:01:03.000000000 +0200 +++ linux-work/net/xfrm/Makefile 2003-08-03 23:12:25.000000000 +0200 @@ -2,6 +2,7 @@ # Makefile for the XFRM subsystem. # -obj-y := xfrm_policy.o xfrm_state.o xfrm_input.o xfrm_algo.o xfrm_output.o +obj-$(CONFIG_XFRM) := xfrm_policy.o xfrm_state.o xfrm_input.o xfrm_algo.o xfrm_output.o \ + xfrm_export.o obj-$(CONFIG_XFRM_USER) += xfrm_user.o diff -u linux-work/net/xfrm/xfrm_export.c-XFRM linux-work/net/xfrm/xfrm_export.c --- linux-work/net/xfrm/xfrm_export.c-XFRM 2003-08-03 23:12:25.000000000 +0200 +++ linux-work/net/xfrm/xfrm_export.c 2003-08-03 23:16:06.000000000 +0200 @@ -0,0 +1,76 @@ +#include +#include + +EXPORT_SYMBOL(xfrm_user_policy); +EXPORT_SYMBOL(km_waitq); +EXPORT_SYMBOL(km_new_mapping); +EXPORT_SYMBOL(xfrm_cfg_sem); +EXPORT_SYMBOL(xfrm_policy_alloc); +EXPORT_SYMBOL(__xfrm_policy_destroy); +EXPORT_SYMBOL(xfrm_lookup); +EXPORT_SYMBOL(__xfrm_policy_check); +EXPORT_SYMBOL(__xfrm_route_forward); +EXPORT_SYMBOL(xfrm_state_alloc); +EXPORT_SYMBOL(__xfrm_state_destroy); +EXPORT_SYMBOL(xfrm_state_find); +EXPORT_SYMBOL(xfrm_state_insert); +EXPORT_SYMBOL(xfrm_state_add); +EXPORT_SYMBOL(xfrm_state_update); +EXPORT_SYMBOL(xfrm_state_check_expire); +EXPORT_SYMBOL(xfrm_state_check_space); +EXPORT_SYMBOL(xfrm_state_lookup); +EXPORT_SYMBOL(xfrm_state_register_afinfo); +EXPORT_SYMBOL(xfrm_state_unregister_afinfo); +EXPORT_SYMBOL(xfrm_state_get_afinfo); +EXPORT_SYMBOL(xfrm_state_put_afinfo); +EXPORT_SYMBOL(xfrm_state_delete_tunnel); +EXPORT_SYMBOL(xfrm_replay_check); +EXPORT_SYMBOL(xfrm_replay_advance); +EXPORT_SYMBOL(xfrm_check_selectors); +EXPORT_SYMBOL(xfrm_check_output); +EXPORT_SYMBOL(__secpath_destroy); +EXPORT_SYMBOL(xfrm_get_acqseq); +EXPORT_SYMBOL(xfrm_parse_spi); +EXPORT_SYMBOL(xfrm4_rcv); +EXPORT_SYMBOL(xfrm4_tunnel_register); +EXPORT_SYMBOL(xfrm4_tunnel_deregister); +EXPORT_SYMBOL(xfrm4_tunnel_check_size); +EXPORT_SYMBOL(xfrm_register_type); +EXPORT_SYMBOL(xfrm_unregister_type); +EXPORT_SYMBOL(xfrm_get_type); +EXPORT_SYMBOL(inet_peer_idlock); +EXPORT_SYMBOL(xfrm_register_km); +EXPORT_SYMBOL(xfrm_unregister_km); +EXPORT_SYMBOL(xfrm_state_delete); +EXPORT_SYMBOL(xfrm_state_walk); +EXPORT_SYMBOL(xfrm_find_acq_byseq); +EXPORT_SYMBOL(xfrm_find_acq); +EXPORT_SYMBOL(xfrm_alloc_spi); +EXPORT_SYMBOL(xfrm_state_flush); +EXPORT_SYMBOL(xfrm_policy_kill); +EXPORT_SYMBOL(xfrm_policy_bysel); +EXPORT_SYMBOL(xfrm_policy_insert); +EXPORT_SYMBOL(xfrm_policy_walk); +EXPORT_SYMBOL(xfrm_policy_flush); +EXPORT_SYMBOL(xfrm_policy_byid); +EXPORT_SYMBOL(xfrm_policy_list); +EXPORT_SYMBOL(xfrm_dst_lookup); +EXPORT_SYMBOL(xfrm_policy_register_afinfo); +EXPORT_SYMBOL(xfrm_policy_unregister_afinfo); +EXPORT_SYMBOL(xfrm_policy_get_afinfo); +EXPORT_SYMBOL(xfrm_policy_put_afinfo); + +EXPORT_SYMBOL_GPL(xfrm_probe_algs); +EXPORT_SYMBOL_GPL(xfrm_count_auth_supported); +EXPORT_SYMBOL_GPL(xfrm_count_enc_supported); +EXPORT_SYMBOL_GPL(xfrm_aalg_get_byidx); +EXPORT_SYMBOL_GPL(xfrm_ealg_get_byidx); +EXPORT_SYMBOL_GPL(xfrm_calg_get_byidx); +EXPORT_SYMBOL_GPL(xfrm_aalg_get_byid); +EXPORT_SYMBOL_GPL(xfrm_ealg_get_byid); +EXPORT_SYMBOL_GPL(xfrm_calg_get_byid); +EXPORT_SYMBOL_GPL(xfrm_aalg_get_byname); +EXPORT_SYMBOL_GPL(xfrm_ealg_get_byname); +EXPORT_SYMBOL_GPL(xfrm_calg_get_byname); + +EXPORT_SYMBOL_GPL(skb_icv_walk); diff -u linux-work/net/Kconfig-XFRM linux-work/net/Kconfig --- linux-work/net/Kconfig-XFRM 2003-05-27 03:00:21.000000000 +0200 +++ linux-work/net/Kconfig 2003-08-03 23:12:24.000000000 +0200 @@ -143,6 +143,7 @@ config NET_KEY tristate "PF_KEY sockets" + select XFRM ---help--- PF_KEYv2 socket family, compatible to KAME ones. They are required if you are going to use IPsec tools ported diff -u linux-work/net/netsyms.c-XFRM linux-work/net/netsyms.c --- linux-work/net/netsyms.c-XFRM 2003-07-28 23:12:33.000000000 +0200 +++ linux-work/net/netsyms.c 2003-08-03 23:16:23.000000000 +0200 @@ -56,7 +56,6 @@ #include #include #include -#include #if defined(CONFIG_INET_AH) || defined(CONFIG_INET_AH_MODULE) || defined(CONFIG_INET6_AH) || defined(CONFIG_INET6_AH_MODULE) #include #endif @@ -294,78 +293,6 @@ /* needed for ip_gre -cw */ EXPORT_SYMBOL(ip_statistics); -EXPORT_SYMBOL(xfrm_user_policy); -EXPORT_SYMBOL(km_waitq); -EXPORT_SYMBOL(km_new_mapping); -EXPORT_SYMBOL(xfrm_cfg_sem); -EXPORT_SYMBOL(xfrm_policy_alloc); -EXPORT_SYMBOL(__xfrm_policy_destroy); -EXPORT_SYMBOL(xfrm_lookup); -EXPORT_SYMBOL(__xfrm_policy_check); -EXPORT_SYMBOL(__xfrm_route_forward); -EXPORT_SYMBOL(xfrm_state_alloc); -EXPORT_SYMBOL(__xfrm_state_destroy); -EXPORT_SYMBOL(xfrm_state_find); -EXPORT_SYMBOL(xfrm_state_insert); -EXPORT_SYMBOL(xfrm_state_add); -EXPORT_SYMBOL(xfrm_state_update); -EXPORT_SYMBOL(xfrm_state_check_expire); -EXPORT_SYMBOL(xfrm_state_check_space); -EXPORT_SYMBOL(xfrm_state_lookup); -EXPORT_SYMBOL(xfrm_state_register_afinfo); -EXPORT_SYMBOL(xfrm_state_unregister_afinfo); -EXPORT_SYMBOL(xfrm_state_get_afinfo); -EXPORT_SYMBOL(xfrm_state_put_afinfo); -EXPORT_SYMBOL(xfrm_state_delete_tunnel); -EXPORT_SYMBOL(xfrm_replay_check); -EXPORT_SYMBOL(xfrm_replay_advance); -EXPORT_SYMBOL(xfrm_check_selectors); -EXPORT_SYMBOL(xfrm_check_output); -EXPORT_SYMBOL(__secpath_destroy); -EXPORT_SYMBOL(xfrm_get_acqseq); -EXPORT_SYMBOL(xfrm_parse_spi); -EXPORT_SYMBOL(xfrm4_rcv); -EXPORT_SYMBOL(xfrm4_tunnel_register); -EXPORT_SYMBOL(xfrm4_tunnel_deregister); -EXPORT_SYMBOL(xfrm4_tunnel_check_size); -EXPORT_SYMBOL(xfrm_register_type); -EXPORT_SYMBOL(xfrm_unregister_type); -EXPORT_SYMBOL(xfrm_get_type); -EXPORT_SYMBOL(inet_peer_idlock); -EXPORT_SYMBOL(xfrm_register_km); -EXPORT_SYMBOL(xfrm_unregister_km); -EXPORT_SYMBOL(xfrm_state_delete); -EXPORT_SYMBOL(xfrm_state_walk); -EXPORT_SYMBOL(xfrm_find_acq_byseq); -EXPORT_SYMBOL(xfrm_find_acq); -EXPORT_SYMBOL(xfrm_alloc_spi); -EXPORT_SYMBOL(xfrm_state_flush); -EXPORT_SYMBOL(xfrm_policy_kill); -EXPORT_SYMBOL(xfrm_policy_bysel); -EXPORT_SYMBOL(xfrm_policy_insert); -EXPORT_SYMBOL(xfrm_policy_walk); -EXPORT_SYMBOL(xfrm_policy_flush); -EXPORT_SYMBOL(xfrm_policy_byid); -EXPORT_SYMBOL(xfrm_policy_list); -EXPORT_SYMBOL(xfrm_dst_lookup); -EXPORT_SYMBOL(xfrm_policy_register_afinfo); -EXPORT_SYMBOL(xfrm_policy_unregister_afinfo); -EXPORT_SYMBOL(xfrm_policy_get_afinfo); -EXPORT_SYMBOL(xfrm_policy_put_afinfo); - -EXPORT_SYMBOL_GPL(xfrm_probe_algs); -EXPORT_SYMBOL_GPL(xfrm_count_auth_supported); -EXPORT_SYMBOL_GPL(xfrm_count_enc_supported); -EXPORT_SYMBOL_GPL(xfrm_aalg_get_byidx); -EXPORT_SYMBOL_GPL(xfrm_ealg_get_byidx); -EXPORT_SYMBOL_GPL(xfrm_calg_get_byidx); -EXPORT_SYMBOL_GPL(xfrm_aalg_get_byid); -EXPORT_SYMBOL_GPL(xfrm_ealg_get_byid); -EXPORT_SYMBOL_GPL(xfrm_calg_get_byid); -EXPORT_SYMBOL_GPL(xfrm_aalg_get_byname); -EXPORT_SYMBOL_GPL(xfrm_ealg_get_byname); -EXPORT_SYMBOL_GPL(xfrm_calg_get_byname); -EXPORT_SYMBOL_GPL(skb_icv_walk); #if defined(CONFIG_INET_ESP) || defined(CONFIG_INET_ESP_MODULE) || defined(CONFIG_INET6_ESP) || defined(CONFIG_INET6_ESP_MODULE) EXPORT_SYMBOL_GPL(skb_cow_data); EXPORT_SYMBOL_GPL(pskb_put); From yoshfuji@linux-ipv6.org Mon Aug 4 05:58:03 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 05:58:08 -0700 (PDT) Received: from yue.hongo.wide.ad.jp (yue.hongo.wide.ad.jp [203.178.139.94]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74Cw1Fl018295 for ; Mon, 4 Aug 2003 05:58:03 -0700 Received: from localhost (localhost [127.0.0.1]) by yue.hongo.wide.ad.jp (8.12.3+3.5Wbeta/8.12.3/Debian-5) with ESMTP id h74Cw11M000463; Mon, 4 Aug 2003 21:58:01 +0900 Date: Mon, 04 Aug 2003 21:58:01 +0900 (JST) Message-Id: <20030804.215801.124854897.yoshfuji@linux-ipv6.org> To: ak@muc.de Cc: netdev@oss.sgi.com, yoshfuji@linux-ipv6.org Subject: Re: [PATCH] Make XFRM optional From: YOSHIFUJI Hideaki / =?iso-2022-jp?B?GyRCNUhGIzFRTEAbKEI=?= In-Reply-To: <20030804125022.GA8167@averell> References: <20030804125022.GA8167@averell> Organization: USAGI Project X-URL: http://www.yoshifuji.org/%7Ehideaki/ X-Fingerprint: 90 22 65 EB 1E CF 3A D1 0B DF 80 D8 48 07 F8 94 E0 62 0E EA X-PGP-Key-URL: http://www.yoshifuji.org/%7Ehideaki/hideaki@yoshifuji.org.asc X-Face: "5$Al-.M>NJ%a'@hhZdQm:."qn~PA^gq4o*>iCFToq*bAi#4FRtx}enhuQKz7fNqQz\BYU] $~O_5m-9'}MIs`XGwIEscw;e5b>n"B_?j/AkL~i/MEaZBLP X-Mailer: Mew version 2.2 on Emacs 20.7 / Mule 4.1 (AOI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 4513 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: yoshfuji@linux-ipv6.org Precedence: bulk X-list: netdev Hello. In article <20030804125022.GA8167@averell> (at Mon, 4 Aug 2003 14:50:22 +0200), Andi Kleen says: > diff -u linux-work/net/ipv6/Kconfig-XFRM linux-work/net/ipv6/Kconfig > --- linux-work/net/ipv6/Kconfig-XFRM 2003-07-18 02:39:29.000000000 +0200 > +++ linux-work/net/ipv6/Kconfig 2003-08-03 23:12:25.000000000 +0200 > @@ -4,6 +4,7 @@ > config IPV6_PRIVACY > bool "IPv6: Privacy Extensions (RFC 3041) support" > depends on IPV6 > + select XFRM > select CRYPTO > select CRYPTO_MD5 > ---help--- We do not need this. > @@ -57,6 +61,7 @@ > > config IPV6_TUNNEL > tristate "IPv6: IPv6-in-IPv6 tunnel" > + select XFRM > depends on IPV6 > ---help--- > Support for IPv6-in-IPv6 tunnels described in RFC 2473. We do not need this for now. -- Hideaki YOSHIFUJI @ USAGI Project GPG FP: 9022 65EB 1ECF 3AD1 0BDF 80D8 4807 F894 E062 0EEA From ak@muc.de Mon Aug 4 06:04:14 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 06:04:29 -0700 (PDT) Received: from colin2.muc.de (qmailr@colin2.muc.de [193.149.48.15]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74D4CFl018727 for ; Mon, 4 Aug 2003 06:04:14 -0700 Received: (qmail 39137 invoked by uid 3709); 4 Aug 2003 13:04:08 -0000 Date: 4 Aug 2003 15:04:08 +0200 Date: Mon, 4 Aug 2003 15:04:08 +0200 From: Andi Kleen To: "YOSHIFUJI Hideaki / ?$B5HF#1QL@" Cc: ak@muc.de, netdev@oss.sgi.com Subject: Re: [PATCH] Make XFRM optional Message-ID: <20030804130408.GA36367@colin2.muc.de> References: <20030804125022.GA8167@averell> <20030804.215801.124854897.yoshfuji@linux-ipv6.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030804.215801.124854897.yoshfuji@linux-ipv6.org> User-Agent: Mutt/1.4.1i X-archive-position: 4514 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ak@colin2.muc.de Precedence: bulk X-list: netdev On Mon, Aug 04, 2003 at 09:58:01PM +0900, YOSHIFUJI Hideaki / ?$B5HF#1QL@ wrote: > Hello. > > In article <20030804125022.GA8167@averell> (at Mon, 4 Aug 2003 14:50:22 +0200), Andi Kleen says: > > > diff -u linux-work/net/ipv6/Kconfig-XFRM linux-work/net/ipv6/Kconfig > > --- linux-work/net/ipv6/Kconfig-XFRM 2003-07-18 02:39:29.000000000 +0200 > > +++ linux-work/net/ipv6/Kconfig 2003-08-03 23:12:25.000000000 +0200 > > @@ -4,6 +4,7 @@ > > config IPV6_PRIVACY > > bool "IPv6: Privacy Extensions (RFC 3041) support" > > depends on IPV6 > > + select XFRM > > select CRYPTO > > select CRYPTO_MD5 > > ---help--- > > We do not need this. Thanks for the feedback. Here is a new patch with the two hunks removed. -Andi diff -u linux-work/include/net/dst.h-XFRM linux-work/include/net/dst.h --- linux-work/include/net/dst.h-XFRM 2003-07-18 02:40:02.000000000 +0200 +++ linux-work/include/net/dst.h 2003-08-03 23:12:24.000000000 +0200 @@ -247,8 +247,16 @@ extern void dst_init(void); struct flowi; +#ifndef CONFIG_XFRM +static inline int xfrm_lookup(struct dst_entry **dst_p, struct flowi *fl, + struct sock *sk, int flags) +{ + return 0; +} +#else extern int xfrm_lookup(struct dst_entry **dst_p, struct flowi *fl, struct sock *sk, int flags); #endif +#endif #endif /* _NET_DST_H */ diff -u linux-work/include/net/xfrm.h-XFRM linux-work/include/net/xfrm.h --- linux-work/include/net/xfrm.h-XFRM 2003-07-28 23:12:30.000000000 +0200 +++ linux-work/include/net/xfrm.h 2003-08-03 23:14:04.000000000 +0200 @@ -587,6 +587,8 @@ return !0; } +#ifdef CONFIG_XFRM + extern int __xfrm_policy_check(struct sock *, int dir, struct sk_buff *skb, unsigned short family); static inline int xfrm_policy_check(struct sock *sk, int dir, struct sk_buff *skb, unsigned short family) @@ -652,6 +654,26 @@ } } +#else + +static inline void xfrm_sk_free_policy(struct sock *sk) {} +static inline int xfrm_sk_clone_policy(struct sock *sk) { return 0; } +static inline int xfrm6_route_forward(struct sk_buff *skb) { return 1; } +static inline int xfrm4_route_forward(struct sk_buff *skb) { return 1; } +static inline int xfrm6_policy_check(struct sock *sk, int dir, struct sk_buff *skb) +{ + return 1; +} +static inline int xfrm4_policy_check(struct sock *sk, int dir, struct sk_buff *skb) +{ + return 1; +} +static inline int xfrm_policy_check(struct sock *sk, int dir, struct sk_buff *skb, unsigned short family) +{ + return 1; +} +#endif + static __inline__ xfrm_address_t *xfrm_flowi_daddr(struct flowi *fl, unsigned short family) { @@ -782,12 +804,32 @@ extern int xfrm_check_selectors(struct xfrm_state **x, int n, struct flowi *fl); extern int xfrm_check_output(struct xfrm_state *x, struct sk_buff *skb, unsigned short family); extern int xfrm4_rcv(struct sk_buff *skb); -extern int xfrm4_rcv_encap(struct sk_buff *skb, __u16 encap_type); extern int xfrm4_tunnel_register(struct xfrm_tunnel *handler); extern int xfrm4_tunnel_deregister(struct xfrm_tunnel *handler); extern int xfrm4_tunnel_check_size(struct sk_buff *skb); extern int xfrm6_rcv(struct sk_buff **pskb, unsigned int *nhoffp); + +#ifdef CONFIG_XFRM +extern int xfrm4_rcv_encap(struct sk_buff *skb, __u16 encap_type); extern int xfrm_user_policy(struct sock *sk, int optname, u8 *optval, int optlen); +extern int xfrm_dst_lookup(struct xfrm_dst **dst, struct flowi *fl, unsigned short family); +#else +static inline int xfrm_user_policy(struct sock *sk, int optname, u8 *optval, int optlen) +{ + return -ENOPROTOOPT; +} + +static inline int xfrm4_rcv_encap(struct sk_buff *skb, __u16 encap_type) +{ + /* should not happen */ + kfree_skb(skb); + return 0; +} +static inline int xfrm_dst_lookup(struct xfrm_dst **dst, struct flowi *fl, unsigned short family) +{ + return -EINVAL; +} +#endif void xfrm_policy_init(void); void xfrm4_policy_init(void); @@ -809,7 +851,6 @@ extern int xfrm_sk_policy_insert(struct sock *sk, int dir, struct xfrm_policy *pol); extern struct xfrm_policy *xfrm_sk_policy_lookup(struct sock *sk, int dir, struct flowi *fl); extern int xfrm_flush_bundles(struct xfrm_state *x); -extern int xfrm_dst_lookup(struct xfrm_dst **dst, struct flowi *fl, unsigned short family); extern wait_queue_head_t km_waitq; extern void km_state_expired(struct xfrm_state *x, int hard); diff -u linux-work/net/core/skbuff.c-XFRM linux-work/net/core/skbuff.c --- linux-work/net/core/skbuff.c-XFRM 2003-07-18 02:39:47.000000000 +0200 +++ linux-work/net/core/skbuff.c 2003-08-03 23:12:25.000000000 +0200 @@ -225,7 +225,7 @@ } dst_release(skb->dst); -#ifdef CONFIG_INET +#ifdef CONFIG_XFRM secpath_put(skb->sp); #endif if(skb->destructor) { diff -u linux-work/net/ipv4/Kconfig-XFRM linux-work/net/ipv4/Kconfig --- linux-work/net/ipv4/Kconfig-XFRM 2003-07-18 02:42:42.000000000 +0200 +++ linux-work/net/ipv4/Kconfig 2003-08-03 23:12:25.000000000 +0200 @@ -187,6 +187,7 @@ config NET_IPIP tristate "IP: tunneling" depends on INET + select XFRM ---help--- Tunneling means encapsulating data of one protocol type within another protocol and sending it over a channel that understands the @@ -205,6 +206,7 @@ config NET_IPGRE tristate "IP: GRE tunnels over IP" depends on INET + select XFRM help Tunneling means encapsulating data of one protocol type within another protocol and sending it over a channel that understands the @@ -343,6 +345,7 @@ config INET_AH tristate "IP: AH transformation" + select XFRM select CRYPTO select CRYPTO_HMAC select CRYPTO_MD5 @@ -354,6 +357,7 @@ config INET_ESP tristate "IP: ESP transformation" + select XFRM select CRYPTO select CRYPTO_HMAC select CRYPTO_MD5 @@ -366,6 +370,7 @@ config INET_IPCOMP tristate "IP: IPComp transformation" + select XFRM select CRYPTO select CRYPTO_DEFLATE ---help--- diff -u linux-work/net/ipv4/Makefile-XFRM linux-work/net/ipv4/Makefile --- linux-work/net/ipv4/Makefile-XFRM 2003-07-18 02:42:42.000000000 +0200 +++ linux-work/net/ipv4/Makefile 2003-08-03 23:12:25.000000000 +0200 @@ -23,4 +23,4 @@ obj-$(CONFIG_NETFILTER) += netfilter/ obj-$(CONFIG_IP_VS) += ipvs/ -obj-y += xfrm4_policy.o xfrm4_state.o xfrm4_input.o xfrm4_tunnel.o +obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o xfrm4_tunnel.o diff -u linux-work/net/ipv4/route.c-XFRM linux-work/net/ipv4/route.c --- linux-work/net/ipv4/route.c-XFRM 2003-07-18 02:39:31.000000000 +0200 +++ linux-work/net/ipv4/route.c 2003-08-03 23:12:25.000000000 +0200 @@ -2785,8 +2785,10 @@ create_proc_read_entry("net/rt_acct", 0, 0, ip_rt_acct_read, NULL); #endif #endif +#ifdef CONFIG_XFRM xfrm_init(); xfrm4_init(); +#endif out: return rc; out_enomem: diff -u linux-work/net/ipv4/udp.c-XFRM linux-work/net/ipv4/udp.c --- linux-work/net/ipv4/udp.c-XFRM 2003-07-18 02:42:43.000000000 +0200 +++ linux-work/net/ipv4/udp.c 2003-08-03 23:31:05.000000000 +0200 @@ -938,6 +938,9 @@ */ static int udp_encap_rcv(struct sock * sk, struct sk_buff *skb) { +#ifndef CONFIG_XFRM + return 1; +#else struct udp_opt *up = udp_sk(sk); struct udphdr *uh = skb->h.uh; struct iphdr *iph; @@ -997,10 +1000,12 @@ return -1; default: - printk(KERN_INFO "udp_encap_rcv(): Unhandled UDP encap type: %u\n", - encap_type); + if (net_ratelimit()) + printk(KERN_INFO "udp_encap_rcv(): Unhandled UDP encap type: %u\n", + encap_type); return 1; } +#endif } /* returns: diff -u linux-work/net/ipv6/Kconfig-XFRM linux-work/net/ipv6/Kconfig --- linux-work/net/ipv6/Kconfig-XFRM 2003-07-18 02:39:29.000000000 +0200 +++ linux-work/net/ipv6/Kconfig 2003-08-03 23:12:25.000000000 +0200 @@ -22,6 +23,7 @@ config INET6_AH tristate "IPv6: AH transformation" depends on IPV6 + select XFRM select CRYPTO select CRYPTO_HMAC select CRYPTO_MD5 @@ -34,6 +36,7 @@ config INET6_ESP tristate "IPv6: ESP transformation" depends on IPV6 + select XFRM select CRYPTO select CRYPTO_HMAC select CRYPTO_MD5 @@ -47,6 +50,7 @@ config INET6_IPCOMP tristate "IPv6: IPComp transformation" depends on IPV6 + select XFRM select CRYPTO select CRYPTO_DEFLATE ---help--- diff -u linux-work/net/ipv6/Makefile-XFRM linux-work/net/ipv6/Makefile --- linux-work/net/ipv6/Makefile-XFRM 2003-07-18 02:39:29.000000000 +0200 +++ linux-work/net/ipv6/Makefile 2003-08-03 23:12:25.000000000 +0200 @@ -8,8 +8,9 @@ route.o ip6_fib.o ipv6_sockglue.o ndisc.o udp.o raw.o \ protocol.o icmp.o mcast.o reassembly.o tcp_ipv6.o \ exthdrs.o sysctl_net_ipv6.o datagram.o proc.o \ - ip6_flowlabel.o ipv6_syms.o \ - xfrm6_policy.o xfrm6_state.o xfrm6_input.o + ip6_flowlabel.o ipv6_syms.o + +obj-$(CONFIG_XFRM) += xfrm6_policy.o xfrm6_state.o xfrm6_input.o obj-$(CONFIG_INET6_AH) += ah6.o obj-$(CONFIG_INET6_ESP) += esp6.o diff -u linux-work/net/ipv6/ipv6_syms.c-XFRM linux-work/net/ipv6/ipv6_syms.c --- linux-work/net/ipv6/ipv6_syms.c-XFRM 2003-07-18 02:39:31.000000000 +0200 +++ linux-work/net/ipv6/ipv6_syms.c 2003-08-03 23:14:41.000000000 +0200 @@ -36,7 +36,9 @@ EXPORT_SYMBOL(in6addr_loopback); EXPORT_SYMBOL(in6_dev_finish_destroy); EXPORT_SYMBOL(ip6_find_1stfragopt); +#ifdef CONFIG_XFRM EXPORT_SYMBOL(xfrm6_rcv); +#endif EXPORT_SYMBOL(rt6_lookup); EXPORT_SYMBOL(fl6_sock_lookup); EXPORT_SYMBOL(ipv6_ext_hdr); diff -u linux-work/net/ipv6/route.c-XFRM linux-work/net/ipv6/route.c --- linux-work/net/ipv6/route.c-XFRM 2003-07-28 23:12:32.000000000 +0200 +++ linux-work/net/ipv6/route.c 2003-08-03 23:12:25.000000000 +0200 @@ -1988,7 +1988,9 @@ if (p) p->proc_fops = &rt6_stats_seq_fops; #endif +#ifdef CONFIG_XFRM xfrm6_init(); +#endif } #ifdef MODULE diff -u linux-work/net/xfrm/Kconfig-XFRM linux-work/net/xfrm/Kconfig --- linux-work/net/xfrm/Kconfig-XFRM 2003-05-27 03:00:40.000000000 +0200 +++ linux-work/net/xfrm/Kconfig 2003-08-03 23:12:25.000000000 +0200 @@ -1,9 +1,13 @@ # # XFRM configuration # +config XFRM + bool + depends on NET + config XFRM_USER tristate "IPsec user configuration interface" - depends on INET + depends on INET && XFRM ---help--- Support for IPsec user configuration interface used by native Linux tools. diff -u linux-work/net/xfrm/Makefile-XFRM linux-work/net/xfrm/Makefile --- linux-work/net/xfrm/Makefile-XFRM 2003-05-27 03:01:03.000000000 +0200 +++ linux-work/net/xfrm/Makefile 2003-08-03 23:12:25.000000000 +0200 @@ -2,6 +2,7 @@ # Makefile for the XFRM subsystem. # -obj-y := xfrm_policy.o xfrm_state.o xfrm_input.o xfrm_algo.o xfrm_output.o +obj-$(CONFIG_XFRM) := xfrm_policy.o xfrm_state.o xfrm_input.o xfrm_algo.o xfrm_output.o \ + xfrm_export.o obj-$(CONFIG_XFRM_USER) += xfrm_user.o diff -u linux-work/net/xfrm/xfrm_export.c-XFRM linux-work/net/xfrm/xfrm_export.c --- linux-work/net/xfrm/xfrm_export.c-XFRM 2003-08-03 23:12:25.000000000 +0200 +++ linux-work/net/xfrm/xfrm_export.c 2003-08-03 23:16:06.000000000 +0200 @@ -0,0 +1,76 @@ +#include +#include + +EXPORT_SYMBOL(xfrm_user_policy); +EXPORT_SYMBOL(km_waitq); +EXPORT_SYMBOL(km_new_mapping); +EXPORT_SYMBOL(xfrm_cfg_sem); +EXPORT_SYMBOL(xfrm_policy_alloc); +EXPORT_SYMBOL(__xfrm_policy_destroy); +EXPORT_SYMBOL(xfrm_lookup); +EXPORT_SYMBOL(__xfrm_policy_check); +EXPORT_SYMBOL(__xfrm_route_forward); +EXPORT_SYMBOL(xfrm_state_alloc); +EXPORT_SYMBOL(__xfrm_state_destroy); +EXPORT_SYMBOL(xfrm_state_find); +EXPORT_SYMBOL(xfrm_state_insert); +EXPORT_SYMBOL(xfrm_state_add); +EXPORT_SYMBOL(xfrm_state_update); +EXPORT_SYMBOL(xfrm_state_check_expire); +EXPORT_SYMBOL(xfrm_state_check_space); +EXPORT_SYMBOL(xfrm_state_lookup); +EXPORT_SYMBOL(xfrm_state_register_afinfo); +EXPORT_SYMBOL(xfrm_state_unregister_afinfo); +EXPORT_SYMBOL(xfrm_state_get_afinfo); +EXPORT_SYMBOL(xfrm_state_put_afinfo); +EXPORT_SYMBOL(xfrm_state_delete_tunnel); +EXPORT_SYMBOL(xfrm_replay_check); +EXPORT_SYMBOL(xfrm_replay_advance); +EXPORT_SYMBOL(xfrm_check_selectors); +EXPORT_SYMBOL(xfrm_check_output); +EXPORT_SYMBOL(__secpath_destroy); +EXPORT_SYMBOL(xfrm_get_acqseq); +EXPORT_SYMBOL(xfrm_parse_spi); +EXPORT_SYMBOL(xfrm4_rcv); +EXPORT_SYMBOL(xfrm4_tunnel_register); +EXPORT_SYMBOL(xfrm4_tunnel_deregister); +EXPORT_SYMBOL(xfrm4_tunnel_check_size); +EXPORT_SYMBOL(xfrm_register_type); +EXPORT_SYMBOL(xfrm_unregister_type); +EXPORT_SYMBOL(xfrm_get_type); +EXPORT_SYMBOL(inet_peer_idlock); +EXPORT_SYMBOL(xfrm_register_km); +EXPORT_SYMBOL(xfrm_unregister_km); +EXPORT_SYMBOL(xfrm_state_delete); +EXPORT_SYMBOL(xfrm_state_walk); +EXPORT_SYMBOL(xfrm_find_acq_byseq); +EXPORT_SYMBOL(xfrm_find_acq); +EXPORT_SYMBOL(xfrm_alloc_spi); +EXPORT_SYMBOL(xfrm_state_flush); +EXPORT_SYMBOL(xfrm_policy_kill); +EXPORT_SYMBOL(xfrm_policy_bysel); +EXPORT_SYMBOL(xfrm_policy_insert); +EXPORT_SYMBOL(xfrm_policy_walk); +EXPORT_SYMBOL(xfrm_policy_flush); +EXPORT_SYMBOL(xfrm_policy_byid); +EXPORT_SYMBOL(xfrm_policy_list); +EXPORT_SYMBOL(xfrm_dst_lookup); +EXPORT_SYMBOL(xfrm_policy_register_afinfo); +EXPORT_SYMBOL(xfrm_policy_unregister_afinfo); +EXPORT_SYMBOL(xfrm_policy_get_afinfo); +EXPORT_SYMBOL(xfrm_policy_put_afinfo); + +EXPORT_SYMBOL_GPL(xfrm_probe_algs); +EXPORT_SYMBOL_GPL(xfrm_count_auth_supported); +EXPORT_SYMBOL_GPL(xfrm_count_enc_supported); +EXPORT_SYMBOL_GPL(xfrm_aalg_get_byidx); +EXPORT_SYMBOL_GPL(xfrm_ealg_get_byidx); +EXPORT_SYMBOL_GPL(xfrm_calg_get_byidx); +EXPORT_SYMBOL_GPL(xfrm_aalg_get_byid); +EXPORT_SYMBOL_GPL(xfrm_ealg_get_byid); +EXPORT_SYMBOL_GPL(xfrm_calg_get_byid); +EXPORT_SYMBOL_GPL(xfrm_aalg_get_byname); +EXPORT_SYMBOL_GPL(xfrm_ealg_get_byname); +EXPORT_SYMBOL_GPL(xfrm_calg_get_byname); + +EXPORT_SYMBOL_GPL(skb_icv_walk); diff -u linux-work/net/Kconfig-XFRM linux-work/net/Kconfig --- linux-work/net/Kconfig-XFRM 2003-05-27 03:00:21.000000000 +0200 +++ linux-work/net/Kconfig 2003-08-03 23:12:24.000000000 +0200 @@ -143,6 +143,7 @@ config NET_KEY tristate "PF_KEY sockets" + select XFRM ---help--- PF_KEYv2 socket family, compatible to KAME ones. They are required if you are going to use IPsec tools ported diff -u linux-work/net/netsyms.c-XFRM linux-work/net/netsyms.c --- linux-work/net/netsyms.c-XFRM 2003-07-28 23:12:33.000000000 +0200 +++ linux-work/net/netsyms.c 2003-08-03 23:16:23.000000000 +0200 @@ -56,7 +56,6 @@ #include #include #include -#include #if defined(CONFIG_INET_AH) || defined(CONFIG_INET_AH_MODULE) || defined(CONFIG_INET6_AH) || defined(CONFIG_INET6_AH_MODULE) #include #endif @@ -294,78 +293,6 @@ /* needed for ip_gre -cw */ EXPORT_SYMBOL(ip_statistics); -EXPORT_SYMBOL(xfrm_user_policy); -EXPORT_SYMBOL(km_waitq); -EXPORT_SYMBOL(km_new_mapping); -EXPORT_SYMBOL(xfrm_cfg_sem); -EXPORT_SYMBOL(xfrm_policy_alloc); -EXPORT_SYMBOL(__xfrm_policy_destroy); -EXPORT_SYMBOL(xfrm_lookup); -EXPORT_SYMBOL(__xfrm_policy_check); -EXPORT_SYMBOL(__xfrm_route_forward); -EXPORT_SYMBOL(xfrm_state_alloc); -EXPORT_SYMBOL(__xfrm_state_destroy); -EXPORT_SYMBOL(xfrm_state_find); -EXPORT_SYMBOL(xfrm_state_insert); -EXPORT_SYMBOL(xfrm_state_add); -EXPORT_SYMBOL(xfrm_state_update); -EXPORT_SYMBOL(xfrm_state_check_expire); -EXPORT_SYMBOL(xfrm_state_check_space); -EXPORT_SYMBOL(xfrm_state_lookup); -EXPORT_SYMBOL(xfrm_state_register_afinfo); -EXPORT_SYMBOL(xfrm_state_unregister_afinfo); -EXPORT_SYMBOL(xfrm_state_get_afinfo); -EXPORT_SYMBOL(xfrm_state_put_afinfo); -EXPORT_SYMBOL(xfrm_state_delete_tunnel); -EXPORT_SYMBOL(xfrm_replay_check); -EXPORT_SYMBOL(xfrm_replay_advance); -EXPORT_SYMBOL(xfrm_check_selectors); -EXPORT_SYMBOL(xfrm_check_output); -EXPORT_SYMBOL(__secpath_destroy); -EXPORT_SYMBOL(xfrm_get_acqseq); -EXPORT_SYMBOL(xfrm_parse_spi); -EXPORT_SYMBOL(xfrm4_rcv); -EXPORT_SYMBOL(xfrm4_tunnel_register); -EXPORT_SYMBOL(xfrm4_tunnel_deregister); -EXPORT_SYMBOL(xfrm4_tunnel_check_size); -EXPORT_SYMBOL(xfrm_register_type); -EXPORT_SYMBOL(xfrm_unregister_type); -EXPORT_SYMBOL(xfrm_get_type); -EXPORT_SYMBOL(inet_peer_idlock); -EXPORT_SYMBOL(xfrm_register_km); -EXPORT_SYMBOL(xfrm_unregister_km); -EXPORT_SYMBOL(xfrm_state_delete); -EXPORT_SYMBOL(xfrm_state_walk); -EXPORT_SYMBOL(xfrm_find_acq_byseq); -EXPORT_SYMBOL(xfrm_find_acq); -EXPORT_SYMBOL(xfrm_alloc_spi); -EXPORT_SYMBOL(xfrm_state_flush); -EXPORT_SYMBOL(xfrm_policy_kill); -EXPORT_SYMBOL(xfrm_policy_bysel); -EXPORT_SYMBOL(xfrm_policy_insert); -EXPORT_SYMBOL(xfrm_policy_walk); -EXPORT_SYMBOL(xfrm_policy_flush); -EXPORT_SYMBOL(xfrm_policy_byid); -EXPORT_SYMBOL(xfrm_policy_list); -EXPORT_SYMBOL(xfrm_dst_lookup); -EXPORT_SYMBOL(xfrm_policy_register_afinfo); -EXPORT_SYMBOL(xfrm_policy_unregister_afinfo); -EXPORT_SYMBOL(xfrm_policy_get_afinfo); -EXPORT_SYMBOL(xfrm_policy_put_afinfo); - -EXPORT_SYMBOL_GPL(xfrm_probe_algs); -EXPORT_SYMBOL_GPL(xfrm_count_auth_supported); -EXPORT_SYMBOL_GPL(xfrm_count_enc_supported); -EXPORT_SYMBOL_GPL(xfrm_aalg_get_byidx); -EXPORT_SYMBOL_GPL(xfrm_ealg_get_byidx); -EXPORT_SYMBOL_GPL(xfrm_calg_get_byidx); -EXPORT_SYMBOL_GPL(xfrm_aalg_get_byid); -EXPORT_SYMBOL_GPL(xfrm_ealg_get_byid); -EXPORT_SYMBOL_GPL(xfrm_calg_get_byid); -EXPORT_SYMBOL_GPL(xfrm_aalg_get_byname); -EXPORT_SYMBOL_GPL(xfrm_ealg_get_byname); -EXPORT_SYMBOL_GPL(xfrm_calg_get_byname); -EXPORT_SYMBOL_GPL(skb_icv_walk); #if defined(CONFIG_INET_ESP) || defined(CONFIG_INET_ESP_MODULE) || defined(CONFIG_INET6_ESP) || defined(CONFIG_INET6_ESP_MODULE) EXPORT_SYMBOL_GPL(skb_cow_data); EXPORT_SYMBOL_GPL(pskb_put); From nf@hipac.org Mon Aug 4 06:18:46 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 06:19:00 -0700 (PDT) Received: from indyio.rz.uni-saarland.de (indyio.rz.uni-saarland.de [134.96.7.3]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74DIhFl019267 for ; Mon, 4 Aug 2003 06:18:45 -0700 Received: from mars.rz.uni-saarland.de (mars.rz.uni-saarland.de [134.96.7.4]) by indyio.rz.uni-saarland.de (8.12.9/8.12.5) with ESMTP id h74DIZqk6640013; Mon, 4 Aug 2003 15:18:35 +0200 (CEST) Received: from e002.stw.stud.uni-saarland.de (e002.stw.stud.uni-saarland.de [134.96.65.17]) by mars.rz.uni-saarland.de (8.9.3p2/8.8.4/8.8.2) with ESMTP id PAA26020101; Mon, 4 Aug 2003 15:18:34 +0200 (CEST) Received: from e226.stw.stud.uni-saarland.de ([134.96.65.241] helo=hipac.org) by e002.stw.stud.uni-saarland.de with esmtp (Exim 3.35 #1 (Debian)) id 19jfEQ-0003Qv-00; Mon, 04 Aug 2003 15:18:34 +0200 Message-ID: <3F2E5CD6.4030500@hipac.org> Date: Mon, 04 Aug 2003 15:17:10 +0200 From: Michael Bellion and Thomas Heinz User-Agent: Mozilla/5.0 (X11; U; Linux i686; de-AT; rv:1.4) Gecko/20030714 Debian/1.4-2 X-Accept-Language: de, en MIME-Version: 1.0 To: hadi@cyberus.ca CC: linux-net@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [RFC] High Performance Packet Classifiction for tc framework References: <200307141045.40999.nf@hipac.org> <1058328537.1797.24.camel@jzny.localdomain> <3F16A0E5.1080007@hipac.org> <1059934468.1103.41.camel@jzny.localdomain> In-Reply-To: <1059934468.1103.41.camel@jzny.localdomain> X-Enigmail-Version: 0.76.2.0 X-Enigmail-Supports: pgp-inline, pgp-mime Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="------------enig1F772C011F16724D016A230F" X-archive-position: 4515 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: nf@hipac.org Precedence: bulk X-list: netdev This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enig1F772C011F16724D016A230F Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Hi Jamal You wrote: > Apologies for late response. Its funny how i thought i was going to have > more time in the last 2 weeks but due to bad scheduling that wasnt the > case. No problemo ... > I think i will have to look at your code to make comments. Curious about it. > Not entirely accurate. Depends which tc classifier. u32 hash tables are > infact like iptables chains. Hm, we don't think so. Unfortunately, there does not seem to be much information about the inner workings of u32 and we currently don't have the time to deduce the whole algorithm from the code. Here is a short overview of our current view on u32: - each u32 filter "rule" consists of possibly several u32 matches, i.e. tc_u32_sel with nkeys tc_u32_key's => one rule is basically represented as a struct tc_u_knode - a set of u32 filter rules with same priority is in general a tree of hashes like for example: hash1: |--|--| / \ hash2: |--|--|--| hash3: |--|--|--|--| | | | | | | | r1 r2 r3 r4 r5 r6 r7 where the r_i are in fact lists of rules (-> hashing with chaining) => if there is more than one single filter with same prio there is always a tree of hashes (with possibly only 1 node (=hash)) - within such a tree of u32 filters (with same prio) there is no concept of prioritizing them any further => the rules must be conflict free - there is no way of optimizing filters with different priorities since one cannot assume that the intermediate classifiers are all of the same type If this is not the way it _really_ works we'd appreciate it if you could describe the general principles behind u32. > Note, the concept of priorities which is used for conflict resolution as > well as further separating sets of rules doesnt exist in iptables. Well, iptables rule position and tc filter priorities are just the same apart from the fact that iptables does not allow multiple rules to have the same priority (read: position). Therefore iptables rulesets don't suffer from conflicts. > You can also have them use different priorities and with the continue > operator first clasify based on packet data then on metadata or on > another packet header filter. Ok but then you fall back to the linear approach. Since with u32 only blocks of rules with same prio can be optimized one has to implement a ruleset using as few different prioritized blocks of filters as possible to achieve maximum performance. >>One disadvantage of this concept is that the hashed filters >>must be compact, i.e. there cannot be other classifiers in between. > > I didnt understand this. Are you talking about conflict resolving of > overlapping filters? No, the issue is just that within a block of filters with same prio there cannot be another type of filter, e.g. one cannot put a route classifier inside a hash of u32 classifiers. >>Another major disadvantage is caused by the hashing scheme. >>If you use the hash for 1 dimension you have to make sure that >>either all filters in a certain bucket are disjoint or you must have >>an implicit ordering of the rules (according to the insertion order >>or something). This scheme is not extendable to 2 or more dimensions, >>i.e. 1 hash for src ip, #(src ip buckets) many dst ip hashes and so >>on, because you simply cannot express arbitrary rulesets. > > If i understood you - you are refering to a way to reduce the number of > lookups by having disjoint hashes. I suppose for something as simple as > a five tuple lookup, this is almost solvable by hardcoding the different > fields into multiway hashes. Its when you try to generalize that it > becomes an issue. What do you mean exactly by "five tuple"? Do you refer to rules which consist of 5 punctiform matches, i.e. no masks or ranges or wildcards allowed, like (src ip 1.2.3.4, dst ip 3.4.5.6, proto tcp, src port 123, dst port 456)? Of course the scheme works for such cases (which consist of non-conflicting rules) although the user must be aware of the concrete hash function: divisor & u32_hash_fold(key, sel) because the mask would be 0xffffffff for the ip's. If ranges or overlapping masks are involved it gets really complicated and we doubt that people are able to manage such scenarios. >>Another general problem is of course that the user has to manually >>setup the hash which is rather inconvenient. > > Yes. Take a look at Werners tcng - he has a clever way to hide things > from the user. I did experimentation on u32 with a kernel thread which > rearranged things when they seemed out of balance but i havent > experimented with a lot of rules. We had a look at the tcng paper. Here it says that the u32 classifier is not used in the optimal way. Since we didn't have a look at the current tcng release it might well be that these problems are already addressed. Is that the case? BTW, why do you want to rearrange the tree of hashes and based on which heuristics? Why is there a kernel thread needed? Isn't it possible to arrange the tree directly after insert/delete operations? >>Now, what are the implications on the matching performance: >>tc vs. nf-hipac? As long as the extended hashing stuff is not used >>nf-hipac is clearly superior to tc. > > You are refering to u32. You mean as long as u32 stored things in a > single linked list, you win - correct? Yes, but this is not only true for u32. As long as the ruleset looks like: "n filters with n different priorities which can be translated into n nf-hipac rules" nf-hipac is clearly faster because in this case tc uses the linear approach. >>When hashing is used it _really_ >>depends. If there is only one classifier (with hashing) per interface >>and the number of rules per bucket is very small the performance should >>be comparable. As soon as you add other classifiers nf-hipac will >>outperform tc again. > > If we take a simple user interface abstraction like tcng which hides the > evil of u32 and then take simple 5 tuple rules - i doubt you will see > any difference. For more generic setup, the kernel thread i refer to > would work - but may slow insertion. For the simple punctiform examples like described above you may be right that nf-hipac and tc should perform similar but it's not clear to us how you want to achieve universality (including mask, ranges and wildcards) by this kernel thread rearranging approach. Basically you have to address the following problem: Given an arbitrary set of u32 rules with different priorities you have to compute an semantically equivalent representation with a tree of hashes. >>So, basically HIPAC is just a normal classifier like any other >>with two exceptions: >> a) it can occur only once per interface >> b) the rules within the classifier can contain other classifiers, >> e.g. u32, fw, tc_index, as matches > > But why restriction a)? Well, the restriction is necessary because of the new hipac design in which nf-hipac (i.e. firewalling), routing and cls_hipac (i.e. tc) are just applications for the classification framework. The basic idea is to reduce the number of classifications on the forwarding path to a single one (in the best case). In order to truly understand the requirement it would be necessary to explain the idea behind the new stage concept which is beyond the scope of this e-mail :-/. > Also why should we need hipac to hold other filters when the > infrastructure itself can hold the extended filters just fine? > I think you may actually be trying to say why somewhere in the email, > but it must not be making a significant impression on my brain. The idea is to reduce the embedded classifiers to a match, i.e. their return value is ignored. This offers the possibility of expressing a conjunction of native matches and classifiers in the very same way nf-hipac rules support iptables matches. This enhances the expressiveness of classification rules. A rule |nat. match 1|...|nat. match n|emb. cls 1|...|emb. cls m| matches if nat. match 1-n and emb. cls 1-m match. >>There is just one problem with the current tc framework. Once >>a new filter is inserted into the chain it is not removed even >>if the change function of the classifier returns < 0 >>(2.6.0-test1: net/sched/cls_api.c: line 280f). >>This should be changed anyway, shouldn't it? > > Are you refering to this piece of code?: > ---- > err = tp->ops->change(tp, cl, t->tcm_handle, tca, &fh); > if (err == 0) > tfilter_notify(skb, n, tp, fh, RTM_NEWTFILTER); > > errout: > if (cl) > cops->put(q, cl); > return err; > --- Yes. > change() should not return <0 if it has installed the filter i think. > Should the top level code be responsible for removing filters? The top level code (cls_hipac.c:tc_ctl_filter) is responsible for creating new tcf_proto structs (if not existent) and enqueuing the struct into the chain. Therefore it is also responsible for taking the stuff out of the chain again if necessary. In case we have just created a new tcf_proto and change fails it would be better if the new tcf_proto is removed afterwards, i.e. write_lock(&qdisc_tree_lock); spin_lock_bh(&dev->queue_lock); *back = tp->next; spin_unlock_bh(&dev->queue_lock); write_unlock(&qdisc_tree_lock); tp->ops->destroy(tp); module_put(tp->ops->owner); kfree(tp); is issued. Do you agree? > Consider what i said above. I'll try n cobble together some examples to > demonstrate (although it seems you already know this). > To allow for anyone to install classifiers-du-jour without being > dependet on hipac would be very useful. So ideas that you have for > enabling this cleanly should be moved to cls_api. Nobody will be forced to use hipac :-). It's just another classifier like u32. We don't even had to modify cls_api so far. Everything integrates just fine. Regards, +-----------------------+----------------------+ | Michael Bellion | Thomas Heinz | | | | +-----------------------+----------------------+ | High Performance Packet Classification | | nf-hipac: http://www.hipac.org/ | +----------------------------------------------+ --------------enig1F772C011F16724D016A230F Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.2 (GNU/Linux) Comment: Using GnuPG with Debian - http://enigmail.mozdev.org iD8DBQE/LlzdtXh2AYIMjggRAtcvAKCUZykozfMnI5MmRMo0j/zH6TDg7gCdGl20 ngF9kmhPF45vfAYjTq6sd/U= =qy5Z -----END PGP SIGNATURE----- --------------enig1F772C011F16724D016A230F-- From niv@us.ibm.com Mon Aug 4 08:51:02 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 08:51:07 -0700 (PDT) Received: from e33.co.us.ibm.com (e33.co.us.ibm.com [32.97.110.131]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74FotFl021660 for ; Mon, 4 Aug 2003 08:51:02 -0700 Received: from westrelay02.boulder.ibm.com (westrelay02.boulder.ibm.com [9.17.195.11]) by e33.co.us.ibm.com (8.12.9/8.12.2) with ESMTP id h74FoHj3303868; Mon, 4 Aug 2003 11:50:17 -0400 Received: from us.ibm.com (d03av02.boulder.ibm.com [9.17.193.82]) by westrelay02.boulder.ibm.com (8.12.9/NCO/VER6.5) with ESMTP id h74FoGiQ067578; Mon, 4 Aug 2003 09:50:17 -0600 Message-ID: <3F2E80CD.3090206@us.ibm.com> Date: Mon, 04 Aug 2003 08:50:37 -0700 From: Nivedita Singhvi User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.2.1) Gecko/20021130 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Andi Kleen CC: netdev@oss.sgi.com Subject: Re: [PATCH] Make XFRM optional References: <20030804125022.GA8167@averell> In-Reply-To: <20030804125022.GA8167@averell> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4516 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: niv@us.ibm.com Precedence: bulk X-list: netdev Andi Kleen wrote: > Only compile in the xfrm subsystem when it's needed by any config options. > > This avoids some code/data structure bloat in case you don't use IP > tunneling or IPsec. Yes, I would like this too, please. thanks, Nivedita From hadi@cyberus.ca Mon Aug 4 08:51:44 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 08:51:53 -0700 (PDT) Received: from mail.cyberus.ca (mail.cyberus.ca [209.195.118.111]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74FpgFl021732 for ; Mon, 4 Aug 2003 08:51:43 -0700 Received: from cpe0030ab124d2f-cm014500000962.cpe.net.cable.rogers.com ([24.103.99.32] helo=[10.0.0.9]) by mail.cyberus.ca with esmtp (Exim 4.12) id 19jhcb-000EBD-00; Mon, 04 Aug 2003 11:51:41 -0400 Subject: Re: [RFC] High Performance Packet Classifiction for tc framework From: jamal Reply-To: hadi@cyberus.ca To: Michael Bellion and Thomas Heinz Cc: linux-net@vger.kernel.org, netdev@oss.sgi.com In-Reply-To: <3F2E5CD6.4030500@hipac.org> References: <200307141045.40999.nf@hipac.org> <1058328537.1797.24.camel@jzny.localdomain> <3F16A0E5.1080007@hipac.org> <1059934468.1103.41.camel@jzny.localdomain> <3F2E5CD6.4030500@hipac.org> Content-Type: text/plain Organization: jamalopolis Message-Id: <1060012260.1103.380.camel@jzny.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 Date: 04 Aug 2003 11:51:01 -0400 Content-Transfer-Encoding: 7bit X-archive-position: 4517 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev Olla, On Mon, 2003-08-04 at 09:17, Michael Bellion and Thomas Heinz wrote: > > I think i will have to look at your code to make comments. > > Curious about it. > I promise i will. I dont think i will do it justice spending 5 minutes on it. I take it you have written extensive docs too ;-> > > Not entirely accurate. Depends which tc classifier. u32 hash tables are > > infact like iptables chains. > > Hm, we don't think so. Unfortunately, there does not seem to be much > information about the inner workings of u32 and we currently don't have > the time to deduce the whole algorithm from the code. > Unfortunately it is more exciting to write code than documents. I almost got someone to document at least its proper usage but they backed away at the last minute. > Here is a short overview of our current view on u32: > - each u32 filter "rule" consists of possibly several u32 matches, > i.e. tc_u32_sel with nkeys tc_u32_key's > => one rule is basically represented as a struct tc_u_knode > - a set of u32 filter rules with same priority is in general a > tree of hashes like for example: > hash1: |--|--| > / \ > hash2: |--|--|--| hash3: |--|--|--|--| > | | | | | | | > r1 r2 r3 r4 r5 r6 r7 > where the r_i are in fact lists of rules (-> hashing with > chaining) > => if there is more than one single filter with same prio > there is always a tree of hashes (with possibly only 1 node > (=hash)) > - within such a tree of u32 filters (with same prio) there is > no concept of prioritizing them any further => the rules must > be conflict free > - there is no way of optimizing filters with different priorities > since one cannot assume that the intermediate classifiers are all > of the same type > > If this is not the way it _really_ works we'd appreciate it if you could > describe the general principles behind u32. > u32 is a swiss knife so to go into general principles requires some time, motivation, and more importantly patience.I possess none of these nice attributes at the moment. You are doing a good job keep reading the code. I dont wanna go in a lot of details, but one important detail is that keynodes can also lead to other hash tables. So you can split the packet parsing across multiple hashes - this is where the comparison with chains comes in. There are several ways to do this. I'll show you the brute force way and you can make it more usable with "hashkey" and "sample" operator. Stealing from your example: hash1: |--|--| / hash2: |--|--|--| | | | r1 r2 r3 | | hash3 hash4 | | r4 r5 Example, you go into hash2 for all IP packets. The rules on the hash2 look at the protocol type and select a different hash table for TCP, UDP, ICMP etc. - so general rules is: Put your most hit rules at the highest priority so they are found first. Heres an example, i havent tested this (i can send you a tested example if you cant get this to work): ------- TCF=tc filter add dev eth0 parent ffff: protocol ip prio 10 # add hash table 1 $TCF handle 1::: u32 divisor 1 #add hash table 2 $TCF handle 2::: u32 divisor 1 #add your filter rules to specific tables: ICMP to table 1, TCP to table #6 etc . . #ICMP gets matched in table 1 $TCF match ip protocol 1 0xff link 1:0:0 . . ---------- Makes sense? Note, this doesnt say much about the user usability of u32 - it just says can be done. > > Note, the concept of priorities which is used for conflict resolution as > > well as further separating sets of rules doesnt exist in iptables. > > Well, iptables rule position and tc filter priorities are just the > same apart from the fact that iptables does not allow multiple rules > to have the same priority (read: position). Therefore iptables rulesets > don't suffer from conflicts. > sure position could be used as a priority. It is easier/intuitive to just have explicit priorities. > > You can also have them use different priorities and with the continue > > operator first clasify based on packet data then on metadata or on > > another packet header filter. > > Ok but then you fall back to the linear approach. Since with u32 only > blocks of rules with same prio can be optimized one has to implement a > ruleset using as few different prioritized blocks of filters as possible > to achieve maximum performance. > Read what i said above if you still hold the same opinion lets discuss. What "optimizes" could be a user interface or the thread i was talking about earlier. > >>One disadvantage of this concept is that the hashed filters > >>must be compact, i.e. there cannot be other classifiers in between. > > > > I didnt understand this. Are you talking about conflict resolving of > > overlapping filters? > > No, the issue is just that within a block of filters with same prio > there cannot be another type of filter, e.g. one cannot put a route > classifier inside a hash of u32 classifiers. > But you dont need to as i was pointing out earlier. You can have both fwmark,tcindex,u32, rsvp etc being invoked one after the other. > >>Another major disadvantage is caused by the hashing scheme. > >>If you use the hash for 1 dimension you have to make sure that > >>either all filters in a certain bucket are disjoint or you must have > >>an implicit ordering of the rules (according to the insertion order > >>or something). This scheme is not extendable to 2 or more dimensions, > >>i.e. 1 hash for src ip, #(src ip buckets) many dst ip hashes and so > >>on, because you simply cannot express arbitrary rulesets. > > > > If i understood you - you are refering to a way to reduce the number of > > lookups by having disjoint hashes. I suppose for something as simple as > > a five tuple lookup, this is almost solvable by hardcoding the different > > fields into multiway hashes. Its when you try to generalize that it > > becomes an issue. > > What do you mean exactly by "five tuple"? Do you refer to rules which > consist of 5 punctiform matches, i.e. no masks or ranges or wildcards > allowed, like (src ip 1.2.3.4, dst ip 3.4.5.6, proto tcp, src port 123, > dst port 456)? > above but with masks. "5 tuple" is a classical name for the above. > Of course the scheme works for such cases (which consist of > non-conflicting rules) although the user must be aware of the > concrete hash function: divisor & u32_hash_fold(key, sel) > because the mask would be 0xffffffff for the ip's. > > If ranges or overlapping masks are involved it gets really complicated > and we doubt that people are able to manage such scenarios. > I was refering to the cascaded hash tables i was refering to earlier. Depending on the rules, you could optimize differently. > >>Another general problem is of course that the user has to manually > >>setup the hash which is rather inconvenient. > > > > Yes. Take a look at Werners tcng - he has a clever way to hide things > > from the user. I did experimentation on u32 with a kernel thread which > > rearranged things when they seemed out of balance but i havent > > experimented with a lot of rules. > > We had a look at the tcng paper. Here it says that the u32 classifier > is not used in the optimal way. Since we didn't have a look at the > current tcng release it might well be that these problems are already > addressed. Is that the case? > He doesnt fix the u32, rather if you use his wrappers he outputs optimized u32 rules. All that is hidden from the user. > BTW, why do you want to rearrange the tree of hashes and based on which > heuristics? Why is there a kernel thread needed? Isn't it possible to > arrange the tree directly after insert/delete operations? > You can do that, but then you are adding delay to the insertion/deletion rates which are very important metrics. Another way to do it is to fire a netlink message every time a hash table's keynodes exceed a threshold value and have user space compute a rearrangement. Essentially you have to weigh your tradeoffs. > >>Now, what are the implications on the matching performance: > >>tc vs. nf-hipac? As long as the extended hashing stuff is not used > >>nf-hipac is clearly superior to tc. > > > > You are refering to u32. You mean as long as u32 stored things in a > > single linked list, you win - correct? > > Yes, but this is not only true for u32. As long as the ruleset > looks like: "n filters with n different priorities which can > be translated into n nf-hipac rules" nf-hipac is clearly faster > because in this case tc uses the linear approach. > If you still hold this opinion after my explanation on cascaded hash tables, then lets discuss again. > >>When hashing is used it _really_ > >>depends. If there is only one classifier (with hashing) per interface > >>and the number of rules per bucket is very small the performance should > >>be comparable. As soon as you add other classifiers nf-hipac will > >>outperform tc again. > > > > If we take a simple user interface abstraction like tcng which hides the > > evil of u32 and then take simple 5 tuple rules - i doubt you will see > > any difference. For more generic setup, the kernel thread i refer to > > would work - but may slow insertion. > > For the simple punctiform examples like described above you may be right > that nf-hipac and tc should perform similar but it's not clear to us > how you want to achieve universality (including mask, ranges and > wildcards) by this kernel thread rearranging approach. Basically you > have to address the following problem: Given an arbitrary set of u32 > rules with different priorities you have to compute an semantically > equivalent representation with a tree of hashes. > yes - that is the challenge to resolve;-> > >>So, basically HIPAC is just a normal classifier like any other > >>with two exceptions: > >> a) it can occur only once per interface > >> b) the rules within the classifier can contain other classifiers, > >> e.g. u32, fw, tc_index, as matches > > > > But why restriction a)? > > Well, the restriction is necessary because of the new hipac design in > which nf-hipac (i.e. firewalling), routing and cls_hipac (i.e. tc) are > just applications for the classification framework. The basic idea is > to reduce the number of classifications on the forwarding path to a > single one (in the best case). In order to truly understand the > requirement it would be necessary to explain the idea behind the new > stage concept which is beyond the scope of this e-mail :-/. > Ok - maybe when you explain the concept later i will get it. Is your plan to put this in other places other than Linux? > > Also why should we need hipac to hold other filters when the > > infrastructure itself can hold the extended filters just fine? > > I think you may actually be trying to say why somewhere in the email, > > but it must not be making a significant impression on my brain. > > The idea is to reduce the embedded classifiers to a match, i.e. > their return value is ignored. This offers the possibility of > expressing a conjunction of native matches and classifiers in the > very same way nf-hipac rules support iptables matches. This enhances > the expressiveness of classification rules. > A rule |nat. match 1|...|nat. match n|emb. cls 1|...|emb. cls m| > matches if nat. match 1-n and emb. cls 1-m match. > So you got this thought from iptables and took it to the next level? I am still not sure i understand why not use what already exists - but i'll just say i dont see it right now. > > The top level code (cls_hipac.c:tc_ctl_filter) is responsible for > creating new tcf_proto structs (if not existent) and enqueuing the > struct into the chain. Therefore it is also responsible for taking > the stuff out of the chain again if necessary. In case we have just > created a new tcf_proto and change fails it would be better if the new > tcf_proto is removed afterwards, i.e. > write_lock(&qdisc_tree_lock); > spin_lock_bh(&dev->queue_lock); > *back = tp->next; > spin_unlock_bh(&dev->queue_lock); > write_unlock(&qdisc_tree_lock); > tp->ops->destroy(tp); > module_put(tp->ops->owner); > kfree(tp); > is issued. > Do you agree? > It doesnt appear harmful to leave it there without destroying it. The next time someome adds a filter of the same protocol + priority, it will already exist. If you want to be accurate (because it does get destroyed when the init() fails), then destroy it but you need to put checks for "incase we have added a new tcf_proto" which may not look pretty. Is this causing you some discomfort? > > Consider what i said above. I'll try n cobble together some examples to > > demonstrate (although it seems you already know this). > > To allow for anyone to install classifiers-du-jour without being > > dependet on hipac would be very useful. So ideas that you have for > > enabling this cleanly should be moved to cls_api. > > Nobody will be forced to use hipac :-). It's just another classifier > like u32. We don't even had to modify cls_api so far. Everything > integrates just fine. > cool. Keep up the good work. cheers, jamal From mathis@psc.edu Mon Aug 4 09:21:04 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 09:21:12 -0700 (PDT) Received: from zippy.psc.edu (pa-monroeville3a-31.pit.adelphia.net [24.53.185.31]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74GL2Fl023294 for ; Mon, 4 Aug 2003 09:21:03 -0700 Received: from localhost (mathis@localhost) by zippy.psc.edu (8.11.6/8.11.6) with ESMTP id h74GKlB27764; Mon, 4 Aug 2003 12:20:47 -0400 X-Authentication-Warning: zippy.psc.edu: mathis owned process doing -bs Date: Mon, 4 Aug 2003 12:20:47 -0400 (EDT) From: Matt Mathis To: "David S. Miller" cc: netdev@oss.sgi.com, John Heffner Subject: Web100 In-Reply-To: <20030803222554.7027a160.davem@redhat.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 4518 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: mathis@psc.edu Precedence: bulk X-list: netdev On Sun, 3 Aug 2003, David S. Miller wrote: > The web100 patches aren't in the kernel because 1) they've > never even been submitted and 2) they need a large cleanup. Furthermore 1 is due to 2.... We know our code is not ready for kernel inclusion, and are having a little trouble seeing the path through to doing so ourselves. A big part of the problem is that I an not a kernel guy - my focus in on the protocol and measurement issues and not on the implementation details. Although John could probably get it together by himself, he is split between several projects and it isn't clear that incrementally submitting substandard patches is a cost effective strategy to getting it done. It would be a lot easier if we 1) had a mentor who was experienced at kernel inclusion, 2) specific guidance on some of the non-network components, such as the API (currently using /proc) and 3) a laundry list of things that we need to fix. > I sort of get the impression that the web100 folks actually like that > their changes are not in the main sources, it keeps their work > "special". Nope, not at all. Actually I find kernel inclusion rather daunting. One of our collaborators was asked some very pointed questions about the TCP ESTATS MIB by somebody at M$. I would hate to have the first general release be in anything but Linux. Any takers on helping us? Thanks, --MM-- ------------------------------------------- Matt Mathis http://www.psc.edu/~mathis Work:412.268.3319 Home/Cell:412.654.7529 ------------------------------------------- From hadi@cyberus.ca Mon Aug 4 09:46:00 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 09:46:08 -0700 (PDT) Received: from mail.cyberus.ca (mail.cyberus.ca [209.195.118.111]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74GjxFl024775 for ; Mon, 4 Aug 2003 09:46:00 -0700 Received: from cpe0030ab124d2f-cm014500000962.cpe.net.cable.rogers.com ([24.103.99.32] helo=[10.0.0.9]) by mail.cyberus.ca with esmtp (Exim 4.12) id 19jiT8-000JAa-00; Mon, 04 Aug 2003 12:45:59 -0400 Subject: Re: TOE brain dump From: jamal Reply-To: hadi@cyberus.ca To: netdev@oss.sgi.com Cc: "Ihar 'Philips' Filipau" Content-Type: text/plain Organization: jamalopolis Message-Id: <1060015518.1103.399.camel@jzny.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 Date: 04 Aug 2003 12:45:18 -0400 Content-Transfer-Encoding: 7bit X-archive-position: 4519 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev Can you please post to netdev? Posting networking related issues to linux kernel alone is considered rude. Posting them to netdev only is acceptable. > Ihar 'Philips' Filipau wrote: > > >Werner Almesberger wrote: > > Ihar 'Philips' Filipau wrote: > > > >| | | Modern NPUs generally do this. > > > > > > Unfortunately, they don't - they run *some* code, but that > > is rarely a Linux kernel, or a substantial part of it. > > > > Embedded CPU we are using is based MIPS, and has a lot of specialized > instructions. > It makes not that much sense to run kernel (especially Linux) on CPU > which is optimized for handling of network packets. (And has actually > several co-processors to help in this task). The coprocessors are useful, but that has nothing to do with the value of the NPU. You can add those within a general processor system. I am also in the camp that to be really useful these things need to run a real OS - Linux. > How much sense it makes to run general purpose OS (optimized for PCs > and servers) on devices which can make only couple of functions? (and no > MMU btw) > > It is a whole idea behind this kind of CPUs - to do a few of > functions - but to do them good. > > If you will start stretching CPUs like this to fit Linux kernel - it > will generally just increase price. Probably there are some markets > which can afford this. > Actually i believe it will lower the prices.I am waiting for intel to get hyperthreading right - then we'll see these things disapear. The only thing useful about NPUs is their ability to management the discrepency between memory latency and CPU speeds. Trust me i used to be in the same camp as you.If you note, a lot of these things appeared around the height of the .com days. VCs were looking for something new and exciting. > Remeber - "Small is beatiful" (c) - and linux kernel far from it. > Our routing code which handles two GE interfaces (actually not pure > GE, but up to 2.5GB) fits into 3k. 3k of code - and that's it. not 650kb > of bzip compressed bloat. And it handles two interfaces, handles fast > data path from siblign interfaces, handles up to 1E6 routes. 3k of code. > not 650k of bzip. If all you wanted was to do L3 - why not just buy a $5 chip that can do this for a lot more interfaces? Why sweat over optimizing L3 routing in a 3K space? to nit: Its no longer about routing or bridging, friend. Thats like getting fries at mcdonalds. cheers, jamal From alan@storlinksemi.com Mon Aug 4 10:19:39 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 10:19:49 -0700 (PDT) Received: from smtp016.mail.yahoo.com (smtp016.mail.yahoo.com [216.136.174.113]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74HJdFl025593 for ; Mon, 4 Aug 2003 10:19:39 -0700 Received: from adsl-63-203-236-74.dsl.snfc21.pacbell.net (HELO AlanLap) (alansuntzishih@63.203.236.74 with login) by smtp.mail.vip.sc5.yahoo.com with SMTP; 4 Aug 2003 17:19:38 -0000 From: "Alan Shih" To: "Ingo Oeser" , "Jeff Garzik" Cc: "Nivedita Singhvi" , "Werner Almesberger" , , Subject: RE: TOE brain dump Date: Mon, 4 Aug 2003 10:19:21 -0700 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2911.0) In-Reply-To: <20030804163606.Q639@nightmaster.csn.tu-chemnitz.de> Importance: Normal X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2727.1300 X-archive-position: 4520 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: alan@storlinksemi.com Precedence: bulk X-list: netdev Hmm, So would main processor still need a copy of the data for re-transmission? Won't that defeat the purpose? Alan -----Original Message----- From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-owner@vger.kernel.org]On Behalf Of Ingo Oeser Sent: Monday, August 04, 2003 7:36 AM To: Jeff Garzik Cc: Nivedita Singhvi; Werner Almesberger; netdev@oss.sgi.com; linux-kernel@vger.kernel.org Subject: Re: TOE brain dump Hi Jeff, On Sat, Aug 02, 2003 at 03:08:52PM -0400, Jeff Garzik wrote: > So, fix the other end of the pipeline too, otherwise this fast network > stuff is flashly but pointless. If you want to serve up data from disk, > then start creating PCI cards that have both Serial ATA and ethernet > connectors on them :) Cut out the middleman of the host CPU and host > memory bus instead of offloading portions of TCP that do not need to be > offloaded. Exactly what I suggested: sys_ioroute() "Providing generic pipelines and io routing as Linux service" Msg-ID: <20030718134235.K639@nightmaster.csn.tu-chemnitz.de> on linux-kernel and linux-fsdevel Be my guest. I know, that you mean doing it in hardware, but you cannot accelerate sth. which the kernel doesn't do ;-) Regards Ingo Oeser - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ From inaky.perez-gonzalez@intel.com Mon Aug 4 11:36:24 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 11:36:32 -0700 (PDT) Received: from caduceus.jf.intel.com (fmr06.intel.com [134.134.136.7]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74IaNFl027540 for ; Mon, 4 Aug 2003 11:36:24 -0700 Received: from talaria.jf.intel.com (talaria.jf.intel.com [10.7.209.7]) by caduceus.jf.intel.com (8.11.6p2/8.11.6/d: outer.mc,v 1.66 2003/05/22 21:17:36 rfjohns1 Exp $) with ESMTP id h74IUIQ07776 for ; Mon, 4 Aug 2003 18:30:18 GMT Received: from orsmsxvs041.jf.intel.com (orsmsxvs041.jf.intel.com [192.168.65.54]) by talaria.jf.intel.com (8.11.6p2/8.11.6/d: inner.mc,v 1.35 2003/05/22 21:18:01 rfjohns1 Exp $) with SMTP id h74Hxgl00580 for ; Mon, 4 Aug 2003 17:59:42 GMT Received: from orsmsx332.amr.corp.intel.com ([192.168.65.60]) by orsmsxvs041.jf.intel.com (NAVGW 2.5.2.11) with SMTP id M2003080411361613752 ; Mon, 04 Aug 2003 11:36:16 -0700 Received: from orsmsx409.amr.corp.intel.com ([192.168.65.58]) by orsmsx332.amr.corp.intel.com with Microsoft SMTPSVC(5.0.2195.5329); Mon, 4 Aug 2003 11:36:16 -0700 content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" X-MimeOLE: Produced By Microsoft Exchange V6.0.6375.0 Subject: RE: TOE brain dump Date: Mon, 4 Aug 2003 11:36:15 -0700 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: TOE brain dump Thread-Index: AcNZ/oaUAcgg0owhSWG6t+TBx1MScAAbhUUA From: "Perez-Gonzalez, Inaky" To: "Larry McVoy" , "David Lang" Cc: "Erik Andersen" , "Werner Almesberger" , "Jeff Garzik" , , , "Nivedita Singhvi" X-OriginalArrivalTime: 04 Aug 2003 18:36:16.0508 (UTC) FILETIME=[4A1793C0:01C35AB7] Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id h74IaNFl027540 X-archive-position: 4521 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: inaky.perez-gonzalez@intel.com Precedence: bulk X-list: netdev > From: Larry McVoy [mailto:lm@bitmover.com] > > > 2. router nodes that have access to main memory (PCI card running linux > > acting as a router/firewall/VPN to offload the main CPU's) > > I can get an entire machine, memory, disk, > Ghz CPU, case, power supply, > cdrom, floppy, onboard enet extra net card for routing, for $250 or less, > quantity 1, shipped to my door. > > Why would I want to spend money on some silly offload card when I can get > the whole PC for less than the card? Because you want to stack 200 of those together in a huge data center interconnecting whatever you want to interconnect and you don't want your maintenance costs to go up to the sky? I see your point, though :) Iñaky Pérez-González -- Not speaking for Intel -- all opinions are my own (and my fault) From filia@softhome.net Mon Aug 4 11:47:45 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 11:47:53 -0700 (PDT) Received: from jive.SoftHome.net (jive.SoftHome.net [66.54.152.27]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74IliFl028072 for ; Mon, 4 Aug 2003 11:47:45 -0700 Received: (qmail 8633 invoked by uid 417); 4 Aug 2003 18:47:44 -0000 Received: from shunt-smtp-out-0 (HELO softhome.net) (172.16.3.12) by shunt-smtp-out-0 with SMTP; 4 Aug 2003 18:47:44 -0000 Received: from softhome.net ([212.18.200.6]) (AUTH: PLAIN filia@softhome.net) by softhome.net with esmtp; Mon, 04 Aug 2003 12:47:42 -0600 Message-ID: <3F2EAA78.60202@softhome.net> Date: Mon, 04 Aug 2003 20:48:24 +0200 From: "Ihar 'Philips' Filipau" Organization: Home Sweet Home User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030701 X-Accept-Language: en-us, en MIME-Version: 1.0 To: hadi@cyberus.ca CC: netdev@oss.sgi.com Subject: Re: TOE brain dump References: <1060015518.1103.399.camel@jzny.localdomain> In-Reply-To: <1060015518.1103.399.camel@jzny.localdomain> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4522 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: filia@softhome.net Precedence: bulk X-list: netdev jamal wrote: > to nit: Its no longer about routing or bridging, friend. Thats like getting > fries at mcdonalds. > 1GE/10GE - for $5? I'm first in the shoping queue!!!-))) Since I see no reasonable out-come of this discussion I left it. TOE as I see - since my company utilizes several of them - are too different and too specialized to application/protocols. And yes - price of development/deployment maters too. Linux support for those protocols is inmature. It cannot handle or requirements even software-wise. I'm not talking about timing requirements - linux network in general is not (even soft) real-time. My personal flame-meter is out of scale ;-) I shall join the discussion back when I will see any real ideas. > If all you wanted was to do L3 - why not just buy a $5 chip that > can do this for a lot more interfaces? Why sweat over > optimizing L3 routing in a 3K space? We are doing not a teapot, and high level spec for this code takes around 15 pages. 3k - it is not optimized - we have limit around 2GB ;-) It just takes only 3k. And it handles some special (read - proprietary) functions too - some bugs of some other pieces of hardware. NPU does all stuff by itself, but sometimes we need to extract configuration information which is direct to us, for example. From davem@redhat.com Mon Aug 4 11:49:35 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 11:49:39 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74InYFl028410 for ; Mon, 4 Aug 2003 11:49:35 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id LAA20777; Mon, 4 Aug 2003 11:45:07 -0700 Date: Mon, 4 Aug 2003 11:45:07 -0700 From: "David S. Miller" To: Andi Kleen Cc: yoshfuji@linux-ipv6.org, ak@muc.de, netdev@oss.sgi.com Subject: Re: [PATCH] Make XFRM optional Message-Id: <20030804114507.6e496c77.davem@redhat.com> In-Reply-To: <20030804130408.GA36367@colin2.muc.de> References: <20030804125022.GA8167@averell> <20030804.215801.124854897.yoshfuji@linux-ipv6.org> <20030804130408.GA36367@colin2.muc.de> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4523 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On 4 Aug 2003 15:04:08 +0200 Andi Kleen wrote: > Thanks for the feedback. Here is a new patch with the two hunks > removed. Still broken in two areas: 1) You moved inet_peer_idlock into net/xfrm/xfrm_exports.c, that looks quite wrong. 2) Your patch doesn't apply to Linus's current tree because "secpath_dup" got added to net/netsyms.c since 2.6.0-test2 got released. I wanted to merge this, but I can't until you fix the above problems. Thanks. From alan@lxorguk.ukuu.org.uk Mon Aug 4 12:07:26 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 12:07:30 -0700 (PDT) Received: from lxorguk.ukuu.org.uk (pc1-cwma1-5-cust4.swan.cable.ntl.com [80.5.120.4]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74J7OFl029104 for ; Mon, 4 Aug 2003 12:07:26 -0700 Received: from dhcp22.swansea.linux.org.uk (dhcp22.swansea.linux.org.uk [127.0.0.1]) by lxorguk.ukuu.org.uk (8.12.8/8.12.5) with ESMTP id h74J3EC3001142; Mon, 4 Aug 2003 20:03:15 +0100 Received: (from alan@localhost) by dhcp22.swansea.linux.org.uk (8.12.8/8.12.8/Submit) id h74J3BPF001140; Mon, 4 Aug 2003 20:03:11 +0100 X-Authentication-Warning: dhcp22.swansea.linux.org.uk: alan set sender to alan@lxorguk.ukuu.org.uk using -f Subject: RE: TOE brain dump From: Alan Cox To: "Perez-Gonzalez, Inaky" Cc: Larry McVoy , David Lang , Erik Andersen , Werner Almesberger , Jeff Garzik , netdev@oss.sgi.com, Linux Kernel Mailing List , Nivedita Singhvi In-Reply-To: References: Content-Type: text/plain Content-Transfer-Encoding: 7bit Organization: Message-Id: <1060023790.723.23.camel@dhcp22.swansea.linux.org.uk> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 (1.2.2-5) Date: 04 Aug 2003 20:03:11 +0100 X-archive-position: 4524 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: alan@lxorguk.ukuu.org.uk Precedence: bulk X-list: netdev On Llu, 2003-08-04 at 19:36, Perez-Gonzalez, Inaky wrote: > > Why would I want to spend money on some silly offload card when I can get > > the whole PC for less than the card? > > Because you want to stack 200 of those together in a huge > data center interconnecting whatever you want to interconnect > and you don't want your maintenance costs to go up to the sky? 17cm squared, fanless, network booting. Its not as big a cost as you might think, and TOE cards fail too, the difference being that if they are now out of production you have a nasty mess on your hands. From werner@almesberger.net Mon Aug 4 12:24:45 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 12:24:54 -0700 (PDT) Received: from host.almesberger.net (almesberger.net [63.105.73.239] (may be forged)) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74JOiFl029623 for ; Mon, 4 Aug 2003 12:24:45 -0700 Received: from almesberger.net (vpnwa-home [10.200.0.2]) by host.almesberger.net (8.11.6/8.9.3) with ESMTP id h74JOdG11924; Mon, 4 Aug 2003 12:24:39 -0700 Received: (from werner@localhost) by almesberger.net (8.11.6/8.11.6) id h74JOXM16998; Mon, 4 Aug 2003 16:24:33 -0300 Date: Mon, 4 Aug 2003 16:24:33 -0300 From: Werner Almesberger To: "Eric W. Biederman" Cc: Jeff Garzik , Nivedita Singhvi , netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: TOE brain dump Message-ID: <20030804162433.L5798@almesberger.net> References: <20030802140444.E5798@almesberger.net> <3F2BF5C7.90400@us.ibm.com> <3F2C0C44.6020002@pobox.com> <20030802184901.G5798@almesberger.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: ; from ebiederm@xmission.com on Sun, Aug 03, 2003 at 01:21:09PM -0600 X-archive-position: 4525 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: werner@almesberger.net Precedence: bulk X-list: netdev Eric W. Biederman wrote: > The optimized for low latency cases seem to have a strong > market in clusters. Clusters have captive, no, _desperate_ customers ;-) And it seems that people are just as happy putting MPI as their transport on top of all those link-layer technologies. > There is one place in low latency communications that I can think > of where TCP/IP is not the proper solution. For low latency > communication the checksum is at the wrong end of the packet. That's one of the few things ATM's AAL5 got right. But in the end, I think it doesn't really matter. At 1 Gbps, an MTU-sized packet flies by within 13 us. At 10 Gbps, it's only 1.3 us. At that point, you may well treat it as an atomic unit. > On that score it is worth noting that the next generation of > peripheral busses (Hypertransport, PCI Express, etc) are all switched. And it's about time for that :-) - Werner -- _________________________________________________________________________ / Werner Almesberger, Buenos Aires, Argentina werner@almesberger.net / /_http://www.almesberger.net/____________________________________________/ From davem@redhat.com Mon Aug 4 12:31:11 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 12:31:16 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74JVBFl030047 for ; Mon, 4 Aug 2003 12:31:11 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id MAA20942; Mon, 4 Aug 2003 12:26:32 -0700 Date: Mon, 4 Aug 2003 12:26:32 -0700 From: "David S. Miller" To: Werner Almesberger Cc: ebiederm@xmission.com, jgarzik@pobox.com, niv@us.ibm.com, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: TOE brain dump Message-Id: <20030804122632.65ba2122.davem@redhat.com> In-Reply-To: <20030804162433.L5798@almesberger.net> References: <20030802140444.E5798@almesberger.net> <3F2BF5C7.90400@us.ibm.com> <3F2C0C44.6020002@pobox.com> <20030802184901.G5798@almesberger.net> <20030804162433.L5798@almesberger.net> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4526 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Mon, 4 Aug 2003 16:24:33 -0300 Werner Almesberger wrote: > Eric W. Biederman wrote: > > There is one place in low latency communications that I can think > > of where TCP/IP is not the proper solution. For low latency > > communication the checksum is at the wrong end of the packet. > > That's one of the few things ATM's AAL5 got right. Let's recall how long the IFF_TRAILERS hack from BSD :-) > But in the end, I think it doesn't really matter. I tend to agree on this one. And on the transmit side if you have more than 1 pending TX frame, you can always be prefetching the next one into the fifo so that by the time the medium is ready all the checksum bits have been done. In fact I'd be surprised if current generation 1g/10g cards are not doing something like this. From hadi@cyberus.ca Mon Aug 4 12:43:12 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 12:43:17 -0700 (PDT) Received: from mail.cyberus.ca (mail.cyberus.ca [209.195.118.111]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74JhAFl030573 for ; Mon, 4 Aug 2003 12:43:11 -0700 Received: from cpe0030ab124d2f-cm014500000962.cpe.net.cable.rogers.com ([24.103.99.32] helo=[10.0.0.9]) by mail.cyberus.ca with esmtp (Exim 4.12) id 19jlEc-0008eK-00; Mon, 04 Aug 2003 15:43:10 -0400 Subject: Re: TOE brain dump From: jamal Reply-To: hadi@cyberus.ca To: "Ihar 'Philips' Filipau" Cc: netdev@oss.sgi.com In-Reply-To: <3F2EAA78.60202@softhome.net> References: <1060015518.1103.399.camel@jzny.localdomain> <3F2EAA78.60202@softhome.net> Content-Type: text/plain Organization: jamalopolis Message-Id: <1060026149.1102.411.camel@jzny.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 Date: 04 Aug 2003 15:42:29 -0400 Content-Transfer-Encoding: 7bit X-archive-position: 4527 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev On Mon, 2003-08-04 at 14:48, Ihar 'Philips' Filipau wrote: > jamal wrote: > > to nit: Its no longer about routing or bridging, friend. Thats like getting > > fries at mcdonalds. > > > > 1GE/10GE - for $5? > I'm first in the shoping queue!!!-))) > I thought you were talking about a 2 Gige interface doing routing, no? Do the math: Dell will happily sell you a (managed?) switch which has 8Giges on it for about $300. It does wire rate on all 8 interfaces. All ready to go in a 1U form factor. How much do you think that chip costs? Lets say it doesnt do L3, how much more do you think it will cost to do L3 in quantities? > Since I see no reasonable out-come of this discussion I left it. > > TOE as I see - since my company utilizes several of them - are too > different and too specialized to application/protocols. And yes - price > of development/deployment maters too. Linux support for those protocols > is inmature. It cannot handle or requirements even software-wise. I'm > not talking about timing requirements - linux network in general is not > (even soft) real-time. > Now this is anti-social talk;-> Why do you need to have realtime for any of this stuff? > My personal flame-meter is out of scale ;-) > I shall join the discussion back when I will see any real ideas. > Please dont dissapear, a lot of questions need answers;-> > > > If all you wanted was to do L3 - why not just buy a $5 chip that > > can do this for a lot more interfaces? Why sweat over > > optimizing L3 routing in a 3K space? > > We are doing not a teapot, and high level spec for this code takes > around 15 pages. > 3k - it is not optimized - we have limit around 2GB ;-) I am really confused now. We must be talking about different class of devices. NPUs as i know them are very limited in how much code you can stash them. In the 10K ranges is already overkill. Do you have any URL i can look at on what you are describing? > It just takes only 3k. And it handles some special (read - > proprietary) functions too - some bugs of some other pieces of hardware. > NPU does all stuff by itself, but sometimes we need to extract > configuration information which is direct to us, for example. Please provide me a pointer if you can - I am very interested in the 2G code space you mention. cheers, jamal > From filia@softhome.net Mon Aug 4 13:05:53 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 13:05:57 -0700 (PDT) Received: from jive.SoftHome.net (jive.SoftHome.net [66.54.152.27]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74K5qFl031227 for ; Mon, 4 Aug 2003 13:05:53 -0700 Received: (qmail 22843 invoked by uid 417); 4 Aug 2003 20:05:52 -0000 Received: from shunt-smtp-out-0 (HELO softhome.net) (172.16.3.12) by shunt-smtp-out-0 with SMTP; 4 Aug 2003 20:05:52 -0000 Received: from softhome.net ([212.18.200.6]) (AUTH: PLAIN filia@softhome.net) by softhome.net with esmtp; Mon, 04 Aug 2003 14:05:51 -0600 Message-ID: <3F2EBCCA.5060708@softhome.net> Date: Mon, 04 Aug 2003 22:06:34 +0200 From: "Ihar 'Philips' Filipau" Organization: Home Sweet Home User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030701 X-Accept-Language: en-us, en MIME-Version: 1.0 To: hadi@cyberus.ca CC: netdev@oss.sgi.com Subject: Re: TOE brain dump References: <1060015518.1103.399.camel@jzny.localdomain> <3F2EAA78.60202@softhome.net> <1060026149.1102.411.camel@jzny.localdomain> In-Reply-To: <1060026149.1102.411.camel@jzny.localdomain> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4528 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: filia@softhome.net Precedence: bulk X-list: netdev jamal wrote: > >> It just takes only 3k. And it handles some special (read - >>proprietary) functions too - some bugs of some other pieces of hardware. >>NPU does all stuff by itself, but sometimes we need to extract >>configuration information which is direct to us, for example. > > > Please provide me a pointer if you can - I am very interested in the 2G > code space you mention. > I'm not sure - actually as I wrote - immediately gone checking specs. try: http://www.vitesse.com/products/categories.cfm?family_id=5&category_id=16 ... [ Okay I got to docs server. ] You are right - It has limit of 4K insns == 16k of executable memory. Sorry for confusion :( We really can address a lot of memory - we have 32MB for routing info and configuration - but for execution only 16kB of memory is available... From ak@muc.de Mon Aug 4 13:35:44 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 13:35:54 -0700 (PDT) Received: from colin2.muc.de (qmailr@colin2.muc.de [193.149.48.15]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74KZgFl031967 for ; Mon, 4 Aug 2003 13:35:43 -0700 Received: (qmail 21559 invoked by uid 3709); 4 Aug 2003 20:35:24 -0000 Date: 4 Aug 2003 22:35:24 +0200 Date: Mon, 4 Aug 2003 22:35:24 +0200 From: Andi Kleen To: "David S. Miller" Cc: yoshfuji@linux-ipv6.org, ak@muc.de, netdev@oss.sgi.com Subject: Re: [PATCH] Make XFRM optional Message-ID: <20030804203524.GA15874@colin2.muc.de> References: <20030804125022.GA8167@averell> <20030804.215801.124854897.yoshfuji@linux-ipv6.org> <20030804130408.GA36367@colin2.muc.de> <20030804114507.6e496c77.davem@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030804114507.6e496c77.davem@redhat.com> User-Agent: Mutt/1.4.1i X-archive-position: 4529 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ak@colin2.muc.de Precedence: bulk X-list: netdev Ok, here is a new patch again current BKCVS. It also moves the inet_peer_idlock only inside netsyms. -Andi diff -u linux-xfrm/include/net/dst.h-XFRM linux-xfrm/include/net/dst.h --- linux-xfrm/include/net/dst.h-XFRM 2003-06-29 12:29:21.000000000 +0200 +++ linux-xfrm/include/net/dst.h 2003-08-04 22:16:49.000000000 +0200 @@ -247,8 +247,16 @@ extern void dst_init(void); struct flowi; +#ifndef CONFIG_XFRM +static inline int xfrm_lookup(struct dst_entry **dst_p, struct flowi *fl, + struct sock *sk, int flags) +{ + return 0; +} +#else extern int xfrm_lookup(struct dst_entry **dst_p, struct flowi *fl, struct sock *sk, int flags); #endif +#endif #endif /* _NET_DST_H */ diff -u linux-xfrm/include/net/xfrm.h-XFRM linux-xfrm/include/net/xfrm.h --- linux-xfrm/include/net/xfrm.h-XFRM 2003-08-04 22:09:46.000000000 +0200 +++ linux-xfrm/include/net/xfrm.h 2003-08-04 22:16:49.000000000 +0200 @@ -588,6 +588,8 @@ return !0; } +#ifdef CONFIG_XFRM + extern int __xfrm_policy_check(struct sock *, int dir, struct sk_buff *skb, unsigned short family); static inline int xfrm_policy_check(struct sock *sk, int dir, struct sk_buff *skb, unsigned short family) @@ -653,6 +655,26 @@ } } +#else + +static inline void xfrm_sk_free_policy(struct sock *sk) {} +static inline int xfrm_sk_clone_policy(struct sock *sk) { return 0; } +static inline int xfrm6_route_forward(struct sk_buff *skb) { return 1; } +static inline int xfrm4_route_forward(struct sk_buff *skb) { return 1; } +static inline int xfrm6_policy_check(struct sock *sk, int dir, struct sk_buff *skb) +{ + return 1; +} +static inline int xfrm4_policy_check(struct sock *sk, int dir, struct sk_buff *skb) +{ + return 1; +} +static inline int xfrm_policy_check(struct sock *sk, int dir, struct sk_buff *skb, unsigned short family) +{ + return 1; +} +#endif + static __inline__ xfrm_address_t *xfrm_flowi_daddr(struct flowi *fl, unsigned short family) { @@ -783,12 +805,32 @@ extern int xfrm_check_selectors(struct xfrm_state **x, int n, struct flowi *fl); extern int xfrm_check_output(struct xfrm_state *x, struct sk_buff *skb, unsigned short family); extern int xfrm4_rcv(struct sk_buff *skb); -extern int xfrm4_rcv_encap(struct sk_buff *skb, __u16 encap_type); extern int xfrm4_tunnel_register(struct xfrm_tunnel *handler); extern int xfrm4_tunnel_deregister(struct xfrm_tunnel *handler); extern int xfrm4_tunnel_check_size(struct sk_buff *skb); extern int xfrm6_rcv(struct sk_buff **pskb, unsigned int *nhoffp); + +#ifdef CONFIG_XFRM +extern int xfrm4_rcv_encap(struct sk_buff *skb, __u16 encap_type); extern int xfrm_user_policy(struct sock *sk, int optname, u8 *optval, int optlen); +extern int xfrm_dst_lookup(struct xfrm_dst **dst, struct flowi *fl, unsigned short family); +#else +static inline int xfrm_user_policy(struct sock *sk, int optname, u8 *optval, int optlen) +{ + return -ENOPROTOOPT; +} + +static inline int xfrm4_rcv_encap(struct sk_buff *skb, __u16 encap_type) +{ + /* should not happen */ + kfree_skb(skb); + return 0; +} +static inline int xfrm_dst_lookup(struct xfrm_dst **dst, struct flowi *fl, unsigned short family) +{ + return -EINVAL; +} +#endif void xfrm_policy_init(void); void xfrm4_policy_init(void); @@ -810,7 +852,6 @@ extern int xfrm_sk_policy_insert(struct sock *sk, int dir, struct xfrm_policy *pol); extern struct xfrm_policy *xfrm_sk_policy_lookup(struct sock *sk, int dir, struct flowi *fl); extern int xfrm_flush_bundles(struct xfrm_state *x); -extern int xfrm_dst_lookup(struct xfrm_dst **dst, struct flowi *fl, unsigned short family); extern wait_queue_head_t km_waitq; extern void km_state_expired(struct xfrm_state *x, int hard); diff -u linux-xfrm/net/core/skbuff.c-XFRM linux-xfrm/net/core/skbuff.c --- linux-xfrm/net/core/skbuff.c-XFRM 2003-06-19 09:21:04.000000000 +0200 +++ linux-xfrm/net/core/skbuff.c 2003-08-04 22:16:49.000000000 +0200 @@ -225,7 +225,7 @@ } dst_release(skb->dst); -#ifdef CONFIG_INET +#ifdef CONFIG_XFRM secpath_put(skb->sp); #endif if(skb->destructor) { diff -u linux-xfrm/net/ipv4/Kconfig-XFRM linux-xfrm/net/ipv4/Kconfig --- linux-xfrm/net/ipv4/Kconfig-XFRM 2003-08-04 22:09:47.000000000 +0200 +++ linux-xfrm/net/ipv4/Kconfig 2003-08-04 22:16:49.000000000 +0200 @@ -187,6 +187,7 @@ config NET_IPIP tristate "IP: tunneling" depends on INET + select XFRM ---help--- Tunneling means encapsulating data of one protocol type within another protocol and sending it over a channel that understands the @@ -205,6 +206,7 @@ config NET_IPGRE tristate "IP: GRE tunnels over IP" depends on INET + select XFRM help Tunneling means encapsulating data of one protocol type within another protocol and sending it over a channel that understands the @@ -343,6 +345,7 @@ config INET_AH tristate "IP: AH transformation" + select XFRM select CRYPTO select CRYPTO_HMAC select CRYPTO_MD5 @@ -354,6 +357,7 @@ config INET_ESP tristate "IP: ESP transformation" + select XFRM select CRYPTO select CRYPTO_HMAC select CRYPTO_MD5 @@ -366,6 +370,7 @@ config INET_IPCOMP tristate "IP: IPComp transformation" + select XFRM select CRYPTO select CRYPTO_DEFLATE ---help--- diff -u linux-xfrm/net/ipv4/Makefile-XFRM linux-xfrm/net/ipv4/Makefile --- linux-xfrm/net/ipv4/Makefile-XFRM 2003-08-04 22:09:47.000000000 +0200 +++ linux-xfrm/net/ipv4/Makefile 2003-08-04 22:16:49.000000000 +0200 @@ -23,4 +23,4 @@ obj-$(CONFIG_NETFILTER) += netfilter/ obj-$(CONFIG_IP_VS) += ipvs/ -obj-y += xfrm4_policy.o xfrm4_state.o xfrm4_input.o xfrm4_tunnel.o +obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o xfrm4_tunnel.o diff -u linux-xfrm/net/ipv4/route.c-XFRM linux-xfrm/net/ipv4/route.c --- linux-xfrm/net/ipv4/route.c-XFRM 2003-06-19 09:21:04.000000000 +0200 +++ linux-xfrm/net/ipv4/route.c 2003-08-04 22:16:49.000000000 +0200 @@ -2785,8 +2785,10 @@ create_proc_read_entry("net/rt_acct", 0, 0, ip_rt_acct_read, NULL); #endif #endif +#ifdef CONFIG_XFRM xfrm_init(); xfrm4_init(); +#endif out: return rc; out_enomem: diff -u linux-xfrm/net/ipv4/udp.c-XFRM linux-xfrm/net/ipv4/udp.c --- linux-xfrm/net/ipv4/udp.c-XFRM 2003-08-04 22:09:47.000000000 +0200 +++ linux-xfrm/net/ipv4/udp.c 2003-08-04 22:16:49.000000000 +0200 @@ -938,6 +938,9 @@ */ static int udp_encap_rcv(struct sock * sk, struct sk_buff *skb) { +#ifndef CONFIG_XFRM + return 1; +#else struct udp_opt *up = udp_sk(sk); struct udphdr *uh = skb->h.uh; struct iphdr *iph; @@ -997,10 +1000,12 @@ return -1; default: - printk(KERN_INFO "udp_encap_rcv(): Unhandled UDP encap type: %u\n", - encap_type); + if (net_ratelimit()) + printk(KERN_INFO "udp_encap_rcv(): Unhandled UDP encap type: %u\n", + encap_type); return 1; } +#endif } /* returns: diff -u linux-xfrm/net/ipv6/Kconfig-XFRM linux-xfrm/net/ipv6/Kconfig --- linux-xfrm/net/ipv6/Kconfig-XFRM 2003-08-04 22:09:48.000000000 +0200 +++ linux-xfrm/net/ipv6/Kconfig 2003-08-04 22:16:49.000000000 +0200 @@ -22,6 +22,7 @@ config INET6_AH tristate "IPv6: AH transformation" depends on IPV6 + select XFRM select CRYPTO select CRYPTO_HMAC select CRYPTO_MD5 @@ -34,6 +35,7 @@ config INET6_ESP tristate "IPv6: ESP transformation" depends on IPV6 + select XFRM select CRYPTO select CRYPTO_HMAC select CRYPTO_MD5 @@ -47,6 +49,7 @@ config INET6_IPCOMP tristate "IPv6: IPComp transformation" depends on IPV6 + select XFRM select CRYPTO select CRYPTO_DEFLATE ---help--- diff -u linux-xfrm/net/ipv6/Makefile-XFRM linux-xfrm/net/ipv6/Makefile --- linux-xfrm/net/ipv6/Makefile-XFRM 2003-06-14 12:19:38.000000000 +0200 +++ linux-xfrm/net/ipv6/Makefile 2003-08-04 22:16:49.000000000 +0200 @@ -8,8 +8,9 @@ route.o ip6_fib.o ipv6_sockglue.o ndisc.o udp.o raw.o \ protocol.o icmp.o mcast.o reassembly.o tcp_ipv6.o \ exthdrs.o sysctl_net_ipv6.o datagram.o proc.o \ - ip6_flowlabel.o ipv6_syms.o \ - xfrm6_policy.o xfrm6_state.o xfrm6_input.o + ip6_flowlabel.o ipv6_syms.o + +obj-$(CONFIG_XFRM) += xfrm6_policy.o xfrm6_state.o xfrm6_input.o obj-$(CONFIG_INET6_AH) += ah6.o obj-$(CONFIG_INET6_ESP) += esp6.o diff -u linux-xfrm/net/ipv6/ipv6_syms.c-XFRM linux-xfrm/net/ipv6/ipv6_syms.c --- linux-xfrm/net/ipv6/ipv6_syms.c-XFRM 2003-06-16 09:04:50.000000000 +0200 +++ linux-xfrm/net/ipv6/ipv6_syms.c 2003-08-04 22:16:49.000000000 +0200 @@ -36,7 +36,9 @@ EXPORT_SYMBOL(in6addr_loopback); EXPORT_SYMBOL(in6_dev_finish_destroy); EXPORT_SYMBOL(ip6_find_1stfragopt); +#ifdef CONFIG_XFRM EXPORT_SYMBOL(xfrm6_rcv); +#endif EXPORT_SYMBOL(rt6_lookup); EXPORT_SYMBOL(fl6_sock_lookup); EXPORT_SYMBOL(ipv6_ext_hdr); diff -u linux-xfrm/net/ipv6/route.c-XFRM linux-xfrm/net/ipv6/route.c --- linux-xfrm/net/ipv6/route.c-XFRM 2003-08-04 22:09:48.000000000 +0200 +++ linux-xfrm/net/ipv6/route.c 2003-08-04 22:16:49.000000000 +0200 @@ -1988,7 +1988,9 @@ if (p) p->proc_fops = &rt6_stats_seq_fops; #endif +#ifdef CONFIG_XFRM xfrm6_init(); +#endif } #ifdef MODULE diff -u linux-xfrm/net/xfrm/Kconfig-XFRM linux-xfrm/net/xfrm/Kconfig --- linux-xfrm/net/xfrm/Kconfig-XFRM 2003-06-14 12:19:38.000000000 +0200 +++ linux-xfrm/net/xfrm/Kconfig 2003-08-04 22:16:49.000000000 +0200 @@ -1,9 +1,13 @@ # # XFRM configuration # +config XFRM + bool + depends on NET + config XFRM_USER tristate "IPsec user configuration interface" - depends on INET + depends on INET && XFRM ---help--- Support for IPsec user configuration interface used by native Linux tools. diff -u linux-xfrm/net/xfrm/Makefile-XFRM linux-xfrm/net/xfrm/Makefile --- linux-xfrm/net/xfrm/Makefile-XFRM 2003-06-14 12:19:38.000000000 +0200 +++ linux-xfrm/net/xfrm/Makefile 2003-08-04 22:16:49.000000000 +0200 @@ -2,6 +2,7 @@ # Makefile for the XFRM subsystem. # -obj-y := xfrm_policy.o xfrm_state.o xfrm_input.o xfrm_algo.o xfrm_output.o +obj-$(CONFIG_XFRM) := xfrm_policy.o xfrm_state.o xfrm_input.o xfrm_algo.o xfrm_output.o \ + xfrm_export.o obj-$(CONFIG_XFRM_USER) += xfrm_user.o diff -u linux-xfrm/net/Kconfig-XFRM linux-xfrm/net/Kconfig --- linux-xfrm/net/Kconfig-XFRM 2003-08-04 22:09:47.000000000 +0200 +++ linux-xfrm/net/Kconfig 2003-08-04 22:16:49.000000000 +0200 @@ -83,6 +83,7 @@ config NET_KEY tristate "PF_KEY sockets" + select XFRM ---help--- PF_KEYv2 socket family, compatible to KAME ones. They are required if you are going to use IPsec tools ported diff -u linux-xfrm/net/netsyms.c-XFRM linux-xfrm/net/netsyms.c --- linux-xfrm/net/netsyms.c-XFRM 2003-08-04 22:09:47.000000000 +0200 +++ linux-xfrm/net/netsyms.c 2003-08-04 22:19:14.000000000 +0200 @@ -56,7 +56,6 @@ #include #include #include -#include #if defined(CONFIG_INET_AH) || defined(CONFIG_INET_AH_MODULE) || defined(CONFIG_INET6_AH) || defined(CONFIG_INET6_AH_MODULE) #include #endif @@ -276,6 +275,7 @@ EXPORT_SYMBOL(inetdev_by_index); EXPORT_SYMBOL(in_dev_finish_destroy); EXPORT_SYMBOL(ip_defrag); +EXPORT_SYMBOL(inet_peer_idlock); /* Route manipulation */ EXPORT_SYMBOL(ip_rt_ioctl); @@ -293,80 +293,6 @@ /* needed for ip_gre -cw */ EXPORT_SYMBOL(ip_statistics); - -EXPORT_SYMBOL(xfrm_user_policy); -EXPORT_SYMBOL(km_waitq); -EXPORT_SYMBOL(km_new_mapping); -EXPORT_SYMBOL(xfrm_cfg_sem); -EXPORT_SYMBOL(xfrm_policy_alloc); -EXPORT_SYMBOL(__xfrm_policy_destroy); -EXPORT_SYMBOL(xfrm_lookup); -EXPORT_SYMBOL(__xfrm_policy_check); -EXPORT_SYMBOL(__xfrm_route_forward); -EXPORT_SYMBOL(xfrm_state_alloc); -EXPORT_SYMBOL(__xfrm_state_destroy); -EXPORT_SYMBOL(xfrm_state_find); -EXPORT_SYMBOL(xfrm_state_insert); -EXPORT_SYMBOL(xfrm_state_add); -EXPORT_SYMBOL(xfrm_state_update); -EXPORT_SYMBOL(xfrm_state_check_expire); -EXPORT_SYMBOL(xfrm_state_check_space); -EXPORT_SYMBOL(xfrm_state_lookup); -EXPORT_SYMBOL(xfrm_state_register_afinfo); -EXPORT_SYMBOL(xfrm_state_unregister_afinfo); -EXPORT_SYMBOL(xfrm_state_get_afinfo); -EXPORT_SYMBOL(xfrm_state_put_afinfo); -EXPORT_SYMBOL(xfrm_state_delete_tunnel); -EXPORT_SYMBOL(xfrm_replay_check); -EXPORT_SYMBOL(xfrm_replay_advance); -EXPORT_SYMBOL(xfrm_check_selectors); -EXPORT_SYMBOL(xfrm_check_output); -EXPORT_SYMBOL(__secpath_destroy); -EXPORT_SYMBOL(secpath_dup); -EXPORT_SYMBOL(xfrm_get_acqseq); -EXPORT_SYMBOL(xfrm_parse_spi); -EXPORT_SYMBOL(xfrm4_rcv); -EXPORT_SYMBOL(xfrm4_tunnel_register); -EXPORT_SYMBOL(xfrm4_tunnel_deregister); -EXPORT_SYMBOL(xfrm4_tunnel_check_size); -EXPORT_SYMBOL(xfrm_register_type); -EXPORT_SYMBOL(xfrm_unregister_type); -EXPORT_SYMBOL(xfrm_get_type); -EXPORT_SYMBOL(inet_peer_idlock); -EXPORT_SYMBOL(xfrm_register_km); -EXPORT_SYMBOL(xfrm_unregister_km); -EXPORT_SYMBOL(xfrm_state_delete); -EXPORT_SYMBOL(xfrm_state_walk); -EXPORT_SYMBOL(xfrm_find_acq_byseq); -EXPORT_SYMBOL(xfrm_find_acq); -EXPORT_SYMBOL(xfrm_alloc_spi); -EXPORT_SYMBOL(xfrm_state_flush); -EXPORT_SYMBOL(xfrm_policy_kill); -EXPORT_SYMBOL(xfrm_policy_bysel); -EXPORT_SYMBOL(xfrm_policy_insert); -EXPORT_SYMBOL(xfrm_policy_walk); -EXPORT_SYMBOL(xfrm_policy_flush); -EXPORT_SYMBOL(xfrm_policy_byid); -EXPORT_SYMBOL(xfrm_policy_list); -EXPORT_SYMBOL(xfrm_dst_lookup); -EXPORT_SYMBOL(xfrm_policy_register_afinfo); -EXPORT_SYMBOL(xfrm_policy_unregister_afinfo); -EXPORT_SYMBOL(xfrm_policy_get_afinfo); -EXPORT_SYMBOL(xfrm_policy_put_afinfo); - -EXPORT_SYMBOL_GPL(xfrm_probe_algs); -EXPORT_SYMBOL_GPL(xfrm_count_auth_supported); -EXPORT_SYMBOL_GPL(xfrm_count_enc_supported); -EXPORT_SYMBOL_GPL(xfrm_aalg_get_byidx); -EXPORT_SYMBOL_GPL(xfrm_ealg_get_byidx); -EXPORT_SYMBOL_GPL(xfrm_calg_get_byidx); -EXPORT_SYMBOL_GPL(xfrm_aalg_get_byid); -EXPORT_SYMBOL_GPL(xfrm_ealg_get_byid); -EXPORT_SYMBOL_GPL(xfrm_calg_get_byid); -EXPORT_SYMBOL_GPL(xfrm_aalg_get_byname); -EXPORT_SYMBOL_GPL(xfrm_ealg_get_byname); -EXPORT_SYMBOL_GPL(xfrm_calg_get_byname); -EXPORT_SYMBOL_GPL(skb_icv_walk); #if defined(CONFIG_INET_ESP) || defined(CONFIG_INET_ESP_MODULE) || defined(CONFIG_INET6_ESP) || defined(CONFIG_INET6_ESP_MODULE) EXPORT_SYMBOL_GPL(skb_cow_data); EXPORT_SYMBOL_GPL(pskb_put); From shemminger@osdl.org Mon Aug 4 16:43:26 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 16:43:34 -0700 (PDT) Received: from mail.osdl.org (fw.osdl.org [65.172.181.6]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74NhPFl004508 for ; Mon, 4 Aug 2003 16:43:26 -0700 Received: from dell_ss3.pdx.osdl.net (dell_ss3.pdx.osdl.net [172.20.1.60]) by mail.osdl.org (8.11.6/8.11.6) with SMTP id h74NLlI08855; Mon, 4 Aug 2003 16:21:47 -0700 Date: Mon, 4 Aug 2003 16:21:47 -0700 From: Stephen Hemminger To: Jeff Garzik Cc: netdev@oss.sgi.com Subject: [PATCH] convert lp486e driver to dynamic allocation Message-Id: <20030804162147.591c55f6.shemminger@osdl.org> Organization: Open Source Development Lab X-Mailer: Sylpheed version 0.9.3claws (GTK+ 1.2.10; i686-pc-linux-gnu) X-Face: &@E+xe?c%:&e4D{>f1O<&U>2qwRREG5!}7R4;D<"NO^UI2mJ[eEOA2*3>(`Th.yP,VDPo9$ /`~cw![cmj~~jWe?AHY7D1S+\}5brN0k*NE?pPh_'_d>6;XGG[\KDRViCfumZT3@[ Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4530 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: shemminger@osdl.org Precedence: bulk X-list: netdev Convert this driver from static net_device to using alloc_etherdev. Patch against 2.6.0-test2. Loads and unloads, but don't have the actual hardware. diff -Nru a/drivers/net/lp486e.c b/drivers/net/lp486e.c --- a/drivers/net/lp486e.c Mon Aug 4 14:53:55 2003 +++ b/drivers/net/lp486e.c Mon Aug 4 14:53:55 2003 @@ -975,15 +975,7 @@ return -EBUSY; } - /* - * Allocate working memory, 16-byte aligned - */ - dev->mem_start = (unsigned long) kmalloc(sizeof(struct i596_private) + 0x0f, GFP_KERNEL); - if (!dev->mem_start) - goto err_out; - dev->priv = (void *)((dev->mem_start + 0xf) & 0xfffffff0); lp = (struct i596_private *) dev->priv; - memset((void *)lp, 0, sizeof(struct i596_private)); spin_lock_init(&lp->cmd_lock); /* @@ -997,7 +989,6 @@ dev->base_addr = IOADDR; dev->irq = IRQ; - ether_setup(dev); /* * How do we find the ethernet address? I don't know. @@ -1045,8 +1036,6 @@ return 0; err_out_kfree: - kfree ((void *) dev->mem_start); -err_out: release_region(IOADDR, LP486E_TOTAL_SIZE); return ret; } @@ -1318,29 +1307,36 @@ MODULE_PARM(options, "1-" __MODULE_STRING(MAX_UNITS) "i"); MODULE_PARM(full_duplex, "1-" __MODULE_STRING(MAX_UNITS) "i"); -static struct net_device dev_lp486e; +static struct net_device *dev_lp486e; static int full_duplex; static int options; static int io = IOADDR; static int irq = IRQ; static int __init lp486e_init_module(void) { - struct net_device *dev = &dev_lp486e; + struct net_device *dev; + + dev = alloc_etherdev(sizeof(struct i596_private)); + if (!dev) + return -ENOMEM; + dev->irq = irq; dev->base_addr = io; dev->init = lp486e_probe; - if (register_netdev(dev) != 0) + if (register_netdev(dev) != 0) { + kfree(dev); return -EIO; + } + dev_lp486e = dev; full_duplex = 0; options = 0; return 0; } static void __exit lp486e_cleanup_module(void) { - unregister_netdev(&dev_lp486e); - kfree((void *)dev_lp486e.mem_start); - dev_lp486e.priv = NULL; - release_region(dev_lp486e.base_addr, LP486E_TOTAL_SIZE); + unregister_netdev(dev_lp486e); + release_region(dev_lp486e->base_addr, LP486E_TOTAL_SIZE); + kfree(dev_lp486e); } module_init(lp486e_init_module); From davem@redhat.com Mon Aug 4 16:53:51 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 16:53:55 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74NroFl004999 for ; Mon, 4 Aug 2003 16:53:51 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id QAA21337; Mon, 4 Aug 2003 16:49:20 -0700 Date: Mon, 4 Aug 2003 16:49:20 -0700 From: "David S. Miller" To: Andi Kleen Cc: yoshfuji@linux-ipv6.org, ak@muc.de, netdev@oss.sgi.com Subject: Re: [PATCH] Make XFRM optional Message-Id: <20030804164920.371d5afd.davem@redhat.com> In-Reply-To: <20030804203524.GA15874@colin2.muc.de> References: <20030804125022.GA8167@averell> <20030804.215801.124854897.yoshfuji@linux-ipv6.org> <20030804130408.GA36367@colin2.muc.de> <20030804114507.6e496c77.davem@redhat.com> <20030804203524.GA15874@colin2.muc.de> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4531 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On 4 Aug 2003 22:35:24 +0200 Andi Kleen wrote: > Ok, here is a new patch again current BKCVS. It also moves the > inet_peer_idlock only inside netsyms. Appied, thanks Andi. From davem@redhat.com Mon Aug 4 16:56:06 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 16:56:12 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h74Nu5Fl005356 for ; Mon, 4 Aug 2003 16:56:06 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id QAA21354; Mon, 4 Aug 2003 16:51:37 -0700 Date: Mon, 4 Aug 2003 16:51:37 -0700 From: "David S. Miller" To: Andi Kleen Cc: yoshfuji@linux-ipv6.org, ak@muc.de, netdev@oss.sgi.com Subject: Re: [PATCH] Make XFRM optional Message-Id: <20030804165137.40d744c5.davem@redhat.com> In-Reply-To: <20030804203524.GA15874@colin2.muc.de> References: <20030804125022.GA8167@averell> <20030804.215801.124854897.yoshfuji@linux-ipv6.org> <20030804130408.GA36367@colin2.muc.de> <20030804114507.6e496c77.davem@redhat.com> <20030804203524.GA15874@colin2.muc.de> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4532 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On 4 Aug 2003 22:35:24 +0200 Andi Kleen wrote: > Ok, here is a new patch again current BKCVS. It also moves the inet_peer_idlock > only inside netsyms. This one is missing net/xfrm/xfrm_export.c :-( Andi, please be more careful with your patches. I'd suggest use subversions or whatever source management system you like best to help avoid these problems in the future. You seem to be chronicly making mistakes like this, as if you're rushing things. From davem@redhat.com Mon Aug 4 17:02:45 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 17:02:51 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7502iFl005850 for ; Mon, 4 Aug 2003 17:02:44 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id QAA21376; Mon, 4 Aug 2003 16:57:46 -0700 Date: Mon, 4 Aug 2003 16:57:46 -0700 From: "David S. Miller" To: Krishna Kumar Cc: kuznet@ms2.inr.ac.ru, yoshfuji@linux-ipv6.org, netdev@oss.sgi.com, krkumar@us.ibm.com Subject: Re: O/M flags against 2.6.0-test1 Message-Id: <20030804165746.133f370a.davem@redhat.com> In-Reply-To: References: <20030730220223.4c25fcfe.davem@redhat.com> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4533 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Thu, 31 Jul 2003 13:33:27 -0700 (PDT) Krishna Kumar wrote: > > Ok, but then please use "__s32". > > OK, slowly getting there :-) > > Latest patch follows : Krishna is away, but let us make more progress on this patch. I see some problem with it that still need to be resolved: > +/* Subtype attributes for IFLA_PROTINFO */ > +enum > +{ > + IFLA_INET6_UNSPEC, > + IFLA_INET6_FLAGS, /* link flags */ > + IFLA_INET6_CONF, /* sysctl parameters */ > + IFLA_INET6_STATS, /* statistics */ > + IFLA_INET6_MCAST, /* MC things. What of them? */ > +}; > + > +#define IFLA_INET6_MAX IFLA_INET6_MCAST Ok, how does this actually work? The code does RTA_PUT(...IFLA_INET6_*...) but IFLA_PROTINFO is not actually used anywhere. This cannot work, it makes these RTA attributes just look like whatever IFLA_* ones have the same values as the inet6 ones in this enumeration. Alexey, how did you intend this stuff to be done? Cerainly not like this :-) > + /* return the device sysctl params */ > + if ((array = kmalloc(DEVCONF_MAX * sizeof(*array), GFP_KERNEL)) == NULL) > + goto rtattr_failure; > + ipv6_store_devconf(&idev->cnf, array); > + RTA_PUT(skb, IFLA_INET6_CONF, DEVCONF_MAX * sizeof(*array), array); This is what I'm talking about. Maybe there is something I'm missing. How does APP know to interpret IFLA_INET6_CONF as "sub-attribute" of IFLA_PROTINFO? Also, missing "memset(array, 0, sizeof(*array));" else we leak uninitialized kernel memory into user space. Another bug, GFP_KERNEL memory allocation with dev_base_lock held. Otherwise I am OK with the patch. From davem@redhat.com Mon Aug 4 17:03:29 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 17:03:33 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7503TFl006007 for ; Mon, 4 Aug 2003 17:03:29 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id QAA21402; Mon, 4 Aug 2003 16:59:04 -0700 Date: Mon, 4 Aug 2003 16:59:04 -0700 From: "David S. Miller" To: Stephen Hemminger Cc: jgarzik@pobox.com, netdev@oss.sgi.com Subject: Re: [PATCH] convert lp486e driver to dynamic allocation Message-Id: <20030804165904.0e9f60ab.davem@redhat.com> In-Reply-To: <20030804162147.591c55f6.shemminger@osdl.org> References: <20030804162147.591c55f6.shemminger@osdl.org> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4534 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Mon, 4 Aug 2003 16:21:47 -0700 Stephen Hemminger wrote: > Convert this driver from static net_device to using alloc_etherdev. Applied, thanks. From scott.feldman@intel.com Mon Aug 4 20:45:15 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 20:45:24 -0700 (PDT) Received: from hermes.jf.intel.com (fmr05.intel.com [134.134.136.6]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h753jFFl017502 for ; Mon, 4 Aug 2003 20:45:15 -0700 Received: from petasus.jf.intel.com (petasus.jf.intel.com [10.7.209.6]) by hermes.jf.intel.com (8.11.6p2/8.11.6/d: outer.mc,v 1.66 2003/05/22 21:17:36 rfjohns1 Exp $) with ESMTP id h753h3x04231 for ; Tue, 5 Aug 2003 03:43:03 GMT Received: from orsmsxvs040.jf.intel.com (orsmsxvs040.jf.intel.com [192.168.65.206]) by petasus.jf.intel.com (8.11.6p2/8.11.6/d: inner.mc,v 1.35 2003/05/22 21:18:01 rfjohns1 Exp $) with SMTP id h753eBv26263 for ; Tue, 5 Aug 2003 03:40:11 GMT Received: from orsmsx332.amr.corp.intel.com ([192.168.65.60]) by orsmsxvs040.jf.intel.com (NAVGW 2.5.2.11) with SMTP id M2003080420571626008 ; Mon, 04 Aug 2003 20:57:16 -0700 Received: from orsmsx402.amr.corp.intel.com ([192.168.65.208]) by orsmsx332.amr.corp.intel.com with Microsoft SMTPSVC(5.0.2195.5329); Mon, 4 Aug 2003 20:45:09 -0700 content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-MimeOLE: Produced By Microsoft Exchange V6.0.6375.0 Subject: RE: e100 "Ferguson" release Date: Mon, 4 Aug 2003 20:45:08 -0700 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: e100 "Ferguson" release Thread-Index: AcNZhWYRC0Gz1n9oToGU+hvgKaMpJwBWPZNQ From: "Feldman, Scott" To: "Jeff Garzik" Cc: X-OriginalArrivalTime: 05 Aug 2003 03:45:09.0070 (UTC) FILETIME=[F76DBEE0:01C35B03] Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id h753jFFl017502 X-archive-position: 4535 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: scott.feldman@intel.com Precedence: bulk X-list: netdev New one: http://sf.net/projects/e1000, e100-3.0.0_dev12.tar.gz > Comments: Thanks Jeff! > * (API) Does the out-of-tx-resources condition in > e100_xmit_frame ever really happen? I am under the > impression that returning non-zero in ->hard_start_xmit > results in the packet sometimes being requeued and > sometimes dropped. I prefer to guarantee a more-steady > state, by simply dropping the packet unconditionally, > when this uncommon condition occurs. So, I would > a) mark the failure condition with unlikely(), and > b) if the condition occurs, simply drop the packet > (tx_dropped++, kfree > skb), and return zero. Stop the queue also? if(unlikely(e100_exec_cb(nic, skb, e100_xmit_prepare) == -ENOMEM)) { netif_stop_queue(netdev); nic->net_stats.tx_dropped++; dev_kfree_skb(skb); return 0; } Added some more likely/unlikely's in the perf paths. > * (minor) for completeness, you should limit the PCI class in the > pci_device_id table to PCI_CLASS_NETWORK_ETHERNET. There are > one-in-a-million cases where this matters, but it's usually a > BIOS bug. Still, it's there in pci_device_id table, and it's an easy > change, so might as well use it. OK > * (style) your struct config definition is terribly clever. > perhaps too clever, making it unreadable? Not a specific complaint, > mind you, just something that caught my eye. Then the driver would be perfect. We can't have that. ;-) > * (minor) in tg3, my own benchmarks and experiments showed it > helped to explictly use ____cacheline_aligned markers when > defining certain sections of members in struct tg3 > (or struct nic, in e100's case). You already clearly pay > attention to member layout WRT cache effects, but if > you have a clear dividing line, or lines, in struct nic you can use > _____cacheline_aligned for even greater benefit. At a > minimum test it with a cpu-usage-measuring benchmark like ttcp, > though, of course :) OK > * (extremely minor) some people (like me :)) consider dead reads like > the readb() call in e100_write_flush OK > * (major?) Aren't there some clunky e100 adapters that don't do MMIO? > Do we care? Not that I'm aware of. Current e100 doesn't support them if they're out there. > * I would love to see feedback from people testing this > driver on ppc64 and sparc64, particularly. Me too. Things seem to work on ppc (Mac) and ia64. > * (style, minor) My eyes would prefer functions like e100_hw_reset to > have a few more blank lines in them, spreading code+comment > blocks out a bit. OK > * (extremely minor) one wonders if you really need the write flush in > mdio_ctrl. If the flush is removed, the same net effect > appears to occur. Good catch. > * (boring but needed) convert all the magic numbers in e100_configure > into constants, or at least add comments describing the magic > numbers. If the value is just one bit, you might simply append "/* > true */", for example. The general idea is to make the "member name = > value" list a little bit more readable to somebody who doesn't know the > hardware, and struct config, intimately. That _was_ boring. > * IIRC Donald's MII phy scanning code scans MII phy ids like this: > 1..31,0. Or maybe 1..31, and then 0 iff no MII phys were found. In > general I would prefer to follow his eepro100.c probe order. > Some phys need this because they will report on both phy id #0 (which > is magical) and phy id #(non-zero). Donald would know more than me, here. [kernel] eepro100 gets the ID from the eeprom, so no scanning there. Current e100 goes 1, 0..31, which is what we've always done, IIRC. > * Is it easy to support MII phy interrupts? It would be nice > to get a callback that was handled immediately, on phys that > do support such interrupts. I don't see those being passed through and handled by the MAC. > * do we care about spinlocks around the update_stats and > get_stats code? Not sure. update_stats runs in a timer callback. Can get_stats jump in? > * (bugs) in e100_up, you should undo mod_timer [major] and > netif_start_queue [minor], if request_irq fails. And maybe stop the > receiver, too? OK > * for all constants 0xffffffff (and others as well if you so choose), > prefer the C99 suffix to a cast. This is particularly relevant in > pci_set_dma_mask calls, where one should be using 0xffffffffULL, but > applies to other constants as well. I didn't see any other constant casts besides the pci_set_dma_mask call. That one is fixed. > * (potential races) in e100_probe, you want to call > register_netdev as basically the last operation that can > fail, if possible. Particularly, you need to move the > PCI API operations above register_netdev. > Remember, register_netdev winds up calling /sbin/hotplug, > which in turn calls programs that will want to start using > the interface. So you need to have everything set up by > that point, really. OK (nice catch). > * in e100_probe, "if(nic->csr == 0UL) {" should really just test for > NULL, because ioremap is defined to return a pointer... OK > * (minor) use a netif_msg_xxx wrapper/constant in > e100_init_module test? Can't - don't have nic->msg_enable allocated yet. :( -scott From jgarzik@pobox.com Mon Aug 4 22:29:55 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 04 Aug 2003 22:30:40 -0700 (PDT) Received: from www.linux.org.uk (IDENT:h2Rxu3GU7PMeJPrMvDTL9MOVi3QhHI88@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h755TsFl022666 for ; Mon, 4 Aug 2003 22:29:55 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19juOO-00060o-CI; Tue, 05 Aug 2003 06:29:52 +0100 Message-ID: <3F2F40C5.9070601@pobox.com> Date: Tue, 05 Aug 2003 01:29:41 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: "Feldman, Scott" CC: netdev@oss.sgi.com Subject: Re: e100 "Ferguson" release References: In-Reply-To: Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4536 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev Feldman, Scott wrote: >>* (API) Does the out-of-tx-resources condition in >>e100_xmit_frame ever really happen? I am under the >>impression that returning non-zero in ->hard_start_xmit >>results in the packet sometimes being requeued and >>sometimes dropped. I prefer to guarantee a more-steady >>state, by simply dropping the packet unconditionally, >>when this uncommon condition occurs. So, I would >>a) mark the failure condition with unlikely(), and >>b) if the condition occurs, simply drop the packet >>(tx_dropped++, kfree >>skb), and return zero. > > > Stop the queue also? > > if(unlikely(e100_exec_cb(nic, skb, e100_xmit_prepare) == -ENOMEM)) { > netif_stop_queue(netdev); > nic->net_stats.tx_dropped++; > dev_kfree_skb(skb); > return 0; > } Yes. I would also printk(KERN_ERR "we have a bug!") or somesuch, like several other drivers do, too. >>* IIRC Donald's MII phy scanning code scans MII phy ids like this: >>1..31,0. Or maybe 1..31, and then 0 iff no MII phys were found. In >>general I would prefer to follow his eepro100.c probe order. >>Some phys need this because they will report on both phy id #0 (which >>is magical) and phy id #(non-zero). Donald would know more than me, > > here. > > [kernel] eepro100 gets the ID from the eeprom, so no scanning there. > Current e100 goes 1, 0..31, which is what we've always done, IIRC. hmmm. I prefer the phy scanning to checking eeprom, since it reduces the chance of eeprom screwups. However, I still think there's some issue related to phy id #0. Oh well, fine for now, I guess. >>* do we care about spinlocks around the update_stats and >>get_stats code? > > > Not sure. update_stats runs in a timer callback. Can get_stats jump > in? Well, the ->get_stats only returns a pointer to the stats, which are then accessed in an unlocked manner. Since the net stats are unsigned longs, asynchronously reading and updating them isn't a big deal in practice. >>* (minor) use a netif_msg_xxx wrapper/constant in >>e100_init_module test? > > > Can't - don't have nic->msg_enable allocated yet. :( You could always use "(1 << debug) - 1"... :) I dunno if it's worth worrying about. Jeff From davem@redhat.com Tue Aug 5 00:21:41 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 00:22:20 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h757L0Fl030028 for ; Tue, 5 Aug 2003 00:21:41 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id AAA22135; Tue, 5 Aug 2003 00:16:31 -0700 Date: Tue, 5 Aug 2003 00:16:31 -0700 From: "David S. Miller" To: "Feldman, Scott" Cc: jgarzik@pobox.com, netdev@oss.sgi.com Subject: Re: e100 "Ferguson" release Message-Id: <20030805001631.2fb55f38.davem@redhat.com> In-Reply-To: References: X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4537 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Mon, 4 Aug 2003 20:45:08 -0700 "Feldman, Scott" wrote: > > * I would love to see feedback from people testing this > > driver on ppc64 and sparc64, particularly. > > Me too. Things seem to work on ppc (Mac) and ia64. This gets things building on sparc64, I'll stick an e100 into my workstation and use it for everything for a while using this driver. --- Makefile.~1~ 2003-08-04 20:20:42.000000000 -0700 +++ Makefile 2003-08-05 00:12:29.000000000 -0700 @@ -96,10 +96,15 @@ endif # pick a compiler -ifneq (,$(findstring egcs-2.91.66, $(shell cat /proc/version))) - CC := kgcc gcc cc +ARCH := $(shell uname -m | sed 's/i.86/i386/') +ifeq ($(ARCH),sparc64) +CC := $(shell if gcc -m64 -S -o /dev/null -xc /dev/null >/dev/null 2>&1; then echo gcc; else echo sparc64-linux-gcc; fi ) else - CC := gcc cc + ifneq (,$(findstring egcs-2.91.66, $(shell cat /proc/version))) + CC := kgcc gcc cc + else + CC := gcc cc + endif endif test_cc = $(shell which $(cc) > /dev/null 2>&1 && echo $(cc)) CC := $(foreach cc, $(CC), $(test_cc)) @@ -198,10 +203,30 @@ # we need to know what platform the driver is being built on # some additional features are only built on Intel platforms -ARCH := $(shell uname -m | sed 's/i.86/i386/') ifeq ($(ARCH),alpha) CFLAGS += -ffixed-8 -mno-fp-regs endif +ifeq ($(ARCH),sparc64) + NEW_GCC := $(shell if $(CC) -m64 -mcmodel=medlow -S -o /dev/null -xc /dev/null >/dev/null 2>&1; then echo y; else echo n; fi; ) + UNDECLARED_REGS := $(shell if $(CC) -c -x assembler /dev/null -Wa,--help | grep undeclared-regs > /dev/null; then echo y; else echo n; fi; ) + INLINE_LIMIT := $(shell if $(CC) -m64 -finline-limit=100000 -S -o /dev/null -xc /dev/null >/dev/null 2>&1; then echo y; else echo n; fi; ) + ifneq ($(UNDECLARED_REGS),y) + CC_UNDECL = + else + CC_UNDECL = -Wa,--undeclared-regs + endif + ifneq ($(NEW_GCC),y) + CFLAGS += -pipe -mno-fpu -mtune=ultrasparc -mmedlow \ + -ffixed-g4 -fcall-used-g5 -fcall-used-g7 -Wno-sign-compare + else + CFLAGS += -m64 -pipe -mno-fpu -mcpu=ultrasparc -mcmodel=medlow \ + -ffixed-g4 -fcall-used-g5 -fcall-used-g7 -Wno-sign-compare \ + $(CC_UNDECL) + endif + ifeq ($(INLINE_LIMIT),y) + CFLAGS := $(CFLAGS) -finline-limit=100000 + endif +endif # depmod version for rpm builds DEPVER := $(shell /sbin/depmod -V 2>/dev/null | awk 'BEGIN {FS="."} NR==1 {print $$2}') --- e100.c.~1~ 2003-08-04 20:20:42.000000000 -0700 +++ e100.c 2003-08-05 00:13:23.000000000 -0700 @@ -150,6 +150,7 @@ #include #include #include +#include #include "kcompat.h" From felix@allot.com Tue Aug 5 01:23:06 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 01:23:14 -0700 (PDT) Received: from mxout3.netvision.net.il (mxout3.netvision.net.il [194.90.9.24]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h758N4Fl001990 for ; Tue, 5 Aug 2003 01:23:05 -0700 Received: from exg.allot.com ([199.203.223.202]) by mxout3.netvision.net.il (iPlanet Messaging Server 5.2 HotFix 1.14 (built Mar 18 2003)) with ESMTP id <0HJ50039I0M9WZ@mxout3.netvision.net.il> for netdev@oss.sgi.com; Tue, 05 Aug 2003 11:22:57 +0300 (IDT) Received: from allot.com (199.203.223.201 [199.203.223.201]) by exg.allot.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id QG1CBB3A; Tue, 05 Aug 2003 11:25:57 +0200 Date: Tue, 05 Aug 2003 11:23:22 +0300 From: Felix Radensky Subject: Re: e100 "Ferguson" release To: Ben Greear Cc: Jeff Garzik , "Feldman, Scott" , netdev@oss.sgi.com Message-id: <3F2F697A.2020708@allot.com> Organization: Allot Communications Ltd. MIME-version: 1.0 Content-type: text/plain; charset=us-ascii; format=flowed Content-transfer-encoding: 7BIT X-Accept-Language: en-us, en User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.2) Gecko/20030208 Netscape/7.02 References: <3F2CA65F.8060105@pobox.com> <3F2CBA71.2070503@candelatech.com> X-archive-position: 4538 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: felix@allot.com Precedence: bulk X-list: netdev I've also noticed that the number of hard_start_xmit failures in e1000 has increased significantly in version 5.1.13-k1. In version 5.0.43-k1 the number of failures was much smaller. Felix. Ben Greear wrote: > > > > With e100 and e1000, I see the very large numbers of the > hard_start_xmit failure > when running very high packets-per-second rates (small packets). > I see virtually no failures with tulip. pktgen knows how to re-queue, > but it's > curious it has to so often. For code that does not requeue, this > could be even > more of a bummer. > > > From kuznet@ms2.inr.ac.ru Tue Aug 5 06:41:02 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 06:41:07 -0700 (PDT) Received: from dub.inr.ac.ru (dub.inr.ac.ru [193.233.7.105]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h75Df0Fl005543 for ; Tue, 5 Aug 2003 06:41:01 -0700 Received: (from kuznet@localhost) by dub.inr.ac.ru (8.6.13/ANK) id RAA28267; Tue, 5 Aug 2003 17:40:42 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200308051340.RAA28267@dub.inr.ac.ru> Subject: [PATCH] repairing rtcache killer To: davem@redhat.com, Robert.Olsson@data.slu.se, netdev@oss.sgi.com Date: Tue, 5 Aug 2003 17:40:42 +0400 (MSD) X-Mailer: ELM [version 2.5 PL6] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 4539 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: kuznet@ms2.inr.ac.ru Precedence: bulk X-list: netdev Hello! Alexey # This is a BitKeeper generated patch for the following project: # Project Name: Linux kernel tree # This patch format is intended for GNU patch command version 2.5 or higher. # This patch includes the following deltas: # ChangeSet 1.1613 -> 1.1614 # net/ipv4/route.c 1.66 -> 1.67 # # The following is the BitKeeper ChangeSet Log # -------------------------------------------- # 03/08/05 kuznet@mops.inr.ac.ru 1.1614 # route.c: # [IPV4] Repair calculation of rtcache entries score # # Two serious and interesting mistakes were made in the patch of 2003-06-16. # 1. Variance of hash chain turned out to be unexpectedly high, so truncation # chain length at <=ip_rt_gc_elasticity results in strong growth of # cache misses. Set the threshould to 2*ip_rt_gc_elasticity. # And continue to think how to switch to mode when lots of cache # entries are used once or twice, so truncation should be done at 1. # 2. The selection rt_score() function based on use count resulted in killing # new fresh entries. Actually, it is clear when minimal brain efforts # are applied. :-) So, switch to scoring using last used time, which # should give real LRU behaviour. # -------------------------------------------- # diff -Nru a/net/ipv4/route.c b/net/ipv4/route.c --- a/net/ipv4/route.c Tue Aug 5 17:37:41 2003 +++ b/net/ipv4/route.c Tue Aug 5 17:37:41 2003 @@ -463,7 +463,9 @@ */ static inline u32 rt_score(struct rtable *rt) { - u32 score = rt->u.dst.__use; + u32 score = jiffies - rt->u.dst.lastuse; + + score = ~score & ~(3<<30); if (rt_valuable(rt)) score |= (1<<31); @@ -807,8 +809,7 @@ * The second limit is less certain. At the moment it allows * only 2 entries per bucket. We will see. */ - if (chain_length > ip_rt_gc_elasticity || - (chain_length > 1 && !(min_score & (1<<31)))) { + if (chain_length > 2*ip_rt_gc_elasticity) { *candp = cand->u.rt_next; rt_free(cand); } From vnuorval@tcs.hut.fi Tue Aug 5 07:20:09 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 07:20:17 -0700 (PDT) Received: from mail.tcs.hut.fi (mail.tcs.hut.fi [130.233.215.20]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h75EJwFl007784 for ; Tue, 5 Aug 2003 07:19:59 -0700 Received: from rhea.tcs.hut.fi (rhea.tcs.hut.fi [130.233.215.147]) by mail.tcs.hut.fi (Postfix) with ESMTP id E93028001CD; Tue, 5 Aug 2003 16:42:32 +0300 (EEST) Received: from rhea.tcs.hut.fi (localhost [127.0.0.1]) by rhea.tcs.hut.fi (8.12.3/8.12.3/Debian-5) with ESMTP id h75DgW5L031191; Tue, 5 Aug 2003 16:42:32 +0300 Received: from localhost (vnuorval@localhost) by rhea.tcs.hut.fi (8.12.3/8.12.3/Debian-5) with ESMTP id h75DgWhQ031187; Tue, 5 Aug 2003 16:42:32 +0300 Date: Tue, 5 Aug 2003 16:42:32 +0300 (EEST) From: Ville Nuorvala To: davem@redhat.com Cc: netdev@oss.sgi.com Subject: [PATCH] IPV6: Fix bugs in ip6ip6_tnl_xmit() In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="-377318441-99616309-1060088089=:30970" Content-ID: X-archive-position: 4540 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: vnuorval@tcs.hut.fi Precedence: bulk X-list: netdev This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. Send mail to mime@docserver.cac.washington.edu for more info. ---377318441-99616309-1060088089=:30970 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII Content-ID: Hi, There were two bugs in ip6ip6_tnl_xmit() which are fixed in this patch (made against Linux 2.6.0-test2 + cset 1.1612): - ip6_tunnel must give its own getfrag function to ip6_append_data() - fix dst leakage when encapsulated packet too big Please apply! Thanks, Ville -- Ville Nuorvala Research Assistant, Institute of Digital Communications, Helsinki University of Technology email: vnuorval@tcs.hut.fi, phone: +358 (0)9 451 5257 ---377318441-99616309-1060088089=:30970 Content-Type: TEXT/PLAIN; charset=US-ASCII; name="ip6_tnl_xmit.patch" Content-Transfer-Encoding: BASE64 Content-ID: Content-Description: Content-Disposition: attachment; filename="ip6_tnl_xmit.patch" ZGlmZiAtTnVyIC0tZXhjbHVkZT1SQ1MgLS1leGNsdWRlPUNWUyAtLWV4Y2x1 ZGU9U0NDUyAtLWV4Y2x1ZGU9Qml0S2VlcGVyIC0tZXhjbHVkZT1DaGFuZ2VT ZXQgbGludXgtMi41Lk9MRC9uZXQvaXB2Ni9pcDZfdHVubmVsLmMgbGludXgt Mi41L25ldC9pcHY2L2lwNl90dW5uZWwuYw0KLS0tIGxpbnV4LTIuNS5PTEQv bmV0L2lwdjYvaXA2X3R1bm5lbC5jCVR1ZSBBdWcgIDUgMTU6MTU6MDcgMjAw Mw0KKysrIGxpbnV4LTIuNS9uZXQvaXB2Ni9pcDZfdHVubmVsLmMJVHVlIEF1 ZyAgNSAxNTo0NTo0MSAyMDAzDQpAQCAtNjIxLDYgKzYyMSwxNCBAQA0KIAly ZXR1cm4gb3B0Ow0KIH0NCiANCitzdGF0aWMgaW50IA0KK2lwNmlwNl9nZXRm cmFnKHZvaWQgKmZyb20sIGNoYXIgKnRvLCBpbnQgb2Zmc2V0LCBpbnQgbGVu LCBpbnQgb2RkLCANCisJCXN0cnVjdCBza19idWZmICpza2IpDQorew0KKwlt ZW1jcHkodG8sIChjaGFyICopIGZyb20gKyBvZmZzZXQsIGxlbik7DQorCXJl dHVybiAwOw0KK30NCisNCiAvKioNCiAgKiBpcDZpcDZfdG5sX2FkZHJfY29u ZmxpY3QgLSBjb21wYXJlIHBhY2tldCBhZGRyZXNzZXMgdG8gdHVubmVsJ3Mg b3duDQogICogICBAdDogdGhlIG91dGdvaW5nIHR1bm5lbCBkZXZpY2UNCkBA IC03NTUsOSArNzYzLDkgQEANCiAJfQ0KIAlpZiAoc2tiLT5sZW4gPiBtdHUp IHsNCiAJCWljbXB2Nl9zZW5kKHNrYiwgSUNNUFY2X1BLVF9UT09CSUcsIDAs IG10dSwgZGV2KTsNCi0JCWdvdG8gdHhfZXJyX29wdF9yZWxlYXNlOw0KKwkJ Z290byB0eF9lcnJfZHN0X3JlbGVhc2U7DQogCX0NCi0JZXJyID0gaXA2X2Fw cGVuZF9kYXRhKHNrLCBpcF9nZW5lcmljX2dldGZyYWcsIHNrYi0+bmgucmF3 LCBza2ItPmxlbiwgMCwNCisJZXJyID0gaXA2X2FwcGVuZF9kYXRhKHNrLCBp cDZpcDZfZ2V0ZnJhZywgc2tiLT5uaC5yYXcsIHNrYi0+bGVuLCAwLA0KIAkJ CSAgICAgIHQtPnBhcm1zLmhvcF9saW1pdCwgb3B0LCAmZmwsIA0KIAkJCSAg ICAgIChzdHJ1Y3QgcnQ2X2luZm8gKilkc3QsIE1TR19ET05UV0FJVCk7DQog DQpAQCAtNzg1LDcgKzc5Myw2IEBADQogCXJldHVybiAwOw0KIHR4X2Vycl9k c3RfcmVsZWFzZToNCiAJZHN0X3JlbGVhc2UoZHN0KTsNCi10eF9lcnJfb3B0 X3JlbGVhc2U6DQogCWlmIChvcHQgJiYgb3B0ICE9IG9yaWdfb3B0KQ0KIAkJ c29ja19rZnJlZV9zKHNrLCBvcHQsIG9wdC0+dG90X2xlbik7DQogdHhfZXJy X2ZyZWVfZmxfbGJsOg0KZGlmZiAtTnVyIC0tZXhjbHVkZT1SQ1MgLS1leGNs dWRlPUNWUyAtLWV4Y2x1ZGU9U0NDUyAtLWV4Y2x1ZGU9Qml0S2VlcGVyIC0t ZXhjbHVkZT1DaGFuZ2VTZXQgbGludXgtMi41Lk9MRC9uZXQvbmV0c3ltcy5j IGxpbnV4LTIuNS9uZXQvbmV0c3ltcy5jDQotLS0gbGludXgtMi41Lk9MRC9u ZXQvbmV0c3ltcy5jCVR1ZSBBdWcgIDUgMTU6MTU6MDMgMjAwMw0KKysrIGxp bnV4LTIuNS9uZXQvbmV0c3ltcy5jCVR1ZSBBdWcgIDUgMTM6NTg6MzQgMjAw Mw0KQEAgLTQ4MiwxMCArNDgyLDggQEANCiBFWFBPUlRfU1lNQk9MKHN5c2N0 bF9tYXhfc3luX2JhY2tsb2cpOw0KICNlbmRpZg0KIA0KLSNlbmRpZg0KLQ0K LSNpZiBkZWZpbmVkIChDT05GSUdfSVBWNl9NT0RVTEUpIHx8IGRlZmluZWQg KENPTkZJR19JUF9TQ1RQX01PRFVMRSkgfHwgZGVmaW5lZCAoQ09ORklHX0lQ VjZfVFVOTkVMX01PRFVMRSkNCiBFWFBPUlRfU1lNQk9MKGlwX2dlbmVyaWNf Z2V0ZnJhZyk7DQorDQogI2VuZGlmDQogDQogRVhQT1JUX1NZTUJPTCh0Y3Bf cmVhZF9zb2NrKTsNCg== ---377318441-99616309-1060088089=:30970-- From scott.feldman@intel.com Tue Aug 5 07:29:07 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 07:29:10 -0700 (PDT) Received: from hermes.jf.intel.com (fmr05.intel.com [134.134.136.6]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h75ET6Fl008508 for ; Tue, 5 Aug 2003 07:29:06 -0700 Received: from petasus.jf.intel.com (petasus.jf.intel.com [10.7.209.6]) by hermes.jf.intel.com (8.11.6p2/8.11.6/d: outer.mc,v 1.66 2003/05/22 21:17:36 rfjohns1 Exp $) with ESMTP id h75EQrw18915 for ; Tue, 5 Aug 2003 14:26:54 GMT Received: from orsmsxvs041.jf.intel.com (orsmsxvs041.jf.intel.com [192.168.65.54]) by petasus.jf.intel.com (8.11.6p2/8.11.6/d: inner.mc,v 1.35 2003/05/22 21:18:01 rfjohns1 Exp $) with SMTP id h75EO1c21378 for ; Tue, 5 Aug 2003 14:24:01 GMT Received: from orsmsx332.amr.corp.intel.com ([192.168.65.60]) by orsmsxvs041.jf.intel.com (NAVGW 2.5.2.11) with SMTP id M2003080507285928763 ; Tue, 05 Aug 2003 07:28:59 -0700 Received: from orsmsx402.amr.corp.intel.com ([192.168.65.208]) by orsmsx332.amr.corp.intel.com with Microsoft SMTPSVC(5.0.2195.5329); Tue, 5 Aug 2003 07:28:59 -0700 content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-MimeOLE: Produced By Microsoft Exchange V6.0.6375.0 Subject: RE: e100 "Ferguson" release Date: Tue, 5 Aug 2003 07:28:58 -0700 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: e100 "Ferguson" release Thread-Index: AcNbEppm4ua1VvpURRC1DmNP6YxZrAASpVCA From: "Feldman, Scott" To: "Jeff Garzik" Cc: X-OriginalArrivalTime: 05 Aug 2003 14:28:59.0646 (UTC) FILETIME=[E90BD9E0:01C35B5D] Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id h75ET6Fl008508 X-archive-position: 4541 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: scott.feldman@intel.com Precedence: bulk X-list: netdev > > if(unlikely(e100_exec_cb(nic, skb, e100_xmit_prepare) == -ENOMEM)) { > > netif_stop_queue(netdev); > > nic->net_stats.tx_dropped++; > > dev_kfree_skb(skb); > > return 0; > > } > > Yes. I would also printk(KERN_ERR "we have a bug!") or > somesuch, like several other drivers do, too. It's there, sorry, was trying to keep the code snippet small. > >>* (minor) use a netif_msg_xxx wrapper/constant in > >>e100_init_module test? > > > > > > Can't - don't have nic->msg_enable allocated yet. :( > > You could always use "(1 << debug) - 1"... :) I dunno if it's worth > worrying about. (1 << debug) - 1) & NETIF_MSG_DRV is what's there now. -scott From david-b@pacbell.net Tue Aug 5 08:14:55 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 08:15:03 -0700 (PDT) Received: from mta7.pltn13.pbi.net (mta7.pltn13.pbi.net [64.164.98.8]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h75FEsFl012006 for ; Tue, 5 Aug 2003 08:14:55 -0700 Received: from pacbell.net (ppp-67-118-247-188.dialup.pltn13.pacbell.net [67.118.247.188]) by mta7.pltn13.pbi.net (8.12.9/8.12.3) with ESMTP id h75FEgeC006162; Tue, 5 Aug 2003 08:14:43 -0700 (PDT) Message-ID: <3F2E9A09.7000707@pacbell.net> Date: Mon, 04 Aug 2003 10:38:17 -0700 From: David Brownell User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225 X-Accept-Language: en-us, en, fr MIME-Version: 1.0 To: "David S. Miller" CC: greearb@candelatech.com, jgarzik@pobox.com, scott.feldman@intel.com, netdev@oss.sgi.com Subject: Re: e100 "Ferguson" release References: <3F2CA65F.8060105@pobox.com> <3F2CBA71.2070503@candelatech.com> <20030803003239.4257ef24.davem@redhat.com> <3F2DCE56.6030601@pacbell.net> <20030803200851.7d46a605.davem@redhat.com> <3F2DD6BD.7070504@pacbell.net> <20030803204642.684c6075.davem@redhat.com> <3F2DDC3A.2040707@pacbell.net> <20030803211333.12839f66.davem@redhat.com> In-Reply-To: <20030803211333.12839f66.davem@redhat.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4542 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: david-b@pacbell.net Precedence: bulk X-list: netdev David S. Miller wrote: > > For example, what do USB block device drivers do when -ENOMEM comes > back? Do they just drop the request on the floor? No, rather they > resubmit the request later without the scsi/block layer knowing > anything about what happened, right? I didn't notice any code to retry, but I did see some that morphed ENOMEM into a generic scsi "error". Scsi presumably does something more or less intelligent then. The network layer on the other hand _does_ have hooks for retrying, not that they're used much. - Dave From scott.feldman@intel.com Tue Aug 5 08:19:33 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 08:19:39 -0700 (PDT) Received: from hermes.jf.intel.com (fmr05.intel.com [134.134.136.6]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h75FJWFl013136 for ; Tue, 5 Aug 2003 08:19:32 -0700 Received: from petasus.jf.intel.com (petasus.jf.intel.com [10.7.209.6]) by hermes.jf.intel.com (8.11.6p2/8.11.6/d: outer.mc,v 1.66 2003/05/22 21:17:36 rfjohns1 Exp $) with ESMTP id h75FHKw28917 for ; Tue, 5 Aug 2003 15:17:20 GMT Received: from orsmsxvs041.jf.intel.com (orsmsxvs041.jf.intel.com [192.168.65.54]) by petasus.jf.intel.com (8.11.6p2/8.11.6/d: inner.mc,v 1.35 2003/05/22 21:18:01 rfjohns1 Exp $) with SMTP id h75FERc21323 for ; Tue, 5 Aug 2003 15:14:27 GMT Received: from orsmsx331.amr.corp.intel.com ([192.168.65.56]) by orsmsxvs041.jf.intel.com (NAVGW 2.5.2.11) with SMTP id M2003080508192629015 ; Tue, 05 Aug 2003 08:19:26 -0700 Received: from orsmsx402.amr.corp.intel.com ([192.168.65.208]) by orsmsx331.amr.corp.intel.com with Microsoft SMTPSVC(5.0.2195.5329); Tue, 5 Aug 2003 08:19:26 -0700 content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-MimeOLE: Produced By Microsoft Exchange V6.0.6375.0 Subject: RE: e100 "Ferguson" release Date: Tue, 5 Aug 2003 08:19:25 -0700 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: e100 "Ferguson" release Thread-Index: AcNbKsmh/6+8q5R8RsOBg+Su6l5c9gANLF4Q From: "Feldman, Scott" To: "Felix Radensky" , "Ben Greear" Cc: "Jeff Garzik" , X-OriginalArrivalTime: 05 Aug 2003 15:19:26.0092 (UTC) FILETIME=[F4F2DCC0:01C35B64] Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id h75FJWFl013136 X-archive-position: 4543 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: scott.feldman@intel.com Precedence: bulk X-list: netdev > I've also noticed that the number of hard_start_xmit failures > in e1000 has increased significantly in version 5.1.13-k1. In > version 5.0.43-k1 the number of failures was much smaller. Interesting. Felix, would you undo the change[1] below in 5.1.13-k1 and see what happens? With the change below, 5.1.13 would be more aggressive on Tx cleanup, so we'll be quicker waking the queue than before. -scott for(i = 0; i < E1000_MAX_INTR; i++) - if(!e1000_clean_rx_irq(adapter) && + if(!e1000_clean_rx_irq(adapter) & !e1000_clean_tx_irq(adapter)) break; [1] Something still bothers me about this new form where we're mixing a bit-wise operator with logical operands. Should this bother me? From garzik@gtf.org Tue Aug 5 08:24:25 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 08:24:30 -0700 (PDT) Received: from havoc.gtf.org (host-64-213-145-173.atlantasolutions.com [64.213.145.173] (may be forged)) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h75FOOFl013723 for ; Tue, 5 Aug 2003 08:24:25 -0700 Received: by havoc.gtf.org (Postfix, from userid 500) id EBC946663; Tue, 5 Aug 2003 11:24:18 -0400 (EDT) Date: Tue, 5 Aug 2003 11:24:18 -0400 From: Jeff Garzik To: "Feldman, Scott" Cc: Felix Radensky , Ben Greear , netdev@oss.sgi.com Subject: Re: e100 "Ferguson" release Message-ID: <20030805152418.GB6695@gtf.org> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.3.28i X-archive-position: 4544 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev On Tue, Aug 05, 2003 at 08:19:25AM -0700, Feldman, Scott wrote: > > I've also noticed that the number of hard_start_xmit failures > > in e1000 has increased significantly in version 5.1.13-k1. In > > version 5.0.43-k1 the number of failures was much smaller. > > Interesting. Felix, would you undo the change[1] below in 5.1.13-k1 and > see what happens? With the change below, 5.1.13 would be more > aggressive on Tx cleanup, so we'll be quicker waking the queue than > before. > > -scott > > for(i = 0; i < E1000_MAX_INTR; i++) > - if(!e1000_clean_rx_irq(adapter) && > + if(!e1000_clean_rx_irq(adapter) & > !e1000_clean_tx_irq(adapter)) > break; > > [1] Something still bothers me about this new form where we're mixing a > bit-wise operator with logical operands. Should this bother me? It doesn't matter to the compiler if you make it explicit: unsigned int rx_work = e1000_clean_rx_irq(); unsigned int tx_work = e1000_clean_tx_irq(); if (!rx_work && !tx_work) break; From Robert.Olsson@data.slu.se Tue Aug 5 10:08:34 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 10:08:42 -0700 (PDT) Received: from robur.slu.se (robur.slu.se [130.238.98.12]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h75H8UFl019597 for ; Tue, 5 Aug 2003 10:08:33 -0700 Received: (from robert@localhost) by robur.slu.se (8.9.3p2/8.9.3) id TAA27260; Tue, 5 Aug 2003 19:08:23 +0200 From: Robert Olsson MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <16175.58503.134543.310459@robur.slu.se> Date: Tue, 5 Aug 2003 19:08:23 +0200 To: kuznet@ms2.inr.ac.ru Cc: davem@redhat.com, Robert.Olsson@data.slu.se, netdev@oss.sgi.com Subject: [PATCH] repairing rtcache killer In-Reply-To: <200308051340.RAA28267@dub.inr.ac.ru> References: <200308051340.RAA28267@dub.inr.ac.ru> X-Mailer: VM 6.92 under Emacs 19.34.1 X-archive-position: 4545 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: Robert.Olsson@data.slu.se Precedence: bulk X-list: netdev kuznet@ms2.inr.ac.ru writes: > # Two serious and interesting mistakes were made in the patch of 2003-06-16. > # 1. Variance of hash chain turned out to be unexpectedly high, so truncation > # chain length at <=ip_rt_gc_elasticity results in strong growth of > # cache misses. Set the threshould to 2*ip_rt_gc_elasticity. > # And continue to think how to switch to mode when lots of cache > # entries are used once or twice, so truncation should be done at 1. Hello! I'll guess the setting was very much affected by DoS attacs discussion which indicated very different flowlenths compared to our actual measurement for Uppsala University which had 65 pkts per new DST entry. Proably due to the "new" applications and lots of students. For autotuning I think we can have help from a ratio of warm cache hits (in_hit) and misses (in_slow_tot) to set threshhold to trim hash chain lengths. Cheers. --ro From ebiederm@xmission.com Tue Aug 5 10:22:35 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 10:22:46 -0700 (PDT) Received: from frodo.biederman.org (ebiederm.dsl.xmission.com [166.70.28.69]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h75HMYFl020574 for ; Tue, 5 Aug 2003 10:22:34 -0700 Received: (from eric@localhost) by frodo.biederman.org (8.9.3/8.9.3) id LAA05791; Tue, 5 Aug 2003 11:19:09 -0600 To: Werner Almesberger Cc: Jeff Garzik , Nivedita Singhvi , netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: TOE brain dump References: <20030802140444.E5798@almesberger.net> <3F2BF5C7.90400@us.ibm.com> <3F2C0C44.6020002@pobox.com> <20030802184901.G5798@almesberger.net> <20030804162433.L5798@almesberger.net> From: ebiederm@xmission.com (Eric W. Biederman) Date: 05 Aug 2003 11:19:09 -0600 In-Reply-To: <20030804162433.L5798@almesberger.net> Message-ID: Lines: 68 User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.1 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-archive-position: 4546 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ebiederm@xmission.com Precedence: bulk X-list: netdev Werner Almesberger writes: > Eric W. Biederman wrote: > > The optimized for low latency cases seem to have a strong > > market in clusters. > > Clusters have captive, no, _desperate_ customers ;-) And it > seems that people are just as happy putting MPI as their > transport on top of all those link-layer technologies. MPI is not a transport. It an interface like the Berkeley sockets layer. The semantics it wants right now are usually mapped to TCP/IP when used on an IP network. Though I suspect SCTP might be a better fit. But right now nothing in the IP stack is a particularly good fit. Right now there is a very strong feeling among most of the people using and developing on clusters that by and large what they are doing is not of interest to the general kernel community, and so has no chance of going in. So you see hack piled on top of hack piled on top of hack. Mostly I think the that is less true, at least if they can stand the process of severe code review and cleaning up their code. If we can put in code to scale the kernel to 64 processors. NIC drivers for fast interconnects and a few similar tweaks can't hurt either. But of course to get through the peer review process people need to understand what they are doing. > > There is one place in low latency communications that I can think > > of where TCP/IP is not the proper solution. For low latency > > communication the checksum is at the wrong end of the packet. > > That's one of the few things ATM's AAL5 got right. But in the end, > I think it doesn't really matter. At 1 Gbps, an MTU-sized packet > flies by within 13 us. At 10 Gbps, it's only 1.3 us. At that point, > you may well treat it as an atomic unit. So store and forward of packets in a 3 layer switch hierarchy, at 1.3 us per copy. 1.3us to the NIC + 1.3us to the first switch chip + 1.3us to the second switch chip + 1.3us to the top level switch chip + 1.3us to a middle layer switch chip + 1.3us to the receiving NIC + 1.3us the receiver. 1.3us * 7 = 9.1us to deliver a packet to the other side. That is still quite painful. Right now I can get better latencies over any of the cluster interconnects. I think 5 us is the current low end, with the high end being about 1 us. Quite often in MPI when a message is sent the program cannot continue until the reply is received. Possibly this is a fundamental problem with the application programming model, encouraging applications to be latency sensitive. But it is a well established API and programming paradigm so it has to be lived with. All of this is pretty much the reverse of the TOE case. Things are latency sensitive because real work needs to be done. And the more latency you have the slower that work gets done. A lot of the NICs which are used for MPI tend to be smart for two reasons. 1) So they can do source routing. 2) So they can safely export some of their interface to user space, so in the fast path they can bypass the kernel. Eric From ebiederm@xmission.com Tue Aug 5 10:29:21 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 10:29:26 -0700 (PDT) Received: from frodo.biederman.org (ebiederm.dsl.xmission.com [166.70.28.69]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h75HTLFl021304 for ; Tue, 5 Aug 2003 10:29:21 -0700 Received: (from eric@localhost) by frodo.biederman.org (8.9.3/8.9.3) id LAA05837; Tue, 5 Aug 2003 11:25:57 -0600 To: "David S. Miller" Cc: Werner Almesberger jgarzik@pobox.com, niv@us.ibm.com, netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: TOE brain dump References: <20030802140444.E5798@almesberger.net> <3F2BF5C7.90400@us.ibm.com> <3F2C0C44.6020002@pobox.com> <20030802184901.G5798@almesberger.net> <20030804162433.L5798@almesberger.net> <20030804122632.65ba2122.davem@redhat.com> From: ebiederm@xmission.com (Eric W. Biederman) Date: 05 Aug 2003 11:25:57 -0600 In-Reply-To: <20030804122632.65ba2122.davem@redhat.com> Message-ID: Lines: 48 User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.1 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-archive-position: 4547 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ebiederm@xmission.com Precedence: bulk X-list: netdev "David S. Miller" writes: > On Mon, 4 Aug 2003 16:24:33 -0300 > Werner Almesberger wrote: > > > Eric W. Biederman wrote: > > > There is one place in low latency communications that I can think > > > of where TCP/IP is not the proper solution. For low latency > > > communication the checksum is at the wrong end of the packet. > > > > That's one of the few things ATM's AAL5 got right. > > Let's recall how long the IFF_TRAILERS hack from BSD :-) Putting the variable length headers on the end of a packet? Or was that something other than RFC893? I think IPv6 solves that much more cleanly by simply deleting them. > > But in the end, I think it doesn't really matter. > > I tend to agree on this one. > > And on the transmit side if you have more than 1 pending TX frame, you > can always be prefetching the next one into the fifo so that by the > time the medium is ready all the checksum bits have been done. For large data transmissions that happens. > In fact I'd be surprised if current generation 1g/10g cards are not > doing something like this. Well at this point before I propose anything concrete I suspect I need to profile some actual application and see how things go. But from a very latency sensitive perspective, I would be surprised if the problem goes away with faster technology. For now I am happy just to insert the peculiar thought that latency across the entire cluster/lan is of great importance to some applications. Eric From ingo.oeser@informatik.tu-chemnitz.de Tue Aug 5 10:33:50 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 10:33:54 -0700 (PDT) Received: from meg.hrz.tu-chemnitz.de (meg.hrz.tu-chemnitz.de [134.109.132.57]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h75HXmFl021963 for ; Tue, 5 Aug 2003 10:33:50 -0700 Received: from tnt188.hrz.tu-chemnitz.de ([134.109.156.188] helo=nightmaster.csn.tu-chemnitz.de ident=root) by meg.hrz.tu-chemnitz.de with esmtp (Exim 4.12) id 19jhum-0003rB-00; Mon, 04 Aug 2003 18:10:30 +0200 Received: (from ioe@localhost) by nightmaster.csn.tu-chemnitz.de (8.9.1/8.9.1) id QAA23195; Mon, 4 Aug 2003 16:36:06 +0200 Date: Mon, 4 Aug 2003 16:36:06 +0200 From: Ingo Oeser To: Jeff Garzik Cc: Nivedita Singhvi , Werner Almesberger , netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: TOE brain dump Message-ID: <20030804163606.Q639@nightmaster.csn.tu-chemnitz.de> References: <20030802140444.E5798@almesberger.net> <3F2BF5C7.90400@us.ibm.com> <3F2C0C44.6020002@pobox.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2i In-Reply-To: <3F2C0C44.6020002@pobox.com>; from jgarzik@pobox.com on Sat, Aug 02, 2003 at 03:08:52PM -0400 X-Spam-Score: -5.0 (-----) X-Scanner: exiscan for exim4 (http://duncanthrax.net/exiscan/) *19jhum-0003rB-00*vFn3hP0u2Ks* X-archive-position: 4548 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ingo.oeser@informatik.tu-chemnitz.de Precedence: bulk X-list: netdev Hi Jeff, On Sat, Aug 02, 2003 at 03:08:52PM -0400, Jeff Garzik wrote: > So, fix the other end of the pipeline too, otherwise this fast network > stuff is flashly but pointless. If you want to serve up data from disk, > then start creating PCI cards that have both Serial ATA and ethernet > connectors on them :) Cut out the middleman of the host CPU and host > memory bus instead of offloading portions of TCP that do not need to be > offloaded. Exactly what I suggested: sys_ioroute() "Providing generic pipelines and io routing as Linux service" Msg-ID: <20030718134235.K639@nightmaster.csn.tu-chemnitz.de> on linux-kernel and linux-fsdevel Be my guest. I know, that you mean doing it in hardware, but you cannot accelerate sth. which the kernel doesn't do ;-) Regards Ingo Oeser From miller@techsource.com Tue Aug 5 12:15:39 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 12:15:50 -0700 (PDT) Received: from kinesis.swishmail.com (qmailr@kinesis.swishmail.com [209.10.110.86]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h75JFcFl028716 for ; Tue, 5 Aug 2003 12:15:39 -0700 Received: (qmail 42158 invoked by uid 89); 5 Aug 2003 19:15:37 -0000 Received: from unknown (HELO techsource.com) (209.208.48.130) by kinesis.swishmail.com with SMTP; 5 Aug 2003 19:15:37 -0000 Message-ID: <3F300549.60800@techsource.com> Date: Tue, 05 Aug 2003 15:28:09 -0400 From: Timothy Miller User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20020823 Netscape/7.0 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Larry McVoy CC: David Lang , Erik Andersen , Werner Almesberger , Jeff Garzik , netdev@oss.sgi.com, linux-kernel@vger.kernel.org, Nivedita Singhvi Subject: Re: TOE brain dump References: <20030803194011.GA8324@work.bitmover.com> <20030803203051.GA9057@work.bitmover.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4549 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: miller@techsource.com Precedence: bulk X-list: netdev Larry McVoy wrote: > On Sun, Aug 03, 2003 at 01:13:24PM -0700, David Lang wrote: > >>2. router nodes that have access to main memory (PCI card running linux >>acting as a router/firewall/VPN to offload the main CPU's) > > > I can get an entire machine, memory, disk, > Ghz CPU, case, power supply, > cdrom, floppy, onboard enet extra net card for routing, for $250 or less, > quantity 1, shipped to my door. > > Why would I want to spend money on some silly offload card when I can get > the whole PC for less than the card? Physical space? Power usage? Heat dissipation? Optimization for the specific task? Fast, low latency communication between CPU and device (ie. local bus)? Maintenance? Lots of reasons why one might pay more for the offload card. If you're cheap, you'll just use the software stack and a $10 NIC and just live with the corresponding CPU usage. If you're a performance freak, you'll spend whatever you have to to squeeze out every last bit of performance you can. Mind you, another option is, if you're dealing with the kind of load that requires that much network performance, is to use redundant servers, like google. No one server is exceptionally fast, but it not many people are using it, it's fast enough. From shemminger@osdl.org Tue Aug 5 14:46:38 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 14:46:44 -0700 (PDT) Received: from mail.osdl.org (fw.osdl.org [65.172.181.6]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h75LkbFl005904 for ; Tue, 5 Aug 2003 14:46:37 -0700 Received: from dell_ss3.pdx.osdl.net (dell_ss3.pdx.osdl.net [172.20.1.60]) by mail.osdl.org (8.11.6/8.11.6) with SMTP id h75LkNI01529; Tue, 5 Aug 2003 14:46:23 -0700 Date: Tue, 5 Aug 2003 14:46:22 -0700 From: Stephen Hemminger To: Ralf Baechle , "David S. Miller" Cc: linux-hams@vger.kernel.org, netdev@oss.sgi.com Subject: [PATCH] (2/2) Convert ROSE to seq_file Message-Id: <20030805144622.100f208d.shemminger@osdl.org> Organization: Open Source Development Lab X-Mailer: Sylpheed version 0.9.3claws (GTK+ 1.2.10; i686-pc-linux-gnu) X-Face: &@E+xe?c%:&e4D{>f1O<&U>2qwRREG5!}7R4;D<"NO^UI2mJ[eEOA2*3>(`Th.yP,VDPo9$ /`~cw![cmj~~jWe?AHY7D1S+\}5brN0k*NE?pPh_'_d>6;XGG[\KDRViCfumZT3@[ Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4551 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: shemminger@osdl.org Precedence: bulk X-list: netdev The existing ROSE /proc interface has no module owner, and doesn't check for bounds overflow. Easier to just convert it to the seq_file wrapper functions. This patch is against 2.6.0-test2 (offsets assume earlier patch). diff -Nru a/include/net/rose.h b/include/net/rose.h --- a/include/net/rose.h Tue Aug 5 14:36:07 2003 +++ b/include/net/rose.h Tue Aug 5 14:36:07 2003 @@ -140,6 +140,9 @@ #define rose_sk(__sk) ((rose_cb *)(__sk)->sk_protinfo) +/* Magic value indicating first entry in /proc (ie header) */ +#define ROSE_PROC_START ((void *) 1) + /* af_rose.c */ extern ax25_address rose_callsign; extern int sysctl_rose_restart_request_timeout; @@ -154,7 +157,7 @@ extern int sysctl_rose_window_size; extern int rosecmp(rose_address *, rose_address *); extern int rosecmpm(rose_address *, rose_address *, unsigned short); -extern char *rose2asc(rose_address *); +extern const char *rose2asc(const rose_address *); extern struct sock *rose_find_socket(unsigned int, struct rose_neigh *); extern void rose_kill_by_neigh(struct rose_neigh *); extern unsigned int rose_new_lci(struct rose_neigh *); @@ -193,6 +196,9 @@ /* rose_route.c */ extern struct rose_neigh *rose_loopback_neigh; +extern struct file_operations rose_neigh_fops; +extern struct file_operations rose_nodes_fops; +extern struct file_operations rose_routes_fops; extern int rose_add_loopback_neigh(void); extern int rose_add_loopback_node(rose_address *); @@ -207,9 +213,6 @@ extern int rose_rt_ioctl(unsigned int, void *); extern void rose_link_failed(ax25_cb *, int); extern int rose_route_frame(struct sk_buff *, ax25_cb *); -extern int rose_nodes_get_info(char *, char **, off_t, int); -extern int rose_neigh_get_info(char *, char **, off_t, int); -extern int rose_routes_get_info(char *, char **, off_t, int); extern void rose_rt_free(void); /* rose_subr.c */ diff -Nru a/net/rose/af_rose.c b/net/rose/af_rose.c --- a/net/rose/af_rose.c Tue Aug 5 14:36:07 2003 +++ b/net/rose/af_rose.c Tue Aug 5 14:36:07 2003 @@ -39,6 +39,7 @@ #include #include #include +#include #include #include #include @@ -56,8 +57,8 @@ int sysctl_rose_maximum_vcs = ROSE_DEFAULT_MAXVC; int sysctl_rose_window_size = ROSE_DEFAULT_WINDOW_SIZE; -static HLIST_HEAD(rose_list); -static spinlock_t rose_list_lock = SPIN_LOCK_UNLOCKED; +HLIST_HEAD(rose_list); +spinlock_t rose_list_lock = SPIN_LOCK_UNLOCKED; static struct proto_ops rose_proto_ops; @@ -66,7 +67,7 @@ /* * Convert a ROSE address into text. */ -char *rose2asc(rose_address *addr) +const char *rose2asc(const rose_address *addr) { static char buffer[11]; @@ -1332,29 +1333,57 @@ return 0; } -static int rose_get_info(char *buffer, char **start, off_t offset, int length) +#ifdef CONFIG_PROC_FS +static void *rose_info_start(struct seq_file *seq, loff_t *pos) { + int i; struct sock *s; struct hlist_node *node; - struct net_device *dev; - const char *devname, *callsign; - int len = 0; - off_t pos = 0; - off_t begin = 0; spin_lock_bh(&rose_list_lock); + if (*pos == 0) + return ROSE_PROC_START; + + i = 1; + sk_for_each(s, node, &rose_list) { + if (i == *pos) + return s; + ++i; + } + return NULL; +} - len += sprintf(buffer, "dest_addr dest_call src_addr src_call dev lci neigh st vs vr va t t1 t2 t3 hb idle Snd-Q Rcv-Q inode\n"); +static void *rose_info_next(struct seq_file *seq, void *v, loff_t *pos) +{ + ++*pos; - sk_for_each(s, node, &rose_list) { + return (v == ROSE_PROC_START) ? sk_head(&rose_list) + : sk_next((struct sock *)v); +} + +static void rose_info_stop(struct seq_file *seq, void *v) +{ + spin_unlock_bh(&rose_list_lock); +} + +static int rose_info_show(struct seq_file *seq, void *v) +{ + if (v == ROSE_PROC_START) + seq_puts(seq, + "dest_addr dest_call src_addr src_call dev lci neigh st vs vr va t t1 t2 t3 hb idle Snd-Q Rcv-Q inode\n"); + + else { + struct sock *s = v; rose_cb *rose = rose_sk(s); + const char *devname, *callsign; + const struct net_device *dev = rose->device; - if ((dev = rose->device) == NULL) + if (!dev) devname = "???"; else devname = dev->name; - - len += sprintf(buffer + len, "%-10s %-9s ", + + seq_printf(seq, "%-10s %-9s ", rose2asc(&rose->dest_addr), ax2asc(&rose->dest_call)); @@ -1363,7 +1392,8 @@ else callsign = ax2asc(&rose->source_call); - len += sprintf(buffer + len, "%-10s %-9s %-5s %3.3X %05d %d %d %d %d %3lu %3lu %3lu %3lu %3lu %3lu/%03lu %5d %5d %ld\n", + seq_printf(seq, + "%-10s %-9s %-5s %3.3X %05d %d %d %d %d %3lu %3lu %3lu %3lu %3lu %3lu/%03lu %5d %5d %ld\n", rose2asc(&rose->source_addr), callsign, devname, @@ -1383,27 +1413,32 @@ atomic_read(&s->sk_wmem_alloc), atomic_read(&s->sk_rmem_alloc), s->sk_socket ? SOCK_INODE(s->sk_socket)->i_ino : 0L); - - pos = begin + len; - - if (pos < offset) { - len = 0; - begin = pos; - } - - if (pos > offset + length) - break; } - spin_unlock_bh(&rose_list_lock); - *start = buffer + (offset - begin); - len -= (offset - begin); + return 0; +} - if (len > length) len = length; +static struct seq_operations rose_info_seqops = { + .start = rose_info_start, + .next = rose_info_next, + .stop = rose_info_stop, + .show = rose_info_show, +}; - return len; +static int rose_info_open(struct inode *inode, struct file *file) +{ + return seq_open(file, &rose_info_seqops); } +static struct file_operations rose_info_fops = { + .owner = THIS_MODULE, + .open = rose_info_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release, +}; +#endif /* CONFIG_PROC_FS */ + static struct net_proto_family rose_family_ops = { .family = PF_ROSE, .create = rose_create, @@ -1499,10 +1534,11 @@ rose_add_loopback_neigh(); - proc_net_create("rose", 0, rose_get_info); - proc_net_create("rose_neigh", 0, rose_neigh_get_info); - proc_net_create("rose_nodes", 0, rose_nodes_get_info); - proc_net_create("rose_routes", 0, rose_routes_get_info); + proc_net_fops_create("rose", S_IRUGO, &rose_info_fops); + proc_net_fops_create("rose_neigh", S_IRUGO, &rose_neigh_fops); + proc_net_fops_create("rose_nodes", S_IRUGO, &rose_nodes_fops); + proc_net_fops_create("rose_routes", S_IRUGO, &rose_routes_fops); + return 0; } module_init(rose_proto_init); diff -Nru a/net/rose/rose_route.c b/net/rose/rose_route.c --- a/net/rose/rose_route.c Tue Aug 5 14:36:07 2003 +++ b/net/rose/rose_route.c Tue Aug 5 14:36:07 2003 @@ -35,12 +35,13 @@ #include #include #include +#include static unsigned int rose_neigh_no = 1; static struct rose_node *rose_node_list; static spinlock_t rose_node_list_lock = SPIN_LOCK_UNLOCKED; -static struct rose_neigh *rose_neigh_list; +struct rose_neigh *rose_neigh_list; static spinlock_t rose_neigh_list_lock = SPIN_LOCK_UNLOCKED; static struct rose_route *rose_route_list; static spinlock_t rose_route_list_lock = SPIN_LOCK_UNLOCKED; @@ -1066,165 +1067,248 @@ return res; } -int rose_nodes_get_info(char *buffer, char **start, off_t offset, int length) +#ifdef CONFIG_PROC_FS + +static void *rose_node_start(struct seq_file *seq, loff_t *pos) { struct rose_node *rose_node; - int len = 0; - off_t pos = 0; - off_t begin = 0; - int i; + int i = 1; spin_lock_bh(&rose_neigh_list_lock); + if (*pos == 0) + return ROSE_PROC_START; + + for (rose_node = rose_node_list; rose_node && i < *pos; + rose_node = rose_node->next, ++i); + + return (i == *pos) ? rose_node : NULL; +} - len += sprintf(buffer, "address mask n neigh neigh neigh\n"); +static void *rose_node_next(struct seq_file *seq, void *v, loff_t *pos) +{ + ++*pos; + + return (v == ROSE_PROC_START) ? rose_node_list + : ((struct rose_node *)v)->next; +} - for (rose_node = rose_node_list; rose_node != NULL; rose_node = rose_node->next) { +static void rose_node_stop(struct seq_file *seq, void *v) +{ + spin_unlock_bh(&rose_neigh_list_lock); +} + +static int rose_node_show(struct seq_file *seq, void *v) +{ + int i; + + if (v == ROSE_PROC_START) + seq_puts(seq, "address mask n neigh neigh neigh\n"); + else { + const struct rose_node *rose_node = v; /* if (rose_node->loopback) { - len += sprintf(buffer + len, "%-10s %04d 1 loopback\n", + seq_printf(seq, "%-10s %04d 1 loopback\n", rose2asc(&rose_node->address), rose_node->mask); } else { */ - len += sprintf(buffer + len, "%-10s %04d %d", + seq_printf(seq, "%-10s %04d %d", rose2asc(&rose_node->address), rose_node->mask, rose_node->count); for (i = 0; i < rose_node->count; i++) - len += sprintf(buffer + len, " %05d", + seq_printf(seq, " %05d", rose_node->neighbour[i]->number); - len += sprintf(buffer + len, "\n"); + seq_puts(seq, "\n"); /* } */ + } + return 0; +} - pos = begin + len; +static struct seq_operations rose_node_seqops = { + .start = rose_node_start, + .next = rose_node_next, + .stop = rose_node_stop, + .show = rose_node_show, +}; + +static int rose_nodes_open(struct inode *inode, struct file *file) +{ + return seq_open(file, &rose_node_seqops); +} + +struct file_operations rose_nodes_fops = { + .owner = THIS_MODULE, + .open = rose_nodes_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release, +}; - if (pos < offset) { - len = 0; - begin = pos; - } +static void *rose_neigh_start(struct seq_file *seq, loff_t *pos) +{ + struct rose_neigh *rose_neigh; + int i = 1; - if (pos > offset + length) - break; - } - spin_unlock_bh(&rose_neigh_list_lock); + spin_lock_bh(&rose_neigh_list_lock); + if (*pos == 0) + return ROSE_PROC_START; - *start = buffer + (offset - begin); - len -= (offset - begin); + for (rose_neigh = rose_neigh_list; rose_neigh && i < *pos; + rose_neigh = rose_neigh->next, ++i); - if (len > length) - len = length; + return (i == *pos) ? rose_neigh : NULL; +} - return len; +static void *rose_neigh_next(struct seq_file *seq, void *v, loff_t *pos) +{ + ++*pos; + + return (v == ROSE_PROC_START) ? rose_neigh_list + : ((struct rose_neigh *)v)->next; } -int rose_neigh_get_info(char *buffer, char **start, off_t offset, int length) +static void rose_neigh_stop(struct seq_file *seq, void *v) { - struct rose_neigh *rose_neigh; - int len = 0; - off_t pos = 0; - off_t begin = 0; - int i; + spin_unlock_bh(&rose_neigh_list_lock); +} - spin_lock_bh(&rose_neigh_list_lock); +static int rose_neigh_show(struct seq_file *seq, void *v) +{ + int i; - len += sprintf(buffer, "addr callsign dev count use mode restart t0 tf digipeaters\n"); + if (v == ROSE_PROC_START) + seq_puts(seq, + "addr callsign dev count use mode restart t0 tf digipeaters\n"); + else { + struct rose_neigh *rose_neigh = v; - for (rose_neigh = rose_neigh_list; rose_neigh != NULL; rose_neigh = rose_neigh->next) { /* if (!rose_neigh->loopback) { */ - len += sprintf(buffer + len, "%05d %-9s %-4s %3d %3d %3s %3s %3lu %3lu", - rose_neigh->number, - (rose_neigh->loopback) ? "RSLOOP-0" : ax2asc(&rose_neigh->callsign), - rose_neigh->dev ? rose_neigh->dev->name : "???", - rose_neigh->count, - rose_neigh->use, - (rose_neigh->dce_mode) ? "DCE" : "DTE", - (rose_neigh->restarted) ? "yes" : "no", - ax25_display_timer(&rose_neigh->t0timer) / HZ, - ax25_display_timer(&rose_neigh->ftimer) / HZ); - - if (rose_neigh->digipeat != NULL) { - for (i = 0; i < rose_neigh->digipeat->ndigi; i++) - len += sprintf(buffer + len, " %s", ax2asc(&rose_neigh->digipeat->calls[i])); - } - - len += sprintf(buffer + len, "\n"); - - pos = begin + len; - - if (pos < offset) { - len = 0; - begin = pos; - } + seq_printf(seq, "%05d %-9s %-4s %3d %3d %3s %3s %3lu %3lu", + rose_neigh->number, + (rose_neigh->loopback) ? "RSLOOP-0" : ax2asc(&rose_neigh->callsign), + rose_neigh->dev ? rose_neigh->dev->name : "???", + rose_neigh->count, + rose_neigh->use, + (rose_neigh->dce_mode) ? "DCE" : "DTE", + (rose_neigh->restarted) ? "yes" : "no", + ax25_display_timer(&rose_neigh->t0timer) / HZ, + ax25_display_timer(&rose_neigh->ftimer) / HZ); + + if (rose_neigh->digipeat != NULL) { + for (i = 0; i < rose_neigh->digipeat->ndigi; i++) + seq_printf(seq, " %s", ax2asc(&rose_neigh->digipeat->calls[i])); + } - if (pos > offset + length) - break; - /* } */ + seq_puts(seq, "\n"); } + return 0; +} - spin_unlock_bh(&rose_neigh_list_lock); - - *start = buffer + (offset - begin); - len -= (offset - begin); - if (len > length) - len = length; +static struct seq_operations rose_neigh_seqops = { + .start = rose_neigh_start, + .next = rose_neigh_next, + .stop = rose_neigh_stop, + .show = rose_neigh_show, +}; - return len; +static int rose_neigh_open(struct inode *inode, struct file *file) +{ + return seq_open(file, &rose_neigh_seqops); } -int rose_routes_get_info(char *buffer, char **start, off_t offset, int length) +struct file_operations rose_neigh_fops = { + .owner = THIS_MODULE, + .open = rose_neigh_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release, +}; + + +static void *rose_route_start(struct seq_file *seq, loff_t *pos) { struct rose_route *rose_route; - int len = 0; - off_t pos = 0; - off_t begin = 0; + int i = 1; spin_lock_bh(&rose_route_list_lock); + if (*pos == 0) + return ROSE_PROC_START; - len += sprintf(buffer, "lci address callsign neigh <-> lci address callsign neigh\n"); + for (rose_route = rose_route_list; rose_route && i < *pos; + rose_route = rose_route->next, ++i); - for (rose_route = rose_route_list; rose_route != NULL; rose_route = rose_route->next) { - if (rose_route->neigh1 != NULL) { - len += sprintf(buffer + len, "%3.3X %-10s %-9s %05d ", - rose_route->lci1, - rose2asc(&rose_route->src_addr), - ax2asc(&rose_route->src_call), - rose_route->neigh1->number); - } else { - len += sprintf(buffer + len, "000 * * 00000 "); - } + return (i == *pos) ? rose_route : NULL; +} + +static void *rose_route_next(struct seq_file *seq, void *v, loff_t *pos) +{ + ++*pos; + + return (v == ROSE_PROC_START) ? rose_route_list + : ((struct rose_route *)v)->next; +} - if (rose_route->neigh2 != NULL) { - len += sprintf(buffer + len, "%3.3X %-10s %-9s %05d\n", +static void rose_route_stop(struct seq_file *seq, void *v) +{ + spin_unlock_bh(&rose_route_list_lock); +} + +static int rose_route_show(struct seq_file *seq, void *v) +{ + if (v == ROSE_PROC_START) + seq_puts(seq, + "lci address callsign neigh <-> lci address callsign neigh\n"); + else { + struct rose_route *rose_route = v; + + if (rose_route->neigh1) + seq_printf(seq, + "%3.3X %-10s %-9s %05d ", + rose_route->lci1, + rose2asc(&rose_route->src_addr), + ax2asc(&rose_route->src_call), + rose_route->neigh1->number); + else + seq_puts(seq, + "000 * * 00000 "); + + if (rose_route->neigh2) + seq_printf(seq, + "%3.3X %-10s %-9s %05d\n", rose_route->lci2, rose2asc(&rose_route->dest_addr), ax2asc(&rose_route->dest_call), rose_route->neigh2->number); - } else { - len += sprintf(buffer + len, "000 * * 00000\n"); - } - - pos = begin + len; - - if (pos < offset) { - len = 0; - begin = pos; + else + seq_puts(seq, + "000 * * 00000\n"); } + return 0; +} - if (pos > offset + length) - break; - } - - spin_unlock_bh(&rose_route_list_lock); - - *start = buffer + (offset - begin); - len -= (offset - begin); - - if (len > length) - len = length; +static struct seq_operations rose_route_seqops = { + .start = rose_route_start, + .next = rose_route_next, + .stop = rose_route_stop, + .show = rose_route_show, +}; + +static int rose_route_open(struct inode *inode, struct file *file) +{ + return seq_open(file, &rose_route_seqops); +} + +struct file_operations rose_routes_fops = { + .owner = THIS_MODULE, + .open = rose_route_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release, +}; - return len; -} +#endif /* CONFIG_PROC_FS */ /* * Release all memory associated with ROSE routing structures. From shemminger@osdl.org Tue Aug 5 14:46:35 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 14:46:44 -0700 (PDT) Received: from mail.osdl.org (fw.osdl.org [65.172.181.6]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h75LkYFl005901 for ; Tue, 5 Aug 2003 14:46:35 -0700 Received: from dell_ss3.pdx.osdl.net (dell_ss3.pdx.osdl.net [172.20.1.60]) by mail.osdl.org (8.11.6/8.11.6) with SMTP id h75LkHI01525; Tue, 5 Aug 2003 14:46:18 -0700 Date: Tue, 5 Aug 2003 14:46:17 -0700 From: Stephen Hemminger To: Ralf Baechle , "David S. Miller" Cc: linux-hams@vger.kernel.org, netdev@oss.sgi.com Subject: [PATCH 2.6.0-test2] (1/2) Dynamically allocate net_device structures for ROSE Message-Id: <20030805144617.2e856d6d.shemminger@osdl.org> Organization: Open Source Development Lab X-Mailer: Sylpheed version 0.9.3claws (GTK+ 1.2.10; i686-pc-linux-gnu) X-Face: &@E+xe?c%:&e4D{>f1O<&U>2qwRREG5!}7R4;D<"NO^UI2mJ[eEOA2*3>(`Th.yP,VDPo9$ /`~cw![cmj~~jWe?AHY7D1S+\}5brN0k*NE?pPh_'_d>6;XGG[\KDRViCfumZT3@[ Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4550 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: shemminger@osdl.org Precedence: bulk X-list: netdev This patch changes the ROSE protocol to allocate an array of pointers and each network device separately. This sets up later change where network_device object's are released on last use which may be after the module is unloaded. The patch is against 2.6.0-test2 (though this code hasn't changed in a long time). Allocation is done via alloc_netdev so the dev->priv area is already reserved and doesn't need to be allocated separately. diff -Nru a/include/net/rose.h b/include/net/rose.h --- a/include/net/rose.h Tue Aug 5 14:35:52 2003 +++ b/include/net/rose.h Tue Aug 5 14:35:52 2003 @@ -163,7 +163,7 @@ /* rose_dev.c */ extern int rose_rx_ip(struct sk_buff *, struct net_device *); -extern int rose_init(struct net_device *); +extern void rose_setup(struct net_device *); /* rose_in.c */ extern int rose_process_rx_frame(struct sock *, struct sk_buff *); diff -Nru a/net/rose/af_rose.c b/net/rose/af_rose.c --- a/net/rose/af_rose.c Tue Aug 5 14:35:52 2003 +++ b/net/rose/af_rose.c Tue Aug 5 14:35:52 2003 @@ -43,7 +43,7 @@ #include #include -int rose_ndevs = 10; +static int rose_ndevs = 10; int sysctl_rose_restart_request_timeout = ROSE_DEFAULT_T0; int sysctl_rose_call_request_timeout = ROSE_DEFAULT_T1; @@ -56,7 +56,7 @@ int sysctl_rose_maximum_vcs = ROSE_DEFAULT_MAXVC; int sysctl_rose_window_size = ROSE_DEFAULT_WINDOW_SIZE; -HLIST_HEAD(rose_list); +static HLIST_HEAD(rose_list); static spinlock_t rose_list_lock = SPIN_LOCK_UNLOCKED; static struct proto_ops rose_proto_ops; @@ -1435,7 +1435,7 @@ .notifier_call = rose_device_event, }; -static struct net_device *dev_rose; +static struct net_device **dev_rose; static const char banner[] = KERN_INFO "F6FBB/G4KLX ROSE for Linux. Version 0.62 for AX25.037 Linux 2.4\n"; @@ -1450,17 +1450,39 @@ return -1; } - if ((dev_rose = kmalloc(rose_ndevs * sizeof(struct net_device), GFP_KERNEL)) == NULL) { + dev_rose = kmalloc(rose_ndevs * sizeof(struct net_device *), GFP_KERNEL); + if (dev_rose == NULL) { printk(KERN_ERR "ROSE: rose_proto_init - unable to allocate device structure\n"); return -1; } - memset(dev_rose, 0x00, rose_ndevs * sizeof(struct net_device)); + memset(dev_rose, 0x00, rose_ndevs * sizeof(struct net_device*)); + for (i = 0; i < rose_ndevs; i++) { + struct net_device *dev; + char name[IFNAMSIZ]; + + sprintf(name, "rose%d", i); + dev = alloc_netdev(sizeof(struct net_device_stats), + name, rose_setup); + if (!dev) { + printk(KERN_ERR "ROSE: rose_proto_init - unable to allocate memory\n"); + while (--i >= 0) + kfree(dev_rose[i]); + return -ENOMEM; + } + dev_rose[i] = dev; + } for (i = 0; i < rose_ndevs; i++) { - sprintf(dev_rose[i].name, "rose%d", i); - dev_rose[i].init = rose_init; - register_netdev(&dev_rose[i]); + if (register_netdev(dev_rose[i])) { + printk(KERN_ERR "ROSE: netdevice regeistration failed\n"); + while (--i >= 0) { + unregister_netdev(dev_rose[i]); + kfree(dev_rose[i]); + return -EIO; + } + } + } sock_register(&rose_family_ops); @@ -1518,10 +1540,11 @@ sock_unregister(PF_ROSE); for (i = 0; i < rose_ndevs; i++) { - if (dev_rose[i].priv != NULL) { - kfree(dev_rose[i].priv); - dev_rose[i].priv = NULL; - unregister_netdev(&dev_rose[i]); + struct net_device *dev = dev_rose[i]; + + if (dev) { + unregister_netdev(dev); + kfree(dev); } } diff -Nru a/net/rose/rose_dev.c b/net/rose/rose_dev.c --- a/net/rose/rose_dev.c Tue Aug 5 14:35:52 2003 +++ b/net/rose/rose_dev.c Tue Aug 5 14:35:52 2003 @@ -165,7 +165,7 @@ return (struct net_device_stats *)dev->priv; } -int rose_init(struct net_device *dev) +void rose_setup(struct net_device *dev) { SET_MODULE_OWNER(dev); dev->mtu = ROSE_MAX_PACKET_SIZE - 2; @@ -182,13 +182,5 @@ /* New-style flags. */ dev->flags = 0; - - if ((dev->priv = kmalloc(sizeof(struct net_device_stats), GFP_KERNEL)) == NULL) - return -ENOMEM; - - memset(dev->priv, 0, sizeof(struct net_device_stats)); - dev->get_stats = rose_get_stats; - - return 0; -}; +} From shemminger@osdl.org Tue Aug 5 14:57:16 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 14:57:23 -0700 (PDT) Received: from mail.osdl.org (fw.osdl.org [65.172.181.6]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h75LvFFl007116 for ; Tue, 5 Aug 2003 14:57:16 -0700 Received: from dell_ss3.pdx.osdl.net (dell_ss3.pdx.osdl.net [172.20.1.60]) by mail.osdl.org (8.11.6/8.11.6) with SMTP id h75LuwI03721; Tue, 5 Aug 2003 14:56:58 -0700 Date: Tue, 5 Aug 2003 14:56:58 -0700 From: Stephen Hemminger To: Ralf Baechle , "David S. Miller" Cc: linux-hams@vger.kernel.org, netdev@oss.sgi.com Subject: [PATCH] Fix use after free in AX.25 Message-Id: <20030805145658.1b3f194b.shemminger@osdl.org> Organization: Open Source Development Lab X-Mailer: Sylpheed version 0.9.3claws (GTK+ 1.2.10; i686-pc-linux-gnu) X-Face: &@E+xe?c%:&e4D{>f1O<&U>2qwRREG5!}7R4;D<"NO^UI2mJ[eEOA2*3>(`Th.yP,VDPo9$ /`~cw![cmj~~jWe?AHY7D1S+\}5brN0k*NE?pPh_'_d>6;XGG[\KDRViCfumZT3@[ Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4552 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: shemminger@osdl.org Precedence: bulk X-list: netdev This patch is against 2.6.0-test2. The problem is that the ax25_destroy_socket function frees the socket buffer, but then ax25_release dereferences this causing an OOPS. To reproduce: modprobe ax25; ifconfig -a Replaced sk_free with sock_put which will free if this is the last reference. diff -urNp -X dontdiff net-2.5/net/ax25/af_ax25.c linux-2.5-net/net/ax25/af_ax25.c --- net-2.5/net/ax25/af_ax25.c 2003-08-04 09:32:21.000000000 -0700 +++ linux-2.5-net/net/ax25/af_ax25.c 2003-08-05 14:34:21.000000000 -0700 @@ -349,7 +349,7 @@ void ax25_destroy_socket(ax25_cb *ax25) ax25->timer.data = (unsigned long)ax25; add_timer(&ax25->timer); } else { - sk_free(ax25->sk); + sock_put(ax25->sk); } } else { ax25_free_cb(ax25); @@ -944,15 +944,13 @@ static int ax25_release(struct socket *s switch (ax25->state) { case AX25_STATE_0: ax25_disconnect(ax25, 0); - ax25_destroy_socket(ax25); - break; + goto drop; case AX25_STATE_1: case AX25_STATE_2: ax25_send_control(ax25, AX25_DISC, AX25_POLLON, AX25_COMMAND); ax25_disconnect(ax25, 0); - ax25_destroy_socket(ax25); - break; + goto drop; case AX25_STATE_3: case AX25_STATE_4: @@ -995,13 +993,16 @@ static int ax25_release(struct socket *s sk->sk_shutdown |= SEND_SHUTDOWN; sk->sk_state_change(sk); sock_set_flag(sk, SOCK_DEAD); - ax25_destroy_socket(ax25); + goto drop; } sock->sk = NULL; sk->sk_socket = NULL; /* Not used, but we should do this */ release_sock(sk); - + return 0; + drop: + release_sock(sk); + ax25_destroy_socket(ax25); return 0; } From shemminger@osdl.org Tue Aug 5 15:01:26 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 15:01:30 -0700 (PDT) Received: from mail.osdl.org (fw.osdl.org [65.172.181.6]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h75M1QFl007706 for ; Tue, 5 Aug 2003 15:01:26 -0700 Received: from dell_ss3.pdx.osdl.net (dell_ss3.pdx.osdl.net [172.20.1.60]) by mail.osdl.org (8.11.6/8.11.6) with SMTP id h75M1AI04456; Tue, 5 Aug 2003 15:01:10 -0700 Date: Tue, 5 Aug 2003 15:01:10 -0700 From: Stephen Hemminger To: Henner Eisen , "David S. Miller" Cc: linux-x25@vger.kernel.org, netdev@oss.sgi.com Subject: [PATCH] Fix X.25 use after free. Message-Id: <20030805150110.0e2753ab.shemminger@osdl.org> Organization: Open Source Development Lab X-Mailer: Sylpheed version 0.9.3claws (GTK+ 1.2.10; i686-pc-linux-gnu) X-Face: &@E+xe?c%:&e4D{>f1O<&U>2qwRREG5!}7R4;D<"NO^UI2mJ[eEOA2*3>(`Th.yP,VDPo9$ /`~cw![cmj~~jWe?AHY7D1S+\}5brN0k*NE?pPh_'_d>6;XGG[\KDRViCfumZT3@[ Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4553 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: shemminger@osdl.org Precedence: bulk X-list: netdev The conversion from cli/sti to locking in X.25 must not have been tested on a real SMP with memory debugging enabled. It OOPS right away if I do: modprobe x25; ifconfig -a The problem is that it dereferences the socket after it has already been freed. The fix for this is to make the call to sock_put, later in x25_destroy_socket do the free. Also, need a go to avoid references in x25_release. This patch is against 2.6.0-test2. diff -urNp -X dontdiff net-2.5/net/x25/af_x25.c linux-2.5-net/net/x25/af_x25.c --- net-2.5/net/x25/af_x25.c 2003-08-01 11:12:02.000000000 -0700 +++ linux-2.5-net/net/x25/af_x25.c 2003-08-05 12:14:42.000000000 -0700 @@ -350,8 +350,11 @@ void x25_destroy_socket(struct sock *sk) sk->sk_timer.function = x25_destroy_timer; sk->sk_timer.data = (unsigned long)sk; add_timer(&sk->sk_timer); - } else - sk_free(sk); + } else { + /* drop last reference so sock_put will free */ + __sock_put(sk); + } + release_sock(sk); sock_put(sk); } @@ -553,7 +556,7 @@ static int x25_release(struct socket *so case X25_STATE_2: x25_disconnect(sk, 0, 0, 0); x25_destroy_socket(sk); - break; + goto out; case X25_STATE_1: case X25_STATE_3: From felix@allot.com Tue Aug 5 15:14:46 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 15:14:49 -0700 (PDT) Received: from mxout1.netvision.net.il (mxout1.netvision.net.il [194.90.9.20]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h75MEiFl008833 for ; Tue, 5 Aug 2003 15:14:45 -0700 Received: from exg.allot.com ([199.203.223.202]) by mxout1.netvision.net.il (iPlanet Messaging Server 5.2 HotFix 1.14 (built Mar 18 2003)) with ESMTP id <0HJ5001RIL0W5P@mxout1.netvision.net.il> for netdev@oss.sgi.com; Tue, 05 Aug 2003 18:43:44 +0300 (IDT) Received: from allot.com (199.203.223.201 [199.203.223.201]) by exg.allot.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id QG1CBDM4; Tue, 05 Aug 2003 18:46:45 +0200 Date: Tue, 05 Aug 2003 18:44:10 +0300 From: Felix Radensky Subject: Re: e100 "Ferguson" release To: "Feldman, Scott" Cc: Ben Greear , Jeff Garzik , netdev@oss.sgi.com Message-id: <3F2FD0CA.1080403@allot.com> Organization: Allot Communications Ltd. MIME-version: 1.0 Content-type: multipart/alternative; boundary="Boundary_(ID_Lg9l6CsjxHY6kAV0DHDKJw)" X-Accept-Language: en-us, en User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.2) Gecko/20030208 Netscape/7.02 References: X-archive-position: 4554 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: felix@allot.com Precedence: bulk X-list: netdev --Boundary_(ID_Lg9l6CsjxHY6kAV0DHDKJw) Content-type: text/plain; charset=us-ascii; format=flowed Content-transfer-encoding: 7BIT Hi, Scott This change seems to fix the problem. Thanks a lot ! Felix. Feldman, Scott wrote: >>I've also noticed that the number of hard_start_xmit failures >>in e1000 has increased significantly in version 5.1.13-k1. In >>version 5.0.43-k1 the number of failures was much smaller. >> >> > >Interesting. Felix, would you undo the change[1] below in 5.1.13-k1 and >see what happens? With the change below, 5.1.13 would be more >aggressive on Tx cleanup, so we'll be quicker waking the queue than >before. > >-scott > > for(i = 0; i < E1000_MAX_INTR; i++) >- if(!e1000_clean_rx_irq(adapter) && >+ if(!e1000_clean_rx_irq(adapter) & > !e1000_clean_tx_irq(adapter)) > break; > >[1] Something still bothers me about this new form where we're mixing a >bit-wise operator with logical operands. Should this bother me? > > > --Boundary_(ID_Lg9l6CsjxHY6kAV0DHDKJw) Content-type: text/html; charset=us-ascii Content-transfer-encoding: 7BIT Hi, Scott

This change seems to fix the problem.
Thanks a lot !

Felix.

Feldman, Scott wrote:
I've also noticed that the number of hard_start_xmit failures 
in e1000 has increased significantly in version 5.1.13-k1. In 
version 5.0.43-k1 the number of failures was much smaller.
    

Interesting.  Felix, would you undo the change[1] below in 5.1.13-k1 and
see what happens?  With the change below, 5.1.13 would be more
aggressive on Tx cleanup, so we'll be quicker waking the queue than
before. 

-scott

        for(i = 0; i < E1000_MAX_INTR; i++)
-               if(!e1000_clean_rx_irq(adapter) &&
+               if(!e1000_clean_rx_irq(adapter) &
                   !e1000_clean_tx_irq(adapter))
                        break;

[1] Something still bothers me about this new form where we're mixing a
bit-wise operator with logical operands.  Should this bother me?

  

--Boundary_(ID_Lg9l6CsjxHY6kAV0DHDKJw)-- From nf@hipac.org Tue Aug 5 15:23:25 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 15:23:34 -0700 (PDT) Received: from smtprelay02.ispgateway.de (smtprelay02.ispgateway.de [62.67.200.157]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h75MNNFl009687 for ; Tue, 5 Aug 2003 15:23:24 -0700 Received: (qmail 18331 invoked from network); 5 Aug 2003 22:23:20 -0000 Received: from unknown (HELO portal.lan) (134300@[80.138.239.123]) (envelope-sender ) by smtprelay02.ispgateway.de (qmail-ldap-1.03) with SMTP for ; 5 Aug 2003 22:23:20 -0000 Received: from hipac.org (tmobile.lan [192.168.0.6]) by portal.lan (Postfix) with ESMTP id 235E14B0B6; Tue, 5 Aug 2003 22:46:13 +0200 (CEST) Message-ID: <3F302E04.1090503@hipac.org> Date: Wed, 06 Aug 2003 00:21:56 +0200 From: Michael Bellion and Thomas Heinz User-Agent: Mozilla/5.0 (X11; U; Linux i686; de-AT; rv:1.4) Gecko/20030714 Debian/1.4-2 X-Accept-Language: de, en MIME-Version: 1.0 To: hadi@cyberus.ca Cc: linux-net@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [RFC] High Performance Packet Classifiction for tc framework References: <200307141045.40999.nf@hipac.org> <1058328537.1797.24.camel@jzny.localdomain> <3F16A0E5.1080007@hipac.org> <1059934468.1103.41.camel@jzny.localdomain> <3F2E5CD6.4030500@hipac.org> <1060012260.1103.380.camel@jzny.localdomain> In-Reply-To: <1060012260.1103.380.camel@jzny.localdomain> X-Enigmail-Version: 0.76.2.0 X-Enigmail-Supports: pgp-inline, pgp-mime Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="------------enig2AA484285077C06548045724" X-archive-position: 4555 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: nf@hipac.org Precedence: bulk X-list: netdev This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enig2AA484285077C06548045724 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Hi Jamal You wrote: > I promise i will. I dont think i will do it justice spending 5 minutes > on it. I take it you have written extensive docs too ;-> Of course ;-) Well, actually we are going to present an overview of the hipac algorithm at the netfilter developer workshop in Budapest. Hope to see you there. > Unfortunately it is more exciting to write code than documents. I almost > got someone to document at least its proper usage but they backed away > at the last minute. lol > I dont wanna go in a lot of details, but one important detail is that > keynodes can also lead to other hash tables. So you can split the packet > parsing across multiple hashes - this is where the comparison with > chains comes in. There are several ways to do this. I'll show you the > brute force way and you can make it more usable with "hashkey" and > "sample" operator. Stealing from your example: > > [example snipped] > > Makes sense? Yes, it does. Still the question is how to solve this generally. Consider the following example ruleset: 1) src ip 10.0.0.0/30 dst ip 20.0.0.0/20 2) src ip 10.0.0.0/28 dst ip 20.0.0.0/22 3) src ip 10.0.0.0/26 dst ip 20.0.0.0/24 4) src ip 10.0.0.0/24 dst ip 20.0.0.0/26 5) src ip 10.0.0.0/22 dst ip 20.0.0.0/28 6) src ip 10.0.0.0/20 dst ip 20.0.0.0/30 So you have 1 src ip hash and #buckets(src ip hash) many dst ip hashes. In order to achieve maximum performance you have to minimize the number of collisions in the hash buckets. How would you choose the hash function and what would the construction look like? In principle the tree of hashes approach is capable to express a general access list like ruleset, i.e. a set of terminal rules with different priorities. The problem is that the approach is only efficient if the number of collisions is O(1) -> no amortized analysis but rather per bucket. In theory you can do the following. Let's consider one dimension. The matches in one dimension form a set of elementary intervals which are overlapped by certain rules. Example: |------| |---------| |----------------| |------------------| |---------------| |----|---|--|---|-----|---|----|-------|--|------|-------| The '|-----|' reflect the matches and the bottom line represents the set of elementary intervals introduced by the matches. Now, you can decide for each elementary interval which rule matches since the rules are terminal and uniquely prioritized. The next step would be to create a hash with #elementary intervals many buckets and create a hash function which maps the keys to the appropriate buckets like in the picture. In this case you have exactly 1 entry per hash bucket. Sounds fine BUT it is not possible to generically deduce an easily (= fast) computable hash function with the described requirements. BTW, this approach can be extended to 2 or more dimensions where the hash function for each hash has to meet the requirement. Of course this information is not very helpful since the problem of defining appropriate hash functions remains ;) Obviously this way is not viable but supposedly the only one to achieve ultimate performance with the tree of hashes concept. BTW, the way hipac works is basically not so different from the idea described above but since we use efficient btrees we don't have to define hash functions. > sure position could be used as a priority. It is easier/intuitive to > just have explicit priorities. Merely a matter of taste. The way iptables and nf-hipac use priorities is somewhat more dynamic than the tc way because they are automatically adjusted if a rule is inserted in between others. > What "optimizes" could be a user interface or the thread i was talking > about earlier. Hm, this rebalancing is not clear to us. Do you want to rebalance the tree of hashes? This seems a little strange at the first glance because the performance of the tree of hashes is dominated by the number of collisions that need to be resolved and not the depth of the tree. > Is your plan to put this in other places other than Linux? Currently we are working on the integration in linux. In general the hipac core is OS and application independent, so basically it could also be used for some userspace program which is related to classification and of course in other OS's. Any special reason why you are asking this question? > So you got this thought from iptables and took it to the next level? Well, in order to support iptables matches and targets we had to create an appropriate abstraction for them on the hipac layer. This abstraction can also be used for tc classifiers if the tcf_result is ignored, i.e. you just consider whether the filter matched or not. > I am still not sure i understand why not use what already exists - but > i'll just say i dont see it right now. If hipac had no support for embedded classifiers you couldn't express a ruleset like: 1) [native hipac matches] [u32 filter] [classid] 2) [native hipac matches] [classid] You would have to construct rule 1) in a way that it "jumps" to an external u32 filter. Unfortunately, you cannot jump back to the hipac filter again in case the u32 filter does not match so rule 2) is unreachable. This problem is caused by the fact that cls_hipac can occur at most once per interface. > It doesnt appear harmful to leave it there without destroying it. > The next time someome adds a filter of the same protocol + priority, it > will already exist. If you want to be accurate (because it does get > destroyed when the init() fails), then destroy it but you need to put > checks for "incase we have added a new tcf_proto" which may not look > pretty. Is this causing you some discomfort? No, actually not. Regards, +-----------------------+----------------------+ | Michael Bellion | Thomas Heinz | | | | +-----------------------+----------------------+ | High Performance Packet Classification | | nf-hipac: http://www.hipac.org/ | +----------------------------------------------+ --------------enig2AA484285077C06548045724 Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.2 (GNU/Linux) Comment: Using GnuPG with Debian - http://enigmail.mozdev.org iD8DBQE/MC4FtXh2AYIMjggRAvm5AJ4r5t7eKXHNt/mWCIcS93+l/Gh+tgCdH82Z 76Nh+wx5v75reDsjfY1SJY4= =NW50 -----END PGP SIGNATURE----- --------------enig2AA484285077C06548045724-- From shemminger@osdl.org Tue Aug 5 15:43:54 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 15:44:05 -0700 (PDT) Received: from mail.osdl.org (fw.osdl.org [65.172.181.6]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h75MhrFl011720 for ; Tue, 5 Aug 2003 15:43:54 -0700 Received: from dell_ss3.pdx.osdl.net (dell_ss3.pdx.osdl.net [172.20.1.60]) by mail.osdl.org (8.11.6/8.11.6) with SMTP id h75MhZI14668; Tue, 5 Aug 2003 15:43:36 -0700 Date: Tue, 5 Aug 2003 15:43:35 -0700 From: Stephen Hemminger To: Henner Eisen , "David S. Miller" , linux-x25@vger.kernel.org, netdev@oss.sgi.com Subject: [PATCH] X.25 async net_device fixup Message-Id: <20030805154335.7abfcb92.shemminger@osdl.org> Organization: Open Source Development Lab X-Mailer: Sylpheed version 0.9.3claws (GTK+ 1.2.10; i686-pc-linux-gnu) X-Face: &@E+xe?c%:&e4D{>f1O<&U>2qwRREG5!}7R4;D<"NO^UI2mJ[eEOA2*3>(`Th.yP,VDPo9$ /`~cw![cmj~~jWe?AHY7D1S+\}5brN0k*NE?pPh_'_d>6;XGG[\KDRViCfumZT3@[ Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4556 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: shemminger@osdl.org Precedence: bulk X-list: netdev Convert X.25 async driver to have dynamic net_device's. This driver is a lot like SLIP so the code changes were similar. - Added similar locking to SLIP - replaced code that snooped for MTU changes with the net_device change mtu callback. - cleaned up the statistics by using the net_device_stats structure. Patch is against 2.6.0-test2. Not sure if anyone ever uses this. I tested by bringing up an x.25 async line using a modified version of slattach. diff -urNp -X dontdiff net-2.5/drivers/net/wan/x25_asy.c linux-2.5-net/drivers/net/wan/x25_asy.c --- net-2.5/drivers/net/wan/x25_asy.c 2003-08-01 11:11:42.000000000 -0700 +++ linux-2.5-net/drivers/net/wan/x25_asy.c 2003-07-31 13:22:41.000000000 -0700 @@ -34,81 +34,67 @@ #include #include "x25_asy.h" -typedef struct x25_ctrl { - struct x25_asy ctrl; /* X.25 things */ - struct net_device dev; /* the device */ -} x25_asy_ctrl_t; - -static x25_asy_ctrl_t **x25_asy_ctrls = NULL; - -int x25_asy_maxdev = SL_NRUNIT; /* Can be overridden with insmod! */ +static struct net_device **x25_asy_devs; +static int x25_asy_maxdev = SL_NRUNIT; MODULE_PARM(x25_asy_maxdev, "i"); MODULE_LICENSE("GPL"); static int x25_asy_esc(unsigned char *p, unsigned char *d, int len); static void x25_asy_unesc(struct x25_asy *sl, unsigned char c); +static void x25_asy_setup(struct net_device *dev); /* Find a free X.25 channel, and link in this `tty' line. */ -static inline struct x25_asy *x25_asy_alloc(void) +static struct x25_asy *x25_asy_alloc(void) { - x25_asy_ctrl_t *slp = NULL; + struct net_device *dev = NULL; + struct x25_asy *sl; int i; - if (x25_asy_ctrls == NULL) + if (x25_asy_devs == NULL) return NULL; /* Master array missing ! */ - for (i = 0; i < x25_asy_maxdev; i++) - { - slp = x25_asy_ctrls[i]; + for (i = 0; i < x25_asy_maxdev; i++) { + dev = x25_asy_devs[i]; + /* Not allocated ? */ - if (slp == NULL) + if (dev == NULL) break; + + sl = dev->priv; /* Not in use ? */ - if (!test_and_set_bit(SLF_INUSE, &slp->ctrl.flags)) - break; + if (!test_and_set_bit(SLF_INUSE, &sl->flags)) + return sl; } - /* SLP is set.. */ + /* Sorry, too many, all slots in use */ if (i >= x25_asy_maxdev) return NULL; /* If no channels are available, allocate one */ - if (!slp && - (x25_asy_ctrls[i] = (x25_asy_ctrl_t *)kmalloc(sizeof(x25_asy_ctrl_t), - GFP_KERNEL)) != NULL) { - slp = x25_asy_ctrls[i]; - memset(slp, 0, sizeof(x25_asy_ctrl_t)); + if (!dev) { + char name[IFNAMSIZ]; + sprintf(name, "x25asy%d", i); + + dev = alloc_netdev(sizeof(struct x25_asy), + name, x25_asy_setup); + if (!dev) + return NULL; /* Initialize channel control data */ - set_bit(SLF_INUSE, &slp->ctrl.flags); - slp->ctrl.tty = NULL; - sprintf(slp->dev.name, "x25asy%d", i); - slp->dev.base_addr = i; - slp->dev.priv = (void*)&(slp->ctrl); - slp->dev.next = NULL; - slp->dev.init = x25_asy_init; - } - if (slp != NULL) - { + sl = dev->priv; + dev->base_addr = i; /* register device so that it can be ifconfig'ed */ - /* x25_asy_init() will be called as a side-effect */ - /* SIDE-EFFECT WARNING: x25_asy_init() CLEARS slp->ctrl ! */ - - if (register_netdev(&(slp->dev)) == 0) - { + if (register_netdev(dev) == 0) { /* (Re-)Set the INUSE bit. Very Important! */ - set_bit(SLF_INUSE, &slp->ctrl.flags); - slp->ctrl.dev = &(slp->dev); - slp->dev.priv = (void*)&(slp->ctrl); - return (&(slp->ctrl)); - } - else - { - clear_bit(SLF_INUSE,&(slp->ctrl.flags)); + set_bit(SLF_INUSE, &sl->flags); + x25_asy_devs[i] = dev; + return sl; + } else { printk("x25_asy_alloc() - register_netdev() failure.\n"); + kfree(dev); } } return NULL; @@ -116,8 +102,7 @@ static inline struct x25_asy *x25_asy_al /* Free an X.25 channel. */ - -static inline void x25_asy_free(struct x25_asy *sl) +static void x25_asy_free(struct x25_asy *sl) { /* Free all X.25 frame buffers. */ if (sl->rbuff) { @@ -134,18 +119,11 @@ static inline void x25_asy_free(struct x } } -/* MTU has been changed by the IP layer. Unfortunately we are not told - about this, but we spot it ourselves and fix things up. We could be - in an upcall from the tty driver, or in an ip packet queue. */ - -static void x25_asy_changed_mtu(struct x25_asy *sl) -{ - struct net_device *dev = sl->dev; - unsigned char *xbuff, *rbuff, *oxbuff, *orbuff; - int len; - unsigned long flags; - - len = dev->mtu * 2; +static int x25_asy_change_mtu(struct net_device *dev, int newmtu) +{ + struct x25_asy *sl = dev->priv; + unsigned char *xbuff, *rbuff; + int len = 2* newmtu; xbuff = (unsigned char *) kmalloc (len + 4, GFP_ATOMIC); rbuff = (unsigned char *) kmalloc (len + 4, GFP_ATOMIC); @@ -153,52 +131,47 @@ static void x25_asy_changed_mtu(struct x if (xbuff == NULL || rbuff == NULL) { printk("%s: unable to grow X.25 buffers, MTU change cancelled.\n", - sl->dev->name); - dev->mtu = sl->mtu; + dev->name); if (xbuff != NULL) kfree(xbuff); if (rbuff != NULL) kfree(rbuff); - return; + return -ENOMEM; } - save_flags(flags); - cli(); - - oxbuff = sl->xbuff; - sl->xbuff = xbuff; - orbuff = sl->rbuff; - sl->rbuff = rbuff; - + spin_lock_bh(&sl->lock); + xbuff = xchg(&sl->xbuff, xbuff); if (sl->xleft) { if (sl->xleft <= len) { memcpy(sl->xbuff, sl->xhead, sl->xleft); } else { sl->xleft = 0; - sl->tx_dropped++; + sl->stats.tx_dropped++; } } sl->xhead = sl->xbuff; + rbuff = xchg(&sl->rbuff, rbuff); if (sl->rcount) { if (sl->rcount <= len) { - memcpy(sl->rbuff, orbuff, sl->rcount); + memcpy(sl->rbuff, rbuff, sl->rcount); } else { sl->rcount = 0; - sl->rx_over_errors++; + sl->stats.rx_over_errors++; set_bit(SLF_ERROR, &sl->flags); } } - sl->mtu = dev->mtu; + dev->mtu = newmtu; sl->buffsize = len; - restore_flags(flags); + spin_unlock_bh(&sl->lock); - if (oxbuff != NULL) - kfree(oxbuff); - if (orbuff != NULL) - kfree(orbuff); + if (xbuff != NULL) + kfree(xbuff); + if (rbuff != NULL) + kfree(rbuff); + return 0; } @@ -226,13 +199,13 @@ static void x25_asy_bump(struct x25_asy int err; count = sl->rcount; - sl->rx_bytes+=count; + sl->stats.rx_bytes+=count; skb = dev_alloc_skb(count+1); if (skb == NULL) { printk("%s: memory squeeze, dropping packet.\n", sl->dev->name); - sl->rx_dropped++; + sl->stats.rx_dropped++; return; } skb_push(skb,1); /* LAPB internal control */ @@ -249,7 +222,7 @@ static void x25_asy_bump(struct x25_asy { netif_rx(skb); sl->dev->last_rx = jiffies; - sl->rx_packets++; + sl->stats.rx_packets++; } } @@ -257,19 +230,13 @@ static void x25_asy_bump(struct x25_asy static void x25_asy_encaps(struct x25_asy *sl, unsigned char *icp, int len) { unsigned char *p; - int actual, count; - + int actual, count, mtu = sl->dev->mtu; - if (sl->mtu != sl->dev->mtu) { /* Someone has been ifconfigging */ - - x25_asy_changed_mtu(sl); - } - - if (len > sl->mtu) + if (len > mtu) { /* Sigh, shouldn't occur BUT ... */ - len = sl->mtu; + len = mtu; printk ("%s: truncating oversized transmit packet!\n", sl->dev->name); - sl->tx_dropped++; + sl->stats.tx_dropped++; x25_asy_unlock(sl); return; } @@ -310,7 +277,7 @@ static void x25_asy_write_wakeup(struct { /* Now serial buffer is almost free & we can start * transmission of another packet */ - sl->tx_packets++; + sl->stats.tx_packets++; tty->flags &= ~(1 << TTY_DO_WRITE_WAKEUP); x25_asy_unlock(sl); return; @@ -324,15 +291,20 @@ static void x25_asy_write_wakeup(struct static void x25_asy_timeout(struct net_device *dev) { struct x25_asy *sl = (struct x25_asy*)(dev->priv); - /* May be we must check transmitter timeout here ? - * 14 Oct 1994 Dmitry Gorodchanin. - */ - printk(KERN_WARNING "%s: transmit timed out, %s?\n", dev->name, - (sl->tty->driver->chars_in_buffer(sl->tty) || sl->xleft) ? - "bad line quality" : "driver error"); - sl->xleft = 0; - sl->tty->flags &= ~(1 << TTY_DO_WRITE_WAKEUP); - x25_asy_unlock(sl); + + spin_lock(&sl->lock); + if (netif_queue_stopped(dev)) { + /* May be we must check transmitter timeout here ? + * 14 Oct 1994 Dmitry Gorodchanin. + */ + printk(KERN_WARNING "%s: transmit timed out, %s?\n", dev->name, + (sl->tty->driver->chars_in_buffer(sl->tty) || sl->xleft) ? + "bad line quality" : "driver error"); + sl->xleft = 0; + sl->tty->flags &= ~(1 << TTY_DO_WRITE_WAKEUP); + x25_asy_unlock(sl); + } + spin_unlock(&sl->lock); } /* Encapsulate an IP datagram and kick it into a TTY queue. */ @@ -342,10 +314,10 @@ static int x25_asy_xmit(struct sk_buff * struct x25_asy *sl = (struct x25_asy*)(dev->priv); int err; - if (!netif_running(sl->dev)) - { + if (!netif_running(sl->dev)) { printk("%s: xmit call when iface is down\n", dev->name); - return 1; + kfree_skb(skb); + return 0; } switch(skb->data[0]) @@ -409,8 +381,11 @@ static int x25_asy_data_indication(void static void x25_asy_data_transmit(void *token, struct sk_buff *skb) { struct x25_asy *sl=token; - if (netif_queue_stopped(sl->dev)) + + spin_lock(&sl->lock); + if (netif_queue_stopped(sl->dev) || sl->tty == NULL) { + spin_unlock(&sl->lock); printk(KERN_ERR "x25_asy: tbusy drop\n"); kfree_skb(skb); return; @@ -419,10 +394,11 @@ static void x25_asy_data_transmit(void * if (skb != NULL) { x25_asy_lock(sl); - sl->tx_bytes+=skb->len; + sl->stats.tx_bytes+=skb->len; x25_asy_encaps(sl, skb->data, skb->len); dev_kfree_skb(skb); } + spin_unlock(&sl->lock); } /* @@ -475,12 +451,20 @@ static void x25_asy_disconnected(void *t sl->dev->last_rx = jiffies; } +static struct lapb_register_struct x25_asy_callbacks = { + .connect_confirmation = x25_asy_connected, + .connect_indication = x25_asy_connected, + .disconnect_confirmation = x25_asy_disconnected, + .disconnect_indication = x25_asy_disconnected, + .data_indication = x25_asy_data_indication, + .data_transmit = x25_asy_data_transmit, + +}; -/* Open the low-level part of the X.25 channel. Easy! */ +/* Open the low-level part of the X.25 channel. Easy! */ static int x25_asy_open(struct net_device *dev) { - struct lapb_register_struct x25_asy_callbacks; struct x25_asy *sl = (struct x25_asy*)(dev->priv); unsigned long len; int err; @@ -505,7 +489,7 @@ static int x25_asy_open(struct net_devic if (sl->xbuff == NULL) { goto noxbuff; } - sl->mtu = dev->mtu; + sl->buffsize = len; sl->rcount = 0; sl->xleft = 0; @@ -516,14 +500,6 @@ static int x25_asy_open(struct net_devic /* * Now attach LAPB */ - - x25_asy_callbacks.connect_confirmation=x25_asy_connected; - x25_asy_callbacks.connect_indication=x25_asy_connected; - x25_asy_callbacks.disconnect_confirmation=x25_asy_disconnected; - x25_asy_callbacks.disconnect_indication=x25_asy_disconnected; - x25_asy_callbacks.data_indication=x25_asy_data_indication; - x25_asy_callbacks.data_transmit=x25_asy_data_transmit; - if((err=lapb_register(sl, &x25_asy_callbacks))==LAPB_OK) return 0; @@ -542,13 +518,16 @@ static int x25_asy_close(struct net_devi struct x25_asy *sl = (struct x25_asy*)(dev->priv); int err; - if (sl->tty == NULL) - return -EBUSY; + spin_lock(&sl->lock); + if (sl->tty) + sl->tty->flags &= ~(1 << TTY_DO_WRITE_WAKEUP); - sl->tty->flags &= ~(1 << TTY_DO_WRITE_WAKEUP); netif_stop_queue(dev); + sl->rcount = 0; + sl->xleft = 0; if((err=lapb_unregister(sl))!=LAPB_OK) printk(KERN_ERR "x25_asy_close: lapb_unregister error -%d\n",err); + spin_unlock(&sl->lock); return 0; } @@ -571,20 +550,12 @@ static void x25_asy_receive_buf(struct t if (!sl || sl->magic != X25_ASY_MAGIC || !netif_running(sl->dev)) return; - /* - * Argh! mtu change time! - costs us the packet part received - * at the change - */ - if (sl->mtu != sl->dev->mtu) { - - x25_asy_changed_mtu(sl); - } /* Read the characters out of the buffer */ while (count--) { if (fp && *fp++) { if (!test_and_set_bit(SLF_ERROR, &sl->flags)) { - sl->rx_errors++; + sl->stats.rx_errors++; } cp++; continue; @@ -659,27 +630,14 @@ static void x25_asy_close_tty(struct tty tty->disc_data = 0; sl->tty = NULL; x25_asy_free(sl); - unregister_netdev(sl->dev); } static struct net_device_stats *x25_asy_get_stats(struct net_device *dev) { - static struct net_device_stats stats; struct x25_asy *sl = (struct x25_asy*)(dev->priv); - memset(&stats, 0, sizeof(struct net_device_stats)); - - stats.rx_packets = sl->rx_packets; - stats.tx_packets = sl->tx_packets; - stats.rx_bytes = sl->rx_bytes; - stats.tx_bytes = sl->tx_bytes; - stats.rx_dropped = sl->rx_dropped; - stats.tx_dropped = sl->tx_dropped; - stats.tx_errors = sl->tx_errors; - stats.rx_errors = sl->rx_errors; - stats.rx_over_errors = sl->rx_over_errors; - return (&stats); + return &sl->stats; } @@ -757,7 +715,7 @@ static void x25_asy_unesc(struct x25_asy sl->rbuff[sl->rcount++] = s; return; } - sl->rx_over_errors++; + sl->stats.rx_over_errors++; set_bit(SLF_ERROR, &sl->flags); } } @@ -799,18 +757,14 @@ static int x25_asy_open_dev(struct net_d } /* Initialise the X.25 driver. Called by the device init code */ -int x25_asy_init(struct net_device *dev) +static void x25_asy_setup(struct net_device *dev) { - struct x25_asy *sl = (struct x25_asy*)(dev->priv); - - if (sl == NULL) /* Allocation failed ?? */ - return -ENODEV; - - /* Set up the control block. (And clear statistics) */ + struct x25_asy *sl = dev->priv; - memset(sl, 0, sizeof (struct x25_asy)); sl->magic = X25_ASY_MAGIC; sl->dev = dev; + spin_lock_init(&sl->lock); + set_bit(SLF_INUSE, &sl->flags); /* * Finish setting up the DEVICE info. @@ -823,6 +777,7 @@ int x25_asy_init(struct net_device *dev) dev->open = x25_asy_open_dev; dev->stop = x25_asy_close; dev->get_stats = x25_asy_get_stats; + dev->change_mtu = x25_asy_change_mtu; dev->hard_header_len = 0; dev->addr_len = 0; dev->type = ARPHRD_X25; @@ -830,8 +785,6 @@ int x25_asy_init(struct net_device *dev) /* New-style flags. */ dev->flags = IFF_NOARP; - - return 0; } static struct tty_ldisc x25_ldisc = { @@ -853,13 +806,15 @@ static int __init init_x25_asy(void) printk(KERN_INFO "X.25 async: version 0.00 ALPHA " "(dynamic channels, max=%d).\n", x25_asy_maxdev ); - x25_asy_ctrls = kmalloc(sizeof(void*)*x25_asy_maxdev, GFP_KERNEL); - if (!x25_asy_ctrls) { + + x25_asy_devs = kmalloc(sizeof(struct net_device *)*x25_asy_maxdev, + GFP_KERNEL); + if (!x25_asy_devs) { printk(KERN_WARNING "X25 async: Can't allocate x25_asy_ctrls[] " "array! Uaargh! (-> No X.25 available)\n"); return -ENOMEM; } - memset(x25_asy_ctrls, 0, sizeof(void*)*x25_asy_maxdev); /* Pointers */ + memset(x25_asy_devs, 0, sizeof(struct net_device *)*x25_asy_maxdev); return tty_register_ldisc(N_X25, &x25_ldisc); } @@ -867,22 +822,29 @@ static int __init init_x25_asy(void) static void __exit exit_x25_asy(void) { + struct net_device *dev; int i; for (i = 0; i < x25_asy_maxdev; i++) { - if (x25_asy_ctrls[i]) { + dev = x25_asy_devs[i]; + if (dev) { + struct x25_asy *sl = dev->priv; + + spin_lock_bh(&sl->lock); + if (sl->tty) + tty_hangup(sl->tty); + + spin_unlock_bh(&sl->lock); /* * VSV = if dev->start==0, then device * unregistered while close proc. */ - if (netif_running(&(x25_asy_ctrls[i]->dev))) - unregister_netdev(&(x25_asy_ctrls[i]->dev)); - - kfree(x25_asy_ctrls[i]); + unregister_netdev(dev); + kfree(dev); } } - kfree(x25_asy_ctrls); + kfree(x25_asy_devs); tty_register_ldisc(N_X25, NULL); } diff -urNp -X dontdiff net-2.5/drivers/net/wan/x25_asy.h linux-2.5-net/drivers/net/wan/x25_asy.h --- net-2.5/drivers/net/wan/x25_asy.h 2003-08-01 11:11:42.000000000 -0700 +++ linux-2.5-net/drivers/net/wan/x25_asy.h 2003-07-30 14:29:19.000000000 -0700 @@ -18,8 +18,9 @@ struct x25_asy { int magic; /* Various fields. */ + spinlock_t lock; struct tty_struct *tty; /* ptr to TTY structure */ - struct net_device *dev; /* easy for intr handling */ + struct net_device *dev; /* easy for intr handling */ /* These are pointers to the malloc()ed frame buffers. */ unsigned char *rbuff; /* receiver buffer */ @@ -29,17 +30,8 @@ struct x25_asy { int xleft; /* bytes left in XMIT queue */ /* X.25 interface statistics. */ - unsigned long rx_packets; /* inbound frames counter */ - unsigned long tx_packets; /* outbound frames counter */ - unsigned long rx_bytes; /* inbound byte counte */ - unsigned long tx_bytes; /* outbound byte counter */ - unsigned long rx_errors; /* Parity, etc. errors */ - unsigned long tx_errors; /* Planned stuff */ - unsigned long rx_dropped; /* No memory for skb */ - unsigned long tx_dropped; /* When MTU change */ - unsigned long rx_over_errors; /* Frame bigger then X.25 buf. */ + struct net_device_stats stats; - int mtu; /* Our mtu (to spot changes!) */ int buffsize; /* Max buffers sizes */ unsigned long flags; /* Flag values/ mode etc */ From willy@www.linux.org.uk Tue Aug 5 17:00:36 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 17:00:45 -0700 (PDT) Received: from www.linux.org.uk (IDENT:19Zp21VAhIurssXahOjUtOITdxsgIIfO@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7600ZFl016767 for ; Tue, 5 Aug 2003 17:00:36 -0700 Received: from willy by www.linux.org.uk with local (Exim 4.14) id 19k2r7-0004ko-Gu; Tue, 05 Aug 2003 15:32:05 +0100 Date: Tue, 5 Aug 2003 15:32:05 +0100 From: Matthew Wilcox To: Jeff Garzik Cc: Matthew Wilcox , netdev@oss.sgi.com Subject: Re: [PATCH] ethtool_ops rev 4 Message-ID: <20030805143205.GP22222@parcelfarce.linux.theplanet.co.uk> References: <20030801150232.GV22222@parcelfarce.linux.theplanet.co.uk> <20030801154021.GA7696@gtf.org> <20030801154656.GW22222@parcelfarce.linux.theplanet.co.uk> <20030801162536.GA18574@gtf.org> <20030802222145.GE22222@parcelfarce.linux.theplanet.co.uk> <3F2C3C86.6000202@pobox.com> <20030803002744.GF22222@parcelfarce.linux.theplanet.co.uk> <3F2C7E12.8070904@pobox.com> <20030803145656.GI22222@parcelfarce.linux.theplanet.co.uk> <3F2D41B7.7040205@pobox.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3F2D41B7.7040205@pobox.com> User-Agent: Mutt/1.4.1i X-archive-position: 4557 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: willy@debian.org Precedence: bulk X-list: netdev On Sun, Aug 03, 2003 at 01:09:11PM -0400, Jeff Garzik wrote: > Matthew Wilcox wrote: > >OK. At this point, I really feel like I'm getting in the way and > >hindering more than I'm helping. Can I pass the torch to you and let > >you finish the job? > > Sorry to give that impression :( I think we're pretty much "there". > But if you wanna hand it off to me for the last little bits, and > merging, that's fine too. I'll leave it up to you. Oh, I completely agree, I think we're down to quibbling over the last tiny details. And I think that's exactly why I should bow out at this point; you know this area much better than I do. I'm not leaving in a huff or anything -- this was a weekend hack rather than a major project to me. -- "It's not Hollywood. War is real, war is primarily not about defeat or victory, it is about death. I've seen thousands and thousands of dead bodies. Do you think I want to have an academic debate on this subject?" -- Robert Fisk From greearb@candelatech.com Tue Aug 5 17:24:49 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 17:24:57 -0700 (PDT) Received: from grok.yi.org (evrtwa1-ar2-4-33-045-074.evrtwa1.dsl-verizon.net [4.33.45.74]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h760OnFl017694 for ; Tue, 5 Aug 2003 17:24:49 -0700 Received: from candelatech.com (localhost.localdomain [127.0.0.1]) by grok.yi.org (8.12.8/8.12.8) with ESMTP id h760Odtf023281 for ; Tue, 5 Aug 2003 17:24:43 -0700 Message-ID: <3F304AC7.6070808@candelatech.com> Date: Tue, 05 Aug 2003 17:24:39 -0700 From: Ben Greear Organization: Candela Technologies User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5a) Gecko/20030718 X-Accept-Language: en-us, en MIME-Version: 1.0 To: "'netdev@oss.sgi.com'" Subject: PATCH: Add comment to make finding the priv_flags definition easier. Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4558 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: greearb@candelatech.com Precedence: bulk X-list: netdev This helps me, at least, remember where the definitions are at! --- linux-2.4.21/include/linux/netdevice.h 2003-06-13 07:51:38.000000000 -0700 +++ linux-2.4.21.amds/include/linux/netdevice.h 2003-07-30 16:27:20.000000000 -0700 @@ -296,7 +296,9 @@ unsigned short flags; /* interface flags (a la BSD) */ unsigned short gflags; - unsigned short priv_flags; /* Like 'flags' but invisible to userspace. */ + unsigned short priv_flags; /* Like 'flags' but invisible to userspace, + * see: if.h for flag definitions. + */ unsigned short unused_alignment_fixer; /* Because we need priv_flags, * and we want to be 32-bit aligned. */ -- Ben Greear Candela Technologies Inc http://www.candelatech.com From greearb@candelatech.com Tue Aug 5 17:27:22 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 17:27:26 -0700 (PDT) Received: from grok.yi.org (evrtwa1-ar2-4-33-045-074.evrtwa1.dsl-verizon.net [4.33.45.74]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h760RMFl018032 for ; Tue, 5 Aug 2003 17:27:22 -0700 Received: from candelatech.com (localhost.localdomain [127.0.0.1]) by grok.yi.org (8.12.8/8.12.8) with ESMTP id h760REtf023621 for ; Tue, 5 Aug 2003 17:27:16 -0700 Message-ID: <3F304B62.3010505@candelatech.com> Date: Tue, 05 Aug 2003 17:27:14 -0700 From: Ben Greear Organization: Candela Technologies User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5a) Gecko/20030718 X-Accept-Language: en-us, en MIME-Version: 1.0 To: "'netdev@oss.sgi.com'" Subject: VLAN patch for 2.4.21 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4559 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: greearb@candelatech.com Precedence: bulk X-list: netdev Here is a patch that adds a few new IOCTL options (not new IOCTLs per se) for the 802.1Q VLANs. One ioctl allows one to get the VID for a device by the interface name. A second gets the name of the underlying device for the VLAN device. Tested on x86 and PPC. Comments welcome! Thanks, Ben --- linux-2.4.21/net/8021q/vlan_dev.c 2003-06-13 07:51:39.000000000 -0700 +++ linux-2.4.21.amds/net/8021q/vlan_dev.c 2003-07-30 16:20:41.000000000 -0700 @@ -1,4 +1,4 @@ -/* +/* -*- linux-c -*- * INET 802.1Q VLAN * Ethernet-type device handling. * @@ -632,6 +632,59 @@ return -EINVAL; } + +int vlan_dev_get_realdev_name(const char *dev_name, char* result) +{ + struct net_device *dev = dev_get_by_name(dev_name); + int rv = 0; + + if (dev) { + if (dev->priv_flags & IFF_802_1Q_VLAN) { + strncpy(result, VLAN_DEV_INFO(dev)->real_dev->name, 23); + dev_put(dev); + rv = 0; + } else { + printk(KERN_ERR + "%s: %s is not a vlan device, priv_flags: %hX.\n", + __FUNCTION__, dev->name, dev->priv_flags); + dev_put(dev); + rv = -EINVAL; + } + } else { + printk(KERN_ERR "%s: Could not find device: %s\n", + __FUNCTION__, dev_name); + rv = -ENODEV; + } + + return rv; +} + +int vlan_dev_get_vid(const char *dev_name, unsigned short* result) +{ + struct net_device *dev = dev_get_by_name(dev_name); + int rv = 0; + + if (dev) { + if (dev->priv_flags & IFF_802_1Q_VLAN) { + *result = VLAN_DEV_INFO(dev)->vlan_id; + dev_put(dev); + rv = 0; + } else { + printk(KERN_ERR + "%s: %s is not a vlan device, priv_flags: %hX.\n", + __FUNCTION__, dev->name, dev->priv_flags); + dev_put(dev); + rv = -EINVAL; + } + } else { + printk(KERN_ERR "%s: Could not find device: %s\n", + __FUNCTION__, dev_name); + rv = -ENODEV; + } + + return rv; +} + int vlan_dev_set_mac_address(struct net_device *dev, void *addr_struct_p) { struct sockaddr *addr = (struct sockaddr *)(addr_struct_p); --- linux-2.4.21/net/8021q/vlan.c 2003-06-13 07:51:39.000000000 -0700 +++ linux-2.4.21.amds/net/8021q/vlan.c 2003-07-30 16:20:41.000000000 -0700 @@ -1,4 +1,4 @@ -/* +/* -*- linux-c -*- * INET 802.1Q VLAN * Ethernet-type device handling. * @@ -655,12 +655,9 @@ int vlan_ioctl_handler(unsigned long arg) { int err = 0; + unsigned short vid = 0; struct vlan_ioctl_args args; - /* everything here needs root permissions, except aguably the - * hack ioctls for sending packets. However, I know _I_ don't - * want users running that on my network! --BLG - */ if (!capable(CAP_NET_ADMIN)) return -EPERM; @@ -678,24 +675,32 @@ switch (args.cmd) { case SET_VLAN_INGRESS_PRIORITY_CMD: + if (!capable(CAP_NET_ADMIN)) + return -EPERM; err = vlan_dev_set_ingress_priority(args.device1, args.u.skb_priority, args.vlan_qos); break; case SET_VLAN_EGRESS_PRIORITY_CMD: + if (!capable(CAP_NET_ADMIN)) + return -EPERM; err = vlan_dev_set_egress_priority(args.device1, args.u.skb_priority, args.vlan_qos); break; case SET_VLAN_FLAG_CMD: + if (!capable(CAP_NET_ADMIN)) + return -EPERM; err = vlan_dev_set_vlan_flag(args.device1, args.u.flag, args.vlan_qos); break; case SET_VLAN_NAME_TYPE_CMD: + if (!capable(CAP_NET_ADMIN)) + return -EPERM; if ((args.u.name_type >= 0) && (args.u.name_type < VLAN_NAME_TYPE_HIGHEST)) { vlan_name_type = args.u.name_type; @@ -705,17 +710,9 @@ } break; - /* TODO: Figure out how to pass info back... - case GET_VLAN_INGRESS_PRIORITY_IOCTL: - err = vlan_dev_get_ingress_priority(args); - break; - - case GET_VLAN_EGRESS_PRIORITY_IOCTL: - err = vlan_dev_get_egress_priority(args); - break; - */ - case ADD_VLAN_CMD: + if (!capable(CAP_NET_ADMIN)) + return -EPERM; /* we have been given the name of the Ethernet Device we want to * talk to: args.dev1 We also have the * VLAN ID: args.u.VID @@ -728,12 +725,53 @@ break; case DEL_VLAN_CMD: + if (!capable(CAP_NET_ADMIN)) + return -EPERM; /* Here, the args.dev1 is the actual VLAN we want * to get rid of. */ err = unregister_vlan_device(args.device1); break; + case GET_VLAN_INGRESS_PRIORITY_CMD: + /* TODO: Implement + err = vlan_dev_get_ingress_priority(args); + if (copy_to_user((void*)arg, &args, + sizeof(struct vlan_ioctl_args))) { + err = -EFAULT; + } + */ + err = -EINVAL; + break; + + case GET_VLAN_EGRESS_PRIORITY_CMD: + /* TODO: Implement + err = vlan_dev_get_egress_priority(args.device1, &(args.args); + if (copy_to_user((void*)arg, &args, + sizeof(struct vlan_ioctl_args))) { + err = -EFAULT; + } + */ + err = -EINVAL; + break; + + case GET_VLAN_REALDEV_NAME_CMD: + err = vlan_dev_get_realdev_name(args.device1, args.u.device2); + if (copy_to_user((void*)arg, &args, + sizeof(struct vlan_ioctl_args))) { + err = -EFAULT; + } + break; + + case GET_VLAN_VID_CMD: + err = vlan_dev_get_vid(args.device1, &vid); + args.u.VID = vid; + if (copy_to_user((void*)arg, &args, + sizeof(struct vlan_ioctl_args))) { + err = -EFAULT; + } + break; + default: /* pass on to underlying device instead?? */ printk(VLAN_DBG "%s: Unknown VLAN CMD: %x \n", --- linux-2.4.21/net/8021q/vlan.h 2002-08-02 17:39:46.000000000 -0700 +++ linux-2.4.21.amds/net/8021q/vlan.h 2003-07-30 16:30:53.000000000 -0700 @@ -72,6 +72,8 @@ int vlan_dev_set_ingress_priority(char* dev_name, __u32 skb_prio, short vlan_prio); int vlan_dev_set_egress_priority(char* dev_name, __u32 skb_prio, short vlan_prio); int vlan_dev_set_vlan_flag(char* dev_name, __u32 flag, short flag_val); +int vlan_dev_get_realdev_name(const char* dev_name, char* result); +int vlan_dev_get_vid(const char* dev_name, unsigned short* result); void vlan_dev_set_multicast_list(struct net_device *vlan_dev); #endif /* !(__BEN_VLAN_802_1Q_INC__) */ --- linux-2.4.21/include/linux/if_vlan.h 2002-11-28 15:53:15.000000000 -0800 +++ linux-2.4.21.amds/include/linux/if_vlan.h 2003-07-30 16:29:30.000000000 -0700 @@ -212,7 +212,9 @@ GET_VLAN_INGRESS_PRIORITY_CMD, GET_VLAN_EGRESS_PRIORITY_CMD, SET_VLAN_NAME_TYPE_CMD, - SET_VLAN_FLAG_CMD + SET_VLAN_FLAG_CMD, + GET_VLAN_REALDEV_NAME_CMD, /* If this works, you know it's a VLAN device, btw */ + GET_VLAN_VID_CMD /* Get the VID of this VLAN (specified by name) */ }; enum vlan_name_types { -- Ben Greear Candela Technologies Inc http://www.candelatech.com From greearb@candelatech.com Tue Aug 5 17:33:12 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 17:33:14 -0700 (PDT) Received: from grok.yi.org (evrtwa1-ar2-4-33-045-074.evrtwa1.dsl-verizon.net [4.33.45.74]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h760XBFl018375 for ; Tue, 5 Aug 2003 17:33:11 -0700 Received: from candelatech.com (localhost.localdomain [127.0.0.1]) by grok.yi.org (8.12.8/8.12.8) with ESMTP id h760X3tf024354 for ; Tue, 5 Aug 2003 17:33:06 -0700 Message-ID: <3F304CBF.6050902@candelatech.com> Date: Tue, 05 Aug 2003 17:33:03 -0700 From: Ben Greear Organization: Candela Technologies User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5a) Gecko/20030718 X-Accept-Language: en-us, en MIME-Version: 1.0 To: "'netdev@oss.sgi.com'" Subject: MAC-VLANS Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4560 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: greearb@candelatech.com Precedence: bulk X-list: netdev In order to get MAC-VLANs to work, the only way I can see to make it happen is to put a hook into dev.c in the: int netif_receive_skb(struct sk_buff *skb) method. The hook looks like this, and immediately follows the bridging code hook in the same method: #if defined(CONFIG_MACVLAN) || defined(CONFIG_MACVLAN_MODULE) if (skb->dev->macvlan_priv != NULL && macvlan_handle_frame_hook != NULL) { if (handle_macvlan(skb) >= 0) { /* consumed by mac-vlan...it would have been * re-sent to this method with a different * device... */ return 0; } else { /* Let it fall through and be processed normally */ } } #endif So, the question is: Will this feature be allowed to go in since it needs this hook, regardless of other issues? If it's possible, I'll break out the rest of the patch for inspection... Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From jmorris@intercode.com.au Tue Aug 5 17:34:39 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 17:34:41 -0700 (PDT) Received: from blackbird.intercode.com.au (IDENT:f2CjXfr43uFTNwRG0cB0WMezYNuQ3wx7@blackbird.intercode.com.au [203.32.101.10]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h760YZFl018703 for ; Tue, 5 Aug 2003 17:34:37 -0700 Received: from excalibur.intercode.com.au (excalibur.intercode.com.au [203.32.101.12]) by blackbird.intercode.com.au (8.11.6p2/8.9.3) with ESMTP id h760Y4r26614; Wed, 6 Aug 2003 10:34:04 +1000 Date: Wed, 6 Aug 2003 10:34:03 +1000 (EST) From: James Morris To: Stephen Hemminger cc: Ralf Baechle , "David S. Miller" , , Subject: Re: [PATCH] (2/2) Convert ROSE to seq_file In-Reply-To: <20030805144622.100f208d.shemminger@osdl.org> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 4561 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jmorris@intercode.com.au Precedence: bulk X-list: netdev Applied to bk://kernel.bkbits.net/jmorris/net-2.5 - James -- James Morris From jmorris@intercode.com.au Tue Aug 5 17:34:44 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 17:34:47 -0700 (PDT) Received: from blackbird.intercode.com.au (IDENT:hoeVbT2rTMYdyvIW67JSUPZyRoC9OW6H@blackbird.intercode.com.au [203.32.101.10]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h760YfFl018709 for ; Tue, 5 Aug 2003 17:34:43 -0700 Received: from excalibur.intercode.com.au (excalibur.intercode.com.au [203.32.101.12]) by blackbird.intercode.com.au (8.11.6p2/8.9.3) with ESMTP id h760YLr26620; Wed, 6 Aug 2003 10:34:21 +1000 Date: Wed, 6 Aug 2003 10:34:21 +1000 (EST) From: James Morris To: Stephen Hemminger cc: Ralf Baechle , "David S. Miller" , , Subject: Re: [PATCH 2.6.0-test2] (1/2) Dynamically allocate net_device structures for ROSE In-Reply-To: <20030805144617.2e856d6d.shemminger@osdl.org> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 4562 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jmorris@intercode.com.au Precedence: bulk X-list: netdev Applied to bk://kernel.bkbits.net/jmorris/net-2.5 -- James Morris From jmorris@intercode.com.au Tue Aug 5 17:34:52 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 17:34:54 -0700 (PDT) Received: from blackbird.intercode.com.au (IDENT:Jlzxj8MKIjsXElf9D/uLWFqiRBt7Dl8R@blackbird.intercode.com.au [203.32.101.10]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h760YnFl018769 for ; Tue, 5 Aug 2003 17:34:50 -0700 Received: from excalibur.intercode.com.au (excalibur.intercode.com.au [203.32.101.12]) by blackbird.intercode.com.au (8.11.6p2/8.9.3) with ESMTP id h760YTr26627; Wed, 6 Aug 2003 10:34:29 +1000 Date: Wed, 6 Aug 2003 10:34:29 +1000 (EST) From: James Morris To: Stephen Hemminger cc: Ralf Baechle , "David S. Miller" , , Subject: Re: [PATCH] Fix use after free in AX.25 In-Reply-To: <20030805145658.1b3f194b.shemminger@osdl.org> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 4563 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jmorris@intercode.com.au Precedence: bulk X-list: netdev Applied to bk://kernel.bkbits.net/jmorris/net-2.5 -- James Morris From jmorris@intercode.com.au Tue Aug 5 17:35:13 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 17:35:16 -0700 (PDT) Received: from blackbird.intercode.com.au (IDENT:k3krbhVq4Bw3NqfG7XRvhOqSUlr5Pwa5@blackbird.intercode.com.au [203.32.101.10]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h760ZAFl018891 for ; Tue, 5 Aug 2003 17:35:12 -0700 Received: from excalibur.intercode.com.au (excalibur.intercode.com.au [203.32.101.12]) by blackbird.intercode.com.au (8.11.6p2/8.9.3) with ESMTP id h760Yir26637; Wed, 6 Aug 2003 10:34:44 +1000 Date: Wed, 6 Aug 2003 10:34:43 +1000 (EST) From: James Morris To: Stephen Hemminger cc: Henner Eisen , "David S. Miller" , , Subject: Re: [PATCH] Fix X.25 use after free. In-Reply-To: <20030805150110.0e2753ab.shemminger@osdl.org> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 4564 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jmorris@intercode.com.au Precedence: bulk X-list: netdev Applied to bk://kernel.bkbits.net/jmorris/net-2.5 -- James Morris From jmorris@intercode.com.au Tue Aug 5 17:35:42 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 17:35:45 -0700 (PDT) Received: from blackbird.intercode.com.au (IDENT:v1VYsOypAvseLXmDKGTfY8uAhfRd78CF@blackbird.intercode.com.au [203.32.101.10]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h760ZdFl019427 for ; Tue, 5 Aug 2003 17:35:40 -0700 Received: from excalibur.intercode.com.au (excalibur.intercode.com.au [203.32.101.12]) by blackbird.intercode.com.au (8.11.6p2/8.9.3) with ESMTP id h760ZKr26651; Wed, 6 Aug 2003 10:35:20 +1000 Date: Wed, 6 Aug 2003 10:35:19 +1000 (EST) From: James Morris To: Stephen Hemminger cc: Henner Eisen , "David S. Miller" , , Subject: Re: [PATCH] X.25 async net_device fixup In-Reply-To: <20030805154335.7abfcb92.shemminger@osdl.org> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 4565 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jmorris@intercode.com.au Precedence: bulk X-list: netdev Also applied to bk://kernel.bkbits.net/jmorris/net-2.5 -- James Morris From jgarzik@pobox.com Tue Aug 5 19:58:27 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 19:58:37 -0700 (PDT) Received: from www.linux.org.uk (IDENT:kFrzWfvhJzSSlooxnx+eUym2uOAeLT4I@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h762wQFl001953 for ; Tue, 5 Aug 2003 19:58:27 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19kEVM-0008Fs-3P; Wed, 06 Aug 2003 03:58:24 +0100 Message-ID: <3F306EC4.1030109@pobox.com> Date: Tue, 05 Aug 2003 22:58:12 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: "Feldman, Scott" CC: Samuel Flory , netdev@oss.sgi.com Subject: Re: More 2.4.22pre10 ACPI breakage References: In-Reply-To: Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4566 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev Feldman, Scott wrote: >> It appears that the intel Se7501BR mother is also having >>issues with ACPI. When ACPI support is enable the e1000 >>controller stops working printing "<6>NETDEV WATCHDOG: >>eth0: transmit timed out". > > > Must...have...interrupts. You are 100% correct. That said... I'm tempted to extend NAPI just a bit, to provide an "always poll" mode. It seems like all the bug reports I get these days for 8139too are caused by x86 ACPI/APIC/irq routing troubles completely unrelated to the driver. Tulip-almost-NAPI in 2.4 has an always-poll mode, so I have a convenient excuse :) Jeff From davem@redhat.com Tue Aug 5 20:45:04 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 20:45:10 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h763j3Fl014425 for ; Tue, 5 Aug 2003 20:45:04 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id UAA24203; Tue, 5 Aug 2003 20:40:28 -0700 Date: Tue, 5 Aug 2003 20:40:28 -0700 From: "David S. Miller" To: Ben Greear Cc: netdev@oss.sgi.com Subject: Re: MAC-VLANS Message-Id: <20030805204028.644895dc.davem@redhat.com> In-Reply-To: <3F304CBF.6050902@candelatech.com> References: <3F304CBF.6050902@candelatech.com> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4567 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Tue, 05 Aug 2003 17:33:03 -0700 Ben Greear wrote: > In order to get MAC-VLANs to work, the only way I can see to make > it happen is to put a hook into dev.c in the: Why not do it the same we do normal VLAN's? Ie. directly in the device driver receive method via something akin to the vlan_hwaccel_*() routines. From greearb@candelatech.com Tue Aug 5 21:10:54 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 21:10:59 -0700 (PDT) Received: from grok.yi.org (evrtwa1-ar2-4-33-045-074.evrtwa1.dsl-verizon.net [4.33.45.74]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h764ArFl016407 for ; Tue, 5 Aug 2003 21:10:54 -0700 Received: from candelatech.com (localhost.localdomain [127.0.0.1]) by grok.yi.org (8.12.8/8.12.8) with ESMTP id h764Altf019406; Tue, 5 Aug 2003 21:10:47 -0700 Message-ID: <3F307FC7.10908@candelatech.com> Date: Tue, 05 Aug 2003 21:10:47 -0700 From: Ben Greear Organization: Candela Technologies User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5a) Gecko/20030718 X-Accept-Language: en-us, en MIME-Version: 1.0 To: "David S. Miller" CC: netdev@oss.sgi.com Subject: Re: MAC-VLANS References: <3F304CBF.6050902@candelatech.com> <20030805204028.644895dc.davem@redhat.com> In-Reply-To: <20030805204028.644895dc.davem@redhat.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4568 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: greearb@candelatech.com Precedence: bulk X-list: netdev David S. Miller wrote: > On Tue, 05 Aug 2003 17:33:03 -0700 > Ben Greear wrote: > > >>In order to get MAC-VLANs to work, the only way I can see to make >>it happen is to put a hook into dev.c in the: > > > Why not do it the same we do normal VLAN's? Ie. directly > in the device driver receive method via something akin to > the vlan_hwaccel_*() routines. You mean in eth.c or something? I don't want to have to put identical code in all drivers, if that's what you're suggesting. It could be done to common drivers if the feature is used enough to warrant it, but we definately need a fallback case to work with all generic drivers, just as .1q does. 802.1q works because we have an extra shim header in there..but MAC-vlans have no extra header info. It would be nice to have a separate 'protocol' list that was able to consume the pkt: that would allow this to work w/out additional hacks, and could work for pktgen rx and even bridging. Of course, not all could be active at once, but that is no worse than 'hooks' in that regard. And, evil ppl could re-write the IP stack, of course, but that doesn't bother me as much as some folks :) Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From greearb@candelatech.com Tue Aug 5 21:15:06 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 21:15:15 -0700 (PDT) Received: from grok.yi.org (evrtwa1-ar2-4-33-045-074.evrtwa1.dsl-verizon.net [4.33.45.74]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h764F5Fl016979 for ; Tue, 5 Aug 2003 21:15:06 -0700 Received: from candelatech.com (localhost.localdomain [127.0.0.1]) by grok.yi.org (8.12.8/8.12.8) with ESMTP id h764F0tf019943 for ; Tue, 5 Aug 2003 21:15:00 -0700 Message-ID: <3F3080C4.9070507@candelatech.com> Date: Tue, 05 Aug 2003 21:15:00 -0700 From: Ben Greear Organization: Candela Technologies User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5a) Gecko/20030718 X-Accept-Language: en-us, en MIME-Version: 1.0 To: "'netdev@oss.sgi.com'" Subject: VLAN patch try 2, tabs instead of spaces Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4569 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: greearb@candelatech.com Precedence: bulk X-list: netdev Same as last time, but now using tabs instead of spaces. --- linux-2.4.21/net/8021q/vlan_dev.c 2003-06-13 07:51:39.000000000 -0700 +++ linux-2.4.21.amds/net/8021q/vlan_dev.c 2003-08-05 20:38:25.000000000 -0700 @@ -1,18 +1,18 @@ -/* +/* -*- linux-c -*- * INET 802.1Q VLAN * Ethernet-type device handling. * * Authors: Ben Greear - * Please send support related email to: vlan@scry.wanfear.com - * VLAN Home Page: http://www.candelatech.com/~greear/vlan.html + * Please send support related email to: vlan@scry.wanfear.com + * VLAN Home Page: http://www.candelatech.com/~greear/vlan.html * - * Fixes: Mar 22 2001: Martin Bokaemper - * - reset skb->pkt_type on incoming packets when MAC was changed - * - see that changed MAC is saddr for outgoing packets - * Oct 20, 2001: Ard van Breeman: - * - Fix MC-list, finally. - * - Flush MC-list on VLAN destroy. - * + * Fixes: Mar 22 2001: Martin Bokaemper + * - reset skb->pkt_type on incoming packets when MAC was changed + * - see that changed MAC is saddr for outgoing packets + * Oct 20, 2001: Ard van Breeman: + * - Fix MC-list, finally. + * - Flush MC-list on VLAN destroy. + * * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License @@ -99,18 +99,18 @@ * NOTE: Should be similar to ethernet/eth.c. * * SANITY NOTE: This method is called when a packet is moving up the stack - * towards userland. To get here, it would have already passed - * through the ethernet/eth.c eth_type_trans() method. + * towards userland. To get here, it would have already passed + * through the ethernet/eth.c eth_type_trans() method. * SANITY NOTE 2: We are referencing to the VLAN_HDR frields, which MAY be - * stored UNALIGNED in the memory. RISC systems don't like - * such cases very much... + * stored UNALIGNED in the memory. RISC systems don't like + * such cases very much... * SANITY NOTE 2a: According to Dave Miller & Alexey, it will always be aligned, - * so there doesn't need to be any of the unaligned stuff. It has - * been commented out now... --Ben + * so there doesn't need to be any of the unaligned stuff. It has + * been commented out now... --Ben * */ int vlan_skb_recv(struct sk_buff *skb, struct net_device *dev, - struct packet_type* ptype) + struct packet_type* ptype) { unsigned char *rawp = NULL; struct vlan_hdr *vhdr = (struct vlan_hdr *)(skb->data); @@ -170,7 +170,7 @@ spin_unlock_bh(&vlan_group_lock); #ifdef VLAN_DEBUG - printk(VLAN_DBG "%s: dropping skb: %p because came in on wrong device, dev: %s real_dev: %s, skb_dev: %s\n", + printk(VLAN_DBG "%s: dropping skb: %p because came in on wrong device, dev: %s real_dev: %s, skb_dev: %s\n", __FUNCTION__ skb, dev->name, VLAN_DEV_INFO(skb->dev)->real_dev->name, skb->dev->name); @@ -324,8 +324,8 @@ * physical devices. */ int vlan_dev_hard_header(struct sk_buff *skb, struct net_device *dev, - unsigned short type, void *daddr, void *saddr, - unsigned len) + unsigned short type, void *daddr, void *saddr, + unsigned len) { struct vlan_hdr *vhdr; unsigned short veth_TCI = 0; @@ -613,7 +613,7 @@ dev_put(dev); return 0; } else { - printk(KERN_ERR "%s: flag %i is not valid.\n", + printk(KERN_ERR "%s: flag %i is not valid.\n", __FUNCTION__, (int)(flag)); dev_put(dev); return -EINVAL; @@ -625,13 +625,66 @@ dev_put(dev); } } else { - printk(KERN_ERR "%s: Could not find device: %s\n", + printk(KERN_ERR "%s: Could not find device: %s\n", __FUNCTION__, dev_name); } return -EINVAL; } + +int vlan_dev_get_realdev_name(const char *dev_name, char* result) +{ + struct net_device *dev = dev_get_by_name(dev_name); + int rv = 0; + + if (dev) { + if (dev->priv_flags & IFF_802_1Q_VLAN) { + strncpy(result, VLAN_DEV_INFO(dev)->real_dev->name, 23); + dev_put(dev); + rv = 0; + } else { + printk(KERN_ERR + "%s: %s is not a vlan device, priv_flags: %hX.\n", + __FUNCTION__, dev->name, dev->priv_flags); + dev_put(dev); + rv = -EINVAL; + } + } else { + printk(KERN_ERR "%s: Could not find device: %s\n", + __FUNCTION__, dev_name); + rv = -ENODEV; + } + + return rv; +} + +int vlan_dev_get_vid(const char *dev_name, unsigned short* result) +{ + struct net_device *dev = dev_get_by_name(dev_name); + int rv = 0; + + if (dev) { + if (dev->priv_flags & IFF_802_1Q_VLAN) { + *result = VLAN_DEV_INFO(dev)->vlan_id; + dev_put(dev); + rv = 0; + } else { + printk(KERN_ERR + "%s: %s is not a vlan device, priv_flags: %hX.\n", + __FUNCTION__, dev->name, dev->priv_flags); + dev_put(dev); + rv = -EINVAL; + } + } else { + printk(KERN_ERR "%s: Could not find device: %s\n", + __FUNCTION__, dev_name); + rv = -ENODEV; + } + + return rv; +} + int vlan_dev_set_mac_address(struct net_device *dev, void *addr_struct_p) { struct sockaddr *addr = (struct sockaddr *)(addr_struct_p); @@ -671,7 +724,7 @@ } static inline int vlan_dmi_equals(struct dev_mc_list *dmi1, - struct dev_mc_list *dmi2) + struct dev_mc_list *dmi2) { return ((dmi1->dmi_addrlen == dmi2->dmi_addrlen) && (memcmp(dmi1->dmi_addr, dmi2->dmi_addr, dmi1->dmi_addrlen) == 0)); --- linux-2.4.21/net/8021q/vlan.c 2003-06-13 07:51:39.000000000 -0700 +++ linux-2.4.21.amds/net/8021q/vlan.c 2003-08-05 20:53:31.000000000 -0700 @@ -1,13 +1,13 @@ -/* +/* -*- linux-c -*- * INET 802.1Q VLAN * Ethernet-type device handling. * * Authors: Ben Greear - * Please send support related email to: vlan@scry.wanfear.com - * VLAN Home Page: http://www.candelatech.com/~greear/vlan.html + * Please send support related email to: vlan@scry.wanfear.com + * VLAN Home Page: http://www.candelatech.com/~greear/vlan.html * * Fixes: - * Fix for packet capture - Nick Eggleston ; + * Fix for packet capture - Nick Eggleston ; * Add HW acceleration hooks - David S. Miller ; * Correct all the locking - David S. Miller ; * Use hash table for VLAN groups - David S. Miller @@ -173,7 +173,7 @@ *pprev = grp->next; } -/* Find the protocol handler. Assumes VID < VLAN_VID_MASK. +/* Find the protocol handler. Assumes VID < VLAN_VID_MASK. * * Must be invoked with vlan_group_lock held. */ @@ -183,7 +183,7 @@ struct vlan_group *grp = __vlan_find_group(real_dev->ifindex); if (grp) - return grp->vlan_devices[VID]; + return grp->vlan_devices[VID]; return NULL; } @@ -270,7 +270,7 @@ } } - return ret; + return ret; } static int unregister_vlan_device(const char *vlan_IF_name) @@ -655,17 +655,14 @@ int vlan_ioctl_handler(unsigned long arg) { int err = 0; + unsigned short vid = 0; struct vlan_ioctl_args args; - /* everything here needs root permissions, except aguably the - * hack ioctls for sending packets. However, I know _I_ don't - * want users running that on my network! --BLG - */ if (!capable(CAP_NET_ADMIN)) return -EPERM; if (copy_from_user(&args, (void*)arg, - sizeof(struct vlan_ioctl_args))) + sizeof(struct vlan_ioctl_args))) return -EFAULT; /* Null terminate this sucker, just in case. */ @@ -678,24 +675,32 @@ switch (args.cmd) { case SET_VLAN_INGRESS_PRIORITY_CMD: + if (!capable(CAP_NET_ADMIN)) + return -EPERM; err = vlan_dev_set_ingress_priority(args.device1, args.u.skb_priority, args.vlan_qos); break; case SET_VLAN_EGRESS_PRIORITY_CMD: + if (!capable(CAP_NET_ADMIN)) + return -EPERM; err = vlan_dev_set_egress_priority(args.device1, args.u.skb_priority, args.vlan_qos); break; case SET_VLAN_FLAG_CMD: + if (!capable(CAP_NET_ADMIN)) + return -EPERM; err = vlan_dev_set_vlan_flag(args.device1, args.u.flag, args.vlan_qos); break; case SET_VLAN_NAME_TYPE_CMD: + if (!capable(CAP_NET_ADMIN)) + return -EPERM; if ((args.u.name_type >= 0) && (args.u.name_type < VLAN_NAME_TYPE_HIGHEST)) { vlan_name_type = args.u.name_type; @@ -705,17 +710,9 @@ } break; - /* TODO: Figure out how to pass info back... - case GET_VLAN_INGRESS_PRIORITY_IOCTL: - err = vlan_dev_get_ingress_priority(args); - break; - - case GET_VLAN_EGRESS_PRIORITY_IOCTL: - err = vlan_dev_get_egress_priority(args); - break; - */ - case ADD_VLAN_CMD: + if (!capable(CAP_NET_ADMIN)) + return -EPERM; /* we have been given the name of the Ethernet Device we want to * talk to: args.dev1 We also have the * VLAN ID: args.u.VID @@ -728,12 +725,53 @@ break; case DEL_VLAN_CMD: + if (!capable(CAP_NET_ADMIN)) + return -EPERM; /* Here, the args.dev1 is the actual VLAN we want * to get rid of. */ err = unregister_vlan_device(args.device1); break; + case GET_VLAN_INGRESS_PRIORITY_CMD: + /* TODO: Implement + err = vlan_dev_get_ingress_priority(args); + if (copy_to_user((void*)arg, &args, + sizeof(struct vlan_ioctl_args))) { + err = -EFAULT; + } + */ + err = -EINVAL; + break; + + case GET_VLAN_EGRESS_PRIORITY_CMD: + /* TODO: Implement + err = vlan_dev_get_egress_priority(args.device1, &(args.args); + if (copy_to_user((void*)arg, &args, + sizeof(struct vlan_ioctl_args))) { + err = -EFAULT; + } + */ + err = -EINVAL; + break; + + case GET_VLAN_REALDEV_NAME_CMD: + err = vlan_dev_get_realdev_name(args.device1, args.u.device2); + if (copy_to_user((void*)arg, &args, + sizeof(struct vlan_ioctl_args))) { + err = -EFAULT; + } + break; + + case GET_VLAN_VID_CMD: + err = vlan_dev_get_vid(args.device1, &vid); + args.u.VID = vid; + if (copy_to_user((void*)arg, &args, + sizeof(struct vlan_ioctl_args))) { + err = -EFAULT; + } + break; + default: /* pass on to underlying device instead?? */ printk(VLAN_DBG "%s: Unknown VLAN CMD: %x \n", --- linux-2.4.21/net/8021q/vlan.h 2002-08-02 17:39:46.000000000 -0700 +++ linux-2.4.21.amds/net/8021q/vlan.h 2003-07-30 16:30:53.000000000 -0700 @@ -72,6 +72,8 @@ int vlan_dev_set_ingress_priority(char* dev_name, __u32 skb_prio, short vlan_prio); int vlan_dev_set_egress_priority(char* dev_name, __u32 skb_prio, short vlan_prio); int vlan_dev_set_vlan_flag(char* dev_name, __u32 flag, short flag_val); +int vlan_dev_get_realdev_name(const char* dev_name, char* result); +int vlan_dev_get_vid(const char* dev_name, unsigned short* result); void vlan_dev_set_multicast_list(struct net_device *vlan_dev); #endif /* !(__BEN_VLAN_802_1Q_INC__) */ --- linux-2.4.21/include/linux/if_vlan.h 2002-11-28 15:53:15.000000000 -0800 +++ linux-2.4.21.amds/include/linux/if_vlan.h 2003-07-30 16:29:30.000000000 -0700 @@ -212,7 +212,9 @@ GET_VLAN_INGRESS_PRIORITY_CMD, GET_VLAN_EGRESS_PRIORITY_CMD, SET_VLAN_NAME_TYPE_CMD, - SET_VLAN_FLAG_CMD + SET_VLAN_FLAG_CMD, + GET_VLAN_REALDEV_NAME_CMD, /* If this works, you know it's a VLAN device, btw */ + GET_VLAN_VID_CMD /* Get the VID of this VLAN (specified by name) */ }; enum vlan_name_types { -- Ben Greear Candela Technologies Inc http://www.candelatech.com From werner@almesberger.net Tue Aug 5 22:13:17 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 22:13:25 -0700 (PDT) Received: from host.almesberger.net (almesberger.net [63.105.73.239] (may be forged)) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h765DGFl022080 for ; Tue, 5 Aug 2003 22:13:17 -0700 Received: from almesberger.net (vpnwa-home [10.200.0.2]) by host.almesberger.net (8.11.6/8.9.3) with ESMTP id h765DEG23709; Tue, 5 Aug 2003 22:13:14 -0700 Received: (from werner@localhost) by almesberger.net (8.11.6/8.11.6) id h765D4D31551; Wed, 6 Aug 2003 02:13:04 -0300 Date: Wed, 6 Aug 2003 02:13:04 -0300 From: Werner Almesberger To: "Eric W. Biederman" Cc: Jeff Garzik , Nivedita Singhvi , netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: TOE brain dump Message-ID: <20030806021304.E5798@almesberger.net> References: <20030802140444.E5798@almesberger.net> <3F2BF5C7.90400@us.ibm.com> <3F2C0C44.6020002@pobox.com> <20030802184901.G5798@almesberger.net> <20030804162433.L5798@almesberger.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: ; from ebiederm@xmission.com on Tue, Aug 05, 2003 at 11:19:09AM -0600 X-archive-position: 4570 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: werner@almesberger.net Precedence: bulk X-list: netdev Eric W. Biederman wrote: > MPI is not a transport. It an interface like the Berkeley sockets > layer. Hmm, but doesn't it also unify transport semantics (i.e. chop TCP streams into messages), maybe add reliability to transports that don't have it, and provide addressing ? Okay, perhaps you wouldn't call this a transport in the OSI sense, but it still seems to have considerably more functionality than just providing an API. > Mostly I think the that is less true, at least if they can stand the > process of severe code review and cleaning up their code. Hmm, people putting dozens of millions into building clusters can't afford to have what is probably their most essential infrastructure code reviewed and cleaned up ? Oh dear. > But of course to get through the peer review process people need > to understand what they are doing. A good point :-) > So store and forward of packets in a 3 layer switch hierarchy, at 1.3 us > per copy. But your switch could just do cut-through, no ? Or do they need to recompute checksums ? > A lot of the NICs which are used for MPI tend to be smart for two > reasons. 1) So they can do source routing. 2) So they can safely > export some of their interface to user space, so in the fast path > they can bypass the kernel. The second part could be interesting for TOE, too. Only that the interface exported would just be the socket interface. - Werner -- _________________________________________________________________________ / Werner Almesberger, Buenos Aires, Argentina werner@almesberger.net / /_http://www.almesberger.net/____________________________________________/ From davem@redhat.com Tue Aug 5 23:13:17 2003 Received: with ECARTIS (v1.0.0; list netdev); Tue, 05 Aug 2003 23:13:20 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h766DFFl026190 for ; Tue, 5 Aug 2003 23:13:16 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id XAA24465; Tue, 5 Aug 2003 23:08:37 -0700 Date: Tue, 5 Aug 2003 23:08:37 -0700 From: "David S. Miller" To: Andi Kleen Cc: netdev@oss.sgi.com Subject: Re: [PATCH] Make XFRM optional Message-Id: <20030805230837.5609f130.davem@redhat.com> In-Reply-To: <20030805135315.GB63394@colin2.muc.de> References: <20030804125022.GA8167@averell> <20030804.215801.124854897.yoshfuji@linux-ipv6.org> <20030804130408.GA36367@colin2.muc.de> <20030804114507.6e496c77.davem@redhat.com> <20030804203524.GA15874@colin2.muc.de> <20030804165137.40d744c5.davem@redhat.com> <20030805135315.GB63394@colin2.muc.de> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4571 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On 5 Aug 2003 15:53:15 +0200 Andi Kleen wrote: > Here is a new patch that includes the missing file. Something is still wrong, in particular it seems the net/ipv6/Makefile changes are wrong resulting in: *** Warning: "xfrm6_fini" [net/ipv6/ipv6.ko] undefined! *** Warning: "xfrm6_init" [net/ipv6/ipv6.ko] undefined! *** Warning: "xfrm6_rcv" [net/ipv6/ipv6.ko] undefined! When IPV6 is built modular. Please fix, thanks. From andre@linux-ide.org Wed Aug 6 00:23:03 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 06 Aug 2003 00:23:18 -0700 (PDT) Received: from master.linux-ide.org (astound-64-85-224-253.ca.astound.net [64.85.224.253]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h767N2Fl031304 for ; Wed, 6 Aug 2003 00:23:03 -0700 Received: from localhost (andre@localhost) by master.linux-ide.org (8.11.0/8.11.0/SuSE Linux 8.11.0-0.4) with ESMTP id h767CaF25450; Wed, 6 Aug 2003 00:12:36 -0700 Date: Wed, 6 Aug 2003 00:12:36 -0700 (PDT) From: Andre Hedrick To: Jeff Garzik cc: netdev@oss.sgi.com, linux-kernel@vger.kernel.org, Werner Almesberger , Nivedita Singhvi Subject: Re: TOE brain dump In-Reply-To: <3F2CAE61.7070401@pobox.com> Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-archive-position: 4572 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: andre@linux-ide.org Precedence: bulk X-list: netdev Jeff, Do be sure to check that your data payload is correct. Everyone knows that a router/gateway/switch with a sticky bit in its memory will recompute the net crc16 checksum insure it pass the to the nic regardless. It is amazing how much data can be corrupted by a network environment via all the NFS/NBD/etc wantabie storage products out there. Just a chuckle for you to ponder. --a On Sun, 3 Aug 2003, Jeff Garzik wrote: > Werner Almesberger wrote: > > Jeff Garzik wrote: > > > >>jabbering at the same time. TCP is a "one size fits all" solution, but > >>it doesn't work well for everyone. > > > > > > But then, ten "optimized xxPs" that work well in two different > > scenarios each, but not so good in the 98 others, wouldn't be > > much fun either. > > > > It's been tried a number of times. Usually, real life sneaks > > in at one point or another, leaving behind a complex mess. > > When they've sorted out these problems, regular TCP has caught > > up with the great optimized transport protocols. At that point, > > they return to their niche, sometimes tail between legs and > > muttering curses, sometimes shaking their fist and boldly > > proclaiming how badly they'll rub TCP in the dirt in the next > > round. Maybe they shed off some of the complexity, and trade it > > for even more aggressive optimization, which puts them into > > their niche even more firmly. Eventually, they fade away. > > > > There are cases where TCP doesn't work well, like a path of > > badly mismatched link layers, but such paths don't treat any > > protocol following the end-to-end principle kindly. > > > > Another problem of TCP is that it has grown a bit too many > > knobs you need to turn before it works over your really fast > > really long pipe. (In one of the OLS after dinner speeches, > > this was quite appropriately called the "wizard gap".) > > > > > >>It's obviously not over a WAN... > > > > > > That's why NFS turned off UDP checksums ;-) As soon as you put > > it on IP, it will crawl to distances you didn't imagine in your > > wildest dreams. It always does. > > Really fast, really long pipes in practice don't exist for 99.9% of all > Internet users. > > > When you approach traffic levels that push you want to offload most of > the TCP net stack, then TCP isn't the right solution for you anymore, > all things considered. > > > The Linux net stack just isn't built to be offloaded. TOE engines will > either need to (1) fall back to Linux software for all-but-the-common > case (otherwise netfilter, etc. break), or, (2) will need to be > hideously complex beasts themselves. And I can't see ASIC and firmware > designers being excited about implementing netfilter on a PCI card :) > > Unfortunately some vendors seem to choosing TOE option #3: TCP offload > which introduces many limitations (connection limits, netfilter not > supported, etc.) which Linux never had before. Vendors don't seem to > realize TOE has real potential to damage the "good network neighbor" > image the net stack has. The Linux net stack's behavior is known, > documented, predictable. TOE changes all that. > > There is one interesting TOE solution, that I have yet to see created: > run Linux on an embedded processor, on the NIC. This stripped-down > Linux kernel would perform all the header parsing, checksumming, etc. > into the NIC's local RAM. The Linux OS driver interface becomes a > virtual interface with a large MTU, that communicates from host CPU to > NIC across the PCI bus using jumbo-ethernet-like data frames. > Management frames would control the ethernet interface on the other side > of the PCI bus "tunnel". > > > >>So, fix the other end of the pipeline too, otherwise this fast network > >>stuff is flashly but pointless. If you want to serve up data from disk, > >>then start creating PCI cards that have both Serial ATA and ethernet > >>connectors on them :) Cut out the middleman of the host CPU and host > >>memory bus instead of offloading portions of TCP that do not need to be > >>offloaded. > > > > > > That's a good point. A hierarchical memory structure can help > > here. Moving one end closer to the hardware, and letting it > > know (e.g. through sendfile) that also the other end is close > > (or can be reached more directly that through some hopelessly > > crowded main bus) may help too. > > Definitely. > > Jeff > > > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > From kazunori@miyazawa.org Wed Aug 6 00:27:54 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 06 Aug 2003 00:28:02 -0700 (PDT) Received: from miyazawa.org (usen-221x116x13x66.ap-US01.usen.ad.jp [221.116.13.66]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h767RqFl031930 for ; Wed, 6 Aug 2003 00:27:54 -0700 Received: from monza.miyazawa.org ([::ffff:203.178.141.107]) (AUTH: LOGIN kazunori, ) by miyazawa.org with esmtp; Wed, 06 Aug 2003 16:20:16 +0900 Date: Wed, 6 Aug 2003 16:28:08 +0900 From: Kazunori Miyazawa To: davem@redhat.com, kuznet@ms2.inr.ac.ru Cc: netdev@oss.sgi.com, usagi@linux-ipv6.org, latten@austin.ibm.com Subject: [PATCH][IPV6] fix clearing in ah6 input Message-Id: <20030806162808.4edf9eeb.kazunori@miyazawa.org> X-Mailer: Sylpheed version 0.9.3 (GTK+ 1.2.10; i386-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4573 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: kazunori@miyazawa.org Precedence: bulk X-list: netdev Hello, This patch fixes zero-clear in ah6_input. If calling pskb_expand_head, the kernel clears wrong memory. This patch is against linux-2.6.0-test2 Best regards, diff -ruN a/net/ipv6/ah6.c b/net/ipv6/ah6.c --- a/net/ipv6/ah6.c 2003-07-28 02:05:05.000000000 +0900 +++ b/net/ipv6/ah6.c 2003-08-06 12:57:19.000000000 +0900 @@ -262,13 +262,12 @@ * There is offset of AH before IPv6 header after the process. */ - struct ipv6hdr *iph = skb->nh.ipv6h; struct ipv6_auth_hdr *ah; struct ah_data *ahp; unsigned char *tmp_hdr = NULL; - u16 hdr_len = skb->data - skb->nh.raw; + u16 hdr_len; u16 ah_hlen; - u16 cleared_hlen = hdr_len; + u16 cleared_hlen; u16 nh_offset = 0; u8 nexthdr = 0; u8 *prevhdr; @@ -276,6 +275,14 @@ if (!pskb_may_pull(skb, sizeof(struct ip_auth_hdr))) goto out; + /* We are going to _remove_ AH header to keep sockets happy, + * so... Later this can change. */ + if (skb_cloned(skb) && + pskb_expand_head(skb, 0, 0, GFP_ATOMIC)) + goto out; + + hdr_len = skb->data - skb->nh.raw; + cleared_hlen = hdr_len; ah = (struct ipv6_auth_hdr*)skb->data; ahp = x->data; nexthdr = ah->nexthdr; @@ -294,27 +301,22 @@ if (!pskb_may_pull(skb, ah_hlen)) goto out; - /* We are going to _remove_ AH header to keep sockets happy, - * so... Later this can change. */ - if (skb_cloned(skb) && - pskb_expand_head(skb, 0, 0, GFP_ATOMIC)) - goto out; - tmp_hdr = kmalloc(cleared_hlen, GFP_ATOMIC); if (!tmp_hdr) goto out; memcpy(tmp_hdr, skb->nh.raw, cleared_hlen); ipv6_clear_mutable_options(skb, &nh_offset, XFRM_POLICY_IN); - iph->priority = 0; - iph->flow_lbl[0] = 0; - iph->flow_lbl[1] = 0; - iph->flow_lbl[2] = 0; - iph->hop_limit = 0; + skb->nh.ipv6h->priority = 0; + skb->nh.ipv6h->flow_lbl[0] = 0; + skb->nh.ipv6h->flow_lbl[1] = 0; + skb->nh.ipv6h->flow_lbl[2] = 0; + skb->nh.ipv6h->hop_limit = 0; { u8 auth_data[ahp->icv_trunc_len]; memcpy(auth_data, ah->auth_data, ahp->icv_trunc_len); + memset(ah->auth_data, 0, ahp->icv_trunc_len); skb_push(skb, skb->data - skb->nh.raw); ahp->icv(ahp, skb, ah->auth_data); if (memcmp(ah->auth_data, auth_data, ahp->icv_trunc_len)) { From kazunori@miyazawa.org Wed Aug 6 00:43:52 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 06 Aug 2003 00:44:00 -0700 (PDT) Received: from miyazawa.org (usen-221x116x13x66.ap-US01.usen.ad.jp [221.116.13.66]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h767hpFl000755 for ; Wed, 6 Aug 2003 00:43:52 -0700 Received: from monza.miyazawa.org ([::ffff:203.178.141.107]) (AUTH: LOGIN kazunori, ) by miyazawa.org with esmtp; Wed, 06 Aug 2003 16:36:15 +0900 Date: Wed, 6 Aug 2003 16:44:13 +0900 From: Kazunori Miyazawa To: davem@redhat.com, kuznet@ms2.inr.ac.ru Cc: netdev@oss.sgi.com, usagi@linux-ipv6.org, latten@austin.ibm.com Subject: [PATCH][IPV6] fixed authentication error with TCP Message-Id: <20030806164413.669ef5f8.kazunori@miyazawa.org> X-Mailer: Sylpheed version 0.9.3 (GTK+ 1.2.10; i386-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4574 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: kazunori@miyazawa.org Precedence: bulk X-list: netdev Hello, Miss Joy (@IBM) and I investigated the bug that "authentication error" occured with using TCP and AH in IPv6. This patch fixes the bug. This patch makes the kernel consider extension header length in a dst. This pach works with my previous patch which fixes zero-clear in ah6_input. Please append the name "Joy Latten" into the log. #I'm in summer holidays until 10th August. I will response very slowly because I only have dial-up line with 30kbps :-p Best regards, diff -ruN a/include/net/ipv6.h b/include/net/ipv6.h --- a/include/net/ipv6.h 2003-07-28 02:07:24.000000000 +0900 +++ b/include/net/ipv6.h 2003-08-06 14:10:36.000000000 +0900 @@ -353,9 +353,7 @@ extern void ip6_flush_pending_frames(struct sock *sk); -extern int ip6_dst_lookup(struct sock *sk, - struct dst_entry **dst, - struct flowi *fl); +extern struct dst_entry * ip6_dst_lookup(struct sock *sk, struct flowi *fl); /* * skb processing functions diff -ruN a/net/ipv6/icmp.c b/net/ipv6/icmp.c --- a/net/ipv6/icmp.c 2003-07-28 01:59:40.000000000 +0900 +++ b/net/ipv6/icmp.c 2003-08-06 14:20:29.000000000 +0900 @@ -355,8 +355,8 @@ if (!fl.oif && ipv6_addr_is_multicast(&fl.fl6_dst)) fl.oif = np->mcast_oif; - err = ip6_dst_lookup(sk, &dst, &fl); - if (err) goto out; + dst = ip6_dst_lookup(sk, &fl); + if (dst->error) goto out; if (hlimit < 0) { if (ipv6_addr_is_multicast(&fl.fl6_dst)) @@ -434,9 +434,9 @@ if (!fl.oif && ipv6_addr_is_multicast(&fl.fl6_dst)) fl.oif = np->mcast_oif; - err = ip6_dst_lookup(sk, &dst, &fl); + dst = ip6_dst_lookup(sk, &fl); - if (err) goto out; + if (dst->error) goto out; if (hlimit < 0) { if (ipv6_addr_is_multicast(&fl.fl6_dst)) diff -ruN a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c --- a/net/ipv6/ip6_output.c 2003-07-28 01:57:01.000000000 +0900 +++ b/net/ipv6/ip6_output.c 2003-08-06 15:35:23.000000000 +0900 @@ -211,10 +211,6 @@ u32 mtu; int err = 0; - if ((err = xfrm_lookup(&skb->dst, fl, sk, 0)) < 0) { - return err; - } - if (opt) { int head_room; @@ -1141,72 +1137,73 @@ return err; } -int ip6_dst_lookup(struct sock *sk, struct dst_entry **dst, struct flowi *fl) +struct dst_entry *ip6_dst_lookup(struct sock *sk, struct flowi *fl) { - struct ipv6_pinfo *np = inet6_sk(sk); + struct dst_entry *dst = NULL; int err = 0; - *dst = __sk_dst_check(sk, np->dst_cookie); - if (*dst) { - struct rt6_info *rt = (struct rt6_info*)*dst; - - /* Yes, checking route validity in not connected - case is not very simple. Take into account, - that we do not support routing by source, TOS, - and MSG_DONTROUTE --ANK (980726) - - 1. If route was host route, check that - cached destination is current. - If it is network route, we still may - check its validity using saved pointer - to the last used address: daddr_cache. - We do not want to save whole address now, - (because main consumer of this service - is tcp, which has not this problem), - so that the last trick works only on connected - sockets. - 2. oif also should be the same. - */ - - if (((rt->rt6i_dst.plen != 128 || - ipv6_addr_cmp(&fl->fl6_dst, &rt->rt6i_dst.addr)) - && (np->daddr_cache == NULL || - ipv6_addr_cmp(&fl->fl6_dst, np->daddr_cache))) - || (fl->oif && fl->oif != (*dst)->dev->ifindex)) { - *dst = NULL; - } else - dst_hold(*dst); + if (sk) { + struct ipv6_pinfo *np = inet6_sk(sk); + + dst = __sk_dst_check(sk, np->dst_cookie); + if (dst) { + struct rt6_info *rt = (struct rt6_info*)dst; + + /* Yes, checking route validity in not connected + case is not very simple. Take into account, + that we do not support routing by source, TOS, + and MSG_DONTROUTE --ANK (980726) + + 1. If route was host route, check that + cached destination is current. + If it is network route, we still may + check its validity using saved pointer + to the last used address: daddr_cache. + We do not want to save whole address now, + (because main consumer of this service + is tcp, which has not this problem), + so that the last trick works only on connected + sockets. + 2. oif also should be the same. + */ + + if (((rt->rt6i_dst.plen != 128 || + ipv6_addr_cmp(&fl->fl6_dst, &rt->rt6i_dst.addr)) + && (np->daddr_cache == NULL || + ipv6_addr_cmp(&fl->fl6_dst, np->daddr_cache))) + || (fl->oif && fl->oif != dst->dev->ifindex)) { + dst = NULL; + } else + dst_hold(dst); + } } - if (*dst == NULL) - *dst = ip6_route_output(sk, fl); + if (dst == NULL) + dst = ip6_route_output(sk, fl); - if ((*dst)->error) { - IP6_INC_STATS(Ip6OutNoRoutes); - dst_release(*dst); - return -ENETUNREACH; - } + if (dst->error) + return dst; if (ipv6_addr_any(&fl->fl6_src)) { - err = ipv6_get_saddr(*dst, &fl->fl6_dst, &fl->fl6_src); + err = ipv6_get_saddr(dst, &fl->fl6_dst, &fl->fl6_src); if (err) { #if IP6_DEBUG >= 2 printk(KERN_DEBUG "ip6_build_xmit: " "no available source address\n"); #endif - return err; + dst->error = err; + return dst; } } - if (*dst) { - if ((err = xfrm_lookup(dst, fl, sk, 0)) < 0) { - dst_release(*dst); - return -ENETUNREACH; + if (dst) { + if ((err = xfrm_lookup(&dst, fl, sk, 0)) < 0) { + dst->error = -ENETUNREACH; } } - return 0; + return dst; } int ip6_append_data(struct sock *sk, int getfrag(void *from, char *to, int offset, int len, int odd, struct sk_buff *skb), diff -ruN a/net/ipv6/raw.c b/net/ipv6/raw.c --- a/net/ipv6/raw.c 2003-07-28 02:00:40.000000000 +0900 +++ b/net/ipv6/raw.c 2003-08-06 14:19:32.000000000 +0900 @@ -658,8 +658,8 @@ if (!fl.oif && ipv6_addr_is_multicast(&fl.fl6_dst)) fl.oif = np->mcast_oif; - err = ip6_dst_lookup(sk, &dst, &fl); - if (err) + dst = ip6_dst_lookup(sk, &fl); + if (dst->error) goto out; if (hlimit < 0) { diff -ruN a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c --- a/net/ipv6/tcp_ipv6.c 2003-07-28 02:03:09.000000000 +0900 +++ b/net/ipv6/tcp_ipv6.c 2003-08-06 16:13:21.000000000 +0900 @@ -663,7 +663,7 @@ ipv6_addr_copy(&fl.fl6_dst, rt0->addr); } - dst = ip6_route_output(sk, &fl); + dst = ip6_dst_lookup(sk, &fl); if ((err = dst->error) != 0) { dst_release(dst); @@ -691,6 +691,8 @@ tp->ext_header_len = 0; if (np->opt) tp->ext_header_len = np->opt->opt_flen + np->opt->opt_nflen; + tp->ext2_header_len = dst->header_len; + tp->mss_clamp = IPV6_MIN_MTU - sizeof(struct tcphdr) - sizeof(struct ipv6hdr); inet->dport = usin->sin6_port; @@ -788,7 +790,7 @@ fl.fl_ip_dport = inet->dport; fl.fl_ip_sport = inet->sport; - dst = ip6_route_output(sk, &fl); + dst = ip6_dst_lookup(sk, &fl); } else dst_hold(dst); @@ -889,7 +891,7 @@ ipv6_addr_copy(&fl.fl6_dst, rt0->addr); } - dst = ip6_route_output(sk, &fl); + dst = ip6_dst_lookup(sk, &fl); if (dst->error) goto done; } @@ -1018,7 +1020,7 @@ fl.fl_ip_sport = t1->source; /* sk = NULL, but it is safe for now. RST socket required. */ - buff->dst = ip6_route_output(NULL, &fl); + buff->dst = ip6_dst_lookup(NULL, &fl); if (buff->dst->error == 0) { ip6_xmit(NULL, buff, &fl, NULL, 0); @@ -1081,7 +1083,7 @@ fl.fl_ip_dport = t1->dest; fl.fl_ip_sport = t1->source; - buff->dst = ip6_route_output(NULL, &fl); + buff->dst = ip6_dst_lookup(NULL, &fl); if (buff->dst->error == 0) { ip6_xmit(NULL, buff, &fl, NULL, 0); @@ -1329,7 +1331,7 @@ fl.fl_ip_dport = req->rmt_port; fl.fl_ip_sport = inet_sk(sk)->sport; - dst = ip6_route_output(sk, &fl); + dst = ip6_dst_lookup(sk, &fl); } if (dst->error) @@ -1401,6 +1403,7 @@ if (newnp->opt) newtp->ext_header_len = newnp->opt->opt_nflen + newnp->opt->opt_flen; + newtp->ext2_header_len = dst->header_len; tcp_sync_mss(newsk, dst_pmtu(dst)); newtp->advmss = dst_metric(dst, RTAX_ADVMSS); @@ -1727,7 +1730,7 @@ ipv6_addr_copy(&fl.fl6_dst, rt0->addr); } - dst = ip6_route_output(sk, &fl); + dst = ip6_dst_lookup(sk, &fl); if (dst->error) { err = dst->error; @@ -1770,7 +1773,7 @@ dst = __sk_dst_check(sk, np->dst_cookie); if (dst == NULL) { - dst = ip6_route_output(sk, &fl); + dst = ip6_dst_lookup(sk, &fl); if (dst->error) { sk->sk_err_soft = -dst->error; diff -ruN a/net/ipv6/udp.c b/net/ipv6/udp.c --- a/net/ipv6/udp.c 2003-07-28 02:07:29.000000000 +0900 +++ b/net/ipv6/udp.c 2003-08-06 14:19:23.000000000 +0900 @@ -928,8 +928,8 @@ if (!fl.oif && ipv6_addr_is_multicast(&fl.fl6_dst)) fl.oif = np->mcast_oif; - err = ip6_dst_lookup(sk, &dst, &fl); - if (err) + dst = ip6_dst_lookup(sk, &fl); + if (dst->error) goto out; if (hlimit < 0) { From davem@redhat.com Wed Aug 6 00:57:40 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 06 Aug 2003 00:57:45 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h767vdFl001909 for ; Wed, 6 Aug 2003 00:57:40 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id AAA24739; Wed, 6 Aug 2003 00:52:24 -0700 Date: Wed, 6 Aug 2003 00:52:24 -0700 From: "David S. Miller" To: Robert Olsson Cc: kuznet@ms2.inr.ac.ru, Robert.Olsson@data.slu.se, netdev@oss.sgi.com Subject: Re: [PATCH] repairing rtcache killer Message-Id: <20030806005224.4798f744.davem@redhat.com> In-Reply-To: <16175.58503.134543.310459@robur.slu.se> References: <200308051340.RAA28267@dub.inr.ac.ru> <16175.58503.134543.310459@robur.slu.se> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4575 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Tue, 5 Aug 2003 19:08:23 +0200 Robert Olsson wrote: > > kuznet@ms2.inr.ac.ru writes: > > # Two serious and interesting mistakes were made in the patch of 2003-06-16. Mama mia! This patch exists in 2.4.22-preX too, so full fix becomes more urgent. > For autotuning I think we can have help from a ratio of warm cache > hits (in_hit) and misses (in_slow_tot) to set threshhold to trim > hash chain lengths. Yes, I agree, and algorithm can be even not too smart, something like the following. Before scan loop, we compute: in_hit = in_slow_tot = 0; for (i = 0; i < NR_CPUS; i++) { if (!cpu_possible(i)) continue; in_hit += per_cpu_ptr(rt_cache_stat, i)->in_hit; in_slow_tot += per_cpu_ptr(rt_cache_stat, i)->in_slow_tot; } aggressive = 0; if (in_hit < (in_slow_tot >> 2)) aggressive = 1; thresh = ip_rt_gc_elasticity; if (!aggressive) thresh <<= 1; Then the purging test becomes: if (chain_length > thresh || (aggressive && chain_length > 1 && !(min_score & (1<<31)))) { *candp = cand->u.rt_next; rt_free(cand); } To make algorithm cheaper, we can even use only the current cpu's rt_cache_stat in order to make our decisions about whether to enter agressive mode or not. Alexey, given all this what would you like to do? Should I push your patch urgently into 2.4.x or spend some more time trying to solve this issue? From ebiederm@xmission.com Wed Aug 6 01:02:23 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 06 Aug 2003 01:02:27 -0700 (PDT) Received: from frodo.biederman.org (ebiederm.dsl.xmission.com [166.70.28.69]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7682MFl002543 for ; Wed, 6 Aug 2003 01:02:23 -0700 Received: (from eric@localhost) by frodo.biederman.org (8.9.3/8.9.3) id BAA09468; Wed, 6 Aug 2003 01:58:56 -0600 To: Werner Almesberger Cc: netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: TOE brain dump References: <20030802140444.E5798@almesberger.net> <3F2BF5C7.90400@us.ibm.com> <3F2C0C44.6020002@pobox.com> <20030802184901.G5798@almesberger.net> <20030804162433.L5798@almesberger.net> <20030806021304.E5798@almesberger.net> From: ebiederm@xmission.com (Eric W. Biederman) Date: 06 Aug 2003 01:58:56 -0600 In-Reply-To: <20030806021304.E5798@almesberger.net> Message-ID: Lines: 63 User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.1 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-archive-position: 4576 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ebiederm@xmission.com Precedence: bulk X-list: netdev Werner Almesberger writes: > Eric W. Biederman wrote: > > MPI is not a transport. It an interface like the Berkeley sockets > > layer. > > Hmm, but doesn't it also unify transport semantics (i.e. chop > TCP streams into messages), maybe add reliability to transports > that don't have it, and provide addressing ? Okay, perhaps you > wouldn't call this a transport in the OSI sense, but it still > seems to have considerably more functionality than just > providing an API. Those are all features of the MPI implementation. It is not that MPI does not have an underlying transport. MPI has a lot of underlying transports. And there is a different MPI implementation for each transport. Although a lot of them start with a common base. > > Mostly I think the that is less true, at least if they can stand the > > process of severe code review and cleaning up their code. > > Hmm, people putting dozens of millions into building clusters > can't afford to have what is probably their most essential > infrastructure code reviewed and cleaned up ? Oh dear. Afford, they can do. A lot of the users are researchers and a lot of people doing the code are researchers. So corralling them up and getting production quality code can be a challenge, or getting them to take small enough steps that they don't frighten the rest of the world. Plus ten million dollars pretty much buys you a spot in the top 10 of the top 500 supercomputers. The bulk of the clusters are a lot less expensive than that. > > But of course to get through the peer review process people need > > to understand what they are doing. > > A good point :-) > > > So store and forward of packets in a 3 layer switch hierarchy, at 1.3 us > > per copy. > > But your switch could just do cut-through, no ? Or do they > need to recompute checksums ? Correct, switches can and generally do implement cut-through in that kind of environment. I was just showing that even at 10Gbps treating a packet as an atomic unit has issues. cut-through is necessary to keep your latency down. Do any ethernet switches do cut-through? > > A lot of the NICs which are used for MPI tend to be smart for two > > reasons. 1) So they can do source routing. 2) So they can safely > > export some of their interface to user space, so in the fast path > > they can bypass the kernel. > > The second part could be interesting for TOE, too. Only that > the interface exported would just be the socket interface. Agreed. Eric From klein@SANRAD.COM Wed Aug 6 01:07:29 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 06 Aug 2003 01:07:33 -0700 (PDT) Received: from SANSRV1.SAN-RAD.CO.IL ([80.74.102.50]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7686mFl003584 for ; Wed, 6 Aug 2003 01:07:16 -0700 content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Subject: iSCSI driver for SGI X-MimeOLE: Produced By Microsoft Exchange V6.0.6249.0 Date: Wed, 6 Aug 2003 11:08:25 +0300 Message-ID: <838D8D2617300146B7F47E4D9AE7FF1086F746@SANSRV1.SAN-RAD.CO.IL> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: iSCSI driver for SGI Thread-Index: AcNb8gqC4BtDU4xtSu2129FQuoH5cg== From: "Yaron Klein" To: Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id h7686mFl003584 X-archive-position: 4577 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: klein@SANRAD.COM Precedence: bulk X-list: netdev Hi there, My name is Yaron Klein and I m the Technical Lead of Sanrad (www.sanrad.com). We manufacture iSCSI switches with virtualization capabilities. We have several clients with SGI servers that would like to connect them to our switch. Is there any iSCSI driver for SGI servers? Do you know of any company that distributes such driver? Thanks Yaron Klein Sanrad From ltd@cisco.com Wed Aug 6 01:20:25 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 06 Aug 2003 01:20:35 -0700 (PDT) Received: from sj-iport-2.cisco.com (sj-iport-2-in.cisco.com [171.71.176.71]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h768KOFl004789 for ; Wed, 6 Aug 2003 01:20:24 -0700 Received: from cisco.com (64.104.193.198) by sj-iport-2.cisco.com with ESMTP; 06 Aug 2003 01:24:53 -0700 Received: from cisco.com (localhost [127.0.0.1]) by syd-msg-core-1.cisco.com (8.12.2/8.12.6) with ESMTP id h768KFQt008150; Wed, 6 Aug 2003 18:20:15 +1000 (EST) Received: from ltd-t30.cisco.com (syd-vpn-client-255-68.cisco.com [10.66.255.68]) by cisco.com (8.8.8/2.6/Cisco List Logging/8.8.8) with ESMTP id SAA25057; Wed, 6 Aug 2003 18:24:31 +1000 (EST) Message-Id: <5.1.0.14.2.20030806181359.02bf9570@mira-sjcm-3.cisco.com> X-Sender: ltd@mira-sjcm-3.cisco.com X-Mailer: QUALCOMM Windows Eudora Version 5.1 Date: Wed, 06 Aug 2003 18:20:06 +1000 To: Andre Hedrick From: Lincoln Dale Subject: Re: TOE brain dump Cc: Jeff Garzik , netdev@oss.sgi.com, linux-kernel@vger.kernel.org, Werner Almesberger , Nivedita Singhvi In-Reply-To: References: <3F2CAE61.7070401@pobox.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; format=flowed X-archive-position: 4578 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ltd@cisco.com Precedence: bulk X-list: netdev At 05:12 PM 6/08/2003, Andre Hedrick wrote: >Do be sure to check that your data payload is correct. >Everyone knows that a router/gateway/switch with a sticky bit in its >memory will recompute the net crc16 checksum insure it pass the to the nic >regardless. It is amazing how much data can be corrupted by a network >environment via all the NFS/NBD/etc wantabie storage products out there. Andre, you are wrong. firstly, do you REALLY think that most router(s)/switch(es) out there recompute IP checksums because they did a IP TTL decrement when routing an IP packet or NAT IP addresses? no, they don't. just like netfilter or router-on-linux is smart enough to be able to re-code an IP checksum by unmasking and re-masking the old/new values in a header, so does the most router vendor's code. secondly, why would a router or switch even be touching the data at layer-4 (TCP), let alone recalculating a CRC? i know you really like your "we do ERL 2 in iSCSI" pitch, but lets stick to facts here eh? cheers, lincoln. From davem@redhat.com Wed Aug 6 01:27:05 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 06 Aug 2003 01:27:11 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h768R4Fl005507 for ; Wed, 6 Aug 2003 01:27:05 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id BAA24823; Wed, 6 Aug 2003 01:22:18 -0700 Date: Wed, 6 Aug 2003 01:22:18 -0700 From: "David S. Miller" To: Lincoln Dale Cc: andre@linux-ide.org, jgarzik@pobox.com, netdev@oss.sgi.com, linux-kernel@vger.kernel.org, werner@almesberger.net, niv@us.ibm.com Subject: Re: TOE brain dump Message-Id: <20030806012218.4012d9e4.davem@redhat.com> In-Reply-To: <5.1.0.14.2.20030806181359.02bf9570@mira-sjcm-3.cisco.com> References: <3F2CAE61.7070401@pobox.com> <5.1.0.14.2.20030806181359.02bf9570@mira-sjcm-3.cisco.com> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4579 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Wed, 06 Aug 2003 18:20:06 +1000 Lincoln Dale wrote: > secondly, why would a router or switch even be touching the data at layer-4 > (TCP), let alone recalculating a CRC? To make sure emails about Falun Gong and other undesirable topics don't make it into China. From yoshfuji@linux-ipv6.org Wed Aug 6 03:28:20 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 06 Aug 2003 03:28:38 -0700 (PDT) Received: from yue.hongo.wide.ad.jp (yue.hongo.wide.ad.jp [203.178.139.94]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h76ASIFl014451 for ; Wed, 6 Aug 2003 03:28:19 -0700 Received: from localhost (localhost [127.0.0.1]) by yue.hongo.wide.ad.jp (8.12.3+3.5Wbeta/8.12.3/Debian-5) with ESMTP id h76ASO1M013276; Wed, 6 Aug 2003 19:28:24 +0900 Date: Wed, 06 Aug 2003 19:28:24 +0900 (JST) Message-Id: <20030806.192824.129498139.yoshfuji@linux-ipv6.org> To: vnuorval@tcs.hut.fi Cc: netdev@oss.sgi.com, usagi-core@linux-ipv6.org Subject: Re: (usagi-core 14846) [PATCH] IPv6: No fragmentation of packets with length <= mtu From: YOSHIFUJI Hideaki / =?iso-2022-jp?B?GyRCNUhGIzFRTEAbKEI=?= In-Reply-To: References: Organization: USAGI Project X-URL: http://www.yoshifuji.org/%7Ehideaki/ X-Fingerprint: 90 22 65 EB 1E CF 3A D1 0B DF 80 D8 48 07 F8 94 E0 62 0E EA X-PGP-Key-URL: http://www.yoshifuji.org/%7Ehideaki/hideaki@yoshifuji.org.asc X-Face: "5$Al-.M>NJ%a'@hhZdQm:."qn~PA^gq4o*>iCFToq*bAi#4FRtx}enhuQKz7fNqQz\BYU] $~O_5m-9'}MIs`XGwIEscw;e5b>n"B_?j/AkL~i/MEaZBLP X-Mailer: Mew version 2.2 on Emacs 20.7 / Mule 4.1 (AOI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 4580 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: yoshfuji@linux-ipv6.org Precedence: bulk X-list: netdev In article (at Wed, 6 Aug 2003 13:00:41 +0300 (EEST)), Ville Nuorvala says: > I noticed ip6_append_data() always reserves space for the fragment header > at the cost of the payload data. This leads to the unnecessary > fragmentation of packets with lengths close (or equal) to the link-mtu. > This can for example be seen with ping6 -s 1448 (on an ethernet link). > > My attached patch _seems_ to fix the problem without breaking anything > else, but can you still verify this? Well, your patch breaks something; the idea of append_data. User may "push" multiple times to generate a packet. I'm chasing this bug. Since this bug is not grave, we do not need to fix this ASAP; We need to fix a grave issue with UDPv6 with MSG_MORE flag before this. -- Hideaki YOSHIFUJI @ USAGI Project GPG FP: 9022 65EB 1ECF 3AD1 0BDF 80D8 4807 F894 E062 0EEA From vnuorval@tcs.hut.fi Wed Aug 6 03:36:50 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 06 Aug 2003 03:36:56 -0700 (PDT) Received: from mail.tcs.hut.fi (mail.tcs.hut.fi [130.233.215.20]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h76AadFl014813 for ; Wed, 6 Aug 2003 03:36:40 -0700 Received: from rhea.tcs.hut.fi (rhea.tcs.hut.fi [130.233.215.147]) by mail.tcs.hut.fi (Postfix) with ESMTP id 886078001E3; Wed, 6 Aug 2003 13:00:42 +0300 (EEST) Received: from rhea.tcs.hut.fi (localhost [127.0.0.1]) by rhea.tcs.hut.fi (8.12.3/8.12.3/Debian-5) with ESMTP id h76A0g5L001820; Wed, 6 Aug 2003 13:00:42 +0300 Received: from localhost (vnuorval@localhost) by rhea.tcs.hut.fi (8.12.3/8.12.3/Debian-5) with ESMTP id h76A0fT0001816; Wed, 6 Aug 2003 13:00:42 +0300 Date: Wed, 6 Aug 2003 13:00:41 +0300 (EEST) From: Ville Nuorvala To: usagi-core@linux-ipv6.org Cc: netdev@oss.sgi.com Subject: [PATCH] IPv6: No fragmentation of packets with length <= mtu Message-ID: MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="-377318441-1293612270-1060158514=:1401" Content-ID: X-archive-position: 4581 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: vnuorval@tcs.hut.fi Precedence: bulk X-list: netdev This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. Send mail to mime@docserver.cac.washington.edu for more info. ---377318441-1293612270-1060158514=:1401 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII Content-ID: Hi guys, I noticed ip6_append_data() always reserves space for the fragment header at the cost of the payload data. This leads to the unnecessary fragmentation of packets with lengths close (or equal) to the link-mtu. This can for example be seen with ping6 -s 1448 (on an ethernet link). My attached patch _seems_ to fix the problem without breaking anything else, but can you still verify this? Thanks, Ville -- Ville Nuorvala Research Assistant, Institute of Digital Communications, Helsinki University of Technology email: vnuorval@tcs.hut.fi, phone: +358 (0)9 451 5257 ---377318441-1293612270-1060158514=:1401 Content-Type: TEXT/PLAIN; charset=US-ASCII; name="ip6_output_frag.patch" Content-Transfer-Encoding: BASE64 Content-ID: Content-Description: Content-Disposition: attachment; filename="ip6_output_frag.patch" LS0tIGxpbnV4LTIuNS5PTEQvbmV0L2lwdjYvaXA2X291dHB1dC5jCTIwMDMt MDgtMDUgMTU6MTU6MDcuMDAwMDAwMDAwICswMzAwDQorKysgbGludXgtMi41 L25ldC9pcHY2L2lwNl9vdXRwdXQuYwkyMDAzLTA4LTA2IDExOjI3OjM4LjAw MDAwMDAwMCArMDMwMA0KQEAgLTEyMTcsNyArMTIxNyw3IEBADQogCXN0cnVj dCBpbmV0X29wdCAqaW5ldCA9IGluZXRfc2soc2spOw0KIAlzdHJ1Y3QgaXB2 Nl9waW5mbyAqbnAgPSBpbmV0Nl9zayhzayk7DQogCXN0cnVjdCBza19idWZm ICpza2I7DQotCXVuc2lnbmVkIGludCBtYXhmcmFnbGVuLCBmcmFnaGVhZGVy bGVuOw0KKwl1bnNpZ25lZCBpbnQgbWF4ZnJhZ2xlbiwgZnJhZ2hlYWRlcmxl biwgeHRyYV9maF9sZW47DQogCWludCBleHRoZHJsZW47DQogCWludCBoaF9s ZW47DQogCWludCBtdHU7DQpAQCAtMTI2Myw3ICsxMjYzLDEyIEBADQogCWho X2xlbiA9IChydC0+dS5kc3QuZGV2LT5oYXJkX2hlYWRlcl9sZW4mfjE1KSAr IDE2Ow0KIA0KIAlmcmFnaGVhZGVybGVuID0gc2l6ZW9mKHN0cnVjdCBpcHY2 aGRyKSArIChvcHQgPyBvcHQtPm9wdF9uZmxlbiA6IDApOw0KLQltYXhmcmFn bGVuID0gKChtdHUgLSBmcmFnaGVhZGVybGVuKSAmIH43KSArIGZyYWdoZWFk ZXJsZW4gLSBzaXplb2Yoc3RydWN0IGZyYWdfaGRyKTsNCisJbWF4ZnJhZ2xl biA9ICgobXR1IC0gZnJhZ2hlYWRlcmxlbikgJiB+NykgKyBmcmFnaGVhZGVy bGVuOw0KKw0KKwkvKiBPbmx5IHJlc2VydmUgc3BhY2UgZm9yIGZyYWdfaGRy IGlmIHRoZSBwYWNrZXQgd2lsbCBiZSBmcmFnbWVudGVkICovDQorCXh0cmFf ZmhfbGVuID0gKGxlbmd0aCA+IG1heGZyYWdsZW4gPyBzaXplb2Yoc3RydWN0 IGZyYWdfaGRyKSA6IDApOw0KKw0KKwltYXhmcmFnbGVuIC09IHh0cmFfZmhf bGVuOw0KIA0KIAlpZiAobXR1IDw9IHNpemVvZihzdHJ1Y3QgaXB2Nmhkcikg KyBJUFY2X01BWFBMRU4pIHsNCiAJCWlmIChpbmV0LT5jb3JrLmxlbmd0aCAr IGxlbmd0aCA+IHNpemVvZihzdHJ1Y3QgaXB2NmhkcikgKyBJUFY2X01BWFBM RU4gLSBmcmFnaGVhZGVybGVuKSB7DQpAQCAtMTI5NCw3ICsxMjk5LDcgQEAN CiAJCQkJYWxsb2NsZW4gPSBtYXhmcmFnbGVuOw0KIAkJCWVsc2UNCiAJCQkJ YWxsb2NsZW4gPSBmcmFnbGVuOw0KLQkJCWFsbG9jbGVuICs9IHNpemVvZihz dHJ1Y3QgZnJhZ19oZHIpOw0KKwkJCWFsbG9jbGVuICs9IHh0cmFfZmhfbGVu Ow0KIAkJCWlmICh0cmFuc2hkcmxlbikgew0KIAkJCQlza2IgPSBzb2NrX2Fs bG9jX3NlbmRfc2tiKHNrLA0KIAkJCQkJCWFsbG9jbGVuICsgaGhfbGVuICsg MTUsDQpAQCAtMTMxNyw3ICsxMzIyLDcgQEANCiAJCQlza2ItPmlwX3N1bW1l ZCA9IGNzdW1tb2RlOw0KIAkJCXNrYi0+Y3N1bSA9IDA7DQogCQkJLyogcmVz ZXJ2ZSA4IGJ5dGUgZm9yIGZyYWdtZW50YXRpb24gKi8NCi0JCQlza2JfcmVz ZXJ2ZShza2IsIGhoX2xlbitzaXplb2Yoc3RydWN0IGZyYWdfaGRyKSk7DQor CQkJc2tiX3Jlc2VydmUoc2tiLCBoaF9sZW4gKyB4dHJhX2ZoX2xlbik7DQog DQogCQkJLyoNCiAJCQkgKglGaW5kIHdoZXJlIHRvIHN0YXJ0IHB1dHRpbmcg Ynl0ZXMNCg== ---377318441-1293612270-1060158514=:1401-- From davem@redhat.com Wed Aug 6 03:51:21 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 06 Aug 2003 03:51:29 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h76ApKFl015260 for ; Wed, 6 Aug 2003 03:51:21 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id DAA24968; Wed, 6 Aug 2003 03:46:43 -0700 Date: Wed, 6 Aug 2003 03:46:42 -0700 From: "David S. Miller" To: Ben Greear Cc: netdev@oss.sgi.com Subject: Re: PATCH: Add comment to make finding the priv_flags definition easier. Message-Id: <20030806034642.4d91641c.davem@redhat.com> In-Reply-To: <3F304AC7.6070808@candelatech.com> References: <3F304AC7.6070808@candelatech.com> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4582 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Tue, 05 Aug 2003 17:24:39 -0700 Ben Greear wrote: > This helps me, at least, remember where the definitions are at! This is silly, adding one for 'priv_flags' and not one for 'flags'. I really don't have a taste for these "look in file foo for interesting stuff about bar" type comments :-) From vnuorval@tcs.hut.fi Wed Aug 6 04:43:30 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 06 Aug 2003 04:43:38 -0700 (PDT) Received: from mail.tcs.hut.fi (mail.tcs.hut.fi [130.233.215.20]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h76BhIFl016836 for ; Wed, 6 Aug 2003 04:43:19 -0700 Received: from rhea.tcs.hut.fi (rhea.tcs.hut.fi [130.233.215.147]) by mail.tcs.hut.fi (Postfix) with ESMTP id 693528001BC; Wed, 6 Aug 2003 14:00:21 +0300 (EEST) Received: from rhea.tcs.hut.fi (localhost [127.0.0.1]) by rhea.tcs.hut.fi (8.12.3/8.12.3/Debian-5) with ESMTP id h76B0L5L002021; Wed, 6 Aug 2003 14:00:21 +0300 Received: from localhost (vnuorval@localhost) by rhea.tcs.hut.fi (8.12.3/8.12.3/Debian-5) with ESMTP id h76B0LXF002017; Wed, 6 Aug 2003 14:00:21 +0300 Date: Wed, 6 Aug 2003 14:00:21 +0300 (EEST) From: Ville Nuorvala To: YOSHIFUJI Hideaki / =?iso-2022-jp?B?GyRCNUhGIzFRTEAbKEI=?= Cc: netdev@oss.sgi.com, Subject: Re: (usagi-core 14846) [PATCH] IPv6: No fragmentation of packets with length <= mtu In-Reply-To: <20030806.192824.129498139.yoshfuji@linux-ipv6.org> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=iso-8859-15 X-archive-position: 4583 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: vnuorval@tcs.hut.fi Precedence: bulk X-list: netdev On Wed, 6 Aug 2003, YOSHIFUJI Hideaki / [iso-2022-jp] $B5HF#1QL@(B wrote: > Well, your patch breaks something; the idea of append_data. > User may "push" multiple times to generate a packet. Yeah, I thought there might be something like this, since I'm not yet that familiar with the new fragmentation code. :) > I'm chasing this bug. > Since this bug is not grave, we do not need to fix this ASAP; > We need to fix a grave issue with UDPv6 with MSG_MORE flag > before this. No problem. Regards, Ville -- Ville Nuorvala Research Assistant, Institute of Digital Communications, Helsinki University of Technology email: vnuorval@tcs.hut.fi, phone: +358 (0)9 451 5257 From jesse@cats-chateau.net Wed Aug 6 05:47:12 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 06 Aug 2003 05:47:25 -0700 (PDT) Received: from tabby.cats.internal (34.mufa.noln.chcgil24.dsl.att.net [12.100.181.34]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h76ClBFl017849 for ; Wed, 6 Aug 2003 05:47:12 -0700 Received: from localhost (localhost [[UNIX: localhost]]) by tabby.cats.internal (8.11.4/8.11.4) id h76CkuR08429; Wed, 6 Aug 2003 07:46:56 -0500 Content-Type: text/plain; charset="iso-8859-1" From: Jesse Pollard To: ebiederm@xmission.com (Eric W. Biederman), Werner Almesberger Subject: Re: TOE brain dump Date: Wed, 6 Aug 2003 07:46:33 -0500 X-Mailer: KMail [version 1.2] Cc: Jeff Garzik , Nivedita Singhvi , netdev@oss.sgi.com, linux-kernel@vger.kernel.org References: <20030802140444.E5798@almesberger.net> <20030804162433.L5798@almesberger.net> In-Reply-To: MIME-Version: 1.0 Message-Id: <03080607463300.08387@tabby> Content-Transfer-Encoding: 8bit X-archive-position: 4584 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jesse@cats-chateau.net Precedence: bulk X-list: netdev On Tuesday 05 August 2003 12:19, Eric W. Biederman wrote: > Werner Almesberger writes: > > Eric W. Biederman wrote: > > > The optimized for low latency cases seem to have a strong > > > market in clusters. > > > > Clusters have captive, no, _desperate_ customers ;-) And it > > seems that people are just as happy putting MPI as their > > transport on top of all those link-layer technologies. > > MPI is not a transport. It an interface like the Berkeley sockets > layer. The semantics it wants right now are usually mapped to > TCP/IP when used on an IP network. Though I suspect SCTP might > be a better fit. > > But right now nothing in the IP stack is a particularly good fit. > > Right now there is a very strong feeling among most of the people > using and developing on clusters that by and large what they are doing > is not of interest to the general kernel community, and so has no > chance of going in. So you see hack piled on top of hack piled on > top of hack. > > Mostly I think the that is less true, at least if they can stand the > process of severe code review and cleaning up their code. If we can > put in code to scale the kernel to 64 processors. NIC drivers for > fast interconnects and a few similar tweaks can't hurt either. > > But of course to get through the peer review process people need > to understand what they are doing. > > > > There is one place in low latency communications that I can think > > > of where TCP/IP is not the proper solution. For low latency > > > communication the checksum is at the wrong end of the packet. > > > > That's one of the few things ATM's AAL5 got right. But in the end, > > I think it doesn't really matter. At 1 Gbps, an MTU-sized packet > > flies by within 13 us. At 10 Gbps, it's only 1.3 us. At that point, > > you may well treat it as an atomic unit. > > So store and forward of packets in a 3 layer switch hierarchy, at 1.3 us > per copy. 1.3us to the NIC + 1.3us to the first switch chip + 1.3us to the > second switch chip + 1.3us to the top level switch chip + 1.3us to a middle > layer switch chip + 1.3us to the receiving NIC + 1.3us the receiver. > > 1.3us * 7 = 9.1us to deliver a packet to the other side. That is > still quite painful. Right now I can get better latencies over any of > the cluster interconnects. I think 5 us is the current low end, with > the high end being about 1 us. I think you are off here since the second and third layer should not recompute checksums other than for the header (if they even did that). Most of the switches I used (mind, not configured) were wire speed. Only header checksums had recomputes, and I understood it was only for routing. > Quite often in MPI when a message is sent the program cannot continue > until the reply is received. Possibly this is a fundamental problem > with the application programming model, encouraging applications to > be latency sensitive. But it is a well established API and > programming paradigm so it has to be lived with. > > All of this is pretty much the reverse of the TOE case. Things are > latency sensitive because real work needs to be done. And the more > latency you have the slower that work gets done. > > A lot of the NICs which are used for MPI tend to be smart for two > reasons. 1) So they can do source routing. 2) So they can safely > export some of their interface to user space, so in the fast path > they can bypass the kernel. And bypass any security checks required. A single rogue MPI application using such an interface can/will bring the cluster down. Now this is not as much of a problem since many clusters use a standalone internal network, AND are single application clusters. These clusters tend to be relatively small (32 - 64 nodes? perhaps 16-32 is better. The clusters I've worked with have always been large 128-300 nodes, so I'm not a good judge of "small"). This is immediately broken when you schedule two or more batch jobs on a cluster in parallel. It is also broken if the two jobs require different security contexts. From jesse@cats-chateau.net Wed Aug 6 06:08:14 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 06 Aug 2003 06:08:24 -0700 (PDT) Received: from tabby.cats.internal (34.mufa.noln.chcgil24.dsl.att.net [12.100.181.34]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h76D8DFl018443 for ; Wed, 6 Aug 2003 06:08:14 -0700 Received: from localhost (localhost [[UNIX: localhost]]) by tabby.cats.internal (8.11.4/8.11.4) id h76D7wW08472; Wed, 6 Aug 2003 08:07:58 -0500 Content-Type: text/plain; charset="iso-8859-1" From: Jesse Pollard To: "David S. Miller" , Lincoln Dale Subject: Re: TOE brain dump Date: Wed, 6 Aug 2003 08:07:37 -0500 X-Mailer: KMail [version 1.2] Cc: andre@linux-ide.org, jgarzik@pobox.com, netdev@oss.sgi.com, linux-kernel@vger.kernel.org, werner@almesberger.net, niv@us.ibm.com References: <3F2CAE61.7070401@pobox.com> <5.1.0.14.2.20030806181359.02bf9570@mira-sjcm-3.cisco.com> <20030806012218.4012d9e4.davem@redhat.com> In-Reply-To: <20030806012218.4012d9e4.davem@redhat.com> MIME-Version: 1.0 Message-Id: <03080608073701.08387@tabby> Content-Transfer-Encoding: 8bit X-archive-position: 4585 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jesse@cats-chateau.net Precedence: bulk X-list: netdev On Wednesday 06 August 2003 03:22, David S. Miller wrote: > On Wed, 06 Aug 2003 18:20:06 +1000 > > Lincoln Dale wrote: > > secondly, why would a router or switch even be touching the data at > > layer-4 (TCP), let alone recalculating a CRC? > > To make sure emails about Falun Gong and other undesirable topics > don't make it into China. Thats not a router, or switch... It's a firewall :-) From werner@almesberger.net Wed Aug 6 06:38:08 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 06 Aug 2003 06:38:34 -0700 (PDT) Received: from host.almesberger.net (almesberger.net [63.105.73.239] (may be forged)) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h76Dc6Fl019724 for ; Wed, 6 Aug 2003 06:38:07 -0700 Received: from almesberger.net (vpnwa-home [10.200.0.2]) by host.almesberger.net (8.11.6/8.9.3) with ESMTP id h76Dc3G27530; Wed, 6 Aug 2003 06:38:03 -0700 Received: (from werner@localhost) by almesberger.net (8.11.6/8.11.6) id h76Dbwk01468; Wed, 6 Aug 2003 10:37:58 -0300 Date: Wed, 6 Aug 2003 10:37:58 -0300 From: Werner Almesberger To: "Eric W. Biederman" Cc: netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: TOE brain dump Message-ID: <20030806103758.H5798@almesberger.net> References: <20030802140444.E5798@almesberger.net> <3F2BF5C7.90400@us.ibm.com> <3F2C0C44.6020002@pobox.com> <20030802184901.G5798@almesberger.net> <20030804162433.L5798@almesberger.net> <20030806021304.E5798@almesberger.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: ; from ebiederm@xmission.com on Wed, Aug 06, 2003 at 01:58:56AM -0600 X-archive-position: 4586 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: werner@almesberger.net Precedence: bulk X-list: netdev Eric W. Biederman wrote: > Afford, they can do. A lot of the users are researchers and > a lot of people doing the code are researchers. So corralling > them up and getting production quality code can be a challenge, Ah, the joy of herding cats :-) But I guess you just need a sufficiently competent and sufficiently well-funded group that goes ahead and does it. There is usually little point in directly involving everyone who may have an opinion. > to keep your latency down. Do any ethernet switches do cut-through? According to Google, many at least claim to do this. - Werner -- _________________________________________________________________________ / Werner Almesberger, Buenos Aires, Argentina werner@almesberger.net / /_http://www.almesberger.net/____________________________________________/ From adi@hexapodia.org Wed Aug 6 09:26:00 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 06 Aug 2003 09:26:12 -0700 (PDT) Received: from pirx.hexapodia.org (postfix@pirx.hexapodia.org [208.42.114.113]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h76GPwFl001281 for ; Wed, 6 Aug 2003 09:25:59 -0700 Received: by pirx.hexapodia.org (Postfix, from userid 22448) id D6341B404; Wed, 6 Aug 2003 11:25:56 -0500 (CDT) Date: Wed, 6 Aug 2003 11:25:56 -0500 From: Andy Isaacson To: Jesse Pollard Cc: netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: TOE brain dump Message-ID: <20030806112556.C26920@hexapodia.org> References: <20030802140444.E5798@almesberger.net> <20030804162433.L5798@almesberger.net> <03080607463300.08387@tabby> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <03080607463300.08387@tabby>; from jesse@cats-chateau.net on Wed, Aug 06, 2003 at 07:46:33AM -0500 X-PGP-Fingerprint: 48 01 21 E2 D4 E4 68 D1 B8 DF 39 B2 AF A3 16 B9 X-PGP-Key-URL: http://web.hexapodia.org/~adi/pgp.txt X-Domestic-Surveillance: money launder bomb tax evasion X-archive-position: 4587 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: adi@hexapodia.org Precedence: bulk X-list: netdev On Wed, Aug 06, 2003 at 07:46:33AM -0500, Jesse Pollard wrote: > On Tuesday 05 August 2003 12:19, Eric W. Biederman wrote: > > So store and forward of packets in a 3 layer switch hierarchy, at 1.3 us > > per copy. 1.3us to the NIC + 1.3us to the first switch chip + 1.3us to the > > second switch chip + 1.3us to the top level switch chip + 1.3us to a middle > > layer switch chip + 1.3us to the receiving NIC + 1.3us the receiver. > > > > 1.3us * 7 = 9.1us to deliver a packet to the other side. That is > > still quite painful. Right now I can get better latencies over any of > > the cluster interconnects. I think 5 us is the current low end, with > > the high end being about 1 us. > > I think you are off here since the second and third layer should not recompute > checksums other than for the header (if they even did that). Most of the > switches I used (mind, not configured) were wire speed. Only header checksums > had recomputes, and I understood it was only for routing. The switches may be "wire speed" but that doesn't help the latency any. AFAIK all GigE switches are store-and-forward, which automatically costs you the full 1.3us for each link hop. (I didn't check Eric's numbers, so I don't know that 1.3us is the right value, but it sounds right.) Also I think you might be confused about what Eric meant by "3 layer switch hierarchy"; he's referring to a tree topology network with layer-one switches connecting hosts, layer-two switches connecting layer-one switches, and layer-three switches connecting layer-two switches. This means that your worst-case node-to-node latency has 6 wire hops with 7 "read the entire packet into memory" operations, depending on how you count the initiating node's generation of the packet. [snip] > > Quite often in MPI when a message is sent the program cannot continue > > until the reply is received. Possibly this is a fundamental problem > > with the application programming model, encouraging applications to > > be latency sensitive. But it is a well established API and > > programming paradigm so it has to be lived with. This is true, in HPC. Some of the problem is the APIs encouraging such behavior; another part of the problem is that sometimes, the problem has fundamental latency dependencies that cannot be programmed around. > > A lot of the NICs which are used for MPI tend to be smart for two > > reasons. 1) So they can do source routing. 2) So they can safely > > export some of their interface to user space, so in the fast path > > they can bypass the kernel. > > And bypass any security checks required. A single rogue MPI application > using such an interface can/will bring the cluster down. This is just false. Kernel bypass (done properly) has no negative effect on system stability, either on-node or on-network. By "done properly" I mean that the NIC has mappings programmed into it by the kernel at app-startup time, and properly bounds-checks all remote DMA, and has a method for verifying that incoming packets are not rogue or corrupt. (Of course a rogue *kernel* can probably interfere with other *applications* on the network it's connected to, by inserting malicious packets into the datastream, but even that is soluble with cookies or routing checks. However, I don't believe any systems try to defend against rogue nodes today.) I believe that Myrinet's hardware has the capability to meet the "kernel bypass done properly" requirement I state above; I make no claim that their GM implementation actually meets the requirement (although I think it might). It's pretty likely that QSW's Elan hardware can, too, but I know even less about that. -andy From Robert.Olsson@data.slu.se Wed Aug 6 10:01:59 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 06 Aug 2003 10:02:03 -0700 (PDT) Received: from robur.slu.se (robur.slu.se [130.238.98.12]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h76H1vFl004216 for ; Wed, 6 Aug 2003 10:01:58 -0700 Received: (from robert@localhost) by robur.slu.se (8.9.3p2/8.9.3) id TAA32274; Wed, 6 Aug 2003 19:01:47 +0200 From: Robert Olsson MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <16177.13435.45921.229742@robur.slu.se> Date: Wed, 6 Aug 2003 19:01:47 +0200 To: kuznet@ms2.inr.ac.ru Cc: davem@redhat.com, Robert.Olsson@data.slu.se, netdev@oss.sgi.com Subject: [PATCH] repairing rtcache killer In-Reply-To: <200308051340.RAA28267@dub.inr.ac.ru> References: <200308051340.RAA28267@dub.inr.ac.ru> X-Mailer: VM 6.92 under Emacs 19.34.1 X-archive-position: 4588 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: Robert.Olsson@data.slu.se Precedence: bulk X-list: netdev kuznet@ms2.inr.ac.ru writes: > # This is a BitKeeper generated patch for the following project: > # Project Name: Linux kernel tree > # This patch format is intended for GNU patch command version 2.5 or higher. > # This patch includes the following deltas: > # ChangeSet 1.1613 -> 1.1614 > # net/ipv4/route.c 1.66 -> 1.67 Hello! Crap. We are back to the dst cache overflow again even with routing tables loaded. Well test is now on SMP and 2.6.0-test1. Undo the min_score test and give it retry? Or some new RCU stuff discover. It's current lab setup. Cheers. --ro From kuznet@ms2.inr.ac.ru Wed Aug 6 10:15:08 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 06 Aug 2003 10:15:11 -0700 (PDT) Received: from dub.inr.ac.ru (dub.inr.ac.ru [193.233.7.105]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h76HF6Fl005434 for ; Wed, 6 Aug 2003 10:15:08 -0700 Received: (from kuznet@localhost) by dub.inr.ac.ru (8.6.13/ANK) id VAA01614; Wed, 6 Aug 2003 21:14:56 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200308061714.VAA01614@dub.inr.ac.ru> Subject: Re: [PATCH] repairing rtcache killer To: Robert.Olsson@data.slu.se (Robert Olsson) Date: Wed, 6 Aug 2003 21:14:56 +0400 (MSD) Cc: davem@redhat.com, Robert.Olsson@data.slu.se, netdev@oss.sgi.com In-Reply-To: <16175.58503.134543.310459@robur.slu.se> from "Robert Olsson" at Aug 05, 2003 07:08:23 PM X-Mailer: ELM [version 2.5 PL6] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 4589 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: kuznet@ms2.inr.ac.ru Precedence: bulk X-list: netdev Hello! > For autotuning I think we can have help from a ratio of warm cache hits (in_hit) > and misses (in_slow_tot) to set threshhold to trim hash chain lengths. Have you already forgotten the whole day lost staring to some impossible numbers for number of misses? :-) The only output which you can get from ratio hits/misses is to _increase_ size of cache when number of misses growth. It is not only useless, it is exactly opposite to the behaviour which you want to see. We want to shrink it at DoS, remember? I still do not know any criterium, apparently it should be based not on ratio hits/misses, but on absolute rates or something like that. average/max = 1/2 is always acceptable, perfect at normal flow rates and not disasterous even for 1packet/flow. Alexey From Robert.Olsson@data.slu.se Wed Aug 6 10:24:02 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 06 Aug 2003 10:24:07 -0700 (PDT) Received: from robur.slu.se (robur.slu.se [130.238.98.12]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h76HO1Fl006290 for ; Wed, 6 Aug 2003 10:24:02 -0700 Received: (from robert@localhost) by robur.slu.se (8.9.3p2/8.9.3) id TAA32372; Wed, 6 Aug 2003 19:23:52 +0200 From: Robert Olsson MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <16177.14760.612648.582063@robur.slu.se> Date: Wed, 6 Aug 2003 19:23:52 +0200 To: "David S. Miller" Cc: Robert Olsson , kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com Subject: Re: [PATCH] repairing rtcache killer In-Reply-To: <20030806005224.4798f744.davem@redhat.com> References: <200308051340.RAA28267@dub.inr.ac.ru> <16175.58503.134543.310459@robur.slu.se> <20030806005224.4798f744.davem@redhat.com> X-Mailer: VM 6.92 under Emacs 19.34.1 X-archive-position: 4590 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: Robert.Olsson@data.slu.se Precedence: bulk X-list: netdev David S. Miller writes: > To make algorithm cheaper, we can even use only the current > cpu's rt_cache_stat in order to make our decisions about whether > to enter agressive mode or not. We get into complex senarios... we risk that not all CPU see "attacks" so they still contribute to the hash chain and spinning. We can run the chain length calculation at interval as alternative. One starts dreaming of having a route hash per cpu to avoid cache bouncing in the hash lookup. :-) Well not all cache bouncing well dissapear (read DoS) but as nowadays there is "more" affinity on input due to NAPI and scheduler etc. Somewhat reworked code: --- include/net/route.h.orig 2003-07-14 05:28:54.000000000 +0200 +++ include/net/route.h 2003-08-06 18:27:52.000000000 +0200 @@ -89,7 +89,9 @@ struct rt_cache_stat { unsigned int in_hit; + unsigned int in_hit_last; unsigned int in_slow_tot; + unsigned int in_slow_tot_last; unsigned int in_slow_mc; unsigned int in_no_route; unsigned int in_brd; --- net/ipv4/route.c.ank-repair-hash 2003-08-06 16:27:19.000000000 +0200 +++ net/ipv4/route.c 2003-08-06 17:53:43.000000000 +0200 @@ -118,6 +118,8 @@ int ip_rt_error_cost = HZ; int ip_rt_error_burst = 5 * HZ; int ip_rt_gc_elasticity = 8; +int ip_rt_gc_elasticity2 = 2 * 8; +int ip_rt_gc_elasticity2_recalc = 0; int ip_rt_mtu_expires = 10 * 60 * HZ; int ip_rt_min_pmtu = 512 + 20 + 20; int ip_rt_min_advmss = 256; @@ -747,6 +749,40 @@ int chain_length; int attempts = !in_softirq(); + + if (! (ip_rt_gc_elasticity2_recalc++ % 200 )) { + unsigned in_hit = 0, in_slow_tot = 0; + int i; + + for (i = 0; i < NR_CPUS; i++) { + if (!cpu_possible(i)) + continue; + + in_hit += per_cpu_ptr(rt_cache_stat, i)->in_hit - + per_cpu_ptr(rt_cache_stat, i)->in_hit_last; + + per_cpu_ptr(rt_cache_stat, i)->in_hit_last = + per_cpu_ptr(rt_cache_stat, i)->in_hit; + + in_slow_tot += per_cpu_ptr(rt_cache_stat, i)->in_slow_tot - + per_cpu_ptr(rt_cache_stat, i)->in_slow_tot_last; + + per_cpu_ptr(rt_cache_stat, i)->in_slow_tot_last = + per_cpu_ptr(rt_cache_stat, i)->in_slow_tot; + + } + + if (in_hit < in_slow_tot) { + /* Aggressive */ + if(ip_rt_gc_elasticity2 > 1) + ip_rt_gc_elasticity2 >>= 1; + } + else + if(ip_rt_gc_elasticity2 < 2*ip_rt_gc_elasticity) { + ip_rt_gc_elasticity2 <<= 1; + } + } + restart: chain_length = 0; min_score = ~(u32)0; @@ -801,13 +837,10 @@ } if (cand) { - /* ip_rt_gc_elasticity used to be average length of chain - * length, when exceeded gc becomes really aggressive. - * - * The second limit is less certain. At the moment it allows - * only 2 entries per bucket. We will see. + /* ip_rt_gc_elasticity2 used to limit length of chain + * when exceeded gc becomes really aggressive. */ - if (chain_length > 2*ip_rt_gc_elasticity) { + if (chain_length > ip_rt_gc_elasticity2) { *candp = cand->u.rt_next; rt_free(cand); } Cheers. --ro From greearb@candelatech.com Wed Aug 6 10:30:24 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 06 Aug 2003 10:30:32 -0700 (PDT) Received: from grok.yi.org (evrtwa1-ar2-4-33-045-074.evrtwa1.dsl-verizon.net [4.33.45.74]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h76HUNFl006981 for ; Wed, 6 Aug 2003 10:30:24 -0700 Received: from candelatech.com (localhost.localdomain [127.0.0.1]) by grok.yi.org (8.12.8/8.12.8) with ESMTP id h76HUItf023006; Wed, 6 Aug 2003 10:30:18 -0700 Message-ID: <3F313B2A.4080500@candelatech.com> Date: Wed, 06 Aug 2003 10:30:18 -0700 From: Ben Greear Organization: Candela Technologies User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5a) Gecko/20030718 X-Accept-Language: en-us, en MIME-Version: 1.0 To: "David S. Miller" CC: netdev@oss.sgi.com Subject: Re: PATCH: Add comment to make finding the priv_flags definition easier. References: <3F304AC7.6070808@candelatech.com> <20030806034642.4d91641c.davem@redhat.com> In-Reply-To: <20030806034642.4d91641c.davem@redhat.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4591 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: greearb@candelatech.com Precedence: bulk X-list: netdev David S. Miller wrote: > On Tue, 05 Aug 2003 17:24:39 -0700 > Ben Greear wrote: > > >>This helps me, at least, remember where the definitions are at! > > > This is silly, adding one for 'priv_flags' and not one for 'flags'. > > I really don't have a taste for these "look in file foo for > interesting stuff about bar" type comments :-) Since they are not enums, it is hard to know where they are properly defined. For someone who is new to the code, I think it helps a great deal to say where the possible values are defined. I can add a comment for 'flags' as well, but not if no one cares anyway. Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From kuznet@ms2.inr.ac.ru Wed Aug 6 10:58:19 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 06 Aug 2003 10:58:27 -0700 (PDT) Received: from dub.inr.ac.ru (dub.inr.ac.ru [193.233.7.105]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h76HwHFl009056 for ; Wed, 6 Aug 2003 10:58:18 -0700 Received: (from kuznet@localhost) by dub.inr.ac.ru (8.6.13/ANK) id VAA01743; Wed, 6 Aug 2003 21:58:10 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200308061758.VAA01743@dub.inr.ac.ru> Subject: Re: [PATCH] repairing rtcache killer To: Robert.Olsson@data.slu.se (Robert Olsson) Date: Wed, 6 Aug 2003 21:58:09 +0400 (MSD) Cc: davem@redhat.com, Robert.Olsson@data.slu.se, netdev@oss.sgi.com In-Reply-To: <16177.13435.45921.229742@robur.slu.se> from "Robert Olsson" at Aug 06, 2003 07:01:47 PM X-Mailer: ELM [version 2.5 PL6] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 4592 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: kuznet@ms2.inr.ac.ru Precedence: bulk X-list: netdev Hello! > Crap. We are back to the dst cache overflow again even with routing tables > loaded. Well test is now on SMP and 2.6.0-test1. Undo the min_score test > and give it retry? No. This patch is not related to RCU problem at all. It just sanitizes craziness of balancing introduced by chain truncation, that's why the subject is different. :-) I think you did not apply patch, which is responsible for repairing rcu troubles. Alexey From kuznet@ms2.inr.ac.ru Wed Aug 6 11:06:56 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 06 Aug 2003 11:06:59 -0700 (PDT) Received: from dub.inr.ac.ru (dub.inr.ac.ru [193.233.7.105]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h76I6sFl010003 for ; Wed, 6 Aug 2003 11:06:55 -0700 Received: (from kuznet@localhost) by dub.inr.ac.ru (8.6.13/ANK) id WAA01763; Wed, 6 Aug 2003 22:06:45 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200308061806.WAA01763@dub.inr.ac.ru> Subject: Re: [PATCH] repairing rtcache killer To: Robert.Olsson@data.slu.se (Robert Olsson) Date: Wed, 6 Aug 2003 22:06:45 +0400 (MSD) Cc: davem@redhat.com, Robert.Olsson@data.slu.se, netdev@oss.sgi.com In-Reply-To: <16177.14760.612648.582063@robur.slu.se> from "Robert Olsson" at Aug 06, 2003 07:23:52 PM X-Mailer: ELM [version 2.5 PL6] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 4593 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: kuznet@ms2.inr.ac.ru Precedence: bulk X-list: netdev Hello! > + if (in_hit < in_slow_tot) { > + /* Aggressive */ > + if(ip_rt_gc_elasticity2 > 1) > + ip_rt_gc_elasticity2 >>= 1; > + } > + else > + if(ip_rt_gc_elasticity2 < 2*ip_rt_gc_elasticity) { > + ip_rt_gc_elasticity2 <<= 1; > + } It is the system with positive feedback. Reduction of chain length results in increasing amount of misses and so on. Under normal load it has the only stable state, zero chain length and will never leave it. hits/misses is wrong feedback, unless you use it to increase chain length. :-) Alexey From Robert.Olsson@data.slu.se Wed Aug 6 11:21:07 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 06 Aug 2003 11:21:11 -0700 (PDT) Received: from robur.slu.se (robur.slu.se [130.238.98.12]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h76IL6Fl011304 for ; Wed, 6 Aug 2003 11:21:06 -0700 Received: (from robert@localhost) by robur.slu.se (8.9.3p2/8.9.3) id UAA32639; Wed, 6 Aug 2003 20:20:59 +0200 From: Robert Olsson MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <16177.18187.429888.128929@robur.slu.se> Date: Wed, 6 Aug 2003 20:20:59 +0200 To: kuznet@ms2.inr.ac.ru Cc: Robert.Olsson@data.slu.se (Robert Olsson), davem@redhat.com, netdev@oss.sgi.com Subject: Re: [PATCH] repairing rtcache killer In-Reply-To: <200308061758.VAA01743@dub.inr.ac.ru> References: <16177.13435.45921.229742@robur.slu.se> <200308061758.VAA01743@dub.inr.ac.ru> X-Mailer: VM 6.92 under Emacs 19.34.1 X-archive-position: 4594 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: Robert.Olsson@data.slu.se Precedence: bulk X-list: netdev kuznet@ms2.inr.ac.ru writes: > I think you did not apply patch, which is responsible for repairing > rcu troubles. > > Alexey No correct RCU patches are not applied but even before the RCU pathes we didn't see dst cache overflow during DoS if the routing table was fully loaded and we used hash chain limit patch. Cheers. --ro From kuznet@ms2.inr.ac.ru Wed Aug 6 11:35:07 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 06 Aug 2003 11:35:14 -0700 (PDT) Received: from dub.inr.ac.ru (dub.inr.ac.ru [193.233.7.105]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h76IZ4Fl012551 for ; Wed, 6 Aug 2003 11:35:07 -0700 Received: (from kuznet@localhost) by dub.inr.ac.ru (8.6.13/ANK) id WAA01878; Wed, 6 Aug 2003 22:34:56 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200308061834.WAA01878@dub.inr.ac.ru> Subject: Re: [PATCH] repairing rtcache killer To: Robert.Olsson@data.slu.se (Robert Olsson) Date: Wed, 6 Aug 2003 22:34:51 +0400 (MSD) Cc: Robert.Olsson@data.slu.se, davem@redhat.com, netdev@oss.sgi.com In-Reply-To: <16177.18187.429888.128929@robur.slu.se> from "Robert Olsson" at Aug 06, 2003 08:20:59 PM X-Mailer: ELM [version 2.5 PL6] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 4595 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: kuznet@ms2.inr.ac.ru Precedence: bulk X-list: netdev Hello! > No correct RCU patches are not applied but even before the RCU pathes we > didn't see dst cache overflow during DoS if the routing table was fully > loaded and we used hash chain limit patch. Sure. :-) Robert, did not we discover a week ago that the reason of rcu stalls is rt_run_flush() which runs only when routes change? :-) By the way, to refresh your memory, months ago there was another reason for overflows. It was fixed by setting sane value to ip_rt_gc_min_interval. RCU showed on surface after this. Alexey From Robert.Olsson@data.slu.se Wed Aug 6 11:50:36 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 06 Aug 2003 11:50:45 -0700 (PDT) Received: from robur.slu.se (robur.slu.se [130.238.98.12]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h76IoZFl013910 for ; Wed, 6 Aug 2003 11:50:36 -0700 Received: (from robert@localhost) by robur.slu.se (8.9.3p2/8.9.3) id UAA32750; Wed, 6 Aug 2003 20:50:28 +0200 From: Robert Olsson MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <16177.19956.596928.527997@robur.slu.se> Date: Wed, 6 Aug 2003 20:50:28 +0200 To: kuznet@ms2.inr.ac.ru Cc: Robert.Olsson@data.slu.se (Robert Olsson), davem@redhat.com, netdev@oss.sgi.com Subject: Re: [PATCH] repairing rtcache killer In-Reply-To: <200308061806.WAA01763@dub.inr.ac.ru> References: <16177.14760.612648.582063@robur.slu.se> <200308061806.WAA01763@dub.inr.ac.ru> X-Mailer: VM 6.92 under Emacs 19.34.1 X-archive-position: 4596 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: Robert.Olsson@data.slu.se Precedence: bulk X-list: netdev kuznet@ms2.inr.ac.ru writes: > It is the system with positive feedback. Reduction of chain length > results in increasing amount of misses and so on. Under normal load > it has the only stable state, zero chain length and will never leave it. > > hits/misses is wrong feedback, unless you use it to increase chain length. :-) > Well is was not the intention to find any optimum or equilibrium point the idea was just to get a different and more agressive setting pure DoS attacks to start with. Something like: if (in_hit < in_slow_tot) ip_rt_gc_elasticity2 = 1; else ip_rt_gc_elasticity2 = 2*ip_rt_gc_elasticity; would have been cleaner but maybe it's not worth it. Cheers. --ro From jesse@cats-chateau.net Wed Aug 6 11:59:22 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 06 Aug 2003 11:59:30 -0700 (PDT) Received: from tabby.cats.internal (34.mufa.noln.chcgil24.dsl.att.net [12.100.181.34]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h76IxLFl014762 for ; Wed, 6 Aug 2003 11:59:22 -0700 Received: from localhost (localhost [[UNIX: localhost]]) by tabby.cats.internal (8.11.4/8.11.4) id h76IxDo09250; Wed, 6 Aug 2003 13:59:13 -0500 Content-Type: text/plain; charset="iso-8859-1" From: Jesse Pollard To: Andy Isaacson Subject: Re: TOE brain dump Date: Wed, 6 Aug 2003 13:58:59 -0500 X-Mailer: KMail [version 1.2] Cc: netdev@oss.sgi.com, linux-kernel@vger.kernel.org References: <20030802140444.E5798@almesberger.net> <03080607463300.08387@tabby> <20030806112556.C26920@hexapodia.org> In-Reply-To: <20030806112556.C26920@hexapodia.org> MIME-Version: 1.0 Message-Id: <03080613585900.09086@tabby> Content-Transfer-Encoding: 8bit X-archive-position: 4597 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jesse@cats-chateau.net Precedence: bulk X-list: netdev On Wednesday 06 August 2003 11:25, Andy Isaacson wrote: > On Wed, Aug 06, 2003 at 07:46:33AM -0500, Jesse Pollard wrote: > > On Tuesday 05 August 2003 12:19, Eric W. Biederman wrote: > > > So store and forward of packets in a 3 layer switch hierarchy, at 1.3 > > > us per copy. 1.3us to the NIC + 1.3us to the first switch chip + 1.3us > > > to the second switch chip + 1.3us to the top level switch chip + 1.3us > > > to a middle layer switch chip + 1.3us to the receiving NIC + 1.3us the > > > receiver. > > > > > > 1.3us * 7 = 9.1us to deliver a packet to the other side. That is > > > still quite painful. Right now I can get better latencies over any of > > > the cluster interconnects. I think 5 us is the current low end, with > > > the high end being about 1 us. > > > > I think you are off here since the second and third layer should not > > recompute checksums other than for the header (if they even did that). > > Most of the switches I used (mind, not configured) were wire speed. Only > > header checksums had recomputes, and I understood it was only for > > routing. > > The switches may be "wire speed" but that doesn't help the latency any. > AFAIK all GigE switches are store-and-forward, which automatically costs > you the full 1.3us for each link hop. (I didn't check Eric's numbers, > so I don't know that 1.3us is the right value, but it sounds right.) > Also I think you might be confused about what Eric meant by "3 layer > switch hierarchy"; he's referring to a tree topology network with > layer-one switches connecting hosts, layer-two switches connecting > layer-one switches, and layer-three switches connecting layer-two > switches. This means that your worst-case node-to-node latency has 6 > wire hops with 7 "read the entire packet into memory" operations, > depending on how you count the initiating node's generation of the > packet. If it reads the packet into memory before starting transmission, it isn't "wire speed". It is a router. > [snip] > > > > Quite often in MPI when a message is sent the program cannot continue > > > until the reply is received. Possibly this is a fundamental problem > > > with the application programming model, encouraging applications to > > > be latency sensitive. But it is a well established API and > > > programming paradigm so it has to be lived with. > > This is true, in HPC. Some of the problem is the APIs encouraging such > behavior; another part of the problem is that sometimes, the problem has > fundamental latency dependencies that cannot be programmed around. > > > > A lot of the NICs which are used for MPI tend to be smart for two > > > reasons. 1) So they can do source routing. 2) So they can safely > > > export some of their interface to user space, so in the fast path > > > they can bypass the kernel. > > > > And bypass any security checks required. A single rogue MPI application > > using such an interface can/will bring the cluster down. > > This is just false. Kernel bypass (done properly) has no negative > effect on system stability, either on-node or on-network. By "done > properly" I mean that the NIC has mappings programmed into it by the > kernel at app-startup time, and properly bounds-checks all remote DMA, > and has a method for verifying that incoming packets are not rogue or > corrupt. (Of course a rogue *kernel* can probably interfere with other > *applications* on the network it's connected to, by inserting malicious > packets into the datastream, but even that is soluble with cookies or > routing checks. However, I don't believe any systems try to defend > against rogue nodes today.) Just because the packet gets transfered to a buffer correctly does not mean that buffer is the one it should have been sent to. If it didn't have this problem, then there would be no kernel TCP/IP interaction. Just open the ethernet device and start writing/reading. Ooops. known security failure. > > I believe that Myrinet's hardware has the capability to meet the "kernel > bypass done properly" requirement I state above; I make no claim that > their GM implementation actually meets the requirement (although I think > it might). It's pretty likely that QSW's Elan hardware can, too, but I > know even less about that. since the routing is done is user mode, as part of the library, it can be used to directly affect processes NOT owned by the user. This bypasses the kernel security checks by definition. Already known to happen with raw myrinet, so there is a kernel layer on top of it to shield it (or at least try to). If there is no kernel involvement, then there can be no restrictions on what can be passed down the line to the device. Now some of the modifications for myrinet were to use normal TCP/IP to establish source/destination header information, then bypass any packet handshake, but force EACH packet to include the pre-established source/destination header info. This is equivalent to UDP, but without any checksums, and sometimes can bypass part of the kernel cache. Unfortunately, it also means that sometimes incoming data is NOT destined for the user, and must be erased/copied before the final destination is achieved. This introduces leaks due to the race condition caused by the transfer to the wrong buffer. You can't DMA directly to a users buffer, because you MUST verify the header before the data... and you can't do that until the buffer is in memory... So bypassing the kernel generates security failures. This is already a problem in fibre channel devices, and in other network devices. Anytime you bypass the kernel security you also void any restrictions on the network, and any hosts it is attached to. From kuznet@ms2.inr.ac.ru Wed Aug 6 12:01:23 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 06 Aug 2003 12:01:26 -0700 (PDT) Received: from dub.inr.ac.ru (dub.inr.ac.ru [193.233.7.105]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h76J1MFl015254 for ; Wed, 6 Aug 2003 12:01:22 -0700 Received: (from kuznet@localhost) by dub.inr.ac.ru (8.6.13/ANK) id XAA01942; Wed, 6 Aug 2003 23:01:13 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200308061901.XAA01942@dub.inr.ac.ru> Subject: Re: [PATCH] repairing rtcache killer To: Robert.Olsson@data.slu.se (Robert Olsson) Date: Wed, 6 Aug 2003 23:01:13 +0400 (MSD) Cc: Robert.Olsson@data.slu.se, davem@redhat.com, netdev@oss.sgi.com In-Reply-To: <16177.19956.596928.527997@robur.slu.se> from "Robert Olsson" at Aug 06, 2003 08:50:28 PM X-Mailer: ELM [version 2.5 PL6] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 4598 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: kuznet@ms2.inr.ac.ru Precedence: bulk X-list: netdev Hello! > if (in_hit < in_slow_tot) > ip_rt_gc_elasticity2 = 1; > else > ip_rt_gc_elasticity2 = 2*ip_rt_gc_elasticity; > > would have been cleaner It is _much_ cleaner. > but maybe it's not worth it. Well, it is wrong before all. With default settings for number of flows at pktgen you will get always 1 not depending on length of flow at all. No hits just because cache is too small. See? It is the problem. Alexey From adi@hexapodia.org Wed Aug 6 12:39:59 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 06 Aug 2003 12:40:06 -0700 (PDT) Received: from pirx.hexapodia.org (postfix@pirx.hexapodia.org [208.42.114.113]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h76JdvFl019374 for ; Wed, 6 Aug 2003 12:39:58 -0700 Received: by pirx.hexapodia.org (Postfix, from userid 22448) id C396CB404; Wed, 6 Aug 2003 14:39:56 -0500 (CDT) Date: Wed, 6 Aug 2003 14:39:56 -0500 From: Andy Isaacson To: Jesse Pollard Cc: netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: TOE brain dump Message-ID: <20030806143956.B15543@hexapodia.org> References: <20030802140444.E5798@almesberger.net> <03080607463300.08387@tabby> <20030806112556.C26920@hexapodia.org> <03080613585900.09086@tabby> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <03080613585900.09086@tabby>; from jesse@cats-chateau.net on Wed, Aug 06, 2003 at 01:58:59PM -0500 X-PGP-Fingerprint: 48 01 21 E2 D4 E4 68 D1 B8 DF 39 B2 AF A3 16 B9 X-PGP-Key-URL: http://web.hexapodia.org/~adi/pgp.txt X-Domestic-Surveillance: money launder bomb tax evasion X-archive-position: 4599 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: adi@hexapodia.org Precedence: bulk X-list: netdev On Wed, Aug 06, 2003 at 01:58:59PM -0500, Jesse Pollard wrote: > On Wednesday 06 August 2003 11:25, Andy Isaacson wrote: > > The switches may be "wire speed" but that doesn't help the latency any. > > AFAIK all GigE switches are store-and-forward, which automatically costs > > you the full 1.3us for each link hop. (I didn't check Eric's numbers, > > so I don't know that 1.3us is the right value, but it sounds right.) > > Also I think you might be confused about what Eric meant by "3 layer > > switch hierarchy"; he's referring to a tree topology network with > > layer-one switches connecting hosts, layer-two switches connecting > > layer-one switches, and layer-three switches connecting layer-two > > switches. This means that your worst-case node-to-node latency has 6 > > wire hops with 7 "read the entire packet into memory" operations, > > depending on how you count the initiating node's generation of the > > packet. > > If it reads the packet into memory before starting transmission, it isn't > "wire speed". It is a router. [Please read an implied "I might be totally off base here, since I've never designed an Ethernet switch" disclaimer into this paragraph.] This statement is completely false. Ethernet switches *do* read the packet into memory before starting transmission. This must be so, because an Ethernet switch does not propagate runts, jabber frames, or frames with an incorrect ethernet crc. If the switch starts transmission before it's received the last bit, it is provably impossible for it to avoid propagating crc-failing-frames; ergo, switches must have the entire packet on hand before starting transmission. > > > > A lot of the NICs which are used for MPI tend to be smart for two > > > > reasons. 1) So they can do source routing. 2) So they can safely > > > > export some of their interface to user space, so in the fast path > > > > they can bypass the kernel. > > > > > > And bypass any security checks required. A single rogue MPI application > > > using such an interface can/will bring the cluster down. > > > > This is just false. Kernel bypass (done properly) has no negative > > effect on system stability, either on-node or on-network. By "done > > properly" I mean that the NIC has mappings programmed into it by the > > kernel at app-startup time, and properly bounds-checks all remote DMA, > > and has a method for verifying that incoming packets are not rogue or > > corrupt. (Of course a rogue *kernel* can probably interfere with other > > *applications* on the network it's connected to, by inserting malicious > > packets into the datastream, but even that is soluble with cookies or > > routing checks. However, I don't believe any systems try to defend > > against rogue nodes today.) > > Just because the packet gets transfered to a buffer correctly does not > mean that buffer is the one it should have been sent to. If it didn't > have this problem, then there would be no kernel TCP/IP interaction. Just > open the ethernet device and start writing/reading. Ooops. known security > failure. You're ignoring the fact that there's a complete, programmable RISC CPU on the Myrinet card which is running code (the MCP, Myrinet Control Program) installed into it by the kernel. The kernel tells the MCP to allow access to a given app (by mapping a page of PCI IO addresses into the user's virtual address space), and the MCP checks the user's DMA requests for validity. The user cannot generate arbitrary Myrinet routing requests, cannot write to arbitrary addresses, cannot send messages to hosts not in his allowed lists, et cetera. We do know that the buffer is the one it should have been sent to, because the MCP on the sending end verified that it was an allowed destination host, and the MCP on the receiving end verified that the destination address was valid. Myrinet Inc even offers a SDK allowing you to write your own MCP, if you so desire, and various research projects have done precisely that. Demonstrating that dumb Ethernet cards cannot be smart does not demonstrate that smart FooNet cards cannot be smart. (s/FooNet/$x/ as desired.) > > I believe that Myrinet's hardware has the capability to meet the "kernel > > bypass done properly" requirement I state above; I make no claim that > > their GM implementation actually meets the requirement (although I think > > it might). It's pretty likely that QSW's Elan hardware can, too, but I > > know even less about that. > > since the routing is done is user mode, as part of the library, it can be > used to directly affect processes NOT owned by the user. This > bypasses the kernel security checks by definition. The routing is done on the MCP, not in a library. (Or at least, it could be -- I don't know offhand how GM1 and GM2 work.) This is not an insoluble problem. > Already known to happen with raw myrinet, so there is a kernel layer > on top of it to shield it (or at least try to). Perhaps that's the case with GM1 (I don't know) but it is not a fundamental flaw of the hardware or the network. > If there is no kernel involvement, then there can be no restrictions > on what can be passed down the line to the device. The MCP provides the necessary checking. > Now some of the modifications for myrinet were to use normal TCP/IP to > establish source/destination header information, then bypass any > packet handshake, but force EACH packet to include the pre-established > source/destination header info. I don't know what you're talking about here; perhaps this was some early "TCP over Myrinet" thing. Currently on a host with GM1 running, the myri0 interface shows up as an almost-normal Ethernet interface, and most of the relevant networking ioctls work just fine. I can even tcpdump it. On a related topic, there is a Myrinet line card with a GigE port available. I haven't looked into the software end deeply, but apparently you just stick a standard Myrinet route to that switch port on the front of the Myrinet frame, append an Ethernet frame, and your Myrinet host can send GigE packets without bother. I don't know how incoming ethernet packets are routed, alas -- presumably a Myrinet route is encoded in the MAC somehow. > This is equivalent to UDP, but without any checksums, and sometimes > can bypass part of the kernel cache. Unfortunately, it also means that > sometimes incoming data is NOT destined for the user, and must be > erased/copied before the final destination is achieved. This introduces leaks > due to the race condition caused by the transfer to the wrong buffer. > > You can't DMA directly to a users buffer, because you MUST verify the header > before the data... and you can't do that until the buffer is in memory... > So bypassing the kernel generates security failures. Again, the security problems are solved by having the MCP check the necessary conditions. You bring up a good point WRT error resilience, though -- I don't know how Myrinet handles media bit errors. You *can* DMA directly to a user's buffer, because the necessary header information was checked on the MCP before the bits even touch the PCI bus. > This is already a problem in fibre channel devices, and in other network > devices. Anytime you bypass the kernel security you also void any > restrictions on the network, and any hosts it is attached to. Sufficiently advanced HBA hardware and software solve this problem. Please pick another windmill to tilt at. (Like the error one; I need to find out what the answer to that is.) -andy From rddunlap@osdl.org Wed Aug 6 13:24:30 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 06 Aug 2003 13:24:36 -0700 (PDT) Received: from mail.osdl.org (fw.osdl.org [65.172.181.6]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h76KOTFl023067 for ; Wed, 6 Aug 2003 13:24:30 -0700 Received: from dragon.pdx.osdl.net (dragon.pdx.osdl.net [172.20.1.27]) by mail.osdl.org (8.11.6/8.11.6) with SMTP id h76KONI02312; Wed, 6 Aug 2003 13:24:23 -0700 Date: Wed, 6 Aug 2003 13:20:32 -0700 From: "Randy.Dunlap" To: netdev@oss.sgi.com Cc: mlev@despammed.com Subject: Fw: TcpOutSegs in tcp_mib not RFC1213 compliant? Message-Id: <20030806132032.012d69f4.rddunlap@osdl.org> Organization: OSDL X-Mailer: Sylpheed version 0.9.4 (GTK+ 1.2.10; i686-pc-linux-gnu) X-Face: +5V?h'hZQPB9kW Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4600 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: rddunlap@osdl.org Precedence: bulk X-list: netdev This would be more appropriate on the netdev mailing list. (from lkml) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Date: Tue, 5 Aug 2003 20:01:22 -0400 From: Lev Makhlis To: linux-kernel@vger.kernel.org Subject: TcpOutSegs in tcp_mib not RFC1213 compliant? Hi, >From looking at net/ipv4/tcp_output.c, TcpOutSegs counts all outgoing packets, including pure retransmits. This seems to contradict RFC 1213 (MIB-II): tcpOutSegs OBJECT-TYPE SYNTAX Counter ACCESS read-only STATUS mandatory DESCRIPTION "The total number of segments sent, including those on current connections but excluding those containing only retransmitted octets." ::= { tcp 11 } Is that intentional or an oversight? Lev - From kuznet@ms2.inr.ac.ru Wed Aug 6 14:23:29 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 06 Aug 2003 14:23:33 -0700 (PDT) Received: from dub.inr.ac.ru (dub.inr.ac.ru [193.233.7.105]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h76LNRFl032108 for ; Wed, 6 Aug 2003 14:23:29 -0700 Received: (from kuznet@localhost) by dub.inr.ac.ru (8.6.13/ANK) id BAA02118; Thu, 7 Aug 2003 01:23:18 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200308062123.BAA02118@dub.inr.ac.ru> Subject: Re: [PATCH] repairing rtcache killer To: davem@redhat.com (David S. Miller) Date: Thu, 7 Aug 2003 01:23:18 +0400 (MSD) Cc: Robert.Olsson@data.slu.se, kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com In-Reply-To: <20030806005224.4798f744.davem@redhat.com> from "David S. Miller" at Aug 06, 2003 12:52:24 AM X-Mailer: ELM [version 2.5 PL6] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 4601 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: kuznet@ms2.inr.ac.ru Precedence: bulk X-list: netdev Hello! > > > # Two serious and interesting mistakes were made in the patch of 2003-06-16. > > Mama mia! This patch exists in 2.4.22-preX too, so full fix > becomes more urgent. I exaggerated saying "serious". The emphasize is rather on "interesting". :-) Mistake were not evident, Robert and me spent day and half to figure out what the hell is going on. :-) It shows only at high flow rate and it is just suboptimal thing, not a disaster. > Alexey, given all this what would you like to do? Should I push > your patch urgently into 2.4.x or spend some more time trying to > solve this issue? Hour ago I would say "yes". But this bus trip happened to be productive. :-) Seems, I know how to do right thing. I will code it now, and if Robert will be happy... Robert, look, the idea is: 1. Periodically we reset elasticity2 to 2*elasticity, f.e. from periodic gc timer. 2. We measure hits and misses with higher frequency, f.e. from forced gc. The measurement are suppressed for some time after each flush while cache collects new fresh entries. F.e. if (misses > rt_hash_mask+1 && hits < misses) elasticity2 = 0; else elasticity2 = 2*elasticity; misses > rt_hash_mask+1 guarantees that cache is populated and probed enough, rt_hash_mask+1 is not a random number, it corresponds to maximal size with elasticity2 = 0. Seems, it should work. And it is simple enough. Alexey From macro@ds2.pg.gda.pl Wed Aug 6 14:48:22 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 06 Aug 2003 14:48:25 -0700 (PDT) Received: from delta.ds2.pg.gda.pl (root@delta.ds2.pg.gda.pl [213.192.72.1]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h76LmJFl001981 for ; Wed, 6 Aug 2003 14:48:20 -0700 Received: from localhost by delta.ds2.pg.gda.pl (8.9.3/8.9.3) with SMTP id XAA02537; Wed, 6 Aug 2003 23:05:20 +0200 (MET DST) X-Authentication-Warning: delta.ds2.pg.gda.pl: macro owned process doing -bs Date: Wed, 6 Aug 2003 23:05:19 +0200 (MET DST) From: "Maciej W. Rozycki" Reply-To: "Maciej W. Rozycki" To: Linus Torvalds cc: linux-net@vger.kernel.org, netdev@oss.sgi.com Subject: [patch] defxx: Maintenance + DMA API fixes Message-ID: Organization: Technical University of Gdansk MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 4602 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: macro@ds2.pg.gda.pl Precedence: bulk X-list: netdev Hello Linus, all, Having necessary resources, I've decided to take over the maintenance of the defxx driver for the PDQ-based family of DEC FDDI controllers (the DEFEA for EISA and the DEFPA for PCI are the models currently handled). I've talked to Larry, the original author and the last maintainer of the code, and he's said he'd be happy about it. He's asked me to update his long-outdated contact information. Here is a patch to update the driver to the PCI version of the DMA API. It works for my system (using a DEFPA). I encourage everyone using one of the FDDI controllers to test the changes. In particular, I would like to hear from DEFEA owners as I don't have such a controller for testing (nor an EISA system at all). I have patches for the DMA API for 2.4.21 as well. I've made all these patches available at: 'ftp://ftp.ds2.pg.gda.pl/pub/macro/drivers/defxx/'. The patch includes appropriate status and contact information updates. Linus, please apply. Maciej -- + Maciej W. Rozycki, Technical University of Gdansk, Poland + +--------------------------------------------------------------+ + e-mail: macro@ds2.pg.gda.pl, PGP key available + patch-2.6.0-test2-defxx-dma-5 diff -up --recursive --new-file linux-2.6.0-test2.macro/MAINTAINERS linux-2.6.0-test2/MAINTAINERS --- linux-2.6.0-test2.macro/MAINTAINERS 2003-08-02 13:53:58.000000000 +0000 +++ linux-2.6.0-test2/MAINTAINERS 2003-08-04 01:08:32.000000000 +0000 @@ -531,6 +531,11 @@ W: http://www.sucs.swan.ac.uk/~rohan/DEC L: linux-decnet-user@lists.sourceforge.net S: Maintained +DEFXX FDDI NETWORK DRIVER +P: Maciej W. Rozycki +M: macro@ds2.pg.gda.pl +S: Maintained + DELL LAPTOP SMM DRIVER P: Massimo Dal Zotto M: dz@debian.org diff -up --recursive --new-file linux-2.6.0-test2.macro/drivers/net/defxx.c linux-2.6.0-test2/drivers/net/defxx.c --- linux-2.6.0-test2.macro/drivers/net/defxx.c 2003-08-01 21:58:40.000000000 +0000 +++ linux-2.6.0-test2/drivers/net/defxx.c 2003-08-04 00:15:38.000000000 +0000 @@ -15,19 +15,11 @@ * DEC FDDIcontroller/EISA (DEFEA) * DEC FDDIcontroller/PCI (DEFPA) * - * Maintainers: - * LVS Lawrence V. Stefani - * - * Contact: - * The author may be reached at: + * The original author: + * LVS Lawrence V. Stefani * - * Inet: stefani@lkg.dec.com - * (NOTE! this address no longer works -jgarzik) - * - * Mail: Digital Equipment Corporation - * 550 King Street - * M/S: LKG1-3/M07 - * Littleton, MA 01460 + * Maintainers: + * macro Maciej W. Rozycki * * Credits: * I'd like to thank Patricia Cross for helping me get started with @@ -197,10 +189,9 @@ * Sep 2000 tjeerd Fix leak on unload, cosmetic code cleanup * Feb 2001 Skb allocation fixes * Feb 2001 davej PCI enable cleanups. + * 04 Aug 2003 macro Converted to the DMA API. */ -#error Please convert me to Documentation/DMA-mapping.txt - /* Include files */ #include @@ -226,7 +217,7 @@ /* Version information string - should be updated prior to each new release!!! */ static char version[] __devinitdata = - "defxx.c:v1.05e 2001/02/03 Lawrence V. Stefani and others\n"; + "defxx.c:v1.06 2003/08/04 Lawrence V. Stefani and others\n"; #define DYNAMIC_BUFFERS 1 @@ -413,6 +404,7 @@ static int __devinit dfx_init_one_pci_or { struct net_device *dev; DFX_board_t *bp; /* board pointer */ + int alloc_size; /* total buffer size used */ int err; #ifndef MODULE @@ -486,7 +478,16 @@ static int __devinit dfx_init_one_pci_or return 0; err_out_kfree: - if (bp->kmalloced) kfree(bp->kmalloced); + alloc_size = sizeof(PI_DESCR_BLOCK) + + PI_CMD_REQ_K_SIZE_MAX + PI_CMD_RSP_K_SIZE_MAX + +#ifndef DYNAMIC_BUFFERS + (bp->rcv_bufs_to_post * PI_RCV_DATA_K_SIZE_MAX) + +#endif + sizeof(PI_CONSUMER_BLOCK) + + (PI_ALIGN_K_DESC_BLK - 1); + if (bp->kmalloced) + pci_free_consistent(pdev, alloc_size, + bp->kmalloced, bp->kmalloced_dma); err_out_region: release_region(ioaddr, pdev ? PFI_K_CSR_IO_LEN : PI_ESIC_K_CSR_IO_LEN); err_out: @@ -781,8 +782,8 @@ static void __devinit dfx_bus_config_che * or read adapter MAC address * * Assumptions: - * Memory allocated from kmalloc() call is physically contiguous, locked - * memory whose physical address equals its virtual address. + * Memory allocated from pci_alloc_consistent() call is physically + * contiguous, locked memory. * * Side Effects: * Adapter is reset and should be in DMA_UNAVAILABLE state before @@ -794,7 +795,7 @@ static int __devinit dfx_driver_init(str DFX_board_t *bp = dev->priv; int alloc_size; /* total buffer size needed */ char *top_v, *curr_v; /* virtual addrs into memory block */ - u32 top_p, curr_p; /* physical addrs into memory block */ + dma_addr_t top_p, curr_p; /* physical addrs into memory block */ u32 data; /* host data register value */ DBG_printk("In dfx_driver_init...\n"); @@ -904,14 +905,15 @@ static int __devinit dfx_driver_init(str #endif sizeof(PI_CONSUMER_BLOCK) + (PI_ALIGN_K_DESC_BLK - 1); - bp->kmalloced = top_v = (char *) kmalloc(alloc_size, GFP_KERNEL); + bp->kmalloced = top_v = pci_alloc_consistent(bp->pci_dev, alloc_size, + &bp->kmalloced_dma); if (top_v == NULL) { printk("%s: Could not allocate memory for host buffers and structures!\n", dev->name); return(DFX_K_FAILURE); } memset(top_v, 0, alloc_size); /* zero out memory before continuing */ - top_p = virt_to_bus(top_v); /* get physical address of buffer */ + top_p = bp->kmalloced_dma; /* get physical address of buffer */ /* * To guarantee the 8K alignment required for the descriptor block, 8K - 1 @@ -925,7 +927,7 @@ static int __devinit dfx_driver_init(str * for allocating the needed memory. */ - curr_p = (u32) (ALIGN(top_p, PI_ALIGN_K_DESC_BLK)); + curr_p = ALIGN(top_p, PI_ALIGN_K_DESC_BLK); curr_v = top_v + (curr_p - top_p); /* Reserve space for descriptor block */ @@ -2744,7 +2746,10 @@ static int dfx_rcv_init(DFX_board_t *bp, */ my_skb_align(newskb, 128); - bp->descr_block_virt->rcv_data[i+j].long_1 = virt_to_bus(newskb->data); + bp->descr_block_virt->rcv_data[i + j].long_1 = + (u32)pci_map_single(bp->pci_dev, newskb->data, + NEW_SKB_SIZE, + PCI_DMA_FROMDEVICE); /* * p_rcv_buff_va is only used inside the * kernel so we put the skb pointer here. @@ -2858,9 +2863,17 @@ static void dfx_rcv_queue_process( my_skb_align(newskb, 128); skb = (struct sk_buff *)bp->p_rcv_buff_va[entry]; + pci_unmap_single(bp->pci_dev, + bp->descr_block_virt->rcv_data[entry].long_1, + NEW_SKB_SIZE, + PCI_DMA_FROMDEVICE); skb_reserve(skb, RCV_BUFF_K_PADDING); bp->p_rcv_buff_va[entry] = (char *)newskb; - bp->descr_block_virt->rcv_data[entry].long_1 = virt_to_bus(newskb->data); + bp->descr_block_virt->rcv_data[entry].long_1 = + (u32)pci_map_single(bp->pci_dev, + newskb->data, + NEW_SKB_SIZE, + PCI_DMA_FROMDEVICE); } else skb = NULL; } else @@ -2933,7 +2946,7 @@ static void dfx_rcv_queue_process( * is contained in a single physically contiguous buffer * in which the virtual address of the start of packet * (skb->data) can be converted to a physical address - * by using virt_to_bus(). + * by using pci_map_single(). * * Since the adapter architecture requires a three byte * packet request header to prepend the start of packet, @@ -3081,12 +3094,13 @@ static int dfx_xmt_queue_pkt( * skb->data. * 6. The physical address of the start of packet * can be determined from the virtual address - * by using virt_to_bus() and is only 32-bits + * by using pci_map_single() and is only 32-bits * wide. */ p_xmt_descr->long_0 = (u32) (PI_XMT_DESCR_M_SOP | PI_XMT_DESCR_M_EOP | ((skb->len) << PI_XMT_DESCR_V_SEG_LEN)); - p_xmt_descr->long_1 = (u32) virt_to_bus(skb->data); + p_xmt_descr->long_1 = (u32)pci_map_single(bp->pci_dev, skb->data, + skb->len, PCI_DMA_TODEVICE); /* * Verify that descriptor is actually available @@ -3170,6 +3184,7 @@ static int dfx_xmt_done(DFX_board_t *bp) { XMT_DRIVER_DESCR *p_xmt_drv_descr; /* ptr to transmit driver descriptor */ PI_TYPE_2_CONSUMER *p_type_2_cons; /* ptr to rcv/xmt consumer block register */ + u8 comp; /* local transmit completion index */ int freed = 0; /* buffers freed */ /* Service all consumed transmit frames */ @@ -3187,7 +3202,11 @@ static int dfx_xmt_done(DFX_board_t *bp) bp->xmt_total_bytes += p_xmt_drv_descr->p_skb->len; /* Return skb to operating system */ - + comp = bp->rcv_xmt_reg.index.xmt_comp; + pci_unmap_single(bp->pci_dev, + bp->descr_block_virt->xmt_data[comp].long_1, + p_xmt_drv_descr->p_skb->len, + PCI_DMA_TODEVICE); dev_kfree_skb_irq(p_xmt_drv_descr->p_skb); /* @@ -3296,6 +3315,7 @@ static void dfx_xmt_flush( DFX_board_t * { u32 prod_cons; /* rcv/xmt consumer block longword */ XMT_DRIVER_DESCR *p_xmt_drv_descr; /* ptr to transmit driver descriptor */ + u8 comp; /* local transmit completion index */ /* Flush all outstanding transmit frames */ @@ -3306,7 +3326,11 @@ static void dfx_xmt_flush( DFX_board_t * p_xmt_drv_descr = &(bp->xmt_drv_descr_blk[bp->rcv_xmt_reg.index.xmt_comp]); /* Return skb to operating system */ - + comp = bp->rcv_xmt_reg.index.xmt_comp; + pci_unmap_single(bp->pci_dev, + bp->descr_block_virt->xmt_data[comp].long_1, + p_xmt_drv_descr->p_skb->len, + PCI_DMA_TODEVICE); dev_kfree_skb(p_xmt_drv_descr->p_skb); /* Increment transmit error counter */ @@ -3336,11 +3360,22 @@ static void dfx_xmt_flush( DFX_board_t * static void __devexit dfx_remove_one_pci_or_eisa(struct pci_dev *pdev, struct net_device *dev) { - DFX_board_t *bp = dev->priv; + DFX_board_t *bp = dev->priv; + int alloc_size; /* total buffer size used */ unregister_netdev(dev); release_region(dev->base_addr, pdev ? PFI_K_CSR_IO_LEN : PI_ESIC_K_CSR_IO_LEN ); - if (bp->kmalloced) kfree(bp->kmalloced); + + alloc_size = sizeof(PI_DESCR_BLOCK) + + PI_CMD_REQ_K_SIZE_MAX + PI_CMD_RSP_K_SIZE_MAX + +#ifndef DYNAMIC_BUFFERS + (bp->rcv_bufs_to_post * PI_RCV_DATA_K_SIZE_MAX) + +#endif + sizeof(PI_CONSUMER_BLOCK) + + (PI_ALIGN_K_DESC_BLK - 1); + if (bp->kmalloced) + pci_free_consistent(pdev, alloc_size, bp->kmalloced, + bp->kmalloced_dma); kfree(dev); } diff -up --recursive --new-file linux-2.6.0-test2.macro/drivers/net/defxx.h linux-2.6.0-test2/drivers/net/defxx.h --- linux-2.6.0-test2.macro/drivers/net/defxx.h 2003-08-01 21:47:38.000000000 +0000 +++ linux-2.6.0-test2/drivers/net/defxx.h 2003-08-04 00:15:41.000000000 +0000 @@ -12,17 +12,11 @@ * Contains all definitions specified by port specification and required * by the defxx.c driver. * - * Maintainers: - * LVS Lawrence V. Stefani - * - * Contact: - * The author may be reached at: + * The original author: + * LVS Lawrence V. Stefani * - * Inet: stefani@lkg.dec.com - * Mail: Digital Equipment Corporation - * 550 King Street - * M/S: LKG1-3/M07 - * Littleton, MA 01460 + * Maintainers: + * macro Maciej W. Rozycki * * Modification History: * Date Name Description @@ -30,6 +24,7 @@ * 09-Sep-96 LVS Added group_prom field. Moved read/write I/O * macros to DEFXX.C. * 12-Sep-96 LVS Removed packet request header pointers. + * 04 Aug 2003 macro Converted to the DMA API. */ #ifndef _DEFXX_H_ @@ -1697,17 +1692,19 @@ typedef struct DFX_board_tag { /* Keep virtual and physical pointers to locked, physically contiguous memory */ - char *kmalloced; /* kfree this on unload */ + char *kmalloced; /* pci_free_consistent this on unload */ + dma_addr_t kmalloced_dma; + /* DMA handle for the above */ PI_DESCR_BLOCK *descr_block_virt; /* PDQ descriptor block virt address */ - u32 descr_block_phys; /* PDQ descriptor block phys address */ + dma_addr_t descr_block_phys; /* PDQ descriptor block phys address */ PI_DMA_CMD_REQ *cmd_req_virt; /* Command request buffer virt address */ - u32 cmd_req_phys; /* Command request buffer phys address */ + dma_addr_t cmd_req_phys; /* Command request buffer phys address */ PI_DMA_CMD_RSP *cmd_rsp_virt; /* Command response buffer virt address */ - u32 cmd_rsp_phys; /* Command response buffer phys address */ + dma_addr_t cmd_rsp_phys; /* Command response buffer phys address */ char *rcv_block_virt; /* LLC host receive queue buf blk virt */ - u32 rcv_block_phys; /* LLC host receive queue buf blk phys */ + dma_addr_t rcv_block_phys; /* LLC host receive queue buf blk phys */ PI_CONSUMER_BLOCK *cons_block_virt; /* PDQ consumer block virt address */ - u32 cons_block_phys; /* PDQ consumer block phys address */ + dma_addr_t cons_block_phys; /* PDQ consumer block phys address */ /* Keep local copies of Type 1 and Type 2 register data */ From Robert.Olsson@data.slu.se Wed Aug 6 15:57:36 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 06 Aug 2003 15:57:44 -0700 (PDT) Received: from robur.slu.se (robur.slu.se [130.238.98.12]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h76MvYFl004393 for ; Wed, 6 Aug 2003 15:57:36 -0700 Received: (from robert@localhost) by robur.slu.se (8.9.3p2/8.9.3) id AAA01287; Thu, 7 Aug 2003 00:57:28 +0200 From: Robert Olsson MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <16177.34776.207150.232716@robur.slu.se> Date: Thu, 7 Aug 2003 00:57:28 +0200 To: kuznet@ms2.inr.ac.ru Cc: davem@redhat.com (David S. Miller), Robert.Olsson@data.slu.se, netdev@oss.sgi.com Subject: Re: [PATCH] repairing rtcache killer In-Reply-To: <200308062123.BAA02118@dub.inr.ac.ru> References: <20030806005224.4798f744.davem@redhat.com> <200308062123.BAA02118@dub.inr.ac.ru> X-Mailer: VM 6.92 under Emacs 19.34.1 X-archive-position: 4603 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: Robert.Olsson@data.slu.se Precedence: bulk X-list: netdev kuznet@ms2.inr.ac.ru writes: > Robert, look, the idea is: > > 1. Periodically we reset elasticity2 to 2*elasticity, f.e. from > periodic gc timer. This solve the "positive" feedback. Actually the code I tested moved from elasticity2=1 to elasticity*2 as well but this seems more reliable. > 2. We measure hits and misses with higher frequency, f.e. from > forced gc. The measurement are suppressed for some time > after each flush while cache collects new fresh entries. And the measure (forced gc) should not be inhibited by any elastiticy2 setting. > if (misses > rt_hash_mask+1 && hits < misses) > elasticity2 = 0; > else > elasticity2 = 2*elasticity; ( hits < misses ) Delicate balancing point but actually it didn't look too bad in the lab setup. > misses > rt_hash_mask+1 guarantees that cache is populated and probed > enough, rt_hash_mask+1 is not a random number, it corresponds > to maximal size with elasticity2 = 0. Yes better. > Seems, it should work. And it is simple enough. Let's try... ;-) Cheers. --ro From sri@us.ibm.com Wed Aug 6 16:37:37 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 06 Aug 2003 16:37:40 -0700 (PDT) Received: from e3.ny.us.ibm.com (e3.ny.us.ibm.com [32.97.182.103]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h76NbaFl005914 for ; Wed, 6 Aug 2003 16:37:37 -0700 Received: from northrelay02.pok.ibm.com (northrelay02.pok.ibm.com [9.56.224.150]) by e3.ny.us.ibm.com (8.12.9/8.12.2) with ESMTP id h76NbSpW167892; Wed, 6 Aug 2003 19:37:28 -0400 Received: from w-sridhar.beaverton.ibm.com (d01av02.pok.ibm.com [9.56.224.216]) by northrelay02.pok.ibm.com (8.12.9/NCO/VER6.5) with ESMTP id h76NbPJD033516; Wed, 6 Aug 2003 19:37:26 -0400 Date: Wed, 6 Aug 2003 16:32:01 -0700 (PDT) From: Sridhar Samudrala X-X-Sender: sridhar@w-sridhar.beaverton.ibm.com To: davem@redhat.com cc: netdev@oss.sgi.com Subject: [BK PATCH] Minor updates to SCTP Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 4604 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: sri@us.ibm.com Precedence: bulk X-list: netdev Hi Dave, Please do a bk pull http://linux-lksctp.bkbits.net/lksctp-2.5 to get the following updates to SCTP on top of linux 2.6.0-test2. # This patch includes the following deltas: # ChangeSet 1.1595 -> 1.1597 # net/sctp/associola.c 1.57 -> 1.58 # net/sctp/input.c 1.34 -> 1.35 # net/sctp/sm_statefuns.c 1.61 -> 1.62 # net/sctp/endpointola.c 1.27 -> 1.28 # net/sctp/sm_make_chunk.c 1.57 -> 1.58 # include/net/sctp/constants.h 1.15 -> 1.16 # net/sctp/debug.c 1.8 -> 1.9 # net/sctp/proc.c 1.6 -> 1.7 # include/net/sctp/sm.h 1.26 -> 1.27 # net/sctp/protocol.c 1.56 -> 1.57 # include/linux/sctp.h 1.6 -> 1.7 # net/sctp/sm_sideeffect.c 1.47 -> 1.48 # net/sctp/sm_statetable.c 1.18 -> 1.19 # include/net/sctp/command.h 1.13 -> 1.14 # include/net/sctp/structs.h 1.69 -> 1.71 # net/sctp/socket.c 1.88 -> 1.89 # # The following is the BitKeeper ChangeSet Log # -------------------------------------------- # 03/08/04 sri@us.ibm.com 1.1596 # [SCTP] ADDIP basic infrastructure support. (Ardelle.Fan) # -------------------------------------------- # 03/08/04 sri@us.ibm.com 1.1597 # [SCTP] Fix to avoid large kmalloc failures on 64-bit platforms. # # When spinlock debugging is enabled, the size of assoc hash table and # port hash table for a fixed value of 4096 entries exceeds the size # of the largest possible kmalloc() on 64-bit platforms. This problem is # avoided by using page allocations similar to the methodology followed # for tcp hash table allocations. # -------------------------------------------- Thanks Sridhar From davem@redhat.com Wed Aug 6 17:00:56 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 06 Aug 2003 17:01:05 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7700tFl006460 for ; Wed, 6 Aug 2003 17:00:56 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id QAA26552; Wed, 6 Aug 2003 16:56:10 -0700 Date: Wed, 6 Aug 2003 16:56:10 -0700 From: "David S. Miller" To: Sridhar Samudrala Cc: netdev@oss.sgi.com Subject: Re: [BK PATCH] Minor updates to SCTP Message-Id: <20030806165610.47505d4e.davem@redhat.com> In-Reply-To: References: X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4605 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Wed, 6 Aug 2003 16:32:01 -0700 (PDT) Sridhar Samudrala wrote: > Please do a > bk pull http://linux-lksctp.bkbits.net/lksctp-2.5 Pulled, thanks. From scott.feldman@intel.com Wed Aug 6 17:33:26 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 06 Aug 2003 17:33:33 -0700 (PDT) Received: from caduceus.jf.intel.com (fmr06.intel.com [134.134.136.7]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h770XPFl007068 for ; Wed, 6 Aug 2003 17:33:26 -0700 Received: from petasus.jf.intel.com (petasus.jf.intel.com [10.7.209.6]) by caduceus.jf.intel.com (8.11.6p2/8.11.6/d: outer.mc,v 1.66 2003/05/22 21:17:36 rfjohns1 Exp $) with ESMTP id h770RIe27200 for ; Thu, 7 Aug 2003 00:27:18 GMT Received: from orsmsxvs040.jf.intel.com (orsmsxvs040.jf.intel.com [192.168.65.206]) by petasus.jf.intel.com (8.11.6p2/8.11.6/d: inner.mc,v 1.35 2003/05/22 21:18:01 rfjohns1 Exp $) with SMTP id h76MTYn18377 for ; Wed, 6 Aug 2003 22:29:41 GMT Received: from orsmsx332.amr.corp.intel.com ([192.168.65.60]) by orsmsxvs040.jf.intel.com (NAVGW 2.5.2.11) with SMTP id M2003080615464411703 ; Wed, 06 Aug 2003 15:46:44 -0700 Received: from orsmsx402.amr.corp.intel.com ([192.168.65.208]) by orsmsx332.amr.corp.intel.com with Microsoft SMTPSVC(5.0.2195.5329); Wed, 6 Aug 2003 15:34:33 -0700 content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-MimeOLE: Produced By Microsoft Exchange V6.0.6375.0 Subject: RE: More 2.4.22pre10 ACPI breakage Date: Wed, 6 Aug 2003 15:34:33 -0700 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: More 2.4.22pre10 ACPI breakage Thread-Index: AcNbxpyURp/8RGKwQ4mmXlQY2KiudAAoMH1g From: "Feldman, Scott" To: "Jeff Garzik" Cc: "Samuel Flory" , X-OriginalArrivalTime: 06 Aug 2003 22:34:33.0716 (UTC) FILETIME=[E8B7E340:01C35C6A] Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id h770XPFl007068 X-archive-position: 4606 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: scott.feldman@intel.com Precedence: bulk X-list: netdev > That said... I'm tempted to extend NAPI just a bit, to provide an > "always poll" mode. It seems like all the bug reports I get > these days for 8139too are caused by x86 ACPI/APIC/irq routing troubles > completely unrelated to the driver. Tulip-almost-NAPI in 2.4 has an > always-poll mode, so I have a convenient excuse :) NAPI always-poll mode...that would be fun to play with. JC was getting his best results for small packets when he modified the dev e100 driver to stay in polling mode, even if the quota wasn't met. Basically running without interrupts. If there is someway for the the driver to sample/ack the device for events when interrupts are disabled/unrouted, then these async events can be handled in the poll routine. I'm thinking of events like link-status-change. Is this what you're thinking: 1) block any place the driver enables interrupts so interrupts stay disabled, 2) ignore netif_rx_complete so we stay in polling mode, 3) ignore return code from netdev->poll. For 1), the driver needs some way to know that we're in always-poll-mode so enabling interrupts is a nop. Just thinking out loud - haven't tried any of this. -scott From pr@mrno.jp Wed Aug 6 18:37:09 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 06 Aug 2003 18:37:14 -0700 (PDT) Received: from mail.mrno.jp ([219.163.205.181]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h771b8Fl008742 for ; Wed, 6 Aug 2003 18:37:09 -0700 Received: from localhost (mrno.jp [127.0.0.1]) by mail.mrno.jp (Postfix) with ESMTP id 5B4A1188FB8 for ; Thu, 7 Aug 2003 10:40:38 +0900 (JST) From: pr@mrno.jp To: netdev@oss.sgi.com Subject: =?ISO-8859-1?Q?=1B$B#N#O%/%j%C%/!y9b2AGc Date: Thu, 7 Aug 2003 10:40:38 +0900 (JST) X-archive-position: 4607 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: pr@mrno.jp Precedence: bulk X-list: netdev $B%5%$%H%*!<%J!e$2$F$*$j$^$9!#(B $BJ@Z(B $B5Z$SF~B`=P5!9=$G$NHsF~<<%f!<%6!l9g!"(B $B!&7HBS%5%$%HMM$O#1%/%j%C%/!a#1#21_!JDL>o#71_!K(B $B!&#P#C%5%$%HMM$O#1%/%j%C%/!a#51_!JDL>o#21_!K(B $B$GGcu!"F~B`=P5!9=$r$*;}$A$G$J$$%*!<%J!]$H$J$j$^$9!#(B $BJ@]$K$J$i$J$+$C$?HsF~<<%f!<%6!<(B $B%/%j%C%/$rJ@u67E*$K$b#2#0#0#3G/#1#07n$+$iK!N'5,@)$K$h$k(B $B%"%@%k%H%5%$%H$X$NG/NpG'>Z5!9=$N5AL3IU$1$,;\9T$5$l(B $B$k;v$b$"$j$^$9$N$G!"$3$N5!2q$KG/NpG'>Z5!9=$NL5$$(B $B%5%$%HMM$OG/NpG'>Z5!9=$NF3F~$HF1;~$KJ@\$7$/$O2<5-$N%5%$%H$G$4Mw2<$5$$!#(B $B7HBSHG!X#M#r(B.$B#N#O!Y(B http://www.mrno.jp/i/ $B#P#CHG!X#M#r(B.$B#N#O!Y(B http://www.mrno.jp/ $B:G8e$K!"$3$A$i$Ne$2$^$9!#(B ////////////////////////////////////////////////// $B!!3t<02q>0f9,;R(B $B!!(Binfo@mrno.jp ////////////////////////////////////////////////// From shmulik.hen@intel.com Wed Aug 6 23:23:05 2003 Received: with ECARTIS (v1.0.0; list netdev); Wed, 06 Aug 2003 23:23:13 -0700 (PDT) Received: from hermes.iil.intel.com (hermes.iil.intel.com [192.198.152.99]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h776N3Fl027308 for ; Wed, 6 Aug 2003 23:23:04 -0700 Received: from petasus.iil.intel.com (petasus.iil.intel.com [143.185.77.3]) by hermes.iil.intel.com (8.11.6p2/8.11.6/d: outer.mc,v 1.66 2003/05/22 21:17:36 rfjohns1 Exp $) with ESMTP id h776HhK02664 for ; Thu, 7 Aug 2003 06:17:43 GMT Received: from hasmsxvs01.iil.intel.com (hasmsxvs01.iil.intel.com [143.185.63.58]) by petasus.iil.intel.com (8.11.6p2/8.11.6/d: inner.mc,v 1.35 2003/05/22 21:18:01 rfjohns1 Exp $) with SMTP id h776PwZ01435 for ; Thu, 7 Aug 2003 06:25:58 GMT Received: from hasmsx331.ger.corp.intel.com ([143.185.63.144]) by hasmsxvs01.iil.intel.com (NAVGW 2.5.2.11) with SMTP id M2003080709300018391 ; Thu, 07 Aug 2003 09:30:00 +0300 Received: from hasmsx403.ger.corp.intel.com ([143.185.63.109]) by hasmsx331.ger.corp.intel.com with Microsoft SMTPSVC(5.0.2195.5329); Thu, 7 Aug 2003 09:22:53 +0300 content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" X-MimeOLE: Produced By Microsoft Exchange V6.0.6375.0 Subject: RE: VLAN patch for 2.4.21 Date: Thu, 7 Aug 2003 09:22:53 +0300 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: VLAN patch for 2.4.21 Thread-Index: AcNbsZtW9W2htjy0TYejl3bEB6UcnQA+ibUg From: "Hen, Shmulik" To: "Ben Greear" , X-OriginalArrivalTime: 07 Aug 2003 06:22:53.0577 (UTC) FILETIME=[558A6790:01C35CAC] Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id h776N3Fl027308 X-archive-position: 4608 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: shmulik.hen@intel.com Precedence: bulk X-list: netdev > -----Original Message----- > From: Ben Greear [mailto:greearb@candelatech.com] > Sent: Wednesday, August 06, 2003 3:27 AM > To: 'netdev@oss.sgi.com' > Subject: VLAN patch for 2.4.21 > > > Here is a patch that adds a few new IOCTL options (not new > IOCTLs per se) for the 802.1Q VLANs. > One ioctl allows one to get the VID for a device by > the interface name. A second gets the name of the underlying > device for the VLAN device. Tested on x86 and PPC. > > Comments welcome! > > Thanks, > Ben > Oh, this is great. You just saved me the work I was going to do for fixing VLAN stuff over bonding :) Any idea how to export those in a way that would enable bonding and alike to include them without getting dependant on 8021q module being loaded ? (I'm guessing inline function in a .h file, but any other solution is welcome (coed sample ?). -- | Shmulik Hen Advanced Network Services | | Israel Design Center, Jerusalem | | LAN Access Division, Platform Networking | | Intel Communications Group, Intel corp. | From greearb@candelatech.com Thu Aug 7 00:28:20 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 07 Aug 2003 00:28:32 -0700 (PDT) Received: from grok.yi.org (evrtwa1-ar2-4-33-045-074.evrtwa1.dsl-verizon.net [4.33.45.74]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h777SJFl031873 for ; Thu, 7 Aug 2003 00:28:20 -0700 Received: from candelatech.com (localhost.localdomain [127.0.0.1]) by grok.yi.org (8.12.8/8.12.8) with ESMTP id h777SCtf007851; Thu, 7 Aug 2003 00:28:13 -0700 Message-ID: <3F31FF8C.9080705@candelatech.com> Date: Thu, 07 Aug 2003 00:28:12 -0700 From: Ben Greear Organization: Candela Technologies User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5a) Gecko/20030718 X-Accept-Language: en-us, en MIME-Version: 1.0 To: "Hen, Shmulik" CC: netdev@oss.sgi.com Subject: Re: VLAN patch for 2.4.21 References: In-Reply-To: Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4609 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: greearb@candelatech.com Precedence: bulk X-list: netdev Hen, Shmulik wrote: >>-----Original Message----- >>From: Ben Greear [mailto:greearb@candelatech.com] >>Sent: Wednesday, August 06, 2003 3:27 AM >>To: 'netdev@oss.sgi.com' >>Subject: VLAN patch for 2.4.21 >> >> >>Here is a patch that adds a few new IOCTL options (not new >>IOCTLs per se) for the 802.1Q VLANs. >>One ioctl allows one to get the VID for a device by >>the interface name. A second gets the name of the underlying >>device for the VLAN device. Tested on x86 and PPC. >> >>Comments welcome! >> >>Thanks, >>Ben >> > > > Oh, this is great. You just saved me the work I was going to do > for fixing VLAN stuff over bonding :) > Any idea how to export those in a way that would enable bonding > and alike to include them without getting dependant on 8021q module > being loaded ? (I'm guessing inline function in a .h file, but any > other solution is welcome (coed sample ?). You can just check the things in the net_device struct directly I imagine. The calls I added are mainly to provide the info to user-space. What information do you need, and where do you need it? Also, no word yet from the guys who actually take patches, so dunno if/when this will get into the tree. Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From davem@redhat.com Thu Aug 7 00:34:55 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 07 Aug 2003 00:35:02 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h777YrFl032602 for ; Thu, 7 Aug 2003 00:34:54 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id AAA27518; Thu, 7 Aug 2003 00:30:04 -0700 Date: Thu, 7 Aug 2003 00:30:04 -0700 From: "David S. Miller" To: Ben Greear Cc: shmulik.hen@intel.com, netdev@oss.sgi.com Subject: Re: VLAN patch for 2.4.21 Message-Id: <20030807003004.53544de7.davem@redhat.com> In-Reply-To: <3F31FF8C.9080705@candelatech.com> References: <3F31FF8C.9080705@candelatech.com> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4610 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Thu, 07 Aug 2003 00:28:12 -0700 Ben Greear wrote: > Also, no word yet from the guys who actually take patches, so dunno > if/when this will get into the tree. I'll review and add your patch, I have some bigger fires to put out first :-) From felix@allot.com Thu Aug 7 02:11:52 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 07 Aug 2003 02:12:02 -0700 (PDT) Received: from mxout2.netvision.net.il (mxout2.netvision.net.il [194.90.9.21]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h779BnFl004086 for ; Thu, 7 Aug 2003 02:11:51 -0700 Received: from exg.allot.com ([199.203.223.202]) by mxout2.netvision.net.il (iPlanet Messaging Server 5.2 HotFix 1.14 (built Mar 18 2003)) with ESMTP id <0HJ80043US7FRK@mxout2.netvision.net.il> for netdev@oss.sgi.com; Thu, 07 Aug 2003 12:11:43 +0300 (IDT) Received: from allot.com (199.203.223.201 [199.203.223.201]) by exg.allot.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id QNW7ZTMQ; Thu, 07 Aug 2003 12:14:41 +0200 Date: Thu, 07 Aug 2003 12:12:07 +0300 From: Felix Radensky Subject: Ethernet bridge performance To: netdev@oss.sgi.com Message-id: <3F3217E7.2080903@allot.com> Organization: Allot Communications Ltd. MIME-version: 1.0 Content-type: multipart/mixed; boundary="Boundary_(ID_RVVz94TACxHZ39XZVjgHQQ)" X-Accept-Language: en-us, en User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.2) Gecko/20030208 Netscape/7.02 X-archive-position: 4611 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: felix@allot.com Precedence: bulk X-list: netdev This is a multi-part message in MIME format. --Boundary_(ID_RVVz94TACxHZ39XZVjgHQQ) Content-type: text/plain; charset=us-ascii; format=flowed Content-transfer-encoding: 7BIT Hi, I'm evaluating a performance of a dual port ethernet bridge, and the results are a bit disappointing. I would appreciate any hints on improving the results. I'm using a Dual Xeon 2.66 GHz box based on Intel 5701 chipset with 1G of RAM. NICs are e1000 82546 connected to PCIX bus. Kernel is 2.4.22-pre8, e1000 driver version 5.1.13-k1 with NAPI support. NICs' interrupts are bound to CPU0. The test consists of sending 200 byte UDP packets from 2 ports of Gigabit IXIA traffic generator to 2 bridge ports. The bridge is capable to sustain the rate of ~170000 pps from each IXIA port without drops. I was expecting it to be able to do at least 250000 pps (our own bridge code based on 2.2.x kernel sustains ~266000 pps on the same hardware). e1000 driver drops 0 packets, all drops occur at higher level. The output of oprofile attached. I'd be happy to provide any info you may need. Thanks in advance for your help. Felix. --Boundary_(ID_RVVz94TACxHZ39XZVjgHQQ) Content-type: text/plain; name=oprofile.log Content-transfer-encoding: 7BIT Content-disposition: inline; filename=oprofile.log CPU: P4 / Xeon, speed 2666.83 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (count cycles when processor is active) count 1333415 vma samples % symbol name c01a13ac 7983 12.5065 eth_type_trans c01a1590 7629 11.9519 qdisc_restart c0197720 7365 11.5383 skb_release_data c010c170 5962 9.3403 do_gettimeofday c019a9b4 5207 8.1575 dev_queue_xmit c012df98 4243 6.6472 free_block c019ad44 3404 5.3328 netif_rx c01974e0 3266 5.1166 alloc_skb c012e1b4 2600 4.0733 kmalloc c012e3b8 2448 3.8351 kfree c019b284 2401 3.7615 process_backlog c012debc 1941 3.0408 kmem_cache_alloc_batch c0197790 1555 2.4361 kfree_skbmem c01977f8 1296 2.0304 __kfree_skb c012e07c 1203 1.8847 kmem_cache_alloc c019afa4 1079 1.6904 net_tx_action c019b0bc 932 1.4601 netif_receive_skb c01a1a04 750 1.1750 pfifo_fast_dequeue c019accc 445 0.6972 get_sample_stats c012e338 431 0.6752 kmem_cache_free c01a1990 222 0.3478 pfifo_fast_enqueue c01087f0 180 0.2820 do_IRQ c01142a8 152 0.2381 schedule c010c434 131 0.2052 timer_interrupt c010b04c 123 0.1927 IRQ0x31_interrupt c010b040 115 0.1802 IRQ0x30_interrupt c011c400 82 0.1285 cpu_raise_softirq c019b3ac 62 0.0971 net_rx_action c01052a0 57 0.0893 default_idle c011be40 55 0.0862 do_softirq c0112b28 47 0.0736 end_level_ioapic_irq c016b210 39 0.0611 add_timer_randomness c016afb0 38 0.0595 add_entropy_words c016b0a4 37 0.0580 batch_entropy_store c0105944 35 0.0548 __switch_to c0115778 32 0.0501 wake_up_process c0112de0 23 0.0360 do_check_pgt_cache c010b760 22 0.0345 apic_timer_interrupt c010ae10 21 0.0329 common_interrupt c01da960 13 0.0204 __rdtsc_delay c0106f3d 11 0.0172 restore_all c0111d0c 11 0.0172 smp_apic_timer_interrupt c01245a8 11 0.0172 check_pgt_cache c016b158 11 0.0172 batch_entropy_process c0107138 9 0.0141 page_fault c0108654 9 0.0141 handle_IRQ_event c010ae30 9 0.0141 IRQ0x00_interrupt c0125414 9 0.0141 do_wp_page c011f1b0 8 0.0125 update_one_process c011f36c 8 0.0125 timer_bh c011c33c 7 0.0110 ksoftirqd c0113eb8 6 0.0094 reschedule_idle c011c0cc 5 0.0078 tasklet_hi_action c011c204 5 0.0078 bh_action c010ae23 4 0.0063 call_do_IRQ c011eb90 4 0.0063 add_timer c011f284 4 0.0063 update_process_times c0125970 4 0.0063 do_anonymous_page c016b320 4 0.0063 add_interrupt_randomness c0106ef4 3 0.0047 system_call c010c78c 3 0.0047 inc_new_microsec_time c0124850 3 0.0047 zap_page_range c01daa70 3 0.0047 __generic_copy_to_user c010b778 2 0.0031 call_apic_timer_interrupt c0112ae8 2 0.0031 ack_edge_ioapic_irq c011327c 2 0.0031 do_page_fault c01147c8 2 0.0031 __wake_up c011ee88 2 0.0031 tqueue_bh c0124698 2 0.0031 copy_page_range c0125a5c 2 0.0031 do_no_page c01282d8 2 0.0031 __find_get_page c0128e1c 2 0.0031 file_read_actor c01d1880 2 0.0031 fn_hash_lookup c01db1b0 2 0.0031 number c01052f4 1 0.0016 cpu_idle c0106628 1 0.0016 setup_frame c0106fb0 1 0.0016 ret_from_intr c0106fb7 1 0.0016 ret_from_exception c0106fe4 1 0.0016 error_code c01114f8 1 0.0016 flush_tlb_page c01198ec 1 0.0016 get_module_list c0119d30 1 0.0016 s_show c011c2d0 1 0.0016 __run_task_queue c011f73c 1 0.0016 do_timer c012093c 1 0.0016 sys_rt_sigprocmask c0125bf8 1 0.0016 handle_mm_fault c0125cc8 1 0.0016 pte_alloc c0125fd8 1 0.0016 lock_vma_mappings c0126348 1 0.0016 do_mmap_pgoff c012ca90 1 0.0016 vmfree_area_pages c012dcac 1 0.0016 kmem_cache_grow c012f000 1 0.0016 __lru_cache_del c0130288 1 0.0016 rmqueue c01308a8 1 0.0016 __free_pages c0130acf 1 0.0016 .text.lock.page_alloc c0134460 1 0.0016 fd_install c013fb70 1 0.0016 link_path_walk c01557f8 1 0.0016 proc_pid_statm c016b078 1 0.0016 credit_entropy_store c01804dc 1 0.0016 ide_inb c0180524 1 0.0016 ide_outb c0180a60 1 0.0016 drive_is_ready c0198d3c 1 0.0016 skb_headerinit c01a58e4 1 0.0016 ip_route_input_slow c01a8324 1 0.0016 ip_rcv c01d0c68 1 0.0016 fib_semantic_match --Boundary_(ID_RVVz94TACxHZ39XZVjgHQQ)-- From shmulik.hen@intel.com Thu Aug 7 03:22:18 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 07 Aug 2003 03:22:21 -0700 (PDT) Received: from hermes.iil.intel.com (hermes.iil.intel.com [192.198.152.99]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h77AMDFl008024 for ; Thu, 7 Aug 2003 03:22:16 -0700 Received: from petasus.iil.intel.com (petasus.iil.intel.com [143.185.77.3]) by hermes.iil.intel.com (8.11.6p2/8.11.6/d: outer.mc,v 1.66 2003/05/22 21:17:36 rfjohns1 Exp $) with ESMTP id h77AGuJ03927 for ; Thu, 7 Aug 2003 10:16:56 GMT Received: from hasmsxvs01.iil.intel.com (hasmsxvs01.iil.intel.com [143.185.63.58]) by petasus.iil.intel.com (8.11.6p2/8.11.6/d: inner.mc,v 1.35 2003/05/22 21:18:01 rfjohns1 Exp $) with SMTP id h77APBE02478 for ; Thu, 7 Aug 2003 10:25:11 GMT Received: from hasmsx331.ger.corp.intel.com ([143.185.63.144]) by hasmsxvs01.iil.intel.com (NAVGW 2.5.2.11) with SMTP id M2003080713291305899 ; Thu, 07 Aug 2003 13:29:13 +0300 Received: from hasmsx403.ger.corp.intel.com ([143.185.63.109]) by hasmsx331.ger.corp.intel.com with Microsoft SMTPSVC(5.0.2195.5329); Thu, 7 Aug 2003 13:22:06 +0300 content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" X-MimeOLE: Produced By Microsoft Exchange V6.0.6375.0 Subject: RE: VLAN patch for 2.4.21 Date: Thu, 7 Aug 2003 13:22:05 +0300 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: VLAN patch for 2.4.21 Thread-Index: AcNctawwoqf9COSQT/e3elVv8xNmhQAFpVsA From: "Hen, Shmulik" To: "Ben Greear" Cc: X-OriginalArrivalTime: 07 Aug 2003 10:22:06.0546 (UTC) FILETIME=[C093AB20:01C35CCD] Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id h77AMDFl008024 X-archive-position: 4612 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: shmulik.hen@intel.com Precedence: bulk X-list: netdev > -----Original Message----- > From: Ben Greear [mailto:greearb@candelatech.com] > Sent: Thursday, August 07, 2003 10:28 AM > To: Hen, Shmulik > Cc: netdev@oss.sgi.com > Subject: Re: VLAN patch for 2.4.21 > > > You can just check the things in the net_device struct directly > I imagine. The calls I added are mainly to provide the info to > user-space. That was my guess too. I'll figure out a way to do that safely from kernel space. > What information do you need, and where do you need it? In bonding, to better handle self generated packets, I'll need to know what vlan ID's are on top of the bond device. So, I'll need to listen to net dev registration notifications and sort out which ones are for VLAN devices, and then see if they were added on top of a bond device. Once I've got that, I'll need to get the VID and store it in bonding, so both your additions do exactly what I need. I also heard from the bridge developers that they wanted similar support, so that's 2 birds... Shmulik. From ingo.oeser@informatik.tu-chemnitz.de Thu Aug 7 04:05:40 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 07 Aug 2003 04:06:20 -0700 (PDT) Received: from tom.hrz.tu-chemnitz.de (tom.hrz.tu-chemnitz.de [134.109.132.38]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h77B5WFl009111 for ; Thu, 7 Aug 2003 04:05:38 -0700 Received: from tnt200.hrz.tu-chemnitz.de ([134.109.156.200] helo=nightmaster.csn.tu-chemnitz.de ident=root) by tom.hrz.tu-chemnitz.de with esmtp (Exim 4.12) id 19jwzG-00034j-00; Tue, 05 Aug 2003 10:16:07 +0200 Received: (from ioe@localhost) by nightmaster.csn.tu-chemnitz.de (8.9.1/8.9.1) id KAA29171; Tue, 5 Aug 2003 10:15:25 +0200 Date: Tue, 5 Aug 2003 10:15:25 +0200 From: Ingo Oeser To: Alan Shih Cc: Jeff Garzik , Nivedita Singhvi , Werner Almesberger , netdev@oss.sgi.com, linux-kernel@vger.kernel.org Subject: Re: TOE brain dump Message-ID: <20030805101525.P670@nightmaster.csn.tu-chemnitz.de> References: <20030804163606.Q639@nightmaster.csn.tu-chemnitz.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2i In-Reply-To: ; from alan@storlinksemi.com on Mon, Aug 04, 2003 at 10:19:21AM -0700 X-Spam-Score: -4.5 (----) X-Scanner: exiscan for exim4 (http://duncanthrax.net/exiscan/) *19jwzG-00034j-00*RgAXBNAHfIU* X-archive-position: 4613 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ingo.oeser@informatik.tu-chemnitz.de Precedence: bulk X-list: netdev On Mon, Aug 04, 2003 at 10:19:21AM -0700, Alan Shih wrote: > So would main processor still need a copy of the data for re-transmission? > Won't that defeat the purpose? No, since I didn't state that a retransmission is done along the pipe, because you cannot go back in a pipeline. A retransmission can be done at the end of the pipe, where this can also be done in hardware. Regards Ingo Oeser From davids@webmaster.com Thu Aug 7 08:05:25 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 07 Aug 2003 08:05:35 -0700 (PDT) Received: from shell.webmaster.com (mail.webmaster.com [216.152.64.131]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h77F5OFl022065 for ; Thu, 7 Aug 2003 08:05:25 -0700 Received: from however ([206.171.168.138]) by shell.webmaster.com (Post.Office MTA v3.5.3 release 223 ID# 0-12345L500S10000V35) with SMTP id com; Wed, 6 Aug 2003 14:11:32 -0700 From: "David Schwartz" To: "Andy Isaacson" , "Jesse Pollard" Cc: , Subject: RE: TOE brain dump Date: Wed, 6 Aug 2003 14:13:47 -0700 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook IMO, Build 9.0.6604 (9.0.2911.0) In-Reply-To: <20030806143956.B15543@hexapodia.org> X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1106 Importance: Normal X-archive-position: 4614 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davids@webmaster.com Precedence: bulk X-list: netdev > This statement is completely false. Ethernet switches *do* read the > packet into memory before starting transmission. Some do. Some don't. Some are configurable. > This must be so, > because an Ethernet switch does not propagate runts, jabber frames, or > frames with an incorrect ethernet crc. If they use cut-through switching, they do. Some use adaptive switching, which means they use cut-through switching but change to store and forward if there are too many runts, jabber frames, bad CRCs, and so on. Obviously, you can't always do a cut-through. If the target port is busy, cut-through is impossible. If the ports are different speeds, cut-through is impossible. The Intel 510T switch for my home network does adaptive switching with configurable error thresholds. In fact, it's even smarter than that, with an intermediate mode that suppresses runts without doing a full store and forward. See: http://www.intel.com/support/express/switches/23188.htm > If the switch starts > transmission before it's received the last bit, it is provably > impossible for it to avoid propagating crc-failing-frames; ergo, > switches must have the entire packet on hand before starting > transmission. Except not all switches always avoid propogating bad frames. DS From akpm@osdl.org Thu Aug 7 08:48:55 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 07 Aug 2003 08:49:03 -0700 (PDT) Received: from mail.osdl.org (fw.osdl.org [65.172.181.6]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h77FmsFl025800 for ; Thu, 7 Aug 2003 08:48:54 -0700 Received: from mnm (build.pdx.osdl.net [172.20.1.2]) by mail.osdl.org (8.11.6/8.11.6) with ESMTP id h77FmjI18699; Thu, 7 Aug 2003 08:48:45 -0700 Date: Thu, 7 Aug 2003 08:50:43 -0700 From: Andrew Morton To: netdev@oss.sgi.com Cc: laforge@gnumonks.org, Rusty Russell , temnota@kmv.ru Subject: Fw: [Bugme-new] [Bug 1054] New: loading iptables modules kill raid5 kernel thread Message-Id: <20030807085043.3b794387.akpm@osdl.org> X-Mailer: Sylpheed version 0.9.4 (GTK+ 1.2.10; i686-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4615 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: akpm@osdl.org Precedence: bulk X-list: netdev This is weird. It looks like something on the netfilter module initialisation path has called smp_call_function(garbage_address). But I cannot see where anything like that could happen. Begin forwarded message: Date: Thu, 7 Aug 2003 07:49:13 -0700 From: bugme-daemon@osdl.org To: bugme-new@lists.osdl.org Subject: [Bugme-new] [Bug 1054] New: loading iptables modules kill raid5 kernel thread http://bugme.osdl.org/show_bug.cgi?id=1054 Summary: loading iptables modules kill raid5 kernel thread Kernel Version: 2.4.22-pre10 Status: NEW Severity: normal Owner: laforge@gnumonks.org Submitter: temnota@kmv.ru Distribution: RedHat 7.1 Hardware Environment: HP NetServer 5/LS $ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 5 model : 2 model name : Pentium 75 - 200 stepping : 5 fdiv_bug : no hlt_bug : no f00f_bug : yes coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse msr mce cx8 apic bogomips : 53.04 processor : 1 vendor_id : GenuineIntel cpu family : 5 model : 2 model name : Pentium 75 - 200 stepping : 5 fdiv_bug : no hlt_bug : no f00f_bug : yes coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse msr mce cx8 apic bogomips : 53.24 $ lspci -v 00:00.0 Host bridge: Intel Corporation 82452KX/GX [Orion] (rev 02) Flags: bus master, medium devsel, latency 6 00:0d.0 Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100] (rev 01) Flags: bus master, medium devsel, latency 66, IRQ 10 Memory at ffe7f000 (32-bit, prefetchable) [size=4K] I/O ports at ef80 [size=32] Memory at ff600000 (32-bit, non-prefetchable) [size=1M] Expansion ROM at [disabled] [size=1M] 00:0e.0 Non-VGA unclassified device: Intel Corporation 82375EB (rev 05) Flags: bus master, medium devsel, latency 248 00:0f.0 Class ff00: Intel Corporation: Unknown device 0008 Subsystem: Unknown device ec08:ffe7 Flags: fast devsel Memory at ffe7ec00 (32-bit, prefetchable) [size=1K] Memory at 12000000 (32-bit, prefetchable) [size=1K] Memory at 12000400 (32-bit, prefetchable) [size=1K] Memory at 12000800 (32-bit, prefetchable) [size=1K] Memory at 12000c00 (32-bit, prefetchable) [size=1K] Memory at 12001000 (32-bit, prefetchable) [size=1K] Expansion ROM at fffff800 [disabled] [size=2K] 01:00.0 Host bridge: Intel Corporation 82452KX/GX [Orion] (rev 02) Flags: bus master, medium devsel, latency 6 01:0c.0 Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100] (rev 08) Subsystem: Intel Corporation EtherExpress PRO/100+ Management Adapter Flags: bus master, medium devsel, latency 66, IRQ 9 Memory at ffcfe000 (32-bit, non-prefetchable) [size=4K] I/O ports at f8c0 [size=64] Memory at ffb00000 (32-bit, non-prefetchable) [size=1M] Expansion ROM at [disabled] [size=1M] Capabilities: [dc] Power Management version 2 01:0d.0 SCSI storage controller: Adaptec AHA-294x / AIC-7870 (rev 03) Flags: bus master, medium devsel, latency 64, IRQ 11 I/O ports at fc00 [disabled] [size=256] Memory at ffcff000 (32-bit, non-prefetchable) [size=4K] Expansion ROM at [disabled] [size=64K] 01:0e.0 SCSI storage controller: Adaptec AHA-294x / AIC-7870 (rev 03) Flags: bus master, medium devsel, latency 64, IRQ 9 I/O ports at f400 [disabled] [size=256] Memory at ffcfd000 (32-bit, non-prefetchable) [size=4K] Expansion ROM at [disabled] [size=64K] $ cat /proc/modules ipt_TOS 1008 0 ipt_tos 448 0 (unused) iptable_mangle 2144 1 ipt_TCPMSS 2336 3 ipt_tcpmss 800 0 (unused) ipt_LOG 3568 20 ipt_MARK 720 0 (unused) ipt_REDIRECT 768 0 (unused) iptable_nat 23264 1 [ipt_REDIRECT] ipt_REJECT 3136 0 (unused) ipt_mac 656 12 ipt_mark 464 0 (unused) ipt_multiport 640 0 (unused) iptable_filter 1712 1 ipt_state 576 8 ipt_limit 1216 171 ip_conntrack_ftp 4512 0 (unused) ip_conntrack 29664 3 [ipt_REDIRECT iptable_nat ipt_state ip_conntrack_ftp] ip_tables 15008 18 [ipt_TOS ipt_tos iptable_mangle ipt_TCPMSS ipt_tcpmss ipt_LOG ipt_MARK ipt_REDIRECT iptable_nat ipt_REJECT ipt_mac ipt_mark ipt_multiport iptable_filter ipt_state ipt_limit] Software Environment: Software Raid5 + iptables modules Problem Description: When raid recovery discs (after unclean shutdown), loading iptables modules kill radi5 kernel thread Unable to handle kernel NULL pointer dereference at virtual address 00000212 c102c04d *pde = 00000000 Oops: 0002 CPU: 1 EIP: 0010:[] Not tainted Using defaults from ksymoops -t elf32-i386 -a i386 EFLAGS: 00010082 eax: d0355bb3 ebx: 00000001 ecx: c102c01c edx: 00000212 esi: 00000019 edi: d1af3000 ebp: d1af4000 esp: d1d95dec ds: 0018 es: 0018 ss: 0018 Process raid5d (pid: 13, stackpage=d1d95000) Stack: c0113b18 00000212 d1af3cc0 c010ca6a d1af3cc0 d1af5cc0 d1af4cc0 00000019 d1af3000 d1af4000 0080e85d 00000018 00000018 fffffffb c01fb83a 00000010 00000282 00000003 d1af6c00 c01fbef7 00001000 d1af4000 d1af3000 d1af5000 Call Trace: [] [] [] [] [] [] [] [] [] [] [] [] [] Code: c0 02 c1 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >>EIP; c102c04d <_end+cae92d/124d9940> <===== Trace; c0113b18 Trace; c010ca6a Trace; c01fb83a Trace; c01fbef7 Trace; c01f734e Trace; c01b1261 Trace; c01f82e6 Trace; c01bb224 Trace; c01f8ae1 Trace; c01f8a10 Trace; c0200515 Trace; c0105883 Trace; c0200370 Code; c102c04d <_end+cae92d/124d9940> 00000000 <_EIP>: Code; c102c04d <_end+cae92d/124d9940> 0: c0 02 c1 rolb $0xc1,(%edx) Unable to handle kernel NULL pointer dereference at virtual address 00000217 c102c04d *pde = 00000000 Oops: 0002 CPU: 0 EIP: 0010:[] Not tainted EFLAGS: 00010082 eax: cb189bb3 ebx: 00000001 ecx: c102c01c edx: 00000217 esi: 00000021 edi: d1c51000 ebp: d1c52000 esp: d1d95dec ds: 0018 es: 0018 ss: 0018 Process raid5d (pid: 13, stackpage=d1d95000) Stack: c0113b18 00000217 d1c51bc0 c010ca6a d1c51bc0 d1c53be0 d1c52bc0 00000021 d1c51000 d1c52000 807944a9 00000018 00000018 fffffffb c01fb85b 00000010 00000286 00000003 d1c54c00 c01fbef7 00001000 d1c52000 d1c51000 d1c53000 Call Trace: [] [] [] [] [] [] [] [] [] [] [] [] [] Code: c0 02 c1 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >>EIP; c102c04d <_end+cae92d/124d9940> <===== Trace; c0113b18 Trace; c010ca6a Trace; c01fb85b Trace; c01fbef7 Trace; c01f734e Trace; c01b1261 Trace; c01f82e6 Trace; c01bb224 Trace; c01f8ae1 Trace; c01f8a10 Trace; c0200515 Trace; c0105883 Trace; c0200370 Code; c102c04d <_end+cae92d/124d9940> 00000000 <_EIP>: Code; c102c04d <_end+cae92d/124d9940> 0: c0 02 c1 rolb $0xc1,(%edx) Steps to reproduce: raidsetfaulty /dev/md0 /dev/sde1 raidhotremove /dev/md0 /dev/sde1 raidhotadd /dev/md0 /dev/sde1 and load iptables modules. OOPS ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From greearb@candelatech.com Thu Aug 7 08:49:55 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 07 Aug 2003 08:50:00 -0700 (PDT) Received: from grok.yi.org (evrtwa1-ar2-4-33-045-074.evrtwa1.dsl-verizon.net [4.33.45.74]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h77FnsFl026011 for ; Thu, 7 Aug 2003 08:49:55 -0700 Received: from candelatech.com (localhost.localdomain [127.0.0.1]) by grok.yi.org (8.12.8/8.12.8) with ESMTP id h77FnWtf001210; Thu, 7 Aug 2003 08:49:33 -0700 Message-ID: <3F32750C.4000600@candelatech.com> Date: Thu, 07 Aug 2003 08:49:32 -0700 From: Ben Greear Organization: Candela Technologies User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5a) Gecko/20030718 X-Accept-Language: en-us, en MIME-Version: 1.0 To: "Hen, Shmulik" CC: netdev@oss.sgi.com Subject: Re: VLAN patch for 2.4.21 References: In-Reply-To: Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4616 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: greearb@candelatech.com Precedence: bulk X-list: netdev Hen, Shmulik wrote: >>-----Original Message----- >>From: Ben Greear [mailto:greearb@candelatech.com] >>Sent: Thursday, August 07, 2003 10:28 AM >>To: Hen, Shmulik >>Cc: netdev@oss.sgi.com >>Subject: Re: VLAN patch for 2.4.21 >> >> >>You can just check the things in the net_device struct directly >>I imagine. The calls I added are mainly to provide the info to >>user-space. > > > That was my guess too. I'll figure out a way to do that > safely from kernel space. > > >>What information do you need, and where do you need it? > > > In bonding, to better handle self generated packets, I'll need > to know what vlan ID's are on top of the bond device. So, I'll > need to listen to net dev registration notifications and sort > out which ones are for VLAN devices, and then see if they were > added on top of a bond device. Once I've got that, I'll need to > get the VID and store it in bonding, so both your additions > do exactly what I need. I also heard from the bridge developers > that they wanted similar support, so that's 2 birds... If it's a VLAN device, it will have priv_flags & 0x1 turned on, see dev->priv_flags and if.h for possible values that priv_flags can have: /* Private (from user) interface flags (netdevice->priv_flags). */ #define IFF_802_1Q_VLAN 0x1 /* 802.1Q VLAN device. */ You can then get it's vlan-ID by looking for: VLAN_DEV_INFO(dev)->vlan_id; VLAN_DEV_INFO is defined in if_vlan.h as: #define VLAN_DEV_INFO(x) ((struct vlan_dev_info *)(x->priv)) > > > Shmulik. > > -- Ben Greear Candela Technologies Inc http://www.candelatech.com From shemminger@osdl.org Thu Aug 7 08:59:46 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 07 Aug 2003 08:59:51 -0700 (PDT) Received: from mail.osdl.org (fw.osdl.org [65.172.181.6]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h77FxjFl026997 for ; Thu, 7 Aug 2003 08:59:46 -0700 Received: from dell_ss3.pdx.osdl.net (dell_ss3.pdx.osdl.net [172.20.1.60]) by mail.osdl.org (8.11.6/8.11.6) with SMTP id h77FxaI22159; Thu, 7 Aug 2003 08:59:36 -0700 Date: Thu, 7 Aug 2003 08:59:30 -0700 From: Stephen Hemminger To: Felix Radensky Cc: netdev@oss.sgi.com Subject: Re: Ethernet bridge performance Message-Id: <20030807085930.032b0602.shemminger@osdl.org> In-Reply-To: <3F3217E7.2080903@allot.com> References: <3F3217E7.2080903@allot.com> Organization: Open Source Development Lab X-Mailer: Sylpheed version 0.9.4claws (GTK+ 1.2.10; i686-pc-linux-gnu) X-Face: &@E+xe?c%:&e4D{>f1O<&U>2qwRREG5!}7R4;D<"NO^UI2mJ[eEOA2*3>(`Th.yP,VDPo9$ /`~cw![cmj~~jWe?AHY7D1S+\}5brN0k*NE?pPh_'_d>6;XGG[\KDRViCfumZT3@[ Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4617 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: shemminger@osdl.org Precedence: bulk X-list: netdev What kernel version? 2.6 should be faster. On Thu, 07 Aug 2003 12:12:07 +0300 Felix Radensky wrote: > Hi, > > I'm evaluating a performance of a dual port ethernet bridge, and the > results are a bit disappointing. I would appreciate any hints on improving > the results. > > I'm using a Dual Xeon 2.66 GHz box based on Intel 5701 chipset with 1G > of RAM. NICs are e1000 82546 connected to PCIX bus. Kernel is 2.4.22-pre8, > e1000 driver version 5.1.13-k1 with NAPI support. NICs' interrupts are > bound > to CPU0. > > The test consists of sending 200 byte UDP packets from 2 ports of Gigabit > IXIA traffic generator to 2 bridge ports. The bridge is capable to sustain > the rate of ~170000 pps from each IXIA port without drops. I was > expecting it to be able to do at least 250000 pps (our own bridge code > based > on 2.2.x kernel sustains ~266000 pps on the same hardware). > e1000 driver drops 0 packets, all drops occur at higher level. > > The output of oprofile attached. I'd be happy to provide any info you may > need. > > Thanks in advance for your help. > > Felix. > From felix@allot.com Thu Aug 7 09:31:49 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 07 Aug 2003 09:31:58 -0700 (PDT) Received: from mxout4.netvision.net.il (mxout4.netvision.net.il [194.90.9.27]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h77GVlFl029254 for ; Thu, 7 Aug 2003 09:31:48 -0700 Received: from exg.allot.com ([199.203.223.202]) by mxout4.netvision.net.il (iPlanet Messaging Server 5.2 HotFix 1.14 (built Mar 18 2003)) with ESMTP id <0HJ900428BCEH8@mxout4.netvision.net.il> for netdev@oss.sgi.com; Thu, 07 Aug 2003 19:05:03 +0300 (IDT) Received: from allot.com (199.203.223.201 [199.203.223.201]) by exg.allot.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id QNW7ZVDG; Thu, 07 Aug 2003 19:08:06 +0200 Date: Thu, 07 Aug 2003 19:05:31 +0300 From: Felix Radensky Subject: Re: Ethernet bridge performance To: Stephen Hemminger Cc: netdev@oss.sgi.com Message-id: <3F3278CB.5070505@allot.com> Organization: Allot Communications Ltd. MIME-version: 1.0 Content-type: multipart/alternative; boundary="Boundary_(ID_5fAUGaz5oHMtphX6UtIryg)" X-Accept-Language: en-us, en User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.2) Gecko/20030208 Netscape/7.02 References: <3F3217E7.2080903@allot.com> <20030807085930.032b0602.shemminger@osdl.org> X-archive-position: 4618 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: felix@allot.com Precedence: bulk X-list: netdev --Boundary_(ID_5fAUGaz5oHMtphX6UtIryg) Content-type: text/plain; charset=us-ascii; format=flowed Content-transfer-encoding: 7BIT It's 2.4.22-pre8. I have to work with 2.4, 2.6 is not an option at this point. Felix. Stephen Hemminger wrote: >What kernel version? 2.6 should be faster. > >On Thu, 07 Aug 2003 12:12:07 +0300 >Felix Radensky wrote: > > > >>Hi, >> >>I'm evaluating a performance of a dual port ethernet bridge, and the >>results are a bit disappointing. I would appreciate any hints on improving >>the results. >> >>I'm using a Dual Xeon 2.66 GHz box based on Intel 5701 chipset with 1G >>of RAM. NICs are e1000 82546 connected to PCIX bus. Kernel is 2.4.22-pre8, >>e1000 driver version 5.1.13-k1 with NAPI support. NICs' interrupts are >>bound >>to CPU0. >> >>The test consists of sending 200 byte UDP packets from 2 ports of Gigabit >>IXIA traffic generator to 2 bridge ports. The bridge is capable to sustain >>the rate of ~170000 pps from each IXIA port without drops. I was >>expecting it to be able to do at least 250000 pps (our own bridge code >> based >>on 2.2.x kernel sustains ~266000 pps on the same hardware). >>e1000 driver drops 0 packets, all drops occur at higher level. >> >>The output of oprofile attached. I'd be happy to provide any info you may >>need. >> >>Thanks in advance for your help. >> >>Felix. >> >> >> > > > --Boundary_(ID_5fAUGaz5oHMtphX6UtIryg) Content-type: text/html; charset=us-ascii Content-transfer-encoding: 7BIT It's 2.4.22-pre8. I have to work with 2.4, 2.6 is not an option
at this point.

Felix.

Stephen Hemminger wrote:
What kernel version?  2.6 should be faster.

On Thu, 07 Aug 2003 12:12:07 +0300
Felix Radensky <felix@allot.com> wrote:

  
Hi,

I'm evaluating a performance of a dual port ethernet bridge, and the
results are a bit disappointing. I would appreciate any hints on improving
the results.

I'm using a Dual Xeon 2.66 GHz box based on Intel  5701 chipset with  1G
of RAM. NICs are e1000 82546 connected to PCIX bus. Kernel is 2.4.22-pre8,
e1000  driver version 5.1.13-k1 with NAPI support. NICs' interrupts are 
bound
to CPU0.

The test consists of sending 200 byte UDP packets from 2 ports of Gigabit
IXIA traffic generator to 2 bridge ports. The bridge is capable to sustain
the rate of ~170000 pps from each IXIA port without drops. I was
expecting it to be able to do at least 250000 pps (our own bridge code 
 based
on 2.2.x kernel sustains ~266000 pps on the same hardware).
e1000 driver drops 0 packets, all drops occur at higher level.

The output of oprofile attached. I'd be happy to provide any info you may
need.

Thanks in advance for your help.

Felix.

    

  

--Boundary_(ID_5fAUGaz5oHMtphX6UtIryg)-- From greearb@candelatech.com Thu Aug 7 09:57:23 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 07 Aug 2003 09:57:32 -0700 (PDT) Received: from grok.yi.org (evrtwa1-ar2-4-33-045-074.evrtwa1.dsl-verizon.net [4.33.45.74]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h77GvMFl030528 for ; Thu, 7 Aug 2003 09:57:23 -0700 Received: from candelatech.com (localhost.localdomain [127.0.0.1]) by grok.yi.org (8.12.8/8.12.8) with ESMTP id h77GvEtf009001; Thu, 7 Aug 2003 09:57:14 -0700 Message-ID: <3F3284EA.5050406@candelatech.com> Date: Thu, 07 Aug 2003 09:57:14 -0700 From: Ben Greear Organization: Candela Technologies User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5a) Gecko/20030718 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Felix Radensky CC: netdev@oss.sgi.com Subject: Re: Ethernet bridge performance References: <3F3217E7.2080903@allot.com> In-Reply-To: <3F3217E7.2080903@allot.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4619 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: greearb@candelatech.com Precedence: bulk X-list: netdev Felix Radensky wrote: > Hi, > > I'm evaluating a performance of a dual port ethernet bridge, and the > results are a bit disappointing. I would appreciate any hints on improving > the results. > c01a13ac 7983 12.5065 eth_type_trans > c01a1590 7629 11.9519 qdisc_restart > c0197720 7365 11.5383 skb_release_data > c010c170 5962 9.3403 do_gettimeofday If that do_gettimeofday is happening in the skb rx code, then you could gain ~10% by disabling it somehow..as it should not matter for a bridge. I bet Robert's skb-recycle patch would help here too, especially if you allowed the NICs to save up a large number of skbs so that alloc was less likely to fail. Btw, I've considered saving, say, 10k skbs on a list in my module, allocated by GFP_KERNEL at module load time, and using them when GFP_ATOMIC skb_alloc fails in the IRQ handling portion of the code.... Anyone think that's a good idea? :) Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From felix@allot.com Thu Aug 7 10:18:51 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 07 Aug 2003 10:18:56 -0700 (PDT) Received: from mxout5.netvision.net.il (mxout5.netvision.net.il [194.90.9.29]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h77HInFl032042 for ; Thu, 7 Aug 2003 10:18:50 -0700 Received: from exg.allot.com ([199.203.223.202]) by mxout5.netvision.net.il (iPlanet Messaging Server 5.2 HotFix 1.14 (built Mar 18 2003)) with ESMTP id <0HJ90073VER7OM@mxout5.netvision.net.il> for netdev@oss.sgi.com; Thu, 07 Aug 2003 20:18:43 +0300 (IDT) Received: from allot.com (199.203.223.201 [199.203.223.201]) by exg.allot.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id QNW7ZVML; Thu, 07 Aug 2003 20:21:46 +0200 Date: Thu, 07 Aug 2003 20:19:11 +0300 From: Felix Radensky Subject: Re: Ethernet bridge performance To: Ben Greear Cc: netdev@oss.sgi.com Message-id: <3F328A0F.3040005@allot.com> Organization: Allot Communications Ltd. MIME-version: 1.0 Content-type: text/plain; charset=us-ascii; format=flowed Content-transfer-encoding: 7BIT X-Accept-Language: en-us, en User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.2) Gecko/20030208 Netscape/7.02 References: <3F3217E7.2080903@allot.com> <3F3284EA.5050406@candelatech.com> X-archive-position: 4620 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: felix@allot.com Precedence: bulk X-list: netdev Thanks for your help, Ben. What is skb-recycle patch and where can I find it ? Felix. Ben Greear wrote: > Felix Radensky wrote: > >> Hi, >> >> I'm evaluating a performance of a dual port ethernet bridge, and the >> results are a bit disappointing. I would appreciate any hints on >> improving >> the results. > > >> c01a13ac 7983 12.5065 eth_type_trans >> c01a1590 7629 11.9519 qdisc_restart >> c0197720 7365 11.5383 skb_release_data >> c010c170 5962 9.3403 do_gettimeofday > > > If that do_gettimeofday is happening in the skb rx code, then > you could gain ~10% by disabling it somehow..as it should not > matter for a bridge. I bet Robert's skb-recycle patch would > help here too, especially if you allowed the NICs to save up a large > number of skbs so that alloc was less likely to fail. > > Btw, I've considered saving, say, 10k skbs on a list in my module, > allocated by GFP_KERNEL at module load time, and using them when > GFP_ATOMIC skb_alloc fails in the IRQ handling portion of the code.... > > Anyone think that's a good idea? :) > > Ben > From Robert.Olsson@data.slu.se Thu Aug 7 12:09:54 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 07 Aug 2003 12:10:04 -0700 (PDT) Received: from robur.slu.se (robur.slu.se [130.238.98.12]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h77J9qFl008522 for ; Thu, 7 Aug 2003 12:09:54 -0700 Received: (from robert@localhost) by robur.slu.se (8.9.3p2/8.9.3) id VAA05772; Thu, 7 Aug 2003 21:09:44 +0200 From: Robert Olsson MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <16178.41976.3643.584516@robur.slu.se> Date: Thu, 7 Aug 2003 21:09:44 +0200 To: Felix Radensky Cc: Ben Greear , netdev@oss.sgi.com Subject: Re: Ethernet bridge performance In-Reply-To: <3F328A0F.3040005@allot.com> References: <3F3217E7.2080903@allot.com> <3F3284EA.5050406@candelatech.com> <3F328A0F.3040005@allot.com> X-Mailer: VM 6.92 under Emacs 19.34.1 X-archive-position: 4621 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: Robert.Olsson@data.slu.se Precedence: bulk X-list: netdev Felix Radensky writes: > Thanks for your help, Ben. What is skb-recycle patch > and where can I find it ? It's experimental and not updated for almost a year and current implementation does not add anything to SMP. Got some idea how to improve this... but try to keep to slab as long as possible it has been improved. Routing/bridging on SMP has affinty problem. If you are passing skb's say from eth0 to eth1 and they are bound on different CPU's you get cache boucing since the TX-interrupts come on another CPU. In a recent test with pktgen: 300 kpps with TX interrupts on same CPU as sender. 198 kpps with TX intr on different CPU as sender. Recycling tries to address this but current implementation fails as said. But you are probably hit by something else... Check were the drops happens qdisc?. NIC ring RX/TX size, Number of interrupts. ksoftird priority, link HW_FLOW control, checksumming, affinity etc. Cheers. --ro From hadi@cyberus.ca Thu Aug 7 12:21:38 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 07 Aug 2003 12:21:49 -0700 (PDT) Received: from mail.cyberus.ca (mail.cyberus.ca [209.195.118.111]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h77JLbFl009582 for ; Thu, 7 Aug 2003 12:21:38 -0700 Received: from [216.209.86.2] (helo=[10.0.0.9] ident=jamal) by mail.cyberus.ca with esmtp (Exim 4.12) id 19kqKN-0009tM-00; Thu, 07 Aug 2003 15:21:35 -0400 Subject: Re: Ethernet bridge performance From: jamal Reply-To: hadi@cyberus.ca To: Robert Olsson Cc: Felix Radensky , Ben Greear , netdev@oss.sgi.com In-Reply-To: <16178.41976.3643.584516@robur.slu.se> References: <3F3217E7.2080903@allot.com> <3F3284EA.5050406@candelatech.com> <3F328A0F.3040005@allot.com> <16178.41976.3643.584516@robur.slu.se> Content-Type: text/plain Organization: jamalopolis Message-Id: <1060284094.1024.36.camel@jzny.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 Date: 07 Aug 2003 15:21:34 -0400 Content-Transfer-Encoding: 7bit X-archive-position: 4622 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev Actually seems his biggest problem is he is not running the NAPI driver cheers, jamal On Thu, 2003-08-07 at 15:09, Robert Olsson wrote: > Felix Radensky writes: > > Thanks for your help, Ben. What is skb-recycle patch > > and where can I find it ? > > It's experimental and not updated for almost a year and current > implementation does not add anything to SMP. Got some idea how > to improve this... but try to keep to slab as long as possible > it has been improved. > > Routing/bridging on SMP has affinty problem. If you are passing > skb's say from eth0 to eth1 and they are bound on different CPU's > you get cache boucing since the TX-interrupts come on another CPU. > > In a recent test with pktgen: > 300 kpps with TX interrupts on same CPU as sender. > 198 kpps with TX intr on different CPU as sender. > > Recycling tries to address this but current implementation fails > as said. > > But you are probably hit by something else... Check were the drops > happens qdisc?. NIC ring RX/TX size, Number of interrupts. ksoftird > priority, link HW_FLOW control, checksumming, affinity etc. > > > Cheers. > --ro > > From davem@redhat.com Thu Aug 7 12:40:43 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 07 Aug 2003 12:40:52 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h77JegFl011058 for ; Thu, 7 Aug 2003 12:40:43 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id MAA29280; Thu, 7 Aug 2003 12:35:48 -0700 Date: Thu, 7 Aug 2003 12:35:47 -0700 From: "David S. Miller" To: Ben Greear Cc: felix@allot.com, netdev@oss.sgi.com Subject: Re: Ethernet bridge performance Message-Id: <20030807123547.1dcf2353.davem@redhat.com> In-Reply-To: <3F3284EA.5050406@candelatech.com> References: <3F3217E7.2080903@allot.com> <3F3284EA.5050406@candelatech.com> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4623 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Thu, 07 Aug 2003 09:57:14 -0700 Ben Greear wrote: > Btw, I've considered saving, say, 10k skbs on a list in my module, > allocated by GFP_KERNEL at module load time, and using them when > GFP_ATOMIC skb_alloc fails in the IRQ handling portion of the code.... > > Anyone think that's a good idea? :) Not really. GFP_ATOMIC should not fail regularly under normal (even heavy load) operation. If it does, it means the amount of reserved pages the kernel keeps around is not set correctly for your system. In 2.6.x, play with /proc/sys/vm/min_free_kbytes From greearb@candelatech.com Thu Aug 7 12:51:08 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 07 Aug 2003 12:51:12 -0700 (PDT) Received: from grok.yi.org (evrtwa1-ar2-4-33-045-074.evrtwa1.dsl-verizon.net [4.33.45.74]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h77Jp7Fl012026 for ; Thu, 7 Aug 2003 12:51:07 -0700 Received: from candelatech.com (localhost.localdomain [127.0.0.1]) by grok.yi.org (8.12.8/8.12.8) with ESMTP id h77Jostf031569; Thu, 7 Aug 2003 12:50:54 -0700 Message-ID: <3F32AD9D.4010504@candelatech.com> Date: Thu, 07 Aug 2003 12:50:53 -0700 From: Ben Greear Organization: Candela Technologies User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5a) Gecko/20030718 X-Accept-Language: en-us, en MIME-Version: 1.0 To: "David S. Miller" CC: felix@allot.com, netdev@oss.sgi.com Subject: Re: Ethernet bridge performance References: <3F3217E7.2080903@allot.com> <3F3284EA.5050406@candelatech.com> <20030807123547.1dcf2353.davem@redhat.com> In-Reply-To: <20030807123547.1dcf2353.davem@redhat.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4624 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: greearb@candelatech.com Precedence: bulk X-list: netdev David S. Miller wrote: > On Thu, 07 Aug 2003 09:57:14 -0700 > Ben Greear wrote: > > >>Btw, I've considered saving, say, 10k skbs on a list in my module, >>allocated by GFP_KERNEL at module load time, and using them when >>GFP_ATOMIC skb_alloc fails in the IRQ handling portion of the code.... >> >>Anyone think that's a good idea? :) > > > Not really. > > GFP_ATOMIC should not fail regularly under normal (even heavy load) > operation. If it does, it means the amount of reserved pages > the kernel keeps around is not set correctly for your system. > > In 2.6.x, play with /proc/sys/vm/min_free_kbytes Anything to set for 2.4? I've looked for how to tune the 2.4 VM for some time, but never found anything. Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From hadi@cyberus.ca Thu Aug 7 12:58:53 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 07 Aug 2003 12:59:01 -0700 (PDT) Received: from mail.cyberus.ca (mail.cyberus.ca [209.195.118.111]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h77JwqFl013273 for ; Thu, 7 Aug 2003 12:58:53 -0700 Received: from [216.209.86.2] (helo=[10.0.0.9] ident=jamal) by mail.cyberus.ca with esmtp (Exim 4.12) id 19kquS-000ECS-00; Thu, 07 Aug 2003 15:58:52 -0400 Subject: Re: [RFC] High Performance Packet Classifiction for tc framework From: jamal Reply-To: hadi@cyberus.ca To: Michael Bellion and Thomas Heinz Cc: linux-net@vger.kernel.org, netdev@oss.sgi.com In-Reply-To: <3F302E04.1090503@hipac.org> References: <200307141045.40999.nf@hipac.org> <1058328537.1797.24.camel@jzny.localdomain> <3F16A0E5.1080007@hipac.org> <1059934468.1103.41.camel@jzny.localdomain> <3F2E5CD6.4030500@hipac.org> <1060012260.1103.380.camel@jzny.localdomain> <3F302E04.1090503@hipac.org> Content-Type: text/plain Organization: jamalopolis Message-Id: <1060286331.1025.73.camel@jzny.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 Date: 07 Aug 2003 15:58:51 -0400 Content-Transfer-Encoding: 7bit X-archive-position: 4625 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev Hi there, On Tue, 2003-08-05 at 18:21, Michael Bellion and Thomas Heinz wrote: > Hi Jamal > > You wrote: > > I promise i will. I dont think i will do it justice spending 5 minutes > > on it. I take it you have written extensive docs too ;-> > > Of course ;-) > Well, actually we are going to present an overview of the hipac > algorithm at the netfilter developer workshop in Budapest. > Hope to see you there. > Unfortunately due to economical reasons i wont be able to make it. I mentioned it to LaForge. > > Unfortunately it is more exciting to write code than documents. I almost > > got someone to document at least its proper usage but they backed away > > at the last minute. > > lol > It was very close ;-> The guy looked motivated i felt scared for a while that he will be asking a lot of questions. Then i never heard about it again ;-> I think he left town too. > Yes, it does. Still the question is how to solve this > generally. Consider the following example ruleset: > > 1) src ip 10.0.0.0/30 dst ip 20.0.0.0/20 > 2) src ip 10.0.0.0/28 dst ip 20.0.0.0/22 > 3) src ip 10.0.0.0/26 dst ip 20.0.0.0/24 > 4) src ip 10.0.0.0/24 dst ip 20.0.0.0/26 > 5) src ip 10.0.0.0/22 dst ip 20.0.0.0/28 > 6) src ip 10.0.0.0/20 dst ip 20.0.0.0/30 > > So you have 1 src ip hash and #buckets(src ip hash) many > dst ip hashes. In order to achieve maximum performance > you have to minimize the number of collisions in the > hash buckets. How would you choose the hash function > and what would the construction look like? > It can be done by using the masks - but it would look really ugly. I suppose just providing a good user interface is valuable. > In principle the tree of hashes approach is capable to > express a general access list like ruleset, i.e. a set > of terminal rules with different priorities. The problem > is that the approach is only efficient if the number of > collisions is O(1) -> no amortized analysis but rather > per bucket. > true. > In theory you can do the following. Let's consider one > dimension. The matches in one dimension form a set of > elementary intervals which are overlapped by certain rules. > Example: > > |------| |---------| > |----------------| > |------------------| > |---------------| > > |----|---|--|---|-----|---|----|-------|--|------|-------| > > The '|-----|' reflect the matches and the bottom line > represents the set of elementary intervals introduced > by the matches. Now, you can decide for each elementary > interval which rule matches since the rules are terminal > and uniquely prioritized. > right. Why do you refer to this as one dimension? > The next step would be to create a hash with #elementary > intervals many buckets and create a hash function which > maps the keys to the appropriate buckets like in the picture. > In this case you have exactly 1 entry per hash bucket. > Sounds fine BUT it is not possible to generically deduce > an easily (= fast) computable hash function with the > described requirements. > nod. > BTW, this approach can be extended to 2 or more dimensions > where the hash function for each hash has to meet the > requirement. Of course this information is not very helpful > since the problem of defining appropriate hash functions > remains ;) > Is that problem even solvable? > Obviously this way is not viable but supposedly the only > one to achieve ultimate performance with the tree of hashes > concept. > > BTW, the way hipac works is basically not so different > from the idea described above but since we use efficient > btrees we don't have to define hash functions. > This is why i was wondering how fast your instertions and deletions are. Seems to me you will have to sort the rules everytime. > > sure position could be used as a priority. It is easier/intuitive to > > just have explicit priorities. > > Merely a matter of taste. The way iptables and nf-hipac use > priorities is somewhat more dynamic than the tc way because > they are automatically adjusted if a rule is inserted in between > others. > Dont you think this "dynamic adjustment" is unnecessary. Essentially you enforce a model where every rule is a different priority. > > What "optimizes" could be a user interface or the thread i was talking > > about earlier. > > Hm, this rebalancing is not clear to us. Do you want to rebalance > the tree of hashes? This seems a little strange at the first > glance because the performance of the tree of hashes is dominated > by the number of collisions that need to be resolved and > not the depth of the tree. > The general idea is to recreate the tree if need be based on colisions. I just hope some idiot reading this doesnt go and patent it(has happened before). Think of it as dynamic hash adjustment. Talk to me in private if you are really interested. > > Is your plan to put this in other places other than Linux? > > Currently we are working on the integration in linux. > In general the hipac core is OS and application independent, > so basically it could also be used for some userspace program > which is related to classification and of course in other OS's. > > Any special reason why you are asking this question? > I was wondering why not just optimize it for Linux. You are trying to center the world around nf-hipac - I would just center it around Linux ;-> > > So you got this thought from iptables and took it to the next level? > > Well, in order to support iptables matches and targets we had > to create an appropriate abstraction for them on the hipac > layer. This abstraction can also be used for tc classifiers > if the tcf_result is ignored, i.e. you just consider whether > the filter matched or not. > I am not sure i understood the part about ignoring tcf_result. > > I am still not sure i understand why not use what already exists - but > > i'll just say i dont see it right now. > > If hipac had no support for embedded classifiers you couldn't > express a ruleset like: > 1) [native hipac matches] [u32 filter] [classid] > 2) [native hipac matches] [classid] > You would have to construct rule 1) in a way that it "jumps" > to an external u32 filter. Unfortunately, you cannot jump > back to the hipac filter again in case the u32 filter does > not match so rule 2) is unreachable. This problem is caused > by the fact that cls_hipac can occur at most once per interface. > You show only one classid per rule .. I think i can see what you meant by ignoring tcf_result - essentially you want to have a series of filter rules with different classification enngines, no? But could you not have the filters repeat the same classid for every filter? Also it seems you want to be able to have an action defined for "nomatch" as well as "match" - is that correct? Some form of reclassification when nomatch ? cheers, jamal From davem@redhat.com Thu Aug 7 13:03:24 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 07 Aug 2003 13:03:27 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h77K3NFl013952 for ; Thu, 7 Aug 2003 13:03:24 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id MAA29325; Thu, 7 Aug 2003 12:58:29 -0700 Date: Thu, 7 Aug 2003 12:58:28 -0700 From: "David S. Miller" To: Ben Greear Cc: felix@allot.com, netdev@oss.sgi.com Subject: Re: Ethernet bridge performance Message-Id: <20030807125828.625b6640.davem@redhat.com> In-Reply-To: <3F32AD9D.4010504@candelatech.com> References: <3F3217E7.2080903@allot.com> <3F3284EA.5050406@candelatech.com> <20030807123547.1dcf2353.davem@redhat.com> <3F32AD9D.4010504@candelatech.com> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4626 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Thu, 07 Aug 2003 12:50:53 -0700 Ben Greear wrote: > David S. Miller wrote: > > In 2.6.x, play with /proc/sys/vm/min_free_kbytes > > Anything to set for 2.4? I've looked for how to tune the 2.4 VM for > some time, but never found anything. Good question. Nothing exists there. The per-zone ->pages_min value is what is controlled by this settting. It should be easy to backport the 2.6.x sysctl to 2.4.x, even for an amateur :-) From davem@redhat.com Thu Aug 7 13:10:02 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 07 Aug 2003 13:10:08 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h77KA1Fl014729 for ; Thu, 7 Aug 2003 13:10:02 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id NAA29361; Thu, 7 Aug 2003 13:05:03 -0700 Date: Thu, 7 Aug 2003 13:05:02 -0700 From: "David S. Miller" To: hadi@cyberus.ca Cc: nf@hipac.org, linux-net@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [RFC] High Performance Packet Classifiction for tc framework Message-Id: <20030807130502.4af9c815.davem@redhat.com> In-Reply-To: <1060286331.1025.73.camel@jzny.localdomain> References: <200307141045.40999.nf@hipac.org> <1058328537.1797.24.camel@jzny.localdomain> <3F16A0E5.1080007@hipac.org> <1059934468.1103.41.camel@jzny.localdomain> <3F2E5CD6.4030500@hipac.org> <1060012260.1103.380.camel@jzny.localdomain> <3F302E04.1090503@hipac.org> <1060286331.1025.73.camel@jzny.localdomain> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4627 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On 07 Aug 2003 15:58:51 -0400 jamal wrote: > > Yes, it does. Still the question is how to solve this > > generally. Consider the following example ruleset: > > > > 1) src ip 10.0.0.0/30 dst ip 20.0.0.0/20 > > 2) src ip 10.0.0.0/28 dst ip 20.0.0.0/22 > > 3) src ip 10.0.0.0/26 dst ip 20.0.0.0/24 > > 4) src ip 10.0.0.0/24 dst ip 20.0.0.0/26 > > 5) src ip 10.0.0.0/22 dst ip 20.0.0.0/28 > > 6) src ip 10.0.0.0/20 dst ip 20.0.0.0/30 > > > > So you have 1 src ip hash and #buckets(src ip hash) many > > dst ip hashes. In order to achieve maximum performance > > you have to minimize the number of collisions in the > > hash buckets. How would you choose the hash function > > and what would the construction look like? > > > > It can be done by using the masks - but it would look really ugly. I > suppose just providing a good user interface is valuable. If you input all the keys into the Jenkins hash, how does it perform? Has anyone even tried that and compared it to all of these fancy multi-level tree like hash things? I think Jenkins would work very well for exactly this kind of application. And it's fully available to the entire kernel via linux/jhash.h and already in use by other things such as the routing cache and the netfilter conntrack code. From shemminger@osdl.org Thu Aug 7 14:22:12 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 07 Aug 2003 14:22:16 -0700 (PDT) Received: from mail.osdl.org (fw.osdl.org [65.172.181.6]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h77LMAFl021019 for ; Thu, 7 Aug 2003 14:22:12 -0700 Received: from dell_ss3.pdx.osdl.net (dell_ss3.pdx.osdl.net [172.20.1.60]) by mail.osdl.org (8.11.6/8.11.6) with SMTP id h77LLvo06916; Thu, 7 Aug 2003 14:21:58 -0700 Date: Thu, 7 Aug 2003 14:21:51 -0700 From: Stephen Hemminger To: "David S. Miller" Cc: netdev@oss.sgi.com Subject: [PATCH] skb_pull add unlikely Message-Id: <20030807142151.376e5b3c.shemminger@osdl.org> Organization: Open Source Development Lab X-Mailer: Sylpheed version 0.9.4claws (GTK+ 1.2.10; i686-pc-linux-gnu) X-Face: &@E+xe?c%:&e4D{>f1O<&U>2qwRREG5!}7R4;D<"NO^UI2mJ[eEOA2*3>(`Th.yP,VDPo9$ /`~cw![cmj~~jWe?AHY7D1S+\}5brN0k*NE?pPh_'_d>6;XGG[\KDRViCfumZT3@[ Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4628 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: shemminger@osdl.org Precedence: bulk X-list: netdev Yet another case where giving compiler hints may speed up packet fast path. diff -Nru a/include/linux/skbuff.h b/include/linux/skbuff.h --- a/include/linux/skbuff.h Thu Aug 7 14:20:09 2003 +++ b/include/linux/skbuff.h Thu Aug 7 14:20:09 2003 @@ -883,7 +883,7 @@ */ static inline unsigned char *skb_pull(struct sk_buff *skb, unsigned int len) { - return (len > skb->len) ? NULL : __skb_pull(skb, len); + return unlikely(len > skb->len) ? NULL : __skb_pull(skb, len); } extern unsigned char *__pskb_pull_tail(struct sk_buff *skb, int delta); From davem@redhat.com Thu Aug 7 14:34:12 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 07 Aug 2003 14:34:15 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h77LYBFl022100 for ; Thu, 7 Aug 2003 14:34:11 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id OAA29639; Thu, 7 Aug 2003 14:29:15 -0700 Date: Thu, 7 Aug 2003 14:29:15 -0700 From: "David S. Miller" To: Stephen Hemminger Cc: netdev@oss.sgi.com Subject: Re: [PATCH] skb_pull add unlikely Message-Id: <20030807142915.64081139.davem@redhat.com> In-Reply-To: <20030807142151.376e5b3c.shemminger@osdl.org> References: <20030807142151.376e5b3c.shemminger@osdl.org> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4629 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Thu, 7 Aug 2003 14:21:51 -0700 Stephen Hemminger wrote: > Yet another case where giving compiler hints may speed up packet fast path. Applied, thanks Stephen. From shemminger@osdl.org Thu Aug 7 15:45:45 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 07 Aug 2003 15:45:47 -0700 (PDT) Received: from mail.osdl.org (fw.osdl.org [65.172.181.6]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h77MjiFl024962 for ; Thu, 7 Aug 2003 15:45:45 -0700 Received: from dell_ss3.pdx.osdl.net (dell_ss3.pdx.osdl.net [172.20.1.60]) by mail.osdl.org (8.11.6/8.11.6) with SMTP id h77MjUo29143; Thu, 7 Aug 2003 15:45:30 -0700 Date: Thu, 7 Aug 2003 15:45:24 -0700 From: Stephen Hemminger To: bellucda@tiscali.it, "David S. Miller" Cc: netdev@oss.sgi.com Subject: [PATCH] tun driver not cleaning up on module remove Message-Id: <20030807154524.4794ad45.shemminger@osdl.org> In-Reply-To: <200308051910.55823.bellucda@tiscali.it> References: <200308051630.28552.bellucda@tiscali.it> <20030805090647.691daa7e.shemminger@osdl.org> <200308051910.55823.bellucda@tiscali.it> Organization: Open Source Development Lab X-Mailer: Sylpheed version 0.9.4claws (GTK+ 1.2.10; i686-pc-linux-gnu) X-Face: &@E+xe?c%:&e4D{>f1O<&U>2qwRREG5!}7R4;D<"NO^UI2mJ[eEOA2*3>(`Th.yP,VDPo9$ /`~cw![cmj~~jWe?AHY7D1S+\}5brN0k*NE?pPh_'_d>6;XGG[\KDRViCfumZT3@[ Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4630 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: shemminger@osdl.org Precedence: bulk X-list: netdev This should fix module unload issues with tun driver in 2.6-test2. Driver was not cleaning up it's devices on module exit. diff -Nru a/drivers/net/tun.c b/drivers/net/tun.c --- a/drivers/net/tun.c Thu Aug 7 15:41:10 2003 +++ b/drivers/net/tun.c Thu Aug 7 15:41:10 2003 @@ -605,7 +605,18 @@ void tun_cleanup(void) { + struct net_device *dev, *nxt; + misc_deregister(&tun_miscdev); + + rtnl_lock(); + for (dev = dev_base; dev; dev = nxt) { + nxt = dev->next; + if (dev->init == tun_net_init) + unregister_netdevice(dev); + } + rtnl_unlock(); + } module_init(tun_init); From Robert.Olsson@data.slu.se Thu Aug 7 15:49:52 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 07 Aug 2003 15:49:55 -0700 (PDT) Received: from robur.slu.se (robur.slu.se [130.238.98.12]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h77MnoFl025286 for ; Thu, 7 Aug 2003 15:49:52 -0700 Received: (from robert@localhost) by robur.slu.se (8.9.3p2/8.9.3) id AAA06579; Fri, 8 Aug 2003 00:49:34 +0200 From: Robert Olsson MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <16178.55166.157720.889528@robur.slu.se> Date: Fri, 8 Aug 2003 00:49:34 +0200 To: hadi@cyberus.ca Cc: Robert Olsson , Felix Radensky , Ben Greear , netdev@oss.sgi.com Subject: Re: Ethernet bridge performance In-Reply-To: <1060284094.1024.36.camel@jzny.localdomain> References: <3F3217E7.2080903@allot.com> <3F3284EA.5050406@candelatech.com> <3F328A0F.3040005@allot.com> <16178.41976.3643.584516@robur.slu.se> <1060284094.1024.36.camel@jzny.localdomain> X-Mailer: VM 6.92 under Emacs 19.34.1 X-archive-position: 4631 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: Robert.Olsson@data.slu.se Precedence: bulk X-list: netdev jamal writes: > Actually seems his biggest problem is he is not running > the NAPI driver Oh! I missed this. Cheers. --ro From davem@redhat.com Thu Aug 7 16:04:00 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 07 Aug 2003 16:04:03 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h77N3vFl025797 for ; Thu, 7 Aug 2003 16:04:00 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id PAA29885; Thu, 7 Aug 2003 15:59:01 -0700 Date: Thu, 7 Aug 2003 15:59:01 -0700 From: "David S. Miller" To: Stephen Hemminger Cc: bellucda@tiscali.it, netdev@oss.sgi.com Subject: Re: [PATCH] tun driver not cleaning up on module remove Message-Id: <20030807155901.49f1a424.davem@redhat.com> In-Reply-To: <20030807154524.4794ad45.shemminger@osdl.org> References: <200308051630.28552.bellucda@tiscali.it> <20030805090647.691daa7e.shemminger@osdl.org> <200308051910.55823.bellucda@tiscali.it> <20030807154524.4794ad45.shemminger@osdl.org> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4632 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Thu, 7 Aug 2003 15:45:24 -0700 Stephen Hemminger wrote: > This should fix module unload issues with tun driver in 2.6-test2. > Driver was not cleaning up it's devices on module exit. The fix looks correct, but the dev->init test looks kind of grotty. Why not add a list_head to tun_struct, and then maintain a list rooted in 'tun.c:tun_alldevs_list', then iterate over that in the module_exit() routine? From garzik@gtf.org Thu Aug 7 17:05:15 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 07 Aug 2003 17:05:23 -0700 (PDT) Received: from havoc.gtf.org (host-64-213-145-173.atlantasolutions.com [64.213.145.173] (may be forged)) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7805DFl031771 for ; Thu, 7 Aug 2003 17:05:14 -0700 Received: by havoc.gtf.org (Postfix, from userid 500) id 649D3667A; Thu, 7 Aug 2003 20:05:08 -0400 (EDT) Date: Thu, 7 Aug 2003 20:05:08 -0400 From: Jeff Garzik To: torvalds@osdl.org Cc: linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: [bk patches] 2.6.x net driver updates Message-ID: <20030808000508.GA4464@gtf.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.3.28i X-archive-position: 4633 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev Linus, please do a bk pull bk://kernel.bkbits.net/jgarzik/net-drivers-2.6 Others may download the patch from ftp://ftp.??.kernel.org/pub/linux/kernel/people/jgarzik/patchkits/2.6/2.6.0-test2-bk7-netdrvr1.patch.bz2 This will update the following files: CREDITS | 15 + MAINTAINERS | 8 drivers/net/arm/am79c961a.c | 7 drivers/net/arm/ether00.c | 81 ++++------ drivers/net/arm/ether1.c | 9 - drivers/net/arm/ether3.c | 7 drivers/net/arm/etherh.c | 16 + drivers/net/pcmcia/3c574_cs.c | 18 -- drivers/net/pcmcia/3c589_cs.c | 18 -- drivers/net/pcmcia/axnet_cs.c | 19 -- drivers/net/pcmcia/com20020_cs.c | 14 - drivers/net/pcmcia/fmvj18x_cs.c | 18 -- drivers/net/pcmcia/ibmtr_cs.c | 15 - drivers/net/pcmcia/nmclan_cs.c | 17 -- drivers/net/pcmcia/pcnet_cs.c | 17 -- drivers/net/pcmcia/smc91c92_cs.c | 17 -- drivers/net/pcmcia/xirc2ps_cs.c | 18 -- drivers/net/wireless/airo.c | 33 ++-- drivers/net/wireless/airo_cs.c | 22 -- drivers/net/wireless/netwave_cs.c | 20 -- drivers/net/wireless/orinoco_cs.c | 16 - drivers/net/wireless/ray_cs.c | 22 -- drivers/net/wireless/wavelan_cs.c | 15 - drivers/net/wireless/wavelan_cs.p.h | 2 drivers/net/wireless/wl3501.h | 244 ++++++++++++++++++------------ drivers/net/wireless/wl3501_cs.c | 290 +++++++++++++++++++++++------------- 26 files changed, 534 insertions(+), 444 deletions(-) through these ChangeSets: (03/08/07 1.1130) [netdrvr airo] now that it builds, re-enable wireless_ext (03/08/07 1.1129) [netdrvr airo] Fix adhoc config (03/08/07 1.1128) [netdrvr airo] safer unload code (03/08/07 1.1127) [netdrvr airo] MIC support with newer firmware (03/08/07 1.1126) [netdrvr airo] add missing lines for Wireless Extensions 16 (03/08/07 1.1125) [netdrvr airo] MAC type changed to unsigned (03/08/07 1.1124) [netdrvr airo] Missing defines (only for documentation) (03/08/07 1.1123) [netdrvr pcmcia] remove the release timer from all pcmcia net drivers Ack'd by Russell King as well. (03/08/05 1.1106.1.11) [netdrvr ARM] alloc_etherdev updates (03/07/20 1.1046.409.66) o MAINTAINERS: add acme as wl3501 maintainer Also add Niemeyer to CREDITS for his work on early stages of wireless extensions support for the wl3501 card. (03/07/20 1.1046.409.65) o wl3501: add a first cut, lazy scan triggering for set_scan (03/07/20 1.1046.409.64) o wl3501: implement {get,set}_scan wireless extensions set_scan still needs to trigger a scan, but for now doing something that resets the card, like iwconfig eth0 mode ad-hoc triggers a scanning, and even without that we report the last scan results, good enough for now 8) But it will be implemented, don't worry! :-) (03/07/20 1.1046.409.63) o wl3501: introduce iw_mgmt_data_rset and rate labels enum (03/07/20 1.1046.409.62) o wl3501: introduce struct iw_mgmt_cf_pset Just for completeness, it is included in the mgmt frames, but not used in this driver, i.e. it may well be that this driver supports contention free service, but the original driver had no use for it at all. (03/07/20 1.1046.409.61) o wl3501: introduce iw_mgmt_ibss_pset (03/07/20 1.1046.409.60) o wl3501: fix bug in iw_mgmt_info_element id field and more . unfortunately we can't use enum iw_mgmt_info_element_ids for the id field in iw_mgmt_info_element, as it has to be u8 and sizeof(enum) is bigger than that, but we use the enum in the relevant functions to help catch invalid elements being used. . also we can't have iw_mgmt_info_element with a fixed size data field, as it is variable as per the 802.11 specs, so I do a poor man's OOP by subclassing iw_mgmt_info_element into the standard element types. Done up to now with iw_mgmt_essid_pset and iw_mgmt_ds_pset, others will follow. (03/07/19 1.1046.409.59) o wl3501: fix set_essid wireless extension, using the flags for any (03/07/19 1.1046.409.58) o wl3501: use iw_mgmt_info_element for phy_pset (now ds_parameter_set) Clarifying stuff is good: with this I have fixed a bug in join, where the element id and size were not being set... longstanding one, since original driver times... (03/07/19 1.1046.409.57) o wl3501: introduce iw_mgmt_info_element & associate functions and enums Also aimed at inclusion on the core wireless extensions, with this we are closer to 802.11 specs with regards to frame management elements stuff. Next patches will deal with other elements that are done in a raw way such as the phys parameter set (DS in this driver). From jgarzik@pobox.com Thu Aug 7 17:13:28 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 07 Aug 2003 17:13:35 -0700 (PDT) Received: from www.linux.org.uk (IDENT:iDu4psLct9F2vi2WRRZDaza09B0IHN9Q@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h780DQFl032207 for ; Thu, 7 Aug 2003 17:13:27 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19kusl-0005PP-QF; Fri, 08 Aug 2003 01:13:23 +0100 Message-ID: <3F32EB18.2040801@pobox.com> Date: Thu, 07 Aug 2003 20:13:12 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: Amir Noam CC: fubar@us.ibm.com, bonding-devel@lists.sourceforge.net, netdev@oss.sgi.com Subject: Re: [PATCH 2/5] [bonding] backport 2.6 changes to 2.4 References: <200307302007.09735.amir.noam@intel.com> In-Reply-To: <200307302007.09735.amir.noam@intel.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4634 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev applied to 2.4 From jgarzik@pobox.com Thu Aug 7 17:28:34 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 07 Aug 2003 17:28:39 -0700 (PDT) Received: from www.linux.org.uk (IDENT:rrdIOIvwAUxWDfPPS1dqztE3tJNYzQJq@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h780SWFl000561 for ; Thu, 7 Aug 2003 17:28:33 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19kv7P-0005Vd-Bd; Fri, 08 Aug 2003 01:28:31 +0100 Message-ID: <3F32EEA3.2000507@pobox.com> Date: Thu, 07 Aug 2003 20:28:19 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: Amir Noam CC: fubar@us.ibm.com, bonding-devel@lists.sourceforge.net, netdev@oss.sgi.com Subject: Re: [PATCH 4/5] [bonding] backport 2.6 changes to 2.4 References: <200307302007.21369.amir.noam@intel.com> In-Reply-To: <200307302007.21369.amir.noam@intel.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4635 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev Amir Noam wrote: > Backported from 2.6: > Don't dynamically allocate a net_device_stats structure for each bond, > instead allocate it with the bonding structure. > > Since they are always allocated together anyway, we might as well put > the stats struct within the bond. applied to 2.4. From garzik@gtf.org Thu Aug 7 17:59:25 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 07 Aug 2003 17:59:31 -0700 (PDT) Received: from havoc.gtf.org (host-64-213-145-173.atlantasolutions.com [64.213.145.173] (may be forged)) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h780xOFl002085 for ; Thu, 7 Aug 2003 17:59:25 -0700 Received: by havoc.gtf.org (Postfix, from userid 500) id 2FAE16643; Thu, 7 Aug 2003 20:59:19 -0400 (EDT) Date: Thu, 7 Aug 2003 20:59:19 -0400 From: Jeff Garzik To: netdev@oss.sgi.com Cc: linux-kernel@vger.kernel.org Subject: [bk patches] 2.4.x net driver updates Message-ID: <20030808005919.GA14081@gtf.org> Reply-To: netdev@oss.sgi.com Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.3.28i X-archive-position: 4636 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev (this will be sent to Marcelo when 2.4.23-pre1 opens) BK users: bk pull bk://kernel.bkbits.net/jgarzik/net-drivers-2.4 GNU diff: ftp://ftp.??.kernel.org/pub/linux/kernel/people/jgarzik/patchkits/2.4/2.4.22-rc1-netdrvr1.patch.bz2 This will update the following files: drivers/net/bonding/bond_main.c | 17 +++-------------- drivers/net/bonding/bonding.h | 2 +- drivers/net/net_init.c | 3 ++- drivers/net/wireless/airo.c | 31 +++++++++++++++++++------------ include/linux/netdevice.h | 2 ++ 5 files changed, 27 insertions(+), 28 deletions(-) through these ChangeSets: (03/08/07 1.1072) [netdrvr bonding] embed stats struct inside bonding private struct Simplification: Don't allocate the stats struct via kmalloc, embed it inside it's parent bonding_t. (03/08/07 1.1071) [net] export alloc_netdev (03/08/07 1.1070) [PATCH] Fix adhoc config (03/08/07 1.1069) [PATCH] Safer unload code (03/08/07 1.1068) [PATCH] MIC support with newer firmware (03/08/07 1.1067) [PATCH] Missing lines for Wireless Extensions 16 (03/08/07 1.1066) [netdrvr airo] MAC type changed to unsigned (03/08/07 1.1065) [netdrvr airo] Missing defines (only for documentation) From minniewkitty@yahoo.com.cn Thu Aug 7 19:55:19 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 07 Aug 2003 19:55:21 -0700 (PDT) Received: from web15211.mail.bjs.yahoo.com (web15211.mail.bjs.yahoo.com [202.3.77.141]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h782tGFl015323 for ; Thu, 7 Aug 2003 19:55:18 -0700 Message-ID: <20030808025509.4328.qmail@web15211.mail.bjs.yahoo.com> Received: from [218.90.189.254] by web15211.mail.bjs.yahoo.com via HTTP; Fri, 08 Aug 2003 10:55:09 CST Date: Fri, 8 Aug 2003 10:55:09 +0800 (CST) From: =?gb2312?q?minnie=20wu?= Subject: How to improve small packet performance? To: netdev@oss.sgi.com MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="0-1326245407-1060311309=:1975" Content-Transfer-Encoding: 8bit X-archive-position: 4637 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: minniewkitty@yahoo.com.cn Precedence: bulk X-list: netdev --0-1326245407-1060311309=:1975 Content-Type: text/plain; charset=gb2312 Content-Transfer-Encoding: 8bit Hi, all! I'm working on e100/eepro100, kernel version 2.4.20. I used NAPI, but the 64 bit small packet dual throughout is only 28M/s. How to improve small packet performance further? --------------------------------- Do You Yahoo!? ÊîÆÚ´óƬÆë¾ÛÑÅ»¢Í¨ ÍøÂçÉãÏñÍ·+ÑÅ»¢Í¨µ÷ƵÊÕÒô»úµÈÄãÀ´Äà --0-1326245407-1060311309=:1975 Content-Type: text/html; charset=gb2312 Content-Transfer-Encoding: 8bit
Hi, all!
     I'm working on e100/eepro100, kernel version 2.4.20. I used NAPI, but the 64 bit small packet dual throughout is only 28M/s. How to improve small packet performance further?



Do You Yahoo!?
ÊîÆÚ´óƬÆë¾ÛÑÅ»¢Í¨ ÍøÂçÉãÏñÍ·+ÑÅ»¢Í¨µ÷ƵÊÕÒô»úµÈÄãÀ´Äà --0-1326245407-1060311309=:1975-- From garzik@gtf.org Thu Aug 7 20:32:26 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 07 Aug 2003 20:32:32 -0700 (PDT) Received: from havoc.gtf.org (host-64-213-145-173.atlantasolutions.com [64.213.145.173] (may be forged)) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h783WPFl018945 for ; Thu, 7 Aug 2003 20:32:26 -0700 Received: by havoc.gtf.org (Postfix, from userid 500) id 9B9616643; Thu, 7 Aug 2003 23:32:19 -0400 (EDT) Date: Thu, 7 Aug 2003 23:32:19 -0400 From: Jeff Garzik To: netdev@oss.sgi.com Cc: linux-kernel@vger.kernel.org Subject: [bk patches 2.6] ethtool_ops Message-ID: <20030808033219.GA5779@gtf.org> Reply-To: netdev@oss.sgi.com Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.3.28i X-archive-position: 4638 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev Eventually destined for Linus, after some testing, and I after I merge willy's 8139too.c conversion. This ethtool_ops update doesn't break existing drivers, allowing for piecemeal migration. BK users: bk pull http://gkernel.bkbits.net/ethtool-2.6 GNU diff: ftp://ftp.??.kernel.org/pub/linux/kernel/people/jgarzik/patchkits/2.6/2.6.0-test2-bk7-ethtool1.patch.bz2 This will update the following files: drivers/net/tg3.c | 664 +++++++++++++++++++-------------------------- include/linux/ethtool.h | 99 ++++++ include/linux/netdevice.h | 9 net/core/Makefile | 4 net/core/dev.c | 16 - net/core/ethtool.c | 672 ++++++++++++++++++++++++++++++++++++++++++++++ 6 files changed, 1076 insertions(+), 388 deletions(-) through these ChangeSets: (03/08/07 1.1119.3.3) [netdrvr] add SET_ETHTOOL_OPS back-compat hook (03/08/07 1.1119.3.2) [netdrvr tg3] convert to using ethtool_ops [also contributed by Matthew Wilcox -jg] (03/08/07 1.1119.3.1) [netdrvr] add ethtool_ops to struct net_device, and associated infrastructure Contributed by Matthew Wilcox. From davem@redhat.com Thu Aug 7 20:44:49 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 07 Aug 2003 20:44:53 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h783imFl020861 for ; Thu, 7 Aug 2003 20:44:49 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id UAA30328; Thu, 7 Aug 2003 20:39:54 -0700 Date: Thu, 7 Aug 2003 20:39:53 -0700 From: "David S. Miller" To: netdev@oss.sgi.com Cc: jgarzik@pobox.com, linux-kernel@vger.kernel.org Subject: Re: [bk patches 2.6] ethtool_ops Message-Id: <20030807203953.5b4bbc6f.davem@redhat.com> In-Reply-To: <20030808033219.GA5779@gtf.org> References: <20030808033219.GA5779@gtf.org> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4639 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Thu, 7 Aug 2003 23:32:19 -0400 Jeff Garzik wrote: > ftp://ftp.??.kernel.org/pub/linux/kernel/people/jgarzik/patchkits/2.6/2.6.0-test2-bk7-ethtool1.patch.bz2 Two thumbs up :) From davem@redhat.com Thu Aug 7 21:38:57 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 07 Aug 2003 21:39:01 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h784cuFl032707 for ; Thu, 7 Aug 2003 21:38:57 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id VAA30550; Thu, 7 Aug 2003 21:32:59 -0700 Date: Thu, 7 Aug 2003 21:32:59 -0700 From: "David S. Miller" To: Ville Nuorvala Cc: netdev@oss.sgi.com Subject: Re: [PATCH] IPV6: Fix bugs in ip6ip6_tnl_xmit() Message-Id: <20030807213259.56c953d5.davem@redhat.com> In-Reply-To: References: X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4640 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Tue, 5 Aug 2003 16:42:32 +0300 (EEST) Ville Nuorvala wrote: > There were two bugs in ip6ip6_tnl_xmit() which are fixed in this patch > (made against Linux 2.6.0-test2 + cset 1.1612): > > - ip6_tunnel must give its own getfrag function to ip6_append_data() > - fix dst leakage when encapsulated packet too big Patch applied, thank you Ville. From davem@redhat.com Thu Aug 7 21:53:04 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 07 Aug 2003 21:53:09 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h784r3Fl000697 for ; Thu, 7 Aug 2003 21:53:04 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id VAA30601; Thu, 7 Aug 2003 21:47:33 -0700 Date: Thu, 7 Aug 2003 21:47:32 -0700 From: "David S. Miller" To: Kazunori Miyazawa Cc: kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com, usagi@linux-ipv6.org, latten@austin.ibm.com Subject: Re: [PATCH][IPV6] fix clearing in ah6 input Message-Id: <20030807214732.1d16cc94.davem@redhat.com> In-Reply-To: <20030806162808.4edf9eeb.kazunori@miyazawa.org> References: <20030806162808.4edf9eeb.kazunori@miyazawa.org> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4641 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Wed, 6 Aug 2003 16:28:08 +0900 Kazunori Miyazawa wrote: > This patch fixes zero-clear in ah6_input. > If calling pskb_expand_head, the kernel clears wrong memory. Patch applied, thank you very much. From davem@redhat.com Thu Aug 7 22:11:09 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 07 Aug 2003 22:11:13 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h785B9Fl002516 for ; Thu, 7 Aug 2003 22:11:09 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id WAA30648; Thu, 7 Aug 2003 22:05:39 -0700 Date: Thu, 7 Aug 2003 22:05:39 -0700 From: "David S. Miller" To: Kazunori Miyazawa Cc: kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com, usagi@linux-ipv6.org, latten@austin.ibm.com Subject: Re: [PATCH][IPV6] fixed authentication error with TCP Message-Id: <20030807220539.4555db2d.davem@redhat.com> In-Reply-To: <20030806164413.669ef5f8.kazunori@miyazawa.org> References: <20030806164413.669ef5f8.kazunori@miyazawa.org> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4642 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Wed, 6 Aug 2003 16:44:13 +0900 Kazunori Miyazawa wrote: > Miss Joy (@IBM) and I investigated the bug that "authentication > error" occured with using TCP and AH in IPv6. This patch fixes the > bug. This patch makes the kernel consider extension header length in > a dst. > > This pach works with my previous patch which fixes zero-clear in ah6_input. > > Please append the name "Joy Latten" into the log. I have applied this patch, thank you. But I see a small area for improvement. Look at the place inside of ip6_dst_lookup() where we do source address selection. If this fails, we mark error to dst->error. Is it correct? This 'dst' route might otherwise be perfectly fine. But now that dst->error is set, it is poisoned for other users and they are not able to use it. A similar case occurs further down after the xfrm_lookup() call, but this one I think is correct. It seems to me that it is only OK for dst->error to be set on routes that may not be used validly for anything. Alexey, do I understand this stuff correctly? From davem@redhat.com Thu Aug 7 22:39:53 2003 Received: with ECARTIS (v1.0.0; list netdev); Thu, 07 Aug 2003 22:40:00 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h785dqFl004988 for ; Thu, 7 Aug 2003 22:39:52 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id WAA30718; Thu, 7 Aug 2003 22:34:22 -0700 Date: Thu, 7 Aug 2003 22:34:21 -0700 From: "David S. Miller" To: yoshfuji@linux-ipv6.org Cc: kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com Subject: ipv6 UDP MSG_MORE oops fix Message-Id: <20030807223421.70497d61.davem@redhat.com> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4643 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev Yoshfuji-san, I found the MSG_MORE udp bug. If "np->pending" is true, we call ip_dst_store() with uninitialized dst, it could be any kind of garbage. This is dereferenced and we crash. If np->pending, we have some 'dst' stored into the socket already. So no need to relookup dst or anything like that. This is the fix I am using, on top of Miyazawa-san's ah6_input and authentication error patches. # This is a BitKeeper generated patch for the following project: # Project Name: Linux kernel tree # This patch format is intended for GNU patch command version 2.5 or higher. # This patch includes the following deltas: # ChangeSet 1.1167 -> 1.1168 # net/ipv6/ip6_output.c 1.40 -> 1.41 # net/ipv6/raw.c 1.38 -> 1.39 # net/ipv6/udp.c 1.45 -> 1.46 # # The following is the BitKeeper ChangeSet Log # -------------------------------------------- # 03/08/07 davem@nuts.ninka.net 1.1168 # [IPV6]: Make sure errors propagate properly in {udp,raw} sendmsg. # -------------------------------------------- # diff -Nru a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c --- a/net/ipv6/ip6_output.c Thu Aug 7 22:34:06 2003 +++ b/net/ipv6/ip6_output.c Thu Aug 7 22:34:06 2003 @@ -209,7 +209,6 @@ int seg_len = skb->len; int hlimit; u32 mtu; - int err = 0; if (opt) { int head_room; diff -Nru a/net/ipv6/raw.c b/net/ipv6/raw.c --- a/net/ipv6/raw.c Thu Aug 7 22:34:06 2003 +++ b/net/ipv6/raw.c Thu Aug 7 22:34:06 2003 @@ -659,7 +659,7 @@ fl.oif = np->mcast_oif; dst = ip6_dst_lookup(sk, &fl); - if (dst->error) + if ((err = dst->error)) goto out; if (hlimit < 0) { diff -Nru a/net/ipv6/udp.c b/net/ipv6/udp.c --- a/net/ipv6/udp.c Thu Aug 7 22:34:06 2003 +++ b/net/ipv6/udp.c Thu Aug 7 22:34:06 2003 @@ -811,8 +811,10 @@ * The socket lock must be held while it's corked. */ lock_sock(sk); - if (likely(up->pending)) + if (likely(up->pending)) { + dst = NULL; goto do_append_data; + } release_sock(sk); } ulen += sizeof(struct udphdr); @@ -929,7 +931,7 @@ fl.oif = np->mcast_oif; dst = ip6_dst_lookup(sk, &fl); - if (dst->error) + if ((err = dst->error)) goto out; if (hlimit < 0) { @@ -968,9 +970,10 @@ else if (!corkreq) err = udp_v6_push_pending_frames(sk, up); - ip6_dst_store(sk, dst, - !ipv6_addr_cmp(&fl.fl6_dst, &np->daddr) ? - &np->daddr : NULL); + if (dst) + ip6_dst_store(sk, dst, + !ipv6_addr_cmp(&fl.fl6_dst, &np->daddr) ? + &np->daddr : NULL); if (err > 0) err = np->recverr ? net_xmit_errno(err) : 0; release_sock(sk); From yoshfuji@linux-ipv6.org Fri Aug 8 01:08:40 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 08 Aug 2003 01:08:46 -0700 (PDT) Received: from yue.hongo.wide.ad.jp (yue.hongo.wide.ad.jp [203.178.139.94]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7888cFl017755 for ; Fri, 8 Aug 2003 01:08:39 -0700 Received: from localhost (localhost [127.0.0.1]) by yue.hongo.wide.ad.jp (8.12.3+3.5Wbeta/8.12.3/Debian-5) with ESMTP id h7888d1M027014; Fri, 8 Aug 2003 17:08:40 +0900 Date: Fri, 08 Aug 2003 17:08:39 +0900 (JST) Message-Id: <20030808.170839.90822982.yoshfuji@linux-ipv6.org> To: davem@redhat.com Cc: kuznet@ms2.inr.ac.ru, netdev@oss.sgi.com, yoshfuji@linux-ipv6.org Subject: Re: ipv6 UDP MSG_MORE oops fix From: YOSHIFUJI Hideaki / =?iso-2022-jp?B?GyRCNUhGIzFRTEAbKEI=?= In-Reply-To: <20030807223421.70497d61.davem@redhat.com> References: <20030807223421.70497d61.davem@redhat.com> Organization: USAGI Project X-URL: http://www.yoshifuji.org/%7Ehideaki/ X-Fingerprint: 90 22 65 EB 1E CF 3A D1 0B DF 80 D8 48 07 F8 94 E0 62 0E EA X-PGP-Key-URL: http://www.yoshifuji.org/%7Ehideaki/hideaki@yoshifuji.org.asc X-Face: "5$Al-.M>NJ%a'@hhZdQm:."qn~PA^gq4o*>iCFToq*bAi#4FRtx}enhuQKz7fNqQz\BYU] $~O_5m-9'}MIs`XGwIEscw;e5b>n"B_?j/AkL~i/MEaZBLP X-Mailer: Mew version 2.2 on Emacs 20.7 / Mule 4.1 (AOI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 4644 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: yoshfuji@linux-ipv6.org Precedence: bulk X-list: netdev In article <20030807223421.70497d61.davem@redhat.com> (at Thu, 7 Aug 2003 22:34:21 -0700), "David S. Miller" says: > Yoshfuji-san, I found the MSG_MORE udp bug. Yes, thanks. This is what I told you before. I'm going to test if problems go away with this patch. --yoshfuji From yoshfuji@linux-ipv6.org Fri Aug 8 01:45:08 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 08 Aug 2003 01:45:18 -0700 (PDT) Received: from yue.hongo.wide.ad.jp (yue.hongo.wide.ad.jp [203.178.139.94]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h788j6Fl001169 for ; Fri, 8 Aug 2003 01:45:07 -0700 Received: from localhost (localhost [127.0.0.1]) by yue.hongo.wide.ad.jp (8.12.3+3.5Wbeta/8.12.3/Debian-5) with ESMTP id h788j51M029224; Fri, 8 Aug 2003 17:45:06 +0900 Date: Fri, 08 Aug 2003 17:45:04 +0900 (JST) Message-Id: <20030808.174504.14391608.yoshfuji@linux-ipv6.org> To: jan.oravec@6com.sk Cc: netdev@oss.sgi.com, yoshfuji@linux-ipv6.org Subject: Re: problem setting net.ipvX.conf.all.forwarding via sysctl() system call From: YOSHIFUJI Hideaki / =?iso-2022-jp?B?GyRCNUhGIzFRTEAbKEI=?= In-Reply-To: <20030803154427.GA12926@wsx.ksp.sk> References: <20030803154427.GA12926@wsx.ksp.sk> Organization: USAGI Project X-URL: http://www.yoshifuji.org/%7Ehideaki/ X-Fingerprint: 90 22 65 EB 1E CF 3A D1 0B DF 80 D8 48 07 F8 94 E0 62 0E EA X-PGP-Key-URL: http://www.yoshifuji.org/%7Ehideaki/hideaki@yoshifuji.org.asc X-Face: "5$Al-.M>NJ%a'@hhZdQm:."qn~PA^gq4o*>iCFToq*bAi#4FRtx}enhuQKz7fNqQz\BYU] $~O_5m-9'}MIs`XGwIEscw;e5b>n"B_?j/AkL~i/MEaZBLP X-Mailer: Mew version 2.2 on Emacs 20.7 / Mule 4.1 (AOI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 4645 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: yoshfuji@linux-ipv6.org Precedence: bulk X-list: netdev In article <20030803154427.GA12926@wsx.ksp.sk> (at Sun, 3 Aug 2003 17:44:27 +0200), Jan Oravec says: > For IPv4, it is because ipv4_sysctl_forward_strategy only copy new value to > check whether it has changed and does not update ipv4_devconf.forwarding > before calling inet_forward_change(). (it is copied internally by sysctl > after ipv4_sysctl_forward_strategy because we return positive number) > > I am not good in kernel parallel computing strategy, whether it requires > some locking or it is safe to do: > > --- sysctl_net_ipv4.c.old 2003-08-03 17:37:44.000000000 +0200 > +++ sysctl_net_ipv4.c 2003-08-03 17:38:18.000000000 +0200 > @@ -109,8 +109,9 @@ static int ipv4_sysctl_forward_strategy( > } > } > > + ipv4_devconf.forwarding=new; = new; > inet_forward_change(); > - return 1; > + return 0; > } > > ctl_table ipv4_table[] = { It seems correct, however, this patch cannot apply against current tree. Pleaes resend the patch. And please make sure to diff like this: diff -u linux-2.6/net/ipv4/sysctl_net_ipv4.c.old linux-2.6/net/ipv4/sysctl_net_ipv4.c Thank you. -- Hideaki YOSHIFUJI @ USAGI Project GPG FP: 9022 65EB 1ECF 3AD1 0BDF 80D8 4807 F894 E062 0EEA From yoshfuji@linux-ipv6.org Fri Aug 8 01:50:27 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 08 Aug 2003 01:50:30 -0700 (PDT) Received: from yue.hongo.wide.ad.jp (yue.hongo.wide.ad.jp [203.178.139.94]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h788oPFl001570 for ; Fri, 8 Aug 2003 01:50:26 -0700 Received: from localhost (localhost [127.0.0.1]) by yue.hongo.wide.ad.jp (8.12.3+3.5Wbeta/8.12.3/Debian-5) with ESMTP id h788oU1M029268; Fri, 8 Aug 2003 17:50:31 +0900 Date: Fri, 08 Aug 2003 17:50:30 +0900 (JST) Message-Id: <20030808.175030.19527061.yoshfuji@linux-ipv6.org> To: davem@redhat.com CC: yoshfuji@linux-ipv6.org, jan.oravec@6com.sk, netdev@oss.sgi.com Subject: [PATCH] IPV6: strategy hander for net.ipv6.conf.*.forwarding (is Re: problem setting net.ipvX.conf.all.forwarding via sysctl() system call) From: YOSHIFUJI Hideaki / =?iso-2022-jp?B?GyRCNUhGIzFRTEAbKEI=?= In-Reply-To: <20030803154427.GA12926@wsx.ksp.sk> References: <20030803154427.GA12926@wsx.ksp.sk> Organization: USAGI Project X-URL: http://www.yoshifuji.org/%7Ehideaki/ X-Fingerprint: 90 22 65 EB 1E CF 3A D1 0B DF 80 D8 48 07 F8 94 E0 62 0E EA X-PGP-Key-URL: http://www.yoshifuji.org/%7Ehideaki/hideaki@yoshifuji.org.asc X-Face: "5$Al-.M>NJ%a'@hhZdQm:."qn~PA^gq4o*>iCFToq*bAi#4FRtx}enhuQKz7fNqQz\BYU] $~O_5m-9'}MIs`XGwIEscw;e5b>n"B_?j/AkL~i/MEaZBLP X-Mailer: Mew version 2.2 on Emacs 20.7 / Mule 4.1 (AOI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 4646 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: yoshfuji@linux-ipv6.org Precedence: bulk X-list: netdev In article <20030803154427.GA12926@wsx.ksp.sk> (at Sun, 3 Aug 2003 17:44:27 +0200), Jan Oravec says: > For IPv6, it is obviously because sysctl 'strategy' handler is not defined. Here's tha patch to add strategy handler for net.ipv6.conf.*.forwarding. Thanks. Index: linux-2.6/net/ipv6/addrconf.c =================================================================== RCS file: /home/cvs/linux-2.5/net/ipv6/addrconf.c,v retrieving revision 1.48 diff -u -r1.48 addrconf.c --- linux-2.6/net/ipv6/addrconf.c 25 Jul 2003 23:58:59 -0000 1.48 +++ linux-2.6/net/ipv6/addrconf.c 8 Aug 2003 07:12:13 -0000 @@ -2593,6 +2593,48 @@ return ret; } +static int addrconf_sysctl_forward_strategy(ctl_table *table, + int *name, int nlen, + void *oldval, size_t *oldlenp, + void *newval, size_t newlen, + void **context) +{ + int *valp = table->data; + int new; + + if (!newval || !newlen) + return 0; + if (newlen != sizeof(int)) + return -EINVAL; + if (get_user(new, (int *)newval)) + return -EFAULT; + if (new == *valp) + return 0; + if (oldval && oldlenp) { + size_t len; + if (get_user(len, oldlenp)) + return -EFAULT; + if (len) { + if (len > table->maxlen) + len = table->maxlen; + if (copy_to_user(oldval, valp, len)) + return -EFAULT; + if (put_user(len, oldlenp)) + return -EFAULT; + } + } + + *valp = new; + if (valp != &ipv6_devconf.forwarding && + valp != &ipv6_devconf_dflt.forwarding) { + struct inet6_dev *idev = (struct inet6_dev *)table->extra1; + if (!idev) + return -ENODEV; + addrconf_forward_change(idev); + } + return 0; +} + static struct addrconf_sysctl_table { struct ctl_table_header *sysctl_header; @@ -2611,6 +2653,7 @@ .maxlen = sizeof(int), .mode = 0644, .proc_handler = &addrconf_sysctl_forward, + .strategy = &addrconf_sysctl_forward_strategy, }, { .ctl_name = NET_IPV6_HOP_LIMIT, -- Hideaki YOSHIFUJI @ USAGI Project GPG FP: 9022 65EB 1ECF 3AD1 0BDF 80D8 4807 F894 E062 0EEA From yoshfuji@linux-ipv6.org Fri Aug 8 02:51:33 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 08 Aug 2003 02:51:43 -0700 (PDT) Received: from yue.hongo.wide.ad.jp (yue.hongo.wide.ad.jp [203.178.139.94]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h789pVFl014572 for ; Fri, 8 Aug 2003 02:51:32 -0700 Received: from localhost (localhost [127.0.0.1]) by yue.hongo.wide.ad.jp (8.12.3+3.5Wbeta/8.12.3/Debian-5) with ESMTP id h789pZ1M029766; Fri, 8 Aug 2003 18:51:35 +0900 Date: Fri, 08 Aug 2003 18:51:35 +0900 (JST) Message-Id: <20030808.185135.112441851.yoshfuji@linux-ipv6.org> To: jan.oravec@6com.sk, davem@redhat.com, netdev@oss.sgi.com Subject: Re: [PATCH] IPV6: strategy hander for net.ipv6.conf.*.forwarding (is Re: problem setting net.ipvX.conf.all.forwarding via sysctl() system call) From: YOSHIFUJI Hideaki / =?iso-2022-jp?B?GyRCNUhGIzFRTEAbKEI=?= In-Reply-To: <20030808093704.GA18131@wsx.ksp.sk> References: <20030803154427.GA12926@wsx.ksp.sk> <20030808.175030.19527061.yoshfuji@linux-ipv6.org> <20030808093704.GA18131@wsx.ksp.sk> Organization: USAGI Project X-URL: http://www.yoshifuji.org/%7Ehideaki/ X-Fingerprint: 90 22 65 EB 1E CF 3A D1 0B DF 80 D8 48 07 F8 94 E0 62 0E EA X-PGP-Key-URL: http://www.yoshifuji.org/%7Ehideaki/hideaki@yoshifuji.org.asc X-Face: "5$Al-.M>NJ%a'@hhZdQm:."qn~PA^gq4o*>iCFToq*bAi#4FRtx}enhuQKz7fNqQz\BYU] $~O_5m-9'}MIs`XGwIEscw;e5b>n"B_?j/AkL~i/MEaZBLP X-Mailer: Mew version 2.2 on Emacs 20.7 / Mule 4.1 (AOI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 4647 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: yoshfuji@linux-ipv6.org Precedence: bulk X-list: netdev In article <20030808093704.GA18131@wsx.ksp.sk> (at Fri, 8 Aug 2003 11:37:04 +0200), Jan Oravec says: > On Fri, Aug 08, 2003 at 05:50:30PM +0900, YOSHIFUJI Hideaki / ?$B5HF#1QL@ wrote: > > > + *valp = new; > > + if (valp != &ipv6_devconf.forwarding && > > + valp != &ipv6_devconf_dflt.forwarding) { > > + struct inet6_dev *idev = (struct inet6_dev *)table->extra1; > > + if (!idev) > > + return -ENODEV; > > + addrconf_forward_change(idev); > > + } > > + return 0; > > +} > > Shouldn't we set ipv6_devconf_dflt.forwarding and call > addr_forward_change(NULL) in case that valp==&ipv6_devconf.forwarding? Oh, You're right. Here's the revised one: Index: linux-2.6/net/ipv6/addrconf.c =================================================================== RCS file: /home/cvs/linux-2.5/net/ipv6/addrconf.c,v retrieving revision 1.48 diff -u -r1.48 addrconf.c --- linux-2.6/net/ipv6/addrconf.c 25 Jul 2003 23:58:59 -0000 1.48 +++ linux-2.6/net/ipv6/addrconf.c 8 Aug 2003 08:21:56 -0000 @@ -2593,6 +2593,51 @@ return ret; } +static int addrconf_sysctl_forward_strategy(ctl_table *table, + int *name, int nlen, + void *oldval, size_t *oldlenp, + void *newval, size_t newlen, + void **context) +{ + int *valp = table->data; + int new; + + if (!newval || !newlen) + return 0; + if (newlen != sizeof(int)) + return -EINVAL; + if (get_user(new, (int *)newval)) + return -EFAULT; + if (new == *valp) + return 0; + if (oldval && oldlenp) { + size_t len; + if (get_user(len, oldlenp)) + return -EFAULT; + if (len) { + if (len > table->maxlen) + len = table->maxlen; + if (copy_to_user(oldval, valp, len)) + return -EFAULT; + if (put_user(len, oldlenp)) + return -EFAULT; + } + } + + *valp = new; + if (valp != &ipv6_devconf_dflt.forwarding) { + struct inet6_dev *idev; + if (valp != &ipv6_devconf.forwarding) { + idev = (struct inet6_dev *)table->extra1; + if (unlikely(idev == NULL)) + return -ENODEV; + } else + idev = NULL; + addrconf_forward_change(idev); + } + return 0; +} + static struct addrconf_sysctl_table { struct ctl_table_header *sysctl_header; @@ -2611,6 +2656,7 @@ .maxlen = sizeof(int), .mode = 0644, .proc_handler = &addrconf_sysctl_forward, + .strategy = &addrconf_sysctl_forward_strategy, }, { .ctl_name = NET_IPV6_HOP_LIMIT, -- Hideaki YOSHIFUJI @ USAGI Project GPG FP: 9022 65EB 1ECF 3AD1 0BDF 80D8 4807 F894 E062 0EEA From wsx@6com.sk Fri Aug 8 02:52:53 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 08 Aug 2003 02:53:02 -0700 (PDT) Received: from mail.6com.sk (cement.ksp.edi.fmph.uniba.sk [158.195.16.151] (may be forged)) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h789qgFl014863 for ; Fri, 8 Aug 2003 02:52:42 -0700 Received: by mail.6com.sk (Postfix, from userid 501) id 45EF328A8E; Fri, 8 Aug 2003 11:11:24 +0200 (CEST) Date: Fri, 8 Aug 2003 11:11:24 +0200 From: Jan Oravec To: "YOSHIFUJI Hideaki / ?$B5HF#1QL@" Cc: netdev@oss.sgi.com, davem@redhat.com Subject: Re: problem setting net.ipvX.conf.all.forwarding via sysctl() system call Message-ID: <20030808091124.GA17961@wsx.ksp.sk> Reply-To: Jan Oravec References: <20030803154427.GA12926@wsx.ksp.sk> <20030808.174504.14391608.yoshfuji@linux-ipv6.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030808.174504.14391608.yoshfuji@linux-ipv6.org> User-Agent: Mutt/1.4.1i X-Operating-System: UNIX X-archive-position: 4648 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jan.oravec@6com.sk Precedence: bulk X-list: netdev Hello, this is a new patch > It seems correct, however, this patch cannot apply against > current tree. Pleaes resend the patch. > And please make sure to diff like this: > diff -u linux-2.6/net/ipv4/sysctl_net_ipv4.c.old linux-2.6/net/ipv4/sysctl_net_ipv4.c diff -u linux-2.6.0-test2/net/ipv4/sysctl_net_ipv4.c.old linux-2.6.0-test2/net/ipv4/sysctl_net_ipv4.c --- linux-2.6.0-test2/net/ipv4/sysctl_net_ipv4.c.old 2003-07-27 19:06:19.000000000 +0200 +++ linux-2.6.0-test2/net/ipv4/sysctl_net_ipv4.c 2003-08-08 10:56:31.000000000 +0200 @@ -109,8 +109,9 @@ } } + ipv4_devconf.forwarding = new; inet_forward_change(); - return 1; + return 0; } ctl_table ipv4_table[] = { Thanks, -- Jan Oravec XS26 coordinator 6COM s.r.o. 'Access to IPv6' http://www.6com.sk http://www.xs26.net From wsx@6com.sk Fri Aug 8 03:09:32 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 08 Aug 2003 03:09:40 -0700 (PDT) Received: from mail.6com.sk (cement.ksp.edi.fmph.uniba.sk [158.195.16.151] (may be forged)) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h78A9LFl016402 for ; Fri, 8 Aug 2003 03:09:22 -0700 Received: by mail.6com.sk (Postfix, from userid 501) id EE09528A98; Fri, 8 Aug 2003 11:37:04 +0200 (CEST) Date: Fri, 8 Aug 2003 11:37:04 +0200 From: Jan Oravec To: "YOSHIFUJI Hideaki / ?$B5HF#1QL@" Cc: davem@redhat.com, netdev@oss.sgi.com Subject: Re: [PATCH] IPV6: strategy hander for net.ipv6.conf.*.forwarding (is Re: problem setting net.ipvX.conf.all.forwarding via sysctl() system call) Message-ID: <20030808093704.GA18131@wsx.ksp.sk> Reply-To: Jan Oravec References: <20030803154427.GA12926@wsx.ksp.sk> <20030808.175030.19527061.yoshfuji@linux-ipv6.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030808.175030.19527061.yoshfuji@linux-ipv6.org> User-Agent: Mutt/1.4.1i X-Operating-System: UNIX X-archive-position: 4649 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jan.oravec@6com.sk Precedence: bulk X-list: netdev On Fri, Aug 08, 2003 at 05:50:30PM +0900, YOSHIFUJI Hideaki / ?$B5HF#1QL@ wrote: > + *valp = new; > + if (valp != &ipv6_devconf.forwarding && > + valp != &ipv6_devconf_dflt.forwarding) { > + struct inet6_dev *idev = (struct inet6_dev *)table->extra1; > + if (!idev) > + return -ENODEV; > + addrconf_forward_change(idev); > + } > + return 0; > +} Shouldn't we set ipv6_devconf_dflt.forwarding and call addr_forward_change(NULL) in case that valp==&ipv6_devconf.forwarding? -- Jan Oravec XS26 coordinator 6COM s.r.o. 'Access to IPv6' http://www.6com.sk http://www.xs26.net From jchapman@katalix.com Fri Aug 8 03:27:42 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 08 Aug 2003 03:27:50 -0700 (PDT) Received: from plesk.avahost.net ([216.40.206.200]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h78ARfFl017883 for ; Fri, 8 Aug 2003 03:27:42 -0700 Received: (qmail 6399 invoked by uid 10623); 8 Aug 2003 17:38:57 -0000 Received: from 212.56.89.216 ( [212.56.89.216]) as user jchapman@localhost by webmail.katalix.com with HTTP; Fri, 8 Aug 2003 12:38:57 -0500 Message-ID: <1060364337.3f33e0318bf82@webmail.katalix.com> Date: Fri, 8 Aug 2003 12:38:57 -0500 From: jchapman@katalix.com To: netdev@oss.sgi.com Subject: Re: How to improve small packet performance MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit User-Agent: Internet Messaging Program (IMP) 3.1 X-Originating-IP: 212.56.89.216 X-archive-position: 4650 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jchapman@katalix.com Precedence: bulk X-list: netdev Try the latest e100 development driver from sf.net/projects/e1000. Make sure you compile it with NAPI enabled. - jc minnie wu wrote: > Hi, all! > I'm working on e100/eepro100, kernel version 2.4.20. I used NAPI, but > the 64 bit small packet dual throughout is only 28M/s. How to improve small > packet performance further? ------------------------------------------------- This mail sent through IMP: http://horde.org/imp/ From wsx@6com.sk Fri Aug 8 04:40:02 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 08 Aug 2003 04:40:09 -0700 (PDT) Received: from mail.6com.sk (cement.ksp.edi.fmph.uniba.sk [158.195.16.151] (may be forged)) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h78Be0Fl023512 for ; Fri, 8 Aug 2003 04:40:01 -0700 Received: by mail.6com.sk (Postfix, from userid 501) id 32E1020D1A; Fri, 8 Aug 2003 13:39:55 +0200 (CEST) Date: Fri, 8 Aug 2003 13:39:55 +0200 From: Jan Oravec To: "YOSHIFUJI Hideaki / ?$B5HF#1QL@" Cc: davem@redhat.com, netdev@oss.sgi.com Subject: Re: [PATCH] IPV6: strategy hander for net.ipv6.conf.*.forwarding (is Re: problem setting net.ipvX.conf.all.forwarding via sysctl() system call) Message-ID: <20030808113955.GA18307@wsx.ksp.sk> Reply-To: Jan Oravec References: <20030803154427.GA12926@wsx.ksp.sk> <20030808.175030.19527061.yoshfuji@linux-ipv6.org> <20030808093704.GA18131@wsx.ksp.sk> <20030808.185135.112441851.yoshfuji@linux-ipv6.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030808.185135.112441851.yoshfuji@linux-ipv6.org> User-Agent: Mutt/1.4.1i X-Operating-System: UNIX X-archive-position: 4651 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jan.oravec@6com.sk Precedence: bulk X-list: netdev On Fri, Aug 08, 2003 at 06:51:35PM +0900, YOSHIFUJI Hideaki / ?$B5HF#1QL@ wrote: > > Oh, You're right. Here's the revised one: Thanks, I tried it and it works. -- Jan Oravec XS26 coordinator 6COM s.r.o. 'Access to IPv6' http://www.6com.sk http://www.xs26.net From yoshfuji@linux-ipv6.org Fri Aug 8 06:37:36 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 08 Aug 2003 06:37:40 -0700 (PDT) Received: from yue.hongo.wide.ad.jp (yue.hongo.wide.ad.jp [203.178.139.94]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h78DbYFl007489 for ; Fri, 8 Aug 2003 06:37:35 -0700 Received: from localhost (localhost [127.0.0.1]) by yue.hongo.wide.ad.jp (8.12.3+3.5Wbeta/8.12.3/Debian-5) with ESMTP id h78Dbg1M031881; Fri, 8 Aug 2003 22:37:43 +0900 Date: Fri, 08 Aug 2003 22:37:42 +0900 (JST) Message-Id: <20030808.223742.55652930.yoshfuji@linux-ipv6.org> To: davem@redhat.com CC: netdev@oss.sgi.com, yoshfuji@linux-ipv6.org Subject: [PATCH] IPV6: typo in include/linux/ipv6.h From: YOSHIFUJI Hideaki / =?iso-2022-jp?B?GyRCNUhGIzFRTEAbKEI=?= Organization: USAGI Project X-URL: http://www.yoshifuji.org/%7Ehideaki/ X-Fingerprint: 90 22 65 EB 1E CF 3A D1 0B DF 80 D8 48 07 F8 94 E0 62 0E EA X-PGP-Key-URL: http://www.yoshifuji.org/%7Ehideaki/hideaki@yoshifuji.org.asc X-Face: "5$Al-.M>NJ%a'@hhZdQm:."qn~PA^gq4o*>iCFToq*bAi#4FRtx}enhuQKz7fNqQz\BYU] $~O_5m-9'}MIs`XGwIEscw;e5b>n"B_?j/AkL~i/MEaZBLP X-Mailer: Mew version 2.2 on Emacs 20.7 / Mule 4.1 (AOI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 4652 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: yoshfuji@linux-ipv6.org Precedence: bulk X-list: netdev Hello. Typo in definition in include/linux/ipv6.h. 2.4.x has similar bug, too. Thanks. Index: linux-2.6/include/linux/ipv6.h =================================================================== RCS file: /home/cvs/linux-2.5/include/linux/ipv6.h,v retrieving revision 1.11 diff -u -r1.11 ipv6.h --- linux-2.6/include/linux/ipv6.h 3 Aug 2003 18:34:10 -0000 1.11 +++ linux-2.6/include/linux/ipv6.h 8 Aug 2003 12:05:41 -0000 @@ -71,7 +71,7 @@ __u32 bitmap; /* strict/loose bit map */ struct in6_addr addr[0]; -#define rt0_type rt_hdr.type; +#define rt0_type rt_hdr.type }; struct ipv6_auth_hdr { -- Hideaki YOSHIFUJI @ USAGI Project GPG FP: 9022 65EB 1ECF 3AD1 0BDF 80D8 4807 F894 E062 0EEA From shmulik.hen@intel.com Fri Aug 8 07:45:10 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 08 Aug 2003 07:45:16 -0700 (PDT) Received: from caduceus.jf.intel.com (fmr06.intel.com [134.134.136.7]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h78Ej8Fl018008 for ; Fri, 8 Aug 2003 07:45:09 -0700 Received: from talaria.jf.intel.com (talaria.jf.intel.com [10.7.209.7]) by caduceus.jf.intel.com (8.11.6p2/8.11.6/d: outer.mc,v 1.66 2003/05/22 21:17:36 rfjohns1 Exp $) with ESMTP id h78Ecxw27752 for ; Fri, 8 Aug 2003 14:39:00 GMT Received: from orsmsxvs040.jf.intel.com (orsmsxvs040.jf.intel.com [192.168.65.206]) by talaria.jf.intel.com (8.11.6p2/8.11.6/d: inner.mc,v 1.35 2003/05/22 21:18:01 rfjohns1 Exp $) with SMTP id h78E86T06961 for ; Fri, 8 Aug 2003 14:08:06 GMT Received: from jrslxjul4.npdj.intel.com ([10.12.254.188]) by orsmsxvs040.jf.intel.com (NAVGW 2.5.2.11) with SMTP id M2003080807571209908 ; Fri, 08 Aug 2003 07:57:13 -0700 Content-Type: text/plain; charset="us-ascii" From: Shmulik Hen Reply-To: shmulik.hen@intel.com Organization: Intel corp. Subject: [SET 2][PATCH 2/8][bonding] Propagating master's settings to slaves Date: Fri, 8 Aug 2003 17:44:58 +0300 User-Agent: KMail/1.4.3 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit To: bonding-devel@lists.sourceforge.net, netdev@oss.sgi.com Message-Id: <200308081744.58946.shmulik.hen@intel.com> X-archive-position: 4662 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: shmulik.hen@intel.com Precedence: bulk X-list: netdev 2 - Change monitoring function use the new functionality. diff -Nuarp linux-2.4.22-rc1/drivers/net/bonding/bond_main.c linux-2.4.22-rc1-devel/drivers/net/bonding/bond_main.c --- linux-2.4.22-rc1/drivers/net/bonding/bond_main.c Fri Aug 8 14:03:16 2003 +++ linux-2.4.22-rc1-devel/drivers/net/bonding/bond_main.c Fri Aug 8 14:03:17 2003 @@ -2207,8 +2207,9 @@ out: static void bond_mii_monitor(struct net_device *master) { bonding_t *bond = (struct bonding *) master->priv; - slave_t *slave, *bestslave, *oldcurrent; + slave_t *slave, *oldcurrent; int slave_died = 0; + int do_failover = 0; read_lock(&bond->lock); @@ -2218,7 +2219,6 @@ static void bond_mii_monitor(struct net_ * program could monitor the link itself if needed. */ - bestslave = NULL; slave = (slave_t *)bond; read_lock(&bond->ptrlock); @@ -2226,8 +2226,6 @@ static void bond_mii_monitor(struct net_ read_unlock(&bond->ptrlock); while ((slave = slave->prev) != (slave_t *)bond) { - /* use updelay+1 to match an UP slave even when updelay is 0 */ - int mindelay = updelay + 1; struct net_device *dev = slave->dev; int link_state; u16 old_speed = slave->speed; @@ -2238,14 +2236,7 @@ static void bond_mii_monitor(struct net_ switch (slave->link) { case BOND_LINK_UP: /* the link was up */ if (link_state == BMSR_LSTATUS) { - /* link stays up, tell that this one - is immediately available */ - if (IS_UP(dev) && (mindelay > -2)) { - /* -2 is the best case : - this slave was already up */ - mindelay = -2; - bestslave = slave; - } + /* link stays up, nothing more to do */ break; } else { /* link going down */ @@ -2285,6 +2276,7 @@ static void bond_mii_monitor(struct net_ (bond_mode == BOND_MODE_8023AD)) { bond_set_slave_inactive_flags(slave); } + printk(KERN_INFO "%s: link status definitely down " "for interface %s, disabling it", @@ -2301,12 +2293,10 @@ static void bond_mii_monitor(struct net_ bond_alb_handle_link_change(bond, slave, BOND_LINK_DOWN); } - write_lock(&bond->ptrlock); - if (slave == bond->current_slave) { - /* find a new interface and be verbose */ - reselect_active_interface(bond); + if (slave == oldcurrent) { + do_failover = 1; } - write_unlock(&bond->ptrlock); + slave_died = 1; } else { slave->delay--; @@ -2321,13 +2311,6 @@ static void bond_mii_monitor(struct net_ master->name, (downdelay - slave->delay) * miimon, dev->name); - - if (IS_UP(dev) && (mindelay > -1)) { - /* -1 is a good case : this slave went - down only for a short time */ - mindelay = -1; - bestslave = slave; - } } break; case BOND_LINK_DOWN: /* the link was down */ @@ -2397,26 +2380,12 @@ static void bond_mii_monitor(struct net_ bond_alb_handle_link_change(bond, slave, BOND_LINK_UP); } - write_lock(&bond->ptrlock); - if ( (bond->primary_slave != NULL) - && (slave == bond->primary_slave) ) - reselect_active_interface(bond); - write_unlock(&bond->ptrlock); - } - else + if ((oldcurrent == NULL) || + (slave == bond->primary_slave)) { + do_failover = 1; + } + } else { slave->delay--; - - /* we'll also look for the mostly eligible slave */ - if (bond->primary_slave == NULL) { - if (IS_UP(dev) && (slave->delay < mindelay)) { - mindelay = slave->delay; - bestslave = slave; - } - } else if ( (IS_UP(bond->primary_slave->dev)) || - ( (!IS_UP(bond->primary_slave->dev)) && - (IS_UP(dev) && (slave->delay < mindelay)) ) ) { - mindelay = slave->delay; - bestslave = slave; } } break; @@ -2435,26 +2404,17 @@ static void bond_mii_monitor(struct net_ } /* end of while */ - /* - * if there's no active interface and we discovered that one - * of the slaves could be activated earlier, so we do it. - */ - read_lock(&bond->ptrlock); - oldcurrent = bond->current_slave; - read_unlock(&bond->ptrlock); + if (do_failover) { + write_lock(&bond->ptrlock); - /* no active interface at the moment or need to bring up the primary */ - if (oldcurrent == NULL) { /* no active interface at the moment */ - if (bestslave != NULL) { /* last chance to find one ? */ - write_lock(&bond->ptrlock); - change_active_interface(bond, bestslave); - write_unlock(&bond->ptrlock); - } else if (slave_died) { - /* print this message only once a slave has just died */ + reselect_active_interface(bond); + if (oldcurrent && !bond->current_slave) { printk(KERN_INFO "%s: now running without any active interface !\n", master->name); } + + write_unlock(&bond->ptrlock); } read_unlock(&bond->lock); @@ -2472,9 +2432,10 @@ static void bond_mii_monitor(struct net_ static void loadbalance_arp_monitor(struct net_device *master) { bonding_t *bond; - slave_t *slave; + slave_t *slave, *oldcurrent; int the_delta_in_ticks = arp_interval * HZ / 1000; int next_timer = jiffies + (arp_interval * HZ / 1000); + int do_failover = 0; bond = (struct bonding *) master->priv; if (master->priv == NULL) { @@ -2498,6 +2459,10 @@ static void loadbalance_arp_monitor(stru read_lock(&bond->lock); + read_lock(&bond->ptrlock); + oldcurrent = bond->current_slave; + read_unlock(&bond->ptrlock); + /* see if any of the previous devices are up now (i.e. they have * xmt and rcv traffic). the current_slave does not come into * the picture unless it is null. also, slave->jiffies is not needed @@ -2524,21 +2489,19 @@ static void loadbalance_arp_monitor(stru * current_slave being null after enslaving * is closed. */ - write_lock(&bond->ptrlock); - if (bond->current_slave == NULL) { + if (oldcurrent == NULL) { printk(KERN_INFO "%s: link status definitely up " "for interface %s, ", master->name, slave->dev->name); - reselect_active_interface(bond); + do_failover = 1; } else { printk(KERN_INFO "%s: interface %s is now up\n", master->name, slave->dev->name); } - write_unlock(&bond->ptrlock); } } else { /* slave->link == BOND_LINK_UP */ @@ -2561,11 +2524,9 @@ static void loadbalance_arp_monitor(stru master->name, slave->dev->name); - write_lock(&bond->ptrlock); - if (slave == bond->current_slave) { - reselect_active_interface(bond); + if (slave == oldcurrent) { + do_failover = 1; } - write_unlock(&bond->ptrlock); } } @@ -2579,6 +2540,19 @@ static void loadbalance_arp_monitor(stru if (IS_UP(slave->dev)) { arp_send_all(slave); } + } + + if (do_failover) { + write_lock(&bond->ptrlock); + + reselect_active_interface(bond); + if (oldcurrent && !bond->current_slave) { + printk(KERN_INFO + "%s: now running without any active interface !\n", + master->name); + } + + write_unlock(&bond->ptrlock); } read_unlock(&bond->lock); -- | Shmulik Hen Advanced Network Services | | Israel Design Center, Jerusalem | | LAN Access Division, Platform Networking | | Intel Communications Group, Intel corp. | From garzik@gtf.org Fri Aug 8 10:39:42 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 08 Aug 2003 10:39:55 -0700 (PDT) Received: from havoc.gtf.org (host-64-213-145-173.atlantasolutions.com [64.213.145.173] (may be forged)) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h78HdfFl026326 for ; Fri, 8 Aug 2003 10:39:42 -0700 Received: by havoc.gtf.org (Postfix, from userid 500) id 12F87667E; Fri, 8 Aug 2003 13:39:33 -0400 (EDT) Date: Fri, 8 Aug 2003 13:39:33 -0400 From: Jeff Garzik To: netdev@oss.sgi.com Subject: RFR: new SiS gige driver Message-ID: <20030808173932.GA4077@gtf.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.3.28i X-archive-position: 4678 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev Every so often, new vendor drivers are submitted to me for inclusion in the Linux kernel. In an attempt to obtain greater pre-merge peer review, I'm going to start posting some of them here on netdev. This driver is actually a lot more clean than many that crop up. So far most of my nits are minor: * it needs copyright/license blurb at the top * should not wait for autoneg to complete in ->open, * worst-case delay in smdio_write is very long... try moving most phy interaction to process context, or examining if delays exist only to paper over PCI posting (for example) bugs * should use netif_carrier_xxx (required for bonding and linkwatch, these days) * remove #if 0'd code, commented-out code ("// ..."), and other dead code * since it's gige, it should definitely be using NAPI /* ========================================================================= SiS190.c: A SiS190 Gigabit Ethernet driver for Linux kernel 2.6.x. -------------------------------------------------------------------- History: Aug 7 2003 - created initially by K.M. Liu . ========================================================================= VERSION 1.0 <2003/8/7> Initial by K.M. Liu, test 100bps Full in 2.6.0 O.K. 1.1 <2003/8/8> Add mode detection. The bit4:0 of MII register 4 is called "selector field", and have to be 00001b to indicate support of IEEE std 802.3 during NWay process of exchanging Link Code Word (FLP). */ #include #include #include #include #include #include #include #include #define SiS190_VERSION "1.1" #define MODULENAME "SiS190" #define SiS190_DRIVER_NAME MODULENAME " Gigabit Ethernet driver " SiS190_VERSION #define PFX MODULENAME ": " #ifdef SiS190_DEBUG #define assert(expr) \ if(!(expr)) { \ printk( "Assertion failed! %s,%s,%s,line=%d\n", \ #expr,__FILE__,__FUNCTION__,__LINE__); \ } #else #define assert(expr) do {} while (0) #endif /* media options */ #define MAX_UNITS 8 /* Maximum events (Rx packets, etc.) to handle at each interrupt. */ static int max_interrupt_work = 20; /* Maximum number of multicast addresses to filter (vs. Rx-all-multicast). The chips use a 64 element hash table based on the Ethernet CRC. */ static int multicast_filter_limit = 32; /* MAC address length*/ #define MAC_ADDR_LEN 6 /* max supported gigabit ethernet frame size -- must be at least (dev->mtu+14+4).*/ #define MAX_ETH_FRAME_SIZE 1536 #define TX_FIFO_THRESH 256 /* In bytes */ #define RX_FIFO_THRESH 7 /* 7 means NO threshold, Rx buffer level before first PCI xfer. */ #define RX_DMA_BURST 6 /* Maximum PCI burst, '6' is 1024 */ #define TX_DMA_BURST 6 /* Maximum PCI burst, '6' is 1024 */ #define EarlyTxThld 0x3F /* 0x3F means NO early transmit */ #define RxPacketMaxSize 0x0800 /* Maximum size supported is 16K-1 */ #define InterFrameGap 0x03 /* 3 means InterFrameGap = the shortest one */ #define NUM_TX_DESC 64 /* Number of Tx descriptor registers */ #define NUM_RX_DESC 64 /* Number of Rx descriptor registers */ #define RX_BUF_SIZE 1536 /* Rx Buffer size */ #define SiS190_MIN_IO_SIZE 0x80 #define TX_TIMEOUT (6*HZ) /* enhanced PHY access register bit definitions */ #define EhnMIIread 0x0000 #define EhnMIIwrite 0x0020 #define EhnMIIdataShift 16 #define EhnMIIpmdShift 6 /* 7016 only */ #define EhnMIIregShift 11 #define EhnMIIreq 0x0010 #define EhnMIInotDone 0x0010 //------------------------------------------------------------------------- // Bit Mask definitions //------------------------------------------------------------------------- #define BIT_0 0x0001 #define BIT_1 0x0002 #define BIT_2 0x0004 #define BIT_3 0x0008 #define BIT_4 0x0010 #define BIT_5 0x0020 #define BIT_6 0x0040 #define BIT_7 0x0080 #define BIT_8 0x0100 #define BIT_9 0x0200 #define BIT_10 0x0400 #define BIT_11 0x0800 #define BIT_12 0x1000 #define BIT_13 0x2000 #define BIT_14 0x4000 #define BIT_15 0x8000 #define BIT_16 0x10000 #define BIT_17 0x20000 #define BIT_18 0x40000 #define BIT_19 0x80000 #define BIT_20 0x100000 #define BIT_21 0x200000 #define BIT_22 0x400000 #define BIT_23 0x800000 #define BIT_24 0x1000000 #define BIT_25 0x2000000 #define BIT_26 0x04000000 #define BIT_27 0x08000000 #define BIT_28 0x10000000 #define BIT_29 0x20000000 #define BIT_30 0x40000000 #define BIT_31 0x80000000 /* write/read MMIO register */ #define SiS_W8(reg, val8) writeb ((val8), ioaddr + (reg)) #define SiS_W16(reg, val16) writew ((val16), ioaddr + (reg)) #define SiS_W32(reg, val32) writel ((val32), ioaddr + (reg)) #define SiS_R8(reg) readb (ioaddr + (reg)) #define SiS_R16(reg) readw (ioaddr + (reg)) #define SiS_R32(reg) ((unsigned long) readl (ioaddr + (reg))) static struct { const char *name; } board_info[] __devinitdata = { { "SiS190 Gigabit Ethernet"},}; static struct pci_device_id sis190_pci_tbl[] __devinitdata = { {0x1039, 0x0190, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0}, {0,}, }; MODULE_DEVICE_TABLE(pci, sis190_pci_tbl); enum SiS190_registers { TxControl=0x0, TxDescStartAddr =0x4, TxNextDescAddr =0x0c, RxControl=0x10, RxDescStartAddr =0x14, RxNextDescAddr =0x1c, IntrStatus = 0x20, IntrMask = 0x24, IntrControl = 0x28, IntrTimer = 0x2c, PMControl = 0x30, ROMControl=0x38, ROMInterface=0x3c, StationControl=0x40, GMIIControl=0x44, TxMacControl=0x50, RxMacControl=0x60, RxMacAddr=0x62, RxHashTable=0x68, RxWakeOnLan=0x70, RxMPSControl=0x78, }; enum sis190_register_content { /*InterruptStatusBits */ SoftInt=0x40000000, Timeup=0x20000000, PauseFrame = 0x80000, MagicPacket=0x40000, WakeupFrame = 0x20000, LinkChange = 0x10000, RxQEmpty = 0x80, RxQInt = 0x40, TxQ1Empty = 0x20, TxQ1Int = 0x10, TxQ0Empty = 0x08, TxQ0Int = 0x04, RxHalt = 0x02, TxHalt=0x01, /*RxStatusDesc */ RxRES = 0x00200000, RxCRC = 0x00080000, RxRUNT = 0x00100000, RxRWT = 0x00400000, /*ChipCmdBits */ CmdReset = 0x10, CmdRxEnb = 0x08, CmdTxEnb = 0x01, RxBufEmpty = 0x01, /*Cfg9346Bits */ Cfg9346_Lock = 0x00, Cfg9346_Unlock = 0xC0, /*rx_mode_bits */ AcceptErr = 0x20, AcceptRunt = 0x10, AcceptBroadcast = 0x0800, AcceptMulticast = 0x0400, AcceptMyPhys = 0x0200, AcceptAllPhys = 0x0100, /*RxConfigBits */ RxCfgFIFOShift = 13, RxCfgDMAShift = 8, /*TxConfigBits */ TxInterFrameGapShift = 24, TxDMAShift = 8, /* DMA burst value (0-7) is shift this many bits */ /*_PHYstatus */ TBI_Enable = 0x80, TxFlowCtrl = 0x40, RxFlowCtrl = 0x20, _1000bpsF = 0x1c, _1000bpsH = 0x0c, _100bpsF = 0x18, _100bpsH = 0x08, _10bpsF = 0x14, _10bpsH = 0x04, LinkStatus = 0x02, FullDup = 0x01, /*GIGABIT_PHY_registers */ PHY_CTRL_REG = 0, PHY_STAT_REG = 1, PHY_AUTO_NEGO_REG = 4, PHY_1000_CTRL_REG = 9, /*GIGABIT_PHY_REG_BIT */ PHY_Restart_Auto_Nego = 0x0200, PHY_Enable_Auto_Nego = 0x1000, //PHY_STAT_REG = 1; PHY_Auto_Neco_Comp = 0x0020, //PHY_AUTO_NEGO_REG = 4; PHY_Cap_10_Half = 0x0020, PHY_Cap_10_Full = 0x0040, PHY_Cap_100_Half = 0x0080, PHY_Cap_100_Full = 0x0100, //PHY_1000_CTRL_REG = 9; PHY_Cap_1000_Full = 0x0200, PHY_Cap_Null = 0x0, /*_MediaType*/ _10_Half = 0x01, _10_Full = 0x02, _100_Half = 0x04, _100_Full = 0x08, _1000_Full = 0x10, /*_TBICSRBit*/ TBILinkOK = 0x02000000, }; const static struct { const char *name; u8 version; /* depend on docs */ u32 RxConfigMask; /* should clear the bits supported by this chip */ } sis_chip_info[] = { { "SiS-0190", 0x00, 0xff7e1880,},}; enum _DescStatusBit { OWNbit = 0x80000000, INTbit = 0x40000000, DEFbit = 0x200000, CRCbit = 0x20000, PADbit=0x10000, ENDbit=0x80000000, }; struct TxDesc { u32 PSize; u32 status; u32 buf_addr; u32 buf_Len; }; struct RxDesc { u32 PSize; u32 status; u32 buf_addr; u32 buf_Len; }; struct sis190_private { void *mmio_addr; /* memory map physical address */ struct pci_dev *pci_dev; /* Index of PCI device */ struct net_device_stats stats; /* statistics of net device */ spinlock_t lock; /* spin lock flag */ int chipset; unsigned long cur_rx; /* Index into the Rx descriptor buffer of next Rx pkt. */ unsigned long cur_tx; /* Index into the Tx descriptor buffer of next Rx pkt. */ unsigned long dirty_tx; unsigned char *TxDescArrays; /* Index of Tx Descriptor buffer */ unsigned char *RxDescArrays; /* Index of Rx Descriptor buffer */ struct TxDesc *TxDescArray; /* Index of 256-alignment Tx Descriptor buffer */ struct RxDesc *RxDescArray; /* Index of 256-alignment Rx Descriptor buffer */ unsigned char *RxBufferRings; /* Index of Rx Buffer */ unsigned char *RxBufferRing[NUM_RX_DESC]; /* Index of Rx Buffer array */ struct sk_buff *Tx_skbuff[NUM_TX_DESC]; /* Index of Transmit data buffer */ }; MODULE_AUTHOR("K.M. Liu999) printk(KERN_ERR PFX "Phy write Error!!!\n"); } int smdio_read(void *ioaddr, int RegAddr) { u32 l; u16 i; u32 pmd; pmd=1; l=0; l = EhnMIIread |EhnMIIreq | (((u32)RegAddr)<999) printk(KERN_ERR PFX "Phy Read Error!!!\n"); } l=SiS_R32(GMIIControl); return( (u16) ( l>>EhnMIIdataShift ) ); } int ReadEEprom(void *ioaddr, u32 RegAddr) { u16 data; u32 i; u32 ulValue; if(!(SiS_R32(ROMControl)&BIT_1)) { return 0; } ulValue = (BIT_7 | (0x2 << 8) | (RegAddr << 10)); SiS_W32(ROMInterface, ulValue); for(i=0 ; i < 200; i++) { if(!(SiS_R32(ROMInterface)& BIT_7)) break; udelay(1000); } data = (u16)((SiS_R32(ROMInterface) & 0xffff0000) >> 16); return data; } static int __devinit SiS190_init_board(struct pci_dev *pdev, struct net_device **dev_out, void **ioaddr_out) { void *ioaddr = NULL; struct net_device *dev; struct sis190_private *tp; u16 rc; unsigned long mmio_start, mmio_end, mmio_flags, mmio_len; // u32 tmp; assert(pdev != NULL); assert(ioaddr_out != NULL); *ioaddr_out = NULL; *dev_out = NULL; // dev zeroed in init_etherdev // printk("SiS190_init_board\n"); dev = alloc_etherdev(sizeof (*tp)); // dev = init_etherdev(NULL, sizeof (*tp)); if (dev == NULL) { printk(KERN_ERR PFX "unable to alloc new ethernet\n"); return -ENOMEM; } SET_MODULE_OWNER(dev); SET_NETDEV_DEV(dev, &pdev->dev); tp = dev->priv; // SET_MODULE_OWNER(dev); // tp = dev->priv; // enable device (incl. PCI PM wakeup and hotplug setup) rc = pci_enable_device(pdev); if (rc) goto err_out; mmio_start = pci_resource_start(pdev, 0); mmio_end = pci_resource_end(pdev, 0); mmio_flags = pci_resource_flags(pdev, 0); mmio_len = pci_resource_len(pdev, 0); // make sure PCI base addr 0 is MMIO if (!(mmio_flags & IORESOURCE_MEM)) { printk(KERN_ERR PFX "region #0 not an MMIO resource, aborting\n"); rc = -ENODEV; goto err_out; } // check for weird/broken PCI region reporting if (mmio_len < SiS190_MIN_IO_SIZE) { printk(KERN_ERR PFX "Invalid PCI region size(s), aborting\n"); rc = -ENODEV; goto err_out; } rc = pci_request_regions(pdev, dev->name); if (rc) goto err_out; // enable PCI bus-mastering pci_set_master(pdev); // ioremap MMIO region ioaddr = ioremap(mmio_start, mmio_len); if (ioaddr == NULL) { printk(KERN_ERR PFX "cannot remap MMIO, aborting\n"); rc = -EIO; goto err_out_free_res; } // Soft reset the chip. //_W8(ChipCmd, CmdReset); SiS_W32(IntrControl,0x8000); udelay(1000); SiS_W32(IntrControl,0x0); SiS_W32(TxControl,0x1a00); SiS_W32(RxControl,0x1a00); udelay(1000); //match: *ioaddr_out = ioaddr; *dev_out = dev; return 0; err_out_free_res: pci_release_regions(pdev); err_out: // err_out_disable: pci_disable_device(pdev); unregister_netdev(dev); kfree(dev); return rc; } static int __devinit SiS190_init_one(struct pci_dev *pdev, const struct pci_device_id *ent) { struct net_device *dev = NULL; struct sis190_private *tp = NULL; void *ioaddr = NULL; static int board_idx = -1; static int printed_version = 0; int i,rc; u16 reg31; // int option = -1, Cap10_100 = 0, Cap1000 = 0; assert(pdev != NULL); assert(ent != NULL); board_idx++; // printk("SiS190_init_one! \n"); if (!printed_version) { printk(KERN_INFO SiS190_DRIVER_NAME " loaded\n"); printed_version = 1; } i = SiS190_init_board(pdev, &dev, &ioaddr); if (i < 0) { return i; } tp = dev->priv; assert(ioaddr != NULL); assert(dev != NULL); assert(tp != NULL); // Get MAC address // // Read node address from the EEPROM if(SiS_R32(ROMControl)&0x2){ for (i=0; i< 6; i += 2){ SiS_W16(RxMacAddr+i,ReadEEprom(ioaddr, 3 + (i/2))); } }else{ SiS_W32(RxMacAddr,0x11111100); //If 9346 does not exist SiS_W32(RxMacAddr+2,0x00111111); } for (i = 0; i < MAC_ADDR_LEN; i++) { dev->dev_addr[i] = SiS_R8(RxMacAddr+i); printk("SiS_R8(RxMacAddr+%x)= %x ",i,SiS_R8(RxMacAddr+i)); } dev->open = SiS190_open; dev->hard_start_xmit = SiS190_start_xmit; dev->get_stats = SiS190_get_stats; dev->stop = SiS190_close; dev->tx_timeout = SiS190_tx_timeout; dev->set_multicast_list = SiS190_set_rx_mode; dev->watchdog_timeo = TX_TIMEOUT; dev->irq = pdev->irq; dev->base_addr = (unsigned long) ioaddr; // dev->do_ioctl = mii_ioctl; tp = dev->priv; // private data // tp->pci_dev = pdev; tp->mmio_addr = ioaddr; printk(KERN_DEBUG "%s: Identified chip type is '%s'.\n", dev->name, sis_chip_info[tp->chipset].name); spin_lock_init(&tp->lock); rc = register_netdev(dev); if (rc) { iounmap(ioaddr); pci_release_regions(pdev); pci_disable_device(pdev); kfree(dev); return rc; } printk(KERN_DEBUG "%s: Identified chip type is '%s'.\n", dev->name, sis_chip_info[tp->chipset].name); pci_set_drvdata(pdev, dev); printk(KERN_INFO "%s: %s at 0x%lx, " "%2.2x:%2.2x:%2.2x:%2.2x:%2.2x:%2.2x, " "IRQ %d\n", dev->name, board_info[ent->driver_data].name, dev->base_addr, dev->dev_addr[0], dev->dev_addr[1], dev->dev_addr[2], dev->dev_addr[3], dev->dev_addr[4], dev->dev_addr[5], dev->irq); int val = smdio_read(ioaddr, PHY_AUTO_NEGO_REG); printk(KERN_INFO "%s: Auto-negotiation Enabled.\n", dev->name); // enable 10/100 Full/Half Mode, leave PHY_AUTO_NEGO_REG bit4:0 unchanged smdio_write(ioaddr, PHY_AUTO_NEGO_REG, PHY_Cap_10_Half | PHY_Cap_10_Full | PHY_Cap_100_Half | PHY_Cap_100_Full | (val & 0x1F)); // enable 1000 Full Mode smdio_write(ioaddr, PHY_1000_CTRL_REG, PHY_Cap_1000_Full); // Enable auto-negotiation and restart auto-nigotiation smdio_write(ioaddr, PHY_CTRL_REG, PHY_Enable_Auto_Nego | PHY_Restart_Auto_Nego); udelay(100); // wait for auto-negotiation process for (i = 10000; i > 0; i--) { //check if auto-negotiation complete if (smdio_read(ioaddr, PHY_STAT_REG) & PHY_Auto_Neco_Comp) { udelay(100); reg31=smdio_read(ioaddr,31); reg31 &= 0x1c; //bit 4:2 switch(reg31){ case _1000bpsF: SiS_W16( 0x40, 0x1c01); printk("SiS190 Link on 1000 bps Full Duplex mode. \n"); break; case _1000bpsH: SiS_W16( 0x40, 0x0c01); printk("SiS190 Link on 1000 bps Half Duplex mode. \n"); break; case _100bpsF: SiS_W16( 0x40, 0x1801); printk("SiS190 Link on 100 bps Full Duplex mode. \n"); break; case _100bpsH: SiS_W16( 0x40, 0x0801); printk("SiS190 Link on 100 bps Half Duplex mode. \n"); break; case _10bpsF: SiS_W16( 0x40, 0x1401); printk("SiS190 Link on 10 bps Full Duplex mode. \n"); break; case _10bpsH: SiS_W16( 0x40, 0x0401); printk("SiS190 Link on 10 bps Half Duplex mode. \n"); break; default: printk(KERN_ERR PFX "Error! SiS190 Can not detect mode !!! \n"); break; } break; } else { udelay(100); } } // end for-loop to wait for auto-negotiation process return 0; } static void __devexit SiS190_remove_one(struct pci_dev *pdev) { struct net_device *dev = pci_get_drvdata(pdev); struct sis190_private *tp = (struct sis190_private *) (dev->priv); assert(dev != NULL); assert(tp != NULL); unregister_netdev(dev); iounmap(tp->mmio_addr); pci_release_regions(pdev); // poison memory before freeing memset(dev, 0xBC, sizeof (struct net_device) + sizeof (struct sis190_private)); kfree(dev); pci_set_drvdata(pdev, NULL); } static int SiS190_open(struct net_device *dev) { struct sis190_private *tp = dev->priv; int retval; u8 diff; u32 TxPhyAddr, RxPhyAddr; retval = request_irq(dev->irq, SiS190_interrupt, SA_SHIRQ, dev->name, dev); if (retval) { return retval; } tp->TxDescArrays = kmalloc(NUM_TX_DESC * sizeof (struct TxDesc) + 256, GFP_KERNEL); // Tx Desscriptor needs 256 bytes alignment; TxPhyAddr = virt_to_bus(tp->TxDescArrays); diff = 256 - (TxPhyAddr - ((TxPhyAddr >> 8) << 8)); TxPhyAddr += diff; tp->TxDescArray = (struct TxDesc *) (tp->TxDescArrays + diff); tp->RxDescArrays = kmalloc(NUM_RX_DESC * sizeof (struct RxDesc) + 256, GFP_KERNEL); // Rx Desscriptor needs 256 bytes alignment; RxPhyAddr = virt_to_bus(tp->RxDescArrays); diff = 256 - (RxPhyAddr - ((RxPhyAddr >> 8) << 8)); RxPhyAddr += diff; tp->RxDescArray = (struct RxDesc *) (tp->RxDescArrays + diff); if (tp->TxDescArrays == NULL || tp->RxDescArrays == NULL) { printk(KERN_INFO "Allocate RxDescArray or TxDescArray failed\n"); free_irq(dev->irq, dev); if (tp->TxDescArrays) kfree(tp->TxDescArrays); if (tp->RxDescArrays) kfree(tp->RxDescArrays); return -ENOMEM; } tp->RxBufferRings = kmalloc(RX_BUF_SIZE * NUM_RX_DESC, GFP_KERNEL); if (tp->RxBufferRings == NULL) { printk(KERN_INFO "Allocate RxBufferRing failed\n"); } SiS190_init_ring(dev); SiS190_hw_start(dev); return 0; } static void SiS190_hw_start(struct net_device *dev) { struct sis190_private *tp = dev->priv; void *ioaddr = tp->mmio_addr; /* Soft reset the chip. */ //_W8(ChipCmd, CmdReset); SiS_W32(IntrControl,0x8000); udelay(1000); SiS_W32(IntrControl,0x0); SiS_W32( 0x0, 0x01a00); SiS_W32( 0x4, virt_to_bus(tp->TxDescArray)); SiS_W32( 0x10, 0x1a00); SiS_W32( 0x14, virt_to_bus(tp->RxDescArray)); SiS_W32( 0x20, 0xffffffff); SiS_W32( 0x24, 0x0); SiS_W16( 0x40, 0x1901); //default is 100Mbps SiS_W32( 0x44, 0x0); SiS_W32( 0x50, 0x60); SiS_W16( 0x60, 0x02); SiS_W32( 0x68, 0x0); SiS_W32( 0x6c, 0x0); SiS_W32( 0x70, 0x0); SiS_W32( 0x74, 0x0); // Set Rx Config register tp->cur_rx = 0; udelay(10); SiS190_set_rx_mode(dev); /* Enable all known interrupts by setting the interrupt mask. */ SiS_W32(IntrMask, sis190_intr_mask); SiS_W32( 0x0, 0x1a01); SiS_W32( 0x10, 0x1a1d); netif_start_queue(dev); } static void SiS190_init_ring(struct net_device *dev) { struct sis190_private *tp = dev->priv; int i; tp->cur_rx = 0; tp->cur_tx = 0; tp->dirty_tx = 0; memset(tp->TxDescArray, 0x0, NUM_TX_DESC * sizeof (struct TxDesc)); memset(tp->RxDescArray, 0x0, NUM_RX_DESC * sizeof (struct RxDesc)); for (i = 0; i < NUM_TX_DESC; i++) { tp->Tx_skbuff[i] = NULL; } for (i = 0; i < NUM_RX_DESC; i++) { tp->RxDescArray[i].PSize = 0x0; if (i == (NUM_RX_DESC - 1)) tp->RxDescArray[i].buf_Len = BIT_31 + RX_BUF_SIZE; //bit 31 is End bit else tp->RxDescArray[i].buf_Len = RX_BUF_SIZE; tp->RxBufferRing[i] = &(tp->RxBufferRings[i * RX_BUF_SIZE]); tp->RxDescArray[i].buf_addr = virt_to_bus(tp->RxBufferRing[i]); tp->RxDescArray[i].status = OWNbit | INTbit; } } static void SiS190_tx_clear(struct sis190_private *tp) { int i; tp->cur_tx = 0; for (i = 0; i < NUM_TX_DESC; i++) { if (tp->Tx_skbuff[i] != NULL) { dev_kfree_skb(tp->Tx_skbuff[i]); tp->Tx_skbuff[i] = NULL; tp->stats.tx_dropped++; } } } static void SiS190_tx_timeout(struct net_device *dev) { struct sis190_private *tp = dev->priv; void *ioaddr = tp->mmio_addr; u8 tmp8; /* disable Tx, if not already */ tmp8 = SiS_R8(TxControl); if (tmp8 & CmdTxEnb) SiS_W8(TxControl, tmp8 & ~CmdTxEnb); /* Disable interrupts by clearing the interrupt mask. */ SiS_W32(IntrMask, 0x0000); /* Stop a shared interrupt from scavenging while we are. */ spin_lock_irq(&tp->lock); SiS190_tx_clear(tp); spin_unlock_irq(&tp->lock); /* ...and finally, reset everything */ SiS190_hw_start(dev); netif_wake_queue(dev); } static int SiS190_start_xmit(struct sk_buff *skb, struct net_device *dev) { struct sis190_private *tp = dev->priv; void *ioaddr = tp->mmio_addr; int entry = tp->cur_tx % NUM_TX_DESC; if (skb->len < ETH_ZLEN) { skb = skb_padto(skb, ETH_ZLEN); if (skb == NULL) return 0; } spin_lock_irq(&tp->lock); if ((tp->TxDescArray[entry].status & OWNbit) == 0) { tp->Tx_skbuff[entry] = skb; tp->TxDescArray[entry].buf_addr = virt_to_bus(skb->data); tp->TxDescArray[entry].PSize = ((skb->len > ETH_ZLEN) ? skb->len : ETH_ZLEN); if (entry != (NUM_TX_DESC - 1)){ tp->TxDescArray[entry].buf_Len = tp->TxDescArray[entry].PSize; }else{ tp->TxDescArray[entry].buf_Len = tp->TxDescArray[entry].PSize|ENDbit; } tp->TxDescArray[entry].status |= (OWNbit | INTbit | DEFbit |CRCbit |PADbit); //_W8(TxPoll, 0x40); //set polling bit SiS_W32(TxControl,0x1a11); //Start Send dev->trans_start = jiffies; tp->cur_tx++; } spin_unlock_irq(&tp->lock); if ((tp->cur_tx - NUM_TX_DESC) == tp->dirty_tx) { netif_stop_queue(dev); } return 0; } static void SiS190_tx_interrupt(struct net_device *dev, struct sis190_private *tp, void *ioaddr) { unsigned long dirty_tx, tx_left = 0; int entry = tp->cur_tx % NUM_TX_DESC; assert(dev != NULL); assert(tp != NULL); assert(ioaddr != NULL); dirty_tx = tp->dirty_tx; tx_left = tp->cur_tx - dirty_tx; while (tx_left > 0) { if ((tp->TxDescArray[entry].status & OWNbit) == 0) { dev_kfree_skb_irq(tp-> Tx_skbuff[dirty_tx % NUM_TX_DESC]); tp->Tx_skbuff[dirty_tx % NUM_TX_DESC] = NULL; tp->stats.tx_packets++; dirty_tx++; tx_left--; entry++; } } if (tp->dirty_tx != dirty_tx) { tp->dirty_tx = dirty_tx; if (netif_queue_stopped(dev)) netif_wake_queue(dev); } } static void SiS190_rx_interrupt(struct net_device *dev, struct sis190_private *tp, void *ioaddr) { int cur_rx; struct sk_buff *skb; int pkt_size = 0; assert(dev != NULL); assert(tp != NULL); assert(ioaddr != NULL); cur_rx = tp->cur_rx; while ((tp->RxDescArray[cur_rx].status & OWNbit) == 0) { if (tp->RxDescArray[cur_rx].PSize & 0x0080000) { printk(KERN_INFO "%s: Rx ERROR!!!\n", dev->name); tp->stats.rx_errors++; tp->stats.rx_length_errors++; }else if(!(tp->RxDescArray[cur_rx].PSize & 0x0010000)){ printk(KERN_INFO "%s: Rx ERROR!!!\n", dev->name); tp->stats.rx_errors++; tp->stats.rx_crc_errors++; } else { pkt_size = (int) (tp->RxDescArray[cur_rx]. PSize & 0x0000FFFF) - 4; skb = dev_alloc_skb(pkt_size + 2); if (skb != NULL) { skb->dev = dev; skb_reserve(skb, 2); // 16 byte align the IP fields. // eth_copy_and_sum(skb, tp->RxBufferRing[cur_rx], pkt_size, 0); skb_put(skb, pkt_size); skb->protocol = eth_type_trans(skb, dev); netif_rx(skb); tp->RxDescArray[cur_rx].PSize =0x0; if (cur_rx == (NUM_RX_DESC - 1)) tp->RxDescArray[cur_rx].buf_Len = ENDbit+RX_BUF_SIZE; else tp->RxDescArray[cur_rx].buf_Len = RX_BUF_SIZE; tp->RxDescArray[cur_rx].buf_addr = virt_to_bus(tp->RxBufferRing[cur_rx]); dev->last_rx = jiffies; tp->stats.rx_bytes += pkt_size; tp->stats.rx_packets++; tp->RxDescArray[cur_rx].status = OWNbit|INTbit; } else { printk(KERN_WARNING "%s: Memory squeeze, deferring packet.\n", dev->name); /* We should check that some rx space is free. If not, free one and mark stats->rx_dropped++. */ tp->stats.rx_dropped++; } } cur_rx = (cur_rx + 1) % NUM_RX_DESC; } tp->cur_rx = cur_rx; } /* The interrupt handler does all of the Rx thread work and cleans up after the Tx thread. */ static irqreturn_t SiS190_interrupt(int irq, void *dev_instance, struct pt_regs *regs) { struct net_device *dev = (struct net_device *) dev_instance; struct sis190_private *tp = dev->priv; int boguscnt = max_interrupt_work; void *ioaddr = tp->mmio_addr; unsigned long status = 0; int handled = 0; do { status = SiS_R32(IntrStatus); /* h/w no longer present (hotplug?) or major error, bail */ SiS_W32(IntrStatus,status); if ((status & (TxQ0Int | RxQInt)) == 0) break; // Rx interrupt if (status & (RxQInt)) { SiS190_rx_interrupt(dev, tp, ioaddr); } // Tx interrupt if (status & (TxQ0Int)) { spin_lock(&tp->lock); SiS190_tx_interrupt(dev, tp, ioaddr); spin_unlock(&tp->lock); } boguscnt--; } while (boguscnt > 0); if (boguscnt <= 0) { printk(KERN_WARNING "%s: Too much work at interrupt!\n", dev->name); /* Clear all interrupt sources. */ SiS_W32(IntrStatus, 0xffffffff); } return IRQ_RETVAL(handled); } static int SiS190_close(struct net_device *dev) { struct sis190_private *tp = dev->priv; void *ioaddr = tp->mmio_addr; int i; netif_stop_queue(dev); spin_lock_irq(&tp->lock); /* Stop the chip's Tx and Rx DMA processes. */ SiS_W32(TxControl,0x1a00); SiS_W32(RxControl,0x1a00); /* Disable interrupts by clearing the interrupt mask. */ SiS_W32(IntrMask, 0x0000); /* Update the error counts. */ //tp->stats.rx_missed_errors += _R32(RxMissed); //_W32(RxMissed, 0); spin_unlock_irq(&tp->lock); synchronize_irq(); free_irq(dev->irq, dev); SiS190_tx_clear(tp); kfree(tp->TxDescArrays); kfree(tp->RxDescArrays); tp->TxDescArrays = NULL; tp->RxDescArrays = NULL; tp->TxDescArray = NULL; tp->RxDescArray = NULL; kfree(tp->RxBufferRings); for (i = 0; i < NUM_RX_DESC; i++) { tp->RxBufferRing[i] = NULL; } return 0; } static void SiS190_set_rx_mode(struct net_device *dev) { struct sis190_private *tp = dev->priv; void *ioaddr = tp->mmio_addr; unsigned long flags; u32 mc_filter[2]; /* Multicast hash filter */ int i, rx_mode; u32 tmp = 0; if (dev->flags & IFF_PROMISC) { /* Unconditionally log net taps. */ printk(KERN_NOTICE "%s: Promiscuous mode enabled.\n", dev->name); rx_mode = AcceptBroadcast | AcceptMulticast | AcceptMyPhys | AcceptAllPhys; mc_filter[1] = mc_filter[0] = 0xffffffff; } else if ((dev->mc_count > multicast_filter_limit) || (dev->flags & IFF_ALLMULTI)) { /* Too many to filter perfectly -- accept all multicasts. */ rx_mode = AcceptBroadcast | AcceptMulticast | AcceptMyPhys; mc_filter[1] = mc_filter[0] = 0xffffffff; } else { struct dev_mc_list *mclist; rx_mode = AcceptBroadcast | AcceptMyPhys; mc_filter[1] = mc_filter[0] = 0; for (i = 0, mclist = dev->mc_list; mclist && i < dev->mc_count; i++, mclist = mclist->next) { int bit_nr = ether_crc(ETH_ALEN, mclist->dmi_addr) >> 26; mc_filter[bit_nr >> 5] |= 1 << (bit_nr & 31); rx_mode |= AcceptMulticast; } } spin_lock_irqsave(&tp->lock, flags); tmp = rx_mode | 0x2; SiS_W16(RxMacControl, tmp); SiS_W32(RxHashTable , mc_filter[0]); SiS_W32(RxHashTable + 4, mc_filter[1]); spin_unlock_irqrestore(&tp->lock, flags); } struct net_device_stats * SiS190_get_stats(struct net_device *dev) { struct sis190_private *tp = dev->priv; return &tp->stats; } static struct pci_driver sis190_pci_driver = { .name = MODULENAME, .id_table = sis190_pci_tbl, .probe = SiS190_init_one, .remove = SiS190_remove_one, .suspend = NULL, .resume = NULL, }; static int __init SiS190_init_module(void) { return pci_module_init(&sis190_pci_driver); } static void __exit SiS190_cleanup_module(void) { pci_unregister_driver(&sis190_pci_driver); } module_init(SiS190_init_module); module_exit(SiS190_cleanup_module); From shemminger@osdl.org Fri Aug 8 11:34:24 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 08 Aug 2003 11:34:31 -0700 (PDT) Received: from mail.osdl.org (fw.osdl.org [65.172.181.6]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h78IYMFl030031 for ; Fri, 8 Aug 2003 11:34:23 -0700 Received: from dell_ss3.pdx.osdl.net (dell_ss3.pdx.osdl.net [172.20.1.60]) by mail.osdl.org (8.11.6/8.11.6) with SMTP id h78IYAo00661; Fri, 8 Aug 2003 11:34:10 -0700 Date: Fri, 8 Aug 2003 11:34:04 -0700 From: Stephen Hemminger To: "David S. Miller" Cc: netdev@oss.sgi.com Subject: [PATCH] tun driver use private linked list. Message-Id: <20030808113404.0e9e1e6d.shemminger@osdl.org> In-Reply-To: <20030807155901.49f1a424.davem@redhat.com> References: <200308051630.28552.bellucda@tiscali.it> <20030805090647.691daa7e.shemminger@osdl.org> <200308051910.55823.bellucda@tiscali.it> <20030807154524.4794ad45.shemminger@osdl.org> <20030807155901.49f1a424.davem@redhat.com> Organization: Open Source Development Lab X-Mailer: Sylpheed version 0.9.4claws (GTK+ 1.2.10; i686-pc-linux-gnu) X-Face: &@E+xe?c%:&e4D{>f1O<&U>2qwRREG5!}7R4;D<"NO^UI2mJ[eEOA2*3>(`Th.yP,VDPo9$ /`~cw![cmj~~jWe?AHY7D1S+\}5brN0k*NE?pPh_'_d>6;XGG[\KDRViCfumZT3@[ Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4679 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: shemminger@osdl.org Precedence: bulk X-list: netdev Less grotty version, applies over earlier patch. - keep a private list. - fix debug format strings. - drop the name entry in the private data structure since it already has a pointer to netdev that has name. diff -Nru a/drivers/net/tun.c b/drivers/net/tun.c --- a/drivers/net/tun.c Fri Aug 8 11:26:07 2003 +++ b/drivers/net/tun.c Fri Aug 8 11:26:07 2003 @@ -51,6 +51,8 @@ /* Network device part of the driver */ +static LIST_HEAD(tun_dev_list); + /* Net device open. */ static int tun_net_open(struct net_device *dev) { @@ -70,7 +72,7 @@ { struct tun_struct *tun = (struct tun_struct *)dev->priv; - DBG(KERN_INFO "%s: tun_net_xmit %d\n", tun->name, skb->len); + DBG(KERN_INFO "%s: tun_net_xmit %d\n", tun->dev->name, skb->len); /* Drop packet if interface is not attached */ if (!tun->attached) @@ -120,7 +122,7 @@ { struct tun_struct *tun = (struct tun_struct *)dev->priv; - DBG(KERN_INFO "%s: tun_net_init\n", tun->name); + DBG(KERN_INFO "%s: tun_net_init\n", tun->dev->name); switch (tun->flags & TUN_TYPE_MASK) { case TUN_TUN_DEV: @@ -161,7 +163,7 @@ if (!tun) return -EBADFD; - DBG(KERN_INFO "%s: tun_chr_poll\n", tun->name); + DBG(KERN_INFO "%s: tun_chr_poll\n", tun->dev->name); poll_wait(file, &tun->read_wait, wait); @@ -226,7 +228,7 @@ if (!tun) return -EBADFD; - DBG(KERN_INFO "%s: tun_chr_write %d\n", tun->name, count); + DBG(KERN_INFO "%s: tun_chr_write %ld\n", tun->dev->name, count); for (i = 0, len = 0; i < count; i++) { if (verify_area(VERIFY_READ, iv[i].iov_base, iv[i].iov_len)) @@ -290,7 +292,7 @@ if (!tun) return -EBADFD; - DBG(KERN_INFO "%s: tun_chr_read\n", tun->name); + DBG(KERN_INFO "%s: tun_chr_read\n", tun->dev->name); for (i = 0, len = 0; i < count; i++) { if (verify_area(VERIFY_WRITE, iv[i].iov_base, iv[i].iov_len)) @@ -350,7 +352,7 @@ tun->owner = -1; dev->init = tun_net_init; - tun->name = dev->name; + SET_MODULE_OWNER(dev); dev->open = tun_net_open; dev->hard_start_xmit = tun_net_xmit; @@ -359,27 +361,40 @@ dev->destructor = (void (*)(struct net_device *))kfree; } +static struct tun_struct *tun_get_by_name(const char *name) +{ + struct tun_struct *tun; + + ASSERT_RTNL(); + list_for_each_entry(tun, &tun_dev_list, list) { + if (!strncmp(tun->dev->name, name, IFNAMSIZ)) + return tun; + } + + return NULL; +} + static int tun_set_iff(struct file *file, struct ifreq *ifr) { struct tun_struct *tun; - struct net_device *dev; int err; - dev = __dev_get_by_name(ifr->ifr_name); - if (dev) { - /* Device exist */ - tun = dev->priv; - - if (dev->init != tun_net_init || tun->attached) + tun = tun_get_by_name(ifr->ifr_name); + if (tun) { + if (tun->attached) return -EBUSY; /* Check permissions */ - if (tun->owner != -1) - if (current->euid != tun->owner && !capable(CAP_NET_ADMIN)) - return -EPERM; - } else { + if (tun->owner != -1 && + current->euid != tun->owner && !capable(CAP_NET_ADMIN)) + return -EPERM; + } + else if (__dev_get_by_name(ifr->ifr_name)) + return -EINVAL; + else { char *name; unsigned long flags = 0; + struct net_device *dev; err = -EINVAL; @@ -420,9 +435,10 @@ goto failed; } + list_add(&tun->list, &tun_dev_list); } - DBG(KERN_INFO "%s: tun_set_iff\n", tun->name); + DBG(KERN_INFO "%s: tun_set_iff\n", tun->dev->name); if (ifr->ifr_flags & IFF_NO_PI) tun->flags |= TUN_NO_PI; @@ -433,7 +449,7 @@ file->private_data = tun; tun->attached = 1; - strcpy(ifr->ifr_name, tun->name); + strcpy(ifr->ifr_name, tun->dev->name); return 0; failed: return err; @@ -466,7 +482,7 @@ if (!tun) return -EBADFD; - DBG(KERN_INFO "%s: tun_chr_ioctl cmd %d\n", tun->name, cmd); + DBG(KERN_INFO "%s: tun_chr_ioctl cmd %d\n", tun->dev->name, cmd); switch (cmd) { case TUNSETNOCSUM: @@ -477,7 +493,7 @@ tun->flags &= ~TUN_NOCHECKSUM; DBG(KERN_INFO "%s: checksum %s\n", - tun->name, arg ? "disabled" : "enabled"); + tun->dev->name, arg ? "disabled" : "enabled"); break; case TUNSETPERSIST: @@ -488,14 +504,14 @@ tun->flags &= ~TUN_PERSIST; DBG(KERN_INFO "%s: persist %s\n", - tun->name, arg ? "disabled" : "enabled"); + tun->dev->name, arg ? "disabled" : "enabled"); break; case TUNSETOWNER: /* Set owner of the device */ tun->owner = (uid_t) arg; - DBG(KERN_INFO "%s: owner set to %d\n", tun->owner); + DBG(KERN_INFO "%s: owner set to %d\n", tun->dev->name, tun->owner); break; #ifdef TUN_DEBUG @@ -519,7 +535,7 @@ if (!tun) return -EBADFD; - DBG(KERN_INFO "%s: tun_chr_fasync %d\n", tun->name, on); + DBG(KERN_INFO "%s: tun_chr_fasync %d\n", tun->dev->name, on); if ((ret = fasync_helper(fd, file, on, &tun->fasync)) < 0) return ret; @@ -549,7 +565,7 @@ if (!tun) return 0; - DBG(KERN_INFO "%s: tun_chr_close\n", tun->name); + DBG(KERN_INFO "%s: tun_chr_close\n", tun->dev->name); tun_chr_fasync(-1, file, 0); @@ -562,8 +578,10 @@ /* Drop read queue */ skb_queue_purge(&tun->readq); - if (!(tun->flags & TUN_PERSIST)) + if (!(tun->flags & TUN_PERSIST)) { + list_del(&tun->list); unregister_netdevice(tun->dev); + } rtnl_unlock(); @@ -605,15 +623,14 @@ void tun_cleanup(void) { - struct net_device *dev, *nxt; + struct tun_struct *tun, *nxt; misc_deregister(&tun_miscdev); rtnl_lock(); - for (dev = dev_base; dev; dev = nxt) { - nxt = dev->next; - if (dev->init == tun_net_init) - unregister_netdevice(dev); + list_for_each_entry_safe(tun, nxt, &tun_dev_list, list) { + DBG(KERN_INFO "%s cleaned up\n", tun->dev->name); + unregister_netdevice(tun->dev); } rtnl_unlock(); diff -Nru a/include/linux/if_tun.h b/include/linux/if_tun.h --- a/include/linux/if_tun.h Fri Aug 8 11:29:09 2003 +++ b/include/linux/if_tun.h Fri Aug 8 11:29:09 2003 @@ -32,7 +32,7 @@ #endif struct tun_struct { - char *name; + struct list_head list; unsigned long flags; int attached; uid_t owner; From shemminger@osdl.org Fri Aug 8 12:02:04 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 08 Aug 2003 12:02:06 -0700 (PDT) Received: from mail.osdl.org (fw.osdl.org [65.172.181.6]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h78J22Fl032216 for ; Fri, 8 Aug 2003 12:02:03 -0700 Received: from dell_ss3.pdx.osdl.net (dell_ss3.pdx.osdl.net [172.20.1.60]) by mail.osdl.org (8.11.6/8.11.6) with SMTP id h78J1uo06384; Fri, 8 Aug 2003 12:01:57 -0700 Date: Fri, 8 Aug 2003 12:01:50 -0700 From: Stephen Hemminger To: Jeff Garzik Cc: netdev@oss.sgi.com Subject: Re: RFR: new SiS gige driver Message-Id: <20030808120150.45f091d8.shemminger@osdl.org> In-Reply-To: <20030808173932.GA4077@gtf.org> References: <20030808173932.GA4077@gtf.org> Organization: Open Source Development Lab X-Mailer: Sylpheed version 0.9.4claws (GTK+ 1.2.10; i686-pc-linux-gnu) X-Face: &@E+xe?c%:&e4D{>f1O<&U>2qwRREG5!}7R4;D<"NO^UI2mJ[eEOA2*3>(`Th.yP,VDPo9$ /`~cw![cmj~~jWe?AHY7D1S+\}5brN0k*NE?pPh_'_d>6;XGG[\KDRViCfumZT3@[ Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4680 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: shemminger@osdl.org Precedence: bulk X-list: netdev Comments: - Run through lindent, indenting is non-standard. - struct board_info defined but only used in one debug message and it is just a string. - C99 initializers? - if code is commented out and outdated remove it (see init_board) - poisoning memory is useless and done already by allocator (see SIS190_remove_one) - assert checks for null pointers are overkill - ethtool? Overall, suffers a little from the "having more debug code makes my driver more reliable" fallacy. From shmulik.hen@intel.com Fri Aug 8 13:01:04 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 08 Aug 2003 13:01:09 -0700 (PDT) Received: from hermes.jf.intel.com (fmr05.intel.com [134.134.136.6]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h78K13Fl004794 for ; Fri, 8 Aug 2003 13:01:04 -0700 Received: from talaria.jf.intel.com (talaria.jf.intel.com [10.7.209.7]) by hermes.jf.intel.com (8.11.6p2/8.11.6/d: outer.mc,v 1.66 2003/05/22 21:17:36 rfjohns1 Exp $) with ESMTP id h78EhKc05732 for ; Fri, 8 Aug 2003 14:43:20 GMT Received: from orsmsxvs040.jf.intel.com (orsmsxvs040.jf.intel.com [192.168.65.206]) by talaria.jf.intel.com (8.11.6p2/8.11.6/d: inner.mc,v 1.35 2003/05/22 21:18:01 rfjohns1 Exp $) with SMTP id h78E8VT07438 for ; Fri, 8 Aug 2003 14:08:31 GMT Received: from jrslxjul4.npdj.intel.com ([10.12.254.188]) by orsmsxvs040.jf.intel.com (NAVGW 2.5.2.11) with SMTP id M2003080807573707052 ; Fri, 08 Aug 2003 07:57:38 -0700 Content-Type: text/plain; charset="us-ascii" From: Shmulik Hen Reply-To: shmulik.hen@intel.com Organization: Intel corp. Subject: [SET 2][PATCH 8/8][bonding] Propagating master's settings to slaves Date: Fri, 8 Aug 2003 17:45:23 +0300 User-Agent: KMail/1.4.3 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit To: bonding-devel@lists.sourceforge.net, netdev@oss.sgi.com Message-Id: <200308081745.23436.shmulik.hen@intel.com> X-archive-position: 4681 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: shmulik.hen@intel.com Precedence: bulk X-list: netdev 8 - Enhance netdev notification handling. Add comment block and bump version. diff -Nuarp linux-2.4.22-rc1/drivers/net/bonding/bond_main.c linux-2.4.22-rc1-devel/drivers/net/bonding/bond_main.c --- linux-2.4.22-rc1/drivers/net/bonding/bond_main.c Fri Aug 8 14:03:27 2003 +++ linux-2.4.22-rc1-devel/drivers/net/bonding/bond_main.c Fri Aug 8 14:03:28 2003 @@ -408,6 +408,18 @@ * and free it separately; use standard list operations instead * of pre-allocated array of bonds. * Version to 2.3.0. + * + * 2003/08/07 - Jay Vosburgh , + * Amir Noam and + * Shmulik Hen + * - Propagating master's settings: Distinguish between modes that + * use a primary slave from those that don't, and propagate settings + * accordingly; Consolidate change_active opeartions and add + * reselect_active and find_best opeartions; Decouple promiscuous + * handling from the multicast mode setting; Add support for changing + * HW address and MTU with proper unwind; Consolidate procfs code, + * add CHANGENAME handler; Enhance netdev notification handling. + * Version to 2.4.0. */ #include @@ -452,8 +464,8 @@ #include "bond_3ad.h" #include "bond_alb.h" -#define DRV_VERSION "2.3.0" -#define DRV_RELDATE "August 6, 2003" +#define DRV_VERSION "2.4.0" +#define DRV_RELDATE "August 7, 2003" #define DRV_NAME "bonding" #define DRV_DESCRIPTION "Ethernet Channel Bonding Driver" @@ -572,7 +584,6 @@ static struct net_device_stats *bond_get static void bond_mii_monitor(struct net_device *dev); static void loadbalance_arp_monitor(struct net_device *dev); static void activebackup_arp_monitor(struct net_device *dev); -static int bond_event(struct notifier_block *this, unsigned long event, void *ptr); static void bond_mc_list_destroy(struct bonding *bond); static void bond_mc_add(bonding_t *bond, void *addr, int alen); static void bond_mc_delete(bonding_t *bond, void *addr, int alen); @@ -3468,7 +3479,6 @@ static int bond_read_proc(char *buf, cha } #endif /* CONFIG_PROC_FS */ - static int bond_create_proc_info(struct bonding *bond) { #ifdef CONFIG_PROC_FS @@ -3633,21 +3643,134 @@ unwind: return error; } -static int bond_event(struct notifier_block *this, unsigned long event, - void *ptr) +/* + * Change device name + */ +static inline int bond_event_changename(struct bonding *bond) +{ + int error; + + bond_destroy_proc_info(bond); + error = bond_create_proc_info(bond); + if (error) { + return NOTIFY_BAD; + } + return NOTIFY_DONE; +} + +static int bond_master_netdev_event(unsigned long event, struct net_device *event_dev) +{ + struct bonding *bond, *event_bond = NULL; + + list_for_each_entry(bond, &bond_dev_list, bond_list) { + if (bond == event_dev->priv) { + event_bond = bond; + break; + } + } + + if (event_bond == NULL) { + return NOTIFY_DONE; + } + + switch (event) { + case NETDEV_CHANGENAME: + return bond_event_changename(event_bond); + case NETDEV_UNREGISTER: + /* + * TODO: remove a bond from the list? + */ + break; + default: + break; + } + + return NOTIFY_DONE; +} + +static int bond_slave_netdev_event(unsigned long event, struct net_device *event_dev) { - struct net_device *event_dev = (struct net_device *)ptr; struct net_device *master = event_dev->master; - if ((event == NETDEV_UNREGISTER) && (master != NULL)) { - bond_release(master, event_dev); + switch (event) { + case NETDEV_UNREGISTER: + if (master != NULL) { + bond_release(master, event_dev); + } + break; + case NETDEV_CHANGE: + /* + * TODO: is this what we get if somebody + * sets up a hierarchical bond, then rmmod's + * one of the slave bonding devices? + */ + break; + case NETDEV_DOWN: + /* + * ... Or is it this? + */ + break; + case NETDEV_CHANGEMTU: + /* + * TODO: Should slaves be allowed to + * independently alter their MTU? For + * an active-backup bond, slaves need + * not be the same type of device, so + * MTUs may vary. For other modes, + * slaves arguably should have the + * same MTUs. To do this, we'd need to + * take over the slave's change_mtu + * function for the duration of their + * servitude. + */ + break; + case NETDEV_CHANGENAME: + /* + * TODO: handle changing the primary's name + */ + break; + default: + break; } return NOTIFY_DONE; } +/* + * bond_netdev_event: handle netdev notifier chain events. + * + * This function receives events for the netdev chain. The caller (an + * ioctl handler calling notifier_call_chain) holds the necessary + * locks for us to safely manipulate the slave devices (RTNL lock, + * dev_probe_lock). + */ +static int bond_netdev_event(struct notifier_block *this, unsigned long event, void *ptr) +{ + struct net_device *event_dev = (struct net_device *)ptr; + unsigned short flags; + int res = NOTIFY_DONE; + + dprintk(KERN_INFO "bond_netdev_event n_b %p ev %lx ptr %p\n", + this, event, ptr); + + flags = event_dev->flags & (IFF_MASTER | IFF_SLAVE); + switch (flags) { + case IFF_MASTER: + res = bond_master_netdev_event(event, event_dev); + break; + case IFF_SLAVE: + res = bond_slave_netdev_event(event, event_dev); + break; + default: + /* A master that is also a slave ? */ + break; + } + + return res; +} + static struct notifier_block bond_netdev_notifier = { - notifier_call: bond_event, + notifier_call: bond_netdev_event, }; static void bond_deinit(struct net_device *dev) -- | Shmulik Hen Advanced Network Services | | Israel Design Center, Jerusalem | | LAN Access Division, Platform Networking | | Intel Communications Group, Intel corp. | From hadi@cyberus.ca Fri Aug 8 14:51:49 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 08 Aug 2003 14:51:58 -0700 (PDT) Received: from mail.cyberus.ca (mail.cyberus.ca [209.195.118.111]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h78LpmFl012028 for ; Fri, 8 Aug 2003 14:51:49 -0700 Received: from [216.209.86.2] (helo=[10.0.0.9] ident=jamal) by mail.cyberus.ca with esmtp (Exim 4.12) id 19lF9C-0005f9-00; Fri, 08 Aug 2003 17:51:42 -0400 Subject: Re: [RFC] High Performance Packet Classifiction for tc framework From: jamal Reply-To: hadi@cyberus.ca To: "David S. Miller" Cc: nf@hipac.org, linux-net@vger.kernel.org, netdev@oss.sgi.com In-Reply-To: <20030807130502.4af9c815.davem@redhat.com> References: <200307141045.40999.nf@hipac.org> <1058328537.1797.24.camel@jzny.localdomain> <3F16A0E5.1080007@hipac.org> <1059934468.1103.41.camel@jzny.localdomain> <3F2E5CD6.4030500@hipac.org> <1060012260.1103.380.camel@jzny.localdomain> <3F302E04.1090503@hipac.org> <1060286331.1025.73.camel@jzny.localdomain> <20030807130502.4af9c815.davem@redhat.com> Content-Type: text/plain Organization: jamalopolis Message-Id: <1060379500.1723.214.camel@jzny.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 Date: 08 Aug 2003 17:51:40 -0400 Content-Transfer-Encoding: 7bit X-archive-position: 4682 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev On Thu, 2003-08-07 at 16:05, David S. Miller wrote: > On 07 Aug 2003 15:58:51 -0400 > jamal wrote: > If you input all the keys into the Jenkins hash, how does > it perform? Has anyone even tried that and compared it > to all of these fancy multi-level tree like hash things? AFAIK, noone has tried it. I will try out at some point. > I think Jenkins would work very well for exactly this kind > of application. And it's fully available to the entire kernel > via linux/jhash.h and already in use by other things such > as the routing cache and the netfilter conntrack code. A good reason for the multilevel stuff is to support arbitrary packet offsets i.e you dont know which bits in the packet you are interested in ahead of time. Its easy to use hashes when you know that you need to find example ip src/dst. Since its in the kernel I will look into it - but has to meet the arbitrary offset requirement. cheers, jamal From hadi@cyberus.ca Fri Aug 8 15:01:18 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 08 Aug 2003 15:01:24 -0700 (PDT) Received: from mail.cyberus.ca (mail.cyberus.ca [209.195.118.111]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h78M1HFl012776 for ; Fri, 8 Aug 2003 15:01:18 -0700 Received: from [216.209.86.2] (helo=[10.0.0.9] ident=jamal) by mail.cyberus.ca with esmtp (Exim 4.12) id 19lFIT-0006e9-00; Fri, 08 Aug 2003 18:01:17 -0400 Subject: Re: [SET 2][PATCH 2/8][bonding] Propagating master's settings to slaves From: jamal Reply-To: hadi@cyberus.ca To: shmulik.hen@intel.com Cc: bonding-devel@lists.sourceforge.net, netdev@oss.sgi.com In-Reply-To: <200308081744.58946.shmulik.hen@intel.com> References: <200308081744.58946.shmulik.hen@intel.com> Content-Type: text/plain Organization: jamalopolis Message-Id: <1060380076.1717.233.camel@jzny.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 Date: 08 Aug 2003 18:01:16 -0400 Content-Transfer-Encoding: 7bit X-archive-position: 4683 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev Shmulik, Some of this bonding stuff is pretty scary. Lotsa policies in the kernel and communication seems to be centred around /proc. Shouldnt policies on failover be really driven from user space? Also shouldnt communication be using something like netlink? cheers, jamal On Fri, 2003-08-08 at 10:44, Shmulik Hen wrote: > 2 - Change monitoring function use the new functionality. > From ahtraps@runbox.com Fri Aug 8 16:56:00 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 08 Aug 2003 16:56:06 -0700 (PDT) Received: from aibo.runbox.com (cujo.runbox.com [193.71.199.138]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h78NtwFl019920 for ; Fri, 8 Aug 2003 16:56:00 -0700 Received: from [10.9.9.15] (helo=odie.runbox.com) by lufsen.runbox.com with esmtp (Exim 4.20) id 19lH5K-0001Hk-QD; Sat, 09 Aug 2003 01:55:50 +0200 Received: from mail by odie.runbox.com with local (Exim 4.20) id 19lH5A-0005yt-J9; Sat, 09 Aug 2003 01:55:40 +0200 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline MIME-Version: 1.0 From: Reply-To: ahtraps@runbox.com To: linux-net@vger.kernel.org, netdev@oss.sgi.com Subject: A question on inet_select_addr() Date: Fri, 08 Aug 2003 23:55:40 GMT X-Sender: 243652 X-Mailer: RMM Message-Id: X-Sender: unknown Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id h78NtwFl019920 X-archive-position: 4684 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ahtraps@runbox.com Precedence: bulk X-list: netdev Hello, I don't quite understand the comment, and the logic following it, in inet_select_addr() function. Specifically: 1. How is it guranteed that a loopback address is not chosen? (there is no check for 127.X.X.X addresses!) 2. Why is it important that the loopback interface be the first in the list. What would go wrong if the search starts from the second device in the list (i.e., skipping the loopback device). 3. And what does this test mean: (ifa->ifa_scope != RT_SCOPE_LINK && ifa->ifa_scope <= scope)? u32 inet_select_addr(const struct net_device *dev, u32 dst, int scope) { ... SNIP ... /* Not loopback addresses on loopback should be preferred in this case. It is importnat that lo is the first interface in dev_base list. */ read_lock(&dev_base_lock); read_lock(&inetdev_lock); for (dev=dev_base; dev; dev=dev->next) { if ((in_dev=__in_dev_get(dev)) == NULL) continue; read_lock(&in_dev->lock); for_primary_ifa(in_dev) { if (ifa->ifa_scope != RT_SCOPE_LINK && ifa->ifa_scope <= scope) { read_unlock(&in_dev->lock); read_unlock(&inetdev_lock); read_unlock(&dev_base_lock); return ifa->ifa_local; } } endfor_ifa(in_dev); read_unlock(&in_dev->lock); } read_unlock(&inetdev_lock); read_unlock(&dev_base_lock); return 0; } tx Andy --------------------------------------------------------------------------------------------------- Runbox Mail Manager (free trial version - this tag is removed upon subscription) Try your own premium email account for free at http://111.runbox.com 100MB storage, no ads, fast webmail, access on any device, retrieve and filter email. From davem@redhat.com Fri Aug 8 17:06:42 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 08 Aug 2003 17:06:48 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7906fFl020917 for ; Fri, 8 Aug 2003 17:06:42 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id RAA32358; Fri, 8 Aug 2003 17:01:30 -0700 Date: Fri, 8 Aug 2003 17:01:30 -0700 From: "David S. Miller" To: hadi@cyberus.ca Cc: nf@hipac.org, linux-net@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [RFC] High Performance Packet Classifiction for tc framework Message-Id: <20030808170130.578ff441.davem@redhat.com> In-Reply-To: <1060379500.1723.214.camel@jzny.localdomain> References: <200307141045.40999.nf@hipac.org> <1058328537.1797.24.camel@jzny.localdomain> <3F16A0E5.1080007@hipac.org> <1059934468.1103.41.camel@jzny.localdomain> <3F2E5CD6.4030500@hipac.org> <1060012260.1103.380.camel@jzny.localdomain> <3F302E04.1090503@hipac.org> <1060286331.1025.73.camel@jzny.localdomain> <20030807130502.4af9c815.davem@redhat.com> <1060379500.1723.214.camel@jzny.localdomain> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4685 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On 08 Aug 2003 17:51:40 -0400 jamal wrote: > Its easy to use hashes when you know that you need to > find example ip src/dst. Jenkins is rather agnostic about the input bits, that's what makes it so powerful. It performs about as well for random input as it does for input which has various patterns. Wait, are you saying the input key size can change? Yes, that's an interesting problem. But for things where you always want some 96-bit key, Jenkins is probably best. From davem@redhat.com Fri Aug 8 23:44:52 2003 Received: with ECARTIS (v1.0.0; list netdev); Fri, 08 Aug 2003 23:44:57 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h796ipFl005112 for ; Fri, 8 Aug 2003 23:44:52 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id XAA00635; Fri, 8 Aug 2003 23:39:40 -0700 Date: Fri, 8 Aug 2003 23:39:40 -0700 From: "David S. Miller" To: Jeff Garzik Cc: netdev@oss.sgi.com Subject: Re: RFR: new SiS gige driver Message-Id: <20030808233940.6ad767f5.davem@redhat.com> In-Reply-To: <20030808173932.GA4077@gtf.org> References: <20030808173932.GA4077@gtf.org> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4686 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Fri, 8 Aug 2003 13:39:33 -0400 Jeff Garzik wrote: > This driver is actually a lot more clean than many that crop up. So far > most of my nits are minor: egrep virt_to_bus sis190.c :-( I think this is more important to fix than any of the other things which have been listed. I really believe we're at the point where putting PCI device drivers into the tree still using virt_to_bus() and friends should simply not be allowed... From yoshfuji@linux-ipv6.org Sat Aug 9 00:29:45 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 09 Aug 2003 00:29:53 -0700 (PDT) Received: from yue.hongo.wide.ad.jp (yue.hongo.wide.ad.jp [203.178.139.94]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h797ThFl008075 for ; Sat, 9 Aug 2003 00:29:44 -0700 Received: from localhost (localhost [127.0.0.1]) by yue.hongo.wide.ad.jp (8.12.3+3.5Wbeta/8.12.3/Debian-5) with ESMTP id h797Tq1M004639; Sat, 9 Aug 2003 16:29:53 +0900 Date: Sat, 09 Aug 2003 16:29:52 +0900 (JST) Message-Id: <20030809.162952.114376111.yoshfuji@linux-ipv6.org> To: davem@redhat.com CC: netdev@oss.sgi.com Subject: [PATCH] IPVS: linkage error without CONFIG_IP_VS_PROTO_TCP From: YOSHIFUJI Hideaki / =?iso-2022-jp?B?GyRCNUhGIzFRTEAbKEI=?= Organization: USAGI Project X-URL: http://www.yoshifuji.org/%7Ehideaki/ X-Fingerprint: 90 22 65 EB 1E CF 3A D1 0B DF 80 D8 48 07 F8 94 E0 62 0E EA X-PGP-Key-URL: http://www.yoshifuji.org/%7Ehideaki/hideaki@yoshifuji.org.asc X-Face: "5$Al-.M>NJ%a'@hhZdQm:."qn~PA^gq4o*>iCFToq*bAi#4FRtx}enhuQKz7fNqQz\BYU] $~O_5m-9'}MIs`XGwIEscw;e5b>n"B_?j/AkL~i/MEaZBLP X-Mailer: Mew version 2.2 on Emacs 20.7 / Mule 4.1 (AOI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 4687 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: yoshfuji@linux-ipv6.org Precedence: bulk X-list: netdev This patch fixes linkage error occurs if CONFIG_IP_VS_PROTO_TCP is not set. Thanks. Index: linux26/net/ipv4/ipvs/ip_vs_core.c =================================================================== RCS file: /cvsroot/usagi/usagi/kernel/linux26/net/ipv4/ipvs/ip_vs_core.c,v retrieving revision 1.1.1.1 diff -u -r1.1.1.1 ip_vs_core.c --- linux26/net/ipv4/ipvs/ip_vs_core.c 15 Jul 2003 07:42:29 -0000 1.1.1.1 +++ linux26/net/ipv4/ipvs/ip_vs_core.c 9 Aug 2003 07:14:55 -0000 @@ -53,7 +53,9 @@ EXPORT_SYMBOL(ip_vs_conn_new); EXPORT_SYMBOL(ip_vs_conn_in_get); EXPORT_SYMBOL(ip_vs_conn_out_get); +#ifdef CONFIG_IP_VS_PROTO_TCP EXPORT_SYMBOL(ip_vs_tcp_conn_listen); +#endif EXPORT_SYMBOL(ip_vs_conn_put); #ifdef CONFIG_IP_VS_DEBUG EXPORT_SYMBOL(ip_vs_get_debug_level); -- Hideaki YOSHIFUJI @ USAGI Project GPG FP: 9022 65EB 1ECF 3AD1 0BDF 80D8 4807 F894 E062 0EEA From davem@redhat.com Sat Aug 9 01:11:32 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 09 Aug 2003 01:11:42 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h798BVFl009984 for ; Sat, 9 Aug 2003 01:11:31 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id BAA00751; Sat, 9 Aug 2003 01:06:06 -0700 Date: Sat, 9 Aug 2003 01:06:06 -0700 From: "David S. Miller" To: Jan Oravec Cc: yoshfuji@linux-ipv6.org, netdev@oss.sgi.com Subject: Re: problem setting net.ipvX.conf.all.forwarding via sysctl() system call Message-Id: <20030809010606.6cfce75c.davem@redhat.com> In-Reply-To: <20030808091124.GA17961@wsx.ksp.sk> References: <20030803154427.GA12926@wsx.ksp.sk> <20030808.174504.14391608.yoshfuji@linux-ipv6.org> <20030808091124.GA17961@wsx.ksp.sk> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4688 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Fri, 8 Aug 2003 11:11:24 +0200 Jan Oravec wrote: > this is a new patch ... > + ipv4_devconf.forwarding = new; > inet_forward_change(); > - return 1; > + return 0; This is still wrong. First of all we have the table->data pointer in "valp" so let's use that to set the value. In this way we can use this function for other sysctl values if we ever desire to do that. Second, if we set the table->data value, we must return > 0. This tells the caller that we've done the sysctl value update. If we return zero, it would update the value a second time. Here is the fix I'm going to use. # This is a BitKeeper generated patch for the following project: # Project Name: Linux kernel tree # This patch format is intended for GNU patch command version 2.5 or higher. # This patch includes the following deltas: # ChangeSet 1.1136 -> 1.1137 # net/ipv4/sysctl_net_ipv4.c 1.11 -> 1.12 # # The following is the BitKeeper ChangeSet Log # -------------------------------------------- # 03/08/09 davem@nuts.ninka.net 1.1137 # [IPV4]: Fix setting net.ipv4.conf.all.forwarding via sysctl() system call. # -------------------------------------------- # diff -Nru a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c --- a/net/ipv4/sysctl_net_ipv4.c Sat Aug 9 01:02:17 2003 +++ b/net/ipv4/sysctl_net_ipv4.c Sat Aug 9 01:02:17 2003 @@ -109,6 +109,7 @@ } } + *valp = new; inet_forward_change(); return 1; } From davem@redhat.com Sat Aug 9 01:14:00 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 09 Aug 2003 01:14:04 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h798DxFl010350 for ; Sat, 9 Aug 2003 01:14:00 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id BAA00766; Sat, 9 Aug 2003 01:08:45 -0700 Date: Sat, 9 Aug 2003 01:08:45 -0700 From: "David S. Miller" To: "YOSHIFUJI Hideaki / _$B5HF#1QL@" Cc: jan.oravec@6com.sk, netdev@oss.sgi.com Subject: Re: [PATCH] IPV6: strategy hander for net.ipv6.conf.*.forwarding (is Re: problem setting net.ipvX.conf.all.forwarding via sysctl() system call) Message-Id: <20030809010845.01ebcbe9.davem@redhat.com> In-Reply-To: <20030808.185135.112441851.yoshfuji@linux-ipv6.org> References: <20030803154427.GA12926@wsx.ksp.sk> <20030808.175030.19527061.yoshfuji@linux-ipv6.org> <20030808093704.GA18131@wsx.ksp.sk> <20030808.185135.112441851.yoshfuji@linux-ipv6.org> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4689 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Fri, 08 Aug 2003 18:51:35 +0900 (JST) YOSHIFUJI Hideaki / _$B5HF#1QL@ wrote: > In article <20030808093704.GA18131@wsx.ksp.sk> (at Fri, 8 Aug 2003 11:37:04 +0200), Jan Oravec says: > > > On Fri, Aug 08, 2003 at 05:50:30PM +0900, YOSHIFUJI Hideaki / ?$B5HF#1QL@ wrote: > > > > > + *valp = new; > > > + if (valp != &ipv6_devconf.forwarding && > > > + valp != &ipv6_devconf_dflt.forwarding) { > > > + struct inet6_dev *idev = (struct inet6_dev *)table->extra1; > > > + if (!idev) > > > + return -ENODEV; > > > + addrconf_forward_change(idev); > > > + } > > > + return 0; > > > +} > > > > Shouldn't we set ipv6_devconf_dflt.forwarding and call > > addr_forward_change(NULL) in case that valp==&ipv6_devconf.forwarding? > > Oh, You're right. Here's the revised one: As I mentioned for the ipv4 forwarding sysctl bug fix, if you will set table->data yourself you should return > 0 (for example "1") from your strategy handler. The patch looks fine otherwise. Please fix this, thank you. From davem@redhat.com Sat Aug 9 01:15:40 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 09 Aug 2003 01:15:42 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h798FdFl010757 for ; Sat, 9 Aug 2003 01:15:40 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id BAA00796; Sat, 9 Aug 2003 01:10:31 -0700 Date: Sat, 9 Aug 2003 01:10:31 -0700 From: "David S. Miller" To: "YOSHIFUJI Hideaki / _$B5HF#1QL@" Cc: netdev@oss.sgi.com, yoshfuji@linux-ipv6.org Subject: Re: [PATCH] IPV6: typo in include/linux/ipv6.h Message-Id: <20030809011031.14b9f639.davem@redhat.com> In-Reply-To: <20030808.223742.55652930.yoshfuji@linux-ipv6.org> References: <20030808.223742.55652930.yoshfuji@linux-ipv6.org> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4690 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Fri, 08 Aug 2003 22:37:42 +0900 (JST) YOSHIFUJI Hideaki / _$B5HF#1QL@ wrote: > Typo in definition in include/linux/ipv6.h. > 2.4.x has similar bug, too. Applied, thank you. From davem@redhat.com Sat Aug 9 01:23:10 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 09 Aug 2003 01:23:17 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h798NAFl011425 for ; Sat, 9 Aug 2003 01:23:10 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id BAA00838; Sat, 9 Aug 2003 01:18:02 -0700 Date: Sat, 9 Aug 2003 01:18:01 -0700 From: "David S. Miller" To: Stephen Hemminger Cc: netdev@oss.sgi.com Subject: Re: [PATCH] tun driver use private linked list. Message-Id: <20030809011801.3258f7af.davem@redhat.com> In-Reply-To: <20030808113404.0e9e1e6d.shemminger@osdl.org> References: <200308051630.28552.bellucda@tiscali.it> <20030805090647.691daa7e.shemminger@osdl.org> <200308051910.55823.bellucda@tiscali.it> <20030807154524.4794ad45.shemminger@osdl.org> <20030807155901.49f1a424.davem@redhat.com> <20030808113404.0e9e1e6d.shemminger@osdl.org> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4691 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Fri, 8 Aug 2003 11:34:04 -0700 Stephen Hemminger wrote: > Less grotty version, applies over earlier patch. > - keep a private list. > - fix debug format strings. > - drop the name entry in the private data structure since it already > has a pointer to netdev that has name. Applied, thanks for following up on this Stephen. From davem@redhat.com Sat Aug 9 02:39:21 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 09 Aug 2003 02:39:24 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h799dKFl017451 for ; Sat, 9 Aug 2003 02:39:20 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id CAA01103; Sat, 9 Aug 2003 02:34:05 -0700 Date: Sat, 9 Aug 2003 02:34:05 -0700 From: "David S. Miller" To: "YOSHIFUJI Hideaki / _$B5HF#1QL@" Cc: netdev@oss.sgi.com Subject: Re: [PATCH] IPVS: linkage error without CONFIG_IP_VS_PROTO_TCP Message-Id: <20030809023405.3aa597dd.davem@redhat.com> In-Reply-To: <20030809.162952.114376111.yoshfuji@linux-ipv6.org> References: <20030809.162952.114376111.yoshfuji@linux-ipv6.org> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4692 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Sat, 09 Aug 2003 16:29:52 +0900 (JST) YOSHIFUJI Hideaki / _$B5HF#1QL@ wrote: > This patch fixes linkage error occurs if CONFIG_IP_VS_PROTO_TCP > is not set. Applied, thank you. From yoshfuji@linux-ipv6.org Sat Aug 9 03:21:59 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 09 Aug 2003 03:22:06 -0700 (PDT) Received: from yue.hongo.wide.ad.jp (yue.hongo.wide.ad.jp [203.178.139.94]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h79ALvFl022472 for ; Sat, 9 Aug 2003 03:21:58 -0700 Received: from localhost (localhost [127.0.0.1]) by yue.hongo.wide.ad.jp (8.12.3+3.5Wbeta/8.12.3/Debian-5) with ESMTP id h79ALx1M005553; Sat, 9 Aug 2003 19:21:59 +0900 Date: Sat, 09 Aug 2003 19:21:58 +0900 (JST) Message-Id: <20030809.192158.47862071.yoshfuji@linux-ipv6.org> To: davem@redhat.com Cc: jan.oravec@6com.sk, netdev@oss.sgi.com Subject: Re: [PATCH] IPV6: strategy hander for net.ipv6.conf.*.forwarding (is Re: problem setting net.ipvX.conf.all.forwarding via sysctl() system call) From: YOSHIFUJI Hideaki / =?iso-2022-jp?B?GyRCNUhGIzFRTEAbKEI=?= In-Reply-To: <20030809010845.01ebcbe9.davem@redhat.com> References: <20030808093704.GA18131@wsx.ksp.sk> <20030808.185135.112441851.yoshfuji@linux-ipv6.org> <20030809010845.01ebcbe9.davem@redhat.com> Organization: USAGI Project X-URL: http://www.yoshifuji.org/%7Ehideaki/ X-Fingerprint: 90 22 65 EB 1E CF 3A D1 0B DF 80 D8 48 07 F8 94 E0 62 0E EA X-PGP-Key-URL: http://www.yoshifuji.org/%7Ehideaki/hideaki@yoshifuji.org.asc X-Face: "5$Al-.M>NJ%a'@hhZdQm:."qn~PA^gq4o*>iCFToq*bAi#4FRtx}enhuQKz7fNqQz\BYU] $~O_5m-9'}MIs`XGwIEscw;e5b>n"B_?j/AkL~i/MEaZBLP X-Mailer: Mew version 2.2 on Emacs 20.7 / Mule 4.1 (AOI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 4693 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: yoshfuji@linux-ipv6.org Precedence: bulk X-list: netdev In article <20030809010845.01ebcbe9.davem@redhat.com> (at Sat, 9 Aug 2003 01:08:45 -0700), "David S. Miller" says: > As I mentioned for the ipv4 forwarding sysctl bug fix, > if you will set table->data yourself you should return > > 0 (for example "1") from your strategy handler. : > Please fix this, thank you. Here it is. Index: linux-2.6/net/ipv6/addrconf.c =================================================================== RCS file: /home/cvs/linux-2.5/net/ipv6/addrconf.c,v retrieving revision 1.48 diff -u -r1.48 addrconf.c --- linux-2.6/net/ipv6/addrconf.c 25 Jul 2003 23:58:59 -0000 1.48 +++ linux-2.6/net/ipv6/addrconf.c 9 Aug 2003 08:51:42 -0000 @@ -2593,6 +2593,53 @@ return ret; } +static int addrconf_sysctl_forward_strategy(ctl_table *table, + int *name, int nlen, + void *oldval, size_t *oldlenp, + void *newval, size_t newlen, + void **context) +{ + int *valp = table->data; + int new; + + if (!newval || !newlen) + return 0; + if (newlen != sizeof(int)) + return -EINVAL; + if (get_user(new, (int *)newval)) + return -EFAULT; + if (new == *valp) + return 0; + if (oldval && oldlenp) { + size_t len; + if (get_user(len, oldlenp)) + return -EFAULT; + if (len) { + if (len > table->maxlen) + len = table->maxlen; + if (copy_to_user(oldval, valp, len)) + return -EFAULT; + if (put_user(len, oldlenp)) + return -EFAULT; + } + } + + if (valp != &ipv6_devconf_dflt.forwarding) { + struct inet6_dev *idev; + if (valp != &ipv6_devconf.forwarding) { + idev = (struct inet6_dev *)table->extra1; + if (unlikely(idev == NULL)) + return -ENODEV; + } else + idev = NULL; + *valp = new; + addrconf_forward_change(idev); + } else + *valp = new; + + return 1; +} + static struct addrconf_sysctl_table { struct ctl_table_header *sysctl_header; @@ -2611,6 +2658,7 @@ .maxlen = sizeof(int), .mode = 0644, .proc_handler = &addrconf_sysctl_forward, + .strategy = &addrconf_sysctl_forward_strategy, }, { .ctl_name = NET_IPV6_HOP_LIMIT, -- Hideaki YOSHIFUJI @ USAGI Project GPG FP: 9022 65EB 1ECF 3AD1 0BDF 80D8 4807 F894 E062 0EEA From shmulik.hen@intel.com Sat Aug 9 03:29:19 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 09 Aug 2003 03:29:22 -0700 (PDT) Received: from hermes.iil.intel.com (hermes.iil.intel.com [192.198.152.99]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h79ATGFl023468 for ; Sat, 9 Aug 2003 03:29:18 -0700 Received: from petasus.iil.intel.com (petasus.iil.intel.com [143.185.77.3]) by hermes.iil.intel.com (8.11.6p2/8.11.6/d: outer.mc,v 1.66 2003/05/22 21:17:36 rfjohns1 Exp $) with ESMTP id h79ANvM28038 for ; Sat, 9 Aug 2003 10:23:57 GMT Received: from hasmsxvs01.iil.intel.com (hasmsxvs01.iil.intel.com [143.185.63.58]) by petasus.iil.intel.com (8.11.6p2/8.11.6/d: inner.mc,v 1.35 2003/05/22 21:18:01 rfjohns1 Exp $) with SMTP id h79AWD017421 for ; Sat, 9 Aug 2003 10:32:15 GMT Received: from hasmsx331.ger.corp.intel.com ([143.185.63.144]) by hasmsxvs01.iil.intel.com (NAVGW 2.5.2.11) with SMTP id M2003080913361709031 ; Sat, 09 Aug 2003 13:36:17 +0300 Received: from hasmsx403.ger.corp.intel.com ([143.185.63.109]) by hasmsx331.ger.corp.intel.com with Microsoft SMTPSVC(5.0.2195.5329); Sat, 9 Aug 2003 13:29:08 +0300 content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" X-MimeOLE: Produced By Microsoft Exchange V6.0.6375.0 Subject: RE: [SET 2][PATCH 2/8][bonding] Propagating master's settings to slaves Date: Sat, 9 Aug 2003 13:29:08 +0300 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [SET 2][PATCH 2/8][bonding] Propagating master's settings toslaves Thread-Index: AcNd+KHqOJcY40/CSMqRlLdGamwFSwAZpZEA From: "Hen, Shmulik" To: Cc: , X-OriginalArrivalTime: 09 Aug 2003 10:29:08.0350 (UTC) FILETIME=[10D159E0:01C35E61] Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id h79ATGFl023468 X-archive-position: 4694 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: shmulik.hen@intel.com Precedence: bulk X-list: netdev > -----Original Message----- > From: jamal [mailto:hadi@cyberus.ca] > Sent: Saturday, August 09, 2003 1:01 AM > To: Hen, Shmulik > Cc: bonding-devel@lists.sourceforge.net; netdev@oss.sgi.com > Subject: Re: [SET 2][PATCH 2/8][bonding] Propagating master's settings > to slaves > > Shmulik, > > Some of this bonding stuff is pretty scary. Lotsa policies in the > kernel and communication seems to be centred around /proc. > Shouldnt policies on failover be really driven from user space? > Also shouldnt communication be using something like netlink? > > cheers, > jamal > > On Fri, 2003-08-08 at 10:44, Shmulik Hen wrote: > > 2 - Change monitoring function use the new functionality. > > > Not sure I fully understood the concerns above, but I'll try to explain what the change was all about. By monitoring, I meant the 3 timer function running in bonding to monitor link changes and act once a link fail/recovery is detected. The old code used to do all the activity related to changing the current active slave separately in each timer function and it seemed redundant since it was basically the same thing repeated 3 times. Instead, we thought it would be best if we put that into 3 new functions - reselect_active, find_best_slave and change_active that does all the actual stuff of swapping an old current with the new one. The change we did in /proc was to reduce the amount of data extarcted each time the proc entry is polled. Instead of dumping all the data of all the bond devices that exist, each bond returns just data that is relevant to itself. In the lonf term, the drive is to move any *smart* code done in the config application into the driver itself and be left with the smallest, most compact application as possible. This is the trend we've seen in the VLAN config app, and the bridge module. All the "brain" is in the kernel module and very little should be done in the application. Shmulik. From shmulik.hen@intel.com Sat Aug 9 05:40:55 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 09 Aug 2003 05:41:01 -0700 (PDT) Received: from caduceus.jf.intel.com (fmr06.intel.com [134.134.136.7]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h79CesFl001100 for ; Sat, 9 Aug 2003 05:40:55 -0700 Received: from talaria.jf.intel.com (talaria.jf.intel.com [10.7.209.7]) by caduceus.jf.intel.com (8.11.6p2/8.11.6/d: outer.mc,v 1.66 2003/05/22 21:17:36 rfjohns1 Exp $) with ESMTP id h78EdBw27945 for ; Fri, 8 Aug 2003 14:39:11 GMT Received: from orsmsxvs040.jf.intel.com (orsmsxvs040.jf.intel.com [192.168.65.206]) by talaria.jf.intel.com (8.11.6p2/8.11.6/d: inner.mc,v 1.35 2003/05/22 21:18:01 rfjohns1 Exp $) with SMTP id h78E8IT07196 for ; Fri, 8 Aug 2003 14:08:18 GMT Received: from jrslxjul4.npdj.intel.com ([10.12.254.188]) by orsmsxvs040.jf.intel.com (NAVGW 2.5.2.11) with SMTP id M2003080807572403054 ; Fri, 08 Aug 2003 07:57:25 -0700 Content-Type: text/plain; charset="us-ascii" From: Shmulik Hen Reply-To: shmulik.hen@intel.com Organization: Intel corp. Subject: [SET 2][PATCH 5/8][bonding] Propagating master's settings to slaves Date: Fri, 8 Aug 2003 17:45:10 +0300 User-Agent: KMail/1.4.3 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit To: bonding-devel@lists.sourceforge.net, netdev@oss.sgi.com Message-Id: <200308081745.10710.shmulik.hen@intel.com> X-archive-position: 4695 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: shmulik.hen@intel.com Precedence: bulk X-list: netdev 5 - Add support for changing HW address and MTU with proper unwind. diff -Nuarp linux-2.4.22-rc1/drivers/net/bonding/bond_main.c linux-2.4.22-rc1-devel/drivers/net/bonding/bond_main.c --- linux-2.4.22-rc1/drivers/net/bonding/bond_main.c Fri Aug 8 14:03:22 2003 +++ linux-2.4.22-rc1-devel/drivers/net/bonding/bond_main.c Fri Aug 8 14:03:23 2003 @@ -592,6 +592,11 @@ static struct slave *find_best_interface /* #define BONDING_DEBUG 1 */ +#ifdef BONDING_DEBUG +#define dprintk(x...) printk(x...) +#else /* BONDING_DEBUG */ +#define dprintk(x...) do {} while (0) +#endif /* BONDING_DEBUG */ /* several macros */ @@ -680,19 +685,6 @@ update_slave_cnt(bonding_t *bond, int in BUG(); } -/* - * Set MAC. Differs from eth_mac_addr in that we allow changes while - * netif_running(). - */ -static int -bond_set_mac_address(struct net_device *dev, void *p) -{ - struct sockaddr *addr = p; - - memcpy(dev->dev_addr, addr->sa_data, dev->addr_len); - return 0; -} - /* * This function detaches the slave from the list . * WARNING: no check is made to verify if the slave effectively @@ -3037,10 +3029,6 @@ static int bond_ioctl(struct net_device case SIOCBONDRELEASE: ret = bond_release(master_dev, slave_dev); break; - case BOND_SETHWADDR_OLD: - case SIOCBONDSETHWADDR: - ret = bond_sethwaddr(master_dev, slave_dev); - break; case BOND_CHANGE_ACTIVE_OLD: case SIOCBONDCHANGEACTIVE: if (USES_PRIMARY(bond_mode)) { @@ -3480,6 +3468,132 @@ static int bond_read_proc(char *buf, cha } #endif /* CONFIG_PROC_FS */ + +/* + * Change HW address + * + * Note that many devices must be down to change the HW address, and + * downing the master releases all slaves. We can make bonds full of + * bonding devices to test this, however. + */ +static inline int +bond_set_mac_address(struct net_device *dev, void *addr) +{ + struct bonding *bond = dev->priv; + struct sockaddr *sa = addr, tmp_sa; + struct slave *slave; + int error; + + dprintk(KERN_INFO "bond_set_mac_address %p %s\n", dev, + dev->name); + + if (!is_valid_ether_addr(sa->sa_data)) { + return -EADDRNOTAVAIL; + } + + for (slave = bond->prev; slave != (struct slave *)bond; + slave = slave->prev) { + dprintk(KERN_INFO "bond_set_mac: slave %p %s\n", slave, + slave->dev->name); + if (slave->dev->set_mac_address == NULL) { + error = -EOPNOTSUPP; + dprintk(KERN_INFO "bond_set_mac EOPNOTSUPP %s\n", + slave->dev->name); + goto unwind; + } + + error = slave->dev->set_mac_address(slave->dev, addr); + if (error) { + /* TODO: consider downing the slave + * and retry ? + * User should expect communications + * breakage anyway until ARP finish + * updating, so... + */ + dprintk(KERN_INFO "bond_set_mac err %d %s\n", + error, slave->dev->name); + goto unwind; + } + } + + /* success */ + memcpy(dev->dev_addr, sa->sa_data, dev->addr_len); + return 0; + +unwind: + memcpy(tmp_sa.sa_data, dev->dev_addr, dev->addr_len); + tmp_sa.sa_family = dev->type; + + for (slave = slave->next; slave != bond->next; + slave = slave->next) { + int tmp_error; + + tmp_error = slave->dev->set_mac_address(slave->dev, &tmp_sa); + if (tmp_error) { + dprintk(KERN_INFO "bond_set_mac_address: " + "unwind err %d dev %s\n", + tmp_error, slave->dev->name); + } + } + + return error; +} + +/* + * Change the MTU of all of a master's slaves to match the master + */ +static inline int +bond_change_mtu(struct net_device *dev, int newmtu) +{ + bonding_t *bond = dev->priv; + slave_t *slave; + int error; + + dprintk(KERN_INFO "CM: b %p nm %d\n", bond, newmtu); + for (slave = bond->prev; slave != (slave_t *)bond; + slave = slave->prev) { + dprintk(KERN_INFO "CM: s %p s->p %p c_m %p\n", slave, + slave->prev, slave->dev->change_mtu); + if (slave->dev->change_mtu) { + error = slave->dev->change_mtu(slave->dev, newmtu); + } else { + slave->dev->mtu = newmtu; + error = 0; + } + + if (error) { + /* If we failed to set the slave's mtu to the new value + * we must abort the operation even in ACTIVE_BACKUP + * mode, because if we allow the backup slaves to have + * different mtu values than the active slave we'll + * need to change their mtu when doing a failover. That + * means changing their mtu from timer context, which + * is probably not a good idea. + */ + dprintk(KERN_INFO "bond_change_mtu err %d %s\n", + error, slave->dev->name); + goto unwind; + } + } + + dev->mtu = newmtu; + return 0; + + +unwind: + for (slave = slave->next; slave != bond->next; + slave = slave->next) { + + if (slave->dev->change_mtu) { + slave->dev->change_mtu(slave->dev, dev->mtu); + } else { + slave->dev->mtu = dev->mtu; + } + } + + return error; +} + static int bond_event(struct notifier_block *this, unsigned long event, void *ptr) { @@ -3572,6 +3686,7 @@ static int __init bond_init(struct net_d dev->stop = bond_close; dev->set_multicast_list = set_multicast_list; dev->do_ioctl = bond_ioctl; + dev->change_mtu = bond_change_mtu; dev->set_mac_address = bond_set_mac_address; dev->tx_queue_len = 0; dev->flags |= IFF_MASTER|IFF_MULTICAST; -- | Shmulik Hen Advanced Network Services | | Israel Design Center, Jerusalem | | LAN Access Division, Platform Networking | | Intel Communications Group, Intel corp. | From shmulik.hen@intel.com Sat Aug 9 05:40:56 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 09 Aug 2003 05:41:01 -0700 (PDT) Received: from caduceus.jf.intel.com (fmr06.intel.com [134.134.136.7]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h79CesFn001100 for ; Sat, 9 Aug 2003 05:40:56 -0700 Received: from talaria.jf.intel.com (talaria.jf.intel.com [10.7.209.7]) by caduceus.jf.intel.com (8.11.6p2/8.11.6/d: outer.mc,v 1.66 2003/05/22 21:17:36 rfjohns1 Exp $) with ESMTP id h78Ed7w27888 for ; Fri, 8 Aug 2003 14:39:08 GMT Received: from orsmsxvs040.jf.intel.com (orsmsxvs040.jf.intel.com [192.168.65.206]) by talaria.jf.intel.com (8.11.6p2/8.11.6/d: inner.mc,v 1.35 2003/05/22 21:18:01 rfjohns1 Exp $) with SMTP id h78E8ET07131 for ; Fri, 8 Aug 2003 14:08:14 GMT Received: from jrslxjul4.npdj.intel.com ([10.12.254.188]) by orsmsxvs040.jf.intel.com (NAVGW 2.5.2.11) with SMTP id M2003080807572021576 ; Fri, 08 Aug 2003 07:57:21 -0700 Content-Type: text/plain; charset="us-ascii" From: Shmulik Hen Reply-To: shmulik.hen@intel.com Organization: Intel corp. Subject: [SET 2][PATCH 4/8][bonding] Propagating master's settings to slaves Date: Fri, 8 Aug 2003 17:45:06 +0300 User-Agent: KMail/1.4.3 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit To: bonding-devel@lists.sourceforge.net, netdev@oss.sgi.com Message-Id: <200308081745.06800.shmulik.hen@intel.com> X-archive-position: 4696 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: shmulik.hen@intel.com Precedence: bulk X-list: netdev 4 - Decouple promiscuous handling from the multicast mode setting. diff -Nuarp linux-2.4.22-rc1/drivers/net/bonding/bond_main.c linux-2.4.22-rc1-devel/drivers/net/bonding/bond_main.c --- linux-2.4.22-rc1/drivers/net/bonding/bond_main.c Fri Aug 8 14:03:20 2003 +++ linux-2.4.22-rc1-devel/drivers/net/bonding/bond_main.c Fri Aug 8 14:03:21 2003 @@ -1167,23 +1167,22 @@ static inline int dmi_same(struct dev_mc } /* - * Push the promiscuity flag down to all slaves + * Push the promiscuity flag down to appropriate slaves */ static void bond_set_promiscuity(bonding_t *bond, int inc) { slave_t *slave; - switch (multicast_mode) { - case BOND_MULTICAST_ACTIVE : - /* write lock already acquired */ - if (bond->current_slave != NULL) + + if (USES_PRIMARY(bond_mode)) { + if (bond->current_slave) { dev_set_promiscuity(bond->current_slave->dev, inc); - break; - case BOND_MULTICAST_ALL : - for (slave = bond->prev; slave != (slave_t*)bond; slave = slave->prev) + } + + } else { + for (slave = bond->prev; slave != (slave_t*)bond; + slave = slave->prev) { dev_set_promiscuity(slave->dev, inc); - break; - case BOND_MULTICAST_DISABLED : - break; + } } } @@ -1229,20 +1228,23 @@ static void set_multicast_list(struct ne bonding_t *bond = master->priv; struct dev_mc_list *dmi; - if (multicast_mode == BOND_MULTICAST_DISABLED) - return; - /* - * Lock the private data for the master - */ write_lock_bh(&bond->lock); - /* set promiscuity flag to slaves */ + /* + * Do promisc before checking multicast_mode + */ if ( (master->flags & IFF_PROMISC) && !(bond->flags & IFF_PROMISC) ) bond_set_promiscuity(bond, 1); if ( !(master->flags & IFF_PROMISC) && (bond->flags & IFF_PROMISC) ) bond_set_promiscuity(bond, -1); + if (multicast_mode == BOND_MULTICAST_DISABLED) { + bond->flags = master->flags; + write_unlock_bh(&bond->lock); + return; + } + /* set allmulti flag to slaves */ if ( (master->flags & IFF_ALLMULTI) && !(bond->flags & IFF_ALLMULTI) ) bond_set_allmulti(bond, 1); @@ -1274,32 +1276,40 @@ static void set_multicast_list(struct ne /* * Update the mc list and multicast-related flags for the new and - * old active slaves (if any) according to the multicast mode + * old active slaves (if any) according to the multicast mode, and + * promiscuous flags unconditionally. */ static void bond_mc_update(bonding_t *bond, slave_t *new, slave_t *old) { struct dev_mc_list *dmi; - switch(multicast_mode) { - case BOND_MULTICAST_ACTIVE : + if (USES_PRIMARY(bond_mode)) { if (bond->device->flags & IFF_PROMISC) { - if (old != NULL && new != old) + if (old) dev_set_promiscuity(old->dev, -1); - dev_set_promiscuity(new->dev, 1); + if (new) + dev_set_promiscuity(new->dev, 1); } + } + + switch(multicast_mode) { + case BOND_MULTICAST_ACTIVE : if (bond->device->flags & IFF_ALLMULTI) { - if (old != NULL && new != old) + if (old) dev_set_allmulti(old->dev, -1); - dev_set_allmulti(new->dev, 1); + if (new) + dev_set_allmulti(new->dev, 1); } /* first remove all mc addresses from old slave if any, and _then_ add them to new active slave */ - if (old != NULL && new != old) { + if (old) { for (dmi = bond->device->mc_list; dmi != NULL; dmi = dmi->next) dev_mc_delete(old->dev, dmi->dmi_addr, dmi->dmi_addrlen, 0); } - for (dmi = bond->device->mc_list; dmi != NULL; dmi = dmi->next) - dev_mc_add(new->dev, dmi->dmi_addr, dmi->dmi_addrlen, 0); + if (new) { + for (dmi = bond->device->mc_list; dmi != NULL; dmi = dmi->next) + dev_mc_add(new->dev, dmi->dmi_addr, dmi->dmi_addrlen, 0); + } break; case BOND_MULTICAST_ALL : /* nothing to do: mc list is already up-to-date on all slaves */ @@ -1461,11 +1471,19 @@ static int bond_enslave(struct net_devic } } - if (multicast_mode == BOND_MULTICAST_ALL) { - /* set promiscuity level to new slave */ - if (master_dev->flags & IFF_PROMISC) + /* set promiscuity level to new slave */ + if (master_dev->flags & IFF_PROMISC) { + /* If the mode USES_PRIMARY, then the new slave gets the + * master's promisc (and mc) settings only if it becomes the + * current_slave, and that is taken care of later when calling + * bond_change_active() + */ + if (!USES_PRIMARY(bond_mode)) { dev_set_promiscuity(slave_dev, 1); + } + } + if (multicast_mode == BOND_MULTICAST_ALL) { /* set allmulti level to new slave */ if (master_dev->flags & IFF_ALLMULTI) dev_set_allmulti(slave_dev, 1); @@ -2040,16 +2058,22 @@ static int bond_release(struct net_devic return -EINVAL; } + /* unset promiscuity level from slave */ + if (master->flags & IFF_PROMISC) { + /* If the mode USES_PRIMARY, then we should only remove its + * promisc settings if it was the current_slave, but that was + * already taken care of above when we detached the slave + */ + if (!USES_PRIMARY(bond_mode)) { + dev_set_promiscuity(slave, -1); + } + } + /* undo settings and restore original values */ - if (multicast_mode == BOND_MULTICAST_ALL) { /* flush master's mc_list from slave */ bond_mc_list_flush (slave, master); - /* unset promiscuity level from slave */ - if (master->flags & IFF_PROMISC) - dev_set_promiscuity(slave, -1); - /* unset allmulti level from slave */ if (master->flags & IFF_ALLMULTI) dev_set_allmulti(slave, -1); @@ -2145,17 +2169,17 @@ static int bond_release_all(struct net_d */ write_unlock_bh(&bond->lock); - if (multicast_mode == BOND_MULTICAST_ALL - || (multicast_mode == BOND_MULTICAST_ACTIVE - && old_current == our_slave)) { + /* unset promiscuity level from slave */ + if (master->flags & IFF_PROMISC) { + if (!USES_PRIMARY(bond_mode)) { + dev_set_promiscuity(slave_dev, -1); + } + } + if (multicast_mode == BOND_MULTICAST_ALL) { /* flush master's mc_list from slave */ bond_mc_list_flush (slave_dev, master); - /* unset promiscuity level from slave */ - if (master->flags & IFF_PROMISC) - dev_set_promiscuity(slave_dev, -1); - /* unset allmulti level from slave */ if (master->flags & IFF_ALLMULTI) dev_set_allmulti(slave_dev, -1); @@ -3650,6 +3674,12 @@ static int __init bonding_init(void) mode == NULL ? "NULL" : mode); return -EINVAL; } + } + + if (USES_PRIMARY(bond_mode)) { + multicast_mode = BOND_MULTICAST_ACTIVE; + } else { + multicast_mode = BOND_MULTICAST_ALL; } if (multicast) { -- | Shmulik Hen Advanced Network Services | | Israel Design Center, Jerusalem | | LAN Access Division, Platform Networking | | Intel Communications Group, Intel corp. | From mailperson@alexandria.cc Sat Aug 9 06:44:43 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 09 Aug 2003 06:44:52 -0700 (PDT) Received: from alexandria.cc (user-0ccetq0.cable.mindspring.com [24.199.119.64]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h79DifFl005480 for ; Sat, 9 Aug 2003 06:44:42 -0700 From: mailperson@alexandria.cc To: netdev@oss.sgi.com Date: 09 Aug 2003 09:44:34 -0400 Message-ID: <20030809094434.A6929C2F7F46598B@alexandria.cc> MIME-Version: 1.0 X-archive-position: 4697 Subject: (no subject) X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: mailperson@alexandria.cc Precedence: bulk X-list: netdev From ak@suse.de Sat Aug 9 07:15:41 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 09 Aug 2003 07:15:45 -0700 (PDT) Received: from Cantor.suse.de (mail.suse.de [213.95.15.193]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h79EFdFl007586 for ; Sat, 9 Aug 2003 07:15:40 -0700 Received: from Hermes.suse.de (Hermes.suse.de [213.95.15.136]) by Cantor.suse.de (Postfix) with ESMTP id D14C214B94; Sat, 9 Aug 2003 16:15:33 +0200 (MEST) Date: Sat, 9 Aug 2003 16:15:33 +0200 From: Andi Kleen To: Jeff Garzik Cc: netdev@oss.sgi.com Subject: Re: RFR: new SiS gige driver Message-ID: <20030809141533.GB4539@wotan.suse.de> References: <20030808173932.GA4077@gtf.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20030808173932.GA4077@gtf.org> X-archive-position: 4698 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ak@suse.de Precedence: bulk X-list: netdev > * since it's gige, it should definitely be using NAPI * it does not check its kmalloc returns. * and doesn't set a DMA mask. This means it is not DAC capable? * does not use pci_dma_* mappings like David noted. * no hardware checksum support? This looks quite poor for a Gigabit chipset. * same with sendpage - gigabit should have that. * missing NAPI and even with NAPI it should support interrupt mitigation - but the driver doesn't seem to do that. I suspect it's very easily DoSable in the current form. Even with NAPI interrupt mitigation would be needed, otherwise the start/stop of polling mode can be too expensive for moderate load. * netif_stop_queue in hard_start_xmit is not protected against the interrupt by the spinlock. That's racy, isn't it? -Andi From ak@suse.de Sat Aug 9 08:27:54 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 09 Aug 2003 08:28:01 -0700 (PDT) Received: from Cantor.suse.de (mail.suse.de [213.95.15.193]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h79FRqFl011801 for ; Sat, 9 Aug 2003 08:27:53 -0700 Received: from Hermes.suse.de (Hermes.suse.de [213.95.15.136]) by Cantor.suse.de (Postfix) with ESMTP id 9D0A114B7B; Sat, 9 Aug 2003 17:27:47 +0200 (MEST) Date: Sat, 9 Aug 2003 17:27:47 +0200 From: Andi Kleen To: Jeff Garzik Cc: Andi Kleen , netdev@oss.sgi.com Subject: Re: RFR: new SiS gige driver Message-ID: <20030809152747.GA1618@wotan.suse.de> References: <20030808173932.GA4077@gtf.org> <20030809141533.GB4539@wotan.suse.de> <3F350CC8.3090605@pobox.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3F350CC8.3090605@pobox.com> X-archive-position: 4699 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ak@suse.de Precedence: bulk X-list: netdev On Sat, Aug 09, 2003 at 11:01:28AM -0400, Jeff Garzik wrote: > Andi Kleen wrote: > >* netif_stop_queue in hard_start_xmit is not protected against the > >interrupt by the spinlock. That's racy, isn't it? > > Shouldn't be, if done right. If the interrupt runs a TX completion > cycle, it will run the code > if (work_done && netif_queue_stopped(dev)) > netif_wake_queue(dev) > > Since ->hard_start_xmit is guaranteed never to be called if the queue is > stopped, you also guaranteed that netif_wake_queue and ->hard_start_xmit > are mutually exclusive. The race is CPU0 CPU1 hard_start_xmit release lock TX finished interrupt my queue is full... clean some packets netif_wake_queue netif_stop_queue The netif_wake_queue is lost. It's only deadly when clean some packets clears the full TX ring, otherwise it will likely recover with the next TX finished interrupt but give suboptimal performance. Fix is to do the my queue is full -> netif_stop_queue inside the spinlock. -Andi From jgarzik@pobox.com Sat Aug 9 09:00:40 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 09 Aug 2003 09:00:46 -0700 (PDT) Received: from www.linux.org.uk (IDENT:U+uCazPeXDTaOTgsXiae7FItTD8TYQfW@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h79G0dFl013827 for ; Sat, 9 Aug 2003 09:00:40 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19lVDv-0007Rh-Rp; Sat, 09 Aug 2003 16:01:39 +0100 Message-ID: <3F350CC8.3090605@pobox.com> Date: Sat, 09 Aug 2003 11:01:28 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: Andi Kleen CC: netdev@oss.sgi.com Subject: Re: RFR: new SiS gige driver References: <20030808173932.GA4077@gtf.org> <20030809141533.GB4539@wotan.suse.de> In-Reply-To: <20030809141533.GB4539@wotan.suse.de> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4700 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev Andi Kleen wrote: > * netif_stop_queue in hard_start_xmit is not protected against the interrupt by the > spinlock. That's racy, isn't it? Shouldn't be, if done right. If the interrupt runs a TX completion cycle, it will run the code if (work_done && netif_queue_stopped(dev)) netif_wake_queue(dev) Since ->hard_start_xmit is guaranteed never to be called if the queue is stopped, you also guaranteed that netif_wake_queue and ->hard_start_xmit are mutually exclusive. This of course assumes certain details about the driver implementation, which may be missing from that driver's TX completion handler :) Thanks for your, and everybody else's comments. They are being saved. Jeff From davem@redhat.com Sat Aug 9 14:49:45 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 09 Aug 2003 14:49:51 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h79LniFl003121 for ; Sat, 9 Aug 2003 14:49:45 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id OAA02462; Sat, 9 Aug 2003 14:44:16 -0700 Date: Sat, 9 Aug 2003 14:44:16 -0700 From: "David S. Miller" To: "YOSHIFUJI Hideaki / _$B5HF#1QL@" Cc: jan.oravec@6com.sk, netdev@oss.sgi.com Subject: Re: [PATCH] IPV6: strategy hander for net.ipv6.conf.*.forwarding (is Re: problem setting net.ipvX.conf.all.forwarding via sysctl() system call) Message-Id: <20030809144416.181cbad1.davem@redhat.com> In-Reply-To: <20030809.192158.47862071.yoshfuji@linux-ipv6.org> References: <20030808093704.GA18131@wsx.ksp.sk> <20030808.185135.112441851.yoshfuji@linux-ipv6.org> <20030809010845.01ebcbe9.davem@redhat.com> <20030809.192158.47862071.yoshfuji@linux-ipv6.org> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4701 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Sat, 09 Aug 2003 19:21:58 +0900 (JST) YOSHIFUJI Hideaki / _$B5HF#1QL@ wrote: > In article <20030809010845.01ebcbe9.davem@redhat.com> (at Sat, 9 Aug 2003 01:08:45 -0700), "David S. Miller" says: > > > As I mentioned for the ipv4 forwarding sysctl bug fix, > > if you will set table->data yourself you should return > > > 0 (for example "1") from your strategy handler. > : > > Please fix this, thank you. > > Here it is. Patch applied, thank you. From kuznet@ms2.inr.ac.ru Sat Aug 9 20:04:21 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 09 Aug 2003 20:04:26 -0700 (PDT) Received: from dub.inr.ac.ru (dub.inr.ac.ru [193.233.7.105]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7A34JFl024026 for ; Sat, 9 Aug 2003 20:04:20 -0700 Received: (from kuznet@localhost) by dub.inr.ac.ru (8.6.13/ANK) id HAA20709; Sun, 10 Aug 2003 07:03:48 +0400 From: kuznet@ms2.inr.ac.ru Message-Id: <200308100303.HAA20709@dub.inr.ac.ru> Subject: Re: Fw: [PATCH] set NLM_F_MULTI in answer of RTM_GETADDR dump answer To: davem@redhat.com (David S. Miller) Date: Sun, 10 Aug 2003 07:03:48 +0400 (MSD) Cc: hadi@cyberus.ca, netdev@oss.sgi.com In-Reply-To: <20030809011153.46a99ef2.davem@redhat.com> from "David S. Miller" at Aug 09, 2003 01:11:53 AM X-Mailer: ELM [version 2.5 PL6] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 4702 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: kuznet@ms2.inr.ac.ru Precedence: bulk X-list: netdev Hello! > After sending RTM_GETADDR dump request to netlink socket, received multipart > answer does not have NLM_F_MULTI set on each message. This patch fix that. OK. Actually, this flag was never used because of its redundancy it is not the only place where it is not set. Alexey From davem@redhat.com Sat Aug 9 20:53:19 2003 Received: with ECARTIS (v1.0.0; list netdev); Sat, 09 Aug 2003 20:53:54 -0700 (PDT) Received: from pizda.ninka.net (IDENT:root@pizda.ninka.net [216.101.162.242]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7A3rIFl027672 for ; Sat, 9 Aug 2003 20:53:19 -0700 Received: from pizda.ninka.net (IDENT:davem@localhost.localdomain [127.0.0.1]) by pizda.ninka.net (8.9.3/8.9.3) with SMTP id UAA03046; Sat, 9 Aug 2003 20:47:28 -0700 Date: Sat, 9 Aug 2003 20:47:28 -0700 From: "David S. Miller" To: kuznet@ms2.inr.ac.ru Cc: hadi@cyberus.ca, netdev@oss.sgi.com Subject: Re: Fw: [PATCH] set NLM_F_MULTI in answer of RTM_GETADDR dump answer Message-Id: <20030809204728.03100f55.davem@redhat.com> In-Reply-To: <200308100303.HAA20709@dub.inr.ac.ru> References: <20030809011153.46a99ef2.davem@redhat.com> <200308100303.HAA20709@dub.inr.ac.ru> X-Mailer: Sylpheed version 0.9.2 (GTK+ 1.2.6; sparc-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4703 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@redhat.com Precedence: bulk X-list: netdev On Sun, 10 Aug 2003 07:03:48 +0400 (MSD) kuznet@ms2.inr.ac.ru wrote: > > After sending RTM_GETADDR dump request to netlink socket, received multipart > > answer does not have NLM_F_MULTI set on each message. This patch fix that. > > OK. > > Actually, this flag was never used because of its redundancy > it is not the only place where it is not set. Ok, I applied the patch. Thanks for the review Alexey. From felix@allot.com Sun Aug 10 00:32:20 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 10 Aug 2003 00:32:29 -0700 (PDT) Received: from mxout2.netvision.net.il (mxout2.netvision.net.il [194.90.9.21]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7A7WIFl009758 for ; Sun, 10 Aug 2003 00:32:20 -0700 Received: from exg.allot.com ([199.203.223.202]) by mxout2.netvision.net.il (iPlanet Messaging Server 5.2 HotFix 1.14 (built Mar 18 2003)) with ESMTP id <0HJE00EHM7LN3Z@mxout2.netvision.net.il> for netdev@oss.sgi.com; Sun, 10 Aug 2003 10:32:12 +0300 (IDT) Received: from allot.com (199.203.223.201 [199.203.223.201]) by exg.allot.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id QNW75ASP; Sun, 10 Aug 2003 10:35:17 +0200 Date: Sun, 10 Aug 2003 10:32:43 +0300 From: Felix Radensky Subject: Re: Ethernet bridge performance To: hadi@cyberus.ca Cc: Robert Olsson , Ben Greear , netdev@oss.sgi.com Message-id: <3F35F51B.7080301@allot.com> Organization: Allot Communications Ltd. MIME-version: 1.0 Content-type: multipart/alternative; boundary="Boundary_(ID_UirtBd6b6YWS5joOM6NHtg)" X-Accept-Language: en-us, en User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.2) Gecko/20030208 Netscape/7.02 References: <3F3217E7.2080903@allot.com> <3F3284EA.5050406@candelatech.com> <3F328A0F.3040005@allot.com> <16178.41976.3643.584516@robur.slu.se> <1060284094.1024.36.camel@jzny.localdomain> X-archive-position: 4704 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: felix@allot.com Precedence: bulk X-list: netdev --Boundary_(ID_UirtBd6b6YWS5joOM6NHtg) Content-type: text/plain; charset=us-ascii; format=flowed Content-transfer-encoding: 7BIT Hi, Jamal I guess you were not reading my first posting very carefully :) 2.4.22 has NAPI capable e1000 driver and I've compiled the driver with NAPI support. So running non-NAPI driver is not my problem. Felix. jamal wrote: >Actually seems his biggest problem is he is not running >the NAPI driver > >cheers, >jamal > >On Thu, 2003-08-07 at 15:09, Robert Olsson wrote: > > >>Felix Radensky writes: >> > Thanks for your help, Ben. What is skb-recycle patch >> > and where can I find it ? >> >> It's experimental and not updated for almost a year and current >> implementation does not add anything to SMP. Got some idea how >> to improve this... but try to keep to slab as long as possible >> it has been improved. >> >> Routing/bridging on SMP has affinty problem. If you are passing >> skb's say from eth0 to eth1 and they are bound on different CPU's >> you get cache boucing since the TX-interrupts come on another CPU. >> >> In a recent test with pktgen: >> 300 kpps with TX interrupts on same CPU as sender. >> 198 kpps with TX intr on different CPU as sender. >> >> Recycling tries to address this but current implementation fails >> as said. >> >> But you are probably hit by something else... Check were the drops >> happens qdisc?. NIC ring RX/TX size, Number of interrupts. ksoftird >> priority, link HW_FLOW control, checksumming, affinity etc. >> >> >> Cheers. >> --ro >> >> >> >> > > > --Boundary_(ID_UirtBd6b6YWS5joOM6NHtg) Content-type: text/html; charset=us-ascii Content-transfer-encoding: 7BIT Hi, Jamal

I guess you were not reading my first posting
very carefully :)

2.4.22 has NAPI capable e1000 driver and I've
compiled the driver with NAPI support.

So running non-NAPI driver is not my problem.

Felix.

jamal wrote:
Actually seems his biggest problem is he is not running
the NAPI driver

cheers,
jamal

On Thu, 2003-08-07 at 15:09, Robert Olsson wrote:
  
Felix Radensky writes:
 > Thanks for your help, Ben. What is skb-recycle patch
 > and where can I find it ?
 
 It's experimental and not updated for almost a year and current 
 implementation does not add anything to SMP. Got some idea how
 to improve this... but try to keep to slab as long as possible 
 it has been improved.

 Routing/bridging on SMP has affinty problem. If you are passing
 skb's say from eth0 to eth1 and they are bound on different CPU's
 you get cache boucing since the TX-interrupts come on another CPU.

 In a recent test with pktgen:
 300 kpps with TX interrupts on same CPU as sender.
 198 kpps with TX intr on different CPU as sender.

 Recycling tries to address this but current implementation fails
 as said.

 But you are probably hit by something else... Check were the drops 
 happens qdisc?. NIC ring RX/TX size, Number of interrupts. ksoftird 
 priority, link HW_FLOW control, checksumming, affinity etc. 

 
 Cheers.
						--ro


    

  

--Boundary_(ID_UirtBd6b6YWS5joOM6NHtg)-- From felix@allot.com Sun Aug 10 02:25:00 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 10 Aug 2003 02:25:09 -0700 (PDT) Received: from mxout4.netvision.net.il (mxout4.netvision.net.il [194.90.9.27]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7A9OwFl022363 for ; Sun, 10 Aug 2003 02:24:59 -0700 Received: from exg.allot.com ([199.203.223.202]) by mxout4.netvision.net.il (iPlanet Messaging Server 5.2 HotFix 1.14 (built Mar 18 2003)) with ESMTP id <0HJE00BA2BNF0T@mxout4.netvision.net.il> for netdev@oss.sgi.com; Sun, 10 Aug 2003 11:59:40 +0300 (IDT) Received: from allot.com (199.203.223.201 [199.203.223.201]) by exg.allot.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id QNW75BBQ; Sun, 10 Aug 2003 12:02:45 +0200 Date: Sun, 10 Aug 2003 12:00:12 +0300 From: Felix Radensky Subject: Re: e100 "Ferguson" release To: Jeff Garzik Cc: "Feldman, Scott" , Ben Greear , netdev@oss.sgi.com Message-id: <3F36099C.7090002@allot.com> Organization: Allot Communications Ltd. MIME-version: 1.0 Content-type: multipart/alternative; boundary="Boundary_(ID_NRV+4kR+TEKMqsKNRaPtwQ)" X-Accept-Language: en-us, en User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.2) Gecko/20030208 Netscape/7.02 References: <20030805152418.GB6695@gtf.org> X-archive-position: 4705 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: felix@allot.com Precedence: bulk X-list: netdev --Boundary_(ID_NRV+4kR+TEKMqsKNRaPtwQ) Content-type: text/plain; charset=us-ascii; format=flowed Content-transfer-encoding: 7BIT Hi, Jeff, Scott Are you planning to fix this before 2.4.22-final ? Thanks. Felix. Jeff Garzik wrote: >On Tue, Aug 05, 2003 at 08:19:25AM -0700, Feldman, Scott wrote: > > >>>I've also noticed that the number of hard_start_xmit failures >>>in e1000 has increased significantly in version 5.1.13-k1. In >>>version 5.0.43-k1 the number of failures was much smaller. >>> >>> >>Interesting. Felix, would you undo the change[1] below in 5.1.13-k1 and >>see what happens? With the change below, 5.1.13 would be more >>aggressive on Tx cleanup, so we'll be quicker waking the queue than >>before. >> >>-scott >> >> for(i = 0; i < E1000_MAX_INTR; i++) >>- if(!e1000_clean_rx_irq(adapter) && >>+ if(!e1000_clean_rx_irq(adapter) & >> !e1000_clean_tx_irq(adapter)) >> break; >> >>[1] Something still bothers me about this new form where we're mixing a >>bit-wise operator with logical operands. Should this bother me? >> >> > >It doesn't matter to the compiler if you make it explicit: > > unsigned int rx_work = e1000_clean_rx_irq(); > unsigned int tx_work = e1000_clean_tx_irq(); > if (!rx_work && !tx_work) > break; > > > > --Boundary_(ID_NRV+4kR+TEKMqsKNRaPtwQ) Content-type: text/html; charset=us-ascii Content-transfer-encoding: 7BIT Hi, Jeff, Scott

Are you planning to fix this before 2.4.22-final ?

Thanks.

Felix.

Jeff Garzik wrote:
On Tue, Aug 05, 2003 at 08:19:25AM -0700, Feldman, Scott wrote:
  
I've also noticed that the number of hard_start_xmit failures 
in e1000 has increased significantly in version 5.1.13-k1. In 
version 5.0.43-k1 the number of failures was much smaller.
      
Interesting.  Felix, would you undo the change[1] below in 5.1.13-k1 and
see what happens?  With the change below, 5.1.13 would be more
aggressive on Tx cleanup, so we'll be quicker waking the queue than
before. 

-scott

        for(i = 0; i < E1000_MAX_INTR; i++)
-               if(!e1000_clean_rx_irq(adapter) &&
+               if(!e1000_clean_rx_irq(adapter) &
                   !e1000_clean_tx_irq(adapter))
                        break;

[1] Something still bothers me about this new form where we're mixing a
bit-wise operator with logical operands.  Should this bother me?
    

It doesn't matter to the compiler if you make it explicit:

	unsigned int rx_work = e1000_clean_rx_irq();
	unsigned int tx_work = e1000_clean_tx_irq();
	if (!rx_work && !tx_work)
		break;


  

--Boundary_(ID_NRV+4kR+TEKMqsKNRaPtwQ)-- From arekm@risca.sse.pl Sun Aug 10 07:06:52 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 10 Aug 2003 07:07:01 -0700 (PDT) Received: from mail.perfopol.pl (exim@perfo.perfopol.pl [213.25.186.10]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7AE6nFl018183 for ; Sun, 10 Aug 2003 07:06:52 -0700 Received: from amavis by mail.perfopol.pl with scanned-ok (Exim 3.36 #1) id 19lqqM-0001Xb-00 for netdev@oss.sgi.com; Sun, 10 Aug 2003 16:06:46 +0200 Received: from uucp by mail.perfopol.pl with local-rmail (Exim 3.36 #1) id 19lqqK-0001XT-00 for netdev@oss.sgi.com; Sun, 10 Aug 2003 16:06:44 +0200 Received: from arekm by risca.sse.pl with local (Exim 4.20) id 19lqDQ-0004CV-4t for netdev@oss.sgi.com; Sun, 10 Aug 2003 15:26:32 +0200 From: Arkadiusz Miskiewicz (by way of Arkadiusz Miskiewicz ) Organization: SelfOrganizing Subject: watchdog daemon causes kernel oops (2.4.18, 2.4.20, 2.4.21) Date: Sun, 10 Aug 2003 15:26:31 +0200 User-Agent: KMail/1.5.9 To: netdev@oss.sgi.com MIME-Version: 1.0 Content-Disposition: inline Content-Type: text/plain; charset="iso-8859-2" Message-Id: <200308101526.31932.misiek@pld.ORG.PL> X-Virus-Scanned: by AMaViS Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id h7AE6nFl018183 X-archive-position: 4706 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: arekm@pld-linux.org Precedence: bulk X-list: netdev Hi, I'm using watchdog daemon (ftp://ftp.debian.org/debian/pool/main/w/watchdog/watchdog_5.2.4.orig.tar.gz) . The problem is that kernels oops if watchdog is started and it uses softdog driver and then some networking operation like loading driver for network card and setting it up or doing something with nfs occurrs. If network driver is loaded before starting watchdog then everything is fine until for example rmmod network module and load it again+try to setup some ip. I've checked it on 2.4.18, 2.4.20, 2.4.21 - everywhere oops. I've also asked two other person to check this - for them it also oopses. Oppses from 2.4.21: Oops: 0002 CPU: 0 EIP: 0010:[] Not tainted Using defaults from ksymoops -t elf32-i386 -a i386 EFLAGS: 00010202 eax: d08f33a0 ebx: d08ea000 ecx: 00007123 edx: d0864020 esi: d08f02a0 edi: d08ea000 ebp: bffffe7e esp: cb77df80 ds: 0018 es: 0018 ss: 0018 Process rmmod (pid: 21701, stackpage=cb77d000) Stack: d08ea000 00000000 d08ede27 d08f02a0 c011a2b3 d08ea000 00000000 cf247000 bffffe7e c0119751 d08ea000 00000000 cb77c000 00000001 bfffecec bfffecec c010864b bffffe7e bffffd94 bfffecec 00000001 bfffecec bfffecec 00000081 Call Trace: [] [] [] [] [] Code: 89 50 04 89 02 c7 06 00 00 00 00 c7 46 04 00 00 00 00 8b 1d >>EIP; c0196687 <===== >> >>eax; d08f33a0 <[softdog].bss.end+cdd/793d> >>ebx; d08ea000 <[3c59x]__module_kernel_version+0/20> >>edx; d0864020 <[aic7xxx]aic7xxx_pci_driver+0/3f> >>esi; d08f02a0 <[3c59x]vortex_driver+0/3f> >>edi; d08ea000 <[3c59x]__module_kernel_version+0/20> >>esp; cb77df80 <___strtok+b4ad618/1053e698> Trace; d08ede27 <[3c59x]vortex_cleanup+13/25> Trace; d08f02a0 <[3c59x]vortex_driver+0/3f> Trace; c011a2b3 Trace; c0119751 Trace; c010864b <__up_wakeup+f87/1334> Code; c0196687 00000000 <_EIP>: Code; c0196687 <===== 0: 89 50 04 mov %edx,0x4(%eax) <===== Code; c019668a 3: 89 02 mov %eax,(%edx) Code; c019668c 5: c7 06 00 00 00 00 movl $0x0,(%esi) Code; c0196692 b: c7 46 04 00 00 00 00 movl $0x0,0x4(%esi) Code; c0196699 12: 8b 1d 00 00 00 00 mov 0x0,%ebx and second one Unable to handle kernel paging request at virtual address 2cd08f22 d08ed39f *pde = 00000000 Oops: 0000 CPU: 0 EIP: 0010:[] Not tainted Using defaults from ksymoops -t elf32-i386 -a i386 EFLAGS: 00010246 eax: 00000802 ebx: 00000020 ecx: 00000020 edx: e40ce40e esi: ce6de400 edi: ce6de580 ebp: 0000e400 esp: ce5b5e90 ds: 0018 es: 0018 ss: 0018 Process ip (pid: 1071, stackpage=ce5b5000) Stack: 000001f0 ce6de580 00000020 00000020 01000000 ce6de478 000001f0 e40ce40e e406e40a 0000782d d08ed7da ce6de400 ce6de400 00000000 00001002 00000000 ce6de590 c01a8b19 ce6de400 ce6de400 00001003 c01a9bd5 ce6de400 ce5b5f48 Call Trace: [] [] [] [] [] [] [] [] [] Code: 66 a0 22 8f d0 2c 31 db 8a 49 28 88 4c 24 1b 8b b7 54 01 00 >>EIP; d08ed39f <[3c59x]vortex_down+47/bc> <===== >> >>esi; ce6de400 <___strtok+e40da98/1053e698> >>edi; ce6de580 <___strtok+e40dc18/1053e698> >>esp; ce5b5e90 <___strtok+e2e5528/1053e698> Trace; d08ed7da <[3c59x]netdev_ethtool_ioctl+3e/128> Trace; c01a8b19 Trace; c01a9bd5 Trace; c01dbd91 Trace; c01ddd57 Trace; c01a27f5 Trace; c0142e17 Trace; c010873c <__up_wakeup+1078/1334> Trace; c010864b <__up_wakeup+f87/1334> Code; d08ed39f <[3c59x]vortex_down+47/bc> 00000000 <_EIP>: Code; d08ed39f <[3c59x]vortex_down+47/bc> <===== 0: 66 data16 <===== Code; d08ed3a0 <[3c59x]vortex_down+48/bc> 1: a0 22 8f d0 2c mov 0x2cd08f22,%al Code; d08ed3a5 <[3c59x]vortex_down+4d/bc> 6: 31 db xor %ebx,%ebx Code; d08ed3a7 <[3c59x]vortex_down+4f/bc> 8: 8a 49 28 mov 0x28(%ecx),%cl Code; d08ed3aa <[3c59x]vortex_down+52/bc> b: 88 4c 24 1b mov %cl,0x1b(%esp,1) Code; d08ed3ae <[3c59x]vortex_down+56/bc> f: 8b b7 54 01 00 00 mov 0x154(%edi),%esi -- Arkadiusz Mi¶kiewicz CS at FoE, Wroclaw University of Technology arekm@sse.pl AM2-6BONE, 1024/3DB19BBD, arekm(at)ircnet, PLD/Linux - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- Arkadiusz Mi¶kiewicz CS at FoE, Wroclaw University of Technology arekm@sse.pl AM2-6BONE, 1024/3DB19BBD, arekm(at)ircnet, PLD/Linux From greearb@candelatech.com Sun Aug 10 11:13:30 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 10 Aug 2003 11:13:35 -0700 (PDT) Received: from grok.yi.org (evrtwa1-ar2-4-33-045-074.evrtwa1.dsl-verizon.net [4.33.45.74]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7AIDTFl027086 for ; Sun, 10 Aug 2003 11:13:29 -0700 Received: from candelatech.com (localhost.localdomain [127.0.0.1]) by grok.yi.org (8.12.8/8.12.8) with ESMTP id h7AIDFtf019363; Sun, 10 Aug 2003 11:13:15 -0700 Message-ID: <3F368B3A.3070009@candelatech.com> Date: Sun, 10 Aug 2003 11:13:14 -0700 From: Ben Greear Organization: Candela Technologies User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030529 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Felix Radensky CC: Robert Olsson , netdev@oss.sgi.com, "Feldman, Scott" Subject: Re: Ethernet bridge performance References: <3F3217E7.2080903@allot.com> <3F3284EA.5050406@candelatech.com> <3F328A0F.3040005@allot.com> <16178.41976.3643.584516@robur.slu.se> <3F3601F3.6000001@allot.com> In-Reply-To: <3F3601F3.6000001@allot.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4707 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: greearb@candelatech.com Precedence: bulk X-list: netdev Felix Radensky wrote: > I've mentioned in my first post that there are zero drops at driver level > (as shown by ifconfig). I'm also using IRQ affinity feature, binding > interrupts > of eth0 and eth1 to CPU0, so I guess NIC ring RX/TX size, link HW_FLOW > control > and affinity are not my problems. Maybe try un-binding the cpus, or binding them to different procs? > > Speaking of checksumming. This can be a problem indeed. ethtool -S shows > a lot > of rx_csum_offload errors. Scott, what could possibly be a problem ? The > NIC > is dual port 82546, driver is 5.1.13-k1. Why would a bridge be checksumming anything? > > I've failed to find a discussion about ksoftird priority. Can someone > please > provide a link. I've increased my performance by setting softirqd priority to -18, but since you seem to be dropping someplace other than the driver, I'm not sure that will help. > > Thanks a lot. > > Felix. > -- Ben Greear Candela Technologies Inc http://www.candelatech.com From ak@suse.de Sun Aug 10 12:47:31 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 10 Aug 2003 12:47:39 -0700 (PDT) Received: from Cantor.suse.de (mail.suse.de [213.95.15.193]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7AJlUFl000556 for ; Sun, 10 Aug 2003 12:47:31 -0700 Received: from Hermes.suse.de (Hermes.suse.de [213.95.15.136]) by Cantor.suse.de (Postfix) with ESMTP id 9010F149EF; Sun, 10 Aug 2003 21:47:24 +0200 (MEST) Date: Sun, 10 Aug 2003 21:47:23 +0200 From: Andi Kleen To: Ben Greear Cc: Felix Radensky , Robert Olsson , netdev@oss.sgi.com, "Feldman, Scott" Subject: Re: Ethernet bridge performance Message-ID: <20030810194723.GA14822@wotan.suse.de> References: <3F3217E7.2080903@allot.com> <3F3284EA.5050406@candelatech.com> <3F328A0F.3040005@allot.com> <16178.41976.3643.584516@robur.slu.se> <3F3601F3.6000001@allot.com> <3F368B3A.3070009@candelatech.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3F368B3A.3070009@candelatech.com> X-archive-position: 4708 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ak@suse.de Precedence: bulk X-list: netdev > >Speaking of checksumming. This can be a problem indeed. ethtool -S shows > >a lot > >of rx_csum_offload errors. Scott, what could possibly be a problem ? The > >NIC > >is dual port 82546, driver is 5.1.13-k1. > > Why would a bridge be checksumming anything? A NIC normally checksums every incoming packet when hardware checksumming is enabled. But of course the bridge doesn't use that information. -Andi From Robert.Olsson@data.slu.se Sun Aug 10 14:49:57 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 10 Aug 2003 14:50:03 -0700 (PDT) Received: from robur.slu.se (robur.slu.se [130.238.98.12]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7ALntFl008644 for ; Sun, 10 Aug 2003 14:49:57 -0700 Received: (from robert@localhost) by robur.slu.se (8.9.3p2/8.9.3) id XAA19725; Sun, 10 Aug 2003 23:49:43 +0200 From: Robert Olsson MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <16182.48631.720695.768876@robur.slu.se> Date: Sun, 10 Aug 2003 23:49:43 +0200 To: Felix Radensky Cc: Robert Olsson , Ben Greear , netdev@oss.sgi.com, "Feldman, Scott" Subject: Re: Ethernet bridge performance In-Reply-To: <3F3601F3.6000001@allot.com> References: <3F3217E7.2080903@allot.com> <3F3284EA.5050406@candelatech.com> <3F328A0F.3040005@allot.com> <16178.41976.3643.584516@robur.slu.se> <3F3601F3.6000001@allot.com> X-Mailer: VM 6.92 under Emacs 19.34.1 X-archive-position: 4709 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: Robert.Olsson@data.slu.se Precedence: bulk X-list: netdev Felix Radensky writes: > Is slab good enough in 2.4 ? I was thinking that one of the goals > of skb-recycle patch was to avoid skb allocations and deallocations > which consume quite a lot of CPU time (as profile shows). Are you > saying that your patch is not hepling to reduce CPU load ? Try to understand why/where packets get dropped in your setup to start with. Bridging shouldn't be different from routing which I experiments with. Check /proc/interrupts, /proc/net/softnet_stat and check for drops at qdisc ( "tc -s qdisc" you might have to readd the qdisc just get the stats) Possible your TX-side cannot keep up with RX. Often the TX ring is not cleard aggressivly enough at high rates due intr. mitigation etc or possibly HW_FLOWCTRL from sink device. Disable it in your for testing. When you cure the cause in-kernel-drop you should have packets drops on the DMA-ring (in the driver) with NAPI drivers and no unnecssary skb allocation/CPU use or DMA's. > echo 1 > /proc/irq/48/smp_affinity > to avoid this kind of problem, or something else is required ? If you have both both incoming and outgoing on same CPU there will be no cache bouncing of course and a UP kernel would be faster if this all your job. Cheers. --ro From Robert.Olsson@data.slu.se Sun Aug 10 15:23:04 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 10 Aug 2003 15:23:09 -0700 (PDT) Received: from robur.slu.se (robur.slu.se [130.238.98.12]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7AMN3Fl012194 for ; Sun, 10 Aug 2003 15:23:03 -0700 Received: (from robert@localhost) by robur.slu.se (8.9.3p2/8.9.3) id AAA19854; Mon, 11 Aug 2003 00:22:50 +0200 From: Robert Olsson MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <16182.50618.69504.611254@robur.slu.se> Date: Mon, 11 Aug 2003 00:22:50 +0200 To: "Feldman, Scott" Cc: "Jeff Garzik" , "Samuel Flory" , Subject: RE: More 2.4.22pre10 ACPI breakage In-Reply-To: References: X-Mailer: VM 6.92 under Emacs 19.34.1 X-archive-position: 4710 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: Robert.Olsson@data.slu.se Precedence: bulk X-list: netdev Feldman, Scott writes: > NAPI always-poll mode...that would be fun to play with... > Is this what you're thinking: 1) block any place the driver enables > interrupts so interrupts stay disabled, 2) ignore netif_rx_complete so > we stay in polling mode, 3) ignore return code from netdev->poll. > > For 1), the driver needs some way to know that we're in always-poll-mode > so enabling interrupts is a nop. > I will work but I doubt the usefulness of it as we spin aggressively even when there low or no load. This as NAPI tries to serve your dev->poll fastest possible given the fairness conditions are met. I could think of a variant... As dev->poll is callback we could possibly schedule (and delay) via a timer or something with in turn does the the schedule dev->poll for us. We have to return netif_rx_complete and have RX-buffers etc. > Just thinking out loud - haven't tried any of this. Same here... :-) Cheers. --ro From hadi@cyberus.ca Sun Aug 10 19:51:58 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 10 Aug 2003 19:52:05 -0700 (PDT) Received: from mail.cyberus.ca (mail.cyberus.ca [209.195.118.111]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7B2pvFl023976 for ; Sun, 10 Aug 2003 19:51:58 -0700 Received: from cpe0030ab124d2f-cm014500000962.cpe.net.cable.rogers.com ([24.103.99.32] helo=[10.0.0.9]) by mail.cyberus.ca with esmtp (Exim 4.12) id 19m2mq-000JaO-00; Sun, 10 Aug 2003 22:51:56 -0400 Subject: RE: [SET 2][PATCH 2/8][bonding] Propagating master's settings to slaves From: jamal Reply-To: hadi@cyberus.ca To: "Hen, Shmulik" Cc: bonding-devel@lists.sourceforge.net, netdev@oss.sgi.com In-Reply-To: References: Content-Type: text/plain Organization: jamalopolis Message-Id: <1060570284.1056.15.camel@jzny.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 Date: 10 Aug 2003 22:51:25 -0400 Content-Transfer-Encoding: 7bit X-archive-position: 4711 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev On Sat, 2003-08-09 at 06:29, Hen, Shmulik wrote: > > > > Not sure I fully understood the concerns above, but I'll try > to explain what the change was all about. > I think it wasnt the one specific change rather a few posted that i spent a minute or two staring at. And you confirm my suspicion below. [..] > > In the lonf term, the drive is to move any *smart* code done in > the config application into the driver itself and be left with > the smallest, most compact application as possible. This is the > trend we've seen in the VLAN config app, and the bridge module. > All the "brain" is in the kernel module and very little should be > done in the application. I am not very familiar with the bonding code although i think you guys have been doing very good work since you got involved. In any case the approach you state above is wrong. Actually Stephen Hemminger and I discussed this for bridging. Post 2.6 he is going to remove a lot of the bridge policy (or "brain" as you call it) out of the kernel. Netlink for kernel<->userspace not /proc. I think we should head towards that direction so we can have more sophisticated management. Thoughts? cheers, jamal From hadi@cyberus.ca Sun Aug 10 20:45:38 2003 Received: with ECARTIS (v1.0.0; list netdev); Sun, 10 Aug 2003 20:45:47 -0700 (PDT) Received: from mail.cyberus.ca (mail.cyberus.ca [209.195.118.111]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7B3jbFl024627 for ; Sun, 10 Aug 2003 20:45:38 -0700 Received: from cpe0030ab124d2f-cm014500000962.cpe.net.cable.rogers.com ([24.103.99.32] helo=[10.0.0.9]) by mail.cyberus.ca with esmtp (Exim 4.12) id 19m2qu-000Jri-00; Sun, 10 Aug 2003 22:56:09 -0400 Subject: Re: Ethernet bridge performance From: jamal Reply-To: hadi@cyberus.ca To: Felix Radensky Cc: Robert Olsson , Ben Greear , netdev@oss.sgi.com In-Reply-To: <3F35F51B.7080301@allot.com> References: <3F3217E7.2080903@allot.com> <3F3284EA.5050406@candelatech.com> <3F328A0F.3040005@allot.com> <16178.41976.3643.584516@robur.slu.se> <1060284094.1024.36.camel@jzny.localdomain> <3F35F51B.7080301@allot.com> Content-Type: text/plain Organization: jamalopolis Message-Id: <1060570537.1050.17.camel@jzny.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 Date: 10 Aug 2003 22:55:37 -0400 Content-Transfer-Encoding: 7bit X-archive-position: 4712 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev Felix, Actually i based my comments on the profiles you posted ;-> Are you running any other nics? If not that profile does look strange. cheers, jamal On Sun, 2003-08-10 at 03:32, Felix Radensky wrote: > Hi, Jamal > > I guess you were not reading my first posting > very carefully :) > > 2.4.22 has NAPI capable e1000 driver and I've > compiled the driver with NAPI support. > > So running non-NAPI driver is not my problem. > > Felix. > > jamal wrote: > > Actually seems his biggest problem is he is not running > > the NAPI driver > > > > cheers, > > jamal > > > > On Thu, 2003-08-07 at 15:09, Robert Olsson wrote: > > > > > Felix Radensky writes: > > > > Thanks for your help, Ben. What is skb-recycle patch > > > > and where can I find it ? > > > > > > It's experimental and not updated for almost a year and current > > > implementation does not add anything to SMP. Got some idea how > > > to improve this... but try to keep to slab as long as possible > > > it has been improved. > > > > > > Routing/bridging on SMP has affinty problem. If you are passing > > > skb's say from eth0 to eth1 and they are bound on different CPU's > > > you get cache boucing since the TX-interrupts come on another CPU. > > > > > > In a recent test with pktgen: > > > 300 kpps with TX interrupts on same CPU as sender. > > > 198 kpps with TX intr on different CPU as sender. > > > > > > Recycling tries to address this but current implementation fails > > > as said. > > > > > > But you are probably hit by something else... Check were the drops > > > happens qdisc?. NIC ring RX/TX size, Number of interrupts. ksoftird > > > priority, link HW_FLOW control, checksumming, affinity etc. > > > > > > > > > Cheers. > > > --ro > > > > > > > > > > > > > > From Robert.Olsson@data.slu.se Mon Aug 11 00:52:56 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 11 Aug 2003 00:53:05 -0700 (PDT) Received: from robur.slu.se (robur.slu.se [130.238.98.12]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7B7qsFl002302 for ; Mon, 11 Aug 2003 00:52:56 -0700 Received: (from robert@localhost) by robur.slu.se (8.9.3p2/8.9.3) id JAA21747; Mon, 11 Aug 2003 09:52:40 +0200 From: Robert Olsson MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <16183.19272.107650.647577@robur.slu.se> Date: Mon, 11 Aug 2003 09:52:40 +0200 To: hadi@cyberus.ca Cc: Felix Radensky , Robert Olsson , Ben Greear , netdev@oss.sgi.com Subject: Re: Ethernet bridge performance In-Reply-To: <1060570537.1050.17.camel@jzny.localdomain> References: <3F3217E7.2080903@allot.com> <3F3284EA.5050406@candelatech.com> <3F328A0F.3040005@allot.com> <16178.41976.3643.584516@robur.slu.se> <1060284094.1024.36.camel@jzny.localdomain> <3F35F51B.7080301@allot.com> <1060570537.1050.17.camel@jzny.localdomain> X-Mailer: VM 6.92 under Emacs 19.34.1 X-archive-position: 4713 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: Robert.Olsson@data.slu.se Precedence: bulk X-list: netdev jamal writes: > > Felix, > > Actually i based my comments on the profiles you posted ;-> > Are you running any other nics? If not that profile does look strange. Yes netif_rx the non-NAPI entrance to the upper layers is in the profiles. c019ad44 3404 5.3328 netif_rx Cheers. --ro From jgarzik@pobox.com Mon Aug 11 03:01:20 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 11 Aug 2003 03:01:29 -0700 (PDT) Received: from www.linux.org.uk (IDENT:54dtAvtP/RiVqdP6WrDgwivooPn4OWk4@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7BA1IFl013548 for ; Mon, 11 Aug 2003 03:01:20 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19m19x-0007tc-PD; Mon, 11 Aug 2003 02:07:41 +0100 Message-ID: <3F36EC53.8040404@pobox.com> Date: Sun, 10 Aug 2003 21:07:31 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: Andi Kleen CC: netdev@oss.sgi.com Subject: Re: RFR: new SiS gige driver References: <20030808173932.GA4077@gtf.org> <20030809141533.GB4539@wotan.suse.de> <3F350CC8.3090605@pobox.com> <20030809152747.GA1618@wotan.suse.de> In-Reply-To: <20030809152747.GA1618@wotan.suse.de> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4714 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev Andi Kleen wrote: > The netif_wake_queue is lost. It's only deadly when clean some packets > clears the full TX ring, otherwise it will likely recover with the > next TX finished interrupt but give suboptimal performance. > > Fix is to do the my queue is full -> netif_stop_queue inside the spinlock. "a fix" not "the fix" :) You can also do what some drivers do, and move the netif_queue_stop above the queue-packet-to-hw section of driver's ->hard_start_xmit. Then when this uncommon race occurs, you are guaranteed another TX-complete interrupt, even if the queue is stopped prematurely. A lot of drivers netif_stop_queue after, not before, so this is indeed an issue that needs paying attention to. Jeff From shmulik.hen@intel.com Mon Aug 11 03:08:58 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 11 Aug 2003 03:09:05 -0700 (PDT) Received: from caduceus.sc.intel.com (fmr04.intel.com [143.183.121.6]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7BA8vFl014346 for ; Mon, 11 Aug 2003 03:08:58 -0700 Received: from petasus.sc.intel.com (petasus.sc.intel.com [10.3.253.4]) by caduceus.sc.intel.com (8.11.6p2/8.11.6/d: outer.mc,v 1.66 2003/05/22 21:17:36 rfjohns1 Exp $) with ESMTP id h7BA7Uk13727 for ; Mon, 11 Aug 2003 10:07:30 GMT Received: from fmsmsxvs043.fm.intel.com (fmsmsxvs043.fm.intel.com [132.233.42.129]) by petasus.sc.intel.com (8.11.6p2/8.11.6/d: inner.mc,v 1.35 2003/05/22 21:18:01 rfjohns1 Exp $) with SMTP id h7BA7NT13657 for ; Mon, 11 Aug 2003 10:07:23 GMT Received: from jrslxjul4.npdj.intel.com ([10.12.254.188]) by fmsmsxvs043.fm.intel.com (NAVGW 2.5.2.11) with SMTP id M2003081103054101644 ; Mon, 11 Aug 2003 03:05:42 -0700 Content-Type: text/plain; charset="iso-8859-1" From: Shmulik Hen Reply-To: shmulik.hen@intel.com Organization: Intel corp. To: hadi@cyberus.ca Subject: Re: [SET 2][PATCH 2/8][bonding] Propagating master's settings to slaves Date: Mon, 11 Aug 2003 13:08:48 +0300 User-Agent: KMail/1.4.3 References: <1060570284.1056.15.camel@jzny.localdomain> In-Reply-To: <1060570284.1056.15.camel@jzny.localdomain> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Cc: bonding-devel@lists.sourceforge.net, netdev@oss.sgi.com Message-Id: <200308111308.48263.shmulik.hen@intel.com> X-archive-position: 4715 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: shmulik.hen@intel.com Precedence: bulk X-list: netdev On Monday 11 August 2003 05:51 am, you wrote: > On Sat, 2003-08-09 at 06:29, Hen, Shmulik wrote: > > Not sure I fully understood the concerns above, but I'll try > > to explain what the change was all about. > > I think it wasnt the one specific change rather a few posted that i > spent a minute or two staring at. And you confirm my suspicion > below. I probably didn't make myself clear - by "understood" I wanted to say I probably didn't get the *meaning* of the whole sentence , and not "I don't under stand why you are concerned". (English is not my native tongue :) ). > I am not very familiar with the bonding code although i think you > guys have been doing very good work since you got involved. > In any case the approach you state above is wrong. Actually Stephen > Hemminger and I discussed this for bridging. Post 2.6 he is going > to remove a lot of the bridge policy (or "brain" as you call it) > out of the kernel. Netlink for kernel<->userspace not /proc. I > think we should head towards that direction so we can have more > sophisticated management. I, on the other hand, am not familiar with the bridging code and I don't know what it actually does internally, I just noticed that regarding config operations, most of the code is done at the kernel level as response to ioctl commands. I'll try to clarify how that relates to bonding. The ifenslave utility has very little "brain" as it is, and all it knows how to do currently is enslave/release slave devices and change the current active slave. It also has some ability to extract status info from the bond and present it nicely for a user. The "brain" I was referring to in the bonding module itself has to do with timer functions monitoring link status or Tx/Rx activity of the slaves, and once a faulty slave is detected, switch to use another one instead according to the teaming mode. There are no large scale decision making nor major CPU consuming computations that are part of the continuous operation of the module that is basically handle Rx/Tx on slaves. The bonding module doesn't need to access any special info that is normally available to user space apps. What it does need is very short response time and accessibility to kernel internal resources like net devices info to make it a high availability intermediate driver. Trying to move that from the kernel module into the config application seems to be a very hard task to implement since we'll have to find a way to make the application constantly aware to the specifics like current topology, slave-to-bond affiliation, updated status of each slave, etc., etc. It would also mean that the driver will have to wait for the application to tell it what to do each time it needs a decision, and by that we'll surely suffer some performance hit and probably get low availability or temporary loss of communications. Going back to the first problem, discussions on the bonding development list pointed that it might be better if we moved the configuration-time decisions making to the driver, so the application wouldn't have to deal with situations like: 1) get the master's MTU settings, master's teaming mode, communication version, backwards compatibility issues, etc. 2) figure if need to set MTU to slave according to all that, 3) try to set that on the new slave being added, 4) if not successfull, decide if may enslave anyway or, 5) maybe undo all previous settings already done to the slave (needs a way to retrieve old values) 6) decide if should go on or fail any further operations 7) repeat the above for all other settings On the other hand, what we want to get to is something more like: 1) tell bonding to add slave X to bond Y, 2) watch for error returns, 3) print a nice message according to the type of the error. While the driver, already aware of all possible relevant data, makes all decisions, performs settings, handles compatibility issues, checks for failures at each stage, handles any undo steps, and return success/error values accordingly. > > Thoughts? Mostly explanations :) Is there anywhere I can see what you refereed to as discussions with Stephen Hemminger ? I would really like to know how and what could also be applied to bonding. Regards, Shmulik. From laurent.deniel@thalesatm.com Mon Aug 11 07:08:00 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 11 Aug 2003 07:08:06 -0700 (PDT) Received: from gwsmtp.thomson-csf.com (gwsmtp.thomson-csf.com [195.101.39.226]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7BE7vFl002473 for ; Mon, 11 Aug 2003 07:07:59 -0700 Received: from thalescan.corp.thales (200.3.2.3) by gwsmtp.thomson-csf.com (NPlex 6.5.026) id 3F3761E40000A8E8 for netdev@oss.sgi.com; Mon, 11 Aug 2003 16:07:51 +0200 Received: from bgxplex.bgx.airsys.thomson-csf.com ([220.1.107.25]) by thalescan with InterScan Messaging Security Suite; Mon, 11 Aug 2003 16:07:45 +0200 Received: from bgxplex2.bgx.airsys.thomson-csf.com (1.38.9.152) by bgxplex.bgx.airsys.thomson-csf.com (NPlex 6.5.026) id 3F15565F0003ACC2; Mon, 11 Aug 2003 16:07:58 +0200 Received: from bgxplex2.bgx.airsys.thomson-csf.com (1.38.9.152) by bgxplex2.bgx.airsys.thomson-csf.com (NPlex 6.5.026) id 3F370B62000016DA; Mon, 11 Aug 2003 16:07:45 +0200 Received: from 200.64.130.8 by bgxplex2.bgx.airsys.thomson-csf.com (InterScan E-Mail VirusWall NT); Mon, 11 Aug 2003 16:07:45 +0200 Message-ID: <3F37A331.88EB0B1D@thalesatm.com> Date: Mon, 11 Aug 2003 16:07:45 +0200 From: Laurent DENIEL Organization: THALES ATM X-Mailer: Mozilla 4.75 [fr] (Windows NT 5.0; U) X-Accept-Language: fr,en-GB MIME-Version: 1.0 To: hadi@cyberus.ca CC: shmulik.hen@intel.com, bonding-devel@lists.sourceforge.net, netdev@oss.sgi.com Subject: Re: [Bonding-devel] Re: [SET 2][PATCH 2/8][bonding] Propagating master's settings toslaves References: <1060570284.1056.15.camel@jzny.localdomain> <200308111308.48263.shmulik.hen@intel.com> <1060607079.1050.144.camel@jzny.localdomain> Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from Quoted-Printable to 8bit by oss.sgi.com id h7BE7vFl002473 X-archive-position: 4716 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: laurent.deniel@thalesatm.com Precedence: bulk X-list: netdev jamal a écrit : > > > Trying to move that from the kernel module into the config application > > seems to be a very hard task to implement since we'll have to find a > > way to make the application constantly aware to the specifics like > > current topology, slave-to-bond affiliation, updated status of each > > slave, etc., etc. It would also mean that the driver will have to > > wait for the application to tell it what to do each time it needs a > > decision, and by that we'll surely suffer some performance hit and > > probably get low availability or temporary loss of communications. > > > > Not at all. If you let some app control this i am sure whoever writes > the app has vested interest in getting fast failovers etc. > > > Basically what i described at the top. Move any "richness" to user > space. HP/Compaq/Digital used to have the same approach with their Netrain implementation, and from one release of Tru64 UNIX to another, they could no longer support resolution ala milli-seconds but only seconds due to the move of such "richness" to user space (among other things). I am not saying that doing so on Linux will result to the same, but a minimal failover policy shall remain in the kernel for performance reason ... (or a user space facility could exist to *configure* such policy but without direct interaction with user space when the kernel has to decide). Laurent From shmulik.hen@intel.com Mon Aug 11 07:20:49 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 11 Aug 2003 07:20:58 -0700 (PDT) Received: from caduceus.sc.intel.com (fmr04.intel.com [143.183.121.6]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7BEKlFl003456 for ; Mon, 11 Aug 2003 07:20:48 -0700 Received: from petasus.sc.intel.com (petasus.sc.intel.com [10.3.253.4]) by caduceus.sc.intel.com (8.11.6p2/8.11.6/d: outer.mc,v 1.66 2003/05/22 21:17:36 rfjohns1 Exp $) with ESMTP id h7BEJL807844 for ; Mon, 11 Aug 2003 14:19:21 GMT Received: from fmsmsxvs043.fm.intel.com (fmsmsxvs043.fm.intel.com [132.233.42.129]) by petasus.sc.intel.com (8.11.6p2/8.11.6/d: inner.mc,v 1.35 2003/05/22 21:18:01 rfjohns1 Exp $) with SMTP id h7BEJEa14666 for ; Mon, 11 Aug 2003 14:19:14 GMT Received: from jrslxjul4.npdj.intel.com ([10.12.254.188]) by fmsmsxvs043.fm.intel.com (NAVGW 2.5.2.11) with SMTP id M2003081107173124813 ; Mon, 11 Aug 2003 07:17:32 -0700 Content-Type: text/plain; charset="iso-8859-1" From: Shmulik Hen Reply-To: shmulik.hen@intel.com Organization: Intel corp. To: Laurent DENIEL , hadi@cyberus.ca Subject: Re: [Bonding-devel] Re: [SET 2][PATCH 2/8][bonding] Propagating master's settings toslaves Date: Mon, 11 Aug 2003 17:20:38 +0300 User-Agent: KMail/1.4.3 Cc: bonding-devel@lists.sourceforge.net, netdev@oss.sgi.com References: <1060607079.1050.144.camel@jzny.localdomain> <3F37A331.88EB0B1D@thalesatm.com> In-Reply-To: <3F37A331.88EB0B1D@thalesatm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Message-Id: <200308111720.38472.shmulik.hen@intel.com> X-archive-position: 4717 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: shmulik.hen@intel.com Precedence: bulk X-list: netdev On Monday 11 August 2003 05:07 pm, Laurent DENIEL wrote: > HP/Compaq/Digital used to have the same approach with their Netrain > implementation, and from one release of Tru64 UNIX to another, they > could no longer support resolution ala milli-seconds but only > seconds due to the move of such "richness" to user space (among > other things). I am not saying that doing so on Linux will result > to the same, but a minimal failover policy shall remain in the > kernel for performance reason ... (or a user space facility could > exist to *configure* such policy but without direct interaction > with user space when the kernel has to decide). > > Laurent That was my point. Thank you for putting it into better words. If high availbilty and fast failovers are what's needed, why move it out of kernel space and put it in an application ? How fast could it work compared to a kernel module ? Why need an extra piece of code running in user space (daemon?) to monitor a module when the module can do that itself ? If smarter behavior is needed (e.g. falling to eth4 instead of eth1 when eth0 fails), we can add some priority mechanism to the driver to do that when it decides to swap. Otherwise, we'll be devleoping applications from now on, not the Linux kernel :) Shmulik. From hadi@cyberus.ca Mon Aug 11 07:34:43 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 11 Aug 2003 07:34:51 -0700 (PDT) Received: from mail.cyberus.ca (mail.cyberus.ca [209.195.118.111]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7BEYgFl004672 for ; Mon, 11 Aug 2003 07:34:43 -0700 Received: from [216.209.86.2] (helo=[10.0.0.9] ident=jamal) by mail.cyberus.ca with esmtp (Exim 4.12) id 19mDkw-000JoR-00; Mon, 11 Aug 2003 10:34:42 -0400 Subject: Re: [Bonding-devel] Re: [SET 2][PATCH 2/8][bonding] Propagating master's settings toslaves From: jamal Reply-To: hadi@cyberus.ca To: shmulik.hen@intel.com Cc: Laurent DENIEL , bonding-devel@lists.sourceforge.net, netdev@oss.sgi.com In-Reply-To: <200308111720.38472.shmulik.hen@intel.com> References: <1060607079.1050.144.camel@jzny.localdomain> <3F37A331.88EB0B1D@thalesatm.com> <200308111720.38472.shmulik.hen@intel.com> Content-Type: text/plain Organization: jamalopolis Message-Id: <1060612481.1034.15.camel@jzny.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 Date: 11 Aug 2003 10:34:41 -0400 Content-Transfer-Encoding: 7bit X-archive-position: 4718 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev On Mon, 2003-08-11 at 10:20, Shmulik Hen wrote: > On Monday 11 August 2003 05:07 pm, Laurent DENIEL wrote: > > HP/Compaq/Digital used to have the same approach with their Netrain > > implementation, and from one release of Tru64 UNIX to another, they > > could no longer support resolution ala milli-seconds but only > > seconds due to the move of such "richness" to user space (among > > other things). I am not saying that doing so on Linux will result > > to the same, but a minimal failover policy shall remain in the > > kernel for performance reason ... (or a user space facility could > > exist to *configure* such policy but without direct interaction > > with user space when the kernel has to decide). > > > > Laurent > > That was my point. Thank you for putting it into better words. > If high availbilty and fast failovers are what's needed, why move it > out of kernel space and put it in an application ? How fast could it > work compared to a kernel module ? Why need an extra piece of code > running in user space (daemon?) to monitor a module when the module > can do that itself ? > > If smarter behavior is needed (e.g. falling to eth4 instead of eth1 > when eth0 fails), we can add some priority mechanism to the driver to > do that when it decides to swap. Otherwise, we'll be devleoping > applications from now on, not the Linux kernel :) > So how many smart things are you going to add to the driver? ;-> Do you wanna add the qos policy changeover as well? What about route changes, firewalling etc. What about sliceing bread and adding butter? Where do you draw the line? BTW, I dont understand why it would slow down failover; sure it will a tiny bit because you have to cross user space to lookup the policy. Maybe this is the part that i havent made clear, heres an example: - User space gets notified link eth0 went down - User space looks up a policy config on what to do when eth0 goes down - user space executes commands which may include telling kernel to move activity to eth1. Note: I agree on a minimal failover policy staying in the kernel; very basic stuff like what bonding used to do (may still do, dont know). cheers, jamal From matthew.hare@amsjv.com Mon Aug 11 09:17:30 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 11 Aug 2003 09:17:34 -0700 (PDT) Received: from smtp2.bae.co.uk (smtp2.bae.co.uk [20.138.254.62]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7BGHQFl011780 for ; Mon, 11 Aug 2003 09:17:30 -0700 Received: from ngbaux (ngbaux.msd.bae.co.uk [141.245.68.234]) by smtp2.bae.co.uk (Switch-2.2.6/Switch-2.2.6) with ESMTP id h7BGHHr06783 for ; Mon, 11 Aug 2003 17:17:17 +0100 (BST) Received: from ngban8.ng.bae.co.uk ([141.245.68.229]) by ngbaux.net.bae.co.uk (PMDF V5.2-33 #44998) with ESMTP id <0HJG006ITQJOQD@ngbaux.net.bae.co.uk> for netdev@oss.sgi.com; Mon, 11 Aug 2003 17:16:38 +0100 (BST) Received: from dsnmtrs1.nm.ds.bae.co.uk (unverified) by ngban8.ng.bae.co.uk (Content Technologies SMTPRS 2.0.15) with ESMTP id ; Mon, 11 Aug 2003 17:17:02 +0100 Received: from nmtr01.nm.dsx.bae.co.uk (jupiter [172.30.192.21]) by dsnmtrs1.nm.ds.bae.co.uk with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2653.13) id 3G131QDH; Mon, 11 Aug 2003 17:17:03 +0100 Received: by nmtr01.nm.dsx.bae.co.uk with Internet Mail Service (5.5.2653.19) id ; Mon, 11 Aug 2003 17:23:51 +0100 Date: Mon, 11 Aug 2003 17:22:04 +0100 From: "Hare, Matthew" Subject: TCP/IP Window Size Problem/Query To: "'davem@redhat.com'" , "'ak@muc.de'" , "'kuznet@ms2.inr.ac.ru'" , "'netdev@oss.sgi.com'" Message-id: <3E28BEE03F7CD3118E540008C7F35FF602C7AFAB@NMEX01> MIME-version: 1.0 X-Mailer: Internet Mail Service (5.5.2653.19) Content-type: text/plain; charset="iso-8859-1" X-archive-position: 4719 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: matthew.hare@amsjv.com Precedence: bulk X-list: netdev Forgive this mass mailing out but I am getting desperate for help. My problem is with reported TCP window sizes. I have two systems running SuSE Linux. System 1 is running SuSE 7.1 with the 2.2.18 kernel. System 2 is running SuSE 7.3 with the 2.4.17 kernel. Both systems are PC based (Pentium III, 512MB RAM). When I open a TCP connection from System 1 I get a reported window size (I use Ethereal to check window sizes) of 32k. When I do the same with System 2 I get a reported window size of 5840bytes and the window size increases as the rate of data being sent via TCP increases. This is causing me problems, I need a fixed window size of 32k on System 2 so that it matches System 1. I have done a bit of testing to try and work out where this window size is set. I am by no means a Linux guru so I was working with little knowledge at the start. The application we use to open the TCP connections is our own code (C++) so I have hacked that and used some setsockopt() calls to set SO_SNDBUF. I tried setting this to 32k but this appears to have no effect on my reported window sizes. Does anyone know if setsockopt() can be used to change the size of the window I am seeing? Is SO_SNDBUF the same as the TCP window size that Ethereal is reporting? Is there a way to fix the TCP window size in the 2.4.17 kernel? I am familiar with recompiling the kernel but I've not found any kernel options that seem to do what I want. Can anyone help me or at least point me in the right direction? Thanks for your time. Mat Hare SAINTT Developer AMS UK ******************************************************************** This email and any attachments are confidential to the intended recipient and may also be privileged. If you are not the intended recipient please delete it from your system and notify the sender. You should not copy it or use it for any purpose nor disclose or distribute its contents to any other person. ******************************************************************** From hadi@cyberus.ca Mon Aug 11 09:19:45 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 11 Aug 2003 09:19:51 -0700 (PDT) Received: from mail.cyberus.ca (mail.cyberus.ca [209.195.118.111]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7BGJiFl012240 for ; Mon, 11 Aug 2003 09:19:44 -0700 Received: from [216.209.86.2] (helo=[10.0.0.9] ident=jamal) by mail.cyberus.ca with esmtp (Exim 4.12) id 19mD1Q-000E4B-00; Mon, 11 Aug 2003 09:47:41 -0400 Subject: Re: [SET 2][PATCH 2/8][bonding] Propagating master's settings to slaves From: jamal Reply-To: hadi@cyberus.ca To: shmulik.hen@intel.com Cc: bonding-devel@lists.sourceforge.net, netdev@oss.sgi.com In-Reply-To: <200308111308.48263.shmulik.hen@intel.com> References: <1060570284.1056.15.camel@jzny.localdomain> <200308111308.48263.shmulik.hen@intel.com> Content-Type: text/plain Organization: jamalopolis Message-Id: <1060607079.1050.144.camel@jzny.localdomain> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.2 Date: 11 Aug 2003 09:47:39 -0400 Content-Transfer-Encoding: 7bit X-archive-position: 4720 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev On Mon, 2003-08-11 at 06:08, Shmulik Hen wrote: > On Monday 11 August 2003 05:51 am, you wrote: > > On Sat, 2003-08-09 at 06:29, Hen, Shmulik wrote: > > > Not sure I fully understood the concerns above, but I'll try > > > to explain what the change was all about. > > > > I think it wasnt the one specific change rather a few posted that i > > spent a minute or two staring at. And you confirm my suspicion > > below. > > I probably didn't make myself clear - by "understood" I wanted to say > I probably didn't get the *meaning* of the whole sentence , and not > "I don't under stand why you are concerned". > (English is not my native tongue :) ). > > > I am not very familiar with the bonding code although i think you > > guys have been doing very good work since you got involved. > > In any case the approach you state above is wrong. Actually Stephen > > Hemminger and I discussed this for bridging. Post 2.6 he is going > > to remove a lot of the bridge policy (or "brain" as you call it) > > out of the kernel. Netlink for kernel<->userspace not /proc. I > > think we should head towards that direction so we can have more > > sophisticated management. > > I, on the other hand, am not familiar with the bridging code and I > don't know what it actually does internally, I just noticed that > regarding config operations, most of the code is done at the kernel > level as response to ioctl commands. > Theres two main components to it: a control protocol and a forwarding path. The control protocol known as STP tells the forwarding path how to behave. Essentially, STP carries the policy implemented by the forwarding path. This is the same breakdown to say routing protocols like OSPF and regular forwarding path. At the moment STP sits in the kernel. STP is really the "brains". > I'll try to clarify how that relates to bonding. The ifenslave utility > has very little "brain" as it is, and all it knows how to do > currently is enslave/release slave devices and change the current > active slave. It also has some ability to extract status info from > the bond and present it nicely for a user. > > The "brain" I was referring to in the bonding module itself has to do > with timer functions monitoring link status or Tx/Rx activity of the > slaves, and once a faulty slave is detected, switch to use another > one instead according to the teaming mode. > There are no large scale > decision making nor major CPU consuming computations that are part of > the continuous operation of the module that is basically handle Rx/Tx > on slaves. > > The bonding module doesn't need to access any special info that is > normally available to user space apps. What it does need is very > short response time and accessibility to kernel internal resources > like net devices info to make it a high availability intermediate > driver. > > Trying to move that from the kernel module into the config application > seems to be a very hard task to implement since we'll have to find a > way to make the application constantly aware to the specifics like > current topology, slave-to-bond affiliation, updated status of each > slave, etc., etc. It would also mean that the driver will have to > wait for the application to tell it what to do each time it needs a > decision, and by that we'll surely suffer some performance hit and > probably get low availability or temporary loss of communications. > Not at all. If you let some app control this i am sure whoever writes the app has vested interest in getting fast failovers etc. > Going back to the first problem, discussions on the bonding > development list pointed that it might be better if we moved the > configuration-time decisions making to the driver, so the application > wouldn't have to deal with situations like: > 1) get the master's MTU settings, master's teaming mode, communication > version, backwards compatibility issues, etc. > 2) figure if need to set MTU to slave according to all that, > 3) try to set that on the new slave being added, > 4) if not successfull, decide if may enslave anyway or, > 5) maybe undo all previous settings already done to the slave > (needs a way to retrieve old values) > 6) decide if should go on or fail any further operations > 7) repeat the above for all other settings > > On the other hand, what we want to get to is something more like: > 1) tell bonding to add slave X to bond Y, > 2) watch for error returns, > 3) print a nice message according to the type of the error. > Dont you think that anything thats "rich" like you list above should stay out of the kernel? In any case, if you have a controlling app, you could do more interesting things; example add or delete routes, firewall rules, qos policies etc which all have very strong correlation with availability - these are examples btw, not an exhaustive list. If all you are satisfied with is link management alone, then by all means hardcoding behavior into the kernel is fine. I dont think it is sufficient. > While the driver, already aware of all possible relevant data, makes > all decisions, performs settings, handles compatibility issues, > checks for failures at each stage, handles any undo steps, and return > success/error values accordingly. > Driver - actually bonding - should have minimal failover policy built in for the lazy; example what i used to know about bonding - failover to the next link, maybe send a grat arp etc. If I want more than basic, then send me netlink events to user space and let me control how it goes. Maybe i dont want to go to the second link but rather the 4th link. > > > > Thoughts? > > Mostly explanations :) > > Is there anywhere I can see what you refereed to as discussions with > Stephen Hemminger ? I would really like to know how and what could > also be applied to bonding. > Basically what i described at the top. Move any "richness" to user space. cheers, jamal From shmulik.hen@intel.com Mon Aug 11 09:25:49 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 11 Aug 2003 09:25:56 -0700 (PDT) Received: from hermes-pilot.fm.intel.com (fmr99.intel.com [192.55.52.32]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7BGPmFl012927 for ; Mon, 11 Aug 2003 09:25:49 -0700 Received: from petasus.fm.intel.com (petasus.fm.intel.com [10.1.192.37]) by hermes-pilot.fm.intel.com (8.12.9/8.12.9/d: outer.mc,v 1.66 2003/05/22 21:17:36 rfjohns1 Exp $) with ESMTP id h7BGKs14016460 for ; Mon, 11 Aug 2003 16:20:54 GMT Received: from fmsmsxv040-1.fm.intel.com (fmsmsxv040-1.fm.intel.com [132.233.48.108]) by petasus.fm.intel.com (8.11.6p2/8.11.6/d: inner.mc,v 1.35 2003/05/22 21:18:01 rfjohns1 Exp $) with SMTP id h7BGIEG10949 for ; Mon, 11 Aug 2003 16:18:14 GMT Received: from jrslxjul4.npdj.intel.com ([10.12.254.188]) by fmsmsxv040-1.fm.intel.com (NAVGW 2.5.2.11) with SMTP id M2003081109244309038 ; Mon, 11 Aug 2003 09:24:44 -0700 Content-Type: text/plain; charset="iso-8859-1" From: Shmulik Hen Reply-To: shmulik.hen@intel.com Organization: Intel corp. To: hadi@cyberus.ca Subject: Re: [Bonding-devel] Re: [SET 2][PATCH 2/8][bonding] Propagating master's settings toslaves Date: Mon, 11 Aug 2003 19:25:38 +0300 User-Agent: KMail/1.4.3 Cc: Laurent DENIEL , bonding-devel@lists.sourceforge.net, netdev@oss.sgi.com References: <200308111720.38472.shmulik.hen@intel.com> <1060612481.1034.15.camel@jzny.localdomain> In-Reply-To: <1060612481.1034.15.camel@jzny.localdomain> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Message-Id: <200308111925.38278.shmulik.hen@intel.com> X-archive-position: 4721 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: shmulik.hen@intel.com Precedence: bulk X-list: netdev On Monday 11 August 2003 05:34 pm, jamal wrote: > So how many smart things are you going to add to the driver? ;-> > Do you wanna add the qos policy changeover as well? What about > route changes, firewalling etc. What about sliceing bread and > adding butter? Where do you draw the line? > BTW, I dont understand why it would slow down failover; sure it > will a tiny bit because you have to cross user space to lookup the > policy. Maybe this is the part that i havent made clear, heres an > example: - User space gets notified link eth0 went down > - User space looks up a policy config on what to do when eth0 goes > down - user space executes commands which may include telling > kernel to move activity to eth1. > > Note: I agree on a minimal failover policy staying in the kernel; > very basic stuff like what bonding used to do (may still do, dont > know). > > cheers, > jamal Why have any kernel code other than device drivers in the first place ? Why not move all the TCP/IP stack out of kernel space and put it in an application ? Lets have the entire ARP mechanism in an appliaction and let it handle everything from routing tables management to arp negotiation while the kernel will only know how to create arp packets that it gets from that app and send them away ? It doesn't need to have the know how. Say we do thing s your way and use the notification mechanism, how long do you think it's going to take for the whole operation to finish taking into consideration how the kernel runs user space applications in comparison with kernel code? what happens when the system is heavily loaded ? What happens if the application dies for some reason ? Why should the bonding driver even care about routes or firewalling ? It's only meant to group several physical ethernet devices and group them under one logical device to handle teaming solutions. -- | Shmulik Hen Advanced Network Services | | Israel Design Center, Jerusalem | | LAN Access Division, Platform Networking | | Intel Communications Group, Intel corp. | From jgarzik@pobox.com Mon Aug 11 09:44:02 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 11 Aug 2003 09:44:08 -0700 (PDT) Received: from www.linux.org.uk (IDENT:FMFX82S5sRmm6CLXA6TgyC8iPBP2Kvzd@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7BGi0Fl014519 for ; Mon, 11 Aug 2003 09:44:01 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19mFm2-0003Ki-By; Mon, 11 Aug 2003 17:43:58 +0100 Message-ID: <3F37C7C3.7070807@pobox.com> Date: Mon, 11 Aug 2003 12:43:47 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: shmulik.hen@intel.com, hadi@cyberus.ca CC: Laurent DENIEL , bonding-devel@lists.sourceforge.net, netdev@oss.sgi.com Subject: Re: [Bonding-devel] Re: [SET 2][PATCH 2/8][bonding] Propagating master's settings toslaves References: <200308111720.38472.shmulik.hen@intel.com> <1060612481.1034.15.camel@jzny.localdomain> <200308111925.38278.shmulik.hen@intel.com> In-Reply-To: <200308111925.38278.shmulik.hen@intel.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 4722 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev The answer is, like life, it's a balance. As a general rule, we do prefer to move all code possible out of the Linux kernel. We have even created "initramfs", which for 2.7, will be used as a vehicle to move code from the kernel to userspace, that previously had to be in the kernel only because it was a task that "had to be performed at boot time". However, one must consider (1) does moving code to userspace create any security holes? (2) does moving code to userspace dramatically increase the number of context switches? (3) does moving code to userspace violate some atomicity that being inside the kernel guarantees? In practice, #3 is the showstopper that occurs most often. This is why I push for a "bonding-utils" package from Jay.... because of the general rule above: put it into userspace, where possible. Jeff From davej@codemonkey.org.uk Mon Aug 11 09:49:29 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 11 Aug 2003 09:49:35 -0700 (PDT) Received: from lacrosse.corp.redhat.com (pix-525-pool.redhat.com [66.187.233.200]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7BGnRFl016954 for ; Mon, 11 Aug 2003 09:49:28 -0700 Received: from tetrachloride (davej.cipe.redhat.com [10.0.1.164]) by lacrosse.corp.redhat.com (8.11.6/8.9.3) with ESMTP id h7BGnQK22583 for ; Mon, 11 Aug 2003 12:49:26 -0400 Received: from davej by tetrachloride with local (Exim 3.36 #1 (Debian)) id 19mFqr-00068W-00 for ; Mon, 11 Aug 2003 17:48:57 +0100 To: netdev@oss.sgi.com From: davej@redhat.com Subject: [PATCH] Duplicate access_ok in sunrpc Message-Id: Date: Mon, 11 Aug 2003 17:48:57 +0100 X-archive-position: 4723 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davej@redhat.com Precedence: bulk X-list: netdev Already checked some lines above. diff -urpN --exclude-from=/home/davej/.exclude bk-linus/net/sunrpc/sysctl.c linux-2.5/net/sunrpc/sysctl.c --- bk-linus/net/sunrpc/sysctl.c 2003-06-30 14:01:04.000000000 +0100 +++ linux-2.5/net/sunrpc/sysctl.c 2003-06-30 16:04:03.000000000 +0100 @@ -102,7 +102,7 @@ proc_dodebug(ctl_table *table, int write len = sprintf(tmpbuf, "%d", *(unsigned int *) table->data); if (len > left) len = left; - copy_to_user(buffer, tmpbuf, len); + __copy_to_user(buffer, tmpbuf, len); if ((left -= len) > 0) { put_user('\n', (char *)buffer + len); left--; From davej@codemonkey.org.uk Mon Aug 11 09:49:29 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 11 Aug 2003 09:49:35 -0700 (PDT) Received: from lacrosse.corp.redhat.com (pix-525-pool.redhat.com [66.187.233.200]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7BGnRFl016953 for ; Mon, 11 Aug 2003 09:49:28 -0700 Received: from tetrachloride (davej.cipe.redhat.com [10.0.1.164]) by lacrosse.corp.redhat.com (8.11.6/8.9.3) with ESMTP id h7BGnQK22570 for ; Mon, 11 Aug 2003 12:49:26 -0400 Received: from davej by tetrachloride with local (Exim 3.36 #1 (Debian)) id 19mFqr-00068Q-00 for ; Mon, 11 Aug 2003 17:48:57 +0100 To: netdev@oss.sgi.com From: davej@redhat.com Subject: [PATCH] ipv4 reuses freed mem Message-Id: Date: Mon, 11 Aug 2003 17:48:57 +0100 X-archive-position: 4723 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davej@redhat.com Precedence: bulk X-list: netdev cat /proc/net/pnp will return garbage, as the fields it dumps are initdata. diff -urpN --exclude-from=/home/davej/.exclude bk-linus/net/ipv4/ipconfig.c linux-2.5/net/ipv4/ipconfig.c --- bk-linus/net/ipv4/ipconfig.c 2003-07-10 01:02:21.000000000 +0100 +++ linux-2.5/net/ipv4/ipconfig.c 2003-08-09 16:08:13.000000000 +0100 @@ -125,14 +125,14 @@ int ic_proto_enabled __initdata = 0 int ic_host_name_set __initdata = 0; /* Host name set by us? */ -u32 ic_myaddr __initdata = INADDR_NONE; /* My IP address */ -u32 ic_netmask __initdata = INADDR_NONE; /* Netmask for local subnet */ -u32 ic_gateway __initdata = INADDR_NONE; /* Gateway IP address */ +u32 ic_myaddr = INADDR_NONE; /* My IP address */ +u32 ic_netmask = INADDR_NONE; /* Netmask for local subnet */ +u32 ic_gateway = INADDR_NONE; /* Gateway IP address */ -u32 ic_servaddr __initdata = INADDR_NONE; /* Boot server IP address */ +u32 ic_servaddr = INADDR_NONE; /* Boot server IP address */ -u32 root_server_addr __initdata = INADDR_NONE; /* Address of NFS server */ -u8 root_server_path[256] __initdata = { 0, }; /* Path to mount as root */ +u32 root_server_addr = INADDR_NONE; /* Address of NFS server */ +u8 root_server_path[256] = { 0, }; /* Path to mount as root */ /* Persistent data: */ From davej@codemonkey.org.uk Mon Aug 11 09:49:29 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 11 Aug 2003 09:49:35 -0700 (PDT) Received: from lacrosse.corp.redhat.com (pix-525-pool.redhat.com [66.187.233.200]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7BGnRFl016952 for ; Mon, 11 Aug 2003 09:49:28 -0700 Received: from tetrachloride (davej.cipe.redhat.com [10.0.1.164]) by lacrosse.corp.redhat.com (8.11.6/8.9.3) with ESMTP id h7BGnQK22573 for ; Mon, 11 Aug 2003 12:49:26 -0400 Received: from davej by tetrachloride with local (Exim 3.36 #1 (Debian)) id 19mFqr-00068T-00 for ; Mon, 11 Aug 2003 17:48:57 +0100 To: netdev@oss.sgi.com From: davej@redhat.com Subject: [PATCH] Missing break in switch statement. Message-Id: Date: Mon, 11 Aug 2003 17:48:57 +0100 X-archive-position: 4724 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davej@redhat.com Precedence: bulk X-list: netdev Is this intentional? It should at least have a /* FALLTHROUGH */ or similar if so. Dave diff -urpN --exclude-from=/home/davej/.exclude bk-linus/net/ipv6/raw.c linux-2.5/net/ipv6/raw.c --- bk-linus/net/ipv6/raw.c 2003-07-11 13:57:41.000000000 +0100 +++ linux-2.5/net/ipv6/raw.c 2003-07-11 14:08:05.000000000 +0100 @@ -833,6 +833,7 @@ static int rawv6_getsockopt(struct sock val = -1; else val = opt->offset; + break; default: return -ENOPROTOOPT; From yoshfuji@hongo.wide.ad.jp Mon Aug 11 10:14:17 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 11 Aug 2003 10:14:20 -0700 (PDT) Received: from linux6.nezu.wide.ad.jp (linux6.nezu.wide.ad.jp [203.178.142.218]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7BHEFFl022424 for ; Mon, 11 Aug 2003 10:14:17 -0700 Received: from localhost (localhost [127.0.0.1]) by linux6.nezu.wide.ad.jp (8.12.3/8.12.3/Debian-6.4) with ESMTP id h7BHEEAn001729; Tue, 12 Aug 2003 02:14:14 +0900 Date: Tue, 12 Aug 2003 02:14:14 +0900 (JST) Message-Id: <20030812.021414.128414948.yoshfuji@hongo.wide.ad.jp> To: davem@redhat.com, davej@redhat.com Cc: netdev@oss.sgi.com, yoshfuji@linux-ipv6.org Subject: Re: [PATCH] Missing break in switch statement. From: YOSHIFUJI Hideaki In-Reply-To: References: X-Mailer: Mew version 2.2 on Emacs 20.7 / Mule 4.0 (HANANOEN) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 4725 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: yoshfuji@hongo.wide.ad.jp Precedence: bulk X-list: netdev In article (at Mon, 11 Aug 2003 17:48:57 +0100), davej@redhat.com says: > Is this intentional? It should at least have a > /* FALLTHROUGH */ or similar if so. Of course not. > else > val = opt->offset; > + break; > > default: > return -ENOPROTOOPT; agreed. --yoshfuji From laurent.deniel@thalesatm.com Mon Aug 11 10:31:38 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 11 Aug 2003 10:31:43 -0700 (PDT) Received: from gwsmtp.thomson-csf.com (gwsmtp.thomson-csf.com [195.101.39.226]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7BHVaFl023827 for ; Mon, 11 Aug 2003 10:31:38 -0700 Received: from thalescan.corp.thales (200.3.2.3) by gwsmtp.thomson-csf.com (NPlex 6.5.026) id 3F3761E40000FCA0 for netdev@oss.sgi.com; Mon, 11 Aug 2003 19:31:31 +0200 Received: from bgxplex.bgx.airsys.thomson-csf.com ([220.1.107.25]) by thalescan with InterScan Messaging Security Suite; Mon, 11 Aug 2003 19:31:26 +0200 Received: from bgxplex2.bgx.airsys.thomson-csf.com (1.38.9.152) by bgxplex.bgx.airsys.thomson-csf.com (NPlex 6.5.026) id 3F15565F0003B327; Mon, 11 Aug 2003 19:31:39 +0200 Received: from bgxplex2.bgx.airsys.thomson-csf.com (1.38.9.152) by bgxplex2.bgx.airsys.thomson-csf.com (NPlex 6.5.026) id 3F370B6200001C1F; Mon, 11 Aug 2003 19:31:26 +0200 Received: from 200.64.130.8 by bgxplex2.bgx.airsys.thomson-csf.com (InterScan E-Mail VirusWall NT); Mon, 11 Aug 2003 19:31:26 +0200 Message-ID: <3F37D2ED.B4B9223C@thalesatm.com> Date: Mon, 11 Aug 2003 19:31:25 +0200 From: Laurent DENIEL Organization: THALES ATM X-Mailer: Mozilla 4.75 [fr] (Windows NT 5.0; U) X-Accept-Language: fr,en-GB MIME-Version: 1.0 To: Jeff Garzik CC: shmulik.hen@intel.com, hadi@cyberus.ca, bonding-devel@lists.sourceforge.net, netdev@oss.sgi.com Subject: Re: [Bonding-devel] Re: [SET 2][PATCH 2/8][bonding] Propagating master'ssettings toslaves References: <200308111720.38472.shmulik.hen@intel.com> <1060612481.1034.15.camel@jzny.localdomain> <200308111925.38278.shmulik.hen@intel.com> <3F37C7C3.7070807@pobox.com> Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from Quoted-Printable to 8bit by oss.sgi.com id h7BHVaFl023827 X-archive-position: 4726 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: laurent.deniel@thalesatm.com Precedence: bulk X-list: netdev Jeff Garzik a écrit : > > The answer is, like life, it's a balance. > > As a general rule, we do prefer to move all code possible out of the > Linux kernel. We have even created "initramfs", which for 2.7, will be > used as a vehicle to move code from the kernel to userspace, that > previously had to be in the kernel only because it was a task that "had > to be performed at boot time". > > However, one must consider > (1) does moving code to userspace create any security holes? > (2) does moving code to userspace dramatically increase the number of > context switches? > (3) does moving code to userspace violate some atomicity that being > inside the kernel guarantees? You forgot one important aspect : (4) does moving code to userspace break compatibility (or behavior) with user land applications (or systems) What can one do if say, kernel 2.[4|5] switches the NIC in 10 mseconds while kernel 2.7 with user land daemon switches in a few seconds ? nothing but stay with the previous version or fork the driver development ;-( But I agree that it is interesting to do some stuff at user land, and if the bonding had an option to disable the automatic failover policy, this could be implemented with trigger towards user land application that could use an ioctl call to switch to the appropriate NIC according to the user lan configuration ... But the fast and simple failover policy shall remain in kernel code. Laurent From jgarzik@pobox.com Mon Aug 11 10:43:43 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 11 Aug 2003 10:43:54 -0700 (PDT) Received: from www.linux.org.uk (IDENT:0JIazgH4OLXAUw9cDLemfpgzTPTBra1l@parcelfarce.linux.theplanet.co.uk [195.92.249.252]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7BHhfFl024883 for ; Mon, 11 Aug 2003 10:43:42 -0700 Received: from rdu26-227-011.nc.rr.com ([66.26.227.11] helo=pobox.com) by www.linux.org.uk with esmtp (Exim 4.14) id 19mGhn-00047P-36; Mon, 11 Aug 2003 18:43:39 +0100 Message-ID: <3F37D5BF.8000702@pobox.com> Date: Mon, 11 Aug 2003 13:43:27 -0400 From: Jeff Garzik Organization: none User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021213 Debian/1.2.1-2.bunk X-Accept-Language: en MIME-Version: 1.0 To: Laurent DENIEL CC: shmulik.hen@intel.com, hadi@cyberus.ca, bonding-devel@lists.sourceforge.net, netdev@oss.sgi.com Subject: Re: [Bonding-devel] Re: [SET 2][PATCH 2/8][bonding] Propagating master'ssettings toslaves References: <200308111720.38472.shmulik.hen@intel.com> <1060612481.1034.15.camel@jzny.localdomain> <200308111925.38278.shmulik.hen@intel.com> <3F37C7C3.7070807@pobox.com> <3F37D2ED.B4B9223C@thalesatm.com> In-Reply-To: <3F37D2ED.B4B9223C@thalesatm.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-archive-position: 4727 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev Laurent DENIEL wrote: > Jeff Garzik a écrit : > >>The answer is, like life, it's a balance. >> >>As a general rule, we do prefer to move all code possible out of the >>Linux kernel. We have even created "initramfs", which for 2.7, will be >>used as a vehicle to move code from the kernel to userspace, that >>previously had to be in the kernel only because it was a task that "had >>to be performed at boot time". >> >>However, one must consider >>(1) does moving code to userspace create any security holes? >>(2) does moving code to userspace dramatically increase the number of >>context switches? >>(3) does moving code to userspace violate some atomicity that being >>inside the kernel guarantees? > > > You forgot one important aspect : > > (4) does moving code to userspace break compatibility (or behavior) > with user land applications (or systems) I agree... assuming these userland interfaces are fairly standard and widely deployed. > What can one do if say, kernel 2.[4|5] switches the NIC in 10 mseconds > while kernel 2.7 with user land daemon switches in a few seconds ? > nothing but stay with the previous version or fork the driver development ;-( This is a silly example. If that happens in practice, then that is a bug in the configuration of the userland daemon, or a bug in the kernel<->userland ABI. > But I agree that it is interesting to do some stuff at user land, and if > the bonding had an option to disable the automatic failover policy, > this could be implemented with trigger towards user land application that > could use an ioctl call to switch to the appropriate NIC according to > the user lan configuration ... Remember, ioctls are bad. :) Unix design mistake. > But the fast and simple failover policy shall remain in kernel code. I would not make such absolute predictions, especially about policy :) Jeff From shemminger@osdl.org Mon Aug 11 11:48:55 2003 Received: with ECARTIS (v1.0.0; list netdev); Mon, 11 Aug 2003 11:49:04 -0700 (PDT) Received: from mail.osdl.org (fw.osdl.org [65.172.181.6]) by oss.sgi.com (8.12.9/8.12.9) with SMTP id h7BImsFl027380 for ; Mon, 11 Aug 2003 11:48:55 -0700 Received: from dell_ss3.pdx.osdl.net (dell_ss3.pdx.osdl.net [172.20.1.60]) by mail.osdl.org (8.11.6/8.11.6) with SMTP id h7BImVo27603; Mon, 11 Aug 2003 11:48:31 -0700 Date: Mon, 11 Aug 2003 11:48:23 -0700 From: Stephen Hemminger To: chas williams , "David S. Miller" Cc: netdev@oss.sgi.com Subject: [RFT][PATCH] cleanup net/atm/br2684.c Message-Id: <20030811114823.5b81474c.shemminger@osdl.org> Organization: Open Source Development Lab X-Mailer: Sylpheed version 0.9.4claws (GTK+ 1.2.10; i686-pc-linux-gnu) X-Face: &@E+xe?c%:&e4D{>f1O<&U>2qwRREG5!}7R4;D<"NO^UI2mJ[eEOA2*3>(`Th.yP,VDPo9$ /`~cw![cmj~~jWe?AHY7D1S+\}5brN0k*NE?pPh_'_d>6;XGG[\KDRViCfumZT3@[ Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-archive-position: 4728 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: shemminger@osdl.org Precedence: bulk X-list: netdev I fixed up some things in br2684 but don't have ATM hardware to test it well enough. The patch is against 2.6.0-test3 and module loads/unloads fine. Fixed: - Allocate network device with alloc_netdev and embed private data via dev->priv. This allows for future fix of rmmod race with sysfs. - Get rid of all the MOD_INC stuff. MOD_INC is not a spinlock or semaphore and don't use it like that! Have driver clean itself up properly on unload. - Add required owner field for /proc interface. Thought about converting to seq_file, but existing output format and ordering makes that hard. diff -Nru a/net/atm/br2684.c b/net/atm/br2684.c --- a/net/atm/br2684.c Mon Aug 11 11:42:19 2003 +++ b/net/atm/br2684.c Mon Aug 11 11:42:19 2003 @@ -81,7 +81,7 @@ }; struct br2684_dev { - struct net_device net_dev; + struct net_device *net_dev; struct list_head br2684_devs; int number; struct list_head brvccs; /* one device <=> one vcc (before xmas) */ @@ -137,8 +137,8 @@ case BR2684_FIND_BYIFNAME: list_for_each(lh, &br2684_devs) { brdev = list_entry_brdev(lh); - if (!strncmp(brdev->net_dev.name, s->spec.ifname, - sizeof brdev->net_dev.name)) + if (!strncmp(brdev->net_dev->name, s->spec.ifname, + IFNAMSIZ)) return brdev; } break; @@ -400,7 +400,6 @@ brvcc->atmvcc->user_back = NULL; /* what about vcc->recvq ??? */ brvcc->old_push(brvcc->atmvcc, NULL); /* pass on the bad news */ kfree(brvcc); - MOD_DEC_USE_COUNT; } /* when AAL5 PDU comes in: */ @@ -418,8 +417,8 @@ read_lock(&devs_lock); list_del(&brdev->br2684_devs); read_unlock(&devs_lock); - unregister_netdev(&brdev->net_dev); - kfree(brdev); + unregister_netdev(brdev->net_dev); + kfree(brdev->net_dev); } return; } @@ -464,7 +463,7 @@ #endif /* CONFIG_BR2684_FAST_TRANS */ #else skb_pull(skb, plen - ETH_HLEN); - skb->protocol = eth_type_trans(skb, &brdev->net_dev); + skb->protocol = eth_type_trans(skb, brdev->net_dev); #endif /* FASTER_VERSION */ #ifdef CONFIG_ATM_BR2684_IPFILTER if (packet_fails_filter(skb->protocol, brvcc, skb)) { @@ -473,11 +472,11 @@ return; } #endif /* CONFIG_ATM_BR2684_IPFILTER */ - skb->dev = &brdev->net_dev; + skb->dev = brdev->net_dev; ATM_SKB(skb)->vcc = atmvcc; /* needed ? */ DPRINTK("received packet's protocol: %x\n", ntohs(skb->protocol)); skb_debug(skb); - if (!(brdev->net_dev.flags & IFF_UP)) { /* sigh, interface is down */ + if (!(brdev->net_dev->flags & IFF_UP)) { /* sigh, interface is down */ brdev->stats.rx_dropped++; dev_kfree_skb(skb); return; @@ -500,9 +499,7 @@ struct br2684_dev *brdev; struct atm_backend_br2684 be; - MOD_INC_USE_COUNT; if (copy_from_user(&be, (void *) arg, sizeof be)) { - MOD_DEC_USE_COUNT; return -EFAULT; } write_lock_irq(&devs_lock); @@ -539,10 +536,10 @@ if (list_empty(&brdev->brvccs) && !brdev->mac_was_set) { unsigned char *esi = atmvcc->dev->esi; if (esi[0] | esi[1] | esi[2] | esi[3] | esi[4] | esi[5]) - memcpy(brdev->net_dev.dev_addr, esi, - brdev->net_dev.addr_len); + memcpy(brdev->net_dev->de