From amine@anevia.com Thu Sep 1 08:09:59 2005 Received: with ECARTIS (v1.0.0; list netdev); Thu, 01 Sep 2005 08:10:03 -0700 (PDT) Received: from smtp8.wanadoo.fr (smtp8.wanadoo.fr [193.252.22.23]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j81F9wiL015566 for ; Thu, 1 Sep 2005 08:09:59 -0700 Received: from me-wanadoo.net (localhost [127.0.0.1]) by mwinf0808.wanadoo.fr (SMTP Server) with ESMTP id ECFD61C00223 for ; Thu, 1 Sep 2005 17:07:28 +0200 (CEST) Received: from goliath.anevia.com (LSt-Amand-152-31-11-137.w82-127.abo.wanadoo.fr [82.127.10.137]) by mwinf0808.wanadoo.fr (SMTP Server) with ESMTP id D326B1C0021C for ; Thu, 1 Sep 2005 17:07:28 +0200 (CEST) X-ME-UUID: 20050901150728864.D326B1C0021C@mwinf0808.wanadoo.fr Received: from therese.anevia.com (therese.anevia.com [10.0.1.33]) by goliath.anevia.com (Postfix) with ESMTP id D0DE91300048 for ; Thu, 1 Sep 2005 17:07:31 +0200 (CEST) From: amine To: netdev@oss.sgi.com Subject: Linux multicast support Date: Thu, 1 Sep 2005 17:05:47 +0200 User-Agent: KMail/1.7.2 MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200509011705.47623.amine@anevia.com> X-archive-position: 3588 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: amine@anevia.com Precedence: bulk X-list: netdev Hi, I have a question about Multicast in Linux IP stack. I need to know why are loking the " dev->xmit_lock" when mading change in device multicast list? Is it required to suppress parallel execution of that handler and set_multicast_list? Thank in advance -- EL HEDADI Amine R&D phone : Email : amine@anevia.com From manfred@colorfullife.com Sun Sep 4 05:36:49 2005 Received: with ECARTIS (v1.0.0; list netdev); Sun, 04 Sep 2005 05:36:57 -0700 (PDT) Received: from dbl.q-ag.de (dbl.q-ag.de [213.172.117.3]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j84CaiiL027751 for ; Sun, 4 Sep 2005 05:36:47 -0700 Received: from [127.0.0.2] (dbl [127.0.0.1]) by dbl.q-ag.de (8.13.3/8.13.3/Debian-6) with ESMTP id j84CeUvV015789; Sun, 4 Sep 2005 14:40:31 +0200 Message-ID: <431AE9B7.2040300@colorfullife.com> Date: Sun, 04 Sep 2005 14:33:59 +0200 From: Manfred Spraul User-Agent: Mozilla/5.0 (X11; U; Linux i686; fr-FR; rv:1.7.10) Gecko/20050719 Fedora/1.7.10-1.5.1 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Linux Kernel Mailing List , Netdev CC: Ayaz Abdulla Subject: [CFT] forcedeth backport to 2.4 Content-Type: multipart/mixed; boundary="------------040003030102010107040708" X-archive-position: 3591 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: manfred@colorfullife.com Precedence: bulk X-list: netdev Content-Length: 51824 Lines: 1630 This is a multi-part message in MIME format. --------------040003030102010107040708 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Hi, Attached is a backport of the latest forcedeth version to 2.4. It includes lots of changes, among them: - a critical bugfix for nv_open(): ifdown/ifup cycles resulted in an incomplete initialization that causes hangs after a few MB network traffic. - jumbo frame support - far better ethtool support - 64-bit dma support - support for additional nforce versions. It compiles and boots, but I can't test it properly. Could you give it a try? -- Manfred --------------040003030102010107040708 Content-Type: text/plain; name="patch-forcedeth-backport" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="patch-forcedeth-backport" --- 2.4/drivers/net/forcedeth.c 2005-01-19 15:09:56.000000000 +0100 +++ build-2.4/drivers/net/forcedeth.c 2005-09-04 13:58:07.000000000 +0200 @@ -79,6 +79,22 @@ * 0.30: 25 Sep 2004: rx checksum support for nf 250 Gb. Add rx reset * into nv_close, otherwise reenabling for wol can * cause DMA to kfree'd memory. + * 0.31: 14 Nov 2004: ethtool support for getting/setting link + * capabilities. + * 0.32: 16 Apr 2005: RX_ERROR4 handling added. + * 0.33: 16 May 2005: Support for MCP51 added. + * 0.34: 18 Jun 2005: Add DEV_NEED_LINKTIMER to all nForce nics. + * 0.35: 26 Jun 2005: Support for MCP55 added. + * 0.36: 28 Jun 2005: Add jumbo frame support. + * 0.37: 10 Jul 2005: Additional ethtool support, cleanup of pci id list + * 0.38: 16 Jul 2005: tx irq rewrite: Use global flags instead of + * per-packet flags. + * 0.39: 18 Jul 2005: Add 64bit descriptor support. + * 0.40: 19 Jul 2005: Add support for mac address change. + * 0.41: 30 Jul 2005: Write back original MAC in nv_close instead + * of nv_remove + * 0.42: 06 Aug 2005: Fix lack of link speed initialization + * in the second (and later) nv_open call * * Known bugs: * We suspect that on some hardware no TX done interrupts are generated. @@ -90,7 +106,7 @@ * DEV_NEED_TIMERIRQ will not harm you on sane hardware, only generating a few * superfluous timer interrupts from the nic. */ -#define FORCEDETH_VERSION "0.30" +#define FORCEDETH_VERSION "0.42" #define DRV_NAME "forcedeth" #include @@ -108,6 +124,7 @@ #include #include #include +#include #include #include @@ -125,11 +142,10 @@ * Hardware access: */ -#define DEV_NEED_LASTPACKET1 0x0001 /* set LASTPACKET1 in tx flags */ -#define DEV_IRQMASK_1 0x0002 /* use NVREG_IRQMASK_WANTED_1 for irq mask */ -#define DEV_IRQMASK_2 0x0004 /* use NVREG_IRQMASK_WANTED_2 for irq mask */ -#define DEV_NEED_TIMERIRQ 0x0008 /* set the timer irq flag in the irq mask */ -#define DEV_NEED_LINKTIMER 0x0010 /* poll link settings. Relies on the timer irq */ +#define DEV_NEED_TIMERIRQ 0x0001 /* set the timer irq flag in the irq mask */ +#define DEV_NEED_LINKTIMER 0x0002 /* poll link settings. Relies on the timer irq */ +#define DEV_HAS_LARGEDESC 0x0004 /* device supports jumbo frames and needs packet format 2 */ +#define DEV_HAS_HIGH_DMA 0x0008 /* device supports 64bit dma */ enum { NvRegIrqStatus = 0x000, @@ -140,13 +156,16 @@ #define NVREG_IRQ_RX 0x0002 #define NVREG_IRQ_RX_NOBUF 0x0004 #define NVREG_IRQ_TX_ERR 0x0008 -#define NVREG_IRQ_TX2 0x0010 +#define NVREG_IRQ_TX_OK 0x0010 #define NVREG_IRQ_TIMER 0x0020 #define NVREG_IRQ_LINK 0x0040 +#define NVREG_IRQ_TX_ERROR 0x0080 #define NVREG_IRQ_TX1 0x0100 -#define NVREG_IRQMASK_WANTED_1 0x005f -#define NVREG_IRQMASK_WANTED_2 0x0147 -#define NVREG_IRQ_UNKNOWN (~(NVREG_IRQ_RX_ERROR|NVREG_IRQ_RX|NVREG_IRQ_RX_NOBUF|NVREG_IRQ_TX_ERR|NVREG_IRQ_TX2|NVREG_IRQ_TIMER|NVREG_IRQ_LINK|NVREG_IRQ_TX1)) +#define NVREG_IRQMASK_WANTED 0x00df + +#define NVREG_IRQ_UNKNOWN (~(NVREG_IRQ_RX_ERROR|NVREG_IRQ_RX|NVREG_IRQ_RX_NOBUF|NVREG_IRQ_TX_ERR| \ + NVREG_IRQ_TX_OK|NVREG_IRQ_TIMER|NVREG_IRQ_LINK|NVREG_IRQ_TX_ERROR| \ + NVREG_IRQ_TX1)) NvRegUnknownSetupReg6 = 0x008, #define NVREG_UNKSETUP6_VAL 3 @@ -211,6 +230,7 @@ #define NVREG_LINKSPEED_10 1000 #define NVREG_LINKSPEED_100 100 #define NVREG_LINKSPEED_1000 50 +#define NVREG_LINKSPEED_MASK (0xFFF) NvRegUnknownSetupReg5 = 0x130, #define NVREG_UNKSETUP5_BIT31 (1<<31) NvRegUnknownSetupReg3 = 0x13c, @@ -279,6 +299,18 @@ u32 FlagLen; }; +struct ring_desc_ex { + u32 PacketBufferHigh; + u32 PacketBufferLow; + u32 Reserved; + u32 FlagLen; +}; + +typedef union _ring_type { + struct ring_desc* orig; + struct ring_desc_ex* ex; +} ring_type; + #define FLAG_MASK_V1 0xffff0000 #define FLAG_MASK_V2 0xffffc000 #define LEN_MASK_V1 (0xffffffff ^ FLAG_MASK_V1) @@ -286,7 +318,7 @@ #define NV_TX_LASTPACKET (1<<16) #define NV_TX_RETRYERROR (1<<19) -#define NV_TX_LASTPACKET1 (1<<24) +#define NV_TX_FORCED_INTERRUPT (1<<24) #define NV_TX_DEFERRED (1<<26) #define NV_TX_CARRIERLOST (1<<27) #define NV_TX_LATECOLLISION (1<<28) @@ -296,7 +328,7 @@ #define NV_TX2_LASTPACKET (1<<29) #define NV_TX2_RETRYERROR (1<<18) -#define NV_TX2_LASTPACKET1 (1<<23) +#define NV_TX2_FORCED_INTERRUPT (1<<30) #define NV_TX2_DEFERRED (1<<25) #define NV_TX2_CARRIERLOST (1<<26) #define NV_TX2_LATECOLLISION (1<<27) @@ -362,7 +394,7 @@ #define RX_RING 128 #define TX_RING 64 -/* +/* * If your nic mysteriously hangs then try to reduce the limits * to 1/0: It might be required to set NV_TX_LASTPACKET in the * last valid ring entry. But this would be impossible to @@ -372,15 +404,19 @@ #define TX_LIMIT_START 62 /* rx/tx mac addr + type + vlan + align + slack*/ -#define RX_NIC_BUFSIZE (ETH_DATA_LEN + 64) -/* even more slack */ -#define RX_ALLOC_BUFSIZE (ETH_DATA_LEN + 128) +#define NV_RX_HEADERS (64) +/* even more slack. */ +#define NV_RX_ALLOC_PAD (64) + +/* maximum mtu size */ +#define NV_PKTLIMIT_1 ETH_DATA_LEN /* hard limit not known */ +#define NV_PKTLIMIT_2 9100 /* Actual limit according to NVidia: 9202 */ #define OOM_REFILL (1+HZ/20) #define POLL_WAIT (1+HZ/100) #define LINK_TIMEOUT (3*HZ) -/* +/* * desc_ver values: * This field has two purposes: * - Newer nics uses a different ring layout. The layout is selected by @@ -389,6 +425,7 @@ */ #define DESC_VER_1 0x0 #define DESC_VER_2 (0x02100|NVREG_TXRXCTL_RXCHECK) +#define DESC_VER_3 (0x02200|NVREG_TXRXCTL_RXCHECK) /* PHY defines */ #define PHY_OUI_MARVELL 0x5043 @@ -442,6 +479,8 @@ int in_shutdown; u32 linkspeed; int duplex; + int autoneg; + int fixed_mode; int phyaddr; int wolenabled; unsigned int phy_oui; @@ -454,14 +493,17 @@ u32 irqmask; u32 desc_ver; + void __iomem *base; + /* rx specific fields. * Locking: Within irq hander or disable_irq+spin_lock(&np->lock); */ - struct ring_desc *rx_ring; + ring_type rx_ring; unsigned int cur_rx, refill_rx; struct sk_buff *rx_skbuff[RX_RING]; dma_addr_t rx_dma[RX_RING]; unsigned int rx_buf_sz; + unsigned int pkt_limit; struct timer_list oom_kick; struct timer_list nic_poll; @@ -473,7 +515,7 @@ /* * tx specific fields. */ - struct ring_desc *tx_ring; + ring_type tx_ring; unsigned int next_tx, nic_tx; struct sk_buff *tx_skbuff[TX_RING]; dma_addr_t tx_dma[TX_RING]; @@ -488,15 +530,15 @@ static inline struct fe_priv *get_nvpriv(struct net_device *dev) { - return (struct fe_priv *) dev->priv; + return netdev_priv(dev); } -static inline u8 *get_hwbase(struct net_device *dev) +static inline u8 __iomem *get_hwbase(struct net_device *dev) { - return (u8 *) dev->base_addr; + return get_nvpriv(dev)->base; } -static inline void pci_push(u8 * base) +static inline void pci_push(u8 __iomem *base) { /* force out pending posted writes */ readl(base); @@ -508,10 +550,15 @@ & ((v == DESC_VER_1) ? LEN_MASK_V1 : LEN_MASK_V2); } +static inline u32 nv_descr_getlength_ex(struct ring_desc_ex *prd, u32 v) +{ + return le32_to_cpu(prd->FlagLen) & LEN_MASK_V2; +} + static int reg_delay(struct net_device *dev, int offset, u32 mask, u32 target, int delay, int delaymax, const char *msg) { - u8 *base = get_hwbase(dev); + u8 __iomem *base = get_hwbase(dev); pci_push(base); do { @@ -533,7 +580,7 @@ */ static int mii_rw(struct net_device *dev, int addr, int miireg, int value) { - u8 *base = get_hwbase(dev); + u8 __iomem *base = get_hwbase(dev); u32 reg; int retval; @@ -604,7 +651,7 @@ static int phy_init(struct net_device *dev) { struct fe_priv *np = get_nvpriv(dev); - u8 *base = get_hwbase(dev); + u8 __iomem *base = get_hwbase(dev); u32 phyinterface, phy_reserved, mii_status, mii_control, mii_control_1000,reg; /* set advertise register */ @@ -681,7 +728,7 @@ static void nv_start_rx(struct net_device *dev) { struct fe_priv *np = get_nvpriv(dev); - u8 *base = get_hwbase(dev); + u8 __iomem *base = get_hwbase(dev); dprintk(KERN_DEBUG "%s: nv_start_rx\n", dev->name); /* Already running? Stop it. */ @@ -699,7 +746,7 @@ static void nv_stop_rx(struct net_device *dev) { - u8 *base = get_hwbase(dev); + u8 __iomem *base = get_hwbase(dev); dprintk(KERN_DEBUG "%s: nv_stop_rx\n", dev->name); writel(0, base + NvRegReceiverControl); @@ -713,7 +760,7 @@ static void nv_start_tx(struct net_device *dev) { - u8 *base = get_hwbase(dev); + u8 __iomem *base = get_hwbase(dev); dprintk(KERN_DEBUG "%s: nv_start_tx\n", dev->name); writel(NVREG_XMITCTL_START, base + NvRegTransmitterControl); @@ -722,7 +769,7 @@ static void nv_stop_tx(struct net_device *dev) { - u8 *base = get_hwbase(dev); + u8 __iomem *base = get_hwbase(dev); dprintk(KERN_DEBUG "%s: nv_stop_tx\n", dev->name); writel(0, base + NvRegTransmitterControl); @@ -737,7 +784,7 @@ static void nv_txrx_reset(struct net_device *dev) { struct fe_priv *np = get_nvpriv(dev); - u8 *base = get_hwbase(dev); + u8 __iomem *base = get_hwbase(dev); dprintk(KERN_DEBUG "%s: nv_txrx_reset\n", dev->name); writel(NVREG_TXRXCTL_BIT2 | NVREG_TXRXCTL_RESET | np->desc_ver, base + NvRegTxRxControl); @@ -764,50 +811,6 @@ return &np->stats; } -static void nv_get_drvinfo(struct net_device *dev, struct ethtool_drvinfo *info) -{ - struct fe_priv *np = get_nvpriv(dev); - strcpy(info->driver, "forcedeth"); - strcpy(info->version, FORCEDETH_VERSION); - strcpy(info->bus_info, pci_name(np->pci_dev)); -} - -static void nv_get_wol(struct net_device *dev, struct ethtool_wolinfo *wolinfo) -{ - struct fe_priv *np = get_nvpriv(dev); - wolinfo->supported = WAKE_MAGIC; - - spin_lock_irq(&np->lock); - if (np->wolenabled) - wolinfo->wolopts = WAKE_MAGIC; - spin_unlock_irq(&np->lock); -} - -static int nv_set_wol(struct net_device *dev, struct ethtool_wolinfo *wolinfo) -{ - struct fe_priv *np = get_nvpriv(dev); - u8 *base = get_hwbase(dev); - - spin_lock_irq(&np->lock); - if (wolinfo->wolopts == 0) { - writel(0, base + NvRegWakeUpFlags); - np->wolenabled = 0; - } - if (wolinfo->wolopts & WAKE_MAGIC) { - writel(NVREG_WAKEUPFLAGS_ENABLE, base + NvRegWakeUpFlags); - np->wolenabled = 1; - } - spin_unlock_irq(&np->lock); - return 0; -} - -static struct ethtool_ops ops = { - .get_drvinfo = nv_get_drvinfo, - .get_link = ethtool_op_get_link, - .get_wol = nv_get_wol, - .set_wol = nv_set_wol, -}; - /* * nv_alloc_rx: fill rx ring entries. * Return 1 if the allocations for the skbs failed and the @@ -825,7 +828,7 @@ nr = refill_rx % RX_RING; if (np->rx_skbuff[nr] == NULL) { - skb = dev_alloc_skb(RX_ALLOC_BUFSIZE); + skb = dev_alloc_skb(np->rx_buf_sz + NV_RX_ALLOC_PAD); if (!skb) break; @@ -836,9 +839,16 @@ } np->rx_dma[nr] = pci_map_single(np->pci_dev, skb->data, skb->len, PCI_DMA_FROMDEVICE); - np->rx_ring[nr].PacketBuffer = cpu_to_le32(np->rx_dma[nr]); - wmb(); - np->rx_ring[nr].FlagLen = cpu_to_le32(RX_NIC_BUFSIZE | NV_RX_AVAIL); + if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) { + np->rx_ring.orig[nr].PacketBuffer = cpu_to_le32(np->rx_dma[nr]); + wmb(); + np->rx_ring.orig[nr].FlagLen = cpu_to_le32(np->rx_buf_sz | NV_RX_AVAIL); + } else { + np->rx_ring.ex[nr].PacketBufferHigh = cpu_to_le64(np->rx_dma[nr]) >> 32; + np->rx_ring.ex[nr].PacketBufferLow = cpu_to_le64(np->rx_dma[nr]) & 0x0FFFFFFFF; + wmb(); + np->rx_ring.ex[nr].FlagLen = cpu_to_le32(np->rx_buf_sz | NV_RX2_AVAIL); + } dprintk(KERN_DEBUG "%s: nv_alloc_rx: Packet %d marked as Available\n", dev->name, refill_rx); refill_rx++; @@ -864,19 +874,37 @@ enable_irq(dev->irq); } -static int nv_init_ring(struct net_device *dev) +static void nv_init_rx(struct net_device *dev) { struct fe_priv *np = get_nvpriv(dev); int i; - np->next_tx = np->nic_tx = 0; - for (i = 0; i < TX_RING; i++) - np->tx_ring[i].FlagLen = 0; - np->cur_rx = RX_RING; np->refill_rx = 0; for (i = 0; i < RX_RING; i++) - np->rx_ring[i].FlagLen = 0; + if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) + np->rx_ring.orig[i].FlagLen = 0; + else + np->rx_ring.ex[i].FlagLen = 0; +} + +static void nv_init_tx(struct net_device *dev) +{ + struct fe_priv *np = get_nvpriv(dev); + int i; + + np->next_tx = np->nic_tx = 0; + for (i = 0; i < TX_RING; i++) + if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) + np->tx_ring.orig[i].FlagLen = 0; + else + np->tx_ring.ex[i].FlagLen = 0; +} + +static int nv_init_ring(struct net_device *dev) +{ + nv_init_tx(dev); + nv_init_rx(dev); return nv_alloc_rx(dev); } @@ -885,7 +913,10 @@ struct fe_priv *np = get_nvpriv(dev); int i; for (i = 0; i < TX_RING; i++) { - np->tx_ring[i].FlagLen = 0; + if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) + np->tx_ring.orig[i].FlagLen = 0; + else + np->tx_ring.ex[i].FlagLen = 0; if (np->tx_skbuff[i]) { pci_unmap_single(np->pci_dev, np->tx_dma[i], np->tx_skbuff[i]->len, @@ -902,7 +933,10 @@ struct fe_priv *np = get_nvpriv(dev); int i; for (i = 0; i < RX_RING; i++) { - np->rx_ring[i].FlagLen = 0; + if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) + np->rx_ring.orig[i].FlagLen = 0; + else + np->rx_ring.ex[i].FlagLen = 0; wmb(); if (np->rx_skbuff[i]) { pci_unmap_single(np->pci_dev, np->rx_dma[i], @@ -933,11 +967,19 @@ np->tx_dma[nr] = pci_map_single(np->pci_dev, skb->data,skb->len, PCI_DMA_TODEVICE); - np->tx_ring[nr].PacketBuffer = cpu_to_le32(np->tx_dma[nr]); + if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) + np->tx_ring.orig[nr].PacketBuffer = cpu_to_le32(np->tx_dma[nr]); + else { + np->tx_ring.ex[nr].PacketBufferHigh = cpu_to_le64(np->tx_dma[nr]) >> 32; + np->tx_ring.ex[nr].PacketBufferLow = cpu_to_le64(np->tx_dma[nr]) & 0x0FFFFFFFF; + } spin_lock_irq(&np->lock); wmb(); - np->tx_ring[nr].FlagLen = cpu_to_le32( (skb->len-1) | np->tx_flags ); + if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) + np->tx_ring.orig[nr].FlagLen = cpu_to_le32( (skb->len-1) | np->tx_flags ); + else + np->tx_ring.ex[nr].FlagLen = cpu_to_le32( (skb->len-1) | np->tx_flags ); dprintk(KERN_DEBUG "%s: nv_start_xmit: packet packet %d queued for transmission.\n", dev->name, np->next_tx); { @@ -975,7 +1017,10 @@ while (np->nic_tx != np->next_tx) { i = np->nic_tx % TX_RING; - Flags = le32_to_cpu(np->tx_ring[i].FlagLen); + if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) + Flags = le32_to_cpu(np->tx_ring.orig[i].FlagLen); + else + Flags = le32_to_cpu(np->tx_ring.ex[i].FlagLen); dprintk(KERN_DEBUG "%s: nv_tx_done: looking at packet %d, Flags 0x%x.\n", dev->name, np->nic_tx, Flags); @@ -1024,11 +1069,58 @@ static void nv_tx_timeout(struct net_device *dev) { struct fe_priv *np = get_nvpriv(dev); - u8 *base = get_hwbase(dev); + u8 __iomem *base = get_hwbase(dev); - dprintk(KERN_DEBUG "%s: Got tx_timeout. irq: %08x\n", dev->name, + printk(KERN_INFO "%s: Got tx_timeout. irq: %08x\n", dev->name, readl(base + NvRegIrqStatus) & NVREG_IRQSTAT_MASK); + { + int i; + + printk(KERN_INFO "%s: Ring at %lx: next %d nic %d\n", + dev->name, (unsigned long)np->ring_addr, + np->next_tx, np->nic_tx); + printk(KERN_INFO "%s: Dumping tx registers\n", dev->name); + for (i=0;i<0x400;i+= 32) { + printk(KERN_INFO "%3x: %08x %08x %08x %08x %08x %08x %08x %08x\n", + i, + readl(base + i + 0), readl(base + i + 4), + readl(base + i + 8), readl(base + i + 12), + readl(base + i + 16), readl(base + i + 20), + readl(base + i + 24), readl(base + i + 28)); + } + printk(KERN_INFO "%s: Dumping tx ring\n", dev->name); + for (i=0;idesc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) { + printk(KERN_INFO "%03x: %08x %08x // %08x %08x // %08x %08x // %08x %08x\n", + i, + le32_to_cpu(np->tx_ring.orig[i].PacketBuffer), + le32_to_cpu(np->tx_ring.orig[i].FlagLen), + le32_to_cpu(np->tx_ring.orig[i+1].PacketBuffer), + le32_to_cpu(np->tx_ring.orig[i+1].FlagLen), + le32_to_cpu(np->tx_ring.orig[i+2].PacketBuffer), + le32_to_cpu(np->tx_ring.orig[i+2].FlagLen), + le32_to_cpu(np->tx_ring.orig[i+3].PacketBuffer), + le32_to_cpu(np->tx_ring.orig[i+3].FlagLen)); + } else { + printk(KERN_INFO "%03x: %08x %08x %08x // %08x %08x %08x // %08x %08x %08x // %08x %08x %08x\n", + i, + le32_to_cpu(np->tx_ring.ex[i].PacketBufferHigh), + le32_to_cpu(np->tx_ring.ex[i].PacketBufferLow), + le32_to_cpu(np->tx_ring.ex[i].FlagLen), + le32_to_cpu(np->tx_ring.ex[i+1].PacketBufferHigh), + le32_to_cpu(np->tx_ring.ex[i+1].PacketBufferLow), + le32_to_cpu(np->tx_ring.ex[i+1].FlagLen), + le32_to_cpu(np->tx_ring.ex[i+2].PacketBufferHigh), + le32_to_cpu(np->tx_ring.ex[i+2].PacketBufferLow), + le32_to_cpu(np->tx_ring.ex[i+2].FlagLen), + le32_to_cpu(np->tx_ring.ex[i+3].PacketBufferHigh), + le32_to_cpu(np->tx_ring.ex[i+3].PacketBufferLow), + le32_to_cpu(np->tx_ring.ex[i+3].FlagLen)); + } + } + } + spin_lock_irq(&np->lock); /* 1) stop tx engine */ @@ -1042,7 +1134,10 @@ printk(KERN_DEBUG "%s: tx_timeout: dead entries!\n", dev->name); nv_drain_tx(dev); np->next_tx = np->nic_tx = 0; - writel((u32) (np->ring_addr + RX_RING*sizeof(struct ring_desc)), base + NvRegTxRingPhysAddr); + if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) + writel((u32) (np->ring_addr + RX_RING*sizeof(struct ring_desc)), base + NvRegTxRingPhysAddr); + else + writel((u32) (np->ring_addr + RX_RING*sizeof(struct ring_desc_ex)), base + NvRegTxRingPhysAddr); netif_wake_queue(dev); } @@ -1051,6 +1146,59 @@ spin_unlock_irq(&np->lock); } +/* + * Called when the nic notices a mismatch between the actual data len on the + * wire and the len indicated in the 802 header + */ +static int nv_getlen(struct net_device *dev, void *packet, int datalen) +{ + int hdrlen; /* length of the 802 header */ + int protolen; /* length as stored in the proto field */ + + /* 1) calculate len according to header */ + if ( ((struct vlan_ethhdr *)packet)->h_vlan_proto == __constant_htons(ETH_P_8021Q)) { + protolen = ntohs( ((struct vlan_ethhdr *)packet)->h_vlan_encapsulated_proto ); + hdrlen = VLAN_HLEN; + } else { + protolen = ntohs( ((struct ethhdr *)packet)->h_proto); + hdrlen = ETH_HLEN; + } + dprintk(KERN_DEBUG "%s: nv_getlen: datalen %d, protolen %d, hdrlen %d\n", + dev->name, datalen, protolen, hdrlen); + if (protolen > ETH_DATA_LEN) + return datalen; /* Value in proto field not a len, no checks possible */ + + protolen += hdrlen; + /* consistency checks: */ + if (datalen > ETH_ZLEN) { + if (datalen >= protolen) { + /* more data on wire than in 802 header, trim of + * additional data. + */ + dprintk(KERN_DEBUG "%s: nv_getlen: accepting %d bytes.\n", + dev->name, protolen); + return protolen; + } else { + /* less data on wire than mentioned in header. + * Discard the packet. + */ + dprintk(KERN_DEBUG "%s: nv_getlen: discarding long packet.\n", + dev->name); + return -1; + } + } else { + /* short packet. Accept only if 802 values are also short */ + if (protolen > ETH_ZLEN) { + dprintk(KERN_DEBUG "%s: nv_getlen: discarding short packet.\n", + dev->name); + return -1; + } + dprintk(KERN_DEBUG "%s: nv_getlen: accepting %d bytes.\n", + dev->name, datalen); + return datalen; + } +} + static void nv_rx_process(struct net_device *dev) { struct fe_priv *np = get_nvpriv(dev); @@ -1064,8 +1212,13 @@ break; /* we scanned the whole ring - do not continue */ i = np->cur_rx % RX_RING; - Flags = le32_to_cpu(np->rx_ring[i].FlagLen); - len = nv_descr_getlength(&np->rx_ring[i], np->desc_ver); + if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) { + Flags = le32_to_cpu(np->rx_ring.orig[i].FlagLen); + len = nv_descr_getlength(&np->rx_ring.orig[i], np->desc_ver); + } else { + Flags = le32_to_cpu(np->rx_ring.ex[i].FlagLen); + len = nv_descr_getlength_ex(&np->rx_ring.ex[i], np->desc_ver); + } dprintk(KERN_DEBUG "%s: nv_rx_process: looking at packet %d, Flags 0x%x.\n", dev->name, np->cur_rx, Flags); @@ -1102,7 +1255,7 @@ np->stats.rx_errors++; goto next_pkt; } - if (Flags & (NV_RX_ERROR1|NV_RX_ERROR2|NV_RX_ERROR3|NV_RX_ERROR4)) { + if (Flags & (NV_RX_ERROR1|NV_RX_ERROR2|NV_RX_ERROR3)) { np->stats.rx_errors++; goto next_pkt; } @@ -1116,22 +1269,24 @@ np->stats.rx_errors++; goto next_pkt; } - if (Flags & NV_RX_ERROR) { - /* framing errors are soft errors, the rest is fatal. */ - if (Flags & NV_RX_FRAMINGERR) { - if (Flags & NV_RX_SUBSTRACT1) { - len--; - } - } else { + if (Flags & NV_RX_ERROR4) { + len = nv_getlen(dev, np->rx_skbuff[i]->data, len); + if (len < 0) { np->stats.rx_errors++; goto next_pkt; } } + /* framing errors are soft errors. */ + if (Flags & NV_RX_FRAMINGERR) { + if (Flags & NV_RX_SUBSTRACT1) { + len--; + } + } } else { if (!(Flags & NV_RX2_DESCRIPTORVALID)) goto next_pkt; - if (Flags & (NV_RX2_ERROR1|NV_RX2_ERROR2|NV_RX2_ERROR3|NV_RX2_ERROR4)) { + if (Flags & (NV_RX2_ERROR1|NV_RX2_ERROR2|NV_RX2_ERROR3)) { np->stats.rx_errors++; goto next_pkt; } @@ -1145,17 +1300,19 @@ np->stats.rx_errors++; goto next_pkt; } - if (Flags & NV_RX2_ERROR) { - /* framing errors are soft errors, the rest is fatal. */ - if (Flags & NV_RX2_FRAMINGERR) { - if (Flags & NV_RX2_SUBSTRACT1) { - len--; - } - } else { + if (Flags & NV_RX2_ERROR4) { + len = nv_getlen(dev, np->rx_skbuff[i]->data, len); + if (len < 0) { np->stats.rx_errors++; goto next_pkt; } } + /* framing errors are soft errors */ + if (Flags & NV_RX2_FRAMINGERR) { + if (Flags & NV_RX2_SUBSTRACT1) { + len--; + } + } Flags &= NV_RX2_CHECKSUMMASK; if (Flags == NV_RX2_CHECKSUMOK1 || Flags == NV_RX2_CHECKSUMOK2 || @@ -1183,15 +1340,133 @@ } } +static void set_bufsize(struct net_device *dev) +{ + struct fe_priv *np = netdev_priv(dev); + + if (dev->mtu <= ETH_DATA_LEN) + np->rx_buf_sz = ETH_DATA_LEN + NV_RX_HEADERS; + else + np->rx_buf_sz = dev->mtu + NV_RX_HEADERS; +} + /* * nv_change_mtu: dev->change_mtu function * Called with dev_base_lock held for read. */ static int nv_change_mtu(struct net_device *dev, int new_mtu) { - if (new_mtu > ETH_DATA_LEN) + struct fe_priv *np = get_nvpriv(dev); + int old_mtu; + + if (new_mtu < 64 || new_mtu > np->pkt_limit) return -EINVAL; + + old_mtu = dev->mtu; dev->mtu = new_mtu; + + /* return early if the buffer sizes will not change */ + if (old_mtu <= ETH_DATA_LEN && new_mtu <= ETH_DATA_LEN) + return 0; + if (old_mtu == new_mtu) + return 0; + + /* synchronized against open : rtnl_lock() held by caller */ + if (netif_running(dev)) { + u8 *base = get_hwbase(dev); + /* + * It seems that the nic preloads valid ring entries into an + * internal buffer. The procedure for flushing everything is + * guessed, there is probably a simpler approach. + * Changing the MTU is a rare event, it shouldn't matter. + */ + disable_irq(dev->irq); + spin_lock_bh(&dev->xmit_lock); + spin_lock(&np->lock); + /* stop engines */ + nv_stop_rx(dev); + nv_stop_tx(dev); + nv_txrx_reset(dev); + /* drain rx queue */ + nv_drain_rx(dev); + nv_drain_tx(dev); + /* reinit driver view of the rx queue */ + nv_init_rx(dev); + nv_init_tx(dev); + /* alloc new rx buffers */ + set_bufsize(dev); + if (nv_alloc_rx(dev)) { + if (!np->in_shutdown) + mod_timer(&np->oom_kick, jiffies + OOM_REFILL); + } + /* reinit nic view of the rx queue */ + writel(np->rx_buf_sz, base + NvRegOffloadConfig); + writel((u32) np->ring_addr, base + NvRegRxRingPhysAddr); + if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) + writel((u32) (np->ring_addr + RX_RING*sizeof(struct ring_desc)), base + NvRegTxRingPhysAddr); + else + writel((u32) (np->ring_addr + RX_RING*sizeof(struct ring_desc_ex)), base + NvRegTxRingPhysAddr); + writel( ((RX_RING-1) << NVREG_RINGSZ_RXSHIFT) + ((TX_RING-1) << NVREG_RINGSZ_TXSHIFT), + base + NvRegRingSizes); + pci_push(base); + writel(NVREG_TXRXCTL_KICK|np->desc_ver, get_hwbase(dev) + NvRegTxRxControl); + pci_push(base); + + /* restart rx engine */ + nv_start_rx(dev); + nv_start_tx(dev); + spin_unlock(&np->lock); + spin_unlock_bh(&dev->xmit_lock); + enable_irq(dev->irq); + } + return 0; +} + +static void nv_copy_mac_to_hw(struct net_device *dev) +{ + u8 *base = get_hwbase(dev); + u32 mac[2]; + + mac[0] = (dev->dev_addr[0] << 0) + (dev->dev_addr[1] << 8) + + (dev->dev_addr[2] << 16) + (dev->dev_addr[3] << 24); + mac[1] = (dev->dev_addr[4] << 0) + (dev->dev_addr[5] << 8); + + writel(mac[0], base + NvRegMacAddrA); + writel(mac[1], base + NvRegMacAddrB); +} + +/* + * nv_set_mac_address: dev->set_mac_address function + * Called with rtnl_lock() held. + */ +static int nv_set_mac_address(struct net_device *dev, void *addr) +{ + struct fe_priv *np = get_nvpriv(dev); + struct sockaddr *macaddr = (struct sockaddr*)addr; + + if(!is_valid_ether_addr(macaddr->sa_data)) + return -EADDRNOTAVAIL; + + /* synchronized against open : rtnl_lock() held by caller */ + memcpy(dev->dev_addr, macaddr->sa_data, ETH_ALEN); + + if (netif_running(dev)) { + spin_lock_bh(&dev->xmit_lock); + spin_lock_irq(&np->lock); + + /* stop rx engine */ + nv_stop_rx(dev); + + /* set mac address */ + nv_copy_mac_to_hw(dev); + + /* restart rx engine */ + nv_start_rx(dev); + spin_unlock_irq(&np->lock); + spin_unlock_bh(&dev->xmit_lock); + } else { + nv_copy_mac_to_hw(dev); + } return 0; } @@ -1202,7 +1477,7 @@ static void nv_set_multicast(struct net_device *dev) { struct fe_priv *np = get_nvpriv(dev); - u8 *base = get_hwbase(dev); + u8 __iomem *base = get_hwbase(dev); u32 addr[2]; u32 mask[2]; u32 pff; @@ -1262,7 +1537,7 @@ static int nv_update_linkspeed(struct net_device *dev) { struct fe_priv *np = get_nvpriv(dev); - u8 *base = get_hwbase(dev); + u8 __iomem *base = get_hwbase(dev); int adv, lpa; int newls = np->linkspeed; int newdup = np->duplex; @@ -1285,6 +1560,25 @@ goto set_speed; } + if (np->autoneg == 0) { + dprintk(KERN_DEBUG "%s: nv_update_linkspeed: autoneg off, PHY set to 0x%04x.\n", + dev->name, np->fixed_mode); + if (np->fixed_mode & LPA_100FULL) { + newls = NVREG_LINKSPEED_FORCE|NVREG_LINKSPEED_100; + newdup = 1; + } else if (np->fixed_mode & LPA_100HALF) { + newls = NVREG_LINKSPEED_FORCE|NVREG_LINKSPEED_100; + newdup = 0; + } else if (np->fixed_mode & LPA_10FULL) { + newls = NVREG_LINKSPEED_FORCE|NVREG_LINKSPEED_10; + newdup = 1; + } else { + newls = NVREG_LINKSPEED_FORCE|NVREG_LINKSPEED_10; + newdup = 0; + } + retval = 1; + goto set_speed; + } /* check auto negotiation is complete */ if (!(mii_status & BMSR_ANEGCOMPLETE)) { /* still in autonegotiation - configure nic for 10 MBit HD and wait. */ @@ -1302,7 +1596,7 @@ if ((control_1000 & ADVERTISE_1000FULL) && (status_1000 & LPA_1000FULL)) { - dprintk(KERN_DEBUG "%s: nv_update_linkspeed: GBit ethernet detected.\n", + dprintk(KERN_DEBUG "%s: nv_update_linkspeed: GBit ethernet detected.\n", dev->name); newls = NVREG_LINKSPEED_FORCE|NVREG_LINKSPEED_1000; newdup = 1; @@ -1361,9 +1655,9 @@ phyreg &= ~(PHY_HALF|PHY_100|PHY_1000); if (np->duplex == 0) phyreg |= PHY_HALF; - if ((np->linkspeed & 0xFFF) == NVREG_LINKSPEED_100) + if ((np->linkspeed & NVREG_LINKSPEED_MASK) == NVREG_LINKSPEED_100) phyreg |= PHY_100; - else if ((np->linkspeed & 0xFFF) == NVREG_LINKSPEED_1000) + else if ((np->linkspeed & NVREG_LINKSPEED_MASK) == NVREG_LINKSPEED_1000) phyreg |= PHY_1000; writel(phyreg, base + NvRegPhyInterface); @@ -1397,7 +1691,7 @@ static void nv_link_irq(struct net_device *dev) { - u8 *base = get_hwbase(dev); + u8 __iomem *base = get_hwbase(dev); u32 miistat; miistat = readl(base + NvRegMIIStatus); @@ -1413,7 +1707,7 @@ { struct net_device *dev = (struct net_device *) data; struct fe_priv *np = get_nvpriv(dev); - u8 *base = get_hwbase(dev); + u8 __iomem *base = get_hwbase(dev); u32 events; int i; @@ -1427,7 +1721,7 @@ if (!(events & np->irqmask)) break; - if (events & (NVREG_IRQ_TX1|NVREG_IRQ_TX2|NVREG_IRQ_TX_ERR)) { + if (events & (NVREG_IRQ_TX1|NVREG_IRQ_TX_OK|NVREG_IRQ_TX_ERROR|NVREG_IRQ_TX_ERR)) { spin_lock(&np->lock); nv_tx_done(dev); spin_unlock(&np->lock); @@ -1485,7 +1779,7 @@ { struct net_device *dev = (struct net_device *) data; struct fe_priv *np = get_nvpriv(dev); - u8 *base = get_hwbase(dev); + u8 __iomem *base = get_hwbase(dev); disable_irq(dev->irq); /* FIXME: Do we need synchronize_irq(dev->irq) here? */ @@ -1499,10 +1793,285 @@ enable_irq(dev->irq); } +#ifdef CONFIG_NET_POLL_CONTROLLER +static void nv_poll_controller(struct net_device *dev) +{ + nv_do_nic_poll((unsigned long) dev); +} +#endif + +static void nv_get_drvinfo(struct net_device *dev, struct ethtool_drvinfo *info) +{ + struct fe_priv *np = get_nvpriv(dev); + strcpy(info->driver, "forcedeth"); + strcpy(info->version, FORCEDETH_VERSION); + strcpy(info->bus_info, pci_name(np->pci_dev)); +} + +static void nv_get_wol(struct net_device *dev, struct ethtool_wolinfo *wolinfo) +{ + struct fe_priv *np = get_nvpriv(dev); + wolinfo->supported = WAKE_MAGIC; + + spin_lock_irq(&np->lock); + if (np->wolenabled) + wolinfo->wolopts = WAKE_MAGIC; + spin_unlock_irq(&np->lock); +} + +static int nv_set_wol(struct net_device *dev, struct ethtool_wolinfo *wolinfo) +{ + struct fe_priv *np = get_nvpriv(dev); + u8 __iomem *base = get_hwbase(dev); + + spin_lock_irq(&np->lock); + if (wolinfo->wolopts == 0) { + writel(0, base + NvRegWakeUpFlags); + np->wolenabled = 0; + } + if (wolinfo->wolopts & WAKE_MAGIC) { + writel(NVREG_WAKEUPFLAGS_ENABLE, base + NvRegWakeUpFlags); + np->wolenabled = 1; + } + spin_unlock_irq(&np->lock); + return 0; +} + +static int nv_get_settings(struct net_device *dev, struct ethtool_cmd *ecmd) +{ + struct fe_priv *np = netdev_priv(dev); + int adv; + + spin_lock_irq(&np->lock); + ecmd->port = PORT_MII; + if (!netif_running(dev)) { + /* We do not track link speed / duplex setting if the + * interface is disabled. Force a link check */ + nv_update_linkspeed(dev); + } + switch(np->linkspeed & (NVREG_LINKSPEED_MASK)) { + case NVREG_LINKSPEED_10: + ecmd->speed = SPEED_10; + break; + case NVREG_LINKSPEED_100: + ecmd->speed = SPEED_100; + break; + case NVREG_LINKSPEED_1000: + ecmd->speed = SPEED_1000; + break; + } + ecmd->duplex = DUPLEX_HALF; + if (np->duplex) + ecmd->duplex = DUPLEX_FULL; + + ecmd->autoneg = np->autoneg; + + ecmd->advertising = ADVERTISED_MII; + if (np->autoneg) { + ecmd->advertising |= ADVERTISED_Autoneg; + adv = mii_rw(dev, np->phyaddr, MII_ADVERTISE, MII_READ); + } else { + adv = np->fixed_mode; + } + if (adv & ADVERTISE_10HALF) + ecmd->advertising |= ADVERTISED_10baseT_Half; + if (adv & ADVERTISE_10FULL) + ecmd->advertising |= ADVERTISED_10baseT_Full; + if (adv & ADVERTISE_100HALF) + ecmd->advertising |= ADVERTISED_100baseT_Half; + if (adv & ADVERTISE_100FULL) + ecmd->advertising |= ADVERTISED_100baseT_Full; + if (np->autoneg && np->gigabit == PHY_GIGABIT) { + adv = mii_rw(dev, np->phyaddr, MII_1000BT_CR, MII_READ); + if (adv & ADVERTISE_1000FULL) + ecmd->advertising |= ADVERTISED_1000baseT_Full; + } + + ecmd->supported = (SUPPORTED_Autoneg | + SUPPORTED_10baseT_Half | SUPPORTED_10baseT_Full | + SUPPORTED_100baseT_Half | SUPPORTED_100baseT_Full | + SUPPORTED_MII); + if (np->gigabit == PHY_GIGABIT) + ecmd->supported |= SUPPORTED_1000baseT_Full; + + ecmd->phy_address = np->phyaddr; + ecmd->transceiver = XCVR_EXTERNAL; + + /* ignore maxtxpkt, maxrxpkt for now */ + spin_unlock_irq(&np->lock); + return 0; +} + +static int nv_set_settings(struct net_device *dev, struct ethtool_cmd *ecmd) +{ + struct fe_priv *np = netdev_priv(dev); + + if (ecmd->port != PORT_MII) + return -EINVAL; + if (ecmd->transceiver != XCVR_EXTERNAL) + return -EINVAL; + if (ecmd->phy_address != np->phyaddr) { + /* TODO: support switching between multiple phys. Should be + * trivial, but not enabled due to lack of test hardware. */ + return -EINVAL; + } + if (ecmd->autoneg == AUTONEG_ENABLE) { + u32 mask; + + mask = ADVERTISED_10baseT_Half | ADVERTISED_10baseT_Full | + ADVERTISED_100baseT_Half | ADVERTISED_100baseT_Full; + if (np->gigabit == PHY_GIGABIT) + mask |= ADVERTISED_1000baseT_Full; + + if ((ecmd->advertising & mask) == 0) + return -EINVAL; + + } else if (ecmd->autoneg == AUTONEG_DISABLE) { + /* Note: autonegotiation disable, speed 1000 intentionally + * forbidden - noone should need that. */ + + if (ecmd->speed != SPEED_10 && ecmd->speed != SPEED_100) + return -EINVAL; + if (ecmd->duplex != DUPLEX_HALF && ecmd->duplex != DUPLEX_FULL) + return -EINVAL; + } else { + return -EINVAL; + } + + spin_lock_irq(&np->lock); + if (ecmd->autoneg == AUTONEG_ENABLE) { + int adv, bmcr; + + np->autoneg = 1; + + /* advertise only what has been requested */ + adv = mii_rw(dev, np->phyaddr, MII_ADVERTISE, MII_READ); + adv &= ~(ADVERTISE_ALL | ADVERTISE_100BASE4); + if (ecmd->advertising & ADVERTISED_10baseT_Half) + adv |= ADVERTISE_10HALF; + if (ecmd->advertising & ADVERTISED_10baseT_Full) + adv |= ADVERTISE_10FULL; + if (ecmd->advertising & ADVERTISED_100baseT_Half) + adv |= ADVERTISE_100HALF; + if (ecmd->advertising & ADVERTISED_100baseT_Full) + adv |= ADVERTISE_100FULL; + mii_rw(dev, np->phyaddr, MII_ADVERTISE, adv); + + if (np->gigabit == PHY_GIGABIT) { + adv = mii_rw(dev, np->phyaddr, MII_1000BT_CR, MII_READ); + adv &= ~ADVERTISE_1000FULL; + if (ecmd->advertising & ADVERTISED_1000baseT_Full) + adv |= ADVERTISE_1000FULL; + mii_rw(dev, np->phyaddr, MII_1000BT_CR, adv); + } + + bmcr = mii_rw(dev, np->phyaddr, MII_BMCR, MII_READ); + bmcr |= (BMCR_ANENABLE | BMCR_ANRESTART); + mii_rw(dev, np->phyaddr, MII_BMCR, bmcr); + + } else { + int adv, bmcr; + + np->autoneg = 0; + + adv = mii_rw(dev, np->phyaddr, MII_ADVERTISE, MII_READ); + adv &= ~(ADVERTISE_ALL | ADVERTISE_100BASE4); + if (ecmd->speed == SPEED_10 && ecmd->duplex == DUPLEX_HALF) + adv |= ADVERTISE_10HALF; + if (ecmd->speed == SPEED_10 && ecmd->duplex == DUPLEX_FULL) + adv |= ADVERTISE_10FULL; + if (ecmd->speed == SPEED_100 && ecmd->duplex == DUPLEX_HALF) + adv |= ADVERTISE_100HALF; + if (ecmd->speed == SPEED_100 && ecmd->duplex == DUPLEX_FULL) + adv |= ADVERTISE_100FULL; + mii_rw(dev, np->phyaddr, MII_ADVERTISE, adv); + np->fixed_mode = adv; + + if (np->gigabit == PHY_GIGABIT) { + adv = mii_rw(dev, np->phyaddr, MII_1000BT_CR, MII_READ); + adv &= ~ADVERTISE_1000FULL; + mii_rw(dev, np->phyaddr, MII_1000BT_CR, adv); + } + + bmcr = mii_rw(dev, np->phyaddr, MII_BMCR, MII_READ); + bmcr |= ~(BMCR_ANENABLE|BMCR_SPEED100|BMCR_FULLDPLX); + if (adv & (ADVERTISE_10FULL|ADVERTISE_100FULL)) + bmcr |= BMCR_FULLDPLX; + if (adv & (ADVERTISE_100HALF|ADVERTISE_100FULL)) + bmcr |= BMCR_SPEED100; + mii_rw(dev, np->phyaddr, MII_BMCR, bmcr); + + if (netif_running(dev)) { + /* Wait a bit and then reconfigure the nic. */ + udelay(10); + nv_linkchange(dev); + } + } + spin_unlock_irq(&np->lock); + + return 0; +} + +#define FORCEDETH_REGS_VER 1 +#define FORCEDETH_REGS_SIZE 0x400 /* 256 32-bit registers */ + +static int nv_get_regs_len(struct net_device *dev) +{ + return FORCEDETH_REGS_SIZE; +} + +static void nv_get_regs(struct net_device *dev, struct ethtool_regs *regs, void *buf) +{ + struct fe_priv *np = get_nvpriv(dev); + u8 __iomem *base = get_hwbase(dev); + u32 *rbuf = buf; + int i; + + regs->version = FORCEDETH_REGS_VER; + spin_lock_irq(&np->lock); + for (i=0;ilock); +} + +static int nv_nway_reset(struct net_device *dev) +{ + struct fe_priv *np = get_nvpriv(dev); + int ret; + + spin_lock_irq(&np->lock); + if (np->autoneg) { + int bmcr; + + bmcr = mii_rw(dev, np->phyaddr, MII_BMCR, MII_READ); + bmcr |= (BMCR_ANENABLE | BMCR_ANRESTART); + mii_rw(dev, np->phyaddr, MII_BMCR, bmcr); + + ret = 0; + } else { + ret = -EINVAL; + } + spin_unlock_irq(&np->lock); + + return ret; +} + +static struct ethtool_ops ops = { + .get_drvinfo = nv_get_drvinfo, + .get_link = ethtool_op_get_link, + .get_wol = nv_get_wol, + .set_wol = nv_set_wol, + .get_settings = nv_get_settings, + .set_settings = nv_set_settings, + .get_regs_len = nv_get_regs_len, + .get_regs = nv_get_regs, + .nway_reset = nv_nway_reset, +}; + static int nv_open(struct net_device *dev) { struct fe_priv *np = get_nvpriv(dev); - u8 *base = get_hwbase(dev); + u8 __iomem *base = get_hwbase(dev); int ret, oom, i; dprintk(KERN_DEBUG "nv_open: begin\n"); @@ -1521,6 +2090,7 @@ writel(0, base + NvRegAdapterControl); /* 2) initialize descriptor rings */ + set_bufsize(dev); oom = nv_init_ring(dev); writel(0, base + NvRegLinkSpeed); @@ -1531,27 +2101,18 @@ np->in_shutdown = 0; /* 3) set mac address */ - { - u32 mac[2]; - - mac[0] = (dev->dev_addr[0] << 0) + (dev->dev_addr[1] << 8) + - (dev->dev_addr[2] << 16) + (dev->dev_addr[3] << 24); - mac[1] = (dev->dev_addr[4] << 0) + (dev->dev_addr[5] << 8); - - writel(mac[0], base + NvRegMacAddrA); - writel(mac[1], base + NvRegMacAddrB); - } + nv_copy_mac_to_hw(dev); /* 4) give hw rings */ writel((u32) np->ring_addr, base + NvRegRxRingPhysAddr); - writel((u32) (np->ring_addr + RX_RING*sizeof(struct ring_desc)), base + NvRegTxRingPhysAddr); + if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) + writel((u32) (np->ring_addr + RX_RING*sizeof(struct ring_desc)), base + NvRegTxRingPhysAddr); + else + writel((u32) (np->ring_addr + RX_RING*sizeof(struct ring_desc_ex)), base + NvRegTxRingPhysAddr); writel( ((RX_RING-1) << NVREG_RINGSZ_RXSHIFT) + ((TX_RING-1) << NVREG_RINGSZ_TXSHIFT), base + NvRegRingSizes); /* 5) continue setup */ - np->linkspeed = NVREG_LINKSPEED_FORCE|NVREG_LINKSPEED_10; - np->duplex = 0; - writel(np->linkspeed, base + NvRegLinkSpeed); writel(NVREG_UNKSETUP3_VAL1, base + NvRegUnknownSetupReg3); writel(np->desc_ver, base + NvRegTxRxControl); @@ -1569,7 +2130,7 @@ writel(NVREG_MISC1_FORCE | NVREG_MISC1_HD, base + NvRegMisc1); writel(readl(base + NvRegTransmitterStatus), base + NvRegTransmitterStatus); writel(NVREG_PFF_ALWAYS, base + NvRegPacketFilterFlags); - writel(NVREG_OFFLOAD_NORMAL, base + NvRegOffloadConfig); + writel(np->rx_buf_sz, base + NvRegOffloadConfig); writel(readl(base + NvRegReceiverStatus), base + NvRegReceiverStatus); get_random_bytes(&i, sizeof(i)); @@ -1620,6 +2181,9 @@ writel(NVREG_MIISTAT_MASK, base + NvRegMIIStatus); dprintk(KERN_INFO "startup: got 0x%08x.\n", miistat); } + /* set linkspeed to invalid value, thus force nv_update_linkspeed + * to init hw */ + np->linkspeed = 0; ret = nv_update_linkspeed(dev); nv_start_rx(dev); nv_start_tx(dev); @@ -1643,7 +2207,7 @@ static int nv_close(struct net_device *dev) { struct fe_priv *np = get_nvpriv(dev); - u8 *base; + u8 __iomem *base; spin_lock_irq(&np->lock); np->in_shutdown = 1; @@ -1674,6 +2238,12 @@ if (np->wolenabled) nv_start_rx(dev); + /* special op: write back the misordered MAC address - otherwise + * the next nv_probe would see a wrong address. + */ + writel(np->orig_mac[0], base + NvRegMacAddrA); + writel(np->orig_mac[1], base + NvRegMacAddrB); + /* FIXME: power down nic */ return 0; @@ -1684,7 +2254,7 @@ struct net_device *dev; struct fe_priv *np; unsigned long addr; - u8 *base; + u8 __iomem *base; int err, i; dev = alloc_etherdev(sizeof(struct fe_priv)); @@ -1738,30 +2308,59 @@ } /* handle different descriptor versions */ - if (pci_dev->device == PCI_DEVICE_ID_NVIDIA_NVENET_1 || - pci_dev->device == PCI_DEVICE_ID_NVIDIA_NVENET_2 || - pci_dev->device == PCI_DEVICE_ID_NVIDIA_NVENET_3) - np->desc_ver = DESC_VER_1; - else + if (id->driver_data & DEV_HAS_HIGH_DMA) { + /* packet format 3: supports 40-bit addressing */ + np->desc_ver = DESC_VER_3; + if (pci_set_dma_mask(pci_dev, 0x0000007fffffffffULL)) { + printk(KERN_INFO "forcedeth: 64-bit DMA failed, using 32-bit addressing for device %s.\n", + pci_name(pci_dev)); + } + } else if (id->driver_data & DEV_HAS_LARGEDESC) { + /* packet format 2: supports jumbo frames */ np->desc_ver = DESC_VER_2; + } else { + /* original packet format */ + np->desc_ver = DESC_VER_1; + } + + np->pkt_limit = NV_PKTLIMIT_1; + if (id->driver_data & DEV_HAS_LARGEDESC) + np->pkt_limit = NV_PKTLIMIT_2; err = -ENOMEM; - dev->base_addr = (unsigned long) ioremap(addr, NV_PCI_REGSZ); - if (!dev->base_addr) + np->base = ioremap(addr, NV_PCI_REGSZ); + if (!np->base) goto out_relreg; + dev->base_addr = (unsigned long)np->base; + dev->irq = pci_dev->irq; - np->rx_ring = pci_alloc_consistent(pci_dev, sizeof(struct ring_desc) * (RX_RING + TX_RING), - &np->ring_addr); - if (!np->rx_ring) - goto out_unmap; - np->tx_ring = &np->rx_ring[RX_RING]; + + if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) { + np->rx_ring.orig = pci_alloc_consistent(pci_dev, + sizeof(struct ring_desc) * (RX_RING + TX_RING), + &np->ring_addr); + if (!np->rx_ring.orig) + goto out_unmap; + np->tx_ring.orig = &np->rx_ring.orig[RX_RING]; + } else { + np->rx_ring.ex = pci_alloc_consistent(pci_dev, + sizeof(struct ring_desc_ex) * (RX_RING + TX_RING), + &np->ring_addr); + if (!np->rx_ring.ex) + goto out_unmap; + np->tx_ring.ex = &np->rx_ring.ex[RX_RING]; + } dev->open = nv_open; dev->stop = nv_close; dev->hard_start_xmit = nv_start_xmit; dev->get_stats = nv_get_stats; dev->change_mtu = nv_change_mtu; + dev->set_mac_address = nv_set_mac_address; dev->set_multicast_list = nv_set_multicast; +#ifdef CONFIG_NET_POLL_CONTROLLER + dev->poll_controller = nv_poll_controller; +#endif SET_ETHTOOL_OPS(dev, &ops); dev->tx_timeout = nv_tx_timeout; dev->watchdog_timeo = NV_WATCHDOG_TIMEO; @@ -1806,17 +2405,10 @@ if (np->desc_ver == DESC_VER_1) { np->tx_flags = NV_TX_LASTPACKET|NV_TX_VALID; - if (id->driver_data & DEV_NEED_LASTPACKET1) - np->tx_flags |= NV_TX_LASTPACKET1; } else { np->tx_flags = NV_TX2_LASTPACKET|NV_TX2_VALID; - if (id->driver_data & DEV_NEED_LASTPACKET1) - np->tx_flags |= NV_TX2_LASTPACKET1; } - if (id->driver_data & DEV_IRQMASK_1) - np->irqmask = NVREG_IRQMASK_WANTED_1; - if (id->driver_data & DEV_IRQMASK_2) - np->irqmask = NVREG_IRQMASK_WANTED_2; + np->irqmask = NVREG_IRQMASK_WANTED; if (id->driver_data & DEV_NEED_TIMERIRQ) np->irqmask |= NVREG_IRQ_TIMER; if (id->driver_data & DEV_NEED_LINKTIMER) { @@ -1864,6 +2456,11 @@ phy_init(dev); } + /* set default link speed settings */ + np->linkspeed = NVREG_LINKSPEED_FORCE|NVREG_LINKSPEED_10; + np->duplex = 0; + np->autoneg = 1; + err = register_netdev(dev); if (err) { printk(KERN_INFO "forcedeth: unable to register netdev: %d\n", err); @@ -1876,8 +2473,12 @@ return 0; out_freering: - pci_free_consistent(np->pci_dev, sizeof(struct ring_desc) * (RX_RING + TX_RING), - np->rx_ring, np->ring_addr); + if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) + pci_free_consistent(np->pci_dev, sizeof(struct ring_desc) * (RX_RING + TX_RING), + np->rx_ring.orig, np->ring_addr); + else + pci_free_consistent(np->pci_dev, sizeof(struct ring_desc_ex) * (RX_RING + TX_RING), + np->rx_ring.ex, np->ring_addr); pci_set_drvdata(pci_dev, NULL); out_unmap: iounmap(get_hwbase(dev)); @@ -1895,18 +2496,14 @@ { struct net_device *dev = pci_get_drvdata(pci_dev); struct fe_priv *np = get_nvpriv(dev); - u8 *base = get_hwbase(dev); unregister_netdev(dev); - /* special op: write back the misordered MAC address - otherwise - * the next nv_probe would see a wrong address. - */ - writel(np->orig_mac[0], base + NvRegMacAddrA); - writel(np->orig_mac[1], base + NvRegMacAddrB); - /* free all structures */ - pci_free_consistent(np->pci_dev, sizeof(struct ring_desc) * (RX_RING + TX_RING), np->rx_ring, np->ring_addr); + if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) + pci_free_consistent(np->pci_dev, sizeof(struct ring_desc) * (RX_RING + TX_RING), np->rx_ring.orig, np->ring_addr); + else + pci_free_consistent(np->pci_dev, sizeof(struct ring_desc_ex) * (RX_RING + TX_RING), np->rx_ring.ex, np->ring_addr); iounmap(get_hwbase(dev)); pci_release_regions(pci_dev); pci_disable_device(pci_dev); @@ -1916,81 +2513,64 @@ static struct pci_device_id pci_tbl[] = { { /* nForce Ethernet Controller */ - .vendor = PCI_VENDOR_ID_NVIDIA, - .device = PCI_DEVICE_ID_NVIDIA_NVENET_1, - .subvendor = PCI_ANY_ID, - .subdevice = PCI_ANY_ID, - .driver_data = DEV_IRQMASK_1|DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER, + PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_1), + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER, }, { /* nForce2 Ethernet Controller */ - .vendor = PCI_VENDOR_ID_NVIDIA, - .device = PCI_DEVICE_ID_NVIDIA_NVENET_2, - .subvendor = PCI_ANY_ID, - .subdevice = PCI_ANY_ID, - .driver_data = DEV_NEED_LASTPACKET1|DEV_IRQMASK_2|DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER, + PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_2), + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER, }, { /* nForce3 Ethernet Controller */ - .vendor = PCI_VENDOR_ID_NVIDIA, - .device = PCI_DEVICE_ID_NVIDIA_NVENET_3, - .subvendor = PCI_ANY_ID, - .subdevice = PCI_ANY_ID, - .driver_data = DEV_NEED_LASTPACKET1|DEV_IRQMASK_2|DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER, + PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_3), + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER, }, { /* nForce3 Ethernet Controller */ - .vendor = PCI_VENDOR_ID_NVIDIA, - .device = PCI_DEVICE_ID_NVIDIA_NVENET_4, - .subvendor = PCI_ANY_ID, - .subdevice = PCI_ANY_ID, - .driver_data = DEV_NEED_LASTPACKET1|DEV_IRQMASK_2|DEV_NEED_TIMERIRQ, + PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_4), + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC, }, { /* nForce3 Ethernet Controller */ - .vendor = PCI_VENDOR_ID_NVIDIA, - .device = PCI_DEVICE_ID_NVIDIA_NVENET_5, - .subvendor = PCI_ANY_ID, - .subdevice = PCI_ANY_ID, - .driver_data = DEV_NEED_LASTPACKET1|DEV_IRQMASK_2|DEV_NEED_TIMERIRQ, + PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_5), + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC, }, { /* nForce3 Ethernet Controller */ - .vendor = PCI_VENDOR_ID_NVIDIA, - .device = PCI_DEVICE_ID_NVIDIA_NVENET_6, - .subvendor = PCI_ANY_ID, - .subdevice = PCI_ANY_ID, - .driver_data = DEV_NEED_LASTPACKET1|DEV_IRQMASK_2|DEV_NEED_TIMERIRQ, + PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_6), + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC, }, { /* nForce3 Ethernet Controller */ - .vendor = PCI_VENDOR_ID_NVIDIA, - .device = PCI_DEVICE_ID_NVIDIA_NVENET_7, - .subvendor = PCI_ANY_ID, - .subdevice = PCI_ANY_ID, - .driver_data = DEV_NEED_LASTPACKET1|DEV_IRQMASK_2|DEV_NEED_TIMERIRQ, + PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_7), + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC, }, { /* CK804 Ethernet Controller */ - .vendor = PCI_VENDOR_ID_NVIDIA, - .device = PCI_DEVICE_ID_NVIDIA_NVENET_8, - .subvendor = PCI_ANY_ID, - .subdevice = PCI_ANY_ID, - .driver_data = DEV_NEED_LASTPACKET1|DEV_IRQMASK_2|DEV_NEED_TIMERIRQ, + PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_8), + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_HIGH_DMA, }, { /* CK804 Ethernet Controller */ - .vendor = PCI_VENDOR_ID_NVIDIA, - .device = PCI_DEVICE_ID_NVIDIA_NVENET_9, - .subvendor = PCI_ANY_ID, - .subdevice = PCI_ANY_ID, - .driver_data = DEV_NEED_LASTPACKET1|DEV_IRQMASK_2|DEV_NEED_TIMERIRQ, + PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_9), + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_HIGH_DMA, }, { /* MCP04 Ethernet Controller */ - .vendor = PCI_VENDOR_ID_NVIDIA, - .device = PCI_DEVICE_ID_NVIDIA_NVENET_10, - .subvendor = PCI_ANY_ID, - .subdevice = PCI_ANY_ID, - .driver_data = DEV_NEED_LASTPACKET1|DEV_IRQMASK_2|DEV_NEED_TIMERIRQ, + PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_10), + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_HIGH_DMA, }, { /* MCP04 Ethernet Controller */ - .vendor = PCI_VENDOR_ID_NVIDIA, - .device = PCI_DEVICE_ID_NVIDIA_NVENET_11, - .subvendor = PCI_ANY_ID, - .subdevice = PCI_ANY_ID, - .driver_data = DEV_NEED_LASTPACKET1|DEV_IRQMASK_2|DEV_NEED_TIMERIRQ, + PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_11), + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_HIGH_DMA, + }, + { /* MCP51 Ethernet Controller */ + PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_12), + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_HIGH_DMA, + }, + { /* MCP51 Ethernet Controller */ + PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_13), + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_HIGH_DMA, + }, + { /* MCP55 Ethernet Controller */ + PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_14), + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_HIGH_DMA, + }, + { /* MCP55 Ethernet Controller */ + PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_15), + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_HIGH_DMA, }, {0,}, }; @@ -2016,7 +2596,7 @@ module_param(max_interrupt_work, int, 0); MODULE_PARM_DESC(max_interrupt_work, "forcedeth maximum events handled per interrupt"); - + MODULE_AUTHOR("Manfred Spraul "); MODULE_DESCRIPTION("Reverse Engineered nForce ethernet driver"); MODULE_LICENSE("GPL"); --- 2.4/include/linux/pci_ids.h 2005-06-01 02:56:56.000000000 +0200 +++ build-2.4/include/linux/pci_ids.h 2005-09-04 13:55:37.000000000 +0200 @@ -1034,6 +1034,10 @@ #define PCI_DEVICE_ID_NVIDIA_GEFORCE3_1 0x0201 #define PCI_DEVICE_ID_NVIDIA_GEFORCE3_2 0x0202 #define PCI_DEVICE_ID_NVIDIA_QUADRO_DDC 0x0203 +#define PCI_DEVICE_ID_NVIDIA_NVENET_12 0x0268 +#define PCI_DEVICE_ID_NVIDIA_NVENET_13 0x0269 +#define PCI_DEVICE_ID_NVIDIA_NVENET_14 0x0372 +#define PCI_DEVICE_ID_NVIDIA_NVENET_15 0x0373 #define PCI_VENDOR_ID_IMS 0x10e0 #define PCI_DEVICE_ID_IMS_8849 0x8849 --------------040003030102010107040708-- From pravin.shelar@gmail.com Sun Sep 4 13:13:37 2005 Received: with ECARTIS (v1.0.0; list netdev); Sun, 04 Sep 2005 13:13:39 -0700 (PDT) Received: from rproxy.gmail.com (rproxy.gmail.com [64.233.170.192]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j84KDaiL023807 for ; Sun, 4 Sep 2005 13:13:36 -0700 Received: by rproxy.gmail.com with SMTP id 34so188837rns for ; Sun, 04 Sep 2005 13:11:02 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:reply-to:to:subject:mime-version:content-type:content-transfer-encoding:content-disposition; b=aQmsjwuk1upKhLuHC+eyBNwT6pIYz64Q4hMgITJ7Uc8uDg9ovFQcgCTo83Y1KgzmziwAynfmRFf8dJVkaQ1yPawqGcLx2Fs0A1JJoZXelW6ccyKt0bFodzQegW7cflR/hgVG16TWHQxy5IrnmOsNDfnuQ3qUxSvfqG+z5X4VBac= Received: by 10.11.119.4 with SMTP id r4mr72988cwc; Sun, 04 Sep 2005 13:11:02 -0700 (PDT) Received: by 10.11.117.12 with HTTP; Sun, 4 Sep 2005 13:11:02 -0700 (PDT) Message-ID: Date: Mon, 5 Sep 2005 01:41:02 +0530 From: pravin Reply-To: pravin.shelar@gmail.com To: netdev@oss.sgi.com Subject: question abt equal cost multipath networking Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id j84KDaiL023807 X-archive-position: 3592 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: pravin.shelar@gmail.com Precedence: bulk X-list: netdev Content-Length: 754 Lines: 18 Hello everyone, I am working on equal cost multipath networking code in Linux kernel. I studied the device round robin algorithm for the same. The drr algorithm examines use count of devices to select outgoing device. The use count is defined as number of sessions opened on that device up till now. But this does not necessarily give us current load on a device. We can use some other metric to select the outgoing device e.g. current device packet-queue length. So is there any specific reason for choosing use count as a metric for this algorithm. Can I change this metric to some different parameter e.g. device queue length or number of open sessions on a device at present? Thanks, Pravin. PS. I'm not on the list, so please CC me. From ravinandan.arakali@neterion.com Tue Sep 6 14:56:12 2005 Received: with ECARTIS (v1.0.0; list netdev); Tue, 06 Sep 2005 14:56:15 -0700 (PDT) Received: from ns1.s2io.com (ns1.s2io.com [142.46.200.198]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j86LuBiL012791 for ; Tue, 6 Sep 2005 14:56:12 -0700 Received: from guinness.s2io.com (sentry.s2io.com [142.46.200.199]) by ns1.s2io.com (8.12.10/8.12.10) with ESMTP id j86LrWcx028112; Tue, 6 Sep 2005 17:53:32 -0400 (EDT) Received: from localhost.localdomain ([10.16.16.97]) by guinness.s2io.com (8.12.6/8.12.6) with ESMTP id j86LrTlb004377; Tue, 6 Sep 2005 17:53:30 -0400 (EDT) Received: (from root@localhost) by localhost.localdomain (8.13.1/8.13.1/Submit) id j874au8m004304; Tue, 6 Sep 2005 21:36:56 -0700 Date: Tue, 6 Sep 2005 21:36:56 -0700 Message-Id: <200509070436.j874au8m004304@localhost.localdomain> To: jgarzik@pobox.com, netdev@oss.sgi.com CC: raghavendra.koushik@neterion.com, ravinandan.arakali@neterion.com, leonid.grossman@neterion.com, rapuru.sriram@neterion.com, ananda.raju@neterion.com From: ravinandan.arakali@neterion.com Subject: [PATCH 2.6.13] S2io: Hardware and miscellaneous fixes X-Scanned-By: MIMEDefang 2.34 X-archive-position: 3595 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ravinandan.arakali@neterion.com Precedence: bulk X-list: netdev Content-Length: 11705 Lines: 307 Hi, This patch contains the following hardware related fixes and other miscellaneous bug fixes. 1. Updated the definition of single and double-bit ECC errors 2. Earlier we were allocating Transmit descriptors equal to MAX_SKB_FRAGS. This was causing a boundary condition failure. Need to allocate MAX_SKB_FRAGS+1 descriptors. 3. On some platforms(like PPC), pci_alloc_consistent() can return a zero DMA address. Since the NIC cannot handle zero-addresses, a workaround has been provided. Basically, we don't use such that page. We reallocate. 4. If list_info allocation failed during driver load, check for it during driver exit and return instead of trying to dereference NULL pointer. 5. Increase the debug level of few non-critical debug messages. 6. Reset the card on critical ECC double errors only in case of XframeI since XframeII can recover from such errors. 7. Print copyright message on driver load. 8. Bumped up the driver version no. to 2.0.8.1 Signed-off-by: Ravinandan Arakali --- diff -urpN old/drivers/net/s2io-regs.h new/drivers/net/s2io-regs.h --- old/drivers/net/s2io-regs.h 2005-09-06 04:51:44.000000000 -0700 +++ new/drivers/net/s2io-regs.h 2005-09-06 04:52:08.000000000 -0700 @@ -1,5 +1,5 @@ /************************************************************************ - * regs.h: A Linux PCI-X Ethernet driver for S2IO 10GbE Server NIC + * regs.h: A Linux PCI-X Ethernet driver for Neterion 10GbE Server NIC * Copyright(c) 2002-2005 Neterion Inc. * This software may be used and distributed according to the terms of @@ -713,13 +713,16 @@ typedef struct _XENA_dev_config { u64 mc_err_reg; #define MC_ERR_REG_ECC_DB_ERR_L BIT(14) #define MC_ERR_REG_ECC_DB_ERR_U BIT(15) +#define MC_ERR_REG_MIRI_ECC_DB_ERR_0 BIT(18) +#define MC_ERR_REG_MIRI_ECC_DB_ERR_1 BIT(20) #define MC_ERR_REG_MIRI_CRI_ERR_0 BIT(22) #define MC_ERR_REG_MIRI_CRI_ERR_1 BIT(23) #define MC_ERR_REG_SM_ERR BIT(31) -#define MC_ERR_REG_ECC_ALL_SNG (BIT(6) | \ - BIT(7) | BIT(17) | BIT(19)) -#define MC_ERR_REG_ECC_ALL_DBL (BIT(14) | \ - BIT(15) | BIT(18) | BIT(20)) +#define MC_ERR_REG_ECC_ALL_SNG (BIT(2) | BIT(3) | BIT(4) | BIT(5) |\ + BIT(6) | BIT(7) | BIT(17) | BIT(19)) +#define MC_ERR_REG_ECC_ALL_DBL (BIT(10) | BIT(11) | BIT(12) |\ + BIT(13) | BIT(14) | BIT(15) |\ + BIT(18) | BIT(20)) u64 mc_err_mask; u64 mc_err_alarm; diff -urpN old/drivers/net/s2io.c new/drivers/net/s2io.c --- old/drivers/net/s2io.c 2005-09-06 04:51:44.000000000 -0700 +++ new/drivers/net/s2io.c 2005-09-06 04:52:08.000000000 -0700 @@ -1,5 +1,5 @@ /************************************************************************ - * s2io.c: A Linux PCI-X Ethernet driver for S2IO 10GbE Server NIC + * s2io.c: A Linux PCI-X Ethernet driver for Neterion 10GbE Server NIC * Copyright(c) 2002-2005 Neterion Inc. * This software may be used and distributed according to the terms of @@ -28,7 +28,7 @@ * explaination of all the variables. * rx_ring_num : This can be used to program the number of receive rings used * in the driver. - * rx_ring_len: This defines the number of descriptors each ring can have. This + * rx_ring_sz: This defines the number of descriptors each ring can have. This * is also an array of size 8. * tx_fifo_num: This defines the number of Tx FIFOs thats used int the driver. * tx_fifo_len: This too is an array of 8. Each element defines the number of @@ -67,7 +67,7 @@ /* S2io Driver name & version. */ static char s2io_driver_name[] = "Neterion"; -static char s2io_driver_version[] = "Version 2.0.3.1"; +static char s2io_driver_version[] = "Version 2.0.8.1"; static inline int RXD_IS_UP2DT(RxD_t *rxdp) { @@ -404,7 +404,7 @@ static int init_shared_mem(struct s2io_n config->tx_cfg[i].fifo_len - 1; mac_control->fifos[i].fifo_no = i; mac_control->fifos[i].nic = nic; - mac_control->fifos[i].max_txds = MAX_SKB_FRAGS; + mac_control->fifos[i].max_txds = MAX_SKB_FRAGS + 1; for (j = 0; j < page_num; j++) { int k = 0; @@ -418,6 +418,26 @@ static int init_shared_mem(struct s2io_n DBG_PRINT(ERR_DBG, "failed for TxDL\n"); return -ENOMEM; } + /* If we got a zero DMA address(can happen on + * certain platforms like PPC), reallocate. + * Store virtual address of page we don't want, + * to be freed later. + */ + if (!tmp_p) { + mac_control->zerodma_virt_addr = tmp_v; + DBG_PRINT(INIT_DBG, + "%s: Zero DMA address for TxDL. ", dev->name); + DBG_PRINT(INIT_DBG, + "Virtual address %llx\n", (u64)tmp_v); + tmp_v = pci_alloc_consistent(nic->pdev, + PAGE_SIZE, &tmp_p); + if (!tmp_v) { + DBG_PRINT(ERR_DBG, + "pci_alloc_consistent "); + DBG_PRINT(ERR_DBG, "failed for TxDL\n"); + return -ENOMEM; + } + } while (k < lst_per_page) { int l = (j * lst_per_page) + k; if (l == config->tx_cfg[i].fifo_len) @@ -600,7 +620,7 @@ static void free_shared_mem(struct s2io_ mac_info_t *mac_control; struct config_param *config; int lst_size, lst_per_page; - + struct net_device *dev = nic->dev; if (!nic) return; @@ -616,9 +636,10 @@ static void free_shared_mem(struct s2io_ lst_per_page); for (j = 0; j < page_num; j++) { int mem_blks = (j * lst_per_page); - if ((!mac_control->fifos[i].list_info) || - (!mac_control->fifos[i].list_info[mem_blks]. - list_virt_addr)) + if (!mac_control->fifos[i].list_info) + return; + if (!mac_control->fifos[i].list_info[mem_blks]. + list_virt_addr) break; pci_free_consistent(nic->pdev, PAGE_SIZE, mac_control->fifos[i]. @@ -628,6 +649,18 @@ static void free_shared_mem(struct s2io_ list_info[mem_blks]. list_phy_addr); } + /* If we got a zero DMA address during allocation, + * free the page now + */ + if (mac_control->zerodma_virt_addr) { + pci_free_consistent(nic->pdev, PAGE_SIZE, + mac_control->zerodma_virt_addr, + (dma_addr_t)0); + DBG_PRINT(INIT_DBG, + "%s: Freeing TxDL with zero DMA addr. ", dev->name); + DBG_PRINT(INIT_DBG, "Virtual address %llx\n", + (u64)(mac_control->zerodma_virt_addr)); + } kfree(mac_control->fifos[i].list_info); } @@ -2479,9 +2512,10 @@ static void rx_intr_handler(ring_info_t #endif spin_lock(&nic->rx_lock); if (atomic_read(&nic->card_state) == CARD_DOWN) { - DBG_PRINT(ERR_DBG, "%s: %s going down for reset\n", + DBG_PRINT(INTR_DBG, "%s: %s going down for reset\n", __FUNCTION__, dev->name); spin_unlock(&nic->rx_lock); + return; } get_info = ring_data->rx_curr_get_info; @@ -2596,8 +2630,14 @@ static void tx_intr_handler(fifo_info_t if (txdlp->Control_1 & TXD_T_CODE) { unsigned long long err; err = txdlp->Control_1 & TXD_T_CODE; - DBG_PRINT(ERR_DBG, "***TxD error %llx\n", - err); + if ((err >> 48) == 0xA) { + DBG_PRINT(TX_DBG, "TxD returned due \ + to loss of link\n"); + } + else { + DBG_PRINT(ERR_DBG, "***TxD error \ + %llx\n", err); + } } skb = (struct sk_buff *) ((unsigned long) @@ -2689,12 +2729,16 @@ static void alarm_intr_handler(struct s2 if (val64 & MC_ERR_REG_ECC_ALL_DBL) { nic->mac_control.stats_info->sw_stat. double_ecc_errs++; - DBG_PRINT(ERR_DBG, "%s: Device indicates ", + DBG_PRINT(INIT_DBG, "%s: Device indicates ", dev->name); - DBG_PRINT(ERR_DBG, "double ECC error!!\n"); + DBG_PRINT(INIT_DBG, "double ECC error!!\n"); if (nic->device_type != XFRAME_II_DEVICE) { - netif_stop_queue(dev); - schedule_work(&nic->rst_timer_task); + /* Reset XframeI only if critical error */ + if (val64 & (MC_ERR_REG_MIRI_ECC_DB_ERR_0 | + MC_ERR_REG_MIRI_ECC_DB_ERR_1)) { + netif_stop_queue(dev); + schedule_work(&nic->rst_timer_task); + } } } else { nic->mac_control.stats_info->sw_stat. @@ -2706,7 +2750,8 @@ static void alarm_intr_handler(struct s2 val64 = readq(&bar0->serr_source); if (val64 & SERR_SOURCE_ANY) { DBG_PRINT(ERR_DBG, "%s: Device indicates ", dev->name); - DBG_PRINT(ERR_DBG, "serious error!!\n"); + DBG_PRINT(ERR_DBG, "serious error %llx!!\n", + (unsigned long long)val64); netif_stop_queue(dev); schedule_work(&nic->rst_timer_task); } @@ -3130,7 +3175,7 @@ int s2io_xmit(struct sk_buff *skb, struc queue_len = mac_control->fifos[queue].tx_curr_put_info.fifo_len + 1; /* Avoid "put" pointer going beyond "get" pointer */ if (txdp->Host_Control || (((put_off + 1) % queue_len) == get_off)) { - DBG_PRINT(ERR_DBG, "Error in xmit, No free TXDs.\n"); + DBG_PRINT(TX_DBG, "Error in xmit, No free TXDs.\n"); netif_stop_queue(dev); dev_kfree_skb(skb); spin_unlock_irqrestore(&sp->tx_lock, flags); @@ -3528,7 +3573,7 @@ static void s2io_set_multicast(struct ne val64 = readq(&bar0->mac_cfg); sp->promisc_flg = 1; - DBG_PRINT(ERR_DBG, "%s: entered promiscuous mode\n", + DBG_PRINT(INFO_DBG, "%s: entered promiscuous mode\n", dev->name); } else if (!(dev->flags & IFF_PROMISC) && (sp->promisc_flg)) { /* Remove the NIC from promiscuous mode */ @@ -3543,7 +3588,7 @@ static void s2io_set_multicast(struct ne val64 = readq(&bar0->mac_cfg); sp->promisc_flg = 0; - DBG_PRINT(ERR_DBG, "%s: left promiscuous mode\n", + DBG_PRINT(INFO_DBG, "%s: left promiscuous mode\n", dev->name); } @@ -5325,7 +5370,7 @@ s2io_init_nic(struct pci_dev *pdev, cons break; } } - config->max_txds = MAX_SKB_FRAGS; + config->max_txds = MAX_SKB_FRAGS + 1; /* Rx side parameters. */ if (rx_ring_sz[0] == 0) @@ -5525,9 +5570,14 @@ s2io_init_nic(struct pci_dev *pdev, cons if (sp->device_type & XFRAME_II_DEVICE) { DBG_PRINT(ERR_DBG, "%s: Neterion Xframe II 10GbE adapter ", dev->name); - DBG_PRINT(ERR_DBG, "(rev %d), Driver %s\n", + DBG_PRINT(ERR_DBG, "(rev %d), %s", get_xena_rev_id(sp->pdev), s2io_driver_version); +#ifdef CONFIG_2BUFF_MODE + DBG_PRINT(ERR_DBG, ", Buffer mode %d",2); +#endif + + DBG_PRINT(ERR_DBG, "\nCopyright(c) 2002-2005 Neterion Inc.\n"); DBG_PRINT(ERR_DBG, "MAC ADDR: %02x:%02x:%02x:%02x:%02x:%02x\n", sp->def_mac_addr[0].mac_addr[0], sp->def_mac_addr[0].mac_addr[1], @@ -5544,9 +5594,13 @@ s2io_init_nic(struct pci_dev *pdev, cons } else { DBG_PRINT(ERR_DBG, "%s: Neterion Xframe I 10GbE adapter ", dev->name); - DBG_PRINT(ERR_DBG, "(rev %d), Driver %s\n", + DBG_PRINT(ERR_DBG, "(rev %d), %s", get_xena_rev_id(sp->pdev), s2io_driver_version); +#ifdef CONFIG_2BUFF_MODE + DBG_PRINT(ERR_DBG, ", Buffer mode %d",2); +#endif + DBG_PRINT(ERR_DBG, "\nCopyright(c) 2002-2005 Neterion Inc.\n"); DBG_PRINT(ERR_DBG, "MAC ADDR: %02x:%02x:%02x:%02x:%02x:%02x\n", sp->def_mac_addr[0].mac_addr[0], sp->def_mac_addr[0].mac_addr[1], diff -urpN old/drivers/net/s2io.h new/drivers/net/s2io.h --- old/drivers/net/s2io.h 2005-09-06 04:51:44.000000000 -0700 +++ new/drivers/net/s2io.h 2005-09-06 04:52:08.000000000 -0700 @@ -1,5 +1,5 @@ /************************************************************************ - * s2io.h: A Linux PCI-X Ethernet driver for S2IO 10GbE Server NIC + * s2io.h: A Linux PCI-X Ethernet driver for Neterion 10GbE Server NIC * Copyright(c) 2002-2005 Neterion Inc. * This software may be used and distributed according to the terms of @@ -622,6 +622,9 @@ typedef struct mac_info { /* Fifo specific structure */ fifo_info_t fifos[MAX_TX_FIFOS]; + /* Save virtual address of TxD page with zero DMA addr(if any) */ + void *zerodma_virt_addr; + /* rx side stuff */ /* Ring specific structure */ ring_info_t rings[MAX_RX_RINGS]; From sim@netnation.com Tue Sep 6 16:59:34 2005 Received: with ECARTIS (v1.0.0; list netdev); Tue, 06 Sep 2005 16:59:40 -0700 (PDT) Received: from peace.netnation.com (newpeace.netnation.com [204.174.223.7]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j86NxYiL025151 for ; Tue, 6 Sep 2005 16:59:34 -0700 Received: from sim by peace.netnation.com with local (Exim 4.50) id 1ECnJE-0008MJ-B2; Tue, 06 Sep 2005 16:57:00 -0700 Date: Tue, 6 Sep 2005 16:57:00 -0700 From: Simon Kirby To: Robert Olsson Cc: Alexey Kuznetsov , Eric Dumazet , netdev@oss.sgi.com Subject: Re: Route cache performance Message-ID: <20050906235700.GA31820@netnation.com> References: <20050815213855.GA17832@netnation.com> <43014E27.1070104@cosmosbay.com> <20050823190852.GA20794@netnation.com> <17163.32645.202453.145416@robur.slu.se> <20050824000158.GA8137@netnation.com> <20050825181111.GB14336@netnation.com> <20050825200543.GA6612@yakov.inr.ac.ru> <20050825212211.GA23384@netnation.com> <20050826115520.GA12351@yakov.inr.ac.ru> <17167.29239.469711.847951@robur.slu.se> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <17167.29239.469711.847951@robur.slu.se> User-Agent: Mutt/1.5.9i X-archive-position: 3596 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: sim@netnation.com Precedence: bulk X-list: netdev Content-Length: 959 Lines: 35 On Fri, Aug 26, 2005 at 09:49:11PM +0200, Robert Olsson wrote: > Hello! > > This thread seems familar :) > > I think Simon uses UP and it could be idea to check if the RCU deferred > deletion causes the problem. >... > --- a/net/ipv4/route.c > +++ b/net/ipv4/route.c > @@ -485,7 +485,11 @@ static struct file_operations rt_cpu_seq > static __inline__ void rt_free(struct rtable *rt) > { > multipath_remove(rt); > +#ifdef CONFIG_SMP > call_rcu_bh(&rt->u.dst.rcu_head, dst_rcu_free); > +#else > + dst_free((struct dst_entry *)rt); > +#endif > } > > static __inline__ void rt_drop(struct rtable *rt) Woot! Yes, this is the difference. With the patch applied (ajust directly freeing the dst_entry), everything balances easily, there are no overflows, and the result of rt_may_expire() looks very reasonable. (Yay!) So, this seems to be the culprit. Is NAPI supposed to allow the queued bh to run or should we just not be queuing this? Simon- From kuznet@yakov.inr.ac.ru Tue Sep 6 18:23:03 2005 Received: with ECARTIS (v1.0.0; list netdev); Tue, 06 Sep 2005 18:23:06 -0700 (PDT) Received: from yakov.inr.ac.ru (yakov.inr.ac.ru [194.67.69.111]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id j871N0iL028695 for ; Tue, 6 Sep 2005 18:23:03 -0700 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=ms2.inr.ac.ru; b=iFjczkWuIH1P4zF6UygCD2wzbbcJ7WITdAhBmRUQ9ETt3lt//K7Sjn1QOXU1fAJ91I3u8LhAQxu6xpPTJwtAhuOX3Wn9rmJ2yEFQFHB7x2OtmRDysvqyIUM/KE1tU/8dL7fH1JbGmaTMhwKMQ4jkUK/oH1xYP5J84Z3RRr64Dbg=; Received: (from kuznet@localhost) envelope-from=kuznet by yakov.inr.ac.ru (8.6.13/ANK) id FAA25774; Wed, 7 Sep 2005 05:19:59 +0400 Date: Wed, 7 Sep 2005 05:19:59 +0400 From: Alexey Kuznetsov To: Simon Kirby Cc: Robert Olsson , Alexey Kuznetsov , Eric Dumazet , netdev@oss.sgi.com Subject: Re: Route cache performance Message-ID: <20050907011959.GA25725@yakov.inr.ac.ru> References: <43014E27.1070104@cosmosbay.com> <20050823190852.GA20794@netnation.com> <17163.32645.202453.145416@robur.slu.se> <20050824000158.GA8137@netnation.com> <20050825181111.GB14336@netnation.com> <20050825200543.GA6612@yakov.inr.ac.ru> <20050825212211.GA23384@netnation.com> <20050826115520.GA12351@yakov.inr.ac.ru> <17167.29239.469711.847951@robur.slu.se> <20050906235700.GA31820@netnation.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050906235700.GA31820@netnation.com> User-Agent: Mutt/1.5.6i X-archive-position: 3597 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: kuznet@ms2.inr.ac.ru Precedence: bulk X-list: netdev Content-Length: 848 Lines: 23 Hello! On Tue, Sep 06, 2005 at 04:57:00PM -0700, Simon Kirby wrote: > On Fri, Aug 26, 2005 at 09:49:11PM +0200, Robert Olsson wrote: ... > > I think Simon uses UP and it could be idea to check if the RCU deferred > > deletion causes the problem. .. > Yes, this is the difference. With the patch applied (ajust directly > freeing the dst_entry), everything balances easily, there are no > overflows, and the result of rt_may_expire() looks very reasonable. > (Yay!) > > So, this seems to be the culprit. Is NAPI supposed to allow the > queued bh to run or should we just not be queuing this? It is supposed to work. :-) The problem is like an unkillable zombie. Robert, have you seen this pehonomenon already? Did you mean that SMP works or that it never works (but this patch is valid only for UP)? Did it become worse after 2.6.9? Alexey From Robert.Olsson@data.slu.se Wed Sep 7 07:48:38 2005 Received: with ECARTIS (v1.0.0; list netdev); Wed, 07 Sep 2005 07:48:52 -0700 (PDT) Received: from mx1.slu.se (mx1.slu.se [130.238.96.70]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j87EmUiL030813 for ; Wed, 7 Sep 2005 07:48:38 -0700 Received: from robur.slu.se (robur.slu.se [130.238.98.12]) by mx1.slu.se (8.13.1/8.13.1) with ESMTP id j87Ej5u6023315; Wed, 7 Sep 2005 16:45:05 +0200 Received: by robur.slu.se (Postfix, from userid 1000) id EE4CEEC3CC; Wed, 7 Sep 2005 16:45:03 +0200 (CEST) From: Robert Olsson MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <17182.64751.340488.996748@robur.slu.se> Date: Wed, 7 Sep 2005 16:45:03 +0200 To: Simon Kirby Cc: Robert Olsson , Alexey Kuznetsov , Eric Dumazet , netdev@oss.sgi.com Subject: Re: Route cache performance In-Reply-To: <20050906235700.GA31820@netnation.com> References: <20050815213855.GA17832@netnation.com> <43014E27.1070104@cosmosbay.com> <20050823190852.GA20794@netnation.com> <17163.32645.202453.145416@robur.slu.se> <20050824000158.GA8137@netnation.com> <20050825181111.GB14336@netnation.com> <20050825200543.GA6612@yakov.inr.ac.ru> <20050825212211.GA23384@netnation.com> <20050826115520.GA12351@yakov.inr.ac.ru> <17167.29239.469711.847951@robur.slu.se> <20050906235700.GA31820@netnation.com> X-Mailer: VM 7.19 under Emacs 21.4.1 X-Scanned-By: MIMEDefang 2.48 on 130.238.96.70 X-archive-position: 3598 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: Robert.Olsson@data.slu.se Precedence: bulk X-list: netdev Content-Length: 998 Lines: 31 Simon Kirby writes: > Woot! > > Yes, this is the difference. With the patch applied (ajust directly > freeing the dst_entry), everything balances easily, there are no > overflows, and the result of rt_may_expire() looks very reasonable. > (Yay!) > > So, this seems to be the culprit. Is NAPI supposed to allow the > queued bh to run or should we just not be queuing this? Packet processing happens in RX_SOFIRQ. NAPI or non-NAPI is no difference with RCU deferred delete this should happen by the RCU-tasklet when tasklets are run after the real SOFTIRQ's. There is a limit for RCU work... maxbatch it's set to 10 you could back out the patch and try increase it 1000/10000 so we know this not prevent the freeing of entries. module_param(maxbatch, int, 0); /* rcupdate.c */ Also RCU clearly states that is should be used in read-mostly situations rDoS is outside this scope. Anyway it would be interesting to understand what's going on. Cheers. --ro From Robert.Olsson@data.slu.se Wed Sep 7 08:06:29 2005 Received: with ECARTIS (v1.0.0; list netdev); Wed, 07 Sep 2005 08:06:32 -0700 (PDT) Received: from mx1.slu.se (mx1.slu.se [130.238.96.70]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j87F6SiL000360 for ; Wed, 7 Sep 2005 08:06:28 -0700 Received: from robur.slu.se (robur.slu.se [130.238.98.12]) by mx1.slu.se (8.13.1/8.13.1) with ESMTP id j87F3Hxk025818; Wed, 7 Sep 2005 17:03:17 +0200 Received: by robur.slu.se (Postfix, from userid 1000) id 57D39EC3CC; Wed, 7 Sep 2005 17:03:17 +0200 (CEST) From: Robert Olsson MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <17183.309.317160.103056@robur.slu.se> Date: Wed, 7 Sep 2005 17:03:17 +0200 To: Alexey Kuznetsov Cc: Simon Kirby , Robert Olsson , Eric Dumazet , netdev@oss.sgi.com Subject: Re: Route cache performance In-Reply-To: <20050907011959.GA25725@yakov.inr.ac.ru> References: <43014E27.1070104@cosmosbay.com> <20050823190852.GA20794@netnation.com> <17163.32645.202453.145416@robur.slu.se> <20050824000158.GA8137@netnation.com> <20050825181111.GB14336@netnation.com> <20050825200543.GA6612@yakov.inr.ac.ru> <20050825212211.GA23384@netnation.com> <20050826115520.GA12351@yakov.inr.ac.ru> <17167.29239.469711.847951@robur.slu.se> <20050906235700.GA31820@netnation.com> <20050907011959.GA25725@yakov.inr.ac.ru> X-Mailer: VM 7.19 under Emacs 21.4.1 X-Scanned-By: MIMEDefang 2.48 on 130.238.96.70 X-archive-position: 3599 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: Robert.Olsson@data.slu.se Precedence: bulk X-list: netdev Content-Length: 710 Lines: 20 Alexey Kuznetsov writes: > Robert, have you seen this pehonomenon already? Did you mean that SMP works > or that it never works (but this patch is valid only for UP)? Did it > become worse after 2.6.9? It was quite some time since I saw dst cache overflow and we use 2.6 in infrastructure. Anyway I was able to "tune" route cache so I see in our lab system on a SMP box. I think UP and SMP behaves the same but with UP we could disable the deferred delete as Simon tested. I don't know if anything happen in 2.6.9 I don't think so. But any improvement in drivers or FIB lookup may increase the burden so we get overflows. We had some code that checked the RCU latency. Cheers. --ro From sim@netnation.com Wed Sep 7 09:31:28 2005 Received: with ECARTIS (v1.0.0; list netdev); Wed, 07 Sep 2005 09:31:33 -0700 (PDT) Received: from peace.netnation.com (newpeace.netnation.com [204.174.223.7]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j87GVSiL009195 for ; Wed, 7 Sep 2005 09:31:28 -0700 Received: from sim by peace.netnation.com with local (Exim 4.50) id 1ED2n8-0007sq-E6; Wed, 07 Sep 2005 09:28:54 -0700 Date: Wed, 7 Sep 2005 09:28:54 -0700 From: Simon Kirby To: Robert Olsson Cc: Alexey Kuznetsov , Eric Dumazet , netdev@oss.sgi.com Subject: Re: Route cache performance Message-ID: <20050907162854.GB24735@netnation.com> References: <20050823190852.GA20794@netnation.com> <17163.32645.202453.145416@robur.slu.se> <20050824000158.GA8137@netnation.com> <20050825181111.GB14336@netnation.com> <20050825200543.GA6612@yakov.inr.ac.ru> <20050825212211.GA23384@netnation.com> <20050826115520.GA12351@yakov.inr.ac.ru> <17167.29239.469711.847951@robur.slu.se> <20050906235700.GA31820@netnation.com> <17182.64751.340488.996748@robur.slu.se> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <17182.64751.340488.996748@robur.slu.se> User-Agent: Mutt/1.5.9i X-archive-position: 3600 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: sim@netnation.com Precedence: bulk X-list: netdev Content-Length: 1029 Lines: 23 On Wed, Sep 07, 2005 at 04:45:03PM +0200, Robert Olsson wrote: > Packet processing happens in RX_SOFIRQ. NAPI or non-NAPI is no difference > with RCU deferred delete this should happen by the RCU-tasklet when > tasklets are run after the real SOFTIRQ's. > > There is a limit for RCU work... maxbatch it's set to 10 you could back > out the patch and try increase it 1000/10000 so we know this not prevent > the freeing of entries. Yes, setting maxbatch to 10000 also results in working gc, though routing throughput is about 5.7% higher when just calling dst_free directly. > Also RCU clearly states that is should be used in read-mostly situations > rDoS is outside this scope. Anyway it would be interesting to understand > what's going on. There was discussion about this before (recycling of existing entries is also now impossible, as compared with 2.4). It's a shame that this win for the normal case also hurts the DoS case...and it really hurts when the when the DoS case is the normal case. Simon- From Robert.Olsson@data.slu.se Wed Sep 7 09:52:02 2005 Received: with ECARTIS (v1.0.0; list netdev); Wed, 07 Sep 2005 09:52:06 -0700 (PDT) Received: from mx1.slu.se (mx1.slu.se [130.238.96.70]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j87GpwiL010409 for ; Wed, 7 Sep 2005 09:52:01 -0700 Received: from robur.slu.se (robur.slu.se [130.238.98.12]) by mx1.slu.se (8.13.1/8.13.1) with ESMTP id j87Gn4SX010775; Wed, 7 Sep 2005 18:49:04 +0200 Received: by robur.slu.se (Postfix, from userid 1000) id 03AC3EC3CC; Wed, 7 Sep 2005 18:49:03 +0200 (CEST) From: Robert Olsson MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <17183.6655.977975.249491@robur.slu.se> Date: Wed, 7 Sep 2005 18:49:03 +0200 To: Simon Kirby Cc: Robert Olsson , Alexey Kuznetsov , Eric Dumazet , netdev@oss.sgi.com Subject: Re: Route cache performance In-Reply-To: <20050907162854.GB24735@netnation.com> References: <20050823190852.GA20794@netnation.com> <17163.32645.202453.145416@robur.slu.se> <20050824000158.GA8137@netnation.com> <20050825181111.GB14336@netnation.com> <20050825200543.GA6612@yakov.inr.ac.ru> <20050825212211.GA23384@netnation.com> <20050826115520.GA12351@yakov.inr.ac.ru> <17167.29239.469711.847951@robur.slu.se> <20050906235700.GA31820@netnation.com> <17182.64751.340488.996748@robur.slu.se> <20050907162854.GB24735@netnation.com> X-Mailer: VM 7.19 under Emacs 21.4.1 X-Scanned-By: MIMEDefang 2.48 on 130.238.96.70 X-archive-position: 3601 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: Robert.Olsson@data.slu.se Precedence: bulk X-list: netdev Content-Length: 748 Lines: 21 Simon Kirby writes: > Yes, setting maxbatch to 10000 also results in working gc, though routing > throughput is about 5.7% higher when just calling dst_free directly. Oh that's good news... You loose 5.7% for rDoS but should benefit in normal conditions. > There was discussion about this before (recycling of existing entries is > also now impossible, as compared with 2.4). It's a shame that this win > for the normal case also hurts the DoS case...and it really hurts when > the when the DoS case is the normal case. It's called trade-off's :) rDoS is hardly nomal case? But maybe it's time to compare routing via route hash vs FIB lookup directly again now when we have RCU with some FIB lookup's too. Cheers. --ro From sim@netnation.com Wed Sep 7 09:58:02 2005 Received: with ECARTIS (v1.0.0; list netdev); Wed, 07 Sep 2005 09:58:05 -0700 (PDT) Received: from peace.netnation.com (newpeace.netnation.com [204.174.223.7]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j87Gw2iL011020 for ; Wed, 7 Sep 2005 09:58:02 -0700 Received: from sim by peace.netnation.com with local (Exim 4.50) id 1ED3Cq-0008UA-KF; Wed, 07 Sep 2005 09:55:28 -0700 Date: Wed, 7 Sep 2005 09:55:28 -0700 From: Simon Kirby To: Robert Olsson Cc: Alexey Kuznetsov , Eric Dumazet , netdev@oss.sgi.com Subject: Re: Route cache performance Message-ID: <20050907165528.GC24735@netnation.com> References: <17163.32645.202453.145416@robur.slu.se> <20050824000158.GA8137@netnation.com> <20050825181111.GB14336@netnation.com> <20050825200543.GA6612@yakov.inr.ac.ru> <20050825212211.GA23384@netnation.com> <20050826115520.GA12351@yakov.inr.ac.ru> <17167.29239.469711.847951@robur.slu.se> <20050906235700.GA31820@netnation.com> <20050907011959.GA25725@yakov.inr.ac.ru> <17183.309.317160.103056@robur.slu.se> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <17183.309.317160.103056@robur.slu.se> User-Agent: Mutt/1.5.9i X-archive-position: 3602 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: sim@netnation.com Precedence: bulk X-list: netdev Content-Length: 1346 Lines: 28 On Wed, Sep 07, 2005 at 05:03:17PM +0200, Robert Olsson wrote: > It was quite some time since I saw dst cache overflow and we use 2.6 > in infrastructure. Anyway I was able to "tune" route cache so I see > in our lab system on a SMP box. I think UP and SMP behaves the same > but with UP we could disable the deferred delete as Simon tested. > > I don't know if anything happen in 2.6.9 I don't think so. But any > improvement in drivers or FIB lookup may increase the burden so we get > overflows. I believe what I've been seeing is a _reduction_ in performance in both the e1000 driver and other parts of the kernel that result in it handling these packets much more slowly than in 2.4. The dst cache only overflows when the thing is completely pegged, so earlier 2.6 versions that were a little faster (eg: 2.6.11) were only overflowing occasionally depending on the speed of the input traffic. I've only been able to send 179 Mbps from one box, so that's what has been killing it. On the receiving end, 2.6.13-rc6 with the direct dst_free now drops a bunch but stays responsive with working GC, routing through about 69.6 Mbps, while 2.4.27 routes 103 Mbps worth. If it would be helpful, I can build some scripts to do benchmarks with different kernel combinations, and run it on a bunch of different kernel versions. Simon- From sim@netnation.com Wed Sep 7 10:00:32 2005 Received: with ECARTIS (v1.0.0; list netdev); Wed, 07 Sep 2005 10:00:36 -0700 (PDT) Received: from peace.netnation.com (newpeace.netnation.com [204.174.223.7]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j87H0WiL011509 for ; Wed, 7 Sep 2005 10:00:32 -0700 Received: from sim by peace.netnation.com with local (Exim 4.50) id 1ED3FG-0008W1-Lw; Wed, 07 Sep 2005 09:57:58 -0700 Date: Wed, 7 Sep 2005 09:57:58 -0700 From: Simon Kirby To: Robert Olsson Cc: Alexey Kuznetsov , Eric Dumazet , netdev@oss.sgi.com Subject: Re: Route cache performance Message-ID: <20050907165758.GD24735@netnation.com> References: <20050824000158.GA8137@netnation.com> <20050825181111.GB14336@netnation.com> <20050825200543.GA6612@yakov.inr.ac.ru> <20050825212211.GA23384@netnation.com> <20050826115520.GA12351@yakov.inr.ac.ru> <17167.29239.469711.847951@robur.slu.se> <20050906235700.GA31820@netnation.com> <17182.64751.340488.996748@robur.slu.se> <20050907162854.GB24735@netnation.com> <17183.6655.977975.249491@robur.slu.se> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <17183.6655.977975.249491@robur.slu.se> User-Agent: Mutt/1.5.9i X-archive-position: 3603 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: sim@netnation.com Precedence: bulk X-list: netdev Content-Length: 389 Lines: 10 On Wed, Sep 07, 2005 at 06:49:03PM +0200, Robert Olsson wrote: > It's called trade-off's :) rDoS is hardly nomal case? But maybe it's time > to compare routing via route hash vs FIB lookup directly again now when > we have RCU with some FIB lookup's too. I haven't even filled the route tables yet. I've just been testing with a bog standard table (three /24s and one /0). Simon- From Robert.Olsson@data.slu.se Wed Sep 7 10:24:04 2005 Received: with ECARTIS (v1.0.0; list netdev); Wed, 07 Sep 2005 10:24:08 -0700 (PDT) Received: from mx1.slu.se (mx1.slu.se [130.238.96.70]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j87HO1iL013264 for ; Wed, 7 Sep 2005 10:24:04 -0700 Received: from robur.slu.se (robur.slu.se [130.238.98.12]) by mx1.slu.se (8.13.1/8.13.1) with ESMTP id j87HLEIG015282; Wed, 7 Sep 2005 19:21:14 +0200 Received: by robur.slu.se (Postfix, from userid 1000) id 1468AEC3CC; Wed, 7 Sep 2005 19:21:14 +0200 (CEST) From: Robert Olsson MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <17183.8586.47462.585303@robur.slu.se> Date: Wed, 7 Sep 2005 19:21:14 +0200 To: Simon Kirby Cc: Robert Olsson , Alexey Kuznetsov , Eric Dumazet , netdev@oss.sgi.com Subject: Re: Route cache performance In-Reply-To: <20050907165528.GC24735@netnation.com> References: <17163.32645.202453.145416@robur.slu.se> <20050824000158.GA8137@netnation.com> <20050825181111.GB14336@netnation.com> <20050825200543.GA6612@yakov.inr.ac.ru> <20050825212211.GA23384@netnation.com> <20050826115520.GA12351@yakov.inr.ac.ru> <17167.29239.469711.847951@robur.slu.se> <20050906235700.GA31820@netnation.com> <20050907011959.GA25725@yakov.inr.ac.ru> <17183.309.317160.103056@robur.slu.se> <20050907165528.GC24735@netnation.com> X-Mailer: VM 7.19 under Emacs 21.4.1 X-Scanned-By: MIMEDefang 2.48 on 130.238.96.70 X-archive-position: 3604 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: Robert.Olsson@data.slu.se Precedence: bulk X-list: netdev Content-Length: 618 Lines: 15 Simon Kirby writes: > I've only been able to send 179 Mbps from one box, so that's what has > been killing it. On the receiving end, 2.6.13-rc6 with the direct > dst_free now drops a bunch but stays responsive with working GC, > routing through about 69.6 Mbps, while 2.4.27 routes 103 Mbps worth. If route hash setup is identical, buckets etc and HZ is same etc. I have no idea about the performance difference. Somebody else? In other case you need to compare (o)profiles and see if this can give us any hints. To test drivers etc you might also want to test with a single flow. Cheers. --ro From kuznet@yakov.inr.ac.ru Wed Sep 7 13:02:13 2005 Received: with ECARTIS (v1.0.0; list netdev); Wed, 07 Sep 2005 13:02:16 -0700 (PDT) Received: from yakov.inr.ac.ru (yakov.inr.ac.ru [194.67.69.111]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id j87K2BiL026980 for ; Wed, 7 Sep 2005 13:02:12 -0700 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=ms2.inr.ac.ru; b=hHAZRrUeIVToG8JeQmW1EGBVbwL0xYDvpoGcUwjBB5LOaERi2V3eM1KiJTlFkMfeThMO6Cm7K1LENi3nphqLVHN+cYKWwc2hnV1QlUq9jLyVBIrEQ0plFjvUJd8kCaK4waAJJGj99P0UAtAH5USEH3llGKCF32xZSk3BBWWqwVM=; Received: (from kuznet@localhost) envelope-from=kuznet by yakov.inr.ac.ru (8.6.13/ANK) id XAA08451; Wed, 7 Sep 2005 23:59:11 +0400 Date: Wed, 7 Sep 2005 23:59:11 +0400 From: Alexey Kuznetsov To: Simon Kirby Cc: Robert Olsson , Alexey Kuznetsov , Eric Dumazet , netdev@oss.sgi.com Subject: Re: Route cache performance Message-ID: <20050907195911.GA8382@yakov.inr.ac.ru> References: <17163.32645.202453.145416@robur.slu.se> <20050824000158.GA8137@netnation.com> <20050825181111.GB14336@netnation.com> <20050825200543.GA6612@yakov.inr.ac.ru> <20050825212211.GA23384@netnation.com> <20050826115520.GA12351@yakov.inr.ac.ru> <17167.29239.469711.847951@robur.slu.se> <20050906235700.GA31820@netnation.com> <17182.64751.340488.996748@robur.slu.se> <20050907162854.GB24735@netnation.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050907162854.GB24735@netnation.com> User-Agent: Mutt/1.5.6i X-archive-position: 3605 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: kuznet@ms2.inr.ac.ru Precedence: bulk X-list: netdev Content-Length: 341 Lines: 14 Hello! > Yes, setting maxbatch to 10000 also results in working gc, Could you try lower values? F.e. I guess 300 or a little more (it is netdev_max_backlog) should be enough. > for the normal case also hurts the DoS case...and it really hurts when > the when the DoS case is the normal case. 5.7% is not "really hurts" yet. :-) Alexey From bernd-schubert@gmx.de Fri Sep 9 10:37:36 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 09 Sep 2005 10:37:39 -0700 (PDT) Received: from relay.uni-heidelberg.de (relay.uni-heidelberg.de [129.206.100.212]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j89HbWiL014953 for ; Fri, 9 Sep 2005 10:37:35 -0700 Received: from hamilton1.pci.uni-heidelberg.de (hamilton1.pci.uni-heidelberg.de [129.206.21.201]) by relay.uni-heidelberg.de (8.13.4/8.13.1) with ESMTP id j89HYmq6005840; Fri, 9 Sep 2005 19:34:48 +0200 Received: from lanczos.pci.uni-heidelberg.de ([129.206.21.135] helo=lanczos ident=foobar) by hamilton1.pci.uni-heidelberg.de with smtp (Exim 3.36 #1 (Debian)) id 1EDmm3-0001Xg-00; Fri, 09 Sep 2005 19:34:51 +0200 Received: by lanczos (sSMTP sendmail emulation); Fri, 9 Sep 2005 19:34:51 +0200 From: Bernd Schubert To: netdev@oss.sgi.com Subject: skge: reboot on sysfs resource0 access Date: Fri, 9 Sep 2005 19:34:50 +0200 User-Agent: KMail/1.7.2 Cc: Stephen Hemminger MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Disposition: inline Message-Id: <200509091934.51301.bernd-schubert@gmx.de> Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id j89HbWiL014953 X-archive-position: 3606 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: bernd-schubert@gmx.de Precedence: bulk X-list: netdev Content-Length: 1404 Lines: 37 Hello, today we tried 2.6.13 on our server and also tried to use the skge driver. Well, in principle it works fine, until I became curious about the sysfs values. Stupid me, I was using the midnight commander to read the values. When I opened "/sys/bus/pci/drivers/skge/0000:01:01.0/resource0", the system immediately rebooted. After the reboot we tested using cat to the resource0 file, which gave an input/output error. Using again the mc, the system again immediately rebooted. Well, I guess I better don't use the midnight commander in the future, but somehow I think it shouldn't cause the system to reboot, should it? Is the i/o error of cat supposed to happen? Maybe it helps, here is a strace of mc's open for a normal file: open("/home/bernd/notes", O_RDONLY|O_NONBLOCK|O_LARGEFILE) = 6 fstat64(6, {st_mode=S_IFREG|0644, st_size=96, ...}) = 0 fcntl64(102, F_GETFL) = -1 EBADF (Bad file descriptor) read(6, "http", 4) = 4 mmap2(NULL, 96, PROT_READ, MAP_SHARED, 6, 0) = 0x402fe000 select(5, [4], NULL, NULL, {0, 0}) = 0 (Timeout) select(5, [4], NULL, NULL, {0, 0}) = 0 (Timeout) write(1, "\33[1;1H\33[m\17\33[30m\33[46mFile: notes "..., 4019) = 4019 Thanks, Bernd -- Bernd Schubert Physikalisch Chemisches Institut / Theoretische Chemie Universität Heidelberg INF 229 69120 Heidelberg e-mail: bernd.schubert@pci.uni-heidelberg.de From kas@fi.muni.cz Fri Sep 9 10:42:10 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 09 Sep 2005 10:42:19 -0700 (PDT) Received: from tirith.ics.muni.cz (tirith.ics.muni.cz [147.251.4.36]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j89Hg8iL015660 for ; Fri, 9 Sep 2005 10:42:09 -0700 Received: from anxur.fi.muni.cz (anxur.fi.muni.cz [147.251.48.3]) by tirith.ics.muni.cz (8.13.2/8.13.2) with ESMTP id j89HdSww006438; Fri, 9 Sep 2005 19:39:30 +0200 Received: by anxur.fi.muni.cz (Postfix, from userid 11561) id EA97922AF67; Fri, 9 Sep 2005 19:39:28 +0200 (CEST) Date: Fri, 9 Sep 2005 19:39:28 +0200 From: Jan Kasprzak To: linux-kernel@vger.kernel.org Cc: netdev@oss.sgi.com Subject: TCP segmentation offload performance Message-ID: <20050909173928.GI4823@fi.muni.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.1i X-Muni-Spam-TestIP: 147.251.48.3 X-Muni-Envelope-From: kas@fi.muni.cz X-Muni-Virus-Test: Clean X-archive-position: 3608 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: kas@fi.muni.cz Precedence: bulk X-list: netdev Content-Length: 2474 Lines: 54 Hello, world! I tried to find out whether the TCP segmentation offload can perform better on my server than no TSO at all. My server is dual Opteron 244 with Tyan S2882 board with the following NIC: eth0: Tigon3 [partno(BCM95704A7) rev 2003 PHY(5704)] (PCIX:100MHz:64-bit) 10/100/1000BaseT Ethernet 00:e0:81:27:de:17 eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] Split[0] WireSpeed[1] TSOcap[1] eth0: dma_rwctrl[769f4000] The server runs ProFTPd with sendfile(2) enabled (and I have verified that it is being used with strace(8)). The kernel is 2.6.12.2. I have found that according to ethtool -k eth0 the TSO is switched off by default. So I tried to switch it on (altough I wondered why it is not switched on by default, provided that the hardware supports this feature). I tried to measure the difference by downloading an ISO image of FC4 i386 CD1 (665434112 bytes) from two hosts connected to the same switch. I did 10 transfers of the same file with each settings, and took the average and maximum of the last five transfers only (to avoid any start-up temporary conditions). The client Alpha was dual Opteron 248 with Tyan S2882 board, and the client Beta was quad Opteron 848 on HP DL-585 board. Client TSO Average speed Max speed Alpha off 108.7 MB/s 110.5 MB/s Alpha on 100.9 MB/s 101.2 MB/s Beta off 102.1 MB/s 102.4 MB/s Beta on 93.2 MB/s 95.5 MB/s Surprisingly enough, the tests without TSO were faster than with TSO enabled. Looking at tcpdump it seems that the system with TSO enabled sends only a 15 KB-sized frames to the NIC instead of full 64 KB-sized ones: 18:45:38.993150 IP odysseus.ftp-data > alpha.33125: P 127424:143352(15928) ack 1 win 1460 18:45:38.993203 IP odysseus.ftp-data > alpha.33125: P 143352:159280(15928) ack 1 win 1460 So I wonder what is wrong with TSO on my hardware and whether the TSO is expected to be faster than generating MTU-sized packets in the TCP stack. I did not measure the CPU usage on the server, only the network speed. Thanks! -Yenya -- | Jan "Yenya" Kasprzak | | GPG: ID 1024/D3498839 Fingerprint 0D99A7FB206605D7 8B35FCDE05B18A5E | | http://www.fi.muni.cz/~kas/ Journal: http://www.fi.muni.cz/~kas/blog/ | >>> $ cd my-kernel-tree-2.6 <<< >>> $ dotest /path/to/mbox # yes, Linus has no taste in naming scripts <<< From bernd.schubert@pci.uni-heidelberg.de Fri Sep 9 10:41:00 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 09 Sep 2005 10:41:04 -0700 (PDT) Received: from relay.uni-heidelberg.de (relay.uni-heidelberg.de [129.206.100.212]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j89HewiL015306 for ; Fri, 9 Sep 2005 10:41:00 -0700 Received: from hamilton1.pci.uni-heidelberg.de (hamilton1.pci.uni-heidelberg.de [129.206.21.201]) by relay.uni-heidelberg.de (8.13.4/8.13.1) with ESMTP id j89HcEqR007542; Fri, 9 Sep 2005 19:38:14 +0200 Received: from lanczos.pci.uni-heidelberg.de ([129.206.21.135] helo=lanczos ident=foobar) by hamilton1.pci.uni-heidelberg.de with smtp (Exim 3.36 #1 (Debian)) id 1EDmpO-0001Z3-00; Fri, 09 Sep 2005 19:38:18 +0200 Received: by lanczos (sSMTP sendmail emulation); Fri, 9 Sep 2005 19:38:18 +0200 From: Bernd Schubert To: netdev@oss.sgi.com Subject: skge: reboot on sysfs resource0 access User-Agent: KMail/1.7.2 Cc: Stephen Hemminger MIME-Version: 1.0 Content-Disposition: inline Date: Fri, 9 Sep 2005 19:38:17 +0200 Reply-To: bernd-schubert@gmx.de Content-Type: text/plain; charset="iso-8859-1" Message-Id: <200509091938.18079.bernd.schubert@pci.uni-heidelberg.de> Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id j89HewiL015306 X-archive-position: 3607 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: bernd.schubert@pci.uni-heidelberg.de Precedence: bulk X-list: netdev Content-Length: 1572 Lines: 45 Hello, today we tried 2.6.13 on our server and also tried to use the skge driver. Well, in principle it works fine, until I became curious about the sysfs values. Stupid me, I was using the midnight commander to read the values. When I opened "/sys/bus/pci/drivers/skge/0000:01:01.0/resource0", the system immediately rebooted. After the reboot we tested using cat to the resource0 file, which gave an input/output error. Using again the mc, the system again immediately rebooted. Well, I guess I better don't use the midnight commander in the future, but somehow I think it shouldn't cause the system to reboot, should it? Is the i/o error of cat supposed to happen? Maybe it helps, here is a strace of mc's open for a normal file: open("/home/bernd/notes", O_RDONLY|O_NONBLOCK|O_LARGEFILE) = 6 fstat64(6, {st_mode=S_IFREG|0644, st_size=96, ...}) = 0 fcntl64(102, F_GETFL) = -1 EBADF (Bad file descriptor) read(6, "http", 4) = 4 mmap2(NULL, 96, PROT_READ, MAP_SHARED, 6, 0) = 0x402fe000 select(5, [4], NULL, NULL, {0, 0}) = 0 (Timeout) select(5, [4], NULL, NULL, {0, 0}) = 0 (Timeout) write(1, "\33[1;1H\33[m\17\33[30m\33[46mFile: notes "..., 4019) = 4019 Thanks, Bernd -- Bernd Schubert Physikalisch Chemisches Institut / Theoretische Chemie Universität Heidelberg INF 229 69120 Heidelberg e-mail: bernd.schubert@pci.uni-heidelberg.de -- Bernd Schubert Physikalisch Chemisches Institut / Theoretische Chemie Universität Heidelberg INF 229 69120 Heidelberg e-mail: bernd.schubert@pci.uni-heidelberg.de From shemminger@osdl.org Fri Sep 9 11:04:15 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 09 Sep 2005 11:04:20 -0700 (PDT) Received: from smtp.osdl.org (smtp.osdl.org [65.172.181.4]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j89I4FiL018777 for ; Fri, 9 Sep 2005 11:04:15 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp.osdl.org (8.12.8/8.12.8) with ESMTP id j89I1aBo029724 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO); Fri, 9 Sep 2005 11:01:37 -0700 Received: from localhost.localdomain (dxpl.pdx.osdl.net [10.8.0.74]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with ESMTP id j89I1as4028712; Fri, 9 Sep 2005 11:01:36 -0700 Date: Fri, 9 Sep 2005 11:01:53 -0700 From: Stephen Hemminger To: bernd-schubert@gmx.de Cc: bernd.schubert@pci.uni-heidelberg.de, netdev@oss.sgi.com Subject: Re: skge: reboot on sysfs resource0 access Message-ID: <20050909110153.5a2e2e90@localhost.localdomain> In-Reply-To: <200509091938.18079.bernd.schubert@pci.uni-heidelberg.de> References: <200509091938.18079.bernd.schubert@pci.uni-heidelberg.de> X-Mailer: Sylpheed-Claws 1.9.13 (GTK+ 2.6.7; x86_64-redhat-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.115 $ X-Scanned-By: MIMEDefang 2.36 X-archive-position: 3609 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: shemminger@osdl.org Precedence: bulk X-list: netdev Content-Length: 1106 Lines: 24 On Fri, 9 Sep 2005 19:38:17 +0200 Bernd Schubert wrote: > Hello, > > today we tried 2.6.13 on our server and also tried to use the skge driver. > Well, in principle it works fine, until I became curious about the sysfs > values. Stupid me, I was using the midnight commander to read the values. > When I opened "/sys/bus/pci/drivers/skge/0000:01:01.0/resource0", the system > immediately rebooted. After the reboot we tested using cat to the resource0 > file, which gave an input/output error. Using again the mc, the system again > immediately rebooted. > Well, I guess I better don't use the midnight commander in the future, but > somehow I think it shouldn't cause the system to reboot, should it? Is the > i/o error of cat supposed to happen? > Don't do that! resource0 is the pci space for the card and reading it directly accesses the memory mapped space. The register is sparse, and some places are unaccessable. Accessing non-existent memory will cause system to hang and if you are lucky a timeout and reboot. Sorry, this is not a driver bug. From bernd.schubert@pci.uni-heidelberg.de Fri Sep 9 11:12:23 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 09 Sep 2005 11:12:27 -0700 (PDT) Received: from relay.uni-heidelberg.de (relay.uni-heidelberg.de [129.206.100.212]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j89ICLiL019925 for ; Fri, 9 Sep 2005 11:12:22 -0700 Received: from hamilton1.pci.uni-heidelberg.de (hamilton1.pci.uni-heidelberg.de [129.206.21.201]) by relay.uni-heidelberg.de (8.13.4/8.13.1) with ESMTP id j89I9cKN018393; Fri, 9 Sep 2005 20:09:38 +0200 Received: from lanczos.pci.uni-heidelberg.de ([129.206.21.135] helo=lanczos ident=foobar) by hamilton1.pci.uni-heidelberg.de with smtp (Exim 3.36 #1 (Debian)) id 1EDnJl-0001ei-00; Fri, 09 Sep 2005 20:09:41 +0200 Received: by lanczos (sSMTP sendmail emulation); Fri, 9 Sep 2005 20:09:41 +0200 From: Bernd Schubert Reply-To: bernd-schubert@gmx.de To: Stephen Hemminger Subject: Re: skge: reboot on sysfs resource0 access Date: Fri, 9 Sep 2005 20:09:40 +0200 User-Agent: KMail/1.7.2 References: <200509091938.18079.bernd.schubert@pci.uni-heidelberg.de> <20050909110153.5a2e2e90@localhost.localdomain> In-Reply-To: <20050909110153.5a2e2e90@localhost.localdomain> Cc: netdev@oss.sgi.com MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Disposition: inline Message-Id: <200509092009.41497.bernd.schubert@pci.uni-heidelberg.de> Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id j89ICLiL019925 X-archive-position: 3610 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: bernd.schubert@pci.uni-heidelberg.de Precedence: bulk X-list: netdev Content-Length: 1584 Lines: 41 On Friday 09 September 2005 20:01, you wrote: > On Fri, 9 Sep 2005 19:38:17 +0200 > > Bernd Schubert wrote: > > Hello, > > > > today we tried 2.6.13 on our server and also tried to use the skge > > driver. Well, in principle it works fine, until I became curious about > > the sysfs values. Stupid me, I was using the midnight commander to read > > the values. When I opened > > "/sys/bus/pci/drivers/skge/0000:01:01.0/resource0", the system > > immediately rebooted. After the reboot we tested using cat to the > > resource0 file, which gave an input/output error. Using again the mc, the > > system again immediately rebooted. > > Well, I guess I better don't use the midnight commander in the future, > > but somehow I think it shouldn't cause the system to reboot, should it? > > Is the i/o error of cat supposed to happen? > > Don't do that! resource0 is the pci space for the card and > reading it directly accesses the memory mapped space. The > register is sparse, and some places are unaccessable. > Accessing non-existent memory will cause system to hang and if you > are lucky a timeout and reboot. > > Sorry, this is not a driver bug. Thanks, I better also won't read the resource values of the other pci-devices. And I think I will search for some documentation of sysfs to know in the future which values one should read and which not. Thanks again, Bernd -- Bernd Schubert Physikalisch Chemisches Institut / Theoretische Chemie Universität Heidelberg INF 229 69120 Heidelberg e-mail: bernd.schubert@pci.uni-heidelberg.de From greearb@candelatech.com Fri Sep 9 11:24:43 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 09 Sep 2005 11:24:50 -0700 (PDT) Received: from www.lanforge.com (ns1.lanforge.com [66.165.47.210]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j89IOhiL021299 for ; Fri, 9 Sep 2005 11:24:43 -0700 Received: from [71.112.207.5] (pool-71-112-207-5.sttlwa.dsl-w.verizon.net [71.112.207.5]) (authenticated bits=0) by www.lanforge.com (8.12.8/8.12.8) with ESMTP id j89IROo6003730; Fri, 9 Sep 2005 11:27:25 -0700 Message-ID: <4321D2C0.10800@candelatech.com> Date: Fri, 09 Sep 2005 11:21:52 -0700 From: Ben Greear Organization: Candela Technologies User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.7.10) Gecko/20050719 Fedora/1.7.10-1.3.1 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Stephen Hemminger CC: bernd-schubert@gmx.de, bernd.schubert@pci.uni-heidelberg.de, netdev@oss.sgi.com Subject: Re: skge: reboot on sysfs resource0 access References: <200509091938.18079.bernd.schubert@pci.uni-heidelberg.de> <20050909110153.5a2e2e90@localhost.localdomain> In-Reply-To: <20050909110153.5a2e2e90@localhost.localdomain> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 3611 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: greearb@candelatech.com Precedence: bulk X-list: netdev Content-Length: 1605 Lines: 46 Stephen Hemminger wrote: > On Fri, 9 Sep 2005 19:38:17 +0200 > Bernd Schubert wrote: > > >>Hello, >> >>today we tried 2.6.13 on our server and also tried to use the skge driver. >>Well, in principle it works fine, until I became curious about the sysfs >>values. Stupid me, I was using the midnight commander to read the values. >>When I opened "/sys/bus/pci/drivers/skge/0000:01:01.0/resource0", the system >>immediately rebooted. After the reboot we tested using cat to the resource0 >>file, which gave an input/output error. Using again the mc, the system again >>immediately rebooted. >>Well, I guess I better don't use the midnight commander in the future, but >>somehow I think it shouldn't cause the system to reboot, should it? Is the >>i/o error of cat supposed to happen? >> > > > Don't do that! resource0 is the pci space for the card and > reading it directly accesses the memory mapped space. The > register is sparse, and some places are unaccessable. > Accessing non-existent memory will cause system to hang and if you > are lucky a timeout and reboot. > > Sorry, this is not a driver bug. Does that mean if you do this: find /sys -name "*" -print|xargs grep foo that the system will crash? I certainly would consider that a bug, and even if that somehow works, I'd think that at the least you should be able to read every file in the file system without crashing the system! Do you at least have to be root to cause this crash? Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From ananda.raju@neterion.com Fri Sep 9 18:40:47 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 09 Sep 2005 18:40:49 -0700 (PDT) Received: from ns1.s2io.com (ns1.s2io.com [142.46.200.198]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8A1ekiL032452 for ; Fri, 9 Sep 2005 18:40:47 -0700 Received: from guinness.s2io.com (sentry.s2io.com [142.46.200.199]) by ns1.s2io.com (8.12.10/8.12.10) with ESMTP id j8A1c9cx010078 for ; Fri, 9 Sep 2005 21:38:09 -0400 (EDT) Received: from rkoushik ([10.16.16.56]) by guinness.s2io.com (8.12.6/8.12.6) with ESMTP id j8A1c8lb003006; Fri, 9 Sep 2005 21:38:08 -0400 (EDT) Message-Id: <200509100138.j8A1c8lb003006@guinness.s2io.com> From: "Ananda Raju" To: Cc: "'Leonid Grossman'" , Subject: clarification required on UDP sendfile() Date: Fri, 9 Sep 2005 18:37:16 -0700 MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook, Build 11.0.5510 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.2180 Thread-Index: AcW1qC02FbvWbLv0RS6gMj3bdhNA7g== X-Scanned-By: MIMEDefang 2.34 X-archive-position: 3612 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ananda.raju@neterion.com Precedence: bulk X-list: netdev Content-Length: 2259 Lines: 88 Hi, We are implementing UDP Large send offload (USO) feature for our Xframe-II 10g Ethernet adapter. We are facing problem in using sendfile(). we have written a client server program which uses sendfile() over udp. We are facing a problem in which the last sendfile() operation fails to reach server. This behavior is irrespective of USO feature. Client.c has following code. Size of file iperf-2.0.1.tar.gz used for transfer is 222957 -------------------------------------------------- Main() { portno=16000; fd1 = open("iperf-2.0.1.tar.gz",O_RDWR); fd = socket(AF_INET,SOCK_DGRAM,0); bzero((char *) &serv_addr, sizeof(serv_addr)); serv_addr.sin_family = AF_INET; serv_addr.sin_port = htons(portno); serv_addr.sin_addr.s_addr = inet_addr("172.10.1.227"); ret = connect(fd,&serv_addr,sizeof(serv_addr)); len = sizeof(client_addr); off=0; while (1) { size = 40*1024; ret = sendfile(fd,fd1,NULL,size); printf("size %d \n",ret); sleep(1); if (ret<=0) exit(0); } close(fd); close(fd1); } -------------------------------------------- Server.c has following code -------------------------------------------- int portno=16000; char buf[65000]; main() { fd = socket(AF_INET,SOCK_DGRAM,0); bzero((char *) &serv_addr, sizeof(serv_addr)); serv_addr.sin_family = AF_INET; serv_addr.sin_port = htons(portno); ret = bind(fd,(struct sockaddr*)&serv_addr,sizeof(serv_addr)); len = sizeof(client_addr); while (1){ ret = recvfrom(fd,&buf,sizeof(buf),0,(struct sockaddr*)&client_addr,&len); printf("size %d \n",ret); } } ------------------------------------------- # ls -l |grep iperf-2.0.1.tar.gz -rw-r--r-- 1 root root 222957 Sep 8 08:31 iperf-2.0.1.tar.gz # #./client size 40960 size 40960 size 40960 size 40960 size 40960 size 18157 <<< Didn't reach server size 0 # #./server size 40960 size 40960 size 40960 size 40960 size 40960 The last transmit of 18157 bytes didn't reach the server, any reason why this happens. Also some time the middle frames also won't reach the server. We did tcpdump and observed that the packets are not put on the wire. The packets are getting lost in the host network stack. Is this behavior is expected or We are doing wrong somewhere? Regards, Ananda From bernd.schubert@pci.uni-heidelberg.de Mon Sep 12 04:04:41 2005 Received: with ECARTIS (v1.0.0; list netdev); Mon, 12 Sep 2005 04:04:46 -0700 (PDT) Received: from relay.uni-heidelberg.de (relay.uni-heidelberg.de [129.206.100.212]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8CB4diL027930 for ; Mon, 12 Sep 2005 04:04:40 -0700 Received: from hamilton1.pci.uni-heidelberg.de (hamilton1.pci.uni-heidelberg.de [129.206.21.201]) by relay.uni-heidelberg.de (8.13.4/8.13.1) with ESMTP id j8CB1akx018239; Mon, 12 Sep 2005 13:01:44 +0200 Received: from lanczos.pci.uni-heidelberg.de ([129.206.21.135] helo=lanczos ident=foobar) by hamilton1.pci.uni-heidelberg.de with smtp (Exim 3.36 #1 (Debian)) id 1EEm4C-00044a-00; Mon, 12 Sep 2005 13:01:40 +0200 Received: by lanczos (sSMTP sendmail emulation); Mon, 12 Sep 2005 13:01:40 +0200 From: Bernd Schubert Reply-To: bernd-schubert@gmx.de To: Ben Greear Subject: Re: skge: reboot on sysfs resource0 access Date: Mon, 12 Sep 2005 13:01:39 +0200 User-Agent: KMail/1.7.2 Cc: Stephen Hemminger , netdev@oss.sgi.com References: <200509091938.18079.bernd.schubert@pci.uni-heidelberg.de> <20050909110153.5a2e2e90@localhost.localdomain> <4321D2C0.10800@candelatech.com> In-Reply-To: <4321D2C0.10800@candelatech.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200509121301.39924.bernd.schubert@pci.uni-heidelberg.de> X-archive-position: 3615 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: bernd.schubert@pci.uni-heidelberg.de Precedence: bulk X-list: netdev Content-Length: 537 Lines: 22 > > Sorry, this is not a driver bug. > > Does that mean if you do this: > > find /sys -name "*" -print|xargs grep foo > > that the system will crash? I would also guess it would happen, though I won't try that now. > > I certainly would consider that a bug, and even if that somehow works, I'd > think that at the least you should be able to read every file in the file > system without crashing the system! > > Do you at least have to be root to cause this crash? Yes, the resource0 file has rw access to root only. Cheers, Bernd From bernd.schubert@pci.uni-heidelberg.de Mon Sep 12 08:42:27 2005 Received: with ECARTIS (v1.0.0; list netdev); Mon, 12 Sep 2005 08:42:32 -0700 (PDT) Received: from relay.uni-heidelberg.de (relay.uni-heidelberg.de [129.206.100.212]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8CFgQiL020242 for ; Mon, 12 Sep 2005 08:42:27 -0700 Received: from hamilton1.pci.uni-heidelberg.de (hamilton1.pci.uni-heidelberg.de [129.206.21.201]) by relay.uni-heidelberg.de (8.13.4/8.13.1) with ESMTP id j8CFdhup030146; Mon, 12 Sep 2005 17:39:43 +0200 Received: from lanczos.pci.uni-heidelberg.de ([129.206.21.135] helo=lanczos ident=foobar) by hamilton1.pci.uni-heidelberg.de with smtp (Exim 3.36 #1 (Debian)) id 1EEqPL-0002dI-00; Mon, 12 Sep 2005 17:39:47 +0200 Received: by lanczos (sSMTP sendmail emulation); Mon, 12 Sep 2005 17:39:47 +0200 From: Bernd Schubert Reply-To: bernd-schubert@gmx.de To: netdev@oss.sgi.com Subject: 2.613: network write socket problems Date: Mon, 12 Sep 2005 17:39:45 +0200 User-Agent: KMail/1.7.2 Cc: linux-kernel@vger.kernel.org MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Disposition: inline Message-Id: <200509121739.46172.bernd.schubert@pci.uni-heidelberg.de> Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id j8CFgQiL020242 X-archive-position: 3616 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: bernd.schubert@pci.uni-heidelberg.de Precedence: bulk X-list: netdev Content-Length: 1694 Lines: 45 Hello, on last Friday we switched on our server to 2.6.13 and today we are experiencing problems with our nfs clients. In particular I'm talking about the unfs3 daemon, not the kernel nfs daemon. Both are running on the server but on different ports, of course. Both are also serving to the same clients, but different directories. Today it already several times happend that the unfs3 daemon stalled. Ethereal showed no network packages on the unfs3 daemon port during this time. A strace to the proc-id of the daemon clearly shows that *some* writes to some network sockets will take ages to finish write(37, "\200\0\0x\203\326(\5\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 124) = 124 This kind of writes can take between seconds and minutes, while it usually happens much faster than I can count. After the write() to the network socket, other operations happen rather fast, until the next write to a network socket. (I identified the troublesome filedescriptors by looking to /proc/procid/fd). After restarting the unfs3 daemon everything goes smooth for some time (approximately 20min to 2h), until the next write to a filedescriptor stalls. Any idea whats going on? Until today this never happend before, neither with 2.6.x nor 2.4.x. As I wrote, on Friday we replaced 2.6.11.12 by 2.6.13, the configuration should be similar, only changes should be HZ set to 250 and additionally the skge driver. We already switched back from skge to sk98lin, but the problem seems to remain. Thanks, Bernd -- Bernd Schubert Physikalisch Chemisches Institut / Theoretische Chemie Universität Heidelberg INF 229 69120 Heidelberg e-mail: bernd.schubert@pci.uni-heidelberg.de From y_h_lee@yahoo.com Mon Sep 12 09:54:31 2005 Received: with ECARTIS (v1.0.0; list netdev); Mon, 12 Sep 2005 09:54:37 -0700 (PDT) Received: from web34211.mail.mud.yahoo.com (web34211.mail.mud.yahoo.com [66.163.178.126]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id j8CGsViL024788 for ; Mon, 12 Sep 2005 09:54:31 -0700 Received: (qmail 51507 invoked by uid 60001); 12 Sep 2005 16:51:53 -0000 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=Message-ID:Received:Date:From:Subject:To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=6fG8hOMm9Mgz2Qo7i0ys2zCYXeq0Inn6MfDv1yq4okgD7AAdaxKCJVJpB25xi29VjjSCDakPmTec+UY4NXqIEDYOR2opTftiWsUQullCtgxoKYgbC1poXx2QF3brq1CeedQnou5FDrVA3B2kmd2e/v0ORRC3qN/zUmJZOsem6xY= ; Message-ID: <20050912165153.51505.qmail@web34211.mail.mud.yahoo.com> Received: from [192.35.17.30] by web34211.mail.mud.yahoo.com via HTTP; Mon, 12 Sep 2005 09:51:53 PDT Date: Mon, 12 Sep 2005 09:51:53 -0700 (PDT) From: YongHan Lee Subject: Writing Kernel Module to get Kernel Routing Table Information To: y_h_lee@yahoo.com MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-archive-position: 3617 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: y_h_lee@yahoo.com Precedence: bulk X-list: netdev Content-Length: 2195 Lines: 72 Dear Linux-Networking Maintainer, I am a student of EPFL in Switzerland and do a semester project in networking. My subject is "ipv6 multipath AODV routing implementation". To achieve my goal I would like to program a kernel module which allows/enables multipath routing. This means, if I have several routes to one destination (but different next hops) in the kernel routing table, then I want to choose a route entry with the desired next hop and not necessary the first route entry. Since I do not know well the kernel architecture, especially the kernel networking section, I do not know if and how it is possible. My first idea was to use netfilter for ipv6, but the multiple routing table with marks are only implemented for ipv4. https://lists.netfilter.org/pipermail/netfilter/2005-August/062252.html And now I try to write a Kernel module which should retrieve all route entries from the kernel routing table (fib6_node, rt6_info or dst_entry struct) by comparing the destination address, source address and next hop address with those of the ip packet. This happens at POST ROUTING hook. Afterwards, I would change the destination (struct dst_entry) of the sk_buff struct of the ip packet. The problem is not to get the next hop address of the ip packet. I have started to write my kernel module, but it was not able to get the route entries from the kernel routing table, because a lot of the functions are static. I wanted to iterate the fib6_nodes from the root (like fib6_lookup(&ip6_routing_table, daddr, saddr) [from ip6_fib.c]), but the kernel returns me: Sep 12 17:31:34 m66533pp kernel: kaodv: Unknown symbol fib6_lookup Sep 12 17:31:34 m66533pp kernel: kaodv: Unknown symbol ip6_routing_table even they are defined in kallsyms and I included the ip6_fib.h file. I would be very glad if you could give me some helps or advises (how it is possible or my idea is totally impossible). Some links to architecture of kernel routing table would be already a great help for me. I would like to thank you in advance for your help. yours faithfully, Yong-Han Lee __________________________________ Yahoo! Mail - PC Magazine Editors' Choice 2005 http://mail.yahoo.com From bernd.schubert@pci.uni-heidelberg.de Tue Sep 13 02:25:38 2005 Received: with ECARTIS (v1.0.0; list netdev); Tue, 13 Sep 2005 02:25:44 -0700 (PDT) Received: from relay.uni-heidelberg.de (relay.uni-heidelberg.de [129.206.100.212]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8D9PaiL032121 for ; Tue, 13 Sep 2005 02:25:37 -0700 Received: from hamilton1.pci.uni-heidelberg.de (hamilton1.pci.uni-heidelberg.de [129.206.21.201]) by relay.uni-heidelberg.de (8.13.4/8.13.1) with ESMTP id j8D9MnL3004747; Tue, 13 Sep 2005 11:22:49 +0200 Received: from lanczos.pci.uni-heidelberg.de ([129.206.21.135] helo=lanczos ident=foobar) by hamilton1.pci.uni-heidelberg.de with smtp (Exim 3.36 #1 (Debian)) id 1EF709-0001oQ-00; Tue, 13 Sep 2005 11:22:53 +0200 Received: by lanczos (sSMTP sendmail emulation); Tue, 13 Sep 2005 11:22:53 +0200 From: Bernd Schubert Reply-To: TC-ADMIN@listserv.uni-heidelberg.de To: netdev@oss.sgi.com Subject: Re: 2.613: network write socket problems Date: Tue, 13 Sep 2005 11:22:52 +0200 User-Agent: KMail/1.7.2 Cc: linux-kernel@vger.kernel.org References: <200509121739.46172.bernd.schubert@pci.uni-heidelberg.de> In-Reply-To: <200509121739.46172.bernd.schubert@pci.uni-heidelberg.de> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200509131122.53286.bernd.schubert@pci.uni-heidelberg.de> X-archive-position: 3618 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: bernd.schubert@pci.uni-heidelberg.de Precedence: bulk X-list: netdev Content-Length: 1120 Lines: 24 On Monday 12 September 2005 17:39, Bernd Schubert wrote: > Hello, > > on last Friday we switched on our server to 2.6.13 and today we are > experiencing problems with our nfs clients. > In particular I'm talking about the unfs3 daemon, not the kernel nfs > daemon. Both are running on the server but on different ports, of course. > Both are also serving to the same clients, but different directories. > > Today it already several times happend that the unfs3 daemon stalled. > Ethereal showed no network packages on the unfs3 daemon port during this > time. A strace to the proc-id of the daemon clearly shows that *some* > writes to some network sockets will take ages to finish > > write(37, "\200\0\0x\203\326(\5\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 124) > = 124 Sorry for the noise, its not a kernel problem. Switching back to 2.6.11 didn't help, so we investigated further. It turned out, that one of our clients was in a kind of a zombie state and asking for filehandles, but not answering request from the server. Since unfs3 is only single threaded, all other clients had to wait for timeouts. Bernd From det.nicolas@free.fr Tue Sep 13 06:50:33 2005 Received: with ECARTIS (v1.0.0; list netdev); Tue, 13 Sep 2005 06:50:41 -0700 (PDT) Received: from smtp11.wanadoo.fr (smtp11.wanadoo.fr [193.252.22.31]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8DDoWiL025456 for ; Tue, 13 Sep 2005 06:50:32 -0700 Received: from me-wanadoo.net (localhost [127.0.0.1]) by mwinf1109.wanadoo.fr (SMTP Server) with ESMTP id 1C6551C00053 for ; Tue, 13 Sep 2005 15:47:47 +0200 (CEST) Received: from free.fr (ALyon-253-1-35-243.w83-205.abo.wanadoo.fr [83.205.42.243]) by mwinf1109.wanadoo.fr (SMTP Server) with ESMTP id 750DA1C0004E; Tue, 13 Sep 2005 15:47:46 +0200 (CEST) X-ME-UUID: 20050913134746479.750DA1C0004E@mwinf1109.wanadoo.fr From: Nicolas DET To: netdev@oss.sgi.com Cc: Christoph Hellwig , Sven Luther , Dale Farnsworth Subject: is it safe to use skb_linearize() within a network driver X-Mailer: SimpleMail 0.27 (MorphOS/MUI) E-Mail Client (c) 2000-2005 by Hynek Schlawack and Sebastian Bauer Date: 13 Sep 2005 15:42:33 +0100 Message-Id: <20050913134746.750DA1C0004E@mwinf1109.wanadoo.fr> X-archive-position: 3619 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: det.nicolas@free.fr Precedence: bulk X-list: netdev Content-Length: 2347 Lines: 72 Hello, First, you can find here old; http://arrakin.homedns.org/~nicolas/mv643xx_eth.tar.gz http://arrakin.homedns.org/~nicolas/mv643xx_eth.diff.bz2 more up to date: http://arrakin.homedns.org/~nicolas/mv643xx_eth_small.tar.gz a small patch for the mv643xx_eth driver. But his is not my concern ATM. I recently noticed I got: Badness in local_bh_enable at kernel/softirq.c:140 Call trace: [c0005758] check_bug_trap+0x98/0xdc [c0005904] ProgramCheckException+0x168/0x4c0 [c0004ee4] ret_from_except_full+0x0/0x4c [c0021424] local_bh_enable+0x18/0x88 [c025c928] skb_copy_bits+0x144/0x37c [c0263b64] __skb_linearize+0x90/0x158 [e106ac68] mv643xx_eth_start_xmit+0x54c/0x610 [mv643xx_eth] [c0263df0] dev_queue_xmit+0x1c4/0x35c [c028e414] ip_finish_output+0x130/0x2b8 [c028e85c] ip_queue_xmit+0x2c0/0x540 [c029e920] tcp_transmit_skb+0x344/0x7c4 [c029fc4c] __tcp_push_pending_frames+0x228/0x4a0 [c029ce90] tcp_rcv_established+0x414/0x8b4 [c02a65bc] tcp_v4_do_rcv+0x16c/0x35c [c02a713c] tcp_v4_rcv+0x990/0xac8 when stressing the mv643xx_eth driver. The driver does on xmit: for (frag = 0; frag < skb_shinfo(skb)->nr_frags; frag++) { skb_frag_t *fragp; fragp = &skb_shinfo(skb)->frags[frag]; if (fragp->size <= 8 && fragp->page_offset & 0x7) { skb_linearize(skb, GFP_ATOMIC); printk(KERN_DEBUG "%s: unaligned tiny fragment" "%d of %d, fixed\n", dev->name, frag, skb_shinfo(skb)->nr_frags); goto linear; } } This routine checks if the hw is capable to send every fragments of a scatter skb. If not, it calls skb_linearize() and send the packet as it would do for linear skb. However, this code is executed into a spin_lock_irq() and: * skb_linearize() (net/dev/core.c) calls skb_copy_bits() (net/core/skbuff.c) * skb_copy_bits() may call kunmap_skb_frag() (include/linux/skbuff.h) * kunmap_skb_frag() will call local_bh_enable() (kernel/softirq.c) and the first line of local_bh_enable() is 'WARN_ON(irqs_disabled());' I guess spin_lock_irq() disable the IRQ, and then local_bh_enable() isn't so happy about this. Maybe it's a trivial issue (?). I just wanted to reported this issue I have. Your sincerly, -- Nicolas DET MorphOS & Linux developer From sim@netnation.com Tue Sep 13 15:17:30 2005 Received: with ECARTIS (v1.0.0; list netdev); Tue, 13 Sep 2005 15:17:37 -0700 (PDT) Received: from peace.netnation.com (newpeace.netnation.com [204.174.223.7]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8DMHUiL006322 for ; Tue, 13 Sep 2005 15:17:30 -0700 Received: from sim by peace.netnation.com with local (Exim 4.50) id 1EFJ3A-0007lP-Gm; Tue, 13 Sep 2005 15:14:48 -0700 Date: Tue, 13 Sep 2005 15:14:48 -0700 From: Simon Kirby To: Alexey Kuznetsov , Robert Olsson , Eric Dumazet , netdev@oss.sgi.com Subject: Re: Route cache performance Message-ID: <20050913221448.GD15704@netnation.com> References: <20050824000158.GA8137@netnation.com> <20050825181111.GB14336@netnation.com> <20050825200543.GA6612@yakov.inr.ac.ru> <20050825212211.GA23384@netnation.com> <20050826115520.GA12351@yakov.inr.ac.ru> <17167.29239.469711.847951@robur.slu.se> <20050906235700.GA31820@netnation.com> <17182.64751.340488.996748@robur.slu.se> <20050907162854.GB24735@netnation.com> <20050907195911.GA8382@yakov.inr.ac.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050907195911.GA8382@yakov.inr.ac.ru> User-Agent: Mutt/1.5.9i X-archive-position: 3620 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: sim@netnation.com Precedence: bulk X-list: netdev Content-Length: 4880 Lines: 116 On Wed, Sep 07, 2005 at 11:59:11PM +0400, Alexey Kuznetsov wrote: > Hello! > > > Yes, setting maxbatch to 10000 also results in working gc, > > Could you try lower values? F.e. I guess 300 or a little more > (it is netdev_max_backlog) should be enough. 300 seems to be sufficient, but I'm not sure what this depends on (load, HZ, timing of some sort?). See below for full tests. > > for the normal case also hurts the DoS case...and it really hurts when > > the when the DoS case is the normal case. > > 5.7% is not "really hurts" yet. :-) I decided to try out FreeBSD in comparison as I've heard people saying that it handles this case quite well. The results are interesting. FreeBSD seems to have a route cache; however, it keys only on destination. When a new destination is seen, the route table entry that matched is "cloned" so that the MTU, etc., is copied, the dst rewritten to the exact IP (as opposed to a network route), and path MTU discovery results are maintained in this entry, keyed by destination address only. I'm not sure if Linux could work in the same way with the source routing tables enabled, but perhaps it's possible to either disable the source side of the route cache when policy routing is disabled. Or perhaps a route cache hash could be instantiated per route table or something. Actually, is there ever a valid case where the source needs to be tracked in the route cache when policy routing is disabled? A local socket will track MSS correctly while a forwarded packet will create or use an entry without touching it, so I don't see why not. Anyway, spoofed source or not go the same speed through FreeBSD. Also, there is a "fastforwarding" sysctl that sends forwarded packets from the input interrupt/poll without queueing them in a soft interrupt ("NETISR"). Polling mode on FreeBSD isn't as nice as NAPI in that it's fully manual on or off, and when it's on it triggers entirely from the timer interrupt unless told to also trigger from the idle loop. The user/kernel balancing is also manual but I can't seem to get it to forward as fast as with it disabled no matter how I adjust it. TEST RESULTS ------------ All Linux tests with NAPI enabled and the e1000 driver native to that kernel unless otherwise specified. maxbatch does not exist in kernels < 2.6.9, and rhash_size does not exist in 2.4. Sender: 367 Mbps, 717883 pps valid src/dst, 64 byte (Ethernet) packets 2.4.27-rc1: 297 Mbps forwarded (w/idle time?!) 2.4.31: 296 Mbps forwarded (w/idle time?!) 2.6.13-rc6: 173 Mbps forwarded FreeBSD 5.4-RELEASE (HZ=1000): 103 Mbps forwarded (dead userland) `- net.inet.ip.fastforwarding=1: 282 Mbps forwarded (dead userland) `- kern.polling.enable=1: 75.3 Mbps forwarded `- kern.polling.idle_poll=1: 226 Mbps forwarded Sender: 348 Mbps, 680416 pps random src, valid dst, 64 bytes (All FreeBSD tests have identical results.) 2.4.27-rc1: 122 Mbps forwarded 2.4.27-rc1 gc_elasticity=1: 182 Mbps forwarded 2.4.27-rc1+2.4.31_e1000: 117 Mbps forwarded 2.4.27-rc1+2.4.31_e1000 gc_elasticity=1: 170 Mbps forwarded 2.4.31: 95.1 Mbps forwarded 2.4.31 gc_elasticity=1: 122 Mbps forwarded 2.6.13-rc6: <1 Mbps forwarded (dst overflow) 2.6.13-rc6 maxbatch=30: <1 Mbps forwarded (dst overflow) 2.6.13-rc6 maxbatch=60: 1.5 Mbps forwarded (dst overflow) 2.6.13-rc6 maxbatch=100: 2.6 Mbps forwarded (dst overflow) 2.6.13-rc6 maxbatch=150: 3.8 Mbps forwarded (dst overflow) 2.6.13-rc6 maxbatch=200: 6.9 Mbps forwarded (dst overflow) 2.6.13-rc6 maxbatch=250: 15.4 Mbps forwarded (dst overflow) 2.6.13-rc6 maxbatch=300: 58.6 Mbps forwarded (gc balanced) 2.6.13-rc6 maxbatch=350: 60.5 Mbps forwarded 2.6.13-rc6 maxbatch=400: 59.4 Mbps forwarded 2.6.13-rc6 maxbatch=450: 59.1 Mbps forwarded 2.6.13-rc6 maxbatch=500: 62.0 Mbps forwarded 2.6.13-rc6 maxbatch=550: 61.9 Mbps forwarded 2.6.13-rc6 maxbatch=1000: 61.4 Mbps forwarded 2.6.13-rc6 maxbatch=2000: 60.2 Mbps forwarded 2.6.13-rc6 maxbatch=3000: 60.1 Mbps forwarded 2.6.13-rc6 maxbatch=5000: 59.1 Mbps forwarded 2.6.13-rc6 maxbatch=MAXINT: 59.1 Mbps forwarded 2.6.13-rc6 dst_free: 66.0 Mbps forwarded 2.6.13-rc6 dst_free max_size=rhash_size: 79.2 Mbps forwarded ------------ 2.6 definitely has better dst cache gc balancing than 2.4. I can set the max_size=rhash_size in 2.6.13-rc6 and it will just work, even without adjusting gc_elasticity or gc_thresh. In 2.4.27 and 2.4.31, the only parameter that appears to help is gc_elasticity. If I just adjust max_size, it overflows and falls over. I note that the actual read copy update "maxbatch" limit was added in 2.6.9. Before then, it seems there was no limit (infinite). Was it added for latency reasons? Time permitting, I'd also like to run some profiles. It's interesting to note that 2.6 is slower at forwarding even straight duplicate small packets. We should definitely get to the bottom of that. Simon- From flamingice@sourmilk.net Tue Sep 13 16:48:13 2005 Received: with ECARTIS (v1.0.0; list netdev); Tue, 13 Sep 2005 16:48:20 -0700 (PDT) Received: from narnia10.rutgers.edu (eden-out.rutgers.edu [128.6.68.11]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8DNmDiL014840 for ; Tue, 13 Sep 2005 16:48:13 -0700 Received: from [192.168.0.100] (resnet-172.23.47.44.resnet.rutgers.edu [172.23.47.44]) (using TLSv1 with cipher RC4-MD5 (128/128 bits)) (No client certificate requested) by narnia10.rutgers.edu (Postfix) with ESMTP id 1C0C0185D; Tue, 13 Sep 2005 19:45:30 -0400 (EDT) From: Michael Wu To: netdev@oss.sgi.com Subject: Re: [PATCH 15/29] ieee80211 Renamed ieee80211_hdr to ieee80211_hdr_4addr Date: Tue, 13 Sep 2005 19:45:23 -0400 User-Agent: KMail/1.8.2 References: <43275856.7010305@linux.intel.com> In-Reply-To: <43275856.7010305@linux.intel.com> Cc: ieee80211-devel@lists.sourceforge.net, James Ketrenos MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200509131945.23126.flamingice@sourmilk.net> X-archive-position: 3621 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: flamingice@sourmilk.net Precedence: bulk X-list: netdev Content-Length: 1723 Lines: 54 This patch plus patch 5 looks very similar to my management frame patch except.. On Tuesday 13 September 2005 18:53, James Ketrenos wrote: > @@ -552,6 +577,17 @@ struct ieee80211_authentication { > struct ieee80211_info_element info_element[0]; > } __attribute__ ((packed)); > > +struct ieee80211_disassoc { Add alias for deauthentication frame here? > + struct ieee80211_hdr_3addr header; > + u16 reason_code; If we do not append _code to "status", why append _code to "reason"? > + struct ieee80211_info_element info_element[0]; > +} __attribute__ ((packed)); > + > +struct ieee80211_probe_request { > + struct ieee80211_hdr_3addr header; > + struct ieee80211_info_element info_element[0]; > +} __attribute__ ((packed)); > + > struct ieee80211_probe_response { > struct ieee80211_hdr_3addr header; > u32 time_stamp[2]; > @@ -560,14 +596,25 @@ struct ieee80211_probe_response { > struct ieee80211_info_element info_element[0]; > } __attribute__ ((packed)); > > -struct ieee80211_assoc_request_frame { > +/* Alias beacon for probe_response */ > +#define ieee80211_beacon ieee80211_probe_response > + > +struct ieee80211_assoc_request { > + struct ieee80211_hdr_3addr header; > + u16 capability; > + u16 listen_interval; > + struct ieee80211_info_element info_element[0]; > +} __attribute__ ((packed)); > + > +struct ieee80211_reassoc_request { > + struct ieee80211_hdr_3addr header; > __le16 capability; > __le16 listen_interval; > u8 current_ap[ETH_ALEN]; > struct ieee80211_info_element info_element[0]; > } __attribute__ ((packed)); > > -struct ieee80211_assoc_response_frame { > +struct ieee80211_assoc_response { > struct ieee80211_hdr_3addr header; > __le16 capability; > __le16 status; From flamingice@sourmilk.net Tue Sep 13 16:50:43 2005 Received: with ECARTIS (v1.0.0; list netdev); Tue, 13 Sep 2005 16:50:47 -0700 (PDT) Received: from narnia9.rutgers.edu (eden-out.rutgers.edu [128.6.68.11]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8DNohiL015299 for ; Tue, 13 Sep 2005 16:50:43 -0700 Received: from [192.168.0.100] (resnet-172.23.47.44.resnet.rutgers.edu [172.23.47.44]) (using TLSv1 with cipher RC4-MD5 (128/128 bits)) (No client certificate requested) by narnia9.rutgers.edu (Postfix) with ESMTP id B41A814DC; Tue, 13 Sep 2005 19:48:04 -0400 (EDT) From: Michael Wu To: netdev@oss.sgi.com Subject: Re: [PATCH 25/29] ieee80211 use endian-aware types Date: Tue, 13 Sep 2005 19:48:01 -0400 User-Agent: KMail/1.8.2 References: <43276290.2010809@linux.intel.com> In-Reply-To: <43276290.2010809@linux.intel.com> Cc: James Ketrenos , ieee80211-devel@lists.sourceforge.net MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200509131948.02083.flamingice@sourmilk.net> X-archive-position: 3622 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: flamingice@sourmilk.net Precedence: bulk X-list: netdev Content-Length: 512 Lines: 18 On Tuesday 13 September 2005 19:36, James Ketrenos wrote: > ieee80211: use endian-aware types > > From: Michael Wu > > This patch: > - fixes misc. whitespace/comments > - replaces u16 with __le16/__be16 where appropriate > > Signed-off-by: Michael Wu > Signed-off-by: Jiri Benc > > Signed-off-by: James Ketrenos > The original patch already went in, and this patch looks nothing like the one I sent in. -Michael Wu From flamingice@sourmilk.net Tue Sep 13 16:53:39 2005 Received: with ECARTIS (v1.0.0; list netdev); Tue, 13 Sep 2005 16:53:43 -0700 (PDT) Received: from narnia10.rutgers.edu (eden-out.rutgers.edu [128.6.68.11]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8DNrciL015820 for ; Tue, 13 Sep 2005 16:53:38 -0700 Received: from [192.168.0.100] (resnet-172.23.47.44.resnet.rutgers.edu [172.23.47.44]) (using TLSv1 with cipher RC4-MD5 (128/128 bits)) (No client certificate requested) by narnia10.rutgers.edu (Postfix) with ESMTP id D561C1856 for ; Tue, 13 Sep 2005 19:50:59 -0400 (EDT) From: Michael Wu To: netdev@oss.sgi.com Subject: Re: [PATCH 23/29] ieee80211 Added ieee80211_radiotap.h Date: Tue, 13 Sep 2005 19:50:56 -0400 User-Agent: KMail/1.8.2 References: <43276231.1000908@linux.intel.com> In-Reply-To: <43276231.1000908@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200509131950.56673.flamingice@sourmilk.net> X-archive-position: 3623 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: flamingice@sourmilk.net Precedence: bulk X-list: netdev Content-Length: 334 Lines: 9 On Tuesday 13 September 2005 19:35, James Ketrenos wrote: > Added ieee80211_radiotap.h to enhance statistic reporting to user space > from wireless drivers. > > Signed-off-by: James Ketrenos I'm not sure, but didn't Mike Kershaw create this patch? I was expecting a signed-off-by line by him. -Michael Wu From ananda.raju@neterion.com Tue Sep 13 17:15:13 2005 Received: with ECARTIS (v1.0.0; list netdev); Tue, 13 Sep 2005 17:15:19 -0700 (PDT) Received: from ns1.s2io.com (ns1.s2io.com [142.46.200.198]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8E0FDiL018827 for ; Tue, 13 Sep 2005 17:15:13 -0700 Received: from guinness.s2io.com (sentry.s2io.com [142.46.200.199]) by ns1.s2io.com (8.12.10/8.12.10) with ESMTP id j8E0CVcx023499; Tue, 13 Sep 2005 20:12:31 -0400 (EDT) Received: from localhost.localdomain ([10.16.16.97]) by guinness.s2io.com (8.12.6/8.12.6) with ESMTP id j8E0CRlb017096; Tue, 13 Sep 2005 20:12:29 -0400 (EDT) Received: (from root@localhost) by localhost.localdomain (8.13.1/8.13.1/Submit) id j8E6t1jx003507; Tue, 13 Sep 2005 23:55:01 -0700 Date: Tue, 13 Sep 2005 23:55:01 -0700 Message-Id: <200509140655.j8E6t1jx003507@localhost.localdomain> To: jgarzik@pobox.com, netdev@oss.sgi.com CC: raghavendra.koushik@neterion.com, ravinandan.arakali@neterion.com, leonid.grossman@neterion.com, rapuru.sriram@neterion.com, ananda.raju@neterion.com From: ravinandan.arakali@neterion.com Subject: [PATCH 2.6.13] IPv4/IPv6: USO Scatter-gather approac X-Scanned-By: MIMEDefang 2.34 X-archive-position: 3624 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ravinandan.arakali@neterion.com Precedence: bulk X-list: netdev Content-Length: 15469 Lines: 463 Hi, Attached below is kernel patch with UPD large send offload which address the sendfile() syscall also. This patch uses scatter-gather support of skb to generate large UDP datagram. Below is a "how-to" on changes required in network drivers to use the USO interface. UDP Large Send Offload (USO) Interface: -------------------------------------- USO is a feature wherein the Linux kernel network stack will offload the IP fragmentation functionality of large UDP datagram to hardware. This will reduce the overhead of stack in fragmenting the large UDP datagram to MTU sized packets 1) Drivers indicate their capability of USO using dev->features |= NETIF_F_USO | NETIF_F_HW_CSUM | NETIF_F_SG NETIF_F_HW_CSUM is required for USO over ipv6. 2) USO packet will be submitted for transmission using driver xmit routine. USO packet will have a non-zero value for "skb_shinfo(skb)->uso_size" skb_shinfo(skb)->uso_size will indicate the length of data part in each IP fragment going out of the adapter after IP fragmentation by hardware. skb->data will contain MAC/IP/UDP header and skb_shinfo(skb)->frags[] contains the data payload. The skb->ip_summed will be set to CHECKSUM_HW indicating that hardware has to do checksum calculation. Hardware should compute the UDP checksum of complete datagram and also ip header checksum of each fragmented IP packet. For IPV6 the USO provides the fragment identification-id in skb_shinfo(skb)->ip6_frag_id. The adapter should use this ID for generating IPv6 fragments. Signed-off-by: Ananda Raju --- diff -uNr linux-2.6.13/include/linux/ethtool.h linux-2.6.13_uso/include/linux/ethtool.h --- linux-2.6.13/include/linux/ethtool.h 2005-09-07 06:36:15.000000000 -0700 +++ linux-2.6.13_uso/include/linux/ethtool.h 2005-09-07 06:32:29.000000000 -0700 @@ -261,6 +261,8 @@ int ethtool_op_set_sg(struct net_device *dev, u32 data); u32 ethtool_op_get_tso(struct net_device *dev); int ethtool_op_set_tso(struct net_device *dev, u32 data); +u32 ethtool_op_get_uso(struct net_device *dev); +int ethtool_op_set_uso(struct net_device *dev, u32 data); /** * ðtool_ops - Alter and report network device settings @@ -290,6 +292,8 @@ * set_sg: Turn scatter-gather on or off * get_tso: Report whether TCP segmentation offload is enabled * set_tso: Turn TCP segmentation offload on or off + * get_uso: Report whether UDP large send offload is enabled + * set_uso: Turn UDP large send offload on or off * self_test: Run specified self-tests * get_strings: Return a set of strings that describe the requested objects * phys_id: Identify the device @@ -354,6 +358,8 @@ void (*get_ethtool_stats)(struct net_device *, struct ethtool_stats *, u64 *); int (*begin)(struct net_device *); void (*complete)(struct net_device *); + u32 (*get_uso)(struct net_device *); + int (*set_uso)(struct net_device *, u32); }; /* CMDs currently supported */ @@ -389,6 +395,8 @@ #define ETHTOOL_GSTATS 0x0000001d /* get NIC-specific statistics */ #define ETHTOOL_GTSO 0x0000001e /* Get TSO enable (ethtool_value) */ #define ETHTOOL_STSO 0x0000001f /* Set TSO enable (ethtool_value) */ +#define ETHTOOL_GUSO 0x00000020 /* Get USO enable (ethtool_value) */ +#define ETHTOOL_SUSO 0x00000021 /* Set USO enable (ethtool_value) */ /* compatibility with older code */ #define SPARC_ETH_GSET ETHTOOL_GSET diff -uNr linux-2.6.13/include/linux/netdevice.h linux-2.6.13_uso/include/linux/netdevice.h --- linux-2.6.13/include/linux/netdevice.h 2005-09-07 04:20:51.000000000 -0700 +++ linux-2.6.13_uso/include/linux/netdevice.h 2005-09-07 04:22:51.000000000 -0700 @@ -408,6 +408,7 @@ #define NETIF_F_VLAN_CHALLENGED 1024 /* Device cannot handle VLAN packets */ #define NETIF_F_TSO 2048 /* Can offload TCP/IP segmentation */ #define NETIF_F_LLTX 4096 /* LockLess TX */ +#define NETIF_F_USO 8192 /* Can offload UDP Large Send*/ /* Called after device is detached from network. */ void (*uninit)(struct net_device *dev); diff -uNr linux-2.6.13/include/linux/skbuff.h linux-2.6.13_uso/include/linux/skbuff.h --- linux-2.6.13/include/linux/skbuff.h 2005-09-07 04:20:56.000000000 -0700 +++ linux-2.6.13_uso/include/linux/skbuff.h 2005-09-07 04:22:58.000000000 -0700 @@ -137,6 +137,8 @@ unsigned int nr_frags; unsigned short tso_size; unsigned short tso_segs; + unsigned short uso_size; + unsigned int ip6_frag_id; struct sk_buff *frag_list; skb_frag_t frags[MAX_SKB_FRAGS]; }; @@ -327,6 +329,11 @@ extern void skb_under_panic(struct sk_buff *skb, int len, void *here); +extern int skb_append_datato_frags(struct sock *sk, struct sk_buff *skb, + int getfrag(void *from, char *to, int offset, + int len,int odd, struct sk_buff *skb), + void *from, int length); + struct skb_seq_state { __u32 lower_offset; diff -uNr linux-2.6.13/net/core/dev.c linux-2.6.13_uso/net/core/dev.c --- linux-2.6.13/net/core/dev.c 2005-09-07 06:36:25.000000000 -0700 +++ linux-2.6.13_uso/net/core/dev.c 2005-09-07 06:32:02.000000000 -0700 @@ -2706,6 +2706,25 @@ dev->name); dev->features &= ~NETIF_F_TSO; } + /* TSO requires that SG is present as well. */ + if ((dev->features & NETIF_F_TSO) && + !(dev->features & NETIF_F_SG)) { + printk("%s: Dropping NETIF_F_TSO since no SG feature.\n", + dev->name); + dev->features &= ~NETIF_F_TSO; + } + if (dev->features & NETIF_F_USO) { + if (!(dev->features & NETIF_F_HW_CSUM)) { + printk("%s: Dropping NETIF_F_USO since no ", dev->name); + printk("NETIF_F_HW_CSUM feature.\n"); + dev->features &= ~NETIF_F_USO; + } + if (!(dev->features & NETIF_F_SG)) { + printk("%s: Dropping NETIF_F_USO since no ", dev->name); + printk("NETIF_F_SG feature.\n"); + dev->features &= ~NETIF_F_USO; + } + } /* * nil rebuild_header routine, diff -uNr linux-2.6.13/net/core/ethtool.c linux-2.6.13_uso/net/core/ethtool.c --- linux-2.6.13/net/core/ethtool.c 2005-09-07 06:36:34.000000000 -0700 +++ linux-2.6.13_uso/net/core/ethtool.c 2005-09-07 06:32:15.000000000 -0700 @@ -81,6 +81,20 @@ return 0; } +u32 ethtool_op_get_uso(struct net_device *dev) +{ + return (dev->features & NETIF_F_USO) != 0; +} + +int ethtool_op_set_uso(struct net_device *dev, u32 data) +{ + if (data) + dev->features |= NETIF_F_USO; + else + dev->features &= ~NETIF_F_USO; + return 0; +} + /* Handlers for each ethtool command */ static int ethtool_get_settings(struct net_device *dev, void __user *useraddr) @@ -469,6 +483,9 @@ err = dev->ethtool_ops->set_tso(dev, 0); if (err) return err; + err = dev->ethtool_ops->set_uso(dev, 0); + if (err) + return err; } return dev->ethtool_ops->set_sg(dev, data); @@ -557,6 +574,32 @@ return dev->ethtool_ops->set_tso(dev, edata.data); } +static int ethtool_get_uso(struct net_device *dev, char __user *useraddr) +{ + struct ethtool_value edata = { ETHTOOL_GTSO }; + + if (!dev->ethtool_ops->get_uso) + return -EOPNOTSUPP; + edata.data = dev->ethtool_ops->get_uso(dev); + if (copy_to_user(useraddr, &edata, sizeof(edata))) + return -EFAULT; + return 0; +} +static int ethtool_set_uso(struct net_device *dev, char __user *useraddr) +{ + struct ethtool_value edata; + + if (!dev->ethtool_ops->set_uso) + return -EOPNOTSUPP; + if (copy_from_user(&edata, useraddr, sizeof(edata))) + return -EFAULT; + if (edata.data && !(dev->features & NETIF_F_SG)) + return -EINVAL; + if (edata.data && !(dev->features & NETIF_F_HW_CSUM)) + return -EINVAL; + return dev->ethtool_ops->set_uso(dev, edata.data); +} + static int ethtool_self_test(struct net_device *dev, char __user *useraddr) { struct ethtool_test test; @@ -806,6 +849,12 @@ case ETHTOOL_GSTATS: rc = ethtool_get_stats(dev, useraddr); break; + case ETHTOOL_GUSO: + rc = ethtool_get_uso(dev, useraddr); + break; + case ETHTOOL_SUSO: + rc = ethtool_set_uso(dev, useraddr); + break; default: rc = -EOPNOTSUPP; } @@ -833,3 +882,5 @@ EXPORT_SYMBOL(ethtool_op_set_tso); EXPORT_SYMBOL(ethtool_op_set_tx_csum); EXPORT_SYMBOL(ethtool_op_set_tx_hw_csum); +EXPORT_SYMBOL(ethtool_op_set_uso); +EXPORT_SYMBOL(ethtool_op_get_uso); diff -uNr linux-2.6.13/net/core/skbuff.c linux-2.6.13_uso/net/core/skbuff.c --- linux-2.6.13/net/core/skbuff.c 2005-09-07 04:21:30.000000000 -0700 +++ linux-2.6.13_uso/net/core/skbuff.c 2005-09-07 06:38:57.000000000 -0700 @@ -159,6 +159,8 @@ skb_shinfo(skb)->tso_size = 0; skb_shinfo(skb)->tso_segs = 0; skb_shinfo(skb)->frag_list = NULL; + skb_shinfo(skb)->uso_size = 0; + skb_shinfo(skb)->ip6_frag_id = 0; out: return skb; nodata: @@ -1654,6 +1656,64 @@ return textsearch_find(config, state); } +/* + * skb_append_datato_frags - append the user data to a skb, + * sk - sock structure which contains skbs for transmission + * getfrag - The function to be called to get the data from the user. + * from - pointer to user message iov + * length - length of the iov message + * + * This procedure will allocate a skb enough to hold protocol headers and + * append the user data in the fragment part of the skb and add the skb to + * socket write queue + */ +int skb_append_datato_frags(struct sock *sk, struct sk_buff *skb, + int getfrag(void *from, char *to, int offset, + int len,int odd, struct sk_buff *skb), + void *from, int length) +{ + int frg_cnt = 0; + skb_frag_t *frag = NULL; + struct page *page = NULL; + int copy, left; + int offset = 0; + do { + frg_cnt = skb_shinfo(skb)->nr_frags; + if (frg_cnt >= MAX_SKB_FRAGS) { + kfree_skb(skb); + return -EFAULT; + } + page = alloc_pages(sk->sk_allocation, 0); + if (page == NULL) { + kfree_skb(skb); + return -ENOMEM; + } + sk->sk_sndmsg_page = page; + sk->sk_sndmsg_off = 0; + skb_fill_page_desc(skb, frg_cnt, page, 0, 0); + frg_cnt = skb_shinfo(skb)->nr_frags; + atomic_add(PAGE_SIZE, &sk->sk_wmem_alloc); + skb->truesize += PAGE_SIZE; + frag = &skb_shinfo(skb)->frags[frg_cnt - 1]; + left = PAGE_SIZE - frag->page_offset; + copy = (length > left)? left : length; + if (getfrag(from, page_address(frag->page) + + frag->page_offset+frag->size, + offset, copy, 0, skb) < 0) { + kfree_skb(skb); + return -EFAULT; + } + sk->sk_sndmsg_off += copy; + frag->size += copy; + skb->len += copy; + skb->data_len += copy; + offset += copy; + length -= copy; + page = NULL; + } while (length > 0); + return 0; +} + void __init skb_init(void) { skbuff_head_cache = kmem_cache_create("skbuff_head_cache", @@ -1696,3 +1756,4 @@ EXPORT_SYMBOL(skb_seq_read); EXPORT_SYMBOL(skb_abort_seq_read); EXPORT_SYMBOL(skb_find_text); +EXPORT_SYMBOL(skb_append_datato_frags); diff -uNr linux-2.6.13/net/ipv4/ip_output.c linux-2.6.13_uso/net/ipv4/ip_output.c --- linux-2.6.13/net/ipv4/ip_output.c 2005-09-07 04:21:46.000000000 -0700 +++ linux-2.6.13_uso/net/ipv4/ip_output.c 2005-09-13 07:12:05.000000000 -0700 @@ -280,7 +280,8 @@ { IP_INC_STATS(IPSTATS_MIB_OUTREQUESTS); - if (skb->len > dst_mtu(skb->dst) && !skb_shinfo(skb)->tso_size) + if (skb->len > dst_mtu(skb->dst) && + !(skb_shinfo(skb)->uso_size || skb_shinfo(skb)->tso_size)) return ip_fragment(skb, ip_finish_output); else return ip_finish_output(skb); @@ -781,6 +782,46 @@ csummode = CHECKSUM_HW; inet->cork.length += length; + if (((length > mtu) && (sk->sk_protocol == IPPROTO_UDP)) && + (rt->u.dst.dev->features & NETIF_F_USO)) { + /* There is support for UDP large send offload by network + * device, so create one single skb packet containing complete + * udp datagram + */ + if ((skb = skb_peek_tail(&sk->sk_write_queue)) == NULL) { + skb = sock_alloc_send_skb(sk, + hh_len + fragheaderlen + transhdrlen + 20, + (flags & MSG_DONTWAIT), &err); + if (skb == NULL) + goto error; + /* reserve space for Hardware header */ + skb_reserve(skb, hh_len); + /* create space for UDP/IP header */ + skb_put(skb,fragheaderlen + transhdrlen); + /* initialize network header pointer */ + skb->nh.raw = skb->data; + /* initialize protocol header pointer */ + skb->h.raw = skb->data + fragheaderlen; + skb->ip_summed = CHECKSUM_HW; + skb->csum = 0; + sk->sk_sndmsg_off = 0; + } + err = skb_append_datato_frags(sk,skb, getfrag, from, + (length - transhdrlen)); + if (!err) { + /* specify the length of each IP datagram fragment*/ + skb_shinfo(skb)->uso_size = (mtu - fragheaderlen); + __skb_queue_tail(&sk->sk_write_queue, skb); + return 0; + } else { + /* There is not enough support do UPD LSO, + * so follow normal path + */ + kfree_skb(skb); + goto error; + } + } + /* So, what's going on in the loop below? * @@ -1012,14 +1053,23 @@ return -EINVAL; inet->cork.length += size; + if ((sk->sk_protocol == IPPROTO_UDP) && + (rt->u.dst.dev->features & NETIF_F_USO)) + skb_shinfo(skb)->uso_size = (mtu - fragheaderlen); + while (size > 0) { int i; - /* Check if the remaining data fits into current packet. */ - len = mtu - skb->len; - if (len < size) - len = maxfraglen - skb->len; + if (skb_shinfo(skb)->uso_size) { + len = size; + } else { + + /* Check if the remaining data fits into current packet. */ + len = mtu - skb->len; + if (len < size) + len = maxfraglen - skb->len; + } if (len <= 0) { struct sk_buff *skb_prev; char *data; diff -uNr linux-2.6.13/net/ipv6/ip6_output.c linux-2.6.13_uso/net/ipv6/ip6_output.c --- linux-2.6.13/net/ipv6/ip6_output.c 2005-09-07 04:21:57.000000000 -0700 +++ linux-2.6.13_uso/net/ipv6/ip6_output.c 2005-09-13 07:11:10.000000000 -0700 @@ -147,7 +147,8 @@ int ip6_output(struct sk_buff *skb) { - if (skb->len > dst_mtu(skb->dst) || dst_allfrag(skb->dst)) + if ((skb->len > dst_mtu(skb->dst) && !skb_shinfo(skb)->uso_size) || + dst_allfrag(skb->dst)) return ip6_fragment(skb, ip6_output2); else return ip6_output2(skb); @@ -893,6 +894,50 @@ */ inet->cork.length += length; + if (((length > mtu) && (sk->sk_protocol == IPPROTO_UDP)) && + (rt->u.dst.dev->features & NETIF_F_USO)) { + /* There is support for UDP large send offload by network + * device, so create one single skb packet containing complete + * udp datagram + */ + if ((skb = skb_peek_tail(&sk->sk_write_queue)) == NULL) { + skb = sock_alloc_send_skb(sk, + hh_len + fragheaderlen + transhdrlen + 20, + (flags & MSG_DONTWAIT), &err); + if (skb == NULL) + goto error; + /* reserve space for Hardware header */ + skb_reserve(skb, hh_len); + /* create space for UDP/IP header */ + skb_put(skb,fragheaderlen + transhdrlen); + /* initialize network header pointer */ + skb->nh.raw = skb->data; + /* initialize protocol header pointer */ + skb->h.raw = skb->data + fragheaderlen; + skb->ip_summed = CHECKSUM_HW; + skb->csum = 0; + sk->sk_sndmsg_off = 0; + } + err = skb_append_datato_frags(sk,skb, getfrag, from, + (length - transhdrlen)); + if (!err) { + struct frag_hdr fhdr; + + /* specify the length of each IP datagram fragment*/ + skb_shinfo(skb)->uso_size = (mtu - fragheaderlen) - + sizeof(struct frag_hdr); + ipv6_select_ident(skb, &fhdr); + skb_shinfo(skb)->ip6_frag_id = fhdr.identification; + __skb_queue_tail(&sk->sk_write_queue, skb); + return 0; + } else { + /* There is not enough support do UPD LSO, + * so follow normal path + */ + kfree_skb(skb); + goto error; + } + } if ((skb = skb_peek_tail(&sk->sk_write_queue)) == NULL) goto alloc_new_skb; From ananda.raju@neterion.com Tue Sep 13 17:16:10 2005 Received: with ECARTIS (v1.0.0; list netdev); Tue, 13 Sep 2005 17:16:14 -0700 (PDT) Received: from ns1.s2io.com (ns1.s2io.com [142.46.200.198]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8E0G9iL019005 for ; Tue, 13 Sep 2005 17:16:10 -0700 Received: from guinness.s2io.com (sentry.s2io.com [142.46.200.199]) by ns1.s2io.com (8.12.10/8.12.10) with ESMTP id j8E0DScx023503; Tue, 13 Sep 2005 20:13:28 -0400 (EDT) Received: from localhost.localdomain ([10.16.16.97]) by guinness.s2io.com (8.12.6/8.12.6) with ESMTP id j8E0DPlb017222; Tue, 13 Sep 2005 20:13:26 -0400 (EDT) Received: (from root@localhost) by localhost.localdomain (8.13.1/8.13.1/Submit) id j8E6u0gj003515; Tue, 13 Sep 2005 23:56:00 -0700 Date: Tue, 13 Sep 2005 23:56:00 -0700 Message-Id: <200509140656.j8E6u0gj003515@localhost.localdomain> To: jgarzik@pobox.com, netdev@oss.sgi.com CC: raghavendra.koushik@neterion.com, ravinandan.arakali@neterion.com, leonid.grossman@neterion.com, rapuru.sriram@neterion.com, ananda.raju@neterion.com From: ananda.raju@neterion.com Subject: [PATCH 2.6.13] IPv4/IPv6: USO Scatter-gather approac X-Scanned-By: MIMEDefang 2.34 X-archive-position: 3625 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ananda.raju@neterion.com Precedence: bulk X-list: netdev Content-Length: 15469 Lines: 463 Hi, Attached below is kernel patch with UPD large send offload which address the sendfile() syscall also. This patch uses scatter-gather support of skb to generate large UDP datagram. Below is a "how-to" on changes required in network drivers to use the USO interface. UDP Large Send Offload (USO) Interface: -------------------------------------- USO is a feature wherein the Linux kernel network stack will offload the IP fragmentation functionality of large UDP datagram to hardware. This will reduce the overhead of stack in fragmenting the large UDP datagram to MTU sized packets 1) Drivers indicate their capability of USO using dev->features |= NETIF_F_USO | NETIF_F_HW_CSUM | NETIF_F_SG NETIF_F_HW_CSUM is required for USO over ipv6. 2) USO packet will be submitted for transmission using driver xmit routine. USO packet will have a non-zero value for "skb_shinfo(skb)->uso_size" skb_shinfo(skb)->uso_size will indicate the length of data part in each IP fragment going out of the adapter after IP fragmentation by hardware. skb->data will contain MAC/IP/UDP header and skb_shinfo(skb)->frags[] contains the data payload. The skb->ip_summed will be set to CHECKSUM_HW indicating that hardware has to do checksum calculation. Hardware should compute the UDP checksum of complete datagram and also ip header checksum of each fragmented IP packet. For IPV6 the USO provides the fragment identification-id in skb_shinfo(skb)->ip6_frag_id. The adapter should use this ID for generating IPv6 fragments. Signed-off-by: Ananda Raju --- diff -uNr linux-2.6.13/include/linux/ethtool.h linux-2.6.13_uso/include/linux/ethtool.h --- linux-2.6.13/include/linux/ethtool.h 2005-09-07 06:36:15.000000000 -0700 +++ linux-2.6.13_uso/include/linux/ethtool.h 2005-09-07 06:32:29.000000000 -0700 @@ -261,6 +261,8 @@ int ethtool_op_set_sg(struct net_device *dev, u32 data); u32 ethtool_op_get_tso(struct net_device *dev); int ethtool_op_set_tso(struct net_device *dev, u32 data); +u32 ethtool_op_get_uso(struct net_device *dev); +int ethtool_op_set_uso(struct net_device *dev, u32 data); /** * ðtool_ops - Alter and report network device settings @@ -290,6 +292,8 @@ * set_sg: Turn scatter-gather on or off * get_tso: Report whether TCP segmentation offload is enabled * set_tso: Turn TCP segmentation offload on or off + * get_uso: Report whether UDP large send offload is enabled + * set_uso: Turn UDP large send offload on or off * self_test: Run specified self-tests * get_strings: Return a set of strings that describe the requested objects * phys_id: Identify the device @@ -354,6 +358,8 @@ void (*get_ethtool_stats)(struct net_device *, struct ethtool_stats *, u64 *); int (*begin)(struct net_device *); void (*complete)(struct net_device *); + u32 (*get_uso)(struct net_device *); + int (*set_uso)(struct net_device *, u32); }; /* CMDs currently supported */ @@ -389,6 +395,8 @@ #define ETHTOOL_GSTATS 0x0000001d /* get NIC-specific statistics */ #define ETHTOOL_GTSO 0x0000001e /* Get TSO enable (ethtool_value) */ #define ETHTOOL_STSO 0x0000001f /* Set TSO enable (ethtool_value) */ +#define ETHTOOL_GUSO 0x00000020 /* Get USO enable (ethtool_value) */ +#define ETHTOOL_SUSO 0x00000021 /* Set USO enable (ethtool_value) */ /* compatibility with older code */ #define SPARC_ETH_GSET ETHTOOL_GSET diff -uNr linux-2.6.13/include/linux/netdevice.h linux-2.6.13_uso/include/linux/netdevice.h --- linux-2.6.13/include/linux/netdevice.h 2005-09-07 04:20:51.000000000 -0700 +++ linux-2.6.13_uso/include/linux/netdevice.h 2005-09-07 04:22:51.000000000 -0700 @@ -408,6 +408,7 @@ #define NETIF_F_VLAN_CHALLENGED 1024 /* Device cannot handle VLAN packets */ #define NETIF_F_TSO 2048 /* Can offload TCP/IP segmentation */ #define NETIF_F_LLTX 4096 /* LockLess TX */ +#define NETIF_F_USO 8192 /* Can offload UDP Large Send*/ /* Called after device is detached from network. */ void (*uninit)(struct net_device *dev); diff -uNr linux-2.6.13/include/linux/skbuff.h linux-2.6.13_uso/include/linux/skbuff.h --- linux-2.6.13/include/linux/skbuff.h 2005-09-07 04:20:56.000000000 -0700 +++ linux-2.6.13_uso/include/linux/skbuff.h 2005-09-07 04:22:58.000000000 -0700 @@ -137,6 +137,8 @@ unsigned int nr_frags; unsigned short tso_size; unsigned short tso_segs; + unsigned short uso_size; + unsigned int ip6_frag_id; struct sk_buff *frag_list; skb_frag_t frags[MAX_SKB_FRAGS]; }; @@ -327,6 +329,11 @@ extern void skb_under_panic(struct sk_buff *skb, int len, void *here); +extern int skb_append_datato_frags(struct sock *sk, struct sk_buff *skb, + int getfrag(void *from, char *to, int offset, + int len,int odd, struct sk_buff *skb), + void *from, int length); + struct skb_seq_state { __u32 lower_offset; diff -uNr linux-2.6.13/net/core/dev.c linux-2.6.13_uso/net/core/dev.c --- linux-2.6.13/net/core/dev.c 2005-09-07 06:36:25.000000000 -0700 +++ linux-2.6.13_uso/net/core/dev.c 2005-09-07 06:32:02.000000000 -0700 @@ -2706,6 +2706,25 @@ dev->name); dev->features &= ~NETIF_F_TSO; } + /* TSO requires that SG is present as well. */ + if ((dev->features & NETIF_F_TSO) && + !(dev->features & NETIF_F_SG)) { + printk("%s: Dropping NETIF_F_TSO since no SG feature.\n", + dev->name); + dev->features &= ~NETIF_F_TSO; + } + if (dev->features & NETIF_F_USO) { + if (!(dev->features & NETIF_F_HW_CSUM)) { + printk("%s: Dropping NETIF_F_USO since no ", dev->name); + printk("NETIF_F_HW_CSUM feature.\n"); + dev->features &= ~NETIF_F_USO; + } + if (!(dev->features & NETIF_F_SG)) { + printk("%s: Dropping NETIF_F_USO since no ", dev->name); + printk("NETIF_F_SG feature.\n"); + dev->features &= ~NETIF_F_USO; + } + } /* * nil rebuild_header routine, diff -uNr linux-2.6.13/net/core/ethtool.c linux-2.6.13_uso/net/core/ethtool.c --- linux-2.6.13/net/core/ethtool.c 2005-09-07 06:36:34.000000000 -0700 +++ linux-2.6.13_uso/net/core/ethtool.c 2005-09-07 06:32:15.000000000 -0700 @@ -81,6 +81,20 @@ return 0; } +u32 ethtool_op_get_uso(struct net_device *dev) +{ + return (dev->features & NETIF_F_USO) != 0; +} + +int ethtool_op_set_uso(struct net_device *dev, u32 data) +{ + if (data) + dev->features |= NETIF_F_USO; + else + dev->features &= ~NETIF_F_USO; + return 0; +} + /* Handlers for each ethtool command */ static int ethtool_get_settings(struct net_device *dev, void __user *useraddr) @@ -469,6 +483,9 @@ err = dev->ethtool_ops->set_tso(dev, 0); if (err) return err; + err = dev->ethtool_ops->set_uso(dev, 0); + if (err) + return err; } return dev->ethtool_ops->set_sg(dev, data); @@ -557,6 +574,32 @@ return dev->ethtool_ops->set_tso(dev, edata.data); } +static int ethtool_get_uso(struct net_device *dev, char __user *useraddr) +{ + struct ethtool_value edata = { ETHTOOL_GTSO }; + + if (!dev->ethtool_ops->get_uso) + return -EOPNOTSUPP; + edata.data = dev->ethtool_ops->get_uso(dev); + if (copy_to_user(useraddr, &edata, sizeof(edata))) + return -EFAULT; + return 0; +} +static int ethtool_set_uso(struct net_device *dev, char __user *useraddr) +{ + struct ethtool_value edata; + + if (!dev->ethtool_ops->set_uso) + return -EOPNOTSUPP; + if (copy_from_user(&edata, useraddr, sizeof(edata))) + return -EFAULT; + if (edata.data && !(dev->features & NETIF_F_SG)) + return -EINVAL; + if (edata.data && !(dev->features & NETIF_F_HW_CSUM)) + return -EINVAL; + return dev->ethtool_ops->set_uso(dev, edata.data); +} + static int ethtool_self_test(struct net_device *dev, char __user *useraddr) { struct ethtool_test test; @@ -806,6 +849,12 @@ case ETHTOOL_GSTATS: rc = ethtool_get_stats(dev, useraddr); break; + case ETHTOOL_GUSO: + rc = ethtool_get_uso(dev, useraddr); + break; + case ETHTOOL_SUSO: + rc = ethtool_set_uso(dev, useraddr); + break; default: rc = -EOPNOTSUPP; } @@ -833,3 +882,5 @@ EXPORT_SYMBOL(ethtool_op_set_tso); EXPORT_SYMBOL(ethtool_op_set_tx_csum); EXPORT_SYMBOL(ethtool_op_set_tx_hw_csum); +EXPORT_SYMBOL(ethtool_op_set_uso); +EXPORT_SYMBOL(ethtool_op_get_uso); diff -uNr linux-2.6.13/net/core/skbuff.c linux-2.6.13_uso/net/core/skbuff.c --- linux-2.6.13/net/core/skbuff.c 2005-09-07 04:21:30.000000000 -0700 +++ linux-2.6.13_uso/net/core/skbuff.c 2005-09-07 06:38:57.000000000 -0700 @@ -159,6 +159,8 @@ skb_shinfo(skb)->tso_size = 0; skb_shinfo(skb)->tso_segs = 0; skb_shinfo(skb)->frag_list = NULL; + skb_shinfo(skb)->uso_size = 0; + skb_shinfo(skb)->ip6_frag_id = 0; out: return skb; nodata: @@ -1654,6 +1656,64 @@ return textsearch_find(config, state); } +/* + * skb_append_datato_frags - append the user data to a skb, + * sk - sock structure which contains skbs for transmission + * getfrag - The function to be called to get the data from the user. + * from - pointer to user message iov + * length - length of the iov message + * + * This procedure will allocate a skb enough to hold protocol headers and + * append the user data in the fragment part of the skb and add the skb to + * socket write queue + */ +int skb_append_datato_frags(struct sock *sk, struct sk_buff *skb, + int getfrag(void *from, char *to, int offset, + int len,int odd, struct sk_buff *skb), + void *from, int length) +{ + int frg_cnt = 0; + skb_frag_t *frag = NULL; + struct page *page = NULL; + int copy, left; + int offset = 0; + do { + frg_cnt = skb_shinfo(skb)->nr_frags; + if (frg_cnt >= MAX_SKB_FRAGS) { + kfree_skb(skb); + return -EFAULT; + } + page = alloc_pages(sk->sk_allocation, 0); + if (page == NULL) { + kfree_skb(skb); + return -ENOMEM; + } + sk->sk_sndmsg_page = page; + sk->sk_sndmsg_off = 0; + skb_fill_page_desc(skb, frg_cnt, page, 0, 0); + frg_cnt = skb_shinfo(skb)->nr_frags; + atomic_add(PAGE_SIZE, &sk->sk_wmem_alloc); + skb->truesize += PAGE_SIZE; + frag = &skb_shinfo(skb)->frags[frg_cnt - 1]; + left = PAGE_SIZE - frag->page_offset; + copy = (length > left)? left : length; + if (getfrag(from, page_address(frag->page) + + frag->page_offset+frag->size, + offset, copy, 0, skb) < 0) { + kfree_skb(skb); + return -EFAULT; + } + sk->sk_sndmsg_off += copy; + frag->size += copy; + skb->len += copy; + skb->data_len += copy; + offset += copy; + length -= copy; + page = NULL; + } while (length > 0); + return 0; +} + void __init skb_init(void) { skbuff_head_cache = kmem_cache_create("skbuff_head_cache", @@ -1696,3 +1756,4 @@ EXPORT_SYMBOL(skb_seq_read); EXPORT_SYMBOL(skb_abort_seq_read); EXPORT_SYMBOL(skb_find_text); +EXPORT_SYMBOL(skb_append_datato_frags); diff -uNr linux-2.6.13/net/ipv4/ip_output.c linux-2.6.13_uso/net/ipv4/ip_output.c --- linux-2.6.13/net/ipv4/ip_output.c 2005-09-07 04:21:46.000000000 -0700 +++ linux-2.6.13_uso/net/ipv4/ip_output.c 2005-09-13 07:12:05.000000000 -0700 @@ -280,7 +280,8 @@ { IP_INC_STATS(IPSTATS_MIB_OUTREQUESTS); - if (skb->len > dst_mtu(skb->dst) && !skb_shinfo(skb)->tso_size) + if (skb->len > dst_mtu(skb->dst) && + !(skb_shinfo(skb)->uso_size || skb_shinfo(skb)->tso_size)) return ip_fragment(skb, ip_finish_output); else return ip_finish_output(skb); @@ -781,6 +782,46 @@ csummode = CHECKSUM_HW; inet->cork.length += length; + if (((length > mtu) && (sk->sk_protocol == IPPROTO_UDP)) && + (rt->u.dst.dev->features & NETIF_F_USO)) { + /* There is support for UDP large send offload by network + * device, so create one single skb packet containing complete + * udp datagram + */ + if ((skb = skb_peek_tail(&sk->sk_write_queue)) == NULL) { + skb = sock_alloc_send_skb(sk, + hh_len + fragheaderlen + transhdrlen + 20, + (flags & MSG_DONTWAIT), &err); + if (skb == NULL) + goto error; + /* reserve space for Hardware header */ + skb_reserve(skb, hh_len); + /* create space for UDP/IP header */ + skb_put(skb,fragheaderlen + transhdrlen); + /* initialize network header pointer */ + skb->nh.raw = skb->data; + /* initialize protocol header pointer */ + skb->h.raw = skb->data + fragheaderlen; + skb->ip_summed = CHECKSUM_HW; + skb->csum = 0; + sk->sk_sndmsg_off = 0; + } + err = skb_append_datato_frags(sk,skb, getfrag, from, + (length - transhdrlen)); + if (!err) { + /* specify the length of each IP datagram fragment*/ + skb_shinfo(skb)->uso_size = (mtu - fragheaderlen); + __skb_queue_tail(&sk->sk_write_queue, skb); + return 0; + } else { + /* There is not enough support do UPD LSO, + * so follow normal path + */ + kfree_skb(skb); + goto error; + } + } + /* So, what's going on in the loop below? * @@ -1012,14 +1053,23 @@ return -EINVAL; inet->cork.length += size; + if ((sk->sk_protocol == IPPROTO_UDP) && + (rt->u.dst.dev->features & NETIF_F_USO)) + skb_shinfo(skb)->uso_size = (mtu - fragheaderlen); + while (size > 0) { int i; - /* Check if the remaining data fits into current packet. */ - len = mtu - skb->len; - if (len < size) - len = maxfraglen - skb->len; + if (skb_shinfo(skb)->uso_size) { + len = size; + } else { + + /* Check if the remaining data fits into current packet. */ + len = mtu - skb->len; + if (len < size) + len = maxfraglen - skb->len; + } if (len <= 0) { struct sk_buff *skb_prev; char *data; diff -uNr linux-2.6.13/net/ipv6/ip6_output.c linux-2.6.13_uso/net/ipv6/ip6_output.c --- linux-2.6.13/net/ipv6/ip6_output.c 2005-09-07 04:21:57.000000000 -0700 +++ linux-2.6.13_uso/net/ipv6/ip6_output.c 2005-09-13 07:11:10.000000000 -0700 @@ -147,7 +147,8 @@ int ip6_output(struct sk_buff *skb) { - if (skb->len > dst_mtu(skb->dst) || dst_allfrag(skb->dst)) + if ((skb->len > dst_mtu(skb->dst) && !skb_shinfo(skb)->uso_size) || + dst_allfrag(skb->dst)) return ip6_fragment(skb, ip6_output2); else return ip6_output2(skb); @@ -893,6 +894,50 @@ */ inet->cork.length += length; + if (((length > mtu) && (sk->sk_protocol == IPPROTO_UDP)) && + (rt->u.dst.dev->features & NETIF_F_USO)) { + /* There is support for UDP large send offload by network + * device, so create one single skb packet containing complete + * udp datagram + */ + if ((skb = skb_peek_tail(&sk->sk_write_queue)) == NULL) { + skb = sock_alloc_send_skb(sk, + hh_len + fragheaderlen + transhdrlen + 20, + (flags & MSG_DONTWAIT), &err); + if (skb == NULL) + goto error; + /* reserve space for Hardware header */ + skb_reserve(skb, hh_len); + /* create space for UDP/IP header */ + skb_put(skb,fragheaderlen + transhdrlen); + /* initialize network header pointer */ + skb->nh.raw = skb->data; + /* initialize protocol header pointer */ + skb->h.raw = skb->data + fragheaderlen; + skb->ip_summed = CHECKSUM_HW; + skb->csum = 0; + sk->sk_sndmsg_off = 0; + } + err = skb_append_datato_frags(sk,skb, getfrag, from, + (length - transhdrlen)); + if (!err) { + struct frag_hdr fhdr; + + /* specify the length of each IP datagram fragment*/ + skb_shinfo(skb)->uso_size = (mtu - fragheaderlen) - + sizeof(struct frag_hdr); + ipv6_select_ident(skb, &fhdr); + skb_shinfo(skb)->ip6_frag_id = fhdr.identification; + __skb_queue_tail(&sk->sk_write_queue, skb); + return 0; + } else { + /* There is not enough support do UPD LSO, + * so follow normal path + */ + kfree_skb(skb); + goto error; + } + } if ((skb = skb_peek_tail(&sk->sk_write_queue)) == NULL) goto alloc_new_skb; From jgarzik@pobox.com Tue Sep 13 17:26:53 2005 Received: with ECARTIS (v1.0.0; list netdev); Tue, 13 Sep 2005 17:26:59 -0700 (PDT) Received: from mail.dvmed.net (mail.dvmed.net [216.237.124.58]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8E0QqiL020202 for ; Tue, 13 Sep 2005 17:26:53 -0700 Received: from cpe-069-134-188-146.nc.res.rr.com ([69.134.188.146] helo=[10.10.10.88]) by mail.dvmed.net with esmtpsa (Exim 4.52 #1 (Red Hat Linux)) id 1EFL4H-0001BJ-Nl; Wed, 14 Sep 2005 00:24:07 +0000 Message-ID: <43276DA1.3030304@pobox.com> Date: Tue, 13 Sep 2005 20:24:01 -0400 From: Jeff Garzik User-Agent: Mozilla Thunderbird 1.0.6-1.1.fc4 (X11/20050720) X-Accept-Language: en-us, en MIME-Version: 1.0 To: ravinandan.arakali@neterion.com CC: netdev@oss.sgi.com, raghavendra.koushik@neterion.com, leonid.grossman@neterion.com, rapuru.sriram@neterion.com, ananda.raju@neterion.com Subject: Re: [PATCH 2.6.13] IPv4/IPv6: USO Scatter-gather approac References: <200509140655.j8E6t1jx003507@localhost.localdomain> In-Reply-To: <200509140655.j8E6t1jx003507@localhost.localdomain> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 3626 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jgarzik@pobox.com Precedence: bulk X-list: netdev Content-Length: 130 Lines: 10 Please resend to netdev@vger.kernel.org, so that I may properly comment. netdev@oss.sgi.com has been retired. Thanks, Jeff From Robert.Olsson@data.slu.se Wed Sep 14 01:07:19 2005 Received: with ECARTIS (v1.0.0; list netdev); Wed, 14 Sep 2005 01:07:30 -0700 (PDT) Received: from mx1.slu.se (mx1.slu.se [130.238.96.70]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8E87GiL023891 for ; Wed, 14 Sep 2005 01:07:19 -0700 Received: from robur.slu.se (robur.slu.se [130.238.98.12]) by mx1.slu.se (8.13.1/8.13.1) with ESMTP id j8E84LeN004539; Wed, 14 Sep 2005 10:04:22 +0200 Received: by robur.slu.se (Postfix, from userid 1000) id DBF95EC3CC; Wed, 14 Sep 2005 10:04:21 +0200 (CEST) From: Robert Olsson MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <17191.55685.861191.831981@robur.slu.se> Date: Wed, 14 Sep 2005 10:04:21 +0200 To: Simon Kirby Cc: Alexey Kuznetsov , Robert Olsson , Eric Dumazet , netdev@oss.sgi.com Subject: Re: Route cache performance In-Reply-To: <20050913221448.GD15704@netnation.com> References: <20050824000158.GA8137@netnation.com> <20050825181111.GB14336@netnation.com> <20050825200543.GA6612@yakov.inr.ac.ru> <20050825212211.GA23384@netnation.com> <20050826115520.GA12351@yakov.inr.ac.ru> <17167.29239.469711.847951@robur.slu.se> <20050906235700.GA31820@netnation.com> <17182.64751.340488.996748@robur.slu.se> <20050907162854.GB24735@netnation.com> <20050907195911.GA8382@yakov.inr.ac.ru> <20050913221448.GD15704@netnation.com> X-Mailer: VM 7.19 under Emacs 21.4.1 X-Scanned-By: MIMEDefang 2.48 on 130.238.96.70 X-archive-position: 3627 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: Robert.Olsson@data.slu.se Precedence: bulk X-list: netdev Content-Length: 703 Lines: 25 Simon Kirby writes: > Sender: 367 Mbps, 717883 pps valid src/dst, 64 byte (Ethernet) packets > > 2.4.27-rc1: 297 Mbps forwarded (w/idle time?!) > 2.4.31: 296 Mbps forwarded (w/idle time?!) > 2.6.13-rc6: 173 Mbps forwarded > Time permitting, I'd also like to run some profiles. It's interesting > to note that 2.6 is slower at forwarding even straight duplicate small > packets. We should definitely get to the bottom of that. Yes. This is single flow? Strange. Run a fixed size shot 10Mpkts pkts or so for both 2.4 and 2.6 and save /proc/interrupts, proc/net/softnetstat, netstat -i, tc -s qdisc to start with. A profile on 2.6 could solve the confusion. Cheers. --ro From kuznet@yakov.inr.ac.ru Thu Sep 15 14:08:01 2005 Received: with ECARTIS (v1.0.0; list netdev); Thu, 15 Sep 2005 14:08:13 -0700 (PDT) Received: from yakov.inr.ac.ru (yakov.inr.ac.ru [194.67.69.111]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id j8FL7wiL007050 for ; Thu, 15 Sep 2005 14:08:01 -0700 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=ms2.inr.ac.ru; b=XaOy7620w9hfz4iPjJMu/7jgQUBWyVMWBzGTQCGcOXhzJ2p4IWMnryrK/MbEtB7tMAyisM9Oi7ryfcF9/jJkEtMeqy/xQ6I+wJQQQe0K2UZ/Ea64pFrxmtZeUOng1n+U4J9SlfWj/P0vpEm3e9GPd7HRsMVhEcgGmX91G/zEQDM=; Received: (from kuznet@localhost) envelope-from=kuznet by yakov.inr.ac.ru (8.6.13/ANK) id BAA29625; Fri, 16 Sep 2005 01:04:32 +0400 Date: Fri, 16 Sep 2005 01:04:32 +0400 From: Alexey Kuznetsov To: Simon Kirby Cc: Alexey Kuznetsov , Robert Olsson , Eric Dumazet , netdev@oss.sgi.com Subject: Re: Route cache performance Message-ID: <20050915210432.GD28925@yakov.inr.ac.ru> References: <20050825181111.GB14336@netnation.com> <20050825200543.GA6612@yakov.inr.ac.ru> <20050825212211.GA23384@netnation.com> <20050826115520.GA12351@yakov.inr.ac.ru> <17167.29239.469711.847951@robur.slu.se> <20050906235700.GA31820@netnation.com> <17182.64751.340488.996748@robur.slu.se> <20050907162854.GB24735@netnation.com> <20050907195911.GA8382@yakov.inr.ac.ru> <20050913221448.GD15704@netnation.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050913221448.GD15704@netnation.com> User-Agent: Mutt/1.5.6i X-archive-position: 3628 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: kuznet@ms2.inr.ac.ru Precedence: bulk X-list: netdev Content-Length: 2812 Lines: 79 Hello! > 300 seems to be sufficient, but I'm not sure what this depends on (load, > HZ, timing of some sort?). It should be enough not depending on anything but sysctl net/core/netdev_max_backlog. > I'm not sure if Linux could work in the same way It could, but it does not. Still. > Actually, is there ever a valid case where the source needs to be tracked > in the route cache when policy routing is disabled? Unfortunately. It caches lots of information depeding on incoming interface and address. All this is mostly useless and can be eliminated, but it is not so trivial. > there is a "fastforwarding" sysctl that sends forwarded packets from > the input interrupt/poll without queueing them in a soft interrupt > ("NETISR"). We used to experiment with this too. "fastroute" was killed completely, napi is a little slower, but much better from viewpoint of maintainability. >2.4.27-rc1: 297 Mbps forwarded (w/idle time?!) vs >2.6.13-rc6: 173 Mbps forwarded and > 2.4.27-rc1 gc_elasticity=1: 182 Mbps forwarded vs > 2.6.13-rc6 maxbatch=300: 58.6 Mbps forwarded (gc balanced) No clue! There should be no such big difference. It is some disaster. Something is very wrong and most of loss is even not related to routing cache. Most likely it is driver or something is seriously screwed up in softirq processing. Profiling is really required... Robert, did you not see anything like this? > 2.6 definitely has better dst cache gc balancing than 2.4. I can set > the max_size=rhash_size in 2.6.13-rc6 and it will just work, even without > adjusting gc_elasticity or gc_thresh. In 2.4.27 and 2.4.31, the only > parameter that appears to help is gc_elasticity. If I just adjust > max_size, it overflows and falls over. I have no idea why it works. Size of cache is determined by gc_elasticity both in 2.4 and 2.6. Nothing changed. The only difference in 2.4 is that it used to have wrong default 5 second value for gc_min_interval (0.5 sec in 2.6). Unless this is fixed, gc just does not work at high rates. Both in 2.6 and 2.4 you must not touch max_size unless you want to _increase_ it, default value is minimum allowed by sanity. Actually there is a hard constraint: gc_elasticity*rhash_size <= max_size/2, if you break this condition, it must break. Probably, you do not see this because you do not change routing tables while testing. > I note that the actual read copy update "maxbatch" limit was added in > 2.6.9. Before then, it seems there was no limit (infinite). Was it > added for latency reasons? Before 2.6.9 rcu worked differently. It run very rarely and had to do lots of work each run, effectively unlimited. Apparently, when RCU folks finally implemented new better mechanism they also added some job limit and did this wrong, 10 is ridiculously low limit. Alexey From Robert.Olsson@data.slu.se Thu Sep 15 14:33:53 2005 Received: with ECARTIS (v1.0.0; list netdev); Thu, 15 Sep 2005 14:33:59 -0700 (PDT) Received: from mx1.slu.se (mx1.slu.se [130.238.96.70]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8FLXqiL008866 for ; Thu, 15 Sep 2005 14:33:53 -0700 Received: from robur.slu.se (robur.slu.se [130.238.98.12]) by mx1.slu.se (8.13.1/8.13.1) with ESMTP id j8FLUsb6021660; Thu, 15 Sep 2005 23:30:54 +0200 Received: by robur.slu.se (Postfix, from userid 1000) id 3A864EC3CC; Thu, 15 Sep 2005 23:30:54 +0200 (CEST) From: Robert Olsson MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <17193.59406.200787.819069@robur.slu.se> Date: Thu, 15 Sep 2005 23:30:54 +0200 To: Alexey Kuznetsov Cc: Simon Kirby , Robert Olsson , Eric Dumazet , netdev@oss.sgi.com Subject: Re: Route cache performance In-Reply-To: <20050915210432.GD28925@yakov.inr.ac.ru> References: <20050825181111.GB14336@netnation.com> <20050825200543.GA6612@yakov.inr.ac.ru> <20050825212211.GA23384@netnation.com> <20050826115520.GA12351@yakov.inr.ac.ru> <17167.29239.469711.847951@robur.slu.se> <20050906235700.GA31820@netnation.com> <17182.64751.340488.996748@robur.slu.se> <20050907162854.GB24735@netnation.com> <20050907195911.GA8382@yakov.inr.ac.ru> <20050913221448.GD15704@netnation.com> <20050915210432.GD28925@yakov.inr.ac.ru> X-Mailer: VM 7.19 under Emacs 21.4.1 X-Scanned-By: MIMEDefang 2.48 on 130.238.96.70 X-archive-position: 3629 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: Robert.Olsson@data.slu.se Precedence: bulk X-list: netdev Content-Length: 893 Lines: 25 Alexey Kuznetsov writes: > Most likely it is driver or something is seriously screwed up in softirq > processing. Profiling is really required... > > Robert, did you not see anything like this? No. There must be an explanation... I've seen around 1 Mpps in the best single flow tests w. 2.6 kernels of course decent HW. Simon can you report in pps as you use 64 byte pkts. > Before 2.6.9 rcu worked differently. It run very rarely and had to do > lots of work each run, effectively unlimited. Apparently, when RCU folks > finally implemented new better mechanism they also added some job limit > and did this wrong, 10 is ridiculously low limit. Yes. I'll guess the thinking was that RCU is for read mostly and rDoS violates this but yes 10 seems dangerous low. Also interesing to get BSD numbers? Sounds like they use something like old FASTROUTE. Cheers. --ro From kuznet@yakov.inr.ac.ru Thu Sep 15 15:23:56 2005 Received: with ECARTIS (v1.0.0; list netdev); Thu, 15 Sep 2005 15:24:01 -0700 (PDT) Received: from yakov.inr.ac.ru (yakov.inr.ac.ru [194.67.69.111]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id j8FMNtiL012745 for ; Thu, 15 Sep 2005 15:23:55 -0700 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=ms2.inr.ac.ru; b=P5jrLoWUJYOH1XbVN2xjSHsJpt1rcpaLO/Lq9ipb0eqxMEP6cfV4qxJGC2E09+/6lkDWm7TRlYnmH3iA2uXarB8AolkYOMT2edMbbeCYkckN7dNRdRxShMkuLTIFoZxxs/9Z802AiB8NS2Q3kxfsDrHtJm4BLbSSosum7Err8oU=; Received: (from kuznet@localhost) envelope-from=kuznet by yakov.inr.ac.ru (8.6.13/ANK) id CAA30539; Fri, 16 Sep 2005 02:21:02 +0400 Date: Fri, 16 Sep 2005 02:21:02 +0400 From: Alexey Kuznetsov To: Robert Olsson Cc: Alexey Kuznetsov , Simon Kirby , Eric Dumazet , netdev@oss.sgi.com Subject: Re: Route cache performance Message-ID: <20050915222102.GA30387@yakov.inr.ac.ru> References: <20050825212211.GA23384@netnation.com> <20050826115520.GA12351@yakov.inr.ac.ru> <17167.29239.469711.847951@robur.slu.se> <20050906235700.GA31820@netnation.com> <17182.64751.340488.996748@robur.slu.se> <20050907162854.GB24735@netnation.com> <20050907195911.GA8382@yakov.inr.ac.ru> <20050913221448.GD15704@netnation.com> <20050915210432.GD28925@yakov.inr.ac.ru> <17193.59406.200787.819069@robur.slu.se> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <17193.59406.200787.819069@robur.slu.se> User-Agent: Mutt/1.5.6i X-archive-position: 3630 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: kuznet@ms2.inr.ac.ru Precedence: bulk X-list: netdev Content-Length: 815 Lines: 26 Hello! > No. There must be an explanation... I've seen around 1 Mpps in the best > single flow tests w. 2.6 kernels of course decent HW. Simon can you > report in pps as you use 64 byte pkts. Sender: 367 Mbps, 717883 pps valid src/dst, 64 byte (Ethernet) packets 2.4.27-rc1: 297 Mbps forwarded (w/idle time?!) So, his best number is (717883/367)*297 ~= 580kpps > Yes. I'll guess the thinking was that RCU is for read mostly RCU should not add essential overhead to DoS, actually. The difference between direct dst_free and RCU is strange as well. > Also interesing to get BSD numbers? Sounds like they use something like > old FASTROUTE. Yes, it is quite funny. I guess it required irq protection to radix tree manipulations, grr... Anyway, I would expect BSD with fastforwarding beat NAPI. Alexey From absalom@netkushi.com Thu Sep 15 18:26:47 2005 Received: with ECARTIS (v1.0.0; list netdev); Thu, 15 Sep 2005 18:27:41 -0700 (PDT) Received: from netkushi.com ([222.71.141.165]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id j8G1QgiL028238; Thu, 15 Sep 2005 18:26:43 -0700 Message-ID: <89252B5F.B84AE16@netkushi.com> Date: Thu, 15 Sep 2005 16:08:14 -0200 From: "florrie ruiz" User-Agent: The Bat! (v1.52f) Business MIME-Version: 1.0 To: "Cristin Wright" Cc: , , , , , , Subject: Be given a large markdown on your prescriptoins Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-archive-position: 3631 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: absalom@netkushi.com Precedence: bulk X-list: netdev Content-Length: 451 Lines: 12 Take delivery of a sizeable concession on your medicnie Reliable national manufacturers, Outstanding quality. Big range, including challenging to find drugs 0 previous doctors direction considered necessary. Secret with No waiting quarters or appointments mandatory Mailed within 23 hours or under, plain paper boxing Acquire in volume and Save! Still extra http://uk.geocities.com/lezlie_opatz/?ms=717Huge selection including Hard to find drugs From Robert.Olsson@data.slu.se Fri Sep 16 05:21:06 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 16 Sep 2005 05:21:23 -0700 (PDT) Received: from mx1.slu.se (mx1.slu.se [130.238.96.70]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8GCL3iL025806 for ; Fri, 16 Sep 2005 05:21:06 -0700 Received: from robur.slu.se (robur.slu.se [130.238.98.12]) by mx1.slu.se (8.13.1/8.13.1) with ESMTP id j8GCI18q024616; Fri, 16 Sep 2005 14:18:01 +0200 Received: by robur.slu.se (Postfix, from userid 1000) id 9D4DFEC3CC; Fri, 16 Sep 2005 14:18:01 +0200 (CEST) From: Robert Olsson MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <17194.47097.607795.141059@robur.slu.se> Date: Fri, 16 Sep 2005 14:18:01 +0200 To: Alexey Kuznetsov Cc: Robert Olsson , Simon Kirby , Eric Dumazet , netdev@oss.sgi.com Subject: Re: Route cache performance In-Reply-To: <20050915222102.GA30387@yakov.inr.ac.ru> References: <20050825212211.GA23384@netnation.com> <20050826115520.GA12351@yakov.inr.ac.ru> <17167.29239.469711.847951@robur.slu.se> <20050906235700.GA31820@netnation.com> <17182.64751.340488.996748@robur.slu.se> <20050907162854.GB24735@netnation.com> <20050907195911.GA8382@yakov.inr.ac.ru> <20050913221448.GD15704@netnation.com> <20050915210432.GD28925@yakov.inr.ac.ru> <17193.59406.200787.819069@robur.slu.se> <20050915222102.GA30387@yakov.inr.ac.ru> X-Mailer: VM 7.19 under Emacs 21.4.1 X-Scanned-By: MIMEDefang 2.48 on 130.238.96.70 X-archive-position: 3632 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: Robert.Olsson@data.slu.se Precedence: bulk X-list: netdev Content-Length: 1885 Lines: 52 Alexey Kuznetsov writes: > Sender: 367 Mbps, 717883 pps valid src/dst, 64 byte (Ethernet) packets > 2.4.27-rc1: 297 Mbps forwarded (w/idle time?!) > So, his best number is (717883/367)*297 ~= 580kpps Yes sounds famliar XEON with e1000... So why not for 2.6? Below a very quick test from our 1.6 GHz Opteron. Latest GIT tree w. UP. e1000 at PCI-X 133/100 MHz. 82546 GB dual NIC's Input 881 kpps.into eth0. Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flags eth0 1500 0 6733009 3267274 6534548 3267274 180 0 0 0 BRU eth1 1500 0 6 0 0 0 6732737 0 0 0 BRU cat /proc/net/softnet_stat 0066bd27 00000000 000057aa 00000000 00000000 00000000 00000000 00000000 00000000 cat /proc/interrupts CPU0 16: 707 IO-APIC-level eth0 17: 293 IO-APIC-level eth1 18: 286 IO-APIC-level eth2 19: 286 IO-APIC-level eth3 Total routed T-put of 590 Kpps > > Yes. I'll guess the thinking was that RCU is for read mostly > > RCU should not add essential overhead to DoS, actually. The difference > between direct dst_free and RCU is strange as well. I think we saw this before. I proposed disabling deferred deletions as with the patch I sent for UP. > > Also interesing to get BSD numbers? Sounds like they use something like > > old FASTROUTE. > > Yes, it is quite funny. I guess it required irq protection to radix tree > manipulations, grr... Anyway, I would expect BSD with fastforwarding beat > NAPI. BSD uses fixed polling from what I understand so it should be pretty close NAPI. With Radix for FIB they need route even more than Linux. But code path might be more efficient have less hooks. Also I dunno about SMP/NUMA for BSD we pay some price for it but hopefully we get something return. Cheers. --ro From kuznet@yakov.inr.ac.ru Fri Sep 16 12:07:07 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 16 Sep 2005 12:07:13 -0700 (PDT) Received: from yakov.inr.ac.ru (yakov.inr.ac.ru [194.67.69.111]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id j8GJ71iL031432 for ; Fri, 16 Sep 2005 12:07:06 -0700 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=ms2.inr.ac.ru; b=S/XUjcetLam24Er01x1MGG1wD8S1eSkHSVhlK5hg/+Gd5GPnyB2jKRukLqSDjMS+O5HtRg7rT3gKrektSdhBqhrRW3Qc8vz7VOohTkdsAY9SFokBTIFkFUpRZTaw3CDuAOyiCYQ8jZRj2GPO+z7IJ9PP0TnwtsYieaf/+XxORv4=; Received: (from kuznet@localhost) envelope-from=kuznet by yakov.inr.ac.ru (8.6.13/ANK) id XAA11132; Fri, 16 Sep 2005 23:04:04 +0400 Date: Fri, 16 Sep 2005 23:04:04 +0400 From: Alexey Kuznetsov To: Robert Olsson Cc: Alexey Kuznetsov , Simon Kirby , Eric Dumazet , netdev@oss.sgi.com Subject: Re: Route cache performance Message-ID: <20050916190404.GA11012@yakov.inr.ac.ru> References: <17167.29239.469711.847951@robur.slu.se> <20050906235700.GA31820@netnation.com> <17182.64751.340488.996748@robur.slu.se> <20050907162854.GB24735@netnation.com> <20050907195911.GA8382@yakov.inr.ac.ru> <20050913221448.GD15704@netnation.com> <20050915210432.GD28925@yakov.inr.ac.ru> <17193.59406.200787.819069@robur.slu.se> <20050915222102.GA30387@yakov.inr.ac.ru> <17194.47097.607795.141059@robur.slu.se> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <17194.47097.607795.141059@robur.slu.se> User-Agent: Mutt/1.5.6i X-archive-position: 3633 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: kuznet@ms2.inr.ac.ru Precedence: bulk X-list: netdev Content-Length: 692 Lines: 19 Hello! > Yes sounds famliar XEON with e1000... So why not for 2.6? Most likely, something is broken in the e1000 driver. Otherwise, no ideas. > I think we saw this before. I proposed disabling deferred deletions > as with the patch I sent for UP. I do not see _why_. Apparently some overhead is present but I do not understand why it is so large. Is it just because 300 redundant entries pollute cache a little more? I do not see another reasons. Maybe it makes sense to compare this effect with the effect of increment gc_elasticity by 1. If it is due to cache pollution, effect of increment of gc_elasticity, which increses size of cache by rhash_size should be even worse. Alexey From greearb@candelatech.com Fri Sep 16 12:26:17 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 16 Sep 2005 12:26:21 -0700 (PDT) Received: from www.lanforge.com (ns1.lanforge.com [66.165.47.210]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8GJQGiL004491 for ; Fri, 16 Sep 2005 12:26:17 -0700 Received: from [71.112.207.5] (pool-71-112-207-5.sttlwa.dsl-w.verizon.net [71.112.207.5]) (authenticated bits=0) by www.lanforge.com (8.12.8/8.12.8) with ESMTP id j8GJSno6018435; Fri, 16 Sep 2005 12:28:50 -0700 Message-ID: <432B1B73.2050808@candelatech.com> Date: Fri, 16 Sep 2005 12:22:27 -0700 From: Ben Greear Organization: Candela Technologies User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.7.10) Gecko/20050719 Fedora/1.7.10-1.3.1 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Alexey Kuznetsov CC: Robert Olsson , Simon Kirby , Eric Dumazet , netdev@oss.sgi.com Subject: Re: Route cache performance References: <17167.29239.469711.847951@robur.slu.se> <20050906235700.GA31820@netnation.com> <17182.64751.340488.996748@robur.slu.se> <20050907162854.GB24735@netnation.com> <20050907195911.GA8382@yakov.inr.ac.ru> <20050913221448.GD15704@netnation.com> <20050915210432.GD28925@yakov.inr.ac.ru> <17193.59406.200787.819069@robur.slu.se> <20050915222102.GA30387@yakov.inr.ac.ru> <17194.47097.607795.141059@robur.slu.se> <20050916190404.GA11012@yakov.inr.ac.ru> In-Reply-To: <20050916190404.GA11012@yakov.inr.ac.ru> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 3634 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: greearb@candelatech.com Precedence: bulk X-list: netdev Content-Length: 959 Lines: 29 Alexey Kuznetsov wrote: > Hello! > > >>Yes sounds famliar XEON with e1000... So why not for 2.6? > > > Most likely, something is broken in the e1000 driver. Otherwise, no ideas. Has anyone tried using bridging to compare numbers? I would assume that the bridging code is lower-overhead than the routing, so if it's a route cache problem, the bridge traffic should be significantly higher than the routed traffic. If they are both about the same, then either bridging has lots of overhead too, or the driver (or other network sub-system) is the bottleneck. For reference, I was able to bridge only about 200kpps (in each direction, 64 byte pkts) on a P-IV 3Ghz system with dual Intel e1000 NIC in a PCI-X 64/133 bus.... I would like to hear of any other bridging benchmarks that someone may have, especially for bi-directional traffic flows. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From Robert.Olsson@data.slu.se Fri Sep 16 13:00:32 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 16 Sep 2005 13:00:35 -0700 (PDT) Received: from mx1.slu.se (mx1.slu.se [130.238.96.70]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8GK0ViL006534 for ; Fri, 16 Sep 2005 13:00:32 -0700 Received: from robur.slu.se (robur.slu.se [130.238.98.12]) by mx1.slu.se (8.13.1/8.13.1) with ESMTP id j8GJvVxp019552; Fri, 16 Sep 2005 21:57:31 +0200 Received: by robur.slu.se (Postfix, from userid 1000) id 80F90EC3CC; Fri, 16 Sep 2005 21:57:31 +0200 (CEST) From: Robert Olsson MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <17195.9131.491717.442214@robur.slu.se> Date: Fri, 16 Sep 2005 21:57:31 +0200 To: Alexey Kuznetsov Cc: Robert Olsson , Simon Kirby , Eric Dumazet , netdev@oss.sgi.com Subject: Re: Route cache performance In-Reply-To: <20050916190404.GA11012@yakov.inr.ac.ru> References: <17167.29239.469711.847951@robur.slu.se> <20050906235700.GA31820@netnation.com> <17182.64751.340488.996748@robur.slu.se> <20050907162854.GB24735@netnation.com> <20050907195911.GA8382@yakov.inr.ac.ru> <20050913221448.GD15704@netnation.com> <20050915210432.GD28925@yakov.inr.ac.ru> <17193.59406.200787.819069@robur.slu.se> <20050915222102.GA30387@yakov.inr.ac.ru> <17194.47097.607795.141059@robur.slu.se> <20050916190404.GA11012@yakov.inr.ac.ru> X-Mailer: VM 7.19 under Emacs 21.4.1 X-Scanned-By: MIMEDefang 2.48 on 130.238.96.70 X-archive-position: 3635 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: Robert.Olsson@data.slu.se Precedence: bulk X-list: netdev Content-Length: 1056 Lines: 27 Alexey Kuznetsov writes: > Most likely, something is broken in the e1000 driver. Otherwise, no ideas. No it's hard to guess. Simon will hopefully bring some more data. > I do not see _why_. Apparently some overhead is present but I do not > understand why it is so large. Is it just because 300 redundant entries > pollute cache a little more? I do not see another reasons. Yes when RX softirq is done. RCU tasklet has to take over and probably reload cache with some the entries to complete the deletion. It might be worth a profile... > Maybe it makes sense to compare this effect with the effect of increment > gc_elasticity by 1. If it is due to cache pollution, effect of increment > of gc_elasticity, which increses size of cache by rhash_size should be > even worse. Something like that yes :) but if we increase gc_elasticity we also add more spinning in hash chains. So we need to sort out if the expected performance drop comes from extra hash spinning or are cache effects from the increased hash. Cheers. --ro From sim@netnation.com Fri Sep 16 17:31:08 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 16 Sep 2005 17:31:13 -0700 (PDT) Received: from peace.netnation.com (newpeace.netnation.com [204.174.223.7]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8H0V7iL027371 for ; Fri, 16 Sep 2005 17:31:07 -0700 Received: from sim by peace.netnation.com with local (Exim 4.50) id 1EGQZ5-0003n5-Lv; Fri, 16 Sep 2005 17:28:23 -0700 Date: Fri, 16 Sep 2005 17:28:23 -0700 From: Simon Kirby To: Robert Olsson Cc: Alexey Kuznetsov , Eric Dumazet , netdev@oss.sgi.com Subject: Re: Route cache performance Message-ID: <20050917002823.GB19112@netnation.com> References: <20050825200543.GA6612@yakov.inr.ac.ru> <20050825212211.GA23384@netnation.com> <20050826115520.GA12351@yakov.inr.ac.ru> <17167.29239.469711.847951@robur.slu.se> <20050906235700.GA31820@netnation.com> <17182.64751.340488.996748@robur.slu.se> <20050907162854.GB24735@netnation.com> <20050907195911.GA8382@yakov.inr.ac.ru> <20050913221448.GD15704@netnation.com> <17191.55685.861191.831981@robur.slu.se> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <17191.55685.861191.831981@robur.slu.se> User-Agent: Mutt/1.5.9i X-archive-position: 3637 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: sim@netnation.com Precedence: bulk X-list: netdev Content-Length: 1101 Lines: 30 On Wed, Sep 14, 2005 at 10:04:21AM +0200, Robert Olsson wrote: > > Simon Kirby writes: > > > > Sender: 367 Mbps, 717883 pps valid src/dst, 64 byte (Ethernet) packets > > > > 2.4.27-rc1: 297 Mbps forwarded (w/idle time?!) > > 2.4.31: 296 Mbps forwarded (w/idle time?!) > > 2.6.13-rc6: 173 Mbps forwarded > > > Time permitting, I'd also like to run some profiles. It's interesting > > to note that 2.6 is slower at forwarding even straight duplicate small > > packets. We should definitely get to the bottom of that. > > Yes. This is single flow? Strange. > > Run a fixed size shot 10Mpkts pkts or so for both 2.4 and 2.6 and save > /proc/interrupts, proc/net/softnetstat, netstat -i, tc -s qdisc to start with. I got stuck in some mud again, but I was able to run a small oprofile. nf_iterate was near the top even though the firewall was empty, so I changed CONFIG_IP_NF_IPTABLES=y to CONFIG_IP_NF_IPTABLES=m (and didn't load it). Throughput went up from 173 Mbps to 232 Mbps...yikes. Conntrack was never compiled. I'll do some more profiling when I get a chance... Simon- From gandalf@wlug.westbo.se Sat Sep 17 02:07:02 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 17 Sep 2005 02:07:10 -0700 (PDT) Received: from mxfep01.bredband.com (mxfep01.bredband.com [195.54.107.70]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8H96uiL003048 for ; Sat, 17 Sep 2005 02:06:59 -0700 Received: from tux.rsn.bth.se ([85.228.2.92] [85.228.2.92]) by mxfep01.bredband.com with ESMTP id <20050917090413.BRSH13243.mxfep01.bredband.com@tux.rsn.bth.se>; Sat, 17 Sep 2005 11:04:13 +0200 Received: from localhost.localdomain (localhost [127.0.0.1]) by tux.rsn.bth.se (Postfix) with ESMTP id 6EA4E3F55; Sat, 17 Sep 2005 10:05:21 +0200 (CEST) Subject: Re: Route cache performance From: Martin Josefsson To: Simon Kirby Cc: Robert Olsson , Alexey Kuznetsov , Eric Dumazet , netdev@oss.sgi.com In-Reply-To: <20050917002823.GB19112@netnation.com> References: <20050825200543.GA6612@yakov.inr.ac.ru> <20050825212211.GA23384@netnation.com> <20050826115520.GA12351@yakov.inr.ac.ru> <17167.29239.469711.847951@robur.slu.se> <20050906235700.GA31820@netnation.com> <17182.64751.340488.996748@robur.slu.se> <20050907162854.GB24735@netnation.com> <20050907195911.GA8382@yakov.inr.ac.ru> <20050913221448.GD15704@netnation.com> <17191.55685.861191.831981@robur.slu.se> <20050917002823.GB19112@netnation.com> Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="=-kW9ddZExY4QUTFvf/+VF" Date: Sat, 17 Sep 2005 11:04:09 +0200 Message-Id: <1126947850.4549.11.camel@localhost.localdomain> Mime-Version: 1.0 X-Mailer: Evolution 2.2.3 X-archive-position: 3638 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: gandalf@wlug.westbo.se Precedence: bulk X-list: netdev Content-Length: 1145 Lines: 39 --=-kW9ddZExY4QUTFvf/+VF Content-Type: text/plain Content-Transfer-Encoding: quoted-printable On Fri, 2005-09-16 at 17:28 -0700, Simon Kirby wrote: > I got stuck in some mud again, but I was able to run a small oprofile. >=20 > nf_iterate was near the top even though the firewall was empty, so I > changed CONFIG_IP_NF_IPTABLES=3Dy to CONFIG_IP_NF_IPTABLES=3Dm (and didn'= t > load it). Throughput went up from 173 Mbps to 232 Mbps...yikes.=20 > Conntrack was never compiled. I'll do some more profiling when I get > a chance... Yes, it's bloody slow even without any rules loaded at the moment, it's on my todo list... If you want even less overhead then don't even select CONFIG_NETFILTER, that way you avoid compiling in the netfilter hooks completely. --=20 /Martin --=-kW9ddZExY4QUTFvf/+VF Content-Type: application/pgp-signature; name=signature.asc Content-Description: This is a digitally signed message part -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQBDK9wJWm2vlfa207ERArTcAJ9tvSLM0aeQEJMqQnOmGcFIYTHoeQCgk98y P5zTc3hcYsjAqfV1HE2ux/0= =OQEB -----END PGP SIGNATURE----- --=-kW9ddZExY4QUTFvf/+VF-- From hadi@cyberus.ca Sat Sep 17 08:20:08 2005 Received: with ECARTIS (v1.0.0; list netdev); Sat, 17 Sep 2005 08:20:14 -0700 (PDT) Received: from mx03.cybersurf.com (mx03.cybersurf.com [209.197.145.106]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8HFK7iL002755 for ; Sat, 17 Sep 2005 08:20:08 -0700 Received: from mail.cyberus.ca ([209.197.145.21]) by mx03.cybersurf.com with esmtp (Exim 4.30) id 1EGeRU-000322-Lf for netdev@oss.sgi.com; Sat, 17 Sep 2005 11:17:28 -0400 Received: from cpe0030ab124d2f-cm014500000962.cpe.net.cable.rogers.com ([24.103.96.183] helo=[10.0.0.229]) by mail.cyberus.ca with esmtp (Exim 4.20) id 1EGeRQ-0000N0-7t; Sat, 17 Sep 2005 11:17:24 -0400 Subject: Re: Route cache performance From: jamal Reply-To: hadi@cyberus.ca To: Simon Kirby Cc: Robert Olsson , Alexey Kuznetsov , Eric Dumazet , netdev@oss.sgi.com In-Reply-To: <20050917002823.GB19112@netnation.com> References: <20050825200543.GA6612@yakov.inr.ac.ru> <20050825212211.GA23384@netnation.com> <20050826115520.GA12351@yakov.inr.ac.ru> <17167.29239.469711.847951@robur.slu.se> <20050906235700.GA31820@netnation.com> <17182.64751.340488.996748@robur.slu.se> <20050907162854.GB24735@netnation.com> <20050907195911.GA8382@yakov.inr.ac.ru> <20050913221448.GD15704@netnation.com> <17191.55685.861191.831981@robur.slu.se> <20050917002823.GB19112@netnation.com> Content-Type: text/plain Organization: unknown Date: Sat, 17 Sep 2005 11:17:20 -0400 Message-Id: <1126970240.6681.128.camel@localhost.localdomain> Mime-Version: 1.0 X-Mailer: Evolution 2.2.1.1 Content-Transfer-Encoding: 7bit X-archive-position: 3639 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev Content-Length: 533 Lines: 17 On Fri, 2005-16-09 at 17:28 -0700, Simon Kirby wrote: > nf_iterate was near the top even though the firewall was empty, so I > changed CONFIG_IP_NF_IPTABLES=y to CONFIG_IP_NF_IPTABLES=m (and didn't > load it). Throughput went up from 173 Mbps to 232 Mbps...yikes. > Conntrack was never compiled. I'll do some more profiling when I get > a chance... > If you want some basic stateless firewalling, turn off netfilter and use tc ingress/egress actions instead. The impact on performance is a lot more tolerable. cheers, jamal From manfred@colorfullife.com Sun Sep 18 07:04:47 2005 Received: with ECARTIS (v1.0.0; list netdev); Sun, 18 Sep 2005 07:04:55 -0700 (PDT) Received: from dbl.q-ag.de (dbl.q-ag.de [213.172.117.3]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8IE4hiL013000 for ; Sun, 18 Sep 2005 07:04:46 -0700 Received: from [127.0.0.2] (dbl [127.0.0.1]) by dbl.q-ag.de (8.13.3/8.13.3/Debian-6) with ESMTP id j8IE8teB008529; Sun, 18 Sep 2005 16:08:56 +0200 Message-ID: <432D7354.8000503@colorfullife.com> Date: Sun, 18 Sep 2005 16:01:56 +0200 From: Manfred Spraul User-Agent: Mozilla/5.0 (X11; U; Linux i686; fr-FR; rv:1.7.10) Gecko/20050909 Fedora/1.7.10-1.5.2 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Jeff Garzik CC: Netdev , Ayaz Abdulla Subject: [PATCH 2/2] forcedeth: scatter gather and segmentation offload support Content-Type: multipart/mixed; boundary="------------030405010209050508090501" X-archive-position: 3641 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: manfred@colorfullife.com Precedence: bulk X-list: netdev Content-Length: 10893 Lines: 334 This is a multi-part message in MIME format. --------------030405010209050508090501 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit The attached patch adds scatter gather and segmentation offload support into forcedeth driver. This patch has been tested by NVIDIA and reviewed by Manfred. Notes: - Manfred mentioned that mapping of pages could take time and should not be under spinlock for performance reasons - During testing with netperf, I have noticed a connection running segmentation offload gets "unoffloaded" by the kernel due to possible retransmissions. Thanks, Ayaz Signed-off-by: Ayaz Abdulla Signed-off-By: Manfred Spraul --------------030405010209050508090501 Content-Type: text/plain; name="patch-forcedeth-044-sg-segmentation" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="patch-forcedeth-044-sg-segmentation" --- orig-2.6/drivers/net/forcedeth.c 2005-09-06 11:54:41.000000000 -0700 +++ 2.6/drivers/net/forcedeth.c 2005-09-06 13:52:50.000000000 -0700 @@ -96,6 +96,7 @@ * 0.42: 06 Aug 2005: Fix lack of link speed initialization * in the second (and later) nv_open call * 0.43: 10 Aug 2005: Add support for tx checksum. + * 0.44: 20 Aug 2005: Add support for scatter gather and segmentation. * * Known bugs: * We suspect that on some hardware no TX done interrupts are generated. @@ -107,7 +108,7 @@ * DEV_NEED_TIMERIRQ will not harm you on sane hardware, only generating a few * superfluous timer interrupts from the nic. */ -#define FORCEDETH_VERSION "0.43" +#define FORCEDETH_VERSION "0.44" #define DRV_NAME "forcedeth" #include @@ -340,6 +341,8 @@ /* error and valid are the same for both */ #define NV_TX2_ERROR (1<<30) #define NV_TX2_VALID (1<<31) +#define NV_TX2_TSO (1<<28) +#define NV_TX2_TSO_SHIFT 14 #define NV_TX2_CHECKSUM_L3 (1<<27) #define NV_TX2_CHECKSUM_L4 (1<<26) @@ -901,11 +904,13 @@ int i; np->next_tx = np->nic_tx = 0; - for (i = 0; i < TX_RING; i++) + for (i = 0; i < TX_RING; i++) { if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) np->tx_ring.orig[i].FlagLen = 0; else np->tx_ring.ex[i].FlagLen = 0; + np->tx_skbuff[i] = NULL; + } } static int nv_init_ring(struct net_device *dev) @@ -915,21 +920,44 @@ return nv_alloc_rx(dev); } +static void nv_release_txskb(struct net_device *dev, unsigned int skbnr) +{ + struct fe_priv *np = get_nvpriv(dev); + struct sk_buff *skb = np->tx_skbuff[skbnr];; + unsigned int j, entry, fragments; + + dprintk(KERN_INFO "%s: nv_release_txskb for skbnr %d, skb %p\n", + dev->name, skbnr, np->tx_skbuff[skbnr]); + + entry = skbnr; + if ((fragments = skb_shinfo(skb)->nr_frags) != 0) { + for (j = fragments; j >= 1; j--) { + skb_frag_t *frag = &skb_shinfo(skb)->frags[j-1]; + pci_unmap_page(np->pci_dev, np->tx_dma[entry], + frag->size, + PCI_DMA_TODEVICE); + entry = (entry - 1) % TX_RING; + } + } + pci_unmap_single(np->pci_dev, np->tx_dma[entry], + skb->len - skb->data_len, + PCI_DMA_TODEVICE); + dev_kfree_skb_irq(skb); + np->tx_skbuff[skbnr] = NULL; +} + static void nv_drain_tx(struct net_device *dev) { struct fe_priv *np = get_nvpriv(dev); - int i; + unsigned int i; + for (i = 0; i < TX_RING; i++) { if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) np->tx_ring.orig[i].FlagLen = 0; else np->tx_ring.ex[i].FlagLen = 0; if (np->tx_skbuff[i]) { - pci_unmap_single(np->pci_dev, np->tx_dma[i], - np->tx_skbuff[i]->len, - PCI_DMA_TODEVICE); - dev_kfree_skb(np->tx_skbuff[i]); - np->tx_skbuff[i] = NULL; + nv_release_txskb(dev, i); np->stats.tx_dropped++; } } @@ -968,28 +996,69 @@ static int nv_start_xmit(struct sk_buff *skb, struct net_device *dev) { struct fe_priv *np = get_nvpriv(dev); - int nr = np->next_tx % TX_RING; - u32 tx_checksum = (skb->ip_summed == CHECKSUM_HW ? (NV_TX2_CHECKSUM_L3|NV_TX2_CHECKSUM_L4) : 0); + u32 tx_flags_extra = (np->desc_ver == DESC_VER_1 ? NV_TX_LASTPACKET : NV_TX2_LASTPACKET); + unsigned int fragments = skb_shinfo(skb)->nr_frags; + unsigned int nr = (np->next_tx + fragments) % TX_RING; + unsigned int i; + + spin_lock_irq(&np->lock); + wmb(); + + if ((np->next_tx - np->nic_tx + fragments) > TX_LIMIT_STOP) { + spin_unlock_irq(&np->lock); + netif_stop_queue(dev); + return 1; + } np->tx_skbuff[nr] = skb; - np->tx_dma[nr] = pci_map_single(np->pci_dev, skb->data,skb->len, - PCI_DMA_TODEVICE); + + if (fragments) { + dprintk(KERN_DEBUG "%s: nv_start_xmit: buffer contains %d fragments\n", dev->name, fragments); + /* setup descriptors in reverse order */ + for (i = fragments; i >= 1; i--) { + skb_frag_t *frag = &skb_shinfo(skb)->frags[i-1]; + np->tx_dma[nr] = pci_map_page(np->pci_dev, frag->page, frag->page_offset, frag->size, + PCI_DMA_TODEVICE); - if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) + if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) { + np->tx_ring.orig[nr].PacketBuffer = cpu_to_le32(np->tx_dma[nr]); + np->tx_ring.orig[nr].FlagLen = cpu_to_le32( (frag->size-1) | np->tx_flags | tx_flags_extra); + } else { + np->tx_ring.ex[nr].PacketBufferHigh = cpu_to_le64(np->tx_dma[nr]) >> 32; + np->tx_ring.ex[nr].PacketBufferLow = cpu_to_le64(np->tx_dma[nr]) & 0x0FFFFFFFF; + np->tx_ring.ex[nr].FlagLen = cpu_to_le32( (frag->size-1) | np->tx_flags | tx_flags_extra); + } + + nr = (nr - 1) % TX_RING; + + if (np->desc_ver == DESC_VER_1) + tx_flags_extra &= ~NV_TX_LASTPACKET; + else + tx_flags_extra &= ~NV_TX2_LASTPACKET; + } + } + +#ifdef NETIF_F_TSO + if (skb_shinfo(skb)->tso_size) + tx_flags_extra |= NV_TX2_TSO | (skb_shinfo(skb)->tso_size << NV_TX2_TSO_SHIFT); + else +#endif + tx_flags_extra |= (skb->ip_summed == CHECKSUM_HW ? (NV_TX2_CHECKSUM_L3|NV_TX2_CHECKSUM_L4) : 0); + + np->tx_dma[nr] = pci_map_single(np->pci_dev, skb->data, skb->len-skb->data_len, + PCI_DMA_TODEVICE); + + if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) { np->tx_ring.orig[nr].PacketBuffer = cpu_to_le32(np->tx_dma[nr]); - else { + np->tx_ring.orig[nr].FlagLen = cpu_to_le32( (skb->len-skb->data_len-1) | np->tx_flags | tx_flags_extra); + } else { np->tx_ring.ex[nr].PacketBufferHigh = cpu_to_le64(np->tx_dma[nr]) >> 32; np->tx_ring.ex[nr].PacketBufferLow = cpu_to_le64(np->tx_dma[nr]) & 0x0FFFFFFFF; - } + np->tx_ring.ex[nr].FlagLen = cpu_to_le32( (skb->len-skb->data_len-1) | np->tx_flags | tx_flags_extra); + } - spin_lock_irq(&np->lock); - wmb(); - if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) - np->tx_ring.orig[nr].FlagLen = cpu_to_le32( (skb->len-1) | np->tx_flags | tx_checksum); - else - np->tx_ring.ex[nr].FlagLen = cpu_to_le32( (skb->len-1) | np->tx_flags | tx_checksum); - dprintk(KERN_DEBUG "%s: nv_start_xmit: packet packet %d queued for transmission\n", - dev->name, np->next_tx); + dprintk(KERN_DEBUG "%s: nv_start_xmit: packet packet %d queued for transmission. tx_flags_extra: %x\n", + dev->name, np->next_tx, tx_flags_extra); { int j; for (j=0; j<64; j++) { @@ -1000,11 +1069,9 @@ dprintk("\n"); } - np->next_tx++; + np->next_tx += 1 + fragments; dev->trans_start = jiffies; - if (np->next_tx - np->nic_tx >= TX_LIMIT_STOP) - netif_stop_queue(dev); spin_unlock_irq(&np->lock); writel(NVREG_TXRXCTL_KICK|np->txrxctl_bits, get_hwbase(dev) + NvRegTxRxControl); pci_push(get_hwbase(dev)); @@ -1020,7 +1087,8 @@ { struct fe_priv *np = get_nvpriv(dev); u32 Flags; - int i; + unsigned int i; + struct sk_buff *skb; while (np->nic_tx != np->next_tx) { i = np->nic_tx % TX_RING; @@ -1035,35 +1103,38 @@ if (Flags & NV_TX_VALID) break; if (np->desc_ver == DESC_VER_1) { - if (Flags & (NV_TX_RETRYERROR|NV_TX_CARRIERLOST|NV_TX_LATECOLLISION| - NV_TX_UNDERFLOW|NV_TX_ERROR)) { - if (Flags & NV_TX_UNDERFLOW) - np->stats.tx_fifo_errors++; - if (Flags & NV_TX_CARRIERLOST) - np->stats.tx_carrier_errors++; - np->stats.tx_errors++; - } else { - np->stats.tx_packets++; - np->stats.tx_bytes += np->tx_skbuff[i]->len; + if (Flags & NV_TX_LASTPACKET) { + skb = np->tx_skbuff[i]; + if (Flags & (NV_TX_RETRYERROR|NV_TX_CARRIERLOST|NV_TX_LATECOLLISION| + NV_TX_UNDERFLOW|NV_TX_ERROR)) { + if (Flags & NV_TX_UNDERFLOW) + np->stats.tx_fifo_errors++; + if (Flags & NV_TX_CARRIERLOST) + np->stats.tx_carrier_errors++; + np->stats.tx_errors++; + } else { + np->stats.tx_packets++; + np->stats.tx_bytes += skb->len; + } + nv_release_txskb(dev, i); } } else { - if (Flags & (NV_TX2_RETRYERROR|NV_TX2_CARRIERLOST|NV_TX2_LATECOLLISION| - NV_TX2_UNDERFLOW|NV_TX2_ERROR)) { - if (Flags & NV_TX2_UNDERFLOW) - np->stats.tx_fifo_errors++; - if (Flags & NV_TX2_CARRIERLOST) - np->stats.tx_carrier_errors++; - np->stats.tx_errors++; - } else { - np->stats.tx_packets++; - np->stats.tx_bytes += np->tx_skbuff[i]->len; + if (Flags & NV_TX2_LASTPACKET) { + skb = np->tx_skbuff[i]; + if (Flags & (NV_TX2_RETRYERROR|NV_TX2_CARRIERLOST|NV_TX2_LATECOLLISION| + NV_TX2_UNDERFLOW|NV_TX2_ERROR)) { + if (Flags & NV_TX2_UNDERFLOW) + np->stats.tx_fifo_errors++; + if (Flags & NV_TX2_CARRIERLOST) + np->stats.tx_carrier_errors++; + np->stats.tx_errors++; + } else { + np->stats.tx_packets++; + np->stats.tx_bytes += skb->len; + } + nv_release_txskb(dev, i); } } - pci_unmap_single(np->pci_dev, np->tx_dma[i], - np->tx_skbuff[i]->len, - PCI_DMA_TODEVICE); - dev_kfree_skb_irq(np->tx_skbuff[i]); - np->tx_skbuff[i] = NULL; np->nic_tx++; } if (np->next_tx - np->nic_tx < TX_LIMIT_START) @@ -2322,6 +2393,8 @@ if (pci_set_dma_mask(pci_dev, 0x0000007fffffffffULL)) { printk(KERN_INFO "forcedeth: 64-bit DMA failed, using 32-bit addressing for device %s.\n", pci_name(pci_dev)); + } else { + dev->features |= NETIF_F_HIGHDMA; } np->txrxctl_bits = NVREG_TXRXCTL_DESC_3; } else if (id->driver_data & DEV_HAS_LARGEDESC) { @@ -2340,8 +2413,11 @@ if (id->driver_data & DEV_HAS_CHECKSUM) { np->txrxctl_bits |= NVREG_TXRXCTL_RXCHECK; - dev->features |= NETIF_F_HW_CSUM; - } + dev->features |= NETIF_F_HW_CSUM | NETIF_F_SG; +#ifdef NETIF_F_TSO + dev->features |= NETIF_F_TSO; +#endif + } err = -ENOMEM; np->base = ioremap(addr, NV_PCI_REGSZ); @@ -2420,9 +2496,9 @@ np->wolenabled = 0; if (np->desc_ver == DESC_VER_1) { - np->tx_flags = NV_TX_LASTPACKET|NV_TX_VALID; + np->tx_flags = NV_TX_VALID; } else { - np->tx_flags = NV_TX2_LASTPACKET|NV_TX2_VALID; + np->tx_flags = NV_TX2_VALID; } np->irqmask = NVREG_IRQMASK_WANTED; if (id->driver_data & DEV_NEED_TIMERIRQ) --------------030405010209050508090501-- From manfred@colorfullife.com Sun Sep 18 07:04:19 2005 Received: with ECARTIS (v1.0.0; list netdev); Sun, 18 Sep 2005 07:04:27 -0700 (PDT) Received: from dbl.q-ag.de (dbl.q-ag.de [213.172.117.3]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8IE4HiL012737 for ; Sun, 18 Sep 2005 07:04:19 -0700 Received: from [127.0.0.2] (dbl [127.0.0.1]) by dbl.q-ag.de (8.13.3/8.13.3/Debian-6) with ESMTP id j8IE8Il1008525; Sun, 18 Sep 2005 16:08:19 +0200 Message-ID: <432D7330.6090001@colorfullife.com> Date: Sun, 18 Sep 2005 16:01:20 +0200 From: Manfred Spraul User-Agent: Mozilla/5.0 (X11; U; Linux i686; fr-FR; rv:1.7.10) Gecko/20050909 Fedora/1.7.10-1.5.2 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Jeff Garzik CC: Netdev , Ayaz Abdulla Subject: [PATCH 1/2] forcedeth: Add hardware tx checksum support Content-Type: multipart/mixed; boundary="------------090809010604010806080805" X-archive-position: 3640 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: manfred@colorfullife.com Precedence: bulk X-list: netdev Content-Length: 10016 Lines: 247 This is a multi-part message in MIME format. --------------090809010604010806080805 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Recent forcedeth nics support checksum offloading for tx. The attached patch, written by Ayaz Abdulla, adds the support to the driver. It also cleans up the handling of the three dma ring entry formats that are supported by the driver. Signed-off-By: Manfred Spraul --------------090809010604010806080805 Content-Type: text/plain; name="patch-forcedeth-043-txchecksum" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="patch-forcedeth-043-txchecksum" --- orig-2.6/drivers/net/forcedeth.c 2005-09-06 11:46:05.000000000 -0700 +++ 2.6/drivers/net/forcedeth.c 2005-09-06 11:45:15.000000000 -0700 @@ -95,6 +95,7 @@ * of nv_remove * 0.42: 06 Aug 2005: Fix lack of link speed initialization * in the second (and later) nv_open call + * 0.43: 10 Aug 2005: Add support for tx checksum. * * Known bugs: * We suspect that on some hardware no TX done interrupts are generated. @@ -106,7 +107,7 @@ * DEV_NEED_TIMERIRQ will not harm you on sane hardware, only generating a few * superfluous timer interrupts from the nic. */ -#define FORCEDETH_VERSION "0.41" +#define FORCEDETH_VERSION "0.43" #define DRV_NAME "forcedeth" #include @@ -145,6 +146,7 @@ #define DEV_NEED_LINKTIMER 0x0002 /* poll link settings. Relies on the timer irq */ #define DEV_HAS_LARGEDESC 0x0004 /* device supports jumbo frames and needs packet format 2 */ #define DEV_HAS_HIGH_DMA 0x0008 /* device supports 64bit dma */ +#define DEV_HAS_CHECKSUM 0x0010 /* device supports tx and rx checksum offloads */ enum { NvRegIrqStatus = 0x000, @@ -241,6 +243,9 @@ #define NVREG_TXRXCTL_IDLE 0x0008 #define NVREG_TXRXCTL_RESET 0x0010 #define NVREG_TXRXCTL_RXCHECK 0x0400 +#define NVREG_TXRXCTL_DESC_1 0 +#define NVREG_TXRXCTL_DESC_2 0x02100 +#define NVREG_TXRXCTL_DESC_3 0x02200 NvRegMIIStatus = 0x180, #define NVREG_MIISTAT_ERROR 0x0001 #define NVREG_MIISTAT_LINKCHANGE 0x0008 @@ -335,6 +340,8 @@ /* error and valid are the same for both */ #define NV_TX2_ERROR (1<<30) #define NV_TX2_VALID (1<<31) +#define NV_TX2_CHECKSUM_L3 (1<<27) +#define NV_TX2_CHECKSUM_L4 (1<<26) #define NV_RX_DESCRIPTORVALID (1<<16) #define NV_RX_MISSEDFRAME (1<<17) @@ -417,14 +424,14 @@ /* * desc_ver values: - * This field has two purposes: - * - Newer nics uses a different ring layout. The layout is selected by - * comparing np->desc_ver with DESC_VER_xy. - * - It contains bits that are forced on when writing to NvRegTxRxControl. + * The nic supports three different descriptor types: + * - DESC_VER_1: Original + * - DESC_VER_2: support for jumbo frames. + * - DESC_VER_3: 64-bit format. */ -#define DESC_VER_1 0x0 -#define DESC_VER_2 (0x02100|NVREG_TXRXCTL_RXCHECK) -#define DESC_VER_3 (0x02200|NVREG_TXRXCTL_RXCHECK) +#define DESC_VER_1 1 +#define DESC_VER_2 2 +#define DESC_VER_3 3 /* PHY defines */ #define PHY_OUI_MARVELL 0x5043 @@ -491,6 +498,7 @@ u32 orig_mac[2]; u32 irqmask; u32 desc_ver; + u32 txrxctl_bits; void __iomem *base; @@ -786,10 +794,10 @@ u8 __iomem *base = get_hwbase(dev); dprintk(KERN_DEBUG "%s: nv_txrx_reset\n", dev->name); - writel(NVREG_TXRXCTL_BIT2 | NVREG_TXRXCTL_RESET | np->desc_ver, base + NvRegTxRxControl); + writel(NVREG_TXRXCTL_BIT2 | NVREG_TXRXCTL_RESET | np->txrxctl_bits, base + NvRegTxRxControl); pci_push(base); udelay(NV_TXRX_RESET_DELAY); - writel(NVREG_TXRXCTL_BIT2 | np->desc_ver, base + NvRegTxRxControl); + writel(NVREG_TXRXCTL_BIT2 | np->txrxctl_bits, base + NvRegTxRxControl); pci_push(base); } @@ -961,6 +969,7 @@ { struct fe_priv *np = get_nvpriv(dev); int nr = np->next_tx % TX_RING; + u32 tx_checksum = (skb->ip_summed == CHECKSUM_HW ? (NV_TX2_CHECKSUM_L3|NV_TX2_CHECKSUM_L4) : 0); np->tx_skbuff[nr] = skb; np->tx_dma[nr] = pci_map_single(np->pci_dev, skb->data,skb->len, @@ -976,10 +985,10 @@ spin_lock_irq(&np->lock); wmb(); if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) - np->tx_ring.orig[nr].FlagLen = cpu_to_le32( (skb->len-1) | np->tx_flags ); + np->tx_ring.orig[nr].FlagLen = cpu_to_le32( (skb->len-1) | np->tx_flags | tx_checksum); else - np->tx_ring.ex[nr].FlagLen = cpu_to_le32( (skb->len-1) | np->tx_flags ); - dprintk(KERN_DEBUG "%s: nv_start_xmit: packet packet %d queued for transmission.\n", + np->tx_ring.ex[nr].FlagLen = cpu_to_le32( (skb->len-1) | np->tx_flags | tx_checksum); + dprintk(KERN_DEBUG "%s: nv_start_xmit: packet packet %d queued for transmission\n", dev->name, np->next_tx); { int j; @@ -997,7 +1006,7 @@ if (np->next_tx - np->nic_tx >= TX_LIMIT_STOP) netif_stop_queue(dev); spin_unlock_irq(&np->lock); - writel(NVREG_TXRXCTL_KICK|np->desc_ver, get_hwbase(dev) + NvRegTxRxControl); + writel(NVREG_TXRXCTL_KICK|np->txrxctl_bits, get_hwbase(dev) + NvRegTxRxControl); pci_push(get_hwbase(dev)); return 0; } @@ -1408,7 +1417,7 @@ writel( ((RX_RING-1) << NVREG_RINGSZ_RXSHIFT) + ((TX_RING-1) << NVREG_RINGSZ_TXSHIFT), base + NvRegRingSizes); pci_push(base); - writel(NVREG_TXRXCTL_KICK|np->desc_ver, get_hwbase(dev) + NvRegTxRxControl); + writel(NVREG_TXRXCTL_KICK|np->txrxctl_bits, get_hwbase(dev) + NvRegTxRxControl); pci_push(base); /* restart rx engine */ @@ -2114,9 +2123,9 @@ /* 5) continue setup */ writel(np->linkspeed, base + NvRegLinkSpeed); writel(NVREG_UNKSETUP3_VAL1, base + NvRegUnknownSetupReg3); - writel(np->desc_ver, base + NvRegTxRxControl); + writel(np->txrxctl_bits, base + NvRegTxRxControl); pci_push(base); - writel(NVREG_TXRXCTL_BIT1|np->desc_ver, base + NvRegTxRxControl); + writel(NVREG_TXRXCTL_BIT1|np->txrxctl_bits, base + NvRegTxRxControl); reg_delay(dev, NvRegUnknownSetupReg5, NVREG_UNKSETUP5_BIT31, NVREG_UNKSETUP5_BIT31, NV_SETUP5_DELAY, NV_SETUP5_DELAYMAX, KERN_INFO "open: SetupReg5, Bit 31 remained off\n"); @@ -2314,18 +2323,26 @@ printk(KERN_INFO "forcedeth: 64-bit DMA failed, using 32-bit addressing for device %s.\n", pci_name(pci_dev)); } + np->txrxctl_bits = NVREG_TXRXCTL_DESC_3; } else if (id->driver_data & DEV_HAS_LARGEDESC) { /* packet format 2: supports jumbo frames */ np->desc_ver = DESC_VER_2; + np->txrxctl_bits = NVREG_TXRXCTL_DESC_2; } else { /* original packet format */ np->desc_ver = DESC_VER_1; + np->txrxctl_bits = NVREG_TXRXCTL_DESC_1; } np->pkt_limit = NV_PKTLIMIT_1; if (id->driver_data & DEV_HAS_LARGEDESC) np->pkt_limit = NV_PKTLIMIT_2; + if (id->driver_data & DEV_HAS_CHECKSUM) { + np->txrxctl_bits |= NVREG_TXRXCTL_RXCHECK; + dev->features |= NETIF_F_HW_CSUM; + } + err = -ENOMEM; np->base = ioremap(addr, NV_PCI_REGSZ); if (!np->base) @@ -2525,35 +2542,35 @@ }, { /* nForce3 Ethernet Controller */ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_4), - .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC, + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_CHECKSUM, }, { /* nForce3 Ethernet Controller */ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_5), - .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC, + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_CHECKSUM, }, { /* nForce3 Ethernet Controller */ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_6), - .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC, + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_CHECKSUM, }, { /* nForce3 Ethernet Controller */ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_7), - .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC, + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_CHECKSUM, }, { /* CK804 Ethernet Controller */ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_8), - .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_HIGH_DMA, + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_CHECKSUM|DEV_HAS_HIGH_DMA, }, { /* CK804 Ethernet Controller */ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_9), - .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_HIGH_DMA, + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_CHECKSUM|DEV_HAS_HIGH_DMA, }, { /* MCP04 Ethernet Controller */ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_10), - .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_HIGH_DMA, + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_CHECKSUM|DEV_HAS_HIGH_DMA, }, { /* MCP04 Ethernet Controller */ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_11), - .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_HIGH_DMA, + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_CHECKSUM|DEV_HAS_HIGH_DMA, }, { /* MCP51 Ethernet Controller */ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_12), @@ -2565,11 +2582,11 @@ }, { /* MCP55 Ethernet Controller */ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_14), - .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_HIGH_DMA, + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_CHECKSUM|DEV_HAS_HIGH_DMA, }, { /* MCP55 Ethernet Controller */ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_15), - .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_HIGH_DMA, + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_CHECKSUM|DEV_HAS_HIGH_DMA, }, {0,}, }; --------------090809010604010806080805-- From manfred@colorfullife.com Sun Sep 18 07:20:55 2005 Received: with ECARTIS (v1.0.0; list netdev); Sun, 18 Sep 2005 07:21:02 -0700 (PDT) Received: from dbl.q-ag.de (dbl.q-ag.de [213.172.117.3]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8IEKsiL014774 for ; Sun, 18 Sep 2005 07:20:55 -0700 Received: from [127.0.0.2] (dbl [127.0.0.1]) by dbl.q-ag.de (8.13.3/8.13.3/Debian-6) with ESMTP id j8IEP4rb008578; Sun, 18 Sep 2005 16:25:05 +0200 Message-ID: <432D771D.7050107@colorfullife.com> Date: Sun, 18 Sep 2005 16:18:05 +0200 From: Manfred Spraul User-Agent: Mozilla/5.0 (X11; U; Linux i686; fr-FR; rv:1.7.10) Gecko/20050909 Fedora/1.7.10-1.5.2 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Jeff Garzik CC: Ayaz Abdulla , Netdev Subject: [PATCH 3/2] forcedeth: Compile fix forcedeth 0.44 Content-Type: multipart/mixed; boundary="------------040706090009010407050300" X-archive-position: 3642 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: manfred@colorfullife.com Precedence: bulk X-list: netdev Content-Length: 1179 Lines: 33 This is a multi-part message in MIME format. --------------040706090009010407050300 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Hi, forcedeth-0.44 contains a spurious ; in nv_release_txskb. gcc-4 compiles it with a warning, older compilers might reject it. The attached onliner fixes that, sorry. Signed-off-By: Manfred Spraul --------------040706090009010407050300 Content-Type: text/plain; name="patch-forcedeth-044a-compilefix" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="patch-forcedeth-044a-compilefix" --- 2.6/drivers/net/forcedeth.c 2005-09-18 16:12:10.000000000 +0200 +++ build-2.6/drivers/net/forcedeth.c 2005-09-18 16:14:19.000000000 +0200 @@ -923,7 +923,7 @@ static int nv_init_ring(struct net_devic static void nv_release_txskb(struct net_device *dev, unsigned int skbnr) { struct fe_priv *np = get_nvpriv(dev); - struct sk_buff *skb = np->tx_skbuff[skbnr];; + struct sk_buff *skb = np->tx_skbuff[skbnr]; unsigned int j, entry, fragments; dprintk(KERN_INFO "%s: nv_release_txskb for skbnr %d, skb %p\n", --------------040706090009010407050300-- From manfred@colorfullife.com Sun Sep 18 07:40:01 2005 Received: with ECARTIS (v1.0.0; list netdev); Sun, 18 Sep 2005 07:40:07 -0700 (PDT) Received: from dbl.q-ag.de (dbl.q-ag.de [213.172.117.3]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8IEe0iL016242 for ; Sun, 18 Sep 2005 07:40:01 -0700 Received: from [127.0.0.2] (dbl [127.0.0.1]) by dbl.q-ag.de (8.13.3/8.13.3/Debian-6) with ESMTP id j8IEiBej008632; Sun, 18 Sep 2005 16:44:11 +0200 Message-ID: <432D7B98.8050307@colorfullife.com> Date: Sun, 18 Sep 2005 16:37:12 +0200 From: Manfred Spraul User-Agent: Mozilla/5.0 (X11; U; Linux i686; fr-FR; rv:1.7.10) Gecko/20050909 Fedora/1.7.10-1.5.2 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Netdev CC: Ayaz Abdulla Subject: [PATCH,CFT] forcedeth: Remove superflous rx engine stop/start cycles. Content-Type: multipart/mixed; boundary="------------050404040603000109000702" X-archive-position: 3643 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: manfred@colorfullife.com Precedence: bulk X-list: netdev Content-Length: 3664 Lines: 102 This is a multi-part message in MIME format. --------------050404040603000109000702 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Hi all, Ayaz noticed that forcedeth stops and restarts the rx engine every 3 seconds (link timeout). The attached patch fixes that. It also contains a larger whitespace cleanup: I've replaced a few spaces in the comments with tabs. Please test it. -- Manfred --------------050404040603000109000702 Content-Type: text/plain; name="patch-forcedeth-045-start_stop_rx" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="patch-forcedeth-045-start_stop_rx" --- 2.6/drivers/net/forcedeth.c 2005-09-18 16:21:44.000000000 +0200 +++ build-2.6/drivers/net/forcedeth.c 2005-09-18 16:22:42.000000000 +0200 @@ -80,7 +80,7 @@ * into nv_close, otherwise reenabling for wol can * cause DMA to kfree'd memory. * 0.31: 14 Nov 2004: ethtool support for getting/setting link - * capabilities. + * capabilities. * 0.32: 16 Apr 2005: RX_ERROR4 handling added. * 0.33: 16 May 2005: Support for MCP51 added. * 0.34: 18 Jun 2005: Add DEV_NEED_LINKTIMER to all nForce nics. @@ -89,14 +89,15 @@ * 0.37: 10 Jul 2005: Additional ethtool support, cleanup of pci id list * 0.38: 16 Jul 2005: tx irq rewrite: Use global flags instead of * per-packet flags. - * 0.39: 18 Jul 2005: Add 64bit descriptor support. - * 0.40: 19 Jul 2005: Add support for mac address change. - * 0.41: 30 Jul 2005: Write back original MAC in nv_close instead + * 0.39: 18 Jul 2005: Add 64bit descriptor support. + * 0.40: 19 Jul 2005: Add support for mac address change. + * 0.41: 30 Jul 2005: Write back original MAC in nv_close instead * of nv_remove - * 0.42: 06 Aug 2005: Fix lack of link speed initialization + * 0.42: 06 Aug 2005: Fix lack of link speed initialization * in the second (and later) nv_open call - * 0.43: 10 Aug 2005: Add support for tx checksum. - * 0.44: 20 Aug 2005: Add support for scatter gather and segmentation. + * 0.43: 10 Aug 2005: Add support for tx checksum. + * 0.44: 20 Aug 2005: Add support for scatter gather and segmentation. + * 0.45: 18 Sep 2005: Remove nv_stop/start_rx from every link check * * Known bugs: * We suspect that on some hardware no TX done interrupts are generated. @@ -108,7 +109,7 @@ * DEV_NEED_TIMERIRQ will not harm you on sane hardware, only generating a few * superfluous timer interrupts from the nic. */ -#define FORCEDETH_VERSION "0.44" +#define FORCEDETH_VERSION "0.45" #define DRV_NAME "forcedeth" #include @@ -1613,6 +1614,17 @@ spin_unlock_irq(&np->lock); } +/** + * nv_update_linkspeed: Setup the MAC according to the link partner + * @dev: Network device to be configured + * + * The function queries the PHY and checks if there is a link partner. + * If yes, then it sets up the MAC accordingly. Otherwise, the MAC is + * set to 10 MBit HD. + * + * The function returns 0 if there is no link partner and 1 if there is + * a good link partner. + */ static int nv_update_linkspeed(struct net_device *dev) { struct fe_priv *np = get_nvpriv(dev); @@ -1752,13 +1764,11 @@ static void nv_linkchange(struct net_device *dev) { if (nv_update_linkspeed(dev)) { - if (netif_carrier_ok(dev)) { - nv_stop_rx(dev); - } else { + if (!netif_carrier_ok(dev)) { netif_carrier_on(dev); printk(KERN_INFO "%s: link up.\n", dev->name); + nv_start_rx(dev); } - nv_start_rx(dev); } else { if (netif_carrier_ok(dev)) { netif_carrier_off(dev); --------------050404040603000109000702-- From pavel@ucw.cz Tue Sep 20 13:48:00 2005 Received: with ECARTIS (v1.0.0; list netdev); Tue, 20 Sep 2005 13:48:06 -0700 (PDT) Received: from amd.ucw.cz (gprs189-60.eurotel.cz [160.218.189.60]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8KKlriL013060 for ; Tue, 20 Sep 2005 13:47:57 -0700 Received: by amd.ucw.cz (Postfix, from userid 8) id 19B068B063; Tue, 20 Sep 2005 15:28:11 +0200 (CEST) Date: Tue, 20 Sep 2005 15:28:11 +0200 From: Pavel Machek To: Andrew Morton , Jeff Garzik , Netdev list Subject: [patch] fix suspend/resume on b44 Message-ID: <20050920132811.GA4563@elf.ucw.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-Warning: Reading this can be dangerous to your mental health. User-Agent: Mutt/1.5.9i X-archive-position: 3646 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: pavel@ucw.cz Precedence: bulk X-list: netdev Content-Length: 1136 Lines: 40 Fix suspend/resume on b44 by freeing/reacquiring irq. Otherwise it hangs on resume. Signed-off-by: Pavel Machek --- commit 7bdc8fc378f053bd4eb4210beb1d494485318512 tree 6e5679697b11eb70b73ff5275aafe7c34a90ffef parent 17cd36a6d0fc36b61fa558cade6a98a3e99a6992 author Tue, 20 Sep 2005 15:26:37 +0200 committer Tue, 20 Sep 2005 15:26:37 +0200 drivers/net/b44.c | 5 +++++ 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/drivers/net/b44.c b/drivers/net/b44.c --- a/drivers/net/b44.c +++ b/drivers/net/b44.c @@ -1930,6 +1930,8 @@ static int b44_suspend(struct pci_dev *p b44_free_rings(bp); spin_unlock_irq(&bp->lock); + + free_irq(dev->irq, dev); pci_disable_device(pdev); return 0; } @@ -1946,6 +1948,9 @@ static int b44_resume(struct pci_dev *pd if (!netif_running(dev)) return 0; + if (request_irq(dev->irq, b44_interrupt, SA_SHIRQ, dev->name, dev)) + printk(KERN_ERR PFX "%s: request_irq failed\n", dev->name); + spin_lock_irq(&bp->lock); b44_init_rings(bp); -- if you have sharp zaurus hardware you don't need... you know my address From akpm@osdl.org Tue Sep 20 16:29:15 2005 Received: with ECARTIS (v1.0.0; list netdev); Tue, 20 Sep 2005 16:29:22 -0700 (PDT) Received: from smtp.osdl.org (smtp.osdl.org [65.172.181.4]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8KNTFiL026145 for ; Tue, 20 Sep 2005 16:29:15 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp.osdl.org (8.12.8/8.12.8) with ESMTP id j8KNQTBo021674 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO); Tue, 20 Sep 2005 16:26:30 -0700 Received: from localhost.localdomain (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id j8KNQTto012984; Tue, 20 Sep 2005 16:26:29 -0700 Date: Tue, 20 Sep 2005 16:26:35 -0700 From: Andrew Morton To: Pavel Machek Cc: jgarzik@pobox.com, netdev@oss.sgi.com Subject: Re: [patch] fix suspend/resume on b44 Message-Id: <20050920162635.565e4b46.akpm@osdl.org> In-Reply-To: <20050920132811.GA4563@elf.ucw.cz> References: <20050920132811.GA4563@elf.ucw.cz> X-Mailer: Sylpheed version 1.0.0 (GTK+ 1.2.10; i386-vine-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.117 $ X-Scanned-By: MIMEDefang 2.36 X-archive-position: 3647 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: akpm@osdl.org Precedence: bulk X-list: netdev Content-Length: 957 Lines: 36 Pavel Machek wrote: > > Fix suspend/resume on b44 by freeing/reacquiring irq. Otherwise it > hangs on resume. > > ... > diff --git a/drivers/net/b44.c b/drivers/net/b44.c > --- a/drivers/net/b44.c > +++ b/drivers/net/b44.c > @@ -1930,6 +1930,8 @@ static int b44_suspend(struct pci_dev *p > b44_free_rings(bp); > > spin_unlock_irq(&bp->lock); > + > + free_irq(dev->irq, dev); > pci_disable_device(pdev); > return 0; > } > @@ -1946,6 +1948,9 @@ static int b44_resume(struct pci_dev *pd > if (!netif_running(dev)) > return 0; > > + if (request_irq(dev->irq, b44_interrupt, SA_SHIRQ, dev->name, dev)) > + printk(KERN_ERR PFX "%s: request_irq failed\n", dev->name); > + > spin_lock_irq(&bp->lock); > > b44_init_rings(bp); > Why does it hang on suspend/resume? This came up a while back and iirc we decided that adding free_irq() to every ->suspend() handler in the world was the wrong thing to do. Do I misremember? From pavel@atrey.karlin.mff.cuni.cz Wed Sep 21 03:23:43 2005 Received: with ECARTIS (v1.0.0; list netdev); Wed, 21 Sep 2005 03:23:48 -0700 (PDT) Received: from atrey.karlin.mff.cuni.cz (atrey.karlin.mff.cuni.cz [195.113.31.123]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8LANgiL030978 for ; Wed, 21 Sep 2005 03:23:43 -0700 Received: by atrey.karlin.mff.cuni.cz (Postfix, from userid 512) id 10E6E4B40E4; Wed, 21 Sep 2005 12:20:54 +0200 (CEST) Date: Wed, 21 Sep 2005 12:20:54 +0200 From: Pavel Machek To: Andrew Morton Cc: jgarzik@pobox.com, netdev@oss.sgi.com Subject: Re: [patch] fix suspend/resume on b44 Message-ID: <20050921102054.GE25297@atrey.karlin.mff.cuni.cz> References: <20050920132811.GA4563@elf.ucw.cz> <20050920162635.565e4b46.akpm@osdl.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050920162635.565e4b46.akpm@osdl.org> User-Agent: Mutt/1.5.9i X-archive-position: 3649 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: pavel@suse.cz Precedence: bulk X-list: netdev Content-Length: 1120 Lines: 39 Hi! > > diff --git a/drivers/net/b44.c b/drivers/net/b44.c > > --- a/drivers/net/b44.c > > +++ b/drivers/net/b44.c > > @@ -1930,6 +1930,8 @@ static int b44_suspend(struct pci_dev *p > > b44_free_rings(bp); > > > > spin_unlock_irq(&bp->lock); > > + > > + free_irq(dev->irq, dev); > > pci_disable_device(pdev); > > return 0; > > } > > @@ -1946,6 +1948,9 @@ static int b44_resume(struct pci_dev *pd > > if (!netif_running(dev)) > > return 0; > > > > + if (request_irq(dev->irq, b44_interrupt, SA_SHIRQ, dev->name, dev)) > > + printk(KERN_ERR PFX "%s: request_irq failed\n", dev->name); > > + > > spin_lock_irq(&bp->lock); > > > > b44_init_rings(bp); > > > > Why does it hang on suspend/resume? > > This came up a while back and iirc we decided that adding free_irq() to > every ->suspend() handler in the world was the wrong thing to do. Do I > misremember? No, you remember right, but b44 needed that free_irq/request_irq even because those ACPI changes. I'm not exactly sure why, something went very wrong otherwise. Pavel -- Boycott Kodak -- for their patent abuse against Java. From akpm@osdl.org Wed Sep 21 03:40:17 2005 Received: with ECARTIS (v1.0.0; list netdev); Wed, 21 Sep 2005 03:40:25 -0700 (PDT) Received: from smtp.osdl.org (smtp.osdl.org [65.172.181.4]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8LAeGiL000820 for ; Wed, 21 Sep 2005 03:40:16 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp.osdl.org (8.12.8/8.12.8) with ESMTP id j8LAbUBo016645 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO); Wed, 21 Sep 2005 03:37:31 -0700 Received: from bix (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id j8LAbUlW002164; Wed, 21 Sep 2005 03:37:30 -0700 Date: Wed, 21 Sep 2005 03:36:53 -0700 From: Andrew Morton To: Pavel Machek Cc: jgarzik@pobox.com, netdev@oss.sgi.com Subject: Re: [patch] fix suspend/resume on b44 Message-Id: <20050921033653.05c448df.akpm@osdl.org> In-Reply-To: <20050921102054.GE25297@atrey.karlin.mff.cuni.cz> References: <20050920132811.GA4563@elf.ucw.cz> <20050920162635.565e4b46.akpm@osdl.org> <20050921102054.GE25297@atrey.karlin.mff.cuni.cz> X-Mailer: Sylpheed version 1.0.4 (GTK+ 1.2.10; i386-redhat-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.117 $ X-Scanned-By: MIMEDefang 2.36 X-archive-position: 3650 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: akpm@osdl.org Precedence: bulk X-list: netdev Content-Length: 1354 Lines: 43 Pavel Machek wrote: > > Hi! > > > > diff --git a/drivers/net/b44.c b/drivers/net/b44.c > > > --- a/drivers/net/b44.c > > > +++ b/drivers/net/b44.c > > > @@ -1930,6 +1930,8 @@ static int b44_suspend(struct pci_dev *p > > > b44_free_rings(bp); > > > > > > spin_unlock_irq(&bp->lock); > > > + > > > + free_irq(dev->irq, dev); > > > pci_disable_device(pdev); > > > return 0; > > > } > > > @@ -1946,6 +1948,9 @@ static int b44_resume(struct pci_dev *pd > > > if (!netif_running(dev)) > > > return 0; > > > > > > + if (request_irq(dev->irq, b44_interrupt, SA_SHIRQ, dev->name, dev)) > > > + printk(KERN_ERR PFX "%s: request_irq failed\n", dev->name); > > > + > > > spin_lock_irq(&bp->lock); > > > > > > b44_init_rings(bp); > > > > > > > Why does it hang on suspend/resume? > > > > This came up a while back and iirc we decided that adding free_irq() to > > every ->suspend() handler in the world was the wrong thing to do. Do I > > misremember? > > No, you remember right, but b44 needed that free_irq/request_irq even > because those ACPI changes. I'm not exactly sure why, something went > very wrong otherwise. Well I guess we should work out what went wrong ;) What are the symptoms? Screaming interrupt? Can't immediately see why. Does the screaming interrupt detetor trigger and disable the IRQ Line? From pavel@ucw.cz Wed Sep 21 14:16:33 2005 Received: with ECARTIS (v1.0.0; list netdev); Wed, 21 Sep 2005 14:16:41 -0700 (PDT) Received: from amd.ucw.cz (gprs189-60.eurotel.cz [160.218.189.60]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8LLGOiL011175 for ; Wed, 21 Sep 2005 14:16:29 -0700 Received: by amd.ucw.cz (Postfix, from userid 8) id 4BF448B106; Wed, 21 Sep 2005 23:13:32 +0200 (CEST) Date: Wed, 21 Sep 2005 23:13:32 +0200 From: Pavel Machek To: Andrew Morton Cc: jgarzik@pobox.com, netdev@oss.sgi.com, hmacht@suse.de Subject: Re: [patch] fix suspend/resume on b44 Message-ID: <20050921211332.GA2194@elf.ucw.cz> References: <20050920132811.GA4563@elf.ucw.cz> <20050920162635.565e4b46.akpm@osdl.org> <20050921102054.GE25297@atrey.karlin.mff.cuni.cz> <20050921033653.05c448df.akpm@osdl.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050921033653.05c448df.akpm@osdl.org> X-Warning: Reading this can be dangerous to your mental health. User-Agent: Mutt/1.5.9i X-archive-position: 3651 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: pavel@ucw.cz Precedence: bulk X-list: netdev Content-Length: 1943 Lines: 61 Hi! > > > > diff --git a/drivers/net/b44.c b/drivers/net/b44.c > > > > --- a/drivers/net/b44.c > > > > +++ b/drivers/net/b44.c > > > > @@ -1930,6 +1930,8 @@ static int b44_suspend(struct pci_dev *p > > > > b44_free_rings(bp); > > > > > > > > spin_unlock_irq(&bp->lock); > > > > + > > > > + free_irq(dev->irq, dev); > > > > pci_disable_device(pdev); > > > > return 0; > > > > } > > > > @@ -1946,6 +1948,9 @@ static int b44_resume(struct pci_dev *pd > > > > if (!netif_running(dev)) > > > > return 0; > > > > > > > > + if (request_irq(dev->irq, b44_interrupt, SA_SHIRQ, dev->name, dev)) > > > > + printk(KERN_ERR PFX "%s: request_irq failed\n", dev->name); > > > > + > > > > spin_lock_irq(&bp->lock); > > > > > > > > b44_init_rings(bp); > > > > > > > > > > Why does it hang on suspend/resume? > > > > > > This came up a while back and iirc we decided that adding free_irq() to > > > every ->suspend() handler in the world was the wrong thing to do. Do I > > > misremember? > > > > No, you remember right, but b44 needed that free_irq/request_irq even > > because those ACPI changes. I'm not exactly sure why, something went > > very wrong otherwise. > > Well I guess we should work out what went wrong ;) > > What are the symptoms? Screaming interrupt? Can't immediately see why. > Does the screaming interrupt detetor trigger and disable the IRQ Line? No, it seems like BUG() triggers in b44. https://bugzilla.novell.com/show_bug.cgi?id=116088 is for basically 2.6.13 kernel (but it was in something as old as 2.6.5, too). Setting machine into suspend to disk with loaded module b44. While resuming, kernel oopses and machine freezes after reloading data from swap. Image will be appended. If this can not be fixed in time, we could add this module to UNLOAD_MODULES_BEFORE_SUSPEND in powersave configuratiion. Pavel -- if you have sharp zaurus hardware you don't need... you know my address From akpm@osdl.org Wed Sep 21 14:26:14 2005 Received: with ECARTIS (v1.0.0; list netdev); Wed, 21 Sep 2005 14:26:18 -0700 (PDT) Received: from smtp.osdl.org (smtp.osdl.org [65.172.181.4]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8LLQEiL012658 for ; Wed, 21 Sep 2005 14:26:14 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp.osdl.org (8.12.8/8.12.8) with ESMTP id j8LLNSBo020023 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO); Wed, 21 Sep 2005 14:23:28 -0700 Received: from bix (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with SMTP id j8LLNRra030581; Wed, 21 Sep 2005 14:23:27 -0700 Date: Wed, 21 Sep 2005 14:22:50 -0700 From: Andrew Morton To: Pavel Machek Cc: jgarzik@pobox.com, netdev@oss.sgi.com, hmacht@suse.de Subject: Re: [patch] fix suspend/resume on b44 Message-Id: <20050921142250.30319b23.akpm@osdl.org> In-Reply-To: <20050921211332.GA2194@elf.ucw.cz> References: <20050920132811.GA4563@elf.ucw.cz> <20050920162635.565e4b46.akpm@osdl.org> <20050921102054.GE25297@atrey.karlin.mff.cuni.cz> <20050921033653.05c448df.akpm@osdl.org> <20050921211332.GA2194@elf.ucw.cz> X-Mailer: Sylpheed version 1.0.4 (GTK+ 1.2.10; i386-redhat-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.117 $ X-Scanned-By: MIMEDefang 2.36 X-archive-position: 3652 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: akpm@osdl.org Precedence: bulk X-list: netdev Content-Length: 2383 Lines: 73 Pavel Machek wrote: > > Hi! > > > > > > diff --git a/drivers/net/b44.c b/drivers/net/b44.c > > > > > --- a/drivers/net/b44.c > > > > > +++ b/drivers/net/b44.c > > > > > @@ -1930,6 +1930,8 @@ static int b44_suspend(struct pci_dev *p > > > > > b44_free_rings(bp); > > > > > > > > > > spin_unlock_irq(&bp->lock); > > > > > + > > > > > + free_irq(dev->irq, dev); > > > > > pci_disable_device(pdev); > > > > > return 0; > > > > > } > > > > > @@ -1946,6 +1948,9 @@ static int b44_resume(struct pci_dev *pd > > > > > if (!netif_running(dev)) > > > > > return 0; > > > > > > > > > > + if (request_irq(dev->irq, b44_interrupt, SA_SHIRQ, dev->name, dev)) > > > > > + printk(KERN_ERR PFX "%s: request_irq failed\n", dev->name); > > > > > + > > > > > spin_lock_irq(&bp->lock); > > > > > > > > > > b44_init_rings(bp); > > > > > > > > > > > > > Why does it hang on suspend/resume? > > > > > > > > This came up a while back and iirc we decided that adding free_irq() to > > > > every ->suspend() handler in the world was the wrong thing to do. Do I > > > > misremember? > > > > > > No, you remember right, but b44 needed that free_irq/request_irq even > > > because those ACPI changes. I'm not exactly sure why, something went > > > very wrong otherwise. > > > > Well I guess we should work out what went wrong ;) > > > > What are the symptoms? Screaming interrupt? Can't immediately see why. > > Does the screaming interrupt detetor trigger and disable the IRQ Line? > > No, it seems like BUG() triggers in > b44. https://bugzilla.novell.com/show_bug.cgi?id=116088 is for > basically 2.6.13 kernel (but it was in something as old as 2.6.5, > too). > That's here: static void b44_tx(struct b44 *bp) { u32 cur, cons; cur = br32(bp, B44_DMATX_STAT) & DMATX_STAT_CDMASK; cur /= sizeof(struct dma_desc); /* XXX needs updating when NETIF_F_SG is supported */ for (cons = bp->tx_cons; cons != cur; cons = NEXT_TX(cons)) { struct ring_info *rp = &bp->tx_buffers[cons]; struct sk_buff *skb = rp->skb; if (unlikely(skb == NULL)) BUG(); So I'd assume that the newly-woken driver took an interrupt, decided that a Tx interrupt was pending then went BUG when it discovered that it hadn't sent anything. Would be good to find out the value of istat in b44_interrupt() and poke maintainers, rather than proferring strange workarounds ;) From pp@ee.oulu.fi Sun Sep 25 04:01:24 2005 Received: with ECARTIS (v1.0.0; list netdev); Sun, 25 Sep 2005 04:01:26 -0700 (PDT) Received: from ee.oulu.fi (ee.oulu.fi [130.231.61.23]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8PB1NiL012044 for ; Sun, 25 Sep 2005 04:01:24 -0700 Received: from tk28.oulu.fi (tk28 [130.231.48.68]) by ee.oulu.fi (8.13.3/8.13.3) with ESMTP id j8PAwYc3019496 for ; Sun, 25 Sep 2005 13:58:34 +0300 (EEST) Received: (from pp@localhost) by tk28.oulu.fi (8.13.3/8.13.3/Submit) id j8PAwYwm015311 for netdev@oss.sgi.com; Sun, 25 Sep 2005 13:58:34 +0300 (EEST) Date: Sun, 25 Sep 2005 13:58:34 +0300 From: Pekka Pietikainen To: netdev@oss.sgi.com Subject: rwlock recursion on CPU#0, netfilter related? Message-ID: <20050925105834.GA15243@ee.oulu.fi> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline User-Agent: Mutt/1.4.2i X-archive-position: 3656 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: pp@ee.oulu.fi Precedence: bulk X-list: netdev Content-Length: 641 Lines: 16 Just to get a wider audience, somewhere between 2.6.13-git4 and current (2.6.14-rc2-git4 is the last one I tested, which seems to have some fixes in this are wrt. git3, but problem remains) my x86_64 crashes quite quickly after boot. Using Fedora devel kernels, I can probably whip up a vanilla kernel if the maintainers in this area prefer that. https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=167835 and https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=119228 apart from the crashes I get funny ping times on the kernels that break when they're still up (64 bytes from 10.10.9.1: icmp_seq=0 ttl=255 time=4294971590968 ms) From laforge@gnumonks.org Sun Sep 25 06:46:44 2005 Received: with ECARTIS (v1.0.0; list netdev); Sun, 25 Sep 2005 06:46:54 -0700 (PDT) Received: from ganesha.gnumonks.org (ganesha.gnumonks.org [213.95.27.120]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8PDkhiL002477 for ; Sun, 25 Sep 2005 06:46:44 -0700 Received: from berligate.hmw-consulting.de ([83.236.178.202] helo=sunbeam.hmw-consulting.de) by ganesha.gnumonks.org with esmtpsa (TLS-1.0:RSA_AES_256_CBC_SHA:32) (Exim 4.50) id 1EJWnC-0005Tk-Em; Sun, 25 Sep 2005 15:43:46 +0200 Received: from laforge by sunbeam.hmw-consulting.de with local (Exim 4.52) id 1EJWnA-0008FJ-BY; Sun, 25 Sep 2005 15:43:44 +0200 Date: Sun, 25 Sep 2005 15:43:44 +0200 From: Harald Welte To: Pekka Pietikainen Cc: netdev@oss.sgi.com Subject: Re: rwlock recursion on CPU#0, netfilter related? Message-ID: <20050925134344.GJ731@sunbeam.de.gnumonks.org> References: <20050925105834.GA15243@ee.oulu.fi> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="OOApBOWN8PWHaZPG" Content-Disposition: inline In-Reply-To: <20050925105834.GA15243@ee.oulu.fi> User-Agent: mutt-ng devel-20050619 (Debian) X-archive-position: 3657 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: laforge@gnumonks.org Precedence: bulk X-list: netdev --OOApBOWN8PWHaZPG Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sun, Sep 25, 2005 at 01:58:34PM +0300, Pekka Pietikainen wrote: > Just to get a wider audience, somewhere between 2.6.13-git4 and current= =20 > (2.6.14-rc2-git4 is the last one I tested, which seems to have some > fixes in this are wrt. git3, but problem remains) my x86_64 > crashes quite quickly after boot. Using Fedora devel kernels, I can > probably whip up a vanilla kernel if the maintainers in this area > prefer that. >=20 > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=3D167835 Can you please give some more feedback like=20 1) how does your kernel .config look like? 2) which modules are loaded 3) how does your ruleset look like? 4) most importantly, have you enabled CONFIG_IP_NF_CONNTRACK_EVENTS ? if yes, please disable, it's broken, a fix has been submitted, but I don't know if it has propagated to Linus yet (netdev Message-ID: <20050922143515.GD8917@rama.de.gnumonks.org>) please also try=20 a) only loading iptable_filter (and ip_tables), but no other modules a) only loading ip_conntrack but no other netfilter modules (no nat, no ipt= ables) b) only loading ip_conntrack and iptable_nat (but no rules) this kind of debugging helps to locate where it is. netfilter has grown big ;) Also, I have that Ping time problem on my x86_64 debian unstable (smp). But only in 1 out of ten cases on average (when starting ping, ctrl+c, pin, ctrl+c, ...). I've always assumed it's some 64bit problem in "ping" itself. --=20 - Harald Welte http://gnumonks.org/ =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D "Privacy in residential applications is a desirable marketing option." (ETSI EN 300 175-7 Ch. A6) --OOApBOWN8PWHaZPG Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQFDNqmQXaXGVTD0i/8RAs8gAJ9WFtsxjpA7z1b9H6kDVrFEhMim+gCfbH7s 5lCq4RbuPdJzClT4RWBL3pw= =vbY3 -----END PGP SIGNATURE----- --OOApBOWN8PWHaZPG-- From pp@ee.oulu.fi Sun Sep 25 13:22:38 2005 Received: with ECARTIS (v1.0.0; list netdev); Sun, 25 Sep 2005 13:22:45 -0700 (PDT) Received: from ee.oulu.fi (ee.oulu.fi [130.231.61.23]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8PKMaiL014041 for ; Sun, 25 Sep 2005 13:22:37 -0700 Received: from tk28.oulu.fi (tk28 [130.231.48.68]) by ee.oulu.fi (8.13.3/8.13.3) with ESMTP id j8PKJjE8023136; Sun, 25 Sep 2005 23:19:45 +0300 (EEST) Received: (from pp@localhost) by tk28.oulu.fi (8.13.3/8.13.3/Submit) id j8PKJjaO021300; Sun, 25 Sep 2005 23:19:45 +0300 (EEST) Date: Sun, 25 Sep 2005 23:19:45 +0300 From: Pekka Pietikainen To: Harald Welte Cc: netdev@oss.sgi.com Subject: Re: rwlock recursion on CPU#0, netfilter related? Message-ID: <20050925201945.GA21176@ee.oulu.fi> References: <20050925105834.GA15243@ee.oulu.fi> <20050925134344.GJ731@sunbeam.de.gnumonks.org> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <20050925134344.GJ731@sunbeam.de.gnumonks.org> User-Agent: Mutt/1.4.2i X-archive-position: 3658 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: pp@ee.oulu.fi Precedence: bulk X-list: netdev On Sun, Sep 25, 2005 at 03:43:44PM +0200, Harald Welte wrote: > 1) how does your kernel .config look like? http://cvs.fedora.redhat.com/viewcvs/devel/kernel/configs/config-generic?rev=1.60&view=auto http://cvs.fedora.redhat.com/viewcvs/devel/kernel/configs/config-x86_64-generic?rev=1.16&view=auto > 2) which modules are loaded Module Size Used by w83627hf 46569 0 eeprom 17617 0 i2c_sensor 12225 2 w83627hf,eeprom i2c_isa 11329 0 rfcomm 61033 0 l2cap 46145 5 rfcomm bluetooth 73317 4 rfcomm,l2cap ipv6 325889 16 ppp_synctty 21057 0 ppp_async 22465 1 crc_ccitt 10817 1 ppp_async ppp_generic 41953 6 ppp_synctty,ppp_async slhc 16193 1 ppp_generic ip_conntrack_ftp 82177 0 ipt_ULOG 18913 1 ipt_state 10689 18 ip_conntrack 60053 2 ip_conntrack_ftp,ipt_state iptable_filter 11969 1 ip_tables 32193 3 ipt_ULOG,ipt_state,iptable_filter loop 26449 0 video 27977 0 button 16481 0 battery 19657 0 ac 14409 0 ohci1394 46753 0 ieee1394 381273 1 ohci1394 ohci_hcd 33249 0 ehci_hcd 46157 0 parport_pc 40621 0 parport 52557 1 parport_pc i2c_nforce2 16833 0 i2c_core 34241 5 w83627hf,eeprom,i2c_sensor,i2c_isa,i2c_nforce2 shpchp 108009 0 emu10k1_gp 12865 0 gameport 27089 2 emu10k1_gp snd_emu10k1 138629 0 snd_rawmidi 39521 1 snd_emu10k1 snd_util_mem 14401 1 snd_emu10k1 snd_hwdep 20321 1 snd_emu10k1 snd_intel8x0 46273 0 snd_ac97_codec 106757 2 snd_emu10k1,snd_intel8x0 snd_seq_dummy 12869 0 snd_seq_oss 47012 0 snd_seq_midi_event 17473 1 snd_seq_oss snd_seq 74265 5 snd_seq_dummy,snd_seq_oss,snd_seq_midi_event snd_seq_device 19281 5 snd_emu10k1,snd_rawmidi,snd_seq_dummy,snd_seq_oss,snd_seq snd_pcm_oss 68465 0 snd_mixer_oss 28225 1 snd_pcm_oss snd_pcm 115401 4 snd_emu10k1,snd_intel8x0,snd_ac97_codec,snd_pcm_oss snd_timer 37577 3 snd_emu10k1,snd_seq,snd_pcm snd 75681 12 snd_emu10k1,snd_rawmidi,snd_hwdep,snd_intel8x0,snd_ac97_codec,snd_seq_oss,snd_seq,snd_seq_device,snd_pcm_oss,snd_mixer_oss,snd_pcm,snd_timer soundcore 19809 1 snd snd_page_alloc 21713 3 snd_emu10k1,snd_intel8x0,snd_pcm r8169 43209 0 forcedeth 30657 0 floppy 77865 0 dm_snapshot 26369 0 dm_zero 10817 0 dm_mirror 32433 0 ext3 154577 3 jbd 76145 1 ext3 dm_mod 73873 7 dm_snapshot,dm_zero,dm_mirror sata_nv 19141 3 libata 61649 1 sata_nv sd_mod 29121 4 scsi_mod 167801 2 libata,sd_mod > 3) how does your ruleset look like? *filter :INPUT ACCEPT [0:0] :FORWARD ACCEPT [0:0] :OUTPUT ACCEPT [0:0] :RH-Firewall-1-INPUT - [0:0] -A INPUT -j RH-Firewall-1-INPUT -A FORWARD -j RH-Firewall-1-INPUT -A RH-Firewall-1-INPUT -i lo -j ACCEPT -A RH-Firewall-1-INPUT -i eth1 -j ACCEPT -A RH-Firewall-1-INPUT -p icmp --icmp-type echo-request -j ACCEPT -A RH-Firewall-1-INPUT -p esp -j ACCEPT -A RH-Firewall-1-INPUT -p ah -j ACCEPT -A RH-Firewall-1-INPUT -p ipv6 -j ACCEPT -A RH-Firewall-1-INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT -A RH-Firewall-1-INPUT -j ULOG -A RH-Firewall-1-INPUT -p tcp -m state --state NEW -m tcp --dport x -j ACCEPT -A RH-Firewall-1-INPUT -p udp -m state --state NEW -m udp --dport y -j ACCEPT (for a bunch of ports, some with -s sourcenet/24 etc.) -A RH-Firewall-1-INPUT -j DROP COMMIT # Completed on Sun Sep 28 10:37:44 2003 So basically a single-host firewall with ULOG and ftp conntracking being the only fancy things. > 4) most importantly, have you enabled CONFIG_IP_NF_CONNTRACK_EVENTS ? > if yes, please disable, it's broken, a fix has been submitted, but I > don't know if it has propagated to Linus yet (netdev Message-ID: > <20050922143515.GD8917@rama.de.gnumonks.org>) Enabled, so this could be it. But 2.6.14-rc2-git4 did crash too (although it did take a bit longer for that to happen), and the changelog does state: commit 1dfbab59498d6f227c91988bab6c71af049a5333 tree 6b20409a232ebe8c37f16d06b3fbcde6bec8f328 parent a82b748930fce0dab22c64075c38c830ae116904 author Harald Welte Thu, 22 Sep 2005 23:46:57 -0700 committer David S. Miller Thu, 22 Sep 2005 23:46:57 -0700 [NETFILTER] Fix conntrack event cache deadlock/oops Which is this patch, right? Will verify whether disabling the option makes any difference tomorrow, as well as your other recommendations. > Also, I have that Ping time problem on my x86_64 debian unstable (smp). > But only in 1 out of ten cases on average (when starting ping, ctrl+c, > pin, ctrl+c, ...). I've always assumed it's some 64bit problem in > "ping" itself. Happens for all packets on the "broken" kernels, and works a-ok (few ms latencies to the same box) on the 2.6.13-era ones that don't crash. Could be a different bug, sure. From mbellion@hipac.org Sun Sep 25 19:54:33 2005 Received: with ECARTIS (v1.0.0; list netdev); Sun, 25 Sep 2005 19:54:43 -0700 (PDT) Received: from smtprelay01.ispgateway.de (smtprelay01.ispgateway.de [80.67.18.13]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8Q2sViL017814 for ; Sun, 25 Sep 2005 19:54:32 -0700 Received: (qmail 11666 invoked from network); 26 Sep 2005 02:51:42 -0000 Received: from unknown (HELO hpnotebook) (281619@[194.231.230.43]) (envelope-sender ) by smtprelay01.ispgateway.de (qmail-ldap-1.03) with RC4-MD5 encrypted SMTP for ; 26 Sep 2005 02:51:42 -0000 From: Michael Bellion To: linux-kernel@vger.kernel.org, linux-net@vger.kernel.org, netdev@oss.sgi.com Subject: [ANNOUNCE] Release of nf-HiPAC 0.9.0 Date: Mon, 26 Sep 2005 04:45:46 +0200 User-Agent: KMail/1.8.1 MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200509260445.46740.mbellion@hipac.org> X-archive-position: 3659 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: mbellion@hipac.org Precedence: bulk X-list: netdev Hi I am happy to announce the release of nf-HiPAC version 0.9.0 During the development of version 0.9.0 everything was ported to Linux kernel 2.6 and large parts of the kernel code have been rewritten. The kernel patch is now fairly non-intrusive: it only adds one simple function to ip_tables.c. The rest of the patch introduces new files to the kernel. The new release fixes all known bugs and also introduces some new features. Since the last release I have become part of MARA Systems AB ( http://www.marasystems.com ). MARA Systems AB is now the commercial backer of the HiPAC Project and finances it completely. Together MARA Systems and I will make sure that HiPAC is actively maintained and further developed under the GNU GPL. For all of you who don't know nf-HiPAC yet, here is a short overview: nf-HiPAC is a full featured packet filter for Linux which demonstrates the power and flexibility of HiPAC. HiPAC is a novel framework for packet classification which uses an advanced algorithm to reduce the number of memory lookups per packet. It is ideal for environments involving large rule sets and/or high bandwidth networks. nf-HiPAC provides the same rich feature set as iptables, the popular Linux packet filter. The complexity of the sophisticated HiPAC packet classification algorithm is hidden behind an iptables compatible user interface which renders nf-HiPAC a drop-in replacement for iptables. Thereby, the iptables' semantics of the rules is preserved, i.e. you can construct your rules like you are used to. From a user's point of view there is no need to understand anything about the HiPAC algorithm. The nf-hipac user space tool is designed to be as compatible as possible to 'iptables -t filter'. It even supports the full power of iptables targets, matches and stateful packet filtering (connection tracking) besides the native nf-HiPAC matches. This makes a switch from iptables to nf-HiPAC very easy. Usually it is sufficient to replace the calls to iptables with calls to nf-hipac for your filter rules. Why another packet filter? Performance: iptables, like most packet filters, uses a simple packet classification algorithm which traverses the rules in a chain linearly per packet until a matching rule is found (or not). Clearly, this approach lacks efficiency. As networks grow more and more complex and offer a wider bandwidth linear packet filtering is no longer an option if many rules have to be matched per packet. Higher bandwidth means more packets per second which leads to shorter process times per packet. nf-HiPAC outperforms iptables regardless of the number of rules, i.e. the HiPAC classification engine does not impose any overhead even for very small rule sets. Scalability to large rule sets: The performance of nf-HiPAC is nearly independent of the number of rules. nf-HiPAC with thousands of rules still outperforms iptables with 20 rules. Dynamic rule sets: nf-HiPAC offers fast dynamic rules et updates without stalling packet classification in contrast to iptables which yields bad update performance along with stalled packet processing during updates. More information about the project can be found at: http://www.hipac.org The releases are published on: http://sourceforge.net/projects/nf-hipac/ Enjoy, +---------------------------+ | Michael Bellion | | | +---------------------------+ From horms@koto.vergenet.net Sun Sep 25 21:27:19 2005 Received: with ECARTIS (v1.0.0; list netdev); Sun, 25 Sep 2005 21:27:32 -0700 (PDT) Received: from koto.vergenet.net ([210.128.90.7]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8Q4RIiL030614 for ; Sun, 25 Sep 2005 21:27:18 -0700 Received: by koto.vergenet.net (Postfix, from userid 7100) id DA6FD34036; Mon, 26 Sep 2005 13:24:18 +0900 (JST) Date: Mon, 26 Sep 2005 12:28:08 +0900 From: Horms To: Roger Tsang , Luca Maranzano , "LinuxVirtualServer.org users mailing list." Cc: Nishanth Aravamudan , Dave Miller , Wensong Zhang , Julian Anastasov , netdev@oss.sgi.com Subject: Re: ipvs_syncmaster brings cpu to 100% Message-ID: <20050926032807.GI18357@verge.net.au> Mail-Followup-To: Roger Tsang , Luca Maranzano , "LinuxVirtualServer.org users mailing list." , Nishanth Aravamudan , Dave Miller , Wensong Zhang , Julian Anastasov , netdev@oss.sgi.com References: <68559cef050908090657fc2599@mail.gmail.com> <498263350509081605956a771@mail.gmail.com> <68559cef05092207022f1f0df4@mail.gmail.com> <498263350509230815eb08a73@mail.gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <498263350509230815eb08a73@mail.gmail.com> X-Cluestick: seven User-Agent: Mutt/1.5.10i X-archive-position: 3660 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: horms@verge.net.au Precedence: bulk X-list: netdev On Fri, Sep 23, 2005 at 11:15:31AM -0400, Roger Tsang wrote: > As I've said before in this thread, you might want to try changing all the > ssleep() calls to schedule_timeout(). > > Roger > > > On 9/22/05, Luca Maranzano wrote: > > > > Hello all, > > > > here again trying to discover the reason ot the CPU hog for > > ipvs_sync{master,backup}. > > > > I've digged in the sources for ip_vs_sync.c and the main differences > > between kernel 2.6.8 and 2.6.12 is the use of ssleep() instead of > > schedule_timeout(). > > > > The oddity I've seen is that in the header of both files, the version > > is always like this: > > > > * Version: $Id: ip_vs_sync.c,v 1.13 2003/06/08 09:31:19 wensong Exp $ > > * > > * Authors: Wensong Zhang > > > > Is Wensong still the maintainer for this code? Yes, although he is kind of quiet. > > Furthermore, if I make an "rgrep" in the source tree of kernel 2.6.12 > > the function schedule_timeout() is more used than the ssleep() (517 > > occurrencies vs. 43), so why in ip_vs_sync.c there was this change? > > > > The other oddity is that Horms reported on this list that on non Xeon > > CPU the same version of kernel of mine does not present the problem. > > > > I'm getting crazy :-) I've prepared a patch, which reverts the change which was introduced by Nishanth Aravamudan in February. I have CCed him, Dave Miller, Wensong Zhang, Julian Anastasov, and the netdev list for comment. Could intererested parties please test the patch. Thanks -- Horms Use schedule_timeout() instead of ssleep() in ip_vs_sync daemon, as the latter seems to cause 100% CPU utilistaion on HT Xeons. Discussion: http://archive.linuxvirtualserver.org/html/lvs-users/2005-09/msg00031.html Reverts: http://www.kernel.org/git/?p=linux/kernel/git/tglx/history.git;a=commit;h=f8afb60c7537130448cc479d6d8dc9bf4ee06027 Signed-off-by: Horms diff --git a/net/ipv4/ipvs/ip_vs_sync.c b/net/ipv4/ipvs/ip_vs_sync.c --- a/net/ipv4/ipvs/ip_vs_sync.c +++ b/net/ipv4/ipvs/ip_vs_sync.c @@ -655,7 +655,9 @@ static void sync_master_loop(void) if (stop_master_sync) break; - ssleep(1); + __set_current_state(TASK_INTERRUPTIBLE); + schedule_timeout(HZ); + __set_current_state(TASK_RUNNING); } /* clean up the sync_buff queue */ @@ -712,7 +714,9 @@ static void sync_backup_loop(void) if (stop_backup_sync) break; - ssleep(1); + __set_current_state(TASK_INTERRUPTIBLE); + schedule_timeout(HZ); + __set_current_state(TASK_RUNNING); } /* release the sending multicast socket */ @@ -824,7 +828,9 @@ static int fork_sync_thread(void *startu if ((pid = kernel_thread(sync_thread, startup, 0)) < 0) { IP_VS_ERR("could not create sync_thread due to %d... " "retrying.\n", pid); - ssleep(1); + __set_current_state(TASK_INTERRUPTIBLE); + schedule_timeout(HZ); + __set_current_state(TASK_RUNNING); goto repeat; } @@ -858,7 +864,9 @@ int start_sync_thread(int state, char *m if ((pid = kernel_thread(fork_sync_thread, &startup, 0)) < 0) { IP_VS_ERR("could not create fork_sync_thread due to %d... " "retrying.\n", pid); - ssleep(1); + __set_current_state(TASK_INTERRUPTIBLE); + schedule_timeout(HZ); + __set_current_state(TASK_RUNNING); goto repeat; } From nacc@us.ibm.com Sun Sep 25 21:36:38 2005 Received: with ECARTIS (v1.0.0; list netdev); Sun, 25 Sep 2005 21:36:42 -0700 (PDT) Received: from e36.co.us.ibm.com (e36.co.us.ibm.com [32.97.110.154]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8Q4abiL031632 for ; Sun, 25 Sep 2005 21:36:38 -0700 Received: from westrelay02.boulder.ibm.com (westrelay02.boulder.ibm.com [9.17.195.11]) by e36.co.us.ibm.com (8.12.11/8.12.11) with ESMTP id j8Q4Wfne007169 for ; Mon, 26 Sep 2005 00:32:41 -0400 Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by westrelay02.boulder.ibm.com (8.12.10/NCO/VERS6.7) with ESMTP id j8Q4XnsH533128 for ; Sun, 25 Sep 2005 22:33:49 -0600 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.12.11/8.13.3) with ESMTP id j8Q4XmeL023990 for ; Sun, 25 Sep 2005 22:33:48 -0600 Received: from arkanoid (sig-9-65-2-111.mts.ibm.com [9.65.2.111]) by d03av01.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id j8Q4XmOx023982; Sun, 25 Sep 2005 22:33:48 -0600 Received: by arkanoid (Postfix, from userid 1000) id 2AA20EACF4; Sun, 25 Sep 2005 21:34:00 -0700 (PDT) Date: Sun, 25 Sep 2005 21:34:00 -0700 From: Nishanth Aravamudan To: Roger Tsang , Luca Maranzano , "LinuxVirtualServer.org users mailing list." , Dave Miller , Wensong Zhang , Julian Anastasov , netdev@oss.sgi.com Subject: Re: ipvs_syncmaster brings cpu to 100% Message-ID: <20050926043400.GD5079@us.ibm.com> References: <68559cef050908090657fc2599@mail.gmail.com> <498263350509081605956a771@mail.gmail.com> <68559cef05092207022f1f0df4@mail.gmail.com> <498263350509230815eb08a73@mail.gmail.com> <20050926032807.GI18357@verge.net.au> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050926032807.GI18357@verge.net.au> X-Operating-System: Linux 2.6.14-rc2 (x86_64) User-Agent: Mutt/1.5.9i X-archive-position: 3661 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: nacc@us.ibm.com Precedence: bulk X-list: netdev On 26.09.2005 [12:28:08 +0900], Horms wrote: > On Fri, Sep 23, 2005 at 11:15:31AM -0400, Roger Tsang wrote: > > As I've said before in this thread, you might want to try changing all the > > ssleep() calls to schedule_timeout(). > > > > Roger > > > > > > On 9/22/05, Luca Maranzano wrote: > > > > > > Hello all, > > > > > > here again trying to discover the reason ot the CPU hog for > > > ipvs_sync{master,backup}. > > > > > > I've digged in the sources for ip_vs_sync.c and the main differences > > > between kernel 2.6.8 and 2.6.12 is the use of ssleep() instead of > > > schedule_timeout(). > > > > > > The oddity I've seen is that in the header of both files, the version > > > is always like this: > > > > > > * Version: $Id: ip_vs_sync.c,v 1.13 2003/06/08 09:31:19 wensong Exp $ > > > * > > > * Authors: Wensong Zhang > > > > > > Is Wensong still the maintainer for this code? > > Yes, although he is kind of quiet. > > > > Furthermore, if I make an "rgrep" in the source tree of kernel 2.6.12 > > > the function schedule_timeout() is more used than the ssleep() (517 > > > occurrencies vs. 43), so why in ip_vs_sync.c there was this change? > > > > > > The other oddity is that Horms reported on this list that on non Xeon > > > CPU the same version of kernel of mine does not present the problem. > > > > > > I'm getting crazy :-) > > I've prepared a patch, which reverts the change which was introduced > by Nishanth Aravamudan in February. Was the 100% cpu utilization only occurring on Xeon processors? Care to try to use msleep_interruptible() instead of ssleep(), as opposed to schedule_timeout()? In your patch, you do not need to set the state back to TASK_RUNNING, btw. Thanks, Nish From horms@koto.vergenet.net Mon Sep 26 01:15:42 2005 Received: with ECARTIS (v1.0.0; list netdev); Mon, 26 Sep 2005 01:15:55 -0700 (PDT) Received: from koto.vergenet.net (koto.vergenet.net [210.128.90.7]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8Q8FfiL024238 for ; Mon, 26 Sep 2005 01:15:41 -0700 Received: by koto.vergenet.net (Postfix, from userid 7100) id 8BF4C3402C; Mon, 26 Sep 2005 17:12:51 +0900 (JST) Date: Mon, 26 Sep 2005 17:05:10 +0900 From: Horms To: Nishanth Aravamudan Cc: Roger Tsang , Luca Maranzano , "LinuxVirtualServer.org users mailing list." , Dave Miller , Wensong Zhang , Julian Anastasov , netdev@oss.sgi.com Subject: Re: ipvs_syncmaster brings cpu to 100% Message-ID: <20050926080508.GF11027@verge.net.au> Mail-Followup-To: Nishanth Aravamudan , Roger Tsang , Luca Maranzano , "LinuxVirtualServer.org users mailing list." , Dave Miller , Wensong Zhang , Julian Anastasov , netdev@oss.sgi.com References: <68559cef050908090657fc2599@mail.gmail.com> <498263350509081605956a771@mail.gmail.com> <68559cef05092207022f1f0df4@mail.gmail.com> <498263350509230815eb08a73@mail.gmail.com> <20050926032807.GI18357@verge.net.au> <20050926043400.GD5079@us.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050926043400.GD5079@us.ibm.com> X-Cluestick: seven User-Agent: Mutt/1.5.10i X-archive-position: 3662 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: horms@verge.net.au Precedence: bulk X-list: netdev On Sun, Sep 25, 2005 at 09:34:00PM -0700, Nishanth Aravamudan wrote: > On 26.09.2005 [12:28:08 +0900], Horms wrote: > > On Fri, Sep 23, 2005 at 11:15:31AM -0400, Roger Tsang wrote: > > > As I've said before in this thread, you might want to try changing all the > > > ssleep() calls to schedule_timeout(). > > > > > > Roger > > > > > > > > > On 9/22/05, Luca Maranzano wrote: > > > > > > > > Hello all, > > > > > > > > here again trying to discover the reason ot the CPU hog for > > > > ipvs_sync{master,backup}. > > > > > > > > I've digged in the sources for ip_vs_sync.c and the main differences > > > > between kernel 2.6.8 and 2.6.12 is the use of ssleep() instead of > > > > schedule_timeout(). > > > > > > > > The oddity I've seen is that in the header of both files, the version > > > > is always like this: > > > > > > > > * Version: $Id: ip_vs_sync.c,v 1.13 2003/06/08 09:31:19 wensong Exp $ > > > > * > > > > * Authors: Wensong Zhang > > > > > > > > Is Wensong still the maintainer for this code? > > > > Yes, although he is kind of quiet. > > > > > > Furthermore, if I make an "rgrep" in the source tree of kernel 2.6.12 > > > > the function schedule_timeout() is more used than the ssleep() (517 > > > > occurrencies vs. 43), so why in ip_vs_sync.c there was this change? > > > > > > > > The other oddity is that Horms reported on this list that on non Xeon > > > > CPU the same version of kernel of mine does not present the problem. > > > > > > > > I'm getting crazy :-) > > > > I've prepared a patch, which reverts the change which was introduced > > by Nishanth Aravamudan in February. > > Was the 100% cpu utilization only occurring on Xeon processors? That seems to be the only case where were this problem has been observed. I don't have such a processor myself, so I haven't actually been able to produce the problem locally. One reason I posted this issue to netdev was to get some more eyes on the problem as it is puzzling to say the least. > Care to try to use msleep_interruptible() instead of ssleep(), as > opposed to schedule_timeout()? I will send a version that does that shortly, Luca, can you plase check that too? > In your patch, you do not need to set the state back to TASK_RUNNING, > btw. Thanks, updated patch below. -- Horms Use schedule_timeout() instead of ssleep() in ip_vs_sync daemon, as the latter seems to cause 100% CPU utilistaion on HT Xeons. Discussion: http://archive.linuxvirtualserver.org/html/lvs-users/2005-09/msg00031.html Reverts: http://www.kernel.org/git/?p=linux/kernel/git/tglx/history.git;a=commit;h=f8afb60c7537130448cc479d6d8dc9bf4ee06027 Signed-off-by: Horms diff --git a/net/ipv4/ipvs/ip_vs_sync.c b/net/ipv4/ipvs/ip_vs_sync.c --- a/net/ipv4/ipvs/ip_vs_sync.c +++ b/net/ipv4/ipvs/ip_vs_sync.c @@ -655,7 +655,8 @@ static void sync_master_loop(void) if (stop_master_sync) break; - ssleep(1); + __set_current_state(TASK_INTERRUPTIBLE); + schedule_timeout(HZ); } /* clean up the sync_buff queue */ @@ -712,7 +713,8 @@ static void sync_backup_loop(void) if (stop_backup_sync) break; - ssleep(1); + __set_current_state(TASK_INTERRUPTIBLE); + schedule_timeout(HZ); } /* release the sending multicast socket */ @@ -824,7 +826,8 @@ static int fork_sync_thread(void *startu if ((pid = kernel_thread(sync_thread, startup, 0)) < 0) { IP_VS_ERR("could not create sync_thread due to %d... " "retrying.\n", pid); - ssleep(1); + __set_current_state(TASK_INTERRUPTIBLE); + schedule_timeout(HZ); goto repeat; } @@ -858,7 +861,8 @@ int start_sync_thread(int state, char *m if ((pid = kernel_thread(fork_sync_thread, &startup, 0)) < 0) { IP_VS_ERR("could not create fork_sync_thread due to %d... " "retrying.\n", pid); - ssleep(1); + __set_current_state(TASK_INTERRUPTIBLE); + schedule_timeout(HZ); goto repeat; } From horms@koto.vergenet.net Mon Sep 26 01:15:41 2005 Received: with ECARTIS (v1.0.0; list netdev); Mon, 26 Sep 2005 01:15:56 -0700 (PDT) Received: from koto.vergenet.net (koto.vergenet.net [210.128.90.7]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8Q8FfiL024239 for ; Mon, 26 Sep 2005 01:15:41 -0700 Received: by koto.vergenet.net (Postfix, from userid 7100) id A752434031; Mon, 26 Sep 2005 17:12:51 +0900 (JST) Date: Mon, 26 Sep 2005 17:12:32 +0900 From: Horms To: Nishanth Aravamudan , Roger Tsang , Luca Maranzano , "LinuxVirtualServer.org users mailing list." , Dave Miller , Wensong Zhang , Julian Anastasov , netdev@oss.sgi.com Subject: Re: ipvs_syncmaster brings cpu to 100% Message-ID: <20050926081229.GA23755@verge.net.au> Mail-Followup-To: Nishanth Aravamudan , Roger Tsang , Luca Maranzano , "LinuxVirtualServer.org users mailing list." , Dave Miller , Wensong Zhang , Julian Anastasov , netdev@oss.sgi.com References: <68559cef050908090657fc2599@mail.gmail.com> <498263350509081605956a771@mail.gmail.com> <68559cef05092207022f1f0df4@mail.gmail.com> <498263350509230815eb08a73@mail.gmail.com> <20050926032807.GI18357@verge.net.au> <20050926043400.GD5079@us.ibm.com> <20050926080508.GF11027@verge.net.au> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050926080508.GF11027@verge.net.au> X-Cluestick: seven User-Agent: Mutt/1.5.10i X-archive-position: 3663 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: horms@verge.net.au Precedence: bulk X-list: netdev On Mon, Sep 26, 2005 at 05:05:10PM +0900, Horms wrote: [snip] > > > > > Furthermore, if I make an "rgrep" in the source tree of kernel 2.6.12 > > > > > the function schedule_timeout() is more used than the ssleep() (517 > > > > > occurrencies vs. 43), so why in ip_vs_sync.c there was this change? > > > > > > > > > > The other oddity is that Horms reported on this list that on non Xeon > > > > > CPU the same version of kernel of mine does not present the problem. > > > > > > > > > > I'm getting crazy :-) > > > > > > I've prepared a patch, which reverts the change which was introduced > > > by Nishanth Aravamudan in February. > > > > Was the 100% cpu utilization only occurring on Xeon processors? > > That seems to be the only case where were this problem has been > observed. I don't have such a processor myself, so I haven't actually > been able to produce the problem locally. > > One reason I posted this issue to netdev was to get some more > eyes on the problem as it is puzzling to say the least. > > > Care to try to use msleep_interruptible() instead of ssleep(), as > > opposed to schedule_timeout()? > > I will send a version that does that shortly, Luca, can > you plase check that too? Here is that version of the patch. Nishanth, I take it that I do not need to set TASK_INTERRUPTABLE before calling msleep_interruptible(), please let me know if I am wrong. Luca, please test. -- Horms *UNTESTED* Use msleep_interruptible() instead of ssleep() in ip_vs_sync daemon, as the latter seems to cause 100% CPU utilistaion on HT Xeons. Discussion: http://archive.linuxvirtualserver.org/html/lvs-users/2005-09/msg00031.html Reverts: http://www.kernel.org/git/?p=linux/kernel/git/tglx/history.git;a=commit;h=f8afb60c7537130448cc479d6d8dc9bf4ee06027 Signed-off-by: Horms diff --git a/net/ipv4/ipvs/ip_vs_sync.c b/net/ipv4/ipvs/ip_vs_sync.c --- a/net/ipv4/ipvs/ip_vs_sync.c +++ b/net/ipv4/ipvs/ip_vs_sync.c @@ -655,7 +655,7 @@ static void sync_master_loop(void) if (stop_master_sync) break; - ssleep(1); + msleep_interruptible(1000); } /* clean up the sync_buff queue */ @@ -712,7 +712,7 @@ static void sync_backup_loop(void) if (stop_backup_sync) break; - ssleep(1); + msleep_interruptible(1000); } /* release the sending multicast socket */ @@ -824,7 +824,7 @@ static int fork_sync_thread(void *startu if ((pid = kernel_thread(sync_thread, startup, 0)) < 0) { IP_VS_ERR("could not create sync_thread due to %d... " "retrying.\n", pid); - ssleep(1); + msleep_interruptible(1000); goto repeat; } @@ -858,7 +858,7 @@ int start_sync_thread(int state, char *m if ((pid = kernel_thread(fork_sync_thread, &startup, 0)) < 0) { IP_VS_ERR("could not create fork_sync_thread due to %d... " "retrying.\n", pid); - ssleep(1); + msleep_interruptible(1000); goto repeat; } From fleury@cs.aau.dk Mon Sep 26 04:29:28 2005 Received: with ECARTIS (v1.0.0; list netdev); Mon, 26 Sep 2005 04:29:32 -0700 (PDT) Received: from smtp.cs.aau.dk (smtp.cs.aau.dk [130.225.194.6]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8QBTQiL016116 for ; Mon, 26 Sep 2005 04:29:27 -0700 Received: from [130.225.195.123] (fleury@rade7.e.cs.auc.dk [130.225.195.123]) by smtp.cs.aau.dk (8.13.4/8.13.4) with ESMTP id j8QBQX32013814; Mon, 26 Sep 2005 13:26:33 +0200 Message-ID: <4337DA7C.2000804@cs.aau.dk> Date: Mon, 26 Sep 2005 13:24:44 +0200 From: Emmanuel Fleury User-Agent: Debian Thunderbird 1.0.6 (X11/20050802) X-Accept-Language: en-us, en MIME-Version: 1.0 To: Michael Bellion CC: linux-kernel@vger.kernel.org, linux-net@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [ANNOUNCE] Release of nf-HiPAC 0.9.0 References: <200509260445.46740.mbellion@hipac.org> In-Reply-To: <200509260445.46740.mbellion@hipac.org> X-Enigmail-Version: 0.92.0.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.52 on 130.225.194.6 X-archive-position: 3665 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: fleury@cs.aau.dk Precedence: bulk X-list: netdev Hi, Did you solved your "size" issues when entering long list of rules ??? I'm still not convinced by your approach. :-/ These experiments have to be updated but can you comment on this: http://www.cs.aau.dk/~mixxel/cf/experiments.html Regards -- Emmanuel Fleury Houston, we've had a problem here. -- Jack Swigert (Appolo XIII, April 13, 1970) From hadi@cyberus.ca Mon Sep 26 04:21:06 2005 Received: with ECARTIS (v1.0.0; list netdev); Mon, 26 Sep 2005 04:21:28 -0700 (PDT) Received: from mx04.cybersurf.com (mx04.cybersurf.com [209.197.145.108]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8QBL5iL014804 for ; Mon, 26 Sep 2005 04:21:05 -0700 Received: from mail.cyberus.ca ([209.197.145.21]) by mx04.cybersurf.com with esmtp (Exim 4.30) id 1EJqzx-0005Ke-Mu for netdev@oss.sgi.com; Mon, 26 Sep 2005 07:18:17 -0400 Received: from cpe0030ab124d2f-cm014500000962.cpe.net.cable.rogers.com ([24.103.96.183] helo=[10.0.0.229]) by mail.cyberus.ca with esmtp (Exim 4.20) id 1EJqzw-0003Hu-8q; Mon, 26 Sep 2005 07:18:16 -0400 Subject: Re: [ANNOUNCE] Release of nf-HiPAC 0.9.0 From: jamal Reply-To: hadi@cyberus.ca To: Michael Bellion Cc: linux-kernel@vger.kernel.org, linux-net@vger.kernel.org, netdev@oss.sgi.com In-Reply-To: <200509260445.46740.mbellion@hipac.org> References: <200509260445.46740.mbellion@hipac.org> Content-Type: text/plain Organization: unknown Date: Mon, 26 Sep 2005 07:18:12 -0400 Message-Id: <1127733492.6215.274.camel@localhost.localdomain> Mime-Version: 1.0 X-Mailer: Evolution 2.2.1.1 Content-Transfer-Encoding: 7bit X-archive-position: 3664 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev On Mon, 2005-26-09 at 04:45 +0200, Michael Bellion wrote: > Hi > > I am happy to announce the release of nf-HiPAC version 0.9.0 > > During the development of version 0.9.0 everything was ported to Linux kernel > 2.6 and large parts of the kernel code have been rewritten. > The kernel patch is now fairly non-intrusive: it only adds one simple function > to ip_tables.c. The rest of the patch introduces new files to the kernel. > The new release fixes all known bugs and also introduces some new features. > > Since the last release I have become part of MARA Systems AB > ( http://www.marasystems.com ). MARA Systems AB is now the commercial backer > of the HiPAC Project and finances it completely. Together MARA Systems and I > will make sure that HiPAC is actively maintained and further developed under > the GNU GPL. > > Congratulations to yourself as well as your sponsor. I think this is useful. The iptables wrapper is certainly valuable. Can you post some numbers relative to iptables? Some tests with the following parameters would be helpful: - Variable incoming packet rate (in packets per second) - Variable packet sizes - Variable number of users/filters - Effect of adding/removing/modifying policies while under different incoming traffic rates. Just even simple non-stateful comparisons like i did with tc over here: http://www.suug.ch/sucon/04/slides/pkt_cls.pdf Or even better when you do these tests also try out with tc filter. cheers, jamal From nacc@us.ibm.com Mon Sep 26 06:14:06 2005 Received: with ECARTIS (v1.0.0; list netdev); Mon, 26 Sep 2005 06:14:16 -0700 (PDT) Received: from e36.co.us.ibm.com (e36.co.us.ibm.com [32.97.110.154]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8QDE0iL027179 for ; Mon, 26 Sep 2005 06:14:06 -0700 Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com [9.17.195.106]) by e36.co.us.ibm.com (8.12.11/8.12.11) with ESMTP id j8QD9wml009876 for ; Mon, 26 Sep 2005 09:09:58 -0400 Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by d03relay04.boulder.ibm.com (8.12.10/NCO/VERS6.7) with ESMTP id j8QDBgHU529166 for ; Mon, 26 Sep 2005 07:11:42 -0600 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.12.11/8.13.3) with ESMTP id j8QDB5CL015237 for ; Mon, 26 Sep 2005 07:11:06 -0600 Received: from arkanoid (sig-9-49-133-243.mts.ibm.com [9.49.133.243]) by d03av01.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id j8QDB5Bl015218; Mon, 26 Sep 2005 07:11:05 -0600 Received: by arkanoid (Postfix, from userid 1000) id 72ED2EACF4; Mon, 26 Sep 2005 06:11:04 -0700 (PDT) Date: Mon, 26 Sep 2005 06:11:04 -0700 From: Nishanth Aravamudan To: Roger Tsang , Luca Maranzano , "LinuxVirtualServer.org users mailing list." , Dave Miller , Wensong Zhang , Julian Anastasov , netdev@oss.sgi.com Subject: Re: ipvs_syncmaster brings cpu to 100% Message-ID: <20050926131104.GA7532@us.ibm.com> References: <68559cef050908090657fc2599@mail.gmail.com> <498263350509081605956a771@mail.gmail.com> <68559cef05092207022f1f0df4@mail.gmail.com> <498263350509230815eb08a73@mail.gmail.com> <20050926032807.GI18357@verge.net.au> <20050926043400.GD5079@us.ibm.com> <20050926080508.GF11027@verge.net.au> <20050926081229.GA23755@verge.net.au> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050926081229.GA23755@verge.net.au> X-Operating-System: Linux 2.6.14-rc2 (x86_64) User-Agent: Mutt/1.5.9i X-archive-position: 3666 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: nacc@us.ibm.com Precedence: bulk X-list: netdev On 26.09.2005 [17:12:32 +0900], Horms wrote: > On Mon, Sep 26, 2005 at 05:05:10PM +0900, Horms wrote: > > [snip] > > > > > > > Furthermore, if I make an "rgrep" in the source tree of kernel 2.6.12 > > > > > > the function schedule_timeout() is more used than the ssleep() (517 > > > > > > occurrencies vs. 43), so why in ip_vs_sync.c there was this change? > > > > > > > > > > > > The other oddity is that Horms reported on this list that on non Xeon > > > > > > CPU the same version of kernel of mine does not present the problem. > > > > > > > > > > > > I'm getting crazy :-) > > > > > > > > I've prepared a patch, which reverts the change which was introduced > > > > by Nishanth Aravamudan in February. > > > > > > Was the 100% cpu utilization only occurring on Xeon processors? > > > > That seems to be the only case where were this problem has been > > observed. I don't have such a processor myself, so I haven't actually > > been able to produce the problem locally. > > > > One reason I posted this issue to netdev was to get some more > > eyes on the problem as it is puzzling to say the least. > > > > > Care to try to use msleep_interruptible() instead of ssleep(), as > > > opposed to schedule_timeout()? > > > > I will send a version that does that shortly, Luca, can > > you plase check that too? > > Here is that version of the patch. Nishanth, I take it that I do not > need to set TASK_INTERRUPTABLE before calling msleep_interruptible(), > please let me know if I am wrong. Yes, exactly. I'm just trying to narrow it down to see if it's the task state that's causing the issue (which, to be honest, doesn't make a lot of sense to me -- with ssleep() your load average will go up as the task will be UNINTERRUPTIBLE state, but I am not sure why utilisation would rise, as you are still sleeping...) Thanks, Nish From mbellion@hipac.org Mon Sep 26 06:19:13 2005 Received: with ECARTIS (v1.0.0; list netdev); Mon, 26 Sep 2005 06:19:19 -0700 (PDT) Received: from triton.rz.uni-saarland.de (triton.rz.uni-saarland.de [134.96.7.25]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8QDJAiL028073 for ; Mon, 26 Sep 2005 06:19:13 -0700 Received: from e135.stw.stud.uni-saarland.de (e135.stw.stud.uni-saarland.de [134.96.65.150]) by triton.rz.uni-saarland.de (8.12.10/8.12.10) with ESMTP id j8QDGGn83676011; Mon, 26 Sep 2005 15:16:16 +0200 (CEST) From: Michael Bellion To: hadi@cyberus.ca Subject: Re: [ANNOUNCE] Release of nf-HiPAC 0.9.0 Date: Mon, 26 Sep 2005 15:16:16 +0200 User-Agent: KMail/1.8.1 Cc: linux-kernel@vger.kernel.org, linux-net@vger.kernel.org, netdev@oss.sgi.com References: <200509260445.46740.mbellion@hipac.org> <1127733492.6215.274.camel@localhost.localdomain> In-Reply-To: <1127733492.6215.274.camel@localhost.localdomain> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-6" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200509261516.16565.mbellion@hipac.org> X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-1.5.1 (triton.rz.uni-saarland.de [134.96.7.25]); Mon, 26 Sep 2005 15:16:17 +0200 (CEST) X-AntiVirus: checked by AntiVir Milter 1.0.6; AVE 6.32.0.6; VDF 6.32.0.43 X-archive-position: 3667 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: mbellion@hipac.org Precedence: bulk X-list: netdev Hi, > Can you post some numbers relative to iptables? We have some performance tests available at: http://www.hipac.org/performance_tests/overview.html We also have a list of the independent performance tests we know of: http://www.hipac.org/performance_tests/independent.html > Some tests with the following parameters would be helpful: > - Variable incoming packet rate (in packets per second) > - Variable packet sizes > - Variable number of users/filters > - Effect of adding/removing/modifying policies while under different > incoming traffic rates. Most of this parameters are used in the performance tests above. The effect of adding/removing/modifying policies while under different incoming traffic rates has not been tested in the above tests. nf-HiPAC is based on a completely dynamic approach. This means that the algorithm used in HiPAC makes sure that the lookup data structure is not rebuild from scratch again as soon as you make a update of the data structure. Instead during an update of the policies only the required changes of the lookup data structure are made. This guaranties that the packet processing is only affected to the least possible amount during updates. It would certainly be nice to see some benchmark results for this case. nf-HiPAC is expected to handle this very well, because it was designed with this case in mind. Regards +---------------------------+ | Michael Bellion | | | +---------------------------+ From hadi@cyberus.ca Mon Sep 26 06:34:30 2005 Received: with ECARTIS (v1.0.0; list netdev); Mon, 26 Sep 2005 06:34:38 -0700 (PDT) Received: from mx03.cybersurf.com (mx03.cybersurf.com [209.197.145.106]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8QDYTiL029931 for ; Mon, 26 Sep 2005 06:34:29 -0700 Received: from mail.cyberus.ca ([209.197.145.21]) by mx03.cybersurf.com with esmtp (Exim 4.30) id 1EJt55-00037X-Ig for netdev@oss.sgi.com; Mon, 26 Sep 2005 09:31:43 -0400 Received: from cpe0030ab124d2f-cm014500000962.cpe.net.cable.rogers.com ([24.103.96.183] helo=[10.0.0.229]) by mail.cyberus.ca with esmtp (Exim 4.20) id 1EJt53-0005Bv-Ng; Mon, 26 Sep 2005 09:31:41 -0400 Subject: Re: [ANNOUNCE] Release of nf-HiPAC 0.9.0 From: jamal Reply-To: hadi@cyberus.ca To: Michael Bellion Cc: linux-kernel@vger.kernel.org, linux-net@vger.kernel.org, netdev@oss.sgi.com In-Reply-To: <200509261516.16565.mbellion@hipac.org> References: <200509260445.46740.mbellion@hipac.org> <1127733492.6215.274.camel@localhost.localdomain> <200509261516.16565.mbellion@hipac.org> Content-Type: text/plain Organization: unknown Date: Mon, 26 Sep 2005 09:31:37 -0400 Message-Id: <1127741497.6215.345.camel@localhost.localdomain> Mime-Version: 1.0 X-Mailer: Evolution 2.2.1.1 Content-Transfer-Encoding: 7bit X-archive-position: 3668 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: hadi@cyberus.ca Precedence: bulk X-list: netdev On Mon, 2005-26-09 at 15:16 +0200, Michael Bellion wrote: > Hi, > > > Can you post some numbers relative to iptables? > > We have some performance tests available at: > http://www.hipac.org/performance_tests/overview.html > > We also have a list of the independent performance tests we know of: > http://www.hipac.org/performance_tests/independent.html > Can you please post something against new kernels you are patching against _today_? I recall these same graphs from a few years back but even iptables has improved since. Any issues you may find can only help you improve. BTW, your tests were unfair to iptables; you should have had optimized the rules with the assumption that someone needing that many rules would probably have needed to do some optimization even with iptables. Yes, it would only have taken one year to load 256K rules, but it would have loaded eventually. > > Some tests with the following parameters would be helpful: > > - Variable incoming packet rate (in packets per second) > > - Variable packet sizes > > - Variable number of users/filters > > - Effect of adding/removing/modifying policies while under different > > incoming traffic rates. > > Most of this parameters are used in the performance tests above. > > The effect of adding/removing/modifying policies while under different > incoming traffic rates has not been tested in the above tests. > > nf-HiPAC is based on a completely dynamic approach. Very good. Please do more up to date testing and try to include tc filter as well. cheers, jamal From liuk001@gmail.com Mon Sep 26 06:54:52 2005 Received: with ECARTIS (v1.0.0; list netdev); Mon, 26 Sep 2005 06:55:07 -0700 (PDT) Received: from zproxy.gmail.com (zproxy.gmail.com [64.233.162.195]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8QDspiL002256 for ; Mon, 26 Sep 2005 06:54:52 -0700 Received: by zproxy.gmail.com with SMTP id o1so1396211nzf for ; Mon, 26 Sep 2005 06:52:03 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:reply-to:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=a/rEtxmdV4UgtHx8k/T5gSZHgkdFPQ55WyhQgZ9BUfOz87t7FM/jgQeVM9m0q4ErxYt85b8IHy696PN7/xaIDGmsGgFTAAGOIDb7fUU66uTMhXZy6+tY8HJkFAa+6t73JWVsiydnAXrkflM/ra0Sn6mslVLcEDbvdqcpjkP2Eos= Received: by 10.54.52.30 with SMTP id z30mr956822wrz; Mon, 26 Sep 2005 06:52:02 -0700 (PDT) Received: by 10.54.70.6 with HTTP; Mon, 26 Sep 2005 06:52:02 -0700 (PDT) Message-ID: <68559cef05092606521cc13f9a@mail.gmail.com> Date: Mon, 26 Sep 2005 15:52:02 +0200 From: Luca Maranzano Reply-To: Luca Maranzano To: Nishanth Aravamudan Subject: Re: ipvs_syncmaster brings cpu to 100% Cc: "LinuxVirtualServer.org users mailing list." , Dave Miller , Wensong Zhang , Julian Anastasov , netdev@oss.sgi.com In-Reply-To: <20050926131104.GA7532@us.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Disposition: inline References: <68559cef050908090657fc2599@mail.gmail.com> <498263350509081605956a771@mail.gmail.com> <68559cef05092207022f1f0df4@mail.gmail.com> <498263350509230815eb08a73@mail.gmail.com> <20050926032807.GI18357@verge.net.au> <20050926043400.GD5079@us.ibm.com> <20050926080508.GF11027@verge.net.au> <20050926081229.GA23755@verge.net.au> <20050926131104.GA7532@us.ibm.com> Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id j8QDspiL002256 X-archive-position: 3669 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: liuk001@gmail.com Precedence: bulk X-list: netdev Just to add more info, please note the output of "ps": debld1:~# ps aux|grep ipvs root 3748 0.0 0.0 0 0 ? D 12:09 0:00 [ipvs_syncmaster] root 3757 0.0 0.0 0 0 ? D 12:09 0:00 [ipvs_syncbackup] Note the D status, i.e. (from ps(1) man page): Uninterruptible sleep (usually IO) I hope to have a Xeon machine to make some more tests in the next days, in the mean time I'll try to reproduce my setup on a couple of VMWare Workstation machines. More later. Thank you all. Luca On 26/09/05, Nishanth Aravamudan wrote: > On 26.09.2005 [17:12:32 +0900], Horms wrote: > > On Mon, Sep 26, 2005 at 05:05:10PM +0900, Horms wrote: > > > > [snip] > > > > > > > > > Furthermore, if I make an "rgrep" in the source tree of kernel 2.6.12 > > > > > > > the function schedule_timeout() is more used than the ssleep() (517 > > > > > > > occurrencies vs. 43), so why in ip_vs_sync.c there was this change? > > > > > > > > > > > > > > The other oddity is that Horms reported on this list that on non Xeon > > > > > > > CPU the same version of kernel of mine does not present the problem. > > > > > > > > > > > > > > I'm getting crazy :-) > > > > > > > > > > I've prepared a patch, which reverts the change which was introduced > > > > > by Nishanth Aravamudan in February. > > > > > > > > Was the 100% cpu utilization only occurring on Xeon processors? > > > > > > That seems to be the only case where were this problem has been > > > observed. I don't have such a processor myself, so I haven't actually > > > been able to produce the problem locally. > > > > > > One reason I posted this issue to netdev was to get some more > > > eyes on the problem as it is puzzling to say the least. > > > > > > > Care to try to use msleep_interruptible() instead of ssleep(), as > > > > opposed to schedule_timeout()? > > > > > > I will send a version that does that shortly, Luca, can > > > you plase check that too? > > > > Here is that version of the patch. Nishanth, I take it that I do not > > need to set TASK_INTERRUPTABLE before calling msleep_interruptible(), > > please let me know if I am wrong. > > Yes, exactly. I'm just trying to narrow it down to see if it's the task > state that's causing the issue (which, to be honest, doesn't make a lot > of sense to me -- with ssleep() your load average will go up as the task > will be UNINTERRUPTIBLE state, but I am not sure why utilisation would > rise, as you are still sleeping...) > > Thanks, > Nish > From nacc@us.ibm.com Mon Sep 26 07:24:16 2005 Received: with ECARTIS (v1.0.0; list netdev); Mon, 26 Sep 2005 07:24:24 -0700 (PDT) Received: from e1.ny.us.ibm.com (e1.ny.us.ibm.com [32.97.182.141]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8QEO7iL005610 for ; Mon, 26 Sep 2005 07:24:16 -0700 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e1.ny.us.ibm.com (8.12.11/8.12.11) with ESMTP id j8QELBUd017230 for ; Mon, 26 Sep 2005 10:21:11 -0400 Received: from d01av01.pok.ibm.com (d01av01.pok.ibm.com [9.56.224.215]) by d01relay04.pok.ibm.com (8.12.10/NCO/VERS6.7) with ESMTP id j8QELBEZ107232 for ; Mon, 26 Sep 2005 10:21:11 -0400 Received: from d01av01.pok.ibm.com (loopback [127.0.0.1]) by d01av01.pok.ibm.com (8.12.11/8.13.3) with ESMTP id j8QELAwo003910 for ; Mon, 26 Sep 2005 10:21:11 -0400 Received: from arkanoid (sig-9-49-133-243.mts.ibm.com [9.49.133.243]) by d01av01.pok.ibm.com (8.12.11/8.12.11) with ESMTP id j8QELAUY003852; Mon, 26 Sep 2005 10:21:10 -0400 Received: by arkanoid (Postfix, from userid 1000) id C296BEACF4; Mon, 26 Sep 2005 07:21:09 -0700 (PDT) Date: Mon, 26 Sep 2005 07:21:09 -0700 From: Nishanth Aravamudan To: Luca Maranzano Cc: Dave Miller , Wensong Zhang , Julian Anastasov , netdev@oss.sgi.com Subject: Re: ipvs_syncmaster brings cpu to 100% Message-ID: <20050926142109.GD7532@us.ibm.com> References: <68559cef050908090657fc2599@mail.gmail.com> <498263350509081605956a771@mail.gmail.com> <68559cef05092207022f1f0df4@mail.gmail.com> <498263350509230815eb08a73@mail.gmail.com> <20050926032807.GI18357@verge.net.au> <20050926043400.GD5079@us.ibm.com> <20050926080508.GF11027@verge.net.au> <20050926081229.GA23755@verge.net.au> <20050926131104.GA7532@us.ibm.com> <68559cef05092606521cc13f9a@mail.gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <68559cef05092606521cc13f9a@mail.gmail.com> X-Operating-System: Linux 2.6.14-rc2 (x86_64) User-Agent: Mutt/1.5.9i X-archive-position: 3670 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: nacc@us.ibm.com Precedence: bulk X-list: netdev On 26.09.2005 [15:52:02 +0200], Luca Maranzano wrote: > On 26/09/05, Nishanth Aravamudan wrote: > > On 26.09.2005 [17:12:32 +0900], Horms wrote: > > > On Mon, Sep 26, 2005 at 05:05:10PM +0900, Horms wrote: > > > > > > [snip] > > > > > > > > > > > Furthermore, if I make an "rgrep" in the source tree of kernel 2.6.12 > > > > > > > > the function schedule_timeout() is more used than the ssleep() (517 > > > > > > > > occurrencies vs. 43), so why in ip_vs_sync.c there was this change? > > > > > > > > > > > > > > > > The other oddity is that Horms reported on this list that on non Xeon > > > > > > > > CPU the same version of kernel of mine does not present the problem. > > > > > > > > > > > > > > > > I'm getting crazy :-) > > > > > > > > > > > > I've prepared a patch, which reverts the change which was introduced > > > > > > by Nishanth Aravamudan in February. > > > > > > > > > > Was the 100% cpu utilization only occurring on Xeon processors? > > > > > > > > That seems to be the only case where were this problem has been > > > > observed. I don't have such a processor myself, so I haven't actually > > > > been able to produce the problem locally. > > > > > > > > One reason I posted this issue to netdev was to get some more > > > > eyes on the problem as it is puzzling to say the least. > > > > > > > > > Care to try to use msleep_interruptible() instead of ssleep(), as > > > > > opposed to schedule_timeout()? > > > > > > > > I will send a version that does that shortly, Luca, can > > > > you plase check that too? > > > > > > Here is that version of the patch. Nishanth, I take it that I do not > > > need to set TASK_INTERRUPTABLE before calling msleep_interruptible(), > > > please let me know if I am wrong. > > > > Yes, exactly. I'm just trying to narrow it down to see if it's the task > > state that's causing the issue (which, to be honest, doesn't make a lot > > of sense to me -- with ssleep() your load average will go up as the task > > will be UNINTERRUPTIBLE state, but I am not sure why utilisation would > > rise, as you are still sleeping...) [trimmed lvs-users from my reply, as it is a closed list] > Just to add more info, please note the output of "ps": > > debld1:~# ps aux|grep ipvs > root 3748 0.0 0.0 0 0 ? D 12:09 0:00 > [ipvs_syncmaster] > root 3757 0.0 0.0 0 0 ? D 12:09 0:00 > [ipvs_syncbackup] > > Note the D status, i.e. (from ps(1) man page): Uninterruptible sleep > (usually IO) The msleep_interruptible() change should fix that. But that does not show 100% CPU utilisation at all, it shows 0. Did you mean to say your load increases? I'm still unclear what the problem is. Horms initial Cc trimmed some important information. It would be very useful to "start over" -- at least from the perspective of what the problem actually is. > I hope to have a Xeon machine to make some more tests in the next > days, in the mean time I'll try to reproduce my setup on a couple of > VMWare Workstation machines. Please don't top-most. It makes it really hard to write sane replies... Thanks, Nish From mbellion@hipac.org Mon Sep 26 07:41:07 2005 Received: with ECARTIS (v1.0.0; list netdev); Mon, 26 Sep 2005 07:41:12 -0700 (PDT) Received: from triton.rz.uni-saarland.de (triton.rz.uni-saarland.de [134.96.7.25]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8QEf6iL008413 for ; Mon, 26 Sep 2005 07:41:07 -0700 Received: from e135.stw.stud.uni-saarland.de (e135.stw.stud.uni-saarland.de [134.96.65.150]) by triton.rz.uni-saarland.de (8.12.10/8.12.10) with ESMTP id j8QEcCn83680544; Mon, 26 Sep 2005 16:38:12 +0200 (CEST) From: Michael Bellion To: Emmanuel Fleury Subject: Re: [ANNOUNCE] Release of nf-HiPAC 0.9.0 Date: Mon, 26 Sep 2005 16:38:12 +0200 User-Agent: KMail/1.8.1 Cc: linux-kernel@vger.kernel.org, linux-net@vger.kernel.org, netdev@oss.sgi.com, jamal References: <200509260445.46740.mbellion@hipac.org> <4337DA7C.2000804@cs.aau.dk> In-Reply-To: <4337DA7C.2000804@cs.aau.dk> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200509261638.12731.mbellion@hipac.org> X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-1.5.1 (triton.rz.uni-saarland.de [134.96.7.25]); Mon, 26 Sep 2005 16:38:12 +0200 (CEST) X-AntiVirus: checked by AntiVir Milter 1.0.6; AVE 6.32.0.6; VDF 6.32.0.43 X-archive-position: 3673 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: mbellion@hipac.org Precedence: bulk X-list: netdev Content-Length: 3940 Lines: 91 Hi > I'm still not convinced by your approach. :-/ You really should have a closer look at nf-HiPAC so that you know what you are talking about! Your Compact Filter takes a completely different approach than nf-HiPAC to build the data structure used in the kernel for the packet classification lookup. Your Compact Filter uses a static compiler in user space. That compiler transforms the rule set into boolean expressions and than uses operations from predicate logic to optimize the rule set. This has the big drawback that whenever only a single rule changes you have to recompile the complete lookup data structure. So this approach is clearly not suitable for scenarios depending on dynamic rule sets. nf-HiPAC uses a completely different approach to build the lookup data structure in the kernel. It is based on geometry. This approach allows completely dynamic updates. During an update of the rules only the required changes of the lookup data structure are made. The data structure is NOT rebuild from scratch. This guarantees that the packet processing is only affected to the least possible amount during updates. Although nf-HiPAC and Compact Filter use completely different approaches and algorithms to build the lookup data structure it is important that you understand the following: nf-HiPAC and Compact filter end up with a very very similar lookup data structure in the kernel. > These experiments have to be updated but can you comment on this: > http://www.cs.aau.dk/~mixxel/cf/experiments.html The current version of the algorithm used in nf-HiPAC does not optimize certain aspects of the lookup data structure in order to increase the speed of dynamic rule set updates. This means that the lookup data structure is larger than it really needs to be because it contains some unnecessary redundancy. This explains your test results. Compact Filter and nf-HiPAC perform the same when they are both able to keep their lookup data structure in the CPU caches and when they are both not able to do so anymore. Compact Filter is currently able to perform better in the area where it is able to keep its data structure still in the caches while nf-HiPAC is not able to do so anymore. Most aspects of your performance tests are quite nice (e.g. the generating the traffic by replaying a packet header trace). But your performance tests have a serious flaw: You construct your rule set by creating one rule for each entry in your packet header trace. This results in an completely artificial rule set that creates a lot of redundancy in the nf-HiPAC lookup data structure making it much larger than the Compact Filter data structure. You have to understand that with real world rule sets the size of the computed lookup data structure will not be much different for Compact Filter and nf-HiPAC. This means that when you use real world rule sets there shouldn't be any noticeable difference in lookup performance betweeen Compact Filter and nf-HiPAC. ----------------- I am currently working on a new improved version of the algorithm used in nf-HiPAC. The new algorithmic core will reduce memory usage while at the same time improving the running time of insert and delete operations. The lookup performance will be improved too, especially for bigger rulesets. The concepts and the design are already developed, but the implementation is still in its early stages. The new algorithmic core will make sure that the lookup data structure in the kernel is always fully optimized while at the same time allowing very fast dynamic updates. At that point Compact Filter will not be able to win in any performance test against nf-HiPAC anymore, simply because there is no way to optimize the lookup data structure any further. Regards, +---------------------------+ | Michael Bellion | | | +---------------------------+ From liuk001@gmail.com Mon Sep 26 07:46:59 2005 Received: with ECARTIS (v1.0.0; list netdev); Mon, 26 Sep 2005 07:47:17 -0700 (PDT) Received: from zproxy.gmail.com (zproxy.gmail.com [64.233.162.200]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8QEkwiL009467 for ; Mon, 26 Sep 2005 07:46:59 -0700 Received: by zproxy.gmail.com with SMTP id r28so153115nza for ; Mon, 26 Sep 2005 07:44:09 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:reply-to:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=LRXXqlgjE8Yitpr/M/iZB8TIglwVVm2XmXOakLMuXWdOj8tU6pb7FGQPTgZ95Z187Zsk9SBrd19GXcnDWnNdgYdbt6CXyKVV2jZvC93WawEofsrKxue26cudG/ZAEX3WGvIT0ZAY95q2otpX+YlgDYMTCLufeeQTC0aCywmTgC4= Received: by 10.54.18.74 with SMTP id 74mr1728785wrr; Mon, 26 Sep 2005 07:44:09 -0700 (PDT) Received: by 10.54.70.6 with HTTP; Mon, 26 Sep 2005 07:44:09 -0700 (PDT) Message-ID: <68559cef05092607441dd8e961@mail.gmail.com> Date: Mon, 26 Sep 2005 16:44:09 +0200 From: Luca Maranzano Reply-To: Luca Maranzano To: Nishanth Aravamudan , "LinuxVirtualServer.org users mailing list." Subject: Re: ipvs_syncmaster brings cpu to 100% Cc: netdev@oss.sgi.com In-Reply-To: <20050926142109.GD7532@us.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Disposition: inline References: <68559cef050908090657fc2599@mail.gmail.com> <68559cef05092207022f1f0df4@mail.gmail.com> <498263350509230815eb08a73@mail.gmail.com> <20050926032807.GI18357@verge.net.au> <20050926043400.GD5079@us.ibm.com> <20050926080508.GF11027@verge.net.au> <20050926081229.GA23755@verge.net.au> <20050926131104.GA7532@us.ibm.com> <68559cef05092606521cc13f9a@mail.gmail.com> <20050926142109.GD7532@us.ibm.com> Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id j8QEkwiL009467 X-archive-position: 3674 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: liuk001@gmail.com Precedence: bulk X-list: netdev Content-Length: 4264 Lines: 103 [trimmed Cc to avoid spamming...] Ok, just to summarize the long thread from the beginning: The goal: setting up a Local Director with IPVS with state synchronization, failover and failback. The hardware: 1 CPU Intel Xeon 3,4 Ghz - HP DL380G4 on 2 identical boxes The problems (please note that all kernel versions are *Debian* kernels): 1. Kernel 2.6.8: got a system lock of the standby node when simulating a failover. The load average as reported from "top" or "w" is always 0.00. 2. Kernel 2.6.11 and Kernel 2.6.12: failover and failback works fine, but the load average as reported from "top" or "w" is always systematically at 2.00 or more with both sync thread started (ipvs_syncmaster and ipvs_syncbackup). Load average from top is 1.00 or mroe with only one thread (i.e. ipvs_syncmaster). Horms reported that he was not able to reproduce this on a non-Xeon system. That's all, let me know if you need more info. Regards, Luca On 26/09/05, Nishanth Aravamudan wrote: > On 26.09.2005 [15:52:02 +0200], Luca Maranzano wrote: > > On 26/09/05, Nishanth Aravamudan wrote: > > > On 26.09.2005 [17:12:32 +0900], Horms wrote: > > > > On Mon, Sep 26, 2005 at 05:05:10PM +0900, Horms wrote: > > > > > > > > [snip] > > > > > > > > > > > > > Furthermore, if I make an "rgrep" in the source tree of kernel 2.6.12 > > > > > > > > > the function schedule_timeout() is more used than the ssleep() (517 > > > > > > > > > occurrencies vs. 43), so why in ip_vs_sync.c there was this change? > > > > > > > > > > > > > > > > > > The other oddity is that Horms reported on this list that on non Xeon > > > > > > > > > CPU the same version of kernel of mine does not present the problem. > > > > > > > > > > > > > > > > > > I'm getting crazy :-) > > > > > > > > > > > > > > I've prepared a patch, which reverts the change which was introduced > > > > > > > by Nishanth Aravamudan in February. > > > > > > > > > > > > Was the 100% cpu utilization only occurring on Xeon processors? > > > > > > > > > > That seems to be the only case where were this problem has been > > > > > observed. I don't have such a processor myself, so I haven't actually > > > > > been able to produce the problem locally. > > > > > > > > > > One reason I posted this issue to netdev was to get some more > > > > > eyes on the problem as it is puzzling to say the least. > > > > > > > > > > > Care to try to use msleep_interruptible() instead of ssleep(), as > > > > > > opposed to schedule_timeout()? > > > > > > > > > > I will send a version that does that shortly, Luca, can > > > > > you plase check that too? > > > > > > > > Here is that version of the patch. Nishanth, I take it that I do not > > > > need to set TASK_INTERRUPTABLE before calling msleep_interruptible(), > > > > please let me know if I am wrong. > > > > > > Yes, exactly. I'm just trying to narrow it down to see if it's the task > > > state that's causing the issue (which, to be honest, doesn't make a lot > > > of sense to me -- with ssleep() your load average will go up as the task > > > will be UNINTERRUPTIBLE state, but I am not sure why utilisation would > > > rise, as you are still sleeping...) > > [trimmed lvs-users from my reply, as it is a closed list] > > > Just to add more info, please note the output of "ps": > > > > debld1:~# ps aux|grep ipvs > > root 3748 0.0 0.0 0 0 ? D 12:09 0:00 > > [ipvs_syncmaster] > > root 3757 0.0 0.0 0 0 ? D 12:09 0:00 > > [ipvs_syncbackup] > > > > Note the D status, i.e. (from ps(1) man page): Uninterruptible sleep > > (usually IO) > > The msleep_interruptible() change should fix that. > > But that does not show 100% CPU utilisation at all, it shows 0. Did you > mean to say your load increases? > > I'm still unclear what the problem is. Horms initial Cc trimmed some > important information. It would be very useful to "start over" -- at > least from the perspective of what the problem actually is. > > > I hope to have a Xeon machine to make some more tests in the next > > days, in the mean time I'll try to reproduce my setup on a couple of > > VMWare Workstation machines. > > Please don't top-most. It makes it really hard to write sane replies... > > Thanks, > Nish > From fleury@cs.aau.dk Mon Sep 26 08:10:30 2005 Received: with ECARTIS (v1.0.0; list netdev); Mon, 26 Sep 2005 08:10:45 -0700 (PDT) Received: from smtp.cs.aau.dk (smtp.cs.aau.dk [130.225.194.6]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8QFASiL013004 for ; Mon, 26 Sep 2005 08:10:29 -0700 Received: from [130.225.195.123] (fleury@rade7.e.cs.auc.dk [130.225.195.123]) by smtp.cs.aau.dk (8.13.4/8.13.4) with ESMTP id j8QF7ZrD002077; Mon, 26 Sep 2005 17:07:35 +0200 Message-ID: <43380E4A.1060604@cs.aau.dk> Date: Mon, 26 Sep 2005 17:05:46 +0200 From: Emmanuel Fleury User-Agent: Debian Thunderbird 1.0.6 (X11/20050802) X-Accept-Language: en-us, en MIME-Version: 1.0 To: Michael Bellion , linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [ANNOUNCE] Release of nf-HiPAC 0.9.0 References: <200509260445.46740.mbellion@hipac.org> <4337DA7C.2000804@cs.aau.dk> <200509261638.12731.mbellion@hipac.org> In-Reply-To: <200509261638.12731.mbellion@hipac.org> X-Enigmail-Version: 0.92.0.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.52 on 130.225.194.6 X-archive-position: 3675 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: fleury@cs.aau.dk Precedence: bulk X-list: netdev Content-Length: 3040 Lines: 66 Michael Bellion wrote: > > The current version of the algorithm used in nf-HiPAC does not optimize > certain aspects of the lookup data structure in order to increase the speed > of dynamic rule set updates. > This means that the lookup data structure is larger than it really needs to be > because it contains some unnecessary redundancy. Could you quantify how much this "unnecessary redundancy" does hit the size of the filter. Because last time I looked it was quite huge (you may have improve it). And having a fat kernel does not help in backbones. > But your performance tests have a serious flaw: > You construct your rule set by creating one rule for each entry in your packet > header trace. This results in an completely artificial rule set that creates > a lot of redundancy in the nf-HiPAC lookup data structure making it much > larger than the Compact Filter data structure. Yes, it was intended to be a worst case for our scheme (not realistic but worst case). We were more interested in comparing the complexity of the different algorithms better than the efficiency of several implementations. I don't consider this as a flaw in our experiment because our goal was different from having a real proof of concept (kind of having an empirical evidence of a theoretical result). > You have to understand that with real world rule sets the size of the computed > lookup data structure will not be much different for Compact Filter and > nf-HiPAC. This means that when you use real world rule sets there shouldn't > be any noticeable difference in lookup performance betweeen Compact Filter > and nf-HiPAC. Might be right, but admit that the big problem of your algorithm is the size of your data-structure in kernel-space. What you gain in speed, you loose it in memory. And this IS an issue on routers (IMHO). > I am currently working on a new improved version of the algorithm used in > nf-HiPAC. The new algorithmic core will reduce memory usage while at the same > time improving the running time of insert and delete operations. The lookup > performance will be improved too, especially for bigger rulesets. The > concepts and the design are already developed, but the implementation is > still in its early stages. > > The new algorithmic core will make sure that the lookup data structure in the > kernel is always fully optimized while at the same time allowing very fast > dynamic updates. > > At that point Compact Filter will not be able to win in any performance test > against nf-HiPAC anymore, simply because there is no way to optimize the > lookup data structure any further. Well, you already said this last time we had exchanged some mails (it was more than one year ago if I count well). Anyway, I doubt you can get something that you can update dynamically AND small in size following your way of doing. But, prove me wrong and I'll be happy. :) Regards -- Emmanuel Fleury Ideals are dangerous things. Realities are better. They wound but they are better. -- Oscar Wilde From mbellion@hipac.org Mon Sep 26 09:06:26 2005 Received: with ECARTIS (v1.0.0; list netdev); Mon, 26 Sep 2005 09:06:40 -0700 (PDT) Received: from justus.rz.uni-saarland.de (justus.rz.uni-saarland.de [134.96.7.31]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8QG6PiL023249 for ; Mon, 26 Sep 2005 09:06:25 -0700 Received: from e135.stw.stud.uni-saarland.de (e135.stw.stud.uni-saarland.de [134.96.65.150]) by justus.rz.uni-saarland.de (8.12.10/8.12.10) with ESMTP id j8QG3S2611491260; Mon, 26 Sep 2005 18:03:28 +0200 (CEST) From: Michael Bellion To: Emmanuel Fleury Subject: Re: [ANNOUNCE] Release of nf-HiPAC 0.9.0 Date: Mon, 26 Sep 2005 18:03:27 +0200 User-Agent: KMail/1.8.1 Cc: linux-kernel@vger.kernel.org, netdev@oss.sgi.com References: <200509260445.46740.mbellion@hipac.org> <200509261638.12731.mbellion@hipac.org> <43380E4A.1060604@cs.aau.dk> In-Reply-To: <43380E4A.1060604@cs.aau.dk> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200509261803.28150.mbellion@hipac.org> X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-1.5.1 (justus.rz.uni-saarland.de [134.96.7.31]); Mon, 26 Sep 2005 18:03:28 +0200 (CEST) X-AntiVirus: checked by AntiVir Milter 1.0.6; AVE 6.32.0.6; VDF 6.32.0.43 X-archive-position: 3676 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: mbellion@hipac.org Precedence: bulk X-list: netdev Content-Length: 2594 Lines: 58 Hi, > > But your performance tests have a serious flaw: > > You construct your rule set by creating one rule for each entry in your > > packet header trace. This results in an completely artificial rule set > > that creates a lot of redundancy in the nf-HiPAC lookup data structure > > making it much larger than the Compact Filter data structure. > > Yes, it was intended to be a worst case for our scheme (not realistic > but worst case).. Sorry, but this is far away from the worst case for your scheme. Actually it is a quite good case for your compiler, because every rule is fully specified (meaning there are no wildcards in any rule) and there are no ranges or masks involved. Try using a mixed rule set that contains rules that only specify certain dimensions and have wildcards on the other dimensions. Try using rules with ranges and masks. Try using overlapping rules, meaning rules that completely or partly overlap other rules in certain dimensions. This will make your data structure grow! > > I am currently working on a new improved version of the algorithm used in > > nf-HiPAC. The new algorithmic core will reduce memory usage while at the > > same time improving the running time of insert and delete operations. The > > lookup performance will be improved too, especially for bigger rulesets. > > The concepts and the design are already developed, but the implementation > > is still in its early stages. > > > > The new algorithmic core will make sure that the lookup data structure in > > the kernel is always fully optimized while at the same time allowing very > > fast dynamic updates. > > > > At that point Compact Filter will not be able to win in any performance > > test against nf-HiPAC anymore, simply because there is no way to > > optimize the lookup data structure any further. > > Well, you already said this last time we had exchanged some mails > (it was more than one year ago if I count well). Yes, you are right. The HiPAC project has gone through some tough times over the last 2 years. With MARA Systems the HiPAC Project has finally found a strong partner that is fully committed to the concept of Open Source Software. This allows me to continue the development of HiPAC under the GNU GPL license. > Anyway, I doubt you can get something that you can update dynamically > AND small in size following your way of doing. But, prove me wrong and > I'll be happy. :) Ok, I'll do that :) Regards, +---------------------------+ | Michael Bellion | | | +---------------------------+ From fleury@cs.aau.dk Mon Sep 26 09:36:30 2005 Received: with ECARTIS (v1.0.0; list netdev); Mon, 26 Sep 2005 09:36:43 -0700 (PDT) Received: from smtp.cs.aau.dk (smtp.cs.aau.dk [130.225.194.6]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8QGaTiL027852 for ; Mon, 26 Sep 2005 09:36:29 -0700 Received: from [130.225.195.123] (fleury@rade7.e.cs.auc.dk [130.225.195.123]) by smtp.cs.aau.dk (8.13.4/8.13.4) with ESMTP id j8QGXX0a009078; Mon, 26 Sep 2005 18:33:33 +0200 Message-ID: <43382271.90400@cs.aau.dk> Date: Mon, 26 Sep 2005 18:31:45 +0200 From: Emmanuel Fleury User-Agent: Debian Thunderbird 1.0.6 (X11/20050802) X-Accept-Language: en-us, en MIME-Version: 1.0 To: Michael Bellion CC: linux-kernel@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [ANNOUNCE] Release of nf-HiPAC 0.9.0 References: <200509260445.46740.mbellion@hipac.org> <200509261638.12731.mbellion@hipac.org> <43380E4A.1060604@cs.aau.dk> <200509261803.28150.mbellion@hipac.org> In-Reply-To: <200509261803.28150.mbellion@hipac.org> X-Enigmail-Version: 0.92.0.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.52 on 130.225.194.6 X-archive-position: 3677 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: fleury@cs.aau.dk Precedence: bulk X-list: netdev Michael Bellion wrote: > > Sorry, but this is far away from the worst case for your scheme. Actually it > is a quite good case for your compiler, because every rule is fully specified > (meaning there are no wildcards in any rule) and there are no ranges or masks > involved. > Try using a mixed rule set that contains rules that only specify certain > dimensions and have wildcards on the other dimensions. Try using rules with > ranges and masks. > Try using overlapping rules, meaning rules that completely or partly overlap > other rules in certain dimensions. > This will make your data structure grow! I think you misunderstood our experiment. In fact, we were trying to generate as much possible different singletons on the domain (each of our rule was the header of a packet which have not been seen before), because if we can group these rules into intervals, then our scheme is having some advantages. We were using IDD (Interval Decision Diagrams) which is a kind of extended BDD (Binary Decision Diagrams) where you take your decision by looking at a partition of the possible values of the variable. For example, looking at the value x in [0,1024] where [0,128] leads to one node in the decision tree, [129,256] to another and [257,1024] to a last one. More this partition is fragmented more you increase the size of the structure. Having a lot of overlap does certainly increase the number of partitions, but adding singletons is the simplest way to increase the number of partitions. Take a look at this paper, maybe you can get some idea for your scheme (it might be that some hybrid between your ideas and ours can make it): http://www.cs.aau.dk/~fleury/download/papers/tc04.pdf > Yes, you are right. The HiPAC project has gone through some tough times over > the last 2 years. With MARA Systems the HiPAC Project has finally found a > strong partner that is fully committed to the concept of Open Source > Software. This allows me to continue the development of HiPAC under the GNU > GPL license. I'm always happy to see a firm funding some Open Source project. So, I can do anything else but wishing you good luck for the future. :) > Ok, I'll do that :) Good. :) Regards -- Emmanuel Fleury As usual, goodness hardly puts up a fight. -- Calvin & Hobbes (Bill Waterson) From nacc@us.ibm.com Mon Sep 26 10:54:24 2005 Received: with ECARTIS (v1.0.0; list netdev); Mon, 26 Sep 2005 10:54:34 -0700 (PDT) Received: from e2.ny.us.ibm.com (e2.ny.us.ibm.com [32.97.182.142]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8QHsHiL006697 for ; Mon, 26 Sep 2005 10:54:24 -0700 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e2.ny.us.ibm.com (8.12.11/8.12.11) with ESMTP id j8QHpNfa025213 for ; Mon, 26 Sep 2005 13:51:23 -0400 Received: from d01av03.pok.ibm.com (d01av03.pok.ibm.com [9.56.224.217]) by d01relay04.pok.ibm.com (8.12.10/NCO/VERS6.7) with ESMTP id j8QHpNEZ082418 for ; Mon, 26 Sep 2005 13:51:23 -0400 Received: from d01av03.pok.ibm.com (loopback [127.0.0.1]) by d01av03.pok.ibm.com (8.12.11/8.13.3) with ESMTP id j8QHpMES010185 for ; Mon, 26 Sep 2005 13:51:23 -0400 Received: from arkanoid (sig-9-49-133-243.mts.ibm.com [9.49.133.243]) by d01av03.pok.ibm.com (8.12.11/8.12.11) with ESMTP id j8QHpMcd009573; Mon, 26 Sep 2005 13:51:22 -0400 Received: by arkanoid (Postfix, from userid 1000) id 184EEEACF4; Mon, 26 Sep 2005 10:51:12 -0700 (PDT) Date: Mon, 26 Sep 2005 10:51:12 -0700 From: Nishanth Aravamudan To: Luca Maranzano Cc: "LinuxVirtualServer.org users mailing list." , netdev@oss.sgi.com Subject: Re: ipvs_syncmaster brings cpu to 100% Message-ID: <20050926175112.GF7532@us.ibm.com> References: <68559cef05092207022f1f0df4@mail.gmail.com> <498263350509230815eb08a73@mail.gmail.com> <20050926032807.GI18357@verge.net.au> <20050926043400.GD5079@us.ibm.com> <20050926080508.GF11027@verge.net.au> <20050926081229.GA23755@verge.net.au> <20050926131104.GA7532@us.ibm.com> <68559cef05092606521cc13f9a@mail.gmail.com> <20050926142109.GD7532@us.ibm.com> <68559cef05092607441dd8e961@mail.gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <68559cef05092607441dd8e961@mail.gmail.com> X-Operating-System: Linux 2.6.14-rc2 (x86_64) User-Agent: Mutt/1.5.9i X-archive-position: 3678 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: nacc@us.ibm.com Precedence: bulk X-list: netdev On 26.09.2005 [16:44:09 +0200], Luca Maranzano wrote: > On 26/09/05, Nishanth Aravamudan wrote: > > On 26.09.2005 [15:52:02 +0200], Luca Maranzano wrote: > > > On 26/09/05, Nishanth Aravamudan wrote: > > > > On 26.09.2005 [17:12:32 +0900], Horms wrote: > > > > > On Mon, Sep 26, 2005 at 05:05:10PM +0900, Horms wrote: > > > > > > > > > > [snip] > > > > > > > > > > > > > > > Furthermore, if I make an "rgrep" in the source tree of kernel 2.6.12 > > > > > > > > > > the function schedule_timeout() is more used than the ssleep() (517 > > > > > > > > > > occurrencies vs. 43), so why in ip_vs_sync.c there was this change? > > > > > > > > > > > > > > > > > > > > The other oddity is that Horms reported on this list that on non Xeon > > > > > > > > > > CPU the same version of kernel of mine does not present the problem. > > > > > > > > > > > > > > > > > > > > I'm getting crazy :-) > > > > > > > > > > > > > > > > I've prepared a patch, which reverts the change which was introduced > > > > > > > > by Nishanth Aravamudan in February. > > > > > > > > > > > > > > Was the 100% cpu utilization only occurring on Xeon processors? > > > > > > > > > > > > That seems to be the only case where were this problem has been > > > > > > observed. I don't have such a processor myself, so I haven't actually > > > > > > been able to produce the problem locally. > > > > > > > > > > > > One reason I posted this issue to netdev was to get some more > > > > > > eyes on the problem as it is puzzling to say the least. > > > > > > > > > > > > > Care to try to use msleep_interruptible() instead of ssleep(), as > > > > > > > opposed to schedule_timeout()? > > > > > > > > > > > > I will send a version that does that shortly, Luca, can > > > > > > you plase check that too? > > > > > > > > > > Here is that version of the patch. Nishanth, I take it that I do not > > > > > need to set TASK_INTERRUPTABLE before calling msleep_interruptible(), > > > > > please let me know if I am wrong. > > > > > > > > Yes, exactly. I'm just trying to narrow it down to see if it's the task > > > > state that's causing the issue (which, to be honest, doesn't make a lot > > > > of sense to me -- with ssleep() your load average will go up as the task > > > > will be UNINTERRUPTIBLE state, but I am not sure why utilisation would > > > > rise, as you are still sleeping...) > > > > [trimmed lvs-users from my reply, as it is a closed list] > > > > > Just to add more info, please note the output of "ps": > > > > > > debld1:~# ps aux|grep ipvs > > > root 3748 0.0 0.0 0 0 ? D 12:09 0:00 > > > [ipvs_syncmaster] > > > root 3757 0.0 0.0 0 0 ? D 12:09 0:00 > > > [ipvs_syncbackup] > > > > > > Note the D status, i.e. (from ps(1) man page): Uninterruptible sleep > > > (usually IO) > > > > The msleep_interruptible() change should fix that. > > > > But that does not show 100% CPU utilisation at all, it shows 0. Did you > > mean to say your load increases? > > > > I'm still unclear what the problem is. Horms initial Cc trimmed some > > important information. It would be very useful to "start over" -- at > > least from the perspective of what the problem actually is. > > > > > I hope to have a Xeon machine to make some more tests in the next > > > days, in the mean time I'll try to reproduce my setup on a couple of > > > VMWare Workstation machines. > > > > Please don't top-most. It makes it really hard to write sane replies... > > [trimmed Cc to avoid spamming...] > > Ok, just to summarize the long thread from the beginning: > > The goal: setting up a Local Director with IPVS with state > synchronization, failover and failback. > > The hardware: 1 CPU Intel Xeon 3,4 Ghz - HP DL380G4 on 2 identical boxes > > The problems (please note that all kernel versions are *Debian* kernels): > 1. Kernel 2.6.8: got a system lock of the standby node when simulating > a failover. The load average as reported from "top" or "w" is always > 0.00. > > 2. Kernel 2.6.11 and Kernel 2.6.12: failover and failback works fine, > but the load average as reported from "top" or "w" is always > systematically at 2.00 or more with both sync thread started > (ipvs_syncmaster and ipvs_syncbackup). Load average from top is 1.00 > or mroe with only one thread (i.e. ipvs_syncmaster). Horms reported > that he was not able to reproduce this on a non-Xeon system. Ok, so when whomever mentioned "CPU utilisation" they were mistaken. The load average being 2 is due to ssleep(). The msleep_interruptible() version of the patch should fix that up. It really doesn't make any difference in the code, except that your load average will go back to 0.00 and the ipvs threads can be interrupted by signals. I would expect the load average to be 2.00 for all systems, not just Xeon. The system lock has nothing to do with the patch, though. Something else fixed it. Thanks, Nish P.S. Again, please don't top-post, it makes it harder for me to reply (and disinclines me to do so). From jordi@baylina.org Tue Sep 27 09:16:58 2005 Received: with ECARTIS (v1.0.0; list netdev); Tue, 27 Sep 2005 09:17:05 -0700 (PDT) Received: from powy.masterasp.com (ns2.masterasp.com [217.75.228.66]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id j8RGGuiL022875 for ; Tue, 27 Sep 2005 09:16:57 -0700 Received: (qmail 20230 invoked by uid 89); 27 Sep 2005 16:17:59 -0000 Message-ID: <20050927161759.20229.qmail@powy.masterasp.com> From: jordi@baylina.org To: linux-kernel@vger.kernel.org, linux-net@vger.kernel.org, netdev@oss.sgi.com Subject: Idea for packet classification. Date: Tue, 27 Sep 2005 18:17:59 +0200 Mime-Version: 1.0 Content-Type: text/plain; format=flowed; charset="utf-8" Content-Transfer-Encoding: 8bit X-archive-position: 3680 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: jordi@baylina.org Precedence: bulk X-list: netdev Content-Length: 2016 Lines: 62 The idea is to create a set of iptables TARGETS that classifies the packets. When a packet is classified, a classification / Values is associated with the packet. This classifications can then be used on an iptable filter rule, in a routing table selection rule or in a tc classification filter. For example: #iptables –A INPUT –j CLS user --classifier tcfilter --filtername u32 … #iptables –A INPUT –j CLS quota_plan --classifier hash --table user_to_quota --input cls user #iptables –A INPUT –j CLS tos --classifier tos #iptables –A FORWARD –p tcp –port 5343 –cls quota_plan=1 –j DROP So in this example when a packet arrives, the source address is taken and translated directly to a user, and the packet is marked with the userid. I.e. The packed has an associated classification user = 23 In the second line a hash table classifies the packet. The user is taken from input and a quota plan is taken as an output. So after the second rule, the packet has associated 2 classifications: user=23 quota_plan=2 The 3rd line classifies the packet by TOS so the packet has 3 classifications User=23 Quota_plan=2 Tos=0 Once a packet is classified, those classifications can be used in a filter rule or can be used in a routing rule or in a traffic shaping queue classification. A packet can have many classifications Those classifications can be used any time in the packet live. In the 4th line in th example, the rule drops all tcp packets with port 5343 and had been classified as quota_plan The 1st line in the rule uses a tc filter wrapper to classify the packet. This idea would be an extesion of the MARK target. I am planning to make a patch to implement a couple of functions to insert classifications to the sk_buff structure and to consult classifications of a sk_buff. Do you believe that it is interesting or are you planning to do packet classifications in another way and doing that I would lose the time. Thank you, Jordi From herbert@gondor.apana.org.au Tue Sep 27 19:59:22 2005 Received: with ECARTIS (v1.0.0; list netdev); Tue, 27 Sep 2005 19:59:36 -0700 (PDT) Received: from arnor.apana.org.au (22.107.233.220.exetel.com.au [220.233.107.22]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8S2xKiL017793 for ; Tue, 27 Sep 2005 19:59:21 -0700 Received: from gondolin.me.apana.org.au ([192.168.0.6] ident=mail) by arnor.apana.org.au with esmtp (Exim 3.36 #1 (Debian)) id 1EKS6q-0007IR-00; Wed, 28 Sep 2005 12:55:52 +1000 Received: from herbert by gondolin.me.apana.org.au with local (Exim 3.36 #1 (Debian)) id 1EKS6j-0006s4-00; Wed, 28 Sep 2005 12:55:45 +1000 From: Herbert Xu To: davem@davemloft.net (David S. Miller) Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections Cc: suzannew@cs.pdx.edu, linux-kernel@vger.kernel.org, Robert.Olsson@data.slu.se, paulmck@us.ibm.com, walpole@cs.pdx.edu, netdev@oss.sgi.com Organization: Core In-Reply-To: <20050927.135626.88296134.davem@davemloft.net> X-Newsgroups: apana.lists.os.linux.kernel User-Agent: tin/1.7.4-20040225 ("Benbecula") (UNIX) (Linux/2.4.27-hx-1-686-smp (i686)) Message-Id: Date: Wed, 28 Sep 2005 12:55:45 +1000 X-archive-position: 3681 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: herbert@gondor.apana.org.au Precedence: bulk X-list: netdev Content-Length: 998 Lines: 26 David S. Miller wrote: > > I agree with the changes to add rcu_dereference() use. > Those were definitely lacking and needed. Actually I'm not so sure that they are all needed. I only looked at the very first one in the patch which is in in_dev_get(). That one certainly isn't necessary because the old value of ip_ptr is valid as long as the reference count does not hit zero. The later is guaranteed by the increment in in_dev_get(). Because the pervasiveness of reference counting in the network stack, I believe that we should scrutinise the other bits in the patch too to make sure that they are all needed. In general, using rcu_dereference/rcu_assign_pointer does not guarantee correct code. We really need to look at each case individually. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From horms@koto.vergenet.net Tue Sep 27 23:03:05 2005 Received: with ECARTIS (v1.0.0; list netdev); Tue, 27 Sep 2005 23:03:23 -0700 (PDT) Received: from koto.vergenet.net (koto.vergenet.net [210.128.90.7]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8S634iL003931 for ; Tue, 27 Sep 2005 23:03:05 -0700 Received: by koto.vergenet.net (Postfix, from userid 7100) id AA45834028; Wed, 28 Sep 2005 15:00:09 +0900 (JST) Date: Wed, 28 Sep 2005 11:23:09 +0900 From: Horms To: Nishanth Aravamudan Cc: Luca Maranzano , Dave Miller , Wensong Zhang , Julian Anastasov , netdev@oss.sgi.com Subject: Re: ipvs_syncmaster brings cpu to 100% Message-ID: <20050928022307.GK18765@verge.net.au> Mail-Followup-To: Nishanth Aravamudan , Luca Maranzano , Dave Miller , Wensong Zhang , Julian Anastasov , netdev@oss.sgi.com References: <498263350509081605956a771@mail.gmail.com> <68559cef05092207022f1f0df4@mail.gmail.com> <498263350509230815eb08a73@mail.gmail.com> <20050926032807.GI18357@verge.net.au> <20050926043400.GD5079@us.ibm.com> <20050926080508.GF11027@verge.net.au> <20050926081229.GA23755@verge.net.au> <20050926131104.GA7532@us.ibm.com> <68559cef05092606521cc13f9a@mail.gmail.com> <20050926142109.GD7532@us.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050926142109.GD7532@us.ibm.com> X-Cluestick: seven User-Agent: Mutt/1.5.10i X-archive-position: 3682 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: horms@verge.net.au Precedence: bulk X-list: netdev Content-Length: 3075 Lines: 71 On Mon, Sep 26, 2005 at 07:21:09AM -0700, Nishanth Aravamudan wrote: > On 26.09.2005 [15:52:02 +0200], Luca Maranzano wrote: > > On 26/09/05, Nishanth Aravamudan wrote: > > > On 26.09.2005 [17:12:32 +0900], Horms wrote: > > > > On Mon, Sep 26, 2005 at 05:05:10PM +0900, Horms wrote: > > > > > > > > [snip] > > > > > > > > > > > > > Furthermore, if I make an "rgrep" in the source tree of kernel 2.6.12 > > > > > > > > > the function schedule_timeout() is more used than the ssleep() (517 > > > > > > > > > occurrencies vs. 43), so why in ip_vs_sync.c there was this change? > > > > > > > > > > > > > > > > > > The other oddity is that Horms reported on this list that on non Xeon > > > > > > > > > CPU the same version of kernel of mine does not present the problem. > > > > > > > > > > > > > > > > > > I'm getting crazy :-) > > > > > > > > > > > > > > I've prepared a patch, which reverts the change which was introduced > > > > > > > by Nishanth Aravamudan in February. > > > > > > > > > > > > Was the 100% cpu utilization only occurring on Xeon processors? > > > > > > > > > > That seems to be the only case where were this problem has been > > > > > observed. I don't have such a processor myself, so I haven't actually > > > > > been able to produce the problem locally. > > > > > > > > > > One reason I posted this issue to netdev was to get some more > > > > > eyes on the problem as it is puzzling to say the least. > > > > > > > > > > > Care to try to use msleep_interruptible() instead of ssleep(), as > > > > > > opposed to schedule_timeout()? > > > > > > > > > > I will send a version that does that shortly, Luca, can > > > > > you plase check that too? > > > > > > > > Here is that version of the patch. Nishanth, I take it that I do not > > > > need to set TASK_INTERRUPTABLE before calling msleep_interruptible(), > > > > please let me know if I am wrong. > > > > > > Yes, exactly. I'm just trying to narrow it down to see if it's the task > > > state that's causing the issue (which, to be honest, doesn't make a lot > > > of sense to me -- with ssleep() your load average will go up as the task > > > will be UNINTERRUPTIBLE state, but I am not sure why utilisation would > > > rise, as you are still sleeping...) > > [trimmed lvs-users from my reply, as it is a closed list] > > > Just to add more info, please note the output of "ps": > > > > debld1:~# ps aux|grep ipvs > > root 3748 0.0 0.0 0 0 ? D 12:09 0:00 > > [ipvs_syncmaster] > > root 3757 0.0 0.0 0 0 ? D 12:09 0:00 > > [ipvs_syncbackup] > > > > Note the D status, i.e. (from ps(1) man page): Uninterruptible sleep > > (usually IO) > > The msleep_interruptible() change should fix that. > > But that does not show 100% CPU utilisation at all, it shows 0. Did you > mean to say your load increases? he full discussion is available online at the follwoing URL: I can get than information and post it all here if that is desirable. http://archive.linuxvirtualserver.org/html/lvs-users/2005-09/msg00031.html -- Horms From nacc@us.ibm.com Wed Sep 28 06:29:38 2005 Received: with ECARTIS (v1.0.0; list netdev); Wed, 28 Sep 2005 06:29:41 -0700 (PDT) Received: from e35.co.us.ibm.com (e35.co.us.ibm.com [32.97.110.153]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8SDTViL016541 for ; Wed, 28 Sep 2005 06:29:38 -0700 Received: from westrelay02.boulder.ibm.com (westrelay02.boulder.ibm.com [9.17.195.11]) by e35.co.us.ibm.com (8.12.11/8.12.11) with ESMTP id j8SDOI6s015802 for ; Wed, 28 Sep 2005 09:24:18 -0400 Received: from d03av03.boulder.ibm.com (d03av03.boulder.ibm.com [9.17.195.169]) by westrelay02.boulder.ibm.com (8.12.10/NCO/VERS6.7) with ESMTP id j8SDQen9544982 for ; Wed, 28 Sep 2005 07:26:40 -0600 Received: from d03av03.boulder.ibm.com (loopback [127.0.0.1]) by d03av03.boulder.ibm.com (8.12.11/8.13.3) with ESMTP id j8SDQd38024905 for ; Wed, 28 Sep 2005 07:26:39 -0600 Received: from arkanoid (sig-9-65-36-182.mts.ibm.com [9.65.36.182]) by d03av03.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id j8SDQcX8024887; Wed, 28 Sep 2005 07:26:38 -0600 Received: by arkanoid (Postfix, from userid 1000) id BEC1AEB39E; Wed, 28 Sep 2005 06:26:39 -0700 (PDT) Date: Wed, 28 Sep 2005 06:26:39 -0700 From: Nishanth Aravamudan To: Luca Maranzano , Dave Miller , Wensong Zhang , Julian Anastasov , netdev@oss.sgi.com Subject: Re: ipvs_syncmaster brings cpu to 100% Message-ID: <20050928132639.GA5791@us.ibm.com> References: <68559cef05092207022f1f0df4@mail.gmail.com> <498263350509230815eb08a73@mail.gmail.com> <20050926032807.GI18357@verge.net.au> <20050926043400.GD5079@us.ibm.com> <20050926080508.GF11027@verge.net.au> <20050926081229.GA23755@verge.net.au> <20050926131104.GA7532@us.ibm.com> <68559cef05092606521cc13f9a@mail.gmail.com> <20050926142109.GD7532@us.ibm.com> <20050928022307.GK18765@verge.net.au> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050928022307.GK18765@verge.net.au> X-Operating-System: Linux 2.6.14-rc2 (x86_64) User-Agent: Mutt/1.5.9i X-archive-position: 3683 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: nacc@us.ibm.com Precedence: bulk X-list: netdev Content-Length: 3633 Lines: 79 On 28.09.2005 [11:23:09 +0900], Horms wrote: > On Mon, Sep 26, 2005 at 07:21:09AM -0700, Nishanth Aravamudan wrote: > > On 26.09.2005 [15:52:02 +0200], Luca Maranzano wrote: > > > On 26/09/05, Nishanth Aravamudan wrote: > > > > On 26.09.2005 [17:12:32 +0900], Horms wrote: > > > > > On Mon, Sep 26, 2005 at 05:05:10PM +0900, Horms wrote: > > > > > > > > > > [snip] > > > > > > > > > > > > > > > Furthermore, if I make an "rgrep" in the source tree of kernel 2.6.12 > > > > > > > > > > the function schedule_timeout() is more used than the ssleep() (517 > > > > > > > > > > occurrencies vs. 43), so why in ip_vs_sync.c there was this change? > > > > > > > > > > > > > > > > > > > > The other oddity is that Horms reported on this list that on non Xeon > > > > > > > > > > CPU the same version of kernel of mine does not present the problem. > > > > > > > > > > > > > > > > > > > > I'm getting crazy :-) > > > > > > > > > > > > > > > > I've prepared a patch, which reverts the change which was introduced > > > > > > > > by Nishanth Aravamudan in February. > > > > > > > > > > > > > > Was the 100% cpu utilization only occurring on Xeon processors? > > > > > > > > > > > > That seems to be the only case where were this problem has been > > > > > > observed. I don't have such a processor myself, so I haven't actually > > > > > > been able to produce the problem locally. > > > > > > > > > > > > One reason I posted this issue to netdev was to get some more > > > > > > eyes on the problem as it is puzzling to say the least. > > > > > > > > > > > > > Care to try to use msleep_interruptible() instead of ssleep(), as > > > > > > > opposed to schedule_timeout()? > > > > > > > > > > > > I will send a version that does that shortly, Luca, can > > > > > > you plase check that too? > > > > > > > > > > Here is that version of the patch. Nishanth, I take it that I do not > > > > > need to set TASK_INTERRUPTABLE before calling msleep_interruptible(), > > > > > please let me know if I am wrong. > > > > > > > > Yes, exactly. I'm just trying to narrow it down to see if it's the task > > > > state that's causing the issue (which, to be honest, doesn't make a lot > > > > of sense to me -- with ssleep() your load average will go up as the task > > > > will be UNINTERRUPTIBLE state, but I am not sure why utilisation would > > > > rise, as you are still sleeping...) > > > > [trimmed lvs-users from my reply, as it is a closed list] > > > > > Just to add more info, please note the output of "ps": > > > > > > debld1:~# ps aux|grep ipvs > > > root 3748 0.0 0.0 0 0 ? D 12:09 0:00 > > > [ipvs_syncmaster] > > > root 3757 0.0 0.0 0 0 ? D 12:09 0:00 > > > [ipvs_syncbackup] > > > > > > Note the D status, i.e. (from ps(1) man page): Uninterruptible sleep > > > (usually IO) > > > > The msleep_interruptible() change should fix that. > > > > But that does not show 100% CPU utilisation at all, it shows 0. Did you > > mean to say your load increases? > > he full discussion is available online at the follwoing URL: > I can get than information and post it all here if that is > desirable. > > http://archive.linuxvirtualserver.org/html/lvs-users/2005-09/msg00031.html Yes, the information in that thread is the same as what Luca said. It's a load average problem, not a CPU utilisation problem (those threads are sleeping!) If Luca could test the msleep_interruptible() version of the patch and it works (like I said, performance should not change, but the load average will drop to by 2), then I will ACK the patch for mainline acceptance. Thanks, Nish From paulmck@us.ibm.com Wed Sep 28 07:53:45 2005 Received: with ECARTIS (v1.0.0; list netdev); Wed, 28 Sep 2005 07:53:51 -0700 (PDT) Received: from e34.co.us.ibm.com (e34.co.us.ibm.com [32.97.110.152]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8SErdiL023845 for ; Wed, 28 Sep 2005 07:53:45 -0700 Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com [9.17.195.106]) by e34.co.us.ibm.com (8.12.11/8.12.11) with ESMTP id j8SEnBZw021835 for ; Wed, 28 Sep 2005 10:49:11 -0400 Received: from d03av03.boulder.ibm.com (d03av03.boulder.ibm.com [9.17.195.169]) by d03relay04.boulder.ibm.com (8.12.10/NCO/VERS6.7) with ESMTP id j8SEpLct522202 for ; Wed, 28 Sep 2005 08:51:21 -0600 Received: from d03av03.boulder.ibm.com (loopback [127.0.0.1]) by d03av03.boulder.ibm.com (8.12.11/8.13.3) with ESMTP id j8SEog5Q004906 for ; Wed, 28 Sep 2005 08:50:43 -0600 Received: from linux.local ([9.47.22.63]) by d03av03.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id j8SEoTBG003303; Wed, 28 Sep 2005 08:50:41 -0600 Received: by linux.local (Postfix on SuSE Linux 7.3 (i386), from userid 500) id DE562148809; Wed, 28 Sep 2005 07:51:10 -0700 (PDT) Date: Wed, 28 Sep 2005 07:51:10 -0700 From: "Paul E. McKenney" To: Herbert Xu Cc: "David S. Miller" , suzannew@cs.pdx.edu, linux-kernel@vger.kernel.org, Robert.Olsson@data.slu.se, walpole@cs.pdx.edu, netdev@oss.sgi.com Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections Message-ID: <20050928145110.GA4925@us.ibm.com> Reply-To: paulmck@us.ibm.com References: <20050927.135626.88296134.davem@davemloft.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.1i X-archive-position: 3685 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: paulmck@us.ibm.com Precedence: bulk X-list: netdev Content-Length: 2027 Lines: 50 On Wed, Sep 28, 2005 at 12:55:45PM +1000, Herbert Xu wrote: > David S. Miller wrote: > > > > I agree with the changes to add rcu_dereference() use. > > Those were definitely lacking and needed. > > Actually I'm not so sure that they are all needed. I only looked > at the very first one in the patch which is in in_dev_get(). That > one certainly isn't necessary because the old value of ip_ptr > is valid as long as the reference count does not hit zero. > > The later is guaranteed by the increment in in_dev_get(). > > Because the pervasiveness of reference counting in the network stack, > I believe that we should scrutinise the other bits in the patch too > to make sure that they are all needed. > > In general, using rcu_dereference/rcu_assign_pointer does not > guarantee correct code. We really need to look at each case > individually. Yep, these two APIs are only part of the solution. The reference-count approach is only guaranteed to work if the kernel thread that did the reference-count increment is later referencing that same data element. Otherwise, one has the following possible situation on DEC Alpha: o CPU 0 initializes and inserts a new element into the data structure, using rcu_assign_pointer() to provide any needed memory barriers. (Or, if RCU is not being used, under the appropriate update-side lock.) o CPU 1 acquires a reference to this new element, presumably using either a lock or rcu_read_lock() and rcu_dereference() in order to do so safely. CPU 1 then increments the reference count. o CPU 2 picks up a pointer to this new element, but in a way that relies on the reference count having been incremented, without using locking, rcu_read_lock(), rcu_dereference(), and so on. This CPU can then see the pre-initialized contents of the newly inserted data structure (again, but only on DEC Alpha). Again, if the same kernel thread that incremented the reference count is later accessing it, no problem, even on Alpha. Thanx, Paul From pp@ee.oulu.fi Wed Sep 28 08:01:10 2005 Received: with ECARTIS (v1.0.0; list netdev); Wed, 28 Sep 2005 08:01:14 -0700 (PDT) Received: from ee.oulu.fi (ee.oulu.fi [130.231.61.23]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8SF18iL024854 for ; Wed, 28 Sep 2005 08:01:09 -0700 Received: from tk28.oulu.fi (tk28 [130.231.48.68]) by ee.oulu.fi (8.13.3/8.13.3) with ESMTP id j8SEwGcx003697; Wed, 28 Sep 2005 17:58:16 +0300 (EEST) Received: (from pp@localhost) by tk28.oulu.fi (8.13.3/8.13.3/Submit) id j8SEwF4d000752; Wed, 28 Sep 2005 17:58:15 +0300 (EEST) Date: Wed, 28 Sep 2005 17:58:15 +0300 From: Pekka Pietikainen To: Harald Welte Cc: netdev@oss.sgi.com Subject: Re: rwlock recursion on CPU#0, netfilter related? Message-ID: <20050928145815.GA421@ee.oulu.fi> References: <20050925105834.GA15243@ee.oulu.fi> <20050925134344.GJ731@sunbeam.de.gnumonks.org> <20050925201945.GA21176@ee.oulu.fi> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <20050925201945.GA21176@ee.oulu.fi> User-Agent: Mutt/1.4.2i X-archive-position: 3686 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: pp@ee.oulu.fi Precedence: bulk X-list: netdev Content-Length: 1174 Lines: 19 On Sun, Sep 25, 2005 at 11:19:45PM +0300, Pekka Pietikainen wrote: > Enabled, so this could be it. But 2.6.14-rc2-git4 did crash too (although > it did take a bit longer for that to happen), and the changelog does state: Ok, it looks like that patch was the thing after all. I now tried the latest fedora-devel kernel (1.1582, based on 2.6.14-rc2-git6) and the box has been running for a few hours happily. Could be the fedora kernel that claimed to be git4 actually wasn't, or the git4 changelog was really a post-git4 changelog :). But anyway, bug is gone. > > But only in 1 out of ten cases on average (when starting ping, ctrl+c, > > pin, ctrl+c, ...). I've always assumed it's some 64bit problem in > > "ping" itself. > Happens for all packets on the "broken" kernels, and works a-ok (few ms > latencies to the same box) on the 2.6.13-era ones that don't crash. > Could be a different bug, sure. This one is still around, so it's a different bug. Looks like it's a 64-bit issue, a 32-bit ping gives realistic ping times. tcpdump timestamps are also affected, they're completely off too. So looks like someone broke packet timestamps on 64-bit some time after 2.6.13. From herbert@gondor.apana.org.au Wed Sep 28 15:14:56 2005 Received: with ECARTIS (v1.0.0; list netdev); Wed, 28 Sep 2005 15:15:17 -0700 (PDT) Received: from arnor.apana.org.au (22.107.233.220.exetel.com.au [220.233.107.22]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8SMEtO0031850 for ; Wed, 28 Sep 2005 15:14:56 -0700 Received: from gondolin.me.apana.org.au ([192.168.0.6] ident=mail) by arnor.apana.org.au with esmtp (Exim 3.36 #1 (Debian)) id 1EKk91-0000PD-00; Thu, 29 Sep 2005 08:11:19 +1000 Received: from herbert by gondolin.me.apana.org.au with local (Exim 3.36 #1 (Debian)) id 1EKk8s-0005jQ-00; Thu, 29 Sep 2005 08:11:10 +1000 Date: Thu, 29 Sep 2005 08:11:10 +1000 To: "Paul E. McKenney" Cc: "David S. Miller" , suzannew@cs.pdx.edu, linux-kernel@vger.kernel.org, Robert.Olsson@data.slu.se, walpole@cs.pdx.edu, netdev@oss.sgi.com Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections Message-ID: <20050928221110.GA22018@gondor.apana.org.au> References: <20050927.135626.88296134.davem@davemloft.net> <20050928145110.GA4925@us.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050928145110.GA4925@us.ibm.com> User-Agent: Mutt/1.5.9i From: Herbert Xu X-archive-position: 3687 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: herbert@gondor.apana.org.au Precedence: bulk X-list: netdev Content-Length: 685 Lines: 18 On Wed, Sep 28, 2005 at 07:51:10AM -0700, Paul E. McKenney wrote: > > The reference-count approach is only guaranteed to work if the kernel > thread that did the reference-count increment is later referencing that > same data element. Otherwise, one has the following possible situation > on DEC Alpha: You're quite right. Without the rcu_dereference users of in_dev_get() may see pre-initialisation contents of in_dev. So these barriers are definitely needed. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From ja@ssi.bg Wed Sep 28 23:57:13 2005 Received: with ECARTIS (v1.0.0; list netdev); Wed, 28 Sep 2005 23:57:34 -0700 (PDT) Received: from u.domain.uli (ja.ssi.bg [217.79.71.194]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8T6vBO0015203 for ; Wed, 28 Sep 2005 23:57:12 -0700 Received: from localhost (localhost [127.0.0.1]) by u.domain.uli (8.12.10/8.12.10) with ESMTP id j8T70CpC006693; Thu, 29 Sep 2005 10:00:18 +0300 Date: Thu, 29 Sep 2005 10:00:12 +0300 (EEST) From: Julian Anastasov X-X-Sender: ja@u.domain.uli To: Nishanth Aravamudan cc: Luca Maranzano , Dave Miller , Wensong Zhang , netdev@oss.sgi.com Subject: Re: ipvs_syncmaster brings cpu to 100% In-Reply-To: <20050928132639.GA5791@us.ibm.com> Message-ID: References: <68559cef05092207022f1f0df4@mail.gmail.com> <498263350509230815eb08a73@mail.gmail.com> <20050926032807.GI18357@verge.net.au> <20050926043400.GD5079@us.ibm.com> <20050926080508.GF11027@verge.net.au> <20050926081229.GA23755@verge.net.au> <20050926131104.GA7532@us.ibm.com> <68559cef05092606521cc13f9a@mail.gmail.com> <20050926142109.GD7532@us.ibm.com> <20050928022307.GK18765@verge.net.au> <20050928132639.GA5791@us.ibm.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 3689 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ja@ssi.bg Precedence: bulk X-list: netdev Content-Length: 928 Lines: 27 Hello, On Wed, 28 Sep 2005, Nishanth Aravamudan wrote: > Yes, the information in that thread is the same as what Luca said. It's > a load average problem, not a CPU utilisation problem (those threads are > sleeping!) If Luca could test the msleep_interruptible() version of the > patch and it works (like I said, performance should not change, but the > load average will drop to by 2), then I will ACK the patch for mainline > acceptance. Agreed. It seems your initial conversion was based on wrong assumptions, quoting you: > Description: Use ssleep() instead of schedule_timeout() to guarantee the task > delays as expected. The first two replacements use TASK_INTERRUPTIBLE but do > not > check for signals, so ssleep() should be appropriate. As all signals are blocked from daemonize and even explicitly later it was not necessary to convert to non-interruptible variant. Regards -- Julian Anastasov From laforge@gnumonks.org Thu Sep 29 05:08:45 2005 Received: with ECARTIS (v1.0.0; list netdev); Thu, 29 Sep 2005 05:08:49 -0700 (PDT) Received: from ganesha.gnumonks.org (ganesha.gnumonks.org [213.95.27.120]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8TC8cO0014424 for ; Thu, 29 Sep 2005 05:08:45 -0700 Received: from berligate.hmw-consulting.de ([83.236.178.202] helo=sunbeam.hmw-consulting.de) by ganesha.gnumonks.org with esmtpsa (TLS-1.0:RSA_AES_256_CBC_SHA:32) (Exim 4.50) id 1EKxAS-0000tr-1w; Thu, 29 Sep 2005 14:05:40 +0200 Received: from laforge by sunbeam.hmw-consulting.de with local (Exim 4.52) id 1EKxAQ-00053X-Uc; Thu, 29 Sep 2005 14:05:39 +0200 Date: Thu, 29 Sep 2005 14:05:38 +0200 From: Harald Welte To: Pekka Pietikainen Cc: netdev@oss.sgi.com Subject: Re: rwlock recursion on CPU#0, netfilter related? Message-ID: <20050929120538.GT4168@sunbeam.de.gnumonks.org> References: <20050925105834.GA15243@ee.oulu.fi> <20050925134344.GJ731@sunbeam.de.gnumonks.org> <20050925201945.GA21176@ee.oulu.fi> <20050928145815.GA421@ee.oulu.fi> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="NEaRsfQExFH3jWtg" Content-Disposition: inline In-Reply-To: <20050928145815.GA421@ee.oulu.fi> User-Agent: mutt-ng devel-20050619 (Debian) X-archive-position: 3691 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: laforge@gnumonks.org Precedence: bulk X-list: netdev Content-Length: 1900 Lines: 54 --NEaRsfQExFH3jWtg Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Sep 28, 2005 at 05:58:15PM +0300, Pekka Pietikainen wrote: > On Sun, Sep 25, 2005 at 11:19:45PM +0300, Pekka Pietikainen wrote: > > Enabled, so this could be it. But 2.6.14-rc2-git4 did crash too (althou= gh > > it did take a bit longer for that to happen), and the changelog does st= ate: > Ok, it looks like that patch was the thing after all. I now tried the lat= est > fedora-devel kernel (1.1582, based on 2.6.14-rc2-git6) and the box has be= en > running for a few hours happily. Could be the fedora kernel that claimed = to > be git4 actually wasn't, or the git4 changelog was really a post-git4 > changelog :). But anyway, bug is gone. great news. > This one is still around, so it's a different bug. Looks like it's a 64-b= it > issue, a 32-bit ping gives realistic ping times. tcpdump timestamps are a= lso > affected, they're completely off too. So looks like someone broke packet > timestamps on 64-bit some time after 2.6.13. luckily I'm not the core network maintainer ;) --=20 - Harald Welte http://gnumonks.org/ =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D "Privacy in residential applications is a desirable marketing option." (ETSI EN 300 175-7 Ch. A6) --NEaRsfQExFH3jWtg Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQFDO9iSXaXGVTD0i/8RApLKAJ9/16WAsgolDwJ3+niYa/fTSc5tHwCfUozU G3CSD+pENTT+gp8tvfY1+2g= =CtHq -----END PGP SIGNATURE----- --NEaRsfQExFH3jWtg-- From suzannew@cs.pdx.edu Thu Sep 29 09:06:09 2005 Received: with ECARTIS (v1.0.0; list netdev); Thu, 29 Sep 2005 09:06:22 -0700 (PDT) Received: from iron.cat.pdx.edu (iron.cat.pdx.edu [131.252.208.92]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8TG69O0004708 for ; Thu, 29 Sep 2005 09:06:09 -0700 Received: from rastaban.cs.pdx.edu (root@rastaban.cs.pdx.edu [131.252.209.214]) by iron.cat.pdx.edu (8.13.1/8.13.1) with ESMTP id j8TG2bOk026907 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO); Thu, 29 Sep 2005 09:02:43 -0700 (PDT) Received: from rastaban.cs.pdx.edu (suzannew@localhost [127.0.0.1]) by rastaban.cs.pdx.edu (8.12.10/8.12.6) with ESMTP id j8TG2bUL015923; Thu, 29 Sep 2005 09:02:37 -0700 (PDT) Received: (from suzannew@localhost) by rastaban.cs.pdx.edu (8.12.10/8.12.6/Submit) id j8TG2TuI015920; Thu, 29 Sep 2005 09:02:29 -0700 (PDT) Date: Thu, 29 Sep 2005 09:02:29 -0700 (PDT) From: Suzanne Wood Message-Id: <200509291602.j8TG2TuI015920@rastaban.cs.pdx.edu> To: paulmck@us.ibm.com Cc: Robert.Olsson@data.slu.se, davem@davemloft.net, herbert@gondor.apana.org.au, linux-kernel@vger.kernel.org, netdev@oss.sgi.com, suzannew@cs.pdx.edu, walpole@cs.pdx.edu Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections X-archive-position: 3692 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: suzannew@cs.pdx.edu Precedence: bulk X-list: netdev Content-Length: 2808 Lines: 75 The original motivation for this patch was in __in_dev_get usage. I'll try to test a build, but should submittals be incremental? first addressing in_dev_get, then __in_dev_get? What seems resolved so far follows. The exchange below suggests that it is equally important to have the rcu_dereference() in __in_dev_get(), so the idea of the only difference between in_dev_get and __in_dev_get being the refcnt may be accepted. Correct usage may be a question with the mismatched definitions (in terms of refcnt) of __in_dev_get() and __in_dev_put() that superficially appear paired and this may merit a comment. If interested, examples are mentioned in www.uwsg.iu.edu/hypermail/linux/kernel/0509.1/0184.html and www.ussg.iu.edu/hypermail/linux/kernel/0509.3/0757.html But when the refcnt is employed for the DEC Alpha, rcu-protection or other locking must be in place for multiple CPUs, which apparently affirms the value of the marking of an rcu read-side critical section done by the calling function which has the vision of the extent of use of the protected dereference. Is this all reasonable to you? Thank you very much. ----- Original Message ----- From: Paul E. McKenney Sent: Wednesday, September 28, 2005 7:51 AM > On Wed, Sep 28, 2005 at 12:55:45PM +1000, Herbert Xu wrote: >> David S. Miller wrote: >> > >> > I agree with the changes to add rcu_dereference() use. >> > Those were definitely lacking and needed. >> >> Actually I'm not so sure that they are all needed. I only looked >> at the > guarantee correct code. We really need to look at each case >> individually. > > Yep, these two APIs are only part of the solution. > > The reference-count approach is only guaranteed to work if the kernel > thread that did the reference-count increment is later referencing that > same data element. Otherwise, one has the following possible situation > on DEC Alpha: > > o CPU 0 initializes and inserts a new element into the data > structure, using rcu_assign_pointer() to provide any needed > memory barriers. (Or, if RCU is not being used, under the > appropriate update-side lock.) > > o CPU 1 acquires a reference to this new element, presumably > using either a lock or rcu_read_lock() and rcu_dereference() > in order to do so safely. CPU 1 then increments the reference > count. > > o CPU 2 picks up a pointer to this new element, but in a way > that relies on the reference count having been incremented, > without using locking, rcu_read_lock(), rcu_dereference(), > and so on. > > This CPU can then see the pre-initialized contents of the > newly inserted data structure (again, but only on DEC Alpha). > > Again, if the same kernel thread that incremented the reference count > is later accessing it, no problem, even on Alpha. > > Thanx, Paul > From shekhark@juniper.net Thu Sep 29 11:28:05 2005 Received: with ECARTIS (v1.0.0; list netdev); Thu, 29 Sep 2005 11:28:10 -0700 (PDT) Received: from kremlin.juniper.net (kremlin.juniper.net [207.17.137.120]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8TIS4O0020112 for ; Thu, 29 Sep 2005 11:28:04 -0700 Received: from unknown (HELO beta.jnpr.net) (172.24.18.109) by kremlin.juniper.net with ESMTP; 29 Sep 2005 11:25:14 -0700 X-BrightmailFiltered: true X-Brightmail-Tracker: AAAAAA== X-IronPort-AV: i="3.97,147,1125903600"; d="scan'208"; a="484436685:sNHT19826072" Received: from gluon.jnpr.net ([172.24.15.23]) by beta.jnpr.net with Microsoft SMTPSVC(6.0.3790.211); Thu, 29 Sep 2005 11:25:13 -0700 X-MimeOLE: Produced By Microsoft Exchange V6.5.7226.0 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Subject: netfilter hooks for ESP over UDP packets are not invoked Date: Thu, 29 Sep 2005 11:25:13 -0700 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: netfilter hooks for ESP over UDP packets are not invoked Thread-Index: AcW+8tPJQ1m7c1t9Rx6lXWklRfkieQGL9cVg From: "Shekhar Kshirsagar" To: X-OriginalArrivalTime: 29 Sep 2005 18:25:13.0645 (UTC) FILETIME=[2217F5D0:01C5C523] Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id j8TIS4O0020112 X-archive-position: 3693 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: shekhark@juniper.net Precedence: bulk X-list: netdev Content-Length: 290 Lines: 10 iptables rules to match ESP packets work fine for raw ESP, but they do not work for ESP over UDP packets. Looking into the code, it seems that netfilter hooks are not invoked for ESP packets that came over UDP. Does somebody already have a patch to resolve this issue? Thanks, Shekhar From herbert@gondor.apana.org.au Thu Sep 29 14:32:26 2005 Received: with ECARTIS (v1.0.0; list netdev); Thu, 29 Sep 2005 14:32:39 -0700 (PDT) Received: from arnor.apana.org.au (22.107.233.220.exetel.com.au [220.233.107.22]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8TLWPO0004450 for ; Thu, 29 Sep 2005 14:32:26 -0700 Received: from gondolin.me.apana.org.au ([192.168.0.6] ident=mail) by arnor.apana.org.au with esmtp (Exim 3.36 #1 (Debian)) id 1EL5xK-0003hq-00; Fri, 30 Sep 2005 07:28:42 +1000 Received: from herbert by gondolin.me.apana.org.au with local (Exim 3.36 #1 (Debian)) id 1EL5xE-0003jQ-00; Fri, 30 Sep 2005 07:28:36 +1000 Date: Fri, 30 Sep 2005 07:28:36 +1000 To: Suzanne Wood Cc: paulmck@us.ibm.com, Robert.Olsson@data.slu.se, davem@davemloft.net, linux-kernel@vger.kernel.org, netdev@oss.sgi.com, walpole@cs.pdx.edu Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections Message-ID: <20050929212836.GA14323@gondor.apana.org.au> References: <200509291602.j8TG2TuI015920@rastaban.cs.pdx.edu> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200509291602.j8TG2TuI015920@rastaban.cs.pdx.edu> User-Agent: Mutt/1.5.9i From: Herbert Xu X-archive-position: 3694 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: herbert@gondor.apana.org.au Precedence: bulk X-list: netdev Content-Length: 909 Lines: 24 On Thu, Sep 29, 2005 at 09:02:29AM -0700, Suzanne Wood wrote: > > The exchange below suggests that it is equally important > to have the rcu_dereference() in __in_dev_get(), so the > idea of the only difference between in_dev_get and > __in_dev_get being the refcnt may be accepted. With __in_dev_get() it's the caller's responsibility to ensure that RCU works correctly. Therefore if any rcu_dereference is needed it should be done by the caller. Some callers of __in_dev_get() don't need rcu_dereference at all because they're protected by the rtnl. BTW, could you please move the rcu_dereference in in_dev_get() into the if clause? The barrier is not needed when ip_ptr is NULL. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From suzannew@cs.pdx.edu Thu Sep 29 16:33:52 2005 Received: with ECARTIS (v1.0.0; list netdev); Thu, 29 Sep 2005 16:34:05 -0700 (PDT) Received: from lead.cat.pdx.edu (lead.cat.pdx.edu [131.252.208.91]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8TNXqO0017030 for ; Thu, 29 Sep 2005 16:33:52 -0700 Received: from rastaban.cs.pdx.edu (root@rastaban.cs.pdx.edu [131.252.209.214]) by lead.cat.pdx.edu (8.13.1/8.13.1) with ESMTP id j8TNUTAE022598 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO); Thu, 29 Sep 2005 16:30:35 -0700 (PDT) Received: from rastaban.cs.pdx.edu (suzannew@localhost [127.0.0.1]) by rastaban.cs.pdx.edu (8.12.10/8.12.6) with ESMTP id j8TNUTUL019573; Thu, 29 Sep 2005 16:30:29 -0700 (PDT) Received: (from suzannew@localhost) by rastaban.cs.pdx.edu (8.12.10/8.12.6/Submit) id j8TNUSmH019572; Thu, 29 Sep 2005 16:30:28 -0700 (PDT) Date: Thu, 29 Sep 2005 16:30:28 -0700 (PDT) From: Suzanne Wood Message-Id: <200509292330.j8TNUSmH019572@rastaban.cs.pdx.edu> To: herbert@gondor.apana.org.au Cc: Robert.Olsson@data.slu.se, davem@davemloft.net, linux-kernel@vger.kernel.org, netdev@oss.sgi.com, paulmck@us.ibm.com, walpole@cs.pdx.edu Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections X-archive-position: 3695 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: suzannew@cs.pdx.edu Precedence: bulk X-list: netdev Content-Length: 1098 Lines: 29 > Date: Fri, 30 Sep 2005 07:28:36 +1000 > From: Herbert Xu > On Thu, Sep 29, 2005 at 09:02:29AM -0700, Suzanne Wood wrote: > > > > The exchange below suggests that it is equally important > > to have the rcu_dereference() in __in_dev_get(), so the > > idea of the only difference between in_dev_get and > > __in_dev_get being the refcnt may be accepted. > With __in_dev_get() it's the caller's responsibility to ensure > that RCU works correctly. Therefore if any rcu_dereference is > needed it should be done by the caller. This sounds reasonable to me. Does everyone agree? > Some callers of __in_dev_get() don't need rcu_dereference at all > because they're protected by the rtnl. > BTW, could you please move the rcu_dereference in in_dev_get() > into the if clause? The barrier is not needed when ip_ptr is > NULL. The trouble with that may be that there are three events, the dereference, the assignment, and the conditional test. The rcu_dereference() is meant to assure deferred destruction throughout. Thank you very much for your suggestions. From suzannew@cs.pdx.edu Thu Sep 29 16:43:17 2005 Received: with ECARTIS (v1.0.0; list netdev); Thu, 29 Sep 2005 16:43:23 -0700 (PDT) Received: from lead.cat.pdx.edu (lead.cat.pdx.edu [131.252.208.91]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8TNhHO0017848 for ; Thu, 29 Sep 2005 16:43:17 -0700 Received: from rastaban.cs.pdx.edu (root@rastaban.cs.pdx.edu [131.252.209.214]) by lead.cat.pdx.edu (8.13.1/8.13.1) with ESMTP id j8TNdwBh022927 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO); Thu, 29 Sep 2005 16:40:04 -0700 (PDT) Received: from rastaban.cs.pdx.edu (suzannew@localhost [127.0.0.1]) by rastaban.cs.pdx.edu (8.12.10/8.12.6) with ESMTP id j8TNdwUL019658; Thu, 29 Sep 2005 16:39:58 -0700 (PDT) Received: (from suzannew@localhost) by rastaban.cs.pdx.edu (8.12.10/8.12.6/Submit) id j8TNdvKc019657; Thu, 29 Sep 2005 16:39:57 -0700 (PDT) Date: Thu, 29 Sep 2005 16:39:57 -0700 (PDT) From: Suzanne Wood Message-Id: <200509292339.j8TNdvKc019657@rastaban.cs.pdx.edu> To: herbert@gondor.apana.org.au Cc: Robert.Olsson@data.slu.se, davem@davemloft.net, linux-kernel@vger.kernel.org, netdev@oss.sgi.com, paulmck@us.ibm.com, walpole@cs.pdx.edu Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections X-archive-position: 3696 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: suzannew@cs.pdx.edu Precedence: bulk X-list: netdev Content-Length: 519 Lines: 14 > From suzannew Thu Sep 29 16:30:28 2005 > > From: Herbert Xu 30 Sep 2005 07:28 > > BTW, could you please move the rcu_dereference in in_dev_get() > > into the if clause? The barrier is not needed when ip_ptr is > > NULL. > The trouble with that may be that there are three events, the > dereference, the assignment, and the conditional test. The > rcu_dereference() is meant to assure deferred destruction > throughout. Sorry, I was thinking in terms of the rcu_read_lock, so this is misstated. From suzannew@cs.pdx.edu Thu Sep 29 17:03:24 2005 Received: with ECARTIS (v1.0.0; list netdev); Thu, 29 Sep 2005 17:03:29 -0700 (PDT) Received: from mournblade.cat.pdx.edu (mournblade.cat.pdx.edu [131.252.208.27]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8U03OO0019482 for ; Thu, 29 Sep 2005 17:03:24 -0700 Received: from rastaban.cs.pdx.edu (root@rastaban.cs.pdx.edu [131.252.209.214]) by mournblade.cat.pdx.edu (8.13.1/8.13.1) with ESMTP id j8U001S5018681 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO); Thu, 29 Sep 2005 17:00:06 -0700 (PDT) Received: from rastaban.cs.pdx.edu (suzannew@localhost [127.0.0.1]) by rastaban.cs.pdx.edu (8.12.10/8.12.6) with ESMTP id j8U001UL019839; Thu, 29 Sep 2005 17:00:01 -0700 (PDT) Received: (from suzannew@localhost) by rastaban.cs.pdx.edu (8.12.10/8.12.6/Submit) id j8TNxuxD019838; Thu, 29 Sep 2005 16:59:56 -0700 (PDT) Date: Thu, 29 Sep 2005 16:59:56 -0700 (PDT) From: Suzanne Wood Message-Id: <200509292359.j8TNxuxD019838@rastaban.cs.pdx.edu> To: herbert@gondor.apana.org.au Cc: Robert.Olsson@data.slu.se, davem@davemloft.net, linux-kernel@vger.kernel.org, netdev@oss.sgi.com, paulmck@us.ibm.com, walpole@cs.pdx.edu Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections X-archive-position: 3697 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: suzannew@cs.pdx.edu Precedence: bulk X-list: netdev Content-Length: 645 Lines: 21 Sorry to be thinking on-line, but if you mean this: if (in_dev = rcu_dereference(dev->ip_ptr)) I think that's fine. > From suzannew Thu Sep 29 16:39:57 2005 > > From suzannew Thu Sep 29 16:30:28 2005 > > > From: Herbert Xu 30 Sep 2005 07:28 > > > BTW, could you please move the rcu_dereference in in_dev_get() > > > into the if clause? The barrier is not needed when ip_ptr is > > > NULL. > > The trouble with that may be that there are three events, the > > dereference, the assignment, and the conditional test. The > > rcu_dereference() is meant to assure deferred destruction > > throughout. From paulmck@us.ibm.com Thu Sep 29 17:25:58 2005 Received: with ECARTIS (v1.0.0; list netdev); Thu, 29 Sep 2005 17:26:01 -0700 (PDT) Received: from e31.co.us.ibm.com (e31.co.us.ibm.com [32.97.110.149]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8U0PwO0021173 for ; Thu, 29 Sep 2005 17:25:58 -0700 Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com [9.17.195.106]) by e31.co.us.ibm.com (8.12.11/8.12.11) with ESMTP id j8U0MT5T029699 for ; Thu, 29 Sep 2005 20:22:29 -0400 Received: from d03av02.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.195.168]) by d03relay04.boulder.ibm.com (8.12.10/NCO/VERS6.7) with ESMTP id j8U0NjXk464946 for ; Thu, 29 Sep 2005 18:23:45 -0600 Received: from d03av02.boulder.ibm.com (loopback [127.0.0.1]) by d03av02.boulder.ibm.com (8.12.11/8.13.3) with ESMTP id j8U0N6YW018022 for ; Thu, 29 Sep 2005 18:23:06 -0600 Received: from linux.local ([9.47.22.63]) by d03av02.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id j8U0N49v017999; Thu, 29 Sep 2005 18:23:05 -0600 Received: by linux.local (Postfix on SuSE Linux 7.3 (i386), from userid 500) id 59D691486D1; Thu, 29 Sep 2005 17:23:46 -0700 (PDT) Date: Thu, 29 Sep 2005 17:23:46 -0700 From: "Paul E. McKenney" To: Suzanne Wood Cc: herbert@gondor.apana.org.au, Robert.Olsson@data.slu.se, davem@davemloft.net, linux-kernel@vger.kernel.org, netdev@oss.sgi.com, walpole@cs.pdx.edu Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections Message-ID: <20050930002346.GP8177@us.ibm.com> Reply-To: paulmck@us.ibm.com References: <200509292330.j8TNUSmH019572@rastaban.cs.pdx.edu> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200509292330.j8TNUSmH019572@rastaban.cs.pdx.edu> User-Agent: Mutt/1.4.1i X-archive-position: 3699 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: paulmck@us.ibm.com Precedence: bulk X-list: netdev Content-Length: 2040 Lines: 48 On Thu, Sep 29, 2005 at 04:30:28PM -0700, Suzanne Wood wrote: > > Date: Fri, 30 Sep 2005 07:28:36 +1000 > > From: Herbert Xu > > > On Thu, Sep 29, 2005 at 09:02:29AM -0700, Suzanne Wood wrote: > > > > > > The exchange below suggests that it is equally important > > > to have the rcu_dereference() in __in_dev_get(), so the > > > idea of the only difference between in_dev_get and > > > __in_dev_get being the refcnt may be accepted. > > > With __in_dev_get() it's the caller's responsibility to ensure > > that RCU works correctly. Therefore if any rcu_dereference is > > needed it should be done by the caller. > > This sounds reasonable to me. Does everyone agree? Is there any case where __in_dev_get() might be called without needing to be wrapped with rcu_dereference()? If so, then I agree (FWIW, given my meagre knowledge of Linux networking). If all __in_dev_get() invocations need to be wrapped in rcu_dereference(), then it seems to me that there would be motivation to bury rcu_dereference() in __in_dev_get(). > > Some callers of __in_dev_get() don't need rcu_dereference at all > > because they're protected by the rtnl. > > > BTW, could you please move the rcu_dereference in in_dev_get() > > into the if clause? The barrier is not needed when ip_ptr is > > NULL. > > The trouble with that may be that there are three events, the > dereference, the assignment, and the conditional test. The > rcu_dereference() is meant to assure deferred destruction > throughout. One only needs an rcu_dereference() once on the data-flow path from fetching the RCU-protected pointer to dereferencing that pointer. If the pointer is NULL, there is no way you can dereference it, so, technically, Herbert is quite correct. However, rcu_dereference() only generates a memory barrier on DEC Alpha, so there is normally no penalty for using it in the NULL-pointer case. So, when using rcu_dereference() unconditionally simplifies the code, it may make sense to "just do it". Thanx, Paul From herbert@gondor.apana.org.au Thu Sep 29 17:25:11 2005 Received: with ECARTIS (v1.0.0; list netdev); Thu, 29 Sep 2005 17:25:23 -0700 (PDT) Received: from arnor.apana.org.au (22.107.233.220.exetel.com.au [220.233.107.22]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8U0PAO0021037 for ; Thu, 29 Sep 2005 17:25:11 -0700 Received: from gondolin.me.apana.org.au ([192.168.0.6] ident=mail) by arnor.apana.org.au with esmtp (Exim 3.36 #1 (Debian)) id 1EL8er-0005sg-00; Fri, 30 Sep 2005 10:21:49 +1000 Received: from herbert by gondolin.me.apana.org.au with local (Exim 3.36 #1 (Debian)) id 1EL8en-0005U1-00; Fri, 30 Sep 2005 10:21:45 +1000 Date: Fri, 30 Sep 2005 10:21:44 +1000 To: Suzanne Wood Cc: Robert.Olsson@data.slu.se, davem@davemloft.net, linux-kernel@vger.kernel.org, netdev@oss.sgi.com, paulmck@us.ibm.com, walpole@cs.pdx.edu Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections Message-ID: <20050930002144.GA21062@gondor.apana.org.au> References: <200509292330.j8TNUSmH019572@rastaban.cs.pdx.edu> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200509292330.j8TNUSmH019572@rastaban.cs.pdx.edu> User-Agent: Mutt/1.5.9i From: Herbert Xu X-archive-position: 3698 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: herbert@gondor.apana.org.au Precedence: bulk X-list: netdev Content-Length: 912 Lines: 24 On Thu, Sep 29, 2005 at 04:30:28PM -0700, Suzanne Wood wrote: > > > BTW, could you please move the rcu_dereference in in_dev_get() > > into the if clause? The barrier is not needed when ip_ptr is > > NULL. > > The trouble with that may be that there are three events, the > dereference, the assignment, and the conditional test. The > rcu_dereference() is meant to assure deferred destruction > throughout. The deferred destruction is guaranteed here by the reference count. The only purpose served by rcu_dereference() in in_dev_get() is to prevent the user from seeing pre-initialisation data. When the pointer is NULL, you can't see any data at all, let alone pre-initialisation data. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From herbert@gondor.apana.org.au Thu Sep 29 17:27:00 2005 Received: with ECARTIS (v1.0.0; list netdev); Thu, 29 Sep 2005 17:27:07 -0700 (PDT) Received: from arnor.apana.org.au (22.107.233.220.exetel.com.au [220.233.107.22]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8U0QxO0021459 for ; Thu, 29 Sep 2005 17:27:00 -0700 Received: from gondolin.me.apana.org.au ([192.168.0.6] ident=mail) by arnor.apana.org.au with esmtp (Exim 3.36 #1 (Debian)) id 1EL8gf-0005t6-00; Fri, 30 Sep 2005 10:23:41 +1000 Received: from herbert by gondolin.me.apana.org.au with local (Exim 3.36 #1 (Debian)) id 1EL8gd-0005Ul-00; Fri, 30 Sep 2005 10:23:39 +1000 Date: Fri, 30 Sep 2005 10:23:39 +1000 To: Suzanne Wood Cc: Robert.Olsson@data.slu.se, davem@davemloft.net, linux-kernel@vger.kernel.org, netdev@oss.sgi.com, paulmck@us.ibm.com, walpole@cs.pdx.edu Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections Message-ID: <20050930002339.GB21062@gondor.apana.org.au> References: <200509292359.j8TNxuxD019838@rastaban.cs.pdx.edu> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200509292359.j8TNxuxD019838@rastaban.cs.pdx.edu> User-Agent: Mutt/1.5.9i From: Herbert Xu X-archive-position: 3700 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: herbert@gondor.apana.org.au Precedence: bulk X-list: netdev Content-Length: 605 Lines: 24 On Thu, Sep 29, 2005 at 04:59:56PM -0700, Suzanne Wood wrote: > Sorry to be thinking on-line, but if you mean this: > > if (in_dev = rcu_dereference(dev->ip_ptr)) > > I think that's fine. Close. What I had in mind is rcu_read_lock(); in_dev = dev->ip_ptr; if (in_dev) { in_dev = rcu_dereference(in_dev); atomic_inc(&in_dev->refcnt); } rcu_read_unlock(); return in_dev; Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From herbert@gondor.apana.org.au Thu Sep 29 17:30:36 2005 Received: with ECARTIS (v1.0.0; list netdev); Thu, 29 Sep 2005 17:30:40 -0700 (PDT) Received: from arnor.apana.org.au (22.107.233.220.exetel.com.au [220.233.107.22]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8U0UYO0022552 for ; Thu, 29 Sep 2005 17:30:35 -0700 Received: from gondolin.me.apana.org.au ([192.168.0.6] ident=mail) by arnor.apana.org.au with esmtp (Exim 3.36 #1 (Debian)) id 1EL8kD-0005v8-00; Fri, 30 Sep 2005 10:27:21 +1000 Received: from herbert by gondolin.me.apana.org.au with local (Exim 3.36 #1 (Debian)) id 1EL8kB-0005VU-00; Fri, 30 Sep 2005 10:27:19 +1000 Date: Fri, 30 Sep 2005 10:27:19 +1000 To: "Paul E. McKenney" Cc: Suzanne Wood , Robert.Olsson@data.slu.se, davem@davemloft.net, linux-kernel@vger.kernel.org, netdev@oss.sgi.com, walpole@cs.pdx.edu Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections Message-ID: <20050930002719.GC21062@gondor.apana.org.au> References: <200509292330.j8TNUSmH019572@rastaban.cs.pdx.edu> <20050930002346.GP8177@us.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050930002346.GP8177@us.ibm.com> User-Agent: Mutt/1.5.9i From: Herbert Xu X-archive-position: 3701 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: herbert@gondor.apana.org.au Precedence: bulk X-list: netdev Content-Length: 1109 Lines: 32 On Thu, Sep 29, 2005 at 05:23:46PM -0700, Paul E. McKenney wrote: > > Is there any case where __in_dev_get() might be called without > needing to be wrapped with rcu_dereference()? If so, then I > agree (FWIW, given my meagre knowledge of Linux networking). Yes. All paths that call __in_dev_get() under the rtnl do not need rcu_dereference (or any RCU at all) since the rtnl prevents any ip_ptr modification from occuring. > However, rcu_dereference() only generates a memory barrier on DEC > Alpha, so there is normally no penalty for using it in the NULL-pointer > case. So, when using rcu_dereference() unconditionally simplifies > the code, it may make sense to "just do it". Here is what the code would look like: rcu_read_lock(); in_dev = dev->ip_ptr; if (in_dev) { in_dev = rcu_dereference(in_dev); atomic_inc(&in_dev->refcnt); } rcu_read_unlock(); return in_dev; Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From paulmck@us.ibm.com Thu Sep 29 17:39:00 2005 Received: with ECARTIS (v1.0.0; list netdev); Thu, 29 Sep 2005 17:39:05 -0700 (PDT) Received: from e35.co.us.ibm.com (e35.co.us.ibm.com [32.97.110.153]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8U0crO0023286 for ; Thu, 29 Sep 2005 17:39:00 -0700 Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com [9.17.195.106]) by e35.co.us.ibm.com (8.12.11/8.12.11) with ESMTP id j8U0Xa8b017596 for ; Thu, 29 Sep 2005 20:33:36 -0400 Received: from d03av03.boulder.ibm.com (d03av03.boulder.ibm.com [9.17.195.169]) by d03relay04.boulder.ibm.com (8.12.10/NCO/VERS6.7) with ESMTP id j8U0aeXk535238 for ; Thu, 29 Sep 2005 18:36:40 -0600 Received: from d03av03.boulder.ibm.com (loopback [127.0.0.1]) by d03av03.boulder.ibm.com (8.12.11/8.13.3) with ESMTP id j8U0a15R024715 for ; Thu, 29 Sep 2005 18:36:02 -0600 Received: from linux.local ([9.47.22.63]) by d03av03.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id j8U0a0cs024677; Thu, 29 Sep 2005 18:36:01 -0600 Received: by linux.local (Postfix on SuSE Linux 7.3 (i386), from userid 500) id 35CCC1486D1; Thu, 29 Sep 2005 17:36:42 -0700 (PDT) Date: Thu, 29 Sep 2005 17:36:42 -0700 From: "Paul E. McKenney" To: Herbert Xu Cc: Suzanne Wood , Robert.Olsson@data.slu.se, davem@davemloft.net, linux-kernel@vger.kernel.org, netdev@oss.sgi.com, walpole@cs.pdx.edu Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections Message-ID: <20050930003642.GQ8177@us.ibm.com> Reply-To: paulmck@us.ibm.com References: <200509292330.j8TNUSmH019572@rastaban.cs.pdx.edu> <20050930002346.GP8177@us.ibm.com> <20050930002719.GC21062@gondor.apana.org.au> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050930002719.GC21062@gondor.apana.org.au> User-Agent: Mutt/1.4.1i X-archive-position: 3702 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: paulmck@us.ibm.com Precedence: bulk X-list: netdev Content-Length: 1216 Lines: 40 On Fri, Sep 30, 2005 at 10:27:19AM +1000, Herbert Xu wrote: > On Thu, Sep 29, 2005 at 05:23:46PM -0700, Paul E. McKenney wrote: > > > > Is there any case where __in_dev_get() might be called without > > needing to be wrapped with rcu_dereference()? If so, then I > > agree (FWIW, given my meagre knowledge of Linux networking). > > Yes. All paths that call __in_dev_get() under the rtnl do not > need rcu_dereference (or any RCU at all) since the rtnl prevents > any ip_ptr modification from occuring. > > > However, rcu_dereference() only generates a memory barrier on DEC > > Alpha, so there is normally no penalty for using it in the NULL-pointer > > case. So, when using rcu_dereference() unconditionally simplifies > > the code, it may make sense to "just do it". > > Here is what the code would look like: > > rcu_read_lock(); > in_dev = dev->ip_ptr; > if (in_dev) { > in_dev = rcu_dereference(in_dev); > atomic_inc(&in_dev->refcnt); > } > rcu_read_unlock(); > return in_dev; How about: rcu_read_lock(); in_dev = dev->ip_ptr; if (rcu_dereference(in_dev)) { atomic_inc(&in_dev->refcnt); } rcu_read_unlock(); return in_dev; Admittedly only saves one line, but... Thanx, Paul From herbert@gondor.apana.org.au Thu Sep 29 18:07:30 2005 Received: with ECARTIS (v1.0.0; list netdev); Thu, 29 Sep 2005 18:07:50 -0700 (PDT) Received: from arnor.apana.org.au (22.107.233.220.exetel.com.au [220.233.107.22]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8U17TO0025390 for ; Thu, 29 Sep 2005 18:07:30 -0700 Received: from gondolin.me.apana.org.au ([192.168.0.6] ident=mail) by arnor.apana.org.au with esmtp (Exim 3.36 #1 (Debian)) id 1EL9Jp-0006FF-00; Fri, 30 Sep 2005 11:04:09 +1000 Received: from herbert by gondolin.me.apana.org.au with local (Exim 3.36 #1 (Debian)) id 1EL9Jk-0005Zx-00; Fri, 30 Sep 2005 11:04:04 +1000 Date: Fri, 30 Sep 2005 11:04:04 +1000 To: "Paul E. McKenney" Cc: Suzanne Wood , Robert.Olsson@data.slu.se, davem@davemloft.net, linux-kernel@vger.kernel.org, netdev@oss.sgi.com, walpole@cs.pdx.edu Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections Message-ID: <20050930010404.GA21429@gondor.apana.org.au> References: <200509292330.j8TNUSmH019572@rastaban.cs.pdx.edu> <20050930002346.GP8177@us.ibm.com> <20050930002719.GC21062@gondor.apana.org.au> <20050930003642.GQ8177@us.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050930003642.GQ8177@us.ibm.com> User-Agent: Mutt/1.5.9i From: Herbert Xu X-archive-position: 3703 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: herbert@gondor.apana.org.au Precedence: bulk X-list: netdev Content-Length: 946 Lines: 34 On Thu, Sep 29, 2005 at 05:36:42PM -0700, Paul E. McKenney wrote: > > > rcu_read_lock(); > > in_dev = dev->ip_ptr; > > if (in_dev) { > > in_dev = rcu_dereference(in_dev); > > atomic_inc(&in_dev->refcnt); > > } > > rcu_read_unlock(); > > return in_dev; > > How about: > > rcu_read_lock(); > in_dev = dev->ip_ptr; > if (rcu_dereference(in_dev)) { > atomic_inc(&in_dev->refcnt); > } > rcu_read_unlock(); > return in_dev; With this the barrier will taken even when in_dev is NULL. I agree this isn't such a big deal since it only impacts Alpha and then only when in_dev is NULL. But as we already do the branch anyway to increment the reference count, we might as well make things a little better for Alpha. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From suzannew@cs.pdx.edu Thu Sep 29 18:10:05 2005 Received: with ECARTIS (v1.0.0; list netdev); Thu, 29 Sep 2005 18:10:08 -0700 (PDT) Received: from mournblade.cat.pdx.edu (mournblade.cat.pdx.edu [131.252.208.27]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8U1A4O0025841 for ; Thu, 29 Sep 2005 18:10:05 -0700 Received: from rastaban.cs.pdx.edu (root@rastaban.cs.pdx.edu [131.252.209.214]) by mournblade.cat.pdx.edu (8.13.1/8.13.1) with ESMTP id j8U16ru0023134 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO); Thu, 29 Sep 2005 18:06:53 -0700 (PDT) Received: from rastaban.cs.pdx.edu (suzannew@localhost [127.0.0.1]) by rastaban.cs.pdx.edu (8.12.10/8.12.6) with ESMTP id j8U16qUL021065; Thu, 29 Sep 2005 18:06:52 -0700 (PDT) Received: (from suzannew@localhost) by rastaban.cs.pdx.edu (8.12.10/8.12.6/Submit) id j8U16obP021064; Thu, 29 Sep 2005 18:06:50 -0700 (PDT) Date: Thu, 29 Sep 2005 18:06:50 -0700 (PDT) From: Suzanne Wood Message-Id: <200509300106.j8U16obP021064@rastaban.cs.pdx.edu> To: paulmck@us.ibm.com, suzannew@cs.pdx.edu Cc: Robert.Olsson@data.slu.se, davem@davemloft.net, herbert@gondor.apana.org.au, linux-kernel@vger.kernel.org, netdev@oss.sgi.com, walpole@cs.pdx.edu Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections X-archive-position: 3704 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: suzannew@cs.pdx.edu Precedence: bulk X-list: netdev Content-Length: 2167 Lines: 49 In reviewing the 44 kernel uses of __in_dev_get and seeing many cases in 13 of 20 C code files of insertions of rcu_read_lock with and without the rcu_dereference that is indicated, so it does appear often to be programmer intent. Of the programs using __in_dev_get that don't include rcu, devinet.c and igmp.c use an rtnl lock. Five other programs that use __in_dev_get without rcu have rtnl locking in the program source code, but I need to actually look further into the call tree to say more. Are there three cases then? RCU protection with refcnt, RCU without refcnt, and the bare cast of the dereference? Thank you very much for getting it back on track. > From paulmck@us.ibm.com Thu Sep 29 17:23:18 2005 > Is there any case where __in_dev_get() might be called without > needing to be wrapped with rcu_dereference()? If so, then I > agree (FWIW, given my meagre knowledge of Linux networking). > If all __in_dev_get() invocations need to be wrapped in > rcu_dereference(), then it seems to me that there would be > motivation to bury rcu_dereference() in __in_dev_get(). > > > Some callers of __in_dev_get() don't need rcu_dereference at all > > > because they're protected by the rtnl. > > > > > BTW, could you please move the rcu_dereference in in_dev_get() > > > into the if clause? The barrier is not needed when ip_ptr is > > > NULL. > > > > The trouble with that may be that there are three events, the > > dereference, the assignment, and the conditional test. The > > rcu_dereference() is meant to assure deferred destruction > > throughout. > One only needs an rcu_dereference() once on the data-flow path from > fetching the RCU-protected pointer to dereferencing that pointer. > If the pointer is NULL, there is no way you can dereference it, > so, technically, Herbert is quite correct. > However, rcu_dereference() only generates a memory barrier on DEC > Alpha, so there is normally no penalty for using it in the NULL-pointer > case. So, when using rcu_dereference() unconditionally simplifies > the code, it may make sense to "just do it". > Thanx, Paul From paulmck@us.ibm.com Thu Sep 29 18:18:24 2005 Received: with ECARTIS (v1.0.0; list netdev); Thu, 29 Sep 2005 18:18:29 -0700 (PDT) Received: from e1.ny.us.ibm.com (e1.ny.us.ibm.com [32.97.182.141]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8U1IHO0026778 for ; Thu, 29 Sep 2005 18:18:24 -0700 Received: from d01relay02.pok.ibm.com (d01relay02.pok.ibm.com [9.56.227.234]) by e1.ny.us.ibm.com (8.12.11/8.12.11) with ESMTP id j8U1FOeJ014990 for ; Thu, 29 Sep 2005 21:15:24 -0400 Received: from d01av01.pok.ibm.com (d01av01.pok.ibm.com [9.56.224.215]) by d01relay02.pok.ibm.com (8.12.10/NCO/VERS6.7) with ESMTP id j8U1FOVU099390 for ; Thu, 29 Sep 2005 21:15:24 -0400 Received: from d01av01.pok.ibm.com (loopback [127.0.0.1]) by d01av01.pok.ibm.com (8.12.11/8.13.3) with ESMTP id j8U1FNOF032063 for ; Thu, 29 Sep 2005 21:15:24 -0400 Received: from linux.local ([9.47.22.63]) by d01av01.pok.ibm.com (8.12.11/8.12.11) with ESMTP id j8U1FM2f032020; Thu, 29 Sep 2005 21:15:23 -0400 Received: by linux.local (Postfix on SuSE Linux 7.3 (i386), from userid 500) id C44AF1486D1; Thu, 29 Sep 2005 18:16:03 -0700 (PDT) Date: Thu, 29 Sep 2005 18:16:03 -0700 From: "Paul E. McKenney" To: Herbert Xu Cc: Suzanne Wood , Robert.Olsson@data.slu.se, davem@davemloft.net, linux-kernel@vger.kernel.org, netdev@oss.sgi.com, walpole@cs.pdx.edu Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections Message-ID: <20050930011603.GT8177@us.ibm.com> Reply-To: paulmck@us.ibm.com References: <200509292330.j8TNUSmH019572@rastaban.cs.pdx.edu> <20050930002346.GP8177@us.ibm.com> <20050930002719.GC21062@gondor.apana.org.au> <20050930003642.GQ8177@us.ibm.com> <20050930010404.GA21429@gondor.apana.org.au> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050930010404.GA21429@gondor.apana.org.au> User-Agent: Mutt/1.4.1i X-archive-position: 3705 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: paulmck@us.ibm.com Precedence: bulk X-list: netdev Content-Length: 1033 Lines: 40 On Fri, Sep 30, 2005 at 11:04:04AM +1000, Herbert Xu wrote: > On Thu, Sep 29, 2005 at 05:36:42PM -0700, Paul E. McKenney wrote: > > > > > rcu_read_lock(); > > > in_dev = dev->ip_ptr; > > > if (in_dev) { > > > in_dev = rcu_dereference(in_dev); > > > atomic_inc(&in_dev->refcnt); > > > } > > > rcu_read_unlock(); > > > return in_dev; > > > > How about: > > > > rcu_read_lock(); > > in_dev = dev->ip_ptr; > > if (rcu_dereference(in_dev)) { > > atomic_inc(&in_dev->refcnt); > > } > > rcu_read_unlock(); > > return in_dev; > > With this the barrier will taken even when in_dev is NULL. > > I agree this isn't such a big deal since it only impacts Alpha and then > only when in_dev is NULL. But as we already do the branch anyway to > increment the reference count, we might as well make things a little > better for Alpha. OK, how about this instead? rcu_read_lock(); in_dev = dev->ip_ptr; if (in_dev) { atomic_inc(&rcu_dereference(in_dev)->refcnt); } rcu_read_unlock(); return in_dev; Thanx, Paul From herbert@gondor.apana.org.au Thu Sep 29 18:22:47 2005 Received: with ECARTIS (v1.0.0; list netdev); Thu, 29 Sep 2005 18:22:51 -0700 (PDT) Received: from arnor.apana.org.au (22.107.233.220.exetel.com.au [220.233.107.22]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8U1MgO0027356 for ; Thu, 29 Sep 2005 18:22:46 -0700 Received: from gondolin.me.apana.org.au ([192.168.0.6] ident=mail) by arnor.apana.org.au with esmtp (Exim 3.36 #1 (Debian)) id 1EL9YM-0006LJ-00; Fri, 30 Sep 2005 11:19:10 +1000 Received: from herbert by gondolin.me.apana.org.au with local (Exim 3.36 #1 (Debian)) id 1EL9YJ-0005cJ-00; Fri, 30 Sep 2005 11:19:07 +1000 Date: Fri, 30 Sep 2005 11:19:07 +1000 To: "Paul E. McKenney" Cc: Suzanne Wood , Robert.Olsson@data.slu.se, davem@davemloft.net, linux-kernel@vger.kernel.org, netdev@oss.sgi.com, walpole@cs.pdx.edu Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections Message-ID: <20050930011907.GA21579@gondor.apana.org.au> References: <200509292330.j8TNUSmH019572@rastaban.cs.pdx.edu> <20050930002346.GP8177@us.ibm.com> <20050930002719.GC21062@gondor.apana.org.au> <20050930003642.GQ8177@us.ibm.com> <20050930010404.GA21429@gondor.apana.org.au> <20050930011603.GT8177@us.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050930011603.GT8177@us.ibm.com> User-Agent: Mutt/1.5.9i From: Herbert Xu X-archive-position: 3706 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: herbert@gondor.apana.org.au Precedence: bulk X-list: netdev Content-Length: 497 Lines: 18 On Thu, Sep 29, 2005 at 06:16:03PM -0700, Paul E. McKenney wrote: > > OK, how about this instead? > > rcu_read_lock(); > in_dev = dev->ip_ptr; > if (in_dev) { > atomic_inc(&rcu_dereference(in_dev)->refcnt); > } > rcu_read_unlock(); > return in_dev; Looks great. Thanks Paul. -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From pp@ee.oulu.fi Thu Sep 29 23:52:15 2005 Received: with ECARTIS (v1.0.0; list netdev); Thu, 29 Sep 2005 23:52:22 -0700 (PDT) Received: from ee.oulu.fi (ee.oulu.fi [130.231.61.23]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8U6qEO0017670 for ; Thu, 29 Sep 2005 23:52:15 -0700 Received: from tk28.oulu.fi (tk28 [130.231.48.68]) by ee.oulu.fi (8.13.3/8.13.3) with ESMTP id j8U6nKcj021657 for ; Fri, 30 Sep 2005 09:49:20 +0300 (EEST) Received: (from pp@localhost) by tk28.oulu.fi (8.13.3/8.13.3/Submit) id j8U6nJj5021881 for netdev@oss.sgi.com; Fri, 30 Sep 2005 09:49:19 +0300 (EEST) Date: Fri, 30 Sep 2005 09:49:19 +0300 From: Pekka Pietikainen To: netdev@oss.sgi.com Subject: Funny timestamps (Was: Re: rwlock recursion on CPU#0, netfilter related?) Message-ID: <20050930064919.GA21573@ee.oulu.fi> References: <20050925105834.GA15243@ee.oulu.fi> <20050925134344.GJ731@sunbeam.de.gnumonks.org> <20050925201945.GA21176@ee.oulu.fi> <20050928145815.GA421@ee.oulu.fi> <20050929120538.GT4168@sunbeam.de.gnumonks.org> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <20050929120538.GT4168@sunbeam.de.gnumonks.org> User-Agent: Mutt/1.4.2i X-archive-position: 3707 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: pp@ee.oulu.fi Precedence: bulk X-list: netdev Content-Length: 822 Lines: 16 On Thu, Sep 29, 2005 at 02:05:38PM +0200, Harald Welte wrote: > > This one is still around, so it's a different bug. Looks like it's a 64-bit > > issue, a 32-bit ping gives realistic ping times. tcpdump timestamps are also > > affected, they're completely off too. So looks like someone broke packet > > timestamps on 64-bit some time after 2.6.13. > > luckily I'm not the core network maintainer ;) Here's the actual bug report: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=168166 Ended up being a userspace thing, maybe. But still makes me wonder what change actually broke things. It must have been something soon after 2.6.13. And there's still tcpdump, which doesn't seem to go into the problem mode when I test it now, except that nothing should have changed in the kernel/tcpdump/libpcap versions. Blah. From laforge@gnumonks.org Fri Sep 30 05:36:45 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 30 Sep 2005 05:36:50 -0700 (PDT) Received: from ganesha.gnumonks.org (ganesha.gnumonks.org [213.95.27.120]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8UCaiO0020630 for ; Fri, 30 Sep 2005 05:36:45 -0700 Received: from berligate.hmw-consulting.de ([83.236.178.202] helo=sunbeam.hmw-consulting.de) by ganesha.gnumonks.org with esmtpsa (TLS-1.0:RSA_AES_256_CBC_SHA:32) (Exim 4.50) id 1ELK53-0007Ip-Oz; Fri, 30 Sep 2005 14:33:37 +0200 Received: from laforge by sunbeam.hmw-consulting.de with local (Exim 4.52) id 1ELK51-00086s-KE; Fri, 30 Sep 2005 14:33:35 +0200 Date: Fri, 30 Sep 2005 14:33:35 +0200 From: Harald Welte To: Michael Bellion Cc: linux-kernel@vger.kernel.org, linux-net@vger.kernel.org, netdev@oss.sgi.com Subject: Re: [ANNOUNCE] Release of nf-HiPAC 0.9.0 Message-ID: <20050930123334.GW4168@sunbeam.de.gnumonks.org> References: <200509260445.46740.mbellion@hipac.org> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="VIdDLDeyEAhJQ0DY" Content-Disposition: inline In-Reply-To: <200509260445.46740.mbellion@hipac.org> User-Agent: mutt-ng devel-20050619 (Debian) X-archive-position: 3708 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: laforge@gnumonks.org Precedence: bulk X-list: netdev Content-Length: 1380 Lines: 41 --VIdDLDeyEAhJQ0DY Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Mon, Sep 26, 2005 at 04:45:46AM +0200, Michael Bellion wrote: > Hi >=20 > I am happy to announce the release of nf-HiPAC version 0.9.0 I'm happy to hear this, especially in the advent of the netfilter develpoer workshop next week, and after a very long period of silence =66rom the nf-hipac project. I'll make sure to have read through your 0.9.0 version source code until then, to be able to give some feedback asap. Looking forward to talking to you about it next week! --=20 - Harald Welte http://gnumonks.org/ =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D "Privacy in residential applications is a desirable marketing option." (ETSI EN 300 175-7 Ch. A6) --VIdDLDeyEAhJQ0DY Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQFDPTCeXaXGVTD0i/8RAk+6AJ9yRqzLPHAw8y6hBTdgM3IvtVcHTQCdFKhz g3w+aCg9EjwtfBdsYvfo3fU= =rmvC -----END PGP SIGNATURE----- --VIdDLDeyEAhJQ0DY-- From liuk001@gmail.com Fri Sep 30 09:02:36 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 30 Sep 2005 09:02:48 -0700 (PDT) Received: from wproxy.gmail.com (wproxy.gmail.com [64.233.184.202]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8UG2aO0008150 for ; Fri, 30 Sep 2005 09:02:36 -0700 Received: by wproxy.gmail.com with SMTP id i31so570wra for ; Fri, 30 Sep 2005 08:59:44 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:reply-to:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=Kbzfo5WFz1h/j0UppCHiL43XWDewMEHgO2Ky4XS6GqOvrL0u1KsVuWyOfdDQfRZqiMMW1efi+aDwYWI0Q6oS5tICa3EAeglPr9gphl2lNUP6PVW3cLxMJT3QgWMlUVdHDziwbiZ7S7/Z4erZo+i8FFTNRfZ0irIcnEBap4BZN3U= Received: by 10.54.86.7 with SMTP id j7mr97349wrb; Fri, 30 Sep 2005 08:59:44 -0700 (PDT) Received: by 10.54.70.6 with HTTP; Fri, 30 Sep 2005 08:59:44 -0700 (PDT) Message-ID: <68559cef0509300859s38cf42bn@mail.gmail.com> Date: Fri, 30 Sep 2005 17:59:44 +0200 From: Luca Maranzano Reply-To: Luca Maranzano To: Nishanth Aravamudan Subject: Re: ipvs_syncmaster brings cpu to 100% Cc: Dave Miller , Wensong Zhang , Julian Anastasov , netdev@oss.sgi.com, horms@verge.net.au In-Reply-To: <20050928132639.GA5791@us.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Disposition: inline References: <68559cef05092207022f1f0df4@mail.gmail.com> <20050926032807.GI18357@verge.net.au> <20050926043400.GD5079@us.ibm.com> <20050926080508.GF11027@verge.net.au> <20050926081229.GA23755@verge.net.au> <20050926131104.GA7532@us.ibm.com> <68559cef05092606521cc13f9a@mail.gmail.com> <20050926142109.GD7532@us.ibm.com> <20050928022307.GK18765@verge.net.au> <20050928132639.GA5791@us.ibm.com> Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id j8UG2aO0008150 X-archive-position: 3709 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: liuk001@gmail.com Precedence: bulk X-list: netdev Content-Length: 685 Lines: 23 First of all thank you all for your precious support! :-) The two machines on which I discovered the problem are now in production and I cannot for the moment make tests, but I hope to have some other hardware to try in the next week. I'll let you know ASAP. Thanks, Luca > > Yes, the information in that thread is the same as what Luca said. It's > a load average problem, not a CPU utilisation problem (those threads are > sleeping!) If Luca could test the msleep_interruptible() version of the > patch and it works (like I said, performance should not change, but the > load average will drop to by 2), then I will ACK the patch for mainline > acceptance. > > Thanks, > Nish > From herbert@gondor.apana.org.au Fri Sep 30 18:17:13 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 30 Sep 2005 18:17:28 -0700 (PDT) Received: from arnor.apana.org.au (22.107.233.220.exetel.com.au [220.233.107.22]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j911H9O0014812 for ; Fri, 30 Sep 2005 18:17:10 -0700 Received: from gondolin.me.apana.org.au ([192.168.0.6] ident=mail) by arnor.apana.org.au with esmtp (Exim 3.36 #1 (Debian)) id 1ELVwG-0000Cs-00; Sat, 01 Oct 2005 11:13:20 +1000 Received: from herbert by gondolin.me.apana.org.au with local (Exim 3.36 #1 (Debian)) id 1ELVw8-0007LY-00; Sat, 01 Oct 2005 11:13:12 +1000 Date: Sat, 1 Oct 2005 11:13:12 +1000 To: Suzanne Wood Cc: paulmck@us.ibm.com, Robert.Olsson@data.slu.se, davem@davemloft.net, linux-kernel@vger.kernel.org, netdev@oss.sgi.com, walpole@cs.pdx.edu Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections Message-ID: <20051001011312.GA28204@gondor.apana.org.au> References: <200509300106.j8U16obP021064@rastaban.cs.pdx.edu> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="6c2NcOVqGQ03X4Wi" Content-Disposition: inline In-Reply-To: <200509300106.j8U16obP021064@rastaban.cs.pdx.edu> User-Agent: Mutt/1.5.9i From: Herbert Xu X-archive-position: 3710 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: herbert@gondor.apana.org.au Precedence: bulk X-list: netdev Content-Length: 18253 Lines: 601 --6c2NcOVqGQ03X4Wi Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Thu, Sep 29, 2005 at 06:06:50PM -0700, Suzanne Wood wrote: > > Are there three cases then? RCU protection with refcnt, RCU without refcnt, > and the bare cast of the dereference? Correct. The following patch renames __in_dev_get() to __in_dev_get_rtnl() and introduces __in_dev_get_rcu() to cover the second case. 1) RCU with refcnt should use in_dev_get(). 2) RCU without refcnt should use __in_dev_get_rcu(). 3) All others must hold RTNL and use __in_dev_get_rtnl(). There is one exception in net/ipv4/route.c which is in fact a pre-existing race condition. I've marked it as such so that we remember to fix it. This patch is based on suggestions and prior work by Suzanne Wood and Paul McKenney. Signed-off-by: Herbert Xu Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --6c2NcOVqGQ03X4Wi Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename=p diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -2776,7 +2776,7 @@ static u32 bond_glean_dev_ip(struct net_ return 0; rcu_read_lock(); - idev = __in_dev_get(dev); + idev = __in_dev_get_rcu(dev); if (!idev) goto out; diff --git a/drivers/net/wan/sdlamain.c b/drivers/net/wan/sdlamain.c --- a/drivers/net/wan/sdlamain.c +++ b/drivers/net/wan/sdlamain.c @@ -57,6 +57,7 @@ #include /* request_region(), release_region() */ #include /* WAN router definitions */ #include /* WANPIPE common user API definitions */ +#include #include #include /* phys_to_virt() */ @@ -1268,37 +1269,41 @@ unsigned long get_ip_address(struct net_ struct in_ifaddr *ifaddr; struct in_device *in_dev; + unsigned long addr = 0; - if ((in_dev = __in_dev_get(dev)) == NULL){ - return 0; + rcu_read_lock(); + if ((in_dev = __in_dev_get_rcu(dev)) == NULL){ + goto out; } if ((ifaddr = in_dev->ifa_list)== NULL ){ - return 0; + goto out; } switch (option){ case WAN_LOCAL_IP: - return ifaddr->ifa_local; + addr = ifaddr->ifa_local; break; case WAN_POINTOPOINT_IP: - return ifaddr->ifa_address; + addr = ifaddr->ifa_address; break; case WAN_NETMASK_IP: - return ifaddr->ifa_mask; + addr = ifaddr->ifa_mask; break; case WAN_BROADCAST_IP: - return ifaddr->ifa_broadcast; + addr = ifaddr->ifa_broadcast; break; default: - return 0; + break; } - return 0; +out: + rcu_read_unlock(); + return addr; } void add_gateway(sdla_t *card, struct net_device *dev) diff --git a/drivers/net/wan/syncppp.c b/drivers/net/wan/syncppp.c --- a/drivers/net/wan/syncppp.c +++ b/drivers/net/wan/syncppp.c @@ -769,7 +769,7 @@ static void sppp_cisco_input (struct spp u32 addr = 0, mask = ~0; /* FIXME: is the mask correct? */ #ifdef CONFIG_INET rcu_read_lock(); - if ((in_dev = __in_dev_get(dev)) != NULL) + if ((in_dev = __in_dev_get_rcu(dev)) != NULL) { for (ifa=in_dev->ifa_list; ifa != NULL; ifa=ifa->ifa_next) { diff --git a/drivers/net/wireless/strip.c b/drivers/net/wireless/strip.c --- a/drivers/net/wireless/strip.c +++ b/drivers/net/wireless/strip.c @@ -1352,7 +1352,7 @@ static unsigned char *strip_make_packet( struct in_device *in_dev; rcu_read_lock(); - in_dev = __in_dev_get(strip_info->dev); + in_dev = __in_dev_get_rcu(strip_info->dev); if (in_dev == NULL) { rcu_read_unlock(); return NULL; @@ -1508,7 +1508,7 @@ static void strip_send(struct strip *str brd = addr = 0; rcu_read_lock(); - in_dev = __in_dev_get(strip_info->dev); + in_dev = __in_dev_get_rcu(strip_info->dev); if (in_dev) { if (in_dev->ifa_list) { brd = in_dev->ifa_list->ifa_broadcast; diff --git a/drivers/parisc/led.c b/drivers/parisc/led.c --- a/drivers/parisc/led.c +++ b/drivers/parisc/led.c @@ -37,6 +37,7 @@ #include #include #include +#include #include #include #include @@ -358,9 +359,10 @@ static __inline__ int led_get_net_activi /* we are running as tasklet, so locking dev_base * for reading should be OK */ read_lock(&dev_base_lock); + rcu_read_lock(); for (dev = dev_base; dev; dev = dev->next) { struct net_device_stats *stats; - struct in_device *in_dev = __in_dev_get(dev); + struct in_device *in_dev = __in_dev_get_rcu(dev); if (!in_dev || !in_dev->ifa_list) continue; if (LOOPBACK(in_dev->ifa_list->ifa_local)) @@ -371,6 +373,7 @@ static __inline__ int led_get_net_activi rx_total += stats->rx_packets; tx_total += stats->tx_packets; } + rcu_read_unlock(); read_unlock(&dev_base_lock); retval = 0; diff --git a/drivers/s390/net/qeth_main.c b/drivers/s390/net/qeth_main.c --- a/drivers/s390/net/qeth_main.c +++ b/drivers/s390/net/qeth_main.c @@ -5200,7 +5200,7 @@ qeth_free_vlan_addresses4(struct qeth_ca if (!card->vlangrp) return; rcu_read_lock(); - in_dev = __in_dev_get(card->vlangrp->vlan_devices[vid]); + in_dev = __in_dev_get_rcu(card->vlangrp->vlan_devices[vid]); if (!in_dev) goto out; for (ifa = in_dev->ifa_list; ifa; ifa = ifa->ifa_next) { @@ -7725,7 +7725,7 @@ qeth_arp_constructor(struct neighbour *n goto out; rcu_read_lock(); - in_dev = rcu_dereference(__in_dev_get(dev)); + in_dev = __in_dev_get_rcu(dev); if (in_dev == NULL) { rcu_read_unlock(); return -EINVAL; diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h --- a/include/linux/inetdevice.h +++ b/include/linux/inetdevice.h @@ -142,13 +142,21 @@ static __inline__ int bad_mask(u32 mask, #define endfor_ifa(in_dev) } +static inline struct in_device *__in_dev_get_rcu(const struct net_device *dev) +{ + struct in_device *in_dev = dev->ip_ptr; + if (in_dev) + in_dev = rcu_dereference(in_dev); + return in_dev; +} + static __inline__ struct in_device * in_dev_get(const struct net_device *dev) { struct in_device *in_dev; rcu_read_lock(); - in_dev = dev->ip_ptr; + in_dev = __in_dev_get_rcu(dev); if (in_dev) atomic_inc(&in_dev->refcnt); rcu_read_unlock(); @@ -156,7 +164,7 @@ in_dev_get(const struct net_device *dev) } static __inline__ struct in_device * -__in_dev_get(const struct net_device *dev) +__in_dev_get_rtnl(const struct net_device *dev) { return (struct in_device*)dev->ip_ptr; } diff --git a/net/atm/clip.c b/net/atm/clip.c --- a/net/atm/clip.c +++ b/net/atm/clip.c @@ -310,7 +310,7 @@ static int clip_constructor(struct neigh if (neigh->type != RTN_UNICAST) return -EINVAL; rcu_read_lock(); - in_dev = rcu_dereference(__in_dev_get(dev)); + in_dev = __in_dev_get_rcu(dev); if (!in_dev) { rcu_read_unlock(); return -EINVAL; diff --git a/net/core/netpoll.c b/net/core/netpoll.c --- a/net/core/netpoll.c +++ b/net/core/netpoll.c @@ -703,7 +703,7 @@ int netpoll_setup(struct netpoll *np) if (!np->local_ip) { rcu_read_lock(); - in_dev = __in_dev_get(ndev); + in_dev = __in_dev_get_rcu(ndev); if (!in_dev || !in_dev->ifa_list) { rcu_read_unlock(); diff --git a/net/core/pktgen.c b/net/core/pktgen.c --- a/net/core/pktgen.c +++ b/net/core/pktgen.c @@ -1667,7 +1667,7 @@ static void pktgen_setup_inject(struct p struct in_device *in_dev; rcu_read_lock(); - in_dev = __in_dev_get(pkt_dev->odev); + in_dev = __in_dev_get_rcu(pkt_dev->odev); if (in_dev) { if (in_dev->ifa_list) { pkt_dev->saddr_min = in_dev->ifa_list->ifa_address; diff --git a/net/econet/af_econet.c b/net/econet/af_econet.c --- a/net/econet/af_econet.c +++ b/net/econet/af_econet.c @@ -406,7 +406,7 @@ static int econet_sendmsg(struct kiocb * unsigned long network = 0; rcu_read_lock(); - idev = __in_dev_get(dev); + idev = __in_dev_get_rcu(dev); if (idev) { if (idev->ifa_list) network = ntohl(idev->ifa_list->ifa_address) & diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c --- a/net/ipv4/arp.c +++ b/net/ipv4/arp.c @@ -241,7 +241,7 @@ static int arp_constructor(struct neighb neigh->type = inet_addr_type(addr); rcu_read_lock(); - in_dev = rcu_dereference(__in_dev_get(dev)); + in_dev = __in_dev_get_rcu(dev); if (in_dev == NULL) { rcu_read_unlock(); return -EINVAL; @@ -990,8 +990,8 @@ static int arp_req_set(struct arpreq *r, ipv4_devconf.proxy_arp = 1; return 0; } - if (__in_dev_get(dev)) { - __in_dev_get(dev)->cnf.proxy_arp = 1; + if (__in_dev_get_rtnl(dev)) { + __in_dev_get_rtnl(dev)->cnf.proxy_arp = 1; return 0; } return -ENXIO; @@ -1096,8 +1096,8 @@ static int arp_req_delete(struct arpreq ipv4_devconf.proxy_arp = 0; return 0; } - if (__in_dev_get(dev)) { - __in_dev_get(dev)->cnf.proxy_arp = 0; + if (__in_dev_get_rtnl(dev)) { + __in_dev_get_rtnl(dev)->cnf.proxy_arp = 0; return 0; } return -ENXIO; diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c --- a/net/ipv4/devinet.c +++ b/net/ipv4/devinet.c @@ -351,7 +351,7 @@ static int inet_insert_ifa(struct in_ifa static int inet_set_ifa(struct net_device *dev, struct in_ifaddr *ifa) { - struct in_device *in_dev = __in_dev_get(dev); + struct in_device *in_dev = __in_dev_get_rtnl(dev); ASSERT_RTNL(); @@ -449,7 +449,7 @@ static int inet_rtm_newaddr(struct sk_bu goto out; rc = -ENOBUFS; - if ((in_dev = __in_dev_get(dev)) == NULL) { + if ((in_dev = __in_dev_get_rtnl(dev)) == NULL) { in_dev = inetdev_init(dev); if (!in_dev) goto out; @@ -584,7 +584,7 @@ int devinet_ioctl(unsigned int cmd, void if (colon) *colon = ':'; - if ((in_dev = __in_dev_get(dev)) != NULL) { + if ((in_dev = __in_dev_get_rtnl(dev)) != NULL) { if (tryaddrmatch) { /* Matthias Andree */ /* compare label and address (4.4BSD style) */ @@ -748,7 +748,7 @@ rarok: static int inet_gifconf(struct net_device *dev, char __user *buf, int len) { - struct in_device *in_dev = __in_dev_get(dev); + struct in_device *in_dev = __in_dev_get_rtnl(dev); struct in_ifaddr *ifa; struct ifreq ifr; int done = 0; @@ -791,7 +791,7 @@ u32 inet_select_addr(const struct net_de struct in_device *in_dev; rcu_read_lock(); - in_dev = __in_dev_get(dev); + in_dev = __in_dev_get_rcu(dev); if (!in_dev) goto no_in_dev; @@ -818,7 +818,7 @@ no_in_dev: read_lock(&dev_base_lock); rcu_read_lock(); for (dev = dev_base; dev; dev = dev->next) { - if ((in_dev = __in_dev_get(dev)) == NULL) + if ((in_dev = __in_dev_get_rcu(dev)) == NULL) continue; for_primary_ifa(in_dev) { @@ -887,7 +887,7 @@ u32 inet_confirm_addr(const struct net_d if (dev) { rcu_read_lock(); - if ((in_dev = __in_dev_get(dev))) + if ((in_dev = __in_dev_get_rcu(dev))) addr = confirm_addr_indev(in_dev, dst, local, scope); rcu_read_unlock(); @@ -897,7 +897,7 @@ u32 inet_confirm_addr(const struct net_d read_lock(&dev_base_lock); rcu_read_lock(); for (dev = dev_base; dev; dev = dev->next) { - if ((in_dev = __in_dev_get(dev))) { + if ((in_dev = __in_dev_get_rcu(dev))) { addr = confirm_addr_indev(in_dev, dst, local, scope); if (addr) break; @@ -957,7 +957,7 @@ static int inetdev_event(struct notifier void *ptr) { struct net_device *dev = ptr; - struct in_device *in_dev = __in_dev_get(dev); + struct in_device *in_dev = __in_dev_get_rtnl(dev); ASSERT_RTNL(); @@ -1078,7 +1078,7 @@ static int inet_dump_ifaddr(struct sk_bu if (idx > s_idx) s_ip_idx = 0; rcu_read_lock(); - if ((in_dev = __in_dev_get(dev)) == NULL) { + if ((in_dev = __in_dev_get_rcu(dev)) == NULL) { rcu_read_unlock(); continue; } @@ -1149,7 +1149,7 @@ void inet_forward_change(void) for (dev = dev_base; dev; dev = dev->next) { struct in_device *in_dev; rcu_read_lock(); - in_dev = __in_dev_get(dev); + in_dev = __in_dev_get_rcu(dev); if (in_dev) in_dev->cnf.forwarding = on; rcu_read_unlock(); diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -173,7 +173,7 @@ int fib_validate_source(u32 src, u32 dst no_addr = rpf = 0; rcu_read_lock(); - in_dev = __in_dev_get(dev); + in_dev = __in_dev_get_rcu(dev); if (in_dev) { no_addr = in_dev->ifa_list == NULL; rpf = IN_DEV_RPFILTER(in_dev); @@ -607,7 +607,7 @@ static int fib_inetaddr_event(struct not static int fib_netdev_event(struct notifier_block *this, unsigned long event, void *ptr) { struct net_device *dev = ptr; - struct in_device *in_dev = __in_dev_get(dev); + struct in_device *in_dev = __in_dev_get_rtnl(dev); if (event == NETDEV_UNREGISTER) { fib_disable_ip(dev, 2); diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -1087,7 +1087,7 @@ fib_convert_rtentry(int cmd, struct nlms rta->rta_oif = &dev->ifindex; if (colon) { struct in_ifaddr *ifa; - struct in_device *in_dev = __in_dev_get(dev); + struct in_device *in_dev = __in_dev_get_rtnl(dev); if (!in_dev) return -ENODEV; *colon = ':'; @@ -1268,7 +1268,7 @@ int fib_sync_up(struct net_device *dev) } if (nh->nh_dev == NULL || !(nh->nh_dev->flags&IFF_UP)) continue; - if (nh->nh_dev != dev || __in_dev_get(dev) == NULL) + if (nh->nh_dev != dev || !__in_dev_get_rtnl(dev)) continue; alive++; spin_lock_bh(&fib_multipath_lock); diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c --- a/net/ipv4/igmp.c +++ b/net/ipv4/igmp.c @@ -1323,7 +1323,7 @@ static struct in_device * ip_mc_find_dev } if (dev) { imr->imr_ifindex = dev->ifindex; - idev = __in_dev_get(dev); + idev = __in_dev_get_rtnl(dev); } return idev; } diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c --- a/net/ipv4/ip_gre.c +++ b/net/ipv4/ip_gre.c @@ -1104,10 +1104,10 @@ static int ipgre_open(struct net_device return -EADDRNOTAVAIL; dev = rt->u.dst.dev; ip_rt_put(rt); - if (__in_dev_get(dev) == NULL) + if (__in_dev_get_rtnl(dev) == NULL) return -EADDRNOTAVAIL; t->mlink = dev->ifindex; - ip_mc_inc_group(__in_dev_get(dev), t->parms.iph.daddr); + ip_mc_inc_group(__in_dev_get_rtnl(dev), t->parms.iph.daddr); } return 0; } diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -149,7 +149,7 @@ struct net_device *ipmr_new_tunnel(struc if (err == 0 && (dev = __dev_get_by_name(p.name)) != NULL) { dev->flags |= IFF_MULTICAST; - in_dev = __in_dev_get(dev); + in_dev = __in_dev_get_rtnl(dev); if (in_dev == NULL && (in_dev = inetdev_init(dev)) == NULL) goto failure; in_dev->cnf.rp_filter = 0; @@ -278,7 +278,7 @@ static int vif_delete(int vifi) dev_set_allmulti(dev, -1); - if ((in_dev = __in_dev_get(dev)) != NULL) { + if ((in_dev = __in_dev_get_rtnl(dev)) != NULL) { in_dev->cnf.mc_forwarding--; ip_rt_multicast_event(in_dev); } @@ -421,7 +421,7 @@ static int vif_add(struct vifctl *vifc, return -EINVAL; } - if ((in_dev = __in_dev_get(dev)) == NULL) + if ((in_dev = __in_dev_get_rtnl(dev)) == NULL) return -EADDRNOTAVAIL; in_dev->cnf.mc_forwarding++; dev_set_allmulti(dev, +1); diff --git a/net/ipv4/netfilter/ip_conntrack_netbios_ns.c b/net/ipv4/netfilter/ip_conntrack_netbios_ns.c --- a/net/ipv4/netfilter/ip_conntrack_netbios_ns.c +++ b/net/ipv4/netfilter/ip_conntrack_netbios_ns.c @@ -58,7 +58,7 @@ static int help(struct sk_buff **pskb, goto out; rcu_read_lock(); - in_dev = __in_dev_get(rt->u.dst.dev); + in_dev = __in_dev_get_rcu(rt->u.dst.dev); if (in_dev != NULL) { for_primary_ifa(in_dev) { if (ifa->ifa_broadcast == iph->daddr) { diff --git a/net/ipv4/netfilter/ipt_REDIRECT.c b/net/ipv4/netfilter/ipt_REDIRECT.c --- a/net/ipv4/netfilter/ipt_REDIRECT.c +++ b/net/ipv4/netfilter/ipt_REDIRECT.c @@ -93,7 +93,7 @@ redirect_target(struct sk_buff **pskb, newdst = 0; rcu_read_lock(); - indev = __in_dev_get((*pskb)->dev); + indev = __in_dev_get_rcu((*pskb)->dev); if (indev && (ifa = indev->ifa_list)) newdst = ifa->ifa_local; rcu_read_unlock(); diff --git a/net/ipv4/route.c b/net/ipv4/route.c --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -2128,7 +2128,7 @@ int ip_route_input(struct sk_buff *skb, struct in_device *in_dev; rcu_read_lock(); - if ((in_dev = __in_dev_get(dev)) != NULL) { + if ((in_dev = __in_dev_get_rcu(dev)) != NULL) { int our = ip_check_mc(in_dev, daddr, saddr, skb->nh.iph->protocol); if (our @@ -2443,7 +2443,9 @@ static int ip_route_output_slow(struct r err = -ENODEV; if (dev_out == NULL) goto out; - if (__in_dev_get(dev_out) == NULL) { + + /* RACE: Check return value of inet_select_addr instead. */ + if (__in_dev_get_rtnl(dev_out) == NULL) { dev_put(dev_out); goto out; /* Wrong error code */ } diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -1806,7 +1806,7 @@ static void sit_add_v4_addrs(struct inet } for (dev = dev_base; dev != NULL; dev = dev->next) { - struct in_device * in_dev = __in_dev_get(dev); + struct in_device * in_dev = __in_dev_get_rtnl(dev); if (in_dev && (dev->flags & IFF_UP)) { struct in_ifaddr * ifa; diff --git a/net/irda/irlan/irlan_eth.c b/net/irda/irlan/irlan_eth.c --- a/net/irda/irlan/irlan_eth.c +++ b/net/irda/irlan/irlan_eth.c @@ -310,7 +310,7 @@ void irlan_eth_send_gratuitous_arp(struc #ifdef CONFIG_INET IRDA_DEBUG(4, "IrLAN: Sending gratuitous ARP\n"); rcu_read_lock(); - in_dev = __in_dev_get(dev); + in_dev = __in_dev_get_rcu(dev); if (in_dev == NULL) goto out; if (in_dev->ifa_list) diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c --- a/net/sctp/protocol.c +++ b/net/sctp/protocol.c @@ -147,7 +147,7 @@ static void sctp_v4_copy_addrlist(struct struct sctp_sockaddr_entry *addr; rcu_read_lock(); - if ((in_dev = __in_dev_get(dev)) == NULL) { + if ((in_dev = __in_dev_get_rcu(dev)) == NULL) { rcu_read_unlock(); return; } --6c2NcOVqGQ03X4Wi--