From amine@anevia.com Thu Sep 1 08:09:59 2005 Received: with ECARTIS (v1.0.0; list netdev); Thu, 01 Sep 2005 08:10:03 -0700 (PDT) Received: from smtp8.wanadoo.fr (smtp8.wanadoo.fr [193.252.22.23]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j81F9wiL015566 for ; Thu, 1 Sep 2005 08:09:59 -0700 Received: from me-wanadoo.net (localhost [127.0.0.1]) by mwinf0808.wanadoo.fr (SMTP Server) with ESMTP id ECFD61C00223 for ; Thu, 1 Sep 2005 17:07:28 +0200 (CEST) Received: from goliath.anevia.com (LSt-Amand-152-31-11-137.w82-127.abo.wanadoo.fr [82.127.10.137]) by mwinf0808.wanadoo.fr (SMTP Server) with ESMTP id D326B1C0021C for ; Thu, 1 Sep 2005 17:07:28 +0200 (CEST) X-ME-UUID: 20050901150728864.D326B1C0021C@mwinf0808.wanadoo.fr Received: from therese.anevia.com (therese.anevia.com [10.0.1.33]) by goliath.anevia.com (Postfix) with ESMTP id D0DE91300048 for ; Thu, 1 Sep 2005 17:07:31 +0200 (CEST) From: amine To: netdev@oss.sgi.com Subject: Linux multicast support Date: Thu, 1 Sep 2005 17:05:47 +0200 User-Agent: KMail/1.7.2 MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200509011705.47623.amine@anevia.com> X-archive-position: 3588 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: amine@anevia.com Precedence: bulk X-list: netdev Hi, I have a question about Multicast in Linux IP stack. I need to know why are loking the " dev->xmit_lock" when mading change in device multicast list? Is it required to suppress parallel execution of that handler and set_multicast_list? Thank in advance -- EL HEDADI Amine R&D phone : Email : amine@anevia.com From manfred@colorfullife.com Sun Sep 4 05:36:49 2005 Received: with ECARTIS (v1.0.0; list netdev); Sun, 04 Sep 2005 05:36:57 -0700 (PDT) Received: from dbl.q-ag.de (dbl.q-ag.de [213.172.117.3]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j84CaiiL027751 for ; Sun, 4 Sep 2005 05:36:47 -0700 Received: from [127.0.0.2] (dbl [127.0.0.1]) by dbl.q-ag.de (8.13.3/8.13.3/Debian-6) with ESMTP id j84CeUvV015789; Sun, 4 Sep 2005 14:40:31 +0200 Message-ID: <431AE9B7.2040300@colorfullife.com> Date: Sun, 04 Sep 2005 14:33:59 +0200 From: Manfred Spraul User-Agent: Mozilla/5.0 (X11; U; Linux i686; fr-FR; rv:1.7.10) Gecko/20050719 Fedora/1.7.10-1.5.1 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Linux Kernel Mailing List , Netdev CC: Ayaz Abdulla Subject: [CFT] forcedeth backport to 2.4 Content-Type: multipart/mixed; boundary="------------040003030102010107040708" X-archive-position: 3591 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: manfred@colorfullife.com Precedence: bulk X-list: netdev Content-Length: 51824 Lines: 1630 This is a multi-part message in MIME format. --------------040003030102010107040708 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Hi, Attached is a backport of the latest forcedeth version to 2.4. It includes lots of changes, among them: - a critical bugfix for nv_open(): ifdown/ifup cycles resulted in an incomplete initialization that causes hangs after a few MB network traffic. - jumbo frame support - far better ethtool support - 64-bit dma support - support for additional nforce versions. It compiles and boots, but I can't test it properly. Could you give it a try? -- Manfred --------------040003030102010107040708 Content-Type: text/plain; name="patch-forcedeth-backport" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="patch-forcedeth-backport" --- 2.4/drivers/net/forcedeth.c 2005-01-19 15:09:56.000000000 +0100 +++ build-2.4/drivers/net/forcedeth.c 2005-09-04 13:58:07.000000000 +0200 @@ -79,6 +79,22 @@ * 0.30: 25 Sep 2004: rx checksum support for nf 250 Gb. Add rx reset * into nv_close, otherwise reenabling for wol can * cause DMA to kfree'd memory. + * 0.31: 14 Nov 2004: ethtool support for getting/setting link + * capabilities. + * 0.32: 16 Apr 2005: RX_ERROR4 handling added. + * 0.33: 16 May 2005: Support for MCP51 added. + * 0.34: 18 Jun 2005: Add DEV_NEED_LINKTIMER to all nForce nics. + * 0.35: 26 Jun 2005: Support for MCP55 added. + * 0.36: 28 Jun 2005: Add jumbo frame support. + * 0.37: 10 Jul 2005: Additional ethtool support, cleanup of pci id list + * 0.38: 16 Jul 2005: tx irq rewrite: Use global flags instead of + * per-packet flags. + * 0.39: 18 Jul 2005: Add 64bit descriptor support. + * 0.40: 19 Jul 2005: Add support for mac address change. + * 0.41: 30 Jul 2005: Write back original MAC in nv_close instead + * of nv_remove + * 0.42: 06 Aug 2005: Fix lack of link speed initialization + * in the second (and later) nv_open call * * Known bugs: * We suspect that on some hardware no TX done interrupts are generated. @@ -90,7 +106,7 @@ * DEV_NEED_TIMERIRQ will not harm you on sane hardware, only generating a few * superfluous timer interrupts from the nic. */ -#define FORCEDETH_VERSION "0.30" +#define FORCEDETH_VERSION "0.42" #define DRV_NAME "forcedeth" #include @@ -108,6 +124,7 @@ #include #include #include +#include #include #include @@ -125,11 +142,10 @@ * Hardware access: */ -#define DEV_NEED_LASTPACKET1 0x0001 /* set LASTPACKET1 in tx flags */ -#define DEV_IRQMASK_1 0x0002 /* use NVREG_IRQMASK_WANTED_1 for irq mask */ -#define DEV_IRQMASK_2 0x0004 /* use NVREG_IRQMASK_WANTED_2 for irq mask */ -#define DEV_NEED_TIMERIRQ 0x0008 /* set the timer irq flag in the irq mask */ -#define DEV_NEED_LINKTIMER 0x0010 /* poll link settings. Relies on the timer irq */ +#define DEV_NEED_TIMERIRQ 0x0001 /* set the timer irq flag in the irq mask */ +#define DEV_NEED_LINKTIMER 0x0002 /* poll link settings. Relies on the timer irq */ +#define DEV_HAS_LARGEDESC 0x0004 /* device supports jumbo frames and needs packet format 2 */ +#define DEV_HAS_HIGH_DMA 0x0008 /* device supports 64bit dma */ enum { NvRegIrqStatus = 0x000, @@ -140,13 +156,16 @@ #define NVREG_IRQ_RX 0x0002 #define NVREG_IRQ_RX_NOBUF 0x0004 #define NVREG_IRQ_TX_ERR 0x0008 -#define NVREG_IRQ_TX2 0x0010 +#define NVREG_IRQ_TX_OK 0x0010 #define NVREG_IRQ_TIMER 0x0020 #define NVREG_IRQ_LINK 0x0040 +#define NVREG_IRQ_TX_ERROR 0x0080 #define NVREG_IRQ_TX1 0x0100 -#define NVREG_IRQMASK_WANTED_1 0x005f -#define NVREG_IRQMASK_WANTED_2 0x0147 -#define NVREG_IRQ_UNKNOWN (~(NVREG_IRQ_RX_ERROR|NVREG_IRQ_RX|NVREG_IRQ_RX_NOBUF|NVREG_IRQ_TX_ERR|NVREG_IRQ_TX2|NVREG_IRQ_TIMER|NVREG_IRQ_LINK|NVREG_IRQ_TX1)) +#define NVREG_IRQMASK_WANTED 0x00df + +#define NVREG_IRQ_UNKNOWN (~(NVREG_IRQ_RX_ERROR|NVREG_IRQ_RX|NVREG_IRQ_RX_NOBUF|NVREG_IRQ_TX_ERR| \ + NVREG_IRQ_TX_OK|NVREG_IRQ_TIMER|NVREG_IRQ_LINK|NVREG_IRQ_TX_ERROR| \ + NVREG_IRQ_TX1)) NvRegUnknownSetupReg6 = 0x008, #define NVREG_UNKSETUP6_VAL 3 @@ -211,6 +230,7 @@ #define NVREG_LINKSPEED_10 1000 #define NVREG_LINKSPEED_100 100 #define NVREG_LINKSPEED_1000 50 +#define NVREG_LINKSPEED_MASK (0xFFF) NvRegUnknownSetupReg5 = 0x130, #define NVREG_UNKSETUP5_BIT31 (1<<31) NvRegUnknownSetupReg3 = 0x13c, @@ -279,6 +299,18 @@ u32 FlagLen; }; +struct ring_desc_ex { + u32 PacketBufferHigh; + u32 PacketBufferLow; + u32 Reserved; + u32 FlagLen; +}; + +typedef union _ring_type { + struct ring_desc* orig; + struct ring_desc_ex* ex; +} ring_type; + #define FLAG_MASK_V1 0xffff0000 #define FLAG_MASK_V2 0xffffc000 #define LEN_MASK_V1 (0xffffffff ^ FLAG_MASK_V1) @@ -286,7 +318,7 @@ #define NV_TX_LASTPACKET (1<<16) #define NV_TX_RETRYERROR (1<<19) -#define NV_TX_LASTPACKET1 (1<<24) +#define NV_TX_FORCED_INTERRUPT (1<<24) #define NV_TX_DEFERRED (1<<26) #define NV_TX_CARRIERLOST (1<<27) #define NV_TX_LATECOLLISION (1<<28) @@ -296,7 +328,7 @@ #define NV_TX2_LASTPACKET (1<<29) #define NV_TX2_RETRYERROR (1<<18) -#define NV_TX2_LASTPACKET1 (1<<23) +#define NV_TX2_FORCED_INTERRUPT (1<<30) #define NV_TX2_DEFERRED (1<<25) #define NV_TX2_CARRIERLOST (1<<26) #define NV_TX2_LATECOLLISION (1<<27) @@ -362,7 +394,7 @@ #define RX_RING 128 #define TX_RING 64 -/* +/* * If your nic mysteriously hangs then try to reduce the limits * to 1/0: It might be required to set NV_TX_LASTPACKET in the * last valid ring entry. But this would be impossible to @@ -372,15 +404,19 @@ #define TX_LIMIT_START 62 /* rx/tx mac addr + type + vlan + align + slack*/ -#define RX_NIC_BUFSIZE (ETH_DATA_LEN + 64) -/* even more slack */ -#define RX_ALLOC_BUFSIZE (ETH_DATA_LEN + 128) +#define NV_RX_HEADERS (64) +/* even more slack. */ +#define NV_RX_ALLOC_PAD (64) + +/* maximum mtu size */ +#define NV_PKTLIMIT_1 ETH_DATA_LEN /* hard limit not known */ +#define NV_PKTLIMIT_2 9100 /* Actual limit according to NVidia: 9202 */ #define OOM_REFILL (1+HZ/20) #define POLL_WAIT (1+HZ/100) #define LINK_TIMEOUT (3*HZ) -/* +/* * desc_ver values: * This field has two purposes: * - Newer nics uses a different ring layout. The layout is selected by @@ -389,6 +425,7 @@ */ #define DESC_VER_1 0x0 #define DESC_VER_2 (0x02100|NVREG_TXRXCTL_RXCHECK) +#define DESC_VER_3 (0x02200|NVREG_TXRXCTL_RXCHECK) /* PHY defines */ #define PHY_OUI_MARVELL 0x5043 @@ -442,6 +479,8 @@ int in_shutdown; u32 linkspeed; int duplex; + int autoneg; + int fixed_mode; int phyaddr; int wolenabled; unsigned int phy_oui; @@ -454,14 +493,17 @@ u32 irqmask; u32 desc_ver; + void __iomem *base; + /* rx specific fields. * Locking: Within irq hander or disable_irq+spin_lock(&np->lock); */ - struct ring_desc *rx_ring; + ring_type rx_ring; unsigned int cur_rx, refill_rx; struct sk_buff *rx_skbuff[RX_RING]; dma_addr_t rx_dma[RX_RING]; unsigned int rx_buf_sz; + unsigned int pkt_limit; struct timer_list oom_kick; struct timer_list nic_poll; @@ -473,7 +515,7 @@ /* * tx specific fields. */ - struct ring_desc *tx_ring; + ring_type tx_ring; unsigned int next_tx, nic_tx; struct sk_buff *tx_skbuff[TX_RING]; dma_addr_t tx_dma[TX_RING]; @@ -488,15 +530,15 @@ static inline struct fe_priv *get_nvpriv(struct net_device *dev) { - return (struct fe_priv *) dev->priv; + return netdev_priv(dev); } -static inline u8 *get_hwbase(struct net_device *dev) +static inline u8 __iomem *get_hwbase(struct net_device *dev) { - return (u8 *) dev->base_addr; + return get_nvpriv(dev)->base; } -static inline void pci_push(u8 * base) +static inline void pci_push(u8 __iomem *base) { /* force out pending posted writes */ readl(base); @@ -508,10 +550,15 @@ & ((v == DESC_VER_1) ? LEN_MASK_V1 : LEN_MASK_V2); } +static inline u32 nv_descr_getlength_ex(struct ring_desc_ex *prd, u32 v) +{ + return le32_to_cpu(prd->FlagLen) & LEN_MASK_V2; +} + static int reg_delay(struct net_device *dev, int offset, u32 mask, u32 target, int delay, int delaymax, const char *msg) { - u8 *base = get_hwbase(dev); + u8 __iomem *base = get_hwbase(dev); pci_push(base); do { @@ -533,7 +580,7 @@ */ static int mii_rw(struct net_device *dev, int addr, int miireg, int value) { - u8 *base = get_hwbase(dev); + u8 __iomem *base = get_hwbase(dev); u32 reg; int retval; @@ -604,7 +651,7 @@ static int phy_init(struct net_device *dev) { struct fe_priv *np = get_nvpriv(dev); - u8 *base = get_hwbase(dev); + u8 __iomem *base = get_hwbase(dev); u32 phyinterface, phy_reserved, mii_status, mii_control, mii_control_1000,reg; /* set advertise register */ @@ -681,7 +728,7 @@ static void nv_start_rx(struct net_device *dev) { struct fe_priv *np = get_nvpriv(dev); - u8 *base = get_hwbase(dev); + u8 __iomem *base = get_hwbase(dev); dprintk(KERN_DEBUG "%s: nv_start_rx\n", dev->name); /* Already running? Stop it. */ @@ -699,7 +746,7 @@ static void nv_stop_rx(struct net_device *dev) { - u8 *base = get_hwbase(dev); + u8 __iomem *base = get_hwbase(dev); dprintk(KERN_DEBUG "%s: nv_stop_rx\n", dev->name); writel(0, base + NvRegReceiverControl); @@ -713,7 +760,7 @@ static void nv_start_tx(struct net_device *dev) { - u8 *base = get_hwbase(dev); + u8 __iomem *base = get_hwbase(dev); dprintk(KERN_DEBUG "%s: nv_start_tx\n", dev->name); writel(NVREG_XMITCTL_START, base + NvRegTransmitterControl); @@ -722,7 +769,7 @@ static void nv_stop_tx(struct net_device *dev) { - u8 *base = get_hwbase(dev); + u8 __iomem *base = get_hwbase(dev); dprintk(KERN_DEBUG "%s: nv_stop_tx\n", dev->name); writel(0, base + NvRegTransmitterControl); @@ -737,7 +784,7 @@ static void nv_txrx_reset(struct net_device *dev) { struct fe_priv *np = get_nvpriv(dev); - u8 *base = get_hwbase(dev); + u8 __iomem *base = get_hwbase(dev); dprintk(KERN_DEBUG "%s: nv_txrx_reset\n", dev->name); writel(NVREG_TXRXCTL_BIT2 | NVREG_TXRXCTL_RESET | np->desc_ver, base + NvRegTxRxControl); @@ -764,50 +811,6 @@ return &np->stats; } -static void nv_get_drvinfo(struct net_device *dev, struct ethtool_drvinfo *info) -{ - struct fe_priv *np = get_nvpriv(dev); - strcpy(info->driver, "forcedeth"); - strcpy(info->version, FORCEDETH_VERSION); - strcpy(info->bus_info, pci_name(np->pci_dev)); -} - -static void nv_get_wol(struct net_device *dev, struct ethtool_wolinfo *wolinfo) -{ - struct fe_priv *np = get_nvpriv(dev); - wolinfo->supported = WAKE_MAGIC; - - spin_lock_irq(&np->lock); - if (np->wolenabled) - wolinfo->wolopts = WAKE_MAGIC; - spin_unlock_irq(&np->lock); -} - -static int nv_set_wol(struct net_device *dev, struct ethtool_wolinfo *wolinfo) -{ - struct fe_priv *np = get_nvpriv(dev); - u8 *base = get_hwbase(dev); - - spin_lock_irq(&np->lock); - if (wolinfo->wolopts == 0) { - writel(0, base + NvRegWakeUpFlags); - np->wolenabled = 0; - } - if (wolinfo->wolopts & WAKE_MAGIC) { - writel(NVREG_WAKEUPFLAGS_ENABLE, base + NvRegWakeUpFlags); - np->wolenabled = 1; - } - spin_unlock_irq(&np->lock); - return 0; -} - -static struct ethtool_ops ops = { - .get_drvinfo = nv_get_drvinfo, - .get_link = ethtool_op_get_link, - .get_wol = nv_get_wol, - .set_wol = nv_set_wol, -}; - /* * nv_alloc_rx: fill rx ring entries. * Return 1 if the allocations for the skbs failed and the @@ -825,7 +828,7 @@ nr = refill_rx % RX_RING; if (np->rx_skbuff[nr] == NULL) { - skb = dev_alloc_skb(RX_ALLOC_BUFSIZE); + skb = dev_alloc_skb(np->rx_buf_sz + NV_RX_ALLOC_PAD); if (!skb) break; @@ -836,9 +839,16 @@ } np->rx_dma[nr] = pci_map_single(np->pci_dev, skb->data, skb->len, PCI_DMA_FROMDEVICE); - np->rx_ring[nr].PacketBuffer = cpu_to_le32(np->rx_dma[nr]); - wmb(); - np->rx_ring[nr].FlagLen = cpu_to_le32(RX_NIC_BUFSIZE | NV_RX_AVAIL); + if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) { + np->rx_ring.orig[nr].PacketBuffer = cpu_to_le32(np->rx_dma[nr]); + wmb(); + np->rx_ring.orig[nr].FlagLen = cpu_to_le32(np->rx_buf_sz | NV_RX_AVAIL); + } else { + np->rx_ring.ex[nr].PacketBufferHigh = cpu_to_le64(np->rx_dma[nr]) >> 32; + np->rx_ring.ex[nr].PacketBufferLow = cpu_to_le64(np->rx_dma[nr]) & 0x0FFFFFFFF; + wmb(); + np->rx_ring.ex[nr].FlagLen = cpu_to_le32(np->rx_buf_sz | NV_RX2_AVAIL); + } dprintk(KERN_DEBUG "%s: nv_alloc_rx: Packet %d marked as Available\n", dev->name, refill_rx); refill_rx++; @@ -864,19 +874,37 @@ enable_irq(dev->irq); } -static int nv_init_ring(struct net_device *dev) +static void nv_init_rx(struct net_device *dev) { struct fe_priv *np = get_nvpriv(dev); int i; - np->next_tx = np->nic_tx = 0; - for (i = 0; i < TX_RING; i++) - np->tx_ring[i].FlagLen = 0; - np->cur_rx = RX_RING; np->refill_rx = 0; for (i = 0; i < RX_RING; i++) - np->rx_ring[i].FlagLen = 0; + if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) + np->rx_ring.orig[i].FlagLen = 0; + else + np->rx_ring.ex[i].FlagLen = 0; +} + +static void nv_init_tx(struct net_device *dev) +{ + struct fe_priv *np = get_nvpriv(dev); + int i; + + np->next_tx = np->nic_tx = 0; + for (i = 0; i < TX_RING; i++) + if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) + np->tx_ring.orig[i].FlagLen = 0; + else + np->tx_ring.ex[i].FlagLen = 0; +} + +static int nv_init_ring(struct net_device *dev) +{ + nv_init_tx(dev); + nv_init_rx(dev); return nv_alloc_rx(dev); } @@ -885,7 +913,10 @@ struct fe_priv *np = get_nvpriv(dev); int i; for (i = 0; i < TX_RING; i++) { - np->tx_ring[i].FlagLen = 0; + if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) + np->tx_ring.orig[i].FlagLen = 0; + else + np->tx_ring.ex[i].FlagLen = 0; if (np->tx_skbuff[i]) { pci_unmap_single(np->pci_dev, np->tx_dma[i], np->tx_skbuff[i]->len, @@ -902,7 +933,10 @@ struct fe_priv *np = get_nvpriv(dev); int i; for (i = 0; i < RX_RING; i++) { - np->rx_ring[i].FlagLen = 0; + if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) + np->rx_ring.orig[i].FlagLen = 0; + else + np->rx_ring.ex[i].FlagLen = 0; wmb(); if (np->rx_skbuff[i]) { pci_unmap_single(np->pci_dev, np->rx_dma[i], @@ -933,11 +967,19 @@ np->tx_dma[nr] = pci_map_single(np->pci_dev, skb->data,skb->len, PCI_DMA_TODEVICE); - np->tx_ring[nr].PacketBuffer = cpu_to_le32(np->tx_dma[nr]); + if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) + np->tx_ring.orig[nr].PacketBuffer = cpu_to_le32(np->tx_dma[nr]); + else { + np->tx_ring.ex[nr].PacketBufferHigh = cpu_to_le64(np->tx_dma[nr]) >> 32; + np->tx_ring.ex[nr].PacketBufferLow = cpu_to_le64(np->tx_dma[nr]) & 0x0FFFFFFFF; + } spin_lock_irq(&np->lock); wmb(); - np->tx_ring[nr].FlagLen = cpu_to_le32( (skb->len-1) | np->tx_flags ); + if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) + np->tx_ring.orig[nr].FlagLen = cpu_to_le32( (skb->len-1) | np->tx_flags ); + else + np->tx_ring.ex[nr].FlagLen = cpu_to_le32( (skb->len-1) | np->tx_flags ); dprintk(KERN_DEBUG "%s: nv_start_xmit: packet packet %d queued for transmission.\n", dev->name, np->next_tx); { @@ -975,7 +1017,10 @@ while (np->nic_tx != np->next_tx) { i = np->nic_tx % TX_RING; - Flags = le32_to_cpu(np->tx_ring[i].FlagLen); + if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) + Flags = le32_to_cpu(np->tx_ring.orig[i].FlagLen); + else + Flags = le32_to_cpu(np->tx_ring.ex[i].FlagLen); dprintk(KERN_DEBUG "%s: nv_tx_done: looking at packet %d, Flags 0x%x.\n", dev->name, np->nic_tx, Flags); @@ -1024,11 +1069,58 @@ static void nv_tx_timeout(struct net_device *dev) { struct fe_priv *np = get_nvpriv(dev); - u8 *base = get_hwbase(dev); + u8 __iomem *base = get_hwbase(dev); - dprintk(KERN_DEBUG "%s: Got tx_timeout. irq: %08x\n", dev->name, + printk(KERN_INFO "%s: Got tx_timeout. irq: %08x\n", dev->name, readl(base + NvRegIrqStatus) & NVREG_IRQSTAT_MASK); + { + int i; + + printk(KERN_INFO "%s: Ring at %lx: next %d nic %d\n", + dev->name, (unsigned long)np->ring_addr, + np->next_tx, np->nic_tx); + printk(KERN_INFO "%s: Dumping tx registers\n", dev->name); + for (i=0;i<0x400;i+= 32) { + printk(KERN_INFO "%3x: %08x %08x %08x %08x %08x %08x %08x %08x\n", + i, + readl(base + i + 0), readl(base + i + 4), + readl(base + i + 8), readl(base + i + 12), + readl(base + i + 16), readl(base + i + 20), + readl(base + i + 24), readl(base + i + 28)); + } + printk(KERN_INFO "%s: Dumping tx ring\n", dev->name); + for (i=0;idesc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) { + printk(KERN_INFO "%03x: %08x %08x // %08x %08x // %08x %08x // %08x %08x\n", + i, + le32_to_cpu(np->tx_ring.orig[i].PacketBuffer), + le32_to_cpu(np->tx_ring.orig[i].FlagLen), + le32_to_cpu(np->tx_ring.orig[i+1].PacketBuffer), + le32_to_cpu(np->tx_ring.orig[i+1].FlagLen), + le32_to_cpu(np->tx_ring.orig[i+2].PacketBuffer), + le32_to_cpu(np->tx_ring.orig[i+2].FlagLen), + le32_to_cpu(np->tx_ring.orig[i+3].PacketBuffer), + le32_to_cpu(np->tx_ring.orig[i+3].FlagLen)); + } else { + printk(KERN_INFO "%03x: %08x %08x %08x // %08x %08x %08x // %08x %08x %08x // %08x %08x %08x\n", + i, + le32_to_cpu(np->tx_ring.ex[i].PacketBufferHigh), + le32_to_cpu(np->tx_ring.ex[i].PacketBufferLow), + le32_to_cpu(np->tx_ring.ex[i].FlagLen), + le32_to_cpu(np->tx_ring.ex[i+1].PacketBufferHigh), + le32_to_cpu(np->tx_ring.ex[i+1].PacketBufferLow), + le32_to_cpu(np->tx_ring.ex[i+1].FlagLen), + le32_to_cpu(np->tx_ring.ex[i+2].PacketBufferHigh), + le32_to_cpu(np->tx_ring.ex[i+2].PacketBufferLow), + le32_to_cpu(np->tx_ring.ex[i+2].FlagLen), + le32_to_cpu(np->tx_ring.ex[i+3].PacketBufferHigh), + le32_to_cpu(np->tx_ring.ex[i+3].PacketBufferLow), + le32_to_cpu(np->tx_ring.ex[i+3].FlagLen)); + } + } + } + spin_lock_irq(&np->lock); /* 1) stop tx engine */ @@ -1042,7 +1134,10 @@ printk(KERN_DEBUG "%s: tx_timeout: dead entries!\n", dev->name); nv_drain_tx(dev); np->next_tx = np->nic_tx = 0; - writel((u32) (np->ring_addr + RX_RING*sizeof(struct ring_desc)), base + NvRegTxRingPhysAddr); + if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) + writel((u32) (np->ring_addr + RX_RING*sizeof(struct ring_desc)), base + NvRegTxRingPhysAddr); + else + writel((u32) (np->ring_addr + RX_RING*sizeof(struct ring_desc_ex)), base + NvRegTxRingPhysAddr); netif_wake_queue(dev); } @@ -1051,6 +1146,59 @@ spin_unlock_irq(&np->lock); } +/* + * Called when the nic notices a mismatch between the actual data len on the + * wire and the len indicated in the 802 header + */ +static int nv_getlen(struct net_device *dev, void *packet, int datalen) +{ + int hdrlen; /* length of the 802 header */ + int protolen; /* length as stored in the proto field */ + + /* 1) calculate len according to header */ + if ( ((struct vlan_ethhdr *)packet)->h_vlan_proto == __constant_htons(ETH_P_8021Q)) { + protolen = ntohs( ((struct vlan_ethhdr *)packet)->h_vlan_encapsulated_proto ); + hdrlen = VLAN_HLEN; + } else { + protolen = ntohs( ((struct ethhdr *)packet)->h_proto); + hdrlen = ETH_HLEN; + } + dprintk(KERN_DEBUG "%s: nv_getlen: datalen %d, protolen %d, hdrlen %d\n", + dev->name, datalen, protolen, hdrlen); + if (protolen > ETH_DATA_LEN) + return datalen; /* Value in proto field not a len, no checks possible */ + + protolen += hdrlen; + /* consistency checks: */ + if (datalen > ETH_ZLEN) { + if (datalen >= protolen) { + /* more data on wire than in 802 header, trim of + * additional data. + */ + dprintk(KERN_DEBUG "%s: nv_getlen: accepting %d bytes.\n", + dev->name, protolen); + return protolen; + } else { + /* less data on wire than mentioned in header. + * Discard the packet. + */ + dprintk(KERN_DEBUG "%s: nv_getlen: discarding long packet.\n", + dev->name); + return -1; + } + } else { + /* short packet. Accept only if 802 values are also short */ + if (protolen > ETH_ZLEN) { + dprintk(KERN_DEBUG "%s: nv_getlen: discarding short packet.\n", + dev->name); + return -1; + } + dprintk(KERN_DEBUG "%s: nv_getlen: accepting %d bytes.\n", + dev->name, datalen); + return datalen; + } +} + static void nv_rx_process(struct net_device *dev) { struct fe_priv *np = get_nvpriv(dev); @@ -1064,8 +1212,13 @@ break; /* we scanned the whole ring - do not continue */ i = np->cur_rx % RX_RING; - Flags = le32_to_cpu(np->rx_ring[i].FlagLen); - len = nv_descr_getlength(&np->rx_ring[i], np->desc_ver); + if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) { + Flags = le32_to_cpu(np->rx_ring.orig[i].FlagLen); + len = nv_descr_getlength(&np->rx_ring.orig[i], np->desc_ver); + } else { + Flags = le32_to_cpu(np->rx_ring.ex[i].FlagLen); + len = nv_descr_getlength_ex(&np->rx_ring.ex[i], np->desc_ver); + } dprintk(KERN_DEBUG "%s: nv_rx_process: looking at packet %d, Flags 0x%x.\n", dev->name, np->cur_rx, Flags); @@ -1102,7 +1255,7 @@ np->stats.rx_errors++; goto next_pkt; } - if (Flags & (NV_RX_ERROR1|NV_RX_ERROR2|NV_RX_ERROR3|NV_RX_ERROR4)) { + if (Flags & (NV_RX_ERROR1|NV_RX_ERROR2|NV_RX_ERROR3)) { np->stats.rx_errors++; goto next_pkt; } @@ -1116,22 +1269,24 @@ np->stats.rx_errors++; goto next_pkt; } - if (Flags & NV_RX_ERROR) { - /* framing errors are soft errors, the rest is fatal. */ - if (Flags & NV_RX_FRAMINGERR) { - if (Flags & NV_RX_SUBSTRACT1) { - len--; - } - } else { + if (Flags & NV_RX_ERROR4) { + len = nv_getlen(dev, np->rx_skbuff[i]->data, len); + if (len < 0) { np->stats.rx_errors++; goto next_pkt; } } + /* framing errors are soft errors. */ + if (Flags & NV_RX_FRAMINGERR) { + if (Flags & NV_RX_SUBSTRACT1) { + len--; + } + } } else { if (!(Flags & NV_RX2_DESCRIPTORVALID)) goto next_pkt; - if (Flags & (NV_RX2_ERROR1|NV_RX2_ERROR2|NV_RX2_ERROR3|NV_RX2_ERROR4)) { + if (Flags & (NV_RX2_ERROR1|NV_RX2_ERROR2|NV_RX2_ERROR3)) { np->stats.rx_errors++; goto next_pkt; } @@ -1145,17 +1300,19 @@ np->stats.rx_errors++; goto next_pkt; } - if (Flags & NV_RX2_ERROR) { - /* framing errors are soft errors, the rest is fatal. */ - if (Flags & NV_RX2_FRAMINGERR) { - if (Flags & NV_RX2_SUBSTRACT1) { - len--; - } - } else { + if (Flags & NV_RX2_ERROR4) { + len = nv_getlen(dev, np->rx_skbuff[i]->data, len); + if (len < 0) { np->stats.rx_errors++; goto next_pkt; } } + /* framing errors are soft errors */ + if (Flags & NV_RX2_FRAMINGERR) { + if (Flags & NV_RX2_SUBSTRACT1) { + len--; + } + } Flags &= NV_RX2_CHECKSUMMASK; if (Flags == NV_RX2_CHECKSUMOK1 || Flags == NV_RX2_CHECKSUMOK2 || @@ -1183,15 +1340,133 @@ } } +static void set_bufsize(struct net_device *dev) +{ + struct fe_priv *np = netdev_priv(dev); + + if (dev->mtu <= ETH_DATA_LEN) + np->rx_buf_sz = ETH_DATA_LEN + NV_RX_HEADERS; + else + np->rx_buf_sz = dev->mtu + NV_RX_HEADERS; +} + /* * nv_change_mtu: dev->change_mtu function * Called with dev_base_lock held for read. */ static int nv_change_mtu(struct net_device *dev, int new_mtu) { - if (new_mtu > ETH_DATA_LEN) + struct fe_priv *np = get_nvpriv(dev); + int old_mtu; + + if (new_mtu < 64 || new_mtu > np->pkt_limit) return -EINVAL; + + old_mtu = dev->mtu; dev->mtu = new_mtu; + + /* return early if the buffer sizes will not change */ + if (old_mtu <= ETH_DATA_LEN && new_mtu <= ETH_DATA_LEN) + return 0; + if (old_mtu == new_mtu) + return 0; + + /* synchronized against open : rtnl_lock() held by caller */ + if (netif_running(dev)) { + u8 *base = get_hwbase(dev); + /* + * It seems that the nic preloads valid ring entries into an + * internal buffer. The procedure for flushing everything is + * guessed, there is probably a simpler approach. + * Changing the MTU is a rare event, it shouldn't matter. + */ + disable_irq(dev->irq); + spin_lock_bh(&dev->xmit_lock); + spin_lock(&np->lock); + /* stop engines */ + nv_stop_rx(dev); + nv_stop_tx(dev); + nv_txrx_reset(dev); + /* drain rx queue */ + nv_drain_rx(dev); + nv_drain_tx(dev); + /* reinit driver view of the rx queue */ + nv_init_rx(dev); + nv_init_tx(dev); + /* alloc new rx buffers */ + set_bufsize(dev); + if (nv_alloc_rx(dev)) { + if (!np->in_shutdown) + mod_timer(&np->oom_kick, jiffies + OOM_REFILL); + } + /* reinit nic view of the rx queue */ + writel(np->rx_buf_sz, base + NvRegOffloadConfig); + writel((u32) np->ring_addr, base + NvRegRxRingPhysAddr); + if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) + writel((u32) (np->ring_addr + RX_RING*sizeof(struct ring_desc)), base + NvRegTxRingPhysAddr); + else + writel((u32) (np->ring_addr + RX_RING*sizeof(struct ring_desc_ex)), base + NvRegTxRingPhysAddr); + writel( ((RX_RING-1) << NVREG_RINGSZ_RXSHIFT) + ((TX_RING-1) << NVREG_RINGSZ_TXSHIFT), + base + NvRegRingSizes); + pci_push(base); + writel(NVREG_TXRXCTL_KICK|np->desc_ver, get_hwbase(dev) + NvRegTxRxControl); + pci_push(base); + + /* restart rx engine */ + nv_start_rx(dev); + nv_start_tx(dev); + spin_unlock(&np->lock); + spin_unlock_bh(&dev->xmit_lock); + enable_irq(dev->irq); + } + return 0; +} + +static void nv_copy_mac_to_hw(struct net_device *dev) +{ + u8 *base = get_hwbase(dev); + u32 mac[2]; + + mac[0] = (dev->dev_addr[0] << 0) + (dev->dev_addr[1] << 8) + + (dev->dev_addr[2] << 16) + (dev->dev_addr[3] << 24); + mac[1] = (dev->dev_addr[4] << 0) + (dev->dev_addr[5] << 8); + + writel(mac[0], base + NvRegMacAddrA); + writel(mac[1], base + NvRegMacAddrB); +} + +/* + * nv_set_mac_address: dev->set_mac_address function + * Called with rtnl_lock() held. + */ +static int nv_set_mac_address(struct net_device *dev, void *addr) +{ + struct fe_priv *np = get_nvpriv(dev); + struct sockaddr *macaddr = (struct sockaddr*)addr; + + if(!is_valid_ether_addr(macaddr->sa_data)) + return -EADDRNOTAVAIL; + + /* synchronized against open : rtnl_lock() held by caller */ + memcpy(dev->dev_addr, macaddr->sa_data, ETH_ALEN); + + if (netif_running(dev)) { + spin_lock_bh(&dev->xmit_lock); + spin_lock_irq(&np->lock); + + /* stop rx engine */ + nv_stop_rx(dev); + + /* set mac address */ + nv_copy_mac_to_hw(dev); + + /* restart rx engine */ + nv_start_rx(dev); + spin_unlock_irq(&np->lock); + spin_unlock_bh(&dev->xmit_lock); + } else { + nv_copy_mac_to_hw(dev); + } return 0; } @@ -1202,7 +1477,7 @@ static void nv_set_multicast(struct net_device *dev) { struct fe_priv *np = get_nvpriv(dev); - u8 *base = get_hwbase(dev); + u8 __iomem *base = get_hwbase(dev); u32 addr[2]; u32 mask[2]; u32 pff; @@ -1262,7 +1537,7 @@ static int nv_update_linkspeed(struct net_device *dev) { struct fe_priv *np = get_nvpriv(dev); - u8 *base = get_hwbase(dev); + u8 __iomem *base = get_hwbase(dev); int adv, lpa; int newls = np->linkspeed; int newdup = np->duplex; @@ -1285,6 +1560,25 @@ goto set_speed; } + if (np->autoneg == 0) { + dprintk(KERN_DEBUG "%s: nv_update_linkspeed: autoneg off, PHY set to 0x%04x.\n", + dev->name, np->fixed_mode); + if (np->fixed_mode & LPA_100FULL) { + newls = NVREG_LINKSPEED_FORCE|NVREG_LINKSPEED_100; + newdup = 1; + } else if (np->fixed_mode & LPA_100HALF) { + newls = NVREG_LINKSPEED_FORCE|NVREG_LINKSPEED_100; + newdup = 0; + } else if (np->fixed_mode & LPA_10FULL) { + newls = NVREG_LINKSPEED_FORCE|NVREG_LINKSPEED_10; + newdup = 1; + } else { + newls = NVREG_LINKSPEED_FORCE|NVREG_LINKSPEED_10; + newdup = 0; + } + retval = 1; + goto set_speed; + } /* check auto negotiation is complete */ if (!(mii_status & BMSR_ANEGCOMPLETE)) { /* still in autonegotiation - configure nic for 10 MBit HD and wait. */ @@ -1302,7 +1596,7 @@ if ((control_1000 & ADVERTISE_1000FULL) && (status_1000 & LPA_1000FULL)) { - dprintk(KERN_DEBUG "%s: nv_update_linkspeed: GBit ethernet detected.\n", + dprintk(KERN_DEBUG "%s: nv_update_linkspeed: GBit ethernet detected.\n", dev->name); newls = NVREG_LINKSPEED_FORCE|NVREG_LINKSPEED_1000; newdup = 1; @@ -1361,9 +1655,9 @@ phyreg &= ~(PHY_HALF|PHY_100|PHY_1000); if (np->duplex == 0) phyreg |= PHY_HALF; - if ((np->linkspeed & 0xFFF) == NVREG_LINKSPEED_100) + if ((np->linkspeed & NVREG_LINKSPEED_MASK) == NVREG_LINKSPEED_100) phyreg |= PHY_100; - else if ((np->linkspeed & 0xFFF) == NVREG_LINKSPEED_1000) + else if ((np->linkspeed & NVREG_LINKSPEED_MASK) == NVREG_LINKSPEED_1000) phyreg |= PHY_1000; writel(phyreg, base + NvRegPhyInterface); @@ -1397,7 +1691,7 @@ static void nv_link_irq(struct net_device *dev) { - u8 *base = get_hwbase(dev); + u8 __iomem *base = get_hwbase(dev); u32 miistat; miistat = readl(base + NvRegMIIStatus); @@ -1413,7 +1707,7 @@ { struct net_device *dev = (struct net_device *) data; struct fe_priv *np = get_nvpriv(dev); - u8 *base = get_hwbase(dev); + u8 __iomem *base = get_hwbase(dev); u32 events; int i; @@ -1427,7 +1721,7 @@ if (!(events & np->irqmask)) break; - if (events & (NVREG_IRQ_TX1|NVREG_IRQ_TX2|NVREG_IRQ_TX_ERR)) { + if (events & (NVREG_IRQ_TX1|NVREG_IRQ_TX_OK|NVREG_IRQ_TX_ERROR|NVREG_IRQ_TX_ERR)) { spin_lock(&np->lock); nv_tx_done(dev); spin_unlock(&np->lock); @@ -1485,7 +1779,7 @@ { struct net_device *dev = (struct net_device *) data; struct fe_priv *np = get_nvpriv(dev); - u8 *base = get_hwbase(dev); + u8 __iomem *base = get_hwbase(dev); disable_irq(dev->irq); /* FIXME: Do we need synchronize_irq(dev->irq) here? */ @@ -1499,10 +1793,285 @@ enable_irq(dev->irq); } +#ifdef CONFIG_NET_POLL_CONTROLLER +static void nv_poll_controller(struct net_device *dev) +{ + nv_do_nic_poll((unsigned long) dev); +} +#endif + +static void nv_get_drvinfo(struct net_device *dev, struct ethtool_drvinfo *info) +{ + struct fe_priv *np = get_nvpriv(dev); + strcpy(info->driver, "forcedeth"); + strcpy(info->version, FORCEDETH_VERSION); + strcpy(info->bus_info, pci_name(np->pci_dev)); +} + +static void nv_get_wol(struct net_device *dev, struct ethtool_wolinfo *wolinfo) +{ + struct fe_priv *np = get_nvpriv(dev); + wolinfo->supported = WAKE_MAGIC; + + spin_lock_irq(&np->lock); + if (np->wolenabled) + wolinfo->wolopts = WAKE_MAGIC; + spin_unlock_irq(&np->lock); +} + +static int nv_set_wol(struct net_device *dev, struct ethtool_wolinfo *wolinfo) +{ + struct fe_priv *np = get_nvpriv(dev); + u8 __iomem *base = get_hwbase(dev); + + spin_lock_irq(&np->lock); + if (wolinfo->wolopts == 0) { + writel(0, base + NvRegWakeUpFlags); + np->wolenabled = 0; + } + if (wolinfo->wolopts & WAKE_MAGIC) { + writel(NVREG_WAKEUPFLAGS_ENABLE, base + NvRegWakeUpFlags); + np->wolenabled = 1; + } + spin_unlock_irq(&np->lock); + return 0; +} + +static int nv_get_settings(struct net_device *dev, struct ethtool_cmd *ecmd) +{ + struct fe_priv *np = netdev_priv(dev); + int adv; + + spin_lock_irq(&np->lock); + ecmd->port = PORT_MII; + if (!netif_running(dev)) { + /* We do not track link speed / duplex setting if the + * interface is disabled. Force a link check */ + nv_update_linkspeed(dev); + } + switch(np->linkspeed & (NVREG_LINKSPEED_MASK)) { + case NVREG_LINKSPEED_10: + ecmd->speed = SPEED_10; + break; + case NVREG_LINKSPEED_100: + ecmd->speed = SPEED_100; + break; + case NVREG_LINKSPEED_1000: + ecmd->speed = SPEED_1000; + break; + } + ecmd->duplex = DUPLEX_HALF; + if (np->duplex) + ecmd->duplex = DUPLEX_FULL; + + ecmd->autoneg = np->autoneg; + + ecmd->advertising = ADVERTISED_MII; + if (np->autoneg) { + ecmd->advertising |= ADVERTISED_Autoneg; + adv = mii_rw(dev, np->phyaddr, MII_ADVERTISE, MII_READ); + } else { + adv = np->fixed_mode; + } + if (adv & ADVERTISE_10HALF) + ecmd->advertising |= ADVERTISED_10baseT_Half; + if (adv & ADVERTISE_10FULL) + ecmd->advertising |= ADVERTISED_10baseT_Full; + if (adv & ADVERTISE_100HALF) + ecmd->advertising |= ADVERTISED_100baseT_Half; + if (adv & ADVERTISE_100FULL) + ecmd->advertising |= ADVERTISED_100baseT_Full; + if (np->autoneg && np->gigabit == PHY_GIGABIT) { + adv = mii_rw(dev, np->phyaddr, MII_1000BT_CR, MII_READ); + if (adv & ADVERTISE_1000FULL) + ecmd->advertising |= ADVERTISED_1000baseT_Full; + } + + ecmd->supported = (SUPPORTED_Autoneg | + SUPPORTED_10baseT_Half | SUPPORTED_10baseT_Full | + SUPPORTED_100baseT_Half | SUPPORTED_100baseT_Full | + SUPPORTED_MII); + if (np->gigabit == PHY_GIGABIT) + ecmd->supported |= SUPPORTED_1000baseT_Full; + + ecmd->phy_address = np->phyaddr; + ecmd->transceiver = XCVR_EXTERNAL; + + /* ignore maxtxpkt, maxrxpkt for now */ + spin_unlock_irq(&np->lock); + return 0; +} + +static int nv_set_settings(struct net_device *dev, struct ethtool_cmd *ecmd) +{ + struct fe_priv *np = netdev_priv(dev); + + if (ecmd->port != PORT_MII) + return -EINVAL; + if (ecmd->transceiver != XCVR_EXTERNAL) + return -EINVAL; + if (ecmd->phy_address != np->phyaddr) { + /* TODO: support switching between multiple phys. Should be + * trivial, but not enabled due to lack of test hardware. */ + return -EINVAL; + } + if (ecmd->autoneg == AUTONEG_ENABLE) { + u32 mask; + + mask = ADVERTISED_10baseT_Half | ADVERTISED_10baseT_Full | + ADVERTISED_100baseT_Half | ADVERTISED_100baseT_Full; + if (np->gigabit == PHY_GIGABIT) + mask |= ADVERTISED_1000baseT_Full; + + if ((ecmd->advertising & mask) == 0) + return -EINVAL; + + } else if (ecmd->autoneg == AUTONEG_DISABLE) { + /* Note: autonegotiation disable, speed 1000 intentionally + * forbidden - noone should need that. */ + + if (ecmd->speed != SPEED_10 && ecmd->speed != SPEED_100) + return -EINVAL; + if (ecmd->duplex != DUPLEX_HALF && ecmd->duplex != DUPLEX_FULL) + return -EINVAL; + } else { + return -EINVAL; + } + + spin_lock_irq(&np->lock); + if (ecmd->autoneg == AUTONEG_ENABLE) { + int adv, bmcr; + + np->autoneg = 1; + + /* advertise only what has been requested */ + adv = mii_rw(dev, np->phyaddr, MII_ADVERTISE, MII_READ); + adv &= ~(ADVERTISE_ALL | ADVERTISE_100BASE4); + if (ecmd->advertising & ADVERTISED_10baseT_Half) + adv |= ADVERTISE_10HALF; + if (ecmd->advertising & ADVERTISED_10baseT_Full) + adv |= ADVERTISE_10FULL; + if (ecmd->advertising & ADVERTISED_100baseT_Half) + adv |= ADVERTISE_100HALF; + if (ecmd->advertising & ADVERTISED_100baseT_Full) + adv |= ADVERTISE_100FULL; + mii_rw(dev, np->phyaddr, MII_ADVERTISE, adv); + + if (np->gigabit == PHY_GIGABIT) { + adv = mii_rw(dev, np->phyaddr, MII_1000BT_CR, MII_READ); + adv &= ~ADVERTISE_1000FULL; + if (ecmd->advertising & ADVERTISED_1000baseT_Full) + adv |= ADVERTISE_1000FULL; + mii_rw(dev, np->phyaddr, MII_1000BT_CR, adv); + } + + bmcr = mii_rw(dev, np->phyaddr, MII_BMCR, MII_READ); + bmcr |= (BMCR_ANENABLE | BMCR_ANRESTART); + mii_rw(dev, np->phyaddr, MII_BMCR, bmcr); + + } else { + int adv, bmcr; + + np->autoneg = 0; + + adv = mii_rw(dev, np->phyaddr, MII_ADVERTISE, MII_READ); + adv &= ~(ADVERTISE_ALL | ADVERTISE_100BASE4); + if (ecmd->speed == SPEED_10 && ecmd->duplex == DUPLEX_HALF) + adv |= ADVERTISE_10HALF; + if (ecmd->speed == SPEED_10 && ecmd->duplex == DUPLEX_FULL) + adv |= ADVERTISE_10FULL; + if (ecmd->speed == SPEED_100 && ecmd->duplex == DUPLEX_HALF) + adv |= ADVERTISE_100HALF; + if (ecmd->speed == SPEED_100 && ecmd->duplex == DUPLEX_FULL) + adv |= ADVERTISE_100FULL; + mii_rw(dev, np->phyaddr, MII_ADVERTISE, adv); + np->fixed_mode = adv; + + if (np->gigabit == PHY_GIGABIT) { + adv = mii_rw(dev, np->phyaddr, MII_1000BT_CR, MII_READ); + adv &= ~ADVERTISE_1000FULL; + mii_rw(dev, np->phyaddr, MII_1000BT_CR, adv); + } + + bmcr = mii_rw(dev, np->phyaddr, MII_BMCR, MII_READ); + bmcr |= ~(BMCR_ANENABLE|BMCR_SPEED100|BMCR_FULLDPLX); + if (adv & (ADVERTISE_10FULL|ADVERTISE_100FULL)) + bmcr |= BMCR_FULLDPLX; + if (adv & (ADVERTISE_100HALF|ADVERTISE_100FULL)) + bmcr |= BMCR_SPEED100; + mii_rw(dev, np->phyaddr, MII_BMCR, bmcr); + + if (netif_running(dev)) { + /* Wait a bit and then reconfigure the nic. */ + udelay(10); + nv_linkchange(dev); + } + } + spin_unlock_irq(&np->lock); + + return 0; +} + +#define FORCEDETH_REGS_VER 1 +#define FORCEDETH_REGS_SIZE 0x400 /* 256 32-bit registers */ + +static int nv_get_regs_len(struct net_device *dev) +{ + return FORCEDETH_REGS_SIZE; +} + +static void nv_get_regs(struct net_device *dev, struct ethtool_regs *regs, void *buf) +{ + struct fe_priv *np = get_nvpriv(dev); + u8 __iomem *base = get_hwbase(dev); + u32 *rbuf = buf; + int i; + + regs->version = FORCEDETH_REGS_VER; + spin_lock_irq(&np->lock); + for (i=0;ilock); +} + +static int nv_nway_reset(struct net_device *dev) +{ + struct fe_priv *np = get_nvpriv(dev); + int ret; + + spin_lock_irq(&np->lock); + if (np->autoneg) { + int bmcr; + + bmcr = mii_rw(dev, np->phyaddr, MII_BMCR, MII_READ); + bmcr |= (BMCR_ANENABLE | BMCR_ANRESTART); + mii_rw(dev, np->phyaddr, MII_BMCR, bmcr); + + ret = 0; + } else { + ret = -EINVAL; + } + spin_unlock_irq(&np->lock); + + return ret; +} + +static struct ethtool_ops ops = { + .get_drvinfo = nv_get_drvinfo, + .get_link = ethtool_op_get_link, + .get_wol = nv_get_wol, + .set_wol = nv_set_wol, + .get_settings = nv_get_settings, + .set_settings = nv_set_settings, + .get_regs_len = nv_get_regs_len, + .get_regs = nv_get_regs, + .nway_reset = nv_nway_reset, +}; + static int nv_open(struct net_device *dev) { struct fe_priv *np = get_nvpriv(dev); - u8 *base = get_hwbase(dev); + u8 __iomem *base = get_hwbase(dev); int ret, oom, i; dprintk(KERN_DEBUG "nv_open: begin\n"); @@ -1521,6 +2090,7 @@ writel(0, base + NvRegAdapterControl); /* 2) initialize descriptor rings */ + set_bufsize(dev); oom = nv_init_ring(dev); writel(0, base + NvRegLinkSpeed); @@ -1531,27 +2101,18 @@ np->in_shutdown = 0; /* 3) set mac address */ - { - u32 mac[2]; - - mac[0] = (dev->dev_addr[0] << 0) + (dev->dev_addr[1] << 8) + - (dev->dev_addr[2] << 16) + (dev->dev_addr[3] << 24); - mac[1] = (dev->dev_addr[4] << 0) + (dev->dev_addr[5] << 8); - - writel(mac[0], base + NvRegMacAddrA); - writel(mac[1], base + NvRegMacAddrB); - } + nv_copy_mac_to_hw(dev); /* 4) give hw rings */ writel((u32) np->ring_addr, base + NvRegRxRingPhysAddr); - writel((u32) (np->ring_addr + RX_RING*sizeof(struct ring_desc)), base + NvRegTxRingPhysAddr); + if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) + writel((u32) (np->ring_addr + RX_RING*sizeof(struct ring_desc)), base + NvRegTxRingPhysAddr); + else + writel((u32) (np->ring_addr + RX_RING*sizeof(struct ring_desc_ex)), base + NvRegTxRingPhysAddr); writel( ((RX_RING-1) << NVREG_RINGSZ_RXSHIFT) + ((TX_RING-1) << NVREG_RINGSZ_TXSHIFT), base + NvRegRingSizes); /* 5) continue setup */ - np->linkspeed = NVREG_LINKSPEED_FORCE|NVREG_LINKSPEED_10; - np->duplex = 0; - writel(np->linkspeed, base + NvRegLinkSpeed); writel(NVREG_UNKSETUP3_VAL1, base + NvRegUnknownSetupReg3); writel(np->desc_ver, base + NvRegTxRxControl); @@ -1569,7 +2130,7 @@ writel(NVREG_MISC1_FORCE | NVREG_MISC1_HD, base + NvRegMisc1); writel(readl(base + NvRegTransmitterStatus), base + NvRegTransmitterStatus); writel(NVREG_PFF_ALWAYS, base + NvRegPacketFilterFlags); - writel(NVREG_OFFLOAD_NORMAL, base + NvRegOffloadConfig); + writel(np->rx_buf_sz, base + NvRegOffloadConfig); writel(readl(base + NvRegReceiverStatus), base + NvRegReceiverStatus); get_random_bytes(&i, sizeof(i)); @@ -1620,6 +2181,9 @@ writel(NVREG_MIISTAT_MASK, base + NvRegMIIStatus); dprintk(KERN_INFO "startup: got 0x%08x.\n", miistat); } + /* set linkspeed to invalid value, thus force nv_update_linkspeed + * to init hw */ + np->linkspeed = 0; ret = nv_update_linkspeed(dev); nv_start_rx(dev); nv_start_tx(dev); @@ -1643,7 +2207,7 @@ static int nv_close(struct net_device *dev) { struct fe_priv *np = get_nvpriv(dev); - u8 *base; + u8 __iomem *base; spin_lock_irq(&np->lock); np->in_shutdown = 1; @@ -1674,6 +2238,12 @@ if (np->wolenabled) nv_start_rx(dev); + /* special op: write back the misordered MAC address - otherwise + * the next nv_probe would see a wrong address. + */ + writel(np->orig_mac[0], base + NvRegMacAddrA); + writel(np->orig_mac[1], base + NvRegMacAddrB); + /* FIXME: power down nic */ return 0; @@ -1684,7 +2254,7 @@ struct net_device *dev; struct fe_priv *np; unsigned long addr; - u8 *base; + u8 __iomem *base; int err, i; dev = alloc_etherdev(sizeof(struct fe_priv)); @@ -1738,30 +2308,59 @@ } /* handle different descriptor versions */ - if (pci_dev->device == PCI_DEVICE_ID_NVIDIA_NVENET_1 || - pci_dev->device == PCI_DEVICE_ID_NVIDIA_NVENET_2 || - pci_dev->device == PCI_DEVICE_ID_NVIDIA_NVENET_3) - np->desc_ver = DESC_VER_1; - else + if (id->driver_data & DEV_HAS_HIGH_DMA) { + /* packet format 3: supports 40-bit addressing */ + np->desc_ver = DESC_VER_3; + if (pci_set_dma_mask(pci_dev, 0x0000007fffffffffULL)) { + printk(KERN_INFO "forcedeth: 64-bit DMA failed, using 32-bit addressing for device %s.\n", + pci_name(pci_dev)); + } + } else if (id->driver_data & DEV_HAS_LARGEDESC) { + /* packet format 2: supports jumbo frames */ np->desc_ver = DESC_VER_2; + } else { + /* original packet format */ + np->desc_ver = DESC_VER_1; + } + + np->pkt_limit = NV_PKTLIMIT_1; + if (id->driver_data & DEV_HAS_LARGEDESC) + np->pkt_limit = NV_PKTLIMIT_2; err = -ENOMEM; - dev->base_addr = (unsigned long) ioremap(addr, NV_PCI_REGSZ); - if (!dev->base_addr) + np->base = ioremap(addr, NV_PCI_REGSZ); + if (!np->base) goto out_relreg; + dev->base_addr = (unsigned long)np->base; + dev->irq = pci_dev->irq; - np->rx_ring = pci_alloc_consistent(pci_dev, sizeof(struct ring_desc) * (RX_RING + TX_RING), - &np->ring_addr); - if (!np->rx_ring) - goto out_unmap; - np->tx_ring = &np->rx_ring[RX_RING]; + + if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) { + np->rx_ring.orig = pci_alloc_consistent(pci_dev, + sizeof(struct ring_desc) * (RX_RING + TX_RING), + &np->ring_addr); + if (!np->rx_ring.orig) + goto out_unmap; + np->tx_ring.orig = &np->rx_ring.orig[RX_RING]; + } else { + np->rx_ring.ex = pci_alloc_consistent(pci_dev, + sizeof(struct ring_desc_ex) * (RX_RING + TX_RING), + &np->ring_addr); + if (!np->rx_ring.ex) + goto out_unmap; + np->tx_ring.ex = &np->rx_ring.ex[RX_RING]; + } dev->open = nv_open; dev->stop = nv_close; dev->hard_start_xmit = nv_start_xmit; dev->get_stats = nv_get_stats; dev->change_mtu = nv_change_mtu; + dev->set_mac_address = nv_set_mac_address; dev->set_multicast_list = nv_set_multicast; +#ifdef CONFIG_NET_POLL_CONTROLLER + dev->poll_controller = nv_poll_controller; +#endif SET_ETHTOOL_OPS(dev, &ops); dev->tx_timeout = nv_tx_timeout; dev->watchdog_timeo = NV_WATCHDOG_TIMEO; @@ -1806,17 +2405,10 @@ if (np->desc_ver == DESC_VER_1) { np->tx_flags = NV_TX_LASTPACKET|NV_TX_VALID; - if (id->driver_data & DEV_NEED_LASTPACKET1) - np->tx_flags |= NV_TX_LASTPACKET1; } else { np->tx_flags = NV_TX2_LASTPACKET|NV_TX2_VALID; - if (id->driver_data & DEV_NEED_LASTPACKET1) - np->tx_flags |= NV_TX2_LASTPACKET1; } - if (id->driver_data & DEV_IRQMASK_1) - np->irqmask = NVREG_IRQMASK_WANTED_1; - if (id->driver_data & DEV_IRQMASK_2) - np->irqmask = NVREG_IRQMASK_WANTED_2; + np->irqmask = NVREG_IRQMASK_WANTED; if (id->driver_data & DEV_NEED_TIMERIRQ) np->irqmask |= NVREG_IRQ_TIMER; if (id->driver_data & DEV_NEED_LINKTIMER) { @@ -1864,6 +2456,11 @@ phy_init(dev); } + /* set default link speed settings */ + np->linkspeed = NVREG_LINKSPEED_FORCE|NVREG_LINKSPEED_10; + np->duplex = 0; + np->autoneg = 1; + err = register_netdev(dev); if (err) { printk(KERN_INFO "forcedeth: unable to register netdev: %d\n", err); @@ -1876,8 +2473,12 @@ return 0; out_freering: - pci_free_consistent(np->pci_dev, sizeof(struct ring_desc) * (RX_RING + TX_RING), - np->rx_ring, np->ring_addr); + if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) + pci_free_consistent(np->pci_dev, sizeof(struct ring_desc) * (RX_RING + TX_RING), + np->rx_ring.orig, np->ring_addr); + else + pci_free_consistent(np->pci_dev, sizeof(struct ring_desc_ex) * (RX_RING + TX_RING), + np->rx_ring.ex, np->ring_addr); pci_set_drvdata(pci_dev, NULL); out_unmap: iounmap(get_hwbase(dev)); @@ -1895,18 +2496,14 @@ { struct net_device *dev = pci_get_drvdata(pci_dev); struct fe_priv *np = get_nvpriv(dev); - u8 *base = get_hwbase(dev); unregister_netdev(dev); - /* special op: write back the misordered MAC address - otherwise - * the next nv_probe would see a wrong address. - */ - writel(np->orig_mac[0], base + NvRegMacAddrA); - writel(np->orig_mac[1], base + NvRegMacAddrB); - /* free all structures */ - pci_free_consistent(np->pci_dev, sizeof(struct ring_desc) * (RX_RING + TX_RING), np->rx_ring, np->ring_addr); + if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) + pci_free_consistent(np->pci_dev, sizeof(struct ring_desc) * (RX_RING + TX_RING), np->rx_ring.orig, np->ring_addr); + else + pci_free_consistent(np->pci_dev, sizeof(struct ring_desc_ex) * (RX_RING + TX_RING), np->rx_ring.ex, np->ring_addr); iounmap(get_hwbase(dev)); pci_release_regions(pci_dev); pci_disable_device(pci_dev); @@ -1916,81 +2513,64 @@ static struct pci_device_id pci_tbl[] = { { /* nForce Ethernet Controller */ - .vendor = PCI_VENDOR_ID_NVIDIA, - .device = PCI_DEVICE_ID_NVIDIA_NVENET_1, - .subvendor = PCI_ANY_ID, - .subdevice = PCI_ANY_ID, - .driver_data = DEV_IRQMASK_1|DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER, + PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_1), + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER, }, { /* nForce2 Ethernet Controller */ - .vendor = PCI_VENDOR_ID_NVIDIA, - .device = PCI_DEVICE_ID_NVIDIA_NVENET_2, - .subvendor = PCI_ANY_ID, - .subdevice = PCI_ANY_ID, - .driver_data = DEV_NEED_LASTPACKET1|DEV_IRQMASK_2|DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER, + PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_2), + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER, }, { /* nForce3 Ethernet Controller */ - .vendor = PCI_VENDOR_ID_NVIDIA, - .device = PCI_DEVICE_ID_NVIDIA_NVENET_3, - .subvendor = PCI_ANY_ID, - .subdevice = PCI_ANY_ID, - .driver_data = DEV_NEED_LASTPACKET1|DEV_IRQMASK_2|DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER, + PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_3), + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER, }, { /* nForce3 Ethernet Controller */ - .vendor = PCI_VENDOR_ID_NVIDIA, - .device = PCI_DEVICE_ID_NVIDIA_NVENET_4, - .subvendor = PCI_ANY_ID, - .subdevice = PCI_ANY_ID, - .driver_data = DEV_NEED_LASTPACKET1|DEV_IRQMASK_2|DEV_NEED_TIMERIRQ, + PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_4), + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC, }, { /* nForce3 Ethernet Controller */ - .vendor = PCI_VENDOR_ID_NVIDIA, - .device = PCI_DEVICE_ID_NVIDIA_NVENET_5, - .subvendor = PCI_ANY_ID, - .subdevice = PCI_ANY_ID, - .driver_data = DEV_NEED_LASTPACKET1|DEV_IRQMASK_2|DEV_NEED_TIMERIRQ, + PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_5), + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC, }, { /* nForce3 Ethernet Controller */ - .vendor = PCI_VENDOR_ID_NVIDIA, - .device = PCI_DEVICE_ID_NVIDIA_NVENET_6, - .subvendor = PCI_ANY_ID, - .subdevice = PCI_ANY_ID, - .driver_data = DEV_NEED_LASTPACKET1|DEV_IRQMASK_2|DEV_NEED_TIMERIRQ, + PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_6), + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC, }, { /* nForce3 Ethernet Controller */ - .vendor = PCI_VENDOR_ID_NVIDIA, - .device = PCI_DEVICE_ID_NVIDIA_NVENET_7, - .subvendor = PCI_ANY_ID, - .subdevice = PCI_ANY_ID, - .driver_data = DEV_NEED_LASTPACKET1|DEV_IRQMASK_2|DEV_NEED_TIMERIRQ, + PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_7), + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC, }, { /* CK804 Ethernet Controller */ - .vendor = PCI_VENDOR_ID_NVIDIA, - .device = PCI_DEVICE_ID_NVIDIA_NVENET_8, - .subvendor = PCI_ANY_ID, - .subdevice = PCI_ANY_ID, - .driver_data = DEV_NEED_LASTPACKET1|DEV_IRQMASK_2|DEV_NEED_TIMERIRQ, + PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_8), + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_HIGH_DMA, }, { /* CK804 Ethernet Controller */ - .vendor = PCI_VENDOR_ID_NVIDIA, - .device = PCI_DEVICE_ID_NVIDIA_NVENET_9, - .subvendor = PCI_ANY_ID, - .subdevice = PCI_ANY_ID, - .driver_data = DEV_NEED_LASTPACKET1|DEV_IRQMASK_2|DEV_NEED_TIMERIRQ, + PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_9), + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_HIGH_DMA, }, { /* MCP04 Ethernet Controller */ - .vendor = PCI_VENDOR_ID_NVIDIA, - .device = PCI_DEVICE_ID_NVIDIA_NVENET_10, - .subvendor = PCI_ANY_ID, - .subdevice = PCI_ANY_ID, - .driver_data = DEV_NEED_LASTPACKET1|DEV_IRQMASK_2|DEV_NEED_TIMERIRQ, + PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_10), + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_HIGH_DMA, }, { /* MCP04 Ethernet Controller */ - .vendor = PCI_VENDOR_ID_NVIDIA, - .device = PCI_DEVICE_ID_NVIDIA_NVENET_11, - .subvendor = PCI_ANY_ID, - .subdevice = PCI_ANY_ID, - .driver_data = DEV_NEED_LASTPACKET1|DEV_IRQMASK_2|DEV_NEED_TIMERIRQ, + PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_11), + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_HIGH_DMA, + }, + { /* MCP51 Ethernet Controller */ + PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_12), + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_HIGH_DMA, + }, + { /* MCP51 Ethernet Controller */ + PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_13), + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_HIGH_DMA, + }, + { /* MCP55 Ethernet Controller */ + PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_14), + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_HIGH_DMA, + }, + { /* MCP55 Ethernet Controller */ + PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_NVENET_15), + .driver_data = DEV_NEED_TIMERIRQ|DEV_NEED_LINKTIMER|DEV_HAS_LARGEDESC|DEV_HAS_HIGH_DMA, }, {0,}, }; @@ -2016,7 +2596,7 @@ module_param(max_interrupt_work, int, 0); MODULE_PARM_DESC(max_interrupt_work, "forcedeth maximum events handled per interrupt"); - + MODULE_AUTHOR("Manfred Spraul "); MODULE_DESCRIPTION("Reverse Engineered nForce ethernet driver"); MODULE_LICENSE("GPL"); --- 2.4/include/linux/pci_ids.h 2005-06-01 02:56:56.000000000 +0200 +++ build-2.4/include/linux/pci_ids.h 2005-09-04 13:55:37.000000000 +0200 @@ -1034,6 +1034,10 @@ #define PCI_DEVICE_ID_NVIDIA_GEFORCE3_1 0x0201 #define PCI_DEVICE_ID_NVIDIA_GEFORCE3_2 0x0202 #define PCI_DEVICE_ID_NVIDIA_QUADRO_DDC 0x0203 +#define PCI_DEVICE_ID_NVIDIA_NVENET_12 0x0268 +#define PCI_DEVICE_ID_NVIDIA_NVENET_13 0x0269 +#define PCI_DEVICE_ID_NVIDIA_NVENET_14 0x0372 +#define PCI_DEVICE_ID_NVIDIA_NVENET_15 0x0373 #define PCI_VENDOR_ID_IMS 0x10e0 #define PCI_DEVICE_ID_IMS_8849 0x8849 --------------040003030102010107040708-- From pravin.shelar@gmail.com Sun Sep 4 13:13:37 2005 Received: with ECARTIS (v1.0.0; list netdev); Sun, 04 Sep 2005 13:13:39 -0700 (PDT) Received: from rproxy.gmail.com (rproxy.gmail.com [64.233.170.192]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j84KDaiL023807 for ; Sun, 4 Sep 2005 13:13:36 -0700 Received: by rproxy.gmail.com with SMTP id 34so188837rns for ; Sun, 04 Sep 2005 13:11:02 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:reply-to:to:subject:mime-version:content-type:content-transfer-encoding:content-disposition; b=aQmsjwuk1upKhLuHC+eyBNwT6pIYz64Q4hMgITJ7Uc8uDg9ovFQcgCTo83Y1KgzmziwAynfmRFf8dJVkaQ1yPawqGcLx2Fs0A1JJoZXelW6ccyKt0bFodzQegW7cflR/hgVG16TWHQxy5IrnmOsNDfnuQ3qUxSvfqG+z5X4VBac= Received: by 10.11.119.4 with SMTP id r4mr72988cwc; Sun, 04 Sep 2005 13:11:02 -0700 (PDT) Received: by 10.11.117.12 with HTTP; Sun, 4 Sep 2005 13:11:02 -0700 (PDT) Message-ID: Date: Mon, 5 Sep 2005 01:41:02 +0530 From: pravin Reply-To: pravin.shelar@gmail.com To: netdev@oss.sgi.com Subject: question abt equal cost multipath networking Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id j84KDaiL023807 X-archive-position: 3592 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: pravin.shelar@gmail.com Precedence: bulk X-list: netdev Content-Length: 754 Lines: 18 Hello everyone, I am working on equal cost multipath networking code in Linux kernel. I studied the device round robin algorithm for the same. The drr algorithm examines use count of devices to select outgoing device. The use count is defined as number of sessions opened on that device up till now. But this does not necessarily give us current load on a device. We can use some other metric to select the outgoing device e.g. current device packet-queue length. So is there any specific reason for choosing use count as a metric for this algorithm. Can I change this metric to some different parameter e.g. device queue length or number of open sessions on a device at present? Thanks, Pravin. PS. I'm not on the list, so please CC me. From ravinandan.arakali@neterion.com Tue Sep 6 14:56:12 2005 Received: with ECARTIS (v1.0.0; list netdev); Tue, 06 Sep 2005 14:56:15 -0700 (PDT) Received: from ns1.s2io.com (ns1.s2io.com [142.46.200.198]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j86LuBiL012791 for ; Tue, 6 Sep 2005 14:56:12 -0700 Received: from guinness.s2io.com (sentry.s2io.com [142.46.200.199]) by ns1.s2io.com (8.12.10/8.12.10) with ESMTP id j86LrWcx028112; Tue, 6 Sep 2005 17:53:32 -0400 (EDT) Received: from localhost.localdomain ([10.16.16.97]) by guinness.s2io.com (8.12.6/8.12.6) with ESMTP id j86LrTlb004377; Tue, 6 Sep 2005 17:53:30 -0400 (EDT) Received: (from root@localhost) by localhost.localdomain (8.13.1/8.13.1/Submit) id j874au8m004304; Tue, 6 Sep 2005 21:36:56 -0700 Date: Tue, 6 Sep 2005 21:36:56 -0700 Message-Id: <200509070436.j874au8m004304@localhost.localdomain> To: jgarzik@pobox.com, netdev@oss.sgi.com CC: raghavendra.koushik@neterion.com, ravinandan.arakali@neterion.com, leonid.grossman@neterion.com, rapuru.sriram@neterion.com, ananda.raju@neterion.com From: ravinandan.arakali@neterion.com Subject: [PATCH 2.6.13] S2io: Hardware and miscellaneous fixes X-Scanned-By: MIMEDefang 2.34 X-archive-position: 3595 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ravinandan.arakali@neterion.com Precedence: bulk X-list: netdev Content-Length: 11705 Lines: 307 Hi, This patch contains the following hardware related fixes and other miscellaneous bug fixes. 1. Updated the definition of single and double-bit ECC errors 2. Earlier we were allocating Transmit descriptors equal to MAX_SKB_FRAGS. This was causing a boundary condition failure. Need to allocate MAX_SKB_FRAGS+1 descriptors. 3. On some platforms(like PPC), pci_alloc_consistent() can return a zero DMA address. Since the NIC cannot handle zero-addresses, a workaround has been provided. Basically, we don't use such that page. We reallocate. 4. If list_info allocation failed during driver load, check for it during driver exit and return instead of trying to dereference NULL pointer. 5. Increase the debug level of few non-critical debug messages. 6. Reset the card on critical ECC double errors only in case of XframeI since XframeII can recover from such errors. 7. Print copyright message on driver load. 8. Bumped up the driver version no. to 2.0.8.1 Signed-off-by: Ravinandan Arakali --- diff -urpN old/drivers/net/s2io-regs.h new/drivers/net/s2io-regs.h --- old/drivers/net/s2io-regs.h 2005-09-06 04:51:44.000000000 -0700 +++ new/drivers/net/s2io-regs.h 2005-09-06 04:52:08.000000000 -0700 @@ -1,5 +1,5 @@ /************************************************************************ - * regs.h: A Linux PCI-X Ethernet driver for S2IO 10GbE Server NIC + * regs.h: A Linux PCI-X Ethernet driver for Neterion 10GbE Server NIC * Copyright(c) 2002-2005 Neterion Inc. * This software may be used and distributed according to the terms of @@ -713,13 +713,16 @@ typedef struct _XENA_dev_config { u64 mc_err_reg; #define MC_ERR_REG_ECC_DB_ERR_L BIT(14) #define MC_ERR_REG_ECC_DB_ERR_U BIT(15) +#define MC_ERR_REG_MIRI_ECC_DB_ERR_0 BIT(18) +#define MC_ERR_REG_MIRI_ECC_DB_ERR_1 BIT(20) #define MC_ERR_REG_MIRI_CRI_ERR_0 BIT(22) #define MC_ERR_REG_MIRI_CRI_ERR_1 BIT(23) #define MC_ERR_REG_SM_ERR BIT(31) -#define MC_ERR_REG_ECC_ALL_SNG (BIT(6) | \ - BIT(7) | BIT(17) | BIT(19)) -#define MC_ERR_REG_ECC_ALL_DBL (BIT(14) | \ - BIT(15) | BIT(18) | BIT(20)) +#define MC_ERR_REG_ECC_ALL_SNG (BIT(2) | BIT(3) | BIT(4) | BIT(5) |\ + BIT(6) | BIT(7) | BIT(17) | BIT(19)) +#define MC_ERR_REG_ECC_ALL_DBL (BIT(10) | BIT(11) | BIT(12) |\ + BIT(13) | BIT(14) | BIT(15) |\ + BIT(18) | BIT(20)) u64 mc_err_mask; u64 mc_err_alarm; diff -urpN old/drivers/net/s2io.c new/drivers/net/s2io.c --- old/drivers/net/s2io.c 2005-09-06 04:51:44.000000000 -0700 +++ new/drivers/net/s2io.c 2005-09-06 04:52:08.000000000 -0700 @@ -1,5 +1,5 @@ /************************************************************************ - * s2io.c: A Linux PCI-X Ethernet driver for S2IO 10GbE Server NIC + * s2io.c: A Linux PCI-X Ethernet driver for Neterion 10GbE Server NIC * Copyright(c) 2002-2005 Neterion Inc. * This software may be used and distributed according to the terms of @@ -28,7 +28,7 @@ * explaination of all the variables. * rx_ring_num : This can be used to program the number of receive rings used * in the driver. - * rx_ring_len: This defines the number of descriptors each ring can have. This + * rx_ring_sz: This defines the number of descriptors each ring can have. This * is also an array of size 8. * tx_fifo_num: This defines the number of Tx FIFOs thats used int the driver. * tx_fifo_len: This too is an array of 8. Each element defines the number of @@ -67,7 +67,7 @@ /* S2io Driver name & version. */ static char s2io_driver_name[] = "Neterion"; -static char s2io_driver_version[] = "Version 2.0.3.1"; +static char s2io_driver_version[] = "Version 2.0.8.1"; static inline int RXD_IS_UP2DT(RxD_t *rxdp) { @@ -404,7 +404,7 @@ static int init_shared_mem(struct s2io_n config->tx_cfg[i].fifo_len - 1; mac_control->fifos[i].fifo_no = i; mac_control->fifos[i].nic = nic; - mac_control->fifos[i].max_txds = MAX_SKB_FRAGS; + mac_control->fifos[i].max_txds = MAX_SKB_FRAGS + 1; for (j = 0; j < page_num; j++) { int k = 0; @@ -418,6 +418,26 @@ static int init_shared_mem(struct s2io_n DBG_PRINT(ERR_DBG, "failed for TxDL\n"); return -ENOMEM; } + /* If we got a zero DMA address(can happen on + * certain platforms like PPC), reallocate. + * Store virtual address of page we don't want, + * to be freed later. + */ + if (!tmp_p) { + mac_control->zerodma_virt_addr = tmp_v; + DBG_PRINT(INIT_DBG, + "%s: Zero DMA address for TxDL. ", dev->name); + DBG_PRINT(INIT_DBG, + "Virtual address %llx\n", (u64)tmp_v); + tmp_v = pci_alloc_consistent(nic->pdev, + PAGE_SIZE, &tmp_p); + if (!tmp_v) { + DBG_PRINT(ERR_DBG, + "pci_alloc_consistent "); + DBG_PRINT(ERR_DBG, "failed for TxDL\n"); + return -ENOMEM; + } + } while (k < lst_per_page) { int l = (j * lst_per_page) + k; if (l == config->tx_cfg[i].fifo_len) @@ -600,7 +620,7 @@ static void free_shared_mem(struct s2io_ mac_info_t *mac_control; struct config_param *config; int lst_size, lst_per_page; - + struct net_device *dev = nic->dev; if (!nic) return; @@ -616,9 +636,10 @@ static void free_shared_mem(struct s2io_ lst_per_page); for (j = 0; j < page_num; j++) { int mem_blks = (j * lst_per_page); - if ((!mac_control->fifos[i].list_info) || - (!mac_control->fifos[i].list_info[mem_blks]. - list_virt_addr)) + if (!mac_control->fifos[i].list_info) + return; + if (!mac_control->fifos[i].list_info[mem_blks]. + list_virt_addr) break; pci_free_consistent(nic->pdev, PAGE_SIZE, mac_control->fifos[i]. @@ -628,6 +649,18 @@ static void free_shared_mem(struct s2io_ list_info[mem_blks]. list_phy_addr); } + /* If we got a zero DMA address during allocation, + * free the page now + */ + if (mac_control->zerodma_virt_addr) { + pci_free_consistent(nic->pdev, PAGE_SIZE, + mac_control->zerodma_virt_addr, + (dma_addr_t)0); + DBG_PRINT(INIT_DBG, + "%s: Freeing TxDL with zero DMA addr. ", dev->name); + DBG_PRINT(INIT_DBG, "Virtual address %llx\n", + (u64)(mac_control->zerodma_virt_addr)); + } kfree(mac_control->fifos[i].list_info); } @@ -2479,9 +2512,10 @@ static void rx_intr_handler(ring_info_t #endif spin_lock(&nic->rx_lock); if (atomic_read(&nic->card_state) == CARD_DOWN) { - DBG_PRINT(ERR_DBG, "%s: %s going down for reset\n", + DBG_PRINT(INTR_DBG, "%s: %s going down for reset\n", __FUNCTION__, dev->name); spin_unlock(&nic->rx_lock); + return; } get_info = ring_data->rx_curr_get_info; @@ -2596,8 +2630,14 @@ static void tx_intr_handler(fifo_info_t if (txdlp->Control_1 & TXD_T_CODE) { unsigned long long err; err = txdlp->Control_1 & TXD_T_CODE; - DBG_PRINT(ERR_DBG, "***TxD error %llx\n", - err); + if ((err >> 48) == 0xA) { + DBG_PRINT(TX_DBG, "TxD returned due \ + to loss of link\n"); + } + else { + DBG_PRINT(ERR_DBG, "***TxD error \ + %llx\n", err); + } } skb = (struct sk_buff *) ((unsigned long) @@ -2689,12 +2729,16 @@ static void alarm_intr_handler(struct s2 if (val64 & MC_ERR_REG_ECC_ALL_DBL) { nic->mac_control.stats_info->sw_stat. double_ecc_errs++; - DBG_PRINT(ERR_DBG, "%s: Device indicates ", + DBG_PRINT(INIT_DBG, "%s: Device indicates ", dev->name); - DBG_PRINT(ERR_DBG, "double ECC error!!\n"); + DBG_PRINT(INIT_DBG, "double ECC error!!\n"); if (nic->device_type != XFRAME_II_DEVICE) { - netif_stop_queue(dev); - schedule_work(&nic->rst_timer_task); + /* Reset XframeI only if critical error */ + if (val64 & (MC_ERR_REG_MIRI_ECC_DB_ERR_0 | + MC_ERR_REG_MIRI_ECC_DB_ERR_1)) { + netif_stop_queue(dev); + schedule_work(&nic->rst_timer_task); + } } } else { nic->mac_control.stats_info->sw_stat. @@ -2706,7 +2750,8 @@ static void alarm_intr_handler(struct s2 val64 = readq(&bar0->serr_source); if (val64 & SERR_SOURCE_ANY) { DBG_PRINT(ERR_DBG, "%s: Device indicates ", dev->name); - DBG_PRINT(ERR_DBG, "serious error!!\n"); + DBG_PRINT(ERR_DBG, "serious error %llx!!\n", + (unsigned long long)val64); netif_stop_queue(dev); schedule_work(&nic->rst_timer_task); } @@ -3130,7 +3175,7 @@ int s2io_xmit(struct sk_buff *skb, struc queue_len = mac_control->fifos[queue].tx_curr_put_info.fifo_len + 1; /* Avoid "put" pointer going beyond "get" pointer */ if (txdp->Host_Control || (((put_off + 1) % queue_len) == get_off)) { - DBG_PRINT(ERR_DBG, "Error in xmit, No free TXDs.\n"); + DBG_PRINT(TX_DBG, "Error in xmit, No free TXDs.\n"); netif_stop_queue(dev); dev_kfree_skb(skb); spin_unlock_irqrestore(&sp->tx_lock, flags); @@ -3528,7 +3573,7 @@ static void s2io_set_multicast(struct ne val64 = readq(&bar0->mac_cfg); sp->promisc_flg = 1; - DBG_PRINT(ERR_DBG, "%s: entered promiscuous mode\n", + DBG_PRINT(INFO_DBG, "%s: entered promiscuous mode\n", dev->name); } else if (!(dev->flags & IFF_PROMISC) && (sp->promisc_flg)) { /* Remove the NIC from promiscuous mode */ @@ -3543,7 +3588,7 @@ static void s2io_set_multicast(struct ne val64 = readq(&bar0->mac_cfg); sp->promisc_flg = 0; - DBG_PRINT(ERR_DBG, "%s: left promiscuous mode\n", + DBG_PRINT(INFO_DBG, "%s: left promiscuous mode\n", dev->name); } @@ -5325,7 +5370,7 @@ s2io_init_nic(struct pci_dev *pdev, cons break; } } - config->max_txds = MAX_SKB_FRAGS; + config->max_txds = MAX_SKB_FRAGS + 1; /* Rx side parameters. */ if (rx_ring_sz[0] == 0) @@ -5525,9 +5570,14 @@ s2io_init_nic(struct pci_dev *pdev, cons if (sp->device_type & XFRAME_II_DEVICE) { DBG_PRINT(ERR_DBG, "%s: Neterion Xframe II 10GbE adapter ", dev->name); - DBG_PRINT(ERR_DBG, "(rev %d), Driver %s\n", + DBG_PRINT(ERR_DBG, "(rev %d), %s", get_xena_rev_id(sp->pdev), s2io_driver_version); +#ifdef CONFIG_2BUFF_MODE + DBG_PRINT(ERR_DBG, ", Buffer mode %d",2); +#endif + + DBG_PRINT(ERR_DBG, "\nCopyright(c) 2002-2005 Neterion Inc.\n"); DBG_PRINT(ERR_DBG, "MAC ADDR: %02x:%02x:%02x:%02x:%02x:%02x\n", sp->def_mac_addr[0].mac_addr[0], sp->def_mac_addr[0].mac_addr[1], @@ -5544,9 +5594,13 @@ s2io_init_nic(struct pci_dev *pdev, cons } else { DBG_PRINT(ERR_DBG, "%s: Neterion Xframe I 10GbE adapter ", dev->name); - DBG_PRINT(ERR_DBG, "(rev %d), Driver %s\n", + DBG_PRINT(ERR_DBG, "(rev %d), %s", get_xena_rev_id(sp->pdev), s2io_driver_version); +#ifdef CONFIG_2BUFF_MODE + DBG_PRINT(ERR_DBG, ", Buffer mode %d",2); +#endif + DBG_PRINT(ERR_DBG, "\nCopyright(c) 2002-2005 Neterion Inc.\n"); DBG_PRINT(ERR_DBG, "MAC ADDR: %02x:%02x:%02x:%02x:%02x:%02x\n", sp->def_mac_addr[0].mac_addr[0], sp->def_mac_addr[0].mac_addr[1], diff -urpN old/drivers/net/s2io.h new/drivers/net/s2io.h --- old/drivers/net/s2io.h 2005-09-06 04:51:44.000000000 -0700 +++ new/drivers/net/s2io.h 2005-09-06 04:52:08.000000000 -0700 @@ -1,5 +1,5 @@ /************************************************************************ - * s2io.h: A Linux PCI-X Ethernet driver for S2IO 10GbE Server NIC + * s2io.h: A Linux PCI-X Ethernet driver for Neterion 10GbE Server NIC * Copyright(c) 2002-2005 Neterion Inc. * This software may be used and distributed according to the terms of @@ -622,6 +622,9 @@ typedef struct mac_info { /* Fifo specific structure */ fifo_info_t fifos[MAX_TX_FIFOS]; + /* Save virtual address of TxD page with zero DMA addr(if any) */ + void *zerodma_virt_addr; + /* rx side stuff */ /* Ring specific structure */ ring_info_t rings[MAX_RX_RINGS]; From sim@netnation.com Tue Sep 6 16:59:34 2005 Received: with ECARTIS (v1.0.0; list netdev); Tue, 06 Sep 2005 16:59:40 -0700 (PDT) Received: from peace.netnation.com (newpeace.netnation.com [204.174.223.7]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j86NxYiL025151 for ; Tue, 6 Sep 2005 16:59:34 -0700 Received: from sim by peace.netnation.com with local (Exim 4.50) id 1ECnJE-0008MJ-B2; Tue, 06 Sep 2005 16:57:00 -0700 Date: Tue, 6 Sep 2005 16:57:00 -0700 From: Simon Kirby To: Robert Olsson Cc: Alexey Kuznetsov , Eric Dumazet , netdev@oss.sgi.com Subject: Re: Route cache performance Message-ID: <20050906235700.GA31820@netnation.com> References: <20050815213855.GA17832@netnation.com> <43014E27.1070104@cosmosbay.com> <20050823190852.GA20794@netnation.com> <17163.32645.202453.145416@robur.slu.se> <20050824000158.GA8137@netnation.com> <20050825181111.GB14336@netnation.com> <20050825200543.GA6612@yakov.inr.ac.ru> <20050825212211.GA23384@netnation.com> <20050826115520.GA12351@yakov.inr.ac.ru> <17167.29239.469711.847951@robur.slu.se> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <17167.29239.469711.847951@robur.slu.se> User-Agent: Mutt/1.5.9i X-archive-position: 3596 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: sim@netnation.com Precedence: bulk X-list: netdev Content-Length: 959 Lines: 35 On Fri, Aug 26, 2005 at 09:49:11PM +0200, Robert Olsson wrote: > Hello! > > This thread seems familar :) > > I think Simon uses UP and it could be idea to check if the RCU deferred > deletion causes the problem. >... > --- a/net/ipv4/route.c > +++ b/net/ipv4/route.c > @@ -485,7 +485,11 @@ static struct file_operations rt_cpu_seq > static __inline__ void rt_free(struct rtable *rt) > { > multipath_remove(rt); > +#ifdef CONFIG_SMP > call_rcu_bh(&rt->u.dst.rcu_head, dst_rcu_free); > +#else > + dst_free((struct dst_entry *)rt); > +#endif > } > > static __inline__ void rt_drop(struct rtable *rt) Woot! Yes, this is the difference. With the patch applied (ajust directly freeing the dst_entry), everything balances easily, there are no overflows, and the result of rt_may_expire() looks very reasonable. (Yay!) So, this seems to be the culprit. Is NAPI supposed to allow the queued bh to run or should we just not be queuing this? Simon- From kuznet@yakov.inr.ac.ru Tue Sep 6 18:23:03 2005 Received: with ECARTIS (v1.0.0; list netdev); Tue, 06 Sep 2005 18:23:06 -0700 (PDT) Received: from yakov.inr.ac.ru (yakov.inr.ac.ru [194.67.69.111]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id j871N0iL028695 for ; Tue, 6 Sep 2005 18:23:03 -0700 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=ms2.inr.ac.ru; b=iFjczkWuIH1P4zF6UygCD2wzbbcJ7WITdAhBmRUQ9ETt3lt//K7Sjn1QOXU1fAJ91I3u8LhAQxu6xpPTJwtAhuOX3Wn9rmJ2yEFQFHB7x2OtmRDysvqyIUM/KE1tU/8dL7fH1JbGmaTMhwKMQ4jkUK/oH1xYP5J84Z3RRr64Dbg=; Received: (from kuznet@localhost) envelope-from=kuznet by yakov.inr.ac.ru (8.6.13/ANK) id FAA25774; Wed, 7 Sep 2005 05:19:59 +0400 Date: Wed, 7 Sep 2005 05:19:59 +0400 From: Alexey Kuznetsov To: Simon Kirby Cc: Robert Olsson , Alexey Kuznetsov , Eric Dumazet , netdev@oss.sgi.com Subject: Re: Route cache performance Message-ID: <20050907011959.GA25725@yakov.inr.ac.ru> References: <43014E27.1070104@cosmosbay.com> <20050823190852.GA20794@netnation.com> <17163.32645.202453.145416@robur.slu.se> <20050824000158.GA8137@netnation.com> <20050825181111.GB14336@netnation.com> <20050825200543.GA6612@yakov.inr.ac.ru> <20050825212211.GA23384@netnation.com> <20050826115520.GA12351@yakov.inr.ac.ru> <17167.29239.469711.847951@robur.slu.se> <20050906235700.GA31820@netnation.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050906235700.GA31820@netnation.com> User-Agent: Mutt/1.5.6i X-archive-position: 3597 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: kuznet@ms2.inr.ac.ru Precedence: bulk X-list: netdev Content-Length: 848 Lines: 23 Hello! On Tue, Sep 06, 2005 at 04:57:00PM -0700, Simon Kirby wrote: > On Fri, Aug 26, 2005 at 09:49:11PM +0200, Robert Olsson wrote: ... > > I think Simon uses UP and it could be idea to check if the RCU deferred > > deletion causes the problem. .. > Yes, this is the difference. With the patch applied (ajust directly > freeing the dst_entry), everything balances easily, there are no > overflows, and the result of rt_may_expire() looks very reasonable. > (Yay!) > > So, this seems to be the culprit. Is NAPI supposed to allow the > queued bh to run or should we just not be queuing this? It is supposed to work. :-) The problem is like an unkillable zombie. Robert, have you seen this pehonomenon already? Did you mean that SMP works or that it never works (but this patch is valid only for UP)? Did it become worse after 2.6.9? Alexey From Robert.Olsson@data.slu.se Wed Sep 7 07:48:38 2005 Received: with ECARTIS (v1.0.0; list netdev); Wed, 07 Sep 2005 07:48:52 -0700 (PDT) Received: from mx1.slu.se (mx1.slu.se [130.238.96.70]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j87EmUiL030813 for ; Wed, 7 Sep 2005 07:48:38 -0700 Received: from robur.slu.se (robur.slu.se [130.238.98.12]) by mx1.slu.se (8.13.1/8.13.1) with ESMTP id j87Ej5u6023315; Wed, 7 Sep 2005 16:45:05 +0200 Received: by robur.slu.se (Postfix, from userid 1000) id EE4CEEC3CC; Wed, 7 Sep 2005 16:45:03 +0200 (CEST) From: Robert Olsson MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <17182.64751.340488.996748@robur.slu.se> Date: Wed, 7 Sep 2005 16:45:03 +0200 To: Simon Kirby Cc: Robert Olsson , Alexey Kuznetsov , Eric Dumazet , netdev@oss.sgi.com Subject: Re: Route cache performance In-Reply-To: <20050906235700.GA31820@netnation.com> References: <20050815213855.GA17832@netnation.com> <43014E27.1070104@cosmosbay.com> <20050823190852.GA20794@netnation.com> <17163.32645.202453.145416@robur.slu.se> <20050824000158.GA8137@netnation.com> <20050825181111.GB14336@netnation.com> <20050825200543.GA6612@yakov.inr.ac.ru> <20050825212211.GA23384@netnation.com> <20050826115520.GA12351@yakov.inr.ac.ru> <17167.29239.469711.847951@robur.slu.se> <20050906235700.GA31820@netnation.com> X-Mailer: VM 7.19 under Emacs 21.4.1 X-Scanned-By: MIMEDefang 2.48 on 130.238.96.70 X-archive-position: 3598 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: Robert.Olsson@data.slu.se Precedence: bulk X-list: netdev Content-Length: 998 Lines: 31 Simon Kirby writes: > Woot! > > Yes, this is the difference. With the patch applied (ajust directly > freeing the dst_entry), everything balances easily, there are no > overflows, and the result of rt_may_expire() looks very reasonable. > (Yay!) > > So, this seems to be the culprit. Is NAPI supposed to allow the > queued bh to run or should we just not be queuing this? Packet processing happens in RX_SOFIRQ. NAPI or non-NAPI is no difference with RCU deferred delete this should happen by the RCU-tasklet when tasklets are run after the real SOFTIRQ's. There is a limit for RCU work... maxbatch it's set to 10 you could back out the patch and try increase it 1000/10000 so we know this not prevent the freeing of entries. module_param(maxbatch, int, 0); /* rcupdate.c */ Also RCU clearly states that is should be used in read-mostly situations rDoS is outside this scope. Anyway it would be interesting to understand what's going on. Cheers. --ro From Robert.Olsson@data.slu.se Wed Sep 7 08:06:29 2005 Received: with ECARTIS (v1.0.0; list netdev); Wed, 07 Sep 2005 08:06:32 -0700 (PDT) Received: from mx1.slu.se (mx1.slu.se [130.238.96.70]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j87F6SiL000360 for ; Wed, 7 Sep 2005 08:06:28 -0700 Received: from robur.slu.se (robur.slu.se [130.238.98.12]) by mx1.slu.se (8.13.1/8.13.1) with ESMTP id j87F3Hxk025818; Wed, 7 Sep 2005 17:03:17 +0200 Received: by robur.slu.se (Postfix, from userid 1000) id 57D39EC3CC; Wed, 7 Sep 2005 17:03:17 +0200 (CEST) From: Robert Olsson MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <17183.309.317160.103056@robur.slu.se> Date: Wed, 7 Sep 2005 17:03:17 +0200 To: Alexey Kuznetsov Cc: Simon Kirby , Robert Olsson , Eric Dumazet , netdev@oss.sgi.com Subject: Re: Route cache performance In-Reply-To: <20050907011959.GA25725@yakov.inr.ac.ru> References: <43014E27.1070104@cosmosbay.com> <20050823190852.GA20794@netnation.com> <17163.32645.202453.145416@robur.slu.se> <20050824000158.GA8137@netnation.com> <20050825181111.GB14336@netnation.com> <20050825200543.GA6612@yakov.inr.ac.ru> <20050825212211.GA23384@netnation.com> <20050826115520.GA12351@yakov.inr.ac.ru> <17167.29239.469711.847951@robur.slu.se> <20050906235700.GA31820@netnation.com> <20050907011959.GA25725@yakov.inr.ac.ru> X-Mailer: VM 7.19 under Emacs 21.4.1 X-Scanned-By: MIMEDefang 2.48 on 130.238.96.70 X-archive-position: 3599 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: Robert.Olsson@data.slu.se Precedence: bulk X-list: netdev Content-Length: 710 Lines: 20 Alexey Kuznetsov writes: > Robert, have you seen this pehonomenon already? Did you mean that SMP works > or that it never works (but this patch is valid only for UP)? Did it > become worse after 2.6.9? It was quite some time since I saw dst cache overflow and we use 2.6 in infrastructure. Anyway I was able to "tune" route cache so I see in our lab system on a SMP box. I think UP and SMP behaves the same but with UP we could disable the deferred delete as Simon tested. I don't know if anything happen in 2.6.9 I don't think so. But any improvement in drivers or FIB lookup may increase the burden so we get overflows. We had some code that checked the RCU latency. Cheers. --ro From sim@netnation.com Wed Sep 7 09:31:28 2005 Received: with ECARTIS (v1.0.0; list netdev); Wed, 07 Sep 2005 09:31:33 -0700 (PDT) Received: from peace.netnation.com (newpeace.netnation.com [204.174.223.7]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j87GVSiL009195 for ; Wed, 7 Sep 2005 09:31:28 -0700 Received: from sim by peace.netnation.com with local (Exim 4.50) id 1ED2n8-0007sq-E6; Wed, 07 Sep 2005 09:28:54 -0700 Date: Wed, 7 Sep 2005 09:28:54 -0700 From: Simon Kirby To: Robert Olsson Cc: Alexey Kuznetsov , Eric Dumazet , netdev@oss.sgi.com Subject: Re: Route cache performance Message-ID: <20050907162854.GB24735@netnation.com> References: <20050823190852.GA20794@netnation.com> <17163.32645.202453.145416@robur.slu.se> <20050824000158.GA8137@netnation.com> <20050825181111.GB14336@netnation.com> <20050825200543.GA6612@yakov.inr.ac.ru> <20050825212211.GA23384@netnation.com> <20050826115520.GA12351@yakov.inr.ac.ru> <17167.29239.469711.847951@robur.slu.se> <20050906235700.GA31820@netnation.com> <17182.64751.340488.996748@robur.slu.se> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <17182.64751.340488.996748@robur.slu.se> User-Agent: Mutt/1.5.9i X-archive-position: 3600 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: sim@netnation.com Precedence: bulk X-list: netdev Content-Length: 1029 Lines: 23 On Wed, Sep 07, 2005 at 04:45:03PM +0200, Robert Olsson wrote: > Packet processing happens in RX_SOFIRQ. NAPI or non-NAPI is no difference > with RCU deferred delete this should happen by the RCU-tasklet when > tasklets are run after the real SOFTIRQ's. > > There is a limit for RCU work... maxbatch it's set to 10 you could back > out the patch and try increase it 1000/10000 so we know this not prevent > the freeing of entries. Yes, setting maxbatch to 10000 also results in working gc, though routing throughput is about 5.7% higher when just calling dst_free directly. > Also RCU clearly states that is should be used in read-mostly situations > rDoS is outside this scope. Anyway it would be interesting to understand > what's going on. There was discussion about this before (recycling of existing entries is also now impossible, as compared with 2.4). It's a shame that this win for the normal case also hurts the DoS case...and it really hurts when the when the DoS case is the normal case. Simon- From Robert.Olsson@data.slu.se Wed Sep 7 09:52:02 2005 Received: with ECARTIS (v1.0.0; list netdev); Wed, 07 Sep 2005 09:52:06 -0700 (PDT) Received: from mx1.slu.se (mx1.slu.se [130.238.96.70]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j87GpwiL010409 for ; Wed, 7 Sep 2005 09:52:01 -0700 Received: from robur.slu.se (robur.slu.se [130.238.98.12]) by mx1.slu.se (8.13.1/8.13.1) with ESMTP id j87Gn4SX010775; Wed, 7 Sep 2005 18:49:04 +0200 Received: by robur.slu.se (Postfix, from userid 1000) id 03AC3EC3CC; Wed, 7 Sep 2005 18:49:03 +0200 (CEST) From: Robert Olsson MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <17183.6655.977975.249491@robur.slu.se> Date: Wed, 7 Sep 2005 18:49:03 +0200 To: Simon Kirby Cc: Robert Olsson , Alexey Kuznetsov , Eric Dumazet , netdev@oss.sgi.com Subject: Re: Route cache performance In-Reply-To: <20050907162854.GB24735@netnation.com> References: <20050823190852.GA20794@netnation.com> <17163.32645.202453.145416@robur.slu.se> <20050824000158.GA8137@netnation.com> <20050825181111.GB14336@netnation.com> <20050825200543.GA6612@yakov.inr.ac.ru> <20050825212211.GA23384@netnation.com> <20050826115520.GA12351@yakov.inr.ac.ru> <17167.29239.469711.847951@robur.slu.se> <20050906235700.GA31820@netnation.com> <17182.64751.340488.996748@robur.slu.se> <20050907162854.GB24735@netnation.com> X-Mailer: VM 7.19 under Emacs 21.4.1 X-Scanned-By: MIMEDefang 2.48 on 130.238.96.70 X-archive-position: 3601 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: Robert.Olsson@data.slu.se Precedence: bulk X-list: netdev Content-Length: 748 Lines: 21 Simon Kirby writes: > Yes, setting maxbatch to 10000 also results in working gc, though routing > throughput is about 5.7% higher when just calling dst_free directly. Oh that's good news... You loose 5.7% for rDoS but should benefit in normal conditions. > There was discussion about this before (recycling of existing entries is > also now impossible, as compared with 2.4). It's a shame that this win > for the normal case also hurts the DoS case...and it really hurts when > the when the DoS case is the normal case. It's called trade-off's :) rDoS is hardly nomal case? But maybe it's time to compare routing via route hash vs FIB lookup directly again now when we have RCU with some FIB lookup's too. Cheers. --ro From sim@netnation.com Wed Sep 7 09:58:02 2005 Received: with ECARTIS (v1.0.0; list netdev); Wed, 07 Sep 2005 09:58:05 -0700 (PDT) Received: from peace.netnation.com (newpeace.netnation.com [204.174.223.7]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j87Gw2iL011020 for ; Wed, 7 Sep 2005 09:58:02 -0700 Received: from sim by peace.netnation.com with local (Exim 4.50) id 1ED3Cq-0008UA-KF; Wed, 07 Sep 2005 09:55:28 -0700 Date: Wed, 7 Sep 2005 09:55:28 -0700 From: Simon Kirby To: Robert Olsson Cc: Alexey Kuznetsov , Eric Dumazet , netdev@oss.sgi.com Subject: Re: Route cache performance Message-ID: <20050907165528.GC24735@netnation.com> References: <17163.32645.202453.145416@robur.slu.se> <20050824000158.GA8137@netnation.com> <20050825181111.GB14336@netnation.com> <20050825200543.GA6612@yakov.inr.ac.ru> <20050825212211.GA23384@netnation.com> <20050826115520.GA12351@yakov.inr.ac.ru> <17167.29239.469711.847951@robur.slu.se> <20050906235700.GA31820@netnation.com> <20050907011959.GA25725@yakov.inr.ac.ru> <17183.309.317160.103056@robur.slu.se> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <17183.309.317160.103056@robur.slu.se> User-Agent: Mutt/1.5.9i X-archive-position: 3602 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: sim@netnation.com Precedence: bulk X-list: netdev Content-Length: 1346 Lines: 28 On Wed, Sep 07, 2005 at 05:03:17PM +0200, Robert Olsson wrote: > It was quite some time since I saw dst cache overflow and we use 2.6 > in infrastructure. Anyway I was able to "tune" route cache so I see > in our lab system on a SMP box. I think UP and SMP behaves the same > but with UP we could disable the deferred delete as Simon tested. > > I don't know if anything happen in 2.6.9 I don't think so. But any > improvement in drivers or FIB lookup may increase the burden so we get > overflows. I believe what I've been seeing is a _reduction_ in performance in both the e1000 driver and other parts of the kernel that result in it handling these packets much more slowly than in 2.4. The dst cache only overflows when the thing is completely pegged, so earlier 2.6 versions that were a little faster (eg: 2.6.11) were only overflowing occasionally depending on the speed of the input traffic. I've only been able to send 179 Mbps from one box, so that's what has been killing it. On the receiving end, 2.6.13-rc6 with the direct dst_free now drops a bunch but stays responsive with working GC, routing through about 69.6 Mbps, while 2.4.27 routes 103 Mbps worth. If it would be helpful, I can build some scripts to do benchmarks with different kernel combinations, and run it on a bunch of different kernel versions. Simon- From sim@netnation.com Wed Sep 7 10:00:32 2005 Received: with ECARTIS (v1.0.0; list netdev); Wed, 07 Sep 2005 10:00:36 -0700 (PDT) Received: from peace.netnation.com (newpeace.netnation.com [204.174.223.7]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j87H0WiL011509 for ; Wed, 7 Sep 2005 10:00:32 -0700 Received: from sim by peace.netnation.com with local (Exim 4.50) id 1ED3FG-0008W1-Lw; Wed, 07 Sep 2005 09:57:58 -0700 Date: Wed, 7 Sep 2005 09:57:58 -0700 From: Simon Kirby To: Robert Olsson Cc: Alexey Kuznetsov , Eric Dumazet , netdev@oss.sgi.com Subject: Re: Route cache performance Message-ID: <20050907165758.GD24735@netnation.com> References: <20050824000158.GA8137@netnation.com> <20050825181111.GB14336@netnation.com> <20050825200543.GA6612@yakov.inr.ac.ru> <20050825212211.GA23384@netnation.com> <20050826115520.GA12351@yakov.inr.ac.ru> <17167.29239.469711.847951@robur.slu.se> <20050906235700.GA31820@netnation.com> <17182.64751.340488.996748@robur.slu.se> <20050907162854.GB24735@netnation.com> <17183.6655.977975.249491@robur.slu.se> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <17183.6655.977975.249491@robur.slu.se> User-Agent: Mutt/1.5.9i X-archive-position: 3603 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: sim@netnation.com Precedence: bulk X-list: netdev Content-Length: 389 Lines: 10 On Wed, Sep 07, 2005 at 06:49:03PM +0200, Robert Olsson wrote: > It's called trade-off's :) rDoS is hardly nomal case? But maybe it's time > to compare routing via route hash vs FIB lookup directly again now when > we have RCU with some FIB lookup's too. I haven't even filled the route tables yet. I've just been testing with a bog standard table (three /24s and one /0). Simon- From Robert.Olsson@data.slu.se Wed Sep 7 10:24:04 2005 Received: with ECARTIS (v1.0.0; list netdev); Wed, 07 Sep 2005 10:24:08 -0700 (PDT) Received: from mx1.slu.se (mx1.slu.se [130.238.96.70]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j87HO1iL013264 for ; Wed, 7 Sep 2005 10:24:04 -0700 Received: from robur.slu.se (robur.slu.se [130.238.98.12]) by mx1.slu.se (8.13.1/8.13.1) with ESMTP id j87HLEIG015282; Wed, 7 Sep 2005 19:21:14 +0200 Received: by robur.slu.se (Postfix, from userid 1000) id 1468AEC3CC; Wed, 7 Sep 2005 19:21:14 +0200 (CEST) From: Robert Olsson MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <17183.8586.47462.585303@robur.slu.se> Date: Wed, 7 Sep 2005 19:21:14 +0200 To: Simon Kirby Cc: Robert Olsson , Alexey Kuznetsov , Eric Dumazet , netdev@oss.sgi.com Subject: Re: Route cache performance In-Reply-To: <20050907165528.GC24735@netnation.com> References: <17163.32645.202453.145416@robur.slu.se> <20050824000158.GA8137@netnation.com> <20050825181111.GB14336@netnation.com> <20050825200543.GA6612@yakov.inr.ac.ru> <20050825212211.GA23384@netnation.com> <20050826115520.GA12351@yakov.inr.ac.ru> <17167.29239.469711.847951@robur.slu.se> <20050906235700.GA31820@netnation.com> <20050907011959.GA25725@yakov.inr.ac.ru> <17183.309.317160.103056@robur.slu.se> <20050907165528.GC24735@netnation.com> X-Mailer: VM 7.19 under Emacs 21.4.1 X-Scanned-By: MIMEDefang 2.48 on 130.238.96.70 X-archive-position: 3604 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: Robert.Olsson@data.slu.se Precedence: bulk X-list: netdev Content-Length: 618 Lines: 15 Simon Kirby writes: > I've only been able to send 179 Mbps from one box, so that's what has > been killing it. On the receiving end, 2.6.13-rc6 with the direct > dst_free now drops a bunch but stays responsive with working GC, > routing through about 69.6 Mbps, while 2.4.27 routes 103 Mbps worth. If route hash setup is identical, buckets etc and HZ is same etc. I have no idea about the performance difference. Somebody else? In other case you need to compare (o)profiles and see if this can give us any hints. To test drivers etc you might also want to test with a single flow. Cheers. --ro From kuznet@yakov.inr.ac.ru Wed Sep 7 13:02:13 2005 Received: with ECARTIS (v1.0.0; list netdev); Wed, 07 Sep 2005 13:02:16 -0700 (PDT) Received: from yakov.inr.ac.ru (yakov.inr.ac.ru [194.67.69.111]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id j87K2BiL026980 for ; Wed, 7 Sep 2005 13:02:12 -0700 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=ms2.inr.ac.ru; b=hHAZRrUeIVToG8JeQmW1EGBVbwL0xYDvpoGcUwjBB5LOaERi2V3eM1KiJTlFkMfeThMO6Cm7K1LENi3nphqLVHN+cYKWwc2hnV1QlUq9jLyVBIrEQ0plFjvUJd8kCaK4waAJJGj99P0UAtAH5USEH3llGKCF32xZSk3BBWWqwVM=; Received: (from kuznet@localhost) envelope-from=kuznet by yakov.inr.ac.ru (8.6.13/ANK) id XAA08451; Wed, 7 Sep 2005 23:59:11 +0400 Date: Wed, 7 Sep 2005 23:59:11 +0400 From: Alexey Kuznetsov To: Simon Kirby Cc: Robert Olsson , Alexey Kuznetsov , Eric Dumazet , netdev@oss.sgi.com Subject: Re: Route cache performance Message-ID: <20050907195911.GA8382@yakov.inr.ac.ru> References: <17163.32645.202453.145416@robur.slu.se> <20050824000158.GA8137@netnation.com> <20050825181111.GB14336@netnation.com> <20050825200543.GA6612@yakov.inr.ac.ru> <20050825212211.GA23384@netnation.com> <20050826115520.GA12351@yakov.inr.ac.ru> <17167.29239.469711.847951@robur.slu.se> <20050906235700.GA31820@netnation.com> <17182.64751.340488.996748@robur.slu.se> <20050907162854.GB24735@netnation.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050907162854.GB24735@netnation.com> User-Agent: Mutt/1.5.6i X-archive-position: 3605 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: kuznet@ms2.inr.ac.ru Precedence: bulk X-list: netdev Content-Length: 341 Lines: 14 Hello! > Yes, setting maxbatch to 10000 also results in working gc, Could you try lower values? F.e. I guess 300 or a little more (it is netdev_max_backlog) should be enough. > for the normal case also hurts the DoS case...and it really hurts when > the when the DoS case is the normal case. 5.7% is not "really hurts" yet. :-) Alexey From bernd-schubert@gmx.de Fri Sep 9 10:37:36 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 09 Sep 2005 10:37:39 -0700 (PDT) Received: from relay.uni-heidelberg.de (relay.uni-heidelberg.de [129.206.100.212]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j89HbWiL014953 for ; Fri, 9 Sep 2005 10:37:35 -0700 Received: from hamilton1.pci.uni-heidelberg.de (hamilton1.pci.uni-heidelberg.de [129.206.21.201]) by relay.uni-heidelberg.de (8.13.4/8.13.1) with ESMTP id j89HYmq6005840; Fri, 9 Sep 2005 19:34:48 +0200 Received: from lanczos.pci.uni-heidelberg.de ([129.206.21.135] helo=lanczos ident=foobar) by hamilton1.pci.uni-heidelberg.de with smtp (Exim 3.36 #1 (Debian)) id 1EDmm3-0001Xg-00; Fri, 09 Sep 2005 19:34:51 +0200 Received: by lanczos (sSMTP sendmail emulation); Fri, 9 Sep 2005 19:34:51 +0200 From: Bernd Schubert To: netdev@oss.sgi.com Subject: skge: reboot on sysfs resource0 access Date: Fri, 9 Sep 2005 19:34:50 +0200 User-Agent: KMail/1.7.2 Cc: Stephen Hemminger MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Disposition: inline Message-Id: <200509091934.51301.bernd-schubert@gmx.de> Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id j89HbWiL014953 X-archive-position: 3606 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: bernd-schubert@gmx.de Precedence: bulk X-list: netdev Content-Length: 1404 Lines: 37 Hello, today we tried 2.6.13 on our server and also tried to use the skge driver. Well, in principle it works fine, until I became curious about the sysfs values. Stupid me, I was using the midnight commander to read the values. When I opened "/sys/bus/pci/drivers/skge/0000:01:01.0/resource0", the system immediately rebooted. After the reboot we tested using cat to the resource0 file, which gave an input/output error. Using again the mc, the system again immediately rebooted. Well, I guess I better don't use the midnight commander in the future, but somehow I think it shouldn't cause the system to reboot, should it? Is the i/o error of cat supposed to happen? Maybe it helps, here is a strace of mc's open for a normal file: open("/home/bernd/notes", O_RDONLY|O_NONBLOCK|O_LARGEFILE) = 6 fstat64(6, {st_mode=S_IFREG|0644, st_size=96, ...}) = 0 fcntl64(102, F_GETFL) = -1 EBADF (Bad file descriptor) read(6, "http", 4) = 4 mmap2(NULL, 96, PROT_READ, MAP_SHARED, 6, 0) = 0x402fe000 select(5, [4], NULL, NULL, {0, 0}) = 0 (Timeout) select(5, [4], NULL, NULL, {0, 0}) = 0 (Timeout) write(1, "\33[1;1H\33[m\17\33[30m\33[46mFile: notes "..., 4019) = 4019 Thanks, Bernd -- Bernd Schubert Physikalisch Chemisches Institut / Theoretische Chemie Universität Heidelberg INF 229 69120 Heidelberg e-mail: bernd.schubert@pci.uni-heidelberg.de From kas@fi.muni.cz Fri Sep 9 10:42:10 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 09 Sep 2005 10:42:19 -0700 (PDT) Received: from tirith.ics.muni.cz (tirith.ics.muni.cz [147.251.4.36]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j89Hg8iL015660 for ; Fri, 9 Sep 2005 10:42:09 -0700 Received: from anxur.fi.muni.cz (anxur.fi.muni.cz [147.251.48.3]) by tirith.ics.muni.cz (8.13.2/8.13.2) with ESMTP id j89HdSww006438; Fri, 9 Sep 2005 19:39:30 +0200 Received: by anxur.fi.muni.cz (Postfix, from userid 11561) id EA97922AF67; Fri, 9 Sep 2005 19:39:28 +0200 (CEST) Date: Fri, 9 Sep 2005 19:39:28 +0200 From: Jan Kasprzak To: linux-kernel@vger.kernel.org Cc: netdev@oss.sgi.com Subject: TCP segmentation offload performance Message-ID: <20050909173928.GI4823@fi.muni.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.1i X-Muni-Spam-TestIP: 147.251.48.3 X-Muni-Envelope-From: kas@fi.muni.cz X-Muni-Virus-Test: Clean X-archive-position: 3608 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: kas@fi.muni.cz Precedence: bulk X-list: netdev Content-Length: 2474 Lines: 54 Hello, world! I tried to find out whether the TCP segmentation offload can perform better on my server than no TSO at all. My server is dual Opteron 244 with Tyan S2882 board with the following NIC: eth0: Tigon3 [partno(BCM95704A7) rev 2003 PHY(5704)] (PCIX:100MHz:64-bit) 10/100/1000BaseT Ethernet 00:e0:81:27:de:17 eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] Split[0] WireSpeed[1] TSOcap[1] eth0: dma_rwctrl[769f4000] The server runs ProFTPd with sendfile(2) enabled (and I have verified that it is being used with strace(8)). The kernel is 2.6.12.2. I have found that according to ethtool -k eth0 the TSO is switched off by default. So I tried to switch it on (altough I wondered why it is not switched on by default, provided that the hardware supports this feature). I tried to measure the difference by downloading an ISO image of FC4 i386 CD1 (665434112 bytes) from two hosts connected to the same switch. I did 10 transfers of the same file with each settings, and took the average and maximum of the last five transfers only (to avoid any start-up temporary conditions). The client Alpha was dual Opteron 248 with Tyan S2882 board, and the client Beta was quad Opteron 848 on HP DL-585 board. Client TSO Average speed Max speed Alpha off 108.7 MB/s 110.5 MB/s Alpha on 100.9 MB/s 101.2 MB/s Beta off 102.1 MB/s 102.4 MB/s Beta on 93.2 MB/s 95.5 MB/s Surprisingly enough, the tests without TSO were faster than with TSO enabled. Looking at tcpdump it seems that the system with TSO enabled sends only a 15 KB-sized frames to the NIC instead of full 64 KB-sized ones: 18:45:38.993150 IP odysseus.ftp-data > alpha.33125: P 127424:143352(15928) ack 1 win 1460 18:45:38.993203 IP odysseus.ftp-data > alpha.33125: P 143352:159280(15928) ack 1 win 1460 So I wonder what is wrong with TSO on my hardware and whether the TSO is expected to be faster than generating MTU-sized packets in the TCP stack. I did not measure the CPU usage on the server, only the network speed. Thanks! -Yenya -- | Jan "Yenya" Kasprzak | | GPG: ID 1024/D3498839 Fingerprint 0D99A7FB206605D7 8B35FCDE05B18A5E | | http://www.fi.muni.cz/~kas/ Journal: http://www.fi.muni.cz/~kas/blog/ | >>> $ cd my-kernel-tree-2.6 <<< >>> $ dotest /path/to/mbox # yes, Linus has no taste in naming scripts <<< From bernd.schubert@pci.uni-heidelberg.de Fri Sep 9 10:41:00 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 09 Sep 2005 10:41:04 -0700 (PDT) Received: from relay.uni-heidelberg.de (relay.uni-heidelberg.de [129.206.100.212]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j89HewiL015306 for ; Fri, 9 Sep 2005 10:41:00 -0700 Received: from hamilton1.pci.uni-heidelberg.de (hamilton1.pci.uni-heidelberg.de [129.206.21.201]) by relay.uni-heidelberg.de (8.13.4/8.13.1) with ESMTP id j89HcEqR007542; Fri, 9 Sep 2005 19:38:14 +0200 Received: from lanczos.pci.uni-heidelberg.de ([129.206.21.135] helo=lanczos ident=foobar) by hamilton1.pci.uni-heidelberg.de with smtp (Exim 3.36 #1 (Debian)) id 1EDmpO-0001Z3-00; Fri, 09 Sep 2005 19:38:18 +0200 Received: by lanczos (sSMTP sendmail emulation); Fri, 9 Sep 2005 19:38:18 +0200 From: Bernd Schubert To: netdev@oss.sgi.com Subject: skge: reboot on sysfs resource0 access User-Agent: KMail/1.7.2 Cc: Stephen Hemminger MIME-Version: 1.0 Content-Disposition: inline Date: Fri, 9 Sep 2005 19:38:17 +0200 Reply-To: bernd-schubert@gmx.de Content-Type: text/plain; charset="iso-8859-1" Message-Id: <200509091938.18079.bernd.schubert@pci.uni-heidelberg.de> Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id j89HewiL015306 X-archive-position: 3607 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: bernd.schubert@pci.uni-heidelberg.de Precedence: bulk X-list: netdev Content-Length: 1572 Lines: 45 Hello, today we tried 2.6.13 on our server and also tried to use the skge driver. Well, in principle it works fine, until I became curious about the sysfs values. Stupid me, I was using the midnight commander to read the values. When I opened "/sys/bus/pci/drivers/skge/0000:01:01.0/resource0", the system immediately rebooted. After the reboot we tested using cat to the resource0 file, which gave an input/output error. Using again the mc, the system again immediately rebooted. Well, I guess I better don't use the midnight commander in the future, but somehow I think it shouldn't cause the system to reboot, should it? Is the i/o error of cat supposed to happen? Maybe it helps, here is a strace of mc's open for a normal file: open("/home/bernd/notes", O_RDONLY|O_NONBLOCK|O_LARGEFILE) = 6 fstat64(6, {st_mode=S_IFREG|0644, st_size=96, ...}) = 0 fcntl64(102, F_GETFL) = -1 EBADF (Bad file descriptor) read(6, "http", 4) = 4 mmap2(NULL, 96, PROT_READ, MAP_SHARED, 6, 0) = 0x402fe000 select(5, [4], NULL, NULL, {0, 0}) = 0 (Timeout) select(5, [4], NULL, NULL, {0, 0}) = 0 (Timeout) write(1, "\33[1;1H\33[m\17\33[30m\33[46mFile: notes "..., 4019) = 4019 Thanks, Bernd -- Bernd Schubert Physikalisch Chemisches Institut / Theoretische Chemie Universität Heidelberg INF 229 69120 Heidelberg e-mail: bernd.schubert@pci.uni-heidelberg.de -- Bernd Schubert Physikalisch Chemisches Institut / Theoretische Chemie Universität Heidelberg INF 229 69120 Heidelberg e-mail: bernd.schubert@pci.uni-heidelberg.de From shemminger@osdl.org Fri Sep 9 11:04:15 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 09 Sep 2005 11:04:20 -0700 (PDT) Received: from smtp.osdl.org (smtp.osdl.org [65.172.181.4]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j89I4FiL018777 for ; Fri, 9 Sep 2005 11:04:15 -0700 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp.osdl.org (8.12.8/8.12.8) with ESMTP id j89I1aBo029724 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO); Fri, 9 Sep 2005 11:01:37 -0700 Received: from localhost.localdomain (dxpl.pdx.osdl.net [10.8.0.74]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with ESMTP id j89I1as4028712; Fri, 9 Sep 2005 11:01:36 -0700 Date: Fri, 9 Sep 2005 11:01:53 -0700 From: Stephen Hemminger To: bernd-schubert@gmx.de Cc: bernd.schubert@pci.uni-heidelberg.de, netdev@oss.sgi.com Subject: Re: skge: reboot on sysfs resource0 access Message-ID: <20050909110153.5a2e2e90@localhost.localdomain> In-Reply-To: <200509091938.18079.bernd.schubert@pci.uni-heidelberg.de> References: <200509091938.18079.bernd.schubert@pci.uni-heidelberg.de> X-Mailer: Sylpheed-Claws 1.9.13 (GTK+ 2.6.7; x86_64-redhat-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-MIMEDefang-Filter: osdl$Revision: 1.115 $ X-Scanned-By: MIMEDefang 2.36 X-archive-position: 3609 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: shemminger@osdl.org Precedence: bulk X-list: netdev Content-Length: 1106 Lines: 24 On Fri, 9 Sep 2005 19:38:17 +0200 Bernd Schubert wrote: > Hello, > > today we tried 2.6.13 on our server and also tried to use the skge driver. > Well, in principle it works fine, until I became curious about the sysfs > values. Stupid me, I was using the midnight commander to read the values. > When I opened "/sys/bus/pci/drivers/skge/0000:01:01.0/resource0", the system > immediately rebooted. After the reboot we tested using cat to the resource0 > file, which gave an input/output error. Using again the mc, the system again > immediately rebooted. > Well, I guess I better don't use the midnight commander in the future, but > somehow I think it shouldn't cause the system to reboot, should it? Is the > i/o error of cat supposed to happen? > Don't do that! resource0 is the pci space for the card and reading it directly accesses the memory mapped space. The register is sparse, and some places are unaccessable. Accessing non-existent memory will cause system to hang and if you are lucky a timeout and reboot. Sorry, this is not a driver bug. From bernd.schubert@pci.uni-heidelberg.de Fri Sep 9 11:12:23 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 09 Sep 2005 11:12:27 -0700 (PDT) Received: from relay.uni-heidelberg.de (relay.uni-heidelberg.de [129.206.100.212]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j89ICLiL019925 for ; Fri, 9 Sep 2005 11:12:22 -0700 Received: from hamilton1.pci.uni-heidelberg.de (hamilton1.pci.uni-heidelberg.de [129.206.21.201]) by relay.uni-heidelberg.de (8.13.4/8.13.1) with ESMTP id j89I9cKN018393; Fri, 9 Sep 2005 20:09:38 +0200 Received: from lanczos.pci.uni-heidelberg.de ([129.206.21.135] helo=lanczos ident=foobar) by hamilton1.pci.uni-heidelberg.de with smtp (Exim 3.36 #1 (Debian)) id 1EDnJl-0001ei-00; Fri, 09 Sep 2005 20:09:41 +0200 Received: by lanczos (sSMTP sendmail emulation); Fri, 9 Sep 2005 20:09:41 +0200 From: Bernd Schubert Reply-To: bernd-schubert@gmx.de To: Stephen Hemminger Subject: Re: skge: reboot on sysfs resource0 access Date: Fri, 9 Sep 2005 20:09:40 +0200 User-Agent: KMail/1.7.2 References: <200509091938.18079.bernd.schubert@pci.uni-heidelberg.de> <20050909110153.5a2e2e90@localhost.localdomain> In-Reply-To: <20050909110153.5a2e2e90@localhost.localdomain> Cc: netdev@oss.sgi.com MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Disposition: inline Message-Id: <200509092009.41497.bernd.schubert@pci.uni-heidelberg.de> Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id j89ICLiL019925 X-archive-position: 3610 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: bernd.schubert@pci.uni-heidelberg.de Precedence: bulk X-list: netdev Content-Length: 1584 Lines: 41 On Friday 09 September 2005 20:01, you wrote: > On Fri, 9 Sep 2005 19:38:17 +0200 > > Bernd Schubert wrote: > > Hello, > > > > today we tried 2.6.13 on our server and also tried to use the skge > > driver. Well, in principle it works fine, until I became curious about > > the sysfs values. Stupid me, I was using the midnight commander to read > > the values. When I opened > > "/sys/bus/pci/drivers/skge/0000:01:01.0/resource0", the system > > immediately rebooted. After the reboot we tested using cat to the > > resource0 file, which gave an input/output error. Using again the mc, the > > system again immediately rebooted. > > Well, I guess I better don't use the midnight commander in the future, > > but somehow I think it shouldn't cause the system to reboot, should it? > > Is the i/o error of cat supposed to happen? > > Don't do that! resource0 is the pci space for the card and > reading it directly accesses the memory mapped space. The > register is sparse, and some places are unaccessable. > Accessing non-existent memory will cause system to hang and if you > are lucky a timeout and reboot. > > Sorry, this is not a driver bug. Thanks, I better also won't read the resource values of the other pci-devices. And I think I will search for some documentation of sysfs to know in the future which values one should read and which not. Thanks again, Bernd -- Bernd Schubert Physikalisch Chemisches Institut / Theoretische Chemie Universität Heidelberg INF 229 69120 Heidelberg e-mail: bernd.schubert@pci.uni-heidelberg.de From greearb@candelatech.com Fri Sep 9 11:24:43 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 09 Sep 2005 11:24:50 -0700 (PDT) Received: from www.lanforge.com (ns1.lanforge.com [66.165.47.210]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j89IOhiL021299 for ; Fri, 9 Sep 2005 11:24:43 -0700 Received: from [71.112.207.5] (pool-71-112-207-5.sttlwa.dsl-w.verizon.net [71.112.207.5]) (authenticated bits=0) by www.lanforge.com (8.12.8/8.12.8) with ESMTP id j89IROo6003730; Fri, 9 Sep 2005 11:27:25 -0700 Message-ID: <4321D2C0.10800@candelatech.com> Date: Fri, 09 Sep 2005 11:21:52 -0700 From: Ben Greear Organization: Candela Technologies User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.7.10) Gecko/20050719 Fedora/1.7.10-1.3.1 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Stephen Hemminger CC: bernd-schubert@gmx.de, bernd.schubert@pci.uni-heidelberg.de, netdev@oss.sgi.com Subject: Re: skge: reboot on sysfs resource0 access References: <200509091938.18079.bernd.schubert@pci.uni-heidelberg.de> <20050909110153.5a2e2e90@localhost.localdomain> In-Reply-To: <20050909110153.5a2e2e90@localhost.localdomain> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 3611 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: greearb@candelatech.com Precedence: bulk X-list: netdev Content-Length: 1605 Lines: 46 Stephen Hemminger wrote: > On Fri, 9 Sep 2005 19:38:17 +0200 > Bernd Schubert wrote: > > >>Hello, >> >>today we tried 2.6.13 on our server and also tried to use the skge driver. >>Well, in principle it works fine, until I became curious about the sysfs >>values. Stupid me, I was using the midnight commander to read the values. >>When I opened "/sys/bus/pci/drivers/skge/0000:01:01.0/resource0", the system >>immediately rebooted. After the reboot we tested using cat to the resource0 >>file, which gave an input/output error. Using again the mc, the system again >>immediately rebooted. >>Well, I guess I better don't use the midnight commander in the future, but >>somehow I think it shouldn't cause the system to reboot, should it? Is the >>i/o error of cat supposed to happen? >> > > > Don't do that! resource0 is the pci space for the card and > reading it directly accesses the memory mapped space. The > register is sparse, and some places are unaccessable. > Accessing non-existent memory will cause system to hang and if you > are lucky a timeout and reboot. > > Sorry, this is not a driver bug. Does that mean if you do this: find /sys -name "*" -print|xargs grep foo that the system will crash? I certainly would consider that a bug, and even if that somehow works, I'd think that at the least you should be able to read every file in the file system without crashing the system! Do you at least have to be root to cause this crash? Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com From ananda.raju@neterion.com Fri Sep 9 18:40:47 2005 Received: with ECARTIS (v1.0.0; list netdev); Fri, 09 Sep 2005 18:40:49 -0700 (PDT) Received: from ns1.s2io.com (ns1.s2io.com [142.46.200.198]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8A1ekiL032452 for ; Fri, 9 Sep 2005 18:40:47 -0700 Received: from guinness.s2io.com (sentry.s2io.com [142.46.200.199]) by ns1.s2io.com (8.12.10/8.12.10) with ESMTP id j8A1c9cx010078 for ; Fri, 9 Sep 2005 21:38:09 -0400 (EDT) Received: from rkoushik ([10.16.16.56]) by guinness.s2io.com (8.12.6/8.12.6) with ESMTP id j8A1c8lb003006; Fri, 9 Sep 2005 21:38:08 -0400 (EDT) Message-Id: <200509100138.j8A1c8lb003006@guinness.s2io.com> From: "Ananda Raju" To: Cc: "'Leonid Grossman'" , Subject: clarification required on UDP sendfile() Date: Fri, 9 Sep 2005 18:37:16 -0700 MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook, Build 11.0.5510 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.2180 Thread-Index: AcW1qC02FbvWbLv0RS6gMj3bdhNA7g== X-Scanned-By: MIMEDefang 2.34 X-archive-position: 3612 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: ananda.raju@neterion.com Precedence: bulk X-list: netdev Content-Length: 2259 Lines: 88 Hi, We are implementing UDP Large send offload (USO) feature for our Xframe-II 10g Ethernet adapter. We are facing problem in using sendfile(). we have written a client server program which uses sendfile() over udp. We are facing a problem in which the last sendfile() operation fails to reach server. This behavior is irrespective of USO feature. Client.c has following code. Size of file iperf-2.0.1.tar.gz used for transfer is 222957 -------------------------------------------------- Main() { portno=16000; fd1 = open("iperf-2.0.1.tar.gz",O_RDWR); fd = socket(AF_INET,SOCK_DGRAM,0); bzero((char *) &serv_addr, sizeof(serv_addr)); serv_addr.sin_family = AF_INET; serv_addr.sin_port = htons(portno); serv_addr.sin_addr.s_addr = inet_addr("172.10.1.227"); ret = connect(fd,&serv_addr,sizeof(serv_addr)); len = sizeof(client_addr); off=0; while (1) { size = 40*1024; ret = sendfile(fd,fd1,NULL,size); printf("size %d \n",ret); sleep(1); if (ret<=0) exit(0); } close(fd); close(fd1); } -------------------------------------------- Server.c has following code -------------------------------------------- int portno=16000; char buf[65000]; main() { fd = socket(AF_INET,SOCK_DGRAM,0); bzero((char *) &serv_addr, sizeof(serv_addr)); serv_addr.sin_family = AF_INET; serv_addr.sin_port = htons(portno); ret = bind(fd,(struct sockaddr*)&serv_addr,sizeof(serv_addr)); len = sizeof(client_addr); while (1){ ret = recvfrom(fd,&buf,sizeof(buf),0,(struct sockaddr*)&client_addr,&len); printf("size %d \n",ret); } } ------------------------------------------- # ls -l |grep iperf-2.0.1.tar.gz -rw-r--r-- 1 root root 222957 Sep 8 08:31 iperf-2.0.1.tar.gz # #./client size 40960 size 40960 size 40960 size 40960 size 40960 size 18157 <<< Didn't reach server size 0 # #./server size 40960 size 40960 size 40960 size 40960 size 40960 The last transmit of 18157 bytes didn't reach the server, any reason why this happens. Also some time the middle frames also won't reach the server. We did tcpdump and observed that the packets are not put on the wire. The packets are getting lost in the host network stack. Is this behavior is expected or We are doing wrong somewhere? Regards, Ananda From bernd.schubert@pci.uni-heidelberg.de Mon Sep 12 04:04:41 2005 Received: with ECARTIS (v1.0.0; list netdev); Mon, 12 Sep 2005 04:04:46 -0700 (PDT) Received: from relay.uni-heidelberg.de (relay.uni-heidelberg.de [129.206.100.212]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8CB4diL027930 for ; Mon, 12 Sep 2005 04:04:40 -0700 Received: from hamilton1.pci.uni-heidelberg.de (hamilton1.pci.uni-heidelberg.de [129.206.21.201]) by relay.uni-heidelberg.de (8.13.4/8.13.1) with ESMTP id j8CB1akx018239; Mon, 12 Sep 2005 13:01:44 +0200 Received: from lanczos.pci.uni-heidelberg.de ([129.206.21.135] helo=lanczos ident=foobar) by hamilton1.pci.uni-heidelberg.de with smtp (Exim 3.36 #1 (Debian)) id 1EEm4C-00044a-00; Mon, 12 Sep 2005 13:01:40 +0200 Received: by lanczos (sSMTP sendmail emulation); Mon, 12 Sep 2005 13:01:40 +0200 From: Bernd Schubert Reply-To: bernd-schubert@gmx.de To: Ben Greear Subject: Re: skge: reboot on sysfs resource0 access Date: Mon, 12 Sep 2005 13:01:39 +0200 User-Agent: KMail/1.7.2 Cc: Stephen Hemminger , netdev@oss.sgi.com References: <200509091938.18079.bernd.schubert@pci.uni-heidelberg.de> <20050909110153.5a2e2e90@localhost.localdomain> <4321D2C0.10800@candelatech.com> In-Reply-To: <4321D2C0.10800@candelatech.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200509121301.39924.bernd.schubert@pci.uni-heidelberg.de> X-archive-position: 3615 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: bernd.schubert@pci.uni-heidelberg.de Precedence: bulk X-list: netdev Content-Length: 537 Lines: 22 > > Sorry, this is not a driver bug. > > Does that mean if you do this: > > find /sys -name "*" -print|xargs grep foo > > that the system will crash? I would also guess it would happen, though I won't try that now. > > I certainly would consider that a bug, and even if that somehow works, I'd > think that at the least you should be able to read every file in the file > system without crashing the system! > > Do you at least have to be root to cause this crash? Yes, the resource0 file has rw access to root only. Cheers, Bernd From bernd.schubert@pci.uni-heidelberg.de Mon Sep 12 08:42:27 2005 Received: with ECARTIS (v1.0.0; list netdev); Mon, 12 Sep 2005 08:42:32 -0700 (PDT) Received: from relay.uni-heidelberg.de (relay.uni-heidelberg.de [129.206.100.212]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8CFgQiL020242 for ; Mon, 12 Sep 2005 08:42:27 -0700 Received: from hamilton1.pci.uni-heidelberg.de (hamilton1.pci.uni-heidelberg.de [129.206.21.201]) by relay.uni-heidelberg.de (8.13.4/8.13.1) with ESMTP id j8CFdhup030146; Mon, 12 Sep 2005 17:39:43 +0200 Received: from lanczos.pci.uni-heidelberg.de ([129.206.21.135] helo=lanczos ident=foobar) by hamilton1.pci.uni-heidelberg.de with smtp (Exim 3.36 #1 (Debian)) id 1EEqPL-0002dI-00; Mon, 12 Sep 2005 17:39:47 +0200 Received: by lanczos (sSMTP sendmail emulation); Mon, 12 Sep 2005 17:39:47 +0200 From: Bernd Schubert Reply-To: bernd-schubert@gmx.de To: netdev@oss.sgi.com Subject: 2.613: network write socket problems Date: Mon, 12 Sep 2005 17:39:45 +0200 User-Agent: KMail/1.7.2 Cc: linux-kernel@vger.kernel.org MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Disposition: inline Message-Id: <200509121739.46172.bernd.schubert@pci.uni-heidelberg.de> Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by oss.sgi.com id j8CFgQiL020242 X-archive-position: 3616 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: bernd.schubert@pci.uni-heidelberg.de Precedence: bulk X-list: netdev Content-Length: 1694 Lines: 45 Hello, on last Friday we switched on our server to 2.6.13 and today we are experiencing problems with our nfs clients. In particular I'm talking about the unfs3 daemon, not the kernel nfs daemon. Both are running on the server but on different ports, of course. Both are also serving to the same clients, but different directories. Today it already several times happend that the unfs3 daemon stalled. Ethereal showed no network packages on the unfs3 daemon port during this time. A strace to the proc-id of the daemon clearly shows that *some* writes to some network sockets will take ages to finish write(37, "\200\0\0x\203\326(\5\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 124) = 124 This kind of writes can take between seconds and minutes, while it usually happens much faster than I can count. After the write() to the network socket, other operations happen rather fast, until the next write to a network socket. (I identified the troublesome filedescriptors by looking to /proc/procid/fd). After restarting the unfs3 daemon everything goes smooth for some time (approximately 20min to 2h), until the next write to a filedescriptor stalls. Any idea whats going on? Until today this never happend before, neither with 2.6.x nor 2.4.x. As I wrote, on Friday we replaced 2.6.11.12 by 2.6.13, the configuration should be similar, only changes should be HZ set to 250 and additionally the skge driver. We already switched back from skge to sk98lin, but the problem seems to remain. Thanks, Bernd -- Bernd Schubert Physikalisch Chemisches Institut / Theoretische Chemie Universität Heidelberg INF 229 69120 Heidelberg e-mail: bernd.schubert@pci.uni-heidelberg.de From y_h_lee@yahoo.com Mon Sep 12 09:54:31 2005 Received: with ECARTIS (v1.0.0; list netdev); Mon, 12 Sep 2005 09:54:37 -0700 (PDT) Received: from web34211.mail.mud.yahoo.com (web34211.mail.mud.yahoo.com [66.163.178.126]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with SMTP id j8CGsViL024788 for ; Mon, 12 Sep 2005 09:54:31 -0700 Received: (qmail 51507 invoked by uid 60001); 12 Sep 2005 16:51:53 -0000 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=Message-ID:Received:Date:From:Subject:To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=6fG8hOMm9Mgz2Qo7i0ys2zCYXeq0Inn6MfDv1yq4okgD7AAdaxKCJVJpB25xi29VjjSCDakPmTec+UY4NXqIEDYOR2opTftiWsUQullCtgxoKYgbC1poXx2QF3brq1CeedQnou5FDrVA3B2kmd2e/v0ORRC3qN/zUmJZOsem6xY= ; Message-ID: <20050912165153.51505.qmail@web34211.mail.mud.yahoo.com> Received: from [192.35.17.30] by web34211.mail.mud.yahoo.com via HTTP; Mon, 12 Sep 2005 09:51:53 PDT Date: Mon, 12 Sep 2005 09:51:53 -0700 (PDT) From: YongHan Lee Subject: Writing Kernel Module to get Kernel Routing Table Information To: y_h_lee@yahoo.com MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-archive-position: 3617 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: y_h_lee@yahoo.com Precedence: bulk X-list: netdev Content-Length: 2195 Lines: 72 Dear Linux-Networking Maintainer, I am a student of EPFL in Switzerland and do a semester project in networking. My subject is "ipv6 multipath AODV routing implementation". To achieve my goal I would like to program a kernel module which allows/enables multipath routing. This means, if I have several routes to one destination (but different next hops) in the kernel routing table, then I want to choose a route entry with the desired next hop and not necessary the first route entry. Since I do not know well the kernel architecture, especially the kernel networking section, I do not know if and how it is possible. My first idea was to use netfilter for ipv6, but the multiple routing table with marks are only implemented for ipv4. https://lists.netfilter.org/pipermail/netfilter/2005-August/062252.html And now I try to write a Kernel module which should retrieve all route entries from the kernel routing table (fib6_node, rt6_info or dst_entry struct) by comparing the destination address, source address and next hop address with those of the ip packet. This happens at POST ROUTING hook. Afterwards, I would change the destination (struct dst_entry) of the sk_buff struct of the ip packet. The problem is not to get the next hop address of the ip packet. I have started to write my kernel module, but it was not able to get the route entries from the kernel routing table, because a lot of the functions are static. I wanted to iterate the fib6_nodes from the root (like fib6_lookup(&ip6_routing_table, daddr, saddr) [from ip6_fib.c]), but the kernel returns me: Sep 12 17:31:34 m66533pp kernel: kaodv: Unknown symbol fib6_lookup Sep 12 17:31:34 m66533pp kernel: kaodv: Unknown symbol ip6_routing_table even they are defined in kallsyms and I included the ip6_fib.h file. I would be very glad if you could give me some helps or advises (how it is possible or my idea is totally impossible). Some links to architecture of kernel routing table would be already a great help for me. I would like to thank you in advance for your help. yours faithfully, Yong-Han Lee __________________________________ Yahoo! Mail - PC Magazine Editors' Choice 2005 http://mail.yahoo.com From bernd.schubert@pci.uni-heidelberg.de Tue Sep 13 02:25:38 2005 Received: with ECARTIS (v1.0.0; list netdev); Tue, 13 Sep 2005 02:25:44 -0700 (PDT) Received: from relay.uni-heidelberg.de (relay.uni-heidelberg.de [129.206.100.212]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j8D9PaiL032121 for ; Tue, 13 Sep 2005 02:25:37 -0700 Received: from hamilton1.pci.uni-heidelberg.de (hamilton1.pci.uni-heidelberg.de [129.206.21.201]) by relay.uni-heidelberg.de (8.13.4/8.13.1) with ESMTP id j8D9MnL3004747; Tue, 13 Sep 2005 11:22:49 +0200 Received: from lanczos.pci.uni-heidelberg.de ([129.206.21.135] helo=lanczos ident=foobar) by hamilton1.pci.uni-heidelberg.de with smtp (Exim 3.36 #1 (Debian)) id 1EF709-0001oQ-00; Tue, 13 Sep 2005 11:22:53 +0200 Received: by lanczos (sSMTP sendmail emulation); Tue, 13 Sep 2005 11:22:53 +0200 From: Bernd Schubert Reply-To: TC-ADMIN@listserv.uni-heidelberg.de To: netdev@oss.sgi.com Subject: Re: 2.613: network write socket problems Date: Tue, 13 Sep 2005 11:22:52 +0200 User-Agent: KMail/1.7.2 Cc: linux-kernel@vger.kernel.org References: <200509121739.46172.bernd.schubert@pci.uni-heidelberg.de> In-Reply-To: <200509121739.46172.bernd.schubert@pci.uni-heidelberg.de> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200509131122.53286.bernd.schubert@pci.uni-heidelberg.de> X-archive-position: 3618 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: bernd.schubert@pci.uni-heidelberg.de Precedence: bulk X-list: netdev Content-Length: 1120 Lines: 24 On Monday 12 September 2005 17:39, Bernd Schubert wrote: > Hello, > > on last Friday we switched on our server to 2.6.13 and today we are > experiencing problems with our nfs clients. > In particular I'm talking about the unfs3 daemon, not the kernel nfs > daemon. Both are running on the server but on different ports, of course