Received: with ECARTIS (v1.0.0; list netdev); Mon, 16 May 2005 13:35:38 -0700 (PDT) Received: from e5.ny.us.ibm.com (e5.ny.us.ibm.com [32.97.182.145]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j4GKZROv031140 for ; Mon, 16 May 2005 13:35:34 -0700 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e5.ny.us.ibm.com (8.12.11/8.12.11) with ESMTP id j4GKYed7004191 for ; Mon, 16 May 2005 16:34:40 -0400 Received: from d01av02.pok.ibm.com (d01av02.pok.ibm.com [9.56.224.216]) by d01relay04.pok.ibm.com (8.12.10/NCO/VER6.6) with ESMTP id j4GKYeXn102988 for ; Mon, 16 May 2005 16:34:40 -0400 Received: from d01av02.pok.ibm.com (loopback [127.0.0.1]) by d01av02.pok.ibm.com (8.12.11/8.13.3) with ESMTP id j4GKYeM7023659 for ; Mon, 16 May 2005 16:34:40 -0400 Received: from death.nxdomain.ibm.com (sig-9-65-37-67.mts.ibm.com [9.65.37.67]) by d01av02.pok.ibm.com (8.12.11/8.12.11) with ESMTP id j4GKYdqB023626; Mon, 16 May 2005 16:34:40 -0400 Received: from death.nxdomain.ibm.com (localhost [127.0.0.1]) by death.nxdomain.ibm.com (8.12.8/8.12.8) with ESMTP id j4GKYbse022998; Mon, 16 May 2005 13:34:37 -0700 Received: from death (fubar@localhost) by death.nxdomain.ibm.com (8.12.8/8.12.8/Submit) with ESMTP id j4GKYbch022992; Mon, 16 May 2005 13:34:37 -0700 Message-Id: <200505162034.j4GKYbch022992@death.nxdomain.ibm.com> To: Eric Paris cc: netdev@oss.sgi.com, jgarzik@pobox.com, bonding-devel@lists.sourceforge.net Subject: Re: [PATCH] bonding using arp_ip_target may stay down with active path In-Reply-To: Message from Eric Paris of "Mon, 16 May 2005 14:41:25 EDT." <1116268885.3738.19.camel@dhcp59-180.rdu.redhat.com> X-Mailer: MH-E 7.83; nmh 1.0.4; GNU Emacs 21.3.1 Date: Mon, 16 May 2005 13:34:37 -0700 From: Jay Vosburgh X-archive-position: 1199 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: fubar@us.ibm.com Precedence: bulk X-list: netdev Content-Length: 1752 Lines: 45 Eric Paris wrote: >[...] Bring back up the interface connected to >eth1. At this point we have a "valid" connection since eth1 can talk to >one of the arp targets. But we are only sending arp requests on eth0 >(verify with tcpdump) The trick is to have a situation with a partitioned network and a failure such that the device still has link, but does not respond to the ARP queries. That's not an unreasonable failure if there's a switch in each path to the arp_ip_target peers (which is how I set it up locally). >The patch below has been tested by me and appears to fix the problem. >All of the failover tests I performed seem to work including pulling >cables and stopping responses from the arp_ip_target entries. The patch looks good to me, also (although I made the change by hand instead of via patch). -J --- -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com Signed-off-by: Jay Vosburgh --- linux-2.6.11/drivers/net/bonding/bond_main.c.orig 2005-05-12 12:22:52.000000000 -0400 +++ linux-2.6.11/drivers/net/bonding/bond_main.c 2005-05-12 15:13:53.000000000 -0400 @@ -3046,7 +3046,7 @@ static void bond_activebackup_arp_mon(st bond_set_slave_inactive_flags(bond->current_arp_slave); /* search for next candidate */ - bond_for_each_slave_from(bond, slave, i, bond->current_arp_slave) { + bond_for_each_slave_from(bond, slave, i, bond->current_arp_slave->next) { if (IS_UP(slave->dev)) { slave->link = BOND_LINK_BACK; bond_set_slave_active_flags(slave);