| To: | sfeldma@xxxxxxxxx |
|---|---|
| Subject: | Re: [PATCH 2.6] e100: use NAPI mode all the time |
| From: | Tim Mattox <tmattox@xxxxxxxxxxxx> |
| Date: | Sun, 6 Jun 2004 21:51:23 -0400 |
| Cc: | netdev@xxxxxxxxxxx, bonding-devel@xxxxxxxxxxxxxxxxxxxxx, Scott Feldman <scott.feldman@xxxxxxxxx>, jgarzik@xxxxxxxxx |
| In-reply-to: | <1086566591.3721.54.camel@localhost.localdomain> |
| References: | <Pine.LNX.4.58.0406041727160.2662@sfeldma-ich5.jf.intel.com> <DC71FD1C-B80C-11D8-9557-000393652100@engr.uky.edu> <1086566591.3721.54.camel@localhost.localdomain> |
| Sender: | netdev-bounce@xxxxxxxxxxx |
Please excuse the length of this e-mail. I will attempt to explain the potential problem between NAPI and bonding with an example below. And the only reason I say "potential" is that I have deliberately avoided building clusters with this configuration and have not seen it "in the wild" personally. I've read about this problem on the beowulf mailing list, usually in conjunction with people trying to bond GigE NICs. I will soon have a cluster that can be easily switched to various modes on it's network, including simple bonding, and I should be able to directly test this myself in my lab. The problem is caused by the order packets are delivered to the TCP stack on the receiving machine. In normal round-robin bonding mode, the packets are sent out one per NIC in the bond. For simplicity sake, lets say we have two NICs in a bond, eth0 and eth1. When sending packets, eth0 will handle all the even packets, and eth1 all the odd packets. Similarly when receiving, eth0 would get all the even packets, and eth1 all the odd packets from a particular TCP stream. With NAPI (or other interrupt mitigation techniques) the receiving machine will process multiple packets in a row from a single NIC, before getting packets from another NIC. In the above example, eth0 would receive packets 0, 2, 4, 6, etc. and pass them to the TCP layer. Followed by eth1's packets 1, 3, 5, 7, etc. The specific number of out-of-order packets received in a row would depend on many factors. The TCP layer would need to reorder the packets from something like 0, 2, 4, 6, 1, 3, 5, 7 or something like 0, 2, 4, 1, 3, 5, 6, 7. With many possible variations. Before NAPI (and hardware interrupt mitigation schemes), bonding would work without causing this re-ordering, since each packet would arrive and be enqueued to the TCP stack in the order of arrival, which in a well designed network would match the transmission order. Sure, if your network randomly delayed packets then things would get out of order, but in the HPC community which uses bonding, the two network paths would normally be made identical, and possibly with only a single switch between source and destination NICs. If there was congestion delays in one path and not in another, then the HPC network/program had more serious problems. I don't want to slow the progress of Linux networking development. I was objecting to the removal of a feature to e100 that already has working code and that was, AFAIK, necessary for the performance enhancement of bonding. If the overhead of re-ordering the packets is not significant, and if simply increasing the value of /proc/sys/net/ipv4/tcp_reordering will allow TCP to "chill" and not send negative ACKs when it sees packets this much out of order, than sure, remove the non-NAPI support. I will attempt to re-locate the specific examples discussed on the beowulf mailing list, but I don't have those URLs handy. On Jun 6, 2004, at 8:03 PM, Scott Feldman wrote: Have you considered how this interacts with multiple e100's bonded together with Linux channel bonding? I've CC'd the bonding developer mailing list to flush out any more opinions on this.
I have yet to set up a good test system, but my impression has been that NAPI and channel bonding would lead to lots of packet re-ordering load for the CPU that could outweigh the interrupt load savings. Does anyone have experience with this?
Also, depending on the setting of /proc/sys/net/ipv4/tcp_reordering the TCP stack might do aggressive NACKs because of a false-positive on dropped packets due to the large reordering that could occur with NAPI and bonding combined.
In short, unless there has been study on this, I would suggest not yet removing support for non-NAPI mode on any network driver. I have NO problems with NAPI itself, I think it's a wonderful development. I would even advocate for making NAPI the default across the board. But for bonding, until I see otherwise, I want to be able to not use NAPI. As I indicated, I will have a new cluster that I can directly test this NAPI vs Bonding issue very soon. -- Tim Mattox - tmattox@xxxxxxxxxxxx - http://homepage.mac.com/tmattox/ http://aggregate.org/KAOS/ - http://advogato.org/person/tmattox/ |
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| ||
| Previous by Date: | Re: [PATCH 2.6] e100: use NAPI mode all the time, Scott Feldman |
|---|---|
| Next by Date: | Re: [0/3] mc_filter on big-endian arch, David Dillow |
| Previous by Thread: | Re: [PATCH 2.6] e100: use NAPI mode all the time, Scott Feldman |
| Next by Thread: | Re: [PATCH 2.6] e100: use NAPI mode all the time, Jeff Garzik |
| Indexes: | [Date] [Thread] [Top] [All Lists] |