Received: with ECARTIS (v1.0.0; list netdev); Sun, 15 May 2005 15:25:26 -0700 (PDT) Received: from sunset.davemloft.net (dsl027-180-168.sfo1.dsl.speakeasy.net [216.27.180.168]) by oss.sgi.com (8.12.10/8.12.10/SuSE Linux 0.7) with ESMTP id j4FMPNOv020820 for ; Sun, 15 May 2005 15:25:23 -0700 Received: from localhost ([127.0.0.1] ident=davem) by sunset.davemloft.net with esmtp (Exim 4.50) id 1DXRXN-0002ns-Ld; Sun, 15 May 2005 15:24:41 -0700 Date: Sun, 15 May 2005 15:24:41 -0700 (PDT) Message-Id: <20050515.152441.106262257.davem@davemloft.net> To: herbert@gondor.apana.org.au Cc: johnpol@2ka.mipt.ru, netdev@oss.sgi.com Subject: Re: [IPV4/IPV6] Keep wmem accounting separate in ip*_push_pending_frames From: "David S. Miller" In-Reply-To: <20050515114121.GA4830@gondor.apana.org.au> References: <20050515104016.GA24344@gondor.apana.org.au> <20050515114121.GA4830@gondor.apana.org.au> X-Mailer: Mew version 3.3 on Emacs 21.4 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-archive-position: 1158 X-ecartis-version: Ecartis v1.0.0 Sender: netdev-bounce@oss.sgi.com Errors-to: netdev-bounce@oss.sgi.com X-original-sender: davem@davemloft.net Precedence: bulk X-list: netdev Content-Length: 1849 Lines: 59 From: Herbert Xu Subject: Re: [IPV4/IPV6] Keep wmem accounting separate in ip*_push_pending_frames Date: Sun, 15 May 2005 21:41:21 +1000 > So let's go the other way and make this an invariant: > > For any skb on a frag_list, skb->sk must be NULL. > > That is, the socket ownership always belongs to the head skb. > It turns out that the implementation is actually pretty simple. Yes, I was just about to inform you that when SKBs are free'd up, only the head SKB's sk gets the destructor run on it. Please refer to this patch from July of 2004: [IPV4]: Fix multicast socket hangs. If a multicast packet gets looped back, the sending socket can hang if a local read just sits and does not empty its receive queue. The problem is that when an SKB clone is freed up, the destructor is only invoked for the head SKB when there is a fraglist (which is created for fragmentation). The solution is to account the fragment list SKB lengths in the top-level head SKB, then it all works out. Signed-off-by: David S. Miller --- 1/net/ipv4/ip_output.c 2005-05-15 15:23:19 -07:00 +++ 2/net/ipv4/ip_output.c 2005-05-15 15:23:19 -07:00 @@ -498,10 +498,6 @@ skb_headroom(frag) < hlen) goto slow_path; - /* Correct socket ownership. */ - if (frag->sk == NULL && skb->sk) - goto slow_path; - /* Partially cloned skb? */ if (skb_shared(frag)) goto slow_path; @@ -1113,12 +1109,10 @@ tail_skb = &(tmp_skb->next); skb->len += tmp_skb->len; skb->data_len += tmp_skb->len; -#if 0 /* Logically correct, but useless work, ip_fragment() will have to undo */ skb->truesize += tmp_skb->truesize; __sock_put(tmp_skb->sk); tmp_skb->destructor = NULL; tmp_skb->sk = NULL; -#endif } /* Unless user demanded real pmtu discovery (IP_PMTUDISC_DO), we allow