Received: by oss.sgi.com id ; Sat, 19 Aug 2000 07:28:16 -0700 Received: from smtprch2.nortelnetworks.com ([192.135.215.15]:55969 "EHLO smtprch2.nortel.com") by oss.sgi.com with ESMTP id ; Sat, 19 Aug 2000 07:27:57 -0700 Received: from zrchb213.us.nortel.com (actually zrchb213) by smtprch2.nortel.com; Sat, 19 Aug 2000 06:00:54 -0500 Received: from zctwb003.asiapac.nortel.com ([47.152.32.111]) by zrchb213.us.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2652.39) id QYZF72F4; Sat, 19 Aug 2000 06:04:25 -0500 Received: from uow.edu.au (47.181.194.64 [47.181.194.64]) by zctwb003.asiapac.nortel.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2652.39) id RBSBA8K7; Sat, 19 Aug 2000 21:04:25 +1000 Message-ID: <399E6A64.CB616120@uow.edu.au> Date: Sat, 19 Aug 2000 21:07:16 +1000 X-Sybari-Space: 00000000 00000000 00000000 From: Andrew Morton X-Mailer: Mozilla 4.7 [en] (X11; I; Linux 2.2.14-15mdk i586) X-Accept-Language: en MIME-Version: 1.0 To: "David S. Miller" , Alexey Kuznetosv CC: "netdev@oss.sgi.com" Subject: [patch] IP_FRAG_TIME versus unregister_netdevice Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Orig: Sender: owner-netdev@oss.sgi.com Precedence: bulk Return-Path: X-Orcpt: rfc822;netdev-outgoing There's a piece of code at the end of unregister_netdevice() which waits for up to ten seconds waiting for the device to become un-busy. If the timeout is expired it prints the "Wait for crash" message and returns. The problem is, orphan ipv4 fragments have a reference to the device and they have an arbitrarily long in-kernel lifetime. Default is 30 seconds. So if you send an orphan fragment to your linux box and then do an ifdown/rmmod, you'll get the "Wait for crash" message. Then, typically, the net_device is kfree'ed and the driver module is unloaded. Then, at some time in the future, the fragments start expiring. This causes atomic_decs of dev->refcnt (it has been kfree'ed) and, possibly, a call to dev->destructor (its text has been unmapped). The kernel will crash. I see several ways to fix this: 1: Make unregister_netdevice hunt down all the skbuffs associated with this device and release them. Sounds hard. 2: Make the 10 second delay equal to the max of sysctl_ipfrag_time, sysctl_ip6frag_time, and who knows what else. This isn't attractive because it'll require us to be able to predict the future lifetime of all skbuffs for all protocols. 3: Make unregister_netdevice return -EAGAIN, so the caller (usually /sbin/rmmod) can make a policy decision what to do. Long term, this is the correct architecture. The problem with this is that the net drivers will then need to be taught that unregister_netdevice can fail. They will ALL need changing. It's not a big change, but frankly, if we're going to churn each and every driver then we may as well get it right. And at present the Linux netdevice lifecycle is far from right. The lack or userland access to [un]register_netdevice and the unhealthy linkage between sys_ins/del_module and [un]register_netdevice is the source of a lot of bugs, races and general bogosity. We need `ifconfig plumb' and `ifconfig unplumb', like Solaris. I'll write a rant about this sometime, but I suggest it's 2.5-food. 4: Make unregister_netdevice() wait until the device is free, as Alexey suggested in his comment. That's what this patch does. It waits indefinitely for the device to become free and prints a message every ten seconds while waiting. --- ../linux-2.4.0-test7-pre4/net/core/dev.c Wed Aug 16 23:54:29 2000 +++ net/core/dev.c Sat Aug 19 20:24:36 2000 @@ -58,6 +58,7 @@ * the backlog queue. * Paul Rusty Russell : SIOCSIFNAME * Pekka Riikonen : Netdev boot-time settings code + * Andrew Morton : Make unregister_netdevice wait indefinitely on dev->refcnt */ #include @@ -2311,7 +2312,7 @@ int unregister_netdevice(struct net_device *dev) { - unsigned long now; + unsigned long now, warning_time; struct net_device *d, **dp; /* If device is running, close it first. */ @@ -2379,31 +2380,30 @@ printk("unregister_netdevice: waiting %s refcnt=%d\n", dev->name, atomic_read(&dev->refcnt)); #endif - /* EXPLANATION. If dev->refcnt is not 1 now (1 is our own reference) - it means that someone in the kernel still has reference + /* EXPLANATION. If dev->refcnt is not now 1 (our own reference) + it means that someone in the kernel still has a reference to this device and we cannot release it. "New style" devices have destructors, hence we can return from this - function and destructor will do all the work later. + function and destructor will do all the work later. As of kernel 2.4.0 + there are very few "New Style" devices. - "Old style" devices expect that device is free of any references - upon exit from this function. WE CANNOT MAKE such release - without delay. Note that it is not new feature. Referencing devices - after they are released occured in 2.0 and 2.2. - Now we just can know about each fact of illegal usage. - - So, we linger for 10*HZ (it is an arbitrary number) + "Old style" devices expect that the device is free of any references + upon exit from this function. + We cannot return from this function until all such references have + fallen away. This is because the caller of this function will probably + immediately kfree(*dev) and then be unloaded via sys_delete_module. + + So, we linger until all references fall away. The duration of the + linger is basically unbounded! It is driven by, for example, the + current setting of sysctl_ipfrag_time. After 1 second, we start to rebroadcast unregister notifications in hope that careless clients will release the device. - If timeout expired, we have no choice how to cross fingers - and return. Real alternative would be block here forever - and we will make it eventually, when all peaceful citizens - will be notified and repaired. */ - now = jiffies; + now = warning_time = jiffies; while (atomic_read(&dev->refcnt) != 1) { if ((jiffies - now) > 1*HZ) { /* Rebroadcast unregister notification */ @@ -2412,12 +2412,13 @@ current->state = TASK_INTERRUPTIBLE; schedule_timeout(HZ/4); current->state = TASK_RUNNING; - if ((jiffies - now) > 10*HZ) - break; + if ((jiffies - warning_time) > 10*HZ) { + printk(KERN_EMERG "unregister_netdevice: waiting for %s to " + "become free. Usage count = %d\n", + dev->name, atomic_read(&dev->refcnt)); + warning_time = jiffies; + } } - - if (atomic_read(&dev->refcnt) != 1) - printk("unregister_netdevice: Old style device %s leaked(refcnt=%d). Wait for crash.\n", dev->name, atomic_read(&dev->refcnt)-1); dev_put(dev); return 0; }