There's a piece of code at the end of unregister_netdevice()
which waits for up to ten seconds waiting for the device to
become un-busy. If the timeout is expired it prints the
"Wait for crash" message and returns.
The problem is, orphan ipv4 fragments have a reference to
the device and they have an arbitrarily long in-kernel
lifetime. Default is 30 seconds.
So if you send an orphan fragment to your linux box and
then do an ifdown/rmmod, you'll get the "Wait for crash"
message. Then, typically, the net_device is kfree'ed
and the driver module is unloaded. Then, at some time
in the future, the fragments start expiring. This causes
atomic_decs of dev->refcnt (it has been kfree'ed) and,
possibly, a call to dev->destructor (its text has been
unmapped). The kernel will crash.
I see several ways to fix this:
1: Make unregister_netdevice hunt down all the skbuffs associated
with this device and release them. Sounds hard.
2: Make the 10 second delay equal to the max of sysctl_ipfrag_time,
sysctl_ip6frag_time, and who knows what else. This isn't
attractive because it'll require us to be able to predict
the future lifetime of all skbuffs for all protocols.
3: Make unregister_netdevice return -EAGAIN, so the caller
(usually /sbin/rmmod) can make a policy decision what to do.
Long term, this is the correct architecture.
The problem with this is that the net drivers will then
need to be taught that unregister_netdevice can fail.
They will ALL need changing.
It's not a big change, but frankly, if we're going to
churn each and every driver then we may as well get
it right. And at present the Linux netdevice lifecycle is
far from right. The lack or userland access to
[un]register_netdevice and the unhealthy linkage
between sys_ins/del_module and [un]register_netdevice
is the source of a lot of bugs, races and general
bogosity. We need `ifconfig plumb' and `ifconfig
unplumb', like Solaris. I'll write a rant about this
sometime, but I suggest it's 2.5-food.
4: Make unregister_netdevice() wait until the device is
free, as Alexey suggested in his comment.
That's what this patch does. It waits indefinitely for the
device to become free and prints a message every ten seconds
while waiting.
--- ../linux-2.4.0-test7-pre4/net/core/dev.c Wed Aug 16 23:54:29 2000
+++ net/core/dev.c Sat Aug 19 20:24:36 2000
@@ -58,6 +58,7 @@
* the backlog queue.
* Paul Rusty Russell : SIOCSIFNAME
* Pekka Riikonen : Netdev boot-time settings code
+ * Andrew Morton : Make unregister_netdevice wait
indefinitely on dev->refcnt
*/
#include <asm/uaccess.h>
@@ -2311,7 +2312,7 @@
int unregister_netdevice(struct net_device *dev)
{
- unsigned long now;
+ unsigned long now, warning_time;
struct net_device *d, **dp;
/* If device is running, close it first. */
@@ -2379,31 +2380,30 @@
printk("unregister_netdevice: waiting %s refcnt=%d\n", dev->name,
atomic_read(&dev->refcnt));
#endif
- /* EXPLANATION. If dev->refcnt is not 1 now (1 is our own reference)
- it means that someone in the kernel still has reference
+ /* EXPLANATION. If dev->refcnt is not now 1 (our own reference)
+ it means that someone in the kernel still has a reference
to this device and we cannot release it.
"New style" devices have destructors, hence we can return from this
- function and destructor will do all the work later.
+ function and destructor will do all the work later. As of kernel
2.4.0
+ there are very few "New Style" devices.
- "Old style" devices expect that device is free of any references
- upon exit from this function. WE CANNOT MAKE such release
- without delay. Note that it is not new feature. Referencing devices
- after they are released occured in 2.0 and 2.2.
- Now we just can know about each fact of illegal usage.
-
- So, we linger for 10*HZ (it is an arbitrary number)
+ "Old style" devices expect that the device is free of any references
+ upon exit from this function.
+ We cannot return from this function until all such references have
+ fallen away. This is because the caller of this function will
probably
+ immediately kfree(*dev) and then be unloaded via sys_delete_module.
+
+ So, we linger until all references fall away. The duration of the
+ linger is basically unbounded! It is driven by, for example, the
+ current setting of sysctl_ipfrag_time.
After 1 second, we start to rebroadcast unregister notifications
in hope that careless clients will release the device.
- If timeout expired, we have no choice how to cross fingers
- and return. Real alternative would be block here forever
- and we will make it eventually, when all peaceful citizens
- will be notified and repaired.
*/
- now = jiffies;
+ now = warning_time = jiffies;
while (atomic_read(&dev->refcnt) != 1) {
if ((jiffies - now) > 1*HZ) {
/* Rebroadcast unregister notification */
@@ -2412,12 +2412,13 @@
current->state = TASK_INTERRUPTIBLE;
schedule_timeout(HZ/4);
current->state = TASK_RUNNING;
- if ((jiffies - now) > 10*HZ)
- break;
+ if ((jiffies - warning_time) > 10*HZ) {
+ printk(KERN_EMERG "unregister_netdevice: waiting for %s
to "
+ "become free. Usage count = %d\n",
+ dev->name, atomic_read(&dev->refcnt));
+ warning_time = jiffies;
+ }
}
-
- if (atomic_read(&dev->refcnt) != 1)
- printk("unregister_netdevice: Old style device %s
leaked(refcnt=%d). Wait for crash.\n", dev->name, atomic_read(&dev->refcnt)-1);
dev_put(dev);
return 0;
}
|