[Top] [All Lists]

[patch] IP_FRAG_TIME versus unregister_netdevice

To: "David S. Miller" <davem@xxxxxxxxxx>, Alexey Kuznetosv <kuznet@xxxxxxxxxxxxx>
Subject: [patch] IP_FRAG_TIME versus unregister_netdevice
From: Andrew Morton <andrewm@xxxxxxxxxx>
Date: Sat, 19 Aug 2000 21:07:16 +1000
Cc: "netdev@xxxxxxxxxxx" <netdev@xxxxxxxxxxx>
Sender: owner-netdev@xxxxxxxxxxx
There's a piece of code at the end of unregister_netdevice()
which waits for up to ten seconds waiting for the device to
become un-busy.  If the timeout is expired it prints the
"Wait for crash" message and returns.

The problem is, orphan ipv4 fragments have a reference to
the device and they have an arbitrarily long in-kernel
lifetime.  Default is 30 seconds.

So if you send an orphan fragment to your linux box and
then do an ifdown/rmmod, you'll get the "Wait for crash"
message.  Then, typically, the net_device is kfree'ed
and the driver module is unloaded.  Then, at some time
in the future, the fragments start expiring.  This causes
atomic_decs of dev->refcnt (it has been kfree'ed) and,
possibly, a call to dev->destructor (its text has been
unmapped).  The kernel will crash.

I see several ways to fix this:

1: Make unregister_netdevice hunt down all the skbuffs associated
   with this device and release them.  Sounds hard.

2: Make the 10 second delay equal to the max of sysctl_ipfrag_time,
   sysctl_ip6frag_time, and who knows what else.  This isn't
   attractive because it'll require us to be able to predict
   the future lifetime of all skbuffs for all protocols.

3: Make unregister_netdevice return -EAGAIN, so the caller
   (usually /sbin/rmmod) can make a policy decision what to do.
   Long term, this is the correct architecture.

   The problem with this is that the net drivers will then
   need to be taught that  unregister_netdevice can fail.
   They will ALL need changing.

   It's not a big change, but frankly, if we're going to
   churn each and every driver then we may as well get
   it right.  And at present the Linux netdevice lifecycle is
   far from right.  The lack or userland access to
   [un]register_netdevice and the unhealthy linkage
    between sys_ins/del_module and [un]register_netdevice
   is the source of a lot of bugs, races and general
   bogosity.  We need `ifconfig plumb' and `ifconfig
   unplumb', like Solaris.  I'll write a rant about this
   sometime, but I suggest it's 2.5-food.

4: Make unregister_netdevice() wait until the device is
   free, as Alexey suggested in his comment.

   That's what this patch does.  It waits indefinitely for the
   device to become free and prints a message every ten seconds
   while waiting.

--- ../linux-2.4.0-test7-pre4/net/core/dev.c    Wed Aug 16 23:54:29 2000
+++ net/core/dev.c      Sat Aug 19 20:24:36 2000
@@ -58,6 +58,7 @@
  *                                     the backlog queue.
  *     Paul Rusty Russell      :       SIOCSIFNAME
  *              Pekka Riikonen  :      Netdev boot-time settings code
+ *              Andrew Morton   :       Make unregister_netdevice wait 
indefinitely on dev->refcnt
 #include <asm/uaccess.h>
@@ -2311,7 +2312,7 @@
 int unregister_netdevice(struct net_device *dev)
-       unsigned long now;
+       unsigned long now, warning_time;
        struct net_device *d, **dp;
        /* If device is running, close it first. */
@@ -2379,31 +2380,30 @@
        printk("unregister_netdevice: waiting %s refcnt=%d\n", dev->name, 
-       /* EXPLANATION. If dev->refcnt is not 1 now (1 is our own reference)
-          it means that someone in the kernel still has reference
+       /* EXPLANATION. If dev->refcnt is not now 1 (our own reference)
+          it means that someone in the kernel still has a reference
           to this device and we cannot release it.
           "New style" devices have destructors, hence we can return from this
-          function and destructor will do all the work later.
+          function and destructor will do all the work later.  As of kernel 
+          there are very few "New Style" devices.
-          "Old style" devices expect that device is free of any references
-          upon exit from this function. WE CANNOT MAKE such release
-          without delay. Note that it is not new feature. Referencing devices
-          after they are released occured in 2.0 and 2.2.
-          Now we just can know about each fact of illegal usage.
-          So, we linger for 10*HZ (it is an arbitrary number)
+          "Old style" devices expect that the device is free of any references
+          upon exit from this function.
+          We cannot return from this function until all such references have
+          fallen away.  This is because the caller of this function will 
+          immediately kfree(*dev) and then be unloaded via sys_delete_module.
+          So, we linger until all references fall away.  The duration of the
+          linger is basically unbounded! It is driven by, for example, the
+          current setting of sysctl_ipfrag_time.
           After 1 second, we start to rebroadcast unregister notifications
           in hope that careless clients will release the device.
-          If timeout expired, we have no choice how to cross fingers
-          and return. Real alternative would be block here forever
-          and we will make it eventually, when all peaceful citizens
-          will be notified and repaired.
-       now = jiffies;
+       now = warning_time = jiffies;
        while (atomic_read(&dev->refcnt) != 1) {
                if ((jiffies - now) > 1*HZ) {
                        /* Rebroadcast unregister notification */
@@ -2412,12 +2412,13 @@
                current->state = TASK_INTERRUPTIBLE;
                current->state = TASK_RUNNING;
-               if ((jiffies - now) > 10*HZ)
-                       break;
+               if ((jiffies - warning_time) > 10*HZ) {
+                       printk(KERN_EMERG "unregister_netdevice: waiting for %s 
to "
+                                       "become free. Usage count = %d\n",
+                                       dev->name, atomic_read(&dev->refcnt));
+                       warning_time = jiffies;
+               }
-       if (atomic_read(&dev->refcnt) != 1)
-               printk("unregister_netdevice: Old style device %s 
leaked(refcnt=%d). Wait for crash.\n", dev->name, atomic_read(&dev->refcnt)-1);
        return 0;

<Prev in Thread] Current Thread [Next in Thread>