[cc:ed to netdev] Well this doesn't look unreasonable, but I haven't run into it with the NICs I've tested. Nor have I seen this reported before. Which NICs is this with? I wonder if netif_running is
Hi, Sungem. I didn't find anything strange in sungem, but it may be... Probably... I wanted to do the less modifications possible. -- Colin http://colino.net/323/ - Presenting the Mazda 323 Rouge
I think this is very strange that only sungem behaves this way. I don't think netpoll is doing anything different than what would happen, f.e., when bringing an interface up using dhcp. That should c
Hi, Thanks for your help. You meant zero here ? (or didn't I understand something) Btw, shouldn't gem_poll() check for gp->hw_running, too? I added a printk at the end of gem_open(), it's 1 even when
Hi, My girlfriend told me what was outputted by my infinite loop: netpoll: netif_running 1, netif_carrier_ok 0, netif_queue_stopped 0. Shouldn't sungem stop its queue when there's no carrier ? (I rea
Hi again, I looked a bit more at the code and found out a possible problem. However it doesn't fix the hang, so either it's not what I found, or there's something else. First, my newbie question: is
Oh yes, it appears that netpoll doesn't support NETIF_F_LLTX locking, crap :( When a device has NETIF_F_LLTX set, it means that the driver's the caller one level up. Andi Kleen didn't fix up netpoll
Colin, feeling adventurous enough to take a stab at this? It looks pretty straightforward but I'm going to be even more useless than usual for the next two weeks. -- Mathematics is the supreme nostal
It takes an own lock, not xmit_lock. It's fine to ignore it completely. In the worst case the poll will not be retried, but netpoll has no way to do that anyways I think. -Andi
Hi again, This patch should do that. It works OK for me, but I'd like it checked before sent upstream... However, it doesn't fix the hang. it looks like this hang is really coming from sungem. -- a/n
IMHO it's not needed. Taking xmit_lock is harmless even when the NETIF_F_LLTX flag is set. (or at least it was with my original patchkit. In theory it's possible someone changed their driver to take
If another thread on another cpu is in the dev->hard_start_xmit() routine, then it will have it's tx device lock held, and netpoll will simply get an immediate return from ->hard_start_xmit() with er
Is it hanging inside of the ->hard_start_xmit() call or somewhere else? Do you have a way to determine this without adding printk()'s and thus causing recursion as you mentioned earlier? :-)
Deadlocks from recursion, presumably? We could probably throw in a max retry count, as ugly as that is.. -- Mathematics is the supreme nostalgia of our time.
There should not be any recursion, no. The problem is that the poll is effectively a spinlock. But when another CPU takes an long interrupt while holding the lock it could take quite a long time to
Hi, If printk() is synchronous, there could be, if there's a printk() in the codepath taken by dev->hard_start_xmit()... But I don't if it is... Second newbie question: how are the interrupts disable
Hi, I think so, but my way of discovering it may not be very good: I tested by replacing status = np->dev->hard_start_xmit(...); by status = NETDEV_TX_OK, then status = NETDEV_TX_BUSY, then status =
Hi, Should that be completely dropped, or is it still ok ? (I think differenciating action based on hard_start_xmit status, that is, don't goto repeat undefinitely when NETDEV_TX_BUSY, could be a goo