I already responded to this analysis before. In any case, here it is:
Later versions of e100 (3.4.8 for instance) includes a call to
netif_poll_disable in e100_down. This is supposed to wait and when it
returns we are guaranteed that e100_poll will no longer be called. In
addition, if there happens to be an interrupt, our call to
netif_rx_schedule() will not add our poll routine to the poll-list since
poll is disabled. So this race can never happen.
>From: Andrew Morton [mailto:akpm@xxxxxxxx]
>Sent: Thursday, May 26, 2005 1:31 PM
>To: Jian Jun He
>Cc: Venkatesan, Ganesh; anton@xxxxxxxxx; rende@xxxxxxxxxx;
>ganesh.venkatesan@xxxxxxxxx; herbert@xxxxxxxxxxxxxxxxxxx; Brandeburg,
>Jesse; jgarzik@xxxxxxxxx; wangjs@xxxxxxxxxx; Ronciak, John;
>Subject: Re: Fw: [Bugme-new] [Bug 4628] New: Test server hang while
>rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4
>Jian Jun He <hejianj@xxxxxxxxxx> wrote:
>> 2. Download rhr2-rhel4-1.0-14a.noarch.rpm from rhn.redhat.com and
>> it on
>> the test machine.
>> 3. Configure and run the rhr test via invoking redhat-ready.
>This is the problematic bit.
>- Please provide a full URL which can be used to obtain rhr.
> rhn.redhat.com is subscription-based.
>- Please describe the hardware setup - surely the test requires at
> two machines. How are they configured?
>- Provide an exact transcript of the commands which are to be used. Is
> it just
> with no arguments?
>All that begin said, we already have a quite specific diagnosis via
>inspection, from Herbert:
>Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx> wrote:
>> Andrew Morton <akpm@xxxxxxxx> wrote:
>> > Might be a bug in the e100 driver, might not be.
>> > I assume this is the
>> > BUG_ON(skb->list != NULL);
>> It certainly is a bug in e100.
>> e100_tx_timeout -> e100_down -> e100_rx_clean_list
>> is racing against
>> e100_poll -> e100_rx_clean -> e100_rx_indicate
>> e100_rx_clean/e100_rx_indicate takes an skb off the RX ring and
>> while it's being processed e100_rx_clean_list comes along and
>> frees it.
>> From a quick check similar problems may exist in other drivers that
>> have lockless ->poll() functions with RX rings.
>Do the e100 maintainers agree with this diagnosis? If so then more
>isn't required at this stage - the next step is to fix the above bug,