netdev
[Top] [All Lists]

RE: Fw: [Bugme-new] [Bug 4628] New: Test server hang while running rhr (

To: "Andrew Morton" <akpm@xxxxxxxx>, "Jian Jun He" <hejianj@xxxxxxxxxx>
Subject: RE: Fw: [Bugme-new] [Bug 4628] New: Test server hang while running rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4
From: "Venkatesan, Ganesh" <ganesh.venkatesan@xxxxxxxxx>
Date: Thu, 26 May 2005 13:41:53 -0700
Cc: <anton@xxxxxxxxx>, <rende@xxxxxxxxxx>, <ganesh.venkatesan@xxxxxxxxx>, <herbert@xxxxxxxxxxxxxxxxxxx>, "Brandeburg, Jesse" <jesse.brandeburg@xxxxxxxxx>, <jgarzik@xxxxxxxxx>, <wangjs@xxxxxxxxxx>, "Ronciak, John" <john.ronciak@xxxxxxxxx>, <cdlwangl@xxxxxxxxxx>, <linuxppc64-dev@xxxxxxxxxxxxxxxxxxxxxxxxxx>, <netdev@xxxxxxxxxxx>
Sender: netdev-bounce@xxxxxxxxxxx
Thread-index: AcViMig1KCQsma4eQ5CKxhlgMWr2RQAAJVPw
Thread-topic: Fw: [Bugme-new] [Bug 4628] New: Test server hang while running rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4
Andrew:

I already responded to this analysis before. In any case, here it is:

Later versions of e100 (3.4.8 for instance) includes a call to
netif_poll_disable in e100_down. This is supposed to wait and when it
returns we are guaranteed that e100_poll will no longer be called. In
addition, if there happens to be an interrupt, our call to
netif_rx_schedule() will not add our poll routine to the poll-list since
poll is disabled. So this race can never happen.

Ganesh.

>-----Original Message-----
>From: Andrew Morton [mailto:akpm@xxxxxxxx]
>Sent: Thursday, May 26, 2005 1:31 PM
>To: Jian Jun He
>Cc: Venkatesan, Ganesh; anton@xxxxxxxxx; rende@xxxxxxxxxx;
>ganesh.venkatesan@xxxxxxxxx; herbert@xxxxxxxxxxxxxxxxxxx; Brandeburg,
>Jesse; jgarzik@xxxxxxxxx; wangjs@xxxxxxxxxx; Ronciak, John;
>cdlwangl@xxxxxxxxxx; linuxppc64-dev@xxxxxxxxxxxxxxxxxxxxxxxxxx;
>netdev@xxxxxxxxxxx
>Subject: Re: Fw: [Bugme-new] [Bug 4628] New: Test server hang while
running
>rhr (network) test on RHEL4 with kernel 2.6.12-rc1-mm4
>
>Jian Jun He <hejianj@xxxxxxxxxx> wrote:
>>
>>  2. Download rhr2-rhel4-1.0-14a.noarch.rpm from rhn.redhat.com and
>install
>>  it on
>>  the test machine.
>>  3. Configure and run the rhr test via invoking redhat-ready.
>
>This is the problematic bit.
>
>- Please provide a full URL which can be used to obtain rhr.
>  rhn.redhat.com is subscription-based.
>
>- Please describe the hardware setup - surely the test requires at
least
>  two machines.  How are they configured?
>
>- Provide an exact transcript of the commands which are to be used.  Is
>  it just
>
>       redhat-ready
>
>  with no arguments?
>
>
>
>All that begin said, we already have a quite specific diagnosis via
code
>inspection, from Herbert:
>
>
>Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx> wrote:
>>
>> Andrew Morton <akpm@xxxxxxxx> wrote:
>> >
>> > Might be a bug in the e100 driver, might not be.
>> >
>> > I assume this is the
>> >
>> >        BUG_ON(skb->list != NULL);
>>
>> It certainly is a bug in e100.
>>
>> e100_tx_timeout -> e100_down -> e100_rx_clean_list
>>
>> is racing against
>>
>> e100_poll -> e100_rx_clean -> e100_rx_indicate
>>
>> e100_rx_clean/e100_rx_indicate takes an skb off the RX ring and
>> while it's being processed e100_rx_clean_list comes along and
>> frees it.
>>
>> From a quick check similar problems may exist in other drivers that
>> have lockless ->poll() functions with RX rings.
>
>Do the e100 maintainers agree with this diagnosis?  If so then more
testing
>isn't required at this stage - the next step is to fix the above bug,
no?



<Prev in Thread] Current Thread [Next in Thread>