xfs
[Top] [All Lists]

Re: BUG: soft lockup detected on CPU#5 / xfsdatad

To: Christian Røsnes <christian.rosnes@xxxxxxxxx>
Subject: Re: BUG: soft lockup detected on CPU#5 / xfsdatad
From: David Chinner <dgc@xxxxxxx>
Date: Mon, 10 Apr 2006 11:47:01 +1000
Cc: linux-xfs@xxxxxxxxxxx
In-reply-to: <443916EC.10809@gmail.com>
References: <443916EC.10809@gmail.com>
Sender: linux-xfs-bounce@xxxxxxxxxxx
User-agent: Mutt/1.4.2.1i
On Sun, Apr 09, 2006 at 04:15:08PM +0200, Christian Røsnes wrote:
> Hi
> 
> I'm not sure if this is XFS specific, but here goes:

Not directly an XFS problem.

__wake_up_bit() turns off interrupts on the cpu it runs on, and if
there are a lot of processes to be woken up, then it can take some
time to run. The buffer bitlock uses the generic waittable hashes,
so this may simply be caused by hash collisions due to lots of
processes sleeping in non-exclusive wait states (e.g.
io_schedule()).  If your flush process that triggers this, then you
probably have lots of I/o going on at the time.

I guess this comment in the waittable hash sizing code is relevant:

        /*
         * Once we have dozens or even hundreds of threads sleeping
         * on IO we've got bigger problems than wait queue collision.
         * Limit the size of the wait table to a reasonable size.
         */

FWIW:

> =====================================================
> Running: cat /proc/interrupts
>            CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       
> CPU6       CPU7       
>   0:        226          0          0          0          0  277717661        
>   0          0    IO-APIC-edge  timer
>   1:          0          0          0          0          0          9        
>   0          0    IO-APIC-edge  i8042
>   9:          0          0          0          0          0          0        
>   0          0   IO-APIC-level  acpi
>  14:          0          0          0          0          0         13        
>   0          0    IO-APIC-edge  ide0
> 169:          0          0          0          0          0    5649134        
>   0          0   IO-APIC-level  uhci_hcd:usb1, qla2400
> 185:          0          0          0          0          0  545423724        
>   0          0   IO-APIC-level  eth0
> 201:          0          0          0          0          0  215363073        
>   0          0   IO-APIC-level  megaraid
> 209:          0          0          0          0          0         63        
>   0          0   IO-APIC-level  ide2, ehci_hcd:usb4
> 217:          0          0          0          0          0          0        
>   0          0   IO-APIC-level  uhci_hcd:usb2
> 225:          0          0          0          0          0          0        
>   0          0   IO-APIC-level  uhci_hcd:usb3
> NMI:      62452     249384      63225      62699      62361     249362      
> 63203      62677 
> LOC:  277687412  277301786  277700824  277701522  277687319  277301693  
> 277700731  277701429 
> ERR:          7
> MIS:          0

CPU 5 is doing all your interrupt work for timers, usb, ethernet and
your RAID. You might want to try to spread these interrupts to
different CPus to reduce the interrupt load on this one CPU. That
may improve the situation.

Cheers,

Dave.
-- 
Dave Chinner
R&D Software Enginner
SGI Australian Software Group


<Prev in Thread] Current Thread [Next in Thread>