On Sun, Apr 09, 2006 at 04:15:08PM +0200, Christian Røsnes wrote:
> Hi
>
> I'm not sure if this is XFS specific, but here goes:
Not directly an XFS problem.
__wake_up_bit() turns off interrupts on the cpu it runs on, and if
there are a lot of processes to be woken up, then it can take some
time to run. The buffer bitlock uses the generic waittable hashes,
so this may simply be caused by hash collisions due to lots of
processes sleeping in non-exclusive wait states (e.g.
io_schedule()). If your flush process that triggers this, then you
probably have lots of I/o going on at the time.
I guess this comment in the waittable hash sizing code is relevant:
/*
* Once we have dozens or even hundreds of threads sleeping
* on IO we've got bigger problems than wait queue collision.
* Limit the size of the wait table to a reasonable size.
*/
FWIW:
> =====================================================
> Running: cat /proc/interrupts
> CPU0 CPU1 CPU2 CPU3 CPU4 CPU5
> CPU6 CPU7
> 0: 226 0 0 0 0 277717661
> 0 0 IO-APIC-edge timer
> 1: 0 0 0 0 0 9
> 0 0 IO-APIC-edge i8042
> 9: 0 0 0 0 0 0
> 0 0 IO-APIC-level acpi
> 14: 0 0 0 0 0 13
> 0 0 IO-APIC-edge ide0
> 169: 0 0 0 0 0 5649134
> 0 0 IO-APIC-level uhci_hcd:usb1, qla2400
> 185: 0 0 0 0 0 545423724
> 0 0 IO-APIC-level eth0
> 201: 0 0 0 0 0 215363073
> 0 0 IO-APIC-level megaraid
> 209: 0 0 0 0 0 63
> 0 0 IO-APIC-level ide2, ehci_hcd:usb4
> 217: 0 0 0 0 0 0
> 0 0 IO-APIC-level uhci_hcd:usb2
> 225: 0 0 0 0 0 0
> 0 0 IO-APIC-level uhci_hcd:usb3
> NMI: 62452 249384 63225 62699 62361 249362
> 63203 62677
> LOC: 277687412 277301786 277700824 277701522 277687319 277301693
> 277700731 277701429
> ERR: 7
> MIS: 0
CPU 5 is doing all your interrupt work for timers, usb, ethernet and
your RAID. You might want to try to spread these interrupts to
different CPus to reduce the interrupt load on this one CPU. That
may improve the situation.
Cheers,
Dave.
--
Dave Chinner
R&D Software Enginner
SGI Australian Software Group
|