On Mon, Dec 18, 2006 at 12:56:41AM +0100, Haar János wrote:
> > On Sat, Dec 16, 2006 at 12:19:45PM +0100, Haar JÃ¡nos wrote:
> > > I dont know there is a context between 2 messages, but i can see, the
> > > spinlock bug comes always on cpu #3.
> > >
> > > Somebody have any idea?
> > Your disk interrupts are directed to CPU 3, and so log I/O completion
> > occurs on that CPU.
> CPU0 CPU1 CPU2 CPU3
> 0: 100 0 0 4583704 IO-APIC-edge timer
> 1: 0 0 0 2 IO-APIC-edge i8042
> 4: 0 0 0 3878668 IO-APIC-edge serial
> 14: 3072118 0 0 181 IO-APIC-edge ide0
> 52: 0 0 0 213052723 IO-APIC-fasteoi eth1
> 53: 0 0 0 91913759 IO-APIC-fasteoi eth2
> 100: 0 0 0 16776910 IO-APIC-fasteoi eth0
> I have 3 XFS on this system, with 3 source.
> 1. 200G one ide hdd.
> 2. 2x200G mirror on 1 ide + 1 sata hdd.
> 3. 4x3.3TB strip on NBD.
> The NBD serves through eth1, and it is on the CPU3, but the ide0 is on the
I'd say your NBD based XFS filesystem is having trouble.
> > Are you using XFS on a NBD?
> Yes, on the 3. source.
Ok, I've never heard of a problem like this before and you are doing
something that very few ppl are doing (i.e. XFS on NBD). I'd start
Hence I'd start by suspecting a bug in the NBD driver.
> > > Dec 16 12:08:36 dy-base RSP: 0018:ffff81011fdedbc0 EFLAGS: 00010002
> > > Dec 16 12:08:36 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX:
> > ^^^^^^^^^^^^^^^^
> > Anyone recognise that pattern?
> I think i have one idea.
> This issue can stops sometimes the 5sec automatic restart on crash, and this
> shows possible memory corruption, and if the bug occurs in the IRQ
> handling.... :-)
> I have a lot of logs about this issue, and the RAX, RBX always the same.
And is this the only place where you see the problem? Or are there
other stack traces that you see this in as well?
> > This implies a spinlock inside a wait_queue_head_t is corrupt.
> > What are you type of system do you have, and what sort of
> > workload are you running?
> OS: Fedora 5, 64bit.
> HW: dual xeon, with HT, ram 4GB.
> (the min_free_kbytes limit is set to 128000, because sometimes the e1000
> driver run out the reserved memory during irq handling.)
That does not sound good. What happens when it does run out of memory?
Is that when you start to see the above corruptions?
SGI Australian Software Group