On Thu, Sep 08, 2011 at 05:13:05PM +0200, Lars Ellenberg wrote:
> Sorry for double posting on drbd-dev, I managed to strip the other lists from
> > We upgraded from 2.6.36 which seemed to have a page leak (file pages left
> > on the LRU) and so would eventually perform very poorly. 2.6.37 and
> > 2.6.38 seemed to have some unix socket issue that caused heartbeat to
> > wedge. Shall we enable lock debugging or something here?
> That could help us understand that stack trace.
> It looks like cpu 1 blocks in
> > [ 1532.427149] [<ffffffff8103d512>] ? try_to_wake_up+0xc2/0x270
> > [ 1532.427149] <<EOE>> <IRQ> [<ffffffff8103d6cd>]
> > default_wake_function+0xd/0x10
> Which does not make sense to me at all.
Well, good news, I think.. I believe this may be related to
"PCI: Set PCI-E Max Payload Size on fabric", added by b03e7495a862b02829.
3.1-rc5 is running now with a patch to basically disable those changes,
and has been stable for 12 hours. It usually hung in a few minutes
The XFS peoples say it was very likely not 58d84c4ee0389ddeb86238d5 which
is the only other thing that changed between these versions that seems to
be at all in the hang path.
Also, when the thing hangs, it stops pinging immediately, and with the
PCI-E max payload thing active, the device that raises a bus error is
actually the PCI-E to PCI-X bridge chip used to support the BCM5708 NICs,
so that all seems related.