I have a rather interesting old box, a 6 x PPro ALR beast with two primary
PCI buses and a truly immense number of disk bays. At the moment it's got
Linux 2.4.18 with XFS 1.1 on it; the disks are four IDE disks in a software
RAID and five IBM SCSI disks on an Adaptec 3210S I2O RAID controller. The
only other thing of note in the machine is a Netgear GA-620 (Tigon-II)
ethernet adapter.
I have, unfortunately, discovered that though I can't provoke the problem
any other way, if I copy many large files onto the box using Samba, I get
an almost instant hard hang. No network I/O, no keyboard input; I can't
even drop to the debugger.
I figured the problem was with the software RAID (even though I'm using an
external log on a partition of the Adaptec's "disk") but copying onto the
Adaptec RAID volume, as it turns out, has the same issue. So, I assume
the likely culprit is a locking botch somewhere in the acenic or dpti
drivers or in XFS. I assume everyone in the world would know if acenic
or dpti were broken (they have many more users that XFS, I've got to guess)
so I tend to blame XFS...
I note that Linux spinlocks seem to use cli/sti to disable all interrupts
so a locking botch does seem like a likely cause of a total, irrecoverable
hang. That leaves me with little or no idea how to debug this, but I'd be
glad to give it a shot if someone could make suggestions. I work at a
router vendor that ships a Linux-based product so I can handle the kernel
debugger fairly well; I just don't know where to start with this kind of
problem, since we ship only uniprocessor machines and locking issues aren't
exactly common. :-)
I'd be perfectly willing to arrange login or serial console access to the
box for anyone from SGI who cared to look at this; or just let me know
what you want me to look at and I'll be glad to report back.
Thor
|