I have a box running XFS over md (raid5) over Fedora core5 2.6.17-1 kernel.
The box contains 16x750GB SATA drives combined into a single 11TB raid5
partition using md, and this partition contains a single XFS filesystem.
I can consistently crash the box within about ten minutes with a simple
perl script that spawns 25 processes each of which loop writing random
files to the filesystem.
The only message I get on the console is something like this:
do_IRQ: stack overflow: 492
<c0406460>
Once crashed, the box requires a hard reboot to rescue it (and needs to
resync
the RAID array).
As the box is to be used for a production upload fileserver receiving
several hundred
simultaneous uploads, I would most likely be seeing this problem lots.
So..... questions:
1. How much is known about this problem? Seeing as it is 100% reproducible,
is there any active development underway to fix it?
2. I have seen postings that say compiling a kernel with 8K stacks will
fix the
problem. Is this the case? Or will I be able to trigger it again by
running 100 or
200 simultaneous writes?
3. Any suggestions as to what I should try? At present it looks like I
am stuck between
finding a fix for XFS and splitting the box into 2 or 3 EXT3 partitions
(which I really don't
want to do). I have tried ReiserFS (max FS size is 8TB even though the
FAQ says 16), and
JFS (jfs_fsck segfaults which doesn't fill me with confidence).
Many thanks for any suggestions,
Chris Allen.
|