Andrew Morton wrote:
This person's stack overrun can be seen at:
http://www.icglink.com/cluster-debug-info.html
(search for dm_table_unplug_all+0x41/0x43)
It seems mainly to be due to XFS. Is there anything we can do about that?
His summary suggests it is not just XFS he sees it on:
Kernel oops during heavy I/O on Software RAID device with LVM and XFS (or JFS or
reiser)
I have always thought 4K stacks were going to bring things like this
out of the woodwork, he could have been doing his I/O though NFS on
top of this too. Once you start building complex I/O systems with
layers of drivers and filesystems, 4K starts to look very small.
In general, if you give people lego bricks they will eventually
build something very tall out of them.
Try NFS -> journalled filesystem -> LVM/MD -> Fiber Channel (throw
in multiple paths to the devices here as well).
And yes I know the standard argument about interrupts could kill
you anyway in the 2.4 kernel.
A quick summary of what is happening in that stack is:
Write call wants to allocate pages
pages get flushed into xfs
xfs needs to allocate an extent for the pages
In order to allocate the extent, xfs needs to read some metadata
Boom
I am sure there is some dieting which could be done in some xfs functions
to prune a few bytes here or there, which will stave off the inevitable until
someone else decides to insert an encryption layer into the stack somewhere.
Moving the actual flushing of the dirty pages off to another thread so there
is more stack to play with is another option.
Steve
|