On Tue, Apr 13, 2004 at 01:32:36PM -0500, Michael E. Thomadakis wrote:
> I've also noticed that the FC adapter driver threads are running at
> 100% CPU utilization, when they are pumping data to the RAID for
> long time. Is there any data copy taking place at the drivers? The
> HBAs are from QLogic.
I would bitch to your SGI support channel about this. Off the top of
your head do you have any idea where in the driver the cycles are
being spent?
> A more disturbing issue is that the system does NOT clean up the
> file cache and eventually all memory gets occupied by FS pages. Then
> the system simply hungs.
Are you sure about this? How can you tell? I guess it's redundant to
mention linux will cache all fs pages it can and release them as
required but not before then. A umount will force them to be released
however.
Are you perhaps seeing bad slab behaviour (unbounded growth with weak
pressure to shrink it) instead I wonder? Looking at /proc/slabinfo
will give you some idea of how much slab is being use and by what.
> One of our main objectives at our center is to maximize file
> thoughput for our systems. We are a medium size Supercomputing
> Center were compute and I/O intensive numerical computation code
> runs in batch sub-systems. Several programs expect and generate
> often very large files, in the order of 10-70GBs.
Are these C or fortran programs? SGI has a fortran library that's
supposed to do good stuff for file-IO using O_DIRECT and other smarts.
I forget what it's called but the SGI support people should know.
> Another common problem is the competition between file cache and
> computation pages. We definitely do NOT want file cache pages being
> cached, while computation pages are reclaimed.
Known problem that's especially bad with 2.4.x --- it's even apparent
when doing a backup on a live system as that will cause swapping.
I'ts acutally a vm balancing problem and not specific to XFS.
There are a couple of ways you can 'hack' around this I an think of
right now. Either mlock you applications pages or use O_DIRECT IO in
your applications. For the latter you want to be a little clever to
do write-behind and read-ahead sort of stuff in another thread or just
use really large IO-sizes and assume that's good enouugh.
Also, do current propack 2.4.x kernels use rmap?
> 1. Set an upper bound on the number of memory pages ever caching FS
> blocks.
Presently not possible. Discussed a few times with various vm people
but nothing ever came of it as far as I know.
> 2. Control the amount of data flushed out to disk in set time
> periods; that is we need to be able to match the long term flushing
> rate with the service rate that the I/O subsystem is capable of
> delivering, tolerating possible transient spikes. We also need to be
> able to control the amount of read-ahead, write behind or even hint
> that data are only being streamed through, never to be reused again.
I think with some care you should be able to tune that a little
better.
> 3. Specify different parameters for 2., above, per file system: we
> have file systems that are meant to transfer wide stripes of
> sequential data, vs. file systems that need to perform well with
> smaller block, random I/O, vs. ones that need to provide access to
> numerous smaller files.
You can tune the fs parameters to some extent that may help here. You
might also want to look at using a real-time subvolume if you have
lots of streaming data (again, this implies O_DIRECT).
> Also, cache percentages per file system would be useful.
That's starting to sound pretty complex to manage and tune.
> 4. Specify, if else fails, what parts of the FS cache should flushed
> in the near future.
Does madvise suffice? Actually, I'm not sure that it will, I'd have
to check to see how much of it is actually implemented in a useful way
but I recall noise about it not being very useful at one point.
> 5. Provide in-depth technical documentation on the internal workings
> of the file system cache, its interaction with the VM and the
> interaction of XFS/LVM with the VM.
This is starting to sound really complicated. The page-cache
semantics are pretty clear but when it comes to interactions with slab
and slab pressure it gets a little more muddy and I'm not sure. There
is also an XFS-specific buffer layer for metadata.
> 6. We do operate IRIX Origins and IBM Regatta SMPs where all these
> issues have been addressed to a far more satisfying degree than on
> Linux. Is the IRIX file system cache going to be ported to ALTIX
> Linux?
I sersiously doubt such a thing is possible in any reasonable time
frame. Or desirable.
I'm almost tempted to suggest you try mainline 2.6.x and see if that
behaves any better. Normally I would expect Propack's XFS performance
to be much better than 2.6.x but I wonder if you're not hitting 2.4.x
VM suckage.
--cw
|