Hi there,
Recently I observed some annoying errors:
"Corruption of in-memory data detected" along with page faillure allocations of
ohter applications (smbd or nfsd).
Basicly I m running a suse 9.1 (2.6.5-7.104-smp) on a dual athlon with 4 Gigs
of RAM and "some" storage devices attached. A Gbit Networkinterface is also
part of the system and all filesystems run on an encryted loop-device.
Well, fortunately it is currently only partial in production state, so that I
had the chance to do some investigations.
In every case the failures occured under load via the fileserver application.
Therefore I started observing the MemFree state of the machine and found, that
it goes down (as expected) to the limit defined in vm.nin_free_kbytes whitch is
in this case set to 1914.
Every failure started with an "page allocation failure" mostly from smbd. This
process was dead afterwards. Then followed by in-memory datacorruptions
reported by xfs.
Thus finally (after a handfull of tests) resulting in a 50% data-loss on a
completely garbled 1.8TB xfs-partition.
After doing some investigations the following behaviour was observed:
- changing eth interface speed from 1GB to 100MB: The errors were occuring less
often
- changing the sync-bahaviour (strictsync, etc) in smbd: The errors occured
less often
- Nevertheless there was no clear picture unter what circumsdances these errors
can occure
- As mentioned above the vm.min_free_kbyte is set to about 2MB (default suse
oder 2.6 setting), so the idea was to rize this to a higher value to give the
system a little more space for its bufferhandling. And: It worked for now,
setting this value to about 20M, all problems even under full-load-conditions
up to the systems limit were gone.
The behauvior I observed after setting vm.min_free to 20 MB was that some
processes (for the most part smbd, xfs) were allocating memory really quick.
And it seems (unfortunately I have no proof for it) that there can be a race
condition while concurrent applications allocate memory buffers. I cant state
clearly by now where the problem originaly comes from (kernel, samba, xfs or
intel e100/1000 driver), but the ones mentioned before are my favorites.
The questions I have now are:
- Is there a "known" problem with xfs and memory-allocation ?
- Even if this bug which is not originated by xfs itself, is there or can there
be a function to avoid damage to the filesystems in case another app goes
"wild"?
- Can this issue be used by an attacker to damage a system?
- Is there a table or list stating some basic (known to be good or best choice)
(kernel|fs|application)parameters for given filesystem(sizes)?
If usefull, I can provide more infos and do some more tests - well I can do
this until end of august, the the machine will go to production state.
Ciao
Andi
Andreas Hümmer
IT-Service
Mobile: +49 (0) 1 60.90 53 02 04
Mailto:andreas.huemmer@xxxxxxxxx
_____________________________________
ELAXY GmbH
Spitalgasse 23
D - 96450 Coburg
Phone: +49 (0) 95 61.5543.0
FAX: +49 (0) 95 61.5543.344
http://www.elaxy.com
_____________________________________
|