On 11/28/2011 10:26 AM, Jon Marshall wrote:
Hi Joe,
Thanks for the rapid response.
Is this something that has been reported often in relation to AoE? Is
We've experienced it in the past when we supported our customers with
Coraid gear. Most of that is gone now, so we haven't seen much AoE
stuff as of late (last 2 years or so).
This said, the AoE stack depends critically upon the network stack, and
between AoE and the network stack (or possibly something else), you ran
out of memory for use in the kernel. Our experience with this is
usually a leaky network driver. e1000 and similar Intel drivers shipped
with default RHEL5/Centos5 are highly problematic. AoE could be leaking
itself (early versions were pretty bad in this regard, though I haven't
looked at the driver in the last few years, they hopefully have improved
it).
The xfs connection to this (to stay relevant to this group) is that xfs
is ok atop this, as long as the other layers don't go away. If you can
detect problems like this in advance, you might be able to issue an
xfs_freeze, and preserve the integrity of the underlying filesystem
(obviating the need for an xfs_repair). The hard part would be an
accurate prediction, but if your drivers are grabbing memory and not
releasing it back, or you have a run-away memory consuming process,
yeah, you could potentially predict this onset.
there any chance you could point us in the direction of some more
background on the issue? I am checking the AoE mailing list, but if you know
of something specific that would be very helpful.
Not really, we aren't doing much with AoE anymore. This may or may not
be an AoE issue per se. Likely AoE crashed, and the reason for the
crash is very probably the same reason that xfs crashed, it ran out of
memory. If AoE is the culprit, you might find some sort of imprint of
this in the logs, though our experience has been usually a run-away
network driver. Since AoE does its block devices over raw ethernet
packets, it doesn't take very long for a leaky driver to crash such a
system under load.
I am also looking into the ethernet drivers we have in place on the
system in question.
Again, thanks for the quick and informative response.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman@xxxxxxxxxxxxxxxxxxxxxxx
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
|