XFS on CoRAID errors with SMB

Joe Landman landman at scalableinformatics.com
Mon Nov 28 09:36:48 CST 2011


On 11/28/2011 10:26 AM, Jon Marshall wrote:
> Hi Joe,
>
> Thanks for the rapid response.
>
> Is this something that has been reported often in relation to AoE? Is

We've experienced it in the past when we supported our customers with 
Coraid gear.  Most of that is gone now, so we haven't seen much AoE 
stuff as of late (last 2 years or so).

This said, the AoE stack depends critically upon the network stack, and 
between AoE and the network stack (or possibly something else), you ran 
out of memory for use in the kernel.  Our experience with this is 
usually a leaky network driver.  e1000 and similar Intel drivers shipped 
with default RHEL5/Centos5 are highly problematic.  AoE could be leaking 
itself (early versions were pretty bad in this regard, though I haven't 
looked at the driver in the last few years, they hopefully have improved 
it).

The xfs connection to this (to stay relevant to this group) is that xfs 
is ok atop this, as long as the other layers don't go away.  If you can 
detect problems like this in advance, you might be able to issue an 
xfs_freeze, and preserve the integrity of the underlying filesystem 
(obviating the need for an xfs_repair).  The hard part would be an 
accurate prediction, but if your drivers are grabbing memory and not 
releasing it back, or you have a run-away memory consuming process, 
yeah, you could potentially predict this onset.

> there any chance you could point us in the direction of some more
> background on the issue? I am checking the AoE mailing list, but if you know
> of something specific that would be very helpful.

Not really, we aren't doing much with AoE anymore.  This may or may not 
be an AoE issue per se.  Likely AoE crashed, and the reason for the 
crash is very probably the same reason that xfs crashed, it ran out of 
memory.  If AoE is the culprit, you might find some sort of imprint of 
this in the logs, though our experience has been usually a run-away 
network driver.  Since AoE does its block devices over raw ethernet 
packets, it doesn't take very long for a leaky driver to crash such a 
system under load.

>
> I am also looking into the ethernet drivers we have in place on the
> system in question.
>
> Again, thanks for the quick and informative response.


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615




More information about the xfs mailing list