[Top] [All Lists]

Re: XFS filesystem corruption

To: Julien FERRERO <jferrero06@xxxxxxxxx>
Subject: Re: XFS filesystem corruption
From: Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx>
Date: Thu, 07 Mar 2013 07:32:15 -0600
Cc: xfs@xxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <CAPcwv6yAHAsmwgROs12gRtCbqTXBvTPrx8F-e4kYab2YApsobg@xxxxxxxxxxxxxx>
References: <CAPcwv6wZJSBtgF-L6KNSn6N6Y+wUZJFXdbcg+zYRwoaB2sDdjw@xxxxxxxxxxxxxx> <51380FD3.5010302@xxxxxxxxxxxxxxxxx> <CAPcwv6yAHAsmwgROs12gRtCbqTXBvTPrx8F-e4kYab2YApsobg@xxxxxxxxxxxxxx>
Reply-to: stan@xxxxxxxxxxxxxxxxx
User-agent: Mozilla/5.0 (Windows NT 5.1; rv:17.0) Gecko/20130215 Thunderbird/17.0.3
On 3/7/2013 7:04 AM, Julien FERRERO wrote:
>> It may be unrelated to your corruption, problem but I'm curious why you
>> are specifying a 32MB log section instead of letting mkfs.xfs make the
>> log size decision.
> I honestly don' know, the rebuild script was written 8 years ago by an
> engineer that since left the company.
> Is 32MB a short log space for a 1.5 TB of data ?

The log is for journal metadata.  So if you're capturing a frame of
video per file, or 24 or 60 frames per file, and thus are writing lots
of files, 32MB may be too small.  I'm not an expert here.  Dave C. would
be better able to answer this.  But this is a very minor problem
compared to...

> Moreover, the common usage is to power off all the equipment (included
> ours) from a general power switch.

this.  Have the crews been hard cutting power to these XFS boxen for the
8 years you mention above?  And this filesystem corruption problem
and/or corrupted files, is just now cropping up?  That's hard to
believe.  There may be a bug in 2.6.35 that exacerbates this that's been
fixed in later versions--2.6.35 is not a long term stable kernel--odd
that a vendor would choose it for long term use.

If you never had this problem before, I can only guess that previously
you were using hardware RAID controllers with BBWC having sufficient
battery hours of cache power to survive until the next power on, at
which point the BBWC RAID dumped the data to the disks.  If you switched
from that solution to non BBWC RAID, or to Linux software RAID, that
might explain why you're seeing corruption now and did not previously.
And even with BBWC RAID, hard cutting power to the system is still not a
smart thing to do.

For this kind of environment, if field techs are going to hard cut power
no matter what you tell them, then you simply MUST get LSI (or possibly
other) RAID cards with the flash backed write cache.  This doesn't rely
on batteries so the cache is never volatile, and can sit overnight, or
for days or weeks, without losing the data in the write cache.


<Prev in Thread] Current Thread [Next in Thread>