xfs
[Top] [All Lists]

Re: XFS File System Monitor

To: Rotem Ben Arye <rotem.benarye@xxxxxxxxx>, "xfs@xxxxxxxxxxx" <xfs@xxxxxxxxxxx>
Subject: Re: XFS File System Monitor
From: Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx>
Date: Sat, 04 Jan 2014 09:45:06 -0600
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <CA+apj_iOy2dqyPGunKe91WLCqy71uE1uq2HZQ_v=+QHewCymeA@xxxxxxxxxxxxxx>
References: <CA+apj_hV4HvZxfASF7JJ1k6mmvio3cRWAHj-S1V=Vm8X_RWA=Q@xxxxxxxxxxxxxx> <52C58095.70408@xxxxxxxxxxxxxxxxx> <CA+apj_iOy2dqyPGunKe91WLCqy71uE1uq2HZQ_v=+QHewCymeA@xxxxxxxxxxxxxx>
Reply-to: stan@xxxxxxxxxxxxxxxxx
User-agent: Mozilla/5.0 (Windows NT 5.1; rv:24.0) Gecko/20100101 Thunderbird/24.2.0
On 1/4/2014 2:21 AM, Rotem Ben Arye wrote:
> Hi Stan,
> Thank you for focused answer ,only to realize , i'm aware that monitor snmp
> checks On  /var/log/messages
> Of Production server in power failure ,so that after that will tell us -
> "power outage causes to a file system corruption" is useless.
> 
> But for all those cases that you mention " bugs in the XFS code or
> elsewhere in the Linux kernel , transient or permanent hardware failures"
> Is there no some suitable log that we can track to get an indication of
> kind event that you specified .

Errors due to such problems are logged in dmesg.  Hardware problems will
usually show up as IO errors generated from the device on which XFS
resides.  When these occur XFS will typically initiate automatic
shutdown of the filesystem to prevent (further) corruption.  In this
case the log entry occurs simultaneously with the shutdown, so
monitoring logs won't notify you in advance of this problem.  Monitoring
your hardware may.

If you get corruption due to a software bug, you may not see an error in
the log until after the filesystem suffers the corruption event.
Usually when you see errors of this nature in the log it is because
corruption has already occurred, possibly long ago, but is just now
being detected by code specifically added to XFS to detect such things.
 For example, say your filesystem is 3 years old, corruption occurred in
year one, and an update to XFS 2 years later looks for such corruption
whereas before it did not.  Depending on the severity of the corruption,
xfs_repair may be able to fix it, or it may not.  If not you ask for
help here.

So again, I'm not away of any proactive monitoring that would help in
these situations.  Of course it would be nice to know if something is
going to fail beforehand, but this isn't always possible, unfortunately.

--
Stan


> Thank you.
> 
> 
> 
> On Thu, Jan 2, 2014 at 5:07 PM, Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx>wrote:
> 
>> On 1/2/2014 6:16 AM, Rotem Ben Arye wrote:
>>> Hi, SGI Support Team.
>>> My Name is Rotem , I am a Linux/Unix System Administrator in web company
>> at
>>> Israel.
>>> I have a question I want to appeal to you to get some advice.
>>>
>>> In the last weekend we had crisis in one of the Production server in
>>> the comany ,the problem was defined by the Integrators as  "xfs file
>> system
>>> corrupted"
>>> My question is , what are the open source tools , that we can use on
>>> runtime at production environment , to monitor and sample to get
>> indication
>>> on mount XFS ,
>>> That something is not living well, and can lead to problem.
>>>
>>> We are working in a Linux environment on CentOS distributions server.
>>
>> So in a nutshell you're looking for a monitor application that will in
>> essence give you a green, yellow, or red light informing you of the
>> filesystem's health.  Or some kind of SNMP logging that suggests a
>> corruption is imminent.
>>
>> There is no such tool, and never will be.  Nearly all XFS corruption
>> events are caused by either software bugs in the XFS code or elsewhere
>> in the Linux kernel, transient or permanent hardware failures, or power
>> failures, at some layer in the storage stack.  It is not feasible to
>> predict such events.
>>
>> When an XFS corruption occurs, one should report all related log
>> information and errors to this list so that the problem may be analyzed
>> and the root cause identified.  Then the proper corrective action can be
>> identified and implemented to fix the problem and hopefully prevent it
>> from reoccurring.
>>
>> --
>> Stan
>>
> 

<Prev in Thread] Current Thread [Next in Thread>