xfs
[Top] [All Lists]

xfs file system corruption

To: "xfs@xxxxxxxxxxx" <xfs@xxxxxxxxxxx>
Subject: xfs file system corruption
From: Allan Haywood <Allan.Haywood@xxxxxxxxxxxxx>
Date: Tue, 7 Oct 2008 16:18:57 -0700
Accept-language: en-US
Acceptlanguage: en-US
Thread-index: Acko0xKYd9DT9xw6TmWc3+7lvxwu0g==
Thread-topic: xfs file system corruption
To All:

On a production system we are running into an interesting situation where we 
hit a corrupt file system, let me outline the timeline as best I know it:

We have a failover process where there are two servers connected to fiber 
storage, if the active server goes into failover (for numerous reasons) an 
automatic process kicks in that makes it inactive, and then makes the backup 
server active, here are the details:


1.       On failed server database and other processes are shut down (attempts)

2.       Fiber attached file system is unmounted (attempts)

3.       Fiber ports are turned off for that server

4.       On backup server fiber ports are turned on

5.       Fiber attached file system is mounted (same filesystems that were on 
the previous server)

6.       Database and other processes are started

7.       The backup server is now active and processing queries

Here is where it got interesting, when recovering from the backup server back 
to the main server we pretty much just reverse the steps above. We had the file 
systems unmount cleanly on the backup server, however when we went to mount it 
on the main server it detected a file system corruption (using xfs_check it 
indicated a repair was needed so xfs_repair was then run on the filesystem), it 
proceded to "fix" the filesystem, at which point we lost files that the 
database needed for one of the tables.

What I am curious about is the following message in the system log:

Oct  2 08:15:09 arch-node4 kernel: Device dm-31, XFS metadata write error block 
0x40 in dm-31

This is when the main node was fenced (fiber ports turned off), I am wondering 
if any pending XFS metadata still exists, later on when the fiber is unfenced 
that the metadata flushes to disk. I could see this as an issue, if there are 
pending metadata writes to a filesystem, that filesystem through failure is 
mounted on another server and used as normal, then unmounted normally, then 
when the ports are re-activated on the server that has pending metadata, is it 
possible this does get flushed to the disk, but since the disk has been in use 
on another server the metadata no longer matches the filesystem properly and 
potentially writes over or changes the filesystem in a way that causes 
corruption.

Any thoughts would be great.

If there is any more info I can provide, let me know.

Thanks.


--

Allan Haywood, Systems Engineering Program Manager II
SQL Server Data Warehousing Product Unit
allan.haywood@xxxxxxxxxxxxx



[[HTML alternate version deleted]]

<Prev in Thread] Current Thread [Next in Thread>