[Top] [All Lists]

Re: Sudden File System Corruption

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: Sudden File System Corruption
From: Mike Dacre <mike.dacre@xxxxxxxxx>
Date: Wed, 04 Dec 2013 19:46:06 -0800
Cc: xfs@xxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=user-agent:in-reply-to:references:mime-version:content-type:subject :from:date:to:cc:message-id; bh=dk+brEg+3lOb/Ry6Fas25rJTnPyNO5ZCYNIurKw6wYY=; b=JKVK5nr3rCHzUr95SxcW8PeDwv5lPDTm3Yp/4UuCnWdfGFtzyrKjRw4wMpQLiwn7C+ /LfQWuPx+VSo6JBgspdXcFtcUHBR6nFfS+Pie/ZyL/1n9KxEXJg0W9LozdaflJNDXSfv vGSBN55pNAYAKvN5v/OJDgnYY9UbFbnibDun4LbtrRl6vJnDtV/XDETn3Ri9Dou+Lmf9 kzY3T5f88ft8YiW8EOZZ8Y3f4hO8gzFoHxai5JGgFJo9BdI9PvmVWjMQQtvyV6kHKiHX EjBQQ0PQ8oQtmTdBI6b6KrInk5BpYSofq/enkNtoY9IqX0ygIkXuQ0U75GTgn01PQ6xZ gpow==
In-reply-to: <20131205034034.GI8803@dastard>
References: <CAPd9ww_qT9J_Rt04g7+OApoBeggNOyWNwD+57DiDTuUvz-O-0g@xxxxxxxxxxxxxx> <20131205034034.GI8803@dastard>
User-agent: K-9 Mail for Android
Hi Dave,

My apologies, I completely miscommunicated. The drive dying was unrelated, it happened two months ago. I mentioned it only as background info, but I realize now that was stupid. There were no drive or RAID problems at all at the time the XFS mount died today. The drives are all fine and the RAID log shows nothing significant.



From: Dave Chinner <david@xxxxxxxxxxxxx>
Sent: Wed Dec 04 19:40:34 PST 2013
To: Mike Dacre <mike.dacre@xxxxxxxxx>
Cc: xfs@xxxxxxxxxxx
Subject: Re: Sudden File System Corruption

On Wed, Dec 04, 2013 at 06:55:05PM -0800, Mike Dacre wrote:
Hi Folks,

Apologies if this is the wrong place to post or if this has been answered

I have a 16 2TB drive RAID6 array powered by an LSI 9240-4i. It has an XFS
filesystem and has been online for over a year. It is accessed by 23
different machines connected via Infiniband over NFS v3. I haven't had any
major problems yet, one drive failed but it was easily replaced.

However, today the drive suddenly stopped responding and started returning
IO errors when any requests were made. This happened while it was being
accessed by 5 different users, one was doing a very large rm operation (rm
*sh on thousands on files in a directory). Also, about 30 minutes before
we had connected the globus connect endpoint to allow easy file transfers
to SDSC.

So, you had a drive die and at roughly the same time XFS started
reporting corruption problems and shut down? Chances are that the
drive returned garbage to XFS before died completely and that's what
XFS detected and shut down on. If you are unlucky in this situation,
the corruption can get propagated into the log by changes that are
adjacent to the corrupted region, and then you have problems with log
recovery failing because the corruption gets replayed....

I have attached the complete log from the time it died until now.

In the end, I successfully repaired the filesystem with `xfs_repair -L
/dev/sda1`. However, I am nervous that some files may have been corrupted.

Do any of you have any idea what cou ld have caused this problem?

When corruption appears at roughly the same time a drive dies, it's
almost always caused by the drive that failed. RAID doesn't repvent
disks from returning crap to the OS because nobody configures the
arrays to do read-verify cycles that would catch such a condition.


<Prev in Thread] Current Thread [Next in Thread>