[Top] [All Lists]

Re: Crash recovery/zero-byte file question

To: Josh Endries <endries@xxxxxxxxxxxxxx>
Subject: Re: Crash recovery/zero-byte file question
From: Eric Sandeen <sandeen@xxxxxxxxxxx>
Date: Sun, 19 May 2013 21:22:55 -0500
Cc: xfs@xxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <76015885.11204.1369015296087.JavaMail.root@xxxxxxxxxxxxxxxxxx>
References: <731755347.10846.1368808589019.JavaMail.root@xxxxxxxxxxxxxxxxxx> <5196A4A7.6010805@xxxxxxxxxxx> <76015885.11204.1369015296087.JavaMail.root@xxxxxxxxxxxxxxxxxx>
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:17.0) Gecko/20130509 Thunderbird/17.0.6
On 5/19/13 9:01 PM, Josh Endries wrote:
> Hello,
> Thanks for the reply!
>>> We have a RHEL 6.3 machine with a large XFS mount that suffered a
>>> power outage.
>> For starters, have you engaged your RH support folks?
> Unfortunately we don't have support for these machines. We have tons of RH 
> machines and licenses, but only a few with paid support. Generally the 
> (grant-funded) research machines don't include RH support. (And generally we 
> don't run into problems like this. :))


>>> When it came back up, it allegedly fixed itself, but
>>> now many files are zero bytes. I found a bug report/errata fix at RH
>>> that mentions something similar, which might be what we ran into.
>> Which one?  RH support can probably help you decide if that bug report
>> applies, and where/when it was fixed.
> This one: https://access.redhat.com/site/solutions/272673

well, that's a "solution" ;)

> You need a login to view that, though... I think this is the same one, which 
> I just found today:
> https://bugzilla.redhat.com/show_bug.cgi?id=845233
> That URL is currently broken for me, so here is a cache of it:
> http://webcache.googleusercontent.com/search?q=cache:3OjuPDd8A1AJ:https://bugzilla.redhat.com/show_bug.cgi%3Fid%3D845233+&cd=2&hl=en&ct=clnk&gl=us&client=firefox-a
> Reading this, I'm no longer sure we have a kernel with the fix. That machine 
> is running:
> 2.6.32-279.el6.x86_64

Right, and:  "Fixed In Version:         kernel-2.6.32-328.el6"

So this is a known bug and fixed, but you're not running the fix it seems.

> I'm not really sure when the files were created or how long it was
> idle before the crash... I wonder if ctime/mtime would be reliable
> for the files. I also don't know how to reproduce the situation in
> order to test if it's fixed in a later kernel. I can pull the power
> out to test if I knew how to modify files ahead of time such that
> they would zero themselves out.

I think you can be fairly certain that it's resolved in the above

>>> We
>>> are running a kernel that should have the fix as far as I can tell,
>>> but we definitely have zero byte files that shouldn't be.
>> shouldn't be because they had all been properly synced to disk
>> before the power loss, or?  (just in general, files not fsynced
>> aren't guaranteed to be in any particular state if you lose power,
>> though of course there are certain expectations of timely flushing).
> No, I mean they shouldn't be zero normally. They weren't zero a week
> ago. In other words, the files definitely changed unexpectedly, I'm
> assuming due to the power outage. The files had not been touched in
> at least a few days before the crash, according to the researcher
> working on those files. If I read the report correctly, though, that
> might not matter much.


>>> My question is: is there a way to restore this or fix it before going
>>> to backups? Is it worth it to unmount and run xfs_check or similar?
>>> Unfortunately, since the system came up and appeared to be working,
>>> some users have been using that mount point.
>> If you have backups that's probably the best option.
> There aren't any backups of these files. The researchers should be
> able to recreate them (I hope so); the data sets come from various
> places. It's a lot of data, so I was hoping I could recover something
> to lessen the downtime. They opted not to back up that directory
> because it's just too many TBs for normal backups.
> I'm not really expecting to be able to restore everything, I just
> want to put some effort in to getting back what I can before telling
> them they need to start over...

Dave is more familiar with that bug than I am, but short of some serious
forensics & luck, I don't think you'll be able to get things back.

I'd update to the kernel mentioned above soon, though, and sorry
about the hassle.  :(


> Thanks,
> Josh

<Prev in Thread] Current Thread [Next in Thread>