xfs
[Top] [All Lists]

Re: EFSCORRUPTED on mount?

To: xfs@xxxxxxxxxxx
Subject: Re: EFSCORRUPTED on mount?
From: Gregory Farnum <gregory.farnum@xxxxxxxxxxxxx>
Date: Tue, 22 Nov 2011 16:03:15 -0800
Cc: Eric Sandeen <sandeen@xxxxxxxxxxx>, Christoph Hellwig <hch@xxxxxxxxxxxxx>
In-reply-to: <4ECC19C3.5070905@xxxxxxxxxxx>
References: <CAF3hT9B8-ou-4RhfCkfFWTwwB_tb7nWSP-5pgP3G6oTE+1gAvA@xxxxxxxxxxxxxx> <CAF3hT9AurrVi7xosauVmhQcsbqJgLsxkNYm6dWDNCpB+GR69=w@xxxxxxxxxxxxxx> <20111122014114.GJ2386@dastard> <CAF3hT9Cu+56WJqzHVu+fMgquddwoMsfW-WC8CvRRBtC4PA3uHw@xxxxxxxxxxxxxx> <4ECBEF83.5020607@xxxxxxxxxxx> <CAF3hT9CA23aDfYTF__mVEM7jRq=ZEgqhfX49hK9kcBqM_+h0CQ@xxxxxxxxxxxxxx> <4ECC19C3.5070905@xxxxxxxxxxx>
On Tue, Nov 22, 2011 at 1:53 PM, Eric Sandeen <sandeen@xxxxxxxxxxx> wrote:
> And this was the first indication of trouble.
>
>> Nov 17 16:01:01 cephstore6358 kernel: [  214.214692]
>> Nov 17 16:01:01 cephstore6358 kernel: [  214.227313] Pid: 11196, comm:
>> ceph-osd Not tainted 3.1.0-dho-00004-g1ffcb5c-dirty #1
>> Nov 17 16:01:01 cephstore6358 kernel: [  214.235056] Call Trace:
>> Nov 17 16:01:01 cephstore6358 kernel: [  214.237530]
>> [<ffffffff811d606e>] ? xfs_free_ag_extent+0x4e3/0x698
>> Nov 17 16:01:01 cephstore6358 kernel: [  214.243717]
>> [<ffffffff811d6b71>] ? xfs_free_extent+0xb6/0xf9
>> Nov 17 16:01:01 cephstore6358 kernel: [  214.249468]
>> [<ffffffff811d3034>] ? kmem_zone_alloc+0x58/0x9e
>> Nov 17 16:01:01 cephstore6358 kernel: [  214.255220]
>> [<ffffffff812095f9>] ? xfs_trans_get_efd+0x21/0x2a
>> Nov 17 16:01:01 cephstore6358 kernel: [  214.261159]
>> [<ffffffff811e2011>] ? xfs_bmap_finish+0xeb/0x160
>> Nov 17 16:01:01 cephstore6358 kernel: [  214.266993]
>> [<ffffffff811f8634>] ? xfs_itruncate_extents+0xe8/0x1d0
>> Nov 17 16:01:01 cephstore6358 kernel: [  214.273361]
>> [<ffffffff811f879f>] ? xfs_itruncate_data+0x83/0xee
>> Nov 17 16:01:01 cephstore6358 kernel: [  214.279362]
>> [<ffffffff811cb0a2>] ? xfs_setattr_size+0x246/0x36c
>> Nov 17 16:01:01 cephstore6358 kernel: [  214.285363]
>> [<ffffffff811cb1e3>] ? xfs_vn_setattr+0x1b/0x2f
>> Nov 17 16:01:01 cephstore6358 kernel: [  214.291031]
>> [<ffffffff810e7875>] ? notify_change+0x16d/0x23e
>> Nov 17 16:01:01 cephstore6358 kernel: [  214.296776]
>> [<ffffffff810d2982>] ? do_truncate+0x68/0x86
>> Nov 17 16:01:01 cephstore6358 kernel: [  214.302172]
>> [<ffffffff810d2b11>] ? sys_truncate+0x171/0x173
>> Nov 17 16:01:01 cephstore6358 kernel: [  214.307846]
>> [<ffffffff8166c07b>] ? system_call_fastpath+0x16/0x1b
>> Nov 17 16:01:01 cephstore6358 kernel: [  214.314031] XFS (sdg1):
>> xfs_do_force_shutdown(0x8) called from line 3864 of file
>> fs/xfs/xfs_bmap.c.  Return address = 0xffffffff811e2046
>
> by here it had shut down, and you were just riding along when
> it went kablooey.  Any non-xfs error just before this point?

Nope, nothing from anybody else.


On Tue, Nov 22, 2011 at 2:11 PM, Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote:
> On Tue, Nov 22, 2011 at 10:47:24AM -0800, Gregory Farnum wrote:
>> Barriers on (at least, nobody turned them off); the RAID card is
>> battery-backed; here are megacli dumps:
>> http://pastebin.com/yTskgzWG
>> http://pastebin.com/ekhczycy
>
> I had a lot of of issues with megaraid cards and their unsafe caching
> settings, up to the point that I'd recommend staying away from them
> now.  Can you check in the megacli config if the _disk_ write caches
> are enabled?  megaraid adapters used to do that a lot, and given that
> the disk cache isn't batter backed it's fairly fatal.
>
> I think in your dump this one might be the culprit given that SATA
> disks outside of a few niches come with a writeback cache policy:
>
> Disk Cache Policy: Disk's Default
>
> try changing that to an explicit writethrough mode - and maybe try
> running a crash data integrity test like
>
>        http://www.complang.tuwien.ac.at/anton/hdtest/
>
> on this controller.
We're going to look into this in more detail very shortly. Right now
all I can tell you is that none of the drives ever actually lost
power, so unless something is explicitly telling them to clear their
caches I don't know how the drives could have lost their cache to
cause a problem like this.
But for now I'll just see what I can get by zeroing out the log, and
we'll get back to you again if we manage to reproduce this in a
situation where we can tell you more definitively about the caching
and barriers.
Thanks guys,
-Greg

<Prev in Thread] Current Thread [Next in Thread>