xfs
[Top] [All Lists]

Re: XFS filesystem on EC2 instance corrupts and shuts down

To: Shrinath M <shrinath.m@xxxxxxxxxx>
Subject: Re: XFS filesystem on EC2 instance corrupts and shuts down
From: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Fri, 15 Mar 2013 09:02:44 +1100
Cc: Eric Sandeen <sandeen@xxxxxxxxxxx>, Sabyasachi Ruj <sabyasachi.ruj@xxxxxxxxxx>, Vivek Goel <vivek.goel@xxxxxxxxxx>, Supratik Goswami <supratik.goswami@xxxxxxxxxx>, Ric Wheeler <rwheeler@xxxxxxxxxx>, xfs@xxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <CAOdS1hnVoMtXnOrECrU8xUyRn82UUJ=jMzX0_odnAs0GH8V-yA@xxxxxxxxxxxxxx>
References: <CAOdS1h=7X4O1O7X8YOwxtLm7G=fc+J+6hJxJ1RKbDmfTZXTpeg@xxxxxxxxxxxxxx> <51373DB8.2020707@xxxxxxxxxx> <CAOdS1hnXGj9puaHxeToqmpK40A-3WvJnM7=5HckpyyZYqZTvEQ@xxxxxxxxxxxxxx> <51373FC1.6010101@xxxxxxxxxx> <CAOurMUeasru6ekDYcvVR1QnaWVJFV+-coZsUG5SgG6LnENBvXg@xxxxxxxxxxxxxx> <513751F2.2060109@xxxxxxxxxx> <CAOdS1hngSuHn_HiremLyUS7Qd9eZ68=8arfBuHnEpwXQaBw9Wg@xxxxxxxxxxxxxx> <5140CBE3.80705@xxxxxxxxxxx> <20130313234213.GW21651@dastard> <CAOdS1hnVoMtXnOrECrU8xUyRn82UUJ=jMzX0_odnAs0GH8V-yA@xxxxxxxxxxxxxx>
User-agent: Mutt/1.5.21 (2010-09-15)
On Thu, Mar 14, 2013 at 06:58:19AM +0530, Shrinath M wrote:
> Thanks Ben, Dave and Eric.
> 
> Eric,
> >>but I am wondering if there might be more information before this which
> is not in your trimmed logs.
> No, this was the first entry every time we have it in /var/log/messages.
> dmesg also holds the same. After reboot, it simply fixes without anyone
> doing anything.
> 
> The Linux we are running is definitely amazon baked one, looks like this -
> $~: uname -a Linux ip-100-0-100-1 3.2.34-55.46.amzn1.x86_64 #1 SMP Tue Nov
> 20 10:06:15 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

So, this is an amazon special kernel by the looks of it. I think
that only amazon can really help you track down the problem...

>  - dmesg shows something like this after repairing/rebooting -
> 
> [    8.414176] SGI XFS with ACLs, security attributes, realtime, large
> block/inode numbers, no debug enabled
> [    8.415342] SGI XFS Quota Management subsystem
> [    8.417664] XFS (md0): Mounting Filesystem
> [    8.771553] XFS (md0): Starting recovery (logdev: internal)
> [    9.977325] XFS (md0): Ending recovery (logdev: internal)
> 
> Check the first line there, it says no debug enabled. How good/bad is this
> debug mode in production environments? We are not getting any corruption in
> our local/test environments, in production, we are getting it once on every
> third day.

debug shoul dnot be used in production environments. It'll cause
panics in situations where production kernels continue just
fine, and it changes the allocation algorithms to give better code
coverage for testing rather than optimal layout.

> Dave,
> You say unlinked inode list, but if that, it should have an entry in
> /var/log/messages, right?

It did - the error that Eric pointed out.

> Anyway, how can we create this situation?

If I knew, I would have fixed the bug already. You need to work out
what in your production environment is triggering it...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>