[Top] [All Lists]

Re: XFS filesystem on EC2 instance corrupts and shuts down

To: Shrinath M <shrinath.m@xxxxxxxxxx>
Subject: Re: XFS filesystem on EC2 instance corrupts and shuts down
From: Eric Sandeen <sandeen@xxxxxxxxxxx>
Date: Wed, 13 Mar 2013 13:56:35 -0500
Cc: Ric Wheeler <rwheeler@xxxxxxxxxx>, Sabyasachi Ruj <sabyasachi.ruj@xxxxxxxxxx>, xfs@xxxxxxxxxxx, Supratik Goswami <supratik.goswami@xxxxxxxxxx>, Vivek Goel <vivek.goel@xxxxxxxxxx>
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <CAOdS1hngSuHn_HiremLyUS7Qd9eZ68=8arfBuHnEpwXQaBw9Wg@xxxxxxxxxxxxxx>
References: <CAOdS1h=7X4O1O7X8YOwxtLm7G=fc+J+6hJxJ1RKbDmfTZXTpeg@xxxxxxxxxxxxxx> <51373DB8.2020707@xxxxxxxxxx> <CAOdS1hnXGj9puaHxeToqmpK40A-3WvJnM7=5HckpyyZYqZTvEQ@xxxxxxxxxxxxxx> <51373FC1.6010101@xxxxxxxxxx> <CAOurMUeasru6ekDYcvVR1QnaWVJFV+-coZsUG5SgG6LnENBvXg@xxxxxxxxxxxxxx> <513751F2.2060109@xxxxxxxxxx> <CAOdS1hngSuHn_HiremLyUS7Qd9eZ68=8arfBuHnEpwXQaBw9Wg@xxxxxxxxxxxxxx>
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:17.0) Gecko/20130216 Thunderbird/17.0.3
On 3/13/13 1:07 PM, Shrinath M wrote:
> Sorry to be asking in dev thread, but Amazon seems to be clueless in this 
> case :(
> Can someone tell me where can we find the logs/output of xfs repair
> after this runs? We just reboot the machine when we see this and the
> /var/log/messages or dmesg seems to know nothing about what it
> repaired.

xfs_repair does not run automatically at boot on any OS I know of; xfs simply
replays the log.  But then I don't know what OS you are running, looks like
an amazon special?  It's a pity they can't support the OS they provide you,
because on an older kernel like this, upstream developers will be less
interested unless the problem persists in upstream kernels.  This sort
of support is usually best left to an OS vendor.

But all that aside, you list this as the first error:

    Mar  5 01:14:33 ip-100-0-100-1 kernel: [14139930.248619] XFS (md0): 
Corruption detected. Unmount and run xfs_repair

but I am wondering if there might be more information before this which is not 
in your trimmed logs.

The text above is from xfs_corruption_error() which calls xfs_error_report() 
the above message, and which should normally tell us a lot more about what went 
wrong, for 
example something like "Internal error %s at line %d of file %s.  Caller 0x%"
and possibly a hexdump or stack trace.

One of the things in

" dmesg output showing all error messages and stack traces "

If you really didn't get anything else before this, try:

echo 11 > /proc/sys/fs/xfs/error_level

to capture the one instance where a corruption does not trigger verbose logs. 
That actually might be what you hit.

It's a little odd that you get:

Feb 12 19:47:18 ip-100-0-100-1 kernel: [2541168.014259] XFS (md0): 
xfs_iunlink_remove: xfs_itobp() returned error 117.

because AFAIK, 117 is not any known error number (not even xfs's old 
EFSCORRUPTED value, which was 990)
But I see other references in various places to this error number coming from 
XFS - so I'm not sure.


> On Wed, Mar 6, 2013 at 7:55 PM, Ric Wheeler <rwheeler@xxxxxxxxxx 
> <mailto:rwheeler@xxxxxxxxxx>> wrote:
>     I would suggest contacting Amazon's customer support channel (or the 
> vendor you paid for the Linux instance you are running).
>     XFS developer list is probably not the correct forum to help you debug 
> this :)
>     Good luck!
>     Ric
>     On 03/06/2013 08:12 AM, Supratik Goswami wrote:
>         Have we created a ticket with AWS ?
>         It could be an EBS issue who knows, we need to confirm that first.
>         --
>         Warm Regards
>         Supratik
>         On Wed, Mar 6, 2013 at 6:38 PM, Ric Wheeler <rwheeler@xxxxxxxxxx 
> <mailto:rwheeler@xxxxxxxxxx> <mailto:rwheeler@xxxxxxxxxx 
> <mailto:rwheeler@xxxxxxxxxx>>> wrote:
>             On 03/06/2013 08:03 AM, Shrinath M wrote:
>                 On Wed, Mar 6, 2013 at 6:29 PM, Ric Wheeler 
> <rwheeler@xxxxxxxxxx <mailto:rwheeler@xxxxxxxxxx>
>                 <mailto:rwheeler@xxxxxxxxxx <mailto:rwheeler@xxxxxxxxxx>> 
> <mailto:rwheeler@xxxxxxxxxx <mailto:rwheeler@xxxxxxxxxx>
>                 <mailto:rwheeler@xxxxxxxxxx <mailto:rwheeler@xxxxxxxxxx>>>> 
> wrote:
>                     I think that you would need to verify that the Amazon 
> storage is not
>                     throwing errors - do your logs show IO errors or issues 
> before XFS
>                 hits an
>                     issue?
>                 No IO errors in /var/log/messages.
>                 Where else should I be looking?
>             Feb 12 19:47:18 ip-100-0-100-1 kernel: [2541168.023638] XFS 
> (md0): I/O
>             Error Detected. Shutting down filesystem
>             Is an IO error from MD.
>             I would suggest trying to reproduce without MD in the picture 
> first -
>             always best to try to reproduce with the simplest setup first and 
> work
>             your way up the complexity ladder,
>             Ric
>         _________________________________________________
>         xfs mailing list
>         xfs@xxxxxxxxxxx <mailto:xfs@xxxxxxxxxxx>
>         http://oss.sgi.com/mailman/__listinfo/xfs 
> <http://oss.sgi.com/mailman/listinfo/xfs>
> -- 
> Regards
> *Shrinath.M*
> _______________________________________________
> xfs mailing list
> xfs@xxxxxxxxxxx
> http://oss.sgi.com/mailman/listinfo/xfs

<Prev in Thread] Current Thread [Next in Thread>