[Top] [All Lists]

Re: storage, libaio, or XFS problem? 3.4.26

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: storage, libaio, or XFS problem? 3.4.26
From: stan hoeppner <stan@xxxxxxxxxxxxxxxxx>
Date: Sat, 20 Sep 2014 14:47:20 -0500
Cc: xfs@xxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <540DC78C.4010607@xxxxxxxxxxxxxxxxx>
References: <20140828230817.GU20518@dastard> <2d2ce7bb38c00a7d35f4a324f6a36cbb@localhost> <20140829235538.GF20518@dastard> <d20fe777ec1fd318ae5d4054dffda3f4@localhost> <20140831235749.GH20518@dastard> <5403E9B9.7040608@xxxxxxxxxxxxxxxxx> <20140901234529.GI20518@dastard> <5405FB19.2020208@xxxxxxxxxxxxxxxxx> <20140902221915.GK20518@dastard> <540BEBB7.7020306@xxxxxxxxxxxxxxxxx> <20140907233910.GA30012@dastard> <540DC78C.4010607@xxxxxxxxxxxxxxxxx>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Icedove/24.7.0
On 09/08/2014 10:13 AM, stan hoeppner wrote:
> On 09/07/2014 06:39 PM, Dave Chinner wrote:
>> On Sun, Sep 07, 2014 at 12:23:03AM -0500, stan hoeppner wrote:
>>> I have some more information regarding the AIO issue.  I fired up the
>>> test harness and it ran for 30 hours at 706 MB/s avg write rate, 303
>>> MB/s per LUN, nearly flawlessly, less than 0.01% buffer loss, and avg IO
>>> times were less than 0.5 seconds.  Then the app crashed and I found the
>>> following in dmesg.  I had to "hard reset" the box due to the shrapnel.
>>>  There are no IO errors of any kind leading up to the forced shutdown.
>>> I assume the inode update and streamRT-sa hung task traces are a result
>>> of the forced shutdown, not a cause of it.  In lieu of an xfs_repair
>>> with a version newer than I'm able to install, any ideas what caused the
>>> forced shutdown after 30 hours, given there are no errors preceding it?
>>> Sep  6 06:33:33 Anguish-ssu-1 kernel: [288087.334863] XFS (dm-5):
>>> xfs_do_force_shutdown(0x8) called from line 3732 of file
>>> fs/xfs/xfs_bmap.c.  Return address = 0xffffffffa02009a6
>>> Sep  6 06:33:42 Anguish-ssu-1 kernel: [288096.220920] XFS (dm-5): failed
>>> to update timestamps for inode 0x2ffc9caae
>> Hi Stan, can you need to turn off line wrapping for stuff you paste
>> in? It's all but unreadable when it line wraps like this?
> Sorry.  I switched my daily desktop from Windows/Tbird to Wheezy/Icedove
> and I haven't tweaked it out much yet.  I set hard wrap at 72 and that's
> the problem.  I'll set flowed format and see if that helps.
>> Next, you need to turn /proc/sys/fs/xfs/error_level up to 11 so that
>> it dumps a stack trace on corruption events. I don't have a (I can't
>> remember what kernel version you are running) tree in front of me to
>> convert that line number to something meaningful, so it's not a
>> great help...
> error_level is now 11 on both systems and will survive reboots.  It's
> kernel 3.4.26.
>> Was there anything in the logs before the shutdown?  i.e. can you
>> paste the dmesg output from the start of the test (i.e. the mount of
>> the fs) to the end?
> They have this setup in a quasi production/test manner, which is
> frustrating.  The two test rigs PXE/tftp boot and mount rootfs on NFS.
> Both systems remote log kern.log into to a single file on the boot
> server, so I grep for hostname.  dmesg isn't logged remotely, and is
> lost after a reboot.  So I don't have the mount entries for some reason.
>  It seems kern.log doesn't get populated with all the stuff that goes
> into dmesg.  I'll be sure to grab all of dmesg next time before
> rebooting.  However, I don't recall any errors of any kind prior to the
> shutdown, which in itself is strange.

Hi Dave,

Long story short I was able to get 3.12.26 installed and ran the test
harness for 96 hours without problems.  It usually puked within 30 hours
or much sooner.  Just prior to this a new firmware was uploaded to the
controllers--which decreased throughput by ~35% and increased IO latency
by ~25x.  So I'm not sure if the new kernel fixed this problem or if it
was the new controller firmware.  This old firmware (one of about 50
previous binary loads) is planned to be reloaded Monday so we can test
against it with 3.12.26 and hopefully put this AIO issue to rest.


<Prev in Thread] Current Thread [Next in Thread>