Failing XFS filesystem underlying Ceph OSDs
Alex Gorbachev
ag at iss-integration.com
Thu Mar 10 21:26:52 CST 2016
Final update: no issues for many months when using 1GB!
On Thursday, August 13, 2015, Alex Gorbachev <ag at iss-integration.com> wrote:
> Good morning,
>
> We have experienced one more failure like the ones originally described.
> I am assuming the vm.min_free_kbytes at 256 MB helped (only one hit, OSD
> went down but the rest of the cluster stayed up unlike the previous massive
> storms). So I went ahead and increased the vm.min_free_kbytes to 1 GB.
>
> I do not know of any way to reproduce the problem, or what causes it.
> There is no unusual IO pattern at the time that we are aware of.
>
> Thanks,
> Alex
>
> On Wed, Jul 22, 2015 at 8:23 AM, Alex Gorbachev <ag at iss-integration.com
> <javascript:_e(%7B%7D,'cvml','ag at iss-integration.com');>> wrote:
>
>> Hi Dave,
>>
>> On Mon, Jul 6, 2015 at 8:35 PM, Dave Chinner <david at fromorbit.com
>> <javascript:_e(%7B%7D,'cvml','david at fromorbit.com');>> wrote:
>>
>>> On Mon, Jul 06, 2015 at 03:20:19PM -0400, Alex Gorbachev wrote:
>>> > On Sun, Jul 5, 2015 at 7:24 PM, Dave Chinner <david at fromorbit.com
>>> <javascript:_e(%7B%7D,'cvml','david at fromorbit.com');>> wrote:
>>> > > On Sun, Jul 05, 2015 at 12:25:47AM -0400, Alex Gorbachev wrote:
>>> > > > > > sysctl vm.swappiness=20 (can probably be 1 as per article)
>>> > > > > >
>>> > > > > > sysctl vm.min_free_kbytes=262144
>>> > > > >
>>> > > [...]
>>> > > >
>>> > > > We have experienced the problem in various guises with kernels
>>> 3.14,
>>> > > 3.19,
>>> > > > 4.1-rc2 and now 4.1, so it's not new to us, just different error
>>> stack.
>>> > > > Below are some other stack dumps of what manifested as the same
>>> error.
>>> > > >
>>> > > > [<ffffffff817cf4b9>] schedule+0x29/0x70
>>> > > > [<ffffffffc07caee7>] _xfs_log_force+0x187/0x280 [xfs]
>>> > > > [<ffffffff810a4150>] ? try_to_wake_up+0x2a0/0x2a0
>>> > > > [<ffffffffc07cb019>] xfs_log_force+0x39/0xc0 [xfs]
>>> > > > [<ffffffffc07d6542>] xfsaild_push+0x552/0x5a0 [xfs]
>>> > > > [<ffffffff817d2264>] ? schedule_timeout+0x124/0x210
>>> > > > [<ffffffffc07d662f>] xfsaild+0x9f/0x140 [xfs]
>>> > > > [<ffffffffc07d6590>] ? xfsaild_push+0x5a0/0x5a0 [xfs]
>>> > > > [<ffffffff81095e29>] kthread+0xc9/0xe0
>>> > > > [<ffffffff81095d60>] ? flush_kthread_worker+0x90/0x90
>>> > > > [<ffffffff817d3718>] ret_from_fork+0x58/0x90
>>> > > > [<ffffffff81095d60>] ? flush_kthread_worker+0x90/0x90
>>> > > > INFO: task xfsaild/sdg1:2606 blocked for more than 120 seconds.
>>> > > > Not tainted 3.19.4-031904-generic #201504131440
>>> > > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
>>> > > message.
>>> > >
>>> > > That's indicative of IO completion problems, but not a crash.
>>> > >
>>> > > > BUG: unable to handle kernel NULL pointer dereference at
>>> > > (null)
>>> > > > IP: [<ffffffffc04be80f>] xfs_count_page_state+0x3f/0x70 [xfs]
>>> > > ....
>>> > > > [<ffffffffc04be880>] xfs_vm_releasepage+0x40/0x120 [xfs]
>>> > > > [<ffffffff8118a7d2>] try_to_release_page+0x32/0x50
>>> > > > [<ffffffff8119fe6d>] shrink_page_list+0x69d/0x720
>>> > > > [<ffffffff811a058d>] shrink_inactive_list+0x1dd/0x5d0
>>> > > ....
>>> > >
>>> > > Again, this is indicative of a page cache issue: a page without
>>> > > buffers has been passed to xfs_vm_releasepage(), which implies the
>>> > > page flags are not correct. i.e PAGE_FLAGS_PRIVATE is set but
>>> > > page->private is null...
>>> > >
>>> > > Again, this is unlikely to be an XFS issue.
>>> > >
>>> >
>>> > Sorry for my ignorance, but would this likely come from Ceph code or a
>>> > hardware issue of some kind, such as a disk drive? I have reached out
>>> to
>>> > RedHat and Ceph community on that as well.
>>>
>>> More likely a kernel bug somewhere in the page cache or memory
>>> reclaim paths. The issue is that we only notice the problem long
>>> after it has occurred. i.e. when XFS goes to tear down the page it has
>>> been handed, the page is already in a bad state and so it doesn't
>>> really tell us anything about the cause of the problem.
>>>
>>> Realisticaly, we need a script that reproduces the problem (that
>>> doesn't require a Ceph cluster) to be able to isolate the cause.
>>> In the mean time, you can always try running CONFIG_XFS_WARN=y to
>>> see if that catches problems earlier, and you might also want to do
>>> things like turn on memory poisoning and other kernel debugging
>>> options to try to isolate the cause of the issue....
>>>
>>
>> We have been error free for almost 3 weeks now with these changes:
>>
>> vm.swappiness=1
>> vm.min_free_kbytes=262144
>>
>> I wonder if this is related to us using high speed Areca HBAs with RAM
>> writeback cache and having had vm.swappiness=0 previously. POssibly the
>> HBA handing down a large chunk of IO very fast and page cache not being to
>> handle it with swappiness=0. I will keep monitoring, but thank you very
>> much for the analysis and info.
>>
>> Alex
>>
>>
>>
>>>
>>> Cheers,
>>>
>>> Dave.
>>> --
>>> Dave Chinner
>>> david at fromorbit.com
>>> <javascript:_e(%7B%7D,'cvml','david at fromorbit.com');>
>>>
>>
>>
>
--
--
Alex Gorbachev
Storcium
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://oss.sgi.com/pipermail/xfs/attachments/20160310/567b249c/attachment-0001.html>
More information about the xfs
mailing list