Failing XFS filesystem underlying Ceph OSDs
Alex Gorbachev
ag at iss-integration.com
Mon Jul 6 14:20:19 CDT 2015
Thank you Dave,
On Sun, Jul 5, 2015 at 7:24 PM, Dave Chinner <david at fromorbit.com> wrote:
> [ Please turn off line wrap when pasting kernel traces ]
>
Noted, sorry, I thought wrap was off, but Google must have it on by default.
>
> On Sun, Jul 05, 2015 at 12:25:47AM -0400, Alex Gorbachev wrote:
> > > > sysctl vm.swappiness=20 (can probably be 1 as per article)
> > > >
> > > > sysctl vm.min_free_kbytes=262144
> > >
> [...]
> >
> > We have experienced the problem in various guises with kernels 3.14,
> 3.19,
> > 4.1-rc2 and now 4.1, so it's not new to us, just different error stack.
> > Below are some other stack dumps of what manifested as the same error.
> >
> > [<ffffffff817cf4b9>] schedule+0x29/0x70
> > [<ffffffffc07caee7>] _xfs_log_force+0x187/0x280 [xfs]
> > [<ffffffff810a4150>] ? try_to_wake_up+0x2a0/0x2a0
> > [<ffffffffc07cb019>] xfs_log_force+0x39/0xc0 [xfs]
> > [<ffffffffc07d6542>] xfsaild_push+0x552/0x5a0 [xfs]
> > [<ffffffff817d2264>] ? schedule_timeout+0x124/0x210
> > [<ffffffffc07d662f>] xfsaild+0x9f/0x140 [xfs]
> > [<ffffffffc07d6590>] ? xfsaild_push+0x5a0/0x5a0 [xfs]
> > [<ffffffff81095e29>] kthread+0xc9/0xe0
> > [<ffffffff81095d60>] ? flush_kthread_worker+0x90/0x90
> > [<ffffffff817d3718>] ret_from_fork+0x58/0x90
> > [<ffffffff81095d60>] ? flush_kthread_worker+0x90/0x90
> > INFO: task xfsaild/sdg1:2606 blocked for more than 120 seconds.
> > Not tainted 3.19.4-031904-generic #201504131440
> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
>
> That's indicative of IO completion problems, but not a crash.
>
> > BUG: unable to handle kernel NULL pointer dereference at
> (null)
> > IP: [<ffffffffc04be80f>] xfs_count_page_state+0x3f/0x70 [xfs]
> ....
> > [<ffffffffc04be880>] xfs_vm_releasepage+0x40/0x120 [xfs]
> > [<ffffffff8118a7d2>] try_to_release_page+0x32/0x50
> > [<ffffffff8119fe6d>] shrink_page_list+0x69d/0x720
> > [<ffffffff811a058d>] shrink_inactive_list+0x1dd/0x5d0
> ....
>
> Again, this is indicative of a page cache issue: a page without
> buffers has been passed to xfs_vm_releasepage(), which implies the
> page flags are not correct. i.e PAGE_FLAGS_PRIVATE is set but
> page->private is null...
>
> Again, this is unlikely to be an XFS issue.
>
Sorry for my ignorance, but would this likely come from Ceph code or a
hardware issue of some kind, such as a disk drive? I have reached out to
RedHat and Ceph community on that as well.
Thank you,
Alex
>
> > Do you think we need to look at RAM handling by this Supermicro machine
> > type?
>
> Not sure what you mean by that. Problems like this can be caused by
> bad hardware, but it's unusual for a machine using ECC memory to
> have undetected RAM problems...
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david at fromorbit.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://oss.sgi.com/pipermail/xfs/attachments/20150706/349386a1/attachment.html>
More information about the xfs
mailing list