"XFS: possible memory allocation deadlock in kmem_alloc" on high memory machine
Dave Chinner
david at fromorbit.com
Wed Jun 3 18:11:00 CDT 2015
On Wed, Jun 03, 2015 at 09:07:25AM +0200, Anders Ossowicki wrote:
> On Wed, Jun 03, 2015 at 03:52:45AM +0200, Dave Chinner wrote:
> > On Tue, Jun 02, 2015 at 02:06:48PM +0200, Anders Ossowicki wrote:
> >
> > > Slab: 79729144 kB
> > > SReclaimable: 79040008 kB
> >
> > 80GB of slab caches as well - what is the output of /proc/slabinfo?
>
> slabinfo - version: 2.1
> # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
...
> xfs_ili 1066228 1066625 152 53 2 : tunables 0 0 0 : slabdata 20125 20125 0
> xfs_inode 2522728 2523172 1024 32 8 : tunables 0 0 0 : slabdata 78857 78857 0
> dentry 3217866 3702384 192 42 2 : tunables 0 0 0 : slabdata 88152 88152 0
> buffer_head 370050715 400741536 104 39 1 : tunables 0 0 0 : slabdata 10275424 10275424 0
> radix_tree_node 64025078 64148728 584 56 8 : tunables 0 0 0 : slabdata 1145751 1145751 0
.....
> Slab: 82699516 kB
> SReclaimable: 81921588 kB
....
So 400 million bufferheads (consuming 40GB RAM) and 60 million radix
tree nodes (consuming 35GB RAM) is where all that memory is. That's
being used to track the 2.8GB of page cache data (roughly 3% memory
overhead).
Ok, nothing unusual there, but it demonstrates why I want to get rid
of bufferheads.....
> > > We have three hardware raid'ed disks with XFS on them, one of which receives
> > > the bulk of the load. This is a raid 50 volume on SSDs with the raid controller
> > > running in writethrough mode.
> >
> > It doesn't seem like writeback of dirty pages is the problem; more
> > the case that the page cache is rediculously huge and not being
> > reclaimed in a sane manner. Do you really need 2.8TB of cached file
> > data in memory for performance?
>
> Yeah, disk cache is the primary reason for stuffing memory into that machine.
Hmmmm. I don't think anyone has considered the page cache to be used
at this scale for caching before. Normally this amount of memory is
needed by applications in their process space, not as a disk buffer
to avoid disk IO. You've only got a 12TB filesystem, so you're
keeping 25% of it in the page cache at any given time, so I'm not
surprised that the page cache reclaim algorithms are having trouble....
I don't think there's anything on the XFS side we can do here to
improve the situation you are in - it appears that it's memory
relcaim and compaction that aren't working well enough to sustain
your workload on that platform....
OTOH, have you considered using something like dm-cache with a huge
ramdisk as the cache device and running it in write-through mode so
that power failure doesn't result in data loss or filesystem
corruption?
Cheers,
Dave.
--
Dave Chinner
david at fromorbit.com
More information about the xfs
mailing list