[BUG REPORT] missing memory counter introduced by xfs
Lin Feng
linf at chinanetcenter.com
Fri Sep 9 01:32:18 CDT 2016
Hi Dave,
A final not-clear concept about XFS, look beblow please.
On 09/09/2016 04:44 AM, Dave Chinner wrote:
> On Thu, Sep 08, 2016 at 06:07:45PM +0800, Lin Feng wrote:
>> Hi Dave,
>>
>> Thank you for your fast reply, look beblow please.
>>
>> On 09/08/2016 05:22 AM, Dave Chinner wrote:
>>> On Wed, Sep 07, 2016 at 06:36:19PM +0800, Lin Feng wrote:
>>>> Hi all nice xfs folks,
>>>>
>>>> I'm a rookie and really fresh new in xfs and currently I ran into an
>>>> issue same as the following link described:
>>>> http://oss.sgi.com/archives/xfs/2014-04/msg00058.html
>>>>
>>>> In my box(running cephfs osd using xfs kernel 2.6.32-358) and I sum
>>>> all possible memory counter can be find but it seems that nearlly
>>>> 26GB memory has gone and they are back after I echo 2 >
>>>> /proc/sys/vm/drop_caches, so seems these memory can be reclaimed by
>>>> slab.
>>>
>>> It isn't "reclaimed by slab". The XFS metadata buffer cache is
>>> reclaimed by a memory shrinker, which are for reclaiming objects
>> >from caches that aren't the page cache. "echo 2 >
>>> /proc/sys/vm/drop_caches" runs the memory shrinkers rather than page
>>> cache reclaim. Many slab caches are backed by memory shrinkers,
>>> which is why it is thought that "2" is "slab reclaim"....
>>>
>>>> And according to what David said replying in the list:
>>> ..
>>>> That's where your memory is - in metadata buffers. The xfs_buf slab
>>>> entries are just the handles - the metadata pages in the buffers
>>>> usually take much more space and it's not accounted to the slab
>>>> cache nor the page cache.
>>>
>>> That's exactly the case.
>>>
>>>> Minimum / Average / Maximum Object : 0.02K / 0.33K / 4096.00K
>>>>
>>>> OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
>>>> 4383036 4383014 99% 1.00K 1095759 4 4383036K xfs_inode
>>>> 5394610 5394544 99% 0.38K 539461 10 2157844K xfs_buf
>>>
>>> So, you have *5.4 million* active metadata buffers. Each buffer will
>>> hold 1 or 2 4k pages on your kernel, so simple math says 4M * 4k +
>>> 1.4M * 8k = 26G. There's no missing counter here....
>>
>> Does xattr contribute to such metadata buffers or there is something else?
>
> xattrs are metadata, so if they don't fit in line in the inode
> (typical for ceph because it uses xattrs larger than 256 bytes) then
> they are held in external blocks which are cached in the buffer
> cache.
>
So the 'buffer cache' here you mean is the pages handled by xfs_buf struct, used
to hold the xattrs if the inode inline data space overflows, not the
'beffer/cache' seen via free command, they won't reflect in cache field by free
command, right?
>> After consulting to my teammate, who told me that in our case small files
>> (there are a looot, look below) always use xattr.
>
> Which means that if you have 4.4M cached inodes, you probably have
> ~4.4M xattr metadata buffers in cache for those inodes, too.
>
>> Another thing is do we need to export such thing or we have to make
>> the computation every time to figure out if we leak memory.
>> And more important is that seems these memory has a low priority to
>> be reclaimed by memory reclaim mechanism, does it due to most of the
>> slab objects are active?
>
> "active" slab objects simply mean they are allocated. It does not
> mean they are cached or imply anything else about the object's life
> cycle.
Sorry, I mistake the concept for active in slab, thanks your explanation.
>
>>>> OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
>>>> 4383036 4383014 99% 1.00K 1095759 4 4383036K xfs_inode
>>>> 5394610 5394544 99% 0.38K 539461 10 2157844K xfs_buf
>>
>> In fact xfs eats a lot of my ram and I will never know where it goes
>> without diving into xfs source, at least I'm the second extreme user
>> ;-)
>>
>>>
>>> Obviously your workload is doing something extremely metadata
>>> intensive to have a cache footprint like this - you have more cached
>>> buffers than inodes, dentries, etc. That in itself is very unusual -
>>> can you describe what is stored on that filesystem and how large the
>>> attributes being stored in each inode are?
>>
>> The fs-user behavior is that ceph-osd daemon will intensively
>> pull/synchronize/update files from other osd when the server is up.
>> In our case cephfs osd stores a lot of small pictures in the
>> filesystem, and I do some simple analysis, there are nearly
>> 3,000,000 files on each disk and there are 10 such disk.
>> [root at wzdx49 osd.670]# find current -type f -size -512k | wc -l
>> 2668769
>> [root at wzdx49 ~]# find /data/osd/osd.67 -type f | wc -l
>> 2682891
>> [root at wzdx49 ~]# find /data/osd/osd.67 -type d | wc -l
>> 109760
>
> Yup, that's a pretty good indication that you have a high metadata
> to data ratio in each filesystem, and that ceph is accessing the
> metadata more intensively than the data. The fact that the metadata
> buffer count roughly matches the cached inode count tells me that
> the memory reclaim code is being fairly balanced about what it
> reclaims under memory pressure - I think the problem here is more
> that you didn't know where the memory was being used than anything
> else....
Yes, that's exactly why I sent this mail.
Again, thanks for your detailed explanation.
Best regards,
linfeng
More information about the xfs
mailing list