Is XFS suitable for 350 million files on 20TB storage?
Stefan Priebe
s.priebe at profihost.ag
Sat Sep 6 02:35:15 CDT 2014
Hi Dave,
Am 06.09.2014 01:05, schrieb Dave Chinner:
> On Fri, Sep 05, 2014 at 02:40:32PM +0200, Stefan Priebe - Profihost AG wrote:
>>
>> Am 05.09.2014 um 14:30 schrieb Brian Foster:
>>> On Fri, Sep 05, 2014 at 11:47:29AM +0200, Stefan Priebe - Profihost AG wrote:
>>>> Hi,
>>>>
>>>> i have a backup system running 20TB of storage having 350 million files.
>>>> This was working fine for month.
>>>>
>>>> But now the free space is so heavily fragmented that i only see the
>>>> kworker with 4x 100% CPU and write speed beeing very slow. 15TB of the
>>>> 20TB are in use.
>
> What does perf tell you about the CPU being burnt? (i.e run perf top
> for 10-20s while that CPU burn is happening and paste the top 10 CPU
> consuming functions).
here we go:
15,79% [kernel] [k] xfs_inobt_get_rec
14,57% [kernel] [k] xfs_btree_get_rec
10,37% [kernel] [k] xfs_btree_increment
7,20% [kernel] [k] xfs_btree_get_block
6,13% [kernel] [k] xfs_btree_rec_offset
4,90% [kernel] [k] xfs_dialloc_ag
3,53% [kernel] [k] xfs_btree_readahead
2,87% [kernel] [k] xfs_btree_rec_addr
2,80% [kernel] [k] _xfs_buf_find
1,94% [kernel] [k] intel_idle
1,49% [kernel] [k] _raw_spin_lock
1,13% [kernel] [k] copy_pte_range
1,10% [kernel] [k] unmap_single_vma
>>>>
>>>> Overall files are 350 Million - all in different directories. Max 5000
>>>> per dir.
>>>>
>>>> Kernel is 3.10.53 and mount options are:
>>>> noatime,nodiratime,attr2,inode64,logbufs=8,logbsize=256k,noquota
>>>>
>>>> # xfs_db -r -c freesp /dev/sda1
>>>> from to extents blocks pct
>>>> 1 1 29484138 29484138 2,16
>>>> 2 3 16930134 39834672 2,92
>>>> 4 7 16169985 87877159 6,45
>>>> 8 15 78202543 999838327 73,41
>
> With an inode size of 256 bytes, this is going to be your real
> problem soon - most of the free space is smaller than an inode
> chunk so soon you won't be able to allocate new inodes, even though
> there is free space on disk.
>
> Unfortunately, there's not much we can do about this right now - we
> need development in both user and kernel space to mitigate this
> issue: spare inode chunk allocation in kernel space, and free space
> defragmentation in userspace. Both are on the near term development
> list....
>
> Also, the fact that there are almost 80 million 8-15 block extents
> indicates that the CPU burn is likely coming from the by-size free
> space search. We look up the first extent of the correct size, and
> then do a linear search for a nearest extent of that size to the
> target. Hence we could be searching millions of extents to find the
> "nearest"....
>
>>>> 16 31 3562456 83746085 6,15
>>>> 32 63 2370812 102124143 7,50
>>>> 64 127 280885 18929867 1,39
>>>> 256 511 2 827 0,00
>>>> 512 1023 65 35092 0,00
>>>> 2048 4095 2 6561 0,00
>>>> 16384 32767 1 23951 0,00
>>>>
>>>> Is there anything i can optimize? Or is it just a bad idea to do this
>>>> with XFS?
>
> No, it's not a bad idea. In fact, if you have this sort of use case,
> XFS is really your only choice. In terms of optimisation, the only
> thing that will really help performance is the new finobt structure.
> That's a mkfs option andnot an in-place change, though, so it's
> unlikely to help.
I've no problem with reformatting the array. I've more backups.
> FWIW, it may also help aging characteristics of this sort of
> workload by improving inode allocation layout. That would be
> a side effect of being able to search the entire free inode tree
> extremely quickly rather than allocating new chunks to keep CPU time
> searching the allocate inode tree for free inodes down. Hence it
> would tend to more tightly pack inode chunks when they are allocated
> on disk as it will fill full chunks before allocating new ones
> elsewhere.
>
>>>> Any other options? Maybe rsync options like --inplace /
>>>> --no-whole-file?
>
> For 350M files? I doubt there's much you can really do. Any sort of
> large scale re-organisation is going to take a long, long time and
> require lots of IO. If you are goign to take that route, you'd do
> better to upgrade kernel and xfsprogs, then dump/mkfs.xfs -m
> crc=1,finobt=1/restore. And you'd probably want to use a
> multi-stream dump/restore so it can run operations concurrently and
> hence at storage speed rather than being CPU bound....
I don't need a backup reformatting is possible but i really would like
to stay at 3.10. Is there anything i can backport or do i really need to
upgrade? Which version at least?
> Also, if the problem really is the number of indentically sized free
> space fragments in the freespace btrees, then the initial solution
> is, again, a mkfs one. i.e. remake the filesystem with more, smaller
> AGs to keep the number of extents the btrees need to index down to a
> reasonable level. Say a couple of hundred AGs rather than 21?
mkfs has chosen 21 automagically - it's nothing i've set. Is this a bug
or do i just need it cause of my special use case.
Thanks!
Stefan
>>> If so, I wonder if something like the
>>> following commit introduced in 3.12 would help:
>>>
>>> 133eeb17 xfs: don't use speculative prealloc for small files
>>
>> Looks interesting.
>
> Probably won't make any difference because backups via rsync do
> open/write/close and don't touch the file data again, so the close
> will be removing speculative preallocation before the data is
> written and extents are allocated by background writeback....
>
> Cheers,
>
> Dave.
>
More information about the xfs
mailing list