On Tue, Nov 13, 2012 at 11:13:55AM +0200, Linas Jankauskas wrote:
trace-cmd output was about 300mb, so im pasting first 100 lines of
it, is it enough?:
/usr/bin/rsync -e ssh -c blowfish -a --inplace --numeric-ids
--hard-links --ignore-errors --delete --force
Ok, so you are overwriting in place and deleting files/dirs that
don't exist anymore. And they are all small files.
xfs_bmap on one random file:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET
0: [0..991]: 26524782560..26524783551 12 (754978880..754979871)
xfs_db -r -c "frag" /dev/sda5
actual 81347252, ideal 80737778, fragmentation factor 0.75%
And that indicates file fragmentation is not an issue.
Not too bad.
from to extents blocks pct
1 1 74085 74085 0.05
2 3 97017 237788 0.15
4 7 165766 918075 0.59
8 15 2557055 35731152 22.78
And there's the problem. Free space is massively fragmented in the
8-16 block size (32-64k) range. All the other AGs show the same
8 15 2477693 34631683 18.51
8 15 2479273 34656696 20.37
8 15 2440290 34132542 20.51
8 15 2461646 34419704 20.38
8 15 2463571 34439233 21.06
8 15 2487324 34785498 19.92
8 15 2474275 34589732 19.85
8 15 2438528 34100460 20.69
8 15 2467056 34493555 20.04
8 15 2457983 34364055 20.14
8 15 2438076 34112592 22.48
8 15 2465147 34481897 19.79
8 15 2466844 34492253 21.44
8 15 2445986 34205258 21.35
8 15 2436154 34060275 19.60
8 15 2438373 34082653 20.59
8 15 2435860 34057838 21.01
Given the uniform distribution of the freespace fragmentation, the
problem is most likely the fact you are using the inode32 allocator.
What is does is keep inodes in AG 0 (below 1TB) and rotor's data
extents across all other AGs. Hence AG 0 has a different freespace
pattern because it mainly contains metadata. The data AGs are
showing the signs of files with no reference locality being packed
adjacent to each other when written, then randomly removed, which
leaves a swiss-cheese style of freespace fragmentation.
The result is freespace btrees that are much, much larger than
usual, and each AG is being randomly accessed by each userspace
process. This leads to long lock hold times during searches, and
access from multiple CPUs at once slows things down and adds to lock
It appears that the threshold that limits performance for your
workload and configuration is around 2.5million freespace extents in
a single size range. most likely it is a linear scan of duplicate
sizes trying to find the best block number match that is chewing up
all the CPU. That's roughly what the event trace shows.
I don't think you can fix a filesystem once it's got into this
state. It's aged severely and the only way to fix freespace
fragmentation is to remove files from the filesystem. In this case,
mkfs.xfs is going to be the only sane way to do that, because it's
much faster than removing 90million inodes...
So, how to prevent it from happening again on a new filesystem?
Using the inode64 allocator should prevent this freespace
fragmentation from happening. It allocates file data in the same AG
as the inode and inodes are grouped in an AG based on the parent
directory location. Directory inodes are rotored across AGs to
spread them out. The way it searches for free space for new files is
different, too, and will tend to fill holes near to the inode before
searching wider. Hence it's a much more local search, and it will
fill holes created by deleting files/dirs much faster, leaving less
swiss chess freespace fragmentation around.
The other thing is that if you have lots of rsyncs running at once
is increase the number of AGs to reduce their size. More AGs will
increase allocation parallelism, reducing contention, and also
reducing the size of each free space trees if freespace
fragmentation does occur. Given you are tracking lots of small
files, (90 million inodes so far), I'd suggest increase the number
of AGs by an order of magnitude so that the size drops from 1TB down
to 100GB. Even if freespace fragmentation then does occur, it is
Spread over 10x the number of freespace trees, and hence will have
significantly less effect on performance.
FWIW, you probably also want to set allocsize=4k as well, as you
don't need specualtive EOF preallocation on your workload to avoid