Slow performance after ~4.5TB
Linas Jankauskas
linas.j at iv.lt
Thu Nov 15 02:34:02 CST 2012
Ok,
thanks for your help.
We will try to make 200 allocation groups and enable inode64 option.
Hope it will solve a problem.
Thanks
Linas
On 11/14/2012 11:13 PM, Dave Chinner wrote:
> On Tue, Nov 13, 2012 at 11:13:55AM +0200, Linas Jankauskas wrote:
>> trace-cmd output was about 300mb, so im pasting first 100 lines of
>> it, is it enough?:
> ....
>>
>> Rsync command:
>>
>> /usr/bin/rsync -e ssh -c blowfish -a --inplace --numeric-ids
>> --hard-links --ignore-errors --delete --force
>
> Ok, so you are overwriting in place and deleting files/dirs that
> don't exist anymore. And they are all small files.
>
>> xfs_bmap on one random file:
>>
>> EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET
>> TOTAL FLAGS
>> 0: [0..991]: 26524782560..26524783551 12 (754978880..754979871) 992 00000
>>
>> xfs_db -r -c "frag" /dev/sda5
>> actual 81347252, ideal 80737778, fragmentation factor 0.75%
>
> And that indicates file fragmentation is not an issue.
>>
>>
>> agno: 0
>
> Not too bad.
>
>> agno: 1
>>
>> from to extents blocks pct
>> 1 1 74085 74085 0.05
>> 2 3 97017 237788 0.15
>> 4 7 165766 918075 0.59
>> 8 15 2557055 35731152 22.78
>
> And there's the problem. Free space is massively fragmented in the
> 8-16 block size (32-64k) range. All the other AGs show the same
> pattern:
>
>> 8 15 2477693 34631683 18.51
>> 8 15 2479273 34656696 20.37
>> 8 15 2440290 34132542 20.51
>> 8 15 2461646 34419704 20.38
>> 8 15 2463571 34439233 21.06
>> 8 15 2487324 34785498 19.92
>> 8 15 2474275 34589732 19.85
>> 8 15 2438528 34100460 20.69
>> 8 15 2467056 34493555 20.04
>> 8 15 2457983 34364055 20.14
>> 8 15 2438076 34112592 22.48
>> 8 15 2465147 34481897 19.79
>> 8 15 2466844 34492253 21.44
>> 8 15 2445986 34205258 21.35
>> 8 15 2436154 34060275 19.60
>> 8 15 2438373 34082653 20.59
>> 8 15 2435860 34057838 21.01
>
> Given the uniform distribution of the freespace fragmentation, the
> problem is most likely the fact you are using the inode32 allocator.
>
> What is does is keep inodes in AG 0 (below 1TB) and rotor's data
> extents across all other AGs. Hence AG 0 has a different freespace
> pattern because it mainly contains metadata. The data AGs are
> showing the signs of files with no reference locality being packed
> adjacent to each other when written, then randomly removed, which
> leaves a swiss-cheese style of freespace fragmentation.
>
> The result is freespace btrees that are much, much larger than
> usual, and each AG is being randomly accessed by each userspace
> process. This leads to long lock hold times during searches, and
> access from multiple CPUs at once slows things down and adds to lock
> contention.
>
> It appears that the threshold that limits performance for your
> workload and configuration is around 2.5million freespace extents in
> a single size range. most likely it is a linear scan of duplicate
> sizes trying to find the best block number match that is chewing up
> all the CPU. That's roughly what the event trace shows.
>
> I don't think you can fix a filesystem once it's got into this
> state. It's aged severely and the only way to fix freespace
> fragmentation is to remove files from the filesystem. In this case,
> mkfs.xfs is going to be the only sane way to do that, because it's
> much faster than removing 90million inodes...
>
> So, how to prevent it from happening again on a new filesystem?
>
> Using the inode64 allocator should prevent this freespace
> fragmentation from happening. It allocates file data in the same AG
> as the inode and inodes are grouped in an AG based on the parent
> directory location. Directory inodes are rotored across AGs to
> spread them out. The way it searches for free space for new files is
> different, too, and will tend to fill holes near to the inode before
> searching wider. Hence it's a much more local search, and it will
> fill holes created by deleting files/dirs much faster, leaving less
> swiss chess freespace fragmentation around.
>
> The other thing is that if you have lots of rsyncs running at once
> is increase the number of AGs to reduce their size. More AGs will
> increase allocation parallelism, reducing contention, and also
> reducing the size of each free space trees if freespace
> fragmentation does occur. Given you are tracking lots of small
> files, (90 million inodes so far), I'd suggest increase the number
> of AGs by an order of magnitude so that the size drops from 1TB down
> to 100GB. Even if freespace fragmentation then does occur, it is
> Spread over 10x the number of freespace trees, and hence will have
> significantly less effect on performance.
>
> FWIW, you probably also want to set allocsize=4k as well, as you
> don't need specualtive EOF preallocation on your workload to avoid
> file fragmentation....
>
> Cheers,
>
> Dave.
More information about the xfs
mailing list