[Top] [All Lists]

Re: bad performance on touch/cp file on XFS system

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: bad performance on touch/cp file on XFS system
From: Zhang Qiang <zhangqiang.buaa@xxxxxxxxx>
Date: Wed, 27 Aug 2014 16:53:17 +0800
Cc: Greg Freemyer <greg.freemyer@xxxxxxxxx>, xfs-oss <xfs@xxxxxxxxxxx>
Delivered-to: xfs@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=1R3FjxzDCmWAKFygncMrfQBUXq0gXJRz37C660++KNo=; b=rrNVvHp6HNDGgozUTDDIFATUI4kP5PJvoGyt84Vh9XYhYZQsuOjBwzG/5E6CP3hGam D1O9WSqSkV+kjt9ct18MVc+e5Ldc/jXjTbg0sdJeiPHeNApxS13GH7DKkrENr1maIPKf YxTvrDIkb4e35ekBr7Ki3+2YkNwG71ZA/oBvaHNLXcvoo/HbcBU26U9bNcgFg/zPaedM qFZSvOl/xdUgsU7x14MiDbet1RfkQMi0H4HAgQsDQ/JayG5gtI/yqADTXo2eNx4KeUdx N83MYimnqbzfV1lfF1OMrRridVMfTA5+CVI5dgJLnzWNEHDzF8I511o2C4+DtyLIJm0r HnUg==
In-reply-to: <20140826131354.GK20518@dastard>
References: <CAKEtwsWxZseS8M+O7vSR2FRXr4gjVQ0RDO8ok+jMPWq-8jPEeA@xxxxxxxxxxxxxx> <20140825051801.GY26465@dastard> <CAKEtwsXiVKTWAW+YszjNnFnD4_Ld7g2qXEvw48A-SitYSGyXHA@xxxxxxxxxxxxxx> <20140825090843.GE20518@dastard> <CAKEtwsU4gywG7fVVMVU1Y_TG9Pgg_-sFV0=SPg_7Ob5EV6FTew@xxxxxxxxxxxxxx> <20140825222657.GF20518@dastard> <CAGpXXZL2=ynv4x6hhBSsBPZmBG9Ac8mPOgE-Ekjs3tLvQO9Uaw@xxxxxxxxxxxxxx> <20140826023754.GH20518@dastard> <CAKEtwsW=6Wh3rdaNvmNbiOq1iUm+=xAwL0FsNhcmpKwkQrN9Ww@xxxxxxxxxxxxxx> <20140826131354.GK20518@dastard>

2014-08-26 21:13 GMT+08:00 Dave Chinner <david@xxxxxxxxxxxxx>:
On Tue, Aug 26, 2014 at 06:04:52PM +0800, Zhang Qiang wrote:
> Thanks Dave/Greg for your analysis and suggestions.
> I can summarize what I should do next:
> - backup my data using xfsdump
> - rebuilt filesystem using mkfs with options: agcount=32 for 2T disk
> - mount filesystem with option inode64,nobarrier

Ok up to here.

> - applied patches about adding free list inode on disk structure

No, don't do that. You're almost certain to get it wrong and corrupt
your filesysetms and lose data.

> As we have about ~100 servers need back up, so that will take much effort,
> do you have any other suggestion?

Just remount them with inode64. Nothing else. Over time as you add
and remove files the inodes will redistribute across all 4 AGs.

How I can see Âthe layout number of inodes on each AGs? Here's my checking steps:

1) Check unmounted file system first:
[root@fstest data1]# xfs_db -c "sb 0" Â-c "p" /dev/sdb1 |egrep 'icount|ifree'
icount = 421793920
ifree = 41
[root@fstest data1]# xfs_db -c "sb 1" Â-c "p" /dev/sdb1 |egrep 'icount|ifree'
icount = 0
ifree = 0
[root@fstest data1]# xfs_db -c "sb 2" Â-c "p" /dev/sdb1 |egrep 'icount|ifree'
icount = 0
ifree = 0
[root@fstest data1]# xfs_db -c "sb 3" Â-c "p" /dev/sdb1 |egrep 'icount|ifree'
icount = 0
ifree = 0
2) mount it with inode64 and create many files:

[root@fstest /]# mount -o inode64,nobarrier /dev/sdb1 /data
[root@fstest /]# cd /data/tmp/
[root@fstest tmp]# fdtree.bash -d 16 -l 2 -f 100 -s 1
[root@fstest /]# umount /data

3) Check with xfs_db again:

[root@fstest data1]# xfs_db -f -c "sb 0" Â-c "p" /dev/sdb1 |egrep 'icount|ifree'
icount = 421821504
ifree = 52
[root@fstest data1]# xfs_db -f -c "sb 1" Â-c "p" /dev/sdb1 |egrep 'icount|ifree'
icount = 0
ifree = 0

So, it seems that inodes only on first AG. Or icount/ifree is not the correct value to check, and how should I check how many inodes on each AGs?

I am finding Âa way to improve the performance based on current filesystem and kernel just remounting with inode64, I am trying how to make all inodes redistribute on all AGs averagely.

Is there any good way?, for example backup half of data to another device and remove it, then copy back it.

> What I am testing (ongoing):
>Â - created a new 2T partition filesystem
>Â - try to create small files and fill whole spaces then remove some of them
> randomly
>Â - check the performance of touch/cp files
>Â - apply patches and verify it.
> I have got more data from server:
> 1) flush all cache(echo 3 > /proc/sys/vm/drop_caches), and umount filesystem
> 2) mount filesystem and testing with touch command
>Â Â* The first touch new file command take about ~23s
>Â Â* second touch command take about ~0.1s.

So it's cache population that is your issue. You didn't say that
first time around, which means the diagnosis was wrong. Again, it's having to
search a btree with 220 million inodes in it to find the first free
inode, and that btree has to be pulled in from disk and searched.
Once it's cached, then each subsequent allocation will be much
faster becaue the majority of the tree being searched will already
be in cache...

> I have compared the memory used, it seems that xfs try to load inode bmap
> block for the first time, which take much time, is that the reason to take
> so much time for the first touch operation?

No. reading the AGI btree to find the first free inode to allocate
is what is taking the time. If you spread the inodes out over 4 AGs
(using inode64) then the overhead of the first read will go down
proportionally. Indeed, that is one of the reasons for using more
AGs than 4 for filesystems lik ethis.
OK, I see.Â

Still, I can't help but wonder why you are using a filesystem to
store hundreds of millions of tiny files, when a database is far
better suited to storing and indexing this type and quantity of

OK, this is a social networking website back end servers, actually the CDN infrastructure, and different server located different cities.
We have a global sync script to make all these 100 servers have the same data.

For each server we use RAID10 and XFS (CentOS6.3).

There are about 3M files (50K in size) generated every day, and we track the path of each files in database.

Do you have any suggestions to improve our solution?


Dave Chinner

<Prev in Thread] Current Thread [Next in Thread>