bad performance on touch/cp file on XFS system

Zhang Qiang zhangqiang.buaa at gmail.com
Wed Aug 27 03:53:17 CDT 2014


2014-08-26 21:13 GMT+08:00 Dave Chinner <david at fromorbit.com>:

> On Tue, Aug 26, 2014 at 06:04:52PM +0800, Zhang Qiang wrote:
> > Thanks Dave/Greg for your analysis and suggestions.
> >
> > I can summarize what I should do next:
> >
> > - backup my data using xfsdump
> > - rebuilt filesystem using mkfs with options: agcount=32 for 2T disk
> > - mount filesystem with option inode64,nobarrier
>
> Ok up to here.
>
> > - applied patches about adding free list inode on disk structure
>
> No, don't do that. You're almost certain to get it wrong and corrupt
> your filesysetms and lose data.
>
> > As we have about ~100 servers need back up, so that will take much
> effort,
> > do you have any other suggestion?
>
> Just remount them with inode64. Nothing else. Over time as you add
> and remove files the inodes will redistribute across all 4 AGs.
>
OK.

How I can see  the layout number of inodes on each AGs? Here's my checking
steps:

1) Check unmounted file system first:
[root at fstest data1]# xfs_db -c "sb 0"  -c "p" /dev/sdb1 |egrep
'icount|ifree'
icount = 421793920
ifree = 41
[root at fstest data1]# xfs_db -c "sb 1"  -c "p" /dev/sdb1 |egrep
'icount|ifree'
icount = 0
ifree = 0
[root at fstest data1]# xfs_db -c "sb 2"  -c "p" /dev/sdb1 |egrep
'icount|ifree'
icount = 0
ifree = 0
[root at fstest data1]# xfs_db -c "sb 3"  -c "p" /dev/sdb1 |egrep
'icount|ifree'
icount = 0
ifree = 0
2) mount it with inode64 and create many files:

[root at fstest /]# mount -o inode64,nobarrier /dev/sdb1 /data
[root at fstest /]# cd /data/tmp/
[root at fstest tmp]# fdtree.bash -d 16 -l 2 -f 100 -s 1
[root at fstest /]# umount /data

3) Check with xfs_db again:

[root at fstest data1]# xfs_db -f -c "sb 0"  -c "p" /dev/sdb1 |egrep
'icount|ifree'
icount = 421821504
ifree = 52
[root at fstest data1]# xfs_db -f -c "sb 1"  -c "p" /dev/sdb1 |egrep
'icount|ifree'
icount = 0
ifree = 0

So, it seems that inodes only on first AG. Or icount/ifree is not the
correct value to check, and how should I check how many inodes on each AGs?


I am finding  a way to improve the performance based on current filesystem
and kernel just remounting with inode64, I am trying how to make all inodes
redistribute on all AGs averagely.

Is there any good way?, for example backup half of data to another device
and remove it, then copy back it.


> > What I am testing (ongoing):
> >  - created a new 2T partition filesystem
> >  - try to create small files and fill whole spaces then remove some of
> them
> > randomly
> >  - check the performance of touch/cp files
> >  - apply patches and verify it.
> >
> > I have got more data from server:
> >
> > 1) flush all cache(echo 3 > /proc/sys/vm/drop_caches), and umount
> filesystem
> > 2) mount filesystem and testing with touch command
> >   * The first touch new file command take about ~23s
> >   * second touch command take about ~0.1s.
>
> So it's cache population that is your issue. You didn't say that
> first time around, which means the diagnosis was wrong. Again, it's having
> to
> search a btree with 220 million inodes in it to find the first free
> inode, and that btree has to be pulled in from disk and searched.
> Once it's cached, then each subsequent allocation will be much
> faster becaue the majority of the tree being searched will already
> be in cache...
>
> > I have compared the memory used, it seems that xfs try to load inode bmap
> > block for the first time, which take much time, is that the reason to
> take
> > so much time for the first touch operation?
>
> No. reading the AGI btree to find the first free inode to allocate
> is what is taking the time. If you spread the inodes out over 4 AGs
> (using inode64) then the overhead of the first read will go down
> proportionally. Indeed, that is one of the reasons for using more
> AGs than 4 for filesystems lik ethis.
>
OK, I see.


> Still, I can't help but wonder why you are using a filesystem to
> store hundreds of millions of tiny files, when a database is far
> better suited to storing and indexing this type and quantity of
> data....
>

OK, this is a social networking website back end servers, actually the CDN
infrastructure, and different server located different cities.
We have a global sync script to make all these 100 servers have the same
data.

For each server we use RAID10 and XFS (CentOS6.3).

There are about 3M files (50K in size) generated every day, and we track
the path of each files in database.

Do you have any suggestions to improve our solution?



> Cheers,
>
> Dave.
> --
> Dave Chinner
> david at fromorbit.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://oss.sgi.com/pipermail/xfs/attachments/20140827/13b77fa1/attachment.html>


More information about the xfs mailing list