On Tue, Aug 26, 2014 at 06:04:52PM +0800, Zhang Qiang wrote:
> Thanks Dave/Greg for your analysis and suggestions.
> I can summarize what I should do next:
> - backup my data using xfsdump
> - rebuilt filesystem using mkfs with options: agcount=32 for 2T disk
> - mount filesystem with option inode64,nobarrier
Ok up to here.
> - applied patches about adding free list inode on disk structure
No, don't do that. You're almost certain to get it wrong and corrupt
your filesysetms and lose data.
> As we have about ~100 servers need back up, so that will take much effort,
> do you have any other suggestion?
Just remount them with inode64. Nothing else. Over time as you add
and remove files the inodes will redistribute across all 4 AGs.
> What I am testing (ongoing):
> - created a new 2T partition filesystem
> - try to create small files and fill whole spaces then remove some of them
> - check the performance of touch/cp files
> - apply patches and verify it.
> I have got more data from server:
> 1) flush all cache(echo 3 > /proc/sys/vm/drop_caches), and umount filesystem
> 2) mount filesystem and testing with touch command
> * The first touch new file command take about ~23s
> * second touch command take about ~0.1s.
So it's cache population that is your issue. You didn't say that
first time around, which means the diagnosis was wrong. Again, it's having to
search a btree with 220 million inodes in it to find the first free
inode, and that btree has to be pulled in from disk and searched.
Once it's cached, then each subsequent allocation will be much
faster becaue the majority of the tree being searched will already
be in cache...
> I have compared the memory used, it seems that xfs try to load inode bmap
> block for the first time, which take much time, is that the reason to take
> so much time for the first touch operation?
No. reading the AGI btree to find the first free inode to allocate
is what is taking the time. If you spread the inodes out over 4 AGs
(using inode64) then the overhead of the first read will go down
proportionally. Indeed, that is one of the reasons for using more
AGs than 4 for filesystems lik ethis.
Still, I can't help but wonder why you are using a filesystem to
store hundreds of millions of tiny files, when a database is far
better suited to storing and indexing this type and quantity of