On Wed, Dec 07, 2011 at 01:35:08AM -0500, Christoph Hellwig wrote:
> On Wed, Dec 07, 2011 at 05:18:11PM +1100, Dave Chinner wrote:
> > The series passes xfstests on 4k/4k, 4k/512b, 64k/4k and 64k/512b
> > (dirblksz/fsblksz) configurations without any new regressions, and
> > survives 100 million inode fs_mark benchmarks on a 17TB filesystem
> > using 4k/4k, 64k/512b and 64k/512b configurations.
> Do you have any benchmark numbers showing performance improvements
> for the large directory block case?
I haven't run real comparisons yet (it hasn't been working for long
enough for me to do so), but I suspect that the gains are lost
in the amount of CPU overhead the buffer formatting code is
consuming - it's around 40-50% of the entire CPU time on the
parallel create tests:
+ 13.10% [kernel] [k] memcpy
+ 7.94% [kernel] [k] xfs_next_bit
+ 7.63% [kernel] [k] xfs_buf_find_irec.isra.11
+ 5.86% [kernel] [k] xfs_buf_offset
+ 4.36% [kernel] [k] xfs_buf_item_format_segment
+ 4.11% [kernel] [k] xfs_buf_item_size_segment.isra.0
That's all cpu usage under the transaction commit path.
Basically I'm getting 100-110k files/s with 4k directory sizes, and
70-80k files/s with 64k dirs for the same workload consuming the
same amount of roughly the same CPU time. Killing the buffer logging
overhead (which barely registers on the 4k directory block size)
looks like it will now bring parity to large block size directory
performance compared to 4k block size performance because the amount
written to the log (~30MB/s) is identical for both configurations...
It might cwbe as simple as checking the hamming weight of the dirty
bitmap, and if it is over a certain amount just log the buffer in
it's entirity, skipping the bitmap based dirty region processing