On Thu, Jun 04, 2015 at 10:35:47AM +1000, Dave Chinner wrote:
> > - Trace cmd report
> > Too big to attach. Here's a link:
> > https://www.dropbox.com/s/3xxe2chsv4fsrv8/trace_report.txt.zip?dl=0
> Downloading now.
AIL pushing is occurring every 30s, yes. Across all filesystems, there
are roughly 23-25,000 metadata objects being pushed every 30s flush.
Think about that for a moment.
You have a write once workload, so inode metadata is journalled and
written only once. Hence if you are creating 1000 files/s, then you
have at least 30,000 inodes to push every 30s.
but that's not actually the big problem. Of the two ail push events
in the trace, there are this many objects that we attempt to push:
$ wc -l t.t
And this many inodes:
$ grep INODE t.t | wc -l
Now, XFS has inode clustering on writeback and that is active; it is
reducing the number of inode IOs by a factor of roughly 10. So that
means that every 30s, we've only got ~600 IOs across 8 disks
to write back dirty inodes. i.e. less than a second worth of random
IO. That's not the problem we are looking for.
$ grep BUF t.t | wc -l
So call it 17,000 every 30 seconds. That requires 17,000 4k IOs.
Across 8 disks at 170 IOPS, that is *exactly* 12.5 seconds worth of
Looks to me like the buffers are mostly inode btree. free space
btree and directory buffers.
Directory buffers, well, that's where increasing the directory block
size might help (e.g. to 8k). That may well reduce the number of
directory buffers by more than a factor of 2 due to the structure of
the directories. Depends on how many files you have in each
The number of inode and alloc btree buffers can be reduced by
reducing the number of AGs - probably by a factor of 10 by bringing
the AG count down to 4. And, because the active inode and freespace
btree buffers will be hotter, they are more likely just to be
relogged than written back, further reducing IOs.
Indeed, this looks to me like the smoking gun. To allocate a block,
you have to lock the AGF buffer that the allocation is going to take
place in. Problem is, when the xfsaild pushes the AGF buffers to the
writeback queue, they sit there with the buffer locked until the IO
In the traces, the xfsailds all run at 509385s, and immediately I
see a ~10s gap in the trace where almost no xfs_read_agf() traces
occur. It's not until 509396s that the traces really start to appear
at normal speed again.
Again, reducing the number of AGs will help with this problem,
simply because the AG headers are more likely to be locked or
pinned when the xfsaild sweep runs because they are active rather
than sitting idle waiting for the next operation in that AG to
Remember, a single AG can sustain thousands of allocations every
second - if you are only creating a few thousand files every second,
you don't need tens of AGs to sustain that - the default of 4 AGs
will do that just fine...