On Thu, Jun 04, 2015 at 11:25:30AM +1000, Dave Chinner wrote:
> On Thu, Jun 04, 2015 at 10:35:47AM +1000, Dave Chinner wrote:
> > > - Trace cmd report
> > > Too big to attach. Here's a link:
> > > https://www.dropbox.com/s/3xxe2chsv4fsrv8/trace_report.txt.zip?dl=0
> > Downloading now.
> AIL pushing is occurring every 30s, yes. Across all filesystems, there
> are roughly 23-25,000 metadata objects being pushed every 30s flush.
> Indeed, this looks to me like the smoking gun. To allocate a block,
> you have to lock the AGF buffer that the allocation is going to take
> place in. Problem is, when the xfsaild pushes the AGF buffers to the
> writeback queue, they sit there with the buffer locked until the IO
> In the traces, the xfsailds all run at 509385s, and immediately I
> see a ~10s gap in the trace where almost no xfs_read_agf() traces
> occur. It's not until 509396s that the traces really start to appear
> at normal speed again.
> Again, reducing the number of AGs will help with this problem,
> simply because the AG headers are more likely to be locked or
> pinned when the xfsaild sweep runs because they are active rather
> than sitting idle waiting for the next operation in that AG to
> require allocation....
> Remember, a single AG can sustain thousands of allocations every
> second - if you are only creating a few thousand files every second,
> you don't need tens of AGs to sustain that - the default of 4 AGs
> will do that just fine...
And in looking deeper into the issue, I think there's some code
changes we need to make to minimise this issue.
Allocation requires a locked AGF buffer, but they also need to be
locked for IO. The underlying issue looks like we hold the lock for
too long durign Io submission. i.e. a list gets passed to the
delayed write submission code, which then walks the list locking the
buffers, then we sort and issue the io on the list. If the writeback
queue is long enough, submission is getting blocked on the request
queue and we wait with locked buffers and hence don't allow
modifications to take place on the buffers while we are waiting for
Fixing this requires a tweak to the algorithm in
__xfs_buf_delwri_submit() so that we don't lock an entire list of
thousands of IOs before starting submission. In the mean time,
reducing the number of AGs will reduce the impact of this because
the delayed write submission code will skip buffers that are already
locked or pinned in memory, and hence an AG under modification at
the time submission occurs will be skipped by the delwri code.