On Fri, Oct 17, 2014 at 04:20:20PM -0400, Brian Foster wrote:
> It appears that the tp->t_blk_res overrun assert is also related to
> delalloc punching. I'm basically seeing a writepage attempted for a page
> over a delalloc extent (imap in writepage reflects this), but at some
> point before we get into the allocation (I suspect between where we
> release and reacquire ilock in xfs_map_blocks() to
> iomap_write_allocate()) the delalloc blocks that relate this particular
> page to the extent disappear. An in-core extent dump after doing the
> allocation but before inserting into the extent tree shows the range to
> be allocated as a hole. Indeed, the allocator saw a hole (wasdel == 0)
> and ended up doing the full allocation while the transaction had a
> reservation for a delalloc conversion, which we end up overrunning.
> It's not quite clear to me how this is happening. I would expect
> truncate_pagecache_range() to serialize against writeback via page lock.
It does. But the problem won't be truncate_pagecache_range(), it will be
that punching delalloc blocks doesn't serialise against the page
i.e. we run truncate_pagecache_range/truncate_setsize without
holding inode locks, and they serialise against writeback via page
locks. However, we run xfs_bmap_punch_delalloc_range() holding inode
locks, but no page locks. Hence both can serialise against
writeback, but not against each other. therefore:
writeback zero range
page is delalloc
map delalloc <blocks>
punch delalloc range
allocate HOLE!!!! page_lock
starts page writeback
waits on writeback
removes page from cache
Basically, we back to the fundamental problem that we can't
manipulate extent state directly if there are *unlocked* pages in
the page cache over that range because we hold meaningful writeback
state on the page.
> Regardless, the scenario of writing back pages into unallocated space
> that writeback thinks is covered by delayed allocation suggests
> something racing with delalloc punching. The problem seems to disappear
> if I comment it out from zero range, so we might want to kill the above
> hunk entirely for the time being.
Yes, but I think I see the way forward now. xfs_vm_writepage() needs
to unconditionally map the page it is being passed rather than
relying on the state of bufferheads attached to the buffer. We can't
serialise the page cache state against direct extent manipulations,
therefore we have to make all decisions based on one set of state we
can serialise access to sanely.
The issue is that we can't currently serialise extent manipulations
against writeback easily, but I think I see a neat and easy way to
do this, and it mirrors a technique we already use(*): add a cursor
to track the active map we are operating on so that we know if we
are about to modify and extent we are currently writing to.
Actually, a cursor solves a couple of other problems in this
mapping code - see the big comment in xfs_iomap_write_allocate()
about using a single map to avoid lookup/unlock/lock/allocate race
conditions with truncate/hole-punch.
So I think I have a solution that moves us towards the goal of "die,
bufferheads, die" and solves these writeback vs page cache vs extent
manipulation issues - let me do a bit more thinking about it and
I'll write something up.
(*) we use a cursor for a similar purpose during AIL traversal: to
detect item deletion while the AIL lock was dropped that would cause
us to follow an invalid list pointer.