On Mon, Oct 20, 2014 at 12:25:16PM +1100, Dave Chinner wrote:
> On Fri, Oct 17, 2014 at 04:20:20PM -0400, Brian Foster wrote:
> > It appears that the tp->t_blk_res overrun assert is also related to
> > delalloc punching. I'm basically seeing a writepage attempted for a page
> > over a delalloc extent (imap in writepage reflects this), but at some
> > point before we get into the allocation (I suspect between where we
> > release and reacquire ilock in xfs_map_blocks() to
> > iomap_write_allocate()) the delalloc blocks that relate this particular
> > page to the extent disappear. An in-core extent dump after doing the
> > allocation but before inserting into the extent tree shows the range to
> > be allocated as a hole. Indeed, the allocator saw a hole (wasdel == 0)
> > and ended up doing the full allocation while the transaction had a
> > reservation for a delalloc conversion, which we end up overrunning.
> > It's not quite clear to me how this is happening. I would expect
> > truncate_pagecache_range() to serialize against writeback via page lock.
> It does. But the problem won't be truncate_pagecache_range(), it will be
> that punching delalloc blocks doesn't serialise against the page
> i.e. we run truncate_pagecache_range/truncate_setsize without
> holding inode locks, and they serialise against writeback via page
> locks. However, we run xfs_bmap_punch_delalloc_range() holding inode
> locks, but no page locks. Hence both can serialise against
> writeback, but not against each other. therefore:
> writeback zero range
> page is delalloc
> XFS_ILOCK_SHARED XFS_ILOCK_EXCL
> map delalloc <blocks>
This pretty much maps the observed behavior and what I suspect is
happening on a basic level (e.g., this ilock cycle opens a race window).
What still isn't clear is how this gets past the pagecache truncate in
such a state. The zero range path truncates pagecache before the
delalloc punch, which does serialize on the page lock. If it gets the
lock it completely tosses the page, which it looks like the mapping
checks in writepage should handle. If it doesn't get the lock or the
page is writeback, it looks like it waits on writeback.
This is an fsx workload so it is single threaded I/O. I'm not quite
seeing how we get through the writeback/pagecache truncate in a state
with a delalloc page.
> punch delalloc range
> XFS_ILOCK_EXCL pagecache_truncate_range
> allocate HOLE!!!! page_lock
> xfs_iunlock <blocks>
Ah, so this makes sense according to the code as it is today because we
punch delalloc blocks and then truncate pagecache. Note that this zero
range rework does the pagecache truncate first, then punches out the
delalloc blocks. Hence my expectation of either seeing the page tossed
or the extent converted before getting to the delalloc punch.
> starts page writeback
> unlocks page
> waits on writeback
> removes page from cache
> Basically, we back to the fundamental problem that we can't
> manipulate extent state directly if there are *unlocked* pages in
> the page cache over that range because we hold meaningful writeback
> state on the page.
> > Regardless, the scenario of writing back pages into unallocated space
> > that writeback thinks is covered by delayed allocation suggests
> > something racing with delalloc punching. The problem seems to disappear
> > if I comment it out from zero range, so we might want to kill the above
> > hunk entirely for the time being.
> Yes, but I think I see the way forward now. xfs_vm_writepage() needs
> to unconditionally map the page it is being passed rather than
> relying on the state of bufferheads attached to the buffer. We can't
> serialise the page cache state against direct extent manipulations,
> therefore we have to make all decisions based on one set of state we
> can serialise access to sanely.
Sounds interesting. I'm all for condensing our page state management
towards killing buffer_head. Tracking down some of these issues has
helped clarify the general problem for me and also shows that instances
are quite difficult to root cause. This one in particular is a zero
range operation that presumably works (no fsx failures), only to cause
an implicit error in an asynchronous path (writeback) sometime later.
That makes it difficult to detect and equally difficult to try and
> The issue is that we can't currently serialise extent manipulations
> against writeback easily, but I think I see a neat and easy way to
> do this, and it mirrors a technique we already use(*): add a cursor
> to track the active map we are operating on so that we know if we
> are about to modify and extent we are currently writing to.
Ok, so I take it this would allow us to mark an extent as "under
modification" or "in-use" in the sense of its state is significant for a
code sequence larger than an ilock critical section (i.e., an imap,
extent-conversion, writeback sequence). Therefore, one side can
invalidate another if a modification is made, or conversely the other
side could simply block on the operation if that is more appropriate. I
_think_ I get where you're going there, but I await more details. ;)
I'll post a new version of this patch regardless because as mentioned
previously, I'd like to get some kind of checkpoint into the tree that
works correctly (no asserts), even if the implementation is not ideal.
I've not been able to reproduce any asserts so far without the delalloc
punch, but I'd also like to see if any of your subsequent testing
uncovers anything else.
> Actually, a cursor solves a couple of other problems in this
> mapping code - see the big comment in xfs_iomap_write_allocate()
> about using a single map to avoid lookup/unlock/lock/allocate race
> conditions with truncate/hole-punch.
> So I think I have a solution that moves us towards the goal of "die,
> bufferheads, die" and solves these writeback vs page cache vs extent
> manipulation issues - let me do a bit more thinking about it and
> I'll write something up.
> (*) we use a cursor for a similar purpose during AIL traversal: to
> detect item deletion while the AIL lock was dropped that would cause
> us to follow an invalid list pointer.
> Dave Chinner
> xfs mailing list