On Thu, Nov 17, 2011 at 03:03:04AM -0500, Christoph Hellwig wrote:
> On Thu, Nov 17, 2011 at 11:29:15AM +1100, Dave Chinner wrote:
> > Actually, that is not true: we know that transaction reservation in
> > Io completion can hang systems. That's why we specifically shifted
> > unwritten extent conversion to it's own workqueue.
> > The condition that I saw most frequently was transaction reservation
> > blocking and preventing IO from completing and therefore marking
> > memory clean and freeable, whilst the unwritten extent conversion
> > transaction reservation was blocked because the tail of the log
> > could not be moved due to memory allocation for inode buffers to
> > write back the inodes pinning the tail of the log failing and hence
> > looping endlessly in the buffer cache. If we do transaction
> > reservations for every single extending IO completion during Io
> > completion, we are effectively making the deadlock possible for most
> > extending delayed allocation write IOs (i.e. the majority of write
> > IO in a typical system).
> That fact has changed:
> commit 80641dc66a2d6dfb22af4413227a92b8ab84c7bb
> Author: Christoph Hellwig <hch@xxxxxxxxxxxxx>
> Date: Mon Oct 19 04:00:03 2009 +0000
> xfs: I/O completion handlers must use NOFS allocations
> When completing I/O requests we must not allow the memory allocator to
> recurse into the filesystem, as we might deadlock on waiting for the
> I/O completion otherwise. The only thing currently allocating normal
> GFP_KERNEL memory is the allocation of the transaction structure for
> the unwritten extent conversion. Add a memflags argument to
> _xfs_trans_alloc to allow controlling the allocator behaviour.
Right, but that isn't the problem I've seen. That's memory reclaim
recursion - The situation I've seen is simply not having the
available memory available to flush a dirty inode when a buffer
needs to be allocated.
IOWs, flushing a dirty inode can require a metadata buffer to be
allocated. The problem is that allocating pages for a metadata
buffer can get stuck in the allocation loop in
xfs_buf_allocate_memory() precisely because GFP_NOFS page allocation
is failing. Then no further pages can be cleaned/freed because IO
completion is held up waiting for this allocation loop to succeed to
allow log space to be made available...
> So unwritten extent conversion will never recurse back into the
> filesystem any more.
It often needs to read metadata buffers from disk, so can deadlock
as per above. The moving of unwritten extent conversion into it's
own workqueue didn't prevent that deadlock - it just made it
extremely unlikely because normal IO could still complete and hence
the system could make progress in almost all cases it previously
> If we still care about blocking on other things
> (and I've hit this code fairly hard over the last months) we can
> increase the per-cpu workqueue concurrency limit. Currently xfs
> limits each of the workqueues to one item, but the default (when
> passing 0) would be 512. With that we could probably even get away
> with one per-filesystem workqueue if we want, but I'd really prefer
> to get the size logging fix out to users ASAP instead of doing all
> the fine-tuning now.
> > > In addition to that implementing log reservations from ->writepage that
> > > are only released by a different thread requires a lot of work,
> > That's trivial to do - there's nothing thread specific about a
> > transaction. All you need to do is ensure that everything is logged
> > before you pass the struct xfs_trans off to another thread. We
> > can do that simply by adding it to the ioend.
> The transaction set PF_FSTRANS in current->flags, which also includes
> saving and restoring the current flags in (I can't figure out exactly
> why we do the later). Without this we trivially deadlock - I tried
> to fix it, but it either didn't quite work or I ran into more deadlocks
Oh, I forgot about that. That should still be able to be done,
though - when the transaction is attached to the ioend, remove the
PF_FSTRANS flag from that thread, and set it when we get the ioend
I did exactly this for handing allocation off to a workqueue - the
work done by the workqueue is completely within transaction context,
so the worker thread needs to have that flag set when the work
starts, and cleared when the work completes. I didn't see any
problems from doing this.
Hence, AFAICT, the handing off of such state to other threads along
with the transaction shouldn't be problematic if it matches the way
the transaction is passed around. Without seeing you patches to do
this, I can't really comment any further....