On Wed, 18 Mar 2009, Dave Chinner wrote:
> On Tue, Mar 17, 2009 at 09:08:36AM -0400, Mikulas Patocka wrote:
> > On Sun, 15 Mar 2009, Dave Chinner wrote:
> > > From: Dave Chinner <dgc@xxxxxxxxxxx>
> > >
> > > Currently xfs_device_flush calls sync_blockdev() which is
> > > a no-op for XFS as all it's metadata is held in a different
> > > address to the one sync_blockdev() works on.
> > >
> > > Call xfs_sync_inodes() instead to flush all the delayed
> > > allocation blocks out. To do this as efficiently as possible,
> > > do it via two passes - one to do an async flush of all the
> > > dirty blocks and a second to wait for all the IO to complete.
> > > This requires some modification to the xfs-sync_inodes_ag()
> > > flush code to do efficiently.
> > I tested the code and it works fine.
> > But I'm not somehow convinced that that TRYLOCK is right ... that it will
> > avoid that pagelock-deadlock that I found before.
> Care to expand on why?
> > What is the ordering of pagelock, ILOCK and IOLOCK?
> iolock, pagelock, and ilock is always the innermost of the three.
> > If you keep the
> > ordering right, you don't need trylock.
> In this case, we are doing:
> iolock exclusive
> Clearly we can't take the iolock during <flush> (nor any other
> several other places in XFS that might try to take the iolock), so
> trylock-and-fail is the usual method used in XFS for avoiding a
> deadlock in this scenario.
> > And if the ordering is wrong,
> > trylock will only lower the probability of a deadlock, not avoid it
> > completely.
> Details of why you think this? XFS uses trylocks extensively to
> avoid deadlocks in these sorts of situations, so I'm curious to
> know why you think this technique provide an absolute guarantee in
> this case....
What I find scary is those two recursive pagelocks being held without
trylock. The sequence is like:
lock iolock 1
lock pagelock 1
--- submit request to xfssyncd, that does
trylock iolock 2
lock pagelock 2
... now suppose that this is racing with another process that takes
pagelock without taking iolock first (memory manager trying to flush files
mmaped for write or so). It can do
lock pagelock 2
... and submit flush request to a thread that is actually waiting to get
--- I don't know --- is there a possibility to reserve filesystem space
for write-mapped files at the time of the first page fault? (so that the
space won't be allocated at the time of a flush?).