[Top] [All Lists]

Re: [PATCH 7/7] xfs: xfs_fs_write_inode() can fail to write inodes synch

To: Christoph Hellwig <hch@xxxxxxxxxxxxx>
Subject: Re: [PATCH 7/7] xfs: xfs_fs_write_inode() can fail to write inodes synchronously
From: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Tue, 26 Jan 2010 15:51:49 +1100
Cc: xfs@xxxxxxxxxxx
In-reply-to: <20100125160354.GA30227@xxxxxxxxxxxxx>
References: <1264400564-19704-1-git-send-email-david@xxxxxxxxxxxxx> <1264400564-19704-8-git-send-email-david@xxxxxxxxxxxxx> <20100125160354.GA30227@xxxxxxxxxxxxx>
User-agent: Mutt/1.5.18 (2008-05-17)
On Mon, Jan 25, 2010 at 11:03:54AM -0500, Christoph Hellwig wrote:
> On Mon, Jan 25, 2010 at 05:22:44PM +1100, Dave Chinner wrote:
> > When an inode has already be flushed delayed write,
> > xfs_inode_clean() returns true and hence xfs_fs_write_inode() can
> > return on a synchronous inode write without having written the
> > inode. Currently these sycnhronous writes only come from the unmount
> > path or the nfsd on a synchronous export so should be fairly rare.
> They also come from sync_filesystem, which is uses by the sync system
> call, in the unmount code and from cachefiles.

True - I'll update the comment - but I still think it'll be fairly

> > Realistically, a synchronous inode write is not necessary here; we
> > can treat this like fsync where we either force the log if there are
> > no unlogged changes, or do a sync transaction if there are unlogged
> > changes. The will result real synchronous semantics as the fsync
> > will issue barriers, but may slow down the above two configurations
> > as a result. However, if the inode is not pinned and has no unlogged
> > changes, then the fsync code is a no-op and hence it may be faster
> > than the existing code.
> If we get a lot of cases where we need to write out the inode
> synchronously the barrier might hit us really hard, though.

No different to running wsync, though, where all transactions
are synchronous and will issue barriers all the time.

> If
> we have a lot of delalloc I/O outstanding I fear this might actually
> happen in practice as the inode gets modified between the first
> ->write_inode with wait == 0 by I/O completion.

So far in my testing I haven't seen a big hit - the performance
tests I've done are on filesystems with barriers enabled. I just
checked barrier vs nobarrier sync times after creating 400,000
single block files in parallel - nobarrier = 27s, barrier = 29s.

> > +   error = EAGAIN;
> > +   if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED))
> > +           goto out;
> > +   if (xfs_ipincount(ip) || !xfs_iflock_nowait(ip))
> > +           goto out_unlock;
> So if we make this non-blocking even for the wait case, don't we
> still have a race window there bulkstat could miss the updates, even
> after a sync?

Yes, you're right. But even if we lock here properly, a delwri flush
is non-blocking and hence can still return EAGAIN. We really only need
this if a newly allocated inode has not been previously flushed for
bulkstat to work correctly. We would need to race with a concurrent
transaction between the fsync call and the below checks for this
flush to fail, which I think should be a relatively rare ocurrence.

What I will look at is whether I can get xfs_fsync() to take a
locked inode and return with it still locked. Then this race
condition will go away completely and hence the delwri flush
will only occur if the inode has not been flushed yet (based
on the flock).


Dave Chinner

<Prev in Thread] Current Thread [Next in Thread>