<div class="gmail_quote">Am 21.11.2012 02:38 schrieb <<a href="mailto:xfs-request@oss.sgi.com">xfs-request@oss.sgi.com</a>>:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Send xfs mailing list submissions to<br>
<a href="mailto:xfs@oss.sgi.com">xfs@oss.sgi.com</a><br>
<br>
To subscribe or unsubscribe via the World Wide Web, visit<br>
<a href="http://oss.sgi.com/mailman/listinfo/xfs" target="_blank">http://oss.sgi.com/mailman/listinfo/xfs</a><br>
or, via email, send a message with subject or body 'help' to<br>
<a href="mailto:xfs-request@oss.sgi.com">xfs-request@oss.sgi.com</a><br>
<br>
You can reach the person managing the list at<br>
<a href="mailto:xfs-owner@oss.sgi.com">xfs-owner@oss.sgi.com</a><br>
<br>
When replying, please edit your Subject line so it is more specific<br>
than "Re: Contents of xfs digest..."<br>
<br>Today's Topics:<br>
<br>
1. [PATCH 0/2] discontiguous buffer patches (Mark Tinguely)<br>
2. Re: [PATCH 0/2] xfsprogs: fixes for release candidate.<br>
(Mark Tinguely)<br>
3. Re: [PATCH] xfs: Don't flush inodes when project quota<br>
exceeded (Jan Kara)<br>
4. Re: [PATCH 2/9] ext4: honor the O_SYNC flag for aysnchronous<br>
direct I/O requests (Jan Kara)<br>
5. [PATCH] ext4: Reduce i_mutex usage in ext4_file_sync() (Jan Kara)<br>
6. Re: [PATCH 9/9] blkdev: Fix up AIO+DIO+O_SYNC to do the sync<br>
part correctly (Jan Kara)<br>
7. Re: [PATCH] xfs: Don't flush inodes when project quota<br>
exceeded (Dave Chinner)<br>
<br><br>---------- Weitergeleitete Nachricht ----------<br>From: Mark Tinguely <<a href="mailto:tinguely@sgi.com">tinguely@sgi.com</a>><br>To: <a href="mailto:xfs@oss.sgi.com">xfs@oss.sgi.com</a><br>Cc: <br>Date: Tue, 20 Nov 2012 16:41:20 -0600<br>
Subject: [PATCH 0/2] discontiguous buffer patches<br>Eric Sundeen's "userspace vs. fragmented multiblock dir2", xfstest 291<br>
commit 2a4ed, causes a xfs_buf lock hang:<br>
<br>
[<ffffffff81065d87>] down+0x47/0x50<br>
[<ffffffffa03a25c6>] xfs_buf_lock+0x66/0xe0 [xfs]<br>
[<ffffffffa03a33c9>] _xfs_buf_find+0x1f9/0x320 [xfs]<br>
[<ffffffffa03a393f>] xfs_buf_get_map+0x2f/0x170 [xfs]<br>
[<ffffffffa03a3f7a>] xfs_buf_read_map+0x2a/0x100 [xfs]<br>
[<ffffffffa0411630>] xfs_trans_read_buf_map+0x3b0/0x590 [xfs]<br>
[<ffffffffa03e1c5e>] xfs_da_read_buf+0xbe/0x230 [xfs]<br>
[<ffffffffa03e78dc>] xfs_dir2_block_addname+0x7c/0x980 [xfs]<br>
[<ffffffffa03f1468>] xfs_dir2_sf_addname+0x3e8/0x450 [xfs]<br>
[<ffffffffa03e634c>] xfs_dir_createname+0x17c/0x1e0 [xfs]<br>
[<ffffffffa03b9fe2>] xfs_create+0x4c2/0x5f0 [xfs]<br>
[<ffffffffa03b0c8a>] xfs_vn_mknod+0x8a/0x1a0 [xfs]<br>
[<ffffffffa03b0dce>] xfs_vn_create+0xe/0x10 [xfs]<br>
[<ffffffff8115695c>] vfs_create+0xac/0xd0<br>
[<ffffffff811587de>] do_last+0x8be/0x960<br>
[<ffffffff8115940c>] path_openat+0xdc/0x410<br>
[<ffffffff81159853>] do_filp_open+0x43/0xa0<br>
[<ffffffff8114a502>] do_sys_open+0x152/0x1e0<br>
[<ffffffff8114a5cc>] sys_open+0x1c/0x20<br>
[<ffffffff81423df9>] system_call_fastpath+0x16/0x1b<br>
[<ffffffffffffffff>] 0xffffffffffffffff<br>
<br>
That bisect a problem to:<br>
<br>
commit 3605431fb9739a30ccd0c6380ae8e3c6f8e670a5<br>
Author: Dave Chinner <<a href="mailto:dchinner@redhat.com">dchinner@redhat.com</a>><br>
Date: Fri Jun 22 18:50:13 2012 +1000<br>
xfs: use discontiguous xfs_buf support in dabuf wrappers<br>
<br>
xfs_trans_buf_item_match() is looking for the block number of the<br>
buffer in the single segment map area.<br>
<br>
Futhermore, there are a couple issue with multi-segment buffer log<br>
format.<br>
<br>
Patch 1 cleans up the buffer map so that XFS always uses b_maps[].<br>
<br>
Patch 2 fixes the buffer log format issues.<br>
<br>
--Mark.<br>
<br>
<br>
<br>
<br><br>---------- Weitergeleitete Nachricht ----------<br>From: Mark Tinguely <<a href="mailto:tinguely@sgi.com">tinguely@sgi.com</a>><br>To: Dave Chinner <<a href="mailto:david@fromorbit.com">david@fromorbit.com</a>><br>
Cc: <a href="mailto:xfs@oss.sgi.com">xfs@oss.sgi.com</a><br>Date: Tue, 20 Nov 2012 17:17:36 -0600<br>Subject: Re: [PATCH 0/2] xfsprogs: fixes for release candidate.<br>On 11/09/12 01:02, Dave Chinner wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Hi Ben,<br>
<br>
These are two fixes for bugs that have shown up during the -rc<br>
phase of the release. Please consider them for inclusion.<br>
<br>
Cheers,<br>
<br>
Dave.<br>
<br>
______________________________<u></u>_________________<br>
xfs mailing list<br>
<a href="mailto:xfs@oss.sgi.com" target="_blank">xfs@oss.sgi.com</a><br>
<a href="http://oss.sgi.com/mailman/listinfo/xfs" target="_blank">http://oss.sgi.com/mailman/<u></u>listinfo/xfs</a><br>
</blockquote>
<br>
This series has been committed to git://<a href="http://oss.sgi.com/xfs/cmds/xfsprogs" target="_blank">oss.sgi.com/xfs/cmds/<u></u>xfsprogs</a> master branch, commit 1adfff, 19473a and 36298.<br>
<br>
--Mark.<br>
<br>
<br>
<br><br>---------- Weitergeleitete Nachricht ----------<br>From: Jan Kara <<a href="mailto:jack@suse.cz">jack@suse.cz</a>><br>To: Dave Chinner <<a href="mailto:david@fromorbit.com">david@fromorbit.com</a>><br>
Cc: Jan Kara <<a href="mailto:jack@suse.cz">jack@suse.cz</a>>, <a href="mailto:xfs@oss.sgi.com">xfs@oss.sgi.com</a><br>Date: Wed, 21 Nov 2012 01:24:59 +0100<br>Subject: Re: [PATCH] xfs: Don't flush inodes when project quota exceeded<br>
On Tue 20-11-12 11:04:28, Dave Chinner wrote:<br>
> On Mon, Nov 19, 2012 at 10:39:13PM +0100, Jan Kara wrote:<br>
> > On Tue 13-11-12 01:36:13, Jan Kara wrote:<br>
> > > When project quota gets exceeded xfs_iomap_write_delay() ends up flushing<br>
> > > inodes because ENOSPC gets returned from xfs_bmapi_delay() instead of EDQUOT.<br>
> > > This makes handling of writes over project quota rather slow as a simple test<br>
> > > program shows:<br>
> > > fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC, 0644);<br>
> > > for (i = 0; i < 50000; i++)<br>
> > > pwrite(fd, buf, 4096, i*4096);<br>
> > ><br>
> > > Writing 200 MB like this into a directory with 100 MB project quota takes<br>
> > > around 6 minutes while it takes about 2 seconds with this patch applied. This<br>
> > > actually happens in a real world load when nfs pushes data into a directory<br>
> > > which is over project quota.<br>
> > ><br>
> > > Fix the problem by replacing XFS_QMOPT_ENOSPC flag with XFS_QMOPT_EPDQUOT.<br>
> > > That makes xfs_trans_reserve_quota_bydquots() return new error EPDQUOT when<br>
> > > project quota is exceeded. xfs_bmapi_delay() then uses this flag so that<br>
> > > xfs_iomap_write_delay() can distinguish real ENOSPC (requiring flushing)<br>
> > > from exceeded project quota (not requiring flushing).<br>
> > ><br>
> > > As a side effect this patch fixes inconsistency where e.g. xfs_create()<br>
> > > returned EDQUOT even when project quota was exceeded.<br>
> > Ping? Any opinions?<br>
><br>
> FWIW, it doesn't look like it'll apply to a current XFs tree:<br>
><br>
> > > @@ -441,8 +442,11 @@ retry:<br>
> > > */<br>
> > > if (nimaps == 0) {<br>
> > > trace_xfs_delalloc_enospc(ip, offset, count);<br>
> > > - if (flushed)<br>
> > > - return XFS_ERROR(error ? error : ENOSPC);<br>
> > > + if (flushed) {<br>
> > > + if (error == 0 || error == EPDQUOT)<br>
> > > + error = ENOSPC;<br>
> > > + return XFS_ERROR(error);<br>
> > > + }<br>
> > ><br>
> > > if (error == ENOSPC) {<br>
> > > xfs_iunlock(ip, XFS_ILOCK_EXCL);<br>
><br>
> This xfs_iomap_write_delay() looks like this now:<br>
><br>
> /*<br>
> * If bmapi returned us nothing, we got either ENOSPC or EDQUOT. Retry<br>
> * without EOF preallocation.<br>
> */<br>
> if (nimaps == 0) {<br>
> trace_xfs_delalloc_enospc(ip, offset, count);<br>
> if (prealloc) {<br>
> prealloc = 0;<br>
> error = 0;<br>
> goto retry;<br>
> }<br>
> return XFS_ERROR(error ? error : ENOSPC);<br>
> }<br>
><br>
> The flushing is now way up in xfs_file_buffered_aio_write(), and the<br>
> implementation of xfs_flush_inodes() has changed as well. Hence it<br>
> may or may not behave differently not....<br>
OK, so I tested latest XFS tree and changes by commit 9aa05000 (changing<br>
xfs_flush_inodes()) indeed improve the performace from those ~6 minutes to<br>
~6 seconds which is good enough I believe. Thanks for the pointer! I was<br>
thinking for a while why sync_inodes_sb() is so much faster than the<br>
original XFS implementation and I believe it's because we don't force the<br>
log on each sync now.<br>
<br>
Honza<br>
--<br>
Jan Kara <<a href="mailto:jack@suse.cz">jack@suse.cz</a>><br>
SUSE Labs, CR<br>
<br>
<br>
<br><br>---------- Weitergeleitete Nachricht ----------<br>From: Jan Kara <<a href="mailto:jack@suse.cz">jack@suse.cz</a>><br>To: Jeff Moyer <<a href="mailto:jmoyer@redhat.com">jmoyer@redhat.com</a>><br>Cc: <a href="mailto:axboe@kernel.dk">axboe@kernel.dk</a>, <a href="mailto:tytso@mit.edu">tytso@mit.edu</a>, "Darrick J. Wong" <<a href="mailto:darrick.wong@oracle.com">darrick.wong@oracle.com</a>>, <a href="mailto:linux-kernel@vger.kernel.org">linux-kernel@vger.kernel.org</a>, <a href="mailto:xfs@oss.sgi.com">xfs@oss.sgi.com</a>, <a href="mailto:hch@infradead.org">hch@infradead.org</a>, <a href="mailto:bpm@sgi.com">bpm@sgi.com</a>, <a href="mailto:viro@zeniv.linux.org.uk">viro@zeniv.linux.org.uk</a>, <a href="mailto:linux-fsdevel@vger.kernel.org">linux-fsdevel@vger.kernel.org</a>, Jan Kara <<a href="mailto:jack@suse.cz">jack@suse.cz</a>>, <a href="mailto:linux-ext4@vger.kernel.org">linux-ext4@vger.kernel.org</a><br>
Date: Wed, 21 Nov 2012 01:56:26 +0100<br>Subject: Re: [PATCH 2/9] ext4: honor the O_SYNC flag for aysnchronous direct I/O requests<br>On Tue 20-11-12 15:02:15, Jeff Moyer wrote:<br>
> Jan Kara <<a href="mailto:jack@suse.cz">jack@suse.cz</a>> writes:<br>
><br>
> >> @@ -1279,6 +1280,9 @@ struct ext4_sb_info {<br>
> >> /* workqueue for dio unwritten */<br>
> >> struct workqueue_struct *dio_unwritten_wq;<br>
> >><br>
> >> + /* workqueue for aio+dio+o_sync disk cache flushing */<br>
> >> + struct workqueue_struct *aio_dio_flush_wq;<br>
> >> +<br>
> > Umm, I'm not completely decided whether we really need a separate<br>
> > workqueue. But it doesn't cost too much so I guess it makes some sense -<br>
> > fsync() is rather heavy so syncing won't starve extent conversion...<br>
><br>
> I'm assuming you'd like me to convert the names from flush to fsync,<br>
> yes?<br>
Would be nicer, yes.<br>
<br>
> >> +<br>
> >> + /*<br>
> >> + * If we are running in nojournal mode, just flush the disk<br>
> >> + * cache and return.<br>
> >> + */<br>
> >> + if (!journal)<br>
> >> + return blkdev_issue_flush(inode->i_sb->s_bdev, GFP_NOIO, NULL);<br>
> > And this is wrong as well - you need to do work similar to what<br>
> > ext4_sync_file() does. Actually it would be *much* better if these two<br>
> > sites used the same helper function. Which also poses an interesting<br>
> > question about locking - do we need i_mutex or not? Forcing a transaction<br>
> > commit is definitely OK without it, similarly as grabbing transaction ids<br>
> > from inode or ext4_should_journal_data() test. __sync_inode() call seems<br>
> > to be OK without i_mutex as well so I believe we can just get rid of it<br>
> > (getting i_mutex from the workqueue is a locking nightmare we don't want to<br>
> > return to).<br>
><br>
> Just to be clear, are you saying you would like me to remove the<br>
> mutex_lock/unlock pair from ext4_sync_file? (I had already factored out<br>
> the common code between this new code path and the fsync path in my tree.)<br>
Yes, after some thinking I came to that conclusion. We actually need to<br>
keep i_mutex around ext4_flush_unwritten_io() to avoid livelocks but the<br>
rest doesn't need it. The change should be definitely a separate patch just<br>
in case there's something subtle I missed and we need to bisect in<br>
future... I've attached a patch for that so that blame for bugs goes my way<br>
;) Compile tested only so far. I'll give it some more testing overnight.<br>
<br>
Honza<br>
--<br>
Jan Kara <<a href="mailto:jack@suse.cz">jack@suse.cz</a>><br>
SUSE Labs, CR<br>
<br><br>---------- Weitergeleitete Nachricht ----------<br>From: Jan Kara <<a href="mailto:jack@suse.cz">jack@suse.cz</a>><br>To: <br>Cc: <br>Date: Wed, 21 Nov 2012 01:46:51 +0100<br>Subject: [PATCH] ext4: Reduce i_mutex usage in ext4_file_sync()<br>
ext4_file_sync() needs i_mutex only to avoid livelocks of<br>
ext4_flush_unwritten_io() all other code doesn't need it. In particular<br>
syncing of inode & metadata in non-journal case is safe (writeback doesn't<br>
hold i_mutex either) and forcing of transaction commits doesn't need i_mutex<br>
either (there's nothing inode specific in that code apart from grabbing<br>
transaction ids from the inode). So shorten the span where i_mutex is held.<br>
<br>
Signed-off-by: Jan Kara <<a href="mailto:jack@suse.cz">jack@suse.cz</a>><br>
---<br>
fs/ext4/fsync.c | 6 ++----<br>
1 files changed, 2 insertions(+), 4 deletions(-)<br>
<br>
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c<br>
index be1d89f..2268114 100644<br>
--- a/fs/ext4/fsync.c<br>
+++ b/fs/ext4/fsync.c<br>
@@ -113,8 +113,6 @@ static int __sync_inode(struct inode *inode, int datasync)<br>
*<br>
* What we do is just kick off a commit and wait on it. This will snapshot the<br>
* inode to disk.<br>
- *<br>
- * i_mutex lock is held when entering and exiting this function<br>
*/<br>
<br>
int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)<br>
@@ -133,12 +131,13 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)<br>
ret = filemap_write_and_wait_range(inode->i_mapping, start, end);<br>
if (ret)<br>
return ret;<br>
- mutex_lock(&inode->i_mutex);<br>
<br>
if (inode->i_sb->s_flags & MS_RDONLY)<br>
goto out;<br>
<br>
+ mutex_lock(&inode->i_mutex);<br>
ret = ext4_flush_unwritten_io(inode);<br>
+ mutex_unlock(&inode->i_mutex);<br>
if (ret < 0)<br>
goto out;<br>
<br>
@@ -180,7 +179,6 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)<br>
ret = err;<br>
}<br>
out:<br>
- mutex_unlock(&inode->i_mutex);<br>
trace_ext4_sync_file_exit(inode, ret);<br>
return ret;<br>
}<br>
--<br>
1.7.1<br>
<br>
<br>
--k+w/mQv8wyuph6w0--<br>
<br>
<br>
<br><br>---------- Weitergeleitete Nachricht ----------<br>From: Jan Kara <<a href="mailto:jack@suse.cz">jack@suse.cz</a>><br>To: Jeff Moyer <<a href="mailto:jmoyer@redhat.com">jmoyer@redhat.com</a>><br>Cc: <a href="mailto:axboe@kernel.dk">axboe@kernel.dk</a>, <a href="mailto:tytso@mit.edu">tytso@mit.edu</a>, "Darrick J. Wong" <<a href="mailto:darrick.wong@oracle.com">darrick.wong@oracle.com</a>>, <a href="mailto:linux-kernel@vger.kernel.org">linux-kernel@vger.kernel.org</a>, <a href="mailto:xfs@oss.sgi.com">xfs@oss.sgi.com</a>, <a href="mailto:hch@infradead.org">hch@infradead.org</a>, <a href="mailto:bpm@sgi.com">bpm@sgi.com</a>, <a href="mailto:viro@zeniv.linux.org.uk">viro@zeniv.linux.org.uk</a>, <a href="mailto:linux-fsdevel@vger.kernel.org">linux-fsdevel@vger.kernel.org</a>, Jan Kara <<a href="mailto:jack@suse.cz">jack@suse.cz</a>>, <a href="mailto:linux-ext4@vger.kernel.org">linux-ext4@vger.kernel.org</a><br>
Date: Wed, 21 Nov 2012 01:57:39 +0100<br>Subject: Re: [PATCH 9/9] blkdev: Fix up AIO+DIO+O_SYNC to do the sync part correctly<br>On Tue 20-11-12 15:47:54, Jeff Moyer wrote:<br>
> Jan Kara <<a href="mailto:jack@suse.cz">jack@suse.cz</a>> writes:<br>
><br>
> > On Mon 19-11-12 23:51:15, Darrick J. Wong wrote:<br>
> >> When performing O_SYNC+AIO+DIO writes to block devices, use the DIO_SYNC_WRITES<br>
> >> flag so that flushes are issued /after/ the write completes, not before.<br>
> >><br>
> >> Note, however, that for block devices, the DIO setup code ensures that a flush<br>
> >> wq is attached to the superblock of the bdevfs filesystem, not the filesystem<br>
> >> that the device node happens to reside in. This means that unlike regular<br>
> >> files, iocb->ki_filp->f_mapping->host->i_sb != inode->i_sb. Therefore, adjust<br>
> >> Jeff's earlier patch to keep the pointer use consistent and avoid a NULL deref.<br>
> >><br>
> >> Signed-off-by: Darrick J. Wong <<a href="mailto:darrick.wong@oracle.com">darrick.wong@oracle.com</a>><br>
> >> ---<br>
> >> fs/block_dev.c | 5 +++--<br>
> >> fs/direct-io.c | 3 ++-<br>
> >> 2 files changed, 5 insertions(+), 3 deletions(-)<br>
> >><br>
> >><br>
> >> diff --git a/fs/block_dev.c b/fs/block_dev.c<br>
> >> index 1a1e5e3..05ff33a 100644<br>
> >> --- a/fs/block_dev.c<br>
> >> +++ b/fs/block_dev.c<br>
> >> @@ -235,7 +235,8 @@ blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,<br>
> >> struct inode *inode = file->f_mapping->host;<br>
> >><br>
> >> return __blockdev_direct_IO(rw, iocb, inode, I_BDEV(inode), iov, offset,<br>
> >> - nr_segs, blkdev_get_blocks, NULL, NULL, 0);<br>
> >> + nr_segs, blkdev_get_blocks, NULL, NULL,<br>
> >> + DIO_SYNC_WRITES);<br>
> >> }<br>
> >><br>
> >> int __sync_blockdev(struct block_device *bdev, int wait)<br>
> >> @@ -1631,7 +1632,7 @@ ssize_t blkdev_aio_write(struct kiocb *iocb, const struct iovec *iov,<br>
> >> percpu_down_read(&bdev->bd_block_size_semaphore);<br>
> >><br>
> >> ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);<br>
> >> - if (ret > 0 || ret == -EIOCBQUEUED) {<br>
> >> + if (ret > 0) {<br>
> >> ssize_t err;<br>
> >><br>
> >> err = generic_write_sync(file, pos, ret);<br>
> >> diff --git a/fs/direct-io.c b/fs/direct-io.c<br>
> >> index b7391d4..c626c43 100644<br>
> >> --- a/fs/direct-io.c<br>
> >> +++ b/fs/direct-io.c<br>
> >> @@ -258,7 +258,8 @@ void generic_dio_end_io(struct kiocb *iocb, loff_t offset, ssize_t bytes,<br>
> >> work->ret = ret;<br>
> >> work->offset = offset;<br>
> >> work->len = bytes;<br>
> >> - queue_work(inode->i_sb->s_dio_flush_wq, &work->work);<br>
> >> + queue_work(iocb->ki_filp->f_mapping->host->i_sb->s_dio_flush_wq,<br>
> >> + &work->work);<br>
> > This should be folded into the original patch introducing the<br>
> > s_dio_flush_wq. And please add a comment before this line saying that block<br>
> > devices need a dereference exactly like this... Otherwise the patch looks<br>
> > good so you can add:<br>
> > Reviewed-by: Jan Kara <<a href="mailto:jack@suse.cz">jack@suse.cz</a>><br>
><br>
> When you say, "This," do you mean that one change, or the whole patch?<br>
I meant the whole patch after that hung is folded ;)<br>
<br>
Honza<br>
--<br>
Jan Kara <<a href="mailto:jack@suse.cz">jack@suse.cz</a>><br>
SUSE Labs, CR<br>
<br>
<br>
<br><br>---------- Weitergeleitete Nachricht ----------<br>From: Dave Chinner <<a href="mailto:david@fromorbit.com">david@fromorbit.com</a>><br>To: Jan Kara <<a href="mailto:jack@suse.cz">jack@suse.cz</a>><br>
Cc: <a href="mailto:xfs@oss.sgi.com">xfs@oss.sgi.com</a><br>Date: Wed, 21 Nov 2012 12:38:21 +1100<br>Subject: Re: [PATCH] xfs: Don't flush inodes when project quota exceeded<br>On Wed, Nov 21, 2012 at 01:24:59AM +0100, Jan Kara wrote:<br>
> On Tue 20-11-12 11:04:28, Dave Chinner wrote:<br>
> > On Mon, Nov 19, 2012 at 10:39:13PM +0100, Jan Kara wrote:<br>
> > > On Tue 13-11-12 01:36:13, Jan Kara wrote:<br>
> > > > When project quota gets exceeded xfs_iomap_write_delay() ends up flushing<br>
> > > > inodes because ENOSPC gets returned from xfs_bmapi_delay() instead of EDQUOT.<br>
> > > > This makes handling of writes over project quota rather slow as a simple test<br>
> > > > program shows:<br>
> > > > fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC, 0644);<br>
> > > > for (i = 0; i < 50000; i++)<br>
> > > > pwrite(fd, buf, 4096, i*4096);<br>
> > > ><br>
> > > > Writing 200 MB like this into a directory with 100 MB project quota takes<br>
> > > > around 6 minutes while it takes about 2 seconds with this patch applied. This<br>
> > > > actually happens in a real world load when nfs pushes data into a directory<br>
> > > > which is over project quota.<br>
> > > ><br>
> > > > Fix the problem by replacing XFS_QMOPT_ENOSPC flag with XFS_QMOPT_EPDQUOT.<br>
> > > > That makes xfs_trans_reserve_quota_bydquots() return new error EPDQUOT when<br>
> > > > project quota is exceeded. xfs_bmapi_delay() then uses this flag so that<br>
> > > > xfs_iomap_write_delay() can distinguish real ENOSPC (requiring flushing)<br>
> > > > from exceeded project quota (not requiring flushing).<br>
> > > ><br>
> > > > As a side effect this patch fixes inconsistency where e.g. xfs_create()<br>
> > > > returned EDQUOT even when project quota was exceeded.<br>
> > > Ping? Any opinions?<br>
> ><br>
> > FWIW, it doesn't look like it'll apply to a current XFs tree:<br>
> ><br>
> > > > @@ -441,8 +442,11 @@ retry:<br>
> > > > */<br>
> > > > if (nimaps == 0) {<br>
> > > > trace_xfs_delalloc_enospc(ip, offset, count);<br>
> > > > - if (flushed)<br>
> > > > - return XFS_ERROR(error ? error : ENOSPC);<br>
> > > > + if (flushed) {<br>
> > > > + if (error == 0 || error == EPDQUOT)<br>
> > > > + error = ENOSPC;<br>
> > > > + return XFS_ERROR(error);<br>
> > > > + }<br>
> > > ><br>
> > > > if (error == ENOSPC) {<br>
> > > > xfs_iunlock(ip, XFS_ILOCK_EXCL);<br>
> ><br>
> > This xfs_iomap_write_delay() looks like this now:<br>
> ><br>
> > /*<br>
> > * If bmapi returned us nothing, we got either ENOSPC or EDQUOT. Retry<br>
> > * without EOF preallocation.<br>
> > */<br>
> > if (nimaps == 0) {<br>
> > trace_xfs_delalloc_enospc(ip, offset, count);<br>
> > if (prealloc) {<br>
> > prealloc = 0;<br>
> > error = 0;<br>
> > goto retry;<br>
> > }<br>
> > return XFS_ERROR(error ? error : ENOSPC);<br>
> > }<br>
> ><br>
> > The flushing is now way up in xfs_file_buffered_aio_write(), and the<br>
> > implementation of xfs_flush_inodes() has changed as well. Hence it<br>
> > may or may not behave differently not....<br>
> OK, so I tested latest XFS tree and changes by commit 9aa05000 (changing<br>
> xfs_flush_inodes()) indeed improve the performace from those ~6 minutes to<br>
> ~6 seconds which is good enough I believe. Thanks for the pointer! I was<br>
> thinking for a while why sync_inodes_sb() is so much faster than the<br>
> original XFS implementation and I believe it's because we don't force the<br>
> log on each sync now.<br>
<br>
I think it's detter because we now don't scan every inode in the<br>
inode cache doing a mapping_tagged(PAGECACHE_TAG_DIRTY) check to see<br>
if they are dirty or not to decide whether it needs writeback. The<br>
overhead of doing that adds up very quickly when you have lots of<br>
cached inodes and you are scanning them for every page that is<br>
written to....<br>
<br>
Cheers,<br>
<br>
Dave.<br>
--<br>
Dave Chinner<br>
<a href="mailto:david@fromorbit.com">david@fromorbit.com</a><br>
<br>
<br>
<br>_______________________________________________<br>
xfs mailing list<br>
<a href="mailto:xfs@oss.sgi.com">xfs@oss.sgi.com</a><br>
<a href="http://oss.sgi.com/mailman/listinfo/xfs" target="_blank">http://oss.sgi.com/mailman/listinfo/xfs</a><br>
<br></blockquote></div>