Thanks, Dave. I'll also try this patch.<div><br></div><div>Da<br><div><br><div class="gmail_quote">On Thu, Feb 9, 2012 at 1:09 AM, Dave Chinner <span dir="ltr"><<a href="mailto:david@fromorbit.com">david@fromorbit.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">On Tue, Jan 24, 2012 at 02:54:31PM +1100, Dave Chinner wrote:<br>
> On Mon, Jan 23, 2012 at 03:51:43PM -0500, Zheng Da wrote:<br>
> > Hello<br>
> ><br>
> > On Mon, Jan 23, 2012 at 2:34 PM, Zheng Da <<a href="mailto:zhengda1936@gmail.com">zhengda1936@gmail.com</a>> wrote:<br>
</div><div><div class="h5">> > >> > So the test case is pretty simple and I think it's easy to reproduce it.<br>
> > >> > It'll be great if you can try the test case.<br>
> > >><br>
> > >> Can you post your test code so I know what I test is exactly what<br>
> > >> you are running?<br>
> > >><br>
> > > I can do that. My test code gets very complicated now. I need to simplify<br>
> > > it.<br>
> > ><br>
> > Here is the code. It's still a bit long. I hope it's OK.<br>
> > You can run the code like "rand-read file option=direct pages=1048576<br>
> > threads=8 access=write/read".<br>
><br>
> With 262144 pages on a 2Gb ramdisk, the results I get on 3.2.0 are<br>
><br>
> Threads Read Write<br>
> 1 0.92s 1.49s<br>
> 2 0.51s 1.20s<br>
> 4 0.31s 1.34s<br>
> 8 0.22s 1.59s<br>
> 16 0.23s 2.24s<br>
><br>
> the contention is on the ip->i_ilock, and the newsize update is one<br>
> of the offenders It probably needs this change to<br>
> xfs_aio_write_newsize_update():<br>
><br>
> - if (new_size == ip->i_new_size) {<br>
> + if (new_size && new_size == ip->i_new_size) {<br>
><br>
> to avoid the lock being taken here.<br>
><br>
> But all that newsize crap is gone in the current git Linus tree,<br>
> so how much would that gains us:<br>
><br>
> Threads Read Write<br>
> 1 0.88s 0.85s<br>
> 2 0.54s 1.20s<br>
> 4 0.31s 1.23s<br>
> 8 0.27s 1.40s<br>
> 16 0.25s 2.36s<br>
><br>
> Pretty much nothing. IOWs, it's just like I suspected - you are<br>
> doing so many write IOs that you are serialising on the extent<br>
> lookup and write checks which use exclusive locking..<br>
><br>
> Given that it is 2 lock traversals per write IO, we're limiting at<br>
> about 4-500,000 exclusive lock grabs per second and decreasing as<br>
> contention goes up.<br>
><br>
> For reads, we are doing 2 shared (nested) lookups per read IO, we<br>
> appear to be limiting at around 2,000,000 shared lock grabs per<br>
> second. Ahmdals law is kicking in here, but it means if we could<br>
> make the writes to use a shared lock, it would at least scale like<br>
> the reads for this "no metadata modification except for mtime"<br>
> overwrite case.<br>
><br>
> I don't think that the generic write checks absolutely need<br>
> exclusive locking - we probably could get away with a shared lock<br>
> and only fall back to exclusive when we need to do EOF zeroing.<br>
> Similarly, for the block mapping code if we don't need to do<br>
> allocation, a shared lock is all we need. So maybe in that case for<br>
> direct IO when create == 1, we can do a read lookup first and only<br>
> grab the lock exclusively if that falls in a hole and requires<br>
> allocation.....<br>
<br>
</div></div>So, I have a proof of concept patch that gets rid of the exclusive<br>
locking for the overwrite case.<br>
<br>
Results are:<br>
<br>
Writes<br>
Threads vanilla patched read<br>
1 0.85s 0.93s 0.88s<br>
2 1.20s 0.58s 0.54s<br>
4 1.23s 0.32s 0.31s<br>
8 1.40s 0.27s 0.27s<br>
16 2.36s 0.23s 0.25s<br>
<br>
So overwrites scale pretty much like reads now: ~1,000,000 overwrite<br>
IOs per second to that one file with 8-16 threads. Given these tests<br>
are running on an 8p VM, it's not surprising it doesn't go any<br>
faster than that as the thread count goes up.<br>
<br>
The patch hacks in some stuff that Christoph's transactional size<br>
and timestamp update patches do correctly, so these changes would<br>
need to wait for that series to be finalised. As it is, this patch<br>
doesn't appear to cause any new xfstests regressions, so it's good<br>
for discussion and testing....<br>
<br>
Anyway, for people to comment on, the patch is below.<br>
<div class="im"><br>
Cheers,<br>
<br>
Dave.<br>
--<br>
Dave Chinner<br>
<a href="mailto:david@fromorbit.com">david@fromorbit.com</a><br>
<br>
<br>
</div>xfs: use shared ilock mode for direct IO writes by default<br>
<br>
From: Dave Chinner <<a href="mailto:dchinner@redhat.com">dchinner@redhat.com</a>><br>
<br>
For the direct IO write path, we only really need the ilock to be<br>
taken in exclusive mode during IO submission if we need to do extent<br>
allocation. We currently take it in exclusive mode for both the<br>
write sanity checks and for block mapping and allocation.<br>
<br>
In the case of the write sanity checks, we only need to protect the<br>
inode from change while this is occurring, and hence we can use a<br>
shared lock for this. We still need to provide exclusion for EOF<br>
zeroing, so we need to detect that case and upgrade the locking to<br>
exclusive in that case. This is a simple extension of the existing<br>
iolock upgrade case.<br>
<br>
We also have the case of timestamp updates occurring inside the<br>
ilock. however, we don't really care if timestamp update races occur<br>
as they are going to end up with the same timestamp anyway. Further,<br>
as we move to transactional timestamp updates, we can't do the<br>
update from within the ilock at all. Hence move the timestamp<br>
update outside the ilock altogether as we don't need or want it to<br>
be protected there.<br>
<br>
For block mapping, the direct IO case has to drop the ilock after<br>
the initial read mapping for transaction reservation before the<br>
allocation can be done. This means that the mapping to determine if<br>
allocation is needed simply requires a "don't change" locking<br>
semantic. i.e. a shared lock.<br>
<br>
Hence we can safely change the xfs_get_blocks() code to use a shared<br>
lock for the initial mapping lookup because that provides the same<br>
guarantees but doesn't introduce new race conditions due to the<br>
allocation having to upgrade the lock - it already has those race<br>
conditions and has to handle them. This means that overwrite and<br>
write into preallocated space direct IO will be mapped with just a<br>
shared lock.<br>
<br>
Finally, we only take the ilock during IO completion if the current<br>
ioend is beyond the file size. This is a quick hack that will be<br>
fixed properly by transactional size updates.<br>
<br>
In combination, these three changes remove all exclusive<br>
serialisation points in the direct IO write path for single file<br>
overwrite workloads.<br>
<br>
Signed-off-by: Dave Chinner <<a href="mailto:dchinner@redhat.com">dchinner@redhat.com</a>><br>
---<br>
fs/xfs/xfs_aops.c | 22 ++++++++++++++++++++--<br>
fs/xfs/xfs_file.c | 35 +++++++++++++++++++++--------------<br>
fs/xfs/xfs_iomap.c | 10 +++++++---<br>
fs/xfs/xfs_iomap.h | 2 +-<br>
4 files changed, 49 insertions(+), 20 deletions(-)<br>
<br>
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c<br>
index 74b9baf..4c84508 100644<br>
--- a/fs/xfs/xfs_aops.c<br>
+++ b/fs/xfs/xfs_aops.c<br>
@@ -138,6 +138,9 @@ xfs_setfilesize(<br>
xfs_inode_t *ip = XFS_I(ioend->io_inode);<br>
xfs_fsize_t isize;<br>
<br>
+ if (!xfs_ioend_new_eof(ioend))<br>
+ return 0;<br>
+<br>
if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))<br>
return EAGAIN;<br>
<br>
@@ -1121,7 +1124,22 @@ __xfs_get_blocks(<br>
return 0;<br>
<br>
if (create) {<br>
- lockmode = XFS_ILOCK_EXCL;<br>
+ /*<br>
+ * For direct IO, we lock in shared mode so that write<br>
+ * operations that don't require allocation can occur<br>
+ * concurrently. The ilock has to be dropped over the allocation<br>
+ * transaction reservation, so the only thing the ilock is<br>
+ * providing here is modification exclusion. i.e. there is no<br>
+ * need to hold the lock exclusive.<br>
+ *<br>
+ * For buffered IO, if we need to do delayed allocation then<br>
+ * hold the ilock exclusive so that the lookup and delalloc<br>
+ * reservation is atomic.<br>
+ */<br>
+ if (direct)<br>
+ lockmode = XFS_ILOCK_SHARED;<br>
+ else<br>
+ lockmode = XFS_ILOCK_EXCL;<br>
xfs_ilock(ip, lockmode);<br>
} else {<br>
lockmode = xfs_ilock_map_shared(ip);<br>
@@ -1144,7 +1162,7 @@ __xfs_get_blocks(<br>
imap.br_startblock == DELAYSTARTBLOCK))) {<br>
if (direct) {<br>
error = xfs_iomap_write_direct(ip, offset, size,<br>
- &imap, nimaps);<br>
+ &imap, nimaps, &lockmode);<br>
} else {<br>
error = xfs_iomap_write_delay(ip, offset, size, &imap);<br>
}<br>
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c<br>
index 10ec272..c74e28c 100644<br>
--- a/fs/xfs/xfs_file.c<br>
+++ b/fs/xfs/xfs_file.c<br>
@@ -650,41 +650,48 @@ xfs_file_aio_write_checks(<br>
struct inode *inode = file->f_mapping->host;<br>
struct xfs_inode *ip = XFS_I(inode);<br>
int error = 0;<br>
+ int ilock = XFS_ILOCK_SHARED;<br>
<br>
- xfs_rw_ilock(ip, XFS_ILOCK_EXCL);<br>
+ xfs_rw_ilock(ip, ilock);<br>
restart:<br>
error = generic_write_checks(file, pos, count, S_ISBLK(inode->i_mode));<br>
if (error) {<br>
- xfs_rw_iunlock(ip, XFS_ILOCK_EXCL);<br>
+ xfs_rw_iunlock(ip, ilock);<br>
return error;<br>
}<br>
<br>
- if (likely(!(file->f_mode & FMODE_NOCMTIME)))<br>
- file_update_time(file);<br>
-<br>
/*<br>
* If the offset is beyond the size of the file, we need to zero any<br>
* blocks that fall between the existing EOF and the start of this<br>
- * write. If zeroing is needed and we are currently holding the<br>
- * iolock shared, we need to update it to exclusive which involves<br>
- * dropping all locks and relocking to maintain correct locking order.<br>
- * If we do this, restart the function to ensure all checks and values<br>
- * are still valid.<br>
+ * write. If zeroing is needed and we are currently holding shared<br>
+ * locks, we need to update it to exclusive which involves dropping all<br>
+ * locks and relocking to maintain correct locking order. If we do<br>
+ * this, restart the function to ensure all checks and values are still<br>
+ * valid.<br>
*/<br>
if (*pos > i_size_read(inode)) {<br>
- if (*iolock == XFS_IOLOCK_SHARED) {<br>
- xfs_rw_iunlock(ip, XFS_ILOCK_EXCL | *iolock);<br>
+ if (*iolock == XFS_IOLOCK_SHARED || ilock == XFS_ILOCK_SHARED) {<br>
+ xfs_rw_iunlock(ip, ilock | *iolock);<br>
*iolock = XFS_IOLOCK_EXCL;<br>
- xfs_rw_ilock(ip, XFS_ILOCK_EXCL | *iolock);<br>
+ ilock = XFS_ILOCK_EXCL;<br>
+ xfs_rw_ilock(ip, ilock | *iolock);<br>
goto restart;<br>
}<br>
error = -xfs_zero_eof(ip, *pos, i_size_read(inode));<br>
}<br>
- xfs_rw_iunlock(ip, XFS_ILOCK_EXCL);<br>
+ xfs_rw_iunlock(ip, ilock);<br>
if (error)<br>
return error;<br>
<br>
/*<br>
+ * we can't do any operation that might call .dirty_inode under the<br>
+ * ilock when we move to completely transactional updates. Hence this<br>
+ * timestamp must sit outside the ilock.<br>
+ */<br>
+ if (likely(!(file->f_mode & FMODE_NOCMTIME)))<br>
+ file_update_time(file);<br>
+<br>
+ /*<br>
* If we're writing the file then make sure to clear the setuid and<br>
* setgid bits if the process is not being run by root. This keeps<br>
* people from modifying setuid and setgid binaries.<br>
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c<br>
index 246c7d5..792c81d 100644<br>
--- a/fs/xfs/xfs_iomap.c<br>
+++ b/fs/xfs/xfs_iomap.c<br>
@@ -123,7 +123,8 @@ xfs_iomap_write_direct(<br>
xfs_off_t offset,<br>
size_t count,<br>
xfs_bmbt_irec_t *imap,<br>
- int nmaps)<br>
+ int nmaps,<br>
+ int *lockmode)<br>
{<br>
xfs_mount_t *mp = ip->i_mount;<br>
xfs_fileoff_t offset_fsb;<br>
@@ -189,7 +190,8 @@ xfs_iomap_write_direct(<br>
/*<br>
* Allocate and setup the transaction<br>
*/<br>
- xfs_iunlock(ip, XFS_ILOCK_EXCL);<br>
+ xfs_iunlock(ip, *lockmode);<br>
+<br>
tp = xfs_trans_alloc(mp, XFS_TRANS_DIOSTRAT);<br>
error = xfs_trans_reserve(tp, resblks,<br>
XFS_WRITE_LOG_RES(mp), resrtextents,<br>
@@ -200,7 +202,9 @@ xfs_iomap_write_direct(<br>
*/<br>
if (error)<br>
xfs_trans_cancel(tp, 0);<br>
- xfs_ilock(ip, XFS_ILOCK_EXCL);<br>
+<br>
+ *lockmode = XFS_ILOCK_EXCL;<br>
+ xfs_ilock(ip, *lockmode);<br>
if (error)<br>
goto error_out;<br>
<br>
diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h<br>
index 8061576..21e3398 100644<br>
--- a/fs/xfs/xfs_iomap.h<br>
+++ b/fs/xfs/xfs_iomap.h<br>
@@ -22,7 +22,7 @@ struct xfs_inode;<br>
struct xfs_bmbt_irec;<br>
<br>
extern int xfs_iomap_write_direct(struct xfs_inode *, xfs_off_t, size_t,<br>
- struct xfs_bmbt_irec *, int);<br>
+ struct xfs_bmbt_irec *, int, int *);<br>
extern int xfs_iomap_write_delay(struct xfs_inode *, xfs_off_t, size_t,<br>
struct xfs_bmbt_irec *);<br>
extern int xfs_iomap_write_allocate(struct xfs_inode *, xfs_off_t, size_t,<br>
</blockquote></div><br></div></div>