concurrent direct IO write in xfs
Zheng Da
zhengda1936 at gmail.com
Thu Feb 9 00:44:33 CST 2012
Thanks, Dave. I'll also try this patch.
Da
On Thu, Feb 9, 2012 at 1:09 AM, Dave Chinner <david at fromorbit.com> wrote:
> On Tue, Jan 24, 2012 at 02:54:31PM +1100, Dave Chinner wrote:
> > On Mon, Jan 23, 2012 at 03:51:43PM -0500, Zheng Da wrote:
> > > Hello
> > >
> > > On Mon, Jan 23, 2012 at 2:34 PM, Zheng Da <zhengda1936 at gmail.com>
> wrote:
> > > >> > So the test case is pretty simple and I think it's easy to
> reproduce it.
> > > >> > It'll be great if you can try the test case.
> > > >>
> > > >> Can you post your test code so I know what I test is exactly what
> > > >> you are running?
> > > >>
> > > > I can do that. My test code gets very complicated now. I need to
> simplify
> > > > it.
> > > >
> > > Here is the code. It's still a bit long. I hope it's OK.
> > > You can run the code like "rand-read file option=direct pages=1048576
> > > threads=8 access=write/read".
> >
> > With 262144 pages on a 2Gb ramdisk, the results I get on 3.2.0 are
> >
> > Threads Read Write
> > 1 0.92s 1.49s
> > 2 0.51s 1.20s
> > 4 0.31s 1.34s
> > 8 0.22s 1.59s
> > 16 0.23s 2.24s
> >
> > the contention is on the ip->i_ilock, and the newsize update is one
> > of the offenders It probably needs this change to
> > xfs_aio_write_newsize_update():
> >
> > - if (new_size == ip->i_new_size) {
> > + if (new_size && new_size == ip->i_new_size) {
> >
> > to avoid the lock being taken here.
> >
> > But all that newsize crap is gone in the current git Linus tree,
> > so how much would that gains us:
> >
> > Threads Read Write
> > 1 0.88s 0.85s
> > 2 0.54s 1.20s
> > 4 0.31s 1.23s
> > 8 0.27s 1.40s
> > 16 0.25s 2.36s
> >
> > Pretty much nothing. IOWs, it's just like I suspected - you are
> > doing so many write IOs that you are serialising on the extent
> > lookup and write checks which use exclusive locking..
> >
> > Given that it is 2 lock traversals per write IO, we're limiting at
> > about 4-500,000 exclusive lock grabs per second and decreasing as
> > contention goes up.
> >
> > For reads, we are doing 2 shared (nested) lookups per read IO, we
> > appear to be limiting at around 2,000,000 shared lock grabs per
> > second. Ahmdals law is kicking in here, but it means if we could
> > make the writes to use a shared lock, it would at least scale like
> > the reads for this "no metadata modification except for mtime"
> > overwrite case.
> >
> > I don't think that the generic write checks absolutely need
> > exclusive locking - we probably could get away with a shared lock
> > and only fall back to exclusive when we need to do EOF zeroing.
> > Similarly, for the block mapping code if we don't need to do
> > allocation, a shared lock is all we need. So maybe in that case for
> > direct IO when create == 1, we can do a read lookup first and only
> > grab the lock exclusively if that falls in a hole and requires
> > allocation.....
>
> So, I have a proof of concept patch that gets rid of the exclusive
> locking for the overwrite case.
>
> Results are:
>
> Writes
> Threads vanilla patched read
> 1 0.85s 0.93s 0.88s
> 2 1.20s 0.58s 0.54s
> 4 1.23s 0.32s 0.31s
> 8 1.40s 0.27s 0.27s
> 16 2.36s 0.23s 0.25s
>
> So overwrites scale pretty much like reads now: ~1,000,000 overwrite
> IOs per second to that one file with 8-16 threads. Given these tests
> are running on an 8p VM, it's not surprising it doesn't go any
> faster than that as the thread count goes up.
>
> The patch hacks in some stuff that Christoph's transactional size
> and timestamp update patches do correctly, so these changes would
> need to wait for that series to be finalised. As it is, this patch
> doesn't appear to cause any new xfstests regressions, so it's good
> for discussion and testing....
>
> Anyway, for people to comment on, the patch is below.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david at fromorbit.com
>
>
> xfs: use shared ilock mode for direct IO writes by default
>
> From: Dave Chinner <dchinner at redhat.com>
>
> For the direct IO write path, we only really need the ilock to be
> taken in exclusive mode during IO submission if we need to do extent
> allocation. We currently take it in exclusive mode for both the
> write sanity checks and for block mapping and allocation.
>
> In the case of the write sanity checks, we only need to protect the
> inode from change while this is occurring, and hence we can use a
> shared lock for this. We still need to provide exclusion for EOF
> zeroing, so we need to detect that case and upgrade the locking to
> exclusive in that case. This is a simple extension of the existing
> iolock upgrade case.
>
> We also have the case of timestamp updates occurring inside the
> ilock. however, we don't really care if timestamp update races occur
> as they are going to end up with the same timestamp anyway. Further,
> as we move to transactional timestamp updates, we can't do the
> update from within the ilock at all. Hence move the timestamp
> update outside the ilock altogether as we don't need or want it to
> be protected there.
>
> For block mapping, the direct IO case has to drop the ilock after
> the initial read mapping for transaction reservation before the
> allocation can be done. This means that the mapping to determine if
> allocation is needed simply requires a "don't change" locking
> semantic. i.e. a shared lock.
>
> Hence we can safely change the xfs_get_blocks() code to use a shared
> lock for the initial mapping lookup because that provides the same
> guarantees but doesn't introduce new race conditions due to the
> allocation having to upgrade the lock - it already has those race
> conditions and has to handle them. This means that overwrite and
> write into preallocated space direct IO will be mapped with just a
> shared lock.
>
> Finally, we only take the ilock during IO completion if the current
> ioend is beyond the file size. This is a quick hack that will be
> fixed properly by transactional size updates.
>
> In combination, these three changes remove all exclusive
> serialisation points in the direct IO write path for single file
> overwrite workloads.
>
> Signed-off-by: Dave Chinner <dchinner at redhat.com>
> ---
> fs/xfs/xfs_aops.c | 22 ++++++++++++++++++++--
> fs/xfs/xfs_file.c | 35 +++++++++++++++++++++--------------
> fs/xfs/xfs_iomap.c | 10 +++++++---
> fs/xfs/xfs_iomap.h | 2 +-
> 4 files changed, 49 insertions(+), 20 deletions(-)
>
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 74b9baf..4c84508 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -138,6 +138,9 @@ xfs_setfilesize(
> xfs_inode_t *ip = XFS_I(ioend->io_inode);
> xfs_fsize_t isize;
>
> + if (!xfs_ioend_new_eof(ioend))
> + return 0;
> +
> if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
> return EAGAIN;
>
> @@ -1121,7 +1124,22 @@ __xfs_get_blocks(
> return 0;
>
> if (create) {
> - lockmode = XFS_ILOCK_EXCL;
> + /*
> + * For direct IO, we lock in shared mode so that write
> + * operations that don't require allocation can occur
> + * concurrently. The ilock has to be dropped over the
> allocation
> + * transaction reservation, so the only thing the ilock is
> + * providing here is modification exclusion. i.e. there is
> no
> + * need to hold the lock exclusive.
> + *
> + * For buffered IO, if we need to do delayed allocation
> then
> + * hold the ilock exclusive so that the lookup and delalloc
> + * reservation is atomic.
> + */
> + if (direct)
> + lockmode = XFS_ILOCK_SHARED;
> + else
> + lockmode = XFS_ILOCK_EXCL;
> xfs_ilock(ip, lockmode);
> } else {
> lockmode = xfs_ilock_map_shared(ip);
> @@ -1144,7 +1162,7 @@ __xfs_get_blocks(
> imap.br_startblock == DELAYSTARTBLOCK))) {
> if (direct) {
> error = xfs_iomap_write_direct(ip, offset, size,
> - &imap, nimaps);
> + &imap, nimaps,
> &lockmode);
> } else {
> error = xfs_iomap_write_delay(ip, offset, size,
> &imap);
> }
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 10ec272..c74e28c 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -650,41 +650,48 @@ xfs_file_aio_write_checks(
> struct inode *inode = file->f_mapping->host;
> struct xfs_inode *ip = XFS_I(inode);
> int error = 0;
> + int ilock = XFS_ILOCK_SHARED;
>
> - xfs_rw_ilock(ip, XFS_ILOCK_EXCL);
> + xfs_rw_ilock(ip, ilock);
> restart:
> error = generic_write_checks(file, pos, count,
> S_ISBLK(inode->i_mode));
> if (error) {
> - xfs_rw_iunlock(ip, XFS_ILOCK_EXCL);
> + xfs_rw_iunlock(ip, ilock);
> return error;
> }
>
> - if (likely(!(file->f_mode & FMODE_NOCMTIME)))
> - file_update_time(file);
> -
> /*
> * If the offset is beyond the size of the file, we need to zero any
> * blocks that fall between the existing EOF and the start of this
> - * write. If zeroing is needed and we are currently holding the
> - * iolock shared, we need to update it to exclusive which involves
> - * dropping all locks and relocking to maintain correct locking
> order.
> - * If we do this, restart the function to ensure all checks and
> values
> - * are still valid.
> + * write. If zeroing is needed and we are currently holding shared
> + * locks, we need to update it to exclusive which involves
> dropping all
> + * locks and relocking to maintain correct locking order. If we do
> + * this, restart the function to ensure all checks and values are
> still
> + * valid.
> */
> if (*pos > i_size_read(inode)) {
> - if (*iolock == XFS_IOLOCK_SHARED) {
> - xfs_rw_iunlock(ip, XFS_ILOCK_EXCL | *iolock);
> + if (*iolock == XFS_IOLOCK_SHARED || ilock ==
> XFS_ILOCK_SHARED) {
> + xfs_rw_iunlock(ip, ilock | *iolock);
> *iolock = XFS_IOLOCK_EXCL;
> - xfs_rw_ilock(ip, XFS_ILOCK_EXCL | *iolock);
> + ilock = XFS_ILOCK_EXCL;
> + xfs_rw_ilock(ip, ilock | *iolock);
> goto restart;
> }
> error = -xfs_zero_eof(ip, *pos, i_size_read(inode));
> }
> - xfs_rw_iunlock(ip, XFS_ILOCK_EXCL);
> + xfs_rw_iunlock(ip, ilock);
> if (error)
> return error;
>
> /*
> + * we can't do any operation that might call .dirty_inode under the
> + * ilock when we move to completely transactional updates. Hence
> this
> + * timestamp must sit outside the ilock.
> + */
> + if (likely(!(file->f_mode & FMODE_NOCMTIME)))
> + file_update_time(file);
> +
> + /*
> * If we're writing the file then make sure to clear the setuid and
> * setgid bits if the process is not being run by root. This keeps
> * people from modifying setuid and setgid binaries.
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 246c7d5..792c81d 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -123,7 +123,8 @@ xfs_iomap_write_direct(
> xfs_off_t offset,
> size_t count,
> xfs_bmbt_irec_t *imap,
> - int nmaps)
> + int nmaps,
> + int *lockmode)
> {
> xfs_mount_t *mp = ip->i_mount;
> xfs_fileoff_t offset_fsb;
> @@ -189,7 +190,8 @@ xfs_iomap_write_direct(
> /*
> * Allocate and setup the transaction
> */
> - xfs_iunlock(ip, XFS_ILOCK_EXCL);
> + xfs_iunlock(ip, *lockmode);
> +
> tp = xfs_trans_alloc(mp, XFS_TRANS_DIOSTRAT);
> error = xfs_trans_reserve(tp, resblks,
> XFS_WRITE_LOG_RES(mp), resrtextents,
> @@ -200,7 +202,9 @@ xfs_iomap_write_direct(
> */
> if (error)
> xfs_trans_cancel(tp, 0);
> - xfs_ilock(ip, XFS_ILOCK_EXCL);
> +
> + *lockmode = XFS_ILOCK_EXCL;
> + xfs_ilock(ip, *lockmode);
> if (error)
> goto error_out;
>
> diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
> index 8061576..21e3398 100644
> --- a/fs/xfs/xfs_iomap.h
> +++ b/fs/xfs/xfs_iomap.h
> @@ -22,7 +22,7 @@ struct xfs_inode;
> struct xfs_bmbt_irec;
>
> extern int xfs_iomap_write_direct(struct xfs_inode *, xfs_off_t, size_t,
> - struct xfs_bmbt_irec *, int);
> + struct xfs_bmbt_irec *, int, int *);
> extern int xfs_iomap_write_delay(struct xfs_inode *, xfs_off_t, size_t,
> struct xfs_bmbt_irec *);
> extern int xfs_iomap_write_allocate(struct xfs_inode *, xfs_off_t, size_t,
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://oss.sgi.com/pipermail/xfs/attachments/20120209/737fb99c/attachment-0001.htm>
More information about the xfs
mailing list