xfs
[Top] [All Lists]

Re: XFS metadata flushing design - current and future

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: XFS metadata flushing design - current and future
From: Christoph Hellwig <hch@xxxxxxxxxxxxx>
Date: Mon, 29 Aug 2011 08:33:18 -0400
Cc: Christoph Hellwig <hch@xxxxxxxxxxxxx>, xfs@xxxxxxxxxxx
In-reply-to: <20110829010149.GE3162@dastard>
References: <20110827080321.GA16661@xxxxxxxxxxxxx> <20110829010149.GE3162@dastard>
User-agent: Mutt/1.5.21 (2010-09-15)
On Mon, Aug 29, 2011 at 11:01:49AM +1000, Dave Chinner wrote:
> Another thing I've noticed is that AIL pushing of dirty inodes can
> be quite inefficient from a CPU usage perspective. Inodes that have
> already been flushed to their backing buffer results in a
> IOP_PUSHBUF call when the AIL tries to push them. Pushing the buffer
> requires a buffer cache search, followed by a delwri list promotion.
> However, the initial xfs_iflush() call on a dirty inode also
> clusters all the other remaining dirty inodes in the buffer to the
> buffer. When the AIl hits those other dirty inodes, they are already
> locked and so we do a IOP_PUSHBUF call. On every other dirty inode.
> So on a completely dirty inode cluster, we do ~30 needless buffer
> cache searches and buffer delwri promotions all for the same buffer.
> That's a lot of extra work we don't need to be doing - ~10% of the
> buffer cache lookups come from IOP_PUSHBUF under inode intensive
> metadata workloads:

One really stupid thing we do in that area is that the xfs_iflush from
xfs_inode_item_push puts the buffer at the end of the delwri list and
expects it to be aged, just so that the first xfs_inode_item_pushbuf
can promote it to the front of the list.  Now that we mostly write
metadata from AIL pushing we should not do an additional pass of aging
on that - that's what we already the AIL for.  Once we did that we
should be able to remove the buffer promotion and make the pushuf a
no-op.  The only thing this might interact with in a not so nice way
would be inode reclaim if it still did delwri writes with the delay
period, but we might be able to get away without that one as well.

> Also, larger inode buffers to reduce the amount of IO we do to both
> read and write inodes might also provide significant benefits by
> reducing the amount of IO and number of buffers we need to track in
> the cache...

We could try to get for large in-core clusters.  That is try to always
allocate N aligned inode clusters together, and always read/write
clusters in that alignment together if possible.

<Prev in Thread] Current Thread [Next in Thread>