On Wed, Nov 28, 2007 at 03:18:54PM +1100, Lachlan McIlroy wrote:
> David Chinner wrote:
> >>>[ I do know the answer to this question and there's a day of kdb tracing
> >>>behind the answer. I wrote a 15 line comment to explain what was going
> >>>on in one of my patches. ]
> >>Are you referring to the !(inode->i_state & I_LOCK) check?
> I've never liked that check, can we just get rid of it?
No - removing the check is what lead me to understand why it
> >Sorry, I missed an important work there - could not be implemented
> >Basically, you are logging the inode, then call xfs_iflush, which
> >immediately sees it pinned and forces the log. That's an extra
> >transaction *and* log I/O for every inode we write. That defeats all
> >inode clustering and and will seriously harm performance.
> I didn't see another way around it. We only need to force the log for
> pinned inodes if it is a sync writeback, otherwise we can just try again
But we don't do that right now - we call xfs_ipinwait() in xfs_iflush()
which forces the log.
> >Also, the change fails to log changes to inodes in the same
> >cluster that get written out because they are dirty.
> That's where it all sort of falls apart. I didn't want to log the inode
> in xfs_iflush_int() because we have the flush lock held and I was pretty
> sure logging a transaction with the flush lock held would be a bad idea.
> That's why I specifically removed the code that resets i_update_core in
> xfs_iflush_int() - so that other inodes in the same cluster will still be
> flagged as having unlogged changes even after the inodes have been synced
> to disk. But as I said it was an idea that needed some polishing.
It's messy, and if we are logging changes then we should never write
to disk unlogged changes....
> >>>So, the cleaner fix is to do this - change the xfs_inode_flush()
> >>>just to unconditionally log the inode and don't do inode writeback *at
> >>>all* from there. That will catch all cases of unlogged changes and leave
> >>>inode writeback to tail-pushing or xfssyncd which can be driven by
> >>>the radix tree.
> >>Huh? Aren't we trying to minimize the number of transactions we do? My
> >>changes introduce new transactions but only when we have to. You're
> >>here that we log the inode unconditionally - how is that better? I'm not
> >>trying to defend my changes here (I don't care how the problem gets fixed)
> >>- I'm just trying to understand why your suggestions are a good idea.
> >Because we can log entire inode cluster's worth of changes in a single
> >transaction. One transaction vs one I/O - it's a decent tradeoff to
> >avoid this problem, esp. as we'll get improved inode writeback clustering
> >if we flush from the radix tree (i.e. clusters get flushed in ascending
> >inode number order).....
> That should help a lot but it will use even more space in the log - quite a
> lot more if just one inode in the cluster needs to be logged.
32 inodes * 100 bytes for the inode core - 3k per inode is we log the
entire cluster. But we know what is dirty and what isn't, so that's worst
case. i.e. we only log those that are dirty. It's not log space I'm
worried about here - it's the transaction overhead....
Still, doing it as a cluster is probably premature optimisation. Lets
see what logging only the dirty inodes gets us and go from there.
> Do you plan
> to do this in the write_inode path? If so we'll have inodes that have been
> logged (with a previous cluster) that still have I_DIRTY set.
I'll cross that one if we need to - we can probably just clear it if
the inode is not otherwise dirty...
SGI Australian Software Group