On Fri 15-02-13 18:16:07, Dave Chinner wrote:
> On Thu, Feb 14, 2013 at 02:14:52PM +0100, Jan Kara wrote:
> > Hi,
> > this is a follow up on a discussion started here:
> > http://www.spinics.net/lists/xfs/msg14999.html
> > To just quickly sum up the issue:
> > When project quota gets exceeded XFS ends up flushing inodes using
> > sync_inodes_sb(). I've tested (in 3.8-rc4) that if one writes 200 MB to a
> > directory with 100 MB project quota like:
> > fd = open(argv, O_WRONLY | O_CREAT | O_TRUNC, 0644);
> > for (i = 0; i < 50000; i++)
> > pwrite(fd, buf, 4096, i*4096);
> > it takes about 3 s to finish, which is OK. But when there are lots of
> > inodes cached (I've tried with 10000 inodes cached on the fs), the same
> > test program runs ~140 s.
> So, you're testing the overhead of ~25,000 ENOSPC flushes. I could
> brush this off and say "stupid application" but I won't....
Yes, stupid... NFS. This is what happens when NFS client writes to a
directory over project quota. So as much as I agree the workload is
braindamaged we don't have a choice to fix a client and we didn't find a
reasonable fix on NFS server side either. So that's why we ended up with
> > This is because sync_inodes_sb() iterates over
> > all inodes in superblock and waits for IO and this iteration eats CPU
> > cycles.
> Yup, exactly what I said here:
> Iterating inodes takes a lot of CPU.
> I think the difference in the old method and the current one is that
> we only do one inode cache iteration per write(), not one per
> get_blocks() call. Hence we've removed the per-page overhead of
> flushing, and now we just have the inode cache iteration overhead.
> The fix to that problem is mentioned here:
> Which is to:
> a) throttle speculative allocation as EDQUOT approaches; and
> b) efficiently track speculative preallocation
> for all the inodes in the given project, and write back
> and trim those inodes on ENOSPC.
> both of those are still a work in progress. I was hoping that we'd
> have a) in 3.9, but that doesn't seem likely now the merge window is
> just about upon us....
Yeah, I know someone is working on a better solution. I was mostly
wondering why writeback_inodes_sb() isn't enough - i.e. why do you really
need to wait for IO completion. And you explained that below. Thanks!
> It's trying to prevent the filesystem for falling into worst case IO
> patterns at ENOSPC when there are hundreds of threads banging on the
> filesystem and essentially locking up the system. i.e. we have to
> throttle the IO submission rate from ENOSPC flushing artificially -
> we really only need one thread doing the submission work, so we need
> to throttle all concurrent callers while we are doing that work.
> sync_inodes_sb() does that. And then we need to prevent individual
> callers from trying to allocate space too frequently, which we do by
> waiting for IO that was submitted in the flush. sync_inodes_sb()
> does that, too.
I see so the waiting for IO isn't really a correctness thing but just a
way to slow down processes bringing fs out of space so that they don't lock
up the system. Thanks for explanation!
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR