xfs
[Top] [All Lists]

Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering

To: Christoph Hellwig <hch@xxxxxxxxxxxxx>
Subject: Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
From: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Thu, 30 Jun 2011 12:48:45 +1000
Cc: xfs@xxxxxxxxxxx
In-reply-to: <20110630020013.GX561@dastard>
References: <20110629140109.003209430@xxxxxxxxxxxxxxxxxxxxxx> <20110629140336.950805096@xxxxxxxxxxxxxxxxxxxxxx> <20110630020013.GX561@dastard>
User-agent: Mutt/1.5.20 (2009-06-14)
On Thu, Jun 30, 2011 at 12:00:13PM +1000, Dave Chinner wrote:
> On Wed, Jun 29, 2011 at 10:01:12AM -0400, Christoph Hellwig wrote:
> > Instead of implementing our own writeback clustering use write_cache_pages
> > to do it for us.  This means the guts of the current writepage 
> > implementation
> > become a new helper used both for implementing ->writepage and as a callback
> > to write_cache_pages for ->writepages.  A new struct xfs_writeback_ctx
> > is used to track block mapping state and the ioend chain over multiple
> > invocation of it.
> > 
> > The advantage over the old code is that we avoid a double pagevec lookup,
> > and a more efficient handling of extent boundaries inside a page for
> > small blocksize filesystems, as well as having less XFS specific code.
> 
> Yes, it should be, but I can't actually measure any noticable CPU
> usage difference @800MB/s writeback. The profiles change shape
> around the changed code, but overall cpu usage does not change. I
> think this is because the second pagevec lookup is pretty much free
> because the radix tree is already hot in cache when we do the second
> lookup...
> 
> > The downside is that we don't do writeback clustering when called from
> > kswapd anyore, but that is a case that should be avoided anyway.  Note
> > that we still convert the whole delalloc range from ->writepage, so
> > the on-disk allocation pattern is not affected.
> 
> All the more reason to ensure the mm subsystem doesn't do this....
> 
> .....
> >  error:
> > -   if (iohead)
> > -           xfs_cancel_ioend(iohead);
> > -
> > -   if (err == -EAGAIN)
> > -           goto redirty;
> > -
> 
> Should this EAGAIN handling be dealt with in the removing-the-non-
> blocking-mode patch?
> 
> > +STATIC int
> >  xfs_vm_writepages(
> >     struct address_space    *mapping,
> >     struct writeback_control *wbc)
> >  {
> > +   struct xfs_writeback_ctx ctx = { };
> > +   int ret;
> > +
> >     xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
> > -   return generic_writepages(mapping, wbc);
> > +
> > +   ret = write_cache_pages(mapping, wbc, __xfs_vm_writepage, &ctx);
> > +
> > +   if (ctx.iohead) {
> > +           if (ret)
> > +                   xfs_cancel_ioend(ctx.iohead);
> > +           else
> > +                   xfs_submit_ioend(wbc, ctx.iohead);
> > +   }
> 
> I think this error handling does not work. If we have put pages into
> the ioend (i.e. successful ->writepage calls) and then have a
> ->writepage call fail, we'll get all the pages under writeback (i.e.
> those on the ioend) remain in that state, and not ever get written
> back (so move into the clean state) or redirtied (so written again
> later)
> 
> xfs_cancel_ioend() was only ever called for the first page sent down
> to ->writepage, and on error that page was redirtied separately.
> Hence it doesn't handle this case at all as it never occurs in the
> existing code.
> 
> I'd suggest that regardless of whether an error is returned here,
> the existence of ctx.iohead indicates a valid ioend that needs to be
> submitted....

I think i just tripped this. I'm running a 1k block size filesystem,
and test 224 has hung with waiting on IO completion after .writepage
errors:

[ 2850.300979] XFS (vdb): Mounting Filesystem
[ 2850.310069] XFS (vdb): Ending clean mount
[ 2867.246341] Filesystem "vdb": reserve blocks depleted! Consider increasing 
reserve pool size.
[ 2867.247652] XFS (vdb): page discard on page ffffea0000257b40, inode 0x1c6, 
offset 1187840.
[ 2867.254135] XFS (vdb): page discard on page ffffea0000025f40, inode 0x423, 
offset 1839104.
[ 2867.256289] XFS (vdb): page discard on page ffffea0000a21aa0, inode 0x34e, 
offset 28672.
[ 2867.258845] XFS (vdb): page discard on page ffffea00001830d0, inode 0xe5, 
offset 3637248.
[ 2867.260637] XFS (vdb): page discard on page ffffea0000776af8, inode 0x132, 
offset 6283264.
[ 2867.269380] XFS (vdb): page discard on page ffffea00009d5d38, inode 0xf1, 
offset 5632000.
[ 2867.277851] XFS (vdb): page discard on page ffffea0000017e60, inode 0x27a, 
offset 32768.
[ 2867.281165] XFS (vdb): page discard on page ffffea0000258278, inode 0x274, 
offset 32768.
[ 2867.282802] XFS (vdb): page discard on page ffffea00009a3c60, inode 0x48a, 
offset 32768.
[ 2867.284166] XFS (vdb): page discard on page ffffea0000cc7808, inode 0x42e, 
offset 32768.
[ 2867.287138] XFS (vdb): page discard on page ffffea00004d4440, inode 0x4e0, 
offset 32768.
[ 2867.288500] XFS (vdb): page discard on page ffffea0000b34978, inode 0x4cd, 
offset 32768.
[ 2867.289381] XFS (vdb): page discard on page ffffea00003f40f8, inode 0x4c4, 
offset 155648.
[ 2867.291536] XFS (vdb): page discard on page ffffea0000023578, inode 0x4c7, 
offset 32768.
[ 2867.300880] XFS (vdb): page discard on page ffffea00005276e8, inode 0x4cc, 
offset 32768.
[ 2867.318819] XFS (vdb): page discard on page ffffea0000777230, inode 0x449, 
offset 8581120.
[ 4701.141666] SysRq : Show Blocked State
[ 4701.142093]   task                        PC stack   pid father
[ 4701.142707] dd              D ffff8800076edbe8     0 14211   8946 0x00000000
[ 4701.143509]  ffff88002b03fa58 0000000000000086 ffffea00002db598 
ffffea0000000000
[ 4701.144009]  ffff88002b03f9d8 ffffffff81113a35 ffff8800076ed860 
0000000000010f80
[ 4701.144009]  ffff88002b03ffd8 ffff88002b03e010 ffff88002b03ffd8 
0000000000010f80
[ 4701.144009] Call Trace:
[ 4701.144009]  [<ffffffff81113a35>] ? __free_pages+0x35/0x40
[ 4701.144009]  [<ffffffff81062f69>] ? default_spin_lock_flags+0x9/0x10
[ 4701.144009]  [<ffffffff8110b520>] ? __lock_page+0x70/0x70
[ 4701.144009]  [<ffffffff81afe2d0>] io_schedule+0x60/0x80
[ 4701.144009]  [<ffffffff8110b52e>] sleep_on_page+0xe/0x20
[ 4701.144009]  [<ffffffff81afec2f>] __wait_on_bit+0x5f/0x90
[ 4701.144009]  [<ffffffff8110b773>] wait_on_page_bit+0x73/0x80
[ 4701.144009]  [<ffffffff810a4110>] ? autoremove_wake_function+0x40/0x40
[ 4701.144009]  [<ffffffff81116365>] ? pagevec_lookup_tag+0x25/0x40
[ 4701.144009]  [<ffffffff8110bbc2>] filemap_fdatawait_range+0x112/0x1a0
[ 4701.144009]  [<ffffffff8145f469>] xfs_wait_on_pages+0x59/0x80
[ 4701.144009]  [<ffffffff8145f51d>] xfs_flush_pages+0x8d/0xb0
[ 4701.144009]  [<ffffffff8145f084>] xfs_file_buffered_aio_write+0x104/0x190
[ 4701.144009]  [<ffffffff81b03a98>] ? do_page_fault+0x1e8/0x450
[ 4701.144009]  [<ffffffff8145f2cf>] xfs_file_aio_write+0x1bf/0x300
[ 4701.144009]  [<ffffffff81160844>] ? path_openat+0x104/0x3f0
[ 4701.144009]  [<ffffffff8115251a>] do_sync_write+0xda/0x120
[ 4701.144009]  [<ffffffff816488b3>] ? security_file_permission+0x23/0x90
[ 4701.144009]  [<ffffffff81152a88>] vfs_write+0xc8/0x180
[ 4701.144009]  [<ffffffff81152c31>] sys_write+0x51/0x90
[ 4701.144009]  [<ffffffff81b07ec2>] system_call_fastpath+0x16/0x1b

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>