> On Thu, 8 Jun 2000, Andi Kleen wrote:
> > Please not that 2.3 itself has significant performance
> > regressions for huge bulk writes (there were several threads on
> > linux-kernel about that). Partly the still broken page cache
> > balance is probably to blame,
> Definately. Following age-old Linux tradition I'm currently
> re-writing the VM subsystem at the end of the code freeze
> period, heavily counting on code beautification and tons of
> obviousness to make Linus accept the change.
> One interesting thing that's going on is the split in active,
> inactive and scavenge queues, where dirty pages will be flushed
> when we want to take them off of the inactive list. We're planning
> a block->mapping->flush() function for that...
> This should be interesting to XFS because it means that you'll
> be able to implement allocate-on-flush as a relatively simple
> and independant patch that'll just "plug in" the MM subsystem.
> Basically my not-yet-compiling code tree that's sitting on my
> disk now (need to write a few more functions and then I can
> compile and boot it) is ready for allocate-on-flush. The only
> thing that needs to be done is the reservation system for
> pinned buffers.
Perhaps some expansion on what happens with XFS would be good here:
1. During a write call we create delayed allocate space for the write
(presuming it does not exist already). This is cheap as we just mess
with some in core counters and hang the in core extent on the inode.
Pages are setup as Uptodate but have no buffer heads attached to them,
they have a special PG_delalloc bit set.
2. We currently have a page cleaner daemon which walks around the page tables
looking for pages with the PG_delalloc bit set. The daemon then calls into
the filesystem to ask it to allocate real extents for the data. Since the
filesystem knows which byte range in the file gets allocated contiguous
with the requested page we can hang buffer heads off the requested page
and all those which are in the same real extent on disk. We could also
initiate an I/O at this point to write all these pages out - using the
buffer heads or a direct kiobuf I/O.
Currently we are not triggering the I/O in this daemon, we let bdflush
come along and write the data.
Now comes the tricky part - the big selling point of a journalled filesystem
is that it comes back after a crash quickly and in a consistent state. XFS
does not journal file data, so until we have done the real allocate, and
written the data out, it can go away after a crash. Of course, O_SYNC or
fflush fixes that, but only if the app wants to pay the extra costs.
So allocating extents because of memory pressure alone is not really the
best solution - you could write out some important data and walk away from
your machine, after a day of being idle the power goes out, and because
nothing was pushing on memory your data goes bye bye.
So I suspect even with a flush callout we still need a another mechanism
to go around pushing on delayed allocate pages.
As for reservation, we do have a scheme in place at the moment, but it
needs some work. Probably when requesting a new page we need to tell the
VM system that it will be allocated delayed alloc.
> > for other things the elevator (Jens Axboe's per device elevator
> > patches seem to cause a huge speedup)
> Yup, according to Jens and a bunch of other people this
> seems to be sorted out.
> These changes should help XFS performance quite a bit. Tuning for
> the small changes may want to wait until after the big stuff is
> done. Btw, anybody here interested in doing some IO clustering
> stuff for the VM subsystem? ;)
> The Internet is not a network of computers. It is a network
> of people. That is its real strength.
> Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies
> http://www.conectiva.com/ http://www.surriel.com/