[Top] [All Lists]

Re: lmdd performance results XFS vs. Ext2

To: Rik van Riel <riel@xxxxxxxxxxxxxxxx>
Subject: Re: lmdd performance results XFS vs. Ext2
From: Steve Lord <lord@xxxxxxx>
Date: Thu, 08 Jun 2000 11:28:09 -0500
Cc: Andi Kleen <ak@xxxxxxx>, Rajagopal Ananthanarayanan <ananth@xxxxxxx>, linux-xfs@xxxxxxxxxxx
In-reply-to: Message from Rik van Riel <riel@xxxxxxxxxxxxxxxx> of "Thu, 08 Jun 2000 09:48:46 -0300." <Pine.LNX.4.21.0006080938080.21898-100000@xxxxxxxxxxxxxxxxxxxxxxxx>
Sender: owner-linux-xfs@xxxxxxxxxxx
> On Thu, 8 Jun 2000, Andi Kleen wrote:
> > Please not that 2.3 itself has significant performance
> > regressions for huge bulk writes (there were several threads on
> > linux-kernel about that). Partly the still broken page cache
> > balance is probably to blame,
> Definately. Following age-old Linux tradition I'm currently
> re-writing the VM subsystem at the end of the code freeze
> period, heavily counting on code beautification and tons of
> obviousness to make Linus accept the change.
> One interesting thing that's going on is the split in active,
> inactive and scavenge queues, where dirty pages will be flushed
> when we want to take them off of the inactive list. We're planning
> a block->mapping->flush() function for that...
> This should be interesting to XFS because it means that you'll
> be able to implement allocate-on-flush as a relatively simple
> and independant patch that'll just "plug in" the MM subsystem.
> Basically my not-yet-compiling code tree that's sitting on my
> disk now (need to write a few more functions and then I can
> compile and boot it) is ready for allocate-on-flush. The only
> thing that needs to be done is the reservation system for
> pinned buffers.

Perhaps some expansion on what happens with XFS would be good here:

1. During a write call we create delayed allocate space for the write
   (presuming it does not exist already). This is cheap as we just mess
   with some in core counters and hang the in core extent on the inode.
   Pages are setup as Uptodate but have no buffer heads attached to them,
   they have a special PG_delalloc bit set.

2. We currently have a page cleaner daemon which walks around the page tables
   looking for pages with the PG_delalloc bit set. The daemon then calls into
   the filesystem to ask it to allocate real extents for the data. Since the
   filesystem knows which byte range in the file gets allocated contiguous
   with the requested page we can hang buffer heads off the requested page
   and all those which are in the same real extent on disk. We could also
   initiate an I/O at this point to write all these pages out - using the
   buffer heads or a direct kiobuf I/O.

   Currently we are not triggering the I/O in this daemon, we let bdflush
   come along and write the data.

Now comes the tricky part - the big selling point of a journalled filesystem
is that it comes back after a crash quickly and in a consistent state. XFS
does not journal file data, so until we have done the real allocate, and
written the data out, it can go away after a crash. Of course, O_SYNC or
fflush fixes that, but only if the app wants to pay the extra costs.

So allocating extents because of memory pressure alone is not really the
best solution - you could write out some important data and walk away from
your machine, after a day of being idle the power goes out, and because
nothing was pushing on memory your data goes bye bye.

So I suspect even with a flush callout we still need a another mechanism
to go around pushing on delayed allocate pages.

As for reservation, we do have a scheme in place at the moment, but it
needs some work. Probably when requesting a new page we need to tell the
VM system that it will be allocated delayed alloc.


> > for other things the elevator (Jens Axboe's per device elevator
> > patches seem to cause a huge speedup)
> Yup, according to Jens and a bunch of other people this
> seems to be sorted out.
> These changes should help XFS performance quite a bit. Tuning for
> the small changes may want to wait until after the big stuff is
> done. Btw, anybody here interested in doing some IO clustering
> stuff for the VM subsystem? ;)
> regards,
> Rik
> --
> The Internet is not a network of computers. It is a network
> of people. That is its real strength.
> Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
> http://www.conectiva.com/             http://www.surriel.com/

<Prev in Thread] Current Thread [Next in Thread>