From: Larry McVoy Newsgroups: mail.linux-kernel Subject: Re: [OT] SGI to OpenSource XFS Date: 26 May 1999 15:10:29 -0700 Organization: bitmover.com Lines: 33 Sender: root@fido.engr.sgi.com Approved: mailnews@fido.engr.sgi.com Distribution: sgi Message-ID: <7ihrgl$fuhho@fido.engr.sgi.com> NNTP-Posting-Host: fido.engr.sgi.com : XFS has some nice features such as journalling, dynamically managed inodes, : B-tree directories, and real-time features for multimedia streams (looks like : this last one will not be in the open-sourced code). : But how about performance ? Does anybody have comparisons of various : filesystems in terms of performance ? I know a lot about XFS performance. Unfortunately, it's hard to split out what parts are XFS and what parts are IRIX infrastructure. Some parts of XFS are amazing (actually, it isn't XFS that's so fast, it's XLV - the volume manager underneath - XFS does its part by getting out of the way and letting XLV set up all the DMAs in parallel). The I/O bandwidth that you can get out of XFS/XLV is limited only by the hardware. When I was at SGI they demoed XFS doing 7GByte/second and there is no reason why that number couldn't be 7TByte/second. The journalling is nice - it's nowhere near as fast as ext2 but it is safe, you can turn off the machine the middle of an untar and things are in a sane state when you reboot. I strongly suspect that Stephen's journalling work will be lighter weight. XFS is extent based so you could have a 10TB file that was made up of a small number of extents, very nice. I suspect that what will happen is that we'll get XFS, take a while to understand it and then migrate the ideas that we want into ext3 or whatever Stephen is calling his thing. For a lot of stuff, XFS is overkill and it comes at a non-zero cost. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/ From: Larry McVoy Newsgroups: mail.linux-kernel Subject: Re: XFS and journalling filesystems Date: 27 May 1999 08:52:03 -0700 Organization: bitmover.com Lines: 63 Sender: root@fido.engr.sgi.com Approved: mailnews@fido.engr.sgi.com Distribution: sgi Message-ID: <7ijpn3$g7nk5@fido.engr.sgi.com> NNTP-Posting-Host: fido.engr.sgi.com I'm in 100% agreement on Ted's general line of reasoning, but there is one area that could use some clarification: : (And there are certain features of XFS, such as the features that allow : Irix to tell the disk controller to send disk blocks directly to the : ethernet controller, which then slaps on the TCP header and calclates : the TCP checksum without the disk data ever hitting memory This is not quite right, in fact, it is a little unfair to IRIX. There is no interaction between the file system and/or the block device system and the networking stack. The way it works is this (I know this code extremely well since I'm the guy that originally made NFS use both the networking and the file system to go at 94MByte/sec - Ethan Solomita is the guy who made it go at 640MByte/second over Super HIPPI): The short summary is that you DMA from disks to user VM and then from user VM to networking (or the other way). But user VM is the currency. The longer summary is that in order to get the file system to go fast, you open up files with O_DIRECT and that tells the file system to lock down the user pages and DMA directly to/from them, bypassing the buffer cache completely (there is interaction with the VM layer here, but it is two part: the locking down part and the invalidationg part; the latter is so that if there was unflushed data in the buffer cache, that got flushed out before the direct I/O occurred). For the networking part, that works by page flipping on receive and COW pages on send. Again, the currency is user VM. So if you were going disk -> network, then the pages would be in your VM and you would do a write(sock, buf, some_big_size). The socket layer would get all the way down to sosend() and decide that this wad of data as a good candidate for VM tricks. It calls out to the VM layer and asks that these pages be marked COW. There was a lot of discussion about whether it would be smart to set up the COW fault handler to sleep the faulting process until the data had moved out - then naive processes would get slept and smart processes - those which flip flopped between two buffers - would stream. This optimization was never done. If you were going network -> disk then the data would be coming and the pages, if they were nicely aligned - which can be done by doing something Vernon Shriver calls "tail aligning", you line up the end of the message at the end of a page so that if the message is page sized it is page aligned - then you could work them up the stack and when they hit the top of stack, you "page flip" them into user space. So while it is true there are VM interactions, they are all pretty much in the networking stack. As a result of my experience with this stuff, I got pretty disgusted with the "design" even though the results were quite good. The obvious problem was that the "currency" was user virtual memory, not physical memory. SO I wrote up (or started to write up) a design based on physical memory, which I called splice. There is a short paper about it on my ftp site somewhere. Stephen Tweedie is currently implementing splice() semantics for Linux (I'm very happy with his design, as usual - sometimes I think that guy will solve world peace next but when I asked about that, he said that was a user space problem :-) [value is xlv] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/