<html>
<head>
<title>Scalability in the XFS File System</title>
</head>
<body>
<center>
<h2>Scalability in the XFS File System<a href="#fn1">¹</a></h2>
<p>
Adam Sweeney, Doug Doucette, Wei Hu, Curtis Anderson, Mike Nishimoto,
and Geoff Peck
<p>
<i>Silicon Graphics, Inc.</i>
</center>
<pre>
</pre>
<center>
<h3>
Abstract
</h3>
</center>
<p>
In this paper we describe the architecture and design of
a new file system, XFS, for SGI's IRIX
operating system. It is a general purpose file system for
use on both workstations and servers. The focus of the
paper is on the mechanisms used by XFS to scale capacity
and performance in supporting
very large file systems. The large file system support
includes mechanisms for managing large files, large
numbers of files, large directories, and very high performance
I/O.
<p>
In discussing the mechanisms used for scalability we
include both descriptions of the XFS on-disk data structures
and analyses of why they were chosen. We discuss
in detail our use of B+ trees in place of many of the
more traditional linear file system structures.
<p>
XFS has been shipping to customers since December of
1994 in a version of IRIX 5.3, and we are continuing to
improve its performance and add features in upcoming
releases. We include performance results from running
on the latest version of XFS to demonstrate the viability
of our design.
<pre>
</pre>
<h3>
1. Introduction
</h3>
<p>
XFS is the next generation local file system for SGI's workstations and servers. It is a general purpose
Unix file system that runs on workstations with 16
megabytes of memory and a single disk drive and also
on large SMP network servers with gigabytes of memory
and terabytes of disk capacity. In this paper we
describe the XFS file system with a focus on the mechanisms
it uses to manage large file systems on large computer
systems.
<p>
The most notable mechanism used by XFS to
increase the scalability of the file system is the pervasive
use of B+ trees [Comer79]. B+ trees are used for tracking free
extents in the file system rather than bitmaps. B+ trees
are used to index directory entries rather than using linear
lookup structures. B+ trees are used to manage file
extent maps that overflow the number of direct pointers
kept in the inodes. Finally, B+ trees are used to keep
track of dynamically allocated inodes scattered throughout
the file system. In addition, XFS uses an asynchronous
write ahead logging scheme for protecting
complex metadata updates and allowing fast file system
recovery after a crash. We also support very high
throughput file I/O using large, parallel I/O requests and
DMA to transfer data directly between user buffers and
the underlying disk drives. These mechanisms allow us
to recover even very large file systems after a crash in
typically less than 15 seconds, to manage very large file
systems efficiently, and to perform file I/O at hardware
speeds that can exceed 300 MB/sec.
<p>
XFS has been shipping to customers since December of
1994 in a version of IRIX 5.3, and XFS will be the
default file system installed on all SGI systems starting
with the release of IRIX 6.2 in early 1996. The file system
is stable and is being used on production servers
throughout SGI and at many of our
customers' sites.
<p>
In the rest of this paper we describe why we chose to
focus on scalability in the design of XFS and the mechanisms
that are the result of that focus. We start with an
explanation of why we chose to start from scratch rather
than enhancing the old IRIX file system. We next
describe the overall architecture of XFS, followed by
the specific mechanisms of XFS which allow it to scale
in both capacity and performance. Finally, we present
performance results from running on real systems to
demonstrate the success of the XFS design.
<pre>
</pre>
<h3>
2. Why a New File System?
</h3>
<p>
The file system literature began predicting the coming
of the "I/O bottleneck" years ago [Ousterhout90], and
we experienced it first hand at SGI. The problem was
not the I/O performance of our hardware, but the
limitations imposed by the old IRIX file system, EFS [SGI92].
EFS is similar to the Berkeley Fast File System [McKusick84]
in structure, but it uses extents rather than individual blocks
for file space allocation and I/O.
EFS could not support file systems greater than 8
gigabytes in size, files greater than 2 gigabytes in size,
or give applications access to the full I/O bandwidth of
the hardware on which they were running. EFS was not
designed with large systems and file systems in mind,
and it was faltering under the pressure coming from the
needs of new applications and the capabilities of new
hardware. While we considered enhancing EFS to meet these
new demands, the required changes were so great that we
decided it would be better to replace EFS with an entirely
new file system designed with these new demands in mind.
<p>
One example of a new application that places new demands on
the file system is the storage and retrieval of uncompressed video.
This requires approximately 30 MB/sec of I/O throughput
for a single stream, and just one hour of such video
requires 108 gigabytes of disk storage. While most people
work with compressed video to make the I/O throughput
and capacity requirements easier to manage, many customers,
for example those involved in professional
video editing, come to SGI looking to work with
uncompressed video. Another video storage example
that influenced the design of XFS is a video on demand
server, such as the one being deployed by Time Warner
in Orlando. These servers store thousands of compressed
movies. One thousand typical movies take up
around 2.7 terabytes of disk space. Playing two hundred
high quality 0.5 MB/sec MPEG streams concurrently
uses 100 MB/sec of I/O bandwidth. Applications with
similar requirements are appearing in database and scientific
computing, where file system scalability and performance
is sometimes more important than CPU
performance. The requirements we derived from
these applications were support for terabytes of disk
space, huge files, and hundreds of megabytes per second
of I/O bandwidth.
<p>
We also needed to ensure that the file system could provide
access to the full capabilities and
capacity of our hardware, and to do so with a minimal
amount of overhead. This meant supporting systems
with multiple terabytes of disk capacity. With today's 9
gigabyte disk drives it only takes 112 disk drives to surpass 1
terabyte of storage capacity. These requirements also
meant providing access to the large amount of disk
bandwidth that is available in such high capacity systems.
Today, SGI's high end systems have demonstrated
sustained disk bandwidths in excess of 500 megabytes
per second. We needed to make that bandwidth available
to applications using the file system. Finally, these
requirements meant doing all of this without using
unreasonable portions of the available CPU and memory
on the systems.
<pre>
</pre>
<h3>
3. Scalability Problems Addressed by XFS
</h3>
<p>
In designing XFS, we focused in on the specific problems
with EFS and other existing file systems that we felt
we needed to address. In this section we consider several
of the specific scalability
problems addressed in the design of XFS and why the
mechanisms used in other file systems are not sufficient.
<p>
<h3>
Slow Crash Recovery
</h3>
<p>
A file system with a crash recovery procedure that is
dependent on the file system size cannot be practically
used on large systems, because the data on the system is
unavailable for an unacceptably long period after a
crash. EFS and file systems based on the BSD Fast File
System [McKusick84] falter in this area due to their
dependence on a file system scavenger program to
restore the file system to a consistent state after a crash.
Running fsck
over an 8 gigabyte file system with a few hundred
thousand inodes today takes a few minutes. This is
already too slow to satisfy modern availability requirements,
and the time it takes to recover in this way only
gets worse when applied to larger file systems with
more files. Most recently designed file systems apply
database recovery techniques to their metadata recovery
procedure to avoid this pitfall.
<p>
<h3>
Inability to Support Large File Systems
</h3>
<p>
We needed a file system that could manage even petabytes of storage,
but all of the file systems we know of are limited to either a
few gigabytes or a few terabytes in size. EFS is limited to only
8 gigabytes in size. These limitations
stem from the use of data structures that don't
scale, for example the bitmap in EFS, and from the use
of 32 bit block pointers throughout the on-disk structures
of the file system. The 32 bit block pointers can
address at most 4 billion blocks, so even with an 8 kilobyte
block size the file system is limited to a theoretical maximum
of 32 terabytes in size.
<p>
<h3>
Inability to Support Large, Sparse Files
</h3>
<p>
None of the file systems we looked at support full 64 bit,
sparse files. EFS did not support sparse files at all. Most
others use the block mapping scheme created for FFS.
We decided early on that we
would manage space in files with variable length extents
(which we will describe later), and the FFS style scheme
does not work with variable length extents.
Entries in the FFS block map point to individual blocks in
the file, and up to three levels of indirect blocks can be
used to track blocks throughout the file. This scheme
requires that all entries in the map point to
extents of the same size. This is because it does not store
the offset of each entry in the map with the entry, and
thus forces each entry to be in a fixed location in the tree
so that it can be found. Also, a 64
bit file address space cannot be supported at all without adding
more levels of indirection to the FFS block map.
<p>
<h3>
Inability to Support Large, Contiguous Files
</h3>
<p>
Another problem is that the mechanisms in many other file
systems for allocating large, contiguous files do not
scale well. Most, including EFS, use linear bitmap structures
for tracking free and allocated blocks in the file
system. Finding large regions of contiguous space in
such bitmaps in large file systems is not efficient. For
EFS this has become a significant bottleneck in the performance
of writing newly allocated files. For other file
systems, for example FFS, this has not been a problem
up to this point, because they do not try very hard to
allocate files contiguously. Not doing so, however, can have
bad implications for the I/O performance of accessing
files in those file systems [Seltzer95].
<p>
<h3>
Inability to Support Large Directories
</h3>
<p>
Another area which has not been addressed by other
Unix file systems is support for directories with more than
a few thousand entries. While some, for example Episode
[Chutani92] and VxFS [Veritas95], at least speed up searching
for entries within a directory block via hashing, most
file systems use directory structures which require a
linear scan of the directory blocks in searching for a particular
file. The lookup and update performance of these
unindexed formats degrades linearly with the size of the
directory. Others use in-memory hashing schemes layered
over simple on-disk structures [Hitz94]. These in
memory schemes work well to a point, but in very large
directories they require a large amount of memory. This
problem has been addressed in some non-Unix file systems,
like NTFS [Custer94] and Cedar [Hagmann87],
by using B trees to index the entries in the directory.
<p>
<h3>
Inability to Support Large Numbers of Files
</h3>
<p>
While EFS and other file systems can theoretically support
very large numbers of files in a file system, in practice
they do not. The reason is that the number of inodes allocated
in these file systems is fixed at the time the file system is
created. Choosing a very large number of inodes up front
wastes the space allocated to those inodes when they are not
actually used. The real number of files that will reside in
a file system is rarely known at the time the file system
is created. Being forced to choose makes the management
of large file systems more difficult than it should
be. Episode [Chutani92] and VxFS [Veritas95] both solve this
problem by allowing the number of inodes in the file system to be
increased dynamically.
<p>
In summary, there are several problems with EFS and other file
systems that we wanted to address in the design of XFS. While
these problems may not have been important in the past, we believe
the rules of file system design have changed.
The rest of this paper
describes XFS and the ways in which it solves the
scalability problems described here.
<pre>
</pre>
<h3>
4. XFS Architecture
</h3>
<p>
Figure 1. gives a block diagram of the general structure
of the XFS file system.
<a href="figure1.gif"><img alt="Figure 1. XFS Architecture"
align="left" vspace="10" hspace="20"
width="295" height="421" src="figure1z.gif"></a>
<p>
The high level structure of XFS is similar to a conventional
file system with the addition of a transaction manager
and a volume manager. XFS supports all of the
standard Unix file interfaces and is entirely POSIX and
XPG4 compliant. It sits below the vnode interface
[Kleiman86] in the IRIX kernel and takes full advantage
of services provided by the kernel, including the buffer/page
cache, the directory name lookup cache, and the
dynamic vnode cache.
<p>
XFS is modularized into several parts, each of which is
responsible for a separate piece of the file system's functionality.
The central and most important piece of the
file system is the space manager. This module manages
the file system free space, the allocation of inodes, and
the allocation of space within individual files. The I/O
manager is responsible for satisfying file I/O requests
and depends on the space manager for allocating and
keeping track of space for files. The directory manager
implements the XFS file system name space. The buffer
cache is used by all of these pieces to cache the contents
of frequently accessed blocks from the underlying volume
in memory. It is an integrated page and file cache
shared by all file systems in the kernel. The transaction
manager is used by the other pieces of the file system to
make all updates to the metadata of the file system
atomic. This enables the quick recovery of the file system
after a crash. While the XFS implementation is modular,
it is also large and complex. The current implementation
is over 50,000 lines of C code, while the
EFS implementation is approximately 12,000 lines.
<p>
The volume manager used by XFS, known as XLV, provides
a layer of abstraction between XFS and its underlying
disk devices. XLV provides all of the disk striping,
concatenation, and mirroring used by XFS. XFS itself
knows nothing of the layout of the devices upon which
it is stored. This separation of disk management from
the file system simplifies the file system implementation,
its application interfaces, and the management of
the file system.
<pre>
</pre>
<h3>
5. Storage Scalability
</h3>
<p>
XFS goes to great lengths to efficiently support large
files, large file systems, large numbers of files, and large
directories. This section describes the mechanisms used
to achieve such scalability in size.
<pre>
</pre>
<h3>
5.1. Allocation Groups
</h3>
<p>
XFS supports full 64 bit file systems. All of the global
counters in the system are 64 bits in length. Block
addresses and inode numbers are also 64 bits in length.
To avoid requiring all structures in XFS to scale to the 64
bit size of the file system, the file system is partitioned
into regions called allocation groups (AGs). These are
somewhat similar to the cylinder groups in FFS, but
AGs are used for scalability and parallelism rather than
disk locality.
<p>
Allocation groups keep the size of the XFS data structures
in a range where they can operate efficiently without
breaking the file system into an unmanageable
number of pieces. Allocation groups are typically 0.5
to 4 gigabytes in size. Each AG has its own separate
data structures for managing the free space and
inodes within its boundaries. Partitioning the file system
into AGs limits the size of the individual structures used
for tracking free space and inodes. The partitioning
also allows the per-AG data structures to use AG
relative block and inode pointers. Doing so reduces the
size of those pointers from 64 to 32 bits. Like the limitations
on the size of the region managed by the AG, this
helps to keep the per-AG data structures to an optimal
size.
<p>
Allocation groups are only occasionally used for disk
locality. They are generally far too large to be of much
use in this respect. Instead, we establish locality
around individual files and directories. Like FFS, each
time a new directory is created, we place it in a different
AG from its parent. Once we've allocated a directory, we
try to cluster the inodes in that directory and the blocks
for those inodes around the directory itself. This works
well for keeping directories of small files clustered
together on disk. For large files, we try to allocate
extents initially near the inode and afterwards near the
existing block in the file which is closest to the offset in
the file for which we are allocating space. That implies
blocks will be allocated
near the last block in the file for sequential writers and
near blocks in the middle of the file for processes writing
into holes. Files and directories are not limited to
allocating space within a single allocation group, however.
While the structures maintained within an AG use
AG relative pointers, files and directories are file system
global structures that can reference inodes and blocks
anywhere in the file system.
<p>
The other purpose of allocation groups is to allow for
parallelism in the management of free space and inode
allocation. Previous file systems, like SGI's EFS,
have single threaded block allocation and freeing mechanisms.
On a large file system with large numbers of processes
running, this can be a significant bottleneck. By
making the structures in each AG independent of those
in the other AGs, XFS enables free space and inode
management operations to proceed in parallel throughout
the file system. Thus, processes running concurrently
can allocate space in the file system concurrently
without interfering with each other.
<pre>
</pre>
<h3>
5.2. Managing Free Space
</h3>
<p>
Space management is key to good file system performance
and scalability. Efficiently allocating and freeing
space and keeping the file system from becoming fragmented
are essential to good file system performance.
XFS has replaced the block oriented bitmaps of other
file systems with an extent oriented structure consisting
of a pair of B+ trees for each allocation group. The
entries in the B+ trees are descriptors of the free extents
in the AG. Each descriptor consists of an AG relative
starting block and a length. One of the B+ trees is
indexed by the starting block of the free extents, and the
other is indexed by the length of the free extents. This
double indexing allows for very flexible and efficient
searching for free extents based on the type of allocation
being performed.
<p>
Searching an extent based tree is more efficient than a
linear bitmap scan, especially for large, contiguous allocations.
In searching a tree describing only the free
extents, no time is wasted scanning bits for allocated
blocks or determining the length of a given extent.
According to our simulations, the extent based trees are
just as efficient and more flexible than hierarchical bitmap
schemes such as binary buddy bitmaps. Unfortunately,
the results of those simulations have been lost, so
we will have to settle here for an analytical explanation.
Unlike binary buddy schemes, there are no restrictions
on the alignment or size of the extents which can be
allocated. This is why we consider the B+ trees more
flexible. Finding an extent of a given size with the B+
tree indexed by free extent size, and finding an extent
near a given block with the B+ tree indexed by extent
starting block are both O(log N) operations. This is why
we feel that the B+ trees are just as efficient as binary
buddy schemes. The implementation of the allocation
B+ trees is certainly more complex than normal or
binary buddy bitmap schemes, but we believe that the
combination of flexibility and performance we get from
the B+ trees is worth the complexity.
<pre>
</pre>
<h3>
5.3. Supporting Large Files
</h3>
<p>
XFS provides a 64 bit, sparse address space for each
file. The support for sparse files allows files to have
holes in them for which no disk space is allocated. The
support for 64 bit files means that there are potentially a
very large number of blocks to be indexed for every file.
In order to keep the number of entries in the file allocation
map small, XFS uses an extent map rather than a
block map for each file. Entries in the extent map are
ranges of contiguous blocks allocated to the file. Each
entry consists of the block offset of the entry in the file,
the length of the extent in blocks, and the starting block
of the extent in the file system. In addition to saving
space over a block map by compressing the allocation
map entries for up to two million contiguous blocks into
a single extent map entry, using extent descriptors
makes the management of contiguous space within a file
efficient.
<p>
Even with the space compression provided by an extent
map, sparse files may still require large numbers of
entries in the file allocation map. When the number of
extents allocated to a file overflows the number that can
fit immediately within an XFS inode, we use a B+ tree
rooted within the inode to manage the extent descriptors.
The B+ tree is indexed by the block offset field of
the extent descriptors, and the data stored within the B+
tree are the extent descriptors. The B+ tree structure
allows us to keep track of millions of extent descriptors,
and, unlike an FFS style solution, it allows us to do so
without forcing all extents to be of the same size. By
storing the offset and length of each entry in the extent
map in the entry, we gain the benefit of entries in the
map which can point to variable length extents in
exchange for a more complicated map implementation
and less fan out at each level of the mapping tree (since
our individual entries are larger we fit fewer of them in
each indirect block).
<pre>
</pre>
<h3>
5.4. Supporting Large Numbers of Files
</h3>
<p>
In addition to supporting very large files, XFS supports
very large numbers of files. The number of files in a file
system is limited only by the amount of space in the file
system to hold them. Rather than statically pre-allocating
the inodes for all of these files, XFS dynamically
allocates inodes as needed. This relieves the system
administrator of having to guess the number of files
that will be created in a given file system and of having
to recreate the file system when that guess is wrong.
<p>
With dynamically allocated inodes, it is necessary to use
some data structure for keeping track of where the
inodes are located. In XFS, each allocation group manages
the inodes allocated within its confines. Each AG
uses a B+ tree to index the locations of the inodes within
it. Inodes are allocated in chunks of sixty-four inodes.
The inode allocation B+ tree in each AG keeps track of
the locations of these chunks and whether each inode
within a chunk is in use. The inodes themselves are not
actually contained in the B+ tree. The B+ tree records
only indicate where each chunk of inodes is located
within the AG.
<p>
The inode allocation B+ trees, containing only the offset
of each inode chunk along with a bit for each inode in
the chunk, can each manage millions of inodes. This
flexibility comes at the cost of additional complexity in
the implementation of the file system. Deciding where
to allocate new chunks of inodes and keeping track of
them requires complexity that does not exist in other file
systems. File system backup programs that need to
traverse the inodes of the file system are also more complex,
because they need to be able to traverse the inode B+
trees in order to find all of the inodes. Finally, having a
sparse inode numbering space forced us to use 64 bit inode
numbers, and this introduces a whole series of system
interface issues for returning file identifiers to programs.
<pre>
</pre>
<h3>
5.5. Supporting Large Directories
</h3>
<p>
The millions of files in an XFS file system need to be
represented in the file system name space. XFS implements
the traditional Unix hierarchical name space.
Unlike existing Unix file systems, XFS can efficiently support
large numbers of files in a single directory. XFS uses an
on-disk B+ tree structure for its directories.
<p>
The directory B+ tree is a bit different from the other B+
trees in XFS. The difference is that keys for the entries
in the directory B+ tree, the names of the files in the
directory, vary in length from 1 to 255 bytes. To hide
this fact from the B+ tree index management code, the
directory entry names are hashed to four byte values
which are used as the keys in the B+ tree for the entries.
The directory entries are kept in the leaves of the B+
tree. Each stores the full name of the entry along with
the inode number for that entry. Since the hash function
used for the directories is not perfect, the directory code
must manage entries in the directory with duplicate
keys. This is done by keeping entries with duplicate
hash values next to each other in the tree. The use of
fixed size keys in the interior nodes of the directory B+
tree simplifies the code for managing the B+ tree, but by
making the keys non-unique via hashing we add significant
complexity. We feel that the increased performance
of the directory B+ trees that results from having fixed
size keys, described below, is worth the increased complexity
of the implementation.
<p>
The hashing of potentially large, variable length key
values to small, constant size keys increases the breadth
of the directory B+ trees. This reduces the height of the
tree. The breadth of the B+ tree is increased by using
small, constant sized keys in the interior nodes. This is because
the interior nodes are fixed in size to a single file system
block, and compressing the keys allows more of them to fit
into each interior
node of the tree. This allows each interior node to have
more children, thus increasing the breadth of the tree. In
reducing the height of the tree, we reduce the number of
levels that must be examined in searching for a given
entry in the directory. The B+ tree structure makes
lookup, create, and remove operations in directories
with millions of entries practical. However, listing the
contents of a directory with a million entries is still
impractical due to the size of the resulting output.
<pre>
</pre>
<h3>
5.6. Supporting Fast Crash Recovery
</h3>
<p>
File systems of the size and complexity of XFS cannot
be practically recovered by a process which examines
the file system metadata to reconstruct the file system. In
a large file system, examining the large amount of metadata
will take too long. In a complex file system, piecing
the on-disk data structures back together will take even
longer. An example is the recovery of the XFS inode
table. Since our inodes are not located in a fixed location,
finding all of the inodes in the worst case where the
inode B+ trees have been trashed can require scanning
the entire disk for inodes. To avoid these problems, XFS
uses a write ahead logging scheme that enables atomic
updates of the file system. This scheme is very similar to
the one described very thoroughly in [Hisgen93].
<p>
XFS logs all structural
updates to the file system metadata. This includes inodes,
directory blocks, free extent tree blocks, inode allocation
tree blocks, file extent map blocks, AG header blocks, and
the superblock. XFS does not log user data. For example,
creating a file requires logging the directory block containing
the new entry, the newly allocated inode, the inode allocation
tree block describing the allocated inode, the allocation group
header block containing the count of free inodes, and the superblock
to record the change in its count of free inodes. The entry in the
log for each of these items consists of header information describing
which block or inode this is and a copy of the new image of the
item as it should exist on disk.
<p>
Logging new copies of the
modified items makes recovering the XFS log independent of both
the size and complexity of the file system.
Recovering the data structures from the log requires
nothing but replaying the block and inode images in the
log out to their real locations in the file system. The log
recovery does not know that it is recovering a B+
tree. It only knows that it is restoring the latest images
of some file system blocks.
<p>
Unfortunately, using a transaction log does not entirely
obsolete the use of file system scavenger programs.
Hardware and software errors which corrupt random
blocks in the file system are not generally recoverable
with the transaction log, yet these errors can make the
contents of the file system inaccessible. We did not provide
such a repair program in the initial release of XFS,
naively thinking that it would not be necessary, but our
customers have convinced us that we were wrong. Without
one, the only way to bring a corrupted file system
back on line is to re-create it with mkfs and restore it
from backups. We will be providing a scavenger program
for all versions of XFS in the near future.
<pre>
</pre>
<h3>
6. Performance Scalability
</h3>
<p>
In addition to managing large amounts of disk space,
XFS is designed for high performance file and file system
access. XFS is designed to run well over large,
striped disk arrays where the aggregate bandwidth of the
underlying drives ranges in the tens to hundreds of
megabytes per second.
<p>
The keys to performance in these arrays are I/O request
size and I/O request parallelism. Modern disk drives
have much higher bandwidth when requests are made in
large chunks. With a striped disk array, this need for
large requests is increased as individual requests are
broken up into smaller requests to the individual drives.
Since there are practical limits to individual request
sizes, it is important to issue many requests in parallel in
order to keep all of the drives in a striped array busy.
The aggregate bandwidth of a disk array can only be
achieved if all of the drives in the array are constantly
busy.
<p>
In this section, we describe how XFS makes that full
aggregate bandwidth available to applications. We start
with how XFS works to allocate large contiguous files.
Next we describe how XFS performs I/O to those files.
Finally, we describe how XFS manages its metadata for
high performance.
<pre>
</pre>
<h3>
6.1. Allocating Files Contiguously
</h3>
<p>
The first step in allowing large I/O requests to a file is to
allocate the file as contiguously as possible. This is
because the size of a request to the underlying drives is
limited by the range of contiguous blocks in the file
being read or written. The space manager in XFS goes
to great lengths to ensure that files are allocated contiguously.
<pre>
</pre>
<h3>
Delaying Allocation
</h3>
<p>
One of the key features of XFS in allocating files contiguously
is delayed file extent allocation. Delayed allocation
applies lazy evaluation techniques to file allocation.
Rather than allocating specific blocks to a file as it is
written in the buffer cache, XFS simply reserves blocks
in the file system for the data buffered in memory. A virtual
extent is built up in memory for the reserved blocks.
Only when the buffered data is flushed to disk are real
blocks allocated for the virtual extent. Delaying the
decision of which and how many blocks to allocate to a
file as it is written provides the allocator with much better
knowledge of the eventual size of the file when it makes its
decision. When the
entire file can be buffered in memory, the entire file can
usually be allocated in a single extent if the contiguous
space to hold it is available. For files that cannot be
entirely buffered in memory, delayed allocation allows
the files to be allocated in much larger extents than
would otherwise be possible.
<p>
Delayed allocation fits well in modern file system
design in that its effectiveness increases with the size of
the memory of the system. As more data is buffered in
memory, the allocator is provided with better and better
information for making its decisions. Also, with delayed
allocation, short lived files which can be buffered in
memory are often never allocated any real disk blocks.
The files are removed and purged from the file cache
before they are pushed to disk. Such short lived files
appear to be relatively common in Unix systems
[Ousterhout85, Baker91], and delayed allocation
reduces both the number of metadata updates caused by
such files and the impact of such files on file system
fragmentation.
<p>
Another benefit of delayed allocation is that files which
are written randomly but have no holes can often be
allocated contiguously. If all of the dirty data can be
buffered in memory, the space for the randomly written
data can be allocated contiguously when the dirty data is
flushed out to disk. This is especially important for
applications writing data with mapped files where random
access is the norm rather than the exception.
<pre>
</pre>
<h3>
Supporting Large Extents
</h3>
<p>
To make the management of large amounts of contiguous
space in a file efficient, XFS uses very large extent
descriptors in the file extent map. Each descriptor can
describe up to two million file system blocks, because
we use 21 bits in the extent descriptor to store the length
of the extent. Describing large numbers of blocks with a
single extent descriptor eliminates the CPU overhead of scanning
entries in the extent map to determine whether
blocks in the file are contiguous. We can simply
read the length of the extent rather than looking at each
entry to see if it is contiguous with the previous entry.
<p>
The extent descriptors used by XFS are 16 bytes in
length. This is actually their compressed size, as the in
memory extent descriptor needs 20 bytes (8 for file offset,
8 for the block number, and 4 for the extent length).
Having such large extent descriptors forces us to have a
smaller number of direct extent pointers in the inode
than we would with smaller extent descriptors like those
used by EFS (8 bytes total). We feel that this is a reasonable
trade-off for XFS because of our focus on contiguous
file allocation and the good performance of the
indirect extent maps even when we do overflow the
direct extents.
<pre>
</pre>
<h3>
Supporting Variable Block Sizes
</h3>
<p>
In addition to the above features for keeping disk space
contiguous, XFS allows the file system block size to
range from 512 bytes to 64 kilobytes on a per file system
basis. The file system block size is the minimum unit of
allocation and I/O request size. Thus, setting the block
size sets the minimum unit of fragmentation in the file
system. Of course, this must be balanced against the
large amount of internal fragmentation that is caused by
using very large block sizes. File systems with large
numbers of small files, for example news servers, typically
use smaller block sizes in order to avoid wasting
space via internal fragmentation. File systems with large
files tend to make the opposite choice and use large
block sizes in order to reduce external fragmentation of
the file system and their files' extents.
<pre>
</pre>
<h3>
Avoiding File System Fragmentation
</h3>
<p>
The work by Seltzer and Smith [Seltzer95] shows that long term
file system fragmentation can degrade the performance of FFS
file systems by between 5% and 15%. This fragmentation is the
result of creating and removing many files over time. Even if
all of the files are allocated contiguously, eventually, the remaining
files are scattered about the disk. This fragments the file system's
free space. Given the propensity of
XFS for doing large I/O to contiguously allocated files, we
could expect the degradation of XFS from its optimum performance
to be even worse.
<p>
While XFS cannot completely avoid this problem, there are a few reasons
why its impact is not as severe as it could be with XFS file systems.
The first is the combination of delayed allocation and the allocation
B+ trees. In using the two together XFS makes requests for larger
allocations to the allocator and is able to efficiently determine one
of the best fitting extents in the file system for that allocation.
This helps in delaying the onset of the fragmentation problem and reducing
its performance impact once it occurs. A second reason is that XFS file
systems are typically larger than EFS and FFS file systems. In a large
file system, there is typically a larger amount of free space for the
allocator to work with. In such a file system it takes much longer for
the file system to become fragmented. Another reason
is that file systems tend to be used to store
either a small number of large files or a large number of small files.
In a file system with a smaller number of large files, fragmentation will
not be a problem, because allocating and deleting large files still leaves
large regions of contiguous free space in the file system. In a file system
containing mostly small files, fragmentation is not a big problem, because
small files have no need for large regions of contiguous space.
However, in the long term we still expect fragmentation to degrade the
performance of XFS file systems, so we intend to add an on-line file system
defragmentation utility to optimize the file system in the future.
<pre>
</pre>
<h3>
6.2. Performing File I/O
</h3>
<p>
Given a contiguously allocated file, it is the job of the
XFS I/O manager to read and write the file in large
enough requests to drive the underlying disk drives at
full speed. XFS uses a combination of clustering, read
ahead, write behind, and request parallelism in order to
exploit its underlying disk array. For high performance
I/O, XFS allows applications to use direct I/O to move
data directly between application memory and the disk
array using DMA. Each of these is described in detail
below.
<pre>
</pre>
<h3>
Handling Read Requests
</h3>
<p>
To obtain good sequential read performance, XFS uses
large read buffers and multiple read ahead buffers. By
large read buffers, we mean that for sequential reads we
use a large minimum I/O buffer size (typically 64 kilobytes).
Of course, for files smaller than the minimum buffer
size, we reduce the size of the buffers to match the files.
Using a large minimum I/O size ensures that even when
applications issue reads in small units the file system
feeds the disk array requests that are large enough for
good disk I/O performance. For larger application reads,
XFS increases the read buffer size to match the application's
request. This is very similar to the read clustering
scheme in SunOS [McVoy90], but it is more aggressive
in using memory to improve I/O performance.
<p>
While large read buffers satisfy the need for large
request sizes, XFS uses multiple read ahead buffers to
increase the parallelism in accessing the underlying disk
array. Traditional Unix systems have used only a single
read ahead buffer at a time [McVoy90]. For sequential
reads, XFS keeps outstanding two to three requests of
the same size as the primary I/O buffer. The number varies
because we try to keep three read ahead requests outstanding,
but we wait until the process catches up a bit
with the read ahead before issuing more. The multiple
read ahead requests keep the drives in the array busy
while the application processes the data being read. The
larger number of read ahead buffers allows us to keep a
larger number of underlying drives busy at once. Not
issuing read ahead blindly, but instead waiting until the
application catches up a bit, helps to keep sequential
readers from flooding the drive with read ahead requests
when the application is not keeping up with the I/O anyway.
<pre>
</pre>
<h3>
Handling Write Requests
</h3>
<p>
To get good write performance, XFS uses aggressive
write clustering [McVoy90]. Dirty file data is buffered
in memory in chunks of 64 kilobytes, and when a chunk is
chosen to be flushed from memory it is clustered with
other contiguous chunks to form a larger I/O request.
These I/O clusters are written to disk asynchronously, so
as data is written into the file cache many such clusters
will be sent to the underlying disk array concurrently.
This keeps the underlying disk array busy with a stream
of large write requests.
<p>
The write behind used by XFS is tightly integrated with
the delayed allocation mechanism described earlier. The
more dirty data we can buffer in memory for a newly
written file, the better the allocation for that file will be.
This is balanced with the need to keep memory from
being flooded with dirty pages and the need to keep I/O
requests streaming out to the underlying disk array. This
is mostly an issue for the file cache, however, so it will
not be discussed in this paper.
<pre>
</pre>
<h3>
Using Direct I/O
</h3>
<p>
With very large disk arrays, it is often the case that the
underlying I/O hardware can move data faster than the
system's CPUs can copy that data into or out of the
buffer cache. On these systems, the CPU is the bottleneck
in moving data between a file and an application.
For these situations, XFS provides what we call direct I/O.
Direct I/O allows a program to read or write a file
without first passing the data through the system buffer
cache. The data is moved directly between the user program's
buffer and the disk array using DMA. This
avoids the overhead of copying the data into or out of
the buffer cache, and it also allows the program to better
control the size of the requests made to the underlying
devices. In the initial implementation of XFS, direct I/O
was not kept coherent with buffered I/O, but this has
been fixed in the latest version. Direct I/O is very similar
to traditional Unix raw disk access, but it differs in that
the disk addressing is indirected through the file extent
map.
<p>
Direct I/O provides applications with access to the full
bandwidth of the underlying disk array without the complexity
of managing raw disk devices. Applications processing
files much larger than the system's memory can
avoid using the buffer cache since they get no benefit
from it. Applications like databases that consider the
Unix buffer cache a nuisance can avoid it entirely while
still reaping the benefits of working with normal files.
Applications with real-time I/O requirements can use
direct I/O to gain fine grained control over the I/O they do
to files.
<p>
The downsides of direct I/O are that it is more restrictive
than traditional Unix file I/O and that it requires more
sophistication from the application using it. It is more
restrictive in that it requires the application to align its
requests on block boundaries and to keep the requests a
multiple of the block size in length. This often requires
more complicated buffering techniques in the application
that are normally handled by the Unix file cache.
Direct I/O also requires more of the application in that it
places the burden of making efficient I/O requests on the
application. If the application writes a file using direct I/
O and makes individual 4 kilobyte requests, the application
will run much slower than if it made those same
requests into the file cache where they could be clustered
into larger requests. While direct I/O will never
entirely replace traditional Unix file I/O, it is a useful
alternative for sophisticated applications that need high
performance file I/O.
<pre>
</pre>
<h3>
Using Multiple Processes
</h3>
<p>
Another barrier to high performance file I/O in many
Unix file systems is the single threading inode lock used
for each file. This lock ensures that only one process at a
time may have I/O outstanding for a single file. This
lock thwarts applications trying to increase the rate at
which they can read or write a file using multiple processes
to access the file at once.
<p>
XFS uses a more flexible locking scheme that allows
multiple processes to read and write a file at once. When
using normal, buffered I/O, multiple readers can access
the file concurrently, but only a single writer is allowed
access to the file at a time. The single writer restriction
is due to implementation rather than architectural
restrictions and will eventually be removed. When using
direct I/O, multiple readers and writers can all access the
file simultaneously. Currently, when using direct I/O and
multiple writers, we place the burden of serializing
writes to the same region of the file on the application.
This differs from traditional Unix file I/O where file
writes are atomic with respect to other file accesses, and
it is one of the main reasons why we do not yet support
multiple writers using traditional Unix file I/O.
<p>
Allowing parallel access to a file can make a significant
difference in the performance of access to the file. When
the bottleneck in accessing the file is the speed at which
the CPU can move data between the application buffer
and the buffer cache, parallel access to the file allows
multiple CPUs to be applied to the data movement.
When using direct I/O to drive a large disk array, parallel
access to the file allows requests to be pipelined to
the disk array using multiple processes to issue multiple
requests. This feature is especially important for systems
like IRIX that implement asynchronous I/O using
threads. Without parallel file access, the asynchronous
requests would be serialized by the inode lock and
would therefore provide almost no performance benefit.
<pre>
</pre>
<h3>
6.3. Accessing and Updating Metadata
</h3>
<p>
The other side of file system performance is that of
manipulating the file system metadata. For many applications,
the speed at which files and directories can be
created, destroyed, and traversed is just as important as
file I/O rates. XFS attacks the problem of metadata performance
on three fronts. The first is to use a transaction
log to make metadata updates fast. The second is to use
advanced data structures to change searches and updates
from linear to logarithmic in complexity. The third is to
allow parallelism in the search and update of different
parts of the file system. We have already discussed the
XFS data structures in detail, so this section will focus
on the XFS transaction log and file system parallelism.
<pre>
</pre>
<h3>
Logging Transactions
</h3>
<p>
A problem that has plagued traditional Unix file systems
is their use of ordered, synchronous updates to on-disk
data structures in order to make those updates recoverable
by a scavenger program like fsck. The synchronous
writes slow the performance of the metadata updates
down to the performance of disk writes rather than the
speed of today's fast CPUs [Rosenblum92].
<p>
XFS uses a write ahead transaction log to gather all the
writes of an update into a single disk I/O, and it writes
the transaction log asynchronously in order to decouple
the metadata update rate from the speed of the disks.
Other schemes such as log structured file systems
[Rosenblum92], shadow paging [Hitz94], and soft
updates [Ganger94] have been proposed to solve this
problem, but we feel that write ahead logging provides
the best trade-off among flexibility, performance, and
reliability. This is because it provides us with the fast
metadata updates and crash recovery we need without
sacrificing our ability to efficiently support synchronous
writing workloads, for example that of an NFS server
[Sandberg85], and without sacrificing our desire for
large, contiguous file support. However, an in depth analysis of
write ahead logging or the tradeoffs among these schemes
is beyond the scope of this paper.
<pre>
</pre>
<h3>
Logging Transactions Asynchronously
</h3>
<p>
Traditional write ahead logging schemes write the log
synchronously to disk before declaring a transaction
committed and unlocking its resources. While this provides
concrete guarantees about the permanence of an
update, it restricts the update rate of the file system to
the rate at which it can write the log. While XFS provides
a mode for making file system updates synchronous
for use when the file system is exported via NFS,
the normal mode of operation for XFS is to use an asynchronously
written log. We still ensure that the write
ahead logging protocol is followed in that modified data
cannot be flushed to disk until after the data is committed
to the on-disk log. Rather than keeping the modified
resources locked until the transaction is committed to
disk, however, we instead unlock the resources and pin
them in memory until the transaction commit is written
to the on-disk log. The resources can be unlocked once
the transaction is committed to the in-memory log buffers,
because the log itself preserves the order of the
updates to the file system.
<p>
XFS gains two things by writing the log asynchronously.
First, multiple updates can be batched into a single
log write. This increases the efficiency of the log
writes with respect to the underlying disk array
[Hagmann87, Rosenblum92]. Second, the performance
of metadata updates is normally made independent of
the speed of the underlying drives. This independence is
limited by the amount of buffering dedicated to the log,
but it is far better than the synchronous updates of older
file systems.
<pre>
</pre>
<h3>
Using a Separate Log Device
</h3>
<p>
Under very intense metadata update workloads, the performance
of the updates can still become limited by the
speed at which the log buffers can be written to disk.
This occurs when updates are being written into the
buffers faster than the buffers can be written into the
log. For these cases, XFS allows the log to be placed on
a separate device from the rest of the file system. It can
be stored on a dedicated disk or non-volatile memory
device. Using non-volatile memory devices for the
transaction log has proven very effective in high end
OLTP systems [Dimino94]. It can be especially useful
with XFS on an NFS server, where updates must be synchronous,
in both increasing the throughput and
decreasing the latency of metadata update operations.
<pre>
</pre>
<h3>
Exploiting Parallelism
</h3>
<p>
XFS is designed to run well on large scale shared memory
multiprocessors. In order to support the parallelism
of such a machine, XFS has only one centralized
resource: the transaction log. All other resources in the
file system are made independent either across allocation
groups or across individual inodes. This allows
inodes and blocks to be allocated and freed in parallel
throughout the file system.
<p>
The transaction log is the most contentious resource in
XFS. All updates to the XFS metadata pass through the
log. However, the job of the log manager is very simple.
It provides buffer space into which transactions can
copy their updates, it writes those updates out to disk,
and it notifies the transactions when the log writes complete.
The copying of data into the log is easily parallelized
by making the processor performing the transaction
do the copy. As long as the log can be written fast
enough to keep up with the transaction load, the fact that
it is centralized is not a problem. However, under workloads
which modify large amount of metadata without
pausing to do anything else, like a program constantly
linking and unlinking a file in a directory, the metadata
update rate will be limited to the speed at which we can
write the log to disk.
<pre>
</pre>
<h3>
7. Experience and Performance Results
</h3>
<p>
In this section we present results demonstrating the scalability
and performance of the XFS file system. These
results are not meant as a rigorous investigation of the
performance of XFS, but only as a demonstration of
XFS's capabilities. We are continuing to measure and
improve the performance of XFS as development of the
file system proceeds.
<pre>
</pre>
<h3>
7.1. I/O Throughput Test Results
</h3>
<p>
Figures 2 and 3 contain the results of some I/O throughput
tests run on a raw volume, XFS, and EFS. The results
<a href="figure2.gif"><img alt="Figure 2. Read Throughput."
align="left" vspace="10" hspace="20"
width="295" height="358" src="figure2z.gif"></a>
come from a test which measures the rate at which we
can write a previously empty file (create), read it back (read),
and overwrite the existing file (write). The number of drives
over which the underlying volume is striped ranges
from 3 to 57 in the test. The test system is an 8 CPU
Challenge with 512 megabytes of memory. The test is run with
three disks per SCSI channel, and each disk is capable
of reading data sequentially at approximately 7 MB/sec
and writing data sequentially at approximately 5.5 MB/
sec. All tests are run on newly created file systems in
order to measure the optimal performance of the file
systems. All tests using EFS and XFS are using direct I/
O and large I/O requests, and the tests using multiple
threads are using the IRIX asynchronous I/O library
with the given number of threads. Measurements for multiple,
asynchronous threads with EFS
are not given, because the performance of EFS with multiple
threads is the same or worse as with one thread due to its
single threaded (per file) I/O path.
The test files are
approximately 30 megabytes per disk in the volume in
size, and for the raw volume tests we write the same
amount of data to the volume itself.
The stripe unit for the
volumes is 84 kilobytes for the single threaded cases and 256
kilobytes for the multi-threaded cases. We have found these
stripe units to provide the best performance for each of
the cases in our experimentation.
<p>
<a href="figure3.gif"><img alt="Figure 3. Write/Create Throughput."
align="left" vspace="10" hspace="20"
width="295" height="358" src="figure3z.gif"></a>
We can draw several conclusions from this data. One is
that XFS is capable of reading a file at nearly the full
speed of the underlying volume. We manage to stay
within 5-10% of the raw volume performance in all disk
configurations when using an equivalent number of
asynchronous I/O threads.
Another interesting result is the parity
of the create and write results for XFS versus the large
disparity of the results for EFS. We believe that this
demonstrates the efficiency of the XFS space allocator.
Finally, the benefits of parallel file access are clearly
demonstrated in these results. At the high end this
makes a 55 MB/sec difference in the XFS read results.
For writing and creating files it makes a 125 MB/sec difference.
This is entirely because the parallel cases are
capable of pipelining the drives with requests to keep
them constantly busy whereas the single threaded cases
are not.
<pre>
</pre>
<h3>
7.2. Database Sort Benchmark Results
</h3>
<p>
Using XFS, SGI recently achieved record
breaking performance on the Datamation sort [Anon85] and Indy
MinuteSort [Nyberg94] benchmarks. The Datamation sort benchmark
measures how fast the system can sort 100 megabytes of
100 byte records. The MinuteSort benchmark measures
how much data the system can sort in one minute. This
includes start-up, reading the data in from disk, sorting
it in memory, and writing the sorted data back out to
disk. On a 12 CPU 200 Mhz Challenge system with 2.25
gigabytes of memory and a striped volume of 96 disk
drives, we performed the Datamation sort in 3.52 seconds
and sorted 1.6 gigabytes of data in 56 seconds for
the MinuteSort. The previous records of 7 seconds and
1.08 gigabytes of data were achieved on a DEC Alpha
system running VMS.
<p>
Achieving this level of results requires high memory
bandwidth, high file system and I/O bandwidth, scalable
multiprocessing, and a sophisticated multiprocessing
sort package. The key contribution of XFS to these
results is the ability to create and read files at 170 MB/sec.
This actually moved the bottleneck in the system
from the file system to the allocation of zeroed pages for
the sort processes.
<pre>
</pre>
<h3>
7.3. LADDIS Benchmark Results
</h3>
<p>
The results of the SPEC SFS (a.k.a. LADDIS) benchmark
using XFS are encouraging as well. On a 12 CPU
250 Mhz Challenge XL with 1 gigabyte of memory, 4
FDDI networks, 16 scsi channels, and 121 disks, we
achieved a maximum throughput of 8806 SPECnfs
operations per second. While XFS plays only a part in
achieving such outstanding performance, these results
exceed our previous results using the EFS file
system. On a slightly less powerful machine using EFS,
we originally reported a result of 7023 SPECnfs operations
per second. We estimate that the difference in
hardware accounts for approximately 800 of the operations,
leaving XFS approximately 1000 operations per
second ahead of EFS. The difference is that EFS
achieves 65 operations per second per disk, while XFS
achieves 73 operations per disk. While this 12% increase might
not seem like much, the LADDIS workload is dominated by
small, synchronous write performance. This is often very
difficult to improve without better disk hardware.
We believe that the improvement with XFS is
the result of the high performance directory structures,
better file allocations, and synchronous metadata update
batching of the transaction log provided by XFS.
.br
.ne 6
<pre>
</pre>
<h3>
7.4. Directory Lookups in Large Directories
</h3>
<p>
<table border width="40%" align="left" vspace="10" hspace="30">
<tr valign="top">
<th><b>Directory <br>entries</b></th> <th><b>EFS</b></th> <th><b>XFS</b></th>
</tr>
<tr align="center">
<td align="right">100 </td> <td align="right">5,970 </td> <td align="right">8,972 </td>
</tr>
<tr align="center">
<td align="right">500 </td> <td align="right">2,833 </td> <td align="right">6,600 </td>
</tr>
<tr align="center">
<td align="right">1,000 </td> <td align="right">1,596 </td> <td align="right">7,089 </td>
</tr>
<tr align="center">
<td align="right">10,000 </td> <td align="right">169 </td> <td align="right">6,716 </td>
</tr>
<tr align="center">
<td align="right">30,000 </td> <td align="right">43 </td> <td align="right">6,522 </td>
</tr>
<tr align="center">
<td align="right">50,000 </td> <td align="right">27 </td> <td align="right">6,497 </td>
</tr>
<tr align="center">
<td align="right">70,000 </td> <td align="right">- </td> <td align="right">5,661 </td>
</tr>
<tr align="center">
<td align="right">90,000 </td> <td align="right">- </td> <td align="right">5,497 </td>
</tr>
<tr align="center">
<td align="right">150,000 </td> <td align="right">- </td> <td align="right">177 </td>
</tr>
<tr align="center">
<td align="right">250,000 </td> <td align="right">- </td> <td align="right">102 </td>
</tr>
<tr align="center">
<td align="right">350,000 </td> <td align="right">- </td> <td align="right">90 </td>
</tr>
<tr align="center">
<td align="right">450,000 </td> <td align="right">- </td> <td align="right">79 </td>
</tr>
<tr align="center">
<td align="right">1,000,000 </td> <td align="right">- </td> <td align="right">66 </td>
</tr>
<tr>
<th colspan="3"><b>Figure 4. Lookup Operations Per Second</b></th>
</tr>
</table>
Figure 4 contains the results for a test measuring the performance
of random lookups in directories of various
sizes for EFS and XFS. The results included are the
average of several iterations of the test. The machine
used for the test is a 4 CPU machine with 128 megabytes of
memory. Each file system was created on a single, 2 gigabyte
disk with nothing else on it. To make sure that we are
measuring the performance of the file system directory
structures, the test is run with the directory name lookup
cache turned off. Also, the entries in the directories are
all links to just a few real files. There are 20,000 links
per real file. The test performs lookups using the stat(2)
system call, so making most of the entries links to just a
few files eliminates the size of the inode cache from the
variability of the test.
<p>
It is clear from this test that lookups in medium to large
directories are much more efficient using XFS. EFS uses
a linear directory format similar to that used by BSD
FFS. It degrades severely between 1,000 and 10,000
entries, at which point the test is entirely CPU bound
scanning the cached file blocks for the entries being
looked up. For XFS, the test is entirely CPU bound, but
still very fast, until the size of the directory overflows
the number of blocks that can be cached in memory. While there
is a large amount of memory in the machine, only a limited portion
of it can be used to cache directory blocks due to limitations
of the IRIX metadata block cache. At the point where we overflow
the cache, the interior nodes of the directory B+ tree are
still cached, but most leaf nodes in the tree need to be
read in from disk when they are accessed. This reduces
the performance of the test to the performance of directory
block sized I/O operations to the single underlying
disk drive. The reason the performance continues to
degrade as the directory size increases is most likely that
the effectiveness of the leaf block caching continues to decrease
with the increase in directory size.
<pre>
</pre>
<h3>
8. Conclusion
</h3>
<p>
The main idea behind the design of XFS is very simple:
think big. This idea brings forth the needs for large file
systems, large files, large numbers of files, large directories,
and large I/O that are addressed in the design and
implementation of XFS. We believe that by satisfying
these needs, XFS will satisfy the needs of the next generation
of applications and systems so that we will not
be back to where we are today in just a few years.
<p>
The mechanisms in XFS for satisfying the requirements
of big systems also make it a high performance general
purpose file system. The pervasive use of B+ trees
throughout the file system reduces many of the algorithms
in the file system from linear to logarithmic. The
use of asynchronous transaction logging eliminates
many of the metadata update performance problems in
previous file systems. Also, the use of delayed allocation
improves the performance of all file allocations,
especially those of small files. XFS is designed to perform
well on both the desktop and the server, and it is
this focus on scalability that distinguishes XFS from the
rest of the file system crowd.
<pre>
</pre>
<h3>
9. Acknowledgments
</h3>
<p>
We would like to thank John Ousterhout, Bob Gray, and Ray Chen
for their help in reviewing and improving this paper;
Chuck Bullis, Ray Chen, Tin Le, James Leong, Jim Orosz, Tom Phelan,
and Supriya Wickrematillake,
the other members of the XFS/XLV team, for helping to make XFS and XLV
real, commercial products; and Larry McVoy for his magic
troff incantations that made this paper presentable.
<pre>
</pre>
<h3>
10. References
</h3>
<p>
[Anon85] Anonymous, "A Measure of Transaction Processing Power,"
Datamation, Vol. 31 No. 7, 112-118.
<p>
[Baker91] Baker, M., Hartman, J., Kupfer, M., Shirriff,
K., Ousterhout, J., "Measurements of a Distributed File
System," Proceedings of the 13th Symposium on Operating
System Principles, Pacific Grove, CA, October
1991, 192-212.
<p>
[Chutani92] Chutani, S., Anderson, O., et. al., "The Episode
File System," Proceedings of the 1992 Winter
Usenix, San Francisco, CA, 1992, 43-60.
<p>
[Comer79] Comer, D., "The Ubiquitous B-Tree," Computing Surveys,
Vol. 11, No. 2, June 1979 121-137.
<p>
[Dimino94] Dimino, L., Mediouni, R., Rengarajan, T.,
Rubino, M., Spiro, P., "Performance of DEC Rdb Version
6.0 on AXP Systems," Digital Technical Journal,
Vol. 6, No. 1, Winter 1994 23-35.
<p>
[Ganger94] Ganger, G., Patt, Y., "Metadata Update Performance
in File Systems," Proceedings of the First
Usenix Symposium on Operating System Design and
Implementation, Monterey, CA, November, 1994, 49-60.
<p>
[Hagmann87] Hagmann, R., "Reimplementing the
Cedar File System Using Logging and Group Commit,"
Proceedings of the 10th Symposium on Operating System
Principles, November, 1987.
<p>
[Hisgen93] Hisgen, A., Birrell, A., Jerian, C., Mann, T.,
Swart, G., "New-Value Logging in the Echo Replicated
File System," Research Report 104, Systems Research
Center, Digital Equipment Corporation, 1993.
<p>
[Hitz94] Hitz, D., Lau, J., Malcolm, M., "File System
Design for an NFS File Server Appliance," Proceedings
of the 1994 Winter Usenix, San Francisco, CA, 1994,
235-246.
<p>
[Kleiman86] Kleiman, S., "Vnodes: an Architecture for
Multiple File System types in Sun Unix," Proceedings
of the 1986 Summer Usenix, Summer 1986.
<p>
[McKusick84] McKusick, M., Joy, W., Leffler, S.,
Fabry, R. "A Fast File System for UNIX," ACM Transactions
on Computer Systems Vol. 2, No. 3, August 1984, 181-197.
<p>
[McVoy90] McVoy, L., Kleiman, S., "Extent-like Performance
from a UNIX File System," Proceedings of
the 1991 Winter Usenix, Dallas, Texas, June 1991, 33-43.
<p>
[Nyberg94] Nyberg, C., Barclay, T., Cvetanovic, Z., Gray, J.,
Lomet, D., "AlphaSort: A RISC Machine Sort," Proceedings of the
1994 SIGMOD International Conference on Management of Data,
Minneapolis, 1994.
<p>
[Ousterhout85] Ousterhout, J., Da Costa, H., Harrison,
D., Kunze, J., Kupfer, M., Thompson, J., "A Trace-Driven
Analysis of the UNIX 4.2 BSD File System,"
Proceedings of the 10th Symposium on Operating System
Principles, Orcas Island, WA, December 1985, 15-24.
<p>
[Ousterhout90] Ousterhout, J. "Why Aren't Operating
Systems Getting Faster As Fast as Hardware?" Proceedings
of the 1990 Summer Usenix, Anaheim, CA, June,
1990, 247-256.
<p>
[Rosenblum92] Rosenblum, M., Ousterhout, J., "The
Design and Implementation of a Log-Structured File
System," ACM Transactions on Computer Systems Vol 10,
No. 1, February 1992, 26-52.
<p>
[Sandberg85] Sandberg, R., et al., "Design and Implementation
of the Sun Network File System," Proceedings
of the 1985 Summer Usenix, June, 1985, 119-130.
<p>
[Seltzer95] Seltzer, M., Smith, K., Balakrishnan, H., Chang, J.,
McMains, S., Padmanabhan, V., "File System Logging Versus
Clustering: A Performance Comparison," Proceedings of the 1995
Usenix Technical Conference, January 1995, 249-264.
<p>
[SGI92] IRIX Advanced Site and Server Administration
Guide, Silicon Graphics, Inc., chapter 8, 241-288
<p>
[Veritas95] Veritas Software, http://www.veritas.com
<p>
<p>
<b>Adam Sweeney</b>, <b>Doug Doucette</b>, <b>Wei Hu</b>,
<b>Curtis Anderson</b>, <b>Michael Nishimoto</b>, and
<b>Geoff Peck</b> are all members of the Server Technology
group at SGI. Adam went to Stanford, Doug to
NYU and Berkeley, Wei to MIT, Curtis to Cal Poly, Michael
to Berkeley and Stanford, and Geoff to Harvard and
Berkeley. None of them holds a Ph.D. All together they
have worked at somewhere around 27 companies, on projects
including secure operating systems, distributed operating
systems, fault tolerant systems, and plain old Unix
systems. None of them intends to make a career out of
building file systems, but they all enjoyed building one.
<a name="fn1">
<pre>
</pre>
</a>
<hr size="3" align="left" width="30%">
<ol>
<b><li></b>
This paper was presented at the January, 1996 USENIX conference in
San Diego, California. Information on future plans and performance
data in this paper are not to be construed as a commitments by SGI.
</ol>
<pre>
</pre>
</body>
</html>