[Top] [All Lists]

Re: howto keep xfs directory searches fast for a long time

To: Linux fs XFS <xfs@xxxxxxxxxxx>
Subject: Re: howto keep xfs directory searches fast for a long time
From: pg_xf2@xxxxxxxxxxxxxxxxxx (Peter Grandi)
Date: Sun, 12 Aug 2012 20:05:42 +0100
In-reply-to: <6344220.LKveJofnHA@saturn>
References: <6344220.LKveJofnHA@saturn>
Resent-date: Sun, 12 Aug 2012 23:41:59 +0100
Resent-from: pg_mh@xxxxxxxxxxxxxx
Resent-message-id: <20520.12624.397768.450417@xxxxxxxxxxxxxxxxxx>
Resent-to: xfs@xxxxxxxxxxx
> I need a VMware VM that has 8TB storage. As I can at max
> create a 2TB disk, I need to add 4 disks, and use lvm to
> concat these. All is on top of a RAID5 or RAID6 store.

Ah the usual goal of a single large storage pool for cheap.

> The workload will be storage of mostly large media files (5TB
> mkv Video + 1TB mp3), plus backup of normal documents (1TB
> .odt,.doc,.pdf etc).

Probably 2-3MB per MP3 thus 300-500k MP3, 1-2MB per document.
For videos hard to guess how long, but picking an arbitrary
number with 100MB videos it would be around 50,000 files.

Overall 1 million files, still within plausibility.

> The server should be able to find files quickly, transfer speed
> is not important. There won't be many deletes to media files,
> mostly uploads and searching for files. Only when it grows
> full, old files will be removed. But normal documents will be
> rsynced (used as backup destination) regularly.

> I will set vm.vfs_cache_pressure = 10, this helps at least
> keeping inodes cached when they were read once.

That may be a workaround (see below) in your specific case to
the default answer to this question:

> - What is the best setup to get high speed on directory
>   searches? Find, ls, du, etc. should be quick.

None. If you are thinking inode-accessing searches, it just won't
work fast on large filetrees over long stroking.

  Note: in way of principle 'find' and 'ls' won't access inodes,
  as the could be just about names, but most uses of 'find' and
  'ls' do access inode fields. 'du' obviously does, and so does
  'rsync' which you intend to use for backups.

Especially as most filesystems, including XFS (in at least some
versions and configurations) aim to keep metadata (directories,
inodes) close to file data rather than each other, because that
is what typical workloads are supposed to required.

Perhaps you could change the intended storage layer to favour
clustering of metadata, however difficult it is to get
filesystems to go against the grain of their usual design.

> - Should I use inode64 or not?

That's a very difficult question, as 'inode64' has two different

* Allows inodes to be stored in the first 1TiB (with 512B
  sectors) of the filetree space.
* Distributes directories across AGs, and attempts to put
  *data* in the same AG as the directory they are linked from.

In your case perhaps it is best not to distribute directories
across AGs, and to keep all inodes in the first 1TiB. But it is a
very difficult tradeoff as you may run out of space for inodes in
the first 1TiB, even if you don't have that many inodes.

> - If that's an 8 disk RAID-6, should I mkfs.xfs with 6*4 AGs?
>   Or what would be a good start, or wouldn't it matter at all?

Difficult to say ahead of time. RAID6 can be a very bad choice
for metadata intensive accesses, but only for updating the
metadata, and it seems that there won't a lot of that in your

> And as it'll be mostly big media files, should I use
> sunit/swidth set to 64KB/6*64KB, does that make sense?

Whatever the size of files, 'sw' should be the size of the RMW
block of the blockdevice containing the filesystem, and 'su'
should be the size of contiguous data on each member blockdevice.
What is a difficult question is the best '--chunksize' for the
RAID set, and that depends a lot on how multithreaded and random
is the workload.

> I'm asking because I had such a VM setup once, and while it
> was fairly quick in the beginning, over time it felt much
> slower on traversing directories, very seek bound.

The definition of a database is something like "a set of data
whose working set cannot be cached in memory". If you want to
store a database consider using a DBMS. But perhaps your
(meta)data set can be cached in memory, see below.

It may be worthile to consider a few large directories as in XFS
they are implemented as fairly decent trees, for random access,
but large directories don't work so well for linear scans (inode
enumeration issues), especially with apps that are not careful.

Also depending on filesystem used and parameters, things gets
slower with time the more space is used in a partition, because
most filesystems tend to allocate clumpedly and starting with the
low address blocks on the outer tracks, thus implicitly short
striking the block device at the beginning.

> That xfs was only 80% filled, so shouldn't have had a
> fragmentation problem.

Perhaps 80% is not enough for fragmentation of file contents,
but it can be a big issue for keeping metadata together.

> And I know nothing to fix that apart from backup/restore, so
> maybe there's something to prevent that?

No. Even backup/restore may not be good enough once the filetree
block device has filled up and accesses often need long strokes.

Filesystems are designed for "average" performance on "average"
workloads more than peak performance on custom workloads, no
matter the committment to denial of so many posters to this list.
In your case you are trying to bend over a filesystem aimed at
high parallel throughput over large sequential streams into doing
low latency access to widely scattered small metadata...

Given your requirements it might be better for you do have a
filesystem that clusters all metadata together and far away from
the data it described, as you 1M inodes might take all together
around 1GiB of space.

Or you could implement a pre-service phase where all inodes are
scanned at system startup (I think it would be best with 'du'),
and then to ensure that they get rarely written back to storage
(which by default XFS rarely does as in effect it defaults to

For example on my laptop I have two filetrees with around 700,000
inodes, and with 4GiB of RAM when I 'rsync' either of them for
backups further passes cause almost no disk IO, because that many
inodes do get cached. These are some lines from 'slabtop' after
such an 'rsync':

665193 665193 100%    0.94K  39129       17    626064K xfs_inode
601377 601377 100%    0.19K  28637       21    114548K dentry

This is cheating, because it uses the in-memory inode and dentry
caches as a DBMS, but in your case you might get away with cheating.

Setting 'vm/vfs_cache_pressure=0' might even be a sensible option
as the number of inodes in your situation has an upper bound
which is likely to be below maximum RAM you can give to your

Finally I am rather perplexed when a VM and SAN is used in a
situation performance, and in particular where low latency disk
and network access is important. VMs perform well for CPU bound
loads, not so well for network loads, and even less for IO loads,
and even less when latency matters more than throughput.

<Prev in Thread] Current Thread [Next in Thread>