On Wed, 2002-04-24 at 13:29, Thor Lancelot Simon wrote:
> On Wed, Apr 24, 2002 at 09:44:11AM -0500, Steve Lord wrote:
> > Very interesting, I will take a look at this some more. One initial
> > comment is that optimizing for bonnie is not necessarily the correct
> > thing to do - not many real world loads create thousands of files and
> > then immediately delete them. Plus, once you are on a raid device,
> > logical and physical closeness on the volume no longer mean very much.
> >
> > Having said that we need to think some more about the underlying
> > allocation policy of inodes vs file data here.
>
> There's been some amount of academic research on this with LFS (both the
> BSD and Sprite variants), which suffers particularly badly from this
> problem because inode rewrites can cause inodes to migrate away from their
> related data blocks in the log over time. One interesting result is that
> the original Sprite policy, with all inodes stored at the head of the disk,
> isn't nearly as bad as you'd think; it keeps the seek time down and the
> inodes end up pretty much pinned in the cache anyway. This isn't too far
> off "allocate inodes from the front of the allocation group", I think. To
> allocate them at opposite ends, as the average percentage of free space in
> filesystems with large numbers of inodes grows (as I believe research would
> currently show to be the case, due to increasing per-spindle capacity esp.
> when compared to per-spindle seek performance) probably is about as bad as
> you can get; allocation groups will *not* fill up, so this really does
> maximize seek time in the case in which the inode is not in the cache. On
> the other hand, if you assume that allocation groups *will* fill up, and
> are willing to make the additional assumption that data written last is
> most likely to be read (questionable, I think, in general, but true for some
> database workloads) then as the group fills up, the data blocks you read
> most frequently will turn out to be closest to the inodes you need to get at
> them. However, in this case the inode is almost sure to be in cache, no?
>
> Another interesting take on it is to think about locality of reference
> from an LFS-like temporal point of view. If inode blocks were simply
> allocated with no constraint other than that they be in the same allocation
> group as the first data block of the file, in the absence of fragmentation
> inode and data blocks that were written at about the same time would tend
> to be in about the same part of the disk -- indeed, since inodes in XFS
> are allocated in 64K extents (aren't they?) you could turn the allocation
> policy on its head and say "put the data block as close as possible to
> the extent with the inode in it". This would produce the effect that in a
> filesystem with many small files written at the same time, you'd get
> temporal locality of reference on reads, even across multiple files -- which
> the LFS work shows to be quite good: data written at the same time generally
> is read at the same time. I believe NetApp's WAFL does this, as well: pick
> where the metadata goes, then use that to place the data. Of course, there
> are pathological cases for this kind of filesystem structure, too, but the
> existence of the allocation groups, and the presence of read caching, would
> at least reduce them.
>
> I'm not sure how clear the above was (and maybe anything of sense in it
> was already obvious to you) but it seemed like it might be worth pointing
> out.
>
> Thor
I am not ignoring this thread, just got a few dozen other things I need
to get done. XFS is supposed to put file data near inode data (inodes
come in chunks of 2 filesystem blocks, 8K on linux, or 32 inodes by
default). We need to study that code path and make sure it is behaving
as designed.
Steve
--
Steve Lord voice: +1-651-683-3511
Principal Engineer, Filesystem Software email: lord@xxxxxxx
|