I/O hang, possibly XFS, possibly general
Phil Karn
karn at philkarn.net
Thu Jun 2 21:11:15 CDT 2011
On Thu, Jun 2, 2011 at 5:39 PM, Dave Chinner <david at fromorbit.com> wrote:
>
> You're ignoring the fact that delayed allocation effectively does
> this for you without needing to physically allocate the blocks.
> So when you have files that are short lived, you don't actually do
> any allocation at all, Further delayed allocation results in
> allocation order according to writeback order rather than write()
> order, so I/O patterns are much nicer when using delayed allocation.
>
Oh, I'm well aware of delayed allocation. I've just noticed that, in my
experience, it doesn't seem to work nearly as well as fallocate(). And why
should it? If you know in advance how big a file you're writing, how can it
hurt to inform your file system? I suppose the FS implementer could always
ignore that information if he felt he could somehow do a better job, but
it's hard to see how. Isn't it always better to know than to guess?
I'm talking here about the genuine fallocate() system call, not the POSIX
hack that falls back to first conventionally writing zeroes over the file.
The true fallocate() call seems very fast, and if your file system doesn't
support it then it will simply fail without harm. I still can't see any
reason not to use it.
I did know that xfs can avoid the disk allocation and writes entirely when
the files are short-lived, but Paul was talking about writing large,
long-lived files so that's what I had in mind. And when I use fallocate(),
my files are not likely to be short-lived either. Like most people I write
the vast majority of my short-lived files to /tmp, which is tmpfs, not xfs.
But you do raise an interesting point -- is there any serious performance
degradation from using fallocate() on a short-lived file? The written data
still lives in the buffer cache for a while, so if you delete the file
before it gets flushed the disk writes will still be avoided. The file
system may have a little extra work to undo the unnecessary allocation but
that doesn't seem to be a big deal.
Basicaly you are removing one of the major IO optimisation
> capabilities of XFS by preallocating everything like this.
>
"Remove" it? How is giving it the correct answer worse than letting it guess
-- even if it usually guesses correctly?
I still rely on preallocation to keep log files and mailboxes from getting
too badly fragmented.
>So you don't have any idea of how well XFS minimises fragmentation
> without needing to use preallocation? Sounds like you have a classic
> case of premature optimisation. ;)
>
>
As I said, I've tried it both ways. I found that the simple act of adding
fallocate() to rsync (which I use for practically all copying) vastly
reduces xfs fragmentation. Just as I expected it would.
Maybe I'm a little more sensitive to fragmentation than most because I've
been experimenting with storing SHA1 hashes of all my files in external
attributes. This grew out of a data deduplication tool; at first I simply
cached the hashes so I wouldn't have to recompute them on another run, but
then I just added them to every file. This lets me get a warm and fuzzy
feeling by periodically verifying that my files haven't been corrupted,
especially when I began to use SSDs with trim tools.
XFS stores both attributes and extent lists directly in the inode when
there's room, and it turns out that a default-sized xfs inode can store my
hashes provided that the extent list is small. So I now when I walk through
my file system statting everything I can read the hashes too at absolutely
no extra cost. This makes deduplication really fast.
I haven't experimented to see how many extents a file can have before the
attributes get pushed out of the inode, but by keeping most everything
contiguous I simply avoid the problem.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://oss.sgi.com/pipermail/xfs/attachments/20110602/1a5b411b/attachment.htm>
More information about the xfs
mailing list