[Top] [All Lists]

Re: I/O hang, possibly XFS, possibly general

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: I/O hang, possibly XFS, possibly general
From: Phil Karn <karn@xxxxxxxxxxxx>
Date: Thu, 2 Jun 2011 19:11:15 -0700
Cc: Paul Anderson <pha@xxxxxxxxx>, Linux fs XFS <xfs@xxxxxxxxxxx>
In-reply-to: <20110603003907.GW561@dastard>
References: <BANLkTim_BCiKeqi5gY_gXAcmg7JgrgJCxQ@xxxxxxxxxxxxxx> <19943.56524.969126.59978@xxxxxxxxxxxxxxxxxx> <BANLkTim978GhfamN=TEFULP5GdfMu02-7w@xxxxxxxxxxxxxx> <4DE823DD.7060600@xxxxxxxxxxxx> <20110603003907.GW561@dastard>
Reply-to: karn@xxxxxxxx
On Thu, Jun 2, 2011 at 5:39 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:

You're ignoring the fact that delayed allocation effectively does
this for you without needing to physically allocate the blocks.
So when you have files that are short lived, you don't actually do
any allocation at all, Further delayed allocation results in
allocation order according to writeback order rather than write()
order, so I/O patterns are much nicer when using delayed allocation.

Oh, I'm well aware of delayed allocation. I've just noticed that, in my experience, it doesn't seem to work nearly as well as fallocate(). And why should it? If you know in advance how big a file you're writing, how can it hurt to inform your file system? I suppose the FS implementer could always ignore that information if he felt he could somehow do a better job, but it's hard to see how. Isn't it always better to know than to guess?

I'm talking here about the genuine fallocate() system call, not the POSIX hack that falls back to first conventionally writing zeroes over the file. The true fallocate() call seems very fast, and if your file system doesn't support it then it will simply fail without harm. I still can't see any reason not to use it.

I did know that xfs can avoid the disk allocation and writes entirely when the files are short-lived, but Paul was talking about writing large, long-lived files so that's what I had in mind. And when I use fallocate(), my files are not likely to be short-lived either. Like most people I write the vast majority of my short-lived files to /tmp, which is tmpfs, not xfs.

But you do raise an interesting point -- is there any serious performance degradation from using fallocate() on a short-lived file? The written data still lives in the buffer cache for a while, so if you delete the file before it gets flushed the disk writes will still be avoided. The file system may have a little extra work to undo the unnecessary allocation but that doesn't seem to be a big deal.

Basicaly you are removing one of the major IO optimisation
capabilities of XFS by preallocating everything like this.

"Remove" it? How is giving it the correct answer worse than letting it guess -- even if it usually guesses correctly?

I still rely on preallocation to keep log files and mailboxes from getting too badly fragmented.

>So you don't have any idea of how well XFS minimises fragmentation
without needing to use preallocation? Sounds like you have a classic
case of premature optimisation. ;)

As I said, I've tried it both ways. I found that the simple act of adding fallocate() to rsync (which I use for practically all copying) vastly reduces xfs fragmentation. Just as I expected it would.

Maybe I'm a little more sensitive to fragmentation than most because I've been experimenting with storing SHA1 hashes of all my files in external attributes. This grew out of a data deduplication tool; at first I simply cached the hashes so I wouldn't have to recompute them on another run, but then I just added them to every file. This lets me get a warm and fuzzy feeling by periodically verifying that my files haven't been corrupted, especially when I began to use SSDs with trim tools.

XFS stores both attributes and extent lists directly in the inode when there's room, and it turns out that a default-sized xfs inode can store my hashes provided that the extent list is small. So I now when I walk through my file system statting everything I can read the hashes too at absolutely no extra cost. This makes deduplication really fast.

I haven't experimented to see how many extents a file can have before the attributes get pushed out of the inode, but by keeping most everything contiguous I simply avoid the problem.

<Prev in Thread] Current Thread [Next in Thread>