On 10/10/2012 3:51 AM, Marcin Deranek wrote:
> We are running XFS filesystem on one of out machines which is a big
> store (~3TB) of different data files (mostly images). Quite recently we
> experienced some performance problems - machine wasn't able to keep up
> with updates. After some investigation it turned out that open()
> syscalls (open for writing) were taking significantly more time than
> they should eg. 15-20ms vs 100-150us.
> Some more info about our workload as I think it's important here:
> our XFS filesystem is exclusively used as data store, so we only
> read and write our data (we mostly write). When new update comes it's
> written to a temporary file eg.
> When file is completely stored we move it to final location eg.
> That means that we create lots of files in /mountpoint/some/path/.tmp
> directory, but directory is empty as they are moved (rename() syscall)
> shortly after file creation to a different directory on the same
> The workaround which I found so far is to remove that directory
> (/mountpoint/some/path/.tmp in our case) with its content and re-create
> it. After this operation open() syscall goes down to 100-150us again.
> Is this a known problem ?
> Information regarding our system:
> CentOS 5.8 / kernel 2.6.18-308.el5 / kmod-xfs-0.4-2
> Let me know if you need to know anything more.
I'll begin where you ended: kmod-xfs. DO NOT USE THAT. Use the kernel
driver. Eric Sandeen can point you to the why. AIUI that XFS module
hasn't been supported for many many years.
Regarding your problem, I can't state some of the following with
authority, though it might read that way. I'm making an educated guess
based on what I do know of XFS and the behavior you're seeing. Dave
will clobber and correct me if I'm wrong here. ;)
XFS filesystems are divided into multiple equal sized allocation groups
on the underlying storage device (single disk, RAID, LVM volume, etc).
With inode32 each directory that is created has its files store in only
one AG, with some exceptions, which you appear to bumping up against.
If you're using inode64 the directories, along with their files, go into
the AGs round robin.
Educated guessing: When you use rename(2) to move the files, the file
contents are not being moved, only the directory entry, as with EXTx
etc. Thus the file data is still in the ".tmp" directory AG, but that
AG is no longer its home. Once this temp dir AG gets full of these
"phantom" file contents (you can only see them with XFS tools), the AG
spills over. At that point XFS starts moving the phantom contents of
the rename(2) files into the AG which owns the directory of the
rename(2) target. I believe this is the source of your additional
latency. Each time you do an open(2) call to write a new file, XFS is
moving a file's contents (extents) to its new/correct parent AG, causing
much additional IO, especially if these are large files.
As you are witnessing, if XFS did the move to the new AG in real time,
the performance of rename(2) would be horrible on the front end. I'd
guess the developers never imagined that a user would fill an entire AG
using rename(2) calls. Your deleting and recreating of the .tmp
directory which fixes the performance seems to be evidence of this.
Each time you delete/create that directory it is put into a different AG
in the filesystem, in a round robin fashion. If you do this enough
times, you should eventually create the directory in the original AG
that's full of the rename(2) file extents, and performance will suffer
One of the devs probably has some tricks/tools up his sleeve to force
those extents to their new parent AG. You might be able to run a
nightly script to do this housekeeping. Or you could always put the
.tmp directory on a different filesystem on a scratch disk.
This problem could also be a free space fragmentation issue, but given
that recreating the .tmp directory fixes it, I doubt free space frag is