On Tue, Apr 28, 2015 at 05:17:14PM -0700, Shrinand Javadekar wrote:
> I will look at the hardware. But, I think, there's also a possible
> software problem here.
> If you look at the sequence of events, first a tmp file is created in
> <mount-point>/tmp/tmp_blah. After a few writes, this file is renamed
> to a different path in the filesystem.
> The "tmp" directory above is created only once. Temp files get created
> inside it and then get renamed. We wondered if this causes disk layout
> issues resulting in slower performance. And then, we stumbled upon
> this. Someone complaining about the exact same problem.
That's pretty braindead behaviour. That will screw performance and
locality on any filesystem you do that on, not to mention age it
In the case of XFS, it forces allocation of all the inodes in one
AG, rather than allowing XFs to distribute and balance inode
allocation around the filesystem and keeping good
directory/inode/data locality for all your data.
Best way to do this is to create your tmp files using O_TMPFILE,
with the source directory being the destination directory and then
use linkat() rather than rename to make them visible in the
> One quick way to validate this was to delete the "tmp" directory
> periodically and see what numbers we get. And they do. With 15 runs of
> writing 80K objects in each run, our performance was dropping from
> ~100MB/s to 30MB/s. With deleting the tmp directory after each run, we
> saw the performance only drop from ~100MB/s to 80MB/s.
> The explanation in the link below says that when xfs does not find
> free extents in an existing allocation group, it frees up the extents
> by copying data from existing extents to their target allocation group
> (which happens because of renames). Is that explanation still valid?
No, it wasn't correct even back then. XFS does not move data around
once it has been allocated and is on disk. Indeed, rename() does not
move data, it only modifies directory entries.
The problem is that the locality of a new inode is determined by the
parent inode, and so if all new inodes are created in the same
directory, then they are all created in the same AG. If you have
millions of inodes, then you have a btree will millions on inodes in
it in one AG, and pretty much none in any other AG. Hence inode
allocation, which has to search for free inodes in a btree
containing millions of records, can be extremely IO and CPU
intensive and therefore slow. And the larger the number of inodes,
the slower it will go....