[Top] [All Lists]

Re: Inode and dentry cache behavior

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: Inode and dentry cache behavior
From: Shrinand Javadekar <shrinand@xxxxxxxxxxxxxx>
Date: Mon, 11 May 2015 14:07:39 -0700
Cc: xfs@xxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <CABppvi63mY=-UFEyupF2-TQux+arOaCy+B-rtckvZziuachCbA@xxxxxxxxxxxxxx>
References: <CABppvi55C+vE7Ei8u=_ntC_heDQb4HwUcKom-_9hGkunk84Sfw@xxxxxxxxxxxxxx> <20150423224324.GM15810@dastard> <CABppvi7+Mu78FAM75YvJvekX2CHtKk4yeMrU7j35fvvWRb923Q@xxxxxxxxxxxxxx> <20150424061554.GN15810@dastard> <CABppvi6N6McmfLgAPcP9cxXxPrBMaD81UyeiVHWOaxrJisSN=g@xxxxxxxxxxxxxx> <20150429013024.GU15810@dastard> <CABppvi63mY=-UFEyupF2-TQux+arOaCy+B-rtckvZziuachCbA@xxxxxxxxxxxxxx>
In case you're curious, the solutions to this problem are being
discussed at [1] on the Swift side.

O_TMPFILE and linkat() aren't available in all kernels (and python)
and therefore not a viable option right away. Among few others, one of
the proposals is to shard the tmp directory into say 256 dirs and
create these files inside those dirs. That way files won't get created
in a single AG. Are there any other problems if this is done?

[1] https://bugs.launchpad.net/swift/+bug/1450656

On Wed, Apr 29, 2015 at 10:46 AM, Shrinand Javadekar
<shrinand@xxxxxxxxxxxxxx> wrote:
> Awesome!! Thanks Dave!
> On Tue, Apr 28, 2015 at 6:30 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>> On Tue, Apr 28, 2015 at 05:17:14PM -0700, Shrinand Javadekar wrote:
>>> I will look at the hardware. But, I think, there's also a possible
>>> software problem here.
>>> If you look at the sequence of events, first a tmp file is created in
>>> <mount-point>/tmp/tmp_blah. After a few writes, this file is renamed
>>> to a different path in the filesystem.
>>> rename(<mount-point>/tmp/tmp_blah,
>>> <mount-point>/objects/1004/eef/deadbeef/foo.data).
>>> The "tmp" directory above is created only once. Temp files get created
>>> inside it and then get renamed. We wondered if this causes disk layout
>>> issues resulting in slower performance. And then, we stumbled upon
>>> this[1]. Someone complaining about the exact same problem.
>> That's pretty braindead behaviour. That will screw performance and
>> locality on any filesystem you do that on, not to mention age it
>> extremely quickly.
>> In the case of XFS, it forces allocation of all the inodes in one
>> AG, rather than allowing XFs to distribute and balance inode
>> allocation around the filesystem and keeping good
>> directory/inode/data locality for all your data.
>> Best way to do this is to create your tmp files using O_TMPFILE,
>> with the source directory being the destination directory and then
>> use linkat() rather than rename to make them visible in the
>> directory.
>>> One quick way to validate this was to delete the "tmp" directory
>>> periodically and see what numbers we get. And they do. With 15 runs of
>>> writing 80K objects in each run, our performance was dropping from
>>> ~100MB/s to 30MB/s. With deleting the tmp directory after each run, we
>>> saw the performance only drop from ~100MB/s to 80MB/s.
>>>  The explanation in the link below says that when xfs does not find
>>> free extents in an existing allocation group, it frees up the extents
>>> by copying data from existing extents to their target allocation group
>>> (which happens because of renames). Is that explanation still valid?
>> No, it wasn't correct even back then.  XFS does not move data around
>> once it has been allocated and is on disk. Indeed, rename() does not
>> move data, it only modifies directory entries.
>> The problem is that the locality of a new inode is determined by the
>> parent inode, and so if all new inodes are created in the same
>> directory, then they are all created in the same AG. If you have
>> millions of inodes, then you have a btree will millions on inodes in
>> it in one AG, and pretty much none in any other AG. Hence inode
>> allocation, which has to search for free inodes in a btree
>> containing millions of records, can be extremely IO and CPU
>> intensive and therefore slow. And the larger the number of inodes,
>> the slower it will go....
>> Cheers,
>> Dave.
>> --
>> Dave Chinner
>> david@xxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>