sleeps and waits during io_submit
Avi Kivity
avi at scylladb.com
Tue Dec 1 15:38:29 CST 2015
On 12/01/2015 11:19 PM, Dave Chinner wrote:
> On Tue, Dec 01, 2015 at 09:07:14PM +0200, Avi Kivity wrote:
>> On 12/01/2015 08:03 PM, Carlos Maiolino wrote:
>>> Hi Avi,
>>>
>>>>> else is going to execute in our place until this thread can make
>>>>> progress.
>>>> For us, nothing else can execute in our place, we usually have exactly one
>>>> thread per logical core. So we are heavily dependent on io_submit not
>>>> sleeping.
>>>>
>>>> The case of a contended lock is, to me, less worrying. It can be reduced by
>>>> using more allocation groups, which is apparently the shared resource under
>>>> contention.
>>>>
>>> I apologize if I misread your previous comments, but, IIRC you said you can't
>>> change the directory structure your application is using, and IIRC your
>>> application does not spread files across several directories.
>> I miswrote somewhat: the application writes data files and commitlog
>> files. The data file directory structure is fixed due to
>> compatibility concerns (it is not a single directory, but some
>> workloads will see most access on files in a single directory. The
>> commitlog directory structure is more relaxed, and we can split it
>> to a directory per shard (=cpu) or something else.
>>
>> If worst comes to worst, we'll hack around this and distribute the
>> data files into more directories, and provide some hack for
>> compatibility.
>>
>>> XFS spread files across the allocation groups, based on the directory these
>>> files are created,
>> Idea: create the files in some subdirectory, and immediately move
>> them to their required location.
> See xfs_fsr.
Can you elaborate? I don't see how it is applicable.
My hack involves creating the file in a random directory, and while it
is still zero sized, move it to its final directory. This is simply to
defeat the ag selection heuristic. No data is copied.
>>> trying to keep files as close as possible from their
>>> metadata.
>> This is pointless for an SSD. Perhaps XFS should randomize the ag on
>> nonrotational media instead.
> Actually, no, it is not pointless. SSDs do not require optimisation
> for minimal seek time, but data locality is still just as important
> as spinning disks, if not moreso. Why? Because the garbage
> collection routines in the SSDs are all about locality and we can't
> drive garbage collection effectively via discard operations if the
> filesystem is not keeping temporally related files close together in
> it's block address space.
In my case, files in the same directory are not temporally related. But
I understand where the heuristic comes from.
Maybe an ioctl to set a directory attribute "the files in this directory
are not temporally related"?
I imagine this will be useful for many server applications.
> e.g. If the files in a directory are all close together, and the
> directory is removed, we then leave a big empty contiguous region in
> the filesystem free space map, and when we send discards over that
> we end up with a single big trim and the drive handles that far more
Would this not be defeated if a directory that happens to share the
allocation group gets populated simultaneously?
> effectively than lots of little trims (i.e. one per file) that the
> drive cannot do anything useful with because they are all smaller
> than the internal SSD page/block sizes and so get ignored. This is
> one of the reasons fstrim is so much more efficient and effective
> than using the discard mount option.
In my use case, the files are fairly large, and there is constant
rewriting (not in-place: files are read, merged, and written back). So
I'm worried an fstrim can happen too late.
>
> And, well, XFS is designed to operate on storage devices made up of
> more than one drive, so the way AGs are selected is designed to
> given long term load balancing (both for space usage and
> instantenous performance). With the existing algorithms we've not
> had any issues with SSD lifetimes, long term performance
> degradation, etc, so there's no evidence that we actually need to
> change the fundamental allocation algorithms specially for SSDs.
>
Ok. Maybe the SSDs can deal with untrimmed overwrites efficiently,
provided the io sizes are large enough.
More information about the xfs
mailing list