sleeps and waits during io_submit
Avi Kivity
avi at scylladb.com
Thu Dec 3 06:52:08 CST 2015
On 12/03/2015 01:19 AM, Dave Chinner wrote:
> On Wed, Dec 02, 2015 at 11:02:08AM +0200, Avi Kivity wrote:
>> On 12/02/2015 01:06 AM, Dave Chinner wrote:
>>> On Tue, Dec 01, 2015 at 11:38:29PM +0200, Avi Kivity wrote:
>>>> On 12/01/2015 11:19 PM, Dave Chinner wrote:
>>>>>>> XFS spread files across the allocation groups, based on the directory these
>>>>>>> files are created,
>>>>>> Idea: create the files in some subdirectory, and immediately move
>>>>>> them to their required location.
> ....
>>>> My hack involves creating the file in a random directory, and while
>>>> it is still zero sized, move it to its final directory. This is
>>>> simply to defeat the ag selection heuristic.
>>> Which you really don't want to do.
>> Why not? For my directory structure, files in the same directory do
>> not share temporal locality. What does the ag selection heuristic
>> give me?
> Wrong question. The right question is this: what problems does
> subverting the AG selection heuristic cause me?
>
> If you can't answer that question, then you can't quantify the risks
> involved with making such a behavioural change.
Okay. Any hint about the answer to that question?
>
>>>>>>> trying to keep files as close as possible from their
>>>>>>> metadata.
>>>>>> This is pointless for an SSD. Perhaps XFS should randomize the ag on
>>>>>> nonrotational media instead.
>>>>> Actually, no, it is not pointless. SSDs do not require optimisation
>>>>> for minimal seek time, but data locality is still just as important
>>>>> as spinning disks, if not moreso. Why? Because the garbage
>>>>> collection routines in the SSDs are all about locality and we can't
>>>>> drive garbage collection effectively via discard operations if the
>>>>> filesystem is not keeping temporally related files close together in
>>>>> it's block address space.
>>>> In my case, files in the same directory are not temporally related.
>>>> But I understand where the heuristic comes from.
>>>>
>>>> Maybe an ioctl to set a directory attribute "the files in this
>>>> directory are not temporally related"?
>>> And exactly what does that gain us?
>> I have a directory with commitlog files that are constantly and
>> rapidly being created, appended to, and removed, from all logical
>> cores in the system. Does this not put pressure on that allocation
>> group's locks?
> Not usually, because if an AG is contended, the allocation algorithm
> skips the contended AG and selects the next uncontended AG to
> allocate in. And given that the append algorithm used by the
> allocator attempts to use the last block of the last extent as the
> target for the new extent (i.e. contiguous allocation) once a file
> has skipped to a different AG all allocations will continue in that
> new AG until it is either full or it becomes contended....
>
> IOWs, when AG contention occurs, the filesystem automatically
> spreads out the load over multiple AGs. Put simply, we optimise for
> locality first, but we're willing to compromise on locality to
> minimise contention when it occurs. But, also, keep in mind that
> in minimising contention we are still selecting the most local of
> possible alternatives, and that's something you can't do in
> userspace....
Cool. I don't think "nearly-local" matters much for an SSD (it's either
contiguous or it is not), but it's good to know that it's self-tuning
wrt. contention.
In some good news, Glauber hacked our I/O engine not to throw so many
concurrent I/Os at the filesystem, and indeed so the contention
reduced. So it's likely we were pushing the fs so hard all the ags were
contended, but this is no longer the case.
>
>>> Exactly what problem are you
>>> trying to solve by manipulating file locality that can't be solved
>>> by existing knobs and config options?
>> I admit I don't know much about the existing knobs and config
>> options. Pointers are appreciated.
> You can find some work in progress here:
>
> https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/
>
> looks like there's some problem with xfs.org wiki, so the links
> to the user/training info on this page:
>
> http://xfs.org/index.php/XFS_Papers_and_Documentation
>
> aren't working.
>
>>> Perhaps you'd like to read up on how the inode32 allocator behaves?
>> Indeed I would, pointers are appreciated.
> Inode allocation section here:
>
> https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/tree/admin/XFS_Performance_Tuning/filesystem_tunables.asciidoc
Thanks for all the links, I'll study them and see what we can do to tune
for our workload.
>>> Once we know which of the different algorithms is causing the
>>> blocking issues, we'll know a lot more about why we're having
>>> problems and a better idea of what problems we actually need to
>>> solve.
>> I'm happy to hack off the lowest hanging fruit and then go after the
>> next one. I understand you're annoyed at having to defend against
>> what may be non-problems; but for me it is an opportunity to learn
>> about the file system.
> No, I'm not annoyed. I just don't want to be chasing ghosts and so
> we need to be on the same page about how to track down these issues.
> And, beleive me, you'll learn a lot about how the filesystem behaves
> just by watching how the different configs react to the same
> input...
Ok. Looks like I have a lot of homework.
>
>> For us it is the weakest spot in our system,
>> because on the one hand we heavily depend on async behavior and on
>> the other hand Linux is notoriously bad at it. So we are very
>> nervous when blocking happens.
> I can't disagree with you there - we really need to fix what we can
> within the constraints of the OS first, then we once we have it
> working as well as we can, then we can look to solving the remaining
> "notoriously bad" AIO problems...
There are lots of users who will be eternally grateful to you if you can
get this fixed. Linux has a very bad reputation in this area with the
accepted wisdom that you can only use aio reliably against block
devices. XFS comes very close, it will make a huge impact if it can be
used to do aio reliably, without a lot of constraints on the application.
>
>>>>> effectively than lots of little trims (i.e. one per file) that the
>>>>> drive cannot do anything useful with because they are all smaller
>>>>> than the internal SSD page/block sizes and so get ignored. This is
>>>>> one of the reasons fstrim is so much more efficient and effective
>>>>> than using the discard mount option.
>>>> In my use case, the files are fairly large, and there is constant
>>>> rewriting (not in-place: files are read, merged, and written back).
>>>> So I'm worried an fstrim can happen too late.
>>> Have you measured the SSD performance degradation over time due to
>>> large overwrites? If not, then again it is a good chance you are
>>> trying to solve a theoretical problem rather than a real problem....
>>>
>> I'm not worried about that (maybe I should be) but about the SSD
>> reaching internal ENOSPC due to the fstrim happening too late.
>>
>> Consider this scenario, which is quite typical for us:
>>
>> 1. Fill 1/3rd of the disk with a few large files.
>> 2. Copy/merge the data into a new file, occupying another 1/3rd of the disk.
>> 3. Repeat 1+2.
>>
>> If this is repeated few times, the disk can see 100% of its space
>> occupied (depending on how free space is allocated), even if from a
>> user's perspective it is never more than 2/3rds full.
> I don't think that's true. SSD behaviour largely depends on how much
> of the LBA space has been written to (i.e. marked used) and so that
> metric tends to determine how the SSD behaves under such workloads.
> This is one of the reasons that overprovisioning SSD space (e.g.
> leaving 25% of the LBA space completely unused) results in better
> performance under overwrite workloads - there's lots more scratch
> space for the garbage collector to work with...
>
> Hence as long as the filesystem is reusing the same LBA regions for
> the files, TRIM will probably not make a significant difference to
> performance because there's still 1/3rd of the LBA region that is
> "unused". Hence the overwrites go into the unused 1/3rd of the SSD,
> and the underlying SSD blocks associated with the "overwritten" LBA
> region are immediately marked free, just like if you issued a trim
> for that region before you start the overwrite.
>
> With the way the XFS allocator works, it fills AGs from lowest to
> highest blocks, and if you free lots of space down low in the AG
> then that tends to get reused before the higher offset free space.
> hence the XFS allocates space in the above workload would result in
> roughly 1/3rd of the LBA space associated with the filesystem
> remaining unused. This is another allocator behaviour designed for
> spinning disks (to keep the data on the faster outer edges of
> drives) that maps very well to internal SSD allocation/reclaim
> algorithms....
Cool. So we'll keep fstrim usage to daily, or something similarly low.
>
> FWIW, did you know that TRIM generally doesn't return the disk to
> the performance of a pristine, empty disk? Generally only a secure
> erase will guarantee that a SSD returns to "empty disk" performance,
> but that also removes all data from then entire SSD. Hence the
> baseline "sustained performance" you should be using is not "empty
> disk" performance, but the performance once the disk has been
> overwritten completely at least once. Only them will you tend to see
> what effect TRIM will actually have.
I did not know that. Maybe that's another factor in why cloud SSDs are
so slow.
>
>> Maybe a simple countermeasure is to issue an fstrim every time we
>> write 10%-20% of the disk's capacity.
> Run the workload to steady state performance and measure the
> degradation as it continues to run and overwrite the SSDs
> repeatedly. To do this properly you are going to have to sacrifice
> some SSDs, because you're going to need to overwrite them quite a
> few times to get an idea of the degradation characteristics and
> whether a periodic trim makes any difference or not.
Enterprise SSDs are guaranteed for something like N full writes / day
for several years, are they not? So such a test can take weeks or
months, depending on the ratio between disk size and bandwidth.
Still, I guess it has to be done.
More information about the xfs
mailing list