sleeps and waits during io_submit
Avi Kivity
avi at scylladb.com
Tue Dec 1 13:07:14 CST 2015
On 12/01/2015 08:03 PM, Carlos Maiolino wrote:
> Hi Avi,
>
>>> else is going to execute in our place until this thread can make
>>> progress.
>> For us, nothing else can execute in our place, we usually have exactly one
>> thread per logical core. So we are heavily dependent on io_submit not
>> sleeping.
>>
>> The case of a contended lock is, to me, less worrying. It can be reduced by
>> using more allocation groups, which is apparently the shared resource under
>> contention.
>>
> I apologize if I misread your previous comments, but, IIRC you said you can't
> change the directory structure your application is using, and IIRC your
> application does not spread files across several directories.
I miswrote somewhat: the application writes data files and commitlog
files. The data file directory structure is fixed due to compatibility
concerns (it is not a single directory, but some workloads will see most
access on files in a single directory. The commitlog directory
structure is more relaxed, and we can split it to a directory per shard
(=cpu) or something else.
If worst comes to worst, we'll hack around this and distribute the data
files into more directories, and provide some hack for compatibility.
> XFS spread files across the allocation groups, based on the directory these
> files are created,
Idea: create the files in some subdirectory, and immediately move them
to their required location.
> trying to keep files as close as possible from their
> metadata.
This is pointless for an SSD. Perhaps XFS should randomize the ag on
nonrotational media instead.
> Directories are spreaded across the AGs in a 'round-robin' way, each
> new directory, will be created in the next allocation group, and, xfs will try
> to allocate the files in the same AG as its parent directory. (Take a look at
> the 'rotorstep' sysctl option for xfs).
>
> So, unless you have the files distributed across enough directories, increasing
> the number of allocation groups may not change the lock contention you're
> facing in this case.
>
> I really don't remember if it has been mentioned already, but if not, it might
> be worth to take this point in consideration.
Thanks. I think you should really consider randomizing the ag for SSDs,
and meanwhile, we can just use the creation-directory hack to get the
same effect, at the cost of an extra system call. So at least for this
problem, there is a solution.
> anyway, just my 0.02
>
>> The case of waiting for I/O is much more worrying, because I/O latency are
>> much higher. But it seems like most of the DIO path does not trigger
>> locking around I/O (and we are careful to avoid the ones that do, like
>> writing beyond eof).
>>
>> (sorry for repeating myself, I have the feeling we are talking past each
>> other and want to be on the same page)
>>
>>>>> We submit an I/O which is
>>>>> asynchronous in nature and wait on a completion, which causes the cpu to
>>>>> schedule and execute another task until the completion is set by I/O
>>>>> completion (via an async callback). At that point, the issuing thread
>>>>> continues where it left off. I suspect I'm missing something... can you
>>>>> elaborate on what you'd do differently here (and how it helps)?
>>>> Just apply the same technique everywhere: convert locks to trylock +
>>>> schedule a continuation on failure.
>>>>
>>> I'm certainly not an expert on the kernel scheduling, locking and
>>> serialization mechanisms, but my understanding is that most things
>>> outside of spin locks are reschedule points. For example, the
>>> wait_for_completion() calls XFS uses to wait on I/O boil down to
>>> schedule_timeout() calls. Buffer locks are implemented as semaphores and
>>> down() can end up in the same place.
>> But, for the most part, XFS seems to be able to avoid sleeping. The call to
>> __blockdev_direct_IO only launches the I/O, so any locking is only around
>> cpu operations and, unless there is contention, won't cause us to sleep in
>> io_submit().
>>
>> Trying to follow the code, it looks like xfs_get_blocks_direct (and
>> __blockdev_direct_IO's get_block parameter in general) is synchronous, so
>> we're just lucky to have everything in cache. If it isn't, we block right
>> there. I really hope I'm misreading this and some other magic is happening
>> elsewhere instead of this.
>>
>>> Brian
>>>
>>>>>> Seastar (the async user framework which we use to drive xfs) makes writing
>>>>>> code like this easy, using continuations; but of course from ordinary
>>>>>> threaded code it can be quite hard.
>>>>>>
>>>>>> btw, there was an attempt to make ext[34] async using this method, but I
>>>>>> think it was ripped out. Yes, the mortal remains can still be seen with
>>>>>> 'git grep EIOCBQUEUED'.
>>>>>>
>>>>>>>>> It sounds to me that first and foremost you want to make sure you don't
>>>>>>>>> have however many parallel operations you typically have running
>>>>>>>>> contending on the same inodes or AGs. Hint: creating files under
>>>>>>>>> separate subdirectories is a quick and easy way to allocate inodes under
>>>>>>>>> separate AGs (the agno is encoded into the upper bits of the inode
>>>>>>>>> number).
>>>>>>>> Unfortunately our directory layout cannot be changed. And doesn't this
>>>>>>>> require having agcount == O(number of active files)? That is easily in the
>>>>>>>> thousands.
>>>>>>>>
>>>>>>> I think Glauber's O(nr_cpus) comment is probably the more likely
>>>>>>> ballpark, but really it's something you'll probably just need to test to
>>>>>>> see how far you need to go to avoid AG contention.
>>>>>>>
>>>>>>> I'm primarily throwing the subdir thing out there for testing purposes.
>>>>>>> It's just an easy way to create inodes in a bunch of separate AGs so you
>>>>>>> can determine whether/how much it really helps with modified AG counts.
>>>>>>> I don't know enough about your application design to really comment on
>>>>>>> that...
>>>>>> We have O(cpus) shards that operate independently. Each shard writes 32MB
>>>>>> commitlog files (that are pre-truncated to 32MB to allow concurrent writes
>>>>>> without blocking); the files are then flushed and closed, and later removed.
>>>>>> In parallel there are sequential writes and reads of large files using 128kB
>>>>>> buffers), as well as random reads. Files are immutable (append-only), and
>>>>>> if a file is being written, it is not concurrently read. In general files
>>>>>> are not shared across shards. All I/O is async and O_DIRECT. open(),
>>>>>> truncate(), fdatasync(), and friends are called from a helper thread.
>>>>>>
>>>>>> As far as I can tell it should a very friendly load for XFS and SSDs.
>>>>>>
>>>>>>>>> Reducing the frequency of block allocation/frees might also be
>>>>>>>>> another help (e.g., preallocate and reuse files,
>>>>>>>> Isn't that discouraged for SSDs?
>>>>>>>>
>>>>>>> Perhaps, if you're referring to the fact that the blocks are never freed
>>>>>>> and thus never discarded..? Are you running fstrim?
>>>>>> mount -o discard. And yes, overwrites are supposedly more expensive than
>>>>>> trim old data + allocate new data, but maybe if you compare it with the work
>>>>>> XFS has to do, perhaps the tradeoff is bad.
>>>>>>
>>>>> Ok, my understanding is that '-o discard' is not recommended in favor of
>>>>> periodic fstrim for performance reasons, but that may or may not still
>>>>> be the case.
>>>> I understand that most SSDs have queued trim these days, but maybe I'm
>>>> optimistic.
>>>>
>> _______________________________________________
>> xfs mailing list
>> xfs at oss.sgi.com
>> http://oss.sgi.com/mailman/listinfo/xfs
More information about the xfs
mailing list