sleeps and waits during io_submit

Avi Kivity avi at scylladb.com
Tue Dec 1 11:09:29 CST 2015



On 12/01/2015 06:29 PM, Brian Foster wrote:
> On Tue, Dec 01, 2015 at 06:08:51PM +0200, Avi Kivity wrote:
>>
>> On 12/01/2015 06:01 PM, Brian Foster wrote:
>>> On Tue, Dec 01, 2015 at 05:22:38PM +0200, Avi Kivity wrote:
>>>> On 12/01/2015 04:56 PM, Brian Foster wrote:
>>>>> On Tue, Dec 01, 2015 at 03:58:28PM +0200, Avi Kivity wrote:
>>>>>> On 12/01/2015 03:11 PM, Brian Foster wrote:
>>>>>>> On Tue, Dec 01, 2015 at 11:08:47AM +0200, Avi Kivity wrote:
>>>>>>>> On 11/30/2015 06:14 PM, Brian Foster wrote:
>>>>>>>>> On Mon, Nov 30, 2015 at 04:29:13PM +0200, Avi Kivity wrote:
>>>>>>>>>> On 11/30/2015 04:10 PM, Brian Foster wrote:
>>>>>>> ...
>>>>>>>>> The agsize/agcount mkfs-time heuristics change depending on the type of
>>>>>>>>> storage. A single AG can be up to 1TB and if the fs is not considered
>>>>>>>>> "multidisk" (e.g., no stripe unit/width is defined), 4 AGs is the
>>>>>>>>> default up to 4TB. If a stripe unit is set, the agsize/agcount is
>>>>>>>>> adjusted depending on the size of the overall volume (see
>>>>>>>>> xfsprogs-dev/mkfs/xfs_mkfs.c:calc_default_ag_geometry() for details).
>>>>>>>> We'll experiment with this.  Surely it depends on more than the amount of
>>>>>>>> storage?  If you have a high op rate you'll be more likely to excite
>>>>>>>> contention, no?
>>>>>>>>
>>>>>>> Sure. The absolute optimal configuration for your workload probably
>>>>>>> depends on more than storage size, but mkfs doesn't have that
>>>>>>> information. In general, it tries to use the most reasonable
>>>>>>> configuration based on the storage and expected workload. If you want to
>>>>>>> tweak it beyond that, indeed, the best bet is to experiment with what
>>>>>>> works.
>>>>>> We will do that.
>>>>>>
>>>>>>>>>> Are those locks held around I/O, or just CPU operations, or a mix?
>>>>>>>>> I believe it's a mix of modifications and I/O, though it looks like some
>>>>>>>>> of the I/O cases don't necessarily wait on the lock. E.g., the AIL
>>>>>>>>> pushing case will trylock and defer to the next list iteration if the
>>>>>>>>> buffer is busy.
>>>>>>>>>
>>>>>>>> Ok.  For us sleeping in io_submit() is death because we have no other thread
>>>>>>>> on that core to take its place.
>>>>>>>>
>>>>>>> The above is with regard to metadata I/O, whereas io_submit() is
>>>>>>> obviously for user I/O.
>>>>>> Won't io_submit() also trigger metadata I/O?  Or is that all deferred to
>>>>>> async tasks?  I don't mind them blocking each other as long as they let my
>>>>>> io_submit alone.
>>>>>>
>>>>> Yeah, it can trigger metadata reads, force the log (the stale buffer
>>>>> example) or push the AIL (wait on log space). Metadata changes made
>>>>> directly via your I/O request are logged/committed via transactions,
>>>>> which are generally processed asynchronously from that point on.
>>>>>
>>>>>>>   io_submit() can probably block in a variety of
>>>>>>> places afaict... it might have to read in the inode extent map, allocate
>>>>>>> blocks, take inode/ag locks, reserve log space for transactions, etc.
>>>>>> Any chance of changing all that to be asynchronous?  Doesn't sound too hard,
>>>>>> if somebody else has to do it.
>>>>>>
>>>>> I'm not following... if the fs needs to read in the inode extent map to
>>>>> prepare for an allocation, what else can the thread do but wait? Are you
>>>>> suggesting the request kick off whatever the blocking action happens to
>>>>> be asynchronously and return with an error such that the request can be
>>>>> retried later?
>>>> Not quite, it should be invisible to the caller.
>>>>
>>>> That is, the code called by io_submit() (file_operations::write_iter, it
>>>> seems to be called today) can kick off this operation and have it continue
>>> >from where it left off.
>>> Isn't that generally what happens today?
>> You tell me.  According to $subject, apparently not enough.  Maybe we're
>> triggering it more often, or we suffer more when it does trigger (the latter
>> probably more likely).
>>
> The original mail describes looking at the sched:sched_switch tracepoint
> which on a quick look, appears to fire whenever a cpu context switch
> occurs. This likely triggers any time we wait on an I/O or a contended
> lock (among other situations I'm sure), and it signifies that something
> else is going to execute in our place until this thread can make
> progress.

For us, nothing else can execute in our place, we usually have exactly 
one thread per logical core.  So we are heavily dependent on io_submit 
not sleeping.

The case of a contended lock is, to me, less worrying.  It can be 
reduced by using more allocation groups, which is apparently the shared 
resource under contention.

The case of waiting for I/O is much more worrying, because I/O latency 
are much higher.  But it seems like most of the DIO path does not 
trigger locking around I/O (and we are careful to avoid the ones that 
do, like writing beyond eof).

(sorry for repeating myself, I have the feeling we are talking past each 
other and want to be on the same page)

>
>>>   We submit an I/O which is
>>> asynchronous in nature and wait on a completion, which causes the cpu to
>>> schedule and execute another task until the completion is set by I/O
>>> completion (via an async callback). At that point, the issuing thread
>>> continues where it left off. I suspect I'm missing something... can you
>>> elaborate on what you'd do differently here (and how it helps)?
>> Just apply the same technique everywhere: convert locks to trylock +
>> schedule a continuation on failure.
>>
> I'm certainly not an expert on the kernel scheduling, locking and
> serialization mechanisms, but my understanding is that most things
> outside of spin locks are reschedule points. For example, the
> wait_for_completion() calls XFS uses to wait on I/O boil down to
> schedule_timeout() calls. Buffer locks are implemented as semaphores and
> down() can end up in the same place.

But, for the most part, XFS seems to be able to avoid sleeping.  The 
call to __blockdev_direct_IO only launches the I/O, so any locking is 
only around cpu operations and, unless there is contention, won't cause 
us to sleep in io_submit().

Trying to follow the code, it looks like xfs_get_blocks_direct (and 
__blockdev_direct_IO's get_block parameter in general) is synchronous, 
so we're just lucky to have everything in cache.  If it isn't, we block 
right there.  I really hope I'm misreading this and some other magic is 
happening elsewhere instead of this.

> Brian
>
>>>> Seastar (the async user framework which we use to drive xfs) makes writing
>>>> code like this easy, using continuations; but of course from ordinary
>>>> threaded code it can be quite hard.
>>>>
>>>> btw, there was an attempt to make ext[34] async using this method, but I
>>>> think it was ripped out.  Yes, the mortal remains can still be seen with
>>>> 'git grep EIOCBQUEUED'.
>>>>
>>>>>>> It sounds to me that first and foremost you want to make sure you don't
>>>>>>> have however many parallel operations you typically have running
>>>>>>> contending on the same inodes or AGs. Hint: creating files under
>>>>>>> separate subdirectories is a quick and easy way to allocate inodes under
>>>>>>> separate AGs (the agno is encoded into the upper bits of the inode
>>>>>>> number).
>>>>>> Unfortunately our directory layout cannot be changed.  And doesn't this
>>>>>> require having agcount == O(number of active files)?  That is easily in the
>>>>>> thousands.
>>>>>>
>>>>> I think Glauber's O(nr_cpus) comment is probably the more likely
>>>>> ballpark, but really it's something you'll probably just need to test to
>>>>> see how far you need to go to avoid AG contention.
>>>>>
>>>>> I'm primarily throwing the subdir thing out there for testing purposes.
>>>>> It's just an easy way to create inodes in a bunch of separate AGs so you
>>>>> can determine whether/how much it really helps with modified AG counts.
>>>>> I don't know enough about your application design to really comment on
>>>>> that...
>>>> We have O(cpus) shards that operate independently.  Each shard writes 32MB
>>>> commitlog files (that are pre-truncated to 32MB to allow concurrent writes
>>>> without blocking); the files are then flushed and closed, and later removed.
>>>> In parallel there are sequential writes and reads of large files using 128kB
>>>> buffers), as well as random reads.  Files are immutable (append-only), and
>>>> if a file is being written, it is not concurrently read.  In general files
>>>> are not shared across shards.  All I/O is async and O_DIRECT.  open(),
>>>> truncate(), fdatasync(), and friends are called from a helper thread.
>>>>
>>>> As far as I can tell it should a very friendly load for XFS and SSDs.
>>>>
>>>>>>>   Reducing the frequency of block allocation/frees might also be
>>>>>>> another help (e.g., preallocate and reuse files,
>>>>>> Isn't that discouraged for SSDs?
>>>>>>
>>>>> Perhaps, if you're referring to the fact that the blocks are never freed
>>>>> and thus never discarded..? Are you running fstrim?
>>>> mount -o discard.  And yes, overwrites are supposedly more expensive than
>>>> trim old data + allocate new data, but maybe if you compare it with the work
>>>> XFS has to do, perhaps the tradeoff is bad.
>>>>
>>> Ok, my understanding is that '-o discard' is not recommended in favor of
>>> periodic fstrim for performance reasons, but that may or may not still
>>> be the case.
>> I understand that most SSDs have queued trim these days, but maybe I'm
>> optimistic.
>>



More information about the xfs mailing list