sleeps and waits during io_submit

Avi Kivity avi at scylladb.com
Wed Dec 2 02:34:14 CST 2015



On 12/02/2015 02:13 AM, Brian Foster wrote:
> On Tue, Dec 01, 2015 at 09:26:42PM +0200, Avi Kivity wrote:
>> On 12/01/2015 08:51 PM, Brian Foster wrote:
>>> On Tue, Dec 01, 2015 at 07:09:29PM +0200, Avi Kivity wrote:
>>>> On 12/01/2015 06:29 PM, Brian Foster wrote:
>>>>> On Tue, Dec 01, 2015 at 06:08:51PM +0200, Avi Kivity wrote:
>>>>>> On 12/01/2015 06:01 PM, Brian Foster wrote:
>>>>>>> On Tue, Dec 01, 2015 at 05:22:38PM +0200, Avi Kivity wrote:
>>>>>>>> On 12/01/2015 04:56 PM, Brian Foster wrote:
>>>>>>>>> On Tue, Dec 01, 2015 at 03:58:28PM +0200, Avi Kivity wrote:
>>>>>>>>>> On 12/01/2015 03:11 PM, Brian Foster wrote:
>>>>>>>>>>> On Tue, Dec 01, 2015 at 11:08:47AM +0200, Avi Kivity wrote:
>>>>>>>>>>>> On 11/30/2015 06:14 PM, Brian Foster wrote:
>>>>>>>>>>>>> On Mon, Nov 30, 2015 at 04:29:13PM +0200, Avi Kivity wrote:
>>>>>>>>>>>>>> On 11/30/2015 04:10 PM, Brian Foster wrote:
> ...
>>>> The case of waiting for I/O is much more worrying, because I/O latency are
>>>> much higher.  But it seems like most of the DIO path does not trigger
>>>> locking around I/O (and we are careful to avoid the ones that do, like
>>>> writing beyond eof).
>>>>
>>>> (sorry for repeating myself, I have the feeling we are talking past each
>>>> other and want to be on the same page)
>>>>
>>> Yeah, my point is just that just because the thread blocked on I/O,
>>> doesn't mean the cpu can't carry on with some useful work for another
>>> task.
>> In our case, there is no other task.  We run one thread per logical core, so
>> if that thread gets blocked, the cpu idles.
>>
>> The whole point of io_submit() is to issue an I/O and let the caller
>> continue processing immediately.  It is the equivalent of O_NONBLOCK for
>> networking code.  If O_NONBLOCK did block from time to time, practically all
>> modern network applications would see a huge performance drop.
>>
> Ok, but my understanding is that O_NONBLOCK would return an error code
> in the blocking case such that userspace can do something else or retry
> from a blockable context.

I did not mean the exact equivalent, but in the spirit of allowing a 
thread to perform an I/O task (networking or file I/O) in parallel with 
computation.

For networking, returning an error is fine because there exists a 
notification (epoll) to tell userspace when a retry would succeed. For 
file I/O, there isn't one.  Still, returning an error is better than 
nothing because then, as you say, you can retry in a blockable context.

>   I think this is similar to what hch posted wrt
> to the pwrite2() bits for nonblocking buffered I/O or what I was asking
> about earlier on with regard to returning an error if some blocking
> would otherwise occur.

Yes.  Anything except silently blocking!

>
>>>>>>>   We submit an I/O which is
>>>>>>> asynchronous in nature and wait on a completion, which causes the cpu to
>>>>>>> schedule and execute another task until the completion is set by I/O
>>>>>>> completion (via an async callback). At that point, the issuing thread
>>>>>>> continues where it left off. I suspect I'm missing something... can you
>>>>>>> elaborate on what you'd do differently here (and how it helps)?
>>>>>> Just apply the same technique everywhere: convert locks to trylock +
>>>>>> schedule a continuation on failure.
>>>>>>
>>>>> I'm certainly not an expert on the kernel scheduling, locking and
>>>>> serialization mechanisms, but my understanding is that most things
>>>>> outside of spin locks are reschedule points. For example, the
>>>>> wait_for_completion() calls XFS uses to wait on I/O boil down to
>>>>> schedule_timeout() calls. Buffer locks are implemented as semaphores and
>>>>> down() can end up in the same place.
>>>> But, for the most part, XFS seems to be able to avoid sleeping.  The call to
>>>> __blockdev_direct_IO only launches the I/O, so any locking is only around
>>>> cpu operations and, unless there is contention, won't cause us to sleep in
>>>> io_submit().
>>>>
>>>> Trying to follow the code, it looks like xfs_get_blocks_direct (and
>>>> __blockdev_direct_IO's get_block parameter in general) is synchronous, so
>>>> we're just lucky to have everything in cache.  If it isn't, we block right
>>>> there.  I really hope I'm misreading this and some other magic is happening
>>>> elsewhere instead of this.
>>>>
>>> Nope, it's synchronous from a code perspective. The
>>> xfs_bmapi_read()->xfs_iread_extents() path could have to read in the
>>> inode bmap metadata if it hasn't been done already. Note that this
>>> should only happen once as everything is stored in-core, so in most
>>> cases this is skipped. It's also possible extents are read in via some
>>> other path/operation on the inode before an async I/O happens to be
>>> submitted (e.g., see some of the other xfs_bmapi_read() callers).
>> Is there (could we add) some ioctl to prime this cache?  We could call it
>> from a worker thread where we don't mind blocking during open.
>>
> I suppose that's possible, or the worker thread could perform some
> existing operation known to prime the cache. I don't think it's worth
> getting into without a concrete example, however. The extent read
> example we're batting around might not ever be a problem (as you've
> noted due to file size), if files are truncated and recycled, for
> example.
>
>> What is the eviction policy for this cache?   Is it simply the block
>> device's page cache?
>>
> IIUC the extent list stays around until the inode is reclaimed. There's
> a separate buffer cache for metadata buffers. Both types of objects
> would be reclaimed based on memory pressure.

It comes down to size of disk, size of memory, and average file size.  I 
expect that with current disk and memory sizes the metadata is quite 
small, so this might not be a problem, and even a cold start would 
self-prime in a reasonably short time.

>
>> What about the write path, will we see the same problems there?  I would
>> guess the problem is less severe there if the metadata is written with
>> writeback policy.
>>
> Metadata is modified in-core and handed off to the logging
> infrastructure via a transaction. The log is flushed to disk some time
> later and metadata writeback occurs asynchronously via the xfsaild
> thread.

Unless, I expect, if the log is full.  Since we're hammering on the disk 
quite heavily, the log would be fighting with user I/O and possibly losing.

Does XFS throttle user I/O in order to get the log buffers recycled faster?

Is there any way for us to keep track of it, and reduce disk pressure 
when it gets full?

Oh you answered that already, /sys/fs/xfs/device/log/*.

>
> Brian
>
>>> Either way, the extents have to be read in at some point and I'd expect
>>> that cpu to schedule onto some other task while that thread waits on I/O
>>> to complete (read-ahead could also be a factor here, but I haven't
>>> really dug into how that is triggered for buffers).
>> To provide an example, our application, which is a database, faces this
>> problem exact at a higher level.  Data is stored in data files, and data
>> items' locations are stored in index files. When we read a bit of data, we
>> issue an index read, and pass it a continuation to be executed when the read
>> completes.  This latter continuation parses the data and passes it to the
>> code that prepares it for merging with data from other data files, and an
>> eventual return to the user.
>>
>> Having written code for over a year in this style, I've come to expect it to
>> be used everywhere asynchronous I/O is used, but I realize it is fairly hard
>> without good support from a framework that allows continuations to be
>> composed in a natural way.
>>
>>



More information about the xfs mailing list