xfs
[Top] [All Lists]

Re: sleeps and waits during io_submit

To: Avi Kivity <avi@xxxxxxxxxxxx>
Subject: Re: sleeps and waits during io_submit
From: Glauber Costa <glauber@xxxxxxxxxxxx>
Date: Tue, 1 Dec 2015 09:01:13 -0500
Cc: Brian Foster <bfoster@xxxxxxxxxx>, xfs@xxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=scylladb-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=BcDK85bkOKVY5TA4r8xfdfL5VOuHSyZnb54FaFtMopk=; b=KD3yi3G/GeK5aMVsq7yoHfu2nxCNP4fQBr5zwtC0nY6swoAPsQ42gK5/3CIop1/AMz zLCpBQRWVFbXFbIBD6++sh9KmgsgattPlQSOnj0EH+unRIazYLAdyR8Z+2PlMTRs9cJF J6w8Y3MrRAaUmf5ghULu7dJ9pawZZ7P0ty9ulC24Z3EuobjzGrJehMcq2oc9zmDWrRud 9UJWNUYV/f8QvMcjkwrpUszHuMO+p/ClN8xjgpPTHZ+T5ZiGt0/15W3wEKuXqGgifNSW 9Vqn638ofYyyU/h0cDUFxFyER0nIUCPU3UIF4mziuSDawYfGdW0nowWcIuJzUit+OOW7 itLA==
In-reply-to: <565DA784.5080003@xxxxxxxxxxxx>
References: <CAD-J=zZh1dtJsfrW_Gwxjg+qvkZMu7ED-QOXrMMO6B-G0HY2-A@xxxxxxxxxxxxxx> <20151130141000.GC24765@xxxxxxxxxxxxxxx> <565C5D39.8080300@xxxxxxxxxxxx> <20151130161438.GD24765@xxxxxxxxxxxxxxx> <565D639F.8070403@xxxxxxxxxxxx> <20151201131114.GA26129@xxxxxxxxxxxxxxx> <565DA784.5080003@xxxxxxxxxxxx>
On Tue, Dec 1, 2015 at 8:58 AM, Avi Kivity <avi@xxxxxxxxxxxx> wrote:
>
>
> On 12/01/2015 03:11 PM, Brian Foster wrote:
>>
>> On Tue, Dec 01, 2015 at 11:08:47AM +0200, Avi Kivity wrote:
>>>
>>> On 11/30/2015 06:14 PM, Brian Foster wrote:
>>>>
>>>> On Mon, Nov 30, 2015 at 04:29:13PM +0200, Avi Kivity wrote:
>>>>>
>>>>> On 11/30/2015 04:10 PM, Brian Foster wrote:
>>
>> ...
>>>>
>>>> The agsize/agcount mkfs-time heuristics change depending on the type of
>>>> storage. A single AG can be up to 1TB and if the fs is not considered
>>>> "multidisk" (e.g., no stripe unit/width is defined), 4 AGs is the
>>>> default up to 4TB. If a stripe unit is set, the agsize/agcount is
>>>> adjusted depending on the size of the overall volume (see
>>>> xfsprogs-dev/mkfs/xfs_mkfs.c:calc_default_ag_geometry() for details).
>>>
>>> We'll experiment with this.  Surely it depends on more than the amount of
>>> storage?  If you have a high op rate you'll be more likely to excite
>>> contention, no?
>>>
>> Sure. The absolute optimal configuration for your workload probably
>> depends on more than storage size, but mkfs doesn't have that
>> information. In general, it tries to use the most reasonable
>> configuration based on the storage and expected workload. If you want to
>> tweak it beyond that, indeed, the best bet is to experiment with what
>> works.
>
>
> We will do that.
>
>>>>> Are those locks held around I/O, or just CPU operations, or a mix?
>>>>
>>>> I believe it's a mix of modifications and I/O, though it looks like some
>>>> of the I/O cases don't necessarily wait on the lock. E.g., the AIL
>>>> pushing case will trylock and defer to the next list iteration if the
>>>> buffer is busy.
>>>>
>>> Ok.  For us sleeping in io_submit() is death because we have no other
>>> thread
>>> on that core to take its place.
>>>
>> The above is with regard to metadata I/O, whereas io_submit() is
>> obviously for user I/O.
>
>
> Won't io_submit() also trigger metadata I/O?  Or is that all deferred to
> async tasks?  I don't mind them blocking each other as long as they let my
> io_submit alone.
>
>>   io_submit() can probably block in a variety of
>> places afaict... it might have to read in the inode extent map, allocate
>> blocks, take inode/ag locks, reserve log space for transactions, etc.
>
>
> Any chance of changing all that to be asynchronous?  Doesn't sound too hard,
> if somebody else has to do it.
>
>>
>> It sounds to me that first and foremost you want to make sure you don't
>> have however many parallel operations you typically have running
>> contending on the same inodes or AGs. Hint: creating files under
>> separate subdirectories is a quick and easy way to allocate inodes under
>> separate AGs (the agno is encoded into the upper bits of the inode
>> number).
>
>
> Unfortunately our directory layout cannot be changed.  And doesn't this
> require having agcount == O(number of active files)?  That is easily in the
> thousands.

Actually, wouldn't agcount == O(nr_cpus) be good enough?

>
>>   Reducing the frequency of block allocation/frees might also be
>> another help (e.g., preallocate and reuse files,
>
>
> Isn't that discouraged for SSDs?
>
> We can do that for a subset of our files.
>
> We do use XFS_IOC_FSSETXATTR though.
>
>> 'mount -o ikeep,'
>
>
> Interesting.  Our files are large so we could try this.
>
>> etc.). Beyond that, you probably want to make sure the log is large
>> enough to support all concurrent operations. See the xfs_log_grant_*
>> tracepoints for a window into if/how long transaction reservations might
>> be waiting on the log.
>
>
> I see that on an 400G fs, the log is 180MB.  Seems plenty large for write
> operations that are mostly large sequential, though I've no real feel for
> the numbers.  Will keep an eye on this.
>
> Thanks for all the info.
>
>
>> Brian
>>
>>> _______________________________________________
>>> xfs mailing list
>>> xfs@xxxxxxxxxxx
>>> http://oss.sgi.com/mailman/listinfo/xfs
>
>

<Prev in Thread] Current Thread [Next in Thread>