sleeps and waits during io_submit
Avi Kivity
avi at scylladb.com
Tue Dec 1 14:56:01 CST 2015
On 12/01/2015 10:45 PM, Dave Chinner wrote:
> On Tue, Dec 01, 2015 at 09:01:13AM -0500, Glauber Costa wrote:
>> On Tue, Dec 1, 2015 at 8:58 AM, Avi Kivity <avi at scylladb.com> wrote:
>>> On 12/01/2015 03:11 PM, Brian Foster wrote:
>>>> It sounds to me that first and foremost you want to make sure you don't
>>>> have however many parallel operations you typically have running
>>>> contending on the same inodes or AGs. Hint: creating files under
>>>> separate subdirectories is a quick and easy way to allocate inodes under
>>>> separate AGs (the agno is encoded into the upper bits of the inode
>>>> number).
>>>
>>> Unfortunately our directory layout cannot be changed. And doesn't this
>>> require having agcount == O(number of active files)? That is easily in the
>>> thousands.
>> Actually, wouldn't agcount == O(nr_cpus) be good enough?
> Not quite. What you need is agcount ~= O(nr_active_allocations).
Yes, this is what I mean by "active files".
>
> The difference is an allocation can block waiting on IO, and the
> CPU can then go off and run another process, which then tries to do
> an allocation. So you might only have 4 CPUs, but a workload that
> can have a hundred active allocations at once (not uncommon in
> file server workloads).
But for us, probably not much more. We try to restrict active I/Os to
the effective disk queue depth (more than that and they just turn sour
waiting in the disk queue).
> On worklaods that are roughly 1 process per CPU, it's typical that
> agcount = 2 * N cpus gives pretty good results on large filesystems.
This is probably using sync calls. Using async calls you can have many
more I/Os in progress (but still limited by effective disk queue depth).
> If you've got 400GB filesystems or you are using spinning disks,
> then you probably don't want to go above 16 AGs, because then you
> have problems with maintaining continugous free space and you'll
> seek the spinning disks to death....
We're concentrating on SSDs for now.
>
>>>> 'mount -o ikeep,'
>>>
>>> Interesting. Our files are large so we could try this.
> Keep in mind that ikeep means that inode allocation permanently
> fragments free space, which can affect how large files are allocated
> once you truncate/rm the original files.
>
>
We can try to prime this by allocating a lot of inodes up front, then
removing them, so that this doesn't happen.
Hurray ext2.
More information about the xfs
mailing list