xfs
[Top] [All Lists]

Re: sleeps and waits during io_submit

To: Brian Foster <bfoster@xxxxxxxxxx>, Glauber Costa <glauber@xxxxxxxxxxxx>
Subject: Re: sleeps and waits during io_submit
From: Avi Kivity <avi@xxxxxxxxxxxx>
Date: Mon, 30 Nov 2015 16:29:13 +0200
Cc: xfs@xxxxxxxxxxx, david@xxxxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=scylladb-com.20150623.gappssmtp.com; s=20150623; h=subject:to:references:cc:from:message-id:date:user-agent :mime-version:in-reply-to:content-type:content-transfer-encoding; bh=HIWiENc7LBxY8Xvrb3IS2p9nNCXEBP5N3IktuYkt01U=; b=U/hrY6SEMgzW59SLYSf3ZwnvZEVw8/9ieB2YWLvvkKn68hr6QpZmT9D9CqLPw83nXN At6hb66KzkX+Us4BGmX+GgiQ3X9LYhXqXTdMqPo5Coe1ZpC1KhW2gfxrzBNc4X/RCL7c z5I30I//Jwr53V2Or/R/L86H3QXPL6i7QtXGh4xMIMoSp8Aldf+cRUSVi9xJcR4ygkdy nyEY5IfvGeGu5dZfudD+tQLRNpAyQ9KcMK9UnTb7cRgQKIiCV7qux+ddVlSmk/mVlMKc RdpG2dqDyIn9MU+pUfZGjnJ9iLoD2SSxGSF0ezZ31twRAYrCDT8TXUHjzF22rnHN2D0j vEEQ==
In-reply-to: <20151130141000.GC24765@xxxxxxxxxxxxxxx>
References: <CAD-J=zZh1dtJsfrW_Gwxjg+qvkZMu7ED-QOXrMMO6B-G0HY2-A@xxxxxxxxxxxxxx> <20151130141000.GC24765@xxxxxxxxxxxxxxx>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0


On 11/30/2015 04:10 PM, Brian Foster wrote:
2) xfs_buf_lock -> down
This is one I truly don't understand. What can be causing contention
in this lock? We never have two different cores writing to the same
buffer, nor should we have the same core doingCAP_FOWNER so.

This is not one single lock. An XFS buffer is the data structure used to
modify/log/read-write metadata on-disk and each buffer has its own lock
to prevent corruption. Buffer lock contention is possible because the
filesystem has bits of "global" metadata that has to be updated via
buffers.

For example, usually one has multiple allocation groups to maximize
parallelism, but we still have per-ag metadata that has to be tracked
globally with respect to each AG (e.g., free space trees, inode
allocation trees, etc.). Any operation that affects this metadata (e.g.,
block/inode allocation) has to lock the agi/agf buffers along with any
buffers associated with the modified btree leaf/node blocks, etc.

One example in your attached perf traces has several threads looking to
acquire the AGF, which is a per-AG data structure for tracking free
space in the AG. One thread looks like the inode eviction case noted
above (freeing blocks), another looks like a file truncate (also freeing
blocks), and yet another is a block allocation due to a direct I/O
write. Were any of these operations directed to an inode in a separate
AG, they would be able to proceed in parallel (but I believe they would
still hit the same codepaths as far as perf can tell).

I guess we can mitigate (but not eliminate) this by creating more allocation groups. What is the default value for agsize? Are there any downsides to decreasing it, besides consuming more memory?

Are those locks held around I/O, or just CPU operations, or a mix?

<Prev in Thread] Current Thread [Next in Thread>