xfs
[Top] [All Lists]

sleeps and waits during io_submit

To: xfs@xxxxxxxxxxx, Avi Kivity <avi@xxxxxxxxxxxx>, david@xxxxxxxxxxxxx
Subject: sleeps and waits during io_submit
From: Glauber Costa <glauber@xxxxxxxxxxxx>
Date: Fri, 27 Nov 2015 21:43:50 -0500
Delivered-to: xfs@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=scylladb-com.20150623.gappssmtp.com; s=20150623; h=mime-version:date:message-id:subject:from:to:content-type; bh=A8U3GF2fNTR6C8ZsbJHT0hssVdoXaxnpCWBw4cABdPA=; b=qhsh4kdqDHQ3Fkiexarq/2yRhgcy331o3TWqpJ551hpTiyb1lqF4GtXpD0PDQXj6AL fwAUInNtdOxZMvUt8FjOx9JSe/fc1DgmLQ4bORlZ+9SolYIGq4SeayBVRKNZA2RlpUtg 6KHqI+OTLuWbi1G0mqTeP76a3LrGnq36w3Wf4dv9ivEBX+gmubGl3J/jxGo2wexft4p6 tPeGfGgam7LPNOoNN/oaVi3oL7IHRi3gOkJ6An/15/ep2T3vnu2RujPGZ+Gpe7FUGzEJ cMaKC5yLTaFvQRSZHCvVKkq4ZcTefQGKQGZovIXAesH2v1nUAt6yGm4nt7zFnY0rIFWz TpWQ==
Hello my dear XFSers,

For those of you who don't know, we at ScyllaDB produce a modern NoSQL
data store that, at the moment, runs on top of XFS only. We deal
exclusively with asynchronous and direct IO, due to our
thread-per-core architecture. Due to that, we avoid issuing any
operation that will sleep.

While debugging an extreme case of bad performance (most likely
related to a not-so-great disk), I have found a variety of cases in
which XFS blocks. To find those, I have used perf record -e
sched:sched_switch -p <pid_of_db>, and I am attaching the perf report
as xfs-sched_switch.log. Please note that this doesn't tell me for how
long we block, but as mentioned before, blocking operations outside
our control are detrimental to us regardless of the elapsed time.

For those who are not acquainted to our internals, please ignore
everything in that file but the xfs functions. For the xfs symbols,
there are two kinds of events: the ones that are a children of
io_submit, where we don't tolerate blocking, and the ones that are
children of our helper IO thread, to where we push big operations that
we know will block until we can get rid of them all. We care about the
former and ignore the latter.

Please allow me to ask you a couple of questions about those findings.
If we are doing anything wrong, advise on best practices is truly
welcome.

1) xfs_buf_lock -> xfs_log_force.

I've started wondering what would make xfs_log_force sleep. But then I
have noticed that xfs_log_force will only be called when a buffer is
marked stale. Most of the times a buffer is marked stale seems to be
due to errors. Although that is not my case (more on that), it got me
thinking that maybe the right thing to do would be to avoid hitting
this case altogether?

The file example-stale.txt contains a backtrace of the case where we
are being marked as stale. It seems to be happening when we convert
the the inode's extents from unwritten to real. Can this case be
avoided? I won't pretend I know the intricacies of this, but couldn't
we be keeping extents from the very beginning to avoid creating stale
buffers?

2) xfs_buf_lock -> down
This is one I truly don't understand. What can be causing contention
in this lock? We never have two different cores writing to the same
buffer, nor should we have the same core doingCAP_FOWNER so.

3) xfs_file_aio_write_checks -> file_update_time -> xfs_vn_update_time

You guys seem to have an interface to avoid that, by setting the
FMODE_NOCMTIME flag. This is done by issuing the open by handle ioctl,
which will set this flag for all regular files. That's great, but that
ioctl required CAP_SYS_ADMIN, which is a big no for us, since we run
our server as an unprivileged user. I don't understand, however, why
such an strict check is needed. If we have full rights on the
filesystem, why can't we issue this operation? In my view, CAP_FOWNER
should already be enough.I do understand the handles have to be stable
and a file can have its ownership changed, in which case the previous
owner would keep the handle valid. Is that the reason you went with
the most restrictive capability ?

Attachment: xfs-sched_switch.log
Description: Text document

Attachment: example-stale.txt
Description: Text document

<Prev in Thread] Current Thread [Next in Thread>