[Top] [All Lists]

Re: XFS Syncd

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: XFS Syncd
From: Shrinand Javadekar <shrinand@xxxxxxxxxxxxxx>
Date: Tue, 2 Jun 2015 11:43:30 -0700
Cc: xfs@xxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <20150410131245.GK15810@dastard>
References: <CABppvi6pC4qEFZUTesbT0v5agbd67MP4dEoUbaVFwEyCv4h21g@xxxxxxxxxxxxxx> <20150410063210.GJ15810@dastard> <CABppvi4e_xEMY7tDHtEo6miZcN2AZ-mFMHXKaUS0hfpx6AMt0w@xxxxxxxxxxxxxx> <20150410072100.GL13731@dastard> <CABppvi437S9e+DEFOi6ECPu8=AnEK0V=5rRmU5Of1_XtWiQbfA@xxxxxxxxxxxxxx> <20150410131245.GK15810@dastard>
Sorry, I dropped the ball on this one. We found some other problems
and I was busy fixing them.

So, the xfsaild thread/s that kick in every 30 seconds are hitting us
pretty badly. Here's a graph with the latest tests I ran. We get great
throughput for ~18 seconds but then the world pretty much stops for
the next ~12 seconds or so making the final numbers look pretty bad.
This particular graph was plotted when the disk had ~150GB of data
(total capacity of 3TB).

I am using a 3.16.0-38-generic kernel (upgraded since the time I wrote
the first email on this thread).

I know fs.xfs.xfssyncd_centisecs controls this interval of 30 seconds.
What other options can I tune for making this work better?

We have 8 disks. And unfortunately, all 8 disks are brought to a halt
every 30 seconds. Does XFS have options to only work on a subset of
disks at a time?

Also, what does XFS exactly do every 30 seconds? If I understand it
right, metadata can be 3 locations:

1. Memory
2. Log buffer on disk
3. Final location on disk.

Every 30 seconds, from where to where is this metadata being copied?
Are there ways to just disable this to avoid the stop-of-the-world
pauses (at the cost of lower but sustained performance)?

Thanks in advance.

On Fri, Apr 10, 2015 at 6:12 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> On Fri, Apr 10, 2015 at 12:29:34AM -0700, Shrinand Javadekar wrote:
>> On Fri, Apr 10, 2015 at 12:21 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>> > On Thu, Apr 09, 2015 at 11:51:17PM -0700, Shrinand Javadekar wrote:
>> >> Thanks for the reply Dave!
>> >>
>> >> >
>> >> > Oh, right, it's that workqueue we removed in late 2012 (in the 3.7
>> >> > cycle) because it was redundant. The only remaining fragment of it
>> >> > is the xfslogd. What kernel are you running?
>> >>
>> >> I am running 3.13.0-39-generic on Ubuntu 14.04.
>> >
>> > You can't be running that kernel if you are seeing a process called
>> > xfssyncd in your traces.
>> I don't see a process called xfssyncd. I started investigating the 30
>> second pauses but looking for xfs config options in sysctl. We found
>> the option "fs.xfs.xfssyncd_centisecs" whose documentation[1] says it
>> is the interval in which the "filesystem flushes metadata out to disk
>> and runs internal cache cleanup routines".
> Right, that's what it does, but even though xfssynd has been
> removed, we can't remove or rename the sysctl because it's part
> of the userspace ABI.
>> I tweaked this setting and saw the corresponding changes in the
>> performance. Bumping this value up saw pauses at longer interval,
>> decreasing this interval saw pauses more frequently.
> Ok, so it's not speculative preallocation that is the problem,
> it's metadata writeback that is causing the stalls. I forgot the log
> worker is also triggered of that sysctl, and so...
>> >> >> I am seeing a behavior where the system pretty much stalls for ~5
>> >> >> seconds after every 30 seconds. I see that the # of ios goes up but
>> >> >> the actual write bandwidth during this 5 second period is very low
>> >> >> (see attached images). After a fair bit of investigation, we've
>> >> >> narrowed down the problem to XFS's syncd (fs.xfs.xfssyncd_centisecs).
>> >> >> This runs at a default interval of 30 seconds.
>> >> >
>> >> > It's doing background inode reclaim which, under some circumstances,
>> >> > involves truncating specualtive allocation beyond EOF before reclaim
>> >> > occurs, which results in transactions and inode writeback. It was
>> >> > highly inefficient, which is why we replaced it.
>> >>
>> >> Oh.. I see. So, this isn't even actual filesystem metadata. And there
>> >> is no option to turn the speculative allocation on/off?
>> >
>> > You can turn it off, but now you're jumping to conclusions that this
>> > is the cause of your problems. Perhaps you should do some
>> > tracing/profiling whenthe system goes through these stalls to see
>> > what is actually happening? "perf top" and trace-cmd are very useful
>> > for this sort of investigation...
>> Let me dig deeper here using "perf top" and see what's running during
>> these stalls.
> ... it's much more likely that filesystem metadata writeback is
> being run every 30s, and that's what is causing the issue. i.e. you
> should see the xfsaild issuing lots of IO very quickly.
> See, fsync() doesn't cause metadata writeback; only data writeback.
> The metadata is written to the log, not it's final place on disk
> during fsync. So some time later it's got to me written back because
> it is still dirty in memory, and that's most likely what is
> happening.
> My guess is you have RAID5 or RAID6 and the partial stripe writes
> are causing it to do RMW cycles and hence it's really, really slow
> when metadata gets written...
> Probably too late now as I've now basically asked for all this info,
> but:
> http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
> Cheers,
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx

Attachment: xfs_pauses.png
Description: PNG image

<Prev in Thread] Current Thread [Next in Thread>