xfs
[Top] [All Lists]

Re: [PATCH v2] [RFC] xfs: allocate log vector buffers outside CIL contex

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: [PATCH v2] [RFC] xfs: allocate log vector buffers outside CIL context lock
From: Jens Rosenboom <j.rosenboom@xxxxxxxx>
Date: Mon, 15 Feb 2016 12:57:51 +0100
Cc: Brian Foster <bfoster@xxxxxxxxxx>, xfs@xxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=x-ion-de.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=Umu3cnZenZLXH0L2N5FZu75ecPHsUU/eUi7GZNHEToE=; b=LK1Xlor7xkPOl8kAVe88cwYv3apOanZnsRWcC/nylPUoQeAGk/m6RfLLtKxfDJN63w Qv9D9jSUKrDqem8spmFYL1f/2tWLVNFx8DNnhZ5lb/HjcITbMeBBElS94L5nMegmnwjU cz1mJTkIQkowUUTTk+qFKxuOW/3sUfTrVHR8DXMH44Mgk8Qnnih15y3tTIuDSxsikb0L NRjIzR89npOxbOiZrHvseP7Il/OnEm5+CEyB3JtABLylsPXYTGu80ZlkGcjs/NuMCniL lReGJN0rV5QFZ9xiqV+V3wZb2pYwSB409nhgxaXsZ6ybiOsl+sZbFNeBX/Gr/Bp2Nf6Q X2xw==
In-reply-to: <20160214001645.GF14668@dastard>
References: <1453177919-17849-1-git-send-email-david@xxxxxxxxxxxxx> <20160120015853.GU6033@dastard> <20160126141733.GA48264@xxxxxxxxxxxxxxx> <CADr68Wa9vG=ZOn4gbHezEyOeM+tCo69s3WVqcVXnZrn+=DdoVA@xxxxxxxxxxxxxx> <20160214001645.GF14668@dastard>
2016-02-14 1:16 GMT+01:00 Dave Chinner <david@xxxxxxxxxxxxx>:
> On Sat, Feb 13, 2016 at 06:09:17PM +0100, Jens Rosenboom wrote:
>> 2016-01-26 15:17 GMT+01:00 Brian Foster <bfoster@xxxxxxxxxx>:
>> > On Wed, Jan 20, 2016 at 12:58:53PM +1100, Dave Chinner wrote:
>> >> From: Dave Chinner <dchinner@xxxxxxxxxx>
>> >>
>> >> One of the problems we currently have with delayed logging is that
>> >> under serious memory pressure we can deadlock memory reclaim. THis
>> >> occurs when memory reclaim (such as run by kswapd) is reclaiming XFS
>> >> inodes and issues a log force to unpin inodes that are dirty in the
>> >> CIL.
> ....
>> >> That said, I don't have a reliable deadlock reproducer in the first
>> >> place, so I'm interested i hearing what people think about this
>> >> approach to solve the problem and ways to test and improve it.
>> >>
>> >> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
>> >> ---
>> >
>> > This seems reasonable to me in principle. It would be nice to have some
>> > kind of feedback in terms of effectiveness resolving the original
>> > deadlock report. I can't think of a good way of testing short of
>> > actually instrumenting the deadlock one way or another, unfortunately.
>> > Was there a user that might be willing to test or had a detailed enough
>> > description of the workload/environment?
>>
>> We have seen this issue on our production Ceph cluster sporadically
>> and have tried a long time to reproduce it in a lab environment.
> ....
>> kmem_alloc (mode:0x2408240)
>> Feb 13 10:51:57 storage-node35 kernel: [10562.614089] XFS:
>> ceph-osd(10078) possible memory allocation deadlock size 32856 in
>> kmem_alloc (mode:0x2408240)
>
> High order allocation of 32k. That implies a buffer size of at least
> 32k is in use. Can you tell me what the output of xfs_info <mntpt>
> is for one of your filesystems?

$ xfs_info /tmp/cbt/mnt/osd-device-0-data/
meta-data=/dev/sda2              isize=2048   agcount=4, agsize=97370688 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=0        finobt=0
data     =                       bsize=4096   blocks=389482752, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=65536  ascii-ci=0 ftype=0
log      =internal               bsize=4096   blocks=190177, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

> I suspect you are using a 64k directory block size, in which case
> I'll ask "are you storing millions of files in a single directory"?
> If your answer is no, then "don't do that" is an appropriate
> solution because large directory block sizes are slower than the
> default (4k) for almost all operations until you get up into the
> millions of files per directory range.

These options are kind of standard folklore for setting up Ceph
clusters, I must admit that I delayed testing their performance
implications up to now, so many knobs to turn, so little time.

  mkfs_opts: '-f -i size=2048 -n size=64k'
  mount_opts: '-o inode64,noatime,logbsize=256k'

It turns out that when running with '-n size=4k', indeed I do not get
any warnings during a 10h test run. I'll try to come up with some more
detailed benchmarking of the possible performance impacts, too.

Am I right in assuming that this parameter can not be tuned after the
initial mkfs? In that case getting a production-ready version of your
patch would probably still be valuable for cluster admins wanting to
avoid having to move all of their data to new filesystems.

>> Soon after this, operations get so slow that the OSDs die because of
>> their suicide timeouts.
>>
>> Then I installed onto 3 servers this patch (applied onto kernel
>> v4.4.1). The bad news is that I am still getting the kernel messages
>> on these machines. The good news, though, is that they appear at a
>> much lower frequency and also the impact on performance seems to be
>> lower, so the OSD processes on these three nodes did not get killed.
>
> Right, the patch doesn't fix the underlying issue that memory
> fragmentation can prevent high order allocation from succeeding for
> long periods.  However, it does ensure that the filesystem does not
> immediately deadlock memory reclaim when it happens so the system
> has a chance to recover. It still can deadlock the filesystem,
> because if we can't commit the transaction we can't unlock the
> objects in the transaction and everything can get stuck behind that
> if there's something sufficiently important in the blocked
> transaction.

So how would your success criteria for getting this patch into
upstream look like? Would a benchmark of the 64k directory block size
case on machines all running with patched kernels be interesting? Or
would that scenario disqualify itself as being mistuned in the first
place?

<Prev in Thread] Current Thread [Next in Thread>