xfs
[Top] [All Lists]

Re: How to handle TIF_MEMDIE stalls?

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: How to handle TIF_MEMDIE stalls?
From: Brian Foster <bfoster@xxxxxxxxxx>
Date: Mon, 2 Mar 2015 07:46:26 -0500
Cc: Johannes Weiner <hannes@xxxxxxxxxxx>, "Theodore Ts'o" <tytso@xxxxxxx>, Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>, dchinner@xxxxxxxxxx, oleg@xxxxxxxxxx, xfs@xxxxxxxxxxx, mhocko@xxxxxxx, linux-mm@xxxxxxxxx, mgorman@xxxxxxx, rientjes@xxxxxxxxxx, akpm@xxxxxxxxxxxxxxxxxxxx, torvalds@xxxxxxxxxxxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <20150302001723.GO4251@dastard>
References: <20150217225430.GJ4251@dastard> <20150219102431.GA15569@xxxxxxxxxxxxxxxxxxxxxx> <20150219225217.GY12722@dastard> <20150221235227.GA25079@xxxxxxxxxxxxxxxxxxxxxx> <20150223004521.GK12722@dastard> <20150228162943.GA17989@xxxxxxxxxxxxxxxxxxxxxx> <20150228164158.GE5404@xxxxxxxxx> <20150228221558.GA23028@xxxxxxxxxxxxxxxxxxxxxx> <20150301214805.GN4251@dastard> <20150302001723.GO4251@dastard>
User-agent: Mutt/1.5.23 (2014-03-12)
On Mon, Mar 02, 2015 at 11:17:23AM +1100, Dave Chinner wrote:
> On Mon, Mar 02, 2015 at 08:48:05AM +1100, Dave Chinner wrote:
> > On Sat, Feb 28, 2015 at 05:15:58PM -0500, Johannes Weiner wrote:
> > > On Sat, Feb 28, 2015 at 11:41:58AM -0500, Theodore Ts'o wrote:
> > > > On Sat, Feb 28, 2015 at 11:29:43AM -0500, Johannes Weiner wrote:
> > > > > 
> > > > > I'm trying to figure out if the current nofail allocators can get
> > > > > their memory needs figured out beforehand.  And reliably so - what
> > > > > good are estimates that are right 90% of the time, when failing the
> > > > > allocation means corrupting user data?  What is the contingency plan?
> > > > 
> > > > In the ideal world, we can figure out the exact memory needs
> > > > beforehand.  But we live in an imperfect world, and given that block
> > > > devices *also* need memory, the answer is "of course not".  We can't
> > > > be perfect.  But we can least give some kind of hint, and we can offer
> > > > to wait before we get into a situation where we need to loop in
> > > > GFP_NOWAIT --- which is the contingency/fallback plan.
> > > 
> > > Overestimating should be fine, the result would a bit of false memory
> > > pressure.  But underestimating and looping can't be an option or the
> > > original lockups will still be there.  We need to guarantee forward
> > > progress or the problem is somewhat mitigated at best - only now with
> > > quite a bit more complexity in the allocator and the filesystems.
> > 
> > The additional complexity in XFS is actually quite minor, and
> > initial "rough worst case" memory usage estimates are not that hard
> > to measure....
> 
> And, just to point out that the OOM killer can be invoked without a
> single transaction-based filesystem ENOMEM failure, here's what
> xfs/084 does on 4.0-rc1:
> 
> [  148.820369] resvtest invoked oom-killer: gfp_mask=0x201da, order=0, 
> oom_score_adj=0
> [  148.822113] resvtest cpuset=/ mems_allowed=0
> [  148.823124] CPU: 0 PID: 4342 Comm: resvtest Not tainted 4.0.0-rc1-dgc+ #825
> [  148.824648] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> Bochs 01/01/2011
> [  148.826471]  0000000000000000 ffff88003ba2b988 ffffffff81dcb570 
> 000000000000000c
> [  148.828220]  ffff88003bb06380 ffff88003ba2ba08 ffffffff81dc5c2f 
> 0000000000000000
> [  148.829958]  0000000000000000 ffff88003ba2b9a8 0000000000000206 
> ffff88003ba2b9d8
> [  148.831734] Call Trace:
> [  148.832325]  [<ffffffff81dcb570>] dump_stack+0x4c/0x65
> [  148.833493]  [<ffffffff81dc5c2f>] dump_header.isra.12+0x79/0x1cb
> [  148.834855]  [<ffffffff8117db69>] oom_kill_process+0x1c9/0x3b0
> [  148.836195]  [<ffffffff810a7105>] ? has_capability_noaudit+0x25/0x40
> [  148.837633]  [<ffffffff8117e0c5>] __out_of_memory+0x315/0x500
> [  148.838925]  [<ffffffff8117e44b>] out_of_memory+0x5b/0x80
> [  148.840162]  [<ffffffff811830d9>] __alloc_pages_nodemask+0x7d9/0x810
> [  148.841592]  [<ffffffff811c0531>] alloc_pages_current+0x91/0x100
> [  148.842950]  [<ffffffff8117a427>] __page_cache_alloc+0xa7/0xc0
> [  148.844286]  [<ffffffff8117c688>] filemap_fault+0x1b8/0x420
> [  148.845545]  [<ffffffff811a05ed>] __do_fault+0x3d/0x70
> [  148.846706]  [<ffffffff811a4478>] handle_mm_fault+0x988/0x1230
> [  148.848042]  [<ffffffff81090305>] __do_page_fault+0x1a5/0x460
> [  148.849333]  [<ffffffff81090675>] trace_do_page_fault+0x45/0x130
> [  148.850681]  [<ffffffff8108b8ce>] do_async_page_fault+0x1e/0xd0
> [  148.852025]  [<ffffffff81dd1567>] ? schedule+0x37/0x90
> [  148.853187]  [<ffffffff81dd8b88>] async_page_fault+0x28/0x30
> [  148.854456] Mem-Info:
> [  148.854986] Node 0 DMA per-cpu:
> [  148.855727] CPU    0: hi:    0, btch:   1 usd:   0
> [  148.856820] Node 0 DMA32 per-cpu:
> [  148.857600] CPU    0: hi:  186, btch:  31 usd:   0
> [  148.858688] active_anon:119251 inactive_anon:119329 isolated_anon:0
> [  148.858688]  active_file:19 inactive_file:2 isolated_file:0
> [  148.858688]  unevictable:0 dirty:0 writeback:0 unstable:0
> [  148.858688]  free:1965 slab_reclaimable:2816 slab_unreclaimable:2184
> [  148.858688]  mapped:3 shmem:2 pagetables:1259 bounce:0
> [  148.858688]  free_cma:0
> [  148.865606] Node 0 DMA free:3916kB min:60kB low:72kB high:88kB 
> active_anon:5100kB inactive_anon:5324kB active_file:0kB inactive_file:8kB 
> unevictable:0kB isolated(as
> [  148.874431] lowmem_reserve[]: 0 966 966 966
> [  148.875504] Node 0 DMA32 free:3944kB min:3944kB low:4928kB high:5916kB 
> active_anon:471904kB inactive_anon:471992kB active_file:76kB 
> inactive_file:0kB unevictable:0s
> [  148.884817] lowmem_reserve[]: 0 0 0 0
> [  148.885770] Node 0 DMA: 1*4kB (M) 1*8kB (U) 2*16kB (UM) 3*32kB (UM) 1*64kB 
> (M) 1*128kB (M) 0*256kB 1*512kB (M) 1*1024kB (M) 1*2048kB (R) 0*4096kB = 
> 3916kB
> [  148.889385] Node 0 DMA32: 8*4kB (UEM) 2*8kB (UR) 3*16kB (M) 1*32kB (M) 
> 2*64kB (MR) 1*128kB (R) 0*256kB 1*512kB (R) 1*1024kB (R) 1*2048kB (R) 
> 0*4096kB = 3968kB
> [  148.893068] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 
> hugepages_size=2048kB
> [  148.894949] 47361 total pagecache pages
> [  148.895816] 47334 pages in swap cache
> [  148.896657] Swap cache stats: add 124669, delete 77335, find 83/169
> [  148.898057] Free swap  = 0kB
> [  148.898714] Total swap = 497976kB
> [  148.899470] 262044 pages RAM
> [  148.900145] 0 pages HighMem/MovableOnly
> [  148.901006] 10253 pages reserved
> [  148.901735] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents 
> oom_score_adj name
> [  148.903637] [ 1204]     0  1204     6039        1      15       3      163 
>         -1000 udevd
> [  148.905571] [ 1323]     0  1323     6038        1      14       3      165 
>         -1000 udevd
> [  148.907499] [ 1324]     0  1324     6038        1      14       3      164 
>         -1000 udevd
> [  148.909439] [ 2176]     0  2176     2524        0       6       2      571 
>             0 dhclient
> [  148.911427] [ 2227]     0  2227     9267        0      22       3       95 
>             0 rpcbind
> [  148.913392] [ 2632]     0  2632    64981       30      29       3      136 
>             0 rsyslogd
> [  148.915391] [ 2686]     0  2686     1062        1       6       3       36 
>             0 acpid
> [  148.917325] [ 2826]     0  2826     4753        0      12       2       44 
>             0 atd
> [  148.919209] [ 2877]     0  2877     6473        0      17       3       66 
>             0 cron
> [  148.921120] [ 2911]   104  2911     7078        1      17       3       81 
>             0 dbus-daemon
> [  148.923150] [ 3591]     0  3591    13731        0      28       2      165 
>         -1000 sshd
> [  148.925073] [ 3603]     0  3603    22024        0      43       2      215 
>             0 winbindd
> [  148.927066] [ 3612]     0  3612    22024        0      42       2      216 
>             0 winbindd
> [  148.929062] [ 3636]     0  3636     3722        1      11       3       41 
>             0 getty
> [  148.930981] [ 3637]     0  3637     3722        1      11       3       40 
>             0 getty
> [  148.932915] [ 3638]     0  3638     3722        1      11       3       39 
>             0 getty
> [  148.934835] [ 3639]     0  3639     3722        1      11       3       40 
>             0 getty
> [  148.936789] [ 3640]     0  3640     3722        1      11       3       40 
>             0 getty
> [  148.938704] [ 3641]     0  3641     3722        1      10       3       38 
>             0 getty
> [  148.940635] [ 3642]     0  3642     3677        1      11       3       40 
>             0 getty
> [  148.942550] [ 3643]     0  3643    25894        2      52       2      248 
>             0 sshd
> [  148.944469] [ 3649]     0  3649   146652        1      35       4      320 
>             0 console-kit-dae
> [  148.946578] [ 3716]     0  3716    48287        1      31       4      171 
>             0 polkitd
> [  148.948552] [ 3722]  1000  3722    25894        0      51       2      250 
>             0 sshd
> [  148.950457] [ 3723]  1000  3723     5435        3      15       3      495 
>             0 bash
> [  148.952375] [ 3742]     0  3742    17157        1      37       2      160 
>             0 sudo
> [  148.954275] [ 3743]     0  3743     3365        1      11       3      516 
>             0 check
> [  148.956229] [ 4130]     0  4130     3334        1      11       3      484 
>             0 084
> [  148.958108] [ 4342]     0  4342   314556   191159     619       4   119808 
>             0 resvtest
> [  148.960104] [ 4343]     0  4343     3334        0      11       3      485 
>             0 084
> [  148.961990] [ 4344]     0  4344     3334        0      11       3      485 
>             0 084
> [  148.963876] [ 4345]     0  4345     3305        0      11       3       36 
>             0 sed
> [  148.965766] [ 4346]     0  4346     3305        0      11       3       37 
>             0 sed
> [  148.967652] Out of memory: Kill process 4342 (resvtest) score 803 or 
> sacrifice child
> [  148.969390] Killed process 4342 (resvtest) total-vm:1258224kB, 
> anon-rss:764636kB, file-rss:0kB
> [  149.415288] XFS (vda): Unmounting Filesystem
> [  150.211229] XFS (vda): Mounting V5 Filesystem
> [  150.292092] XFS (vda): Ending clean mount
> [  150.342307] XFS (vda): Unmounting Filesystem
> [  150.346522] XFS (vdb): Unmounting Filesystem
> [  151.264135] XFS: kmalloc allocations by trans type
> [  151.265195] XFS: 3: count 7, bytes 3992, fails 0, max_size 1024
> [  151.266479] XFS: 4: count 3, bytes 400, fails 0, max_size 144
> [  151.267735] XFS: 7: count 9, bytes 2784, fails 0, max_size 536
> [  151.269022] XFS: 16: count 1, bytes 696, fails 0, max_size 696
> [  151.270286] XFS: 26: count 1, bytes 384, fails 0, max_size 384
> [  151.271550] XFS: 35: count 1, bytes 696, fails 0, max_size 696
> [  151.272833] XFS: slab allocations by trans type
> [  151.273818] XFS: 3: count 22, bytes 0, fails 0, max_size 0
> [  151.275010] XFS: 4: count 13, bytes 0, fails 0, max_size 0
> [  151.276212] XFS: 7: count 12, bytes 0, fails 0, max_size 0
> [  151.277406] XFS: 15: count 2, bytes 0, fails 0, max_size 0
> [  151.278595] XFS: 16: count 10, bytes 0, fails 0, max_size 0
> [  151.279854] XFS: 18: count 2, bytes 0, fails 0, max_size 0
> [  151.281080] XFS: 26: count 3, bytes 0, fails 0, max_size 0
> [  151.282275] XFS: 35: count 2, bytes 0, fails 0, max_size 0
> [  151.283476] XFS: vmalloc allocations by trans type
> [  151.284535] XFS: page allocations by trans type
> 
> Those XFS allocation stats are largest measured allocations done
> under transaction context broken down by allocation and transaction
> type.  No failures that would result in looping, even though the
> system invoked the OOM killer on a filesystem workload....
> 
> I need to break the slab allocations down further by cache (other
> workloads are generating over 50 slab allocations per transaction),
> but another hour's work and a few days of observation of the stats
> in my normal day-to-day work wll get me all the information I need
> to do a decent first pass at memory reservation requirements for
> XFS.
> 

This sounds like something that would serve us well under sysfs,
particularly if we do adopt the kind of reservation model being
discussed in this thread. I wouldn't expect these values to change
drastically or that often, but they could certainly adjust over time to
the point of being out of line with a reservation. A tool like this
combined with Johannes' idea of a warning or something along those lines
for a reservation overrun should give us all we need to identify
something is wrong and have the ability to fix it.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@xxxxxxxxxx  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>

<Prev in Thread] Current Thread [Next in Thread>