Hi,
Thanks for having a look.
On 02/23/2016 11:45 PM, Dave Chinner wrote:
> On Tue, Feb 23, 2016 at 05:13:35PM +0100, Jean-Tiare Le Bigot wrote:
>> Hi,
>>
>> We've hit kernel hang related to XFS reclaim under heavy I/O load on a
>> couple of storage servers using XFS over flashcache over a 3.13.y kernel.
>>
>> On the crash dumps, kthreadd is blocked, waiting for XFS to reclaim some
>> memory but the related reclaim job is queued on a worker_pool stuck
>> waiting for some I/O, itself depending on other jobs on other queues
>> which would require additional threads to go forward. Unfortunately
>> kthreadd is blocked.
>> The host has plenty of memory (~128GB), about 80% of which being used
>> for the page cache.
>>
>> It looks like this is fixed by commit
>> 7a29ac474a47eb8cf212b45917683ae89d6fa13b.
>
> That commit fixed a regression introduced, IIRC, in 3.19. The
> problem it fixed didn't exist before then, so I doubt you are seeing
> the problem that the above commit fixed. Perhaps you'd like to
> describe your problem along with the stack traces, etc so we have
> some idea of what you are trying to to fix?
On the dump we investigated, kthreadd is stuck, waiting for some memory
crash> bt 2
PID: 2 TASK: ffff881fd2a39800 CPU: 0 COMMAND: "kthreadd"
#0 [ffff881fd2ab37c8] __schedule at ffffffff81724e19
#1 [ffff881fd2ab3830] io_schedule at ffffffff817255fd
#2 [ffff881fd2ab3848] __xfs_iunpin_wait at ffffffffa06cf269 [xfs]
#3 [ffff881fd2ab38c0] xfs_iunpin_wait at ffffffffa06d2629 [xfs]
#4 [ffff881fd2ab38d0] xfs_reclaim_inode at ffffffffa068ed4c [xfs]
#5 [ffff881fd2ab3910] xfs_reclaim_inodes_ag at ffffffffa068f267 [xfs]
#6 [ffff881fd2ab3aa0] xfs_reclaim_inodes_nr at ffffffffa068fd73 [xfs]
#7 [ffff881fd2ab3ac0] xfs_fs_free_cached_objects at ffffffffa069a3a5 [xfs]
#8 [ffff881fd2ab3ad0] super_cache_scan at ffffffff811c13e9
#9 [ffff881fd2ab3b18] shrink_slab at ffffffff81160f27
#10 [ffff881fd2ab3bc0] do_try_to_free_pages at ffffffff8116405d
#11 [ffff881fd2ab3c38] try_to_free_pages at ffffffff8116429c
#12 [ffff881fd2ab3cd8] __alloc_pages_nodemask at ffffffff81158f15
#13 [ffff881fd2ab3e10] copy_process at ffffffff810652b3
#14 [ffff881fd2ab3e90] do_fork at ffffffff810669f5
#15 [ffff881fd2ab3ef8] kernel_thread at ffffffff81066c86
#16 [ffff881fd2ab3f08] kthreadd at ffffffff8108beea
#17 [ffff881fd2ab3f50] ret_from_fork at ffffffff817318bc
This triggered the insertion of a xfs-reclaim job for device dm-46 on
the workerpool bound to cpu #0
This xfs_reclaim job is the first *pending* job on the workerpool. The
workerpool has 2 active workers stuck in "xfs_log_worker" both blocked
in "xlog_state_get_iclog_space". My guess is they are waiting for some
underlying flashcache I/O (which won't happen, see below)
--> busy_hash aka busy workers:
crash> struct worker_pool.busy_hash 0xffff88207fc12d40 | grep -o 0xf.*
0xffff8806dd3b5780 --> 132626
--> in xfs_log_worker
--> in xlog_state_get_iclog_space
--> dm-40
0xffff88142d06ae80 --> 129142
--> in xfs_log_worker
--> in xlog_state_get_iclog_space
--> dm-63
--> pending jobs:
crash> list -o work_struct.entry -s work_struct.func -H ffff88207fc12d58
-x | awk 'BEGIN{w="";c=0} {if($1=="func") printf("%2d %s --> %s\n", c,
w, $4); else {w=$1;c++}}'
WORK FUNC DEVICE
1 ffff881fceda6ca8 --> <xfs_reclaim_worker> dm-46
2 ffff881fcd51bc28 --> <xfs_log_worker> dm-46
3 ffff88207fc0f3a0 --> <vmstat_update> N/A
...
To progress this queue needs either:
- the I/O to complete
- a new thread to handle the xfs_reclaim
(which does not trigger I/O in this context, If I got the code right)
The I/O is stuck so we need a new thread. The pool manager requested
this thread BUT the request is pending in kthreadd queue. It is #5 in
the line
crash> list -o kthread_create_info.list -s kthread_create_info -H
kthread_create_list | grep threadfn | sort | uniq -c
Hence 'data' field represents a struct worker
WORKER POOL ID
1 0xffff881851c50780 0xffff881fff011c00 4 unbound pool 24
2 0xffff881a08c5b680 0xffff881fd1cb4c00 0 unbound pool 25
3 0xffff881295297700 0xffff88207fd72d40 0 cpu 11 pool 22
4 0xffff88131ed9c580 0xffff88207fc72d40 1 cpu 3 pool 6
5 0xffff88062e8e9b80 0xffff88207fc12d40 0 cpu 0 pool 0
6 0xffff8819440d1500 0xffff88207fc92d40 2 cpu 4 pool 8
7 0xffff8802288aa280 0xffff88207fc52d40 0 cpu 2 pool 4
8 0xffff88120d711200 0xffff88207fcb2d40 0 cpu 5 pool 10
9 0xffff881cf648fb00 0xffff88207fc32d40 0 cpu 1 pool 2
10 0xffff8819c1471a00 0xffff88207fd32d40 2 cpu 9 pool 18
Hence, the "mayday" mode should trigger and move the xfs_reclaim job to
the rescuer worker.
On the other hand, we also hit a similar issue on the underlying
flashcache. If it had not happened, I bet the system would not have locked.
Anyway, as I understood the code, flashcache queues a single job on the
default bound system queue. When this job is executed, it processes its
own internal queues sequentially. So basically, if this single job is
stuck, all I/Os of flashcache devices are stuck (!).
This is on queue #6
--> busy_hash aka busy workers:
0xffff881330823080 --> 139598
--> in xfs_end_io
--> in rwsem_down_write_failed
0xffff881330823c80 --> 139599
--> in xfs_end_io
--> in ???
0xffff880e1a890e80 --> 127402
--> in xfs_end_io
--> rwsem_down_write_failed
--> pending jobs:
crash> list -o work_struct.entry -s work_struct.func -H ffff88207fcd2d58
-x | awk 'BEGIN{w="";c=0} {if($1=="func") printf(" %2d %s --> %s\n",
c, w, $4); else {w=$1;c++}}'
WORK FUNC NOTES
1 ffffffffa062e780 --> <do_work> FLASHCACHE JOB
2 ffff881f582dd350 --> <flashcache_clean_all_sets>
3 ffff88207fccf3a0 --> <vmstat_update>
4 ffff881ed1d05350 --> <flashcache_clean_all_sets>
5 ffff881fff010d10 --> <vmpressure_work_fn>
6 ffff881d9998bca8 --> <xfs_reclaim_worker> dm-4
7 ffffffff81ca8880 --> <push_to_pool>
8 ffff881fcc425ca8 --> <xfs_reclaim_worker> dm-45
9 ffff88207fccf180 --> <lru_add_drain_per_cpu>
Flashcache job is #1 (yes, the explicit 'do_work' name). All workers are
waiting for I/O. The pool is waiting for a thread. Which is waiting for
memory. Which ... OK, we're stuck.
Some assumptions may be wrong, I have seemingly truncated stack traces
looking like, this does not help much :s
crash> bt 139599
PID: 139599 TASK: ffff8816c7970000 CPU: 6 COMMAND: "kworker/6:1"
#0 [ffff88011c407700] __schedule at ffffffff81724e19
#1 [ffff88011c407768] schedule at ffffffff817252d9
#2 [ffff88011c407778] schedule_timeout at ffffffff81724529
Long story short, My understanding is that the hang is the result of a
design issue in flashcache, which could be helped in such not-so-low-mem
situation (80% is page cache) if the reclaim job had a rescuer thread.
Actually, the quick prod fix was a clever brute force hack in flashcache.
Regards,
>
> Cheers,
>
> Dave.
>
--
Jean-Tiare Le Bigot, OVH
|