On 09/26/2013 02:49 PM, Jeff Liu wrote:
> From: Jie Liu <jeff.liu@xxxxxxxxxx>
>
> Write a file with an offset greater than 16TB on 32-bit system and
> then trigger page write-back via sync(1) as below will cause the
> task hang in a little while:
>
> INFO: task sync:2590 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> sync D c1064a28 0 2590 2097 0x00000000
> .....
> Call Trace:
> [<c1064a28>] ? ttwu_do_wakeup+0x18/0x130
> [<c1066d0e>] ? try_to_wake_up+0x1ce/0x220
> [<c1066dbf>] ? wake_up_process+0x1f/0x40
> [<c104fc2e>] ? wake_up_worker+0x1e/0x30
> [<c15b6083>] schedule+0x23/0x60
> [<c15b3c2d>] schedule_timeout+0x18d/0x1f0
> [<c12a143e>] ? do_raw_spin_unlock+0x4e/0x90
> [<c10515f1>] ? __queue_delayed_work+0x91/0x150
> [<c12a12ef>] ? do_raw_spin_lock+0x3f/0x100
> [<c12a143e>] ? do_raw_spin_unlock+0x4e/0x90
> [<c15b5b5d>] wait_for_completion+0x7d/0xc0
> [<c1066d60>] ? try_to_wake_up+0x220/0x220
> [<c116a4d2>] sync_inodes_sb+0x92/0x180
> [<c116fb05>] sync_inodes_one_sb+0x15/0x20
> [<c114a8f8>] iterate_supers+0xb8/0xc0
> [<c116faf0>] ? fdatawrite_one_bdev+0x20/0x20
> [<c116fc21>] sys_sync+0x31/0x80
> [<c15be18d>] sysenter_do_call+0x12/0x28
>
> The reason is that the end_index is unsigned long with maximum value
> '2^32-1=4294967295' on 32-bit platform, and the given offset cause it
> wrapped to 0, so that the following codes will repeat again and again
> until the task schedule time out:
>
> end_index = offset >> PAGE_CACHE_SHIFT;
> last_index = (offset - 1) >> PAGE_CACHE_SHIFT;
> if (page->index >= end_index) {
> unsigned offset_into_page = offset & (PAGE_CACHE_SIZE - 1);
> /*
> * Just skip the page if it is fully outside i_size, e.g. due
> * to a truncate operation that is in progress.
> */
> if (page->index >= end_index + 1 || offset_into_page == 0) {
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> unlock_page(page);
> return 0;
> }
>
> To check a page is fully outsids i_size or not, we can change the
> logic to:
> if (page->index > end_index ||
> (page->index == end_index && offset_into_page == 0))
>
> Secondly, there still has another similar issue when calculating the
> end offset for mapping the filesystem blocks to the file blocks for
> delalloc. With the same tests to above, run unmount(8) will cause
> kernel panic if CONFIG_XFS_DEBUG is enabled:
>
> XFS: Assertion failed: XFS_FORCED_SHUTDOWN(ip->i_mount) || \
> ip->i_delayed_blks == 0, file: fs/xfs/xfs_super.c, line: 964
>
> kernel BUG at fs/xfs/xfs_message.c:108!
> invalid opcode: 0000 [#1] SMP
> task: edddc100 ti: ec6ee000 task.ti: ec6ee000
> EIP: 0060:[<f83d87cb>] EFLAGS: 00010296 CPU: 1
> EIP is at assfail+0x2b/0x30 [xfs]
> ..............
> Call Trace:
> [<f83d9cd4>] xfs_fs_destroy_inode+0x74/0x120 [xfs]
> [<c115ddf1>] destroy_inode+0x31/0x50
> [<c115deff>] evict+0xef/0x170
> [<c115dfb2>] dispose_list+0x32/0x40
> [<c115ea3a>] evict_inodes+0xca/0xe0
> [<c1149706>] generic_shutdown_super+0x46/0xd0
> [<c11497b9>] kill_block_super+0x29/0x70
> [<c1149a14>] deactivate_locked_super+0x44/0x70
> [<c114a427>] deactivate_super+0x47/0x60
> [<c1161c3d>] mntput_no_expire+0xcd/0x120
> [<c1162ae8>] SyS_umount+0xa8/0x370
> [<c1162dce>] SyS_oldumount+0x1e/0x20
> [<c15be18d>] sysenter_do_call+0x12/0x28
>
> That because the end_offset is evaluated to 0 same to above, hence the
> mapping and covertion for dealloc file blocks to file system blocks did
> not happened.
>
> This patch just fixed both issues.
>
> Reported-by: Michael L. Semon <mlsemon35@xxxxxxxxx>
> Signed-off-by: Jie Liu <jeff.liu@xxxxxxxxxx>
> ---
> v3: Add code comments to reflect this change.
> v2: don't reset the s_max_bytes to MAX_LFS_FILESIZE, instead, revise the page
> offset
> check up strategy to avoid the potential overflow.
> v1: http://oss.sgi.com/archives/xfs/2013-07/msg00154.html
>
> fs/xfs/xfs_aops.c | 49 +++++++++++++++++++++++++++++++++++++++++++------
> 1 file changed, 43 insertions(+), 6 deletions(-)
>
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index e51e581..541c837 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -960,7 +960,32 @@ xfs_vm_writepage(
> offset = i_size_read(inode);
> end_index = offset >> PAGE_CACHE_SHIFT;
> last_index = (offset - 1) >> PAGE_CACHE_SHIFT;
> - if (page->index >= end_index) {
> +
> + /*
> + * The page index is less than the end_index, adjust the end_offset
> + * to the highest offset that this page should represent.
> + * -----------------------------------------------------
> + * | file mapping | <EOF> |
> + * -----------------------------------------------------
> + * | Page ... | Page N-2 | Page N-1 | Page N | |
> + * ^--------------------------------^----------|--------
> + * | desired writeback range | see else |
> + * ---------------------------------^------------------|
> + */
> + if (page->index < end_index)
> + end_offset = (xfs_off_t)(page->index + 1) << PAGE_CACHE_SHIFT;
> + else {
> + /*
> + * Check whether the page to write out is beyond or straddles
> + * i_size or not.
> + * -------------------------------------------------------
> + * | file mapping | <EOF> |
> + * -------------------------------------------------------
> + * | Page ... | Page N-2 | Page N-1 | Page N | Beyond |
> + * ^--------------------------------^-----------|---------
> + * | | Straddles |
> + * ---------------------------------^-----------|--------|
> + */
> unsigned offset_into_page = offset & (PAGE_CACHE_SIZE - 1);
>
> /*
> @@ -968,24 +993,36 @@ xfs_vm_writepage(
> * truncate operation that is in progress. We must redirty the
> * page so that reclaim stops reclaiming it. Otherwise
> * xfs_vm_releasepage() is called on it and gets confused.
> + *
> + * Note that the end_index is unsigned long, it would cause it
> + * overflow if the given offset is greater than 16TB on 32-bit
> + * system if we do check the page is fully outside i_size or not
> + * via "if (page->index >= end_index + 1)" as "end_idex + 1"
> + * will be evaluated to 0. Hence this page will be redirtied
> + * and be written out again and again which would result in
> + * dead loop, the user program that perform this operation will
^^^^^^^^^^^^^
Should I say this is an infinite or endless loop rather than dead loop, although
I was intended to comments the same behavior, but someone was kindly reminded me
of that the infinite loop is more suitable in English world just now.
Thanks,
-Jeff
> + * hang. Instead, we can verify this situation by checking if
> + * the page to write if totally beyond the i_size or if it's
> + * offset just equal to the EOF.
> */
> - if (page->index >= end_index + 1 || offset_into_page == 0)
> + if (page->index > end_index ||
> + (page->index == end_index && offset_into_page == 0))
> goto redirty;
>
> /*
> * The page straddles i_size. It must be zeroed out on each
> * and every writepage invocation because it may be mmapped.
> * "A file is mapped in multiples of the page size. For a file
> - * that is not a multiple of the page size, the remaining
> + * that is not a multiple of the page size, the remaining
> * memory is zeroed when mapped, and writes to that region are
> * not written out to the file."
> */
> zero_user_segment(page, offset_into_page, PAGE_CACHE_SIZE);
> +
> + /* Adjust the end_offset to the end of file */
> + end_offset = offset;
> }
>
> - end_offset = min_t(unsigned long long,
> - (xfs_off_t)(page->index + 1) << PAGE_CACHE_SHIFT,
> - offset);
> len = 1 << inode->i_blkbits;
>
> bh = head = page_buffers(page);
|