xfs
[Top] [All Lists]

Re: [PATCH v3] xfs: fix dead loop at xfs_vm_writepage() on 32bit machine

To: "xfs@xxxxxxxxxxx" <xfs@xxxxxxxxxxx>
Subject: Re: [PATCH v3] xfs: fix dead loop at xfs_vm_writepage() on 32bit machine
From: Jeff Liu <jeff.liu@xxxxxxxxxx>
Date: Thu, 26 Sep 2013 15:37:30 +0800
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <5243D8DE.8090003@xxxxxxxxxx>
References: <5243D8DE.8090003@xxxxxxxxxx>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:11.0) Gecko/20120410 Thunderbird/11.0.1
On 09/26/2013 02:49 PM, Jeff Liu wrote:

> From: Jie Liu <jeff.liu@xxxxxxxxxx>
> 
> Write a file with an offset greater than 16TB on 32-bit system and
> then trigger page write-back via sync(1) as below will cause the
> task hang in a little while:
> 
> INFO: task sync:2590 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> sync            D c1064a28     0  2590   2097 0x00000000
> .....
> Call Trace:
> [<c1064a28>] ? ttwu_do_wakeup+0x18/0x130
> [<c1066d0e>] ? try_to_wake_up+0x1ce/0x220
> [<c1066dbf>] ? wake_up_process+0x1f/0x40
> [<c104fc2e>] ? wake_up_worker+0x1e/0x30
> [<c15b6083>] schedule+0x23/0x60
> [<c15b3c2d>] schedule_timeout+0x18d/0x1f0
> [<c12a143e>] ? do_raw_spin_unlock+0x4e/0x90
> [<c10515f1>] ? __queue_delayed_work+0x91/0x150
> [<c12a12ef>] ? do_raw_spin_lock+0x3f/0x100
> [<c12a143e>] ? do_raw_spin_unlock+0x4e/0x90
> [<c15b5b5d>] wait_for_completion+0x7d/0xc0
> [<c1066d60>] ? try_to_wake_up+0x220/0x220
> [<c116a4d2>] sync_inodes_sb+0x92/0x180
> [<c116fb05>] sync_inodes_one_sb+0x15/0x20
> [<c114a8f8>] iterate_supers+0xb8/0xc0
> [<c116faf0>] ? fdatawrite_one_bdev+0x20/0x20
> [<c116fc21>] sys_sync+0x31/0x80
> [<c15be18d>] sysenter_do_call+0x12/0x28
> 
> The reason is that the end_index is unsigned long with maximum value
> '2^32-1=4294967295' on 32-bit platform, and the given offset cause it
> wrapped to 0, so that the following codes will repeat again and again
> until the task schedule time out:
> 
> end_index = offset >> PAGE_CACHE_SHIFT;
> last_index = (offset - 1) >> PAGE_CACHE_SHIFT;
> if (page->index >= end_index) {
>       unsigned offset_into_page = offset & (PAGE_CACHE_SIZE - 1);
>         /*
>          * Just skip the page if it is fully outside i_size, e.g. due
>          * to a truncate operation that is in progress.
>          */
>         if (page->index >= end_index + 1 || offset_into_page == 0) {
>       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>               unlock_page(page);
>               return 0;
>       }
> 
> To check a page is fully outsids i_size or not, we can change the
> logic to:
>       if (page->index > end_index ||
>           (page->index == end_index && offset_into_page == 0))
> 
> Secondly, there still has another similar issue when calculating the
> end offset for mapping the filesystem blocks to the file blocks for
> delalloc.  With the same tests to above, run unmount(8) will cause
> kernel panic if CONFIG_XFS_DEBUG is enabled:
> 
> XFS: Assertion failed: XFS_FORCED_SHUTDOWN(ip->i_mount) || \
>       ip->i_delayed_blks == 0, file: fs/xfs/xfs_super.c, line: 964
> 
> kernel BUG at fs/xfs/xfs_message.c:108!
> invalid opcode: 0000 [#1] SMP
> task: edddc100 ti: ec6ee000 task.ti: ec6ee000
> EIP: 0060:[<f83d87cb>] EFLAGS: 00010296 CPU: 1
> EIP is at assfail+0x2b/0x30 [xfs]
> ..............
> Call Trace:
> [<f83d9cd4>] xfs_fs_destroy_inode+0x74/0x120 [xfs]
> [<c115ddf1>] destroy_inode+0x31/0x50
> [<c115deff>] evict+0xef/0x170
> [<c115dfb2>] dispose_list+0x32/0x40
> [<c115ea3a>] evict_inodes+0xca/0xe0
> [<c1149706>] generic_shutdown_super+0x46/0xd0
> [<c11497b9>] kill_block_super+0x29/0x70
> [<c1149a14>] deactivate_locked_super+0x44/0x70
> [<c114a427>] deactivate_super+0x47/0x60
> [<c1161c3d>] mntput_no_expire+0xcd/0x120
> [<c1162ae8>] SyS_umount+0xa8/0x370
> [<c1162dce>] SyS_oldumount+0x1e/0x20
> [<c15be18d>] sysenter_do_call+0x12/0x28
> 
> That because the end_offset is evaluated to 0 same to above, hence the
> mapping and covertion for dealloc file blocks to file system blocks did
> not happened.
> 
> This patch just fixed both issues.
> 
> Reported-by: Michael L. Semon <mlsemon35@xxxxxxxxx>
> Signed-off-by: Jie Liu <jeff.liu@xxxxxxxxxx>
> ---
> v3: Add code comments to reflect this change.
> v2: don't reset the s_max_bytes to MAX_LFS_FILESIZE, instead, revise the page 
> offset
>     check up strategy to avoid the potential overflow.
> v1: http://oss.sgi.com/archives/xfs/2013-07/msg00154.html
> 
>  fs/xfs/xfs_aops.c |   49 +++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 43 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index e51e581..541c837 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -960,7 +960,32 @@ xfs_vm_writepage(
>       offset = i_size_read(inode);
>       end_index = offset >> PAGE_CACHE_SHIFT;
>       last_index = (offset - 1) >> PAGE_CACHE_SHIFT;
> -     if (page->index >= end_index) {
> +
> +     /*
> +      * The page index is less than the end_index, adjust the end_offset
> +      * to the highest offset that this page should represent.
> +      * -----------------------------------------------------
> +      * |                    file mapping           | <EOF> |
> +      * -----------------------------------------------------
> +      * | Page ... | Page N-2 | Page N-1 |  Page N  |       |
> +      * ^--------------------------------^----------|--------
> +      * |     desired writeback range    |      see else    |
> +      * ---------------------------------^------------------|
> +      */
> +     if (page->index < end_index)
> +             end_offset = (xfs_off_t)(page->index + 1) << PAGE_CACHE_SHIFT;
> +     else {
> +             /*
> +              * Check whether the page to write out is beyond or straddles
> +              * i_size or not.
> +              * -------------------------------------------------------
> +              * |            file mapping                    | <EOF>  |
> +              * -------------------------------------------------------
> +              * | Page ... | Page N-2 | Page N-1 |  Page N   | Beyond |
> +              * ^--------------------------------^-----------|---------
> +              * |                                |      Straddles     |
> +              * ---------------------------------^-----------|--------|
> +              */
>               unsigned offset_into_page = offset & (PAGE_CACHE_SIZE - 1);
>  
>               /*
> @@ -968,24 +993,36 @@ xfs_vm_writepage(
>                * truncate operation that is in progress. We must redirty the
>                * page so that reclaim stops reclaiming it. Otherwise
>                * xfs_vm_releasepage() is called on it and gets confused.
> +              *
> +              * Note that the end_index is unsigned long, it would cause it
> +              * overflow if the given offset is greater than 16TB on 32-bit
> +              * system if we do check the page is fully outside i_size or not
> +              * via "if (page->index >= end_index + 1)" as "end_idex + 1"
> +              * will be evaluated to 0.  Hence this page will be redirtied
> +              * and be written out again and again which would result in
> +              * dead loop, the user program that perform this operation will

                ^^^^^^^^^^^^^
Should I say this is an infinite or endless loop rather than dead loop, although
I was intended to comments the same behavior, but someone was kindly reminded me
of that the infinite loop is more suitable in English world just now.

Thanks,
-Jeff

> +              * hang.  Instead, we can verify this situation by checking if
> +              * the page to write if totally beyond the i_size or if it's
> +              * offset just equal to the EOF.
>                */
> -             if (page->index >= end_index + 1 || offset_into_page == 0)
> +             if (page->index > end_index ||
> +                 (page->index == end_index && offset_into_page == 0))
>                       goto redirty;
>  
>               /*
>                * The page straddles i_size.  It must be zeroed out on each
>                * and every writepage invocation because it may be mmapped.
>                * "A file is mapped in multiples of the page size.  For a file
> -              * that is not a multiple of the  page size, the remaining
> +              * that is not a multiple of the page size, the remaining
>                * memory is zeroed when mapped, and writes to that region are
>                * not written out to the file."
>                */
>               zero_user_segment(page, offset_into_page, PAGE_CACHE_SIZE);
> +
> +             /* Adjust the end_offset to the end of file */
> +             end_offset = offset;
>       }
>  
> -     end_offset = min_t(unsigned long long,
> -                     (xfs_off_t)(page->index + 1) << PAGE_CACHE_SHIFT,
> -                     offset);
>       len = 1 << inode->i_blkbits;
>  
>       bh = head = page_buffers(page);


<Prev in Thread] Current Thread [Next in Thread>