xfs
[Top] [All Lists]

Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur
From: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Date: Tue, 10 Mar 2015 16:55:52 -0700
Cc: Ingo Molnar <mingo@xxxxxxxxxx>, Mel Gorman <mgorman@xxxxxxx>, Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>, Aneesh Kumar <aneesh.kumar@xxxxxxxxxxxxxxxxxx>, Linux Kernel Mailing List <linux-kernel@xxxxxxxxxxxxxxx>, Linux-MM <linux-mm@xxxxxxxxx>, xfs@xxxxxxxxxxx, ppc-dev <linuxppc-dev@xxxxxxxxxxxxxxxx>
Delivered-to: xfs@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=ouTdILE578f2fBq/VZ9j/eKBe/wwAdVoQ5pNd8bxeW0=; b=Vtk4kkwsLWG91mnPtGNF4M/xwZN+WgwIDChYXJ4s2MvT6ewW2CBHMcwFWkEbYSK3FT ujVMlZTmzKInlU4rDDRV4GGNR715hhd2JRCIZpyYSC4BfW7iZPU/lOQOAXxCKzABpUIX Ty/HsFOCbhaRUTYob/a1ry8kloZyfrjeMSL4J1lxbd8SmGqKBrV1aDztJIgEy9UnfUn3 rTFOvpk6KjCwtkR668wYj6tRxkhTtU/BoobNzK72yGmNYknNxgzLXk2QPlAfMJPv41R8 Lx6di2KUNgM1Qv8m63C6J0CahawUQ4v4wj7S6dC1einDpLsBWSbnjX6/PTeN9Ywkv4+1 ME9g==
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux-foundation.org; s=google; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=ouTdILE578f2fBq/VZ9j/eKBe/wwAdVoQ5pNd8bxeW0=; b=gaGISGowzMNw5X6ELd42+Ckx5cPNut+CRvuvAYZeKElL9GJMKlEl8NFIm9+b0Rd1za EGunySz4BOieS4knrptu4gJKLdiylGv/hOQVoqcqeFnJdyTnCp6OD77wqAjkcmM3QyA/ zA47BgiDoC12yhOemIHw3IuGvSO66116cTxME=
In-reply-to: <20150309191943.GF26657@destitution>
References: <1425741651-29152-1-git-send-email-mgorman@xxxxxxx> <1425741651-29152-5-git-send-email-mgorman@xxxxxxx> <20150307163657.GA9702@xxxxxxxxx> <CA+55aFwDuzpL-k8LsV3touhNLh+TFSLKP8+-nPwMXkWXDYPhrg@xxxxxxxxxxxxxx> <20150308100223.GC15487@xxxxxxxxx> <CA+55aFyQyZXu2fi7X9bWdSX0utk8=sccfBwFaSoToROXoE_PLA@xxxxxxxxxxxxxx> <20150309112936.GD26657@destitution> <CA+55aFywW5JLq=BU_qb2OG5+pJ-b1v9tiS5Ygi-vtEKbEZ_T5Q@xxxxxxxxxxxxxx> <20150309191943.GF26657@destitution>
Sender: linus971@xxxxxxxxx
On Mon, Mar 9, 2015 at 12:19 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> On Mon, Mar 09, 2015 at 09:52:18AM -0700, Linus Torvalds wrote:
>>
>> What's your virtual environment setup? Kernel config, and
>> virtualization environment to actually get that odd fake NUMA thing
>> happening?
>
> I don't have the exact .config with me (test machines at home
> are shut down because I'm half a world away), but it's pretty much
> this (copied and munged from a similar test vm on my laptop):

[ snip snip ]

Ok, I hate debugging by symptoms anyway, so I didn't do any of this,
and went back to actually *thinking* about the code instead of trying
to reproduce this and figure things out by trial and error.

And I think I figured it out. Of course, since I didn't actually test
anything, what do I know, but I feel good about it, because I think I
can explain why that patch that on the face of it shouldn't change
anything actually did.

So, the old code just did all those manual page table changes,
clearing the present bit and setting the NUMA bit instead.

The new code _ostensibly_ does the same, except it clears the present
bit and sets the PROTNONE bit instead.

However, rather than playing special games with just those two bits,
it uses the normal pte accessor functions, and in particular uses
vma->vm_page_prot to reset the protections back. Which is a nice
cleanup and really makes the code look saner, and does the same thing.

Except it really isn't the same thing at all.

Why?

The protection bits in the page tables are *not* the same as
vma->vm_page_prot. Yes, they start out that way, but they don't stay
that way. And no, I'm not talking about dirty and accessed bits.

The difference? COW. Any private mapping is marked read-only in
vma->vm_page_prot, and then the COW (or the initial write) makes it
read-write.

And so, when we did

-       pte = pte_mknonnuma(pte);
+       /* Make it present again */
+       pte = pte_modify(pte, vma->vm_page_prot);
+       pte = pte_mkyoung(pte);

that isn't equivalent at all - it makes the page read-only, because it
restores it to its original state.

Now, that isn't actually what hurts most, I suspect. Judging by the
profiles, we don't suddenly take a lot of new COW faults. No, what
hurts most is that the NUMA balancing code does this:

        /*
         * Avoid grouping on DSO/COW pages in specific and RO pages
         * in general, RO pages shouldn't hurt as much anyway since
         * they can be in shared cache state.
         */
        if (!pte_write(pte))
                flags |= TNF_NO_GROUP;

and that "!pte_write(pte)" is basically now *always* true for private
mappings (which is 99% of all mappings).

In other words, I think the patch unintentionally made the NUMA code
basically always do the TNF_NO_GROUP case.

I think that a quick hack for testing might be to just replace that
"!pte_write()" with "!pte_dirty()", and seeing how that acts.

Comments?

                      Linus

<Prev in Thread] Current Thread [Next in Thread>