[PATCH v7 00/11] Speedup mremap on ppc64

Discussion:

[PATCH v7 00/11] Speedup mremap on ppc64

Aneesh Kumar K.V

2021-06-07 05:51:20 UTC

This patchset enables MOVE_PMD/MOVE_PUD support on power. This requires
the platform to support updating higher-level page tables without
updating page table entries. This also needs to invalidate the Page Walk
Cache on architecture supporting the same.

Changes from v6:
* Update ppc64 flush_tlb_range to invalidate page walk cache.
* Add patches to fix race between mremap and page out
* Add patch to fix build error with page table levels 2

Changes from v5:
* Drop patch mm/mremap: Move TLB flush outside page table lock
* Add fixes for race between optimized mremap and page out

Changes from v4:
* Change function name and arguments based on review feedback.

Changes from v3:
* Fix build error reported by kernel test robot
* Address review feedback.

Changes from v2:
* switch from using mmu_gather to flush_pte_tlb_pwc_range()

Changes from v1:
* Rebase to recent upstream
* Fix build issues with tlb_gather_mmu changes

Aneesh Kumar K.V (11):
mm/mremap: Fix race between MOVE_PMD mremap and pageout
mm/mremap: Fix race between MOVE_PUD mremap and pageout
selftest/mremap_test: Update the test to handle pagesize other than 4K
selftest/mremap_test: Avoid crash with static build
mm/mremap: Convert huge PUD move to separate helper
mm/mremap: Don't enable optimized PUD move if page table levels is 2
mm/mremap: Use pmd/pud_poplulate to update page table entries
powerpc/mm/book3s64: Fix possible build error
mm/mremap: Allow arch runtime override
powerpc/book3s64/mm: Update flush_tlb_range to flush page walk cache
powerpc/mm: Enable HAVE_MOVE_PMD support

.../include/asm/book3s/64/tlbflush-radix.h | 2 +
arch/powerpc/include/asm/tlb.h | 6 +
arch/powerpc/mm/book3s64/radix_hugetlbpage.c | 8 +-
arch/powerpc/mm/book3s64/radix_tlb.c | 70 +++++++----
arch/powerpc/platforms/Kconfig.cputype | 2 +
include/linux/rmap.h | 13 +-
mm/mremap.c | 104 +++++++++++++--
mm/page_vma_mapped.c | 43 ++++---
tools/testing/selftests/vm/mremap_test.c | 118 ++++++++++--------
9 files changed, 251 insertions(+), 115 deletions(-)

--
2.31.1

Aneesh Kumar K.V

2021-06-07 05:51:22 UTC

CPU 1 CPU 2 CPU 3

mremap(old_addr, new_addr) page_shrinker/try_to_unmap_one

mmap_write_lock_killable()

addr = old_addr
lock(pte_ptl)
lock(pud_ptl)
pud = *old_pud
pud_clear(old_pud)
flush_tlb_range(old_addr)

*new_pud = pud
*new_addr = 10; and fills
TLB with new addr
and old pfn

unlock(pud_ptl)
ptep_clear_flush()
old pfn is free.
Stale TLB entry
Fix this race by holding pud lock in pageout.

Fixes: c49dd3401802 ("mm: speedup mremap on 1GB or larger regions")
Signed-off-by: Aneesh Kumar K.V <***@linux.ibm.com>
---
include/linux/rmap.h | 4 ++++
mm/page_vma_mapped.c | 13 ++++++++++---
2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 272ab0c2b60b..491c65ce1d46 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -209,6 +209,7 @@ struct page_vma_mapped_walk {
pte_t *pte;
spinlock_t *pte_ptl;
spinlock_t *pmd_ptl;
+ spinlock_t *pud_ptl;
unsigned int flags;
};

@@ -221,6 +222,9 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
spin_unlock(pvmw->pte_ptl);
if (pvmw->pmd_ptl)
spin_unlock(pvmw->pmd_ptl);
+ if (pvmw->pud_ptl)
+ spin_unlock(pvmw->pud_ptl);
+
}

bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw);
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 87a2c94c7e27..c913bc34b1d3 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -180,8 +180,11 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
pud = pud_offset(p4d, pvmw->address);
if (!pud_present(*pud))
return false;
+
+ pvmw->pud_ptl = pud_lock(mm, pud);
pvmw->pmd = pmd_offset(pud, pvmw->address);
- pvmw->pmd_ptl = pmd_lock(mm, pvmw->pmd);
+ if (USE_SPLIT_PMD_PTLOCKS)
+ pvmw->pmd_ptl = pmd_lock(mm, pvmw->pmd);
/*
* Make sure the pmd value isn't cached in a register by the
* compiler and used as a stale value after we've observed a
@@ -235,8 +238,12 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
spin_unlock(pvmw->pte_ptl);
pvmw->pte_ptl = NULL;
}
- spin_unlock(pvmw->pmd_ptl);
- pvmw->pmd_ptl = NULL;
+ if (pvmw->pmd_ptl) {
+ spin_unlock(pvmw->pmd_ptl);
+ pvmw->pmd_ptl = NULL;
+ }
+ spin_unlock(pvmw->pud_ptl);
+ pvmw->pud_ptl = NULL;
goto restart;
} else {
pvmw->pte++;

--
2.31.1

Aneesh Kumar K.V

2021-06-07 05:51:21 UTC

CPU 1 CPU 2 CPU 3

mremap(old_addr, new_addr) page_shrinker/try_to_unmap_one

mmap_write_lock_killable()

addr = old_addr
lock(pte_ptl)
lock(pmd_ptl)
pmd = *old_pmd
pmd_clear(old_pmd)
flush_tlb_range(old_addr)

*new_pmd = pmd
*new_addr = 10; and fills
TLB with new addr
and old pfn

unlock(pmd_ptl)
ptep_clear_flush()
old pfn is free.
Stale TLB entry

Fix this race by holding pmd lock in pageout. This still doesn't handle the race
between MOVE_PUD and pageout.

Fixes: 2c91bd4a4e2e ("mm: speed up mremap by 20x on large regions")
Link: https://lore.kernel.org/linux-mm/CAHk-=wgXVR04eBNtxQfevontWnP6FDm+oj5vauQXP3S-***@mail.gmail.com
Signed-off-by: Aneesh Kumar K.V <***@linux.ibm.com>
---
include/linux/rmap.h | 9 ++++++---
mm/page_vma_mapped.c | 36 ++++++++++++++++++------------------
2 files changed, 24 insertions(+), 21 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index def5c62c93b3..272ab0c2b60b 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -207,7 +207,8 @@ struct page_vma_mapped_walk {
unsigned long address;
pmd_t *pmd;
pte_t *pte;
- spinlock_t *ptl;
+ spinlock_t *pte_ptl;
+ spinlock_t *pmd_ptl;
unsigned int flags;
};

@@ -216,8 +217,10 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
/* HugeTLB pte is set to the relevant page table entry without pte_mapped. */
if (pvmw->pte && !PageHuge(pvmw->page))
pte_unmap(pvmw->pte);
- if (pvmw->ptl)
- spin_unlock(pvmw->ptl);
+ if (pvmw->pte_ptl)
+ spin_unlock(pvmw->pte_ptl);
+ if (pvmw->pmd_ptl)
+ spin_unlock(pvmw->pmd_ptl);
}

bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw);
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 2cf01d933f13..87a2c94c7e27 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -47,8 +47,10 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw)
return false;
}
}
- pvmw->ptl = pte_lockptr(pvmw->vma->vm_mm, pvmw->pmd);
- spin_lock(pvmw->ptl);
+ if (USE_SPLIT_PTE_PTLOCKS) {
+ pvmw->pte_ptl = pte_lockptr(pvmw->vma->vm_mm, pvmw->pmd);
+ spin_lock(pvmw->pte_ptl);
+ }
return true;
}

@@ -162,8 +164,8 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
if (!pvmw->pte)
return false;

- pvmw->ptl = huge_pte_lockptr(page_hstate(page), mm, pvmw->pte);
- spin_lock(pvmw->ptl);
+ pvmw->pte_ptl = huge_pte_lockptr(page_hstate(page), mm, pvmw->pte);
+ spin_lock(pvmw->pte_ptl);
if (!check_pte(pvmw))
return not_found(pvmw);
return true;
@@ -179,6 +181,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
if (!pud_present(*pud))
return false;
pvmw->pmd = pmd_offset(pud, pvmw->address);
+ pvmw->pmd_ptl = pmd_lock(mm, pvmw->pmd);
/*
* Make sure the pmd value isn't cached in a register by the
* compiler and used as a stale value after we've observed a
@@ -186,7 +189,6 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
*/
pmde = READ_ONCE(*pvmw->pmd);
if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
- pvmw->ptl = pmd_lock(mm, pvmw->pmd);
if (likely(pmd_trans_huge(*pvmw->pmd))) {
if (pvmw->flags & PVMW_MIGRATION)
return not_found(pvmw);
@@ -206,14 +208,10 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
}
}
return not_found(pvmw);
- } else {
- /* THP pmd was split under us: handle on pte level */
- spin_unlock(pvmw->ptl);
- pvmw->ptl = NULL;
}
- } else if (!pmd_present(pmde)) {
- return false;
- }
+ } else if (!pmd_present(pmde))
+ return not_found(pvmw);
+
if (!map_pte(pvmw))
goto next_pte;
while (1) {
@@ -233,19 +231,21 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
/* Did we cross page table boundary? */
if (pvmw->address % PMD_SIZE == 0) {
pte_unmap(pvmw->pte);
- if (pvmw->ptl) {
- spin_unlock(pvmw->ptl);
- pvmw->ptl = NULL;
+ if (pvmw->pte_ptl) {
+ spin_unlock(pvmw->pte_ptl);
+ pvmw->pte_ptl = NULL;
}
+ spin_unlock(pvmw->pmd_ptl);
+ pvmw->pmd_ptl = NULL;
goto restart;
} else {
pvmw->pte++;
}
} while (pte_none(*pvmw->pte));

- if (!pvmw->ptl) {
- pvmw->ptl = pte_lockptr(mm, pvmw->pmd);
- spin_lock(pvmw->ptl);
+ if (USE_SPLIT_PTE_PTLOCKS && !pvmw->pte_ptl) {
+ pvmw->pte_ptl = pte_lockptr(mm, pvmw->pmd);
+ spin_lock(pvmw->pte_ptl);
}
}
}

--
2.31.1

Hugh Dickins

2021-06-08 00:06:28 UTC

Post by Aneesh Kumar K.V
CPU 1 CPU 2 CPU 3
mremap(old_addr, new_addr) page_shrinker/try_to_unmap_one
mmap_write_lock_killable()
addr = old_addr
lock(pte_ptl)
lock(pmd_ptl)
pmd = *old_pmd
pmd_clear(old_pmd)
flush_tlb_range(old_addr)
*new_pmd = pmd
*new_addr = 10; and fills
TLB with new addr
and old pfn
unlock(pmd_ptl)
ptep_clear_flush()
old pfn is free.
Stale TLB entry
Fix this race by holding pmd lock in pageout. This still doesn't handle the race
between MOVE_PUD and pageout.
Fixes: 2c91bd4a4e2e ("mm: speed up mremap by 20x on large regions")

This seems very wrong to me, to require another level of locking in the
rmap lookup, just to fix some new pagetable games in mremap.

But Linus asked "Am I missing something?": neither of you have mentioned
mremap's take_rmap_locks(), so I hope that already meets your need. And
if it needs to be called more often than before (see "need_rmap_locks"),
that's probably okay.

Hugh

Post by Aneesh Kumar K.V
---
include/linux/rmap.h | 9 ++++++---
mm/page_vma_mapped.c | 36 ++++++++++++++++++------------------
2 files changed, 24 insertions(+), 21 deletions(-)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index def5c62c93b3..272ab0c2b60b 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -207,7 +207,8 @@ struct page_vma_mapped_walk {
unsigned long address;
pmd_t *pmd;
pte_t *pte;
- spinlock_t *ptl;
+ spinlock_t *pte_ptl;
+ spinlock_t *pmd_ptl;
unsigned int flags;
};
@@ -216,8 +217,10 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
/* HugeTLB pte is set to the relevant page table entry without pte_mapped. */
if (pvmw->pte && !PageHuge(pvmw->page))
pte_unmap(pvmw->pte);
- if (pvmw->ptl)
- spin_unlock(pvmw->ptl);
+ if (pvmw->pte_ptl)
+ spin_unlock(pvmw->pte_ptl);
+ if (pvmw->pmd_ptl)
+ spin_unlock(pvmw->pmd_ptl);
}
bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw);
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 2cf01d933f13..87a2c94c7e27 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -47,8 +47,10 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw)
return false;
}
}
- pvmw->ptl = pte_lockptr(pvmw->vma->vm_mm, pvmw->pmd);
- spin_lock(pvmw->ptl);
+ if (USE_SPLIT_PTE_PTLOCKS) {
+ pvmw->pte_ptl = pte_lockptr(pvmw->vma->vm_mm, pvmw->pmd);
+ spin_lock(pvmw->pte_ptl);
+ }
return true;
}
@@ -162,8 +164,8 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
if (!pvmw->pte)
return false;
- pvmw->ptl = huge_pte_lockptr(page_hstate(page), mm, pvmw->pte);
- spin_lock(pvmw->ptl);
+ pvmw->pte_ptl = huge_pte_lockptr(page_hstate(page), mm, pvmw->pte);
+ spin_lock(pvmw->pte_ptl);
if (!check_pte(pvmw))
return not_found(pvmw);
return true;
@@ -179,6 +181,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
if (!pud_present(*pud))
return false;
pvmw->pmd = pmd_offset(pud, pvmw->address);
+ pvmw->pmd_ptl = pmd_lock(mm, pvmw->pmd);
/*
* Make sure the pmd value isn't cached in a register by the
* compiler and used as a stale value after we've observed a
@@ -186,7 +189,6 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
*/
pmde = READ_ONCE(*pvmw->pmd);
if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
- pvmw->ptl = pmd_lock(mm, pvmw->pmd);
if (likely(pmd_trans_huge(*pvmw->pmd))) {
if (pvmw->flags & PVMW_MIGRATION)
return not_found(pvmw);
@@ -206,14 +208,10 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
}
}
return not_found(pvmw);
- } else {
- /* THP pmd was split under us: handle on pte level */
- spin_unlock(pvmw->ptl);
- pvmw->ptl = NULL;
}
- } else if (!pmd_present(pmde)) {
- return false;
- }
+ } else if (!pmd_present(pmde))
+ return not_found(pvmw);
+
if (!map_pte(pvmw))
goto next_pte;
while (1) {
@@ -233,19 +231,21 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
/* Did we cross page table boundary? */
if (pvmw->address % PMD_SIZE == 0) {
pte_unmap(pvmw->pte);
- if (pvmw->ptl) {
- spin_unlock(pvmw->ptl);
- pvmw->ptl = NULL;
+ if (pvmw->pte_ptl) {
+ spin_unlock(pvmw->pte_ptl);
+ pvmw->pte_ptl = NULL;
}
+ spin_unlock(pvmw->pmd_ptl);
+ pvmw->pmd_ptl = NULL;
goto restart;
} else {
pvmw->pte++;
}
} while (pte_none(*pvmw->pte));
- if (!pvmw->ptl) {
- pvmw->ptl = pte_lockptr(mm, pvmw->pmd);
- spin_lock(pvmw->ptl);
+ if (USE_SPLIT_PTE_PTLOCKS && !pvmw->pte_ptl) {
+ pvmw->pte_ptl = pte_lockptr(mm, pvmw->pmd);
+ spin_lock(pvmw->pte_ptl);
}
}
}
--
2.31.1

Aneesh Kumar K.V

2021-06-07 05:51:23 UTC

Instead of hardcoding 4K page size fetch it using sysconf(). For the performance
measurements test still assume 2M and 1G are hugepage sizes.

Reviewed-by: Kalesh Singh <***@google.com>
Signed-off-by: Aneesh Kumar K.V <***@linux.ibm.com>
---
tools/testing/selftests/vm/mremap_test.c | 113 ++++++++++++-----------
1 file changed, 61 insertions(+), 52 deletions(-)

diff --git a/tools/testing/selftests/vm/mremap_test.c b/tools/testing/selftests/vm/mremap_test.c
index 9c391d016922..c9a5461eb786 100644
--- a/tools/testing/selftests/vm/mremap_test.c
+++ b/tools/testing/selftests/vm/mremap_test.c
@@ -45,14 +45,15 @@ enum {
_4MB = 4ULL << 20,
_1GB = 1ULL << 30,
_2GB = 2ULL << 30,
- PTE = _4KB,
PMD = _2MB,
PUD = _1GB,
};

+#define PTE page_size
+
#define MAKE_TEST(source_align, destination_align, size, \
overlaps, should_fail, test_name) \
-{ \
+(struct test){ \
.name = test_name, \
.config = { \
.src_alignment = source_align, \
@@ -252,12 +253,17 @@ static int parse_args(int argc, char **argv, unsigned int *threshold_mb,
return 0;
}

+#define MAX_TEST 13
+#define MAX_PERF_TEST 3
int main(int argc, char **argv)
{
int failures = 0;
int i, run_perf_tests;
unsigned int threshold_mb = VALIDATION_DEFAULT_THRESHOLD;
unsigned int pattern_seed;
+ struct test test_cases[MAX_TEST];
+ struct test perf_test_cases[MAX_PERF_TEST];
+ int page_size;
time_t t;

pattern_seed = (unsigned int) time(&t);
@@ -268,56 +274,59 @@ int main(int argc, char **argv)
ksft_print_msg("Test configs:\n\tthreshold_mb=%u\n\tpattern_seed=%u\n\n",
threshold_mb, pattern_seed);

- struct test test_cases[] = {
- /* Expected mremap failures */
- MAKE_TEST(_4KB, _4KB, _4KB, OVERLAPPING, EXPECT_FAILURE,
- "mremap - Source and Destination Regions Overlapping"),
- MAKE_TEST(_4KB, _1KB, _4KB, NON_OVERLAPPING, EXPECT_FAILURE,
- "mremap - Destination Address Misaligned (1KB-aligned)"),
- MAKE_TEST(_1KB, _4KB, _4KB, NON_OVERLAPPING, EXPECT_FAILURE,
- "mremap - Source Address Misaligned (1KB-aligned)"),
-
- /* Src addr PTE aligned */
- MAKE_TEST(PTE, PTE, _8KB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "8KB mremap - Source PTE-aligned, Destination PTE-aligned"),
-
- /* Src addr 1MB aligned */
- MAKE_TEST(_1MB, PTE, _2MB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "2MB mremap - Source 1MB-aligned, Destination PTE-aligned"),
- MAKE_TEST(_1MB, _1MB, _2MB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "2MB mremap - Source 1MB-aligned, Destination 1MB-aligned"),
-
- /* Src addr PMD aligned */
- MAKE_TEST(PMD, PTE, _4MB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "4MB mremap - Source PMD-aligned, Destination PTE-aligned"),
- MAKE_TEST(PMD, _1MB, _4MB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "4MB mremap - Source PMD-aligned, Destination 1MB-aligned"),
- MAKE_TEST(PMD, PMD, _4MB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "4MB mremap - Source PMD-aligned, Destination PMD-aligned"),
-
- /* Src addr PUD aligned */
- MAKE_TEST(PUD, PTE, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "2GB mremap - Source PUD-aligned, Destination PTE-aligned"),
- MAKE_TEST(PUD, _1MB, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "2GB mremap - Source PUD-aligned, Destination 1MB-aligned"),
- MAKE_TEST(PUD, PMD, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "2GB mremap - Source PUD-aligned, Destination PMD-aligned"),
- MAKE_TEST(PUD, PUD, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "2GB mremap - Source PUD-aligned, Destination PUD-aligned"),
- };
-
- struct test perf_test_cases[] = {
- /*
- * mremap 1GB region - Page table level aligned time
- * comparison.
- */
- MAKE_TEST(PTE, PTE, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "1GB mremap - Source PTE-aligned, Destination PTE-aligned"),
- MAKE_TEST(PMD, PMD, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "1GB mremap - Source PMD-aligned, Destination PMD-aligned"),
- MAKE_TEST(PUD, PUD, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS,
- "1GB mremap - Source PUD-aligned, Destination PUD-aligned"),
- };
+ page_size = sysconf(_SC_PAGESIZE);
+
+ /* Expected mremap failures */
+ test_cases[0] = MAKE_TEST(page_size, page_size, page_size,
+ OVERLAPPING, EXPECT_FAILURE,
+ "mremap - Source and Destination Regions Overlapping");
+
+ test_cases[1] = MAKE_TEST(page_size, page_size/4, page_size,
+ NON_OVERLAPPING, EXPECT_FAILURE,
+ "mremap - Destination Address Misaligned (1KB-aligned)");
+ test_cases[2] = MAKE_TEST(page_size/4, page_size, page_size,
+ NON_OVERLAPPING, EXPECT_FAILURE,
+ "mremap - Source Address Misaligned (1KB-aligned)");
+
+ /* Src addr PTE aligned */
+ test_cases[3] = MAKE_TEST(PTE, PTE, PTE * 2,
+ NON_OVERLAPPING, EXPECT_SUCCESS,
+ "8KB mremap - Source PTE-aligned, Destination PTE-aligned");
+
+ /* Src addr 1MB aligned */
+ test_cases[4] = MAKE_TEST(_1MB, PTE, _2MB, NON_OVERLAPPING, EXPECT_SUCCESS,
+ "2MB mremap - Source 1MB-aligned, Destination PTE-aligned");
+ test_cases[5] = MAKE_TEST(_1MB, _1MB, _2MB, NON_OVERLAPPING, EXPECT_SUCCESS,
+ "2MB mremap - Source 1MB-aligned, Destination 1MB-aligned");
+
+ /* Src addr PMD aligned */
+ test_cases[6] = MAKE_TEST(PMD, PTE, _4MB, NON_OVERLAPPING, EXPECT_SUCCESS,
+ "4MB mremap - Source PMD-aligned, Destination PTE-aligned");
+ test_cases[7] = MAKE_TEST(PMD, _1MB, _4MB, NON_OVERLAPPING, EXPECT_SUCCESS,
+ "4MB mremap - Source PMD-aligned, Destination 1MB-aligned");
+ test_cases[8] = MAKE_TEST(PMD, PMD, _4MB, NON_OVERLAPPING, EXPECT_SUCCESS,
+ "4MB mremap - Source PMD-aligned, Destination PMD-aligned");
+
+ /* Src addr PUD aligned */
+ test_cases[9] = MAKE_TEST(PUD, PTE, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
+ "2GB mremap - Source PUD-aligned, Destination PTE-aligned");
+ test_cases[10] = MAKE_TEST(PUD, _1MB, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
+ "2GB mremap - Source PUD-aligned, Destination 1MB-aligned");
+ test_cases[11] = MAKE_TEST(PUD, PMD, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
+ "2GB mremap - Source PUD-aligned, Destination PMD-aligned");
+ test_cases[12] = MAKE_TEST(PUD, PUD, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
+ "2GB mremap - Source PUD-aligned, Destination PUD-aligned");
+
+ perf_test_cases[0] = MAKE_TEST(page_size, page_size, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS,
+ "1GB mremap - Source PTE-aligned, Destination PTE-aligned");
+ /*
+ * mremap 1GB region - Page table level aligned time
+ * comparison.
+ */
+ perf_test_cases[1] = MAKE_TEST(PMD, PMD, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS,
+ "1GB mremap - Source PMD-aligned, Destination PMD-aligned");
+ perf_test_cases[2] = MAKE_TEST(PUD, PUD, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS,
+ "1GB mremap - Source PUD-aligned, Destination PUD-aligned");

run_perf_tests = (threshold_mb == VALIDATION_NO_THRESHOLD) ||
(threshold_mb * _1MB >= _1GB);

--
2.31.1

Aneesh Kumar K.V

2021-06-07 05:51:24 UTC

With a large mmap map size, we can overlap with the text area and using
MAP_FIXED results in unmapping that area. Switch to MAP_FIXED_NOREPLACE
and handle the EEXIST error.

Reviewed-by: Kalesh Singh <***@google.com>
Signed-off-by: Aneesh Kumar K.V <***@linux.ibm.com>
---
tools/testing/selftests/vm/mremap_test.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/vm/mremap_test.c b/tools/testing/selftests/vm/mremap_test.c
index c9a5461eb786..0624d1bd71b5 100644
--- a/tools/testing/selftests/vm/mremap_test.c
+++ b/tools/testing/selftests/vm/mremap_test.c
@@ -75,9 +75,10 @@ static void *get_source_mapping(struct config c)
retry:
addr += c.src_alignment;
src_addr = mmap((void *) addr, c.region_size, PROT_READ | PROT_WRITE,
- MAP_FIXED | MAP_ANONYMOUS | MAP_SHARED, -1, 0);
+ MAP_FIXED_NOREPLACE | MAP_ANONYMOUS | MAP_SHARED,
+ -1, 0);
if (src_addr == MAP_FAILED) {
- if (errno == EPERM)
+ if (errno == EPERM || errno == EEXIST)
goto retry;
goto error;
}

--
2.31.1

Aneesh Kumar K.V

2021-06-07 05:51:25 UTC

With TRANSPARENT_HUGEPAGE_PUD enabled the kernel can find huge PUD entries.
Add a helper to move huge PUD entries on mremap().

This will be used by a later patch to optimize mremap of PUD_SIZE aligned
level 4 PTE mapped address

This also make sure we support mremap on huge PUD entries even with
CONFIG_HAVE_MOVE_PUD disabled.

Signed-off-by: Aneesh Kumar K.V <***@linux.ibm.com>
---
mm/mremap.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 73 insertions(+), 7 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 47c255b60150..92ab7d24a587 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -324,10 +324,62 @@ static inline bool move_normal_pud(struct vm_area_struct *vma,
}
#endif

+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PUD
+static bool move_huge_pud(struct vm_area_struct *vma, unsigned long old_addr,
+ unsigned long new_addr, pud_t *old_pud, pud_t *new_pud)
+{
+ spinlock_t *old_ptl, *new_ptl;
+ struct mm_struct *mm = vma->vm_mm;
+ pud_t pud;
+
+ /*
+ * The destination pud shouldn't be established, free_pgtables()
+ * should have released it.
+ */
+ if (WARN_ON_ONCE(!pud_none(*new_pud)))
+ return false;
+
+ /*
+ * We don't have to worry about the ordering of src and dst
+ * ptlocks because exclusive mmap_lock prevents deadlock.
+ */
+ old_ptl = pud_lock(vma->vm_mm, old_pud);
+ new_ptl = pud_lockptr(mm, new_pud);
+ if (new_ptl != old_ptl)
+ spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
+
+ /* Clear the pud */
+ pud = *old_pud;
+ pud_clear(old_pud);
+
+ VM_BUG_ON(!pud_none(*new_pud));
+
+ /* Set the new pud */
+ /* mark soft_ditry when we add pud level soft dirty support */
+ set_pud_at(mm, new_addr, new_pud, pud);
+ flush_pud_tlb_range(vma, old_addr, old_addr + HPAGE_PUD_SIZE);
+ if (new_ptl != old_ptl)
+ spin_unlock(new_ptl);
+ spin_unlock(old_ptl);
+
+ return true;
+}
+#else
+static bool move_huge_pud(struct vm_area_struct *vma, unsigned long old_addr,
+ unsigned long new_addr, pud_t *old_pud, pud_t *new_pud)
+{
+ WARN_ON_ONCE(1);
+ return false;
+
+}
+#endif
+
enum pgt_entry {
NORMAL_PMD,
HPAGE_PMD,
NORMAL_PUD,
+ HPAGE_PUD,
};

/*
@@ -347,6 +399,7 @@ static __always_inline unsigned long get_extent(enum pgt_entry entry,
mask = PMD_MASK;
size = PMD_SIZE;
break;
+ case HPAGE_PUD:
case NORMAL_PUD:
mask = PUD_MASK;
size = PUD_SIZE;
@@ -395,6 +448,11 @@ static bool move_pgt_entry(enum pgt_entry entry, struct vm_area_struct *vma,
move_huge_pmd(vma, old_addr, new_addr, old_entry,
new_entry);
break;
+ case HPAGE_PUD:
+ moved = move_huge_pud(vma, old_addr, new_addr, old_entry,
+ new_entry);
+ break;
+
default:
WARN_ON_ONCE(1);
break;
@@ -414,6 +472,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
unsigned long extent, old_end;
struct mmu_notifier_range range;
pmd_t *old_pmd, *new_pmd;
+ pud_t *old_pud, *new_pud;

old_end = old_addr + len;
flush_cache_range(vma, old_addr, old_end);
@@ -429,15 +488,22 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
* PUD level if possible.
*/
extent = get_extent(NORMAL_PUD, old_addr, old_end, new_addr);
- if (IS_ENABLED(CONFIG_HAVE_MOVE_PUD) && extent == PUD_SIZE) {
- pud_t *old_pud, *new_pud;

- old_pud = get_old_pud(vma->vm_mm, old_addr);
- if (!old_pud)
+ old_pud = get_old_pud(vma->vm_mm, old_addr);
+ if (!old_pud)
+ continue;
+ new_pud = alloc_new_pud(vma->vm_mm, vma, new_addr);
+ if (!new_pud)
+ break;
+ if (pud_trans_huge(*old_pud) || pud_devmap(*old_pud)) {
+ if (extent == HPAGE_PUD_SIZE) {
+ move_pgt_entry(HPAGE_PUD, vma, old_addr, new_addr,
+ old_pud, new_pud, need_rmap_locks);
+ /* We ignore and continue on error? */
continue;
- new_pud = alloc_new_pud(vma->vm_mm, vma, new_addr);
- if (!new_pud)
- break;
+ }
+ } else if (IS_ENABLED(CONFIG_HAVE_MOVE_PUD) && extent == PUD_SIZE) {
+
if (move_pgt_entry(NORMAL_PUD, vma, old_addr, new_addr,
old_pud, new_pud, need_rmap_locks))
continue;

--
2.31.1

Aneesh Kumar K.V

2021-06-07 05:51:26 UTC

With two level page table don't enable move_normal_pud.

Signed-off-by: Aneesh Kumar K.V <***@linux.ibm.com>
---
mm/mremap.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 92ab7d24a587..795a7d628b53 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -276,7 +276,7 @@ static inline bool move_normal_pmd(struct vm_area_struct *vma,
}
#endif

-#ifdef CONFIG_HAVE_MOVE_PUD
+#if CONFIG_PGTABLE_LEVELS > 2 && defined(CONFIG_HAVE_MOVE_PUD)
static bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr,
unsigned long new_addr, pud_t *old_pud, pud_t *new_pud)
{

--
2.31.1

Aneesh Kumar K.V

2021-06-07 05:51:27 UTC

pmd/pud_populate is the right interface to be used to set the respective
page table entries. Some architectures like ppc64 do assume that set_pmd/pud_at
can only be used to set a hugepage PTE. Since we are not setting up a hugepage
PTE here, use the pmd/pud_populate interface.

Signed-off-by: Aneesh Kumar K.V <***@linux.ibm.com>
---
mm/mremap.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 795a7d628b53..dacfa9111ab1 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -26,6 +26,7 @@

#include <asm/cacheflush.h>
#include <asm/tlbflush.h>
+#include <asm/pgalloc.h>

#include "internal.h"

@@ -258,8 +259,7 @@ static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,

VM_BUG_ON(!pmd_none(*new_pmd));

- /* Set the new pmd */
- set_pmd_at(mm, new_addr, new_pmd, pmd);
+ pmd_populate(mm, new_pmd, pmd_pgtable(pmd));
flush_tlb_range(vma, old_addr, old_addr + PMD_SIZE);
if (new_ptl != old_ptl)
spin_unlock(new_ptl);
@@ -306,8 +306,7 @@ static bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr,

VM_BUG_ON(!pud_none(*new_pud));

- /* Set the new pud */
- set_pud_at(mm, new_addr, new_pud, pud);
+ pud_populate(mm, new_pud, (pmd_t *)pud_page_vaddr(pud));
flush_tlb_range(vma, old_addr, old_addr + PUD_SIZE);
if (new_ptl != old_ptl)
spin_unlock(new_ptl);

--
2.31.1

Aneesh Kumar K.V

2021-06-07 05:51:28 UTC

Update _tlbiel_pid() such that we can avoid build errors like below when
using this function in other places.

arch/powerpc/mm/book3s64/radix_tlb.c: In function ‘__radix__flush_tlb_range_psize’:
arch/powerpc/mm/book3s64/radix_tlb.c:114:2: warning: ‘asm’ operand 3 probably does not match constraints
114 | asm volatile(PPC_TLBIEL(%0, %4, %3, %2, %1)
| ^~~
arch/powerpc/mm/book3s64/radix_tlb.c:114:2: error: impossible constraint in ‘asm’
make[4]: *** [scripts/Makefile.build:271: arch/powerpc/mm/book3s64/radix_tlb.o] Error 1
m

With this fix, we can also drop the __always_inline in __radix_flush_tlb_range_psize
which was added by commit e12d6d7d46a6 ("powerpc/mm/radix: mark __radix__flush_tlb_range_psize() as __always_inline")

Reviewed-by: Christophe Leroy <***@csgroup.eu>
Acked-by: Michael Ellerman <***@ellerman.id.au>
Signed-off-by: Aneesh Kumar K.V <***@linux.ibm.com>
---
arch/powerpc/mm/book3s64/radix_tlb.c | 26 +++++++++++++++++---------
1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
index 409e61210789..817a02ef6032 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -291,22 +291,30 @@ static inline void fixup_tlbie_lpid(unsigned long lpid)
/*
* We use 128 set in radix mode and 256 set in hpt mode.
*/
-static __always_inline void _tlbiel_pid(unsigned long pid, unsigned long ric)
+static inline void _tlbiel_pid(unsigned long pid, unsigned long ric)
{
int set;

asm volatile("ptesync": : :"memory");

- /*
- * Flush the first set of the TLB, and if we're doing a RIC_FLUSH_ALL,
- * also flush the entire Page Walk Cache.
- */
- __tlbiel_pid(pid, 0, ric);
+ switch (ric) {
+ case RIC_FLUSH_PWC:

- /* For PWC, only one flush is needed */
- if (ric == RIC_FLUSH_PWC) {
+ /* For PWC, only one flush is needed */
+ __tlbiel_pid(pid, 0, RIC_FLUSH_PWC);
ppc_after_tlbiel_barrier();
return;
+ case RIC_FLUSH_TLB:
+ __tlbiel_pid(pid, 0, RIC_FLUSH_TLB);
+ break;
+ case RIC_FLUSH_ALL:
+ default:
+ /*
+ * Flush the first set of the TLB, and if
+ * we're doing a RIC_FLUSH_ALL, also flush
+ * the entire Page Walk Cache.
+ */
+ __tlbiel_pid(pid, 0, RIC_FLUSH_ALL);
}

if (!cpu_has_feature(CPU_FTR_ARCH_31)) {
@@ -1176,7 +1184,7 @@ void radix__tlb_flush(struct mmu_gather *tlb)
}
}

-static __always_inline void __radix__flush_tlb_range_psize(struct mm_struct *mm,
+static void __radix__flush_tlb_range_psize(struct mm_struct *mm,
unsigned long start, unsigned long end,
int psize, bool also_pwc)
{

--
2.31.1

Aneesh Kumar K.V

2021-06-07 05:51:29 UTC

Architectures like ppc64 support faster mremap only with radix
translation. Hence allow a runtime check w.r.t support for fast mremap.

Signed-off-by: Aneesh Kumar K.V <***@linux.ibm.com>
---
arch/powerpc/include/asm/tlb.h | 6 ++++++
mm/mremap.c | 15 ++++++++++++++-
2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index 160422a439aa..09a9ae5f3656 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -83,5 +83,11 @@ static inline int mm_is_thread_local(struct mm_struct *mm)
}
#endif

+#define arch_supports_page_table_move arch_supports_page_table_move
+static inline bool arch_supports_page_table_move(void)
+{
+ return radix_enabled();
+}
+
#endif /* __KERNEL__ */
#endif /* __ASM_POWERPC_TLB_H */
diff --git a/mm/mremap.c b/mm/mremap.c
index dacfa9111ab1..9cd352fb9cf8 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -25,7 +25,7 @@
#include <linux/userfaultfd_k.h>

#include <asm/cacheflush.h>
-#include <asm/tlbflush.h>
+#include <asm/tlb.h>
#include <asm/pgalloc.h>

#include "internal.h"
@@ -210,6 +210,15 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
drop_rmap_locks(vma);
}

+#ifndef arch_supports_page_table_move
+#define arch_supports_page_table_move arch_supports_page_table_move
+static inline bool arch_supports_page_table_move(void)
+{
+ return IS_ENABLED(CONFIG_HAVE_MOVE_PMD) ||
+ IS_ENABLED(CONFIG_HAVE_MOVE_PUD);
+}
+#endif
+
#ifdef CONFIG_HAVE_MOVE_PMD
static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd)
@@ -218,6 +227,8 @@ static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
struct mm_struct *mm = vma->vm_mm;
pmd_t pmd;

+ if (!arch_supports_page_table_move())
+ return false;
/*
* The destination pmd shouldn't be established, free_pgtables()
* should have released it.
@@ -284,6 +295,8 @@ static bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr,
struct mm_struct *mm = vma->vm_mm;
pud_t pud;

+ if (!arch_supports_page_table_move())
+ return false;
/*
* The destination pud shouldn't be established, free_pgtables()
* should have released it.

--
2.31.1

Aneesh Kumar K.V

2021-06-07 05:51:30 UTC

flush_tlb_range is special in that we don't specify the page size used
for the translation. Hence when flushing TLB we flush the translation cache
for all possible page sizes. The kernel also uses the same interface when
moving page tables around. Such a move requires us to flush the page walk cache.

Instead of adding another interface to force page walk cache flush,
update flush_tlb_range to flush page walk cache if the range flushed
is more than the PMD range. A page table move will always involve an
invalidate range more than PMD_SIZE.

Running microbenchmark with mprotect and parallel memory access
didn't show any observable performance impact.

Signed-off-by: Aneesh Kumar K.V <***@linux.ibm.com>
---
.../include/asm/book3s/64/tlbflush-radix.h | 2 +
arch/powerpc/mm/book3s64/radix_hugetlbpage.c | 8 +++-
arch/powerpc/mm/book3s64/radix_tlb.c | 44 ++++++++++++-------
3 files changed, 36 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
index 8b33601cdb9d..ab9d5e535000 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
@@ -60,6 +60,8 @@ extern void radix__flush_hugetlb_tlb_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end);
extern void radix__flush_tlb_range_psize(struct mm_struct *mm, unsigned long start,
unsigned long end, int psize);
+void radix__flush_tlb_pwc_range_psize(struct mm_struct *mm, unsigned long start,
+ unsigned long end, int psize);
extern void radix__flush_pmd_tlb_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end);
extern void radix__flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
diff --git a/arch/powerpc/mm/book3s64/radix_hugetlbpage.c b/arch/powerpc/mm/book3s64/radix_hugetlbpage.c
index cb91071eef52..23d3e08911d3 100644
--- a/arch/powerpc/mm/book3s64/radix_hugetlbpage.c
+++ b/arch/powerpc/mm/book3s64/radix_hugetlbpage.c
@@ -32,7 +32,13 @@ void radix__flush_hugetlb_tlb_range(struct vm_area_struct *vma, unsigned long st
struct hstate *hstate = hstate_file(vma->vm_file);

psize = hstate_get_psize(hstate);
- radix__flush_tlb_range_psize(vma->vm_mm, start, end, psize);
+ /*
+ * Flush PWC even if we get PUD_SIZE hugetlb invalidate to keep this simpler.
+ */
+ if (end - start >= PUD_SIZE)
+ radix__flush_tlb_pwc_range_psize(vma->vm_mm, start, end, psize);
+ else
+ radix__flush_tlb_range_psize(vma->vm_mm, start, end, psize);
}

/*
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
index 817a02ef6032..35c5eb23bfaf 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -997,14 +997,13 @@ static unsigned long tlb_local_single_page_flush_ceiling __read_mostly = POWER9_

static inline void __radix__flush_tlb_range(struct mm_struct *mm,
unsigned long start, unsigned long end)
-
{
unsigned long pid;
unsigned int page_shift = mmu_psize_defs[mmu_virtual_psize].shift;
unsigned long page_size = 1UL << page_shift;
unsigned long nr_pages = (end - start) >> page_shift;
bool fullmm = (end == TLB_FLUSH_ALL);
- bool flush_pid;
+ bool flush_pid, flush_pwc = false;
enum tlb_flush_type type;

pid = mm->context.id;
@@ -1023,8 +1022,16 @@ static inline void __radix__flush_tlb_range(struct mm_struct *mm,
flush_pid = nr_pages > tlb_single_page_flush_ceiling;
else
flush_pid = nr_pages > tlb_local_single_page_flush_ceiling;
+ /*
+ * full pid flush already does the PWC flush. if it is not full pid
+ * flush check the range is more than PMD and force a pwc flush
+ * mremap() depends on this behaviour.
+ */
+ if (!flush_pid && (end - start) >= PMD_SIZE)
+ flush_pwc = true;

if (!mmu_has_feature(MMU_FTR_GTSE) && type == FLUSH_TYPE_GLOBAL) {
+ unsigned long type = H_RPTI_TYPE_TLB;
unsigned long tgt = H_RPTI_TARGET_CMMU;
unsigned long pg_sizes = psize_to_rpti_pgsize(mmu_virtual_psize);

@@ -1032,19 +1039,20 @@ static inline void __radix__flush_tlb_range(struct mm_struct *mm,
pg_sizes |= psize_to_rpti_pgsize(MMU_PAGE_2M);
if (atomic_read(&mm->context.copros) > 0)
tgt |= H_RPTI_TARGET_NMMU;
- pseries_rpt_invalidate(pid, tgt, H_RPTI_TYPE_TLB, pg_sizes,
- start, end);
+ if (flush_pwc)
+ type |= H_RPTI_TYPE_PWC;
+ pseries_rpt_invalidate(pid, tgt, type, pg_sizes, start, end);
} else if (flush_pid) {
+ /*
+ * We are now flushing a range larger than PMD size force a RIC_FLUSH_ALL
+ */
if (type == FLUSH_TYPE_LOCAL) {
- _tlbiel_pid(pid, RIC_FLUSH_TLB);
+ _tlbiel_pid(pid, RIC_FLUSH_ALL);
} else {
if (cputlb_use_tlbie()) {
- if (mm_needs_flush_escalation(mm))
- _tlbie_pid(pid, RIC_FLUSH_ALL);
- else
- _tlbie_pid(pid, RIC_FLUSH_TLB);
+ _tlbie_pid(pid, RIC_FLUSH_ALL);
} else {
- _tlbiel_pid_multicast(mm, pid, RIC_FLUSH_TLB);
+ _tlbiel_pid_multicast(mm, pid, RIC_FLUSH_ALL);
}
}
} else {
@@ -1060,6 +1068,9 @@ static inline void __radix__flush_tlb_range(struct mm_struct *mm,

if (type == FLUSH_TYPE_LOCAL) {
asm volatile("ptesync": : :"memory");
+ if (flush_pwc)
+ /* For PWC, only one flush is needed */
+ __tlbiel_pid(pid, 0, RIC_FLUSH_PWC);
__tlbiel_va_range(start, end, pid, page_size, mmu_virtual_psize);
if (hflush)
__tlbiel_va_range(hstart, hend, pid,
@@ -1067,6 +1078,8 @@ static inline void __radix__flush_tlb_range(struct mm_struct *mm,
ppc_after_tlbiel_barrier();
} else if (cputlb_use_tlbie()) {
asm volatile("ptesync": : :"memory");
+ if (flush_pwc)
+ __tlbie_pid(pid, RIC_FLUSH_PWC);
__tlbie_va_range(start, end, pid, page_size, mmu_virtual_psize);
if (hflush)
__tlbie_va_range(hstart, hend, pid,
@@ -1074,10 +1087,10 @@ static inline void __radix__flush_tlb_range(struct mm_struct *mm,
asm volatile("eieio; tlbsync; ptesync": : :"memory");
} else {
_tlbiel_va_range_multicast(mm,
- start, end, pid, page_size, mmu_virtual_psize, false);
+ start, end, pid, page_size, mmu_virtual_psize, flush_pwc);
if (hflush)
_tlbiel_va_range_multicast(mm,
- hstart, hend, pid, PMD_SIZE, MMU_PAGE_2M, false);
+ hstart, hend, pid, PMD_SIZE, MMU_PAGE_2M, flush_pwc);
}
}
out:
@@ -1151,9 +1164,6 @@ void radix__flush_all_lpid_guest(unsigned int lpid)
_tlbie_lpid_guest(lpid, RIC_FLUSH_ALL);
}

-static void radix__flush_tlb_pwc_range_psize(struct mm_struct *mm, unsigned long start,
- unsigned long end, int psize);
-
void radix__tlb_flush(struct mmu_gather *tlb)
{
int psize = 0;
@@ -1260,8 +1270,8 @@ void radix__flush_tlb_range_psize(struct mm_struct *mm, unsigned long start,
return __radix__flush_tlb_range_psize(mm, start, end, psize, false);
}

-static void radix__flush_tlb_pwc_range_psize(struct mm_struct *mm, unsigned long start,
- unsigned long end, int psize)
+void radix__flush_tlb_pwc_range_psize(struct mm_struct *mm, unsigned long start,
+ unsigned long end, int psize)
{
__radix__flush_tlb_range_psize(mm, start, end, psize, true);
}

--
2.31.1

Aneesh Kumar K.V

2021-06-07 05:51:31 UTC

mremap HAVE_MOVE_PMD/PUD optimization time comparison for 1GB region:
1GB mremap - Source PTE-aligned, Destination PTE-aligned
mremap time: 2292772ns
1GB mremap - Source PMD-aligned, Destination PMD-aligned
mremap time: 1158928ns
1GB mremap - Source PUD-aligned, Destination PUD-aligned
mremap time: 63886ns

Signed-off-by: Aneesh Kumar K.V <***@linux.ibm.com>
---
arch/powerpc/platforms/Kconfig.cputype | 2 ++
1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
index f998e655b570..be8ceb5bece4 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -101,6 +101,8 @@ config PPC_BOOK3S_64
select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
select ARCH_SUPPORTS_HUGETLBFS
select ARCH_SUPPORTS_NUMA_BALANCING
+ select HAVE_MOVE_PMD
+ select HAVE_MOVE_PUD
select IRQ_WORK
select PPC_MM_SLICES
select PPC_HAVE_KUEP

--
2.31.1

Nick Piggin

2021-06-07 10:10:32 UTC

Post by Aneesh Kumar K.V
This patchset enables MOVE_PMD/MOVE_PUD support on power. This requires
the platform to support updating higher-level page tables without
updating page table entries. This also needs to invalidate the Page Walk
Cache on architecture supporting the same.
* Update ppc64 flush_tlb_range to invalidate page walk cache.

I'd really rather not do this, I'm not sure if micro bench mark captures
everything.

Page tables coming from L2/L3 probably aren't the primary purpose or
biggest benefit of intermediate level caches.

The situation on POWER with nest mmu (coherent accelerators) is magnified.
They have huge page walk cashes to make up for the fact they don't have
data caches for walking page tables which makes the invalidation more
painful in terms of subsequent misses, but also latency to invalidate (can
be order of microseconds whereas a page invalidate is a couple of orders of
magnitude faster).

Yes it is a deficiency of the ppc invalidation architecture, we are aware
and would like to improve it but for now those is what we have.

Thanks,
Nick

Post by Aneesh Kumar K.V
* Add patches to fix race between mremap and page out
* Add patch to fix build error with page table levels 2
* Drop patch mm/mremap: Move TLB flush outside page table lock
* Add fixes for race between optimized mremap and page out
* Change function name and arguments based on review feedback.
* Fix build error reported by kernel test robot
* Address review feedback.
* switch from using mmu_gather to flush_pte_tlb_pwc_range()
* Rebase to recent upstream
* Fix build issues with tlb_gather_mmu changes
mm/mremap: Fix race between MOVE_PMD mremap and pageout
mm/mremap: Fix race between MOVE_PUD mremap and pageout
selftest/mremap_test: Update the test to handle pagesize other than 4K
selftest/mremap_test: Avoid crash with static build
mm/mremap: Convert huge PUD move to separate helper
mm/mremap: Don't enable optimized PUD move if page table levels is 2
mm/mremap: Use pmd/pud_poplulate to update page table entries
powerpc/mm/book3s64: Fix possible build error
mm/mremap: Allow arch runtime override
powerpc/book3s64/mm: Update flush_tlb_range to flush page walk cache
powerpc/mm: Enable HAVE_MOVE_PMD support
.../include/asm/book3s/64/tlbflush-radix.h | 2 +
arch/powerpc/include/asm/tlb.h | 6 +
arch/powerpc/mm/book3s64/radix_hugetlbpage.c | 8 +-
arch/powerpc/mm/book3s64/radix_tlb.c | 70 +++++++----
arch/powerpc/platforms/Kconfig.cputype | 2 +
include/linux/rmap.h | 13 +-
mm/mremap.c | 104 +++++++++++++--
mm/page_vma_mapped.c | 43 ++++---
tools/testing/selftests/vm/mremap_test.c | 118 ++++++++++--------
9 files changed, 251 insertions(+), 115 deletions(-)
--
2.31.1

Aneesh Kumar K.V

2021-06-08 04:39:36 UTC

wrote: This patchset enables MOVE_PMD/MOVE_PUD support on power. This
requires the platform to support updating higher-level page tables
without updating page table ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
This patchset enables MOVE_PMD/MOVE_PUD support on power. This requires
the platform to support updating higher-level page tables without
updating page table entries. This also needs to invalidate the Page Walk
Cache on architecture supporting the same.
* Update ppc64 flush_tlb_range to invalidate page walk cache.
I'd really rather not do this, I'm not sure if micro bench mark captures
everything.
Page tables coming from L2/L3 probably aren't the primary purpose or
biggest benefit of intermediate level caches.
The situation on POWER with nest mmu (coherent accelerators) is
magnified. They have huge page walk cashes to make up for the fact they
don't have data caches for walking page tables which makes the
invalidation more painful in terms of subsequent misses, but also
latency to invalidate (can be order of microseconds whereas a page
invalidate is a couple of orders of magnitude faster).

If we are using NestMMU, we already upgrade that flush to invalidate
page walk cache right? ie, if we have > PMD_SIZE range, we would upgrade
the invalidate to a pid flush via

flush_pid = nr_pages > tlb_single_page_flush_ceiling;

and if it is PID flush if we are using NestMMU we already upgrade a
RIC_FLUSH_TLB to RIC_FLUSH_ALL ?

Yes it is a deficiency of the ppc invalidation architecture, we are
aware and would like to improve it but for now those is what we have.

-aneesh

Nicholas Piggin

2021-06-08 05:03:22 UTC

Post by Aneesh Kumar K.V

wrote: This patchset enables MOVE_PMD/MOVE_PUD support on power. This
requires the platform to support updating higher-level page tables
without updating page table ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
This patchset enables MOVE_PMD/MOVE_PUD support on power. This requires
the platform to support updating higher-level page tables without
updating page table entries. This also needs to invalidate the Page Walk
Cache on architecture supporting the same.
* Update ppc64 flush_tlb_range to invalidate page walk cache.
I'd really rather not do this, I'm not sure if micro bench mark captures
everything.
Page tables coming from L2/L3 probably aren't the primary purpose or
biggest benefit of intermediate level caches.
The situation on POWER with nest mmu (coherent accelerators) is
magnified. They have huge page walk cashes to make up for the fact they
don't have data caches for walking page tables which makes the
invalidation more painful in terms of subsequent misses, but also
latency to invalidate (can be order of microseconds whereas a page
invalidate is a couple of orders of magnitude faster).

If we are using NestMMU, we already upgrade that flush to invalidate
page walk cache right? ie, if we have > PMD_SIZE range, we would upgrade
the invalidate to a pid flush via
flush_pid = nr_pages > tlb_single_page_flush_ceiling;

Not that we've tuned that parameter for a long time, certainly not with
nMMU probably. Quite possibly it should be higher for nMMU because of
the big TLBs they have. (and what about == PMD_SIZE)?

Post by Aneesh Kumar K.V
and if it is PID flush if we are using NestMMU we already upgrade a
RIC_FLUSH_TLB to RIC_FLUSH_ALL ?

Does P10 still have that bug?

At any rate, the core MMU I think still has the same issues just less
pronounced. PWC invalidates take longer, and PWC should have most
benefit when CPU data caches are highly used and don't filled with
page table entries.

Thanks,
Nick

15 Replies
2 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Aneesh Kumar K.V 2021-06-07 05:51:20 UTC

Aneesh Kumar K.V 2021-06-07 05:51:22 UTC

Aneesh Kumar K.V 2021-06-07 05:51:21 UTC

Hugh Dickins 2021-06-08 00:06:28 UTC

Aneesh Kumar K.V 2021-06-07 05:51:23 UTC

Aneesh Kumar K.V 2021-06-07 05:51:24 UTC

Aneesh Kumar K.V 2021-06-07 05:51:25 UTC

Aneesh Kumar K.V 2021-06-07 05:51:26 UTC

Aneesh Kumar K.V 2021-06-07 05:51:27 UTC

Aneesh Kumar K.V 2021-06-07 05:51:28 UTC

Aneesh Kumar K.V 2021-06-07 05:51:29 UTC

Aneesh Kumar K.V 2021-06-07 05:51:30 UTC

Aneesh Kumar K.V 2021-06-07 05:51:31 UTC

Nick Piggin 2021-06-07 10:10:32 UTC

Aneesh Kumar K.V 2021-06-08 04:39:36 UTC

Nicholas Piggin 2021-06-08 05:03:22 UTC

about - legalese

Loading...