Discussion:
[PATCH v4 0/4] shoot lazy tlbs
Nicholas Piggin
2021-06-05 01:42:12 UTC
Permalink
The additional unused config option was a valid criticism, so this now
purely just toggles refcounting of the lazy tlb mm.

Thanks,
Nick

Since v3:
- Removed the extra config option, MMU_LAZY_TLB=n. This can be
resurrected if an arch wants it.

Nicholas Piggin (4):
lazy tlb: introduce lazy mm refcount helper functions
lazy tlb: allow lazy tlb mm refcounting to be configurable
lazy tlb: shoot lazies, a non-refcounting lazy tlb option
powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN

arch/Kconfig | 17 ++++++++++
arch/arm/mach-rpc/ecard.c | 2 +-
arch/powerpc/Kconfig | 1 +
arch/powerpc/kernel/smp.c | 2 +-
arch/powerpc/mm/book3s64/radix_tlb.c | 4 +--
fs/exec.c | 4 +--
include/linux/sched/mm.h | 20 +++++++++++
kernel/cpu.c | 2 +-
kernel/exit.c | 2 +-
kernel/fork.c | 51 ++++++++++++++++++++++++++++
kernel/kthread.c | 11 +++---
kernel/sched/core.c | 35 +++++++++++++------
kernel/sched/sched.h | 4 ++-
13 files changed, 132 insertions(+), 23 deletions(-)
--
2.23.0
Nicholas Piggin
2021-06-05 01:42:13 UTC
Permalink
Add explicit _lazy_tlb annotated functions for lazy mm refcounting.
This makes lazy mm references more obvious, and allows explicit
refcounting to be removed if it is not used.

Signed-off-by: Nicholas Piggin <***@gmail.com>
---
arch/arm/mach-rpc/ecard.c | 2 +-
arch/powerpc/kernel/smp.c | 2 +-
arch/powerpc/mm/book3s64/radix_tlb.c | 4 ++--
fs/exec.c | 4 ++--
include/linux/sched/mm.h | 11 +++++++++++
kernel/cpu.c | 2 +-
kernel/exit.c | 2 +-
kernel/kthread.c | 11 +++++++----
kernel/sched/core.c | 15 ++++++++-------
9 files changed, 34 insertions(+), 19 deletions(-)

diff --git a/arch/arm/mach-rpc/ecard.c b/arch/arm/mach-rpc/ecard.c
index 827b50f1c73e..1b4a41aad793 100644
--- a/arch/arm/mach-rpc/ecard.c
+++ b/arch/arm/mach-rpc/ecard.c
@@ -253,7 +253,7 @@ static int ecard_init_mm(void)
current->mm = mm;
current->active_mm = mm;
activate_mm(active_mm, mm);
- mmdrop(active_mm);
+ mmdrop_lazy_tlb(active_mm);
ecard_init_pgtables(mm);
return 0;
}
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 2e05c783440a..fb0bdfc67366 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1541,7 +1541,7 @@ void start_secondary(void *unused)
{
unsigned int cpu = raw_smp_processor_id();

- mmgrab(&init_mm);
+ mmgrab_lazy_tlb(&init_mm);
current->active_mm = &init_mm;

smp_store_cpu_info(cpu);
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
index 409e61210789..2962082787c0 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -663,10 +663,10 @@ void exit_lazy_flush_tlb(struct mm_struct *mm, bool always_flush)
if (current->active_mm == mm) {
WARN_ON_ONCE(current->mm != NULL);
/* Is a kernel thread and is using mm as the lazy tlb */
- mmgrab(&init_mm);
+ mmgrab_lazy_tlb(&init_mm);
current->active_mm = &init_mm;
switch_mm_irqs_off(mm, &init_mm, current);
- mmdrop(mm);
+ mmdrop_lazy_tlb(mm);
}

/*
diff --git a/fs/exec.c b/fs/exec.c
index 18594f11c31f..ca0f8b1af23a 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1029,9 +1029,9 @@ static int exec_mmap(struct mm_struct *mm)
setmax_mm_hiwater_rss(&tsk->signal->maxrss, old_mm);
mm_update_next_owner(old_mm);
mmput(old_mm);
- return 0;
+ } else {
+ mmdrop_lazy_tlb(active_mm);
}
- mmdrop(active_mm);
return 0;
}

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index e24b1fe348e3..bfd1baca5266 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -49,6 +49,17 @@ static inline void mmdrop(struct mm_struct *mm)
__mmdrop(mm);
}

+/* Helpers for lazy TLB mm refcounting */
+static inline void mmgrab_lazy_tlb(struct mm_struct *mm)
+{
+ mmgrab(mm);
+}
+
+static inline void mmdrop_lazy_tlb(struct mm_struct *mm)
+{
+ mmdrop(mm);
+}
+
/**
* mmget() - Pin the address space associated with a &struct mm_struct.
* @mm: The address space to pin.
diff --git a/kernel/cpu.c b/kernel/cpu.c
index e538518556f4..e87a89824e6c 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -602,7 +602,7 @@ static int finish_cpu(unsigned int cpu)
*/
if (mm != &init_mm)
idle->active_mm = &init_mm;
- mmdrop(mm);
+ mmdrop_lazy_tlb(mm);
return 0;
}

diff --git a/kernel/exit.c b/kernel/exit.c
index fd1c04193e18..8e87ec5f6be2 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -476,7 +476,7 @@ static void exit_mm(void)
__set_current_state(TASK_RUNNING);
mmap_read_lock(mm);
}
- mmgrab(mm);
+ mmgrab_lazy_tlb(mm);
BUG_ON(mm != current->active_mm);
/* more a memory barrier than a real lock */
task_lock(current);
diff --git a/kernel/kthread.c b/kernel/kthread.c
index fe3f2a40d61e..b70e28431a01 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -1314,14 +1314,14 @@ void kthread_use_mm(struct mm_struct *mm)
WARN_ON_ONCE(!(tsk->flags & PF_KTHREAD));
WARN_ON_ONCE(tsk->mm);

+ mmgrab(mm);
+
task_lock(tsk);
/* Hold off tlb flush IPIs while switching mm's */
local_irq_disable();
active_mm = tsk->active_mm;
- if (active_mm != mm) {
- mmgrab(mm);
+ if (active_mm != mm)
tsk->active_mm = mm;
- }
tsk->mm = mm;
membarrier_update_current_mm(mm);
switch_mm_irqs_off(active_mm, mm, tsk);
@@ -1341,7 +1341,7 @@ void kthread_use_mm(struct mm_struct *mm)
* mmdrop(), or explicitly with smp_mb().
*/
if (active_mm != mm)
- mmdrop(active_mm);
+ mmdrop_lazy_tlb(active_mm);
else
smp_mb();

@@ -1375,10 +1375,13 @@ void kthread_unuse_mm(struct mm_struct *mm)
local_irq_disable();
tsk->mm = NULL;
membarrier_update_current_mm(NULL);
+ mmgrab_lazy_tlb(mm);
/* active_mm is still 'mm' */
enter_lazy_tlb(mm, tsk);
local_irq_enable();
task_unlock(tsk);
+
+ mmdrop(mm);
}
EXPORT_SYMBOL_GPL(kthread_unuse_mm);

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5226cc26a095..e359c76ea2e2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4229,13 +4229,14 @@ static struct rq *finish_task_switch(struct task_struct *prev)
* rq->curr, before returning to userspace, so provide them here:
*
* - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
- * provided by mmdrop(),
+ * provided by mmdrop_lazy_tlb(),
* - a sync_core for SYNC_CORE.
*/
if (mm) {
membarrier_mm_sync_core_before_usermode(mm);
- mmdrop(mm);
+ mmdrop_lazy_tlb(mm);
}
+
if (unlikely(prev_state == TASK_DEAD)) {
if (prev->sched_class->task_dead)
prev->sched_class->task_dead(prev);
@@ -4299,9 +4300,9 @@ context_switch(struct rq *rq, struct task_struct *prev,

/*
* kernel -> kernel lazy + transfer active
- * user -> kernel lazy + mmgrab() active
+ * user -> kernel lazy + mmgrab_lazy_tlb() active
*
- * kernel -> user switch + mmdrop() active
+ * kernel -> user switch + mmdrop_lazy_tlb() active
* user -> user switch
*/
if (!next->mm) { // to kernel
@@ -4309,7 +4310,7 @@ context_switch(struct rq *rq, struct task_struct *prev,

next->active_mm = prev->active_mm;
if (prev->mm) // from user
- mmgrab(prev->active_mm);
+ mmgrab_lazy_tlb(prev->active_mm);
else
prev->active_mm = NULL;
} else { // to user
@@ -4325,7 +4326,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
switch_mm_irqs_off(prev->active_mm, next->mm, next);

if (!prev->mm) { // from kernel
- /* will mmdrop() in finish_task_switch(). */
+ /* will mmdrop_lazy_tlb() in finish_task_switch(). */
rq->prev_mm = prev->active_mm;
prev->active_mm = NULL;
}
@@ -8239,7 +8240,7 @@ void __init sched_init(void)
/*
* The boot idle thread does lazy MMU switching as well:
*/
- mmgrab(&init_mm);
+ mmgrab_lazy_tlb(&init_mm);
enter_lazy_tlb(&init_mm, current);

/*
--
2.23.0
Andrew Morton
2021-06-07 23:49:34 UTC
Permalink
Post by Nicholas Piggin
Add explicit _lazy_tlb annotated functions for lazy mm refcounting.
This makes lazy mm references more obvious, and allows explicit
refcounting to be removed if it is not used.
...
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -1314,14 +1314,14 @@ void kthread_use_mm(struct mm_struct *mm)
WARN_ON_ONCE(!(tsk->flags & PF_KTHREAD));
WARN_ON_ONCE(tsk->mm);
+ mmgrab(mm);
+
task_lock(tsk);
/* Hold off tlb flush IPIs while switching mm's */
local_irq_disable();
active_mm = tsk->active_mm;
- if (active_mm != mm) {
- mmgrab(mm);
+ if (active_mm != mm)
tsk->active_mm = mm;
- }
Looks like a functional change. What's happening here?
Nicholas Piggin
2021-06-08 01:39:56 UTC
Permalink
Post by Andrew Morton
Post by Nicholas Piggin
Add explicit _lazy_tlb annotated functions for lazy mm refcounting.
This makes lazy mm references more obvious, and allows explicit
refcounting to be removed if it is not used.
...
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -1314,14 +1314,14 @@ void kthread_use_mm(struct mm_struct *mm)
WARN_ON_ONCE(!(tsk->flags & PF_KTHREAD));
WARN_ON_ONCE(tsk->mm);
+ mmgrab(mm);
+
task_lock(tsk);
/* Hold off tlb flush IPIs while switching mm's */
local_irq_disable();
active_mm = tsk->active_mm;
- if (active_mm != mm) {
- mmgrab(mm);
+ if (active_mm != mm)
tsk->active_mm = mm;
- }
Looks like a functional change. What's happening here?
That's kthread_use_mm being clever about the lazy tlb mm. If it happened
that the kthread had inherited a the lazy tlb mm that happens to be the
one we want to use here, then we already have a refcount to it via the
lazy tlb ref.

So then it doesn't have to touch the refcount, but rather just converts
it from the lazy tlb ref to the returned reference. If the lazy tlb mm
doesn't get a reference, we can't do that.

Thanks,
Nick
Andrew Morton
2021-06-08 01:48:05 UTC
Permalink
Post by Nicholas Piggin
Post by Andrew Morton
Looks like a functional change. What's happening here?
That's kthread_use_mm being clever about the lazy tlb mm. If it happened
that the kthread had inherited a the lazy tlb mm that happens to be the
one we want to use here, then we already have a refcount to it via the
lazy tlb ref.
So then it doesn't have to touch the refcount, but rather just converts
it from the lazy tlb ref to the returned reference. If the lazy tlb mm
doesn't get a reference, we can't do that.
Please cover this in the changelog and perhaps a code comment.
Nicholas Piggin
2021-06-08 04:11:04 UTC
Permalink
Post by Andrew Morton
Post by Nicholas Piggin
Post by Andrew Morton
Looks like a functional change. What's happening here?
That's kthread_use_mm being clever about the lazy tlb mm. If it happened
that the kthread had inherited a the lazy tlb mm that happens to be the
one we want to use here, then we already have a refcount to it via the
lazy tlb ref.
So then it doesn't have to touch the refcount, but rather just converts
it from the lazy tlb ref to the returned reference. If the lazy tlb mm
doesn't get a reference, we can't do that.
Please cover this in the changelog and perhaps a code comment.
Yeah fair enough, I'll even throw in a bug fix as well (your nose was right,
and it was too clever for me by half...)

Thanks,
Nick

--
Fix a refcounting bug in kthread_use_mm (the mm reference is increased
unconditionally now, but the lazy tlb refcount is still only dropped only
if mm != active_mm).

And an update for the changelog:

If a kernel thread's current lazy tlb mm happens to be the one it wants to
use, then kthread_use_mm() cleverly transfers the mm refcount from the
lazy tlb mm reference to the returned reference. If the lazy tlb mm
reference is no longer identical to a normal reference, this trick does not
work, so that is changed to be explicit about the two references.

Signed-off-by: Nicholas Piggin <***@gmail.com>
---
kernel/kthread.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index b70e28431a01..5e9797b2d06e 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -1314,6 +1314,11 @@ void kthread_use_mm(struct mm_struct *mm)
WARN_ON_ONCE(!(tsk->flags & PF_KTHREAD));
WARN_ON_ONCE(tsk->mm);

+ /*
+ * It's possible that tsk->active_mm == mm here, but we must
+ * still mmgrab(mm) and mmdrop_lazy_tlb(active_mm), because lazy
+ * mm may not have its own refcount (see mmgrab/drop_lazy_tlb()).
+ */
mmgrab(mm);

task_lock(tsk);
@@ -1338,12 +1343,9 @@ void kthread_use_mm(struct mm_struct *mm)
* memory barrier after storing to tsk->mm, before accessing
* user-space memory. A full memory barrier for membarrier
* {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by
- * mmdrop(), or explicitly with smp_mb().
+ * mmdrop_lazy_tlb().
*/
- if (active_mm != mm)
- mmdrop_lazy_tlb(active_mm);
- else
- smp_mb();
+ mmdrop_lazy_tlb(active_mm);

to_kthread(tsk)->oldfs = force_uaccess_begin();
}
--
2.23.0
Nicholas Piggin
2021-06-05 01:42:14 UTC
Permalink
Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
when it is context switched. This can be disabled by architectures that
don't require this refcounting if they clean up lazy tlb mms when the
last refcount is dropped. Currently this is always enabled, which is
what existing code does, so the patch is effectively a no-op.

Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.

Signed-off-by: Nicholas Piggin <***@gmail.com>
---
arch/Kconfig | 4 ++++
include/linux/sched/mm.h | 13 +++++++++++--
kernel/sched/core.c | 22 ++++++++++++++++++----
kernel/sched/sched.h | 4 +++-
4 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index c45b770d3579..1cff045cdde6 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -418,6 +418,10 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
irqs disabled over activate_mm. Architectures that do IPI based TLB
shootdowns should enable this.

+# Use normal mm refcounting for MMU_LAZY_TLB kernel thread references.
+config MMU_LAZY_TLB_REFCOUNT
+ def_bool y
+
config ARCH_HAVE_NMI_SAFE_CMPXCHG
bool

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index bfd1baca5266..29e4638ad124 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -52,12 +52,21 @@ static inline void mmdrop(struct mm_struct *mm)
/* Helpers for lazy TLB mm refcounting */
static inline void mmgrab_lazy_tlb(struct mm_struct *mm)
{
- mmgrab(mm);
+ if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT))
+ mmgrab(mm);
}

static inline void mmdrop_lazy_tlb(struct mm_struct *mm)
{
- mmdrop(mm);
+ if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT)) {
+ mmdrop(mm);
+ } else {
+ /*
+ * mmdrop_lazy_tlb must provide a full memory barrier, see the
+ * membarrier comment finish_task_switch which relies on this.
+ */
+ smp_mb();
+ }
}

/**
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e359c76ea2e2..5e10cb712be3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4171,7 +4171,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
__releases(rq->lock)
{
struct rq *rq = this_rq();
- struct mm_struct *mm = rq->prev_mm;
+ struct mm_struct *mm = NULL;
long prev_state;

/*
@@ -4190,7 +4190,10 @@ static struct rq *finish_task_switch(struct task_struct *prev)
current->comm, current->pid, preempt_count()))
preempt_count_set(FORK_PREEMPT_COUNT);

- rq->prev_mm = NULL;
+#ifdef CONFIG_MMU_LAZY_TLB_REFCOUNT
+ mm = rq->prev_lazy_mm;
+ rq->prev_lazy_mm = NULL;
+#endif

/*
* A task struct has one reference for the use as "current".
@@ -4326,9 +4329,20 @@ context_switch(struct rq *rq, struct task_struct *prev,
switch_mm_irqs_off(prev->active_mm, next->mm, next);

if (!prev->mm) { // from kernel
- /* will mmdrop_lazy_tlb() in finish_task_switch(). */
- rq->prev_mm = prev->active_mm;
+#ifdef CONFIG_MMU_LAZY_TLB_REFCOUNT
+ /* Will mmdrop_lazy_tlb() in finish_task_switch(). */
+ rq->prev_lazy_mm = prev->active_mm;
prev->active_mm = NULL;
+#else
+ /*
+ * Without MMU_LAZY_TLB_REFCOUNT there is no lazy
+ * tracking (because no rq->prev_lazy_mm) in
+ * finish_task_switch, so no mmdrop_lazy_tlb(), so no
+ * memory barrier for membarrier (see the membarrier
+ * comment in finish_task_switch()). Do it here.
+ */
+ smp_mb();
+#endif
}
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a189bec13729..0729cf19a987 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -961,7 +961,9 @@ struct rq {
struct task_struct *idle;
struct task_struct *stop;
unsigned long next_balance;
- struct mm_struct *prev_mm;
+#ifdef CONFIG_MMU_LAZY_TLB_REFCOUNT
+ struct mm_struct *prev_lazy_mm;
+#endif

unsigned int clock_update_flags;
u64 clock;
--
2.23.0
Nicholas Piggin
2021-06-08 03:11:29 UTC
Permalink
Post by Nicholas Piggin
Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
when it is context switched. This can be disabled by architectures that
don't require this refcounting if they clean up lazy tlb mms when the
last refcount is dropped. Currently this is always enabled, which is
what existing code does, so the patch is effectively a no-op.
Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.
Can I give you a couple of incremental patches for 2/4 and 3/4 to
improve the implementation requirement comments a bit for benefit of
other archs.

Thanks,
Nick
--

Explain the requirements for lazy tlb mm refcounting in the comment,
to help with archs that may want to disable this by some means other
than MMU_LAZY_TLB_SHOOTDOWN.

Signed-off-by: Nicholas Piggin <***@gmail.com>
---
arch/Kconfig | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index 1cff045cdde6..39d8c7dcf220 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -419,6 +419,16 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
shootdowns should enable this.

# Use normal mm refcounting for MMU_LAZY_TLB kernel thread references.
+# MMU_LAZY_TLB_REFCOUNT=n can improve the scalability of context switching
+# to/from kernel threads when the same mm is running on a lot of CPUs (a large
+# multi-threaded application), by reducing contention on the mm refcount.
+#
+# This can be disabled if the architecture ensures no CPUs are using an mm as a
+# "lazy tlb" beyond its final refcount (i.e., by the time __mmdrop frees the mm
+# or its kernel page tables). This could be arranged by arch_exit_mmap(), or
+# final exit(2) TLB flush, for example. arch code must also ensure the
+# _lazy_tlb variants of mmgrab/mmdrop are used when dropping the lazy reference
+# to a kthread ->active_mm (non-arch code has been converted already).
config MMU_LAZY_TLB_REFCOUNT
def_bool y
--
2.23.0
Nicholas Piggin
2021-06-05 01:42:15 UTC
Permalink
On big systems, the mm refcount can become highly contented when doing
a lot of context switching with threaded applications (particularly
switching between the idle thread and an application thread).

Abandoning lazy tlb slows switching down quite a bit in the important
user->idle->user cases, so instead implement a non-refcounted scheme
that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
any remaining lazy ones.

Shootdown IPIs are some concern, but they have not been observed to be
a big problem with this scheme (the powerpc implementation generated
314 additional interrupts on a 144 CPU system during a kernel compile).
There are a number of strategies that could be employed to reduce IPIs
if they turn out to be a problem for some workload.

Signed-off-by: Nicholas Piggin <***@gmail.com>
---
arch/Kconfig | 13 +++++++++++++
kernel/fork.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 64 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index 1cff045cdde6..f8136c893991 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -421,6 +421,19 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
# Use normal mm refcounting for MMU_LAZY_TLB kernel thread references.
config MMU_LAZY_TLB_REFCOUNT
def_bool y
+ depends on !MMU_LAZY_TLB_SHOOTDOWN
+
+# Instead of refcounting the lazy mm struct for kernel thread references
+# (which can cause contention with multi-threaded apps on large multiprocessor
+# systems), this option causes __mmdrop to IPI all CPUs in the mm_cpumask and
+# switch to init_mm if they were using the to-be-freed mm as the lazy tlb. To
+# implement this, architectures must use _lazy_tlb variants of mm refcounting
+# when releasing kernel thread mm references, and mm_cpumask must include at
+# least all possible CPUs in which the mm might be lazy, at the time of the
+# final mmdrop. mmgrab/mmdrop in arch/ code must be switched to _lazy_tlb
+# postfix as necessary.
+config MMU_LAZY_TLB_SHOOTDOWN
+ bool

config ARCH_HAVE_NMI_SAFE_CMPXCHG
bool
diff --git a/kernel/fork.c b/kernel/fork.c
index dc06afd725cb..8085ff33c7f6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -674,6 +674,53 @@ static void check_mm(struct mm_struct *mm)
#define allocate_mm() (kmem_cache_alloc(mm_cachep, GFP_KERNEL))
#define free_mm(mm) (kmem_cache_free(mm_cachep, (mm)))

+static void do_shoot_lazy_tlb(void *arg)
+{
+ struct mm_struct *mm = arg;
+
+ if (current->active_mm == mm) {
+ WARN_ON_ONCE(current->mm);
+ current->active_mm = &init_mm;
+ switch_mm(mm, &init_mm, current);
+ }
+}
+
+static void do_check_lazy_tlb(void *arg)
+{
+ struct mm_struct *mm = arg;
+
+ WARN_ON_ONCE(current->active_mm == mm);
+}
+
+static void shoot_lazy_tlbs(struct mm_struct *mm)
+{
+ if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN)) {
+ /*
+ * IPI overheads have not found to be expensive, but they could
+ * be reduced in a number of possible ways, for example (in
+ * roughly increasing order of complexity):
+ * - A batch of mms requiring IPIs could be gathered and freed
+ * at once.
+ * - CPUs could store their active mm somewhere that can be
+ * remotely checked without a lock, to filter out
+ * false-positives in the cpumask.
+ * - After mm_users or mm_count reaches zero, switching away
+ * from the mm could clear mm_cpumask to reduce some IPIs
+ * (some batching or delaying would help).
+ * - A delayed freeing and RCU-like quiescing sequence based on
+ * mm switching to avoid IPIs completely.
+ */
+ on_each_cpu_mask(mm_cpumask(mm), do_shoot_lazy_tlb, (void *)mm, 1);
+ if (IS_ENABLED(CONFIG_DEBUG_VM))
+ on_each_cpu(do_check_lazy_tlb, (void *)mm, 1);
+ } else {
+ /*
+ * In this case, lazy tlb mms are refounted and would not reach
+ * __mmdrop until all CPUs have switched away and mmdrop()ed.
+ */
+ }
+}
+
/*
* Called when the last reference to the mm
* is dropped: either by a lazy thread or by
@@ -683,6 +730,10 @@ void __mmdrop(struct mm_struct *mm)
{
BUG_ON(mm == &init_mm);
WARN_ON_ONCE(mm == current->mm);
+
+ /* Ensure no CPUs are using this as their lazy tlb mm */
+ shoot_lazy_tlbs(mm);
+
WARN_ON_ONCE(mm == current->active_mm);
mm_free_pgd(mm);
destroy_context(mm);
--
2.23.0
Nicholas Piggin
2021-06-08 03:15:32 UTC
Permalink
This post might be inappropriate. Click to display it.
Nicholas Piggin
2021-06-05 01:42:16 UTC
Permalink
On a 16-socket 192-core POWER8 system, a context switching benchmark
with as many software threads as CPUs (so each switch will go in and
out of idle), upstream can achieve a rate of about 1 million context
switches per second. After this patch it goes up to 118 million.

Signed-off-by: Nicholas Piggin <***@gmail.com>
---
arch/powerpc/Kconfig | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 088dd2afcfe4..8a092eedc692 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -252,6 +252,7 @@ config PPC
select IRQ_FORCED_THREADING
select MMU_GATHER_PAGE_SIZE
select MMU_GATHER_RCU_TABLE_FREE
+ select MMU_LAZY_TLB_SHOOTDOWN if PPC_BOOK3S_64
select MODULES_USE_ELF_RELA
select NEED_DMA_MAP_STATE if PPC64 || NOT_COHERENT_CACHE
select NEED_SG_DMA_LENGTH
--
2.23.0
Andrew Morton
2021-06-07 23:52:41 UTC
Permalink
Post by Nicholas Piggin
On a 16-socket 192-core POWER8 system, a context switching benchmark
with as many software threads as CPUs (so each switch will go in and
out of idle), upstream can achieve a rate of about 1 million context
switches per second. After this patch it goes up to 118 million.
Nice. Do we have a feel for the benefit on any real-world workloads?

Could any other architectures benefit from these changes?
Nicholas Piggin
2021-06-08 02:13:44 UTC
Permalink
Post by Andrew Morton
Post by Nicholas Piggin
On a 16-socket 192-core POWER8 system, a context switching benchmark
with as many software threads as CPUs (so each switch will go in and
out of idle), upstream can achieve a rate of about 1 million context
switches per second. After this patch it goes up to 118 million.
Nice. Do we have a feel for the benefit on any real-world workloads?
Not really unfortunately. I think it's always been a "known" cacheline,
it just showed up badly on will-it-scale tests recently when Anton was
doing a sweep of low hanging scalability issues on big systems.

We have some very big systems running certain in-memory databases that
get into very high contention conditions on mutexes that push context
switch rates right up and with idle times pretty high, which would get
a lot of parallel context switching between user and idle thread, we
might be getting a bit of this contention there.

It's not something at the top of profiles though. And on multi-threaded
workloads like this, the normal refcounting of the user mm still has
fundmaental contention. It's tricky to get the change tested on these
workloads (machine time is very limited and I can't drive the software).

I suspect it could also show in things that do high net or disk IO rates
(enough to need a lot of cores), and do some user processing steps along
the way. You'd potentially get a lot of idle switching.
Post by Andrew Morton
Could any other architectures benefit from these changes?
The cacheline is going to bounce in the same situations on other archs,
so I would say yes. Rik at one stage had some patches to try avoid it
for x86 some years ago, I don't know what happened to those.

The way powerpc has to maintain mm_cpumask for its TLB flushing makes it
relatively easy to do this shootdown, and we decided the additional IPIs
were less of a concern than the bouncing. Others have different concerns,
but I tried to make it generic and add comments explaining what other
archs can do, or possibly different ways it might be achieved.

Thanks,
Nick
Loading...