[PATCH] powerpc/stacktrace: fix raise_backtrace

Discussion:

[PATCH] powerpc/stacktrace: fix raise_backtrace_ipi() logic

Nathan Lynch

2021-05-27 01:15:50 UTC

When smp_send_safe_nmi_ipi() indicates that the target CPU has
responded to the IPI, skip the remote paca inspection
fallback. Otherwise both the sending and target CPUs attempt the
backtrace, usually creating a misleading ("didn't respond to backtrace
IPI" is wrong) and interleaved mess:

[ 1658.929157][ C7] rcu: Stack dump where RCU GP kthread last ran:
[ 1658.929223][ C7] Sending NMI from CPU 7 to CPUs 1:
[ 1658.929303][ C1] NMI backtrace for cpu 1
[ 1658.929303][ C7] CPU 1 didn't respond to backtrace IPI, inspecting paca.
[ 1658.929362][ C1] CPU: 1 PID: 325 Comm: kworker/1:1H Tainted: G W E 5.13.0-rc2+ #46
[ 1658.929405][ C7] irq_soft_mask: 0x01 in_mce: 0 in_nmi: 0 current: 325 (kworker/1:1H)
[ 1658.929465][ C1] Workqueue: events_highpri test_work_fn [test_lockup]
[ 1658.929549][ C7] Back trace of paca->saved_r1 (0xc0000000057fb400) (possibly stale):
[ 1658.929592][ C1] NIP: c00000000002cf50 LR: c008000000820178 CTR: c00000000002cfa0

Verified using the test_lockup module, e.g.

$ echo 5 > /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout
$ insmod test_lockup.ko time_secs=1 iterations=10 state=R lock_rcu \
touch_softlockup all_cpus

Fixes: 5cc05910f26e ("powerpc/64s: Wire up arch_trigger_cpumask_backtrace()")
Signed-off-by: Nathan Lynch <***@linux.ibm.com>
---
arch/powerpc/kernel/stacktrace.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/stacktrace.c b/arch/powerpc/kernel/stacktrace.c
index 1deb1bf331dd..e0ccc5a46d7e 100644
--- a/arch/powerpc/kernel/stacktrace.c
+++ b/arch/powerpc/kernel/stacktrace.c
@@ -174,11 +174,14 @@ static void raise_backtrace_ipi(cpumask_t *mask)
{
unsigned int cpu;

+ if (cpumask_test_cpu(smp_processor_id(), mask)) {
+ handle_backtrace_ipi(NULL);
+ cpumask_clear_cpu(smp_processor_id(), mask);
+ }
+
for_each_cpu(cpu, mask) {
- if (cpu == smp_processor_id())
- handle_backtrace_ipi(NULL);
- else
- smp_send_safe_nmi_ipi(cpu, handle_backtrace_ipi, 5 * USEC_PER_SEC);
+ if (smp_send_safe_nmi_ipi(cpu, handle_backtrace_ipi, 5 * USEC_PER_SEC))
+ cpumask_clear_cpu(cpu, mask);
}

for_each_cpu(cpu, mask) {

--
2.31.1

Michael Ellerman

2021-06-02 11:08:20 UTC

Permalink

Post by Nathan Lynch
When smp_send_safe_nmi_ipi() indicates that the target CPU has
responded to the IPI, skip the remote paca inspection
fallback. Otherwise both the sending and target CPUs attempt the
backtrace, usually creating a misleading ("didn't respond to backtrace

Thanks for fixing my bugs for me :)

Post by Nathan Lynch
[ 1658.929303][ C1] NMI backtrace for cpu 1
[ 1658.929303][ C7] CPU 1 didn't respond to backtrace IPI, inspecting paca.
[ 1658.929362][ C1] CPU: 1 PID: 325 Comm: kworker/1:1H Tainted: G W E 5.13.0-rc2+ #46
[ 1658.929405][ C7] irq_soft_mask: 0x01 in_mce: 0 in_nmi: 0 current: 325 (kworker/1:1H)
[ 1658.929465][ C1] Workqueue: events_highpri test_work_fn [test_lockup]
[ 1658.929592][ C1] NIP: c00000000002cf50 LR: c008000000820178 CTR: c00000000002cfa0
Verified using the test_lockup module, e.g.
$ echo 5 > /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout
$ insmod test_lockup.ko time_secs=1 iterations=10 state=R lock_rcu \
touch_softlockup all_cpus
Fixes: 5cc05910f26e ("powerpc/64s: Wire up arch_trigger_cpumask_backtrace()")
---
arch/powerpc/kernel/stacktrace.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/arch/powerpc/kernel/stacktrace.c b/arch/powerpc/kernel/stacktrace.c
index 1deb1bf331dd..e0ccc5a46d7e 100644
--- a/arch/powerpc/kernel/stacktrace.c
+++ b/arch/powerpc/kernel/stacktrace.c
@@ -174,11 +174,14 @@ static void raise_backtrace_ipi(cpumask_t *mask)
{
unsigned int cpu;
+ if (cpumask_test_cpu(smp_processor_id(), mask)) {
+ handle_backtrace_ipi(NULL);
+ cpumask_clear_cpu(smp_processor_id(), mask);
+ }
+
for_each_cpu(cpu, mask) {
- if (cpu == smp_processor_id())
- handle_backtrace_ipi(NULL);
- else
- smp_send_safe_nmi_ipi(cpu, handle_backtrace_ipi, 5 * USEC_PER_SEC);
+ if (smp_send_safe_nmi_ipi(cpu, handle_backtrace_ipi, 5 * USEC_PER_SEC))
+ cpumask_clear_cpu(cpu, mask);

I think there's still a race here, but instead of causing us to emit a
spurious "didn't respond" trace, it could lead to us failing to emit a
proper trace when we should.

It's hard to follow this code, but mask above is backtrace_mask from
lib/nmi_backtrace.c, because of:

void nmi_trigger_cpumask_backtrace(const cpumask_t *mask,
bool exclude_self,
void (*raise)(cpumask_t *mask))
{
int i, this_cpu = get_cpu();

cpumask_copy(to_cpumask(backtrace_mask), mask);
^^^^^^^^^^^^^^
...

if (!cpumask_empty(to_cpumask(backtrace_mask))) {
pr_info("Sending NMI from CPU %d to CPUs %*pbl:\n",
this_cpu, nr_cpumask_bits, to_cpumask(backtrace_mask));
raise(to_cpumask(backtrace_mask));
^^^^^^^^^^^^^^

And raise there is raise_backtrace_ipi() (the function we're patching).

On the receiving CPU we end up executing:

int smp_handle_nmi_ipi(struct pt_regs *regs)
{
...
nmi_ipi_lock_start(&flags);
if (cpumask_test_cpu(me, &nmi_ipi_pending_mask)) {
cpumask_clear_cpu(me, &nmi_ipi_pending_mask);
fn = READ_ONCE(nmi_ipi_function);
...
}
nmi_ipi_unlock_end(&flags);

if (fn)
fn(regs);

The key detail being that we drop the nmi lock before calling fn, which
means the calling CPU can return back to raise_backtrace_ipi() before fn
is called.

In our case fn is handle_backtrace_ipi() which just calls nmi_cpu_backtrace().

Which does:

bool nmi_cpu_backtrace(struct pt_regs *regs)
{
int cpu = smp_processor_id();

if (cpumask_test_cpu(cpu, to_cpumask(backtrace_mask))) {
...
pr_warn("NMI backtrace for cpu %d\n", cpu);
if (regs)
show_regs(regs);
else
dump_stack();

ie. if the CPU has been cleared from backtrace_mask it doesn't emit a
stack trace.

So we could end up with the following interleaving:

CPU0 CPU1
==== ====
if (smp_send_safe_nmi_ipi(cpu, handle_backtrace_ipi, ...
// smp_handle_nmi_ipi()
fn = READ_ONCE(nmi_ipi_function);
...
nmi_ipi_unlock_end(&flags);
// in smp_send_safe_nmi_ipi()
nmi_ipi_lock();
while (!cpumask_empty(&nmi_ipi_pending_mask)) {
...
nmi_ipi_unlock_end(&flags);
return ret;

cpumask_clear_cpu(cpu, mask);
fn(regs)
// -> nmi_cpu_backtrace()
if (cpumask_test_cpu(cpu, to_cpumask(backtrace_mask))) {

But like I said, it's not easy to follow, so maybe I missed something
along the way.

To solve it I think we want to avoid clearing a CPU from the mask unless
we know that the IPI failed for that CPU. That way there's no risk of
suppressing a trace from a CPU that successfully handles the IPI, and we
know we've waited 5 seconds for CPUs that fail to handle the IPI.

I don't think we want to allocate a whole new cpumask to track which
CPUs have failed to respond, but I don't think we need to. We can just
synchronously handle them.

Something like below.

cheers

diff --git a/arch/powerpc/kernel/stacktrace.c b/arch/powerpc/kernel/stacktrace.c
index 1deb1bf331dd..980e87f7ae7a 100644
--- a/arch/powerpc/kernel/stacktrace.c
+++ b/arch/powerpc/kernel/stacktrace.c
@@ -172,17 +172,19 @@ static void handle_backtrace_ipi(struct pt_regs *regs)

static void raise_backtrace_ipi(cpumask_t *mask)
{
+ struct paca_struct *p;
unsigned int cpu;

for_each_cpu(cpu, mask) {
- if (cpu == smp_processor_id())
+ if (cpu == smp_processor_id()) {
handle_backtrace_ipi(NULL);
- else
- smp_send_safe_nmi_ipi(cpu, handle_backtrace_ipi, 5 * USEC_PER_SEC);
- }
+ continue;
+ }

- for_each_cpu(cpu, mask) {
- struct paca_struct *p = paca_ptrs[cpu];
+ if (smp_send_safe_nmi_ipi(cpu, handle_backtrace_ipi, 5 * USEC_PER_SEC))
+ continue;
+
+ p = paca_ptrs[cpu];

cpumask_clear_cpu(cpu, mask);

Nathan Lynch

2021-06-02 19:44:39 UTC

Permalink

Hi Michael,

Post by Michael Ellerman

Thanks for fixing my bugs for me :)

Thanks for your review! I was beginning to think I had missed some
subtletly here, thanks for illustrating it.

I'll run with your proposed change below for the problem I'm working.

Post by Michael Ellerman
To solve it I think we want to avoid clearing a CPU from the mask unless
we know that the IPI failed for that CPU. That way there's no risk of
suppressing a trace from a CPU that successfully handles the IPI, and we
know we've waited 5 seconds for CPUs that fail to handle the IPI.
I don't think we want to allocate a whole new cpumask to track which
CPUs have failed to respond, but I don't think we need to. We can just
synchronously handle them.
diff --git a/arch/powerpc/kernel/stacktrace.c b/arch/powerpc/kernel/stacktrace.c
index 1deb1bf331dd..980e87f7ae7a 100644
--- a/arch/powerpc/kernel/stacktrace.c
+++ b/arch/powerpc/kernel/stacktrace.c
@@ -172,17 +172,19 @@ static void handle_backtrace_ipi(struct pt_regs *regs)
static void raise_backtrace_ipi(cpumask_t *mask)
{
+ struct paca_struct *p;
unsigned int cpu;
for_each_cpu(cpu, mask) {
- if (cpu == smp_processor_id())
+ if (cpu == smp_processor_id()) {
handle_backtrace_ipi(NULL);
- else
- smp_send_safe_nmi_ipi(cpu, handle_backtrace_ipi, 5 * USEC_PER_SEC);
- }
+ continue;
+ }
- for_each_cpu(cpu, mask) {
- struct paca_struct *p = paca_ptrs[cpu];
+ if (smp_send_safe_nmi_ipi(cpu, handle_backtrace_ipi, 5 * USEC_PER_SEC))
+ continue;
+
+ p = paca_ptrs[cpu];
cpumask_clear_cpu(cpu, mask);

Michael Ellerman

2021-06-03 05:34:11 UTC

Permalink

Post by Nathan Lynch
Hi Michael,

Post by Michael Ellerman

Thanks for fixing my bugs for me :)

Thanks for your review! I was beginning to think I had missed some
subtletly here, thanks for illustrating it.
I'll run with your proposed change below for the problem I'm working.

Thanks. I did test it a bit with the test_lockup module, but some real
world testing would be good too.

cheers

Nathan Lynch

2021-06-07 21:50:44 UTC

Permalink

Post by Michael Ellerman

Post by Nathan Lynch
Hi Michael,

Post by Michael Ellerman

Thanks for fixing my bugs for me :)

Thanks for your review! I was beginning to think I had missed some
subtletly here, thanks for illustrating it.
I'll run with your proposed change below for the problem I'm working.

Thanks. I did test it a bit with the test_lockup module, but some real
world testing would be good too.

Been running with this to work some LPM issues and can confirm it waits
the intended amount of time before falling back to a remote stack walk,
avoiding interleaved traces from source and target CPUs. You can add my
tested-by, thanks.