Discussion:
[Bug 213079] New: IRQ problems and crashes on a PowerMac G5 with 5.13-rc1
b***@bugzilla.kernel.org
2021-05-15 11:58:01 UTC
Permalink
https://bugzilla.kernel.org/show_bug.cgi?id=213079

Bug ID: 213079
Summary: IRQ problems and crashes on a PowerMac G5 with
5.13-rc1
Product: Platform Specific/Hardware
Version: 2.5
Kernel Version: 5.13-rc1
Hardware: PPC-64
OS: Linux
Tree: Mainline
Status: NEW
Severity: normal
Priority: P1
Component: PPC-64
Assignee: platform_ppc-***@kernel-bugs.osdl.org
Reporter: ***@mailbox.org
Regression: No

Created attachment 296759
--> https://bugzilla.kernel.org/attachment.cgi?id=296759&action=edit
dmesg (5.13-rc1, PowerMac G5 11,2)

With v5.13-rc1 I get IRQ problems and crashes on my G5 sooner or later. IRQ 63
is my NVMe SSD.

[...]
irq 63: nobody cared (try booting with the "irqpoll" option)
CPU: 1 PID: 11783 Comm: emerge Tainted: G W
5.13.0-rc1-PowerMacG5 #3
Call Trace:
[c00000000ffefae0] [c000000000549790] .dump_stack+0xe0/0x13c (unreliable)
[c00000000ffefb80] [c0000000000def44] .__report_bad_irq+0x34/0xf0
[c00000000ffefc20] [c0000000000dee2c] .note_interrupt+0x258/0x300
[c00000000ffefce0] [c0000000000db0a8] .handle_irq_event_percpu+0x64/0x90
[c00000000ffefd70] [c0000000000db118] .handle_irq_event+0x44/0x70
[c00000000ffefe00] [c0000000000e0530] .handle_fasteoi_irq+0xac/0x158
[c00000000ffefea0] [c0000000000da164] .generic_handle_irq+0x38/0x58
[c00000000ffeff10] [c000000000011674] .__do_irq+0x15c/0x238
[c00000000ffeff90] [c000000000012068] .do_IRQ+0x180/0x188
[c00000014d357d70] [c000000000011f88] .do_IRQ+0xa0/0x188
[c00000014d357e10] [c000000000007f94]
hardware_interrupt_common_virt+0x1a4/0x1b0
--- interrupt: 500 at 0x3fffb07a1a9c
NIP: 00003fffb07a1a9c LR: 00003fffb07a3d08 CTR: 00003fffb074cb30
REGS: c00000014d357e80 TRAP: 0500 Tainted: G W
(5.13.0-rc1-PowerMacG5)
MSR: 900000000000f032 <SF,HV,EE,PR,FP,ME,IR,DR,RI> CR: 22482820 XER:
20000000
IRQMASK: 0
GPR00: 00003fffb07a3d08 00003fffe84d07a0 00003fffb0ad1200 00003fffa8131100
GPR04: 00003fffa9ea4bd0 a5a8b016e7fdc57d 00003fffe84d0810 00003fffb0aa7ac0
GPR08: 00003fffb0ab3708 00003fffab4eb870 0000000000000000 0000000000000000
GPR12: 00003fffb07b92a0 00003fffb0b8e850 00003fffe84d0a58 000000014df42388
GPR16: 00003fffe84d0a70 ffffffffffffffff 00003fffafbf54c0 ffffffffffffffff
GPR20: 0000000000000000 000000014df42338 000000014c677878 0000000000000000
GPR24: 00003fffafc0b5b0 000000014c677830 00003fffafcc8a50 a5a8b016e7fdc57d
GPR28: 00003fffa863bcc0 00003fffa8131100 00003fffa9ea4bd0 00003fffa8131100
NIP [00003fffb07a1a9c] 0x3fffb07a1a9c
LR [00003fffb07a3d08] 0x3fffb07a3d08
--- interrupt: 500
handlers:
[<00000000370eb0ba>] .nvme_irq
[<00000000370eb0ba>] .nvme_irq
Disabling IRQ #63
Call Trace:
Kernel panic - not syncing: corrupted stack end detected inside scheduler
CPU: 0 PID: 814 Comm: kworker/u4:2 Tainted: G W
5.13.0-rc1-PowerMacG5 #3
Workqueue: writeback .wb_workfn (flush-254:1)
[c00000007db5ab40] [c000000000549790] .dump_stack+0xe0/0x13c (unreliable)
[c00000007db5abe0] [c0000000000680dc] .panic+0x168/0x430
[c00000007db5ac90] [c000000000811e40] .__schedule+0x80/0x840
[c00000007db5ad70] [c00000000081274c] .preempt_schedule_common+0x28/0x48
[c00000007db5adf0] [c00000000081279c] .__cond_resched+0x30/0x4c
[c00000007db5ae70] [c0000000001c6a98] .mempool_alloc+0x38/0x1a4
[c00000007db5af50] [c0000000004a1a70] .bio_alloc_bioset+0x94/0x174
[c00000007db5b000] [c000000000354840] .ext4_bio_write_page+0x314/0x480
[c00000007db5b0c0] [c0000000003334d4] .mpage_submit_page+0x70/0xa0
[c00000007db5b140] [c000000000333630] .mpage_process_page_bufs+0x12c/0x18c
[c00000007db5b1d0] [c0000000003338b8] .mpage_prepare_extent_to_map+0x1f8/0x228
[c00000007db5b320] [c000000000339088] .ext4_writepages+0x360/0xe5c
[c00000007db5b5d0] [c0000000001cee84] .do_writepages+0x54/0xa0
[c00000007db5b650] [c0000000002a49bc] .__writeback_single_inode+0x100/0x560
[c00000007db5b700] [c0000000002a53d8] .writeback_sb_inodes+0x2dc/0x4c8
[c00000007db5b880] [c0000000002a5654] .__writeback_inodes_wb+0x90/0xcc
[c00000007db5b930] [c0000000002a58c0] .wb_writeback+0x230/0x3dc
[c00000007db5ba50] [c0000000002a6790] .wb_workfn+0x380/0x460
[c00000007db5bbb0] [c0000000000890a0] .process_one_work+0x318/0x4dc
[c00000007db5bca0] [c000000000089730] .worker_thread+0x224/0x290
[c00000007db5bd60] [c000000000091200] .kthread+0x134/0x13c
[c00000007db5be10] [c00000000000bbf4] .ret_from_kernel_thread+0x58/0x64
Rebooting in 120 seconds..


# lspci -vv -s 0001:08:00.0
0001:08:00.0 Non-Volatile memory controller: Intel Corporation SSD Pro
7600p/760p/E 6100p Series (rev 03) (prog-if 02 [NVM Express])
Subsystem: Intel Corporation SSD Pro 7600p/760p/E 6100p Series [NVM
Express]
Device tree node:
/sys/firmware/devicetree/base/***@0,f2000000/***@5/pci8086,***@0
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
<MAbort- >SERR- <PERR- INTx+
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 63
NUMA node: 0
Region 0: Memory at a0000000 (64-bit, non-prefetchable) [size=16K]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [50] MSI: Enable- Count=1/8 Maskable+ 64bit+
Address: 0000000000000000 Data: 0000
Masking: 00000000 Pending: 00000000
Capabilities: [70] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s
unlimited, L1 unlimited
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
SlotPowerLimit 0.000W
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+
TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency
L1 <8us
ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s (downgraded), Width x4 (ok)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP-
LTR+
10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt-
EETLPPrefix-
EmergencyPowerReduction Not Supported,
EmergencyPowerReductionInit-
FRS- TPHComp- ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR-
OBFF Disabled,
AtomicOpsCtl: ReqEn-
LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer-
2Retimers- DRS-
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range,
EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -3.5dB,
EqualizationComplete- EqualizationPhase1-
EqualizationPhase2- EqualizationPhase3-
LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [b0] MSI-X: Enable- Count=16 Masked-
Vector table: BAR=0 offset=00002000
PBA: BAR=0 offset=00002100
Kernel driver in use: nvme
--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.
b***@bugzilla.kernel.org
2021-05-15 11:58:27 UTC
Permalink
https://bugzilla.kernel.org/show_bug.cgi?id=213079

--- Comment #1 from Erhard F. (***@mailbox.org) ---
Created attachment 296761
--> https://bugzilla.kernel.org/attachment.cgi?id=296761&action=edit
kernel .config (5.13-rc1, PowerMac G5 11,2)
--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.
b***@bugzilla.kernel.org
2021-05-15 12:30:16 UTC
Permalink
https://bugzilla.kernel.org/show_bug.cgi?id=213079

--- Comment #2 from Erhard F. (***@mailbox.org) ---
Hmm... Just also happened on 5.12.3. But without the Kernel panic (yet).

[...]
irq 63: nobody cared (try booting with the "irqpoll" option)
Call Trace:
CPU: 1 PID: 43491 Comm: emerge Tainted: G W
5.12.3-gentoo-PowerMacG5 #2
[c00000000ffefae0] [c00000000053950c] .dump_stack+0xe0/0x13c (unreliable)
[c00000000ffefb80] [c0000000000ddb68] .__report_bad_irq+0x34/0xf0
[c00000000ffefc20] [c0000000000dda50] .note_interrupt+0x250/0x2f8
[c00000000ffefce0] [c0000000000d9cf8] .handle_irq_event_percpu+0x64/0x90
[c00000000ffefd70] [c0000000000d9d68] .handle_irq_event+0x44/0x70
[c00000000ffefe00] [c0000000000df164] .handle_fasteoi_irq+0xac/0x158
[c00000000ffefea0] [c0000000000d8db8] .generic_handle_irq+0x38/0x58
[c00000000ffeff10] [c000000000011314] .__do_irq+0x15c/0x238
[c00000000ffeff90] [c00000000001fe04] .call_do_irq+0x14/0x24
[c000000056e2fd70] [c00000000001154c] .do_IRQ+0x15c/0x164
[c000000056e2fe10] [c000000000007d38]
hardware_interrupt_common_virt+0x158/0x160
--- interrupt: 500 at 0x3fffb8a21520
handlers:
NIP: 00003fffb8a21520 LR: 00003fffb8a214a0 CTR: 00003fffb8ae6d20
REGS: c000000056e2fe80 TRAP: 0500 Tainted: G W
(5.12.3-gentoo-PowerMacG5)
MSR: 900000000200f032 <SF,HV,VEC,EE,PR,FP,ME,IR,DR,RI> CR: 42482824 XER:
20000000
IRQMASK: 0
GPR00: 00003fffb8a214a0 00003fffdb199650 00003fffb8df7200 000000014e8ddc60
GPR04: 00003fffb210e000 95bfd31b66b69e10 00003fffdb199478 0000000000024d50
GPR08: 000000014cb987c0 0000000000000002 0000000000000000 0000000000000000
GPR12: 00003fffb8ae0e50 00003fffb8eb4850 00003fffdb199a58 000000014e8ddf60
GPR16: 00003fffdb199a70 ffffffffffffffff 0000000000000001 000000014b5d8460
GPR20: 0000000000000000 0000000000000002 000000014e8ddf38 00003fffb6b176e8
GPR24: 000000014c126958 00003fffb2030390 000000014b94c380 000000014b5d8460
GPR28: 000000014c1267f0 000000014c126a60 000000014c1267f0 0000000000000000
NIP [00003fffb8a21520] 0x3fffb8a21520
LR [00003fffb8a214a0] 0x3fffb8a214a0
--- interrupt: 500
[<000000000e5af612>] .nvme_irq
[<000000000e5af612>] .nvme_irq
Disabling IRQ #63
--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.
b***@bugzilla.kernel.org
2021-05-15 14:50:23 UTC
Permalink
https://bugzilla.kernel.org/show_bug.cgi?id=213079

Erhard F. (***@mailbox.org) changed:

What |Removed |Added
----------------------------------------------------------------------------
Kernel Version|5.13-rc1 |5.12.3
Summary|IRQ problems and crashes on |IRQ problems and crashes on
|a PowerMac G5 with 5.13-rc1 |a PowerMac G5 with 5.12.3

--- Comment #3 from Erhard F. (***@mailbox.org) ---
Some time after the "irq 63: nobody cared" on 5.12.3:

[...]
--- interrupt: 500
[<000000000e5af612>] .nvme_irq
[<000000000e5af612>] .nvme_irq
Disabling IRQ #63
Call Trace:
Kernel panic - not syncing: corrupted stack end detected inside scheduler
CPU: 0 PID: 105549 Comm: kworker/u4:1 Tainted: G W
5.12.3-gentoo-PowerMacG5 #2
Workqueue: 0x0 (flush-259:0)
[c000000078dc79f0] [c00000000053950c] .dump_stack+0xe0/0x13c (unreliable)
[c000000078dc7a90] [c000000000066074] .panic+0x168/0x430
[c000000078dc7b40] [c0000000007f19f0] .__schedule+0x80/0x848
[c000000078dc7c20] [c0000000007f2270] .schedule+0xb8/0x110
[c000000078dc7ca0] [c000000000086d18] .worker_thread+0x278/0x290
[c000000078dc7d60] [c00000000008e75c] .kthread+0x134/0x13c
[c000000078dc7e10] [c00000000000b1f4] .ret_from_kernel_thread+0x58/0x64
Rebooting in 120 seconds..
--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.
b***@bugzilla.kernel.org
2021-06-06 18:14:30 UTC
Permalink
https://bugzilla.kernel.org/show_bug.cgi?id=213079

--- Comment #4 from Erhard F. (***@mailbox.org) ---
Created attachment 297191
--> https://bugzilla.kernel.org/attachment.cgi?id=297191&action=edit
bisect.log

Turns out the problem was introduced between v5.11 and v5.12 by following
commit:

# git bisect good
fbbefb320214db14c3e740fce98e2c95c9d0669b is the first bad commit
commit fbbefb320214db14c3e740fce98e2c95c9d0669b
Author: Oliver O'Halloran <***@gmail.com>
Date: Tue Nov 3 15:35:07 2020 +1100

powerpc/pci: Move PHB discovery for PCI_DN using platforms

Make powernv, pseries, powermac and maple use ppc_mc.discover_phbs.
These platforms need to be done together because they all depend on
pci_dn's being created from the DT. The pci_dn contains a pointer to
the relevant pci_controller so they need to be created after the
pci_controller structures are available, but before PCI devices are
scanned. Currently this ordering is provided by initcalls and the
sequence is:

1. PHBs are discovered (setup_arch) (early boot, pre-initcalls)
2. pci_dn are created from the unflattended DT (core initcall)
3. PHBs are scanned pcibios_init() (subsys initcall)

The new ppc_md.discover_phbs() function is also a core_initcall so we
can't guarantee ordering between the creation of pci_controllers and
the creation of pci_dn's which require a pci_controller. We could use
the postcore, or core_sync initcall levels, but it's cleaner to just
move the pci_dn setup into the per-PHB inits which occur inside of
.discover_phb() for these platforms. This brings the boot-time path in
line with the PHB hotplug path that is used for pseries DLPAR
operations too.

Signed-off-by: Oliver O'Halloran <***@gmail.com>
[mpe: Squash powermac & maple in to avoid breakage those platforms,
convert memblock allocs to use kmalloc to avoid warnings]
Signed-off-by: Michael Ellerman <***@ellerman.id.au>
Link: https://lore.kernel.org/r/20201103043523.916109-2-***@gmail.com
--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.
b***@bugzilla.kernel.org
2021-06-07 04:13:16 UTC
Permalink
https://bugzilla.kernel.org/show_bug.cgi?id=213079

--- Comment #5 from Oliver O'Halloran (***@gmail.com) ---
Hmm, it's pretty weird to see an NVMe drive using LSIs. Not too sure what to
make of that. I figure there's something screwy going on with interrupt
routing, but I don't have any g5 hardware to replicate this with.

Could you add "debug" to the kernel command line and post the dmesg output for
a boot with the patch applied and reverted?
--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.
b***@bugzilla.kernel.org
2021-06-07 06:49:37 UTC
Permalink
https://bugzilla.kernel.org/show_bug.cgi?id=213079

--- Comment #6 from Erhard F. (***@mailbox.org) ---
This is already a custom built kernel with lots of debugging options turned on
(see bugzilla attached kernel .config). But of course I can add "debug" to the
other kernel command line parameters.

I'll report back when I get access to this G5 next time in about 2-3 weeks.
--
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.
Loading...