igc/igb: Fix missing time sync events by vcgomes · Pull Request #51 · intel/linux-intel-lts

vcgomes · 2024-02-21T01:29:03Z

This is a backport to kernel 6.1 of the fix proposed here:
https://lore.kernel.org/all/20240220235712.241552-1-vinicius.gomes@intel.com/

In short, the drivers (igc and igb) were double clearing the Time Sync Interrupt Cause register, which can cause events to be missed.

Fix "double" clearing of interrupts, which can cause external events or timestamps to be missed. The IGC_TSIRC Time Sync Interrupt Cause register can be cleared in two ways, by either reading it or by writing '1' into the specific cause bit. This is documented in section 8.16.1. The following flow was used: 1. read IGC_TSIRC into 'tsicr'; 2. handle the interrupts present in 'tsirc' and mark them in 'ack'; 3. write 'ack' into IGC_TSICR; As both (1) and (3) will clear the interrupt cause, if the same interrupt happens again between (1) and (3) it will be ignored, causing events to be missed. Remove the extra clear in (3). Fixes: 2c344ae ("igc: Add support for TX timestamping") Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Tested-by: Kurt Kanzenbach <kurt@linutronix.de> # Intel i225 Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>

Fix "double" clearing of interrupts, which can cause external events or timestamps to be missed. The E1000_TSIRC Time Sync Interrupt Cause register can be cleared in two ways, by either reading it or by writing '1' into the specific cause bit. This is documented in section 8.16.1. The following flow was used: 1. read E1000_TSIRC into 'tsicr'; 2. handle the interrupts present into 'tsirc' and mark them in 'ack'; 3. write 'ack' into E1000_TSICR; As both (1) and (3) will clear the interrupt cause, if the same interrupt happens again between (1) and (3) it will be ignored, causing events to be missed. Remove the extra clear in (3). Fixes: 00c6557 ("igb: enable internal PPS for the i210") Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>

…t handling [ Upstream commit 815a8d2feb5615ae7f0b5befd206af0b0160614c ] The recent commit 1010b4c ("powerpc/eeh: Make EEH driver device hotplug safe") restructured the EEH driver to improve synchronization with the PCI hotplug layer. However, it inadvertently moved pci_lock_rescan_remove() outside its intended scope in eeh_handle_normal_event(), leading to broken PCI error reporting and improper EEH event triggering. Specifically, eeh_handle_normal_event() acquired pci_lock_rescan_remove() before calling eeh_pe_bus_get(), but eeh_pe_bus_get() itself attempts to acquire the same lock internally, causing nested locking and disrupting normal EEH event handling paths. This patch adds a boolean parameter do_lock to _eeh_pe_bus_get(), with two public wrappers: eeh_pe_bus_get() with locking enabled. eeh_pe_bus_get_nolock() that skips locking. Callers that already hold pci_lock_rescan_remove() now use eeh_pe_bus_get_nolock() to avoid recursive lock acquisition. Additionally, pci_lock_rescan_remove() calls are restored to the correct position—after eeh_pe_bus_get() and immediately before iterating affected PEs and devices. This ensures EEH-triggered PCI removes occur under proper bus rescan locking without recursive lock contention. The eeh_pe_loc_get() function has been split into two functions: eeh_pe_loc_get(struct eeh_pe *pe) which retrieves the loc for given PE. eeh_pe_loc_get_bus(struct pci_bus *bus) which retrieves the location code for given bus. This resolves lockdep warnings such as: <snip> [ 84.964298] [ T928] ============================================ [ 84.964304] [ T928] WARNING: possible recursive locking detected [ 84.964311] [ T928] 6.18.0-rc3 #51 Not tainted [ 84.964315] [ T928] -------------------------------------------- [ 84.964320] [ T928] eehd/928 is trying to acquire lock: [ 84.964324] [ T928] c000000003b29d58 (pci_rescan_remove_lock){+.+.}-{3:3}, at: pci_lock_rescan_remove+0x28/0x40 [ 84.964342] [ T928] but task is already holding lock: [ 84.964347] [ T928] c000000003b29d58 (pci_rescan_remove_lock){+.+.}-{3:3}, at: pci_lock_rescan_remove+0x28/0x40 [ 84.964357] [ T928] other info that might help us debug this: [ 84.964363] [ T928] Possible unsafe locking scenario: [ 84.964367] [ T928] CPU0 [ 84.964370] [ T928] ---- [ 84.964373] [ T928] lock(pci_rescan_remove_lock); [ 84.964378] [ T928] lock(pci_rescan_remove_lock); [ 84.964383] [ T928] *** DEADLOCK *** [ 84.964388] [ T928] May be due to missing lock nesting notation [ 84.964393] [ T928] 1 lock held by eehd/928: [ 84.964397] [ T928] #0: c000000003b29d58 (pci_rescan_remove_lock){+.+.}-{3:3}, at: pci_lock_rescan_remove+0x28/0x40 [ 84.964408] [ T928] stack backtrace: [ 84.964414] [ T928] CPU: 2 UID: 0 PID: 928 Comm: eehd Not tainted 6.18.0-rc3 #51 VOLUNTARY [ 84.964417] [ T928] Hardware name: IBM,9080-HEX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_022) hv:phyp pSeries [ 84.964419] [ T928] Call Trace: [ 84.964420] [ T928] [c0000011a7157990] [c000000001705de4] dump_stack_lvl+0xc8/0x130 (unreliable) [ 84.964424] [ T928] [c0000011a71579d0] [c0000000002f66e0] print_deadlock_bug+0x430/0x440 [ 84.964428] [ T928] [c0000011a7157a70] [c0000000002fd0c0] __lock_acquire+0x1530/0x2d80 [ 84.964431] [ T928] [c0000011a7157ba0] [c0000000002fea54] lock_acquire+0x144/0x410 [ 84.964433] [ T928] [c0000011a7157cb0] [c0000011a7157cb0] __mutex_lock+0xf4/0x1050 [ 84.964436] [ T928] [c0000011a7157e00] [c000000000de21d8] pci_lock_rescan_remove+0x28/0x40 [ 84.964439] [ T928] [c0000011a7157e20] [c00000000004ed98] eeh_pe_bus_get+0x48/0xc0 [ 84.964442] [ T928] [c0000011a7157e50] [c000000000050434] eeh_handle_normal_event+0x64/0xa60 [ 84.964446] [ T928] [c0000011a7157f30] [c000000000051de8] eeh_event_handler+0xf8/0x190 [ 84.964450] [ T928] [c0000011a7157f90] [c0000000002747ac] kthread+0x16c/0x180 [ 84.964453] [ T928] [c0000011a7157fe0] [c00000000000ded8] start_kernel_thread+0x14/0x18 </snip> Fixes: 1010b4c ("powerpc/eeh: Make EEH driver device hotplug safe") Signed-off-by: Narayana Murty N <nnmlinux@linux.ibm.com> Reviewed-by: Sourabh Jain <sourabhjain@linux.ibm.com> Reviewed-by: Mahesh Salgaonkar <mahesh@linux.ibm.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20251210142559.8874-1-nnmlinux@linux.ibm.com Signed-off-by: Sasha Levin <sashal@kernel.org>

[ Upstream commit f3a9c95a15d2f4466acad5c68faeff79ca5e9f47 ] Since commit 15f73f5 ("blk-mq: move failure injection out of blk_mq_complete_request"), drivers are responsible for calling blk_should_fake_timeout() at appropriate code paths and opportunities. However, the dm driver does not implement its own timeout handler and relies on the timeout handling of its slave devices. If an io-timeout-fail error is injected to a dm device, the request will be leaked and never completed, causing tasks to hang indefinitely. Reproduce: 1. prepare dm which has iscsi slave device 2. inject io-timeout-fail to dm echo 1 >/sys/class/block/dm-0/io-timeout-fail echo 100 >/sys/kernel/debug/fail_io_timeout/probability echo 10 >/sys/kernel/debug/fail_io_timeout/times 3. read/write dm 4. iscsiadm -m node -u Result: hang task like below [ 862.243768] INFO: task kworker/u514:2:151 blocked for more than 122 seconds. [ 862.244133] Tainted: G E 6.19.0-rc1+ #51 [ 862.244337] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 862.244718] task:kworker/u514:2 state:D stack:0 pid:151 tgid:151 ppid:2 task_flags:0x4288060 flags:0x00080000 [ 862.245024] Workqueue: iscsi_ctrl_3:1 __iscsi_unbind_session [scsi_transport_iscsi] [ 862.245264] Call Trace: [ 862.245587] <TASK> [ 862.245814] __schedule+0x810/0x15c0 [ 862.246557] schedule+0x69/0x180 [ 862.246760] blk_mq_freeze_queue_wait+0xde/0x120 [ 862.247688] elevator_change+0x16d/0x460 [ 862.247893] elevator_set_none+0x87/0xf0 [ 862.248798] blk_unregister_queue+0x12e/0x2a0 [ 862.248995] __del_gendisk+0x231/0x7e0 [ 862.250143] del_gendisk+0x12f/0x1d0 [ 862.250339] sd_remove+0x85/0x130 [sd_mod] [ 862.250650] device_release_driver_internal+0x36d/0x530 [ 862.250849] bus_remove_device+0x1dd/0x3f0 [ 862.251042] device_del+0x38a/0x930 [ 862.252095] __scsi_remove_device+0x293/0x360 [ 862.252291] scsi_remove_target+0x486/0x760 [ 862.252654] __iscsi_unbind_session+0x18a/0x3e0 [scsi_transport_iscsi] [ 862.252886] process_one_work+0x633/0xe50 [ 862.253101] worker_thread+0x6df/0xf10 [ 862.253647] kthread+0x36d/0x720 [ 862.254533] ret_from_fork+0x2a6/0x470 [ 862.255852] ret_from_fork_asm+0x1a/0x30 [ 862.256037] </TASK> Remove the blk_should_fake_timeout() check from dm, as dm has no native timeout handling and should not attempt to fake timeouts. Signed-off-by: Ding Hui <dinghui@sangfor.com.cn> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>

vcgomes added 2 commits February 20, 2024 17:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

igc/igb: Fix missing time sync events#51

igc/igb: Fix missing time sync events#51
vcgomes wants to merge 2 commits intointel:6.1/linuxfrom
vcgomes:igc-igb-fix-pps-interrupts

vcgomes commented Feb 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vcgomes commented Feb 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant