[linux-nvidia-6.18-next] linux-nvidia-6.18: SMT-aware asymmetric CPU capacity idle selection by arighi · Pull Request #441 · NVIDIA/NV-Kernels

arighi · 2026-05-27T08:37:33Z

On Vera Rubin, the firmware exposes CPUs with different capacities through ACPI/CPPC. Unlike Grace systems, Vera Rubin also supports SMT. As a result, the Linux scheduler enables the asymmetric CPU capacity idle selection policy, but the current implementation is not SMT-aware. This can lead to suboptimal task placement, where tasks are scheduled on both SMT siblings of the same core even when fully idle SMT cores are available elsewhere in the system.

In CPU-intensive workloads, this behavior can significantly reduce performance, with slowdowns of up to 2x observed in certain CPU-intensive workloads.

This series is a backport of the upstream patch series available at the following URL (currently applied to linux-next):
https://lore.kernel.org/all/20260428051720.3180182-1-arighi@nvidia.com

NOTE: the original series includes additional patches that are not needed in linux-nvidia-6.18:

PATCH 1/6 is a refactoring that is valid only in kernel >= 7.0, because it requires 71fedc4 ("sched/fair: Switch to rcu_dereference_all()") and it's not worth backporting it,
PATCH 6/6 is incorrect and has been dropped

Given the potential impact on Vera Rubin performance, it seems reasonable to backport and apply these patches to the linux-nvidia kernels and carry them as NVIDIA SAUCE for now, until the upstream solution becomes available.

Patch series has been tested both on Vera and Grace running the benchblas (NVBLAS) benchmark.

NOTE: the same series has been applied to the linux-nvidia-6.17 kernel (see also #395), linux-nvidia-7.0 (see #405) and linux-nvidia-7.0-bos (#406).

NOTE: The previous PRs linked the LKML email threads; now that these patches are in linux-next, I’ve updated the cherry-picked/backported lines to reference the corresponding linux-next commits instead.

LP: https://bugs.launchpad.net/ubuntu/+source/linux-nvidia-bos/+bug/2150671

nirmoy · 2026-05-27T08:47:01Z

Boro review

Latest watcher review: open review

Head: 127bee3d0f7b

This comment is maintained by nv-pr-bot. It is updated when the GitHub watcher publishes a newer review.

nirmoy · 2026-05-27T10:42:44Z

LGTM just one nit: The first patch is missing Link: https://patch.msgid.link/20260516055850.1345932-1-arighi@nvidia.com

sforshee

I've got a couple of questions about the backport in the first patch. Additionally, looking through the history it appears that we don't typically add NVIDIA: SAUCE: for commits from linux-next.

sforshee · 2026-05-27T13:53:11Z

+		}
+
+		/* First, find the topmost SD_SHARE_LLC domain */
+		sd = *per_cpu_ptr(d.sd, i);


Isn't this line redundant with line 2559?

Yes, it is redundant, I'll remove it. Thanks!

…apacity BugLink: https://bugs.launchpad.net/bugs/2150671 On asymmetric CPU capacity systems, the wakeup path uses select_idle_capacity(), which scans the span of sd_asym_cpucapacity rather than sd_llc. The has_idle_cores hint however lives on sd_llc->shared, so the wakeup-time read of has_idle_cores operates on an LLC-scoped blob while the actual scan/decision spans the wider asym domain; nr_busy_cpus also lives in the same shared sched_domain data, but it's never used in the asym CPU capacity scenario. Therefore, move the sched_domain_shared object to sd_asym_cpucapacity whenever the CPU has a SD_ASYM_CPUCAPACITY_FULL ancestor and that ancestor is non-overlapping (i.e., not built from SD_NUMA). In that case the scope of has_idle_cores matches the scope of the wakeup scan. Fall back to attaching the shared object to sd_llc in three cases: 1) plain symmetric systems (no SD_ASYM_CPUCAPACITY_FULL anywhere); 2) CPUs in an exclusive cpuset that carves out a symmetric capacity island: has_asym is system-wide but those CPUs have no SD_ASYM_CPUCAPACITY_FULL ancestor in their hierarchy and follow the symmetric LLC path in select_idle_sibling(); 3) exotic topologies where SD_ASYM_CPUCAPACITY_FULL lands on an SD_NUMA-built domain. init_sched_domain_shared() keys the shared blob off cpumask_first(span), which on overlapping NUMA domains would alias unrelated spans onto the same blob. Keep the shared object on the LLC there; select_idle_capacity() gracefully skips the has_idle_cores preference when sd->shared is NULL. While at it, also rename the per-CPU sd_llc_shared to sd_balance_shared, as it is no longer strictly tied to the LLC. Co-developed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://patch.msgid.link/20260516055850.1345932-1-arighi@nvidia.com (backported from fdfe5a8cd8731dd81840f26abfb6527edd27b0cb linux-next) [ arighi: - backport full logic to attach sd->shared in build_sched_domains() - do not rename sd_llc_shared to reduce the risk of conflicts ] Signed-off-by: Andrea Righi <arighi@nvidia.com>

…ty idle selection BugLink: https://bugs.launchpad.net/bugs/2150671 On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting different per-core frequencies), the wakeup path uses select_idle_capacity() and prioritizes idle CPUs with higher capacity for better task placement. However, when those CPUs belong to SMT cores, their effective capacity can be much lower than the nominal capacity when the sibling thread is busy: SMT siblings compete for shared resources, so a "high capacity" CPU that is idle but whose sibling is busy does not deliver its full capacity. This effective capacity reduction cannot be modeled by the static capacity value alone. Introduce SMT awareness in the asym-capacity idle selection policy: when SMT is active, always prefer fully-idle SMT cores over partially-idle ones. Prioritizing fully-idle SMT cores yields better task placement because the effective capacity of partially-idle SMT cores is reduced; always preferring them when available leads to more accurate capacity usage on task wakeup. On an SMT system with asymmetric CPU capacities (NVIDIA Vera Rubin), SMT-aware idle selection has been shown to improve throughput by around 15-18% over NO_ASYM mainline and by around 60% over ASYM mainline, for CPU-bound workloads (NVBLAS) running an amount of tasks equal to the amount of SMT cores. Reported-by: Felix Abecassis <fabecassis@nvidia.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20260511142502.3873984-1-arighi@nvidia.com (cherry picked from commit 25a32e400a14009601c0a727643057f5515152df linux-next) Signed-off-by: Andrea Righi <arighi@nvidia.com>

arighi · 2026-05-27T15:11:09Z

LGTM just one nit: The first patch is missing Link: https://patch.msgid.link/20260516055850.1345932-1-arighi@nvidia.com

Ah yes, all the patches are missing all the latest acks from linux-next, the text is still based on the LKML email thread, I'll update them. Thanks!

… on asym-capacity BugLink: https://bugs.launchpad.net/bugs/2150671 When SD_ASYM_CPUCAPACITY load balancing considers pulling a misfit task, capacity_of(dst_cpu) can overstate available compute if the SMT sibling is busy: the core does not deliver its full nominal capacity. If SMT is active and dst_cpu is not on a fully idle core, skip this destination so we do not migrate a misfit expecting a capacity upgrade we cannot actually provide. Reported-by: Felix Abecassis <fabecassis@nvidia.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://patch.msgid.link/20260509180955.1840064-5-arighi@nvidia.com (cherry picked from commit bf6aa722198d3c06e4236e8c5a480f30a64e1513 linux-next) Signed-off-by: Andrea Righi <arighi@nvidia.com>

…ty() BugLink: https://bugs.launchpad.net/bugs/2150671 Add to select_idle_capacity() the same SIS_UTIL-controlled idle-scan mechanism, already used by select_idle_cpu(): when sched_feat(SIS_UTIL) is enabled and the LLC domain has sched_domain_shared data, derive the per-attempt scan limit from sd->shared->nr_idle_scan. That bounds the walk on large LLCs: once nr_idle_scan is exhausted, return the best CPU seen so far. The early exit is gated on !has_idle_core so an active idle-core search (SMT with idle cores reported by test_idle_cores()) isn't cut short before it gets a chance to find one. Co-developed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://patch.msgid.link/20260509180955.1840064-6-arighi@nvidia.com (backported from commit 61ea17a63719bac51e1bc50eb39fc637f0fdc06e linux-next) [ arighi: choose_idle_cpu() not available in v6.18 ] Signed-off-by: Andrea Righi <arighi@nvidia.com>

arighi · 2026-05-27T15:16:49Z

I've got a couple of questions about the backport in the first patch. Additionally, looking through the history it appears that we don't typically add NVIDIA: SAUCE: for commits from linux-next.

I missed your comment about using NVIDIA: SAUCE: for linux-next commits. I don't have a strong opinion on that. In this case I left NVIDIA: SAUCE: just to be consistent with the other kernels (the patch wasn't in linux-next yet when the series was applied linux-nvidia-6.17, 7.0 and 7.0-bos).

sforshee

Backport looks good. I defer on the question of whether to use NVIDIA: SAUCE: here to someone who's been doing this longer.

Acked-by: Seth Forshee <sforshee@nvidia.com>

nirmoy · 2026-05-27T16:19:48Z

LGTM just one nit: The first patch is missing Link: https://patch.msgid.link/20260516055850.1345932-1-arighi@nvidia.com

Ah yes, all the patches are missing all the latest acks from linux-next, the text is still based on the LKML email thread, I'll update them. Thanks!

Thanks
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>

jamieNguyenNVIDIA · 2026-05-28T00:26:20Z

Backport looks good. I defer on the question of whether to use NVIDIA: SAUCE: here to someone who's been doing this longer.
Acked-by: Seth Forshee <sforshee@nvidia.com>

@arighi: I think it's better to the drop NVIDIA: SAUCE:

nvmochs · 2026-05-28T15:21:05Z

@arighi Jamie, Nirmoy and I discussed this and our preference is for the NVIDIA:SAUCE tag to be dropped for these patches since they are picked from upstream (-next). Please update the PR and then we can merge these in.

arighi requested review from clsotog, jamieNguyenNVIDIA, ltrager, nvidia-bfigg, nvmochs and sforshee May 27, 2026 08:37

nirmoy added the help wanted Extra attention is needed label May 27, 2026

sforshee reviewed May 27, 2026

View reviewed changes

arighi added 2 commits May 27, 2026 17:09

arighi and others added 2 commits May 27, 2026 17:11

arighi force-pushed the linux-nvidia-6.18 branch from 8acbf1e to 127bee3 Compare May 27, 2026 15:12

sforshee reviewed May 27, 2026

View reviewed changes

nirmoy added the has_1_ack label May 27, 2026

nirmoy added has_2_acks and removed help wanted Extra attention is needed has_1_ack labels May 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[linux-nvidia-6.18-next] linux-nvidia-6.18: SMT-aware asymmetric CPU capacity idle selection#441

[linux-nvidia-6.18-next] linux-nvidia-6.18: SMT-aware asymmetric CPU capacity idle selection#441
arighi wants to merge 4 commits into
NVIDIA:linux-nvidia-6.18-nextfrom
arighi:linux-nvidia-6.18

arighi commented May 27, 2026

Uh oh!

nirmoy commented May 27, 2026 •

edited

Loading

Uh oh!

nirmoy commented May 27, 2026 •

edited

Loading

Uh oh!

sforshee left a comment

Uh oh!

sforshee May 27, 2026

Uh oh!

arighi May 27, 2026

Uh oh!

Uh oh!

arighi commented May 27, 2026

Uh oh!

arighi commented May 27, 2026

Uh oh!

sforshee left a comment

Uh oh!

nirmoy commented May 27, 2026

Uh oh!

jamieNguyenNVIDIA commented May 28, 2026 •

edited

Loading

Uh oh!

nvmochs commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

arighi commented May 27, 2026

Uh oh!

nirmoy commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Boro review

Uh oh!

nirmoy commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sforshee left a comment

Choose a reason for hiding this comment

Uh oh!

sforshee May 27, 2026

Choose a reason for hiding this comment

Uh oh!

arighi May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

arighi commented May 27, 2026

Uh oh!

arighi commented May 27, 2026

Uh oh!

sforshee left a comment

Choose a reason for hiding this comment

Uh oh!

nirmoy commented May 27, 2026

Uh oh!

jamieNguyenNVIDIA commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nvmochs commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

nirmoy commented May 27, 2026 •

edited

Loading

nirmoy commented May 27, 2026 •

edited

Loading

jamieNguyenNVIDIA commented May 28, 2026 •

edited

Loading