24.04_linux-nvidia-6.17-next: MPAM: Please pull arm_mpam: Consider overflow in bandwidth counter state#446
Conversation
Use the overflow status bit to track overflow on each bandwidth counter read and add the counter size to the correction when overflow is detected. This assumes that only a single overflow has occurred since the last read of the counter. Overflow interrupts, on hardware that supports them could be used to remove this limitation. Cc: Zeng Heng <zengheng4@huawei.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Zeng Heng <zengheng4@huawei.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> (backported from commit b353637) Signed-off-by: Fenghua Yu <fenghuay@nvidia.com> [fenghuay: Fix mem bw monitoring counter overflow issue. - Resolve conflict in mpam_msmon_overflow_val(); - Resolve conflict in __ris_msmon_read(); ]
PR Validation ReportPatchscan ✅ No Missing FixesAll cherry-picked commits checked — no missing upstream fixes found. PR Lint ❌ Errors foundDetailsChecking 1 commits...
Cherry-pick digest:
E: 65dbf0f26154 ("arm_mpam: Consider overflow in bandwidth"): patch-ID mismatch with upstream b35363793291
┌──────────────┬──────────────────────────────────────────────────────────────────┬────────────┬─────────┬───────────────────────────┐
│ Local │ Referenced upstream / Patch subject │ Patch-ID │ Subject │ SoB chain │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 65dbf0f26154 │ b35363793291 arm_mpam: Consider overflow in bandwidth counter st │ MISMATCH │ match │ preserved + fenghuay adde │
└──────────────┴──────────────────────────────────────────────────────────────────┴────────────┴─────────┴───────────────────────────┘
Lint: all checks passed.
PR metadata:
W: PR title missing [<branch>] prefix: "24.04_linux-nvidia-6.17-next: MPAM: Please pull arm_mpam: Consider overflow in b"
E: PR targets 24.04_linux-nvidia-6.17-next but body has no https://bugs.launchpad.net/... link
|
Boro reviewLatest watcher review: open review Head: This comment is maintained by nv-pr-bot. It is updated when the GitHub watcher publishes a newer review. |
|
To clarify, this is a patch from v6.19 (not v6.18) and was part of the "[PATCH v6 00/34] arm_mpam: Add basic mpam driver". @fyu1 Are there other patches from this series that are missing from the 6.17-HWE kernel? |
|
@fyu1 Codex is calling out these findings... Line numbers below are from remotes/fyu1/24.04_linux-nvidia-6.17-next.mpam.extras.fixes3 at 65dbf0f.
At drivers/resctrl/mpam_devices.c:1383-1392: case mpam_feat_msmon_mbwu_63counter: Those are maximum counter values: 2^63 - 1, 2^44 - 1, 2^31 - 1. But after this patch, the value is added directly on overflow at :1488-1489: if (overflow) The upstream commit adds the counter modulus instead: for the 31-bit counter, upstream computes BIT_ULL(31), i.e. 2^31, not GENMASK_ULL(30, 0). So every detected overflow undercounts by one counter unit. For 44/63-bit
Existing downstream code scales the sampled MBWU value at :1480-1481: if (mpam_has_quirk(T241_MBW_COUNTER_SCALE_64, msc)) Before this backport, overflow correction also applied the same scale: overflow_val = mpam_msmon_overflow_val(m->type); The patch removed that path and now adds the unscaled helper result directly. That mixes units: now is bytes on T241, while correction is raw counter ticks. A 31-bit overflow should add roughly 2^31 * 64 bytes, but the The fix should make the overflow correction use the same unit as now.
This branch has long MBWU counter support and defines two relevant status bits: MSMON_CFG_MBWU_CTL_OFLOW_STATUS_L /* bit 15 / clean_msmon_ctl_val() already knows about the long-counter bit and clears it for config comparison at :1337-1343: *cur_ctl &= ~MSMON_CFG_x_CTL_OFLOW_STATUS; if (FIELD_GET(MSMON_CFG_x_CTL_TYPE, *cur_ctl) == MSMON_CFG_MBWU_CTL_TYPE_MBWU) But the new overflow detection only checks bit 26 at :1435: overflow = cur_ctl & MSMON_CFG_x_CTL_OFLOW_STATUS; So a long-counter overflow signaled only by MSMON_CFG_MBWU_CTL_OFLOW_STATUS_L will not increment correction, and because overflow is false, the code also will not write the cleaned control value back to clear the sticky The backport needs to treat OFLOW_STATUS_L as overflow for the 44/63-bit MBWU paths.
At :1405: u64 now, overflow_val = 0; overflow_val is no longer used after the patch. That should produce an unused-variable warning. Also, struct msmon_mbwu_state still says prev_val is “Used to detect overflow” at drivers/resctrl/mpam_internal.h:325-326, and write_msmon_ctl_flt_vals() still resets it at mpam_devices.c:1374-1375. But this patch removed |
Claude missed 1, but also flagged 2-4. Additionally, the commit message ought to have your SOB go after the backport notes: |
Backport this commit to fix a bug: https://nvbugspro.nvidia.com/bug/6207279
Since the commit is in 6.18 upstream already, 7.0 bos and lts don't have this issue.
Use the overflow status bit to track overflow on each bandwidth counter read and add the counter size to the correction when overflow is detected.
This assumes that only a single overflow has occurred since the last read of the counter. Overflow interrupts, on hardware that supports them could be used to remove this limitation.
Cc: Zeng Heng zengheng4@huawei.com
Reviewed-by: Gavin Shan gshan@redhat.com
Reviewed-by: Zeng Heng zengheng4@huawei.com
Reviewed-by: Jonathan Cameron jonathan.cameron@huawei.com
Reviewed-by: Shaopeng Tan tan.shaopeng@jp.fujitsu.com
Reviewed-by: Fenghua Yu fenghuay@nvidia.com
Tested-by: Carl Worth carl@os.amperecomputing.com
Tested-by: Gavin Shan gshan@redhat.com
Tested-by: Zeng Heng zengheng4@huawei.com
Tested-by: Shaopeng Tan tan.shaopeng@jp.fujitsu.com
Tested-by: Hanjun Guo guohanjun@huawei.com
(backported from commit b353637)
[fenghuay: Fix mem bw monitoring counter overflow issue.