Skip to content

24.04_linux-nvidia-6.17-next: MPAM: Please pull arm_mpam: Consider overflow in bandwidth counter state#446

Open
fyu1 wants to merge 1 commit into
NVIDIA:24.04_linux-nvidia-6.17-nextfrom
fyu1:24.04_linux-nvidia-6.17-next.mpam.extras.fixes3
Open

24.04_linux-nvidia-6.17-next: MPAM: Please pull arm_mpam: Consider overflow in bandwidth counter state#446
fyu1 wants to merge 1 commit into
NVIDIA:24.04_linux-nvidia-6.17-nextfrom
fyu1:24.04_linux-nvidia-6.17-next.mpam.extras.fixes3

Conversation

@fyu1
Copy link
Copy Markdown
Collaborator

@fyu1 fyu1 commented May 28, 2026

Backport this commit to fix a bug: https://nvbugspro.nvidia.com/bug/6207279

Since the commit is in 6.18 upstream already, 7.0 bos and lts don't have this issue.

Use the overflow status bit to track overflow on each bandwidth counter read and add the counter size to the correction when overflow is detected.

This assumes that only a single overflow has occurred since the last read of the counter. Overflow interrupts, on hardware that supports them could be used to remove this limitation.

Cc: Zeng Heng zengheng4@huawei.com
Reviewed-by: Gavin Shan gshan@redhat.com
Reviewed-by: Zeng Heng zengheng4@huawei.com
Reviewed-by: Jonathan Cameron jonathan.cameron@huawei.com
Reviewed-by: Shaopeng Tan tan.shaopeng@jp.fujitsu.com
Reviewed-by: Fenghua Yu fenghuay@nvidia.com
Tested-by: Carl Worth carl@os.amperecomputing.com
Tested-by: Gavin Shan gshan@redhat.com
Tested-by: Zeng Heng zengheng4@huawei.com
Tested-by: Shaopeng Tan tan.shaopeng@jp.fujitsu.com
Tested-by: Hanjun Guo guohanjun@huawei.com

(backported from commit b353637)

[fenghuay: Fix mem bw monitoring counter overflow issue.

  • Resolve conflict in mpam_msmon_overflow_val();
  • Resolve conflict in __ris_msmon_read(); ]

Use the overflow status bit to track overflow on each bandwidth counter
read and add the counter size to the correction when overflow is detected.

This assumes that only a single overflow has occurred since the last read
of the counter. Overflow interrupts, on hardware that supports them could
be used to remove this limitation.

Cc: Zeng Heng <zengheng4@huawei.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Zeng Heng <zengheng4@huawei.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com>
Tested-by: Gavin Shan <gshan@redhat.com>
Tested-by: Zeng Heng <zengheng4@huawei.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
(backported from commit b353637)
Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
[fenghuay: Fix mem bw monitoring counter overflow issue.
 - Resolve conflict in mpam_msmon_overflow_val();
 - Resolve conflict in __ris_msmon_read();
]
@fyu1 fyu1 requested a review from jamieNguyenNVIDIA May 28, 2026 01:42
@fyu1 fyu1 requested a review from nvmochs May 28, 2026 01:43
@nirmoy nirmoy added the help wanted Extra attention is needed label May 28, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 28, 2026

PR Validation Report

Patchscan ✅ No Missing Fixes

All cherry-picked commits checked — no missing upstream fixes found.

PR Lint ❌ Errors found

Details
Checking 1 commits...

Cherry-pick digest:
E: 65dbf0f26154 ("arm_mpam: Consider overflow in bandwidth"): patch-ID mismatch with upstream b35363793291
┌──────────────┬──────────────────────────────────────────────────────────────────┬────────────┬─────────┬───────────────────────────┐
│ Local        │ Referenced upstream / Patch subject                              │ Patch-ID   │ Subject │ SoB chain                 │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 65dbf0f26154 │ b35363793291 arm_mpam: Consider overflow in bandwidth counter st │ MISMATCH   │ match   │ preserved + fenghuay adde │
└──────────────┴──────────────────────────────────────────────────────────────────┴────────────┴─────────┴───────────────────────────┘

Lint: all checks passed.

PR metadata:
W: PR title missing [<branch>] prefix: "24.04_linux-nvidia-6.17-next: MPAM: Please pull arm_mpam: Consider overflow in b"
E: PR targets 24.04_linux-nvidia-6.17-next but body has no https://bugs.launchpad.net/... link

@nirmoy
Copy link
Copy Markdown
Collaborator

nirmoy commented May 28, 2026

Boro review

Latest watcher review: open review

Head: 65dbf0f26154

This comment is maintained by nv-pr-bot. It is updated when the GitHub watcher publishes a newer review.

@nvmochs
Copy link
Copy Markdown
Collaborator

nvmochs commented May 28, 2026

To clarify, this is a patch from v6.19 (not v6.18) and was part of the "[PATCH v6 00/34] arm_mpam: Add basic mpam driver".

@fyu1 Are there other patches from this series that are missing from the 6.17-HWE kernel?

@nvmochs
Copy link
Copy Markdown
Collaborator

nvmochs commented May 28, 2026

@fyu1 Codex is calling out these findings...

Line numbers below are from remotes/fyu1/24.04_linux-nvidia-6.17-next.mpam.extras.fixes3 at 65dbf0f.

  1. Overflow increment is off by one

At drivers/resctrl/mpam_devices.c:1383-1392:

case mpam_feat_msmon_mbwu_63counter:
return GENMASK_ULL(62, 0);
case mpam_feat_msmon_mbwu_44counter:
return GENMASK_ULL(43, 0);
case mpam_feat_msmon_mbwu_31counter:
return GENMASK_ULL(30, 0);

Those are maximum counter values: 2^63 - 1, 2^44 - 1, 2^31 - 1.

But after this patch, the value is added directly on overflow at :1488-1489:

if (overflow)
mbwu_state->correction += mpam_msmon_overflow_val(m->type);

The upstream commit adds the counter modulus instead: for the 31-bit counter, upstream computes BIT_ULL(31), i.e. 2^31, not GENMASK_ULL(30, 0). So every detected overflow undercounts by one counter unit. For 44/63-bit
downstream counters, the same logic should be BIT_ULL(44) and BIT_ULL(63).


  1. T241 scaling was lost for overflow correction

Existing downstream code scales the sampled MBWU value at :1480-1481:

if (mpam_has_quirk(T241_MBW_COUNTER_SCALE_64, msc))
now *= 64;

Before this backport, overflow correction also applied the same scale:

overflow_val = mpam_msmon_overflow_val(m->type);
if (mpam_has_quirk(T241_MBW_COUNTER_SCALE_64, msc))
overflow_val *= 64;
overflow_val -= mbwu_state->prev_val;

The patch removed that path and now adds the unscaled helper result directly. That mixes units: now is bytes on T241, while correction is raw counter ticks. A 31-bit overflow should add roughly 2^31 * 64 bytes, but the
current code adds only about 2^31.

The fix should make the overflow correction use the same unit as now.


  1. Long MBWU overflow status is ignored

This branch has long MBWU counter support and defines two relevant status bits:

MSMON_CFG_MBWU_CTL_OFLOW_STATUS_L /* bit 15 /
MSMON_CFG_x_CTL_OFLOW_STATUS /
bit 26 */

clean_msmon_ctl_val() already knows about the long-counter bit and clears it for config comparison at :1337-1343:

*cur_ctl &= ~MSMON_CFG_x_CTL_OFLOW_STATUS;

if (FIELD_GET(MSMON_CFG_x_CTL_TYPE, *cur_ctl) == MSMON_CFG_MBWU_CTL_TYPE_MBWU)
*cur_ctl &= ~MSMON_CFG_MBWU_CTL_OFLOW_STATUS_L;

But the new overflow detection only checks bit 26 at :1435:

overflow = cur_ctl & MSMON_CFG_x_CTL_OFLOW_STATUS;

So a long-counter overflow signaled only by MSMON_CFG_MBWU_CTL_OFLOW_STATUS_L will not increment correction, and because overflow is false, the code also will not write the cleaned control value back to clear the sticky
status bit.

The backport needs to treat OFLOW_STATUS_L as overflow for the 44/63-bit MBWU paths.


  1. Dead/stale state remains after removing prev-value overflow detection

At :1405:

u64 now, overflow_val = 0;

overflow_val is no longer used after the patch. That should produce an unused-variable warning.

Also, struct msmon_mbwu_state still says prev_val is “Used to detect overflow” at drivers/resctrl/mpam_internal.h:325-326, and write_msmon_ctl_flt_vals() still resets it at mpam_devices.c:1374-1375. But this patch removed
the only read of prev_val. That is no longer functionally harmful by itself, but it is stale backport debris and makes the state model misleading.

@jamieNguyenNVIDIA
Copy link
Copy Markdown
Collaborator

@fyu1 Codex is calling out these findings...

Line numbers below are from remotes/fyu1/24.04_linux-nvidia-6.17-next.mpam.extras.fixes3 at 65dbf0f.

  1. Overflow increment is off by one

At drivers/resctrl/mpam_devices.c:1383-1392:

case mpam_feat_msmon_mbwu_63counter: return GENMASK_ULL(62, 0); case mpam_feat_msmon_mbwu_44counter: return GENMASK_ULL(43, 0); case mpam_feat_msmon_mbwu_31counter: return GENMASK_ULL(30, 0);

Those are maximum counter values: 2^63 - 1, 2^44 - 1, 2^31 - 1.

But after this patch, the value is added directly on overflow at :1488-1489:

if (overflow) mbwu_state->correction += mpam_msmon_overflow_val(m->type);

The upstream commit adds the counter modulus instead: for the 31-bit counter, upstream computes BIT_ULL(31), i.e. 2^31, not GENMASK_ULL(30, 0). So every detected overflow undercounts by one counter unit. For 44/63-bit downstream counters, the same logic should be BIT_ULL(44) and BIT_ULL(63).

  1. T241 scaling was lost for overflow correction

Existing downstream code scales the sampled MBWU value at :1480-1481:

if (mpam_has_quirk(T241_MBW_COUNTER_SCALE_64, msc)) now *= 64;

Before this backport, overflow correction also applied the same scale:

overflow_val = mpam_msmon_overflow_val(m->type); if (mpam_has_quirk(T241_MBW_COUNTER_SCALE_64, msc)) overflow_val *= 64; overflow_val -= mbwu_state->prev_val;

The patch removed that path and now adds the unscaled helper result directly. That mixes units: now is bytes on T241, while correction is raw counter ticks. A 31-bit overflow should add roughly 2^31 * 64 bytes, but the current code adds only about 2^31.

The fix should make the overflow correction use the same unit as now.

  1. Long MBWU overflow status is ignored

This branch has long MBWU counter support and defines two relevant status bits:

MSMON_CFG_MBWU_CTL_OFLOW_STATUS_L /* bit 15 / MSMON_CFG_x_CTL_OFLOW_STATUS / bit 26 */

clean_msmon_ctl_val() already knows about the long-counter bit and clears it for config comparison at :1337-1343:

*cur_ctl &= ~MSMON_CFG_x_CTL_OFLOW_STATUS;

if (FIELD_GET(MSMON_CFG_x_CTL_TYPE, *cur_ctl) == MSMON_CFG_MBWU_CTL_TYPE_MBWU) *cur_ctl &= ~MSMON_CFG_MBWU_CTL_OFLOW_STATUS_L;

But the new overflow detection only checks bit 26 at :1435:

overflow = cur_ctl & MSMON_CFG_x_CTL_OFLOW_STATUS;

So a long-counter overflow signaled only by MSMON_CFG_MBWU_CTL_OFLOW_STATUS_L will not increment correction, and because overflow is false, the code also will not write the cleaned control value back to clear the sticky status bit.

The backport needs to treat OFLOW_STATUS_L as overflow for the 44/63-bit MBWU paths.

  1. Dead/stale state remains after removing prev-value overflow detection

At :1405:

u64 now, overflow_val = 0;

overflow_val is no longer used after the patch. That should produce an unused-variable warning.

Also, struct msmon_mbwu_state still says prev_val is “Used to detect overflow” at drivers/resctrl/mpam_internal.h:325-326, and write_msmon_ctl_flt_vals() still resets it at mpam_devices.c:1374-1375. But this patch removed the only read of prev_val. That is no longer functionally harmful by itself, but it is stale backport debris and makes the state model misleading.

Claude missed 1, but also flagged 2-4.

Additionally, the commit message ought to have your SOB go after the backport notes:

...
[fenghuay: Fix mem bw monitoring counter overflow issue.
 - Resolve conflict in mpam_msmon_overflow_val();
 - Resolve conflict in __ris_msmon_read();
]
Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

help wanted Extra attention is needed pending_review_comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants