[CI] CPU kernel benchmark for ngram_match — DO NOT MERGE#7203
[CI] CPU kernel benchmark for ngram_match — DO NOT MERGE#7203cloudforge1 wants to merge 13 commits intoPaddlePaddle:developfrom
Conversation
Provides the missing 'CPU compute' column for ngram_match benchmarks. The GPU PR (PaddlePaddle#7136) only measured D2H/H2D transfer overhead, not actual CPU computation. Uses the same 5-group experiment dimensions so results are directly comparable. NOT FOR MERGE — benchmark-only PR for reference data.
|
Thanks for your contribution! |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #7203 +/- ##
==========================================
Coverage ? 74.19%
==========================================
Files ? 376
Lines ? 52941
Branches ? 8260
==========================================
Hits ? 39279
Misses ? 10910
Partials ? 2752
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
fastdeploy-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review |
2026-04-07 14:27 CST
📋 Review 摘要
PR 概述:添加 CPU 基准测试文件,用于对比 GPU benchmark 中 ngram_match 的实际 CPU 计算性能
变更范围:tests/spec_decode/
影响面 Tag:CI Benchmark
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | test_benchmark_ngram_cpu.py:258 |
test_group4_threshold 缺少环境变量恢复逻辑 |
| 🟡 建议 | test_benchmark_ngram_cpu.py:278 |
test_group5_threshold_x_batch 缺少环境变量恢复逻辑 |
总体评价
这是一个用于性能对比的 benchmark 文件,标记为 "NOT FOR MERGE",代码结构清晰,测试用例与 GPU benchmark 保持一致。主要建议是统一环境变量的处理方式,保持代码风格一致性。
| rows = [] | ||
| for ht in hit_types: | ||
| cpu_us = _time_cpu(self.ngram_match, batch_size, seq_len, ht, n_runs) | ||
| rows.append(f" hit={ht:<12s} batch={batch_size:<4d} " f"CPU: {cpu_us:>10.1f} µs (n={n_runs})") |
There was a problem hiding this comment.
🟡 建议 环境变量修改后未恢复
此方法直接修改 os.environ["INFER_WITH_REFERENCE_TOKENUM_THRESHOLD"],但没有像 test_group1_seq_len、test_group2_batch_size、test_group3_hit_type 那样使用 try...finally 块恢复原值。
如果测试中途失败或被中断,环境变量将保持最后一次设置的值,可能影响后续测试的隔离性。
建议参考其他测试方法,添加 try...finally 块:
def test_group4_threshold(self):
old_env = os.environ.get("INFER_WITH_REFERENCE_TOKENUM_THRESHOLD")
try:
rows = []
for thr in thresholds:
os.environ["INFER_WITH_REFERENCE_TOKENUM_THRESHOLD"] = str(thr)
# ... benchmark code ...
_print_table(...)
finally:
if old_env is None:
os.environ.pop("INFER_WITH_REFERENCE_TOKENUM_THRESHOLD", None)
else:
os.environ["INFER_WITH_REFERENCE_TOKENUM_THRESHOLD"] = old_env| seq_len = 32768 | ||
| hit_type = "low_input" | ||
|
|
||
| rows = [] |
There was a problem hiding this comment.
🟡 建议 同上,test_group5_threshold_x_batch 也存在相同的环境变量恢复问题
建议添加 try...finally 块,与其他测试方法保持一致的代码风格。
Motivation
PR #7136 benchmarks the GPU
ngram_matchkernel but the "CPU path" columnonly measures D2H/H2D tensor copy overhead, not the actual C++ kernel
computation. This makes the reported speedup (14×–1,700×) misleading — the
real GPU-vs-CPU-compute speedup is much more modest (~0.3×–5.8× per NKNaN's
profiling data).
This PR adds a standalone CPU benchmark that calls the production C++
kernel (
ngram_match.cc/find_candidate_pred_tokens) with CPU-placedtensors, using the same 5-group experiment dimensions so the numbers are
directly comparable.
.ccfileis deleted in the GPU kernel branch; this benchmark exists on
developwhere both
.ccand.cucoexist.Modifications
tests/spec_decode/test_benchmark_ngram_cpu.py(354 lines)paddle.CPUPlace()tensors → dispatches to.ccC++ kernelUsage or Command
cd FastDeploy python tests/spec_decode/test_benchmark_ngram_cpu.pyAccuracy Tests
Not applicable — benchmark-only, no functional changes.
Checklist