[CI] CPU kernel benchmark for ngram_match — DO NOT MERGE by cloudforge1 · Pull Request #7203 · PaddlePaddle/FastDeploy

cloudforge1 · 2026-04-04T18:06:23Z

Motivation

PR #7136 benchmarks the GPU ngram_match kernel but the "CPU path" column
only measures D2H/H2D tensor copy overhead, not the actual C++ kernel
computation. This makes the reported speedup (14×–1,700×) misleading — the
real GPU-vs-CPU-compute speedup is much more modest (~0.3×–5.8× per NKNaN's
profiling data).

This PR adds a standalone CPU benchmark that calls the production C++
kernel (ngram_match.cc / find_candidate_pred_tokens) with CPU-placed
tensors, using the same 5-group experiment dimensions so the numbers are
directly comparable.

⚠️ NOT FOR MERGE — this is a reference-data-only PR. The .cc file
is deleted in the GPU kernel branch; this benchmark exists on develop
where both .cc and .cu coexist.

Modifications

Added tests/spec_decode/test_benchmark_ngram_cpu.py (354 lines)
- 5 groups matching GPU benchmark dimensions (seq_len, batch_size, hit type, threshold, threshold×batch)
- 2 latency tests (standard + extreme) matching GPU benchmark configs
- Adaptive run counts (100–1000) to stay within 3-minute total runtime
- Uses paddle.CPUPlace() tensors → dispatches to .cc C++ kernel

Usage or Command

cd FastDeploy
python tests/spec_decode/test_benchmark_ngram_cpu.py

Accuracy Tests

Not applicable — benchmark-only, no functional changes.

Checklist

pre-commit hooks pass (black, isort, flake8, ruff)
Same API signature as GPU benchmark for 1:1 comparison
Adaptive run counts to avoid CI timeout (est. ~2.2 min total)
NOT FOR MERGE — reference data only

Provides the missing 'CPU compute' column for ngram_match benchmarks. The GPU PR (PaddlePaddle#7136) only measured D2H/H2D transfer overhead, not actual CPU computation. Uses the same 5-group experiment dimensions so results are directly comparable. NOT FOR MERGE — benchmark-only PR for reference data.

paddle-bot · 2026-04-04T18:06:31Z

Thanks for your contribution!

codecov-commenter · 2026-04-04T19:25:50Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@da3dfe1). Learn more about missing BASE report.

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7203   +/-   ##
==========================================
  Coverage           ?   74.19%           
==========================================
  Files              ?      376           
  Lines              ?    52941           
  Branches           ?     8260           
==========================================
  Hits               ?    39279           
  Misses             ?    10910           
  Partials           ?     2752

Flag	Coverage Δ
GPU	`74.19% <ø> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

fastdeploy-bot

🤖 AI Code Review | 2026-04-07 14:27 CST

📋 Review 摘要

PR 概述：添加 CPU 基准测试文件，用于对比 GPU benchmark 中 ngram_match 的实际 CPU 计算性能
变更范围：tests/spec_decode/
影响面 Tag：CI Benchmark

问题

级别	文件	概述
🟡 建议	`test_benchmark_ngram_cpu.py:258`	`test_group4_threshold` 缺少环境变量恢复逻辑
🟡 建议	`test_benchmark_ngram_cpu.py:278`	`test_group5_threshold_x_batch` 缺少环境变量恢复逻辑

总体评价

这是一个用于性能对比的 benchmark 文件，标记为 "NOT FOR MERGE"，代码结构清晰，测试用例与 GPU benchmark 保持一致。主要建议是统一环境变量的处理方式，保持代码风格一致性。

fastdeploy-bot · 2026-04-07T06:27:11Z

tests/spec_decode/test_benchmark_ngram_cpu.py

+            rows = []
+            for ht in hit_types:
+                cpu_us = _time_cpu(self.ngram_match, batch_size, seq_len, ht, n_runs)
+                rows.append(f"  hit={ht:<12s} batch={batch_size:<4d}  " f"CPU: {cpu_us:>10.1f} µs  (n={n_runs})")


🟡 建议 环境变量修改后未恢复

此方法直接修改 os.environ["INFER_WITH_REFERENCE_TOKENUM_THRESHOLD"]，但没有像 test_group1_seq_len、test_group2_batch_size、test_group3_hit_type 那样使用 try...finally 块恢复原值。

如果测试中途失败或被中断，环境变量将保持最后一次设置的值，可能影响后续测试的隔离性。

建议参考其他测试方法，添加 try...finally 块：

def test_group4_threshold(self): old_env = os.environ.get("INFER_WITH_REFERENCE_TOKENUM_THRESHOLD") try: rows = [] for thr in thresholds: os.environ["INFER_WITH_REFERENCE_TOKENUM_THRESHOLD"] = str(thr) # ... benchmark code ... _print_table(...) finally: if old_env is None: os.environ.pop("INFER_WITH_REFERENCE_TOKENUM_THRESHOLD", None) else: os.environ["INFER_WITH_REFERENCE_TOKENUM_THRESHOLD"] = old_env

fastdeploy-bot · 2026-04-07T06:27:11Z

tests/spec_decode/test_benchmark_ngram_cpu.py

+        seq_len = 32768
+        hit_type = "low_input"
+
+        rows = []


🟡 建议 同上，test_group5_threshold_x_batch 也存在相同的环境变量恢复问题

建议添加 try...finally 块，与其他测试方法保持一致的代码风格。

cloudforge1 added 13 commits March 6, 2026 10:30

Merge remote-tracking branch 'upstream/develop' into develop

daf20d9

Merge remote-tracking branch 'upstream/develop' into develop

6f1e63c

Merge remote-tracking branch 'upstream/develop' into develop

4deb7a7

Merge remote-tracking branch 'upstream/develop' into develop

676daf6

Merge remote-tracking branch 'upstream/develop' into develop

9bcfdca

Merge remote-tracking branch 'upstream/develop' into develop

2bfa878

Merge remote-tracking branch 'upstream/develop' into develop

262c470

Merge remote-tracking branch 'upstream/develop' into develop

171b4d3

Merge remote-tracking branch 'upstream/develop' into develop

def0bd2

Merge remote-tracking branch 'upstream/develop' into develop

4fad5dc

Merge remote-tracking branch 'upstream/develop' into develop

99b9f88

Merge remote-tracking branch 'upstream/develop' into develop

8bc4081

cloudforge1 had a problem deploying to Metax_ci April 4, 2026 18:06 — with GitHub Actions Failure

paddle-bot bot added the contributor External developers label Apr 4, 2026

cloudforge1 changed the title ~~[CI] CPU baseline benchmark for ngram_match — DO NOT MERGE~~ [CI] CPU kernel benchmark for ngram_match — DO NOT MERGE Apr 5, 2026

fastdeploy-bot reviewed Apr 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] CPU kernel benchmark for ngram_match — DO NOT MERGE#7203

[CI] CPU kernel benchmark for ngram_match — DO NOT MERGE#7203
cloudforge1 wants to merge 13 commits intoPaddlePaddle:developfrom
CloudForge-Solutions:benchmark/049-ngram-cpu-nomerge

cloudforge1 commented Apr 4, 2026

Uh oh!

paddle-bot bot commented Apr 4, 2026

Uh oh!

codecov-commenter commented Apr 4, 2026

Uh oh!

fastdeploy-bot left a comment

Uh oh!

fastdeploy-bot Apr 7, 2026

Uh oh!

fastdeploy-bot Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cloudforge1 commented Apr 4, 2026

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Apr 4, 2026

Uh oh!

codecov-commenter commented Apr 4, 2026

Codecov Report

Uh oh!

fastdeploy-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

总体评价

Uh oh!

fastdeploy-bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

fastdeploy-bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants