[Graph Optimization] Add blockwise cuda graph by RichardWooSJTU · Pull Request #7173 · PaddlePaddle/FastDeploy

RichardWooSJTU · 2026-04-03T02:24:57Z

Motivation

当前 FastDeploy 的 CUDA Graph 捕获是整体模型级别的，粒度较粗，存在一些灵活性限制。本 PR 引入 Block-wise CUDA Graph 机制，支持在单个算子/层级别（如 Linear、RMSNorm）独立捕获和回放 CUDA Graph，从而实现更细粒度的图优化，提升 decode 阶段的推理性能。

Modifications

新增 block-wise CUDA Graph 核心模块 (fastdeploy/model_executor/graph_optimization/cuda_graph_op.py)：
- 实现 block_wise_cuda_graph_wrap 装饰器，支持对任意 forward 方法进行 CUDA Graph 捕获/回放
- 支持跨实例共享图缓存（通过 self_attrs 参数），相同 shape 的不同层共享同一个 captured graph，将图数量从 O(num_layers) 降至 O(num_unique_shapes)
- 支持 capture phase 控制：仅在 warmup 阶段捕获，运行时未命中的 key 回退到 eager 执行
- 提供 clear_all_block_wise_graphs() 接口用于 RL 权重更新等场景的图缓存清理
装饰 Linear 和 RMSNorm 层：
- linear.py：对 forward_cuda 添加 @block_wise_cuda_graph_wrap 装饰器
- normalization.py：对 RMSNorm.forward 添加 @block_wise_cuda_graph_wrap 装饰器
Warmup 阶段预捕获 (gpu_model_runner.py)：
- 新增 capture_block_wise_graphs() 方法，根据 FD_BLOCK_WISE_CUDA_GRAPH_SIZES 环境变量指定的 token 数列表进行预捕获
- 在 gpu_worker.py 的 warmup 流程中调用
Custom Op 适配 (custom_ops/gpu_ops/helper.h)：
- GetEmptyTensor 在 CUDA Graph 捕获模式下使用 AllocatorFacade 分配内存，避免捕获期间的分配器兼容性问题
新增环境变量 (envs.py)：
- FD_USE_BLOCK_WISE_CUDA_GRAPH：是否启用 block-wise CUDA Graph（默认关闭）
- FD_BLOCK_WISE_CUDA_GRAPH_SIZES：预捕获的 token 数列表（默认 "128,256,512,1024,2048"）
核心优化点：

结合Paddle的pr [Operator Mechanism] Support replacing input ptrs of cuda graph Paddle#78467 新增的replace_input_ptrs接口替换图的所有tensor，支持近捕获一层即可在所有层之间复用

Usage or Command

# 启用 block-wise CUDA Graph
export FD_USE_BLOCK_WISE_CUDA_GRAPH=1

# 自定义预捕获的 token 数（可选）
export FD_BLOCK_WISE_CUDA_GRAPH_SIZES="1,2,4,8,16,32,64,128,256,512,1024,2048"

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- 建议 PR 标题: [Graph Optimization][Feature] Add block-wise CUDA Graph support for layer-level capture/replay
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-04-03T02:25:03Z

Thanks for your contribution!

fastdeploy-bot

🤖 AI Code Review | 2026-04-03 11:07 CST

📋 Review 摘要

PR 概述：引入 Block-wise CUDA Graph 机制，支持在算子/层级别（Linear、RMSNorm）独立捕获和回放 CUDA Graph，实现更细粒度的图优化。

变更范围：model_executor/graph_optimization/、model_executor/layers/、worker/、custom_ops/gpu_ops/

影响面 Tag：Graph Optimization OP Engine

问题

级别	文件	概述
🟡 建议	`cuda_graph_op.py:243`	graph 存入缓存时机应在 capture 成功后
🟡 建议	`cuda_graph_op.py:189`	self_attrs 属性不存在时缺少校验
🟡 建议	`linear.py:251`	weight_scale_inv 仅量化场景存在，影响 graph 共享
🟡 建议	`gpu_model_runner.py:3150`	缺少 num_tokens 边界检查
❓ 疑问	`cuda_graph_op.py:29`	多 GPU 场景下全局状态隔离

总体评价

PR 设计思路清晰，通过 self_attrs 实现跨实例 graph 共享是一个优雅的优化。代码结构良好，capture/replay 路径分离明确。主要建议集中在异常处理的健壮性和属性存在性校验上，均为非阻塞性问题。建议补充单元测试覆盖量化/非量化两种场景。

fastdeploy-bot · 2026-04-03T03:07:55Z

fastdeploy/model_executor/graph_optimization/cuda_graph_op.py

+                        _kp.append(True)
+                # Include self_attrs shapes/dtypes in key
+                for attr_name in _self_attr_names:
+                    attr = getattr(self, attr_name, None)


🟡 建议 当 self_attrs 中指定的属性不存在于 self 对象时，缓存 key 会包含 (attr_name, None)，但后续 replay 路径中 getattr(self, attr_name, None) 返回 None 时会跳过指针替换。

这在功能上不会出错，但可能导致同一个 key 在不同实例上表现不一致（一个有属性一个没有），建议在装饰时或首次调用时校验 self_attrs 中的属性确实存在于对象上。

# 建议在 capture 阶段添加校验 for attr_name in _self_attr_names: if not hasattr(self, attr_name): raise AttributeError(f"self_attrs 指定的属性 '{attr_name}' 不存在于 {type(self).__name__}")

fastdeploy-bot · 2026-04-03T03:07:55Z

fastdeploy/model_executor/layers/linear.py

            bias_tensor = paddle.to_tensor(get_tensor(state_dict.pop(self.bias_key)))
            self.bias.set_value(bias_tensor)

+    @block_wise_cuda_graph_wrap(inputs=["x"], self_attrs=["weight", "weight_scale_inv", "bias"])


🟡 建议 self_attrs 包含 weight_scale_inv，但该属性仅在量化场景下存在（由 QuantMethodBase.create_weights 创建）。

在非量化模式下，getattr(self, 'weight_scale_inv', None) 会返回 None，装饰器会将 (attr_name, None) 加入 key。这本身不会导致错误，但：

会导致量化/非量化 Linear 层无法共享同一个 captured graph（即使 shape 相同）

语义上 self_attrs 应该只包含确实存在的属性

建议考虑将量化层使用单独的装饰器配置，或在运行时动态过滤不存在的属性。

fastdeploy-bot · 2026-04-03T03:07:55Z

fastdeploy/model_executor/graph_optimization/cuda_graph_op.py

+
+# When True, the wrapper is in the capture phase (during dummy_run) and
+# will capture new graphs. When False, uncached keys fall back to eager.
+_BLOCK_WISE_CAPTURING: bool = False


❓ 疑问 全局变量 _BLOCK_WISE_CAPTURING 和 _ALL_SHARED_CACHES 是模块级状态，在多进程场景下（如 Tensor Parallel）每个进程会有独立副本，这是预期行为。

但请确认：

是否存在同一进程内多 GPU 的场景？

如果有，是否需要 per-device 的状态隔离？

当前实现假设每个进程只负责一个 GPU，如果这是设计约束，建议在文档中明确说明。

fastdeploy-bot · 2026-04-03T03:07:55Z

fastdeploy/worker/gpu_model_runner.py

+                self._dummy_run(
+                    num_tokens=num_tokens,
+                    batch_size=batch_size,
+                    in_capturing=False,


🟡 建议 这里 in_capturing=False 传给 _dummy_run，意味着不使用整体模型级 CUDA Graph 捕获。

但 block-wise 捕获依赖 set_block_wise_capturing(True) 的全局状态。如果 _dummy_run 内部有异常抛出，虽然有 try/finally 保护，但建议在日志中明确区分这是 "block-wise capture phase" 而非普通 warmup，便于排查问题。

另外，当 num_tokens > max_num_seqs * max_seq_len 时，_dummy_run 可能会触发超出预期的内存分配，建议添加边界检查：

max_valid_tokens = self.scheduler_config.max_num_seqs * self.scheduler_config.max_seq_len if num_tokens > max_valid_tokens: logger.warning(f"Skipping block-wise capture for {num_tokens} > max_valid_tokens({max_valid_tokens})") continue

fastdeploy-bot · 2026-04-03T03:07:55Z

fastdeploy/model_executor/graph_optimization/cuda_graph_op.py

+                result = method(self, *args, **kwargs)
+                graph.capture_end()
+
+                graph.replay()


🟡 建议 在 capture_end() 之后立即调用 replay() 是为了确保输出 tensor 有正确的数据。

但如果 capture 过程中发生 CUDA 错误（如 OOM），capture_end() 可能会抛出异常，导致 graph 状态不一致且被存入 graphs[key]。

建议将 graph 存入缓存的操作移到 capture_end() 成功之后：

graph.capture_begin() result = method(self, *args, **kwargs) graph.capture_end() graph.replay() # 仅在成功后存入缓存 graphs[key] = graph cinputs[key] = ci coutputs[key] = result

codecov-commenter · 2026-04-03T04:07:58Z

Codecov Report

❌ Patch coverage is 24.11765% with 129 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@98f3fc9). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
...model_executor/graph_optimization/cuda_graph_op.py	23.74%	103 Missing and 3 partials ⚠️
fastdeploy/worker/gpu_model_runner.py	11.53%	22 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7173   +/-   ##
==========================================
  Coverage           ?   73.74%           
==========================================
  Files              ?      377           
  Lines              ?    53043           
  Branches           ?     8287           
==========================================
  Hits               ?    39119           
  Misses             ?    11192           
  Partials           ?     2732

Flag	Coverage Δ
GPU	`73.74% <24.11%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

add blockwise cuda graph

d2951fd

RichardWooSJTU had a problem deploying to Metax_ci April 3, 2026 02:25 — with GitHub Actions Failure

fastdeploy-bot reviewed Apr 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Graph Optimization] Add blockwise cuda graph#7173

[Graph Optimization] Add blockwise cuda graph#7173
RichardWooSJTU wants to merge 1 commit intoPaddlePaddle:developfrom
RichardWooSJTU:blockwise_cg_0402

RichardWooSJTU commented Apr 3, 2026 •

edited

Loading

Uh oh!

paddle-bot bot commented Apr 3, 2026

Uh oh!

fastdeploy-bot left a comment

Uh oh!

fastdeploy-bot Apr 3, 2026

Uh oh!

fastdeploy-bot Apr 3, 2026

Uh oh!

fastdeploy-bot Apr 3, 2026

Uh oh!

fastdeploy-bot Apr 3, 2026

Uh oh!

fastdeploy-bot Apr 3, 2026

Uh oh!

codecov-commenter commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

RichardWooSJTU commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Apr 3, 2026

Uh oh!

fastdeploy-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

总体评价

Uh oh!

fastdeploy-bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

fastdeploy-bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

fastdeploy-bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

fastdeploy-bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

fastdeploy-bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Apr 3, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

RichardWooSJTU commented Apr 3, 2026 •

edited

Loading