Skip to content

[Graph Optimization] Add blockwise cuda graph#7173

Open
RichardWooSJTU wants to merge 1 commit intoPaddlePaddle:developfrom
RichardWooSJTU:blockwise_cg_0402
Open

[Graph Optimization] Add blockwise cuda graph#7173
RichardWooSJTU wants to merge 1 commit intoPaddlePaddle:developfrom
RichardWooSJTU:blockwise_cg_0402

Conversation

@RichardWooSJTU
Copy link
Copy Markdown
Collaborator

@RichardWooSJTU RichardWooSJTU commented Apr 3, 2026

Motivation

当前 FastDeploy 的 CUDA Graph 捕获是整体模型级别的,粒度较粗,存在一些灵活性限制。本 PR 引入 Block-wise CUDA Graph 机制,支持在单个算子/层级别(如 Linear、RMSNorm)独立捕获和回放 CUDA Graph,从而实现更细粒度的图优化,提升 decode 阶段的推理性能。

Modifications

  1. 新增 block-wise CUDA Graph 核心模块 (fastdeploy/model_executor/graph_optimization/cuda_graph_op.py):

    • 实现 block_wise_cuda_graph_wrap 装饰器,支持对任意 forward 方法进行 CUDA Graph 捕获/回放
    • 支持跨实例共享图缓存(通过 self_attrs 参数),相同 shape 的不同层共享同一个 captured graph,将图数量从 O(num_layers) 降至 O(num_unique_shapes)
    • 支持 capture phase 控制:仅在 warmup 阶段捕获,运行时未命中的 key 回退到 eager 执行
    • 提供 clear_all_block_wise_graphs() 接口用于 RL 权重更新等场景的图缓存清理
  2. 装饰 Linear 和 RMSNorm 层

    • linear.py:对 forward_cuda 添加 @block_wise_cuda_graph_wrap 装饰器
    • normalization.py:对 RMSNorm.forward 添加 @block_wise_cuda_graph_wrap 装饰器
  3. Warmup 阶段预捕获 (gpu_model_runner.py):

    • 新增 capture_block_wise_graphs() 方法,根据 FD_BLOCK_WISE_CUDA_GRAPH_SIZES 环境变量指定的 token 数列表进行预捕获
    • gpu_worker.py 的 warmup 流程中调用
  4. Custom Op 适配 (custom_ops/gpu_ops/helper.h):

    • GetEmptyTensor 在 CUDA Graph 捕获模式下使用 AllocatorFacade 分配内存,避免捕获期间的分配器兼容性问题
  5. 新增环境变量 (envs.py):

    • FD_USE_BLOCK_WISE_CUDA_GRAPH:是否启用 block-wise CUDA Graph(默认关闭)
    • FD_BLOCK_WISE_CUDA_GRAPH_SIZES:预捕获的 token 数列表(默认 "128,256,512,1024,2048"
  6. 核心优化点

Usage or Command

# 启用 block-wise CUDA Graph
export FD_USE_BLOCK_WISE_CUDA_GRAPH=1

# 自定义预捕获的 token 数(可选)
export FD_BLOCK_WISE_CUDA_GRAPH_SIZES="1,2,4,8,16,32,64,128,256,512,1024,2048"

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • 建议 PR 标题: [Graph Optimization][Feature] Add block-wise CUDA Graph support for layer-level capture/replay
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Apr 3, 2026

Thanks for your contribution!

Copy link
Copy Markdown

@fastdeploy-bot fastdeploy-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review | 2026-04-03 11:07 CST

📋 Review 摘要

PR 概述:引入 Block-wise CUDA Graph 机制,支持在算子/层级别(Linear、RMSNorm)独立捕获和回放 CUDA Graph,实现更细粒度的图优化。

变更范围model_executor/graph_optimization/model_executor/layers/worker/custom_ops/gpu_ops/

影响面 TagGraph Optimization OP Engine

问题

级别 文件 概述
🟡 建议 cuda_graph_op.py:243 graph 存入缓存时机应在 capture 成功后
🟡 建议 cuda_graph_op.py:189 self_attrs 属性不存在时缺少校验
🟡 建议 linear.py:251 weight_scale_inv 仅量化场景存在,影响 graph 共享
🟡 建议 gpu_model_runner.py:3150 缺少 num_tokens 边界检查
❓ 疑问 cuda_graph_op.py:29 多 GPU 场景下全局状态隔离

总体评价

PR 设计思路清晰,通过 self_attrs 实现跨实例 graph 共享是一个优雅的优化。代码结构良好,capture/replay 路径分离明确。主要建议集中在异常处理的健壮性和属性存在性校验上,均为非阻塞性问题。建议补充单元测试覆盖量化/非量化两种场景。

_kp.append(True)
# Include self_attrs shapes/dtypes in key
for attr_name in _self_attr_names:
attr = getattr(self, attr_name, None)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议self_attrs 中指定的属性不存在于 self 对象时,缓存 key 会包含 (attr_name, None),但后续 replay 路径中 getattr(self, attr_name, None) 返回 None 时会跳过指针替换。

这在功能上不会出错,但可能导致同一个 key 在不同实例上表现不一致(一个有属性一个没有),建议在装饰时或首次调用时校验 self_attrs 中的属性确实存在于对象上。

# 建议在 capture 阶段添加校验
for attr_name in _self_attr_names:
    if not hasattr(self, attr_name):
        raise AttributeError(f"self_attrs 指定的属性 '{attr_name}' 不存在于 {type(self).__name__}")

bias_tensor = paddle.to_tensor(get_tensor(state_dict.pop(self.bias_key)))
self.bias.set_value(bias_tensor)

@block_wise_cuda_graph_wrap(inputs=["x"], self_attrs=["weight", "weight_scale_inv", "bias"])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 self_attrs 包含 weight_scale_inv,但该属性仅在量化场景下存在(由 QuantMethodBase.create_weights 创建)。

在非量化模式下,getattr(self, 'weight_scale_inv', None) 会返回 None,装饰器会将 (attr_name, None) 加入 key。这本身不会导致错误,但:

  1. 会导致量化/非量化 Linear 层无法共享同一个 captured graph(即使 shape 相同)
  2. 语义上 self_attrs 应该只包含确实存在的属性

建议考虑将量化层使用单独的装饰器配置,或在运行时动态过滤不存在的属性。


# When True, the wrapper is in the capture phase (during dummy_run) and
# will capture new graphs. When False, uncached keys fall back to eager.
_BLOCK_WISE_CAPTURING: bool = False
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 全局变量 _BLOCK_WISE_CAPTURING_ALL_SHARED_CACHES 是模块级状态,在多进程场景下(如 Tensor Parallel)每个进程会有独立副本,这是预期行为。

但请确认:

  1. 是否存在同一进程内多 GPU 的场景?
  2. 如果有,是否需要 per-device 的状态隔离?

当前实现假设每个进程只负责一个 GPU,如果这是设计约束,建议在文档中明确说明。

self._dummy_run(
num_tokens=num_tokens,
batch_size=batch_size,
in_capturing=False,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 这里 in_capturing=False 传给 _dummy_run,意味着不使用整体模型级 CUDA Graph 捕获。

但 block-wise 捕获依赖 set_block_wise_capturing(True) 的全局状态。如果 _dummy_run 内部有异常抛出,虽然有 try/finally 保护,但建议在日志中明确区分这是 "block-wise capture phase" 而非普通 warmup,便于排查问题。

另外,当 num_tokens > max_num_seqs * max_seq_len 时,_dummy_run 可能会触发超出预期的内存分配,建议添加边界检查:

max_valid_tokens = self.scheduler_config.max_num_seqs * self.scheduler_config.max_seq_len
if num_tokens > max_valid_tokens:
    logger.warning(f"Skipping block-wise capture for {num_tokens} > max_valid_tokens({max_valid_tokens})")
    continue

result = method(self, *args, **kwargs)
graph.capture_end()

graph.replay()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议capture_end() 之后立即调用 replay() 是为了确保输出 tensor 有正确的数据。

但如果 capture 过程中发生 CUDA 错误(如 OOM),capture_end() 可能会抛出异常,导致 graph 状态不一致且被存入 graphs[key]

建议将 graph 存入缓存的操作移到 capture_end() 成功之后:

graph.capture_begin()
result = method(self, *args, **kwargs)
graph.capture_end()
graph.replay()
# 仅在成功后存入缓存
graphs[key] = graph
cinputs[key] = ci
coutputs[key] = result

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 24.11765% with 129 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@98f3fc9). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...model_executor/graph_optimization/cuda_graph_op.py 23.74% 103 Missing and 3 partials ⚠️
fastdeploy/worker/gpu_model_runner.py 11.53% 22 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7173   +/-   ##
==========================================
  Coverage           ?   73.74%           
==========================================
  Files              ?      377           
  Lines              ?    53043           
  Branches           ?     8287           
==========================================
  Hits               ?    39119           
  Misses             ?    11192           
  Partials           ?     2732           
Flag Coverage Δ
GPU 73.74% <24.11%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants