Skip to content

fix(windows): handle D3D11 device-lost (TDR) in AMD capture and encoding pipeline#596

Open
qiin2333 wants to merge 1 commit intomasterfrom
fix/amd-capture-device-lost-recovery
Open

fix(windows): handle D3D11 device-lost (TDR) in AMD capture and encoding pipeline#596
qiin2333 wants to merge 1 commit intomasterfrom
fix/amd-capture-device-lost-recovery

Conversation

@qiin2333
Copy link
Copy Markdown
Collaborator

Problem

When a GPU TDR (Timeout Detection and Recovery) occurs during AMD DirectCapture streaming, the capture pipeline would either hang indefinitely or crash. Users experience: streaming freezes, can switch to desktop but wallpaper turns black.

Root Cause

5 locations in the capture/encoding pipeline failed to properly detect and propagate D3D11 device-lost errors:

  1. DXGI Desktop Duplication returned capture_e::error (fatal) instead of capture_e::reinit (recoverable) on device removal
  2. Encoding thread called return (permanent exit) on convert/encode failure instead of break (allow reinit)
  3. AMF encoder didn't check device state after surface creation/submission failures
  4. AMD DirectCapture masked device-lost as capture_e::timeout, causing infinite retry on dead device
  5. VRAM convert keyed mutex failure didn't log device removal reason

Changes

File Change Impact
display_base.cpp Handle DXGI_ERROR_DEVICE_REMOVED/RESET → return reinit All capture backends (DXGI)
video.cpp convert/encode failure: returnbreak All encoders
amf_d3d11.cpp Log device-lost after AMF failures AMD only (diagnostic)
display_amd.cpp Detect device-lost in QueryOutput → return reinit AMD DirectCapture
display_vram.cpp Log GetDeviceRemovedReason on AcquireSync failure All GPU encode (diagnostic)

Impact Analysis

  • NVIDIA/Intel: Benefit from Fix 1 (DXGI reinit) and Fix 2 (encode resilience). No regressions.
  • AMD: Full TDR recovery chain - capture detects device-lost → signals reinit → encoding breaks inner loop → outer loop rebuilds pipeline → streaming resumes.
  • Infinite loop risk: Low. Multiple circuit breakers present (reinit_event + sleep, display_wp expiry check, shutdown_event).
  • Thread safety: All cross-thread communication uses existing safe::signal_t and sync_util::sync_t.

Testing

Requires AMD GPU with DirectCapture enabled. Simulate TDR by triggering GPU timeout during gameplay.

…ing pipeline

When a GPU TDR occurs during AMD DirectCapture streaming, the capture
pipeline would either hang or crash because device-lost errors were not
properly detected and propagated.

Changes:
- display_base.cpp: Handle DXGI_ERROR_DEVICE_REMOVED/RESET in
  AcquireNextFrame and ReleaseFrame, returning capture_e::reinit
  instead of capture_e::error
- video.cpp: Change convert/encode failures from return (permanent
  exit) to break (allows outer loop to reinit pipeline)
- amf_d3d11.cpp: Add device-lost detection after AMF
  CreateSurfaceFromDX11Native and SubmitInput failures
- display_amd.cpp: Detect device-lost in AMD DirectCapture QueryOutput,
  return capture_e::reinit instead of masking as timeout
- display_vram.cpp: Log device removed reason on AcquireSync failure
  for better TDR diagnostics
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 13, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: acd8eba2-f12f-43a9-b13b-4c2ce30b2602

📥 Commits

Reviewing files that changed from the base of the PR and between 7f11979 and ddfb661.

📒 Files selected for processing (5)
  • src/amf/amf_d3d11.cpp
  • src/platform/windows/display_amd.cpp
  • src/platform/windows/display_base.cpp
  • src/platform/windows/display_vram.cpp
  • src/video.cpp
📜 Recent review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Windows
🧰 Additional context used
📓 Path-based instructions (2)
src/**/*.{cpp,c,h}

⚙️ CodeRabbit configuration file

src/**/*.{cpp,c,h}: Sunshine 核心 C++ 源码,自托管游戏串流服务器。审查要点:内存安全、 线程安全、RAII 资源管理、安全漏洞。注意预处理宏控制的平台相关代码。

Files:

  • src/platform/windows/display_vram.cpp
  • src/platform/windows/display_amd.cpp
  • src/platform/windows/display_base.cpp
  • src/video.cpp
  • src/amf/amf_d3d11.cpp
src/platform/**

⚙️ CodeRabbit configuration file

src/platform/**: 平台抽象层代码(Windows/Linux/macOS)。确保各平台实现一致, 注意 Windows API 调用的错误处理和资源释放。

Files:

  • src/platform/windows/display_vram.cpp
  • src/platform/windows/display_amd.cpp
  • src/platform/windows/display_base.cpp
🔇 Additional comments (6)
src/platform/windows/display_vram.cpp (1)

449-458: 设备丢失诊断补全到位

在编码互斥锁获取失败时追加 GetDeviceRemovedReason() 日志非常有价值,能直接区分普通同步失败与 TDR/驱动重置问题,且不改变现有失败语义。

src/amf/amf_d3d11.cpp (2)

20-20: 头文件补充正确

新增 src/utility.h 与后续 util::hex(...) 调用匹配,依赖关系完整。


608-614: AMF 失败路径的设备丢失日志增强合理

CreateSurfaceFromDX11NativeSubmitInput 失败后补充 GetDeviceRemovedReason() 诊断信息,能显著提升 AMD TDR 场景下的定位效率。

Also applies to: 713-719

src/video.cpp (1)

2887-2888: break 替代 return 的恢复语义是正确的

Line 2887 和 Line 2910 在转换/编码失败时改为 break,使编码循环以可恢复方式退出,并让外层重建流程接管,符合本次 device-lost 恢复链路目标。

Also applies to: 2910-2911

src/platform/windows/display_amd.cpp (1)

80-92: 设备丢失检测与状态映射实现到位

Line 80-88 新增 GetDeviceRemovedReason() 检测并返回 capture_e::reinit,有效避免将死设备误判为 timeout,可正确触发上层重建流程。

As per coding guidelines src/platform/**: 平台抽象层代码(Windows/Linux/macOS)。确保各平台实现一致, 注意 Windows API 调用的错误处理和资源释放。

src/platform/windows/display_base.cpp (1)

157-160: DXGI_ERROR_DEVICE_REMOVED/RESET 归类为 reinit 是正确修复

Line 157-160 与 Line 195-198 在 Acquire/Release 两条路径统一将设备丢失映射为 capture_e::reinit,恢复链路完整且诊断日志清晰。

As per coding guidelines src/platform/**: 平台抽象层代码(Windows/Linux/macOS)。确保各平台实现一致, 注意 Windows API 调用的错误处理和资源释放。

Also applies to: 195-198


Summary by CodeRabbit

发布说明

  • Bug 修复
    • 改进了图形设备丢失或移除时的错误处理和恢复机制
    • 增强了设备故障诊断日志,记录详细的错误代码便于故障排查
    • 优化了编码过程中的错误恢复流程,提高系统可靠性

概览

此PR增强了DirectX设备丢失/重置的错误检测与处理。在编码和捕获模块的多个失败路径中添加了设备状态查询和日志记录,并改进了控制流以支持设备重新初始化机制。

变更

缓聚 / 文件 摘要
编码器错误诊断
src/amf/amf_d3d11.cpp, src/platform/windows/display_vram.cpp
CreateSurfaceFromDX11NativeSubmitInput和互斥锁获取失败时,新增D3D11设备移除原因查询。若设备已移除,记录十六进制形式的移除原因代码。
捕获错误处理
src/platform/windows/display_amd.cpp, src/platform/windows/display_base.cpp
增强QueryOutputAcquireNextFrameReleaseFrame的失败处理。检测设备丢失/重置状态,返回capture_e::reinit而非capture_e::timeout,并记录相应的DXGI错误代码。
编码循环控制流
src/video.cpp
encode_run中两处早期退出点从return改为break,允许外层重新初始化逻辑处理convertencode的失败。

代码审查工作量

🎯 3 (Moderate) | ⏱️ ~20 分钟

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed 标题准确反映了主要改动:修复Windows平台上AMD capture和encoding流水线中D3D11设备丢失(TDR)的处理问题。
Description check ✅ Passed 描述充分关联到代码改动,详细说明了问题、根本原因、解决方案和影响分析,与五个文件的修改内容完全对应。

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/amd-capture-device-lost-recovery

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant