Skip to content

feat: TensorRT FP16 depth estimation backend#158

Closed
solderzzc wants to merge 3 commits intodevelopfrom
feature/depth-estimation-cuda-improvements
Closed

feat: TensorRT FP16 depth estimation backend#158
solderzzc wants to merge 3 commits intodevelopfrom
feature/depth-estimation-cuda-improvements

Conversation

@solderzzc
Copy link
Member

@solderzzc solderzzc commented Mar 15, 2026

TensorRT FP16 Backend for Depth Anything v2

Benchmark (RTX 4070 Laptop GPU, 518x518)

Backend Avg (ms) FPS Speedup
PyTorch CUDA FP32 36.48 27.4 1x
TensorRT FP16 5.29 189.0 6.9x

Changes — purely additive, no existing code modified

transform.py (497 additions, 0 existing lines changed)

  • _load_tensorrt(), _build_trt_engine(), _infer_tensorrt()
  • Engine caching at ~/.aegis-ai/models/feature-extraction/trt_engines/
  • GPU-specific engine filenames (prevents cross-GPU issues)
  • bytes() wrapper for TRT 10.15 IHostMemory API
  • --backend CLI arg (auto/tensorrt/pytorch/coreml)
  • Graceful fallback: TRT → PyTorch CUDA → CPU

trt_benchmark.py (new file)

  • Standalone PyTorch vs TRT benchmark script

Config files

  • models.json: TRT FP16 variant for win32
  • requirements.txt: tensorrt>=10.0, onnxruntime-gpu (non-Darwin)
  • deploy.bat: TRT verification step
  • SKILL.md: Updated hardware backends table

What is NOT changed

  • CoreML backend (macOS) — untouched
  • PyTorch inference path — untouched
  • benchmark.py — untouched
  • No existing control flow modified

- Upgrade PyTorch CUDA wheels from cu124 to cu126 (RTX 4090/5090)
- Fix _load_config() dropping CLI args (--model, --colormap, --blend-mode)
- Add deploy.bat for Windows venv + CUDA setup
- Add cross-platform benchmark.py (CoreML + PyTorch/CUDA/MPS/CPU)
- Track models.json (platform model registry)
- Bump depth-estimation version 1.1.0 → 1.2.0 in skills.json
@Intersteller-Apex Intersteller-Apex changed the title feat: cross-platform CUDA depth estimation improvements feat: TensorRT FP16 + cross-platform CUDA depth estimation Mar 16, 2026
- _load_tensorrt(), _build_trt_engine(), _infer_tensorrt() methods
- Engine caching at ~/.aegis-ai/models/feature-extraction/trt_engines/
- GPU-specific engine filenames (prevents cross-GPU issues)
- IHostMemory bytes() fix for TRT 10.15+
- Graceful fallback: TRT > PyTorch CUDA > CPU
- Added --backend CLI arg
- Added trt_benchmark.py for standalone benchmarking

Benchmark: RTX 4070 Laptop GPU 518x518
  PyTorch CUDA FP32: 36.48ms (27.4 FPS)
  TensorRT FP16:      5.29ms (189 FPS) — 6.9x faster
@Intersteller-Apex Intersteller-Apex force-pushed the feature/depth-estimation-cuda-improvements branch from bbbc9fa to 26b9ff2 Compare March 16, 2026 01:49
@Intersteller-Apex Intersteller-Apex changed the title feat: TensorRT FP16 + cross-platform CUDA depth estimation feat: TensorRT FP16 depth estimation backend Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants